mardi 4 août 2015

How to write a "filter" stream wrapper for XML?

I have some large XML feed files with illegal characters in them (0x1 etc). The files are third-party, I cannot change the process for writing them.

I would like to process these files using an XmlReader, but it blows up on these illegal characters.

I could read the files, filter out the bad characters, save them, then process them... but this is a lot of I/O, and it seems like it should be unnecessary.

What I would like to do is something like this:

using(var origStream = File.OpenRead(fileName))
using(var cleanStream = new CleansedXmlStream(origStream))
using(var streamReader = new StreamReader(cleanStream))
using(var xmlReader = XmlReader.Create(streamReader))
{
    //do stuff with reader
}

I tried inheriting from Stream, but when I got to implementing the Read(byte[] buffer, int offset, int count) I lost some confidence. After all, I was planning on removing characters, so it seemed the count would be off, and I'd have to translate each byte to a char which seemed expensive (especially on large files) and I was unclear how this would work with a Unicode encoding, but the answers to my questions were not intuitively obvious.

When googling for "c# stream wrapper" or "c# filter stream" I am not getting satisfactory results. It's possible I'm using the wrong words or describing the wrong concept, so I'm hoping the SO community can square me away.

Using the example above, what would CleansedXmlStream look like?

Here's what my first attempt looked like:

public class CleansedXmlStream : Stream
{
    private readonly Stream _baseStream;

    public CleansedXmlStream(Stream stream)
    {
        this._baseStream = stream;
    }

    public new void Dispose()
    {
        if (this._baseStream != null)
        {
            this._baseStream.Dispose();
        }
        base.Dispose();
    }

    public override bool CanRead
    {
        get { return this._baseStream.CanRead; }
    }

    public override bool CanSeek
    {
        get { return this._baseStream.CanSeek; }
    }

    public override bool CanWrite
    {
        get { return this._baseStream.CanWrite; }
    }

    public override long Length
    {
        get { return this._baseStream.Length; }
    }

    public override long Position
    {
        get { return this._baseStream.Position; }
        set { this._baseStream.Position = value; }
    }

    public override void Flush()
    {
        this._baseStream.Flush();
    }

    public override int Read(byte[] buffer, int offset, int count)
    {
        //what does this look like?

        throw new NotImplementedException();
    }

    public override long Seek(long offset, SeekOrigin origin)
    {
        return this._baseStream.Seek(offset, origin);
    }

    public override void SetLength(long value)
    {
        this._baseStream.SetLength(value);
    }

    public override void Write(byte[] buffer, int offset, int count)
    {
        throw new NotSupportedException();
    }
}

Aucun commentaire:

Enregistrer un commentaire