FDO TextReader Enhancements
Overview
The FDO API provides a number of classes to support streamed I/O to files and in-memory buffers. Clients can add derived classes to support additional devices. A number of Reader and Writer classes are provided to read and write formatted data from and to these devices. One of these classes is the FdoIoTextReader, which reads data from a stream in text format.
The FdoIoTextReader class is incomplete. Streaming support was added to FDO to satisfy the requirement to read and write XML documents. Although XML documents tend to be in text format, FdoIoTextReader was not actually needed to read XML; the FdoXmlReader uses Xerces instead. Therefore, FdoIoTextReader has constructor and destructor but no actual text reading functions.
The FdoIoTextReader should be completed for the following reasons:
- It rounds out the API. The converse class (FdoIoTextWriter) is fully implement so finishing FdoIoTextReader would make the API more consistent
- Developers using FDO 3.2 could have used FdoIoTextReader to convert stream contents to a string. Instead, they had to write workaround code since this class was not complete
- The API changes can be made in a way that will provide performance benefits for writing XML documents. This will be seen later in this document when the API changes are discussed
API Changes
FdoIoTextReader
The text reader can be enhanced by adding functions to read formatted (delimited) text.
class FdoIoTextReader : public FdoIDisposable { public: FDO_API_COMMON FdoBoolean ReadLine(FdoStringP& outString); FDO_API_COMMON FdoBoolean Read(FdoStringP& outString); FDO_API_COMMON FdoBoolean Read( FdoStringP& outString, FdoString* leftDelimiters ); FDO_API_COMMON FdoBoolean Read( FdoStringP& outString, FdoString* leftDelimiters, FdoString* rightDelimiters ); FDO_API_COMMON FdoBoolean Read( FdoStringP& outString, FdoString* leftDelimiters, FdoString* rightDelimiters, FdoString* skipChars ); FDO_API_COMMON FdoBoolean Read( FdoStringP& outString, FdoString* leftDelimiters, FdoString* rightDelimiters, FdoString* skipChars, FdoString* separators ); };
Parameters:
outString: the string that was read
leftDelimiters: list of left delimiter characters; defaults to empty string (L"").
rightDelimiters: list of right delimiter characters; defaults to empty string (L""). Position is important: 1st left delimiter corresponds to 1st right delimiter and so on. If a right delimiter is not explicitly specified for a particular left delimiter then the right delimiter is the same as the left.
skipChars: list of characters that can be skipped while looking for the left delimiter; defaults to blank character plus end-of line (L" \n").
separators: list of separator characters. These are similar to skip characters except that a separator can only occur once between each string that is read; defaults to empty string (L"").
Description:
ReadLine() Reads the current line (from the current position to the next end-of-line character or end-of-stream).
Read() provides the ability to read delimited strings.
Reading is done according to the following steps:
- starting at the current position, any characters in the skipChars set are skipped.
- if the next character is a left delimiter, outString becomes all characters between but not including the left and corresponding right delimiter
- otherwise. outString becomes all characters from the current character, up to but not including the next skipChar or separator (or end-of-stream, whichever comes first).
- Any subsequent skipChars or separators are skipped. However, Read() stops if a second separator character is encountered. In this case, the 2nd separator is not skipped but the current position is set to it.
Returns:
ReadLine() returns true if a line was read and false if at the end of the underlying stream.
Read() returns true if a string was read and false otherwise. false is returned if the end of the stream is reached before a string can be extracted.
It is possible for both of the above to return true with outString being an empty string. For ReadLine(), this happens when the character at the current position is the end-of-line character. For Read() this can happen when there are no characters between the left and right delimeters.
Exceptions:
Read() throws an FdoException if a left delimiter is encountered but end-of-stream is reached before the right delimiter is found.
Example 1:
if the data at and after the input stream’s current position looks like this:
" {'T – true', 'F – false'} {'red' 'green', 'blue'}"
then the following table shows the return value and string that is read. Current position on return from Read() is also shown by the remainder which shows the part of the stream at and after the new current position:
call | returns | outString | remainder |
Read(outString, "", "", " {'" ) | true | "T" | " – true’, ‘F – false’} {‘red’ ‘green’, ‘blue’}" |
Read(outString) | true | "{'T" | " – true’, ‘F – false’} {‘red’ ‘green’, ‘blue’}" |
Read(outString, "'", "", "{ " ) | true | "T – true" | ", ‘F – false’} {‘red’ ‘green’, ‘blue’}" |
Read(outString, "'", "", "{" ) | false(*) | n/a | " {'T – true', 'F – false'} {'red' 'green', 'blue'}" |
Read(outString,"", "", "" ) | true | " {'T – true', 'F – false'} {'red' 'green', 'blue'}" | "" (next call to Read would return false) |
Read(outString,"'{","'}"," "} | true | "'T – true', 'F – false'" () | " {'red' 'green', 'blue'}" |
(*) - " " is not a skip character so left delimiter is not reached
() – the 2nd ("{"} leftDelimiter is encountered before "'", so read is done until the 2nd right delimiter ("}") is reached
Example 2:
If a stream contains comma separated values where:
- values are separated by commas and any number of spaces
- values with embedded separators are enclosed in either single or double quotes
- other values are optionally delimited
e.g:
"abc def", , 123.45, 'ghi', jkl, "", 9876
then each value can be read by a call to:
Read(outString, "'\"", "", " ",",")
multiple calls to the above Read() on the above stream will read the following for each call:
"abc def"
""
"123.45"
"ghi"
"jkl"
""
"9876"
Example 3:
ReadLine is actually just a convenience function. It is equivalent to:
Read(outString, "", "", "", "\n" )
FdoIoCachedStream
A new stream class will be added:
class FdoIoCachedStream : public FdoIoStream { public: FDO_API_COMMON static FdoIoCachedStream* Create( FdoIoStream* baseStream, FdoInt32 bufferSize=4096 ); FDO_API_COMMON FdoIoStream* GetStream(); FDO_API_COMMON void Flush(); }; typedef FdoPtr<FdoIoCachedStream> FdoIoCachedStreamP;
Parameters:
baseSteam: underlying stream where data will ultimately be read from or written to.
bufferSize: size in bytes of data cache. The cache is used to provide better read/write performance
Description:
GetStream() returns the underlying stream.
Flush() if the cache contains data to write, this function writes it to the underlying stream.
This class provides buffered reads and writes from and to an FdoIStream by adding an in-memory cache.
See Design Discussion below for the rationale behind this class.
FdoIoCachedFileStream
The following class will be added for convenience:
class FdoIoCachedFileStream : public FdoIoCachedStream { public: FDO_API_COMMON static FdoIoCachedFileStream* Create( FdoString* fileName, FdoString* accessModes, FdoInt32 bufferSize=4096 ); FDO_API_COMMON static FdoIoCachedFileStream* Create( FILE* fp, FdoInt32 bufferSize=4096 ); }; typedef FdoPtr<FdoIoCachedFileStream> FdoIoCachedFileStreamP;
The parameters are identical to those of FdoIoFileStream and FdoIoCachedStream. The behaviour is also identical except that reads and writes are buffered.
FdoIoCachedFileStream serves as a convenience class for wrapping an FdoIoCachedStream around an FdoIoFileStream to provide buffered access to a file. The following:
FdoIoCachedFileStreamP cfs = FdoIoCachedFileStream::Create( L"myfile.txt", L"r" );
is equivalent to:
FdoIoFileStreamP fs = FdoIoFileStream::Create( L"myfile.txt", L"r" ); FdoIoCachedStreamP cfs = FdoIoCachedStream::Create( fs );
It might seem odd that the Create( FILE*, !FdoInt32) function would be needed since a FILE already provides buffered access. However, FdoIoFileStream actually performs reads and writes through the underlying device for the FILE so the buffered access provided by FILE gets bypassed. An alternative approach would have been to leverage off the buffering provided by FILE. However, once FdoIoCachedStream is in place, it can be used to implement buffered I/O for other types of devices.
FdoIoTextReader and FdoIoTextWriter
The following functions will have a slight behavioural change:
FdoIoTextReader::Create( FdoString* fileName ) FdoIoTextWriter::Create( FdoString* fileName )
These functions used to automatically create an underlying stream of type FdoIoFileStream. They will change to create the stream as an FdoIoCachedFileStream. This will provide performance benefits.
Design Discussion
The Read functions, being added to FdoIoTextReader, do not know in advance how much data they will read. Therefore, these functions must do one of two things:
- read the data one byte at a time.
- read fixed sized chunks of data until the end of the string to read is reached.
For some devices (e.g. files) the second option provides better performance. However, it is much more complicated to implement since the end of the string to read can be in the middle of the current chunk. Therefore, the current read position must be reset from the end of the chunk to the middle. This cannot be done by changing the position on the underlying stream since not all streams support rewinding (some only support forward-only reading). Therefore the text reader would have to cache the remainder of the current chunk, to be read when the next string to read is requested.
The first option is much simpler but could be slow when reading from a file (the FdoIoFileReader class does not provide any buffering on read). Therefore, the second option is preferred.
Rather than complicate FdoIoTextReader with the managing of a read buffer, this complication can be pushed down to a new class, called FdoIoCachedStream. An added benefit is that doing the caching at the stream level allows buffered writing to file to be supported, providing performance benefits for streamed writing as well.
When the first read request is made to FdoIoCachedStream, it will read enough data from its base stream to fill its cache. For this read and subsequent reads, the data will be retrieved from the cache. When the end of the cache is reached, it will be flushed and filled again from the base stream.
When the first write request is made, the data is added to the cache. Subsequent writes will append to the cache. When the cache is full, it will be flushed by writing its contents to the base stream.
The actual algorithm for buffered reading and writing would be more complicated than described above, especially when mixed reads and writes are peformed. However, this document concentrates on the API, so the exact algorithm is beyond the scope of the document. To the caller, FdoIoCacheStream will behave as if the reads and writes had been done directly against the base stream, except that they will be faster.
Of the current streaming implementations, the file stream will benefit the most from caching. Caching would provide no benefit to the memory stream. For convenience, an FdoIoCachedFileStream is provided. It allows a file stream, with cached stream wrapper, to be created in one step.
More Examples
The following reads each line of text from "myfile.txt"
FdoIoTextReaderP rdr = FdoIoTextReader( L"myfile.txt ); FdoStringP line; while ( rdr->ReadLine(line) ) { printf( "%ls\n", (FdoString*) line ); }
The following does the same thing, but with a 50,000 byte read buffer:
FdoIoCachedFileStreamP stream = FdoIoCachedFileStream::Create( L"myfile.txt", L"rt", 50000 ); FdoIoTextReaderP rdr = FdoIoTextReader( stream ); FdoStringP line; while ( rdr->ReadLine(line) ) { printf( "%ls\n", (FdoString*) line ); }
Performance Stats
The addition of the FdoIoCachedStream class will help performance when reading or writing files. This will also translate into performance improvements for writing XML files. The reading of XML files will not be affected, since this is delegated to Xerces, which already does buffered reads.
The improvements on writing were verified by doing some tests against the FdoIoTextWriter class. This class was used to write a 10Mb file on local disk, by multiple calls to FdoIoTextWriter::Write(). A number of tests were done, where each passed a different number of characters to each call to Write(). The following table shows the results:
# characters per call | # calls to Write | time (secs) | char/sec |
1 | 10000000 | 45 | 22222 |
10 | 1000000 | 5.07 | 1972387 |
20 | 500000 | 2.64 | 3787878 |
100 | 100000 | 0.69 | 14492754 |
4000 | 2500 | 0.28 | 35714286 |
8000 | 1250 | 0.34 | 35714286 |
50000 | 200 | 0.34 | 29411765 |
- the one character per call case is an extreme one which wouldn't commonly occur.
- the 10 and 20 char per call cases would simulate the writing of an XML file. The FdoXmlWriter invokes FdoIoTextWriter::Write() for each XML fragment and most fragments would be in the neighbourhood of 20 char.
- larger XML fragments could be as large as 100 characters, so this case was also tested.
- 4000 characters is close to the FdoIoCachedStream default buffer size.
- the 8000 and 50000 character cases were tried to see if any benefits could be realized by increasing the default buffer size.
If we assume that the average XML fragment size is 20 bytes, then a 10X improvement in writing XML to files would be realized. Even in the extreme case, where all fragments are large (100 char), there is an approximate 2.5X improvement.
There does not appear to be any benefit to raising the default buffer size. The 8K buffer would provide the same performance. The 50K buffer is actually slower. FdoIoTextWriter::Write() takes a wide-char string and converts it to UTF8 before writing it. This conversion overhead might start to be significant at the higher buffer sizes, which may explain the slowdown.
In conclusion, FdoIoCachedStream provides significant performance benefits when writing to XML.