Performance parsing CSV data
Not sure how exactly, but I ran across an article that claimed parsing of CSV data was necessarily CPU-bound. I was pretty sure that with reasonably efficient code, there was no reason this had to be true. Still, proof is better than opinion, so I took the feed-readers code from a prior exercise, and adapted the code to parse CSV files.
You can grab the sources for the test CSV-parser-1 program from Subversion. The test program does a full CSV parse as described in RFC 4180 (including handling quoted fields with embedded line breaks) and a bit more, but does nothing with the parsed data.
Results from test runs - on my HP laptop (Intel T9300 Core 2 Duo CPU @ 2.5GHz with 4GB memory):
$ time Release/CSV-parser-1 -n 0 in/1g.txt
TIME Sun Dec 21 16:45:43 2008
Scanning: in/1g.txt
Done with: in/1g.txt
TIME Sun Dec 21 16:45:47 2008
Elapsed (ms): 4249, total (MB): 981
Scanned 230 MB/s
real 0m4.254s
user 0m3.500s
sys 0m0.656s
The above is for a ~1GB file, fully cached in memory (do repeated test runs until the times stabilize).
$ time Release/CSV-parser-1 -n 0 in/4g.txt
TIME Sun Dec 21 16:49:31 2008
Scanning: in/4g.txt
Done with: in/4g.txt
TIME Sun Dec 21 16:51:02 2008
Elapsed (ms): 91449, total (MB): 3944
Scanned 43 MB/s
real 1m31.455s
user 0m41.271s
sys 0m10.549s
The above is for a ~4GB file, not cached in memory. The result is very clear - an efficient CSV file parser can ingest data much faster than the data can be read off ordinary disks (a bit over five times faster). Even a fast RAID would be hard-pressed to deliver data faster than it could be parsed.
Of course, in "real" applications, any processing performed on the parsed CSV data will likely dominate the runtime. Application-specific processing could easily saturate more the one CPU. The problem partitions into most-efficient read-and-parse of CSV data from disk (which is what this example does), and distribution of application-specific processing across multiple CPUs (which this example can do in the same manner as feed-workers ... and which may or may not suit your application).
Insert the usual caveats here. The example program has seen only basic testing. There were other applications (minimally) active. The C++ code was written for reuse, and has not been run through a profiler. You could tweak the code to get slightly better performance, but probably not any large improvements.
The same test run on a desktop (slower CPUs, faster disk):
preston@brutus:~/workspace/CSV-parser-1$ time Release/CSV-parser-1 -n 0 in/1g.txt
TIME Sun Dec 21 17:21:19 2008
Scanning: in/1g.txt
Done with: in/1g.txt
TIME Sun Dec 21 17:21:27 2008
Elapsed (ms): 7530, total (MB): 981
Scanned 130 MB/s
real 0m7.535s
user 0m6.136s
sys 0m1.388s
The above times are for a file cached in memory.
preston@brutus:~/workspace/CSV-parser-1$ time Release/CSV-parser-1 -n 0 in/4g.txt
TIME Sun Dec 21 17:23:31 2008
Scanning: in/4g.txt
Done with: in/4g.txt
TIME Sun Dec 21 17:25:00 2008
Elapsed (ms): 89020, total (MB): 3944
Scanned 44 MB/s
real 1m29.084s
user 0m26.994s
sys 0m7.004s
The above times are for a file not cached in memory. The results are entirely consistent with the first set of runs.
Clearly, an efficient CSV file parser can process data faster than a single disk can deliver. The SSD's (solid-state disks) currently on the market seem to manage sustained read rates in the range of 40-100MB/s, so a single-process parser should be able to fully saturate the disk.
If you are doing large-scale processing of CSV data, your most-efficient approach is most likely to use a single (efficient!) reader-parser thread, and then roughly as many application-specific processing threads (or processes) as you have CPUs.