Not sure how exactly, but I ran across an article that claimed parsing of CSV data was necessarily CPU-bound. I was pretty sure that with reasonably efficient code, there was no reason this had to be true. Still, proof is better than opinion, so I took the feed-readers code from a prior exercise, and adapted the code to parse CSV files.

You can grab the sources for the test CSV-parser-1 program from Subversion. The test program does a full CSV parse as described in RFC 4180 (including handling quoted fields with embedded line breaks) and a bit more, but does nothing with the parsed data.

Results from test runs - on my HP laptop (Intel T9300 Core 2 Duo CPU @ 2.5GHz with 4GB memory):

preston@mercury:~/workspace/CSV-parser-1$ time Release/CSV-parser-1 -n 0 in/1g.txt
TIME Sun Dec 21 16:45:43 2008
Scanning: in/1g.txt
Done with: in/1g.txt
TIME Sun Dec 21 16:45:47 2008
Elapsed (ms): 4249, total (MB): 981
Scanned 230 MB/s

real    0m4.254s
user    0m3.500s
sys 0m0.656s

The above is for a ~1GB file, fully cached in memory (do repeated test runs until the times stabilize).

preston@mercury:~/workspace/CSV-parser-1$ time Release/CSV-parser-1 -n 0 in/4g.txt
TIME Sun Dec 21 16:49:31 2008
Scanning: in/4g.txt
Done with: in/4g.txt
TIME Sun Dec 21 16:51:02 2008
Elapsed (ms): 91449, total (MB): 3944
Scanned 43 MB/s

real    1m31.455s
user    0m41.271s
sys 0m10.549s

The above is for a ~4GB file, not cached in memory. The result is very clear - an efficient CSV file parser can ingest data much faster than the data can be read off ordinary disks (a bit over five times faster). Even a fast RAID would be hard-pressed to deliver data faster than it could be parsed.

Of course, in “real” applications, any processing performed on the parsed CSV data will likely dominate the runtime. Application-specific processing could easily saturate more the one CPU. The problem partitions into most-efficient read-and-parse of CSV data from disk (which is what this example does), and distribution of application-specific processing across multiple CPUs (which this example can do in the same manner as feed-workers … and which may or may not suit your application).

Insert the usual caveats here. The example program has seen only basic testing. There were other applications (minimally) active. The C++ code was written for reuse, and has not been run through a profiler. You could tweak the code to get slightly better performance, but probably not any large improvements.

The same test run on a desktop (slower CPUs, faster disk):

preston@brutus:~/workspace/CSV-parser-1$ time Release/CSV-parser-1 -n 0 in/1g.txt
TIME Sun Dec 21 17:21:19 2008
Scanning: in/1g.txt
Done with: in/1g.txt
TIME Sun Dec 21 17:21:27 2008
Elapsed (ms): 7530, total (MB): 981
Scanned 130 MB/s

real    0m7.535s
user    0m6.136s
sys 0m1.388s

The above times are for a file cached in memory.

preston@brutus:~/workspace/CSV-parser-1$ time Release/CSV-parser-1 -n 0 in/4g.txt
TIME Sun Dec 21 17:23:31 2008
Scanning: in/4g.txt
Done with: in/4g.txt
TIME Sun Dec 21 17:25:00 2008
Elapsed (ms): 89020, total (MB): 3944
Scanned 44 MB/s

real    1m29.084s
user    0m26.994s
sys 0m7.004s

The above times are for a file not cached in memory. The results are entirely consistent with the first set of runs.

Clearly, an efficient CSV file parser can process data faster than a single disk can deliver. The SSD’s (solid-state disks) currently on the market seem to manage sustained read rates in the range of 40-100MB/s, so a single-process parser should be able to fully saturate the disk.

If you are doing large-scale processing of CSV data, your most-efficient approach is most likely to use a single (efficient!) reader-parser thread, and then roughly as many application-specific processing threads (or processes) as you have CPUs.