Not sure how exactly, but I ran across an article that claimed parsing of CSV data was necessarily CPU-bound. I was pretty sure that with reasonably efficient code, there was no reason this had to be true. Still, proof is better than opinion, so I took the feed-readers code from a prior exercise, and adapted the code to parse CSV files.
You can grab the sources for the test CSV-parser-1 program from Subversion. The test program does a full CSV parse as described in RFC 4180 (including handling quoted fields with embedded line breaks) and a bit more, but does nothing with the parsed data.
Results from test runs - on my HP laptop (Intel T9300 Core 2 Duo CPU @ 2.5GHz with 4GB memory):
preston@mercury:~/workspace/CSV-parser-1$ time Release/CSV-parser-1 -n 0 in/1g.txt TIME Sun Dec 21 16:45:43 2008 Scanning: in/1g.txt Done with: in/1g.txt TIME Sun Dec 21 16:45:47 2008 Elapsed (ms): 4249, total (MB): 981 Scanned 230 MB/s real 0m4.254s user 0m3.500s sys 0m0.656s
The above is for a ~1GB file, fully cached in memory (do repeated test runs until the times stabilize).
preston@mercury:~/workspace/CSV-parser-1$ time Release/CSV-parser-1 -n 0 in/4g.txt TIME Sun Dec 21 16:49:31 2008 Scanning: in/4g.txt Done with: in/4g.txt TIME Sun Dec 21 16:51:02 2008 Elapsed (ms): 91449, total (MB): 3944 Scanned 43 MB/s real 1m31.455s user 0m41.271s sys 0m10.549s
The above is for a ~4GB file, not cached in memory. The result is very clear - an efficient CSV file parser can ingest data much faster than the data can be read off ordinary disks (a bit over five times faster). Even a fast RAID would be hard-pressed to deliver data faster than it could be parsed.
Of course, in “real” applications, any processing performed on the parsed CSV data will likely dominate the runtime. Application-specific processing could easily saturate more the one CPU. The problem partitions into most-efficient read-and-parse of CSV data from disk (which is what this example does), and distribution of application-specific processing across multiple CPUs (which this example can do in the same manner as feed-workers … and which may or may not suit your application).
Insert the usual caveats here. The example program has seen only basic testing. There were other applications (minimally) active. The C++ code was written for reuse, and has not been run through a profiler. You could tweak the code to get slightly better performance, but probably not any large improvements.
The same test run on a desktop (slower CPUs, faster disk):
preston@brutus:~/workspace/CSV-parser-1$ time Release/CSV-parser-1 -n 0 in/1g.txt TIME Sun Dec 21 17:21:19 2008 Scanning: in/1g.txt Done with: in/1g.txt TIME Sun Dec 21 17:21:27 2008 Elapsed (ms): 7530, total (MB): 981 Scanned 130 MB/s real 0m7.535s user 0m6.136s sys 0m1.388s
The above times are for a file cached in memory.
preston@brutus:~/workspace/CSV-parser-1$ time Release/CSV-parser-1 -n 0 in/4g.txt TIME Sun Dec 21 17:23:31 2008 Scanning: in/4g.txt Done with: in/4g.txt TIME Sun Dec 21 17:25:00 2008 Elapsed (ms): 89020, total (MB): 3944 Scanned 44 MB/s real 1m29.084s user 0m26.994s sys 0m7.004s
The above times are for a file not cached in memory. The results are entirely consistent with the first set of runs.
Clearly, an efficient CSV file parser can process data faster than a single disk can deliver. The SSD’s (solid-state disks) currently on the market seem to manage sustained read rates in the range of 40-100MB/s, so a single-process parser should be able to fully saturate the disk.
If you are doing large-scale processing of CSV data, your most-efficient approach is most likely to use a single (efficient!) reader-parser thread, and then roughly as many application-specific processing threads (or processes) as you have CPUs.