Performance parsing CSV data

2008-12-20

Not sure how exactly, but I ran across an article that claimed parsing of CSV data was necessarily CPU-bound. I was pretty sure that with reasonably efficient code, there was no reason this had to be true. Still, proof is better than opinion, so I took the feed-readers code from a prior exercise, and adapted the code to parse CSV files.

You can grab the sources for the test CSV-parser-1 program from Subversion. The test program does a full CSV parse as described in RFC 4180 (including handling quoted fields with embedded line breaks) and a bit more, but does nothing with the parsed data.

Results from test runs - on my HP laptop (Intel T9300 Core 2 Duo CPU @ 2.5GHz with 4GB memory):

preston@mercury:~/workspace/CSV-parser-1$ time Release/CSV-parser-1 -n 0 in/1g.txt TIME Sun Dec 21 16:45:43 2008 Scanning: in/1g.txt Done with: in/1g.txt TIME Sun Dec 21 16:45:47 2008 Elapsed (ms): 4249, total (MB): 981 Scanned 230 MB/s

real 0m4.254s user 0m3.500s sys 0m0.656s

The above is for a ~1GB file, fully cached in memory (do repeated test runs until the times stabilize).

preston@mercury:~/workspace/CSV-parser-1$ time Release/CSV-parser-1 -n 0 in/4g.txt TIME Sun Dec 21 16:49:31 2008 Scanning: in/4g.txt Done with: in/4g.txt TIME Sun Dec 21 16:51:02 2008 Elapsed (ms): 91449, total (MB): 3944 Scanned 43 MB/s

real 1m31.455s user 0m41.271s sys 0m10.549s

The above is for a ~4GB file, not cached in memory. The result is very clear - an efficient CSV file parser can ingest data much faster than the data can be read off ordinary disks (a bit over five times faster). Even a fast RAID would be hard-pressed to deliver data faster than it could be parsed.

Of course, in "real" applications, any processing performed on the parsed CSV data will likely dominate the runtime. Application-specific processing could easily saturate more the one CPU. The problem partitions into most-efficient read-and-parse of CSV data from disk (which is what this example does), and distribution of application-specific processing across multiple CPUs (which this example can do in the same manner as feed-workers ... and which may or may not suit your application).

Insert the usual caveats here. The example program has seen only basic testing. There were other applications (minimally) active. The C++ code was written for reuse, and has not been run through a profiler. You could tweak the code to get slightly better performance, but probably not any large improvements.

The same test run on a desktop (slower CPUs, faster disk):

preston@brutus:~/workspace/CSV-parser-1$ time Release/CSV-parser-1 -n 0 in/1g.txt TIME Sun Dec 21 17:21:19 2008 Scanning: in/1g.txt Done with: in/1g.txt TIME Sun Dec 21 17:21:27 2008 Elapsed (ms): 7530, total (MB): 981 Scanned 130 MB/s

real 0m7.535s user 0m6.136s sys 0m1.388s

The above times are for a file cached in memory.

preston@brutus:~/workspace/CSV-parser-1$ time Release/CSV-parser-1 -n 0 in/4g.txt TIME Sun Dec 21 17:23:31 2008 Scanning: in/4g.txt Done with: in/4g.txt TIME Sun Dec 21 17:25:00 2008 Elapsed (ms): 89020, total (MB): 3944 Scanned 44 MB/s

real 1m29.084s user 0m26.994s sys 0m7.004s

The above times are for a file not cached in memory. The results are entirely consistent with the first set of runs.

Clearly, an efficient CSV file parser can process data faster than a single disk can deliver. The SSD's (solid-state disks) currently on the market seem to manage sustained read rates in the range of 40-100MB/s, so a single-process parser should be able to fully saturate the disk.

If you are doing large-scale processing of CSV data, your most-efficient approach is most likely to use a single (efficient!) reader-parser thread, and then roughly as many application-specific processing threads (or processes) as you have CPUs.