Performance parsing CSV data

2008-12-21

Not sure how exactly, but I ran across an article that claimed parsing of CSV data was necessarily CPU-bound. I was pretty sure that with reasonably efficient code, there was no reason this had to be true. Still, proof is better than opinion, so I took the feed-readers code from a prior exercise, and adapted the code to parse CSV files.

You can grab the sources for the test CSV-parser-1 program from Subversion. The test program does a full CSV parse as described in RFC 4180 (including handling quoted fields with embedded line breaks) and a bit more, but does nothing with the parsed data.

Results from test runs - on my HP laptop (Intel T9300 Core 2 Duo CPU @ 2.5GHz with 4GB memory):

    $ time Release/CSV-parser-1 -n 0 in/1g.txt
    TIME Sun Dec 21 16:45:43 2008
    Scanning: in/1g.txt
    Done with: in/1g.txt
    TIME Sun Dec 21 16:45:47 2008
    Elapsed (ms): 4249, total (MB): 981
    Scanned 230 MB/s

    real    0m4.254s
    user    0m3.500s
    sys     0m0.656s

The above is for a ~1GB file, fully cached in memory (do repeated test runs until the times stabilize).

    $ time Release/CSV-parser-1 -n 0 in/4g.txt
    TIME Sun Dec 21 16:49:31 2008
    Scanning: in/4g.txt
    Done with: in/4g.txt
    TIME Sun Dec 21 16:51:02 2008
    Elapsed (ms): 91449, total (MB): 3944
    Scanned 43 MB/s

    real    1m31.455s
    user    0m41.271s
    sys     0m10.549s

The above is for a ~4GB file, not cached in memory. The result is very clear - an efficient CSV file parser can ingest data much faster than the data can be read off ordinary disks (a bit over five times faster). Even a fast RAID would be hard-pressed to deliver data faster than it could be parsed.

Of course, in "real" applications, any processing performed on the parsed CSV data will likely dominate the runtime. Application-specific processing could easily saturate more the one CPU. The problem partitions into most-efficient read-and-parse of CSV data from disk (which is what this example does), and distribution of application-specific processing across multiple CPUs (which this example can do in the same manner as feed-workers ... and which may or may not suit your application).

Insert the usual caveats here. The example program has seen only basic testing. There were other applications (minimally) active. The C++ code was written for reuse, and has not been run through a profiler. You could tweak the code to get slightly better performance, but probably not any large improvements.

The same test run on a desktop (slower CPUs, faster disk):

    preston@brutus:~/workspace/CSV-parser-1$ time Release/CSV-parser-1 -n 0 in/1g.txt
    TIME Sun Dec 21 17:21:19 2008
    Scanning: in/1g.txt
    Done with: in/1g.txt
    TIME Sun Dec 21 17:21:27 2008
    Elapsed (ms): 7530, total (MB): 981
    Scanned 130 MB/s

    real    0m7.535s
    user    0m6.136s
    sys     0m1.388s

The above times are for a file cached in memory.

    preston@brutus:~/workspace/CSV-parser-1$ time Release/CSV-parser-1 -n 0 in/4g.txt
    TIME Sun Dec 21 17:23:31 2008
    Scanning: in/4g.txt
    Done with: in/4g.txt
    TIME Sun Dec 21 17:25:00 2008
    Elapsed (ms): 89020, total (MB): 3944
    Scanned 44 MB/s

    real    1m29.084s
    user    0m26.994s
    sys     0m7.004s

The above times are for a file not cached in memory. The results are entirely consistent with the first set of runs.

Clearly, an efficient CSV file parser can process data faster than a single disk can deliver. The SSD's (solid-state disks) currently on the market seem to manage sustained read rates in the range of 40-100MB/s, so a single-process parser should be able to fully saturate the disk.

If you are doing large-scale processing of CSV data, your most-efficient approach is most likely to use a single (efficient!) reader-parser thread, and then roughly as many application-specific processing threads (or processes) as you have CPUs.