Preston L. Bannister { random memes }

2008.06.22

Why FileInputStream is slow.

Filed under: Software — Preston @ 7:58 pm

On my first crack at a Java-based Wide Finder implementation, got a surprisingly slow result for using java.io.FileInputStream to read a file. Dug into the sources to track down the cause.

(Is there a better public-reference to the OpenJDK code than here?)

First, the FileInputStream Java class read() call turns into a native call to readBytes().

public int read(byte b[]) throws IOException {
    return readBytes(b, 0, b.length);
}
private native int readBytes(byte b[], int off, int len) throws IOException;

Next, the FileInputStream native code bounces from the JNI function to a local readBytes function in io_util.c.

JNIEXPORT jint JNICALL
Java_java_io_FileInputStream_readBytes(JNIEnv *env, jobject this,
        jbyteArray bytes, jint off, jint len) {
    return readBytes(env, this, bytes, off, len, fis_fd);
}
/* The maximum size of a stack-allocated buffer.
 */
#define BUF_SIZE 8192

int
readBytes(JNIEnv *env, jobject this, jbyteArray bytes,
          jint off, jint len, jfieldID fid)
{
    int nread, datalen;
    char stackBuf[BUF_SIZE];
    char *buf = 0;
    FD fd;

    if (IS_NULL(bytes)) {
        JNU_ThrowNullPointerException(env, 0);
        return -1;
    }
    datalen = (*env)->GetArrayLength(env, bytes);

    if ((off < 0) || (off > datalen) ||
        (len < 0) || ((off + len) > datalen) || ((off + len) < 0)) {
        JNU_ThrowByName(env, "java/lang/IndexOutOfBoundsException", 0);
        return -1;
    }

    if (len == 0) {
        return 0;
    } else if (len > BUF_SIZE) {
        buf = malloc(len);
        if (buf == 0) {
            JNU_ThrowOutOfMemoryError(env, 0);
            return 0;
        }
    } else {
        buf = stackBuf;
    }

    fd = GET_FD(this, fid);
    if (fd == -1) {
        JNU_ThrowIOException (env, "Stream Closed");
        return  -1;
    }

    nread = IO_Read(fd, buf, len);
    if (nread > 0) {
        (*env)->SetByteArrayRegion(env, bytes, off, nread, (jbyte *)buf);
    } else if (nread == JVM_IO_ERR) {
        JNU_ThrowIOExceptionWithLastError(env, "Read error");
    } else if (nread == JVM_IO_INTR) { /* EOF */
        JNU_ThrowByName(env, "java/io/InterruptedIOException", 0);
    } else { /* EOF */
        nread = -1;
    }

    if (buf != stackBuf) {
        free(buf);
    }
    return nread;
}

A couple bits in the above code of which I am not fond:

  1. The I/O operation does not read directly into the final buffer.
  2. A temporary buffer is allocated off the C runtime heap for reads bigger than 8192 bytes.

The extra buffer copy is wasteful, but not a huge deal (given all the call overhead we paid to get to this point). The dynamic allocation of the temporary buffer is much more of a big deal. Heap allocations are not free, and allocating a temporary buffer on every call to java.io.FileInputStream.read(buffer) has got to cost something. Note that there is also a risk of heap fragmentation (rare, but something I have seen).

Ideally we would like to do the full I/O operation in one call directly into the final buffer. My guess is that the extra copy is somehow required (once?) by Java garbage collection. Wonder if this is in fact required, or just an artifact of some earlier version?

Cannot seem to find the sources for SetByteArrayRegion. Must be initialized somewhere

At the very least, we want to drop the heap allocation, and get on the “knee” of to I/O performance curve:

    /* Use a buffer big enough to get close to the "knee" of the I/O curve. */
    char buffer[32 * 1024];
    int iRead = off;
    int nWant = len;
    int nRead;
    for (;;) {
        nRead = IO_Read(fd, buffer, (nWant < sizeof(buffer)) ? nWant : sizeof(buffer));
        if (nRead <= 0) {
            break;
        }
        (*env)->SetByteArrayRegion(env, bytes, iRead, nRead, (jbyte *)buffer);
        iRead += nRead;
        nWant -= nRead;
    }

As a later exercise, will try building the JDK with the above change, comparing performance to the stock JDK. Of course, what I would really like to do is read directly into the Java byte array. Need to find the implementation of SetByteArrayRegion first, to see what sort of magic might be involved.

The baseline number measured on my local Linux x86 box and stock Java 1.6.0_06.

Wide Finder in Java

FILE /home/preston/workspace/wide-finder-j/logs/_400MB

Time for READ - to get file cache consistent
File: /home/preston/workspace/wide-finder-j/logs/_400MB
Size: 444945620
Read: 444945620
Time: 6833
Rate: 62 MB/s

Time for READ - for measure
File: /home/preston/workspace/wide-finder-j/logs/_400MB
Size: 444945620
Read: 444945620
Time: 6767
Rate: 62 MB/s

FILE /home/preston/workspace/wide-finder-j/logs/_3.5G

Time for READ - to get file cache consistent
File: /home/preston/workspace/wide-finder-j/logs/_3.5G
Size: 3559564960
Read: 3559564960
Time: 74856
Rate: 45 MB/s

Time for READ - for measure
File: /home/preston/workspace/wide-finder-j/logs/_3.5G
Size: 3559564960
Read: 3559564960
Time: 72839
Rate: 46 MB/s

Looks like the Java test program on this box can manage 62M/s when reading from cache (the first file is small enough to cache in memory), and 45Mb/s reading from disk.

For comparison the (roughly) equivalent C++ code clocks in at 332MB/s when reading from cache, and 51MB/s reading from disk (which we can assume is as fast as data can be read from this machine’s SATA disk).

Clearly we need more headroom in the Java I/O performance, to allow for faster disks and slower CPUs.

2008.06.18

Wide Finder 2 – over to the dark side

Filed under: Software — Preston @ 8:29 pm

My first solution for Tim Bray’s Wide Finder 2 problem was aimed at ease of re-use before performance. Simple Perl scripts (run in a large number of worker processes) did the bulk of the processing.

Got my times, posted the results, and was going to leave it at that.

The damn problem got stuck in my head. Clearly a multi-threaded shared-memory solution would be much more efficient. How would a more specialized solution perform? Found I was thinking about the specialized solution, how to minimize synchronization issues, and was curious how it would perform.

My preference would be to use Java for a multi-threaded shared-memory solution. I have not had cause to use Java for anything similar, so did not know exactly what to expect. Turned out I was in for some surprises.

The first surprise was file I/O. Open a FileInputStream, loop on read() passing a 1MB buffer each call … and I only get ~32MB/s (from cache) on the Sun test box. Not good. The native code feed-workers reader manages ~124MB/s (from cache) and ~85MB/s (from disk) while piping through cat. Java should manage about the same, but does not – am I missing something?

The second surprise was byte to character transformation. Seems you have to create a lot of garbage to do a simple UTF-8 to Unicode conversion. The Java garbage collector may be pretty good, but better not to create huge masses of garbage in the first place. Maybe I missed an API somewhere? Wrote a simple buffer-to-buffer byte to character converter, to stay away from large GC’s. This sort of byte-bashing would be far better in native code.

The third surprise was java.util.regex performance (or lack). Add in regex matching, and suddenly we are down to ~1MB/s processing. Not just a problem with performance – on one long input line regex processing overflowed the Java stack! Ouch.

Not really a surprise – the standard Java general purpose hash tables are not well suited for the required usage: adds and update, but no deletes. Wrote a specialized hash table for which synchronization is only needed on leaf nodes when doing an add (very little chance of one thread blocking another). Still need to properly benchmark this bit.

The poor performance of Java regular expression processing was a disappointment. (I think we have Henry Spencer to blame … indirectly.) Found the Java Regular expression library benchmarks interesting. Sorting out the java.util.regex or better entries:

Regular expression library Time taken for 10,000 iterations
java.util.regex.Pattern 1.4 609ms
kmy.regex.util.Regex 0.1.2 468ms
com.karneim.util.collection.regex.Pattern1.1.1 297ms
dk.brics.automaton.RegExp1.7-1 172ms

The results are a few years old, so timings may have changed. Also the log parsing regex may benchmark differently. Any sort of definite statement would require a benchmark.

The two fastest listed regular expression libraries: dk.brics.automaton and Jrexx (com.karneim.util.collection.regex), do matching only, no group extractions – and so are not useful for my purpose. For the all the faster entries, documentation is sparse.

There is recent traffic among the JRuby folk about regex performance, the upshot of which is the recent integration of Joni into JRuby. Without a proper benchmark, there is no way of knowing how Joni compares, and the code is pretty much undocumented.

Regex processing can be spread over many concurrent threads, and is thus less of a bottleneck. The worry is the poor single-thread file read performance through FileInputStream. Hunted around the Internet looking to see if anyone got better results. Pretty dismaying how much garbage information is out there. Lots of programmers who think single-char read() calls are a good idea (or small-buffer reads). Did not find any substantially better results.

I expect the overhead for calling from Java to be a higher, but not enough to matter when reading a megabyte in a single call. Somewhat disappointed that no one has worked through the simple exercise of allowing Java to process bulk data efficiently. Tempted to dig into the Java sources – there must be something funky in the FileInputStream.readBytes() native method.

From the test run with 32 thread threads scanning the log and updating the hash table:

Time for - grok (32 workers)
File: /export/home/bannister/wfj/logs/O.10m
Size: 2029140422
Read: 2029140422
Time: 57216
Rate: 33 MB/s

2008.06.13

Wrapping up Wide Finder 2

Filed under: Software — Preston @ 10:43 am

The main point of the Tim Bray’s Wide Finder 2 exercise was to solve a common problem (processing web server logs files) in a way that took full advantage of a large number of CPUs. As Sun is shipping machines with many cores per chip, and as Tim works for Sun – he has good reason to use his soapbox to take a public poke at the problem.

My approach to solving the Wide Finder problem is meant to be re-useable. As such, I am not interested in exotic languages, or solutions minutely specialized to this exact problem. I do not expect to get the highest possible performance, but I do expect to get reasonably close – close enough that the difference in runtime is not significant compared to the programmer time saved through easy re-use and adaptation.

Large log file processing is something I have done quite a lot of in the past. Each time I tended to be looking for slightly different information and correlations. In that sort of usage you want simple, clearly written scripts that can be easily adapted to the present need. My solution to Tim’s problem is written for just that sort of usage.

Getting back to the original notion – just how does the benchmark compare when run on the “wide” Sun box to a generic x86 box? From the prior test got about 22MB/s running on the Sun box. My local x86 box (Athlon64 x2 4800) gets about 15MB/s. This is a bit unfair to x86 as the Sun box is current generation, and my x86 box is a few years old. I would guess that a current generation quad processor x86 box would more than double the processing rate, and out-perform the Sun box by a fair margin.

In fact, this is not a bad showing for Sun. This particular solution to the problem becomes much “fatter” (more CPU cycles used) when distributed across many CPUs. For usage that can be more cheaply distributed (like web serving) we could expect proportionately better performance from the Sun box. A more specialized solution – with multiple threads updating shared in-memory data – should perform better on this problem, but would likely also prove harder to re-use (and less interesting for my usage).

Upshot – for this sort of solution, the Sun box is not going to out-perform an equivalent generation x86 box.

Prior “Wide Finder 2″ articles:

2008.06.12

Wider finder – final result

Filed under: General — Preston @ 9:21 pm

[wrapup - added later]

From running my implementation of Tim Bray’s Wide Finder 2 on the Sun test box.

bannister@wfind01$ time ./feed-workers -n 30 -r `which perl` -s scripts/reduce.pl logs/O.all |
    time scripts/combine.pl > _x30_reduce_combine
Fri Jun 13 04:03:32 2008
Scanning: logs/O.all
Done with: logs/O.all
Worker #21875 ended with status: 0
Worker #21874 ended with status: 0
Worker #21873 ended with status: 0
Worker #21872 ended with status: 0
Worker #21871 ended with status: 0
Worker #21870 ended with status: 0
Worker #21869 ended with status: 0
Worker #21868 ended with status: 0
Worker #21867 ended with status: 0
Worker #21866 ended with status: 0
Worker #21865 ended with status: 0
Worker #21864 ended with status: 0
Worker #21863 ended with status: 0
Worker #21862 ended with status: 0
Worker #21861 ended with status: 0
Worker #21860 ended with status: 0
Worker #21859 ended with status: 0
Worker #21858 ended with status: 0
Worker #21857 ended with status: 0
Worker #21856 ended with status: 0
Worker #21855 ended with status: 0
Worker #21854 ended with status: 0
Worker #21853 ended with status: 0
Worker #21852 ended with status: 0
Worker #21851 ended with status: 0
Worker #21850 ended with status: 0
Worker #21849 ended with status: 0
Worker #21848 ended with status: 0
Worker #21847 ended with status: 0
Worker #21846 ended with status: 0
Fri Jun 13 04:36:03 2008
Elapsed (ms): 1950971, total (MB): 43178
Scanned 22 MB/s

real    34:06.9
user    10:32.7
sys        11.1

real    34m6.930s
user    609m12.577s
sys     12m33.298s

Total elapsed time was a bit over 34 minutes to process the full 45GB log file.

The emphasis here is on a more general-purpose and re-useable solution, rather than something over-specialized to this one example problem, as described in the first round. Implementation in brief:

  1. The feed-workers process reads data from disk (at the fastest possible rate),
    and writes the data to a specified number of child processes.
  2. The reduce process parses the log file and computes subtotals for the fields of interest.
    This is the heaviest processing, and is best suited for distributing across many CPUs.
  3. The combine process reads the reduce subtotals
    and computes the final totals for each value of interest.

If you wanted to adapt this “wide-finder” to another purpose, you need only look at reduce and combine. Approximate line counts for each component:

lines component language
349 feed-workers C++
24 reduce Perl
36 combine Perl

Final result (that might even be correct):

Top 10 URIs by total response bytes
        919814823566: /ongoing/ongoing.atom
        393012328499: /ongoing/potd.png
        297110748615: /ongoing/ongoing.rss
        95967470509: /ongoing/rsslogo.jpg
        70619295535: /ongoing/When/200x/2004/08/30/-big/IMGP0851.jpg
        46373582976: /talks/php.de.pdf
        43559176904: /ongoing/When/200x/2006/05/16/J1d0.mov
        42428609673: /ongoing/When/200x/2007/12/14/Shonen-Knife.mov
        38415215289: /ongoing/
        35603054785: /ongoing/moss60.jpg

Top 10 URIs returning 404 (Not Found)
        54271: /ongoing/ongoing.atom.xml
        28030: /ongoing/ongoing.pie
        27365: /ongoing/favicon.ico
        26084: /ongoing/Browser-Market-Share.png
        24631: /ongoing/When/200x/2004/04/27/-//W3C//DTD%20XHTML%201.1//EN
        24078: /ongoing/Browsers-via-search.png
        24004: /ongoing/Search-Engines.png
        22637: /ongoing/ongoing.atom'
        22619: //ongoing/ongoing.atom'
        20587: /ongoing/Feeds.png

Top 10 URIs by hits on articles
        614255: /ongoing/When/200x/2005/05/01/Hammer_sickle_clean.png
        561720: /ongoing/When/200x/2003/07/17/noIE.gif
        321873: /ongoing/When/200x/2004/12/12/-tn/Browser-Market-Share.png
        252828: /ongoing/When/200x/2004/02/18/Bump.png
        242520: /ongoing/When/200x/2004/12/12/-tn/Browsers-via-search.png
        241340: /ongoing/When/200x/2004/12/12/-tn/Search-Engines.png
        219569: /ongoing/When/200x/2003/09/18/NXML
        204202: /ongoing/When/200x/2004/08/30/-big/IMGP0851.jpg
        168652: /ongoing/When/200x/2003/03/16/XML-Prog
        137457: /ongoing/When/200x/2006/03/30/IMG_4613.png

Top 10 client IPs by hits on articles
        366634: msnbot.msn.com
        192147: cmbg-cache-2.server.ntli.net
        161867: crawler14.googlebot.com
        145264: crawl-66-249-72-173.googlebot.com
        132805: crawl-66-249-72-172.googlebot.com
        131051: cmbg-cache-1.server.ntli.net
        100298: crawl-66-249-72-72.googlebot.com
        95580: wfp2.almaden.ibm.com
        90831: sv-crawlfw3.looksmart.com
        84546: crawler10.googlebot.com

Top 10 referrers by hits on articles
        993394: http://www.google.com/reader/view/
        243013: http://planet.xmlhack.com/
        195861: http://tbray.org/ongoing/
        194726: http://planetsun.org/
        181280: http://planetjava.org/
        158613: http://slashdot.org/
        117228: http://www.chat.kg/
        112469: http://planet.intertwingly.net/
        89177: http://www.planetjava.org/
        55593: http://www.bloglines.com/myblogs_display?all=1

Wide finder – combine for top N

Filed under: Software — Preston @ 8:37 pm

[wrapup - added later]

As the last exercise showed, doing the full combine of all the reduced data was just too fat to get acceptable performance. This iteration is to re-write combine.pl to limit the amount of data retained to not much more than the final “top N” result.

The notion here is to only keep a sample of the largest values found, adding to the sample only values above a minimum. When the pool of samples grows too large, sort the sample values, and trim to just N (the final number of samples), and choose the smallest as the new minimum value. This keeps sort operations small, and few. The vast majority of values (well over 99%) are rejected with a simple comparison, so this pass is essentially O(N) and very cheap per value.

sub sampleTop {
	my ( $n, $kv ) = @_;
	my $iLast  = $n - 1;
	my $nLimit = 100 + $n;
	my $min    = 0;
	my @sample = ();
	while ( my ( $k, $v ) = each(%$kv) ) {
		next if $v < $min;
		push( @sample, [ $v, $k ] );
		next if ( scalar @sample ) < $nLimit;
		@sample = sort { $b->[0] <=> $a->[0] } @sample;
		$#sample = $iLast;
		$min = $sample[$#sample]->[0];
	}
	@sample = sort { $b->[0] < => $a->[0] } @sample;
	$#sample = $iLast;
	return @sample;
}

Unfortunately, this step is necessarily a single process.

The single-process combine result and time for the full (reduced) data on the Sun test box:

bannister@wfind01$ time scripts/combine.pl tmp/_reduce_only

Top 10 URIs by total response bytes
        919814823566: /ongoing/ongoing.atom
        393012328499: /ongoing/potd.png
        297110748615: /ongoing/ongoing.rss
        95967470509: /ongoing/rsslogo.jpg
        70619295535: /ongoing/When/200x/2004/08/30/-big/IMGP0851.jpg
        46373582976: /talks/php.de.pdf
        43559176904: /ongoing/When/200x/2006/05/16/J1d0.mov
        42428609673: /ongoing/When/200x/2007/12/14/Shonen-Knife.mov
        38415215289: /ongoing/
        35603054785: /ongoing/moss60.jpg

Top 10 URIs returning 404 (Not Found)
        54271: /ongoing/ongoing.atom.xml
        28030: /ongoing/ongoing.pie
        27365: /ongoing/favicon.ico
        26084: /ongoing/Browser-Market-Share.png
        24631: /ongoing/When/200x/2004/04/27/-//W3C//DTD%20XHTML%201.1//EN
        24078: /ongoing/Browsers-via-search.png
        24004: /ongoing/Search-Engines.png
        22637: /ongoing/ongoing.atom'
        22619: //ongoing/ongoing.atom'
        20587: /ongoing/Feeds.png

Top 10 URIs by hits on articles
        614255: /ongoing/When/200x/2005/05/01/Hammer_sickle_clean.png
        561720: /ongoing/When/200x/2003/07/17/noIE.gif
        321873: /ongoing/When/200x/2004/12/12/-tn/Browser-Market-Share.png
        252828: /ongoing/When/200x/2004/02/18/Bump.png
        242520: /ongoing/When/200x/2004/12/12/-tn/Browsers-via-search.png
        241340: /ongoing/When/200x/2004/12/12/-tn/Search-Engines.png
        219569: /ongoing/When/200x/2003/09/18/NXML
        204202: /ongoing/When/200x/2004/08/30/-big/IMGP0851.jpg
        168652: /ongoing/When/200x/2003/03/16/XML-Prog
        137457: /ongoing/When/200x/2006/03/30/IMG_4613.png

Top 10 client IPs by hits on articles
        366634: msnbot.msn.com
        192147: cmbg-cache-2.server.ntli.net
        161867: crawler14.googlebot.com
        145264: crawl-66-249-72-173.googlebot.com
        132805: crawl-66-249-72-172.googlebot.com
        131051: cmbg-cache-1.server.ntli.net
        100298: crawl-66-249-72-72.googlebot.com
        95580: wfp2.almaden.ibm.com
        90831: sv-crawlfw3.looksmart.com
        84546: crawler10.googlebot.com

Top 10 referrers by hits on articles
        993394: http://www.google.com/reader/view/
        243013: http://planet.xmlhack.com/
        195861: http://tbray.org/ongoing/
        194726: http://planetsun.org/
        181280: http://planetjava.org/
        158613: http://slashdot.org/
        129256:
        117228: http://www.chat.kg/
        112469: http://planet.intertwingly.net/
        89177: http://www.planetjava.org/

real    10m50.437s
user    10m32.404s
sys     0m17.107s

The above time may be slightly pessimistic, as there was some fairly heavy processing (a gcc-build apparently) going on at the same time. With so many CPUs there was no competition for CPU. There might have been some competition for disk usage.

This same test took 3 minutes (elapsed) on my local x86 box – about a 3x difference. Since the earlier test showed the reduce processing will take ~28 minutes, the ~11 minutes taken by combine is reasonably similar.

Good enough. On to the full test run… :)

2008.06.11

property.import task for Ant

Filed under: General, Javascript — Preston @ 8:05 am

I have Ant build scripts that can be run on either Windows or Linux. There are a few properties that need to be set differently, depending on the platform is use. Been using this little snippet for a while now.

import.js

Array.prototype.map = function(f) {
    var a = [];
    for (var i=0; i<this.length; ++i) {
        a.push(f(this[i],i));
    }
    return a;
};
Array.as = function(v) {
    var a = [];
    for (var i=0; i<v.length; ++i) {
        a.push(v[i]);
    }
    return a;
}

var PROPERTY = (function(){
    var t = {};
    var _get = function(k) {
        return project.getProperty(k);
    };
    t.get = _get;
    t.verbose = function(n) {
        n = 1 * n;
        if (0 < n) {
            t.get = function(k) {
                var v = project.getProperty(k);
                self.log("get " + k + " = " + v);
                return v;
            };
        } else {
            t.get = _get;
        }
    };
    t.low = function(s) { var v = t.get(s); return v && v.toLowerCase(); };
    var os_name = t.low("os.name");
    var user_name = t.low("user.name");
    if (os_name.match(/^windows/)) {
        os_name = "win32";
    }
    t.byUser = function(s) {
        return t.get(s + "." + user_name);
    };
    t.byOS = function(s) {
        return t.get(s + "." + os_name);
    };
    t.byPlatform = function(s) {
        return t.get(s + "." + os_name + "." + user_name);
    };
    t.pick = function() {
        return Array.as(arguments).map(function(k){
            var v = t.get(k) || t.byPlatform(k) || t.byUser(k) || t.byOS(k);
            project.setProperty(k,v);
            self.log("set " + k + " = " + v);
            return v;
        });
    };
    return t;
})();

PROPERTY.verbose(attributes.get("verbose"));
PROPERTY.pick(attributes.get("name"));

Task definition from Ant build file:

    <scriptdef
        name        = "property.import"
        language    = "javascript"
        src         = "import.js"
        >
        <attribute name="name" />
        <attribute name="verbose" />
    </scriptdef>

Example contents of build.properties:

nsis.home.win32=c:/Program Files/NSIS
launch4j.home.win32=c:/tools/launch4j
launch4j.home.linux=${user.home}/tools/launch4j

Usage is simple:

    <property.import
        name        = "launch4j.home"
        verbose     = "0"
    />

The property.import task looks for variants of the named property suffixed with user.name, os.name or both. For the above import of launch4j.home the property.import task when run by user “preston” on Windows gets the value from the first of:

launch4j.home.win32.preston
launch4j.home.win32
launch4j.home.preston

This allows checkin of build.properties while picking up appropriate per-user and per-platform property values when run.

2008.06.10

Wide finder – combine and sort

Filed under: Software — Preston @ 9:30 am

[wrapup - added later]

Took another crack at the combine process for my implementation of wide finder. My theme is to try and come up with the something easily re-used and re-purposed, rather than the fastest possible specialized solution. The first round used the generic Perl sort operation, with an inline (two key) sort function. For small sizes the built-in Perl sort works well. When fed the full 45GB example data and attempting to sort the counts of client IPs – between a large memory footprint and a too-slow sort – for data that large Perl’s sort is not the right solution.

Tried the old trick of transforming multiple keys into a single key for sorting. Back in the 1980’s the problem I occasionally faced was how to efficiently sort data extracted from large sets of log files. The Unix sort command was (and is) pretty efficient, but the data was often not in a form sort could handle. On Unix the answer was simple – use awk and/or sed to re-shape the data to be fed into sort. Seemed like a pretty obvious approach at the time. Seems very unlikely I was the first to use a transform to turn awkward and/or multi-key data into a single-key sortable form.

This was long before the first version of Perl was posted to USENET. Funny thing – someone wrote a paper on “efficient Perl sorting”. Funny because this “new” approach is rather familiar.

Tried and discarded a transformed-to-single-key variant for Perl sort – as it did not perform well.

The data to be sorted for problems like the wide finder exercise can be large enough to often-but-not-always fit in memory. Counting on an in-memory sort makes me uncomfortable (keeping to the “general purpose” meme). Perhaps better to use the Unix sort command. The Unix sort command does a smart disk-based sort-merge for large data, and is quite lean in it’s use of memory. Also, we can pick up a small amount of concurrent execution by piping from combine to sort.

The relevant code from combine (which may change later):

sub processSection {
    my ($c,$kv) = @_;
    local *OUT;
    open( OUT, "| sort -k 2nr,2 -k 3 | head -n 10" )
        or die "Cannot sort results!\n";
    while ( my ( $k, $v ) = each(%{$kv}) ) {
        print OUT $c . " " . $v . " " . $k . "\n";
    }
    close OUT;
}

while ( my ( $c, $kv ) = each %v ) {
    processSection($c,$kv);
}

Got acceptable (though not inspiring) times on an x86 box, so ran the already-reduced data through combine on the Sun box.

bannister@wfind01$ time scripts/combine.pl tmp/_reduce_only
H 614255 /ongoing/When/200x/2005/05/01/Hammer_sickle_clean.png
H 561720 /ongoing/When/200x/2003/07/17/noIE.gif
H 321873 /ongoing/When/200x/2004/12/12/-tn/Browser-Market-Share.png
H 252828 /ongoing/When/200x/2004/02/18/Bump.png
H 242520 /ongoing/When/200x/2004/12/12/-tn/Browsers-via-search.png
H 241340 /ongoing/When/200x/2004/12/12/-tn/Search-Engines.png
H 219569 /ongoing/When/200x/2003/09/18/NXML
H 204202 /ongoing/When/200x/2004/08/30/-big/IMGP0851.jpg
H 168652 /ongoing/When/200x/2003/03/16/XML-Prog
H 137457 /ongoing/When/200x/2006/03/30/IMG_4613.png
N 54271 /ongoing/ongoing.atom.xml
N 28030 /ongoing/ongoing.pie
N 27365 /ongoing/favicon.ico
N 26084 /ongoing/Browser-Market-Share.png
N 24631 /ongoing/When/200x/2004/04/27/-//W3C//DTD%20XHTML%201.1//EN
N 24078 /ongoing/Browsers-via-search.png
N 24004 /ongoing/Search-Engines.png
N 22637 /ongoing/ongoing.atom'
N 22619 //ongoing/ongoing.atom'
N 20587 /ongoing/Feeds.png
R 993394 http://www.google.com/reader/view/
R 243013 http://planet.xmlhack.com/
R 195861 http://tbray.org/ongoing/
R 194726 http://planetsun.org/
R 181280 http://planetjava.org/
R 158613 http://slashdot.org/
R 129256
R 117228 http://www.chat.kg/
R 112469 http://planet.intertwingly.net/
R 89177 http://www.planetjava.org/
I 366634 msnbot.msn.com
I 192147 cmbg-cache-2.server.ntli.net
I 161867 crawler14.googlebot.com
I 145264 crawl-66-249-72-173.googlebot.com
I 132805 crawl-66-249-72-172.googlebot.com
I 131051 cmbg-cache-1.server.ntli.net
I 100298 crawl-66-249-72-72.googlebot.com
I 95580 wfp2.almaden.ibm.com
I 90831 sv-crawlfw3.looksmart.com
I 84546 crawler10.googlebot.com
B 919814823566 /ongoing/ongoing.atom
B 393012328499 /ongoing/potd.png
B 297110748615 /ongoing/ongoing.rss
B 95967470509 /ongoing/rsslogo.jpg
B 70619295535 /ongoing/When/200x/2004/08/30/-big/IMGP0851.jpg
B 46373582976 /talks/php.de.pdf
B 43559176904 /ongoing/When/200x/2006/05/16/J1d0.mov
B 42428609673 /ongoing/When/200x/2007/12/14/Shonen-Knife.mov
B 38415215289 /ongoing/
B 35603054785 /ongoing/moss60.jpg

real    79m59.721s
user    82m9.882s
sys     0m42.510s

Given the same run completed in under 7 minutes on my x86 box, the lack of concurrency and slower Sun CPU is not(!) working in our favor. Not sure why this came in so slow (I doubt the Sun chip is in general the 10x slower this seems to show), but clearly performance is too far in the wrong direction.

So much for general purpose – we need to specialize combine for better performance. :)

As an aside, I tried a using fork() to run the sort for each section of the report in parallel. Given that the client IP count data is much larger than the data for the other sections, for this problem the gain would not be great, but at least we would get a bit more concurrency. The Perl code to process each report section in a child process:

sub forkWorker {
    return unless 0 == fork;
    processSection(@_);
    exit;
}

while ( my ( $c, $kv ) = each %v ) {
    forkWorker($c,$kv);
}

Oddly enough this did not work on either Linux or Solaris. I was counting on copy-on-write fork() to keep the physical memory use of each Perl child process quite small. On my Linux box (with 3GB real memory), execution slowed to a crawl as the disk chattered away madly. Seems Linux does not do copy-on-write. I expected better from Solaris, but instead got:

bannister@wfind01$ time scripts/combine.pl tmp/_reduce_only
Cannot sort results!
[snip]

Oh well … this wasn’t going to get the performance we wanted in any case …

Next iteration will have to specialize combine for the “top 10″ nature of this problem. Will have to limit the size of the data sorted to get acceptable (and reliable) performance.

Prior: Wide finder – first round

2008.06.08

Wide finder – first round

Filed under: Software — Preston @ 4:48 pm

[wrapup - added later]

Took my feed-workers tool from my example of solving the prior wide-finder exercise, and updated the scripts to match(?) the new strawman wide-finder benchmark. Updated the scripts over the weekend (yes, I’m sufficiently warped to find this sort of exercise amusing). Got onto the Sun 32-CPU box on Tuesday, and did some full scale runs on the 45GB data. Caught a couple minor bugs on my part, and got results mostly in line with my expectations, with one major exception.

Note that my aim here is not to get the fastest possible time, or to illustrate the use of some new exotic programming language or technique. Rather I am interested in using the simplest possible general purpose tool (feed-workers) in combination with commonly available Unix tools (Perl in this instance) to solve the problem in the least programmer time while taking advantage of a large number of available CPUs (as in the Sun Fire T2000 box). Note that the reduce process fully parses each line of the log – no shortcuts assuming log lines are space-delimited – which means the script can easily be re-used for other log-scanning tasks (and should be dead easy to read and modify).

Counting the (approximate) lines of feed-workers code:

$ cat *.cpp *.h | tr -cd ';}' | wc -c
349

Counting the lines of one-pass report Perl code:

$ tr -cd ';' < scripts/report.pl | wc -c
23

Counting the lines of reduce Perl code:

$ tr -cd ';' < scripts/reduce.pl | wc -c
23

Counting the lines of combine Perl code:

$ tr -cd ';' < scripts/combine.pl | wc -c
29

Once written, you never have to look at the sources for feed-workers again. When changing the log scanner you need only read reduce and combine, each of which are amounts only to a rather sparse page of code - thus easy to modify and easy to re-use.

In the first set of measurements I was interested in CPU usage, not I/O speed, so I made sure the input files were cached in memory before collecting times. I also did more than one run for each test, just to verify that times were consistent (as they were).

First run - single process Perl script processing a small (100K line) log file.

bannister@wfind01$ time scripts/report.pl logs/O.100k > _report

real    0m13.767s
user    0m13.656s
sys     0m0.108s

Next, split the Perl script into two scripts. The first reduces the data to a smaller form, and the second generates a report from the reduced form. We are using two CPUs here, but we get slightly slower performance – and this makes sense. The reduce script only generates output after completely processing the input, so the two processes are essentially sequential. The generation and consumption of the intermediate (reduced) form should cost us a little, and it does.

First we time the reduce script:

bannister@wfind01$ time scripts/reduce.pl logs/O.100k | scripts/combine.pl > _reduce_combine

real    0m14.584s
user    0m14.552s
sys     0m0.125s

Next we time the combine script:

bannister@wfind01$ scripts/reduce.pl logs/O.100k | time scripts/combine.pl > _reduce_combine

real       14.5
user        1.3
sys         0.0

As expected, the majority of the processing is in the reduce script, and the time needed for the combine script is small.

Note that splitting the original script into two scripts requires the generation and consumption of an intermediate form, but both scripts are still quite short. If you have spent a considerable amount of time analyzing large log files (as have I) then this is a common technique used, not for performance, but to break a complex transformation into more manageable chunks.

Put simply, if you do this sort of work, the approach used here should be familiar.

Next, use the same two scripts as the prior run, but invoke the first via feed-workers. There are three active processes now, but we expect and get pretty much the same times as the prior test.

bannister@wfind01$ time ./feed-workers -n 1 -r `which perl` -s scripts/reduce.pl logs/O.100k |
    scripts/combine.pl > _feed_x1_reduce_combine
Tue Jun  3 01:28:50 2008
Scanning: logs/O.100k

real    0m13.241s
user    0m12.547s
sys     0m0.402s

Up to this point we really have not used more than one CPU. The prior measurements were to give us a baseline, and to show the processing load was in line with expectations. Now we want to show the effect of spreading out the heavy reduce processing over a greater number of CPUs.

Times for reduce processing on 4 worker processes:

bannister@wfind01$ time ./feed-workers -n 4 -r `which perl` -s scripts/reduce.pl logs/O.100k |
    scripts/combine.pl > _feed_x4_reduce_combine
Tue Jun  3 01:30:16 2008
Scanning: logs/O.100k

real    0m5.046s
user    0m13.063s
sys     0m0.408s

Times for reduce processing on 10 worker processes:

bannister@wfind01$ time ./feed-workers -n 10 -r `which perl` -s scripts/reduce.pl logs/O.100k |
    scripts/combine.pl > _feed_x10_reduce_combine
Tue Jun  3 01:31:08 2008
Scanning: logs/O.100k

real    0m3.669s
user    0m14.378s
sys     0m0.535s

Times for reduce processing on 20 worker processes:

bannister@wfind01$ time ./feed-workers -n 20 -r `which perl` -s scripts/reduce.pl logs/O.100k |
    scripts/combine.pl > _feed_x20_reduce_combine
Tue Jun  3 01:31:46 2008
Scanning: logs/O.100k

real    0m3.322s
user    0m17.300s
sys     0m0.747s

Times for reduce processing on 30 worker processes:

bannister@wfind01$ time ./feed-workers -n 30 -r `which perl` -s scripts/reduce.pl logs/O.100k |
    scripts/combine.pl > _feed_x30_reduce_combine
Tue Jun  3 01:32:16 2008
Scanning: logs/O.100k

real    0m3.086s
user    0m20.378s
sys     0m0.997s

Times for reduce processing on 40 worker processes:

bannister@wfind01$ time ./feed-workers -n 40 -r `which perl` -s scripts/reduce.pl logs/O.100k |
    scripts/combine.pl > _feed_x40_reduce_combine
Tue Jun  3 01:32:47 2008
Scanning: logs/O.100k

real    0m3.090s
user    0m21.265s
sys     0m1.250s

Times for reduce processing on 50 worker processes:

bannister@wfind01$ time ./feed-workers -n 50 -r `which perl` -s scripts/reduce.pl logs/O.100k |
    scripts/combine.pl > _feed_x50_reduce_combine
Tue Jun  3 01:33:27 2008
Scanning: logs/O.100k

real    0m3.140s
user    0m21.528s
sys     0m1.459s

To be honest I was not exactly sure what sort of result we would get for the above on a Sun Fire T2000. Sun claims the single-chip UltraSPARC T1 processor contains 8 cores and supports 32 threads. That might mean 32 active threads in a multi-threaded application like Java, but what does it mean for a multi-process application? Guess we have an answer. :)

The above is pretty much exactly what you might expect. We get large improvements in elapsed time spreading the processing over 4 and 10 CPUs, and smaller improvements up to 30 CPUs. As the number of CPUs used increases, the amount of “wasted” processing also increases (as reflected in the “user” times). Elapsed times are slightly worse for 40 and 50 reduce processes – past the number of “threads” supported by the 32-way UltraSPARC T1.

Upshot: scaling the number of CPUs used improves elapsed time up to the number of physical CPUs, netting about a 3x overall speedup (for this problem) over the original single-process version. (Remember to add in the ~1.3 seconds needed for the combine processing, which is largely sequential to the reduce processing.)

Note also that for larger input data we might expect the overhead for many-processes to matter less. For the above 100K line input, we are very much into diminishing returns above 10-way replication on reduce.

The above is all about distributing the same problem over a varying number of processors. The next set of tests is about increasing the amount of data processed.

Bumping the amount of data processed from 100 thousand to 1 million lines, we get:

bannister@wfind01$ time ./feed-workers -n 30 -r `which perl` -s scripts/reduce.pl logs/O.1m |
    scripts/combine.pl > _x30_1m
Tue Jun  3 01:35:10 2008
Scanning: logs/O.1m

real    0m21.246s
user    3m5.751s
sys     0m4.180s

Elapsed time for reduce processing with 10x data takes only 7x as long. This makes sense as the overhead for the startup of a larger number of processes gets amortized over a longer run for each sub-process. Looks good thus far, so tried the full data set…

bannister@wfind01$ time ./feed-workers -n 30 -r `which perl` -s scripts/reduce.pl logs/O.all |
    scripts/combine.pl > _reduce_all
Tue Jun  3 21:34:20 2008
Scanning: logs/O.all
... manually aborted ...

I had to manually abort this test run. The multi-process reduce processing was complete, but the combine processing was running … and running … and running ……. so apparently something in combine was not scaling well to large data sets. Something to address in the next iteration.

Measured just the reduce processing on the full data set:

$ time ./feed-workers -n 30 -r `which perl` -s scripts/reduce.pl logs/O.all > _reduce_all
Tue Jun  3 21:34:20 2008
Scanning: logs/O.all
Tue Jun  3 22:02:00 2008
Elapsed (ms): 1660139, total (MB): 43178
Scanned 26 MB/s

real    27m40.180s
user    597m45.533s
sys     12m1.064s

Note that the input data is no longer cached in memory for the full data set, so the processing rate now includes the time needed for reading the data off physical disk. Clearly the reduce processing is nicely distributable across multiple processes (and CPUs).

Object of the next iteration will be to find where combine processing is expensive.

Next Page »