Why FileInputStream is slow.

2008-06-22

On my first crack at a Java-based Wide Finder implementation, got a surprisingly slow result for using java.io.FileInputStream to read a file. Dug into the sources to track down the cause.

(Is there a better public-reference to the OpenJDK code than here?)

First, the FileInputStream Java class read() call turns into a native call to readBytes().

public int read(byte b[]) throws IOException { return readBytes(b, 0, b.length); }

private native int readBytes(byte b[], int off, int len) throws IOException;

Next, the FileInputStream native code bounces from the JNI function to a local readBytes function in io_util.c.

JNIEXPORT jint JNICALL Java_java_io_FileInputStream_readBytes(JNIEnv *env, jobject this, jbyteArray bytes, jint off, jint len) { return readBytes(env, this, bytes, off, len, fis_fd); }

/* The maximum size of a stack-allocated buffer. */ #define BUF_SIZE 8192

int readBytes(JNIEnv *env, jobject this, jbyteArray bytes, jint off, jint len, jfieldID fid) { int nread, datalen; char stackBuf[BUF_SIZE]; char *buf = 0; FD fd;

if (IS_NULL(bytes)) { JNU_ThrowNullPointerException(env, 0); return -1; } datalen = (*env)->GetArrayLength(env, bytes);

if ((off < 0) || (off > datalen) || (len < 0) || ((off + len) > datalen) || ((off + len) < 0)) { JNU_ThrowByName(env, "java/lang/IndexOutOfBoundsException", 0); return -1; }

if (len == 0) { return 0; } else if (len > BUF_SIZE) { buf = malloc(len); if (buf == 0) { JNU_ThrowOutOfMemoryError(env, 0); return 0; } } else { buf = stackBuf; }

fd = GET_FD(this, fid); if (fd == -1) { JNU_ThrowIOException (env, "Stream Closed"); return -1; }

nread = IO_Read(fd, buf, len); if (nread > 0) { (*env)->SetByteArrayRegion(env, bytes, off, nread, (jbyte )buf); } else if (nread == JVM_IO_ERR) { JNU_ThrowIOExceptionWithLastError(env, "Read error"); } else if (nread == JVM_IO_INTR) { / EOF / JNU_ThrowByName(env, "java/io/InterruptedIOException", 0); } else { / EOF */ nread = -1; }

if (buf != stackBuf) { free(buf); } return nread; }

A couple bits in the above code of which I am not fond:

The I/O operation does not read directly into the final buffer.
A temporary buffer is allocated off the C runtime heap for reads bigger than 8192 bytes.

The extra buffer copy is wasteful, but not a huge deal (given all the call overhead we paid to get to this point). The dynamic allocation of the temporary buffer is much more of a big deal. Heap allocations are not free, and allocating a temporary buffer on every call to java.io.FileInputStream.read(buffer) has got to cost something. Note that there is also a risk of heap fragmentation (rare, but something I have seen).

Ideally we would like to do the full I/O operation in one call directly into the final buffer. My guess is that the extra copy is somehow required (once?) by Java garbage collection. Wonder if this is in fact required, or just an artifact of some earlier version?

Cannot seem to find the sources for SetByteArrayRegion. Must be initialized somewhere...

At the very least, we want to drop the heap allocation, and get on the "knee" of to I/O performance curve:

/* Use a buffer big enough to get close to the "knee" of the I/O curve. */ char buffer[32 * 1024]; int iRead = off; int nWant = len; int nRead; for (;;) { nRead = IO_Read(fd, buffer, (nWant < sizeof(buffer)) ? nWant : sizeof(buffer)); if (nRead <= 0) { break; } (*env)->SetByteArrayRegion(env, bytes, iRead, nRead, (jbyte *)buffer); iRead += nRead; nWant -= nRead; }

As a later exercise, will try building the JDK with the above change, comparing performance to the stock JDK. Of course, what I would really like to do is read directly into the Java byte array. Need to find the implementation of SetByteArrayRegion first, to see what sort of magic might be involved.

The baseline number measured on my local Linux x86 box and stock Java 1.6.0_06.

Wide Finder in Java

FILE /home/preston/workspace/wide-finder-j/logs/_400MB

Time for READ - to get file cache consistent File: /home/preston/workspace/wide-finder-j/logs/_400MB Size: 444945620 Read: 444945620 Time: 6833 Rate: 62 MB/s

Time for READ - for measure File: /home/preston/workspace/wide-finder-j/logs/_400MB Size: 444945620 Read: 444945620 Time: 6767 Rate: 62 MB/s

FILE /home/preston/workspace/wide-finder-j/logs/_3.5G

Time for READ - to get file cache consistent File: /home/preston/workspace/wide-finder-j/logs/_3.5G Size: 3559564960 Read: 3559564960 Time: 74856 Rate: 45 MB/s

Time for READ - for measure File: /home/preston/workspace/wide-finder-j/logs/_3.5G Size: 3559564960 Read: 3559564960 Time: 72839 Rate: 46 MB/s

Looks like the Java test program on this box can manage 62M/s when reading from cache (the first file is small enough to cache in memory), and 45Mb/s reading from disk.

For comparison the (roughly) equivalent C++ code clocks in at 332MB/s when reading from cache, and 51MB/s reading from disk (which we can assume is as fast as data can be read from this machine's SATA disk).

Clearly we need more headroom in the Java I/O performance, to allow for faster disks and slower CPUs.