Monthly Archives: November 2007

Wrapping up “Wide Finder”

For Tim Bray’s Wide Finder, what you need is something that will lift data off the disk at the maximum possible rate, and distribute the processing across as many CPUs as needed. The end-to-end result cannot be generated faster than the data can be read from disk. The first part of the problem is to get the data off disk as fast as possible.

My notion was to find the smallest/simplest bit of code that would allow any reasonable scripting language (Perl, AWK, Python, Ruby, whatever) to hit optimal elapsed times on a many-CPU box (my take on Tim’s notion). Ripped out most of the unneeded bits from the previous test program (wide-finder-0), wrote a more efficient buffer rather than line-oriented reader, added in getopt() (so the script run is no longer fixed), and created a simple feed-workers utility.

Usage: feed-workers [ options ] files...
Where options are:
  -n N        number of worker processes
  -r run      name of executable to run as a worker (default is /usr/bin/perl)
  -s script   name of script (passed as argument to worker)
The lines read from given list of files are distributed equally among the workers.

The feed-workers program is the one bit of help you need to efficiently read from the disk and distribute processing across CPUs running scripts written in your favorite higher-order language. The full “wide finder” looks like:

feed-workers -n 4 -s scripts/reduce.pl logs/*| scripts/combine.pl

The results of a test run on a dual-CPU box:

preston@brutus:~/feed-workers$ time ./feed-workers -n 4 -s scripts/reduce.pl logs/* | scripts/combine.pl > _counts
Thu Nov 29 17:22:09 2007
Scanning: logs/log-00
Scanning: logs/log-01
Scanning: logs/log-02
Scanning: logs/log-03
Scanning: logs/log-04
Scanning: logs/log-05
Scanning: logs/log-06
Scanning: logs/log-07
Scanning: logs/log-08
Scanning: logs/log-09
Scanning: logs/log-10
Scanning: logs/log-11
Scanning: logs/log-12
Scanning: logs/log-13
Scanning: logs/log-14
Scanning: logs/log-15
Scanning: logs/log-16
Scanning: logs/log-17
Scanning: logs/log-18
Scanning: logs/log-19
Worker #29298 ended with status: 0
Worker #29297 ended with status: 0
Worker #29296 ended with status: 0
Worker #29295 ended with status: 0
Thu Nov 29 17:23:36 2007
Elapsed (ms): 86565, total (MB): 4096
Scanned 47 MB/s

real    1m26.738s
user    0m48.311s
sys     0m14.649s

This is run on a vanilla AMD Athlon 64 X2 4200 box with ~3GB usable memory . Since the test files total 4GB, the files cannot be cached.

preston@brutus:~/feed-workers$ uname -a
Linux brutus 2.6.20-16-generic #2 SMP Sun Sep 23 18:31:23 UTC 2007 x86_64 GNU/Linux

The following measurements are to get some notion of overhead and performance limits.

Just running the same input through /bin/cat without feed-workers.

preston@brutus:~/feed-workers$ time cat logs/* > /dev/null

real    1m23.065s
user    0m0.096s
sys     0m3.960s

Running just the Perl scripts:

preston@brutus:~/feed-workers$ time scripts/reduce.pl logs/* | scripts/combine.pl > __counts

real    1m27.342s
user    0m42.895s
sys     0m4.288s

Feeding the same input to /bin/cat instead of a Perl script.

preston@brutus:~/feed-workers$ time ./feed-workers -n 4 -r /bin/cat logs/* > /dev/null
Thu Nov 29 17:26:22 2007
[snip]
Thu Nov 29 17:27:50 2007
Elapsed (ms): 88373, total (MB): 4096
Scanned 46 MB/s

real    1m28.698s
user    0m0.440s
sys     0m14.601s

Running the “wide finder” with only a single worker process.

preston@brutus:~/feed-workers$ time ./feed-workers -n 1 -s scripts/reduce.pl logs/* | scripts/combine.pl > _counts
Fri Nov 30 08:14:59 2007
[snip]
Fri Nov 30 08:16:22 2007
Elapsed (ms): 82939, total (MB): 4096
Scanned 49 MB/s

real    1m23.009s
user    0m43.723s
sys     0m14.681s

Looked at together we get a consistent picture.

real u+s command
83 4 cat
87 47 reduce.pl
89 15 feed-workers + cat (x 4)
87 63 feed-workers + reduce.pl (x 4)
83 58 feed-workers + reduce.pl (x 1)

All times rounded to whole seconds. Given the limited tests, variance up to 10% may not be significant.

Since the elapsed times for single-process ‘cat’ and ‘reduce.pl’ are nearly identical, we can conclude that we are disk-limited, and throwing more CPUs at the problem will not help. Given the user+system CPU time is about half the elapsed time, as a benchmark, this example “wide finder” problem is a little too lightweight. To really show off a “wide finder”, we want a processing task heavy enough that no one single CPU can process that data at disk rates. We probably want a task 3 or 4 times “heavier” in terms of CPU usage.

Looking at the ‘feed-workers + cat’ and ‘feed-workers + reduce.pl’ times, the elapsed times are about the same as the single process times. Looks like feed-workers disk reader is using about 13% of one CPU, so we could likely handle eight times the I/O rate before a single process feed-workers became a bottleneck (without any further optimization). This is a good thing, as you would need a very high-end disk subsystem before you would have to think about multiple readers. Also the reader would only become a bottleneck if the worker processes could keep up with the increased rate. Taken together we can expect a single reader to be optimal in almost all interesting cases.

Looks like starting a Perl (via fork/exec) costs us (very roughly) about 4/3 of second. A bit fat, but not too significant out of the entire time.

Interesting that the ‘feed-workers + reduce.pl (x 1)’ time slightly beats the ‘reduce.pl’ time, and matches the ‘cat’ time. This is exactly the result you hope for on a dual-CPU box.

There might be some room for optimization in the feed-workers code, but looks like we are already into diminishing returns for time invested. The overhead added feeding multiple workers is small, and can keep up with much higher than usual I/O rates. A solution based on feed-workers – given enough CPUs – can process data as fast as it can be read off disk (in the general case), and perform efficiently for file sizes much, much larger than memory.

Conclusions

  1. The example “wide finder” problem is too lightweight to show advantage on a multi-core CPU.
  2. A single instance of the generic feed-workers reader should prove optimal for almost all cases.
  3. The initial reader implementation is efficient enough, as we get a small performance improvement even on a dual-CPU box.
  4. For one sort of problem, processing line-oriented records where order does not matter, this a sufficient solution for distributing across many CPUs without specialized code.
Posted in Software | Leave a comment

What you need for a “Wide Finder”, and no more.

[Replaced initial section with an update for feed-reader results]

Guess I want to hit this part once more, before leaving it alone.

All the solutions that depend on “parallel I/O” can only yield fast results when the file is already in memory. The are some (few) scenarios where you can expect the file contents to be already in the disk cache, but this is not the general case.

Do I really have to explain how horribly the “wide finder” examples that count on “parallel” I/O will degrade? If the file contents is not in the disk cache, attempting to read in “parallel” will force a lot of disk seeks. For test files – where the files are likely adjacent and unfragmented – the degradation from many small seeks will be significant. For real log files – accumulated over a long period of time. the files are more likely non-adjacent and somewhat-fragmented – the degradation from many longer seeks could prove profoundly painful.

Not a nice thing to do to your customers.

Posted in Software | Leave a comment

Dumb software design – overloading Windows Startup

Visiting my father in Colorado and – as usual – resolving any problems his computers have developed since my last visit. For context, my father is an engineer, spent most of his career in aerospace, defense electronics, more recently chip design from modems (2400 baud through 56K) and more recently cell phone chip sets. Not a software guy, but definitely a techie.

I happened to be watching while he booted his “old” laptop (Winbook J4, 2Ghz Pentium, 512MB memory). The disk rattled. Icons appeared and flickered. I think it was ten minutes later, and startup was still incomplete. Took a minute(!) or so to get Windows Task Manager running.

Main problems:

  • Too much shit loaded at startup.
  • Not enough memory (though 512MB should be enough).
  • A badly fragmented hard disk.
  • Norton’s fat/slow/bloated “security” software. (verified)

First obvious problem was too many “quick start” helper applications loaded at startup helped make startup slow (ironic that). Adobe, QuickTime, Star Office, and AutoCAD were preloading junk into free memory (of which there is a shortage, not an excess). A bit absurd that none of these “helpers” seem to note the shortage of memory. Turned them all off.

Still too damn slow.

Next, went through the “startup” list. Lots of stuff starting every time that is not used. Makes the startup of the applications that are used very slow. Turned off everything not needed.

Memory use is a bit better. Still too damn slow.

Looked at the process list. What the heck are all these processes!? A good number are “update” processes – HP, RealPlayer, Norton (and more) all have processes launched for the sole purpose of checking for product updates. Dumb. Windows has support for “Scheduled Tasks”. Processes taking up CPU and memory for a task that Windows could do with far less overhead – is not brilliant.

Started up the disk defragmenter. We spent the afternoon in Telluride (as we are on vacation), got the kids season ski passes, wandered around, ate lunch, and drove back. Found the defragmenter still running!

Yes, the disk was badly fragmented … but still this seems excessive. Too little free memory, 17% free disk, and … possibly … Norton’s “security” software hogging far too many resources?

Will get to the bottom of this, eventually.

Update: I have a long history of distrust of anti-virus software, as I have seen bad effects on performance, and bad effects on system reliability. Looking at all the processes used by Norton, and knowing that some of the antivirus software overhead may be incurred in other processes, I strongly suspected Norton could be part of the performance problem. Looks like someone else asked the same question, and got an answer.

Norton is history. Installed AVG Anti-virus, which did indeed make a notable improvement, both in Windows startup and in later application startup.

Posted in Personal, Software | Leave a comment

Sitemap Protocol

This is a big deal. Digging information out of government sites, I have found, is often obscure and difficult. A framework for making information easily accessible can be both of general benefit, and help oversight on government activities.

Sitemap Protocol | Sunlight Foundation
Google has been working with federal agencies to help them ensure that their data are accessible through search engines.

The solution to this problem is a non-proprietary standard championed by Google, called the sitemap protocol. Implementing this standard helps automated web-crawlers (the stuff of search engines) find their way around your entire site.

From the bill:

`(i) GUIDELINES- Not later than 1 year after the date of enactment of the E-Government Reauthorization Act of 2007, the Director shall promulgate guidance and best practices to ensure that publicly available online Federal Government information and services are made more accessible to external search capabilities, including commercial and governmental search capabilities. The guidance and best practices shall include guidelines for each agency to test the accessibility of the websites of that agency to external search capabilities.

Posted in Politics, Web | Leave a comment

Dumb GUI example – menubars

Back in the 1980′s, the menubar was brilliant. Applications loaded off floppy disk had to be small. Small applications could not have too many functions. Functions could be completely enumerated in the menubar. Browsing through the menubar pull-down menus could pretty much tell you everything the program could do. On-line help was limited or non-existent – as there just was not space – so application functions needed to be fairly obvious. Generic support for menubars could built into the base system, saving precious RAM for the application, and saving programmer time. All around a simple and clean solution to a common problem.

Menubars made so much sense that they became a habit. Whole new generations of programmers picked up the habit, without quite knowing the original basis. Everyone “knows” that real GUIs must have menubars….

In 20-odd years applications have grown huge. The number of easily-supported functions in applications has exploded. Crowded menubars, long pull-downs, cascaded menus (another really bad idea), menu choices that in turn lead to paged/tabbed/indexed dialogs … all are symptoms of a growth problem.

Which leads to today’s annoyance. Was working with NetBeans and for some reason javadoc and sources were not hooked up for the Swing libraries. Right … must be an IDE option somewhere, so pulled up:

(from NetBeans)

Right. Yep, I am sure it’s in there somewhere.

We have all been down this path before. You know the option/setting/command is there, and you are going to SEARCH until you find it. Maybe you remember roughly where to look. How many times have you opened a dialog, then realised oh, it’s not here…. Maybe you fire up a web browser, and SEARCH for “how to do…”. Then once you find the command/option/setting, it may not obvious how it works. Maybe you hit the F1 key hopefully … I say hopefully because even $$$ billions-per-year Microsoft software has no or useless on-line help (at least in the places where you really hoped F1 would work).

Um, hello? Anyone paying attention?

The old habit no longer makes sense. Time (way past time) to flip the design completely around. Applications do a lot. Lots of users are going to use only a small subset of an application. Push the stuff they do not need to see into the background. Pull forward the bits they need to see. Make search a part of the user interface design. Users are going to hunt-and-peck at the UI, doing their own ad-hoc search – whether you help them or not. Help them.

The really sad part is when web developers feel compelled to incorporate pull-down menus into web pages (so it looks like a “real” application). Talk about going in entirely the wrong direction…..

Posted in Software | Leave a comment

Netbeans love-hate

Have to love Matisse – the GUI composer in Netbeans – and there are many other interesting features, but I find too many sequences like the following when using Netbeans:

click, drag or type … !@#$%??
(go back) click … ?!? … (hold mouse still over object) press mouse button down (wait) release button … (now it worked, continue)

So my desktop box is “only” an Athlon XP 3400 (no X2, no x64), and there is “only” 2GB memory … but still. Mouse tracking, hit detection, event queuing – should not be that difficult. Having in a past life written GUI code on the original sub-5Mhz PC, there is just no way I can excuse Netbeans for not keeping up on a box that is thousands of times faster.

Love Matisse … but starting to hate Netbeans.

Posted in Software | Leave a comment

I Want My iTV

I Want My iTV
“We know that Apple has destroyed the music business, in terms of pricing, and if we don’t take control they’ll do the same thing on the video side,” NBC Universal (GE ) chief Jeff Zucker told an audience at Syracuse University’s S.I. Newhouse School of Public Communications on Oct. 29.

Nonsense. Apple succeeded in creating a market where others had failed. At 99¢ a song, the “music business” is making as much or more per song than they make with CDs. Downloads have none of the costs associated with manufacturing physical CDs.

The “music business” failed to create an online market before iTunes, and failed to create viable alternatives after iTunes succeeded. Jobs created a viable market by offering a good deal to both consumers and producers. This is capitalism at it’s best. Apple earned their money because Jobs is smart, and did the right things.

Zucker’s comments could be more accurately re-cast as:
“We know that Apple created a viable online music business, and we don’t want him to do the same thing on the video side.”
After all, if this keeps up outfits like NBC might wonder if folk like Zucker are worth their salary.

This entire sideshow is so dreadfully predictable. The notion of on-demand video is not new – but for a long time the costs were too high. With cheap/fast hardware, common availability of internet connections, and dropping prices for bandwidth – “iTV” is possible now. The old models for video – created by then-current technology – are breaking down in the face of new technology. The existing TV/cable industry “channels” are about to become a lot less relevant. Eventually the “industry” will adjust to reality. Lots of folks getting fat checks off the status-quo will find their money has gone elsewhere. Outfits that figure out how to fit the new reality will do well.

The landmark will be the first successful video produced, marketed, and distributed entirely outside the existing TV/cable industry “channels”. Distribution is currently possible but still awkward. It will take a little while for distribution – enhanced Tivo, or software like Miro – to become easy. My guess is we are looking at 5-10 years out. There is an outside chance (way outside) that the “industry” will get a clue, and with Jobs help could make things arrive earlier.

Posted in Web | Leave a comment

The Wide Finder Project

[wrap up - added later]

I must be getting cranky in my old age. The discussion around Tim Bray’s “Wide Finder” bugs me.

ongoing · The Wide Finder Project
In my Finding Things chapter of Beautiful Code, the first complete program is a little Ruby script that reads the ongoing Apache logfile and figures out which articles have been fetched the most. It’s a classic example of the culture, born in Awk, perfected in Perl, of getting useful work done by combining regular expressions and hash tables. I want to figure out how to write an equivalent program that runs fast on modern CPUs with low clock rates but many cores; this is the Wide Finder project.

First, the example problem chosen by Tim Bray is crap, but the underlying question is good. The example can be made to run at near-full I/O speed, and this fact is distracting too many folks. The underlying question – how to spread a common problem over lots of CPUs – is a worthy topic.

Second, given the above context, optimizing the regex use is irrelevant (though fun for folks with a twisted sense of fun – of which I am one).

Third, there are an amazing number of folk who still seem to be clueless when it comes to optimizing disk I/O. Memory-mapped I/O is not magic. Operating Systems optimize sequential I/O – have seen writings from both Sun and Microsoft (and older), but have no references to offer. Very likely stdio is faster than C++ streams (some history there). Seeks are not free – so reading off more than one place on disk is usually bad for performance.

In sum we can expect a single process/thread using stdio/fopen/fgets could suck up all (or near all) the I/O throughput and deliver near-optimal results. More than one thread doing disk reads is non-optimal for most machines (like 90%+). Any approach that starts with more than one thread doing disk reads is probably a Really Bad Idea. (Think about the physics of a disk drive – it is not complicated.)

Still, a guess is not as good as a measurement. As a first cut, wrote a simple program to measure reading a large file using stdio (no other processing), and to measure the effect of applying a regex as in Tim’s example.

preston@mite:~/workspace/wide-finder-0$ time Debug/wide-finder-0 log-*
Scanning: log-00
Scanning: log-01
Scanning: log-02
Scanning: log-03
Scanning: log-04
Scanning: log-05
Scanning: log-06
Scanning: log-07
Scanning: log-08
Scanning: log-09
Scanning: log-10
Scanning: log-11
Scanning: log-12
Scanning: log-13
Scanning: log-14
Scanning: log-15
Scanning: log-16
Scanning: log-17
Scanning: log-18
Scanning: log-19
Elapsed (ms): 26120 total (MB): 4077 lines: 20000000
Scanned ?? MB/s
Scanning: log-00
Scanning: log-01
Scanning: log-02
Scanning: log-03
Scanning: log-04
Scanning: log-05
Scanning: log-06
Scanning: log-07
Scanning: log-08
Scanning: log-09
Scanning: log-10
Scanning: log-11
Scanning: log-12
Scanning: log-13
Scanning: log-14
Scanning: log-15
Scanning: log-16
Scanning: log-17
Scanning: log-18
Scanning: log-19
Elapsed (ms): 258260 total (MB): 4077 lines: 20000000 matched: 14571780
Scanned 15 MB/s
Final elapsed (ms): 284380

real    6m20.018s
user    4m18.864s
sys     0m25.522s
preston@mite:~/workspace/wide-finder-0$

Sources are at: http://svn.bannister.us/public/wide-finder-0

Using stdio with a simple fgets() loop is able to suck in over ??MB/s on my not-cutting-edge hardware. Looks I/O not CPU bound. The input files total bigger than memory, so this is real not cached disk reads. My guess is that a single CPU on a Niagara box will do at least as well (or my faith in the Sun folk is misplaced).

What this tells us is – as expected – a single CPU doing disk reads is right for this problem.

The second part adds applying a regex as in Tim’s example. Throughput drops to 15MB/s and process is (very!) CPU bound. For the purpose of Tim’s exercise, this is excellent! The next logical step would be to find a way to distribute the CPU-hungry regex processing across multiple CPUs.

Just to be clear … for the purpose of Tim’s topic, none of the following is relevant.

  • Comparisons between the performance of my box and another box.
  • Optimizing the file reader. (It is already I/O bound.)
  • Optimizing the regex processing. (Do I have to explain this?)

What is relevant is distributing the regex processing across multiple CPUs. My next iteration would be for the reader to place lines into some sort of in-memory queue (or simple pipes), with a scalable number of threads (~1 per CPU or as needed for the task) reading from the queue. For this problem, outputs from each regex-thread by a single(?) post-processing thread (merging the counted request strings). This is where the problem gets interesting. It is easy to come up with very specialized solutions. What would be more generally useful is a pipeline something like:

(one instance of reader ) -> (N instances of regex/counter) -> (one instance of merge process)

The reader should probably be C++ code (so we can get full I/O throughput), but I’m tempted – just to prove Tim’s point – to write the intermediate and backend processes in AWK. :)

Lacking access to a T2 (or the like), I can write the code, but not do any really interesting experiments. Oh well.

Posted in Software | Leave a comment