Counting k-mers on EC2 – huh?

(This blog post was mightily helped by Qingpeng Zhang, the firstauthor of the paper; he wrote the pipeline. I just ran it a bunch 🙂

We have been benchmarking k-mer counters in a variety of ways, inpreparation for an upcoming paper. As with the diginorm paper we are automatingeverything, so I thought heck, why not try running it on a bunch ofdifferent EC2 machines to see how variable their performance is?Then, I ruined that idea by varying the machine configuration instead ofusing identical machines :).

The overall pipeline takes about 30 hours to run, and for this blogpost I am focusing in on one particular benchmark — the length oftime it takes the various programs to generate and count theabundance distribution of the 22-mers present in 48.7 m shortreads, or about 5 GB of data. We used Jellyfish, DSK, khmer, and Tallymer; we’re planning to try out KMC, also, but didn’t get to it for this post.

I ran the counting on four machines: our local server, which is yourstandard reasonably high performance Linux box; two m2.2xlarge AmazonEC2 instances (34 GB RAM), one with the default setup and one with a 1TB EBS disk with 100 IOPS configuration; and an m2.4xlarge Amazon EC2instance, with 68 GB RAM. I chose different zones for all three EC2machines. The max memory required was about 24 GB, I think.

I analyzed everything within an IPython Notebook, which is availablehere.If you want to play with the data, grab the master branch ofhttps://github.com/ged-lab/2013-khmer-counting.git, go to thenotebooks/ subdirectory, run the ipython notebook server, and open the’khmer-counting-compare’ notebook. All the data necessary to run thenotebook is there.

The results are a bit weird!

First, let’s look at the overall walltime it took to count (Figure 1).Jellyfish did a really nice job, outperforming everything else handily.Tallymer (the oldest of the programs) was by far the slowest; DSK andkhmer were in the middle, depending on machine configuration.

Fig 1. The time (in seconds) to count the k-mers and generate ak-mer abundance histogram for 48.7m short reads from a soilmetagenome, using several different k-mer counting packages.

A few points about Figure 1 —

    Why no errorbars? Time is money, baby — this already cost quiteenough, thankyouverymuch.Doesn’t this mean Jellyfish is just plain better? Well, read on (thisand other blog posts).Why did everything perform worse on the IOPS configured EC2 instance?Heck if I know. Note that khmer has the least disk access ofanything, which suggests that disk performance just downright suckedon the IOPS instance.

Now let’s take a look at how efficiently the programs were using compute.Figure 2 shows the ratio of user time (which is approximately secondsspent by each core, summed, minus time spent in the OS critical sections)to walltime (how long the whole process took).

Fig 2. The ratio of user time (seconds x cores, omitting systemtime) to walltime (seconds) when generating the k-mer abundancehistogram for 48.7m reads. Note that Jellyfish and khmer were bothrun with 8 threads, while Tallymer is unthreaded and DSK was runwith 1 thread by mistake.

A few points about figure 2:

    Wowsers! We ran both Jellyfish (red) and khmer (blue) with 8threads, and the results suggest that they both used them veryefficiently on our own server — a factor of about 8 suggests thatthey were merrily blasting along doing computing, hindered littleif at all by disk access! Since our local server has great I/O (Iguess?), that probably accounts for it. Note: I think this alsomeans our locking and multithreading implementations are reallygood (read thisand this for more information;this is a general threaded API for sequence reading, hint hint).

    DSK and Tallymer both did a poor job of using multiple CPUs. Well, tobe fair, Tallymer doesn’t support threads. And while DSK does, we forgotto run it with 8 threads. Oops. Betcha performance increases!

    If I/O is what matters here, m2.4xlarge has what appears to be thenext best I/O — khmer got up to a ratio of 7.09. Even on theIOPS system, khmer did OK.

    In general, I think these benchmarks show that I/O is the Achillesheel of the various k-mer counting systems. I don’t know why theIOPS configuration would be worse for that, though.


Finally, let’s look at system time. Figure 3 shows total system time(in seconds) for each program/machine configuration. System time includesall disk access, but not, I think, cache invalidation or other thingslike that.

Fig 3. System time (primarily disk access) for generating thek-mer abundance histogram for 48.7m reads.

Thoughts:

    This more or less confirms what we inferred from the other graphs:I/O is a bottleneck. Jellyfish, for whatever reason, disagrees withthat statement, so they must be doing something clever 🙂

Some concluding thoughts for this initial blog post —

    Don’t go around claiming that one k-mer counter is better, basedon this! We omitted at least one good lookin’ published k-mercounter (KMC)and may go take a look at BFCounter and Turtle too. Plus, we screwed upour DSK benchmarking.

    Note we’ve said nothing about memory or disk usage here. Indeed.

    At the end of the day, I don’t understand what’s going on with theIOPS-optimized EBS instances. Did I choose too low a number? (100 IOPS).Did I pick too big a hard drive? Is our access pattern lousy? Or what?

    Note that this post from Garantia Dataended up with similar questions :).

    Here, I think there are probably a variety of access patterns, butthe basic thing that’s going on is (a) reading a steady stream ofdata sequentially, and (b) for most of the programs, writing stuffto disk steadily. (khmer does not do any disk access beyond readingin the sequence file here.)


Anyway, that’s the first of what will probably be several blog posts onk-mer counting performance. This is a real data set, and a real set ofwell-used programs, so I think it’s a pretty good benchmark; let me knowif you disagree and want to see something else…

–titus

Counting k-mers on EC2 – huh?

相关文章:

你感兴趣的文章:

标签云: