Data intensive biology in the cloud: instrumenting

Here’s a draft PyCon ’14 proposal. Comments and suggestions welcome!


Title: Data intensive biology in the cloud: instrumenting ALL the things

Description: (400 ch)

Cloud computing offers some great opportunities for science, but mostcloud computing platforms are both I/O and memory limited, and henceare poor matches for data-intensive computing. After four years ofresearch software development we are now instrumenting and benchmarkingour analysis pipelines; numbers, lessons learned, and future planswill be discussed. Everything is open source, of course.

Audience: People who are interested in things.

Python level: Beginner/intermediate.

Objectives:

Attendees will

learn a bit about I/O and big-memory performance in demanding situations;see performance numbers for various cloud platforms;hear about why some people can’t use Hadoop to process large amounts of data;gain some insight into the sad state of open science;

Detailed abstract:

The cloud provides great opportunities for a variety of importantcomputational science challenges, including reproducible science,standardized computational workflows, comparative benchmarking, andfocused optimization. It can also help be a disruptive force for thebetterment of science by eliminating the need for large infrastructureinvestments and supporting exploratory computational science onpreviously challenging scales. However, most cloud computing use inscience so far has focused on relatively mundane "pleasantly parallel"problems. Our lab has spent many moons addressing a large,non-parallelizable "big data/big graph" problem — sequence assembly– with a mixture of Python and C++, some fun new data structures andalgorithms, and a lot of cloud computing. Most recently we have beenworking on open computational "protocols", worfklows, and pipelinesfor democritizing certain kinds of sequence analysis. As part of thiswork we are tackling issues of standardized test data sets to supportcomparative benchmarking, targeted optimization, reproducible science,and computational standardization in biology. In this talk I’lldiscuss our efforts to understand where our computational bottlenecksare, what kinds of optimization and parallelization efforts make sensefinancially, and how the cloud is enabling us to be usefullydisruptive. As a bonus I’ll talk about how the focus on pleasantlyparalellizable tasks has warped everyone’s brains and convinced themthat engineering, not research, is really interesting.

Outline:

    Defining the terms: cloud computing; data intensive; compute intensive.

2. Our data-intensive problem: sequence assembly and the big graphproblem. The scale of the problem. A complete analysis protocol.

    Predicted bottlenecks, including computation and I/O.Actual bottlenecks, including NUMA architecture and I/O.

5. A cost-benefit analysis of various approaches, including buyingmore memory; striping data across multiple volumes; increasing I/Operformance; focusing on software development; "pipelining" acrossmultiple machines; theory vs practice in terms of implementation.

6. A discussion of solutions that won’t work, includingparallelization and GPU.

7. Making analysis "free" and using low-cost compute to analyze otherpeople’s data. Trying to be disruptive.

Data intensive biology in the cloud: instrumenting

相关文章:

你感兴趣的文章:

标签云: