Schedule link
Description
Random algorithms and probabilistic data structures arealgorithmically efficient and can provide shockingly good practicalresults. I will give a practical introduction, with live demos and badjokes, to this fascinating algorithmic niche. I will conclude withsome discussions of how our group has applied this to large sequencingdata sets (although this will not be the focus of the talk).
Abstract:
I propose to start with Python implementations of most of the DS & A mentioned in this excellent blog post:
http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/
and also discuss skip lists and any other random algorithms that catchmy fancy. I’ll put everything together in an IPython notebook and addvisualizations as appropriate.
I’ll finish with some discussion of how we’ve put these approaches towork in my lab’s research, which focuses on compressive approaches tolarge data sets (and is regularly featured in my Python-ic blog,http://ivory.idyll.org/blog/).
Slides
Github repo with IPython Notebooks in it
Video link
Wikipedia’s category page for Probabilistic Data Structures
The Highly Scalable Blog on Probabilistic Data Structures for WebAnalytics and Data Mining
skiplist IPython Notebook
Wikipedia page on SkipLists
John Shipman’s excellent writeup
William Pugh’s original article
The HackerNews (oops!) reference for my reddit-attributed quote aboutputting a gun to someone’s head and asking them to write a log-timealgorithm for storing stuff:https://news.ycombinator.com/item?id=2670632
coinflips IPython Notebook
HyperLogLog counter IPython Notebook
Aggregate Knowledge’s EXCELLENT blog post on HyperLogLog.The section on Big Pattern Observables is truly fantastic 🙂
Flajolet etal. isthe original paper. It gets a bit technical in the middle but thediscussions are great.
Nick Johnson’s blog post on cardinality estimation
MetaMarkets’ blog post on cardinality counting
More High Scalability blog posts, this one by Matt Abrams
The obligatory Stack Overflow Q&A
Vasily Evseenko’s git repo https://github.com/svpcom/hyperloglog,forked from Nelson Goncalves’s git repo,https://github.com/goncalvesnelson/Log-Log-Sketch, served as thesource for my IPython Notebook.
Bloom filter notebook
The Wikipedia page is prettygood.
Everything I know about Bloom filters comes from my research.
I briefly mentioned the CountMin Sketch, which is anextension of the basic Bloom filter approach, for counting frequencydistributions of objects.
Quotient filters
Rapidly-exploring random trees
Random binary trees
Treaps
StackOverflow question on Bloom-filter like structures that go the other way
A survey of probabilistic data structures
K-Minimum Values over at Aggregate Knowledge again
Set operations on HyperLogLog counters, again over at Aggregate Knowledge.
Using SkipLists to calculate an efficient running median
A fairly readable (?) grant on Big Data in sequencing data sets
A less readable 😉 grant on "infinite assembly"
In addition to our published paper on using Bloom filters to storede Bruijn graphs, you might be interested in:
Our preprint on streaming lossy compression of sequencing data (aka Digital Normalization)
Our use of these techniques to assemble the heck out of large metagenomic data from soil
A chapter on optimizing our khmer software.
–titus