The recent improvements of full genome sequencing technologies, commonly subsumed under the term NGS (Next Generation Sequencing), have tremendously increased the sequencing throughput. Within 10 years it rose from 21 billion base pairs collected over months
to about 400 billion base pairs per day (current throughput of Illumina's HiSeq 4000).
The costs for producing one million base pairs could also be reduced from 140,000 dollars to a few cents.
As a result of this dramatic development, the number of new data submissions, generated by various biotechnological protocols (ChIP-Seq, RNA-Seq, etc.), to genomic databases has grown dramatically and is expected to continue to increase faster than the cost and capacity of storage devices will decrease.
The main task in analyzing NGS data is to search sequencing reads or short sequence patterns (i.e. exon/intron boundary read-through patterns) or expression profiles in large collections of sequences (i.e. a database).
Searching the entirety of such databases mentioned above is usually only possible by searching the metadata or a set of results initially obtained from the experiment. Searching (approximately) for specific genomic sequence in all the data has not been possible in reasonable computational time.
In this work we describe results of our new data structure, called binning directory that can distribute approximate search queries based on an extension of our recently introduced Interleaved Bloom Filters (IBF) called x-partitioned IBF (x-PIBF). The results presented here make use of Intel's Optane DC Persistent Memory architecture and achieves significant speedups compared to a disk based solution.