Indexing raw sequencing data to better decipher living organisms

Lettre de l'INSU - Scientific results

 

Major sequencing projects are fundamental to our understanding of living organisms in various fields (health, agronomy, ecology). Technological advances have made it possible to obtain a considerable amount of raw sequencing data (sequence reads). The European Nucleotide Archive currently contains almost 50 Petabytes of public raw data.

A team of researchers from CNRS Terre & Univers (MIO-OSU Pythéas), in collaboration with several research laboratories, used k-mers (words of size k) to create a notion of word in the raw sequencing data. This indexing solution made it possible to query several tens of terabytes of sequence data from the Tara Oceans project. The Ocean Read Atlas (ORA) public web server, developed for this purpose, enables direct queries of several Tara Oceans consortium datasets collected from all the world's oceans.

 

More information