مؤتمر
Spectrum Preserving Tilings Enable Sparse and Modular Reference Indexing
العنوان: | Spectrum Preserving Tilings Enable Sparse and Modular Reference Indexing |
---|---|
المؤلفون: | Fan J., Khan J., Pibiri G. E., Patro R. |
المساهمون: | Fan, J., Khan, J., Pibiri, G. E., Patro, R. |
بيانات النشر: | Springer Science and Business Media Deutschland GmbH |
سنة النشر: | 2023 |
المجموعة: | Università Ca’ Foscari Venezia: ARCA (Archivio Istituzionale della Ricerca) |
مصطلحات موضوعية: | Minimal Perfect Hashing, Pufferfish2, Reference Indexing, Spectrum Preserving Tilings, Settore INF/01 - Informatica |
الوصف: | The reference indexing problem for k-mers is to pre-process a collection of reference genomic sequences so that the position of all occurrences of any queried k-mer can be rapidly identified. An efficient and scalable solution to this problem is fundamental for many tasks in bioinformatics. In this work, we introduce the spectrum preserving tiling (SPT), a general representation of that specifies how a set of tiles repeatedly occur to spell out the constituent reference sequences in. By encoding the order and positions where tiles occur, SPTs enable the implementation and analysis of a general class of modular indexes. An index over an SPT decomposes the reference indexing problem for k-mers into: (1) a k-mer-to-tile mapping; and (2) a tile-to-occurrence mapping. Recently introduced work to construct and compactly index k-mer sets can be used to efficiently implement the k-mer-to-tile mapping. However, implementing the tile-to-occurrence mapping remains prohibitively costly in terms of space. As reference collections become large, the space requirements of the tile-to-occurrence mapping dominates that of the k-mer-to-tile mapping since the former depends on the amount of total sequence while the latter depends on the number of unique k-mers in. To address this, we introduce a class of sampling schemes for SPTs that trade off speed to reduce the size of the tile-to-reference mapping. We implement a practical index with these sampling schemes in the tool pufferfish2. When indexing over 30,000 bacterial genomes, pufferfish2 reduces the size of the tile-to-occurrence mapping from 86.3 GB to 34.6 GB while incurring only a 3.6 slowdown when querying k-mers from a sequenced readset. Availability: pufferfish2 is implemented in Rust and available at https://github.com/COMBINE-lab/pufferfish2Test. |
نوع الوثيقة: | conference object |
اللغة: | English |
العلاقة: | info:eu-repo/semantics/altIdentifier/isbn/978-3-031-29118-0; info:eu-repo/semantics/altIdentifier/isbn/978-3-031-29119-7; ispartofbook:Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); 27th International Conference on Research in Computational Molecular Biology, RECOMB 2023; volume:13976; firstpage:21; lastpage:40; numberofpages:20; serie:LECTURE NOTES IN COMPUTER SCIENCE; https://hdl.handle.net/10278/5023385Test; info:eu-repo/semantics/altIdentifier/scopus/2-s2.0-85152558770 |
DOI: | 10.1007/978-3-031-29119-7_2 |
الإتاحة: | https://doi.org/10.1007/978-3-031-29119-7_2Test https://hdl.handle.net/10278/5023385Test |
حقوق: | info:eu-repo/semantics/openAccess |
رقم الانضمام: | edsbas.74D3EC9F |
قاعدة البيانات: | BASE |
DOI: | 10.1007/978-3-031-29119-7_2 |
---|