A spaced seed is a local match triggering an alignment, encoded with a pattern of matches interleaved with spaces – jokers (“don’t care” positions).
Spaced is seed is characterized by its weight (number of matching characters) and its span (total length). Below is an example of a spaced seed of weight 28 and span 40:
In the arXiv paper  linked below, through a series of computational experiments, we show that spaced seeds significantly improve the accuracy of metagenomic classification of short NGS reads.
Choosing a spaced seed
Choosing an optimal seed for a given task is a difficult problem. Iedera provides a versatile tool for designing good spaced seeds for various purposes. Luckily, random seeds work well enough in most tasks.
In our experiments we've found that seed-Kraken with the use of the above weight 28 span 40 spaced seed performs better on data with low error rate (plot below) than Kraken with default setting of k-mer length 31, and has higher sensitivity at the cost of slightly lower precision on other data.
Some optimized seeds with weights up to 24 can be found at http://www.lifl.fr/~noe/gk/. These seeds were selected according to a methodology described in .
Improved sensitivity vs precision trade-off
Performance of classification of seed-Kraken (spaced seed modification), and original Kraken, on metagenomes: simulated MiSeq (10 bacterial genomes, average error rate, primarily described and used in ), and HMPtongue (50,000 sequences from SRS011086 Tongue dorsum metagenomic sample from Human Microbiome Project). Charted are genus precision (positive predictive value) against genus sensitivity (rate of correct assignments). Varying are k-mer length, and its spaced seed equivalent seed weight, while the seed span varies from 31 to 40. seed-Kraken outperforms original Kraken in sensitivity/precision trade off, most evidently when classification is done at the level of genus, or family, taxonomic rank.
Right now seed-Kraken employs an algorithm very similar to Kraken's, different in the aspect of separately considering sense/anti-sense strands.
- Spaced seeds improve metagenomic classification. Karel Brinda, Maciej Sykulski, Gregory Kucherov http://arxiv.org/abs/1502.06256
- Kraken: ultrafast metagenomic sequence classification using exact alignments. Wood DE, Salzberg SL: Genome Biology 2014, 15:R46.
- Iedera: subset seed design tool G. Kucherov, L. Noé, and M. Roytberg, , 2009
- A Coverage Criterion for Spaced Seeds and Its Applications to Support Vector Machine String Kernels and k-Mer Distances L Noé, DEK Martin - Journal of Computational Biology, 2014