Spaced seeds

A spaced seed is a local match triggering an alignment, encoded with a pattern of matches interleaved with spaces – jokers (“don’t care” positions).

Spaced is seed is characterized by its weight (number of matching characters) and its span (total length). Below is an example of a spaced seed of weight 28 and span 40:

SEED="######-##-#-#-##-###-#-##---###--#######"

In the arXiv paper [1] linked below, through a series of computational experiments, we show that spaced seeds significantly improve the accuracy of metagenomic classification of short NGS reads.

Choosing a spaced seed

Choosing an optimal seed for a given task is a difficult problem. Iedera provides a versatile tool for designing good spaced seeds for various purposes. Luckily, random seeds work well enough in most tasks.

In our experiments we've found that seed-Kraken with the use of the above weight 28 span 40 spaced seed performs better on data with low error rate (plot below) than Kraken with default setting of k-mer length 31, and has higher sensitivity at the cost of slightly lower precision on other data.

Some optimized seeds with weights up to 24 can be found at http://www.lifl.fr/~noe/gk/. These seeds were selected according to a methodology described in [4].

Improved sensitivity vs precision trade-off

Performance of classification of seed-Kraken (spaced seed modification), and original Kraken, on metagenomes: simulated MiSeq (10 bacterial genomes, average error rate, primarily described and used in [2]), and HMPtongue (50,000 sequences from SRS011086 Tongue dorsum metagenomic sample from Human Microbiome Project). Charted are genus precision (positive predictive value) against genus sensitivity (rate of correct assignments). Varying are k-mer length, and its spaced seed equivalent seed weight, while the seed span varies from 31 to 40. seed-Kraken outperforms original Kraken in sensitivity/precision trade off, most evidently when classification is done at the level of genus, or family, taxonomic rank.

seed-Kraken performance

Classification algorithm

Right now seed-Kraken employs an algorithm very similar to Kraken's, different in the aspect of separately considering sense/anti-sense strands.

Scientific publications

  1. Spaced seeds improve metagenomic classification. Karel Brinda, Maciej Sykulski, Gregory Kucherov http://arxiv.org/abs/1502.06256
  2. Kraken: ultrafast metagenomic sequence classification using exact alignments. Wood DE, Salzberg SL: Genome Biology 2014, 15:R46.
  3. Iedera: subset seed design tool G. Kucherov, L. Noé, and M. Roytberg, , 2009
  4. A Coverage Criterion for Spaced Seeds and Its Applications to Support Vector Machine String Kernels and k-Mer Distances L Noé, DEK Martin - Journal of Computational Biology, 2014