Main content area

CS-SCORE: Rapid identification and removal of human genome contaminants from metagenomic datasets

Haque, Mohammed Monzoorul, Bose, Tungadri, Dutta, Anirban, Reddy, Chennareddy Venkata Siva Kumar, Mande, Sharmila S.
Genomics 2015 v.106 pp. 116-121
algorithms, data collection, genome, humans, metagenomics, microbial communities, nucleotide sequences
Metagenomic sequencing data, obtained from host-associated microbial communities, are usually contaminated with host genome sequence fragments. Prior to performing any downstream analyses, it is necessary to identify and remove such contaminating sequence fragments. The time and memory requirements of available host-contamination detection techniques are enormous. Thus, processing of large metagenomic datasets is a challenging task. This study presents CS-SCORE — a novel algorithm that can rapidly identify host sequences contaminating metagenomic datasets. Validation results indicate that CS-SCORE is 2–6 times faster than the current state-of-the-art methods. Furthermore, the memory footprint of CS-SCORE is in the range of 2–2.5GB, which is significantly lower than other available tools. CS-SCORE achieves this efficiency by incorporating (1) a heuristic pre-filtering mechanism and (2) a directed-mapping approach that utilizes a novel sequence composition metric (cs-score). CS-SCORE is expected to be a handy ‘pre-processing’ utility for researchers analyzing metagenomic datasets.For academic users, an implementation of CS-SCORE is freely available at: (or)