Main content area

Development of TBSPG Pipelines for Refining Unique Mapping and Repetitive Sequence Detection Using the Two Halves of Each Illumina Sequence Read

Xiang, Heng, Li, Xiu-Qing
Plant molecular biology reporter 2016 v.34 no.1 pp. 172-181
chromosomes, computer software, genome, pipelines, potatoes, repetitive sequences, sequence analysis
We developed six pipelines (TBSPG) for mapping Illumina sequence reads to reference genomes, refining unique mapping, and computing the mapped read number and coverage. These pipelines provide the options of conducting multi-mapping or unique mapping, inputting with paired-end read files or a single-end read file, removing or not removing nucleus-organelle shared sequences, and mapping with the full-length reads or with the two halves of each read to refine the detection of unique and non-unique sequences. These TBSPG pipelines were based on (and named after) publicly available tools: Trimmomatic, the Burrows–Wheeler Aligner (BWA), SAMtools, Picard, and the Genome Analysis Toolkit (GATK). We developed several Perl scripts to fill the gaps between the tools, connect the tools, recognize half-length reads, select uniquely mapped reads, and compute and output data in a Microsoft Excel-recognizable format for studying the read number and the coverage per chromosome and organellar genome. In a potato 100-bp paired-end sequence file (Illumina TruSeq), approximately 6.75 % of uniquely mapped full-length reads were found to actually contain non-unique sequences at the half-length-read level. These freely available TBSPG pipelines can be used for many read-based applications, including repetitive sequence analysis and organellar genome copy number estimation.