Main content area

Identifying suitable tools for variant detection and differential gene expression using RNA-seq data

Dharshini, S. Akila Parvathy, Taguchi, Y.-H., Gromiha, M. Michael
Genomics 2019
algorithms, gene expression regulation, genes, genome assembly, genome-wide association study, genomics, hippocampus, messenger RNA, microarray technology, neurodegenerative diseases, pipelines, quantitative trait loci, sequence alignment, transcriptomics
Neurodegenerative diseases are the most predominate brain disorders around the globe and the affected populations are rapidly increasing. Recently, these diseases have been addressed using the data obtained from RNA-sequencing technology to reveal the changes in gene/transcript expression, effect of variants, and pathways involved in disease mechanisms. However, the observations mainly depend on the aligners/tools and the performance of existing RNA-seq tools on hg38 genome assembly has not yet been documented. In this study, we performed a systematic analysis of various spliced aligners, transcript assembling and variant calling tools based on both genomic assemblies (hg19/hg38) from hippocampus brain tissue. This helps to identify the best possible combination tools for hg38 annotation. In order to evaluate the identified variants from various pipelines, we compared them with expression Quantitative Trait Loci (eQTL) and Genome-Wide Association Study (GWAS). In addition, the identified differentially expressed genes (DG) were compared with microarray studies. From our analysis of variant calling, the combination of GATK (Genome Analysis Tool-kit) and STAR (Spliced Transcripts Alignment to a Reference) protocol yields a larger number of GWAS/eQTL variants compared to SAMtools (Sequence Alignment Map). We also identified a higher number of non-coding variants in hg38 compared to hg19 due to enhanced annotation. In the case of various DG pipelines, we found that the Salmon-based hg38 transcriptomic quantification yields a higher number of reported DG compared to other genome-based quantification methods. This study revealed that higher number of reads maps to multiple location of the genome with hg38 compared to hg19, and these spurious multi-mapped reads may affect the gene quantification techniques. We suggest that it is necessary to develop efficient algorithms, which can handle the multi-mapped reads and improve the performance of genome-based alignment quantification.