Main content area

A gene–phenotype relationship extraction pipeline from the biomedical literature using a representation learning approach

Xing, Wenhui, Qi, Junsheng, Yuan, Xiaohui, Li, Lin, Zhang, Xiaoyu, Fu, Yuhua, Xiong, Shengwu, Hu, Lun, Peng, Jing
Bioinformatics 2018 v.34 no.13 pp. i386
Arabidopsis thaliana, bioinformatics, data collection, genes, genetic analysis, models, phenotype
The fundamental challenge of modern genetic analysis is to establish gene-phenotype correlations that are often found in the large-scale publications. Because lexical features of gene are relatively regular in text, the main challenge of these relation extraction is phenotype recognition. Due to phenotypic descriptions are often study- or author-specific, few lexicon can be used to effectively identify the entire phenotypic expressions in text, especially for plants. We have proposed a pipeline for extracting phenotype, gene and their relations from biomedical literature. Combined with abbreviation revision and sentence template extraction, we improved the unsupervised word-embedding-to-sentence-embedding cascaded approach as representation learning to recognize the various broad phenotypic information in literature. In addition, the dictionary- and rule-based method was applied for gene recognition. Finally, we integrated one of famous information extraction system OLLIE to identify gene-phenotype relations. To demonstrate the applicability of the pipeline, we established two types of comparison experiment using model organism Arabidopsis thaliana. In the comparison of state-of-the-art baselines, our approach obtained the best performance (F1-Measure of 66.83%). We also applied the pipeline to 481 full-articles from TAIR gene-phenotype manual relationship dataset to prove the validity. The results showed that our proposed pipeline can cover 70.94% of the original dataset and add 373 new relations to expand it. The source code is available at 82/ Supplementary data are available at Bioinformatics online.