U.S. flag

An official website of the United States government


Main content area

Cocoa origin classifiability through LC-MS data: A statistical approach for large and long-term datasets

Kumar, Santhust, D'Souza, Roy N., Behrends, Britta, Corno, Marcello, Ullrich, Matthias S., Kuhnert, Nikolai, Hütt, Marc-Thorsten
Food research international 2021 v.140 pp. 109983
beans, chemical composition, classification, data collection, discriminant analysis, flavor, food composition, food industry, food research, information, liquid chromatography, multivariate analysis, normal distribution, principal component analysis, quality control, sampling, separation, spectroscopy, transportation
Classification of food samples based upon their countries of origin is an important task in food industry for quality assurance and development of fine flavor products. Liquid chromatography –mass spectrometry (LC-MS) provides a fast technique for obtaining in-depth information about chemical composition of foods. However, in a large dataset that is gathered over a period of few years, multiple, incoherent and hard to avoid sources of variations e.g., experimental conditions, transportation, batch and instrumental effects, etc. pose technical challenges that make the study of origin classification a difficult problem. Here, we use a large dataset gathered over a period of four years containing 297 LC-MS profiles of cocoa sourced from 10 countries to demonstrate these challenges by using two popular multivariate analysis methods: principal component analysis (PCA) and linear discriminant analysis (LDA). We show that PCA provides a limited separation in bean origin, while LDA suffers from a strong non-linear dependence on the set of compounds. Further, we show for LDA that a compound selection criterion based on Gaussian distribution of intensities across samples dramatically enhances origin clustering of samples thereby suggesting possibilities for studying marker compounds in such a disparate dataset through this approach. In essence, we show and develop a new approach that maximizes, avoiding overfitting, the utility of multivariate analysis in a highly complex dataset.