Partition-based feature screening for categorical data via RKHS embeddings

Jun Lu, Lu Lin, WenWu Wang
Computational statistics & data analysis 2021 v.157 pp. 107176
byproducts, data analysis, paper, screening, seeds, statistics
This paper proposes a new screening procedure for the ultrahigh dimensional data with a categorical response. By exploiting the group structure among predictors, a new partition-based screening approach is developed via the reproducing kernel Hilbert space (RKHS) embeddings in the maximum mean discrepancy framework. Consequently, the new method is able to identify the influential group of predictors that may be overlooked by the marginal screening methods. Moreover, by using the RKHS embedding, the new ranking index has a very simple form, and thus can be evaluated easily. As a by-product, the new method is model-free without specifying any relationship between the predictors and the response. The sure screening property of the proposed method is proved and the effectiveness of the new method is also illustrated via numerical studies and a real data analysis.