Event Date
Speaker: Qi Xu, Post-Doctoral Researcher, Carnegie Mellon University
Title: "Statistical Inference for Multi-Modality Data in the AI Era"
Abstract: Multi-modality data are increasingly common across science medicine and technology, such as imaging, text, sensors, and genomics. These modalities are often high dimensional or unstructured and naturally exhibit blockwise (nonmonotone) missingness where different samples observe different subsets of modalities. Such missingness creates a major obstacle for statistical analyses since classical methods either discard large portions of data or rely on strong modeling assumptions. Recent advances in AI make it possible to generate or predict unobserved modalities from observed ones, opening new opportunities for data integration.
In this talk, I will focus on statistical inference for blockwise-missing multi-modality data, while rigorously incorporating modern AI tools. Rooted in semiparametric theory, there is a long-term open problem that theoretically optimal estimating function under non-monotone missingness is computationally intractable, even under the missing completely at random mechanism. I introduce a tractable approximation to the optimal estimating equation through a novel Restricted ANOVA hierarchY or RAY decomposition and its almost-eigen-operator property. This leads to a new class of estimators that leverage predictive or generative AI models to borrow information across datasets while remaining unbiased and asymptotically normal. Motivated by the property of the RAY estimator, we extend the RAY estimator to a class of unbiased, consistent, and computationally tractable estimators. The most efficient estimator in this class is then derived, named as Adaptive RAY estimator, which optimally integrating all available data and prediction from AI. Simulation studies and a single cell multi-omics application demonstrate that the proposed framework enables stable and efficient inference for complex multi modality data in the AI era.
This is a joint work with Lorenzo Testa, Jing Lei and Kathryn Roeder, and the paper is available on arXiv: https://arxiv.org/abs/2509.24158
_____
This talk is part of the STA 290 Seminar series.