Event Date
Speaker: Tracy Ke, Associate Professor, Statistics, Harvard University
Title: Recent Advances in Topic Modeling
Abstract: Topic modeling is a widely used technique in text analysis, with classical models relying on an approximate low-rank factorization of the word count matrix. In the first part of this talk, we introduce Topic-SCORE, a spectral algorithm for estimating classical topic models. The core innovation of this algorithm lies in exploiting a simplex structure in the spectral domain. Using precise entry-wise eigenvector analysis, we demonstrate that Topic-SCORE achieves the minimax optimal rate, both for relatively long and short documents.
In the second part, we extend the classical topic model to capture the distribution of word embeddings from pre-trained large language models (LLMs), enabling the incorporation of word context. We propose a flexible algorithm that integrates traditional topic modeling with nonparametric estimation. We showcase the effectiveness of our methods using MADStat, a dataset comprising 83,000 paper abstracts from statistics-related journals.
Bio: Tracy Ke is currently Associate Professor of Statistics at Harvard University. She received her PhD from Princeton University in 2014, advised by Professor Jianqing Fan. Prior to joining Harvard, she was Assistant Professor of Statistics at The University of Chicago from 2014 to 2018. Her research interests include network data analysis, high-dimensional statistics, text mining, and machine learning. Most of her recent works focus on community analysis for network data and topic modeling for text data. She received the NSF CAREER award, ASA Noether Young Scholar Award, IMS Peter Hall Prize, COPSS Emerging Leader Award, and is currently a Sloan Research Fellow.
Faculty web page (links to Harvard): https://statistics.fas.harvard.edu/people/tracy-ke
Seminar Date/Time: Thursday November 14th, 4:10pm
Location: MSB 1147 (Refreshments 3:30pm, MSB Courtyard (or MSB 4229 if raining)