
Event Date
Speaker: Guang Cheng (Professor, Department of Statistics and Data Science, University of California, Los Angeles)
Title: "Golden Ratio Weighting Prevents Model Collapse"
Abstract: In recent years, synthetic data have been widely used to train generative models such as large language models. This trend is mainly motivated by the limited availability of data to train larger models due to neural scaling laws. However, over successive training iterations, trained generative models gradually lose information about the real data distribution, a phenomenon known as model collapse.
In this talk, we investigate this phenomenon theoretically by training generative models iteratively on a combination of newly collected real data and synthetic data from the previous training step. We conduct our theoretical studies in various scenarios, including Gaussian distribution estimation and linear regression. Our key finding is that, across different settings, the optimal weighting scheme under different proportions of synthetic data asymptotically follows a unified expression. Notably, in some cases, the optimal weight assigned to real data corresponds to the reciprocal of the golden ratio.
Faculty website (links to UCLA): https://datatheory.ucla.edu/person/guang-cheng/