Linguistics/Statistics Seminar: Xinyi Wang

Linguistics Statistics Seminar

Event Date

Location
Social Sciences & Humanities 2203

Speaker: Xinyi Wang (UC Santa Barbara)

Title: "Understanding Large Language Models from Pretraining Data Distribution"

Abstract: Large Language Models (LLMs) exhibit remarkable capabilities across diverse tasks, yet their inner workings remain opaque. This talk investigates the mechanisms behind LLM generalization by analyzing their pretraining data distribution. We explore whether LLMs primarily rely on surface-level text frequency for memorization or infer deeper generative processes to achieve generalization. Through empirical studies, we reveal that LLMs memorize pretraining data for knowledge-intensive tasks while generalizing for reasoning-intensive tasks. Additionally, we examine LLMs’ few-shot generalization as an emergent capability arising from a latent variable-based data generation process. Finally, we analyze how LLMs extrapolate from existing knowledge and draw novel conclusions by aggregating random walk paths sampled from a latent knowledge graph governing text generation. Our findings provide insights into the fundamental mechanisms behind LLM generalization and offer strategies for enhancing their performance and reliability.

Tags