Scaling Large Language Models: The Data Problem
Speaker(s): Kaifeng Lyu (Tsinghua University)
Time: 15:30–16:30 April 16, 2026
Venue: 静园六院211会议室
Abstract:
As the scale of pretraining continues to grow, acquiring high-quality data has become the primary bottleneck. In this talk, I will present several recent works from our group that aim to better understand and address this challenge. First, multi-epoch training can improve data efficiency. How many epochs can we train before overfitting occurs? We present a scaling law analysis of this question. Beyond simply repeating data, a common practice is to augment existing data to improve data efficiency, for example by rephrasing the data or generating QAs. While prior work suggests that more complex methods can perform better, we demonstrate that a simple yet strong baseline can outperform all of them in the context of knowledge injection. Looking forward, we ask: what is ultimately the optimal training data, if the test distribution or downstream tasks are well understood? Combining theoretical analysis with empirical results, we demonstrate that in many cases, even when the test distribution of an LLM is fully known, the optimal training distribution may still differ significantly from the test distribution itself.
Bio:
吕凯风,清华大学交叉信息研究院助理教授。主要研究方向聚焦于深度学习与大模型的理论基础,尤其关注大模型的训练动力学、泛化机理与数据作用机制。在加入清华大学任教前,他曾在加州大学伯克利分校西蒙斯计算理论研究所担任博士后研究员。他2024年博士毕业于普林斯顿大学(师从Sanjeev Arora教授),2019年本科毕业于清华大学。其研究成果发表在NeurIPS、ICML、ICLR等机器学习顶级会议上,曾获得ICLR 2025最佳论文奖。
