I focus on building next-generation LLM evaluation systems and large-scale, data-centric AI. I believe LLM evaluation should be
reliable,
efficient, and
anchored in realistic real-world tasks, and it must rest on a deep understanding of data. I contribute to the Seed Evaluation System at ByteDance Seed (part of my insights are demonstrated in
Seed1.8 Model Card).
Specific topics include:
- Reliable LLM Eval: We design new perturbation algorithms and future prediction tasks with completely no data contamination to reliably assess the true capabilities of Agentic LLMs. Together with Prof. Mengdi Wang, our FutureX project,described by Elon Musk as the "Best Measure of Intelligence"(media), has received over 100 million views on X(Twitter). Following this line, we release CryptoBench for expert-level tasks in cryptocurrency domains. Also, we design LLM Swiss Round to give a holistic ranking of leading LLMs.
- Efficient LLM Eval: We design algorithms to efficiently and accurately evaluate LLM's performance.
- Realistic LLM Eval: We use realistic, professional and valuable tasks to benchmark LLM's true value in helping with daily works. Together with Prof. Hongseok Namkoong, our FinSearchComp has received over 20 million views on X(Twitter), and is recently adopted in MiniMax-M2's report and Kimi-K2-Thinking's report.
- LLM Calibration: We design Behaviorally Calibrated-RL to calibrate LLM's confidence estimation, enabling LLM's to say "I don't know".
(Princeton ECE).