Chenglin Li

About

I am a Ph.D. student jointly affiliated with Zhejiang University and Shanghai Innovation Institute. My research focuses on multimodal large language models, long-video understanding, and agentic reasoning.

I am advised by Jiaqi Wang and Yin Zhang.

Education

Zhejiang University & Shanghai Innovation Institute

Ph.D. in Artificial Intelligence, College of Computer Science and Technology

2025.09 - 2028.06

Zhejiang University

M.S. in Artificial Intelligence, School of Software Technology

Research areas: large language models, video understanding, and agents

2023.09 - 2025.09

Northeastern University

B.S. in Computer Science and Technology

GPA: 4.05/5.00, Rank: 13/221, CET-4: 561, CET-6: 479

2019.09 - 2023.06

Publications [ Google Scholar]

CVPR 2026 Findings First Author

VideoThinker: Building Agentic VideoLLMs with LLM-Guided Tool Reasoning

Paper
ACL 2026 Main First Author

VideoPro: Adaptive Program Reasoning for Long Video Understanding

Paper
EMNLP 2024 Findings First Author

Mixed Distillation Helps Smaller Language Model Better Reasoning

Paper
EMNLP 2024 Findings First Author

Optimizing Instruction Synthesis: Effective Exploration of Evolutionary Space with Tree Search

Paper
Under Review First Author

VCBench: A Controllable Benchmark for Symbolic and Abstract Challenges in VC

Paper
Under Review First Author

Iterative Zoom-In: Temporal Interval Exploration for Long Video Understanding

Paper
EMNLP 2024 Main Third Author

Teaching Small Language Models Reasoning through Counterfactual Distillation

Paper
NeurIPS 2025 Main Third Author

RelationAdapter: Learning and Transferring Visual Relation with Diffusion Transformers

Paper

Research Experience

Agentic VideoLLMs for Long Video Understanding

Developed VideoThinker, a VideoLLM framework that turns long-video understanding into an agentic retrieval-and-zoom reasoning problem.

Introduced a unified retrieval-and-zoom mechanism for temporal localization and fine-grained evidence inspection.
Built synthetic tool-use supervision for VideoLLMs and improved MLVU and LVBench by 6.8% - 10.6%.

2025.09 - 2026.01

Adaptive Program Reasoning for Long Video Understanding

Studied how VideoLLMs can combine direct multimodal reasoning with program-based tool use for complex video queries.

Designed an adaptive routing strategy that selects fast VideoLLM reasoning or slow executable workflows according to query difficulty.
Built a code-based workflow planning pipeline that unifies model inference with external tool execution.

2025.03 - 2025.09

Benchmarking Advanced Multimodal Video Cognition

Built a controllable benchmark for evaluating symbolic, abstract, and high-level cognitive reasoning in video understanding.

Constructed an automated pipeline for scalable benchmark generation and task synthesis.
Covered object tracking, action perception, spatio-temporal reasoning, and cross-modal understanding with controllable difficulty.

2024.06 - 2025.01

Instruction Evolution with Monte Carlo Tree Search

Studied instruction data synthesis with tree search to improve data quality for low-resource alignment.

Applied Monte Carlo Tree Search to explore and evaluate prompt rewriting actions.
Generated higher-quality synthetic instructions and obtained consistent gains on OpenLLM, Alpaca-eval, and Wizard-eval.

2024.03 - 2024.06

Distilling Reasoning Ability into Small Language Models

Proposed a mixed distillation framework for transferring reasoning supervision from strong LLMs to smaller, deployable language models.

Used PoT and CoT signals as complementary supervision for numerical reasoning tasks.
Combined filtered reasoning traces with mixed-task distillation and improved Llama-7B beyond some GPT-3.5-turbo baselines.

2023.10 - 2024.02

Industry Experience

JD Explore Research

Worked on the full research pipeline for multimodal foundation models, including data, training, and evaluation.

2025.11 - 2026.04

ByteDance CapCut

Worked on video understanding systems for user intent modeling in video creation scenarios.

Built an end-to-end multimodal prompting pipeline over video, image, and text inputs.
Reduced hallucinations with decoding strategies and improved efficiency through visual token pruning.

2025.02 - 2025.06

Alibaba Quark

Worked on experience-aware content modeling for search scenarios requiring subjective or experiential evidence.

Led the project from investigation and sample mining to model training, achieving 75% accuracy with a BERT-based model.

2023.06 - 2023.09

Alibaba AiCheng

Worked on post-training and evaluation for enterprise dialogue models, with a focus on data quality and efficient deployment.

Improved data quality with reward modeling and diversity control, enabling Qwen-14B to match Qwen-70B on internal evaluation.
Built subjective and objective evaluation sets for enterprise dialogue tasks and supported efficient deployment.

2023.10 - 2024.01