Post-training · Reasoning · Multimodal Agents
Wei Shen
沈蔚
Senior ML Research Scientist at Skywork AI, working on post-training, reasoning systems, and multimodal agents for frontier foundation models.
Skywork AI · ex-Baichuan · ex-ByteDance Seed / AI Lab · Fudan NLP · Beijing
My work spans RLHF, reward modeling, long-horizon agentic reinforcement learning, and multimodal reasoning, with experience across leading industry labs, strong academic recognition, and open-source impact.
Research Profile
I work on post-training for frontier foundation models, with a focus on reasoning systems, RLHF, and multimodal agents. Most of my recent work sits at the boundary between research ideas and production-facing model development: reward design, scalable reinforcement learning, curriculum construction, and evaluation for reliable reasoning behavior.
I am especially interested in training pipelines that remain robust under noisy, sparse, or gameable feedback, and in systems that transfer across general reasoning, coding, and multimodal tasks. The goal is not only stronger benchmarks, but models that are harder to exploit and more dependable in real use.
Research Interests
- RLHF and scalable post-training for reasoning models
- Reward modeling, GRPO-style optimization, and anti-reward-gaming
- Long-horizon agentic reinforcement learning
- Multimodal reasoning and VLM agents
- Reliable alignment under noisy or shifted feedback
Career Path
Senior ML Research Scientist
Leading long-horizon agentic RL and multimodal post-training for frontier reasoning systems and production agents.
Research Scientist
Worked in the RL team led by Dong Yan on reasoning-oriented post-training, medical reasoning, coding RL, and reward modeling for the Baichuan family.
Research Intern
Worked with Yang Liu and Hang Li on robust RLHF, noisy-reward dynamics, and PPO variants for more stable and generalizable alignment.
Early Research Training
Started formal research training in the Fudan NLP community, building the foundation for later work on language understanding, alignment, and reasoning systems.
Selected Impact
- Led or contributed to major reasoning and multimodal releases including Skywork-R1V3, Skywork-OR1, and MOSS-RLHF.
- Published award- and spotlight-level work on RLHF, alignment, and post-training.
- Built end-to-end experience from reward design and evaluation to training pipelines and public releases.
- Collaborated across top industry labs and Fudan NLP on alignment, reasoning, and multimodal learning.
Current Focus
Reasoning and Agentic RL
Developing long-horizon RL training pipelines for general, coding, and multimodal agent tasks.
Multimodal Systems
Building multimodal reasoning models and production browser / search agents with stronger reliability.
Reliable Alignment
Studying reward design, post-training stability, and robustness under noisy, sparse, or gameable feedback.
Selected Projects
Skywork-R1V3-38B
Led the July 2025 release that improved MMMU from 64.3% to 76.0% through GRPO and curriculum learning.
Skywork-OR1
Contributed to scalable reasoning models that reached 82.2 on AIME24 and 73.3 on AIME25.
MOSS-RLHF
Helped build one of the earliest open-source RLHF frameworks in China and a foundation for later reasoning post-training stacks.
Education and Highlights
Fudan University
M.S. in Computer Science, September 2021 – January 2024
Fudan NLP Lab. Advisors: Prof. Xuanjing Huang, Prof. Xipeng Qiu, and Prof. Tao Gui.
Huazhong University of Science and Technology
B.S. in Computer Science, September 2016 – May 2020
Recognition
NeurIPS Best Paper, ICLR Spotlight, and sustained open-source impact across RLHF and reasoning systems.