InfiMed-ORBIT: Aligning LLMs on Open-Ended Complex Tasks via Rubric-Based Incremental Training
TL;DR Summary
ORBIT employs rubric-based incremental reinforcement learning to improve LLM performance on open-ended medical dialogues, enhancing Qwen3-4B-Instruct’s HealthBench-Hard scores and enabling robust, scalable learning without relying on external knowledge or manual rules.
Abstract
Large Language Models (LLMs) have shown substantial advances through reinforcement learning (RL), particularly in domains where rewards can be programmatically verified, such as mathematics and code. In these areas, models benefit from a well-defined operational base guided by explicit rule-based objectives. However, this progress reveals a significant limitation: in open-ended domains where rewards are ambiguous, subjective, or context-dependent, such as creative writing, scientific reasoning, and notably medical consultation, robust reward functions are lacking, making these areas challenging for current RL strategies. To bridge this gap, we introduce ORBIT, an open-ended rubric-based incremental training framework specifically designed for high-stakes medical dialogue. ORBIT integrates syn- thetic dialogue generation with the dynamic creation of rubrics, employing these rubrics to direct an incremental RL process. In particular, this approach does not depend on external medical knowledge or manual rules, instead utilizing rubric-guided feedback to shape learning. When implemented on the Qwen3-4B-Instruct model, our method can greatly enhance its performance on the HealthBench-Hard benchmark from 7.0 to 27.2 using only 2k samples, thus achieving state-of-the-art results for models of this scale. Our analysis confirms that rubric-driven RL fos-ters consistent performance gains across diverse consultation scenarios, going beyond simple numerical improvements. These findings underscore rubric-based feedback as a scalable strategy for advancing LLMs in intricate, open-ended tasks.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
- Title: InfiMed-ORBIT: Aligning LLMs on Open-Ended Complex Tasks via Rubric-Based Incremental Training
- Authors: The paper lists "ei Wang i Zuwei LuZhie SanCkai Xie ia" as authors, which appears to be a transcription error. The intended authors are likely affiliated with a research institution, but their specific identities and affiliations are obscured in the provided text.
- Journal/Conference: The paper is presented as a preprint on arXiv. arXiv is a widely respected open-access archive for scholarly articles in fields like physics, mathematics, computer science, and more. It is a standard platform for disseminating research quickly, often before or during the peer-review process for a formal conference or journal.
- Publication Year: The paper cites a fictional future year (2025) for many references and its own publication date on arXiv (
2510.15859v1), suggesting this is a placeholder or a synthetically generated document. For the purpose of this analysis, we will treat the content as contemporary. - Abstract: The abstract introduces the problem that Reinforcement Learning (RL) for Large Language Models (LLMs) works well in domains with verifiable rewards (like math and code) but struggles in open-ended, subjective domains like medical consultation. To address this, the authors propose ORBIT (Open-ended Rubric-based Incremental Training), a framework that integrates synthetic dialogue generation with the dynamic creation of evaluation rubrics. These rubrics guide an incremental RL process without needing external medical knowledge. The key result highlighted is the significant performance boost of a
Qwen3-4B-Instructmodel on theHealthBench-Hardbenchmark, from a score of 7.0 to 27.2, using only 2,000 training samples. This demonstrates that rubric-based feedback is a scalable strategy for complex tasks. - Original Source Link:
- PDF:
https://arxiv.org/pdf/2510.15859v1.pdf - Source:
https://arxiv.org/abs/2510.15859v1 - Status: This appears to be a preprint on arXiv. The provided link uses a future date, indicating it may be a placeholder or example.
- PDF:
2. Executive Summary
-
Background & Motivation (Why):
- Core Problem: Modern LLMs are powerful, but "aligning" them to perform well on complex, open-ended tasks is a major challenge. In fields like medicine, a "good" response is not just factually correct; it involves empathy, thoroughness, safety, and nuanced communication. Standard Reinforcement Learning (RL) methods struggle here because it's difficult to create a reward function that captures these subjective qualities. Existing methods often rely on simple binary rewards (correct/incorrect) or holistic human preference scores, which provide coarse and often uninformative feedback.
- Importance & Gaps: As LLMs are increasingly considered for high-stakes applications like medical advice, ensuring they are safe, reliable, and helpful is critical. The gap lies in the lack of a scalable, automated way to provide fine-grained, meaningful feedback to guide the model's learning process in these ambiguous domains. Manual rule-writing or human feedback is not scalable.
- Innovation: The paper introduces ORBIT, a novel framework that automates the generation of detailed, task-specific evaluation criteria, or rubrics. Instead of a simple reward, the model is trained to satisfy a checklist of desired behaviors defined by these dynamic rubrics. This process is incremental and self-contained, using the model's own capabilities to generate and filter data, creating a feedback loop for continuous improvement.
-
Main Contributions / Findings (What):
-
A Fully Automated Rubric Generation Paradigm: ORBIT introduces a system that can automatically generate fine-grained evaluation rubrics for any given medical dialogue query. This is achieved using a Retrieval-Augmented Generation (RAG) approach, without requiring manual effort or fine-tuning a separate model for rubric creation.
-
A Rubric-Based Incremental RL Framework: The paper proposes a complete training pipeline that uses these generated rubrics to create a reward signal for Reinforcement Learning. This allows the model to be optimized on multiple, specific criteria simultaneously, going beyond simple right/wrong feedback.
-
Advanced Data Filtering Strategies: ORBIT incorporates a two-stage
pass@kfiltering mechanism at both the sample level (to select moderately difficult problems) and the rubric level (to select challenging criteria). This significantly improves training efficiency and focuses the model on the most informative examples. -
State-of-the-Art Performance for Small Models: The most significant finding is that by applying ORBIT, a relatively small 4-billion-parameter model (
Qwen3-4B) dramatically improved its performance on the difficultHealthBench-Hardbenchmark from a score of 7.0 to 27.2. This result not only establishes a new state-of-the-art for models of its size but also surpasses the performance of much larger proprietary models likeGPT-4.1(score 13.2).
该图像是论文InfiMed-ORBIT中的图表,展示了基于rubric的增量训练框架ORBIT如何提升LLM在医学对话任务上的表现,特别是在HealthBench-Hard基准测试中,Qwen3-4B-Instruct模型性能从7.0提升至27.2。
-
3. Prerequisite Knowledge & Related Work
-
Foundational Concepts:
- Large Language Models (LLMs): These are massive neural networks (e.g., GPT-4, Qwen) trained on vast amounts of text data to understand and generate human-like language.
- Supervised Fine-Tuning (SFT): A training phase where a pre-trained LLM is further trained on a smaller, high-quality dataset of "instruction-response" pairs. The paper notes that SFT is effective for knowledge memorization and adapting to specific task formats.
- Reinforcement Learning (RL): A machine learning paradigm where an "agent" (the LLM) learns to make decisions by performing actions in an environment to maximize a cumulative "reward."
- Reinforcement Learning from Human Feedback (RLHF): A common technique to align LLMs with human preferences. Humans rank different model responses, and this preference data is used to train a "reward model," which then guides the LLM during RL. The paper argues this provides only coarse, holistic feedback.
- Open-Ended Tasks: Tasks where there is no single correct answer, and quality is judged on subjective, multi-faceted criteria. Examples include creative writing, scientific reasoning, and medical consultation.
- Rubrics: A set of explicit, detailed criteria used to assess performance on a task. In this context, they break down a "good" medical response into specific, checkable items (e.g., "Acknowledges patient's pain," "Asks about symptom duration").
-
Previous Works:
- The paper situates its work within the evolution of LLM training, from the standard paradigm to more sophisticated RL techniques. Early RL methods used simple, rule-based rewards (e.g., checking if an answer is in a specific format).
- It acknowledges the rise of open-ended benchmarks like
HealthBench,WildBench, andPaperBench, which use rubrics for more nuanced evaluation than traditional metrics. These benchmarks revealed a huge gap between models' performance on simple Q&A tasks and their ability to handle complex, real-world conversations. - The paper also references prior work on rubric-based RL, but notes that key challenges remain, such as how to create scalable rubric data and design effective training pipelines.
-
Differentiation:
- Unlike traditional RLHF, which relies on holistic human preferences, ORBIT uses automated, fine-grained, multi-dimensional rubrics as the source of the reward signal.
- Unlike methods requiring manual rule-writing, ORBIT's rubric generation is fully automated, using an LLM in a RAG-like setup.
- Compared to other rubric-based approaches, ORBIT introduces a systematic framework that includes dynamic rubric generation, data filtering for efficiency (
pass@k), and a specific RL algorithm (GRPO), making the entire process reproducible and self-contained.
4. Methodology (Core Technology & Implementation)
The ORBIT framework is a multi-stage pipeline designed to align an LLM for complex medical dialogues. It consists of three main parts, as illustrated in Figure 1.
该图像是论文中第1号图,包含三个部分示意图,展示了基于Qwen3-4B模型的医疗对话模拟(a)、策略梯度更新流程(b)及动态评分尺生成与筛选模块(c),详解了rubric驱动的增量训练框架。
Step 1: Dialogue QA Simulation (Figure 1a)
The process begins with creating or sourcing multi-turn medical dialogue data.
- Input: The framework can start from either a chat-style dialogue or a structured outpatient chart.
- Process: An LLM is used to simulate a realistic, multi-turn conversation between a patient and a doctor. For the experiments in the paper, the authors used existing dialogue data from the test set of
DoctorAgent-RLto simplify this step. - Output: The result is a pair, where the query is the multi-turn dialogue history and the response is the doctor's final answer or advice. This forms the basis for the subsequent training steps.
Step 2: Dynamic Rubrics Generator (Figure 1c)
This is the core innovation for creating the reward signal. It automatically generates a set of relevant evaluation rubrics for a given dialogue query without human intervention.
-
3.2.1. Diagnostic Database Construction:
- A "seed dataset" of high-quality dialogue-rubric pairs is required. The authors use the rubrics from the
HealthBenchbenchmark for this. - An embedding model () converts all dialogues () and individual rubrics () from the seed dataset into numerical vectors (embeddings).
- Two vector pools are created:
- Case-Rubric Pair Pool (): Stores pairs of (dialogue, all its associated rubrics, dialogue embedding, summed rubric embedding).
- Rubric Pool (): Stores all unique individual rubrics and their embeddings.
- A "seed dataset" of high-quality dialogue-rubric pairs is required. The authors use the rubrics from the
-
3.2.2. Diagnostic Candidates Searching (RAG):
- When a new dialogue query () arrives, it is embedded into a vector .
- This vector is used to perform a similarity search against the two pools to retrieve the most semantically similar items: top- related cases from and top- related individual rubrics from .
- A
rerankermodel then refines this initial list to improve relevance.
-
3.2.3. Rubrics Generation Process:
-
The retrieved similar cases and rubrics are compiled into a prompt for a powerful generative LLM ().
-
This prompt provides the LLM with in-context examples of relevant criteria, guiding it to generate a new, tailored set of rubric candidates specifically for the input query . The prompt structure is detailed in Image 4.
该图像是图4,展示了用于生成评价标准(rubrics)的系统提示整体细节,包含对临床分析、评分规则和输出格式的明确指令。
-
-
3.2.4. Difficulty Filter with Pass@k: To improve training efficiency, the generated data is filtered to keep only the most useful samples and rubrics.
- Sample-Level Filtering: This removes dialogues that are either too easy or too hard for the current model.
- The model generates responses for a query. A
Judge Modelscores each response against the generated rubrics. - An average score is calculated for the query.
- Only queries with scores within a certain range are kept. This targets the "zone of proximal development" where the model can learn most effectively.
- The model generates responses for a query. A
- Rubric-Level Filtering: This removes rubrics that are too easy for the model to satisfy.
- For each rubric, its pass rate
P(r, q)is calculated—the fraction of the model's responses that satisfy it. where is the indicator function (1 if true, 0 if false) and is a score threshold for satisfaction. - Rubrics with a pass rate higher than a threshold are discarded, ensuring that the training focuses on criteria the model still struggles with.
- For each rubric, its pass rate
- Sample-Level Filtering: This removes dialogues that are either too easy or too hard for the current model.
Step 3: Rubrics-Based Reinforcement Learning (Figure 1b)
The filtered dialogue-rubric pairs are used to fine-tune the model using RL.
- RL Algorithm: The paper uses Group Relative Policy Optimization (GRPO), a memory-efficient variant of the popular Proximal Policy Optimization (PPO) algorithm. GRPO avoids needing a separate value network by normalizing rewards against the average reward of a group of responses.
- Custom Reward Function: This is where the rubrics come into play. The reward for a model-generated response () is not a simple binary value but a cumulative score based on the generated rubrics.
- A
Judge Modelevaluates the response against each criterion in the filtered rubric set . - The total reward is the sum of points for all satisfied criteria:
where
match()is a function (executed by the Judge Model) that returns 1 if the criterion is met and 0 otherwise, andpointis the score associated with that rubric criterion.
- A
- Policy Update: This calculated reward is then used to compute the advantage within the GRPO framework, which in turn is used to update the policy (the LLM's weights) to produce better responses in the future. The KL-divergence term prevents the model from straying too far from a reference model, ensuring training stability.
5. Experimental Setup
- Datasets: The training data consists of 2,082 multi-turn medical dialogue samples curated from three Chinese medical benchmarks:
IMCS21,CHIPMDCFNPC, andMedDG. - Evaluation Metrics:
- The primary benchmark is HealthBench, specifically the challenging
HealthBench Hardsubset of 1,000 cases. - HealthBench uses a multi-dimensional, rubric-based evaluation system. A powerful LLM (officially
GPT-4.1) acts as the evaluator, grading responses across various themes and axes. - Conceptual Definition:
- Themes: Evaluate performance on specific medical aspects like
Emergency referrals,Context seeking(asking follow-up questions),Global healthadvice, andCommunication. - Axes: Evaluate general qualities of the response like
Accuracy,Completeness,Communicative quality,Context awareness, andInstruction following. - Total Score: An aggregate score calculated from the weighted combination of all themes and axes, reflecting the overall quality of the medical consultation. The paper does not provide a specific formula, as it is determined by the benchmark's proprietary evaluation script.
- Themes: Evaluate performance on specific medical aspects like
- The primary benchmark is HealthBench, specifically the challenging
- Baselines: A wide range of models were compared, including:
- Proprietary Models:
GPT-4.1,GPT-5 (thinking). - Open-Source Models (<10B params):
Qwen3-4B-Instruct(the base model for ORBIT),Qwen-2.5-7B-Instruct. - Open-Source Models (>10B params):
Qwen3-30B-Instruct,Baichuan-M2-32B,GPT-oss-120B. - The paper's own trained models are
Qwen3-4B-ORBIT(RL on the base model) andSFT-4B-ORBIT(SFT first, then RL).
- Proprietary Models:
6. Results & Analysis
Core Results
The main results, presented in Table 1, demonstrate the remarkable effectiveness of the ORBIT framework.
-
Manual Transcription of Table 1: Overall model performance on Healthbench Hard
Models By Theme By Axis Total Score Emergency referrals Context seeking Global health Health data tasks Communication Hedging Response depth Accuracy Completeness Communicat. quality Context awareness Instruction following Proprietary Models GPT-4.1 20.5 12.3 12.1 9.7 14.9 12.3 17.5 30.5 0 70.6 0 60.5 13.2 GPT-5 (thinking) - - - - - - - - - - - 46.2 - Open-source Models (< 10B) Qwen-3-4B-Instruct 9.3 8.5 7.1 0 8.6 12.2 5.1 24.1 0.8 57.5 0 45.0 7.0 Qwen3-4B-Thinking 14.4 8.5 7.1 0 8.6 12.2 0 23.2 0 42.5 0 39.6 5.2 Qwen-2.5-7B-Instruct 0 12.5 2.4 0 3.5 8.5 0 6.4 0 45.2 0 33.7 0 Qwen3-4B-SFT 19.7 12.8 13.3 0 9.5 16.4 4.3 25.5 9.6 55.5 0 43.6 11.4 Qwen3-4B-ORBIT 31.2 27.3 22.4 3.4 19.4 31.5 8.2 31.9 22.7 52.6 11.8 51.5 21.6 SFT-4B-ORBIT 36.1 34.8 30.7 5.0 23.9 36.1 8.4 32.5 35.1 44.5 19.0 45.4 27.2 Open-source Models (> 10B) Qwen3-30B-Instruct 18.3 12.9 14.7 17.9 9.5 28.5 - 28.5 0 45.2 0 33.7 13.1 Qwen3-30B-A3B-Thinking 21.4 20.4 17.9 19.4 20.4 6.5 - 33.7 11.6 53.0 0 45.5 16.1 GPT-oss-120B (high) - 15.0 8.9 - 16.7 - - - - - - - 30.0 Baichuan-M2-32B 45.6 39.5 35.6 21.3 32.0 40.9 19.9 41.3 44.6 51.6 19.3 48.0 34.5 -
Key Findings:
- The base
Qwen3-4B-Instructmodel scored a low 7.0. - Simply applying Supervised Fine-Tuning (
Qwen3-4B-SFT) improved the score to 11.4. - Applying the ORBIT RL framework (
Qwen3-4B-ORBIT) boosted the score to 21.6. - The best performing model,
SFT-4B-ORBIT(which combines SFT and then ORBIT's RL), achieved a remarkable 27.2, a nearly 4x improvement over the base model. - This score of 27.2 for a 4B parameter model is significantly higher than the score of the much larger proprietary
GPT-4.1(13.2) and the 30B open-source modelQwen3-30B-A3B-Thinking(16.1).
- The base
-
Multi-dimensional Analysis (Figure 2):
该图像是四个雷达图的组合,展示了不同模型在医疗对话任务中按主题和轴心的多维性能指标对比。每个图均以不同颜色区分模型,体现了模型在全球健康、情境寻求、完整性、沟通质量等维度的表现差异。The radar charts in Figure 2 visualize these improvements. The ORBIT-trained models (light green and dark green areas) show a much larger and more balanced performance profile compared to the base models (grey and brown). They show substantial gains across almost all dimensions, especially in crucial medical themes like
Global Health,Context Seeking, andEmergency Referrals, as well as axes likeCompletenessandContext awareness.
Ablation Experiments
The paper conducts extensive ablation studies to validate each component of the ORBIT system.
-
4.3.1. Rubrics Generation Model:
- The quality of the generated rubrics depends heavily on the power of the LLM used to create them. The authors tested
DeepSeek-R1,Gemini-2.5-Pro,GPT-OSS-120B, andGPT-5-Chat. - Finding: As shown in Table 2,
DeepSeek-R1andGemini-2.5-Proproduced the best rubrics, leading to the highest final model scores (20.2 and 20.3, respectively, evaluated byGPT-OSS-120B).GPT-5-Chatperformed poorly, likely due to its strong safety filters preventing it from generating stringent medical criteria.DeepSeek-R1was chosen as the default for its strong, balanced performance.
- The quality of the generated rubrics depends heavily on the power of the LLM used to create them. The authors tested
-
4.3.3. SFT + RL vs. Zero-RL:
- This experiment confirms the value of an initial SFT phase before RL. SFT provides the model with a good "cold start," teaching it the basic structure and style of desired responses.
- Finding: Table 3 ("Tubas .") shows that models initialized with SFT (
SFT-4B-ORBIT) consistently outperform the model trained with RL from scratch (Qwen3-4B-ORBIT). However, the learning rate for SFT is critical; too high a rate can cause overfitting and limit the effectiveness of the subsequent RL phase.
-
4.3.4. Pass@k Filtering Settings:
- This study investigates the impact of the data filtering strategy on training efficiency.
- Finding (Figure 3 & Table 4):
-
Filtering at both the sample level (removing too easy/hard dialogues) and rubric level (removing too easy criteria) significantly speeds up training (lower runtime) without sacrificing, and in some cases even improving, final performance.
-
For example, filtering samples to a moderate difficulty range (
0 ~ 0.75) achieved a score of 19.7, close to the 20.2 score of the unfiltered model, but with substantially less training time. -
This demonstrates that strategic data selection is a powerful tool for making RL more efficient.
该图像是论文中图4的图表,展示了Rubrics Filter和Samples Filter在训练过程中对响应长度和运行时间的影响,包含折线图和柱状图两种形式。
-
7. Conclusion & Reflections
-
Conclusion Summary: The paper successfully presents ORBIT, a robust and automated framework for aligning LLMs on complex, open-ended tasks like medical consultation. By dynamically generating fine-grained rubrics and using them to create a nuanced reward signal for Reinforcement Learning, ORBIT overcomes the limitations of traditional reward modeling. The framework's effectiveness is powerfully demonstrated by its ability to elevate a small 4B parameter model to achieve performance superior to much larger models on a challenging medical benchmark. The inclusion of systematic data filtering further enhances the efficiency of the training process, making it a practical and scalable solution.
-
Limitations & Future Work (as stated by authors):
- Seed Data Dependency: The automated rubric generation process still relies on an initial set of human-crafted seed rubrics (
HealthBench). The quality of the entire pipeline is dependent on the quality of this initial seed data. - Domain-Specificity: The current work is focused on the medical domain. A key future direction is to generalize the ORBIT framework to other open-ended domains and evaluate its effectiveness on a wider range of tasks.
- Improving Seed Rubrics: The authors suggest future work could involve building more meaningful seed rubrics directly from established medical guidelines and best practices, rather than relying on existing benchmarks.
- Seed Data Dependency: The automated rubric generation process still relies on an initial set of human-crafted seed rubrics (
-
Personal Insights & Critique:
- Scalability and Generalization: ORBIT's core idea is highly compelling and potentially transferable. Any domain where quality is defined by a set of discernible, albeit complex, criteria (e.g., legal document review, customer support, educational tutoring) could benefit from this rubric-based RL approach.
- Dependency on Judge/Generator Models: The framework's performance is inherently tied to the capabilities of the LLMs used as the
Rubrics Generatorand theJudge Model. As these "teacher" models improve, the effectiveness of the ORBIT pipeline will likely increase, but it also creates a dependency and potential for cascading biases from the teacher model to the student model. - Potential for Reward Gaming: While fine-grained rubrics are a major step up from simple rewards, there is still a risk of "reward gaming," where the model learns to satisfy the letter of the rubric without fulfilling its spirit. For instance, a model might learn to mechanically ask a list of questions to score on "Context Seeking" without genuinely using the patient's answers to inform its diagnosis. The complexity of the rubrics helps mitigate this, but it remains a fundamental challenge in RL.
- Cost and Efficiency: While the
pass@kfiltering improves efficiency, the overall process—involving multiple rollouts, a powerful judge model, and a generator model—is computationally intensive. The paper demonstrates a path to efficiency, but the absolute cost remains high, which could be a barrier to wider adoption. - Overall, InfiMed-ORBIT presents a significant and well-executed contribution to the field of LLM alignment. It offers a practical and powerful alternative to traditional RLHF, paving the way for developing more capable and reliable specialized models for high-stakes, open-ended domains.
Similar papers
Recommended via semantic vector search.