II-Medical
Thu 15 May 2025

Medical AI continues to advance at a rapid pace - and our II-Medical-8B is a testament to just how far we’ve come. Despite its compact size, our new model outperforms systems over 10 times larger on key clinical reasoning benchmarks. Designed for precision, transparency, and real-world applicability, II-Medical-8B builds on our commitment to creating trustworthy AI for healthcare and education. With cutting-edge supervised fine-tuning (SFT) and reinforcement learning (RL) pipelines, it brings powerful, step-by-step reasoning to complex medical tasks - setting a new standard for open-source medical intelligence.
Medical AI presents unique challenges: it demands structured, step-by-step inference, accuracy grounded in real-world clinical knowledge, and outputs that can be audited by experts. II-Medical was developed to meet these demands, with a focus on decision support, medical education, and safe research applications.
Efficiency at Zero Cost

As shown in the HealthBench performance-cost frontier, II-Medical-8B delivers strong benchmark performance at zero inference cost. It outperforms larger models like GPT-4.5 and o4-mini in both efficiency and accessibility. II-Medical-8B is also compact enough to run locally on consumer hardware, putting doctor-level medical reasoning directly in your pocket. This enables clinicians, researchers, and individuals to use high-quality models without expensive cloud infrastructure, unlocking a fast, private, and affordable path forward for medical AI.
Disclaimer: II-Medical is not intended for clinical use at this moment in time. It should only be used for research and development purposes.
Why Specialized Medical Models Matter
Enhancing the medical reasoning capabilities of large language models (LLMs) is a significant and ongoing area of research driven by the inherent complexity and domain-specific challenges in medical problem-solving. Several key methodologies have been developed to address these challenges, notably test-time scaling, supervised fine-tuning, reinforcement learning, and knowledge graph integration.
II-Medical integrates methods that have proven essential in recent research:
- Test-Time Scaling (Inference - Time Scaling): Boosting performance with larger compute budgets during inference [2].
- Supervised Fine-Tuning (SFT): Training on curated reasoning paths and detailed explanations [2][3][4].
- Reinforcement Learning (RL): Fine-tuning with verifier feedback for stepwise reasoning quality [3].
- Knowledge Graph Integration: Used in MedReason for structured clinical logic [4].
- Self-Evolution Frameworks: Like MedS3, combining tree search and reward shaping [5].
Various models and datasets underline these methodologies:
- HuatuoGPT-o1: leverages verifiable medical problems alongside a combined SFT and RL training strategy; outperforming both general-purpose and earlier medical-specific models.
- MedReason-8B: sets a new standard among 7–8B parameter models; achieving state-of-the-art results on complex clinical benchmarks through knowledge graph-enhanced chain-of-thought (CoT) explanations.
- M1: demonstrates the power of inference-time scaling; delivering strong performance even with limited data and smaller model size—rivaling much larger specialized systems.
These innovative approaches and models leverage specialized datasets and benchmarks, including MedQA [6], PubMedQA [7], and MedReason datasets [4], to rigorously evaluate medical reasoning capabilities, underscoring the essential role of detailed, transparent reasoning processes in advancing medical AI systems.
II-Medical Dataset Design
The II-Medical Reasoning Dataset includes 581,204 samples divided into four main categories:
- Public Reasoning Datasets: 103,031 samples from sources:
- General Medical Reasoning: 40,544 samples
- Medical-R1-Distill-Data (English): 22,000 samples
- Medical-R1-Distill-Data (Chinese): 17,000 samples
- UCSC-VLAA m23k-tokenized: 23,487 samples
- Synthetic Medical QA Data Enhanced with QwQ: 225,700 samples from MedMCQA [9], MedQA [6], and MedReason [4].
- Curated Reasoning Traces: 338,055 samples from public reasoning trace datasets filtered with Qwen2.5-32B-Instruct [12].
This extensive subset aggregates publicly available R1 reasoning traces from diverse sources:
- PrimeIntellect/SYNTHETIC-1
- GeneralReasoning/GeneralThought-430K
- a-m-team/AM-DeepSeek-R1-Distilled-1.4M
- open-thoughts/OpenThoughts2-1M
- Nvidia/Llama-Nemotron-Post-Training-Dataset (Science subset)
- Cognitivecomputations/dolphin-r1,ServiceNow-AI/R1-Distill-SFT, and other sources.
A specialized pipeline ensures medical domain relevance:
- Embedding Generation: Utilizing the sentence-transformers/all-MiniLM-L6-v2 model.
- Clustering: K-means clustering into 50,000 clusters.
- Domain Classification: Employing Qwen2.5-32b-Instruct [12] to classify clusters based on medical or biological content, retaining relevant clusters only.
- Math Supplement: 15,000 examples sourced from Light-R1 [11].
- Included to reinforce the general reasoning abilities of models.
Data Preprocessing:
Rigorous preprocessing techniques were employed to optimize data quality:
- Complete Generation Filtering: Excluded incomplete or truncated reasoning traces.
- Length Filtering: Keep only samples with prompts longer than three words.
- Wait Token Filtering: Removed samples containing more than 47 occurrences of "Wait" (above the 97th percentile threshold).
Data Decontamination:
A two-step decontamination approach was implemented:
- 10-grams Decontamination: Followed the open-r1 methodology to eliminate overlap with evaluation datasets.
- Fuzzy Decontamination: Utilized the s1 [13] method at a stringent 90% threshold to further ensure dataset purity.
These meticulous steps guarantee minimal overlap with evaluation datasets, preserving dataset integrity and reliability.
RL Dataset
A comprehensive dataset was essential to train II-Medical with the RL method, particularly due to the intricate nature of medical reasoning within reinforcement learning. Recognizing the need for high-quality, relevant data, numerous filtering methodologies and experimental runs were performed to refine our dataset. Ultimately, the MedReason dataset [4] was found to provide the best performance due to its robust structure and alignment with the specific reasoning challenges encountered in medical applications, ensuring a strong foundation for our model's advanced capabilities.
The original MedReason dataset contained benchmark data samples, necessitating the same decontamination process before training to prevent data leakage and ensure unbiased evaluation of the II-Medical Reasoning Dataset. This refinement maintains data quality and experimental rigor for accurate performance metrics.
Model Training Overview
Supervised Fine-Tuning (SFT)
We fine-tuned the Qwen3-8B-Instruct model on the SFT dataset using this configuration:
- Max Length: 16378
- Batch Size: 32
- Learning Rate: 5e-5
- Number of Epochs: 8
- Total Token / Batch: 16378 × 4
To optimize training efficiency, we employed a dynamic batching strategy. In each batch, we accumulate samples until the predefined token limit is reached, enabling better GPU utilization and faster training.
To elevate reasoning capabilities even further, we applied the DAPO (Decoupled Clip and Dynamic Sampling Policy Optimization) [8] algorithm, a state-of-the-art RL approach designed to address the specific challenges of long-chain reasoning. DAPO introduces several key innovations:
- Clip-Higher: Separately controls the lower and upper clipping bounds to encourage exploration and prevent entropy collapse.
- Dynamic Sampling: Filters out trivial or overly easy prompts to ensure that training focuses on meaningful learning signals.
- Token-Level Policy Gradient Loss: Improves gradient updates by assigning more precise credit to individual tokens, especially important in long responses.
The reward signal combines automatic scoring on multiple-choice tasks and evaluation using an LLM-as-a-judge system (GPT-4o).
Overlong Reward Shaping: Reduces noise from excessively long generations by applying length-aware penalties.
RL parameters:
- Prompt Length: 2048 tokens
- Response Length: up to 12288 tokens + 4096 buffer
- Clip Ratios: 0.2 (low), 0.28 (high)
- Batch Sizes: 512 (train), 1536 (gen), 32 (mini-batch)
- 16 generations per prompt, Temp: 1.0, Top-p: 1.0, Top-k: -1
- Learning Rate: 1e-6, Warmup: 10 steps, Weight Decay: 0.1
- Loss: Token-mean, Gradient Clipping: 1.0, Entropy Coef: 0
Rewards come from:
- Automatic MCQ scoring (with \boxed{} labels)
- GPT-4o judgment on open-ended tasks
The chart shows the mean critic score percentage across 250 steps. The high initial score suggests that the reinforcement learning (RL) process builds on a well-optimized supervised fine-tuned (SFT) model, enabling RL to further refine and stabilize performance effectively.
The reinforcement learning (RL) training process exhibits a trend of gradually increasing response lengths, suggesting the model is learning to produce more elaborate outputs while sustaining performance, despite some fluctuations.
Open-ended Task:
Experimental results suggest that the model learns more effectively on open-ended answer tasks than on multiple-choice tasks, based on evaluations conducted by GPT-4o. However, experimentation remains limited due to the absence of a reliable automated evaluation system. Current efforts focus on expanding this research and developing a robust reward model trained on GPT-4o-generated data.
Benchmark Evaluation
Our II-Medical-8B model also achieved a 40% score on HealthBench, an open-source benchmark evaluating the performance and safety of large language models in healthcare. This performance is comparable to OpenAI's o1 reasoning model and GPT-4.5, OpenAI's largest and most advanced model to date. We provide a comparison to models available in ChatGPT below.

For transparency, we have published our complete HealthBench results here.
II-Medical was evaluated on ten leading medical benchmarks, including MedMCQA, MedQA, PubMedQA, MMLU-Pro, GPQA, Lancet QA, MedB-4, MedB-5, MedX, MEJM
Model | MedMC | MedQA | PubMed | MMLU-P | GPQA | Lan-cet | MedB-4 | MedB-5 | MedX | NEJM | Avg |
II-Medical-8B | 71.57 | 87.82 | 78.2 | 80.46 | 67.18 | 70.38 | 78.25 | 72.07 | 25.26 | 73.13 | 70.49 |
76.76 | 88.85 | 79.90 | 80.46 | 64.36 | 70.87 | 77.27 | 73.05 | 23.53 | 76.29 | 71.13 | |
69.73 | 87.03 | 88.5 | 79.86 | 69.17 | 71.3 | 72.07 | 69.01 | 24.98 | 75.12 | 70.68 | |
63.97 | 74.78 | 80.10 | 63.71 | 55.38 | 64.32 | 58.44 | 51.95 | 15.79 | 64.84 | 59.32 | |
61.67 | 71.87 | 77.4 | 64.1 | 50.51 | 59.7 | 60.06 | 54.22 | 22.87 | 66.8 | 59.92 | |
62.54 | 75.81 | 75.80 | 65.86 | 53.08 | 62.62 | 63.64 | 59.74 | 19.59 | 64.34 | 60.3 | |
66.53 | 81.38 | 73.9 | 77.85 | 64.87 | 66.26 | 68.83 | 62.66 | 19.59 | 69.65 | 65.15 |
II-Medical demonstrates strong performance in the 7–8B class, outperforming several larger models while offering entirely open-source accessibility.
Getting Started with II-Medical
II-Medical is a major step forward in our mission to bring trustworthy, high-performance AI to the world of healthcare. Whether you are a researcher, developer, healthcare professional, or educator, we invite you to explore II-Medical and help us push the frontiers of medical AI.
- Run it on vLLM:
vllm serve Intelligent-Internet/II-Medical-8B
- Run it on SGLang:
python -m sglang.launch_server --model Intelligent-Internet/II-Medical-8B
Resources
References
[1] Intelligent Internet. (2025). II-Medical-7B-Preview: Medical reasoning model.
[2] Huang, X., Wu, J., Liu, H., Tang, X., & Zhou, Y. (2025). m1: Unleash the potential of test-time scaling for medical reasoning in large language models.
[3] Chen, J., Cai, Z., Ji, K., Wang, X., Liu, W., Wang, R., Hou, J., & Wang, B. (2024). HuatuoGPT-o1, towards medical complex reasoning with LLMs.
[4] Wu, J., Deng, W., Li, X., Liu, S., Mi, T., Peng, Y., Xu, Z., Liu, Y., Cho, H., Choi, C.-I., Cao, Y., Ren, H., Li, X., Li, X., & Zhou, Y. (2025). MedReason: Eliciting factual medical reasoning steps in LLMs via knowledge graphs.
[5] Jiang, S., Liao, Y., Chen, Z., Zhang, Y., Wang, Y., & Wang, Y. (2025). MedS³: Towards medical small language models with self-evolved slow thinking.
[6] Jin, D., Pan, E., Oufattole, N., Weng, W.-H., Fang, H., & Szolovits, P. (2020). What disease does this patient have? A large-scale open domain question answering dataset from medical exams.
[7] Jin, Q., Dhingra, B., Liu, Z., Cohen, W. W., & Lu, X. (2019). PubMedQA: A dataset for biomedical research question answering.
[8] Yu, Q., Zhang, Z., Zhu, R., Yuan, Y., Zuo, X., Yue, Y., Fan, T., Liu, G., Liu, L., Liu, X., Lin, H., Lin, Z., Ma, B., Sheng, G., Tong, Y., Zhang, C., Zhang, M., Zhang, W., Zhu, H., Zhu, J., Chen, J., Chen, J., Wang, C., Yu, H., Dai, W., Song, Y., Wei, X., Zhou, H., Liu, J., Ma, W.-Y., Zhang, Y.-Q., Yan, L., Qiao, M., Wu, Y., & Wang, M. (2025). DAPO: An open-source LLM reinforcement learning system at scale.
[9] Pal, A., Umapathi, L. K., & Sankarasubbu, M. (2022). MedMCQA: A large-scale multi-subject multi-choice dataset for medical domain question answering.
[10] Zhao, H., Wang, H., Peng, Y., Zhao, S., Tian, X., Chen, S., Ji, Y., & Li, X. (2025). 1.4 million open-source distilled reasoning dataset to empower large language model training.
[11] Wen, L., Cai, Y., Xiao, F., He, X., An, Q., Duan, Z., Du, Y., Liu, J., Tang, L., Lv, X., Zou, H., Deng, Y., Jia, S., & Zhang, X. (2025). Light-R1: Curriculum SFT, DPO and RL for long CoT from scratch and beyond.
[12] Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., Lin, H., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Lin, J., Dang, K., Lu, K., Bao, K., Yang, K., Yu, L., Li, M., Xue, M., Zhang, P., Zhu, Q., Men, R., Lin, R., Li, T., Xia, T., Ren, X., Ren, X., Fan, Y., Su, Y., Zhang, Y., Wan, Y., Liu, Y., Cui, Z., Zhang, Z., & Qiu, Z. (2024). Qwen2.5 technical report.
[13] Muennighoff, N., Yang, Z., Shi, W., Li, X. L., Li, F.-F., Hajishirzi, H., Zettlemoyer, L., Liang, P., Candès, E., & Hashimoto, T. (2025). s1: Simple test-time scaling.