M2-Reasoning: Empowering MLLMs with Unified General and Spatial Reasoning

📖 Technical Report | 🤗 Hugging Face ｜ 🤖 ModelScope

Introduction

We introduce M2-Reasoning-7B, a model designed to excel in both general and spatial reasoning. Our approach integrates two key innovations: (1) a novel data pipeline that generates 294.2K high-quality data samples (168K for cold-start fine-tuning and 126.2K for RLVR), which feature logically coherent reasoning trajectories and have undergone comprehensive assessment; and (2) a dynamic multi-task training strategy with step-wise optimization to mitigate conflicts between data, and task-specific rewards for delivering tailored incentive signals. This combination of curated data and advanced training allows M2-Reasoning-7B to set a new state-of-the-art (SOTA) across 8 benchmarks, showcasing superior performance in both general and spatial reasoning domains.

📌 Updates

[2025.07.14] 🔥 Our Technical Report is in public on arxiv.
[2025.07.11] 🔥 We release M2-Reasoning on 🤗 Hugging Face and 🤖 ModelScope .

Key Features

A High-quality Data Construction Pipeline: We design and implement a multi-stage data synthesis and curation pipeline that generates vast amounts of reasoning data.
A Dynamic Multi-Task Training Strategy: We propose a sophisticated training strategy that effectively handles data heterogeneity. It features step-wise dynamic optimization to mitigate conflicts between different data sources and a task-specific reward formulation to provide tailored incentive signals.
Unified General and Spatial Reasoning Model: We propose M2-Reasoning-7B, an MLLM uniquely engineered for both abstract and spatial reasoning. Extensive evaluations on 8 distinctbenchmarks demonstrate that, by leveraging our custom data and training pipelines, M2-Reasoning establishes new state-of-the-art (SOTA) results across both general and spatial reasoning domains.

Evaluation

We conduct a comprehensive evaluation of our models across two key domains: general and spatial reasoning. Our evaluation utilizes a diverse set of public benchmarks, grouped by the primary capability they measure:

General Reasoning (Mathematical & Logical): To evaluate this capability, we employ six benchmarks: MathVista, MathVision, MathVerse, DynaMath, WeMath, and LogicVista.

Models	MathVista	MathVision	MathVerse	DynaMath	WeMath	LogicVista	Avg. (Δ)
Base-Scale General Models
InternVL3-8B	70.5	30.0	38.5	25.7	39.5	44.5	41.4
InternVL3-9B	69.0	29.3	37.9	25.1	34.8	49.0	40.8
Qwen2.5-VL-7B	68.1	25.4	41.1	21.8	36.2	47.9	40.1
MUG-U-7B	74.8	26.1	35.4	17.2	26.5	39.8	36.6
SAIL-VL-1.6-8B	74.2	23.2	33.4	14.0	29.6	41.4	36.0
Base-Scale Reasoning Models
WeThink-VL-7B	71.6	26.0	44.2	24.8	48.0	51.2	44.3 (+4.2)
Taichu-VLR-7B	72.3	27.1	46.7	23.0	44.0	48.3	43.6
VLAA-Thinker-7B	68.0	26.4	48.2	22.4	41.5	48.5	42.5 (+2.4)
URSA-8B-PS-GRPO	67.8	31.8	41.5	22.4	38.3	44.7	41.1 (+8.2)
Ovis2-8B	71.8	25.9	42.3	20.4	27.2	39.4	37.8
Our Models
Base Model	70.2	25.9	30.5	20.2	27.2	37.8	35.5
M2-Reasoning-CI-7B	71.7	29.2	42.1	25.0	42.8	46.8	42.9 (+7.4)
M2-Reasoning-7B	75.0	31.5	44.7	26.8	41.8	50.0	45.0 (+9.5)

Spatial Reasoning: We assess this skill using 2 benchmarks: CV-Bench and VSI-Bench

CV-Bench:

Models	Count	Relation	Depth	Distance	Avg.
Large-Scale Models
GPT-4O	65.9	85.7	87.8	78.2	78.9
Gemini-1.5-pro	70.4	85.2	82.4	72.8	77.4
Base-Scale Models
InternVL3-8B	74.0	90.6	84.3	81.0	82.0
Qwen2.5-VL-7B-Instruct	65.2	86.6	70.6	79.8	75.0
LLava-NEXT-Video-7B	59.3	77.0	71.3	54.7	65.2
Our Models
M2-Reasoning-7B	66.6	92.8	89.3	84.3	82.3

VSI-Bench:

	OC	AD	OS	RS	RDs	RDr	RP	AO	Avg.
Large-Scale Models
Gemini-1.5-pro	56.2	30.9	64.1	43.6	51.3	46.3	36.0	34.6	45.4
GPT-4O	46.2	5.3	43.8	38.2	37.0	41.3	31.5	28.5	34.0
Base-Scale Models
InternVL3-8B	68.1	39.0	48.4	33.6	48.3	36.4	27.3	35.4	42.1
Video-R1-7B	-	-	-	-	-	-	-	-	37.1
Qwen2.5-VL-7B-Instruct	37.7	20.1	49.7	37.4	38.5	40.4	31.4	32.0	35.9
LLava-NeXT-Video-7B	48.5	14.0	47.8	24.2	43.5	42.4	34.0	30.6	35.6
Our Models
M2-Reasoning-7B	41.0	34.0	60.9	55.4	40.7	47.3	29.9	28.8	42.3

Model Downloads

You can download the model from both Hugging Face and ModelScope .

If you’re in mainland China, we strongly recommend you to download our model from ModelScope .

Example Usage

The basic environment is python=3.10, torch=2.6.0+cu124, transformers=4.49.0

We provide a small example on the usage of this repo.


import os
import torch
 
from transformers import (
    AutoProcessor,
    AutoTokenizer,
)
 
import warnings
import argparse
from modeling_bailing_qwen2_5 import Bailing_qwen2_5NativeForConditionalGeneration
from processing_bailing_qwen2_5 import Bailing_qwen2_5Processor
 
warnings.filterwarnings("ignore")
 
class BailingMMInfer:
    def __init__(self,
        model_name_or_path,
        device="cuda",
        max_pixels=None,
        min_pixels=None,
        video_max_pixels=768 * 28 * 28,
        video_min_pixels=128 * 28 * 28,
        generation_config=None
    ):
        super().__init__()
        self.model_name_or_path = model_name_or_path
 
        self.device = device
 
        self.device_map = device
 
        self.video_max_pixels = video_max_pixels if video_max_pixels is not None else 768 * 28 * 28
        self.video_min_pixels = video_min_pixels if video_min_pixels is not None else 128 * 28 * 28
 
        self.model, self.tokenizer, self.processor = self.load_model_processor()
        if max_pixels is not None:
            self.processor.max_pixels = max_pixels
        if min_pixels is not None:
            self.processor.min_pixels = min_pixels
        if generation_config is None:
            generation_config = {
                "num_beams": 1,
                "do_sample": True,
                "temperature": 0.9
            }
 
        self.generation_config = generation_config
 
 
    def load_model_processor(self):
 
        model = Bailing_qwen2_5NativeForConditionalGeneration.from_pretrained(
            self.model_name_or_path,
            torch_dtype=torch.bfloat16,
            device_map=self.device_map,
            _attn_implementation="flash_attention_2"
        ).eval()
 
        tokenizer = AutoTokenizer.from_pretrained(self.model_name_or_path, add_bos_token=True, trust_remote_code=True)
        processor = Bailing_qwen2_5Processor.from_pretrained(self.model_name_or_path, trust_remote_code=True)
 
        return model, tokenizer, processor
 
    def generate(self, messages, max_new_tokens=512):
        text = self.processor.apply_chat_template(
            messages, tokenize=False, add_generation_prompt=True, use_system=True
        )
 
        image_inputs, video_inputs = self.processor.process_vision_info(messages)
 
 
        inputs = self.processor(
            text=[text],
            images=image_inputs,
            videos=video_inputs,
            return_tensors="pt",
        )
        # print(inputs)
        print(self.tokenizer.decode(inputs['input_ids'][0]))
 
        inputs = inputs.to(self.device)
 
        for k in inputs.keys():
            if k == "pixel_values" or k == "pixel_values_videos":
                inputs[k] = inputs[k].to(dtype=torch.bfloat16)
 
        with torch.no_grad():
            generated_ids = self.model.generate(
                inputs,
                max_new_tokens=max_new_tokens,
                eos_token_id=self.processor.tokenizer.eos_token_id,
                **self.generation_config,
            )
 
        generated_ids_trimmed = [
            out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
        ]
 
        output_text = self.processor.batch_decode(
            generated_ids_trimmed, skip_special_tokens=False, clean_up_tokenization_spaces=False
        )[0]
 
        return output_text
 
if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--model_name_or_path', type=str, default="inclusionAI/M2-Reasoning")
    parser.add_argument('--max_pixels', type=int, default=401408)
    parser.add_argument('--min_pixels', type=int, default=401408)
    parser.add_argument('--max_new_tokens', type=int, default=4096)
 
    args = parser.parse_args()
 
    device = "cuda" if torch.cuda.is_available() else "cpu"
    # model_name_or_path = os.path.join(args.input_dir, args.model_name_or_path)
    bailing2 = BailingMMInfer(
        args.model_name_or_path,
        device=device,
        max_pixels=args.max_pixels,
        min_pixels=args.min_pixels
    )
 
    messages = [
        {
            "role": "system",
            "content": [
                {"type": "text", "text": "You are a helpful assistant. When the user asks a question, your response must include two parts: first, the reasoning process enclosed in <think>...</think> tags, then the final answer enclosed in <answer>...</answer> tags. The critical answer or key result should be placed within \\boxed{}."}]},
        {
            "role": "user",
            "content": [
                {"type": "image", "image": "./assets/example1.png"},
                {"type": "text", "text": "\nQuestion:\n\nRhombus $QRST$ has an area of 137.9 square meters. If $RT$ is 12.2 meters, find $QS$.\nA. 11.3\nB. 22.4\nC. 22.6\nD. 25.6"},
            ],
        },
    ]
    output_text = bailing2.generate(messages, max_new_tokens=args.max_new_tokens)
    print(output_text)
 
 
 
'''
[Output]:
 
<think>
To find the length of \( QS \) in the rhombus \( QRST \), we can use the formula for the area of a rhombus, which is given by:
 
\[
\text{Area} = \frac{1}{2} \times d_1 \times d_2
\]
 
where \( d_1 \) and \( d_2 \) are the lengths of the diagonals. In this problem, we are given:
- The area of the rhombus is 137.9 square meters.
- One of the diagonals,