Back to list

M2-Reasoning: Empowering MLLMs with Unified General and Spatial Reasoning

July 11, 20253 min read

📖 Technical Report  | 🤗 Hugging Face | 🤖 ModelScope 

Introduction

We introduce M2-Reasoning-7B, a model designed to excel in both general and spatial reasoning. Our approach integrates two key innovations: (1) a novel data pipeline that generates 294.2K high-quality data samples (168K for cold-start fine-tuning and 126.2K for RLVR), which feature logically coherent reasoning trajectories and have undergone comprehensive assessment; and (2) a dynamic multi-task training strategy with step-wise optimization to mitigate conflicts between data, and task-specific rewards for delivering tailored incentive signals. This combination of curated data and advanced training allows M2-Reasoning-7B to set a new state-of-the-art (SOTA) across 8 benchmarks, showcasing superior performance in both general and spatial reasoning domains.

📌 Updates

Key Features

  • A High-quality Data Construction Pipeline: We design and implement a multi-stage data synthesis and curation pipeline that generates vast amounts of reasoning data.
  • A Dynamic Multi-Task Training Strategy: We propose a sophisticated training strategy that effectively handles data heterogeneity. It features step-wise dynamic optimization to mitigate conflicts between different data sources and a task-specific reward formulation to provide tailored incentive signals.
  • Unified General and Spatial Reasoning Model: We propose M2-Reasoning-7B, an MLLM uniquely engineered for both abstract and spatial reasoning. Extensive evaluations on 8 distinctbenchmarks demonstrate that, by leveraging our custom data and training pipelines, M2-Reasoning establishes new state-of-the-art (SOTA) results across both general and spatial reasoning domains.

Evaluation

We conduct a comprehensive evaluation of our models across two key domains: general and spatial reasoning. Our evaluation utilizes a diverse set of public benchmarks, grouped by the primary capability they measure:

  • General Reasoning (Mathematical & Logical): To evaluate this capability, we employ six benchmarks: MathVista, MathVision, MathVerse, DynaMath, WeMath, and LogicVista.
ModelsMathVistaMathVisionMathVerseDynaMathWeMathLogicVistaAvg. (Δ)
Base-Scale General Models
InternVL3-8B70.530.038.525.739.544.541.4
InternVL3-9B69.029.337.925.134.849.040.8
Qwen2.5-VL-7B68.125.441.121.836.247.940.1
MUG-U-7B74.826.135.417.226.539.836.6
SAIL-VL-1.6-8B74.223.233.414.029.641.436.0
Base-Scale Reasoning Models
WeThink-VL-7B71.626.044.224.848.051.244.3 (+4.2)
Taichu-VLR-7B72.327.146.723.044.048.343.6
VLAA-Thinker-7B68.026.448.222.441.548.542.5 (+2.4)
URSA-8B-PS-GRPO67.831.841.522.438.344.741.1 (+8.2)
Ovis2-8B71.825.942.320.427.239.437.8
Our Models
Base Model70.225.930.520.227.237.835.5
M2-Reasoning-CI-7B71.729.242.125.042.846.842.9 (+7.4)
M2-Reasoning-7B75.031.544.726.841.850.045.0 (+9.5)
  • Spatial Reasoning: We assess this skill using 2 benchmarks: CV-Bench and VSI-Bench

    • CV-Bench:
    ModelsCountRelationDepthDistanceAvg.
    Large-Scale Models
    GPT-4O65.985.787.878.278.9
    Gemini-1.5-pro70.485.282.472.877.4
    Base-Scale Models
    InternVL3-8B74.090.684.381.082.0
    Qwen2.5-VL-7B-Instruct65.286.670.679.875.0
    LLava-NEXT-Video-7B59.377.071.354.765.2
    Our Models
    M2-Reasoning-7B66.692.889.384.382.3
    • VSI-Bench:
    OCADOSRSRDsRDrRPAOAvg.
    Large-Scale Models
    Gemini-1.5-pro56.230.964.143.651.346.336.034.645.4
    GPT-4O46.25.343.838.237.041.331.528.534.0
    Base-Scale Models
    InternVL3-8B68.139.048.433.648.336.427.335.442.1
    Video-R1-7B--------37.1
    Qwen2.5-VL-7B-Instruct37.720.149.737.438.540.431.432.035.9
    LLava-NeXT-Video-7B48.514.047.824.243.542.434.030.635.6
    Our Models
    M2-Reasoning-7B41.034.060.955.440.747.329.928.842.3

Model Downloads

You can download the model from both Hugging Face  and ModelScope .

If you’re in mainland China, we strongly recommend you to download our model from ModelScope .

Example Usage

The basic environment is python=3.10, torch=2.6.0+cu124, transformers=4.49.0

We provide a small example on the usage of this repo.

import os import torch from transformers import ( AutoProcessor, AutoTokenizer, ) import warnings import argparse from modeling_bailing_qwen2_5 import Bailing_qwen2_5NativeForConditionalGeneration from processing_bailing_qwen2_5 import Bailing_qwen2_5Processor warnings.filterwarnings("ignore") class BailingMMInfer: def __init__(self, model_name_or_path, device="cuda", max_pixels=None, min_pixels=None, video_max_pixels=768 * 28 * 28, video_min_pixels=128 * 28 * 28, generation_config=None ): super().__init__() self.model_name_or_path = model_name_or_path self.device = device self.device_map = device self.video_max_pixels = video_max_pixels if video_max_pixels is not None else 768 * 28 * 28 self.video_min_pixels = video_min_pixels if video_min_pixels is not None else 128 * 28 * 28 self.model, self.tokenizer, self.processor = self.load_model_processor() if max_pixels is not None: self.processor.max_pixels = max_pixels if min_pixels is not None: self.processor.min_pixels = min_pixels if generation_config is None: generation_config = { "num_beams": 1, "do_sample": True, "temperature": 0.9 } self.generation_config = generation_config def load_model_processor(self): model = Bailing_qwen2_5NativeForConditionalGeneration.from_pretrained( self.model_name_or_path, torch_dtype=torch.bfloat16, device_map=self.device_map, _attn_implementation="flash_attention_2" ).eval() tokenizer = AutoTokenizer.from_pretrained(self.model_name_or_path, add_bos_token=True, trust_remote_code=True) processor = Bailing_qwen2_5Processor.from_pretrained(self.model_name_or_path, trust_remote_code=True) return model, tokenizer, processor def generate(self, messages, max_new_tokens=512): text = self.processor.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, use_system=True ) image_inputs, video_inputs = self.processor.process_vision_info(messages) inputs = self.processor( text=[text], images=image_inputs, videos=video_inputs, return_tensors="pt", ) # print(inputs) print(self.tokenizer.decode(inputs['input_ids'][0])) inputs = inputs.to(self.device) for k in inputs.keys(): if k == "pixel_values" or k == "pixel_values_videos": inputs[k] = inputs[k].to(dtype=torch.bfloat16) with torch.no_grad(): generated_ids = self.model.generate( inputs, max_new_tokens=max_new_tokens, eos_token_id=self.processor.tokenizer.eos_token_id, **self.generation_config, ) generated_ids_trimmed = [ out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] output_text = self.processor.batch_decode( generated_ids_trimmed, skip_special_tokens=False, clean_up_tokenization_spaces=False )[0] return output_text if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('--model_name_or_path', type=str, default="inclusionAI/M2-Reasoning") parser.add_argument('--max_pixels', type=int, default=401408) parser.add_argument('--min_pixels', type=int, default=401408) parser.add_argument('--max_new_tokens', type=int, default=4096) args = parser.parse_args() device = "cuda" if torch.cuda.is_available() else "cpu" # model_name_or_path = os.path.join(args.input_dir, args.model_name_or_path) bailing2 = BailingMMInfer( args.model_name_or_path, device=device, max_pixels=args.max_pixels, min_pixels=args.min_pixels ) messages = [ { "role": "system", "content": [ {"type": "text", "text": "You are a helpful assistant. When the user asks a question, your response must include two parts: first, the reasoning process enclosed in <think>...</think> tags, then the final answer enclosed in <answer>...</answer> tags. The critical answer or key result should be placed within \\boxed{}."}]}, { "role": "user", "content": [ {"type": "image", "image": "./assets/example1.png"}, {"type": "text", "text": "\nQuestion:\n\nRhombus $QRST$ has an area of 137.9 square meters. If $RT$ is 12.2 meters, find $QS$.\nA. 11.3\nB. 22.4\nC. 22.6\nD. 25.6"}, ], }, ] output_text = bailing2.generate(messages, max_new_tokens=args.max_new_tokens) print(output_text) ''' [Output]: <think> To find the length of \( QS \) in the rhombus \( QRST \), we can use the formula for the area of a rhombus, which is given by: \[ \text{Area} = \frac{1}{2} \times d_1 \times d_2 \] where \( d_1 \) and \( d_2 \) are the lengths of the diagonals. In this problem, we are given: - The area of the rhombus is 137.9 square meters. - One of the diagonals,
ReleaseInsights