Ming-Omni: A Unified Multimodal Model for Perception and Generation
GITHUB 📑 Technical Report|📖Project Page |🤗 Hugging Face| 🤖 ModelScope
Introduction
Ming-lite-omni, a light version of Ming-omni, which is derived from Ling-lite and features 2.8 billion activated parameter. Ming-lite-omni is a unified multimodal model capable of processing images, text, audio, and video, while demonstrating strong proficiency in both speech and image generation. Ming-lite-omni employs dedicated encoders to extract tokens from different modalities, which are then processed by Ling, an MoE architecture equipped with newly proposed modality-specific routers. This design enables a single model to efficiently process and fuse multimodal inputs within a unified framework, thereby facilitating diverse tasks without requiring separate models, task-specific fine-tuning, or structural redesign. Importantly, Ming-lite-omni extends beyond conventional multimodal models by supporting audio and image generation. This is achieved through the integration of an advanced audio decoder for natural-sounding speech and Ming-Lite-Uni for high-quality image generation, which also allow the model to engage in context-aware chatting, perform text-to-speech conversion, and conduct versatile image editing. Our experimental results showcase Ming-lite-omni offers a powerful solution for unified perception and generation across all modalities. Notably, Ming-lite-omni is the first open-source model we are aware of to match GPT-4o in modality support, and we release all code and model weights to encourage further research and development in the community.
📌 Updates
- [2025.06.12] 🔥 Our Technical Report is in public on arxiv.
- [2025.05.28] 🔥 The official version of Ming-lite-omni is released, with better performance and image generation support.
- [2025.05.04] 🔥 We release the test version of Ming-lite-omni:Ming-lite-omni-Preview .
Key Features
-
Unified Omni-Modality Perception: Ming-lite-omni, built on Ling , an MoE architecture LLM, resolves task conflicts and ensures coherent integration of tokens from different modalities through modality-specific routers.
-
Unified Perception and Generation: Ming-lite-omni achieves unified understanding and generation, enabling the model to interpret multimodal instructions and user intent during generation, which helps enhance generation quality and improves usability across multiple tasks.
-
Innovative Generation Capabilities: Ming-lite-omni can perceive all modalities and generate high-quality text, real-time speech, and vivid images simultaneously, delivering exceptional cross-modal performance across diverse tasks including image perception, audio-visual interaction, and image generation.
Evaluation
Ming-lite-omni delivers exceptional cross-modal performance, as validated across image perception, audio-visual interaction, and image generation tasks. Specifically, in the image perception task, Ming-lite-omni attained performance comparable to that of Qwen2.5-VL-7B by activating only 2.8B parameters. It delivers superior performance in end-to-end speech understanding and instruction following, surpassing Qwen2.5-Omni and Kimi-Audio. It also supports native-resolution image generation, editing, and style transfer, achieving a GenEval score of 0.64, outperforming mainstream models such as SDXL. In terms of FID, Ming-lite-omni reaches 4.85, setting a new SOTA across existing methods.
Image benchmark
| Benchmarks | Ming-lite-omni | Qwen2.5-VL-7B-Instruct | InternVL2.5-8B-MPO |
|---|---|---|---|
| AI2D | 83.1 | 84.4 | 84.5 |
| HallusionBench | 55.0 | 55.8 | 51.7 |
| MMBench_TEST_V11 | 80.8 | 82.8 | 82.0 |
| MMMU | 56.3 | 56.6 | 54.8 |
| MMStar | 64.7 | 65.3 | 65.2 |
| MMVet | 71.3 | 71.6 | 68.1 |
| MathVista | 71.6 | 68.1 | 67.9 |
| OCRBench | 88.4 | 87.8 | 88.2 |
| Average | 71.4 | 71.5 | 70.3 |
Encyclopedia Benchmarks
| Object Recognition | Ming-lite-omni | Qwen2.5-VL-7B-Instruct |
|---|---|---|
| Plants | 54.96 | 47.8 |
| Animals | 56.7 | 50.85 |
| Vehicles | 41.91 | 42.29 |
| Food & Ingredients | 62.28 | 54.09 |
| Dishes | 44.3 | 39.07 |
| General | 91.08 | 92.42 |
| Average | 58.54 | 54.43 |
Video benchmark
| Benchmarks | Ming-lite-omni | Qwen2.5VL-7B-Instruct |
|---|---|---|
| VideoMME | 67.0 | 67.3 |
| MVBench | 67.7 | 67.4 |
| Video-MMMU | 46.3 | 47.4 |
| LongVideoBench | 56.6 | 54.7 |
| Average | 59.4 | 59.2 |
Note: All models are evaluated based on 128 uniformly sampled frames.
Audio benchmark
SpeechQA
| Model | Average | AlpacaEval | CommonEval | SD-QA | MMSU | OpenBookQA | IFEval | AdvBench |
|---|---|---|---|---|---|---|---|---|
| Qwen2-Audio-chat | 3.545 | 3.69 | 3.40 | 35.35 | 35.43 | 49.01 | 22.57 | 98.85 |
| Baichuan-Audio | 3.695 | 4.00 | 3.39 | 49.64 | 48.80 | 63.30 | 41.32 | 86.73 |
| GLM-4-Voice | 3.77 | 4.06 | 3.48 | 43.31 | 40.11 | 52.97 | 24.91 | 88.08 |
| Kimi-Audio | 4.215 | 4.46 | 3.97 | 63.12 | 62.17 | 83.52 | 61.10 | 100.00 |
| Qwen2.5-Omni | 4.21 | 4.49 | 3.93 | 55.71 | 61.32 | 81.10 | 52.87 | 99.42 |
| Ming-lite-omni | 4.34 | 4.63 | 4.06 | 58.84 | 47.53 | 61.98 | 58.36 | 99.04 |
ASR
| Model | aishell1 | aishell2_android | aishell2_ios | cv15_zh | fleurs_zh | wenetspeech_meeting | wenetspeech_net | librispeech_test_clean | librispeech_test_other | multilingual_librispeech | cv15_en | fleurs_en | voxpopuli_v1.0_en |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Ming-lite-omni | 1.47 | 2.55 | 2.52 | 6.31 | 2.96 | 5.95 | 5.46 | 1.44 | 2.80 | 4.15 | 6.89 | 3.39 | 5.80 |
| Qwen2.-Omni | 1.18 | 2.75 | 2.63 | 5.20 | 3.00 | 5.90 | 7.70 | 1.80 | 3.40 | 7.56 | 7.60 | 4.10 | 5.80 |
| Qwen2-Audio | 1.53 | 2.92 | 2.92 | 6.90 | 7.50 | 7.16 | 8.42 | 1.60 | 3.60 | 5.40 | 8.60 | 6.90 | 6.84 |
| Kimi-Audio | 0.60 | 2.64 | 2.56 | 7.21 | 2.69 | 6.28 | 5.37 | 1.28 | 2.42 | 5.88 | 10.31 | 4.44 | 7.97 |
Information-Seeking Benchmark
| Model | InfoSeek_H-mean | InfoSeek_unseen_question | InfoSeek_unseen_entity |
|---|---|---|---|
| GPT-4o | 36.05 | - | - |
| PaLI-X | 22.06 | 23.5 | 20.8 |
| Qwen2.5-vl-32B | 19.35 | 20.55 | 18.28 |
| Ming-lite-omni | 27.7 | 30.4 | 25.4 |
OCR
| Model | Ming-lite-omni | Qwen2.5-VL-7B-Instruct |
|---|---|---|
| ChartQA_TEST | 85.1 | 87.3 |
| DocVQA_TEST | 93 | 95.7 |
| OCRBenchV2_en/zh | 53.3/52 | 56.3/57.2 |
| OmniDocBench↓ | 34/34.4 | 30.8/39.8 |
| TextVQA_VAL | 82.8 | 84.9 |
GUI
| Model | Ming-lite-omni | InternVL3 8B | Qwen2.5-VL-7B-Instruct |
|---|---|---|---|
| ScreenSpot | 82.1 | 79.5 | 78.9* |
| ScreenSpot-V2 | 84.1 | 81.4 | - |
| AITZ(EM) | 66.6 | - | 57.6* |
Note: * denotes the reproduced results.
Unified Generation Benchmark
| Model | single_object | two_object | counting | colors | position | color_attr | GENEVAL | DPGBench | FID↓ |
|---|---|---|---|---|---|---|---|---|---|
| Ming-lite-omni | 0.9875 | 0.7727 | 0.6812 | 0.7872 | 0.31 | 0.29 | 0.64 | 81.72 | 4.85 |
| Metaquery-XL | - | - | - | - | - | - | 0.61 | 82.05 | 6.02 |
| SDv2.1 | 0.98 | 0.51 | 0.44 | 0.85 | 0.07 | 0.17 | 0.50 | 68.09 | 26.96 |
| Emu3-Gen | 0.98 | 0.71 | 0.34 | 0.81 | 0.17 | 0.21 | 0.54 | 80.60 | - |
| SDXL | 0.98 | 0.74 | 0.39 | 0.85 | 0.15 | 0.23 | 0.55 | 74.65 | 8.76 |
| Janus | 0.97 | 0.68 | 0.30 | 0.84 | 0.46 | 0.42 | 0.61 | 79.68 | 10.10 |
| JanusFlow | - | - | - | - | - | - | 0.63 | 80.09 | 9.51 |
Please refer to our technical report for more comprehensive evaluation results.
Model Downloads
You can download the model from both Huggingface and ModelScope.
| Model | Input modality | Oput modality | Download |
|---|---|---|---|
| Ming-Lite-Omni | Image,text,viedio,audio | Image,text,audio | 🤗 HuggingFace 🤖 ModelScope |
If you’re in mainland China, we strongly recommend you to download our model from 🤖 ModelScope.
pip install modelscope
modelscope download --model inclusionAI/Ming-Lite-Omni --local_dir inclusionAI/Ming-Lite-Omni --revision masterNote: This download process will take several minutes to several hours, depending on your network conditions.
Use Cases
Additional demonstration cases are available on our project page .
Environment Preparation
Installation with pip
pip install -r requirements.txt
# for python 3.10
pip install data/matcha_tts-0.0.5.1-cp310-cp310-linux_x86_64.whl
# for python 3.8
# pip install data/matcha_tts-0.0.5.1-cp38-cp38-linux_x86_64.whl
pip install diffusers==0.33.0
pip install nvidia-cublas-cu12==12.4.5.8 # for H20 GPUInstallation with docker
You can also initialize the environment by building the docker image. First clone this repository:
git clone --depth 1 https://github.com/inclusionAI/Ming.git
cd MingThen build the docker image with the provided Dockerfile in docker/docker-py310-cu121. This step might take a while:
docker build -t ming:py310-cu121 docker/docker-py310-cu121At last, start the container with the current repo directory mounted:
docker run -it --gpus all -v "$(pwd)":/workspace/Ming ming:py310-cu121 ming:py310-cu121 /bin/bashYou can run the model with python interface. You may download the huggingface model in the repo directory first (.../Ming/) or mount the downloaded model path when starting the container.
Example Usage
We provide a step-by-step running example:
Step 1 - Download the source code
git clone https://github.com/inclusionAI/Ming.git
cd MingStep 2 - Download the model weights and create a soft link to the source code directory
Download our model following Model Downloads
mkdir inclusionAI
ln -s /path/to/inclusionAI/Ming-Lite-Omni inclusionAI/Ming-Lite-OmniStep 3 - Enter the code directory, you can refer to the following codes to run the Ming-Lite-Omni model.
jupyter notebook cookbook.ipynbWe also provide a simple example on the usage of this repo. For detailed usage, please refer to cookbook.ipynb .
import torch
from transformers import AutoProcessor, GenerationConfig
from modeling_bailingmm import BailingMMNativeForConditionalGeneration
# load model
model = BailingMMNativeForConditionalGeneration.from_pretrained(
"inclusionAI/Ming-Lite-Omni",
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True
).to("cuda")
# build processor
processor = AutoProcessor.from_pretrained("inclusionAI/Ming-Lite-Omni", trust_remote_code=True)
# qa
messages = [
{
"role": "HUMAN",
"content": [
{"type": "text", "text": "请详细介绍鹦鹉的生活习性。"}
],
},
]
# 1. Format inputs using chat template
text = processor.apply_chat_template(messages, add_generation_prompt=True)
# 2. Extract vision/audio data
image_inputs, video_inputs, audio_inputs = processor.process_vision_info(messages)
# 3. Prepare tensor inputs
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
audios=audio_inputs,
return_tensors="pt",
)
inputs = inputs.to(model.device)
for k in inputs.keys():
if k == "pixel_values" or k == "pixel_values_videos" or k == "audio_feats":
inputs[k] = inputs[k].to(dtype=torch.bfloat16)
# 4. Configure generation
generation_config = GenerationConfig.from_dict({'no_repeat_ngram_size': 10})
generated_ids = model.generate(
**inputs,
max_new_tokens=512,
use_cache=True,
eos_token_id=processor.gen_terminator,
generation_config=generation_config,
)
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
# 5. Decode output
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(output_text)
# Output:
# 鹦鹉是一种非常聪明和社交性强的鸟类,它们的生活习性非常丰富和有趣。以下是一些关于鹦鹉生活习性的详细介绍:
# ### 1. **栖息地**
# 鹦鹉主要分布在热带和亚热带地区,包括非洲、亚洲、澳大利亚和南美洲。它们通常生活在森林、草原、沙漠和城市环境中。不同种类的鹦鹉对栖息地的要求有所不同,但大多数鹦鹉喜欢有丰富植被和水源的地方。
# ### 2. **饮食**
# 鹦鹉是杂食性动物,它们的饮食非常多样化。它们的食物包括种子、坚果、水果、蔬菜、花蜜和昆虫。鹦鹉的喙非常强壮,能够轻松地打开坚硬的果壳和坚果。一些鹦鹉还会吃泥土或沙子,以帮助消化和补充矿物质。
# ......Note: We test the examples on hardware of NVIDIA H800-80GB/H20-96G with CUDA 12.4. Loading inclusionAI/Ming-Lite-Omni in bfloat16 takes about 62G GPU memory.
License and Legal Disclaimer
This code repository is licensed under the MIT License , and the Legal Disclaimer is located in the LEGAL.md file under the project’s root directory.
Citation
If you find our work helpful, feel free to give us a cite.
@misc{Mingomni2025,
title = {Ming-Omni: A Unified Multimodal Model for Perception and Generation},
author = {Inclusion AI},
year = {2025},
eprint = {2506.09344},
archivePrefix = {arXiv},
url = {https://arxiv.org/abs/2506.09344}
}