Back to list

Ming-Omni: A Unified Multimodal Model for Perception and Generation

June 11, 20253 min read

GITHUB 📑 Technical Report|📖Project Page |🤗 Hugging Face| 🤖 ModelScope

Introduction

Ming-lite-omni, a light version of Ming-omni, which is derived from Ling-lite  and features 2.8 billion activated parameter. Ming-lite-omni is a unified multimodal model capable of processing images, text, audio, and video, while demonstrating strong proficiency in both speech and image generation. Ming-lite-omni employs dedicated encoders to extract tokens from different modalities, which are then processed by Ling, an MoE architecture equipped with newly proposed modality-specific routers. This design enables a single model to efficiently process and fuse multimodal inputs within a unified framework, thereby facilitating diverse tasks without requiring separate models, task-specific fine-tuning, or structural redesign. Importantly, Ming-lite-omni extends beyond conventional multimodal models by supporting audio and image generation. This is achieved through the integration of an advanced audio decoder for natural-sounding speech and Ming-Lite-Uni for high-quality image generation, which also allow the model to engage in context-aware chatting, perform text-to-speech conversion, and conduct versatile image editing. Our experimental results showcase Ming-lite-omni offers a powerful solution for unified perception and generation across all modalities. Notably, Ming-lite-omni is the first open-source model we are aware of to match GPT-4o in modality support, and we release all code and model weights to encourage further research and development in the community.

📌 Updates

  • [2025.06.12] 🔥 Our Technical Report  is in public on arxiv.
  • [2025.05.28] 🔥 The official version of Ming-lite-omni is released, with better performance and image generation support.
  • [2025.05.04] 🔥 We release the test version of Ming-lite-omni:Ming-lite-omni-Preview .

Key Features

  • Unified Omni-Modality Perception: Ming-lite-omni, built on Ling , an MoE architecture LLM, resolves task conflicts and ensures coherent integration of tokens from different modalities through modality-specific routers.

  • Unified Perception and Generation: Ming-lite-omni achieves unified understanding and generation, enabling the model to interpret multimodal instructions and user intent during generation, which helps enhance generation quality and improves usability across multiple tasks.

  • Innovative Generation Capabilities: Ming-lite-omni can perceive all modalities and generate high-quality text, real-time speech, and vivid images simultaneously, delivering exceptional cross-modal performance across diverse tasks including image perception, audio-visual interaction, and image generation.

Evaluation

Ming-lite-omni delivers exceptional cross-modal performance, as validated across image perception, audio-visual interaction, and image generation tasks. Specifically, in the image perception task, Ming-lite-omni attained performance comparable to that of Qwen2.5-VL-7B by activating only 2.8B parameters. It delivers superior performance in end-to-end speech understanding and instruction following, surpassing Qwen2.5-Omni and Kimi-Audio. It also supports native-resolution image generation, editing, and style transfer, achieving a GenEval score of 0.64, outperforming mainstream models such as SDXL. In terms of FID, Ming-lite-omni reaches 4.85, setting a new SOTA across existing methods.

Image benchmark

BenchmarksMing-lite-omniQwen2.5-VL-7B-InstructInternVL2.5-8B-MPO
AI2D83.184.484.5
HallusionBench55.055.851.7
MMBench_TEST_V1180.882.882.0
MMMU56.356.654.8
MMStar64.765.365.2
MMVet71.371.668.1
MathVista71.668.167.9
OCRBench88.487.888.2
Average71.471.570.3

Encyclopedia Benchmarks

Object RecognitionMing-lite-omniQwen2.5-VL-7B-Instruct
Plants54.9647.8
Animals56.750.85
Vehicles41.9142.29
Food & Ingredients62.2854.09
Dishes44.339.07
General91.0892.42
Average58.5454.43

Video benchmark

BenchmarksMing-lite-omniQwen2.5VL-7B-Instruct
VideoMME67.067.3
MVBench67.767.4
Video-MMMU46.347.4
LongVideoBench56.654.7
Average59.459.2

Note: All models are evaluated based on 128 uniformly sampled frames.

Audio benchmark

SpeechQA

ModelAverageAlpacaEvalCommonEvalSD-QAMMSUOpenBookQAIFEvalAdvBench
Qwen2-Audio-chat3.5453.693.4035.3535.4349.0122.5798.85
Baichuan-Audio3.6954.003.3949.6448.8063.3041.3286.73
GLM-4-Voice3.774.063.4843.3140.1152.9724.9188.08
Kimi-Audio4.2154.463.9763.1262.1783.5261.10100.00
Qwen2.5-Omni4.214.493.9355.7161.3281.1052.8799.42
Ming-lite-omni4.344.634.0658.8447.5361.9858.3699.04

ASR

Modelaishell1aishell2_androidaishell2_ioscv15_zhfleurs_zhwenetspeech_meetingwenetspeech_netlibrispeech_test_cleanlibrispeech_test_othermultilingual_librispeechcv15_enfleurs_envoxpopuli_v1.0_en
Ming-lite-omni1.472.552.526.312.965.955.461.442.804.156.893.395.80
Qwen2.-Omni1.182.752.635.203.005.907.701.803.407.567.604.105.80
Qwen2-Audio1.532.922.926.907.507.168.421.603.605.408.606.906.84
Kimi-Audio0.602.642.567.212.696.285.371.282.425.8810.314.447.97

Information-Seeking Benchmark

ModelInfoSeek_H-meanInfoSeek_unseen_questionInfoSeek_unseen_entity
GPT-4o36.05--
PaLI-X22.0623.520.8
Qwen2.5-vl-32B19.3520.5518.28
Ming-lite-omni27.730.425.4

OCR

ModelMing-lite-omniQwen2.5-VL-7B-Instruct
ChartQA_TEST85.187.3
DocVQA_TEST9395.7
OCRBenchV2_en/zh53.3/5256.3/57.2
OmniDocBench↓34/34.430.8/39.8
TextVQA_VAL82.884.9

GUI

ModelMing-lite-omniInternVL3 8BQwen2.5-VL-7B-Instruct
ScreenSpot82.179.578.9*
ScreenSpot-V284.181.4-
AITZ(EM)66.6-57.6*

Note: * denotes the reproduced results.

Unified Generation Benchmark

Modelsingle_objecttwo_objectcountingcolorspositioncolor_attrGENEVALDPGBenchFID↓
Ming-lite-omni0.98750.77270.68120.78720.310.290.6481.724.85
Metaquery-XL------0.6182.056.02
SDv2.10.980.510.440.850.070.170.5068.0926.96
Emu3-Gen0.980.710.340.810.170.210.5480.60-
SDXL0.980.740.390.850.150.230.5574.658.76
Janus0.970.680.300.840.460.420.6179.6810.10
JanusFlow------0.6380.099.51

Please refer to our technical report for more comprehensive evaluation results.

Model Downloads

You can download the model from both Huggingface and ModelScope.

ModelInput modalityOput modalityDownload
Ming-Lite-OmniImage,text,viedio,audioImage,text,audio🤗 HuggingFace 
🤖 ModelScope 

If you’re in mainland China, we strongly recommend you to download our model from 🤖 ModelScope.

pip install modelscope modelscope download --model inclusionAI/Ming-Lite-Omni --local_dir inclusionAI/Ming-Lite-Omni --revision master

Note: This download process will take several minutes to several hours, depending on your network conditions.

Use Cases

Additional demonstration cases are available on our project page .

Environment Preparation

Installation with pip

pip install -r requirements.txt # for python 3.10 pip install data/matcha_tts-0.0.5.1-cp310-cp310-linux_x86_64.whl # for python 3.8 # pip install data/matcha_tts-0.0.5.1-cp38-cp38-linux_x86_64.whl pip install diffusers==0.33.0 pip install nvidia-cublas-cu12==12.4.5.8 # for H20 GPU

Installation with docker

You can also initialize the environment by building the docker image. First clone this repository:

git clone --depth 1 https://github.com/inclusionAI/Ming.git cd Ming

Then build the docker image with the provided Dockerfile in docker/docker-py310-cu121. This step might take a while:

docker build -t ming:py310-cu121 docker/docker-py310-cu121

At last, start the container with the current repo directory mounted:

docker run -it --gpus all -v "$(pwd)":/workspace/Ming ming:py310-cu121 ming:py310-cu121 /bin/bash

You can run the model with python interface. You may download the huggingface model in the repo directory first (.../Ming/) or mount the downloaded model path when starting the container.

Example Usage

We provide a step-by-step running example:

Step 1 - Download the source code

git clone https://github.com/inclusionAI/Ming.git cd Ming

Step 2 - Download the model weights and create a soft link to the source code directory

Download our model following Model Downloads

mkdir inclusionAI ln -s /path/to/inclusionAI/Ming-Lite-Omni inclusionAI/Ming-Lite-Omni

Step 3 - Enter the code directory, you can refer to the following codes to run the Ming-Lite-Omni model.

jupyter notebook cookbook.ipynb

We also provide a simple example on the usage of this repo. For detailed usage, please refer to cookbook.ipynb .

import torch from transformers import AutoProcessor, GenerationConfig from modeling_bailingmm import BailingMMNativeForConditionalGeneration # load model model = BailingMMNativeForConditionalGeneration.from_pretrained( "inclusionAI/Ming-Lite-Omni", torch_dtype=torch.bfloat16, low_cpu_mem_usage=True ).to("cuda") # build processor processor = AutoProcessor.from_pretrained("inclusionAI/Ming-Lite-Omni", trust_remote_code=True) # qa messages = [ { "role": "HUMAN", "content": [ {"type": "text", "text": "请详细介绍鹦鹉的生活习性。"} ], }, ] # 1. Format inputs using chat template text = processor.apply_chat_template(messages, add_generation_prompt=True) # 2. Extract vision/audio data image_inputs, video_inputs, audio_inputs = processor.process_vision_info(messages) # 3. Prepare tensor inputs inputs = processor( text=[text], images=image_inputs, videos=video_inputs, audios=audio_inputs, return_tensors="pt", ) inputs = inputs.to(model.device) for k in inputs.keys(): if k == "pixel_values" or k == "pixel_values_videos" or k == "audio_feats": inputs[k] = inputs[k].to(dtype=torch.bfloat16) # 4. Configure generation generation_config = GenerationConfig.from_dict({'no_repeat_ngram_size': 10}) generated_ids = model.generate( **inputs, max_new_tokens=512, use_cache=True, eos_token_id=processor.gen_terminator, generation_config=generation_config, ) generated_ids_trimmed = [ out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] # 5. Decode output output_text = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False )[0] print(output_text) # Output: # 鹦鹉是一种非常聪明和社交性强的鸟类,它们的生活习性非常丰富和有趣。以下是一些关于鹦鹉生活习性的详细介绍: # ### 1. **栖息地** # 鹦鹉主要分布在热带和亚热带地区,包括非洲、亚洲、澳大利亚和南美洲。它们通常生活在森林、草原、沙漠和城市环境中。不同种类的鹦鹉对栖息地的要求有所不同,但大多数鹦鹉喜欢有丰富植被和水源的地方。 # ### 2. **饮食** # 鹦鹉是杂食性动物,它们的饮食非常多样化。它们的食物包括种子、坚果、水果、蔬菜、花蜜和昆虫。鹦鹉的喙非常强壮,能够轻松地打开坚硬的果壳和坚果。一些鹦鹉还会吃泥土或沙子,以帮助消化和补充矿物质。 # ......

Note: We test the examples on hardware of NVIDIA H800-80GB/H20-96G with CUDA 12.4. Loading inclusionAI/Ming-Lite-Omni in bfloat16 takes about 62G GPU memory.

This code repository is licensed under the MIT License , and the Legal Disclaimer is located in the LEGAL.md file  under the project’s root directory.

Citation

If you find our work helpful, feel free to give us a cite.

@misc{Mingomni2025, title = {Ming-Omni: A Unified Multimodal Model for Perception and Generation}, author = {Inclusion AI}, year = {2025}, eprint = {2506.09344}, archivePrefix = {arXiv}, url = {https://arxiv.org/abs/2506.09344} }
Release