Introducing Ming-Lite-Omni V1.5

Overview

Ming-lite-omni v1.5 is a comprehensive upgrade to the full-modal capabilities of Ming-lite-omni(Github). It significantly improves performance across tasks including image-text understanding, document understanding, video understanding, speech understanding and synthesis, and image generation and editing. Built upon Ling-lite-1.5, Ming-lite-omni v1.5 has a total of 20.3 billion parameters, with 3 billion active parameters in its MoE (Mixture-of-Experts) section. It demonstrates highly competitive results in various modal benchmarks compared to industry-leading models.

Performance Comparison

Introduce Ming-lite-omni v1.5

Controllable Image Generation: Pixel-Level Control, Infinite Creativity

Ming-lite-omni v1.5 significantly optimizes Scene Consistency and ID Consistency (Character / Style Consistency) in image editing. When editing human figures, it demonstrates a clear advantage in maintaining scene and character ID. Furthermore, it expands support for perceptual tasks such as generative segmentation, depth prediction, object detection, and edge contour generation.

Depth and Edge Detection

Original Image	Generated Depth Map	Generated Bounding Boxes	Generated Edge Contours

Audio-Video Interactive Understanding

Model Architecture Upgrade and Capability Evaluation

The Ming-lite-omni v1.5 model architecture is outlined below. The core design references the structure of Ming-lite-omni V1. However, a key distinction is the upgrade of the Vision head to support reference image feature input, specifically to enhance character and scene consistency in image editing.

Mode Architecture

The model’s capabilities have been significantly optimized and upgraded across three key areas: enhanced Omni-modal comprehension, precise visual editing control, and improved user experience.

Thanks to optimized data quality, Ming-lite-omni v1.5 shows significant improvements in tasks such as vision-text comprehension (including image-text, document, and video understanding) and speech understanding. It has reached an industry-leading level for models of comparable scale.

Vision-text Comprehension

Task Type	Dataset	Qwen2.5-VL-7B	Ming-lite-omni	Ming-lite-omni v1.5
Image-text Understanding	AI2D	84.36	83.1	84.91
	HallusionBench	55.77	55.0	54.59
	MMBench_TEST_V11	82.75	80.8	80.73
	MMMU	56.56	56.3	54.33
	MMStar	65.27	64.7	65.07
	MMVet	71.61	71.3	73.99
	MathVista	68.10	71.6	72.00
	OCRBench	87.80	88.4	88.90
	Average	71.5	71.4	71.8
Video Understanding	VideoMME(w/o subs)	65.10	63.4	67.07
	VideoMME(w/ subs)	71.60	66.01	72.59
	VideoMME(avg)	68.35	67.7	69.83
	MVBench	69.60	67.7	69.43
	LongVideoBench	56.00	56.6	59.54
	OvOBench	51.10	48.48	52.17
	Average	61.26	58.89	62.74
Document Understanding	ChartQA_test	87.24	85.1	88.84
	DocVQA_test	95.57	93	93.68
	TextVQA_val	85.06	82.8	82.27
	OCRBench	87.8	88.4	88.9
	Average	88.91	87.32	88.42

Speech Understanding

Model	Average(Open-ended QA)	AlpacaEval	CommonEval	SD-QA	MMSU	OpenBookQA	IFEval	AdvBench
Ming-lite-omni v1.5	4.474	4.648	4.3	61.16	45.77	65.934	55.599	98.076
Ming-lite-omni	4.34	4.63	4.06	58.84	47.53	61.98	58.36	99.04
MiniCPM-o	4.285	4.42	4.15	50.72	54.78	78.02	49.25	97.69
Kimi-Audio	4.215	4.46	3.97	63.12	62.17	83.52	61.10	100.00
Qwen2.5-Omni	4.21	4.49	3.93	55.71	61.32	81.10	52.87	99.42
GLM-4-Voice	3.77	4.06	3.48	43.31	40.11	52.97	24.91	88.08

Precise Visual Editing Control

Ming-lite-omni v1.5 employs the following optimization strategies to address the issues of character ID and scene ID consistency during image editing:

ID and Scene Consistency Loss: This is achieved by increasing the weight of the edited region in the target image and the reference strength of the non-edited region in the reference image, while simultaneously decreasing the reference strength of the edited region in the reference image. This approach enhances image editing consistency.
Incorporating Generative Detection and Segmentation Tasks to Boost Perceptual Capabilities: By supporting generative segmentation and keypoint detection, the model’s understanding of image details and spatial relationships is improved. This enhances the structural controllability of the editing and generation processes, leading to significant increases in evaluation metrics related to position, structure, and quantity.
Multi-Task Collaborative Learning Strategy: Through a joint training pipeline, generation and editing mutually reinforce each other. Segmentation tasks are transformed into colorization editing tasks, which significantly improves segmentation metrics and the precision and controllability of local image editing, resulting in smoother edges for edited regions.

Based on these optimizations, Ming-lite-omni v1.5 shows a significant improvement in image editing capabilities, achieving a GenEval score of 0.87.

	1-Obj	2-Obj	Counting	Colors	Position	Color Attr	Avg.
Ming-lite-omni	0.99	0.77	0.68	0.78	0.46	0.42	0.64
Ming-lite-omni v1.5	0.99	0.93	0.86	0.87	0.90	0.66	0.87

Optimized User Experience

Thanks to the construction of high-quality alignment preference data, Ming-lite-omni v1.5 demonstrates a certain advantage over leading models in terms of correctness, relevance, format aesthetics, and fluency of expression for image-text Q&A. Ming-lite-omni v1.5 achieved a win rate of 87.07% against Ming-lite-omni V1 on internal adversarial evaluation sets, indicating a significant optimization in user experience.

Evaluation Dimension	Qwen2.5-VL-7B	Ming-lite-omni V1.5
Relevance	4.308	4.5
Fluency	4.765	4.91
Richness of Content	3.828	3.69
Format aesthetics	4.727	4.8
Correctness	3.741	3.92
Average Score	4.274	4.365

Get Started with Ming-lite-omni v1.5

The model and code for Ming-lite-omni v1.5 are now open-source, and we invite everyone to try it out, share feedback, and join the discussion. Looking ahead, we’re excited to announce that a quantized and accelerated version is on the way. This future release will not only further enhance omni-modal performance but also make the model even more lightweight, all while strengthening its multimodal reasoning and generation capabilities. Stay tuned for more updates!

Github: https://github.com/inclusionAI/Ming
Hugging Face: https://huggingface.co/inclusionAI/Ming-Lite-Omni-1.5
ModelScope: https://www.modelscope.cn/models/inclusionAI/Ming-Lite-Omni-1.5