Back to list

Introducing Ming-Lite-Omni V1.5

July 21, 20253 min read

GITHUB 🤗 Hugging Face| 🤖 ModelScope

Overview

Ming-lite-omni v1.5 is a comprehensive upgrade to the full-modal capabilities of Ming-lite-omni(Github). It significantly improves performance across tasks including image-text understanding, document understanding, video understanding, speech understanding and synthesis, and image generation and editing. Built upon Ling-lite-1.5, Ming-lite-omni v1.5 has a total of 20.3 billion parameters, with 3 billion active parameters in its MoE (Mixture-of-Experts) section. It demonstrates highly competitive results in various modal benchmarks compared to industry-leading models.

Image description

Performance Comparison

Introduce Ming-lite-omni v1.5

Controllable Image Generation: Pixel-Level Control, Infinite Creativity

Ming-lite-omni v1.5 significantly optimizes Scene Consistency and ID Consistency (Character / Style Consistency) in image editing. When editing human figures, it demonstrates a clear advantage in maintaining scene and character ID. Furthermore, it expands support for perceptual tasks such as generative segmentation, depth prediction, object detection, and edge contour generation.

Image description

Depth and Edge Detection

Original ImageGenerated Depth MapGenerated Bounding BoxesGenerated Edge Contours

Audio-Video Interactive Understanding

Model Architecture Upgrade and Capability Evaluation

The Ming-lite-omni v1.5 model architecture is outlined below. The core design references the structure of Ming-lite-omni V1. However, a key distinction is the upgrade of the Vision head to support reference image feature input, specifically to enhance character and scene consistency in image editing.

Image description

Mode Architecture

The model’s capabilities have been significantly optimized and upgraded across three key areas: enhanced Omni-modal comprehension, precise visual editing control, and improved user experience.

Enhanced Omni-Modal Comprehension

Thanks to optimized data quality, Ming-lite-omni v1.5 shows significant improvements in tasks such as vision-text comprehension (including image-text, document, and video understanding) and speech understanding. It has reached an industry-leading level for models of comparable scale.

Vision-text Comprehension

Task TypeDatasetQwen2.5-VL-7BMing-lite-omniMing-lite-omni v1.5
Image-text UnderstandingAI2D84.3683.184.91
HallusionBench55.7755.054.59
MMBench_TEST_V1182.7580.880.73
MMMU56.5656.354.33
MMStar65.2764.765.07
MMVet71.6171.373.99
MathVista68.1071.672.00
OCRBench87.8088.488.90
Average71.571.471.8
Video UnderstandingVideoMME(w/o subs)65.1063.467.07
VideoMME(w/ subs)71.6066.0172.59
VideoMME(avg)68.3567.769.83
MVBench69.6067.769.43
LongVideoBench56.0056.659.54
OvOBench51.1048.4852.17
Average61.2658.8962.74
Document UnderstandingChartQA_test87.2485.188.84
DocVQA_test95.579393.68
TextVQA_val85.0682.882.27
OCRBench87.888.488.9
Average88.9187.3288.42

Speech Understanding

ModelAverage(Open-ended QA)AlpacaEvalCommonEvalSD-QAMMSUOpenBookQAIFEvalAdvBench
Ming-lite-omni v1.54.4744.6484.361.1645.7765.93455.59998.076
Ming-lite-omni4.344.634.0658.8447.5361.9858.3699.04
MiniCPM-o4.2854.424.1550.7254.7878.0249.2597.69
Kimi-Audio4.2154.463.9763.1262.1783.5261.10100.00
Qwen2.5-Omni4.214.493.9355.7161.3281.1052.8799.42
GLM-4-Voice3.774.063.4843.3140.1152.9724.9188.08

Precise Visual Editing Control

Ming-lite-omni v1.5 employs the following optimization strategies to address the issues of character ID and scene ID consistency during image editing:

  1. ID and Scene Consistency Loss: This is achieved by increasing the weight of the edited region in the target image and the reference strength of the non-edited region in the reference image, while simultaneously decreasing the reference strength of the edited region in the reference image. This approach enhances image editing consistency.
  2. Incorporating Generative Detection and Segmentation Tasks to Boost Perceptual Capabilities: By supporting generative segmentation and keypoint detection, the model’s understanding of image details and spatial relationships is improved. This enhances the structural controllability of the editing and generation processes, leading to significant increases in evaluation metrics related to position, structure, and quantity.
  3. Multi-Task Collaborative Learning Strategy: Through a joint training pipeline, generation and editing mutually reinforce each other. Segmentation tasks are transformed into colorization editing tasks, which significantly improves segmentation metrics and the precision and controllability of local image editing, resulting in smoother edges for edited regions.

Based on these optimizations, Ming-lite-omni v1.5 shows a significant improvement in image editing capabilities, achieving a GenEval score of 0.87.

1-Obj2-ObjCountingColorsPositionColor AttrAvg.
Ming-lite-omni0.990.770.680.780.460.420.64
Ming-lite-omni v1.50.990.930.860.870.900.660.87

Optimized User Experience

Thanks to the construction of high-quality alignment preference data, Ming-lite-omni v1.5 demonstrates a certain advantage over leading models in terms of correctness, relevance, format aesthetics, and fluency of expression for image-text Q&A. Ming-lite-omni v1.5 achieved a win rate of 87.07% against Ming-lite-omni V1 on internal adversarial evaluation sets, indicating a significant optimization in user experience.

Evaluation DimensionQwen2.5-VL-7BMing-lite-omni V1.5
Relevance4.3084.5
Fluency4.7654.91
Richness of Content3.8283.69
Format aesthetics4.7274.8
Correctness3.7413.92
Average Score4.2744.365

Get Started with Ming-lite-omni v1.5

The model and code for Ming-lite-omni v1.5 are now open-source, and we invite everyone to try it out, share feedback, and join the discussion. Looking ahead, we’re excited to announce that a quantized and accelerated version is on the way. This future release will not only further enhance omni-modal performance but also make the model even more lightweight, all while strengthening its multimodal reasoning and generation capabilities. Stay tuned for more updates!

Release