MiniCPM-V: SOTA End-Side MLLMs for Vision-Language AI in 2024

MiniCPM-V is a series of end-side multimodal large language models (MLLMs) developed by the OpenBMB organization, specifically designed for visual-language understanding. These models accept images and text as input and provide high-quality text output.

Key Features and Use Cases

Versatile Applications

The MiniCPM-V series models are suitable for a wide range of scenarios, including but not limited to:

Multimodal understanding and interaction between images and text
Efficient end-side deployment on mobile devices and personal computers
Multilingual support for global applications
Image recognition, scene understanding, and text generation tasks

Flagship Models

Since February 2024, four versions of the models have been released, aiming to achieve strong performance and efficient deployment. The most notable models in the MiniCPM-V series include:

MiniCPM-Llama3-V 2.5

The latest and most powerful model in the MiniCPM-V series, with 8B parameters. It surpasses proprietary models such as GPT-4V-1106, Gemini Pro, Qwen-VL-Max, and Claude 3 in overall performance. The model also supports multimodal conversations in over 30 languages, including English, Chinese, French, Spanish, and German.

MiniCPM-V 2.0

The lightest model in the MiniCPM-V series, with 2B parameters. It outperforms larger models such as Yi-VL 34B, CogVLM-Chat 17B, and Qwen-VL-Chat 10B in overall performance. It can accept image inputs of any aspect ratio, with pixels up to 1.8 million (e.g., 1344×1344). It matches Gemini Pro in understanding scene text and GPT-4V in low hallucination rates.

Usage

The project provides methods for using online and local demos.

Online Demos

The project offers two online demos on Hugging Face Spaces:

Local Installation

Clone the repository and open the folder:

git clone https://github.com/OpenBMB/MiniCPM-V.git
cd MiniCPM-V

Create a conda environment:

conda create -n MiniCPM-V python=3.10 -y
conda activate MiniCPM-V

Install dependencies:

pip install -r requirements.txt

Run the WebUI:

# For NVIDIA GPU, run:
python web_demo_2.5.py --device cuda

# For Mac with MPS (Apple chip or AMD GPU), run:
PYTORCH_ENABLE_MPS_FALLBACK=1 python web_demo_2.5.py --device mps

Inference

Model Zoo: Various versions of MiniCPM-V models are provided for different hardware and memory requirements.
Multi-turn Conversation: Example code is provided for multi-turn conversation inference.
Inference on Mac: An example of using MPS for inference on Mac is provided.
Mobile Phone Deployment: MiniCPM-Llama3-V 2.5 and MiniCPM-V 2.0 can be deployed on Android phones.
Inference with llama.cpp: MiniCPM-Llama3-V 2.5 supports inference using llama.cpp.
Inference with vLLM: An example and steps for MiniCPM-V 2.0 inference using vLLM are provided.

Fine-tuning

Simple fine-tuning of MiniCPM-V 2.0 and MiniCPM-Llama3-V 2.5 using Hugging Face is supported.
Fine-tuning the MiniCPM-V series using the SWIFT framework is supported.

Project Examples

For inference examples, please refer to the provided image examples and inference code below.

Image Examples

Inference Code

from chat import MiniCPMVChat, img2base64
import torch
import json

torch.manual_seed(0)

chat_model = MiniCPMVChat('openbmb/MiniCPM-Llama3-V-2_5')

im_64 = img2base64('./assets/airplane.jpeg')

# First round chat 
msgs = [{"role": "user", "content": "Tell me the model of this aircraft."}]

inputs = {"image": im_64, "question": json.dumps(msgs)}
answer = chat_model.chat(inputs)
print(answer)

# Second round chat 
# pass history context of multi-turn conversation
msgs.append({"role": "assistant", "content": answer})
msgs.append({"role": "user", "content": "Introduce something about Airbus A380."})

inputs = {"image": im_64, "question": json.dumps(msgs)}
answer = chat_model.chat(inputs)
print(answer)

Output:

"The aircraft in the image is an Airbus A380, which can be identified by its large size, double-deck structure, and the distinctive shape of its wings and engines. The A380 is a wide-body aircraft known for being the world's largest passenger airliner, designed for long-haul flights. It has four engines, which are characteristic of large commercial aircraft. The registration number on the aircraft can also provide specific information about the model if looked up in an aviation database."
# 图片中的飞机是空客A380，可通过其庞大的体型、双层结构以及独特翼型和发动机形状来辨认。A380是一款宽体飞机，作为全球最大的客运飞机而闻名，专为长途航班设计。它装备有四个发动机，这是大型商业飞机的典型特征。如果在航空数据库中查询，飞机上的注册号还能提供更多具体型号信息。

"The Airbus A380 is a double-deck, wide-body, four-engine jet airliner made by Airbus. It is the world's largest passenger airliner and is known for its long-haul capabilities. The aircraft was developed to improve efficiency and comfort for passengers traveling over long distances. It has two full-length passenger decks, which can accommodate more passengers than a typical single-aisle airplane. The A380 has been operated by airlines such as Lufthansa, Singapore Airlines, and Emirates, among others. It is widely recognized for its unique design and significant impact on the aviation industry."  
# 空客A380是由空客制造的一款双层、宽体、四引擎喷气式客机，它是世界上最大的客运飞机，并以其长距离飞行能力著称。该飞机是为了提升长途旅行乘客的效率与舒适度而研发的。它拥有两个全长的乘客甲板，能比一般的单通道飞机容纳更多乘客。A380已被多家航空公司运营，包括汉莎航空、新加坡航空和阿联酋航空等。它因独特的设计及对航空业的重大影响而广为人知。

Other Multimodal Projects

VisCPM
RLHF-V
LLaVA-UHD
RLAIF-V

Note: The content in this article is for reference only. For the latest project features, please refer to the official GitHub page.

Thank you for reading! Feel free to like, share, and follow for more content.

Resources

GitHub Project: https://github.com/OpenBMB/MiniCPM-V
MiniCPM-Llama3-V 2.5 Hugging Face: https://huggingface.co/spaces/openbmb/MiniCPM-Llama3-V-2_5
MiniCPM-V 2.0 Hugging Face: https://huggingface.co/spaces/openbmb/MiniCPM-V-2
VisCPM: https://github.com/OpenBMB/VisCPM
RLHF-V: https://github.com/RLHF-V/RLHF-V
LLaVA-UHD: https://github.com/thunlp/LLaVA-UHD
RLAIF-V: https://github.com/RLHF-V/RLAIF-V

MiniCPM-V: SOTA End-Side MLLMs for Vision-Language AI in 2024

Key Features and Use Cases

Versatile Applications

Flagship Models

MiniCPM-Llama3-V 2.5

MiniCPM-V 2.0

Usage

Online Demos

Local Installation

Inference

Fine-tuning

Project Examples

Image Examples

Inference Code

Other Multimodal Projects

Resources

Meta’s SAM-2: Meta’s Free AI Revolutionizes Video Segmentation

2024’s Ultimate Free Data Viz Tool: Redash Revolutionizes BI

Parler-TTS: Ultimate Open-Source TTS Model for Natural Speech

MaxKB: Ultimate AI-Powered Q&A System for Enterprises

Open WebUI: The Ultimate User-Friendly LLM Interface (2024)

ChatTTS: Ultimate AI Voice Synthesis for Natural Dialogue | 2024

Leave a Reply Cancel reply

Join 40,000+ AI Enthusiasts Receiving Our
Weekly NobleFilt Newsletter

Subscribe now and get exclusive access to our free guide: “10 Game-Changing AI Tools to Supercharge Your Productivity!”

Key Features and Use Cases

Versatile Applications

Flagship Models

MiniCPM-Llama3-V 2.5

MiniCPM-V 2.0

Usage

Online Demos

Local Installation

Inference

Fine-tuning

Project Examples

Image Examples

Inference Code

Other Multimodal Projects

Resources

Similar Posts

Leave a Reply Cancel reply

Join 40,000+ AI Enthusiasts Receiving OurWeekly NobleFilt Newsletter

Subscribe now and get exclusive access to our free guide: “10 Game-Changing AI Tools to Supercharge Your Productivity!”

Join 40,000+ AI Enthusiasts Receiving Our
Weekly NobleFilt Newsletter