MiniCPM-V is a series of end-side multimodal large language models (MLLMs) developed by the OpenBMB organization, specifically designed for visual-language understanding. These models accept images and text as input and provide high-quality text output.
Key Features and Use Cases
Versatile Applications
The MiniCPM-V series models are suitable for a wide range of scenarios, including but not limited to:
- Multimodal understanding and interaction between images and text
- Efficient end-side deployment on mobile devices and personal computers
- Multilingual support for global applications
- Image recognition, scene understanding, and text generation tasks
Flagship Models
Since February 2024, four versions of the models have been released, aiming to achieve strong performance and efficient deployment. The most notable models in the MiniCPM-V series include:
MiniCPM-Llama3-V 2.5
The latest and most powerful model in the MiniCPM-V series, with 8B parameters. It surpasses proprietary models such as GPT-4V-1106, Gemini Pro, Qwen-VL-Max, and Claude 3 in overall performance. The model also supports multimodal conversations in over 30 languages, including English, Chinese, French, Spanish, and German.
MiniCPM-V 2.0
The lightest model in the MiniCPM-V series, with 2B parameters. It outperforms larger models such as Yi-VL 34B, CogVLM-Chat 17B, and Qwen-VL-Chat 10B in overall performance. It can accept image inputs of any aspect ratio, with pixels up to 1.8 million (e.g., 1344×1344). It matches Gemini Pro in understanding scene text and GPT-4V in low hallucination rates.
Usage
The project provides methods for using online and local demos.
Online Demos
The project offers two online demos on Hugging Face Spaces:
Local Installation
- Clone the repository and open the folder:
git clone https://github.com/OpenBMB/MiniCPM-V.git
cd MiniCPM-V
- Create a conda environment:
conda create -n MiniCPM-V python=3.10 -y
conda activate MiniCPM-V
- Install dependencies:
pip install -r requirements.txt
- Run the WebUI:
# For NVIDIA GPU, run:
python web_demo_2.5.py --device cuda
# For Mac with MPS (Apple chip or AMD GPU), run:
PYTORCH_ENABLE_MPS_FALLBACK=1 python web_demo_2.5.py --device mps
Inference
- Model Zoo: Various versions of MiniCPM-V models are provided for different hardware and memory requirements.
- Multi-turn Conversation: Example code is provided for multi-turn conversation inference.
- Inference on Mac: An example of using MPS for inference on Mac is provided.
- Mobile Phone Deployment: MiniCPM-Llama3-V 2.5 and MiniCPM-V 2.0 can be deployed on Android phones.
- Inference with llama.cpp: MiniCPM-Llama3-V 2.5 supports inference using llama.cpp.
- Inference with vLLM: An example and steps for MiniCPM-V 2.0 inference using vLLM are provided.
Fine-tuning
- Simple fine-tuning of MiniCPM-V 2.0 and MiniCPM-Llama3-V 2.5 using Hugging Face is supported.
- Fine-tuning the MiniCPM-V series using the SWIFT framework is supported.
Project Examples
For inference examples, please refer to the provided image examples and inference code below.
Image Examples
Inference Code
from chat import MiniCPMVChat, img2base64
import torch
import json
torch.manual_seed(0)
chat_model = MiniCPMVChat('openbmb/MiniCPM-Llama3-V-2_5')
im_64 = img2base64('./assets/airplane.jpeg')
# First round chat
msgs = [{"role": "user", "content": "Tell me the model of this aircraft."}]
inputs = {"image": im_64, "question": json.dumps(msgs)}
answer = chat_model.chat(inputs)
print(answer)
# Second round chat
# pass history context of multi-turn conversation
msgs.append({"role": "assistant", "content": answer})
msgs.append({"role": "user", "content": "Introduce something about Airbus A380."})
inputs = {"image": im_64, "question": json.dumps(msgs)}
answer = chat_model.chat(inputs)
print(answer)
Output:
"The aircraft in the image is an Airbus A380, which can be identified by its large size, double-deck structure, and the distinctive shape of its wings and engines. The A380 is a wide-body aircraft known for being the world's largest passenger airliner, designed for long-haul flights. It has four engines, which are characteristic of large commercial aircraft. The registration number on the aircraft can also provide specific information about the model if looked up in an aviation database."
# 图片中的飞机是空客A380,可通过其庞大的体型、双层结构以及独特翼型和发动机形状来辨认。A380是一款宽体飞机,作为全球最大的客运飞机而闻名,专为长途航班设计。它装备有四个发动机,这是大型商业飞机的典型特征。如果在航空数据库中查询,飞机上的注册号还能提供更多具体型号信息。
"The Airbus A380 is a double-deck, wide-body, four-engine jet airliner made by Airbus. It is the world's largest passenger airliner and is known for its long-haul capabilities. The aircraft was developed to improve efficiency and comfort for passengers traveling over long distances. It has two full-length passenger decks, which can accommodate more passengers than a typical single-aisle airplane. The A380 has been operated by airlines such as Lufthansa, Singapore Airlines, and Emirates, among others. It is widely recognized for its unique design and significant impact on the aviation industry."
# 空客A380是由空客制造的一款双层、宽体、四引擎喷气式客机,它是世界上最大的客运飞机,并以其长距离飞行能力著称。该飞机是为了提升长途旅行乘客的效率与舒适度而研发的。它拥有两个全长的乘客甲板,能比一般的单通道飞机容纳更多乘客。A380已被多家航空公司运营,包括汉莎航空、新加坡航空和阿联酋航空等。它因独特的设计及对航空业的重大影响而广为人知。
Other Multimodal Projects
- VisCPM
- RLHF-V
- LLaVA-UHD
- RLAIF-V
Note: The content in this article is for reference only. For the latest project features, please refer to the official GitHub page.
Thank you for reading! Feel free to like, share, and follow for more content.
Resources
GitHub Project: https://github.com/OpenBMB/MiniCPM-V
MiniCPM-Llama3-V 2.5 Hugging Face: https://huggingface.co/spaces/openbmb/MiniCPM-Llama3-V-2_5
MiniCPM-V 2.0 Hugging Face: https://huggingface.co/spaces/openbmb/MiniCPM-V-2
VisCPM: https://github.com/OpenBMB/VisCPM
RLHF-V: https://github.com/RLHF-V/RLHF-V
LLaVA-UHD: https://github.com/thunlp/LLaVA-UHD
RLAIF-V: https://github.com/RLHF-V/RLAIF-V