CodeGeeX4: 9B-Param AI Challenges Coding Giants

Today, we’re diving into a new AI model called CodeGeeX4-ALL-9B. For simplicity’s sake, let’s refer to it as CodeG.

CodeG is a specialized model trained for programming tasks, built upon the foundation of the GLM4-9B model from the same company. With only 9 billion parameters, it’s designed to run efficiently on local devices, making it an intriguing option for developers and AI enthusiasts alike.

Model Overview and Capabilities

CodeGeeX4-ALL-9B is the latest open-source model in the CodeGeeX4 series. It’s a multilingual code generation model that has undergone extensive training on top of the GLM4-9B base, significantly enhancing its code generation abilities.

The model boasts a comprehensive feature set, including:

Code completion and generation
Code interpretation
Web search functionality
Function calling
Repository-level code Q&A

These features cover a wide range of software development scenarios, making it a versatile tool for programmers.

According to the developers, CodeG is currently the strongest code generation model with fewer than 10 billion parameters. They claim it outperforms many larger, general-purpose models, striking an optimal balance between inference speed and model performance.

An additional noteworthy feature is its 128k context window, allowing it to process and understand larger chunks of code or text at once.

Model List

Model	Type	Seq Length	Download
codegeex4-all-9b	Chat	128K	🤗 Huggingface 🤖 ModelScope 🟣 WiseModel

Get Started

Use 4.39.0<=transformers<=4.40.2 to quickly launch codegeex4-all-9b：

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained("THUDM/codegeex4-all-9b", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "THUDM/codegeex4-all-9b",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True
).to(device).eval()
inputs = tokenizer.apply_chat_template([{"role": "user", "content": "write a quick sort"}], add_generation_prompt=True, tokenize=True, return_tensors="pt", return_dict=True ).to(device)
with torch.no_grad():
    outputs = model.generate(**inputs)
    outputs = outputs[:, inputs['input_ids'].shape[1]:]
    print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Tutorials

CodeGeeX4-ALL-9B provides three user guides to help users quickly understand and use the model:

System Prompt Guideline: This guide introduces how to use system prompts in CodeGeeX4-ALL-9B, including the VSCode extension official system prompt, customized system prompts, and some tips for maintaining multi-turn dialogue history.
Infilling Guideline: This guide explains the VSCode extension official infilling format, covering general infilling, cross-file infilling, and generating a new file in a repository.
Repository Tasks Guideline: This guide demonstrates how to use repository tasks in CodeGeeX4-ALL-9B, including QA tasks at the repository level and how to trigger the aicommiter capability of CodeGeeX4-ALL-9B to perform deletions, additions, and changes to files at the repository level.

These guides aim to provide a comprehensive understanding and facilitate efficient use of the model.

Evaluation

CodeGeeX4-ALL-9B is ranked as the most powerful model under 10 billion parameters, even surpassing general models several times its size, achieving the best balance between inference performance and model effectiveness.

Model	Seq Length	HumanEval	MBPP	NCB	LCB	HumanEvalFIM	CRUXEval-O
Llama3-70B-intruct	8K	77.4	82.3	37.0	27.4	–	–
DeepSeek Coder 33B Instruct	16K	81.1	80.4	39.3	29.3	78.2	49.9
Codestral-22B	32K	81.1	78.2	46.0	35.3	91.6	51.3
CodeGeeX4-All-9B	128K	82.3	75.7	40.4	28.5	85.0	47.1

CodeGeeX4-ALL-9B scored 48.9 and 40.4 for the complete and instruct tasks of BigCodeBench, which are the highest scores among models with less than 20 billion parameters. In CRUXEval, a benchmark for testing code reasoning, understanding, and execution capabilities, CodeGeeX4-ALL-9B presented remarkable results with its COT (chain-of-thought) abilities. From easy code generation tasks in HumanEval and MBPP, to very challenging tasks in NaturalCodeBench, CodeGeeX4-ALL-9B also achieved outstanding performance at its scale. It is currently the only code model that supports Function Call capabilities and even achieves a better execution success rate than GPT-4. Furthermore, in the “Code Needle In A Haystack” (NIAH) evaluation, the CodeGeeX4-ALL-9B model demonstrated its ability to retrieve code within contexts up to 128K, achieving a 100% retrieval accuracy in all python scripts.

Details of the evaluation results can be found in the Evaluation.

Benchmark Performance

Let’s examine how CodeG performs on various industry-standard benchmarks:

HumanEval

In the HumanEval benchmark, CodeG reportedly outperformed models like CodeT5, DeepSeek Coder, and even LLaMA 3-70B – models with nearly twice as many parameters.

MBPP and NCB

On the MBPP (Mostly Basic Python Programming) and NCB (Natural Code Benchmark) tests, CodeG’s performance was slightly behind some competitors. However, considering its smaller size, the results are still impressive.

Human Evaluation

In human evaluations, CodeG surpassed the DeepSeek Coder 33B model, showcasing its ability to generate high-quality, human-readable code.

BigCode Bench

CodeG achieved the best performance among models of similar size in the BigCode Bench evaluation. However, it’s worth noting that DeepSeek Coder V2, with slightly more parameters, still outperformed it.

CRUXEval

In the CRUXEval benchmark, which tests code reasoning, understanding, and execution capabilities, CodeG showcased impressive results, leveraging its Chain of Thought (COT) abilities.

Needle in a Haystack (NIAH)

CodeG demonstrated 100% retrieval accuracy for Python scripts in contexts up to 128K, highlighting its strong code retrieval capabilities in large contexts.

Availability and Testing

The model is available on Hugging Face, with a LLaMA release expected in the coming days. There’s also a Hugging Face page where you can try out the model directly.

Personal Evaluation

To assess CodeG’s real-world performance, I conducted a series of tests ranging from simple language tasks to more complex programming challenges. Here are the results:

Word association and rhyming: Passed
Basic math word problem: Failed
Logic puzzle: Failed
Geometry calculation: Failed
Interactive HTML/CSS/JS coding: Failed
Python function writing: Passed
SVG generation: Failed
HTML landing page creation: Passed (with some leniency)
Terminal-based Game of Life in Python: Failed

Conclusion and Personal Thoughts

After testing, I found CodeG’s performance to be somewhat underwhelming. It only passed three out of nine tests, and one of those passes was given with some leniency. Its performance was comparable to more general-purpose language models like GPT-3.5, despite being marketed as a specialized coding model.

While CodeG shows promise in certain areas, particularly in Python function writing and basic HTML tasks, it struggled with more complex programming challenges and general reasoning tasks. This raises questions about its practical utility compared to other available models like DeepSeek Coder V2 or Qwen 2, which have demonstrated superior performance in similar tests.

It’s important to note that benchmark results don’t always translate directly to real-world performance, and your experience may vary depending on specific use cases. If you’re considering using CodeG for your projects, I recommend thoroughly testing it against your specific requirements before making a decision.

For those interested in diving deeper into CodeG’s capabilities or discussing alternative models, I encourage you to join our community chat group where we can exchange ideas and experiences.

Additional Resources

Remember, the field of AI and code generation is rapidly evolving. While CodeG may not have met all expectations in this evaluation, it represents another step forward in the ongoing development of AI-assisted coding tools. As always, the best tool for you will depend on your specific needs and use cases.