Llama 3.1 405B vs GPT-4o: Which model is better?

Evaluating Llama 3.1-405B and GPT-4o Across Key Performance Metrics to Determine the Superior AI Model for Users and Developers.

通过Notion查看本文 本文同步发布在j000e.com

<!-- index-menu -->

Introduction

To compare the strengths and intelligence of Llama-3.1-405B and GPT-4o from a practical usage perspective, this article designed test cases covering five scenarios: mathematics, coding, tool usage, JSON extraction, and creative writing. We conducted a comparison test between the strongest open-source and closed-source models.

Due to resource limitations, we tested Llama-3.1-405B-Instruct and GPT-4o-2024-08-06 on publicly available platforms lmsys.org and huggingface.co/chat/.

Introduction to the Llama 3.1 Model

On July 24, Meta officially open-sourced Llama 3.1, which includes versions with 8B, 70B, and 405B parameters. The specific information for the three models is detailed in the table below. In addition to the BF16 precision, the 405B model also has an FP8 quantized version. An extra content safety classification-tuned model, Llama-Guard-3-8B, was open-sourced for the 8B version. The models support a 128k context and are proficient in eight languages, including English, German, and French. The training process for Llama 3.1 includes two key stages: pre-training and post-training. During the pre-training stage, Llama 3.1 was trained on over 15 trillion tokens using a custom GPU cluster. The post-training stage involves Supervised Fine-Tuning (SFT), rejection sampling, and Direct Preference Optimization (DPO).

8B 70B 405B
Layers 32 80 126
Model Dimension 4,096 8,192 16,384
FFN Dimension 6,144 12,288 20,480
Attention Heads 32 64 128
Key/Value Heads 8 8 16
Peak Learning Rate 3 × 10⁻⁴ 1.5 × 10⁻⁴ 8 × 10⁻⁵
Activation Function SwiGLU SwiGLU SwiGLU
Vocabulary Size 128,000 128,000 128,000
Positional Embeddings RoPE (θ = 500,000) RoPE (θ = 500,000) RoPE (θ = 500,000)

Specific Parameters of the Llama3.1 Model

Notably, the SFT phase utilized high-quality synthetic data to enhance the model's capabilities in coding, mathematical reasoning, and tool usage.

Excerpt from the "Data Processing and Quality Control" section of the paper.

Excerpt from the "Data Processing and Quality Control" section of the paper.

Llama 3.1 vs Llama 3

From the respective papers on Llama 3.1 and Llama 3, it is evident that Llama 3.1 shows significant improvements. The 3.1 8B model outperforms the previous 70B model in math and coding.

Model 3-8B 3-70B 3.1-8B 3.1-70B 3.1-405B
GSM8K 57.2 83.3 84.4 94.8 96.8
HumanEval 34.1 39.0 68.3 79.3 85.3
MMLU 64.3 77.5 67.9 82.4 85.5

Performance metrics of different Llama models

Llama 3.1 vs GPT-4o vs Claude 3.5 Sonnet

The effectiveness of the Llama 3.1 model was tested across more than 50 datasets, along with human evaluations. Experiments show that the largest 405B model performs on par with the industry's best closed-source models, such as GPT-4, GPT-4o, and Claude 3.5 Sonnet. The smaller 8B and 70B models are also competitive with closed-source models of similar parameter sizes.