<!-- index-menu -->
To compare the strengths and intelligence of Llama-3.1-405B and GPT-4o from a practical usage perspective, this article designed test cases covering five scenarios: mathematics, coding, tool usage, JSON extraction, and creative writing. We conducted a comparison test between the strongest open-source and closed-source models.
Due to resource limitations, we tested Llama-3.1-405B-Instruct and GPT-4o-2024-08-06 on publicly available platforms lmsys.org and huggingface.co/chat/.
On July 24, Meta officially open-sourced Llama 3.1, which includes versions with 8B, 70B, and 405B parameters. The specific information for the three models is detailed in the table below. In addition to the BF16 precision, the 405B model also has an FP8 quantized version. An extra content safety classification-tuned model, Llama-Guard-3-8B, was open-sourced for the 8B version. The models support a 128k context and are proficient in eight languages, including English, German, and French. The training process for Llama 3.1 includes two key stages: pre-training and post-training. During the pre-training stage, Llama 3.1 was trained on over 15 trillion tokens using a custom GPU cluster. The post-training stage involves Supervised Fine-Tuning (SFT), rejection sampling, and Direct Preference Optimization (DPO).
8B | 70B | 405B | |
---|---|---|---|
Layers | 32 | 80 | 126 |
Model Dimension | 4,096 | 8,192 | 16,384 |
FFN Dimension | 6,144 | 12,288 | 20,480 |
Attention Heads | 32 | 64 | 128 |
Key/Value Heads | 8 | 8 | 16 |
Peak Learning Rate | 3 × 10⁻⁴ | 1.5 × 10⁻⁴ | 8 × 10⁻⁵ |
Activation Function | SwiGLU | SwiGLU | SwiGLU |
Vocabulary Size | 128,000 | 128,000 | 128,000 |
Positional Embeddings | RoPE (θ = 500,000) | RoPE (θ = 500,000) | RoPE (θ = 500,000) |
Specific Parameters of the Llama3.1 Model
Notably, the SFT phase utilized high-quality synthetic data to enhance the model's capabilities in coding, mathematical reasoning, and tool usage.
Excerpt from the "Data Processing and Quality Control" section of the paper.
From the respective papers on Llama 3.1 and Llama 3, it is evident that Llama 3.1 shows significant improvements. The 3.1 8B model outperforms the previous 70B model in math and coding.
Model | 3-8B | 3-70B | 3.1-8B | 3.1-70B | 3.1-405B |
---|---|---|---|---|---|
GSM8K | 57.2 | 83.3 | 84.4 | 94.8 | 96.8 |
HumanEval | 34.1 | 39.0 | 68.3 | 79.3 | 85.3 |
MMLU | 64.3 | 77.5 | 67.9 | 82.4 | 85.5 |
Performance metrics of different Llama models
The effectiveness of the Llama 3.1 model was tested across more than 50 datasets, along with human evaluations. Experiments show that the largest 405B model performs on par with the industry's best closed-source models, such as GPT-4, GPT-4o, and Claude 3.5 Sonnet. The smaller 8B and 70B models are also competitive with closed-source models of similar parameter sizes.