Benchmarking multi-lingual open-source LLMs
People: David
Idea: Based on conversations with Keao about machine-translation benchmarking, I ran a bunch of LLM models through MMLU-ProX (lite) French benchmark tests (biology section + full 14-topic suite) to see which ones actually deliver the best mix of speed and accuracy in French. This started because we hypothesized that Deepseek R1 could be a better translator than OSS-120B and also because we were curious how Mistral's recent Magistral and Large 2 (announced at the same time as Pixtral Large) models compared.
Details:
- Qwen3-235B-Instruct-2507 (Q2 quantization) won overall: 88.9% on biology, 80% on the full suite
- The AWQ-4bit version of Qwen3-Next-80B was shockingly fast and pretty accurate - 12.5 minutes for all 14 topics (~0.89 min/task)
- Quantization isn't always straight-forward, - Q2 beat Q4 by 2.8 points and ran 10× faster
- DeepSeek-R1 tied for top biology score (88.9%) but crawled through the full suite at 22.5 min/task
- Mistral models surprisingly did not get very good accuracy (59-67% accuracy on full suite) and since they are dense models, could not compete on speed
- Deployment stack/hardware matters enormously: same 80B model ran 10× faster on AWQ vs MLX
- Biology was the easiest domain (84.5% average), law was brutal (43.2% average)
- Weirdly, spending more time didn't correlate with better scores - sometimes the opposite
- The Pareto frontier is pretty clear: Llama-4-Scout if you need speed (0.5 min), Qwen3-Next-80B AWQ for balance, Qwen3-235B if you need accuracy
- Most models cluster around 75-80% on the full suite, but runtime varies wildly (12 min to 5+ hours)
- If you're picking a model for production, look at the per-task speed - averages hide a lot