Benchmarking multi-lingual open-source LLMs

David Pickett

30 Oct 2025 — 1 min read

People: David

Idea: Based on conversations with Keao about machine-translation benchmarking, I ran a bunch of LLM models through MMLU-ProX (lite) French benchmark tests (biology section + full 14-topic suite) to see which ones actually deliver the best mix of speed and accuracy in French. This started because we hypothesized that Deepseek R1 could be a better translator than OSS-120B and also because we were curious how Mistral's recent Magistral and Large 2 (announced at the same time as Pixtral Large) models compared.

Details:

Qwen3-235B-Instruct-2507 (Q2 quantization) won overall: 88.9% on biology, 80% on the full suite
The AWQ-4bit version of Qwen3-Next-80B was shockingly fast and pretty accurate - 12.5 minutes for all 14 topics (~0.89 min/task)
Quantization isn't always straight-forward, - Q2 beat Q4 by 2.8 points and ran 10× faster
DeepSeek-R1 tied for top biology score (88.9%) but crawled through the full suite at 22.5 min/task
Mistral models surprisingly did not get very good accuracy (59-67% accuracy on full suite) and since they are dense models, could not compete on speed
Deployment stack/hardware matters enormously: same 80B model ran 10× faster on AWQ vs MLX
Biology was the easiest domain (84.5% average), law was brutal (43.2% average)
Weirdly, spending more time didn't correlate with better scores - sometimes the opposite
The Pareto frontier is pretty clear: Llama-4-Scout if you need speed (0.5 min), Qwen3-Next-80B AWQ for balance, Qwen3-235B if you need accuracy
Most models cluster around 75-80% on the full suite, but runtime varies wildly (12 min to 5+ hours)
If you're picking a model for production, look at the per-task speed - averages hide a lot

Testing Qwen3-Omni Audio Inputs

People David Idea Part of our work in the Kumubot cluster involves being able to work on both text as well as audio recordings - the idea was the figure out the capabilities of Qwen3-Omni on our Blackwell hardware Details * Blackwell software support (at least for the RTX 6000 Pro)

Trying AI-Trader on local LLMs

People David Idea Joe wanted to know what the capabilities of https://github.com/HKUDS/AI-Trader look like in the context for local LLMs Details * Looks like the assumption of the repo is that you can test multiple AI in agentic fashion against historical stock market data over a period

4bit Quant Showdown: Finding the Sweet Spot for Qwen3 Models

People: David Idea: To setup our Kumubot cluster, I went down a rabbit hole benchmarking different quantization methods for Qwen3 8B and 32B models to see which ones actually deliver the best accuracy-to-speed tradeoff in real-world use Details: • ExLlamaV3-4bpw surprisingly beat even BF16 on LiveBench accuracy (60.0 vs 58.

Local hardware fine-tuning LLMs for Hawaiian-English translation benchmarking

People * David Idea Exploring memory-efficient fine-tuning techniques for improving Hawaiian-to-English translation using Apple's MLX framework, comparing multiple approaches and optimizing for Mac hardware. Details * Successfully fine-tuned gemma-3-4b-it-4bit on Mac M1 Ultra (128GB RAM) achieving 0.8296 semantic similarity score, a 3.6% improvement over the base model * Discovered

Read more

Testing Qwen3-Omni Audio Inputs

Trying AI-Trader on local LLMs

4bit Quant Showdown: Finding the Sweet Spot for Qwen3 Models

Local hardware fine-tuning LLMs for Hawaiian-English translation benchmarking