From Benchmarks to Builders: Running MiniMax M2.1 on Our Mac Studio
Last week I wrote about two paths to vLLM on Apple Silicon - comparing vllm-metal and vllm-mlx as options for local inference. This week the picture changed. LM Studio shipped concurrent request support, and suddenly the simplest option became the most practical one.
People
David
Idea
We spent the week stress-testing local LLM inference on our Mac Studio M3 Ultra (512GB) using Harbor's terminal-bench benchmark - trying to find the right configuration to serve MiniMax M2.1 to our builders for agentic coding work on our own hardware.
Details
- LM Studio 0.4 added concurrent inference, which was the missing piece - prior to this, every request through LM Studio had to wait in line
- We ran 25 test configurations across vllm-mlx, LM Studio (MLX and GGUF), and vLLM on Blackwell GPU, tracking memory pressure, timeouts, accuracy, and server crashes
- vllm-mlx is fast early on but degrades badly as conversation context grows - it crashed the Metal GPU with an unrecoverable OOM error at 4 concurrent tasks (note this was a special branch for Minimax, the main line may have fixes now)
- LM Studio never crashed once across all our tests, even when we pushed it to 6 concurrent conversations and 2.3 million tokens over an hour
- The concurrency sweet spot on 512GB is 3-4 simultaneous tasks - going to 6 caused swap thrashing, and going to 5 with vllm-mlx froze the whole system (at least without quantizing the kv cache)
- MLX 5-bit and GGUF Q5_K_XL quants performed surprisingly similar on LM Studio - about 26 vs 30 minutes for the same three tasks, both completing clean (we would assume that Q5_K_XL has higher accuracy though)
- On our 18-task accuracy baseline, MiniMax M2.1 5-bit solved 11 tasks (61%) including code generation, optimization, and security challenges
- 8-bit KV cache quantization cut memory use but added a per-call slowdown that wiped out any concurrency gains - not worth it at current settings (also we noticed an accuracy drop)
- Keep "separate reasoning from content" turned OFF in LM Studio - turning it on caused a 10x slowdown and 5 of 10 tasks to timeout in our tests (this may just be a weird interaction with the Terminus2.0 reference agent from harbor)
- Our projection for a full 89-task run on LM Studio at -n 4 concurrency (with 5bit MLX without quantizing the cache) is roughly 18-20 passes at about 20% accuracy - a 30% relative drop from the full-precision API model results, which is a reasonable trade for running everything on-prem on a single Mac