(Post-WIP) How to evaluate inference engines

David Pickett

29 Apr 2025

People
- David
Idea
- For a given set of local LLM models that a set of users (or community) wants to use - how do you measure performance/UX on ways to serve it (based on ease of use, speed of processing and generation, level of accuracy, and features like multi-user-context etc)
Details/tools to test
- vllm-benchmark
- llmperf from Ray project
- aider benchmarking
- livebench

Testing Qwen3-Omni Audio Inputs

People David Idea Part of our work in the Kumubot cluster involves being able to work on both text as well as audio recordings - the idea was the figure out the capabilities of Qwen3-Omni on our Blackwell hardware Details * Blackwell software support (at least for the RTX 6000 Pro)

Trying AI-Trader on local LLMs

People David Idea Joe wanted to know what the capabilities of https://github.com/HKUDS/AI-Trader look like in the context for local LLMs Details * Looks like the assumption of the repo is that you can test multiple AI in agentic fashion against historical stock market data over a period

4bit Quant Showdown: Finding the Sweet Spot for Qwen3 Models

People: David Idea: To setup our Kumubot cluster, I went down a rabbit hole benchmarking different quantization methods for Qwen3 8B and 32B models to see which ones actually deliver the best accuracy-to-speed tradeoff in real-world use Details: • ExLlamaV3-4bpw surprisingly beat even BF16 on LiveBench accuracy (60.0 vs 58.

Benchmarking multi-lingual open-source LLMs

People: David Idea: Based on conversations with Keao about machine-translation benchmarking, I ran a bunch of LLM models through MMLU-ProX (lite) French benchmark tests (biology section + full 14-topic suite) to see which ones actually deliver the best mix of speed and accuracy in French. This started because we hypothesized that

Read more

Testing Qwen3-Omni Audio Inputs

Trying AI-Trader on local LLMs

4bit Quant Showdown: Finding the Sweet Spot for Qwen3 Models

Benchmarking multi-lingual open-source LLMs