Testing Qwen3-Omni Audio Inputs
People
David
Idea
Part of our work in the Kumubot cluster involves being able to work on both text as well as audio recordings - the idea was the figure out the capabilities of Qwen3-Omni on our Blackwell hardware
Details
- Blackwell software support (at least for the RTX 6000 Pro) is still pretty early
- Qwen describes compatibility with VLLM for high-throughput inference - though it was only in the latest releases that there is support for workstation Blackwell cards
- Using nightly docker images (tested with the latest from Nov 2nd 2025) now VLLM is using Pytorch 2.9 and Cuda 12.9 so I was able to get Qwen3-Omni working with both audio and visual inputs
- Our workstation card is a pretty good fit for the current optimizations on Qwen3-Omni-30b-AWQ-4bit - including the text, audio, and video caching our GPU is expected to be able to process 10 concurrent requests
- Example of the type of request that can be sent to the model and example response (image and audio):
-
- Input:
LLM_IMAGE_URL="https://upload.wikimedia.org/wikipedia/commons/thumb/3/35/Cow-on_pole%2C_with_antlers.jpeg/960px-Cow-on_pole%2C_with_antlers.jpeg"
LLM_AUDIO_URL="https://upload.wikimedia.org/wikipedia/commons/7/72/Whiskers%27_purr_edit.ogg"
"messages": [
{"role": "user", "content": [
{"type": "image_url", "image_url": {"url": "$LLM_IMAGE_URL"}},
{"type": "audio_url", "audio_url": {"url": "$LLM_AUDIO_URL"}},
{"type": "text", "text": "What is the image showing and separately what it the sound in the audio? Answer each question with just one sentence"}
]}
]
-
- Output example:
"content": "The image shows a large, black and white statue of a cow with deer antlers perched on top of a utility pole. The audio contains the distinct sound of a cat purring."