Testing Qwen3-Omni Audio Inputs

People

David

Idea

Part of our work in the Kumubot cluster involves being able to work on both text as well as audio recordings - the idea was the figure out the capabilities of Qwen3-Omni on our Blackwell hardware

Details

  • Blackwell software support (at least for the RTX 6000 Pro) is still pretty early
  • Qwen describes compatibility with VLLM for high-throughput inference - though it was only in the latest releases that there is support for workstation Blackwell cards
  • Using nightly docker images (tested with the latest from Nov 2nd 2025) now VLLM is using Pytorch 2.9 and Cuda 12.9 so I was able to get Qwen3-Omni working with both audio and visual inputs
  • Our workstation card is a pretty good fit for the current optimizations on Qwen3-Omni-30b-AWQ-4bit - including the text, audio, and video caching our GPU is expected to be able to process 10 concurrent requests
  • Example of the type of request that can be sent to the model and example response (image and audio):
    • Input:
LLM_IMAGE_URL="https://upload.wikimedia.org/wikipedia/commons/thumb/3/35/Cow-on_pole%2C_with_antlers.jpeg/960px-Cow-on_pole%2C_with_antlers.jpeg"
LLM_AUDIO_URL="https://upload.wikimedia.org/wikipedia/commons/7/72/Whiskers%27_purr_edit.ogg"
"messages": [
    {"role": "user", "content": [
        {"type": "image_url", "image_url": {"url": "$LLM_IMAGE_URL"}},
        {"type": "audio_url", "audio_url": {"url": "$LLM_AUDIO_URL"}},
        {"type": "text", "text": "What is the image showing and separately what it the sound in the audio? Answer each question with just one sentence"}
    ]}
  ]
    • Output example:
"content": "The image shows a large, black and white statue of a cow with deer antlers perched on top of a utility pole. The audio contains the distinct sound of a cat purring."

Read more