Cloud-hosted LLMs are convenient, but they come with three real costs. Every prompt and document leaves your infrastructure, monthly bills scale unpredictably with usage, and the entire workflow stops the moment an API key gets revoked or a regional block kicks in.
Qwen, the open-source model family from Alibaba, removes all three. It is licensed under Apache 2.0, which means free commercial use without restrictions, and the lineup covers everything from a 1.7B model that runs on a laptop to a 235B mixture-of-experts that competes with GPT-4 class systems. Combined with Docker, the whole stack becomes portable, reproducible, and ready to deploy on any Linux box with an internet connection.
In this guide we will walk through the full setup, from a fresh server to a working Qwen instance with a ChatGPT-style interface, in under fifteen minutes.
What we will build
The stack has two containers managed by a single Docker Compose file:
- Ollama runs the model and exposes an OpenAI-compatible API on port 11434.
- Open WebUI is a browser interface that talks to Ollama and gives us conversation history, document upload, and multi-user access on port 8080.
Why this combination works well: Ollama handles model management with a single command, supports GPU acceleration out of the box, and the API mirrors OpenAI's format so any existing client code keeps working with a base URL change. Open WebUI adds the user-facing layer without any custom code on our part.
Choosing the right Qwen model for your hardware
Qwen models are labeled by parameter count, where B stands for billions. More parameters generally mean better answers but require more memory. The second factor is quantization, a compression technique that shrinks the model with minimal quality loss. The Q4_K_M variant is the practical default: it cuts memory needs by roughly four times and is what we will use here.
The table below shows the most common Qwen3 sizes on Q4_K_M with an 8K context window:
| Model | On-disk size | Memory needed | Best for |
|---|---|---|---|
| Qwen3 1.7B | 1.4 GB | 2 to 3 GB RAM | Edge devices, light testing |
| Qwen3 4B | 2.5 GB | 4 to 5 GB RAM | CPU-only servers |
| Qwen3 8B | 5 GB | 6 to 7 GB VRAM/RAM | Daily chat, RAG, balanced choice |
| Qwen3 14B | 9 GB | 10 to 12 GB VRAM | Higher-quality reasoning |
| Qwen3 32B | 20 GB | 22 to 24 GB VRAM | RTX 4090 or A6000 |
| Qwen3 30B-A3B (MoE) | 18 GB | 20+ GB VRAM | Speed via active-only params |
For most teams, Qwen3 8B is the sweet spot. It runs comfortably on a 12 GB GPU or on CPU with 16 GB RAM, and quality is high enough for chat, code help, and document analysis. Note that Qwen3.5 and Qwen3.6 are not yet supported in Ollama because they ship multimodal vision components in separate files; for those, vLLM or llama.cpp are the routes to take.
Prerequisites: server, Docker, and the NVIDIA toolkit
We need three things before starting: a Linux server, Docker with Compose, and (for GPU users) the NVIDIA Container Toolkit.
For the server, Ubuntu 22.04 or 24.04 is the cleanest base. Memory requirements depend on the model; 16 GB RAM is the minimum for Qwen3 8B on CPU, and a GPU with at least 8 GB VRAM unlocks fast inference. If you do not have a machine handy, a GPU-enabled cloud server from Serverspace covers the full Qwen lineup from 1.7B to 32B without any local hardware investment.
Install Docker and Compose on Ubuntu:
sudo apt update
sudo apt install -y docker.io docker-compose-plugin
sudo systemctl enable --now docker
sudo usermod -aG docker $USER
Log out and back in for the group change to apply. Verify with:
docker --version
docker compose version
If you have an NVIDIA GPU, install the Container Toolkit. Without it, Docker cannot pass the GPU into containers:
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sudo sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt update && sudo apt install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
Confirm GPU passthrough works:
docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi
A successful run prints the GPU model and driver version. On macOS, GPU passthrough in Docker Desktop is not available, so the Mac path is CPU inference or running Ollama natively outside Docker.
Step 1: Writing the docker-compose.yml
Create a working directory and a compose file:
mkdir ~/qwen-stack && cd ~/qwen-stack
nano docker-compose.yml
Paste the following:
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
restart: unless-stopped
open-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: open-webui
ports:
- "8080:8080"
environment:
- OLLAMA_BASE_URL=http://ollama:11434
- WEBUI_SECRET_KEY=replace-with-a-long-random-string
volumes:
- openwebui_data:/app/backend/data
depends_on:
- ollama
restart: unless-stopped
volumes:
ollama_data:
openwebui_data:
A few details worth understanding:
- Volumes keep downloaded models and user data on disk between container restarts. Without them, every docker compose down wipes the model and we re-download several gigabytes.
- OLLAMA_BASE_URL=http://ollama:11434 uses Docker's internal DNS so Open WebUI finds the Ollama container by service name, not by localhost.
- WEBUI_SECRET_KEY encrypts session cookies. Replace the placeholder with a random string before going live.
- For CPU-only servers, remove the resources section entirely.
Step 2: Launching the stack
Bring everything up:
docker compose up -d
The first run pulls both images, about 2 GB total. Verify they started:
docker compose ps
Expected output shows two containers in the Up state:
NAME IMAGE STATUS
ollama ollama/ollama:latest Up 30 seconds
open-webui ghcr.io/open-webui/open-webui:main Up 30 seconds
Check that GPU detection worked:
docker compose logs ollama | grep -i "cuda\|gpu"
You should see lines confirming CUDA driver discovery. If nothing matches, the toolkit step likely needs revisiting.
Step 3: Pulling the Qwen model
Models are pulled inside the running Ollama container:
docker exec -it ollama ollama pull qwen3:8b
The download takes a few minutes depending on bandwidth. After it finishes, list installed models:
docker exec -it ollama ollama list
NAME ID SIZE MODIFIED
qwen3:8b abc123def456 5.2 GB About a minute ago
Run a quick sanity check from the CLI:
docker exec -it ollama ollama run qwen3:8b "Explain Docker volumes in two sentences."
If the model responds, the stack is working end-to-end.
Step 4: First chat through Open WebUI
Open http://your-server-ip:8080 in a browser. The first visit shows a registration form.
Important: the first registered account becomes the administrator automatically. Every signup after that goes into a queue and needs admin approval. Register your own account immediately so a stranger does not claim admin on a public IP.
After signup, the chat interface appears. Pick qwen3:8b from the model selector at the top and send a message. From here, Open WebUI gives us conversation history, document upload for retrieval-augmented generation, system prompt configuration per chat, and team access management.
Using the OpenAI-compatible API
Ollama exposes an API that mirrors OpenAI's chat completions format. Any existing client code switches over by changing the base URL.
Quick test with curl:
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3:8b",
"messages": [{"role": "user", "content": "Hello"}]
}'
The same call from Python using the official openai library:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama"
)
resp = client.chat.completions.create(
model="qwen3:8b",
messages=[{"role": "user", "content": "Hello"}]
)
print(resp.choices[0].message.content)
The api_key value is ignored by Ollama; we just need it present so the OpenAI client does not refuse to send the request.
A safety note: do not expose port 11434 to the public internet without authentication. Ollama has no built-in auth, so anyone reaching the port can use the model freely. Put a reverse proxy with TLS and basic auth in front of it, or restrict access by firewall to known IPs only.
Common pitfalls and how to avoid them
A few issues come up frequently:
- GPU not detected. If inference is slow and docker compose logs ollama mentions CPU only, the NVIDIA Container Toolkit is likely not configured. Re-run the nvidia-ctk step and restart Docker.
- Out-of-memory on model load. Qwen3 14B does not fit into 8 GB VRAM. Pick a smaller model or a more aggressive quantization with ollama pull qwen3:8b-q4_0.
- Open WebUI shows no models. The model dropdown is empty when OLLAMA_BASE_URL is wrong. Inside Compose, the value must be http://ollama:11434, not http://localhost:11434.
- Models disappear after docker compose down. The volumes section is missing or misnamed. Check that ollama_data is declared at the bottom of the compose file.
- Default 2048 context is too short. Open WebUI lets us raise num_ctx per model in settings. 8192 or 16384 covers most long-document workflows; higher values cost more memory.
- CUDA 13.2 produces gibberish on Qwen3.6. A known driver bug as of April 2026; stay on CUDA 12.x for now.
When Ollama is not enough: vLLM as a production alternative
Ollama is ideal for solo developers and small teams. When concurrent load grows past a dozen users or batched inference becomes a bottleneck, vLLM is the better backend. It runs Qwen with higher throughput, supports tensor parallelism across multiple GPUs, and serves FP8 or AWQ pre-quantized weights.
The minimal vLLM Docker invocation looks like this:
docker run --gpus all -p 8000:8000 --ipc=host \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai:latest --model Qwen/Qwen3-8B
The OpenAI-compatible API lives on port 8000, and Open WebUI works with it just as well by pointing OLLAMA_BASE_URL at the vLLM endpoint.
Wrapping up
In about fifteen minutes we went from a fresh server to a fully private Qwen deployment with a chat interface and an OpenAI-compatible API. No external services, no per-token billing, no data leaving the box. From here, useful next steps are loading internal documents into Open WebUI for retrieval-augmented generation, wiring the API into existing applications, and scaling up to a larger Qwen model if quality demands it.
The model lives in your volume, the code is one compose file, and the whole thing rebuilds anywhere with docker compose up -d. That is what self-hosted AI looks like in 2026.