27.04.2026

Deploying Qwen with Docker: Private AI Without External APIs

Cloud-hosted LLMs are convenient, but they come with three real costs. Every prompt and document leaves your infrastructure, monthly bills scale unpredictably with usage, and the entire workflow stops the moment an API key gets revoked or a regional block kicks in.

Qwen, the open-source model family from Alibaba, removes all three. It is licensed under Apache 2.0, which means free commercial use without restrictions, and the lineup covers everything from a 1.7B model that runs on a laptop to a 235B mixture-of-experts that competes with GPT-4 class systems. Combined with Docker, the whole stack becomes portable, reproducible, and ready to deploy on any Linux box with an internet connection.

In this guide we will walk through the full setup, from a fresh server to a working Qwen instance with a ChatGPT-style interface, in under fifteen minutes.

What we will build

The stack has two containers managed by a single Docker Compose file:

Ollama runs the model and exposes an OpenAI-compatible API on port 11434.
Open WebUI is a browser interface that talks to Ollama and gives us conversation history, document upload, and multi-user access on port 8080.

Why this combination works well: Ollama handles model management with a single command, supports GPU acceleration out of the box, and the API mirrors OpenAI's format so any existing client code keeps working with a base URL change. Open WebUI adds the user-facing layer without any custom code on our part.

Choosing the right Qwen model for your hardware

Qwen models are labeled by parameter count, where B stands for billions. More parameters generally mean better answers but require more memory. The second factor is quantization, a compression technique that shrinks the model with minimal quality loss. The Q4_K_M variant is the practical default: it cuts memory needs by roughly four times and is what we will use here.

The table below shows the most common Qwen3 sizes on Q4_K_M with an 8K context window:

Model	On-disk size	Memory needed	Best for
Qwen3 1.7B	1.4 GB	2 to 3 GB RAM	Edge devices, light testing
Qwen3 4B	2.5 GB	4 to 5 GB RAM	CPU-only servers
Qwen3 8B	5 GB	6 to 7 GB VRAM/RAM	Daily chat, RAG, balanced choice
Qwen3 14B	9 GB	10 to 12 GB VRAM	Higher-quality reasoning
Qwen3 32B	20 GB	22 to 24 GB VRAM	RTX 4090 or A6000
Qwen3 30B-A3B (MoE)	18 GB	20+ GB VRAM	Speed via active-only params

For most teams, Qwen3 8B is the sweet spot. It runs comfortably on a 12 GB GPU or on CPU with 16 GB RAM, and quality is high enough for chat, code help, and document analysis. Note that Qwen3.5 and Qwen3.6 are not yet supported in Ollama because they ship multimodal vision components in separate files; for those, vLLM or llama.cpp are the routes to take.

Prerequisites: server, Docker, and the NVIDIA toolkit

We need three things before starting: a Linux server, Docker with Compose, and (for GPU users) the NVIDIA Container Toolkit.

For the server, Ubuntu 22.04 or 24.04 is the cleanest base. Memory requirements depend on the model; 16 GB RAM is the minimum for Qwen3 8B on CPU, and a GPU with at least 8 GB VRAM unlocks fast inference. If you do not have a machine handy, a GPU-enabled cloud server from Serverspace covers the full Qwen lineup from 1.7B to 32B without any local hardware investment.

Install Docker and Compose on Ubuntu:

sudo apt update

sudo apt install -y docker.io docker-compose-plugin

sudo systemctl enable --now docker

sudo usermod -aG docker $USER

Log out and back in for the group change to apply. Verify with:

docker --version

docker compose version

If you have an NVIDIA GPU, install the Container Toolkit. Without it, Docker cannot pass the GPU into containers:

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \

sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \

sudo sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \

sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt update && sudo apt install -y nvidia-container-toolkit

sudo nvidia-ctk runtime configure --runtime=docker

sudo systemctl restart docker

Confirm GPU passthrough works:

docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi

A successful run prints the GPU model and driver version. On macOS, GPU passthrough in Docker Desktop is not available, so the Mac path is CPU inference or running Ollama natively outside Docker.

Step 1: Writing the docker-compose.yml

Create a working directory and a compose file:

mkdir ~/qwen-stack && cd ~/qwen-stack

nano docker-compose.yml

Paste the following:

services:

ollama:

image: ollama/ollama:latest

container_name: ollama

ports:

- "11434:11434"

volumes:

- ollama_data:/root/.ollama

deploy:

resources:

reservations:

devices:

- driver: nvidia

count: all

capabilities: [gpu]

restart: unless-stopped

open-webui:

image: ghcr.io/open-webui/open-webui:main

container_name: open-webui

ports:

- "8080:8080"

environment:

- OLLAMA_BASE_URL=http://ollama:11434

- WEBUI_SECRET_KEY=replace-with-a-long-random-string

volumes:

- openwebui_data:/app/backend/data

depends_on:

- ollama

restart: unless-stopped

volumes:

ollama_data:

openwebui_data:

A few details worth understanding:

Volumes keep downloaded models and user data on disk between container restarts. Without them, every docker compose down wipes the model and we re-download several gigabytes.
OLLAMA_BASE_URL=http://ollama:11434 uses Docker's internal DNS so Open WebUI finds the Ollama container by service name, not by localhost.
WEBUI_SECRET_KEY encrypts session cookies. Replace the placeholder with a random string before going live.
For CPU-only servers, remove the resources section entirely.

Step 2: Launching the stack

Bring everything up:

docker compose up -d

The first run pulls both images, about 2 GB total. Verify they started:

docker compose ps

Expected output shows two containers in the Up state:

NAME IMAGE STATUS

ollama ollama/ollama:latest Up 30 seconds

open-webui ghcr.io/open-webui/open-webui:main Up 30 seconds

Check that GPU detection worked:

docker compose logs ollama | grep -i "cuda\|gpu"

You should see lines confirming CUDA driver discovery. If nothing matches, the toolkit step likely needs revisiting.

Step 3: Pulling the Qwen model

Models are pulled inside the running Ollama container:

docker exec -it ollama ollama pull qwen3:8b

The download takes a few minutes depending on bandwidth. After it finishes, list installed models:

docker exec -it ollama ollama list

NAME ID SIZE MODIFIED

qwen3:8b abc123def456 5.2 GB About a minute ago

Run a quick sanity check from the CLI:

docker exec -it ollama ollama run qwen3:8b "Explain Docker volumes in two sentences."

If the model responds, the stack is working end-to-end.

Step 4: First chat through Open WebUI

Open http://your-server-ip:8080 in a browser. The first visit shows a registration form.

Important: the first registered account becomes the administrator automatically. Every signup after that goes into a queue and needs admin approval. Register your own account immediately so a stranger does not claim admin on a public IP.

After signup, the chat interface appears. Pick qwen3:8b from the model selector at the top and send a message. From here, Open WebUI gives us conversation history, document upload for retrieval-augmented generation, system prompt configuration per chat, and team access management.

Using the OpenAI-compatible API

Ollama exposes an API that mirrors OpenAI's chat completions format. Any existing client code switches over by changing the base URL.

Quick test with curl:

curl http://localhost:11434/v1/chat/completions \

-H "Content-Type: application/json" \

-d '{

"model": "qwen3:8b",

"messages": [{"role": "user", "content": "Hello"}]

The same call from Python using the official openai library:

from openai import OpenAI

client = OpenAI(

base_url="http://localhost:11434/v1",

api_key="ollama"

)

resp = client.chat.completions.create(

model="qwen3:8b",

messages=[{"role": "user", "content": "Hello"}]

)

print(resp.choices[0].message.content)

The api_key value is ignored by Ollama; we just need it present so the OpenAI client does not refuse to send the request.

A safety note: do not expose port 11434 to the public internet without authentication. Ollama has no built-in auth, so anyone reaching the port can use the model freely. Put a reverse proxy with TLS and basic auth in front of it, or restrict access by firewall to known IPs only.

Common pitfalls and how to avoid them

A few issues come up frequently:

GPU not detected. If inference is slow and docker compose logs ollama mentions CPU only, the NVIDIA Container Toolkit is likely not configured. Re-run the nvidia-ctk step and restart Docker.
Out-of-memory on model load. Qwen3 14B does not fit into 8 GB VRAM. Pick a smaller model or a more aggressive quantization with ollama pull qwen3:8b-q4_0.
Open WebUI shows no models. The model dropdown is empty when OLLAMA_BASE_URL is wrong. Inside Compose, the value must be http://ollama:11434, not http://localhost:11434.
Models disappear after docker compose down. The volumes section is missing or misnamed. Check that ollama_data is declared at the bottom of the compose file.
Default 2048 context is too short. Open WebUI lets us raise num_ctx per model in settings. 8192 or 16384 covers most long-document workflows; higher values cost more memory.
CUDA 13.2 produces gibberish on Qwen3.6. A known driver bug as of April 2026; stay on CUDA 12.x for now.

When Ollama is not enough: vLLM as a production alternative

Ollama is ideal for solo developers and small teams. When concurrent load grows past a dozen users or batched inference becomes a bottleneck, vLLM is the better backend. It runs Qwen with higher throughput, supports tensor parallelism across multiple GPUs, and serves FP8 or AWQ pre-quantized weights.

The minimal vLLM Docker invocation looks like this:

docker run --gpus all -p 8000:8000 --ipc=host \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai:latest --model Qwen/Qwen3-8B

The OpenAI-compatible API lives on port 8000, and Open WebUI works with it just as well by pointing OLLAMA_BASE_URL at the vLLM endpoint.

Wrapping up

In about fifteen minutes we went from a fresh server to a fully private Qwen deployment with a chat interface and an OpenAI-compatible API. No external services, no per-token billing, no data leaving the box. From here, useful next steps are loading internal documents into Open WebUI for retrieval-augmented generation, wiring the API into existing applications, and scaling up to a larger Qwen model if quality demands it.

The model lives in your volume, the code is one compose file, and the whole thing rebuilds anywhere with docker compose up -d. That is what self-hosted AI looks like in 2026.