27.04.2026

Deploying Qwen with Docker: Private AI Without External APIs

Cloud-hosted LLMs are convenient, but they come with three real costs. Every prompt and document leaves your infrastructure, monthly bills scale unpredictably with usage, and the entire workflow stops the moment an API key gets revoked or a regional block kicks in.

Qwen, the open-source model family from Alibaba, removes all three. It is licensed under Apache 2.0, which means free commercial use without restrictions, and the lineup covers everything from a 1.7B model that runs on a laptop to a 235B mixture-of-experts that competes with GPT-4 class systems. Combined with Docker, the whole stack becomes portable, reproducible, and ready to deploy on any Linux box with an internet connection.

In this guide we will walk through the full setup, from a fresh server to a working Qwen instance with a ChatGPT-style interface, in under fifteen minutes.

What we will build

The stack has two containers managed by a single Docker Compose file:

Why this combination works well: Ollama handles model management with a single command, supports GPU acceleration out of the box, and the API mirrors OpenAI's format so any existing client code keeps working with a base URL change. Open WebUI adds the user-facing layer without any custom code on our part.

Choosing the right Qwen model for your hardware

Qwen models are labeled by parameter count, where B stands for billions. More parameters generally mean better answers but require more memory. The second factor is quantization, a compression technique that shrinks the model with minimal quality loss. The Q4_K_M variant is the practical default: it cuts memory needs by roughly four times and is what we will use here.

The table below shows the most common Qwen3 sizes on Q4_K_M with an 8K context window:

Model On-disk size Memory needed Best for
Qwen3 1.7B 1.4 GB 2 to 3 GB RAM Edge devices, light testing
Qwen3 4B 2.5 GB 4 to 5 GB RAM CPU-only servers
Qwen3 8B 5 GB 6 to 7 GB VRAM/RAM Daily chat, RAG, balanced choice
Qwen3 14B 9 GB 10 to 12 GB VRAM Higher-quality reasoning
Qwen3 32B 20 GB 22 to 24 GB VRAM RTX 4090 or A6000
Qwen3 30B-A3B (MoE) 18 GB 20+ GB VRAM Speed via active-only params

For most teams, Qwen3 8B is the sweet spot. It runs comfortably on a 12 GB GPU or on CPU with 16 GB RAM, and quality is high enough for chat, code help, and document analysis. Note that Qwen3.5 and Qwen3.6 are not yet supported in Ollama because they ship multimodal vision components in separate files; for those, vLLM or llama.cpp are the routes to take.

Prerequisites: server, Docker, and the NVIDIA toolkit

We need three things before starting: a Linux server, Docker with Compose, and (for GPU users) the NVIDIA Container Toolkit.

For the server, Ubuntu 22.04 or 24.04 is the cleanest base. Memory requirements depend on the model; 16 GB RAM is the minimum for Qwen3 8B on CPU, and a GPU with at least 8 GB VRAM unlocks fast inference. If you do not have a machine handy, a GPU-enabled cloud server from Serverspace covers the full Qwen lineup from 1.7B to 32B without any local hardware investment.

Install Docker and Compose on Ubuntu:

sudo apt update

sudo apt install -y docker.io docker-compose-plugin

sudo systemctl enable --now docker

sudo usermod -aG docker $USER

 

Log out and back in for the group change to apply. Verify with:

docker --version

docker compose version

 

If you have an NVIDIA GPU, install the Container Toolkit. Without it, Docker cannot pass the GPU into containers:

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \

sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \

sudo sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \

sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt update && sudo apt install -y nvidia-container-toolkit

sudo nvidia-ctk runtime configure --runtime=docker

sudo systemctl restart docker

 

Confirm GPU passthrough works:

docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi

 

A successful run prints the GPU model and driver version. On macOS, GPU passthrough in Docker Desktop is not available, so the Mac path is CPU inference or running Ollama natively outside Docker.

Step 1: Writing the docker-compose.yml

Create a working directory and a compose file:

mkdir ~/qwen-stack && cd ~/qwen-stack

nano docker-compose.yml

 

Paste the following:

services:

ollama:

image: ollama/ollama:latest

container_name: ollama

ports:

- "11434:11434"

volumes:

- ollama_data:/root/.ollama

deploy:

resources:

reservations:

devices:

- driver: nvidia

count: all

capabilities: [gpu]

restart: unless-stopped

 

open-webui:

image: ghcr.io/open-webui/open-webui:main

container_name: open-webui

ports:

- "8080:8080"

environment:

- OLLAMA_BASE_URL=http://ollama:11434

- WEBUI_SECRET_KEY=replace-with-a-long-random-string

volumes:

- openwebui_data:/app/backend/data

depends_on:

- ollama

restart: unless-stopped

 

volumes:

ollama_data:

openwebui_data:

 

A few details worth understanding:

Step 2: Launching the stack

Bring everything up:

docker compose up -d

 

The first run pulls both images, about 2 GB total. Verify they started:

docker compose ps

 

Expected output shows two containers in the Up state:

NAME         IMAGE                                STATUS

ollama       ollama/ollama:latest                 Up 30 seconds

open-webui   ghcr.io/open-webui/open-webui:main   Up 30 seconds

 

Check that GPU detection worked:

docker compose logs ollama | grep -i "cuda\|gpu"

 

You should see lines confirming CUDA driver discovery. If nothing matches, the toolkit step likely needs revisiting.

Step 3: Pulling the Qwen model

Models are pulled inside the running Ollama container:

docker exec -it ollama ollama pull qwen3:8b

 

The download takes a few minutes depending on bandwidth. After it finishes, list installed models:

docker exec -it ollama ollama list

 

NAME        ID              SIZE    MODIFIED

qwen3:8b    abc123def456    5.2 GB  About a minute ago

 

Run a quick sanity check from the CLI:

docker exec -it ollama ollama run qwen3:8b "Explain Docker volumes in two sentences."

 

If the model responds, the stack is working end-to-end.

Step 4: First chat through Open WebUI

Open http://your-server-ip:8080 in a browser. The first visit shows a registration form.

Important: the first registered account becomes the administrator automatically. Every signup after that goes into a queue and needs admin approval. Register your own account immediately so a stranger does not claim admin on a public IP.

After signup, the chat interface appears. Pick qwen3:8b from the model selector at the top and send a message. From here, Open WebUI gives us conversation history, document upload for retrieval-augmented generation, system prompt configuration per chat, and team access management.

Using the OpenAI-compatible API

Ollama exposes an API that mirrors OpenAI's chat completions format. Any existing client code switches over by changing the base URL.

Quick test with curl:

curl http://localhost:11434/v1/chat/completions \

-H "Content-Type: application/json" \

-d '{

"model": "qwen3:8b",

"messages": [{"role": "user", "content": "Hello"}]

}'

 

The same call from Python using the official openai library:

from openai import OpenAI

 

client = OpenAI(

base_url="http://localhost:11434/v1",

api_key="ollama"

)

 

resp = client.chat.completions.create(

model="qwen3:8b",

messages=[{"role": "user", "content": "Hello"}]

)

print(resp.choices[0].message.content)

 

The api_key value is ignored by Ollama; we just need it present so the OpenAI client does not refuse to send the request.

A safety note: do not expose port 11434 to the public internet without authentication. Ollama has no built-in auth, so anyone reaching the port can use the model freely. Put a reverse proxy with TLS and basic auth in front of it, or restrict access by firewall to known IPs only.

Common pitfalls and how to avoid them

A few issues come up frequently:

When Ollama is not enough: vLLM as a production alternative

Ollama is ideal for solo developers and small teams. When concurrent load grows past a dozen users or batched inference becomes a bottleneck, vLLM is the better backend. It runs Qwen with higher throughput, supports tensor parallelism across multiple GPUs, and serves FP8 or AWQ pre-quantized weights.

The minimal vLLM Docker invocation looks like this:

docker run --gpus all -p 8000:8000 --ipc=host \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai:latest --model Qwen/Qwen3-8B

 

The OpenAI-compatible API lives on port 8000, and Open WebUI works with it just as well by pointing OLLAMA_BASE_URL at the vLLM endpoint.

Wrapping up

In about fifteen minutes we went from a fresh server to a fully private Qwen deployment with a chat interface and an OpenAI-compatible API. No external services, no per-token billing, no data leaving the box. From here, useful next steps are loading internal documents into Open WebUI for retrieval-augmented generation, wiring the API into existing applications, and scaling up to a larger Qwen model if quality demands it.

The model lives in your volume, the code is one compose file, and the whole thing rebuilds anywhere with docker compose up -d. That is what self-hosted AI looks like in 2026.