Ollama on a dedicated NVMe with performance tuning
Repurpose a second NVMe as Ollama model storage, install Ollama, and tune it for a 24 GB RTX 3090 — KV cache quantization, FlashAttention, model context windows, and a curated model stack with routing guidance.
- Step 1
Prerequisites
This guide assumes you've already finished NVIDIA GPU passthrough to an Ubuntu VM on Proxmox. That guide covers the underlying setup this one builds on. If you're picking up mid-stack, scan its 'Overview' step to confirm your environment matches.
- Step 2
Set up the model NVMe
Dedicate a fast NVMe to Ollama's model store. The original build used a second 1.8 TB NVMe at
/dev/nvme1n1— substitute your own device path throughout.If the drive has any existing partitions or filesystems, deal with them first using your preferred tools (
lvremove/vgremove/pvremovefor LVM,mdadm --zero-superblockfor old mdraid, etc.) before wiping.# Wipe the drive wipefs -a /dev/nvme1n1 sgdisk --zap-all /dev/nvme1n1 # Partition and format parted /dev/nvme1n1 mklabel gpt parted /dev/nvme1n1 mkpart primary ext4 0% 100% mkfs.ext4 /dev/nvme1n1p1 # Mount permanently mkdir -p /mnt/models blkid /dev/nvme1n1p1 # note UUID echo 'UUID=<uuid> /mnt/models ext4 defaults 0 2' >> /etc/fstab mount -a systemctl daemon-reload # Register with Proxmox pvesm add dir model-storage --path /mnt/models --content images,rootdir - Step 3
Allocate Disk to Ollama VM
Add a 500 GB virtual disk backed by the model-storage NVMe. Directory-backed storage doesn't support snapshots — exclude this disk from snapshots and backups explicitly so neither operation fails on the VM.
# Add 500GB virtual disk on model-storage to the VM qm set 100 --scsi1 model-storage:500 # Exclude from snapshots (dir storage doesn't support snapshots) qm set 100 --scsi1 model-storage:100/vm-100-disk-0.raw,size=500G,backup=0,snapshot=0 - Step 4
Format and Mount Inside VM
Inside the VM, format the new virtual disk ext4 and mount it permanently at
/mnt/models. This is where all Ollama model weights will live.# SSH into VM ssh ollama@192.168.20.110 sudo mkfs.ext4 /dev/sdb sudo mkdir -p /mnt/models sudo blkid /dev/sdb # note UUID echo 'UUID=<uuid> /mnt/models ext4 defaults 0 2' | sudo tee -a /etc/fstab sudo mount -a sudo systemctl daemon-reload df -h /mnt/models # should show ~467GB available - Step 5
Install Ollama
Run the official Ollama install script. It detects the NVIDIA GPU automatically and installs a systemd service that starts on boot.
curl -fsSL https://ollama.com/install.sh | sh # Auto-detects NVIDIA GPU # Creates ollama systemd service # Verify systemctl status ollama ollama list - Step 6
Point Ollama at the Model NVMe
By default, Ollama stores models in ~/.ollama. Override this to use the dedicated 500GB NVMe:
sudo mkdir -p /mnt/models/ollama sudo chown ollama:ollama /mnt/models/ollama # Create systemd override sudo systemctl edit ollama # Add: [Service] Environment="OLLAMA_MODELS=/mnt/models/ollama" sudo systemctl daemon-reload sudo systemctl restart ollama # Verify override sudo cat /etc/systemd/system/ollama.service.d/override.conf - Step 7
Performance Tuning
Three systemd drop-in files configure Ollama performance. All are created at /etc/systemd/system/ollama.service.d/ (system service, not user service): override.conf — OLLAMA_MODELS path (created during initial setup) Effect of each setting: kv-cache.conf: OLLAMA_KV_CACHE_TYPE=q8_0 stores the KV cache in 8-bit instead of fp16, roughly halving KV VRAM usage with negligible quality loss. Critical for fitting 32K+ context on 24GB. performance.conf: OLLAMA_FLASH_ATTENTION=1 enables FlashAttention for 10–30% throughput improvement. OLLAMA_KEEP_ALIVE=20m keeps models loaded for 20 minutes after the last request. The shorter timeout helps when multiple runtimes share the GPU — VRAM is freed quickly so a different agent can load its model without waiting. Note: deepseek-r1:32b is capped at 12K context. At 32K the model weights (19 GB) plus KV cache exceed 24 GB VRAM and OOM. Set num_ctx 12288 in its Modelfile.
# /etc/systemd/system/ollama.service.d/kv-cache.conf [Service] Environment="OLLAMA_KV_CACHE_TYPE=q8_0" # /etc/systemd/system/ollama.service.d/performance.conf [Service] Environment="OLLAMA_FLASH_ATTENTION=1" Environment="OLLAMA_KEEP_ALIVE=20m" systemctl daemon-reload && systemctl restart ollama - Step 8
Model Stack
All models use Q4_K_M quantization unless noted and are configured with the context windows below via Modelfile. The RTX 3090 has 24GB VRAM — leave at least 2–3GB headroom and do not load multiple large models simultaneously. Model Use Context VRAM Notes Sanity check / GPU test 32K 1.3 GB Q8_0 Default / coding / tool calls 32K 18 GB MoE, ~147 tok/s General reasoning 64K 17 GB Q4_K_M, thinking adaptive Deep reasoning 12K 19 GB Q4_K_M, avoid tool-use, ~38 tok/s Uncensored / fast 32K 5 GB Q4_K_M, ~149 tok/s Uncensored reasoning 64K 17 GB Q4_K_M
llama3.2:1b qwen3-coder:30b qwen3.5:27b deepseek-r1:32b dolphin3 huihui_ai/qwen3.5-abliterated:27b - Step 9
Model Routing Guidance
When routing tasks from Claude Code to local models via the Ollama MCP server, use this as a guide:
- Default local route: qwen3-coder:30b — best tool calling reliability, RL-trained on SWE-bench
- General reasoning: qwen3.5:27b — use for non-coding tasks; supports adaptive thinking
- Complex reasoning: deepseek-r1:32b — capped at 12K context (OOMs at 32K); avoid for tool-use tasks
- Uncensored general: dolphin3
- Uncensored reasoning: huihui_ai/qwen3.5-abliterated:27b
- Quick tests: llama3.2:1b Note: Route to uncensored models only when content filters on standard models are blocking a legitimate task, for creative writing with mature themes, or security research/red-teaming. Never use as default.
- Step 10
Pull and Configure
Each model requires a Modelfile to set 32K context. The pattern for each:
# Check disk space first df -h /mnt/models # Pull base model ollama pull <model-name> # Create Modelfile with 32K context cat > /tmp/modelfile << 'EOF' FROM <model-name> PARAMETER num_ctx 32768 EOF # Recreate under same name with context baked in ollama create <model-name>-32k -f /tmp/modelfile ollama rm <model-name> ollama create <model-name> -f /tmp/modelfile ollama rm <model-name>-32k # Verify GPU inference ollama run llama3.2:1b 'say hello in one sentence' watch -n1 nvidia-smi # monitor GPU usage in separate session # Confirm final model list ollama list - Step 11
Verify Model Storage Location
Verify Model Storage Location.
ls /mnt/models/ollama/ # Should show: blobs manifests ollama list # Models listed here are stored on the NVMe - Step 12
Next: continue building the stack
With this layer in place, the next guide in the series is Self-host OpenClaw with HTTPS, Brave search, and GitHub access. Run the OpenClaw agent gateway on your Ollama VM behind an nginx TLS reverse proxy. Wire up the Brave search skill, give it authenticated GitHub access for code tasks, and configure a CLAUDE.md so remote Claude Code sessions have persistent context.
Feature requests
Sign in to suggest features or vote on existing ones.
No feature requests yet.
Discussion
Sign in to join the discussion.
No comments yet.