I just finished setting up my first self-hosted AI playground, and I have to say—seeing a large language model running entirely on my own hardware was incredibly satisfying. No API calls, no rate limits, no cloud dependencies. Just pure, local AI goodness.
The Hardware
My setup is running on Proxmox, which gave me the flexibility to spin up a dedicated VM for this project. The star of the show is my NVIDIA GeForce RTX 3060 with 12GB of VRAM—perfectly sized for running smaller language models without breaking the bank.
The Stack
After some research, I settled on a straightforward setup:
- Ollama: The backend that actually runs the models
- Open WebUI: A beautiful ChatGPT-like interface
- Docker: Everything containerized for easy management
- NVIDIA Container Toolkit: The magic that lets Docker containers access my GPU
The trickiest part was getting GPU passthrough working in the containers. The NVIDIA Container Toolkit was essential here—without it, Docker has no idea how to talk to your GPU. Once I got that configured and saw --gpus all actually work, everything else fell into place.
First Models
I started conservative with some 7B parameter models:
Qwen2.5-Coder was a revelation. This model is specifically trained for coding tasks, and watching it generate Python functions or explain complex code snippets—all running locally on my hardware—felt like science fiction. No waiting for API responses, no concerns about sending proprietary code to the cloud.
Llama 3.2 was equally impressive for general conversation. The responses were fast, coherent, and surprisingly nuanced for a model running on consumer hardware.
The “Wow” Moment
There’s something special about opening nvidia-smi in another terminal and watching your GPU utilization spike to 100% as the model loads. Seeing the VRAM usage jump from idle to 5-6GB as Ollama loads the model weights into memory made everything click. This wasn’t happening on someone else’s server farm—this was my machine doing real AI inference.
The first time I asked the model a question through Open WebUI and got an instant response, I just sat there grinning. No network latency. No usage caps. Just my homelab doing what multi-million dollar data centers do, at a scale that fits my needs.
Lessons Learned
GPU memory is everything. With 12GB of VRAM, I can comfortably run 7B models and push into 13B territory with quantized versions. The 32GB of system RAM helps too, especially when the model doesn’t fully fit in VRAM.
Docker makes this so much easier. Being able to tear down and rebuild my entire setup with a couple commands meant I could experiment without fear. The --restart always flag ensures everything comes back up after a reboot.
Proxmox is perfect for this. Running this in a VM means I can snapshot before major changes, allocate resources dynamically, and keep my AI experiments isolated from other homelab projects.
What’s Next
I’m already planning my next project: training my own LLM from scratch. Nothing huge—probably fine-tuning an existing model on my own dataset or training a small specialized model. I want to understand the full pipeline, from data preparation to inference.
The infrastructure is ready. The GPU is warmed up. Time to dive deeper.
For Anyone Considering This
If you have a spare GPU and even basic Linux knowledge, I can’t recommend this enough. The feeling of running AI models on your own hardware, with complete control and privacy, is worth the learning curve. Plus, you’ll actually understand how this technology works instead of just consuming it through APIs.
Start small. Get Ollama running. Pull a 3B model. Then scale up as you get comfortable. Your homelab will thank you for giving it something interesting to do.
Hardware: Proxmox VM, NVIDIA RTX 3060 (12GB), 32GB RAM
Software: Ollama, Open WebUI, Docker, NVIDIA Container Toolkit
Models tested: Qwen2.5-Coder 7B, Llama 3.2