Choosing the right model for your hardware can be tricky. This guide breaks down 10 excellent Ollama models, what they excel at, and most importantly—will they actually run on your GPU?
Understanding VRAM Requirements
Before we dive in, here’s a quick reference:
- 8GB VRAM: Focus on 3B-7B models, some quantized 8B models
- 12GB VRAM: Comfortable with 7B-13B models, some quantized 30B models
- 16GB+ VRAM: Can handle 13B-30B models, some quantized 70B models
Models are typically loaded entirely into VRAM for best performance. If a model doesn’t fit, Ollama will use system RAM, which is much slower.
The Top 10 Models
1. Llama 3.2 (3B) – The Efficient All-Rounder
Best for: General conversation, quick questions, everyday tasks
Why it’s great: Meta’s latest small model punches well above its weight. It’s fast, coherent, and handles most general-purpose tasks surprisingly well for its size.
Example use cases:
- Quick Q&A sessions
- Summarizing text
- Basic coding help
- Creative brainstorming
VRAM Requirements:
- 3B version: ~2-3GB ✅ 8GB | ✅ 12GB | ✅ 16GB
Pull command:
ollama pull llama3.2
# Or the 1B version for even faster responses
ollama pull llama3.2:1b
2. Qwen 2.5 Coder (7B) – The Code Specialist
Best for: Programming, code explanation, debugging, technical documentation
Why it’s great: Specifically trained on code, this model understands multiple programming languages and can generate, explain, and debug code remarkably well. It often outperforms larger general-purpose models on coding tasks.
Example use cases:
- Writing functions in Python, JavaScript, Go, Rust
- Explaining complex code
- Converting between programming languages
- Finding bugs and suggesting fixes
- Writing unit tests
VRAM Requirements:
- 7B version: ~5-6GB ✅ 8GB | ✅ 12GB | ✅ 16GB
- 14B version: ~9-10GB ❌ 8GB | ✅ 12GB | ✅ 16GB
Pull command:
ollama pull qwen2.5-coder:7b
3. Llama 3.1 (8B) – The Balanced Powerhouse
Best for: General conversation, reasoning, longer context tasks
Why it’s great: Llama 3.1’s 8B variant offers excellent reasoning capabilities with a 128K token context window. It’s the sweet spot between performance and resource usage.
Example use cases:
- Long-form content analysis
- Complex reasoning tasks
- Detailed explanations
- Following multi-step instructions
VRAM Requirements:
- 8B version: ~5-6GB ✅ 8GB | ✅ 12GB | ✅ 16GB
- 70B version: ~40GB+ (quantized versions available) ❌ 8GB | ❌ 12GB | ❌ 16GB
Pull command:
ollama pull llama3.1:8b
4. Mistral (7B) – The Speed Demon
Best for: Fast responses, chat applications, real-time interactions
Why it’s great: Mistral is optimized for speed without sacrificing quality. It’s one of the fastest 7B models available and produces coherent, helpful responses consistently.
Example use cases:
- Chatbots requiring quick responses
- Interactive applications
- Real-time assistance
- Customer service scenarios
VRAM Requirements:
- 7B version: ~5GB ✅ 8GB | ✅ 12GB | ✅ 16GB
Pull command:
ollama pull mistral
5. LLaVA (7B) – The Vision Expert
Best for: Image analysis, visual Q&A, describing images
Why it’s great: One of the best open-source vision-language models. LLaVA can analyze images and answer questions about them, making it perfect for multimodal applications.
Example use cases:
- Describing images for accessibility
- Analyzing charts and diagrams
- Identifying objects in photos
- Answering questions about visual content
- OCR and document analysis
VRAM Requirements:
- 7B version: ~6-7GB ✅ 8GB | ✅ 12GB | ✅ 16GB
- 13B version: ~8-9GB ❌ 8GB | ✅ 12GB | ✅ 16GB
Pull command:
ollama pull llava:7b
Usage example:
ollama run llava:7b
>>> What's in this image? /path/to/image.jpg
6. Gemma 2 (9B) – Google’s Instruction Follower
Best for: Following specific instructions, structured outputs, task completion
Why it’s great: Google’s Gemma 2 excels at understanding and following detailed instructions. It’s particularly good at producing structured outputs and staying on task.
Example use cases:
- Following complex multi-step instructions
- Generating structured data (JSON, XML)
- Template-based content generation
- Precise task execution
VRAM Requirements:
- 2B version: ~2GB ✅ 8GB | ✅ 12GB | ✅ 16GB
- 9B version: ~6-7GB ✅ 8GB | ✅ 12GB | ✅ 16GB
- 27B version: ~16-18GB ❌ 8GB | ❌ 12GB | ✅ 16GB
Pull command:
ollama pull gemma2:9b
7. DeepSeek Coder V2 (16B) – The Advanced Code Generator
Best for: Complex coding tasks, system design, algorithm development
Why it’s great: For those with more VRAM, DeepSeek Coder V2 is one of the most capable code models available. It handles complex architectural decisions and can work across large codebases.
Example use cases:
- Designing system architectures
- Complex algorithm implementation
- Code refactoring
- Full-stack development assistance
- Database query optimization
VRAM Requirements:
- 16B version: ~10-11GB ❌ 8GB | ✅ 12GB | ✅ 16GB
Pull command:
ollama pull deepseek-coder-v2:16b
8. Dolphin Mistral – The Uncensored Assistant
Best for: Creative writing, unrestricted exploration, roleplay
Why it’s great: Based on Mistral but with reduced safety filters, Dolphin is great for creative tasks where you want less corporate-speak and more natural, unrestricted responses.
Example use cases:
- Creative fiction writing
- Exploring controversial topics objectively
- Roleplay scenarios
- Honest, direct answers without hedging
VRAM Requirements:
- 7B version: ~5GB ✅ 8GB | ✅ 12GB | ✅ 16GB
Pull command:
ollama pull dolphin-mistral
Note: Use responsibly. Less filtering means it’s more important to apply your own judgment.
9. Phi-3 (Mini) – The Tiny Titan
Best for: Resource-constrained environments, fast experimentation, embedded systems
Why it’s great: Microsoft’s Phi-3 is shockingly capable for its tiny size. It’s perfect when you need decent performance with minimal resources.
Example use cases:
- Running on older GPUs
- Edge devices
- Quick prototyping
- Learning and experimentation
- Multiple models running simultaneously
VRAM Requirements:
- Mini (3.8B): ~2-3GB ✅ 8GB | ✅ 12GB | ✅ 16GB
Pull command:
ollama pull phi3:mini
10. Mixtral (8x7B) – The Mixture of Experts
Best for: Complex reasoning, diverse knowledge tasks, high-quality outputs
Why it’s great: Mixtral uses a “mixture of experts” architecture, activating only parts of the model for each query. This gives you near-70B performance while using much less VRAM.
Example use cases:
- Complex problem-solving
- Multi-domain knowledge tasks
- High-quality content generation
- Advanced reasoning
VRAM Requirements:
- 8x7B version: ~26-30GB (but quantized versions available)
- Quantized (Q4): ~13-15GB ❌ 8GB | ✅ 12GB | ✅ 16GB
Pull command:
# Full version (requires ~30GB)
ollama pull mixtral
# Quantized version (much smaller)
ollama pull mixtral:8x7b-instruct-v0.1-q4_0
Quick Reference Table
| Model | Size | Best For | 8GB | 12GB | 16GB |
|---|---|---|---|---|---|
| Llama 3.2 | 3B | General use | ✅ | ✅ | ✅ |
| Qwen 2.5 Coder | 7B | Programming | ✅ | ✅ | ✅ |
| Llama 3.1 | 8B | Reasoning | ✅ | ✅ | ✅ |
| Mistral | 7B | Speed | ✅ | ✅ | ✅ |
| LLaVA | 7B | Vision | ✅ | ✅ | ✅ |
| Gemma 2 | 9B | Instructions | ✅ | ✅ | ✅ |
| DeepSeek Coder V2 | 16B | Advanced coding | ❌ | ✅ | ✅ |
| Dolphin Mistral | 7B | Creative | ✅ | ✅ | ✅ |
| Phi-3 Mini | 3.8B | Efficiency | ✅ | ✅ | ✅ |
| Mixtral (Q4) | 8x7B | Complex tasks | ❌ | ✅* | ✅ |
*Tight fit, quantized version required
Choosing the Right Model
For 8GB VRAM:
Best all-around setup:
ollama pull llama3.2 # General use
ollama pull qwen2.5-coder:7b # Coding
ollama pull mistral # Fast responses
For 12GB VRAM:
Recommended collection:
ollama pull llama3.1:8b # Strong reasoning
ollama pull qwen2.5-coder:7b # Coding expert
ollama pull llava:7b # Vision tasks
ollama pull deepseek-coder-v2:16b # Advanced coding
For 16GB+ VRAM:
Power user setup:
ollama pull llama3.1:8b # Fast general use
ollama pull deepseek-coder-v2:16b # Top-tier coding
ollama pull mixtral:8x7b-instruct-v0.1-q4_0 # Complex reasoning
ollama pull llava:13b # Advanced vision
ollama pull gemma2:27b # Instruction following
Pro Tips
1. Use quantized models to fit more in VRAM:
# Look for Q4, Q5, Q8 versions
ollama pull llama3.1:8b-q4_0
2. Check actual VRAM usage:
# While model is loaded
watch -n 1 nvidia-smi
3. Switch between models easily:
# Models stay in memory until you load a different one
ollama run qwen2.5-coder:7b # For coding
ollama run llama3.1:8b # Switch to general use
4. Combine models for different tasks:
- Use Qwen Coder for programming questions
- Use LLaVA when you need image analysis
- Use Llama 3.1 for general reasoning
- Use Mistral when you need speed
5. Test before committing:
# Models are cached, so you can try and remove easily
ollama pull model-name
ollama run model-name
# If you don't like it:
ollama rm model-name
Conclusion
The beauty of running Ollama locally is that you can experiment freely. Download a few models, test them on your actual use cases, and keep the ones that work best for you.
Remember: Bigger isn’t always better. A well-chosen 7B model that fits comfortably in your VRAM will outperform a 70B model that’s constantly swapping to system RAM.
Start with the models that fit your hardware, test them on real tasks, and build your personal AI toolkit from there.
Happy modeling! 🚀