Using Local AI Models with Ollama: Building Effective System Prompts
As AI development becomes more accessible, many developers are turning to local AI models through tools like Ollama to avoid the costs, privacy concerns, and dependency on cloud services. While this approach gives you more control and better privacy, it also comes with important constraints that affect how you design your AI interactions.
In this post, I’ll walk you through how to set up local AI models with Ollama and provide some tips for crafting effective system prompts tailored for this environment.
What is Ollama?
Ollama is a tool that makes it easy to run large language models locally on your computer. It simplifies the process of downloading, running, and managing AI models on your own hardware without needing internet access or cloud credits. This is particularly valuable for developers who want to experiment with AI without external dependencies.
Getting Started with Ollama
First, install Ollama following the instructions for your operating system at ollama.com. Once installed, you can pull models directly from the command line:
# Pull a model (examples)
ollama pull llama3
ollama pull mistral
ollama pull phi3
After installation, start your local AI models. Ollama handles running models as background processes:
# Run models (these will start in background)
ollama run llama3
ollama run mistral
Setting Up Custom System Prompts
To use custom system prompts with your local models, you need to place the prompt file in the correct location. The system prompt file must be named using the format exactModel:name.md where exactModel is the exact model name (as shown by ollama list) and name is any descriptive name you choose.
Location (Linux): ~/.config/zblade/prompts/
Location (macOS): ~/Library/Application Support/zblade/prompts/
Location (Windows): %APPDATA%\zblade\prompts\
Example filename: qwen3-coder:30b.md
This means if you’re using the model qwen3-coder:30b (as shown by running ollama list), you’d name your prompt file qwen3-coder:30b.md.
The prompt file should contain your system prompt in markdown format—the same content you’d normally include as a system message when interacting with the model.
Optimizing System Prompts for Local AI Models
When using local AI models, especially on systems with limited resources (like my RTX 3070 8GB laptop GPU), the system prompt becomes even more critical. Let me explain how to build effective prompts for this scenario.
The Limitation of Context Windows
Local AI models typically have smaller context windows than their cloud counterparts—often as low as 4,000 or 8,000 tokens. When working with your own hardware, you’re also limited by:
- GPU memory (particularly important for 8GB cards)
- CPU resources
- Memory constraints
This means that the AI will only see a limited amount of context when processing your queries, potentially leading to incomplete or inaccurate responses if not properly structured.
Why Smaller System Prompts Are Better
A more concise system prompt works better in this environment because:
- Reduced Memory Overhead: Shorter prompts use less memory during processing, important for constrained hardware.
- Faster Inference: Smaller prompts lead to faster responses, reducing perceived latency.
- Clearer Instructions: Less clutter in the prompt helps the AI focus on the essential elements.
- Better Efficiency: Limited context window forces you to be more precise about what you want from the AI.
Here’s an example of a system prompt optimized for local use with 8GB GPU:
You are a professional software developer with expertise in multiple programming languages. Your task is to help users write code and resolve issues in their projects.
## Context
You have access to the user's current file and are working within a project context.
## Guidelines
- Answer concisely and directly to the question
- Focus on correctness and clarity
- If a request requires multiple steps, explain each clearly
- Keep responses within 200 words (or 500 tokens)
- Be respectful and helpful
## Style
- Use clear and precise language
- Prioritize technical accuracy
- When suggesting code, always include a brief explanation
## Commands
You have access to special tools that the user can activate. These are:
- `read_file(path)` - Read the complete contents of a file
- `read_file_range(path, start_line, end_line, context_lines)` - Read specific lines with context
For code snippets:
- Enclose code in triple backticks with language identifiers
- When writing code for a specific language, include the language name in the code block
Adapting the Prompt to Your Hardware
Given that I’m working on an RTX 3070 with 8GB of VRAM, here’s what I’ve learned about prompt optimization:
- Keep system prompts under 300 words - I’ve found this works well with my hardware without causing out-of-memory errors.
- Be specific about context window limits - Explicitly state that the prompt is within a constrained environment.
- Use clear instructions - Avoid ambiguous phrasing that the AI might interpret differently.
- Include resource-aware tools - Mention tools that can handle the file reading limitations gracefully.
Practical Example: A Working Prompt for Ollama
Here’s a prompt that works well for local models, especially those on low-end hardware:
You are an expert software developer and problem solver, working in a constrained local environment.
## Role
You are an AI assistant that helps developers write code and solve problems. You work in a local AI environment with limited context window and hardware resources.
## Limitations
- Context window is limited to approximately 4,000 tokens
- Working with 8GB GPU memory (RTX 3070)
- Processing may be slower than cloud models
- No internet access or external API access
- Focus on code that works within local hardware constraints
## Tasks
1. Understand the user's request
2. Create efficient solutions that work within the constraints of local hardware
3. Be precise in your responses
4. When providing code, consider runtime efficiency (avoid slow operations, use appropriate data structures for limited memory)
## Output
- Keep responses concise and to the point
- Use technical terms accurately
- When giving code examples, make them efficient and memory-conscious
- When explaining concepts, be clear and practical
## Tools Available
You can read specific files:
```json
{
"name": "read_file",
"description": "Read the complete contents of a file",
"parameters": {
"path": "string"
}
}
{
"name": "read_file_range",
"description": "Read a specific line range from a file with optional context",
"parameters": {
"path": "string",
"start_line": "integer",
"end_line": "integer",
"context_lines": "integer"
}
}
Response Format
- Begin with a clear answer to the question
- Include code snippets when relevant, with language tags
- Keep explanations concise
- If there might be multiple approaches, present clearly the most suitable one for local hardware
This prompt takes advantages of several constraints and capabilities that work well with my hardware:
1. Explicitly mentions the 8GB GPU and context limitations
2. Focuses on practical, efficient solutions
3. Provides clear tools for file management that don't require heavy context parsing
4. Encourages efficiency and memory-conscious code design
## Tips for Effective Local AI Usage
1. **Keep Files Accessible**: Structure your system prompts to reference only necessary files with clear, specific paths.
2. **Use Tool Commands Wisely**: Leverage read_file and read_file_range commands instead of pasting large code snippets into prompts.
3. **Test Response Lengths**: Pay attention to the response length when using local models - long responses may cause processing issues on constrained hardware.
4. **Optimize Your Local Model**: Choose models that perform well on your hardware (like phi3, llama3, or mistral) and test different configurations. The model you select matters as much as your prompt.
5. **Monitor Memory Usage**: Keep an eye on GPU memory usage when doing extensive local AI tasks on systems with limited resources.
## Conclusion
Utilizing local AI models with Ollama is an excellent approach for developers who want control and privacy. However, it requires more thoughtful prompt design, particularly when working with constrained hardware like my RTX 3070 8GB laptop.
By designing system prompts that respect context window limitations and hardware capabilities, you can achieve better performance and more consistent results. Remember that the goal isn't to replicate cloud AI capabilities, but to make the most of your local resources in an effective way.
With careful design, local AI models can be just as useful as their larger, cloud-based counterparts—especially when you take their constraints into account and optimize your interactions accordingly.
Happy coding!