AI Skill Report Card
Deploying Local AI
YAML--- name: deploying-local-ai description: Deploys open source AI models locally on Linux using Ollama, transformers, and related tools. Use when setting up local inference servers, running models offline, or building AI applications without cloud dependencies. --- # Local AI Model Deployment
Quick Start15 / 15
Bash# Install Ollama and run a model curl -fsSL https://ollama.ai/install.sh | sh ollama pull llama2:7b-chat ollama run llama2:7b-chat
Recommendation▾
Add specific hardware requirements upfront (GPU memory, CPU cores, disk space) instead of scattered throughout
Workflow15 / 15
Progress:
- Step 1: Install model runtime (Ollama/Transformers)
- Step 2: Deploy base model
- Step 3: Set up API endpoints
- Step 4: Configure resource limits
- Step 5: Test inference pipeline
Detailed Steps
Step 1: Runtime Setup
Bash# Ollama for easy model management curl -fsSL https://ollama.ai/install.sh | sh # Alternative: Python transformers stack pip install torch transformers accelerate bitsandbytes
Step 2: Model Deployment
Bash# Ollama approach ollama pull codellama:7b ollama pull mistral:7b # Transformers approach python -c " from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained('microsoft/DialoGPT-medium') tokenizer = AutoTokenizer.from_pretrained('microsoft/DialoGPT-medium') "
Step 3: API Server
Python# Simple Flask API for transformers from flask import Flask, request, jsonify from transformers import pipeline app = Flask(__name__) pipe = pipeline("text-generation", model="gpt2", device=0) @app.route("/generate", methods=["POST"]) def generate(): prompt = request.json["prompt"] result = pipe(prompt, max_length=100) return jsonify(result[0]) app.run(host="0.0.0.0", port=8000)
Step 4: Resource Configuration
Bash# Limit GPU memory for Ollama export OLLAMA_GPU_MEMORY_FRACTION=0.8 # Monitor resources watch -n 2 "nvidia-smi; echo; free -h"
Recommendation▾
Include a troubleshooting section with common error messages and exact solutions rather than just monitoring commands
Examples20 / 20
Example 1: Code Generation Setup Input: "Set up local code completion server" Output:
Bashollama pull codellama:7b-code ollama serve & curl http://localhost:11434/api/generate -d '{ "model": "codellama:7b-code", "prompt": "def fibonacci(n):", "stream": false }'
Example 2: Multi-Model Chat System Input: "Deploy both chat and code models with switching" Output:
Pythonimport requests def query_model(prompt, model_type="chat"): models = {"chat": "llama2:7b-chat", "code": "codellama:7b"} response = requests.post("http://localhost:11434/api/generate", json={"model": models[model_type], "prompt": prompt}) return response.json() # Usage: query_model("Hello", "chat") or query_model("def sort():", "code")
Example 3: RAG Document Search Input: "Add document search to local model" Output:
Pythonfrom sentence_transformers import SentenceTransformer import faiss import numpy as np # Setup embedding model embedder = SentenceTransformer('all-MiniLM-L6-v2') # Index documents docs = ["Document 1 text", "Document 2 text"] embeddings = embedder.encode(docs) index = faiss.IndexFlatIP(embeddings.shape[1]) index.add(embeddings.astype('float32')) # Search function def search(query, k=3): query_emb = embedder.encode([query]) scores, indices = index.search(query_emb.astype('float32'), k) return [docs[i] for i in indices[0]]
Recommendation▾
Provide a complete Docker deployment example with dockerfile and compose file for production use
Best Practices
- Model Selection: Use 7B parameters max on 16GB RAM, 13B on 32GB+
- Quantization: Enable 4-bit quantization for larger models on limited hardware
- Concurrent Requests: Use async frameworks (FastAPI) for production APIs
- Model Switching: Keep one model loaded at a time to conserve memory
- Persistence: Use Docker containers for consistent deployments across systems
Common Pitfalls
- OOM Errors: Don't load multiple large models simultaneously - use model switching instead
- CUDA Issues: Install pytorch with correct CUDA version:
pip install torch --index-url https://download.pytorch.org/whl/cu118 - Slow Inference: Enable GPU acceleration and check model is actually using GPU with
nvidia-smi - API Timeouts: Increase timeout values for large text generation requests
- Memory Leaks: Restart services periodically when running continuous inference
Troubleshooting Commands
Bash# Check GPU utilization nvidia-smi dmon -s pucvmet -d 2 # Test Ollama API curl http://localhost:11434/api/tags # Monitor system resources htop -u $USER | grep python