AI Skill Report Card

Deploying Local AI

A-82·Apr 14, 2026·Source: Web
YAML
--- name: deploying-local-ai description: Deploys open source AI models locally on Linux using Ollama, transformers, and related tools. Use when setting up local inference servers, running models offline, or building AI applications without cloud dependencies. --- # Local AI Model Deployment
15 / 15
Bash
# Install Ollama and run a model curl -fsSL https://ollama.ai/install.sh | sh ollama pull llama2:7b-chat ollama run llama2:7b-chat
Recommendation
Add specific hardware requirements upfront (GPU memory, CPU cores, disk space) instead of scattered throughout
15 / 15

Progress:

  • Step 1: Install model runtime (Ollama/Transformers)
  • Step 2: Deploy base model
  • Step 3: Set up API endpoints
  • Step 4: Configure resource limits
  • Step 5: Test inference pipeline

Detailed Steps

Step 1: Runtime Setup

Bash
# Ollama for easy model management curl -fsSL https://ollama.ai/install.sh | sh # Alternative: Python transformers stack pip install torch transformers accelerate bitsandbytes

Step 2: Model Deployment

Bash
# Ollama approach ollama pull codellama:7b ollama pull mistral:7b # Transformers approach python -c " from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained('microsoft/DialoGPT-medium') tokenizer = AutoTokenizer.from_pretrained('microsoft/DialoGPT-medium') "

Step 3: API Server

Python
# Simple Flask API for transformers from flask import Flask, request, jsonify from transformers import pipeline app = Flask(__name__) pipe = pipeline("text-generation", model="gpt2", device=0) @app.route("/generate", methods=["POST"]) def generate(): prompt = request.json["prompt"] result = pipe(prompt, max_length=100) return jsonify(result[0]) app.run(host="0.0.0.0", port=8000)

Step 4: Resource Configuration

Bash
# Limit GPU memory for Ollama export OLLAMA_GPU_MEMORY_FRACTION=0.8 # Monitor resources watch -n 2 "nvidia-smi; echo; free -h"
Recommendation
Include a troubleshooting section with common error messages and exact solutions rather than just monitoring commands
20 / 20

Example 1: Code Generation Setup Input: "Set up local code completion server" Output:

Bash
ollama pull codellama:7b-code ollama serve & curl http://localhost:11434/api/generate -d '{ "model": "codellama:7b-code", "prompt": "def fibonacci(n):", "stream": false }'

Example 2: Multi-Model Chat System Input: "Deploy both chat and code models with switching" Output:

Python
import requests def query_model(prompt, model_type="chat"): models = {"chat": "llama2:7b-chat", "code": "codellama:7b"} response = requests.post("http://localhost:11434/api/generate", json={"model": models[model_type], "prompt": prompt}) return response.json() # Usage: query_model("Hello", "chat") or query_model("def sort():", "code")

Example 3: RAG Document Search Input: "Add document search to local model" Output:

Python
from sentence_transformers import SentenceTransformer import faiss import numpy as np # Setup embedding model embedder = SentenceTransformer('all-MiniLM-L6-v2') # Index documents docs = ["Document 1 text", "Document 2 text"] embeddings = embedder.encode(docs) index = faiss.IndexFlatIP(embeddings.shape[1]) index.add(embeddings.astype('float32')) # Search function def search(query, k=3): query_emb = embedder.encode([query]) scores, indices = index.search(query_emb.astype('float32'), k) return [docs[i] for i in indices[0]]
Recommendation
Provide a complete Docker deployment example with dockerfile and compose file for production use
  • Model Selection: Use 7B parameters max on 16GB RAM, 13B on 32GB+
  • Quantization: Enable 4-bit quantization for larger models on limited hardware
  • Concurrent Requests: Use async frameworks (FastAPI) for production APIs
  • Model Switching: Keep one model loaded at a time to conserve memory
  • Persistence: Use Docker containers for consistent deployments across systems
  • OOM Errors: Don't load multiple large models simultaneously - use model switching instead
  • CUDA Issues: Install pytorch with correct CUDA version: pip install torch --index-url https://download.pytorch.org/whl/cu118
  • Slow Inference: Enable GPU acceleration and check model is actually using GPU with nvidia-smi
  • API Timeouts: Increase timeout values for large text generation requests
  • Memory Leaks: Restart services periodically when running continuous inference
Bash
# Check GPU utilization nvidia-smi dmon -s pucvmet -d 2 # Test Ollama API curl http://localhost:11434/api/tags # Monitor system resources htop -u $USER | grep python
0
Grade A-AI Skill Framework
Scorecard
Criteria Breakdown
Quick Start
15/15
Workflow
15/15
Examples
20/20
Completeness
17/20
Format
15/15
Conciseness
13/15