AI Skill Report Card

Implementing Prompt Caching

A-88·Feb 5, 2026
Bash
curl https://api.anthropic.com/v1/messages \ -H "content-type: application/json" \ -H "x-api-key: $ANTHROPIC_API_KEY" \ -H "anthropic-version: 2023-06-01" \ -d '{ "model": "claude-sonnet-4-5", "max_tokens": 1024, "system": [ { "type": "text", "text": "You are an expert analyst...", "cache_control": {"type": "ephemeral"} } ], "messages": [ {"role": "user", "content": "Analyze this data..."} ] }'
Recommendation
Add specific numerical examples showing token counts and cost calculations (e.g., 'For a 10,000 token document cached with 5 follow-up questions: Creation cost = 12,500 tokens × rate, Read cost = 5 × 1,000 tokens × 0.1 × rate')
  1. Structure your prompt - Place static content first (tools, system, context)
  2. Add cache breakpoints - Mark reusable content with cache_control: {"type": "ephemeral"}
  3. Monitor performance - Check cache_read_input_tokens and cache_creation_input_tokens
  4. Optimize placement - Adjust breakpoints based on hit rates

Progress checklist for implementation:

  • Identify static vs dynamic content
  • Structure prompt with static content first
  • Add cache_control to appropriate blocks
  • Test with sample requests
  • Monitor cache hit rates
  • Adjust strategy based on performance
Recommendation
Include a troubleshooting section with specific error messages and solutions (e.g., 'Error: cache_control not supported' → 'Check model compatibility and API version')

Example 1: Document Analysis Input:

JSON
{ "system": [ { "type": "text", "text": "<entire document content>", "cache_control": {"type": "ephemeral"} } ], "messages": [{"role": "user", "content": "What are the main themes?"}] }

Output: Document cached, subsequent questions reuse cached content at 10% cost

Example 2: Code Assistant Input:

JSON
{ "system": [ { "type": "text", "text": "You are a coding assistant. Here's the codebase:\n<large codebase>", "cache_control": {"type": "ephemeral"} } ] }

Output: Codebase cached for multiple queries about functions, bugs, improvements

Recommendation
Provide a complete working example with actual API response showing cache_read_input_tokens and cache_creation_input_tokens values

Pricing optimization:

  • 5-minute cache: 1.25x cost to write, 0.1x cost to read
  • 1-hour cache: 2x cost to write, 0.1x cost to read
  • Break-even point: ~3 reads for 5-minute cache

Strategic breakpoint placement:

  • End of stable content (system instructions, context)
  • Before frequently changing content
  • Maximum 4 breakpoints per request
  • Consider 20-block lookback window

Content structure:

  • Tools → System → Messages (hierarchical order)
  • Minimum tokens: 1024-4096 depending on model
  • Cache lifetime: 5 minutes (default) or 1 hour

Don't:

  • Cache frequently changing content
  • Place dynamic content before static content
  • Ignore the 20-block lookback limitation
  • Cache prompts below minimum token threshold

Cache invalidation triggers:

  • Modifying tool definitions (invalidates entire cache)
  • Changing system messages (invalidates system + messages)
  • Adding/removing images anywhere
  • Enabling/disabling web search or citations

Monitoring mistakes:

  • input_tokens only shows tokens after last breakpoint
  • Total tokens = cache_read_input_tokens + cache_creation_input_tokens + input_tokens
  • Cache only available after first response completes (affects parallel requests)

Model support:

  • Supported: All Claude 4.x, 3.x models
  • Check model compatibility before implementing
0
Grade A-AI Skill Framework
Scorecard
Criteria Breakdown
Quick Start
11/15
Workflow
11/15
Examples
15/20
Completeness
15/20
Format
11/15
Conciseness
11/15