AI Skill Report Card
Implementing Prompt Caching
Quick Start
Bashcurl https://api.anthropic.com/v1/messages \ -H "content-type: application/json" \ -H "x-api-key: $ANTHROPIC_API_KEY" \ -H "anthropic-version: 2023-06-01" \ -d '{ "model": "claude-sonnet-4-5", "max_tokens": 1024, "system": [ { "type": "text", "text": "You are an expert analyst...", "cache_control": {"type": "ephemeral"} } ], "messages": [ {"role": "user", "content": "Analyze this data..."} ] }'
Recommendation▾
Add specific numerical examples showing token counts and cost calculations (e.g., 'For a 10,000 token document cached with 5 follow-up questions: Creation cost = 12,500 tokens × rate, Read cost = 5 × 1,000 tokens × 0.1 × rate')
Workflow
- Structure your prompt - Place static content first (tools, system, context)
- Add cache breakpoints - Mark reusable content with
cache_control: {"type": "ephemeral"} - Monitor performance - Check
cache_read_input_tokensandcache_creation_input_tokens - Optimize placement - Adjust breakpoints based on hit rates
Progress checklist for implementation:
- Identify static vs dynamic content
- Structure prompt with static content first
- Add cache_control to appropriate blocks
- Test with sample requests
- Monitor cache hit rates
- Adjust strategy based on performance
Recommendation▾
Include a troubleshooting section with specific error messages and solutions (e.g., 'Error: cache_control not supported' → 'Check model compatibility and API version')
Examples
Example 1: Document Analysis Input:
JSON{ "system": [ { "type": "text", "text": "<entire document content>", "cache_control": {"type": "ephemeral"} } ], "messages": [{"role": "user", "content": "What are the main themes?"}] }
Output: Document cached, subsequent questions reuse cached content at 10% cost
Example 2: Code Assistant Input:
JSON{ "system": [ { "type": "text", "text": "You are a coding assistant. Here's the codebase:\n<large codebase>", "cache_control": {"type": "ephemeral"} } ] }
Output: Codebase cached for multiple queries about functions, bugs, improvements
Recommendation▾
Provide a complete working example with actual API response showing cache_read_input_tokens and cache_creation_input_tokens values
Best Practices
Pricing optimization:
- 5-minute cache: 1.25x cost to write, 0.1x cost to read
- 1-hour cache: 2x cost to write, 0.1x cost to read
- Break-even point: ~3 reads for 5-minute cache
Strategic breakpoint placement:
- End of stable content (system instructions, context)
- Before frequently changing content
- Maximum 4 breakpoints per request
- Consider 20-block lookback window
Content structure:
- Tools → System → Messages (hierarchical order)
- Minimum tokens: 1024-4096 depending on model
- Cache lifetime: 5 minutes (default) or 1 hour
Common Pitfalls
Don't:
- Cache frequently changing content
- Place dynamic content before static content
- Ignore the 20-block lookback limitation
- Cache prompts below minimum token threshold
Cache invalidation triggers:
- Modifying tool definitions (invalidates entire cache)
- Changing system messages (invalidates system + messages)
- Adding/removing images anywhere
- Enabling/disabling web search or citations
Monitoring mistakes:
input_tokensonly shows tokens after last breakpoint- Total tokens =
cache_read_input_tokens+cache_creation_input_tokens+input_tokens - Cache only available after first response completes (affects parallel requests)
Model support:
- Supported: All Claude 4.x, 3.x models
- Check model compatibility before implementing