AI Skill Report Card
Implementing Prompt Caching
Quick Start
Bashcurl https://api.anthropic.com/v1/messages \ -H "content-type: application/json" \ -H "x-api-key: $ANTHROPIC_API_KEY" \ -H "anthropic-version: 2023-06-01" \ -d '{ "model": "claude-sonnet-4-5", "max_tokens": 1024, "system": [ { "type": "text", "text": "You are an expert assistant with access to this knowledge base.", "cache_control": {"type": "ephemeral"} } ], "messages": [ { "role": "user", "content": "What can you tell me about this document?" } ] }'
Recommendation▾
Add specific model token requirements (e.g., 'Claude 3.5 Sonnet: 1024 tokens, Claude 3 Opus: 4096 tokens') instead of ranges
Workflow
Setup Phase:
- Structure your prompt - Place static content first (tools → system → messages)
- Identify cacheable content - System instructions, context, examples, tool definitions
- Add cache breakpoints - Use
cache_control: {"type": "ephemeral"}after reusable sections - Set minimum token threshold - Ensure cached sections meet model requirements
Implementation Checklist:
- Move static content to prompt beginning
- Add cache_control to end of reusable sections
- Verify minimum token requirements (1024-4096 depending on model)
- Test with initial request to populate cache
- Monitor cache performance via response tokens
Optimization Phase:
- Analyze cache hit rates using response fields
- Adjust breakpoint placement based on content change frequency
- Consider multiple breakpoints for complex scenarios (max 4)
Recommendation▾
Include cost calculation examples showing actual dollar savings for typical use cases
Examples
Example 1: Document Analysis
JSON{ "system": [ {"type": "text", "text": "Analyze documents thoroughly..."}, {"type": "text", "text": "<entire_document_content>", "cache_control": {"type": "ephemeral"}} ], "messages": [{"role": "user", "content": "Summarize key themes"}] }
Example 2: Tool-Heavy Assistant
JSON{ "tools": [ {"name": "calculator", "description": "...", "cache_control": {"type": "ephemeral"}}, {"name": "web_search", "description": "..."} ], "system": [{"type": "text", "text": "You are a research assistant"}], "messages": [{"role": "user", "content": "Calculate 15% of 250"}] }
Response Analysis:
JSON{ "usage": { "cache_creation_input_tokens": 50000, // New content cached "cache_read_input_tokens": 0, // Content read from cache "input_tokens": 25 // Tokens after cache breakpoint } }
Recommendation▾
Add troubleshooting section with specific error messages and solutions (e.g., what happens when minimum tokens not met)
Best Practices
Content Organization:
- Place most stable content first (tools, system instructions, context)
- Set breakpoints after content that changes infrequently
- Keep dynamic content (user messages) after final breakpoint
Strategic Breakpoint Placement:
- Always set breakpoint at end of conversations
- Add breakpoints before potentially editable sections
- Use multiple breakpoints when content changes at different frequencies
- Maximum 4 breakpoints per request
Cost Optimization:
- Cache writes cost 1.25x base tokens (5min) or 2x (1hr)
- Cache reads cost 0.1x base tokens
- No cost for breakpoints themselves
Performance Monitoring:
- Track
cache_read_input_tokensfor hit rate - Monitor
cache_creation_input_tokensfor new writes - Calculate total tokens:
cache_read + cache_creation + input_tokens
Common Pitfalls
Token Requirements:
- Don't cache content below minimum thresholds (1024-4096 tokens per model)
- Requests below threshold process without caching despite
cache_control
Content Modifications:
- Any change to cached content invalidates that section and all subsequent sections
- Tool definition changes invalidate entire cache hierarchy
- Image additions/removals affect message cache validity
Concurrency Issues:
- Cache becomes available only after first response begins
- For parallel requests, wait for initial response before sending subsequent ones
Breakpoint Limitations:
- System checks max 20 blocks backwards from breakpoints
- Content modified beyond 20-block window won't hit cache without explicit breakpoints
- Empty text blocks cannot be cached
Unsupported Elements:
- Thinking blocks cannot be cached directly (but count as input tokens when read)
- Sub-content blocks like citations cannot be cached individually
- Cache top-level blocks containing sub-elements instead