AI Skill Report Card

Implementing Prompt Caching

A-85·Feb 5, 2026
Bash
curl https://api.anthropic.com/v1/messages \ -H "content-type: application/json" \ -H "x-api-key: $ANTHROPIC_API_KEY" \ -H "anthropic-version: 2023-06-01" \ -d '{ "model": "claude-sonnet-4-5", "max_tokens": 1024, "system": [ { "type": "text", "text": "You are an expert assistant with access to this knowledge base.", "cache_control": {"type": "ephemeral"} } ], "messages": [ { "role": "user", "content": "What can you tell me about this document?" } ] }'
Recommendation
Add specific model token requirements (e.g., 'Claude 3.5 Sonnet: 1024 tokens, Claude 3 Opus: 4096 tokens') instead of ranges

Setup Phase:

  1. Structure your prompt - Place static content first (tools → system → messages)
  2. Identify cacheable content - System instructions, context, examples, tool definitions
  3. Add cache breakpoints - Use cache_control: {"type": "ephemeral"} after reusable sections
  4. Set minimum token threshold - Ensure cached sections meet model requirements

Implementation Checklist:

  • Move static content to prompt beginning
  • Add cache_control to end of reusable sections
  • Verify minimum token requirements (1024-4096 depending on model)
  • Test with initial request to populate cache
  • Monitor cache performance via response tokens

Optimization Phase:

  • Analyze cache hit rates using response fields
  • Adjust breakpoint placement based on content change frequency
  • Consider multiple breakpoints for complex scenarios (max 4)
Recommendation
Include cost calculation examples showing actual dollar savings for typical use cases

Example 1: Document Analysis

JSON
{ "system": [ {"type": "text", "text": "Analyze documents thoroughly..."}, {"type": "text", "text": "<entire_document_content>", "cache_control": {"type": "ephemeral"}} ], "messages": [{"role": "user", "content": "Summarize key themes"}] }

Example 2: Tool-Heavy Assistant

JSON
{ "tools": [ {"name": "calculator", "description": "...", "cache_control": {"type": "ephemeral"}}, {"name": "web_search", "description": "..."} ], "system": [{"type": "text", "text": "You are a research assistant"}], "messages": [{"role": "user", "content": "Calculate 15% of 250"}] }

Response Analysis:

JSON
{ "usage": { "cache_creation_input_tokens": 50000, // New content cached "cache_read_input_tokens": 0, // Content read from cache "input_tokens": 25 // Tokens after cache breakpoint } }
Recommendation
Add troubleshooting section with specific error messages and solutions (e.g., what happens when minimum tokens not met)

Content Organization:

  • Place most stable content first (tools, system instructions, context)
  • Set breakpoints after content that changes infrequently
  • Keep dynamic content (user messages) after final breakpoint

Strategic Breakpoint Placement:

  • Always set breakpoint at end of conversations
  • Add breakpoints before potentially editable sections
  • Use multiple breakpoints when content changes at different frequencies
  • Maximum 4 breakpoints per request

Cost Optimization:

  • Cache writes cost 1.25x base tokens (5min) or 2x (1hr)
  • Cache reads cost 0.1x base tokens
  • No cost for breakpoints themselves

Performance Monitoring:

  • Track cache_read_input_tokens for hit rate
  • Monitor cache_creation_input_tokens for new writes
  • Calculate total tokens: cache_read + cache_creation + input_tokens

Token Requirements:

  • Don't cache content below minimum thresholds (1024-4096 tokens per model)
  • Requests below threshold process without caching despite cache_control

Content Modifications:

  • Any change to cached content invalidates that section and all subsequent sections
  • Tool definition changes invalidate entire cache hierarchy
  • Image additions/removals affect message cache validity

Concurrency Issues:

  • Cache becomes available only after first response begins
  • For parallel requests, wait for initial response before sending subsequent ones

Breakpoint Limitations:

  • System checks max 20 blocks backwards from breakpoints
  • Content modified beyond 20-block window won't hit cache without explicit breakpoints
  • Empty text blocks cannot be cached

Unsupported Elements:

  • Thinking blocks cannot be cached directly (but count as input tokens when read)
  • Sub-content blocks like citations cannot be cached individually
  • Cache top-level blocks containing sub-elements instead
0
Grade A-AI Skill Framework
Scorecard
Criteria Breakdown
Quick Start
11/15
Workflow
11/15
Examples
15/20
Completeness
15/20
Format
11/15
Conciseness
11/15