AI Skill Report Card
Generating Test Data
YAML--- name: generating-test-data description: Generates realistic placeholder data for development and testing. Use when you need sample datasets, mock API responses, or test fixtures. --- # Quick Start ```python from faker import Faker import random fake = Faker() # Generate test users users = [ { "id": i, "name": fake.name(), "email": fake.email(), "created_at": fake.date_this_year().isoformat() } for i in range(1, 6) ]
Workflow
- Identify data schema - Define fields, types, and relationships needed
- Choose generation method - Faker library, manual patterns, or real data sampling
- Generate base dataset - Create core records with realistic values
- Add variations and edge cases - Include nulls, extremes, and special scenarios
- Export in target format - CSV, JSON, SQL, XML as needed
Progress:
- Map required fields and data types
- Install generation tools (faker, mimesis)
- Create base data generation script
- Add edge cases and variants
- Export to required format
Examples
Example 1: Input: E-commerce product catalog (50 items) Output:
Pythonproducts = [ { "sku": f"PROD-{1000+i}", "name": fake.catch_phrase(), "price": round(random.uniform(9.99, 299.99), 2), "category": random.choice(["Electronics", "Clothing", "Home"]), "in_stock": random.choice([True, False]) } for i in range(50) ]
Example 2: Input: CSV for user testing (100 rows) Output:
Pythonimport csv with open('test_users.csv', 'w', newline='') as f: writer = csv.DictWriter(f, fieldnames=['id', 'name', 'age', 'city']) writer.writeheader() for i in range(100): writer.writerow({ 'id': i+1, 'name': fake.name(), 'age': random.randint(18, 80), 'city': fake.city() })
Example 3: Input: API response with nested data Output:
Pythonapi_response = { "status": "success", "data": { "orders": [ { "order_id": fake.uuid4(), "customer": fake.name(), "items": [ {"product": fake.word(), "qty": random.randint(1, 5)} for _ in range(random.randint(1, 4)) ], "total": round(random.uniform(25.00, 500.00), 2) } for _ in range(10) ] } }
Best Practices
- Use Faker library for realistic personal data (names, addresses, emails)
- Use mimesis for performance-critical large datasets
- Seed random generators for reproducible test data:
fake.seed_instance(42) - Create data templates for common schemas (users, products, transactions)
- Include realistic distributions (80/20 rule, bell curves for ages/prices)
- Add intentional edge cases: empty strings, max lengths, special characters
Common Pitfalls
- Using obviously fake data like "Test User 1" that breaks realistic testing
- Forgetting to handle foreign key relationships in relational data
- Creating datasets too small to reveal performance issues
- Using production data patterns that leak sensitive information
- Not including sufficient data variety to catch edge case bugs