AI Skill Report Card
Generating Test Data
YAML--- name: generating-test-data description: Generates realistic test datasets and mock content using systematic approaches. Use when you need sample data for development, testing, or demonstrations. ---
Quick Start
Pythonfrom faker import Faker import random fake = Faker() users = [ { "id": i, "name": fake.name(), "email": fake.email(), "created_at": fake.date_time_between(start_date='-2y').isoformat(), "status": random.choice(["active", "inactive", "pending"]) } for i in range(1, 101) ]
Workflow
- Define schema - Specify data types and relationships needed
- Choose generation method - Faker for personal data, random for IDs/dates
- Apply constraints - Realistic ranges, valid formats, business rules
- Generate in batches - Create manageable chunks for large datasets
- Export format - JSON, CSV, SQL inserts based on target system
Progress:
- Map required fields and data types
- Set up Faker with appropriate locales/providers
- Define realistic value ranges and distributions
- Generate sample batch and validate
- Scale to full dataset size
Examples
Example 1: Input: E-commerce product catalog with 500 items Output:
Pythonproducts = [ { "sku": f"PROD-{fake.random_int(10000, 99999)}", "name": fake.catch_phrase(), "price": round(random.uniform(9.99, 999.99), 2), "category": fake.random_element(["Electronics", "Clothing", "Books"]), "stock": fake.random_int(0, 100), "description": fake.text(max_nb_chars=200) } for _ in range(500) ]
Example 2: Input: API response simulation with nested relationships Output:
Python{ "user": { "id": 1001, "profile": fake.profile(), "orders": [ { "order_id": fake.uuid4(), "items": fake.random_int(1, 5), "total": round(random.uniform(25.00, 500.00), 2), "status": fake.random_element(["pending", "shipped", "delivered"]) } for _ in range(fake.random_int(0, 10)) ] } }
Example 3: Input: SQL test data for user authentication table Output:
SQLINSERT INTO users (username, email, password_hash, created_at) VALUES ('john_doe_123', 'john.doe@example.com', '$2b$12$hash...', '2023-05-15 14:30:22'), ('jane_smith_456', 'j.smith@company.com', '$2b$12$hash...', '2023-06-01 09:15:11');
Best Practices
- Use Faker library for realistic personal/business data
- Set consistent seeds (
Faker.seed(42)) for reproducible datasets - Include edge cases (empty strings, null values, boundary conditions)
- Match real data distributions (80% active users, 20% inactive)
- Use appropriate locales (
Faker('en_US')vsFaker('de_DE')) - Generate related data with logical consistency (order dates after user creation)
Common Pitfalls
- Don't use sequential IDs that reveal dataset size
- Avoid unrealistic data combinations (future birth dates, negative prices)
- Don't ignore foreign key relationships in related tables
- Don't generate all data at once for large datasets (memory issues)
- Don't forget to sanitize generated data for target environment constraints