AI Skill Report Card
Debugging ML Pipelines
YAML--- name: debugging-ml-pipelines description: Debugs machine learning pipelines, especially motion detection systems, by identifying root causes and providing targeted fixes without requiring architectural changes. Use when ML models underperform, training fails, or motion detection gives poor results. --- # ML Pipeline Debugging
Quick Start15 / 15
Python# First, run this diagnostic to identify the issue def diagnose_pipeline(model, data_loader, device): print("=== PIPELINE DIAGNOSIS ===") # Check data flow batch = next(iter(data_loader)) print(f"Batch shape: {batch[0].shape}") print(f"Label shape: {batch[1].shape}") print(f"Data type: {batch[0].dtype}") print(f"Value range: [{batch[0].min():.3f}, {batch[0].max():.3f}]") # Check model forward pass model.eval() with torch.no_grad(): output = model(batch[0].to(device)) print(f"Output shape: {output.shape}") print(f"Output range: [{output.min():.3f}, {output.max():.3f}]") return batch, output
Recommendation▾
Add more concrete input/output pairs showing before/after states of broken vs fixed pipelines
Workflow14 / 15
Progress:
- Run diagnostics to identify symptoms
- Isolate the failing component
- Apply targeted fix
- Validate fix without breaking existing code
Step 1: Identify the Problem Type
Python# Motion detection specific checks def check_motion_detection_issues(frames, detector): # Check frame preprocessing if frames.max() > 1.0: print("❌ Frames not normalized - fix: frames = frames / 255.0") # Check temporal consistency if len(frames.shape) != 4: print("❌ Wrong frame dimensions - expected (batch, time, H, W)") # Check detection sensitivity motion_pixels = detector.detect_motion(frames) if motion_pixels.sum() == 0: print("❌ No motion detected - threshold too high") elif motion_pixels.sum() > 0.8 * motion_pixels.numel(): print("❌ Too much motion detected - threshold too low")
Step 2: Apply Minimal Fixes
Data Issues:
Python# Fix normalization in-place if hasattr(data_loader.dataset, 'transform'): # Add normalization to existing transform from torchvision import transforms old_transform = data_loader.dataset.transform data_loader.dataset.transform = transforms.Compose([ old_transform, transforms.Normalize(mean=[0.485], std=[0.229]) ])
Model Issues:
Python# Fix gradient flow without changing architecture def fix_gradients(model): for name, param in model.named_parameters(): if param.grad is None: print(f"❌ No gradient: {name}") elif param.grad.abs().max() < 1e-7: print(f"❌ Vanishing gradient: {name}") elif param.grad.abs().max() > 100: print(f"❌ Exploding gradient: {name} - clipping") param.grad.clamp_(-10, 10)
Step 3: Motion Detection Specific Fixes
Python# Fix motion detection parameters def tune_motion_detector(detector, sample_frames): # Auto-tune threshold diffs = torch.diff(sample_frames, dim=1).abs() threshold = diffs.mean() + 2 * diffs.std() detector.threshold = threshold.item() # Fix kernel size for noise reduction if hasattr(detector, 'kernel_size'): detector.kernel_size = max(3, int(sample_frames.shape[-1] / 100))
Recommendation▾
Include specific diagnostic outputs (error messages, loss curves, performance metrics) that trigger each fix
Examples18 / 20
Example 1: Input: "Model loss stuck at 0.693, not learning" Output:
Python# Issue: Binary classification with wrong loss function # Fix: Change sigmoid + BCELoss to raw logits + BCEWithLogitsLoss criterion = nn.BCEWithLogitsLoss() # Remove sigmoid from model output
Example 2: Input: "Motion detection too sensitive, flags everything as motion" Output:
Python# Issue: Threshold too low # Fix: Adaptive threshold based on scene complexity scene_noise = frames[:, :-1].std() detector.threshold = max(0.01, scene_noise * 3)
Example 3: Input: "Training crashes with CUDA out of memory" Output:
Python# Issue: Batch size too large # Fix: Gradient accumulation instead of reducing batch size accumulation_steps = 4 for i, batch in enumerate(dataloader): loss = model(batch) / accumulation_steps loss.backward() if (i + 1) % accumulation_steps == 0: optimizer.step() optimizer.zero_grad()
Recommendation▾
Provide a troubleshooting decision tree or flowchart to guide users from symptoms to solutions more systematically
Best Practices
- Always run diagnostics first - Don't guess the problem
- Make minimal changes - Fix the root cause, not symptoms
- Preserve existing interfaces - Don't break downstream code
- Test incrementally - Validate each fix before moving to the next
- Monitor resource usage - Memory, GPU utilization, processing time
Common Pitfalls
- Don't change data loaders unnecessarily - Fix transforms instead
- Don't rebuild models from scratch - Modify weights or layers
- Don't ignore preprocessing - 90% of motion detection issues are here
- Don't over-tune hyperparameters - Fix architectural issues first
- Don't assume GPU issues are hardware - Usually memory management
Python# Quick fix template for motion detection def emergency_motion_fix(detector, frames): # Normalize frames frames = frames.float() / 255.0 # Reduce noise frames = torch.nn.functional.avg_pool2d(frames, 2, 2) # Adaptive threshold detector.threshold = frames.diff(dim=1).abs().quantile(0.95) return detector