AI Skill Report Card

Debugging ML Pipelines

B+78·Apr 20, 2026·Source: Web
YAML
--- name: debugging-ml-pipelines description: Debugs machine learning pipelines, especially motion detection systems, by identifying root causes and providing targeted fixes without requiring architectural changes. Use when ML models underperform, training fails, or motion detection gives poor results. --- # ML Pipeline Debugging
15 / 15
Python
# First, run this diagnostic to identify the issue def diagnose_pipeline(model, data_loader, device): print("=== PIPELINE DIAGNOSIS ===") # Check data flow batch = next(iter(data_loader)) print(f"Batch shape: {batch[0].shape}") print(f"Label shape: {batch[1].shape}") print(f"Data type: {batch[0].dtype}") print(f"Value range: [{batch[0].min():.3f}, {batch[0].max():.3f}]") # Check model forward pass model.eval() with torch.no_grad(): output = model(batch[0].to(device)) print(f"Output shape: {output.shape}") print(f"Output range: [{output.min():.3f}, {output.max():.3f}]") return batch, output
Recommendation
Add more concrete input/output pairs showing before/after states of broken vs fixed pipelines
14 / 15

Progress:

  • Run diagnostics to identify symptoms
  • Isolate the failing component
  • Apply targeted fix
  • Validate fix without breaking existing code

Step 1: Identify the Problem Type

Python
# Motion detection specific checks def check_motion_detection_issues(frames, detector): # Check frame preprocessing if frames.max() > 1.0: print("❌ Frames not normalized - fix: frames = frames / 255.0") # Check temporal consistency if len(frames.shape) != 4: print("❌ Wrong frame dimensions - expected (batch, time, H, W)") # Check detection sensitivity motion_pixels = detector.detect_motion(frames) if motion_pixels.sum() == 0: print("❌ No motion detected - threshold too high") elif motion_pixels.sum() > 0.8 * motion_pixels.numel(): print("❌ Too much motion detected - threshold too low")

Step 2: Apply Minimal Fixes

Data Issues:

Python
# Fix normalization in-place if hasattr(data_loader.dataset, 'transform'): # Add normalization to existing transform from torchvision import transforms old_transform = data_loader.dataset.transform data_loader.dataset.transform = transforms.Compose([ old_transform, transforms.Normalize(mean=[0.485], std=[0.229]) ])

Model Issues:

Python
# Fix gradient flow without changing architecture def fix_gradients(model): for name, param in model.named_parameters(): if param.grad is None: print(f"❌ No gradient: {name}") elif param.grad.abs().max() < 1e-7: print(f"❌ Vanishing gradient: {name}") elif param.grad.abs().max() > 100: print(f"❌ Exploding gradient: {name} - clipping") param.grad.clamp_(-10, 10)

Step 3: Motion Detection Specific Fixes

Python
# Fix motion detection parameters def tune_motion_detector(detector, sample_frames): # Auto-tune threshold diffs = torch.diff(sample_frames, dim=1).abs() threshold = diffs.mean() + 2 * diffs.std() detector.threshold = threshold.item() # Fix kernel size for noise reduction if hasattr(detector, 'kernel_size'): detector.kernel_size = max(3, int(sample_frames.shape[-1] / 100))
Recommendation
Include specific diagnostic outputs (error messages, loss curves, performance metrics) that trigger each fix
18 / 20

Example 1: Input: "Model loss stuck at 0.693, not learning" Output:

Python
# Issue: Binary classification with wrong loss function # Fix: Change sigmoid + BCELoss to raw logits + BCEWithLogitsLoss criterion = nn.BCEWithLogitsLoss() # Remove sigmoid from model output

Example 2: Input: "Motion detection too sensitive, flags everything as motion" Output:

Python
# Issue: Threshold too low # Fix: Adaptive threshold based on scene complexity scene_noise = frames[:, :-1].std() detector.threshold = max(0.01, scene_noise * 3)

Example 3: Input: "Training crashes with CUDA out of memory" Output:

Python
# Issue: Batch size too large # Fix: Gradient accumulation instead of reducing batch size accumulation_steps = 4 for i, batch in enumerate(dataloader): loss = model(batch) / accumulation_steps loss.backward() if (i + 1) % accumulation_steps == 0: optimizer.step() optimizer.zero_grad()
Recommendation
Provide a troubleshooting decision tree or flowchart to guide users from symptoms to solutions more systematically
  • Always run diagnostics first - Don't guess the problem
  • Make minimal changes - Fix the root cause, not symptoms
  • Preserve existing interfaces - Don't break downstream code
  • Test incrementally - Validate each fix before moving to the next
  • Monitor resource usage - Memory, GPU utilization, processing time
  • Don't change data loaders unnecessarily - Fix transforms instead
  • Don't rebuild models from scratch - Modify weights or layers
  • Don't ignore preprocessing - 90% of motion detection issues are here
  • Don't over-tune hyperparameters - Fix architectural issues first
  • Don't assume GPU issues are hardware - Usually memory management
Python
# Quick fix template for motion detection def emergency_motion_fix(detector, frames): # Normalize frames frames = frames.float() / 255.0 # Reduce noise frames = torch.nn.functional.avg_pool2d(frames, 2, 2) # Adaptive threshold detector.threshold = frames.diff(dim=1).abs().quantile(0.95) return detector
0
Grade B+AI Skill Framework
Scorecard
Criteria Breakdown
Quick Start
15/15
Workflow
14/15
Examples
18/20
Completeness
18/20
Format
15/15
Conciseness
13/15