AI Skill Report Card

Generated Skill

B-70·Feb 3, 2026
YAML
--- name: monitoring-kubernetes-clusters description: Sets up comprehensive Kubernetes monitoring with Prometheus, Grafana, and alerting. Use when deploying monitoring infrastructure or troubleshooting cluster visibility issues. ---

Kubernetes Monitoring

Bash
# Deploy Prometheus stack with Helm helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm repo update helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \ --namespace monitoring --create-namespace \ --set prometheus.prometheusSpec.retention=30d \ --set grafana.adminPassword=admin123

Access Grafana: kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80

Recommendation
Consider adding more specific examples

Progress:

  • Install Prometheus Operator and stack
  • Configure ServiceMonitors for custom apps
  • Set up essential dashboards
  • Configure alerting rules
  • Test alert delivery
  • Set up log aggregation (optional)

1. Deploy Core Stack

Bash
# Create monitoring namespace with resource quotas kubectl create namespace monitoring kubectl apply -f - <<EOF apiVersion: v1 kind: ResourceQuota metadata: name: monitoring-quota namespace: monitoring spec: hard: requests.cpu: "4" requests.memory: 8Gi limits.cpu: "8" limits.memory: 16Gi EOF

2. Configure Custom Application Monitoring

YAML
# ServiceMonitor for custom app apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: my-app-monitor namespace: monitoring spec: selector: matchLabels: app: my-app endpoints: - port: metrics interval: 30s path: /metrics

3. Essential Alerting Rules

YAML
apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: critical-alerts namespace: monitoring spec: groups: - name: kubernetes.critical rules: - alert: PodCrashLooping expr: rate(kube_pod_container_status_restarts_total[5m]) * 60 * 5 > 0 for: 2m annotations: summary: "Pod {{ $labels.pod }} is crash looping" - alert: NodeDiskSpaceHigh expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 10 for: 5m annotations: summary: "Node {{ $labels.instance }} disk space < 10%"

4. Configure Slack Notifications

YAML
# In values.yaml for AlertManager alertmanager: config: global: slack_api_url: 'YOUR_SLACK_WEBHOOK_URL' route: group_by: ['alertname', 'cluster', 'service'] receiver: 'slack-notifications' receivers: - name: 'slack-notifications' slack_configs: - channel: '#alerts' title: 'K8s Alert: {{ .GroupLabels.alertname }}' text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
Recommendation
Include edge cases

Example 1: Monitor Custom Microservice Input: Spring Boot app exposing metrics on /actuator/prometheus

YAML
# Service with metrics port apiVersion: v1 kind: Service metadata: name: order-service labels: app: order-service spec: ports: - name: http port: 8080 - name: metrics port: 8080 selector: app: order-service

Output: Automatic scraping every 30s, metrics appear in Prometheus targets

Example 2: High Memory Alert Input: Want alert when pod memory > 80%

YAML
- alert: PodMemoryHigh expr: (container_memory_working_set_bytes / container_spec_memory_limit_bytes) * 100 > 80 for: 3m labels: severity: warning annotations: summary: "Pod {{ $labels.pod }} memory usage > 80%"

Output: Alert fires after 3 minutes above threshold

  • Resource limits: Always set CPU/memory limits for Prometheus and Grafana
  • Retention policy: 30 days for metrics, longer for critical SLI data
  • Label consistency: Use consistent labeling across services (app, version, environment)
  • Alert fatigue: Start with critical alerts only, add warnings gradually
  • Dashboard organization: Group by team/service, not by metric type
  • Backup: Export Grafana dashboards and AlertManager config to Git

Essential Dashboards to Import:

  • Kubernetes Cluster Overview (ID: 7249)
  • Node Exporter Full (ID: 1860)
  • Kubernetes Pod Overview (ID: 6417)
  • Over-alerting: Don't alert on every metric deviation
  • Missing resource quotas: Prometheus can consume unbounded resources
  • Scrape interval mismatch: Keep intervals consistent (30s recommended)
  • No service discovery: Don't hardcode target IPs, use ServiceMonitors
  • Ignoring cardinality: High-cardinality labels (user IDs, timestamps) kill performance
  • Missing backup: Losing dashboard configs during cluster rebuilds
  • Default passwords: Change Grafana admin password immediately

Performance Notes:

  • 1M samples/sec = ~8GB RAM for Prometheus
  • Use recording rules for frequently queried complex expressions
  • Federate multiple Prometheus instances for large clusters (>100 nodes)
0
Grade B-AI Skill Framework
Scorecard
Criteria Breakdown
Quick Start
11/15
Workflow
11/15
Examples
15/20
Completeness
15/20
Format
11/15
Conciseness
11/15