AI Skill Report Card
Generated Skill
YAML--- name: monitoring-kubernetes-clusters description: Sets up comprehensive Kubernetes monitoring with Prometheus, Grafana, and alerting. Use when deploying monitoring infrastructure or troubleshooting cluster visibility issues. ---
Kubernetes Monitoring
Quick Start
Bash# Deploy Prometheus stack with Helm helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm repo update helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \ --namespace monitoring --create-namespace \ --set prometheus.prometheusSpec.retention=30d \ --set grafana.adminPassword=admin123
Access Grafana: kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80
Recommendation▾
Consider adding more specific examples
Workflow
Progress:
- Install Prometheus Operator and stack
- Configure ServiceMonitors for custom apps
- Set up essential dashboards
- Configure alerting rules
- Test alert delivery
- Set up log aggregation (optional)
1. Deploy Core Stack
Bash# Create monitoring namespace with resource quotas kubectl create namespace monitoring kubectl apply -f - <<EOF apiVersion: v1 kind: ResourceQuota metadata: name: monitoring-quota namespace: monitoring spec: hard: requests.cpu: "4" requests.memory: 8Gi limits.cpu: "8" limits.memory: 16Gi EOF
2. Configure Custom Application Monitoring
YAML# ServiceMonitor for custom app apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: my-app-monitor namespace: monitoring spec: selector: matchLabels: app: my-app endpoints: - port: metrics interval: 30s path: /metrics
3. Essential Alerting Rules
YAMLapiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: critical-alerts namespace: monitoring spec: groups: - name: kubernetes.critical rules: - alert: PodCrashLooping expr: rate(kube_pod_container_status_restarts_total[5m]) * 60 * 5 > 0 for: 2m annotations: summary: "Pod {{ $labels.pod }} is crash looping" - alert: NodeDiskSpaceHigh expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 10 for: 5m annotations: summary: "Node {{ $labels.instance }} disk space < 10%"
4. Configure Slack Notifications
YAML# In values.yaml for AlertManager alertmanager: config: global: slack_api_url: 'YOUR_SLACK_WEBHOOK_URL' route: group_by: ['alertname', 'cluster', 'service'] receiver: 'slack-notifications' receivers: - name: 'slack-notifications' slack_configs: - channel: '#alerts' title: 'K8s Alert: {{ .GroupLabels.alertname }}' text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
Recommendation▾
Include edge cases
Examples
Example 1: Monitor Custom Microservice
Input: Spring Boot app exposing metrics on /actuator/prometheus
YAML# Service with metrics port apiVersion: v1 kind: Service metadata: name: order-service labels: app: order-service spec: ports: - name: http port: 8080 - name: metrics port: 8080 selector: app: order-service
Output: Automatic scraping every 30s, metrics appear in Prometheus targets
Example 2: High Memory Alert Input: Want alert when pod memory > 80%
YAML- alert: PodMemoryHigh expr: (container_memory_working_set_bytes / container_spec_memory_limit_bytes) * 100 > 80 for: 3m labels: severity: warning annotations: summary: "Pod {{ $labels.pod }} memory usage > 80%"
Output: Alert fires after 3 minutes above threshold
Best Practices
- Resource limits: Always set CPU/memory limits for Prometheus and Grafana
- Retention policy: 30 days for metrics, longer for critical SLI data
- Label consistency: Use consistent labeling across services (app, version, environment)
- Alert fatigue: Start with critical alerts only, add warnings gradually
- Dashboard organization: Group by team/service, not by metric type
- Backup: Export Grafana dashboards and AlertManager config to Git
Essential Dashboards to Import:
- Kubernetes Cluster Overview (ID: 7249)
- Node Exporter Full (ID: 1860)
- Kubernetes Pod Overview (ID: 6417)
Common Pitfalls
- Over-alerting: Don't alert on every metric deviation
- Missing resource quotas: Prometheus can consume unbounded resources
- Scrape interval mismatch: Keep intervals consistent (30s recommended)
- No service discovery: Don't hardcode target IPs, use ServiceMonitors
- Ignoring cardinality: High-cardinality labels (user IDs, timestamps) kill performance
- Missing backup: Losing dashboard configs during cluster rebuilds
- Default passwords: Change Grafana admin password immediately
Performance Notes:
- 1M samples/sec = ~8GB RAM for Prometheus
- Use recording rules for frequently queried complex expressions
- Federate multiple Prometheus instances for large clusters (>100 nodes)