diff --git a/.claude-plugin/README.md b/.claude-plugin/README.md new file mode 100644 index 000000000..78dcc7c64 --- /dev/null +++ b/.claude-plugin/README.md @@ -0,0 +1,33 @@ +# Kubeshark Claude Code Plugin + +This directory contains the [Claude Code plugin](https://docs.anthropic.com/en/docs/claude-code/plugins) configuration for Kubeshark. + +## What's here + +| File | Purpose | +|------|---------| +| `plugin.json` | Plugin manifest — name, version, description, metadata | +| `marketplace.json` | Marketplace index — allows discovery via `/plugin marketplace add` | + +## Installing the plugin + +``` +/plugin marketplace add kubeshark/kubeshark +/plugin install kubeshark +``` + +This loads the Kubeshark AI skills and MCP configuration. Skills appear as +`/kubeshark:network-rca` and `/kubeshark:kfl`. + +## What the plugin includes + +- **Skills** from [`skills/`](../skills/) — network root cause analysis and KFL filter expertise +- **MCP configuration** from [`.mcp.json`](../.mcp.json) — connects to the Kubeshark MCP server + +## Local development + +Test the plugin without installing: + +```bash +claude --plugin-dir /path/to/kubeshark +``` diff --git a/.claude-plugin/marketplace.json b/.claude-plugin/marketplace.json new file mode 100644 index 000000000..e103cc6c4 --- /dev/null +++ b/.claude-plugin/marketplace.json @@ -0,0 +1,15 @@ +{ + "name": "kubeshark", + "description": "Kubeshark network observability skills for Kubernetes", + "plugins": [ + { + "name": "kubeshark", + "description": "Network observability skills powered by Kubeshark MCP — root cause analysis, KFL traffic filtering, snapshot forensics, PCAP extraction.", + "source": { + "source": "github", + "owner": "kubeshark", + "repo": "kubeshark" + } + } + ] +} diff --git a/.claude-plugin/plugin.json b/.claude-plugin/plugin.json new file mode 100644 index 000000000..b4c427acb --- /dev/null +++ b/.claude-plugin/plugin.json @@ -0,0 +1,24 @@ +{ + "name": "kubeshark", + "version": "1.0.0", + "description": "Kubernetes network observability skills powered by Kubeshark MCP. Root cause analysis, traffic filtering, snapshot forensics, PCAP extraction, and more.", + "author": { + "name": "Kubeshark", + "url": "https://kubeshark.com" + }, + "homepage": "https://kubeshark.com", + "repository": "https://github.com/kubeshark/kubeshark", + "license": "Apache-2.0", + "keywords": [ + "kubeshark", + "kubernetes", + "network", + "observability", + "traffic", + "mcp", + "rca", + "pcap", + "kfl", + "ebpf" + ] +} diff --git a/.mcp.json b/.mcp.json new file mode 100644 index 000000000..e098d28a5 --- /dev/null +++ b/.mcp.json @@ -0,0 +1,8 @@ +{ + "mcpServers": { + "kubeshark": { + "command": "kubeshark", + "args": ["mcp"] + } + } +} diff --git a/mcp/README.md b/mcp/README.md index 0d69f102e..6eea18aba 100644 --- a/mcp/README.md +++ b/mcp/README.md @@ -2,6 +2,18 @@ [Kubeshark](https://kubeshark.com) MCP (Model Context Protocol) server enables AI assistants like Claude Desktop, Cursor, and other MCP-compatible clients to query real-time Kubernetes network traffic. +## AI Skills + +The MCP provides the tools — [AI skills](../skills/) teach agents how to use them. +Skills turn raw MCP capabilities into domain-specific workflows like root cause +analysis, traffic filtering, and forensic investigation. See the +[skills README](../skills/README.md) for installation and usage. + +| Skill | Description | +|-------|-------------| +| [`network-rca`](../skills/network-rca/) | Network Root Cause Analysis — snapshot-based retrospective investigation with PCAP and dissection routes | +| [`kfl`](../skills/kfl/) | KFL2 filter expert — write, debug, and optimize traffic queries across all supported protocols | + ## Features - **L7 API Traffic Analysis**: Query HTTP, gRPC, Redis, Kafka, DNS transactions @@ -34,20 +46,20 @@ Add to your Claude Desktop configuration: **macOS**: `~/Library/Application Support/Claude/claude_desktop_config.json` **Windows**: `%APPDATA%\Claude\claude_desktop_config.json` -#### URL Mode (Recommended for existing deployments) +#### Default (requires kubectl access / kube context) ```json { "mcpServers": { "kubeshark": { "command": "kubeshark", - "args": ["mcp", "--url", "https://kubeshark.example.com"] + "args": ["mcp"] } } } ``` -#### Proxy Mode (Requires kubectl access) +With an explicit kubeconfig path: ```json { @@ -59,14 +71,18 @@ Add to your Claude Desktop configuration: } } ``` -or: + +#### URL Mode (no kubectl required) + +Use this when the machine doesn't have kubectl access or a kube context. +Connect directly to an existing Kubeshark deployment: ```json { "mcpServers": { "kubeshark": { "command": "kubeshark", - "args": ["mcp"] + "args": ["mcp", "--url", "https://kubeshark.example.com"] } } } diff --git a/skills/README.md b/skills/README.md new file mode 100644 index 000000000..140c8ff5b --- /dev/null +++ b/skills/README.md @@ -0,0 +1,120 @@ +# Kubeshark AI Skills + +Open-source AI skills that work with the [Kubeshark MCP](https://github.com/kubeshark/kubeshark). +Skills teach AI agents how to use Kubeshark's MCP tools for specific workflows +like root cause analysis, traffic filtering, and forensic investigation. + +Skills use the open [Agent Skills](https://github.com/anthropics/skills) format +and work with Claude Code, OpenAI Codex CLI, Gemini CLI, Cursor, and other +compatible agents. + +## Available Skills + +| Skill | Description | +|-------|-------------| +| [`network-rca`](network-rca/) | Network Root Cause Analysis. Retrospective traffic analysis via snapshots, with two investigation routes: PCAP (for Wireshark/compliance) and Dissection (for AI-driven API-level investigation). | +| [`kfl`](kfl/) | KFL2 (Kubeshark Filter Language) expert. Complete reference for writing, debugging, and optimizing CEL-based traffic filters across all supported protocols. | + +## Prerequisites + +All skills require the Kubeshark MCP: + +```bash +# Claude Code +claude mcp add kubeshark -- kubeshark mcp + +# Without kubectl access (direct URL) +claude mcp add kubeshark -- kubeshark mcp --url https://kubeshark.example.com +``` + +For Claude Desktop, add to `claude_desktop_config.json`: + +```json +{ + "mcpServers": { + "kubeshark": { + "command": "kubeshark", + "args": ["mcp"] + } + } +} +``` + +## Installation + +### Option 1: Plugin (recommended) + +Install as a Claude Code plugin directly from GitHub: + +``` +/plugin marketplace add kubeshark/kubeshark +/plugin install kubeshark +``` + +Skills appear as `/kubeshark:network-rca` and `/kubeshark:kfl`. The plugin +also bundles the Kubeshark MCP configuration automatically. + +### Option 2: Clone and run + +```bash +git clone https://github.com/kubeshark/kubeshark +cd kubeshark +claude +``` + +Skills trigger automatically based on your conversation. + +### Option 3: Manual installation + +Clone the repo (if you haven't already), then symlink or copy the skills: + +```bash +git clone https://github.com/kubeshark/kubeshark +mkdir -p ~/.claude/skills + +# Symlink to stay in sync with the repo (recommended) +ln -s kubeshark/skills/network-rca ~/.claude/skills/network-rca +ln -s kubeshark/skills/kfl ~/.claude/skills/kfl + +# Or copy to your project (project scope only) +mkdir -p .claude/skills +cp -r kubeshark/skills/network-rca .claude/skills/ +cp -r kubeshark/skills/kfl .claude/skills/ + +# Or copy for personal use (all your projects) +cp -r kubeshark/skills/network-rca ~/.claude/skills/ +cp -r kubeshark/skills/kfl ~/.claude/skills/ +``` + +## Contributing + +We welcome contributions — whether improving an existing skill or proposing a new one. + +- **Suggest improvements**: Open an issue or PR with changes to an existing skill's `SKILL.md` + or reference docs. Better examples, clearer workflows, and additional filter patterns + are always appreciated. +- **Add a new skill**: Open an issue describing the use case first. New skills should + follow the structure below and reference Kubeshark MCP tools by exact name. + +### Skill structure + +``` +skills/ +└── / + ├── SKILL.md # Required. YAML frontmatter + markdown body. + └── references/ # Optional. Detailed reference docs. + └── *.md +``` + +### Guidelines + +- Keep `SKILL.md` under 500 lines. Use `references/` for detailed content. +- Use imperative tone. Reference MCP tools by exact name. +- Include realistic example tool responses. +- The `description` frontmatter should be generous with trigger keywords. + +### Planned skills + +- `api-security` — OWASP API Top 10 assessment against live or snapshot traffic. +- `incident-response` — 7-phase forensic incident investigation methodology. +- `network-engineering` — Real-time traffic analysis, latency debugging, dependency mapping. diff --git a/skills/kfl/SKILL.md b/skills/kfl/SKILL.md new file mode 100644 index 000000000..466424627 --- /dev/null +++ b/skills/kfl/SKILL.md @@ -0,0 +1,331 @@ +--- +name: kfl +user-invocable: false +description: > + KFL2 (Kubeshark Filter Language) reference. This skill MUST be loaded before + writing, constructing, or suggesting any KFL filter expression. KFL is statically + typed — incorrect field names or syntax will fail silently or error. Do not guess + at KFL syntax without this skill loaded. Trigger on any mention of KFL, CEL filters, + traffic filtering, display filters, query syntax, filter expressions, write a filter, + construct a query, build a KFL, create a filter expression, "how do I filter", + "show me only", "find traffic where", protocol-specific queries (HTTP status codes, + DNS lookups, Redis commands, Kafka topics), Kubernetes-aware filtering (by namespace, + pod, service, label, annotation), L4 connection/flow filters, time-based queries, + or any request to slice/search/narrow network traffic in Kubeshark. Also trigger + when other skills need to construct filters — KFL is the query language for all + Kubeshark traffic analysis. +--- + +# KFL2 — Kubeshark Filter Language + +You are a KFL2 expert. KFL2 is built on Google's CEL (Common Expression Language) +and is the query language for all Kubeshark traffic analysis. It operates as a +**display filter** — it doesn't affect what's captured, only what you see. + +Think of KFL the way you think of SQL for databases or Google search syntax for +the web. Kubeshark captures and indexes all cluster traffic; KFL is how you +search it. + +For the complete variable and field reference, see `references/kfl2-reference.md`. + +## Core Syntax + +KFL expressions are boolean CEL expressions. An empty filter matches everything. + +### Operators + +| Category | Operators | +|----------|-----------| +| Comparison | `==`, `!=`, `<`, `<=`, `>`, `>=` | +| Logical | `&&`, `\|\|`, `!` | +| Arithmetic | `+`, `-`, `*`, `/`, `%` | +| Membership | `in` | +| Ternary | `condition ? true_val : false_val` | + +### String Functions + +``` +str.contains(substring) // Substring search +str.startsWith(prefix) // Prefix match +str.endsWith(suffix) // Suffix match +str.matches(regex) // Regex match +size(str) // String length +``` + +### Collection Functions + +``` +size(collection) // List/map/string length +key in map // Key existence +map[key] // Value access +map_get(map, key, default) // Safe access with default +value in list // List membership +``` + +### Time Functions + +``` +timestamp("2026-03-14T22:00:00Z") // Parse ISO timestamp +duration("5m") // Parse duration +now() // Current time (snapshot at filter creation) +``` + +### Negation + +``` +!http // Everything that is NOT HTTP +http && status_code != 200 // HTTP responses that aren't 200 +http && !path.contains("/health") // Exclude health checks +!(src.pod.namespace == "kube-system") // Exclude system namespace +``` + +## Protocol Detection + +Boolean flags that indicate which protocol was detected. Use these as the first +filter term — they're fast and narrow the search space immediately. + +| Flag | Protocol | Flag | Protocol | +|------|----------|------|----------| +| `http` | HTTP/1.1, HTTP/2 | `redis` | Redis | +| `dns` | DNS | `kafka` | Kafka | +| `tls` | TLS/SSL | `amqp` | AMQP | +| `tcp` | TCP | `ldap` | LDAP | +| `udp` | UDP | `ws` | WebSocket | +| `sctp` | SCTP | `gql` | GraphQL (v1+v2) | +| `icmp` | ICMP | `gqlv1` / `gqlv2` | GraphQL version-specific | +| `radius` | RADIUS | `conn` / `flow` | L4 connection/flow tracking | +| `diameter` | Diameter | `tcp_conn` / `udp_conn` | Transport-specific connections | + +## Kubernetes Context + +The most common starting point. Filter by where traffic originates or terminates. + +### Pod and Service Fields + +``` +src.pod.name == "orders-594487879c-7ddxf" +dst.pod.namespace == "production" +src.service.name == "api-gateway" +dst.service.namespace == "payments" +``` + +Pod fields fall back to service data when pod info is unavailable, so +`dst.pod.namespace` works even for service-level entries. + +### Aggregate Collections + +Match against any direction (src or dst): + +``` +"production" in namespaces // Any namespace match +"orders" in pods // Any pod name match +"api-gateway" in services // Any service name match +``` + +### Labels and Annotations + +``` +map_get(local_labels, "app", "") == "checkout" // Safe access with default +map_get(remote_labels, "version", "") == "canary" +"tier" in local_labels // Label existence check +``` + +Always use `map_get()` for labels and annotations — direct access like +`local_labels["app"]` errors if the key doesn't exist. + +### Node and Process + +``` +node_name == "ip-10-0-25-170.ec2.internal" +local_process_name == "nginx" +remote_process_name.contains("postgres") +``` + +### DNS Resolution + +``` +src.dns == "api.example.com" +dst.dns.contains("redis") +``` + +## HTTP Filtering + +HTTP is the most common protocol for API-level investigation. + +### Fields + +| Field | Type | Example | +|-------|------|---------| +| `method` | string | `"GET"`, `"POST"`, `"PUT"`, `"DELETE"` | +| `url` | string | Full path + query: `"/api/users?id=123"` | +| `path` | string | Path only: `"/api/users"` | +| `status_code` | int | `200`, `404`, `500` | +| `http_version` | string | `"HTTP/1.1"`, `"HTTP/2"` | +| `request.headers` | map | `request.headers["content-type"]` | +| `response.headers` | map | `response.headers["server"]` | +| `request.cookies` | map | `request.cookies["session"]` | +| `response.cookies` | map | `response.cookies["token"]` | +| `query_string` | map | `query_string["id"]` | +| `request_body_size` | int | Request body bytes | +| `response_body_size` | int | Response body bytes | +| `elapsed_time` | int | Duration in **microseconds** | + +### Common Patterns + +``` +// Error investigation +http && status_code >= 500 // Server errors +http && status_code == 429 // Rate limiting +http && status_code >= 400 && status_code < 500 // Client errors + +// Endpoint targeting +http && method == "POST" && path.contains("/orders") +http && url.matches(".*/api/v[0-9]+/users.*") + +// Performance +http && elapsed_time > 5000000 // > 5 seconds +http && response_body_size > 1000000 // > 1MB responses + +// Header inspection +http && "authorization" in request.headers +http && request.headers["content-type"] == "application/json" + +// GraphQL (subset of HTTP) +gql && method == "POST" && status_code >= 400 +``` + +## DNS Filtering + +DNS issues are often the hidden root cause of outages. + +| Field | Type | Description | +|-------|------|-------------| +| `dns_questions` | []string | Question domain names | +| `dns_answers` | []string | Answer domain names | +| `dns_question_types` | []string | Record types: A, AAAA, CNAME, MX, TXT, SRV, PTR | +| `dns_request` | bool | Is request | +| `dns_response` | bool | Is response | +| `dns_request_length` | int | Request size | +| `dns_response_length` | int | Response size | + +``` +dns && "api.external-service.com" in dns_questions +dns && dns_response && status_code != 0 // Failed lookups +dns && "A" in dns_question_types // A record queries +dns && size(dns_questions) > 1 // Multi-question +``` + +## Database and Messaging Protocols + +### Redis + +``` +redis && redis_type == "GET" // Command type +redis && redis_key.startsWith("session:") // Key pattern +redis && redis_command.contains("DEL") // Command search +redis && redis_total_size > 10000 // Large operations +``` + +### Kafka + +``` +kafka && kafka_api_key_name == "PRODUCE" // Produce operations +kafka && kafka_client_id == "payment-processor" // Client filtering +kafka && kafka_request_summary.contains("orders") // Topic filtering +kafka && kafka_size > 10000 // Large messages +``` + +### AMQP, LDAP, RADIUS, Diameter + +``` +amqp && amqp_method == "basic.publish" // AMQP publish +ldap && ldap_type == "bind" // LDAP bind requests +radius && radius_code_name == "Access-Request" // RADIUS auth +diameter && diameter_method.contains("Credit") // Diameter credit control +``` + +For the full variable list for these protocols, see `references/kfl2-reference.md`. + +## Transport Layer (L4) + +### TCP/UDP Fields + +``` +tcp && tcp_error_type != "" // TCP errors +udp && udp_length > 1000 // Large UDP packets +``` + +### Connection Tracking + +``` +conn && conn_state == "open" // Active connections +conn && conn_local_bytes > 1000000 // High-volume +conn && "HTTP" in conn_l7_detected // L7 protocol detection +tcp_conn && conn_state == "closed" // Closed TCP connections +``` + +### Flow Tracking (with Rate Metrics) + +``` +flow && flow_local_pps > 1000 // High packet rate +flow && flow_local_bps > 1000000 // High bandwidth +flow && flow_state == "closed" && "TLS" in flow_l7_detected +tcp_flow && flow_local_bps > 5000000 // High-throughput TCP +``` + +## Network Layer + +``` +src.ip == "10.0.53.101" +dst.ip.startsWith("192.168.") +src.port == 8080 +dst.port >= 8000 && dst.port <= 9000 +``` + +## Time-Based Filtering + +``` +timestamp > timestamp("2026-03-14T22:00:00Z") +timestamp >= timestamp("2026-03-14T22:00:00Z") && timestamp <= timestamp("2026-03-14T23:00:00Z") +timestamp > now() - duration("5m") // Last 5 minutes +elapsed_time > 2000000 // Older than 2 seconds +``` + +## Building Filters: Progressive Narrowing + +The most effective investigation technique — start broad, add constraints: + +``` +// Step 1: Protocol + namespace +http && dst.pod.namespace == "production" + +// Step 2: Add error condition +http && dst.pod.namespace == "production" && status_code >= 500 + +// Step 3: Narrow to service +http && dst.pod.namespace == "production" && status_code >= 500 && dst.service.name == "payment-service" + +// Step 4: Narrow to endpoint +http && dst.pod.namespace == "production" && status_code >= 500 && dst.service.name == "payment-service" && path.contains("/charge") + +// Step 5: Add timing +http && dst.pod.namespace == "production" && status_code >= 500 && dst.service.name == "payment-service" && path.contains("/charge") && elapsed_time > 2000000 +``` + +## Performance Tips + +1. **Protocol flags first** — `http && ...` is faster than `... && http` +2. **`startsWith`/`endsWith` over `contains`** — prefix/suffix checks are faster +3. **Specific ports before string ops** — `dst.port == 80` is cheaper than `url.contains(...)` +4. **Use `map_get` for labels** — avoids errors on missing keys +5. **Keep filters simple** — CEL short-circuits on `&&`, so put cheap checks first + +## Type Safety + +KFL2 is statically typed. Common gotchas: + +- `status_code` is `int`, not string — use `status_code == 200`, not `"200"` +- `elapsed_time` is in **microseconds** — 5 seconds = `5000000` +- `timestamp` requires `timestamp()` function — not a raw string +- Map access on missing keys errors — use `key in map` or `map_get()` first +- List membership uses `value in list` — not `list.contains(value)` diff --git a/skills/kfl/references/kfl2-reference.md b/skills/kfl/references/kfl2-reference.md new file mode 100644 index 000000000..45b49128c --- /dev/null +++ b/skills/kfl/references/kfl2-reference.md @@ -0,0 +1,407 @@ +# KFL2 Complete Variable and Field Reference + +This is the exhaustive reference for every variable available in KFL2 filters. +KFL2 is built on Google's CEL (Common Expression Language) and evaluates against +Kubeshark's protobuf-based `BaseEntry` structure. + +## Most Commonly Used Variables + +These are the variables you'll reach for in 90% of investigations: + +| Variable | Type | What it's for | +|----------|------|---------------| +| `status_code` | int | HTTP response status (200, 404, 500) | +| `method` | string | HTTP method (GET, POST, PUT, DELETE) | +| `path` | string | URL path without query string | +| `dst.pod.namespace` | string | Where traffic is going (namespace) | +| `dst.service.name` | string | Where traffic is going (service) | +| `src.pod.name` | string | Where traffic comes from (pod) | +| `elapsed_time` | int | Request duration in microseconds | +| `dns_questions` | []string | DNS domains being queried | +| `namespaces` | []string | All namespaces involved (src + dst) | + +## Network-Level Variables + +| Variable | Type | Description | Example | +|----------|------|-------------|---------| +| `src.ip` | string | Source IP address | `"10.0.53.101"` | +| `dst.ip` | string | Destination IP address | `"192.168.1.1"` | +| `src.port` | int | Source port number | `43210` | +| `dst.port` | int | Destination port number | `8080` | +| `protocol` | string | Detected protocol type | `"HTTP"`, `"DNS"` | + +## Identity and Metadata Variables + +| Variable | Type | Description | +|----------|------|-------------| +| `id` | int | BaseEntry unique identifier (assigned by sniffer) | +| `node_id` | string | Node identifier (assigned by hub) | +| `index` | int | Entry index for stream uniqueness | +| `stream` | string | Stream identifier (hex string) | +| `timestamp` | timestamp | Event time (UTC), use with `timestamp()` function | +| `elapsed_time` | int | Age since timestamp in microseconds | +| `worker` | string | Worker identifier | + +## Cross-Reference Variables + +| Variable | Type | Description | +|----------|------|-------------| +| `conn_id` | int | L7 to L4 connection cross-reference ID | +| `flow_id` | int | L7 to L4 flow cross-reference ID | +| `has_pcap` | bool | Whether PCAP data is available for this entry | + +## Capture Source Variables + +| Variable | Type | Description | Values | +|----------|------|-------------|--------| +| `capture_source` | string | Canonical capture source | `"unspecified"`, `"af_packet"`, `"ebpf"`, `"ebpf_tls"` | +| `capture_backend` | string | Backend family | `"af_packet"`, `"ebpf"` | +| `capture_source_code` | int | Numeric enum | 0=unspecified, 1=af_packet, 2=ebpf, 3=ebpf_tls | +| `capture` | map | Nested map access | `capture["source"]`, `capture["backend"]` | + +## Protocol Detection Flags + +Boolean variables indicating detected protocol. Use as first filter term for performance. + +| Variable | Protocol | Variable | Protocol | +|----------|----------|----------|----------| +| `http` | HTTP/1.1, HTTP/2 | `redis` | Redis | +| `dns` | DNS | `kafka` | Kafka | +| `tls` | TLS/SSL handshake | `amqp` | AMQP messaging | +| `tcp` | TCP transport | `ldap` | LDAP directory | +| `udp` | UDP transport | `ws` | WebSocket | +| `sctp` | SCTP streaming | `gql` | GraphQL (v1 or v2) | +| `icmp` | ICMP | `gqlv1` | GraphQL v1 only | +| `radius` | RADIUS auth | `gqlv2` | GraphQL v2 only | +| `diameter` | Diameter | `conn` | L4 connection tracking | +| `flow` | L4 flow tracking | `tcp_conn` | TCP connection tracking | +| `tcp_flow` | TCP flow tracking | `udp_conn` | UDP connection tracking | +| `udp_flow` | UDP flow tracking | | | + +## HTTP Variables + +| Variable | Type | Description | Example | +|----------|------|-------------|---------| +| `method` | string | HTTP method | `"GET"`, `"POST"`, `"PUT"`, `"DELETE"`, `"PATCH"` | +| `url` | string | Full URL path and query string | `"/api/users?id=123"` | +| `path` | string | URL path component (no query) | `"/api/users"` | +| `status_code` | int | HTTP response status code | `200`, `404`, `500` | +| `http_version` | string | HTTP protocol version | `"HTTP/1.1"`, `"HTTP/2"` | +| `query_string` | map[string]string | Parsed URL query parameters | `query_string["id"]` → `"123"` | +| `request.headers` | map[string]string | Request HTTP headers | `request.headers["content-type"]` | +| `response.headers` | map[string]string | Response HTTP headers | `response.headers["server"]` | +| `request.cookies` | map[string]string | Request cookies | `request.cookies["session"]` | +| `response.cookies` | map[string]string | Response cookies | `response.cookies["token"]` | +| `request_headers_size` | int | Request headers size in bytes | | +| `request_body_size` | int | Request body size in bytes | | +| `response_headers_size` | int | Response headers size in bytes | | +| `response_body_size` | int | Response body size in bytes | | + +GraphQL requests have `gql` (or `gqlv1`/`gqlv2`) set to true and all HTTP +variables available. + +**Example**: `http && method == "POST" && status_code >= 500 && path.contains("/api")` + +## DNS Variables + +| Variable | Type | Description | Example | +|----------|------|-------------|---------| +| `dns_questions` | []string | Question domain names (request + response) | `["example.com"]` | +| `dns_answers` | []string | Answer domain names | `["1.2.3.4"]` | +| `dns_question_types` | []string | Record types in questions | `["A"]`, `["AAAA"]`, `["CNAME"]` | +| `dns_request` | bool | Is DNS request message | | +| `dns_response` | bool | Is DNS response message | | +| `dns_request_length` | int | DNS request size in bytes (0 if absent) | | +| `dns_response_length` | int | DNS response size in bytes (0 if absent) | | +| `dns_total_size` | int | Sum of request + response sizes | | + +Supported question types: A, AAAA, NS, CNAME, SOA, MX, TXT, SRV, PTR, ANY. + +**Example**: `dns && dns_response && status_code != 0` (failed DNS lookups) + +## TLS Variables + +| Variable | Type | Description | Example | +|----------|------|-------------|---------| +| `tls` | bool | TLS payload detected | | +| `tls_summary` | string | TLS handshake summary | `"ClientHello"`, `"ServerHello"` | +| `tls_info` | string | TLS connection details | `"TLS 1.3, AES-256-GCM"` | +| `tls_request_size` | int | TLS request size in bytes | | +| `tls_response_size` | int | TLS response size in bytes | | +| `tls_total_size` | int | Sum of request + response (computed if not provided) | | + +## TCP Variables + +| Variable | Type | Description | +|----------|------|-------------| +| `tcp` | bool | TCP payload detected | +| `tcp_method` | string | TCP method information | +| `tcp_payload` | bytes | Raw TCP payload data | +| `tcp_error_type` | string | TCP error type (empty if none) | +| `tcp_error_message` | string | TCP error message (empty if none) | + +## UDP Variables + +| Variable | Type | Description | +|----------|------|-------------| +| `udp` | bool | UDP payload detected | +| `udp_length` | int | UDP packet length | +| `udp_checksum` | int | UDP checksum value | +| `udp_payload` | bytes | Raw UDP payload data | + +## SCTP Variables + +| Variable | Type | Description | +|----------|------|-------------| +| `sctp` | bool | SCTP payload detected | +| `sctp_checksum` | int | SCTP checksum value | +| `sctp_chunk_type` | string | SCTP chunk type | +| `sctp_length` | int | SCTP chunk length | + +## ICMP Variables + +| Variable | Type | Description | +|----------|------|-------------| +| `icmp` | bool | ICMP payload detected | +| `icmp_type` | string | ICMP type code | +| `icmp_version` | int | ICMP version (4 or 6) | +| `icmp_length` | int | ICMP message length | + +## WebSocket Variables + +| Variable | Type | Description | Values | +|----------|------|-------------|--------| +| `ws` | bool | WebSocket payload detected | | +| `ws_opcode` | string | WebSocket operation code | `"text"`, `"binary"`, `"close"`, `"ping"`, `"pong"` | +| `ws_request` | bool | Is WebSocket request | | +| `ws_response` | bool | Is WebSocket response | | +| `ws_request_payload_data` | string | Request payload (safely truncated) | | +| `ws_request_payload_length` | int | Request payload length in bytes | | +| `ws_response_payload_length` | int | Response payload length in bytes | | + +## Redis Variables + +| Variable | Type | Description | Example | +|----------|------|-------------|---------| +| `redis` | bool | Redis payload detected | | +| `redis_type` | string | Redis command verb | `"GET"`, `"SET"`, `"DEL"`, `"HGET"` | +| `redis_command` | string | Full Redis command line | `"GET session:1234"` | +| `redis_key` | string | Key (truncated to 64 bytes) | `"session:1234"` | +| `redis_request_size` | int | Request size (0 if absent) | | +| `redis_response_size` | int | Response size (0 if absent) | | +| `redis_total_size` | int | Sum of request + response | | + +**Example**: `redis && redis_type == "GET" && redis_key.startsWith("session:")` + +## Kafka Variables + +| Variable | Type | Description | Example | +|----------|------|-------------|---------| +| `kafka` | bool | Kafka payload detected | | +| `kafka_api_key` | int | Kafka API key number | 0=FETCH, 1=PRODUCE | +| `kafka_api_key_name` | string | Human-readable API operation | `"PRODUCE"`, `"FETCH"` | +| `kafka_client_id` | string | Kafka client identifier | `"payment-processor"` | +| `kafka_size` | int | Message size (request preferred, else response) | | +| `kafka_request` | bool | Is Kafka request | | +| `kafka_response` | bool | Is Kafka response | | +| `kafka_request_summary` | string | Request summary/topic | `"orders-topic"` | +| `kafka_request_size` | int | Request size (0 if absent) | | +| `kafka_response_size` | int | Response size (0 if absent) | | + +**Example**: `kafka && kafka_api_key_name == "PRODUCE" && kafka_request_summary.contains("orders")` + +## AMQP Variables + +| Variable | Type | Description | Example | +|----------|------|-------------|---------| +| `amqp` | bool | AMQP payload detected | | +| `amqp_method` | string | AMQP method name | `"basic.publish"`, `"channel.open"` | +| `amqp_summary` | string | Operation summary | | +| `amqp_request` | bool | Is AMQP request | | +| `amqp_response` | bool | Is AMQP response | | +| `amqp_request_length` | int | Request length (0 if absent) | | +| `amqp_response_length` | int | Response length (0 if absent) | | +| `amqp_total_size` | int | Sum of request + response | | + +## LDAP Variables + +| Variable | Type | Description | +|----------|------|-------------| +| `ldap` | bool | LDAP payload detected | +| `ldap_type` | string | LDAP operation type (request preferred) | +| `ldap_summary` | string | Operation summary | +| `ldap_request` | bool | Is LDAP request | +| `ldap_response` | bool | Is LDAP response | +| `ldap_request_length` | int | Request length (0 if absent) | +| `ldap_response_length` | int | Response length (0 if absent) | +| `ldap_total_size` | int | Sum of request + response | + +## RADIUS Variables + +| Variable | Type | Description | Example | +|----------|------|-------------|---------| +| `radius` | bool | RADIUS payload detected | | +| `radius_code` | int | RADIUS code (request preferred) | | +| `radius_code_name` | string | Code name | `"Access-Request"` | +| `radius_request` | bool | Is RADIUS request | | +| `radius_response` | bool | Is RADIUS response | | +| `radius_request_authenticator` | string | Request authenticator (hex) | | +| `radius_request_length` | int | Request size (0 if absent) | | +| `radius_response_length` | int | Response size (0 if absent) | | +| `radius_total_size` | int | Sum of request + response | | + +## Diameter Variables + +| Variable | Type | Description | +|----------|------|-------------| +| `diameter` | bool | Diameter payload detected | +| `diameter_method` | string | Method name (request preferred) | +| `diameter_summary` | string | Operation summary | +| `diameter_request` | bool | Is Diameter request | +| `diameter_response` | bool | Is Diameter response | +| `diameter_request_length` | int | Request size (0 if absent) | +| `diameter_response_length` | int | Response size (0 if absent) | +| `diameter_total_size` | int | Sum of request + response | + +## L4 Connection Tracking Variables + +| Variable | Type | Description | Example | +|----------|------|-------------|---------| +| `conn` | bool | Connection tracking entry | | +| `conn_state` | string | Connection state | `"open"`, `"in_progress"`, `"closed"` | +| `conn_local_pkts` | int | Packets from local peer | | +| `conn_local_bytes` | int | Bytes from local peer | | +| `conn_remote_pkts` | int | Packets from remote peer | | +| `conn_remote_bytes` | int | Bytes from remote peer | | +| `conn_l7_detected` | []string | L7 protocols detected on connection | `["HTTP", "TLS"]` | +| `conn_group_id` | int | Connection group identifier | | + +**Example**: `conn && conn_state == "open" && conn_local_bytes > 1000000` (high-volume open connections) + +## L4 Flow Tracking Variables + +Flows extend connections with rate metrics (packets/bytes per second). + +| Variable | Type | Description | +|----------|------|-------------| +| `flow` | bool | Flow tracking entry | +| `flow_state` | string | Flow state (`"open"`, `"in_progress"`, `"closed"`) | +| `flow_local_pkts` | int | Packets from local peer | +| `flow_local_bytes` | int | Bytes from local peer | +| `flow_remote_pkts` | int | Packets from remote peer | +| `flow_remote_bytes` | int | Bytes from remote peer | +| `flow_local_pps` | int | Local packets per second | +| `flow_local_bps` | int | Local bytes per second | +| `flow_remote_pps` | int | Remote packets per second | +| `flow_remote_bps` | int | Remote bytes per second | +| `flow_l7_detected` | []string | L7 protocols detected on flow | +| `flow_group_id` | int | Flow group identifier | + +**Example**: `tcp_flow && flow_local_bps > 5000000` (high-bandwidth TCP flows) + +## Kubernetes Variables + +### Pod and Service (Directional) + +| Variable | Type | Description | +|----------|------|-------------| +| `src.pod.name` | string | Source pod name | +| `src.pod.namespace` | string | Source pod namespace | +| `dst.pod.name` | string | Destination pod name | +| `dst.pod.namespace` | string | Destination pod namespace | +| `src.service.name` | string | Source service name | +| `src.service.namespace` | string | Source service namespace | +| `dst.service.name` | string | Destination service name | +| `dst.service.namespace` | string | Destination service namespace | + +**Fallback behavior**: Pod namespace/name fields automatically fall back to +service data when pod info is unavailable. This means `dst.pod.namespace` works +even when only service-level resolution exists. + +**Example**: `src.service.name == "api-gateway" && dst.pod.namespace == "production"` + +### Aggregate Collections (Non-Directional) + +| Variable | Type | Description | +|----------|------|-------------| +| `namespaces` | []string | All namespaces (src + dst, pod + service) | +| `pods` | []string | All pod names (src + dst) | +| `services` | []string | All service names (src + dst) | + +### Labels and Annotations + +| Variable | Type | Description | +|----------|------|-------------| +| `local_labels` | map[string]string | Kubernetes labels of local peer | +| `local_annotations` | map[string]string | Kubernetes annotations of local peer | +| `remote_labels` | map[string]string | Kubernetes labels of remote peer | +| `remote_annotations` | map[string]string | Kubernetes annotations of remote peer | + +Use `map_get(local_labels, "key", "default")` for safe access that won't error +on missing keys. + +**Example**: `map_get(local_labels, "app", "") == "checkout" && "production" in namespaces` + +### Node Information + +| Variable | Type | Description | +|----------|------|-------------| +| `node` | map | Nested: `node["name"]`, `node["ip"]` | +| `node_name` | string | Node name (flat alias) | +| `node_ip` | string | Node IP (flat alias) | +| `local_node_name` | string | Node name of local peer | +| `remote_node_name` | string | Node name of remote peer | + +### Process Information + +| Variable | Type | Description | +|----------|------|-------------| +| `local_process_name` | string | Process name on local peer | +| `remote_process_name` | string | Process name on remote peer | + +### DNS Resolution + +| Variable | Type | Description | +|----------|------|-------------| +| `src.dns` | string | DNS resolution of source IP | +| `dst.dns` | string | DNS resolution of destination IP | +| `dns_resolutions` | []string | All DNS resolutions (deduplicated) | + +### Resolution Status + +| Variable | Type | Values | +|----------|------|--------| +| `local_resolution_status` | string | `""` (resolved), `"no_node_mapping"`, `"rpc_error"`, `"rpc_empty"`, `"cache_miss"`, `"queue_full"` | +| `remote_resolution_status` | string | Same as above | + +## Default Values + +When a variable is not present in an entry, KFL2 uses these defaults: + +| Type | Default | +|------|---------| +| string | `""` | +| int | `0` | +| bool | `false` | +| list | `[]` | +| map | `{}` | +| bytes | `[]` | + +## Protocol Variable Precedence + +For protocols with request/response pairs (Kafka, RADIUS, Diameter), merged +fields prefer the **request** side. If no request exists, the response value +is used. Size totals are always computed as `request_size + response_size`. + +## CEL Language Features + +KFL2 supports the full CEL specification: + +- **Short-circuit evaluation**: `&&` stops on first false, `||` stops on first true +- **Ternary**: `condition ? value_if_true : value_if_false` +- **Regex**: `str.matches("pattern")` uses RE2 syntax +- **Type coercion**: Timestamps require `timestamp()`, durations require `duration()` +- **Null safety**: Use `in` operator or `map_get()` before accessing map keys + +For the full CEL specification, see the +[CEL Language Definition](https://github.com/google/cel-spec/blob/master/doc/langdef.md). diff --git a/skills/network-rca/SKILL.md b/skills/network-rca/SKILL.md new file mode 100644 index 000000000..df82854eb --- /dev/null +++ b/skills/network-rca/SKILL.md @@ -0,0 +1,338 @@ +--- +name: network-rca +description: > + Kubernetes network root cause analysis skill powered by Kubeshark MCP. Use this skill + whenever the user wants to investigate past incidents, perform retrospective traffic + analysis, take or manage traffic snapshots, extract PCAPs, dissect L7 API calls from + historical captures, compare traffic patterns over time, detect drift or anomalies + between snapshots, or do any kind of forensic network analysis in Kubernetes. + Also trigger when the user mentions snapshots, raw capture, PCAP extraction, + traffic replay, postmortem analysis, "what happened yesterday/last week", + root cause analysis, RCA, cloud snapshot storage, snapshot dissection, or KFL filters + for historical traffic. Even if the user just says "figure out what went wrong" + or "compare today's traffic to yesterday" in a Kubernetes context, use this skill. +--- + +# Network Root Cause Analysis with Kubeshark MCP + +You are a Kubernetes network forensics specialist. Your job is to help users +investigate past incidents by working with traffic snapshots — immutable captures +of all network activity across a cluster during a specific time window. + +Kubeshark is a search engine for network traffic. Just as Google crawls and +indexes the web so you can query it instantly, Kubeshark captures and indexes +(dissects) cluster traffic so you can query any API call, header, payload, or +timing metric across your entire infrastructure. Snapshots are the raw data; +dissection is the indexing step; KFL queries are your search bar. + +Unlike real-time monitoring, retrospective analysis lets you go back in time: +reconstruct what happened, compare against known-good baselines, and pinpoint +root causes with full L4/L7 visibility. + +## Prerequisites + +Before starting any analysis, verify the environment is ready. + +### Kubeshark MCP Health Check + +Confirm the Kubeshark MCP is accessible and tools are available. Look for tools +like `list_api_calls`, `list_l4_flows`, `create_snapshot`, etc. + +**Tool**: `check_kubeshark_status` + +If tools like `list_api_calls` or `list_l4_flows` are missing from the response, +something is wrong with the MCP connection. Guide the user through setup +(see Setup Reference at the bottom). + +### Raw Capture Must Be Enabled + +Retrospective analysis depends on raw capture — Kubeshark's kernel-level (eBPF) +packet recording that stores traffic at the node level. Without it, snapshots +have nothing to work with. + +Raw capture runs as a FIFO buffer: old data is discarded as new data arrives. +The buffer size determines how far back you can go. Larger buffer = wider +snapshot window. + +```yaml +tap: + capture: + raw: + enabled: true + storageSize: 10Gi # Per-node FIFO buffer +``` + +If raw capture isn't enabled, inform the user that retrospective analysis +requires it and share the configuration above. + +### Snapshot Storage + +Snapshots are assembled on the Hub's storage, which is ephemeral by default. +For serious forensic work, persistent storage is recommended: + +```yaml +tap: + snapshots: + local: + storageClass: gp2 + storageSize: 1000Gi +``` + +## Core Workflow + +Every investigation starts with a snapshot. After that, you choose one of two +investigation routes depending on your goal: + +1. **Determine time window** — When did the issue occur? Use `get_data_boundaries` + to see what raw capture data is available. +2. **Create or locate a snapshot** — Either take a new snapshot covering the + incident window, or find an existing one with `list_snapshots`. +3. **Choose your investigation route** — PCAP or Dissection (see below). + +### Choosing the Right Route + +| | PCAP Route | Dissection Route | +|---|---|---| +| **Speed** | Immediate — no indexing needed | Takes time to index | +| **Filtering** | Nodes, time window, BPF filters | Kubernetes & API-level (pods, labels, paths, status codes) | +| **Output** | Cluster-wide PCAP files | Structured query results | +| **Investigation by** | Human (Wireshark) | AI agent or human (queryable database) | +| **Best for** | Compliance, sharing with network teams, Wireshark deep-dives | Root cause analysis, API-level debugging, automated investigation | + +Both routes are valid and complementary. Use PCAP when you need raw packets +for human analysis or compliance. Use Dissection when you want an AI agent +to search and analyze traffic programmatically. + +## Snapshot Operations + +Both routes start here. A snapshot is an immutable freeze of all cluster traffic +in a time window. + +### Check Data Boundaries + +**Tool**: `get_data_boundaries` + +Check what raw capture data exists across the cluster. You can only create +snapshots within these boundaries — data outside the window has been rotated +out of the FIFO buffer. + +**Example response**: +``` +Cluster-wide: + Oldest: 2026-03-14 16:12:34 UTC + Newest: 2026-03-14 18:05:20 UTC + +Per node: + ┌─────────────────────────────┬──────────┬──────────┐ + │ Node │ Oldest │ Newest │ + ├─────────────────────────────┼──────────┼──────────┤ + │ ip-10-0-25-170.ec2.internal │ 16:12:34 │ 18:03:39 │ + │ ip-10-0-32-115.ec2.internal │ 16:13:45 │ 18:05:20 │ + └─────────────────────────────┴──────────┴──────────┘ +``` + +If the incident falls outside the available window, the data has been rotated +out. Suggest increasing `storageSize` for future coverage. + +### Create a Snapshot + +**Tool**: `create_snapshot` + +Specify nodes (or cluster-wide) and a time window within the data boundaries. +Snapshots include raw capture files, Kubernetes pod events, and eBPF cgroup events. + +Snapshots take time to build. Check status with `get_snapshot` — wait until +`completed` before proceeding with either route. + +### List Existing Snapshots + +**Tool**: `list_snapshots` + +Shows all snapshots on the local Hub, with name, size, status, and node count. + +### Cloud Storage + +Snapshots on the Hub are ephemeral. Cloud storage (S3, GCS, Azure Blob) +provides long-term retention. Snapshots can be downloaded to any cluster +with Kubeshark — not necessarily the original one. + +**Check cloud status**: `get_cloud_storage_status` +**Upload to cloud**: `upload_snapshot_to_cloud` +**Download from cloud**: `download_snapshot_from_cloud` + +--- + +## Route 1: PCAP + +The PCAP route does **not** require dissection. It works directly with the raw +snapshot data to produce filtered, cluster-wide PCAP files. Use this route when: + +- You need raw packets for Wireshark analysis +- You're sharing captures with network teams +- You need evidence for compliance or audit +- A human will perform the investigation (not an AI agent) + +### Filtering a PCAP + +**Tool**: `export_snapshot_pcap` + +Filter the snapshot down to what matters using: +- **Nodes** — specific cluster nodes only +- **Time** — sub-window within the snapshot +- **BPF filter** — standard Berkeley Packet Filter syntax (e.g., `host 10.0.53.101`, + `port 8080`, `net 10.0.0.0/16`) + +These filters are combinable — select specific nodes, narrow the time range, +and apply a BPF expression all at once. + +### Workload-to-BPF Workflow + +When you know the workload names but not their IPs, resolve them from the +snapshot's metadata. Snapshots preserve pod-to-IP mappings from capture time, +so resolution is accurate even if pods have been rescheduled since. + +**Tool**: `resolve_workload` + +**Example workflow** — extract PCAP for specific workloads: + +1. Resolve IPs: `resolve_workload` for `orders-594487879c-7ddxf` → `10.0.53.101` +2. Resolve IPs: `resolve_workload` for `payment-service-6b8f9d-x2k4p` → `10.0.53.205` +3. Build BPF: `host 10.0.53.101 or host 10.0.53.205` +4. Export: `export_snapshot_pcap` with that BPF filter + +This gives you a cluster-wide PCAP filtered to exactly the workloads involved +in the incident — ready for Wireshark or long-term storage. + +--- + +## Route 2: Dissection + +The Dissection route indexes raw packets into structured L7 API calls, building +a queryable database from the snapshot. Use this route when: + +- An AI agent is performing the investigation +- You need to search by Kubernetes context (pods, namespaces, labels, services) +- You need to search by API elements (paths, status codes, headers, payloads) +- You want structured responses you can analyze programmatically +- You need to drill into the payload of a specific API call + +**KFL requirement**: The Dissection route uses KFL filters for all queries +(`list_api_calls`, `get_api_stats`, etc.). Before constructing any KFL filter, +load the KFL skill (`skills/kfl/`). KFL is statically typed — incorrect field +names or syntax will fail silently or error. If the KFL skill is not available, +suggest the user install it: + +```bash +ln -s /path/to/kubeshark/skills/kfl ~/.claude/skills/kfl +``` + +**If the KFL skill cannot be loaded**, only use the exact filter examples shown +in this skill. Do not improvise or guess at field names, operators, or syntax. +KFL field names differ from what you might expect (e.g., `status_code` not +`response.status`, `src.pod.namespace` not `src.namespace`). Using incorrect +fields produces wrong results without warning. + +### Activate Dissection + +**Tool**: `start_snapshot_dissection` + +Dissection takes time proportional to snapshot size — it parses every packet, +reassembles streams, and builds the index. After completion, these tools +become available: +- `list_api_calls` — Search API transactions with KFL filters +- `get_api_call` — Drill into a specific call (headers, body, timing, payload) +- `get_api_stats` — Aggregated statistics (throughput, error rates, latency) + +### Investigation Strategy + +Start broad, then narrow: + +1. `get_api_stats` — Get the overall picture: error rates, latency percentiles, + throughput. Look for spikes or anomalies. +2. `list_api_calls` filtered by error codes (4xx, 5xx) or high latency — find + the problematic transactions. +3. `get_api_call` on specific calls — inspect headers, bodies, timing, and + full payload to understand what went wrong. +4. Use KFL filters to slice by namespace, service, protocol, or any combination. + +**Example `list_api_calls` response** (filtered to `http && status_code >= 500`): +``` +┌──────────────────────┬────────┬──────────────────────────┬────────┬───────────┐ +│ Timestamp │ Method │ URL │ Status │ Elapsed │ +├──────────────────────┼────────┼──────────────────────────┼────────┼───────────┤ +│ 2026-03-14 17:23:45 │ POST │ /api/v1/orders/charge │ 503 │ 12,340 ms │ +│ 2026-03-14 17:23:46 │ POST │ /api/v1/orders/charge │ 503 │ 11,890 ms │ +│ 2026-03-14 17:23:48 │ GET │ /api/v1/inventory/check │ 500 │ 8,210 ms │ +│ 2026-03-14 17:24:01 │ POST │ /api/v1/payments/process │ 502 │ 30,000 ms │ +└──────────────────────┴────────┴──────────────────────────┴────────┴───────────┘ +Src: api-gateway (prod) → Dst: payment-service (prod) +``` + +Use the pattern of repeated failures and high latency to identify the failing +service chain, then drill into individual calls with `get_api_call`. + +### KFL Filters for Dissected Traffic + +Layer filters progressively when investigating: + +``` +// Step 1: Protocol + namespace +http && dst.pod.namespace == "production" + +// Step 2: Add error condition +http && dst.pod.namespace == "production" && status_code >= 500 + +// Step 3: Narrow to service +http && dst.pod.namespace == "production" && status_code >= 500 && dst.service.name == "payment-service" + +// Step 4: Narrow to endpoint +http && dst.pod.namespace == "production" && status_code >= 500 && dst.service.name == "payment-service" && path.contains("/charge") +``` + +Other common RCA filters: + +``` +dns && dns_response && status_code != 0 // Failed DNS lookups +src.service.namespace != dst.service.namespace // Cross-namespace traffic +http && elapsed_time > 5000000 // Slow transactions (> 5s) +conn && conn_state == "open" && conn_local_bytes > 1000000 // High-volume connections +``` + +--- + +## Combining Both Routes + +The two routes are complementary. A common pattern: + +1. Start with **Dissection** — let the AI agent search and identify the root cause +2. Once you've pinpointed the problematic workloads, use `resolve_workload` + to get their IPs +3. Switch to **PCAP** — export a filtered PCAP of just those workloads for + Wireshark deep-dive, sharing with the network team, or compliance archival + +## Use Cases + +### Post-Incident RCA + +1. Identify the incident time window from alerts, logs, or user reports +2. Check `get_data_boundaries` — is the window still in raw capture? +3. `create_snapshot` covering the incident window (add 15 minutes buffer) +4. **Dissection route**: `start_snapshot_dissection` → `get_api_stats` → + `list_api_calls` → `get_api_call` → follow the dependency chain +5. **PCAP route**: `resolve_workload` → `export_snapshot_pcap` with BPF → + hand off to Wireshark or archive + +### Other Use Cases + +- **Trend analysis** — Take snapshots at regular intervals and compare + `get_api_stats` across them to detect latency drift, error rate changes, + or new service-to-service connections. +- **Forensic preservation** — `create_snapshot` + `upload_snapshot_to_cloud` + for immutable, long-term evidence. Downloadable to any cluster months later. +- **Production-to-local replay** — Upload a production snapshot to cloud, + download it on a local KinD cluster, and investigate safely. + +## Setup Reference + +For CLI installation, MCP configuration, verification, and troubleshooting, +see `references/setup.md`. diff --git a/skills/network-rca/references/setup.md b/skills/network-rca/references/setup.md new file mode 100644 index 000000000..ae797d34e --- /dev/null +++ b/skills/network-rca/references/setup.md @@ -0,0 +1,70 @@ +# Kubeshark MCP Setup Reference + +## Installing the CLI + +**Homebrew (macOS)**: +```bash +brew install kubeshark +``` + +**Linux**: +```bash +sh <(curl -Ls https://kubeshark.com/install) +``` + +**From source**: +```bash +git clone https://github.com/kubeshark/kubeshark +cd kubeshark && make +``` + +## MCP Configuration + +**Claude Desktop / Cowork** (`claude_desktop_config.json`): +```json +{ + "mcpServers": { + "kubeshark": { + "command": "kubeshark", + "args": ["mcp"] + } + } +} +``` + +**Claude Code (CLI)**: +```bash +claude mcp add kubeshark -- kubeshark mcp +``` + +**Without kubectl access** (direct URL mode): +```json +{ + "mcpServers": { + "kubeshark": { + "command": "kubeshark", + "args": ["mcp", "--url", "https://kubeshark.example.com"] + } + } +} +``` + +```bash +# Claude Code equivalent: +claude mcp add kubeshark -- kubeshark mcp --url https://kubeshark.example.com +``` + +## Verification + +- Claude Code: `/mcp` to check connection status +- Terminal: `kubeshark mcp --list-tools` +- Cluster: `kubectl get pods -l app=kubeshark-hub` + +## Troubleshooting + +- **Binary not found** → Install via Homebrew or the install script above +- **Connection refused** → Deploy Kubeshark first: `kubeshark tap` +- **No L7 data** → Check `get_dissection_status` and `enable_dissection` +- **Snapshot creation fails** → Verify raw capture is enabled in Kubeshark config +- **Empty snapshot** → Check `get_data_boundaries` — the requested window may + fall outside available data