mirror of
https://github.com/kubeshark/kubeshark.git
synced 2026-05-14 21:27:03 +00:00
Add KFL and Network RCA skills (#1875)
* Add KFL and Network RCA skills Introduce the skills/ directory with two Kubeshark MCP skills: - network-rca: Retrospective traffic analysis via snapshots, dissection, KFL queries, PCAP extraction, and trend comparison - kfl: Complete KFL2 (Kubeshark Filter Language) reference covering all supported protocols, variables, operators, and filter patterns Update CLAUDE.md with skill authoring guidelines, structure conventions, and the list of available Kubeshark MCP tools. * Optimize skills and add shared setup reference - network-rca: cut repeated metaphor, add list_api_calls example response, consolidate use cases, remove unbuilt composability section, extract setup reference to references/setup.md (409 → 306 lines) - kfl: merge thin protocol sections, fix map_get inconsistency, add negation examples, move capture source to reference doc - kfl2-reference: add most-commonly-used variables table, add inline filter examples per protocol section - Add skills/README.md with usage and contribution guidelines * Add plugin infrastructure and update READMEs - Add .claude-plugin/plugin.json and marketplace.json for Claude Code plugin distribution - Add .mcp.json bundling the Kubeshark MCP configuration - Update skills/README.md with plugin install, manual install, and agent compatibility sections - Update mcp/README.md with AI skills section and install instructions - Restructure network-rca skill into two distinct investigation routes: PCAP (no dissection, BPF filters, Wireshark/compliance) and Dissection (indexed queries, AI-driven analysis, payload inspection) * Remove CLAUDE.md from tracked files Content now lives in skills/README.md, mcp/README.md, and the skills themselves. * Add README to .claude-plugin directory * Reorder MCP config: default mode first, URL mode for no-kubectl * Move AI Skills section to top of MCP README * Reorder manual install: symlink first * Streamline skills README: focus on usage and contributing * Enforce KFL skill loading before writing filters - network-rca: require loading KFL skill before constructing filters, suggest installation if unavailable - kfl: set user-invocable: false (background knowledge skill), strengthen description to mandate loading before any filter construction * Move KFL requirement to top of Dissection route * Add strict fallback: only use exact examples if KFL skill unavailable * Add clone step to manual installation * Use $PWD/kubeshark paths in manual install examples * Add mkdir before symlinks, simplify paths * Move prerequisites before installation --------- Co-authored-by: Alon Girmonsky <alongir@Alons-Mac-Studio.local>
This commit is contained in:
33
.claude-plugin/README.md
Normal file
33
.claude-plugin/README.md
Normal file
@@ -0,0 +1,33 @@
|
||||
# Kubeshark Claude Code Plugin
|
||||
|
||||
This directory contains the [Claude Code plugin](https://docs.anthropic.com/en/docs/claude-code/plugins) configuration for Kubeshark.
|
||||
|
||||
## What's here
|
||||
|
||||
| File | Purpose |
|
||||
|------|---------|
|
||||
| `plugin.json` | Plugin manifest — name, version, description, metadata |
|
||||
| `marketplace.json` | Marketplace index — allows discovery via `/plugin marketplace add` |
|
||||
|
||||
## Installing the plugin
|
||||
|
||||
```
|
||||
/plugin marketplace add kubeshark/kubeshark
|
||||
/plugin install kubeshark
|
||||
```
|
||||
|
||||
This loads the Kubeshark AI skills and MCP configuration. Skills appear as
|
||||
`/kubeshark:network-rca` and `/kubeshark:kfl`.
|
||||
|
||||
## What the plugin includes
|
||||
|
||||
- **Skills** from [`skills/`](../skills/) — network root cause analysis and KFL filter expertise
|
||||
- **MCP configuration** from [`.mcp.json`](../.mcp.json) — connects to the Kubeshark MCP server
|
||||
|
||||
## Local development
|
||||
|
||||
Test the plugin without installing:
|
||||
|
||||
```bash
|
||||
claude --plugin-dir /path/to/kubeshark
|
||||
```
|
||||
15
.claude-plugin/marketplace.json
Normal file
15
.claude-plugin/marketplace.json
Normal file
@@ -0,0 +1,15 @@
|
||||
{
|
||||
"name": "kubeshark",
|
||||
"description": "Kubeshark network observability skills for Kubernetes",
|
||||
"plugins": [
|
||||
{
|
||||
"name": "kubeshark",
|
||||
"description": "Network observability skills powered by Kubeshark MCP — root cause analysis, KFL traffic filtering, snapshot forensics, PCAP extraction.",
|
||||
"source": {
|
||||
"source": "github",
|
||||
"owner": "kubeshark",
|
||||
"repo": "kubeshark"
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
24
.claude-plugin/plugin.json
Normal file
24
.claude-plugin/plugin.json
Normal file
@@ -0,0 +1,24 @@
|
||||
{
|
||||
"name": "kubeshark",
|
||||
"version": "1.0.0",
|
||||
"description": "Kubernetes network observability skills powered by Kubeshark MCP. Root cause analysis, traffic filtering, snapshot forensics, PCAP extraction, and more.",
|
||||
"author": {
|
||||
"name": "Kubeshark",
|
||||
"url": "https://kubeshark.com"
|
||||
},
|
||||
"homepage": "https://kubeshark.com",
|
||||
"repository": "https://github.com/kubeshark/kubeshark",
|
||||
"license": "Apache-2.0",
|
||||
"keywords": [
|
||||
"kubeshark",
|
||||
"kubernetes",
|
||||
"network",
|
||||
"observability",
|
||||
"traffic",
|
||||
"mcp",
|
||||
"rca",
|
||||
"pcap",
|
||||
"kfl",
|
||||
"ebpf"
|
||||
]
|
||||
}
|
||||
8
.mcp.json
Normal file
8
.mcp.json
Normal file
@@ -0,0 +1,8 @@
|
||||
{
|
||||
"mcpServers": {
|
||||
"kubeshark": {
|
||||
"command": "kubeshark",
|
||||
"args": ["mcp"]
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -2,6 +2,18 @@
|
||||
|
||||
[Kubeshark](https://kubeshark.com) MCP (Model Context Protocol) server enables AI assistants like Claude Desktop, Cursor, and other MCP-compatible clients to query real-time Kubernetes network traffic.
|
||||
|
||||
## AI Skills
|
||||
|
||||
The MCP provides the tools — [AI skills](../skills/) teach agents how to use them.
|
||||
Skills turn raw MCP capabilities into domain-specific workflows like root cause
|
||||
analysis, traffic filtering, and forensic investigation. See the
|
||||
[skills README](../skills/README.md) for installation and usage.
|
||||
|
||||
| Skill | Description |
|
||||
|-------|-------------|
|
||||
| [`network-rca`](../skills/network-rca/) | Network Root Cause Analysis — snapshot-based retrospective investigation with PCAP and dissection routes |
|
||||
| [`kfl`](../skills/kfl/) | KFL2 filter expert — write, debug, and optimize traffic queries across all supported protocols |
|
||||
|
||||
## Features
|
||||
|
||||
- **L7 API Traffic Analysis**: Query HTTP, gRPC, Redis, Kafka, DNS transactions
|
||||
@@ -34,20 +46,20 @@ Add to your Claude Desktop configuration:
|
||||
**macOS**: `~/Library/Application Support/Claude/claude_desktop_config.json`
|
||||
**Windows**: `%APPDATA%\Claude\claude_desktop_config.json`
|
||||
|
||||
#### URL Mode (Recommended for existing deployments)
|
||||
#### Default (requires kubectl access / kube context)
|
||||
|
||||
```json
|
||||
{
|
||||
"mcpServers": {
|
||||
"kubeshark": {
|
||||
"command": "kubeshark",
|
||||
"args": ["mcp", "--url", "https://kubeshark.example.com"]
|
||||
"args": ["mcp"]
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### Proxy Mode (Requires kubectl access)
|
||||
With an explicit kubeconfig path:
|
||||
|
||||
```json
|
||||
{
|
||||
@@ -59,14 +71,18 @@ Add to your Claude Desktop configuration:
|
||||
}
|
||||
}
|
||||
```
|
||||
or:
|
||||
|
||||
#### URL Mode (no kubectl required)
|
||||
|
||||
Use this when the machine doesn't have kubectl access or a kube context.
|
||||
Connect directly to an existing Kubeshark deployment:
|
||||
|
||||
```json
|
||||
{
|
||||
"mcpServers": {
|
||||
"kubeshark": {
|
||||
"command": "kubeshark",
|
||||
"args": ["mcp"]
|
||||
"args": ["mcp", "--url", "https://kubeshark.example.com"]
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
120
skills/README.md
Normal file
120
skills/README.md
Normal file
@@ -0,0 +1,120 @@
|
||||
# Kubeshark AI Skills
|
||||
|
||||
Open-source AI skills that work with the [Kubeshark MCP](https://github.com/kubeshark/kubeshark).
|
||||
Skills teach AI agents how to use Kubeshark's MCP tools for specific workflows
|
||||
like root cause analysis, traffic filtering, and forensic investigation.
|
||||
|
||||
Skills use the open [Agent Skills](https://github.com/anthropics/skills) format
|
||||
and work with Claude Code, OpenAI Codex CLI, Gemini CLI, Cursor, and other
|
||||
compatible agents.
|
||||
|
||||
## Available Skills
|
||||
|
||||
| Skill | Description |
|
||||
|-------|-------------|
|
||||
| [`network-rca`](network-rca/) | Network Root Cause Analysis. Retrospective traffic analysis via snapshots, with two investigation routes: PCAP (for Wireshark/compliance) and Dissection (for AI-driven API-level investigation). |
|
||||
| [`kfl`](kfl/) | KFL2 (Kubeshark Filter Language) expert. Complete reference for writing, debugging, and optimizing CEL-based traffic filters across all supported protocols. |
|
||||
|
||||
## Prerequisites
|
||||
|
||||
All skills require the Kubeshark MCP:
|
||||
|
||||
```bash
|
||||
# Claude Code
|
||||
claude mcp add kubeshark -- kubeshark mcp
|
||||
|
||||
# Without kubectl access (direct URL)
|
||||
claude mcp add kubeshark -- kubeshark mcp --url https://kubeshark.example.com
|
||||
```
|
||||
|
||||
For Claude Desktop, add to `claude_desktop_config.json`:
|
||||
|
||||
```json
|
||||
{
|
||||
"mcpServers": {
|
||||
"kubeshark": {
|
||||
"command": "kubeshark",
|
||||
"args": ["mcp"]
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Installation
|
||||
|
||||
### Option 1: Plugin (recommended)
|
||||
|
||||
Install as a Claude Code plugin directly from GitHub:
|
||||
|
||||
```
|
||||
/plugin marketplace add kubeshark/kubeshark
|
||||
/plugin install kubeshark
|
||||
```
|
||||
|
||||
Skills appear as `/kubeshark:network-rca` and `/kubeshark:kfl`. The plugin
|
||||
also bundles the Kubeshark MCP configuration automatically.
|
||||
|
||||
### Option 2: Clone and run
|
||||
|
||||
```bash
|
||||
git clone https://github.com/kubeshark/kubeshark
|
||||
cd kubeshark
|
||||
claude
|
||||
```
|
||||
|
||||
Skills trigger automatically based on your conversation.
|
||||
|
||||
### Option 3: Manual installation
|
||||
|
||||
Clone the repo (if you haven't already), then symlink or copy the skills:
|
||||
|
||||
```bash
|
||||
git clone https://github.com/kubeshark/kubeshark
|
||||
mkdir -p ~/.claude/skills
|
||||
|
||||
# Symlink to stay in sync with the repo (recommended)
|
||||
ln -s kubeshark/skills/network-rca ~/.claude/skills/network-rca
|
||||
ln -s kubeshark/skills/kfl ~/.claude/skills/kfl
|
||||
|
||||
# Or copy to your project (project scope only)
|
||||
mkdir -p .claude/skills
|
||||
cp -r kubeshark/skills/network-rca .claude/skills/
|
||||
cp -r kubeshark/skills/kfl .claude/skills/
|
||||
|
||||
# Or copy for personal use (all your projects)
|
||||
cp -r kubeshark/skills/network-rca ~/.claude/skills/
|
||||
cp -r kubeshark/skills/kfl ~/.claude/skills/
|
||||
```
|
||||
|
||||
## Contributing
|
||||
|
||||
We welcome contributions — whether improving an existing skill or proposing a new one.
|
||||
|
||||
- **Suggest improvements**: Open an issue or PR with changes to an existing skill's `SKILL.md`
|
||||
or reference docs. Better examples, clearer workflows, and additional filter patterns
|
||||
are always appreciated.
|
||||
- **Add a new skill**: Open an issue describing the use case first. New skills should
|
||||
follow the structure below and reference Kubeshark MCP tools by exact name.
|
||||
|
||||
### Skill structure
|
||||
|
||||
```
|
||||
skills/
|
||||
└── <skill-name>/
|
||||
├── SKILL.md # Required. YAML frontmatter + markdown body.
|
||||
└── references/ # Optional. Detailed reference docs.
|
||||
└── *.md
|
||||
```
|
||||
|
||||
### Guidelines
|
||||
|
||||
- Keep `SKILL.md` under 500 lines. Use `references/` for detailed content.
|
||||
- Use imperative tone. Reference MCP tools by exact name.
|
||||
- Include realistic example tool responses.
|
||||
- The `description` frontmatter should be generous with trigger keywords.
|
||||
|
||||
### Planned skills
|
||||
|
||||
- `api-security` — OWASP API Top 10 assessment against live or snapshot traffic.
|
||||
- `incident-response` — 7-phase forensic incident investigation methodology.
|
||||
- `network-engineering` — Real-time traffic analysis, latency debugging, dependency mapping.
|
||||
331
skills/kfl/SKILL.md
Normal file
331
skills/kfl/SKILL.md
Normal file
@@ -0,0 +1,331 @@
|
||||
---
|
||||
name: kfl
|
||||
user-invocable: false
|
||||
description: >
|
||||
KFL2 (Kubeshark Filter Language) reference. This skill MUST be loaded before
|
||||
writing, constructing, or suggesting any KFL filter expression. KFL is statically
|
||||
typed — incorrect field names or syntax will fail silently or error. Do not guess
|
||||
at KFL syntax without this skill loaded. Trigger on any mention of KFL, CEL filters,
|
||||
traffic filtering, display filters, query syntax, filter expressions, write a filter,
|
||||
construct a query, build a KFL, create a filter expression, "how do I filter",
|
||||
"show me only", "find traffic where", protocol-specific queries (HTTP status codes,
|
||||
DNS lookups, Redis commands, Kafka topics), Kubernetes-aware filtering (by namespace,
|
||||
pod, service, label, annotation), L4 connection/flow filters, time-based queries,
|
||||
or any request to slice/search/narrow network traffic in Kubeshark. Also trigger
|
||||
when other skills need to construct filters — KFL is the query language for all
|
||||
Kubeshark traffic analysis.
|
||||
---
|
||||
|
||||
# KFL2 — Kubeshark Filter Language
|
||||
|
||||
You are a KFL2 expert. KFL2 is built on Google's CEL (Common Expression Language)
|
||||
and is the query language for all Kubeshark traffic analysis. It operates as a
|
||||
**display filter** — it doesn't affect what's captured, only what you see.
|
||||
|
||||
Think of KFL the way you think of SQL for databases or Google search syntax for
|
||||
the web. Kubeshark captures and indexes all cluster traffic; KFL is how you
|
||||
search it.
|
||||
|
||||
For the complete variable and field reference, see `references/kfl2-reference.md`.
|
||||
|
||||
## Core Syntax
|
||||
|
||||
KFL expressions are boolean CEL expressions. An empty filter matches everything.
|
||||
|
||||
### Operators
|
||||
|
||||
| Category | Operators |
|
||||
|----------|-----------|
|
||||
| Comparison | `==`, `!=`, `<`, `<=`, `>`, `>=` |
|
||||
| Logical | `&&`, `\|\|`, `!` |
|
||||
| Arithmetic | `+`, `-`, `*`, `/`, `%` |
|
||||
| Membership | `in` |
|
||||
| Ternary | `condition ? true_val : false_val` |
|
||||
|
||||
### String Functions
|
||||
|
||||
```
|
||||
str.contains(substring) // Substring search
|
||||
str.startsWith(prefix) // Prefix match
|
||||
str.endsWith(suffix) // Suffix match
|
||||
str.matches(regex) // Regex match
|
||||
size(str) // String length
|
||||
```
|
||||
|
||||
### Collection Functions
|
||||
|
||||
```
|
||||
size(collection) // List/map/string length
|
||||
key in map // Key existence
|
||||
map[key] // Value access
|
||||
map_get(map, key, default) // Safe access with default
|
||||
value in list // List membership
|
||||
```
|
||||
|
||||
### Time Functions
|
||||
|
||||
```
|
||||
timestamp("2026-03-14T22:00:00Z") // Parse ISO timestamp
|
||||
duration("5m") // Parse duration
|
||||
now() // Current time (snapshot at filter creation)
|
||||
```
|
||||
|
||||
### Negation
|
||||
|
||||
```
|
||||
!http // Everything that is NOT HTTP
|
||||
http && status_code != 200 // HTTP responses that aren't 200
|
||||
http && !path.contains("/health") // Exclude health checks
|
||||
!(src.pod.namespace == "kube-system") // Exclude system namespace
|
||||
```
|
||||
|
||||
## Protocol Detection
|
||||
|
||||
Boolean flags that indicate which protocol was detected. Use these as the first
|
||||
filter term — they're fast and narrow the search space immediately.
|
||||
|
||||
| Flag | Protocol | Flag | Protocol |
|
||||
|------|----------|------|----------|
|
||||
| `http` | HTTP/1.1, HTTP/2 | `redis` | Redis |
|
||||
| `dns` | DNS | `kafka` | Kafka |
|
||||
| `tls` | TLS/SSL | `amqp` | AMQP |
|
||||
| `tcp` | TCP | `ldap` | LDAP |
|
||||
| `udp` | UDP | `ws` | WebSocket |
|
||||
| `sctp` | SCTP | `gql` | GraphQL (v1+v2) |
|
||||
| `icmp` | ICMP | `gqlv1` / `gqlv2` | GraphQL version-specific |
|
||||
| `radius` | RADIUS | `conn` / `flow` | L4 connection/flow tracking |
|
||||
| `diameter` | Diameter | `tcp_conn` / `udp_conn` | Transport-specific connections |
|
||||
|
||||
## Kubernetes Context
|
||||
|
||||
The most common starting point. Filter by where traffic originates or terminates.
|
||||
|
||||
### Pod and Service Fields
|
||||
|
||||
```
|
||||
src.pod.name == "orders-594487879c-7ddxf"
|
||||
dst.pod.namespace == "production"
|
||||
src.service.name == "api-gateway"
|
||||
dst.service.namespace == "payments"
|
||||
```
|
||||
|
||||
Pod fields fall back to service data when pod info is unavailable, so
|
||||
`dst.pod.namespace` works even for service-level entries.
|
||||
|
||||
### Aggregate Collections
|
||||
|
||||
Match against any direction (src or dst):
|
||||
|
||||
```
|
||||
"production" in namespaces // Any namespace match
|
||||
"orders" in pods // Any pod name match
|
||||
"api-gateway" in services // Any service name match
|
||||
```
|
||||
|
||||
### Labels and Annotations
|
||||
|
||||
```
|
||||
map_get(local_labels, "app", "") == "checkout" // Safe access with default
|
||||
map_get(remote_labels, "version", "") == "canary"
|
||||
"tier" in local_labels // Label existence check
|
||||
```
|
||||
|
||||
Always use `map_get()` for labels and annotations — direct access like
|
||||
`local_labels["app"]` errors if the key doesn't exist.
|
||||
|
||||
### Node and Process
|
||||
|
||||
```
|
||||
node_name == "ip-10-0-25-170.ec2.internal"
|
||||
local_process_name == "nginx"
|
||||
remote_process_name.contains("postgres")
|
||||
```
|
||||
|
||||
### DNS Resolution
|
||||
|
||||
```
|
||||
src.dns == "api.example.com"
|
||||
dst.dns.contains("redis")
|
||||
```
|
||||
|
||||
## HTTP Filtering
|
||||
|
||||
HTTP is the most common protocol for API-level investigation.
|
||||
|
||||
### Fields
|
||||
|
||||
| Field | Type | Example |
|
||||
|-------|------|---------|
|
||||
| `method` | string | `"GET"`, `"POST"`, `"PUT"`, `"DELETE"` |
|
||||
| `url` | string | Full path + query: `"/api/users?id=123"` |
|
||||
| `path` | string | Path only: `"/api/users"` |
|
||||
| `status_code` | int | `200`, `404`, `500` |
|
||||
| `http_version` | string | `"HTTP/1.1"`, `"HTTP/2"` |
|
||||
| `request.headers` | map | `request.headers["content-type"]` |
|
||||
| `response.headers` | map | `response.headers["server"]` |
|
||||
| `request.cookies` | map | `request.cookies["session"]` |
|
||||
| `response.cookies` | map | `response.cookies["token"]` |
|
||||
| `query_string` | map | `query_string["id"]` |
|
||||
| `request_body_size` | int | Request body bytes |
|
||||
| `response_body_size` | int | Response body bytes |
|
||||
| `elapsed_time` | int | Duration in **microseconds** |
|
||||
|
||||
### Common Patterns
|
||||
|
||||
```
|
||||
// Error investigation
|
||||
http && status_code >= 500 // Server errors
|
||||
http && status_code == 429 // Rate limiting
|
||||
http && status_code >= 400 && status_code < 500 // Client errors
|
||||
|
||||
// Endpoint targeting
|
||||
http && method == "POST" && path.contains("/orders")
|
||||
http && url.matches(".*/api/v[0-9]+/users.*")
|
||||
|
||||
// Performance
|
||||
http && elapsed_time > 5000000 // > 5 seconds
|
||||
http && response_body_size > 1000000 // > 1MB responses
|
||||
|
||||
// Header inspection
|
||||
http && "authorization" in request.headers
|
||||
http && request.headers["content-type"] == "application/json"
|
||||
|
||||
// GraphQL (subset of HTTP)
|
||||
gql && method == "POST" && status_code >= 400
|
||||
```
|
||||
|
||||
## DNS Filtering
|
||||
|
||||
DNS issues are often the hidden root cause of outages.
|
||||
|
||||
| Field | Type | Description |
|
||||
|-------|------|-------------|
|
||||
| `dns_questions` | []string | Question domain names |
|
||||
| `dns_answers` | []string | Answer domain names |
|
||||
| `dns_question_types` | []string | Record types: A, AAAA, CNAME, MX, TXT, SRV, PTR |
|
||||
| `dns_request` | bool | Is request |
|
||||
| `dns_response` | bool | Is response |
|
||||
| `dns_request_length` | int | Request size |
|
||||
| `dns_response_length` | int | Response size |
|
||||
|
||||
```
|
||||
dns && "api.external-service.com" in dns_questions
|
||||
dns && dns_response && status_code != 0 // Failed lookups
|
||||
dns && "A" in dns_question_types // A record queries
|
||||
dns && size(dns_questions) > 1 // Multi-question
|
||||
```
|
||||
|
||||
## Database and Messaging Protocols
|
||||
|
||||
### Redis
|
||||
|
||||
```
|
||||
redis && redis_type == "GET" // Command type
|
||||
redis && redis_key.startsWith("session:") // Key pattern
|
||||
redis && redis_command.contains("DEL") // Command search
|
||||
redis && redis_total_size > 10000 // Large operations
|
||||
```
|
||||
|
||||
### Kafka
|
||||
|
||||
```
|
||||
kafka && kafka_api_key_name == "PRODUCE" // Produce operations
|
||||
kafka && kafka_client_id == "payment-processor" // Client filtering
|
||||
kafka && kafka_request_summary.contains("orders") // Topic filtering
|
||||
kafka && kafka_size > 10000 // Large messages
|
||||
```
|
||||
|
||||
### AMQP, LDAP, RADIUS, Diameter
|
||||
|
||||
```
|
||||
amqp && amqp_method == "basic.publish" // AMQP publish
|
||||
ldap && ldap_type == "bind" // LDAP bind requests
|
||||
radius && radius_code_name == "Access-Request" // RADIUS auth
|
||||
diameter && diameter_method.contains("Credit") // Diameter credit control
|
||||
```
|
||||
|
||||
For the full variable list for these protocols, see `references/kfl2-reference.md`.
|
||||
|
||||
## Transport Layer (L4)
|
||||
|
||||
### TCP/UDP Fields
|
||||
|
||||
```
|
||||
tcp && tcp_error_type != "" // TCP errors
|
||||
udp && udp_length > 1000 // Large UDP packets
|
||||
```
|
||||
|
||||
### Connection Tracking
|
||||
|
||||
```
|
||||
conn && conn_state == "open" // Active connections
|
||||
conn && conn_local_bytes > 1000000 // High-volume
|
||||
conn && "HTTP" in conn_l7_detected // L7 protocol detection
|
||||
tcp_conn && conn_state == "closed" // Closed TCP connections
|
||||
```
|
||||
|
||||
### Flow Tracking (with Rate Metrics)
|
||||
|
||||
```
|
||||
flow && flow_local_pps > 1000 // High packet rate
|
||||
flow && flow_local_bps > 1000000 // High bandwidth
|
||||
flow && flow_state == "closed" && "TLS" in flow_l7_detected
|
||||
tcp_flow && flow_local_bps > 5000000 // High-throughput TCP
|
||||
```
|
||||
|
||||
## Network Layer
|
||||
|
||||
```
|
||||
src.ip == "10.0.53.101"
|
||||
dst.ip.startsWith("192.168.")
|
||||
src.port == 8080
|
||||
dst.port >= 8000 && dst.port <= 9000
|
||||
```
|
||||
|
||||
## Time-Based Filtering
|
||||
|
||||
```
|
||||
timestamp > timestamp("2026-03-14T22:00:00Z")
|
||||
timestamp >= timestamp("2026-03-14T22:00:00Z") && timestamp <= timestamp("2026-03-14T23:00:00Z")
|
||||
timestamp > now() - duration("5m") // Last 5 minutes
|
||||
elapsed_time > 2000000 // Older than 2 seconds
|
||||
```
|
||||
|
||||
## Building Filters: Progressive Narrowing
|
||||
|
||||
The most effective investigation technique — start broad, add constraints:
|
||||
|
||||
```
|
||||
// Step 1: Protocol + namespace
|
||||
http && dst.pod.namespace == "production"
|
||||
|
||||
// Step 2: Add error condition
|
||||
http && dst.pod.namespace == "production" && status_code >= 500
|
||||
|
||||
// Step 3: Narrow to service
|
||||
http && dst.pod.namespace == "production" && status_code >= 500 && dst.service.name == "payment-service"
|
||||
|
||||
// Step 4: Narrow to endpoint
|
||||
http && dst.pod.namespace == "production" && status_code >= 500 && dst.service.name == "payment-service" && path.contains("/charge")
|
||||
|
||||
// Step 5: Add timing
|
||||
http && dst.pod.namespace == "production" && status_code >= 500 && dst.service.name == "payment-service" && path.contains("/charge") && elapsed_time > 2000000
|
||||
```
|
||||
|
||||
## Performance Tips
|
||||
|
||||
1. **Protocol flags first** — `http && ...` is faster than `... && http`
|
||||
2. **`startsWith`/`endsWith` over `contains`** — prefix/suffix checks are faster
|
||||
3. **Specific ports before string ops** — `dst.port == 80` is cheaper than `url.contains(...)`
|
||||
4. **Use `map_get` for labels** — avoids errors on missing keys
|
||||
5. **Keep filters simple** — CEL short-circuits on `&&`, so put cheap checks first
|
||||
|
||||
## Type Safety
|
||||
|
||||
KFL2 is statically typed. Common gotchas:
|
||||
|
||||
- `status_code` is `int`, not string — use `status_code == 200`, not `"200"`
|
||||
- `elapsed_time` is in **microseconds** — 5 seconds = `5000000`
|
||||
- `timestamp` requires `timestamp()` function — not a raw string
|
||||
- Map access on missing keys errors — use `key in map` or `map_get()` first
|
||||
- List membership uses `value in list` — not `list.contains(value)`
|
||||
407
skills/kfl/references/kfl2-reference.md
Normal file
407
skills/kfl/references/kfl2-reference.md
Normal file
@@ -0,0 +1,407 @@
|
||||
# KFL2 Complete Variable and Field Reference
|
||||
|
||||
This is the exhaustive reference for every variable available in KFL2 filters.
|
||||
KFL2 is built on Google's CEL (Common Expression Language) and evaluates against
|
||||
Kubeshark's protobuf-based `BaseEntry` structure.
|
||||
|
||||
## Most Commonly Used Variables
|
||||
|
||||
These are the variables you'll reach for in 90% of investigations:
|
||||
|
||||
| Variable | Type | What it's for |
|
||||
|----------|------|---------------|
|
||||
| `status_code` | int | HTTP response status (200, 404, 500) |
|
||||
| `method` | string | HTTP method (GET, POST, PUT, DELETE) |
|
||||
| `path` | string | URL path without query string |
|
||||
| `dst.pod.namespace` | string | Where traffic is going (namespace) |
|
||||
| `dst.service.name` | string | Where traffic is going (service) |
|
||||
| `src.pod.name` | string | Where traffic comes from (pod) |
|
||||
| `elapsed_time` | int | Request duration in microseconds |
|
||||
| `dns_questions` | []string | DNS domains being queried |
|
||||
| `namespaces` | []string | All namespaces involved (src + dst) |
|
||||
|
||||
## Network-Level Variables
|
||||
|
||||
| Variable | Type | Description | Example |
|
||||
|----------|------|-------------|---------|
|
||||
| `src.ip` | string | Source IP address | `"10.0.53.101"` |
|
||||
| `dst.ip` | string | Destination IP address | `"192.168.1.1"` |
|
||||
| `src.port` | int | Source port number | `43210` |
|
||||
| `dst.port` | int | Destination port number | `8080` |
|
||||
| `protocol` | string | Detected protocol type | `"HTTP"`, `"DNS"` |
|
||||
|
||||
## Identity and Metadata Variables
|
||||
|
||||
| Variable | Type | Description |
|
||||
|----------|------|-------------|
|
||||
| `id` | int | BaseEntry unique identifier (assigned by sniffer) |
|
||||
| `node_id` | string | Node identifier (assigned by hub) |
|
||||
| `index` | int | Entry index for stream uniqueness |
|
||||
| `stream` | string | Stream identifier (hex string) |
|
||||
| `timestamp` | timestamp | Event time (UTC), use with `timestamp()` function |
|
||||
| `elapsed_time` | int | Age since timestamp in microseconds |
|
||||
| `worker` | string | Worker identifier |
|
||||
|
||||
## Cross-Reference Variables
|
||||
|
||||
| Variable | Type | Description |
|
||||
|----------|------|-------------|
|
||||
| `conn_id` | int | L7 to L4 connection cross-reference ID |
|
||||
| `flow_id` | int | L7 to L4 flow cross-reference ID |
|
||||
| `has_pcap` | bool | Whether PCAP data is available for this entry |
|
||||
|
||||
## Capture Source Variables
|
||||
|
||||
| Variable | Type | Description | Values |
|
||||
|----------|------|-------------|--------|
|
||||
| `capture_source` | string | Canonical capture source | `"unspecified"`, `"af_packet"`, `"ebpf"`, `"ebpf_tls"` |
|
||||
| `capture_backend` | string | Backend family | `"af_packet"`, `"ebpf"` |
|
||||
| `capture_source_code` | int | Numeric enum | 0=unspecified, 1=af_packet, 2=ebpf, 3=ebpf_tls |
|
||||
| `capture` | map | Nested map access | `capture["source"]`, `capture["backend"]` |
|
||||
|
||||
## Protocol Detection Flags
|
||||
|
||||
Boolean variables indicating detected protocol. Use as first filter term for performance.
|
||||
|
||||
| Variable | Protocol | Variable | Protocol |
|
||||
|----------|----------|----------|----------|
|
||||
| `http` | HTTP/1.1, HTTP/2 | `redis` | Redis |
|
||||
| `dns` | DNS | `kafka` | Kafka |
|
||||
| `tls` | TLS/SSL handshake | `amqp` | AMQP messaging |
|
||||
| `tcp` | TCP transport | `ldap` | LDAP directory |
|
||||
| `udp` | UDP transport | `ws` | WebSocket |
|
||||
| `sctp` | SCTP streaming | `gql` | GraphQL (v1 or v2) |
|
||||
| `icmp` | ICMP | `gqlv1` | GraphQL v1 only |
|
||||
| `radius` | RADIUS auth | `gqlv2` | GraphQL v2 only |
|
||||
| `diameter` | Diameter | `conn` | L4 connection tracking |
|
||||
| `flow` | L4 flow tracking | `tcp_conn` | TCP connection tracking |
|
||||
| `tcp_flow` | TCP flow tracking | `udp_conn` | UDP connection tracking |
|
||||
| `udp_flow` | UDP flow tracking | | |
|
||||
|
||||
## HTTP Variables
|
||||
|
||||
| Variable | Type | Description | Example |
|
||||
|----------|------|-------------|---------|
|
||||
| `method` | string | HTTP method | `"GET"`, `"POST"`, `"PUT"`, `"DELETE"`, `"PATCH"` |
|
||||
| `url` | string | Full URL path and query string | `"/api/users?id=123"` |
|
||||
| `path` | string | URL path component (no query) | `"/api/users"` |
|
||||
| `status_code` | int | HTTP response status code | `200`, `404`, `500` |
|
||||
| `http_version` | string | HTTP protocol version | `"HTTP/1.1"`, `"HTTP/2"` |
|
||||
| `query_string` | map[string]string | Parsed URL query parameters | `query_string["id"]` → `"123"` |
|
||||
| `request.headers` | map[string]string | Request HTTP headers | `request.headers["content-type"]` |
|
||||
| `response.headers` | map[string]string | Response HTTP headers | `response.headers["server"]` |
|
||||
| `request.cookies` | map[string]string | Request cookies | `request.cookies["session"]` |
|
||||
| `response.cookies` | map[string]string | Response cookies | `response.cookies["token"]` |
|
||||
| `request_headers_size` | int | Request headers size in bytes | |
|
||||
| `request_body_size` | int | Request body size in bytes | |
|
||||
| `response_headers_size` | int | Response headers size in bytes | |
|
||||
| `response_body_size` | int | Response body size in bytes | |
|
||||
|
||||
GraphQL requests have `gql` (or `gqlv1`/`gqlv2`) set to true and all HTTP
|
||||
variables available.
|
||||
|
||||
**Example**: `http && method == "POST" && status_code >= 500 && path.contains("/api")`
|
||||
|
||||
## DNS Variables
|
||||
|
||||
| Variable | Type | Description | Example |
|
||||
|----------|------|-------------|---------|
|
||||
| `dns_questions` | []string | Question domain names (request + response) | `["example.com"]` |
|
||||
| `dns_answers` | []string | Answer domain names | `["1.2.3.4"]` |
|
||||
| `dns_question_types` | []string | Record types in questions | `["A"]`, `["AAAA"]`, `["CNAME"]` |
|
||||
| `dns_request` | bool | Is DNS request message | |
|
||||
| `dns_response` | bool | Is DNS response message | |
|
||||
| `dns_request_length` | int | DNS request size in bytes (0 if absent) | |
|
||||
| `dns_response_length` | int | DNS response size in bytes (0 if absent) | |
|
||||
| `dns_total_size` | int | Sum of request + response sizes | |
|
||||
|
||||
Supported question types: A, AAAA, NS, CNAME, SOA, MX, TXT, SRV, PTR, ANY.
|
||||
|
||||
**Example**: `dns && dns_response && status_code != 0` (failed DNS lookups)
|
||||
|
||||
## TLS Variables
|
||||
|
||||
| Variable | Type | Description | Example |
|
||||
|----------|------|-------------|---------|
|
||||
| `tls` | bool | TLS payload detected | |
|
||||
| `tls_summary` | string | TLS handshake summary | `"ClientHello"`, `"ServerHello"` |
|
||||
| `tls_info` | string | TLS connection details | `"TLS 1.3, AES-256-GCM"` |
|
||||
| `tls_request_size` | int | TLS request size in bytes | |
|
||||
| `tls_response_size` | int | TLS response size in bytes | |
|
||||
| `tls_total_size` | int | Sum of request + response (computed if not provided) | |
|
||||
|
||||
## TCP Variables
|
||||
|
||||
| Variable | Type | Description |
|
||||
|----------|------|-------------|
|
||||
| `tcp` | bool | TCP payload detected |
|
||||
| `tcp_method` | string | TCP method information |
|
||||
| `tcp_payload` | bytes | Raw TCP payload data |
|
||||
| `tcp_error_type` | string | TCP error type (empty if none) |
|
||||
| `tcp_error_message` | string | TCP error message (empty if none) |
|
||||
|
||||
## UDP Variables
|
||||
|
||||
| Variable | Type | Description |
|
||||
|----------|------|-------------|
|
||||
| `udp` | bool | UDP payload detected |
|
||||
| `udp_length` | int | UDP packet length |
|
||||
| `udp_checksum` | int | UDP checksum value |
|
||||
| `udp_payload` | bytes | Raw UDP payload data |
|
||||
|
||||
## SCTP Variables
|
||||
|
||||
| Variable | Type | Description |
|
||||
|----------|------|-------------|
|
||||
| `sctp` | bool | SCTP payload detected |
|
||||
| `sctp_checksum` | int | SCTP checksum value |
|
||||
| `sctp_chunk_type` | string | SCTP chunk type |
|
||||
| `sctp_length` | int | SCTP chunk length |
|
||||
|
||||
## ICMP Variables
|
||||
|
||||
| Variable | Type | Description |
|
||||
|----------|------|-------------|
|
||||
| `icmp` | bool | ICMP payload detected |
|
||||
| `icmp_type` | string | ICMP type code |
|
||||
| `icmp_version` | int | ICMP version (4 or 6) |
|
||||
| `icmp_length` | int | ICMP message length |
|
||||
|
||||
## WebSocket Variables
|
||||
|
||||
| Variable | Type | Description | Values |
|
||||
|----------|------|-------------|--------|
|
||||
| `ws` | bool | WebSocket payload detected | |
|
||||
| `ws_opcode` | string | WebSocket operation code | `"text"`, `"binary"`, `"close"`, `"ping"`, `"pong"` |
|
||||
| `ws_request` | bool | Is WebSocket request | |
|
||||
| `ws_response` | bool | Is WebSocket response | |
|
||||
| `ws_request_payload_data` | string | Request payload (safely truncated) | |
|
||||
| `ws_request_payload_length` | int | Request payload length in bytes | |
|
||||
| `ws_response_payload_length` | int | Response payload length in bytes | |
|
||||
|
||||
## Redis Variables
|
||||
|
||||
| Variable | Type | Description | Example |
|
||||
|----------|------|-------------|---------|
|
||||
| `redis` | bool | Redis payload detected | |
|
||||
| `redis_type` | string | Redis command verb | `"GET"`, `"SET"`, `"DEL"`, `"HGET"` |
|
||||
| `redis_command` | string | Full Redis command line | `"GET session:1234"` |
|
||||
| `redis_key` | string | Key (truncated to 64 bytes) | `"session:1234"` |
|
||||
| `redis_request_size` | int | Request size (0 if absent) | |
|
||||
| `redis_response_size` | int | Response size (0 if absent) | |
|
||||
| `redis_total_size` | int | Sum of request + response | |
|
||||
|
||||
**Example**: `redis && redis_type == "GET" && redis_key.startsWith("session:")`
|
||||
|
||||
## Kafka Variables
|
||||
|
||||
| Variable | Type | Description | Example |
|
||||
|----------|------|-------------|---------|
|
||||
| `kafka` | bool | Kafka payload detected | |
|
||||
| `kafka_api_key` | int | Kafka API key number | 0=FETCH, 1=PRODUCE |
|
||||
| `kafka_api_key_name` | string | Human-readable API operation | `"PRODUCE"`, `"FETCH"` |
|
||||
| `kafka_client_id` | string | Kafka client identifier | `"payment-processor"` |
|
||||
| `kafka_size` | int | Message size (request preferred, else response) | |
|
||||
| `kafka_request` | bool | Is Kafka request | |
|
||||
| `kafka_response` | bool | Is Kafka response | |
|
||||
| `kafka_request_summary` | string | Request summary/topic | `"orders-topic"` |
|
||||
| `kafka_request_size` | int | Request size (0 if absent) | |
|
||||
| `kafka_response_size` | int | Response size (0 if absent) | |
|
||||
|
||||
**Example**: `kafka && kafka_api_key_name == "PRODUCE" && kafka_request_summary.contains("orders")`
|
||||
|
||||
## AMQP Variables
|
||||
|
||||
| Variable | Type | Description | Example |
|
||||
|----------|------|-------------|---------|
|
||||
| `amqp` | bool | AMQP payload detected | |
|
||||
| `amqp_method` | string | AMQP method name | `"basic.publish"`, `"channel.open"` |
|
||||
| `amqp_summary` | string | Operation summary | |
|
||||
| `amqp_request` | bool | Is AMQP request | |
|
||||
| `amqp_response` | bool | Is AMQP response | |
|
||||
| `amqp_request_length` | int | Request length (0 if absent) | |
|
||||
| `amqp_response_length` | int | Response length (0 if absent) | |
|
||||
| `amqp_total_size` | int | Sum of request + response | |
|
||||
|
||||
## LDAP Variables
|
||||
|
||||
| Variable | Type | Description |
|
||||
|----------|------|-------------|
|
||||
| `ldap` | bool | LDAP payload detected |
|
||||
| `ldap_type` | string | LDAP operation type (request preferred) |
|
||||
| `ldap_summary` | string | Operation summary |
|
||||
| `ldap_request` | bool | Is LDAP request |
|
||||
| `ldap_response` | bool | Is LDAP response |
|
||||
| `ldap_request_length` | int | Request length (0 if absent) |
|
||||
| `ldap_response_length` | int | Response length (0 if absent) |
|
||||
| `ldap_total_size` | int | Sum of request + response |
|
||||
|
||||
## RADIUS Variables
|
||||
|
||||
| Variable | Type | Description | Example |
|
||||
|----------|------|-------------|---------|
|
||||
| `radius` | bool | RADIUS payload detected | |
|
||||
| `radius_code` | int | RADIUS code (request preferred) | |
|
||||
| `radius_code_name` | string | Code name | `"Access-Request"` |
|
||||
| `radius_request` | bool | Is RADIUS request | |
|
||||
| `radius_response` | bool | Is RADIUS response | |
|
||||
| `radius_request_authenticator` | string | Request authenticator (hex) | |
|
||||
| `radius_request_length` | int | Request size (0 if absent) | |
|
||||
| `radius_response_length` | int | Response size (0 if absent) | |
|
||||
| `radius_total_size` | int | Sum of request + response | |
|
||||
|
||||
## Diameter Variables
|
||||
|
||||
| Variable | Type | Description |
|
||||
|----------|------|-------------|
|
||||
| `diameter` | bool | Diameter payload detected |
|
||||
| `diameter_method` | string | Method name (request preferred) |
|
||||
| `diameter_summary` | string | Operation summary |
|
||||
| `diameter_request` | bool | Is Diameter request |
|
||||
| `diameter_response` | bool | Is Diameter response |
|
||||
| `diameter_request_length` | int | Request size (0 if absent) |
|
||||
| `diameter_response_length` | int | Response size (0 if absent) |
|
||||
| `diameter_total_size` | int | Sum of request + response |
|
||||
|
||||
## L4 Connection Tracking Variables
|
||||
|
||||
| Variable | Type | Description | Example |
|
||||
|----------|------|-------------|---------|
|
||||
| `conn` | bool | Connection tracking entry | |
|
||||
| `conn_state` | string | Connection state | `"open"`, `"in_progress"`, `"closed"` |
|
||||
| `conn_local_pkts` | int | Packets from local peer | |
|
||||
| `conn_local_bytes` | int | Bytes from local peer | |
|
||||
| `conn_remote_pkts` | int | Packets from remote peer | |
|
||||
| `conn_remote_bytes` | int | Bytes from remote peer | |
|
||||
| `conn_l7_detected` | []string | L7 protocols detected on connection | `["HTTP", "TLS"]` |
|
||||
| `conn_group_id` | int | Connection group identifier | |
|
||||
|
||||
**Example**: `conn && conn_state == "open" && conn_local_bytes > 1000000` (high-volume open connections)
|
||||
|
||||
## L4 Flow Tracking Variables
|
||||
|
||||
Flows extend connections with rate metrics (packets/bytes per second).
|
||||
|
||||
| Variable | Type | Description |
|
||||
|----------|------|-------------|
|
||||
| `flow` | bool | Flow tracking entry |
|
||||
| `flow_state` | string | Flow state (`"open"`, `"in_progress"`, `"closed"`) |
|
||||
| `flow_local_pkts` | int | Packets from local peer |
|
||||
| `flow_local_bytes` | int | Bytes from local peer |
|
||||
| `flow_remote_pkts` | int | Packets from remote peer |
|
||||
| `flow_remote_bytes` | int | Bytes from remote peer |
|
||||
| `flow_local_pps` | int | Local packets per second |
|
||||
| `flow_local_bps` | int | Local bytes per second |
|
||||
| `flow_remote_pps` | int | Remote packets per second |
|
||||
| `flow_remote_bps` | int | Remote bytes per second |
|
||||
| `flow_l7_detected` | []string | L7 protocols detected on flow |
|
||||
| `flow_group_id` | int | Flow group identifier |
|
||||
|
||||
**Example**: `tcp_flow && flow_local_bps > 5000000` (high-bandwidth TCP flows)
|
||||
|
||||
## Kubernetes Variables
|
||||
|
||||
### Pod and Service (Directional)
|
||||
|
||||
| Variable | Type | Description |
|
||||
|----------|------|-------------|
|
||||
| `src.pod.name` | string | Source pod name |
|
||||
| `src.pod.namespace` | string | Source pod namespace |
|
||||
| `dst.pod.name` | string | Destination pod name |
|
||||
| `dst.pod.namespace` | string | Destination pod namespace |
|
||||
| `src.service.name` | string | Source service name |
|
||||
| `src.service.namespace` | string | Source service namespace |
|
||||
| `dst.service.name` | string | Destination service name |
|
||||
| `dst.service.namespace` | string | Destination service namespace |
|
||||
|
||||
**Fallback behavior**: Pod namespace/name fields automatically fall back to
|
||||
service data when pod info is unavailable. This means `dst.pod.namespace` works
|
||||
even when only service-level resolution exists.
|
||||
|
||||
**Example**: `src.service.name == "api-gateway" && dst.pod.namespace == "production"`
|
||||
|
||||
### Aggregate Collections (Non-Directional)
|
||||
|
||||
| Variable | Type | Description |
|
||||
|----------|------|-------------|
|
||||
| `namespaces` | []string | All namespaces (src + dst, pod + service) |
|
||||
| `pods` | []string | All pod names (src + dst) |
|
||||
| `services` | []string | All service names (src + dst) |
|
||||
|
||||
### Labels and Annotations
|
||||
|
||||
| Variable | Type | Description |
|
||||
|----------|------|-------------|
|
||||
| `local_labels` | map[string]string | Kubernetes labels of local peer |
|
||||
| `local_annotations` | map[string]string | Kubernetes annotations of local peer |
|
||||
| `remote_labels` | map[string]string | Kubernetes labels of remote peer |
|
||||
| `remote_annotations` | map[string]string | Kubernetes annotations of remote peer |
|
||||
|
||||
Use `map_get(local_labels, "key", "default")` for safe access that won't error
|
||||
on missing keys.
|
||||
|
||||
**Example**: `map_get(local_labels, "app", "") == "checkout" && "production" in namespaces`
|
||||
|
||||
### Node Information
|
||||
|
||||
| Variable | Type | Description |
|
||||
|----------|------|-------------|
|
||||
| `node` | map | Nested: `node["name"]`, `node["ip"]` |
|
||||
| `node_name` | string | Node name (flat alias) |
|
||||
| `node_ip` | string | Node IP (flat alias) |
|
||||
| `local_node_name` | string | Node name of local peer |
|
||||
| `remote_node_name` | string | Node name of remote peer |
|
||||
|
||||
### Process Information
|
||||
|
||||
| Variable | Type | Description |
|
||||
|----------|------|-------------|
|
||||
| `local_process_name` | string | Process name on local peer |
|
||||
| `remote_process_name` | string | Process name on remote peer |
|
||||
|
||||
### DNS Resolution
|
||||
|
||||
| Variable | Type | Description |
|
||||
|----------|------|-------------|
|
||||
| `src.dns` | string | DNS resolution of source IP |
|
||||
| `dst.dns` | string | DNS resolution of destination IP |
|
||||
| `dns_resolutions` | []string | All DNS resolutions (deduplicated) |
|
||||
|
||||
### Resolution Status
|
||||
|
||||
| Variable | Type | Values |
|
||||
|----------|------|--------|
|
||||
| `local_resolution_status` | string | `""` (resolved), `"no_node_mapping"`, `"rpc_error"`, `"rpc_empty"`, `"cache_miss"`, `"queue_full"` |
|
||||
| `remote_resolution_status` | string | Same as above |
|
||||
|
||||
## Default Values
|
||||
|
||||
When a variable is not present in an entry, KFL2 uses these defaults:
|
||||
|
||||
| Type | Default |
|
||||
|------|---------|
|
||||
| string | `""` |
|
||||
| int | `0` |
|
||||
| bool | `false` |
|
||||
| list | `[]` |
|
||||
| map | `{}` |
|
||||
| bytes | `[]` |
|
||||
|
||||
## Protocol Variable Precedence
|
||||
|
||||
For protocols with request/response pairs (Kafka, RADIUS, Diameter), merged
|
||||
fields prefer the **request** side. If no request exists, the response value
|
||||
is used. Size totals are always computed as `request_size + response_size`.
|
||||
|
||||
## CEL Language Features
|
||||
|
||||
KFL2 supports the full CEL specification:
|
||||
|
||||
- **Short-circuit evaluation**: `&&` stops on first false, `||` stops on first true
|
||||
- **Ternary**: `condition ? value_if_true : value_if_false`
|
||||
- **Regex**: `str.matches("pattern")` uses RE2 syntax
|
||||
- **Type coercion**: Timestamps require `timestamp()`, durations require `duration()`
|
||||
- **Null safety**: Use `in` operator or `map_get()` before accessing map keys
|
||||
|
||||
For the full CEL specification, see the
|
||||
[CEL Language Definition](https://github.com/google/cel-spec/blob/master/doc/langdef.md).
|
||||
338
skills/network-rca/SKILL.md
Normal file
338
skills/network-rca/SKILL.md
Normal file
@@ -0,0 +1,338 @@
|
||||
---
|
||||
name: network-rca
|
||||
description: >
|
||||
Kubernetes network root cause analysis skill powered by Kubeshark MCP. Use this skill
|
||||
whenever the user wants to investigate past incidents, perform retrospective traffic
|
||||
analysis, take or manage traffic snapshots, extract PCAPs, dissect L7 API calls from
|
||||
historical captures, compare traffic patterns over time, detect drift or anomalies
|
||||
between snapshots, or do any kind of forensic network analysis in Kubernetes.
|
||||
Also trigger when the user mentions snapshots, raw capture, PCAP extraction,
|
||||
traffic replay, postmortem analysis, "what happened yesterday/last week",
|
||||
root cause analysis, RCA, cloud snapshot storage, snapshot dissection, or KFL filters
|
||||
for historical traffic. Even if the user just says "figure out what went wrong"
|
||||
or "compare today's traffic to yesterday" in a Kubernetes context, use this skill.
|
||||
---
|
||||
|
||||
# Network Root Cause Analysis with Kubeshark MCP
|
||||
|
||||
You are a Kubernetes network forensics specialist. Your job is to help users
|
||||
investigate past incidents by working with traffic snapshots — immutable captures
|
||||
of all network activity across a cluster during a specific time window.
|
||||
|
||||
Kubeshark is a search engine for network traffic. Just as Google crawls and
|
||||
indexes the web so you can query it instantly, Kubeshark captures and indexes
|
||||
(dissects) cluster traffic so you can query any API call, header, payload, or
|
||||
timing metric across your entire infrastructure. Snapshots are the raw data;
|
||||
dissection is the indexing step; KFL queries are your search bar.
|
||||
|
||||
Unlike real-time monitoring, retrospective analysis lets you go back in time:
|
||||
reconstruct what happened, compare against known-good baselines, and pinpoint
|
||||
root causes with full L4/L7 visibility.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
Before starting any analysis, verify the environment is ready.
|
||||
|
||||
### Kubeshark MCP Health Check
|
||||
|
||||
Confirm the Kubeshark MCP is accessible and tools are available. Look for tools
|
||||
like `list_api_calls`, `list_l4_flows`, `create_snapshot`, etc.
|
||||
|
||||
**Tool**: `check_kubeshark_status`
|
||||
|
||||
If tools like `list_api_calls` or `list_l4_flows` are missing from the response,
|
||||
something is wrong with the MCP connection. Guide the user through setup
|
||||
(see Setup Reference at the bottom).
|
||||
|
||||
### Raw Capture Must Be Enabled
|
||||
|
||||
Retrospective analysis depends on raw capture — Kubeshark's kernel-level (eBPF)
|
||||
packet recording that stores traffic at the node level. Without it, snapshots
|
||||
have nothing to work with.
|
||||
|
||||
Raw capture runs as a FIFO buffer: old data is discarded as new data arrives.
|
||||
The buffer size determines how far back you can go. Larger buffer = wider
|
||||
snapshot window.
|
||||
|
||||
```yaml
|
||||
tap:
|
||||
capture:
|
||||
raw:
|
||||
enabled: true
|
||||
storageSize: 10Gi # Per-node FIFO buffer
|
||||
```
|
||||
|
||||
If raw capture isn't enabled, inform the user that retrospective analysis
|
||||
requires it and share the configuration above.
|
||||
|
||||
### Snapshot Storage
|
||||
|
||||
Snapshots are assembled on the Hub's storage, which is ephemeral by default.
|
||||
For serious forensic work, persistent storage is recommended:
|
||||
|
||||
```yaml
|
||||
tap:
|
||||
snapshots:
|
||||
local:
|
||||
storageClass: gp2
|
||||
storageSize: 1000Gi
|
||||
```
|
||||
|
||||
## Core Workflow
|
||||
|
||||
Every investigation starts with a snapshot. After that, you choose one of two
|
||||
investigation routes depending on your goal:
|
||||
|
||||
1. **Determine time window** — When did the issue occur? Use `get_data_boundaries`
|
||||
to see what raw capture data is available.
|
||||
2. **Create or locate a snapshot** — Either take a new snapshot covering the
|
||||
incident window, or find an existing one with `list_snapshots`.
|
||||
3. **Choose your investigation route** — PCAP or Dissection (see below).
|
||||
|
||||
### Choosing the Right Route
|
||||
|
||||
| | PCAP Route | Dissection Route |
|
||||
|---|---|---|
|
||||
| **Speed** | Immediate — no indexing needed | Takes time to index |
|
||||
| **Filtering** | Nodes, time window, BPF filters | Kubernetes & API-level (pods, labels, paths, status codes) |
|
||||
| **Output** | Cluster-wide PCAP files | Structured query results |
|
||||
| **Investigation by** | Human (Wireshark) | AI agent or human (queryable database) |
|
||||
| **Best for** | Compliance, sharing with network teams, Wireshark deep-dives | Root cause analysis, API-level debugging, automated investigation |
|
||||
|
||||
Both routes are valid and complementary. Use PCAP when you need raw packets
|
||||
for human analysis or compliance. Use Dissection when you want an AI agent
|
||||
to search and analyze traffic programmatically.
|
||||
|
||||
## Snapshot Operations
|
||||
|
||||
Both routes start here. A snapshot is an immutable freeze of all cluster traffic
|
||||
in a time window.
|
||||
|
||||
### Check Data Boundaries
|
||||
|
||||
**Tool**: `get_data_boundaries`
|
||||
|
||||
Check what raw capture data exists across the cluster. You can only create
|
||||
snapshots within these boundaries — data outside the window has been rotated
|
||||
out of the FIFO buffer.
|
||||
|
||||
**Example response**:
|
||||
```
|
||||
Cluster-wide:
|
||||
Oldest: 2026-03-14 16:12:34 UTC
|
||||
Newest: 2026-03-14 18:05:20 UTC
|
||||
|
||||
Per node:
|
||||
┌─────────────────────────────┬──────────┬──────────┐
|
||||
│ Node │ Oldest │ Newest │
|
||||
├─────────────────────────────┼──────────┼──────────┤
|
||||
│ ip-10-0-25-170.ec2.internal │ 16:12:34 │ 18:03:39 │
|
||||
│ ip-10-0-32-115.ec2.internal │ 16:13:45 │ 18:05:20 │
|
||||
└─────────────────────────────┴──────────┴──────────┘
|
||||
```
|
||||
|
||||
If the incident falls outside the available window, the data has been rotated
|
||||
out. Suggest increasing `storageSize` for future coverage.
|
||||
|
||||
### Create a Snapshot
|
||||
|
||||
**Tool**: `create_snapshot`
|
||||
|
||||
Specify nodes (or cluster-wide) and a time window within the data boundaries.
|
||||
Snapshots include raw capture files, Kubernetes pod events, and eBPF cgroup events.
|
||||
|
||||
Snapshots take time to build. Check status with `get_snapshot` — wait until
|
||||
`completed` before proceeding with either route.
|
||||
|
||||
### List Existing Snapshots
|
||||
|
||||
**Tool**: `list_snapshots`
|
||||
|
||||
Shows all snapshots on the local Hub, with name, size, status, and node count.
|
||||
|
||||
### Cloud Storage
|
||||
|
||||
Snapshots on the Hub are ephemeral. Cloud storage (S3, GCS, Azure Blob)
|
||||
provides long-term retention. Snapshots can be downloaded to any cluster
|
||||
with Kubeshark — not necessarily the original one.
|
||||
|
||||
**Check cloud status**: `get_cloud_storage_status`
|
||||
**Upload to cloud**: `upload_snapshot_to_cloud`
|
||||
**Download from cloud**: `download_snapshot_from_cloud`
|
||||
|
||||
---
|
||||
|
||||
## Route 1: PCAP
|
||||
|
||||
The PCAP route does **not** require dissection. It works directly with the raw
|
||||
snapshot data to produce filtered, cluster-wide PCAP files. Use this route when:
|
||||
|
||||
- You need raw packets for Wireshark analysis
|
||||
- You're sharing captures with network teams
|
||||
- You need evidence for compliance or audit
|
||||
- A human will perform the investigation (not an AI agent)
|
||||
|
||||
### Filtering a PCAP
|
||||
|
||||
**Tool**: `export_snapshot_pcap`
|
||||
|
||||
Filter the snapshot down to what matters using:
|
||||
- **Nodes** — specific cluster nodes only
|
||||
- **Time** — sub-window within the snapshot
|
||||
- **BPF filter** — standard Berkeley Packet Filter syntax (e.g., `host 10.0.53.101`,
|
||||
`port 8080`, `net 10.0.0.0/16`)
|
||||
|
||||
These filters are combinable — select specific nodes, narrow the time range,
|
||||
and apply a BPF expression all at once.
|
||||
|
||||
### Workload-to-BPF Workflow
|
||||
|
||||
When you know the workload names but not their IPs, resolve them from the
|
||||
snapshot's metadata. Snapshots preserve pod-to-IP mappings from capture time,
|
||||
so resolution is accurate even if pods have been rescheduled since.
|
||||
|
||||
**Tool**: `resolve_workload`
|
||||
|
||||
**Example workflow** — extract PCAP for specific workloads:
|
||||
|
||||
1. Resolve IPs: `resolve_workload` for `orders-594487879c-7ddxf` → `10.0.53.101`
|
||||
2. Resolve IPs: `resolve_workload` for `payment-service-6b8f9d-x2k4p` → `10.0.53.205`
|
||||
3. Build BPF: `host 10.0.53.101 or host 10.0.53.205`
|
||||
4. Export: `export_snapshot_pcap` with that BPF filter
|
||||
|
||||
This gives you a cluster-wide PCAP filtered to exactly the workloads involved
|
||||
in the incident — ready for Wireshark or long-term storage.
|
||||
|
||||
---
|
||||
|
||||
## Route 2: Dissection
|
||||
|
||||
The Dissection route indexes raw packets into structured L7 API calls, building
|
||||
a queryable database from the snapshot. Use this route when:
|
||||
|
||||
- An AI agent is performing the investigation
|
||||
- You need to search by Kubernetes context (pods, namespaces, labels, services)
|
||||
- You need to search by API elements (paths, status codes, headers, payloads)
|
||||
- You want structured responses you can analyze programmatically
|
||||
- You need to drill into the payload of a specific API call
|
||||
|
||||
**KFL requirement**: The Dissection route uses KFL filters for all queries
|
||||
(`list_api_calls`, `get_api_stats`, etc.). Before constructing any KFL filter,
|
||||
load the KFL skill (`skills/kfl/`). KFL is statically typed — incorrect field
|
||||
names or syntax will fail silently or error. If the KFL skill is not available,
|
||||
suggest the user install it:
|
||||
|
||||
```bash
|
||||
ln -s /path/to/kubeshark/skills/kfl ~/.claude/skills/kfl
|
||||
```
|
||||
|
||||
**If the KFL skill cannot be loaded**, only use the exact filter examples shown
|
||||
in this skill. Do not improvise or guess at field names, operators, or syntax.
|
||||
KFL field names differ from what you might expect (e.g., `status_code` not
|
||||
`response.status`, `src.pod.namespace` not `src.namespace`). Using incorrect
|
||||
fields produces wrong results without warning.
|
||||
|
||||
### Activate Dissection
|
||||
|
||||
**Tool**: `start_snapshot_dissection`
|
||||
|
||||
Dissection takes time proportional to snapshot size — it parses every packet,
|
||||
reassembles streams, and builds the index. After completion, these tools
|
||||
become available:
|
||||
- `list_api_calls` — Search API transactions with KFL filters
|
||||
- `get_api_call` — Drill into a specific call (headers, body, timing, payload)
|
||||
- `get_api_stats` — Aggregated statistics (throughput, error rates, latency)
|
||||
|
||||
### Investigation Strategy
|
||||
|
||||
Start broad, then narrow:
|
||||
|
||||
1. `get_api_stats` — Get the overall picture: error rates, latency percentiles,
|
||||
throughput. Look for spikes or anomalies.
|
||||
2. `list_api_calls` filtered by error codes (4xx, 5xx) or high latency — find
|
||||
the problematic transactions.
|
||||
3. `get_api_call` on specific calls — inspect headers, bodies, timing, and
|
||||
full payload to understand what went wrong.
|
||||
4. Use KFL filters to slice by namespace, service, protocol, or any combination.
|
||||
|
||||
**Example `list_api_calls` response** (filtered to `http && status_code >= 500`):
|
||||
```
|
||||
┌──────────────────────┬────────┬──────────────────────────┬────────┬───────────┐
|
||||
│ Timestamp │ Method │ URL │ Status │ Elapsed │
|
||||
├──────────────────────┼────────┼──────────────────────────┼────────┼───────────┤
|
||||
│ 2026-03-14 17:23:45 │ POST │ /api/v1/orders/charge │ 503 │ 12,340 ms │
|
||||
│ 2026-03-14 17:23:46 │ POST │ /api/v1/orders/charge │ 503 │ 11,890 ms │
|
||||
│ 2026-03-14 17:23:48 │ GET │ /api/v1/inventory/check │ 500 │ 8,210 ms │
|
||||
│ 2026-03-14 17:24:01 │ POST │ /api/v1/payments/process │ 502 │ 30,000 ms │
|
||||
└──────────────────────┴────────┴──────────────────────────┴────────┴───────────┘
|
||||
Src: api-gateway (prod) → Dst: payment-service (prod)
|
||||
```
|
||||
|
||||
Use the pattern of repeated failures and high latency to identify the failing
|
||||
service chain, then drill into individual calls with `get_api_call`.
|
||||
|
||||
### KFL Filters for Dissected Traffic
|
||||
|
||||
Layer filters progressively when investigating:
|
||||
|
||||
```
|
||||
// Step 1: Protocol + namespace
|
||||
http && dst.pod.namespace == "production"
|
||||
|
||||
// Step 2: Add error condition
|
||||
http && dst.pod.namespace == "production" && status_code >= 500
|
||||
|
||||
// Step 3: Narrow to service
|
||||
http && dst.pod.namespace == "production" && status_code >= 500 && dst.service.name == "payment-service"
|
||||
|
||||
// Step 4: Narrow to endpoint
|
||||
http && dst.pod.namespace == "production" && status_code >= 500 && dst.service.name == "payment-service" && path.contains("/charge")
|
||||
```
|
||||
|
||||
Other common RCA filters:
|
||||
|
||||
```
|
||||
dns && dns_response && status_code != 0 // Failed DNS lookups
|
||||
src.service.namespace != dst.service.namespace // Cross-namespace traffic
|
||||
http && elapsed_time > 5000000 // Slow transactions (> 5s)
|
||||
conn && conn_state == "open" && conn_local_bytes > 1000000 // High-volume connections
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Combining Both Routes
|
||||
|
||||
The two routes are complementary. A common pattern:
|
||||
|
||||
1. Start with **Dissection** — let the AI agent search and identify the root cause
|
||||
2. Once you've pinpointed the problematic workloads, use `resolve_workload`
|
||||
to get their IPs
|
||||
3. Switch to **PCAP** — export a filtered PCAP of just those workloads for
|
||||
Wireshark deep-dive, sharing with the network team, or compliance archival
|
||||
|
||||
## Use Cases
|
||||
|
||||
### Post-Incident RCA
|
||||
|
||||
1. Identify the incident time window from alerts, logs, or user reports
|
||||
2. Check `get_data_boundaries` — is the window still in raw capture?
|
||||
3. `create_snapshot` covering the incident window (add 15 minutes buffer)
|
||||
4. **Dissection route**: `start_snapshot_dissection` → `get_api_stats` →
|
||||
`list_api_calls` → `get_api_call` → follow the dependency chain
|
||||
5. **PCAP route**: `resolve_workload` → `export_snapshot_pcap` with BPF →
|
||||
hand off to Wireshark or archive
|
||||
|
||||
### Other Use Cases
|
||||
|
||||
- **Trend analysis** — Take snapshots at regular intervals and compare
|
||||
`get_api_stats` across them to detect latency drift, error rate changes,
|
||||
or new service-to-service connections.
|
||||
- **Forensic preservation** — `create_snapshot` + `upload_snapshot_to_cloud`
|
||||
for immutable, long-term evidence. Downloadable to any cluster months later.
|
||||
- **Production-to-local replay** — Upload a production snapshot to cloud,
|
||||
download it on a local KinD cluster, and investigate safely.
|
||||
|
||||
## Setup Reference
|
||||
|
||||
For CLI installation, MCP configuration, verification, and troubleshooting,
|
||||
see `references/setup.md`.
|
||||
70
skills/network-rca/references/setup.md
Normal file
70
skills/network-rca/references/setup.md
Normal file
@@ -0,0 +1,70 @@
|
||||
# Kubeshark MCP Setup Reference
|
||||
|
||||
## Installing the CLI
|
||||
|
||||
**Homebrew (macOS)**:
|
||||
```bash
|
||||
brew install kubeshark
|
||||
```
|
||||
|
||||
**Linux**:
|
||||
```bash
|
||||
sh <(curl -Ls https://kubeshark.com/install)
|
||||
```
|
||||
|
||||
**From source**:
|
||||
```bash
|
||||
git clone https://github.com/kubeshark/kubeshark
|
||||
cd kubeshark && make
|
||||
```
|
||||
|
||||
## MCP Configuration
|
||||
|
||||
**Claude Desktop / Cowork** (`claude_desktop_config.json`):
|
||||
```json
|
||||
{
|
||||
"mcpServers": {
|
||||
"kubeshark": {
|
||||
"command": "kubeshark",
|
||||
"args": ["mcp"]
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Claude Code (CLI)**:
|
||||
```bash
|
||||
claude mcp add kubeshark -- kubeshark mcp
|
||||
```
|
||||
|
||||
**Without kubectl access** (direct URL mode):
|
||||
```json
|
||||
{
|
||||
"mcpServers": {
|
||||
"kubeshark": {
|
||||
"command": "kubeshark",
|
||||
"args": ["mcp", "--url", "https://kubeshark.example.com"]
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
```bash
|
||||
# Claude Code equivalent:
|
||||
claude mcp add kubeshark -- kubeshark mcp --url https://kubeshark.example.com
|
||||
```
|
||||
|
||||
## Verification
|
||||
|
||||
- Claude Code: `/mcp` to check connection status
|
||||
- Terminal: `kubeshark mcp --list-tools`
|
||||
- Cluster: `kubectl get pods -l app=kubeshark-hub`
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
- **Binary not found** → Install via Homebrew or the install script above
|
||||
- **Connection refused** → Deploy Kubeshark first: `kubeshark tap`
|
||||
- **No L7 data** → Check `get_dissection_status` and `enable_dissection`
|
||||
- **Snapshot creation fails** → Verify raw capture is enabled in Kubeshark config
|
||||
- **Empty snapshot** → Check `get_data_boundaries` — the requested window may
|
||||
fall outside available data
|
||||
Reference in New Issue
Block a user