mirror of
https://github.com/kubeshark/kubeshark.git
synced 2026-03-30 07:17:57 +00:00
Compare commits
1 Commits
update-rca
...
update/rea
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
7464087798 |
@@ -1,33 +0,0 @@
|
||||
# Kubeshark Claude Code Plugin
|
||||
|
||||
This directory contains the [Claude Code plugin](https://docs.anthropic.com/en/docs/claude-code/plugins) configuration for Kubeshark.
|
||||
|
||||
## What's here
|
||||
|
||||
| File | Purpose |
|
||||
|------|---------|
|
||||
| `plugin.json` | Plugin manifest — name, version, description, metadata |
|
||||
| `marketplace.json` | Marketplace index — allows discovery via `/plugin marketplace add` |
|
||||
|
||||
## Installing the plugin
|
||||
|
||||
```
|
||||
/plugin marketplace add kubeshark/kubeshark
|
||||
/plugin install kubeshark
|
||||
```
|
||||
|
||||
This loads the Kubeshark AI skills and MCP configuration. Skills appear as
|
||||
`/kubeshark:network-rca` and `/kubeshark:kfl`.
|
||||
|
||||
## What the plugin includes
|
||||
|
||||
- **Skills** from [`skills/`](../skills/) — network root cause analysis and KFL filter expertise
|
||||
- **MCP configuration** from [`.mcp.json`](../.mcp.json) — connects to the Kubeshark MCP server
|
||||
|
||||
## Local development
|
||||
|
||||
Test the plugin without installing:
|
||||
|
||||
```bash
|
||||
claude --plugin-dir /path/to/kubeshark
|
||||
```
|
||||
@@ -1,15 +0,0 @@
|
||||
{
|
||||
"name": "kubeshark",
|
||||
"description": "Kubeshark network observability skills for Kubernetes",
|
||||
"plugins": [
|
||||
{
|
||||
"name": "kubeshark",
|
||||
"description": "Network observability skills powered by Kubeshark MCP — root cause analysis, KFL traffic filtering, snapshot forensics, PCAP extraction.",
|
||||
"source": {
|
||||
"source": "github",
|
||||
"owner": "kubeshark",
|
||||
"repo": "kubeshark"
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
@@ -1,24 +0,0 @@
|
||||
{
|
||||
"name": "kubeshark",
|
||||
"version": "1.0.0",
|
||||
"description": "Kubernetes network observability skills powered by Kubeshark MCP. Root cause analysis, traffic filtering, snapshot forensics, PCAP extraction, and more.",
|
||||
"author": {
|
||||
"name": "Kubeshark",
|
||||
"url": "https://kubeshark.com"
|
||||
},
|
||||
"homepage": "https://kubeshark.com",
|
||||
"repository": "https://github.com/kubeshark/kubeshark",
|
||||
"license": "Apache-2.0",
|
||||
"keywords": [
|
||||
"kubeshark",
|
||||
"kubernetes",
|
||||
"network",
|
||||
"observability",
|
||||
"traffic",
|
||||
"mcp",
|
||||
"rca",
|
||||
"pcap",
|
||||
"kfl",
|
||||
"ebpf"
|
||||
]
|
||||
}
|
||||
@@ -1,8 +0,0 @@
|
||||
{
|
||||
"mcpServers": {
|
||||
"kubeshark": {
|
||||
"command": "kubeshark",
|
||||
"args": ["mcp"]
|
||||
}
|
||||
}
|
||||
}
|
||||
51
README.md
51
README.md
@@ -17,13 +17,20 @@
|
||||
|
||||
---
|
||||
|
||||
Kubeshark indexes cluster-wide network traffic at the kernel level using eBPF — delivering instant answers to any query using network, API, and Kubernetes semantics.
|
||||
Kubeshark captures cluster-wide network traffic at the speed and scale of Kubernetes, continuously, at the kernel level using eBPF. It consolidates a highly fragmented picture — dozens of nodes, thousands of workloads, millions of connections — into a single, queryable view with full Kubernetes and API context.
|
||||
|
||||
**What you can do:**
|
||||
Network data is available to **AI agents via [MCP](https://docs.kubeshark.com/en/mcp)** and to **human operators via a [dashboard](https://docs.kubeshark.com/en/v2)**.
|
||||
|
||||
- **Download Retrospective PCAPs** — cluster-wide packet captures filtered by nodes, time, workloads, and IPs. Store PCAPs for long-term retention and later investigation.
|
||||
- **Visualize Network Data** — explore traffic matching queries with API, Kubernetes, or network semantics through a real-time dashboard.
|
||||
- **Integrate with AI** — connect your favorite AI assistant (e.g. Claude, Copilot) to include network data in AI-driven workflows like incident response and root cause analysis.
|
||||
**Kubeshark captures, processes, and retains cluster-wide network traffic:**
|
||||
|
||||
- **PCAP Retention** — continuous raw packet capture with point-in-time snapshots, exportable for Wireshark ([Snapshots →](https://docs.kubeshark.com/en/v2/traffic_snapshots))
|
||||
- **L7 API Dissection** — real-time request/response matching with full payload parsing: HTTP, gRPC, GraphQL, Redis, Kafka, DNS ([API dissection →](https://docs.kubeshark.com/en/v2/l7_api_dissection))
|
||||
- **Kubernetes Context** — every packet and API call resolved to pod, service, namespace, and node
|
||||
|
||||
**Additional benefits:**
|
||||
|
||||
- **Decrypted TLS** — eBPF-based TLS decryption without key management
|
||||
- **L4 TCP Insights** — retransmissions, RTT, window saturation, connection lifecycle, packet loss across every node-to-node path ([TCP insights →](https://docs.kubeshark.com/en/mcp/tcp_insights))
|
||||
|
||||

|
||||
|
||||
@@ -34,12 +41,9 @@ Kubeshark indexes cluster-wide network traffic at the kernel level using eBPF
|
||||
```bash
|
||||
helm repo add kubeshark https://helm.kubeshark.com
|
||||
helm install kubeshark kubeshark/kubeshark
|
||||
kubectl port-forward svc/kubeshark-front 8899:80
|
||||
```
|
||||
|
||||
Open `http://localhost:8899` in your browser. You're capturing traffic.
|
||||
|
||||
> For production use, we recommend using an [ingress controller](https://docs.kubeshark.com/en/ingress) instead of port-forward.
|
||||
Dashboard opens automatically. You're capturing traffic.
|
||||
|
||||
**Connect an AI agent** via MCP:
|
||||
|
||||
@@ -52,9 +56,9 @@ claude mcp add kubeshark -- kubeshark mcp
|
||||
|
||||
---
|
||||
|
||||
### Network Data for AI Agents
|
||||
### AI-Powered Network Analysis
|
||||
|
||||
Kubeshark exposes cluster-wide network data via [MCP](https://docs.kubeshark.com/en/mcp) — enabling AI agents to query traffic, investigate API calls, and perform root cause analysis through natural language.
|
||||
Kubeshark exposes all cluster-wide network data via MCP (Model Context Protocol). AI agents can query L4 metrics, investigate L7 API calls, analyze traffic patterns, and run root cause analysis — through natural language. Use cases include incident response, root cause analysis, troubleshooting, debugging, and reliability workflows.
|
||||
|
||||
> *"Why did checkout fail at 2:15 PM?"*
|
||||
> *"Which services have error rates above 1%?"*
|
||||
@@ -69,36 +73,39 @@ Works with Claude Code, Cursor, and any MCP-compatible AI.
|
||||
|
||||
---
|
||||
|
||||
### Network Traffic Indexing
|
||||
### L7 API Dissection
|
||||
|
||||
Kubeshark indexes cluster-wide network traffic by parsing it according to protocol specifications, with support for HTTP, gRPC, Redis, Kafka, DNS, and more. This enables queries using Kubernetes semantics (e.g. pod, namespace, node), API semantics (e.g. path, headers, status), and network semantics (e.g. IP, port). No code instrumentation required.
|
||||
Cluster-wide request/response matching with full payloads, parsed according to protocol specifications. HTTP, gRPC, Redis, Kafka, DNS, and more. Every API call resolved to source and destination pod, service, namespace, and node. No code instrumentation required.
|
||||
|
||||

|
||||
|
||||
[Learn more →](https://docs.kubeshark.com/en/v2/l7_api_dissection)
|
||||
|
||||
### Workload Dependency Map
|
||||
### Cluster-wide PCAP
|
||||
|
||||
A visual map of how workloads communicate, showing dependencies, traffic volume, and protocol usage across the cluster.
|
||||
Generate a cluster-wide PCAP file from any point in time. Filter by time range, specific nodes, and BPF expressions (e.g. `net`, `ip`, `port`, `host`) to capture exactly the traffic you need — across the entire cluster, in a single file. Download and analyze with Wireshark, tshark, or any PCAP-compatible tool — or let your AI agent download and analyze programmatically via MCP.
|
||||
|
||||

|
||||
|
||||
[Learn more →](https://docs.kubeshark.com/en/v2/service_map)
|
||||
|
||||
### Traffic Retention & PCAP Export
|
||||
|
||||
Capture and retain raw network traffic cluster-wide. Download PCAPs scoped by time range, nodes, workloads, and IPs — ready for Wireshark or any PCAP-compatible tool.
|
||||
Store snapshots locally or in S3/Azure Blob for long-term retention.
|
||||
|
||||

|
||||
|
||||
[Snapshots guide →](https://docs.kubeshark.com/en/v2/traffic_snapshots)
|
||||
|
||||
### L4/L7 Workload Map
|
||||
|
||||
Cluster-wide view of service communication: dependencies, traffic flow, and anomalies across all nodes and namespaces.
|
||||
|
||||

|
||||
|
||||
[Learn more →](https://docs.kubeshark.com/en/v2/service_map)
|
||||
|
||||
---
|
||||
|
||||
## Features
|
||||
|
||||
| Feature | Description |
|
||||
|---------|-------------|
|
||||
| [**Raw Capture**](https://docs.kubeshark.com/en/v2/raw_capture) | Continuous cluster-wide packet capture with minimal overhead |
|
||||
| [**Traffic Snapshots**](https://docs.kubeshark.com/en/v2/traffic_snapshots) | Point-in-time snapshots, export as PCAP for Wireshark |
|
||||
| [**L7 API Dissection**](https://docs.kubeshark.com/en/v2/l7_api_dissection) | Request/response matching with full payloads and protocol parsing |
|
||||
| [**Protocol Support**](https://docs.kubeshark.com/en/protocols) | HTTP, gRPC, GraphQL, Redis, Kafka, DNS, and more |
|
||||
|
||||
@@ -40,11 +40,9 @@ type Readiness struct {
|
||||
}
|
||||
|
||||
var ready *Readiness
|
||||
var proxyOnce sync.Once
|
||||
|
||||
func tap() {
|
||||
ready = &Readiness{}
|
||||
proxyOnce = sync.Once{}
|
||||
state.startTime = time.Now()
|
||||
log.Info().Str("registry", config.Config.Tap.Docker.Registry).Str("tag", config.Config.Tap.Docker.Tag).Msg("Using Docker:")
|
||||
|
||||
@@ -149,21 +147,11 @@ func printNoPodsFoundSuggestion(targetNamespaces []string) {
|
||||
log.Warn().Msg(fmt.Sprintf("Did not find any currently running pods that match the regex argument, %s will automatically target matching pods if any are created later%s", misc.Software, suggestionStr))
|
||||
}
|
||||
|
||||
func isPodReady(pod *core.Pod) bool {
|
||||
for _, condition := range pod.Status.Conditions {
|
||||
if condition.Type == core.PodReady {
|
||||
return condition.Status == core.ConditionTrue
|
||||
}
|
||||
}
|
||||
return false
|
||||
}
|
||||
|
||||
func watchHubPod(ctx context.Context, kubernetesProvider *kubernetes.Provider, cancel context.CancelFunc) {
|
||||
podExactRegex := regexp.MustCompile(fmt.Sprintf("^%s", kubernetes.HubPodName))
|
||||
podWatchHelper := kubernetes.NewPodWatchHelper(kubernetesProvider, podExactRegex)
|
||||
eventChan, errorChan := kubernetes.FilteredWatch(ctx, podWatchHelper, []string{config.Config.Tap.Release.Namespace}, podWatchHelper)
|
||||
podReady := false
|
||||
podRunning := false
|
||||
isPodReady := false
|
||||
|
||||
timeAfter := time.After(120 * time.Second)
|
||||
for {
|
||||
@@ -195,30 +183,26 @@ func watchHubPod(ctx context.Context, kubernetesProvider *kubernetes.Provider, c
|
||||
Interface("containers-statuses", modifiedPod.Status.ContainerStatuses).
|
||||
Msg("Watching pod.")
|
||||
|
||||
if isPodReady(modifiedPod) && !podReady {
|
||||
podReady = true
|
||||
if modifiedPod.Status.Phase == core.PodRunning && !isPodReady {
|
||||
isPodReady = true
|
||||
|
||||
ready.Lock()
|
||||
ready.Hub = true
|
||||
ready.Unlock()
|
||||
log.Info().Str("pod", kubernetes.HubPodName).Msg("Ready.")
|
||||
} else if modifiedPod.Status.Phase == core.PodRunning && !podRunning {
|
||||
podRunning = true
|
||||
log.Info().Str("pod", kubernetes.HubPodName).Msg("Waiting for readiness...")
|
||||
}
|
||||
|
||||
ready.Lock()
|
||||
proxyDone := ready.Proxy
|
||||
hubPodReady := ready.Hub
|
||||
frontPodReady := ready.Front
|
||||
ready.Unlock()
|
||||
|
||||
if hubPodReady && frontPodReady {
|
||||
proxyOnce.Do(func() {
|
||||
ready.Lock()
|
||||
ready.Proxy = true
|
||||
ready.Unlock()
|
||||
postFrontStarted(ctx, kubernetesProvider, cancel)
|
||||
})
|
||||
if !proxyDone && hubPodReady && frontPodReady {
|
||||
ready.Lock()
|
||||
ready.Proxy = true
|
||||
ready.Unlock()
|
||||
postFrontStarted(ctx, kubernetesProvider, cancel)
|
||||
}
|
||||
case kubernetes.EventBookmark:
|
||||
break
|
||||
@@ -239,7 +223,7 @@ func watchHubPod(ctx context.Context, kubernetesProvider *kubernetes.Provider, c
|
||||
cancel()
|
||||
|
||||
case <-timeAfter:
|
||||
if !podReady {
|
||||
if !isPodReady {
|
||||
log.Error().
|
||||
Str("pod", kubernetes.HubPodName).
|
||||
Msg("Pod was not ready in time.")
|
||||
@@ -258,8 +242,7 @@ func watchFrontPod(ctx context.Context, kubernetesProvider *kubernetes.Provider,
|
||||
podExactRegex := regexp.MustCompile(fmt.Sprintf("^%s", kubernetes.FrontPodName))
|
||||
podWatchHelper := kubernetes.NewPodWatchHelper(kubernetesProvider, podExactRegex)
|
||||
eventChan, errorChan := kubernetes.FilteredWatch(ctx, podWatchHelper, []string{config.Config.Tap.Release.Namespace}, podWatchHelper)
|
||||
podReady := false
|
||||
podRunning := false
|
||||
isPodReady := false
|
||||
|
||||
timeAfter := time.After(120 * time.Second)
|
||||
for {
|
||||
@@ -291,29 +274,25 @@ func watchFrontPod(ctx context.Context, kubernetesProvider *kubernetes.Provider,
|
||||
Interface("containers-statuses", modifiedPod.Status.ContainerStatuses).
|
||||
Msg("Watching pod.")
|
||||
|
||||
if isPodReady(modifiedPod) && !podReady {
|
||||
podReady = true
|
||||
if modifiedPod.Status.Phase == core.PodRunning && !isPodReady {
|
||||
isPodReady = true
|
||||
ready.Lock()
|
||||
ready.Front = true
|
||||
ready.Unlock()
|
||||
log.Info().Str("pod", kubernetes.FrontPodName).Msg("Ready.")
|
||||
} else if modifiedPod.Status.Phase == core.PodRunning && !podRunning {
|
||||
podRunning = true
|
||||
log.Info().Str("pod", kubernetes.FrontPodName).Msg("Waiting for readiness...")
|
||||
}
|
||||
|
||||
ready.Lock()
|
||||
proxyDone := ready.Proxy
|
||||
hubPodReady := ready.Hub
|
||||
frontPodReady := ready.Front
|
||||
ready.Unlock()
|
||||
|
||||
if hubPodReady && frontPodReady {
|
||||
proxyOnce.Do(func() {
|
||||
ready.Lock()
|
||||
ready.Proxy = true
|
||||
ready.Unlock()
|
||||
postFrontStarted(ctx, kubernetesProvider, cancel)
|
||||
})
|
||||
if !proxyDone && hubPodReady && frontPodReady {
|
||||
ready.Lock()
|
||||
ready.Proxy = true
|
||||
ready.Unlock()
|
||||
postFrontStarted(ctx, kubernetesProvider, cancel)
|
||||
}
|
||||
case kubernetes.EventBookmark:
|
||||
break
|
||||
@@ -333,7 +312,7 @@ func watchFrontPod(ctx context.Context, kubernetesProvider *kubernetes.Provider,
|
||||
Msg("Failed creating pod.")
|
||||
|
||||
case <-timeAfter:
|
||||
if !podReady {
|
||||
if !isPodReady {
|
||||
log.Error().
|
||||
Str("pod", kubernetes.FrontPodName).
|
||||
Msg("Pod was not ready in time.")
|
||||
@@ -450,6 +429,9 @@ func postFrontStarted(ctx context.Context, kubernetesProvider *kubernetes.Provid
|
||||
watchScripts(ctx, kubernetesProvider, false)
|
||||
}
|
||||
|
||||
if config.Config.Scripting.Console {
|
||||
go runConsoleWithoutProxy()
|
||||
}
|
||||
}
|
||||
|
||||
func updateConfig(kubernetesProvider *kubernetes.Provider) {
|
||||
|
||||
@@ -153,7 +153,6 @@ func CreateDefaultConfig() ConfigStruct {
|
||||
},
|
||||
Dashboard: configStructs.DashboardConfig{
|
||||
CompleteStreamingEnabled: true,
|
||||
ClusterWideMapEnabled: false,
|
||||
},
|
||||
Capture: configStructs.CaptureConfig{
|
||||
Dissection: configStructs.DissectionConfig{
|
||||
|
||||
@@ -202,7 +202,6 @@ type RoutingConfig struct {
|
||||
type DashboardConfig struct {
|
||||
StreamingType string `yaml:"streamingType" json:"streamingType" default:"connect-rpc"`
|
||||
CompleteStreamingEnabled bool `yaml:"completeStreamingEnabled" json:"completeStreamingEnabled" default:"true"`
|
||||
ClusterWideMapEnabled bool `yaml:"clusterWideMapEnabled" json:"clusterWideMapEnabled" default:"false"`
|
||||
}
|
||||
|
||||
type FrontRoutingConfig struct {
|
||||
@@ -210,9 +209,9 @@ type FrontRoutingConfig struct {
|
||||
}
|
||||
|
||||
type ReleaseConfig struct {
|
||||
Repo string `yaml:"repo" json:"repo" default:"https://helm.kubeshark.com"`
|
||||
Name string `yaml:"name" json:"name" default:"kubeshark"`
|
||||
Namespace string `yaml:"namespace" json:"namespace" default:"default"`
|
||||
Repo string `yaml:"repo" json:"repo" default:"https://helm.kubeshark.com"`
|
||||
Name string `yaml:"name" json:"name" default:"kubeshark"`
|
||||
Namespace string `yaml:"namespace" json:"namespace" default:"default"`
|
||||
HelmChartPath string `yaml:"helmChartPath" json:"helmChartPath" default:""`
|
||||
}
|
||||
|
||||
@@ -412,6 +411,7 @@ type TapConfig struct {
|
||||
Gitops GitopsConfig `yaml:"gitops" json:"gitops"`
|
||||
Sentry SentryConfig `yaml:"sentry" json:"sentry"`
|
||||
DefaultFilter string `yaml:"defaultFilter" json:"defaultFilter" default:""`
|
||||
LiveConfigMapChangesDisabled bool `yaml:"liveConfigMapChangesDisabled" json:"liveConfigMapChangesDisabled" default:"false"`
|
||||
GlobalFilter string `yaml:"globalFilter" json:"globalFilter" default:""`
|
||||
EnabledDissectors []string `yaml:"enabledDissectors" json:"enabledDissectors"`
|
||||
PortMapping PortMapping `yaml:"portMapping" json:"portMapping"`
|
||||
|
||||
@@ -232,6 +232,7 @@ Example for overriding image names:
|
||||
| `tap.sentry.enabled` | Enable sending of error logs to Sentry | `false` |
|
||||
| `tap.sentry.environment` | Sentry environment to label error logs with | `production` |
|
||||
| `tap.defaultFilter` | Sets the default dashboard KFL filter (e.g. `http`). By default, this value is set to filter out noisy protocols such as DNS, UDP, ICMP and TCP. The user can easily change this, **temporarily**, in the Dashboard. For a permanent change, you should change this value in the `values.yaml` or `config.yaml` file. | `""` |
|
||||
| `tap.liveConfigMapChangesDisabled` | If set to `true`, all user functionality (scripting, targeting settings, global & default KFL modification, traffic recording, traffic capturing on/off, protocol dissectors) involving dynamic ConfigMap changes from UI will be disabled | `false` |
|
||||
| `tap.globalFilter` | Prepends to any KFL filter and can be used to limit what is visible in the dashboard. For example, `redact("request.headers.Authorization")` will redact the appropriate field. Another example `!dns` will not show any DNS traffic. | `""` |
|
||||
| `tap.metrics.port` | Pod port used to expose Prometheus metrics | `49100` |
|
||||
| `tap.enabledDissectors` | This is an array of strings representing the list of supported protocols. Remove or comment out redundant protocols (e.g., dns).| The default list excludes: `udp` and `tcp` |
|
||||
|
||||
@@ -95,85 +95,7 @@ helm install kubeshark kubeshark/kubeshark \
|
||||
|
||||
### Example: IRSA (recommended for EKS)
|
||||
|
||||
[IAM Roles for Service Accounts (IRSA)](https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts.html) lets EKS pods assume an IAM role without static credentials. EKS injects a short-lived token into the pod automatically.
|
||||
|
||||
**Prerequisites:**
|
||||
|
||||
1. Your EKS cluster must have an [OIDC provider](https://docs.aws.amazon.com/eks/latest/userguide/enable-iam-roles-for-service-accounts.html) associated with it.
|
||||
2. An IAM role with a trust policy that allows the Kubeshark service account to assume it.
|
||||
|
||||
**Step 1 — Create an IAM policy scoped to your bucket:**
|
||||
|
||||
```json
|
||||
{
|
||||
"Version": "2012-10-17",
|
||||
"Statement": [
|
||||
{
|
||||
"Effect": "Allow",
|
||||
"Action": [
|
||||
"s3:GetObject",
|
||||
"s3:PutObject",
|
||||
"s3:DeleteObject",
|
||||
"s3:GetObjectVersion",
|
||||
"s3:DeleteObjectVersion",
|
||||
"s3:ListBucket",
|
||||
"s3:ListBucketVersions",
|
||||
"s3:GetBucketLocation",
|
||||
"s3:GetBucketVersioning"
|
||||
],
|
||||
"Resource": [
|
||||
"arn:aws:s3:::my-kubeshark-snapshots",
|
||||
"arn:aws:s3:::my-kubeshark-snapshots/*"
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
> For read-only access, remove `s3:PutObject`, `s3:DeleteObject`, and `s3:DeleteObjectVersion`.
|
||||
|
||||
**Step 2 — Create an IAM role with IRSA trust policy:**
|
||||
|
||||
```bash
|
||||
# Get your cluster's OIDC provider URL
|
||||
OIDC_PROVIDER=$(aws eks describe-cluster --name CLUSTER_NAME \
|
||||
--query "cluster.identity.oidc.issuer" --output text | sed 's|https://||')
|
||||
|
||||
# Create a trust policy
|
||||
# The default K8s SA name is "<release-name>-service-account" (e.g. "kubeshark-service-account")
|
||||
cat > trust-policy.json <<EOF
|
||||
{
|
||||
"Version": "2012-10-17",
|
||||
"Statement": [
|
||||
{
|
||||
"Effect": "Allow",
|
||||
"Principal": {
|
||||
"Federated": "arn:aws:iam::ACCOUNT_ID:oidc-provider/${OIDC_PROVIDER}"
|
||||
},
|
||||
"Action": "sts:AssumeRoleWithWebIdentity",
|
||||
"Condition": {
|
||||
"StringEquals": {
|
||||
"${OIDC_PROVIDER}:sub": "system:serviceaccount:NAMESPACE:kubeshark-service-account",
|
||||
"${OIDC_PROVIDER}:aud": "sts.amazonaws.com"
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
EOF
|
||||
|
||||
# Create the role and attach your policy
|
||||
aws iam create-role \
|
||||
--role-name KubesharkS3Role \
|
||||
--assume-role-policy-document file://trust-policy.json
|
||||
|
||||
aws iam put-role-policy \
|
||||
--role-name KubesharkS3Role \
|
||||
--policy-name KubesharkSnapshotsBucketAccess \
|
||||
--policy-document file://bucket-policy.json
|
||||
```
|
||||
|
||||
**Step 3 — Create a ConfigMap with bucket configuration:**
|
||||
Create a ConfigMap with bucket configuration:
|
||||
|
||||
```yaml
|
||||
apiVersion: v1
|
||||
@@ -185,12 +107,10 @@ data:
|
||||
SNAPSHOT_AWS_REGION: us-east-1
|
||||
```
|
||||
|
||||
**Step 4 — Set Helm values with `tap.annotations` to annotate the service account:**
|
||||
Set Helm values:
|
||||
|
||||
```yaml
|
||||
tap:
|
||||
annotations:
|
||||
eks.amazonaws.com/role-arn: arn:aws:iam::ACCOUNT_ID:role/KubesharkS3Role
|
||||
snapshots:
|
||||
cloud:
|
||||
provider: "s3"
|
||||
@@ -198,17 +118,7 @@ tap:
|
||||
- kubeshark-s3-config
|
||||
```
|
||||
|
||||
Or via `--set`:
|
||||
|
||||
```bash
|
||||
helm install kubeshark kubeshark/kubeshark \
|
||||
--set tap.snapshots.cloud.provider=s3 \
|
||||
--set tap.snapshots.cloud.s3.bucket=my-kubeshark-snapshots \
|
||||
--set tap.snapshots.cloud.s3.region=us-east-1 \
|
||||
--set tap.annotations."eks\.amazonaws\.com/role-arn"=arn:aws:iam::ACCOUNT_ID:role/KubesharkS3Role
|
||||
```
|
||||
|
||||
No `accessKey`/`secretKey` is needed — EKS injects credentials automatically via the IRSA token.
|
||||
The hub pod's service account must be annotated for IRSA with an IAM role that has S3 access to the bucket.
|
||||
|
||||
### Example: Static Credentials
|
||||
|
||||
|
||||
@@ -26,15 +26,15 @@ spec:
|
||||
- env:
|
||||
- name: REACT_APP_AUTH_ENABLED
|
||||
value: '{{- if or (and .Values.cloudLicenseEnabled (not (empty .Values.license))) (not .Values.internetConnectivity) -}}
|
||||
{{ (default false .Values.demoModeEnabled) | ternary true ((and .Values.tap.auth.enabled (eq .Values.tap.auth.type "dex")) | ternary true false) }}
|
||||
{{ (and .Values.tap.auth.enabled (eq .Values.tap.auth.type "dex")) | ternary true false }}
|
||||
{{- else -}}
|
||||
{{ .Values.cloudLicenseEnabled | ternary "true" ((default false .Values.demoModeEnabled) | ternary "true" .Values.tap.auth.enabled) }}
|
||||
{{ .Values.cloudLicenseEnabled | ternary "true" .Values.tap.auth.enabled }}
|
||||
{{- end }}'
|
||||
- name: REACT_APP_AUTH_TYPE
|
||||
value: '{{- if and .Values.cloudLicenseEnabled (not (eq .Values.tap.auth.type "dex")) -}}
|
||||
default
|
||||
{{- else -}}
|
||||
{{ (default false .Values.demoModeEnabled) | ternary "default" .Values.tap.auth.type }}
|
||||
{{ .Values.tap.auth.type }}
|
||||
{{- end }}'
|
||||
- name: REACT_APP_COMPLETE_STREAMING_ENABLED
|
||||
value: '{{- if and (hasKey .Values.tap "dashboard") (hasKey .Values.tap.dashboard "completeStreamingEnabled") -}}
|
||||
@@ -55,22 +55,30 @@ spec:
|
||||
false
|
||||
{{- end }}'
|
||||
- name: REACT_APP_SCRIPTING_DISABLED
|
||||
value: '{{ default false .Values.demoModeEnabled }}'
|
||||
value: '{{- if .Values.tap.liveConfigMapChangesDisabled -}}
|
||||
{{- if .Values.demoModeEnabled -}}
|
||||
{{ .Values.demoModeEnabled | ternary false true }}
|
||||
{{- else -}}
|
||||
true
|
||||
{{- end }}
|
||||
{{- else -}}
|
||||
false
|
||||
{{- end }}'
|
||||
- name: REACT_APP_TARGETED_PODS_UPDATE_DISABLED
|
||||
value: '{{ default false .Values.demoModeEnabled }}'
|
||||
value: '{{ .Values.tap.liveConfigMapChangesDisabled }}'
|
||||
- name: REACT_APP_PRESET_FILTERS_CHANGING_ENABLED
|
||||
value: '{{ not (default false .Values.demoModeEnabled) }}'
|
||||
value: '{{ .Values.tap.liveConfigMapChangesDisabled | ternary "false" "true" }}'
|
||||
- name: REACT_APP_BPF_OVERRIDE_DISABLED
|
||||
value: '{{ eq .Values.tap.packetCapture "af_packet" | ternary "false" "true" }}'
|
||||
- name: REACT_APP_RECORDING_DISABLED
|
||||
value: '{{ default false .Values.demoModeEnabled }}'
|
||||
value: '{{ .Values.tap.liveConfigMapChangesDisabled }}'
|
||||
- name: REACT_APP_DISSECTION_ENABLED
|
||||
value: '{{ .Values.tap.capture.dissection.enabled | ternary "true" "false" }}'
|
||||
- name: REACT_APP_DISSECTION_CONTROL_ENABLED
|
||||
value: '{{- if and (not .Values.demoModeEnabled) (not .Values.tap.capture.dissection.enabled) -}}
|
||||
value: '{{- if and .Values.tap.liveConfigMapChangesDisabled (not .Values.tap.capture.dissection.enabled) -}}
|
||||
true
|
||||
{{- else -}}
|
||||
{{ (default false .Values.demoModeEnabled) | ternary false true }}
|
||||
{{ not .Values.tap.liveConfigMapChangesDisabled | ternary "true" "false" }}
|
||||
{{- end -}}'
|
||||
- name: 'REACT_APP_CLOUD_LICENSE_ENABLED'
|
||||
value: '{{- if or (and .Values.cloudLicenseEnabled (not (empty .Values.license))) (not .Values.internetConnectivity) -}}
|
||||
@@ -83,13 +91,7 @@ spec:
|
||||
- name: REACT_APP_BETA_ENABLED
|
||||
value: '{{ default false .Values.betaEnabled | ternary "true" "false" }}'
|
||||
- name: REACT_APP_DISSECTORS_UPDATING_ENABLED
|
||||
value: '{{ not (default false .Values.demoModeEnabled) }}'
|
||||
- name: REACT_APP_SNAPSHOTS_UPDATING_ENABLED
|
||||
value: '{{ not (default false .Values.demoModeEnabled) }}'
|
||||
- name: REACT_APP_DEMO_MODE_ENABLED
|
||||
value: '{{ default false .Values.demoModeEnabled }}'
|
||||
- name: REACT_APP_CLUSTER_WIDE_MAP_ENABLED
|
||||
value: '{{ default false (((.Values).tap).dashboard).clusterWideMapEnabled }}'
|
||||
value: '{{ .Values.tap.liveConfigMapChangesDisabled | ternary "false" "true" }}'
|
||||
- name: REACT_APP_RAW_CAPTURE_ENABLED
|
||||
value: '{{ .Values.tap.capture.raw.enabled | ternary "true" "false" }}'
|
||||
- name: REACT_APP_SENTRY_ENABLED
|
||||
|
||||
@@ -19,14 +19,14 @@ data:
|
||||
INGRESS_HOST: '{{ .Values.tap.ingress.host }}'
|
||||
PROXY_FRONT_PORT: '{{ .Values.tap.proxy.front.port }}'
|
||||
AUTH_ENABLED: '{{- if and .Values.cloudLicenseEnabled (not (empty .Values.license)) -}}
|
||||
{{ (default false .Values.demoModeEnabled) | ternary true ((and .Values.tap.auth.enabled (eq .Values.tap.auth.type "dex")) | ternary true false) }}
|
||||
{{ and .Values.tap.auth.enabled (eq .Values.tap.auth.type "dex") | ternary true false }}
|
||||
{{- else -}}
|
||||
{{ .Values.cloudLicenseEnabled | ternary "true" ((default false .Values.demoModeEnabled) | ternary "true" .Values.tap.auth.enabled) }}
|
||||
{{ .Values.cloudLicenseEnabled | ternary "true" (.Values.tap.auth.enabled | ternary "true" "") }}
|
||||
{{- end }}'
|
||||
AUTH_TYPE: '{{- if and .Values.cloudLicenseEnabled (not (eq .Values.tap.auth.type "dex")) -}}
|
||||
default
|
||||
{{- else -}}
|
||||
{{ (default false .Values.demoModeEnabled) | ternary "default" .Values.tap.auth.type }}
|
||||
{{ .Values.tap.auth.type }}
|
||||
{{- end }}'
|
||||
AUTH_SAML_IDP_METADATA_URL: '{{ .Values.tap.auth.saml.idpMetadataUrl }}'
|
||||
AUTH_SAML_ROLE_ATTRIBUTE: '{{ .Values.tap.auth.saml.roleAttribute }}'
|
||||
@@ -44,14 +44,22 @@ data:
|
||||
false
|
||||
{{- end }}'
|
||||
TELEMETRY_DISABLED: '{{ not .Values.internetConnectivity | ternary "true" (not .Values.tap.telemetry.enabled | ternary "true" "false") }}'
|
||||
SCRIPTING_DISABLED: '{{ default false .Values.demoModeEnabled }}'
|
||||
TARGETED_PODS_UPDATE_DISABLED: '{{ default false .Values.demoModeEnabled }}'
|
||||
PRESET_FILTERS_CHANGING_ENABLED: '{{ not (default false .Values.demoModeEnabled) }}'
|
||||
RECORDING_DISABLED: '{{ (default false .Values.demoModeEnabled) | ternary true false }}'
|
||||
DISSECTION_CONTROL_ENABLED: '{{- if and (not .Values.demoModeEnabled) (not .Values.tap.capture.dissection.enabled) -}}
|
||||
SCRIPTING_DISABLED: '{{- if .Values.tap.liveConfigMapChangesDisabled -}}
|
||||
{{- if .Values.demoModeEnabled -}}
|
||||
{{ .Values.demoModeEnabled | ternary false true }}
|
||||
{{- else -}}
|
||||
true
|
||||
{{- end }}
|
||||
{{- else -}}
|
||||
false
|
||||
{{- end }}'
|
||||
TARGETED_PODS_UPDATE_DISABLED: '{{ .Values.tap.liveConfigMapChangesDisabled | ternary "true" "" }}'
|
||||
PRESET_FILTERS_CHANGING_ENABLED: '{{ .Values.tap.liveConfigMapChangesDisabled | ternary "false" "true" }}'
|
||||
RECORDING_DISABLED: '{{ .Values.tap.liveConfigMapChangesDisabled | ternary "true" "" }}'
|
||||
DISSECTION_CONTROL_ENABLED: '{{- if and .Values.tap.liveConfigMapChangesDisabled (not .Values.tap.capture.dissection.enabled) -}}
|
||||
true
|
||||
{{- else -}}
|
||||
{{ (default false .Values.demoModeEnabled) | ternary false true }}
|
||||
{{ not .Values.tap.liveConfigMapChangesDisabled | ternary "true" "false" }}
|
||||
{{- end }}'
|
||||
GLOBAL_FILTER: {{ include "kubeshark.escapeDoubleQuotes" .Values.tap.globalFilter | quote }}
|
||||
DEFAULT_FILTER: {{ include "kubeshark.escapeDoubleQuotes" .Values.tap.defaultFilter | quote }}
|
||||
@@ -68,9 +76,7 @@ data:
|
||||
DUPLICATE_TIMEFRAME: '{{ .Values.tap.misc.duplicateTimeframe }}'
|
||||
ENABLED_DISSECTORS: '{{ gt (len .Values.tap.enabledDissectors) 0 | ternary (join "," .Values.tap.enabledDissectors) "" }}'
|
||||
CUSTOM_MACROS: '{{ toJson .Values.tap.customMacros }}'
|
||||
DISSECTORS_UPDATING_ENABLED: '{{ not (default false .Values.demoModeEnabled) }}'
|
||||
SNAPSHOTS_UPDATING_ENABLED: '{{ not (default false .Values.demoModeEnabled) }}'
|
||||
DEMO_MODE_ENABLED: '{{ default false .Values.demoModeEnabled }}'
|
||||
DISSECTORS_UPDATING_ENABLED: '{{ .Values.tap.liveConfigMapChangesDisabled | ternary "false" "true" }}'
|
||||
DETECT_DUPLICATES: '{{ .Values.tap.misc.detectDuplicates | ternary "true" "false" }}'
|
||||
PCAP_DUMP_ENABLE: '{{ .Values.pcapdump.enabled }}'
|
||||
PCAP_TIME_INTERVAL: '{{ .Values.pcapdump.timeInterval }}'
|
||||
|
||||
@@ -185,7 +185,6 @@ tap:
|
||||
dashboard:
|
||||
streamingType: connect-rpc
|
||||
completeStreamingEnabled: true
|
||||
clusterWideMapEnabled: false
|
||||
telemetry:
|
||||
enabled: true
|
||||
resourceGuard:
|
||||
@@ -198,6 +197,7 @@ tap:
|
||||
enabled: false
|
||||
environment: production
|
||||
defaultFilter: ""
|
||||
liveConfigMapChangesDisabled: false
|
||||
globalFilter: ""
|
||||
enabledDissectors:
|
||||
- amqp
|
||||
|
||||
@@ -2,18 +2,6 @@
|
||||
|
||||
[Kubeshark](https://kubeshark.com) MCP (Model Context Protocol) server enables AI assistants like Claude Desktop, Cursor, and other MCP-compatible clients to query real-time Kubernetes network traffic.
|
||||
|
||||
## AI Skills
|
||||
|
||||
The MCP provides the tools — [AI skills](../skills/) teach agents how to use them.
|
||||
Skills turn raw MCP capabilities into domain-specific workflows like root cause
|
||||
analysis, traffic filtering, and forensic investigation. See the
|
||||
[skills README](../skills/README.md) for installation and usage.
|
||||
|
||||
| Skill | Description |
|
||||
|-------|-------------|
|
||||
| [`network-rca`](../skills/network-rca/) | Network Root Cause Analysis — snapshot-based retrospective investigation with PCAP and dissection routes |
|
||||
| [`kfl`](../skills/kfl/) | KFL2 filter expert — write, debug, and optimize traffic queries across all supported protocols |
|
||||
|
||||
## Features
|
||||
|
||||
- **L7 API Traffic Analysis**: Query HTTP, gRPC, Redis, Kafka, DNS transactions
|
||||
@@ -46,20 +34,20 @@ Add to your Claude Desktop configuration:
|
||||
**macOS**: `~/Library/Application Support/Claude/claude_desktop_config.json`
|
||||
**Windows**: `%APPDATA%\Claude\claude_desktop_config.json`
|
||||
|
||||
#### Default (requires kubectl access / kube context)
|
||||
#### URL Mode (Recommended for existing deployments)
|
||||
|
||||
```json
|
||||
{
|
||||
"mcpServers": {
|
||||
"kubeshark": {
|
||||
"command": "kubeshark",
|
||||
"args": ["mcp"]
|
||||
"args": ["mcp", "--url", "https://kubeshark.example.com"]
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
With an explicit kubeconfig path:
|
||||
#### Proxy Mode (Requires kubectl access)
|
||||
|
||||
```json
|
||||
{
|
||||
@@ -71,18 +59,14 @@ With an explicit kubeconfig path:
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### URL Mode (no kubectl required)
|
||||
|
||||
Use this when the machine doesn't have kubectl access or a kube context.
|
||||
Connect directly to an existing Kubeshark deployment:
|
||||
or:
|
||||
|
||||
```json
|
||||
{
|
||||
"mcpServers": {
|
||||
"kubeshark": {
|
||||
"command": "kubeshark",
|
||||
"args": ["mcp", "--url", "https://kubeshark.example.com"]
|
||||
"args": ["mcp"]
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
120
skills/README.md
120
skills/README.md
@@ -1,120 +0,0 @@
|
||||
# Kubeshark AI Skills
|
||||
|
||||
Open-source AI skills that work with the [Kubeshark MCP](https://github.com/kubeshark/kubeshark).
|
||||
Skills teach AI agents how to use Kubeshark's MCP tools for specific workflows
|
||||
like root cause analysis, traffic filtering, and forensic investigation.
|
||||
|
||||
Skills use the open [Agent Skills](https://github.com/anthropics/skills) format
|
||||
and work with Claude Code, OpenAI Codex CLI, Gemini CLI, Cursor, and other
|
||||
compatible agents.
|
||||
|
||||
## Available Skills
|
||||
|
||||
| Skill | Description |
|
||||
|-------|-------------|
|
||||
| [`network-rca`](network-rca/) | Network Root Cause Analysis. Retrospective traffic analysis via snapshots, with two investigation routes: PCAP (for Wireshark/compliance) and Dissection (for AI-driven API-level investigation). |
|
||||
| [`kfl`](kfl/) | KFL2 (Kubeshark Filter Language) expert. Complete reference for writing, debugging, and optimizing CEL-based traffic filters across all supported protocols. |
|
||||
|
||||
## Prerequisites
|
||||
|
||||
All skills require the Kubeshark MCP:
|
||||
|
||||
```bash
|
||||
# Claude Code
|
||||
claude mcp add kubeshark -- kubeshark mcp
|
||||
|
||||
# Without kubectl access (direct URL)
|
||||
claude mcp add kubeshark -- kubeshark mcp --url https://kubeshark.example.com
|
||||
```
|
||||
|
||||
For Claude Desktop, add to `claude_desktop_config.json`:
|
||||
|
||||
```json
|
||||
{
|
||||
"mcpServers": {
|
||||
"kubeshark": {
|
||||
"command": "kubeshark",
|
||||
"args": ["mcp"]
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Installation
|
||||
|
||||
### Option 1: Plugin (recommended)
|
||||
|
||||
Install as a Claude Code plugin directly from GitHub:
|
||||
|
||||
```
|
||||
/plugin marketplace add kubeshark/kubeshark
|
||||
/plugin install kubeshark
|
||||
```
|
||||
|
||||
Skills appear as `/kubeshark:network-rca` and `/kubeshark:kfl`. The plugin
|
||||
also bundles the Kubeshark MCP configuration automatically.
|
||||
|
||||
### Option 2: Clone and run
|
||||
|
||||
```bash
|
||||
git clone https://github.com/kubeshark/kubeshark
|
||||
cd kubeshark
|
||||
claude
|
||||
```
|
||||
|
||||
Skills trigger automatically based on your conversation.
|
||||
|
||||
### Option 3: Manual installation
|
||||
|
||||
Clone the repo (if you haven't already), then symlink or copy the skills:
|
||||
|
||||
```bash
|
||||
git clone https://github.com/kubeshark/kubeshark
|
||||
mkdir -p ~/.claude/skills
|
||||
|
||||
# Symlink to stay in sync with the repo (recommended)
|
||||
ln -s kubeshark/skills/network-rca ~/.claude/skills/network-rca
|
||||
ln -s kubeshark/skills/kfl ~/.claude/skills/kfl
|
||||
|
||||
# Or copy to your project (project scope only)
|
||||
mkdir -p .claude/skills
|
||||
cp -r kubeshark/skills/network-rca .claude/skills/
|
||||
cp -r kubeshark/skills/kfl .claude/skills/
|
||||
|
||||
# Or copy for personal use (all your projects)
|
||||
cp -r kubeshark/skills/network-rca ~/.claude/skills/
|
||||
cp -r kubeshark/skills/kfl ~/.claude/skills/
|
||||
```
|
||||
|
||||
## Contributing
|
||||
|
||||
We welcome contributions — whether improving an existing skill or proposing a new one.
|
||||
|
||||
- **Suggest improvements**: Open an issue or PR with changes to an existing skill's `SKILL.md`
|
||||
or reference docs. Better examples, clearer workflows, and additional filter patterns
|
||||
are always appreciated.
|
||||
- **Add a new skill**: Open an issue describing the use case first. New skills should
|
||||
follow the structure below and reference Kubeshark MCP tools by exact name.
|
||||
|
||||
### Skill structure
|
||||
|
||||
```
|
||||
skills/
|
||||
└── <skill-name>/
|
||||
├── SKILL.md # Required. YAML frontmatter + markdown body.
|
||||
└── references/ # Optional. Detailed reference docs.
|
||||
└── *.md
|
||||
```
|
||||
|
||||
### Guidelines
|
||||
|
||||
- Keep `SKILL.md` under 500 lines. Use `references/` for detailed content.
|
||||
- Use imperative tone. Reference MCP tools by exact name.
|
||||
- Include realistic example tool responses.
|
||||
- The `description` frontmatter should be generous with trigger keywords.
|
||||
|
||||
### Planned skills
|
||||
|
||||
- `api-security` — OWASP API Top 10 assessment against live or snapshot traffic.
|
||||
- `incident-response` — 7-phase forensic incident investigation methodology.
|
||||
- `network-engineering` — Real-time traffic analysis, latency debugging, dependency mapping.
|
||||
@@ -1,331 +0,0 @@
|
||||
---
|
||||
name: kfl
|
||||
user-invocable: false
|
||||
description: >
|
||||
KFL2 (Kubeshark Filter Language) reference. This skill MUST be loaded before
|
||||
writing, constructing, or suggesting any KFL filter expression. KFL is statically
|
||||
typed — incorrect field names or syntax will fail silently or error. Do not guess
|
||||
at KFL syntax without this skill loaded. Trigger on any mention of KFL, CEL filters,
|
||||
traffic filtering, display filters, query syntax, filter expressions, write a filter,
|
||||
construct a query, build a KFL, create a filter expression, "how do I filter",
|
||||
"show me only", "find traffic where", protocol-specific queries (HTTP status codes,
|
||||
DNS lookups, Redis commands, Kafka topics), Kubernetes-aware filtering (by namespace,
|
||||
pod, service, label, annotation), L4 connection/flow filters, time-based queries,
|
||||
or any request to slice/search/narrow network traffic in Kubeshark. Also trigger
|
||||
when other skills need to construct filters — KFL is the query language for all
|
||||
Kubeshark traffic analysis.
|
||||
---
|
||||
|
||||
# KFL2 — Kubeshark Filter Language
|
||||
|
||||
You are a KFL2 expert. KFL2 is built on Google's CEL (Common Expression Language)
|
||||
and is the query language for all Kubeshark traffic analysis. It operates as a
|
||||
**display filter** — it doesn't affect what's captured, only what you see.
|
||||
|
||||
Think of KFL the way you think of SQL for databases or Google search syntax for
|
||||
the web. Kubeshark captures and indexes all cluster traffic; KFL is how you
|
||||
search it.
|
||||
|
||||
For the complete variable and field reference, see `references/kfl2-reference.md`.
|
||||
|
||||
## Core Syntax
|
||||
|
||||
KFL expressions are boolean CEL expressions. An empty filter matches everything.
|
||||
|
||||
### Operators
|
||||
|
||||
| Category | Operators |
|
||||
|----------|-----------|
|
||||
| Comparison | `==`, `!=`, `<`, `<=`, `>`, `>=` |
|
||||
| Logical | `&&`, `\|\|`, `!` |
|
||||
| Arithmetic | `+`, `-`, `*`, `/`, `%` |
|
||||
| Membership | `in` |
|
||||
| Ternary | `condition ? true_val : false_val` |
|
||||
|
||||
### String Functions
|
||||
|
||||
```
|
||||
str.contains(substring) // Substring search
|
||||
str.startsWith(prefix) // Prefix match
|
||||
str.endsWith(suffix) // Suffix match
|
||||
str.matches(regex) // Regex match
|
||||
size(str) // String length
|
||||
```
|
||||
|
||||
### Collection Functions
|
||||
|
||||
```
|
||||
size(collection) // List/map/string length
|
||||
key in map // Key existence
|
||||
map[key] // Value access
|
||||
map_get(map, key, default) // Safe access with default
|
||||
value in list // List membership
|
||||
```
|
||||
|
||||
### Time Functions
|
||||
|
||||
```
|
||||
timestamp("2026-03-14T22:00:00Z") // Parse ISO timestamp
|
||||
duration("5m") // Parse duration
|
||||
now() // Current time (snapshot at filter creation)
|
||||
```
|
||||
|
||||
### Negation
|
||||
|
||||
```
|
||||
!http // Everything that is NOT HTTP
|
||||
http && status_code != 200 // HTTP responses that aren't 200
|
||||
http && !path.contains("/health") // Exclude health checks
|
||||
!(src.pod.namespace == "kube-system") // Exclude system namespace
|
||||
```
|
||||
|
||||
## Protocol Detection
|
||||
|
||||
Boolean flags that indicate which protocol was detected. Use these as the first
|
||||
filter term — they're fast and narrow the search space immediately.
|
||||
|
||||
| Flag | Protocol | Flag | Protocol |
|
||||
|------|----------|------|----------|
|
||||
| `http` | HTTP/1.1, HTTP/2 | `redis` | Redis |
|
||||
| `dns` | DNS | `kafka` | Kafka |
|
||||
| `tls` | TLS/SSL | `amqp` | AMQP |
|
||||
| `tcp` | TCP | `ldap` | LDAP |
|
||||
| `udp` | UDP | `ws` | WebSocket |
|
||||
| `sctp` | SCTP | `gql` | GraphQL (v1+v2) |
|
||||
| `icmp` | ICMP | `gqlv1` / `gqlv2` | GraphQL version-specific |
|
||||
| `radius` | RADIUS | `conn` / `flow` | L4 connection/flow tracking |
|
||||
| `diameter` | Diameter | `tcp_conn` / `udp_conn` | Transport-specific connections |
|
||||
|
||||
## Kubernetes Context
|
||||
|
||||
The most common starting point. Filter by where traffic originates or terminates.
|
||||
|
||||
### Pod and Service Fields
|
||||
|
||||
```
|
||||
src.pod.name == "orders-594487879c-7ddxf"
|
||||
dst.pod.namespace == "production"
|
||||
src.service.name == "api-gateway"
|
||||
dst.service.namespace == "payments"
|
||||
```
|
||||
|
||||
Pod fields fall back to service data when pod info is unavailable, so
|
||||
`dst.pod.namespace` works even for service-level entries.
|
||||
|
||||
### Aggregate Collections
|
||||
|
||||
Match against any direction (src or dst):
|
||||
|
||||
```
|
||||
"production" in namespaces // Any namespace match
|
||||
"orders" in pods // Any pod name match
|
||||
"api-gateway" in services // Any service name match
|
||||
```
|
||||
|
||||
### Labels and Annotations
|
||||
|
||||
```
|
||||
map_get(local_labels, "app", "") == "checkout" // Safe access with default
|
||||
map_get(remote_labels, "version", "") == "canary"
|
||||
"tier" in local_labels // Label existence check
|
||||
```
|
||||
|
||||
Always use `map_get()` for labels and annotations — direct access like
|
||||
`local_labels["app"]` errors if the key doesn't exist.
|
||||
|
||||
### Node and Process
|
||||
|
||||
```
|
||||
node_name == "ip-10-0-25-170.ec2.internal"
|
||||
local_process_name == "nginx"
|
||||
remote_process_name.contains("postgres")
|
||||
```
|
||||
|
||||
### DNS Resolution
|
||||
|
||||
```
|
||||
src.dns == "api.example.com"
|
||||
dst.dns.contains("redis")
|
||||
```
|
||||
|
||||
## HTTP Filtering
|
||||
|
||||
HTTP is the most common protocol for API-level investigation.
|
||||
|
||||
### Fields
|
||||
|
||||
| Field | Type | Example |
|
||||
|-------|------|---------|
|
||||
| `method` | string | `"GET"`, `"POST"`, `"PUT"`, `"DELETE"` |
|
||||
| `url` | string | Full path + query: `"/api/users?id=123"` |
|
||||
| `path` | string | Path only: `"/api/users"` |
|
||||
| `status_code` | int | `200`, `404`, `500` |
|
||||
| `http_version` | string | `"HTTP/1.1"`, `"HTTP/2"` |
|
||||
| `request.headers` | map | `request.headers["content-type"]` |
|
||||
| `response.headers` | map | `response.headers["server"]` |
|
||||
| `request.cookies` | map | `request.cookies["session"]` |
|
||||
| `response.cookies` | map | `response.cookies["token"]` |
|
||||
| `query_string` | map | `query_string["id"]` |
|
||||
| `request_body_size` | int | Request body bytes |
|
||||
| `response_body_size` | int | Response body bytes |
|
||||
| `elapsed_time` | int | Duration in **microseconds** |
|
||||
|
||||
### Common Patterns
|
||||
|
||||
```
|
||||
// Error investigation
|
||||
http && status_code >= 500 // Server errors
|
||||
http && status_code == 429 // Rate limiting
|
||||
http && status_code >= 400 && status_code < 500 // Client errors
|
||||
|
||||
// Endpoint targeting
|
||||
http && method == "POST" && path.contains("/orders")
|
||||
http && url.matches(".*/api/v[0-9]+/users.*")
|
||||
|
||||
// Performance
|
||||
http && elapsed_time > 5000000 // > 5 seconds
|
||||
http && response_body_size > 1000000 // > 1MB responses
|
||||
|
||||
// Header inspection
|
||||
http && "authorization" in request.headers
|
||||
http && request.headers["content-type"] == "application/json"
|
||||
|
||||
// GraphQL (subset of HTTP)
|
||||
gql && method == "POST" && status_code >= 400
|
||||
```
|
||||
|
||||
## DNS Filtering
|
||||
|
||||
DNS issues are often the hidden root cause of outages.
|
||||
|
||||
| Field | Type | Description |
|
||||
|-------|------|-------------|
|
||||
| `dns_questions` | []string | Question domain names |
|
||||
| `dns_answers` | []string | Answer domain names |
|
||||
| `dns_question_types` | []string | Record types: A, AAAA, CNAME, MX, TXT, SRV, PTR |
|
||||
| `dns_request` | bool | Is request |
|
||||
| `dns_response` | bool | Is response |
|
||||
| `dns_request_length` | int | Request size |
|
||||
| `dns_response_length` | int | Response size |
|
||||
|
||||
```
|
||||
dns && "api.external-service.com" in dns_questions
|
||||
dns && dns_response && status_code != 0 // Failed lookups
|
||||
dns && "A" in dns_question_types // A record queries
|
||||
dns && size(dns_questions) > 1 // Multi-question
|
||||
```
|
||||
|
||||
## Database and Messaging Protocols
|
||||
|
||||
### Redis
|
||||
|
||||
```
|
||||
redis && redis_type == "GET" // Command type
|
||||
redis && redis_key.startsWith("session:") // Key pattern
|
||||
redis && redis_command.contains("DEL") // Command search
|
||||
redis && redis_total_size > 10000 // Large operations
|
||||
```
|
||||
|
||||
### Kafka
|
||||
|
||||
```
|
||||
kafka && kafka_api_key_name == "PRODUCE" // Produce operations
|
||||
kafka && kafka_client_id == "payment-processor" // Client filtering
|
||||
kafka && kafka_request_summary.contains("orders") // Topic filtering
|
||||
kafka && kafka_size > 10000 // Large messages
|
||||
```
|
||||
|
||||
### AMQP, LDAP, RADIUS, Diameter
|
||||
|
||||
```
|
||||
amqp && amqp_method == "basic.publish" // AMQP publish
|
||||
ldap && ldap_type == "bind" // LDAP bind requests
|
||||
radius && radius_code_name == "Access-Request" // RADIUS auth
|
||||
diameter && diameter_method.contains("Credit") // Diameter credit control
|
||||
```
|
||||
|
||||
For the full variable list for these protocols, see `references/kfl2-reference.md`.
|
||||
|
||||
## Transport Layer (L4)
|
||||
|
||||
### TCP/UDP Fields
|
||||
|
||||
```
|
||||
tcp && tcp_error_type != "" // TCP errors
|
||||
udp && udp_length > 1000 // Large UDP packets
|
||||
```
|
||||
|
||||
### Connection Tracking
|
||||
|
||||
```
|
||||
conn && conn_state == "open" // Active connections
|
||||
conn && conn_local_bytes > 1000000 // High-volume
|
||||
conn && "HTTP" in conn_l7_detected // L7 protocol detection
|
||||
tcp_conn && conn_state == "closed" // Closed TCP connections
|
||||
```
|
||||
|
||||
### Flow Tracking (with Rate Metrics)
|
||||
|
||||
```
|
||||
flow && flow_local_pps > 1000 // High packet rate
|
||||
flow && flow_local_bps > 1000000 // High bandwidth
|
||||
flow && flow_state == "closed" && "TLS" in flow_l7_detected
|
||||
tcp_flow && flow_local_bps > 5000000 // High-throughput TCP
|
||||
```
|
||||
|
||||
## Network Layer
|
||||
|
||||
```
|
||||
src.ip == "10.0.53.101"
|
||||
dst.ip.startsWith("192.168.")
|
||||
src.port == 8080
|
||||
dst.port >= 8000 && dst.port <= 9000
|
||||
```
|
||||
|
||||
## Time-Based Filtering
|
||||
|
||||
```
|
||||
timestamp > timestamp("2026-03-14T22:00:00Z")
|
||||
timestamp >= timestamp("2026-03-14T22:00:00Z") && timestamp <= timestamp("2026-03-14T23:00:00Z")
|
||||
timestamp > now() - duration("5m") // Last 5 minutes
|
||||
elapsed_time > 2000000 // Older than 2 seconds
|
||||
```
|
||||
|
||||
## Building Filters: Progressive Narrowing
|
||||
|
||||
The most effective investigation technique — start broad, add constraints:
|
||||
|
||||
```
|
||||
// Step 1: Protocol + namespace
|
||||
http && dst.pod.namespace == "production"
|
||||
|
||||
// Step 2: Add error condition
|
||||
http && dst.pod.namespace == "production" && status_code >= 500
|
||||
|
||||
// Step 3: Narrow to service
|
||||
http && dst.pod.namespace == "production" && status_code >= 500 && dst.service.name == "payment-service"
|
||||
|
||||
// Step 4: Narrow to endpoint
|
||||
http && dst.pod.namespace == "production" && status_code >= 500 && dst.service.name == "payment-service" && path.contains("/charge")
|
||||
|
||||
// Step 5: Add timing
|
||||
http && dst.pod.namespace == "production" && status_code >= 500 && dst.service.name == "payment-service" && path.contains("/charge") && elapsed_time > 2000000
|
||||
```
|
||||
|
||||
## Performance Tips
|
||||
|
||||
1. **Protocol flags first** — `http && ...` is faster than `... && http`
|
||||
2. **`startsWith`/`endsWith` over `contains`** — prefix/suffix checks are faster
|
||||
3. **Specific ports before string ops** — `dst.port == 80` is cheaper than `url.contains(...)`
|
||||
4. **Use `map_get` for labels** — avoids errors on missing keys
|
||||
5. **Keep filters simple** — CEL short-circuits on `&&`, so put cheap checks first
|
||||
|
||||
## Type Safety
|
||||
|
||||
KFL2 is statically typed. Common gotchas:
|
||||
|
||||
- `status_code` is `int`, not string — use `status_code == 200`, not `"200"`
|
||||
- `elapsed_time` is in **microseconds** — 5 seconds = `5000000`
|
||||
- `timestamp` requires `timestamp()` function — not a raw string
|
||||
- Map access on missing keys errors — use `key in map` or `map_get()` first
|
||||
- List membership uses `value in list` — not `list.contains(value)`
|
||||
@@ -1,407 +0,0 @@
|
||||
# KFL2 Complete Variable and Field Reference
|
||||
|
||||
This is the exhaustive reference for every variable available in KFL2 filters.
|
||||
KFL2 is built on Google's CEL (Common Expression Language) and evaluates against
|
||||
Kubeshark's protobuf-based `BaseEntry` structure.
|
||||
|
||||
## Most Commonly Used Variables
|
||||
|
||||
These are the variables you'll reach for in 90% of investigations:
|
||||
|
||||
| Variable | Type | What it's for |
|
||||
|----------|------|---------------|
|
||||
| `status_code` | int | HTTP response status (200, 404, 500) |
|
||||
| `method` | string | HTTP method (GET, POST, PUT, DELETE) |
|
||||
| `path` | string | URL path without query string |
|
||||
| `dst.pod.namespace` | string | Where traffic is going (namespace) |
|
||||
| `dst.service.name` | string | Where traffic is going (service) |
|
||||
| `src.pod.name` | string | Where traffic comes from (pod) |
|
||||
| `elapsed_time` | int | Request duration in microseconds |
|
||||
| `dns_questions` | []string | DNS domains being queried |
|
||||
| `namespaces` | []string | All namespaces involved (src + dst) |
|
||||
|
||||
## Network-Level Variables
|
||||
|
||||
| Variable | Type | Description | Example |
|
||||
|----------|------|-------------|---------|
|
||||
| `src.ip` | string | Source IP address | `"10.0.53.101"` |
|
||||
| `dst.ip` | string | Destination IP address | `"192.168.1.1"` |
|
||||
| `src.port` | int | Source port number | `43210` |
|
||||
| `dst.port` | int | Destination port number | `8080` |
|
||||
| `protocol` | string | Detected protocol type | `"HTTP"`, `"DNS"` |
|
||||
|
||||
## Identity and Metadata Variables
|
||||
|
||||
| Variable | Type | Description |
|
||||
|----------|------|-------------|
|
||||
| `id` | int | BaseEntry unique identifier (assigned by sniffer) |
|
||||
| `node_id` | string | Node identifier (assigned by hub) |
|
||||
| `index` | int | Entry index for stream uniqueness |
|
||||
| `stream` | string | Stream identifier (hex string) |
|
||||
| `timestamp` | timestamp | Event time (UTC), use with `timestamp()` function |
|
||||
| `elapsed_time` | int | Age since timestamp in microseconds |
|
||||
| `worker` | string | Worker identifier |
|
||||
|
||||
## Cross-Reference Variables
|
||||
|
||||
| Variable | Type | Description |
|
||||
|----------|------|-------------|
|
||||
| `conn_id` | int | L7 to L4 connection cross-reference ID |
|
||||
| `flow_id` | int | L7 to L4 flow cross-reference ID |
|
||||
| `has_pcap` | bool | Whether PCAP data is available for this entry |
|
||||
|
||||
## Capture Source Variables
|
||||
|
||||
| Variable | Type | Description | Values |
|
||||
|----------|------|-------------|--------|
|
||||
| `capture_source` | string | Canonical capture source | `"unspecified"`, `"af_packet"`, `"ebpf"`, `"ebpf_tls"` |
|
||||
| `capture_backend` | string | Backend family | `"af_packet"`, `"ebpf"` |
|
||||
| `capture_source_code` | int | Numeric enum | 0=unspecified, 1=af_packet, 2=ebpf, 3=ebpf_tls |
|
||||
| `capture` | map | Nested map access | `capture["source"]`, `capture["backend"]` |
|
||||
|
||||
## Protocol Detection Flags
|
||||
|
||||
Boolean variables indicating detected protocol. Use as first filter term for performance.
|
||||
|
||||
| Variable | Protocol | Variable | Protocol |
|
||||
|----------|----------|----------|----------|
|
||||
| `http` | HTTP/1.1, HTTP/2 | `redis` | Redis |
|
||||
| `dns` | DNS | `kafka` | Kafka |
|
||||
| `tls` | TLS/SSL handshake | `amqp` | AMQP messaging |
|
||||
| `tcp` | TCP transport | `ldap` | LDAP directory |
|
||||
| `udp` | UDP transport | `ws` | WebSocket |
|
||||
| `sctp` | SCTP streaming | `gql` | GraphQL (v1 or v2) |
|
||||
| `icmp` | ICMP | `gqlv1` | GraphQL v1 only |
|
||||
| `radius` | RADIUS auth | `gqlv2` | GraphQL v2 only |
|
||||
| `diameter` | Diameter | `conn` | L4 connection tracking |
|
||||
| `flow` | L4 flow tracking | `tcp_conn` | TCP connection tracking |
|
||||
| `tcp_flow` | TCP flow tracking | `udp_conn` | UDP connection tracking |
|
||||
| `udp_flow` | UDP flow tracking | | |
|
||||
|
||||
## HTTP Variables
|
||||
|
||||
| Variable | Type | Description | Example |
|
||||
|----------|------|-------------|---------|
|
||||
| `method` | string | HTTP method | `"GET"`, `"POST"`, `"PUT"`, `"DELETE"`, `"PATCH"` |
|
||||
| `url` | string | Full URL path and query string | `"/api/users?id=123"` |
|
||||
| `path` | string | URL path component (no query) | `"/api/users"` |
|
||||
| `status_code` | int | HTTP response status code | `200`, `404`, `500` |
|
||||
| `http_version` | string | HTTP protocol version | `"HTTP/1.1"`, `"HTTP/2"` |
|
||||
| `query_string` | map[string]string | Parsed URL query parameters | `query_string["id"]` → `"123"` |
|
||||
| `request.headers` | map[string]string | Request HTTP headers | `request.headers["content-type"]` |
|
||||
| `response.headers` | map[string]string | Response HTTP headers | `response.headers["server"]` |
|
||||
| `request.cookies` | map[string]string | Request cookies | `request.cookies["session"]` |
|
||||
| `response.cookies` | map[string]string | Response cookies | `response.cookies["token"]` |
|
||||
| `request_headers_size` | int | Request headers size in bytes | |
|
||||
| `request_body_size` | int | Request body size in bytes | |
|
||||
| `response_headers_size` | int | Response headers size in bytes | |
|
||||
| `response_body_size` | int | Response body size in bytes | |
|
||||
|
||||
GraphQL requests have `gql` (or `gqlv1`/`gqlv2`) set to true and all HTTP
|
||||
variables available.
|
||||
|
||||
**Example**: `http && method == "POST" && status_code >= 500 && path.contains("/api")`
|
||||
|
||||
## DNS Variables
|
||||
|
||||
| Variable | Type | Description | Example |
|
||||
|----------|------|-------------|---------|
|
||||
| `dns_questions` | []string | Question domain names (request + response) | `["example.com"]` |
|
||||
| `dns_answers` | []string | Answer domain names | `["1.2.3.4"]` |
|
||||
| `dns_question_types` | []string | Record types in questions | `["A"]`, `["AAAA"]`, `["CNAME"]` |
|
||||
| `dns_request` | bool | Is DNS request message | |
|
||||
| `dns_response` | bool | Is DNS response message | |
|
||||
| `dns_request_length` | int | DNS request size in bytes (0 if absent) | |
|
||||
| `dns_response_length` | int | DNS response size in bytes (0 if absent) | |
|
||||
| `dns_total_size` | int | Sum of request + response sizes | |
|
||||
|
||||
Supported question types: A, AAAA, NS, CNAME, SOA, MX, TXT, SRV, PTR, ANY.
|
||||
|
||||
**Example**: `dns && dns_response && status_code != 0` (failed DNS lookups)
|
||||
|
||||
## TLS Variables
|
||||
|
||||
| Variable | Type | Description | Example |
|
||||
|----------|------|-------------|---------|
|
||||
| `tls` | bool | TLS payload detected | |
|
||||
| `tls_summary` | string | TLS handshake summary | `"ClientHello"`, `"ServerHello"` |
|
||||
| `tls_info` | string | TLS connection details | `"TLS 1.3, AES-256-GCM"` |
|
||||
| `tls_request_size` | int | TLS request size in bytes | |
|
||||
| `tls_response_size` | int | TLS response size in bytes | |
|
||||
| `tls_total_size` | int | Sum of request + response (computed if not provided) | |
|
||||
|
||||
## TCP Variables
|
||||
|
||||
| Variable | Type | Description |
|
||||
|----------|------|-------------|
|
||||
| `tcp` | bool | TCP payload detected |
|
||||
| `tcp_method` | string | TCP method information |
|
||||
| `tcp_payload` | bytes | Raw TCP payload data |
|
||||
| `tcp_error_type` | string | TCP error type (empty if none) |
|
||||
| `tcp_error_message` | string | TCP error message (empty if none) |
|
||||
|
||||
## UDP Variables
|
||||
|
||||
| Variable | Type | Description |
|
||||
|----------|------|-------------|
|
||||
| `udp` | bool | UDP payload detected |
|
||||
| `udp_length` | int | UDP packet length |
|
||||
| `udp_checksum` | int | UDP checksum value |
|
||||
| `udp_payload` | bytes | Raw UDP payload data |
|
||||
|
||||
## SCTP Variables
|
||||
|
||||
| Variable | Type | Description |
|
||||
|----------|------|-------------|
|
||||
| `sctp` | bool | SCTP payload detected |
|
||||
| `sctp_checksum` | int | SCTP checksum value |
|
||||
| `sctp_chunk_type` | string | SCTP chunk type |
|
||||
| `sctp_length` | int | SCTP chunk length |
|
||||
|
||||
## ICMP Variables
|
||||
|
||||
| Variable | Type | Description |
|
||||
|----------|------|-------------|
|
||||
| `icmp` | bool | ICMP payload detected |
|
||||
| `icmp_type` | string | ICMP type code |
|
||||
| `icmp_version` | int | ICMP version (4 or 6) |
|
||||
| `icmp_length` | int | ICMP message length |
|
||||
|
||||
## WebSocket Variables
|
||||
|
||||
| Variable | Type | Description | Values |
|
||||
|----------|------|-------------|--------|
|
||||
| `ws` | bool | WebSocket payload detected | |
|
||||
| `ws_opcode` | string | WebSocket operation code | `"text"`, `"binary"`, `"close"`, `"ping"`, `"pong"` |
|
||||
| `ws_request` | bool | Is WebSocket request | |
|
||||
| `ws_response` | bool | Is WebSocket response | |
|
||||
| `ws_request_payload_data` | string | Request payload (safely truncated) | |
|
||||
| `ws_request_payload_length` | int | Request payload length in bytes | |
|
||||
| `ws_response_payload_length` | int | Response payload length in bytes | |
|
||||
|
||||
## Redis Variables
|
||||
|
||||
| Variable | Type | Description | Example |
|
||||
|----------|------|-------------|---------|
|
||||
| `redis` | bool | Redis payload detected | |
|
||||
| `redis_type` | string | Redis command verb | `"GET"`, `"SET"`, `"DEL"`, `"HGET"` |
|
||||
| `redis_command` | string | Full Redis command line | `"GET session:1234"` |
|
||||
| `redis_key` | string | Key (truncated to 64 bytes) | `"session:1234"` |
|
||||
| `redis_request_size` | int | Request size (0 if absent) | |
|
||||
| `redis_response_size` | int | Response size (0 if absent) | |
|
||||
| `redis_total_size` | int | Sum of request + response | |
|
||||
|
||||
**Example**: `redis && redis_type == "GET" && redis_key.startsWith("session:")`
|
||||
|
||||
## Kafka Variables
|
||||
|
||||
| Variable | Type | Description | Example |
|
||||
|----------|------|-------------|---------|
|
||||
| `kafka` | bool | Kafka payload detected | |
|
||||
| `kafka_api_key` | int | Kafka API key number | 0=FETCH, 1=PRODUCE |
|
||||
| `kafka_api_key_name` | string | Human-readable API operation | `"PRODUCE"`, `"FETCH"` |
|
||||
| `kafka_client_id` | string | Kafka client identifier | `"payment-processor"` |
|
||||
| `kafka_size` | int | Message size (request preferred, else response) | |
|
||||
| `kafka_request` | bool | Is Kafka request | |
|
||||
| `kafka_response` | bool | Is Kafka response | |
|
||||
| `kafka_request_summary` | string | Request summary/topic | `"orders-topic"` |
|
||||
| `kafka_request_size` | int | Request size (0 if absent) | |
|
||||
| `kafka_response_size` | int | Response size (0 if absent) | |
|
||||
|
||||
**Example**: `kafka && kafka_api_key_name == "PRODUCE" && kafka_request_summary.contains("orders")`
|
||||
|
||||
## AMQP Variables
|
||||
|
||||
| Variable | Type | Description | Example |
|
||||
|----------|------|-------------|---------|
|
||||
| `amqp` | bool | AMQP payload detected | |
|
||||
| `amqp_method` | string | AMQP method name | `"basic.publish"`, `"channel.open"` |
|
||||
| `amqp_summary` | string | Operation summary | |
|
||||
| `amqp_request` | bool | Is AMQP request | |
|
||||
| `amqp_response` | bool | Is AMQP response | |
|
||||
| `amqp_request_length` | int | Request length (0 if absent) | |
|
||||
| `amqp_response_length` | int | Response length (0 if absent) | |
|
||||
| `amqp_total_size` | int | Sum of request + response | |
|
||||
|
||||
## LDAP Variables
|
||||
|
||||
| Variable | Type | Description |
|
||||
|----------|------|-------------|
|
||||
| `ldap` | bool | LDAP payload detected |
|
||||
| `ldap_type` | string | LDAP operation type (request preferred) |
|
||||
| `ldap_summary` | string | Operation summary |
|
||||
| `ldap_request` | bool | Is LDAP request |
|
||||
| `ldap_response` | bool | Is LDAP response |
|
||||
| `ldap_request_length` | int | Request length (0 if absent) |
|
||||
| `ldap_response_length` | int | Response length (0 if absent) |
|
||||
| `ldap_total_size` | int | Sum of request + response |
|
||||
|
||||
## RADIUS Variables
|
||||
|
||||
| Variable | Type | Description | Example |
|
||||
|----------|------|-------------|---------|
|
||||
| `radius` | bool | RADIUS payload detected | |
|
||||
| `radius_code` | int | RADIUS code (request preferred) | |
|
||||
| `radius_code_name` | string | Code name | `"Access-Request"` |
|
||||
| `radius_request` | bool | Is RADIUS request | |
|
||||
| `radius_response` | bool | Is RADIUS response | |
|
||||
| `radius_request_authenticator` | string | Request authenticator (hex) | |
|
||||
| `radius_request_length` | int | Request size (0 if absent) | |
|
||||
| `radius_response_length` | int | Response size (0 if absent) | |
|
||||
| `radius_total_size` | int | Sum of request + response | |
|
||||
|
||||
## Diameter Variables
|
||||
|
||||
| Variable | Type | Description |
|
||||
|----------|------|-------------|
|
||||
| `diameter` | bool | Diameter payload detected |
|
||||
| `diameter_method` | string | Method name (request preferred) |
|
||||
| `diameter_summary` | string | Operation summary |
|
||||
| `diameter_request` | bool | Is Diameter request |
|
||||
| `diameter_response` | bool | Is Diameter response |
|
||||
| `diameter_request_length` | int | Request size (0 if absent) |
|
||||
| `diameter_response_length` | int | Response size (0 if absent) |
|
||||
| `diameter_total_size` | int | Sum of request + response |
|
||||
|
||||
## L4 Connection Tracking Variables
|
||||
|
||||
| Variable | Type | Description | Example |
|
||||
|----------|------|-------------|---------|
|
||||
| `conn` | bool | Connection tracking entry | |
|
||||
| `conn_state` | string | Connection state | `"open"`, `"in_progress"`, `"closed"` |
|
||||
| `conn_local_pkts` | int | Packets from local peer | |
|
||||
| `conn_local_bytes` | int | Bytes from local peer | |
|
||||
| `conn_remote_pkts` | int | Packets from remote peer | |
|
||||
| `conn_remote_bytes` | int | Bytes from remote peer | |
|
||||
| `conn_l7_detected` | []string | L7 protocols detected on connection | `["HTTP", "TLS"]` |
|
||||
| `conn_group_id` | int | Connection group identifier | |
|
||||
|
||||
**Example**: `conn && conn_state == "open" && conn_local_bytes > 1000000` (high-volume open connections)
|
||||
|
||||
## L4 Flow Tracking Variables
|
||||
|
||||
Flows extend connections with rate metrics (packets/bytes per second).
|
||||
|
||||
| Variable | Type | Description |
|
||||
|----------|------|-------------|
|
||||
| `flow` | bool | Flow tracking entry |
|
||||
| `flow_state` | string | Flow state (`"open"`, `"in_progress"`, `"closed"`) |
|
||||
| `flow_local_pkts` | int | Packets from local peer |
|
||||
| `flow_local_bytes` | int | Bytes from local peer |
|
||||
| `flow_remote_pkts` | int | Packets from remote peer |
|
||||
| `flow_remote_bytes` | int | Bytes from remote peer |
|
||||
| `flow_local_pps` | int | Local packets per second |
|
||||
| `flow_local_bps` | int | Local bytes per second |
|
||||
| `flow_remote_pps` | int | Remote packets per second |
|
||||
| `flow_remote_bps` | int | Remote bytes per second |
|
||||
| `flow_l7_detected` | []string | L7 protocols detected on flow |
|
||||
| `flow_group_id` | int | Flow group identifier |
|
||||
|
||||
**Example**: `tcp_flow && flow_local_bps > 5000000` (high-bandwidth TCP flows)
|
||||
|
||||
## Kubernetes Variables
|
||||
|
||||
### Pod and Service (Directional)
|
||||
|
||||
| Variable | Type | Description |
|
||||
|----------|------|-------------|
|
||||
| `src.pod.name` | string | Source pod name |
|
||||
| `src.pod.namespace` | string | Source pod namespace |
|
||||
| `dst.pod.name` | string | Destination pod name |
|
||||
| `dst.pod.namespace` | string | Destination pod namespace |
|
||||
| `src.service.name` | string | Source service name |
|
||||
| `src.service.namespace` | string | Source service namespace |
|
||||
| `dst.service.name` | string | Destination service name |
|
||||
| `dst.service.namespace` | string | Destination service namespace |
|
||||
|
||||
**Fallback behavior**: Pod namespace/name fields automatically fall back to
|
||||
service data when pod info is unavailable. This means `dst.pod.namespace` works
|
||||
even when only service-level resolution exists.
|
||||
|
||||
**Example**: `src.service.name == "api-gateway" && dst.pod.namespace == "production"`
|
||||
|
||||
### Aggregate Collections (Non-Directional)
|
||||
|
||||
| Variable | Type | Description |
|
||||
|----------|------|-------------|
|
||||
| `namespaces` | []string | All namespaces (src + dst, pod + service) |
|
||||
| `pods` | []string | All pod names (src + dst) |
|
||||
| `services` | []string | All service names (src + dst) |
|
||||
|
||||
### Labels and Annotations
|
||||
|
||||
| Variable | Type | Description |
|
||||
|----------|------|-------------|
|
||||
| `local_labels` | map[string]string | Kubernetes labels of local peer |
|
||||
| `local_annotations` | map[string]string | Kubernetes annotations of local peer |
|
||||
| `remote_labels` | map[string]string | Kubernetes labels of remote peer |
|
||||
| `remote_annotations` | map[string]string | Kubernetes annotations of remote peer |
|
||||
|
||||
Use `map_get(local_labels, "key", "default")` for safe access that won't error
|
||||
on missing keys.
|
||||
|
||||
**Example**: `map_get(local_labels, "app", "") == "checkout" && "production" in namespaces`
|
||||
|
||||
### Node Information
|
||||
|
||||
| Variable | Type | Description |
|
||||
|----------|------|-------------|
|
||||
| `node` | map | Nested: `node["name"]`, `node["ip"]` |
|
||||
| `node_name` | string | Node name (flat alias) |
|
||||
| `node_ip` | string | Node IP (flat alias) |
|
||||
| `local_node_name` | string | Node name of local peer |
|
||||
| `remote_node_name` | string | Node name of remote peer |
|
||||
|
||||
### Process Information
|
||||
|
||||
| Variable | Type | Description |
|
||||
|----------|------|-------------|
|
||||
| `local_process_name` | string | Process name on local peer |
|
||||
| `remote_process_name` | string | Process name on remote peer |
|
||||
|
||||
### DNS Resolution
|
||||
|
||||
| Variable | Type | Description |
|
||||
|----------|------|-------------|
|
||||
| `src.dns` | string | DNS resolution of source IP |
|
||||
| `dst.dns` | string | DNS resolution of destination IP |
|
||||
| `dns_resolutions` | []string | All DNS resolutions (deduplicated) |
|
||||
|
||||
### Resolution Status
|
||||
|
||||
| Variable | Type | Values |
|
||||
|----------|------|--------|
|
||||
| `local_resolution_status` | string | `""` (resolved), `"no_node_mapping"`, `"rpc_error"`, `"rpc_empty"`, `"cache_miss"`, `"queue_full"` |
|
||||
| `remote_resolution_status` | string | Same as above |
|
||||
|
||||
## Default Values
|
||||
|
||||
When a variable is not present in an entry, KFL2 uses these defaults:
|
||||
|
||||
| Type | Default |
|
||||
|------|---------|
|
||||
| string | `""` |
|
||||
| int | `0` |
|
||||
| bool | `false` |
|
||||
| list | `[]` |
|
||||
| map | `{}` |
|
||||
| bytes | `[]` |
|
||||
|
||||
## Protocol Variable Precedence
|
||||
|
||||
For protocols with request/response pairs (Kafka, RADIUS, Diameter), merged
|
||||
fields prefer the **request** side. If no request exists, the response value
|
||||
is used. Size totals are always computed as `request_size + response_size`.
|
||||
|
||||
## CEL Language Features
|
||||
|
||||
KFL2 supports the full CEL specification:
|
||||
|
||||
- **Short-circuit evaluation**: `&&` stops on first false, `||` stops on first true
|
||||
- **Ternary**: `condition ? value_if_true : value_if_false`
|
||||
- **Regex**: `str.matches("pattern")` uses RE2 syntax
|
||||
- **Type coercion**: Timestamps require `timestamp()`, durations require `duration()`
|
||||
- **Null safety**: Use `in` operator or `map_get()` before accessing map keys
|
||||
|
||||
For the full CEL specification, see the
|
||||
[CEL Language Definition](https://github.com/google/cel-spec/blob/master/doc/langdef.md).
|
||||
@@ -1,444 +0,0 @@
|
||||
---
|
||||
name: network-rca
|
||||
description: >
|
||||
Kubernetes network root cause analysis skill powered by Kubeshark MCP. Use this skill
|
||||
whenever the user wants to investigate past incidents, perform retrospective traffic
|
||||
analysis, take or manage traffic snapshots, extract PCAPs, dissect L7 API calls from
|
||||
historical captures, compare traffic patterns over time, detect drift or anomalies
|
||||
between snapshots, or do any kind of forensic network analysis in Kubernetes.
|
||||
Also trigger when the user mentions snapshots, raw capture, PCAP extraction,
|
||||
traffic replay, postmortem analysis, "what happened yesterday/last week",
|
||||
root cause analysis, RCA, cloud snapshot storage, snapshot dissection, or KFL filters
|
||||
for historical traffic. Even if the user just says "figure out what went wrong"
|
||||
or "compare today's traffic to yesterday" in a Kubernetes context, use this skill.
|
||||
---
|
||||
|
||||
# Network Root Cause Analysis with Kubeshark MCP
|
||||
|
||||
You are a Kubernetes network forensics specialist. Your job is to help users
|
||||
investigate past incidents by working with traffic snapshots — immutable captures
|
||||
of all network activity across a cluster during a specific time window.
|
||||
|
||||
Kubeshark is a search engine for network traffic. Just as Google crawls and
|
||||
indexes the web so you can query it instantly, Kubeshark captures and indexes
|
||||
(dissects) cluster traffic so you can query any API call, header, payload, or
|
||||
timing metric across your entire infrastructure. Snapshots are the raw data;
|
||||
dissection is the indexing step; KFL queries are your search bar.
|
||||
|
||||
Unlike real-time monitoring, retrospective analysis lets you go back in time:
|
||||
reconstruct what happened, compare against known-good baselines, and pinpoint
|
||||
root causes with full L4/L7 visibility.
|
||||
|
||||
## Timezone Handling
|
||||
|
||||
All timestamps presented to the user **must use the local timezone** of the environment
|
||||
where the agent is running. Users think in local time ("this happened around 3pm"), and
|
||||
UTC-only output adds friction during incident response when speed matters.
|
||||
|
||||
### Rules
|
||||
|
||||
1. **Detect the local timezone** at the start of every investigation. Use the system
|
||||
clock or environment (e.g., `date +%Z` or equivalent) to determine the timezone.
|
||||
2. **Present local time as the primary reference** in all output — summaries, event
|
||||
correlations, time-range references, and tables.
|
||||
3. **Show UTC in parentheses** for clarity, e.g., `15:03:22 IST (12:03:22 UTC)`.
|
||||
4. **Convert tool responses** — Kubeshark MCP tools return timestamps in UTC. Always
|
||||
convert these to local time before presenting to the user.
|
||||
5. **Use local time in natural language** — when describing events, say "the spike at
|
||||
3:23 PM" not "the spike at 12:23 UTC".
|
||||
|
||||
### Snapshot Creation
|
||||
|
||||
When creating snapshots, Kubeshark MCP tools accept UTC timestamps. Convert the user's
|
||||
local time references to UTC before passing them to tools like `create_snapshot` or
|
||||
`export_snapshot_pcap`. Confirm the converted window with the user if there's any
|
||||
ambiguity.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
Before starting any analysis, verify the environment is ready.
|
||||
|
||||
### Kubeshark MCP Health Check
|
||||
|
||||
Confirm the Kubeshark MCP is accessible and tools are available. Look for tools
|
||||
like `list_api_calls`, `list_l4_flows`, `create_snapshot`, etc.
|
||||
|
||||
**Tool**: `check_kubeshark_status`
|
||||
|
||||
If tools like `list_api_calls` or `list_l4_flows` are missing from the response,
|
||||
something is wrong with the MCP connection. Guide the user through setup
|
||||
(see Setup Reference at the bottom).
|
||||
|
||||
### Raw Capture Must Be Enabled
|
||||
|
||||
Retrospective analysis depends on raw capture — Kubeshark's kernel-level (eBPF)
|
||||
packet recording that stores traffic at the node level. Without it, snapshots
|
||||
have nothing to work with.
|
||||
|
||||
Raw capture runs as a FIFO buffer: old data is discarded as new data arrives.
|
||||
The buffer size determines how far back you can go. Larger buffer = wider
|
||||
snapshot window.
|
||||
|
||||
```yaml
|
||||
tap:
|
||||
capture:
|
||||
raw:
|
||||
enabled: true
|
||||
storageSize: 10Gi # Per-node FIFO buffer
|
||||
```
|
||||
|
||||
If raw capture isn't enabled, inform the user that retrospective analysis
|
||||
requires it and share the configuration above.
|
||||
|
||||
### Snapshot Storage
|
||||
|
||||
Snapshots are assembled on the Hub's storage, which is ephemeral by default.
|
||||
For serious forensic work, persistent storage is recommended:
|
||||
|
||||
```yaml
|
||||
tap:
|
||||
snapshots:
|
||||
local:
|
||||
storageClass: gp2
|
||||
storageSize: 1000Gi
|
||||
```
|
||||
|
||||
## Core Workflow
|
||||
|
||||
Every investigation starts with a snapshot. After that, you choose one of two
|
||||
investigation routes depending on your goal:
|
||||
|
||||
1. **Determine time window** — When did the issue occur? Use `get_data_boundaries`
|
||||
to see what raw capture data is available.
|
||||
2. **Create or locate a snapshot** — Either take a new snapshot covering the
|
||||
incident window, or find an existing one with `list_snapshots`.
|
||||
3. **Choose your investigation route** — PCAP or Dissection (see below).
|
||||
|
||||
### Choosing the Right Route
|
||||
|
||||
| | PCAP Route | Dissection Route |
|
||||
|---|---|---|
|
||||
| **Speed** | Immediate — no indexing needed | Takes time to index |
|
||||
| **Filtering** | Nodes, time window, BPF filters | Kubernetes & API-level (pods, labels, paths, status codes) |
|
||||
| **Output** | Cluster-wide PCAP files | Structured query results |
|
||||
| **Investigation by** | Human (Wireshark) | AI agent or human (queryable database) |
|
||||
| **Best for** | Compliance, sharing with network teams, Wireshark deep-dives | Root cause analysis, API-level debugging, automated investigation |
|
||||
|
||||
Both routes are valid and complementary. Use PCAP when you need raw packets
|
||||
for human analysis or compliance. Use Dissection when you want an AI agent
|
||||
to search and analyze traffic programmatically.
|
||||
|
||||
**Default to Dissection.** Unless the user explicitly asks for a PCAP file or
|
||||
Wireshark export, assume Dissection is needed. Any question about workloads,
|
||||
APIs, services, pods, error rates, latency, or traffic patterns requires
|
||||
dissected data.
|
||||
|
||||
## Snapshot Operations
|
||||
|
||||
Both routes start here. A snapshot is an immutable freeze of all cluster traffic
|
||||
in a time window.
|
||||
|
||||
### Check Data Boundaries
|
||||
|
||||
**Tool**: `get_data_boundaries`
|
||||
|
||||
Check what raw capture data exists across the cluster. You can only create
|
||||
snapshots within these boundaries — data outside the window has been rotated
|
||||
out of the FIFO buffer.
|
||||
|
||||
**Example response** (raw tool output is in UTC — convert to local time before presenting):
|
||||
```
|
||||
Cluster-wide:
|
||||
Oldest: 2026-03-14 18:12:34 IST (16:12:34 UTC)
|
||||
Newest: 2026-03-14 20:05:20 IST (18:05:20 UTC)
|
||||
|
||||
Per node:
|
||||
┌─────────────────────────────┬───────────────────────────────┬───────────────────────────────┐
|
||||
│ Node │ Oldest │ Newest │
|
||||
├─────────────────────────────┼───────────────────────────────┼───────────────────────────────┤
|
||||
│ ip-10-0-25-170.ec2.internal │ 18:12:34 IST (16:12:34 UTC) │ 20:03:39 IST (18:03:39 UTC) │
|
||||
│ ip-10-0-32-115.ec2.internal │ 18:13:45 IST (16:13:45 UTC) │ 20:05:20 IST (18:05:20 UTC) │
|
||||
└─────────────────────────────┴───────────────────────────────┴───────────────────────────────┘
|
||||
```
|
||||
|
||||
If the incident falls outside the available window, the data has been rotated
|
||||
out. Suggest increasing `storageSize` for future coverage.
|
||||
|
||||
### Create a Snapshot
|
||||
|
||||
**Tool**: `create_snapshot`
|
||||
|
||||
Specify nodes (or cluster-wide) and a time window within the data boundaries.
|
||||
Snapshots include raw capture files, Kubernetes pod events, and eBPF cgroup events.
|
||||
|
||||
Snapshots take time to build. Check status with `get_snapshot` — wait until
|
||||
`completed` before proceeding with either route.
|
||||
|
||||
### List Existing Snapshots
|
||||
|
||||
**Tool**: `list_snapshots`
|
||||
|
||||
Shows all snapshots on the local Hub, with name, size, status, and node count.
|
||||
|
||||
### Cloud Storage
|
||||
|
||||
Snapshots on the Hub are ephemeral. Cloud storage (S3, GCS, Azure Blob)
|
||||
provides long-term retention. Snapshots can be downloaded to any cluster
|
||||
with Kubeshark — not necessarily the original one.
|
||||
|
||||
**Check cloud status**: `get_cloud_storage_status`
|
||||
**Upload to cloud**: `upload_snapshot_to_cloud`
|
||||
**Download from cloud**: `download_snapshot_from_cloud`
|
||||
|
||||
---
|
||||
|
||||
## Route 1: PCAP
|
||||
|
||||
The PCAP route does **not** require dissection. It works directly with the raw
|
||||
snapshot data to produce filtered, cluster-wide PCAP files. Use this route when:
|
||||
|
||||
- You need raw packets for Wireshark analysis
|
||||
- You're sharing captures with network teams
|
||||
- You need evidence for compliance or audit
|
||||
- A human will perform the investigation (not an AI agent)
|
||||
|
||||
### Filtering a PCAP
|
||||
|
||||
**Tool**: `export_snapshot_pcap`
|
||||
|
||||
Filter the snapshot down to what matters using:
|
||||
- **Nodes** — specific cluster nodes only
|
||||
- **Time** — sub-window within the snapshot
|
||||
- **BPF filter** — standard Berkeley Packet Filter syntax (e.g., `host 10.0.53.101`,
|
||||
`port 8080`, `net 10.0.0.0/16`)
|
||||
|
||||
These filters are combinable — select specific nodes, narrow the time range,
|
||||
and apply a BPF expression all at once.
|
||||
|
||||
### Workload-to-BPF Workflow
|
||||
|
||||
When you know the workload names but not their IPs, resolve them from the
|
||||
snapshot's metadata. Snapshots preserve pod-to-IP mappings from capture time,
|
||||
so resolution is accurate even if pods have been rescheduled since.
|
||||
|
||||
**Tool**: `list_workloads`
|
||||
|
||||
Use `list_workloads` with `name` + `namespace` for a singular lookup (works
|
||||
live and against snapshots), or with `snapshot_id` + filters for a broader
|
||||
scan.
|
||||
|
||||
**Example workflow — singular lookup** — extract PCAP for specific workloads:
|
||||
|
||||
1. Resolve IPs: `list_workloads` with `name: "orders-594487879c-7ddxf"`, `namespace: "prod"` → IPs: `["10.0.53.101"]`
|
||||
2. Resolve IPs: `list_workloads` with `name: "payment-service-6b8f9d-x2k4p"`, `namespace: "prod"` → IPs: `["10.0.53.205"]`
|
||||
3. Build BPF: `host 10.0.53.101 or host 10.0.53.205`
|
||||
4. Export: `export_snapshot_pcap` with that BPF filter
|
||||
|
||||
**Example workflow — filtered scan** — extract PCAP for all workloads
|
||||
matching a pattern in a snapshot:
|
||||
|
||||
1. List workloads: `list_workloads` with `snapshot_id`, `namespaces: ["prod"]`,
|
||||
`name_regex: "payment.*"` → returns all matching workloads with their IPs
|
||||
2. Collect all IPs from the response
|
||||
3. Build BPF: `host 10.0.53.205 or host 10.0.53.210 or ...`
|
||||
4. Export: `export_snapshot_pcap` with that BPF filter
|
||||
|
||||
This gives you a cluster-wide PCAP filtered to exactly the workloads involved
|
||||
in the incident — ready for Wireshark or long-term storage.
|
||||
|
||||
### IP-to-Workload Resolution
|
||||
|
||||
When you have an IP address (e.g., from a PCAP or L4 flow) and need to
|
||||
identify the workload behind it:
|
||||
|
||||
**Tool**: `list_ips`
|
||||
|
||||
Use `list_ips` with `ip` for a singular lookup (works live and against
|
||||
snapshots), or with `snapshot_id` + filters for a broader scan.
|
||||
|
||||
**Example — singular lookup**: `list_ips` with `ip: "10.0.53.101"`,
|
||||
`snapshot_id: "snap-abc"` → returns pod/service identity for that IP.
|
||||
|
||||
**Example — filtered scan**: `list_ips` with `snapshot_id: "snap-abc"`,
|
||||
`namespaces: ["prod"]`, `labels: {"app": "payment"}` → returns all IPs
|
||||
associated with workloads matching those filters.
|
||||
|
||||
---
|
||||
|
||||
## Route 2: Dissection
|
||||
|
||||
The Dissection route indexes raw packets into structured L7 API calls, building
|
||||
a queryable database from the snapshot. Use this route when:
|
||||
|
||||
- An AI agent is performing the investigation
|
||||
- You need to search by Kubernetes context (pods, namespaces, labels, services)
|
||||
- You need to search by API elements (paths, status codes, headers, payloads)
|
||||
- You want structured responses you can analyze programmatically
|
||||
- You need to drill into the payload of a specific API call
|
||||
|
||||
**KFL requirement**: The Dissection route uses KFL filters for all queries
|
||||
(`list_api_calls`, `get_api_stats`, etc.). Before constructing any KFL filter,
|
||||
load the KFL skill (`skills/kfl/`). KFL is statically typed — incorrect field
|
||||
names or syntax will fail silently or error. If the KFL skill is not available,
|
||||
suggest the user install it:
|
||||
|
||||
```bash
|
||||
ln -s /path/to/kubeshark/skills/kfl ~/.claude/skills/kfl
|
||||
```
|
||||
|
||||
**If the KFL skill cannot be loaded**, only use the exact filter examples shown
|
||||
in this skill. Do not improvise or guess at field names, operators, or syntax.
|
||||
KFL field names differ from what you might expect (e.g., `status_code` not
|
||||
`response.status`, `src.pod.namespace` not `src.namespace`). Using incorrect
|
||||
fields produces wrong results without warning.
|
||||
|
||||
### Dissection Is Required — Do Not Skip This
|
||||
|
||||
**Any question about workloads, Kubernetes resources, services, pods, namespaces,
|
||||
or API calls requires dissection.** Only the PCAP route works without it. If the
|
||||
user asks anything about traffic content, API behavior, error rates, latency,
|
||||
or service-to-service communication, you **must** ensure dissection is active
|
||||
before attempting to answer.
|
||||
|
||||
**Do not wait for dissection to complete on its own — it will not start by itself.**
|
||||
|
||||
Follow this sequence every time before using `list_api_calls`, `get_api_call`,
|
||||
or `get_api_stats`:
|
||||
|
||||
1. **Check status**: Call `get_snapshot_dissection_status` (or `list_snapshot_dissections`)
|
||||
to see if a dissection already exists for this snapshot.
|
||||
2. **If dissection exists and is completed** — proceed with your query. No further
|
||||
action needed.
|
||||
3. **If dissection is in progress** — wait for it to complete, then proceed.
|
||||
4. **If no dissection exists** — you **must** call `start_snapshot_dissection` to
|
||||
trigger it. Then monitor progress with `get_snapshot_dissection_status` until
|
||||
it completes.
|
||||
|
||||
Never assume dissection is running. Never wait for a dissection that was not started.
|
||||
The agent is responsible for triggering dissection when it is missing.
|
||||
|
||||
**Tool**: `start_snapshot_dissection`
|
||||
|
||||
Dissection takes time proportional to snapshot size — it parses every packet,
|
||||
reassembles streams, and builds the index. After completion, these tools
|
||||
become available:
|
||||
- `list_api_calls` — Search API transactions with KFL filters
|
||||
- `get_api_call` — Drill into a specific call (headers, body, timing, payload)
|
||||
- `get_api_stats` — Aggregated statistics (throughput, error rates, latency)
|
||||
|
||||
### Every Question Is a Query
|
||||
|
||||
**Every user prompt that involves APIs, workloads, services, pods, namespaces,
|
||||
or Kubernetes semantics should translate into a `list_api_calls` call with an
|
||||
appropriate KFL filter.** Do not answer from memory or prior results — always
|
||||
run a fresh query that matches what the user is asking.
|
||||
|
||||
Examples of user prompts and the queries they should trigger:
|
||||
|
||||
| User says | Action |
|
||||
|---|---|
|
||||
| "Show me all 500 errors" | `list_api_calls` with KFL: `http && status_code == 500` |
|
||||
| "What's hitting the payment service?" | `list_api_calls` with KFL: `dst.service.name == "payment-service"` |
|
||||
| "Any DNS failures?" | `list_api_calls` with KFL: `dns && status_code != 0` |
|
||||
| "Show traffic from namespace prod to staging" | `list_api_calls` with KFL: `src.pod.namespace == "prod" && dst.pod.namespace == "staging"` |
|
||||
| "What are the slowest API calls?" | `list_api_calls` with KFL: `http && elapsed_time > 5000000` |
|
||||
|
||||
The user's natural language maps to KFL. Your job is to translate intent into
|
||||
the right filter and run the query — don't summarize old results or speculate
|
||||
without fresh data.
|
||||
|
||||
### Investigation Strategy
|
||||
|
||||
Start broad, then narrow:
|
||||
|
||||
1. `get_api_stats` — Get the overall picture: error rates, latency percentiles,
|
||||
throughput. Look for spikes or anomalies.
|
||||
2. `list_api_calls` filtered by error codes (4xx, 5xx) or high latency — find
|
||||
the problematic transactions.
|
||||
3. `get_api_call` on specific calls — inspect headers, bodies, timing, and
|
||||
full payload to understand what went wrong.
|
||||
4. Use KFL filters to slice by namespace, service, protocol, or any combination.
|
||||
|
||||
**Example `list_api_calls` response** (filtered to `http && status_code >= 500`,
|
||||
timestamps converted from UTC to local):
|
||||
```
|
||||
┌──────────────────────────────────────────┬────────┬──────────────────────────┬────────┬───────────┐
|
||||
│ Timestamp │ Method │ URL │ Status │ Elapsed │
|
||||
├──────────────────────────────────────────┼────────┼──────────────────────────┼────────┼───────────┤
|
||||
│ 2026-03-14 19:23:45 IST (17:23:45 UTC) │ POST │ /api/v1/orders/charge │ 503 │ 12,340 ms │
|
||||
│ 2026-03-14 19:23:46 IST (17:23:46 UTC) │ POST │ /api/v1/orders/charge │ 503 │ 11,890 ms │
|
||||
│ 2026-03-14 19:23:48 IST (17:23:48 UTC) │ GET │ /api/v1/inventory/check │ 500 │ 8,210 ms │
|
||||
│ 2026-03-14 19:24:01 IST (17:24:01 UTC) │ POST │ /api/v1/payments/process │ 502 │ 30,000 ms │
|
||||
└──────────────────────────────────────────┴────────┴──────────────────────────┴────────┴───────────┘
|
||||
Src: api-gateway (prod) → Dst: payment-service (prod)
|
||||
```
|
||||
|
||||
Use the pattern of repeated failures and high latency to identify the failing
|
||||
service chain, then drill into individual calls with `get_api_call`.
|
||||
|
||||
### KFL Filters for Dissected Traffic
|
||||
|
||||
Layer filters progressively when investigating:
|
||||
|
||||
```
|
||||
// Step 1: Protocol + namespace
|
||||
http && dst.pod.namespace == "production"
|
||||
|
||||
// Step 2: Add error condition
|
||||
http && dst.pod.namespace == "production" && status_code >= 500
|
||||
|
||||
// Step 3: Narrow to service
|
||||
http && dst.pod.namespace == "production" && status_code >= 500 && dst.service.name == "payment-service"
|
||||
|
||||
// Step 4: Narrow to endpoint
|
||||
http && dst.pod.namespace == "production" && status_code >= 500 && dst.service.name == "payment-service" && path.contains("/charge")
|
||||
```
|
||||
|
||||
Other common RCA filters:
|
||||
|
||||
```
|
||||
dns && dns_response && status_code != 0 // Failed DNS lookups
|
||||
src.service.namespace != dst.service.namespace // Cross-namespace traffic
|
||||
http && elapsed_time > 5000000 // Slow transactions (> 5s)
|
||||
conn && conn_state == "open" && conn_local_bytes > 1000000 // High-volume connections
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Combining Both Routes
|
||||
|
||||
The two routes are complementary. A common pattern:
|
||||
|
||||
1. Start with **Dissection** — let the AI agent search and identify the root cause
|
||||
2. Once you've pinpointed the problematic workloads, use `list_workloads`
|
||||
to get their IPs (singular lookup by name+namespace, or filtered scan
|
||||
by namespace/regex/labels against the snapshot)
|
||||
3. Switch to **PCAP** — export a filtered PCAP of just those workloads for
|
||||
Wireshark deep-dive, sharing with the network team, or compliance archival
|
||||
|
||||
## Use Cases
|
||||
|
||||
### Post-Incident RCA
|
||||
|
||||
1. Identify the incident time window from alerts, logs, or user reports
|
||||
2. Check `get_data_boundaries` — is the window still in raw capture?
|
||||
3. `create_snapshot` covering the incident window (add 15 minutes buffer)
|
||||
4. **Dissection route**: `start_snapshot_dissection` → `get_api_stats` →
|
||||
`list_api_calls` → `get_api_call` → follow the dependency chain
|
||||
5. **PCAP route**: `list_workloads` → `export_snapshot_pcap` with BPF →
|
||||
hand off to Wireshark or archive
|
||||
|
||||
### Other Use Cases
|
||||
|
||||
- **Trend analysis** — Take snapshots at regular intervals and compare
|
||||
`get_api_stats` across them to detect latency drift, error rate changes,
|
||||
or new service-to-service connections.
|
||||
- **Forensic preservation** — `create_snapshot` + `upload_snapshot_to_cloud`
|
||||
for immutable, long-term evidence. Downloadable to any cluster months later.
|
||||
- **Production-to-local replay** — Upload a production snapshot to cloud,
|
||||
download it on a local KinD cluster, and investigate safely.
|
||||
|
||||
## Setup Reference
|
||||
|
||||
For CLI installation, MCP configuration, verification, and troubleshooting,
|
||||
see `references/setup.md`.
|
||||
@@ -1,70 +0,0 @@
|
||||
# Kubeshark MCP Setup Reference
|
||||
|
||||
## Installing the CLI
|
||||
|
||||
**Homebrew (macOS)**:
|
||||
```bash
|
||||
brew install kubeshark
|
||||
```
|
||||
|
||||
**Linux**:
|
||||
```bash
|
||||
sh <(curl -Ls https://kubeshark.com/install)
|
||||
```
|
||||
|
||||
**From source**:
|
||||
```bash
|
||||
git clone https://github.com/kubeshark/kubeshark
|
||||
cd kubeshark && make
|
||||
```
|
||||
|
||||
## MCP Configuration
|
||||
|
||||
**Claude Desktop / Cowork** (`claude_desktop_config.json`):
|
||||
```json
|
||||
{
|
||||
"mcpServers": {
|
||||
"kubeshark": {
|
||||
"command": "kubeshark",
|
||||
"args": ["mcp"]
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Claude Code (CLI)**:
|
||||
```bash
|
||||
claude mcp add kubeshark -- kubeshark mcp
|
||||
```
|
||||
|
||||
**Without kubectl access** (direct URL mode):
|
||||
```json
|
||||
{
|
||||
"mcpServers": {
|
||||
"kubeshark": {
|
||||
"command": "kubeshark",
|
||||
"args": ["mcp", "--url", "https://kubeshark.example.com"]
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
```bash
|
||||
# Claude Code equivalent:
|
||||
claude mcp add kubeshark -- kubeshark mcp --url https://kubeshark.example.com
|
||||
```
|
||||
|
||||
## Verification
|
||||
|
||||
- Claude Code: `/mcp` to check connection status
|
||||
- Terminal: `kubeshark mcp --list-tools`
|
||||
- Cluster: `kubectl get pods -l app=kubeshark-hub`
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
- **Binary not found** → Install via Homebrew or the install script above
|
||||
- **Connection refused** → Deploy Kubeshark first: `kubeshark tap`
|
||||
- **No L7 data** → Check `get_dissection_status` and `enable_dissection`
|
||||
- **Snapshot creation fails** → Verify raw capture is enabled in Kubeshark config
|
||||
- **Empty snapshot** → Check `get_data_boundaries` — the requested window may
|
||||
fall outside available data
|
||||
Reference in New Issue
Block a user