This is important for two reasons:
* It prevents nasty false-equality bugs when two different services from different ECS clusters
are present in the same report
* It allows us to retrieve the cluster and service name - all the info we need to look up the service -
using only the node ID. This matters, for example, when trying to handle a control request.
With net.netfilter.nf_conntrack_acct = 1, conntrack adds the following
fields in the output: packets=3 bytes=164
And with SELinux (e.g. Fedora), conntrack adds: secctx=...
The parsing with fmt.Sscanf introduced in #2095 was unfortunately
rejecting lines with those fields. This patch fixes that by adding more
complicated parsing in decodeFlowKeyValues() with FieldsFunc and SplitN.
Fixes#2117
Regression from #2095
The header checking code was unsafe because:
1. It was accessing the byteslice at [2] without ensuring a length >= 3
2. It was assuming that the indentation of the 'sl' header is always 2 (which seems to be the case in recent kernels 8f18e4d03e/net/ipv4/tcp_ipv4.c (L2304) and 8f18e4d03e/net/ipv6/tcp_ipv6.c (L1831) ) but it's more robust to simply trim the byteslice.
- describeServices wasn't describing the partial page left over at the end,
which would cause incorrect results
- the shim between listServices and describeServices was closing the channel every iteration,
which would cause panic for write to closed channel
- client was not being saved when created, so it gets recreated each time
- we were describeTasks'ing even if we had no tasks to describe
Due to AWS API rate limits, we need to minimize API calls as much as possible.
Our stated objectives:
* for all displayed tasks and services to have up-to-date metadata
* for all tasks to map to services if able
My approach here:
* Tasks only contain immutable fields (that we care about). We cache tasks forever.
We only DescribeTasks the first time we see a new task.
* We attempt to match tasks to services with what info we have. Any "referenced" services,
ie. a service with at least one matching task, needs to be updated to refresh changing data.
* In the event that a task doesn't match any of the (updated) services, ie. a new service entirely
needs to be found, we do a full list and detail of all services (we don't re-detail ones we just refreshed).
* To avoid unbounded memory usage, we evict tasks and services from the cache after 1 minute without use.
This should be long enough for things like temporary failures to be glossed over.
This gives us exactly one call per task, and one call per referenced service per report,
which is unavoidable to maintain fresh data. Expensive "describe all" service queries are kept
to only when newly-referenced services appear, which should be rare.
We could make a few very minor improvements here, such as trying to refresh unreferenced but known
services before doing a list query, or getting details one by one when "describing all" and stopping
when all matches have been found, but I believe these would produce very minor, if any, gains in
number of calls while having an unjustifiable effect on latency since we wouldn't be able to do requests
as concurrently.
Speaking of which, this change has a minor performance impact.
Even though we're now doing less calls, we can't do them as concurrently.
Old code:
concurrently:
describe tasks (1 call)
sequentially:
list services (1 call)
describe services (N calls concurrently)
Assuming full concurrency, total latency: 2 end-to-end calls
New code (worst case):
sequentially:
describe tasks (1 call)
describe services (N calls concurrently)
list services (1 call)
describe services (N calls concurrently)
Assuming full concurrency, total latency: 4 end-to-end calls
In practical terms, I don't expect this to matter.