Commit Graph

590 Commits

Author SHA1 Message Date
Mike Lang
fad3e88269 Rename ECS Service node ids to be cluster;serviceName
This is important for two reasons:
* It prevents nasty false-equality bugs when two different services from different ECS clusters
  are present in the same report
* It allows us to retrieve the cluster and service name - all the info we need to look up the service -
  using only the node ID. This matters, for example, when trying to handle a control request.
2017-02-03 13:45:18 -08:00
Alfonso Acosta
6347238f10 Review feedback 2017-01-27 13:05:50 +00:00
Alfonso Acosta
7ae94a8c8a DNSSnooper: Support Dot1Q and limit decoding errors 2017-01-27 10:59:33 +00:00
Mike Lang
dee274e438 Merge pull request #2065 from weaveworks/mike/ecs/caching
ECS reporter: Minimize API calls by caching task and service data
2017-01-24 11:03:51 -08:00
Mike Lang
c4eb0960f9 awsecs client: simplify list/describe services
by removing ability to stream results between them, since this is such a minor optimization
and greatly complicates the code.
2017-01-23 12:48:50 -08:00
Mike Lang
baffe94538 awsecs caching: Minor review changes 2017-01-20 14:31:41 -08:00
Alfonso Acosta
7aff988929 Simplify kubelet test 2017-01-20 18:23:11 +00:00
Alfonso Acosta
87f1c0f0f5 Merge pull request #2132 from weaveworks/2049-get-local-pods-from-kubelet
Obtain local pods from kubelet
2017-01-19 12:57:54 +01:00
Mike Lang
79a83e3656 awsecs: Appease linter 2017-01-17 12:17:34 -08:00
Alban Crequy
f1e2b5d93a probe: conntrack: fix output parsing
With net.netfilter.nf_conntrack_acct = 1, conntrack adds the following
fields in the output: packets=3 bytes=164

And with SELinux (e.g. Fedora), conntrack adds: secctx=...

The parsing with fmt.Sscanf introduced in #2095 was unfortunately
rejecting lines with those fields. This patch fixes that by adding more
complicated parsing in decodeFlowKeyValues() with FieldsFunc and SplitN.

Fixes #2117
Regression from #2095
2017-01-17 19:30:56 +01:00
Mike Lang
2b7662a3c6 Make reporter tests a seperate package to appease linter
This requires making All The Things public. Yuck.
2017-01-17 03:02:47 -08:00
Alfonso Acosta
496e3f2072 Merge pull request #2114 from weaveworks/1972-non-established-proc-conns
Report persistent connections in states other than ESTABLISHED
2017-01-17 10:45:53 +01:00
Alfonso Acosta
c6f7bdc78e Obtain local pods from kubelet 2017-01-16 18:50:03 +00:00
Filip Barl
d3466b5454 Refactored the table component/model and wrote the tests
Backward-compatibility fix
2017-01-16 17:05:36 +01:00
Filip Barl
6888108b83 Made the searching of generic tables work on the UI
Extracted table headers common code on the frontend

Fixed the search matching and extracted further common code in the UI
2017-01-16 12:22:10 +01:00
Filip Barl
e475a09ee6 Rendering sortable generic tables in the UI
Rendering generic table columns

Made Type a required attribute for TableTemplate

Made generic table sortable on the UI
2017-01-16 12:22:10 +01:00
Filip Barl
31be525bd2 Created generic table model on backend
Replaced MetadataRow with generic Row in Table model

Sending through multicolumn tables from the backend
2017-01-16 12:22:10 +01:00
Mike Lang
5c19dc792e ecs probe: add tests for reporter 2017-01-13 17:31:29 -08:00
Mike Lang
685af493bf ecs probe: Allow cache settings to be tweaked 2017-01-12 11:37:23 -08:00
Mike Lang
513977081d aws ecs probe: Use a size and time bound LRU gcache for caching
instead of our own hand-rolled size-unbound cache
2017-01-12 10:34:41 -08:00
Mike Lang
e220ae822f wip: 2017-01-12 07:11:12 -08:00
Alfonso Acosta
2be26e2be4 Limit connections to established and half-closed 2017-01-10 15:35:32 +00:00
Alfonso Acosta
89a0ab6799 Fix test data and improve /proc/net/tcp header parsing
The header checking code was unsafe because:

1. It was accessing the byteslice at [2] without ensuring a length >= 3
2. It was assuming that the indentation of the 'sl' header is always 2 (which seems to be the case in recent kernels 8f18e4d03e/net/ipv4/tcp_ipv4.c (L2304) and 8f18e4d03e/net/ipv6/tcp_ipv6.c (L1831) ) but it's more robust to simply trim the byteslice.
2017-01-04 00:27:16 +00:00
Alfonso Acosta
99a7dc3b9a Fix tests 2017-01-03 23:34:32 +00:00
Alfonso Acosta
a8b4e65b5c Make linter happy 2017-01-03 22:55:28 +00:00
Alfonso Acosta
7716d96810 Report persistent connections in states other than ESTABLISHED
This aligns the `/proc` connection tracking (persistent connections) with
conntrack (short-lived connections).
2017-01-03 18:38:02 +00:00
Alfonso Acosta
b4e1fc7074 Merge pull request #2112 from weaveworks/2032-ensure-conntrack-events
Check that conntrack events are enabled in the kernel
2017-01-02 23:11:52 +01:00
Alfonso Acosta
5c3ea83846 Fix minor typo 2017-01-02 14:28:22 +00:00
Alfonso Acosta
dfb52f0d93 Clarify even further that proc/PID/net/tcp varies by namespace 2017-01-02 14:27:37 +00:00
Alfonso Acosta
64f1a5d0f5 Check that conntrack events are enabled in the kernel 2017-01-02 09:22:26 +00:00
Alfonso Acosta
2cd76130a1 Merge pull request #2095 from weaveworks/1991-conntrack-parsing
Disable XML in conntrack parsing
2016-12-22 11:00:51 +01:00
Alfonso Acosta
9d352e96f5 Review feedback 2016-12-22 09:33:52 +00:00
Alfonso Acosta
d22d64c710 Cleanup
* Remove XML traces
* Improve performance
* Fix tests
2016-12-21 19:35:37 +00:00
Alfonso Acosta
06ff64d477 Forward OS/Kernel version to checkpoint
Useful to prioritize ebpf testing

Also:
* Make treatment of kernel release and version consistent across Darwin/Linux
2016-12-19 20:08:08 +00:00
Alfonso Acosta
f19889f63c Reduce garbage 2016-12-19 19:30:23 +00:00
Alfonso Acosta
5c02dfcbd2 Complete hacky manual parser 2016-12-19 11:30:00 +00:00
Alfonso Acosta
710c3bf82e [WIP] Diable XML in conntrack parsing
Not working yet
2016-12-19 11:30:00 +00:00
Mike Lang
49d3e7bbd3 wip: 2016-12-16 17:00:57 -08:00
Mike Lang
0fb74d6781 ecs client: more refactoring for nice code
pulls the inner function of describeServices into its own top-level function,
makes the lock part of the client object as a result
2016-12-15 14:11:58 -08:00
Mike Lang
adb6f9d4a1 Appease linter 2016-12-15 14:11:58 -08:00
Mike Lang
6f2efca968 more review feedback 2016-12-15 14:11:58 -08:00
Mike Lang
7d845f9130 ecs reporter: Review feedback, some trivial renames 2016-12-15 14:11:58 -08:00
Mike Lang
7ebb76d0a3 ecs reporter: Move some code around to break up large function 2016-12-15 14:11:58 -08:00
Mike Lang
1d63830792 awsecs reporter: Add lots of debug logging and fix bugs
- describeServices wasn't describing the partial page left over at the end,
  which would cause incorrect results
- the shim between listServices and describeServices was closing the channel every iteration,
  which would cause panic for write to closed channel
- client was not being saved when created, so it gets recreated each time
- we were describeTasks'ing even if we had no tasks to describe
2016-12-15 14:11:57 -08:00
Mike Lang
4234888bf4 ecs: Linter fixes 2016-12-15 14:11:57 -08:00
Mike Lang
357136721d Fix compile errors and go fmt 2016-12-15 14:11:57 -08:00
Mike Lang
9d1e46f81b ECS reporter: Use persistent client objects across reports
Not only does this allow us to re-use connections, but vitally it allows us
to make use of the new task and service caching within the client object.
2016-12-15 14:11:57 -08:00
Mike Lang
6b19bc2da9 Changes to how ECS AWS API is used to minimize API calls
Due to AWS API rate limits, we need to minimize API calls as much as possible.

Our stated objectives:
* for all displayed tasks and services to have up-to-date metadata
* for all tasks to map to services if able

My approach here:
* Tasks only contain immutable fields (that we care about). We cache tasks forever.
  We only DescribeTasks the first time we see a new task.
* We attempt to match tasks to services with what info we have. Any "referenced" services,
  ie. a service with at least one matching task, needs to be updated to refresh changing data.
* In the event that a task doesn't match any of the (updated) services, ie. a new service entirely
  needs to be found, we do a full list and detail of all services (we don't re-detail ones we just refreshed).
* To avoid unbounded memory usage, we evict tasks and services from the cache after 1 minute without use.
  This should be long enough for things like temporary failures to be glossed over.

This gives us exactly one call per task, and one call per referenced service per report,
which is unavoidable to maintain fresh data. Expensive "describe all" service queries are kept
to only when newly-referenced services appear, which should be rare.

We could make a few very minor improvements here, such as trying to refresh unreferenced but known
services before doing a list query, or getting details one by one when "describing all" and stopping
when all matches have been found, but I believe these would produce very minor, if any, gains in
number of calls while having an unjustifiable effect on latency since we wouldn't be able to do requests
as concurrently.

Speaking of which, this change has a minor performance impact.
Even though we're now doing less calls, we can't do them as concurrently.

Old code:
	concurrently:
		describe tasks (1 call)
		sequentially:
			list services (1 call)
			describe services (N calls concurrently)
Assuming full concurrency, total latency: 2 end-to-end calls

New code (worst case):
	sequentially:
		describe tasks (1 call)
		describe services (N calls concurrently)
		list services (1 call)
		describe services (N calls concurrently)
Assuming full concurrency, total latency: 4 end-to-end calls

In practical terms, I don't expect this to matter.
2016-12-15 14:11:57 -08:00
Mike Lang
5ed63de306 Merge pull request #2060 from weaveworks/mike/awsecs/fix-log-formatting
ecs reporter: Fix some log lines that were passing *string instead of string
2016-12-14 11:20:59 -08:00
Alfonso Acosta
07aee0ed97 Merge pull request #2020 from kinvolk/alban/fix-getWalkedProcPid
procspy: use a Reader to copy the background reader buffer
2016-12-07 12:53:53 +01:00