Commit Graph

536 Commits

Author SHA1 Message Date
Bryan Boreham
f41b90a7d8 Clean up 'import' ordering 2021-04-04 13:47:27 +01:00
Bryan Boreham
2cf48f2bdd Don't call Fatal() on background thread in test
It doesn't fail the test
2021-04-04 13:46:58 +01:00
Bryan Boreham
103ea2095f Fix lint warnings in Go code
All cosmetic.
2020-12-30 18:30:34 +00:00
Bryan Boreham
18acfcefe1 Run go fmt on various files
Seems that go fmt has changed behaviour since these files were last
checked in.  Changes are all cosmetic.
2020-12-30 18:30:34 +00:00
Bryan Boreham
e6faa2ba4b Merge pull request #3796 from weaveworks/billing-spy-interval
billing: cope with spy-interval set longer than publish-interval
2020-06-11 14:12:37 +01:00
Bryan Boreham
5318498d9a improvement: make command-line parsing more robust 2020-06-11 11:15:35 +00:00
Bryan Boreham
70240fc82d billing: cope with spy-interval set longer than publish-interval 2020-06-11 11:15:35 +00:00
Bryan Boreham
5264b61951 improvement: stop rendering if Context is cancelled
Typically this means the http caller has closed the connection,
so no point responding to them.

Also check at the point we send a response back, and log to OpenTracing.
2020-06-11 11:13:38 +00:00
Bryan Boreham
a20c51e94d Higher limit on topology size for merged reports.
Where a report has been merged from several probes, give it a higher
limit before dropping topologies.

We will already have applied the limit on each single-probe report as
it came in, except for historical ones.
2020-05-25 13:12:47 +00:00
Bryan Boreham
fd65155cd6 Register the metric for dropped topologies
Missed earlier.
2020-05-25 13:12:27 +00:00
Bryan Boreham
b117b1a5ef cosmetic: move misplaced import 2020-05-19 10:09:52 +00:00
Bryan Boreham
ad82fafde8 multitenant: scan container command-lines as well as process 2020-05-19 10:09:28 +00:00
Bryan Boreham
f83ad517d8 multitenant: extract command-line parsing function and add test 2020-05-19 10:07:23 +00:00
Bryan Boreham
323aa46d1c fix (pipes): check websocket errors inside CopyToWebsocket()
Previously we were treating EOF on the reader as no-error, meaning
that operations like Kubernetes Describe would retry endlessly when
finished.
2020-05-06 10:04:40 +00:00
Bryan Boreham
fa4d1c4c2b Upgrade reports before merging
In case they came from an older or an overload probe.
2020-04-16 19:27:40 +00:00
Bryan Boreham
b772fa83b3 Add a metric for topologies dropped because they are over limit
Need to modify DropTopologiesOver() to report what it dropped, and
plumb through the userid so the metric can show who has a problem.
2020-04-16 19:27:28 +00:00
Bryan Boreham
b1fc59819a comment: clarify memcached error cases 2020-04-15 16:49:02 +00:00
Bryan Boreham
9a739fda46 Parallelise sending merged reports to store
Writes to DynamoDB and S3 can be done in parallel, which will reduce
the overall flush time.
2020-04-13 19:11:34 +00:00
Bryan Boreham
2629d13780 Add a histogram for flush times 2020-04-13 19:11:34 +00:00
Bryan Boreham
ccf031b8a9 enhancement(multitenant): merge incoming reports in a time window
This means we store fewer, bigger, reports, which reduces cost of
storage and time to render when data is viewed.
2020-04-13 19:11:34 +00:00
Bryan Boreham
104b9cba50 refactor: Call Close() on collector
Doesn't do anything at present, but will be used later.

Change the signature on BillingEmitter.Close() to match. Note we didn't use the error returned.
2020-04-13 19:11:34 +00:00
Bryan Boreham
777ff07e19 refactor(multitenant): break report storage code out into sub-functions
So the main Add() function isn't so long.
2020-04-13 19:11:34 +00:00
Bryan Boreham
8c46367808 fix(multitenant): move use of rounding map inside lock 2020-04-13 16:16:20 +00:00
Bryan Boreham
3f11352435 enhancement(multitenant): Track rounding error in billing calculation
Billing takes an integer number of seconds, so keep track of the
amount lost to rounding when the publish interval is not an integer.
2020-04-10 19:04:50 +00:00
Bryan Boreham
c784acc20d Revert change to use report timestamp
This reverts commit 6b72246fe6.

The app merges reports within a 15-second window of its own time, so
if one or more probes have a time that is several seconds different
they will get excluded from the window.
2020-03-28 13:58:34 +00:00
Bryan Boreham
6b72246fe6 fix (multitenant collector): Use consistent report timestamp
Previously the code called `time.Now()` in two different places so the
timestamps didn't match. Now we use the timestamp of the report itself.

Add the collector's local time to the report if it didn't have one.
2020-03-26 19:15:34 +00:00
Bryan Boreham
ba9ecdd9e2 Merge pull request #3752 from weaveworks/report-window
Set timestamp and window on each report
2020-03-11 21:12:02 +00:00
Bryan Boreham
53701aca1f Cache the last-known report interval per user
Delta reports don't contain the string we are looking for, so remember
it from the last full report.
2020-03-06 18:03:51 +00:00
Bryan Boreham
a47cf0a2aa Remove copying Merge() on Report
It was only used in a few places, and all of those were better off
using the Unsafe variant.
2020-03-06 15:03:43 +00:00
Bryan Boreham
329023b7c5 Improve calculation of usage in multitenant code
Use the duration supplied, if there is one.

It was looking for a process named "scope-probe", whereas the
executable is just named "scope".
2020-03-06 13:29:54 +00:00
Bryan Boreham
634e8f1158 Add tracing for pipe operations 2020-01-21 15:51:00 +00:00
Bryan Boreham
d516ed9883 performance(aws_collector): don't persist shortcut reports
Shortcut reports are sent to update the UI quickly, on events like a
container starting. We don't need to persist them in the time-travel
data since the same information will be covered by a regular report a
few seconds later.
2019-10-17 18:02:36 +00:00
Bryan Boreham
8d9e337a75 chore: fix typos in debugging format strings 2019-09-25 20:08:29 +00:00
Akash Srivastava
ca420b07aa Merge pull request #3687 from weaveworks/refactor-reading
Refactor report reading
2019-09-24 12:26:01 +05:30
Bryan Boreham
13af359bcf refactor: eliminate report.ReadBinary() in favour of MakeFromBinary()
The signature of MakeFromFile changed to return a pointer for
consistency.
2019-09-23 10:01:43 +00:00
Bryan Boreham
6ee9738581 Merge pull request #3686 from weaveworks/analyze-reports
feature(app): Add a debugging summary function, exposed via http
2019-09-18 15:54:50 +01:00
Bryan Boreham
a7d3cbedb5 lint: make lint happy 2019-09-18 14:42:47 +00:00
Bryan Boreham
938d59489c Merge pull request #3682 from weaveworks/websocket-tracing
Websocket tracing spans
2019-09-17 16:36:48 +01:00
Bryan Boreham
2bbd4a3f0d refactor: remove MakeFromBytes() function which is almost the same as ReadBinary() 2019-09-17 10:55:44 +00:00
Bryan Boreham
3f8ba95bea refactor: pass msgpack flag into ReadBinary
instead of a codec.Handle.  This is a cleaner dependency.
2019-09-17 10:55:44 +00:00
Bryan Boreham
11e76f1740 feature(app): Add a debugging summary function, exposed via http
URL is /admin/summary
2019-09-17 10:48:23 +00:00
Bryan Boreham
4e8000cbba review feedback: better tracing info 2019-09-16 11:08:42 +00:00
Bryan Boreham
b0915519df refactor: move websocket state out to a struct to neaten up the send loop 2019-09-16 11:03:02 +00:00
Akash Srivastava
0203757cf5 Merge pull request #3675 from weaveworks/reduce-probe-dependency
Stop render package depending on probe
2019-09-16 12:56:56 +05:30
Bryan Boreham
04af634065 tracing(app): set a tag for userid on awsCollector.Report 2019-09-15 19:22:08 +00:00
Bryan Boreham
852b7cd4c0 tracing(app): spans for report rendering via websocket 2019-09-15 19:08:20 +00:00
Bryan Boreham
15467d7310 Move host-related names out of probe code
Reduce the dependency on low-level libraries
2019-09-13 11:41:09 +00:00
Bryan Boreham
74b6a292d5 Use time.Duration instead of nanoseconds for constants 2019-09-13 07:31:07 +00:00
Bryan Boreham
b5376facf2 Cache merged groups of reports, to reduce the number we handle in parallel
Previously we would merge all reports in a 15-second window.
Now we use a 'quantum' of 3 seconds, similar to the single-user app.

E.g. a 30-node cluster will have 150 individual reports over 15
seconds, but the new code will merge 5 pre-merged reports plus 20-ish
very recent individual ones.

This limits the max heap size used for deserialising, since we only do
3 seconds at once per instance.

Individual reports are still put into the cache, but should get
displaced by the pre-merged ones under LRU.
2019-09-09 10:00:26 +00:00
Bryan Boreham
70550ca34a Refactor: pull userid fetch up out of getReportKeys() 2019-09-09 08:19:55 +00:00