We fell victim to variable shadowing here. Each store would be fed the
original list of report keys, instead of only the ones that weren't
found in the previous store. So if a single report was missing from the
in-process cache, we would then fetch all reports from memcache. And if
that in turn was missing a single report we would fetch all reports from
S3.
We chain report stores for a reason - to reduce latency and, in case of
the in-process cache, eliminate decoding costs. So this bug has a huge
impact on query service performance.
To make matters worse, we return *all* the reports - now possibly in
triplicate. Fortunately, the SmartMerger filters these out, so at least
we were not incurring extra merge costs.
A lot of time could pass between recording the request count and hit
count pertaining to a particular report fetching batch, which skewed
calculations cache hit ratios.
Fix that by defering the request count recording to the end, which is
when we record the hit count.
Problem: Decoding a corrupt report grows the 'missing' list. Since we
are waiting for 'len(keys)-len(missing)' decoder go-routines, this
results in waiting for fewer go-routines than we should. The surplus
go-routines leak and we ignore their reports. And since the keys of the
ignored reports are not included in 'missing', we won't attempt to fetch
them from S3 either. Oops.
Fix: calculate the number of go-routines once, at the beginning.
The new probe will convert all node's LatestControls to Controls, so
the old app can consume them. Also, the new app will convert all
node's Controls to LatestControl, so it can consume the reports from
old probes.
Also:
* Remove Gob encoder/decoder
* Stop using custom encoders/decoders for Timestamps (both ugorji and the Golang JSON codecs use nanosecond precision).
* Use idiomatic way to check for existence in metric.LastSample()
* Rework Scope metrics according to Prometheus conventions.
- counters should end with _total
- elaborated and added units to help strings
- recommended for cache hit/miss metrics: track only the total and the
hits and in separate metrics, since the most common query will be
"hits / total"
- track all times in seconds (base units), which has become the standard
recommendation
- other small changes
There could be more changes that would require more thinking (what
dimensions to use, summaries vs. histograms, etc.), but this is probably
enough controversial material already :)
* Use timeRequestStatus() in sqs_control_router.go.