Sampling
Sampling in MQL mitigates data volume issues. There are two sampling strategies — Random and Sticky:
Random Sampling¶
Random sampling uniformly downsamples the stream to a percentage of its original volume. You establish random sampling through a sampling clause like the following:
select * from stream SAMPLE {"strategy": "RANDOM", "threshold": 200, "factor": 10000}
For each item in the stream, a numeric hash is generated. That hash is modded by factor
to produce
a result between 0 (inclusive) and factor
(exclusive). If that result is less than threshold
the
item will be sampled, otherwise it will be ignored.
You can determine the sampling percentage by remembering that \frac{threshold}{factor} values will
be sampled — 2% in the example above with a threshold
of 200 and factor
of 10000.
You can also set a salt
value, which will change the calculation of the hash. This is helpful if
you want to sample over the same set of values but retrieve a different sample of those values. For
example:
select * from stream SAMPLE {"strategy": "RANDOM", "threshold": 200, "factor": 10000, "salt": 123}
Sticky Sampling¶
Sticky sampling “sticks” to certain values for the provided keys. That is if you are sampling on
“zipcode
” (as in the following example) and you observe a specific zipcode
in the stream, you
will observe all events which contain that specific zipcode
. Sticky sampling can be achieved with
a query like such:
select * from stream SAMPLE {"strategy": "STICKY", "keys":["zipcode"], "threshold": 200, "factor": 10000, "salt": 1}
The query above should retrieve 2% of the total stream (see Random Sampling
above for why), predicated on the events being uniformly distributed over the zipcode
s.
How sticky sampling works¶
Sticky sampling extracts values from the event that correspond to fields specified in keys
array. null
values are filtered out and remaining values concatenated to generate a hash. A uniform random sampling is performed on the hash to select values consistently.
For eg: For keys: ["zipcode", "zipcode_alt", "state"]
, following will be concatenated result to hash-based-sampler:
zipcode | zipcode_alt | state | concat-result |
---|---|---|---|
100001 | null | NY | 100001NY |
null | 100001 | NY | 100001NY |
100001 | 100001 | NY | 100001100001NY |