Skip to content

Sampling

Sampling in MQL mitigates data volume issues. There are two sampling strategies — Random and Sticky:

Random Sampling

Random sampling uniformly downsamples the stream to a percentage of its original volume. You establish random sampling through a sampling clause like the following:

select * from stream SAMPLE {"strategy": "RANDOM", "threshold": 200, "factor": 10000}

For each item in the stream, a numeric hash is generated. That hash is modded by factor to produce a result between 0 (inclusive) and factor (exclusive). If that result is less than threshold the item will be sampled, otherwise it will be ignored.

You can determine the sampling percentage by remembering that \frac{threshold}{factor} values will be sampled — 2% in the example above with a threshold of 200 and factor of 10000.

You can also set a salt value, which will change the calculation of the hash. This is helpful if you want to sample over the same set of values but retrieve a different sample of those values. For example:

select * from stream SAMPLE {"strategy": "RANDOM", "threshold": 200, "factor": 10000, "salt": 123}

Sticky Sampling

Sticky sampling “sticks” to certain values for the provided keys. That is if you are sampling on “zipcode” (as in the following example) and you observe a specific zipcode in the stream, you will observe all events which contain that specific zipcode. Sticky sampling can be achieved with a query like such:

select * from stream SAMPLE {"strategy": "STICKY", "keys":["zipcode"], "threshold": 200, "factor": 10000, "salt": 1}

The query above should retrieve 2% of the total stream (see Random Sampling above for why), predicated on the events being uniformly distributed over the zipcodes.

How sticky sampling works

Sticky sampling extracts values from the event that correspond to fields specified in keys array. null values are filtered out and remaining values concatenated to generate a hash. A uniform random sampling is performed on the hash to select values consistently.

For eg: For keys: ["zipcode", "zipcode_alt", "state"], following will be concatenated result to hash-based-sampler:

zipcode zipcode_alt state concat-result
100001 null NY 100001NY
null 100001 NY 100001NY
100001 100001 NY 100001100001NY