Sampling

Sampling in MQL mitigates data volume issues. There are two sampling strategies — Random and Sticky:

Random Sampling

Random sampling uniformly downsamples the stream to a percentage of its original volume. You establish random sampling through a sampling clause like the following:

select * from stream SAMPLE {'strategy': 'RANDOM', 'threshold': 200, 'factor': 10000}

For each item in the stream, a numeric hash is generated. That hash is modded by factor to produce a result between 0 (inclusive) and factor (exclusive). If that result is less than threshold the item will be sampled, otherwise it will be ignored.

You can determine the sampling percentage by remembering that \frac{threshold}{factor} values will be sampled — 2% in the example above with a threshold of 200 and factor of 10000.

You can also set a salt value, which will change the calculation of the hash. This is helpful if you want to sample over the same set of values but retrieve a different sample of those values. For example:

select * from stream SAMPLE {'strategy': 'RANDOM', 'threshold': 200, 'factor': 10000, 'salt': 123}

Sticky Sampling

Sticky sampling “sticks” to certain values for the provided keys. That is if you are sampling on “zipcode” (as in the following example) and you observe a specific zipcode in the stream, you will observe all events which contain that specific zipcode. Sticky sampling can be achieved with a query like such:

select * from stream SAMPLE {'strategy':'STICKY', 'keys':['zipcode'], 'threshold':200, 'factor':10000, 'salt': 1}

The query above should retrieve 2% of the total stream (see Random Sampling above for why), predicated on the events being uniformly distributed over the zipcodes.