Pre-aggregate data sketches to speed-up Snowflake Approximate Count Distincts

 

A common technique to speed up queries that use aggregation functions is to pre-aggregate data to reduce the amount of data scanned and speed up query response time. 

Creating a pre-aggregated table generally involves:  

  • filtering to a subset of data (e.g. source_id != 'test’)

  • grouping data to a base time granularity (e.g. group by date_trunc(‘hour’, event_time))

  • grouping by additional dimensions of interest (e.g. group by browser,country,url_path)

  • aggregating the metric of interest (e.g. Count(ip))

Querying the pre-aggregated table is done by:

  • grouping by the dimensions of interest (e.g. group by country)

  • rolling up the pre-aggregates SUM(pre_aggregated_value)

The pre-aggregated table data could be reduced by more than 600x depending on the size, density, and time granularity of data. If queried often enough, this approach can yield significant cost and time savings.

Using a similar approach for distincts or percentiles is more involved because there is no clear rollup function to apply (i.e. you can’t really do a SUM on top of intermediate count distinct values). In cases where approximates are acceptable, there are some excellent algorithms such as Hyperloglog for distincts and t-digests & q-digests for percentiles that can help us achieve this.

In this post, we will step through the process to leverage Hyperloglogs to pre-aggregate data for Approx Distinct queries in Snowflake.

 

Intro to Hyperloglog

Hyperloglog is an approximation algorithm that estimates the number of distinct elements in a multiset. The intuition behind the algorithm is to keep track of the maximum number of leading zeroes in the binary representation of the hashed field. Having a large number of leading zeros indicates that there are a large number of distinct elements.

There are several great blog posts that dig into the implementation details of the algorithm linked at the end of this section.

Something to note about Hyperloglog, is that it uses significantly less memory than a regular COUNT DISTINCT where memory usage scales linearly with the number of distinct elements. Each Hyperloglog binary in Snowflake’s implementation uses at most 4096 bytes regardless of input distribution.

Since Hyperloglog is a randomized algorithm, values output by it have an error bound associated with them. The average error varies by implementation. Snowflake’s algorithm has an average relative error of 1.62338%. A really important property of the latest HLL implementations is that you can merge together binary outputs to get an overall estimate of distinct values without further propagating error.

Given the error associated with Hyperloglog, it is best not to use it in situations where users are expecting the numbers to tie out to a source of truth. Nevertheless, this approach can be quite effective when absolute accuracy isn’t necessary to glean insight. E.g.: trying to understand relative panel sizes, comparative data quality monitoring on large datasets and understanding macro trends.

Links to resources about Hyperloglog:

Creating the Pre-aggregated Table

For the purposes of this example, we will be using a table tracking web traffic. As the name suggests, it tracks web traffic by IP address, Timestamp and a few other dimensions.

Here is a query to take the Raw Web Traffic Data and create a pre-aggregated table at the day granularity:

CREATE OR REPLACE TABLE day_aggregated_web_traffic
AS
SELECT
 YEAR(visit_timestamp) as year,
 MONTH(visit_timestamp) as month,
 DAY(visit_timestamp) as day,
 country,
 url_path,
 browser,
 hll_export(hll_accumulate(ip)) AS encoded_ip_sketch
FROM raw_web_traffic
GROUP BY YEAR(visit_timestamp), MONTH(visit_timestamp), DAY(visit_timestamp), country, url_path,browser;

The day_aggregated_web_traffic table can now be used to create a variety of reports using distinct counts of IP addresses as a metric:

The following query will grab the distinct count of IPs grouped by country & rolled up to the yearly granularity:

SELECT
 year,
 country,
 hll_estimate(hll_combine(hll_import(encoded_ip_sketch))) AS distinct_ip
FROM day_aggregated_web_traffic
GROUP BY year, country

The hll_import operation converts the binary encoded IP sketch into a Snowflake HLL object. The HLL Combine operation merges together all the sketches within the group to provide a distinct count of ips by country:

Performance considerations

For the sake of illustration, tests were run against 2 months of raw web traffic data with roughly 10B events per month and the pre aggregated table had 1,500x less data than the raw web traffic table.

A large warehouse was used to run queries against a month’s worth of data. Comparisons were run between HLL approximations on top of the pre-aggregated table and HLL approximates on the raw data for a couple of scenarios. The Compute Cost is calculated based on the current Standard On-Demand price in us-east-1 of $2.00/credit.

  • Group by Day, Country

  • Group by Day, Country, URL Path, Browser

As expected, the pre-aggregated table vastly outperforms the raw data version. Outside of lower execution time, this approach further reduced cost because the pre-aggregated table returned data fast enough for this specific use-case on a lower warehouse size, while the raw table approx distinct did not.

These results align with Snowflake’s post about this in the Examples section on this page.

Exporting sketches outside of Snowflake

Saving the encoded sketches into a snowflake table might not be feasible in some situations. For instance, if similar metrics are getting pulled from raw and intermediate data outside of Snowflake, the service that compares metrics across the three data sources might be vendor-agnostic or just might not have access to Snowflake.

In this event, exporting binaries from the sketch queries above will not be portable, because the code required to serialize and deserialize Snowflake’s version of HLL isn’t open-source. Fear not, because there are a few choices of open source sketching libraries that you can use in conjunction with Snowflake’s UDF functionality.

Here is a useful post that wraps the ZetaSketch implementation of Hyperloglog++ in a Snowflake UDTF.

Other popular open source sketching libraries include:

Lariat Data uses a similar approach to track data quality metrics that leverage distincts on batch and streaming data sources. If you’re interested in how efficient data processing unlocks a host of continuous data observability use-cases, then check out Lariat Data!

 
Previous
Previous

Product Update: Indicator Backfills

Next
Next

Continuous Data Quality Monitoring for AWS Athena