AggregatingMergeTree Materialized Views no longer needing GROUP BY specified

When seeing examples of using AggregatingMergeTree tables, invariably the data is inserted via a materialized view which looks like:

CREATE MATERIALIZED VIEW mv TO agg AS
  SELECT ...
  FROM src
  GROUP BY ...

Whilst it would work to leave out the GROUP BY clause, it means that every row is inserted straight into the aggregate table, and only when a merge is run on it will it finally be aggregated and stored as a single row. This step can cause significant storage- and computational overhead.

This was what used to be true anyway. In ClickHouse 21.1 a great new setting called optimize_on_insert was introduced (and set to true by default) meaning that any block of inserted data gets acted on as if the merge process was already happening. This can really simplify your materialized views. If you are using SimpleAggregateFunctions (and why wouldn’t you?) this is especially true – in that case you can basically just insert the data in as it is read with no changes.

We’ve been running this in various production environments for a while now and there is very discernible difference in performance between the old and new style of doing things, apart from perhaps a bit more CPU overhead and the materialized views being a lot less complicated.

However, it should be noted that when inserting data from a large select query the process can be (sometimes significantly) slower than the old recommended way of inserting into a Null table and running a materialized view off of that. If max_insert_threads is set to a large value and you don’t have too many rows to insert then it’s fine to do this, but if you are rebuilding complicated aggregates with hundreds of billions of rows then it may well be quicker and easier to use the old approach of GROUP BY.

Leave a Reply Cancel reply