Category Archives: ClickHouse

Extracting timestamps from UUIDv6

If you have a UUIDv6 in Clickhouse such as 1ec90684-5c11-62f8-8d4a-a9fc15074a15 and you want to convert it to an epoch timestamp, you can use the following function:

select
    (reinterpretAsUInt32(substr(reinterpretAsString(toUUID('1ec90684-5c11-62f8-8d4a-a9fc15074a15')) AS uuid, 5, 4)) * pow(2, 28)
    + reinterpretAsUInt16(substr(uuid, 3, 2)) * pow(2,2)
    + reinterpretAsUInt16(substr(uuid, 1, 2)) - 0x6000
    - 0x01b21dd213814000) / 10000000;

This looks pretty weird, so let’s break it down:

reinterpretAsString(toUUID('1ec90684-5c11-62f8-8d4a-a9fc15074a15')) AS uuid

First we take the UUID and convert it to a string. If you look at the hex version of this you see it’s been reversed according to little-endian rules to read F862115C8406C91E154A0715FCA94A8D.

We then take the various sections of this and convert them to numbers according to the spec in order to get a 60-bit timestamp.

After this, we subtract the UUID version from it, as the first 4 bits of the time_low section contain the UUID version (6), hence the - 0x6000.

However the 60-bit timestamp is in 100 nanosecond increments since 00:00:00.00, 15 October 1582 (the start of the Gregorian calendar). Looking on google, the magic number of 100-nanosecond increments between then and the unix epoch (1 Jan 1970) is 0x01b21dd213814000, so we subtract this.

Then finally to convert to seconds we divide by 10 million (10000000)

ClickHouse container permissions and faster startup

If you’re using ClickHouse via the docker image which is the way we’d usually recommend customers to install it for testing and perhaps even for live deployments, you might have noticed the startup time is a bit slow. A related issue that we saw with some CentOS-based customers is that in the 21.1 release in January the uid/gid of the database files changed, meaning that upgraded containers would sometimes not start. This is all down to the owner/group that the clickhouse-server instance in the container runs as.

Since 21.1 the uid/gid has been forced to 101 (previously it may have been 999 in some situations). To avoid permissions issues, when the container starts up it runs a script prior to starting the clickhouse-server process which changes the ownership on all the data files. For large instances, even on SSDs this can take several minutes.

Our recommendations for customers are to run the ClickHouse server container as follows:

Continue reading

Storing combined IPv4 and IPv6 addresses in ClickHouse

In Postgres when dealing with network addresses, you can typically use the inet type which supports both IPv4 and IPv6 and uses 7 and 19 bytes of storage per row respectively. Because of the way ClickHouse is designed it has two separate ‘domain’ types (ie functionality wrappers around more standardized underlying data types). These domain types are called IPv4 and IPv6. The underlying storage for IPv4 is UInt32 and for IPv6 is FixedString(16).

As the internet is increasingly transitioning to dual-stack networking, new services should be designed to accommodate both IPv4 and IPv6 addresses when storing in the database. There are several alternative ways to do this in Clickhouse.

Continue reading

AggregatingMergeTree Materialized Views no longer needing GROUP BY specified

When seeing examples of using AggregatingMergeTree tables, invariably the data is inserted via a materialized view which looks like:

CREATE MATERIALIZED VIEW mv TO agg AS
  SELECT ...
  FROM src
  GROUP BY ...

Whilst it would work to leave out the GROUP BY clause, it means that every row is inserted straight into the aggregate table, and only when a merge is run on it will it finally be aggregated and stored as a single row. This step can cause significant storage- and computational overhead.

This was what used to be true anyway. In ClickHouse 21.1 a great new setting called optimize_on_insert was introduced (and set to true by default) meaning that any block of inserted data gets acted on as if the merge process was already happening. This can really simplify your materialized views. If you are using SimpleAggregateFunctions (and why wouldn’t you?) this is especially true – in that case you can basically just insert the data in as it is read with no changes.

Continue reading

Designing a flexible ClickHouse data ingest process flow

If you’re uploading data straight into ClickHouse, it works best ingesting large chunks (a good starting point is 100,000 rows per chunk) of data in a single query. Additionally, there are many other ways of ingesting data via kafka or from other external sources such as S3. The standard process is to ingest the data into a single table for long-term storage and then use materialized views to transform this in real-time into aggregate tables for easier display to customers and smaller storage requirements. But, even if this is working fine in production you may still have times when you need to manipulate tables individually.

Continue reading