Storing combined IPv4 and IPv6 addresses in ClickHouse

In Postgres when dealing with network addresses, you can typically use the inet type which supports both IPv4 and IPv6 and uses 7 and 19 bytes of storage per row respectively. Because of the way ClickHouse is designed it has two separate ‘domain’ types (ie functionality wrappers around more standardized underlying data types). These domain types are called IPv4 and IPv6. The underlying storage for IPv4 is UInt32 and for IPv6 is FixedString(16).

As the internet is increasingly transitioning to dual-stack networking, new services should be designed to accommodate both IPv4 and IPv6 addresses when storing in the database. There are several alternative ways to do this in Clickhouse.

Approach 1 – two separate columns

A very simplistic approach might be to use 2 different columns to store these addresses, for example:

request_address_ipv4 IPv4,
request_address_ipv6 IPv6,

This shouldn’t affect storage requirement too much, as a sequence of empty values (don’t use NULL) in either column will naturally compress down very well with the default compression codec, but I’d recommend ZSTD for this as it produces much better compression ratios at very little additional CPU usage. However the downside of this approach is that your analytics queries will be more tricky to write because you would need to aggregate over both columns and strip out empty values.

Approach 2 – store using a single String column which may contain IPv4 or IPv6 addresses in their native formats

Until the ClickHouse 21.2 release this was probably the best solution providing optimal storage and querying at the expense of some quite complex SQL. Basically

Addr String

You can insert data by either toIPv6('…') or reinterpretAsString(toIPv4('1.2.3.4')) for IPv4 addresses.

You can extract the data as a string by:

IF(length(Addr) = 4,
    IPv4NumToString(reinterpretAsUInt32(reverse(Addr))),
    IPv6NumToString(toFixedString(Addr, 16))) AS AddrStr

To search:

 Addr = reinterpretAsString(toIPv4('1.2.3.4'))

Or CIDR-based search for 1.2.3.4/24:

 Addr BETWEEN reinterpretAsString((IPv4CIDRToRange(toIPv4('1.2.3.4'), 24) AS addr_range).1)
    AND reinterpretAsString(addr_range.2)

Approach 3 – store everything as IPv6 type

Starting in ClickHouse 21.2 the toIPv6() function accepts IPv4-format addresses such as toIPv6('1.2.3.4'). This is internally translated to the designated IPv6 format for IPv4 addresses, which is of the form ‘::ffff:1.2.3.4’ (ie the first ten bytes are 0, then two bytes of 255 and then the four bytes of the IPv4 address). This does cause a bit more storage overhead as it stores a full 16 bytes, rather than approach 2 which is only storing 4-byte strings for IPv4 addresses. But in real-world deployments with column compression, the storage requirement difference is not significant and can probably be mitigated further using different codecs on the columns – but to find the optimal storage for this approach needs experimenting and tailoring to your specific data storage requirements.

To use this approach, first create an IPv6 column:

Addr IPv6

Insert data with Addr = toIPv6(ipv4 or ipv6 address)

To access the data you should use all the IPv6-related functions such as IPv6CIDRToRange, and to extract the IP address as a string in either IPv4 or IPv6 format you can use replaceRegexpOne(IPv6NumToString(Addr), '^::ffff:', ''). There is an issue open to create a function which does this correctly.

Leave a Reply

Your email address will not be published. Required fields are marked *