What data does ClickHouse collect?

Unlock ClickHouse Speed: Basic Query Optimisation

17/12/2020

Rating: 4.46 (9439 votes)

When I first joined the product marketing engineering team at ClickHouse, transitioning from a search-focused role at Elastic, I had a significant learning curve ahead, particularly concerning OLAP databases. One of my initial projects involved bringing the new ClickHouse Playground to life. This environment, packed with diverse datasets, proved to be an invaluable learning ground for understanding ClickHouse. As we launched the Playground, I became increasingly curious about the user experience and how we could enhance the performance of the example queries provided. Thanks to the wealth of learning materials available – from on-demand training and videos to comprehensive documentation and blogs – I’ve absorbed a great deal about optimising ClickHouse queries. This article marks the first in a two-part series where I’ll share practical tips for achieving just that, starting with fundamental concepts.

What data does ClickHouse collect?
24 ORDER BY tuple() By default, ClickHouse collects and logs information about each executed query in the query logs. This data is stored in the table system.query_log. The queries presented in this section have been executed on ClickHouse Cloud.

In this first part, we will explore the essential tooling ClickHouse provides to investigate slow queries. We’ll then delve into basic optimisation strategies, emphasising the critical role of primary keys. The subsequent article will cover more advanced techniques, such as projections, materialized views, and data-skipping indexes. My aim is to equip you with the knowledge to make your ClickHouse queries fly.

Table

Understanding Query Performance

The ideal time to consider performance optimisation is during the initial data schema setup, even before ingesting any data into ClickHouse. However, let’s be realistic: it's incredibly challenging to predict future data growth or the exact types of queries that will be executed. So, if you're just embarking on your ClickHouse journey, you might consider skipping ahead to the next section on basic optimisations and primary keys. But for those with existing deployments and a few queries crying out for improvement, the first crucial step is to understand how these queries are performing and, more importantly, why some execute in milliseconds while others take considerably longer.

ClickHouse offers a rich suite of tools designed to help you dissect query execution and the resources consumed. In this section, we'll examine these tools and learn how to wield them effectively.

General Considerations

To truly grasp query performance, let’s briefly look at what happens when a query is executed within ClickHouse. The following explanation is deliberately simplified to provide a high-level understanding without getting bogged down in excessive detail. For a more exhaustive exploration, you can consult the official query analyser documentation.

From a very high-level perspective, a ClickHouse query typically follows these stages:

  1. Query Parsing and Analysis: The incoming query is parsed and analysed, leading to the creation of a generic query execution plan.
  2. Query Optimisation: This generic plan is then optimised. Unnecessary data is pruned, and a highly efficient query pipeline is constructed from the refined plan.
  3. Query Pipeline Execution: This is where the magic happens. Data is read and processed in parallel. ClickHouse executes the core query operations, such as filtering, aggregations, and sorting, at this stage.
  4. Final Processing: The results from parallel processing are merged, sorted (if required), and formatted into the final output before being sent back to the client.

While many more intricate optimisations occur behind the scenes, these fundamental concepts provide a solid foundation for understanding ClickHouse's execution flow. With this understanding, let's explore the specific tools ClickHouse offers for tracking query performance metrics.

Spotting Slow Queries

By default, ClickHouse diligently collects and logs information about every executed query in its system tables, specifically system.query_log. This table is your first port of call when diagnosing performance issues. On ClickHouse Cloud, where the query_log table is distributed across multiple nodes, you might use clusterAllReplicas(default, system.query_log). For local deployments, FROM system.query_log suffices.

For each query, ClickHouse logs vital statistics such as execution time, the number of rows read, and resource usage (CPU, memory, filesystem cache hits). This makes the query log an excellent starting point for identifying performance bottlenecks. You can easily pinpoint long-running queries and inspect their resource consumption.

Let's find the top five long-running queries on our NYC taxi dataset, which we'll be using as our demo environment. This dataset, available on the ClickHouse Playground, was ingested without any initial optimisations.

-- Create table with inferred schema CREATE TABLE trips_small_inferred ORDER BY () EMPTY AS SELECT * FROM s3( 'https://datasets-documentation.s3.eu-west-3.amazonaws.com/nyc-taxi/clickhouse-academy/nyc_taxi_2009-2010.parquet' ); -- Insert data into table with inferred schema INSERT INTO trips_small_inferred SELECT * FROM s3Cluster( 'default', 'https://datasets-documentation.s3.eu-west-3.amazonaws.com/nyc-taxi/clickhouse-academy/nyc_taxi_2009-2010.parquet' );

The schema was automatically inferred, which is convenient but not always optimal for performance:

SHOW CREATE TABLE trips_small_inferred; /* CREATE TABLE nyc_taxi.trips_small_inferred ( `vendor_id` Nullable(String), `pickup_datetime` Nullable(DateTime64(6, 'UTC')), `dropoff_datetime` Nullable(DateTime64(6, 'UTC')), `passenger_count` Nullable(Int64), `trip_distance` Nullable(Float64), `ratecode_id` Nullable(String), `pickup_location_id` Nullable(String), `dropoff_location_id` Nullable(String), `payment_type` Nullable(Int64), `fare_amount` Nullable(Float64), `extra` Nullable(Float64), `mta_tax` Nullable(Float64), `tip_amount` Nullable(Float64), `tolls_amount` Nullable(Float64), `total_amount` Nullable(Float64) ) ORDER BY tuple() */

Now, let’s identify those sluggish queries:

-- Find top 5 long running queries from nyc_taxi database in the last 1 hour SELECT type, event_time, query_duration_ms, query, read_rows, tables FROM clusterAllReplicas(default, system.query_log) WHERE has(databases, 'nyc_taxi') AND (event_time >= (now() - toIntervalMinute(60))) AND type = 'QueryFinish' ORDER BY query_duration_ms DESC LIMIT 5 FORMAT VERTICAL;

A snippet of the results reveals queries with significant execution times:

Row 1: ────── type: QueryFinish event_time: 2024-11-27 11:12:36 query_duration_ms: 2967 query: WITH dateDiff('s', pickup_datetime, dropoff_datetime) as trip_time, trip_distance / trip_time * 3600 AS speed_mph SELECT quantiles(0.5, 0.75, 0.9, 0.99)(trip_distance) FROM nyc_taxi.trips_small_inferred WHERE speed_mph > 30 FORMAT JSON read_rows: 329044175 tables: ['nyc_taxi.trips_small_inferred'] Row 2: ────── type: QueryFinish event_time: 2024-11-27 11:11:33 query_duration_ms: 2026 query: SELECT payment_type, COUNT() AS trip_count, formatReadableQuantity(SUM(trip_distance)) AS total_distance, AVG(total_amount) AS total_amount_avg, AVG(tip_amount) AS tip_amount_avg FROM nyc_taxi.trips_small_inferred WHERE pickup_datetime >= '2009-01-01' AND pickup_datetime < '2009-04-01' GROUP BY payment_type ORDER BY trip_count DESC ; read_rows: 329044175 tables: ['nyc_taxi.trips_small_inferred'] Row 3: ────── type: QueryFinish event_time: 2024-11-27 11:12:17 query_duration_ms: 1860 query: SELECT avg(dateDiff('s', pickup_datetime, dropoff_datetime)) FROM nyc_taxi.trips_small_inferred WHERE passenger_count = 1 or passenger_count = 2 FORMAT JSON read_rows: 329044175 tables: ['nyc_taxi.trips_small_inferred']

The query_duration_ms field clearly indicates the execution time. The first query, for instance, took almost 3 seconds, which is a significant duration in the ClickHouse world. You can also use the query log to identify queries stressing the system by memory or CPU usage, using fields like memory_usage and ProfileEvents.

To establish a baseline for our optimisations, it’s crucial to rerun these long-running queries a few times. For reproducible results, we must first disable the filesystem cache. Caching can mask underlying I/O bottlenecks or poor schema design during troubleshooting.

-- Disable filesystem cache set enable_filesystem_cache = 0; -- Run query 1 WITH dateDiff('s', pickup_datetime, dropoff_datetime) as trip_time, trip_distance / trip_time * 3600 AS speed_mph SELECT quantiles(0.5, 0.75, 0.9, 0.99)(trip_distance) FROM nyc_taxi.trips_small_inferred WHERE speed_mph > 30 FORMAT JSON; -- Run query 2 SELECT payment_type, COUNT() AS trip_count, formatReadableQuantity(SUM(trip_distance)) AS total_distance, AVG(total_amount) AS total_amount_avg, AVG(tip_amount) AS tip_amount_avg FROM nyc_taxi.trips_small_inferred WHERE pickup_datetime >= '2009-01-01' AND pickup_datetime < '2009-04-01' GROUP BY payment_type ORDER BY trip_count DESC; -- Run query 3 SELECT avg(dateDiff('s', pickup_datetime, dropoff_datetime)) FROM nyc_taxi.trips_small_inferred WHERE passenger_count = 1 or passenger_count = 2 FORMAT JSON;

Here’s a summary of their baseline performance:

NameElapsed (Run 1)Rows ProcessedPeak Memory
Query 11.699 sec329.04 million440.24 MiB
Query 21.419 sec329.04 million546.75 MiB
Query 31.414 sec329.04 million451.53 MiB

Let's briefly understand what these queries aim to achieve:

  • Query 1: Calculates the distance distribution for rides with an average speed exceeding 30 miles per hour.
  • Query 2: Determines the number and average cost of rides per payment type within a specific date range.
  • Query 3: Computes the average trip duration for rides with one or two passengers.

None of these queries are overly complex, except perhaps Query 1, which calculates trip time on the fly. Yet, each takes over a second, which is a long time for ClickHouse. Noticeable memory usage (around 400-500 MiB) and, crucially, each query processes the same number of rows: 329.04 million. A quick check confirms this is the total row count for the table, meaning each query is performing a full table scan.

-- Count number of rows in table SELECT count() FROM nyc_taxi.trips_small_inferred; /* ┌───count()─┐ │ 329044175 │ └───────────┘ */

For ClickHouse Cloud users, the Query Insight UI offers a rich visual representation of query logs, providing another avenue for identifying problematic queries.

The EXPLAIN Statement

Once we've identified slow queries, the next step is to understand *how* they are executed. The ClickHouse EXPLAIN statement is an incredibly powerful tool for this, providing a detailed view of all query execution stages without actually running the query. While its output can initially seem overwhelming, it's indispensable for gaining insight into query behaviour.

EXPLAIN indexes = 1

Let’s begin with EXPLAIN indexes = 1 to inspect the query plan. This plan is a tree structure showing the order of execution for query clauses, read from bottom to top. Applying it to our first long-running query:

EXPLAIN indexes = 1 WITH dateDiff('s', pickup_datetime, dropoff_datetime) AS trip_time, (trip_distance / trip_time) * 3600 AS speed_mph SELECT quantiles(0.5, 0.75, 0.9, 0.99)(trip_distance) FROM nyc_taxi.trips_small_inferred WHERE speed_mph > 30;

The output is quite revealing:

┌─explain─────────────────────────────────────────────┐ 1. │ Expression ((Projection + Before ORDER BY )) │ 2. │ Aggregating │ 3. │ Expression (Before GROUP BY ) │ 4. │ Filter ( WHERE ) │ 5. │ ReadFromMergeTree (nyc_taxi.trips_small_inferred) │ └─────────────────────────────────────────────────────┘

The query starts by reading data from the nyc_taxi.trips_small_inferred table. Then, the WHERE clause filters rows based on computed values. The filtered data is prepared for aggregation, and the quantiles are computed. Crucially, we observe that no primary keys are used. This makes perfect sense, as we didn't define any when creating the table. The consequence? ClickHouse performs a full scan of the entire table for this query, which is a major contributor to its slow performance.

EXPLAIN PIPELINE

The EXPLAIN PIPELINE command goes a step further, illustrating the concrete execution strategy. It shows how ClickHouse actually translates the generic query plan into parallelisable operations:

EXPLAIN PIPELINE WITH dateDiff('s', pickup_datetime, dropoff_datetime) AS trip_time, (trip_distance / trip_time) * 3600 AS speed_mph SELECT quantiles(0.5, 0.75, 0.9, 0.99)(trip_distance) FROM nyc_taxi.trips_small_inferred WHERE speed_mph > 30;

The output reveals the parallel nature of execution:

┌─explain─────────────────────────────────────────────────────────────────────────────┐ 1. │ (Expression) │ 2. │ ExpressionTransform × 59 │ 3. │ (Aggregating) │ 4. │ Resize 59 → 59 │ 5. │ AggregatingTransform × 59 │ 6. │ StrictResize 59 → 59 │ 7. │ (Expression) │ 8. │ ExpressionTransform × 59 │ 9. │ (Filter) │ 10.│ FilterTransform × 59 │ 11.│ (ReadFromMergeTree) │ 12.│ MergeTreeSelect(pool: PrefetchedReadPool, algorithm: Thread) × 59 0 → 1 │ └─────────────────────────────────────────────────────────────────────────────────────┘

Here, we can clearly see the high degree of parallelisation: 59 threads are used to execute the query. While this parallelisation helps speed up queries that would otherwise take much longer on smaller machines, it also explains the substantial memory usage. Each thread requires its own memory allocation, contributing to the overall peak memory consumption.

Ideally, you would investigate all your slow queries using this structured approach: identify them from the query logs, then use EXPLAIN to understand their execution plan, the number of rows read, and the resources consumed. On a production deployment, this can be challenging due to the sheer volume of queries. However, you can narrow your search using fields like user, tables, or databases in system.query_log if you suspect issues with specific entities.

Once identified, the temptation is to change multiple things at once. This often leads to mixed results and a lack of understanding regarding which specific change truly improved performance. Query optimisation demands a structured approach. You don't need advanced benchmarking, but a simple process to track how your changes affect performance is crucial. Always disable the filesystem cache (set enable_filesystem_cache = 0;) when testing, as caching can obscure underlying I/O bottlenecks. Implement optimisations one by one to accurately measure their impact. Finally, be wary of outliers; a single slow query might be an anomaly. Use normalized_query_hash to identify frequently executed, expensive queries – these are your prime candidates for optimisation.

Basic Optimisation Techniques

With our framework for testing established, we can now dive into optimising our queries. The most fundamental principle in database performance is simple: the less data you read, the faster your query will execute. Therefore, the best place to start is by examining how your data is stored.

Data Schema Refinement

While ClickHouse's ability to infer table schemas from ingested data is incredibly convenient for getting started, it's rarely the optimal choice for performance. To truly optimise your query performance, you must review and refine your data schema to best fit your specific use case and query patterns.

Avoiding Nullable Columns

As highlighted in ClickHouse's best practices documentation, you should avoid Nullable columns whenever possible. While they offer flexibility in data ingestion, they negatively impact performance because an additional column (for nullability information) must be processed every time. You can easily identify columns that genuinely require Nullable by counting null values:

-- Find non-null values columns SELECT countIf(vendor_id IS NULL) AS vendor_id_nulls, countIf(pickup_datetime IS NULL) AS pickup_datetime_nulls, countIf(dropoff_datetime IS NULL) AS dropoff_datetime_nulls, countIf(passenger_count IS NULL) AS passenger_count_nulls, countIf(trip_distance IS NULL) AS trip_distance_nulls, countIf(fare_amount IS NULL) AS fare_amount_nulls, countIf(mta_tax IS NULL) AS mta_tax_nulls, countIf(tip_amount IS NULL) AS tip_amount_nulls, countIf(tolls_amount IS NULL) AS tolls_amount_nulls, countIf(total_amount IS NULL) AS total_amount_nulls, countIf(payment_type IS NULL) AS payment_type_nulls, countIf(pickup_location_id IS NULL) AS pickup_location_id_nulls, countIf(dropoff_location_id IS NULL) AS dropoff_location_id_nulls FROM trips_small_inferred FORMAT VERTICAL;

For our NYC taxi dataset, the results showed:

Row 1: ────── vendor_id_nulls: 0 pickup_datetime_nulls: 0 dropoff_datetime_nulls: 0 passenger_count_nulls: 0 trip_distance_nulls: 0 fare_amount_nulls: 0 mta_tax_nulls: 137946731 tip_amount_nulls: 0 tolls_amount_nulls: 0 total_amount_nulls: 0 payment_type_nulls: 69305 pickup_location_id_nulls: 0 dropoff_location_id_nulls: 0

Only mta_tax and payment_type have null values. All other fields should ideally not be Nullable.

Leveraging LowCardinality

An easy yet powerful optimisation for string columns is to utilise the LowCardinality data type. ClickHouse applies dictionary encoding to LowCardinality columns, which significantly boosts query performance and reduces storage. A simple rule of thumb: any column with fewer than 10,000 unique values is an excellent candidate for LowCardinality.

You can identify such columns with the following query:

-- Identify low cardinality columns SELECT uniq(ratecode_id), uniq(pickup_location_id), uniq(dropoff_location_id), uniq(vendor_id) FROM trips_small_inferred FORMAT VERTICAL;

Our results for the taxi data:

Row 1: ────── uniq(ratecode_id): 6 uniq(pickup_location_id): 260 unig(dropoff_location_id): 260 unig(vendor_id): 3

With such low cardinalities, ratecode_id, pickup_location_id, dropoff_location_id, and vendor_id are perfect candidates for the LowCardinality(String) type.

Optimising Data Types

ClickHouse supports a wide array of data types. Always select the smallest possible data type that accurately fits your use case. This not only optimises performance but also significantly reduces data storage space on disk. For numerical fields, check the min/max values in your dataset to ensure the chosen precision is appropriate:

-- Find min/max values for the payment_type and passenger_count fields SELECT min(payment_type), max(payment_type), min(passenger_count), max(passenger_count) FROM trips_small_inferred;

Results:

┌─min(payment_type)─┬─max(payment_type)─┬─min(passenger_count)─┬─max(passenger_count)─┐ │ 1 │ 4 │ 0 │ 24 │ └───────────────────┴───────────────────┴──────────────────────┴──────────────────────┘

For dates, choose a precision that aligns with your dataset and is best suited for the queries you plan to run. For instance, DateTime (seconds precision) might be sufficient instead of DateTime64 (microseconds) if your queries don't require such granularity.

Applying the Optimisations

Let’s create a new table, trips_small_no_pk, incorporating these schema optimisations and re-ingest the data:

-- Create table with optimized data types and no primary key CREATE TABLE trips_small_no_pk ( `vendor_id` LowCardinality(String), `pickup_datetime` DateTime, `dropoff_datetime` DateTime, `passenger_count` UInt8, `trip_distance` Float32, `ratecode_id` LowCardinality(String), `pickup_location_id` LowCardinality(String), `dropoff_location_id` LowCardinality(String), `payment_type` Nullable(UInt8), `fare_amount` Decimal32(2), `extra` Decimal32(2), `mta_tax` Nullable(Decimal32(2)), `tip_amount` Decimal32(2), `tolls_amount` Decimal32(2), `total_amount` Decimal32(2) ) ORDER BY tuple(); -- Insert the data INSERT INTO trips_small_no_pk SELECT * FROM trips_small_inferred;

Now, we rerun our test queries against this new, optimised table:

NameRun 1 - Elapsed (inferred schema)Run 2 - Elapsed (optimised schema)Rows ProcessedPeak Memory (Run 2)
Query 11.699 sec1.353 sec329.04 million337.12 MiB
Query 21.419 sec1.171 sec329.04 million531.09 MiB
Query 31.414 sec1.188 sec329.04 million265.05 MiB

We immediately observe improvements in both query execution time and memory usage. By refining the data schema, we’ve reduced the overall data volume, leading to lower memory consumption and faster processing times. Let’s confirm the storage reduction:

SELECT `table`, formatReadableSize(sum(data_compressed_bytes) AS size) AS compressed, formatReadableSize(sum(data_uncompressed_bytes) AS usize) AS uncompressed, sum(rows) AS rows FROM system.parts WHERE (active = 1) AND ((`table` = 'trips_small_no_pk') OR (`table` = 'trips_small_inferred')) GROUP BY database, `table` ORDER BY size DESC;

The results clearly show the benefit:

┌─table────────────────┬─compressed─┬─uncompressed─┬──────rows─┐ 1. │ trips_small_inferred │ 7.38 GiB │ 37.41 GiB │ 329044175 │ 2. │ trips_small_no_pk │ 4.89 GiB │ 15.31 GiB │ 329044175 │ └──────────────────────┴────────────┴──────────────┴───────────┘

The new table is considerably smaller, showing a reduction of about 34% in compressed disk space (from 7.38 GiB to 4.89 GiB). This demonstrates the significant impact of careful data type selection and `LowCardinality` usage.

The Importance of Primary Keys

Primary keys in ClickHouse function quite differently from those in most traditional database systems. In conventional databases, primary keys primarily enforce uniqueness and data integrity, often backed by B-tree or hash-based indexes for rapid lookups. In ClickHouse, however, the primary key's objective is solely to optimise query performance. It does not enforce uniqueness or guarantee data integrity.

Instead, the primary key defines the physical order in which data is stored on disk. It's implemented as a sparse index that holds pointers to the first row of each 'granule'. Granules are ClickHouse's smallest units of data read during query execution. Each granule contains a fixed number of rows, typically 8192 (determined by index_granularity), and crucially, granules are stored contiguously and sorted by the primary key. Selecting an effective set of primary keys is paramount for performance. It's even common practice in ClickHouse to store the same data in multiple tables, each with a different set of primary keys, to accelerate specific query patterns. Other advanced ClickHouse features, like Projections or Materialized Views, also allow for different primary keys on the same underlying data, a topic we'll explore in the next part of this series.

Choosing Primary Keys Wisely

Selecting the optimal set of primary keys can be a complex task, often requiring trade-offs and experimentation. However, for a solid starting point, we can follow these simple practices:

  • Prioritise fields that are frequently used in WHERE clauses for filtering.
  • Choose columns with lower cardinality first, as they offer better compression and more distinct ranges for indexing.
  • Consider including a time-based component in your primary key, as filtering by timestamp is a very common operation on analytical datasets.

For our NYC taxi dataset, we’ll experiment with passenger_count, pickup_datetime, and dropoff_datetime. passenger_count has a low cardinality (24 unique values) and is used in one of our slow queries. The timestamp fields (pickup_datetime and dropoff_datetime) are excellent choices for filtering common time-based queries.

Let's create a new table, trips_small_pk, with these primary keys and re-ingest the data:

CREATE TABLE trips_small_pk ( `vendor_id` UInt8, `pickup_datetime` DateTime, `dropoff_datetime` DateTime, `passenger_count` UInt8, `trip_distance` Float32, `ratecode_id` LowCardinality(String), `pickup_location_id` UInt16, `dropoff_location_id` UInt16, `payment_type` Nullable(UInt8), `fare_amount` Decimal32(2), `extra` Decimal32(2), `mta_tax` Nullable(Decimal32(2)), `tip_amount` Decimal32(2), `tolls_amount` Decimal32(2), `total_amount` Decimal32(2) ) PRIMARY KEY (passenger_count, pickup_datetime, dropoff_datetime); -- Insert the data INSERT INTO trips_small_pk SELECT * FROM trips_small_inferred;

Now, we rerun our queries for the third time, this time against the table with the defined primary keys. Let's compile the results from all three experiments (inferred schema, optimised schema, and primary key) to see the full impact on elapsed time, rows processed, and memory consumption.

Performance Comparison: Before and After Optimisation

Query 1: Distance distribution for fast rides

MetricRun 1 (Inferred Schema)Run 2 (Optimised Schema)Run 3 (With Primary Key)
Elapsed1.699 sec1.353 sec0.765 sec
Rows Processed329.04 million329.04 million329.04 million
Peak Memory440.24 MiB337.12 MiB444.19 MiB

Query 2: Ride counts and averages by payment type

MetricRun 1 (Inferred Schema)Run 2 (Optimised Schema)Run 3 (With Primary Key)
Elapsed1.419 sec1.171 sec0.248 sec
Rows Processed329.04 million329.04 million41.46 million
Peak Memory546.75 MiB531.09 MiB173.50 MiB

Query 3: Average trip duration by passenger count

MetricRun 1 (Inferred Schema)Run 2 (Optimised Schema)Run 3 (With Primary Key)
Elapsed1.414 sec1.188 sec0.431 sec
Rows Processed329.04 million329.04 million276.99 million
Peak Memory451.53 MiB265.05 MiB197.38 MiB

We can see significant improvements across the board in execution time and memory used. Query 2, in particular, benefits most from the primary key. Let’s examine how its query plan differs now:

EXPLAIN indexes = 1 SELECT payment_type, COUNT() AS trip_count, formatReadableQuantity(SUM(trip_distance)) AS total_distance, AVG(total_amount) AS total_amount_avg, AVG(tip_amount) AS tip_amount_avg FROM nyc_taxi.trips_small_pk WHERE (pickup_datetime >= '2009-01-01') AND (pickup_datetime < '2009-04-01') GROUP BY payment_type ORDER BY trip_count DESC;

The updated explain plan for Query 2:

┌─explain──────────────────────────────────────────────────────────────────────────────────────────────────────────┐ 1. │ Expression ((Projection + Before ORDER BY [lifted up part])) │ 2. │ Sorting (Sorting for ORDER BY ) │ 3. │ Expression (Before ORDER BY ) │ 4. │ Aggregating │ 5. │ Expression (Before GROUP BY ) │ 6. │ Expression │ 7. │ ReadFromMergeTree (nyc_taxi.trips_small_pk) │ 8. │ Indexes: │ 9. │ PrimaryKey │ 10.│ Keys: │ 11.│ pickup_datetime │ 12.│ Condition: and ((pickup_datetime in (-Inf, 1238543999]), (pickup_datetime in [1230768000, +Inf))) │ 13.│ Parts: 9 / 9 │ 14.│ Granules: 5061 / 40167 │ └──────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Thanks to the primary key, ClickHouse was able to prune a significant amount of data, reading only 41.46 million rows instead of the full 329.04 million! The Granules: 5061 / 40167 line indicates that only a small subset of the total granules needed to be read. This alone dramatically improves query performance, as ClickHouse processes significantly less data.

Conclusion

ClickHouse is an incredibly performant analytical database, incorporating a vast array of internal optimisations to achieve its speed. However, to truly unlock its full power, it’s essential to understand how the database works and how to best utilise its features. By leveraging the techniques discussed in this article – identifying less performant queries, understanding their execution plans, and applying fundamental yet powerful changes to your data schema and primary keys – you will witness substantial improvements in your query performance.

This guide serves as an excellent starting point for those new to ClickHouse. For experienced users, while some topics might not be new, the structured approach to diagnosis and optimisation remains invaluable. In the next instalment of this series, we will delve into more advanced topics, including projections, materialized views, and data-skipping indexes, to further enhance your ClickHouse query performance. Stay tuned!

Frequently Asked Questions (FAQs)

What is a 'granule' in ClickHouse?

A granule is the smallest contiguous block of data that ClickHouse reads from disk during query execution. It typically contains a fixed number of rows (defaulting to 8192) and is sorted according to the table's primary key. The sparse primary index points to the beginning of these granules, allowing ClickHouse to skip irrelevant data blocks efficiently.

How do ClickHouse primary keys differ from traditional databases?

Unlike traditional relational databases where primary keys enforce uniqueness and data integrity, ClickHouse primary keys are primarily designed for query performance optimisation. They define the physical sort order of data on disk and serve as a sparse index to enable fast data skipping, rather than guaranteeing uniqueness or referential integrity.

Why should I disable the filesystem cache when benchmarking or troubleshooting queries?

Disabling the filesystem cache (set enable_filesystem_cache = 0;) is crucial during benchmarking or troubleshooting to ensure reproducible results and accurately identify I/O bottlenecks. Caching can mask the true performance impact of your queries by serving data from memory, potentially hiding inefficiencies in your table schema or query design.

When should I use the LowCardinality data type in ClickHouse?

You should use the LowCardinality data type for string columns (or other data types) that have a relatively small number of unique values, typically less than 10,000. It applies dictionary encoding, which significantly reduces storage space and boosts query performance, especially for filtering, grouping, and aggregation operations.

Is using Nullable data types always detrimental to ClickHouse query performance?

While Nullable data types provide flexibility for handling missing data, they generally incur a performance overhead in ClickHouse because an additional column is used internally to track nullability. It's recommended to avoid them where possible. However, if a column truly contains a significant number of nulls and this is intrinsic to your data, using Nullable is appropriate. Always check the actual null count in your data before deciding.

If you want to read more articles similar to Unlock ClickHouse Speed: Basic Query Optimisation, you can visit the Taxis category.

Go up