|
| 1 | +# Indexes |
| 2 | + |
| 3 | +Append Streams support both **primary** and **skipping** indexes to accelerate historical queries. |
| 4 | + |
| 5 | +## Primary Index |
| 6 | + |
| 7 | +The primary key of an Append Stream determines the physical order of rows in the historical columnar store. The **primary index** is automatically built on top of it. Primary index is parse in Append Stream. |
| 8 | + |
| 9 | +Choosing an effective primary key can greatly improve query performance, especially when `WHERE` predicates align with the primary index. |
| 10 | + |
| 11 | +**Example**: |
| 12 | + |
| 13 | +Take the **(counter_id, date)** sorting key as an example. In this case, the sorting and index can be illustrated as follows |
| 14 | + |
| 15 | +``` |
| 16 | +Whole data: [---------------------------------------------] |
| 17 | +CounterID: [aaaaaaaaaaaaaaaaaabbbbcdeeeeeeeeeeeeefgggggggghhhhhhhhhiiiiiiiiikllllllll] |
| 18 | +Date: [1111111222222233331233211111222222333211111112122222223111112223311122333] |
| 19 | +Marks: | | | | | | | | | | | |
| 20 | + a,1 a,2 a,3 b,3 e,2 e,3 g,1 h,2 i,1 i,3 l,3 |
| 21 | +Marks numbers: 0 1 2 3 4 5 6 7 8 9 10 |
| 22 | +``` |
| 23 | + |
| 24 | +If the data query specifies: |
| 25 | + |
| 26 | +- **counter_id IN ('a', 'h')**, the server reads the data in the ranges of marks [0, 3) and [6, 8). |
| 27 | +- **counter_id IN ('a', 'h') AND date = 3**, the server reads the data in the ranges of marks [1, 3) and [7, 8). |
| 28 | +- **date = 3**, the server reads the data in the range of marks [1, 10]. |
| 29 | + |
| 30 | +The examples above show that it is always more effective to use an index than a full scan. |
| 31 | + |
| 32 | +A sparse index allows extra data to be read. When reading a single range of the primary key, up to index_granularity * 2 extra rows in each data block can be read. |
| 33 | + |
| 34 | +Sparse indexes allow you to work with a very large number of table rows, because in most cases, such indexes fit in the computer's RAM. |
| 35 | + |
| 36 | +:::info |
| 37 | +Once an Append Stream is created, its `sorting key` **cannot be changed**. |
| 38 | +::: |
| 39 | + |
| 40 | +## Skipping Indexes |
| 41 | + |
| 42 | +You can create different skipping indexes on an Append Stream to accelerate queries when the primary key index alone is insufficient. |
| 43 | + |
| 44 | +### Create Skipping Indexes |
| 45 | + |
| 46 | +The index declaration is in the columns section of the `CREATE` query. |
| 47 | + |
| 48 | +```sql |
| 49 | +INDEX index_name <expr> TYPE type(...) [GRANULARITY granularity_value] |
| 50 | +``` |
| 51 | + |
| 52 | +These indices aggregate some information about the specified expression on blocks, which consist of granularity_value granules (the size of the granule is specified using the `index_granularity` setting in the stream). Then these aggregates are used in the historical table queries for reducing the amount of data to read from the disk by skipping big blocks of data where the where query cannot be satisfied. |
| 53 | + |
| 54 | +The `GRANULARITY` clause can be omitted, the default value of `granularity_value` is **1**. |
| 55 | + |
| 56 | +```sql |
| 57 | +CREATE STREAM test |
| 58 | +( |
| 59 | + u64 uint64, |
| 60 | + i32 int32, |
| 61 | + s string, |
| 62 | + ... |
| 63 | + INDEX idx1 u64 TYPE bloom_filter GRANULARITY 3, |
| 64 | + INDEX idx2 u64 * i32 TYPE minmax GRANULARITY 3, |
| 65 | + INDEX idx3 u64 * length(s) TYPE set(1000) GRANULARITY 4 |
| 66 | +) |
| 67 | +... |
| 68 | +``` |
| 69 | + |
| 70 | +Indices from the example can be used by ClickHouse to reduce the amount of data to read from disk in the following queries: |
| 71 | + |
| 72 | +```sql |
| 73 | +SELECT count() FROM table(test) WHERE u64 == 10; |
| 74 | +SELECT count() FROM table(test) WHERE u64 * i32 >= 1234; |
| 75 | +SELECT count() FROM table(test) WHERE u64 * length(s) == 1234; |
| 76 | +``` |
| 77 | + |
| 78 | +Data skipping indexes can also be created on composite columns: |
| 79 | + |
| 80 | +```sql |
| 81 | +-- on columns of type map: |
| 82 | +INDEX map_key_index map_keys(map_column) TYPE bloom_filter |
| 83 | +INDEX map_value_index map_values(map_column) TYPE bloom_filter |
| 84 | + |
| 85 | +-- on columns of type tuple: |
| 86 | +INDEX tuple_1_index tuple_column.1 TYPE bloom_filter |
| 87 | +INDEX tuple_2_index tuple_column.2 TYPE bloom_filter |
| 88 | + |
| 89 | +-- on columns of type nested: |
| 90 | +INDEX nested_1_index col.nested_col1 TYPE bloom_filter |
| 91 | +INDEX nested_2_index col.nested_col2 TYPE bloom_filter |
| 92 | +``` |
| 93 | + |
| 94 | +### Skip Index Types |
| 95 | + |
| 96 | +Append Streams support the following types of skip indexes. |
| 97 | + |
| 98 | +- **minmax** index |
| 99 | +- **set** index |
| 100 | +- **bloom_filter** index |
| 101 | +- **ngrambf_v1** index |
| 102 | +- **tokenbf_v1** index |
| 103 | + |
| 104 | + |
| 105 | +#### MinMax |
| 106 | + |
| 107 | +For each index granule, the minimum and maximum values of an expression are stored. (If the expression is of type **tuple**, it stores the minimum and maximum for each tuple element.) |
| 108 | + |
| 109 | +```sql |
| 110 | +TYPE minmax |
| 111 | +``` |
| 112 | + |
| 113 | +**Example:** |
| 114 | + |
| 115 | +```sql |
| 116 | +INDEX idx2 u64 TYPE minmax GRANULARITY 3 |
| 117 | +``` |
| 118 | + |
| 119 | +#### Set |
| 120 | + |
| 121 | +For each index granule at most **max_rows** many unique values of the specified expression are stored. **max_rows = 0** means "store all unique values". |
| 122 | + |
| 123 | +```sql |
| 124 | +TYPE minmax(max_rows) |
| 125 | +``` |
| 126 | + |
| 127 | +**Example:** |
| 128 | +```sql |
| 129 | +INDEX idx3 u64 TYPE set(1000) GRANULARITY 4 |
| 130 | +``` |
| 131 | + |
| 132 | +#### Bloom filter |
| 133 | + |
| 134 | +For each index granule stores a bloom filter for the specified columns. |
| 135 | + |
| 136 | +```sql |
| 137 | +TYPE bloom_filter([false_positive_rate]) |
| 138 | +``` |
| 139 | + |
| 140 | +**Example:** |
| 141 | +```sql |
| 142 | +INDEX idx1 u64 TYPE bloom_filter GRANULARITY 3 |
| 143 | +``` |
| 144 | + |
| 145 | +The **`false_positive_rate`** parameter can take on a value between **0** and **1** (by default: **0.025**) and specifies the probability of generating a positive (which increases the amount of data to be read). |
| 146 | + |
| 147 | +The following data types are supported: |
| 148 | + |
| 149 | +- **`(u)int*`** |
| 150 | +- **`float*`** |
| 151 | +- **`enum`** |
| 152 | +- **`date`** |
| 153 | +- **`date_time`** |
| 154 | +- **`string`** |
| 155 | +- **`fixed_string`** |
| 156 | +- **`array`** |
| 157 | +- **`low_cardinality`** |
| 158 | +- **`nullable`** |
| 159 | +- **`uuid`** |
| 160 | +- **`map`** |
| 161 | + |
| 162 | + |
| 163 | +:::info |
| 164 | +**`map`** data type: specifying index creation with keys or values |
| 165 | +For the **`map`** data type, the client can specify if the index should be created for keys or for values using the **`map_keys`** or **`map_values`** functions. |
| 166 | +::: |
| 167 | + |
| 168 | + |
| 169 | +#### N-gram bloom filter |
| 170 | + |
| 171 | +For each index granule stores a **bloom filter** for the **n-grams** of the specified columns. |
| 172 | + |
| 173 | +```sql |
| 174 | +TYPE ngrambf_v1(n, size_of_bloom_filter_in_bytes, number_of_hash_functions, random_seed) |
| 175 | +``` |
| 176 | + |
| 177 | +Paramters Explaination: |
| 178 | +- **`n`**: ngram size |
| 179 | +- **`size_of_bloom_filter_in_bytes`**: Bloom filter size in bytes. You can use a large value here, for example, 256 or 512, because it can be compressed well). |
| 180 | +- **`number_of_hash_functions`**: The number of hash functions used in the bloom filter. |
| 181 | +- **`random_seed`**: Seed for the bloom filter hash functions. |
| 182 | + |
| 183 | +This index only works with the following data types: |
| 184 | +- **`string`** |
| 185 | +- **`fixed_string`** |
| 186 | +- **`map`** |
| 187 | + |
| 188 | +#### Token bloom filter |
| 189 | + |
| 190 | +The token bloom filter is the same as ngrambf_v1, but stores tokens (sequences separated by non-alphanumeric characters) instead of ngrams. |
| 191 | + |
| 192 | +```sql |
| 193 | +TYPE tokenbf_v1(size_of_bloom_filter_in_bytes, number_of_hash_functions, random_seed) |
| 194 | +``` |
0 commit comments