Unlocking the Power of PostgreSQL Indexes

Shahar Shokrani
5 min readJul 30, 2024

--

Indexes are essential for improving the performance of your PostgreSQL queries. They allow the database to find and retrieve specific rows much faster than without them. In this article, we will see significant improvements in some cases, while in others, we might be surprised by a lack of improvement due to factors like low cardinality. Let's dive in!

Setup: Database, Table, and Data

First, we need to set up our environment, which includes creating a database, a table, and populating it with data.

CREATE DATABASE IndexDemoDatabase;

CREATE TABLE names (
id SERIAL PRIMARY KEY,
name TEXT NOT NULL,
category TEXT NOT NULL
);

We will insert 1,000,000 rows into the names table. The name column will have high cardinality, meaning it will contain many unique values, while the category column will have low cardinality, meaning it will contain a few distinct values repeated many times:

INSERT INTO names (name, category)
SELECT 'name' || generate_series, 'category' || (generate_series % 5 + 1)
FROM generate_series(1, 500000)
UNION ALL
SELECT 'name' || generate_series, 'category' || (generate_series % 5 + 1)
FROM generate_series(1, 500000);

Quick Note on EXPLAIN ANALYZE

ANALYZE collects statistics about the contents of tables in the database, and stores the results in the pg_statistic system catalog. Subsequently, the query planner uses these statistics to help determine the most efficient execution plans for queries.

The ANALYZE command collects statistics about the contents of tables in the database. These statistics are used by the query planner to determine the most efficient execution plans for queries. Regular use of ANALYZE ensures that the planner has the most up-to-date information about data distribution and other relevant metrics.

EXPLAIN displays the execution plan that the PostgreSQL planner generates for the supplied statement. The execution plan shows how the table(s) referenced by the statement will be scanned — by plain sequential scan, index scan, etc. — and if multiple tables are referenced, what join algorithms will be used to bring together the required rows from each input table.

The EXPLAIN command displays the execution plan for a query, showing how the PostgreSQL planner intends to execute the query. When used with ANALYZE, it also runs the query and provides a detailed breakdown of its execution plan and timing, helping to identify potential performance issues.

Demonstrating Query Performance

Query Without Index (High Cardinality)

First, let’s execute a query on the name column without using an index and measure the execution time.

In the Query Tool, run the following command:

SET enable_seqscan TO on;
EXPLAIN ANALYZE SELECT * FROM names WHERE name = 'name250000';

The results:

Execution time: 28.340 ms, Scans: 1,000,000 rows.

we can see the Execution time is 28.340 ms, but the most important param to look uppon is the total hits:

Since the loops=3 indicates that the parallel scan was executed by three workers, the total number of rows processed (Filter: 333333) is 3 * 333333 = 999999, which closely aligns with the total number of rows in the table (1,000,000).

Query With Index (High Cardinality)

Next, let’s create an index on the name column and execute the same query to measure the execution time.

CREATE INDEX idx_name ON names(name);
SET enable_seqscan TO off;
EXPLAIN ANALYZE SELECT * FROM names WHERE name = 'name250000';

The results:

Execution Time: 0.042 ms, low index scan

This output shows a significant improvement in execution time due to the use of the index (using idx_name) 0.042 ms.

Query Without Index (Low Cardinality)

Now, let’s execute a query on the category column without using an index and measure the execution time.

SET enable_seqscan TO on;
EXPLAIN ANALYZE SELECT * FROM names WHERE category = 'category3';

results:

Execution Time: 328.667 ms, Scans: 1,000,000

Query With Index (Low Cardinality)

Finally, let’s create an index on the category column and execute the same query to measure the execution time.

CREATE INDEX idx_category ON names(category);
SET enable_seqscan TO off;
EXPLAIN ANALYZE SELECT * FROM names WHERE category = 'category3';

Results:

Execution Time: 29.941 ms

Although an index was created on the category column, the query performance did not significantly improve. This can be surprising, but it is important to consider the following points:

  • Low Cardinality: The category column has a low cardinality (few distinct values), meaning many rows share the same value. This reduces the selectivity of the index.
  • Bitmap Index Scan: The query planner chose a Bitmap Index Scan, which is efficient for scanning a large number of rows but not as fast as an Index Scan when the cardinality is low.

Results and Analysis

High Cardinality (name column): The execution time improved significantly from 28.340 ms without an index to 0.042 ms with an index.

Low Cardinality (category column): The index did not provide a significant improvement, with an execution time of 29.941 ms even with the index.

Indexes Are Not Free

While indexes can significantly improve query performance, they come with a cost. Indexes consume additional disk space and can affect the performance of write operations since the index must be updated whenever the data in the indexed column is modified.

SELECT pg_size_pretty(pg_relation_size('idx_name')) AS idx_name_size,
pg_size_pretty(pg_relation_size('names')) AS table_size;
Index size is about half the size of the actual table

Conclusion

Indexes are a powerful tool for optimizing the performance of your PostgreSQL database. By creating indexes on frequently queried columns, you can greatly improve the speed of your queries. However, for columns with low cardinality, indexes may not provide significant benefits. Remember to use ANALYZE regularly to keep your statistics up to date, ensuring that the query planner can make the best decisions.

Buy me a coffee

--

--

No responses yet