Execution plan node types

Use node types in an AlloyDB for PostgreSQL execution plan to help you understand how the database system processes a query. Each node represents a specific operation or step in the query execution, and their types provide insights into your chosen strategy.

Understanding the nodes of an explain plan is important for performance tuning. Each node accomplishes a different operation and obtains data differently, so each type can have a performance implication. For more information, see Generate and analyze execution plans.

Table and index node types

Table and Index related node types are important for understanding data access strategies.

Seq Scan node

A Seq Scan node scans every row of a table. It doesn't allow filtering or direct access to specific rows, unlike an index. For an example of a Seq Scan, see WAL option.

Bitmap Heap Scan

A Bitmap Heap Scan node works with a Bitmap Index Scan node, which prefetches heap blocks based on the associated index scan. This node type is useful when a query returns a large number of rows, because the heap can undergo prefetch. In open source PostgreSQL versions 18 and earlier, Bitmap Index Scan is the only scan that produces a prefetch. Index Scan and Index Only Scan don't produce prefetches.

Bitmap Index Scan

A Bitmap Index Scan node precedes a Bitmap Heap Scan. The Bitmap Index Scan node scans the index, finds all matches, and builds a bitmap. It then passes this bitmap to the Bitmap Heap Scan to return the relevant data.

This bitmap allows page skipping and prefetch, which results in faster data access. work_mem is a useful tuning lever for this scan type because it's required to build the bitmap. If work_mem is too small, the bitmap stores only the entire page in the bitmap rather than the exact row counter.

You can identify inefficient bitmap scans when the keyword "Lossy" is on the bitmap scan node. To prevent this situation, increase work_mem or make sure that a recent vacuum is completed on the table.

(postgres@10.3.1.17:5432) [postgres] > EXPLAIN (ANALYZE,VERBOSE, COLUMNAR_ENGINE, COSTS, SETTINGS, BUFFERS, WAL)
SELECT * FROM public.effective_io_concurrency_test eict
WHERE id BETWEEN 10000 AND 100000
OR product_id BETWEEN 100 AND 200;

                                   QUERY PLAN
---------------------------------------------------------------------------------
Bitmap Heap Scan on public.effective_io_concurrency_test eict  (cost=10318.30..488198.00 rows=619435 width=31) (actual time=113.696..2443.712 rows=589389 loops=1)
   Output: id, value, product_id, effective_date
   Recheck Cond: (((eict.id >= 10000) AND (eict.id <= 100000)) OR ((eict.product_id >= 100) AND (eict.product_id <= 200)))
   Heap Blocks: exact=378308
   Buffers: shared hit=1784 read=379535 written=4115
   I/O Timings: shared read=1201.096 write=37.558
   ->  BitmapOr  (cost=10318.30..10318.30 rows=619925 width=0) (actual time=68.895..68.896 rows=0 loops=1)
         Buffers: shared hit=501 read=2510
         I/O Timings: shared read=11.655
         ->  Bitmap Index Scan on effective_io_concurrency_test_pkey  (cost=0.00..1807.88 rows=92951 width=0) (actual time=2.513..2.513 rows=90001 loops=1)
               Index Cond: ((eict.id >= 10000) AND (eict.id <= 100000))
               Buffers: shared hit=496
         ->  Bitmap Index Scan on effective_io_concurrency_test_groupby  (cost=0.00..8200.71 rows=526974 width=0) (actual time=66.380..66.380 rows=499871 loops=1)
               Index Cond: ((eict.product_id >= 100) AND (eict.product_id <= 200))
               Buffers: shared hit=5 read=2510
               I/O Timings: shared read=11.655
 Settings: effective_cache_size = '19690120kB', random_page_cost = '1.1', work_mem = '256MB'
 Query Identifier: -5140071079400709055
 Planning:
   Buffers: shared hit=36
 Planning Time: 0.133 ms
 Execution Time: 2477.216 ms
 AlloyDB query id: 18229116469546507386
 AlloyDB plan id: 17462269545806790969

Index Scan

An Index Scan node uses an index to access specific records matching the predicate. This node then obtains other relevant columns by scanning the heap. This node type supports forward and backward ordering. Heap access is required because an index doesn't store visibility information for columns outside the index. This access method might be slower than others when returning a large number of rows because it doesn't prefetch the heap.

Index Only Scan

An Index Only Scan node occurs when the index completely covers the predicate and any returned columns. This node type relies heavily on the visibility map, which vacuum maintains, to avoid heap access. If the visibility map isn't up to date, some heap access is required to return the correct data.

Join node types

A Join node type is the method that the query planner chooses to combine rows from two or more tables—or other relations—based on a join condition.

Nested Loop Join

A Nested Loop join node steps through one table and looks for a match in another table or subquery. The goal is for each lookup to use an index, but this depends on the volume of accessed data. This access type works well for smaller datasets but not for bulk operations.

Merge Join

A Merge Join node requires sorted data from both tables or from one table and a subquery. An index access or an explicit ORDER BY provides the sort. The sorted data then eliminates unmatched data through comparison. This join node works well for bulk operations, but if an ORDER BY provides the sorted data, then this access method might be slower than others. The sorting steps in the following example serve as input to the Merge Join.

(postgres@10.3.1.17:5432) [postgres] > EXPLAIN (ANALYZE,VERBOSE, COLUMNAR_ENGINE, COSTS, SETTINGS, BUFFERS, WAL)
SELECT * FROM pgbench_accounts JOIN pgbench_branches USING (bid);

                                   QUERY PLAN
---------------------------------------------------------------------------------
Merge Join  (cost=848464.58..923464.83 rows=5000000 width=457) (actual time=1359.470..2482.524 rows=5000000 loops=1)
   Output: pgbench_accounts.bid, pgbench_accounts.aid, pgbench_accounts.abalance, pgbench_accounts.filler, pgbench_branches.bbalance, pgbench_branches.filler
   Inner Unique: true
   Merge Cond: (pgbench_accounts.bid = pgbench_branches.bid)
   Buffers: shared hit=81927 read=42, temp read=65437 written=65440, ultra fast cache hit=42
   I/O Timings: shared read=0.949, temp read=77.181 write=186.481
   ->  Sort  (cost=848461.67..860961.67 rows=5000000 width=97) (actual time=1359.427..1741.924 rows=5000000 loops=1)
         Output: pgbench_accounts.bid, pgbench_accounts.aid, pgbench_accounts.abalance, pgbench_accounts.filler
         Sort Key: pgbench_accounts.bid
         Sort Method: external merge  Disk: 523496kB
         Buffers: shared hit=81926 read=42, temp read=65437 written=65440, ultra fast cache hit=42
         I/O Timings: shared read=0.949, temp read=77.181 write=186.481
         ->  Seq Scan on public.pgbench_accounts  (cost=0.00..131968.00 rows=5000000 width=97) (actual time=0.006..476.386 rows=5000000 loops=1)
               Output: pgbench_accounts.bid, pgbench_accounts.aid, pgbench_accounts.abalance, pgbench_accounts.filler
               Buffers: shared hit=81926 read=42, ultra fast cache hit=42
               I/O Timings: shared read=0.949
               Columnar Check: table is not in the columnar store
   ->  Sort  (cost=2.91..3.04 rows=50 width=364) (actual time=0.038..0.050 rows=50 loops=1)
         Output: pgbench_branches.bbalance, pgbench_branches.filler, pgbench_branches.bid
         Sort Key: pgbench_branches.bid
         Sort Method: quicksort  Memory: 27kB
         Buffers: shared hit=1
         ->  Seq Scan on public.pgbench_branches  (cost=0.00..1.50 rows=50 width=364) (actual time=0.016..0.022 rows=50 loops=1)
               Output: pgbench_branches.bbalance, pgbench_branches.filler, pgbench_branches.bid
               Buffers: shared hit=1
               Columnar Check: table is too small
 Settings: effective_cache_size = '19690120kB', enable_hashjoin = 'off', enable_nestloop = 'off', max_parallel_workers_per_gather = '0', random_page_cost = '1.1', work_mem = '256MB'
 Query Identifier: 6650290151587259687
 Planning:
   Buffers: shared hit=4
 Planning Time: 0.105 ms
 Execution Time: 2786.403 ms

Hash Join

A Hash Join node is a common join type that depends on available memory. This join type is typically slower to start but much faster after it begins. Hash Join builds a hash table for one side of the join, then builds a corresponding hash table for the other side, and compares those entries.

Because hash tables can be large, make sure that you provide enough work_mem to support the operation. work_mem in conjunction with hash_mem_multiplier determines the total memory available for building hash tables. If you notice batching in the node, this indicates that there is insufficient memory to support the entire hash table in memory. The explain plan also shows the memory that the hash table used. Batching is how a query is processed when it exceeds available memory, which forces data to be processed in smaller chunks or batches, often on disk.

(postgres@10.3.1.17:5432) [postgres] > EXPLAIN (ANALYZE,VERBOSE, COLUMNAR_ENGINE, COSTS, SETTINGS, BUFFERS, WAL)
SELECT ship_month,count(*) FROM (select date_trunc('month',l_shipdate) as ship_month, l_partkey from lineitem) a GROUP BY ship_month;

                                   QUERY PLAN
---------------------------------------------------------------------------------
Hash Join  (cost=5909.88..5911.20 rows=25 width=58) (actual time=86.467..86.493 rows=25 loops=1)
   Hash Cond: (n.n_nationkey = c.c_nationkey)
   ->  Seq Scan on nation n  (cost=0.00..1.25 rows=25 width=30) (actual time=0.012..0.016 rows=25 loops=1)
   ->  Hash  (cost=5909.56..5909.56 rows=25 width=36) (actual time=86.447..86.448 rows=25 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 10kB
         ->  Subquery Scan on c  (cost=5909.00..5909.56 rows=25 width=36) (actual time=86.421..86.433 rows=25 loops=1)
               ->  HashAggregate  (cost=5909.00..5909.31 rows=25 width=36) (actual time=86.420..86.427 rows=25 loops=1)
                     Group Key: customer.c_nationkey
                     Batches: 1  Memory Usage: 32kB
                     ->  Seq Scan on customer  (cost=0.00..5159.00 rows=150000 width=10) (actual time=0.006..28.228 rows=150000 loops=1)
 Planning Time: 0.179 ms
 Execution Time: 86.551 ms

Aggregate node types

Aggregate node types combine multiple input rows into a single result row, often in conjunction with aggregate functions like COUNT, SUM, AVG, MAX, orMIN, or when a GROUP BY clause is present. These nodes process multiple input rows to produce single, aggregated results.

GroupAggregate

The GroupAggregate node performs all aggregation operations but requires sorted data from all input nodes. Because the data is sorted, this node type requires less memory and returns sorted data. Indexes covering the GROUP BY clause help speed up this aggregation type. This aggregation type addresses all PostgreSQL aggregate types, including the following:

  • count(distinct ...)
  • array_agg(...)
  • order by ...
(postgres@10.3.1.17:5432) [postgres] > EXPLAIN (ANALYZE,VERBOSE, COLUMNAR_ENGINE, COSTS, SETTINGS, BUFFERS, WAL)
SELECT ship_month,count(*) FROM (select date_trunc('month',l_shipdate) as ship_month, l_partkey from lineitem) a GROUP BY ship_month;

                                   QUERY PLAN
---------------------------------------------------------------------------------
GroupAggregate  (cost=1085309.69..1130355.32 rows=2508 width=16) (actual time=4769.489..8799.882 rows=83 loops=1)
   Group Key: (date_trunc('month'::text, l_shipdate))
   ->  Sort  (cost=1085309.69..1100314.45 rows=6001903 width=14) (actual time=4763.972..5441.191 rows=6001589 loops=1)
         Sort Key: (date_trunc('month'::text, l_shipdate))
         Sort Method: external merge  Disk: 146264kB
         ->  Seq Scan on lineitem  (cost=0.00..204436.79 rows=6001903 width=14) (actual time=0.065..2061.266 rows=6001589 loops=1)
 Planning Time: 0.114 ms
 Execution Time: 8827.120 ms

HashAggregate

The HashAggregate node performs basic aggregation operations and uses unsorted data from all input nodes. Because the data is unsorted, the node requires more memory and returns unsorted data.

Because hash tables can be large, make sure that you provide enough work_mem to support the operation. work_mem in conjunction with hash_mem_multiplier determines the total memory available for building hash tables. If you notice batching in the node, this indicates that there is insufficient memory to support the entire hash table in memory. The explain plan also shows the memory that the hash table used.

(postgres@10.3.1.17:5432) [postgres] > EXPLAIN (ANALYZE,VERBOSE, COLUMNAR_ENGINE, COSTS, SETTINGS, BUFFERS, WAL)
SELECT ship_month,count(*) FROM (select date_trunc('month',l_shipdate) as ship_month, l_partkey from lineitem) a GROUP BY ship_month;

                                   QUERY PLAN
---------------------------------------------------------------------------------
Hash Join  (cost=5909.88..5911.20 rows=25 width=58) (actual time=86.467..86.493 rows=25 loops=1)
   Hash Cond: (n.n_nationkey = c.c_nationkey)
   ->  Seq Scan on nation n  (cost=0.00..1.25 rows=25 width=30) (actual time=0.012..0.016 rows=25 loops=1)
   ->  Hash  (cost=5909.56..5909.56 rows=25 width=36) (actual time=86.447..86.448 rows=25 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 10kB
         ->  Subquery Scan on c  (cost=5909.00..5909.56 rows=25 width=36) (actual time=86.421..86.433 rows=25 loops=1)
               ->  HashAggregate  (cost=5909.00..5909.31 rows=25 width=36) (actual time=86.420..86.427 rows=25 loops=1)
                     Group Key: customer.c_nationkey
                     Batches: 1  Memory Usage: 32kB
                     ->  Seq Scan on customer  (cost=0.00..5159.00 rows=150000 width=10) (actual time=0.006..28.228 rows=150000 loops=1)
 Planning Time: 0.179 ms
 Execution Time: 86.551 ms

Parallel node types

Parallel node types facilitate parallel query execution. These nodes work together to distribute work among multiple worker processes and then consolidate the results.

Gather

The Gather node collects and assembles data from worker processes.

Parallel Sequential Scan

Based on the number of workers that the planner determines it can use, the leader assigns blocks to individual workers. Those workers then scan those blocks sequentially for relevant data and pass it back to the Gather node, which combines the results.

Parallel Bitmap Heap Scan

A Parallel Bitmap Heap Scan gathers data like a Parallel Seq Scan, but it emulates the behavior of a Bitmap Heap Scan. The leader process performs the Bitmap Index Scan and builds the bitmap. The leader then assigns portions of that bitmap to workers to perform the Parallel Bitmap Heap Scan. A Gather step is required to assemble data from the worker processes.

Parallel Index Scan

A Parallel Index Scan works like an Index Scan, except that each worker takes turns reading the index. Each worker outputs its data in sorted order for that worker. Once the data passes back to the leader, the system sorts the data a final time to support ordered index scans. Parallel index scans are only supported for B-Tree index types.

Parallel Index Only Scan

A Parallel Index Only Scan works like a Parallel Index Scan. In certain circumstances, the heap doesn't require a visit.

Parallel aggregate node types

A parallel aggregate is a mechanism for speeding up the execution of aggregate functions by distributing the aggregation work across multiple worker processes.

Partial Aggregate

When a parallel sequential or index scan executes with an aggregation, each worker aggregates its data, which results in a partial aggregate.

Finalize Aggregate

The Finalize Aggregate occurs after the partial aggregations pass to the leader. The leader process then finalizes the aggregation. Not all aggregations use parallel worker processes.

Other node types

TODO: @shaneborden to add an intro sentence here.

Bitmap And/Or node

The Bitmap And/Or node combines bitmaps from multiple Bitmap Index Scans. The combined bitmap then passes to the Bitmap Heap Scan, which removes the need to run the heap scan more than once. This node has the same work_mem limitations as the Bitmap Index Scan and might require more work_mem to reduce lossiness.

Lossiness means that an index might return more results than necessary, forcing the database to perform an extra check to filter out the incorrect results. The "loss" refers to the index losing some of the detail that's needed to identify the correct rows on its own.

(postgres@10.3.1.17:5432) [postgres] > EXPLAIN (ANALYZE,VERBOSE, COLUMNAR_ENGINE, COSTS, SETTINGS, BUFFERS, WAL)
SELECT * FROM public.effective_io_concurrency_test eict
WHERE id BETWEEN 10000 AND 100000
OR product_id BETWEEN 100 AND 200;

                                   QUERY PLAN
---------------------------------------------------------------------------------
Bitmap Heap Scan on public.effective_io_concurrency_test eict  (cost=10318.30..488198.00 rows=619435 width=31) (actual time=113.696..2443.712 rows=589389 loops=1)
   Output: id, value, product_id, effective_date
   Recheck Cond: (((eict.id >= 10000) AND (eict.id <= 100000)) OR ((eict.product_id >= 100) AND (eict.product_id <= 200)))
   Heap Blocks: exact=378308
   Buffers: shared hit=1784 read=379535 written=4115
   I/O Timings: shared read=1201.096 write=37.558
   ->  BitmapOr  (cost=10318.30..10318.30 rows=619925 width=0) (actual time=68.895..68.896 rows=0 loops=1)
         Buffers: shared hit=501 read=2510
         I/O Timings: shared read=11.655
         ->  Bitmap Index Scan on effective_io_concurrency_test_pkey  (cost=0.00..1807.88 rows=92951 width=0) (actual time=2.513..2.513 rows=90001 loops=1)
               Index Cond: ((eict.id >= 10000) AND (eict.id <= 100000))
               Buffers: shared hit=496
         ->  Bitmap Index Scan on effective_io_concurrency_test_groupby  (cost=0.00..8200.71 rows=526974 width=0) (actual time=66.380..66.380 rows=499871 loops=1)
               Index Cond: ((eict.product_id >= 100) AND (eict.product_id <= 200))
               Buffers: shared hit=5 read=2510
               I/O Timings: shared read=11.655
 Settings: effective_cache_size = '19690120kB', random_page_cost = '1.1', work_mem = '256MB'
 Query Identifier: -5140071079400709055
 Planning:
   Buffers: shared hit=36
 Planning Time: 0.133 ms
 Execution Time: 2477.216 ms
 AlloyDB query id: 18229116469546507386
 AlloyDB plan id: 17462269545806790969

Materialize

Typically seen in Common Table Expressions (CTEs), the Materialize node builds an in-memory tuple store for later reuse. If the CTE isn't materialized and is used multiple times, the system builds the tuple store when needed. Materializing a CTE typically improves performance.

Sort

The Sort step supports any ORDER BY options in the query. All records must be visited before the step completes. The Sort node is dependent on work_mem. If work_mem isn't large enough, multiple round trips to disk occur so that the sort can complete.

Memoize

The Memoize node uses work_mem and hash_mem_multiplier to build a hash table. This hash table caches results for parameterized scans that the Nested Loop Join node uses. The hash table must fit within memory constraints in order for the system to use this node. Memoize can significantly speed up Nested Loops because the system doesn't re-execute the scan for each loop.

Append

The Append node handles operations like UNION or UNION ALL. When two nodes combine data, an Append node appears.

Limit

The Limit node returns a subset of records specified by the LIMIT clause. It sometimes works with the OFFSET clause. In cases of OFFSET, the cost to return the first row might be higher than expected.

CTE Scan

The CTE Scan node uses results from a Common Table Expression (CTE) to join to another node. Depending on how many times you use a CTE, using the MATERIALIZE keyword with the CTE instantiation might be beneficial.

Custom scan

A Custom Scan node is specific to AlloyDB. This node indicates that the node operates on the columnar engine.

(postgres@10.3.1.17:5432) [postgres] > EXPLAIN (ANALYZE,VERBOSE, COLUMNAR_ENGINE) select * from public.index_advisor_test where product_id = 1;
                                   QUERY PLAN
---------------------------------------------------------------------------------
Append  (cost=20.00..27438.78 rows=1166668 width=27) (actual time=0.066..377.029 rows=1000290 loops=1)
   ->  Custom Scan (columnar scan) on public.index_advisor_test  (cost=20.00..27437.66 rows=1166667 width=27) (actual time=0.065..296.904 rows=1000290 loops=1)
         Output: id, value, product_id, effective_date
         Filter: (index_advisor_test.product_id = 1)
         Rows Removed by Columnar Filter: 98999711
         Bytes fetched from storage cache: 774835915
         Columnar cache search mode: native
         Swap-in Time: 92.708 ms
   ->  Seq Scan on public.index_advisor_test  (cost=0.00..1.11 rows=1 width=27) (never executed)
         Output: id, value, product_id, effective_date
         Filter: (index_advisor_test.product_id = 1)
 Query Identifier: -4660018746142248761
 Planning Time: 0.217 ms
 Execution Time: 421.114 ms
 AlloyDB query id: 13855683355620344431
 AlloyDB plan id: 2126918133221480510

This plan output includes the following information:

  • Query Filter (predicate): this shows the applied filter, if you use one.
  • Rows Removed by Columnar Filter: this indicates the number of rows that the columnar filter removed.
  • Bytes fetched from storage cache: this shows the number of bytes retrieved from the storage cache.
  • Swap-in Time: this is the time required to swap data from the columnar spill cache (SSD) if the relation doesn't fit into memory.

What's next