Relation Bloat Monitoring and Management

Author: Vonng (@Vonng)

PostgreSQL uses MVCC as its primary concurrency control technology. While it has many benefits, it also brings other effects, such as relation bloat. Relation bloat (table and index) negatively impacts database performance and wastes disk space. To keep PostgreSQL always at optimal performance, it’s necessary to perform timely garbage collection on bloated relations and regularly rebuild excessively bloated relations.

In actual operations, garbage collection isn’t that simple. Here are a series of issues:

What causes relation bloat?
How to measure relation bloat?
How to monitor relation bloat?
How to handle relation bloat?

This article will explain these issues in detail.

Relation Bloat Overview
#

Suppose a relation actually occupies 100G of storage, but much space is wasted by dead tuples, fragments, and free areas. If it were compressed into a new relation, it would occupy 60G, then we can approximately consider this relation has a bloat rate of (100 - 60) / 100 = 40%.

Regular VACUUM cannot solve table bloat issues. Dead tuples themselves can be reclaimed by concurrent VACUUM mechanisms, but the fragments and holes they create cannot. For example, even after deleting many dead tuples, the table size cannot be reduced. Over time, relation files become filled with many holes, wasting substantial disk space.

The VACUUM FULL command can reclaim this space by copying live tuples from the old table file to a new table, compacting the table by rewriting the entire table. However, in actual production, this operation holds an AccessExclusiveLock on the table, blocking normal business access, making it unsuitable for non-stop services. pg_repack is a practical third-party plugin that can perform lock-free VACUUM FULL while online business continues normally.

Unfortunately, there’s no best practice for when to perform VACUUM FULL to handle bloat. DBAs need to formulate cleanup strategies for their specific business scenarios. However, regardless of the strategy adopted, the mechanisms for implementing these strategies are similar:

Monitor, detect, and measure relation bloat levels
Handle relation bloat based on bloat level, timing, and other factors

Here are some key questions: first, how to define relation bloat rate?

Measuring Relation Bloat
#

To measure relation bloat levels, we first need to define a metric: bloat rate.

The calculation idea for bloat rate is: estimate the space that would be occupied if the target table were in a compact state through statistical information, and the proportion of actual used space exceeding this compact space is the bloat rate. Therefore, bloat rate can be defined as 1 - (total bytes occupied by live tuples / total bytes occupied by relation).

For example, if a table actually occupies 100G of storage, but much space is wasted by dead tuples, fragments, and free areas, and if compressed into a new table it would occupy 60G, then the bloat rate is 1 - 60/100 = 40%.

Getting relation size is relatively simple and can be obtained directly from system catalogs. So the key issue is how to obtain total bytes of live tuples.

Precise Calculation of Bloat Rate
#

PostgreSQL comes with the pgstattuple module, which can be used to precisely calculate table bloat rates. For example, the tuple_percent field here is the percentage of actual tuple bytes to total relation size. Subtracting this value from 1 gives the bloat rate.

vonng@[local]:5432/bench# select *,
                          1.0 - tuple_len::numeric / table_len as bloat 
                          from pgstattuple('pgbench_accounts');
┌─[ RECORD 1 ]───────┬────────────────────────┐
│ table_len          │ 136642560              │
│ tuple_count        │ 1000000                │
│ tuple_len          │ 121000000              │
│ tuple_percent      │ 88.55                  │
│ dead_tuple_count   │ 16418                  │
│ dead_tuple_len     │ 1986578                │
│ dead_tuple_percent │ 1.45                   │
│ free_space         │ 1674768                │
│ free_percent       │ 1.23                   │
│ bloat              │ 0.11447794889088729017 │
└────────────────────┴────────────────────────┘

pgstattuple is very useful for precisely determining table and index bloat. For specific details, refer to the official documentation: https://www.postgresql.org/docs/current/static/pgstattuple.html.

Additionally, PostgreSQL provides two built-in extensions, pg_freespacemap and pageinspect. The former can be used to examine the free space size in each page, while the latter can precisely show the physical storage content within each data page in relations. If you want to examine the internal state of relations, these two plugins are very practical. Detailed usage can be found in the official documentation:

https://www.postgresql.org/docs/current/static/pgfreespacemap.html

https://www.postgresql.org/docs/current/static/pageinspect.html

However, in most cases, we don’t care too much about the precision of bloat rates. In actual production, the requirements for bloat rates aren’t high: having the first significant digit accurate is generally sufficient. On the other hand, to know precisely the total bytes occupied by live tuples, a full scan of the entire relation is needed, which puts pressure on the online system’s I/O. If you want to monitor bloat rates for all tables, this approach isn’t suitable.

For example, a 200G relation would take approximately 5 minutes to perform precise bloat rate estimation using the pgstattuple plugin. In version 9.5 and later, the pgstattuple plugin also provides the pgstattuple_approx function, trading precision for speed. But even with estimation, it still takes seconds.

For monitoring bloat rates, the most important requirement is fast speed and low impact. Therefore, when we need to monitor many tables across many databases simultaneously, we need to perform fast estimation of bloat rates to avoid impacting business operations.

Estimating Bloat Rate
#

PostgreSQL maintains many statistical information for each relation. Using statistical information, we can quickly and efficiently estimate bloat rates for all tables in the database. Estimating bloat rates requires using statistical information on tables and columns. Three directly used statistical metrics are:

Average tuple width avgwidth: calculated from column-level statistical data, used to estimate space occupied in compact state
Tuple count: pg_class.reltuples: used to estimate space occupied in compact state
Page count: pg_class.relpages: used to measure actually used space

The calculation formula is also simple:

1 - (reltuples * avgwidth) / (block_size - pageheader) / relpages

Here block_size is page size, default 8182, pageheader is header overhead, default 24 bytes. Page size minus header size gives actual space available for tuple storage. Therefore, (reltuples * avgwidth) gives estimated total tuple size, and dividing by the former gives expected pages needed to compactly store all tuples. Finally, expected page count divided by actual page count gives utilization rate, and 1 minus utilization rate gives bloat rate.

Difficulties
#

The key here is how to use statistical information to estimate average tuple length. To achieve this, we need to overcome three difficulties:

When tuples contain null values, headers will have null bitmaps
There’s padding between headers and data sections, requiring boundary alignment consideration
Some field types also have alignment requirements

Fortunately, bloat rate itself is an estimation, so being roughly correct is sufficient.

Calculating Average Tuple Length
#

To understand the estimation process, we first need to understand PostgreSQL’s internal layout of data pages and tuples.

First, let’s look at tuple average length. The tuple layout in PostgreSQL is shown in the diagram below.

Space occupied by a tuple can be divided into three parts:

Fixed-length line pointer (4 bytes, strictly speaking this isn’t part of the tuple, but it corresponds one-to-one with tuples)
Variable-length header
- Fixed-length part 23 bytes
- When tuples contain null values, a null bitmap appears, with each field occupying one bit, so its length is the number of fields divided by 8
- After the null bitmap, padding is needed to MAXALIGN, usually 8
- If the table has the WITH OIDS option enabled, tuples also have a 4-byte OID, but we don’t consider this case here
Data section

Therefore, a tuple’s average length (including corresponding line pointer) can be calculated as:

avg_size_tuple = 4 + avg_size_hdr + avg_size_data

The key is finding average header length and average data section length.

Calculating Average Header Length
#

The main variables in average header length are null bitmap and padding alignment. To estimate average tuple header length, we need several parameters:

Average header length without null bitmap (with padding): normhdr
Average header length with null bitmap (with padding): nullhdr
Proportion of tuples with null values: nullfrac

The formula for estimating average header length is also very simple:

avg_size_hdr =  nullhdr * nullfrac + normhdr * (1 - nullfrac)

Since headers without null bitmaps are 23 bytes long, aligned to 8-byte boundaries gives 24 bytes, the above formula becomes:

avg_size_hdr =  nullhdr * nullfrac +  24 * (1 - nullfrac)

To calculate the length of a value padded to 8-byte boundaries, use this formula for efficient computation:

padding = lambda x : x + 7 >> 3 << 3

Calculating Average Data Section Length
#

Average data section length mainly depends on each field’s average width and null rate, plus trailing alignment.

The following SQL can calculate average tuple data section width for all tables using statistical information:

SELECT schemaname, tablename, sum((1 - null_frac) * avg_width)
FROM pg_stats GROUP BY (schemaname, tablename);

For example, this SQL can get average tuple length for table app.apple from the pg_stats system statistics view:

SELECT
  count(*),                        -- number of fields
  ceil(count(*) / 8.0),            -- bytes occupied by null bitmap
  max(null_frac),                  -- maximum null rate
  sum((1 - null_frac) * avg_width) -- average width of data section
FROM pg_stats
where schemaname = 'app' and tablename = 'apple';

-[ RECORD 1 ]-----------
count | 47
ceil  | 6
max   | 1
sum   | 1733.76873471724

Integration
#

Integrating the logic from the above three sections, we get the following stored procedure that returns bloat rate for a given table:

CREATE OR REPLACE FUNCTION public.pg_table_bloat(relation regclass)
 RETURNS double precision
 LANGUAGE plpgsql
AS $function$
DECLARE
  _schemaname text;
  tuples      BIGINT := 0;
  pages       INTEGER := 0;
  nullheader  INTEGER:= 0;
  nullfrac    FLOAT := 0;
  datawidth   INTEGER :=0;
  avgtuplelen FLOAT :=24;
BEGIN
  SELECT
    relnamespace :: RegNamespace,
    reltuples,
    relpages
  into _schemaname, tuples, pages
  FROM pg_class
  Where oid = relation;

  SELECT
    23 + ceil(count(*) >> 3),
    max(null_frac),
    ceil(sum((1 - null_frac) * avg_width))
  into nullheader, nullfrac, datawidth
  FROM pg_stats
  where schemaname = _schemaname and tablename = relation :: text;

  SELECT (datawidth + 8 - (CASE WHEN datawidth%8=0 THEN 8 ELSE datawidth%8 END)) -- avg data len
    + (1 - nullfrac) * 24 + nullfrac * (nullheader + 8 - (CASE WHEN nullheader%8=0 THEN 8 ELSE nullheader%8 END))
    INTO avgtuplelen;

  raise notice '% %', nullfrac, datawidth;

  RETURN 1 - (ceil(tuples * avgtuplelen / 8168)) / pages;
END;
$function$

Batch Calculation
#

For monitoring, we often care about not just one table, but all tables in the database. Therefore, the above bloat rate calculation logic can be rewritten as a batch calculation query and defined as a view for easy use:

DROP VIEW IF EXISTS monitor.pg_bloat_indexes CASCADE;
CREATE OR REPLACE VIEW monitor.pg_bloat_indexes AS
  WITH btree_index_atts AS (
      SELECT
        pg_namespace.nspname,
        indexclass.relname                                                          AS index_name,
        indexclass.reltuples,
        indexclass.relpages,
        pg_index.indrelid,
        pg_index.indexrelid,
        indexclass.relam,
        tableclass.relname                                                          AS tablename,
        (regexp_split_to_table((pg_index.indkey) :: TEXT, ' ' :: TEXT)) :: SMALLINT AS attnum,
        pg_index.indexrelid                                                         AS index_oid
      FROM ((((pg_index
        JOIN pg_class indexclass ON ((pg_index.indexrelid = indexclass.oid)))
        JOIN pg_class tableclass ON ((pg_index.indrelid = tableclass.oid)))
        JOIN pg_namespace ON ((pg_namespace.oid = indexclass.relnamespace)))
        JOIN pg_am ON ((indexclass.relam = pg_am.oid)))
      WHERE ((pg_am.amname = 'btree' :: NAME) AND (indexclass.relpages > 0))
  ), index_item_sizes AS (
      SELECT
        ind_atts.nspname,
        ind_atts.index_name,
        ind_atts.reltuples,
        ind_atts.relpages,
        ind_atts.relam,
        ind_atts.indrelid                                    AS table_oid,
        ind_atts.index_oid,
        (current_setting('block_size' :: TEXT)) :: NUMERIC   AS bs,
        8                                                    AS maxalign,
        24                                                   AS pagehdr,
        CASE
        WHEN (max(COALESCE(pg_stats.null_frac, (0) :: REAL)) = (0) :: FLOAT)
          THEN 2
        ELSE 6
        END                                                  AS index_tuple_hdr,
        sum((((1) :: FLOAT - COALESCE(pg_stats.null_frac, (0) :: REAL)) *
             (COALESCE(pg_stats.avg_width, 1024)) :: FLOAT)) AS nulldatawidth
      FROM ((pg_attribute
        JOIN btree_index_atts ind_atts
          ON (((pg_attribute.attrelid = ind_atts.indexrelid) AND (pg_attribute.attnum = ind_atts.attnum))))
        JOIN pg_stats ON (((pg_stats.schemaname = ind_atts.nspname) AND (((pg_stats.tablename = ind_atts.tablename) AND
                                                                          ((pg_stats.attname) :: TEXT =
                                                                           pg_get_indexdef(pg_attribute.attrelid,
                                                                                           (pg_attribute.attnum) :: INTEGER,
                                                                                           TRUE))) OR
                                                                         ((pg_stats.tablename = ind_atts.index_name) AND
                                                                          (pg_stats.attname = pg_attribute.attname))))))
      WHERE (pg_attribute.attnum > 0)
      GROUP BY ind_atts.nspname, ind_atts.index_name, ind_atts.reltuples, ind_atts.relpages, ind_atts.relam,
        ind_atts.indrelid, ind_atts.index_oid, (current_setting('block_size' :: TEXT)) :: NUMERIC, 8 :: INTEGER
  ), index_aligned_est AS (
      SELECT
        index_item_sizes.maxalign,
        index_item_sizes.bs,
        index_item_sizes.nspname,
        index_item_sizes.index_name,
        index_item_sizes.reltuples,
        index_item_sizes.relpages,
        index_item_sizes.relam,
        index_item_sizes.table_oid,
        index_item_sizes.index_oid,
        COALESCE(ceil((((index_item_sizes.reltuples * ((((((((6 + index_item_sizes.maxalign) -
                                                             CASE
                                                             WHEN ((index_item_sizes.index_tuple_hdr %
                                                                    index_item_sizes.maxalign) = 0)
                                                               THEN index_item_sizes.maxalign
                                                             ELSE (index_item_sizes.index_tuple_hdr %
                                                                   index_item_sizes.maxalign)
                                                             END)) :: FLOAT + index_item_sizes.nulldatawidth)
                                                          + (index_item_sizes.maxalign) :: FLOAT) - (
                                                           CASE
                                                           WHEN (((index_item_sizes.nulldatawidth) :: INTEGER %
                                                                  index_item_sizes.maxalign) = 0)
                                                             THEN index_item_sizes.maxalign
                                                           ELSE ((index_item_sizes.nulldatawidth) :: INTEGER %
                                                                 index_item_sizes.maxalign)
                                                           END) :: FLOAT)) :: NUMERIC) :: FLOAT) /
                        ((index_item_sizes.bs - (index_item_sizes.pagehdr) :: NUMERIC)) :: FLOAT) +
                       (1) :: FLOAT)), (0) :: FLOAT) AS expected
      FROM index_item_sizes
  ), raw_bloat AS (
      SELECT
        current_database()                                                           AS dbname,
        index_aligned_est.nspname,
        pg_class.relname                                                             AS table_name,
        index_aligned_est.index_name,
        (index_aligned_est.bs * ((index_aligned_est.relpages) :: BIGINT) :: NUMERIC) AS totalbytes,
        index_aligned_est.expected,
        CASE
        WHEN ((index_aligned_est.relpages) :: FLOAT <= index_aligned_est.expected)
          THEN (0) :: NUMERIC
        ELSE (index_aligned_est.bs *
              ((((index_aligned_est.relpages) :: FLOAT - index_aligned_est.expected)) :: BIGINT) :: NUMERIC)
        END                                                                          AS wastedbytes,
        CASE
        WHEN ((index_aligned_est.relpages) :: FLOAT <= index_aligned_est.expected)
          THEN (0) :: NUMERIC
        ELSE (((index_aligned_est.bs * ((((index_aligned_est.relpages) :: FLOAT -
                                          index_aligned_est.expected)) :: BIGINT) :: NUMERIC) * (100) :: NUMERIC) /
              (index_aligned_est.bs * ((index_aligned_est.relpages) :: BIGINT) :: NUMERIC))
        END                                                                          AS realbloat,
        pg_relation_size((index_aligned_est.table_oid) :: REGCLASS)                  AS table_bytes,
        stat.idx_scan                                                                AS index_scans
      FROM ((index_aligned_est
        JOIN pg_class ON ((pg_class.oid = index_aligned_est.table_oid)))
        JOIN pg_stat_user_indexes stat ON ((index_aligned_est.index_oid = stat.indexrelid)))
  ), format_bloat AS (
      SELECT
        raw_bloat.dbname                                             AS database_name,
        raw_bloat.nspname                                            AS schema_name,
        raw_bloat.table_name,
        raw_bloat.index_name,
        round(
            raw_bloat.realbloat)                                     AS bloat_pct,
        round((raw_bloat.wastedbytes / (((1024) :: FLOAT ^
                                         (2) :: FLOAT)) :: NUMERIC)) AS bloat_mb,
        round((raw_bloat.totalbytes / (((1024) :: FLOAT ^ (2) :: FLOAT)) :: NUMERIC),
              3)                                                     AS index_mb,
        round(
            ((raw_bloat.table_bytes) :: NUMERIC / (((1024) :: FLOAT ^ (2) :: FLOAT)) :: NUMERIC),
            3)                                                       AS table_mb,
        raw_bloat.index_scans
      FROM raw_bloat
  )
  SELECT
    format_bloat.database_name                    as datname,
    format_bloat.schema_name                      as nspname,
    format_bloat.table_name                       as relname,
    format_bloat.index_name                       as idxname,
    format_bloat.index_scans                      as idx_scans,
    format_bloat.bloat_pct                        as bloat_pct,
    format_bloat.table_mb,
    format_bloat.index_mb - format_bloat.bloat_mb as actual_mb,
    format_bloat.bloat_mb,
    format_bloat.index_mb                         as total_mb
  FROM format_bloat
  ORDER BY format_bloat.bloat_mb DESC;


COMMENT ON VIEW monitor.pg_bloat_indexes IS 'index bloat monitor';

Although it looks long, querying this view to get bloat rates for all tables in the entire database (3TB) takes only 50ms of computation. And it only needs to access statistical data, not the relations themselves, consuming no instance I/O.

Handling Table Bloat
#

If it’s just a toy database, or the business allows long daily downtime for maintenance, then simply executing VACUUM FULL in the database would suffice. But VACUUM FULL requires exclusive read-write locks on tables. For databases that need to run continuously, we need to use pg_repack to handle table bloat.

Homepage: http://reorg.github.io/pg_repack/

pg_repack is included in PostgreSQL’s official yum repository, so it can be installed directly via yum install pg_repack.

yum install pg_repack10

Using `pg_repack`
#

Like most PostgreSQL client programs, pg_repack also connects to PostgreSQL servers through similar parameters.

Before using pg_repack, you need to create the pg_repack extension in the database to be reorganized:

CREATE EXTENSION pg_repack

Then you can use it normally. Several typical usage patterns:

# Complete cleanup of entire database, 5 concurrent tasks, 10 second timeout
pg_repack -d <database> -j 5 -T 10

# Clean specific table mytable in mydb, 10 second timeout
pg_repack mydb -t public.mytable -T 10

# Clean specific index myschema.myindex, must use full name with schema
pg_repack mydb -i myschema.myindex

Detailed usage can be found in the official documentation.

`pg_repack` Strategy
#

Usually, if business has peak and valley cycles, you can choose to perform reorganization during business valleys. pg_repack executes quickly but is resource-intensive. Running during peak periods might affect overall database performance and could cause replication lag.

For example, you can use the bloat rate monitoring views provided in the above two sections to daily select the most severely bloated tables and indexes for automatic reorganization.

#--------------------------------------------------------------#
# Name: repack_tables
# Desc: repack table via fullname
# Arg1: database_name
# Argv: list of table full name
# Deps: psql
#--------------------------------------------------------------#
# repack single table
function repack_tables(){
    local db=$1
    shift

    log_info "repack ${db} tables begin"
    log_info "repack table list: $@"

    for relname in $@
    do
        old_size=$(psql ${db} -Atqc "SELECT pg_size_pretty(pg_relation_size('${relname}'));")
        # kill_queries ${db}
        log_info "repack table ${relname} begin, old size: ${old_size}"
        pg_repack ${db} -T 10 -t ${relname}
        new_size=$(psql ${db} -Atqc "SELECT pg_size_pretty(pg_relation_size('${relname}'));")
        log_info "repack table ${relname} done , new size: ${old_size} -> ${new_size}"
    done

    log_info "repack ${db} tables done"
}

#--------------------------------------------------------------#
# Name: get_bloat_tables
# Desc: find bloat tables in given database match some condition
# Arg1: database_name
# Echo: list of full table name
# Deps: psql, monitor.pg_bloat_tables
#--------------------------------------------------------------#
function get_bloat_tables(){
    echo $(psql ${1} -Atq <<-'EOF'
    WITH bloat_tables AS (
        SELECT
          nspname || '.' || relname as relname,
          actual_mb,
          bloat_pct
        FROM monitor.pg_bloat_tables
        WHERE nspname NOT IN ('dba', 'monitor', 'trash')
        ORDER BY 2 DESC,3 DESC
    )
    -- 64 small + 16 medium + 4 large
    (SELECT relname FROM bloat_tables WHERE actual_mb < 256 AND bloat_pct > 40 ORDER BY bloat_pct DESC LIMIT 64) UNION
    (SELECT relname FROM bloat_tables WHERE actual_mb BETWEEN 256 AND 1024  AND bloat_pct > 30 ORDER BY bloat_pct DESC LIMIT 16) UNION
    (SELECT relname FROM bloat_tables WHERE actual_mb BETWEEN 1024 AND 4096  AND bloat_pct > 20 ORDER BY bloat_pct DESC  LIMIT 4);
EOF
)
}

Here, three rules are set:

From small tables < 256MB with bloat rate > 40%, select TOP64
From medium tables 256MB to 1GB with bloat rate > 40%, select TOP16
From large tables 1GB to 4GB with bloat rate > 20%, select TOP4

Select these tables for automatic reorganization during early morning valleys. Tables over 4GB are handled manually.

But when to perform reorganization still depends on specific business patterns.

`pg_repack` Principles
#

pg_repack’s principle is quite simple. It creates a copy for the table to be rebuilt. First, it takes a full snapshot, writes all live tuples to the new table, and synchronizes all changes to the original table to the new table through triggers. Finally, it replaces the old table with the new compact copy through renaming. For indexes, this is accomplished through PostgreSQL’s CREATE(DROP) INDEX CONCURRENTLY.

Reorganizing Tables

Create an empty table with the same schema as the original table but without indexes
Create a log table corresponding to the original table to record changes that occur on that table during pg_repack operation
Add a row trigger to the original table to record all INSERT, DELETE, UPDATE operations in the corresponding log table
Copy data from the old table to the new empty table
Create the same indexes on the new table
Apply incremental changes from the log table to the new table
Switch new and old tables through renaming
Drop the old, renamed table

Reorganizing Indexes

Use CREATE INDEX CONCURRENTLY to create a new index on the original table, maintaining the same definition as the old index
Analyze the new index, set the old index as invalid, and swap new and old indexes in the data directory
Delete the old index

`pg_repack` Considerations
#

Before starting reorganization, it’s best to cancel all ongoing Vacuum tasks
Before reorganizing indexes, it’s best to manually clean up queries that might be using those indexes
If abnormal situations occur (like forced exit midway), garbage might be left behind that needs manual cleanup. This might include:
- Temporary tables and temporary indexes built in the same schema as the original table/index
- Temporary table names: ${schema_name}.table_${table_oid}
- Temporary index names: ${schema_name}.index_${table_oid}}
- Related triggers might remain on the original table and need manual cleanup
When reorganizing particularly large tables, reserve at least the same amount of disk space as the table and its indexes, requiring special care and manual checking
When completing reorganization and performing renaming replacement, massive amounts of WAL will be generated, possibly causing replication delay that cannot be canceled

PgAdmin Installation and Configuration

2018-04-14·220 words·2 mins

PostgreSQL PG Administration Tools

PgAdmin is a GUI program for managing PostgreSQL, written in Python, but it’s quite dated and requires some additional configuration.

Incident Report: Uneven Load Avalanche

2018-04-08·1342 words·7 mins

PostgreSQL PG Administration Incident Report

Recently there was a perplexing incident where a database had half its data volume and load migrated away, but ended up being overwhelmed due to increased load.