Skip to main content
Incident Report: Patroni Failure Due to Time Travel
  1. PostgreSQL Posts/

Incident Report: Patroni Failure Due to Time Travel

·145 words·1 min· ·
Ruohang Feng
Author
Ruohang Feng
Pigsty Founder, @Vonng

Summary: Machine restarted due to failure, NTP service corrected PG time after PG startup, causing Patroni to fail to start.

The failure information in Patroni is shown as follows:

Process %s is not postmaster, too much difference between PID file start time %s and process start time %s

When patroni process start time and pid time are inconsistent, it assumes: postgres is not running.

If the two times differ by more than 30 seconds, patroni fails and cannot start.

The code that prints the error message is:

start_time = int(self._postmaster_pid.get('start_time', 0))
if start_time and abs(self.create_time() - start_time) > 3:
    logger.info('Process %s is not postmaster, too much difference between PID file start time %s and process start time %s', self.pid, self.create_time(), start_time)

Also discovered a BUG in Patroni: https://github.com/zalando/patroni/issues/811 The two timestamps in the error message are reversed.

Lessons learned: NTP time synchronization is very important

Related

PostgreSQL's KPI

·3053 words·15 mins
Managing databases is similar to managing people - both need KPIs (Key Performance Indicators). So what are database KPIs? This article introduces a way to measure PostgreSQL load: using a single horizontally comparable metric that is basically independent of workload type and machine type, called PG Load.