PGCOPYDB CLONE(1) | pgcopydb | PGCOPYDB CLONE(1) |
pgcopydb clone - pgcopydb clone
The main pgcopydb operation is the clone operation, and for historical and user friendlyness reasons three aliases are available that implement the same operation:
pgcopydb clone Clone an entire database from source to target fork Clone an entire database from source to target
The command pgcopydb clone copies a database from the given source Postgres instance to the target Postgres instance.
pgcopydb clone: Clone an entire database from source to target usage: pgcopydb clone --source ... --target ... [ --table-jobs ... --index-jobs ... ] --source Postgres URI to the source database --target Postgres URI to the target database --dir Work directory to use --table-jobs Number of concurrent COPY jobs to run --index-jobs Number of concurrent CREATE INDEX jobs to run --restore-jobs Number of concurrent jobs for pg_restore --large-objects-jobs Number of concurrent Large Objects jobs to run --split-tables-larger-than Same-table concurrency size threshold --drop-if-exists On the target database, clean-up from a previous run first --roles Also copy roles found on source to target --no-role-passwords Do not dump passwords for roles --no-owner Do not set ownership of objects to match the original database --no-acl Prevent restoration of access privileges (grant/revoke commands). --no-comments Do not output commands to restore comments --skip-large-objects Skip copying large objects (blobs) --skip-extensions Skip restoring extensions --skip-ext-comments Skip restoring COMMENT ON EXTENSION --skip-collations Skip restoring collations --skip-vacuum Skip running VACUUM ANALYZE --requirements <filename> List extensions requirements --filters <filename> Use the filters defined in <filename> --fail-fast Abort early in case of error --restart Allow restarting when temp files exist already --resume Allow resuming operations after a failure --not-consistent Allow taking a new snapshot on the source database --snapshot Use snapshot obtained with pg_export_snapshot --follow Implement logical decoding to replay changes --plugin Output plugin to use (test_decoding, wal2json) --wal2json-numeric-as-string Print numeric data type as string when using wal2json output plugin --slot-name Use this Postgres replication slot name --create-slot Create the replication slot --origin Use this Postgres replication origin node name --endpos Stop replaying changes when reaching this LSN
The command pgcopydb fork copies a database from the given source Postgres instance to the target Postgres instance. This command is an alias to the command pgcopydb clone seen above.
The pgcopydb clone command implements both a base copy of a source database into a target database and also a full Logical Decoding client for the wal2json logical decoding plugin.
The pgcopydb clone command implements the following steps:
When filtering is used, the list of objects OIDs that are meant to be filtered out is built during this step.
When filtering is used, the pg_restore --use-list feature is used to filter the list of objects to restore in this step.
This step uses as many as --restore-jobs jobs for pg_restore to share the workload and restore the objects in parallel.
A Postgres connection and a SQL query to the Postgres catalog table pg_class is used to get the list of tables with data to copy around, and the reltuples statistic is used to start with the tables with the greatest number of rows first, as an attempt to minimize the copy time.
This step is much like pg_dump | pg_restore for large objects data parts, except that there isn't a good way to do just that with the tooling.
The primary indexes are created as UNIQUE indexes at this stage.
For each sequence, pgcopydb then calls pg_catalog.setval() on the target database with the information obtained on the source database.
The post-data script is filtered out using the pg_restore --use-list option so that indexes and primary key constraints already created in steps 6 and 7 are properly skipped now.
This step uses as many as --restore-jobs jobs for pg_restore to share the workload and restore the objects in parallel.
Postgres has a notion of a superuser status that can be assigned to any role in the system, and the default role postgres has this status. From the Role Attributes documentation page we see that:
A database superuser bypasses all permission checks, except the right to log in. This is a dangerous privilege and should not be used carelessly; it is best to do most of your work as a role that is not a superuser. To create a new database superuser, use CREATE ROLE name SUPERUSER. You must do this as a role that is already a superuser.
Some Postgres objects can only be created by superusers, and some read and write operations are only allowed to superuser roles, such as the following non-exclusive list:
It is possible to implement a pgcopydb migration that skips the passwords entirely when using the option --no-role-passwords. In that case though authentication might fail until passwords have been setup again correctly.
When such an extension contains Extension Configuration Tables and has been created with a role having superuser status, then the same superuser status is needed again to pg_dump and pg_restore that extension and its current configuration.
When using pgcopydb it is possible to split your migration in privileged and non-privileged parts, like in the following examples:
$ coproc ( pgcopydb snapshot ) # first two commands would use a superuser role to connect $ pgcopydb copy roles --source ... --target ... $ pgcopydb copy extensions --source ... --target ... # now it's possible to use a non-superuser role to connect $ pgcopydb clone --skip-extensions --source ... --target ... $ kill -TERM ${COPROC_PID} $ wait ${COPROC_PID}
In such a script, the calls to pgcopydb copy roles and pgcopydb copy extensions would be done with connection strings that connects with a role having superuser status; and then the call to pgcopydb clone would be done with a non-privileged role, typically the role that owns the source and target databases.
WARNING:
That's because pg_dump filtering (here, there --exclude-table option) does not apply to extension members, and pg_dump does not provide a mechanism to exclude extensions.
When using the --follow option the steps from the pgcopydb follow command are also run concurrently to the main copy. The Change Data Capture is then automatically driven from a prefetch-only phase to the prefetch-and-catchup phase, which is enabled as soon as the base copy is done.
See the command pgcopydb stream sentinel set endpos to remote control the follow parts of the command even while the command is already running.
The command pgcopydb stream cleanup must be used to free resources created to support the change data capture process.
IMPORTANT:
A simple approach to applying changes after the initial base copy has been done follows:
$ pgcopydb clone --follow & # later when the application is ready to make the switch $ pgcopydb stream sentinel set endpos --current # later when the migration is finished, clean-up both source and target $ pgcopydb stream cleanup
In some cases, it might be necessary to have more control over some of the steps taken here. Given pgcopydb flexibility, it's possible to implement the following steps:
In case of crash or other problems with the main operations, it's then possible to resume processing of the base copy and the applying of the changes with the same snapshot again.
This step is also implemented when using pgcopydb clone --follow. That said, if the command was interrupted (or crashed), then the snapshot would be lost.
The following SQL objects are then created:
This step is also implemented when using pgcopydb clone --follow. There is no way to implement Change Data Capture with pgcopydb and skip creating those SQL objects.
Sequences are not handled by Postgres logical decoding, so extra care needs to be implemented manually here.
IMPORTANT:
If the command pgcopydb clone --follow fails it's then possible to start it again. It will automatically discover what was done successfully and what needs to be done again because it failed or was interrupted (table copy, index creation, resuming replication slot consuming, resuming applying changes at the right LSN position, etc).
Here is an example implement the previous steps:
$ pgcopydb snapshot & $ pgcopydb stream setup $ pgcopydb clone --follow & # later when the application is ready to make the switch $ pgcopydb stream sentinel set endpos --current # when the follow process has terminated, re-sync the sequences $ pgcopydb copy sequences # later when the migration is finished, clean-up both source and target $ pgcopydb stream cleanup # now stop holding the snapshot transaction (adjust PID to your environment) $ kill %1
The following options are available to pgcopydb clone:
This limit only applies to the COPY operations, more sub-processes will be running at the same time that this limit while the CREATE INDEX operations are in progress, though then the processes are only waiting for the target Postgres instance to do all the work.
If this value is not set, we reuse the --index-jobs value. If that value is not set either, we use the the default value for --index-jobs.
This option is useful when the same command is run several times in a row, either to fix a previous mistake or for instance when used in a continuous integration system.
This option causes DROP TABLE and DROP INDEX and other DROP commands to be used. Make sure you understand what you're doing here!
The pg_dumpall --roles-only is used to fetch the list of roles from the source database, and this command includes support for passwords. As a result, this operation requires the superuser privileges.
See also pgcopydb copy roles.
When used, schema that extensions depend-on are also skipped: it is expected that creating needed extensions on the target system is then the responsibility of another command (such as pgcopydb copy extensions), and schemas that extensions depend-on are part of that responsibility.
Because creating extensions require superuser, this allows a multi-steps approach where extensions are dealt with superuser privileges, and then the rest of the pgcopydb operations are done without superuser privileges.
The command pgcopydb list extension --requirements --json produces such a JSON file and can be used on the target database instance to get started.
See also the command pgcopydb list extension --available-versions.
See also pgcopydb list extensions.
In some scenarios the list of collations provided by the Operating System on the source and target system might be different, and a mapping then needs to be manually installed before calling pgcopydb.
Then this option allows pgcopydb to skip over collations and assume all the needed collations have been deployed on the target database already.
See also pgcopydb list collations.
In that case, the --restart option can be used to allow pgcopydb to delete traces from a previous run.
When resuming activity from a previous run, table data that was fully copied over to the target server is not sent again. Table data that was interrupted during the COPY has to be started from scratch even when using --resume: the COPY command in Postgres is transactional and was rolled back.
Same reasonning applies to the CREATE INDEX commands and ALTER TABLE commands that pgcopydb issues, those commands are skipped on a --resume run only if known to have run through to completion on the previous one.
Finally, using --resume requires the use of --not-consistent.
Per the Postgres documentation about pg_export_snapshot:
Now, when the pgcopydb process was interrupted (or crashed) on a previous run, it is possible to resume operations, but the snapshot that was exported does not exists anymore. The pgcopydb command can only resume operations with a new snapshot, and thus can not ensure consistency of the whole data set, because each run is now using their own snapshot.
The replication slot is created using the same snapshot as the main database copy operation, and the changes to the source database are prefetched only during the initial copy, then prefetched and applied in a catchup process.
It is possible to give pgcopydb clone --follow a termination point (the LSN endpos) while the command is running with the command pgcopydb stream sentinel set endpos.
It is possible to use wal2json instead. The support for wal2json is mostly historical in pgcopydb, it should not make a user visible difference whether you use the default test_decoding or wal2json.
You need to have a wal2json plugin version on source database that supports --numeric-data-types-as-string option to use this option.
See also the documentation for wal2json regarding this option for details.
The --endpos option is not aware of transaction boundaries and may truncate output partway through a transaction. Any partially output transaction will not be consumed and will be replayed again when the slot is next read from. Individual messages are never truncated.
See also documentation for pg_recvlogical.
Postgres uses a notion of an origin node name as documented in Replication Progress Tracking. This option allows to pick your own node name and defaults to "pgcopydb". Picking a different name is useful in some advanced scenarios like migrating several sources in the same target, where each source should have their own unique origin node name.
PGCOPYDB_SOURCE_PGURI
PGCOPYDB_TARGET_PGURI
PGCOPYDB_TABLE_JOBS
PGCOPYDB_INDEX_JOBS
PGCOPYDB_RESTORE_JOBS
PGCOPYDB_LARGE_OBJECTS_JOBS
PGCOPYDB_SPLIT_TABLES_LARGER_THAN
When --split-tables-larger-than is ommitted from the command line, then this environment variable is used.
PGCOPYDB_OUTPUT_PLUGIN
PGCOPYDB_WAL2JSON_NUMERIC_AS_STRING
When --wal2json-numeric-as-string is ommitted from the command line then this environment variable is used.
PGCOPYDB_DROP_IF_EXISTS
When --drop-if-exists is ommitted from the command line then this environment variable is used.
PGCOPYDB_FAIL_FAST
When --fail-fast is ommitted from the command line then this environment variable is used.
PGCOPYDB_SKIP_VACUUM
PGCOPYDB_SNAPSHOT
TMPDIR
PGCOPYDB_LOG_TIME_FORMAT
See documentation for strftime(3) for details about the format string. See documentation for isatty(3) for details about detecting if pgcopydb is run in an interactive terminal.
PGCOPYDB_LOG_JSON
{ "timestamp": "2023-04-13 16:53:14", "pid": 87956, "error_level": 4, "error_severity": "INFO", "file_name": "main.c", "file_line_num": 165, "message": "Running pgcopydb version 0.11.19.g2290494.dirty from \"/Users/dim/dev/PostgreSQL/pgcopydb/src/bin/pgcopydb/pgcopydb\"" }
PGCOPYDB_LOG_FILENAME
If the file already exists, its content is overwritten. In other words the previous content would be lost when running the same command twice.
PGCOPYDB_LOG_JSON_FILE
XDG_DATA_HOME
When using Change Data Capture (through --follow option and Postgres logical decoding with wal2json) then pgcopydb pre-fetches changes in JSON files and transform them into SQL files to apply to the target database.
These files are stored at the following location, tried in this order:
$ export PGCOPYDB_SOURCE_PGURI=postgres://pagila:0wn3d@source/pagila $ export PGCOPYDB_TARGET_PGURI=postgres://pagila:0wn3d@target/pagila $ export PGCOPYDB_DROP_IF_EXISTS=on $ pgcopydb clone --table-jobs 8 --index-jobs 12 14:49:01 22 INFO Running pgcopydb version 0.13.38.g22e6544.dirty from "/usr/local/bin/pgcopydb" 14:49:01 22 INFO [SOURCE] Copying database from "postgres://pagila@source/pagila?keepalives=1&keepalives_idle=10&keepalives_interval=10&keepalives_count=60" 14:49:01 22 INFO [TARGET] Copying database into "postgres://pagila@target/pagila?keepalives=1&keepalives_idle=10&keepalives_interval=10&keepalives_count=60" 14:49:01 22 INFO Exported snapshot "00000003-00000022-1" from the source database 14:49:01 24 INFO STEP 1: fetch source database tables, indexes, and sequences 14:49:01 24 INFO Fetched information for 3 extensions 14:49:01 24 INFO Splitting source candidate tables larger than 200 kB 14:49:01 24 INFO Table public.rental is 1224 kB large, 7 COPY processes will be used, partitioning on rental_id. 14:49:01 24 INFO Table public.film is 472 kB large, 3 COPY processes will be used, partitioning on film_id. 14:49:01 24 INFO Table public.film_actor is 264 kB large which is larger than --split-tables-larger-than 200 kB, and does not have a unique column of type integer: splitting by CTID 14:49:01 24 INFO Table public.film_actor is 264 kB large, 2 COPY processes will be used, partitioning on ctid. 14:49:01 24 INFO Table public.inventory is 264 kB large, 2 COPY processes will be used, partitioning on inventory_id. 14:49:01 24 INFO Fetched information for 21 tables, with an estimated total of 0 tuples and 3816 kB 14:49:01 24 INFO Fetched information for 54 indexes 14:49:01 24 INFO Fetching information for 13 sequences 14:49:01 24 INFO STEP 2: dump the source database schema (pre/post data) 14:49:01 24 INFO /usr/bin/pg_dump -Fc --snapshot 00000003-00000022-1 --section pre-data --file /tmp/pgcopydb/schema/pre.dump 'postgres://pagila@source/pagila?keepalives=1&keepalives_idle=10&keepalives_interval=10&keepalives_count=60' 14:49:01 24 INFO /usr/bin/pg_dump -Fc --snapshot 00000003-00000022-1 --section post-data --file /tmp/pgcopydb/schema/post.dump 'postgres://pagila@source/pagila?keepalives=1&keepalives_idle=10&keepalives_interval=10&keepalives_count=60' 14:49:02 24 INFO STEP 3: restore the pre-data section to the target database 14:49:02 24 INFO /usr/bin/pg_restore --dbname 'postgres://pagila@target/pagila?keepalives=1&keepalives_idle=10&keepalives_interval=10&keepalives_count=60' --single-transaction --use-list /tmp/pgcopydb/schema/pre-filtered.list /tmp/pgcopydb/schema/pre.dump 14:49:02 24 INFO STEP 6: starting 12 CREATE INDEX processes 14:49:02 24 INFO STEP 7: constraints are built by the CREATE INDEX processes 14:49:02 24 INFO STEP 8: starting 8 VACUUM processes 14:49:02 24 INFO STEP 9: reset sequences values 14:49:02 51 INFO STEP 5: starting 4 Large Objects workers 14:49:02 30 INFO STEP 4: starting 8 table data COPY processes 14:49:02 52 INFO Reset sequences values on the target database 14:49:02 51 INFO Added 0 large objects to the queue 14:49:04 24 INFO STEP 10: restore the post-data section to the target database 14:49:04 24 INFO /usr/bin/pg_restore --dbname 'postgres://pagila@target/pagila?keepalives=1&keepalives_idle=10&keepalives_interval=10&keepalives_count=60' --single-transaction --use-list /tmp/pgcopydb/schema/post-filtered.list /tmp/pgcopydb/schema/post.dump OID | Schema | Name | copy duration | transmitted bytes | indexes | create index duration ------+--------+------------------+---------------+-------------------+---------+---------------------- 16880 | public | rental | 160ms | 188 kB | 3 | 230ms 16880 | public | rental | 77ms | 189 kB | 0 | 0ms 16880 | public | rental | 105ms | 189 kB | 0 | 0ms 16880 | public | rental | 107ms | 189 kB | 0 | 0ms 16880 | public | rental | 97ms | 190 kB | 0 | 0ms 16880 | public | rental | 82ms | 189 kB | 0 | 0ms 16880 | public | rental | 81ms | 189 kB | 0 | 0ms 16758 | public | film | 136ms | 112 kB | 5 | 462ms 16758 | public | film | 52ms | 110 kB | 0 | 0ms 16758 | public | film | 74ms | 111 kB | 0 | 0ms 16770 | public | film_actor | 74ms | 5334 B | 0 | 0ms 16770 | public | film_actor | 77ms | 156 kB | 0 | 0ms 16825 | public | inventory | 106ms | 74 kB | 2 | 586ms 16825 | public | inventory | 107ms | 76 kB | 0 | 0ms 16858 | public | payment_p2022_03 | 86ms | 137 kB | 4 | 468ms 16866 | public | payment_p2022_05 | 98ms | 136 kB | 4 | 663ms 16870 | public | payment_p2022_06 | 106ms | 134 kB | 4 | 571ms 16862 | public | payment_p2022_04 | 125ms | 129 kB | 4 | 775ms 16854 | public | payment_p2022_02 | 117ms | 121 kB | 4 | 684ms 16874 | public | payment_p2022_07 | 255ms | 118 kB | 1 | 270ms 16724 | public | customer | 247ms | 55 kB | 4 | 1s091 16785 | public | address | 128ms | 47 kB | 2 | 132ms 16795 | public | city | 163ms | 23 kB | 2 | 270ms 16774 | public | film_category | 172ms | 28 kB | 1 | 47ms 16850 | public | payment_p2022_01 | 166ms | 36 kB | 4 | 679ms 16738 | public | actor | 399ms | 7999 B | 2 | 116ms 16748 | public | category | 170ms | 526 B | 1 | 200ms 16805 | public | country | 63ms | 3918 B | 1 | 226ms 16900 | public | staff | 170ms | 272 B | 1 | 114ms 16832 | public | language | 115ms | 276 B | 1 | 68ms 16911 | public | store | 88ms | 58 B | 2 | 185ms Step Connection Duration Transfer Concurrency -------------------------------------------------- ---------- ---------- ---------- ------------ Dump Schema source 98ms 1 Catalog Queries (table ordering, filtering, etc) source 687ms 1 Prepare Schema target 667ms 1 COPY, INDEX, CONSTRAINTS, VACUUM (wall clock) both 1s256 8 + 20 COPY (cumulative) both 4s003 2955 kB 8 Large Objects (cumulative) both 877ms 4 CREATE INDEX, CONSTRAINTS (cumulative) target 7s837 12 Finalize Schema target 487ms 1 -------------------------------------------------- ---------- ---------- ---------- ------------ Total Wall Clock Duration both 3s208 8 + 20 -------------------------------------------------- ---------- ---------- ---------- ------------
Dimitri Fontaine
2022-2024, Dimitri Fontaine
January 10, 2024 | 0.15 |