Skip to content

Add opt-in URL replacement mode for faster, safer database migrations#232

Draft
swissspidy wants to merge 3 commits into
mainfrom
fix/search-replace-url
Draft

Add opt-in URL replacement mode for faster, safer database migrations#232
swissspidy wants to merge 3 commits into
mainfrom
fix/search-replace-url

Conversation

@swissspidy

Copy link
Copy Markdown
Member

Why

URL migrations (like moving from a staging to a production domain) are by far the most common use case for wp search-replace. However, standard search-replace has two main pain points for URLs:

  1. Performance (Smarter replacement checks for columns #194): It wastes massive amounts of time scanning core columns that never contain URLs (like post_type, post_status, user_pass, etc.), making large database migrations painfully slow.
  2. Safety (wp search-replace silently writes malformed URLs causing fatal error on next page load #231): It does not validate replacement strings. A simple typo in the replacement URL (e.g. http;//example.com) creates malformed entries in wp_options (siteurl, home). In PHP 8.0+, this causes an immediate fatal ValueError across the entire site when WordPress attempts to use setcookie().

What this does

This PR introduces a dedicated, opt-in URL mode via the new --type=url flag (addressing the request in #186).

  1. Performance Optimization: When --type=url is active, the command automatically skips ~75 WordPress core columns known not to contain URLs. For large databases, this can reduce scan times from ~30 minutes to under 10 minutes (as noted in Smarter replacement checks for columns #194).
  2. Advanced Table Analysis: Adds an --analyze-tables flag that can be used alongside --type=url. This dynamically inspects the MySQL table schemas to skip non-text columns (integers, enums, dates, blobs) and columns matching common patterns (like *_id or *_count) in custom plugin tables.
  3. Fatal Error Prevention: In URL mode, the command validates the replacement string against illegal cookie-path characters (;, ,, spaces, etc.). If detected, it immediately aborts with a helpful error message before modifying the database.

Architecture & Difference from #209

This builds upon the excellent groundwork laid out in PR #209, but makes crucial architectural changes to address the maintainer feedback in that thread:

  • No auto-detection: The command does not automatically force URL mode just because a search string starts with http://. It preserves standard WP-CLI behavior entirely. Users must explicitly opt-in. This ensures backward compatibility.
  • Relaxed Search Validation: It allows users to search for domains without protocols (e.g. example.com) by removing the strict FILTER_VALIDATE_URL requirement on the search string, making it much more user-friendly for standard migration workflows.
  • Future-proof API: Replaced the specific --smart-url flag with a generic --type=<type> flag, paving the way for future profiles such as --type=email or --type=username.

Testing

Includes a comprehensive 600+ line BDD feature test suite (search-replace-url.feature) covering multisite setups, serialized data, dynamic datatype skips, and validation errors.


Fixes #194
Fixes #186
Fixes #231
Supersedes #209

@swissspidy swissspidy added the command:search-replace Related to 'search-replace' command label Apr 29, 2026
@github-actions github-actions Bot added the scope:testing Related to testing label Apr 29, 2026
@swissspidy swissspidy requested a review from Copilot April 29, 2026 18:52

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a 'smart URL mode' for the search-replace command, which optimizes performance by automatically skipping columns unlikely to contain URLs, such as numeric, date, or status-related fields. It also adds validation to prevent malformed URL replacements that could cause issues in PHP 8.0+. My feedback highlights that the current regex validation for URLs might be overly restrictive and could lead to false positives for legitimate URLs containing certain characters in their path or query parameters; I recommend exploring a more robust validation approach.

Comment on lines +357 to +359
if ( preg_match( '/[;,\s\t\r\n]/', $new, $matches ) ) {
WP_CLI::error( sprintf( "The replacement string contains characters that are invalid in a URL (e.g., '%s'). This can cause fatal errors in PHP 8.0+.", $matches[0] ) );
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The regex used for validation is slightly restrictive and might flag valid URLs that happen to contain these characters in the path or query parameters. While it prevents fatal errors, it may cause false positives for legitimate URL replacements. Consider if a more robust URL validation approach is possible.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an opt-in URL-focused mode to wp search-replace intended to speed up common URL migrations and prevent dangerous malformed URL replacements.

Changes:

  • Introduces --type=url (smart URL mode) plus --analyze-tables for schema-based column skipping.
  • Adds a new Non_URL_Columns helper to define static and dynamic “non-URL” column detection.
  • Expands Behat coverage with a new search-replace-url.feature suite and adds a malformed-replacement scenario.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

File Description
src/WP_CLI/SearchReplace/Non_URL_Columns.php New helper for static core skip list + datatype/pattern-based column skipping.
src/Search_Replace_Command.php Wires --type=url / --analyze-tables into command flow, adds validation + skip merging and table analysis logic.
features/search-replace.feature Adds a scenario asserting malformed replacement prevention behavior.
features/search-replace-url.feature Adds URL-mode BDD suite covering skipping behavior, multisite, analysis mode, and serialization.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

'link_rating',
'link_updated',
'link_rel',
'link_rss',

Copilot AI Apr 29, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_core_columns() includes to_ping, pinged, and link_rss as “non-URL” columns, but these fields can store URLs in WordPress core (ping URLs and RSS URLs). In --type=url mode this would skip legitimate URL replacements. These columns should be removed from the non-URL skip list (or otherwise handled so URL-containing values are not skipped).

Suggested change
'link_rss',

Copilot uses AI. Check for mistakes.
Comment on lines +87 to +166
/**
* Get the list of columns that never contain URLs in WordPress core tables.
*
* @return string[] List of column names to skip.
*/
public static function get_core_columns() {
return array(
// wp_posts table - Status, type, and metadata columns
'ID',
'post_author',
'post_date',
'post_date_gmt',
'post_status',
'comment_status',
'ping_status',
'post_password',
// Note: post_name is a slug (not a full URL) in normal WordPress usage.
// In rare edge cases (e.g. imports) it may contain URL-like strings, but we
// still treat it as non-URL for search/replace to keep this optimization simple.
'post_name',
'to_ping',
'pinged',
'post_modified',
'post_modified_gmt',
'post_parent',
'menu_order',
'post_type',
'post_mime_type',
'comment_count',

// wp_postmeta table
'meta_id',
'post_id',

// wp_comments table - IDs, status, type, and dates
'comment_ID',
'comment_post_ID',
'comment_date',
'comment_date_gmt',
'comment_karma',
'comment_approved',
'comment_type',
'comment_parent',
'user_id',

// wp_commentmeta table
'comment_id',

// wp_users table - User metadata and status
'user_login',
'user_pass',
'user_nicename',
'user_email',
'user_registered',
'user_status',
'display_name',

// wp_usermeta table
'umeta_id',

// wp_terms table
'term_id',
'slug',
'term_group',

// wp_term_taxonomy table
'term_taxonomy_id',
'taxonomy',
'parent',
'count',

// wp_term_relationships table
'object_id',
'term_order',

// wp_options table
'option_id',
'autoload',

// wp_links table

Copilot AI Apr 29, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_core_columns() is documented as “columns … in WordPress core tables”, but the values are unqualified column names (e.g. ID, parent, count, email). Because --skip-columns applies globally unless table-qualified, enabling --type=url will also skip columns with these names in custom/plugin tables, which can lead to missed replacements outside core. Consider returning table-qualified names for core tables (e.g. wp_posts.ID) or adjusting the implementation/docs to make the scope explicit and avoid skipping non-core columns unintentionally.

Copilot uses AI. Check for mistakes.
Comment thread src/Search_Replace_Command.php
Comment on lines +347 to +349
$is_url_mode = 'url' === Utils\get_flag_value( $assoc_args, 'type' );
$analyze_tables = Utils\get_flag_value( $assoc_args, 'analyze-tables', false );

Copilot AI Apr 29, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the user passes --type=<type> with an unsupported value (anything other than url), the command silently falls back to normal behavior. Given the help text lists valid options, it would be clearer to error out when --type is set but unrecognized, to avoid users thinking they’re in an optimized/validated mode when they aren’t.

Suggested change
$is_url_mode = 'url' === Utils\get_flag_value( $assoc_args, 'type' );
$analyze_tables = Utils\get_flag_value( $assoc_args, 'analyze-tables', false );
$type = Utils\get_flag_value( $assoc_args, 'type', null );
$is_url_mode = 'url' === $type;
$analyze_tables = Utils\get_flag_value( $assoc_args, 'analyze-tables', false );
if ( null !== $type && ! $is_url_mode ) {
WP_CLI::error( sprintf( "Invalid value for --type: '%s'. Supported values: url.", $type ) );
}

Copilot uses AI. Check for mistakes.
// Issue #231: Validate replacement URL for illegal cookie path characters.
// We do not strictly validate the search string, as users often search for non-schemed domains (e.g. 'example.com').
if ( preg_match( '/[;,\s\t\r\n]/', $new, $matches ) ) {
WP_CLI::error( sprintf( "The replacement string contains characters that are invalid in a URL (e.g., '%s'). This can cause fatal errors in PHP 8.0+.", $matches[0] ) );

Copilot AI Apr 29, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The URL-mode validation error text says the character is “invalid in a URL”, but the rationale given is preventing PHP 8+ setcookie() failures (cookie-path invalid characters). Also, interpolating $matches[0] directly can produce hard-to-read output for whitespace/control characters (e.g. newline). Consider updating the message to reference cookie path / WordPress breakage, and formatting the matched character in an escaped/printable form (e.g. JSON-escaped or showing its ASCII code).

Suggested change
WP_CLI::error( sprintf( "The replacement string contains characters that are invalid in a URL (e.g., '%s'). This can cause fatal errors in PHP 8.0+.", $matches[0] ) );
$invalid_character = json_encode( $matches[0] );
if ( false === $invalid_character ) {
$invalid_character = sprintf( 'ASCII %d', ord( $matches[0] ) );
}
WP_CLI::error(
sprintf(
'The replacement string contains a character that is invalid in a cookie path used by WordPress cookies (%s). This can break WordPress and cause setcookie() failures in PHP 8.0+.',
$invalid_character
)
);

Copilot uses AI. Check for mistakes.
Comment thread features/search-replace.feature
@codecov

codecov Bot commented Apr 29, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@martinkrcho

Copy link
Copy Markdown

@swissspidy could you take a look at two issues in the --type=url skip-column
handling when you get a chance?

  1. Skip names aren't table-qualified. Non_URL_Columns::get_core_columns() returns bare column names, and they're matched by the first arm of in_array( $col, $this->skip_columns ) at Search_Replace_Command.php:582, which fires in every table. So --type=url also skips identically-named columns in custom/plugin tables (slug, count, parent, email, …), silently missing URL replacements there. (Applies even without --analyze-tables, and to the --export path too.) Qualifying the entries to their core tables would scope it correctly.

  2. A few skipped columns actually hold URLs. to_ping, pinged and link_rss are in the skip list, but all three store URLs in core (the trackback ping-lists and the Links-Manager RSS feed URL), so URL mode skips legitimate data. They should probably come off the static list.

Thanks!

@MarkBerube

Copy link
Copy Markdown
Contributor

Hi @martinkrcho I'm the person that put together the original PR this is based off of and I'd like to point out these were known before and I discussed them already before the prior PR was closed because of prior reviews.

Skip names aren't table-qualified.

Yep and this is 100% on purpose. Column names are semantically meaningful in WP DBs. A slug in WC is likely not gonna have data we want to search on either. While I could see an extra flag to stay scoped within that is not something you'd regularly do during a data migration.

A few skipped columns actually hold URLs.

Also entirely intentional. The point of these command changes is to make data migrations from site to site faster. Yes to_ping and the others do hold URLs, BUT they are URLs for other WordPress installs, not URLs for site we'd like to change. If you do want to replace those as well, this is not the param for you and you are better off using vanilla search-replace where it would have total freedom.

@MarkBerube

Copy link
Copy Markdown
Contributor

@swissspidy I'm glad to see my original work carried forward, but this PR doesn't appear to address the core architectural feedback from #209, which was the thought that URL-specific logic doesn't belong in the general search-replace command and should either be a separate command or a generic column-type filter from the other reviewer. The changes here are mostly surface-level (flag rename, validation tweaks) on top of the same architecture that was already blocked. It also removes my authorship from the commits entirely, despite the implementation being substantially the same code.

@swissspidy

Copy link
Copy Markdown
Member Author

Yeah I never got to continue this early draft, so there are still a lot of unfinished things in it. Which is why it's still a draft :)

As I mentioned, that's something I'm still debating the naming. There are pros and cons to both wp search-replace-url and a --type=url arg.

Re: attribution, this would be done in the final squashed commit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

command:search-replace Related to 'search-replace' command scope:testing Related to testing

Projects

None yet

4 participants