[MVP] gprebalance by bimboterminator1 · Pull Request #1198 · arenadata/gpdb

bimboterminator1 · 2025-01-29T11:51:09Z

Mvp for gprebalance utility

Implement cluster validation possibility This is the first commit for building an MVP for new rebalance utility - gprebalance. This utility is intended to be used for the situation, when after cluster resize (after expand, shrink) is in unbalanced state. Balanced state is defined very simple: if number of segments per host is equal across all the hosts, then cluster is balanced. There are a lot of other aspects for proper implementation of optimal rebalance algorithm, which will be implemented in the next patches. This patch adds the skeleton of future utility, providing initial validation of rebalance possibility. It includes checks, that validate some basic aspects: whether segments can be distributed uniformly and can target mirroring strategy be achieved. Decided to provide validation through separate classes, which is different approach from gpexpand utility. Also, some unit tests have been added. Validation of available disk space is not implemented since cannot be achieved at this initial validation step

gprebalance skeleton is complemented with additional options from mvp specification.

This code proposes the rebalance algorithm. GpRebalance.createPlan() returns a Plan represented by the list of Moves. The algorithm itself produces an intiutive greed solution by manual setting the final balanced state.

The proposed code contains main framework for rebalance execution. Some options are not implemented fully and are expected to be finished in next tasks. The code describes the following segment movement approach. Firstly, we creating a movements plan: simple steps telling which segment to which host to move. Steps in plan can be different: Mirror only moves. Both primary and mirror are moved to different hosts. Primary only moves. Primary and mirror are swapped. For each type of movement we clarify the target dirs and ports at target hosts, able to contain the size of moved segment. To do that the DiskFree and DiskUsage commands are used. The movements, in its turn, are composite and imply extra actions including segment switching. Mirror only moves use only single gprecoverseg call to perform movement. If we move primary and mirror pair, the strategy is following. The mirror is firstly moved via gprecoverseg to primary's target host. Then the roles are switched. Then ex-primary (new mirror) is moved to mirror's target host. Primary only moves imply 2 role switches. Switch.Move.Switch. Primary mirror swap is executed similar to 2nd type. Mirror is moved to primary dir in its own host. Switch. Ex-primary is moved to mirror dir in its own host. The status management is written in general and may contain errors. Cleanup is prepared by RekGRpth Co-authored-by: Georgy Shelkovy <g.shelkovy@arenadata.io>

This PR intoduces the rollback handler in gprebalance MVP. The rollback function creates new plan of movements by calculating the difference between current configuration and original state loaded from previously pickled plan.

The changes of this patch provide the prototype for status tracking of mirror moves during rebalance. Firstly , this patch removes the usage of gpdb table for whole execution status. Secondly, the status manager is rewritten in order to track execution process with status file only. If the movement step, presented by gprecoverseg process, fails, the corresponging status (FAILED) will be written to the internal status struct first, then will be flushed to disk. The main purpose of these changes is also implementation of gprecoverseg determination. The code in analyze_gprecoverseg_states() tries to implement the SRS diagram for gprecoverseg status definition. It processes the following scenarios: 1. A mirror move failed after pg_hba conf had been updated at primary. In this case primary marks the mirror as being down. 2. A mirror move failed after gp_segment_configuration had been updated. Here our code tries to determine whether pg_basebackup was executed succesfully or not. Depending on the basebackup state, the algorithm tries to either startup the backuped mirror or rollback the configuration changes with recovering old mirror

Problem description: There were no means to provide segments shrink feature to the 'gprebalance' tool. Fix: Add new command 'ALTER TABLE <table_name> REBALANCE' (MVP level). Details: 1. 'ALTER TABLE <table_name> REBALANCE' supports an optional parameter - target number of segments (ex. 'ALTER TABLE <table_name> REBALANCE 2;'). 2. If the target number of segments is more than the number of segments in the table's distribution policy, rebalance command will invoke the existing functionality of 'ALTER TABLE <table_name> EXPAND TABLE' (meaning that expand will always be done to the current number of segments in the cluster, even if we specified less) 3. If the target number of segments is less than the number of segments in the table's distribution policy, the table will be shrunk into the target number of segments. For hashed or randomly distributed tables, data from the excessive segments is inserted into the target segments, and then for all table types the distribution policy is updated for the target number of segments. Data from the excessive segments is not removed (we do not want to spend time on it, as most likely they will be excluded from the cluster soon anyway). 4. New GUC 'gp_target_numsegments' is added. If the target number of segments is not specified for the 'ALTER TABLE <table_name> REBALANCE' command, value of 'gp_target_numsegments' is used. 5. If 'gp_target_numsegments' is set, all new tables are created using this number of segments.

Commit 5b3f506 introduced new command ALTER TABLE REBALANCE with shrink support. The target number of segments (if not specified in ALTER command) is taken from GP_POLICY_DEFAULT_NUMSEGMENTS() macro. Therefore, we need somehow to set and maintain the creation number across all backends. This patch introduces a mechanism for managing the default number of segments used in table creation during a rebalance operation in GPDB. A new shared variable gp_create_table_rebalance_numsegments is introduced in gpexpand.h to track the number of segments to use during table creation while a rebalancing operation is in progress. The shared variable is initialized in shared memory with appropriate size and get functionality. Corresponding SQL functions are created in gp_toolkit extension. The system now checks if a rebalancing operation is active by verifying locks before allowing modifications to the number of segments. If a lock is not already acquired in current transaction (indicating that no rebalancing is underway), an appropriate error message is returned. Tests from 5b3f506 are updated to support the new functonality gp_debug_numsegments extension preserves its behaviour. But we disallow to modify local numsegments value when gp_create_table_rebalance_numsegments is set.

This patch implements a state machine skeleton for a basic shrink scenario based on 'transitions' library. It consists of a new 'ggrebalance' tool, which will be a single entry point for shrink, expand, and cluster rebalance functionality, and 'shrink.py', which contains the state machine itself with the shrink logic. The main purpose of this half-MVP is to evaluate the state machine pattern suitability. Therefore it implements only a limited set of requirements for the shrink, which allows you to support basic shrink workflow.

This patch adds a check for probable scenario when during interruption of ggrebalance the cluster could be restarted. In this case the shared variable gp_rebalance_numsegments is unset, and new table may be created at old segment count. Thus, during recovering of shrink process the STATE_CHECK_PREVIOUS_RUN callback calls get_state_after_interrupt() function, which checks the mentioned situation. If cluster is restarted the state machine executes transition to STATE_BACKUP_CATALOG_AND_UPDATE_TARGET_SEGMENT_COUNT_STARTED state. The interface for gp_rebalance_numsegments variable is updated via gp_rebalance_numsegments_is_set() SQL function in order to provide convenient way to monitor variable status. Before that, the comparison with INT_MAX value was required. Additionally, fault injection interface was returned to behave tests to cause workflow interruptions. The behave tests utility code was also adjusted to support some of the shrink scenarios. The code related to table population is fixed to make it follow declared semantics. gpaddmirrors test is updated as well. Co-Authored-By: Roman Eskin r.eskin@arenadata.io

In this patch: 1. The new option '--clean' is added for the cluster shrink by the ggrebalance tool. 2. The new option '--rollback' is added for the cluster shrink by the ggrebalance tool. 3. The new option '--non-interactive-mode' is added for the ggrebalance tool. It is essential to allow auto testing of some cleanup scenarios that would expect user confirmation without such an option. 4. As the existing 'main' and the new 'rollback' shrink workflows use similar functionality, the shrink code is reorganized to reduce code duplication: a. New functions that are used in both 'main' and 'rollback' workflows are introduced (like 'prepare_shrink_schema()', 'rebalance_tables()'). b. All logic related to the ggrebalance schema handling is moved to a separate class named 'RebalanceSchema' in 'rebalance_commons.py'. 5. A new entity, 'Plan,' is added. It is used to pass information about required shrink configuration of the target cluster to the shrink engine. We store it in the rebalance schema and used for the 'rollback' workflow, and when we recover from an interrupted shrink state. It is added due to the following reasons: a. As already stated above, we need it during rollback. When the user starts the rollback operation, he doesn't specify the target segment count that was used at the preceding shrink operation. Thus we need to store this information at shrink for the later usage. b. When the user tries to re-enter the shrink procedure from an interrupted state, we need to re-start with the same target segment count that was specified originally. Otherwise we may get the cluster in some invalid configuration where tables are shrunk to different segment counts. Giving the user the ability to specify target segment count for the re-enter launch opens the way for such error prone scenarios. So we just forbid specifying segment count configuration if we re-enter the interrupted state or start the rollback, and use the saved plan information that we got at the very first operation start. c. According to the current design, at the later phase we'll introduce a Planner entity, that will perform planning for all shrink/expand/rebalance operations. And its output Plan will be the input to the shrink engine. So this change is aligned with the overall design. 6. New behave test cases are added. The test cases cover not only the 'cleanup' and 'rollback' flows, but also the existing 'main' shrink flow, as we can't guarantee the correctness of rollback without proving the 'main' flow works Ok. The existing test case is renamed to 'test 2.4' and moved to be near the new tests that cover similar functionality. 7. New steps are added to mgmt_utils.py, that are used to verify that the shrinked segments are actually down. Also a small change in 'SegmentIsShutDown' is done - it is required to check that the mirror is down. 8. In order to recover properly, if we are interrupted in the middle of stopping shrinked segments, a new class 'SegmentStopAfterShrink' is introduced. It wraps the 'SegmentStop' with the checking whether the segment is actually still running. Without it, if shrink was re-entered and some segments were already shut down by the preceding interrupted launch, we got an error when trying to shut down such segments.

This patch adds foundations of shrink/rebalance planner. Some extra planning details and proper integration of planning stage into the ggrebalance state machine are going to be considered in separate tickets. The main feature of provided code is an abstract balancing algorithm, which represents manual primary/mirror host assignment following greedy strategy. In short, algorithm structure consists of several phases: 1) Primary assignment. Sort segments by relocation priority: firstly, must-move segments - those lying at decomissioned hosts, encoded in initial_primary as indexes >= n_target_hosts. Then move from overloaded to underloaded hosts. Assign each segment to least-loaded host, preferring original placement when possible. 2) Mirror assignment. Is built according to simple logic: prefer original mirror hosts, use least-loaded mirror hosts. 3) Optional improvement. Using adaptive large neighborhood search, where we try build near solutions by destroying and reassigning parts of the initial one. Quite volatile, but in some cases can bring better solution. Proposed to use in the ggrebalance utility. Reentrancy could be achieved by saving first plan into the database. Unit tests are moved from gppylib into gprebalance_modules in order to achieve better tests granularity and possibility to import separate modules.

This patch implements the following changes: 1. The support of IP addresses in 'target-hosts, add-hosts, remove-hosts' is added. Their validation requires hostname resolution, thus, the HostResolver() class is added in rebalance_commons.py Without validation we may face the case when passed through options IP address corresponds to existing host but is interpreted by ggrebalance as a new one. 2. The support hosts files is added. 3. The target directories handling is reworked. TemplateParser() class is added to support several placeholders. Now if 'target-datadirs' options is not passed all moves will choose default template directories as target ones. 4. The port planning is added in simple form (since doing network communication is overhead here) via PortAllocator() class. It forms per host per segment type port patterns and assigns them incrementally to moves. 5. The storage estimation is implemented. DiskUsage, DiskFree commands are used. The source datadirs and tablespaces are taken into account and validation of available space is provided. Main datadirs and tablespaces are validated on available disk space on corresponding filesystems. Corresponding unit tests are added for basic scenarios.

List of changes: 1. This patch adds rebalance functionality. Main part of the related logic is located in the 'RebalanceSM' class. Rebalance is done according to the list of moves from the supplied plan, and includes following steps: - move (via gpmovemirrors) all mirrors from the list of moves; - for all primaries from the list of moves switch them with their mirrors; - move (via gpmovemirrors) all these segments which were primaries; - switch all these segments back to primaries roles. 2. As the rebalance functionality should be correctly coordinated with the existing shrink logic, this patch adds the high level state-machine implementation in 'GGRebalanceMainSM' class. It is responsible for proper flow of high level states like planning, rebalance schema creation and deletion, invocation of shrink and rebalance execution, invocation of cleanup and shrink. Therefore: - some states and logic are moved from the existing shrink state-machine to 'GGRebalanceMainSM'; - temp code is removed from the planner; - code in 'ggrebalance' is updated to call only 'GGRebalanceMainSM', that will do the rest. 3. As now we need to handle states from shrink, rebalance and main state-machines, 'RebalanceSchema' code is updated to store and access these state categories. 4. New behave tests for rebalance functionality are added. As the ggrebalance test suit became too large and long too execute, it is split into 3 files: - 'ggrebalance_basics.feature' - contains the existing basic checks from the old file; - 'ggrebalance_shrink.feature' - contains the existing checks for shrink from the old file; - 'ggrebalance_rebalance.feature' - contains the new tests for the rebalance. Also, some notes about changes related to tests: - Old test named 'test 2.2. shrink' is merged into the test with a new name 'test 1.3. shrink', as the usage of the new top-level state-machine allows now to continue shrink execution in this test case; - New step definition is added into 'mgmt_util.py', that allows to get the number of segments which satisfy a certain condition. It is used in the new tests. - New step definition is added into 'mgmt_util.py', that allows to set a delay for a fault to happen. The respective changes are added into the fault injector code. It is used in the new tests, when we test interruption during the work of gpmovemirrors or gprecoverseg.

Problem description: Need to update rebalance execution flow in a way that it can support parallel segment movement, and at the same time the flow must consider following limitations: - ggrebalance should save every move step and it's status in persistance storage so that failed steps may be retried, rollbacked or cancelled (rollback, retry or cancel of particular movement will be implemented later in a separate patch); - switchover actions (primary to mirror, mirror to primary) will require user approval once we implement interactive mode (later in a separate patch); - ggrebalance should consider the order of the planned movements in the primary-mirror swap scenario using 3rd intermediate, transitional host. It means that the executor can't swap the order of mirror and primary movements. Therefore, this patch: 1. Adds an entity of RebalanceStep, that contain the state of execution together with the movement definition. List of such steps is now saved to the rebalance schema. 2. Updates the state machine of the rebalance execution. Now new states, where approval will be later requested from the user, are added. And the state machine can switch between segment processing and approval request as many times as required, till all steps are processed. Execution of the rebalance steps is performed in batches. Each batch is comprised from the same type of rebalance steps, without duplication of dbids. 3. Updates the code to use '--parallel' option to config 'gpmovemirrors'/'gprecoverseg'. 4. Updates behave tests according to changes described above.

This patch adds a new 'ggrebalance_misc_options' test suite, which currently has checks for: 1. '--target-hosts-file' option; 2. '--target-hosts' option; 3. '--target-datadirs-file' option; 4. '--target-datadirs' option; 5. '--mirror-mode' option; 6. '--add-hosts-file' option; 7. '--remove-hosts-file' option; 8. scenario with no mirrors in the cluster; 9. scenario when the cluster can't be rebalanced with the given parameters; 10. scenario when the cluster is in coordinator-only mode; 11. scenario when another instance of ggrebalance is running; 12. scenario when another conflicting tool is running; Also, this patch updates and adds some new step definitions, required by the new tests. Noticeable change: now we can bring up a test cluster with configurable number of segments (before it was hardcoded to 2 segments). And this patch adds a set of small fixes in the ggrebalance code to support the tested scenarios: - Move the validation that the cluster has mirrors to an earlier stage. Otherwise, without this check, ggrebalance crashed on accessing the non-existing mirror information, before it actually checked the mirror's presence. - Fix function 'get_hosts_from_file()'. Before this change, it tried to split hostname into letters (for ex., instead of 'sdw1', it returned 4 hosts: 's', 'd', 'w', '1'). Also, added a validation that the file is not empty. - Add checks for 'gpexpand' and 'pg_basebackup' tools running in parallel.

This patch adds support for the following options: - '--hba-hostnames' It determines whether to use hostnames in pg_hba.conf. Passed directly to 'gpmovemirrors' tool. - '--replay-lag <replay_lag>' It determines replay lag (in GBs) allowed on the mirror when rebalancing the segments. Passed directly to the 'gprecoverseg' tool. - '--log-dir <log_dir>' It determines the directory to store logs of the tool and all tools that are called by it. - '--analyze' It determines whether to run ANALYZE after rebalancing table redistribution. Also, this patch adds: - tests for the mentioned options; - definition of new steps required by the tests; - a small fix in the 'gpmovemirrors' tool to support log-dir with spaces in the name; - definition of STATE_ERROR into rebalance executor SM;

Problem description: Attempts to rebalance a materialized view via 'ALTER MATERIALIZED VIEW ... REBALANCE' command (or via equivalently working for materialized views 'ALTER TABLE ... REBALANCE') ended with an error: 'ERROR: cannot change materialized view ...' Root cause: The table rebalance logic tried to insert the data directly into the materialized view as if it were an ordinary table. It is prohibited for materialized views. Fix: Skip the call of 'ATExecShrinkTable()' for the materialized views. So during 'ALTER ... REBALANCE' only the distribution policy for the materialized view is updated. And the user needs to perform 'REFRESH MATERIALIZED VIEW ...' after the rebalance.

Problem description: Before this patch, in order to rebalance a materialized view, 2 steps were required: the actual rebalance where distribution policy was updated, and the refresh step to update the data in the materialized view. This approach had 2 problems with respect to usage in 'ggrebalance' tool for cluster shrink: 1. It could change the actual data in the materialized view before the cluster shrink, and after the shrink, if the view was not up-to-date. We intend to keep the logical data in the cluster not altered. 2. If a materialized view depends on another materialized view, there could be a race condition when doing the refresh, when we try to refresh based on the yet-not-refreshed one. Fix: Use the CTAS approach from the EXPAND TABLE specifically when we are rebalancing a materialized view. It creates a temp table with a correct distribution policy, where all data from the materialized view is copied, and then the relfilenode of the materialized view is swapped with the temp table. It keeps the data as it was before the rebalance, even if it was not up-to-date (therefore we will not surprise the user with the not expected view content), and it eliminates dependencies on other objects besides the materialized view itself.

List of changes: - Add support for redistribution of materialized views, external writable tables, partitioned tables, unlogged tables. Skip processing of temp tables. It is done to comply with the requirements. - Add checks that the database and the table exists before we actually start to rebalance the table. It is needed as one could drop it in parallel after we have created the rebalance table list. - Add retry logic into table rebalance worker. It is needed, when for ex., other session opens a transaction after we have created the rebalance table list, drops the table before we started to rebalance it, and commits the transaction when we started to rebalance the table (and are hanging on the table's locks). - Change the order of shrunk segment processes stopping. Now mirrors are stopped strictly after primaries in order to avoid hanging replication processes. - Do not stop the tool execution in case we couldn't stop some of the shrinked segments. Now we only emit a warning. It is done to comply with the requirements. - Rework fault injection when stopping a segment due to the item above, as now we will not stop in case of an exception inside the 'SegmentStopAfterShrink' worker. So now, when a fault is injected, send SIGINT to the ggrebalance process to halt its work. - Improve logging inside 'SegmentStopAfterShrink'. - Remove not used flag 'needs_repopulate'. - Add new behave test cases and update old ones to cover the new functionality. - Add new behave step definitions to support the updates in the tests. - Fix behave test steps for view/matview creation - they opened a connection, but didn't use it. Instead, they tried to use the connection from the context, which was not properly configured. - Update code in the behave utils to support new test step definitions for materialized views and unlogged tables. - Add into the fault injector the ability to suspend execution instead of crashing it.

Previously, primary and mirror could coexist at the same host during execution of moves, where segments just swap their hosts. This violates the HA rule for the whole cluster. When the suboptimal rebalance plan requires swapping the locations of a primary segment and its mirror, the planner now decomposes this into three safe phases using an intermediate host to prevent primary-mirror coexistence violations. planner.py now detects swap moves in form_moves() and chooses the appropriate 3rd host for mirror movement. The search is performed based on available space, considering other moves, host status, and on other swap counts. Thus, plan, which previously looked like: ``` ---------------------------------BALANCE MOVES---------------------------------- Total moves planned: 2 [1] Move Segment(content=3, dbid=5, role=p) [254.73 MB] From: sdw1:7005 → /home/gpadmin/.data/primary/gpseg3 To: sdw2:7005 → /home/gpadmin/.data/primary/gpseg3 [2] Move Segment(content=3, dbid=11, role=m) [190.44 MB] From: sdw2:7053 → /home/gpadmin/.data/mirror/gpseg3 To: sdw1:7053 → /home/gpadmin/.data/mirror/gpseg3 ``` now expands into three moves ``` ---------------------------------BALANCE MOVES---------------------------------- Total moves planned: 3 [1] Move Segment(content=2, dbid=10, role=m) [190.45 MB] From: sdw2:7052 → /home/gpadmin/.data/mirror/gpseg2 To: sdw3:7052 → /home/gpadmin/.data/mirror/gpseg2 [2] Move Segment(content=2, dbid=4, role=p) [254.74 MB] From: sdw1:7004 → /home/gpadmin/.data/primary/gpseg2 To: sdw2:7004 → /home/gpadmin/.data/primary/gpseg2 [3] Move Segment(content=2, dbid=10, role=m) [190.45 MB] From: sdw3:7052 → /home/gpadmin/.data/mirror/gpseg2 To: sdw1:7054 → /home/gpadmin/.data/mirror/gpseg2 ``` Moreover, available space check for intermediate host now uses cached filesystem info. Thus, the ResourceEstimator class is refactored. It's unit tests are adjusted. Additionally some unit tests were fixed, because we've forgotten to check them in previous patches.

bimboterminator1 and others added 23 commits December 20, 2024 05:53

Initial commit

9fdd5dd

Merge branch 'adb-7.2.0' into feature/ADBDEV-6608

4668879

gprebalance skeleton (#1201)

9a64377

gprebalance skeleton is complemented with additional options from mvp specification.

Rebalance algorithm (#1204)

47b225a

This code proposes the rebalance algorithm. GpRebalance.createPlan() returns a Plan represented by the list of Moves. The algorithm itself produces an intiutive greed solution by manual setting the final balanced state.

Rollback handler (#1265)

fcc28c4

This PR intoduces the rollback handler in gprebalance MVP. The rollback function creates new plan of movements by calculating the difference between current configuration and original state loaded from previously pickled plan.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MVP] gprebalance#1198

[MVP] gprebalance#1198
bimboterminator1 wants to merge 23 commits intoadb-7.2.0from
feature/ADBDEV-6608

bimboterminator1 commented Jan 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bimboterminator1 commented Jan 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants