Skip to content

[MVP] gprebalance#1198

Draft
bimboterminator1 wants to merge 23 commits intoadb-7.2.0from
feature/ADBDEV-6608
Draft

[MVP] gprebalance#1198
bimboterminator1 wants to merge 23 commits intoadb-7.2.0from
feature/ADBDEV-6608

Conversation

@bimboterminator1
Copy link
Member

Mvp for gprebalance utility

bimboterminator1 and others added 23 commits December 20, 2024 05:53
Implement cluster validation possibility

This is the first commit for building an MVP for new rebalance utility -
gprebalance. This utility is intended to be used for the situation, when after
cluster resize (after expand, shrink) is in unbalanced state. Balanced state
is defined very simple: if number of segments per host is equal across all the
hosts, then cluster is balanced. There are a lot of other aspects for proper
implementation of optimal rebalance algorithm, which will be implemented in
the next patches.

This patch adds the skeleton of future utility, providing initial validation
of rebalance possibility. It includes checks, that validate some basic aspects:
whether segments can be distributed uniformly and can target mirroring strategy
be achieved. Decided to provide validation through separate classes, which is
different approach from gpexpand utility. Also, some unit tests have been added.
Validation of available disk space is not implemented since cannot be achieved at
this initial validation step
gprebalance skeleton is complemented with additional
options from mvp specification.
This code proposes the rebalance algorithm. GpRebalance.createPlan() returns a
Plan represented by the list of Moves. The algorithm itself produces an
intiutive greed solution by manual setting the final balanced state.
The proposed code contains main framework for rebalance execution.
Some options are not implemented fully and are expected to be finished in next
tasks.

The code describes the following segment movement approach. Firstly, we creating
a movements plan: simple steps telling which segment to which host to move.
Steps in plan can be different:

Mirror only moves.
Both primary and mirror are moved to different hosts.
Primary only moves.
Primary and mirror are swapped.
For each type of movement we clarify the target dirs and ports at target hosts,
able to contain the size of moved segment. To do that the DiskFree and DiskUsage
commands are used.

The movements, in its turn, are composite and imply extra actions including
segment switching.

Mirror only moves use only single gprecoverseg call to perform movement.
If we move primary and mirror pair, the strategy is following. The mirror is
firstly moved via gprecoverseg to primary's target host. Then the roles are
switched. Then ex-primary (new mirror) is moved to mirror's target host.
Primary only moves imply 2 role switches. Switch.Move.Switch.
Primary mirror swap is executed similar to 2nd type. Mirror is moved to
primary dir in its own host. Switch. Ex-primary is moved to mirror dir in its
own host.
The status management is written in general and may contain errors.

Cleanup is prepared by RekGRpth

Co-authored-by: Georgy Shelkovy <g.shelkovy@arenadata.io>
This PR intoduces the rollback handler in gprebalance MVP. The rollback
function creates new plan of movements by calculating the difference between
current configuration and original state loaded from previously pickled plan.
The changes of this patch provide the prototype for status tracking of mirror moves
during rebalance. Firstly , this patch removes the usage of gpdb table for
whole execution status. Secondly, the status manager is rewritten in order to
track execution process with status file only. If the movement step, presented
by gprecoverseg process, fails, the corresponging status (FAILED) will be
written to the internal status struct first, then will be flushed to disk.

The main purpose of these changes is also implementation of gprecoverseg
determination. The code in analyze_gprecoverseg_states() tries to implement
the SRS diagram for gprecoverseg status definition. It processes the following
scenarios:
1. A mirror move failed after pg_hba conf had been updated at primary. In this case
primary marks the mirror as being down.
2. A mirror move failed after gp_segment_configuration had been updated. Here our code
tries to determine whether pg_basebackup was executed succesfully or not.

Depending on the basebackup state, the algorithm tries to either startup the 
backuped mirror or rollback the configuration changes with recovering old mirror
Problem description:
There were no means to provide segments shrink feature to the 'gprebalance'
tool.

Fix:
Add new command 'ALTER TABLE <table_name> REBALANCE' (MVP level). Details:
1. 'ALTER TABLE <table_name> REBALANCE' supports an optional parameter - target
number of segments (ex. 'ALTER TABLE <table_name> REBALANCE 2;').
2. If the target number of segments is more than the number of segments in the 
table's distribution policy, rebalance command will invoke the existing 
functionality of 'ALTER TABLE <table_name> EXPAND TABLE' (meaning that expand 
will  always be done to the current number of segments in the cluster, even if
we specified less) 
3. If the target number of segments is less than the number of segments in the 
table's distribution policy, the table will be shrunk into the target number
of segments. For hashed or randomly distributed tables, data from the excessive
segments is inserted into the target segments, and then for all table types the
distribution policy is updated for the target number of segments. Data from the
excessive segments is not removed (we do not want to spend time on it, as most
likely they will be excluded from the cluster soon anyway).
4. New GUC 'gp_target_numsegments' is added. If the target number of segments is
not specified for the 'ALTER TABLE <table_name> REBALANCE' command, value of
'gp_target_numsegments' is used.
5. If 'gp_target_numsegments' is set, all new tables are created using this
number of segments.
Commit 5b3f506 introduced new command ALTER
TABLE REBALANCE with shrink support. The target number of segments (if not
specified in ALTER command) is taken from GP_POLICY_DEFAULT_NUMSEGMENTS() macro.
Therefore, we need somehow to set and maintain the creation number across all
backends.

This patch introduces a mechanism for managing the default number of segments
used in table creation during a rebalance operation in GPDB. A new shared
variable gp_create_table_rebalance_numsegments is introduced in gpexpand.h  to
track the number of segments to use during table creation while a rebalancing
operation is in progress. The shared variable is initialized in shared memory
with appropriate size and get functionality.

Corresponding SQL functions are created in gp_toolkit extension.
The system now checks if a rebalancing operation is active by verifying locks
before allowing modifications to the number of segments. If a lock is not
already acquired in current transaction (indicating that no rebalancing is
underway), an appropriate error message is returned.

Tests from 5b3f506
are updated to support the new functonality

gp_debug_numsegments extension preserves its behaviour. But we disallow to
modify local numsegments value when gp_create_table_rebalance_numsegments
is set.
This patch implements a state machine skeleton for a basic shrink scenario based 
on 'transitions' library. It consists of a new 'ggrebalance' tool, which will be 
a single entry point for shrink, expand, and cluster rebalance functionality, 
and 'shrink.py', which contains the state machine itself with the shrink logic. 

The main purpose of this half-MVP is to evaluate the state machine pattern
suitability. Therefore it implements only a limited set of requirements for the
shrink, which allows you to support basic shrink workflow.
This patch adds a check for probable scenario when during interruption of
ggrebalance the cluster could be restarted. In this case the shared variable
gp_rebalance_numsegments is unset, and new table may be created at old segment
count. Thus, during recovering of shrink process the STATE_CHECK_PREVIOUS_RUN
callback calls get_state_after_interrupt() function, which checks the mentioned
situation. If cluster is restarted the state machine executes transition to
STATE_BACKUP_CATALOG_AND_UPDATE_TARGET_SEGMENT_COUNT_STARTED state.

The interface for gp_rebalance_numsegments variable is updated via
gp_rebalance_numsegments_is_set() SQL function in order to provide convenient way
to monitor variable status. Before that, the comparison with INT_MAX value was required.

Additionally, fault injection interface was returned to behave tests to cause
workflow interruptions. The behave tests utility code was also adjusted to
support some of the shrink scenarios. The code related to table population
is fixed to make it follow declared semantics. gpaddmirrors test is updated
as well.

Co-Authored-By: Roman Eskin r.eskin@arenadata.io
In this patch:
1. The new option '--clean' is added for the cluster shrink by the ggrebalance
tool.
2. The new option '--rollback' is added for the cluster shrink by the
ggrebalance tool.
3. The new option '--non-interactive-mode' is added for the ggrebalance tool. It
is essential to allow auto testing of some cleanup scenarios that would expect
user confirmation without such an option.
4. As the existing 'main' and the new 'rollback' shrink workflows use similar
functionality, the shrink code is reorganized to reduce code duplication:
a. New functions that are used in both 'main' and 'rollback' workflows are
introduced (like 'prepare_shrink_schema()', 'rebalance_tables()').
b. All logic related to the ggrebalance schema handling is moved to a separate
class named 'RebalanceSchema' in 'rebalance_commons.py'.
5. A new entity, 'Plan,' is added. It is used to pass information about required
shrink configuration of the target cluster to the shrink engine. We store it in
the rebalance schema and used for the 'rollback' workflow, and when we recover
from an interrupted shrink state. It is added due to the following reasons:
a. As already stated above, we need it during rollback. When the user starts the
rollback operation, he doesn't specify the target segment count that was used
at the preceding shrink operation. Thus we need to store this information at
shrink for the later usage.
b. When the user tries to re-enter the shrink procedure from an interrupted
state, we need to re-start with the same target segment count that was specified
originally. Otherwise we may get the cluster in some invalid configuration where
tables are shrunk to different segment counts. Giving the user the ability
to specify target segment count for the re-enter launch opens the way for such
error prone scenarios. So we just forbid specifying segment count configuration
if we re-enter the interrupted state or start the rollback, and use the saved
plan information that we got at the very first operation start.
c. According to the current design, at the later phase we'll introduce a Planner
entity, that will perform planning for all shrink/expand/rebalance operations.
And its output Plan will be the input to the shrink engine. So this change is
aligned with the overall design.
6. New behave test cases are added. The test cases cover not only the 'cleanup'
and 'rollback' flows, but also the existing 'main' shrink flow, as we can't
guarantee the correctness of rollback without proving the 'main' flow works Ok.
The existing test case is renamed to 'test 2.4' and moved to be near the new
tests that cover similar functionality.
7. New steps are added to mgmt_utils.py, that are used to verify that the
shrinked segments are actually down. Also a small change in 'SegmentIsShutDown'
is done - it is required to check that the mirror is down.
8. In order to recover properly, if we are interrupted in the middle of stopping
shrinked segments, a new class 'SegmentStopAfterShrink' is introduced. It wraps
the 'SegmentStop' with the checking whether the segment is actually still
running. Without it, if shrink was re-entered and some segments were already
shut down by the preceding interrupted launch, we got an error when trying to
shut down such segments.
This patch adds foundations of shrink/rebalance planner. Some extra planning
details and proper integration of planning stage into the ggrebalance state
machine are going to be considered in separate tickets.

The main feature of provided code is an abstract balancing algorithm, which
represents manual primary/mirror host assignment following greedy strategy.
In short, algorithm structure consists of several phases:

1) Primary assignment. Sort segments by relocation priority: firstly, must-move
segments - those lying at decomissioned hosts, encoded in initial_primary as
indexes >= n_target_hosts. Then move from overloaded to underloaded hosts.
Assign each segment to least-loaded host, preferring original placement when
possible.

2) Mirror assignment. Is built according to simple logic: prefer original
mirror hosts, use least-loaded mirror hosts.

3) Optional improvement. Using adaptive large neighborhood search, where we try
build near solutions by destroying and reassigning parts of the initial one.
Quite volatile, but in some cases can bring better solution. Proposed to use
in the ggrebalance utility. Reentrancy could be achieved by saving first plan
into the database.

Unit tests are moved from gppylib into gprebalance_modules in order to achieve
better tests granularity and possibility to import separate modules.
This patch implements the following changes:

1. The support of IP addresses in 'target-hosts, add-hosts, remove-hosts' is
added. Their validation requires hostname resolution, thus, the HostResolver()
class is added in rebalance_commons.py Without validation we may face the case
when passed through options IP address corresponds to existing host but is
interpreted by ggrebalance as a new one.

2. The support hosts files is added.

3. The target directories handling is reworked. TemplateParser() class is added
to support several placeholders. Now if 'target-datadirs' options is not passed
all moves will choose default template directories as target ones.

4. The port planning is added in simple form (since doing network communication
is overhead here) via PortAllocator() class. It forms per host per segment type
port patterns and assigns them incrementally to moves.

5. The storage estimation is implemented. DiskUsage, DiskFree commands are used.
The source datadirs and tablespaces are taken into account and validation of
available space is provided. Main datadirs and tablespaces are validated on available
disk space on corresponding filesystems. 

Corresponding unit tests are added for basic scenarios.
List of changes:
1. This patch adds rebalance functionality. Main part of the related logic is
located in the 'RebalanceSM' class. Rebalance is done according to the list of
moves from the supplied plan, and includes following steps:
 - move (via gpmovemirrors) all mirrors from the list of moves;
 - for all primaries from the list of moves switch them with their mirrors;
 - move (via gpmovemirrors) all these segments which were primaries;
 - switch all these segments back to primaries roles.
2. As the rebalance functionality should be correctly coordinated with the
existing shrink logic, this patch adds the high level state-machine
implementation in 'GGRebalanceMainSM' class. It is responsible for proper flow
of high level states like planning, rebalance schema creation and deletion,
invocation of shrink and rebalance execution, invocation of cleanup and shrink.
Therefore:
 - some states and logic are moved from the existing shrink state-machine to
 'GGRebalanceMainSM';
 - temp code is removed from the planner;
 - code in 'ggrebalance' is updated to call only 'GGRebalanceMainSM', that will
 do the rest.
3. As now we need to handle states from shrink, rebalance and main
state-machines, 'RebalanceSchema' code is updated to store and access these
state categories.
4. New behave tests for rebalance functionality are added. As the ggrebalance
test suit became too large and long too execute, it is split into 3 files:
 - 'ggrebalance_basics.feature' - contains the existing basic checks from the
 old file;
 - 'ggrebalance_shrink.feature' - contains the existing checks for shrink from
 the old file;
 - 'ggrebalance_rebalance.feature' - contains the new tests for the rebalance.

Also, some notes about changes related to tests:
 - Old test named 'test 2.2. shrink' is merged into the test with a new name
 'test 1.3. shrink', as the usage of the new top-level state-machine allows now
 to continue shrink execution in this test case;
 - New step definition is added into 'mgmt_util.py', that allows to get the
 number of segments which satisfy a certain condition. It is used in the new
 tests.
 - New step definition is added into 'mgmt_util.py', that allows to set a delay
 for a fault to happen. The respective changes are added into the fault
 injector code. It is used in the new tests, when we test interruption during
 the work of gpmovemirrors or gprecoverseg.
Problem description:
Need to update rebalance execution flow in a way that it can support parallel
segment movement, and at the same time the flow must consider following
limitations:
 - ggrebalance should save every move step and it's status in persistance
storage so that failed steps may be retried, rollbacked or cancelled (rollback,
retry or cancel of particular movement will be implemented later in a separate
patch);
 - switchover actions (primary to mirror, mirror to primary) will require user
approval once we implement interactive mode (later in a separate patch);
 - ggrebalance should consider the order of the planned movements in the
primary-mirror swap scenario using 3rd intermediate, transitional host. It means
that the executor can't swap the order of mirror and primary movements.

Therefore, this patch:
1. Adds an entity of RebalanceStep, that contain the state of execution together
with the movement definition. List of such steps is now saved to the rebalance
schema.
2. Updates the state machine of the rebalance execution. Now new states, where
approval will be later requested from the user, are added. And the state machine
can switch between segment processing and approval request as many times as
required, till all steps are processed. Execution of the rebalance steps is
performed in batches. Each batch is comprised from the same type of rebalance
steps, without duplication of dbids.
3. Updates the code to use '--parallel' option to config
'gpmovemirrors'/'gprecoverseg'.
4. Updates behave tests according to changes described above.
This patch adds a new 'ggrebalance_misc_options' test suite, which currently
has checks for:
1. '--target-hosts-file' option;
2. '--target-hosts' option;
3. '--target-datadirs-file' option;
4. '--target-datadirs' option;
5. '--mirror-mode' option;
6. '--add-hosts-file' option;
7. '--remove-hosts-file' option;
8. scenario with no mirrors in the cluster;
9. scenario when the cluster can't be rebalanced with the given parameters;
10. scenario when the cluster is in coordinator-only mode;
11. scenario when another instance of ggrebalance is running;
12. scenario when another conflicting tool is running;

Also, this patch updates and adds some new step definitions, required by the
new tests. Noticeable change: now we can bring up a test cluster with
configurable number of segments (before it was hardcoded to 2 segments).

And this patch adds a set of small fixes in the ggrebalance code to support the
tested scenarios:
 - Move the validation that the cluster has mirrors to an earlier stage.
Otherwise, without this check, ggrebalance crashed on accessing the non-existing
mirror information, before it actually checked the mirror's presence.
 - Fix function 'get_hosts_from_file()'. Before this change, it tried to split
hostname into letters (for ex., instead of 'sdw1', it returned 4 hosts:
's', 'd', 'w', '1'). Also, added a validation that the file is not empty.
 - Add checks for 'gpexpand' and 'pg_basebackup' tools running in parallel.
This patch adds support for the following options:

 - '--hba-hostnames'
It determines whether to use hostnames in pg_hba.conf. Passed directly to
'gpmovemirrors' tool.

 - '--replay-lag <replay_lag>'
It determines replay lag (in GBs) allowed on the mirror when rebalancing the
segments. Passed directly to the 'gprecoverseg' tool.

 - '--log-dir <log_dir>'
It determines the directory to store logs of the tool and all tools that are
called by it.

 - '--analyze'
It determines whether to run ANALYZE after rebalancing table redistribution.

Also, this patch adds:
 - tests for the mentioned options;
 - definition of new steps required by the tests;
 - a small fix in the 'gpmovemirrors' tool to support log-dir with spaces in the
name;
 - definition of STATE_ERROR into rebalance executor SM;
Problem description:
Attempts to rebalance a materialized view via
'ALTER MATERIALIZED VIEW ... REBALANCE' command (or via equivalently working for
materialized views 'ALTER TABLE ... REBALANCE') ended with an error:
'ERROR:  cannot change materialized view ...'

Root cause:
The table rebalance logic tried to insert the data directly into the
materialized view as if it were an ordinary table. It is prohibited for
materialized views.

Fix:
Skip the call of 'ATExecShrinkTable()' for the materialized views. So during
'ALTER ... REBALANCE' only the distribution policy for the materialized view is
updated. And the user needs to perform 'REFRESH MATERIALIZED VIEW ...' after
the rebalance.
Problem description:
Before this patch, in order to rebalance a materialized view, 2 steps were
required: the actual rebalance where distribution policy was updated, and the
refresh step to update the data in the materialized view. This approach had 2
problems with respect to usage in 'ggrebalance' tool for cluster shrink:
1. It could change the actual data in the materialized view before the cluster
shrink, and after the shrink, if the view was not up-to-date. We intend to keep
the logical data in the cluster not altered.
2. If a materialized view depends on another materialized view, there could be
a race condition when doing the refresh, when we try to refresh based on the
yet-not-refreshed one.

Fix:
Use the CTAS approach from the EXPAND TABLE specifically when we are rebalancing
a materialized view. It creates a temp table with a correct distribution policy,
where all data from the materialized view is copied, and then the relfilenode
of the materialized view is swapped with the temp table. It keeps the data as it
was before the rebalance, even if it was not up-to-date (therefore we will not
surprise the user with the not expected view content), and it eliminates
dependencies on other objects besides the materialized view itself.
List of changes:

 - Add support for redistribution of materialized views, external writable
tables, partitioned tables, unlogged tables. Skip processing of temp tables.
It is done to comply with the requirements.
 - Add checks that the database and the table exists before we actually start
to rebalance the table. It is needed as one could drop it in parallel after we
have created the rebalance table list.
 - Add retry logic into table rebalance worker. It is needed, when for ex.,
other session opens a transaction after we have created the rebalance table
list, drops the table before we started to rebalance it, and commits the
transaction when we started to rebalance the table (and are hanging on the
table's locks).
 - Change the order of shrunk segment processes stopping. Now mirrors are
stopped strictly after primaries in order to avoid hanging replication
processes.
 - Do not stop the tool execution in case we couldn't stop some of the shrinked
segments. Now we only emit a warning. It is done to comply with the
requirements.
 - Rework fault injection when stopping a segment due to the item above, as now
we will not stop in case of an exception inside the 'SegmentStopAfterShrink'
worker. So now, when a fault is injected, send SIGINT to the ggrebalance
process to halt its work.
 - Improve logging inside 'SegmentStopAfterShrink'.
 - Remove not used flag 'needs_repopulate'.
 - Add new behave test cases and update old ones to cover the new functionality.
 - Add new behave step definitions to support the updates in the tests.
 - Fix behave test steps for view/matview creation - they opened a connection,
but didn't use it. Instead, they tried to use the connection from the context,
which was not properly configured.
 - Update code in the behave utils to support new test step definitions for
materialized views and unlogged tables.
 - Add into the fault injector the ability to suspend execution instead of
crashing it.
Previously, primary and mirror could coexist at the same host during
execution of moves, where segments just swap their hosts. This violates
the HA rule for the whole cluster.

When the suboptimal rebalance plan requires swapping the locations
of a primary segment and its mirror, the planner now decomposes
this into three safe phases using an intermediate host to prevent
primary-mirror coexistence violations.

planner.py now detects swap moves in form_moves() and
chooses the appropriate 3rd host for mirror movement.
The search is performed based on available space, considering
other moves, host status, and on other swap counts.
Thus, plan, which previously looked like:
```

---------------------------------BALANCE MOVES----------------------------------
Total moves planned: 2

  [1] Move Segment(content=3, dbid=5, role=p) [254.73 MB]
      From: sdw1:7005 → /home/gpadmin/.data/primary/gpseg3
      To:   sdw2:7005 → /home/gpadmin/.data/primary/gpseg3

  [2] Move Segment(content=3, dbid=11, role=m) [190.44 MB]
      From: sdw2:7053 → /home/gpadmin/.data/mirror/gpseg3
      To:   sdw1:7053 → /home/gpadmin/.data/mirror/gpseg3
```
now expands into three moves
```

---------------------------------BALANCE MOVES----------------------------------
Total moves planned: 3

  [1] Move Segment(content=2, dbid=10, role=m) [190.45 MB]
      From: sdw2:7052 → /home/gpadmin/.data/mirror/gpseg2
      To:   sdw3:7052 → /home/gpadmin/.data/mirror/gpseg2

  [2] Move Segment(content=2, dbid=4, role=p) [254.74 MB]
      From: sdw1:7004 → /home/gpadmin/.data/primary/gpseg2
      To:   sdw2:7004 → /home/gpadmin/.data/primary/gpseg2

  [3] Move Segment(content=2, dbid=10, role=m) [190.45 MB]
      From: sdw3:7052 → /home/gpadmin/.data/mirror/gpseg2
      To:   sdw1:7054 → /home/gpadmin/.data/mirror/gpseg2
```

Moreover, available space check for intermediate host now uses
cached filesystem info. Thus, the ResourceEstimator class is refactored.
It's unit tests are adjusted.

Additionally some unit tests were fixed, because we've forgotten to check them
in previous patches.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants