-
Notifications
You must be signed in to change notification settings - Fork 18
Checksums #502
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
314eter
wants to merge
37
commits into
incubaid:1.8
Choose a base branch
from
314eter:checksum
base: 1.8
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Checksums #502
Changes from all commits
Commits
Show all changes
37 commits
Select commit
Hold shift + click to select a range
2c04738
Prepared implementation of rolling checksums
314eter 1e6c6f3
Checksum validation
314eter 33e16c5
Removed commented code
314eter 843a72e
Changed nodenames to make tests independent
314eter 1849998
Removed Uuused variable
314eter 4a194a7
Raise ChecksumError
314eter d04a32f
Compatibility with old format (no checksums)
314eter 5784c07
Use old tlog format for values with no checksums
314eter 209b373
Fixed TODO: _previous_i_entry = _previous_entry at start
314eter b8956be
Moved checksum validation to log_value_explicit
314eter 849edb7
Added OUnit test
314eter c081051
New system test and better error handling
314eter 21ebbae
Better error msg in test
314eter 738c031
Design document
314eter 6eab894
collapser_test leaves head.db in root dir
314eter 74c8ce9
Fixed wrong choice of previous_i_entry
314eter 7ae2956
Removed options from Checksum module
314eter e12f217
Fixed some comments from pull request.
314eter e6f5229
New magic and upgrade path
314eter f0e08d2
Merge remote-tracking branch 'upstream/1.7' into checksum
314eter 6cfc819
Disable checksum validation during catchup
314eter 3fe9179
Save checksum in store, validation during catchup
314eter 3dc63fe
Validate checksum in store
314eter 84acf7a
Merge remote-tracking branch 'upstream/1.8' into checksum
314eter 325e3a0
Fix merge with 1.8
314eter 9426483
Updated design document
314eter f0fba76
Moved store validation to Store module
314eter 787c955
Replaced LAST_ENTRIES and LAST_ENTRIES2 with LAST_ENTRIES3
314eter 9ba4eea
Ensure i and checksum are updated simultaneously in store
314eter 65f2401
Fixed some comments on pull request
314eter 1235157
Add timeout to test power_failure
314eter c253991
Remove duplicate test_large_catchup_while_running
314eter e595752
Fix set_previous_checksum
314eter 85659cc
Updated design document
314eter fdd6dd3
Validate checksums during catchup_store
314eter e34c6af
Use SSE4.2 in update_crc32c
314eter dc5cf62
Restored buildInSandbox.sh
314eter File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,97 @@ | ||
| ================= | ||
| Rolling Checksums | ||
| ================= | ||
|
|
||
| Problem | ||
| ======= | ||
| If a node crashed, and failed to write some tlog entries to disk, this is not detected by Arakoon. The node announces it's in sync up to the last entry in the tlogs, even if other nodes diverged while the node was offline. Note that this should not happen if fsync is set to true (which is the default) and none of the layers below (file system, hardware) lie about fsync behaviour. | ||
|
|
||
| Another problematic situation occurs when for some reason tlog files, databases or even nodes end up in the wrong cluster. The nodes will continue as if nothing happened, and don't know they are diverged. | ||
|
|
||
| Example | ||
| ------- | ||
| Consider this situation: | ||
|
|
||
| +----------------------------------+------------------------------------+------------------------------------+ | ||
| | node0 | node1 | node 2 | | ||
| +==================================+====================================+====================================+ | ||
| | 0:(Vm (node0,0.000000)) | 0:(Vm (node0,0.000000)) | 0:(Vm (node0,0.000000)) | | ||
| +----------------------------------+------------------------------------+------------------------------------+ | ||
| | 1:(Vc ([Set;"a";1;"...";],false) | 1:(Vc ([Set;"a";1;"...";],false) | 1:(Vc ([Set;"a";1;"...";],false) | | ||
| +----------------------------------+------------------------------------+------------------------------------+ | ||
| | 2:(Vc ([Set;"b";1;"...";],false) | *2:(Vc ([Set;"b";1;"...";],false)* | *2:(Vc ([Set;"b";1;"...";],false)* | | ||
| +----------------------------------+------------------------------------+------------------------------------+ | ||
| | 3:(Vc ([Set;"c";1;"...";],false) | *3:(Vc ([Set;"c";1;"...";],false)* | | | ||
| +----------------------------------+------------------------------------+------------------------------------+ | ||
|
|
||
| Node1 and node2 crashed, and the last tlog entries were lost. They are restarted, while node0 is still offline. When node0 comes back, this will result in the following situation: | ||
|
|
||
| +--------------------------------------+----------------------------------+----------------------------------+ | ||
| | node0 | node1 | node 2 | | ||
| +======================================+==================================+==================================+ | ||
| | 0:(Vm (node0,0.000000)) | 0:(Vm (node0,0.000000)) | 0:(Vm (node0,0.000000)) | | ||
| +--------------------------------------+----------------------------------+----------------------------------+ | ||
| | 1:(Vc ([Set;"a";1;"...";],false) | 1:(Vc ([Set;"a";1;"...";],false) | 1:(Vc ([Set;"a";1;"...";],false) | | ||
| +--------------------------------------+----------------------------------+----------------------------------+ | ||
| | **2:(Vc ([Set;"b";1;"...";],false)** | **2:(Vm (node1,0.000000))** | **2:(Vm (node1,0.000000))** | | ||
| +--------------------------------------+----------------------------------+----------------------------------+ | ||
| | 3:(Vc ([Set;"d";1;"...";],false) | 3:(Vc ([Set;"d";1;"...";],false) | 3:(Vc ([Set;"d";1;"...";],false) | | ||
| +--------------------------------------+----------------------------------+----------------------------------+ | ||
|
|
||
| Checksums | ||
| ========= | ||
| This problem can be solved by using a rolling checksum, computed over all the entries in the tlogs. This checksum should be the same for all nodes. The checksum is part of the value that is synced with multi-paxos. | ||
|
|
||
| 1. The client sends a request to the master node. | ||
| 2. The master computes the rolling checksum, and makes a value of this checksum and the update commands. | ||
| 3. This value is sent to the slaves in an accept request. | ||
| 4. The slaves compute the rolling checksum, and compare it with the checksum in the value. | ||
| 5. If the checksums are equal, the tlogs are in sync, the value is written to the tlogs, and the algorithm proceeds as usual. | ||
| 6. If the checksums are different, something bad happened. The node halts, and the tlogs need to be inspected manually. | ||
|
|
||
| The catchup consists of two phases. In the first phase, the missing tlog entries are received from another node. The checksum of the first of these entries will be validated, to prevent a catchup from a diverged node. During the second phase, the tlog entries are replayed to the store, and all checksums are validated. | ||
|
|
||
| Remark | ||
| ------ | ||
| When several consecutive entries in the tlogs have the same number, it is only the last one that is agreed upon by multi-paxos, and thus only this entry is used in the computation of the checksum. | ||
|
|
||
| Tlog Specification | ||
| ================== | ||
| * Serial number (int64) | ||
| * Crc-32 checksum of Cmd (int32) | ||
| * Cmd | ||
| - Value | ||
| - Marker, optional (string option) | ||
|
|
||
| Older value format | ||
| ------------------ | ||
| * Update | ||
| - Update type (int32 between 1 and 16) | ||
| - Update details (depends on type) | ||
| * Synced (bool) | ||
|
|
||
| Old value format | ||
| ---------------- | ||
| * 0xff (int32) | ||
| * Value type (char 'c' or 'm') | ||
| * Value details (depends on type) | ||
|
|
||
| New value format | ||
| ---------------- | ||
| * 0x100 (int32) | ||
| * Checksum (int32 if crc-32 is used) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe you want to dabble into MICs and MACs. |
||
| * Value type (char 'c' or 'm') | ||
| * Value details (depends on type) | ||
|
|
||
| Checksums in store | ||
| ================== | ||
| The current tlog index and checksum are stored in the local store. When a node starts, they are compared with the values in the tlog. | ||
| Thus, after a collapse, the checksum of the last collapsed value is saved in the head database. Therefore the rolling tlog checksum is a continuation of the checksum in the head database. | ||
|
|
||
| Upgrade Path | ||
| ============ | ||
| To upgrade Arakoon to the new version, with a new tlog format, the nodes need to be restarted. A node that restarts after the upgrade can still read the old tlogs, and the checksums of these values will be set to None. Values with checksum None will never be valuated. | ||
|
|
||
| Nodes that need to do a catchup will do this as usual. The first received values will have checksum None, and are written to the tlogs in the old format, until all values from before the upgrade are synced. All values that are created after the upgrade will get a checksum. The checksum of the first value is a normal checksum (not depending on previous values), and all following checksums are rolling. | ||
|
|
||
| New and old nodes can not and will not communicate (they have a different magic). If the nodes are restarted one by one, the old nodes will keep going as long as possible, while the new nodes can't make progress because they don't have a majority. When the critical point is reached, the new nodes will do a catchup and take over. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,80 @@ | ||
| """ | ||
| Copyright (2010-2014) INCUBAID BVBA | ||
|
|
||
| Licensed under the Apache License, Version 2.0 (the "License"); | ||
| you may not use this file except in compliance with the License. | ||
| You may obtain a copy of the License at | ||
|
|
||
| http://www.apache.org/licenses/LICENSE-2.0 | ||
|
|
||
| Unless required by applicable law or agreed to in writing, software | ||
| distributed under the License is distributed on an "AS IS" BASIS, | ||
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| See the License for the specific language governing permissions and | ||
| limitations under the License. | ||
| """ | ||
|
|
||
|
|
||
|
|
||
| import time | ||
| import shutil | ||
|
|
||
| from .. import system_tests_common as C | ||
| from nose.tools import * | ||
|
|
||
| from Compat import X | ||
|
|
||
| """ | ||
| @C.with_custom_setup(C.setup_3_nodes, C.basic_teardown) | ||
| def test_diverge (): | ||
| C.iterate_n_times(100, C.simple_set) | ||
| C.stop_all() | ||
|
|
||
| C.remove_node(1) | ||
| C.remove_node(2) | ||
| C.regenerateClientConfig(C.cluster_id) | ||
| C.start_all() | ||
| C.iterate_n_times(5, C.simple_set, 100) | ||
| C.stop_all() | ||
|
|
||
| C.remove_node(0) | ||
| C.add_node(1) | ||
| C.add_node(2) | ||
| C.regenerateClientConfig(C.cluster_id) | ||
| C.start_all() | ||
| C.iterate_n_times(10, C.simple_set) | ||
| C.stop_all() | ||
|
|
||
| C.add_node(0) | ||
| C.regenerateClientConfig(C.cluster_id) | ||
| C.start_all() | ||
| time.sleep(3.0) | ||
| """ | ||
|
|
||
| @C.with_custom_setup(C.setup_2_nodes_forced_master ,C.basic_teardown) | ||
| def test_power_failure (): | ||
| cluster = C._getCluster() | ||
| logging.info("") | ||
| C.iterate_n_times(50, C.simple_set) | ||
|
|
||
| for i in range(2): | ||
| node_id = C.node_names[i] | ||
| C.stopOne(node_id) | ||
| home = cluster.getNodeConfig(node_id)['home'] | ||
| backup = '/'.join([X.tmpDir, 'backup_' + node_id]) | ||
| shutil.copytree(home, backup) | ||
| C.startOne(node_id) | ||
|
|
||
| C.iterate_n_times(10, C.simple_set) | ||
| C.stop_all() | ||
|
|
||
| for i in range(2): | ||
| node_id = C.node_names[i] | ||
| home = cluster.getNodeConfig(node_id)['home'] | ||
| backup = '/'.join([X.tmpDir, 'backup_' + node_id]) | ||
| shutil.rmtree(home) | ||
| shutil.move(backup, home) | ||
| C.startOne(node_id) | ||
|
|
||
| C.iterate_n_times(10, C.simple_set) | ||
| C.startOne(C.node_names[2]) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, node1 lost the 2 last tlog entries, and replaced them with others. This should not happen when you have an fsync between the writing of each entry, unless your mount options, file system, hardware are wanting.
So is this whole set of changes some kind of runtime detection of bad configuration or a hardware lie-detector?