Skip to content

Commit c3e47ae

Browse files
authored
Add repo for terraform-aws-percona-server (#233)
1 parent 713a424 commit c3e47ae

2 files changed

Lines changed: 256 additions & 0 deletions

File tree

.claude/plans/percona-server.md

Lines changed: 249 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,249 @@
1+
# terraform-aws-percona-server
2+
3+
Terraform module for Percona Server replica set with GTID replication, Orchestrator HA, and automated failover.
4+
5+
GitHub Project: https://github.com/orgs/infrahouse/projects/7
6+
7+
---
8+
9+
## Requirements Summary
10+
11+
### Infrastructure
12+
- Single ASG, odd number of instances (minimum 3)
13+
- ASG health check type: ELB (uses NLB target group health)
14+
- NLB with separate write/read target groups
15+
- DynamoDB table for locks and topology
16+
- S3 bucket for backups and binlogs
17+
18+
### Database
19+
- Percona Server with GTID replication
20+
- One master, N replicas
21+
22+
### Orchestrator
23+
- Sidecar on each node
24+
- Raft cluster with SQLite backend
25+
- Post-failover hooks:
26+
- Update NLB target groups
27+
- Update scale-in protection
28+
- Update DynamoDB topology record
29+
30+
### Provisioning
31+
- Puppet configures Percona and Orchestrator
32+
- Puppet registers instances with appropriate NLB target group
33+
- Custom facts provide cluster_id, dynamodb_table, s3_bucket
34+
35+
### Bootstrap (Master Election)
36+
- Instance acquires DynamoDB lock (`lock-{cluster_id}`)
37+
- Reads topology record (`topology-{cluster_id}`)
38+
- If no master: become master, write topology, take full backup, release lock
39+
- If master exists: restore from backup, configure as replica, release lock
40+
- Uses existing DynamoDB lock class: https://github.com/infrahouse/infrahouse-core/blob/main/src/infrahouse_core/aws/dynamodb.py#L15
41+
42+
### Backups
43+
- Tool: XtraBackup
44+
- Schedule: Weekly full + daily incremental, 00:{random_minute} UTC
45+
- Source: One replica (lock-based election via `backup-lock-{cluster_id}`)
46+
- Destination: S3
47+
- Retention: Configurable, default 4 weeks
48+
- Instance size: Up to 5TB (expect 4-6 hours for full backup/restore)
49+
50+
### Binlog Archival
51+
- Real-time streaming via mysqlbinlog
52+
- One replica holds `binlog-lock-{cluster_id}`
53+
- Position tracked in DynamoDB (`binlog-position-{cluster_id}`)
54+
- GTID-based tracking for minimal data loss
55+
- Synced to S3 continuously
56+
57+
### Health & Replacement
58+
- NLB health check detects dead Percona (port 3306)
59+
- ASG uses NLB health status to trigger instance replacement
60+
- Puppet on new instance handles re-registration
61+
62+
### Target Group Registration
63+
64+
| Event | Who Handles | Action |
65+
|-------|-------------|--------|
66+
| Initial master boot | Puppet | Register to write TG |
67+
| Initial replica boot | Puppet | Register to read TG |
68+
| Master dies (failover) | Orchestrator hook | Deregister old master from write TG, register new master to write TG |
69+
| Replica dies (replacement) | Puppet on new instance | Register to read TG |
70+
71+
### DynamoDB Keys
72+
73+
| Key | Purpose |
74+
|-----|---------|
75+
| `lock-{cluster_id}` | Master election lock |
76+
| `topology-{cluster_id}` | Master info (instance_id, private_ip) |
77+
| `backup-lock-{cluster_id}` | Backup leader election |
78+
| `binlog-lock-{cluster_id}` | Binlog streaming leader election |
79+
| `binlog-position-{cluster_id}` | Last synced GTID position |
80+
81+
---
82+
83+
## Project Items (Todo)
84+
85+
### Infrastructure
86+
1. ASG configuration - single ASG, odd number validation, ELB health check type
87+
2. NLB with separate write/read target groups
88+
3. DynamoDB table for locks and topology
89+
4. S3 bucket for backups and binlogs
90+
91+
### Percona Server
92+
5. Install Percona repository via Puppet
93+
6. Configure Percona Server with GTID replication
94+
7. Master election bootstrap logic with DynamoDB lock
95+
8. Scale-in protection for master instance
96+
97+
### Orchestrator
98+
9. Orchestrator sidecar deployment (Raft + SQLite)
99+
10. Post-failover hook: Update NLB target groups
100+
11. Post-failover hook: Update scale-in protection
101+
12. Post-failover hook: Update DynamoDB topology record
102+
103+
### Backups
104+
13. XtraBackup installation and configuration
105+
14. Weekly full backup with lock-based leader election
106+
15. Daily incremental backup
107+
16. S3 backup storage with configurable retention (default 4 weeks)
108+
17. Bootstrap backup from master (held until complete)
109+
110+
### Binlog Archival
111+
18. Real-time binlog streaming setup (mysqlbinlog)
112+
19. Binlog streaming leader election with lock
113+
20. GTID position tracking in DynamoDB
114+
21. S3 binlog sync
115+
116+
### Puppet Integration
117+
22. Custom facts for cluster_id, dynamodb_table, s3_bucket
118+
23. NLB target group registration logic
119+
24. Percona Server configuration management
120+
25. Orchestrator configuration management
121+
122+
---
123+
124+
## Module Inputs (Draft)
125+
126+
```hcl
127+
variable "cluster_id" {
128+
description = "Unique identifier for the Percona cluster"
129+
type = string
130+
}
131+
132+
variable "instance_count" {
133+
description = "Number of instances (must be odd, minimum 3)"
134+
type = number
135+
default = 3
136+
137+
validation {
138+
condition = var.instance_count >= 3 && var.instance_count % 2 == 1
139+
error_message = "Instance count must be an odd number >= 3"
140+
}
141+
}
142+
143+
variable "instance_type" {
144+
description = "EC2 instance type"
145+
type = string
146+
}
147+
148+
variable "vpc_id" {
149+
description = "VPC ID"
150+
type = string
151+
}
152+
153+
variable "subnet_ids" {
154+
description = "Subnet IDs for the ASG"
155+
type = list(string)
156+
}
157+
158+
variable "backup_retention_weeks" {
159+
description = "Number of weeks to retain backups"
160+
type = number
161+
default = 4
162+
}
163+
164+
variable "percona_version" {
165+
description = "Percona Server version (e.g., ps80, ps57)"
166+
type = string
167+
default = "ps80"
168+
}
169+
```
170+
171+
---
172+
173+
## Architecture Diagram
174+
175+
```
176+
┌─────────────────────────────────────────────────────────┐
177+
│ Single ASG │
178+
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
179+
│ │ Percona │ │ Percona │ │ Percona │ │
180+
│ │ Master │ │ Replica │ │ Replica │ │
181+
│ │ (protected) │ │ │ │ │ │
182+
│ │ Orchestrator│ │ Orchestrator│ │ Orchestrator│ │
183+
│ └─────────────┘ └─────────────┘ └─────────────┘ │
184+
└─────────────────────────────────────────────────────────┘
185+
│ │
186+
│ write TG │ read TG
187+
▼ ▼
188+
┌─────────────────────────────────────────────────────────┐
189+
│ NLB │
190+
│ db-write.internal db-read.internal │
191+
└─────────────────────────────────────────────────────────┘
192+
193+
┌─────────────────────────────────────────────────────────┐
194+
│ DynamoDB │
195+
│ lock-{cluster_id} topology-{cluster_id} │
196+
│ backup-lock-{cluster_id} binlog-lock-{cluster_id} │
197+
│ binlog-position-{cluster_id} │
198+
└─────────────────────────────────────────────────────────┘
199+
200+
┌─────────────────────────────────────────────────────────┐
201+
│ S3 │
202+
│ s3://{bucket}/{cluster_id}/full/{timestamp}/ │
203+
│ s3://{bucket}/{cluster_id}/incremental/{timestamp}/ │
204+
│ s3://{bucket}/{cluster_id}/binlogs/ │
205+
└─────────────────────────────────────────────────────────┘
206+
```
207+
208+
---
209+
210+
## Failover Flow
211+
212+
1. Master dies
213+
2. NLB health check fails, stops sending traffic to master
214+
3. ASG marks instance unhealthy (ELB health check type)
215+
4. Orchestrator (on replicas) detects master failure
216+
5. Orchestrator raft consensus elects new master
217+
6. Orchestrator post-failover hook:
218+
- Deregisters old master from write TG
219+
- Registers new master to write TG
220+
- Removes scale-in protection from old master
221+
- Adds scale-in protection to new master
222+
- Updates DynamoDB topology record
223+
7. ASG terminates old master, launches replacement
224+
8. New instance boots, runs Puppet
225+
9. Puppet reads topology from DynamoDB, configures as replica
226+
10. Puppet registers with read TG
227+
228+
---
229+
230+
## Bootstrap Flow (First Cluster Creation)
231+
232+
1. ASG launches 3 instances simultaneously
233+
2. Each instance runs Puppet
234+
3. Each tries to acquire `lock-{cluster_id}`
235+
4. Winner:
236+
- Reads `topology-{cluster_id}` - not found
237+
- Configures self as master
238+
- Writes topology record (instance_id, private_ip)
239+
- Takes full XtraBackup to S3
240+
- Releases lock
241+
- Registers with write TG
242+
- Sets scale-in protection on self
243+
5. Losers (one at a time):
244+
- Acquire lock
245+
- Read topology - found master
246+
- Restore from S3 backup
247+
- Configure as replica with `MASTER_AUTO_POSITION=1`
248+
- Release lock
249+
- Register with read TG

repos.tf

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -217,6 +217,13 @@ locals {
217217
OPENVPN_CLIENT_SECRET : module.openvpn-oauth-client-id.secret_value
218218
}
219219
}
220+
"terraform-aws-percona-server" = {
221+
"description" = <<-EOT
222+
Terraform module for Percona Server replica set with GTID replication,
223+
Orchestrator HA, and automated failover.
224+
EOT
225+
"type" = "terraform_module"
226+
}
220227
"terraform-aws-pmm-ecs" = {
221228
"description" = <<-EOT
222229
Terraform module for deploying Percona Monitoring and Management (PMM) server

0 commit comments

Comments
 (0)