Skip to content

Commit f41e094

Browse files
init
1 parent 1c5e67e commit f41e094

File tree

8 files changed

+421
-36
lines changed

8 files changed

+421
-36
lines changed

EXERCISE_SUMMARY.md

Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
# Lab Automation Interview Exercise - Quick Reference
2+
3+
## For the Interviewer
4+
5+
This is a 60-90 minute paired programming exercise for a Staff Full-Stack Engineer role.
6+
7+
### Setup (Before Interview)
8+
```bash
9+
docker-compose up --build
10+
```
11+
12+
Verify all bugs are present by checking that Bug 2 prevents workflows from starting.
13+
14+
### Exercise Flow
15+
16+
**Phase 1: Debugging (30-40 min)**
17+
Candidate fixes 4 bugs:
18+
1. Missing `DEVICE_API_URL` env var → service crashes
19+
2. Wrong API endpoint path → workflows fail to start
20+
3. Race condition in device booking → run `./test-race-condition.sh` to demonstrate
21+
4. Frontend status cache → device status doesn't update in UI
22+
23+
**Phase 2: Feature Implementation (30-50 min)**
24+
Implement pause/resume workflow feature (backend + frontend)
25+
26+
### Quick Debugging
27+
28+
If candidate gets stuck:
29+
30+
**Bug 1**: Check `docker-compose logs workflow-service`
31+
**Bug 2**: Look at the endpoint path in `workflow-service/app.py` line ~127
32+
**Bug 3**: Run `./test-race-condition.sh` and check logs for multiple "successfully booked" messages
33+
**Bug 4**: Check `DeviceList.js` - notice the status caching logic
34+
35+
### Evaluation Focus
36+
37+
✅ Systematic debugging approach (logs, tools)
38+
✅ Full-stack competency (React + Python)
39+
✅ Understanding of distributed systems
40+
✅ Clean code and testing
41+
✅ Communication and problem-solving
42+
43+
### Files to Review
44+
- `README.md` - Candidate instructions
45+
- `SOLUTION_GUIDE.md` - Complete solutions and fixes
46+
- `test-race-condition.sh` - Script to reproduce Bug 3
47+
48+
### Key Artifacts
49+
- Candidate should fix bugs in code
50+
- Candidate should implement pause/resume endpoints + UI
51+
- Check git commits for quality
52+
53+
---
54+
55+
## For the Candidate
56+
57+
See [README.md](README.md) for full instructions.
58+
59+
### Quick Start
60+
```bash
61+
docker-compose up --build
62+
# Open http://localhost:3000
63+
# Fix bugs, then implement pause/resume feature
64+
```
65+
66+
### Key Commands
67+
```bash
68+
docker-compose logs -f workflow-service # View service logs
69+
./test-race-condition.sh # Test concurrency bug
70+
docker-compose restart workflow-service # Restart after changes
71+
```

README.md

Lines changed: 18 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -61,11 +61,14 @@ The system currently has several bugs that prevent it from working correctly. Yo
6161
- Enable workflows to be created and started successfully
6262
- Ensure device status updates correctly in the UI
6363
- Identify and fix the race condition in device booking
64+
- Run `./test-race-condition.sh` to reproduce the concurrency bug
65+
- Check logs to see if multiple workflows book the same device
6466

6567
**Tips**:
6668
- Check service logs: `docker-compose logs [service-name]`
6769
- Use browser DevTools to inspect API calls
6870
- Test the system end-to-end after each fix
71+
- For the race condition, look for multiple "successfully booked" messages in device-service logs
6972

7073
### Phase 2: Feature Implementation (30-50 minutes)
7174

@@ -115,8 +118,19 @@ We're looking for:
115118
# View logs for all services
116119
docker-compose logs -f
117120

118-
# View logs for specific service
121+
# View logs for specific service (recommended for debugging)
119122
docker-compose logs -f workflow-service
123+
docker-compose logs -f device-service
124+
docker-compose logs -f sample-service
125+
126+
# View recent logs without following
127+
docker-compose logs --tail=50 workflow-service
128+
129+
# Test for race condition in device booking
130+
./test-race-condition.sh
131+
132+
# Reset all data (clear workflows, device statuses, samples)
133+
docker-compose restart redis
120134

121135
# Restart a specific service
122136
docker-compose restart workflow-service
@@ -127,6 +141,9 @@ docker-compose up --build
127141
# Stop all services
128142
docker-compose down
129143

144+
# Stop all services and remove volumes (complete clean slate)
145+
docker-compose down -v
146+
130147
# Check if services are healthy
131148
curl http://localhost:5001/health
132149
curl http://localhost:5002/health

SOLUTION_GUIDE.md

Lines changed: 75 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -61,13 +61,65 @@ response = requests.post(
6161

6262
### Bug 3: Race Condition in Device Booking
6363

64-
**Location**: `services/device-service/app.py:70-91`
64+
**Location**: `services/device-service/app.py:93-125`
6565

6666
**Symptom**: Multiple workflows can book the same device simultaneously when requests arrive concurrently
6767

6868
**Root Cause**: No atomic check-and-set operation. The code reads status, waits (simulating processing), then sets status - allowing race conditions.
6969

70-
**Fix**: Use Redis atomic operations. Replace the book_device function:
70+
**How to Reproduce**: Run `./test-race-condition.sh` and check the logs. You'll see multiple "successfully booked" messages for the same device.
71+
72+
**Fix Option 1 - Python Threading Lock** (Simpler, works for single instance):
73+
74+
```python
75+
from threading import Lock
76+
77+
# Add at module level
78+
device_locks = {}
79+
80+
@app.route('/devices/<device_id>/book', methods=['POST'])
81+
def book_device(device_id):
82+
"""Book a device for a workflow"""
83+
if device_id not in DEVICES:
84+
logger.warning(f"Device not found: {device_id}")
85+
return jsonify({'error': 'Device not found'}), 404
86+
87+
data = request.json
88+
workflow_id = data.get('workflow_id')
89+
90+
if not workflow_id:
91+
logger.error("Booking request missing workflow_id")
92+
return jsonify({'error': 'workflow_id required'}), 400
93+
94+
# Create lock for this device if it doesn't exist
95+
if device_id not in device_locks:
96+
device_locks[device_id] = Lock()
97+
98+
logger.info(f"Attempting to book device {device_id} for workflow {workflow_id}")
99+
100+
# Use lock to ensure atomic check-and-set
101+
with device_locks[device_id]:
102+
current_status = get_device_status(device_id)
103+
104+
if current_status != 'available':
105+
logger.warning(f"Device {device_id} is not available (status: {current_status})")
106+
return jsonify({'error': 'Device is not available'}), 409
107+
108+
time.sleep(0.1)
109+
set_device_status(device_id, 'busy', workflow_id)
110+
111+
logger.info(f"Device {device_id} successfully booked by workflow {workflow_id}")
112+
return jsonify({
113+
'device_id': device_id,
114+
'status': 'busy',
115+
'workflow_id': workflow_id,
116+
'booked_at': datetime.utcnow().isoformat()
117+
})
118+
```
119+
120+
**Fix Option 2 - Redis Atomic Operations** (Better for distributed systems):
121+
122+
Replace the book_device function:
71123

72124
```python
73125
@app.route('/devices/<device_id>/book', methods=['POST'])
@@ -124,15 +176,17 @@ def release_device(device_id):
124176
```
125177

126178
**Evaluation Points**:
127-
- ✅ Identifies the race condition (may require testing or code review)
128-
- ✅ Understands distributed systems challenges
129-
- ✅ Knows about atomic operations (SETNX, compare-and-swap, etc.)
130-
- ✅ Implements proper locking mechanism
131-
132-
**Alternative Solutions** (also acceptable):
133-
- Database transactions with SELECT FOR UPDATE
134-
- Distributed locks (Redis SETNX, Redlock)
135-
- Optimistic locking with version numbers
179+
- ✅ Runs the test script to reproduce the issue
180+
- ✅ Identifies the race condition from logs or code review
181+
- ✅ Understands the need for atomic operations
182+
- ✅ Implements proper locking mechanism (either threading.Lock or Redis)
183+
- ✅ Verifies fix by running test script again
184+
185+
**Acceptable Solutions**:
186+
- Threading Lock (shown above) - simple, works for single instance
187+
- Redis SETNX (shown above) - better for multiple instances
188+
- Remove the `time.sleep(0.1)` - makes race condition nearly impossible (acceptable but doesn't address root cause)
189+
- Database transactions with SELECT FOR UPDATE (if they add a database)
136190

137191
---
138192

@@ -492,7 +546,7 @@ function WorkflowList({ workflows, onStart, onComplete, onPause, onResume }) {
492546
- [ ] Workflows can be created
493547
- [ ] Workflows fail to start (Bug 2 present)
494548
- [ ] Device status doesn't update in UI (Bug 4 present)
495-
- [ ] Race condition can be triggered with concurrent requests (Bug 3)
549+
- [ ] Race condition can be triggered: `./test-race-condition.sh` shows multiple bookings (Bug 3)
496550

497551
### Post-Exercise Validation
498552

@@ -504,7 +558,7 @@ Run through this flow to verify all bugs are fixed:
504558
4. Check device status in UI → should show "busy"
505559
5. Complete the workflow → should succeed
506560
6. Check device status → should show "available"
507-
7. Test concurrent bookings (optional, for Bug 3)
561+
7. Test concurrent bookings: `./test-race-condition.sh` → should show only ONE successful booking
508562

509563
---
510564

@@ -514,21 +568,24 @@ Run through this flow to verify all bugs are fixed:
514568

515569
**Excellent Candidate**:
516570
- Systematically checks logs and identifies issues
571+
- Runs the race condition test script proactively
517572
- Fixes bugs in logical order (environment → API → race condition)
518573
- Implements clean, well-tested pause/resume feature
519574
- Asks clarifying questions about requirements
520575
- Explains their thought process clearly
521-
- Shows understanding of distributed systems
576+
- Shows understanding of distributed systems and concurrency
522577

523578
**Good Candidate**:
524579
- Finds and fixes most bugs
580+
- Uses test script when prompted
525581
- Implements working pause/resume feature
526-
- May need hints for race condition
582+
- May need hints for race condition fix approach
527583
- Code works but could be cleaner
528584

529585
**Needs Improvement**:
530586
- Struggles to identify bugs without significant help
531587
- Random debugging approach (trial and error)
588+
- Doesn't use provided tools (test script, logs)
532589
- Incomplete feature implementation
533590
- Doesn't test their changes
534591

@@ -538,13 +595,14 @@ Typical breakdown for a strong candidate:
538595
- Bug 1 (env var): 5 minutes
539596
- Bug 2 (wrong endpoint): 5-10 minutes
540597
- Bug 4 (frontend cache): 10-15 minutes
541-
- Bug 3 (race condition): 10-20 minutes
598+
- Bug 3 (race condition): 15-20 minutes (including running test script and implementing fix)
542599
- Pause/Resume feature: 25-40 minutes
543600

544601
If time is running short, you can:
545-
- Skip Bug 3 (race condition) - it's the most complex
602+
- Skip Bug 3 (race condition) or show them the test script output
546603
- Reduce scope of pause/resume (backend only)
547604
- Provide hints more liberally
605+
- Accept simpler fix for Bug 3 (just remove the `time.sleep(0.1)`)
548606

549607
### Discussion Questions
550608

services/device-service/app.py

Lines changed: 28 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -5,10 +5,21 @@
55
import os
66
import time
77
from datetime import datetime
8+
import logging
9+
import sys
810

911
app = Flask(__name__)
1012
CORS(app)
1113

14+
logging.basicConfig(
15+
level=logging.INFO,
16+
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
17+
handlers=[logging.StreamHandler(sys.stdout)]
18+
)
19+
logger = logging.getLogger(__name__)
20+
21+
app.logger.setLevel(logging.INFO)
22+
1223
# Redis connection
1324
redis_client = redis.from_url(os.getenv('REDIS_URL', 'redis://localhost:6379'))
1425

@@ -87,24 +98,29 @@ def get_device(device_id):
8798
def book_device(device_id):
8899
"""Book a device for a workflow"""
89100
if device_id not in DEVICES:
101+
logger.warning(f"Device not found: {device_id}")
90102
return jsonify({'error': 'Device not found'}), 404
91103

92104
data = request.json
93105
workflow_id = data.get('workflow_id')
94106

95107
if not workflow_id:
108+
logger.error("Booking request missing workflow_id")
96109
return jsonify({'error': 'workflow_id required'}), 400
97110

111+
logger.info(f"Attempting to book device {device_id} for workflow {workflow_id}")
112+
98113
current_status = get_device_status(device_id)
99114

100115
if current_status != 'available':
116+
logger.warning(f"Device {device_id} is not available (status: {current_status})")
101117
return jsonify({'error': 'Device is not available'}), 409
102118

103-
# Simulate some processing time that makes race condition more likely
104119
time.sleep(0.1)
105120

106121
set_device_status(device_id, 'busy', workflow_id)
107122

123+
logger.info(f"Device {device_id} successfully booked by workflow {workflow_id}")
108124
return jsonify({
109125
'device_id': device_id,
110126
'status': 'busy',
@@ -116,18 +132,22 @@ def book_device(device_id):
116132
def release_device(device_id):
117133
"""Release a device from a workflow"""
118134
if device_id not in DEVICES:
135+
logger.warning(f"Device not found: {device_id}")
119136
return jsonify({'error': 'Device not found'}), 404
120137

121138
data = request.json
122139
workflow_id = data.get('workflow_id')
123140

124-
# Verify the workflow owns this device
141+
logger.info(f"Attempting to release device {device_id} from workflow {workflow_id}")
142+
125143
current_workflow = redis_client.get(f'device:{device_id}:workflow')
126144
if current_workflow and current_workflow.decode('utf-8') != workflow_id:
145+
logger.warning(f"Device {device_id} is booked by another workflow")
127146
return jsonify({'error': 'Device is booked by another workflow'}), 403
128147

129148
set_device_status(device_id, 'available')
130149

150+
logger.info(f"Device {device_id} released successfully")
131151
return jsonify({
132152
'device_id': device_id,
133153
'status': 'available',
@@ -138,20 +158,23 @@ def release_device(device_id):
138158
def execute_operation(device_id):
139159
"""Execute an operation on a device"""
140160
if device_id not in DEVICES:
161+
logger.warning(f"Device not found: {device_id}")
141162
return jsonify({'error': 'Device not found'}), 404
142163

143164
data = request.json
144165
operation = data.get('operation')
145166
workflow_id = data.get('workflow_id')
146167

147-
# Verify the workflow owns this device
168+
logger.info(f"Executing operation '{operation}' on device {device_id} for workflow {workflow_id}")
169+
148170
current_workflow = redis_client.get(f'device:{device_id}:workflow')
149171
if not current_workflow or current_workflow.decode('utf-8') != workflow_id:
172+
logger.warning(f"Device {device_id} not booked by workflow {workflow_id}")
150173
return jsonify({'error': 'Device not booked by this workflow'}), 403
151174

152-
# Simulate operation execution
153175
time.sleep(0.5)
154176

177+
logger.info(f"Operation '{operation}' completed on device {device_id}")
155178
return jsonify({
156179
'device_id': device_id,
157180
'operation': operation,
@@ -165,4 +188,4 @@ def execute_operation(device_id):
165188
if not redis_client.exists(f'device:{device_id}:status'):
166189
set_device_status(device_id, 'available')
167190

168-
app.run(host='0.0.0.0', port=5001, debug=True)
191+
app.run(host='0.0.0.0', port=5001, debug=False)

0 commit comments

Comments
 (0)