@@ -61,13 +61,65 @@ response = requests.post(
6161
6262# ## Bug 3: Race Condition in Device Booking
6363
64- **Location**: `services/device-service/app.py:70-91 `
64+ **Location**: `services/device-service/app.py:93-125 `
6565
6666**Symptom**: Multiple workflows can book the same device simultaneously when requests arrive concurrently
6767
6868**Root Cause**: No atomic check-and-set operation. The code reads status, waits (simulating processing), then sets status - allowing race conditions.
6969
70- **Fix**: Use Redis atomic operations. Replace the book_device function:
70+ **How to Reproduce**: Run `./test-race-condition.sh` and check the logs. You'll see multiple "successfully booked" messages for the same device.
71+
72+ **Fix Option 1 - Python Threading Lock** (Simpler, works for single instance):
73+
74+ ` ` ` python
75+ from threading import Lock
76+
77+ # Add at module level
78+ device_locks = {}
79+
80+ @app.route('/devices/<device_id>/book', methods=['POST'])
81+ def book_device(device_id):
82+ """Book a device for a workflow"""
83+ if device_id not in DEVICES:
84+ logger.warning(f"Device not found: {device_id}")
85+ return jsonify({'error': 'Device not found'}), 404
86+
87+ data = request.json
88+ workflow_id = data.get('workflow_id')
89+
90+ if not workflow_id:
91+ logger.error("Booking request missing workflow_id")
92+ return jsonify({'error': 'workflow_id required'}), 400
93+
94+ # Create lock for this device if it doesn't exist
95+ if device_id not in device_locks:
96+ device_locks[device_id] = Lock()
97+
98+ logger.info(f"Attempting to book device {device_id} for workflow {workflow_id}")
99+
100+ # Use lock to ensure atomic check-and-set
101+ with device_locks[device_id]:
102+ current_status = get_device_status(device_id)
103+
104+ if current_status != 'available':
105+ logger.warning(f"Device {device_id} is not available (status: {current_status})")
106+ return jsonify({'error': 'Device is not available'}), 409
107+
108+ time.sleep(0.1)
109+ set_device_status(device_id, 'busy', workflow_id)
110+
111+ logger.info(f"Device {device_id} successfully booked by workflow {workflow_id}")
112+ return jsonify({
113+ 'device_id': device_id,
114+ 'status': 'busy',
115+ 'workflow_id': workflow_id,
116+ 'booked_at': datetime.utcnow().isoformat()
117+ })
118+ ` ` `
119+
120+ **Fix Option 2 - Redis Atomic Operations** (Better for distributed systems):
121+
122+ Replace the book_device function :
71123
72124` ` ` python
73125@app.route('/devices/<device_id>/book', methods=['POST'])
@@ -124,15 +176,17 @@ def release_device(device_id):
124176` ` `
125177
126178**Evaluation Points**:
127- - ✅ Identifies the race condition (may require testing or code review)
128- - ✅ Understands distributed systems challenges
129- - ✅ Knows about atomic operations (SETNX, compare-and-swap, etc.)
130- - ✅ Implements proper locking mechanism
131-
132- **Alternative Solutions** (also acceptable):
133- - Database transactions with SELECT FOR UPDATE
134- - Distributed locks (Redis SETNX, Redlock)
135- - Optimistic locking with version numbers
179+ - ✅ Runs the test script to reproduce the issue
180+ - ✅ Identifies the race condition from logs or code review
181+ - ✅ Understands the need for atomic operations
182+ - ✅ Implements proper locking mechanism (either threading.Lock or Redis)
183+ - ✅ Verifies fix by running test script again
184+
185+ **Acceptable Solutions**:
186+ - Threading Lock (shown above) - simple, works for single instance
187+ - Redis SETNX (shown above) - better for multiple instances
188+ - Remove the `time.sleep(0.1)` - makes race condition nearly impossible (acceptable but doesn't address root cause)
189+ - Database transactions with SELECT FOR UPDATE (if they add a database)
136190
137191---
138192
@@ -492,7 +546,7 @@ function WorkflowList({ workflows, onStart, onComplete, onPause, onResume }) {
492546- [ ] Workflows can be created
493547- [ ] Workflows fail to start (Bug 2 present)
494548- [ ] Device status doesn't update in UI (Bug 4 present)
495- - [ ] Race condition can be triggered with concurrent requests (Bug 3)
549+ - [ ] Race condition can be triggered: ` ./test-race-condition.sh ` shows multiple bookings (Bug 3)
496550
497551### Post-Exercise Validation
498552
@@ -504,7 +558,7 @@ Run through this flow to verify all bugs are fixed:
5045584 . Check device status in UI → should show "busy"
5055595 . Complete the workflow → should succeed
5065606 . Check device status → should show "available"
507- 7 . Test concurrent bookings (optional, for Bug 3)
561+ 7 . Test concurrent bookings: ` ./test-race-condition.sh ` → should show only ONE successful booking
508562
509563---
510564
@@ -514,21 +568,24 @@ Run through this flow to verify all bugs are fixed:
514568
515569** Excellent Candidate** :
516570- Systematically checks logs and identifies issues
571+ - Runs the race condition test script proactively
517572- Fixes bugs in logical order (environment → API → race condition)
518573- Implements clean, well-tested pause/resume feature
519574- Asks clarifying questions about requirements
520575- Explains their thought process clearly
521- - Shows understanding of distributed systems
576+ - Shows understanding of distributed systems and concurrency
522577
523578** Good Candidate** :
524579- Finds and fixes most bugs
580+ - Uses test script when prompted
525581- Implements working pause/resume feature
526- - May need hints for race condition
582+ - May need hints for race condition fix approach
527583- Code works but could be cleaner
528584
529585** Needs Improvement** :
530586- Struggles to identify bugs without significant help
531587- Random debugging approach (trial and error)
588+ - Doesn't use provided tools (test script, logs)
532589- Incomplete feature implementation
533590- Doesn't test their changes
534591
@@ -538,13 +595,14 @@ Typical breakdown for a strong candidate:
538595- Bug 1 (env var): 5 minutes
539596- Bug 2 (wrong endpoint): 5-10 minutes
540597- Bug 4 (frontend cache): 10-15 minutes
541- - Bug 3 (race condition): 10 -20 minutes
598+ - Bug 3 (race condition): 15 -20 minutes (including running test script and implementing fix)
542599- Pause/Resume feature: 25-40 minutes
543600
544601If time is running short, you can:
545- - Skip Bug 3 (race condition) - it's the most complex
602+ - Skip Bug 3 (race condition) or show them the test script output
546603- Reduce scope of pause/resume (backend only)
547604- Provide hints more liberally
605+ - Accept simpler fix for Bug 3 (just remove the ` time.sleep(0.1) ` )
548606
549607### Discussion Questions
550608
0 commit comments