Hi, thanks for sharing this ambitious and fascinating project.
I have been trying to set up a local environment using 4x ESP32-S3 boards to accurately detect the number of people, their positions, and estimate 17-keypoint joint poses. Despite trying various configurations and testing multiple methods, I haven't been successful. Because there is currently a lack of comprehensive documentation on how to achieve this specific multi-node setup, I decided to dive deep into the source code, architecture, and existing issues.
Based on my technical analysis, I’d like to share my findings and ask a few questions regarding the current state of the project.
- The "Real" Infrastructure vs. The "Mocked" AI Layer
From my code review, the project currently feels more like a Proof of Concept (PoC) scaffold rather than a fully functional pose-estimation system:
The Real Part (Hardware & Data Pipeline): The infrastructure layer is genuine. The ESP32-S3 firmware correctly utilizes ESP-IDF APIs to extract CSI (Channel State Information) subcarrier amplitude/phase data and transmits the binary frames via UDP. The Rust backend and WebSocket transport layers are also solidly implemented.
The Missing Part (AI Inference): The core AI layer seems incomplete. While the network architecture for DensePoseHead is defined in the code, there are no pre-trained weights (.pth or .onnx files) available in the repository.
2. Hardware Limitations (ESP32 vs. Research NICs)
The concept of "DensePose From WiFi" is scientifically valid (e.g., CMU's pioneering research). However, bridging the gap to ESP32s seems highly challenging:
CMU's research heavily relies on multi-antenna commercial NICs (like Intel 5300 or Atheros with 3x3 MIMO) which provide rich spatial resolution.
ESP32s only have 1x1 SISO antennas. While a 4-node mesh could theoretically compensate for this, achieving the microsecond/nanosecond-level clock synchronization required for accurate 3D spatial field reconstruction and triangulation across discrete ESP32s is notoriously difficult.
3. Why the 17-keypoint pose estimation cannot be reproduced
I noticed the same behavior mentioned in Issue #506. The reason the skeletons sometimes animate even when hardware is disconnected (or fail to animate properly when connected) is due to:
Simulation Fallback: The backend/UI relies heavily on a "simulation mode". When the system lacks real UDP data (or real inference outputs), it renders pre-generated skeletal animations to populate the dashboard.
Hardcoded Logic: When real data is received, since there are no actual neural network weights loaded, older code paths (like pose_service.py) seem to rely on simplistic if-else thresholds on signal norms (e.g., if feature_norm > 2.0 return "walking") instead of actually computing the 17 COCO keypoint coordinates via tensor operations.
Multi-node Fusion: The logic to fuse tensors from 4 distinct nodes into a single coherent pose prediction appears to be mostly at the schema/structural level and lacks trained implementations.
My Questions
I really love the vision of RuView and the idea of edge-based CSI sensing. To help contributors and users align their expectations:
Is the 17-keypoint pose estimation using multiple ESP32s genuinely achievable with the current state of this codebase?
Are there plans to release the pre-trained weights for the neural network, or is the intention for users to collect their own ground-truth data and train from scratch? If the latter, could you provide a pipeline/documentation on how to train a model for the multi-ESP32 setup?
Thank you for your time and for open-sourcing this project! Looking forward to your insights.
Hi, thanks for sharing this ambitious and fascinating project.
I have been trying to set up a local environment using 4x ESP32-S3 boards to accurately detect the number of people, their positions, and estimate 17-keypoint joint poses. Despite trying various configurations and testing multiple methods, I haven't been successful. Because there is currently a lack of comprehensive documentation on how to achieve this specific multi-node setup, I decided to dive deep into the source code, architecture, and existing issues.
Based on my technical analysis, I’d like to share my findings and ask a few questions regarding the current state of the project.
From my code review, the project currently feels more like a Proof of Concept (PoC) scaffold rather than a fully functional pose-estimation system:
The Real Part (Hardware & Data Pipeline): The infrastructure layer is genuine. The ESP32-S3 firmware correctly utilizes ESP-IDF APIs to extract CSI (Channel State Information) subcarrier amplitude/phase data and transmits the binary frames via UDP. The Rust backend and WebSocket transport layers are also solidly implemented.
The Missing Part (AI Inference): The core AI layer seems incomplete. While the network architecture for DensePoseHead is defined in the code, there are no pre-trained weights (.pth or .onnx files) available in the repository.
2. Hardware Limitations (ESP32 vs. Research NICs)
The concept of "DensePose From WiFi" is scientifically valid (e.g., CMU's pioneering research). However, bridging the gap to ESP32s seems highly challenging:
CMU's research heavily relies on multi-antenna commercial NICs (like Intel 5300 or Atheros with 3x3 MIMO) which provide rich spatial resolution.
ESP32s only have 1x1 SISO antennas. While a 4-node mesh could theoretically compensate for this, achieving the microsecond/nanosecond-level clock synchronization required for accurate 3D spatial field reconstruction and triangulation across discrete ESP32s is notoriously difficult.
3. Why the 17-keypoint pose estimation cannot be reproduced
I noticed the same behavior mentioned in Issue #506. The reason the skeletons sometimes animate even when hardware is disconnected (or fail to animate properly when connected) is due to:
Simulation Fallback: The backend/UI relies heavily on a "simulation mode". When the system lacks real UDP data (or real inference outputs), it renders pre-generated skeletal animations to populate the dashboard.
Hardcoded Logic: When real data is received, since there are no actual neural network weights loaded, older code paths (like pose_service.py) seem to rely on simplistic if-else thresholds on signal norms (e.g., if feature_norm > 2.0 return "walking") instead of actually computing the 17 COCO keypoint coordinates via tensor operations.
Multi-node Fusion: The logic to fuse tensors from 4 distinct nodes into a single coherent pose prediction appears to be mostly at the schema/structural level and lacks trained implementations.
My Questions
I really love the vision of RuView and the idea of edge-based CSI sensing. To help contributors and users align their expectations:
Is the 17-keypoint pose estimation using multiple ESP32s genuinely achievable with the current state of this codebase?
Are there plans to release the pre-trained weights for the neural network, or is the intention for users to collect their own ground-truth data and train from scratch? If the latter, could you provide a pipeline/documentation on how to train a model for the multi-ESP32 setup?
Thank you for your time and for open-sourcing this project! Looking forward to your insights.