2026.05.27 - #71 TacO, VITRA

## TacO: Tactile Sensors for Object Manipulation

<img width="641" height="179" alt="Image" src="https://github.com/user-attachments/assets/c37f8a01-e0a6-44ce-ad80-0b98c4fd24bd" />

링크: [arXiv](https://arxiv.org/abs/2605.21976)

- 6개 tactile sensor를 같은 로봇 조작 파이프라인에서 비교하는 real-world benchmark
- 연구 동기: Tactile sensor는 종류가 많지만, 어떤 센서가 어떤 manipulation task에서 실제로 유리한지 비교하기 어려움
- 비교 센서 
	- [FSR](https://www.interlinkelectronics.com/fsr-400): 저렴한 단일 force sensor , 누르는 힘에 따라 저항이 변하는 단일 normal force 값
	- [FlexiTac](https://arxiv.org/pdf/2604.28156): 저항식 taxel array, 12×32 taxel 형태의 분포된 normal force 
	- [eGain](https://publications.ri.cmu.edu/storage/publications/pub_files/2012/8/Park_YL_IEEE_Sensors_2012.pdf): 액체금속 기반 resistive sensor, elastomer microchannel 안의 EGaIn 저항 변화
	- Contact Mic: 접촉 진동/소리를 보는 microphone(piezo contact microphone), 접촉, 미끄러짐, 충격에서 나오는 고주파 진동/소리
	- [Daimon](https://www.dmrobot.com/en/product/p1/dm-tac-w.html): 카메라 기반 visual tactile sensor(soft membrane deformation을 내부 카메라로 촬영해 force/deformation/shear 추정) [Digit360](https://ai.meta.com/blog/fair-robotics-open-source/)
	- [eFlesh](https://e-flesh.com/): magnetic tactile sensor(magnet displacement를 Hall sensor로 읽어 normal/shear force 추정)

<img width="996" height="400" alt="Image" src="https://github.com/user-attachments/assets/4f2a4f6f-19e7-4e5e-acde-bef5621c4555" />
 
- 방법: 
	- RGB camera, proprioception, tactile observation을 ACT 기반 policy에 넣고, vision-only policy와 visuotactile policy를 같은 data에서 비교한다.
        - FSR: scalar -> linear projection
        - FlexiTac: taxel array -> MLP
        - eGain: resistive values -> MLP
        - eFlesh: magnetic/force array -> MLP
        - Daimon: tactile image -> ResNet18
        - Contact Mic: waveform -> mel-spectrogram -> MLP
	- Loss는 $\mathcal{L}=\sum_{\tau=0}^{H-1}\lVert\hat{a}_{t+\tau}-a_{t+\tau}\rVert_1+\lambda_{KL}D_{KL}(q(z|a)\Vert p(z))$다. $\hat{a}$는 예측 action chunk, $a$는 demo action, $H=64$, $z$는 CVAE latent, $p(z)=\mathcal{N}(0,I)$다.
<img width="793" height="290" alt="Image" src="https://github.com/user-attachments/assets/0ea5df7c-d64b-4d8d-8a7f-e22495907a40" />

<img width="1568" height="456" alt="Image" src="https://github.com/user-attachments/assets/024b25e9-4119-42d7-9a4c-4f19db5c87d8" />



	
- 데이터 수집
  	- Franka Panda 로봇으로 teleoperation demonstration
  		- wrist camera image
  		- third-person camera image
  		- robot proprioception
  		- tactile sensor reading
  		- robot action
  	- 같은 demonstration data로 두 정책을 따로 학습
  		- vision-only: tactile 입력 제거
  		- vision + tactile: tactile 입력 포함
	
	<img width="1854" height="766" alt="Image" src="https://github.com/user-attachments/assets/7830e301-1f06-4d36-ab56-62ad3c8083b3" />
- 실험
  - Reorientation task
      - Figure 3(a)처럼 gripper가 물체를 잡고, 물체가 테이블 표면과 계속 접촉한 상태에서 방향을 바꿈
      - 성공 조건은 물체를 들어 올리거나 테이블 밖으로 미끄러뜨리지 않고 reorientation을 끝냄
      - 핵심은 미끄러짐을 완전히 없애는 것이 아니라, 접촉과 힘을 조절해 원하는 회전을 만들기
  - Pick-and-place with unknown mass
      - Figure 3(b)처럼 겉보기에는 같은 캔을 집어 고정된 목표 위치로 옮김
      - 캔은 절반은 비어 있고, 절반은 안에 구슬이 들어 있어 무겁게 되어있음 
      - 초기 위치는 randomize하고 목표 위치는 고정함
      - RGB만 보면 질량 차이를 알 수 없어서, tactile이 grasp force와 gripper width 조절에 도움이 되는지 본다.
  - Insertion task
      - Figure 3(c)처럼 3D-printed plug를 socket에 삽입
      - socket 위치는 randomize되고, plug는 고정된 위치에서 시작한다. 삽입 순간 prong이 가려져 vision만으로 contact geometry를 보기 어려움
      - 성공은 plug를 완전히 삽입하는 것이고, prong이 절반만 들어간 경우는 partial success로 기록
  - Repeatability test
      - Figure 3(d)는 policy task가 아니라 센서 반복성 측정 셋업
      - Dynamixel motor와 3D-printed rack-and-pinion indenter로 센서를 반복해서 누르고, 센서 reading이 episode마다 얼마나 일관적인지 봄
  

- 결과
  - tactile은 대체로 도움이 됐지만, 모든 센서가 모든 task에서 항상 좋은 것은 아님
  - Plug insertion처럼 접촉 상태가 가려지는 task에서는 효과가 큼
      - Contact Mic: 0.20 -> 0.70
      - eFlesh: 0.30 -> 0.70
  - Pick-and-place에서는 무거운 물체를 다룰 때 tactile이 도움
  - Reorientation에서는 force 조절이 필요해서 tactile이 도움 
  - Daimon처럼 고해상도 센서가 항상 최고는 아니고, 저렴한 Contact Mic이나 eFlesh도 task에 따라 충분히 효과가 있었음
  - Pick-and-place에서는 FlexiTac 0.75->0.85, eGain 0.50->0.75, Contact Mic 0.65->0.90, eFlesh 0.85->0.90으로 tactile feedback이 대체로 성능을 올렸다. Plug insertion에서는 Contact Mic 0.20->0.70, eFlesh 0.30->0.70으로 크게 개선
- 결과의 의미와 기여점: TacO는 tactile sensor를 hardware spec이 아니라 policy success 기준으로 비교한다. Vision-only가 놓치는 mass, occluded insertion, continuous force regulation을 tactile signal이 보완함
<img width="1836" height="326" alt="Image" src="https://github.com/user-attachments/assets/a6aced46-65bc-4ab8-8e34-2c8c89970840" />

<img width="1884" height="380" alt="Image" src="https://github.com/user-attachments/assets/eef21251-80e8-4c8d-bd3b-9445c2a6d233" />

- 기타
	vision-only는 tactile reading을 안 쓰지만, **센서 하드웨어는 여전히 gripper fingertip에 붙어 있습니다.** 그래서 센서마다 다음이 달라짐!
	- 손가락 표면 마찰, compliance , 두께 / 형상 , 물체와 닿는 면적, gripper의 실제 접촉 방식
	-> Sensor Material and Form Factor 로 따로 분석
	- FSR / FlexiTac: 낮은 마찰, 미끄러운 표면
	- eFlesh / Daimon / Contact Mic: 더 compliant하고 high-friction한 표면

<img width="1894" height="802" alt="Image" src="https://github.com/user-attachments/assets/93019d45-e544-43a0-b360-45c79a7489f9" />

- vision only 만 비교

<img width="1862" height="298" alt="Image" src="https://github.com/user-attachments/assets/1a24bc1c-ae81-491a-ae5f-84f57d94650c" />

- pick-and-place / insertion은 high-friction fingertip 자체가 유리함.
- reorientation은 controlled slipping이 필요해서 low-friction이 오히려 유리함.
- 따라서 센서 간 절대 성공률 차이는 tactile modality만의 효과가 아니라 embodiment/material 효과도 섞여 있음.


## VITRA: Scalable Vision-Language-Action Model Pretraining for Robotic Manipulation with Real-Life Human Activity Videos

링크: [arXiv](https://arxiv.org/abs/2510.21571), [Project](https://microsoft.github.io/VITRA/), [GitHub](https://github.com/microsoft/VITRA/), [HF model](https://huggingface.co/microsoft/VITRA-VLA-3B)
- 연구 목표: unscripted real-life ego-centric human videos를 로봇 VLA 사전학습 데이터로 바꿔, Dexterous manipulation 성능 높임
- 연구 동기: 로봇 VLA 데이터는 수집 비용이 크고 실험실 환경에 묶여 있어 범위가 좁다. 반면 사람 활동 영상은 물체, 기술, 환경 다양성이 크다. 문제는 영상이 행동 단위로 잘려 있지 않고, 설명도 없고, 로봇 행동 라벨도 없다는 점
- 방법: 

<img width="1698" height="762" alt="Image" src="https://github.com/user-attachments/assets/473b5e38-72ea-4948-aa51-ea74eb8563ca" />

1. 3D motion labeling 
	- 배경 optical flow로 카메라가 고정인지 이동인지 판별하고, 카메라 내부 파라미터와 왜곡을 보정한 뒤 각 프레임에서 양손의 3D 손 자세와 카메라 자세를 복원한다. 이렇게 얻은 camera-frame 손 궤적을 world-frame 손 궤적으로 바꿔 이후 atomic action과 instruction 생성에 쓴다.
		- Camera intrinsics:
		  - 이동 카메라: DroidCalib으로 intrinsics 추정
		  - 고정 카메라: DeepCalib으로 intrinsics/distortion을 먼저 추정하고, distortion이 작으면 MoGe-2로 pinhole focal length를 보정
		  - distortion이 큰 영상은 undistortion 후 pinhole camera model에 맞춤
		- reconstruction per frame camera-space 3D hands : [HaWoR](https://github.com/ThunderVVV/HaWoR)
		- moving camera trajectory:
		  - MegaSAM으로 metric-scale camera pose를 추정 (MegaSAM 내부에서 쓰는 depth prior를 DepthAnything/UniDepth 대신 MoGe-2 출력)
<img width="1995" height="776" alt="Image" src="https://github.com/user-attachments/assets/d1be290a-a6aa-46c3-a054-c60fd1eca538" />

2. atomic action segmentation 
	- world-frame 손목 속도에서 local minima를 찾아 컷 지점으로 사용한다. 사람이 한 행동에서 다음 행동으로 넘어갈 때 손 속도가 잠깐 느려진다는 점을 사용함. 왼손과 오른손을 독립적으로 나누기 때문에 한 손 기준의 짧은 원자 행동 clip을 만들 수 있음
3. instruction labeling
	- 각 행동 clip에서 8개 프레임을 고르게 뽑고, 현재 프레임부터 clip 끝까지의 손바닥 궤적을 이미지 위에 겹쳐 표시한다. GPT-4.1은 이 이미지 묶음과 궤적을 보고 “Right hand: pick up ...” 같은 명령문 형태의 언어 라벨을 붙인다. 의미 있는 행동이 아니면 N/A로 표시
4. episode는 언어 지시문, 영상 프레임, 프레임별 3D action chunk로 구성
	- 정책은 $\pi:(l,o_t,s_t)\rightarrow(a_t,a_{t+1},...,a_{t+N})$ 형태로 언어 지시문 $l$, 시각 관측 $o_t$, 상태 $s_t$를 받아 앞으로 실행할 행동 청크단위로 예측, 사람 손 행동은 $a_t=[\Delta t^l,\Delta r^l,\theta_h^l,\Delta t^r,\Delta r^r,\theta_h^r]\in\mathbb{R}^{102}$로 표현되며, $\Delta t$는 손목 위치 변화, $\Delta r$은 손목 회전 변화, $\theta_h$는 MANO 손 관절 각도, $l/r$은 왼손과 오른손임

<img width="1692" height="774" alt="Image" src="https://github.com/user-attachments/assets/9b9989f9-ff1e-4351-84e6-8282a6bc1300" />

- 실험 결과: 

<img width="1666" height="584" alt="Image" src="https://github.com/user-attachments/assets/12ad6e17-440c-4e49-bb54-e8ff5f5147f2" />

  - 손 행동 예측 평가는 두 가지로 나뉨
  - Grasping: 예측된 손가락 궤적이 목표 물체의 RGB-D point cloud에 얼마나 가까이 가는지 측정함. 낮을수록 목표 물체를 향해 그럴듯하게 접근했다는 의미임.
	- VITRA는 평균/중앙값 손-물체 거리 8.8/6.2cm를 기록했고, [Being-H0](https://research.beingbeyond.com/being-h0)는 19.1/18.4cm를 기록함. 따라서 VITRA가 목표 물체 근처까지 더 잘 접근함.
	- 이 수치는 정답 손 궤적과의 오차나 grasp 성공률이 아님. 예측 손가락 궤적이 목표 물체 point cloud에 얼마나 가까이 갔는지를 보는 plausibility 지표임.
	- 목표 물체 point cloud는 사람이 목표 위치를 지정하고, SAM-2 mask, depth, camera intrinsics로 구성함. 따라서 물체 크기, visible surface, mask/depth 품질의 영향을 받을 수 있음.
- General action은 손-물체 거리 하나로 평가하기 어려워 user study로 평가함.
	- 참가자 23명이 unseen scene에서 여러 모델의 예측 손 동작 영상을 익명으로 보고 top-3를 고름.
	- 1/2/3등에 각각 3/2/1점을 부여하고, 나머지는 0점으로 처리함.
	- VITRA는 평균 1.91점으로 human annotation baseline 0.96, Being-H0 0.15보다 높음.
	- 이 점수는 로봇 실행 성공률이 아니라, 예측 손 동작이 장면과 instruction에 얼마나 자연스럽고 task-aligned하게 보였는지에 대한 상대적 선호도임.

<img width="1706" height="1008" alt="Image" src="https://github.com/user-attachments/assets/694592cb-a713-4c1b-a915-33ca1c5a048b" />

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2026.05.27 - #71 TacO, VITRA #74

TacO: Tactile Sensors for Object Manipulation

VITRA: Scalable Vision-Language-Action Model Pretraining for Robotic Manipulation with Real-Life Human Activity Videos

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

2026.05.27 - #71 TacO, VITRA #74

Description

TacO: Tactile Sensors for Object Manipulation

VITRA: Scalable Vision-Language-Action Model Pretraining for Robotic Manipulation with Real-Life Human Activity Videos

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions