2121
2222## 📋 Contents
2323
24- 1 . [ About] ( #-about )
25- 2 . [ Getting Started] ( #-getting-started )
26- 3 . [ Model and Benchmark] ( #-model-and-benchmark )
27- 4 . [ TODO List] ( #-todo-list )
24+ 1 . [ About] ( #topic1 )
25+ 2 . [ Getting Started] ( #topic2 )
26+ 3 . [ MMScan API Tutorial] ( #topic3 )
27+ 4 . [ MMScan Benchmark] ( #topic4 )
28+ 5 . [ TODO List] ( #topic5 )
2829
2930## 🏠 About
31+ <span id =' topic1 ' />
3032
3133<!--  -->
3234
@@ -55,7 +57,8 @@ Furthermore, we use this high-quality dataset to train state-of-the-art 3D visua
5557grounding and LLMs and obtain remarkable performance improvement both on
5658existing benchmarks and in-the-wild evaluation.
5759
58- ## 🚀 Getting Started:
60+ ## 🚀 Getting Started
61+ <span id =' topic2 ' />
5962
6063### Installation
6164
@@ -98,6 +101,7 @@ existing benchmarks and in-the-wild evaluation.
98101 Please refer to the [ guide] ( data_preparation/README.md ) here.
99102
100103## 👓 MMScan API Tutorial
104+ <span id =' topic3 ' />
101105
102106The ** MMScan Toolkit** provides comprehensive tools for dataset handling and model evaluation in tasks.
103107
@@ -137,39 +141,41 @@ Each dataset item is a dictionary containing key elements:
137141
138142(1) 3D Modality
139143
140- - ** "ori_pcds"** (tuple\[ tensor\] ): Raw point cloud data from the ` .pth ` file.
141- - ** "pcds"** (np.ndarray): Point cloud data, dimensions ( \ [ n_points, 6(xyz+rgb)\] ) .
142- - ** "instance_labels"** (np.ndarray): Instance IDs for each point.
143- - ** "class_labels"** (np.ndarray): Class IDs for each point.
144- - ** "bboxes"** (dict): Bounding boxes in the scan.
144+ - ** "ori_pcds"** (tuple\[ tensor\] ): Original point cloud data extracted from the .pth file.
145+ - ** "pcds"** (np.ndarray): Point cloud data with dimensions [ n_points, 6(xyz+rgb)] , representing the coordinates and color of each point .
146+ - ** "instance_labels"** (np.ndarray): Instance ID assigned to each point in the point cloud .
147+ - ** "class_labels"** (np.ndarray): Class IDs assigned to each point in the point cloud .
148+ - ** "bboxes"** (dict): Information about bounding boxes within the scan.
145149
146150(2) Language Modality
147151
148- - ** "sub_class"** : Sample category.
149- - ** "ID"** : Unique sample ID.
150- - ** "scan_id"** : Corresponding scan ID.
151- - ** --------------For Visual Grounding Task**
152- - ** "target_id"** (list\[ int\] ): IDs of target objects.
153- - ** "text"** (str): Grounding text.
152+ - ** "sub_class"** : The sample category of the sample.
153+ - ** "ID"** : A unique identifier for the sample.
154+ - ** "scan_id"** : Identifier corresponding to the related scan.
155+
156+ * For Visual Grounding Task*
157+ - ** "target_id"** (list\[ int\] ): IDs of target objects.
158+ - ** "text"** (str): Text used for grounding.
154159- ** "target"** (list\[ str\] ): Types of target objects.
155160- ** "anchors"** (list\[ str\] ): Types of anchor objects.
156161- ** "anchor_ids"** (list\[ int\] ): IDs of anchor objects.
157- - ** "tokens_positive"** (dict): Position indices of mentioned objects in the text.
158- - ** --------------ForQuestion Answering Task**
159- - ** "question"** (str): The question text.
162+ - ** "tokens_positive"** (dict): Indices of positions where mentioned objects appear in the text.
163+
164+ * For Question Answering Task*
165+ - ** "question"** (str): The text of the question.
160166- ** "answers"** (list\[ str\] ): List of possible answers.
161167- ** "object_ids"** (list\[ int\] ): Object IDs referenced in the question.
162168- ** "object_names"** (list\[ str\] ): Types of referenced objects.
163169- ** "input_bboxes_id"** (list\[ int\] ): IDs of input bounding boxes.
164- - ** "input_bboxes"** (list\[ np.ndarray\] ): Input bounding boxes, 9 DoF .
170+ - ** "input_bboxes"** (list\[ np.ndarray\] ): Input bounding box data, with 9 degrees of freedom .
165171
166172(3) 2D Modality
167173
168- - ** 'img_path'** (str): Path to RGB image.
169- - ** 'depth_img_path'** (str): Path to depth image.
170- - ** 'intrinsic'** (np.ndarray): Camera intrinsic parameters for RGB images.
171- - ** 'depth_intrinsic'** (np.ndarray): Camera intrinsic parameters for depth images.
172- - ** 'extrinsic'** (np.ndarray): Camera extrinsic parameters.
174+ - ** 'img_path'** (str): File path to the RGB image.
175+ - ** 'depth_img_path'** (str): File path to the depth image.
176+ - ** 'intrinsic'** (np.ndarray): Intrinsic parameters of the camera for RGB images.
177+ - ** 'depth_intrinsic'** (np.ndarray): Intrinsic parameters of the camera for Depth images.
178+ - ** 'extrinsic'** (np.ndarray): Extrinsic parameters of the camera .
173179- ** 'visible_instance_id'** (list): IDs of visible objects in the image.
174180
175181### MMScan Evaluator
@@ -182,7 +188,9 @@ For the visual grounding task, our evaluator computes multiple metrics including
182188
183189- ** AP and AR** : These metrics calculate the precision and recall by considering each sample as an individual category.
184190- ** AP_C and AR_C** : These versions categorize samples belonging to the same subclass and calculate them together.
185- - ** gtop-k** : An expanded metric that generalizes the traditional top-k metric, offering insights into broader performance aspects.
191+ - ** gTop-k** : An expanded metric that generalizes the traditional Top-k metric, offering insights into broader performance aspects.
192+
193+ * Note:* Here, AP corresponds to AP<sub >sample</sub > in the paper, and AP_C corresponds to AP<sub >box</sub > in the paper.
186194
187195Below is an example of how to utilize the Visual Grounding Evaluator:
188196
@@ -301,11 +309,38 @@ The input structure remains the same as for the question answering evaluator:
301309]
302310```
303311
304- # ## Models
312+ # # 🏆 MMScan Benchmark
313+
314+ < span id =' topic4' / >
315+
316+ # ## MMScan Visual Grounding Benchmark
305317
306- We have adapted the MMScan API for some [models](./ models/ README .md).
318+ | Methods | gTop- 1 | gTop- 3 | AP < sub> sample< / sub> | AP < sub> box< / sub> | AR | Release | Download |
319+ | -------- - | -------- | -------- | -------------------- - | ------------------ | ---- | ------ - | ---- |
320+ | ScanRefer | 4.74 | 9.19 | 9.49 | 2.28 | 47.68 | [code](https:// github.com/ rbler1234/ EmbodiedScan/ tree/ mmscan- devkit/ models/ Scanrefer) | [model](https:// drive.google.com/ file / d/ 1C0 - AJweXEc- cHTe9tLJ3Shgqyd44tXqY/ view? usp = drive_link) \| [log](https://drive.google.com/file/d/1ENOS2FE7fkLPWjIf9J76VgiPrn6dGKvi/view?usp=drive_link) |
321+ | MVT | 7.94 | 13.07 | 13.67 | 2.50 | 86.86 | ~ | ~ |
322+ | BUTD - DETR | 15.24 | 20.68 | 18.58 | 9.27 | 66.62 | ~ | ~ |
323+ | ReGround3D | 16.35 | 26.13 | 22.89 | 5.25 | 43.24 | ~ | ~ |
324+ | EmbodiedScan | 19.66 | 34.00 | 29.30 | ** 15.18 ** | 59.96 | [code](https:// github.com/ OpenRobotLab/ EmbodiedScan/ tree/ mmscan/ models/ EmbodiedScan) | [model](https:// drive.google.com/ file / d/ 1F6cHY6 - JVzAk6xg5s61aTT- vD- eu_4DD/ view? usp = drive_link) \| [log](https://drive.google.com/file/d/1Ua_-Z2G3g0CthbeBkrR1a7_sqg_Spd9s/view?usp=drive_link) |
325+ | 3D - VisTA | 25.38 | 35.41 | 33.47 | 6.67 | 87.52 | ~ | ~ |
326+ | ViL3DRef | ** 26.34 ** | ** 37.58 ** | ** 35.09 ** | 6.65 | 86.86 | ~ | ~ |
327+
328+ # ## MMScan Question Answering Benchmark
329+ | Methods | Overall | ST - attr | ST - space | OO - attr | OO - space | OR | Advanced | Release | Download |
330+ | -- - | -------- | -------- | -------- | -------- | -------- | -------- | ------ - | ---- | ---- |
331+ | LL3DA | 45.7 | 39.1 | 58.5 | 43.6 | 55.9 | 37.1 | 24.0 | [code](https:// github.com/ rbler1234/ EmbodiedScan/ tree/ mmscan- devkit/ models/ LL3DA ) | [model](https:// drive.google.com/ file / d/ 1mcWNHdfrhdbtySBtmG - QRH1Y1y5U3PDQ/ view? usp = drive_link) \| [log](https://drive.google.com/file/d/1VHpcnO0QmAvMa0HuZa83TEjU6AiFrP42/view?usp=drive_link) |
332+ | LEO | 54.6 | 48.9 | 62.7 | 50.8 | 64.7 | 50.4 | 45.9 | [code](https:// github.com/ rbler1234/ EmbodiedScan/ tree/ mmscan- devkit/ models/ LEO ) | [model](https:// drive.google.com/ drive/ folders/ 1HZ38LwRe - 1Q_VxlWy8vqvImFjtQ_b9iA ? usp = drive_link)|
333+ | LLaVA- 3D | ** 61.6 ** | 58.5 | 63.5 | 56.8 | 75.6 | 58.0 | 38.5 |~ | ~ |
334+
335+ * Note:* These two tables only show the results for main metrics; see the paper for complete results.
336+
337+ We have released the codes of some models under [./ models](./ models/ README .md).
307338
308339# # 📝 TODO List
309340
310- - \[ \] More Visual Grounding baselines and Question Answering baselines.
341+ < span id =' topic5' / >
342+
343+ - \[ \] MMScan annotation and samples for ARKitScenes.
344+ - \[ \] Online evaluation platform for the MMScan benchmark.
345+ - \[ \] Codes of more MMScan Visual Grounding baselines and Question Answering baselines.
311346- \[ \] Full release and further updates.
0 commit comments