roscenes.github.io/index.html at main · roscenes/roscenes.github.io · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
<!DOCTYPE html>
<html>

<head>
	<meta charset="UTF-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <meta http-equiv="X-UA-Compatible" content="ie=edge" />
    <title>RoScenes</title>
	<link rel="stylesheet" href="https://gw.alicdn.com/imgextra/i3/O1CN01n9wUIv26XN415nWwN_!!6000000007671-0-tps-703-704.jpg">
    <!-- <link href="https://fonts.googleapis.com/css?family=Open+Sans:400" rel="stylesheet" />     -->
	<link href="css/templatemo-style.css" rel="stylesheet" />
	<link href="css/new-style.css" rel="stylesheet" />
	<link href="css/two_style.css" rel="stylesheet" />

</head>

<body>

	<div class="container">
		<div class="placeholder">
			<div class="parallax-window" data-parallax="scroll" data-image-src="assets/hero.jpg">
				<div class="tm-header">
					<div class="row tm-header-inner">
						<!-- <div class="col-md-6 col-12" style="display: inline-flex;">
							<img src="https://gw.alicdn.com/imgextra/i3/O1CN01Dko8nq1a3XZ41bOe8_!!6000000003274-2-tps-60-60.png" alt="Logo" class="tm-site-logo" />
							<div class="tm-site-text-box">
								<h1 class="tm-site-title" style="font-family: system-ui, -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen, Ubuntu, Cantarell, 'Open Sans', 'Helvetica Neue', sans-serif;"><b>RoScenes</b></h1>
								<h6 class="tm-site-description"> A large-scale multi-view 3D dataset for roadside perception.</h6>
							</div>
						</div> -->
					</div>
				</div>
			</div>
		</div>

		<main class="font-serif" style="margin-top: -3em;">
			<header class="row tm-welcome-section">
				<h2 class="col-12 text-center tm-section-title" style="font-size:3em !important; font-family: Times, 'Times New Roman', serif; font-variant: small-caps;"><b>RoScenes</b><br/></h2>
				<h6 class="col-12 text-center tm-site-description" style="font-size:1.8em !important; font-family: Times, 'Times New Roman', serif;">A Large-scale Multi-view 3D Dataset for Roadside Perception<br/></h6>
				<p class="col-12 text-center author-section font-serif" style="font-size:1em;">
					<a class="author-section-a" href="#">Xiaosu Zhu<sup>1*</sup></a>,
					<a class="author-section-a" href="#">Hualian Sheng<sup>1*</sup></a>,
					<a class="author-section-a" href="#">Sijia Cai<sup>1†</sup></a>,
					<a class="author-section-a" href="#">Bing Deng<sup>1</sup></a>,
					<a class="author-section-a" href="#">Shaopeng Yang<sup>1</sup></a>,
					<br/>
					<a class="author-section-a" href="#">Qiao Liang<sup>1</sup></a>,
					<a class="author-section-a" href="#">Ken Chen<sup>2</sup></a>,
					<a class="author-section-a" href="#">Lianli Gao<sup>3</sup></a>,
					<a class="author-section-a" href="#">Jingkuan Song<sup>4‡</sup></a>
					<a class="author-section-a" href="#">Jieping Ye<sup>1‡</sup></a>
				</p>
				<br/>
				<br/>
				<br/>

				<p class="col-12 text-center affiliation-section font-serif" style="font-size:1em;">
					<a class="author-section-a" style="color: black !important;"><sup>1</sup>Alibaba Cloud</a>,
					<a class="author-section-a" style="color: black !important;"><sup>2</sup>Sichuan Digital Transportation Technology Co., Ltd</a>,
					<br/>
					<a class="author-section-a" style="color: black !important;"><sup>3</sup>Independent Researcher</a>,
					<a class="author-section-a" style="color: black !important;"><sup>4</sup>Tongji University</a>
				</p>

				<p class="col-12 text-center affiliation-section font-serif" style="font-size:1em;">
					<a class="author-section-a" style="color: black !important;"><sup>*</sup>Equal contribution</a>,
					<a class="author-section-a" style="color: black !important;"><sup>†</sup>Project lead</a>,
					<a class="author-section-a" style="color: black !important;"><sup>‡</sup>Corresponding authors</a>
				</p>


				<p class="col-12 abstract-section font-serif" style="word-break:break-all;word-wrap:break-word;font-family: Times, 'Times New Roman', serif; font-size:1em; text-indent: 2em;">
					We introduce RoScenes, the largest multi-view roadside perception dataset, which aims to shed light on the development of vision-centric Bird's Eye View (BEV) approaches for more challenging traffic scenes. The highlights of RoScenes include significantly large perception area, full scene coverage and crowded traffic. More specifically, our dataset achieves surprising 21.13M 3D annotations within 64,000 m<sup>2</sup>. To relieve the expensive costs of roadside 3D labeling, we present a novel BEV-to-3D joint annotation pipeline to efficiently collect such a large volume of data. After that, we organize a comprehensive study for current BEV methods on RoScenes in terms of effectiveness and efficiency. Tested methods suffer from the vast perception area and variation of sensor layout across scenes, resulting in performance levels falling below expectations. To this end, we propose RoBEV that incorporates feature-guided position embedding for effective 2D-3D feature assignment. With its help, our method outperforms state-of-the-art by a large margin without extra computational overhead on validation set.
				</p>


				<br/>
				<br/>
				<br/>
				<br/>
				<br/>
				<br/>
				<br/>
				<br/>
				<br/>
				<br/>
				<br/>
				<br/>
			<div class="col-12 affiliation-section font-serif" style="font-size:1em;color: gray;">
				<b>ECCV 2024. We sincerely thank
				<br/>
				Sichuan Expressway Construction & Development Group Co., Ltd.
				<br/>
				Western Sichuan Expressway Co., Ltd.
				<br/>
				Sichuan Intelligent Expressway Technology Co., Ltd.
				<br/>
				for their invaluable assistance with data acquisition.
				</b>
			</div>

			</header>

			<div class="tm-paging-links" style="margin-bottom: 1rem; font-weight:bold;">
				<nav>
					<ul>
						<li class="tm-paging-item"><a class="sw_style_link active" href="https://arxiv.org/abs/2405.09883" target="_blank">Paper</a></li>
						<li class="tm-paging-item"><a class="sw_style_link active" href="https://github.com/roscenes/RoScenes" target="_blank">DevKit</a></li>
						<li class="tm-paging-item"><a class="sw_style_link active" href="https://modelscope.cn/datasets/Apsara_Lab_Multimodal_Intelligence/RoScenes-release">Download</a></li>
					</ul>
				</nav>
			</div>


			<!-- <p class="col-12 text-center abstract-section tm-paging-links" style="word-break:break-all;word-wrap:break-word; color: lightseagreen;">
				We are currently seeking research interns. If you are interested in video generation, please do not hesitate to email <br \> us your resume at the following address:
				<a style="color: lightseagreen; font-size: 1.2rem; font-weight: bold"> zhangjin.zsw@alibaba-inc.com </a>
			</p> -->

			<div class="gray_div font-serif rw-r-container">
				<div class="text-2xl lg:text-4xl leading-none pb-2 text-center">
					Overview
				</div>
				<div class="video-center">

					<img src="assets/summary.jpg" alt="Image" class="video-center img-fluid tm-gallery-img" style="width: 65%;" />
					<!-- Synthesize videos in any style you can imagine using nothing but a text prompt. If you can say it, now you can see it. -->
					<!-- You can generate videos flexibly in any style that you can imagine. -->
					 <div class="text-center">Comparison between vehicle-side (V) and infrastructure-side (I) 3D datasets. “Cam” is the number of synchronized cameras adopted per scene.</div>
				</div>

				<br/>
				<br/>
				<br/>

				<div class="video-center">

					<img src="assets/overview.jpg" alt="Image" class="video-center img-fluid tm-gallery-img" style="width: 80%;" />
					<!-- Synthesize videos in any style you can imagine using nothing but a text prompt. If you can say it, now you can see it. -->
					<!-- You can generate videos flexibly in any style that you can imagine. -->
					 <div class="text-center">Demonstration of our RoScenes dataset. The annotated truck is difficult to recognize in A, B, C, E, F, G, but is clear in D.</div>

				</div>
				<div class="block_bottom">
				</div>
			</div>

			<div class=" font-serif rw-r-container">
				<div class="text-2xl lg:text-4xl leading-none pb-2 text-center">
					Characteristics
				</div>
				<!-- <div class="text-base lg:text-lg pb-5 text-center ">
					You can generate videos flexibly in any style that you can imagine.
				</div> -->


				<div class="text-xl lg:text-2xl leading-none pb-2 text-center">
					I: Large Perception Range
				</div>
				<div class="row video-center">
					<img src="assets/large-perception.jpg" alt="Image" class="video-center img-fluid tm-gallery-img" style="width: 60%;" />
					<!-- Synthesize videos in any style you can imagine using nothing but a text prompt. If you can say it, now you can see it. -->
					<!-- You can generate videos flexibly in any style that you can imagine. -->
				</div>
				<div class="text-center">RoScenes has a ~6× larger perception range than other public datasets.</div>

				<br/>
				<div class="text-xl lg:text-2xl leading-none pb-2 text-center">
					II: Full Scene Coverage
				</div>
				<div class="row video-center">
					<img src="assets/conditions.jpg" alt="Image" class="video-center img-fluid tm-gallery-img" style="width: 50%;" />
					<!-- Synthesize videos in any style you can imagine using nothing but a text prompt. If you can say it, now you can see it. -->
					<!-- You can generate videos flexibly in any style that you can imagine. -->
				</div>
				<div class="text-center">RoScenes covers high variety roadside cameras and conditions.</div>

				<br/>
				<div class="text-xl lg:text-2xl leading-none pb-2 text-center">
					III: Crowded Scenes
				</div>
				<div class="row video-center">
					<img src="assets/stat.jpg" alt="Image" class="video-center img-fluid tm-gallery-img" style="width: 70%;" />
					<!-- Synthesize videos in any style you can imagine using nothing but a text prompt. If you can say it, now you can see it. -->
					<!-- You can generate videos flexibly in any style that you can imagine. -->
				</div>
				<div class="text-center">RoScenes has an average of 123 boxes appear for every scene sample (3× larger).</div>

				<br/>
				<br/>
				<br/>

			</div>


			<div class="gray_div font-serif rw-r-container" >
				<br/>
				<div class="text-2xl lg:text-4xl leading-none pb-2 text-center">
					BEV-to-3D Joint Annotation
				</div>
				<div class="text-base lg:text-lg pb-5 text-center ">
					Extremely large annotation amount requires extremely efficient data pipeline.
				</div>

				<br/>
				<div class="row video-center">
					<img src="assets/pipeline.jpg" alt="Image" class="video-center img-fluid tm-gallery-img" style="width: 95%;" />
					<!-- Synthesize videos in any style you can imagine using nothing but a text prompt. If you can say it, now you can see it. -->
					<!-- You can generate videos flexibly in any style that you can imagine. -->
				</div>
				<div class="text-center">We propose a BEV-to-3D joint annotation pipeline based on a pre-built 3D scene reconstruction model and time-synchronized image data among roadside cameras and Unmanned Aerial Vehicles (UAVs).</div>

				<br/>
				<div class="row video-center">
					<img src="assets/error_cal_1.jpg" alt="Image" class="video-center img-fluid tm-gallery-img" style="width: 70%;" />
					<!-- Synthesize videos in any style you can imagine using nothing but a text prompt. If you can say it, now you can see it. -->
					<!-- You can generate videos flexibly in any style that you can imagine. -->
				</div>
				<div class="text-center">The BEV-to-3D projection has high definition and low error.
				<br/>
				</div>
			<div>
				<b>(a): Static scene error visualization.</b> We put high-definition map as background, and plot red points sampled from 3D reconstruction as overlay.
				<br/>
				<b>(b): Calibration and projection error visualization.</b> We select a camera and pick a single frame as background, and project white points sampled from 3D reconstruction to this perspective view as overlay.
				<br/>
				<b>(c) Vehicles' location and height error.</b> To avoid temporal disalignment and height mismatch, we manually check the fitness of projected boxes with adjacent frames.
			</div>


			<br/>
			<div class="row video-center">
				<img src="assets/error_cal_2.jpg" alt="Image" class="video-center img-fluid tm-gallery-img" style="width: 85%;" />
				<!-- Synthesize videos in any style you can imagine using nothing but a text prompt. If you can say it, now you can see it. -->
				<!-- You can generate videos flexibly in any style that you can imagine. -->
			</div>
			<div class="text-center">We visualize the vehicles' length and width error in UAV view. Green boxes indicate human annotations, while red boxes indicate model predictions.
			</div>


			</div>

			<div class=" font-serif rw-r-container">
				<br/>
				<div class="text-2xl lg:text-4xl leading-none pb-2 text-center">
					Visualizations
				</div>
				<br/>
				<div class="row video-center">
					<img src="assets/vis_1.jpg" alt="Image" class="video-center img-fluid tm-gallery-img" style="width: 65%;" />
					<!-- Synthesize videos in any style you can imagine using nothing but a text prompt. If you can say it, now you can see it. -->
					<!-- You can generate videos flexibly in any style that you can imagine. -->
				</div>
				<br/>
			</div>
			<div class="gray_div font-serif rw-r-container">
				<br/>
				<div class="row video-center">
					<img src="assets/vis_2.jpg" alt="Image" class="video-center img-fluid tm-gallery-img" style="width: 65%;" />
					<!-- Synthesize videos in any style you can imagine using nothing but a text prompt. If you can say it, now you can see it. -->
					<!-- You can generate videos flexibly in any style that you can imagine. -->
				</div>
				<br/>
			</div>

			<div class=" font-serif rw-r-container">
				<br/>
				<div class="row video-center">
					<img src="assets/vis_3.jpg" alt="Image" class="video-center img-fluid tm-gallery-img" style="width: 65%;" />
					<!-- Synthesize videos in any style you can imagine using nothing but a text prompt. If you can say it, now you can see it. -->
					<!-- You can generate videos flexibly in any style that you can imagine. -->
				</div>
				<br/>
			</div>

			<div class="gray_div font-serif rw-r-container">
				<br/>
				<div class="row video-center">
					<img src="assets/vis_4.jpg" alt="Image" class="video-center img-fluid tm-gallery-img" style="width: 65%;" />
					<!-- Synthesize videos in any style you can imagine using nothing but a text prompt. If you can say it, now you can see it. -->
					<!-- You can generate videos flexibly in any style that you can imagine. -->
				</div>
				<br/>
			</div>

			<div class=" font-serif rw-r-container">
				<br/>
				<div class="row video-center">
					<img src="assets/vis_night.jpg" alt="Image" class="video-center img-fluid tm-gallery-img" style="width: 65%;" />
					<!-- Synthesize videos in any style you can imagine using nothing but a text prompt. If you can say it, now you can see it. -->
					<!-- You can generate videos flexibly in any style that you can imagine. -->
				</div>
				<br/>
			</div>

			<!-- <div class="tm-section tm-container-inner">
				<div class="row">
					<div class="col-md-6">
						<figure class="tm-description-figure">
							<img src="https://gw.alicdn.com/imgextra/i3/O1CN01UFj80E23LyMSi4Ip3_!!6000000007240-0-tps-1071-456.jpg" alt="Image" class="img-fluid" />
						</figure>
					</div>
					<div class="col-md-6">
						<div class="tm-description-box">
							<h4 class="tm-gallery-title" style="font-size: 1.5rem;">More>></h4>
							<p class="tm-mb-45">
								If you are seeking an exhilarating challenge and the chance to collaborate with AIGC and large-scale pretraining, then you have come to the right place.
								We are searching for talented, motivated, and imaginative researchers to join our team. If you are interested, please don't hesitate to send us your resume via email
								<a style="color: darkgreen; font-size: 1.4rem;"> yingya.zyy@alibaba-inc.com </a>
							</p>
						</div>
					</div>
				</div>
			</div> -->
		</main>

		<footer class="tm-footer text-center">
			<p>Copyright &copy; Alibaba Cloud 2024
		</footer>
	</div>
	<script src="js/jquery.min.js"></script>
	<script src="js/parallax.min.js"></script>
	<script>
		$(document).ready(function(){
			// Handle click on paging links
			$('.tm-paging-link').click(function(e){
				e.preventDefault();
				var page = $(this).text().toLowerCase();
				$('.tm-gallery-page').addClass('hidden');
				$('#tm-gallery-page-' + page).removeClass('hidden');
				$('.tm-paging-link').removeClass('active');
				$(this).addClass("active");
			});
		});
	</script>
</body>
</html>