towards a comprehensive super-pixel representation of ... · frrn pohlen et al., 2017 71.8 45.5...
TRANSCRIPT
Towards a Comprehensive Super-Pixel Representation
of Traffic Scenes
Uwe Franke !!et al.!!, Daimler R&D
Autonomous Vehicles: it’s all about Sensors
2007 2017
2
What’s a good Representation of the Scene?
A good representation
should be
1. Compact
2. Complete
3. Efficient
4. Explicit
5. Accurate
6. Robust
On which level should sensor fusion act?
1. Object level (boxes)?
2. Low level or even raw data?
3. Some intermedium level?
3
The Stixel-Representation
D. Pfeiffer and U. Franke: „Towards a Global Optimal Multi-Layer Stixel Representation of Dense 3D Data”, BMVC 2011
500.000 3D-Points500.000 3D-Points -> 500…1000 Stixel
4
S.Gehrig, F.Eberli, T.Meyer, “A Real-time Low-Power Stereo Vision Engine Using Semi-Global Matching”,
ICVS 2009 (Best Paper Award)
The first Attempt: Stixel 1.0
far
...clo
se
1. Compute Disparities
2. Compute Freespace
3. Compute Stixel Height
4. Refine Stixel Distance
H.Badino, U.Franke, and D.Pfeiffer: “The Stixel World – A Compact Medium Level Representation of the 3D-World”, DAGM Symposium 2009
5
Stixel anno 2009
6
Stixel-based Focus of Attention
The Stixel-World reduces the computational burden of classification schemes significantly,
at the same time reducing the false positive rate.
5.000 Hypos
500 Hypos
HOG
Class 1
Class 2
5x less false positives
SGMstereo
Classifier
Another 8x less false positives
500 Hypos
SGM
Stixel
500 Hypos
Stixel
500 car hypothesis centered at the Stixels
M. Enzweiler, M. Hummel, and U. Franke: „Efficient Stixel-Based Object Recognition“, IEEE Intelligent Vehicles Symposium IV 20127
Stixel 1.0: Mission accomplished?
Goals
1. Compact
2. Complete
3. Efficient
4. Explicit
5. Accurate
6. Robust
Only the closest objects are represented
Integrated free-space computation takes much time
Stixel only encode geometry
Strong regularization reduces disparity noise significantly.
BUT: close objects may hide relevant parts of the scene.
8
Stixel Segmentation as an Optimization Problem
9
D. Pfeiffer and U. Franke: „Towards a Global Optimal Multi-Layer Stixel Representation of Dense 3D Data”, BMVC 2011
D. Gallup, M. Pollefeys, Jan-Michael Frahm: “3D Reconstruction using an n-LayerHeightmap”, DAGM 2010.
How do we expect the scene to be arranged?
- Large objects are prefered
- Changes between the label typesP(ground → object) ≠ P(object → ground)
10
- In general, objects are earthbound
- Usually, just a small number of Stixels
per column
Robustness: Challenge Darkness
11
Robustness: Challenge Strong Rain
12
Robustness: Challenge Reflections
13
Stixel Motion and Segmentation
D. Peiffer and U. Franke: „Efficient Representation of Traffic Scenes by Means of Dynamic Stixels”, IEEE Int. Veh. Symposium 2010 (best paper award)
F. Erbs, B. Schwarz, and U. Franke: „From Stixels to Objects - A Conditional Random Field based Approach “, IEEE Intelligent Vehicles Symposium 2013
Stixels are optimally grouped based on their depth & motion.
GraphCut execution time: 1msec on a single core Intel i7.
14
Application 1: May I Enter the Round-About?
M.Muffert, T. Milbich, D.Pfeiffern, and U. Franke: “May I Enter the Roundabout? A Time-To-Contact Computation based on Stereo Vision”,
IEEE Intelligent Vehicles Symposium IV 2012 (best paper award)
-90° -20°-180° 0° +180°
2 Objects
~ 560.000 Pixels
~ 600 Stixels
15
Application 1: May I Enter the Round-About?
16
Stixel 2.0: Mission accomplished?
Goals
1. Compact
2. Complete
3. Efficient
4. Explicit
5. Accurate
6. Robust
Stixel only encode geometry and motion
Problems with non-planar roads
Strong regularization reduces disparity noise significantly.
BUT: still too many ghost objects per hour drive.
17
There is more than Depth
So far, Stixels are estimated based on depth information only.
Some work has already adressed this, e.g. with online color modeling [1]:
18
Images taken from Sanberg et al. [1]
[1] W. P. Sanberg, G. Dubbelman, and P. H. de With, “Extending the Stixel World with Online Self-supervised Color Modeling for
Road-versus-Obstacle Segmentation”, ITSC 2014
There is more than Depth
19
T. Scharwächter and U.Franke: “Low-Level Fusion of Color, Texture and Depth for Robust Road Scene Understanding”,
IEEE Intelligent Vehicles Symposium 2015
Feature Channels
Fast to compute and complementary feature transformations:
𝑟 =𝑅
𝑅 + 𝐺 + 𝐵𝑔 =
𝐺
𝑅 + 𝐺 + 𝐵𝑏 = 1 − 𝑟 − 𝑔rg chromaticity:filter bank:3D height above the ground planeVertical disparity gradient
20
Pixel-level Results
21
Ground Obstacle Vegetation Sky
Stixel Extension Results
Ground Obstacle Vegetation SkyGrass
22
The Cityscapes Dataset
50 major German Cities
5000 precisely labeled frames
>4000 downloads
www.cityscapes-dataset.com
23
M. Cordts et al.: “The Cityscapes Dataset for Semantic Urban
Scene Understanding”, CVPR 2016
The Cityscapes Benchmark
50 major German Cities
5000 precisely labeled frames
>4000 downloads
www.cityscapes-dataset.com
24
M. Cordts et al.: “The Cityscapes Dataset for Semantic Urban
Scene Understanding”, CVPR 2016
Benchmark Challenges
• pixel-level semantic labeling
• instance-level semantic labeling
Properties
• evaluation server• non-public test set
• prevent overfitting
• public evaluation scripts
• ranking website
• initial set of baselines
The Cityscapes Benchmark
Adelaide Context [Lin et al. ‘16]
Name Reference
Classes Categories
fin
e
tun
ed
de
pth
co
ars
e
mst
CR
F
su
b
run
tim
e
IoU iIoU IoU iIoU
ResNet-38 Wu et al., 2016 80.6 57.8 91.0 79.1
PSPNet Zhao et al., 2017 80.2 58.1 90.6 78.2
TuSimple Wang et al., 2017 80.1 56.9 90.7 77.8
RefineNet Lin et al., 2017 73.6 47.2 87.9 70.6
LRR-4x Ghiasi and Fowlkes, 2016 71.8 47.9 88.4 73.9
FRRN Pohlen et al., 2017 71.8 45.5 88.9 75.1 2
Adelaide Context Lin et al., 2016 71.6 51.7 87.3 74.1
Deep Layer Cascade Li et al., 2017 71.1 47.0 88.1 74.1
DeepLab v2 CRF Chen et al., 2016 70.4 42.6 86.4 67.7
Dilation 10 Yu and Koltun 2016 67.1 42.0 86.5 71.1 4 s
Scale invariant CNN Kreso et al., 2016 66.3 44.9 85.0 71.2
SQ Treml et al., 2016 59.8 32.3 84.3 66.0 60 ms
ENet Paszke et al., 2016 58.3 34.4 80.4 64.0 2 13 ms
25
26
each layer computes one of a few
simple mathematical operations that
can be highly parallelized (GPUs)
low-level features
first layers typically learn simple
edge-detection and color filters
mid-level features
typically detect simple shapes
like corners, circles, patterns, …
high-level features
complex shapes, parts of larger
objects (wheels), …
Deep Learning in Fully Convolutional Neural Networks (2015)
J. Long, E. Shelhamer, and T. Darrell: “Fully convolutional networks for semantic segmentation,” CVPR 2015
Getting closer to Human Perception
27
Real-Time Scene Labeling
28
FCNs at Rainy Nights
29
Stixel and Semantics
Semantic Labeling: 2 Mio labeled points
Stereo Matching: 2 Mio 3D-points 3D Stixel Representation: 1.000 Stixel
Semantic Stixel: 1000 Stixel
L. Schneider et al.: “Semantic Stixel: Depth is not Enough”, IEEE Intelligent Vehicles Symposium 2016 (Best Paper Award)
30
Semantic Stixels: Results
image semantic input semantic representation depth input depth representation
31
Semantic Stixel World in Downtown Stuttgart
32
Semantic Stixel in 3D
33
Lost! … and Found by CNN
34
Even small objects on the road can cause damage to the car and must be avoided by all means.
Humans are brilliant in detecting those objects.
False Positive Rates for Lost Cargo Fusion
CNN-based Lost Cargo Detection Stereo-based Lost Cargo Detection
Lost Cargo Fusion
The false positive detections (above) disappeared.
Stixel 3.0: Mission accomplished?
Goals
1. Compact
2. Complete
3. Efficient
4. Explicit
5. Accurate
6. Robust
Reconstruction error is high in hilly areas due to simple model.
Semantics seems to solve all problems with ghost objects, but
geometric models fights agains semantics too much in SFO
36
Slanted Stixels: Solving the SFO-Problem
37
Original Stixels
Slanted Stixels
New model to represent all classes:
With priors according to the class: 𝐸𝑝𝑙𝑎𝑛𝑒 𝑠𝑖 = (𝑎−𝜇𝑐𝑖
𝑎
𝜎𝑐𝑖𝑎 )2 + (
𝑏−𝜇𝑐𝑖𝑏
𝜎𝑐𝑖𝑏 )2 − log 𝑍
Optimized jointly within Semantic Stixel probabilistic framework
𝜇 𝑠𝑖 , 𝑣 = 𝑏𝑖 ∗ 𝑣 + 𝑎𝑖
D. Hernandez et al.: “Slanted Stixels: Representing San Francisco's Steepest Street”, BMVC 2017 (Best Paper Award)
New Dataset: SYNTHIA-San Francisco
38
• Generated with SYNTHIA toolkit to evaluate our algorithm, features slanted roads
• Photorealistic virtual sequence (2224 images), pixel-level depth and semantic ground truth
• Expensive to generate equivalent real-data sequence
• Will be available on Synthia soon: http://www.synthia-dataset.net
Results: Frame-Rate39
• Our version is slightly slower because of the increased complexity
Higher is better
Stixels time measured on 6-
core Intel i7
Metric Dataset Stixel 3.0 Slanted S.
Disp Err (%)
Ladicky 17.3 16.9
KITTI 15 10.9 11.0
SYNTHIA-SF 30.9 12.9
IoU (%)
Ladicky 63.5 63.4
Cityscapes 65.7 65.8
SYNTHIA-SF 46.0 48.5
Frame-rate
(Hz)
KITTI 15 113 61
Cityscapes 20.9 6.6
SYNTHIA-SF 19.4 4.7
Stixel Computation Complexity: Pre-Segmentation40
Dynamic Programming
( h’ x h’ ), h’ << h
Semantic
Segmentation
Disparity Image
Ground Object Sky
• Infer possible Stixel cuts (pre-segmentation) from image
• Avoid checking all possible Stixel combinations
• If given the correct Stixel cuts, same accuracy (or better!)
Pre-Segmentation
( h )
Pre-Segmentation Results: Frame-rate41
• Pre-segmentation speeds up both original and Slanted Stixels
Presegmentation
Metric Dataset Stixel 3.0 Slanted S. Stixel 3.0 Slanted S.
Disp Err (%)
Ladicky 17.3 16.9 18.5 17.8
KITTI 15 10.9 11.0 11.8 11.7
SYNTHIA-SF 30.9 12.9 33.9 15.4
IoU (%)
Ladicky 63.5 63.4 63.9 63.7
Cityscapes 65.7 65.8 65.7 65.8
SYNTHIA-SF 46.0 48.5 46.9 48.5
Frame-rate
(Hz)
KITTI 15 113 61 120 116
Cityscapes 20.9 6.6 36.6 27.5
SYNTHIA-SF 19.4 4.7 38.9 33.1 Higher is better
Visual Examples42
Left Image Original Stixels Slanted Stixels
Goals
1. Compact
2. Complete
3. Efficient
4. Explicit
5. Accurate
6. Robust