smart surveillance system - chinese university of hong...
TRANSCRIPT
Smart Surveillance System
Tai-Pang Wu
Principal Researcher
Enterprise & Consumer Electronics Group
ASTRI
Fact
We have too many videos and last too long ◦ ~500,000 in London, ~4,200,000 in UK
◦ ~1,000,000 cameras deployed in SZ
◦ Many surveillance videos are captured in 24-hour-per-day and 7-day-per-week manner
Current Situation (excerpt from Wikipedia, CCTV) ◦ There is little evidence that CCTV deters crime; in fact, there is considerable
evidence that it does not. According to a Liberal Democrat analysis, in London “Police are no more likely to catch offenders in areas with hundreds of cameras than in those with hardly any.” A 2008 Report by UK Police Chiefs concluded that only 3% of crimes where solved by CCTV. In London, a Metropolitan Police report showed that in 2008, only one crime was solved per 1000 cameras.
◦ Full text here: http://en.wikipedia.org/wiki/Closed-circuit_television
2
An Example
3
This video is playing (The length of the video – 2:06)
The first subject will appear at 0:70
Challenges
Most of the videos are not monitored
◦ In many places, polices are still using human eyes,
together with fast-forward, to chase suspects/targets
◦ Not enough human labor
Difficult to search from the “sea of video” even
we know something happened
Solution?
◦ Find out cues from massive videos
4
A Demanded System
How to improve?
◦ We want short videos
Where all the useful contents should be preserved
◦ We need something to replace the human tasks or tedious
labor works
Helps us to find the suspects/targets
Chases the suspects in cross camera views
Analyzes the videos and extracts the respective useful information to
improve decision making
5
video video
Our Solution: Smart Surveillance System
Existing Techniques
◦ Scattered over different research topics which focus
on different components
E.g., Tracking, Pedestrian Detection, Fact Detection, Re-identification,
Trajectory Grouping, Camera Topology Estimation, etc.
Our solution
◦ The synergy of selected cutting-edge technologies
which focuses on the mentioned demands (previous
slide)
6
Our Solution: Smart Surveillance System
Solution
◦ Video Summarization Reduce the length of a video
Decompose the video into components and save as a database for analysis
◦ Object and Face Detection Detecting the type of object
Face recognition and verification
Identification
◦ Re-identification Finding the related video that the same object appear
Analyze the path and activity of the objects
◦ Clustering/Grouping and Searching Classification
Query
7
Video Summarization
Objective
◦ Shorten the video
◦ Retain the content
Side product
◦ ‘Primitives’ called Tubes
How?
◦ Derived from [Pritch et. al. TPAMI 2008]
◦ But we are using a different algorithm
8
video video
Video Summarization: An Example
9
Original Video (2:07) Video Summary (0:16)
Video Summarization: How
The whole process could be done in
online/offline mode Offline mode: the videos are captured in advance
Online mode: video streams are input directly
10
Background
Modeling
Foreground
Extraction
Tube
Extraction
Tube
Reordering
Video Summary
Generation
Background Modeling
We have certain background modeling methods for different situations
◦ Running Average
◦ GMM [Stauffer and Grimson, CVPR’98]
◦ Adaptive GMM [Zivkovic and Heijden, PRL’06]
◦ Adaptive Median [McFarlane and Schofield, MVA’95]
◦ SILTP [Liao et. al., CVPR’10]
◦ Reliable Background Suppression for Complex Scenes
[Calderara et. al, VSSN’06]
◦ Foreground Object Detection from Videos Containing Complex Background
[Li et. al., ACM MM’03]
11
Background Model
Background Modeling
The followings three are selected ◦ Indoor scene: SILTP
◦ Outdoor natural scene: Adaptive GMM
◦ Dynamic background: Adaptive GMM / ACM MM’03
12
Foreground Extraction
Background Cut [Sun et. al. ECCV’06]
13
Background Model
Tube Extraction
Stacking the extracted foreground together
The same object / event will be considered as a
single tube
14
Tube Reordering
Re-arrange the tubes along the temporal axis where the size of empty
space is minimized
Different from [Pritch et. al. TPAMI 2008] which uses pixel-wise
optimization
Our method re-arrange the tubes by using polygon collision detection,
which is much faster
15
Video Summary Generation
Stitch the reordered tubes back to the
estimated background to produce the video
summary
16
Background Model
Video Summary Generation
Stitch the reordered tubes back to the
estimated background to produce the video
summary
17
The Side Product
A database of background models and tubes
from different videos
18
A database
Ready for analysis
Background Model
Tracking Clustering Tube / BG
Database
This is the database
which contains all
the extracted tubes,
extracted
backgrounds models
and original videos.
A kind of raw data
for analysis.
It is responsible for
object tracking. At
this stage, we could
have no (or
limited) information
about the type/class
of the object being
tracked
Feature
Extraction
It is designed to
extract the high
level features from
the objects
obtained from the
previous stage
Classification
It is used to classify
the objects into
three main types: i)
human, ii) face and
iii) vehicle.
Cluster the objects
in some convenient
form (using some
level feature) where
mining or searching
could be easily and
efficiently done
Final
Database
This is the final
database that all the
necessary
information has
been extracted,
clustered and
analyzed. It is ready
for query
Second Phase
Feature Extraction
Pedestrian Detection ◦ “Histogram of Oriented Gradients (HOG) for Human Detection” [Dalal
and Triggs, CVPR’05]
◦ Speed up searching by utilizing the geometry of the environment which is
obtained from “Single View Metrology” [Criminisi et.al., IJCV’00]
20
Human HOG
Descriptor
Positive
Components
Negative
Components
The ground plane is known
The range of human height is known.
Feature Extraction
Shape and Appearance Modeling ◦ “Shape and Appearance Context Modeling”
[Wang et. al., ICCV’07]
◦ Compute the similarity between deformable objects of a given class
◦ For person re-identification
We can search a person from a network of cameras
21
View 1 View 2 Shape
Label
Appearance
Label
Shape
Label
Appearance
Label
Feature Extraction
Face Recognition ◦ “Face Recognition with Learning-based Descriptor”
[Cao et. al., CVPR’10]
◦ A learning based approach
Learning-based encoding and histogram representation
Pose-adaptive matching
22
Feature Extraction
23
Feature Extraction
Vehicle Detection ◦ Possible work
“Real-time Recognition and Re-identification of Vehicles from Video Data with high Re-identification
Rate” [Woesler SCI’04]
◦ Using 3D proxies/models to detect vehicles
Classification is also done
◦ License plate and container detection
24 The 3D models used for vehicle detection. Region of interest where vehicle is searched
Clustering and Retrieval
Detailed Features
◦ Human Trajectory / moving direction
Size
Dressing (color/texture/logo)
Activity
◦ Vehicle Trajectory / moving direction
Type
Speed
Dominated color
◦ Face Normal / Abnormal (e.g., sheltered)
Expression
Gender
25
Clustering and Retrieval
Retrieval video contents by inputting texts,
images or sketches ◦ E.g., “Object-based Surveillance Video Retrieval System with Real-Time Indexing
Methodology” [Yuk et. Al., ICIAR’07]
26
Possible Services
A complete system allows us to ◦ Minimize the time to watch surveillance videos
◦ Index the original video based on the video summary
◦ Find a person by inputting a face image
◦ Find the associated videos where the subject appeared
◦ Extract different target objects based on interested information (e.g.,
moving direction, dressing, etc.)
27
Possible Services
Some Examples:
28
Add Video Summarization to
existing software
Monitor traffic
Detect abnormal activities
Record vehicle information
Monitor entrance
Record human face