multimodal sensing enabled real-time intelligent wireless

32
Multimodal Sensing Enabled Real-time Intelligent Wireless Camera Networks for Secure Spaces Development and implementation of consensus development and data fusion algorithms Al-Khawarizmi Institute of Computer Science (KICS) University of Engineering and Technology Lahore

Upload: others

Post on 06-Apr-2022

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Multimodal Sensing Enabled Real-time Intelligent Wireless

Multimodal Sensing Enabled Real-time

Intelligent Wireless Camera Networks for

Secure Spaces

Development and implementation of consensus developmentand data fusion algorithms

Al-Khawarizmi Institute of Computer Science (KICS)University of Engineering and Technology Lahore

Page 2: Multimodal Sensing Enabled Real-time Intelligent Wireless

Contents

1 Report Summary 3

2 Distributive Data Fusion and Consensus Development 4

2.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 Data Fusion Entities and Consensus Development . . . . . . . . . . . . . . . . . . 4

2.2.1 Layer 1 Entities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2.1.1 Local Fusion Unit . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.3 Layer2 Entities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.3.1 Optimal Data Fusion Entity . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3 Layer 1 Consensus Development Algorithms 6

3.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3.1.1 k-Nearest Neighbor Classifier KNN . . . . . . . . . . . . . . . . . . . . . . 6

3.1.2 Naive Bayesian Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.1.3 Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.1.4 Gaussian Markov Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.2 Effect of Audio Features Selection on Classification Problem . . . . . . . . . . . . 10

3.2.1 Temporal Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2.1.1 Zero Crossing Rate . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.2.1.2 Peaks Count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.2.1.3 Ratio of Peaks-Count to Zero-Crossings-Rate . . . . . . . . . . . 12

3.2.1.4 Short Time Signal Energy . . . . . . . . . . . . . . . . . . . . . . 12

3.2.2 Comparison of Acoustic Features . . . . . . . . . . . . . . . . . . . . . . . 13

4 Layer 2 Consensus Development Algorithms 14

4.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.2 Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.2.1 Histogram of Oriented Gradients . . . . . . . . . . . . . . . . . . . . . . . 14

4.2.2 Deformable Parts Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4.3 Object Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.3.1 Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

5 Optimal Data Fusion (ODF) Unit Performance Evaluation 19

5.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

5.2 Test Case 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

5.2.1 Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5.2.2 Object Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5.3 Test Case 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

1

Page 3: Multimodal Sensing Enabled Real-time Intelligent Wireless

Contents 2

5.3.1 Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5.3.2 Object Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5.4 Test Case 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5.4.1 Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5.4.2 Object Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

A Appendix A 23

A.1 Peak Counts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

A.2 Feature Extraction Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

A.3 Compute Feature Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

A.4 Compute Features of All Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

A.5 Tracking and Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

Page 4: Multimodal Sensing Enabled Real-time Intelligent Wireless

Report Summary

This report is aimed at consensus development and the data fusion algorithm being used at

proposed hierarchical model. The objective is to perform the system level consensus development

and fusion development. The visual and acoustic domain in this regard has bought into focused

for object classification, visual detection and visual tracking.

Chapter 2 highlights the proposed architecture with modular design for system level data fusion

and consensus development. It contains the design level explanation about the proposed model

for fusion and development. Chapter 3 sheds light on the acoustic consensus development in

context of data fusion at layer-1 in proposed architecture. Different data fusion techniques

have been explored to bring the accurate system level implementation for object detection and

classification. Chapter 4 describes the layer-2 consensus development which directly targets

the object detection and tracking on proposed system entity. Various approaches have been

included with results and implementation. In working on system level data fusion, the need for

suitable system platform and its performance] evaluation have been justified and compared by

performing testing discussed in Chapter 5. The draft also includes the Algorithmic details for

feature extraction, object classification, detection and tracking.

3

Page 5: Multimodal Sensing Enabled Real-time Intelligent Wireless

Distributive Data Fusion and Consen-

sus Development

2.1 Purpose

To implement the seamless tracking on this distributed network and to meet the processing

requirement it was required to develop the consensus based on fusion processed data. The

development is with upper bound of low cost and low processing embedded resource availability

in our distributive network. Therefore we have divided the resources into multiple stages and

connected each stage with the next one by following the designed network architecture. We are

targeted toward the object detection and tracking therefore the need is to place the robustness

in the consensus development by using these limited processing resources into same distributed

network. To tackle this we have further divided the network into multiple processing layers,

where each performs a dedicated task. Following are the addressable main fusion processing

components for the consensus development.

2.2 Data Fusion Entities and Consensus Development

The proposed distributed data fusion architecture in report 6 has been used for consensus

development and its implementation. The fusion model has been divided into modular approach

as shown in figure 2.1.

ODF

CDFODF

ODF

DFU Level 1

DFU Level 1

DFU Level 1

Local Fusion Unit

Local Fusion Unit

Local Fusion Unit

Local Fusion Unit

Local Fusion Unit

Local Fusion Unit

Figure 2.1: Fusion Architecture

The layered approach depicts the components involved leading toward the object tracking in

visual domain. The initial local fusion unit consists of acoustic sensor node which has been

categorized as a layer-1 fusion entity. The DFU level-1 is the next hierarchical step toward the

4

Page 6: Multimodal Sensing Enabled Real-time Intelligent Wireless

Distributive Data Fusion Architecture 5

detection and localization of the acoustic event. The fusion model carrying the object detection

and localization has been divided into two different layers, ODF and CDF where the event based

video detection is performed on ODF only and object tracking into single and multiple cameras

is possible by involving the CDF in parallel to ODF.

2.2.1 Layer 1 Entities

2.2.1.1 Local Fusion Unit

The local fusion unit with its components has been shown in figure 2.2. The algorithmic model

on consensus development for feature extraction and classification has been taken into account

on this layer.

S1

Sn

Data AcquisitionFeature

ExtractionClassification

Figure 2.2: Local Fusion Unit

Various techniques have been used for the detection and classification for the consensus and its

measurements. Techniques for this purpose has been included and discussed in later chapters

of this report.

2.3 Layer2 Entities

2.3.1 Optimal Data Fusion Entity

Object detection and tracking with the help of visual sensing modality has been handled over

the fusion layer named ODF and CDF as shown in figure 2.3.

Object Detection

ClassificationContinuous

Tracking

Figure 2.3: Optimal Data Fusion Unit

A complete modular diagram for layer-2 ODF has been shown in figure 2.3. This layer contains

the working model and the developed approaches for detection and tracking implementation

based on build consensus. The development for the consensus has been investigated for detection

using the techniques discussed in chapter later in this report.

Page 7: Multimodal Sensing Enabled Real-time Intelligent Wireless

Layer 1 Consensus Development Al-

gorithms

3.1 Purpose

This chapter discusses the classification of acoustic feature into gun-shot and non-gunshot cat-

egories using inference based methods, classification techniques and artificial intelligence. The

selection of audio feature vector has briefly been discussed in Report-6. Here we will show

the experimental results for optimal feature set for audio signal database consisting of gunshot

acoustic signals and non-gun-shot signals such as street noise and bird chirping.

Next are discussed the classification methods to declare an audio signature as gunshot in noisy

environment. The feasibility of these methods with constraint embedded platform is briefly

discussed. The extensive real time evaluation of these methods in real world (noisy) settings

will be the subject of future report.

The classification methods that we have considered for our embedded setup include k-nearest

neighbor classification, decision tree classifier, naive Bayes approach, decision tree classifier and

Gaussian Markov model.

3.1.1 k-Nearest Neighbor Classifier KNN

Nearest neighbor classification can be explained by the analogy of learning by example. It

compares a to-be-classified test tuple with a training tuple. The training tuple is the feature

vector in our case consisting of acoustic features vector:

x = (zcr, ste, spbw, spro, ber6, ber7, cep, sprc, sprf)T

where

zcr: Zero Crossing Rate

ste: Short Time Energy

spbw: Spectral bandwidth

spro: Spectral roll-off

ber6: Band energy ratio of the 6th subband

ber7: Band energy ratio of the 7th subband

6

Page 8: Multimodal Sensing Enabled Real-time Intelligent Wireless

Chapter 4. Data Fusion Algorithms 7

cep: Cepstral coefficients

sprc: Spectral Centroid

sprf: Spectral flux

Attributes of the training tuple are the optimal features selected for the audio classification.

Each tuple represents a point in the n-dimensional space, with n being the number of features.

All of the training tuples are stored in an n-dimensional pattern space. When an unknown tuple

is given for classification, a k-nearest neighbor (k-NN) classifier searches the pattern space for

the k training tuples which are closest to the unknown tuple. These k training tuples are the

k-nearest neighbors of the unknown tuple.

Closeness is defined in terms of a distance metric, such as Euclidean distance. The Euclidean

distance between two points or tuples X1 = (x11, x12, , x1n) and X2 = (x21, x22, , x2n) can be

obtained from following Equation.

dist(X1, X2) = 2

√√√√ n∑i=1

(x1i − x2i)2

The basic steps of the k-NN algorithm are;

• Compute the distances between the new sample and all previous samples that have already

been classified into clusters

• Sort the distances in increasing order and select the k samples with the smallest distance

values

• Apply the voting principle. A new sample will be added (classified) to the largest cluster

out of k selected samples.

Figure 3.1: k-NN classification Example. The test tuple (green circle) is to be classified eitherto the class of blue squares or to the class of red triangles. If k = 3 (solid circle) it is assignedto the red triangle class because there are 2 triangles and only 1 square inside the inner circle.

If k = 5 (dashed circle) it is assigned to the blue square class.

The k-nearest neighbor approach is a non-parametric classification method as it works with our

the knowledge of underlying probability distributions of the member features. Its performance

Page 9: Multimodal Sensing Enabled Real-time Intelligent Wireless

Chapter 4. Data Fusion Algorithms 8

is tuned by choosing the appropriate k (number of considered nearest neighbors). The choice for

k usually depends on the specific application. Several heuristics are applied to select a suitable

k. The major distance metrics are typically Minkowski distance (Lm norm) and Mahalanobis

distance. In contrast to the Minkowski (Euclidean) distance, the Mahalanobis distance addi-

tionally computes the inverse covariance matrix for each class as weight matrix. Therefore, the

computational complexity of the Mahalanobis distance is higher than with Minkowski distances.

Therefore, the Minkowski distance such as the Euclidean distance is preferred for embedded and

real-time processing. The k-nearest neighbor classification cannot be divided into a training and

classification phases.

Therefore, a major drawback of this algorithm concerning embedded and real-time constraints

is that it cannot be effectively applied to large data sets. Classifying each sample data requires

the complete training data set. If the set of data is large, many distances have to be calculated

and hence, in general it is not feasible to apply k-nearest neighbor to embedded real-time fusion.

Initially we plan to detect the gun sounds from a limited set of possible guns. Because our sensor

nodes will be deployed outdoors mainly, the reverberation disturbances are minimized. In this

controlled experiment the Minkowski distance based classifier can be used effectively.

Another distance-based classifier, the Mahalanobis distance classifier is more popular for em-

bedded real-time classification. The advantage over the k-nearest neighbor approach is that the

Mahalanobis distance classifier separates the training and classification task. Thus, an on-line

implementation of these algorithms is feasible. It reduces memory requirements during the

training of the statistics.

3.1.2 Naive Bayesian Classifier

Bayesian classifiers are statistical classifiers. They work by predicting the class membership

probabilities. Nave Bayes (NB) probabilistic classifiers based on Bayes theorem with strong

(naive) independence assumption between different features. The basic idea in NB approaches

is to use the joint probabilities of feature set in an audio signal of some category. Afterwards

the audio signal category is estimated from the feature probability distribution.

The nave part of NB methods is the assumption of feature independence, i.e. the conditional

probability of a feature given an audio signal class is assumed to be independent from the

conditional probabilities of other features given that same class. Sue to this assumption the

computation of NB classifiers is far more efficient than the exponential complexity of non-

nave Bayes approaches as it does not use feature combinations as predictors. The technique is

particularly attractive for embedded implementation due to computing power constraints.

With an independent feature values xj of the feature vector x, the conditional probability of x

given a class ci is the result of multiplying the probabilities of each feature xj given the class ci.

This is the the product of likelihood functions of the class i. The joint conditional probability

of x given class ci is:

Page 10: Multimodal Sensing Enabled Real-time Intelligent Wireless

Chapter 4. Data Fusion Algorithms 9

p(x|ci) =∏j

p(xj |ci)

To classify the feature vector x the posterior class probabilities are computed as

p(ci|x) ∝ p(ci)∏j

p(xj |ci)

Finally the vector x is classified using maximum-a-posteriori (MAP) estimate of the class label.

class estimate = cj = D(x) = arg maxi p(ci|x) = arg maxi p(ci)∏j

p(xj |ci)

The computational complexity is low as it is just a multiplication or summation of density

function values (summation in case the logarithms of the probabilities are considered).

3.1.3 Decision Tree

A decision tree (DT) is a flowchart-like tree structure, where each internal node denotes a test

on a feature, each branch represents an outcome of the feature bounds test, and each leaf holds

a class label. The topmost node in a tree is the root node. During tree construction, attribute

selection measures are used to select the attribute that best partitions the tuples into distinct

classes. When decision trees are built, many of the branches may reflect noise or outliers in the

training data. Tree pruning (cutting) attempts to identify and remove these branches so that

classification accuracy on unseen data is improved.

After the decision tree has been developed from training data, the classification of a given feature

vector moves from root node to the leaf (classifier) by following boolean expressions. Therefore

this approach is simple to implement in embedded hardware.

3.1.4 Gaussian Markov Model

For each sound class, the statistical behavior of the features (Probability Density Functions,

pdf) can be modeled with a mixture of Gaussians. This model is characterized by the number

of Gaussians, their relative weights, and their mean / covariance parameters. During a training

process, the system learns the GMM parameters, by analyzing a subset of the sound database.

To find the best model for each class of sounds, the likelihood is maximized using 20 iterations

of the Expectation Maximization (EM) algorithm. In the recognition process, the signal to be

classified is compared to the models of each class to find the most probable one.

Page 11: Multimodal Sensing Enabled Real-time Intelligent Wireless

Chapter 4. Data Fusion Algorithms 10

3.2 Effect of Audio Features Selection on Classification Problem

The threshold based gunshot detection , though effective in calm inside scenarios, has found to

be impractical in case of the noisy environment such as busy roads. We observed that the node

falsely identifies the gunshots for sounds such as bus horn, engine noise and loud bird singing

near the sensor node. To resolve such false positive alarms one of the approaches is to use the

fact that gun shot noise by its vary nature is not local. Rather it can be heard upto hundreds of

meters in day times and upto several miles on silent nights. On the other hand other disturbing

noises that cross the noise floor threshold are inherently local. For example engine noise is

highly local sonic activity with area of influence within few meters. Therefore if a particular

even of interest is detected at the same time at several nodes distributed at distances upto a

few hundred meters, gunshot sound can be distinguished from rival noise events. This is the

spatial classification and does not require conventional classification methods.

The spatial filtering is one of the earliest techniques to classify gunshots in urban environment.

However this particular solution is possible only if the acoustic sensor node density is above the

bare minimum so as to ensure that multiple sensors are able to register a gunshot. Another

problem belongs to the time after event has been detected, that is the nodes need to share the

sound signatures with all the network so that a consensus can be reached to declare that it

was the same source that was being recorded on other nodes as well. This however poses other

problems for sensor nodes with limited power, limited RAM memory and narrow communication

bandwidth. To optimize the node power consumption, bandwidth utilization and memory

requirements, a number of algorithms are proposed that work by maintaining minimum spanning

tree of the network at each node to efficiently communicate data and audio signatures. This

also poses problems on network flexibility because changing environmental conditions and node

movement disturbs the optimum routing tables.

To resolve the problem of network data flooding and efficient gunshot classification it is needed

only that audio features are selected that best represent the gunshot audio class. The number

of possible features should also be the minimum possible for best classification. After extensive

analysis of about 36 audio features we have selected the following best in terms of computational

and power constraints on embedded sensor nodes.

3.2.1 Temporal Features

The temporal features are easiest to calculate in embedded platforms. The low end micro-

controllers with just summing and simple arithmetic operation can easily compute the time

features. The most important time features are Zero Crossing Rate, Short Time Energy and

Peaks Count. We experimented with a new data set consisting of the ratio of ZCR count to

Peaks Count with better classification results than using the two parameters independently.

Page 12: Multimodal Sensing Enabled Real-time Intelligent Wireless

Chapter 4. Data Fusion Algorithms 11

3.2.1.1 Zero Crossing Rate

Figure 3.2: Zero Crossing Rate for 5000 sample window as function of time.

Zero Crossing Rate (ZCR) is one of the most used tempral measure in audio feature collection.

It is the measure of how often a signal changes direction across the mean signal value. The

scheme is particularly attractive for embedded implementation because it is more feasible than

the spectrum (FFT) based techniques. The zero crossing rate detector work by counting the

number of signal transitions in a time window of fixed size. It essentially provides information

about the most dominant frequency present in the signal. The ZCR counting is starts at the

moment a thud sound or threshold crossing sound is captured by the sensor node. From here

on the signal recording is started for local classification purposes as well. The signal is low pass

filtered using weighted average filter and the number of times output changes its sign is counted.

The ZCR count after 500 samples is observed. It is noted that the number of zero crossings

is high for outdoor ambient noise near busy roads. The ambient noise signal often crosses the

threshold and has been a major cause of false alarms in field tests for sensor nodes. Similarly

when there is fast wind blow, microphones capture noise that has large amplitude as well as

large ZCR count. With zero crossing detectors these ambient noises can easily be rejected at

root node of the decision tree for gunshot classification. Normal gunshot ZCR mean is much

lower than ambient ZCR mean.

Figure. 3.2 shows the ZCR count as function of time (sample number) for various signals. The

top left signal is recorded in the NWN Lab, simulating high wind environment by placing sensor

node near the fan. The next five windows show the ZCR graphs for gunshots in indoor and

outdoor environment. It is observed that the gunshot signal has maximum signature in first

1000 samples. The bottom three windows are other stray signals. The bottom left and bottom

right are the bird songs while the middle one is the city traffic near a busy road. The maximum

ZCR value for the three signals remains below 100 for all three signals. It can be observed from

Page 13: Multimodal Sensing Enabled Real-time Intelligent Wireless

Chapter 4. Data Fusion Algorithms 12

the gunshot graphs that ZCR lies in the middle range from 150 to 600 when gunshot sound is

active.

3.2.1.2 Peaks Count

Figure 3.3: Peaks Count in 5000 samples frame of audio files test pool. In embedded platformit is real time implemented using a counter update when audio sample amplitude is higher than

the previous sample and the next sample.

3.2.1.3 Ratio of Peaks-Count to Zero-Crossings-Rate

Figure 3.4: Ratio of Peaks-Counts to Zero-Crossings-Rate per 5000 samples of audio

3.2.1.4 Short Time Signal Energy

The results for Short Time Signal Energy has been show in figure 3.5.

Page 14: Multimodal Sensing Enabled Real-time Intelligent Wireless

Chapter 4. Data Fusion Algorithms 13

Figure 3.5: Short Time Signal Energy (STE) as function of time

3.2.2 Comparison of Acoustic Features

In order to compare the effectiveness of acoustic features for classification between gunshot-Vs-

All of the rest signals. There were 11 gunshot sounds recorded from weapons of different caliber

and recorded in multitude of environment. For noise signal we used eight media files containing

a large number of possible sounds that can interfere with gunshots. These include thunderstorm,

clap sounds, white noise, night time sounds of cricket and other insects and various other kinds

of loud noise. The parameter was observed over a window of 50 milliseconds, sliding after 50

millisecond time. Next the histogram of all the windows is captured. The standard deviation

of the acoustic parameter is plotted.

Figure 3.6: Acoustic feature comparison for gunshot and ambient noise signals. The parameterstandard deviation among windows of 50mSec duration is taken as the quantity of interest.

Page 15: Multimodal Sensing Enabled Real-time Intelligent Wireless

Layer 2 Consensus Development Al-

gorithms

4.1 Purpose

Object detection and object tracking is possible when there is an initial event generated by

DFU. This event is taken into account as initial information for performing fusion for visual

detection. The corresponding frame is required to be processed and fusion has been performed

for initial detection. To meet the required object the need of accurate detection is possible

when there is minimum computation and suitable fusion mechanism is designed. Therefore the

purpose of suitable algorithmic model is required to be implemented and consensus development

is required to be done for suitable detection done by system. Object tracking is one of the desired

process to be executed on layer-2 modality. This frame by frame object tracking is possible by

implemented algorithmic model but the complexity arises when there is need of accuracy in

continuous tracking. Therefore the involvement of various techniques has been introduced to

build the consensus which helps in performing the controlled way of object tracking.

4.2 Object Detection

4.2.1 Histogram of Oriented Gradients

Human detection is done in video frame by using the classification methodology has been

adopted. The open computer vision (OpenCV) library has been used for this purpose of classi-

fication. The method used in [3] and [1] is HOG descriptor with SVM based model integrated

for multiple detection libraries.

Normalization GradientWeighted

voting

Contrast Normallization

HOG Collection at

detectionLinear SVM

Person /no-Person

Input Image

Figure 4.1: Human Detection Using HOG

The figure 4.1 is the approach that has been adopted for finding and classifying the humans

in any given video frame. The method of detecting human detection involves the normalized

14

Page 16: Multimodal Sensing Enabled Real-time Intelligent Wireless

Chapter 4. Data Fusion Algorithms 15

histograms of image gradient. The object characterized is done by finding the gradient direction

over small pixels or cells. The normalized block is referred as Histogram of Oriented Gradient.

The human detection is done by tiling the detection window using combined feature vector into

a conventional SVM.

For classification purpose the set of images has been used as positive training examples. The

variations in images has also been included with right and left reflections. Same way a set of

negative images has also been used to train the classifier for negative samples into image or

frame. The method is then iteratively used until it comes up with a final detector.

4.2.2 Deformable Parts Model

Deformable models [2] provide an elegant framework for object detection and recognition and

is considered state-of-the-art in efficient algorithms for matching models to images. Deformable

Part is a discriminatively trained, multistate model for image training that aim at making

possible the effective use of more latent information such as hierarchical (grammar) models and

models involving latent three dimensional pose. The deformable model includes both a coarse

global template covering an entire object and higher resolution part templates. The templates

represent histogram of gradient features discussed above. Fig.4.3 illustrates a placement of such

Figure 4.2: Deformable Part Model

a model in a HOG pyramid. The root filter location defines the detection window (the pixels

inside the cells covered by the filter). The part filters are placed several levels down in the

pyramid, so the HOG cells at that level have half the size of cells in the root filter level. The

score of a placement is given by the scores of each filter (the data term) plus a score of the

placement of each part relative to the root (the spatial term),

n∑i=0

Fi.φ(H, pi) +n∑

i=1

ai.(xi, yi) + bi.(xi2, yi

2) (4.1)

Where Fi is the w × h× 9× 4 weight vector φ(H, pi)are the features in a w × h subwindow of

a HOG pyramid. (xi, yi) = ((xi, yi) − 2(x, y) + vi)/si gives the location of the ith part relative

Page 17: Multimodal Sensing Enabled Real-time Intelligent Wireless

Chapter 4. Data Fusion Algorithms 16

Figure 4.3: Pyramids of Deformable Part Model

to the root location. ai and bi are two dimensional vectors coefficients for measuring a score for

each possible placement of the ith part.

4.3 Object Tracking

4.3.1 Kalman Filter

During the prediction phase, we use what we know to figure out where we expect the system to

be before we attempt to integrate a new measurement. In practice, the prediction phase is done

immediately aft er a new measurement is made, but before the new measurement is incorporated

into our estimation of the state of the system. An example of this might be when we measure

the position of a car at time t, then again at time t + dt. If the car has some velocity v, then we

do not just incorporate the second measurement directly. We first fast-forward our model based

on what we knew at time t so that we have a model not only of the system at time t but also of

the system at time t + dt, the instant before the new information is incorporated. In this way,

the new information, acquired at time t + dt, is fused not with the old model of the system,

but with the old model of the system projected forward to time t + dt. Th is is the meaning

of the cycle depicted in Figure 10-18. In the context of Kalman filters, there are three kinds of

motion that we would like to consider. The first is dynamical motion. Th is is motion that we

expect as a direct result of the state of the system when last we measured it. If we measured the

system to be at position x with some velocity v at time t, then at time t + dt we would expect

the system to be located at position x + v dt, possibly still with velocity. The second form

of motion is called control motion. Control motion is motion that we expect because of some

external influence applied to the system of which, for whatever reason, we happen to be aware.

As the name implies, the most common example of control motion is when we are estimating

the state of a system that we ourselves have some control over, and we know what we did to

bring about the motion. This is particularly the case for robotic systems where the control is

Page 18: Multimodal Sensing Enabled Real-time Intelligent Wireless

Chapter 4. Data Fusion Algorithms 17

the system telling the robot to (for example) accelerate or go forward. Clearly, in this case, if

the robot was at x and moving with velocity v at time t, then at time t + dt we expect it to

have moved not only to x + v dt (as it would have done without the control), but also a little

farther, since we did tell it to accelerate. Th e final important class of motion is random motion.

Even in our simple one-dimensional example, if whatever we were looking at had a possibility

of moving on its own for whatever reason, we would want to include random motion in our

prediction step. Th e effect of such random motion will be to simply increase the variance of

our state estimate with the passage of time. Random motion includes any motions that are not

known or under our control. As with everything else in the Kalman filter framework, however,

there is an assumption that this random motion is either Gaussian (i.e., a kind of random walk)

or that it can at least be modeled effectively as Gaussian. Thus, to include dynamics in our

simulation model, we would first do an update step before including a new measurement. Th

is update step would include first applying any knowledge we have about the motion of the

object according to its prior state, applying any additional information resulting from actions

that we ourselves have taken or that we know to have been taken on the system from another

outside agent, and, finally, incorporating our notion of random events that might have changed

the state of the system since we last measured it. Once those factors have been applied, we can

then incorporate our next new measurement. In practice, the dynamical motion is particularly

important when the state of the system is more complex than our simulation model. Oft en

when an object is moving, there are multiple components to the state such as the position as

well as the velocity. In this case, of course, the state evolves according to the velocity that we

believe it to have. Handling systems with multiple components to the state is the topic of the

next section. We will develop a little more sophisticated notation as well to handle these new

aspects of the situation. consider a particular realistic situation of taking measurements on a

car driving in a parking lot. We might imagine that the state of the car could be summarized

by two position variables, x and y, and two velocities, vk and vy. Th ese four variables would

be the elements of the state vector xk. Th is suggests that the correct form for F is:

xk =

x

y

vx

vy

, F =

1 0 dt 0

0 1 0 dt

0 0 1 0

0 0 0 1

(4.2)

However, when using a camera to make measurements of the cars state, we probably measure

only the position variables:

zk =

[zx

zy

]k

(4.3)

Th is implies that the structure of H is something like:

zk =

1 0

0 1

0 0

0 0

(4.4)

Page 19: Multimodal Sensing Enabled Real-time Intelligent Wireless

Chapter 4. Data Fusion Algorithms 18

In this case, we might not really believe that the velocity of the car is constant and so would

assign a value of Qk to reflect this. We would choose Rk based on our estimate of how accurately

we have measured the car’s position using (for example) our image analysis techniques on a

video stream. All that remains now is to plug these expressions into the generalized forms of

the update equations. Th e basic idea is the same, however. First we compute the a priori

estimate x−k of the state. It is relatively common (though not universal) in the literature to

use the superscript minus sign to mean at the time immediately prior to the new measurement;

we’ll adopt that convention here as well. This a priori estimate is given by:

x−k = Fxk−1 +Buk−1 + wk (4.5)

Using P−k to denote the error covariance, the a priori estimate for this covariance at time k is

obtained from the value at time k 1 by:

P−k = FPk−1F

T +Qk−1 (4.6)

equation forms the basis of the predictive part of the estimator, and it tells us what we expect

based on what we’ve already seen. From here we’ll state (without derivation) what is often

called the Kalman gain or the blending factor,which tells us how to weight new information

against what we think we already know:

Kk = P−k H

Tk (HkP

−k H

Tk +Rk)−1 (4.7)

The figure below shows the result of different object tracking algorithms carried out in a uni-

versity environment. The object has been marked with two colored boxes. The red box is the

kalman filter tracking result, while the blue one is the result of optical flow tracking.

Figure 4.4: Object Tracking results

Page 20: Multimodal Sensing Enabled Real-time Intelligent Wireless

Optimal Data Fusion (ODF) Unit Per-

formance Evaluation

Real time object detection and tracking is challenging because of its computational and real

timeliness involvement in our distributed model. The ODF is responsible of handling the object

detection and tracking in visual domain. The initial data fusion is done at ODF entity for the

object detection based on acoustic event generated by DFU Level-1. The post event detection

is then performed into the video frame using this processing entity chosen for ODF. The in-

formation fusion is then performed for object tracking, classification and detection involving

the limited available resources of processing entity involved in data fusion. The question arises

when the accurate object detection and continuous object tracking over some fix resolution is

required to be performed covering the need of real timeliness and getting a suitable frame rate.

5.1 Purpose

In order to achieve the objective of Optimal Data Fusion unit ODF processing entity finaliza-

tion, performance evaluation criteria have been followed filled with multiple testing procedures

adopted to measure the quality and computation power provided by the selected processing

entity at ODF. In this regard we have arranged an evaluation criteria where the measurements

has been performed for various platforms described in table 5.1. A set of experimentation has

been performed to analyze the real timeliness and computational cost for the selected ODF

entity.

Table 5.1: My caption

Test Case 1 Test Case 2 Test Case 3

Processing Board Beagle Board xM Beagle Bone Black Odroid U3RAM 256 MB 256MB 2GBProcessor 1GHz ARM Cortex A8 Sitara,1GHz 1.7GHz Quad-Core processorCost $ 250 $ 70 $ 80

5.2 Test Case 1

The evaluation was done for embedded Linux Based platform with specifications mentioned

in table 5.1. The output generated from DFU has been provided as input to ODF with USB

interfacing. The functionality of FOV mapping has been performed with the help of initial

19

Page 21: Multimodal Sensing Enabled Real-time Intelligent Wireless

Chapter 4. Optimal Data Fusion (ODF) Unit Performance Evaluation 20

information taken from DFU which was then taken into account to trigger and initial input into

video frame captured at ODF.

This test has been performed for Beagle Board xM with specifications described in table 5.1. The

camera interfacing was done at USB interface and input was adjusted with a video resolution

of 320 x 240.

Figure 5.1: Beagle Board xM

5.2.1 Object Detection

• The RAM utilization was 73% of the toal available 235MB

• CPU utilization was observed at 75% of total available.

• Total delay in object detection for FOV was more than 4 second approximately.

5.2.2 Object Tracking

• The RAM utilization was 70% of the toal available 235MB

• CPU utilization was observed at 100 % of total available.

• The achievable frame rate was 0.25 to 1 Frame Per second.

5.3 Test Case 2

This test has been performed for Beagle Bone Black with specifications described in table 5.1.

The camera interfacing was done at USB interface and input was adjusted with a video resolution

of 320 x 240.

Page 22: Multimodal Sensing Enabled Real-time Intelligent Wireless

Chapter 4. Optimal Data Fusion (ODF) Unit Performance Evaluation 21

Figure 5.2: Beagle Bone Black

5.3.1 Object Detection

• The RAM utilization was 69% of the toal available 235MB

• CPU utilization was observed at 75% of total available.

• Total delay in object detection for FOV was 4 second approximately.

5.3.2 Object Tracking

• The RAM utilization was 80 % of the toal available 235MB

• CPU utilization was observed at 100 % of total available.

• The achievable frame rate was 1 Frame Per second.

5.4 Test Case 3

This test has been performed for Odroid-U3 with specifications described in table 5.1. The

camera interfacing was done at USB interface and input was adjusted with a video resolution

of 640 x 480.

Figure 5.3: Odroid-U3

5.4.1 Object Detection

• The RAM utilization was 30 % of the toal available 1.7GB.

Page 23: Multimodal Sensing Enabled Real-time Intelligent Wireless

Chapter 4. Optimal Data Fusion (ODF) Unit Performance Evaluation 22

• CPU utilization was observed at 35% of total available at dedicated single core.

• Total delay in object detection was 1Sec approximately.

5.4.2 Object Tracking

• The RAM utilization was 40 % of the toal available 1.7GB

• CPU utilization varied between 75 % to 79 % of total available.

• The achievable frame rate was 6 Frame Per second.

• Tracking was performed with video resolution of 640x480.

5.5 Conclusions

The test cases were performed for the purpose of finalizing the processing module such that it

may cover the need of performing object detection with minimum delay and object tracking

with highest possible frame rate and quality. The results gathered shows the performance and

utilization gathered for each of the processing module. The captured result displayed that the

object detection and tracking with minimum delay, highest quality and better frame rate was

achieved using the low cost processor named Odroid-U3. Therefore the decision was taken

to replace the initial proposed processor (Beagle xM) with Odroid-U3 to meet the objective

optimally.

Page 24: Multimodal Sensing Enabled Real-time Intelligent Wireless

Appendix A

A.1 Peak Counts

1 f unc t i on main

2 c l e a r a l l ;

3 path=’ J :\ abubakar\Pro j e c t Mul t inoda l Hammad\wav gunshotsounds\ ’ ; %Direc to ry path

here

4

5 gun1=’ gunshot handgun f i r ing range . wav ’ ;

6 gun2=’ g u n s h o t r i f l e e x t e r i o r 0 0 4 . wav ’ ;

7 gun3=’ g u n s h o t r i f l e e x t e r i o r 0 0 6 . wav ’ ;

8 gun4=’ gunshot x3 handgun on f i r ing range . wav ’ ;

9 gun5=’ h a nd g un 2 2 c a l i b e r s i n g l e s h o t i n t e r i o r s h o o t i n g r a n g e p i s t o l . wav ’ ;

10 gun6=’ handgun 40 c a l i b e r s i n g l e s h o t d i s t an t 5 0 f e e t away sp r i n g f i e l d xdm40 . wav ’ ;

11 bird1=’ b i r d ch i r p i n g . wav ’ ;

12 bird2=’ b i r d ch i r p i n g2 . wav ’ ;

13 c i t y 1=’

c i t y o r t own s t r e e t amb i en c e p ed e s t r i a n s wa l k i n g w i t h s ome t r a f f i c n o i s e i n ba ckg r ound

. wav ’ ;

14 lab1 = ’ HelloHelloHelloOK STM .wav ’ ;

15

16 a = [1 −3 2 0 1 −7 −6 6 3 ] ’ ;

17

18 [ T1 Fs1 ] = audioread ( [ path lab1 ] ) ; x = getPeaks2 (T1) ;

19 subplot (331) ; p l o t ( x ) ;

20 t i t l e ( ’NWN Lab Hel lo He l lo . wav ’ ) ;

21

22

23 [ T1 Fs1 ] = audioread ( [ path gun2 ] ) ;T1 = T1 ( : , 2 ) ;

24 x = getPeaks2 (T1) ;

25 subplot (332) ; p l o t ( x ) ;

26 t i t l e ( ’ gunshot r i f l e e x t e r i o r 004 .wav ’ ) ;

27

28 [ T1 Fs1 ] = audioread ( [ path gun3 ] ) ;T1 = T1 ( : , 2 ) ; x = getPeaks2 (T1) ;

29 subplot (333) ; p l o t ( x ) ;

30 t i t l e ( ’ gunshot r i f l e e x t e r i o r 006 .wav ’ ) ;

31

32 [ T1 Fs1 ] = audioread ( [ path gun4 ] ) ;T1 = T1 ( : , 2 ) ; x = getPeaks2 (T1) ;

33 subplot (334) ; p l o t ( x ) ;

34 t i t l e ( ’ gunshot x3 handgun on f i r i n g range . wav ’ ) ;

35

36 [ T1 Fs1 ] = audioread ( [ path gun5 ] ) ;T1 = T1 ( : , 2 ) ; x = getPeaks2 (T1) ;

37 subplot (335) ; p l o t ( x ) ;

38 t i t l e ( ’ handgun 22 c a l i b e r s i n g l e shot i n t e r i o r . wav ’ ) ;

39

23

Page 25: Multimodal Sensing Enabled Real-time Intelligent Wireless

Appendix A. Appendix A 24

40 [ T1 Fs1 ] = audioread ( [ path gun6 ] ) ;T1 = T1 ( : , 2 ) ; x = getPeaks2 (T1) ;

41 subplot (336) ; p l o t ( x ) ;

42 t i t l e ( ’ handgun 40 c a l i b e r s i n g l e shot 50 f e e t . wav ’ ) ;

43

44 [ T1 Fs1 ] = audioread ( [ path bi rd2 ] ) ;T1 = T1 ( : , 2 ) ; x = getPeaks2 (T1) ;

45 subplot (337) ; p l o t ( x ) ;

46 t i t l e ( ’ b i rd ch i rp ing2 . wav ’ ) ;

47

48 [ T1 Fs1 ] = audioread ( [ path c i t y1 ] ) ;T1 = T1 ( : , 2 ) ; x = getPeaks2 (T1) ;

49 subplot (338) ; p l o t ( x ) ;

50 t i t l e ( ’ c i t y pede s t r i an s walking with t r a f f i c no i s e . wav ’ ) ;

51

52 [ T1 Fs1 ] = audioread ( [ path bi rd1 ] ) ;T1 = T1 ( : , 2 ) ; x = getPeaks2 (T1) ;

53 subplot (339) ; p l o t ( x ) ;

54 t i t l e ( ’ b i rd ch i rp ing . wav ’ ) ;

55

56 end

57

58 f unc t i on y = getPeaks2 (T1)

59 WindLen = 5000 ;

60 l t 1 = length (T1) ;

61 np = [ ] ;

62 f o r i = 1 : l t 1 − WindLen−163

64 a = T1( i : i+WindLen) ;

65 a1 = [ 0 ; 0 ; a ] ;

66 a2 = [ 0 ; a ; 0 ] ;

67 a3 = [ a ; 0 ; 0 ] ;

68 numPeaks = sum(a2>a1 & a2>a3 ) ;

69

70 np = [ np ; numPeaks ] ;

71 end

72 y = np ;

73 end

74

75 f unc t i on y = getPeaks (T1)

76 WindLen = 5000 ;

77 l t 1=length (T1) ;

78 zcrA = [ ] ;

79 f o r i = 1 : l t 1 − WindLen−180 NumZeroCross = numel ( f indpeaks (T1( i : i+WindLen) ) ) ;

81 zcrA = [ zcrA ; NumZeroCross ] ;

82 end

83 y=zcrA ;

84 end

A.2 Feature Extraction Functions

1 f unc t i on y = peak2zcrRat io (T1 ,WindLen)

Page 26: Multimodal Sensing Enabled Real-time Intelligent Wireless

Appendix A. Appendix A 25

2 x1 = getPeaks2 (T1 ,WindLen) ;

3 x2 = getZcr2 (T1 ,WindLen) ;

4 a = length ( x1 ) ;

5 b = length ( x2 ) ;

6 i f ( a < b) y = x1 . / x2 ( 1 : a ) ;

7 e l s e y = x1 ( 1 : b) . / x2 ;

8 end

9 y = 1 ./ y ;

10 end

11

12

13 f unc t i on y = getPeaks2 (T1 ,WindLen)

14 % WindLen = 500 ;

15 l t 1 = length (T1) ;

16 np = [ ] ;

17 f o r i = 1 : l t 1 − WindLen−118

19 a = T1( i : i+WindLen) ;

20 a1 = [ 0 ; 0 ; a ] ;

21 a2 = [ 0 ; a ; 0 ] ;

22 a3 = [ a ; 0 ; 0 ] ;

23 numPeaks = sum(a2>a1 & a2>a3 ) ;

24

25 np = [ np ; numPeaks ] ;

26 end

27 y = np ;

28 end

29

30 f unc t i on y = getZcr2 (T1 ,WindLen)

31 % WindLen = 500 ;

32 l t 1=length (T1) ;

33 zcrA = [ ] ;

34 f o r i = 1 : l t 1 − WindLen−135 NumZeroCross = vectorZcr (T1( i : i+WindLen) ) ;

36 zcrA = [ zcrA ; NumZeroCross ] ;

37 end

38 y=zcrA ;

39 end

40

41

42 f unc t i on y = vectorZcr (T1)

43 % T1= T1 / max(T1) ;

44 % minPeak = 0 . 0 1 ; % Minimum s i g n a l l e v e l the r sho ld

45 % to cons id e r f o r zero c r o s s i n g s

46 % T1 = T1 . ∗ ( abs (T1)> minPeak ) ;

47 a = [ 0 ; T1 ] ;

48 b = [T1 ; 0 ] ;

49 c=a .∗b ;

50 y = sum( c<0) ;

51 end

Page 27: Multimodal Sensing Enabled Real-time Intelligent Wireless

Appendix A. Appendix A 26

A.3 Compute Feature Statistics

1 f unc t i on FF = comput eA l l S t a t i s t i c s ( f i leName , win , s tep )

2

3 % This func t i on computes the average and std va lue s f o r the f o l l ow i ng audio

4 % f e a t u r e s :

5 % − energy entropy

6 % − shor t time energy

7 % − s p e c t r a l r o l l o f f

8 % − s p e c t r a l c en t r o id

9 % − s p e c t r a l f l u x

10 %

11 % ARGUMENTS:

12 % fi leName : the name o f the . wav f i l e in which the s i g n a l i s s to r ed

13 % win : the p ro c e s s i ng window ( in seconds )

14 % step : the p ro c e s s i ng step ( in seconds )

15 %

16 % RETURN VALUE:

17 % F: a 12x1 array conta in ing the 12 f e a tu r e s t a t i s t i c s

18 %

19

20 [ x , f s ] = wavread ( f i leName ) ;

21

22 EE = Energy Entropy Block (x , win∗ f s , s t ep ∗ f s , 10) ;

23 E = ShortTimeEnergy (x , win∗ f s , s t ep ∗ f s ) ;24 Z = zcr (x , win∗ f s , s t ep ∗ f s , f s ) ;

25 R = Spec t r a lRo l lO f f (x , win∗ f s , s t ep ∗ f s , 0 . 80 , f s ) ;

26 C = Spect ra lCent ro id (x , win∗ f s , s t ep ∗ f s , f s ) ;

27 F = Spect ra lF lux (x , win∗ f s , s t ep ∗ f s , f s ) ;

28

29 FF(1) = s t a t i s t i c (EE, 1 , l ength (EE) , ’ s td ’ ) ;

30 FF(2) = s t a t i s t i c (Z , 1 , l ength (Z) , ’ stdbymean ’ ) ;

31 FF(3) = s t a t i s t i c (R, 1 , l ength (R) , ’ s td ’ ) ;

32 FF(4) = s t a t i s t i c (C, 1 , l ength (C) , ’ s td ’ ) ;

33 FF(5) = s t a t i s t i c (F , 1 , l ength (F) , ’ s td ’ ) ;

34 FF(6) = s t a t i s t i c (E, 1 , l ength (E) , ’ stdbymean ’ ) ;

35

36 % X=1: l ength (EE) ;

37 % plo t (X,EE/max(EE) , ’ r ’ ,X,E/max(E) , ’ b ’ ,X, Z/max(Z) ,X,R/max(R) ,X,C/max(C) ,X,F/max(F

) )

38 % plo t (X,EE, ’ r ’ ,X,E, ’ b ’ ,X, Z ,X,R,X,C,X,F)

39 % legend ( ’ Energy Entropy Block ’ , ’ ShortTimeEnergy ’ , ’ zcr ’ , ’ Spec t ra lRo l lO f f ’ , ’

Spect ra lCentro id ’ , ’ Spectra lFlux ’ )

40 %

A.4 Compute Features of All Classes

1 f unc t i on main

2

Page 28: Multimodal Sensing Enabled Real-time Intelligent Wireless

Appendix A. Appendix A 27

3 classNames=({ ’ J :\AbuBakar\Pro j e c t Mul t inoda l Hammad\wav gunshotsounds\audioFeatureExtract ion \GunShot\ ’ , ’ J :\AbuBakar\Pro j e c t Mul t inoda l Hammad\wav gunshotsounds\ audioFeatureExtract ion \Noise \ ’ }) ;

4

5

6 % func t i on Features = computeFeaturesDirectory ( classNames )

7 %

8 % This func t i on computes the audio f e a t u r e s (6−D vecto r ) f o r each . wav

9 % f i l e in a l l d i r e c t o r i e s ( g iven in classNames )

10 % classNames=({ ’GunShot ’ , ’ Noise ’ } ) ;11

12 c l o s e a l l

13 c l c ;

14 f p r i n t f ( ’

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−\n ’ )

15 f p r i n t f ( ’ Real Time Microphone and Camera a c q u i s i t i o n and audio−video p ro c e s s i ng .\n\n ’ ) ;

16 f p r i n t f ( ’ Theodoros Giannakopoulos\n ’ ) ;

17 f p r i n t f ( ’ http ://www. d i . uoa . gr /˜ tyiannak \n ’ ) ;

18 f p r i n t f ( ’Dep . o f In f o rmat i c s and Telecommunications ,\n ’ ) ;

19 f p r i n t f ( ’ Un ive r s i ty o f Athens , Greece\n ’ ) ;

20 f p r i n t f ( ’

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−\n ’ )

21

22

23 Dim = 6 ;

24

25 win = 0 . 0 5 0 ; s tep = 0 . 0 5 0 ;

26 win = 0 . 2 0 ; s tep = 0 . 0 5 0 ;

27 % win = 0 . 2 0 ; s tep = 0 . 0 2 0 ;

28

29

30 FeaturesNames = { ’ Std Energy Entropy ’ , ’ Std/mean ZCR ’ , ’ Std Ro l l o f f ’ , ’ Std Spec t r a l

Centroid ’ , ’ Std Spec t r a l Flux ’ , ’ Std/mean Energy ’ } ;31

32

33 %

34 % STEP A: Feature Ca l cu l a t i on :

35 %

36

37

38 f o r ( c=1: l ength ( classNames ) ) % f o r each c l a s s ( and f o r r e s p e c t i v e d i r e c t o r y ) :

39 f p r i n t f ( ’ Computing f e a t u r e s f o r c l a s s %s . . . \ n ’ , classNames{c }) ;40 D = d i r ( [ classNames{c} ’ //∗ .wav ’ ] )

41

42 tempF = ze ro s ( l ength (D) ,Dim) ;

43 f o r ( i =1: l ength (D) ) % f o r each . wav f i l e in the cur rent d i r e c t o r y :

44 % compute s t a t i s t i c s (6−D array )

45 F = comput eA l l S t a t i s t i c s ( [ classNames{c} ’ // ’ D( i ) . name ] , win , s tep ) ;

46 [ c lassNames{c} ’ // ’ D( i ) . name ]

47 % sto r e s t a t i s t i c s in the cur rent row :

Page 29: Multimodal Sensing Enabled Real-time Intelligent Wireless

Appendix A. Appendix A 28

48 tempF( i , : ) = F ’ ;

49 end

50 % keep a d i f f e r e n t c e l l e lement f o r each f e a tu r e matrix :

51 Features {c} = tempF

52 end

53 Features {c}54 %

55 % STEP B:

56 % ca l c u l a t e and p lo t histograms :

57 %

58

59 Colors = [0 0 0 ;

60 0 0 1 ;

61 0 1 0 ;

62 0 1 1 ;

63 1 0 0 ;

64 1 0 1 ;

65 1 1 0 ;

66 0 0 .25 1 ;

67 0 .25 0 1 ;

68 0 1 0 . 2 5 ;

69 0 .25 1 0 ] ;

70

71 f i g u r e ;

72 f o r ( f =1:Dim)

73 subplot (3 , 2 , f ) ;

74 hold on ;

75 f o r ( c=1: l ength ( classNames ) )

76 tempF = Features {c } ( : , f ) ;77 [H,X] = h i s t ( tempF , l ength ( tempF) ) ;

78 p = p lo t (X,H, ’ .− ’ ) ;

79 s e t (p , ’ Color ’ , Colors ( c , : ) ) ;

80

81 % get the ’ others ’ :

82 tempFOthers = [ ] ;

83 f o r ( cc=1: l ength ( classNames ) )

84 i f ( cc˜=c )

85 tempFOthers = [ tempFOthers ; Features { cc } ( : , f ) ] ;86 end

87 end

88 [ E1 , E2 ] = computeHistError ( tempF , tempFOthers ) ;

89 Errors ( f , c ) = 100 ∗ (E1+E2) / 2 ;

90 hM( c ) = max(H) ;

91 end

92 [ EMin ,MMin] = min ( Errors ( f , : ) ) ;

93 [EMax,MMax] = max( Errors ( f , : ) ) ;

94 EMean = mean( Errors ( f , : ) ) ;

95 s t r = [ ’ l egend ( ’ ’ ’ ’ ’ ’GunShot ’ ’ ’ ’ ’ ] ;

96 f o r ( c=2: l ength ( classNames ) )

97 s t r = [ s t r ’ , ’ ’ ’ ’ ’ ’ A l l Other Noise ’ ’ ’ ’ ’ ] ;

98 end

Page 30: Multimodal Sensing Enabled Real-time Intelligent Wireless

Appendix A. Appendix A 29

99 s t r = [ s t r ’ ) ; ’ ] ;

100 eva l ( s t r ) ;

101 t ex t (0 ,max(hM) ∗0 .80 , FeaturesNames{ f }) ;102 end

A.5 Tracking and Kalman Filter

1 #inc lude <opencv2/opencv . hpp>

2 #inc lude <iostream>

3 #inc lude <s t d i o . h>

4 #inc lude <math . h>

5 #inc lude <s t d l i b . h>

6

7

8 us ing namespace cv ;

9 us ing namespace std ;

10

11

12 i n t main ( i n t argc , char ∗ argv [ ] ) {13

14 VideoCapture capture ;

15 capture . open (0 ) ;

16 namedWindow( ”Output” , CVWINDOWAUTOSIZE) ;

17 Mat frame ;

18 HOGDescriptor hog ;

19 hog . setSVMDetector (HOGDescriptor : : ge tDe fau l tPeop leDetec tor ( ) ) ;

20

21 s i z e t i , j ;

22

23 i f ( ! capture . isOpened ( ) ) // I n i t camera

24 {25 cout << ” capture dev i c e f a i l e d to open ! ” << endl ;

26 re turn 1 ;

27 }28

29 capture . s e t (CV CAP PROP FRAMEWIDTH,320 ) ;

30 capture . s e t (CV CAP PROP FRAME HEIGHT,240 ) ;

31

32 whi le (1 )

33 {34 capture >> frame ;

35

36 vector<Rect> found , f o u n d f i l t e r e d ;

37 hog . de t e c tMu l t iS ca l e ( frame , found , 0 , S i z e (8 , 8 ) , S i z e (32 ,32) , 1 . 05 , 2) ;

38

39

40 f o r ( i =0; i<found . s i z e ( ) ; i++)

41 {42 Rect r = found [ i ] ;

Page 31: Multimodal Sensing Enabled Real-time Intelligent Wireless

References 30

43 f o r ( j =0; j<found . s i z e ( ) ; j++)

44 i f ( j != i && ( r & found [ j ] )==r )

45 break ;

46 i f ( j==found . s i z e ( ) )

47 f o u n d f i l t e r e d . push back ( r ) ;

48 }49 f o r ( i =0; i<f o u n d f i l t e r e d . s i z e ( ) ; i++)

50 {51 Rect r = f o u n d f i l t e r e d [ i ] ;

52 r . x += cvRound ( r . width ∗0 . 1 ) ;53 r . width = cvRound ( r . width ∗0 . 8 ) ;54 r . y += cvRound ( r . he ight ∗0 .06 ) ;55 r . he ight = cvRound ( r . he ight ∗0 . 9 ) ;56 r e c t ang l e ( frame , r . t l ( ) , r . br ( ) , cv : : S ca l a r (0 , 255 ,0 ) , 2) ;

57 }58 imshow ( ”Output” , frame ) ;

59 i f ( cvWaitKey (10) == ’q ’ )

60 re turn 0 ;

61 }62

63 }

Page 32: Multimodal Sensing Enabled Real-time Intelligent Wireless

Bibliography

[1] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In

Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Con-

ference on, volume 1, pages 886–893. IEEE, 2005.

[2] Pedro Felzenszwalb, David McAllester, and Deva Ramanan. A discriminatively trained,

multiscale, deformable part model. In Computer Vision and Pattern Recognition, 2008.

CVPR 2008. IEEE Conference on, pages 1–8. IEEE, 2008.

[3] SD Lin, Y Liu, and Y Jhu. A robust image descriptor for human detection based on hog

and webers law. International Journal of Innovative Computing, Information and Control,

9(10):3887–3901, 2013.

31