chapter multi-person tracking based on faster r-cnn and

23
Chapter Multi-Person Tracking Based on Faster R-CNN and Deep Appearance Features Gulraiz Khan, Zeeshan Tariq and Muhammad Usman Ghani Khan Abstract Mostly computer vision problems related to crowd analytics are highly dependent upon multi-object tracking (MOT) systems. There are two major steps involved in the design of MOT system: object detection and association. In the first step, desired objects are detected in every frame of video stream. Detection quality directly influences the performance of tracking. The second step involves the correspondence of detected objects in current frame with the previous to obtain their trajectories. High accuracy in object detection system results in less number of missing detection and finally produces less fragmented tracks. Better object association increases the affinity between objects in different frames. This paper presents a novel algorithm for improved object detection followed by enhanced object tracking. Object detection accuracy has been increased by employing deep learning-based Faster region convolutional neural network (Faster R-CNN) algorithm. Object association is carried out by using appearance and improved motion features. Evaluation results show that we have enhanced the performance of current state-of-the-art work by reducing identity switches and fragmentation. Keywords: face-based tracking, target tracking, object detection, tracking 1. Introduction We witness the truthfulness of the saying of Greek philosopher Aristotle Man is by nature a social animalin our daily life, as we see thousands of humans walk on roads, terminals, shopping malls, and other public places on a daily basis. They intentionally or unintentionally keep interacting with each other. They also make decision on where to go and how to reach their destinations. So their movement is not always straight away. It changes based on external environmental factors. Study and analysis of human dynamics play an important role in public security, public space management, architecture, and design. These tasks are highly dependent upon proper multi-person tracking (MPT) and trajectory extraction procedure. So this thing motivated us to contribute in the development of such system which performs these tasks with real-time speed and high accuracy. Before moving further we have to know what is meant by multi-object tracking (MOT). The multiple object tracking is the process of localizing multiple moving objects over time using a camera as input or capturing device. A unique identity is 1

Upload: others

Post on 16-Feb-2022

7 views

Category:

Documents


0 download

TRANSCRIPT

Chapter

Multi-Person Tracking Based onFaster R-CNN and DeepAppearance FeaturesGulraiz Khan, Zeeshan Tariqand Muhammad Usman Ghani Khan

Abstract

Mostly computer vision problems related to crowd analytics are highlydependent upon multi-object tracking (MOT) systems. There are two major stepsinvolved in the design of MOT system: object detection and association. In the firststep, desired objects are detected in every frame of video stream. Detection qualitydirectly influences the performance of tracking. The second step involves thecorrespondence of detected objects in current frame with the previous to obtaintheir trajectories. High accuracy in object detection system results in less number ofmissing detection and finally produces less fragmented tracks. Better objectassociation increases the affinity between objects in different frames. This paperpresents a novel algorithm for improved object detection followed by enhancedobject tracking. Object detection accuracy has been increased by employing deeplearning-based Faster region convolutional neural network (Faster R-CNN)algorithm. Object association is carried out by using appearance and improvedmotion features. Evaluation results show that we have enhanced the performance ofcurrent state-of-the-art work by reducing identity switches and fragmentation.

Keywords: face-based tracking, target tracking, object detection, tracking

1. Introduction

Wewitness the truthfulness of the saying of Greek philosopher Aristotle “Man isby nature a social animal” in our daily life, as we see thousands of humans walk onroads, terminals, shopping malls, and other public places on a daily basis. Theyintentionally or unintentionally keep interacting with each other. They also makedecision on where to go and how to reach their destinations. So their movement isnot always straight away. It changes based on external environmental factors. Studyand analysis of human dynamics play an important role in public security, publicspace management, architecture, and design. These tasks are highly dependentupon proper multi-person tracking (MPT) and trajectory extraction procedure. Sothis thing motivated us to contribute in the development of such system whichperforms these tasks with real-time speed and high accuracy.

Before moving further we have to know what is meant by multi-object tracking(MOT). The multiple object tracking is the process of localizing multiple movingobjects over time using a camera as input or capturing device. A unique identity is

1

assigned to every detected object. This identity remains specific for that object up tothe certain time period. Based upon these identities, we draw motion trajectories ofobjects that are being tracked. We can also make an analysis of the behavior ofobjects. Figure 1 depicts the steps involved in multi-object tracking. Here to benoted that multi-object tracking is different from multi-object detection. Multi-object detection is sub-part of multi-object tracking. In short object detection is theprocess of locating object of interests in a single frame, while MOT is associatedwith detecting multiple objects of interests across a series of frames. In tracking theobject detected in next frame should be able to relate same object detected inprevious frame.

MOT has a variety of uses, some of which are human-computer interaction,surveillance and security, video communication and compression, augmented real-ity, traffic control, medical imaging, and video editing. Apart from these mentioneduses, there are some certain reasons which describe why tracking is useful.

• First of all, if there are too many objects detected in a video frame, thentracking will make it possible to establish the identity of certain objects acrossall the frames.

• Second, if there is the case that object detection has failed to detect object, thenit may be still possible to track those objects because tracking system extractsand stores the location and appearance features of detected objects from theprevious frame.

• Third, tracking methods could perform very efficiently because they perform alocal search in place of a global. So, we can achieve a good performance interms of high frame rate for our proposed hybrid system. The proposed systemperforms object detection for every nth frame and tracks the target object inintermediate frames based on their position in the frame and appearancefeatures.

The proposed work has many applications in different fields, surveillance,entertainment, gaming, and autonomous vehicles, as shown in Figure 2. Thissystem also has applications in crowded scene that enables the analysis of eachindividual moving opposite to the group movement. Visual surveillance usuallyrequires the detailed human activity of each individual separately. Detectingactivity for each person separately demands people to be tracked. Precise informa-tion of a person can be obtained by using previous trajectories of each subject. Theanalysis of drawn trajectories for each area can be helpful to find whether a personhas been in forbidden area or not and performing what type of activities (running,walking, and fighting). Combining the tracking information of two or morecharacters can precisely elaborate the interaction between them.

Figure 1.Flow of steps involved in multi-object tracking.

2

Visual Object Tracking with Deep Neural Networks

Multi-human tracking is also useful for multiplayer interactive games. Twohumans playing fighting game can be easily tracked using MOT. Similarly, inautonomous vehicle industry, self-driving cars can employ tracking system tofollow some specific vehicle. Based on the tracked vehicle, autonomous vehicles cantake decisions.

We can track an object continuously after the first-time detection. But a trackingalgorithm may sometimes lose track of the objects they are tracking. For instance,when the movement of the target object is too high, a tracking algorithm may not beable to maintain the track of the object. So the solution is to use detection andtracking algorithms together.

In recent years, there has been a lot of focus on MOT because of advancementsin object detection techniques, which increased the robustness of tracking algo-rithms. So, stat-of-the-art techniques are way better than tradition ones. Most oftraditional techniques do not perform well in real-time environments. For example,there are some batch-based movement tracking methodologies [1, 2] in whichcomplete batch is required for tracking the human. Some others are probability-based systems for finding the track of the subject [3–5]. These methods require acomplete batch of visual frames for processing and tracking the target object. But inreal-time scenarios, there is a continuous stream of frames which are being fed tothe system, and their number increases by time duration. So it is impossible toconvert into batch and perform tracking on real-time bases.

1.1 Challenges in tracking systems

Recent advancements in object detections have made MOT more realistic. Mul-tiple object tracking requires precise tracking of multiple objects based on apparentidentity and relative position. MOT paradigm demands entire video batch at thesame time and applies global optimization for finding the associations. Individualsin live stream need to be tracked based on history of each individual in the stream.The strong basic emphasis of this work is on the time efficiency and less number ofidentity switches.

The first and foremost topic of concern in the development of tracking system istime efficiency. Due to limitations of batch-based tracking algorithms, there arecertain challenges to implement these systems in real-time scenarios. A plethora ofwork has been done to compete related challenges to make multi-person trackingand trajectory drawing real time and robust. A lot of efforts are being made to movetracking system from batch based to real time.

Figure 2.Applications of multi-object tracking system.

3

Multi-Person Tracking Based on Faster R-CNN and Deep Appearance FeaturesDOI: http://dx.doi.org/10.5772/intechopen.85215

The second and important area of interest includes finding the solution of theproblem of identity switches. Identity switch means how many times the particularobject changes its identity number across all the frames or in a specific time dura-tion. Identity switches mostly occur due to missed detection of object in betweenframes. The other reason is occlusion between different objects which causes iden-tity switches.

The third one is fragmentation issue. Fragmentation occurs when identity switchdoes not occur but detection of object is missing in some frames, and due to this afragmented trajectory is generated, meaning tracking breaks where human is notdetected.

As we have understood that the problem of identity switches is due to missingdetections. So there are two possible solutions for this problem. Solution one is toreduce or eliminate the number of missing object detection in frames. This can bedone by improving the object detection algorithm. We have proposed a solution forthis problem, which uses the deep learning-based algorithm to detect a person bydetecting their shoulder, head, or complete body.

But the problem of identity switches will still persist due to occlusion mecha-nism. Here comes the second solution, which is to make use of appearance andlocalization features of objects. As every object have a different location in theframe and different appearance relative to each other, we can build a trackingalgorithm which will track multiple objects in a series of frames based on thesefeatures. Face recognition can also be used to reduce the number of identityswitches. Based upon appearance and motion features, we can relate a trajectoryfragmentation of one object with the other fragmentation of that object and cancomplete the trajectory path. The complete process is shown in Figure 3.

One of the recent available tracking systems is simple online and real-timetracking (SORT) [6] that tried to overcome these challenges. It is a simple frame-work to track persons in real time. SORT utilizes the Kalman filter features on inputframes. Hungarian algorithm is employed to find the association in visual tracks.Their proposed system is only applicable for human tracking in different appear-ance scenes. This system still involves identity switch problem.

Another tracking system, simple online and real-time tracking with a deepassociation metric (Deep SORT) [7], utilized apparent features extracted from deepconvolution neural network (CNN) for tracking the individuals. Deep SORT gen-erates a cost matrix based on motion information and appearance features to avoidmissing tracking because of occlusion or missed detection of persons. Their systemincludes convolution neural network for a person’s apparent features trained onperson reidentification dataset. Deep SORT has a high rate of missed detection forelevated view, crowded view, and distant view because detected humans areobtained from pre-trained models of object detection.

In this paper, we proposed a novel technique for individual human detection andtracking. We provide a unique real-time tracking using motion information andappearance information. Our system employs convolution neural network to pro-vide the visual appearance features for tracking the individual. The CNN for visual

Figure 3.Fragmentation removal steps.

4

Visual Object Tracking with Deep Neural Networks

appearance is trained on person reidentification dataset [8]. Appearance featuresallow us to improve the tracking results by making it robust for occlusion anddetection after multiple frames.

1.2 Major contribution

The proposed system improved the overall detection and tracking problem inMOT problem. Furthermore, we have also improved the detection of human by re-training the Faster region convolutional neural network (Faster R-CNN) on humanbody parts. Provided better hardware resources, the proposed system outperformsthe existing state-of-the-art systems in terms of accuracy and performance. Majorcontributions of our system are manifold.

• Improved detection by Faster R-CNN trained on human pedestrian datasetfrom different views.

• We also improved the feature set for tracking the subjects that includes colorfeatures, area, mutual distance, and HSV histograms for each region of interest.

• The system behaves better than both SORT and Deep SORT in real-timescenarios for pedestrian tracking.

The rest of the paper is divided into different sections. Section 2 provides thedetailed background of previous methodologies for MOT systems. Methodology isdescribed in Section 3 along with complete architecture. Section 4 throws light onexperimentation. The last section concludes the proposed system for human track-ing in a different environment.

2. Background

With the improvement in multi-object detection, research community hasstarted focusing on tracking of every single object in different environments. Thecomplete MOT problem can be considered as an association problem in which thebasic objective is to associate the detected objects. Tracking is carried out afterobject detection using some object detector. In this section, we will focus on thebackground of the following systems:

• Object detection algorithms as humans need to be detected before persontracking

• Face recognition systems for target tracking based on recognized faces

• Tracking algorithms for reviewing already available tracking algorithms

2.1 Object detection in past years

In the early 1990s, object detection was carried out using template matchingbased algorithms [9], where a template of the specific object is slid over the inputimage to find the best possible match in the input image. In the late 1990s, the focuswas shifted toward the geometric appearance-based object detection [10, 11]. Inthese methods, the basic focus was on height, width, angles, and other geometricproperties.

5

Multi-Person Tracking Based on Faster R-CNN and Deep Appearance FeaturesDOI: http://dx.doi.org/10.5772/intechopen.85215

In the 2000s, object detection paradigm was transferred to low-level featuresbased on some statistical classifiers such as local binary pattern (LBP) [12], histo-gram of oriented gradient [13], scale-invariant feature transform [14], and covari-ance [15]. Feature extraction-based object detection and classification involvedtraining of machine based on extracted features.

For many years in computer vision field, handcrafted traditional features wereused for object detection. But, with the progress in deep learning afteraccomplishing the remarkable performance in 2012 image classification challenge[16], convolution neural networks are being used for this purpose. After the successof object classification in [16], researchers transferred their attentions toward objectdetection and classification. Deep convolution neural networks work exceptionallygood for extraction of local and global features in terms of edges, texture, andappearance.

In recent years, the research community has moved in the direction of region-based networks for object detection. This type of object detection is being used indifferent applications like video description [17]. In region-based algorithms forobject detection, convolution features are extracted over proposed regions followedby categorization of the region into a specific class.

With the attractive performance of AlexNet [16], Girshick et al. [18] proposedthe idea of object detection using convolution neural network. They employedselective search for proposing the areas where the potential objects can be found[19]. They called their object detection network as region convolution neural net-work (R-CNN). The basic flow of region convolution neural network (R-CNN) canbe described as follows:

• Regions are proposed for each object in the input image using selectivesearch [19].

• Proposed regions are resized to same consistent size for classification of theproposal into predefined classes based on extracted CNN features of regions.

• Linear SVM classifier replaced the softmax layer for training the system onfixed length CNN features.

• Finally, a bounding box regressor is utilized for perfect localization of object.

Although the proposed R-CNN was a major breakthrough in the field of objectdetection, it has some significant weaknesses:

• Training processes is quite slow because R-CNN has different separate stagesto train.

• Regions are proposed by selective search that is itself a slow process.

• Training the separate SVM classifier is expensive as CNN features are extractedfor each individual region that makes the training of SVM even morechallenging.

• Object detection is slow because CNN features are extracted for individualproposal for each testing image.

To overcome the feature extraction issue for each proposal, Kaiming He et al.[20] proposed spatial pyramid pooling (SPP). The basic idea was that the

6

Visual Object Tracking with Deep Neural Networks

convolution layers accept the input of any size; fully connected layers force input tobe fixed size for making matrix multiplication possible. They used SPP layer afterlast convolution layer for obtaining the fix-sized features to feed in fully connectedlayer. Using SPPNet R-CNN performance improved comprehensively. SPPNetextracts convolution features on input image only once for proposals of differentsizes. This network improves the performance of testing, but it does not improvethe performance of training the R-CNN. Furthermore, weights of convolutionlayers before SPP layer cannot be changed which limits the fine-tuning process.

The main contributor of R-CNN, Girshik et al. [21], proposed Fast-ECNN toaddress some problems of R-CNN and SPPNet. Fast R-CNN employs the idea ofcomputation sharing of convolution for different proposed regions. It adds region ofinterest (ROI)-pooling layer after the last convolution layer for generating fix-sizedfeatures of individual proposals. The fix-sized features from ROI-pooling layers arefed to the stack of fully connected layers that further split down into two branchnetworks: one acts as the object classification network and the other for boundingbox regression. They claimed that the overall performance of training step of R-CNN is enhanced by three times and ten times for testing.

Although Fast R-CNN improved the performance of R-CNN notably, it still usesselective search as a region proposal network (RPN). Region proposal step con-sumes the time comprehensively that acts as the bottleneck in Fast R-CNN. Modernadvancements in object localization using deep neural network [22] motivated Renet al. [23] to employ CNN for replacing slow process of region proposal usingselective search. They proposed efficient RPN for proposing proposals for objects.In Faster R-CNN, RPN and Fast R-CNN share the convolution layers for regionproposal and region classification, respectively. Faster R-CNN is a purely convolu-tion neural network without any handcrafted features that employ fully convolu-tion neural network (FCN) for region proposal. They claimed that Faster R-CNNcan work at 5fps for testing phase.

Redmon et al. [24] proposed You Only Look Once (YOLO) for object detection.They completely dropped the region proposal step; YOLO splits the complete imageinto grids and predicts the detection on the bases of candidate regions. YOLOdivides the complete image into S� S grids. Each grid has a class probability C, B asthe bounding box locations and a probability for each box. Removing the RPN stepenhances the performance of the detection; YOLO can detect the objects whilerunning in real time with about 45 fps.

2.2 Face recognition over past years

In current era, biometric identification systems are required more than everbecause of the improved security requirement in the globe. There have been a lot ofefforts by researchers for face recognition technology (FRT). The basic division ofFRT can be the traditional handcrafted feature-based identification and deeplearning-based identification.

2.2.1 Handcrafted feature-based identification

Eigenface [25] and Fisherface [26] were commonly used approaches in the lastdecade for face identification. Eigenfaces reduced the feature points for measuringmaximum change in face features using minimum set of features. For reducing thefeatures, they used principal component analysis (PCA). Linear face can be recog-nized based on linear structure of the face using Eigenfaces. In contrast with theEigenfaces, Fisherfaces are a supervised learning-based face identification methodbased on traditional texture features. Fisherfaces employ linear discriminator

7

Multi-Person Tracking Based on Faster R-CNN and Deep Appearance FeaturesDOI: http://dx.doi.org/10.5772/intechopen.85215

analysis for finding the uniquely describing data points. Both of these methodolo-gies extract features in terms of Euclidean distance to identify the face.

Researchers have also used LBP for facial recognition [27, 28]. Hadid et al. [27]exploited LBP features for face recognition. They worked on face detection andrecognition. Face detection was achieved by training a support vector machine ofsecond degree on extracted features. Face recognition was achieved using LBP-based texture descriptor. Machine was trained on these descriptors for face recog-nition.

2.2.2 Deep learning-based identification

Advancement in convolution neural networks has achieved remarkable perfor-mance by increasing accuracy and efficiency. The very basic assumption in deepneural networks is to feed as much data as possible for getting better results.Requirement of huge data makes deep learning-based approaches data hungry.

Lu et al. [29] implemented a residual network (ResNet)-based model for facerecognition. They divided their complete network into three networks: one back-bone network called trunk network and two other networks called branch networksthat emit from trunk network. The central network is trained once for learning thedeep features for face identification. The central network is generated using residualblocks. Resolution-specific coupled mapping is employed in branch network fortraining. Input image and comparison image from gallery are transformed to samerepresentation for comparing. Based on distance the decision is made about identi-fied face.

Schroff et al. [30] developed a deep neural network based on convolution neuralnetwork and then named it as FaceNet. Their proposed system extracts the featurespace in terms of Euclidean space. They optimized the feature mapping of facialstructure using deep convolution neural network. Their proposed system, FaceNet,generates a feature vector of 128 dimensions that is optimized using triplet loss.Their proposed triplet loss comprises three face images: two from the same pair andone from a separate individual. The loss function tries to separate the same individ-ual faces from different individual faces. Their triplet loss function is trained tominimize the distance between the same identity faces and maximize the distancebetween different identities. Inception model with little modification is employedin FaceNet for extracting convolution features. They tested their system on LFWdataset [31].

A research group from Facebook, Taigman et al. [32], developed a state-of-the-art system for face alignment and face recognition, named as DeepFace. They useddeep convolution neural network having nine convolution layers for extractingfacial features. Facial landmarks are used in their system for face alignment. Thefacial landmarks are estimated using support vector regressor (SVR). Extractedfeatures from nine-layered network are passed to Softmax layer for classification.They employed cross-entropy to reduce the loss of correct labels. They also pro-posed a huge face recognition dataset named as Social Face Dataset [32]. They usedtheir dataset for training the system for face identification.

2.3 Multi-object tracking

Multiple researchers have focused on movement and spatial features for trackingthe multiple objects [33, 34]. Some of the researchers have focused on appearancefeatures for capturing the associations between different detections [2, 35].

8

Visual Object Tracking with Deep Neural Networks

There are some traditional methods that make prediction on frame-by-framebasis. These traditional approaches involve multiple hypothesis tracking (MHT)[36] and joint probability data association filter (JPDAF) [37]. Both of these oldmethodologies require a lot of computation for tracking the detected objects. Thecomplexity of these methodologies increases exponentially with increasing thenumber of trackable objects that makes them really slow to be used for onlineapplications in complex environment. In JPDAF hypothesis of single state is gener-ated based on relation between individual measurement and association likelihood.In MHT, a complete set of hypotheses is taken into consideration for trackingfollowed by post pruning for tractability.

Rezatofighi et al. [1] made an effort to improve the JPDAF performance byproviding approximation of JPDA. They exploited recent advancement in solvingm-best solution for an integer program. The main advantage of this system is tomake JPDA less complex and more tractable. They redefined the method for calcu-lating individual JPDAF assignment in terms of a solution to a linear program.Another group of researchers Kim et al. [2] used appearance-based features fortracking the target. They improved the MHT by pruning the graph of MHT forachieving state-of-the-art performance. They employed regularized least squaresfor increasing the efficiency of the MHT methodology.

These two improvements perform quite well as compared to the legacyimplementations, but these two methods still have much delay in the decision-making step which makes these methods inappropriate for real-time applications.These methods require large computational resources with increasing the individualdensity.

Some researchers worked on graph theory for tracking human. Kayumbi et al.[38] proposed an algorithm to find football players’ trajectories based on distributedsensing algorithm in multi-camera view. Their algorithm starts with mapping ofcamera view plane to virtual top-view of the ground plane. Finally, they exploitedgraph theory for tracking each individual in the ground plane.

Some online tracking methods utilize appearance features of individuals fortracking [39, 40]. These models extract apparent look features of individuals. Bothof the systems provide accurate appearance descriptors for providing guidance todata association. First system incorporates temporal appearance of individuals alongwith the spatial appearance features. Their appearance model is learned by applyingincremental evaluation after tuning the parameters in each iteration. In the secondsystem, Markov decision process (MDP) is employed to map the age of the detectedobject in terms of Markov chain. MDP decides the tracks based on current statusand history of the target.

Recently, some of the researchers worked on simple online tracking and tried tomake tracking real time in live stream [6, 7]. These systems are named as simpleonline and real-time tracking and simple online and real-time tracking with a deepassociation metric, respectively. Both of these systems are two successive versionsof the same methodology. In both systems Kalman filter is employed to find themovement features of the target. These systems used intersection over union,central position, height, width, and velocity as the core features for tracking. InDeep SORT, convolution features for targets appearance are also used along withmotion features to reduce the missing tracks after occlusions and missed detectionsin multiple frames. Despite the real-time performance, these systems miss tracksafter the changed posture and missed detection in a large number of frames.

Our proposed system reduces the limitation of missed detection of the humanbody, and it also reduces the track missed by incorporating extra features and betterhuman detection system.

9

Multi-Person Tracking Based on Faster R-CNN and Deep Appearance FeaturesDOI: http://dx.doi.org/10.5772/intechopen.85215

3. Methodology and framework

Multi-object tracking (MOT) in real time with good accuracy has been a chal-lenge from decades. Many systems have been developed for this task during the lastfew decades, using tradition computer vision techniques. But due to rebirth of deeplearning, object tracking has been robust. As object detection is the backbone ofobject tracking systems and deep learning techniques are good at object detectionproblem with real-time speed and accuracy, it is a better choice to use deep learningalgorithm for detection purpose.

We have solved the MOT problem using state-of-the-art techniques. The pro-posed method is explained by the key components of human detection, positionprediction of objects in future frames, tracklet associations, and managing the lifespan of identities for tracked objects. The mostly used state-of-the-art object detec-tion algorithm is YOLO [24] which is fast enough to detect multiple objects in realtime, but it has the problem of missed detections which leads to fragmentation andidentity switch problems. So we conduct proper survey to choose the best detectionalgorithm for the problem. We have divided this methodology into sub-componentsfor detection, track handling, and association as follows:

• Faster R-CNN for human detection

• Kalman filter

• CNN for appearance features

• Hungarian algorithm for tracking nearby rectangles

• Additional features like area, relative distance, color, nearest color, and HSV

The basic modules of the proposed system are described in the following sections.

3.1 Person detection using Faster R-CNN

As mentioned above, with the advancement in deep learning-based algorithms,real-world object detection has been a lot easier. So we have employed FasterRegion Convolutional Neural Network (Faster R-CNN) detection network [23].

There are two stages of Faster R-CNN. In the first stage, region proposalnetwork (RPN) generates the anchors on the regions present in the image wherethere might be a high possibility of the presence of an object. This process is furtherdivided into three steps:

1. First step involves the process of feature extraction by using convolutionneural network. Convolution feature maps are generated at the end of lastlayer.

2. In second step, a sliding window approach is used on these feature maps togenerate anchor boxes. These anchor boxes are further refined in the next stepto indicate the presence of objects.

3. Finally, in the third step, generated anchors are refined using a smallernetwork which calculates the loss function to select top anchors containingobjects.

10

Visual Object Tracking with Deep Neural Networks

For region proposal network, prerequisite step is extraction of convolution fea-tures that are extracted using backbone network.

3.1.1 Residual Network-30 (backbone network)

As object detection problem is dependent upon feature extraction process toproduce good quality proposals. So we used ResNet-based model containing 30layers named as ResNet-30. ResNet or residual networks are special type of convo-lution neural networks which have residual connections in between layers. Thebenefit of these residual connections is that the network is able to learn local, global,and intermediate features in parallel, making it more efficient as compared tosimple CNN. Residual connections also help in avoiding vanishing gradients prob-lem, which is a major issue in networks containing high number of layers. So,ResNet-30 is able to learn more patterns than simple CNN by grasping more infor-mation. There are two types of short connections used in ResNet in differentscenarios as described below:

1. In the first case, when the inputs and outputs are of the same dimensions,shortcuts (x) can be used directly. As illustrated in Eq. 1

l ¼ F x;Wið Þ þ x (1)

2. In the second case, we have changed dimensions, and the identity mappingis performed by padding extra zero entries to make dimension suitable.Another option is to use the projection shortcut to match the dimension (doneby 1� 1 conv) using Eq. 2:

l ¼ F x;Wið Þ þWjx (2)

where W is the weight matrix, x is the feature vector from previous layer, andF is the convolution function. Pictorial representation for residual block is given inFigure 4.

3.1.2 Anchor generation

Now to propose the regions in image which contains the high probability ofpresence of objects, the sliding window approach is used. A sliding window movesacross the feature maps to generate anchors. The sliding window has the size of

Figure 4.Basic building block of residual learning.

11

Multi-Person Tracking Based on Faster R-CNN and Deep Appearance FeaturesDOI: http://dx.doi.org/10.5772/intechopen.85215

n� n. In our case n ¼ 3; this means a 3� 3 window is used. A set of nine anchors isgenerated for each pixel, having the same center (x, y) for all anchors. All nineanchors have three multiple aspect ratios and three varieties in scales. Figure 5represents the nine anchors having the same center point. Anchors with the samecolor have the same aspect ratio but different scaling. An intersection of union(IoU) approach is used to determine how much of these anchors overlapped withground-truth bounding boxes. A threshold value is set based on IoU. Mostly anchorsare discarded and some are selected using threshold. Anchors having IoU value >0.7are considered as object-containing regions and anchors with value <0.3 consideredas background. Eq. 3 represents the formula to find the probability of object basedupon IoU value.

IoU ¼ Anchor ∩ GtAnchor ∪ Gt

. 0:7 ¼ Object, 0:3 ¼ NotObject

�(3)

3.1.3 Loss function

The selected anchors are further fine-tuned using the loss function at the end ofregion proposal network. A shallow network is used for this purpose which per-forms two tasks: classification and regression. The classification performed here isbinary classification which classifies anchors in one of two classes. The first class isobject and the second is background.

The output of regressor determines the position of predicted bounding box interms of four parameters x; y;w; hð Þ, where x and y indicate the center point ofanchor, w stands for width, and h represents the height of anchor box. The formulaof loss function which calculates the loss for both regression and classification isgiven in Eq. 4:

L pi; ri� � ¼ 1

Ncls∑iGcls pi; p

∗i

� �þ λ1

Nreg∑iP ∗i Greg ri; r ∗i

� �(4)

RPN is trained to propose regions of interest (ROIs) on feature maps whichare obtained from input image. These RoIs are enclosed in bounding boxes. RPNoutputs different scales of bounding boxes, on feature maps. These bounding boxescontain high probability of presence of objects.

Now comes the second stage of Faster R-CNN, which is a classification of ROIsobtained from RPN network. To bring the ROIs in feedable format for the classifier,a ROI-pooling method is used which uses the pooling mechanism to shape all RoIsin the same scales. Its purpose is to perform max pooling on inputs of nonuniformsizes to obtain fix-sized feature maps for each RoI.

Figure 5.Nine different proposed anchors for each single point.

12

Visual Object Tracking with Deep Neural Networks

3.1.4 ROI classification and regression

Now same-sized feature maps or RoIs obtained from RoI-pooling are furtherproceeded for classification and regression purpose. This step runs two stages inparallel. Bounding box classification and regression loss are calculated based on theoptimization of the loss function. Classification head results in the class score foreach individual category, and regression head resizes the bounding box valuesx; y;w; hð Þ to cover complete object. Overall performance and accuracy of Faster R-CNN is better than all the traditional object detectors. A diagram for Faster R-CNNis given in Figure 6.

We have trained Faster R-CNN on 4000 annotated images of human heads,shoulders, and complete bodies which improved detection accuracy efficientlyhaving only few numbers of miss rate.

3.2 Track handling and state estimation of future frames

Kalman filter is used in its standard form as proposed in [41]. We have definedtracking scenario on the multidimensional state space x; y; γ; h; s; t; u; vð Þ that con-sists of the bounding box center location x; yð Þ, with height h, their respectivevelocities s; t; u; vð Þ in image coordinates, and aspect ratio γ. We keep on calculatingthe total count of frames for every individual track T, starting from previous correctassociation ST. When Kalman filter predicts opposite features then this counter isincremented. When the track is assigned with a previous list, then the counter isreset to 0. In those tracks that are newly detected and cannot be assigned to any ofthe current list, then new track prediction is initiated. For the first three frames,these tracks are classified as tentative. If the association of a measurement at everytime step t is found, the tracks are kept for further processing and otherwisedeleted.

Figure 6.Complete framework diagram of Faster R-CNN.

13

Multi-Person Tracking Based on Faster R-CNN and Deep Appearance FeaturesDOI: http://dx.doi.org/10.5772/intechopen.85215

3.2.1 Association of newly predicted states and current states

A traditional approach to find association between the current Kalman states andnewly arrived detections is to use the Hungarian algorithm. We integrated spatialdisplacement and apparent features by creating two different metrics. For motioninformation, we used Mahalanobis distance between current list of states and newlyarrived states. The Mahalanobis distance removes state estimation uncertainty bymeasuring deviations between detection and mean track location. Further, falseassociations can be excluded by thresholding at a 90% confidence interval com-puted from the inverse x2 distribution. We have set the value of threshold t as 9 forthe decision based on Mahalanobis distance.

Mahalanobis distance matrix provides robust association metric when the over-all motion transition is not high and the Kalman filter framework supply only avague approximation of the object position. Specifically, rapid movement of cap-turing device can lead to displacements, making it uninformed metric for trackingin the presence of occlusions. Therefore, we integrate a second metric for trackingthe pedestrians; we have utilized a pre-trained convolution network to extract thebounding box appearance features. The complete architecture of the proposedconvolution neural network is shown in Table 1.

In combination, both techniques support each other by handling differentaspects of association problem. In particular, the Mahalanobis distance matrix isemployed to extract information about object positions based on movement of theobjects for a short period. Along with the distance matrix, we have employedconvolution for appearance feature descriptor; the CNN considers appearanceinformation for those long-term occluded detection that are not possible to becaptured through motion features. Moreover, we have used some additional fea-tures like area of human, relative distance between tracked humans, color or nearestcolor of object, and HSV to handle occlusions quite efficiently.

Area of human can accommodate the occlusion problem quite well because thearea mostly remains the same during the whole tracking, for example, if a person isshort, they will remain short before and after occlusion which makes it easy toreidentify a person after a long-time occlusion. Similarly, if someone is sitting on awheel chair, their area will be the same throughout the time of tracking.

Name Filter size Stride Output size

Conv 1 3 � 3 1 32� 128� 64

Conv 2 3 � 3 1 32� 128� 64

Max pool 1 3 � 3 2 32� 64� 32

Residual block 1 3 � 3 1 32� 64� 32

Residual block 2 3 � 3 1 32� 64� 32

Residual block 3 3 � 3 2 64� 32� 16

Residual block 4 3 � 3 1 64� 32� 16

Residual block 5 3 � 3 2 128� 16� 8

Residual block 6 3 � 3 1 128� 16� 8

Dense layer 1 — 128

Batch norm — 128

Table 1.Complete architecture for appearance feature extractor network.

14

Visual Object Tracking with Deep Neural Networks

Relative distance is another important appearance feature to keep trackupdated. Let us assume if the targeted person is moving in a group with few otherpeople, their relative distance can be used to keep track of target even after thelong-term occlusion.

Color or nearest color can be helpful to reidentify after occlusion becausehumans mostly keep the same clothes within a session. Similarly, if the nearest coloris predicted, the person can be reidentified even after a sudden change of lighting orcontrast.

HSV and RGB histograms are employed for comparing the histogram basedon object appearances. We compare the histogram appearance model in bothcolor spaces using cumulative brightness transfer function (CBTF) as mappingfunction between the two fields of views, which helps us handle occlusions in abetter way.

3.3 Tracking options

In this system we have provided multiple options for enabling the tracking.Major tracking options include (1) face recognition-based tracking and (2) targettracking.

3.3.1 Face recognition-based tracking

As the object tracking system trained by us also detects faces, so we take benefitof this approach and make use of these detected faces in our tracking system. Weused the face recognition proposed by Lu et al. [29] to recognize the detected faces.When any new face enters in frame, then the system extracts and stores thesefeatures by assigning a specific ID for later use. These feature maps and IDs are usedin the future to associate the face detected with saved faces. That’s how dataassociation accuracy increased and helped in better tracking. But the limitation ofthis system is that it works only in case-detected face that is visible enough. Thissystem works within 15 to 20 feet distance from camera. It also depends uponcamera resolution; as the resolution is high, the better feature will be extracted.

3.3.2 Target tracking

Developed system can also perform specifically selected object tracking task. Wehave to basically select the object which we want to track by clicking on thedetected object. This system enables us to perform analysis of only desired object byhiding the tracking details of other objects. The system is a practical implementationof human-computer interaction facilitated by the user. The overall tracking ofpedestrians is improved based on robust detection of human using multiple viewsand body parts (head, shoulder, and complete body). Furthermore, the problem ofidentity switches and fragmentation is addressed by appearance features andincreased spatial features (area, relative distance, color, and histograms).

4. Evaluation

Training of proposed detection system is performed on self-generated dataset,and for tracking purpose, we employed standard tracking dataset for evaluating theoverall system.

15

Multi-Person Tracking Based on Faster R-CNN and Deep Appearance FeaturesDOI: http://dx.doi.org/10.5772/intechopen.85215

4.1 Environment setup

To setup environment, we used GeForce GTX 1080 Ti GPU with Ubuntu OSinstalled in the system. We chose Python programming language to perform theexperimentation steps. System is built by using Tensor Flow framework. We eval-uated five different models of ResNet integrated with Faster R-CNN architecture onself-generated dataset, and the results are further discussed in Section 4.4.

4.2 Self-generated dataset

For training the detection system, we have utilized self-generated dataset having4000 image of human in different postures. Each image in the dataset contains onaverage five different subjects with limited repetition in other images. The datasethas images from different environment conditions (rain, snow, and shadow) and indifferent lighting conditions (day and night). Some of the images are collected overthe Internet, and some are generated within university premises using surveillancecameras. The dataset is comprehensive in terms of human densities, view angle,posture, and scale. The dataset covers different scenes: streets, bazars, buildings,malls, parks, roads, and stadium. Figure 7 shows some images from self-generateddataset.

We have annotated human body parts in different categories. The visible humanbody can be categorized into three classes based on occlusion and density of thecrowd. These classes include the head, shoulder, and complete body. Based on thevisible category, we have annotated the images. The details of each annotatedcategory in the dataset are shown in Table 2.

4.3 MOT benchmark dataset

For evaluating our proposed tracking system, we have employed multiple objecttracking (MOTChallenge) benchmark dataset [42] that contains a variety of

Figure 7.Sample images from self-generated dataset.

Category Instances Number of images

Head 4325 987

Shoulder 3130 2031

Complete body 12,690 3411

Table 2.Each category division in proposed dataset.

16

Visual Object Tracking with Deep Neural Networks

sequences with dynamic camera and static camera. In this dataset, they have com-bined 22 different available datasets. Some sample images from MOTChellangedataset are shown in Figure 8.

They have provided a 10-minute video of individuals with 61,440 rectangles ofhuman detection. This dataset is composed of 14 different sequences with properannotations by expert annotators. They also annotated different objects like chairand car for better representation of the occlusions. Three different aspects ofMOTChallenge are described below:

1. Dynamic or static stream: A camera while capturing can have multiple states,placed on a stroller, in a car or a person holding the camera that makes itdynamic and static.

2. Viewpoint variation: Video camera can be elevated, at same height aspedestrian, or at low position.

3.Weather conditions: The weather condition of the captured stream is alsoprovided with sequences to get the idea of lighting, shadows and blurring ofthe pedestrians.

4.4 Results

We integrated and tested multiple ResNet backbone network architectures inFaster R-CNN-based object detection. We evaluated the networks based on detec-tion accuracy and performance. After proper evaluation, we found that ResNet-30

Figure 8.Samples from MOT dataset. Top, training images; bottom, test sequences.

17

Multi-Person Tracking Based on Faster R-CNN and Deep Appearance FeaturesDOI: http://dx.doi.org/10.5772/intechopen.85215

performed better in our case as our major concern is minimum runtime withmaintaining the accuracy. Response time of our system is 25 frames per second. So,we decided to use ResNet-30 in designed system for better speed and accuracy. Theevaluation matrix for different tested models is given in Table 3.

As Table 3 depicts, the runtime for ResNet-30 is lower than other ResNetmodels, and Top-5 error rate is not significantly low. Based on this evaluation,ResNet-30 was our ultimate choice for backbone architecture of Faster R-CNN.

Table 4 provides the evaluation results of our complete tracking system onMOT dataset. This evaluation provides results of our designed system on sevenchallenging test sequences, on human eye level and elevated view of camera scenes.The tracking system highly relies on detection mechanism to perform better detec-tion followed by better tracking; we used Faster R-CNN trained on our self-collected dataset. We rerun Deep SORT on the same evaluation dataset for faircomparison.

We set threshold of 0.7 for detection confidence score. We further fine-tunedthe other parameters of network to produce better model. Following metrics areused for comparison purpose:

Model (residual networks) Layers Top-1 error Top-5 errors (avg.) Runtime (ms)

ResNet-34 34 25.27 8.51 52.09

ResNet-50 50 23.93 7.82 104.13

ResNet-101 101 22.81 7.11 158.35

ResNet-152 52 22.52 6.63 219.06

ResNet-30 (proposed) 30 26.02 8.04 48.93

Table 3.Comparison table of different ResNet architectures as backbone network for object detection.

MOTA MOTP MT ML ID FM FP FN Runtime

KDNT [43] Batchbased

68.2 79.4 41.0% 19.0% 933 1093 11,479 45,605 0.7 Hz

LMP p [44] Batchbased

71.0 80.2 46.9% 21.9% 434 587 7880 44,564 0.5 Hz

MCMOT HDM[45]

Batchbased

62.4 78.3 31.5% 24.2% 1394 1318 9855 57,257 35 Hz

NOMTwSDP16[46]

Batchbased

62.2 79.6 32.5% 31.1% 406 642 5119 63,352 3 Hz

EAMTT [47] Realtime

52.5 78.8 19.0% 34.9% 910 1321 4407 81,223 12 Hz

POI [43] Realtime

66.1 79.5 34.0% 20.8% 805 3093 5061 55,914 10 Hz

SORT [6] Realtime

59.8 79.6 25.4% 22.7% 1423 1835 8698 63,245 60 Hz

Deep SORT [7] Realtime

61.4 79.1 32.8% 18.2% 781 2008 12,852 56,668 40 Hz

Proposedsystem

Realtime

75.2 81.3 33.2 17.5 825 1225 4123 52,524 42 Hz

Table 4.Evaluation table.

18

Visual Object Tracking with Deep Neural Networks

• Multi-object tracking accuracy (MOTA): It provides complete accuracy ofsystem incorporating with false negatives, identity switches, and false positives.

• Multi-object tracking precision (MOTP): It gives complete tracking precisionfor bounding boxes overlapping between predicted location and ground-truthvalue.

• Mostly tracked (MT): It is the percentage of ground-truth tracks that do notchange their labels at least during 80% of their life span.

• Mostly lost (ML): It provides the percentage of actual tracks that are beingtracked by the system at most 20% of their life span.

• Identity switches (ID): It defines the total reported identity changes of ground-truth tracks.

• Fragmentation (FM): It provides the detail of how many times the track isinterrupted by missed detection of person.

The results of our evaluation are shown in Table 4. The numbers of identityswitches have been reduced due to our alteration in detection network. As com-pared to Deep SORT [7], MOTA is increased from 61.4 to 75.2.

5. Conclusion

The goal of this work was to implement a fast and competitive MOT system. Wepresented a multiple object tracker that combines a deep learning-based objectdetection network named as Faster R-CNN with the tracking algorithm. The pro-posed system performed tracking by detecting multiple objects followed byassigning each object a unique ID and generating their tracklets. In the case offragmentation in tracklet of any object, the system uses tracklet association mecha-nism to generate a complete trajectory. Tracking is performed based upon appear-ance and motion features of objects. When in any frame object detection networkfails to detect objects, these features are used to track object again with the same ID.Kalman filter and Hungarian algorithm both collectively used to predict the positionof object in the frame. Other features like area, color, relative distance, nearestcolor, and HSV histograms are also used to increase the tracking accuracy. Overallthe system performed very well, and it has shown improvement in MOTA, MOTP,ML, and FP fields as shown in comparison in Table 4. But considering environ-mental constraints and hardware limitations, our system has some pros and cons.We have listed some strengths and weaknesses as follows.

5.1 Pros

• Efficient in terms of response time because of less number of layer of residualnetwork

• Have minimum number of missing detections because of improved objectdetection process

• Less fragmentation in drawn trajectories because of continuous detection ofpersons in consecutive frames

19

Multi-Person Tracking Based on Faster R-CNN and Deep Appearance FeaturesDOI: http://dx.doi.org/10.5772/intechopen.85215

• Introduction of visual and motion feature-based tracking for reducing identityswitches

• Trajectory completion in the case of fragmentation

5.2 Cons

• It requires GPU-based hardware for enabling real-time tracking.

• For highly dense crowd identity, switches may occur because of similarfeatures of the head for each person.

• Performance reduces in dark environmental condition.

Acknowledgements

This work is carried out at UET Lahore under Intelligent Criminology Lab,National Center of Artificial Intelligence.

Author details

Gulraiz Khan1, Zeeshan Tariq1 and Muhammad Usman Ghani Khan2*

1 Al-Khawarizmi Institute of Computer Science, UET Lahore, Pakistan

2 Department of Computer Science and Engineering, UET Lahore, Pakistan

*Address all correspondence to: [email protected]

©2019 TheAuthor(s). Licensee IntechOpen. This chapter is distributed under the termsof theCreativeCommonsAttribution License (http://creativecommons.org/licenses/by/3.0),which permits unrestricted use, distribution, and reproduction in anymedium,provided the original work is properly cited.

20

Visual Object Tracking with Deep Neural Networks

References

[1] Rezatofighi AM, Zhang Z, Shi Q,Dick A, Reid I. Joint probabilistic dataassociation revisited. In: Proceedings ofthe IEEE International Conference onComputer Vision. 2015. pp. 3047-3055

[2] Kim C, Li F, Ciptadi A, Rehg JM.Multiple hypothesis tracking revisited.In: Proceedings of the IEEEInternational Conference on ComputerVision. 2015. pp. 4696-4704

[3] Yang B, Nevatia R. Multi-targettracking by online learning of non-linearmotion patterns and robust appearancemodels. In: 2012 IEEE Conference onComputer Vision and PatternRecognition (CVPR); IEEE. 2012.pp. 1918-1925

[4] Andriyenko A, Schindler K, Roth S.Discrete-continuous optimization formulti-target tracking. In: 2012 IEEEConference on Computer Vision andPattern Recognition (CVPR); IEEE.2012. pp. 1926-1933

[5] Milan A, Schindler K, Roth S.Detection-and trajectory-level exclusionin multiple object tracking. In:Proceedings of the IEEE Conference onComputer Vision and PatternRecognition. 2013. pp. 3682-3689

[6] Bewley A, Ge Z, Ott L, Ramos F,Upcroft B. Simple online and realtimetracking. In: 2016 IEEE InternationalConference on Image Processing (ICIP);IEEE. 2016. pp. 3464-3468

[7] Wojke N, Bewley A, Paulus D.Simple online and realtime trackingwith a deep association metric. In: 2017IEEE International Conference on ImageProcessing (ICIP); IEEE. 2017.pp. 3645-3649

[8] Zheng L, Bie Z, Sun Y, Wang J, Su C,Wang S, et al. Mars: A video benchmarkfor large-scale person re-identification.In: European Conference on ComputerVision; Springer. 2016. pp. 868-884

[9] Jain AK, Zhong Y, Lakshmanan S.Object matching using deformabletemplates. IEEE Transactions on PatternAnalysis and Machine Intelligence.1996;18(3):267-278

[10] Mundy JL. Object recognition in thegeometric era: A retrospective. In:Toward Category-Level ObjectRecognition. Springer; 2006. pp. 3-28

[11] Ponce J, Hebert M, Schmid C,Zisserman A. Toward Category-LevelObject Recognition. Vol. 4170. Springer;2007

[12] Ojala T, Pietikainen M, Maenpaa T.Multiresolution gray-scale and rotationinvariant texture classification withlocal binary patterns. IEEE Transactionson Pattern Analysis and MachineIntelligence. 2002;24(7):971-987

[13] Dalal N, Triggs B. Histograms oforiented gradients for human detection.In: IEEE Computer Society Conferenceon Computer Vision and PatternRecognition, 2005. CVPR 2005; volume 1;IEEE. 2005. pp. 886-893

[14] Lowe DG. Distinctive imagefeatures from scale-invariant key points.International Journal of ComputerVision. 2004;60(2):91-110

[15] Tuzel O, Porikli F, Meer P. Regioncovariance: A fast descriptor fordetection and classification. In:European Conference on ComputerVision; Springer. 2006. pp. 589-600

[16] Krizhevsky A, Sutskever I, HintonGE. Imagenet classification with deepconvolutional neural networks. In:Advances in Neural InformationProcessing Systems. 2012. pp. 1097-1105

[17] Khan G, Ghani MU, Siddiqi A, SeoS, Baik SW, Mehmood I, et al.Egocentric visual scene descriptionbased on human-object interaction and

21

Multi-Person Tracking Based on Faster R-CNN and Deep Appearance FeaturesDOI: http://dx.doi.org/10.5772/intechopen.85215

deep spatial relations among objects.Multimedia Tools and Applications.2018:1-22

[18] Girshick R, Donahue J, Darrell T,Malik J. Rich feature hierarchies foraccurate object detection and semanticsegmentation. In: Proceedings of theIEEE Conference on Computer Visionand Pattern Recognition. 2014.pp. 580-587

[19] Uijlings JRR, Van De Sande KEA,Gevers T, Smeulders AWM. Selectivesearch for object recognition.International Journal of ComputerVision. 2013;104(2):154-171

[20] He K, Zhang X, Ren S, Sun J. Spatialpyramid pooling in deep convolutionalnetworks for visual recognition. In:European Conference on ComputerVision; Springer. 2014. pp. 346-361

[21] Girshick R. Fast R-CNN. In:Proceedings of the IEEE InternationalConference on Computer Vision. 2015.pp. 1440-1448

[22] Zhou B, Khosla A, Lapedriza A,Oliva A, Torralba A. Learning deepfeatures for discriminative localization.In: Proceedings of the IEEE Conferenceon Computer Vision and PatternRecognition. 2016. pp. 2921-2929

[23] Ren S, He K, Girshick R, Sun J.Faster R-CNN: Towards realtime objectdetection with region proposalnetworks. In: Advances in NeuralInformation Processing Systems. 2015.pp. 91-99

[24] Redmon J, Divvala S, Girshick R,Farhadi A. You only look once: Unified,real-time object detection. In:Proceedings of the IEEE Conference onComputer Vision and PatternRecognition. 2016. pp. 779-788

[25] Turk MA, Pentland AP. Facerecognition using eigenfaces. In: IEEEComputer Society Conference on

Computer Vision and PatternRecognition, 1991. ProceedingsCVPR’91; IEEE. 1991. pp. 586-591

[26] Kwak K-C, Pedrycz W. Facerecognition using a fuzzy fisherfaceclassifier. Pattern Recognition. 2005;38(10):1717-1732

[27] Ahonen T, Hadid A, Pietikainen M.Face description with local binarypatterns: Application to facerecognition. IEEE Transactions onPattern Analysis and MachineIntelligence. 2006;28(12):2037-2041

[28] Hadid A, Pietikainen M, Ahonen T.A discriminative feature space fordetecting and recognizing faces.In: Proceedings of the 2004 IEEEComputer Society Conference onComputer Vision and PatternRecognition, 2004. CVPR 2004; volume2; IEEE. 2004

[29] Lu Z, Jiang X, Kot ACC. Deepcoupled resnet for low-resolution facerecognition. IEEE Signal ProcessingLetters. 2018

[30] Schroff F, Kalenichenko D, PhilbinJ. Facenet: A unified embedding for facerecognition and clustering. In:Proceedings of the IEEE Conference onComputer Vision and PatternRecognition. 2015. pp. 815-823

[31] Huang GB, Mattar M, Berg T,Learned-Miller E. Labeled faces in thewild: A database for studying facerecognition in unconstrainedenvironments. In:Workshop on Faces in‘Real-Life’ Images: Detection,Alignment, and Recognition. 2008

[32] Taigman Y, Yang M, Ranzato MA,Wolf L. Deepface: Closing the gap tohuman-level performance in faceverification. In: Proceedings of theIEEE Conference on Computer Visionand Pattern Recognition; 2014.pp. 1701–1708

22

Visual Object Tracking with Deep Neural Networks

[33] Dicle C, Camps OI, Sznaier M. Theway they move: Tracking multipletargets with similar appearance. In:Proceedings of the IEEE InternationalConference on Computer Vision. 2013.pp. 2304-2311

[34] Yoon JH, Yang M-H, Lim J, YoonK-J. Bayesian multiobject tracking usingmotion context from multiple objects.In: 2015 IEEE Winter Conference onApplications of Computer Vision(WACV); IEEE. 2015. pp. 33-40

[35] Bewley A, Ott L, Ramos F, UpcroftB. Alextrac: Affinity learning byexploring temporal reinforcementwithin association chains. In: 2016 IEEEInternational Conference on Roboticsand Automation (ICRA); IEEE. 2016.pp. 2212-2218

[36] Reid D et al. An algorithm fortracking multiple targets. IEEETransactions on Automatic Control.1979;24(6):843-854

[37] Fortmann T, Bar-Shalom Y, ScheffeM. Sonar tracking of multiple targetsusing joint probabilistic data association.IEEE Journal of Oceanic Engineering.1983;8(3):173-184

[38] Kayumbi G, Mazzeo PL, SpagnoloP, Taj M, Cavallaro A. Distributed visualsensing for virtual top-view trajectorygeneration in football videos. In:Proceedings of the 2008 InternationalConference on Content-Based Imageand Video Retrieval; ACM. 2008.pp. 535-542

[39] Yang M, Jia Y. Temporal dynamicappearance modeling for online multi-person tracking. Computer Vision andImage Understanding. 2016;153:16-28

[40] Xiang Y, Alahi A, Savarese S.Learning to track: Online multi-objecttracking by decision making. In:Proceedings of the IEEE InternationalConference on Computer Vision. 2015.pp. 4705-4713

[41] Kalman RE. A new approach tolinear filtering and prediction problems.Journal of Basic Engineering. 1960;82(1):35-45

[42] Leal-Taixé L, Milan A, Reid I, RothS, Schindler K. Motchallenge 2015:Towards a benchmark for multi-targettracking. 2015; arXiv preprint arXiv:1504.01942

[43] Yu F, Li W, Li Q, Liu Y, Shi X, YanJ. Poi: Multiple object tracking with highperformance detection and appearancefeature. In: European Conference onComputer Vision; Springer. 2016.pp. 36-42

[44] Keuper M, Tang S, Zhongjie Y,Andres B, Brox T, Schiele B. A multi-cutformulation for joint segmentation andtracking of multiple objects. arXivpreprint arXiv:1607.06317; 2016

[45] Lee B, Erdenee E, Jin S, Nam MY,Jung YG, Rhee PK. Multi-class multi-object tracking using changing pointdetection. In: European Conference onComputer Vision; Springer. 2016.pp. 68-83

[46] Choi W. Near-online multi-targettracking with aggregated local flowdescriptor. In: Proceedings of the IEEEInternational Conference on ComputerVision. 2015. pp. 3029-3037

[47] Sanchez-Matilla R, Poiesi F,Cavallaro A. Online multi-targettracking with strong and weakdetections. In: European Conference onComputer Vision; Springer. 2016.pp. 84-99

23

Multi-Person Tracking Based on Faster R-CNN and Deep Appearance FeaturesDOI: http://dx.doi.org/10.5772/intechopen.85215