indoor localisation and classification of objects for an unmanned...

87
Indoor detection and classification of objects for an Unmanned Aerial System Andras PALFFY Supervisor at Cranfield University: PROF. Al SAVVARIS Supervisor at Pázmány Péter Catholic University: Andras HORVATH, PhD Faculty of Information Technology and Bionics Pázmány Péter Catholic University Budapest, Hungary A thesis submitted for a Master thesis December, 2015

Upload: others

Post on 14-Nov-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly

Indoor detection and classification of objects foran Unmanned Aerial System

Andras PALFFYSupervisor at Cranfield University PROF Al SAVVARIS

Supervisor at Paacutezmaacuteny Peacuteter Catholic University Andras HORVATH PhD

Faculty of Information Technology and BionicsPaacutezmaacuteny Peacuteter Catholic University

Budapest Hungary

A thesis submitted for aMaster thesis

December 2015

Paacutezmaacuteny Peacuteter Katolikus Egyetem

Informaacutecioacutes Technoloacutegiai eacutes Bionikai Kar

Diplomaterv teacutemabejelentő nyilatkozat

Hallgatoacute

Neacutev Paacutelffy Andraacutes Neptun koacuted W19DOC

Szak Meacuternoumlkinformatikus MSc IMNI-MI

Teacutemavezető

Neacutev Al Savvaris

Inteacutezet ceacuteg Cranfield University

Beosztaacutes tudomaacutenyos fokozat Professor

Belső konzulens neve Dr Horvaacuteth Andraacutes

Dolgozat

A dolgozat ciacuteme (magyarul) Belteacuteri objektumfelismereacutes eacutes -osztaacutelyozaacutes piloacuteta neacutelkuumlli

jaacuterművel

Title of the thesis Indoor detection and classification of objects for an Unmanned

Aerial System

A dolgozat teacutemaacuteja (Topic of the thesis)

In this project a quadrotor based airborne tracking system will be developed with the

aim of detecting objects indoor (potentially extended to outdoor operations as well) and

localize a ground robot that will serve as a landing platform for recharging of the

quadrotor The projectrsquos final aim is the development of the the aerial platform together

with the trackingdetection system to work in near real-time and carryout experimental

testing To achieve this the system will need to have continuous update of the relative

positioning of the ground robot

A hallgatoacute feladatai (tasks of the student)

The student first task would be to carry out literature search to gather information about

unmanned aerial systems modelling and hardware integration (aerial platform flight

control systems hardware and sensor integration and communication)

The following on task would be develop the algorithms and the image processing tools

using OpenCV and DLib libraries in C++ environment

Image processing techniques will mainly focus on the classifier development to identify

the landing target (ie recharge station) for the quadrotor to dock with (please note that

orientation is also an important factor) The detection method has to work with

reasonable speed on an embedded system onboard (ARM Nitrogen6x) thus the

developed algorithms need to be ported and optimised for the chosen onboard platform

If time permits the student will also consider sensor fusion (data associations) to enable

the system to operate robustly in different lightining conditions

A teacutemavezeteacutest vaacutellalom

A teacutemavezető alaacuteiacuteraacutesa

Signature of Supervisor

A hallgatoacute teacutemavezeteacuteseacutet belső konzulenskeacutent vaacutellalom

Belső konzulens alaacuteiacuteraacutesa

Keacuterem a diplomaterv teacutemaacutejaacutenak joacutevaacutehagyaacutesaacutet

Budapest

A hallgatoacute alaacuteiacuteraacutesa

A diplomaterv teacutemaacutejaacutet az Informaacutecioacutes Technoloacutegiai eacutes Bionikai Kar joacutevaacutehagyta

Budapest

Dr Szolgay Peacuteter

deacutekaacuten

A hallgatoacute a konzultaacutecioacutekon reacuteszt vett eacutes a kiiacuteraacutesban foglalt feladatokat teljesiacutetette

Budapest

A teacutemavezető alaacuteiacuteraacutesa

A dolgozat a TVSZ 3 sz melleacutekleteacuteben foglalt tartalmi eacutes formai koumlvetelmeacutenyeknek megfelel

Budapest

Belső konzulens alaacuteiacuteraacutesa

Aluliacuterott Paacutelffy Andraacutes a Paacutezmaacuteny Peacuteter Katolikus Egyetem Informaacutecioacutes Technoloacute-giai eacutes Bionikai Karaacutenak hallgatoacuteja kijelentem hogy jelen diplomatervet meg nem en-gedett segiacutetseacuteg neacutelkuumll sajaacutet magam keacutesziacutetettem eacutes a diplomatervben csak a megadott for-raacutesokat hasznaacuteltam fel Minden olyan reacuteszt melyet szoacute szerint vagy azonos eacutertelembende aacutetfogalmazva maacutes forraacutesboacutel aacutetvettem egyeacutertelműen a forraacutes megadaacutesaacuteval megjeloumlltemEzt a diplomatervet csakis Erasmus tanulmaacutenyuacutet kereteacuten beluumll a Cranfield UniversitybdquoComputational amp Software Techniques in Engineeringrdquo MSc keacutepzeacuteseacuten bdquoDigital Signaland Image Processingrdquo szakiraacutenyon nyuacutejtottam be a 20142015taneacutevben

Undersigned Andraacutes PAacuteLFFY student of Paacutezmaacuteny Peacuteter Catholic Universityrsquos Fac-ulty of Information Technology and Bionics I state that I have written this thesis onmy own without any prohibited sources and I used only the described references Everytranscription citation or paraphrasing is marked with the exact source This thesis hasonly been submitted at Cranfield Universityrsquos Computational amp Software Techniques inEngineering MSc course on the Digital Signal and Image Processing option in 2015

Budapest 20151220

Paacutelffy Andraacutes

iii

Contents

List of Figures vii

Absztrakt viii

Abstract ix

List of Abbreviations x

1 Introduction and project description 111 Project description and requirements 112 Type of vehicle 213 Aims and objectives 3

2 Literature Review 521 UAVs and applications 5

211 Fixed-wing UAVs 5212 Rotary-wing UAVs 6213 Applications 8

22 Object detection on conventional 2D images 9221 Classical detection methods 10

2211 Background subtraction 102212 Template matching algorithms 11

222 Feature descriptors classifiers and learning methods 132221 SIFT features 142222 Haar-like features 152223 HOG features 162224 Learning models in computer vision 172225 AdaBoost 182226 Support Vector Machine 19

iv

CONTENTS

3 Development 2131 Hardware resources 21

311 Nitrogen board 21312 Sensors 21

3121 Pixhawk autopilot 223122 Camera 223123 LiDar 23

32 Chosen software 23321 Matlab 23322 Robotic Operating System (ROS) 24323 OpenCV 24324 Dlib 24

4 Designing and implementing the algorithm 2641 Challenges in the task 2642 Architecture of the detection system 2943 2D image processing methods 31

431 Chosen methods and the training algorithm 31432 Sliding window method 34433 Pre-filtering 36434 Tracking 37435 Implemented detector 39

4351 Mode 1 Sliding window with all the classifiers 414352 Mode 2 Sliding window with intelligent choice of

classifier 414353 Mode 3 Intelligent choice of classifiers and ROIs 424354 Mode 4 Tracking based approach 44

44 3D image processing methods 46441 3D recording method 46442 Android based recording set-up 48443 Final set-up with Pixhawk flight controller 51444 3D reconstruction 53

5 Results 5651 2D image detection results 56

511 Evaluation 565111 Definition of True positive and negative 565112 Definition of False positive and negative 565113 Reducing number of errors 575114 Annotation and database building 57

512 Frame-rate measurement and analysis 59

v

CONTENTS

52 3D image detection results 6053 Discussion of results 61

6 Conclusion and recommended future work 6561 Conclusion 6562 Recommended future work 67

References 70

vi

List of Figures

11 Image of the ground robot 3

21 Fixed wing consumer drone 622 Example for consumer drones 723 Example for people detection with background subtraction 1124 Example of template matching 1225 2 example Haar-like features 1626 Illustration of the discriminative and generative models 1727 Example of a separable problem 20

31 Image of Pixhawk flight controller 2232 The chosen LIDAR sensor Hokuyo UTM-30LX 2333 Elements of Dlibrsquos machine learning toolkit 25

41 A diagram of the designed architecture 2842 Visualization of the trained HOG detectors 3543 Representation of the sliding window method 3644 Example image for the result of edge detection 3845 Example of the detectorrsquos user interface 4146 Presentation of mode 3 4347 Mode 4 example output frames 45410 Screenshot of the Android application for Lidar recordings 4848 Schematic figure to represent the 3D recording set-up 4949 Example representation of the output of the Lidar Sensor 50411 Picture about the laboratory and the recording set-up 52412 Presentation of the axes used in the android application 53

51 Figure of the Vatic user interface 5852 Photo of the recording process 6153 Example of the 3D images built 62

vii

List of Abbreviations

SATM School of Aerospace Technology and ManufacturingUAV Unmanned Aerial VehicleUAS Unmanned Aerial SystemUA Unmanned AircraftUGV Unmanned Ground VehicleHOG Histogram of Oriented GradientsRC Radio ControlledROS Robotic Operating SystemIMU Inertial Measurement UnitDoF Degree of FreedomSLAM Simultaneous Localization And MappingROI Region Of InterestVatic Video Annotation Tool from Irvine California

viii

Absztrakt

Ezen dolgozatban bemutataacutesra keruumll egy piloacuteta neacutelkuumlli repuumllő jaacuterműre szereltkoumlvető rendszer ami belteacuteri objektumokat hivatott detektaacutelni eacutes legfőbb ceacutelja afoumlldi egyseacuteg megtalaacutelaacutesa amely leszaacutelloacute eacutes uacutejratoumlltő aacutellomaacuteskeacutent fog szolgaacutelni

Előszoumlr a projekt illetve a teacutezis ceacuteljai lesznek felsorolva eacutes reacuteszletezveEzt koumlveti egy reacuteszletes irodalomkutataacutes amely bemutatja a hasonloacute kihiacutevaacute-

sokra leacutetező megoldaacutesokat Szerepel egy roumlvid oumlsszefoglalaacutes a piloacuteta neacutelkuumlli jaacuter-művekről eacutes alkalmazaacutesi teruumlleteikről majd a legismertebb objektumdetektaacuteloacutemoacutedszerek keruumllnek bemutataacutesra A kritika taacutergyalja előnyeiket eacutes haacutetraacutenyaikatkuumlloumlnoumls tekintettel a jelenlegi projektben valoacute alkalmazhatoacutesaacutegukra

A koumlvetkező reacutesz a fejleszteacutesi koumlruumllmeacutenyekről szoacutel beleeacutertve a rendelkezeacutesreaacutelloacute szoftvereket eacutes hardvereket

A feladat kihiacutevaacutesai bemutataacutesa utaacuten egy modulaacuteris architektuacutera terve keruumllbemutataacutesra figyelembe veacuteve a ceacutelokat erőforraacutesokat eacutes a felmeruumllő probleacutemaacutekat

Ezen architektuacutera egyik legfontosabb modulja a detektaacuteloacute algoritmus legfris-sebb vaacuteltozata reacuteszletezve is szerepel a koumlvetkező fejezetben keacutepesseacutegeivel moacuted-jaival eacutes felhasznaacuteloacutei feluumlleteacutevel egyuumltt

A modul hateacutekonysaacutegaacutenak meacutereacuteseacutere leacutetrejoumltt egy kieacuterteacutekelő koumlrnyezet melykeacutepes szaacutemos metrikaacutet kiszaacutemolni a detekcioacuteval kapcsolatban Mind a koumlrnyezetmind a metrikaacutek reacuteszletezve lesznek a koumlvetkező fejezetben melyet a legfrissebbalgoritmus aacuteltal eleacutert eredmeacutenyek koumlvetnek

Baacuter ez a dolgozat főkeacutent a hagyomaacutenyos (2D) keacutepeken operaacuteloacute detekcioacutesmoacutedszerekre koncentraacutel 3D keacutepalkotaacutesi eacutes feldolgozoacute moacutedszerek szinteacuten meg-fontolaacutesra keruumlltek Elkeacuteszuumllt egy kiacuteseacuterleti rendszer amely keacutepes laacutetvaacutenyos eacutespontos 3D teacuterkeacutepek leacutetrehozaacutesaacutera egy 2D leacutezer szkenner hasznaacutelataacuteval Szaacutemosfelveacutetel keacuteszuumllt a megoldaacutes kiproacutebaacutelaacutesa amelyek a rendszerrel egyuumltt bemutataacutesrakeruumllnek

Veacuteguumll az implementaacutelt moacutedszerek eacutes az eredmeacutenyek oumlsszefoglaloacuteja zaacuterja adolgozatot

ix

Abstract

In this paper an unmanned aerial vehicle based airborne tracking system is pre-sented with the aim of detecting objects indoor (potentially extended to outdooroperations as well) and localize a ground robot that will serve as a landing plat-form for recharging of the UAV

First the project and the aims and objectives of this thesis are introducedand discussed

Then an extensive literature review is presented to give overview of the ex-isting solutions for similar problems Unmanned Aerial Vehicles are presentedwith examples of their application fields After that the most relevant objectrecognition methods are reviewed and their suitability for the discussed project

Then the environment of the development will be described including theavailable software and hardware resources

Afterwards the challenges of the task are collected and discussed Consideringthe objectives resources and the challenges a modular architecture is designedand introduced

As one of the most important module the currently used detector is intro-duced along with its features modes and user interface

Special attention is given to the evaluation of the system Additional evalu-ating tools are introduced to analyse efficiency and speed

The ground robot detecting algorithmrsquos first version is evaluated giving promis-ing results in simulated experiments

While this thesis focuses on two dimensional image processing and objectdetection methods 3D image inputs are considered as well An experimentalsetup is introduced with the capability to create spectacular and precise 3Dmaps about the environments with a 2D laser scanner To test the idea of usingthe 3D image as an input for the ground robot detection several recordings weremade about the UGV and presented in this paper as well

Finally all implemented methods and relevant results are concluded

x

Chapter 1

Introduction and projectdescription

In this chapter an introduction is given to the whole project which this thesis ispart of Afterwards a structure of the sub-tasks in the project is presented alongwith the recognized challenges Then the aims and objectives of this thesis arelisted

11 Project description and requirementsThe projectrsquos main aim is to build a complete system based on one or moreunmanned autonomous vehicle which are able to carry out an indoor 3D mappingof a building (possible outdoor operations are kept in mind as well) A maplike this would be very useful for many applications such as surveillance rescuemissions architecture renovation etc Furthermore if a 3D model exists otherautonomous vehicles can navigate through the building with ease Later on inthis project after the map building is ready a thermal camera is planned to beattached to the vehicle to seek find and locate heat leaks as a demonstration ofthe system A system like this should be

1 fast Building a 3D map requires a lot of measurements and processingpower not to mention the essential functions stabilization navigationroute planning collision avoidance However to build a useful tool allthese functions should be executed simultaneously mostly on-board nearlyreal-time Furthermore the mapping process itself should be finished inreasonable time (depending on the size of the building)

2 accurate Reasonably accurate recording is required so the map would besuitable for the execution of further tasks Usually this means a maximum

1

12 Type of vehicle

error of 5-6cm Errors with similar magnitude can be corrected later de-pending on the application In the case of architecture for example aligninga 3D room model to the recorded point cloud could eliminate this noise

3 autonomous The solution should be as autonomous as possible The aimis to build a system which does not require human supervision (for examplemapping a dangerous area which would be too far for real time control) andcoordinate the process Although remote control is acceptable and shouldbe implemented it is desired to be minimal This should allow a remoteoperator to coordinate even more than one unit a time (since none of themneeds continuous attention) while keeping the ability to change the routeor priority of premises

4 complete Depending on the size and layout of the building the mappingprocess can be very complicated It may have more than one floors loops(the same location reached via an other route) or opened areas which areall a significant challenge for both the vehicle (plan route to avoid collisionsand cover every desired area) and the mapping software The latter shouldrecognize previously seen locations and rdquocloserdquo the loop (that is assignthe two separately recorded area to one location) and handle multi-layerbuildings The system have to manage these tasks and provide a mapwhich contains all the layers closes loops and represents all walls (includingceiling and floor)

12 Type of vehicleThe first question of such a project is what type of vehicle should be used Theoptions are either some kind of aerial or a ground vehicle Both have their ad-vantages and disadvantages

An aerial vehicle has a lot more degree of freedom since it can elevate fromthe ground It can reach positions and perform measurements which would beimpossible for a ground vehicle However it consumes a lot of energy to stayin the air thus such a system has significant limitations in weight payload andoperating time (since the batteries themselves weigh a lot)

A ground vehicle overcomes all these problems since it is able to carry alot more payload than an aerial one This means a lot of batteries thereforelonger operation time On the other hand it canrsquot elevate from the ground thusall the measurements will be performed from a relatively close position to theground This can be a big disadvantage since taller objects will not have theirtops recorded Also a ground robot would not be able to overcome big obstaclesclosing its way or to get to another floor

2

13 Aims and objectives

Figure 11 Image of the ground robot Source own picture

Considering these and many more arguments (price complexity human andmaterial resources) a combination of an aerial and a ground vehicle was chosenas a solution The idea is that both vehicles enter the building at the sametime and start mapping the environment simultaneously working on the samemap (which is unknown when the process starts) Using the advantage of theaerial vehicle the system scans parts of the rooms which are unavailable for theground vehicle With sophisticated route planning the areas scanned twice canbe minimized resulting in a faster map building Also great load capacity of theground robot makes it able to serve as a charging station for the aerial vehicleThis option will be discussed in details in section 13

13 Aims and objectivesAs it can be seen the scope of the project is extremely wide To cover the wholemore than fifteen students (visiting MSc and PhD) are working on it Thereforethe project has been divided to numerous subtasks

During the planning period of the project several challenges were recognizedfor example the lack of GPS signal indoor the vibration of the on-board sensorsor the short flight time of the aerial platform

This thesis addresses the last one As mentioned in section 12 the UAVconsumes a lot of energy to stay in the air thus such a system has significantlylimited operating time In contrast the ground robot is able to carry a lot morepayload (eg batteries) than an aerial one The idea to overcome this limitationof the UAV is to use the ground robot as a landing and recharging platform for

3

13 Aims and objectives

the aerial vehicleTo achieve this the system will need to have continuous update of the relative

position of the ground robot The aim of this thesis is to give a solution for thisproblem The following objectives were defined to structure the research

First possible solutions have to be collected and discussed in an extensive lit-erature review Then after considering the existing solutions and the constraintsand requirements of the discussed project a design of a new system is neededwhich is able to detect the ground robot using the available sensors Finally thefirst version of the software have to be implemented and tested

This paper will describe the process of the research introduce the designedsystem and discuss the experiences

4

Chapter 2

Literature Review

The aim of this chapter is to provide an overview of the previous researches andarticles related to the topic of this thesis First a short introduction of unmannedaerial vehicles will be given along with their most important advantages andapplication fields

Afterwards a brief conclusion of the science of object recognition in imageprocessing is provided The most often used feature extraction and classificationmethods are presented

21 UAVs and applicationsThe abbreviation UAV stands for Unmanned Aerial Vehicle Any aircraft whichis controlled or piloted by remotely andor by on-board computers They are alsoreferred as UAS for Unmanned Aerial System

They come in various sizes The smallest ones fit a man palms while even afull-size aeroplane can be an UAV by definition Similarly to traditional aircraftUnmanned Aerial vehicles are usually grouped into two big classes fixed-wingUAVs and rotary wing UAVs

211 Fixed-wing UAVsFixed-wing UAVs have a rigid wing which generates lift as the UAV moves for-ward They maneuver with control surfaces called ailerons rudder and elevatorGenerally speaking fixed-wing UAVs are easier to stabilize and control Forcomparison radio controlled aeroplanes can fly without any autopilot functionimplemented relying only on the pilotrsquos input This is a result of the muchsimpler structure compared to rotary wing ones due to the fact that a glidingmovement is easier to stabilize However they still have a quite extended market

5

21 UAVs and applications

Figure 21 One of the most popular consumer level fixed-wing mapping UAV ofthe SenseFly company Source [1]

of flight controllers whose aim is to add autonomous flying and pilot assistancefeatures [2]

One of the biggest advantage of fixed wing UAVs is that due to their naturalgliding capabilities they can stay airborne using no or small amount of powerFor the same reason fixed-wing UAVs are also able to carry heavier or morepayload for longer endurances using less energy which would be very useful inany mapping task

The drawback of the fixed wing aircraft in the case of this project is the factthat they have to keep moving to generate lift and stay airborne Therefore flyingaround between and in buildings is very complicated since there is limited spaceto move around Although there have been approaches to prepare a fixed wingUAV to hover and translate indoors [3] generally they are not optimal for indoorflights especially with heavy payload [4]

Fixed-wing UAVs are excellent choice for long endurance high altitude tasksThe long flight times and higher speed make it possible to cover larger areaswith one take off On figure 21 a consumer fixed-wing UAV is shown designedspecially for mapping

This project needs a vehicle which is able to fly between buildings and indoorswithout any complication Thus rotary-wing UAVs were reviewed

212 Rotary-wing UAVsRotary-wing UAVs generate lift with rotor blades attached to and rotated aroundan axis These blades work exactly as the wing on the fixed-wing vehicles butinstead of the vehicle moving forward the blades are moving constantly Thismakes the vehicle able to hover and hold its position Also due to the same prop-erty they can take off and land vertically These capabilities are very importantaspects for the project since both of them are crucial for indoor flight where spaceis limited

6

21 UAVs and applications

Figure 22 The newest version of the popular Phantom series of DJI Thisconsumer drone is easy to fly even for beginner pilots due to its many advancedautopilot features Source [5]

Based on the number of the rotors several sub-classes are known UAVs withone rotor are helicopters while anything with more than one rotor is called amultirotor or multicopter

The latter group controls its motion by modifying the speed of multiple downthrusting motors In general they are more complicated to control than fixed-wing UAVs since hovering enough requires constant compensation Multirotorsmust individually adjust the thrust of each motor If motors on one side areproducing more thrust than the other the multirotor tilts to the other sideThat is the reason why every remote controlled multicopter is equipped withsome kind of flight controller system They are also less stable (without electricstabilization) and less energy efficient than helicopters That is the reason forthe fact that no large scale multirotor is used since as the size is increased thesimplicity becomes less important than the inefficiency On the other hand theyare mechanically simpler and cheaper and their architecture is suitable for awider range of payloads which makes them appealing for many fields

Rotary-wing UAVsrsquo mechanical and electrical structures are generally morecomplex than fixed-wing ones This can result in more complicated and expansiverepairs and general maintenance which shorten operational time and increase thecost of the project Finally due to their lower speeds and shorter flight rangesrotary-wing UAVs will require many additional flights to survey any large areawhich is another cause of increasing operational costs [4]

7

21 UAVs and applications

In spite of the disadvantages listed above rotary-wing especially multirotorUAVs seem like an excellent choice for this project since there is no need to coverlarge areas and the shorter operational times can be solved with the chargingplatform

213 ApplicationsUAVs as many technological inventions got their biggest motivation from mili-tary goals and applications For example German rocket V-1 (rdquothe flying bombrdquo)had a huge impact on the history of UAS [6] An unmanned aerial vehicle makesit possible to observe spy on and attack the enemy without risking human lives

Fortunately non-military development is catching up too and UAVs are be-coming essential tools for many application fields As rotary wing UAVs es-pecially multicopters are getting cheaper they have appeared in the consumercategories as well They are available as simple remote controlled toys and ad-vanced aerial photography platforms as well [5] See figure 22 for a popularquadcopter

UAVs are gaining popularity in search and rescue operations since they areproviding a fast and relatively cheap tool to get an overview of the involvedarea A good example for this application field is a location of an earth-quakeor searching for missing people in the wildness See [7] and [8] for examples [9]is another related article which includes approaches to identify possibly injuredpeople autonomously

Another application field where UAVs were admittedly proven useful is surveil-lance [10] is an interesting example for this field since it proposes to use a com-bined swarm of rotary and fixed-wing UAVs [11] [12] address the same problemwhile [13] tries to overcome the challenges of the task in an urban area

Getting sensors in the air with low cost vehicles is an amazing opportunity forseveral research areas UAVs have been used for wild-life monitoring [14] earthobservation [15] and even tornado watch missions [16] [17] used UAVs to reducerdquothe data gap between field scale and satellite scale in soil erosion monitoringrdquo

UAVs are often used for 3D mapping and reconstruction tasks for topographyarchitecture or other purposes [18] is a review of current mapping methods basedon UAVs [19] is an other survey which also compares the photogrammetry withother methods to define the optimal application fields

Both surveys mentioned above are rather focus on topographical larger scalemapping tasks Indoor flying and mapping is more complicated in a sense sincethe danger of collision is significantly higher Indoor positioning of UAVs havebeen addressed in several papers They can be grouped by the sensor used Notethat GPS signal is usually too weak to use inside thus other methods are needed

Some researches tried to localize the aerial platform based on a single camera

8

22 Object detection on conventional 2D images

using monocular SLAM (Simultaneous Localization and Mapping) See [20ndash22]for UAVs navigating based on one camera only

Often the platform is equipped with multiple cameras to achieve stereo visionand create a depth map For examples of such solutions see [23ndash25]

A popular approach for indoor mapping is using 3D laser distance sensors asthe input of the SLAM algorithm Examples for this are [26ndash28]

However cameras and laser range finders are often used together [29 30] forexample manage to integrate the two sensors together and carry out 3D mappingwhile controlling navigating the UAV itself This project will use a similar setupthat is a combination of one or more monocular cameras with a 2D laser scanner

22 Object detection on conventional 2D imagesObject class detection is one of the most researched area of computer visionAs the available resources (computational power resolution of cameras etc)improved and became cheaper and more and more complex tasks occurred thefield has witnessed several approaches and methods Thus the topic has extensiveliterature To give an overview general surveys can be a good starting point

There are several articles aimed to provide a summary of the existing methodstrying to categorize and group them (eg [31ndash33])

The terms and definitions of computer vision tasks are defined in several wayin the literature [33] and [34] groups them by

bull Detection is a particular item present in the stimulus

bull Localization detection plus accurate location of item

bull Recognition localization of all the items present in the stimulus

bull Understanding recognition plus role of stimulus in the context of the scene

while [35] sets up five classes defined by the following (quoted from [33])

bull Verification Is a particular item present in an image patch

bull Detection and Localization Given a complex image decide if a particularexemplar object is located somewhere in the image and provide accuratelocation information of the given object

bull Classification Given an image patch decide which of the multiple possiblecategories are present in it Note that no location is determined

9

22 Object detection on conventional 2D images

bull Naming Given a large complex image (instead of an image patch as inthe classification problem) determine the location and labels of the objectspresent in that image

bull Description Given a complex image name all the objects present in itand describe the actions and relationships of the various objects within thecontext of the image

[31] defines object (class or category) detection as the combination of twogoals object categorization and object localization The former decides if anyinstance of the categories of interest is present in the input image The lattertask determines the positions and dimensions (projected size) of the objects thatare found

As it can be seen from the categories defined above the boundaries of thesesubclasses are not clear and they often overlap The aims and objectives listedin section 13 classify this task as a detection or localization (or a combinationof the two) For the sake of simplicity the aim of this thesis and the providedalgorithm will be referred as detection and detector

Beside the grouping of the tasks of computer vision several attempts weremade to classify the approaches themselves The biggest disjunctive propertyof the methods might be the fact that whether or not does it involve machinelearning

221 Classical detection methodsResulting from the rapid development of hardware resources along with the ex-tensive research of artificial intelligence methods not using machine learning aregetting neglected Generally speaking the ones using it show better performanceand are simpler to adopt if the circumstances change (eg re-train them)

It has to be noted that although these classical approaches are not very pop-ular nowadays their simplicity often means an attractive factor to perform easydetection tasks or pre-filter the image

2211 Background subtraction

Background subtraction methods are often used to detect moving objects onstanding platform (static cameras) Such an algorithm is able to learn the rdquoback-ground modelrdquo of an image with differentiating a few frames Although severalmethods exist (See [36] or [37] for good conclusion) the basic idea is common sep-arate the foreground from the background by subtracting the background modelfrom the input image This subtraction should result in an output image where

10

22 Object detection on conventional 2D images

Figure 23 Example for background subtraction used for people detection Thered patches show the object in the segmented foreground Source [39]

only objects in the foreground are shown The background model is usually up-dated regularly which means the detection is adaptive

These solutions could give an excellent base for tasks like traffic flow monitor-ing [38] or human detection [39] as moving points are separated and connectedor close ones could be managed as an object Thus all possible objects of interestare very well separated The fixed position of the camera means fixed perspectiveswhich make size and distance estimations accurate With fixed lengths reliablevelocity measuring can be implemented These and further additional properties(colour texture etc of the moving object) can be the base of an object classifica-tion system This method is often combined with more sophisticated recognitionmethods serving as a region of interest detector After the selection of interestingareas those methods can operate on a smaller input resulting in a faster responsetime

It has to be mentioned that not every moving (changing) patch is an objectas for example in certain lighting circumstances a vehicle and its shadow movealmost identically and have similar size parameters Effective methods to removeshadows from input have been described (eg[40 41]) Certainly process time ofthese filters sum up with the detectionrsquos time

An other concern is the possibility of a sudden change in illumination (such asswitching the light on or off) which would result change across the whole imageIn this case the background model should be adapted automatically and quickly

Although background subtraction has many benefits it is clearly not suitablefor this task since the camera will be mounted on a moving platform

2212 Template matching algorithms

Object class detection can be described as a search for a specific pattern on theinput image This definition is related of the matched filters (see [42]) techniqueof digital signal processing which basic idea is to find a known wavelet in anunknown signal This is usually achieved by cross-correlating them looking forhigh correlation coefficients

Template matching algorithms rely on the same concept interpreting the

11

22 Object detection on conventional 2D images

Figure 24 Example of template matching Notice how the output image isthe darkest at the correct position of the template (darker means higher value)An other interesting point of the picture is the high return values around thecalendarrsquos (on the wall left) dark headlines which are indeed look similar to thehandle Source [43]

image as a 2D digital signal They determine the position of the given patternby comparing a template (that contains the pattern) with the image

To perform the comparison the template is shifted (u v) discrete steps inthe (x y) directions respectively The comparison itself is usually calculated bysome kind of correlation over the area of the template for each (u v) positionThe method assumes that the function (the comparison method) will return thehighest value at the correct position [43]

See Figure 24 for an example input template (part a) and the output of thealgorithm (part b) Notice how the output image is the darkest at the correctposition of the template

The biggest drawback of the method is its computational complexity sincethe cross correlation has to be calculated at every position Several efforts werepublished to reduce the processing time [43] uses normalized cross correlationfor faster comparison [44] applies integral images (based on the idea of [45]) forthe same purpose Note that simplifying the template can reduce the complexityas well since it is easier to compare the two images For example [46] proposedan algorithm which approximates the template image with polynomials

Other disadvantages of template matching are the poor performance in case ofchanges in scale rotation or perspective Resulting from these template matchingis not a suitable concept for this project

12

22 Object detection on conventional 2D images

222 Feature descriptors classifiers and learning methodsWhile computer vision and machine learning might seem two well-separated areasof computer engineering computer vision is one of the biggest application fieldof machine learning [47] puts it this way rdquoPattern recognition has its origins inengineering whereas machine learning grew out of computer science Howeverthese activities can be viewed as two facets of the same field and together theyhave undergone substantial development over the past ten yearsrdquo

The two most important parts of a trained object detection algorithm arefeature extraction and feature classification The extraction method determineswhat type of features will be the base of the classifier and how they are extractedThe classification part is responsible for the machine learning method In otherwords how to decide if the extracted feature array (extracted by the featureextraction method described above) represents an object of interest

Since the latter part (defining the learning and classifier system) is less relatedto the vision studies this paper will only discuss a few learning methods brieflyBefore that some of the most famous and widely used feature descriptors arepresented

Feature extractors or descriptors are often grouped as well [48] recognizesedge based and patch based features [31] has three groups according to thelocality of the features

bull pixel-level feature description these features can be calculated for eachpixel separately Examples are the pixelrsquos intensity (grayscale) or its colourvector

bull patch-level feature description Descriptors based on patch level featuresconcentrate only on point of interests (and their neighbourhood) instead ofdescribing the whole image The name rdquopatchrdquo refers to the area around thepoint which is also considered Since these regions are much smaller thanthe image itself patch-level feature descriptions are also called local featuredescriptors Some of the most famous ones are SIFT (sub-subsection 2221[49 50]) and SURF [51] In a sense Haar-like features (see sub-subsection2222 [45]) belong here as well

bull region-level feature description As patches are sometimes too small to de-scribe the object appropriately a larger scale is often needed The region isa general expression for a set of connected pixels The size and shape of thisarea is not defined Region-level features often constructed from lower levelfeatures presented above However they do not try to apply higher levelgeometric models or structures Resulting from that region-level featuresare also called mid-level representations The most well known descriptors

13

22 Object detection on conventional 2D images

of this group are the BOF (Bag of Features or Bag of Words [52]) and theHOG (Histogram of Oriented Gradients sub-subsection 2223 [53])

Due to the limited scope of this paper it is not possible to give a deeperoverview about the features used in computer vision Instead three of the meth-ods will be presented here These were selected based on fame performanceon general object recognition and the amount of relevant literature examplesand implementation All of them are well-known and both the articles and themethods are part of the computer vision history

2221 SIFT features

Scale-invariant feature transform (or SIFT) is an image processing algorithm usedto detect and extract local features Sift descriptors are the most well-knownpatch-level feature feature descriptors (see point 222) [31] They were proposedby David Lowe in 1999 [49] The same author gave a more in-depth discussionabout his earlier work presented improvements in stability and feature invariancein 2004 [50] His idea was to extract interesting points as features from the object(training image) and try to match these on the test images to locate the object

The main steps of the algorithm are the following [50]

1 Scale-space extrema detection The first stage searches over the image inmultiple scale using a difference-of-Gaussian function to identify potentialinterest points that are scale and orientation invariant

2 Key-point localization At each selected point the algorithm tries to fit amodel to determine location and scale Key-points are selected based ontheir stability points which are expected to be more stable based on themodel are kept

3 Orientation assignment rdquoOne or more orientations are assigned to eachkey-point based on local image gradient directions All future operationsare performed on image data that has been transformed relative to theassigned orientation scale and location for each feature thereby providinginvariance to these transformationsrdquo [50]

4 Key-point descriptor After selecting the interesting points the gradientsaround them are calculated in a selected scale

The biggest advantage of the descriptor is the scale and rotation invarianceNote that rotation here means rotating in the plane of the image not capturingthe object from a rotated viewpoint For example a rotation invariant (frontal)

14

22 Object detection on conventional 2D images

face detector should be able to detect faces upside down but not faces pho-tographed from a side-view SIFT is also partly invariant to illumination and3D camera viewpoint This is reached by the representation of the calculateddescriptors that allows for significant levels of local shape distortion and changein illumination The algorithm assumes that these pointsrsquo relative positions willnot change on the test images Thus SIFT descriptors tend to show worse perfor-mance on flexible objects or ones with moving parts Also the relative positionsof the key-points can be significantly different not only if the object is distorted(flexibility) but also if they are extracted from an other instance from the sameclass (two people for example) Resulting from this property SIFT is more oftenused for object tracking image stitching or to recognise a concrete object insteadof the general class However this is not a problem in the case of this projectsince only one ground-robot is part of it Thus no general class recognition isrequired recognizing the used ground robot would be enough

2222 Haar-like features

Haar-like features were introduced by Viola and Jones in 2001 [45] They proposedsimple features inspired by Haar wavelet basis which are also used in signalprocessing They can be seen as rdquotemplatesrdquo defining rectangular regions in thegrayscale image Viola and Jones described three types of the features

bull two-rectangle features rdquoThe value of a two-rectangle feature is the dif-ference between the sum of the pixels intensities within two rectangularregions The regions have the same size and shape and are horizontally orvertically adjacentrdquo [45]

bull three-rectangle features rdquoA three-rectangle feature computes the sum withintwo outside rectangles subtracted from the sum in a centre rectanglerdquo [45]

bull four-rectangle features rdquoA four-rectangle feature computes the differencebetween diagonal pairs of rectanglesrdquo [45]

See figure 25 for an example of two and three rectangle features The mainadvantage of this approach is the fact that these features can be computed veryrapidly using so-called integral images The integral image rdquocontains the sum ofpixel intensities above and to the left of itrdquo [45] at any given image location Thereferences to the integral values at each pixel location are kept in a structureFinally the Haar features can be computed by summing and subtracting theintegral values in the corners of the rectangles

[45] was motivated by the task of face detection Although it is more than14 years old it is still the base of many implemented face detection methods (in

15

22 Object detection on conventional 2D images

Figure 25 2 example Haar-like features that are proven to be effective separatorsTop row shows the features themselves Bottom row presents the features overlayedon a typical face The first feature corresponds with the observation that eyes areusually darker than the cheeks The second feature measures the intensity of areasaround the eyes compared to the nose Source [45]

consumer cameras for example) and inspired several researches [54] for exampleextended the Haar-like feature set with rotated rectangles by calculating theintegral images diagonally as well

Resulting from the nature of them Haar-like features are suitable to findpatterns of combined bright and dark patches (like faces on grayscale images forexample)

Despite of the original task Haar-like features are used widely in computervision for various tasks vehicle detection ([55]) hand gesture recognition ([56])and pedestrian detection ([57]) just to name a few

Haar-like features are very popular and have a lot of advantages Its maindrawback is the rotation-variance which in spite of several attempts (eg [54])is still not solved completely

2223 HOG features

The abbreviation Hog stands for Histogram of Oriented Gradients Unlike theHaar features (2222) Hog aims to gather information from the gradient image

The technique calculates the gradient of every pixel and produces a histogramof the different orientations (called bins) after quantization in small portions(called cells) on the image After a location based normalization of these cells(within blocks) the feature vector is the concatenation of these histograms Notethat while this is similar to Edge orientation histograms ([58]) it is different since

16

22 Object detection on conventional 2D images

Figure 26 Illustration of the discriminative and generative models The boxesmark the probabilities calculated and used by the models Note the arrows whichdisplay the information flow presenting the main difference in the two philosophiesSource [59]

not only sharp gradient changes (known as edges) are considered The number ofoptimal bins cells and blocks depend on the actual task and resolution and arewidely researched

HOG is essentially a dense version of the SIFT descriptors (see sub-subsection2221 and [31]) However it is not normalized with respect to the orientationThus HOG is not rotation invariant On the other hand it is normalized locallywith respect to the image contrast (over the blocks) which results in a superiorrobustness against illumination changes over SIFT The technique was first de-scribed in the well-known article [53] by Navneet Dalal and Bill Triggs in whichit was used for pedestrian detection While this application field is still the mostcommon usage of these feature descriptor it can be used for many kind of objectsas well as long as the class has a characteristic shape with significant edges

2224 Learning models in computer vision

Learning model approaches can be grouped as well [32 48] finds two mainphilosophies generative and discriminative models

Letrsquos denote the test data (images) with xi the corresponding labels by ciand the feature descriptors (extracted from the data xi) by θi where i = 1 to N where N is the number of inputs

Generative models look for a representation of the image (or any other inputdata) by approximate it keeping as much data as possible Afterwards given atest data they predict the probability of an x instance generated (with featuresθ) conditioned to class c [48] In other terms they aim is to produce a model inwhich the probability

P (x|(θ c))P (c)

to be ideally 1 if x contains an instance of class c and 0 if notThe most famous examples of generative models are Principal Component

Analysis (PCA [60]) and Independent Component Analysis (ICA [61])

17

22 Object detection on conventional 2D images

Discriminative models try to find the best decision boundaries for the giventraining data As the name suggests they try to discriminate the occurrenceof one class from everything else [48] They do this by defining the followingprobability

P (c|(θ x))

which is expected to be ideally 1 if x contains an instance of class c and 0 ifnot

Note that the biggest difference is the direction of the mapping and infor-mation flow discriminative models map images to class labels while generativemodels use a map from labels to the images (or rdquoobservablesrdquo [59]) See figure 26for a representation of the direction of the flow of information in both models

Resulting from approaches presented above (especially the probability mod-els) the discriminative methods are usually perform better for single class detec-tion while generative methods are more suitable for multi-class object detection(also called object recognition) [62] Since the task described in this paper re-quires one class to detect two of the most well-known discriminative methodswill be presented Adaptive boosting (AdaBoost [63] see sub-subsection 2225)and support vector machines (SVM [64] see sub-subsection 2226)

2225 AdaBoost

AdaBoost (short for Adaptive Boosting) is a discriminative machine learningmodel from the class of boosting algorithms It was first proposed by YoavFreund and Robert Schapire in 1997 [63] AdaBoostrsquos aim is to overcome theincreasing calculation expenses caused by the growing set of features (also calleddimensions) For example the number of possible Haar-like features (2222) isover 160000 in case of a 24times24 pixel window (see subsection 432) [65] AlthoughHaar-like features can be extracted efficiently using the integral-image computingthe full set would be very expensive

To reduce the dimensionality of the machine learning problem boosting algo-rithms try to combine several weaker classifiers to create a stronger Weak meansthe output of the classifiers only slightly correlates with the true labels of thedata Usually this is an iterative method

A single stage of AdaBoost can be described as follows [45] [65]

bull Initialize N middot T weights wti where N = number of training examples T =number of features in stage

bull For t = 1 N

1 Normalize the weights

18

22 Object detection on conventional 2D images

2 Select the best classifier using only a single feature by minimising thedetection error εt =

sumiwi|h(xi f p θ)minus yi| where h(xi) is a classifier

output and yi is a correct label (both with a range of 0 for negative 1for positive)

3 Define ht(x) = h(x ft pt θt) where ft pt and θt are minimizers for theerror above

4 Update the weights wt+1i = wti middot εt

1minusεt

1minusεi

bull The final classifier for the stage is based on the sum of weak classifiers

By repeating these steps AdaBoost selects those features (and only those)which improve the prediction Resulting from the way the weights are updatedsubsequent classifiers minimise the error mostly for the inputs which were misclas-sified by previous ones A famous application of AdaBoost is the Viola-Jones ob-ject detection framework presented in 2001 [45] (also see sub-subsection 2222)

2226 Support Vector Machine

Support vector machines (SVM also called support vector networks) is a machinelearning algorithm for two-group classification problems proposed by Vladimir NVapnik Bernhard E Boser and Isabelle M Guyon [66] Although the featuresof the method were known and used since the 1960s they were not put togetheruntil 1992

The idea of the support vector machine is to handle the set of features ex-tracted from the train and test data samples as vectors in the feature spaceThen they construct one ore more hyperplanes (a linear decision surface) in thisspace to separate the classes

The surface is constructed with two constraints First is to divide the spaceinto two parts in a way that no points from different classes remain in the same(in other words separate the two classes perfectly) The second constraint is tokeep the distance to the nearest training samplersquos vector from each class as largeas possible (in other words define the largest margin possible) Creating largermargin is proven to decrease the error of the classifier The closest vectors to thehyperplane (and due to that determining the width of the margin) are calledsupport vectors (hence the name of the method) See figure 27 for an exampleof a separable problem the chosen hyperplane and the margin

The algorithm was extended to not linearly separable tasks in 1995 by Vapnikand Corinna Cortes [64] To create a non-linear classifier the feature space ismapped to a much higher dimensional space where the classes become linearlyseparable and the method above can be used

19

22 Object detection on conventional 2D images

Figure 27 Example of a separable problem in 2D The support vectors aremarked with grey squares They define the margin of the largest separation betweenthe two classes Source [64]

Although support vector machines were designed for binary classification sev-eral methods were proposed to apply their outstanding performance on multi-classtasks The most often used approach to do this is to combine binary classifiers toobtain a multi-class one The binary classifiers can be rdquoone against allrdquo or rdquooneagainst onerdquo types The former means that each class is examined against all thepoints of the other classes The class whose classifier shows the greatest distancefrom the margin is chosen (in other terms the test samplersquos vector is the fartherfrom the other classesrsquo points than in any other cases) The latter is based on thecomparisons of all the possible pairings of the classifiers (each is compared witheach) The class rdquowinningrdquo the most comparisons is chosen as the final decisionof the classifier Good surveys on multi-class SVMs are [67] and [68]

A famous application of SVMs is the HOG descriptor based pedestrian detec-tion presented by Navneet Dalal and Bill Triggs in 2005 [53]

20

Chapter 3

Development

In this chapter the circumstances of the development will be presented whichinfluenced the research especially the objectives defined in Section 13 Firstthe available hardware resources (that is processing unit and sensors) will bediscussed along with their limitations and constraints respecting to the definedobjectives Finally the used software libraries will be presented why were theychosen and what are they used for in this project

31 Hardware resources

311 Nitrogen boardNitrogen board (officially called Nitrogen6x) is the chosen on-board processingunit of the UAV It is a highly integrated single board computer using an ARM-cortex A9 processor [69]

Nitrogen board will be responsible to manage the on-board sensors and collectthe recorded data After some preprocessing part of the data will be transmittedto the ground robot and to the ground station which communication is alsohandled by this unit Fortunately the Robotic Operating System (subsection322) is compatible with unit which will make this task easier

The detection method introduced in this thesis may be ported to this boardlater depending on the available processing units and the achieved communicationbandwidth between the vehicles and the ground station

312 SensorsThis subsection concludes the sensors integrated on-board the UAV including theflight controller the camera and the laser range finder

21

31 Hardware resources

Figure 31 image of Pixhawk flight controller Source [70]

3121 Pixhawk autopilot

Pixhawk is an open-source well supported auto-pilot system manufactured by3Drobotics company It is the chosen flight controller of the projectrsquos UAV Onfigure 31 the unit is shown with optional accessories

Also Pixhawk will be used as an Inertial Measurement Unit (IMU) for build-ing 3D maps This means less weight (since no additional sensor is needed) andbrings integrity both in hardware and software Although the Pixhawk is not adedicated IMU its built-in sensors are very accurate since they are used to sta-bilize multi-rotors during flight Therefore it was suitable to use it for mappingpurposes See 44 for details of its application

3122 Camera

Cameras are optical sensors capable of recording images They are the mostfrequently used sensors for object detection due to the fact that they are easy tointegrate and provide significant amount of data

Also another big advantage of these sensors is the fact that they are availablein several sizes with different resolutions and other features for a relatively cheapprice

Nowadays consumer UAVs are often equipped by some kind of action camerausually to provide an aerial video platform These sensors are suitable for theaim of the project as well resulting from their light weight and wide angle fieldof view

22

32 Chosen software

Figure 32 The chosen LIDAR sensor Hokuyo UTM-30LX It is a 2D rangescanner with a maximum range of 60 m and 3-5 cm accuracy Source [71]

3123 LiDar

The other most important sensor used in this project is the lidar These sensorsare designed for remote sensing and distance measurement The basic conceptof these sensors is to do this by illuminating objects with a laser beam Thereflected light is analysed and the distances are determined

Several versions are available with different range accuracy size and otherfeatures The chosen scanner for this project is the Hokuyo UTM-30LX It isa 2D range scanner with a maximum range of 60 m and 3-5 cm accuracy [71]These features combined with its compact size (62mmtimes62mmtimes875mm210g [71])makes it an ideal scanner for autonomous vehicles for example UAVs

Generally speaking lidars are much more expensive sensors then camerasHowever the depth information provided by them is very valuable for indoorflights

32 Chosen software

321 MatlabMATLAB is a high-level interpreted programming language and environmentdesigned for numerical computation It is a very useful tool for engineers and

23

32 Chosen software

scientist alike as thousands of functions are implemented and included in thesoftware from different fields of science Also excellent visualization tools areready to use which are necessary to easily debug and develop image processingmethods or 3D map construction and visualization Resulting from the reasonsabove developing and testing 2D or 3D image processing algorithms in Matlab isextremely convenient On the other hand Matlab is Java based therefore mostnon-mathematical calculations are slightly slow compared to a C++ environment

322 Robotic Operating System (ROS)Robotic Operating System is an extensive open-source operating system andframework designed for robotic development It is equipped with several toolsaimed to help developing unmanned vehicles like simulation and visualizationsoftware or excellent general message passing This latter property is a suitablesolution to connect the UAV UGV and the ground control station of the project

Also ROS contains several packages developed to handle sensors For exam-ple the both the chosen lidar sensor (3123) and the flight controller (3121)have ready-to-use drivers available

323 OpenCVOpenCV is probably to most popular open-source image processing library con-taining a wide range of functions like basic image manipulations (eg load writeresize rotate image) image processing tools (eg different kind of edge detec-tions threshold methods) and advanced machine learning based solutions (egfeature extractors and training tools) [72] Due to its high popularity examplesand support are available

In this project it was used to handle inputs (either from a file or from a cameraattached to the laptop) since Dlib the main library chosen has no functions toread inputs like that

324 DlibDlib is a general purpose cross-platform C++ library It has an excellent machinelearning library with the aim to provide a machine learning software developmenttoolkit for the C++ language (see figure 33 for an architecture overview) Dlibis completely open-source and is intended to be useful for both scientists andengineers

Dlib also has many computer vision related parts including ready-to-use fea-ture extractors and training tools and object tracking solutions which make it

24

32 Chosen software

Figure 33 Elements of Dlibrsquos machine learning toolkit Source [73]

easy to develop and test custom trained object detectors rapidly Another re-cently added feature is an easy-to-use 3D point cloud visualization function Re-sulting from this Dlib was a reasonable choice to use for this project since itscollection of machine learning image processing and 3D point cloud handlingfunctions made the development significantly easier

It is worth mentioning that aside the features above Dlib also contains com-ponents to handle linear algebra threading network IO graph visualisation andmanaging and numerous other tasks which makes it one of the most versatileC++ libraries Though it is not as popular and well-known as OpenCV it isextremely well documented supported and updated regularly Resulting fromthese the popularity of it increases For more information see [74] and [73]

25

Chapter 4

Designing and implementing thealgorithm

In this chapter challenges related to the topic of this thesis are listed After theconsideration of these the architecture of the designed algorithm will be intro-duced with detailed explanations given to the important parts The connectionbetween the 2D and 3D image inputs are defined Afterwards the base of thedetection algorithm is discussed in details along with all the aiding systems andinputs Finally the 3D imaging experiments and set-ups are introduced withexample images and the mathematical calculations

41 Challenges in the taskThe biggest challenges of the task are listed below

1 Unique object Maybe the biggest challenge of this task is the fact that therobot itself is a completely unique ground vehicle designed as part of theproject Thus no ready-to-use solutions are available which could partly orcompletely solve the problem For frequently researched object recognitiontasks like face detection several open-source codes examples and articlesare available The closest topic for a ground robot detection is car recogni-tion but those vehicles are too different to use any ready detectors Thusthe solution of this task have to be completely self-made and new

2 Limited resources Since the objective of this thesis is a detection executedon a UAV the limitations of this platform have to be considered Thismeans that both the dimensions and the weight of the on-board sensorsand computers are limited significantly Fortunately the cameras nowa-days have very compact sizes with impressive resolutions thus even smaller

26

42 Architecture of the detection system

aerial vehicles can carry one and record high quality videos The used lidar(3123) is heavier and larger but still possible to carry The on-board com-puter had to be chosen similarly a small and light hardware was neededWhile the Nitrogen6x board (311) would fulfil these requirements its pro-cessing power is not comparable to a laptop or desktop PC which shouldbe considered during the designing and assigning of processing methods

3 Moving platform No matter what kind of sensors are used for an objectdetection task movement and vibration of the sensor is usually makes itmore challenging Unfortunately both are present on a UAV platform Thiscauses problems because it changes the perspective and the viewpoint ofthe object Also they add noise and blur to the recordings And third nobackground-foreground separation algorithms are available for the cameraSee subsection 2211 for details

4 Various viewpoint of object Usually in the field of computer vision thepoint of view of the required object is more or less defined For exampleface detectors are usually applied to images where people are facing thecamera However resulting from the UAV platform the sensors will gain orlose height rapidly causing drastic perspective changes Also there are noconstraints about the relative orientation of the two vehicles which meansthat the robot should be recognized from 360 and from above

5 Various size of object Similarly to the previous point 4 the movement ofthe camera will cause significant changes in the scale of the object on theinput image In other words the algorithm have to detect the ground robotfrom a great distance (when it occupies a small part of the input) and alsoduring landing from relatively close (when it seems huge and almost fillsthe entire image)

6 Speed requirements The algorithmrsquos purpose is to support the landingmaneuver which is crucial part of the mission The most critical part is thetouch-down itself This (contrary to the other parts of the flight) requiresreal-time and very accurate position feedback from the sensors and the al-gorithms This is provided by another algorithm specially designed for thetask Thus the software discussed here does not have real-time require-ments On the other hand the system still needs continuous reports on theestimated position of the robot with at least 4Hz This means that everychosen method should be execute fast enough to make this possible

27

42 Architecture of the detection system

Support Vector Machines

camera2D Lidar

Trainer

Detector agorithmfront SVM side SVM

Detections Vatic annotation server

Sensors

Tracking

Evaluation

Current frame

Video reader

Regions of Interest

3D map

Other preprosessing(edgecolour detection)

Figure 41 On this figure a diagram of the designed architecture is shown Theaim of it is the present the connections between the 2D and 3D sensors detectionalgorithms and other parts of the system Arrows represents the dependencies andthe direction of the information flow

28

42 Architecture of the detection system

42 Architecture of the detection systemIn the previous chapters the project (11) and the objectives of this thesis (13)were introduced The advantages and disadvantages of the different availablesensors were presented (312) Some of the most often used feature extractionand classification methods (222) were examined with respect to the suitabilityfor the project Finally in subsection 41 the challenges of the task were discussed

After considering the aspects above a modular complete architecture wasdesigned which is suitable for the objectives defined See figure 41 for an overviewof the design The main idea of the structure is to compensate the disadvantagesof every module with an other part connected to it In some cases this principlebrings redundancy to the system On the other hand it provides a more robustdetection system for the robot The next enumeration will list the main partsof the architecture every already implemented module will be discussed later inthis chapter in detail

1 Training algorithm As explained in subsection 222 object detectionmethods using machine learning are very popular and superior in efficiencyTo apply such an algorithm for a task first a learning (also called trainingor teaching) period is needed when the software extract features from thetraining images and train a classifier to separate the two or more classes(Note that this learning step can be skipped if the algorithm learns rdquoon theflyrdquo or pre-trained detectors are used Neither of these are the case in thisproject)As mentioned above training usually happens off-line before the detectionsTherefore it is not required to be included in the final release However itis an inevitable part of the system since it produces the classifiers for thedetector These productions have to be reproducible compatible with theother parts and effective To make the development more convenient thetraining software is completely separated from testing See subsection 431for details

2 Sensor inputs Sensors are crucial for every autonomous vehicle since theyprovide the essential information about the surrounding environments Inthis project a lidar and a camera where used as main sensors for the task(along with several others needed to stabilize the vehicle itself See subsec-tion 312) The designed system relies more on the camera for now sincethe 3D imaging is in its experimental state However both sensors arehandled in the architecture already with the help of the Robotic Operat-ing System (322) This unified interface will make it possible to integratefurther sensors (eg infra-red cameras ultrasound sensors etc) later withease

29

42 Architecture of the detection system

Resulting from the scope of the project the vehicle itself is not readythus live recordings and testing were not possible during the developmentTherefore the system is able to manage images both from a camera andvideo file The latter is considered as a sensor module as well simulatingreal input from the sensor (other parts of the system canrsquot tell that it is nota live video feed from a camera)

3 Region of interest estimators and tracking Recognized as two of the mostimportant challenges (see 41) speed and the limitations of the availableresources were a great concern Thus methods which could reduce theamount of areas to process and such increase the speed of the detectionwere requiredResulting from the complex structure of the project both can be donebased on either the camera the lidar or both The idea of the architectureis to give a common interface (creating a type of module) to these methodsmaking their integration easy The most general and sophisticated approachwill be based on the continuously built and updated 3D maps That is ifboth the aerial and the ground vehicle were located in this map accurateestimations can be made on the current position of the robot excluding (3D)areas where it is unnecessary to search for the unmanned ground vehicle(UGV) In other words once the real 3D position of the robot is known itsnext one can be estimated based on its maximal velocity and the elapsedtime Theoretically it is not possible for the robot to be outside of thiscircle thus other parts of the map do not need to be scanned and Asa less sophisticated but more rapid alternative to this the currently seen3D sub-map (the point-cloud in front of the sensor) can be interpreted asa pre-filter for the 2D image processing methods For example filteringfor objects with more or less the same size of the UGV could be a veryeffective region of interest detector To achieve this the field of view of thetwo sensors needs to be registered carefully so corresponding areas can befound Unfortunately the real time 3D map building is not yet implementedThus these methods has yet to be developed On the other hand severalapproaches to find regions of interest based on 2D images were implementedAll of them will be presented in section 43

4 Evaluation To make the development easier and the results more objectivean evaluation system was needed This helps to determinate whether a givenmodification of the algorithm did actually make it rdquobetterrdquo andor fasterthan other versions This part of the architecture is only connected to thewhole system loosely (similarly to training point 1) and not required tobe included in the final release since it was designed for development and

30

43 2D image processing methods

debugging purposes However it had to be developed parallel with everyother parts to make sure it is compatible and up to date Thus evaluationhad to be mentioned in this list For more details of the evaluation pleasesee 511

5 Detector The detector module is the core algorithm which connects all theother parts listed above Receiving the inputs from the camera sensor (point2) using the classifiers trained by the training algorithm (point 1) andconsidering the information provided by the ROI estimators (3) executesthe detection Afterwards the detector is able to pass on the detectedobjects coordinates so the evaluation module (point 4) can measure theefficiency

43 2D image processing methodsIn this section the 2D camera image processing concerns methods and the processof the development will be presented

431 Chosen methods and the training algorithmBefore the design and implementation of the algorithm started several objectdetection methods were reviewed in section 22 In section 32 the used imageprocessing libraries and tool-kits were presented

While these two topics seem unrelated they have to be considered togethersince a good software library with image handling or even machine learningalgorithms implemented can save a lot of time during the development

Considering the reviewed methods the Histogram of Oriented Gradients (HOGsee sub-subsection 2223) was chosen as the main feature descriptor The reasonsfor this choice are discussed here

Haar-like features are suitable to find patterns of combined bright and darkpatches While the ground robotrsquos main colours are black (wheels) and white(the main body of the vehicle) their proportion is not ideal since brighter partsare more significant Variance of the relative positions of these patches (egdifferent view angles) can confuse the detector as well Also using OpenCVrsquos(323) training algorithm the learning process was significantly slower than thefinal training software (presented here later) making it very hard to develop andtest

SIFT descriptors (see sub-subsection 2221) have the advantage of rotationinvariance which is very desirable Since they assume that relative positions ofthe key-points will not change on the inputs they are more suitable to recognise

31

43 2D image processing methods

a concrete object instead of the general class This is not an issue in this casesince only one ground robot is present with known properties (in other wordsno generalization is needed) However the relative positions of the key-pointscan change not only because inter-class differences (an other instance of the classrdquolooksrdquo different) but caused by moving parts like rotating and steering wheelsfor example Also SIFT tends to perform worse under changing illuminationconditions which is a drawback in the case of this project since both vehiclesare planned to transfer between premises without no a-priori information aboutthe lighting

Histogram of Oriented Gradients (HOG see sub-subsection 2223) seemed tobe an appropriate choice as it performs well on objects with strong characteristicedges Since the UGV has a cuboid shape body and relatively big wheels itfulfils this condition Also HOG is normalized locally with respect to the imagecontrast which results in a robustness against illumination changes Thus itis expected to perform better than the other methods in case of change in thelighting (like moving to an other room or even outside)

Certainly HOG has it disadvantages as well First it is computationally ex-pensive especially compared to the Haar-like features However with properimplementation (eg a good choice of library) and the use of ROI detectors (seePoint 3) this issue can be solved Secondly HOG in contrast to SIFT is not rota-tion invariant In practice on the other hand it can handle a significant rotationof the object Also in the final version of the project the orientation of the camerawill be stabilized either in hardware (with a gimbal holding the camera levelled)or software (rotate the image based on the orientation sensor) Therefore theinput image of the detector system expected to be levelled Unfortunately dueto the aerial vehicle used in this project different viewpoints are predicted (see4) This means that even if the camera is levelled the object itself could seemlike rotated caused by the perspective To overcome this issue solutions will bepresented in this subsection and in 434

Although the method of HOG feature extraction is well described and docu-mented in [53] and not exceedingly complicated the choice of optimal parametersand the optimization of the computational expenses are time-consuming and com-plex tasks To make the development faster an appropriate library was searchedfor the extraction of HOG features Finally Dlib library (324) was chosen sinceit includes an excellent HOG feature extractor with serious optimization alongwith training tools (discussed in details later) and tracking features (see subsec-tion 434) Also several examples and documentations are available for bothtraining and detection which simplified the implementation

As a classifier Support vector machines (SVM see sub-subsection 2226)were chosen This is because nowadays SVMs are widely used in computer visionwith great results (Dalal used it as well in [53]) and for two-class detection prob-

32

43 2D image processing methods

lems (like this) they outperform the other popular ones Also it is implementedin Dlib as a ready-to-use solution [75] SVMs (and almost everything in Dlib)can save and load by serializing and de-serializing them which made the trans-portation of classifiers relatively easy Dlibrsquos SVMs are compatible with the HOGfeatures extracted with the same library The combination of the two methodswithin Dlib resulted in some very impressive detectors for speed limit sign orfaces Furthermore the face detector mentioned outperformed the Haar-like fea-ture based face detector included in OpenCV [74] After studying these examplesthe combination of Dlibrsquos SVM and HOG extractor was chosen as the base of thedetection

The implemented training software itself was written in C++ It can be exe-cuted from the command line It needs an xml file as an input which includes alist of the training images with the annotation of the objects (coordinates in pix-els) After importing the positive samples the negative samples are rdquogeneratedrdquofrom the training samples as well cut from areas where no annotated object ispresent Note that this feature means the object has to be annotated carefullysince a positive sample might get in to the negative training images otherwise

After the classifier (also called detector) is produced the training softwaretests it on an image database defined by a similar xml file This is a very con-venient feature since major errors are discovered during the training processalready If the first tests do not show any unexpected behaviours the finishedSVM is saved to the disk with serialization

Due to the fact that the robot looks completely unlike from different view-points one classifier was not sufficient because it could only match one typicalrdquoshaperdquo of the object Similarly the face detector mentioned above does notwork for people photographed from side view Therefore multiple classifiers wereconsidered for the task

Two SVMs were trained to overcome of this issue one for the front view andone for the side view of the ground vehicle This approach has two significantadvantages

First the robot is symmetrical horizontally and vertically as well In otherterms the vehicle looks almost exactly the same from the left and right whilethe front and rear-views are also very similar This is especially true if the work-ing principle of the HOG (2223) is considered since the shape of the robot(which defines the edges and gradients detected by the HOG descriptor) is iden-tical viewed from left and right or from front and rear This property of theUGV means that the same classifier will recognise it from the opposite directiontoo Therefore two classifiers are enough for all four sides This is not a trivialsimplification since cars for example look completely different from the front orthe rear and while the side-views are similar they are mirrored

The second advantage of these classifiers is very important in the future of

33

43 2D image processing methods

the project Due to the way the SVMs are trained they not only can detect theposition of the robot but depending on which one of them detects it the systemgains information about the orientation of the vehicle This is crucial since it canhelp predict the next position of the ground robot (see 434) Even more impor-tantly it helps the UAV to land on the UGV with the correct orientation (facingthe right way) Therefore the charging outputs will be connected properly

An other important aspect of the training is how to choose the training imagesFor this pictures were selected where the robot is captured directly from the sideor from the front with the camera almost at the level of the ground Asidethe sidefront of the robot (the top for example) was hidden or cropped Thismethod eliminated all the distortions found on pictures taken from above whichwould decrease the efficiency Although camera attached to the UAV will seldomcruise at this altitude (otherwise it would touch the ground) the HOG descriptor isrobust enough to find the patterns even viewed from above or sideways Althoughteaching an SVMs for diagonal views as well seemed like a logical idea it did notimprove the detections The reason for this is the fact that while side and frontviews are well defined it is hard to frame and train on a diagonal view

On figure 42 a comparison of training images and the produced HOG detec-tors is shown 42(a) is a typical training image for the side-view Hog detectorNotice that only the side of the robot is visible since the camera was hold verylow 42(b) is the visualized final side-view detector Both the wheels and thebody are easy to recognize thanks to the strong edges of them On 42(c) a train-ing image is displayed from the front-view detector training image set 42(d)shows the visualized final front detector Notice tangential lines along the wheelsand the characteristic top edge of the body

Please note that if the current two classifiers seem to be inefficient trainingadditional ones is easy to do with the training software presented here Alsoincluding them to the detector is simple as well since it was designed to beadaptive Please see subsection 435 for more details

432 Sliding window methodA very important property of these trained methods (222) are that usually theirtraining datasets (the images of the robot or any other object) are cropped onlycontaining the object and some margin around it Resulting from this a detectortrained this way will only be able to work on input images which contain theobject in a similarly cropped way Certainly such an image is not the usual inputfor an application since if the input contains only the robot there is no need forlocalization algorithms In other words the input image is usually much largerthan the cropped training images and may contain other kind of objects as wellThus a method is required which provides correctly sized input images for the

34

43 2D image processing methods

(a) sideview training image example

(b) sideview HOG detector

(c) frontview trainging image example

(d) frontview HOG

Figure 42 (a) a typical training image for the side-view Hog detector (b) is thefinal side-view HOG descriptor visualized (c) a training image from the front-viewHOG descriptor training image set (d) the final front detector visualized Noticethe strong lines around the typical edges of the training images

35

43 2D image processing methods

Figure 43 Representation of the sliding window method

detector cropped and resized from the original large inputThis can be achieved in several ways although the most common is the so

called sliding window method This algorithm takes the large image and slidesa window across it with predefined step-size and scales This rdquowindowrdquo behaveslike a mask cropping out smaller parts of the image With appropriate choice ofthese parameters (enough scales and small step-sizes) eventually every robot (orother desired object) will be cropped and handed to the trained detector See 43for a representation

It worth mentioning that it is possible that multiple instances of the soughtobject are present on the image For face or pedestrian detection algorithms forexample this is a potential situation Since only one ground robot was builtyet this is not likely to happen in this project (although reflections of the robotmight occur which are not handled explicitly) However if the system will beexpanded in the future multiple UGVs on the same input image will not cause aproblem for the designed algorithm

433 Pre-filteringAs listed in section 41 two of the most important challenges are speed andthe limitations of the available resources In subsection 432 the sliding windowconcept was introduced which makes it possible to cover every potential locationof the sought object On the other hand this means a lot of positions which areabsolutely unnecessary to check with the HOG detector since it is very unlikelythat the robot is there (eg positions above the ground or homogeneous surfaceslike the wall or the clearly empty ground) These areas could be excluded fromthe search with cheaper algorithms

36

43 2D image processing methods

Thus methods which could reduce the amount of areas to process and suchincrease the speed of the detection were required In other words these algorithmsfinds region of interests (ROI see 3) in which the HOG (or other) computationalheavy detectors are executed in a shorter time than scanning the whole inputimage The algorithms have be implemented with the right interface to keep themodular structure of the architecture Also this practice make it possible toeither swap the currently used with an other or a newly developed ROI detectormodule for further development

Based on intuition a good feature separation is the colour Anything whichdoes not have the same colour as the ground vehicle should be ignored To dothis however is rather complicated since the observed rdquocolourrdquo of the object (andanything else in the scene) depends on the lighting the exposure setting and thedynamic range of the camera Thus it is very hard to define a colour (and aregion of colours around of it) which could be a good base of segmentation onevery input videos Also one of the test environments used (the laboratory) andmany other premises reviewed has bright floor which is surprisingly similar to thecolour of the robot This made the filtering virtually useless

An other good hint for interesting areas can be the edges In this projectthis seems like a logical approach since the chosen feature descriptor operates ongradient images which are strongly connected to the edge map On figure 44 theresult of an edge detection on a typical input image of the system is presentedNote how the ground robot has significantly more edges than its environmentespecially the ground Filtering based on edges often eliminates the majority ofthe ground On the other hand chairs and other objects have lot of edges as wellThus these areas are still scanned

Another idea is to filter the image by the detections themselves on the previousframes In other terms the position of the robot on the previous input image isan excellent estimation of the vehiclersquos position on the current one Theoreticallyit is enough to search for the robot around its last known position with a radiusof the maximum distance the object can move during the time between framesPlease note that this distance is in image coordinates thus the possible movementof the camera is involved as well In practice this distance was set as a parameterafter experiments See subsection 435 for details

434 TrackingIn computer vision tracking means the following of a (moving) object over timeusing the camera image input This is usually done by storing information (likeappearance previous location speed and trajectory) about the object and passit on between frames

In the previous subsection (433) the idea of filtering by prior detections was

37

43 2D image processing methods

Figure 44 On this figure the result of an edge detection on a typical input imageof the system is presented Please note how the ground robot has significantly moreedges than its environment On the other hand chairs and other objects have lotof edges as well

presented Tracking is the extension of this concept If the position of the robotis already known there is no need to find it again To achieve the objectives it isenough to follow the found object(s) This makes a huge difference in calculationcosts since the algorithm does not look for an object described by a generalmodel on the whole image but search for the rdquosame set of pixelsrdquo in a significantlyreduced region Certainly tracking itself is not enough to fulfil the requirements ofthe task hence some kind of combination of rdquotraditionalrdquo classifiers and trackingwas needed

Tracking algorithms are widely researched and have their own extensive litera-ture Considering the complexity of implementing an own 2D tracking algorithmit was decided to choose and include a ready-to-use solution Fortunately Dlib(324) have an excellent object tracker included as well The code is based onthe very recent research presented in [76] The method was evaluated extensivelyand it outperformed state-of-the-art methods while being computationally effi-cient Aside its outstanding performance it is easy to include in the code onceDlib is installed The initializing parameter of the tracker is a bounding box isaround the area needed to be followed Since the classifiers return with boundingboxes those can be used as an input for the tracking See subsection 435 foran overview of the implementation An interesting fact is that it also uses HOGdescriptors as a part of the algorithm which shows the versatility of HOG andconfirms its suitability for this task

The final system of this project will have the advantage of a 3D map of itssurroundings This will allow tracking of objects in 3D relative to the environ-ment The UAV will have an estimation of the UGV even if its out of the frame

38

43 2D image processing methods

since its position will be mapped to the 3D map as well Unfortunately the realtime mapping is not ready yet thus this feature has yet to be implemented

435 Implemented detectorIn subsection 431 the advantages and disadvantages of the chosen object recog-nition methods were considered and the implemented training algorithm wasintroduced Then different approaches to speed up the process were presentedin subsection 432 433 and 434

After considering the points above a detector algorithm was implementedwith two aims First to provide an easy-to-use development and testing toolwith which different methods and their combinations (eg using 2 SVMs withROI) can be tried without having to change and rebuild the code itself Secondafter the optimal combination was chosen this software will be the base of thefinal detector used in the project

The input of the software is a camera feed or video which is read frame byframe The output of the algorithm is one or more rectangles bounding the areasbelieved to contain the sought object These are often called detection boxeshits or simply detections

During the development four different approaches were implemented All ofthem are based on the previous ones (incremental improvements) but all broughta new idea to the detector which made it faster (reach higher frame-rates see512) more accurate (see 511) or both The methods are listed below

1 Mode 1 Sliding window with all the classifiers

2 Mode 2 Sliding window with intelligent choice of classifier

3 Mode 3 Intelligent choice of classifiers and ROIs

4 Mode 4 Tracking based approach

They will all be presented in the next four subsections Since they are rathermodifications of the same code than separate solutions they were not imple-mented as new software Instead all were included in the same code whichdecides which mode to execute in run-time based on a parameter

Aside of changing between the implemented modes the detector is able toload as many SVMs as needed Their usage is different in every mode and willbe presented later

The software can display andor save the video frames with all the detectionsmarked measure and export the processing time for every frame save the de-tections to data files for evaluation or even process the same video input severaltimes

39

43 2D image processing methods

Table 41 Table of the available parameters

Name Valid values Functioninput path to video video as input for detectionsvm path to SVMs These SVMs will be usedmode [1 2 3 4] Selects which mode is usedsaveFrames [01] Turns on video frame exportsaveDetections [01] Turns on detection boxe exportsaveFPS [01] Turns on frame-rate measurementdisplayVideo [01] Turns on video displayDetectionsFileName string Sets the filename for saved detectionsFramesFolderName string Set the foldername used for saving videonumberOfLoops integer(0lt) Sets how many times the video is looped

To make the development more convenient these features can be switchedon or off via parameters which are imported from an input file every time thedetector is started There is an option to change the name of the file where thedetections are exported or the name of the folder where the video frames aresaved

Table 41 concludes all the implemented parameters with the possible valuesand a short description

Note that all these parameters have default values thus it is possible but notcompulsory to include every one of them The text shown below is an exampleinput parameter file

Example parameter file of the detector algorithminput testVideo1avilist all the detectors you want to usesvm groundrobotfrontsvm groundrobotsidesvmsaveDetections 0saveFrames 0displayVideo 0numberOfLoops 100mode 2saveFPS 1

Given this input file the software would execute a detection on testVideo1avi(input) one hundred times (numberOfLoops) using the groundrobotfrontsvm groundrobot-sidesvm classifiers (svm) It would neither save the detections (saveDetections)

40

43 2D image processing methods

nor the video frames (saveFrames) The latter is not shown (displayVideo) How-ever the processing time is saved for every frame (saveFPS) Note that the pa-rameter numberOfLoops is really useful to simulate longer inputs See figure 45for a screenshot of the interface after a parameter file is loaded

Figure 45 Example of the detectorrsquos user interface after loading the parametersThe software can be started from the command line with a parameter file inputwhich defines the detecting mode and the purpose of the execution (producingvideo efficiency statistics or measuring processing frame-rate)

4351 Mode 1 Sliding window with all the classifiers

Mode 1 is the simplest approach of the four After loading the two classifierstrained by the training software (front-view and side-view see 434) both rdquoslidesrdquoacross the input image All of them return with a vector of rectangles where theyfound the sought object These are handled as detections (saved displayed etc)There are no filters implemented for maximum number of detections overlappingallowed position etc

Please note that although currently two SVMs are used this and every othermethod can handle fewer or more of them For example given three classifiersmode 1 loads and slides all of them

As it can be seen this mode 1 is rather simple It checks every possibleposition with all the classifiers on every frame in multiple scales This meansthat it will not miss the object (assuming that one of them would recognize theobject if a cropped image of it is given as an input) On the other hand thisextensive search is very computational heavy especially with two classifiers Thisresults in the lowest frame per second rate of all the methods

4352 Mode 2 Sliding window with intelligent choice of classifier

As mentioned in the previous subsection Mode 1 checks every possible positionwith all the classifiers on every frame This is not efficient because of the waythe two used classifier was trained On figure 42 it can be seen that one of them

41

43 2D image processing methods

represents the frontrear view while the other one was trained for the side-viewof the robot

The idea of mode 2 is based on the assumption that it is very rare and usuallyunnecessary that these two classifiers detect something at once since the robot iseither viewed from the front or from the side Certainly looking at it diagonallywill show both mentioned sides but resulting from the perspective distortionthese detectors will not recognize both sides (and that is not needed either sincethere is no reason to find it twice)

Therefore it seemed logical to implement a basic memory to the code whichstores which one of the detectors found the object last time After that only thatone is used for the next frames If it succeeds to find the object again it remainsthe chosen detector If not the algorithm starts counting the sequential frames onwhich was unable to find something If this exceeds a predefined tolerance limit(that is the number of frames tolerated without detection) all of the classifiersare used again until one of them find the object In every other aspect mode 2is very similar to mode 1 presented above

The limit was introduced because it is very unlikely to change viewpointsbetween two frames so drastically that the other detector would have a betterchance to detect A couple of frames without detection is more like the result ofsome kind of image artefact like motion blur change in exposure etc

This modification of the original code resulted an much faster processingspeed since in significant amount of the time only one classifier was used in-stead of two Fortunately after finding an optimal limit the efficiency of thecode remained the same

4353 Mode 3 Intelligent choice of classifiers and ROIs

While mode 2 brought a significant improvement to the detectorrsquos speed it didnot fulfil the requirements of the project

In subsection 433 the idea of reducing the image searched was introduced asa way to increase the detection speed The position of the robot on the previousinput image is an excellent estimation of the vehiclersquos position on the current oneTheoretically it is enough to search for the robot around its last known positionwith a radius of the maximum distance the object can move during the timebetween frames However this is not an obvious calculation since the distance isin image coordinates hence it is not trivial to use the actual speed of the robotAlso the possible movement of the camera is involved as well For example evenif the UGV holds its position the rotating camera of the UAV will capture itrdquomovingrdquo across the input image

Instead of estimating the distance mentioned above a simpler approach wasimplemented The memory introduced in mode 2 was extended to store the

42

43 2D image processing methods

Figure 46 An example output of mode 3 The green rectangle marks theregion of interest The blue rectangle is the detection returned by the side-viewdetectorMode 3 tries to estimate a ROI based on the previous detections andsearches for the robot only in that Note the ratio between the ROI and the fullsize of the image

position of the detection aside the detector which returned it A new rectanglename ROI (Region of interest) was included which determinates in which areashould the detector search

The rectangle is initialized with the sizes of the full image (in other termsthe whole image is included in the ROI) Then every time a detection occursthe ROI is set to that rectangle since that is the last known position of the robotHowever this rectangle have to be enlarged caused by the reasons mentionedabove (movement of the camera and the object) Therefore the ROI is grown bya percent of its original size (set by a variable 50 by default) and every detectorsearches in this area Note that the same intelligent selection of classifiers whichwas described in mode 2 is included here as well

In the unfortunate case that none of the detectors returned detections insidethe ROI the region of interest is enlarged again by a smaller percent of its originalsize (set by a variable 3 by default) Note that if in any case the ROI reachesthe size of the of the image it will not grow in that direction anymore Eventuallyafter a series of missing detections the ROI will reach the size of the input imageIn this case mode 3 works exactly like mode 2

See figure 46 for an example of the method described above The greenrectangle marks the region of interest (ROI) The classifiers will search for therobot in this area The blue rectangle marks that the side-view detector foundsomething there (in this case correctly recognized the robot) On the next framethe blue rectangle will be the base of the ROI as it is enlarged in the followingsteps to define the region of interest

43

43 2D image processing methods

Please note that while the ROI is updated of the previous detections it isvery easy to change it to another rdquopre-filterrdquo due to the modular architectureintroduced in 42

4354 Mode 4 Tracking based approach

As mentioned in 434 Dlib library has a built-in tracker algorithm based on thevery resent research [76]

Mode 4 is very similar to mode 3 but includes the tracking algorithm men-tioned above Using a tracker makes it unnecessary to scan every frame (or partof it as mode 3 does) with at least one detector

Instead once the robot was found it is followed across the scene Certainlyperiodical validations are still needed Tracking is significantly faster than thesliding window detectors and in appropriate conditions can perform above realtime speed

It also improved the efficiency of the detector since on many frames the robotwas tracked and its position was marked correctly while the detectors missed itfrom the same view-point

On the other hand tracking can be misleading as well In case of the trackedregion rdquoslidesrdquo off the robot incorrect detections will be provided Also thetracker often estimates the position of the object correctly while the scale of therectangle does not fit which can result much larger or smaller detection boxesthan the object itself See figure 47(b) for an example Yellow rectangle marksthe tracked region which clearly includes the robot but itrsquos size is unacceptable

To avoid these phenomena the bounding box returned by the tracker is reg-ularly checked by detectors This is done in the very same way as in mode 3The output of the tracker is the base of the ROI which is than enlarged by pa-rameters (similarly to mode 3) The algorithm checks this area with the selectedclassifier(s)

If any of them founds the object the tracker is reinitialized based on thedetection box The rectangle is enlarged and translated to include a bigger partof the robot in contrast to the detected sides (note that the detectors are trainedfor the sides only see subsection 431) This is done because the tracker tendsto follow bigger patches of the image better

If none of the detectors returned detections inside the ROI the algorithm willcontinue to track the object based on previous clues of its position However afterthe number of sequential failed validations exceeds a tolerance limit (a parameter)the object is labelled as lost Afterwards the algorithm works exactly as mode 3would do enlarge the ROI on every frame until a new detection appears Thenthe tracker is reinitialized

See figure 47(a) for a representation of the processing method of mode 4

44

43 2D image processing methods

(a) Example output frame of mode 4

(b) Typical error of the tracker

Figure 47 (a) A presentation of mode 4 Red rectangle marks the detection ofthe front-view classifier The yellow box is the currently tracked area The greenrectangle marks the estimated region of interest (b) A typical error of the trackerYellow rectangle marks the tracked region which clearly includes the robot but toobig to accept as a correct detection

45

44 3D image processing methods

44 3D image processing methods

441 3D recording methodThe first challenge was how to create a 3D image As explained in sub-subsection3123 the used Lidar sensor is a 2D type one which means it scans only inplane In other words the sensor returns the distance of the closest reflectionpoint rotating around one (vertical) axis The result of this scan looks like across-section of a 3D image To produce a real and complete 3D image severalof these cross-section like one-plane scans have to be combined This is a rathercomplicated method since to cover the whole room the lidar needs to be movedHowever this movement or displacement will have a significant effect on therecorded distances

To simplify the initial experiments the position of the laser scanner was fixedon a tripod and only the orientation of it was changed This way the trans-formation between the coordinate systems and validation of the recordings wereeasier

On figure 48 a schematic diagram is given to introduce the 3D recording set-up The picture is divided in to 3 parts Part a is a side view of the room beingrecorded including the scanner and its dependences Part b is a top view of thevery same room displaying the field of view of the scanner Part c is a closeroverview of the scanner device the inertial measurement unit (IMU) and themounting layout Main parts of the setup are listed below with brief explanation

1 Lidar This is the 2D range scanner which was used in all of the experi-ments It is mounted on a tripod to fix its position The sensor is displayedon every sub-figure (a b c) On part a it is shown in two different orien-tation first when the recording plane is completely flat secondly when itis tilted

2 Visualisation of the laser ray The scanner measures distances in a planewhile rotating around its vertical axis which is call capital Z axis on thefigure (Note The sensor uses class 1 laser which is invisible and harmlessto human eye Red colour was chosen only for visualization)

3 Pitch angle of first orientation While the scanner is tilted its orientationis recorded relative to a fixed coordinate system For this experiment thepitch angle is the most relevant since roll and yaw did not change This iscalculated as a rotation from the (lower-case) z axis

4 Pitch angle of second orientation This is exactly the same variable asdescribed before but recorded for the second set-up of the recording system

46

44 3D image processing methods

5 Yaw angle of the laser ray As mentioned before the sensor used operateswith one single light source rotated around its vertical axis (capital) ZTo know where the recorded point is located relative to the sensorrsquos owncoordinate system the actual rotation of the beam is needed to be savedas well This is calculated as an angle between the forward direction of thesensor (marked with blue) and the laser ray All the points are recordedin this 2D space defined with polar coordinates a distance from the origin(the lidar) and a yaw angleThe angle is increasing counter-clockwise Pleasenote that this angle rotates with the lidar and its field Part b represents astate when the lidar is horizontal

6 Field of view of the sensor Part b of the picture visualize the limited butstill very wide field of view of the sensor The view angle is marked withgreen coloured curve and the covered area is yellowish The scanner is ableto record 270 or in other terms [minus135 135]

7 Blind-spot of the sensor As discussed in the previous point the scannercan register distances in 270 The remainder 90 is a blind-spot and islocated exactly behind the sensor (135minus135) This area is marked withdark grey on the figure

8 Inertial measurement unit All the points are recorded in the 2D spaceof the lidar which is a tilted and translated plane To map every recordedpoint from this to the 3D maprsquos coordinate system the orientation andposition of the Lidar sensor is needed Assuming that the latter is fixed(the scanner is mounted on a tripod) only the rotation remains variableTo record this some kind of inertial measurement unit (IMU) is neededwhich is marked with a grey rectangle on the image

9 Axis of rotation To scan the room a continuous tilting of the scannerwas needed It was not possible to place the light source on the axis of therotation This means an offset between the sensor and the axis which was asource of error The black circle represents the axis the tilting was executedaround

10 Offset between the axis and the light source As described in the previ-ous point an offset was present between the range sensor and the axis ofrotation This distance was carefully measured and it is marked by blue onthe figure

11 Ground vehicle The main objective of this thesis is to localize the UGVThus several 3D map was made of this vehicle itself However the lidarmain task is not the robot detection but the localization of the UAV itself

47

44 3D image processing methods

On part a and b its schematic side and top view is shown to visualize itsshape relative size and location

442 Android based recording set-upTo test the sensorrsquos capabilities simplified experiment set-ups were introducedas described in subsection 441 The scanned distances were recorded in csvformat by the official recording tool provided by the lidarrsquos manufacturer [77]As described in subsection 444 and in point 8 the orientation and position ofthe Lidar sensor is needed to determinate the recorded point coordinates relativeto the coordinate system fixed to the ground

Figure 410 Screenshot ofthe Android application for Lidarrecordings It is able to detect dis-play and log the rotation of the de-vice in a log file with user definedfilename Source own applicationand image

Assuming that the position is fixed (thescanner is mounted on a tripod) only therotation can change To record this somekind of inertial measurement unit (IMU) wasneeded

Since at the time these experimentsstarted no such device was available the au-thor of this thesis created an Android appli-cation The aim of this program was to usethe phonersquos sensors (accelerometer compassand gyroscope) as an IMU The software isable to detect display and log the rotationof the device along the three axis

The log file is structured in a way thatMatlab (see subsection 321) or other de-velopment tools can import it without anymodification needed The file-name is userdefined

Since both the Lidar and the applica-tion operates at a different and not suf-ficiently consistent frequency synchroniza-tion was necessary Since the two data wasrecorded completely separately this has to bedone off-line To achieve this a time-stamphas to be recorded for both type of mea-surements The lidar recorder tool does thisautomatically This function was added tothe Android application too thus the time-stamps of the recorded orientations are saved

48

44 3D image processing methods

Figure 48 Schematic figure to represent the 3D recording set-up Part a is a sideview of the room being recorded including the scanner and its dependences Partb is a top view of the very same room displaying the field of view of the scannerPart c is a closer overview of the scanner device the inertial measurement unit(IMU) and the mounting layout

49

44 3D image processing methods

Figure 49 Example representation of the output of the Lidar Sensor As it canbe seen the data returned by the scanner is basically a scan in one plane and canbe interpreted as a cross-section of a 3D image The sensor is positioned in themiddle of the cross Green areas were reached by the laser Source screenshot ofown recording by the official recording tool [77]

50

44 3D image processing methods

to the log file as well However since the two systems were not connected thetimestamps had to be synchronized manually

On Figure 410 the user interface of the application can be seen Note thatboth the displayed and the saved angles are in radian Figure 411(a) presentshow the phone was mounted next to the lidar and also shows the test environmentwhich was recorded The phone was attached to a common metal base with thelidar Thus the rotation of the phone is expected to be very close to the rotationof the lidar We used a Motorola Moto X (2013) for these experiments

The coordinate system used is fixed to the phone X axis is parallel with theshorter side of the screen pointing from the left to the right Y axis is parallel tolonger side of the screen pointing from to bottom to the top Z axis is pointingup and is perpendicular to the ground See figure 412 for a presentation of theaxes Angles were calculated in the following way

bull pitch is the rotation around the x axis It is 0 if the phone is lying on ahorizontal surface

bull roll is rotation around the y axis

bull yaw is the rotation around the z axis

Although the sensors in a phone are not designed for such precise tasks theset-up presented in this subsection worked sufficiently and produced spectacular3D images On figure 411(b) and 411(c) an example is shown along with therecorded screen captured by a camera (411(a))

443 Final set-up with Pixhawk flight controllerAlthough the Android application (see subsection 442) gave satisfactory resultsseveral reasons occurred for replacing it First the synchronization was compli-cated manual and off-line which is not acceptable in a real time UAV environ-ment Second neither the phonersquos sensors nor its operating system was designedto be mounted on an unmanned aircraft It is heavier less precise and moreexpensive than a dedicated IMU Thus the Android phone has to be replaced

As a replacement the Pixhawk (3121) flight controller was chosen because itis open-source well supported and most importantly it is the chosen controller ofthe projectrsquos UAV This meant less weight and brought integrity both in hardwareand software Although the Pixhawk is not a dedicated IMU its built-in sensorsare very accurate since they are used to stabilize multi-rotors during flight Fellowstudent Domonkos Huszar managed to connect both the Pixhawk and the lidar(3123) to the Robotic operating system (322) This way the two type ofsensors were recorded with the same software

51

44 3D image processing methods

(a) Laboratory and recorded scene

(b) Elevation of the produced 3D map

(c) The produced 3D map

Figure 411 (a) picture about the laboratory and the mounting of the phone (b)and (c) 3D map of the scan from different viewpoints Some object were markedon all images for easier understanding 1 2 3 are boxes 4 is a chair 5 is a studentat his desk 6 7 are windows 8 is the entrance 9 is a UAV wing on a cupboard

52

44 3D image processing methods

Thus the change from the Android software to the Pixhawk solved the problemof synchronization and accuracy All the other parts of the experiment (themounting method the coordinate systems etc) remained unchanged

444 3D reconstructionIn subsection 441 the recording method was described Subsection 442 and443 discussed the inertial measurement units and software used Processing andvisualizing the recorded data as a 3D map was the next step

All the pointsrsquo coordinates obtained were in the lidar sensorrsquos own coordinatesystem As shown on figure 48 (part b) this is a 2D space represented by polarcoordinates a distance from the origin (the scanner) and an angle between theline pointing to the point and the zero vector This space was tilted with thescanner during the records To calculate the positions recorded by the Lidar inthe ground fixed coordinate system this 2D space had to be transformed Trans-formation here means rigid motions no shape or size changes were desired Rigidmotions are reflections rotations translations and glide reflections Howeverin this experiment only translations and rotations were used

The 3D coordinate system was defined as the following

bull x axis is the vector product of y and z x = ytimes z pointing to the right

bull y axis is the facing direction of the scanner parallel to the ground pointing

Figure 412 Presentation of the axes used in the android application

53

44 3D image processing methods

forward

bull z axis is pointing up and is perpendicular to the ground

Note that the origin ([0 0 0]) is at the lidarrsquos location (at the top of the tripodin the height of the rotation axis) This means that points below the scanner havea negative altitude

The lidar and its own 2D space have 6 degrees of freedom (DOF) state in theground fixed coordinate system Three represents its positions along the x y zaxes The other three describes its orientation yaw pitch and roll

First the rotation of the scanner will be considered (the latter three DOF) Asit can bee seen on figure 48 the roll and yaw angles are not able to change Thusonly pitch has to be considered as a variable in this experiment (see point 3) Asa result the transformed coordinates of the recording point are the following

x = distance middot sin(minusyaw) (41)

y = distance middot cos(yaw) middot sin(pitch) (42)

z = distance middot cos(yaw) middot cos(pitch) (43)

Where distance and yaw are the two coordinates of the point in the plane ofthe lidar (figure 48 point 5) and pitch is the angle between the (lower-case) zaxis and the plane of the lidar (48 point 3 and 4)

The position of the lidar in the ground coordinate system did not change intheory only the orientation of it However as it can be seen on figure 48 part cthe rotation axis (marked by 9 see point 9) is not at the same level as the rangescanner of the lidar Since it was not possible to mount the device other way asignificant offset (marked and explained in point 10) occurred between the lightsource and the axis itself This resulted in an undesired movement of the sensoralong the z and y axes (note that x axis was not involved since the roll angle didnot change and only that would move the scanner along the x axis) While theresulted errors were no greater than the offset itself they are significant Thusalong the rotation a translation is needed as well The vector is calculated asthe sum of dy and dz

dy = offset middot sin(pitch) (44)

dz = offset middot cos(pitch) (45)

54

44 3D image processing methods

Where dy is the translation required along the y and dy is along the z axisOffset is the distance between the light source and the axis of rotation presentedon figure 48 part C marked with 10

Combining the 5 equations (41 42 43 44 45) we get xyz

= distance middot

sin(minusyaw)cos(yaw) middot sin(pitch)cos(yaw) middot cos(pitch)

+ offset middot

0sin(pitch)cos(pitch)

(46)

Using the transformation defined in equation 46 spectacular and more im-portantly precise 3D maps were created These were suitable for further worksuch us SLAM (Simultaneous localization and mapping) or the ground robotdetection

55

Chapter 5

Results

51 2D image detection results

511 EvaluationTo make the development easier and the results more objective an evaluationsystem was needed This helps determine whether a given modification of thealgorithm did actually make it better andor faster than other versions Thisinformation is extremely important in every vision based detection system to trackthe improvements of the code and check if it meets the predefined requirements

To understand what better means in case of a system like this some defi-nitions need to be clarified Generally a detection system can either identify aninput or reject it Here the input means a cropped image of the original pictureIn other words it is a position (coordinates) and dimensions (height and width)Please refer to subsection 432 for a more detailed description An identifiedinput is referred as positive Similarly a rejected one is called negative

5111 Definition of True positive and negative

A detection is true positive if the system correctly identifies an input as a positivesample In this case it means that the detector marks the ground robotrsquos positionas a detection Note that it is also often called a hit Similarly a true negativemeans that an input which does not contain the object is correctly rejected

5112 Definition of False positive and negative

In the case of most decision systems errors can be organised into two main groupsA mistake is a false positive error (also called type I error) if the system classifiesan input as a positive sample despite it is not In other words the system believes

56

51 2D image detection results

the object is present at that location although that is not the case In this projectthat typically means that something that is not the ground robot (a chair a boxor even a shadow) is still recognized as the UGV A false negative error (alsocalled type II error) occurs when an input is not recognized (rejected) although itshould be In this current task a false negative error is when the ground robotis not detected

5113 Reducing number of errors

Unfortunately it is really complicated to decrease both kind of errors at thesame time If more strict rules and thresholds are introduced in the decisionmaking the rate of false positives will decrease but simultaneously chances offalse negatives will increase On the other hand if conditions are loosened in thedetection method false negative errors may be reduced However caused by theexact same thing (less strict conditions) more false positives could appear

The actual task determines which kind of error should be minimised Incertain projects it could be more important to find every possible desired objectthan the fact that some mistakes will occur Such a project can be a manuallysupervised classification where appearing false positives can be eliminated by theoperator (for example illegal usage of bus lanes) Also a system with an accidentavoidance purpose should be over-secure and rather recognise non dangeroussituations as a hazard than to miss a real one

In this project the false positive rate was decided to be more important Thereason for this is that the position of the UGV is not crucial during the wholeflight If an input frame is processed incorrectly and a false negative error occurs(the robot is on the picture but the algorithm misses it) the tracking system(see subsection 434) would still be able to estimate its position After a fewframes the detector would be able to detect the robot again with great possibilityHowever a false negative detection at an unexpected position may confuse thesystem since it would have to choose from multiple possible charging platformsThis may lead to a landing attempt on a dangerous surface and to the possibilityof losing the aircraft

5114 Annotation and database building

To test the efficiency of the 2D detection algorithm its detections have to betested on an image dataset with its samples already labelled (either fully manuallyor using a semi-supervised classifier) or on a video set where the object (or objects)position is known on every frame This way the true positivenegative and falsepositivenegative values can be determined

57

51 2D image detection results

Figure 51 Figure of the Vatic user interface

Testing on image datasets is suitable for methods which do not use any in-formation transmitted between frames such as previous positions or ROI Thesealgorithms are usually sliding window based (see subsection 432) Their croppedinputs can be simulated with image datasets without complications

For this project the video based test is more appropriate since as described insubsection 434 and 433 the detector uses tracking and pre-filtering Testing thetracking feature without consecutive frames is pointless Also the pre-filteringand ROI generation is only applicable on frames containing more than just theobject itself

Testing on videos requires annotation which means that every object has tobe labelled on every frame This is a tiring and time-consuming task but withsuitable software it can be simplified To do this the Vatic environment waschosen

The abbreviation Vatic stands for Video Annotation Tool from Irvine Cal-ifornia It is a free to use on-line video annotation tool designed specially forcomputer vision research [78]

After the successful installation and connection to the server a clear interfaceis visible See an example screenshot on figure 51 The interface has a built invideo player which makes it possible to review the video with the annotation atnormal speed or even frame-by-frame The system was designed in a way thatmultiple objects can be annotated even from different classes (like pedestrianscars trucks for example) Since only one ground robot was built yet and no otherkind of objects were requested only one item was annotated on the videos

58

51 2D image detection results

The annotatorrsquos task is to mark the position and the size of the objects (thatis drawing a bounding box around it) on every frame of the videos (there areoptions to label an object Occluded obstructed or outside of frame) Althoughthis seems like a very time-consuming task Vatic uses sophisticated algorithmsto interpolate both the positions and the sizes of the objects on frames thatare not annotated yet This is a significant help for the annotator since aftermarking the object on some key-frames the only task remained is to correctthe interpolations between them if necessary

Note that it is crucial that the bounding boxes are very tight around the ob-ject This will guarantee that during the evaluation of the detectors the statisticsare based on true and objective annotations This topic will be explained in moredetails in section 53

Vatic is also able to crowdsource the annotation tasks using Amazon Mechan-ical Turk [79] which is Amazonrsquos solution for renting human resources This waylarge datasets with several videos are easy to build for relatively low cost How-ever this feature of the tool was not used in this project since the number of testvideos and their complexity did not make it necessary All the test videos wereannotated by the author

After successfully annotating a video Vatic is able to export the boundingboxes in several format including xml json etc The exported file contains theid of the frames and the coordinates of every bounding box along with the labelassigned for the objectrsquos type

Using these files it is possible to compare the detections of an algorithm (whichare also bounding boxes) with the annotations and thus evaluate the efficiencyof the code

512 Frame-rate measurement and analysisIn section 41 speed was recognised as one of the most important challenges

To address this issue the detector was prepared to measure and export thetime elapsed by processing each frames

The measurements were carried out on a laptop PC (Lenovo z580 i5-3210m8GB RAM) simulating a setup where the UAV sends video to a ground stationfor further processing During the tests every other unnecessary process wereterminated so their additional influences are minimized

The exact same software was used for every test only the parameter file waschanged (switching between the modes) Every method was executed on the samevideo one hundred times which means more than 100000 frames This amountensured that temporary resource changes during the test due to the operatingsystem did not influence the results significantly

59

52 3D image detection results

To analyse the exported processing times a Matlab script has been imple-mented This tool can load the log files from the test and calculate statistics ofthe data Measurements like the shortest longest and average processing timeper frame are displayed after execution along with the average frame-rate and itsvariance Latter is very important as mode 2 3 4 processing methods changeduring the execution by definition which results in a varying processing timeVariance measures these changes An example output of the analysis software ispresented below

Example output of Frame-rate analysis softwareLoaded 100 filesNumber of frames in video 1080Total number of frames processed 108000Longest processing time 0813Shortest processing time 0007Average ellapsed seconds 0032446variance of ellapsed seconds per frame

between video loops 80297e-07across the video 00021144

Average Frame per second rate of the processing 308204 FPS

The changes of average frame-rates between video loops were also monitoredIf this value was too big it probably indicated some unexpected load of thecomputer Thus the tests had to be restarted

It should be mentioned that the frame-rate of the software is strongly corre-lated with the resolution of the input images since the smaller the image it takesshorter time to rdquosliderdquo through the detector on it This is especially true for thefirst two methods (4351 4352) The test videosrsquo resolution are 960times 540

52 3D image detection resultsThis thesis focused on the 2D object detection methods since they are easierto implement and have been widely researched for decades Traditional camerasensors are significantly cheaper too and their output is rdquoready-to-userdquo On theother hand data from lidar sensors have to be processed to gather the imagesince the scanner records points in a 2D plane as explained in section 44 Notethat lidar 3D scanners exists but they are still scanning in a plane and have thismentioned processing built-in Also they are even more expensive than the 2Dversions

60

53 Discussion of results

Figure 52 On this figure the recording setup is presented In the backgroundthe ground robot is also visible which was the subject of the measurements

That being said the 3D depth map provided by these sensors are not possibleto obtain with cameras and contains valuable information for a project like thisand makes indoor navigation easier Therefore experiments were carried onthe record 3D images by tilting a 2D laser scanner (3123) and recording itsorientation with a IMU (3121)

On figure 52 a photo of the experimental setup is shown In the backgroundthe ground robot is visible which was the subject of the recordings

Figure 53 presents an example 3D point-cloud recorded of the ground robotBoth 53(a) and 53(b) is an image of the same map viewed from different pointsAs it can be seen recordings from one point is not enough to create completemaps as rdquoshadowsrdquo and obscured parts still occurs Note the rdquoemptyrdquo area behindthe robot on 53(b)

The created point-clouds were found to be detailed enough to carry simpler(eg threshold based on height) or more complex object detection methods (egregistering 3D key-points) on them in the future

53 Discussion of resultsIn general it can be said that the detector modes worked as expected on therecorded test videos with suitable efficiency Example videos can be found athere1 for all modes However to judge the performance of the algorithm moreobjective measurements were needed

The 2D image processing results were evaluated by speed and detection effi-1usersitkppkehu~palan1CranfieldVideos

61

53 Discussion of results

(a) Example result of the 3D scan of the ground robot

(b) Example of the rsquoshadowrsquo of the ground robot

Figure 53 Example of the 3D images built Both image shows the same 3Dmap viewed from different angles Note that the carpet is visible underneath thevehicle

62

53 Discussion of results

ciency The latter can be done by several values Here recall and precision willbe used Recall is defined by

TP

TP + FN

Where TP is the number of true positive and FN is the number of false negativedetections In other words it is the ratio of the of the detected and the totalnumber of positive samples Thus it is also called sensitivity

Precision is defined byTP

TP + FP

Where TP is the number of true positive FP is the number of false positivedetections In other terms precision is a measurement of the correctness of thereturned detections

Both values have a different meaning for validation and none of them can usedalone This is because as mentioned in sub-subsection 5113 both false positiveand negative errors are undesirable but neither recall nor precision measuresboth For example it is possible to reach an outstanding recall rate with extremeamount of false positives Similarly the increase of false negative errors willimprove the precision of the algorithm

The detections (the coordinates of the rectangles) were exported by the de-tection software for further analysis The rectangles were compared to the onesannotated with the Vatic video annotation tool (5114) based on position andthe overlapping areas For the latter a minimum of 50 was defined (ratio of thearea of the intersection and the union of the two rectangles)

Note that this also means that even if a detection is at the correct locationit is possible that it will be logged as false positive error since the overlap of therectangles is not satisfying Unfortunately registering this box as a false positivewill also generate a false negative error since only one detection is returned bythe detector for a location(overlapping detections are merged) Therefore theannotated object will not be covered by any of the detections

As expected mode 1 and mode 2 produced very similar results a decent 62recall rate was achieved This is a result of the fact that they work exactlythe same way except the selection of the used classifiers The results show thatintroducing this idea did not decrease the performance

Mode 3 managed to improve the recall rate by 2 with the introduction ofregion of interests Neither of the first 3 modules had false positive detections asa result of the strict thresholds of the SVMs

Mode 4 had the outstanding recall rate of 95 This is a result of the trackeralgorithm which was able to follow the object even at view-points were the SVMdetectors did no longer detected the robot See table 51 for all the recall andprecision values

63

53 Discussion of results

Similarly the processing speed of the methods introduced in subsection 435were analysed as well with the tools described in subsection 512

Mode 1 was proven to be the slowest with a frame-rate of 42 FPS This isnot surprising as this version of the algorithm scans the whole image with all thedetectors

Mode 2 raised this speed to 65 FPS by introducing an intelligent way ofselecting the applied detector

Mode 3 brought another significant increase with implementing region of in-terest estimators which reduce the amount of the image to be scanned It canprocess around 12 frames in a second

Finally mode 4 was proven to be the fastest solution by far As a result ofits tracking algorithm their is no need to use the SVM classifiers as often asbefore This is a very important result since tracking is significantly faster thanscanning with the detectors Mode 4 reached a frame-rate of 30 FPS during thetests Thus mode 4 can be called a rdquoreal-timerdquo object detection algorithm

See table 51 for the frame-rates of all modules displayed along their recalland precision values Note that all frame-rates are average of 100 executions

Table 51 Table of results

mode Recall Prescision FPS Variancemode 1 0623 1 42599 000000123mode 2 0622 1 6509 00029557mode 3 0645 1 1206 00070877mode 4 0955 0898 3082 00021144

As a conclusion it can be seen that mode 4 outperformed all the others bothin detection rate (recall 95) and speed (average frame-rate 30FPS) using ROIestimators and a tracking algorithm Due to the nature of the tracker and theevaluation method the number of false positives increased As mentioned beforeeven a correctly positioned but incorrectly scaled detection decreased the precisionrate However this should be simple to improve by adding further validations ofthe output of the tracker (both for position and scale)

64

Chapter 6

Conclusion and recommendedfuture work

61 ConclusionIn this paper an unmanned aerial vehicle based airborne tracking system waspresented with the aim of detecting objects indoor (potentially extended to out-door operations as well) and localize a ground robot that will serve as a landingplatform for recharging the UAV

First the project and the aims and objectives of this thesis were introducedin chapter 1 This introduction was followed by an extensive literature review togive an overview of the existing solutions for similar problems (chapter 2) Un-manned Aerial Vehicles were presented and examples were shown for applicationfields Then the most important object recognition methods were introducedand reviewed with the aspect of suitability for the discussed project

Then the environment of the development was described including the avail-able software and hardware resources (chapter 3) Their advantages and disad-vantages were analysed and their applications in the project are explained

Afterwards the recognized challenges of the project were collected and dis-cussed with respect to the object detection task in chapter 4 Considering theobjectives resources and the challenges a modular architecture was designedand introduced (section 42) The type of modules were listed and discussed indetails including their purposes and some possible realization

The modular structure of the system makes it very flexible and easy to extendor replace modules The currently implemented and used ones were listed andexplained Also some unsuccessful experiments were mentioned

Subsection 434 concludes the chosen feature extraction (HOG) and classi-fier (SVM) methods and presented the implemented training software along the

65

61 Conclusion

produced detectorsAs one of the most important parts the currently used detector was intro-

duced in subsection 435 This module was developed with special attention paidto validation and possible future work therefore additional debugging featureslike exporting detections demonstration videos and frame rate measurementswere implemented To make the development simpler an easy-to-use interfacewas included which let the most important parameters to be set without modifi-cation pf the code Four related but different detection methods were developedMode 1 scans the whole image with both trained detectors to locate the robotMode 2 does the same but instead of all classifiers only one is used most of thetime which is chosen based on previous detections This resulted in an increasedframe-rate Mode 3 introduces the concept of Region Of Interests (ROI) and onlyscans this area The ROI is defined based on previous detections however it iseasy to replace it with other methods due to the modular structure The reducedsearch space made the processing significantly faster Finally mode 4 includes atracking algorithm which makes it unnecessary to scan every frame (or part ofit) with a detector Instead once the robot was found it is followed across thescene Certainly periodical validations are still needed Tracking is significantlyfaster than the sliding window detectors and in appropriate conditions can per-form above real time speed (average frame-rate was 308 FPS) It also improvedthe efficiency of the detector since on many frames the robot was tracked andits position was marked correctly while the detectors missed it from the sameview-point However tracking can be misleading as well if the tracked regiondrifts away from the robot (eg a chair next to the robot is followed) To avoidthis the bounding box returned by the tracker is regularly checked by detectors

Special attention was given to the evaluation of the system Two software weredeveloped as additional tools one for evaluating the efficiency of the detectionsand an other for analysing the processing time and frame rates

Although the whole project (especially the autonomous UAV) still needs a lotof improvement and not ready for testing yet the ground robot detecting algo-rithmrsquos first version is ready-to-use and have shown promising results in simulatedexperiments

The detector based on mode 4 had a recall rate of 95 with an average frame-rate of 308 FPS on the test videos Although this processing speed was achievedon a laptop (simulating a setup where the UAV sends video to a ground stationfor further processing) porting this code on an on-board computer with smallercomputational capacity should result in a slower but still acceptable frame-ratewhich satisfies the objectives (5Hz)

While this thesis focused on two dimensional image processing and object de-tection methods 3D image inputs were considered as well Section 44 concludesthe progress made related to 3D mapping along the applied mathematics An

66

62 Recommended future work

experimental setup was created with which it was possible to create spectacu-lar and more importantly precise 3D maps about the environments with a 2Dlaser scanner To test the idea of using the 3D image as an input for the groundrobot detection several recordings were made about the UGV The created point-clouds were found to be detailed enough to carry simpler (eg threshold basedon height) or more complex object detection methods (eg registering 3D key-points) on them in the future

As a conclusion it can be said that the objectives defined in 13

1 explore possible solutions in a literature review

2 design a system which is able to detect the ground robot based on theavailable sensors

3 implement and test the first version of the algorithms

were all addressed and completed

62 Recommended future workIn spite of the satisfying results of the current setup several modifications andimprovements can and should be implemented to increase the efficiency and speedof the algorithm

Focusing on the 2D processing methods first experiments with retrainedSVMs are recommended It is expected that training on the same two sides (sideand front) but including more images from slightly rotated view-points (bothrolling the camera and capture the object from not completely the front) willresult in better performance

Region of interest estimators have huge potential in the project since they cansignificantly speed up the process Aside basic algorithms based on simple fea-tures (eg similar colour patches edges etc) several more complex segmentationmethods were introduced in the literature

Tracking can be tuned more precisely as well Aside the rectangle box whichmarks the estimated position of the object the tracker returns a confidence valuewhich measures how confident the algorithm is that the object is inside the rect-angle This is a very valuable information to eliminate detections which rdquoslippedrdquooff the object Also the further processing of the return position is recommendedFor example growing or shrinking the rectangle based on simple features (egthresholded patches intensity edges etc) can be profitable since the trackertends to perform better when it tracks the whole object On the other hand toobig bounding boxes are problems as well thus shrinking of the rectangles have tobe considered as well

67

62 Recommended future work

Using The 3D image for the detection of the robot is another very interestingopportunity that should be examined in more details Certainly this is onlypossible if the 3D imaging will be carried out in near real time speed (currentlyall the maps were built offline) The currently seen 3D image (the point-cloud infront of the sensor) can be interpreted as a pre-filter for the 2D image processingmethods For example filtering for objects with more or less the same size of theUGV could be a very effective region of interest detector To achieve this thefield of view of the two sensors needs to be register carefully so correspondingareas can be found

Finally when the continuous 3D map building (note that this is more thanthe imaging since the 3D images have to be aligned and combined) is finishedseveral more improvements will be possible Assuming that both the aerial andthe ground vehicle were located in this map the real 3D position of the robotwill be known Its next position can be estimated based on its maximal velocityand the elapsed time Theoretically it is not possible for the robot to be outsideof this circle thus other parts of the map do not need to be scanned Since thelocation and orientation of the UAV will be known as well the field-of-view ofthe sensors can be estimated After that the UAV can either move and rotate ina way to cover the possible locations of the UGV with its sensors or just ignore(does not process) inputs from other areas

It is strongly recommended to give high priority to evaluation and testingduring the further experiments For 2D images Vatic annotation tool was intro-duced in this thesis as a very useful and convenient tool to evaluate detectorsFor 3D detections this is not suitable but a similar solution is needed to keepup the progress towards the ready to deploy 3D mapping system planned

68

Acknowledgements

First of all I would like to thank Paacutezmaacuteny Peacuteter Catholic University and my teachersthere who helped me to advance and grow through the years and later made itpossible to continue my studies in Cranfield University where I was welcomed withwarm hospitality and was supported throughout of the development of this thesis

I would like to thank Dr Al Savvaris for the professional help and encourage-ment he provided while we worked together

Special thanks goes to my friends Domonkos Huszaacuter and Loacuteraacutent Kovaacutecs who Ishared all the happiness sadness worries troubles and the weigh of this pastyear and who I could always turn to in need or joy

Thanks to my friend Roberto Opromolla for helping us to make the necessarymeasurements

My family never stopped supporting me loving me and ensuring the rightenvironment for improvement and growth

Many thanks Fanni Melles for the support and love you gave me and that youstood by me even at the hardest times

Thanks for my friends Maacuterton Hunyady and Doacutera Babicz for the funny andsometimes cruel comments of the revision of this thesis

And finally many thanks to Andras Horvath who agreed to be my rsquosecondrsquosupervisor and mentor at my home university

69

References

[1] ldquoOverview senseFlyrdquo [Online] Available httpswwwsenseflycomdronesoverviewhtml [Accessed at 2015-08-08]

[2] H Chao Y Cao and Y Chen ldquoAutopilots for small fixed-wing unmanned airvehicles A surveyrdquo in Proceedings of the 2007 IEEE International Conference onMechatronics and Automation ICMA 2007 2007 pp 3144ndash3149

[3] A Frank J McGrew M Valenti D Levine and J How ldquoHover Transitionand Level Flight Control Design for a Single-Propeller Indoor AirplanerdquoAIAA Guidance Navigation and Control Conference and Exhibit pp 1ndash43 2007[Online] Available httparcaiaaorgdoiabs10251462007-6318

[4] ldquoFixed Wing Versus Rotary Wing For UAV Mapping Ap-plicationsrdquo [Online] Available httpwwwquestuavcomnewsfixed-wing-versus-rotary-wing-for-uav-mapping-applications [Accessed at 2015-08-08]

[5] ldquo DJI Store Phantom 3 Standardrdquo [Online] Available httpstoredjicomproductphantom-3-standard [Accessed at 2015-08-08]

[6] ldquoWorld War II V-1 Flying Bomb - Military Historyrdquo [Online] Available httpmilitaryhistoryaboutcomodartillerysiegeweaponspv1htm [Accessed at 2015-08-08]

[7] L Lin and M A Goodrich ldquoUAV intelligent path planning for wilderness searchand rescuerdquo in 2009 IEEERSJ International Conference on Intelligent Robotsand Systems IROS 2009 2009 pp 709ndash714

[8] M A Goodrich B S Morse D Gerhardt J L Cooper M Quigley J A Adamsand C Humphrey ldquoSupporting wilderness search and rescue using a camera-equipped mini UAVrdquo Journal of Field Robotics vol 25 no 1-2 pp 89ndash1102008

[9] P Doherty and P Rudol ldquoA UAV Search and Rescue Scenario with HumanBody Detection and Geolocalizationrdquo in Lecture Notes in Computer Science

70

REFERENCES

2007 pp 1ndash13 [Online] Available httpwwwspringerlinkcomcontentt361252205328408fulltextpdf

[10] A Jaimes S Kota and J Gomez ldquoAn approach to surveillance an area usingswarm of fixed wing and quad-rotor unmanned aerial vehicles UAV(s)rdquo 2008 IEEEInternational Conference on System of Systems Engineering SoSE 2008 2008

[11] M Kontitsis K Valavanis and N Tsourveloudis ldquoA UAV vision system for air-borne surveillancerdquo IEEE International Conference on Robotics and Automation2004 Proceedings ICRA rsquo04 2004 vol 1 2004

[12] M Quigley M A Goodrich S Griffiths A Eldredge and R W Beard ldquoTargetacquisition localization and surveillance using a fixed-wing mini-UAV and gim-baled camerardquo in Proceedings - IEEE International Conference on Robotics andAutomation vol 2005 2005 pp 2600ndash2606

[13] E Semsch M Jakob D Pavliacuteček and M Pěchouček ldquoAutonomous UAV surveil-lance in complex urban environmentsrdquo in Proceedings - 2009 IEEEWICACMInternational Conference on Intelligent Agent Technology IAT 2009 vol 2 2009pp 82ndash85

[14] M Israel ldquoA UAV-BASED ROE DEER FAWN DETECTION SYSTEMrdquo pp51ndash55 2012

[15] G Zhou and D Zang ldquoCivil UAV system for earth observationrdquo in InternationalGeoscience and Remote Sensing Symposium (IGARSS) 2007 pp 5319ndash5321

[16] W DeBusk ldquoUnmanned Aerial Vehicle Systems for Disaster Relief TornadoAlleyrdquo AIAA InfotechAerospace 2010 2010 [Online] Available httparcaiaaorgdoiabs10251462010-3506

[17] S DrsquoOleire-Oltmanns I Marzolff K Peter and J Ries ldquoUnmannedAerial Vehicle (UAV) for Monitoring Soil Erosion in Moroccordquo RemoteSensing vol 4 no 12 pp 3390ndash3416 2012 [Online] Available httpwwwmdpicom2072-42924113390

[18] F Nex and F Remondino ldquoUAV for 3D mapping applications a reviewrdquoApplied Geomatics vol 6 no 1 pp 1ndash15 2013 [Online] Availablehttplinkspringercom101007s12518-013-0120-x

[19] H Eisenbeiss ldquoThe Potential of Unmanned Aerial Vehicles for MappingrdquoPhotogrammetrische Woche Heidelberg pp 135ndash145 2011 [Online] Availablehttpwwwifpuni-stuttgartdepublicationsphowo11140Eisenbeisspdf

[20] C Bills J Chen and A Saxena ldquoAutonomous MAV flight in indoor environ-ments using single image perspective cuesrdquo in Proceedings - IEEE InternationalConference on Robotics and Automation 2011 pp 5776ndash5783

71

REFERENCES

[21] W Bath and J Paxman ldquoUAV localisation amp control through computer visionrdquo of the Australasian Conference on Robotics 2004 [Online] Availablehttpwwwcseunsweduau~acra2005proceedingspapersbathpdf

[22] K Ccedilelik S J Chung M Clausman and A K Somani ldquoMonocular vision SLAMfor indoor aerial vehiclesrdquo in 2009 IEEERSJ International Conference on Intel-ligent Robots and Systems IROS 2009 2009 pp 1566ndash1573

[23] H Oh D Y Won S S Huh D H Shim M J Tahk and A Tsourdos ldquoIndoorUAV control using multi-camera visual feedbackrdquo in Journal of Intelligent andRobotic Systems Theory and Applications vol 61 no 1-4 2011 pp 57ndash84

[24] Y M Mustafah A W Azman and F Akbar ldquoIndoor UAV Positioning UsingStereo Vision Sensorrdquo pp 575ndash579 2012

[25] P Jongho and K Youdan ldquoStereo vision based collision avoidance of quadrotorUAVrdquo in Control Automation and Systems (ICCAS) 2012 12th InternationalConference on 2012 pp 173ndash178

[26] J Weingarten and R Siegwart ldquoEKF-based 3D SLAM for structured environmentreconstructionrdquo in 2005 IEEERSJ International Conference on Intelligent Robotsand Systems IROS 2005 pp 2089ndash2094

[27] H Surmann A Nuumlchter and J Hertzberg ldquoAn autonomous mobile robot with a3D laser range finder for 3D exploration and digitalization of indoor environmentsrdquoRobotics and Autonomous Systems vol 45 no 3-4 pp 181ndash198 2003

[28] Y Lin J Hyyppauml and A Jaakkola ldquoMini-UAV-borne LIDAR for fine-scale map-pingrdquo IEEE Geoscience and Remote Sensing Letters vol 8 no 3 pp 426ndash4302011

[29] F Wang J Cui S K Phang B M Chen and T H Lee ldquoA mono-camera andscanning laser range finder based UAV indoor navigation systemrdquo in 2013 Inter-national Conference on Unmanned Aircraft Systems ICUAS 2013 - ConferenceProceedings 2013 pp 694ndash701

[30] M Nagai T Chen R Shibasaki H Kumagai and A Ahmed ldquoUAV-borne 3-Dmapping system by multisensor integrationrdquo IEEE Transactions on Geoscienceand Remote Sensing vol 47 no 3 pp 701ndash708 2009

[31] X Zhang Y-H Yang Z Han H Wang and C Gao ldquoObject class detectionrdquoACM Computing Surveys vol 46 no 1 pp 1ndash53 Oct 2013 [Online] Availablehttpdlacmorgcitationcfmid=25229682522978

[32] P M Roth and M Winter ldquoSurvey of Appearance-Based Methodsfor Object Recognitionrdquo Transform no ICG-TR-0108 2008 [Online]

72

REFERENCES

Available httpwwwicgtu-grazacatMemberspmrothpub_pmrothTR_ORat_downloadfile

[33] A Andreopoulos and J K Tsotsos ldquo50 Years of object recognition Directionsforwardrdquo Computer Vision and Image Understanding vol 117 no 8 pp 827ndash891 Aug 2013 [Online] Available httpwwwresearchgatenetpublication257484936_50_Years_of_object_recognition_Directions_forward

[34] J K Tsotsos Y Liu J C Martinez-Trujillo M Pomplun E Simine andK Zhou ldquoAttending to visual motionrdquo Computer Vision and Image Understand-ing vol 100 no 1-2 SPEC ISS pp 3ndash40 2005

[35] P Perona ldquoVisual Recognition Circa 2007rdquo pp 1ndash12 2007 [Online] Availablehttpswwwvisioncaltechedupublicationsperona-chapter-Dec07pdf

[36] M Piccardi ldquoBackground subtraction techniques a reviewrdquo 2004 IEEE Interna-tional Conference on Systems Man and Cybernetics (IEEE Cat No04CH37583)vol 4 pp 3099ndash3104 2004

[37] T Bouwmans ldquoRecent Advanced Statistical Background Modeling for ForegroundDetection - A Systematic Surveyrdquo Recent Patents on Computer Sciencee vol 4no 3 pp 147ndash176 2011

[38] L Xu and W Bu ldquoTraffic flow detection method based on fusion of frames dif-ferencing and background differencingrdquo in 2011 2nd International Conference onMechanic Automation and Control Engineering MACE 2011 - Proceedings 2011pp 1847ndash1850

[39] A-t Nghiem F Bremond I-s Antipolis and R Lucioles ldquoBackground subtrac-tion in people detection framework for RGB-D camerasrdquo 2004

[40] J C S Jacques C R Jung and S R Musse ldquoBackground subtraction andshadow detection in grayscale video sequencesrdquo in Brazilian Symposium of Com-puter Graphic and Image Processing vol 2005 2005 pp 189ndash196

[41] P Kaewtrakulpong and R Bowden ldquoAn Improved Adaptive Background MixtureModel for Real- time Tracking with Shadow Detectionrdquo Advanced Video BasedSurveillance Systems pp 1ndash5 2001

[42] G Turin ldquoAn introduction to matched filtersrdquo IRE Transactions on InformationTheory vol 6 no 3 1960

[43] K Briechle and U Hanebeck ldquoTemplate matching using fast normalized crosscorrelationrdquo Proceedings of SPIE vol 4387 pp 95ndash102 2001 [Online] AvailablehttplinkaiporglinkPSI4387951ampAgg=doi

73

REFERENCES

[44] H Schweitzer J W Bell and F Wu ldquoVery Fast Template Matchingrdquo Programno 009741 pp 358ndash372 2002 [Online] Available httpwwwspringerlinkcomindexH584WVN93312V4LTpdf

[45] P Viola and M Jones ldquoRobust real-time object detectionrdquo Interna-tional Journal of Computer Vision vol 57 pp 137ndash154 2001 [Online]Available httpscholargooglecomscholarhl=enampbtnG=Searchampq=intitleRobust+Real-time+Object+Detection0

[46] S Omachi and M Omachi ldquoFast template matching with polynomialsrdquo IEEETransactions on Image Processing vol 16 no 8 pp 2139ndash2149 2007

[47] M Jordan and J Kleinberg Bishop - Pattern Recognition and Machine Learning

[48] D Prasad ldquoSurvey of the problem of object detection in real imagesrdquoInternational Journal of Image Processing (IJIP) no 6 pp 441ndash466 2012[Online] Available httpwwwcscjournalsorgcscmanuscriptJournalsIJIPvolume6Issue6IJIP-702pdf

[49] D Lowe ldquoObject Recognition fromLocal Scale-Invariant Featuresrdquo IEEE Inter-national Conference on Computer Vision 1999

[50] D G Lowe ldquoDistinctive image features from scale-invariant keypointsrdquo Interna-tional Journal of Computer Vision vol 60 no 2 pp 91ndash110 2004

[51] H Bay T Tuytelaars and L Van Gool ldquoSURF Speeded up robust featuresrdquo inLecture Notes in Computer Science (including subseries Lecture Notes in ArtificialIntelligence and Lecture Notes in Bioinformatics) vol 3951 LNCS 2006 pp 404ndash417

[52] T Joachims ldquoA probabilistic analysis of the Rocchio algorithmwith TFIDF for text categorizationrdquo International Conference on Ma-chine Learning pp 143ndash151 1997 [Online] Available papers2publicationuuid9FC2122D-6D49-4DC5-AC03-E353D5B3D1D1

[53] N Dalal and B Triggs ldquoHistograms of Oriented Gradients for Human Detectionrdquoin CVPR rsquo05 Proceedings of the 2005 IEEE Computer Society Conference onComputer Vision and Pattern Recognition (CVPRrsquo05) - Volume 1 2005 pp 886ndash893 [Online] Available citeulike-article-id3047126$delimiter026E30F$nhttpdxdoiorg101109CVPR2005177

[54] J M Rainer Lienhart ldquoAn Extended Set of Haar-Like Features for Rapid ObjectDetectionrdquo [Online] Available httpciteseerxistpsueduviewdocsummarydoi=1011869433

74

REFERENCES

[55] S Han Y Han and H Hahn ldquoVehicle Detection Method using Haar-like Featureon Real Time Systemrdquo inWorld Academy of Science Engineering and Technology2009 pp 455ndash459

[56] Q C Q Chen N Georganas and E Petriu ldquoReal-time Vision-based Hand Ges-ture Recognition Using Haar-like Featuresrdquo 2007 IEEE Instrumentation amp Mea-surement Technology Conference IMTC 2007 2007

[57] G Monteiro P Peixoto and U Nunes ldquoVision-based pedestrian detection usingHaar-like featuresrdquo Robotica 2006 [Online] Available httpcyberc3sjtueducnCyberC3docpaperRobotica2006pdf

[58] J Martiacute J M Benediacute A M Mendonccedila and J Serrat Eds PatternRecognition and Image Analysis ser Lecture Notes in Computer Science BerlinHeidelberg Springer Berlin Heidelberg 2007 vol 4477 [Online] Availablehttpwwwspringerlinkcomindex101007978-3-540-72847-4

[59] A E C Pece ldquoOn the computational rationale for generative modelsrdquo ComputerVision and Image Understanding vol 106 no 1 pp 130ndash143 2007

[60] I T Joliffe Principle Component Analysis 2002 vol 2 [Online] Availablehttpwwwspringerlinkcomcontent978-0-387-95442-4

[61] A Hyvarinen J Karhunen and E Oja ldquoIndependent Component Analysisrdquovol 10 p 2002 2002

[62] K Mikolajczyk B Leibe and B Schiele ldquoMultiple Object Class Detection with aGenerative Modelrdquo IEEE Computer Society Conference on Computer Vision andPattern Recognition (CVPR) vol 1 pp 26ndash36 2006

[63] Y Freund and R E Schapire ldquoA Decision-theoretic Generalization of On-lineLearning and an Application to Boostingrdquo Journal of Computing Systems andScience vol 55 no 1 pp 119ndash139 1997

[64] C Cortes and V Vapnik ldquoSupport-vector networksrdquo Machine Learningvol 20 no 3 pp 273ndash297 Sep 1995 [Online] Available httplinkspringercom101007BF00994018

[65] P Viola and M Jones ldquoRobust real-time face detectionrdquo International Journalof Computer Vision vol 57 no 2 pp 137ndash154 2004

[66] B E Boser I M Guyon and V N Vapnik ldquoA Training Algorithmfor Optimal Margin Classifiersrdquo Proceedings of the 5th Annual ACM Workshopon Computational Learning Theory pp 144ndash152 1992 [Online] Availablehttpciteseerxistpsueduviewdocsummarydoi=1011213818

75

REFERENCES

[67] C-W Hsu and C-J Lin ldquoA comparison of methods for multiclass support vectormachinesrdquo IEEE Transactions on Neural Networks vol 13 no 2 pp 415ndash4252002

[68] A Mathur and G M Foody ldquoMulticlass and binary SVM classification Implica-tions for training and classification usersrdquo IEEE Geoscience and Remote SensingLetters vol 5 no 2 pp 241ndash245 2008

[69] ldquoNitrogen6X - iMX6 Single Board Computerrdquo [Online] Available httpboundarydevicescomproductnitrogen6x-board-imx6-arm-cortex-a9-sbc [Ac-cessed at 2015-08-10]

[70] ldquoPixhawk flight controllerrdquo [Online] Available httproverardupilotcomwikipixhawk-wiring-rover [Accessed at 2015-08-10]

[71] ldquoScanning range finder utm-30lx-ewrdquo [Online] Available httpswwwhokuyo-autjp02sensor07scannerutm_30lx_ewhtml [Accessed at 2015-07-22]

[72] ldquoopenCV manual Release 249rdquo [Online] Available httpdocsopencvorgopencv2refmanpdf [Accessed at 2015-07-20]

[73] D E King ldquoDlib-ml A Machine Learning Toolkitrdquo Journal of MachineLearning Research vol 10 pp 1755ndash1758 2009 [Online] Available httpjmlrcsailmitedupapersv10king09ahtml

[74] ldquodlib C++ Libraryrdquo [Online] Available httpdlibnet [Accessed at 2015-07-21]

[75] D E King ldquoMax-Margin Object Detectionrdquo Jan 2015 [Online] Availablehttparxivorgabs150200046

[76] M Danelljan G Haumlger and M Felsberg ldquoAccurate Scale Estimation for RobustVisual Trackingrdquo Proceedings of the British Machine Vision Conference BMVC2014

[77] ldquoUrgBenri Information Pagerdquo [Online] Available httpswwwhokuyo-autjp02sensor07scannerdownloaddataUrgBenrihtm [Accessed at 2015-07-20]

[78] ldquovatic - Video Annotation Tool - UC Irvinerdquo [Online] Available httpwebmiteduvondrickvatic [Accessed at 2015-07-24]

[79] ldquoAmazon Mechanical Turkrdquo [Online] Available httpswwwmturkcommturkwelcome [Accessed at 2015-07-26]

76

  • List of Figures
  • Absztrakt
  • Abstract
  • List of Abbreviations
  • 1 Introduction and project description
    • 11 Project description and requirements
    • 12 Type of vehicle
    • 13 Aims and objectives
      • 2 Literature Review
        • 21 UAVs and applications
          • 211 Fixed-wing UAVs
          • 212 Rotary-wing UAVs
          • 213 Applications
            • 22 Object detection on conventional 2D images
              • 221 Classical detection methods
                • 2211 Background subtraction
                • 2212 Template matching algorithms
                  • 222 Feature descriptors classifiers and learning methods
                    • 2221 SIFT features
                    • 2222 Haar-like features
                    • 2223 HOG features
                    • 2224 Learning models in computer vision
                    • 2225 AdaBoost
                    • 2226 Support Vector Machine
                      • 3 Development
                        • 31 Hardware resources
                          • 311 Nitrogen board
                          • 312 Sensors
                            • 3121 Pixhawk autopilot
                            • 3122 Camera
                            • 3123 LiDar
                                • 32 Chosen software
                                  • 321 Matlab
                                  • 322 Robotic Operating System (ROS)
                                  • 323 OpenCV
                                  • 324 Dlib
                                      • 4 Designing and implementing the algorithm
                                        • 41 Challenges in the task
                                        • 42 Architecture of the detection system
                                        • 43 2D image processing methods
                                          • 431 Chosen methods and the training algorithm
                                          • 432 Sliding window method
                                          • 433 Pre-filtering
                                          • 434 Tracking
                                          • 435 Implemented detector
                                            • 4351 Mode 1 Sliding window with all the classifiers
                                            • 4352 Mode 2 Sliding window with intelligent choice of classifier
                                            • 4353 Mode 3 Intelligent choice of classifiers and ROIs
                                            • 4354 Mode 4 Tracking based approach
                                                • 44 3D image processing methods
                                                  • 441 3D recording method
                                                  • 442 Android based recording set-up
                                                  • 443 Final set-up with Pixhawk flight controller
                                                  • 444 3D reconstruction
                                                      • 5 Results
                                                        • 51 2D image detection results
                                                          • 511 Evaluation
                                                            • 5111 Definition of True positive and negative
                                                            • 5112 Definition of False positive and negative
                                                            • 5113 Reducing number of errors
                                                            • 5114 Annotation and database building
                                                              • 512 Frame-rate measurement and analysis
                                                                • 52 3D image detection results
                                                                • 53 Discussion of results
                                                                  • 6 Conclusion and recommended future work
                                                                    • 61 Conclusion
                                                                    • 62 Recommended future work
                                                                      • References
Page 2: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly

Paacutezmaacuteny Peacuteter Katolikus Egyetem

Informaacutecioacutes Technoloacutegiai eacutes Bionikai Kar

Diplomaterv teacutemabejelentő nyilatkozat

Hallgatoacute

Neacutev Paacutelffy Andraacutes Neptun koacuted W19DOC

Szak Meacuternoumlkinformatikus MSc IMNI-MI

Teacutemavezető

Neacutev Al Savvaris

Inteacutezet ceacuteg Cranfield University

Beosztaacutes tudomaacutenyos fokozat Professor

Belső konzulens neve Dr Horvaacuteth Andraacutes

Dolgozat

A dolgozat ciacuteme (magyarul) Belteacuteri objektumfelismereacutes eacutes -osztaacutelyozaacutes piloacuteta neacutelkuumlli

jaacuterművel

Title of the thesis Indoor detection and classification of objects for an Unmanned

Aerial System

A dolgozat teacutemaacuteja (Topic of the thesis)

In this project a quadrotor based airborne tracking system will be developed with the

aim of detecting objects indoor (potentially extended to outdoor operations as well) and

localize a ground robot that will serve as a landing platform for recharging of the

quadrotor The projectrsquos final aim is the development of the the aerial platform together

with the trackingdetection system to work in near real-time and carryout experimental

testing To achieve this the system will need to have continuous update of the relative

positioning of the ground robot

A hallgatoacute feladatai (tasks of the student)

The student first task would be to carry out literature search to gather information about

unmanned aerial systems modelling and hardware integration (aerial platform flight

control systems hardware and sensor integration and communication)

The following on task would be develop the algorithms and the image processing tools

using OpenCV and DLib libraries in C++ environment

Image processing techniques will mainly focus on the classifier development to identify

the landing target (ie recharge station) for the quadrotor to dock with (please note that

orientation is also an important factor) The detection method has to work with

reasonable speed on an embedded system onboard (ARM Nitrogen6x) thus the

developed algorithms need to be ported and optimised for the chosen onboard platform

If time permits the student will also consider sensor fusion (data associations) to enable

the system to operate robustly in different lightining conditions

A teacutemavezeteacutest vaacutellalom

A teacutemavezető alaacuteiacuteraacutesa

Signature of Supervisor

A hallgatoacute teacutemavezeteacuteseacutet belső konzulenskeacutent vaacutellalom

Belső konzulens alaacuteiacuteraacutesa

Keacuterem a diplomaterv teacutemaacutejaacutenak joacutevaacutehagyaacutesaacutet

Budapest

A hallgatoacute alaacuteiacuteraacutesa

A diplomaterv teacutemaacutejaacutet az Informaacutecioacutes Technoloacutegiai eacutes Bionikai Kar joacutevaacutehagyta

Budapest

Dr Szolgay Peacuteter

deacutekaacuten

A hallgatoacute a konzultaacutecioacutekon reacuteszt vett eacutes a kiiacuteraacutesban foglalt feladatokat teljesiacutetette

Budapest

A teacutemavezető alaacuteiacuteraacutesa

A dolgozat a TVSZ 3 sz melleacutekleteacuteben foglalt tartalmi eacutes formai koumlvetelmeacutenyeknek megfelel

Budapest

Belső konzulens alaacuteiacuteraacutesa

Aluliacuterott Paacutelffy Andraacutes a Paacutezmaacuteny Peacuteter Katolikus Egyetem Informaacutecioacutes Technoloacute-giai eacutes Bionikai Karaacutenak hallgatoacuteja kijelentem hogy jelen diplomatervet meg nem en-gedett segiacutetseacuteg neacutelkuumll sajaacutet magam keacutesziacutetettem eacutes a diplomatervben csak a megadott for-raacutesokat hasznaacuteltam fel Minden olyan reacuteszt melyet szoacute szerint vagy azonos eacutertelembende aacutetfogalmazva maacutes forraacutesboacutel aacutetvettem egyeacutertelműen a forraacutes megadaacutesaacuteval megjeloumlltemEzt a diplomatervet csakis Erasmus tanulmaacutenyuacutet kereteacuten beluumll a Cranfield UniversitybdquoComputational amp Software Techniques in Engineeringrdquo MSc keacutepzeacuteseacuten bdquoDigital Signaland Image Processingrdquo szakiraacutenyon nyuacutejtottam be a 20142015taneacutevben

Undersigned Andraacutes PAacuteLFFY student of Paacutezmaacuteny Peacuteter Catholic Universityrsquos Fac-ulty of Information Technology and Bionics I state that I have written this thesis onmy own without any prohibited sources and I used only the described references Everytranscription citation or paraphrasing is marked with the exact source This thesis hasonly been submitted at Cranfield Universityrsquos Computational amp Software Techniques inEngineering MSc course on the Digital Signal and Image Processing option in 2015

Budapest 20151220

Paacutelffy Andraacutes

iii

Contents

List of Figures vii

Absztrakt viii

Abstract ix

List of Abbreviations x

1 Introduction and project description 111 Project description and requirements 112 Type of vehicle 213 Aims and objectives 3

2 Literature Review 521 UAVs and applications 5

211 Fixed-wing UAVs 5212 Rotary-wing UAVs 6213 Applications 8

22 Object detection on conventional 2D images 9221 Classical detection methods 10

2211 Background subtraction 102212 Template matching algorithms 11

222 Feature descriptors classifiers and learning methods 132221 SIFT features 142222 Haar-like features 152223 HOG features 162224 Learning models in computer vision 172225 AdaBoost 182226 Support Vector Machine 19

iv

CONTENTS

3 Development 2131 Hardware resources 21

311 Nitrogen board 21312 Sensors 21

3121 Pixhawk autopilot 223122 Camera 223123 LiDar 23

32 Chosen software 23321 Matlab 23322 Robotic Operating System (ROS) 24323 OpenCV 24324 Dlib 24

4 Designing and implementing the algorithm 2641 Challenges in the task 2642 Architecture of the detection system 2943 2D image processing methods 31

431 Chosen methods and the training algorithm 31432 Sliding window method 34433 Pre-filtering 36434 Tracking 37435 Implemented detector 39

4351 Mode 1 Sliding window with all the classifiers 414352 Mode 2 Sliding window with intelligent choice of

classifier 414353 Mode 3 Intelligent choice of classifiers and ROIs 424354 Mode 4 Tracking based approach 44

44 3D image processing methods 46441 3D recording method 46442 Android based recording set-up 48443 Final set-up with Pixhawk flight controller 51444 3D reconstruction 53

5 Results 5651 2D image detection results 56

511 Evaluation 565111 Definition of True positive and negative 565112 Definition of False positive and negative 565113 Reducing number of errors 575114 Annotation and database building 57

512 Frame-rate measurement and analysis 59

v

CONTENTS

52 3D image detection results 6053 Discussion of results 61

6 Conclusion and recommended future work 6561 Conclusion 6562 Recommended future work 67

References 70

vi

List of Figures

11 Image of the ground robot 3

21 Fixed wing consumer drone 622 Example for consumer drones 723 Example for people detection with background subtraction 1124 Example of template matching 1225 2 example Haar-like features 1626 Illustration of the discriminative and generative models 1727 Example of a separable problem 20

31 Image of Pixhawk flight controller 2232 The chosen LIDAR sensor Hokuyo UTM-30LX 2333 Elements of Dlibrsquos machine learning toolkit 25

41 A diagram of the designed architecture 2842 Visualization of the trained HOG detectors 3543 Representation of the sliding window method 3644 Example image for the result of edge detection 3845 Example of the detectorrsquos user interface 4146 Presentation of mode 3 4347 Mode 4 example output frames 45410 Screenshot of the Android application for Lidar recordings 4848 Schematic figure to represent the 3D recording set-up 4949 Example representation of the output of the Lidar Sensor 50411 Picture about the laboratory and the recording set-up 52412 Presentation of the axes used in the android application 53

51 Figure of the Vatic user interface 5852 Photo of the recording process 6153 Example of the 3D images built 62

vii

List of Abbreviations

SATM School of Aerospace Technology and ManufacturingUAV Unmanned Aerial VehicleUAS Unmanned Aerial SystemUA Unmanned AircraftUGV Unmanned Ground VehicleHOG Histogram of Oriented GradientsRC Radio ControlledROS Robotic Operating SystemIMU Inertial Measurement UnitDoF Degree of FreedomSLAM Simultaneous Localization And MappingROI Region Of InterestVatic Video Annotation Tool from Irvine California

viii

Absztrakt

Ezen dolgozatban bemutataacutesra keruumll egy piloacuteta neacutelkuumlli repuumllő jaacuterműre szereltkoumlvető rendszer ami belteacuteri objektumokat hivatott detektaacutelni eacutes legfőbb ceacutelja afoumlldi egyseacuteg megtalaacutelaacutesa amely leszaacutelloacute eacutes uacutejratoumlltő aacutellomaacuteskeacutent fog szolgaacutelni

Előszoumlr a projekt illetve a teacutezis ceacuteljai lesznek felsorolva eacutes reacuteszletezveEzt koumlveti egy reacuteszletes irodalomkutataacutes amely bemutatja a hasonloacute kihiacutevaacute-

sokra leacutetező megoldaacutesokat Szerepel egy roumlvid oumlsszefoglalaacutes a piloacuteta neacutelkuumlli jaacuter-művekről eacutes alkalmazaacutesi teruumlleteikről majd a legismertebb objektumdetektaacuteloacutemoacutedszerek keruumllnek bemutataacutesra A kritika taacutergyalja előnyeiket eacutes haacutetraacutenyaikatkuumlloumlnoumls tekintettel a jelenlegi projektben valoacute alkalmazhatoacutesaacutegukra

A koumlvetkező reacutesz a fejleszteacutesi koumlruumllmeacutenyekről szoacutel beleeacutertve a rendelkezeacutesreaacutelloacute szoftvereket eacutes hardvereket

A feladat kihiacutevaacutesai bemutataacutesa utaacuten egy modulaacuteris architektuacutera terve keruumllbemutataacutesra figyelembe veacuteve a ceacutelokat erőforraacutesokat eacutes a felmeruumllő probleacutemaacutekat

Ezen architektuacutera egyik legfontosabb modulja a detektaacuteloacute algoritmus legfris-sebb vaacuteltozata reacuteszletezve is szerepel a koumlvetkező fejezetben keacutepesseacutegeivel moacuted-jaival eacutes felhasznaacuteloacutei feluumlleteacutevel egyuumltt

A modul hateacutekonysaacutegaacutenak meacutereacuteseacutere leacutetrejoumltt egy kieacuterteacutekelő koumlrnyezet melykeacutepes szaacutemos metrikaacutet kiszaacutemolni a detekcioacuteval kapcsolatban Mind a koumlrnyezetmind a metrikaacutek reacuteszletezve lesznek a koumlvetkező fejezetben melyet a legfrissebbalgoritmus aacuteltal eleacutert eredmeacutenyek koumlvetnek

Baacuter ez a dolgozat főkeacutent a hagyomaacutenyos (2D) keacutepeken operaacuteloacute detekcioacutesmoacutedszerekre koncentraacutel 3D keacutepalkotaacutesi eacutes feldolgozoacute moacutedszerek szinteacuten meg-fontolaacutesra keruumlltek Elkeacuteszuumllt egy kiacuteseacuterleti rendszer amely keacutepes laacutetvaacutenyos eacutespontos 3D teacuterkeacutepek leacutetrehozaacutesaacutera egy 2D leacutezer szkenner hasznaacutelataacuteval Szaacutemosfelveacutetel keacuteszuumllt a megoldaacutes kiproacutebaacutelaacutesa amelyek a rendszerrel egyuumltt bemutataacutesrakeruumllnek

Veacuteguumll az implementaacutelt moacutedszerek eacutes az eredmeacutenyek oumlsszefoglaloacuteja zaacuterja adolgozatot

ix

Abstract

In this paper an unmanned aerial vehicle based airborne tracking system is pre-sented with the aim of detecting objects indoor (potentially extended to outdooroperations as well) and localize a ground robot that will serve as a landing plat-form for recharging of the UAV

First the project and the aims and objectives of this thesis are introducedand discussed

Then an extensive literature review is presented to give overview of the ex-isting solutions for similar problems Unmanned Aerial Vehicles are presentedwith examples of their application fields After that the most relevant objectrecognition methods are reviewed and their suitability for the discussed project

Then the environment of the development will be described including theavailable software and hardware resources

Afterwards the challenges of the task are collected and discussed Consideringthe objectives resources and the challenges a modular architecture is designedand introduced

As one of the most important module the currently used detector is intro-duced along with its features modes and user interface

Special attention is given to the evaluation of the system Additional evalu-ating tools are introduced to analyse efficiency and speed

The ground robot detecting algorithmrsquos first version is evaluated giving promis-ing results in simulated experiments

While this thesis focuses on two dimensional image processing and objectdetection methods 3D image inputs are considered as well An experimentalsetup is introduced with the capability to create spectacular and precise 3Dmaps about the environments with a 2D laser scanner To test the idea of usingthe 3D image as an input for the ground robot detection several recordings weremade about the UGV and presented in this paper as well

Finally all implemented methods and relevant results are concluded

x

Chapter 1

Introduction and projectdescription

In this chapter an introduction is given to the whole project which this thesis ispart of Afterwards a structure of the sub-tasks in the project is presented alongwith the recognized challenges Then the aims and objectives of this thesis arelisted

11 Project description and requirementsThe projectrsquos main aim is to build a complete system based on one or moreunmanned autonomous vehicle which are able to carry out an indoor 3D mappingof a building (possible outdoor operations are kept in mind as well) A maplike this would be very useful for many applications such as surveillance rescuemissions architecture renovation etc Furthermore if a 3D model exists otherautonomous vehicles can navigate through the building with ease Later on inthis project after the map building is ready a thermal camera is planned to beattached to the vehicle to seek find and locate heat leaks as a demonstration ofthe system A system like this should be

1 fast Building a 3D map requires a lot of measurements and processingpower not to mention the essential functions stabilization navigationroute planning collision avoidance However to build a useful tool allthese functions should be executed simultaneously mostly on-board nearlyreal-time Furthermore the mapping process itself should be finished inreasonable time (depending on the size of the building)

2 accurate Reasonably accurate recording is required so the map would besuitable for the execution of further tasks Usually this means a maximum

1

12 Type of vehicle

error of 5-6cm Errors with similar magnitude can be corrected later de-pending on the application In the case of architecture for example aligninga 3D room model to the recorded point cloud could eliminate this noise

3 autonomous The solution should be as autonomous as possible The aimis to build a system which does not require human supervision (for examplemapping a dangerous area which would be too far for real time control) andcoordinate the process Although remote control is acceptable and shouldbe implemented it is desired to be minimal This should allow a remoteoperator to coordinate even more than one unit a time (since none of themneeds continuous attention) while keeping the ability to change the routeor priority of premises

4 complete Depending on the size and layout of the building the mappingprocess can be very complicated It may have more than one floors loops(the same location reached via an other route) or opened areas which areall a significant challenge for both the vehicle (plan route to avoid collisionsand cover every desired area) and the mapping software The latter shouldrecognize previously seen locations and rdquocloserdquo the loop (that is assignthe two separately recorded area to one location) and handle multi-layerbuildings The system have to manage these tasks and provide a mapwhich contains all the layers closes loops and represents all walls (includingceiling and floor)

12 Type of vehicleThe first question of such a project is what type of vehicle should be used Theoptions are either some kind of aerial or a ground vehicle Both have their ad-vantages and disadvantages

An aerial vehicle has a lot more degree of freedom since it can elevate fromthe ground It can reach positions and perform measurements which would beimpossible for a ground vehicle However it consumes a lot of energy to stayin the air thus such a system has significant limitations in weight payload andoperating time (since the batteries themselves weigh a lot)

A ground vehicle overcomes all these problems since it is able to carry alot more payload than an aerial one This means a lot of batteries thereforelonger operation time On the other hand it canrsquot elevate from the ground thusall the measurements will be performed from a relatively close position to theground This can be a big disadvantage since taller objects will not have theirtops recorded Also a ground robot would not be able to overcome big obstaclesclosing its way or to get to another floor

2

13 Aims and objectives

Figure 11 Image of the ground robot Source own picture

Considering these and many more arguments (price complexity human andmaterial resources) a combination of an aerial and a ground vehicle was chosenas a solution The idea is that both vehicles enter the building at the sametime and start mapping the environment simultaneously working on the samemap (which is unknown when the process starts) Using the advantage of theaerial vehicle the system scans parts of the rooms which are unavailable for theground vehicle With sophisticated route planning the areas scanned twice canbe minimized resulting in a faster map building Also great load capacity of theground robot makes it able to serve as a charging station for the aerial vehicleThis option will be discussed in details in section 13

13 Aims and objectivesAs it can be seen the scope of the project is extremely wide To cover the wholemore than fifteen students (visiting MSc and PhD) are working on it Thereforethe project has been divided to numerous subtasks

During the planning period of the project several challenges were recognizedfor example the lack of GPS signal indoor the vibration of the on-board sensorsor the short flight time of the aerial platform

This thesis addresses the last one As mentioned in section 12 the UAVconsumes a lot of energy to stay in the air thus such a system has significantlylimited operating time In contrast the ground robot is able to carry a lot morepayload (eg batteries) than an aerial one The idea to overcome this limitationof the UAV is to use the ground robot as a landing and recharging platform for

3

13 Aims and objectives

the aerial vehicleTo achieve this the system will need to have continuous update of the relative

position of the ground robot The aim of this thesis is to give a solution for thisproblem The following objectives were defined to structure the research

First possible solutions have to be collected and discussed in an extensive lit-erature review Then after considering the existing solutions and the constraintsand requirements of the discussed project a design of a new system is neededwhich is able to detect the ground robot using the available sensors Finally thefirst version of the software have to be implemented and tested

This paper will describe the process of the research introduce the designedsystem and discuss the experiences

4

Chapter 2

Literature Review

The aim of this chapter is to provide an overview of the previous researches andarticles related to the topic of this thesis First a short introduction of unmannedaerial vehicles will be given along with their most important advantages andapplication fields

Afterwards a brief conclusion of the science of object recognition in imageprocessing is provided The most often used feature extraction and classificationmethods are presented

21 UAVs and applicationsThe abbreviation UAV stands for Unmanned Aerial Vehicle Any aircraft whichis controlled or piloted by remotely andor by on-board computers They are alsoreferred as UAS for Unmanned Aerial System

They come in various sizes The smallest ones fit a man palms while even afull-size aeroplane can be an UAV by definition Similarly to traditional aircraftUnmanned Aerial vehicles are usually grouped into two big classes fixed-wingUAVs and rotary wing UAVs

211 Fixed-wing UAVsFixed-wing UAVs have a rigid wing which generates lift as the UAV moves for-ward They maneuver with control surfaces called ailerons rudder and elevatorGenerally speaking fixed-wing UAVs are easier to stabilize and control Forcomparison radio controlled aeroplanes can fly without any autopilot functionimplemented relying only on the pilotrsquos input This is a result of the muchsimpler structure compared to rotary wing ones due to the fact that a glidingmovement is easier to stabilize However they still have a quite extended market

5

21 UAVs and applications

Figure 21 One of the most popular consumer level fixed-wing mapping UAV ofthe SenseFly company Source [1]

of flight controllers whose aim is to add autonomous flying and pilot assistancefeatures [2]

One of the biggest advantage of fixed wing UAVs is that due to their naturalgliding capabilities they can stay airborne using no or small amount of powerFor the same reason fixed-wing UAVs are also able to carry heavier or morepayload for longer endurances using less energy which would be very useful inany mapping task

The drawback of the fixed wing aircraft in the case of this project is the factthat they have to keep moving to generate lift and stay airborne Therefore flyingaround between and in buildings is very complicated since there is limited spaceto move around Although there have been approaches to prepare a fixed wingUAV to hover and translate indoors [3] generally they are not optimal for indoorflights especially with heavy payload [4]

Fixed-wing UAVs are excellent choice for long endurance high altitude tasksThe long flight times and higher speed make it possible to cover larger areaswith one take off On figure 21 a consumer fixed-wing UAV is shown designedspecially for mapping

This project needs a vehicle which is able to fly between buildings and indoorswithout any complication Thus rotary-wing UAVs were reviewed

212 Rotary-wing UAVsRotary-wing UAVs generate lift with rotor blades attached to and rotated aroundan axis These blades work exactly as the wing on the fixed-wing vehicles butinstead of the vehicle moving forward the blades are moving constantly Thismakes the vehicle able to hover and hold its position Also due to the same prop-erty they can take off and land vertically These capabilities are very importantaspects for the project since both of them are crucial for indoor flight where spaceis limited

6

21 UAVs and applications

Figure 22 The newest version of the popular Phantom series of DJI Thisconsumer drone is easy to fly even for beginner pilots due to its many advancedautopilot features Source [5]

Based on the number of the rotors several sub-classes are known UAVs withone rotor are helicopters while anything with more than one rotor is called amultirotor or multicopter

The latter group controls its motion by modifying the speed of multiple downthrusting motors In general they are more complicated to control than fixed-wing UAVs since hovering enough requires constant compensation Multirotorsmust individually adjust the thrust of each motor If motors on one side areproducing more thrust than the other the multirotor tilts to the other sideThat is the reason why every remote controlled multicopter is equipped withsome kind of flight controller system They are also less stable (without electricstabilization) and less energy efficient than helicopters That is the reason forthe fact that no large scale multirotor is used since as the size is increased thesimplicity becomes less important than the inefficiency On the other hand theyare mechanically simpler and cheaper and their architecture is suitable for awider range of payloads which makes them appealing for many fields

Rotary-wing UAVsrsquo mechanical and electrical structures are generally morecomplex than fixed-wing ones This can result in more complicated and expansiverepairs and general maintenance which shorten operational time and increase thecost of the project Finally due to their lower speeds and shorter flight rangesrotary-wing UAVs will require many additional flights to survey any large areawhich is another cause of increasing operational costs [4]

7

21 UAVs and applications

In spite of the disadvantages listed above rotary-wing especially multirotorUAVs seem like an excellent choice for this project since there is no need to coverlarge areas and the shorter operational times can be solved with the chargingplatform

213 ApplicationsUAVs as many technological inventions got their biggest motivation from mili-tary goals and applications For example German rocket V-1 (rdquothe flying bombrdquo)had a huge impact on the history of UAS [6] An unmanned aerial vehicle makesit possible to observe spy on and attack the enemy without risking human lives

Fortunately non-military development is catching up too and UAVs are be-coming essential tools for many application fields As rotary wing UAVs es-pecially multicopters are getting cheaper they have appeared in the consumercategories as well They are available as simple remote controlled toys and ad-vanced aerial photography platforms as well [5] See figure 22 for a popularquadcopter

UAVs are gaining popularity in search and rescue operations since they areproviding a fast and relatively cheap tool to get an overview of the involvedarea A good example for this application field is a location of an earth-quakeor searching for missing people in the wildness See [7] and [8] for examples [9]is another related article which includes approaches to identify possibly injuredpeople autonomously

Another application field where UAVs were admittedly proven useful is surveil-lance [10] is an interesting example for this field since it proposes to use a com-bined swarm of rotary and fixed-wing UAVs [11] [12] address the same problemwhile [13] tries to overcome the challenges of the task in an urban area

Getting sensors in the air with low cost vehicles is an amazing opportunity forseveral research areas UAVs have been used for wild-life monitoring [14] earthobservation [15] and even tornado watch missions [16] [17] used UAVs to reducerdquothe data gap between field scale and satellite scale in soil erosion monitoringrdquo

UAVs are often used for 3D mapping and reconstruction tasks for topographyarchitecture or other purposes [18] is a review of current mapping methods basedon UAVs [19] is an other survey which also compares the photogrammetry withother methods to define the optimal application fields

Both surveys mentioned above are rather focus on topographical larger scalemapping tasks Indoor flying and mapping is more complicated in a sense sincethe danger of collision is significantly higher Indoor positioning of UAVs havebeen addressed in several papers They can be grouped by the sensor used Notethat GPS signal is usually too weak to use inside thus other methods are needed

Some researches tried to localize the aerial platform based on a single camera

8

22 Object detection on conventional 2D images

using monocular SLAM (Simultaneous Localization and Mapping) See [20ndash22]for UAVs navigating based on one camera only

Often the platform is equipped with multiple cameras to achieve stereo visionand create a depth map For examples of such solutions see [23ndash25]

A popular approach for indoor mapping is using 3D laser distance sensors asthe input of the SLAM algorithm Examples for this are [26ndash28]

However cameras and laser range finders are often used together [29 30] forexample manage to integrate the two sensors together and carry out 3D mappingwhile controlling navigating the UAV itself This project will use a similar setupthat is a combination of one or more monocular cameras with a 2D laser scanner

22 Object detection on conventional 2D imagesObject class detection is one of the most researched area of computer visionAs the available resources (computational power resolution of cameras etc)improved and became cheaper and more and more complex tasks occurred thefield has witnessed several approaches and methods Thus the topic has extensiveliterature To give an overview general surveys can be a good starting point

There are several articles aimed to provide a summary of the existing methodstrying to categorize and group them (eg [31ndash33])

The terms and definitions of computer vision tasks are defined in several wayin the literature [33] and [34] groups them by

bull Detection is a particular item present in the stimulus

bull Localization detection plus accurate location of item

bull Recognition localization of all the items present in the stimulus

bull Understanding recognition plus role of stimulus in the context of the scene

while [35] sets up five classes defined by the following (quoted from [33])

bull Verification Is a particular item present in an image patch

bull Detection and Localization Given a complex image decide if a particularexemplar object is located somewhere in the image and provide accuratelocation information of the given object

bull Classification Given an image patch decide which of the multiple possiblecategories are present in it Note that no location is determined

9

22 Object detection on conventional 2D images

bull Naming Given a large complex image (instead of an image patch as inthe classification problem) determine the location and labels of the objectspresent in that image

bull Description Given a complex image name all the objects present in itand describe the actions and relationships of the various objects within thecontext of the image

[31] defines object (class or category) detection as the combination of twogoals object categorization and object localization The former decides if anyinstance of the categories of interest is present in the input image The lattertask determines the positions and dimensions (projected size) of the objects thatare found

As it can be seen from the categories defined above the boundaries of thesesubclasses are not clear and they often overlap The aims and objectives listedin section 13 classify this task as a detection or localization (or a combinationof the two) For the sake of simplicity the aim of this thesis and the providedalgorithm will be referred as detection and detector

Beside the grouping of the tasks of computer vision several attempts weremade to classify the approaches themselves The biggest disjunctive propertyof the methods might be the fact that whether or not does it involve machinelearning

221 Classical detection methodsResulting from the rapid development of hardware resources along with the ex-tensive research of artificial intelligence methods not using machine learning aregetting neglected Generally speaking the ones using it show better performanceand are simpler to adopt if the circumstances change (eg re-train them)

It has to be noted that although these classical approaches are not very pop-ular nowadays their simplicity often means an attractive factor to perform easydetection tasks or pre-filter the image

2211 Background subtraction

Background subtraction methods are often used to detect moving objects onstanding platform (static cameras) Such an algorithm is able to learn the rdquoback-ground modelrdquo of an image with differentiating a few frames Although severalmethods exist (See [36] or [37] for good conclusion) the basic idea is common sep-arate the foreground from the background by subtracting the background modelfrom the input image This subtraction should result in an output image where

10

22 Object detection on conventional 2D images

Figure 23 Example for background subtraction used for people detection Thered patches show the object in the segmented foreground Source [39]

only objects in the foreground are shown The background model is usually up-dated regularly which means the detection is adaptive

These solutions could give an excellent base for tasks like traffic flow monitor-ing [38] or human detection [39] as moving points are separated and connectedor close ones could be managed as an object Thus all possible objects of interestare very well separated The fixed position of the camera means fixed perspectiveswhich make size and distance estimations accurate With fixed lengths reliablevelocity measuring can be implemented These and further additional properties(colour texture etc of the moving object) can be the base of an object classifica-tion system This method is often combined with more sophisticated recognitionmethods serving as a region of interest detector After the selection of interestingareas those methods can operate on a smaller input resulting in a faster responsetime

It has to be mentioned that not every moving (changing) patch is an objectas for example in certain lighting circumstances a vehicle and its shadow movealmost identically and have similar size parameters Effective methods to removeshadows from input have been described (eg[40 41]) Certainly process time ofthese filters sum up with the detectionrsquos time

An other concern is the possibility of a sudden change in illumination (such asswitching the light on or off) which would result change across the whole imageIn this case the background model should be adapted automatically and quickly

Although background subtraction has many benefits it is clearly not suitablefor this task since the camera will be mounted on a moving platform

2212 Template matching algorithms

Object class detection can be described as a search for a specific pattern on theinput image This definition is related of the matched filters (see [42]) techniqueof digital signal processing which basic idea is to find a known wavelet in anunknown signal This is usually achieved by cross-correlating them looking forhigh correlation coefficients

Template matching algorithms rely on the same concept interpreting the

11

22 Object detection on conventional 2D images

Figure 24 Example of template matching Notice how the output image isthe darkest at the correct position of the template (darker means higher value)An other interesting point of the picture is the high return values around thecalendarrsquos (on the wall left) dark headlines which are indeed look similar to thehandle Source [43]

image as a 2D digital signal They determine the position of the given patternby comparing a template (that contains the pattern) with the image

To perform the comparison the template is shifted (u v) discrete steps inthe (x y) directions respectively The comparison itself is usually calculated bysome kind of correlation over the area of the template for each (u v) positionThe method assumes that the function (the comparison method) will return thehighest value at the correct position [43]

See Figure 24 for an example input template (part a) and the output of thealgorithm (part b) Notice how the output image is the darkest at the correctposition of the template

The biggest drawback of the method is its computational complexity sincethe cross correlation has to be calculated at every position Several efforts werepublished to reduce the processing time [43] uses normalized cross correlationfor faster comparison [44] applies integral images (based on the idea of [45]) forthe same purpose Note that simplifying the template can reduce the complexityas well since it is easier to compare the two images For example [46] proposedan algorithm which approximates the template image with polynomials

Other disadvantages of template matching are the poor performance in case ofchanges in scale rotation or perspective Resulting from these template matchingis not a suitable concept for this project

12

22 Object detection on conventional 2D images

222 Feature descriptors classifiers and learning methodsWhile computer vision and machine learning might seem two well-separated areasof computer engineering computer vision is one of the biggest application fieldof machine learning [47] puts it this way rdquoPattern recognition has its origins inengineering whereas machine learning grew out of computer science Howeverthese activities can be viewed as two facets of the same field and together theyhave undergone substantial development over the past ten yearsrdquo

The two most important parts of a trained object detection algorithm arefeature extraction and feature classification The extraction method determineswhat type of features will be the base of the classifier and how they are extractedThe classification part is responsible for the machine learning method In otherwords how to decide if the extracted feature array (extracted by the featureextraction method described above) represents an object of interest

Since the latter part (defining the learning and classifier system) is less relatedto the vision studies this paper will only discuss a few learning methods brieflyBefore that some of the most famous and widely used feature descriptors arepresented

Feature extractors or descriptors are often grouped as well [48] recognizesedge based and patch based features [31] has three groups according to thelocality of the features

bull pixel-level feature description these features can be calculated for eachpixel separately Examples are the pixelrsquos intensity (grayscale) or its colourvector

bull patch-level feature description Descriptors based on patch level featuresconcentrate only on point of interests (and their neighbourhood) instead ofdescribing the whole image The name rdquopatchrdquo refers to the area around thepoint which is also considered Since these regions are much smaller thanthe image itself patch-level feature descriptions are also called local featuredescriptors Some of the most famous ones are SIFT (sub-subsection 2221[49 50]) and SURF [51] In a sense Haar-like features (see sub-subsection2222 [45]) belong here as well

bull region-level feature description As patches are sometimes too small to de-scribe the object appropriately a larger scale is often needed The region isa general expression for a set of connected pixels The size and shape of thisarea is not defined Region-level features often constructed from lower levelfeatures presented above However they do not try to apply higher levelgeometric models or structures Resulting from that region-level featuresare also called mid-level representations The most well known descriptors

13

22 Object detection on conventional 2D images

of this group are the BOF (Bag of Features or Bag of Words [52]) and theHOG (Histogram of Oriented Gradients sub-subsection 2223 [53])

Due to the limited scope of this paper it is not possible to give a deeperoverview about the features used in computer vision Instead three of the meth-ods will be presented here These were selected based on fame performanceon general object recognition and the amount of relevant literature examplesand implementation All of them are well-known and both the articles and themethods are part of the computer vision history

2221 SIFT features

Scale-invariant feature transform (or SIFT) is an image processing algorithm usedto detect and extract local features Sift descriptors are the most well-knownpatch-level feature feature descriptors (see point 222) [31] They were proposedby David Lowe in 1999 [49] The same author gave a more in-depth discussionabout his earlier work presented improvements in stability and feature invariancein 2004 [50] His idea was to extract interesting points as features from the object(training image) and try to match these on the test images to locate the object

The main steps of the algorithm are the following [50]

1 Scale-space extrema detection The first stage searches over the image inmultiple scale using a difference-of-Gaussian function to identify potentialinterest points that are scale and orientation invariant

2 Key-point localization At each selected point the algorithm tries to fit amodel to determine location and scale Key-points are selected based ontheir stability points which are expected to be more stable based on themodel are kept

3 Orientation assignment rdquoOne or more orientations are assigned to eachkey-point based on local image gradient directions All future operationsare performed on image data that has been transformed relative to theassigned orientation scale and location for each feature thereby providinginvariance to these transformationsrdquo [50]

4 Key-point descriptor After selecting the interesting points the gradientsaround them are calculated in a selected scale

The biggest advantage of the descriptor is the scale and rotation invarianceNote that rotation here means rotating in the plane of the image not capturingthe object from a rotated viewpoint For example a rotation invariant (frontal)

14

22 Object detection on conventional 2D images

face detector should be able to detect faces upside down but not faces pho-tographed from a side-view SIFT is also partly invariant to illumination and3D camera viewpoint This is reached by the representation of the calculateddescriptors that allows for significant levels of local shape distortion and changein illumination The algorithm assumes that these pointsrsquo relative positions willnot change on the test images Thus SIFT descriptors tend to show worse perfor-mance on flexible objects or ones with moving parts Also the relative positionsof the key-points can be significantly different not only if the object is distorted(flexibility) but also if they are extracted from an other instance from the sameclass (two people for example) Resulting from this property SIFT is more oftenused for object tracking image stitching or to recognise a concrete object insteadof the general class However this is not a problem in the case of this projectsince only one ground-robot is part of it Thus no general class recognition isrequired recognizing the used ground robot would be enough

2222 Haar-like features

Haar-like features were introduced by Viola and Jones in 2001 [45] They proposedsimple features inspired by Haar wavelet basis which are also used in signalprocessing They can be seen as rdquotemplatesrdquo defining rectangular regions in thegrayscale image Viola and Jones described three types of the features

bull two-rectangle features rdquoThe value of a two-rectangle feature is the dif-ference between the sum of the pixels intensities within two rectangularregions The regions have the same size and shape and are horizontally orvertically adjacentrdquo [45]

bull three-rectangle features rdquoA three-rectangle feature computes the sum withintwo outside rectangles subtracted from the sum in a centre rectanglerdquo [45]

bull four-rectangle features rdquoA four-rectangle feature computes the differencebetween diagonal pairs of rectanglesrdquo [45]

See figure 25 for an example of two and three rectangle features The mainadvantage of this approach is the fact that these features can be computed veryrapidly using so-called integral images The integral image rdquocontains the sum ofpixel intensities above and to the left of itrdquo [45] at any given image location Thereferences to the integral values at each pixel location are kept in a structureFinally the Haar features can be computed by summing and subtracting theintegral values in the corners of the rectangles

[45] was motivated by the task of face detection Although it is more than14 years old it is still the base of many implemented face detection methods (in

15

22 Object detection on conventional 2D images

Figure 25 2 example Haar-like features that are proven to be effective separatorsTop row shows the features themselves Bottom row presents the features overlayedon a typical face The first feature corresponds with the observation that eyes areusually darker than the cheeks The second feature measures the intensity of areasaround the eyes compared to the nose Source [45]

consumer cameras for example) and inspired several researches [54] for exampleextended the Haar-like feature set with rotated rectangles by calculating theintegral images diagonally as well

Resulting from the nature of them Haar-like features are suitable to findpatterns of combined bright and dark patches (like faces on grayscale images forexample)

Despite of the original task Haar-like features are used widely in computervision for various tasks vehicle detection ([55]) hand gesture recognition ([56])and pedestrian detection ([57]) just to name a few

Haar-like features are very popular and have a lot of advantages Its maindrawback is the rotation-variance which in spite of several attempts (eg [54])is still not solved completely

2223 HOG features

The abbreviation Hog stands for Histogram of Oriented Gradients Unlike theHaar features (2222) Hog aims to gather information from the gradient image

The technique calculates the gradient of every pixel and produces a histogramof the different orientations (called bins) after quantization in small portions(called cells) on the image After a location based normalization of these cells(within blocks) the feature vector is the concatenation of these histograms Notethat while this is similar to Edge orientation histograms ([58]) it is different since

16

22 Object detection on conventional 2D images

Figure 26 Illustration of the discriminative and generative models The boxesmark the probabilities calculated and used by the models Note the arrows whichdisplay the information flow presenting the main difference in the two philosophiesSource [59]

not only sharp gradient changes (known as edges) are considered The number ofoptimal bins cells and blocks depend on the actual task and resolution and arewidely researched

HOG is essentially a dense version of the SIFT descriptors (see sub-subsection2221 and [31]) However it is not normalized with respect to the orientationThus HOG is not rotation invariant On the other hand it is normalized locallywith respect to the image contrast (over the blocks) which results in a superiorrobustness against illumination changes over SIFT The technique was first de-scribed in the well-known article [53] by Navneet Dalal and Bill Triggs in whichit was used for pedestrian detection While this application field is still the mostcommon usage of these feature descriptor it can be used for many kind of objectsas well as long as the class has a characteristic shape with significant edges

2224 Learning models in computer vision

Learning model approaches can be grouped as well [32 48] finds two mainphilosophies generative and discriminative models

Letrsquos denote the test data (images) with xi the corresponding labels by ciand the feature descriptors (extracted from the data xi) by θi where i = 1 to N where N is the number of inputs

Generative models look for a representation of the image (or any other inputdata) by approximate it keeping as much data as possible Afterwards given atest data they predict the probability of an x instance generated (with featuresθ) conditioned to class c [48] In other terms they aim is to produce a model inwhich the probability

P (x|(θ c))P (c)

to be ideally 1 if x contains an instance of class c and 0 if notThe most famous examples of generative models are Principal Component

Analysis (PCA [60]) and Independent Component Analysis (ICA [61])

17

22 Object detection on conventional 2D images

Discriminative models try to find the best decision boundaries for the giventraining data As the name suggests they try to discriminate the occurrenceof one class from everything else [48] They do this by defining the followingprobability

P (c|(θ x))

which is expected to be ideally 1 if x contains an instance of class c and 0 ifnot

Note that the biggest difference is the direction of the mapping and infor-mation flow discriminative models map images to class labels while generativemodels use a map from labels to the images (or rdquoobservablesrdquo [59]) See figure 26for a representation of the direction of the flow of information in both models

Resulting from approaches presented above (especially the probability mod-els) the discriminative methods are usually perform better for single class detec-tion while generative methods are more suitable for multi-class object detection(also called object recognition) [62] Since the task described in this paper re-quires one class to detect two of the most well-known discriminative methodswill be presented Adaptive boosting (AdaBoost [63] see sub-subsection 2225)and support vector machines (SVM [64] see sub-subsection 2226)

2225 AdaBoost

AdaBoost (short for Adaptive Boosting) is a discriminative machine learningmodel from the class of boosting algorithms It was first proposed by YoavFreund and Robert Schapire in 1997 [63] AdaBoostrsquos aim is to overcome theincreasing calculation expenses caused by the growing set of features (also calleddimensions) For example the number of possible Haar-like features (2222) isover 160000 in case of a 24times24 pixel window (see subsection 432) [65] AlthoughHaar-like features can be extracted efficiently using the integral-image computingthe full set would be very expensive

To reduce the dimensionality of the machine learning problem boosting algo-rithms try to combine several weaker classifiers to create a stronger Weak meansthe output of the classifiers only slightly correlates with the true labels of thedata Usually this is an iterative method

A single stage of AdaBoost can be described as follows [45] [65]

bull Initialize N middot T weights wti where N = number of training examples T =number of features in stage

bull For t = 1 N

1 Normalize the weights

18

22 Object detection on conventional 2D images

2 Select the best classifier using only a single feature by minimising thedetection error εt =

sumiwi|h(xi f p θ)minus yi| where h(xi) is a classifier

output and yi is a correct label (both with a range of 0 for negative 1for positive)

3 Define ht(x) = h(x ft pt θt) where ft pt and θt are minimizers for theerror above

4 Update the weights wt+1i = wti middot εt

1minusεt

1minusεi

bull The final classifier for the stage is based on the sum of weak classifiers

By repeating these steps AdaBoost selects those features (and only those)which improve the prediction Resulting from the way the weights are updatedsubsequent classifiers minimise the error mostly for the inputs which were misclas-sified by previous ones A famous application of AdaBoost is the Viola-Jones ob-ject detection framework presented in 2001 [45] (also see sub-subsection 2222)

2226 Support Vector Machine

Support vector machines (SVM also called support vector networks) is a machinelearning algorithm for two-group classification problems proposed by Vladimir NVapnik Bernhard E Boser and Isabelle M Guyon [66] Although the featuresof the method were known and used since the 1960s they were not put togetheruntil 1992

The idea of the support vector machine is to handle the set of features ex-tracted from the train and test data samples as vectors in the feature spaceThen they construct one ore more hyperplanes (a linear decision surface) in thisspace to separate the classes

The surface is constructed with two constraints First is to divide the spaceinto two parts in a way that no points from different classes remain in the same(in other words separate the two classes perfectly) The second constraint is tokeep the distance to the nearest training samplersquos vector from each class as largeas possible (in other words define the largest margin possible) Creating largermargin is proven to decrease the error of the classifier The closest vectors to thehyperplane (and due to that determining the width of the margin) are calledsupport vectors (hence the name of the method) See figure 27 for an exampleof a separable problem the chosen hyperplane and the margin

The algorithm was extended to not linearly separable tasks in 1995 by Vapnikand Corinna Cortes [64] To create a non-linear classifier the feature space ismapped to a much higher dimensional space where the classes become linearlyseparable and the method above can be used

19

22 Object detection on conventional 2D images

Figure 27 Example of a separable problem in 2D The support vectors aremarked with grey squares They define the margin of the largest separation betweenthe two classes Source [64]

Although support vector machines were designed for binary classification sev-eral methods were proposed to apply their outstanding performance on multi-classtasks The most often used approach to do this is to combine binary classifiers toobtain a multi-class one The binary classifiers can be rdquoone against allrdquo or rdquooneagainst onerdquo types The former means that each class is examined against all thepoints of the other classes The class whose classifier shows the greatest distancefrom the margin is chosen (in other terms the test samplersquos vector is the fartherfrom the other classesrsquo points than in any other cases) The latter is based on thecomparisons of all the possible pairings of the classifiers (each is compared witheach) The class rdquowinningrdquo the most comparisons is chosen as the final decisionof the classifier Good surveys on multi-class SVMs are [67] and [68]

A famous application of SVMs is the HOG descriptor based pedestrian detec-tion presented by Navneet Dalal and Bill Triggs in 2005 [53]

20

Chapter 3

Development

In this chapter the circumstances of the development will be presented whichinfluenced the research especially the objectives defined in Section 13 Firstthe available hardware resources (that is processing unit and sensors) will bediscussed along with their limitations and constraints respecting to the definedobjectives Finally the used software libraries will be presented why were theychosen and what are they used for in this project

31 Hardware resources

311 Nitrogen boardNitrogen board (officially called Nitrogen6x) is the chosen on-board processingunit of the UAV It is a highly integrated single board computer using an ARM-cortex A9 processor [69]

Nitrogen board will be responsible to manage the on-board sensors and collectthe recorded data After some preprocessing part of the data will be transmittedto the ground robot and to the ground station which communication is alsohandled by this unit Fortunately the Robotic Operating System (subsection322) is compatible with unit which will make this task easier

The detection method introduced in this thesis may be ported to this boardlater depending on the available processing units and the achieved communicationbandwidth between the vehicles and the ground station

312 SensorsThis subsection concludes the sensors integrated on-board the UAV including theflight controller the camera and the laser range finder

21

31 Hardware resources

Figure 31 image of Pixhawk flight controller Source [70]

3121 Pixhawk autopilot

Pixhawk is an open-source well supported auto-pilot system manufactured by3Drobotics company It is the chosen flight controller of the projectrsquos UAV Onfigure 31 the unit is shown with optional accessories

Also Pixhawk will be used as an Inertial Measurement Unit (IMU) for build-ing 3D maps This means less weight (since no additional sensor is needed) andbrings integrity both in hardware and software Although the Pixhawk is not adedicated IMU its built-in sensors are very accurate since they are used to sta-bilize multi-rotors during flight Therefore it was suitable to use it for mappingpurposes See 44 for details of its application

3122 Camera

Cameras are optical sensors capable of recording images They are the mostfrequently used sensors for object detection due to the fact that they are easy tointegrate and provide significant amount of data

Also another big advantage of these sensors is the fact that they are availablein several sizes with different resolutions and other features for a relatively cheapprice

Nowadays consumer UAVs are often equipped by some kind of action camerausually to provide an aerial video platform These sensors are suitable for theaim of the project as well resulting from their light weight and wide angle fieldof view

22

32 Chosen software

Figure 32 The chosen LIDAR sensor Hokuyo UTM-30LX It is a 2D rangescanner with a maximum range of 60 m and 3-5 cm accuracy Source [71]

3123 LiDar

The other most important sensor used in this project is the lidar These sensorsare designed for remote sensing and distance measurement The basic conceptof these sensors is to do this by illuminating objects with a laser beam Thereflected light is analysed and the distances are determined

Several versions are available with different range accuracy size and otherfeatures The chosen scanner for this project is the Hokuyo UTM-30LX It isa 2D range scanner with a maximum range of 60 m and 3-5 cm accuracy [71]These features combined with its compact size (62mmtimes62mmtimes875mm210g [71])makes it an ideal scanner for autonomous vehicles for example UAVs

Generally speaking lidars are much more expensive sensors then camerasHowever the depth information provided by them is very valuable for indoorflights

32 Chosen software

321 MatlabMATLAB is a high-level interpreted programming language and environmentdesigned for numerical computation It is a very useful tool for engineers and

23

32 Chosen software

scientist alike as thousands of functions are implemented and included in thesoftware from different fields of science Also excellent visualization tools areready to use which are necessary to easily debug and develop image processingmethods or 3D map construction and visualization Resulting from the reasonsabove developing and testing 2D or 3D image processing algorithms in Matlab isextremely convenient On the other hand Matlab is Java based therefore mostnon-mathematical calculations are slightly slow compared to a C++ environment

322 Robotic Operating System (ROS)Robotic Operating System is an extensive open-source operating system andframework designed for robotic development It is equipped with several toolsaimed to help developing unmanned vehicles like simulation and visualizationsoftware or excellent general message passing This latter property is a suitablesolution to connect the UAV UGV and the ground control station of the project

Also ROS contains several packages developed to handle sensors For exam-ple the both the chosen lidar sensor (3123) and the flight controller (3121)have ready-to-use drivers available

323 OpenCVOpenCV is probably to most popular open-source image processing library con-taining a wide range of functions like basic image manipulations (eg load writeresize rotate image) image processing tools (eg different kind of edge detec-tions threshold methods) and advanced machine learning based solutions (egfeature extractors and training tools) [72] Due to its high popularity examplesand support are available

In this project it was used to handle inputs (either from a file or from a cameraattached to the laptop) since Dlib the main library chosen has no functions toread inputs like that

324 DlibDlib is a general purpose cross-platform C++ library It has an excellent machinelearning library with the aim to provide a machine learning software developmenttoolkit for the C++ language (see figure 33 for an architecture overview) Dlibis completely open-source and is intended to be useful for both scientists andengineers

Dlib also has many computer vision related parts including ready-to-use fea-ture extractors and training tools and object tracking solutions which make it

24

32 Chosen software

Figure 33 Elements of Dlibrsquos machine learning toolkit Source [73]

easy to develop and test custom trained object detectors rapidly Another re-cently added feature is an easy-to-use 3D point cloud visualization function Re-sulting from this Dlib was a reasonable choice to use for this project since itscollection of machine learning image processing and 3D point cloud handlingfunctions made the development significantly easier

It is worth mentioning that aside the features above Dlib also contains com-ponents to handle linear algebra threading network IO graph visualisation andmanaging and numerous other tasks which makes it one of the most versatileC++ libraries Though it is not as popular and well-known as OpenCV it isextremely well documented supported and updated regularly Resulting fromthese the popularity of it increases For more information see [74] and [73]

25

Chapter 4

Designing and implementing thealgorithm

In this chapter challenges related to the topic of this thesis are listed After theconsideration of these the architecture of the designed algorithm will be intro-duced with detailed explanations given to the important parts The connectionbetween the 2D and 3D image inputs are defined Afterwards the base of thedetection algorithm is discussed in details along with all the aiding systems andinputs Finally the 3D imaging experiments and set-ups are introduced withexample images and the mathematical calculations

41 Challenges in the taskThe biggest challenges of the task are listed below

1 Unique object Maybe the biggest challenge of this task is the fact that therobot itself is a completely unique ground vehicle designed as part of theproject Thus no ready-to-use solutions are available which could partly orcompletely solve the problem For frequently researched object recognitiontasks like face detection several open-source codes examples and articlesare available The closest topic for a ground robot detection is car recogni-tion but those vehicles are too different to use any ready detectors Thusthe solution of this task have to be completely self-made and new

2 Limited resources Since the objective of this thesis is a detection executedon a UAV the limitations of this platform have to be considered Thismeans that both the dimensions and the weight of the on-board sensorsand computers are limited significantly Fortunately the cameras nowa-days have very compact sizes with impressive resolutions thus even smaller

26

42 Architecture of the detection system

aerial vehicles can carry one and record high quality videos The used lidar(3123) is heavier and larger but still possible to carry The on-board com-puter had to be chosen similarly a small and light hardware was neededWhile the Nitrogen6x board (311) would fulfil these requirements its pro-cessing power is not comparable to a laptop or desktop PC which shouldbe considered during the designing and assigning of processing methods

3 Moving platform No matter what kind of sensors are used for an objectdetection task movement and vibration of the sensor is usually makes itmore challenging Unfortunately both are present on a UAV platform Thiscauses problems because it changes the perspective and the viewpoint ofthe object Also they add noise and blur to the recordings And third nobackground-foreground separation algorithms are available for the cameraSee subsection 2211 for details

4 Various viewpoint of object Usually in the field of computer vision thepoint of view of the required object is more or less defined For exampleface detectors are usually applied to images where people are facing thecamera However resulting from the UAV platform the sensors will gain orlose height rapidly causing drastic perspective changes Also there are noconstraints about the relative orientation of the two vehicles which meansthat the robot should be recognized from 360 and from above

5 Various size of object Similarly to the previous point 4 the movement ofthe camera will cause significant changes in the scale of the object on theinput image In other words the algorithm have to detect the ground robotfrom a great distance (when it occupies a small part of the input) and alsoduring landing from relatively close (when it seems huge and almost fillsthe entire image)

6 Speed requirements The algorithmrsquos purpose is to support the landingmaneuver which is crucial part of the mission The most critical part is thetouch-down itself This (contrary to the other parts of the flight) requiresreal-time and very accurate position feedback from the sensors and the al-gorithms This is provided by another algorithm specially designed for thetask Thus the software discussed here does not have real-time require-ments On the other hand the system still needs continuous reports on theestimated position of the robot with at least 4Hz This means that everychosen method should be execute fast enough to make this possible

27

42 Architecture of the detection system

Support Vector Machines

camera2D Lidar

Trainer

Detector agorithmfront SVM side SVM

Detections Vatic annotation server

Sensors

Tracking

Evaluation

Current frame

Video reader

Regions of Interest

3D map

Other preprosessing(edgecolour detection)

Figure 41 On this figure a diagram of the designed architecture is shown Theaim of it is the present the connections between the 2D and 3D sensors detectionalgorithms and other parts of the system Arrows represents the dependencies andthe direction of the information flow

28

42 Architecture of the detection system

42 Architecture of the detection systemIn the previous chapters the project (11) and the objectives of this thesis (13)were introduced The advantages and disadvantages of the different availablesensors were presented (312) Some of the most often used feature extractionand classification methods (222) were examined with respect to the suitabilityfor the project Finally in subsection 41 the challenges of the task were discussed

After considering the aspects above a modular complete architecture wasdesigned which is suitable for the objectives defined See figure 41 for an overviewof the design The main idea of the structure is to compensate the disadvantagesof every module with an other part connected to it In some cases this principlebrings redundancy to the system On the other hand it provides a more robustdetection system for the robot The next enumeration will list the main partsof the architecture every already implemented module will be discussed later inthis chapter in detail

1 Training algorithm As explained in subsection 222 object detectionmethods using machine learning are very popular and superior in efficiencyTo apply such an algorithm for a task first a learning (also called trainingor teaching) period is needed when the software extract features from thetraining images and train a classifier to separate the two or more classes(Note that this learning step can be skipped if the algorithm learns rdquoon theflyrdquo or pre-trained detectors are used Neither of these are the case in thisproject)As mentioned above training usually happens off-line before the detectionsTherefore it is not required to be included in the final release However itis an inevitable part of the system since it produces the classifiers for thedetector These productions have to be reproducible compatible with theother parts and effective To make the development more convenient thetraining software is completely separated from testing See subsection 431for details

2 Sensor inputs Sensors are crucial for every autonomous vehicle since theyprovide the essential information about the surrounding environments Inthis project a lidar and a camera where used as main sensors for the task(along with several others needed to stabilize the vehicle itself See subsec-tion 312) The designed system relies more on the camera for now sincethe 3D imaging is in its experimental state However both sensors arehandled in the architecture already with the help of the Robotic Operat-ing System (322) This unified interface will make it possible to integratefurther sensors (eg infra-red cameras ultrasound sensors etc) later withease

29

42 Architecture of the detection system

Resulting from the scope of the project the vehicle itself is not readythus live recordings and testing were not possible during the developmentTherefore the system is able to manage images both from a camera andvideo file The latter is considered as a sensor module as well simulatingreal input from the sensor (other parts of the system canrsquot tell that it is nota live video feed from a camera)

3 Region of interest estimators and tracking Recognized as two of the mostimportant challenges (see 41) speed and the limitations of the availableresources were a great concern Thus methods which could reduce theamount of areas to process and such increase the speed of the detectionwere requiredResulting from the complex structure of the project both can be donebased on either the camera the lidar or both The idea of the architectureis to give a common interface (creating a type of module) to these methodsmaking their integration easy The most general and sophisticated approachwill be based on the continuously built and updated 3D maps That is ifboth the aerial and the ground vehicle were located in this map accurateestimations can be made on the current position of the robot excluding (3D)areas where it is unnecessary to search for the unmanned ground vehicle(UGV) In other words once the real 3D position of the robot is known itsnext one can be estimated based on its maximal velocity and the elapsedtime Theoretically it is not possible for the robot to be outside of thiscircle thus other parts of the map do not need to be scanned and Asa less sophisticated but more rapid alternative to this the currently seen3D sub-map (the point-cloud in front of the sensor) can be interpreted asa pre-filter for the 2D image processing methods For example filteringfor objects with more or less the same size of the UGV could be a veryeffective region of interest detector To achieve this the field of view of thetwo sensors needs to be registered carefully so corresponding areas can befound Unfortunately the real time 3D map building is not yet implementedThus these methods has yet to be developed On the other hand severalapproaches to find regions of interest based on 2D images were implementedAll of them will be presented in section 43

4 Evaluation To make the development easier and the results more objectivean evaluation system was needed This helps to determinate whether a givenmodification of the algorithm did actually make it rdquobetterrdquo andor fasterthan other versions This part of the architecture is only connected to thewhole system loosely (similarly to training point 1) and not required tobe included in the final release since it was designed for development and

30

43 2D image processing methods

debugging purposes However it had to be developed parallel with everyother parts to make sure it is compatible and up to date Thus evaluationhad to be mentioned in this list For more details of the evaluation pleasesee 511

5 Detector The detector module is the core algorithm which connects all theother parts listed above Receiving the inputs from the camera sensor (point2) using the classifiers trained by the training algorithm (point 1) andconsidering the information provided by the ROI estimators (3) executesthe detection Afterwards the detector is able to pass on the detectedobjects coordinates so the evaluation module (point 4) can measure theefficiency

43 2D image processing methodsIn this section the 2D camera image processing concerns methods and the processof the development will be presented

431 Chosen methods and the training algorithmBefore the design and implementation of the algorithm started several objectdetection methods were reviewed in section 22 In section 32 the used imageprocessing libraries and tool-kits were presented

While these two topics seem unrelated they have to be considered togethersince a good software library with image handling or even machine learningalgorithms implemented can save a lot of time during the development

Considering the reviewed methods the Histogram of Oriented Gradients (HOGsee sub-subsection 2223) was chosen as the main feature descriptor The reasonsfor this choice are discussed here

Haar-like features are suitable to find patterns of combined bright and darkpatches While the ground robotrsquos main colours are black (wheels) and white(the main body of the vehicle) their proportion is not ideal since brighter partsare more significant Variance of the relative positions of these patches (egdifferent view angles) can confuse the detector as well Also using OpenCVrsquos(323) training algorithm the learning process was significantly slower than thefinal training software (presented here later) making it very hard to develop andtest

SIFT descriptors (see sub-subsection 2221) have the advantage of rotationinvariance which is very desirable Since they assume that relative positions ofthe key-points will not change on the inputs they are more suitable to recognise

31

43 2D image processing methods

a concrete object instead of the general class This is not an issue in this casesince only one ground robot is present with known properties (in other wordsno generalization is needed) However the relative positions of the key-pointscan change not only because inter-class differences (an other instance of the classrdquolooksrdquo different) but caused by moving parts like rotating and steering wheelsfor example Also SIFT tends to perform worse under changing illuminationconditions which is a drawback in the case of this project since both vehiclesare planned to transfer between premises without no a-priori information aboutthe lighting

Histogram of Oriented Gradients (HOG see sub-subsection 2223) seemed tobe an appropriate choice as it performs well on objects with strong characteristicedges Since the UGV has a cuboid shape body and relatively big wheels itfulfils this condition Also HOG is normalized locally with respect to the imagecontrast which results in a robustness against illumination changes Thus itis expected to perform better than the other methods in case of change in thelighting (like moving to an other room or even outside)

Certainly HOG has it disadvantages as well First it is computationally ex-pensive especially compared to the Haar-like features However with properimplementation (eg a good choice of library) and the use of ROI detectors (seePoint 3) this issue can be solved Secondly HOG in contrast to SIFT is not rota-tion invariant In practice on the other hand it can handle a significant rotationof the object Also in the final version of the project the orientation of the camerawill be stabilized either in hardware (with a gimbal holding the camera levelled)or software (rotate the image based on the orientation sensor) Therefore theinput image of the detector system expected to be levelled Unfortunately dueto the aerial vehicle used in this project different viewpoints are predicted (see4) This means that even if the camera is levelled the object itself could seemlike rotated caused by the perspective To overcome this issue solutions will bepresented in this subsection and in 434

Although the method of HOG feature extraction is well described and docu-mented in [53] and not exceedingly complicated the choice of optimal parametersand the optimization of the computational expenses are time-consuming and com-plex tasks To make the development faster an appropriate library was searchedfor the extraction of HOG features Finally Dlib library (324) was chosen sinceit includes an excellent HOG feature extractor with serious optimization alongwith training tools (discussed in details later) and tracking features (see subsec-tion 434) Also several examples and documentations are available for bothtraining and detection which simplified the implementation

As a classifier Support vector machines (SVM see sub-subsection 2226)were chosen This is because nowadays SVMs are widely used in computer visionwith great results (Dalal used it as well in [53]) and for two-class detection prob-

32

43 2D image processing methods

lems (like this) they outperform the other popular ones Also it is implementedin Dlib as a ready-to-use solution [75] SVMs (and almost everything in Dlib)can save and load by serializing and de-serializing them which made the trans-portation of classifiers relatively easy Dlibrsquos SVMs are compatible with the HOGfeatures extracted with the same library The combination of the two methodswithin Dlib resulted in some very impressive detectors for speed limit sign orfaces Furthermore the face detector mentioned outperformed the Haar-like fea-ture based face detector included in OpenCV [74] After studying these examplesthe combination of Dlibrsquos SVM and HOG extractor was chosen as the base of thedetection

The implemented training software itself was written in C++ It can be exe-cuted from the command line It needs an xml file as an input which includes alist of the training images with the annotation of the objects (coordinates in pix-els) After importing the positive samples the negative samples are rdquogeneratedrdquofrom the training samples as well cut from areas where no annotated object ispresent Note that this feature means the object has to be annotated carefullysince a positive sample might get in to the negative training images otherwise

After the classifier (also called detector) is produced the training softwaretests it on an image database defined by a similar xml file This is a very con-venient feature since major errors are discovered during the training processalready If the first tests do not show any unexpected behaviours the finishedSVM is saved to the disk with serialization

Due to the fact that the robot looks completely unlike from different view-points one classifier was not sufficient because it could only match one typicalrdquoshaperdquo of the object Similarly the face detector mentioned above does notwork for people photographed from side view Therefore multiple classifiers wereconsidered for the task

Two SVMs were trained to overcome of this issue one for the front view andone for the side view of the ground vehicle This approach has two significantadvantages

First the robot is symmetrical horizontally and vertically as well In otherterms the vehicle looks almost exactly the same from the left and right whilethe front and rear-views are also very similar This is especially true if the work-ing principle of the HOG (2223) is considered since the shape of the robot(which defines the edges and gradients detected by the HOG descriptor) is iden-tical viewed from left and right or from front and rear This property of theUGV means that the same classifier will recognise it from the opposite directiontoo Therefore two classifiers are enough for all four sides This is not a trivialsimplification since cars for example look completely different from the front orthe rear and while the side-views are similar they are mirrored

The second advantage of these classifiers is very important in the future of

33

43 2D image processing methods

the project Due to the way the SVMs are trained they not only can detect theposition of the robot but depending on which one of them detects it the systemgains information about the orientation of the vehicle This is crucial since it canhelp predict the next position of the ground robot (see 434) Even more impor-tantly it helps the UAV to land on the UGV with the correct orientation (facingthe right way) Therefore the charging outputs will be connected properly

An other important aspect of the training is how to choose the training imagesFor this pictures were selected where the robot is captured directly from the sideor from the front with the camera almost at the level of the ground Asidethe sidefront of the robot (the top for example) was hidden or cropped Thismethod eliminated all the distortions found on pictures taken from above whichwould decrease the efficiency Although camera attached to the UAV will seldomcruise at this altitude (otherwise it would touch the ground) the HOG descriptor isrobust enough to find the patterns even viewed from above or sideways Althoughteaching an SVMs for diagonal views as well seemed like a logical idea it did notimprove the detections The reason for this is the fact that while side and frontviews are well defined it is hard to frame and train on a diagonal view

On figure 42 a comparison of training images and the produced HOG detec-tors is shown 42(a) is a typical training image for the side-view Hog detectorNotice that only the side of the robot is visible since the camera was hold verylow 42(b) is the visualized final side-view detector Both the wheels and thebody are easy to recognize thanks to the strong edges of them On 42(c) a train-ing image is displayed from the front-view detector training image set 42(d)shows the visualized final front detector Notice tangential lines along the wheelsand the characteristic top edge of the body

Please note that if the current two classifiers seem to be inefficient trainingadditional ones is easy to do with the training software presented here Alsoincluding them to the detector is simple as well since it was designed to beadaptive Please see subsection 435 for more details

432 Sliding window methodA very important property of these trained methods (222) are that usually theirtraining datasets (the images of the robot or any other object) are cropped onlycontaining the object and some margin around it Resulting from this a detectortrained this way will only be able to work on input images which contain theobject in a similarly cropped way Certainly such an image is not the usual inputfor an application since if the input contains only the robot there is no need forlocalization algorithms In other words the input image is usually much largerthan the cropped training images and may contain other kind of objects as wellThus a method is required which provides correctly sized input images for the

34

43 2D image processing methods

(a) sideview training image example

(b) sideview HOG detector

(c) frontview trainging image example

(d) frontview HOG

Figure 42 (a) a typical training image for the side-view Hog detector (b) is thefinal side-view HOG descriptor visualized (c) a training image from the front-viewHOG descriptor training image set (d) the final front detector visualized Noticethe strong lines around the typical edges of the training images

35

43 2D image processing methods

Figure 43 Representation of the sliding window method

detector cropped and resized from the original large inputThis can be achieved in several ways although the most common is the so

called sliding window method This algorithm takes the large image and slidesa window across it with predefined step-size and scales This rdquowindowrdquo behaveslike a mask cropping out smaller parts of the image With appropriate choice ofthese parameters (enough scales and small step-sizes) eventually every robot (orother desired object) will be cropped and handed to the trained detector See 43for a representation

It worth mentioning that it is possible that multiple instances of the soughtobject are present on the image For face or pedestrian detection algorithms forexample this is a potential situation Since only one ground robot was builtyet this is not likely to happen in this project (although reflections of the robotmight occur which are not handled explicitly) However if the system will beexpanded in the future multiple UGVs on the same input image will not cause aproblem for the designed algorithm

433 Pre-filteringAs listed in section 41 two of the most important challenges are speed andthe limitations of the available resources In subsection 432 the sliding windowconcept was introduced which makes it possible to cover every potential locationof the sought object On the other hand this means a lot of positions which areabsolutely unnecessary to check with the HOG detector since it is very unlikelythat the robot is there (eg positions above the ground or homogeneous surfaceslike the wall or the clearly empty ground) These areas could be excluded fromthe search with cheaper algorithms

36

43 2D image processing methods

Thus methods which could reduce the amount of areas to process and suchincrease the speed of the detection were required In other words these algorithmsfinds region of interests (ROI see 3) in which the HOG (or other) computationalheavy detectors are executed in a shorter time than scanning the whole inputimage The algorithms have be implemented with the right interface to keep themodular structure of the architecture Also this practice make it possible toeither swap the currently used with an other or a newly developed ROI detectormodule for further development

Based on intuition a good feature separation is the colour Anything whichdoes not have the same colour as the ground vehicle should be ignored To dothis however is rather complicated since the observed rdquocolourrdquo of the object (andanything else in the scene) depends on the lighting the exposure setting and thedynamic range of the camera Thus it is very hard to define a colour (and aregion of colours around of it) which could be a good base of segmentation onevery input videos Also one of the test environments used (the laboratory) andmany other premises reviewed has bright floor which is surprisingly similar to thecolour of the robot This made the filtering virtually useless

An other good hint for interesting areas can be the edges In this projectthis seems like a logical approach since the chosen feature descriptor operates ongradient images which are strongly connected to the edge map On figure 44 theresult of an edge detection on a typical input image of the system is presentedNote how the ground robot has significantly more edges than its environmentespecially the ground Filtering based on edges often eliminates the majority ofthe ground On the other hand chairs and other objects have lot of edges as wellThus these areas are still scanned

Another idea is to filter the image by the detections themselves on the previousframes In other terms the position of the robot on the previous input image isan excellent estimation of the vehiclersquos position on the current one Theoreticallyit is enough to search for the robot around its last known position with a radiusof the maximum distance the object can move during the time between framesPlease note that this distance is in image coordinates thus the possible movementof the camera is involved as well In practice this distance was set as a parameterafter experiments See subsection 435 for details

434 TrackingIn computer vision tracking means the following of a (moving) object over timeusing the camera image input This is usually done by storing information (likeappearance previous location speed and trajectory) about the object and passit on between frames

In the previous subsection (433) the idea of filtering by prior detections was

37

43 2D image processing methods

Figure 44 On this figure the result of an edge detection on a typical input imageof the system is presented Please note how the ground robot has significantly moreedges than its environment On the other hand chairs and other objects have lotof edges as well

presented Tracking is the extension of this concept If the position of the robotis already known there is no need to find it again To achieve the objectives it isenough to follow the found object(s) This makes a huge difference in calculationcosts since the algorithm does not look for an object described by a generalmodel on the whole image but search for the rdquosame set of pixelsrdquo in a significantlyreduced region Certainly tracking itself is not enough to fulfil the requirements ofthe task hence some kind of combination of rdquotraditionalrdquo classifiers and trackingwas needed

Tracking algorithms are widely researched and have their own extensive litera-ture Considering the complexity of implementing an own 2D tracking algorithmit was decided to choose and include a ready-to-use solution Fortunately Dlib(324) have an excellent object tracker included as well The code is based onthe very recent research presented in [76] The method was evaluated extensivelyand it outperformed state-of-the-art methods while being computationally effi-cient Aside its outstanding performance it is easy to include in the code onceDlib is installed The initializing parameter of the tracker is a bounding box isaround the area needed to be followed Since the classifiers return with boundingboxes those can be used as an input for the tracking See subsection 435 foran overview of the implementation An interesting fact is that it also uses HOGdescriptors as a part of the algorithm which shows the versatility of HOG andconfirms its suitability for this task

The final system of this project will have the advantage of a 3D map of itssurroundings This will allow tracking of objects in 3D relative to the environ-ment The UAV will have an estimation of the UGV even if its out of the frame

38

43 2D image processing methods

since its position will be mapped to the 3D map as well Unfortunately the realtime mapping is not ready yet thus this feature has yet to be implemented

435 Implemented detectorIn subsection 431 the advantages and disadvantages of the chosen object recog-nition methods were considered and the implemented training algorithm wasintroduced Then different approaches to speed up the process were presentedin subsection 432 433 and 434

After considering the points above a detector algorithm was implementedwith two aims First to provide an easy-to-use development and testing toolwith which different methods and their combinations (eg using 2 SVMs withROI) can be tried without having to change and rebuild the code itself Secondafter the optimal combination was chosen this software will be the base of thefinal detector used in the project

The input of the software is a camera feed or video which is read frame byframe The output of the algorithm is one or more rectangles bounding the areasbelieved to contain the sought object These are often called detection boxeshits or simply detections

During the development four different approaches were implemented All ofthem are based on the previous ones (incremental improvements) but all broughta new idea to the detector which made it faster (reach higher frame-rates see512) more accurate (see 511) or both The methods are listed below

1 Mode 1 Sliding window with all the classifiers

2 Mode 2 Sliding window with intelligent choice of classifier

3 Mode 3 Intelligent choice of classifiers and ROIs

4 Mode 4 Tracking based approach

They will all be presented in the next four subsections Since they are rathermodifications of the same code than separate solutions they were not imple-mented as new software Instead all were included in the same code whichdecides which mode to execute in run-time based on a parameter

Aside of changing between the implemented modes the detector is able toload as many SVMs as needed Their usage is different in every mode and willbe presented later

The software can display andor save the video frames with all the detectionsmarked measure and export the processing time for every frame save the de-tections to data files for evaluation or even process the same video input severaltimes

39

43 2D image processing methods

Table 41 Table of the available parameters

Name Valid values Functioninput path to video video as input for detectionsvm path to SVMs These SVMs will be usedmode [1 2 3 4] Selects which mode is usedsaveFrames [01] Turns on video frame exportsaveDetections [01] Turns on detection boxe exportsaveFPS [01] Turns on frame-rate measurementdisplayVideo [01] Turns on video displayDetectionsFileName string Sets the filename for saved detectionsFramesFolderName string Set the foldername used for saving videonumberOfLoops integer(0lt) Sets how many times the video is looped

To make the development more convenient these features can be switchedon or off via parameters which are imported from an input file every time thedetector is started There is an option to change the name of the file where thedetections are exported or the name of the folder where the video frames aresaved

Table 41 concludes all the implemented parameters with the possible valuesand a short description

Note that all these parameters have default values thus it is possible but notcompulsory to include every one of them The text shown below is an exampleinput parameter file

Example parameter file of the detector algorithminput testVideo1avilist all the detectors you want to usesvm groundrobotfrontsvm groundrobotsidesvmsaveDetections 0saveFrames 0displayVideo 0numberOfLoops 100mode 2saveFPS 1

Given this input file the software would execute a detection on testVideo1avi(input) one hundred times (numberOfLoops) using the groundrobotfrontsvm groundrobot-sidesvm classifiers (svm) It would neither save the detections (saveDetections)

40

43 2D image processing methods

nor the video frames (saveFrames) The latter is not shown (displayVideo) How-ever the processing time is saved for every frame (saveFPS) Note that the pa-rameter numberOfLoops is really useful to simulate longer inputs See figure 45for a screenshot of the interface after a parameter file is loaded

Figure 45 Example of the detectorrsquos user interface after loading the parametersThe software can be started from the command line with a parameter file inputwhich defines the detecting mode and the purpose of the execution (producingvideo efficiency statistics or measuring processing frame-rate)

4351 Mode 1 Sliding window with all the classifiers

Mode 1 is the simplest approach of the four After loading the two classifierstrained by the training software (front-view and side-view see 434) both rdquoslidesrdquoacross the input image All of them return with a vector of rectangles where theyfound the sought object These are handled as detections (saved displayed etc)There are no filters implemented for maximum number of detections overlappingallowed position etc

Please note that although currently two SVMs are used this and every othermethod can handle fewer or more of them For example given three classifiersmode 1 loads and slides all of them

As it can be seen this mode 1 is rather simple It checks every possibleposition with all the classifiers on every frame in multiple scales This meansthat it will not miss the object (assuming that one of them would recognize theobject if a cropped image of it is given as an input) On the other hand thisextensive search is very computational heavy especially with two classifiers Thisresults in the lowest frame per second rate of all the methods

4352 Mode 2 Sliding window with intelligent choice of classifier

As mentioned in the previous subsection Mode 1 checks every possible positionwith all the classifiers on every frame This is not efficient because of the waythe two used classifier was trained On figure 42 it can be seen that one of them

41

43 2D image processing methods

represents the frontrear view while the other one was trained for the side-viewof the robot

The idea of mode 2 is based on the assumption that it is very rare and usuallyunnecessary that these two classifiers detect something at once since the robot iseither viewed from the front or from the side Certainly looking at it diagonallywill show both mentioned sides but resulting from the perspective distortionthese detectors will not recognize both sides (and that is not needed either sincethere is no reason to find it twice)

Therefore it seemed logical to implement a basic memory to the code whichstores which one of the detectors found the object last time After that only thatone is used for the next frames If it succeeds to find the object again it remainsthe chosen detector If not the algorithm starts counting the sequential frames onwhich was unable to find something If this exceeds a predefined tolerance limit(that is the number of frames tolerated without detection) all of the classifiersare used again until one of them find the object In every other aspect mode 2is very similar to mode 1 presented above

The limit was introduced because it is very unlikely to change viewpointsbetween two frames so drastically that the other detector would have a betterchance to detect A couple of frames without detection is more like the result ofsome kind of image artefact like motion blur change in exposure etc

This modification of the original code resulted an much faster processingspeed since in significant amount of the time only one classifier was used in-stead of two Fortunately after finding an optimal limit the efficiency of thecode remained the same

4353 Mode 3 Intelligent choice of classifiers and ROIs

While mode 2 brought a significant improvement to the detectorrsquos speed it didnot fulfil the requirements of the project

In subsection 433 the idea of reducing the image searched was introduced asa way to increase the detection speed The position of the robot on the previousinput image is an excellent estimation of the vehiclersquos position on the current oneTheoretically it is enough to search for the robot around its last known positionwith a radius of the maximum distance the object can move during the timebetween frames However this is not an obvious calculation since the distance isin image coordinates hence it is not trivial to use the actual speed of the robotAlso the possible movement of the camera is involved as well For example evenif the UGV holds its position the rotating camera of the UAV will capture itrdquomovingrdquo across the input image

Instead of estimating the distance mentioned above a simpler approach wasimplemented The memory introduced in mode 2 was extended to store the

42

43 2D image processing methods

Figure 46 An example output of mode 3 The green rectangle marks theregion of interest The blue rectangle is the detection returned by the side-viewdetectorMode 3 tries to estimate a ROI based on the previous detections andsearches for the robot only in that Note the ratio between the ROI and the fullsize of the image

position of the detection aside the detector which returned it A new rectanglename ROI (Region of interest) was included which determinates in which areashould the detector search

The rectangle is initialized with the sizes of the full image (in other termsthe whole image is included in the ROI) Then every time a detection occursthe ROI is set to that rectangle since that is the last known position of the robotHowever this rectangle have to be enlarged caused by the reasons mentionedabove (movement of the camera and the object) Therefore the ROI is grown bya percent of its original size (set by a variable 50 by default) and every detectorsearches in this area Note that the same intelligent selection of classifiers whichwas described in mode 2 is included here as well

In the unfortunate case that none of the detectors returned detections insidethe ROI the region of interest is enlarged again by a smaller percent of its originalsize (set by a variable 3 by default) Note that if in any case the ROI reachesthe size of the of the image it will not grow in that direction anymore Eventuallyafter a series of missing detections the ROI will reach the size of the input imageIn this case mode 3 works exactly like mode 2

See figure 46 for an example of the method described above The greenrectangle marks the region of interest (ROI) The classifiers will search for therobot in this area The blue rectangle marks that the side-view detector foundsomething there (in this case correctly recognized the robot) On the next framethe blue rectangle will be the base of the ROI as it is enlarged in the followingsteps to define the region of interest

43

43 2D image processing methods

Please note that while the ROI is updated of the previous detections it isvery easy to change it to another rdquopre-filterrdquo due to the modular architectureintroduced in 42

4354 Mode 4 Tracking based approach

As mentioned in 434 Dlib library has a built-in tracker algorithm based on thevery resent research [76]

Mode 4 is very similar to mode 3 but includes the tracking algorithm men-tioned above Using a tracker makes it unnecessary to scan every frame (or partof it as mode 3 does) with at least one detector

Instead once the robot was found it is followed across the scene Certainlyperiodical validations are still needed Tracking is significantly faster than thesliding window detectors and in appropriate conditions can perform above realtime speed

It also improved the efficiency of the detector since on many frames the robotwas tracked and its position was marked correctly while the detectors missed itfrom the same view-point

On the other hand tracking can be misleading as well In case of the trackedregion rdquoslidesrdquo off the robot incorrect detections will be provided Also thetracker often estimates the position of the object correctly while the scale of therectangle does not fit which can result much larger or smaller detection boxesthan the object itself See figure 47(b) for an example Yellow rectangle marksthe tracked region which clearly includes the robot but itrsquos size is unacceptable

To avoid these phenomena the bounding box returned by the tracker is reg-ularly checked by detectors This is done in the very same way as in mode 3The output of the tracker is the base of the ROI which is than enlarged by pa-rameters (similarly to mode 3) The algorithm checks this area with the selectedclassifier(s)

If any of them founds the object the tracker is reinitialized based on thedetection box The rectangle is enlarged and translated to include a bigger partof the robot in contrast to the detected sides (note that the detectors are trainedfor the sides only see subsection 431) This is done because the tracker tendsto follow bigger patches of the image better

If none of the detectors returned detections inside the ROI the algorithm willcontinue to track the object based on previous clues of its position However afterthe number of sequential failed validations exceeds a tolerance limit (a parameter)the object is labelled as lost Afterwards the algorithm works exactly as mode 3would do enlarge the ROI on every frame until a new detection appears Thenthe tracker is reinitialized

See figure 47(a) for a representation of the processing method of mode 4

44

43 2D image processing methods

(a) Example output frame of mode 4

(b) Typical error of the tracker

Figure 47 (a) A presentation of mode 4 Red rectangle marks the detection ofthe front-view classifier The yellow box is the currently tracked area The greenrectangle marks the estimated region of interest (b) A typical error of the trackerYellow rectangle marks the tracked region which clearly includes the robot but toobig to accept as a correct detection

45

44 3D image processing methods

44 3D image processing methods

441 3D recording methodThe first challenge was how to create a 3D image As explained in sub-subsection3123 the used Lidar sensor is a 2D type one which means it scans only inplane In other words the sensor returns the distance of the closest reflectionpoint rotating around one (vertical) axis The result of this scan looks like across-section of a 3D image To produce a real and complete 3D image severalof these cross-section like one-plane scans have to be combined This is a rathercomplicated method since to cover the whole room the lidar needs to be movedHowever this movement or displacement will have a significant effect on therecorded distances

To simplify the initial experiments the position of the laser scanner was fixedon a tripod and only the orientation of it was changed This way the trans-formation between the coordinate systems and validation of the recordings wereeasier

On figure 48 a schematic diagram is given to introduce the 3D recording set-up The picture is divided in to 3 parts Part a is a side view of the room beingrecorded including the scanner and its dependences Part b is a top view of thevery same room displaying the field of view of the scanner Part c is a closeroverview of the scanner device the inertial measurement unit (IMU) and themounting layout Main parts of the setup are listed below with brief explanation

1 Lidar This is the 2D range scanner which was used in all of the experi-ments It is mounted on a tripod to fix its position The sensor is displayedon every sub-figure (a b c) On part a it is shown in two different orien-tation first when the recording plane is completely flat secondly when itis tilted

2 Visualisation of the laser ray The scanner measures distances in a planewhile rotating around its vertical axis which is call capital Z axis on thefigure (Note The sensor uses class 1 laser which is invisible and harmlessto human eye Red colour was chosen only for visualization)

3 Pitch angle of first orientation While the scanner is tilted its orientationis recorded relative to a fixed coordinate system For this experiment thepitch angle is the most relevant since roll and yaw did not change This iscalculated as a rotation from the (lower-case) z axis

4 Pitch angle of second orientation This is exactly the same variable asdescribed before but recorded for the second set-up of the recording system

46

44 3D image processing methods

5 Yaw angle of the laser ray As mentioned before the sensor used operateswith one single light source rotated around its vertical axis (capital) ZTo know where the recorded point is located relative to the sensorrsquos owncoordinate system the actual rotation of the beam is needed to be savedas well This is calculated as an angle between the forward direction of thesensor (marked with blue) and the laser ray All the points are recordedin this 2D space defined with polar coordinates a distance from the origin(the lidar) and a yaw angleThe angle is increasing counter-clockwise Pleasenote that this angle rotates with the lidar and its field Part b represents astate when the lidar is horizontal

6 Field of view of the sensor Part b of the picture visualize the limited butstill very wide field of view of the sensor The view angle is marked withgreen coloured curve and the covered area is yellowish The scanner is ableto record 270 or in other terms [minus135 135]

7 Blind-spot of the sensor As discussed in the previous point the scannercan register distances in 270 The remainder 90 is a blind-spot and islocated exactly behind the sensor (135minus135) This area is marked withdark grey on the figure

8 Inertial measurement unit All the points are recorded in the 2D spaceof the lidar which is a tilted and translated plane To map every recordedpoint from this to the 3D maprsquos coordinate system the orientation andposition of the Lidar sensor is needed Assuming that the latter is fixed(the scanner is mounted on a tripod) only the rotation remains variableTo record this some kind of inertial measurement unit (IMU) is neededwhich is marked with a grey rectangle on the image

9 Axis of rotation To scan the room a continuous tilting of the scannerwas needed It was not possible to place the light source on the axis of therotation This means an offset between the sensor and the axis which was asource of error The black circle represents the axis the tilting was executedaround

10 Offset between the axis and the light source As described in the previ-ous point an offset was present between the range sensor and the axis ofrotation This distance was carefully measured and it is marked by blue onthe figure

11 Ground vehicle The main objective of this thesis is to localize the UGVThus several 3D map was made of this vehicle itself However the lidarmain task is not the robot detection but the localization of the UAV itself

47

44 3D image processing methods

On part a and b its schematic side and top view is shown to visualize itsshape relative size and location

442 Android based recording set-upTo test the sensorrsquos capabilities simplified experiment set-ups were introducedas described in subsection 441 The scanned distances were recorded in csvformat by the official recording tool provided by the lidarrsquos manufacturer [77]As described in subsection 444 and in point 8 the orientation and position ofthe Lidar sensor is needed to determinate the recorded point coordinates relativeto the coordinate system fixed to the ground

Figure 410 Screenshot ofthe Android application for Lidarrecordings It is able to detect dis-play and log the rotation of the de-vice in a log file with user definedfilename Source own applicationand image

Assuming that the position is fixed (thescanner is mounted on a tripod) only therotation can change To record this somekind of inertial measurement unit (IMU) wasneeded

Since at the time these experimentsstarted no such device was available the au-thor of this thesis created an Android appli-cation The aim of this program was to usethe phonersquos sensors (accelerometer compassand gyroscope) as an IMU The software isable to detect display and log the rotationof the device along the three axis

The log file is structured in a way thatMatlab (see subsection 321) or other de-velopment tools can import it without anymodification needed The file-name is userdefined

Since both the Lidar and the applica-tion operates at a different and not suf-ficiently consistent frequency synchroniza-tion was necessary Since the two data wasrecorded completely separately this has to bedone off-line To achieve this a time-stamphas to be recorded for both type of mea-surements The lidar recorder tool does thisautomatically This function was added tothe Android application too thus the time-stamps of the recorded orientations are saved

48

44 3D image processing methods

Figure 48 Schematic figure to represent the 3D recording set-up Part a is a sideview of the room being recorded including the scanner and its dependences Partb is a top view of the very same room displaying the field of view of the scannerPart c is a closer overview of the scanner device the inertial measurement unit(IMU) and the mounting layout

49

44 3D image processing methods

Figure 49 Example representation of the output of the Lidar Sensor As it canbe seen the data returned by the scanner is basically a scan in one plane and canbe interpreted as a cross-section of a 3D image The sensor is positioned in themiddle of the cross Green areas were reached by the laser Source screenshot ofown recording by the official recording tool [77]

50

44 3D image processing methods

to the log file as well However since the two systems were not connected thetimestamps had to be synchronized manually

On Figure 410 the user interface of the application can be seen Note thatboth the displayed and the saved angles are in radian Figure 411(a) presentshow the phone was mounted next to the lidar and also shows the test environmentwhich was recorded The phone was attached to a common metal base with thelidar Thus the rotation of the phone is expected to be very close to the rotationof the lidar We used a Motorola Moto X (2013) for these experiments

The coordinate system used is fixed to the phone X axis is parallel with theshorter side of the screen pointing from the left to the right Y axis is parallel tolonger side of the screen pointing from to bottom to the top Z axis is pointingup and is perpendicular to the ground See figure 412 for a presentation of theaxes Angles were calculated in the following way

bull pitch is the rotation around the x axis It is 0 if the phone is lying on ahorizontal surface

bull roll is rotation around the y axis

bull yaw is the rotation around the z axis

Although the sensors in a phone are not designed for such precise tasks theset-up presented in this subsection worked sufficiently and produced spectacular3D images On figure 411(b) and 411(c) an example is shown along with therecorded screen captured by a camera (411(a))

443 Final set-up with Pixhawk flight controllerAlthough the Android application (see subsection 442) gave satisfactory resultsseveral reasons occurred for replacing it First the synchronization was compli-cated manual and off-line which is not acceptable in a real time UAV environ-ment Second neither the phonersquos sensors nor its operating system was designedto be mounted on an unmanned aircraft It is heavier less precise and moreexpensive than a dedicated IMU Thus the Android phone has to be replaced

As a replacement the Pixhawk (3121) flight controller was chosen because itis open-source well supported and most importantly it is the chosen controller ofthe projectrsquos UAV This meant less weight and brought integrity both in hardwareand software Although the Pixhawk is not a dedicated IMU its built-in sensorsare very accurate since they are used to stabilize multi-rotors during flight Fellowstudent Domonkos Huszar managed to connect both the Pixhawk and the lidar(3123) to the Robotic operating system (322) This way the two type ofsensors were recorded with the same software

51

44 3D image processing methods

(a) Laboratory and recorded scene

(b) Elevation of the produced 3D map

(c) The produced 3D map

Figure 411 (a) picture about the laboratory and the mounting of the phone (b)and (c) 3D map of the scan from different viewpoints Some object were markedon all images for easier understanding 1 2 3 are boxes 4 is a chair 5 is a studentat his desk 6 7 are windows 8 is the entrance 9 is a UAV wing on a cupboard

52

44 3D image processing methods

Thus the change from the Android software to the Pixhawk solved the problemof synchronization and accuracy All the other parts of the experiment (themounting method the coordinate systems etc) remained unchanged

444 3D reconstructionIn subsection 441 the recording method was described Subsection 442 and443 discussed the inertial measurement units and software used Processing andvisualizing the recorded data as a 3D map was the next step

All the pointsrsquo coordinates obtained were in the lidar sensorrsquos own coordinatesystem As shown on figure 48 (part b) this is a 2D space represented by polarcoordinates a distance from the origin (the scanner) and an angle between theline pointing to the point and the zero vector This space was tilted with thescanner during the records To calculate the positions recorded by the Lidar inthe ground fixed coordinate system this 2D space had to be transformed Trans-formation here means rigid motions no shape or size changes were desired Rigidmotions are reflections rotations translations and glide reflections Howeverin this experiment only translations and rotations were used

The 3D coordinate system was defined as the following

bull x axis is the vector product of y and z x = ytimes z pointing to the right

bull y axis is the facing direction of the scanner parallel to the ground pointing

Figure 412 Presentation of the axes used in the android application

53

44 3D image processing methods

forward

bull z axis is pointing up and is perpendicular to the ground

Note that the origin ([0 0 0]) is at the lidarrsquos location (at the top of the tripodin the height of the rotation axis) This means that points below the scanner havea negative altitude

The lidar and its own 2D space have 6 degrees of freedom (DOF) state in theground fixed coordinate system Three represents its positions along the x y zaxes The other three describes its orientation yaw pitch and roll

First the rotation of the scanner will be considered (the latter three DOF) Asit can bee seen on figure 48 the roll and yaw angles are not able to change Thusonly pitch has to be considered as a variable in this experiment (see point 3) Asa result the transformed coordinates of the recording point are the following

x = distance middot sin(minusyaw) (41)

y = distance middot cos(yaw) middot sin(pitch) (42)

z = distance middot cos(yaw) middot cos(pitch) (43)

Where distance and yaw are the two coordinates of the point in the plane ofthe lidar (figure 48 point 5) and pitch is the angle between the (lower-case) zaxis and the plane of the lidar (48 point 3 and 4)

The position of the lidar in the ground coordinate system did not change intheory only the orientation of it However as it can be seen on figure 48 part cthe rotation axis (marked by 9 see point 9) is not at the same level as the rangescanner of the lidar Since it was not possible to mount the device other way asignificant offset (marked and explained in point 10) occurred between the lightsource and the axis itself This resulted in an undesired movement of the sensoralong the z and y axes (note that x axis was not involved since the roll angle didnot change and only that would move the scanner along the x axis) While theresulted errors were no greater than the offset itself they are significant Thusalong the rotation a translation is needed as well The vector is calculated asthe sum of dy and dz

dy = offset middot sin(pitch) (44)

dz = offset middot cos(pitch) (45)

54

44 3D image processing methods

Where dy is the translation required along the y and dy is along the z axisOffset is the distance between the light source and the axis of rotation presentedon figure 48 part C marked with 10

Combining the 5 equations (41 42 43 44 45) we get xyz

= distance middot

sin(minusyaw)cos(yaw) middot sin(pitch)cos(yaw) middot cos(pitch)

+ offset middot

0sin(pitch)cos(pitch)

(46)

Using the transformation defined in equation 46 spectacular and more im-portantly precise 3D maps were created These were suitable for further worksuch us SLAM (Simultaneous localization and mapping) or the ground robotdetection

55

Chapter 5

Results

51 2D image detection results

511 EvaluationTo make the development easier and the results more objective an evaluationsystem was needed This helps determine whether a given modification of thealgorithm did actually make it better andor faster than other versions Thisinformation is extremely important in every vision based detection system to trackthe improvements of the code and check if it meets the predefined requirements

To understand what better means in case of a system like this some defi-nitions need to be clarified Generally a detection system can either identify aninput or reject it Here the input means a cropped image of the original pictureIn other words it is a position (coordinates) and dimensions (height and width)Please refer to subsection 432 for a more detailed description An identifiedinput is referred as positive Similarly a rejected one is called negative

5111 Definition of True positive and negative

A detection is true positive if the system correctly identifies an input as a positivesample In this case it means that the detector marks the ground robotrsquos positionas a detection Note that it is also often called a hit Similarly a true negativemeans that an input which does not contain the object is correctly rejected

5112 Definition of False positive and negative

In the case of most decision systems errors can be organised into two main groupsA mistake is a false positive error (also called type I error) if the system classifiesan input as a positive sample despite it is not In other words the system believes

56

51 2D image detection results

the object is present at that location although that is not the case In this projectthat typically means that something that is not the ground robot (a chair a boxor even a shadow) is still recognized as the UGV A false negative error (alsocalled type II error) occurs when an input is not recognized (rejected) although itshould be In this current task a false negative error is when the ground robotis not detected

5113 Reducing number of errors

Unfortunately it is really complicated to decrease both kind of errors at thesame time If more strict rules and thresholds are introduced in the decisionmaking the rate of false positives will decrease but simultaneously chances offalse negatives will increase On the other hand if conditions are loosened in thedetection method false negative errors may be reduced However caused by theexact same thing (less strict conditions) more false positives could appear

The actual task determines which kind of error should be minimised Incertain projects it could be more important to find every possible desired objectthan the fact that some mistakes will occur Such a project can be a manuallysupervised classification where appearing false positives can be eliminated by theoperator (for example illegal usage of bus lanes) Also a system with an accidentavoidance purpose should be over-secure and rather recognise non dangeroussituations as a hazard than to miss a real one

In this project the false positive rate was decided to be more important Thereason for this is that the position of the UGV is not crucial during the wholeflight If an input frame is processed incorrectly and a false negative error occurs(the robot is on the picture but the algorithm misses it) the tracking system(see subsection 434) would still be able to estimate its position After a fewframes the detector would be able to detect the robot again with great possibilityHowever a false negative detection at an unexpected position may confuse thesystem since it would have to choose from multiple possible charging platformsThis may lead to a landing attempt on a dangerous surface and to the possibilityof losing the aircraft

5114 Annotation and database building

To test the efficiency of the 2D detection algorithm its detections have to betested on an image dataset with its samples already labelled (either fully manuallyor using a semi-supervised classifier) or on a video set where the object (or objects)position is known on every frame This way the true positivenegative and falsepositivenegative values can be determined

57

51 2D image detection results

Figure 51 Figure of the Vatic user interface

Testing on image datasets is suitable for methods which do not use any in-formation transmitted between frames such as previous positions or ROI Thesealgorithms are usually sliding window based (see subsection 432) Their croppedinputs can be simulated with image datasets without complications

For this project the video based test is more appropriate since as described insubsection 434 and 433 the detector uses tracking and pre-filtering Testing thetracking feature without consecutive frames is pointless Also the pre-filteringand ROI generation is only applicable on frames containing more than just theobject itself

Testing on videos requires annotation which means that every object has tobe labelled on every frame This is a tiring and time-consuming task but withsuitable software it can be simplified To do this the Vatic environment waschosen

The abbreviation Vatic stands for Video Annotation Tool from Irvine Cal-ifornia It is a free to use on-line video annotation tool designed specially forcomputer vision research [78]

After the successful installation and connection to the server a clear interfaceis visible See an example screenshot on figure 51 The interface has a built invideo player which makes it possible to review the video with the annotation atnormal speed or even frame-by-frame The system was designed in a way thatmultiple objects can be annotated even from different classes (like pedestrianscars trucks for example) Since only one ground robot was built yet and no otherkind of objects were requested only one item was annotated on the videos

58

51 2D image detection results

The annotatorrsquos task is to mark the position and the size of the objects (thatis drawing a bounding box around it) on every frame of the videos (there areoptions to label an object Occluded obstructed or outside of frame) Althoughthis seems like a very time-consuming task Vatic uses sophisticated algorithmsto interpolate both the positions and the sizes of the objects on frames thatare not annotated yet This is a significant help for the annotator since aftermarking the object on some key-frames the only task remained is to correctthe interpolations between them if necessary

Note that it is crucial that the bounding boxes are very tight around the ob-ject This will guarantee that during the evaluation of the detectors the statisticsare based on true and objective annotations This topic will be explained in moredetails in section 53

Vatic is also able to crowdsource the annotation tasks using Amazon Mechan-ical Turk [79] which is Amazonrsquos solution for renting human resources This waylarge datasets with several videos are easy to build for relatively low cost How-ever this feature of the tool was not used in this project since the number of testvideos and their complexity did not make it necessary All the test videos wereannotated by the author

After successfully annotating a video Vatic is able to export the boundingboxes in several format including xml json etc The exported file contains theid of the frames and the coordinates of every bounding box along with the labelassigned for the objectrsquos type

Using these files it is possible to compare the detections of an algorithm (whichare also bounding boxes) with the annotations and thus evaluate the efficiencyof the code

512 Frame-rate measurement and analysisIn section 41 speed was recognised as one of the most important challenges

To address this issue the detector was prepared to measure and export thetime elapsed by processing each frames

The measurements were carried out on a laptop PC (Lenovo z580 i5-3210m8GB RAM) simulating a setup where the UAV sends video to a ground stationfor further processing During the tests every other unnecessary process wereterminated so their additional influences are minimized

The exact same software was used for every test only the parameter file waschanged (switching between the modes) Every method was executed on the samevideo one hundred times which means more than 100000 frames This amountensured that temporary resource changes during the test due to the operatingsystem did not influence the results significantly

59

52 3D image detection results

To analyse the exported processing times a Matlab script has been imple-mented This tool can load the log files from the test and calculate statistics ofthe data Measurements like the shortest longest and average processing timeper frame are displayed after execution along with the average frame-rate and itsvariance Latter is very important as mode 2 3 4 processing methods changeduring the execution by definition which results in a varying processing timeVariance measures these changes An example output of the analysis software ispresented below

Example output of Frame-rate analysis softwareLoaded 100 filesNumber of frames in video 1080Total number of frames processed 108000Longest processing time 0813Shortest processing time 0007Average ellapsed seconds 0032446variance of ellapsed seconds per frame

between video loops 80297e-07across the video 00021144

Average Frame per second rate of the processing 308204 FPS

The changes of average frame-rates between video loops were also monitoredIf this value was too big it probably indicated some unexpected load of thecomputer Thus the tests had to be restarted

It should be mentioned that the frame-rate of the software is strongly corre-lated with the resolution of the input images since the smaller the image it takesshorter time to rdquosliderdquo through the detector on it This is especially true for thefirst two methods (4351 4352) The test videosrsquo resolution are 960times 540

52 3D image detection resultsThis thesis focused on the 2D object detection methods since they are easierto implement and have been widely researched for decades Traditional camerasensors are significantly cheaper too and their output is rdquoready-to-userdquo On theother hand data from lidar sensors have to be processed to gather the imagesince the scanner records points in a 2D plane as explained in section 44 Notethat lidar 3D scanners exists but they are still scanning in a plane and have thismentioned processing built-in Also they are even more expensive than the 2Dversions

60

53 Discussion of results

Figure 52 On this figure the recording setup is presented In the backgroundthe ground robot is also visible which was the subject of the measurements

That being said the 3D depth map provided by these sensors are not possibleto obtain with cameras and contains valuable information for a project like thisand makes indoor navigation easier Therefore experiments were carried onthe record 3D images by tilting a 2D laser scanner (3123) and recording itsorientation with a IMU (3121)

On figure 52 a photo of the experimental setup is shown In the backgroundthe ground robot is visible which was the subject of the recordings

Figure 53 presents an example 3D point-cloud recorded of the ground robotBoth 53(a) and 53(b) is an image of the same map viewed from different pointsAs it can be seen recordings from one point is not enough to create completemaps as rdquoshadowsrdquo and obscured parts still occurs Note the rdquoemptyrdquo area behindthe robot on 53(b)

The created point-clouds were found to be detailed enough to carry simpler(eg threshold based on height) or more complex object detection methods (egregistering 3D key-points) on them in the future

53 Discussion of resultsIn general it can be said that the detector modes worked as expected on therecorded test videos with suitable efficiency Example videos can be found athere1 for all modes However to judge the performance of the algorithm moreobjective measurements were needed

The 2D image processing results were evaluated by speed and detection effi-1usersitkppkehu~palan1CranfieldVideos

61

53 Discussion of results

(a) Example result of the 3D scan of the ground robot

(b) Example of the rsquoshadowrsquo of the ground robot

Figure 53 Example of the 3D images built Both image shows the same 3Dmap viewed from different angles Note that the carpet is visible underneath thevehicle

62

53 Discussion of results

ciency The latter can be done by several values Here recall and precision willbe used Recall is defined by

TP

TP + FN

Where TP is the number of true positive and FN is the number of false negativedetections In other words it is the ratio of the of the detected and the totalnumber of positive samples Thus it is also called sensitivity

Precision is defined byTP

TP + FP

Where TP is the number of true positive FP is the number of false positivedetections In other terms precision is a measurement of the correctness of thereturned detections

Both values have a different meaning for validation and none of them can usedalone This is because as mentioned in sub-subsection 5113 both false positiveand negative errors are undesirable but neither recall nor precision measuresboth For example it is possible to reach an outstanding recall rate with extremeamount of false positives Similarly the increase of false negative errors willimprove the precision of the algorithm

The detections (the coordinates of the rectangles) were exported by the de-tection software for further analysis The rectangles were compared to the onesannotated with the Vatic video annotation tool (5114) based on position andthe overlapping areas For the latter a minimum of 50 was defined (ratio of thearea of the intersection and the union of the two rectangles)

Note that this also means that even if a detection is at the correct locationit is possible that it will be logged as false positive error since the overlap of therectangles is not satisfying Unfortunately registering this box as a false positivewill also generate a false negative error since only one detection is returned bythe detector for a location(overlapping detections are merged) Therefore theannotated object will not be covered by any of the detections

As expected mode 1 and mode 2 produced very similar results a decent 62recall rate was achieved This is a result of the fact that they work exactlythe same way except the selection of the used classifiers The results show thatintroducing this idea did not decrease the performance

Mode 3 managed to improve the recall rate by 2 with the introduction ofregion of interests Neither of the first 3 modules had false positive detections asa result of the strict thresholds of the SVMs

Mode 4 had the outstanding recall rate of 95 This is a result of the trackeralgorithm which was able to follow the object even at view-points were the SVMdetectors did no longer detected the robot See table 51 for all the recall andprecision values

63

53 Discussion of results

Similarly the processing speed of the methods introduced in subsection 435were analysed as well with the tools described in subsection 512

Mode 1 was proven to be the slowest with a frame-rate of 42 FPS This isnot surprising as this version of the algorithm scans the whole image with all thedetectors

Mode 2 raised this speed to 65 FPS by introducing an intelligent way ofselecting the applied detector

Mode 3 brought another significant increase with implementing region of in-terest estimators which reduce the amount of the image to be scanned It canprocess around 12 frames in a second

Finally mode 4 was proven to be the fastest solution by far As a result ofits tracking algorithm their is no need to use the SVM classifiers as often asbefore This is a very important result since tracking is significantly faster thanscanning with the detectors Mode 4 reached a frame-rate of 30 FPS during thetests Thus mode 4 can be called a rdquoreal-timerdquo object detection algorithm

See table 51 for the frame-rates of all modules displayed along their recalland precision values Note that all frame-rates are average of 100 executions

Table 51 Table of results

mode Recall Prescision FPS Variancemode 1 0623 1 42599 000000123mode 2 0622 1 6509 00029557mode 3 0645 1 1206 00070877mode 4 0955 0898 3082 00021144

As a conclusion it can be seen that mode 4 outperformed all the others bothin detection rate (recall 95) and speed (average frame-rate 30FPS) using ROIestimators and a tracking algorithm Due to the nature of the tracker and theevaluation method the number of false positives increased As mentioned beforeeven a correctly positioned but incorrectly scaled detection decreased the precisionrate However this should be simple to improve by adding further validations ofthe output of the tracker (both for position and scale)

64

Chapter 6

Conclusion and recommendedfuture work

61 ConclusionIn this paper an unmanned aerial vehicle based airborne tracking system waspresented with the aim of detecting objects indoor (potentially extended to out-door operations as well) and localize a ground robot that will serve as a landingplatform for recharging the UAV

First the project and the aims and objectives of this thesis were introducedin chapter 1 This introduction was followed by an extensive literature review togive an overview of the existing solutions for similar problems (chapter 2) Un-manned Aerial Vehicles were presented and examples were shown for applicationfields Then the most important object recognition methods were introducedand reviewed with the aspect of suitability for the discussed project

Then the environment of the development was described including the avail-able software and hardware resources (chapter 3) Their advantages and disad-vantages were analysed and their applications in the project are explained

Afterwards the recognized challenges of the project were collected and dis-cussed with respect to the object detection task in chapter 4 Considering theobjectives resources and the challenges a modular architecture was designedand introduced (section 42) The type of modules were listed and discussed indetails including their purposes and some possible realization

The modular structure of the system makes it very flexible and easy to extendor replace modules The currently implemented and used ones were listed andexplained Also some unsuccessful experiments were mentioned

Subsection 434 concludes the chosen feature extraction (HOG) and classi-fier (SVM) methods and presented the implemented training software along the

65

61 Conclusion

produced detectorsAs one of the most important parts the currently used detector was intro-

duced in subsection 435 This module was developed with special attention paidto validation and possible future work therefore additional debugging featureslike exporting detections demonstration videos and frame rate measurementswere implemented To make the development simpler an easy-to-use interfacewas included which let the most important parameters to be set without modifi-cation pf the code Four related but different detection methods were developedMode 1 scans the whole image with both trained detectors to locate the robotMode 2 does the same but instead of all classifiers only one is used most of thetime which is chosen based on previous detections This resulted in an increasedframe-rate Mode 3 introduces the concept of Region Of Interests (ROI) and onlyscans this area The ROI is defined based on previous detections however it iseasy to replace it with other methods due to the modular structure The reducedsearch space made the processing significantly faster Finally mode 4 includes atracking algorithm which makes it unnecessary to scan every frame (or part ofit) with a detector Instead once the robot was found it is followed across thescene Certainly periodical validations are still needed Tracking is significantlyfaster than the sliding window detectors and in appropriate conditions can per-form above real time speed (average frame-rate was 308 FPS) It also improvedthe efficiency of the detector since on many frames the robot was tracked andits position was marked correctly while the detectors missed it from the sameview-point However tracking can be misleading as well if the tracked regiondrifts away from the robot (eg a chair next to the robot is followed) To avoidthis the bounding box returned by the tracker is regularly checked by detectors

Special attention was given to the evaluation of the system Two software weredeveloped as additional tools one for evaluating the efficiency of the detectionsand an other for analysing the processing time and frame rates

Although the whole project (especially the autonomous UAV) still needs a lotof improvement and not ready for testing yet the ground robot detecting algo-rithmrsquos first version is ready-to-use and have shown promising results in simulatedexperiments

The detector based on mode 4 had a recall rate of 95 with an average frame-rate of 308 FPS on the test videos Although this processing speed was achievedon a laptop (simulating a setup where the UAV sends video to a ground stationfor further processing) porting this code on an on-board computer with smallercomputational capacity should result in a slower but still acceptable frame-ratewhich satisfies the objectives (5Hz)

While this thesis focused on two dimensional image processing and object de-tection methods 3D image inputs were considered as well Section 44 concludesthe progress made related to 3D mapping along the applied mathematics An

66

62 Recommended future work

experimental setup was created with which it was possible to create spectacu-lar and more importantly precise 3D maps about the environments with a 2Dlaser scanner To test the idea of using the 3D image as an input for the groundrobot detection several recordings were made about the UGV The created point-clouds were found to be detailed enough to carry simpler (eg threshold basedon height) or more complex object detection methods (eg registering 3D key-points) on them in the future

As a conclusion it can be said that the objectives defined in 13

1 explore possible solutions in a literature review

2 design a system which is able to detect the ground robot based on theavailable sensors

3 implement and test the first version of the algorithms

were all addressed and completed

62 Recommended future workIn spite of the satisfying results of the current setup several modifications andimprovements can and should be implemented to increase the efficiency and speedof the algorithm

Focusing on the 2D processing methods first experiments with retrainedSVMs are recommended It is expected that training on the same two sides (sideand front) but including more images from slightly rotated view-points (bothrolling the camera and capture the object from not completely the front) willresult in better performance

Region of interest estimators have huge potential in the project since they cansignificantly speed up the process Aside basic algorithms based on simple fea-tures (eg similar colour patches edges etc) several more complex segmentationmethods were introduced in the literature

Tracking can be tuned more precisely as well Aside the rectangle box whichmarks the estimated position of the object the tracker returns a confidence valuewhich measures how confident the algorithm is that the object is inside the rect-angle This is a very valuable information to eliminate detections which rdquoslippedrdquooff the object Also the further processing of the return position is recommendedFor example growing or shrinking the rectangle based on simple features (egthresholded patches intensity edges etc) can be profitable since the trackertends to perform better when it tracks the whole object On the other hand toobig bounding boxes are problems as well thus shrinking of the rectangles have tobe considered as well

67

62 Recommended future work

Using The 3D image for the detection of the robot is another very interestingopportunity that should be examined in more details Certainly this is onlypossible if the 3D imaging will be carried out in near real time speed (currentlyall the maps were built offline) The currently seen 3D image (the point-cloud infront of the sensor) can be interpreted as a pre-filter for the 2D image processingmethods For example filtering for objects with more or less the same size of theUGV could be a very effective region of interest detector To achieve this thefield of view of the two sensors needs to be register carefully so correspondingareas can be found

Finally when the continuous 3D map building (note that this is more thanthe imaging since the 3D images have to be aligned and combined) is finishedseveral more improvements will be possible Assuming that both the aerial andthe ground vehicle were located in this map the real 3D position of the robotwill be known Its next position can be estimated based on its maximal velocityand the elapsed time Theoretically it is not possible for the robot to be outsideof this circle thus other parts of the map do not need to be scanned Since thelocation and orientation of the UAV will be known as well the field-of-view ofthe sensors can be estimated After that the UAV can either move and rotate ina way to cover the possible locations of the UGV with its sensors or just ignore(does not process) inputs from other areas

It is strongly recommended to give high priority to evaluation and testingduring the further experiments For 2D images Vatic annotation tool was intro-duced in this thesis as a very useful and convenient tool to evaluate detectorsFor 3D detections this is not suitable but a similar solution is needed to keepup the progress towards the ready to deploy 3D mapping system planned

68

Acknowledgements

First of all I would like to thank Paacutezmaacuteny Peacuteter Catholic University and my teachersthere who helped me to advance and grow through the years and later made itpossible to continue my studies in Cranfield University where I was welcomed withwarm hospitality and was supported throughout of the development of this thesis

I would like to thank Dr Al Savvaris for the professional help and encourage-ment he provided while we worked together

Special thanks goes to my friends Domonkos Huszaacuter and Loacuteraacutent Kovaacutecs who Ishared all the happiness sadness worries troubles and the weigh of this pastyear and who I could always turn to in need or joy

Thanks to my friend Roberto Opromolla for helping us to make the necessarymeasurements

My family never stopped supporting me loving me and ensuring the rightenvironment for improvement and growth

Many thanks Fanni Melles for the support and love you gave me and that youstood by me even at the hardest times

Thanks for my friends Maacuterton Hunyady and Doacutera Babicz for the funny andsometimes cruel comments of the revision of this thesis

And finally many thanks to Andras Horvath who agreed to be my rsquosecondrsquosupervisor and mentor at my home university

69

References

[1] ldquoOverview senseFlyrdquo [Online] Available httpswwwsenseflycomdronesoverviewhtml [Accessed at 2015-08-08]

[2] H Chao Y Cao and Y Chen ldquoAutopilots for small fixed-wing unmanned airvehicles A surveyrdquo in Proceedings of the 2007 IEEE International Conference onMechatronics and Automation ICMA 2007 2007 pp 3144ndash3149

[3] A Frank J McGrew M Valenti D Levine and J How ldquoHover Transitionand Level Flight Control Design for a Single-Propeller Indoor AirplanerdquoAIAA Guidance Navigation and Control Conference and Exhibit pp 1ndash43 2007[Online] Available httparcaiaaorgdoiabs10251462007-6318

[4] ldquoFixed Wing Versus Rotary Wing For UAV Mapping Ap-plicationsrdquo [Online] Available httpwwwquestuavcomnewsfixed-wing-versus-rotary-wing-for-uav-mapping-applications [Accessed at 2015-08-08]

[5] ldquo DJI Store Phantom 3 Standardrdquo [Online] Available httpstoredjicomproductphantom-3-standard [Accessed at 2015-08-08]

[6] ldquoWorld War II V-1 Flying Bomb - Military Historyrdquo [Online] Available httpmilitaryhistoryaboutcomodartillerysiegeweaponspv1htm [Accessed at 2015-08-08]

[7] L Lin and M A Goodrich ldquoUAV intelligent path planning for wilderness searchand rescuerdquo in 2009 IEEERSJ International Conference on Intelligent Robotsand Systems IROS 2009 2009 pp 709ndash714

[8] M A Goodrich B S Morse D Gerhardt J L Cooper M Quigley J A Adamsand C Humphrey ldquoSupporting wilderness search and rescue using a camera-equipped mini UAVrdquo Journal of Field Robotics vol 25 no 1-2 pp 89ndash1102008

[9] P Doherty and P Rudol ldquoA UAV Search and Rescue Scenario with HumanBody Detection and Geolocalizationrdquo in Lecture Notes in Computer Science

70

REFERENCES

2007 pp 1ndash13 [Online] Available httpwwwspringerlinkcomcontentt361252205328408fulltextpdf

[10] A Jaimes S Kota and J Gomez ldquoAn approach to surveillance an area usingswarm of fixed wing and quad-rotor unmanned aerial vehicles UAV(s)rdquo 2008 IEEEInternational Conference on System of Systems Engineering SoSE 2008 2008

[11] M Kontitsis K Valavanis and N Tsourveloudis ldquoA UAV vision system for air-borne surveillancerdquo IEEE International Conference on Robotics and Automation2004 Proceedings ICRA rsquo04 2004 vol 1 2004

[12] M Quigley M A Goodrich S Griffiths A Eldredge and R W Beard ldquoTargetacquisition localization and surveillance using a fixed-wing mini-UAV and gim-baled camerardquo in Proceedings - IEEE International Conference on Robotics andAutomation vol 2005 2005 pp 2600ndash2606

[13] E Semsch M Jakob D Pavliacuteček and M Pěchouček ldquoAutonomous UAV surveil-lance in complex urban environmentsrdquo in Proceedings - 2009 IEEEWICACMInternational Conference on Intelligent Agent Technology IAT 2009 vol 2 2009pp 82ndash85

[14] M Israel ldquoA UAV-BASED ROE DEER FAWN DETECTION SYSTEMrdquo pp51ndash55 2012

[15] G Zhou and D Zang ldquoCivil UAV system for earth observationrdquo in InternationalGeoscience and Remote Sensing Symposium (IGARSS) 2007 pp 5319ndash5321

[16] W DeBusk ldquoUnmanned Aerial Vehicle Systems for Disaster Relief TornadoAlleyrdquo AIAA InfotechAerospace 2010 2010 [Online] Available httparcaiaaorgdoiabs10251462010-3506

[17] S DrsquoOleire-Oltmanns I Marzolff K Peter and J Ries ldquoUnmannedAerial Vehicle (UAV) for Monitoring Soil Erosion in Moroccordquo RemoteSensing vol 4 no 12 pp 3390ndash3416 2012 [Online] Available httpwwwmdpicom2072-42924113390

[18] F Nex and F Remondino ldquoUAV for 3D mapping applications a reviewrdquoApplied Geomatics vol 6 no 1 pp 1ndash15 2013 [Online] Availablehttplinkspringercom101007s12518-013-0120-x

[19] H Eisenbeiss ldquoThe Potential of Unmanned Aerial Vehicles for MappingrdquoPhotogrammetrische Woche Heidelberg pp 135ndash145 2011 [Online] Availablehttpwwwifpuni-stuttgartdepublicationsphowo11140Eisenbeisspdf

[20] C Bills J Chen and A Saxena ldquoAutonomous MAV flight in indoor environ-ments using single image perspective cuesrdquo in Proceedings - IEEE InternationalConference on Robotics and Automation 2011 pp 5776ndash5783

71

REFERENCES

[21] W Bath and J Paxman ldquoUAV localisation amp control through computer visionrdquo of the Australasian Conference on Robotics 2004 [Online] Availablehttpwwwcseunsweduau~acra2005proceedingspapersbathpdf

[22] K Ccedilelik S J Chung M Clausman and A K Somani ldquoMonocular vision SLAMfor indoor aerial vehiclesrdquo in 2009 IEEERSJ International Conference on Intel-ligent Robots and Systems IROS 2009 2009 pp 1566ndash1573

[23] H Oh D Y Won S S Huh D H Shim M J Tahk and A Tsourdos ldquoIndoorUAV control using multi-camera visual feedbackrdquo in Journal of Intelligent andRobotic Systems Theory and Applications vol 61 no 1-4 2011 pp 57ndash84

[24] Y M Mustafah A W Azman and F Akbar ldquoIndoor UAV Positioning UsingStereo Vision Sensorrdquo pp 575ndash579 2012

[25] P Jongho and K Youdan ldquoStereo vision based collision avoidance of quadrotorUAVrdquo in Control Automation and Systems (ICCAS) 2012 12th InternationalConference on 2012 pp 173ndash178

[26] J Weingarten and R Siegwart ldquoEKF-based 3D SLAM for structured environmentreconstructionrdquo in 2005 IEEERSJ International Conference on Intelligent Robotsand Systems IROS 2005 pp 2089ndash2094

[27] H Surmann A Nuumlchter and J Hertzberg ldquoAn autonomous mobile robot with a3D laser range finder for 3D exploration and digitalization of indoor environmentsrdquoRobotics and Autonomous Systems vol 45 no 3-4 pp 181ndash198 2003

[28] Y Lin J Hyyppauml and A Jaakkola ldquoMini-UAV-borne LIDAR for fine-scale map-pingrdquo IEEE Geoscience and Remote Sensing Letters vol 8 no 3 pp 426ndash4302011

[29] F Wang J Cui S K Phang B M Chen and T H Lee ldquoA mono-camera andscanning laser range finder based UAV indoor navigation systemrdquo in 2013 Inter-national Conference on Unmanned Aircraft Systems ICUAS 2013 - ConferenceProceedings 2013 pp 694ndash701

[30] M Nagai T Chen R Shibasaki H Kumagai and A Ahmed ldquoUAV-borne 3-Dmapping system by multisensor integrationrdquo IEEE Transactions on Geoscienceand Remote Sensing vol 47 no 3 pp 701ndash708 2009

[31] X Zhang Y-H Yang Z Han H Wang and C Gao ldquoObject class detectionrdquoACM Computing Surveys vol 46 no 1 pp 1ndash53 Oct 2013 [Online] Availablehttpdlacmorgcitationcfmid=25229682522978

[32] P M Roth and M Winter ldquoSurvey of Appearance-Based Methodsfor Object Recognitionrdquo Transform no ICG-TR-0108 2008 [Online]

72

REFERENCES

Available httpwwwicgtu-grazacatMemberspmrothpub_pmrothTR_ORat_downloadfile

[33] A Andreopoulos and J K Tsotsos ldquo50 Years of object recognition Directionsforwardrdquo Computer Vision and Image Understanding vol 117 no 8 pp 827ndash891 Aug 2013 [Online] Available httpwwwresearchgatenetpublication257484936_50_Years_of_object_recognition_Directions_forward

[34] J K Tsotsos Y Liu J C Martinez-Trujillo M Pomplun E Simine andK Zhou ldquoAttending to visual motionrdquo Computer Vision and Image Understand-ing vol 100 no 1-2 SPEC ISS pp 3ndash40 2005

[35] P Perona ldquoVisual Recognition Circa 2007rdquo pp 1ndash12 2007 [Online] Availablehttpswwwvisioncaltechedupublicationsperona-chapter-Dec07pdf

[36] M Piccardi ldquoBackground subtraction techniques a reviewrdquo 2004 IEEE Interna-tional Conference on Systems Man and Cybernetics (IEEE Cat No04CH37583)vol 4 pp 3099ndash3104 2004

[37] T Bouwmans ldquoRecent Advanced Statistical Background Modeling for ForegroundDetection - A Systematic Surveyrdquo Recent Patents on Computer Sciencee vol 4no 3 pp 147ndash176 2011

[38] L Xu and W Bu ldquoTraffic flow detection method based on fusion of frames dif-ferencing and background differencingrdquo in 2011 2nd International Conference onMechanic Automation and Control Engineering MACE 2011 - Proceedings 2011pp 1847ndash1850

[39] A-t Nghiem F Bremond I-s Antipolis and R Lucioles ldquoBackground subtrac-tion in people detection framework for RGB-D camerasrdquo 2004

[40] J C S Jacques C R Jung and S R Musse ldquoBackground subtraction andshadow detection in grayscale video sequencesrdquo in Brazilian Symposium of Com-puter Graphic and Image Processing vol 2005 2005 pp 189ndash196

[41] P Kaewtrakulpong and R Bowden ldquoAn Improved Adaptive Background MixtureModel for Real- time Tracking with Shadow Detectionrdquo Advanced Video BasedSurveillance Systems pp 1ndash5 2001

[42] G Turin ldquoAn introduction to matched filtersrdquo IRE Transactions on InformationTheory vol 6 no 3 1960

[43] K Briechle and U Hanebeck ldquoTemplate matching using fast normalized crosscorrelationrdquo Proceedings of SPIE vol 4387 pp 95ndash102 2001 [Online] AvailablehttplinkaiporglinkPSI4387951ampAgg=doi

73

REFERENCES

[44] H Schweitzer J W Bell and F Wu ldquoVery Fast Template Matchingrdquo Programno 009741 pp 358ndash372 2002 [Online] Available httpwwwspringerlinkcomindexH584WVN93312V4LTpdf

[45] P Viola and M Jones ldquoRobust real-time object detectionrdquo Interna-tional Journal of Computer Vision vol 57 pp 137ndash154 2001 [Online]Available httpscholargooglecomscholarhl=enampbtnG=Searchampq=intitleRobust+Real-time+Object+Detection0

[46] S Omachi and M Omachi ldquoFast template matching with polynomialsrdquo IEEETransactions on Image Processing vol 16 no 8 pp 2139ndash2149 2007

[47] M Jordan and J Kleinberg Bishop - Pattern Recognition and Machine Learning

[48] D Prasad ldquoSurvey of the problem of object detection in real imagesrdquoInternational Journal of Image Processing (IJIP) no 6 pp 441ndash466 2012[Online] Available httpwwwcscjournalsorgcscmanuscriptJournalsIJIPvolume6Issue6IJIP-702pdf

[49] D Lowe ldquoObject Recognition fromLocal Scale-Invariant Featuresrdquo IEEE Inter-national Conference on Computer Vision 1999

[50] D G Lowe ldquoDistinctive image features from scale-invariant keypointsrdquo Interna-tional Journal of Computer Vision vol 60 no 2 pp 91ndash110 2004

[51] H Bay T Tuytelaars and L Van Gool ldquoSURF Speeded up robust featuresrdquo inLecture Notes in Computer Science (including subseries Lecture Notes in ArtificialIntelligence and Lecture Notes in Bioinformatics) vol 3951 LNCS 2006 pp 404ndash417

[52] T Joachims ldquoA probabilistic analysis of the Rocchio algorithmwith TFIDF for text categorizationrdquo International Conference on Ma-chine Learning pp 143ndash151 1997 [Online] Available papers2publicationuuid9FC2122D-6D49-4DC5-AC03-E353D5B3D1D1

[53] N Dalal and B Triggs ldquoHistograms of Oriented Gradients for Human Detectionrdquoin CVPR rsquo05 Proceedings of the 2005 IEEE Computer Society Conference onComputer Vision and Pattern Recognition (CVPRrsquo05) - Volume 1 2005 pp 886ndash893 [Online] Available citeulike-article-id3047126$delimiter026E30F$nhttpdxdoiorg101109CVPR2005177

[54] J M Rainer Lienhart ldquoAn Extended Set of Haar-Like Features for Rapid ObjectDetectionrdquo [Online] Available httpciteseerxistpsueduviewdocsummarydoi=1011869433

74

REFERENCES

[55] S Han Y Han and H Hahn ldquoVehicle Detection Method using Haar-like Featureon Real Time Systemrdquo inWorld Academy of Science Engineering and Technology2009 pp 455ndash459

[56] Q C Q Chen N Georganas and E Petriu ldquoReal-time Vision-based Hand Ges-ture Recognition Using Haar-like Featuresrdquo 2007 IEEE Instrumentation amp Mea-surement Technology Conference IMTC 2007 2007

[57] G Monteiro P Peixoto and U Nunes ldquoVision-based pedestrian detection usingHaar-like featuresrdquo Robotica 2006 [Online] Available httpcyberc3sjtueducnCyberC3docpaperRobotica2006pdf

[58] J Martiacute J M Benediacute A M Mendonccedila and J Serrat Eds PatternRecognition and Image Analysis ser Lecture Notes in Computer Science BerlinHeidelberg Springer Berlin Heidelberg 2007 vol 4477 [Online] Availablehttpwwwspringerlinkcomindex101007978-3-540-72847-4

[59] A E C Pece ldquoOn the computational rationale for generative modelsrdquo ComputerVision and Image Understanding vol 106 no 1 pp 130ndash143 2007

[60] I T Joliffe Principle Component Analysis 2002 vol 2 [Online] Availablehttpwwwspringerlinkcomcontent978-0-387-95442-4

[61] A Hyvarinen J Karhunen and E Oja ldquoIndependent Component Analysisrdquovol 10 p 2002 2002

[62] K Mikolajczyk B Leibe and B Schiele ldquoMultiple Object Class Detection with aGenerative Modelrdquo IEEE Computer Society Conference on Computer Vision andPattern Recognition (CVPR) vol 1 pp 26ndash36 2006

[63] Y Freund and R E Schapire ldquoA Decision-theoretic Generalization of On-lineLearning and an Application to Boostingrdquo Journal of Computing Systems andScience vol 55 no 1 pp 119ndash139 1997

[64] C Cortes and V Vapnik ldquoSupport-vector networksrdquo Machine Learningvol 20 no 3 pp 273ndash297 Sep 1995 [Online] Available httplinkspringercom101007BF00994018

[65] P Viola and M Jones ldquoRobust real-time face detectionrdquo International Journalof Computer Vision vol 57 no 2 pp 137ndash154 2004

[66] B E Boser I M Guyon and V N Vapnik ldquoA Training Algorithmfor Optimal Margin Classifiersrdquo Proceedings of the 5th Annual ACM Workshopon Computational Learning Theory pp 144ndash152 1992 [Online] Availablehttpciteseerxistpsueduviewdocsummarydoi=1011213818

75

REFERENCES

[67] C-W Hsu and C-J Lin ldquoA comparison of methods for multiclass support vectormachinesrdquo IEEE Transactions on Neural Networks vol 13 no 2 pp 415ndash4252002

[68] A Mathur and G M Foody ldquoMulticlass and binary SVM classification Implica-tions for training and classification usersrdquo IEEE Geoscience and Remote SensingLetters vol 5 no 2 pp 241ndash245 2008

[69] ldquoNitrogen6X - iMX6 Single Board Computerrdquo [Online] Available httpboundarydevicescomproductnitrogen6x-board-imx6-arm-cortex-a9-sbc [Ac-cessed at 2015-08-10]

[70] ldquoPixhawk flight controllerrdquo [Online] Available httproverardupilotcomwikipixhawk-wiring-rover [Accessed at 2015-08-10]

[71] ldquoScanning range finder utm-30lx-ewrdquo [Online] Available httpswwwhokuyo-autjp02sensor07scannerutm_30lx_ewhtml [Accessed at 2015-07-22]

[72] ldquoopenCV manual Release 249rdquo [Online] Available httpdocsopencvorgopencv2refmanpdf [Accessed at 2015-07-20]

[73] D E King ldquoDlib-ml A Machine Learning Toolkitrdquo Journal of MachineLearning Research vol 10 pp 1755ndash1758 2009 [Online] Available httpjmlrcsailmitedupapersv10king09ahtml

[74] ldquodlib C++ Libraryrdquo [Online] Available httpdlibnet [Accessed at 2015-07-21]

[75] D E King ldquoMax-Margin Object Detectionrdquo Jan 2015 [Online] Availablehttparxivorgabs150200046

[76] M Danelljan G Haumlger and M Felsberg ldquoAccurate Scale Estimation for RobustVisual Trackingrdquo Proceedings of the British Machine Vision Conference BMVC2014

[77] ldquoUrgBenri Information Pagerdquo [Online] Available httpswwwhokuyo-autjp02sensor07scannerdownloaddataUrgBenrihtm [Accessed at 2015-07-20]

[78] ldquovatic - Video Annotation Tool - UC Irvinerdquo [Online] Available httpwebmiteduvondrickvatic [Accessed at 2015-07-24]

[79] ldquoAmazon Mechanical Turkrdquo [Online] Available httpswwwmturkcommturkwelcome [Accessed at 2015-07-26]

76

  • List of Figures
  • Absztrakt
  • Abstract
  • List of Abbreviations
  • 1 Introduction and project description
    • 11 Project description and requirements
    • 12 Type of vehicle
    • 13 Aims and objectives
      • 2 Literature Review
        • 21 UAVs and applications
          • 211 Fixed-wing UAVs
          • 212 Rotary-wing UAVs
          • 213 Applications
            • 22 Object detection on conventional 2D images
              • 221 Classical detection methods
                • 2211 Background subtraction
                • 2212 Template matching algorithms
                  • 222 Feature descriptors classifiers and learning methods
                    • 2221 SIFT features
                    • 2222 Haar-like features
                    • 2223 HOG features
                    • 2224 Learning models in computer vision
                    • 2225 AdaBoost
                    • 2226 Support Vector Machine
                      • 3 Development
                        • 31 Hardware resources
                          • 311 Nitrogen board
                          • 312 Sensors
                            • 3121 Pixhawk autopilot
                            • 3122 Camera
                            • 3123 LiDar
                                • 32 Chosen software
                                  • 321 Matlab
                                  • 322 Robotic Operating System (ROS)
                                  • 323 OpenCV
                                  • 324 Dlib
                                      • 4 Designing and implementing the algorithm
                                        • 41 Challenges in the task
                                        • 42 Architecture of the detection system
                                        • 43 2D image processing methods
                                          • 431 Chosen methods and the training algorithm
                                          • 432 Sliding window method
                                          • 433 Pre-filtering
                                          • 434 Tracking
                                          • 435 Implemented detector
                                            • 4351 Mode 1 Sliding window with all the classifiers
                                            • 4352 Mode 2 Sliding window with intelligent choice of classifier
                                            • 4353 Mode 3 Intelligent choice of classifiers and ROIs
                                            • 4354 Mode 4 Tracking based approach
                                                • 44 3D image processing methods
                                                  • 441 3D recording method
                                                  • 442 Android based recording set-up
                                                  • 443 Final set-up with Pixhawk flight controller
                                                  • 444 3D reconstruction
                                                      • 5 Results
                                                        • 51 2D image detection results
                                                          • 511 Evaluation
                                                            • 5111 Definition of True positive and negative
                                                            • 5112 Definition of False positive and negative
                                                            • 5113 Reducing number of errors
                                                            • 5114 Annotation and database building
                                                              • 512 Frame-rate measurement and analysis
                                                                • 52 3D image detection results
                                                                • 53 Discussion of results
                                                                  • 6 Conclusion and recommended future work
                                                                    • 61 Conclusion
                                                                    • 62 Recommended future work
                                                                      • References
Page 3: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly

reasonable speed on an embedded system onboard (ARM Nitrogen6x) thus the

developed algorithms need to be ported and optimised for the chosen onboard platform

If time permits the student will also consider sensor fusion (data associations) to enable

the system to operate robustly in different lightining conditions

A teacutemavezeteacutest vaacutellalom

A teacutemavezető alaacuteiacuteraacutesa

Signature of Supervisor

A hallgatoacute teacutemavezeteacuteseacutet belső konzulenskeacutent vaacutellalom

Belső konzulens alaacuteiacuteraacutesa

Keacuterem a diplomaterv teacutemaacutejaacutenak joacutevaacutehagyaacutesaacutet

Budapest

A hallgatoacute alaacuteiacuteraacutesa

A diplomaterv teacutemaacutejaacutet az Informaacutecioacutes Technoloacutegiai eacutes Bionikai Kar joacutevaacutehagyta

Budapest

Dr Szolgay Peacuteter

deacutekaacuten

A hallgatoacute a konzultaacutecioacutekon reacuteszt vett eacutes a kiiacuteraacutesban foglalt feladatokat teljesiacutetette

Budapest

A teacutemavezető alaacuteiacuteraacutesa

A dolgozat a TVSZ 3 sz melleacutekleteacuteben foglalt tartalmi eacutes formai koumlvetelmeacutenyeknek megfelel

Budapest

Belső konzulens alaacuteiacuteraacutesa

Aluliacuterott Paacutelffy Andraacutes a Paacutezmaacuteny Peacuteter Katolikus Egyetem Informaacutecioacutes Technoloacute-giai eacutes Bionikai Karaacutenak hallgatoacuteja kijelentem hogy jelen diplomatervet meg nem en-gedett segiacutetseacuteg neacutelkuumll sajaacutet magam keacutesziacutetettem eacutes a diplomatervben csak a megadott for-raacutesokat hasznaacuteltam fel Minden olyan reacuteszt melyet szoacute szerint vagy azonos eacutertelembende aacutetfogalmazva maacutes forraacutesboacutel aacutetvettem egyeacutertelműen a forraacutes megadaacutesaacuteval megjeloumlltemEzt a diplomatervet csakis Erasmus tanulmaacutenyuacutet kereteacuten beluumll a Cranfield UniversitybdquoComputational amp Software Techniques in Engineeringrdquo MSc keacutepzeacuteseacuten bdquoDigital Signaland Image Processingrdquo szakiraacutenyon nyuacutejtottam be a 20142015taneacutevben

Undersigned Andraacutes PAacuteLFFY student of Paacutezmaacuteny Peacuteter Catholic Universityrsquos Fac-ulty of Information Technology and Bionics I state that I have written this thesis onmy own without any prohibited sources and I used only the described references Everytranscription citation or paraphrasing is marked with the exact source This thesis hasonly been submitted at Cranfield Universityrsquos Computational amp Software Techniques inEngineering MSc course on the Digital Signal and Image Processing option in 2015

Budapest 20151220

Paacutelffy Andraacutes

iii

Contents

List of Figures vii

Absztrakt viii

Abstract ix

List of Abbreviations x

1 Introduction and project description 111 Project description and requirements 112 Type of vehicle 213 Aims and objectives 3

2 Literature Review 521 UAVs and applications 5

211 Fixed-wing UAVs 5212 Rotary-wing UAVs 6213 Applications 8

22 Object detection on conventional 2D images 9221 Classical detection methods 10

2211 Background subtraction 102212 Template matching algorithms 11

222 Feature descriptors classifiers and learning methods 132221 SIFT features 142222 Haar-like features 152223 HOG features 162224 Learning models in computer vision 172225 AdaBoost 182226 Support Vector Machine 19

iv

CONTENTS

3 Development 2131 Hardware resources 21

311 Nitrogen board 21312 Sensors 21

3121 Pixhawk autopilot 223122 Camera 223123 LiDar 23

32 Chosen software 23321 Matlab 23322 Robotic Operating System (ROS) 24323 OpenCV 24324 Dlib 24

4 Designing and implementing the algorithm 2641 Challenges in the task 2642 Architecture of the detection system 2943 2D image processing methods 31

431 Chosen methods and the training algorithm 31432 Sliding window method 34433 Pre-filtering 36434 Tracking 37435 Implemented detector 39

4351 Mode 1 Sliding window with all the classifiers 414352 Mode 2 Sliding window with intelligent choice of

classifier 414353 Mode 3 Intelligent choice of classifiers and ROIs 424354 Mode 4 Tracking based approach 44

44 3D image processing methods 46441 3D recording method 46442 Android based recording set-up 48443 Final set-up with Pixhawk flight controller 51444 3D reconstruction 53

5 Results 5651 2D image detection results 56

511 Evaluation 565111 Definition of True positive and negative 565112 Definition of False positive and negative 565113 Reducing number of errors 575114 Annotation and database building 57

512 Frame-rate measurement and analysis 59

v

CONTENTS

52 3D image detection results 6053 Discussion of results 61

6 Conclusion and recommended future work 6561 Conclusion 6562 Recommended future work 67

References 70

vi

List of Figures

11 Image of the ground robot 3

21 Fixed wing consumer drone 622 Example for consumer drones 723 Example for people detection with background subtraction 1124 Example of template matching 1225 2 example Haar-like features 1626 Illustration of the discriminative and generative models 1727 Example of a separable problem 20

31 Image of Pixhawk flight controller 2232 The chosen LIDAR sensor Hokuyo UTM-30LX 2333 Elements of Dlibrsquos machine learning toolkit 25

41 A diagram of the designed architecture 2842 Visualization of the trained HOG detectors 3543 Representation of the sliding window method 3644 Example image for the result of edge detection 3845 Example of the detectorrsquos user interface 4146 Presentation of mode 3 4347 Mode 4 example output frames 45410 Screenshot of the Android application for Lidar recordings 4848 Schematic figure to represent the 3D recording set-up 4949 Example representation of the output of the Lidar Sensor 50411 Picture about the laboratory and the recording set-up 52412 Presentation of the axes used in the android application 53

51 Figure of the Vatic user interface 5852 Photo of the recording process 6153 Example of the 3D images built 62

vii

List of Abbreviations

SATM School of Aerospace Technology and ManufacturingUAV Unmanned Aerial VehicleUAS Unmanned Aerial SystemUA Unmanned AircraftUGV Unmanned Ground VehicleHOG Histogram of Oriented GradientsRC Radio ControlledROS Robotic Operating SystemIMU Inertial Measurement UnitDoF Degree of FreedomSLAM Simultaneous Localization And MappingROI Region Of InterestVatic Video Annotation Tool from Irvine California

viii

Absztrakt

Ezen dolgozatban bemutataacutesra keruumll egy piloacuteta neacutelkuumlli repuumllő jaacuterműre szereltkoumlvető rendszer ami belteacuteri objektumokat hivatott detektaacutelni eacutes legfőbb ceacutelja afoumlldi egyseacuteg megtalaacutelaacutesa amely leszaacutelloacute eacutes uacutejratoumlltő aacutellomaacuteskeacutent fog szolgaacutelni

Előszoumlr a projekt illetve a teacutezis ceacuteljai lesznek felsorolva eacutes reacuteszletezveEzt koumlveti egy reacuteszletes irodalomkutataacutes amely bemutatja a hasonloacute kihiacutevaacute-

sokra leacutetező megoldaacutesokat Szerepel egy roumlvid oumlsszefoglalaacutes a piloacuteta neacutelkuumlli jaacuter-művekről eacutes alkalmazaacutesi teruumlleteikről majd a legismertebb objektumdetektaacuteloacutemoacutedszerek keruumllnek bemutataacutesra A kritika taacutergyalja előnyeiket eacutes haacutetraacutenyaikatkuumlloumlnoumls tekintettel a jelenlegi projektben valoacute alkalmazhatoacutesaacutegukra

A koumlvetkező reacutesz a fejleszteacutesi koumlruumllmeacutenyekről szoacutel beleeacutertve a rendelkezeacutesreaacutelloacute szoftvereket eacutes hardvereket

A feladat kihiacutevaacutesai bemutataacutesa utaacuten egy modulaacuteris architektuacutera terve keruumllbemutataacutesra figyelembe veacuteve a ceacutelokat erőforraacutesokat eacutes a felmeruumllő probleacutemaacutekat

Ezen architektuacutera egyik legfontosabb modulja a detektaacuteloacute algoritmus legfris-sebb vaacuteltozata reacuteszletezve is szerepel a koumlvetkező fejezetben keacutepesseacutegeivel moacuted-jaival eacutes felhasznaacuteloacutei feluumlleteacutevel egyuumltt

A modul hateacutekonysaacutegaacutenak meacutereacuteseacutere leacutetrejoumltt egy kieacuterteacutekelő koumlrnyezet melykeacutepes szaacutemos metrikaacutet kiszaacutemolni a detekcioacuteval kapcsolatban Mind a koumlrnyezetmind a metrikaacutek reacuteszletezve lesznek a koumlvetkező fejezetben melyet a legfrissebbalgoritmus aacuteltal eleacutert eredmeacutenyek koumlvetnek

Baacuter ez a dolgozat főkeacutent a hagyomaacutenyos (2D) keacutepeken operaacuteloacute detekcioacutesmoacutedszerekre koncentraacutel 3D keacutepalkotaacutesi eacutes feldolgozoacute moacutedszerek szinteacuten meg-fontolaacutesra keruumlltek Elkeacuteszuumllt egy kiacuteseacuterleti rendszer amely keacutepes laacutetvaacutenyos eacutespontos 3D teacuterkeacutepek leacutetrehozaacutesaacutera egy 2D leacutezer szkenner hasznaacutelataacuteval Szaacutemosfelveacutetel keacuteszuumllt a megoldaacutes kiproacutebaacutelaacutesa amelyek a rendszerrel egyuumltt bemutataacutesrakeruumllnek

Veacuteguumll az implementaacutelt moacutedszerek eacutes az eredmeacutenyek oumlsszefoglaloacuteja zaacuterja adolgozatot

ix

Abstract

In this paper an unmanned aerial vehicle based airborne tracking system is pre-sented with the aim of detecting objects indoor (potentially extended to outdooroperations as well) and localize a ground robot that will serve as a landing plat-form for recharging of the UAV

First the project and the aims and objectives of this thesis are introducedand discussed

Then an extensive literature review is presented to give overview of the ex-isting solutions for similar problems Unmanned Aerial Vehicles are presentedwith examples of their application fields After that the most relevant objectrecognition methods are reviewed and their suitability for the discussed project

Then the environment of the development will be described including theavailable software and hardware resources

Afterwards the challenges of the task are collected and discussed Consideringthe objectives resources and the challenges a modular architecture is designedand introduced

As one of the most important module the currently used detector is intro-duced along with its features modes and user interface

Special attention is given to the evaluation of the system Additional evalu-ating tools are introduced to analyse efficiency and speed

The ground robot detecting algorithmrsquos first version is evaluated giving promis-ing results in simulated experiments

While this thesis focuses on two dimensional image processing and objectdetection methods 3D image inputs are considered as well An experimentalsetup is introduced with the capability to create spectacular and precise 3Dmaps about the environments with a 2D laser scanner To test the idea of usingthe 3D image as an input for the ground robot detection several recordings weremade about the UGV and presented in this paper as well

Finally all implemented methods and relevant results are concluded

x

Chapter 1

Introduction and projectdescription

In this chapter an introduction is given to the whole project which this thesis ispart of Afterwards a structure of the sub-tasks in the project is presented alongwith the recognized challenges Then the aims and objectives of this thesis arelisted

11 Project description and requirementsThe projectrsquos main aim is to build a complete system based on one or moreunmanned autonomous vehicle which are able to carry out an indoor 3D mappingof a building (possible outdoor operations are kept in mind as well) A maplike this would be very useful for many applications such as surveillance rescuemissions architecture renovation etc Furthermore if a 3D model exists otherautonomous vehicles can navigate through the building with ease Later on inthis project after the map building is ready a thermal camera is planned to beattached to the vehicle to seek find and locate heat leaks as a demonstration ofthe system A system like this should be

1 fast Building a 3D map requires a lot of measurements and processingpower not to mention the essential functions stabilization navigationroute planning collision avoidance However to build a useful tool allthese functions should be executed simultaneously mostly on-board nearlyreal-time Furthermore the mapping process itself should be finished inreasonable time (depending on the size of the building)

2 accurate Reasonably accurate recording is required so the map would besuitable for the execution of further tasks Usually this means a maximum

1

12 Type of vehicle

error of 5-6cm Errors with similar magnitude can be corrected later de-pending on the application In the case of architecture for example aligninga 3D room model to the recorded point cloud could eliminate this noise

3 autonomous The solution should be as autonomous as possible The aimis to build a system which does not require human supervision (for examplemapping a dangerous area which would be too far for real time control) andcoordinate the process Although remote control is acceptable and shouldbe implemented it is desired to be minimal This should allow a remoteoperator to coordinate even more than one unit a time (since none of themneeds continuous attention) while keeping the ability to change the routeor priority of premises

4 complete Depending on the size and layout of the building the mappingprocess can be very complicated It may have more than one floors loops(the same location reached via an other route) or opened areas which areall a significant challenge for both the vehicle (plan route to avoid collisionsand cover every desired area) and the mapping software The latter shouldrecognize previously seen locations and rdquocloserdquo the loop (that is assignthe two separately recorded area to one location) and handle multi-layerbuildings The system have to manage these tasks and provide a mapwhich contains all the layers closes loops and represents all walls (includingceiling and floor)

12 Type of vehicleThe first question of such a project is what type of vehicle should be used Theoptions are either some kind of aerial or a ground vehicle Both have their ad-vantages and disadvantages

An aerial vehicle has a lot more degree of freedom since it can elevate fromthe ground It can reach positions and perform measurements which would beimpossible for a ground vehicle However it consumes a lot of energy to stayin the air thus such a system has significant limitations in weight payload andoperating time (since the batteries themselves weigh a lot)

A ground vehicle overcomes all these problems since it is able to carry alot more payload than an aerial one This means a lot of batteries thereforelonger operation time On the other hand it canrsquot elevate from the ground thusall the measurements will be performed from a relatively close position to theground This can be a big disadvantage since taller objects will not have theirtops recorded Also a ground robot would not be able to overcome big obstaclesclosing its way or to get to another floor

2

13 Aims and objectives

Figure 11 Image of the ground robot Source own picture

Considering these and many more arguments (price complexity human andmaterial resources) a combination of an aerial and a ground vehicle was chosenas a solution The idea is that both vehicles enter the building at the sametime and start mapping the environment simultaneously working on the samemap (which is unknown when the process starts) Using the advantage of theaerial vehicle the system scans parts of the rooms which are unavailable for theground vehicle With sophisticated route planning the areas scanned twice canbe minimized resulting in a faster map building Also great load capacity of theground robot makes it able to serve as a charging station for the aerial vehicleThis option will be discussed in details in section 13

13 Aims and objectivesAs it can be seen the scope of the project is extremely wide To cover the wholemore than fifteen students (visiting MSc and PhD) are working on it Thereforethe project has been divided to numerous subtasks

During the planning period of the project several challenges were recognizedfor example the lack of GPS signal indoor the vibration of the on-board sensorsor the short flight time of the aerial platform

This thesis addresses the last one As mentioned in section 12 the UAVconsumes a lot of energy to stay in the air thus such a system has significantlylimited operating time In contrast the ground robot is able to carry a lot morepayload (eg batteries) than an aerial one The idea to overcome this limitationof the UAV is to use the ground robot as a landing and recharging platform for

3

13 Aims and objectives

the aerial vehicleTo achieve this the system will need to have continuous update of the relative

position of the ground robot The aim of this thesis is to give a solution for thisproblem The following objectives were defined to structure the research

First possible solutions have to be collected and discussed in an extensive lit-erature review Then after considering the existing solutions and the constraintsand requirements of the discussed project a design of a new system is neededwhich is able to detect the ground robot using the available sensors Finally thefirst version of the software have to be implemented and tested

This paper will describe the process of the research introduce the designedsystem and discuss the experiences

4

Chapter 2

Literature Review

The aim of this chapter is to provide an overview of the previous researches andarticles related to the topic of this thesis First a short introduction of unmannedaerial vehicles will be given along with their most important advantages andapplication fields

Afterwards a brief conclusion of the science of object recognition in imageprocessing is provided The most often used feature extraction and classificationmethods are presented

21 UAVs and applicationsThe abbreviation UAV stands for Unmanned Aerial Vehicle Any aircraft whichis controlled or piloted by remotely andor by on-board computers They are alsoreferred as UAS for Unmanned Aerial System

They come in various sizes The smallest ones fit a man palms while even afull-size aeroplane can be an UAV by definition Similarly to traditional aircraftUnmanned Aerial vehicles are usually grouped into two big classes fixed-wingUAVs and rotary wing UAVs

211 Fixed-wing UAVsFixed-wing UAVs have a rigid wing which generates lift as the UAV moves for-ward They maneuver with control surfaces called ailerons rudder and elevatorGenerally speaking fixed-wing UAVs are easier to stabilize and control Forcomparison radio controlled aeroplanes can fly without any autopilot functionimplemented relying only on the pilotrsquos input This is a result of the muchsimpler structure compared to rotary wing ones due to the fact that a glidingmovement is easier to stabilize However they still have a quite extended market

5

21 UAVs and applications

Figure 21 One of the most popular consumer level fixed-wing mapping UAV ofthe SenseFly company Source [1]

of flight controllers whose aim is to add autonomous flying and pilot assistancefeatures [2]

One of the biggest advantage of fixed wing UAVs is that due to their naturalgliding capabilities they can stay airborne using no or small amount of powerFor the same reason fixed-wing UAVs are also able to carry heavier or morepayload for longer endurances using less energy which would be very useful inany mapping task

The drawback of the fixed wing aircraft in the case of this project is the factthat they have to keep moving to generate lift and stay airborne Therefore flyingaround between and in buildings is very complicated since there is limited spaceto move around Although there have been approaches to prepare a fixed wingUAV to hover and translate indoors [3] generally they are not optimal for indoorflights especially with heavy payload [4]

Fixed-wing UAVs are excellent choice for long endurance high altitude tasksThe long flight times and higher speed make it possible to cover larger areaswith one take off On figure 21 a consumer fixed-wing UAV is shown designedspecially for mapping

This project needs a vehicle which is able to fly between buildings and indoorswithout any complication Thus rotary-wing UAVs were reviewed

212 Rotary-wing UAVsRotary-wing UAVs generate lift with rotor blades attached to and rotated aroundan axis These blades work exactly as the wing on the fixed-wing vehicles butinstead of the vehicle moving forward the blades are moving constantly Thismakes the vehicle able to hover and hold its position Also due to the same prop-erty they can take off and land vertically These capabilities are very importantaspects for the project since both of them are crucial for indoor flight where spaceis limited

6

21 UAVs and applications

Figure 22 The newest version of the popular Phantom series of DJI Thisconsumer drone is easy to fly even for beginner pilots due to its many advancedautopilot features Source [5]

Based on the number of the rotors several sub-classes are known UAVs withone rotor are helicopters while anything with more than one rotor is called amultirotor or multicopter

The latter group controls its motion by modifying the speed of multiple downthrusting motors In general they are more complicated to control than fixed-wing UAVs since hovering enough requires constant compensation Multirotorsmust individually adjust the thrust of each motor If motors on one side areproducing more thrust than the other the multirotor tilts to the other sideThat is the reason why every remote controlled multicopter is equipped withsome kind of flight controller system They are also less stable (without electricstabilization) and less energy efficient than helicopters That is the reason forthe fact that no large scale multirotor is used since as the size is increased thesimplicity becomes less important than the inefficiency On the other hand theyare mechanically simpler and cheaper and their architecture is suitable for awider range of payloads which makes them appealing for many fields

Rotary-wing UAVsrsquo mechanical and electrical structures are generally morecomplex than fixed-wing ones This can result in more complicated and expansiverepairs and general maintenance which shorten operational time and increase thecost of the project Finally due to their lower speeds and shorter flight rangesrotary-wing UAVs will require many additional flights to survey any large areawhich is another cause of increasing operational costs [4]

7

21 UAVs and applications

In spite of the disadvantages listed above rotary-wing especially multirotorUAVs seem like an excellent choice for this project since there is no need to coverlarge areas and the shorter operational times can be solved with the chargingplatform

213 ApplicationsUAVs as many technological inventions got their biggest motivation from mili-tary goals and applications For example German rocket V-1 (rdquothe flying bombrdquo)had a huge impact on the history of UAS [6] An unmanned aerial vehicle makesit possible to observe spy on and attack the enemy without risking human lives

Fortunately non-military development is catching up too and UAVs are be-coming essential tools for many application fields As rotary wing UAVs es-pecially multicopters are getting cheaper they have appeared in the consumercategories as well They are available as simple remote controlled toys and ad-vanced aerial photography platforms as well [5] See figure 22 for a popularquadcopter

UAVs are gaining popularity in search and rescue operations since they areproviding a fast and relatively cheap tool to get an overview of the involvedarea A good example for this application field is a location of an earth-quakeor searching for missing people in the wildness See [7] and [8] for examples [9]is another related article which includes approaches to identify possibly injuredpeople autonomously

Another application field where UAVs were admittedly proven useful is surveil-lance [10] is an interesting example for this field since it proposes to use a com-bined swarm of rotary and fixed-wing UAVs [11] [12] address the same problemwhile [13] tries to overcome the challenges of the task in an urban area

Getting sensors in the air with low cost vehicles is an amazing opportunity forseveral research areas UAVs have been used for wild-life monitoring [14] earthobservation [15] and even tornado watch missions [16] [17] used UAVs to reducerdquothe data gap between field scale and satellite scale in soil erosion monitoringrdquo

UAVs are often used for 3D mapping and reconstruction tasks for topographyarchitecture or other purposes [18] is a review of current mapping methods basedon UAVs [19] is an other survey which also compares the photogrammetry withother methods to define the optimal application fields

Both surveys mentioned above are rather focus on topographical larger scalemapping tasks Indoor flying and mapping is more complicated in a sense sincethe danger of collision is significantly higher Indoor positioning of UAVs havebeen addressed in several papers They can be grouped by the sensor used Notethat GPS signal is usually too weak to use inside thus other methods are needed

Some researches tried to localize the aerial platform based on a single camera

8

22 Object detection on conventional 2D images

using monocular SLAM (Simultaneous Localization and Mapping) See [20ndash22]for UAVs navigating based on one camera only

Often the platform is equipped with multiple cameras to achieve stereo visionand create a depth map For examples of such solutions see [23ndash25]

A popular approach for indoor mapping is using 3D laser distance sensors asthe input of the SLAM algorithm Examples for this are [26ndash28]

However cameras and laser range finders are often used together [29 30] forexample manage to integrate the two sensors together and carry out 3D mappingwhile controlling navigating the UAV itself This project will use a similar setupthat is a combination of one or more monocular cameras with a 2D laser scanner

22 Object detection on conventional 2D imagesObject class detection is one of the most researched area of computer visionAs the available resources (computational power resolution of cameras etc)improved and became cheaper and more and more complex tasks occurred thefield has witnessed several approaches and methods Thus the topic has extensiveliterature To give an overview general surveys can be a good starting point

There are several articles aimed to provide a summary of the existing methodstrying to categorize and group them (eg [31ndash33])

The terms and definitions of computer vision tasks are defined in several wayin the literature [33] and [34] groups them by

bull Detection is a particular item present in the stimulus

bull Localization detection plus accurate location of item

bull Recognition localization of all the items present in the stimulus

bull Understanding recognition plus role of stimulus in the context of the scene

while [35] sets up five classes defined by the following (quoted from [33])

bull Verification Is a particular item present in an image patch

bull Detection and Localization Given a complex image decide if a particularexemplar object is located somewhere in the image and provide accuratelocation information of the given object

bull Classification Given an image patch decide which of the multiple possiblecategories are present in it Note that no location is determined

9

22 Object detection on conventional 2D images

bull Naming Given a large complex image (instead of an image patch as inthe classification problem) determine the location and labels of the objectspresent in that image

bull Description Given a complex image name all the objects present in itand describe the actions and relationships of the various objects within thecontext of the image

[31] defines object (class or category) detection as the combination of twogoals object categorization and object localization The former decides if anyinstance of the categories of interest is present in the input image The lattertask determines the positions and dimensions (projected size) of the objects thatare found

As it can be seen from the categories defined above the boundaries of thesesubclasses are not clear and they often overlap The aims and objectives listedin section 13 classify this task as a detection or localization (or a combinationof the two) For the sake of simplicity the aim of this thesis and the providedalgorithm will be referred as detection and detector

Beside the grouping of the tasks of computer vision several attempts weremade to classify the approaches themselves The biggest disjunctive propertyof the methods might be the fact that whether or not does it involve machinelearning

221 Classical detection methodsResulting from the rapid development of hardware resources along with the ex-tensive research of artificial intelligence methods not using machine learning aregetting neglected Generally speaking the ones using it show better performanceand are simpler to adopt if the circumstances change (eg re-train them)

It has to be noted that although these classical approaches are not very pop-ular nowadays their simplicity often means an attractive factor to perform easydetection tasks or pre-filter the image

2211 Background subtraction

Background subtraction methods are often used to detect moving objects onstanding platform (static cameras) Such an algorithm is able to learn the rdquoback-ground modelrdquo of an image with differentiating a few frames Although severalmethods exist (See [36] or [37] for good conclusion) the basic idea is common sep-arate the foreground from the background by subtracting the background modelfrom the input image This subtraction should result in an output image where

10

22 Object detection on conventional 2D images

Figure 23 Example for background subtraction used for people detection Thered patches show the object in the segmented foreground Source [39]

only objects in the foreground are shown The background model is usually up-dated regularly which means the detection is adaptive

These solutions could give an excellent base for tasks like traffic flow monitor-ing [38] or human detection [39] as moving points are separated and connectedor close ones could be managed as an object Thus all possible objects of interestare very well separated The fixed position of the camera means fixed perspectiveswhich make size and distance estimations accurate With fixed lengths reliablevelocity measuring can be implemented These and further additional properties(colour texture etc of the moving object) can be the base of an object classifica-tion system This method is often combined with more sophisticated recognitionmethods serving as a region of interest detector After the selection of interestingareas those methods can operate on a smaller input resulting in a faster responsetime

It has to be mentioned that not every moving (changing) patch is an objectas for example in certain lighting circumstances a vehicle and its shadow movealmost identically and have similar size parameters Effective methods to removeshadows from input have been described (eg[40 41]) Certainly process time ofthese filters sum up with the detectionrsquos time

An other concern is the possibility of a sudden change in illumination (such asswitching the light on or off) which would result change across the whole imageIn this case the background model should be adapted automatically and quickly

Although background subtraction has many benefits it is clearly not suitablefor this task since the camera will be mounted on a moving platform

2212 Template matching algorithms

Object class detection can be described as a search for a specific pattern on theinput image This definition is related of the matched filters (see [42]) techniqueof digital signal processing which basic idea is to find a known wavelet in anunknown signal This is usually achieved by cross-correlating them looking forhigh correlation coefficients

Template matching algorithms rely on the same concept interpreting the

11

22 Object detection on conventional 2D images

Figure 24 Example of template matching Notice how the output image isthe darkest at the correct position of the template (darker means higher value)An other interesting point of the picture is the high return values around thecalendarrsquos (on the wall left) dark headlines which are indeed look similar to thehandle Source [43]

image as a 2D digital signal They determine the position of the given patternby comparing a template (that contains the pattern) with the image

To perform the comparison the template is shifted (u v) discrete steps inthe (x y) directions respectively The comparison itself is usually calculated bysome kind of correlation over the area of the template for each (u v) positionThe method assumes that the function (the comparison method) will return thehighest value at the correct position [43]

See Figure 24 for an example input template (part a) and the output of thealgorithm (part b) Notice how the output image is the darkest at the correctposition of the template

The biggest drawback of the method is its computational complexity sincethe cross correlation has to be calculated at every position Several efforts werepublished to reduce the processing time [43] uses normalized cross correlationfor faster comparison [44] applies integral images (based on the idea of [45]) forthe same purpose Note that simplifying the template can reduce the complexityas well since it is easier to compare the two images For example [46] proposedan algorithm which approximates the template image with polynomials

Other disadvantages of template matching are the poor performance in case ofchanges in scale rotation or perspective Resulting from these template matchingis not a suitable concept for this project

12

22 Object detection on conventional 2D images

222 Feature descriptors classifiers and learning methodsWhile computer vision and machine learning might seem two well-separated areasof computer engineering computer vision is one of the biggest application fieldof machine learning [47] puts it this way rdquoPattern recognition has its origins inengineering whereas machine learning grew out of computer science Howeverthese activities can be viewed as two facets of the same field and together theyhave undergone substantial development over the past ten yearsrdquo

The two most important parts of a trained object detection algorithm arefeature extraction and feature classification The extraction method determineswhat type of features will be the base of the classifier and how they are extractedThe classification part is responsible for the machine learning method In otherwords how to decide if the extracted feature array (extracted by the featureextraction method described above) represents an object of interest

Since the latter part (defining the learning and classifier system) is less relatedto the vision studies this paper will only discuss a few learning methods brieflyBefore that some of the most famous and widely used feature descriptors arepresented

Feature extractors or descriptors are often grouped as well [48] recognizesedge based and patch based features [31] has three groups according to thelocality of the features

bull pixel-level feature description these features can be calculated for eachpixel separately Examples are the pixelrsquos intensity (grayscale) or its colourvector

bull patch-level feature description Descriptors based on patch level featuresconcentrate only on point of interests (and their neighbourhood) instead ofdescribing the whole image The name rdquopatchrdquo refers to the area around thepoint which is also considered Since these regions are much smaller thanthe image itself patch-level feature descriptions are also called local featuredescriptors Some of the most famous ones are SIFT (sub-subsection 2221[49 50]) and SURF [51] In a sense Haar-like features (see sub-subsection2222 [45]) belong here as well

bull region-level feature description As patches are sometimes too small to de-scribe the object appropriately a larger scale is often needed The region isa general expression for a set of connected pixels The size and shape of thisarea is not defined Region-level features often constructed from lower levelfeatures presented above However they do not try to apply higher levelgeometric models or structures Resulting from that region-level featuresare also called mid-level representations The most well known descriptors

13

22 Object detection on conventional 2D images

of this group are the BOF (Bag of Features or Bag of Words [52]) and theHOG (Histogram of Oriented Gradients sub-subsection 2223 [53])

Due to the limited scope of this paper it is not possible to give a deeperoverview about the features used in computer vision Instead three of the meth-ods will be presented here These were selected based on fame performanceon general object recognition and the amount of relevant literature examplesand implementation All of them are well-known and both the articles and themethods are part of the computer vision history

2221 SIFT features

Scale-invariant feature transform (or SIFT) is an image processing algorithm usedto detect and extract local features Sift descriptors are the most well-knownpatch-level feature feature descriptors (see point 222) [31] They were proposedby David Lowe in 1999 [49] The same author gave a more in-depth discussionabout his earlier work presented improvements in stability and feature invariancein 2004 [50] His idea was to extract interesting points as features from the object(training image) and try to match these on the test images to locate the object

The main steps of the algorithm are the following [50]

1 Scale-space extrema detection The first stage searches over the image inmultiple scale using a difference-of-Gaussian function to identify potentialinterest points that are scale and orientation invariant

2 Key-point localization At each selected point the algorithm tries to fit amodel to determine location and scale Key-points are selected based ontheir stability points which are expected to be more stable based on themodel are kept

3 Orientation assignment rdquoOne or more orientations are assigned to eachkey-point based on local image gradient directions All future operationsare performed on image data that has been transformed relative to theassigned orientation scale and location for each feature thereby providinginvariance to these transformationsrdquo [50]

4 Key-point descriptor After selecting the interesting points the gradientsaround them are calculated in a selected scale

The biggest advantage of the descriptor is the scale and rotation invarianceNote that rotation here means rotating in the plane of the image not capturingthe object from a rotated viewpoint For example a rotation invariant (frontal)

14

22 Object detection on conventional 2D images

face detector should be able to detect faces upside down but not faces pho-tographed from a side-view SIFT is also partly invariant to illumination and3D camera viewpoint This is reached by the representation of the calculateddescriptors that allows for significant levels of local shape distortion and changein illumination The algorithm assumes that these pointsrsquo relative positions willnot change on the test images Thus SIFT descriptors tend to show worse perfor-mance on flexible objects or ones with moving parts Also the relative positionsof the key-points can be significantly different not only if the object is distorted(flexibility) but also if they are extracted from an other instance from the sameclass (two people for example) Resulting from this property SIFT is more oftenused for object tracking image stitching or to recognise a concrete object insteadof the general class However this is not a problem in the case of this projectsince only one ground-robot is part of it Thus no general class recognition isrequired recognizing the used ground robot would be enough

2222 Haar-like features

Haar-like features were introduced by Viola and Jones in 2001 [45] They proposedsimple features inspired by Haar wavelet basis which are also used in signalprocessing They can be seen as rdquotemplatesrdquo defining rectangular regions in thegrayscale image Viola and Jones described three types of the features

bull two-rectangle features rdquoThe value of a two-rectangle feature is the dif-ference between the sum of the pixels intensities within two rectangularregions The regions have the same size and shape and are horizontally orvertically adjacentrdquo [45]

bull three-rectangle features rdquoA three-rectangle feature computes the sum withintwo outside rectangles subtracted from the sum in a centre rectanglerdquo [45]

bull four-rectangle features rdquoA four-rectangle feature computes the differencebetween diagonal pairs of rectanglesrdquo [45]

See figure 25 for an example of two and three rectangle features The mainadvantage of this approach is the fact that these features can be computed veryrapidly using so-called integral images The integral image rdquocontains the sum ofpixel intensities above and to the left of itrdquo [45] at any given image location Thereferences to the integral values at each pixel location are kept in a structureFinally the Haar features can be computed by summing and subtracting theintegral values in the corners of the rectangles

[45] was motivated by the task of face detection Although it is more than14 years old it is still the base of many implemented face detection methods (in

15

22 Object detection on conventional 2D images

Figure 25 2 example Haar-like features that are proven to be effective separatorsTop row shows the features themselves Bottom row presents the features overlayedon a typical face The first feature corresponds with the observation that eyes areusually darker than the cheeks The second feature measures the intensity of areasaround the eyes compared to the nose Source [45]

consumer cameras for example) and inspired several researches [54] for exampleextended the Haar-like feature set with rotated rectangles by calculating theintegral images diagonally as well

Resulting from the nature of them Haar-like features are suitable to findpatterns of combined bright and dark patches (like faces on grayscale images forexample)

Despite of the original task Haar-like features are used widely in computervision for various tasks vehicle detection ([55]) hand gesture recognition ([56])and pedestrian detection ([57]) just to name a few

Haar-like features are very popular and have a lot of advantages Its maindrawback is the rotation-variance which in spite of several attempts (eg [54])is still not solved completely

2223 HOG features

The abbreviation Hog stands for Histogram of Oriented Gradients Unlike theHaar features (2222) Hog aims to gather information from the gradient image

The technique calculates the gradient of every pixel and produces a histogramof the different orientations (called bins) after quantization in small portions(called cells) on the image After a location based normalization of these cells(within blocks) the feature vector is the concatenation of these histograms Notethat while this is similar to Edge orientation histograms ([58]) it is different since

16

22 Object detection on conventional 2D images

Figure 26 Illustration of the discriminative and generative models The boxesmark the probabilities calculated and used by the models Note the arrows whichdisplay the information flow presenting the main difference in the two philosophiesSource [59]

not only sharp gradient changes (known as edges) are considered The number ofoptimal bins cells and blocks depend on the actual task and resolution and arewidely researched

HOG is essentially a dense version of the SIFT descriptors (see sub-subsection2221 and [31]) However it is not normalized with respect to the orientationThus HOG is not rotation invariant On the other hand it is normalized locallywith respect to the image contrast (over the blocks) which results in a superiorrobustness against illumination changes over SIFT The technique was first de-scribed in the well-known article [53] by Navneet Dalal and Bill Triggs in whichit was used for pedestrian detection While this application field is still the mostcommon usage of these feature descriptor it can be used for many kind of objectsas well as long as the class has a characteristic shape with significant edges

2224 Learning models in computer vision

Learning model approaches can be grouped as well [32 48] finds two mainphilosophies generative and discriminative models

Letrsquos denote the test data (images) with xi the corresponding labels by ciand the feature descriptors (extracted from the data xi) by θi where i = 1 to N where N is the number of inputs

Generative models look for a representation of the image (or any other inputdata) by approximate it keeping as much data as possible Afterwards given atest data they predict the probability of an x instance generated (with featuresθ) conditioned to class c [48] In other terms they aim is to produce a model inwhich the probability

P (x|(θ c))P (c)

to be ideally 1 if x contains an instance of class c and 0 if notThe most famous examples of generative models are Principal Component

Analysis (PCA [60]) and Independent Component Analysis (ICA [61])

17

22 Object detection on conventional 2D images

Discriminative models try to find the best decision boundaries for the giventraining data As the name suggests they try to discriminate the occurrenceof one class from everything else [48] They do this by defining the followingprobability

P (c|(θ x))

which is expected to be ideally 1 if x contains an instance of class c and 0 ifnot

Note that the biggest difference is the direction of the mapping and infor-mation flow discriminative models map images to class labels while generativemodels use a map from labels to the images (or rdquoobservablesrdquo [59]) See figure 26for a representation of the direction of the flow of information in both models

Resulting from approaches presented above (especially the probability mod-els) the discriminative methods are usually perform better for single class detec-tion while generative methods are more suitable for multi-class object detection(also called object recognition) [62] Since the task described in this paper re-quires one class to detect two of the most well-known discriminative methodswill be presented Adaptive boosting (AdaBoost [63] see sub-subsection 2225)and support vector machines (SVM [64] see sub-subsection 2226)

2225 AdaBoost

AdaBoost (short for Adaptive Boosting) is a discriminative machine learningmodel from the class of boosting algorithms It was first proposed by YoavFreund and Robert Schapire in 1997 [63] AdaBoostrsquos aim is to overcome theincreasing calculation expenses caused by the growing set of features (also calleddimensions) For example the number of possible Haar-like features (2222) isover 160000 in case of a 24times24 pixel window (see subsection 432) [65] AlthoughHaar-like features can be extracted efficiently using the integral-image computingthe full set would be very expensive

To reduce the dimensionality of the machine learning problem boosting algo-rithms try to combine several weaker classifiers to create a stronger Weak meansthe output of the classifiers only slightly correlates with the true labels of thedata Usually this is an iterative method

A single stage of AdaBoost can be described as follows [45] [65]

bull Initialize N middot T weights wti where N = number of training examples T =number of features in stage

bull For t = 1 N

1 Normalize the weights

18

22 Object detection on conventional 2D images

2 Select the best classifier using only a single feature by minimising thedetection error εt =

sumiwi|h(xi f p θ)minus yi| where h(xi) is a classifier

output and yi is a correct label (both with a range of 0 for negative 1for positive)

3 Define ht(x) = h(x ft pt θt) where ft pt and θt are minimizers for theerror above

4 Update the weights wt+1i = wti middot εt

1minusεt

1minusεi

bull The final classifier for the stage is based on the sum of weak classifiers

By repeating these steps AdaBoost selects those features (and only those)which improve the prediction Resulting from the way the weights are updatedsubsequent classifiers minimise the error mostly for the inputs which were misclas-sified by previous ones A famous application of AdaBoost is the Viola-Jones ob-ject detection framework presented in 2001 [45] (also see sub-subsection 2222)

2226 Support Vector Machine

Support vector machines (SVM also called support vector networks) is a machinelearning algorithm for two-group classification problems proposed by Vladimir NVapnik Bernhard E Boser and Isabelle M Guyon [66] Although the featuresof the method were known and used since the 1960s they were not put togetheruntil 1992

The idea of the support vector machine is to handle the set of features ex-tracted from the train and test data samples as vectors in the feature spaceThen they construct one ore more hyperplanes (a linear decision surface) in thisspace to separate the classes

The surface is constructed with two constraints First is to divide the spaceinto two parts in a way that no points from different classes remain in the same(in other words separate the two classes perfectly) The second constraint is tokeep the distance to the nearest training samplersquos vector from each class as largeas possible (in other words define the largest margin possible) Creating largermargin is proven to decrease the error of the classifier The closest vectors to thehyperplane (and due to that determining the width of the margin) are calledsupport vectors (hence the name of the method) See figure 27 for an exampleof a separable problem the chosen hyperplane and the margin

The algorithm was extended to not linearly separable tasks in 1995 by Vapnikand Corinna Cortes [64] To create a non-linear classifier the feature space ismapped to a much higher dimensional space where the classes become linearlyseparable and the method above can be used

19

22 Object detection on conventional 2D images

Figure 27 Example of a separable problem in 2D The support vectors aremarked with grey squares They define the margin of the largest separation betweenthe two classes Source [64]

Although support vector machines were designed for binary classification sev-eral methods were proposed to apply their outstanding performance on multi-classtasks The most often used approach to do this is to combine binary classifiers toobtain a multi-class one The binary classifiers can be rdquoone against allrdquo or rdquooneagainst onerdquo types The former means that each class is examined against all thepoints of the other classes The class whose classifier shows the greatest distancefrom the margin is chosen (in other terms the test samplersquos vector is the fartherfrom the other classesrsquo points than in any other cases) The latter is based on thecomparisons of all the possible pairings of the classifiers (each is compared witheach) The class rdquowinningrdquo the most comparisons is chosen as the final decisionof the classifier Good surveys on multi-class SVMs are [67] and [68]

A famous application of SVMs is the HOG descriptor based pedestrian detec-tion presented by Navneet Dalal and Bill Triggs in 2005 [53]

20

Chapter 3

Development

In this chapter the circumstances of the development will be presented whichinfluenced the research especially the objectives defined in Section 13 Firstthe available hardware resources (that is processing unit and sensors) will bediscussed along with their limitations and constraints respecting to the definedobjectives Finally the used software libraries will be presented why were theychosen and what are they used for in this project

31 Hardware resources

311 Nitrogen boardNitrogen board (officially called Nitrogen6x) is the chosen on-board processingunit of the UAV It is a highly integrated single board computer using an ARM-cortex A9 processor [69]

Nitrogen board will be responsible to manage the on-board sensors and collectthe recorded data After some preprocessing part of the data will be transmittedto the ground robot and to the ground station which communication is alsohandled by this unit Fortunately the Robotic Operating System (subsection322) is compatible with unit which will make this task easier

The detection method introduced in this thesis may be ported to this boardlater depending on the available processing units and the achieved communicationbandwidth between the vehicles and the ground station

312 SensorsThis subsection concludes the sensors integrated on-board the UAV including theflight controller the camera and the laser range finder

21

31 Hardware resources

Figure 31 image of Pixhawk flight controller Source [70]

3121 Pixhawk autopilot

Pixhawk is an open-source well supported auto-pilot system manufactured by3Drobotics company It is the chosen flight controller of the projectrsquos UAV Onfigure 31 the unit is shown with optional accessories

Also Pixhawk will be used as an Inertial Measurement Unit (IMU) for build-ing 3D maps This means less weight (since no additional sensor is needed) andbrings integrity both in hardware and software Although the Pixhawk is not adedicated IMU its built-in sensors are very accurate since they are used to sta-bilize multi-rotors during flight Therefore it was suitable to use it for mappingpurposes See 44 for details of its application

3122 Camera

Cameras are optical sensors capable of recording images They are the mostfrequently used sensors for object detection due to the fact that they are easy tointegrate and provide significant amount of data

Also another big advantage of these sensors is the fact that they are availablein several sizes with different resolutions and other features for a relatively cheapprice

Nowadays consumer UAVs are often equipped by some kind of action camerausually to provide an aerial video platform These sensors are suitable for theaim of the project as well resulting from their light weight and wide angle fieldof view

22

32 Chosen software

Figure 32 The chosen LIDAR sensor Hokuyo UTM-30LX It is a 2D rangescanner with a maximum range of 60 m and 3-5 cm accuracy Source [71]

3123 LiDar

The other most important sensor used in this project is the lidar These sensorsare designed for remote sensing and distance measurement The basic conceptof these sensors is to do this by illuminating objects with a laser beam Thereflected light is analysed and the distances are determined

Several versions are available with different range accuracy size and otherfeatures The chosen scanner for this project is the Hokuyo UTM-30LX It isa 2D range scanner with a maximum range of 60 m and 3-5 cm accuracy [71]These features combined with its compact size (62mmtimes62mmtimes875mm210g [71])makes it an ideal scanner for autonomous vehicles for example UAVs

Generally speaking lidars are much more expensive sensors then camerasHowever the depth information provided by them is very valuable for indoorflights

32 Chosen software

321 MatlabMATLAB is a high-level interpreted programming language and environmentdesigned for numerical computation It is a very useful tool for engineers and

23

32 Chosen software

scientist alike as thousands of functions are implemented and included in thesoftware from different fields of science Also excellent visualization tools areready to use which are necessary to easily debug and develop image processingmethods or 3D map construction and visualization Resulting from the reasonsabove developing and testing 2D or 3D image processing algorithms in Matlab isextremely convenient On the other hand Matlab is Java based therefore mostnon-mathematical calculations are slightly slow compared to a C++ environment

322 Robotic Operating System (ROS)Robotic Operating System is an extensive open-source operating system andframework designed for robotic development It is equipped with several toolsaimed to help developing unmanned vehicles like simulation and visualizationsoftware or excellent general message passing This latter property is a suitablesolution to connect the UAV UGV and the ground control station of the project

Also ROS contains several packages developed to handle sensors For exam-ple the both the chosen lidar sensor (3123) and the flight controller (3121)have ready-to-use drivers available

323 OpenCVOpenCV is probably to most popular open-source image processing library con-taining a wide range of functions like basic image manipulations (eg load writeresize rotate image) image processing tools (eg different kind of edge detec-tions threshold methods) and advanced machine learning based solutions (egfeature extractors and training tools) [72] Due to its high popularity examplesand support are available

In this project it was used to handle inputs (either from a file or from a cameraattached to the laptop) since Dlib the main library chosen has no functions toread inputs like that

324 DlibDlib is a general purpose cross-platform C++ library It has an excellent machinelearning library with the aim to provide a machine learning software developmenttoolkit for the C++ language (see figure 33 for an architecture overview) Dlibis completely open-source and is intended to be useful for both scientists andengineers

Dlib also has many computer vision related parts including ready-to-use fea-ture extractors and training tools and object tracking solutions which make it

24

32 Chosen software

Figure 33 Elements of Dlibrsquos machine learning toolkit Source [73]

easy to develop and test custom trained object detectors rapidly Another re-cently added feature is an easy-to-use 3D point cloud visualization function Re-sulting from this Dlib was a reasonable choice to use for this project since itscollection of machine learning image processing and 3D point cloud handlingfunctions made the development significantly easier

It is worth mentioning that aside the features above Dlib also contains com-ponents to handle linear algebra threading network IO graph visualisation andmanaging and numerous other tasks which makes it one of the most versatileC++ libraries Though it is not as popular and well-known as OpenCV it isextremely well documented supported and updated regularly Resulting fromthese the popularity of it increases For more information see [74] and [73]

25

Chapter 4

Designing and implementing thealgorithm

In this chapter challenges related to the topic of this thesis are listed After theconsideration of these the architecture of the designed algorithm will be intro-duced with detailed explanations given to the important parts The connectionbetween the 2D and 3D image inputs are defined Afterwards the base of thedetection algorithm is discussed in details along with all the aiding systems andinputs Finally the 3D imaging experiments and set-ups are introduced withexample images and the mathematical calculations

41 Challenges in the taskThe biggest challenges of the task are listed below

1 Unique object Maybe the biggest challenge of this task is the fact that therobot itself is a completely unique ground vehicle designed as part of theproject Thus no ready-to-use solutions are available which could partly orcompletely solve the problem For frequently researched object recognitiontasks like face detection several open-source codes examples and articlesare available The closest topic for a ground robot detection is car recogni-tion but those vehicles are too different to use any ready detectors Thusthe solution of this task have to be completely self-made and new

2 Limited resources Since the objective of this thesis is a detection executedon a UAV the limitations of this platform have to be considered Thismeans that both the dimensions and the weight of the on-board sensorsand computers are limited significantly Fortunately the cameras nowa-days have very compact sizes with impressive resolutions thus even smaller

26

42 Architecture of the detection system

aerial vehicles can carry one and record high quality videos The used lidar(3123) is heavier and larger but still possible to carry The on-board com-puter had to be chosen similarly a small and light hardware was neededWhile the Nitrogen6x board (311) would fulfil these requirements its pro-cessing power is not comparable to a laptop or desktop PC which shouldbe considered during the designing and assigning of processing methods

3 Moving platform No matter what kind of sensors are used for an objectdetection task movement and vibration of the sensor is usually makes itmore challenging Unfortunately both are present on a UAV platform Thiscauses problems because it changes the perspective and the viewpoint ofthe object Also they add noise and blur to the recordings And third nobackground-foreground separation algorithms are available for the cameraSee subsection 2211 for details

4 Various viewpoint of object Usually in the field of computer vision thepoint of view of the required object is more or less defined For exampleface detectors are usually applied to images where people are facing thecamera However resulting from the UAV platform the sensors will gain orlose height rapidly causing drastic perspective changes Also there are noconstraints about the relative orientation of the two vehicles which meansthat the robot should be recognized from 360 and from above

5 Various size of object Similarly to the previous point 4 the movement ofthe camera will cause significant changes in the scale of the object on theinput image In other words the algorithm have to detect the ground robotfrom a great distance (when it occupies a small part of the input) and alsoduring landing from relatively close (when it seems huge and almost fillsthe entire image)

6 Speed requirements The algorithmrsquos purpose is to support the landingmaneuver which is crucial part of the mission The most critical part is thetouch-down itself This (contrary to the other parts of the flight) requiresreal-time and very accurate position feedback from the sensors and the al-gorithms This is provided by another algorithm specially designed for thetask Thus the software discussed here does not have real-time require-ments On the other hand the system still needs continuous reports on theestimated position of the robot with at least 4Hz This means that everychosen method should be execute fast enough to make this possible

27

42 Architecture of the detection system

Support Vector Machines

camera2D Lidar

Trainer

Detector agorithmfront SVM side SVM

Detections Vatic annotation server

Sensors

Tracking

Evaluation

Current frame

Video reader

Regions of Interest

3D map

Other preprosessing(edgecolour detection)

Figure 41 On this figure a diagram of the designed architecture is shown Theaim of it is the present the connections between the 2D and 3D sensors detectionalgorithms and other parts of the system Arrows represents the dependencies andthe direction of the information flow

28

42 Architecture of the detection system

42 Architecture of the detection systemIn the previous chapters the project (11) and the objectives of this thesis (13)were introduced The advantages and disadvantages of the different availablesensors were presented (312) Some of the most often used feature extractionand classification methods (222) were examined with respect to the suitabilityfor the project Finally in subsection 41 the challenges of the task were discussed

After considering the aspects above a modular complete architecture wasdesigned which is suitable for the objectives defined See figure 41 for an overviewof the design The main idea of the structure is to compensate the disadvantagesof every module with an other part connected to it In some cases this principlebrings redundancy to the system On the other hand it provides a more robustdetection system for the robot The next enumeration will list the main partsof the architecture every already implemented module will be discussed later inthis chapter in detail

1 Training algorithm As explained in subsection 222 object detectionmethods using machine learning are very popular and superior in efficiencyTo apply such an algorithm for a task first a learning (also called trainingor teaching) period is needed when the software extract features from thetraining images and train a classifier to separate the two or more classes(Note that this learning step can be skipped if the algorithm learns rdquoon theflyrdquo or pre-trained detectors are used Neither of these are the case in thisproject)As mentioned above training usually happens off-line before the detectionsTherefore it is not required to be included in the final release However itis an inevitable part of the system since it produces the classifiers for thedetector These productions have to be reproducible compatible with theother parts and effective To make the development more convenient thetraining software is completely separated from testing See subsection 431for details

2 Sensor inputs Sensors are crucial for every autonomous vehicle since theyprovide the essential information about the surrounding environments Inthis project a lidar and a camera where used as main sensors for the task(along with several others needed to stabilize the vehicle itself See subsec-tion 312) The designed system relies more on the camera for now sincethe 3D imaging is in its experimental state However both sensors arehandled in the architecture already with the help of the Robotic Operat-ing System (322) This unified interface will make it possible to integratefurther sensors (eg infra-red cameras ultrasound sensors etc) later withease

29

42 Architecture of the detection system

Resulting from the scope of the project the vehicle itself is not readythus live recordings and testing were not possible during the developmentTherefore the system is able to manage images both from a camera andvideo file The latter is considered as a sensor module as well simulatingreal input from the sensor (other parts of the system canrsquot tell that it is nota live video feed from a camera)

3 Region of interest estimators and tracking Recognized as two of the mostimportant challenges (see 41) speed and the limitations of the availableresources were a great concern Thus methods which could reduce theamount of areas to process and such increase the speed of the detectionwere requiredResulting from the complex structure of the project both can be donebased on either the camera the lidar or both The idea of the architectureis to give a common interface (creating a type of module) to these methodsmaking their integration easy The most general and sophisticated approachwill be based on the continuously built and updated 3D maps That is ifboth the aerial and the ground vehicle were located in this map accurateestimations can be made on the current position of the robot excluding (3D)areas where it is unnecessary to search for the unmanned ground vehicle(UGV) In other words once the real 3D position of the robot is known itsnext one can be estimated based on its maximal velocity and the elapsedtime Theoretically it is not possible for the robot to be outside of thiscircle thus other parts of the map do not need to be scanned and Asa less sophisticated but more rapid alternative to this the currently seen3D sub-map (the point-cloud in front of the sensor) can be interpreted asa pre-filter for the 2D image processing methods For example filteringfor objects with more or less the same size of the UGV could be a veryeffective region of interest detector To achieve this the field of view of thetwo sensors needs to be registered carefully so corresponding areas can befound Unfortunately the real time 3D map building is not yet implementedThus these methods has yet to be developed On the other hand severalapproaches to find regions of interest based on 2D images were implementedAll of them will be presented in section 43

4 Evaluation To make the development easier and the results more objectivean evaluation system was needed This helps to determinate whether a givenmodification of the algorithm did actually make it rdquobetterrdquo andor fasterthan other versions This part of the architecture is only connected to thewhole system loosely (similarly to training point 1) and not required tobe included in the final release since it was designed for development and

30

43 2D image processing methods

debugging purposes However it had to be developed parallel with everyother parts to make sure it is compatible and up to date Thus evaluationhad to be mentioned in this list For more details of the evaluation pleasesee 511

5 Detector The detector module is the core algorithm which connects all theother parts listed above Receiving the inputs from the camera sensor (point2) using the classifiers trained by the training algorithm (point 1) andconsidering the information provided by the ROI estimators (3) executesthe detection Afterwards the detector is able to pass on the detectedobjects coordinates so the evaluation module (point 4) can measure theefficiency

43 2D image processing methodsIn this section the 2D camera image processing concerns methods and the processof the development will be presented

431 Chosen methods and the training algorithmBefore the design and implementation of the algorithm started several objectdetection methods were reviewed in section 22 In section 32 the used imageprocessing libraries and tool-kits were presented

While these two topics seem unrelated they have to be considered togethersince a good software library with image handling or even machine learningalgorithms implemented can save a lot of time during the development

Considering the reviewed methods the Histogram of Oriented Gradients (HOGsee sub-subsection 2223) was chosen as the main feature descriptor The reasonsfor this choice are discussed here

Haar-like features are suitable to find patterns of combined bright and darkpatches While the ground robotrsquos main colours are black (wheels) and white(the main body of the vehicle) their proportion is not ideal since brighter partsare more significant Variance of the relative positions of these patches (egdifferent view angles) can confuse the detector as well Also using OpenCVrsquos(323) training algorithm the learning process was significantly slower than thefinal training software (presented here later) making it very hard to develop andtest

SIFT descriptors (see sub-subsection 2221) have the advantage of rotationinvariance which is very desirable Since they assume that relative positions ofthe key-points will not change on the inputs they are more suitable to recognise

31

43 2D image processing methods

a concrete object instead of the general class This is not an issue in this casesince only one ground robot is present with known properties (in other wordsno generalization is needed) However the relative positions of the key-pointscan change not only because inter-class differences (an other instance of the classrdquolooksrdquo different) but caused by moving parts like rotating and steering wheelsfor example Also SIFT tends to perform worse under changing illuminationconditions which is a drawback in the case of this project since both vehiclesare planned to transfer between premises without no a-priori information aboutthe lighting

Histogram of Oriented Gradients (HOG see sub-subsection 2223) seemed tobe an appropriate choice as it performs well on objects with strong characteristicedges Since the UGV has a cuboid shape body and relatively big wheels itfulfils this condition Also HOG is normalized locally with respect to the imagecontrast which results in a robustness against illumination changes Thus itis expected to perform better than the other methods in case of change in thelighting (like moving to an other room or even outside)

Certainly HOG has it disadvantages as well First it is computationally ex-pensive especially compared to the Haar-like features However with properimplementation (eg a good choice of library) and the use of ROI detectors (seePoint 3) this issue can be solved Secondly HOG in contrast to SIFT is not rota-tion invariant In practice on the other hand it can handle a significant rotationof the object Also in the final version of the project the orientation of the camerawill be stabilized either in hardware (with a gimbal holding the camera levelled)or software (rotate the image based on the orientation sensor) Therefore theinput image of the detector system expected to be levelled Unfortunately dueto the aerial vehicle used in this project different viewpoints are predicted (see4) This means that even if the camera is levelled the object itself could seemlike rotated caused by the perspective To overcome this issue solutions will bepresented in this subsection and in 434

Although the method of HOG feature extraction is well described and docu-mented in [53] and not exceedingly complicated the choice of optimal parametersand the optimization of the computational expenses are time-consuming and com-plex tasks To make the development faster an appropriate library was searchedfor the extraction of HOG features Finally Dlib library (324) was chosen sinceit includes an excellent HOG feature extractor with serious optimization alongwith training tools (discussed in details later) and tracking features (see subsec-tion 434) Also several examples and documentations are available for bothtraining and detection which simplified the implementation

As a classifier Support vector machines (SVM see sub-subsection 2226)were chosen This is because nowadays SVMs are widely used in computer visionwith great results (Dalal used it as well in [53]) and for two-class detection prob-

32

43 2D image processing methods

lems (like this) they outperform the other popular ones Also it is implementedin Dlib as a ready-to-use solution [75] SVMs (and almost everything in Dlib)can save and load by serializing and de-serializing them which made the trans-portation of classifiers relatively easy Dlibrsquos SVMs are compatible with the HOGfeatures extracted with the same library The combination of the two methodswithin Dlib resulted in some very impressive detectors for speed limit sign orfaces Furthermore the face detector mentioned outperformed the Haar-like fea-ture based face detector included in OpenCV [74] After studying these examplesthe combination of Dlibrsquos SVM and HOG extractor was chosen as the base of thedetection

The implemented training software itself was written in C++ It can be exe-cuted from the command line It needs an xml file as an input which includes alist of the training images with the annotation of the objects (coordinates in pix-els) After importing the positive samples the negative samples are rdquogeneratedrdquofrom the training samples as well cut from areas where no annotated object ispresent Note that this feature means the object has to be annotated carefullysince a positive sample might get in to the negative training images otherwise

After the classifier (also called detector) is produced the training softwaretests it on an image database defined by a similar xml file This is a very con-venient feature since major errors are discovered during the training processalready If the first tests do not show any unexpected behaviours the finishedSVM is saved to the disk with serialization

Due to the fact that the robot looks completely unlike from different view-points one classifier was not sufficient because it could only match one typicalrdquoshaperdquo of the object Similarly the face detector mentioned above does notwork for people photographed from side view Therefore multiple classifiers wereconsidered for the task

Two SVMs were trained to overcome of this issue one for the front view andone for the side view of the ground vehicle This approach has two significantadvantages

First the robot is symmetrical horizontally and vertically as well In otherterms the vehicle looks almost exactly the same from the left and right whilethe front and rear-views are also very similar This is especially true if the work-ing principle of the HOG (2223) is considered since the shape of the robot(which defines the edges and gradients detected by the HOG descriptor) is iden-tical viewed from left and right or from front and rear This property of theUGV means that the same classifier will recognise it from the opposite directiontoo Therefore two classifiers are enough for all four sides This is not a trivialsimplification since cars for example look completely different from the front orthe rear and while the side-views are similar they are mirrored

The second advantage of these classifiers is very important in the future of

33

43 2D image processing methods

the project Due to the way the SVMs are trained they not only can detect theposition of the robot but depending on which one of them detects it the systemgains information about the orientation of the vehicle This is crucial since it canhelp predict the next position of the ground robot (see 434) Even more impor-tantly it helps the UAV to land on the UGV with the correct orientation (facingthe right way) Therefore the charging outputs will be connected properly

An other important aspect of the training is how to choose the training imagesFor this pictures were selected where the robot is captured directly from the sideor from the front with the camera almost at the level of the ground Asidethe sidefront of the robot (the top for example) was hidden or cropped Thismethod eliminated all the distortions found on pictures taken from above whichwould decrease the efficiency Although camera attached to the UAV will seldomcruise at this altitude (otherwise it would touch the ground) the HOG descriptor isrobust enough to find the patterns even viewed from above or sideways Althoughteaching an SVMs for diagonal views as well seemed like a logical idea it did notimprove the detections The reason for this is the fact that while side and frontviews are well defined it is hard to frame and train on a diagonal view

On figure 42 a comparison of training images and the produced HOG detec-tors is shown 42(a) is a typical training image for the side-view Hog detectorNotice that only the side of the robot is visible since the camera was hold verylow 42(b) is the visualized final side-view detector Both the wheels and thebody are easy to recognize thanks to the strong edges of them On 42(c) a train-ing image is displayed from the front-view detector training image set 42(d)shows the visualized final front detector Notice tangential lines along the wheelsand the characteristic top edge of the body

Please note that if the current two classifiers seem to be inefficient trainingadditional ones is easy to do with the training software presented here Alsoincluding them to the detector is simple as well since it was designed to beadaptive Please see subsection 435 for more details

432 Sliding window methodA very important property of these trained methods (222) are that usually theirtraining datasets (the images of the robot or any other object) are cropped onlycontaining the object and some margin around it Resulting from this a detectortrained this way will only be able to work on input images which contain theobject in a similarly cropped way Certainly such an image is not the usual inputfor an application since if the input contains only the robot there is no need forlocalization algorithms In other words the input image is usually much largerthan the cropped training images and may contain other kind of objects as wellThus a method is required which provides correctly sized input images for the

34

43 2D image processing methods

(a) sideview training image example

(b) sideview HOG detector

(c) frontview trainging image example

(d) frontview HOG

Figure 42 (a) a typical training image for the side-view Hog detector (b) is thefinal side-view HOG descriptor visualized (c) a training image from the front-viewHOG descriptor training image set (d) the final front detector visualized Noticethe strong lines around the typical edges of the training images

35

43 2D image processing methods

Figure 43 Representation of the sliding window method

detector cropped and resized from the original large inputThis can be achieved in several ways although the most common is the so

called sliding window method This algorithm takes the large image and slidesa window across it with predefined step-size and scales This rdquowindowrdquo behaveslike a mask cropping out smaller parts of the image With appropriate choice ofthese parameters (enough scales and small step-sizes) eventually every robot (orother desired object) will be cropped and handed to the trained detector See 43for a representation

It worth mentioning that it is possible that multiple instances of the soughtobject are present on the image For face or pedestrian detection algorithms forexample this is a potential situation Since only one ground robot was builtyet this is not likely to happen in this project (although reflections of the robotmight occur which are not handled explicitly) However if the system will beexpanded in the future multiple UGVs on the same input image will not cause aproblem for the designed algorithm

433 Pre-filteringAs listed in section 41 two of the most important challenges are speed andthe limitations of the available resources In subsection 432 the sliding windowconcept was introduced which makes it possible to cover every potential locationof the sought object On the other hand this means a lot of positions which areabsolutely unnecessary to check with the HOG detector since it is very unlikelythat the robot is there (eg positions above the ground or homogeneous surfaceslike the wall or the clearly empty ground) These areas could be excluded fromthe search with cheaper algorithms

36

43 2D image processing methods

Thus methods which could reduce the amount of areas to process and suchincrease the speed of the detection were required In other words these algorithmsfinds region of interests (ROI see 3) in which the HOG (or other) computationalheavy detectors are executed in a shorter time than scanning the whole inputimage The algorithms have be implemented with the right interface to keep themodular structure of the architecture Also this practice make it possible toeither swap the currently used with an other or a newly developed ROI detectormodule for further development

Based on intuition a good feature separation is the colour Anything whichdoes not have the same colour as the ground vehicle should be ignored To dothis however is rather complicated since the observed rdquocolourrdquo of the object (andanything else in the scene) depends on the lighting the exposure setting and thedynamic range of the camera Thus it is very hard to define a colour (and aregion of colours around of it) which could be a good base of segmentation onevery input videos Also one of the test environments used (the laboratory) andmany other premises reviewed has bright floor which is surprisingly similar to thecolour of the robot This made the filtering virtually useless

An other good hint for interesting areas can be the edges In this projectthis seems like a logical approach since the chosen feature descriptor operates ongradient images which are strongly connected to the edge map On figure 44 theresult of an edge detection on a typical input image of the system is presentedNote how the ground robot has significantly more edges than its environmentespecially the ground Filtering based on edges often eliminates the majority ofthe ground On the other hand chairs and other objects have lot of edges as wellThus these areas are still scanned

Another idea is to filter the image by the detections themselves on the previousframes In other terms the position of the robot on the previous input image isan excellent estimation of the vehiclersquos position on the current one Theoreticallyit is enough to search for the robot around its last known position with a radiusof the maximum distance the object can move during the time between framesPlease note that this distance is in image coordinates thus the possible movementof the camera is involved as well In practice this distance was set as a parameterafter experiments See subsection 435 for details

434 TrackingIn computer vision tracking means the following of a (moving) object over timeusing the camera image input This is usually done by storing information (likeappearance previous location speed and trajectory) about the object and passit on between frames

In the previous subsection (433) the idea of filtering by prior detections was

37

43 2D image processing methods

Figure 44 On this figure the result of an edge detection on a typical input imageof the system is presented Please note how the ground robot has significantly moreedges than its environment On the other hand chairs and other objects have lotof edges as well

presented Tracking is the extension of this concept If the position of the robotis already known there is no need to find it again To achieve the objectives it isenough to follow the found object(s) This makes a huge difference in calculationcosts since the algorithm does not look for an object described by a generalmodel on the whole image but search for the rdquosame set of pixelsrdquo in a significantlyreduced region Certainly tracking itself is not enough to fulfil the requirements ofthe task hence some kind of combination of rdquotraditionalrdquo classifiers and trackingwas needed

Tracking algorithms are widely researched and have their own extensive litera-ture Considering the complexity of implementing an own 2D tracking algorithmit was decided to choose and include a ready-to-use solution Fortunately Dlib(324) have an excellent object tracker included as well The code is based onthe very recent research presented in [76] The method was evaluated extensivelyand it outperformed state-of-the-art methods while being computationally effi-cient Aside its outstanding performance it is easy to include in the code onceDlib is installed The initializing parameter of the tracker is a bounding box isaround the area needed to be followed Since the classifiers return with boundingboxes those can be used as an input for the tracking See subsection 435 foran overview of the implementation An interesting fact is that it also uses HOGdescriptors as a part of the algorithm which shows the versatility of HOG andconfirms its suitability for this task

The final system of this project will have the advantage of a 3D map of itssurroundings This will allow tracking of objects in 3D relative to the environ-ment The UAV will have an estimation of the UGV even if its out of the frame

38

43 2D image processing methods

since its position will be mapped to the 3D map as well Unfortunately the realtime mapping is not ready yet thus this feature has yet to be implemented

435 Implemented detectorIn subsection 431 the advantages and disadvantages of the chosen object recog-nition methods were considered and the implemented training algorithm wasintroduced Then different approaches to speed up the process were presentedin subsection 432 433 and 434

After considering the points above a detector algorithm was implementedwith two aims First to provide an easy-to-use development and testing toolwith which different methods and their combinations (eg using 2 SVMs withROI) can be tried without having to change and rebuild the code itself Secondafter the optimal combination was chosen this software will be the base of thefinal detector used in the project

The input of the software is a camera feed or video which is read frame byframe The output of the algorithm is one or more rectangles bounding the areasbelieved to contain the sought object These are often called detection boxeshits or simply detections

During the development four different approaches were implemented All ofthem are based on the previous ones (incremental improvements) but all broughta new idea to the detector which made it faster (reach higher frame-rates see512) more accurate (see 511) or both The methods are listed below

1 Mode 1 Sliding window with all the classifiers

2 Mode 2 Sliding window with intelligent choice of classifier

3 Mode 3 Intelligent choice of classifiers and ROIs

4 Mode 4 Tracking based approach

They will all be presented in the next four subsections Since they are rathermodifications of the same code than separate solutions they were not imple-mented as new software Instead all were included in the same code whichdecides which mode to execute in run-time based on a parameter

Aside of changing between the implemented modes the detector is able toload as many SVMs as needed Their usage is different in every mode and willbe presented later

The software can display andor save the video frames with all the detectionsmarked measure and export the processing time for every frame save the de-tections to data files for evaluation or even process the same video input severaltimes

39

43 2D image processing methods

Table 41 Table of the available parameters

Name Valid values Functioninput path to video video as input for detectionsvm path to SVMs These SVMs will be usedmode [1 2 3 4] Selects which mode is usedsaveFrames [01] Turns on video frame exportsaveDetections [01] Turns on detection boxe exportsaveFPS [01] Turns on frame-rate measurementdisplayVideo [01] Turns on video displayDetectionsFileName string Sets the filename for saved detectionsFramesFolderName string Set the foldername used for saving videonumberOfLoops integer(0lt) Sets how many times the video is looped

To make the development more convenient these features can be switchedon or off via parameters which are imported from an input file every time thedetector is started There is an option to change the name of the file where thedetections are exported or the name of the folder where the video frames aresaved

Table 41 concludes all the implemented parameters with the possible valuesand a short description

Note that all these parameters have default values thus it is possible but notcompulsory to include every one of them The text shown below is an exampleinput parameter file

Example parameter file of the detector algorithminput testVideo1avilist all the detectors you want to usesvm groundrobotfrontsvm groundrobotsidesvmsaveDetections 0saveFrames 0displayVideo 0numberOfLoops 100mode 2saveFPS 1

Given this input file the software would execute a detection on testVideo1avi(input) one hundred times (numberOfLoops) using the groundrobotfrontsvm groundrobot-sidesvm classifiers (svm) It would neither save the detections (saveDetections)

40

43 2D image processing methods

nor the video frames (saveFrames) The latter is not shown (displayVideo) How-ever the processing time is saved for every frame (saveFPS) Note that the pa-rameter numberOfLoops is really useful to simulate longer inputs See figure 45for a screenshot of the interface after a parameter file is loaded

Figure 45 Example of the detectorrsquos user interface after loading the parametersThe software can be started from the command line with a parameter file inputwhich defines the detecting mode and the purpose of the execution (producingvideo efficiency statistics or measuring processing frame-rate)

4351 Mode 1 Sliding window with all the classifiers

Mode 1 is the simplest approach of the four After loading the two classifierstrained by the training software (front-view and side-view see 434) both rdquoslidesrdquoacross the input image All of them return with a vector of rectangles where theyfound the sought object These are handled as detections (saved displayed etc)There are no filters implemented for maximum number of detections overlappingallowed position etc

Please note that although currently two SVMs are used this and every othermethod can handle fewer or more of them For example given three classifiersmode 1 loads and slides all of them

As it can be seen this mode 1 is rather simple It checks every possibleposition with all the classifiers on every frame in multiple scales This meansthat it will not miss the object (assuming that one of them would recognize theobject if a cropped image of it is given as an input) On the other hand thisextensive search is very computational heavy especially with two classifiers Thisresults in the lowest frame per second rate of all the methods

4352 Mode 2 Sliding window with intelligent choice of classifier

As mentioned in the previous subsection Mode 1 checks every possible positionwith all the classifiers on every frame This is not efficient because of the waythe two used classifier was trained On figure 42 it can be seen that one of them

41

43 2D image processing methods

represents the frontrear view while the other one was trained for the side-viewof the robot

The idea of mode 2 is based on the assumption that it is very rare and usuallyunnecessary that these two classifiers detect something at once since the robot iseither viewed from the front or from the side Certainly looking at it diagonallywill show both mentioned sides but resulting from the perspective distortionthese detectors will not recognize both sides (and that is not needed either sincethere is no reason to find it twice)

Therefore it seemed logical to implement a basic memory to the code whichstores which one of the detectors found the object last time After that only thatone is used for the next frames If it succeeds to find the object again it remainsthe chosen detector If not the algorithm starts counting the sequential frames onwhich was unable to find something If this exceeds a predefined tolerance limit(that is the number of frames tolerated without detection) all of the classifiersare used again until one of them find the object In every other aspect mode 2is very similar to mode 1 presented above

The limit was introduced because it is very unlikely to change viewpointsbetween two frames so drastically that the other detector would have a betterchance to detect A couple of frames without detection is more like the result ofsome kind of image artefact like motion blur change in exposure etc

This modification of the original code resulted an much faster processingspeed since in significant amount of the time only one classifier was used in-stead of two Fortunately after finding an optimal limit the efficiency of thecode remained the same

4353 Mode 3 Intelligent choice of classifiers and ROIs

While mode 2 brought a significant improvement to the detectorrsquos speed it didnot fulfil the requirements of the project

In subsection 433 the idea of reducing the image searched was introduced asa way to increase the detection speed The position of the robot on the previousinput image is an excellent estimation of the vehiclersquos position on the current oneTheoretically it is enough to search for the robot around its last known positionwith a radius of the maximum distance the object can move during the timebetween frames However this is not an obvious calculation since the distance isin image coordinates hence it is not trivial to use the actual speed of the robotAlso the possible movement of the camera is involved as well For example evenif the UGV holds its position the rotating camera of the UAV will capture itrdquomovingrdquo across the input image

Instead of estimating the distance mentioned above a simpler approach wasimplemented The memory introduced in mode 2 was extended to store the

42

43 2D image processing methods

Figure 46 An example output of mode 3 The green rectangle marks theregion of interest The blue rectangle is the detection returned by the side-viewdetectorMode 3 tries to estimate a ROI based on the previous detections andsearches for the robot only in that Note the ratio between the ROI and the fullsize of the image

position of the detection aside the detector which returned it A new rectanglename ROI (Region of interest) was included which determinates in which areashould the detector search

The rectangle is initialized with the sizes of the full image (in other termsthe whole image is included in the ROI) Then every time a detection occursthe ROI is set to that rectangle since that is the last known position of the robotHowever this rectangle have to be enlarged caused by the reasons mentionedabove (movement of the camera and the object) Therefore the ROI is grown bya percent of its original size (set by a variable 50 by default) and every detectorsearches in this area Note that the same intelligent selection of classifiers whichwas described in mode 2 is included here as well

In the unfortunate case that none of the detectors returned detections insidethe ROI the region of interest is enlarged again by a smaller percent of its originalsize (set by a variable 3 by default) Note that if in any case the ROI reachesthe size of the of the image it will not grow in that direction anymore Eventuallyafter a series of missing detections the ROI will reach the size of the input imageIn this case mode 3 works exactly like mode 2

See figure 46 for an example of the method described above The greenrectangle marks the region of interest (ROI) The classifiers will search for therobot in this area The blue rectangle marks that the side-view detector foundsomething there (in this case correctly recognized the robot) On the next framethe blue rectangle will be the base of the ROI as it is enlarged in the followingsteps to define the region of interest

43

43 2D image processing methods

Please note that while the ROI is updated of the previous detections it isvery easy to change it to another rdquopre-filterrdquo due to the modular architectureintroduced in 42

4354 Mode 4 Tracking based approach

As mentioned in 434 Dlib library has a built-in tracker algorithm based on thevery resent research [76]

Mode 4 is very similar to mode 3 but includes the tracking algorithm men-tioned above Using a tracker makes it unnecessary to scan every frame (or partof it as mode 3 does) with at least one detector

Instead once the robot was found it is followed across the scene Certainlyperiodical validations are still needed Tracking is significantly faster than thesliding window detectors and in appropriate conditions can perform above realtime speed

It also improved the efficiency of the detector since on many frames the robotwas tracked and its position was marked correctly while the detectors missed itfrom the same view-point

On the other hand tracking can be misleading as well In case of the trackedregion rdquoslidesrdquo off the robot incorrect detections will be provided Also thetracker often estimates the position of the object correctly while the scale of therectangle does not fit which can result much larger or smaller detection boxesthan the object itself See figure 47(b) for an example Yellow rectangle marksthe tracked region which clearly includes the robot but itrsquos size is unacceptable

To avoid these phenomena the bounding box returned by the tracker is reg-ularly checked by detectors This is done in the very same way as in mode 3The output of the tracker is the base of the ROI which is than enlarged by pa-rameters (similarly to mode 3) The algorithm checks this area with the selectedclassifier(s)

If any of them founds the object the tracker is reinitialized based on thedetection box The rectangle is enlarged and translated to include a bigger partof the robot in contrast to the detected sides (note that the detectors are trainedfor the sides only see subsection 431) This is done because the tracker tendsto follow bigger patches of the image better

If none of the detectors returned detections inside the ROI the algorithm willcontinue to track the object based on previous clues of its position However afterthe number of sequential failed validations exceeds a tolerance limit (a parameter)the object is labelled as lost Afterwards the algorithm works exactly as mode 3would do enlarge the ROI on every frame until a new detection appears Thenthe tracker is reinitialized

See figure 47(a) for a representation of the processing method of mode 4

44

43 2D image processing methods

(a) Example output frame of mode 4

(b) Typical error of the tracker

Figure 47 (a) A presentation of mode 4 Red rectangle marks the detection ofthe front-view classifier The yellow box is the currently tracked area The greenrectangle marks the estimated region of interest (b) A typical error of the trackerYellow rectangle marks the tracked region which clearly includes the robot but toobig to accept as a correct detection

45

44 3D image processing methods

44 3D image processing methods

441 3D recording methodThe first challenge was how to create a 3D image As explained in sub-subsection3123 the used Lidar sensor is a 2D type one which means it scans only inplane In other words the sensor returns the distance of the closest reflectionpoint rotating around one (vertical) axis The result of this scan looks like across-section of a 3D image To produce a real and complete 3D image severalof these cross-section like one-plane scans have to be combined This is a rathercomplicated method since to cover the whole room the lidar needs to be movedHowever this movement or displacement will have a significant effect on therecorded distances

To simplify the initial experiments the position of the laser scanner was fixedon a tripod and only the orientation of it was changed This way the trans-formation between the coordinate systems and validation of the recordings wereeasier

On figure 48 a schematic diagram is given to introduce the 3D recording set-up The picture is divided in to 3 parts Part a is a side view of the room beingrecorded including the scanner and its dependences Part b is a top view of thevery same room displaying the field of view of the scanner Part c is a closeroverview of the scanner device the inertial measurement unit (IMU) and themounting layout Main parts of the setup are listed below with brief explanation

1 Lidar This is the 2D range scanner which was used in all of the experi-ments It is mounted on a tripod to fix its position The sensor is displayedon every sub-figure (a b c) On part a it is shown in two different orien-tation first when the recording plane is completely flat secondly when itis tilted

2 Visualisation of the laser ray The scanner measures distances in a planewhile rotating around its vertical axis which is call capital Z axis on thefigure (Note The sensor uses class 1 laser which is invisible and harmlessto human eye Red colour was chosen only for visualization)

3 Pitch angle of first orientation While the scanner is tilted its orientationis recorded relative to a fixed coordinate system For this experiment thepitch angle is the most relevant since roll and yaw did not change This iscalculated as a rotation from the (lower-case) z axis

4 Pitch angle of second orientation This is exactly the same variable asdescribed before but recorded for the second set-up of the recording system

46

44 3D image processing methods

5 Yaw angle of the laser ray As mentioned before the sensor used operateswith one single light source rotated around its vertical axis (capital) ZTo know where the recorded point is located relative to the sensorrsquos owncoordinate system the actual rotation of the beam is needed to be savedas well This is calculated as an angle between the forward direction of thesensor (marked with blue) and the laser ray All the points are recordedin this 2D space defined with polar coordinates a distance from the origin(the lidar) and a yaw angleThe angle is increasing counter-clockwise Pleasenote that this angle rotates with the lidar and its field Part b represents astate when the lidar is horizontal

6 Field of view of the sensor Part b of the picture visualize the limited butstill very wide field of view of the sensor The view angle is marked withgreen coloured curve and the covered area is yellowish The scanner is ableto record 270 or in other terms [minus135 135]

7 Blind-spot of the sensor As discussed in the previous point the scannercan register distances in 270 The remainder 90 is a blind-spot and islocated exactly behind the sensor (135minus135) This area is marked withdark grey on the figure

8 Inertial measurement unit All the points are recorded in the 2D spaceof the lidar which is a tilted and translated plane To map every recordedpoint from this to the 3D maprsquos coordinate system the orientation andposition of the Lidar sensor is needed Assuming that the latter is fixed(the scanner is mounted on a tripod) only the rotation remains variableTo record this some kind of inertial measurement unit (IMU) is neededwhich is marked with a grey rectangle on the image

9 Axis of rotation To scan the room a continuous tilting of the scannerwas needed It was not possible to place the light source on the axis of therotation This means an offset between the sensor and the axis which was asource of error The black circle represents the axis the tilting was executedaround

10 Offset between the axis and the light source As described in the previ-ous point an offset was present between the range sensor and the axis ofrotation This distance was carefully measured and it is marked by blue onthe figure

11 Ground vehicle The main objective of this thesis is to localize the UGVThus several 3D map was made of this vehicle itself However the lidarmain task is not the robot detection but the localization of the UAV itself

47

44 3D image processing methods

On part a and b its schematic side and top view is shown to visualize itsshape relative size and location

442 Android based recording set-upTo test the sensorrsquos capabilities simplified experiment set-ups were introducedas described in subsection 441 The scanned distances were recorded in csvformat by the official recording tool provided by the lidarrsquos manufacturer [77]As described in subsection 444 and in point 8 the orientation and position ofthe Lidar sensor is needed to determinate the recorded point coordinates relativeto the coordinate system fixed to the ground

Figure 410 Screenshot ofthe Android application for Lidarrecordings It is able to detect dis-play and log the rotation of the de-vice in a log file with user definedfilename Source own applicationand image

Assuming that the position is fixed (thescanner is mounted on a tripod) only therotation can change To record this somekind of inertial measurement unit (IMU) wasneeded

Since at the time these experimentsstarted no such device was available the au-thor of this thesis created an Android appli-cation The aim of this program was to usethe phonersquos sensors (accelerometer compassand gyroscope) as an IMU The software isable to detect display and log the rotationof the device along the three axis

The log file is structured in a way thatMatlab (see subsection 321) or other de-velopment tools can import it without anymodification needed The file-name is userdefined

Since both the Lidar and the applica-tion operates at a different and not suf-ficiently consistent frequency synchroniza-tion was necessary Since the two data wasrecorded completely separately this has to bedone off-line To achieve this a time-stamphas to be recorded for both type of mea-surements The lidar recorder tool does thisautomatically This function was added tothe Android application too thus the time-stamps of the recorded orientations are saved

48

44 3D image processing methods

Figure 48 Schematic figure to represent the 3D recording set-up Part a is a sideview of the room being recorded including the scanner and its dependences Partb is a top view of the very same room displaying the field of view of the scannerPart c is a closer overview of the scanner device the inertial measurement unit(IMU) and the mounting layout

49

44 3D image processing methods

Figure 49 Example representation of the output of the Lidar Sensor As it canbe seen the data returned by the scanner is basically a scan in one plane and canbe interpreted as a cross-section of a 3D image The sensor is positioned in themiddle of the cross Green areas were reached by the laser Source screenshot ofown recording by the official recording tool [77]

50

44 3D image processing methods

to the log file as well However since the two systems were not connected thetimestamps had to be synchronized manually

On Figure 410 the user interface of the application can be seen Note thatboth the displayed and the saved angles are in radian Figure 411(a) presentshow the phone was mounted next to the lidar and also shows the test environmentwhich was recorded The phone was attached to a common metal base with thelidar Thus the rotation of the phone is expected to be very close to the rotationof the lidar We used a Motorola Moto X (2013) for these experiments

The coordinate system used is fixed to the phone X axis is parallel with theshorter side of the screen pointing from the left to the right Y axis is parallel tolonger side of the screen pointing from to bottom to the top Z axis is pointingup and is perpendicular to the ground See figure 412 for a presentation of theaxes Angles were calculated in the following way

bull pitch is the rotation around the x axis It is 0 if the phone is lying on ahorizontal surface

bull roll is rotation around the y axis

bull yaw is the rotation around the z axis

Although the sensors in a phone are not designed for such precise tasks theset-up presented in this subsection worked sufficiently and produced spectacular3D images On figure 411(b) and 411(c) an example is shown along with therecorded screen captured by a camera (411(a))

443 Final set-up with Pixhawk flight controllerAlthough the Android application (see subsection 442) gave satisfactory resultsseveral reasons occurred for replacing it First the synchronization was compli-cated manual and off-line which is not acceptable in a real time UAV environ-ment Second neither the phonersquos sensors nor its operating system was designedto be mounted on an unmanned aircraft It is heavier less precise and moreexpensive than a dedicated IMU Thus the Android phone has to be replaced

As a replacement the Pixhawk (3121) flight controller was chosen because itis open-source well supported and most importantly it is the chosen controller ofthe projectrsquos UAV This meant less weight and brought integrity both in hardwareand software Although the Pixhawk is not a dedicated IMU its built-in sensorsare very accurate since they are used to stabilize multi-rotors during flight Fellowstudent Domonkos Huszar managed to connect both the Pixhawk and the lidar(3123) to the Robotic operating system (322) This way the two type ofsensors were recorded with the same software

51

44 3D image processing methods

(a) Laboratory and recorded scene

(b) Elevation of the produced 3D map

(c) The produced 3D map

Figure 411 (a) picture about the laboratory and the mounting of the phone (b)and (c) 3D map of the scan from different viewpoints Some object were markedon all images for easier understanding 1 2 3 are boxes 4 is a chair 5 is a studentat his desk 6 7 are windows 8 is the entrance 9 is a UAV wing on a cupboard

52

44 3D image processing methods

Thus the change from the Android software to the Pixhawk solved the problemof synchronization and accuracy All the other parts of the experiment (themounting method the coordinate systems etc) remained unchanged

444 3D reconstructionIn subsection 441 the recording method was described Subsection 442 and443 discussed the inertial measurement units and software used Processing andvisualizing the recorded data as a 3D map was the next step

All the pointsrsquo coordinates obtained were in the lidar sensorrsquos own coordinatesystem As shown on figure 48 (part b) this is a 2D space represented by polarcoordinates a distance from the origin (the scanner) and an angle between theline pointing to the point and the zero vector This space was tilted with thescanner during the records To calculate the positions recorded by the Lidar inthe ground fixed coordinate system this 2D space had to be transformed Trans-formation here means rigid motions no shape or size changes were desired Rigidmotions are reflections rotations translations and glide reflections Howeverin this experiment only translations and rotations were used

The 3D coordinate system was defined as the following

bull x axis is the vector product of y and z x = ytimes z pointing to the right

bull y axis is the facing direction of the scanner parallel to the ground pointing

Figure 412 Presentation of the axes used in the android application

53

44 3D image processing methods

forward

bull z axis is pointing up and is perpendicular to the ground

Note that the origin ([0 0 0]) is at the lidarrsquos location (at the top of the tripodin the height of the rotation axis) This means that points below the scanner havea negative altitude

The lidar and its own 2D space have 6 degrees of freedom (DOF) state in theground fixed coordinate system Three represents its positions along the x y zaxes The other three describes its orientation yaw pitch and roll

First the rotation of the scanner will be considered (the latter three DOF) Asit can bee seen on figure 48 the roll and yaw angles are not able to change Thusonly pitch has to be considered as a variable in this experiment (see point 3) Asa result the transformed coordinates of the recording point are the following

x = distance middot sin(minusyaw) (41)

y = distance middot cos(yaw) middot sin(pitch) (42)

z = distance middot cos(yaw) middot cos(pitch) (43)

Where distance and yaw are the two coordinates of the point in the plane ofthe lidar (figure 48 point 5) and pitch is the angle between the (lower-case) zaxis and the plane of the lidar (48 point 3 and 4)

The position of the lidar in the ground coordinate system did not change intheory only the orientation of it However as it can be seen on figure 48 part cthe rotation axis (marked by 9 see point 9) is not at the same level as the rangescanner of the lidar Since it was not possible to mount the device other way asignificant offset (marked and explained in point 10) occurred between the lightsource and the axis itself This resulted in an undesired movement of the sensoralong the z and y axes (note that x axis was not involved since the roll angle didnot change and only that would move the scanner along the x axis) While theresulted errors were no greater than the offset itself they are significant Thusalong the rotation a translation is needed as well The vector is calculated asthe sum of dy and dz

dy = offset middot sin(pitch) (44)

dz = offset middot cos(pitch) (45)

54

44 3D image processing methods

Where dy is the translation required along the y and dy is along the z axisOffset is the distance between the light source and the axis of rotation presentedon figure 48 part C marked with 10

Combining the 5 equations (41 42 43 44 45) we get xyz

= distance middot

sin(minusyaw)cos(yaw) middot sin(pitch)cos(yaw) middot cos(pitch)

+ offset middot

0sin(pitch)cos(pitch)

(46)

Using the transformation defined in equation 46 spectacular and more im-portantly precise 3D maps were created These were suitable for further worksuch us SLAM (Simultaneous localization and mapping) or the ground robotdetection

55

Chapter 5

Results

51 2D image detection results

511 EvaluationTo make the development easier and the results more objective an evaluationsystem was needed This helps determine whether a given modification of thealgorithm did actually make it better andor faster than other versions Thisinformation is extremely important in every vision based detection system to trackthe improvements of the code and check if it meets the predefined requirements

To understand what better means in case of a system like this some defi-nitions need to be clarified Generally a detection system can either identify aninput or reject it Here the input means a cropped image of the original pictureIn other words it is a position (coordinates) and dimensions (height and width)Please refer to subsection 432 for a more detailed description An identifiedinput is referred as positive Similarly a rejected one is called negative

5111 Definition of True positive and negative

A detection is true positive if the system correctly identifies an input as a positivesample In this case it means that the detector marks the ground robotrsquos positionas a detection Note that it is also often called a hit Similarly a true negativemeans that an input which does not contain the object is correctly rejected

5112 Definition of False positive and negative

In the case of most decision systems errors can be organised into two main groupsA mistake is a false positive error (also called type I error) if the system classifiesan input as a positive sample despite it is not In other words the system believes

56

51 2D image detection results

the object is present at that location although that is not the case In this projectthat typically means that something that is not the ground robot (a chair a boxor even a shadow) is still recognized as the UGV A false negative error (alsocalled type II error) occurs when an input is not recognized (rejected) although itshould be In this current task a false negative error is when the ground robotis not detected

5113 Reducing number of errors

Unfortunately it is really complicated to decrease both kind of errors at thesame time If more strict rules and thresholds are introduced in the decisionmaking the rate of false positives will decrease but simultaneously chances offalse negatives will increase On the other hand if conditions are loosened in thedetection method false negative errors may be reduced However caused by theexact same thing (less strict conditions) more false positives could appear

The actual task determines which kind of error should be minimised Incertain projects it could be more important to find every possible desired objectthan the fact that some mistakes will occur Such a project can be a manuallysupervised classification where appearing false positives can be eliminated by theoperator (for example illegal usage of bus lanes) Also a system with an accidentavoidance purpose should be over-secure and rather recognise non dangeroussituations as a hazard than to miss a real one

In this project the false positive rate was decided to be more important Thereason for this is that the position of the UGV is not crucial during the wholeflight If an input frame is processed incorrectly and a false negative error occurs(the robot is on the picture but the algorithm misses it) the tracking system(see subsection 434) would still be able to estimate its position After a fewframes the detector would be able to detect the robot again with great possibilityHowever a false negative detection at an unexpected position may confuse thesystem since it would have to choose from multiple possible charging platformsThis may lead to a landing attempt on a dangerous surface and to the possibilityof losing the aircraft

5114 Annotation and database building

To test the efficiency of the 2D detection algorithm its detections have to betested on an image dataset with its samples already labelled (either fully manuallyor using a semi-supervised classifier) or on a video set where the object (or objects)position is known on every frame This way the true positivenegative and falsepositivenegative values can be determined

57

51 2D image detection results

Figure 51 Figure of the Vatic user interface

Testing on image datasets is suitable for methods which do not use any in-formation transmitted between frames such as previous positions or ROI Thesealgorithms are usually sliding window based (see subsection 432) Their croppedinputs can be simulated with image datasets without complications

For this project the video based test is more appropriate since as described insubsection 434 and 433 the detector uses tracking and pre-filtering Testing thetracking feature without consecutive frames is pointless Also the pre-filteringand ROI generation is only applicable on frames containing more than just theobject itself

Testing on videos requires annotation which means that every object has tobe labelled on every frame This is a tiring and time-consuming task but withsuitable software it can be simplified To do this the Vatic environment waschosen

The abbreviation Vatic stands for Video Annotation Tool from Irvine Cal-ifornia It is a free to use on-line video annotation tool designed specially forcomputer vision research [78]

After the successful installation and connection to the server a clear interfaceis visible See an example screenshot on figure 51 The interface has a built invideo player which makes it possible to review the video with the annotation atnormal speed or even frame-by-frame The system was designed in a way thatmultiple objects can be annotated even from different classes (like pedestrianscars trucks for example) Since only one ground robot was built yet and no otherkind of objects were requested only one item was annotated on the videos

58

51 2D image detection results

The annotatorrsquos task is to mark the position and the size of the objects (thatis drawing a bounding box around it) on every frame of the videos (there areoptions to label an object Occluded obstructed or outside of frame) Althoughthis seems like a very time-consuming task Vatic uses sophisticated algorithmsto interpolate both the positions and the sizes of the objects on frames thatare not annotated yet This is a significant help for the annotator since aftermarking the object on some key-frames the only task remained is to correctthe interpolations between them if necessary

Note that it is crucial that the bounding boxes are very tight around the ob-ject This will guarantee that during the evaluation of the detectors the statisticsare based on true and objective annotations This topic will be explained in moredetails in section 53

Vatic is also able to crowdsource the annotation tasks using Amazon Mechan-ical Turk [79] which is Amazonrsquos solution for renting human resources This waylarge datasets with several videos are easy to build for relatively low cost How-ever this feature of the tool was not used in this project since the number of testvideos and their complexity did not make it necessary All the test videos wereannotated by the author

After successfully annotating a video Vatic is able to export the boundingboxes in several format including xml json etc The exported file contains theid of the frames and the coordinates of every bounding box along with the labelassigned for the objectrsquos type

Using these files it is possible to compare the detections of an algorithm (whichare also bounding boxes) with the annotations and thus evaluate the efficiencyof the code

512 Frame-rate measurement and analysisIn section 41 speed was recognised as one of the most important challenges

To address this issue the detector was prepared to measure and export thetime elapsed by processing each frames

The measurements were carried out on a laptop PC (Lenovo z580 i5-3210m8GB RAM) simulating a setup where the UAV sends video to a ground stationfor further processing During the tests every other unnecessary process wereterminated so their additional influences are minimized

The exact same software was used for every test only the parameter file waschanged (switching between the modes) Every method was executed on the samevideo one hundred times which means more than 100000 frames This amountensured that temporary resource changes during the test due to the operatingsystem did not influence the results significantly

59

52 3D image detection results

To analyse the exported processing times a Matlab script has been imple-mented This tool can load the log files from the test and calculate statistics ofthe data Measurements like the shortest longest and average processing timeper frame are displayed after execution along with the average frame-rate and itsvariance Latter is very important as mode 2 3 4 processing methods changeduring the execution by definition which results in a varying processing timeVariance measures these changes An example output of the analysis software ispresented below

Example output of Frame-rate analysis softwareLoaded 100 filesNumber of frames in video 1080Total number of frames processed 108000Longest processing time 0813Shortest processing time 0007Average ellapsed seconds 0032446variance of ellapsed seconds per frame

between video loops 80297e-07across the video 00021144

Average Frame per second rate of the processing 308204 FPS

The changes of average frame-rates between video loops were also monitoredIf this value was too big it probably indicated some unexpected load of thecomputer Thus the tests had to be restarted

It should be mentioned that the frame-rate of the software is strongly corre-lated with the resolution of the input images since the smaller the image it takesshorter time to rdquosliderdquo through the detector on it This is especially true for thefirst two methods (4351 4352) The test videosrsquo resolution are 960times 540

52 3D image detection resultsThis thesis focused on the 2D object detection methods since they are easierto implement and have been widely researched for decades Traditional camerasensors are significantly cheaper too and their output is rdquoready-to-userdquo On theother hand data from lidar sensors have to be processed to gather the imagesince the scanner records points in a 2D plane as explained in section 44 Notethat lidar 3D scanners exists but they are still scanning in a plane and have thismentioned processing built-in Also they are even more expensive than the 2Dversions

60

53 Discussion of results

Figure 52 On this figure the recording setup is presented In the backgroundthe ground robot is also visible which was the subject of the measurements

That being said the 3D depth map provided by these sensors are not possibleto obtain with cameras and contains valuable information for a project like thisand makes indoor navigation easier Therefore experiments were carried onthe record 3D images by tilting a 2D laser scanner (3123) and recording itsorientation with a IMU (3121)

On figure 52 a photo of the experimental setup is shown In the backgroundthe ground robot is visible which was the subject of the recordings

Figure 53 presents an example 3D point-cloud recorded of the ground robotBoth 53(a) and 53(b) is an image of the same map viewed from different pointsAs it can be seen recordings from one point is not enough to create completemaps as rdquoshadowsrdquo and obscured parts still occurs Note the rdquoemptyrdquo area behindthe robot on 53(b)

The created point-clouds were found to be detailed enough to carry simpler(eg threshold based on height) or more complex object detection methods (egregistering 3D key-points) on them in the future

53 Discussion of resultsIn general it can be said that the detector modes worked as expected on therecorded test videos with suitable efficiency Example videos can be found athere1 for all modes However to judge the performance of the algorithm moreobjective measurements were needed

The 2D image processing results were evaluated by speed and detection effi-1usersitkppkehu~palan1CranfieldVideos

61

53 Discussion of results

(a) Example result of the 3D scan of the ground robot

(b) Example of the rsquoshadowrsquo of the ground robot

Figure 53 Example of the 3D images built Both image shows the same 3Dmap viewed from different angles Note that the carpet is visible underneath thevehicle

62

53 Discussion of results

ciency The latter can be done by several values Here recall and precision willbe used Recall is defined by

TP

TP + FN

Where TP is the number of true positive and FN is the number of false negativedetections In other words it is the ratio of the of the detected and the totalnumber of positive samples Thus it is also called sensitivity

Precision is defined byTP

TP + FP

Where TP is the number of true positive FP is the number of false positivedetections In other terms precision is a measurement of the correctness of thereturned detections

Both values have a different meaning for validation and none of them can usedalone This is because as mentioned in sub-subsection 5113 both false positiveand negative errors are undesirable but neither recall nor precision measuresboth For example it is possible to reach an outstanding recall rate with extremeamount of false positives Similarly the increase of false negative errors willimprove the precision of the algorithm

The detections (the coordinates of the rectangles) were exported by the de-tection software for further analysis The rectangles were compared to the onesannotated with the Vatic video annotation tool (5114) based on position andthe overlapping areas For the latter a minimum of 50 was defined (ratio of thearea of the intersection and the union of the two rectangles)

Note that this also means that even if a detection is at the correct locationit is possible that it will be logged as false positive error since the overlap of therectangles is not satisfying Unfortunately registering this box as a false positivewill also generate a false negative error since only one detection is returned bythe detector for a location(overlapping detections are merged) Therefore theannotated object will not be covered by any of the detections

As expected mode 1 and mode 2 produced very similar results a decent 62recall rate was achieved This is a result of the fact that they work exactlythe same way except the selection of the used classifiers The results show thatintroducing this idea did not decrease the performance

Mode 3 managed to improve the recall rate by 2 with the introduction ofregion of interests Neither of the first 3 modules had false positive detections asa result of the strict thresholds of the SVMs

Mode 4 had the outstanding recall rate of 95 This is a result of the trackeralgorithm which was able to follow the object even at view-points were the SVMdetectors did no longer detected the robot See table 51 for all the recall andprecision values

63

53 Discussion of results

Similarly the processing speed of the methods introduced in subsection 435were analysed as well with the tools described in subsection 512

Mode 1 was proven to be the slowest with a frame-rate of 42 FPS This isnot surprising as this version of the algorithm scans the whole image with all thedetectors

Mode 2 raised this speed to 65 FPS by introducing an intelligent way ofselecting the applied detector

Mode 3 brought another significant increase with implementing region of in-terest estimators which reduce the amount of the image to be scanned It canprocess around 12 frames in a second

Finally mode 4 was proven to be the fastest solution by far As a result ofits tracking algorithm their is no need to use the SVM classifiers as often asbefore This is a very important result since tracking is significantly faster thanscanning with the detectors Mode 4 reached a frame-rate of 30 FPS during thetests Thus mode 4 can be called a rdquoreal-timerdquo object detection algorithm

See table 51 for the frame-rates of all modules displayed along their recalland precision values Note that all frame-rates are average of 100 executions

Table 51 Table of results

mode Recall Prescision FPS Variancemode 1 0623 1 42599 000000123mode 2 0622 1 6509 00029557mode 3 0645 1 1206 00070877mode 4 0955 0898 3082 00021144

As a conclusion it can be seen that mode 4 outperformed all the others bothin detection rate (recall 95) and speed (average frame-rate 30FPS) using ROIestimators and a tracking algorithm Due to the nature of the tracker and theevaluation method the number of false positives increased As mentioned beforeeven a correctly positioned but incorrectly scaled detection decreased the precisionrate However this should be simple to improve by adding further validations ofthe output of the tracker (both for position and scale)

64

Chapter 6

Conclusion and recommendedfuture work

61 ConclusionIn this paper an unmanned aerial vehicle based airborne tracking system waspresented with the aim of detecting objects indoor (potentially extended to out-door operations as well) and localize a ground robot that will serve as a landingplatform for recharging the UAV

First the project and the aims and objectives of this thesis were introducedin chapter 1 This introduction was followed by an extensive literature review togive an overview of the existing solutions for similar problems (chapter 2) Un-manned Aerial Vehicles were presented and examples were shown for applicationfields Then the most important object recognition methods were introducedand reviewed with the aspect of suitability for the discussed project

Then the environment of the development was described including the avail-able software and hardware resources (chapter 3) Their advantages and disad-vantages were analysed and their applications in the project are explained

Afterwards the recognized challenges of the project were collected and dis-cussed with respect to the object detection task in chapter 4 Considering theobjectives resources and the challenges a modular architecture was designedand introduced (section 42) The type of modules were listed and discussed indetails including their purposes and some possible realization

The modular structure of the system makes it very flexible and easy to extendor replace modules The currently implemented and used ones were listed andexplained Also some unsuccessful experiments were mentioned

Subsection 434 concludes the chosen feature extraction (HOG) and classi-fier (SVM) methods and presented the implemented training software along the

65

61 Conclusion

produced detectorsAs one of the most important parts the currently used detector was intro-

duced in subsection 435 This module was developed with special attention paidto validation and possible future work therefore additional debugging featureslike exporting detections demonstration videos and frame rate measurementswere implemented To make the development simpler an easy-to-use interfacewas included which let the most important parameters to be set without modifi-cation pf the code Four related but different detection methods were developedMode 1 scans the whole image with both trained detectors to locate the robotMode 2 does the same but instead of all classifiers only one is used most of thetime which is chosen based on previous detections This resulted in an increasedframe-rate Mode 3 introduces the concept of Region Of Interests (ROI) and onlyscans this area The ROI is defined based on previous detections however it iseasy to replace it with other methods due to the modular structure The reducedsearch space made the processing significantly faster Finally mode 4 includes atracking algorithm which makes it unnecessary to scan every frame (or part ofit) with a detector Instead once the robot was found it is followed across thescene Certainly periodical validations are still needed Tracking is significantlyfaster than the sliding window detectors and in appropriate conditions can per-form above real time speed (average frame-rate was 308 FPS) It also improvedthe efficiency of the detector since on many frames the robot was tracked andits position was marked correctly while the detectors missed it from the sameview-point However tracking can be misleading as well if the tracked regiondrifts away from the robot (eg a chair next to the robot is followed) To avoidthis the bounding box returned by the tracker is regularly checked by detectors

Special attention was given to the evaluation of the system Two software weredeveloped as additional tools one for evaluating the efficiency of the detectionsand an other for analysing the processing time and frame rates

Although the whole project (especially the autonomous UAV) still needs a lotof improvement and not ready for testing yet the ground robot detecting algo-rithmrsquos first version is ready-to-use and have shown promising results in simulatedexperiments

The detector based on mode 4 had a recall rate of 95 with an average frame-rate of 308 FPS on the test videos Although this processing speed was achievedon a laptop (simulating a setup where the UAV sends video to a ground stationfor further processing) porting this code on an on-board computer with smallercomputational capacity should result in a slower but still acceptable frame-ratewhich satisfies the objectives (5Hz)

While this thesis focused on two dimensional image processing and object de-tection methods 3D image inputs were considered as well Section 44 concludesthe progress made related to 3D mapping along the applied mathematics An

66

62 Recommended future work

experimental setup was created with which it was possible to create spectacu-lar and more importantly precise 3D maps about the environments with a 2Dlaser scanner To test the idea of using the 3D image as an input for the groundrobot detection several recordings were made about the UGV The created point-clouds were found to be detailed enough to carry simpler (eg threshold basedon height) or more complex object detection methods (eg registering 3D key-points) on them in the future

As a conclusion it can be said that the objectives defined in 13

1 explore possible solutions in a literature review

2 design a system which is able to detect the ground robot based on theavailable sensors

3 implement and test the first version of the algorithms

were all addressed and completed

62 Recommended future workIn spite of the satisfying results of the current setup several modifications andimprovements can and should be implemented to increase the efficiency and speedof the algorithm

Focusing on the 2D processing methods first experiments with retrainedSVMs are recommended It is expected that training on the same two sides (sideand front) but including more images from slightly rotated view-points (bothrolling the camera and capture the object from not completely the front) willresult in better performance

Region of interest estimators have huge potential in the project since they cansignificantly speed up the process Aside basic algorithms based on simple fea-tures (eg similar colour patches edges etc) several more complex segmentationmethods were introduced in the literature

Tracking can be tuned more precisely as well Aside the rectangle box whichmarks the estimated position of the object the tracker returns a confidence valuewhich measures how confident the algorithm is that the object is inside the rect-angle This is a very valuable information to eliminate detections which rdquoslippedrdquooff the object Also the further processing of the return position is recommendedFor example growing or shrinking the rectangle based on simple features (egthresholded patches intensity edges etc) can be profitable since the trackertends to perform better when it tracks the whole object On the other hand toobig bounding boxes are problems as well thus shrinking of the rectangles have tobe considered as well

67

62 Recommended future work

Using The 3D image for the detection of the robot is another very interestingopportunity that should be examined in more details Certainly this is onlypossible if the 3D imaging will be carried out in near real time speed (currentlyall the maps were built offline) The currently seen 3D image (the point-cloud infront of the sensor) can be interpreted as a pre-filter for the 2D image processingmethods For example filtering for objects with more or less the same size of theUGV could be a very effective region of interest detector To achieve this thefield of view of the two sensors needs to be register carefully so correspondingareas can be found

Finally when the continuous 3D map building (note that this is more thanthe imaging since the 3D images have to be aligned and combined) is finishedseveral more improvements will be possible Assuming that both the aerial andthe ground vehicle were located in this map the real 3D position of the robotwill be known Its next position can be estimated based on its maximal velocityand the elapsed time Theoretically it is not possible for the robot to be outsideof this circle thus other parts of the map do not need to be scanned Since thelocation and orientation of the UAV will be known as well the field-of-view ofthe sensors can be estimated After that the UAV can either move and rotate ina way to cover the possible locations of the UGV with its sensors or just ignore(does not process) inputs from other areas

It is strongly recommended to give high priority to evaluation and testingduring the further experiments For 2D images Vatic annotation tool was intro-duced in this thesis as a very useful and convenient tool to evaluate detectorsFor 3D detections this is not suitable but a similar solution is needed to keepup the progress towards the ready to deploy 3D mapping system planned

68

Acknowledgements

First of all I would like to thank Paacutezmaacuteny Peacuteter Catholic University and my teachersthere who helped me to advance and grow through the years and later made itpossible to continue my studies in Cranfield University where I was welcomed withwarm hospitality and was supported throughout of the development of this thesis

I would like to thank Dr Al Savvaris for the professional help and encourage-ment he provided while we worked together

Special thanks goes to my friends Domonkos Huszaacuter and Loacuteraacutent Kovaacutecs who Ishared all the happiness sadness worries troubles and the weigh of this pastyear and who I could always turn to in need or joy

Thanks to my friend Roberto Opromolla for helping us to make the necessarymeasurements

My family never stopped supporting me loving me and ensuring the rightenvironment for improvement and growth

Many thanks Fanni Melles for the support and love you gave me and that youstood by me even at the hardest times

Thanks for my friends Maacuterton Hunyady and Doacutera Babicz for the funny andsometimes cruel comments of the revision of this thesis

And finally many thanks to Andras Horvath who agreed to be my rsquosecondrsquosupervisor and mentor at my home university

69

References

[1] ldquoOverview senseFlyrdquo [Online] Available httpswwwsenseflycomdronesoverviewhtml [Accessed at 2015-08-08]

[2] H Chao Y Cao and Y Chen ldquoAutopilots for small fixed-wing unmanned airvehicles A surveyrdquo in Proceedings of the 2007 IEEE International Conference onMechatronics and Automation ICMA 2007 2007 pp 3144ndash3149

[3] A Frank J McGrew M Valenti D Levine and J How ldquoHover Transitionand Level Flight Control Design for a Single-Propeller Indoor AirplanerdquoAIAA Guidance Navigation and Control Conference and Exhibit pp 1ndash43 2007[Online] Available httparcaiaaorgdoiabs10251462007-6318

[4] ldquoFixed Wing Versus Rotary Wing For UAV Mapping Ap-plicationsrdquo [Online] Available httpwwwquestuavcomnewsfixed-wing-versus-rotary-wing-for-uav-mapping-applications [Accessed at 2015-08-08]

[5] ldquo DJI Store Phantom 3 Standardrdquo [Online] Available httpstoredjicomproductphantom-3-standard [Accessed at 2015-08-08]

[6] ldquoWorld War II V-1 Flying Bomb - Military Historyrdquo [Online] Available httpmilitaryhistoryaboutcomodartillerysiegeweaponspv1htm [Accessed at 2015-08-08]

[7] L Lin and M A Goodrich ldquoUAV intelligent path planning for wilderness searchand rescuerdquo in 2009 IEEERSJ International Conference on Intelligent Robotsand Systems IROS 2009 2009 pp 709ndash714

[8] M A Goodrich B S Morse D Gerhardt J L Cooper M Quigley J A Adamsand C Humphrey ldquoSupporting wilderness search and rescue using a camera-equipped mini UAVrdquo Journal of Field Robotics vol 25 no 1-2 pp 89ndash1102008

[9] P Doherty and P Rudol ldquoA UAV Search and Rescue Scenario with HumanBody Detection and Geolocalizationrdquo in Lecture Notes in Computer Science

70

REFERENCES

2007 pp 1ndash13 [Online] Available httpwwwspringerlinkcomcontentt361252205328408fulltextpdf

[10] A Jaimes S Kota and J Gomez ldquoAn approach to surveillance an area usingswarm of fixed wing and quad-rotor unmanned aerial vehicles UAV(s)rdquo 2008 IEEEInternational Conference on System of Systems Engineering SoSE 2008 2008

[11] M Kontitsis K Valavanis and N Tsourveloudis ldquoA UAV vision system for air-borne surveillancerdquo IEEE International Conference on Robotics and Automation2004 Proceedings ICRA rsquo04 2004 vol 1 2004

[12] M Quigley M A Goodrich S Griffiths A Eldredge and R W Beard ldquoTargetacquisition localization and surveillance using a fixed-wing mini-UAV and gim-baled camerardquo in Proceedings - IEEE International Conference on Robotics andAutomation vol 2005 2005 pp 2600ndash2606

[13] E Semsch M Jakob D Pavliacuteček and M Pěchouček ldquoAutonomous UAV surveil-lance in complex urban environmentsrdquo in Proceedings - 2009 IEEEWICACMInternational Conference on Intelligent Agent Technology IAT 2009 vol 2 2009pp 82ndash85

[14] M Israel ldquoA UAV-BASED ROE DEER FAWN DETECTION SYSTEMrdquo pp51ndash55 2012

[15] G Zhou and D Zang ldquoCivil UAV system for earth observationrdquo in InternationalGeoscience and Remote Sensing Symposium (IGARSS) 2007 pp 5319ndash5321

[16] W DeBusk ldquoUnmanned Aerial Vehicle Systems for Disaster Relief TornadoAlleyrdquo AIAA InfotechAerospace 2010 2010 [Online] Available httparcaiaaorgdoiabs10251462010-3506

[17] S DrsquoOleire-Oltmanns I Marzolff K Peter and J Ries ldquoUnmannedAerial Vehicle (UAV) for Monitoring Soil Erosion in Moroccordquo RemoteSensing vol 4 no 12 pp 3390ndash3416 2012 [Online] Available httpwwwmdpicom2072-42924113390

[18] F Nex and F Remondino ldquoUAV for 3D mapping applications a reviewrdquoApplied Geomatics vol 6 no 1 pp 1ndash15 2013 [Online] Availablehttplinkspringercom101007s12518-013-0120-x

[19] H Eisenbeiss ldquoThe Potential of Unmanned Aerial Vehicles for MappingrdquoPhotogrammetrische Woche Heidelberg pp 135ndash145 2011 [Online] Availablehttpwwwifpuni-stuttgartdepublicationsphowo11140Eisenbeisspdf

[20] C Bills J Chen and A Saxena ldquoAutonomous MAV flight in indoor environ-ments using single image perspective cuesrdquo in Proceedings - IEEE InternationalConference on Robotics and Automation 2011 pp 5776ndash5783

71

REFERENCES

[21] W Bath and J Paxman ldquoUAV localisation amp control through computer visionrdquo of the Australasian Conference on Robotics 2004 [Online] Availablehttpwwwcseunsweduau~acra2005proceedingspapersbathpdf

[22] K Ccedilelik S J Chung M Clausman and A K Somani ldquoMonocular vision SLAMfor indoor aerial vehiclesrdquo in 2009 IEEERSJ International Conference on Intel-ligent Robots and Systems IROS 2009 2009 pp 1566ndash1573

[23] H Oh D Y Won S S Huh D H Shim M J Tahk and A Tsourdos ldquoIndoorUAV control using multi-camera visual feedbackrdquo in Journal of Intelligent andRobotic Systems Theory and Applications vol 61 no 1-4 2011 pp 57ndash84

[24] Y M Mustafah A W Azman and F Akbar ldquoIndoor UAV Positioning UsingStereo Vision Sensorrdquo pp 575ndash579 2012

[25] P Jongho and K Youdan ldquoStereo vision based collision avoidance of quadrotorUAVrdquo in Control Automation and Systems (ICCAS) 2012 12th InternationalConference on 2012 pp 173ndash178

[26] J Weingarten and R Siegwart ldquoEKF-based 3D SLAM for structured environmentreconstructionrdquo in 2005 IEEERSJ International Conference on Intelligent Robotsand Systems IROS 2005 pp 2089ndash2094

[27] H Surmann A Nuumlchter and J Hertzberg ldquoAn autonomous mobile robot with a3D laser range finder for 3D exploration and digitalization of indoor environmentsrdquoRobotics and Autonomous Systems vol 45 no 3-4 pp 181ndash198 2003

[28] Y Lin J Hyyppauml and A Jaakkola ldquoMini-UAV-borne LIDAR for fine-scale map-pingrdquo IEEE Geoscience and Remote Sensing Letters vol 8 no 3 pp 426ndash4302011

[29] F Wang J Cui S K Phang B M Chen and T H Lee ldquoA mono-camera andscanning laser range finder based UAV indoor navigation systemrdquo in 2013 Inter-national Conference on Unmanned Aircraft Systems ICUAS 2013 - ConferenceProceedings 2013 pp 694ndash701

[30] M Nagai T Chen R Shibasaki H Kumagai and A Ahmed ldquoUAV-borne 3-Dmapping system by multisensor integrationrdquo IEEE Transactions on Geoscienceand Remote Sensing vol 47 no 3 pp 701ndash708 2009

[31] X Zhang Y-H Yang Z Han H Wang and C Gao ldquoObject class detectionrdquoACM Computing Surveys vol 46 no 1 pp 1ndash53 Oct 2013 [Online] Availablehttpdlacmorgcitationcfmid=25229682522978

[32] P M Roth and M Winter ldquoSurvey of Appearance-Based Methodsfor Object Recognitionrdquo Transform no ICG-TR-0108 2008 [Online]

72

REFERENCES

Available httpwwwicgtu-grazacatMemberspmrothpub_pmrothTR_ORat_downloadfile

[33] A Andreopoulos and J K Tsotsos ldquo50 Years of object recognition Directionsforwardrdquo Computer Vision and Image Understanding vol 117 no 8 pp 827ndash891 Aug 2013 [Online] Available httpwwwresearchgatenetpublication257484936_50_Years_of_object_recognition_Directions_forward

[34] J K Tsotsos Y Liu J C Martinez-Trujillo M Pomplun E Simine andK Zhou ldquoAttending to visual motionrdquo Computer Vision and Image Understand-ing vol 100 no 1-2 SPEC ISS pp 3ndash40 2005

[35] P Perona ldquoVisual Recognition Circa 2007rdquo pp 1ndash12 2007 [Online] Availablehttpswwwvisioncaltechedupublicationsperona-chapter-Dec07pdf

[36] M Piccardi ldquoBackground subtraction techniques a reviewrdquo 2004 IEEE Interna-tional Conference on Systems Man and Cybernetics (IEEE Cat No04CH37583)vol 4 pp 3099ndash3104 2004

[37] T Bouwmans ldquoRecent Advanced Statistical Background Modeling for ForegroundDetection - A Systematic Surveyrdquo Recent Patents on Computer Sciencee vol 4no 3 pp 147ndash176 2011

[38] L Xu and W Bu ldquoTraffic flow detection method based on fusion of frames dif-ferencing and background differencingrdquo in 2011 2nd International Conference onMechanic Automation and Control Engineering MACE 2011 - Proceedings 2011pp 1847ndash1850

[39] A-t Nghiem F Bremond I-s Antipolis and R Lucioles ldquoBackground subtrac-tion in people detection framework for RGB-D camerasrdquo 2004

[40] J C S Jacques C R Jung and S R Musse ldquoBackground subtraction andshadow detection in grayscale video sequencesrdquo in Brazilian Symposium of Com-puter Graphic and Image Processing vol 2005 2005 pp 189ndash196

[41] P Kaewtrakulpong and R Bowden ldquoAn Improved Adaptive Background MixtureModel for Real- time Tracking with Shadow Detectionrdquo Advanced Video BasedSurveillance Systems pp 1ndash5 2001

[42] G Turin ldquoAn introduction to matched filtersrdquo IRE Transactions on InformationTheory vol 6 no 3 1960

[43] K Briechle and U Hanebeck ldquoTemplate matching using fast normalized crosscorrelationrdquo Proceedings of SPIE vol 4387 pp 95ndash102 2001 [Online] AvailablehttplinkaiporglinkPSI4387951ampAgg=doi

73

REFERENCES

[44] H Schweitzer J W Bell and F Wu ldquoVery Fast Template Matchingrdquo Programno 009741 pp 358ndash372 2002 [Online] Available httpwwwspringerlinkcomindexH584WVN93312V4LTpdf

[45] P Viola and M Jones ldquoRobust real-time object detectionrdquo Interna-tional Journal of Computer Vision vol 57 pp 137ndash154 2001 [Online]Available httpscholargooglecomscholarhl=enampbtnG=Searchampq=intitleRobust+Real-time+Object+Detection0

[46] S Omachi and M Omachi ldquoFast template matching with polynomialsrdquo IEEETransactions on Image Processing vol 16 no 8 pp 2139ndash2149 2007

[47] M Jordan and J Kleinberg Bishop - Pattern Recognition and Machine Learning

[48] D Prasad ldquoSurvey of the problem of object detection in real imagesrdquoInternational Journal of Image Processing (IJIP) no 6 pp 441ndash466 2012[Online] Available httpwwwcscjournalsorgcscmanuscriptJournalsIJIPvolume6Issue6IJIP-702pdf

[49] D Lowe ldquoObject Recognition fromLocal Scale-Invariant Featuresrdquo IEEE Inter-national Conference on Computer Vision 1999

[50] D G Lowe ldquoDistinctive image features from scale-invariant keypointsrdquo Interna-tional Journal of Computer Vision vol 60 no 2 pp 91ndash110 2004

[51] H Bay T Tuytelaars and L Van Gool ldquoSURF Speeded up robust featuresrdquo inLecture Notes in Computer Science (including subseries Lecture Notes in ArtificialIntelligence and Lecture Notes in Bioinformatics) vol 3951 LNCS 2006 pp 404ndash417

[52] T Joachims ldquoA probabilistic analysis of the Rocchio algorithmwith TFIDF for text categorizationrdquo International Conference on Ma-chine Learning pp 143ndash151 1997 [Online] Available papers2publicationuuid9FC2122D-6D49-4DC5-AC03-E353D5B3D1D1

[53] N Dalal and B Triggs ldquoHistograms of Oriented Gradients for Human Detectionrdquoin CVPR rsquo05 Proceedings of the 2005 IEEE Computer Society Conference onComputer Vision and Pattern Recognition (CVPRrsquo05) - Volume 1 2005 pp 886ndash893 [Online] Available citeulike-article-id3047126$delimiter026E30F$nhttpdxdoiorg101109CVPR2005177

[54] J M Rainer Lienhart ldquoAn Extended Set of Haar-Like Features for Rapid ObjectDetectionrdquo [Online] Available httpciteseerxistpsueduviewdocsummarydoi=1011869433

74

REFERENCES

[55] S Han Y Han and H Hahn ldquoVehicle Detection Method using Haar-like Featureon Real Time Systemrdquo inWorld Academy of Science Engineering and Technology2009 pp 455ndash459

[56] Q C Q Chen N Georganas and E Petriu ldquoReal-time Vision-based Hand Ges-ture Recognition Using Haar-like Featuresrdquo 2007 IEEE Instrumentation amp Mea-surement Technology Conference IMTC 2007 2007

[57] G Monteiro P Peixoto and U Nunes ldquoVision-based pedestrian detection usingHaar-like featuresrdquo Robotica 2006 [Online] Available httpcyberc3sjtueducnCyberC3docpaperRobotica2006pdf

[58] J Martiacute J M Benediacute A M Mendonccedila and J Serrat Eds PatternRecognition and Image Analysis ser Lecture Notes in Computer Science BerlinHeidelberg Springer Berlin Heidelberg 2007 vol 4477 [Online] Availablehttpwwwspringerlinkcomindex101007978-3-540-72847-4

[59] A E C Pece ldquoOn the computational rationale for generative modelsrdquo ComputerVision and Image Understanding vol 106 no 1 pp 130ndash143 2007

[60] I T Joliffe Principle Component Analysis 2002 vol 2 [Online] Availablehttpwwwspringerlinkcomcontent978-0-387-95442-4

[61] A Hyvarinen J Karhunen and E Oja ldquoIndependent Component Analysisrdquovol 10 p 2002 2002

[62] K Mikolajczyk B Leibe and B Schiele ldquoMultiple Object Class Detection with aGenerative Modelrdquo IEEE Computer Society Conference on Computer Vision andPattern Recognition (CVPR) vol 1 pp 26ndash36 2006

[63] Y Freund and R E Schapire ldquoA Decision-theoretic Generalization of On-lineLearning and an Application to Boostingrdquo Journal of Computing Systems andScience vol 55 no 1 pp 119ndash139 1997

[64] C Cortes and V Vapnik ldquoSupport-vector networksrdquo Machine Learningvol 20 no 3 pp 273ndash297 Sep 1995 [Online] Available httplinkspringercom101007BF00994018

[65] P Viola and M Jones ldquoRobust real-time face detectionrdquo International Journalof Computer Vision vol 57 no 2 pp 137ndash154 2004

[66] B E Boser I M Guyon and V N Vapnik ldquoA Training Algorithmfor Optimal Margin Classifiersrdquo Proceedings of the 5th Annual ACM Workshopon Computational Learning Theory pp 144ndash152 1992 [Online] Availablehttpciteseerxistpsueduviewdocsummarydoi=1011213818

75

REFERENCES

[67] C-W Hsu and C-J Lin ldquoA comparison of methods for multiclass support vectormachinesrdquo IEEE Transactions on Neural Networks vol 13 no 2 pp 415ndash4252002

[68] A Mathur and G M Foody ldquoMulticlass and binary SVM classification Implica-tions for training and classification usersrdquo IEEE Geoscience and Remote SensingLetters vol 5 no 2 pp 241ndash245 2008

[69] ldquoNitrogen6X - iMX6 Single Board Computerrdquo [Online] Available httpboundarydevicescomproductnitrogen6x-board-imx6-arm-cortex-a9-sbc [Ac-cessed at 2015-08-10]

[70] ldquoPixhawk flight controllerrdquo [Online] Available httproverardupilotcomwikipixhawk-wiring-rover [Accessed at 2015-08-10]

[71] ldquoScanning range finder utm-30lx-ewrdquo [Online] Available httpswwwhokuyo-autjp02sensor07scannerutm_30lx_ewhtml [Accessed at 2015-07-22]

[72] ldquoopenCV manual Release 249rdquo [Online] Available httpdocsopencvorgopencv2refmanpdf [Accessed at 2015-07-20]

[73] D E King ldquoDlib-ml A Machine Learning Toolkitrdquo Journal of MachineLearning Research vol 10 pp 1755ndash1758 2009 [Online] Available httpjmlrcsailmitedupapersv10king09ahtml

[74] ldquodlib C++ Libraryrdquo [Online] Available httpdlibnet [Accessed at 2015-07-21]

[75] D E King ldquoMax-Margin Object Detectionrdquo Jan 2015 [Online] Availablehttparxivorgabs150200046

[76] M Danelljan G Haumlger and M Felsberg ldquoAccurate Scale Estimation for RobustVisual Trackingrdquo Proceedings of the British Machine Vision Conference BMVC2014

[77] ldquoUrgBenri Information Pagerdquo [Online] Available httpswwwhokuyo-autjp02sensor07scannerdownloaddataUrgBenrihtm [Accessed at 2015-07-20]

[78] ldquovatic - Video Annotation Tool - UC Irvinerdquo [Online] Available httpwebmiteduvondrickvatic [Accessed at 2015-07-24]

[79] ldquoAmazon Mechanical Turkrdquo [Online] Available httpswwwmturkcommturkwelcome [Accessed at 2015-07-26]

76

  • List of Figures
  • Absztrakt
  • Abstract
  • List of Abbreviations
  • 1 Introduction and project description
    • 11 Project description and requirements
    • 12 Type of vehicle
    • 13 Aims and objectives
      • 2 Literature Review
        • 21 UAVs and applications
          • 211 Fixed-wing UAVs
          • 212 Rotary-wing UAVs
          • 213 Applications
            • 22 Object detection on conventional 2D images
              • 221 Classical detection methods
                • 2211 Background subtraction
                • 2212 Template matching algorithms
                  • 222 Feature descriptors classifiers and learning methods
                    • 2221 SIFT features
                    • 2222 Haar-like features
                    • 2223 HOG features
                    • 2224 Learning models in computer vision
                    • 2225 AdaBoost
                    • 2226 Support Vector Machine
                      • 3 Development
                        • 31 Hardware resources
                          • 311 Nitrogen board
                          • 312 Sensors
                            • 3121 Pixhawk autopilot
                            • 3122 Camera
                            • 3123 LiDar
                                • 32 Chosen software
                                  • 321 Matlab
                                  • 322 Robotic Operating System (ROS)
                                  • 323 OpenCV
                                  • 324 Dlib
                                      • 4 Designing and implementing the algorithm
                                        • 41 Challenges in the task
                                        • 42 Architecture of the detection system
                                        • 43 2D image processing methods
                                          • 431 Chosen methods and the training algorithm
                                          • 432 Sliding window method
                                          • 433 Pre-filtering
                                          • 434 Tracking
                                          • 435 Implemented detector
                                            • 4351 Mode 1 Sliding window with all the classifiers
                                            • 4352 Mode 2 Sliding window with intelligent choice of classifier
                                            • 4353 Mode 3 Intelligent choice of classifiers and ROIs
                                            • 4354 Mode 4 Tracking based approach
                                                • 44 3D image processing methods
                                                  • 441 3D recording method
                                                  • 442 Android based recording set-up
                                                  • 443 Final set-up with Pixhawk flight controller
                                                  • 444 3D reconstruction
                                                      • 5 Results
                                                        • 51 2D image detection results
                                                          • 511 Evaluation
                                                            • 5111 Definition of True positive and negative
                                                            • 5112 Definition of False positive and negative
                                                            • 5113 Reducing number of errors
                                                            • 5114 Annotation and database building
                                                              • 512 Frame-rate measurement and analysis
                                                                • 52 3D image detection results
                                                                • 53 Discussion of results
                                                                  • 6 Conclusion and recommended future work
                                                                    • 61 Conclusion
                                                                    • 62 Recommended future work
                                                                      • References
Page 4: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly

Aluliacuterott Paacutelffy Andraacutes a Paacutezmaacuteny Peacuteter Katolikus Egyetem Informaacutecioacutes Technoloacute-giai eacutes Bionikai Karaacutenak hallgatoacuteja kijelentem hogy jelen diplomatervet meg nem en-gedett segiacutetseacuteg neacutelkuumll sajaacutet magam keacutesziacutetettem eacutes a diplomatervben csak a megadott for-raacutesokat hasznaacuteltam fel Minden olyan reacuteszt melyet szoacute szerint vagy azonos eacutertelembende aacutetfogalmazva maacutes forraacutesboacutel aacutetvettem egyeacutertelműen a forraacutes megadaacutesaacuteval megjeloumlltemEzt a diplomatervet csakis Erasmus tanulmaacutenyuacutet kereteacuten beluumll a Cranfield UniversitybdquoComputational amp Software Techniques in Engineeringrdquo MSc keacutepzeacuteseacuten bdquoDigital Signaland Image Processingrdquo szakiraacutenyon nyuacutejtottam be a 20142015taneacutevben

Undersigned Andraacutes PAacuteLFFY student of Paacutezmaacuteny Peacuteter Catholic Universityrsquos Fac-ulty of Information Technology and Bionics I state that I have written this thesis onmy own without any prohibited sources and I used only the described references Everytranscription citation or paraphrasing is marked with the exact source This thesis hasonly been submitted at Cranfield Universityrsquos Computational amp Software Techniques inEngineering MSc course on the Digital Signal and Image Processing option in 2015

Budapest 20151220

Paacutelffy Andraacutes

iii

Contents

List of Figures vii

Absztrakt viii

Abstract ix

List of Abbreviations x

1 Introduction and project description 111 Project description and requirements 112 Type of vehicle 213 Aims and objectives 3

2 Literature Review 521 UAVs and applications 5

211 Fixed-wing UAVs 5212 Rotary-wing UAVs 6213 Applications 8

22 Object detection on conventional 2D images 9221 Classical detection methods 10

2211 Background subtraction 102212 Template matching algorithms 11

222 Feature descriptors classifiers and learning methods 132221 SIFT features 142222 Haar-like features 152223 HOG features 162224 Learning models in computer vision 172225 AdaBoost 182226 Support Vector Machine 19

iv

CONTENTS

3 Development 2131 Hardware resources 21

311 Nitrogen board 21312 Sensors 21

3121 Pixhawk autopilot 223122 Camera 223123 LiDar 23

32 Chosen software 23321 Matlab 23322 Robotic Operating System (ROS) 24323 OpenCV 24324 Dlib 24

4 Designing and implementing the algorithm 2641 Challenges in the task 2642 Architecture of the detection system 2943 2D image processing methods 31

431 Chosen methods and the training algorithm 31432 Sliding window method 34433 Pre-filtering 36434 Tracking 37435 Implemented detector 39

4351 Mode 1 Sliding window with all the classifiers 414352 Mode 2 Sliding window with intelligent choice of

classifier 414353 Mode 3 Intelligent choice of classifiers and ROIs 424354 Mode 4 Tracking based approach 44

44 3D image processing methods 46441 3D recording method 46442 Android based recording set-up 48443 Final set-up with Pixhawk flight controller 51444 3D reconstruction 53

5 Results 5651 2D image detection results 56

511 Evaluation 565111 Definition of True positive and negative 565112 Definition of False positive and negative 565113 Reducing number of errors 575114 Annotation and database building 57

512 Frame-rate measurement and analysis 59

v

CONTENTS

52 3D image detection results 6053 Discussion of results 61

6 Conclusion and recommended future work 6561 Conclusion 6562 Recommended future work 67

References 70

vi

List of Figures

11 Image of the ground robot 3

21 Fixed wing consumer drone 622 Example for consumer drones 723 Example for people detection with background subtraction 1124 Example of template matching 1225 2 example Haar-like features 1626 Illustration of the discriminative and generative models 1727 Example of a separable problem 20

31 Image of Pixhawk flight controller 2232 The chosen LIDAR sensor Hokuyo UTM-30LX 2333 Elements of Dlibrsquos machine learning toolkit 25

41 A diagram of the designed architecture 2842 Visualization of the trained HOG detectors 3543 Representation of the sliding window method 3644 Example image for the result of edge detection 3845 Example of the detectorrsquos user interface 4146 Presentation of mode 3 4347 Mode 4 example output frames 45410 Screenshot of the Android application for Lidar recordings 4848 Schematic figure to represent the 3D recording set-up 4949 Example representation of the output of the Lidar Sensor 50411 Picture about the laboratory and the recording set-up 52412 Presentation of the axes used in the android application 53

51 Figure of the Vatic user interface 5852 Photo of the recording process 6153 Example of the 3D images built 62

vii

List of Abbreviations

SATM School of Aerospace Technology and ManufacturingUAV Unmanned Aerial VehicleUAS Unmanned Aerial SystemUA Unmanned AircraftUGV Unmanned Ground VehicleHOG Histogram of Oriented GradientsRC Radio ControlledROS Robotic Operating SystemIMU Inertial Measurement UnitDoF Degree of FreedomSLAM Simultaneous Localization And MappingROI Region Of InterestVatic Video Annotation Tool from Irvine California

viii

Absztrakt

Ezen dolgozatban bemutataacutesra keruumll egy piloacuteta neacutelkuumlli repuumllő jaacuterműre szereltkoumlvető rendszer ami belteacuteri objektumokat hivatott detektaacutelni eacutes legfőbb ceacutelja afoumlldi egyseacuteg megtalaacutelaacutesa amely leszaacutelloacute eacutes uacutejratoumlltő aacutellomaacuteskeacutent fog szolgaacutelni

Előszoumlr a projekt illetve a teacutezis ceacuteljai lesznek felsorolva eacutes reacuteszletezveEzt koumlveti egy reacuteszletes irodalomkutataacutes amely bemutatja a hasonloacute kihiacutevaacute-

sokra leacutetező megoldaacutesokat Szerepel egy roumlvid oumlsszefoglalaacutes a piloacuteta neacutelkuumlli jaacuter-művekről eacutes alkalmazaacutesi teruumlleteikről majd a legismertebb objektumdetektaacuteloacutemoacutedszerek keruumllnek bemutataacutesra A kritika taacutergyalja előnyeiket eacutes haacutetraacutenyaikatkuumlloumlnoumls tekintettel a jelenlegi projektben valoacute alkalmazhatoacutesaacutegukra

A koumlvetkező reacutesz a fejleszteacutesi koumlruumllmeacutenyekről szoacutel beleeacutertve a rendelkezeacutesreaacutelloacute szoftvereket eacutes hardvereket

A feladat kihiacutevaacutesai bemutataacutesa utaacuten egy modulaacuteris architektuacutera terve keruumllbemutataacutesra figyelembe veacuteve a ceacutelokat erőforraacutesokat eacutes a felmeruumllő probleacutemaacutekat

Ezen architektuacutera egyik legfontosabb modulja a detektaacuteloacute algoritmus legfris-sebb vaacuteltozata reacuteszletezve is szerepel a koumlvetkező fejezetben keacutepesseacutegeivel moacuted-jaival eacutes felhasznaacuteloacutei feluumlleteacutevel egyuumltt

A modul hateacutekonysaacutegaacutenak meacutereacuteseacutere leacutetrejoumltt egy kieacuterteacutekelő koumlrnyezet melykeacutepes szaacutemos metrikaacutet kiszaacutemolni a detekcioacuteval kapcsolatban Mind a koumlrnyezetmind a metrikaacutek reacuteszletezve lesznek a koumlvetkező fejezetben melyet a legfrissebbalgoritmus aacuteltal eleacutert eredmeacutenyek koumlvetnek

Baacuter ez a dolgozat főkeacutent a hagyomaacutenyos (2D) keacutepeken operaacuteloacute detekcioacutesmoacutedszerekre koncentraacutel 3D keacutepalkotaacutesi eacutes feldolgozoacute moacutedszerek szinteacuten meg-fontolaacutesra keruumlltek Elkeacuteszuumllt egy kiacuteseacuterleti rendszer amely keacutepes laacutetvaacutenyos eacutespontos 3D teacuterkeacutepek leacutetrehozaacutesaacutera egy 2D leacutezer szkenner hasznaacutelataacuteval Szaacutemosfelveacutetel keacuteszuumllt a megoldaacutes kiproacutebaacutelaacutesa amelyek a rendszerrel egyuumltt bemutataacutesrakeruumllnek

Veacuteguumll az implementaacutelt moacutedszerek eacutes az eredmeacutenyek oumlsszefoglaloacuteja zaacuterja adolgozatot

ix

Abstract

In this paper an unmanned aerial vehicle based airborne tracking system is pre-sented with the aim of detecting objects indoor (potentially extended to outdooroperations as well) and localize a ground robot that will serve as a landing plat-form for recharging of the UAV

First the project and the aims and objectives of this thesis are introducedand discussed

Then an extensive literature review is presented to give overview of the ex-isting solutions for similar problems Unmanned Aerial Vehicles are presentedwith examples of their application fields After that the most relevant objectrecognition methods are reviewed and their suitability for the discussed project

Then the environment of the development will be described including theavailable software and hardware resources

Afterwards the challenges of the task are collected and discussed Consideringthe objectives resources and the challenges a modular architecture is designedand introduced

As one of the most important module the currently used detector is intro-duced along with its features modes and user interface

Special attention is given to the evaluation of the system Additional evalu-ating tools are introduced to analyse efficiency and speed

The ground robot detecting algorithmrsquos first version is evaluated giving promis-ing results in simulated experiments

While this thesis focuses on two dimensional image processing and objectdetection methods 3D image inputs are considered as well An experimentalsetup is introduced with the capability to create spectacular and precise 3Dmaps about the environments with a 2D laser scanner To test the idea of usingthe 3D image as an input for the ground robot detection several recordings weremade about the UGV and presented in this paper as well

Finally all implemented methods and relevant results are concluded

x

Chapter 1

Introduction and projectdescription

In this chapter an introduction is given to the whole project which this thesis ispart of Afterwards a structure of the sub-tasks in the project is presented alongwith the recognized challenges Then the aims and objectives of this thesis arelisted

11 Project description and requirementsThe projectrsquos main aim is to build a complete system based on one or moreunmanned autonomous vehicle which are able to carry out an indoor 3D mappingof a building (possible outdoor operations are kept in mind as well) A maplike this would be very useful for many applications such as surveillance rescuemissions architecture renovation etc Furthermore if a 3D model exists otherautonomous vehicles can navigate through the building with ease Later on inthis project after the map building is ready a thermal camera is planned to beattached to the vehicle to seek find and locate heat leaks as a demonstration ofthe system A system like this should be

1 fast Building a 3D map requires a lot of measurements and processingpower not to mention the essential functions stabilization navigationroute planning collision avoidance However to build a useful tool allthese functions should be executed simultaneously mostly on-board nearlyreal-time Furthermore the mapping process itself should be finished inreasonable time (depending on the size of the building)

2 accurate Reasonably accurate recording is required so the map would besuitable for the execution of further tasks Usually this means a maximum

1

12 Type of vehicle

error of 5-6cm Errors with similar magnitude can be corrected later de-pending on the application In the case of architecture for example aligninga 3D room model to the recorded point cloud could eliminate this noise

3 autonomous The solution should be as autonomous as possible The aimis to build a system which does not require human supervision (for examplemapping a dangerous area which would be too far for real time control) andcoordinate the process Although remote control is acceptable and shouldbe implemented it is desired to be minimal This should allow a remoteoperator to coordinate even more than one unit a time (since none of themneeds continuous attention) while keeping the ability to change the routeor priority of premises

4 complete Depending on the size and layout of the building the mappingprocess can be very complicated It may have more than one floors loops(the same location reached via an other route) or opened areas which areall a significant challenge for both the vehicle (plan route to avoid collisionsand cover every desired area) and the mapping software The latter shouldrecognize previously seen locations and rdquocloserdquo the loop (that is assignthe two separately recorded area to one location) and handle multi-layerbuildings The system have to manage these tasks and provide a mapwhich contains all the layers closes loops and represents all walls (includingceiling and floor)

12 Type of vehicleThe first question of such a project is what type of vehicle should be used Theoptions are either some kind of aerial or a ground vehicle Both have their ad-vantages and disadvantages

An aerial vehicle has a lot more degree of freedom since it can elevate fromthe ground It can reach positions and perform measurements which would beimpossible for a ground vehicle However it consumes a lot of energy to stayin the air thus such a system has significant limitations in weight payload andoperating time (since the batteries themselves weigh a lot)

A ground vehicle overcomes all these problems since it is able to carry alot more payload than an aerial one This means a lot of batteries thereforelonger operation time On the other hand it canrsquot elevate from the ground thusall the measurements will be performed from a relatively close position to theground This can be a big disadvantage since taller objects will not have theirtops recorded Also a ground robot would not be able to overcome big obstaclesclosing its way or to get to another floor

2

13 Aims and objectives

Figure 11 Image of the ground robot Source own picture

Considering these and many more arguments (price complexity human andmaterial resources) a combination of an aerial and a ground vehicle was chosenas a solution The idea is that both vehicles enter the building at the sametime and start mapping the environment simultaneously working on the samemap (which is unknown when the process starts) Using the advantage of theaerial vehicle the system scans parts of the rooms which are unavailable for theground vehicle With sophisticated route planning the areas scanned twice canbe minimized resulting in a faster map building Also great load capacity of theground robot makes it able to serve as a charging station for the aerial vehicleThis option will be discussed in details in section 13

13 Aims and objectivesAs it can be seen the scope of the project is extremely wide To cover the wholemore than fifteen students (visiting MSc and PhD) are working on it Thereforethe project has been divided to numerous subtasks

During the planning period of the project several challenges were recognizedfor example the lack of GPS signal indoor the vibration of the on-board sensorsor the short flight time of the aerial platform

This thesis addresses the last one As mentioned in section 12 the UAVconsumes a lot of energy to stay in the air thus such a system has significantlylimited operating time In contrast the ground robot is able to carry a lot morepayload (eg batteries) than an aerial one The idea to overcome this limitationof the UAV is to use the ground robot as a landing and recharging platform for

3

13 Aims and objectives

the aerial vehicleTo achieve this the system will need to have continuous update of the relative

position of the ground robot The aim of this thesis is to give a solution for thisproblem The following objectives were defined to structure the research

First possible solutions have to be collected and discussed in an extensive lit-erature review Then after considering the existing solutions and the constraintsand requirements of the discussed project a design of a new system is neededwhich is able to detect the ground robot using the available sensors Finally thefirst version of the software have to be implemented and tested

This paper will describe the process of the research introduce the designedsystem and discuss the experiences

4

Chapter 2

Literature Review

The aim of this chapter is to provide an overview of the previous researches andarticles related to the topic of this thesis First a short introduction of unmannedaerial vehicles will be given along with their most important advantages andapplication fields

Afterwards a brief conclusion of the science of object recognition in imageprocessing is provided The most often used feature extraction and classificationmethods are presented

21 UAVs and applicationsThe abbreviation UAV stands for Unmanned Aerial Vehicle Any aircraft whichis controlled or piloted by remotely andor by on-board computers They are alsoreferred as UAS for Unmanned Aerial System

They come in various sizes The smallest ones fit a man palms while even afull-size aeroplane can be an UAV by definition Similarly to traditional aircraftUnmanned Aerial vehicles are usually grouped into two big classes fixed-wingUAVs and rotary wing UAVs

211 Fixed-wing UAVsFixed-wing UAVs have a rigid wing which generates lift as the UAV moves for-ward They maneuver with control surfaces called ailerons rudder and elevatorGenerally speaking fixed-wing UAVs are easier to stabilize and control Forcomparison radio controlled aeroplanes can fly without any autopilot functionimplemented relying only on the pilotrsquos input This is a result of the muchsimpler structure compared to rotary wing ones due to the fact that a glidingmovement is easier to stabilize However they still have a quite extended market

5

21 UAVs and applications

Figure 21 One of the most popular consumer level fixed-wing mapping UAV ofthe SenseFly company Source [1]

of flight controllers whose aim is to add autonomous flying and pilot assistancefeatures [2]

One of the biggest advantage of fixed wing UAVs is that due to their naturalgliding capabilities they can stay airborne using no or small amount of powerFor the same reason fixed-wing UAVs are also able to carry heavier or morepayload for longer endurances using less energy which would be very useful inany mapping task

The drawback of the fixed wing aircraft in the case of this project is the factthat they have to keep moving to generate lift and stay airborne Therefore flyingaround between and in buildings is very complicated since there is limited spaceto move around Although there have been approaches to prepare a fixed wingUAV to hover and translate indoors [3] generally they are not optimal for indoorflights especially with heavy payload [4]

Fixed-wing UAVs are excellent choice for long endurance high altitude tasksThe long flight times and higher speed make it possible to cover larger areaswith one take off On figure 21 a consumer fixed-wing UAV is shown designedspecially for mapping

This project needs a vehicle which is able to fly between buildings and indoorswithout any complication Thus rotary-wing UAVs were reviewed

212 Rotary-wing UAVsRotary-wing UAVs generate lift with rotor blades attached to and rotated aroundan axis These blades work exactly as the wing on the fixed-wing vehicles butinstead of the vehicle moving forward the blades are moving constantly Thismakes the vehicle able to hover and hold its position Also due to the same prop-erty they can take off and land vertically These capabilities are very importantaspects for the project since both of them are crucial for indoor flight where spaceis limited

6

21 UAVs and applications

Figure 22 The newest version of the popular Phantom series of DJI Thisconsumer drone is easy to fly even for beginner pilots due to its many advancedautopilot features Source [5]

Based on the number of the rotors several sub-classes are known UAVs withone rotor are helicopters while anything with more than one rotor is called amultirotor or multicopter

The latter group controls its motion by modifying the speed of multiple downthrusting motors In general they are more complicated to control than fixed-wing UAVs since hovering enough requires constant compensation Multirotorsmust individually adjust the thrust of each motor If motors on one side areproducing more thrust than the other the multirotor tilts to the other sideThat is the reason why every remote controlled multicopter is equipped withsome kind of flight controller system They are also less stable (without electricstabilization) and less energy efficient than helicopters That is the reason forthe fact that no large scale multirotor is used since as the size is increased thesimplicity becomes less important than the inefficiency On the other hand theyare mechanically simpler and cheaper and their architecture is suitable for awider range of payloads which makes them appealing for many fields

Rotary-wing UAVsrsquo mechanical and electrical structures are generally morecomplex than fixed-wing ones This can result in more complicated and expansiverepairs and general maintenance which shorten operational time and increase thecost of the project Finally due to their lower speeds and shorter flight rangesrotary-wing UAVs will require many additional flights to survey any large areawhich is another cause of increasing operational costs [4]

7

21 UAVs and applications

In spite of the disadvantages listed above rotary-wing especially multirotorUAVs seem like an excellent choice for this project since there is no need to coverlarge areas and the shorter operational times can be solved with the chargingplatform

213 ApplicationsUAVs as many technological inventions got their biggest motivation from mili-tary goals and applications For example German rocket V-1 (rdquothe flying bombrdquo)had a huge impact on the history of UAS [6] An unmanned aerial vehicle makesit possible to observe spy on and attack the enemy without risking human lives

Fortunately non-military development is catching up too and UAVs are be-coming essential tools for many application fields As rotary wing UAVs es-pecially multicopters are getting cheaper they have appeared in the consumercategories as well They are available as simple remote controlled toys and ad-vanced aerial photography platforms as well [5] See figure 22 for a popularquadcopter

UAVs are gaining popularity in search and rescue operations since they areproviding a fast and relatively cheap tool to get an overview of the involvedarea A good example for this application field is a location of an earth-quakeor searching for missing people in the wildness See [7] and [8] for examples [9]is another related article which includes approaches to identify possibly injuredpeople autonomously

Another application field where UAVs were admittedly proven useful is surveil-lance [10] is an interesting example for this field since it proposes to use a com-bined swarm of rotary and fixed-wing UAVs [11] [12] address the same problemwhile [13] tries to overcome the challenges of the task in an urban area

Getting sensors in the air with low cost vehicles is an amazing opportunity forseveral research areas UAVs have been used for wild-life monitoring [14] earthobservation [15] and even tornado watch missions [16] [17] used UAVs to reducerdquothe data gap between field scale and satellite scale in soil erosion monitoringrdquo

UAVs are often used for 3D mapping and reconstruction tasks for topographyarchitecture or other purposes [18] is a review of current mapping methods basedon UAVs [19] is an other survey which also compares the photogrammetry withother methods to define the optimal application fields

Both surveys mentioned above are rather focus on topographical larger scalemapping tasks Indoor flying and mapping is more complicated in a sense sincethe danger of collision is significantly higher Indoor positioning of UAVs havebeen addressed in several papers They can be grouped by the sensor used Notethat GPS signal is usually too weak to use inside thus other methods are needed

Some researches tried to localize the aerial platform based on a single camera

8

22 Object detection on conventional 2D images

using monocular SLAM (Simultaneous Localization and Mapping) See [20ndash22]for UAVs navigating based on one camera only

Often the platform is equipped with multiple cameras to achieve stereo visionand create a depth map For examples of such solutions see [23ndash25]

A popular approach for indoor mapping is using 3D laser distance sensors asthe input of the SLAM algorithm Examples for this are [26ndash28]

However cameras and laser range finders are often used together [29 30] forexample manage to integrate the two sensors together and carry out 3D mappingwhile controlling navigating the UAV itself This project will use a similar setupthat is a combination of one or more monocular cameras with a 2D laser scanner

22 Object detection on conventional 2D imagesObject class detection is one of the most researched area of computer visionAs the available resources (computational power resolution of cameras etc)improved and became cheaper and more and more complex tasks occurred thefield has witnessed several approaches and methods Thus the topic has extensiveliterature To give an overview general surveys can be a good starting point

There are several articles aimed to provide a summary of the existing methodstrying to categorize and group them (eg [31ndash33])

The terms and definitions of computer vision tasks are defined in several wayin the literature [33] and [34] groups them by

bull Detection is a particular item present in the stimulus

bull Localization detection plus accurate location of item

bull Recognition localization of all the items present in the stimulus

bull Understanding recognition plus role of stimulus in the context of the scene

while [35] sets up five classes defined by the following (quoted from [33])

bull Verification Is a particular item present in an image patch

bull Detection and Localization Given a complex image decide if a particularexemplar object is located somewhere in the image and provide accuratelocation information of the given object

bull Classification Given an image patch decide which of the multiple possiblecategories are present in it Note that no location is determined

9

22 Object detection on conventional 2D images

bull Naming Given a large complex image (instead of an image patch as inthe classification problem) determine the location and labels of the objectspresent in that image

bull Description Given a complex image name all the objects present in itand describe the actions and relationships of the various objects within thecontext of the image

[31] defines object (class or category) detection as the combination of twogoals object categorization and object localization The former decides if anyinstance of the categories of interest is present in the input image The lattertask determines the positions and dimensions (projected size) of the objects thatare found

As it can be seen from the categories defined above the boundaries of thesesubclasses are not clear and they often overlap The aims and objectives listedin section 13 classify this task as a detection or localization (or a combinationof the two) For the sake of simplicity the aim of this thesis and the providedalgorithm will be referred as detection and detector

Beside the grouping of the tasks of computer vision several attempts weremade to classify the approaches themselves The biggest disjunctive propertyof the methods might be the fact that whether or not does it involve machinelearning

221 Classical detection methodsResulting from the rapid development of hardware resources along with the ex-tensive research of artificial intelligence methods not using machine learning aregetting neglected Generally speaking the ones using it show better performanceand are simpler to adopt if the circumstances change (eg re-train them)

It has to be noted that although these classical approaches are not very pop-ular nowadays their simplicity often means an attractive factor to perform easydetection tasks or pre-filter the image

2211 Background subtraction

Background subtraction methods are often used to detect moving objects onstanding platform (static cameras) Such an algorithm is able to learn the rdquoback-ground modelrdquo of an image with differentiating a few frames Although severalmethods exist (See [36] or [37] for good conclusion) the basic idea is common sep-arate the foreground from the background by subtracting the background modelfrom the input image This subtraction should result in an output image where

10

22 Object detection on conventional 2D images

Figure 23 Example for background subtraction used for people detection Thered patches show the object in the segmented foreground Source [39]

only objects in the foreground are shown The background model is usually up-dated regularly which means the detection is adaptive

These solutions could give an excellent base for tasks like traffic flow monitor-ing [38] or human detection [39] as moving points are separated and connectedor close ones could be managed as an object Thus all possible objects of interestare very well separated The fixed position of the camera means fixed perspectiveswhich make size and distance estimations accurate With fixed lengths reliablevelocity measuring can be implemented These and further additional properties(colour texture etc of the moving object) can be the base of an object classifica-tion system This method is often combined with more sophisticated recognitionmethods serving as a region of interest detector After the selection of interestingareas those methods can operate on a smaller input resulting in a faster responsetime

It has to be mentioned that not every moving (changing) patch is an objectas for example in certain lighting circumstances a vehicle and its shadow movealmost identically and have similar size parameters Effective methods to removeshadows from input have been described (eg[40 41]) Certainly process time ofthese filters sum up with the detectionrsquos time

An other concern is the possibility of a sudden change in illumination (such asswitching the light on or off) which would result change across the whole imageIn this case the background model should be adapted automatically and quickly

Although background subtraction has many benefits it is clearly not suitablefor this task since the camera will be mounted on a moving platform

2212 Template matching algorithms

Object class detection can be described as a search for a specific pattern on theinput image This definition is related of the matched filters (see [42]) techniqueof digital signal processing which basic idea is to find a known wavelet in anunknown signal This is usually achieved by cross-correlating them looking forhigh correlation coefficients

Template matching algorithms rely on the same concept interpreting the

11

22 Object detection on conventional 2D images

Figure 24 Example of template matching Notice how the output image isthe darkest at the correct position of the template (darker means higher value)An other interesting point of the picture is the high return values around thecalendarrsquos (on the wall left) dark headlines which are indeed look similar to thehandle Source [43]

image as a 2D digital signal They determine the position of the given patternby comparing a template (that contains the pattern) with the image

To perform the comparison the template is shifted (u v) discrete steps inthe (x y) directions respectively The comparison itself is usually calculated bysome kind of correlation over the area of the template for each (u v) positionThe method assumes that the function (the comparison method) will return thehighest value at the correct position [43]

See Figure 24 for an example input template (part a) and the output of thealgorithm (part b) Notice how the output image is the darkest at the correctposition of the template

The biggest drawback of the method is its computational complexity sincethe cross correlation has to be calculated at every position Several efforts werepublished to reduce the processing time [43] uses normalized cross correlationfor faster comparison [44] applies integral images (based on the idea of [45]) forthe same purpose Note that simplifying the template can reduce the complexityas well since it is easier to compare the two images For example [46] proposedan algorithm which approximates the template image with polynomials

Other disadvantages of template matching are the poor performance in case ofchanges in scale rotation or perspective Resulting from these template matchingis not a suitable concept for this project

12

22 Object detection on conventional 2D images

222 Feature descriptors classifiers and learning methodsWhile computer vision and machine learning might seem two well-separated areasof computer engineering computer vision is one of the biggest application fieldof machine learning [47] puts it this way rdquoPattern recognition has its origins inengineering whereas machine learning grew out of computer science Howeverthese activities can be viewed as two facets of the same field and together theyhave undergone substantial development over the past ten yearsrdquo

The two most important parts of a trained object detection algorithm arefeature extraction and feature classification The extraction method determineswhat type of features will be the base of the classifier and how they are extractedThe classification part is responsible for the machine learning method In otherwords how to decide if the extracted feature array (extracted by the featureextraction method described above) represents an object of interest

Since the latter part (defining the learning and classifier system) is less relatedto the vision studies this paper will only discuss a few learning methods brieflyBefore that some of the most famous and widely used feature descriptors arepresented

Feature extractors or descriptors are often grouped as well [48] recognizesedge based and patch based features [31] has three groups according to thelocality of the features

bull pixel-level feature description these features can be calculated for eachpixel separately Examples are the pixelrsquos intensity (grayscale) or its colourvector

bull patch-level feature description Descriptors based on patch level featuresconcentrate only on point of interests (and their neighbourhood) instead ofdescribing the whole image The name rdquopatchrdquo refers to the area around thepoint which is also considered Since these regions are much smaller thanthe image itself patch-level feature descriptions are also called local featuredescriptors Some of the most famous ones are SIFT (sub-subsection 2221[49 50]) and SURF [51] In a sense Haar-like features (see sub-subsection2222 [45]) belong here as well

bull region-level feature description As patches are sometimes too small to de-scribe the object appropriately a larger scale is often needed The region isa general expression for a set of connected pixels The size and shape of thisarea is not defined Region-level features often constructed from lower levelfeatures presented above However they do not try to apply higher levelgeometric models or structures Resulting from that region-level featuresare also called mid-level representations The most well known descriptors

13

22 Object detection on conventional 2D images

of this group are the BOF (Bag of Features or Bag of Words [52]) and theHOG (Histogram of Oriented Gradients sub-subsection 2223 [53])

Due to the limited scope of this paper it is not possible to give a deeperoverview about the features used in computer vision Instead three of the meth-ods will be presented here These were selected based on fame performanceon general object recognition and the amount of relevant literature examplesand implementation All of them are well-known and both the articles and themethods are part of the computer vision history

2221 SIFT features

Scale-invariant feature transform (or SIFT) is an image processing algorithm usedto detect and extract local features Sift descriptors are the most well-knownpatch-level feature feature descriptors (see point 222) [31] They were proposedby David Lowe in 1999 [49] The same author gave a more in-depth discussionabout his earlier work presented improvements in stability and feature invariancein 2004 [50] His idea was to extract interesting points as features from the object(training image) and try to match these on the test images to locate the object

The main steps of the algorithm are the following [50]

1 Scale-space extrema detection The first stage searches over the image inmultiple scale using a difference-of-Gaussian function to identify potentialinterest points that are scale and orientation invariant

2 Key-point localization At each selected point the algorithm tries to fit amodel to determine location and scale Key-points are selected based ontheir stability points which are expected to be more stable based on themodel are kept

3 Orientation assignment rdquoOne or more orientations are assigned to eachkey-point based on local image gradient directions All future operationsare performed on image data that has been transformed relative to theassigned orientation scale and location for each feature thereby providinginvariance to these transformationsrdquo [50]

4 Key-point descriptor After selecting the interesting points the gradientsaround them are calculated in a selected scale

The biggest advantage of the descriptor is the scale and rotation invarianceNote that rotation here means rotating in the plane of the image not capturingthe object from a rotated viewpoint For example a rotation invariant (frontal)

14

22 Object detection on conventional 2D images

face detector should be able to detect faces upside down but not faces pho-tographed from a side-view SIFT is also partly invariant to illumination and3D camera viewpoint This is reached by the representation of the calculateddescriptors that allows for significant levels of local shape distortion and changein illumination The algorithm assumes that these pointsrsquo relative positions willnot change on the test images Thus SIFT descriptors tend to show worse perfor-mance on flexible objects or ones with moving parts Also the relative positionsof the key-points can be significantly different not only if the object is distorted(flexibility) but also if they are extracted from an other instance from the sameclass (two people for example) Resulting from this property SIFT is more oftenused for object tracking image stitching or to recognise a concrete object insteadof the general class However this is not a problem in the case of this projectsince only one ground-robot is part of it Thus no general class recognition isrequired recognizing the used ground robot would be enough

2222 Haar-like features

Haar-like features were introduced by Viola and Jones in 2001 [45] They proposedsimple features inspired by Haar wavelet basis which are also used in signalprocessing They can be seen as rdquotemplatesrdquo defining rectangular regions in thegrayscale image Viola and Jones described three types of the features

bull two-rectangle features rdquoThe value of a two-rectangle feature is the dif-ference between the sum of the pixels intensities within two rectangularregions The regions have the same size and shape and are horizontally orvertically adjacentrdquo [45]

bull three-rectangle features rdquoA three-rectangle feature computes the sum withintwo outside rectangles subtracted from the sum in a centre rectanglerdquo [45]

bull four-rectangle features rdquoA four-rectangle feature computes the differencebetween diagonal pairs of rectanglesrdquo [45]

See figure 25 for an example of two and three rectangle features The mainadvantage of this approach is the fact that these features can be computed veryrapidly using so-called integral images The integral image rdquocontains the sum ofpixel intensities above and to the left of itrdquo [45] at any given image location Thereferences to the integral values at each pixel location are kept in a structureFinally the Haar features can be computed by summing and subtracting theintegral values in the corners of the rectangles

[45] was motivated by the task of face detection Although it is more than14 years old it is still the base of many implemented face detection methods (in

15

22 Object detection on conventional 2D images

Figure 25 2 example Haar-like features that are proven to be effective separatorsTop row shows the features themselves Bottom row presents the features overlayedon a typical face The first feature corresponds with the observation that eyes areusually darker than the cheeks The second feature measures the intensity of areasaround the eyes compared to the nose Source [45]

consumer cameras for example) and inspired several researches [54] for exampleextended the Haar-like feature set with rotated rectangles by calculating theintegral images diagonally as well

Resulting from the nature of them Haar-like features are suitable to findpatterns of combined bright and dark patches (like faces on grayscale images forexample)

Despite of the original task Haar-like features are used widely in computervision for various tasks vehicle detection ([55]) hand gesture recognition ([56])and pedestrian detection ([57]) just to name a few

Haar-like features are very popular and have a lot of advantages Its maindrawback is the rotation-variance which in spite of several attempts (eg [54])is still not solved completely

2223 HOG features

The abbreviation Hog stands for Histogram of Oriented Gradients Unlike theHaar features (2222) Hog aims to gather information from the gradient image

The technique calculates the gradient of every pixel and produces a histogramof the different orientations (called bins) after quantization in small portions(called cells) on the image After a location based normalization of these cells(within blocks) the feature vector is the concatenation of these histograms Notethat while this is similar to Edge orientation histograms ([58]) it is different since

16

22 Object detection on conventional 2D images

Figure 26 Illustration of the discriminative and generative models The boxesmark the probabilities calculated and used by the models Note the arrows whichdisplay the information flow presenting the main difference in the two philosophiesSource [59]

not only sharp gradient changes (known as edges) are considered The number ofoptimal bins cells and blocks depend on the actual task and resolution and arewidely researched

HOG is essentially a dense version of the SIFT descriptors (see sub-subsection2221 and [31]) However it is not normalized with respect to the orientationThus HOG is not rotation invariant On the other hand it is normalized locallywith respect to the image contrast (over the blocks) which results in a superiorrobustness against illumination changes over SIFT The technique was first de-scribed in the well-known article [53] by Navneet Dalal and Bill Triggs in whichit was used for pedestrian detection While this application field is still the mostcommon usage of these feature descriptor it can be used for many kind of objectsas well as long as the class has a characteristic shape with significant edges

2224 Learning models in computer vision

Learning model approaches can be grouped as well [32 48] finds two mainphilosophies generative and discriminative models

Letrsquos denote the test data (images) with xi the corresponding labels by ciand the feature descriptors (extracted from the data xi) by θi where i = 1 to N where N is the number of inputs

Generative models look for a representation of the image (or any other inputdata) by approximate it keeping as much data as possible Afterwards given atest data they predict the probability of an x instance generated (with featuresθ) conditioned to class c [48] In other terms they aim is to produce a model inwhich the probability

P (x|(θ c))P (c)

to be ideally 1 if x contains an instance of class c and 0 if notThe most famous examples of generative models are Principal Component

Analysis (PCA [60]) and Independent Component Analysis (ICA [61])

17

22 Object detection on conventional 2D images

Discriminative models try to find the best decision boundaries for the giventraining data As the name suggests they try to discriminate the occurrenceof one class from everything else [48] They do this by defining the followingprobability

P (c|(θ x))

which is expected to be ideally 1 if x contains an instance of class c and 0 ifnot

Note that the biggest difference is the direction of the mapping and infor-mation flow discriminative models map images to class labels while generativemodels use a map from labels to the images (or rdquoobservablesrdquo [59]) See figure 26for a representation of the direction of the flow of information in both models

Resulting from approaches presented above (especially the probability mod-els) the discriminative methods are usually perform better for single class detec-tion while generative methods are more suitable for multi-class object detection(also called object recognition) [62] Since the task described in this paper re-quires one class to detect two of the most well-known discriminative methodswill be presented Adaptive boosting (AdaBoost [63] see sub-subsection 2225)and support vector machines (SVM [64] see sub-subsection 2226)

2225 AdaBoost

AdaBoost (short for Adaptive Boosting) is a discriminative machine learningmodel from the class of boosting algorithms It was first proposed by YoavFreund and Robert Schapire in 1997 [63] AdaBoostrsquos aim is to overcome theincreasing calculation expenses caused by the growing set of features (also calleddimensions) For example the number of possible Haar-like features (2222) isover 160000 in case of a 24times24 pixel window (see subsection 432) [65] AlthoughHaar-like features can be extracted efficiently using the integral-image computingthe full set would be very expensive

To reduce the dimensionality of the machine learning problem boosting algo-rithms try to combine several weaker classifiers to create a stronger Weak meansthe output of the classifiers only slightly correlates with the true labels of thedata Usually this is an iterative method

A single stage of AdaBoost can be described as follows [45] [65]

bull Initialize N middot T weights wti where N = number of training examples T =number of features in stage

bull For t = 1 N

1 Normalize the weights

18

22 Object detection on conventional 2D images

2 Select the best classifier using only a single feature by minimising thedetection error εt =

sumiwi|h(xi f p θ)minus yi| where h(xi) is a classifier

output and yi is a correct label (both with a range of 0 for negative 1for positive)

3 Define ht(x) = h(x ft pt θt) where ft pt and θt are minimizers for theerror above

4 Update the weights wt+1i = wti middot εt

1minusεt

1minusεi

bull The final classifier for the stage is based on the sum of weak classifiers

By repeating these steps AdaBoost selects those features (and only those)which improve the prediction Resulting from the way the weights are updatedsubsequent classifiers minimise the error mostly for the inputs which were misclas-sified by previous ones A famous application of AdaBoost is the Viola-Jones ob-ject detection framework presented in 2001 [45] (also see sub-subsection 2222)

2226 Support Vector Machine

Support vector machines (SVM also called support vector networks) is a machinelearning algorithm for two-group classification problems proposed by Vladimir NVapnik Bernhard E Boser and Isabelle M Guyon [66] Although the featuresof the method were known and used since the 1960s they were not put togetheruntil 1992

The idea of the support vector machine is to handle the set of features ex-tracted from the train and test data samples as vectors in the feature spaceThen they construct one ore more hyperplanes (a linear decision surface) in thisspace to separate the classes

The surface is constructed with two constraints First is to divide the spaceinto two parts in a way that no points from different classes remain in the same(in other words separate the two classes perfectly) The second constraint is tokeep the distance to the nearest training samplersquos vector from each class as largeas possible (in other words define the largest margin possible) Creating largermargin is proven to decrease the error of the classifier The closest vectors to thehyperplane (and due to that determining the width of the margin) are calledsupport vectors (hence the name of the method) See figure 27 for an exampleof a separable problem the chosen hyperplane and the margin

The algorithm was extended to not linearly separable tasks in 1995 by Vapnikand Corinna Cortes [64] To create a non-linear classifier the feature space ismapped to a much higher dimensional space where the classes become linearlyseparable and the method above can be used

19

22 Object detection on conventional 2D images

Figure 27 Example of a separable problem in 2D The support vectors aremarked with grey squares They define the margin of the largest separation betweenthe two classes Source [64]

Although support vector machines were designed for binary classification sev-eral methods were proposed to apply their outstanding performance on multi-classtasks The most often used approach to do this is to combine binary classifiers toobtain a multi-class one The binary classifiers can be rdquoone against allrdquo or rdquooneagainst onerdquo types The former means that each class is examined against all thepoints of the other classes The class whose classifier shows the greatest distancefrom the margin is chosen (in other terms the test samplersquos vector is the fartherfrom the other classesrsquo points than in any other cases) The latter is based on thecomparisons of all the possible pairings of the classifiers (each is compared witheach) The class rdquowinningrdquo the most comparisons is chosen as the final decisionof the classifier Good surveys on multi-class SVMs are [67] and [68]

A famous application of SVMs is the HOG descriptor based pedestrian detec-tion presented by Navneet Dalal and Bill Triggs in 2005 [53]

20

Chapter 3

Development

In this chapter the circumstances of the development will be presented whichinfluenced the research especially the objectives defined in Section 13 Firstthe available hardware resources (that is processing unit and sensors) will bediscussed along with their limitations and constraints respecting to the definedobjectives Finally the used software libraries will be presented why were theychosen and what are they used for in this project

31 Hardware resources

311 Nitrogen boardNitrogen board (officially called Nitrogen6x) is the chosen on-board processingunit of the UAV It is a highly integrated single board computer using an ARM-cortex A9 processor [69]

Nitrogen board will be responsible to manage the on-board sensors and collectthe recorded data After some preprocessing part of the data will be transmittedto the ground robot and to the ground station which communication is alsohandled by this unit Fortunately the Robotic Operating System (subsection322) is compatible with unit which will make this task easier

The detection method introduced in this thesis may be ported to this boardlater depending on the available processing units and the achieved communicationbandwidth between the vehicles and the ground station

312 SensorsThis subsection concludes the sensors integrated on-board the UAV including theflight controller the camera and the laser range finder

21

31 Hardware resources

Figure 31 image of Pixhawk flight controller Source [70]

3121 Pixhawk autopilot

Pixhawk is an open-source well supported auto-pilot system manufactured by3Drobotics company It is the chosen flight controller of the projectrsquos UAV Onfigure 31 the unit is shown with optional accessories

Also Pixhawk will be used as an Inertial Measurement Unit (IMU) for build-ing 3D maps This means less weight (since no additional sensor is needed) andbrings integrity both in hardware and software Although the Pixhawk is not adedicated IMU its built-in sensors are very accurate since they are used to sta-bilize multi-rotors during flight Therefore it was suitable to use it for mappingpurposes See 44 for details of its application

3122 Camera

Cameras are optical sensors capable of recording images They are the mostfrequently used sensors for object detection due to the fact that they are easy tointegrate and provide significant amount of data

Also another big advantage of these sensors is the fact that they are availablein several sizes with different resolutions and other features for a relatively cheapprice

Nowadays consumer UAVs are often equipped by some kind of action camerausually to provide an aerial video platform These sensors are suitable for theaim of the project as well resulting from their light weight and wide angle fieldof view

22

32 Chosen software

Figure 32 The chosen LIDAR sensor Hokuyo UTM-30LX It is a 2D rangescanner with a maximum range of 60 m and 3-5 cm accuracy Source [71]

3123 LiDar

The other most important sensor used in this project is the lidar These sensorsare designed for remote sensing and distance measurement The basic conceptof these sensors is to do this by illuminating objects with a laser beam Thereflected light is analysed and the distances are determined

Several versions are available with different range accuracy size and otherfeatures The chosen scanner for this project is the Hokuyo UTM-30LX It isa 2D range scanner with a maximum range of 60 m and 3-5 cm accuracy [71]These features combined with its compact size (62mmtimes62mmtimes875mm210g [71])makes it an ideal scanner for autonomous vehicles for example UAVs

Generally speaking lidars are much more expensive sensors then camerasHowever the depth information provided by them is very valuable for indoorflights

32 Chosen software

321 MatlabMATLAB is a high-level interpreted programming language and environmentdesigned for numerical computation It is a very useful tool for engineers and

23

32 Chosen software

scientist alike as thousands of functions are implemented and included in thesoftware from different fields of science Also excellent visualization tools areready to use which are necessary to easily debug and develop image processingmethods or 3D map construction and visualization Resulting from the reasonsabove developing and testing 2D or 3D image processing algorithms in Matlab isextremely convenient On the other hand Matlab is Java based therefore mostnon-mathematical calculations are slightly slow compared to a C++ environment

322 Robotic Operating System (ROS)Robotic Operating System is an extensive open-source operating system andframework designed for robotic development It is equipped with several toolsaimed to help developing unmanned vehicles like simulation and visualizationsoftware or excellent general message passing This latter property is a suitablesolution to connect the UAV UGV and the ground control station of the project

Also ROS contains several packages developed to handle sensors For exam-ple the both the chosen lidar sensor (3123) and the flight controller (3121)have ready-to-use drivers available

323 OpenCVOpenCV is probably to most popular open-source image processing library con-taining a wide range of functions like basic image manipulations (eg load writeresize rotate image) image processing tools (eg different kind of edge detec-tions threshold methods) and advanced machine learning based solutions (egfeature extractors and training tools) [72] Due to its high popularity examplesand support are available

In this project it was used to handle inputs (either from a file or from a cameraattached to the laptop) since Dlib the main library chosen has no functions toread inputs like that

324 DlibDlib is a general purpose cross-platform C++ library It has an excellent machinelearning library with the aim to provide a machine learning software developmenttoolkit for the C++ language (see figure 33 for an architecture overview) Dlibis completely open-source and is intended to be useful for both scientists andengineers

Dlib also has many computer vision related parts including ready-to-use fea-ture extractors and training tools and object tracking solutions which make it

24

32 Chosen software

Figure 33 Elements of Dlibrsquos machine learning toolkit Source [73]

easy to develop and test custom trained object detectors rapidly Another re-cently added feature is an easy-to-use 3D point cloud visualization function Re-sulting from this Dlib was a reasonable choice to use for this project since itscollection of machine learning image processing and 3D point cloud handlingfunctions made the development significantly easier

It is worth mentioning that aside the features above Dlib also contains com-ponents to handle linear algebra threading network IO graph visualisation andmanaging and numerous other tasks which makes it one of the most versatileC++ libraries Though it is not as popular and well-known as OpenCV it isextremely well documented supported and updated regularly Resulting fromthese the popularity of it increases For more information see [74] and [73]

25

Chapter 4

Designing and implementing thealgorithm

In this chapter challenges related to the topic of this thesis are listed After theconsideration of these the architecture of the designed algorithm will be intro-duced with detailed explanations given to the important parts The connectionbetween the 2D and 3D image inputs are defined Afterwards the base of thedetection algorithm is discussed in details along with all the aiding systems andinputs Finally the 3D imaging experiments and set-ups are introduced withexample images and the mathematical calculations

41 Challenges in the taskThe biggest challenges of the task are listed below

1 Unique object Maybe the biggest challenge of this task is the fact that therobot itself is a completely unique ground vehicle designed as part of theproject Thus no ready-to-use solutions are available which could partly orcompletely solve the problem For frequently researched object recognitiontasks like face detection several open-source codes examples and articlesare available The closest topic for a ground robot detection is car recogni-tion but those vehicles are too different to use any ready detectors Thusthe solution of this task have to be completely self-made and new

2 Limited resources Since the objective of this thesis is a detection executedon a UAV the limitations of this platform have to be considered Thismeans that both the dimensions and the weight of the on-board sensorsand computers are limited significantly Fortunately the cameras nowa-days have very compact sizes with impressive resolutions thus even smaller

26

42 Architecture of the detection system

aerial vehicles can carry one and record high quality videos The used lidar(3123) is heavier and larger but still possible to carry The on-board com-puter had to be chosen similarly a small and light hardware was neededWhile the Nitrogen6x board (311) would fulfil these requirements its pro-cessing power is not comparable to a laptop or desktop PC which shouldbe considered during the designing and assigning of processing methods

3 Moving platform No matter what kind of sensors are used for an objectdetection task movement and vibration of the sensor is usually makes itmore challenging Unfortunately both are present on a UAV platform Thiscauses problems because it changes the perspective and the viewpoint ofthe object Also they add noise and blur to the recordings And third nobackground-foreground separation algorithms are available for the cameraSee subsection 2211 for details

4 Various viewpoint of object Usually in the field of computer vision thepoint of view of the required object is more or less defined For exampleface detectors are usually applied to images where people are facing thecamera However resulting from the UAV platform the sensors will gain orlose height rapidly causing drastic perspective changes Also there are noconstraints about the relative orientation of the two vehicles which meansthat the robot should be recognized from 360 and from above

5 Various size of object Similarly to the previous point 4 the movement ofthe camera will cause significant changes in the scale of the object on theinput image In other words the algorithm have to detect the ground robotfrom a great distance (when it occupies a small part of the input) and alsoduring landing from relatively close (when it seems huge and almost fillsthe entire image)

6 Speed requirements The algorithmrsquos purpose is to support the landingmaneuver which is crucial part of the mission The most critical part is thetouch-down itself This (contrary to the other parts of the flight) requiresreal-time and very accurate position feedback from the sensors and the al-gorithms This is provided by another algorithm specially designed for thetask Thus the software discussed here does not have real-time require-ments On the other hand the system still needs continuous reports on theestimated position of the robot with at least 4Hz This means that everychosen method should be execute fast enough to make this possible

27

42 Architecture of the detection system

Support Vector Machines

camera2D Lidar

Trainer

Detector agorithmfront SVM side SVM

Detections Vatic annotation server

Sensors

Tracking

Evaluation

Current frame

Video reader

Regions of Interest

3D map

Other preprosessing(edgecolour detection)

Figure 41 On this figure a diagram of the designed architecture is shown Theaim of it is the present the connections between the 2D and 3D sensors detectionalgorithms and other parts of the system Arrows represents the dependencies andthe direction of the information flow

28

42 Architecture of the detection system

42 Architecture of the detection systemIn the previous chapters the project (11) and the objectives of this thesis (13)were introduced The advantages and disadvantages of the different availablesensors were presented (312) Some of the most often used feature extractionand classification methods (222) were examined with respect to the suitabilityfor the project Finally in subsection 41 the challenges of the task were discussed

After considering the aspects above a modular complete architecture wasdesigned which is suitable for the objectives defined See figure 41 for an overviewof the design The main idea of the structure is to compensate the disadvantagesof every module with an other part connected to it In some cases this principlebrings redundancy to the system On the other hand it provides a more robustdetection system for the robot The next enumeration will list the main partsof the architecture every already implemented module will be discussed later inthis chapter in detail

1 Training algorithm As explained in subsection 222 object detectionmethods using machine learning are very popular and superior in efficiencyTo apply such an algorithm for a task first a learning (also called trainingor teaching) period is needed when the software extract features from thetraining images and train a classifier to separate the two or more classes(Note that this learning step can be skipped if the algorithm learns rdquoon theflyrdquo or pre-trained detectors are used Neither of these are the case in thisproject)As mentioned above training usually happens off-line before the detectionsTherefore it is not required to be included in the final release However itis an inevitable part of the system since it produces the classifiers for thedetector These productions have to be reproducible compatible with theother parts and effective To make the development more convenient thetraining software is completely separated from testing See subsection 431for details

2 Sensor inputs Sensors are crucial for every autonomous vehicle since theyprovide the essential information about the surrounding environments Inthis project a lidar and a camera where used as main sensors for the task(along with several others needed to stabilize the vehicle itself See subsec-tion 312) The designed system relies more on the camera for now sincethe 3D imaging is in its experimental state However both sensors arehandled in the architecture already with the help of the Robotic Operat-ing System (322) This unified interface will make it possible to integratefurther sensors (eg infra-red cameras ultrasound sensors etc) later withease

29

42 Architecture of the detection system

Resulting from the scope of the project the vehicle itself is not readythus live recordings and testing were not possible during the developmentTherefore the system is able to manage images both from a camera andvideo file The latter is considered as a sensor module as well simulatingreal input from the sensor (other parts of the system canrsquot tell that it is nota live video feed from a camera)

3 Region of interest estimators and tracking Recognized as two of the mostimportant challenges (see 41) speed and the limitations of the availableresources were a great concern Thus methods which could reduce theamount of areas to process and such increase the speed of the detectionwere requiredResulting from the complex structure of the project both can be donebased on either the camera the lidar or both The idea of the architectureis to give a common interface (creating a type of module) to these methodsmaking their integration easy The most general and sophisticated approachwill be based on the continuously built and updated 3D maps That is ifboth the aerial and the ground vehicle were located in this map accurateestimations can be made on the current position of the robot excluding (3D)areas where it is unnecessary to search for the unmanned ground vehicle(UGV) In other words once the real 3D position of the robot is known itsnext one can be estimated based on its maximal velocity and the elapsedtime Theoretically it is not possible for the robot to be outside of thiscircle thus other parts of the map do not need to be scanned and Asa less sophisticated but more rapid alternative to this the currently seen3D sub-map (the point-cloud in front of the sensor) can be interpreted asa pre-filter for the 2D image processing methods For example filteringfor objects with more or less the same size of the UGV could be a veryeffective region of interest detector To achieve this the field of view of thetwo sensors needs to be registered carefully so corresponding areas can befound Unfortunately the real time 3D map building is not yet implementedThus these methods has yet to be developed On the other hand severalapproaches to find regions of interest based on 2D images were implementedAll of them will be presented in section 43

4 Evaluation To make the development easier and the results more objectivean evaluation system was needed This helps to determinate whether a givenmodification of the algorithm did actually make it rdquobetterrdquo andor fasterthan other versions This part of the architecture is only connected to thewhole system loosely (similarly to training point 1) and not required tobe included in the final release since it was designed for development and

30

43 2D image processing methods

debugging purposes However it had to be developed parallel with everyother parts to make sure it is compatible and up to date Thus evaluationhad to be mentioned in this list For more details of the evaluation pleasesee 511

5 Detector The detector module is the core algorithm which connects all theother parts listed above Receiving the inputs from the camera sensor (point2) using the classifiers trained by the training algorithm (point 1) andconsidering the information provided by the ROI estimators (3) executesthe detection Afterwards the detector is able to pass on the detectedobjects coordinates so the evaluation module (point 4) can measure theefficiency

43 2D image processing methodsIn this section the 2D camera image processing concerns methods and the processof the development will be presented

431 Chosen methods and the training algorithmBefore the design and implementation of the algorithm started several objectdetection methods were reviewed in section 22 In section 32 the used imageprocessing libraries and tool-kits were presented

While these two topics seem unrelated they have to be considered togethersince a good software library with image handling or even machine learningalgorithms implemented can save a lot of time during the development

Considering the reviewed methods the Histogram of Oriented Gradients (HOGsee sub-subsection 2223) was chosen as the main feature descriptor The reasonsfor this choice are discussed here

Haar-like features are suitable to find patterns of combined bright and darkpatches While the ground robotrsquos main colours are black (wheels) and white(the main body of the vehicle) their proportion is not ideal since brighter partsare more significant Variance of the relative positions of these patches (egdifferent view angles) can confuse the detector as well Also using OpenCVrsquos(323) training algorithm the learning process was significantly slower than thefinal training software (presented here later) making it very hard to develop andtest

SIFT descriptors (see sub-subsection 2221) have the advantage of rotationinvariance which is very desirable Since they assume that relative positions ofthe key-points will not change on the inputs they are more suitable to recognise

31

43 2D image processing methods

a concrete object instead of the general class This is not an issue in this casesince only one ground robot is present with known properties (in other wordsno generalization is needed) However the relative positions of the key-pointscan change not only because inter-class differences (an other instance of the classrdquolooksrdquo different) but caused by moving parts like rotating and steering wheelsfor example Also SIFT tends to perform worse under changing illuminationconditions which is a drawback in the case of this project since both vehiclesare planned to transfer between premises without no a-priori information aboutthe lighting

Histogram of Oriented Gradients (HOG see sub-subsection 2223) seemed tobe an appropriate choice as it performs well on objects with strong characteristicedges Since the UGV has a cuboid shape body and relatively big wheels itfulfils this condition Also HOG is normalized locally with respect to the imagecontrast which results in a robustness against illumination changes Thus itis expected to perform better than the other methods in case of change in thelighting (like moving to an other room or even outside)

Certainly HOG has it disadvantages as well First it is computationally ex-pensive especially compared to the Haar-like features However with properimplementation (eg a good choice of library) and the use of ROI detectors (seePoint 3) this issue can be solved Secondly HOG in contrast to SIFT is not rota-tion invariant In practice on the other hand it can handle a significant rotationof the object Also in the final version of the project the orientation of the camerawill be stabilized either in hardware (with a gimbal holding the camera levelled)or software (rotate the image based on the orientation sensor) Therefore theinput image of the detector system expected to be levelled Unfortunately dueto the aerial vehicle used in this project different viewpoints are predicted (see4) This means that even if the camera is levelled the object itself could seemlike rotated caused by the perspective To overcome this issue solutions will bepresented in this subsection and in 434

Although the method of HOG feature extraction is well described and docu-mented in [53] and not exceedingly complicated the choice of optimal parametersand the optimization of the computational expenses are time-consuming and com-plex tasks To make the development faster an appropriate library was searchedfor the extraction of HOG features Finally Dlib library (324) was chosen sinceit includes an excellent HOG feature extractor with serious optimization alongwith training tools (discussed in details later) and tracking features (see subsec-tion 434) Also several examples and documentations are available for bothtraining and detection which simplified the implementation

As a classifier Support vector machines (SVM see sub-subsection 2226)were chosen This is because nowadays SVMs are widely used in computer visionwith great results (Dalal used it as well in [53]) and for two-class detection prob-

32

43 2D image processing methods

lems (like this) they outperform the other popular ones Also it is implementedin Dlib as a ready-to-use solution [75] SVMs (and almost everything in Dlib)can save and load by serializing and de-serializing them which made the trans-portation of classifiers relatively easy Dlibrsquos SVMs are compatible with the HOGfeatures extracted with the same library The combination of the two methodswithin Dlib resulted in some very impressive detectors for speed limit sign orfaces Furthermore the face detector mentioned outperformed the Haar-like fea-ture based face detector included in OpenCV [74] After studying these examplesthe combination of Dlibrsquos SVM and HOG extractor was chosen as the base of thedetection

The implemented training software itself was written in C++ It can be exe-cuted from the command line It needs an xml file as an input which includes alist of the training images with the annotation of the objects (coordinates in pix-els) After importing the positive samples the negative samples are rdquogeneratedrdquofrom the training samples as well cut from areas where no annotated object ispresent Note that this feature means the object has to be annotated carefullysince a positive sample might get in to the negative training images otherwise

After the classifier (also called detector) is produced the training softwaretests it on an image database defined by a similar xml file This is a very con-venient feature since major errors are discovered during the training processalready If the first tests do not show any unexpected behaviours the finishedSVM is saved to the disk with serialization

Due to the fact that the robot looks completely unlike from different view-points one classifier was not sufficient because it could only match one typicalrdquoshaperdquo of the object Similarly the face detector mentioned above does notwork for people photographed from side view Therefore multiple classifiers wereconsidered for the task

Two SVMs were trained to overcome of this issue one for the front view andone for the side view of the ground vehicle This approach has two significantadvantages

First the robot is symmetrical horizontally and vertically as well In otherterms the vehicle looks almost exactly the same from the left and right whilethe front and rear-views are also very similar This is especially true if the work-ing principle of the HOG (2223) is considered since the shape of the robot(which defines the edges and gradients detected by the HOG descriptor) is iden-tical viewed from left and right or from front and rear This property of theUGV means that the same classifier will recognise it from the opposite directiontoo Therefore two classifiers are enough for all four sides This is not a trivialsimplification since cars for example look completely different from the front orthe rear and while the side-views are similar they are mirrored

The second advantage of these classifiers is very important in the future of

33

43 2D image processing methods

the project Due to the way the SVMs are trained they not only can detect theposition of the robot but depending on which one of them detects it the systemgains information about the orientation of the vehicle This is crucial since it canhelp predict the next position of the ground robot (see 434) Even more impor-tantly it helps the UAV to land on the UGV with the correct orientation (facingthe right way) Therefore the charging outputs will be connected properly

An other important aspect of the training is how to choose the training imagesFor this pictures were selected where the robot is captured directly from the sideor from the front with the camera almost at the level of the ground Asidethe sidefront of the robot (the top for example) was hidden or cropped Thismethod eliminated all the distortions found on pictures taken from above whichwould decrease the efficiency Although camera attached to the UAV will seldomcruise at this altitude (otherwise it would touch the ground) the HOG descriptor isrobust enough to find the patterns even viewed from above or sideways Althoughteaching an SVMs for diagonal views as well seemed like a logical idea it did notimprove the detections The reason for this is the fact that while side and frontviews are well defined it is hard to frame and train on a diagonal view

On figure 42 a comparison of training images and the produced HOG detec-tors is shown 42(a) is a typical training image for the side-view Hog detectorNotice that only the side of the robot is visible since the camera was hold verylow 42(b) is the visualized final side-view detector Both the wheels and thebody are easy to recognize thanks to the strong edges of them On 42(c) a train-ing image is displayed from the front-view detector training image set 42(d)shows the visualized final front detector Notice tangential lines along the wheelsand the characteristic top edge of the body

Please note that if the current two classifiers seem to be inefficient trainingadditional ones is easy to do with the training software presented here Alsoincluding them to the detector is simple as well since it was designed to beadaptive Please see subsection 435 for more details

432 Sliding window methodA very important property of these trained methods (222) are that usually theirtraining datasets (the images of the robot or any other object) are cropped onlycontaining the object and some margin around it Resulting from this a detectortrained this way will only be able to work on input images which contain theobject in a similarly cropped way Certainly such an image is not the usual inputfor an application since if the input contains only the robot there is no need forlocalization algorithms In other words the input image is usually much largerthan the cropped training images and may contain other kind of objects as wellThus a method is required which provides correctly sized input images for the

34

43 2D image processing methods

(a) sideview training image example

(b) sideview HOG detector

(c) frontview trainging image example

(d) frontview HOG

Figure 42 (a) a typical training image for the side-view Hog detector (b) is thefinal side-view HOG descriptor visualized (c) a training image from the front-viewHOG descriptor training image set (d) the final front detector visualized Noticethe strong lines around the typical edges of the training images

35

43 2D image processing methods

Figure 43 Representation of the sliding window method

detector cropped and resized from the original large inputThis can be achieved in several ways although the most common is the so

called sliding window method This algorithm takes the large image and slidesa window across it with predefined step-size and scales This rdquowindowrdquo behaveslike a mask cropping out smaller parts of the image With appropriate choice ofthese parameters (enough scales and small step-sizes) eventually every robot (orother desired object) will be cropped and handed to the trained detector See 43for a representation

It worth mentioning that it is possible that multiple instances of the soughtobject are present on the image For face or pedestrian detection algorithms forexample this is a potential situation Since only one ground robot was builtyet this is not likely to happen in this project (although reflections of the robotmight occur which are not handled explicitly) However if the system will beexpanded in the future multiple UGVs on the same input image will not cause aproblem for the designed algorithm

433 Pre-filteringAs listed in section 41 two of the most important challenges are speed andthe limitations of the available resources In subsection 432 the sliding windowconcept was introduced which makes it possible to cover every potential locationof the sought object On the other hand this means a lot of positions which areabsolutely unnecessary to check with the HOG detector since it is very unlikelythat the robot is there (eg positions above the ground or homogeneous surfaceslike the wall or the clearly empty ground) These areas could be excluded fromthe search with cheaper algorithms

36

43 2D image processing methods

Thus methods which could reduce the amount of areas to process and suchincrease the speed of the detection were required In other words these algorithmsfinds region of interests (ROI see 3) in which the HOG (or other) computationalheavy detectors are executed in a shorter time than scanning the whole inputimage The algorithms have be implemented with the right interface to keep themodular structure of the architecture Also this practice make it possible toeither swap the currently used with an other or a newly developed ROI detectormodule for further development

Based on intuition a good feature separation is the colour Anything whichdoes not have the same colour as the ground vehicle should be ignored To dothis however is rather complicated since the observed rdquocolourrdquo of the object (andanything else in the scene) depends on the lighting the exposure setting and thedynamic range of the camera Thus it is very hard to define a colour (and aregion of colours around of it) which could be a good base of segmentation onevery input videos Also one of the test environments used (the laboratory) andmany other premises reviewed has bright floor which is surprisingly similar to thecolour of the robot This made the filtering virtually useless

An other good hint for interesting areas can be the edges In this projectthis seems like a logical approach since the chosen feature descriptor operates ongradient images which are strongly connected to the edge map On figure 44 theresult of an edge detection on a typical input image of the system is presentedNote how the ground robot has significantly more edges than its environmentespecially the ground Filtering based on edges often eliminates the majority ofthe ground On the other hand chairs and other objects have lot of edges as wellThus these areas are still scanned

Another idea is to filter the image by the detections themselves on the previousframes In other terms the position of the robot on the previous input image isan excellent estimation of the vehiclersquos position on the current one Theoreticallyit is enough to search for the robot around its last known position with a radiusof the maximum distance the object can move during the time between framesPlease note that this distance is in image coordinates thus the possible movementof the camera is involved as well In practice this distance was set as a parameterafter experiments See subsection 435 for details

434 TrackingIn computer vision tracking means the following of a (moving) object over timeusing the camera image input This is usually done by storing information (likeappearance previous location speed and trajectory) about the object and passit on between frames

In the previous subsection (433) the idea of filtering by prior detections was

37

43 2D image processing methods

Figure 44 On this figure the result of an edge detection on a typical input imageof the system is presented Please note how the ground robot has significantly moreedges than its environment On the other hand chairs and other objects have lotof edges as well

presented Tracking is the extension of this concept If the position of the robotis already known there is no need to find it again To achieve the objectives it isenough to follow the found object(s) This makes a huge difference in calculationcosts since the algorithm does not look for an object described by a generalmodel on the whole image but search for the rdquosame set of pixelsrdquo in a significantlyreduced region Certainly tracking itself is not enough to fulfil the requirements ofthe task hence some kind of combination of rdquotraditionalrdquo classifiers and trackingwas needed

Tracking algorithms are widely researched and have their own extensive litera-ture Considering the complexity of implementing an own 2D tracking algorithmit was decided to choose and include a ready-to-use solution Fortunately Dlib(324) have an excellent object tracker included as well The code is based onthe very recent research presented in [76] The method was evaluated extensivelyand it outperformed state-of-the-art methods while being computationally effi-cient Aside its outstanding performance it is easy to include in the code onceDlib is installed The initializing parameter of the tracker is a bounding box isaround the area needed to be followed Since the classifiers return with boundingboxes those can be used as an input for the tracking See subsection 435 foran overview of the implementation An interesting fact is that it also uses HOGdescriptors as a part of the algorithm which shows the versatility of HOG andconfirms its suitability for this task

The final system of this project will have the advantage of a 3D map of itssurroundings This will allow tracking of objects in 3D relative to the environ-ment The UAV will have an estimation of the UGV even if its out of the frame

38

43 2D image processing methods

since its position will be mapped to the 3D map as well Unfortunately the realtime mapping is not ready yet thus this feature has yet to be implemented

435 Implemented detectorIn subsection 431 the advantages and disadvantages of the chosen object recog-nition methods were considered and the implemented training algorithm wasintroduced Then different approaches to speed up the process were presentedin subsection 432 433 and 434

After considering the points above a detector algorithm was implementedwith two aims First to provide an easy-to-use development and testing toolwith which different methods and their combinations (eg using 2 SVMs withROI) can be tried without having to change and rebuild the code itself Secondafter the optimal combination was chosen this software will be the base of thefinal detector used in the project

The input of the software is a camera feed or video which is read frame byframe The output of the algorithm is one or more rectangles bounding the areasbelieved to contain the sought object These are often called detection boxeshits or simply detections

During the development four different approaches were implemented All ofthem are based on the previous ones (incremental improvements) but all broughta new idea to the detector which made it faster (reach higher frame-rates see512) more accurate (see 511) or both The methods are listed below

1 Mode 1 Sliding window with all the classifiers

2 Mode 2 Sliding window with intelligent choice of classifier

3 Mode 3 Intelligent choice of classifiers and ROIs

4 Mode 4 Tracking based approach

They will all be presented in the next four subsections Since they are rathermodifications of the same code than separate solutions they were not imple-mented as new software Instead all were included in the same code whichdecides which mode to execute in run-time based on a parameter

Aside of changing between the implemented modes the detector is able toload as many SVMs as needed Their usage is different in every mode and willbe presented later

The software can display andor save the video frames with all the detectionsmarked measure and export the processing time for every frame save the de-tections to data files for evaluation or even process the same video input severaltimes

39

43 2D image processing methods

Table 41 Table of the available parameters

Name Valid values Functioninput path to video video as input for detectionsvm path to SVMs These SVMs will be usedmode [1 2 3 4] Selects which mode is usedsaveFrames [01] Turns on video frame exportsaveDetections [01] Turns on detection boxe exportsaveFPS [01] Turns on frame-rate measurementdisplayVideo [01] Turns on video displayDetectionsFileName string Sets the filename for saved detectionsFramesFolderName string Set the foldername used for saving videonumberOfLoops integer(0lt) Sets how many times the video is looped

To make the development more convenient these features can be switchedon or off via parameters which are imported from an input file every time thedetector is started There is an option to change the name of the file where thedetections are exported or the name of the folder where the video frames aresaved

Table 41 concludes all the implemented parameters with the possible valuesand a short description

Note that all these parameters have default values thus it is possible but notcompulsory to include every one of them The text shown below is an exampleinput parameter file

Example parameter file of the detector algorithminput testVideo1avilist all the detectors you want to usesvm groundrobotfrontsvm groundrobotsidesvmsaveDetections 0saveFrames 0displayVideo 0numberOfLoops 100mode 2saveFPS 1

Given this input file the software would execute a detection on testVideo1avi(input) one hundred times (numberOfLoops) using the groundrobotfrontsvm groundrobot-sidesvm classifiers (svm) It would neither save the detections (saveDetections)

40

43 2D image processing methods

nor the video frames (saveFrames) The latter is not shown (displayVideo) How-ever the processing time is saved for every frame (saveFPS) Note that the pa-rameter numberOfLoops is really useful to simulate longer inputs See figure 45for a screenshot of the interface after a parameter file is loaded

Figure 45 Example of the detectorrsquos user interface after loading the parametersThe software can be started from the command line with a parameter file inputwhich defines the detecting mode and the purpose of the execution (producingvideo efficiency statistics or measuring processing frame-rate)

4351 Mode 1 Sliding window with all the classifiers

Mode 1 is the simplest approach of the four After loading the two classifierstrained by the training software (front-view and side-view see 434) both rdquoslidesrdquoacross the input image All of them return with a vector of rectangles where theyfound the sought object These are handled as detections (saved displayed etc)There are no filters implemented for maximum number of detections overlappingallowed position etc

Please note that although currently two SVMs are used this and every othermethod can handle fewer or more of them For example given three classifiersmode 1 loads and slides all of them

As it can be seen this mode 1 is rather simple It checks every possibleposition with all the classifiers on every frame in multiple scales This meansthat it will not miss the object (assuming that one of them would recognize theobject if a cropped image of it is given as an input) On the other hand thisextensive search is very computational heavy especially with two classifiers Thisresults in the lowest frame per second rate of all the methods

4352 Mode 2 Sliding window with intelligent choice of classifier

As mentioned in the previous subsection Mode 1 checks every possible positionwith all the classifiers on every frame This is not efficient because of the waythe two used classifier was trained On figure 42 it can be seen that one of them

41

43 2D image processing methods

represents the frontrear view while the other one was trained for the side-viewof the robot

The idea of mode 2 is based on the assumption that it is very rare and usuallyunnecessary that these two classifiers detect something at once since the robot iseither viewed from the front or from the side Certainly looking at it diagonallywill show both mentioned sides but resulting from the perspective distortionthese detectors will not recognize both sides (and that is not needed either sincethere is no reason to find it twice)

Therefore it seemed logical to implement a basic memory to the code whichstores which one of the detectors found the object last time After that only thatone is used for the next frames If it succeeds to find the object again it remainsthe chosen detector If not the algorithm starts counting the sequential frames onwhich was unable to find something If this exceeds a predefined tolerance limit(that is the number of frames tolerated without detection) all of the classifiersare used again until one of them find the object In every other aspect mode 2is very similar to mode 1 presented above

The limit was introduced because it is very unlikely to change viewpointsbetween two frames so drastically that the other detector would have a betterchance to detect A couple of frames without detection is more like the result ofsome kind of image artefact like motion blur change in exposure etc

This modification of the original code resulted an much faster processingspeed since in significant amount of the time only one classifier was used in-stead of two Fortunately after finding an optimal limit the efficiency of thecode remained the same

4353 Mode 3 Intelligent choice of classifiers and ROIs

While mode 2 brought a significant improvement to the detectorrsquos speed it didnot fulfil the requirements of the project

In subsection 433 the idea of reducing the image searched was introduced asa way to increase the detection speed The position of the robot on the previousinput image is an excellent estimation of the vehiclersquos position on the current oneTheoretically it is enough to search for the robot around its last known positionwith a radius of the maximum distance the object can move during the timebetween frames However this is not an obvious calculation since the distance isin image coordinates hence it is not trivial to use the actual speed of the robotAlso the possible movement of the camera is involved as well For example evenif the UGV holds its position the rotating camera of the UAV will capture itrdquomovingrdquo across the input image

Instead of estimating the distance mentioned above a simpler approach wasimplemented The memory introduced in mode 2 was extended to store the

42

43 2D image processing methods

Figure 46 An example output of mode 3 The green rectangle marks theregion of interest The blue rectangle is the detection returned by the side-viewdetectorMode 3 tries to estimate a ROI based on the previous detections andsearches for the robot only in that Note the ratio between the ROI and the fullsize of the image

position of the detection aside the detector which returned it A new rectanglename ROI (Region of interest) was included which determinates in which areashould the detector search

The rectangle is initialized with the sizes of the full image (in other termsthe whole image is included in the ROI) Then every time a detection occursthe ROI is set to that rectangle since that is the last known position of the robotHowever this rectangle have to be enlarged caused by the reasons mentionedabove (movement of the camera and the object) Therefore the ROI is grown bya percent of its original size (set by a variable 50 by default) and every detectorsearches in this area Note that the same intelligent selection of classifiers whichwas described in mode 2 is included here as well

In the unfortunate case that none of the detectors returned detections insidethe ROI the region of interest is enlarged again by a smaller percent of its originalsize (set by a variable 3 by default) Note that if in any case the ROI reachesthe size of the of the image it will not grow in that direction anymore Eventuallyafter a series of missing detections the ROI will reach the size of the input imageIn this case mode 3 works exactly like mode 2

See figure 46 for an example of the method described above The greenrectangle marks the region of interest (ROI) The classifiers will search for therobot in this area The blue rectangle marks that the side-view detector foundsomething there (in this case correctly recognized the robot) On the next framethe blue rectangle will be the base of the ROI as it is enlarged in the followingsteps to define the region of interest

43

43 2D image processing methods

Please note that while the ROI is updated of the previous detections it isvery easy to change it to another rdquopre-filterrdquo due to the modular architectureintroduced in 42

4354 Mode 4 Tracking based approach

As mentioned in 434 Dlib library has a built-in tracker algorithm based on thevery resent research [76]

Mode 4 is very similar to mode 3 but includes the tracking algorithm men-tioned above Using a tracker makes it unnecessary to scan every frame (or partof it as mode 3 does) with at least one detector

Instead once the robot was found it is followed across the scene Certainlyperiodical validations are still needed Tracking is significantly faster than thesliding window detectors and in appropriate conditions can perform above realtime speed

It also improved the efficiency of the detector since on many frames the robotwas tracked and its position was marked correctly while the detectors missed itfrom the same view-point

On the other hand tracking can be misleading as well In case of the trackedregion rdquoslidesrdquo off the robot incorrect detections will be provided Also thetracker often estimates the position of the object correctly while the scale of therectangle does not fit which can result much larger or smaller detection boxesthan the object itself See figure 47(b) for an example Yellow rectangle marksthe tracked region which clearly includes the robot but itrsquos size is unacceptable

To avoid these phenomena the bounding box returned by the tracker is reg-ularly checked by detectors This is done in the very same way as in mode 3The output of the tracker is the base of the ROI which is than enlarged by pa-rameters (similarly to mode 3) The algorithm checks this area with the selectedclassifier(s)

If any of them founds the object the tracker is reinitialized based on thedetection box The rectangle is enlarged and translated to include a bigger partof the robot in contrast to the detected sides (note that the detectors are trainedfor the sides only see subsection 431) This is done because the tracker tendsto follow bigger patches of the image better

If none of the detectors returned detections inside the ROI the algorithm willcontinue to track the object based on previous clues of its position However afterthe number of sequential failed validations exceeds a tolerance limit (a parameter)the object is labelled as lost Afterwards the algorithm works exactly as mode 3would do enlarge the ROI on every frame until a new detection appears Thenthe tracker is reinitialized

See figure 47(a) for a representation of the processing method of mode 4

44

43 2D image processing methods

(a) Example output frame of mode 4

(b) Typical error of the tracker

Figure 47 (a) A presentation of mode 4 Red rectangle marks the detection ofthe front-view classifier The yellow box is the currently tracked area The greenrectangle marks the estimated region of interest (b) A typical error of the trackerYellow rectangle marks the tracked region which clearly includes the robot but toobig to accept as a correct detection

45

44 3D image processing methods

44 3D image processing methods

441 3D recording methodThe first challenge was how to create a 3D image As explained in sub-subsection3123 the used Lidar sensor is a 2D type one which means it scans only inplane In other words the sensor returns the distance of the closest reflectionpoint rotating around one (vertical) axis The result of this scan looks like across-section of a 3D image To produce a real and complete 3D image severalof these cross-section like one-plane scans have to be combined This is a rathercomplicated method since to cover the whole room the lidar needs to be movedHowever this movement or displacement will have a significant effect on therecorded distances

To simplify the initial experiments the position of the laser scanner was fixedon a tripod and only the orientation of it was changed This way the trans-formation between the coordinate systems and validation of the recordings wereeasier

On figure 48 a schematic diagram is given to introduce the 3D recording set-up The picture is divided in to 3 parts Part a is a side view of the room beingrecorded including the scanner and its dependences Part b is a top view of thevery same room displaying the field of view of the scanner Part c is a closeroverview of the scanner device the inertial measurement unit (IMU) and themounting layout Main parts of the setup are listed below with brief explanation

1 Lidar This is the 2D range scanner which was used in all of the experi-ments It is mounted on a tripod to fix its position The sensor is displayedon every sub-figure (a b c) On part a it is shown in two different orien-tation first when the recording plane is completely flat secondly when itis tilted

2 Visualisation of the laser ray The scanner measures distances in a planewhile rotating around its vertical axis which is call capital Z axis on thefigure (Note The sensor uses class 1 laser which is invisible and harmlessto human eye Red colour was chosen only for visualization)

3 Pitch angle of first orientation While the scanner is tilted its orientationis recorded relative to a fixed coordinate system For this experiment thepitch angle is the most relevant since roll and yaw did not change This iscalculated as a rotation from the (lower-case) z axis

4 Pitch angle of second orientation This is exactly the same variable asdescribed before but recorded for the second set-up of the recording system

46

44 3D image processing methods

5 Yaw angle of the laser ray As mentioned before the sensor used operateswith one single light source rotated around its vertical axis (capital) ZTo know where the recorded point is located relative to the sensorrsquos owncoordinate system the actual rotation of the beam is needed to be savedas well This is calculated as an angle between the forward direction of thesensor (marked with blue) and the laser ray All the points are recordedin this 2D space defined with polar coordinates a distance from the origin(the lidar) and a yaw angleThe angle is increasing counter-clockwise Pleasenote that this angle rotates with the lidar and its field Part b represents astate when the lidar is horizontal

6 Field of view of the sensor Part b of the picture visualize the limited butstill very wide field of view of the sensor The view angle is marked withgreen coloured curve and the covered area is yellowish The scanner is ableto record 270 or in other terms [minus135 135]

7 Blind-spot of the sensor As discussed in the previous point the scannercan register distances in 270 The remainder 90 is a blind-spot and islocated exactly behind the sensor (135minus135) This area is marked withdark grey on the figure

8 Inertial measurement unit All the points are recorded in the 2D spaceof the lidar which is a tilted and translated plane To map every recordedpoint from this to the 3D maprsquos coordinate system the orientation andposition of the Lidar sensor is needed Assuming that the latter is fixed(the scanner is mounted on a tripod) only the rotation remains variableTo record this some kind of inertial measurement unit (IMU) is neededwhich is marked with a grey rectangle on the image

9 Axis of rotation To scan the room a continuous tilting of the scannerwas needed It was not possible to place the light source on the axis of therotation This means an offset between the sensor and the axis which was asource of error The black circle represents the axis the tilting was executedaround

10 Offset between the axis and the light source As described in the previ-ous point an offset was present between the range sensor and the axis ofrotation This distance was carefully measured and it is marked by blue onthe figure

11 Ground vehicle The main objective of this thesis is to localize the UGVThus several 3D map was made of this vehicle itself However the lidarmain task is not the robot detection but the localization of the UAV itself

47

44 3D image processing methods

On part a and b its schematic side and top view is shown to visualize itsshape relative size and location

442 Android based recording set-upTo test the sensorrsquos capabilities simplified experiment set-ups were introducedas described in subsection 441 The scanned distances were recorded in csvformat by the official recording tool provided by the lidarrsquos manufacturer [77]As described in subsection 444 and in point 8 the orientation and position ofthe Lidar sensor is needed to determinate the recorded point coordinates relativeto the coordinate system fixed to the ground

Figure 410 Screenshot ofthe Android application for Lidarrecordings It is able to detect dis-play and log the rotation of the de-vice in a log file with user definedfilename Source own applicationand image

Assuming that the position is fixed (thescanner is mounted on a tripod) only therotation can change To record this somekind of inertial measurement unit (IMU) wasneeded

Since at the time these experimentsstarted no such device was available the au-thor of this thesis created an Android appli-cation The aim of this program was to usethe phonersquos sensors (accelerometer compassand gyroscope) as an IMU The software isable to detect display and log the rotationof the device along the three axis

The log file is structured in a way thatMatlab (see subsection 321) or other de-velopment tools can import it without anymodification needed The file-name is userdefined

Since both the Lidar and the applica-tion operates at a different and not suf-ficiently consistent frequency synchroniza-tion was necessary Since the two data wasrecorded completely separately this has to bedone off-line To achieve this a time-stamphas to be recorded for both type of mea-surements The lidar recorder tool does thisautomatically This function was added tothe Android application too thus the time-stamps of the recorded orientations are saved

48

44 3D image processing methods

Figure 48 Schematic figure to represent the 3D recording set-up Part a is a sideview of the room being recorded including the scanner and its dependences Partb is a top view of the very same room displaying the field of view of the scannerPart c is a closer overview of the scanner device the inertial measurement unit(IMU) and the mounting layout

49

44 3D image processing methods

Figure 49 Example representation of the output of the Lidar Sensor As it canbe seen the data returned by the scanner is basically a scan in one plane and canbe interpreted as a cross-section of a 3D image The sensor is positioned in themiddle of the cross Green areas were reached by the laser Source screenshot ofown recording by the official recording tool [77]

50

44 3D image processing methods

to the log file as well However since the two systems were not connected thetimestamps had to be synchronized manually

On Figure 410 the user interface of the application can be seen Note thatboth the displayed and the saved angles are in radian Figure 411(a) presentshow the phone was mounted next to the lidar and also shows the test environmentwhich was recorded The phone was attached to a common metal base with thelidar Thus the rotation of the phone is expected to be very close to the rotationof the lidar We used a Motorola Moto X (2013) for these experiments

The coordinate system used is fixed to the phone X axis is parallel with theshorter side of the screen pointing from the left to the right Y axis is parallel tolonger side of the screen pointing from to bottom to the top Z axis is pointingup and is perpendicular to the ground See figure 412 for a presentation of theaxes Angles were calculated in the following way

bull pitch is the rotation around the x axis It is 0 if the phone is lying on ahorizontal surface

bull roll is rotation around the y axis

bull yaw is the rotation around the z axis

Although the sensors in a phone are not designed for such precise tasks theset-up presented in this subsection worked sufficiently and produced spectacular3D images On figure 411(b) and 411(c) an example is shown along with therecorded screen captured by a camera (411(a))

443 Final set-up with Pixhawk flight controllerAlthough the Android application (see subsection 442) gave satisfactory resultsseveral reasons occurred for replacing it First the synchronization was compli-cated manual and off-line which is not acceptable in a real time UAV environ-ment Second neither the phonersquos sensors nor its operating system was designedto be mounted on an unmanned aircraft It is heavier less precise and moreexpensive than a dedicated IMU Thus the Android phone has to be replaced

As a replacement the Pixhawk (3121) flight controller was chosen because itis open-source well supported and most importantly it is the chosen controller ofthe projectrsquos UAV This meant less weight and brought integrity both in hardwareand software Although the Pixhawk is not a dedicated IMU its built-in sensorsare very accurate since they are used to stabilize multi-rotors during flight Fellowstudent Domonkos Huszar managed to connect both the Pixhawk and the lidar(3123) to the Robotic operating system (322) This way the two type ofsensors were recorded with the same software

51

44 3D image processing methods

(a) Laboratory and recorded scene

(b) Elevation of the produced 3D map

(c) The produced 3D map

Figure 411 (a) picture about the laboratory and the mounting of the phone (b)and (c) 3D map of the scan from different viewpoints Some object were markedon all images for easier understanding 1 2 3 are boxes 4 is a chair 5 is a studentat his desk 6 7 are windows 8 is the entrance 9 is a UAV wing on a cupboard

52

44 3D image processing methods

Thus the change from the Android software to the Pixhawk solved the problemof synchronization and accuracy All the other parts of the experiment (themounting method the coordinate systems etc) remained unchanged

444 3D reconstructionIn subsection 441 the recording method was described Subsection 442 and443 discussed the inertial measurement units and software used Processing andvisualizing the recorded data as a 3D map was the next step

All the pointsrsquo coordinates obtained were in the lidar sensorrsquos own coordinatesystem As shown on figure 48 (part b) this is a 2D space represented by polarcoordinates a distance from the origin (the scanner) and an angle between theline pointing to the point and the zero vector This space was tilted with thescanner during the records To calculate the positions recorded by the Lidar inthe ground fixed coordinate system this 2D space had to be transformed Trans-formation here means rigid motions no shape or size changes were desired Rigidmotions are reflections rotations translations and glide reflections Howeverin this experiment only translations and rotations were used

The 3D coordinate system was defined as the following

bull x axis is the vector product of y and z x = ytimes z pointing to the right

bull y axis is the facing direction of the scanner parallel to the ground pointing

Figure 412 Presentation of the axes used in the android application

53

44 3D image processing methods

forward

bull z axis is pointing up and is perpendicular to the ground

Note that the origin ([0 0 0]) is at the lidarrsquos location (at the top of the tripodin the height of the rotation axis) This means that points below the scanner havea negative altitude

The lidar and its own 2D space have 6 degrees of freedom (DOF) state in theground fixed coordinate system Three represents its positions along the x y zaxes The other three describes its orientation yaw pitch and roll

First the rotation of the scanner will be considered (the latter three DOF) Asit can bee seen on figure 48 the roll and yaw angles are not able to change Thusonly pitch has to be considered as a variable in this experiment (see point 3) Asa result the transformed coordinates of the recording point are the following

x = distance middot sin(minusyaw) (41)

y = distance middot cos(yaw) middot sin(pitch) (42)

z = distance middot cos(yaw) middot cos(pitch) (43)

Where distance and yaw are the two coordinates of the point in the plane ofthe lidar (figure 48 point 5) and pitch is the angle between the (lower-case) zaxis and the plane of the lidar (48 point 3 and 4)

The position of the lidar in the ground coordinate system did not change intheory only the orientation of it However as it can be seen on figure 48 part cthe rotation axis (marked by 9 see point 9) is not at the same level as the rangescanner of the lidar Since it was not possible to mount the device other way asignificant offset (marked and explained in point 10) occurred between the lightsource and the axis itself This resulted in an undesired movement of the sensoralong the z and y axes (note that x axis was not involved since the roll angle didnot change and only that would move the scanner along the x axis) While theresulted errors were no greater than the offset itself they are significant Thusalong the rotation a translation is needed as well The vector is calculated asthe sum of dy and dz

dy = offset middot sin(pitch) (44)

dz = offset middot cos(pitch) (45)

54

44 3D image processing methods

Where dy is the translation required along the y and dy is along the z axisOffset is the distance between the light source and the axis of rotation presentedon figure 48 part C marked with 10

Combining the 5 equations (41 42 43 44 45) we get xyz

= distance middot

sin(minusyaw)cos(yaw) middot sin(pitch)cos(yaw) middot cos(pitch)

+ offset middot

0sin(pitch)cos(pitch)

(46)

Using the transformation defined in equation 46 spectacular and more im-portantly precise 3D maps were created These were suitable for further worksuch us SLAM (Simultaneous localization and mapping) or the ground robotdetection

55

Chapter 5

Results

51 2D image detection results

511 EvaluationTo make the development easier and the results more objective an evaluationsystem was needed This helps determine whether a given modification of thealgorithm did actually make it better andor faster than other versions Thisinformation is extremely important in every vision based detection system to trackthe improvements of the code and check if it meets the predefined requirements

To understand what better means in case of a system like this some defi-nitions need to be clarified Generally a detection system can either identify aninput or reject it Here the input means a cropped image of the original pictureIn other words it is a position (coordinates) and dimensions (height and width)Please refer to subsection 432 for a more detailed description An identifiedinput is referred as positive Similarly a rejected one is called negative

5111 Definition of True positive and negative

A detection is true positive if the system correctly identifies an input as a positivesample In this case it means that the detector marks the ground robotrsquos positionas a detection Note that it is also often called a hit Similarly a true negativemeans that an input which does not contain the object is correctly rejected

5112 Definition of False positive and negative

In the case of most decision systems errors can be organised into two main groupsA mistake is a false positive error (also called type I error) if the system classifiesan input as a positive sample despite it is not In other words the system believes

56

51 2D image detection results

the object is present at that location although that is not the case In this projectthat typically means that something that is not the ground robot (a chair a boxor even a shadow) is still recognized as the UGV A false negative error (alsocalled type II error) occurs when an input is not recognized (rejected) although itshould be In this current task a false negative error is when the ground robotis not detected

5113 Reducing number of errors

Unfortunately it is really complicated to decrease both kind of errors at thesame time If more strict rules and thresholds are introduced in the decisionmaking the rate of false positives will decrease but simultaneously chances offalse negatives will increase On the other hand if conditions are loosened in thedetection method false negative errors may be reduced However caused by theexact same thing (less strict conditions) more false positives could appear

The actual task determines which kind of error should be minimised Incertain projects it could be more important to find every possible desired objectthan the fact that some mistakes will occur Such a project can be a manuallysupervised classification where appearing false positives can be eliminated by theoperator (for example illegal usage of bus lanes) Also a system with an accidentavoidance purpose should be over-secure and rather recognise non dangeroussituations as a hazard than to miss a real one

In this project the false positive rate was decided to be more important Thereason for this is that the position of the UGV is not crucial during the wholeflight If an input frame is processed incorrectly and a false negative error occurs(the robot is on the picture but the algorithm misses it) the tracking system(see subsection 434) would still be able to estimate its position After a fewframes the detector would be able to detect the robot again with great possibilityHowever a false negative detection at an unexpected position may confuse thesystem since it would have to choose from multiple possible charging platformsThis may lead to a landing attempt on a dangerous surface and to the possibilityof losing the aircraft

5114 Annotation and database building

To test the efficiency of the 2D detection algorithm its detections have to betested on an image dataset with its samples already labelled (either fully manuallyor using a semi-supervised classifier) or on a video set where the object (or objects)position is known on every frame This way the true positivenegative and falsepositivenegative values can be determined

57

51 2D image detection results

Figure 51 Figure of the Vatic user interface

Testing on image datasets is suitable for methods which do not use any in-formation transmitted between frames such as previous positions or ROI Thesealgorithms are usually sliding window based (see subsection 432) Their croppedinputs can be simulated with image datasets without complications

For this project the video based test is more appropriate since as described insubsection 434 and 433 the detector uses tracking and pre-filtering Testing thetracking feature without consecutive frames is pointless Also the pre-filteringand ROI generation is only applicable on frames containing more than just theobject itself

Testing on videos requires annotation which means that every object has tobe labelled on every frame This is a tiring and time-consuming task but withsuitable software it can be simplified To do this the Vatic environment waschosen

The abbreviation Vatic stands for Video Annotation Tool from Irvine Cal-ifornia It is a free to use on-line video annotation tool designed specially forcomputer vision research [78]

After the successful installation and connection to the server a clear interfaceis visible See an example screenshot on figure 51 The interface has a built invideo player which makes it possible to review the video with the annotation atnormal speed or even frame-by-frame The system was designed in a way thatmultiple objects can be annotated even from different classes (like pedestrianscars trucks for example) Since only one ground robot was built yet and no otherkind of objects were requested only one item was annotated on the videos

58

51 2D image detection results

The annotatorrsquos task is to mark the position and the size of the objects (thatis drawing a bounding box around it) on every frame of the videos (there areoptions to label an object Occluded obstructed or outside of frame) Althoughthis seems like a very time-consuming task Vatic uses sophisticated algorithmsto interpolate both the positions and the sizes of the objects on frames thatare not annotated yet This is a significant help for the annotator since aftermarking the object on some key-frames the only task remained is to correctthe interpolations between them if necessary

Note that it is crucial that the bounding boxes are very tight around the ob-ject This will guarantee that during the evaluation of the detectors the statisticsare based on true and objective annotations This topic will be explained in moredetails in section 53

Vatic is also able to crowdsource the annotation tasks using Amazon Mechan-ical Turk [79] which is Amazonrsquos solution for renting human resources This waylarge datasets with several videos are easy to build for relatively low cost How-ever this feature of the tool was not used in this project since the number of testvideos and their complexity did not make it necessary All the test videos wereannotated by the author

After successfully annotating a video Vatic is able to export the boundingboxes in several format including xml json etc The exported file contains theid of the frames and the coordinates of every bounding box along with the labelassigned for the objectrsquos type

Using these files it is possible to compare the detections of an algorithm (whichare also bounding boxes) with the annotations and thus evaluate the efficiencyof the code

512 Frame-rate measurement and analysisIn section 41 speed was recognised as one of the most important challenges

To address this issue the detector was prepared to measure and export thetime elapsed by processing each frames

The measurements were carried out on a laptop PC (Lenovo z580 i5-3210m8GB RAM) simulating a setup where the UAV sends video to a ground stationfor further processing During the tests every other unnecessary process wereterminated so their additional influences are minimized

The exact same software was used for every test only the parameter file waschanged (switching between the modes) Every method was executed on the samevideo one hundred times which means more than 100000 frames This amountensured that temporary resource changes during the test due to the operatingsystem did not influence the results significantly

59

52 3D image detection results

To analyse the exported processing times a Matlab script has been imple-mented This tool can load the log files from the test and calculate statistics ofthe data Measurements like the shortest longest and average processing timeper frame are displayed after execution along with the average frame-rate and itsvariance Latter is very important as mode 2 3 4 processing methods changeduring the execution by definition which results in a varying processing timeVariance measures these changes An example output of the analysis software ispresented below

Example output of Frame-rate analysis softwareLoaded 100 filesNumber of frames in video 1080Total number of frames processed 108000Longest processing time 0813Shortest processing time 0007Average ellapsed seconds 0032446variance of ellapsed seconds per frame

between video loops 80297e-07across the video 00021144

Average Frame per second rate of the processing 308204 FPS

The changes of average frame-rates between video loops were also monitoredIf this value was too big it probably indicated some unexpected load of thecomputer Thus the tests had to be restarted

It should be mentioned that the frame-rate of the software is strongly corre-lated with the resolution of the input images since the smaller the image it takesshorter time to rdquosliderdquo through the detector on it This is especially true for thefirst two methods (4351 4352) The test videosrsquo resolution are 960times 540

52 3D image detection resultsThis thesis focused on the 2D object detection methods since they are easierto implement and have been widely researched for decades Traditional camerasensors are significantly cheaper too and their output is rdquoready-to-userdquo On theother hand data from lidar sensors have to be processed to gather the imagesince the scanner records points in a 2D plane as explained in section 44 Notethat lidar 3D scanners exists but they are still scanning in a plane and have thismentioned processing built-in Also they are even more expensive than the 2Dversions

60

53 Discussion of results

Figure 52 On this figure the recording setup is presented In the backgroundthe ground robot is also visible which was the subject of the measurements

That being said the 3D depth map provided by these sensors are not possibleto obtain with cameras and contains valuable information for a project like thisand makes indoor navigation easier Therefore experiments were carried onthe record 3D images by tilting a 2D laser scanner (3123) and recording itsorientation with a IMU (3121)

On figure 52 a photo of the experimental setup is shown In the backgroundthe ground robot is visible which was the subject of the recordings

Figure 53 presents an example 3D point-cloud recorded of the ground robotBoth 53(a) and 53(b) is an image of the same map viewed from different pointsAs it can be seen recordings from one point is not enough to create completemaps as rdquoshadowsrdquo and obscured parts still occurs Note the rdquoemptyrdquo area behindthe robot on 53(b)

The created point-clouds were found to be detailed enough to carry simpler(eg threshold based on height) or more complex object detection methods (egregistering 3D key-points) on them in the future

53 Discussion of resultsIn general it can be said that the detector modes worked as expected on therecorded test videos with suitable efficiency Example videos can be found athere1 for all modes However to judge the performance of the algorithm moreobjective measurements were needed

The 2D image processing results were evaluated by speed and detection effi-1usersitkppkehu~palan1CranfieldVideos

61

53 Discussion of results

(a) Example result of the 3D scan of the ground robot

(b) Example of the rsquoshadowrsquo of the ground robot

Figure 53 Example of the 3D images built Both image shows the same 3Dmap viewed from different angles Note that the carpet is visible underneath thevehicle

62

53 Discussion of results

ciency The latter can be done by several values Here recall and precision willbe used Recall is defined by

TP

TP + FN

Where TP is the number of true positive and FN is the number of false negativedetections In other words it is the ratio of the of the detected and the totalnumber of positive samples Thus it is also called sensitivity

Precision is defined byTP

TP + FP

Where TP is the number of true positive FP is the number of false positivedetections In other terms precision is a measurement of the correctness of thereturned detections

Both values have a different meaning for validation and none of them can usedalone This is because as mentioned in sub-subsection 5113 both false positiveand negative errors are undesirable but neither recall nor precision measuresboth For example it is possible to reach an outstanding recall rate with extremeamount of false positives Similarly the increase of false negative errors willimprove the precision of the algorithm

The detections (the coordinates of the rectangles) were exported by the de-tection software for further analysis The rectangles were compared to the onesannotated with the Vatic video annotation tool (5114) based on position andthe overlapping areas For the latter a minimum of 50 was defined (ratio of thearea of the intersection and the union of the two rectangles)

Note that this also means that even if a detection is at the correct locationit is possible that it will be logged as false positive error since the overlap of therectangles is not satisfying Unfortunately registering this box as a false positivewill also generate a false negative error since only one detection is returned bythe detector for a location(overlapping detections are merged) Therefore theannotated object will not be covered by any of the detections

As expected mode 1 and mode 2 produced very similar results a decent 62recall rate was achieved This is a result of the fact that they work exactlythe same way except the selection of the used classifiers The results show thatintroducing this idea did not decrease the performance

Mode 3 managed to improve the recall rate by 2 with the introduction ofregion of interests Neither of the first 3 modules had false positive detections asa result of the strict thresholds of the SVMs

Mode 4 had the outstanding recall rate of 95 This is a result of the trackeralgorithm which was able to follow the object even at view-points were the SVMdetectors did no longer detected the robot See table 51 for all the recall andprecision values

63

53 Discussion of results

Similarly the processing speed of the methods introduced in subsection 435were analysed as well with the tools described in subsection 512

Mode 1 was proven to be the slowest with a frame-rate of 42 FPS This isnot surprising as this version of the algorithm scans the whole image with all thedetectors

Mode 2 raised this speed to 65 FPS by introducing an intelligent way ofselecting the applied detector

Mode 3 brought another significant increase with implementing region of in-terest estimators which reduce the amount of the image to be scanned It canprocess around 12 frames in a second

Finally mode 4 was proven to be the fastest solution by far As a result ofits tracking algorithm their is no need to use the SVM classifiers as often asbefore This is a very important result since tracking is significantly faster thanscanning with the detectors Mode 4 reached a frame-rate of 30 FPS during thetests Thus mode 4 can be called a rdquoreal-timerdquo object detection algorithm

See table 51 for the frame-rates of all modules displayed along their recalland precision values Note that all frame-rates are average of 100 executions

Table 51 Table of results

mode Recall Prescision FPS Variancemode 1 0623 1 42599 000000123mode 2 0622 1 6509 00029557mode 3 0645 1 1206 00070877mode 4 0955 0898 3082 00021144

As a conclusion it can be seen that mode 4 outperformed all the others bothin detection rate (recall 95) and speed (average frame-rate 30FPS) using ROIestimators and a tracking algorithm Due to the nature of the tracker and theevaluation method the number of false positives increased As mentioned beforeeven a correctly positioned but incorrectly scaled detection decreased the precisionrate However this should be simple to improve by adding further validations ofthe output of the tracker (both for position and scale)

64

Chapter 6

Conclusion and recommendedfuture work

61 ConclusionIn this paper an unmanned aerial vehicle based airborne tracking system waspresented with the aim of detecting objects indoor (potentially extended to out-door operations as well) and localize a ground robot that will serve as a landingplatform for recharging the UAV

First the project and the aims and objectives of this thesis were introducedin chapter 1 This introduction was followed by an extensive literature review togive an overview of the existing solutions for similar problems (chapter 2) Un-manned Aerial Vehicles were presented and examples were shown for applicationfields Then the most important object recognition methods were introducedand reviewed with the aspect of suitability for the discussed project

Then the environment of the development was described including the avail-able software and hardware resources (chapter 3) Their advantages and disad-vantages were analysed and their applications in the project are explained

Afterwards the recognized challenges of the project were collected and dis-cussed with respect to the object detection task in chapter 4 Considering theobjectives resources and the challenges a modular architecture was designedand introduced (section 42) The type of modules were listed and discussed indetails including their purposes and some possible realization

The modular structure of the system makes it very flexible and easy to extendor replace modules The currently implemented and used ones were listed andexplained Also some unsuccessful experiments were mentioned

Subsection 434 concludes the chosen feature extraction (HOG) and classi-fier (SVM) methods and presented the implemented training software along the

65

61 Conclusion

produced detectorsAs one of the most important parts the currently used detector was intro-

duced in subsection 435 This module was developed with special attention paidto validation and possible future work therefore additional debugging featureslike exporting detections demonstration videos and frame rate measurementswere implemented To make the development simpler an easy-to-use interfacewas included which let the most important parameters to be set without modifi-cation pf the code Four related but different detection methods were developedMode 1 scans the whole image with both trained detectors to locate the robotMode 2 does the same but instead of all classifiers only one is used most of thetime which is chosen based on previous detections This resulted in an increasedframe-rate Mode 3 introduces the concept of Region Of Interests (ROI) and onlyscans this area The ROI is defined based on previous detections however it iseasy to replace it with other methods due to the modular structure The reducedsearch space made the processing significantly faster Finally mode 4 includes atracking algorithm which makes it unnecessary to scan every frame (or part ofit) with a detector Instead once the robot was found it is followed across thescene Certainly periodical validations are still needed Tracking is significantlyfaster than the sliding window detectors and in appropriate conditions can per-form above real time speed (average frame-rate was 308 FPS) It also improvedthe efficiency of the detector since on many frames the robot was tracked andits position was marked correctly while the detectors missed it from the sameview-point However tracking can be misleading as well if the tracked regiondrifts away from the robot (eg a chair next to the robot is followed) To avoidthis the bounding box returned by the tracker is regularly checked by detectors

Special attention was given to the evaluation of the system Two software weredeveloped as additional tools one for evaluating the efficiency of the detectionsand an other for analysing the processing time and frame rates

Although the whole project (especially the autonomous UAV) still needs a lotof improvement and not ready for testing yet the ground robot detecting algo-rithmrsquos first version is ready-to-use and have shown promising results in simulatedexperiments

The detector based on mode 4 had a recall rate of 95 with an average frame-rate of 308 FPS on the test videos Although this processing speed was achievedon a laptop (simulating a setup where the UAV sends video to a ground stationfor further processing) porting this code on an on-board computer with smallercomputational capacity should result in a slower but still acceptable frame-ratewhich satisfies the objectives (5Hz)

While this thesis focused on two dimensional image processing and object de-tection methods 3D image inputs were considered as well Section 44 concludesthe progress made related to 3D mapping along the applied mathematics An

66

62 Recommended future work

experimental setup was created with which it was possible to create spectacu-lar and more importantly precise 3D maps about the environments with a 2Dlaser scanner To test the idea of using the 3D image as an input for the groundrobot detection several recordings were made about the UGV The created point-clouds were found to be detailed enough to carry simpler (eg threshold basedon height) or more complex object detection methods (eg registering 3D key-points) on them in the future

As a conclusion it can be said that the objectives defined in 13

1 explore possible solutions in a literature review

2 design a system which is able to detect the ground robot based on theavailable sensors

3 implement and test the first version of the algorithms

were all addressed and completed

62 Recommended future workIn spite of the satisfying results of the current setup several modifications andimprovements can and should be implemented to increase the efficiency and speedof the algorithm

Focusing on the 2D processing methods first experiments with retrainedSVMs are recommended It is expected that training on the same two sides (sideand front) but including more images from slightly rotated view-points (bothrolling the camera and capture the object from not completely the front) willresult in better performance

Region of interest estimators have huge potential in the project since they cansignificantly speed up the process Aside basic algorithms based on simple fea-tures (eg similar colour patches edges etc) several more complex segmentationmethods were introduced in the literature

Tracking can be tuned more precisely as well Aside the rectangle box whichmarks the estimated position of the object the tracker returns a confidence valuewhich measures how confident the algorithm is that the object is inside the rect-angle This is a very valuable information to eliminate detections which rdquoslippedrdquooff the object Also the further processing of the return position is recommendedFor example growing or shrinking the rectangle based on simple features (egthresholded patches intensity edges etc) can be profitable since the trackertends to perform better when it tracks the whole object On the other hand toobig bounding boxes are problems as well thus shrinking of the rectangles have tobe considered as well

67

62 Recommended future work

Using The 3D image for the detection of the robot is another very interestingopportunity that should be examined in more details Certainly this is onlypossible if the 3D imaging will be carried out in near real time speed (currentlyall the maps were built offline) The currently seen 3D image (the point-cloud infront of the sensor) can be interpreted as a pre-filter for the 2D image processingmethods For example filtering for objects with more or less the same size of theUGV could be a very effective region of interest detector To achieve this thefield of view of the two sensors needs to be register carefully so correspondingareas can be found

Finally when the continuous 3D map building (note that this is more thanthe imaging since the 3D images have to be aligned and combined) is finishedseveral more improvements will be possible Assuming that both the aerial andthe ground vehicle were located in this map the real 3D position of the robotwill be known Its next position can be estimated based on its maximal velocityand the elapsed time Theoretically it is not possible for the robot to be outsideof this circle thus other parts of the map do not need to be scanned Since thelocation and orientation of the UAV will be known as well the field-of-view ofthe sensors can be estimated After that the UAV can either move and rotate ina way to cover the possible locations of the UGV with its sensors or just ignore(does not process) inputs from other areas

It is strongly recommended to give high priority to evaluation and testingduring the further experiments For 2D images Vatic annotation tool was intro-duced in this thesis as a very useful and convenient tool to evaluate detectorsFor 3D detections this is not suitable but a similar solution is needed to keepup the progress towards the ready to deploy 3D mapping system planned

68

Acknowledgements

First of all I would like to thank Paacutezmaacuteny Peacuteter Catholic University and my teachersthere who helped me to advance and grow through the years and later made itpossible to continue my studies in Cranfield University where I was welcomed withwarm hospitality and was supported throughout of the development of this thesis

I would like to thank Dr Al Savvaris for the professional help and encourage-ment he provided while we worked together

Special thanks goes to my friends Domonkos Huszaacuter and Loacuteraacutent Kovaacutecs who Ishared all the happiness sadness worries troubles and the weigh of this pastyear and who I could always turn to in need or joy

Thanks to my friend Roberto Opromolla for helping us to make the necessarymeasurements

My family never stopped supporting me loving me and ensuring the rightenvironment for improvement and growth

Many thanks Fanni Melles for the support and love you gave me and that youstood by me even at the hardest times

Thanks for my friends Maacuterton Hunyady and Doacutera Babicz for the funny andsometimes cruel comments of the revision of this thesis

And finally many thanks to Andras Horvath who agreed to be my rsquosecondrsquosupervisor and mentor at my home university

69

References

[1] ldquoOverview senseFlyrdquo [Online] Available httpswwwsenseflycomdronesoverviewhtml [Accessed at 2015-08-08]

[2] H Chao Y Cao and Y Chen ldquoAutopilots for small fixed-wing unmanned airvehicles A surveyrdquo in Proceedings of the 2007 IEEE International Conference onMechatronics and Automation ICMA 2007 2007 pp 3144ndash3149

[3] A Frank J McGrew M Valenti D Levine and J How ldquoHover Transitionand Level Flight Control Design for a Single-Propeller Indoor AirplanerdquoAIAA Guidance Navigation and Control Conference and Exhibit pp 1ndash43 2007[Online] Available httparcaiaaorgdoiabs10251462007-6318

[4] ldquoFixed Wing Versus Rotary Wing For UAV Mapping Ap-plicationsrdquo [Online] Available httpwwwquestuavcomnewsfixed-wing-versus-rotary-wing-for-uav-mapping-applications [Accessed at 2015-08-08]

[5] ldquo DJI Store Phantom 3 Standardrdquo [Online] Available httpstoredjicomproductphantom-3-standard [Accessed at 2015-08-08]

[6] ldquoWorld War II V-1 Flying Bomb - Military Historyrdquo [Online] Available httpmilitaryhistoryaboutcomodartillerysiegeweaponspv1htm [Accessed at 2015-08-08]

[7] L Lin and M A Goodrich ldquoUAV intelligent path planning for wilderness searchand rescuerdquo in 2009 IEEERSJ International Conference on Intelligent Robotsand Systems IROS 2009 2009 pp 709ndash714

[8] M A Goodrich B S Morse D Gerhardt J L Cooper M Quigley J A Adamsand C Humphrey ldquoSupporting wilderness search and rescue using a camera-equipped mini UAVrdquo Journal of Field Robotics vol 25 no 1-2 pp 89ndash1102008

[9] P Doherty and P Rudol ldquoA UAV Search and Rescue Scenario with HumanBody Detection and Geolocalizationrdquo in Lecture Notes in Computer Science

70

REFERENCES

2007 pp 1ndash13 [Online] Available httpwwwspringerlinkcomcontentt361252205328408fulltextpdf

[10] A Jaimes S Kota and J Gomez ldquoAn approach to surveillance an area usingswarm of fixed wing and quad-rotor unmanned aerial vehicles UAV(s)rdquo 2008 IEEEInternational Conference on System of Systems Engineering SoSE 2008 2008

[11] M Kontitsis K Valavanis and N Tsourveloudis ldquoA UAV vision system for air-borne surveillancerdquo IEEE International Conference on Robotics and Automation2004 Proceedings ICRA rsquo04 2004 vol 1 2004

[12] M Quigley M A Goodrich S Griffiths A Eldredge and R W Beard ldquoTargetacquisition localization and surveillance using a fixed-wing mini-UAV and gim-baled camerardquo in Proceedings - IEEE International Conference on Robotics andAutomation vol 2005 2005 pp 2600ndash2606

[13] E Semsch M Jakob D Pavliacuteček and M Pěchouček ldquoAutonomous UAV surveil-lance in complex urban environmentsrdquo in Proceedings - 2009 IEEEWICACMInternational Conference on Intelligent Agent Technology IAT 2009 vol 2 2009pp 82ndash85

[14] M Israel ldquoA UAV-BASED ROE DEER FAWN DETECTION SYSTEMrdquo pp51ndash55 2012

[15] G Zhou and D Zang ldquoCivil UAV system for earth observationrdquo in InternationalGeoscience and Remote Sensing Symposium (IGARSS) 2007 pp 5319ndash5321

[16] W DeBusk ldquoUnmanned Aerial Vehicle Systems for Disaster Relief TornadoAlleyrdquo AIAA InfotechAerospace 2010 2010 [Online] Available httparcaiaaorgdoiabs10251462010-3506

[17] S DrsquoOleire-Oltmanns I Marzolff K Peter and J Ries ldquoUnmannedAerial Vehicle (UAV) for Monitoring Soil Erosion in Moroccordquo RemoteSensing vol 4 no 12 pp 3390ndash3416 2012 [Online] Available httpwwwmdpicom2072-42924113390

[18] F Nex and F Remondino ldquoUAV for 3D mapping applications a reviewrdquoApplied Geomatics vol 6 no 1 pp 1ndash15 2013 [Online] Availablehttplinkspringercom101007s12518-013-0120-x

[19] H Eisenbeiss ldquoThe Potential of Unmanned Aerial Vehicles for MappingrdquoPhotogrammetrische Woche Heidelberg pp 135ndash145 2011 [Online] Availablehttpwwwifpuni-stuttgartdepublicationsphowo11140Eisenbeisspdf

[20] C Bills J Chen and A Saxena ldquoAutonomous MAV flight in indoor environ-ments using single image perspective cuesrdquo in Proceedings - IEEE InternationalConference on Robotics and Automation 2011 pp 5776ndash5783

71

REFERENCES

[21] W Bath and J Paxman ldquoUAV localisation amp control through computer visionrdquo of the Australasian Conference on Robotics 2004 [Online] Availablehttpwwwcseunsweduau~acra2005proceedingspapersbathpdf

[22] K Ccedilelik S J Chung M Clausman and A K Somani ldquoMonocular vision SLAMfor indoor aerial vehiclesrdquo in 2009 IEEERSJ International Conference on Intel-ligent Robots and Systems IROS 2009 2009 pp 1566ndash1573

[23] H Oh D Y Won S S Huh D H Shim M J Tahk and A Tsourdos ldquoIndoorUAV control using multi-camera visual feedbackrdquo in Journal of Intelligent andRobotic Systems Theory and Applications vol 61 no 1-4 2011 pp 57ndash84

[24] Y M Mustafah A W Azman and F Akbar ldquoIndoor UAV Positioning UsingStereo Vision Sensorrdquo pp 575ndash579 2012

[25] P Jongho and K Youdan ldquoStereo vision based collision avoidance of quadrotorUAVrdquo in Control Automation and Systems (ICCAS) 2012 12th InternationalConference on 2012 pp 173ndash178

[26] J Weingarten and R Siegwart ldquoEKF-based 3D SLAM for structured environmentreconstructionrdquo in 2005 IEEERSJ International Conference on Intelligent Robotsand Systems IROS 2005 pp 2089ndash2094

[27] H Surmann A Nuumlchter and J Hertzberg ldquoAn autonomous mobile robot with a3D laser range finder for 3D exploration and digitalization of indoor environmentsrdquoRobotics and Autonomous Systems vol 45 no 3-4 pp 181ndash198 2003

[28] Y Lin J Hyyppauml and A Jaakkola ldquoMini-UAV-borne LIDAR for fine-scale map-pingrdquo IEEE Geoscience and Remote Sensing Letters vol 8 no 3 pp 426ndash4302011

[29] F Wang J Cui S K Phang B M Chen and T H Lee ldquoA mono-camera andscanning laser range finder based UAV indoor navigation systemrdquo in 2013 Inter-national Conference on Unmanned Aircraft Systems ICUAS 2013 - ConferenceProceedings 2013 pp 694ndash701

[30] M Nagai T Chen R Shibasaki H Kumagai and A Ahmed ldquoUAV-borne 3-Dmapping system by multisensor integrationrdquo IEEE Transactions on Geoscienceand Remote Sensing vol 47 no 3 pp 701ndash708 2009

[31] X Zhang Y-H Yang Z Han H Wang and C Gao ldquoObject class detectionrdquoACM Computing Surveys vol 46 no 1 pp 1ndash53 Oct 2013 [Online] Availablehttpdlacmorgcitationcfmid=25229682522978

[32] P M Roth and M Winter ldquoSurvey of Appearance-Based Methodsfor Object Recognitionrdquo Transform no ICG-TR-0108 2008 [Online]

72

REFERENCES

Available httpwwwicgtu-grazacatMemberspmrothpub_pmrothTR_ORat_downloadfile

[33] A Andreopoulos and J K Tsotsos ldquo50 Years of object recognition Directionsforwardrdquo Computer Vision and Image Understanding vol 117 no 8 pp 827ndash891 Aug 2013 [Online] Available httpwwwresearchgatenetpublication257484936_50_Years_of_object_recognition_Directions_forward

[34] J K Tsotsos Y Liu J C Martinez-Trujillo M Pomplun E Simine andK Zhou ldquoAttending to visual motionrdquo Computer Vision and Image Understand-ing vol 100 no 1-2 SPEC ISS pp 3ndash40 2005

[35] P Perona ldquoVisual Recognition Circa 2007rdquo pp 1ndash12 2007 [Online] Availablehttpswwwvisioncaltechedupublicationsperona-chapter-Dec07pdf

[36] M Piccardi ldquoBackground subtraction techniques a reviewrdquo 2004 IEEE Interna-tional Conference on Systems Man and Cybernetics (IEEE Cat No04CH37583)vol 4 pp 3099ndash3104 2004

[37] T Bouwmans ldquoRecent Advanced Statistical Background Modeling for ForegroundDetection - A Systematic Surveyrdquo Recent Patents on Computer Sciencee vol 4no 3 pp 147ndash176 2011

[38] L Xu and W Bu ldquoTraffic flow detection method based on fusion of frames dif-ferencing and background differencingrdquo in 2011 2nd International Conference onMechanic Automation and Control Engineering MACE 2011 - Proceedings 2011pp 1847ndash1850

[39] A-t Nghiem F Bremond I-s Antipolis and R Lucioles ldquoBackground subtrac-tion in people detection framework for RGB-D camerasrdquo 2004

[40] J C S Jacques C R Jung and S R Musse ldquoBackground subtraction andshadow detection in grayscale video sequencesrdquo in Brazilian Symposium of Com-puter Graphic and Image Processing vol 2005 2005 pp 189ndash196

[41] P Kaewtrakulpong and R Bowden ldquoAn Improved Adaptive Background MixtureModel for Real- time Tracking with Shadow Detectionrdquo Advanced Video BasedSurveillance Systems pp 1ndash5 2001

[42] G Turin ldquoAn introduction to matched filtersrdquo IRE Transactions on InformationTheory vol 6 no 3 1960

[43] K Briechle and U Hanebeck ldquoTemplate matching using fast normalized crosscorrelationrdquo Proceedings of SPIE vol 4387 pp 95ndash102 2001 [Online] AvailablehttplinkaiporglinkPSI4387951ampAgg=doi

73

REFERENCES

[44] H Schweitzer J W Bell and F Wu ldquoVery Fast Template Matchingrdquo Programno 009741 pp 358ndash372 2002 [Online] Available httpwwwspringerlinkcomindexH584WVN93312V4LTpdf

[45] P Viola and M Jones ldquoRobust real-time object detectionrdquo Interna-tional Journal of Computer Vision vol 57 pp 137ndash154 2001 [Online]Available httpscholargooglecomscholarhl=enampbtnG=Searchampq=intitleRobust+Real-time+Object+Detection0

[46] S Omachi and M Omachi ldquoFast template matching with polynomialsrdquo IEEETransactions on Image Processing vol 16 no 8 pp 2139ndash2149 2007

[47] M Jordan and J Kleinberg Bishop - Pattern Recognition and Machine Learning

[48] D Prasad ldquoSurvey of the problem of object detection in real imagesrdquoInternational Journal of Image Processing (IJIP) no 6 pp 441ndash466 2012[Online] Available httpwwwcscjournalsorgcscmanuscriptJournalsIJIPvolume6Issue6IJIP-702pdf

[49] D Lowe ldquoObject Recognition fromLocal Scale-Invariant Featuresrdquo IEEE Inter-national Conference on Computer Vision 1999

[50] D G Lowe ldquoDistinctive image features from scale-invariant keypointsrdquo Interna-tional Journal of Computer Vision vol 60 no 2 pp 91ndash110 2004

[51] H Bay T Tuytelaars and L Van Gool ldquoSURF Speeded up robust featuresrdquo inLecture Notes in Computer Science (including subseries Lecture Notes in ArtificialIntelligence and Lecture Notes in Bioinformatics) vol 3951 LNCS 2006 pp 404ndash417

[52] T Joachims ldquoA probabilistic analysis of the Rocchio algorithmwith TFIDF for text categorizationrdquo International Conference on Ma-chine Learning pp 143ndash151 1997 [Online] Available papers2publicationuuid9FC2122D-6D49-4DC5-AC03-E353D5B3D1D1

[53] N Dalal and B Triggs ldquoHistograms of Oriented Gradients for Human Detectionrdquoin CVPR rsquo05 Proceedings of the 2005 IEEE Computer Society Conference onComputer Vision and Pattern Recognition (CVPRrsquo05) - Volume 1 2005 pp 886ndash893 [Online] Available citeulike-article-id3047126$delimiter026E30F$nhttpdxdoiorg101109CVPR2005177

[54] J M Rainer Lienhart ldquoAn Extended Set of Haar-Like Features for Rapid ObjectDetectionrdquo [Online] Available httpciteseerxistpsueduviewdocsummarydoi=1011869433

74

REFERENCES

[55] S Han Y Han and H Hahn ldquoVehicle Detection Method using Haar-like Featureon Real Time Systemrdquo inWorld Academy of Science Engineering and Technology2009 pp 455ndash459

[56] Q C Q Chen N Georganas and E Petriu ldquoReal-time Vision-based Hand Ges-ture Recognition Using Haar-like Featuresrdquo 2007 IEEE Instrumentation amp Mea-surement Technology Conference IMTC 2007 2007

[57] G Monteiro P Peixoto and U Nunes ldquoVision-based pedestrian detection usingHaar-like featuresrdquo Robotica 2006 [Online] Available httpcyberc3sjtueducnCyberC3docpaperRobotica2006pdf

[58] J Martiacute J M Benediacute A M Mendonccedila and J Serrat Eds PatternRecognition and Image Analysis ser Lecture Notes in Computer Science BerlinHeidelberg Springer Berlin Heidelberg 2007 vol 4477 [Online] Availablehttpwwwspringerlinkcomindex101007978-3-540-72847-4

[59] A E C Pece ldquoOn the computational rationale for generative modelsrdquo ComputerVision and Image Understanding vol 106 no 1 pp 130ndash143 2007

[60] I T Joliffe Principle Component Analysis 2002 vol 2 [Online] Availablehttpwwwspringerlinkcomcontent978-0-387-95442-4

[61] A Hyvarinen J Karhunen and E Oja ldquoIndependent Component Analysisrdquovol 10 p 2002 2002

[62] K Mikolajczyk B Leibe and B Schiele ldquoMultiple Object Class Detection with aGenerative Modelrdquo IEEE Computer Society Conference on Computer Vision andPattern Recognition (CVPR) vol 1 pp 26ndash36 2006

[63] Y Freund and R E Schapire ldquoA Decision-theoretic Generalization of On-lineLearning and an Application to Boostingrdquo Journal of Computing Systems andScience vol 55 no 1 pp 119ndash139 1997

[64] C Cortes and V Vapnik ldquoSupport-vector networksrdquo Machine Learningvol 20 no 3 pp 273ndash297 Sep 1995 [Online] Available httplinkspringercom101007BF00994018

[65] P Viola and M Jones ldquoRobust real-time face detectionrdquo International Journalof Computer Vision vol 57 no 2 pp 137ndash154 2004

[66] B E Boser I M Guyon and V N Vapnik ldquoA Training Algorithmfor Optimal Margin Classifiersrdquo Proceedings of the 5th Annual ACM Workshopon Computational Learning Theory pp 144ndash152 1992 [Online] Availablehttpciteseerxistpsueduviewdocsummarydoi=1011213818

75

REFERENCES

[67] C-W Hsu and C-J Lin ldquoA comparison of methods for multiclass support vectormachinesrdquo IEEE Transactions on Neural Networks vol 13 no 2 pp 415ndash4252002

[68] A Mathur and G M Foody ldquoMulticlass and binary SVM classification Implica-tions for training and classification usersrdquo IEEE Geoscience and Remote SensingLetters vol 5 no 2 pp 241ndash245 2008

[69] ldquoNitrogen6X - iMX6 Single Board Computerrdquo [Online] Available httpboundarydevicescomproductnitrogen6x-board-imx6-arm-cortex-a9-sbc [Ac-cessed at 2015-08-10]

[70] ldquoPixhawk flight controllerrdquo [Online] Available httproverardupilotcomwikipixhawk-wiring-rover [Accessed at 2015-08-10]

[71] ldquoScanning range finder utm-30lx-ewrdquo [Online] Available httpswwwhokuyo-autjp02sensor07scannerutm_30lx_ewhtml [Accessed at 2015-07-22]

[72] ldquoopenCV manual Release 249rdquo [Online] Available httpdocsopencvorgopencv2refmanpdf [Accessed at 2015-07-20]

[73] D E King ldquoDlib-ml A Machine Learning Toolkitrdquo Journal of MachineLearning Research vol 10 pp 1755ndash1758 2009 [Online] Available httpjmlrcsailmitedupapersv10king09ahtml

[74] ldquodlib C++ Libraryrdquo [Online] Available httpdlibnet [Accessed at 2015-07-21]

[75] D E King ldquoMax-Margin Object Detectionrdquo Jan 2015 [Online] Availablehttparxivorgabs150200046

[76] M Danelljan G Haumlger and M Felsberg ldquoAccurate Scale Estimation for RobustVisual Trackingrdquo Proceedings of the British Machine Vision Conference BMVC2014

[77] ldquoUrgBenri Information Pagerdquo [Online] Available httpswwwhokuyo-autjp02sensor07scannerdownloaddataUrgBenrihtm [Accessed at 2015-07-20]

[78] ldquovatic - Video Annotation Tool - UC Irvinerdquo [Online] Available httpwebmiteduvondrickvatic [Accessed at 2015-07-24]

[79] ldquoAmazon Mechanical Turkrdquo [Online] Available httpswwwmturkcommturkwelcome [Accessed at 2015-07-26]

76

  • List of Figures
  • Absztrakt
  • Abstract
  • List of Abbreviations
  • 1 Introduction and project description
    • 11 Project description and requirements
    • 12 Type of vehicle
    • 13 Aims and objectives
      • 2 Literature Review
        • 21 UAVs and applications
          • 211 Fixed-wing UAVs
          • 212 Rotary-wing UAVs
          • 213 Applications
            • 22 Object detection on conventional 2D images
              • 221 Classical detection methods
                • 2211 Background subtraction
                • 2212 Template matching algorithms
                  • 222 Feature descriptors classifiers and learning methods
                    • 2221 SIFT features
                    • 2222 Haar-like features
                    • 2223 HOG features
                    • 2224 Learning models in computer vision
                    • 2225 AdaBoost
                    • 2226 Support Vector Machine
                      • 3 Development
                        • 31 Hardware resources
                          • 311 Nitrogen board
                          • 312 Sensors
                            • 3121 Pixhawk autopilot
                            • 3122 Camera
                            • 3123 LiDar
                                • 32 Chosen software
                                  • 321 Matlab
                                  • 322 Robotic Operating System (ROS)
                                  • 323 OpenCV
                                  • 324 Dlib
                                      • 4 Designing and implementing the algorithm
                                        • 41 Challenges in the task
                                        • 42 Architecture of the detection system
                                        • 43 2D image processing methods
                                          • 431 Chosen methods and the training algorithm
                                          • 432 Sliding window method
                                          • 433 Pre-filtering
                                          • 434 Tracking
                                          • 435 Implemented detector
                                            • 4351 Mode 1 Sliding window with all the classifiers
                                            • 4352 Mode 2 Sliding window with intelligent choice of classifier
                                            • 4353 Mode 3 Intelligent choice of classifiers and ROIs
                                            • 4354 Mode 4 Tracking based approach
                                                • 44 3D image processing methods
                                                  • 441 3D recording method
                                                  • 442 Android based recording set-up
                                                  • 443 Final set-up with Pixhawk flight controller
                                                  • 444 3D reconstruction
                                                      • 5 Results
                                                        • 51 2D image detection results
                                                          • 511 Evaluation
                                                            • 5111 Definition of True positive and negative
                                                            • 5112 Definition of False positive and negative
                                                            • 5113 Reducing number of errors
                                                            • 5114 Annotation and database building
                                                              • 512 Frame-rate measurement and analysis
                                                                • 52 3D image detection results
                                                                • 53 Discussion of results
                                                                  • 6 Conclusion and recommended future work
                                                                    • 61 Conclusion
                                                                    • 62 Recommended future work
                                                                      • References
Page 5: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly

Contents

List of Figures vii

Absztrakt viii

Abstract ix

List of Abbreviations x

1 Introduction and project description 111 Project description and requirements 112 Type of vehicle 213 Aims and objectives 3

2 Literature Review 521 UAVs and applications 5

211 Fixed-wing UAVs 5212 Rotary-wing UAVs 6213 Applications 8

22 Object detection on conventional 2D images 9221 Classical detection methods 10

2211 Background subtraction 102212 Template matching algorithms 11

222 Feature descriptors classifiers and learning methods 132221 SIFT features 142222 Haar-like features 152223 HOG features 162224 Learning models in computer vision 172225 AdaBoost 182226 Support Vector Machine 19

iv

CONTENTS

3 Development 2131 Hardware resources 21

311 Nitrogen board 21312 Sensors 21

3121 Pixhawk autopilot 223122 Camera 223123 LiDar 23

32 Chosen software 23321 Matlab 23322 Robotic Operating System (ROS) 24323 OpenCV 24324 Dlib 24

4 Designing and implementing the algorithm 2641 Challenges in the task 2642 Architecture of the detection system 2943 2D image processing methods 31

431 Chosen methods and the training algorithm 31432 Sliding window method 34433 Pre-filtering 36434 Tracking 37435 Implemented detector 39

4351 Mode 1 Sliding window with all the classifiers 414352 Mode 2 Sliding window with intelligent choice of

classifier 414353 Mode 3 Intelligent choice of classifiers and ROIs 424354 Mode 4 Tracking based approach 44

44 3D image processing methods 46441 3D recording method 46442 Android based recording set-up 48443 Final set-up with Pixhawk flight controller 51444 3D reconstruction 53

5 Results 5651 2D image detection results 56

511 Evaluation 565111 Definition of True positive and negative 565112 Definition of False positive and negative 565113 Reducing number of errors 575114 Annotation and database building 57

512 Frame-rate measurement and analysis 59

v

CONTENTS

52 3D image detection results 6053 Discussion of results 61

6 Conclusion and recommended future work 6561 Conclusion 6562 Recommended future work 67

References 70

vi

List of Figures

11 Image of the ground robot 3

21 Fixed wing consumer drone 622 Example for consumer drones 723 Example for people detection with background subtraction 1124 Example of template matching 1225 2 example Haar-like features 1626 Illustration of the discriminative and generative models 1727 Example of a separable problem 20

31 Image of Pixhawk flight controller 2232 The chosen LIDAR sensor Hokuyo UTM-30LX 2333 Elements of Dlibrsquos machine learning toolkit 25

41 A diagram of the designed architecture 2842 Visualization of the trained HOG detectors 3543 Representation of the sliding window method 3644 Example image for the result of edge detection 3845 Example of the detectorrsquos user interface 4146 Presentation of mode 3 4347 Mode 4 example output frames 45410 Screenshot of the Android application for Lidar recordings 4848 Schematic figure to represent the 3D recording set-up 4949 Example representation of the output of the Lidar Sensor 50411 Picture about the laboratory and the recording set-up 52412 Presentation of the axes used in the android application 53

51 Figure of the Vatic user interface 5852 Photo of the recording process 6153 Example of the 3D images built 62

vii

List of Abbreviations

SATM School of Aerospace Technology and ManufacturingUAV Unmanned Aerial VehicleUAS Unmanned Aerial SystemUA Unmanned AircraftUGV Unmanned Ground VehicleHOG Histogram of Oriented GradientsRC Radio ControlledROS Robotic Operating SystemIMU Inertial Measurement UnitDoF Degree of FreedomSLAM Simultaneous Localization And MappingROI Region Of InterestVatic Video Annotation Tool from Irvine California

viii

Absztrakt

Ezen dolgozatban bemutataacutesra keruumll egy piloacuteta neacutelkuumlli repuumllő jaacuterműre szereltkoumlvető rendszer ami belteacuteri objektumokat hivatott detektaacutelni eacutes legfőbb ceacutelja afoumlldi egyseacuteg megtalaacutelaacutesa amely leszaacutelloacute eacutes uacutejratoumlltő aacutellomaacuteskeacutent fog szolgaacutelni

Előszoumlr a projekt illetve a teacutezis ceacuteljai lesznek felsorolva eacutes reacuteszletezveEzt koumlveti egy reacuteszletes irodalomkutataacutes amely bemutatja a hasonloacute kihiacutevaacute-

sokra leacutetező megoldaacutesokat Szerepel egy roumlvid oumlsszefoglalaacutes a piloacuteta neacutelkuumlli jaacuter-művekről eacutes alkalmazaacutesi teruumlleteikről majd a legismertebb objektumdetektaacuteloacutemoacutedszerek keruumllnek bemutataacutesra A kritika taacutergyalja előnyeiket eacutes haacutetraacutenyaikatkuumlloumlnoumls tekintettel a jelenlegi projektben valoacute alkalmazhatoacutesaacutegukra

A koumlvetkező reacutesz a fejleszteacutesi koumlruumllmeacutenyekről szoacutel beleeacutertve a rendelkezeacutesreaacutelloacute szoftvereket eacutes hardvereket

A feladat kihiacutevaacutesai bemutataacutesa utaacuten egy modulaacuteris architektuacutera terve keruumllbemutataacutesra figyelembe veacuteve a ceacutelokat erőforraacutesokat eacutes a felmeruumllő probleacutemaacutekat

Ezen architektuacutera egyik legfontosabb modulja a detektaacuteloacute algoritmus legfris-sebb vaacuteltozata reacuteszletezve is szerepel a koumlvetkező fejezetben keacutepesseacutegeivel moacuted-jaival eacutes felhasznaacuteloacutei feluumlleteacutevel egyuumltt

A modul hateacutekonysaacutegaacutenak meacutereacuteseacutere leacutetrejoumltt egy kieacuterteacutekelő koumlrnyezet melykeacutepes szaacutemos metrikaacutet kiszaacutemolni a detekcioacuteval kapcsolatban Mind a koumlrnyezetmind a metrikaacutek reacuteszletezve lesznek a koumlvetkező fejezetben melyet a legfrissebbalgoritmus aacuteltal eleacutert eredmeacutenyek koumlvetnek

Baacuter ez a dolgozat főkeacutent a hagyomaacutenyos (2D) keacutepeken operaacuteloacute detekcioacutesmoacutedszerekre koncentraacutel 3D keacutepalkotaacutesi eacutes feldolgozoacute moacutedszerek szinteacuten meg-fontolaacutesra keruumlltek Elkeacuteszuumllt egy kiacuteseacuterleti rendszer amely keacutepes laacutetvaacutenyos eacutespontos 3D teacuterkeacutepek leacutetrehozaacutesaacutera egy 2D leacutezer szkenner hasznaacutelataacuteval Szaacutemosfelveacutetel keacuteszuumllt a megoldaacutes kiproacutebaacutelaacutesa amelyek a rendszerrel egyuumltt bemutataacutesrakeruumllnek

Veacuteguumll az implementaacutelt moacutedszerek eacutes az eredmeacutenyek oumlsszefoglaloacuteja zaacuterja adolgozatot

ix

Abstract

In this paper an unmanned aerial vehicle based airborne tracking system is pre-sented with the aim of detecting objects indoor (potentially extended to outdooroperations as well) and localize a ground robot that will serve as a landing plat-form for recharging of the UAV

First the project and the aims and objectives of this thesis are introducedand discussed

Then an extensive literature review is presented to give overview of the ex-isting solutions for similar problems Unmanned Aerial Vehicles are presentedwith examples of their application fields After that the most relevant objectrecognition methods are reviewed and their suitability for the discussed project

Then the environment of the development will be described including theavailable software and hardware resources

Afterwards the challenges of the task are collected and discussed Consideringthe objectives resources and the challenges a modular architecture is designedand introduced

As one of the most important module the currently used detector is intro-duced along with its features modes and user interface

Special attention is given to the evaluation of the system Additional evalu-ating tools are introduced to analyse efficiency and speed

The ground robot detecting algorithmrsquos first version is evaluated giving promis-ing results in simulated experiments

While this thesis focuses on two dimensional image processing and objectdetection methods 3D image inputs are considered as well An experimentalsetup is introduced with the capability to create spectacular and precise 3Dmaps about the environments with a 2D laser scanner To test the idea of usingthe 3D image as an input for the ground robot detection several recordings weremade about the UGV and presented in this paper as well

Finally all implemented methods and relevant results are concluded

x

Chapter 1

Introduction and projectdescription

In this chapter an introduction is given to the whole project which this thesis ispart of Afterwards a structure of the sub-tasks in the project is presented alongwith the recognized challenges Then the aims and objectives of this thesis arelisted

11 Project description and requirementsThe projectrsquos main aim is to build a complete system based on one or moreunmanned autonomous vehicle which are able to carry out an indoor 3D mappingof a building (possible outdoor operations are kept in mind as well) A maplike this would be very useful for many applications such as surveillance rescuemissions architecture renovation etc Furthermore if a 3D model exists otherautonomous vehicles can navigate through the building with ease Later on inthis project after the map building is ready a thermal camera is planned to beattached to the vehicle to seek find and locate heat leaks as a demonstration ofthe system A system like this should be

1 fast Building a 3D map requires a lot of measurements and processingpower not to mention the essential functions stabilization navigationroute planning collision avoidance However to build a useful tool allthese functions should be executed simultaneously mostly on-board nearlyreal-time Furthermore the mapping process itself should be finished inreasonable time (depending on the size of the building)

2 accurate Reasonably accurate recording is required so the map would besuitable for the execution of further tasks Usually this means a maximum

1

12 Type of vehicle

error of 5-6cm Errors with similar magnitude can be corrected later de-pending on the application In the case of architecture for example aligninga 3D room model to the recorded point cloud could eliminate this noise

3 autonomous The solution should be as autonomous as possible The aimis to build a system which does not require human supervision (for examplemapping a dangerous area which would be too far for real time control) andcoordinate the process Although remote control is acceptable and shouldbe implemented it is desired to be minimal This should allow a remoteoperator to coordinate even more than one unit a time (since none of themneeds continuous attention) while keeping the ability to change the routeor priority of premises

4 complete Depending on the size and layout of the building the mappingprocess can be very complicated It may have more than one floors loops(the same location reached via an other route) or opened areas which areall a significant challenge for both the vehicle (plan route to avoid collisionsand cover every desired area) and the mapping software The latter shouldrecognize previously seen locations and rdquocloserdquo the loop (that is assignthe two separately recorded area to one location) and handle multi-layerbuildings The system have to manage these tasks and provide a mapwhich contains all the layers closes loops and represents all walls (includingceiling and floor)

12 Type of vehicleThe first question of such a project is what type of vehicle should be used Theoptions are either some kind of aerial or a ground vehicle Both have their ad-vantages and disadvantages

An aerial vehicle has a lot more degree of freedom since it can elevate fromthe ground It can reach positions and perform measurements which would beimpossible for a ground vehicle However it consumes a lot of energy to stayin the air thus such a system has significant limitations in weight payload andoperating time (since the batteries themselves weigh a lot)

A ground vehicle overcomes all these problems since it is able to carry alot more payload than an aerial one This means a lot of batteries thereforelonger operation time On the other hand it canrsquot elevate from the ground thusall the measurements will be performed from a relatively close position to theground This can be a big disadvantage since taller objects will not have theirtops recorded Also a ground robot would not be able to overcome big obstaclesclosing its way or to get to another floor

2

13 Aims and objectives

Figure 11 Image of the ground robot Source own picture

Considering these and many more arguments (price complexity human andmaterial resources) a combination of an aerial and a ground vehicle was chosenas a solution The idea is that both vehicles enter the building at the sametime and start mapping the environment simultaneously working on the samemap (which is unknown when the process starts) Using the advantage of theaerial vehicle the system scans parts of the rooms which are unavailable for theground vehicle With sophisticated route planning the areas scanned twice canbe minimized resulting in a faster map building Also great load capacity of theground robot makes it able to serve as a charging station for the aerial vehicleThis option will be discussed in details in section 13

13 Aims and objectivesAs it can be seen the scope of the project is extremely wide To cover the wholemore than fifteen students (visiting MSc and PhD) are working on it Thereforethe project has been divided to numerous subtasks

During the planning period of the project several challenges were recognizedfor example the lack of GPS signal indoor the vibration of the on-board sensorsor the short flight time of the aerial platform

This thesis addresses the last one As mentioned in section 12 the UAVconsumes a lot of energy to stay in the air thus such a system has significantlylimited operating time In contrast the ground robot is able to carry a lot morepayload (eg batteries) than an aerial one The idea to overcome this limitationof the UAV is to use the ground robot as a landing and recharging platform for

3

13 Aims and objectives

the aerial vehicleTo achieve this the system will need to have continuous update of the relative

position of the ground robot The aim of this thesis is to give a solution for thisproblem The following objectives were defined to structure the research

First possible solutions have to be collected and discussed in an extensive lit-erature review Then after considering the existing solutions and the constraintsand requirements of the discussed project a design of a new system is neededwhich is able to detect the ground robot using the available sensors Finally thefirst version of the software have to be implemented and tested

This paper will describe the process of the research introduce the designedsystem and discuss the experiences

4

Chapter 2

Literature Review

The aim of this chapter is to provide an overview of the previous researches andarticles related to the topic of this thesis First a short introduction of unmannedaerial vehicles will be given along with their most important advantages andapplication fields

Afterwards a brief conclusion of the science of object recognition in imageprocessing is provided The most often used feature extraction and classificationmethods are presented

21 UAVs and applicationsThe abbreviation UAV stands for Unmanned Aerial Vehicle Any aircraft whichis controlled or piloted by remotely andor by on-board computers They are alsoreferred as UAS for Unmanned Aerial System

They come in various sizes The smallest ones fit a man palms while even afull-size aeroplane can be an UAV by definition Similarly to traditional aircraftUnmanned Aerial vehicles are usually grouped into two big classes fixed-wingUAVs and rotary wing UAVs

211 Fixed-wing UAVsFixed-wing UAVs have a rigid wing which generates lift as the UAV moves for-ward They maneuver with control surfaces called ailerons rudder and elevatorGenerally speaking fixed-wing UAVs are easier to stabilize and control Forcomparison radio controlled aeroplanes can fly without any autopilot functionimplemented relying only on the pilotrsquos input This is a result of the muchsimpler structure compared to rotary wing ones due to the fact that a glidingmovement is easier to stabilize However they still have a quite extended market

5

21 UAVs and applications

Figure 21 One of the most popular consumer level fixed-wing mapping UAV ofthe SenseFly company Source [1]

of flight controllers whose aim is to add autonomous flying and pilot assistancefeatures [2]

One of the biggest advantage of fixed wing UAVs is that due to their naturalgliding capabilities they can stay airborne using no or small amount of powerFor the same reason fixed-wing UAVs are also able to carry heavier or morepayload for longer endurances using less energy which would be very useful inany mapping task

The drawback of the fixed wing aircraft in the case of this project is the factthat they have to keep moving to generate lift and stay airborne Therefore flyingaround between and in buildings is very complicated since there is limited spaceto move around Although there have been approaches to prepare a fixed wingUAV to hover and translate indoors [3] generally they are not optimal for indoorflights especially with heavy payload [4]

Fixed-wing UAVs are excellent choice for long endurance high altitude tasksThe long flight times and higher speed make it possible to cover larger areaswith one take off On figure 21 a consumer fixed-wing UAV is shown designedspecially for mapping

This project needs a vehicle which is able to fly between buildings and indoorswithout any complication Thus rotary-wing UAVs were reviewed

212 Rotary-wing UAVsRotary-wing UAVs generate lift with rotor blades attached to and rotated aroundan axis These blades work exactly as the wing on the fixed-wing vehicles butinstead of the vehicle moving forward the blades are moving constantly Thismakes the vehicle able to hover and hold its position Also due to the same prop-erty they can take off and land vertically These capabilities are very importantaspects for the project since both of them are crucial for indoor flight where spaceis limited

6

21 UAVs and applications

Figure 22 The newest version of the popular Phantom series of DJI Thisconsumer drone is easy to fly even for beginner pilots due to its many advancedautopilot features Source [5]

Based on the number of the rotors several sub-classes are known UAVs withone rotor are helicopters while anything with more than one rotor is called amultirotor or multicopter

The latter group controls its motion by modifying the speed of multiple downthrusting motors In general they are more complicated to control than fixed-wing UAVs since hovering enough requires constant compensation Multirotorsmust individually adjust the thrust of each motor If motors on one side areproducing more thrust than the other the multirotor tilts to the other sideThat is the reason why every remote controlled multicopter is equipped withsome kind of flight controller system They are also less stable (without electricstabilization) and less energy efficient than helicopters That is the reason forthe fact that no large scale multirotor is used since as the size is increased thesimplicity becomes less important than the inefficiency On the other hand theyare mechanically simpler and cheaper and their architecture is suitable for awider range of payloads which makes them appealing for many fields

Rotary-wing UAVsrsquo mechanical and electrical structures are generally morecomplex than fixed-wing ones This can result in more complicated and expansiverepairs and general maintenance which shorten operational time and increase thecost of the project Finally due to their lower speeds and shorter flight rangesrotary-wing UAVs will require many additional flights to survey any large areawhich is another cause of increasing operational costs [4]

7

21 UAVs and applications

In spite of the disadvantages listed above rotary-wing especially multirotorUAVs seem like an excellent choice for this project since there is no need to coverlarge areas and the shorter operational times can be solved with the chargingplatform

213 ApplicationsUAVs as many technological inventions got their biggest motivation from mili-tary goals and applications For example German rocket V-1 (rdquothe flying bombrdquo)had a huge impact on the history of UAS [6] An unmanned aerial vehicle makesit possible to observe spy on and attack the enemy without risking human lives

Fortunately non-military development is catching up too and UAVs are be-coming essential tools for many application fields As rotary wing UAVs es-pecially multicopters are getting cheaper they have appeared in the consumercategories as well They are available as simple remote controlled toys and ad-vanced aerial photography platforms as well [5] See figure 22 for a popularquadcopter

UAVs are gaining popularity in search and rescue operations since they areproviding a fast and relatively cheap tool to get an overview of the involvedarea A good example for this application field is a location of an earth-quakeor searching for missing people in the wildness See [7] and [8] for examples [9]is another related article which includes approaches to identify possibly injuredpeople autonomously

Another application field where UAVs were admittedly proven useful is surveil-lance [10] is an interesting example for this field since it proposes to use a com-bined swarm of rotary and fixed-wing UAVs [11] [12] address the same problemwhile [13] tries to overcome the challenges of the task in an urban area

Getting sensors in the air with low cost vehicles is an amazing opportunity forseveral research areas UAVs have been used for wild-life monitoring [14] earthobservation [15] and even tornado watch missions [16] [17] used UAVs to reducerdquothe data gap between field scale and satellite scale in soil erosion monitoringrdquo

UAVs are often used for 3D mapping and reconstruction tasks for topographyarchitecture or other purposes [18] is a review of current mapping methods basedon UAVs [19] is an other survey which also compares the photogrammetry withother methods to define the optimal application fields

Both surveys mentioned above are rather focus on topographical larger scalemapping tasks Indoor flying and mapping is more complicated in a sense sincethe danger of collision is significantly higher Indoor positioning of UAVs havebeen addressed in several papers They can be grouped by the sensor used Notethat GPS signal is usually too weak to use inside thus other methods are needed

Some researches tried to localize the aerial platform based on a single camera

8

22 Object detection on conventional 2D images

using monocular SLAM (Simultaneous Localization and Mapping) See [20ndash22]for UAVs navigating based on one camera only

Often the platform is equipped with multiple cameras to achieve stereo visionand create a depth map For examples of such solutions see [23ndash25]

A popular approach for indoor mapping is using 3D laser distance sensors asthe input of the SLAM algorithm Examples for this are [26ndash28]

However cameras and laser range finders are often used together [29 30] forexample manage to integrate the two sensors together and carry out 3D mappingwhile controlling navigating the UAV itself This project will use a similar setupthat is a combination of one or more monocular cameras with a 2D laser scanner

22 Object detection on conventional 2D imagesObject class detection is one of the most researched area of computer visionAs the available resources (computational power resolution of cameras etc)improved and became cheaper and more and more complex tasks occurred thefield has witnessed several approaches and methods Thus the topic has extensiveliterature To give an overview general surveys can be a good starting point

There are several articles aimed to provide a summary of the existing methodstrying to categorize and group them (eg [31ndash33])

The terms and definitions of computer vision tasks are defined in several wayin the literature [33] and [34] groups them by

bull Detection is a particular item present in the stimulus

bull Localization detection plus accurate location of item

bull Recognition localization of all the items present in the stimulus

bull Understanding recognition plus role of stimulus in the context of the scene

while [35] sets up five classes defined by the following (quoted from [33])

bull Verification Is a particular item present in an image patch

bull Detection and Localization Given a complex image decide if a particularexemplar object is located somewhere in the image and provide accuratelocation information of the given object

bull Classification Given an image patch decide which of the multiple possiblecategories are present in it Note that no location is determined

9

22 Object detection on conventional 2D images

bull Naming Given a large complex image (instead of an image patch as inthe classification problem) determine the location and labels of the objectspresent in that image

bull Description Given a complex image name all the objects present in itand describe the actions and relationships of the various objects within thecontext of the image

[31] defines object (class or category) detection as the combination of twogoals object categorization and object localization The former decides if anyinstance of the categories of interest is present in the input image The lattertask determines the positions and dimensions (projected size) of the objects thatare found

As it can be seen from the categories defined above the boundaries of thesesubclasses are not clear and they often overlap The aims and objectives listedin section 13 classify this task as a detection or localization (or a combinationof the two) For the sake of simplicity the aim of this thesis and the providedalgorithm will be referred as detection and detector

Beside the grouping of the tasks of computer vision several attempts weremade to classify the approaches themselves The biggest disjunctive propertyof the methods might be the fact that whether or not does it involve machinelearning

221 Classical detection methodsResulting from the rapid development of hardware resources along with the ex-tensive research of artificial intelligence methods not using machine learning aregetting neglected Generally speaking the ones using it show better performanceand are simpler to adopt if the circumstances change (eg re-train them)

It has to be noted that although these classical approaches are not very pop-ular nowadays their simplicity often means an attractive factor to perform easydetection tasks or pre-filter the image

2211 Background subtraction

Background subtraction methods are often used to detect moving objects onstanding platform (static cameras) Such an algorithm is able to learn the rdquoback-ground modelrdquo of an image with differentiating a few frames Although severalmethods exist (See [36] or [37] for good conclusion) the basic idea is common sep-arate the foreground from the background by subtracting the background modelfrom the input image This subtraction should result in an output image where

10

22 Object detection on conventional 2D images

Figure 23 Example for background subtraction used for people detection Thered patches show the object in the segmented foreground Source [39]

only objects in the foreground are shown The background model is usually up-dated regularly which means the detection is adaptive

These solutions could give an excellent base for tasks like traffic flow monitor-ing [38] or human detection [39] as moving points are separated and connectedor close ones could be managed as an object Thus all possible objects of interestare very well separated The fixed position of the camera means fixed perspectiveswhich make size and distance estimations accurate With fixed lengths reliablevelocity measuring can be implemented These and further additional properties(colour texture etc of the moving object) can be the base of an object classifica-tion system This method is often combined with more sophisticated recognitionmethods serving as a region of interest detector After the selection of interestingareas those methods can operate on a smaller input resulting in a faster responsetime

It has to be mentioned that not every moving (changing) patch is an objectas for example in certain lighting circumstances a vehicle and its shadow movealmost identically and have similar size parameters Effective methods to removeshadows from input have been described (eg[40 41]) Certainly process time ofthese filters sum up with the detectionrsquos time

An other concern is the possibility of a sudden change in illumination (such asswitching the light on or off) which would result change across the whole imageIn this case the background model should be adapted automatically and quickly

Although background subtraction has many benefits it is clearly not suitablefor this task since the camera will be mounted on a moving platform

2212 Template matching algorithms

Object class detection can be described as a search for a specific pattern on theinput image This definition is related of the matched filters (see [42]) techniqueof digital signal processing which basic idea is to find a known wavelet in anunknown signal This is usually achieved by cross-correlating them looking forhigh correlation coefficients

Template matching algorithms rely on the same concept interpreting the

11

22 Object detection on conventional 2D images

Figure 24 Example of template matching Notice how the output image isthe darkest at the correct position of the template (darker means higher value)An other interesting point of the picture is the high return values around thecalendarrsquos (on the wall left) dark headlines which are indeed look similar to thehandle Source [43]

image as a 2D digital signal They determine the position of the given patternby comparing a template (that contains the pattern) with the image

To perform the comparison the template is shifted (u v) discrete steps inthe (x y) directions respectively The comparison itself is usually calculated bysome kind of correlation over the area of the template for each (u v) positionThe method assumes that the function (the comparison method) will return thehighest value at the correct position [43]

See Figure 24 for an example input template (part a) and the output of thealgorithm (part b) Notice how the output image is the darkest at the correctposition of the template

The biggest drawback of the method is its computational complexity sincethe cross correlation has to be calculated at every position Several efforts werepublished to reduce the processing time [43] uses normalized cross correlationfor faster comparison [44] applies integral images (based on the idea of [45]) forthe same purpose Note that simplifying the template can reduce the complexityas well since it is easier to compare the two images For example [46] proposedan algorithm which approximates the template image with polynomials

Other disadvantages of template matching are the poor performance in case ofchanges in scale rotation or perspective Resulting from these template matchingis not a suitable concept for this project

12

22 Object detection on conventional 2D images

222 Feature descriptors classifiers and learning methodsWhile computer vision and machine learning might seem two well-separated areasof computer engineering computer vision is one of the biggest application fieldof machine learning [47] puts it this way rdquoPattern recognition has its origins inengineering whereas machine learning grew out of computer science Howeverthese activities can be viewed as two facets of the same field and together theyhave undergone substantial development over the past ten yearsrdquo

The two most important parts of a trained object detection algorithm arefeature extraction and feature classification The extraction method determineswhat type of features will be the base of the classifier and how they are extractedThe classification part is responsible for the machine learning method In otherwords how to decide if the extracted feature array (extracted by the featureextraction method described above) represents an object of interest

Since the latter part (defining the learning and classifier system) is less relatedto the vision studies this paper will only discuss a few learning methods brieflyBefore that some of the most famous and widely used feature descriptors arepresented

Feature extractors or descriptors are often grouped as well [48] recognizesedge based and patch based features [31] has three groups according to thelocality of the features

bull pixel-level feature description these features can be calculated for eachpixel separately Examples are the pixelrsquos intensity (grayscale) or its colourvector

bull patch-level feature description Descriptors based on patch level featuresconcentrate only on point of interests (and their neighbourhood) instead ofdescribing the whole image The name rdquopatchrdquo refers to the area around thepoint which is also considered Since these regions are much smaller thanthe image itself patch-level feature descriptions are also called local featuredescriptors Some of the most famous ones are SIFT (sub-subsection 2221[49 50]) and SURF [51] In a sense Haar-like features (see sub-subsection2222 [45]) belong here as well

bull region-level feature description As patches are sometimes too small to de-scribe the object appropriately a larger scale is often needed The region isa general expression for a set of connected pixels The size and shape of thisarea is not defined Region-level features often constructed from lower levelfeatures presented above However they do not try to apply higher levelgeometric models or structures Resulting from that region-level featuresare also called mid-level representations The most well known descriptors

13

22 Object detection on conventional 2D images

of this group are the BOF (Bag of Features or Bag of Words [52]) and theHOG (Histogram of Oriented Gradients sub-subsection 2223 [53])

Due to the limited scope of this paper it is not possible to give a deeperoverview about the features used in computer vision Instead three of the meth-ods will be presented here These were selected based on fame performanceon general object recognition and the amount of relevant literature examplesand implementation All of them are well-known and both the articles and themethods are part of the computer vision history

2221 SIFT features

Scale-invariant feature transform (or SIFT) is an image processing algorithm usedto detect and extract local features Sift descriptors are the most well-knownpatch-level feature feature descriptors (see point 222) [31] They were proposedby David Lowe in 1999 [49] The same author gave a more in-depth discussionabout his earlier work presented improvements in stability and feature invariancein 2004 [50] His idea was to extract interesting points as features from the object(training image) and try to match these on the test images to locate the object

The main steps of the algorithm are the following [50]

1 Scale-space extrema detection The first stage searches over the image inmultiple scale using a difference-of-Gaussian function to identify potentialinterest points that are scale and orientation invariant

2 Key-point localization At each selected point the algorithm tries to fit amodel to determine location and scale Key-points are selected based ontheir stability points which are expected to be more stable based on themodel are kept

3 Orientation assignment rdquoOne or more orientations are assigned to eachkey-point based on local image gradient directions All future operationsare performed on image data that has been transformed relative to theassigned orientation scale and location for each feature thereby providinginvariance to these transformationsrdquo [50]

4 Key-point descriptor After selecting the interesting points the gradientsaround them are calculated in a selected scale

The biggest advantage of the descriptor is the scale and rotation invarianceNote that rotation here means rotating in the plane of the image not capturingthe object from a rotated viewpoint For example a rotation invariant (frontal)

14

22 Object detection on conventional 2D images

face detector should be able to detect faces upside down but not faces pho-tographed from a side-view SIFT is also partly invariant to illumination and3D camera viewpoint This is reached by the representation of the calculateddescriptors that allows for significant levels of local shape distortion and changein illumination The algorithm assumes that these pointsrsquo relative positions willnot change on the test images Thus SIFT descriptors tend to show worse perfor-mance on flexible objects or ones with moving parts Also the relative positionsof the key-points can be significantly different not only if the object is distorted(flexibility) but also if they are extracted from an other instance from the sameclass (two people for example) Resulting from this property SIFT is more oftenused for object tracking image stitching or to recognise a concrete object insteadof the general class However this is not a problem in the case of this projectsince only one ground-robot is part of it Thus no general class recognition isrequired recognizing the used ground robot would be enough

2222 Haar-like features

Haar-like features were introduced by Viola and Jones in 2001 [45] They proposedsimple features inspired by Haar wavelet basis which are also used in signalprocessing They can be seen as rdquotemplatesrdquo defining rectangular regions in thegrayscale image Viola and Jones described three types of the features

bull two-rectangle features rdquoThe value of a two-rectangle feature is the dif-ference between the sum of the pixels intensities within two rectangularregions The regions have the same size and shape and are horizontally orvertically adjacentrdquo [45]

bull three-rectangle features rdquoA three-rectangle feature computes the sum withintwo outside rectangles subtracted from the sum in a centre rectanglerdquo [45]

bull four-rectangle features rdquoA four-rectangle feature computes the differencebetween diagonal pairs of rectanglesrdquo [45]

See figure 25 for an example of two and three rectangle features The mainadvantage of this approach is the fact that these features can be computed veryrapidly using so-called integral images The integral image rdquocontains the sum ofpixel intensities above and to the left of itrdquo [45] at any given image location Thereferences to the integral values at each pixel location are kept in a structureFinally the Haar features can be computed by summing and subtracting theintegral values in the corners of the rectangles

[45] was motivated by the task of face detection Although it is more than14 years old it is still the base of many implemented face detection methods (in

15

22 Object detection on conventional 2D images

Figure 25 2 example Haar-like features that are proven to be effective separatorsTop row shows the features themselves Bottom row presents the features overlayedon a typical face The first feature corresponds with the observation that eyes areusually darker than the cheeks The second feature measures the intensity of areasaround the eyes compared to the nose Source [45]

consumer cameras for example) and inspired several researches [54] for exampleextended the Haar-like feature set with rotated rectangles by calculating theintegral images diagonally as well

Resulting from the nature of them Haar-like features are suitable to findpatterns of combined bright and dark patches (like faces on grayscale images forexample)

Despite of the original task Haar-like features are used widely in computervision for various tasks vehicle detection ([55]) hand gesture recognition ([56])and pedestrian detection ([57]) just to name a few

Haar-like features are very popular and have a lot of advantages Its maindrawback is the rotation-variance which in spite of several attempts (eg [54])is still not solved completely

2223 HOG features

The abbreviation Hog stands for Histogram of Oriented Gradients Unlike theHaar features (2222) Hog aims to gather information from the gradient image

The technique calculates the gradient of every pixel and produces a histogramof the different orientations (called bins) after quantization in small portions(called cells) on the image After a location based normalization of these cells(within blocks) the feature vector is the concatenation of these histograms Notethat while this is similar to Edge orientation histograms ([58]) it is different since

16

22 Object detection on conventional 2D images

Figure 26 Illustration of the discriminative and generative models The boxesmark the probabilities calculated and used by the models Note the arrows whichdisplay the information flow presenting the main difference in the two philosophiesSource [59]

not only sharp gradient changes (known as edges) are considered The number ofoptimal bins cells and blocks depend on the actual task and resolution and arewidely researched

HOG is essentially a dense version of the SIFT descriptors (see sub-subsection2221 and [31]) However it is not normalized with respect to the orientationThus HOG is not rotation invariant On the other hand it is normalized locallywith respect to the image contrast (over the blocks) which results in a superiorrobustness against illumination changes over SIFT The technique was first de-scribed in the well-known article [53] by Navneet Dalal and Bill Triggs in whichit was used for pedestrian detection While this application field is still the mostcommon usage of these feature descriptor it can be used for many kind of objectsas well as long as the class has a characteristic shape with significant edges

2224 Learning models in computer vision

Learning model approaches can be grouped as well [32 48] finds two mainphilosophies generative and discriminative models

Letrsquos denote the test data (images) with xi the corresponding labels by ciand the feature descriptors (extracted from the data xi) by θi where i = 1 to N where N is the number of inputs

Generative models look for a representation of the image (or any other inputdata) by approximate it keeping as much data as possible Afterwards given atest data they predict the probability of an x instance generated (with featuresθ) conditioned to class c [48] In other terms they aim is to produce a model inwhich the probability

P (x|(θ c))P (c)

to be ideally 1 if x contains an instance of class c and 0 if notThe most famous examples of generative models are Principal Component

Analysis (PCA [60]) and Independent Component Analysis (ICA [61])

17

22 Object detection on conventional 2D images

Discriminative models try to find the best decision boundaries for the giventraining data As the name suggests they try to discriminate the occurrenceof one class from everything else [48] They do this by defining the followingprobability

P (c|(θ x))

which is expected to be ideally 1 if x contains an instance of class c and 0 ifnot

Note that the biggest difference is the direction of the mapping and infor-mation flow discriminative models map images to class labels while generativemodels use a map from labels to the images (or rdquoobservablesrdquo [59]) See figure 26for a representation of the direction of the flow of information in both models

Resulting from approaches presented above (especially the probability mod-els) the discriminative methods are usually perform better for single class detec-tion while generative methods are more suitable for multi-class object detection(also called object recognition) [62] Since the task described in this paper re-quires one class to detect two of the most well-known discriminative methodswill be presented Adaptive boosting (AdaBoost [63] see sub-subsection 2225)and support vector machines (SVM [64] see sub-subsection 2226)

2225 AdaBoost

AdaBoost (short for Adaptive Boosting) is a discriminative machine learningmodel from the class of boosting algorithms It was first proposed by YoavFreund and Robert Schapire in 1997 [63] AdaBoostrsquos aim is to overcome theincreasing calculation expenses caused by the growing set of features (also calleddimensions) For example the number of possible Haar-like features (2222) isover 160000 in case of a 24times24 pixel window (see subsection 432) [65] AlthoughHaar-like features can be extracted efficiently using the integral-image computingthe full set would be very expensive

To reduce the dimensionality of the machine learning problem boosting algo-rithms try to combine several weaker classifiers to create a stronger Weak meansthe output of the classifiers only slightly correlates with the true labels of thedata Usually this is an iterative method

A single stage of AdaBoost can be described as follows [45] [65]

bull Initialize N middot T weights wti where N = number of training examples T =number of features in stage

bull For t = 1 N

1 Normalize the weights

18

22 Object detection on conventional 2D images

2 Select the best classifier using only a single feature by minimising thedetection error εt =

sumiwi|h(xi f p θ)minus yi| where h(xi) is a classifier

output and yi is a correct label (both with a range of 0 for negative 1for positive)

3 Define ht(x) = h(x ft pt θt) where ft pt and θt are minimizers for theerror above

4 Update the weights wt+1i = wti middot εt

1minusεt

1minusεi

bull The final classifier for the stage is based on the sum of weak classifiers

By repeating these steps AdaBoost selects those features (and only those)which improve the prediction Resulting from the way the weights are updatedsubsequent classifiers minimise the error mostly for the inputs which were misclas-sified by previous ones A famous application of AdaBoost is the Viola-Jones ob-ject detection framework presented in 2001 [45] (also see sub-subsection 2222)

2226 Support Vector Machine

Support vector machines (SVM also called support vector networks) is a machinelearning algorithm for two-group classification problems proposed by Vladimir NVapnik Bernhard E Boser and Isabelle M Guyon [66] Although the featuresof the method were known and used since the 1960s they were not put togetheruntil 1992

The idea of the support vector machine is to handle the set of features ex-tracted from the train and test data samples as vectors in the feature spaceThen they construct one ore more hyperplanes (a linear decision surface) in thisspace to separate the classes

The surface is constructed with two constraints First is to divide the spaceinto two parts in a way that no points from different classes remain in the same(in other words separate the two classes perfectly) The second constraint is tokeep the distance to the nearest training samplersquos vector from each class as largeas possible (in other words define the largest margin possible) Creating largermargin is proven to decrease the error of the classifier The closest vectors to thehyperplane (and due to that determining the width of the margin) are calledsupport vectors (hence the name of the method) See figure 27 for an exampleof a separable problem the chosen hyperplane and the margin

The algorithm was extended to not linearly separable tasks in 1995 by Vapnikand Corinna Cortes [64] To create a non-linear classifier the feature space ismapped to a much higher dimensional space where the classes become linearlyseparable and the method above can be used

19

22 Object detection on conventional 2D images

Figure 27 Example of a separable problem in 2D The support vectors aremarked with grey squares They define the margin of the largest separation betweenthe two classes Source [64]

Although support vector machines were designed for binary classification sev-eral methods were proposed to apply their outstanding performance on multi-classtasks The most often used approach to do this is to combine binary classifiers toobtain a multi-class one The binary classifiers can be rdquoone against allrdquo or rdquooneagainst onerdquo types The former means that each class is examined against all thepoints of the other classes The class whose classifier shows the greatest distancefrom the margin is chosen (in other terms the test samplersquos vector is the fartherfrom the other classesrsquo points than in any other cases) The latter is based on thecomparisons of all the possible pairings of the classifiers (each is compared witheach) The class rdquowinningrdquo the most comparisons is chosen as the final decisionof the classifier Good surveys on multi-class SVMs are [67] and [68]

A famous application of SVMs is the HOG descriptor based pedestrian detec-tion presented by Navneet Dalal and Bill Triggs in 2005 [53]

20

Chapter 3

Development

In this chapter the circumstances of the development will be presented whichinfluenced the research especially the objectives defined in Section 13 Firstthe available hardware resources (that is processing unit and sensors) will bediscussed along with their limitations and constraints respecting to the definedobjectives Finally the used software libraries will be presented why were theychosen and what are they used for in this project

31 Hardware resources

311 Nitrogen boardNitrogen board (officially called Nitrogen6x) is the chosen on-board processingunit of the UAV It is a highly integrated single board computer using an ARM-cortex A9 processor [69]

Nitrogen board will be responsible to manage the on-board sensors and collectthe recorded data After some preprocessing part of the data will be transmittedto the ground robot and to the ground station which communication is alsohandled by this unit Fortunately the Robotic Operating System (subsection322) is compatible with unit which will make this task easier

The detection method introduced in this thesis may be ported to this boardlater depending on the available processing units and the achieved communicationbandwidth between the vehicles and the ground station

312 SensorsThis subsection concludes the sensors integrated on-board the UAV including theflight controller the camera and the laser range finder

21

31 Hardware resources

Figure 31 image of Pixhawk flight controller Source [70]

3121 Pixhawk autopilot

Pixhawk is an open-source well supported auto-pilot system manufactured by3Drobotics company It is the chosen flight controller of the projectrsquos UAV Onfigure 31 the unit is shown with optional accessories

Also Pixhawk will be used as an Inertial Measurement Unit (IMU) for build-ing 3D maps This means less weight (since no additional sensor is needed) andbrings integrity both in hardware and software Although the Pixhawk is not adedicated IMU its built-in sensors are very accurate since they are used to sta-bilize multi-rotors during flight Therefore it was suitable to use it for mappingpurposes See 44 for details of its application

3122 Camera

Cameras are optical sensors capable of recording images They are the mostfrequently used sensors for object detection due to the fact that they are easy tointegrate and provide significant amount of data

Also another big advantage of these sensors is the fact that they are availablein several sizes with different resolutions and other features for a relatively cheapprice

Nowadays consumer UAVs are often equipped by some kind of action camerausually to provide an aerial video platform These sensors are suitable for theaim of the project as well resulting from their light weight and wide angle fieldof view

22

32 Chosen software

Figure 32 The chosen LIDAR sensor Hokuyo UTM-30LX It is a 2D rangescanner with a maximum range of 60 m and 3-5 cm accuracy Source [71]

3123 LiDar

The other most important sensor used in this project is the lidar These sensorsare designed for remote sensing and distance measurement The basic conceptof these sensors is to do this by illuminating objects with a laser beam Thereflected light is analysed and the distances are determined

Several versions are available with different range accuracy size and otherfeatures The chosen scanner for this project is the Hokuyo UTM-30LX It isa 2D range scanner with a maximum range of 60 m and 3-5 cm accuracy [71]These features combined with its compact size (62mmtimes62mmtimes875mm210g [71])makes it an ideal scanner for autonomous vehicles for example UAVs

Generally speaking lidars are much more expensive sensors then camerasHowever the depth information provided by them is very valuable for indoorflights

32 Chosen software

321 MatlabMATLAB is a high-level interpreted programming language and environmentdesigned for numerical computation It is a very useful tool for engineers and

23

32 Chosen software

scientist alike as thousands of functions are implemented and included in thesoftware from different fields of science Also excellent visualization tools areready to use which are necessary to easily debug and develop image processingmethods or 3D map construction and visualization Resulting from the reasonsabove developing and testing 2D or 3D image processing algorithms in Matlab isextremely convenient On the other hand Matlab is Java based therefore mostnon-mathematical calculations are slightly slow compared to a C++ environment

322 Robotic Operating System (ROS)Robotic Operating System is an extensive open-source operating system andframework designed for robotic development It is equipped with several toolsaimed to help developing unmanned vehicles like simulation and visualizationsoftware or excellent general message passing This latter property is a suitablesolution to connect the UAV UGV and the ground control station of the project

Also ROS contains several packages developed to handle sensors For exam-ple the both the chosen lidar sensor (3123) and the flight controller (3121)have ready-to-use drivers available

323 OpenCVOpenCV is probably to most popular open-source image processing library con-taining a wide range of functions like basic image manipulations (eg load writeresize rotate image) image processing tools (eg different kind of edge detec-tions threshold methods) and advanced machine learning based solutions (egfeature extractors and training tools) [72] Due to its high popularity examplesand support are available

In this project it was used to handle inputs (either from a file or from a cameraattached to the laptop) since Dlib the main library chosen has no functions toread inputs like that

324 DlibDlib is a general purpose cross-platform C++ library It has an excellent machinelearning library with the aim to provide a machine learning software developmenttoolkit for the C++ language (see figure 33 for an architecture overview) Dlibis completely open-source and is intended to be useful for both scientists andengineers

Dlib also has many computer vision related parts including ready-to-use fea-ture extractors and training tools and object tracking solutions which make it

24

32 Chosen software

Figure 33 Elements of Dlibrsquos machine learning toolkit Source [73]

easy to develop and test custom trained object detectors rapidly Another re-cently added feature is an easy-to-use 3D point cloud visualization function Re-sulting from this Dlib was a reasonable choice to use for this project since itscollection of machine learning image processing and 3D point cloud handlingfunctions made the development significantly easier

It is worth mentioning that aside the features above Dlib also contains com-ponents to handle linear algebra threading network IO graph visualisation andmanaging and numerous other tasks which makes it one of the most versatileC++ libraries Though it is not as popular and well-known as OpenCV it isextremely well documented supported and updated regularly Resulting fromthese the popularity of it increases For more information see [74] and [73]

25

Chapter 4

Designing and implementing thealgorithm

In this chapter challenges related to the topic of this thesis are listed After theconsideration of these the architecture of the designed algorithm will be intro-duced with detailed explanations given to the important parts The connectionbetween the 2D and 3D image inputs are defined Afterwards the base of thedetection algorithm is discussed in details along with all the aiding systems andinputs Finally the 3D imaging experiments and set-ups are introduced withexample images and the mathematical calculations

41 Challenges in the taskThe biggest challenges of the task are listed below

1 Unique object Maybe the biggest challenge of this task is the fact that therobot itself is a completely unique ground vehicle designed as part of theproject Thus no ready-to-use solutions are available which could partly orcompletely solve the problem For frequently researched object recognitiontasks like face detection several open-source codes examples and articlesare available The closest topic for a ground robot detection is car recogni-tion but those vehicles are too different to use any ready detectors Thusthe solution of this task have to be completely self-made and new

2 Limited resources Since the objective of this thesis is a detection executedon a UAV the limitations of this platform have to be considered Thismeans that both the dimensions and the weight of the on-board sensorsand computers are limited significantly Fortunately the cameras nowa-days have very compact sizes with impressive resolutions thus even smaller

26

42 Architecture of the detection system

aerial vehicles can carry one and record high quality videos The used lidar(3123) is heavier and larger but still possible to carry The on-board com-puter had to be chosen similarly a small and light hardware was neededWhile the Nitrogen6x board (311) would fulfil these requirements its pro-cessing power is not comparable to a laptop or desktop PC which shouldbe considered during the designing and assigning of processing methods

3 Moving platform No matter what kind of sensors are used for an objectdetection task movement and vibration of the sensor is usually makes itmore challenging Unfortunately both are present on a UAV platform Thiscauses problems because it changes the perspective and the viewpoint ofthe object Also they add noise and blur to the recordings And third nobackground-foreground separation algorithms are available for the cameraSee subsection 2211 for details

4 Various viewpoint of object Usually in the field of computer vision thepoint of view of the required object is more or less defined For exampleface detectors are usually applied to images where people are facing thecamera However resulting from the UAV platform the sensors will gain orlose height rapidly causing drastic perspective changes Also there are noconstraints about the relative orientation of the two vehicles which meansthat the robot should be recognized from 360 and from above

5 Various size of object Similarly to the previous point 4 the movement ofthe camera will cause significant changes in the scale of the object on theinput image In other words the algorithm have to detect the ground robotfrom a great distance (when it occupies a small part of the input) and alsoduring landing from relatively close (when it seems huge and almost fillsthe entire image)

6 Speed requirements The algorithmrsquos purpose is to support the landingmaneuver which is crucial part of the mission The most critical part is thetouch-down itself This (contrary to the other parts of the flight) requiresreal-time and very accurate position feedback from the sensors and the al-gorithms This is provided by another algorithm specially designed for thetask Thus the software discussed here does not have real-time require-ments On the other hand the system still needs continuous reports on theestimated position of the robot with at least 4Hz This means that everychosen method should be execute fast enough to make this possible

27

42 Architecture of the detection system

Support Vector Machines

camera2D Lidar

Trainer

Detector agorithmfront SVM side SVM

Detections Vatic annotation server

Sensors

Tracking

Evaluation

Current frame

Video reader

Regions of Interest

3D map

Other preprosessing(edgecolour detection)

Figure 41 On this figure a diagram of the designed architecture is shown Theaim of it is the present the connections between the 2D and 3D sensors detectionalgorithms and other parts of the system Arrows represents the dependencies andthe direction of the information flow

28

42 Architecture of the detection system

42 Architecture of the detection systemIn the previous chapters the project (11) and the objectives of this thesis (13)were introduced The advantages and disadvantages of the different availablesensors were presented (312) Some of the most often used feature extractionand classification methods (222) were examined with respect to the suitabilityfor the project Finally in subsection 41 the challenges of the task were discussed

After considering the aspects above a modular complete architecture wasdesigned which is suitable for the objectives defined See figure 41 for an overviewof the design The main idea of the structure is to compensate the disadvantagesof every module with an other part connected to it In some cases this principlebrings redundancy to the system On the other hand it provides a more robustdetection system for the robot The next enumeration will list the main partsof the architecture every already implemented module will be discussed later inthis chapter in detail

1 Training algorithm As explained in subsection 222 object detectionmethods using machine learning are very popular and superior in efficiencyTo apply such an algorithm for a task first a learning (also called trainingor teaching) period is needed when the software extract features from thetraining images and train a classifier to separate the two or more classes(Note that this learning step can be skipped if the algorithm learns rdquoon theflyrdquo or pre-trained detectors are used Neither of these are the case in thisproject)As mentioned above training usually happens off-line before the detectionsTherefore it is not required to be included in the final release However itis an inevitable part of the system since it produces the classifiers for thedetector These productions have to be reproducible compatible with theother parts and effective To make the development more convenient thetraining software is completely separated from testing See subsection 431for details

2 Sensor inputs Sensors are crucial for every autonomous vehicle since theyprovide the essential information about the surrounding environments Inthis project a lidar and a camera where used as main sensors for the task(along with several others needed to stabilize the vehicle itself See subsec-tion 312) The designed system relies more on the camera for now sincethe 3D imaging is in its experimental state However both sensors arehandled in the architecture already with the help of the Robotic Operat-ing System (322) This unified interface will make it possible to integratefurther sensors (eg infra-red cameras ultrasound sensors etc) later withease

29

42 Architecture of the detection system

Resulting from the scope of the project the vehicle itself is not readythus live recordings and testing were not possible during the developmentTherefore the system is able to manage images both from a camera andvideo file The latter is considered as a sensor module as well simulatingreal input from the sensor (other parts of the system canrsquot tell that it is nota live video feed from a camera)

3 Region of interest estimators and tracking Recognized as two of the mostimportant challenges (see 41) speed and the limitations of the availableresources were a great concern Thus methods which could reduce theamount of areas to process and such increase the speed of the detectionwere requiredResulting from the complex structure of the project both can be donebased on either the camera the lidar or both The idea of the architectureis to give a common interface (creating a type of module) to these methodsmaking their integration easy The most general and sophisticated approachwill be based on the continuously built and updated 3D maps That is ifboth the aerial and the ground vehicle were located in this map accurateestimations can be made on the current position of the robot excluding (3D)areas where it is unnecessary to search for the unmanned ground vehicle(UGV) In other words once the real 3D position of the robot is known itsnext one can be estimated based on its maximal velocity and the elapsedtime Theoretically it is not possible for the robot to be outside of thiscircle thus other parts of the map do not need to be scanned and Asa less sophisticated but more rapid alternative to this the currently seen3D sub-map (the point-cloud in front of the sensor) can be interpreted asa pre-filter for the 2D image processing methods For example filteringfor objects with more or less the same size of the UGV could be a veryeffective region of interest detector To achieve this the field of view of thetwo sensors needs to be registered carefully so corresponding areas can befound Unfortunately the real time 3D map building is not yet implementedThus these methods has yet to be developed On the other hand severalapproaches to find regions of interest based on 2D images were implementedAll of them will be presented in section 43

4 Evaluation To make the development easier and the results more objectivean evaluation system was needed This helps to determinate whether a givenmodification of the algorithm did actually make it rdquobetterrdquo andor fasterthan other versions This part of the architecture is only connected to thewhole system loosely (similarly to training point 1) and not required tobe included in the final release since it was designed for development and

30

43 2D image processing methods

debugging purposes However it had to be developed parallel with everyother parts to make sure it is compatible and up to date Thus evaluationhad to be mentioned in this list For more details of the evaluation pleasesee 511

5 Detector The detector module is the core algorithm which connects all theother parts listed above Receiving the inputs from the camera sensor (point2) using the classifiers trained by the training algorithm (point 1) andconsidering the information provided by the ROI estimators (3) executesthe detection Afterwards the detector is able to pass on the detectedobjects coordinates so the evaluation module (point 4) can measure theefficiency

43 2D image processing methodsIn this section the 2D camera image processing concerns methods and the processof the development will be presented

431 Chosen methods and the training algorithmBefore the design and implementation of the algorithm started several objectdetection methods were reviewed in section 22 In section 32 the used imageprocessing libraries and tool-kits were presented

While these two topics seem unrelated they have to be considered togethersince a good software library with image handling or even machine learningalgorithms implemented can save a lot of time during the development

Considering the reviewed methods the Histogram of Oriented Gradients (HOGsee sub-subsection 2223) was chosen as the main feature descriptor The reasonsfor this choice are discussed here

Haar-like features are suitable to find patterns of combined bright and darkpatches While the ground robotrsquos main colours are black (wheels) and white(the main body of the vehicle) their proportion is not ideal since brighter partsare more significant Variance of the relative positions of these patches (egdifferent view angles) can confuse the detector as well Also using OpenCVrsquos(323) training algorithm the learning process was significantly slower than thefinal training software (presented here later) making it very hard to develop andtest

SIFT descriptors (see sub-subsection 2221) have the advantage of rotationinvariance which is very desirable Since they assume that relative positions ofthe key-points will not change on the inputs they are more suitable to recognise

31

43 2D image processing methods

a concrete object instead of the general class This is not an issue in this casesince only one ground robot is present with known properties (in other wordsno generalization is needed) However the relative positions of the key-pointscan change not only because inter-class differences (an other instance of the classrdquolooksrdquo different) but caused by moving parts like rotating and steering wheelsfor example Also SIFT tends to perform worse under changing illuminationconditions which is a drawback in the case of this project since both vehiclesare planned to transfer between premises without no a-priori information aboutthe lighting

Histogram of Oriented Gradients (HOG see sub-subsection 2223) seemed tobe an appropriate choice as it performs well on objects with strong characteristicedges Since the UGV has a cuboid shape body and relatively big wheels itfulfils this condition Also HOG is normalized locally with respect to the imagecontrast which results in a robustness against illumination changes Thus itis expected to perform better than the other methods in case of change in thelighting (like moving to an other room or even outside)

Certainly HOG has it disadvantages as well First it is computationally ex-pensive especially compared to the Haar-like features However with properimplementation (eg a good choice of library) and the use of ROI detectors (seePoint 3) this issue can be solved Secondly HOG in contrast to SIFT is not rota-tion invariant In practice on the other hand it can handle a significant rotationof the object Also in the final version of the project the orientation of the camerawill be stabilized either in hardware (with a gimbal holding the camera levelled)or software (rotate the image based on the orientation sensor) Therefore theinput image of the detector system expected to be levelled Unfortunately dueto the aerial vehicle used in this project different viewpoints are predicted (see4) This means that even if the camera is levelled the object itself could seemlike rotated caused by the perspective To overcome this issue solutions will bepresented in this subsection and in 434

Although the method of HOG feature extraction is well described and docu-mented in [53] and not exceedingly complicated the choice of optimal parametersand the optimization of the computational expenses are time-consuming and com-plex tasks To make the development faster an appropriate library was searchedfor the extraction of HOG features Finally Dlib library (324) was chosen sinceit includes an excellent HOG feature extractor with serious optimization alongwith training tools (discussed in details later) and tracking features (see subsec-tion 434) Also several examples and documentations are available for bothtraining and detection which simplified the implementation

As a classifier Support vector machines (SVM see sub-subsection 2226)were chosen This is because nowadays SVMs are widely used in computer visionwith great results (Dalal used it as well in [53]) and for two-class detection prob-

32

43 2D image processing methods

lems (like this) they outperform the other popular ones Also it is implementedin Dlib as a ready-to-use solution [75] SVMs (and almost everything in Dlib)can save and load by serializing and de-serializing them which made the trans-portation of classifiers relatively easy Dlibrsquos SVMs are compatible with the HOGfeatures extracted with the same library The combination of the two methodswithin Dlib resulted in some very impressive detectors for speed limit sign orfaces Furthermore the face detector mentioned outperformed the Haar-like fea-ture based face detector included in OpenCV [74] After studying these examplesthe combination of Dlibrsquos SVM and HOG extractor was chosen as the base of thedetection

The implemented training software itself was written in C++ It can be exe-cuted from the command line It needs an xml file as an input which includes alist of the training images with the annotation of the objects (coordinates in pix-els) After importing the positive samples the negative samples are rdquogeneratedrdquofrom the training samples as well cut from areas where no annotated object ispresent Note that this feature means the object has to be annotated carefullysince a positive sample might get in to the negative training images otherwise

After the classifier (also called detector) is produced the training softwaretests it on an image database defined by a similar xml file This is a very con-venient feature since major errors are discovered during the training processalready If the first tests do not show any unexpected behaviours the finishedSVM is saved to the disk with serialization

Due to the fact that the robot looks completely unlike from different view-points one classifier was not sufficient because it could only match one typicalrdquoshaperdquo of the object Similarly the face detector mentioned above does notwork for people photographed from side view Therefore multiple classifiers wereconsidered for the task

Two SVMs were trained to overcome of this issue one for the front view andone for the side view of the ground vehicle This approach has two significantadvantages

First the robot is symmetrical horizontally and vertically as well In otherterms the vehicle looks almost exactly the same from the left and right whilethe front and rear-views are also very similar This is especially true if the work-ing principle of the HOG (2223) is considered since the shape of the robot(which defines the edges and gradients detected by the HOG descriptor) is iden-tical viewed from left and right or from front and rear This property of theUGV means that the same classifier will recognise it from the opposite directiontoo Therefore two classifiers are enough for all four sides This is not a trivialsimplification since cars for example look completely different from the front orthe rear and while the side-views are similar they are mirrored

The second advantage of these classifiers is very important in the future of

33

43 2D image processing methods

the project Due to the way the SVMs are trained they not only can detect theposition of the robot but depending on which one of them detects it the systemgains information about the orientation of the vehicle This is crucial since it canhelp predict the next position of the ground robot (see 434) Even more impor-tantly it helps the UAV to land on the UGV with the correct orientation (facingthe right way) Therefore the charging outputs will be connected properly

An other important aspect of the training is how to choose the training imagesFor this pictures were selected where the robot is captured directly from the sideor from the front with the camera almost at the level of the ground Asidethe sidefront of the robot (the top for example) was hidden or cropped Thismethod eliminated all the distortions found on pictures taken from above whichwould decrease the efficiency Although camera attached to the UAV will seldomcruise at this altitude (otherwise it would touch the ground) the HOG descriptor isrobust enough to find the patterns even viewed from above or sideways Althoughteaching an SVMs for diagonal views as well seemed like a logical idea it did notimprove the detections The reason for this is the fact that while side and frontviews are well defined it is hard to frame and train on a diagonal view

On figure 42 a comparison of training images and the produced HOG detec-tors is shown 42(a) is a typical training image for the side-view Hog detectorNotice that only the side of the robot is visible since the camera was hold verylow 42(b) is the visualized final side-view detector Both the wheels and thebody are easy to recognize thanks to the strong edges of them On 42(c) a train-ing image is displayed from the front-view detector training image set 42(d)shows the visualized final front detector Notice tangential lines along the wheelsand the characteristic top edge of the body

Please note that if the current two classifiers seem to be inefficient trainingadditional ones is easy to do with the training software presented here Alsoincluding them to the detector is simple as well since it was designed to beadaptive Please see subsection 435 for more details

432 Sliding window methodA very important property of these trained methods (222) are that usually theirtraining datasets (the images of the robot or any other object) are cropped onlycontaining the object and some margin around it Resulting from this a detectortrained this way will only be able to work on input images which contain theobject in a similarly cropped way Certainly such an image is not the usual inputfor an application since if the input contains only the robot there is no need forlocalization algorithms In other words the input image is usually much largerthan the cropped training images and may contain other kind of objects as wellThus a method is required which provides correctly sized input images for the

34

43 2D image processing methods

(a) sideview training image example

(b) sideview HOG detector

(c) frontview trainging image example

(d) frontview HOG

Figure 42 (a) a typical training image for the side-view Hog detector (b) is thefinal side-view HOG descriptor visualized (c) a training image from the front-viewHOG descriptor training image set (d) the final front detector visualized Noticethe strong lines around the typical edges of the training images

35

43 2D image processing methods

Figure 43 Representation of the sliding window method

detector cropped and resized from the original large inputThis can be achieved in several ways although the most common is the so

called sliding window method This algorithm takes the large image and slidesa window across it with predefined step-size and scales This rdquowindowrdquo behaveslike a mask cropping out smaller parts of the image With appropriate choice ofthese parameters (enough scales and small step-sizes) eventually every robot (orother desired object) will be cropped and handed to the trained detector See 43for a representation

It worth mentioning that it is possible that multiple instances of the soughtobject are present on the image For face or pedestrian detection algorithms forexample this is a potential situation Since only one ground robot was builtyet this is not likely to happen in this project (although reflections of the robotmight occur which are not handled explicitly) However if the system will beexpanded in the future multiple UGVs on the same input image will not cause aproblem for the designed algorithm

433 Pre-filteringAs listed in section 41 two of the most important challenges are speed andthe limitations of the available resources In subsection 432 the sliding windowconcept was introduced which makes it possible to cover every potential locationof the sought object On the other hand this means a lot of positions which areabsolutely unnecessary to check with the HOG detector since it is very unlikelythat the robot is there (eg positions above the ground or homogeneous surfaceslike the wall or the clearly empty ground) These areas could be excluded fromthe search with cheaper algorithms

36

43 2D image processing methods

Thus methods which could reduce the amount of areas to process and suchincrease the speed of the detection were required In other words these algorithmsfinds region of interests (ROI see 3) in which the HOG (or other) computationalheavy detectors are executed in a shorter time than scanning the whole inputimage The algorithms have be implemented with the right interface to keep themodular structure of the architecture Also this practice make it possible toeither swap the currently used with an other or a newly developed ROI detectormodule for further development

Based on intuition a good feature separation is the colour Anything whichdoes not have the same colour as the ground vehicle should be ignored To dothis however is rather complicated since the observed rdquocolourrdquo of the object (andanything else in the scene) depends on the lighting the exposure setting and thedynamic range of the camera Thus it is very hard to define a colour (and aregion of colours around of it) which could be a good base of segmentation onevery input videos Also one of the test environments used (the laboratory) andmany other premises reviewed has bright floor which is surprisingly similar to thecolour of the robot This made the filtering virtually useless

An other good hint for interesting areas can be the edges In this projectthis seems like a logical approach since the chosen feature descriptor operates ongradient images which are strongly connected to the edge map On figure 44 theresult of an edge detection on a typical input image of the system is presentedNote how the ground robot has significantly more edges than its environmentespecially the ground Filtering based on edges often eliminates the majority ofthe ground On the other hand chairs and other objects have lot of edges as wellThus these areas are still scanned

Another idea is to filter the image by the detections themselves on the previousframes In other terms the position of the robot on the previous input image isan excellent estimation of the vehiclersquos position on the current one Theoreticallyit is enough to search for the robot around its last known position with a radiusof the maximum distance the object can move during the time between framesPlease note that this distance is in image coordinates thus the possible movementof the camera is involved as well In practice this distance was set as a parameterafter experiments See subsection 435 for details

434 TrackingIn computer vision tracking means the following of a (moving) object over timeusing the camera image input This is usually done by storing information (likeappearance previous location speed and trajectory) about the object and passit on between frames

In the previous subsection (433) the idea of filtering by prior detections was

37

43 2D image processing methods

Figure 44 On this figure the result of an edge detection on a typical input imageof the system is presented Please note how the ground robot has significantly moreedges than its environment On the other hand chairs and other objects have lotof edges as well

presented Tracking is the extension of this concept If the position of the robotis already known there is no need to find it again To achieve the objectives it isenough to follow the found object(s) This makes a huge difference in calculationcosts since the algorithm does not look for an object described by a generalmodel on the whole image but search for the rdquosame set of pixelsrdquo in a significantlyreduced region Certainly tracking itself is not enough to fulfil the requirements ofthe task hence some kind of combination of rdquotraditionalrdquo classifiers and trackingwas needed

Tracking algorithms are widely researched and have their own extensive litera-ture Considering the complexity of implementing an own 2D tracking algorithmit was decided to choose and include a ready-to-use solution Fortunately Dlib(324) have an excellent object tracker included as well The code is based onthe very recent research presented in [76] The method was evaluated extensivelyand it outperformed state-of-the-art methods while being computationally effi-cient Aside its outstanding performance it is easy to include in the code onceDlib is installed The initializing parameter of the tracker is a bounding box isaround the area needed to be followed Since the classifiers return with boundingboxes those can be used as an input for the tracking See subsection 435 foran overview of the implementation An interesting fact is that it also uses HOGdescriptors as a part of the algorithm which shows the versatility of HOG andconfirms its suitability for this task

The final system of this project will have the advantage of a 3D map of itssurroundings This will allow tracking of objects in 3D relative to the environ-ment The UAV will have an estimation of the UGV even if its out of the frame

38

43 2D image processing methods

since its position will be mapped to the 3D map as well Unfortunately the realtime mapping is not ready yet thus this feature has yet to be implemented

435 Implemented detectorIn subsection 431 the advantages and disadvantages of the chosen object recog-nition methods were considered and the implemented training algorithm wasintroduced Then different approaches to speed up the process were presentedin subsection 432 433 and 434

After considering the points above a detector algorithm was implementedwith two aims First to provide an easy-to-use development and testing toolwith which different methods and their combinations (eg using 2 SVMs withROI) can be tried without having to change and rebuild the code itself Secondafter the optimal combination was chosen this software will be the base of thefinal detector used in the project

The input of the software is a camera feed or video which is read frame byframe The output of the algorithm is one or more rectangles bounding the areasbelieved to contain the sought object These are often called detection boxeshits or simply detections

During the development four different approaches were implemented All ofthem are based on the previous ones (incremental improvements) but all broughta new idea to the detector which made it faster (reach higher frame-rates see512) more accurate (see 511) or both The methods are listed below

1 Mode 1 Sliding window with all the classifiers

2 Mode 2 Sliding window with intelligent choice of classifier

3 Mode 3 Intelligent choice of classifiers and ROIs

4 Mode 4 Tracking based approach

They will all be presented in the next four subsections Since they are rathermodifications of the same code than separate solutions they were not imple-mented as new software Instead all were included in the same code whichdecides which mode to execute in run-time based on a parameter

Aside of changing between the implemented modes the detector is able toload as many SVMs as needed Their usage is different in every mode and willbe presented later

The software can display andor save the video frames with all the detectionsmarked measure and export the processing time for every frame save the de-tections to data files for evaluation or even process the same video input severaltimes

39

43 2D image processing methods

Table 41 Table of the available parameters

Name Valid values Functioninput path to video video as input for detectionsvm path to SVMs These SVMs will be usedmode [1 2 3 4] Selects which mode is usedsaveFrames [01] Turns on video frame exportsaveDetections [01] Turns on detection boxe exportsaveFPS [01] Turns on frame-rate measurementdisplayVideo [01] Turns on video displayDetectionsFileName string Sets the filename for saved detectionsFramesFolderName string Set the foldername used for saving videonumberOfLoops integer(0lt) Sets how many times the video is looped

To make the development more convenient these features can be switchedon or off via parameters which are imported from an input file every time thedetector is started There is an option to change the name of the file where thedetections are exported or the name of the folder where the video frames aresaved

Table 41 concludes all the implemented parameters with the possible valuesand a short description

Note that all these parameters have default values thus it is possible but notcompulsory to include every one of them The text shown below is an exampleinput parameter file

Example parameter file of the detector algorithminput testVideo1avilist all the detectors you want to usesvm groundrobotfrontsvm groundrobotsidesvmsaveDetections 0saveFrames 0displayVideo 0numberOfLoops 100mode 2saveFPS 1

Given this input file the software would execute a detection on testVideo1avi(input) one hundred times (numberOfLoops) using the groundrobotfrontsvm groundrobot-sidesvm classifiers (svm) It would neither save the detections (saveDetections)

40

43 2D image processing methods

nor the video frames (saveFrames) The latter is not shown (displayVideo) How-ever the processing time is saved for every frame (saveFPS) Note that the pa-rameter numberOfLoops is really useful to simulate longer inputs See figure 45for a screenshot of the interface after a parameter file is loaded

Figure 45 Example of the detectorrsquos user interface after loading the parametersThe software can be started from the command line with a parameter file inputwhich defines the detecting mode and the purpose of the execution (producingvideo efficiency statistics or measuring processing frame-rate)

4351 Mode 1 Sliding window with all the classifiers

Mode 1 is the simplest approach of the four After loading the two classifierstrained by the training software (front-view and side-view see 434) both rdquoslidesrdquoacross the input image All of them return with a vector of rectangles where theyfound the sought object These are handled as detections (saved displayed etc)There are no filters implemented for maximum number of detections overlappingallowed position etc

Please note that although currently two SVMs are used this and every othermethod can handle fewer or more of them For example given three classifiersmode 1 loads and slides all of them

As it can be seen this mode 1 is rather simple It checks every possibleposition with all the classifiers on every frame in multiple scales This meansthat it will not miss the object (assuming that one of them would recognize theobject if a cropped image of it is given as an input) On the other hand thisextensive search is very computational heavy especially with two classifiers Thisresults in the lowest frame per second rate of all the methods

4352 Mode 2 Sliding window with intelligent choice of classifier

As mentioned in the previous subsection Mode 1 checks every possible positionwith all the classifiers on every frame This is not efficient because of the waythe two used classifier was trained On figure 42 it can be seen that one of them

41

43 2D image processing methods

represents the frontrear view while the other one was trained for the side-viewof the robot

The idea of mode 2 is based on the assumption that it is very rare and usuallyunnecessary that these two classifiers detect something at once since the robot iseither viewed from the front or from the side Certainly looking at it diagonallywill show both mentioned sides but resulting from the perspective distortionthese detectors will not recognize both sides (and that is not needed either sincethere is no reason to find it twice)

Therefore it seemed logical to implement a basic memory to the code whichstores which one of the detectors found the object last time After that only thatone is used for the next frames If it succeeds to find the object again it remainsthe chosen detector If not the algorithm starts counting the sequential frames onwhich was unable to find something If this exceeds a predefined tolerance limit(that is the number of frames tolerated without detection) all of the classifiersare used again until one of them find the object In every other aspect mode 2is very similar to mode 1 presented above

The limit was introduced because it is very unlikely to change viewpointsbetween two frames so drastically that the other detector would have a betterchance to detect A couple of frames without detection is more like the result ofsome kind of image artefact like motion blur change in exposure etc

This modification of the original code resulted an much faster processingspeed since in significant amount of the time only one classifier was used in-stead of two Fortunately after finding an optimal limit the efficiency of thecode remained the same

4353 Mode 3 Intelligent choice of classifiers and ROIs

While mode 2 brought a significant improvement to the detectorrsquos speed it didnot fulfil the requirements of the project

In subsection 433 the idea of reducing the image searched was introduced asa way to increase the detection speed The position of the robot on the previousinput image is an excellent estimation of the vehiclersquos position on the current oneTheoretically it is enough to search for the robot around its last known positionwith a radius of the maximum distance the object can move during the timebetween frames However this is not an obvious calculation since the distance isin image coordinates hence it is not trivial to use the actual speed of the robotAlso the possible movement of the camera is involved as well For example evenif the UGV holds its position the rotating camera of the UAV will capture itrdquomovingrdquo across the input image

Instead of estimating the distance mentioned above a simpler approach wasimplemented The memory introduced in mode 2 was extended to store the

42

43 2D image processing methods

Figure 46 An example output of mode 3 The green rectangle marks theregion of interest The blue rectangle is the detection returned by the side-viewdetectorMode 3 tries to estimate a ROI based on the previous detections andsearches for the robot only in that Note the ratio between the ROI and the fullsize of the image

position of the detection aside the detector which returned it A new rectanglename ROI (Region of interest) was included which determinates in which areashould the detector search

The rectangle is initialized with the sizes of the full image (in other termsthe whole image is included in the ROI) Then every time a detection occursthe ROI is set to that rectangle since that is the last known position of the robotHowever this rectangle have to be enlarged caused by the reasons mentionedabove (movement of the camera and the object) Therefore the ROI is grown bya percent of its original size (set by a variable 50 by default) and every detectorsearches in this area Note that the same intelligent selection of classifiers whichwas described in mode 2 is included here as well

In the unfortunate case that none of the detectors returned detections insidethe ROI the region of interest is enlarged again by a smaller percent of its originalsize (set by a variable 3 by default) Note that if in any case the ROI reachesthe size of the of the image it will not grow in that direction anymore Eventuallyafter a series of missing detections the ROI will reach the size of the input imageIn this case mode 3 works exactly like mode 2

See figure 46 for an example of the method described above The greenrectangle marks the region of interest (ROI) The classifiers will search for therobot in this area The blue rectangle marks that the side-view detector foundsomething there (in this case correctly recognized the robot) On the next framethe blue rectangle will be the base of the ROI as it is enlarged in the followingsteps to define the region of interest

43

43 2D image processing methods

Please note that while the ROI is updated of the previous detections it isvery easy to change it to another rdquopre-filterrdquo due to the modular architectureintroduced in 42

4354 Mode 4 Tracking based approach

As mentioned in 434 Dlib library has a built-in tracker algorithm based on thevery resent research [76]

Mode 4 is very similar to mode 3 but includes the tracking algorithm men-tioned above Using a tracker makes it unnecessary to scan every frame (or partof it as mode 3 does) with at least one detector

Instead once the robot was found it is followed across the scene Certainlyperiodical validations are still needed Tracking is significantly faster than thesliding window detectors and in appropriate conditions can perform above realtime speed

It also improved the efficiency of the detector since on many frames the robotwas tracked and its position was marked correctly while the detectors missed itfrom the same view-point

On the other hand tracking can be misleading as well In case of the trackedregion rdquoslidesrdquo off the robot incorrect detections will be provided Also thetracker often estimates the position of the object correctly while the scale of therectangle does not fit which can result much larger or smaller detection boxesthan the object itself See figure 47(b) for an example Yellow rectangle marksthe tracked region which clearly includes the robot but itrsquos size is unacceptable

To avoid these phenomena the bounding box returned by the tracker is reg-ularly checked by detectors This is done in the very same way as in mode 3The output of the tracker is the base of the ROI which is than enlarged by pa-rameters (similarly to mode 3) The algorithm checks this area with the selectedclassifier(s)

If any of them founds the object the tracker is reinitialized based on thedetection box The rectangle is enlarged and translated to include a bigger partof the robot in contrast to the detected sides (note that the detectors are trainedfor the sides only see subsection 431) This is done because the tracker tendsto follow bigger patches of the image better

If none of the detectors returned detections inside the ROI the algorithm willcontinue to track the object based on previous clues of its position However afterthe number of sequential failed validations exceeds a tolerance limit (a parameter)the object is labelled as lost Afterwards the algorithm works exactly as mode 3would do enlarge the ROI on every frame until a new detection appears Thenthe tracker is reinitialized

See figure 47(a) for a representation of the processing method of mode 4

44

43 2D image processing methods

(a) Example output frame of mode 4

(b) Typical error of the tracker

Figure 47 (a) A presentation of mode 4 Red rectangle marks the detection ofthe front-view classifier The yellow box is the currently tracked area The greenrectangle marks the estimated region of interest (b) A typical error of the trackerYellow rectangle marks the tracked region which clearly includes the robot but toobig to accept as a correct detection

45

44 3D image processing methods

44 3D image processing methods

441 3D recording methodThe first challenge was how to create a 3D image As explained in sub-subsection3123 the used Lidar sensor is a 2D type one which means it scans only inplane In other words the sensor returns the distance of the closest reflectionpoint rotating around one (vertical) axis The result of this scan looks like across-section of a 3D image To produce a real and complete 3D image severalof these cross-section like one-plane scans have to be combined This is a rathercomplicated method since to cover the whole room the lidar needs to be movedHowever this movement or displacement will have a significant effect on therecorded distances

To simplify the initial experiments the position of the laser scanner was fixedon a tripod and only the orientation of it was changed This way the trans-formation between the coordinate systems and validation of the recordings wereeasier

On figure 48 a schematic diagram is given to introduce the 3D recording set-up The picture is divided in to 3 parts Part a is a side view of the room beingrecorded including the scanner and its dependences Part b is a top view of thevery same room displaying the field of view of the scanner Part c is a closeroverview of the scanner device the inertial measurement unit (IMU) and themounting layout Main parts of the setup are listed below with brief explanation

1 Lidar This is the 2D range scanner which was used in all of the experi-ments It is mounted on a tripod to fix its position The sensor is displayedon every sub-figure (a b c) On part a it is shown in two different orien-tation first when the recording plane is completely flat secondly when itis tilted

2 Visualisation of the laser ray The scanner measures distances in a planewhile rotating around its vertical axis which is call capital Z axis on thefigure (Note The sensor uses class 1 laser which is invisible and harmlessto human eye Red colour was chosen only for visualization)

3 Pitch angle of first orientation While the scanner is tilted its orientationis recorded relative to a fixed coordinate system For this experiment thepitch angle is the most relevant since roll and yaw did not change This iscalculated as a rotation from the (lower-case) z axis

4 Pitch angle of second orientation This is exactly the same variable asdescribed before but recorded for the second set-up of the recording system

46

44 3D image processing methods

5 Yaw angle of the laser ray As mentioned before the sensor used operateswith one single light source rotated around its vertical axis (capital) ZTo know where the recorded point is located relative to the sensorrsquos owncoordinate system the actual rotation of the beam is needed to be savedas well This is calculated as an angle between the forward direction of thesensor (marked with blue) and the laser ray All the points are recordedin this 2D space defined with polar coordinates a distance from the origin(the lidar) and a yaw angleThe angle is increasing counter-clockwise Pleasenote that this angle rotates with the lidar and its field Part b represents astate when the lidar is horizontal

6 Field of view of the sensor Part b of the picture visualize the limited butstill very wide field of view of the sensor The view angle is marked withgreen coloured curve and the covered area is yellowish The scanner is ableto record 270 or in other terms [minus135 135]

7 Blind-spot of the sensor As discussed in the previous point the scannercan register distances in 270 The remainder 90 is a blind-spot and islocated exactly behind the sensor (135minus135) This area is marked withdark grey on the figure

8 Inertial measurement unit All the points are recorded in the 2D spaceof the lidar which is a tilted and translated plane To map every recordedpoint from this to the 3D maprsquos coordinate system the orientation andposition of the Lidar sensor is needed Assuming that the latter is fixed(the scanner is mounted on a tripod) only the rotation remains variableTo record this some kind of inertial measurement unit (IMU) is neededwhich is marked with a grey rectangle on the image

9 Axis of rotation To scan the room a continuous tilting of the scannerwas needed It was not possible to place the light source on the axis of therotation This means an offset between the sensor and the axis which was asource of error The black circle represents the axis the tilting was executedaround

10 Offset between the axis and the light source As described in the previ-ous point an offset was present between the range sensor and the axis ofrotation This distance was carefully measured and it is marked by blue onthe figure

11 Ground vehicle The main objective of this thesis is to localize the UGVThus several 3D map was made of this vehicle itself However the lidarmain task is not the robot detection but the localization of the UAV itself

47

44 3D image processing methods

On part a and b its schematic side and top view is shown to visualize itsshape relative size and location

442 Android based recording set-upTo test the sensorrsquos capabilities simplified experiment set-ups were introducedas described in subsection 441 The scanned distances were recorded in csvformat by the official recording tool provided by the lidarrsquos manufacturer [77]As described in subsection 444 and in point 8 the orientation and position ofthe Lidar sensor is needed to determinate the recorded point coordinates relativeto the coordinate system fixed to the ground

Figure 410 Screenshot ofthe Android application for Lidarrecordings It is able to detect dis-play and log the rotation of the de-vice in a log file with user definedfilename Source own applicationand image

Assuming that the position is fixed (thescanner is mounted on a tripod) only therotation can change To record this somekind of inertial measurement unit (IMU) wasneeded

Since at the time these experimentsstarted no such device was available the au-thor of this thesis created an Android appli-cation The aim of this program was to usethe phonersquos sensors (accelerometer compassand gyroscope) as an IMU The software isable to detect display and log the rotationof the device along the three axis

The log file is structured in a way thatMatlab (see subsection 321) or other de-velopment tools can import it without anymodification needed The file-name is userdefined

Since both the Lidar and the applica-tion operates at a different and not suf-ficiently consistent frequency synchroniza-tion was necessary Since the two data wasrecorded completely separately this has to bedone off-line To achieve this a time-stamphas to be recorded for both type of mea-surements The lidar recorder tool does thisautomatically This function was added tothe Android application too thus the time-stamps of the recorded orientations are saved

48

44 3D image processing methods

Figure 48 Schematic figure to represent the 3D recording set-up Part a is a sideview of the room being recorded including the scanner and its dependences Partb is a top view of the very same room displaying the field of view of the scannerPart c is a closer overview of the scanner device the inertial measurement unit(IMU) and the mounting layout

49

44 3D image processing methods

Figure 49 Example representation of the output of the Lidar Sensor As it canbe seen the data returned by the scanner is basically a scan in one plane and canbe interpreted as a cross-section of a 3D image The sensor is positioned in themiddle of the cross Green areas were reached by the laser Source screenshot ofown recording by the official recording tool [77]

50

44 3D image processing methods

to the log file as well However since the two systems were not connected thetimestamps had to be synchronized manually

On Figure 410 the user interface of the application can be seen Note thatboth the displayed and the saved angles are in radian Figure 411(a) presentshow the phone was mounted next to the lidar and also shows the test environmentwhich was recorded The phone was attached to a common metal base with thelidar Thus the rotation of the phone is expected to be very close to the rotationof the lidar We used a Motorola Moto X (2013) for these experiments

The coordinate system used is fixed to the phone X axis is parallel with theshorter side of the screen pointing from the left to the right Y axis is parallel tolonger side of the screen pointing from to bottom to the top Z axis is pointingup and is perpendicular to the ground See figure 412 for a presentation of theaxes Angles were calculated in the following way

bull pitch is the rotation around the x axis It is 0 if the phone is lying on ahorizontal surface

bull roll is rotation around the y axis

bull yaw is the rotation around the z axis

Although the sensors in a phone are not designed for such precise tasks theset-up presented in this subsection worked sufficiently and produced spectacular3D images On figure 411(b) and 411(c) an example is shown along with therecorded screen captured by a camera (411(a))

443 Final set-up with Pixhawk flight controllerAlthough the Android application (see subsection 442) gave satisfactory resultsseveral reasons occurred for replacing it First the synchronization was compli-cated manual and off-line which is not acceptable in a real time UAV environ-ment Second neither the phonersquos sensors nor its operating system was designedto be mounted on an unmanned aircraft It is heavier less precise and moreexpensive than a dedicated IMU Thus the Android phone has to be replaced

As a replacement the Pixhawk (3121) flight controller was chosen because itis open-source well supported and most importantly it is the chosen controller ofthe projectrsquos UAV This meant less weight and brought integrity both in hardwareand software Although the Pixhawk is not a dedicated IMU its built-in sensorsare very accurate since they are used to stabilize multi-rotors during flight Fellowstudent Domonkos Huszar managed to connect both the Pixhawk and the lidar(3123) to the Robotic operating system (322) This way the two type ofsensors were recorded with the same software

51

44 3D image processing methods

(a) Laboratory and recorded scene

(b) Elevation of the produced 3D map

(c) The produced 3D map

Figure 411 (a) picture about the laboratory and the mounting of the phone (b)and (c) 3D map of the scan from different viewpoints Some object were markedon all images for easier understanding 1 2 3 are boxes 4 is a chair 5 is a studentat his desk 6 7 are windows 8 is the entrance 9 is a UAV wing on a cupboard

52

44 3D image processing methods

Thus the change from the Android software to the Pixhawk solved the problemof synchronization and accuracy All the other parts of the experiment (themounting method the coordinate systems etc) remained unchanged

444 3D reconstructionIn subsection 441 the recording method was described Subsection 442 and443 discussed the inertial measurement units and software used Processing andvisualizing the recorded data as a 3D map was the next step

All the pointsrsquo coordinates obtained were in the lidar sensorrsquos own coordinatesystem As shown on figure 48 (part b) this is a 2D space represented by polarcoordinates a distance from the origin (the scanner) and an angle between theline pointing to the point and the zero vector This space was tilted with thescanner during the records To calculate the positions recorded by the Lidar inthe ground fixed coordinate system this 2D space had to be transformed Trans-formation here means rigid motions no shape or size changes were desired Rigidmotions are reflections rotations translations and glide reflections Howeverin this experiment only translations and rotations were used

The 3D coordinate system was defined as the following

bull x axis is the vector product of y and z x = ytimes z pointing to the right

bull y axis is the facing direction of the scanner parallel to the ground pointing

Figure 412 Presentation of the axes used in the android application

53

44 3D image processing methods

forward

bull z axis is pointing up and is perpendicular to the ground

Note that the origin ([0 0 0]) is at the lidarrsquos location (at the top of the tripodin the height of the rotation axis) This means that points below the scanner havea negative altitude

The lidar and its own 2D space have 6 degrees of freedom (DOF) state in theground fixed coordinate system Three represents its positions along the x y zaxes The other three describes its orientation yaw pitch and roll

First the rotation of the scanner will be considered (the latter three DOF) Asit can bee seen on figure 48 the roll and yaw angles are not able to change Thusonly pitch has to be considered as a variable in this experiment (see point 3) Asa result the transformed coordinates of the recording point are the following

x = distance middot sin(minusyaw) (41)

y = distance middot cos(yaw) middot sin(pitch) (42)

z = distance middot cos(yaw) middot cos(pitch) (43)

Where distance and yaw are the two coordinates of the point in the plane ofthe lidar (figure 48 point 5) and pitch is the angle between the (lower-case) zaxis and the plane of the lidar (48 point 3 and 4)

The position of the lidar in the ground coordinate system did not change intheory only the orientation of it However as it can be seen on figure 48 part cthe rotation axis (marked by 9 see point 9) is not at the same level as the rangescanner of the lidar Since it was not possible to mount the device other way asignificant offset (marked and explained in point 10) occurred between the lightsource and the axis itself This resulted in an undesired movement of the sensoralong the z and y axes (note that x axis was not involved since the roll angle didnot change and only that would move the scanner along the x axis) While theresulted errors were no greater than the offset itself they are significant Thusalong the rotation a translation is needed as well The vector is calculated asthe sum of dy and dz

dy = offset middot sin(pitch) (44)

dz = offset middot cos(pitch) (45)

54

44 3D image processing methods

Where dy is the translation required along the y and dy is along the z axisOffset is the distance between the light source and the axis of rotation presentedon figure 48 part C marked with 10

Combining the 5 equations (41 42 43 44 45) we get xyz

= distance middot

sin(minusyaw)cos(yaw) middot sin(pitch)cos(yaw) middot cos(pitch)

+ offset middot

0sin(pitch)cos(pitch)

(46)

Using the transformation defined in equation 46 spectacular and more im-portantly precise 3D maps were created These were suitable for further worksuch us SLAM (Simultaneous localization and mapping) or the ground robotdetection

55

Chapter 5

Results

51 2D image detection results

511 EvaluationTo make the development easier and the results more objective an evaluationsystem was needed This helps determine whether a given modification of thealgorithm did actually make it better andor faster than other versions Thisinformation is extremely important in every vision based detection system to trackthe improvements of the code and check if it meets the predefined requirements

To understand what better means in case of a system like this some defi-nitions need to be clarified Generally a detection system can either identify aninput or reject it Here the input means a cropped image of the original pictureIn other words it is a position (coordinates) and dimensions (height and width)Please refer to subsection 432 for a more detailed description An identifiedinput is referred as positive Similarly a rejected one is called negative

5111 Definition of True positive and negative

A detection is true positive if the system correctly identifies an input as a positivesample In this case it means that the detector marks the ground robotrsquos positionas a detection Note that it is also often called a hit Similarly a true negativemeans that an input which does not contain the object is correctly rejected

5112 Definition of False positive and negative

In the case of most decision systems errors can be organised into two main groupsA mistake is a false positive error (also called type I error) if the system classifiesan input as a positive sample despite it is not In other words the system believes

56

51 2D image detection results

the object is present at that location although that is not the case In this projectthat typically means that something that is not the ground robot (a chair a boxor even a shadow) is still recognized as the UGV A false negative error (alsocalled type II error) occurs when an input is not recognized (rejected) although itshould be In this current task a false negative error is when the ground robotis not detected

5113 Reducing number of errors

Unfortunately it is really complicated to decrease both kind of errors at thesame time If more strict rules and thresholds are introduced in the decisionmaking the rate of false positives will decrease but simultaneously chances offalse negatives will increase On the other hand if conditions are loosened in thedetection method false negative errors may be reduced However caused by theexact same thing (less strict conditions) more false positives could appear

The actual task determines which kind of error should be minimised Incertain projects it could be more important to find every possible desired objectthan the fact that some mistakes will occur Such a project can be a manuallysupervised classification where appearing false positives can be eliminated by theoperator (for example illegal usage of bus lanes) Also a system with an accidentavoidance purpose should be over-secure and rather recognise non dangeroussituations as a hazard than to miss a real one

In this project the false positive rate was decided to be more important Thereason for this is that the position of the UGV is not crucial during the wholeflight If an input frame is processed incorrectly and a false negative error occurs(the robot is on the picture but the algorithm misses it) the tracking system(see subsection 434) would still be able to estimate its position After a fewframes the detector would be able to detect the robot again with great possibilityHowever a false negative detection at an unexpected position may confuse thesystem since it would have to choose from multiple possible charging platformsThis may lead to a landing attempt on a dangerous surface and to the possibilityof losing the aircraft

5114 Annotation and database building

To test the efficiency of the 2D detection algorithm its detections have to betested on an image dataset with its samples already labelled (either fully manuallyor using a semi-supervised classifier) or on a video set where the object (or objects)position is known on every frame This way the true positivenegative and falsepositivenegative values can be determined

57

51 2D image detection results

Figure 51 Figure of the Vatic user interface

Testing on image datasets is suitable for methods which do not use any in-formation transmitted between frames such as previous positions or ROI Thesealgorithms are usually sliding window based (see subsection 432) Their croppedinputs can be simulated with image datasets without complications

For this project the video based test is more appropriate since as described insubsection 434 and 433 the detector uses tracking and pre-filtering Testing thetracking feature without consecutive frames is pointless Also the pre-filteringand ROI generation is only applicable on frames containing more than just theobject itself

Testing on videos requires annotation which means that every object has tobe labelled on every frame This is a tiring and time-consuming task but withsuitable software it can be simplified To do this the Vatic environment waschosen

The abbreviation Vatic stands for Video Annotation Tool from Irvine Cal-ifornia It is a free to use on-line video annotation tool designed specially forcomputer vision research [78]

After the successful installation and connection to the server a clear interfaceis visible See an example screenshot on figure 51 The interface has a built invideo player which makes it possible to review the video with the annotation atnormal speed or even frame-by-frame The system was designed in a way thatmultiple objects can be annotated even from different classes (like pedestrianscars trucks for example) Since only one ground robot was built yet and no otherkind of objects were requested only one item was annotated on the videos

58

51 2D image detection results

The annotatorrsquos task is to mark the position and the size of the objects (thatis drawing a bounding box around it) on every frame of the videos (there areoptions to label an object Occluded obstructed or outside of frame) Althoughthis seems like a very time-consuming task Vatic uses sophisticated algorithmsto interpolate both the positions and the sizes of the objects on frames thatare not annotated yet This is a significant help for the annotator since aftermarking the object on some key-frames the only task remained is to correctthe interpolations between them if necessary

Note that it is crucial that the bounding boxes are very tight around the ob-ject This will guarantee that during the evaluation of the detectors the statisticsare based on true and objective annotations This topic will be explained in moredetails in section 53

Vatic is also able to crowdsource the annotation tasks using Amazon Mechan-ical Turk [79] which is Amazonrsquos solution for renting human resources This waylarge datasets with several videos are easy to build for relatively low cost How-ever this feature of the tool was not used in this project since the number of testvideos and their complexity did not make it necessary All the test videos wereannotated by the author

After successfully annotating a video Vatic is able to export the boundingboxes in several format including xml json etc The exported file contains theid of the frames and the coordinates of every bounding box along with the labelassigned for the objectrsquos type

Using these files it is possible to compare the detections of an algorithm (whichare also bounding boxes) with the annotations and thus evaluate the efficiencyof the code

512 Frame-rate measurement and analysisIn section 41 speed was recognised as one of the most important challenges

To address this issue the detector was prepared to measure and export thetime elapsed by processing each frames

The measurements were carried out on a laptop PC (Lenovo z580 i5-3210m8GB RAM) simulating a setup where the UAV sends video to a ground stationfor further processing During the tests every other unnecessary process wereterminated so their additional influences are minimized

The exact same software was used for every test only the parameter file waschanged (switching between the modes) Every method was executed on the samevideo one hundred times which means more than 100000 frames This amountensured that temporary resource changes during the test due to the operatingsystem did not influence the results significantly

59

52 3D image detection results

To analyse the exported processing times a Matlab script has been imple-mented This tool can load the log files from the test and calculate statistics ofthe data Measurements like the shortest longest and average processing timeper frame are displayed after execution along with the average frame-rate and itsvariance Latter is very important as mode 2 3 4 processing methods changeduring the execution by definition which results in a varying processing timeVariance measures these changes An example output of the analysis software ispresented below

Example output of Frame-rate analysis softwareLoaded 100 filesNumber of frames in video 1080Total number of frames processed 108000Longest processing time 0813Shortest processing time 0007Average ellapsed seconds 0032446variance of ellapsed seconds per frame

between video loops 80297e-07across the video 00021144

Average Frame per second rate of the processing 308204 FPS

The changes of average frame-rates between video loops were also monitoredIf this value was too big it probably indicated some unexpected load of thecomputer Thus the tests had to be restarted

It should be mentioned that the frame-rate of the software is strongly corre-lated with the resolution of the input images since the smaller the image it takesshorter time to rdquosliderdquo through the detector on it This is especially true for thefirst two methods (4351 4352) The test videosrsquo resolution are 960times 540

52 3D image detection resultsThis thesis focused on the 2D object detection methods since they are easierto implement and have been widely researched for decades Traditional camerasensors are significantly cheaper too and their output is rdquoready-to-userdquo On theother hand data from lidar sensors have to be processed to gather the imagesince the scanner records points in a 2D plane as explained in section 44 Notethat lidar 3D scanners exists but they are still scanning in a plane and have thismentioned processing built-in Also they are even more expensive than the 2Dversions

60

53 Discussion of results

Figure 52 On this figure the recording setup is presented In the backgroundthe ground robot is also visible which was the subject of the measurements

That being said the 3D depth map provided by these sensors are not possibleto obtain with cameras and contains valuable information for a project like thisand makes indoor navigation easier Therefore experiments were carried onthe record 3D images by tilting a 2D laser scanner (3123) and recording itsorientation with a IMU (3121)

On figure 52 a photo of the experimental setup is shown In the backgroundthe ground robot is visible which was the subject of the recordings

Figure 53 presents an example 3D point-cloud recorded of the ground robotBoth 53(a) and 53(b) is an image of the same map viewed from different pointsAs it can be seen recordings from one point is not enough to create completemaps as rdquoshadowsrdquo and obscured parts still occurs Note the rdquoemptyrdquo area behindthe robot on 53(b)

The created point-clouds were found to be detailed enough to carry simpler(eg threshold based on height) or more complex object detection methods (egregistering 3D key-points) on them in the future

53 Discussion of resultsIn general it can be said that the detector modes worked as expected on therecorded test videos with suitable efficiency Example videos can be found athere1 for all modes However to judge the performance of the algorithm moreobjective measurements were needed

The 2D image processing results were evaluated by speed and detection effi-1usersitkppkehu~palan1CranfieldVideos

61

53 Discussion of results

(a) Example result of the 3D scan of the ground robot

(b) Example of the rsquoshadowrsquo of the ground robot

Figure 53 Example of the 3D images built Both image shows the same 3Dmap viewed from different angles Note that the carpet is visible underneath thevehicle

62

53 Discussion of results

ciency The latter can be done by several values Here recall and precision willbe used Recall is defined by

TP

TP + FN

Where TP is the number of true positive and FN is the number of false negativedetections In other words it is the ratio of the of the detected and the totalnumber of positive samples Thus it is also called sensitivity

Precision is defined byTP

TP + FP

Where TP is the number of true positive FP is the number of false positivedetections In other terms precision is a measurement of the correctness of thereturned detections

Both values have a different meaning for validation and none of them can usedalone This is because as mentioned in sub-subsection 5113 both false positiveand negative errors are undesirable but neither recall nor precision measuresboth For example it is possible to reach an outstanding recall rate with extremeamount of false positives Similarly the increase of false negative errors willimprove the precision of the algorithm

The detections (the coordinates of the rectangles) were exported by the de-tection software for further analysis The rectangles were compared to the onesannotated with the Vatic video annotation tool (5114) based on position andthe overlapping areas For the latter a minimum of 50 was defined (ratio of thearea of the intersection and the union of the two rectangles)

Note that this also means that even if a detection is at the correct locationit is possible that it will be logged as false positive error since the overlap of therectangles is not satisfying Unfortunately registering this box as a false positivewill also generate a false negative error since only one detection is returned bythe detector for a location(overlapping detections are merged) Therefore theannotated object will not be covered by any of the detections

As expected mode 1 and mode 2 produced very similar results a decent 62recall rate was achieved This is a result of the fact that they work exactlythe same way except the selection of the used classifiers The results show thatintroducing this idea did not decrease the performance

Mode 3 managed to improve the recall rate by 2 with the introduction ofregion of interests Neither of the first 3 modules had false positive detections asa result of the strict thresholds of the SVMs

Mode 4 had the outstanding recall rate of 95 This is a result of the trackeralgorithm which was able to follow the object even at view-points were the SVMdetectors did no longer detected the robot See table 51 for all the recall andprecision values

63

53 Discussion of results

Similarly the processing speed of the methods introduced in subsection 435were analysed as well with the tools described in subsection 512

Mode 1 was proven to be the slowest with a frame-rate of 42 FPS This isnot surprising as this version of the algorithm scans the whole image with all thedetectors

Mode 2 raised this speed to 65 FPS by introducing an intelligent way ofselecting the applied detector

Mode 3 brought another significant increase with implementing region of in-terest estimators which reduce the amount of the image to be scanned It canprocess around 12 frames in a second

Finally mode 4 was proven to be the fastest solution by far As a result ofits tracking algorithm their is no need to use the SVM classifiers as often asbefore This is a very important result since tracking is significantly faster thanscanning with the detectors Mode 4 reached a frame-rate of 30 FPS during thetests Thus mode 4 can be called a rdquoreal-timerdquo object detection algorithm

See table 51 for the frame-rates of all modules displayed along their recalland precision values Note that all frame-rates are average of 100 executions

Table 51 Table of results

mode Recall Prescision FPS Variancemode 1 0623 1 42599 000000123mode 2 0622 1 6509 00029557mode 3 0645 1 1206 00070877mode 4 0955 0898 3082 00021144

As a conclusion it can be seen that mode 4 outperformed all the others bothin detection rate (recall 95) and speed (average frame-rate 30FPS) using ROIestimators and a tracking algorithm Due to the nature of the tracker and theevaluation method the number of false positives increased As mentioned beforeeven a correctly positioned but incorrectly scaled detection decreased the precisionrate However this should be simple to improve by adding further validations ofthe output of the tracker (both for position and scale)

64

Chapter 6

Conclusion and recommendedfuture work

61 ConclusionIn this paper an unmanned aerial vehicle based airborne tracking system waspresented with the aim of detecting objects indoor (potentially extended to out-door operations as well) and localize a ground robot that will serve as a landingplatform for recharging the UAV

First the project and the aims and objectives of this thesis were introducedin chapter 1 This introduction was followed by an extensive literature review togive an overview of the existing solutions for similar problems (chapter 2) Un-manned Aerial Vehicles were presented and examples were shown for applicationfields Then the most important object recognition methods were introducedand reviewed with the aspect of suitability for the discussed project

Then the environment of the development was described including the avail-able software and hardware resources (chapter 3) Their advantages and disad-vantages were analysed and their applications in the project are explained

Afterwards the recognized challenges of the project were collected and dis-cussed with respect to the object detection task in chapter 4 Considering theobjectives resources and the challenges a modular architecture was designedand introduced (section 42) The type of modules were listed and discussed indetails including their purposes and some possible realization

The modular structure of the system makes it very flexible and easy to extendor replace modules The currently implemented and used ones were listed andexplained Also some unsuccessful experiments were mentioned

Subsection 434 concludes the chosen feature extraction (HOG) and classi-fier (SVM) methods and presented the implemented training software along the

65

61 Conclusion

produced detectorsAs one of the most important parts the currently used detector was intro-

duced in subsection 435 This module was developed with special attention paidto validation and possible future work therefore additional debugging featureslike exporting detections demonstration videos and frame rate measurementswere implemented To make the development simpler an easy-to-use interfacewas included which let the most important parameters to be set without modifi-cation pf the code Four related but different detection methods were developedMode 1 scans the whole image with both trained detectors to locate the robotMode 2 does the same but instead of all classifiers only one is used most of thetime which is chosen based on previous detections This resulted in an increasedframe-rate Mode 3 introduces the concept of Region Of Interests (ROI) and onlyscans this area The ROI is defined based on previous detections however it iseasy to replace it with other methods due to the modular structure The reducedsearch space made the processing significantly faster Finally mode 4 includes atracking algorithm which makes it unnecessary to scan every frame (or part ofit) with a detector Instead once the robot was found it is followed across thescene Certainly periodical validations are still needed Tracking is significantlyfaster than the sliding window detectors and in appropriate conditions can per-form above real time speed (average frame-rate was 308 FPS) It also improvedthe efficiency of the detector since on many frames the robot was tracked andits position was marked correctly while the detectors missed it from the sameview-point However tracking can be misleading as well if the tracked regiondrifts away from the robot (eg a chair next to the robot is followed) To avoidthis the bounding box returned by the tracker is regularly checked by detectors

Special attention was given to the evaluation of the system Two software weredeveloped as additional tools one for evaluating the efficiency of the detectionsand an other for analysing the processing time and frame rates

Although the whole project (especially the autonomous UAV) still needs a lotof improvement and not ready for testing yet the ground robot detecting algo-rithmrsquos first version is ready-to-use and have shown promising results in simulatedexperiments

The detector based on mode 4 had a recall rate of 95 with an average frame-rate of 308 FPS on the test videos Although this processing speed was achievedon a laptop (simulating a setup where the UAV sends video to a ground stationfor further processing) porting this code on an on-board computer with smallercomputational capacity should result in a slower but still acceptable frame-ratewhich satisfies the objectives (5Hz)

While this thesis focused on two dimensional image processing and object de-tection methods 3D image inputs were considered as well Section 44 concludesthe progress made related to 3D mapping along the applied mathematics An

66

62 Recommended future work

experimental setup was created with which it was possible to create spectacu-lar and more importantly precise 3D maps about the environments with a 2Dlaser scanner To test the idea of using the 3D image as an input for the groundrobot detection several recordings were made about the UGV The created point-clouds were found to be detailed enough to carry simpler (eg threshold basedon height) or more complex object detection methods (eg registering 3D key-points) on them in the future

As a conclusion it can be said that the objectives defined in 13

1 explore possible solutions in a literature review

2 design a system which is able to detect the ground robot based on theavailable sensors

3 implement and test the first version of the algorithms

were all addressed and completed

62 Recommended future workIn spite of the satisfying results of the current setup several modifications andimprovements can and should be implemented to increase the efficiency and speedof the algorithm

Focusing on the 2D processing methods first experiments with retrainedSVMs are recommended It is expected that training on the same two sides (sideand front) but including more images from slightly rotated view-points (bothrolling the camera and capture the object from not completely the front) willresult in better performance

Region of interest estimators have huge potential in the project since they cansignificantly speed up the process Aside basic algorithms based on simple fea-tures (eg similar colour patches edges etc) several more complex segmentationmethods were introduced in the literature

Tracking can be tuned more precisely as well Aside the rectangle box whichmarks the estimated position of the object the tracker returns a confidence valuewhich measures how confident the algorithm is that the object is inside the rect-angle This is a very valuable information to eliminate detections which rdquoslippedrdquooff the object Also the further processing of the return position is recommendedFor example growing or shrinking the rectangle based on simple features (egthresholded patches intensity edges etc) can be profitable since the trackertends to perform better when it tracks the whole object On the other hand toobig bounding boxes are problems as well thus shrinking of the rectangles have tobe considered as well

67

62 Recommended future work

Using The 3D image for the detection of the robot is another very interestingopportunity that should be examined in more details Certainly this is onlypossible if the 3D imaging will be carried out in near real time speed (currentlyall the maps were built offline) The currently seen 3D image (the point-cloud infront of the sensor) can be interpreted as a pre-filter for the 2D image processingmethods For example filtering for objects with more or less the same size of theUGV could be a very effective region of interest detector To achieve this thefield of view of the two sensors needs to be register carefully so correspondingareas can be found

Finally when the continuous 3D map building (note that this is more thanthe imaging since the 3D images have to be aligned and combined) is finishedseveral more improvements will be possible Assuming that both the aerial andthe ground vehicle were located in this map the real 3D position of the robotwill be known Its next position can be estimated based on its maximal velocityand the elapsed time Theoretically it is not possible for the robot to be outsideof this circle thus other parts of the map do not need to be scanned Since thelocation and orientation of the UAV will be known as well the field-of-view ofthe sensors can be estimated After that the UAV can either move and rotate ina way to cover the possible locations of the UGV with its sensors or just ignore(does not process) inputs from other areas

It is strongly recommended to give high priority to evaluation and testingduring the further experiments For 2D images Vatic annotation tool was intro-duced in this thesis as a very useful and convenient tool to evaluate detectorsFor 3D detections this is not suitable but a similar solution is needed to keepup the progress towards the ready to deploy 3D mapping system planned

68

Acknowledgements

First of all I would like to thank Paacutezmaacuteny Peacuteter Catholic University and my teachersthere who helped me to advance and grow through the years and later made itpossible to continue my studies in Cranfield University where I was welcomed withwarm hospitality and was supported throughout of the development of this thesis

I would like to thank Dr Al Savvaris for the professional help and encourage-ment he provided while we worked together

Special thanks goes to my friends Domonkos Huszaacuter and Loacuteraacutent Kovaacutecs who Ishared all the happiness sadness worries troubles and the weigh of this pastyear and who I could always turn to in need or joy

Thanks to my friend Roberto Opromolla for helping us to make the necessarymeasurements

My family never stopped supporting me loving me and ensuring the rightenvironment for improvement and growth

Many thanks Fanni Melles for the support and love you gave me and that youstood by me even at the hardest times

Thanks for my friends Maacuterton Hunyady and Doacutera Babicz for the funny andsometimes cruel comments of the revision of this thesis

And finally many thanks to Andras Horvath who agreed to be my rsquosecondrsquosupervisor and mentor at my home university

69

References

[1] ldquoOverview senseFlyrdquo [Online] Available httpswwwsenseflycomdronesoverviewhtml [Accessed at 2015-08-08]

[2] H Chao Y Cao and Y Chen ldquoAutopilots for small fixed-wing unmanned airvehicles A surveyrdquo in Proceedings of the 2007 IEEE International Conference onMechatronics and Automation ICMA 2007 2007 pp 3144ndash3149

[3] A Frank J McGrew M Valenti D Levine and J How ldquoHover Transitionand Level Flight Control Design for a Single-Propeller Indoor AirplanerdquoAIAA Guidance Navigation and Control Conference and Exhibit pp 1ndash43 2007[Online] Available httparcaiaaorgdoiabs10251462007-6318

[4] ldquoFixed Wing Versus Rotary Wing For UAV Mapping Ap-plicationsrdquo [Online] Available httpwwwquestuavcomnewsfixed-wing-versus-rotary-wing-for-uav-mapping-applications [Accessed at 2015-08-08]

[5] ldquo DJI Store Phantom 3 Standardrdquo [Online] Available httpstoredjicomproductphantom-3-standard [Accessed at 2015-08-08]

[6] ldquoWorld War II V-1 Flying Bomb - Military Historyrdquo [Online] Available httpmilitaryhistoryaboutcomodartillerysiegeweaponspv1htm [Accessed at 2015-08-08]

[7] L Lin and M A Goodrich ldquoUAV intelligent path planning for wilderness searchand rescuerdquo in 2009 IEEERSJ International Conference on Intelligent Robotsand Systems IROS 2009 2009 pp 709ndash714

[8] M A Goodrich B S Morse D Gerhardt J L Cooper M Quigley J A Adamsand C Humphrey ldquoSupporting wilderness search and rescue using a camera-equipped mini UAVrdquo Journal of Field Robotics vol 25 no 1-2 pp 89ndash1102008

[9] P Doherty and P Rudol ldquoA UAV Search and Rescue Scenario with HumanBody Detection and Geolocalizationrdquo in Lecture Notes in Computer Science

70

REFERENCES

2007 pp 1ndash13 [Online] Available httpwwwspringerlinkcomcontentt361252205328408fulltextpdf

[10] A Jaimes S Kota and J Gomez ldquoAn approach to surveillance an area usingswarm of fixed wing and quad-rotor unmanned aerial vehicles UAV(s)rdquo 2008 IEEEInternational Conference on System of Systems Engineering SoSE 2008 2008

[11] M Kontitsis K Valavanis and N Tsourveloudis ldquoA UAV vision system for air-borne surveillancerdquo IEEE International Conference on Robotics and Automation2004 Proceedings ICRA rsquo04 2004 vol 1 2004

[12] M Quigley M A Goodrich S Griffiths A Eldredge and R W Beard ldquoTargetacquisition localization and surveillance using a fixed-wing mini-UAV and gim-baled camerardquo in Proceedings - IEEE International Conference on Robotics andAutomation vol 2005 2005 pp 2600ndash2606

[13] E Semsch M Jakob D Pavliacuteček and M Pěchouček ldquoAutonomous UAV surveil-lance in complex urban environmentsrdquo in Proceedings - 2009 IEEEWICACMInternational Conference on Intelligent Agent Technology IAT 2009 vol 2 2009pp 82ndash85

[14] M Israel ldquoA UAV-BASED ROE DEER FAWN DETECTION SYSTEMrdquo pp51ndash55 2012

[15] G Zhou and D Zang ldquoCivil UAV system for earth observationrdquo in InternationalGeoscience and Remote Sensing Symposium (IGARSS) 2007 pp 5319ndash5321

[16] W DeBusk ldquoUnmanned Aerial Vehicle Systems for Disaster Relief TornadoAlleyrdquo AIAA InfotechAerospace 2010 2010 [Online] Available httparcaiaaorgdoiabs10251462010-3506

[17] S DrsquoOleire-Oltmanns I Marzolff K Peter and J Ries ldquoUnmannedAerial Vehicle (UAV) for Monitoring Soil Erosion in Moroccordquo RemoteSensing vol 4 no 12 pp 3390ndash3416 2012 [Online] Available httpwwwmdpicom2072-42924113390

[18] F Nex and F Remondino ldquoUAV for 3D mapping applications a reviewrdquoApplied Geomatics vol 6 no 1 pp 1ndash15 2013 [Online] Availablehttplinkspringercom101007s12518-013-0120-x

[19] H Eisenbeiss ldquoThe Potential of Unmanned Aerial Vehicles for MappingrdquoPhotogrammetrische Woche Heidelberg pp 135ndash145 2011 [Online] Availablehttpwwwifpuni-stuttgartdepublicationsphowo11140Eisenbeisspdf

[20] C Bills J Chen and A Saxena ldquoAutonomous MAV flight in indoor environ-ments using single image perspective cuesrdquo in Proceedings - IEEE InternationalConference on Robotics and Automation 2011 pp 5776ndash5783

71

REFERENCES

[21] W Bath and J Paxman ldquoUAV localisation amp control through computer visionrdquo of the Australasian Conference on Robotics 2004 [Online] Availablehttpwwwcseunsweduau~acra2005proceedingspapersbathpdf

[22] K Ccedilelik S J Chung M Clausman and A K Somani ldquoMonocular vision SLAMfor indoor aerial vehiclesrdquo in 2009 IEEERSJ International Conference on Intel-ligent Robots and Systems IROS 2009 2009 pp 1566ndash1573

[23] H Oh D Y Won S S Huh D H Shim M J Tahk and A Tsourdos ldquoIndoorUAV control using multi-camera visual feedbackrdquo in Journal of Intelligent andRobotic Systems Theory and Applications vol 61 no 1-4 2011 pp 57ndash84

[24] Y M Mustafah A W Azman and F Akbar ldquoIndoor UAV Positioning UsingStereo Vision Sensorrdquo pp 575ndash579 2012

[25] P Jongho and K Youdan ldquoStereo vision based collision avoidance of quadrotorUAVrdquo in Control Automation and Systems (ICCAS) 2012 12th InternationalConference on 2012 pp 173ndash178

[26] J Weingarten and R Siegwart ldquoEKF-based 3D SLAM for structured environmentreconstructionrdquo in 2005 IEEERSJ International Conference on Intelligent Robotsand Systems IROS 2005 pp 2089ndash2094

[27] H Surmann A Nuumlchter and J Hertzberg ldquoAn autonomous mobile robot with a3D laser range finder for 3D exploration and digitalization of indoor environmentsrdquoRobotics and Autonomous Systems vol 45 no 3-4 pp 181ndash198 2003

[28] Y Lin J Hyyppauml and A Jaakkola ldquoMini-UAV-borne LIDAR for fine-scale map-pingrdquo IEEE Geoscience and Remote Sensing Letters vol 8 no 3 pp 426ndash4302011

[29] F Wang J Cui S K Phang B M Chen and T H Lee ldquoA mono-camera andscanning laser range finder based UAV indoor navigation systemrdquo in 2013 Inter-national Conference on Unmanned Aircraft Systems ICUAS 2013 - ConferenceProceedings 2013 pp 694ndash701

[30] M Nagai T Chen R Shibasaki H Kumagai and A Ahmed ldquoUAV-borne 3-Dmapping system by multisensor integrationrdquo IEEE Transactions on Geoscienceand Remote Sensing vol 47 no 3 pp 701ndash708 2009

[31] X Zhang Y-H Yang Z Han H Wang and C Gao ldquoObject class detectionrdquoACM Computing Surveys vol 46 no 1 pp 1ndash53 Oct 2013 [Online] Availablehttpdlacmorgcitationcfmid=25229682522978

[32] P M Roth and M Winter ldquoSurvey of Appearance-Based Methodsfor Object Recognitionrdquo Transform no ICG-TR-0108 2008 [Online]

72

REFERENCES

Available httpwwwicgtu-grazacatMemberspmrothpub_pmrothTR_ORat_downloadfile

[33] A Andreopoulos and J K Tsotsos ldquo50 Years of object recognition Directionsforwardrdquo Computer Vision and Image Understanding vol 117 no 8 pp 827ndash891 Aug 2013 [Online] Available httpwwwresearchgatenetpublication257484936_50_Years_of_object_recognition_Directions_forward

[34] J K Tsotsos Y Liu J C Martinez-Trujillo M Pomplun E Simine andK Zhou ldquoAttending to visual motionrdquo Computer Vision and Image Understand-ing vol 100 no 1-2 SPEC ISS pp 3ndash40 2005

[35] P Perona ldquoVisual Recognition Circa 2007rdquo pp 1ndash12 2007 [Online] Availablehttpswwwvisioncaltechedupublicationsperona-chapter-Dec07pdf

[36] M Piccardi ldquoBackground subtraction techniques a reviewrdquo 2004 IEEE Interna-tional Conference on Systems Man and Cybernetics (IEEE Cat No04CH37583)vol 4 pp 3099ndash3104 2004

[37] T Bouwmans ldquoRecent Advanced Statistical Background Modeling for ForegroundDetection - A Systematic Surveyrdquo Recent Patents on Computer Sciencee vol 4no 3 pp 147ndash176 2011

[38] L Xu and W Bu ldquoTraffic flow detection method based on fusion of frames dif-ferencing and background differencingrdquo in 2011 2nd International Conference onMechanic Automation and Control Engineering MACE 2011 - Proceedings 2011pp 1847ndash1850

[39] A-t Nghiem F Bremond I-s Antipolis and R Lucioles ldquoBackground subtrac-tion in people detection framework for RGB-D camerasrdquo 2004

[40] J C S Jacques C R Jung and S R Musse ldquoBackground subtraction andshadow detection in grayscale video sequencesrdquo in Brazilian Symposium of Com-puter Graphic and Image Processing vol 2005 2005 pp 189ndash196

[41] P Kaewtrakulpong and R Bowden ldquoAn Improved Adaptive Background MixtureModel for Real- time Tracking with Shadow Detectionrdquo Advanced Video BasedSurveillance Systems pp 1ndash5 2001

[42] G Turin ldquoAn introduction to matched filtersrdquo IRE Transactions on InformationTheory vol 6 no 3 1960

[43] K Briechle and U Hanebeck ldquoTemplate matching using fast normalized crosscorrelationrdquo Proceedings of SPIE vol 4387 pp 95ndash102 2001 [Online] AvailablehttplinkaiporglinkPSI4387951ampAgg=doi

73

REFERENCES

[44] H Schweitzer J W Bell and F Wu ldquoVery Fast Template Matchingrdquo Programno 009741 pp 358ndash372 2002 [Online] Available httpwwwspringerlinkcomindexH584WVN93312V4LTpdf

[45] P Viola and M Jones ldquoRobust real-time object detectionrdquo Interna-tional Journal of Computer Vision vol 57 pp 137ndash154 2001 [Online]Available httpscholargooglecomscholarhl=enampbtnG=Searchampq=intitleRobust+Real-time+Object+Detection0

[46] S Omachi and M Omachi ldquoFast template matching with polynomialsrdquo IEEETransactions on Image Processing vol 16 no 8 pp 2139ndash2149 2007

[47] M Jordan and J Kleinberg Bishop - Pattern Recognition and Machine Learning

[48] D Prasad ldquoSurvey of the problem of object detection in real imagesrdquoInternational Journal of Image Processing (IJIP) no 6 pp 441ndash466 2012[Online] Available httpwwwcscjournalsorgcscmanuscriptJournalsIJIPvolume6Issue6IJIP-702pdf

[49] D Lowe ldquoObject Recognition fromLocal Scale-Invariant Featuresrdquo IEEE Inter-national Conference on Computer Vision 1999

[50] D G Lowe ldquoDistinctive image features from scale-invariant keypointsrdquo Interna-tional Journal of Computer Vision vol 60 no 2 pp 91ndash110 2004

[51] H Bay T Tuytelaars and L Van Gool ldquoSURF Speeded up robust featuresrdquo inLecture Notes in Computer Science (including subseries Lecture Notes in ArtificialIntelligence and Lecture Notes in Bioinformatics) vol 3951 LNCS 2006 pp 404ndash417

[52] T Joachims ldquoA probabilistic analysis of the Rocchio algorithmwith TFIDF for text categorizationrdquo International Conference on Ma-chine Learning pp 143ndash151 1997 [Online] Available papers2publicationuuid9FC2122D-6D49-4DC5-AC03-E353D5B3D1D1

[53] N Dalal and B Triggs ldquoHistograms of Oriented Gradients for Human Detectionrdquoin CVPR rsquo05 Proceedings of the 2005 IEEE Computer Society Conference onComputer Vision and Pattern Recognition (CVPRrsquo05) - Volume 1 2005 pp 886ndash893 [Online] Available citeulike-article-id3047126$delimiter026E30F$nhttpdxdoiorg101109CVPR2005177

[54] J M Rainer Lienhart ldquoAn Extended Set of Haar-Like Features for Rapid ObjectDetectionrdquo [Online] Available httpciteseerxistpsueduviewdocsummarydoi=1011869433

74

REFERENCES

[55] S Han Y Han and H Hahn ldquoVehicle Detection Method using Haar-like Featureon Real Time Systemrdquo inWorld Academy of Science Engineering and Technology2009 pp 455ndash459

[56] Q C Q Chen N Georganas and E Petriu ldquoReal-time Vision-based Hand Ges-ture Recognition Using Haar-like Featuresrdquo 2007 IEEE Instrumentation amp Mea-surement Technology Conference IMTC 2007 2007

[57] G Monteiro P Peixoto and U Nunes ldquoVision-based pedestrian detection usingHaar-like featuresrdquo Robotica 2006 [Online] Available httpcyberc3sjtueducnCyberC3docpaperRobotica2006pdf

[58] J Martiacute J M Benediacute A M Mendonccedila and J Serrat Eds PatternRecognition and Image Analysis ser Lecture Notes in Computer Science BerlinHeidelberg Springer Berlin Heidelberg 2007 vol 4477 [Online] Availablehttpwwwspringerlinkcomindex101007978-3-540-72847-4

[59] A E C Pece ldquoOn the computational rationale for generative modelsrdquo ComputerVision and Image Understanding vol 106 no 1 pp 130ndash143 2007

[60] I T Joliffe Principle Component Analysis 2002 vol 2 [Online] Availablehttpwwwspringerlinkcomcontent978-0-387-95442-4

[61] A Hyvarinen J Karhunen and E Oja ldquoIndependent Component Analysisrdquovol 10 p 2002 2002

[62] K Mikolajczyk B Leibe and B Schiele ldquoMultiple Object Class Detection with aGenerative Modelrdquo IEEE Computer Society Conference on Computer Vision andPattern Recognition (CVPR) vol 1 pp 26ndash36 2006

[63] Y Freund and R E Schapire ldquoA Decision-theoretic Generalization of On-lineLearning and an Application to Boostingrdquo Journal of Computing Systems andScience vol 55 no 1 pp 119ndash139 1997

[64] C Cortes and V Vapnik ldquoSupport-vector networksrdquo Machine Learningvol 20 no 3 pp 273ndash297 Sep 1995 [Online] Available httplinkspringercom101007BF00994018

[65] P Viola and M Jones ldquoRobust real-time face detectionrdquo International Journalof Computer Vision vol 57 no 2 pp 137ndash154 2004

[66] B E Boser I M Guyon and V N Vapnik ldquoA Training Algorithmfor Optimal Margin Classifiersrdquo Proceedings of the 5th Annual ACM Workshopon Computational Learning Theory pp 144ndash152 1992 [Online] Availablehttpciteseerxistpsueduviewdocsummarydoi=1011213818

75

REFERENCES

[67] C-W Hsu and C-J Lin ldquoA comparison of methods for multiclass support vectormachinesrdquo IEEE Transactions on Neural Networks vol 13 no 2 pp 415ndash4252002

[68] A Mathur and G M Foody ldquoMulticlass and binary SVM classification Implica-tions for training and classification usersrdquo IEEE Geoscience and Remote SensingLetters vol 5 no 2 pp 241ndash245 2008

[69] ldquoNitrogen6X - iMX6 Single Board Computerrdquo [Online] Available httpboundarydevicescomproductnitrogen6x-board-imx6-arm-cortex-a9-sbc [Ac-cessed at 2015-08-10]

[70] ldquoPixhawk flight controllerrdquo [Online] Available httproverardupilotcomwikipixhawk-wiring-rover [Accessed at 2015-08-10]

[71] ldquoScanning range finder utm-30lx-ewrdquo [Online] Available httpswwwhokuyo-autjp02sensor07scannerutm_30lx_ewhtml [Accessed at 2015-07-22]

[72] ldquoopenCV manual Release 249rdquo [Online] Available httpdocsopencvorgopencv2refmanpdf [Accessed at 2015-07-20]

[73] D E King ldquoDlib-ml A Machine Learning Toolkitrdquo Journal of MachineLearning Research vol 10 pp 1755ndash1758 2009 [Online] Available httpjmlrcsailmitedupapersv10king09ahtml

[74] ldquodlib C++ Libraryrdquo [Online] Available httpdlibnet [Accessed at 2015-07-21]

[75] D E King ldquoMax-Margin Object Detectionrdquo Jan 2015 [Online] Availablehttparxivorgabs150200046

[76] M Danelljan G Haumlger and M Felsberg ldquoAccurate Scale Estimation for RobustVisual Trackingrdquo Proceedings of the British Machine Vision Conference BMVC2014

[77] ldquoUrgBenri Information Pagerdquo [Online] Available httpswwwhokuyo-autjp02sensor07scannerdownloaddataUrgBenrihtm [Accessed at 2015-07-20]

[78] ldquovatic - Video Annotation Tool - UC Irvinerdquo [Online] Available httpwebmiteduvondrickvatic [Accessed at 2015-07-24]

[79] ldquoAmazon Mechanical Turkrdquo [Online] Available httpswwwmturkcommturkwelcome [Accessed at 2015-07-26]

76

  • List of Figures
  • Absztrakt
  • Abstract
  • List of Abbreviations
  • 1 Introduction and project description
    • 11 Project description and requirements
    • 12 Type of vehicle
    • 13 Aims and objectives
      • 2 Literature Review
        • 21 UAVs and applications
          • 211 Fixed-wing UAVs
          • 212 Rotary-wing UAVs
          • 213 Applications
            • 22 Object detection on conventional 2D images
              • 221 Classical detection methods
                • 2211 Background subtraction
                • 2212 Template matching algorithms
                  • 222 Feature descriptors classifiers and learning methods
                    • 2221 SIFT features
                    • 2222 Haar-like features
                    • 2223 HOG features
                    • 2224 Learning models in computer vision
                    • 2225 AdaBoost
                    • 2226 Support Vector Machine
                      • 3 Development
                        • 31 Hardware resources
                          • 311 Nitrogen board
                          • 312 Sensors
                            • 3121 Pixhawk autopilot
                            • 3122 Camera
                            • 3123 LiDar
                                • 32 Chosen software
                                  • 321 Matlab
                                  • 322 Robotic Operating System (ROS)
                                  • 323 OpenCV
                                  • 324 Dlib
                                      • 4 Designing and implementing the algorithm
                                        • 41 Challenges in the task
                                        • 42 Architecture of the detection system
                                        • 43 2D image processing methods
                                          • 431 Chosen methods and the training algorithm
                                          • 432 Sliding window method
                                          • 433 Pre-filtering
                                          • 434 Tracking
                                          • 435 Implemented detector
                                            • 4351 Mode 1 Sliding window with all the classifiers
                                            • 4352 Mode 2 Sliding window with intelligent choice of classifier
                                            • 4353 Mode 3 Intelligent choice of classifiers and ROIs
                                            • 4354 Mode 4 Tracking based approach
                                                • 44 3D image processing methods
                                                  • 441 3D recording method
                                                  • 442 Android based recording set-up
                                                  • 443 Final set-up with Pixhawk flight controller
                                                  • 444 3D reconstruction
                                                      • 5 Results
                                                        • 51 2D image detection results
                                                          • 511 Evaluation
                                                            • 5111 Definition of True positive and negative
                                                            • 5112 Definition of False positive and negative
                                                            • 5113 Reducing number of errors
                                                            • 5114 Annotation and database building
                                                              • 512 Frame-rate measurement and analysis
                                                                • 52 3D image detection results
                                                                • 53 Discussion of results
                                                                  • 6 Conclusion and recommended future work
                                                                    • 61 Conclusion
                                                                    • 62 Recommended future work
                                                                      • References
Page 6: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly

CONTENTS

3 Development 2131 Hardware resources 21

311 Nitrogen board 21312 Sensors 21

3121 Pixhawk autopilot 223122 Camera 223123 LiDar 23

32 Chosen software 23321 Matlab 23322 Robotic Operating System (ROS) 24323 OpenCV 24324 Dlib 24

4 Designing and implementing the algorithm 2641 Challenges in the task 2642 Architecture of the detection system 2943 2D image processing methods 31

431 Chosen methods and the training algorithm 31432 Sliding window method 34433 Pre-filtering 36434 Tracking 37435 Implemented detector 39

4351 Mode 1 Sliding window with all the classifiers 414352 Mode 2 Sliding window with intelligent choice of

classifier 414353 Mode 3 Intelligent choice of classifiers and ROIs 424354 Mode 4 Tracking based approach 44

44 3D image processing methods 46441 3D recording method 46442 Android based recording set-up 48443 Final set-up with Pixhawk flight controller 51444 3D reconstruction 53

5 Results 5651 2D image detection results 56

511 Evaluation 565111 Definition of True positive and negative 565112 Definition of False positive and negative 565113 Reducing number of errors 575114 Annotation and database building 57

512 Frame-rate measurement and analysis 59

v

CONTENTS

52 3D image detection results 6053 Discussion of results 61

6 Conclusion and recommended future work 6561 Conclusion 6562 Recommended future work 67

References 70

vi

List of Figures

11 Image of the ground robot 3

21 Fixed wing consumer drone 622 Example for consumer drones 723 Example for people detection with background subtraction 1124 Example of template matching 1225 2 example Haar-like features 1626 Illustration of the discriminative and generative models 1727 Example of a separable problem 20

31 Image of Pixhawk flight controller 2232 The chosen LIDAR sensor Hokuyo UTM-30LX 2333 Elements of Dlibrsquos machine learning toolkit 25

41 A diagram of the designed architecture 2842 Visualization of the trained HOG detectors 3543 Representation of the sliding window method 3644 Example image for the result of edge detection 3845 Example of the detectorrsquos user interface 4146 Presentation of mode 3 4347 Mode 4 example output frames 45410 Screenshot of the Android application for Lidar recordings 4848 Schematic figure to represent the 3D recording set-up 4949 Example representation of the output of the Lidar Sensor 50411 Picture about the laboratory and the recording set-up 52412 Presentation of the axes used in the android application 53

51 Figure of the Vatic user interface 5852 Photo of the recording process 6153 Example of the 3D images built 62

vii

List of Abbreviations

SATM School of Aerospace Technology and ManufacturingUAV Unmanned Aerial VehicleUAS Unmanned Aerial SystemUA Unmanned AircraftUGV Unmanned Ground VehicleHOG Histogram of Oriented GradientsRC Radio ControlledROS Robotic Operating SystemIMU Inertial Measurement UnitDoF Degree of FreedomSLAM Simultaneous Localization And MappingROI Region Of InterestVatic Video Annotation Tool from Irvine California

viii

Absztrakt

Ezen dolgozatban bemutataacutesra keruumll egy piloacuteta neacutelkuumlli repuumllő jaacuterműre szereltkoumlvető rendszer ami belteacuteri objektumokat hivatott detektaacutelni eacutes legfőbb ceacutelja afoumlldi egyseacuteg megtalaacutelaacutesa amely leszaacutelloacute eacutes uacutejratoumlltő aacutellomaacuteskeacutent fog szolgaacutelni

Előszoumlr a projekt illetve a teacutezis ceacuteljai lesznek felsorolva eacutes reacuteszletezveEzt koumlveti egy reacuteszletes irodalomkutataacutes amely bemutatja a hasonloacute kihiacutevaacute-

sokra leacutetező megoldaacutesokat Szerepel egy roumlvid oumlsszefoglalaacutes a piloacuteta neacutelkuumlli jaacuter-művekről eacutes alkalmazaacutesi teruumlleteikről majd a legismertebb objektumdetektaacuteloacutemoacutedszerek keruumllnek bemutataacutesra A kritika taacutergyalja előnyeiket eacutes haacutetraacutenyaikatkuumlloumlnoumls tekintettel a jelenlegi projektben valoacute alkalmazhatoacutesaacutegukra

A koumlvetkező reacutesz a fejleszteacutesi koumlruumllmeacutenyekről szoacutel beleeacutertve a rendelkezeacutesreaacutelloacute szoftvereket eacutes hardvereket

A feladat kihiacutevaacutesai bemutataacutesa utaacuten egy modulaacuteris architektuacutera terve keruumllbemutataacutesra figyelembe veacuteve a ceacutelokat erőforraacutesokat eacutes a felmeruumllő probleacutemaacutekat

Ezen architektuacutera egyik legfontosabb modulja a detektaacuteloacute algoritmus legfris-sebb vaacuteltozata reacuteszletezve is szerepel a koumlvetkező fejezetben keacutepesseacutegeivel moacuted-jaival eacutes felhasznaacuteloacutei feluumlleteacutevel egyuumltt

A modul hateacutekonysaacutegaacutenak meacutereacuteseacutere leacutetrejoumltt egy kieacuterteacutekelő koumlrnyezet melykeacutepes szaacutemos metrikaacutet kiszaacutemolni a detekcioacuteval kapcsolatban Mind a koumlrnyezetmind a metrikaacutek reacuteszletezve lesznek a koumlvetkező fejezetben melyet a legfrissebbalgoritmus aacuteltal eleacutert eredmeacutenyek koumlvetnek

Baacuter ez a dolgozat főkeacutent a hagyomaacutenyos (2D) keacutepeken operaacuteloacute detekcioacutesmoacutedszerekre koncentraacutel 3D keacutepalkotaacutesi eacutes feldolgozoacute moacutedszerek szinteacuten meg-fontolaacutesra keruumlltek Elkeacuteszuumllt egy kiacuteseacuterleti rendszer amely keacutepes laacutetvaacutenyos eacutespontos 3D teacuterkeacutepek leacutetrehozaacutesaacutera egy 2D leacutezer szkenner hasznaacutelataacuteval Szaacutemosfelveacutetel keacuteszuumllt a megoldaacutes kiproacutebaacutelaacutesa amelyek a rendszerrel egyuumltt bemutataacutesrakeruumllnek

Veacuteguumll az implementaacutelt moacutedszerek eacutes az eredmeacutenyek oumlsszefoglaloacuteja zaacuterja adolgozatot

ix

Abstract

In this paper an unmanned aerial vehicle based airborne tracking system is pre-sented with the aim of detecting objects indoor (potentially extended to outdooroperations as well) and localize a ground robot that will serve as a landing plat-form for recharging of the UAV

First the project and the aims and objectives of this thesis are introducedand discussed

Then an extensive literature review is presented to give overview of the ex-isting solutions for similar problems Unmanned Aerial Vehicles are presentedwith examples of their application fields After that the most relevant objectrecognition methods are reviewed and their suitability for the discussed project

Then the environment of the development will be described including theavailable software and hardware resources

Afterwards the challenges of the task are collected and discussed Consideringthe objectives resources and the challenges a modular architecture is designedand introduced

As one of the most important module the currently used detector is intro-duced along with its features modes and user interface

Special attention is given to the evaluation of the system Additional evalu-ating tools are introduced to analyse efficiency and speed

The ground robot detecting algorithmrsquos first version is evaluated giving promis-ing results in simulated experiments

While this thesis focuses on two dimensional image processing and objectdetection methods 3D image inputs are considered as well An experimentalsetup is introduced with the capability to create spectacular and precise 3Dmaps about the environments with a 2D laser scanner To test the idea of usingthe 3D image as an input for the ground robot detection several recordings weremade about the UGV and presented in this paper as well

Finally all implemented methods and relevant results are concluded

x

Chapter 1

Introduction and projectdescription

In this chapter an introduction is given to the whole project which this thesis ispart of Afterwards a structure of the sub-tasks in the project is presented alongwith the recognized challenges Then the aims and objectives of this thesis arelisted

11 Project description and requirementsThe projectrsquos main aim is to build a complete system based on one or moreunmanned autonomous vehicle which are able to carry out an indoor 3D mappingof a building (possible outdoor operations are kept in mind as well) A maplike this would be very useful for many applications such as surveillance rescuemissions architecture renovation etc Furthermore if a 3D model exists otherautonomous vehicles can navigate through the building with ease Later on inthis project after the map building is ready a thermal camera is planned to beattached to the vehicle to seek find and locate heat leaks as a demonstration ofthe system A system like this should be

1 fast Building a 3D map requires a lot of measurements and processingpower not to mention the essential functions stabilization navigationroute planning collision avoidance However to build a useful tool allthese functions should be executed simultaneously mostly on-board nearlyreal-time Furthermore the mapping process itself should be finished inreasonable time (depending on the size of the building)

2 accurate Reasonably accurate recording is required so the map would besuitable for the execution of further tasks Usually this means a maximum

1

12 Type of vehicle

error of 5-6cm Errors with similar magnitude can be corrected later de-pending on the application In the case of architecture for example aligninga 3D room model to the recorded point cloud could eliminate this noise

3 autonomous The solution should be as autonomous as possible The aimis to build a system which does not require human supervision (for examplemapping a dangerous area which would be too far for real time control) andcoordinate the process Although remote control is acceptable and shouldbe implemented it is desired to be minimal This should allow a remoteoperator to coordinate even more than one unit a time (since none of themneeds continuous attention) while keeping the ability to change the routeor priority of premises

4 complete Depending on the size and layout of the building the mappingprocess can be very complicated It may have more than one floors loops(the same location reached via an other route) or opened areas which areall a significant challenge for both the vehicle (plan route to avoid collisionsand cover every desired area) and the mapping software The latter shouldrecognize previously seen locations and rdquocloserdquo the loop (that is assignthe two separately recorded area to one location) and handle multi-layerbuildings The system have to manage these tasks and provide a mapwhich contains all the layers closes loops and represents all walls (includingceiling and floor)

12 Type of vehicleThe first question of such a project is what type of vehicle should be used Theoptions are either some kind of aerial or a ground vehicle Both have their ad-vantages and disadvantages

An aerial vehicle has a lot more degree of freedom since it can elevate fromthe ground It can reach positions and perform measurements which would beimpossible for a ground vehicle However it consumes a lot of energy to stayin the air thus such a system has significant limitations in weight payload andoperating time (since the batteries themselves weigh a lot)

A ground vehicle overcomes all these problems since it is able to carry alot more payload than an aerial one This means a lot of batteries thereforelonger operation time On the other hand it canrsquot elevate from the ground thusall the measurements will be performed from a relatively close position to theground This can be a big disadvantage since taller objects will not have theirtops recorded Also a ground robot would not be able to overcome big obstaclesclosing its way or to get to another floor

2

13 Aims and objectives

Figure 11 Image of the ground robot Source own picture

Considering these and many more arguments (price complexity human andmaterial resources) a combination of an aerial and a ground vehicle was chosenas a solution The idea is that both vehicles enter the building at the sametime and start mapping the environment simultaneously working on the samemap (which is unknown when the process starts) Using the advantage of theaerial vehicle the system scans parts of the rooms which are unavailable for theground vehicle With sophisticated route planning the areas scanned twice canbe minimized resulting in a faster map building Also great load capacity of theground robot makes it able to serve as a charging station for the aerial vehicleThis option will be discussed in details in section 13

13 Aims and objectivesAs it can be seen the scope of the project is extremely wide To cover the wholemore than fifteen students (visiting MSc and PhD) are working on it Thereforethe project has been divided to numerous subtasks

During the planning period of the project several challenges were recognizedfor example the lack of GPS signal indoor the vibration of the on-board sensorsor the short flight time of the aerial platform

This thesis addresses the last one As mentioned in section 12 the UAVconsumes a lot of energy to stay in the air thus such a system has significantlylimited operating time In contrast the ground robot is able to carry a lot morepayload (eg batteries) than an aerial one The idea to overcome this limitationof the UAV is to use the ground robot as a landing and recharging platform for

3

13 Aims and objectives

the aerial vehicleTo achieve this the system will need to have continuous update of the relative

position of the ground robot The aim of this thesis is to give a solution for thisproblem The following objectives were defined to structure the research

First possible solutions have to be collected and discussed in an extensive lit-erature review Then after considering the existing solutions and the constraintsand requirements of the discussed project a design of a new system is neededwhich is able to detect the ground robot using the available sensors Finally thefirst version of the software have to be implemented and tested

This paper will describe the process of the research introduce the designedsystem and discuss the experiences

4

Chapter 2

Literature Review

The aim of this chapter is to provide an overview of the previous researches andarticles related to the topic of this thesis First a short introduction of unmannedaerial vehicles will be given along with their most important advantages andapplication fields

Afterwards a brief conclusion of the science of object recognition in imageprocessing is provided The most often used feature extraction and classificationmethods are presented

21 UAVs and applicationsThe abbreviation UAV stands for Unmanned Aerial Vehicle Any aircraft whichis controlled or piloted by remotely andor by on-board computers They are alsoreferred as UAS for Unmanned Aerial System

They come in various sizes The smallest ones fit a man palms while even afull-size aeroplane can be an UAV by definition Similarly to traditional aircraftUnmanned Aerial vehicles are usually grouped into two big classes fixed-wingUAVs and rotary wing UAVs

211 Fixed-wing UAVsFixed-wing UAVs have a rigid wing which generates lift as the UAV moves for-ward They maneuver with control surfaces called ailerons rudder and elevatorGenerally speaking fixed-wing UAVs are easier to stabilize and control Forcomparison radio controlled aeroplanes can fly without any autopilot functionimplemented relying only on the pilotrsquos input This is a result of the muchsimpler structure compared to rotary wing ones due to the fact that a glidingmovement is easier to stabilize However they still have a quite extended market

5

21 UAVs and applications

Figure 21 One of the most popular consumer level fixed-wing mapping UAV ofthe SenseFly company Source [1]

of flight controllers whose aim is to add autonomous flying and pilot assistancefeatures [2]

One of the biggest advantage of fixed wing UAVs is that due to their naturalgliding capabilities they can stay airborne using no or small amount of powerFor the same reason fixed-wing UAVs are also able to carry heavier or morepayload for longer endurances using less energy which would be very useful inany mapping task

The drawback of the fixed wing aircraft in the case of this project is the factthat they have to keep moving to generate lift and stay airborne Therefore flyingaround between and in buildings is very complicated since there is limited spaceto move around Although there have been approaches to prepare a fixed wingUAV to hover and translate indoors [3] generally they are not optimal for indoorflights especially with heavy payload [4]

Fixed-wing UAVs are excellent choice for long endurance high altitude tasksThe long flight times and higher speed make it possible to cover larger areaswith one take off On figure 21 a consumer fixed-wing UAV is shown designedspecially for mapping

This project needs a vehicle which is able to fly between buildings and indoorswithout any complication Thus rotary-wing UAVs were reviewed

212 Rotary-wing UAVsRotary-wing UAVs generate lift with rotor blades attached to and rotated aroundan axis These blades work exactly as the wing on the fixed-wing vehicles butinstead of the vehicle moving forward the blades are moving constantly Thismakes the vehicle able to hover and hold its position Also due to the same prop-erty they can take off and land vertically These capabilities are very importantaspects for the project since both of them are crucial for indoor flight where spaceis limited

6

21 UAVs and applications

Figure 22 The newest version of the popular Phantom series of DJI Thisconsumer drone is easy to fly even for beginner pilots due to its many advancedautopilot features Source [5]

Based on the number of the rotors several sub-classes are known UAVs withone rotor are helicopters while anything with more than one rotor is called amultirotor or multicopter

The latter group controls its motion by modifying the speed of multiple downthrusting motors In general they are more complicated to control than fixed-wing UAVs since hovering enough requires constant compensation Multirotorsmust individually adjust the thrust of each motor If motors on one side areproducing more thrust than the other the multirotor tilts to the other sideThat is the reason why every remote controlled multicopter is equipped withsome kind of flight controller system They are also less stable (without electricstabilization) and less energy efficient than helicopters That is the reason forthe fact that no large scale multirotor is used since as the size is increased thesimplicity becomes less important than the inefficiency On the other hand theyare mechanically simpler and cheaper and their architecture is suitable for awider range of payloads which makes them appealing for many fields

Rotary-wing UAVsrsquo mechanical and electrical structures are generally morecomplex than fixed-wing ones This can result in more complicated and expansiverepairs and general maintenance which shorten operational time and increase thecost of the project Finally due to their lower speeds and shorter flight rangesrotary-wing UAVs will require many additional flights to survey any large areawhich is another cause of increasing operational costs [4]

7

21 UAVs and applications

In spite of the disadvantages listed above rotary-wing especially multirotorUAVs seem like an excellent choice for this project since there is no need to coverlarge areas and the shorter operational times can be solved with the chargingplatform

213 ApplicationsUAVs as many technological inventions got their biggest motivation from mili-tary goals and applications For example German rocket V-1 (rdquothe flying bombrdquo)had a huge impact on the history of UAS [6] An unmanned aerial vehicle makesit possible to observe spy on and attack the enemy without risking human lives

Fortunately non-military development is catching up too and UAVs are be-coming essential tools for many application fields As rotary wing UAVs es-pecially multicopters are getting cheaper they have appeared in the consumercategories as well They are available as simple remote controlled toys and ad-vanced aerial photography platforms as well [5] See figure 22 for a popularquadcopter

UAVs are gaining popularity in search and rescue operations since they areproviding a fast and relatively cheap tool to get an overview of the involvedarea A good example for this application field is a location of an earth-quakeor searching for missing people in the wildness See [7] and [8] for examples [9]is another related article which includes approaches to identify possibly injuredpeople autonomously

Another application field where UAVs were admittedly proven useful is surveil-lance [10] is an interesting example for this field since it proposes to use a com-bined swarm of rotary and fixed-wing UAVs [11] [12] address the same problemwhile [13] tries to overcome the challenges of the task in an urban area

Getting sensors in the air with low cost vehicles is an amazing opportunity forseveral research areas UAVs have been used for wild-life monitoring [14] earthobservation [15] and even tornado watch missions [16] [17] used UAVs to reducerdquothe data gap between field scale and satellite scale in soil erosion monitoringrdquo

UAVs are often used for 3D mapping and reconstruction tasks for topographyarchitecture or other purposes [18] is a review of current mapping methods basedon UAVs [19] is an other survey which also compares the photogrammetry withother methods to define the optimal application fields

Both surveys mentioned above are rather focus on topographical larger scalemapping tasks Indoor flying and mapping is more complicated in a sense sincethe danger of collision is significantly higher Indoor positioning of UAVs havebeen addressed in several papers They can be grouped by the sensor used Notethat GPS signal is usually too weak to use inside thus other methods are needed

Some researches tried to localize the aerial platform based on a single camera

8

22 Object detection on conventional 2D images

using monocular SLAM (Simultaneous Localization and Mapping) See [20ndash22]for UAVs navigating based on one camera only

Often the platform is equipped with multiple cameras to achieve stereo visionand create a depth map For examples of such solutions see [23ndash25]

A popular approach for indoor mapping is using 3D laser distance sensors asthe input of the SLAM algorithm Examples for this are [26ndash28]

However cameras and laser range finders are often used together [29 30] forexample manage to integrate the two sensors together and carry out 3D mappingwhile controlling navigating the UAV itself This project will use a similar setupthat is a combination of one or more monocular cameras with a 2D laser scanner

22 Object detection on conventional 2D imagesObject class detection is one of the most researched area of computer visionAs the available resources (computational power resolution of cameras etc)improved and became cheaper and more and more complex tasks occurred thefield has witnessed several approaches and methods Thus the topic has extensiveliterature To give an overview general surveys can be a good starting point

There are several articles aimed to provide a summary of the existing methodstrying to categorize and group them (eg [31ndash33])

The terms and definitions of computer vision tasks are defined in several wayin the literature [33] and [34] groups them by

bull Detection is a particular item present in the stimulus

bull Localization detection plus accurate location of item

bull Recognition localization of all the items present in the stimulus

bull Understanding recognition plus role of stimulus in the context of the scene

while [35] sets up five classes defined by the following (quoted from [33])

bull Verification Is a particular item present in an image patch

bull Detection and Localization Given a complex image decide if a particularexemplar object is located somewhere in the image and provide accuratelocation information of the given object

bull Classification Given an image patch decide which of the multiple possiblecategories are present in it Note that no location is determined

9

22 Object detection on conventional 2D images

bull Naming Given a large complex image (instead of an image patch as inthe classification problem) determine the location and labels of the objectspresent in that image

bull Description Given a complex image name all the objects present in itand describe the actions and relationships of the various objects within thecontext of the image

[31] defines object (class or category) detection as the combination of twogoals object categorization and object localization The former decides if anyinstance of the categories of interest is present in the input image The lattertask determines the positions and dimensions (projected size) of the objects thatare found

As it can be seen from the categories defined above the boundaries of thesesubclasses are not clear and they often overlap The aims and objectives listedin section 13 classify this task as a detection or localization (or a combinationof the two) For the sake of simplicity the aim of this thesis and the providedalgorithm will be referred as detection and detector

Beside the grouping of the tasks of computer vision several attempts weremade to classify the approaches themselves The biggest disjunctive propertyof the methods might be the fact that whether or not does it involve machinelearning

221 Classical detection methodsResulting from the rapid development of hardware resources along with the ex-tensive research of artificial intelligence methods not using machine learning aregetting neglected Generally speaking the ones using it show better performanceand are simpler to adopt if the circumstances change (eg re-train them)

It has to be noted that although these classical approaches are not very pop-ular nowadays their simplicity often means an attractive factor to perform easydetection tasks or pre-filter the image

2211 Background subtraction

Background subtraction methods are often used to detect moving objects onstanding platform (static cameras) Such an algorithm is able to learn the rdquoback-ground modelrdquo of an image with differentiating a few frames Although severalmethods exist (See [36] or [37] for good conclusion) the basic idea is common sep-arate the foreground from the background by subtracting the background modelfrom the input image This subtraction should result in an output image where

10

22 Object detection on conventional 2D images

Figure 23 Example for background subtraction used for people detection Thered patches show the object in the segmented foreground Source [39]

only objects in the foreground are shown The background model is usually up-dated regularly which means the detection is adaptive

These solutions could give an excellent base for tasks like traffic flow monitor-ing [38] or human detection [39] as moving points are separated and connectedor close ones could be managed as an object Thus all possible objects of interestare very well separated The fixed position of the camera means fixed perspectiveswhich make size and distance estimations accurate With fixed lengths reliablevelocity measuring can be implemented These and further additional properties(colour texture etc of the moving object) can be the base of an object classifica-tion system This method is often combined with more sophisticated recognitionmethods serving as a region of interest detector After the selection of interestingareas those methods can operate on a smaller input resulting in a faster responsetime

It has to be mentioned that not every moving (changing) patch is an objectas for example in certain lighting circumstances a vehicle and its shadow movealmost identically and have similar size parameters Effective methods to removeshadows from input have been described (eg[40 41]) Certainly process time ofthese filters sum up with the detectionrsquos time

An other concern is the possibility of a sudden change in illumination (such asswitching the light on or off) which would result change across the whole imageIn this case the background model should be adapted automatically and quickly

Although background subtraction has many benefits it is clearly not suitablefor this task since the camera will be mounted on a moving platform

2212 Template matching algorithms

Object class detection can be described as a search for a specific pattern on theinput image This definition is related of the matched filters (see [42]) techniqueof digital signal processing which basic idea is to find a known wavelet in anunknown signal This is usually achieved by cross-correlating them looking forhigh correlation coefficients

Template matching algorithms rely on the same concept interpreting the

11

22 Object detection on conventional 2D images

Figure 24 Example of template matching Notice how the output image isthe darkest at the correct position of the template (darker means higher value)An other interesting point of the picture is the high return values around thecalendarrsquos (on the wall left) dark headlines which are indeed look similar to thehandle Source [43]

image as a 2D digital signal They determine the position of the given patternby comparing a template (that contains the pattern) with the image

To perform the comparison the template is shifted (u v) discrete steps inthe (x y) directions respectively The comparison itself is usually calculated bysome kind of correlation over the area of the template for each (u v) positionThe method assumes that the function (the comparison method) will return thehighest value at the correct position [43]

See Figure 24 for an example input template (part a) and the output of thealgorithm (part b) Notice how the output image is the darkest at the correctposition of the template

The biggest drawback of the method is its computational complexity sincethe cross correlation has to be calculated at every position Several efforts werepublished to reduce the processing time [43] uses normalized cross correlationfor faster comparison [44] applies integral images (based on the idea of [45]) forthe same purpose Note that simplifying the template can reduce the complexityas well since it is easier to compare the two images For example [46] proposedan algorithm which approximates the template image with polynomials

Other disadvantages of template matching are the poor performance in case ofchanges in scale rotation or perspective Resulting from these template matchingis not a suitable concept for this project

12

22 Object detection on conventional 2D images

222 Feature descriptors classifiers and learning methodsWhile computer vision and machine learning might seem two well-separated areasof computer engineering computer vision is one of the biggest application fieldof machine learning [47] puts it this way rdquoPattern recognition has its origins inengineering whereas machine learning grew out of computer science Howeverthese activities can be viewed as two facets of the same field and together theyhave undergone substantial development over the past ten yearsrdquo

The two most important parts of a trained object detection algorithm arefeature extraction and feature classification The extraction method determineswhat type of features will be the base of the classifier and how they are extractedThe classification part is responsible for the machine learning method In otherwords how to decide if the extracted feature array (extracted by the featureextraction method described above) represents an object of interest

Since the latter part (defining the learning and classifier system) is less relatedto the vision studies this paper will only discuss a few learning methods brieflyBefore that some of the most famous and widely used feature descriptors arepresented

Feature extractors or descriptors are often grouped as well [48] recognizesedge based and patch based features [31] has three groups according to thelocality of the features

bull pixel-level feature description these features can be calculated for eachpixel separately Examples are the pixelrsquos intensity (grayscale) or its colourvector

bull patch-level feature description Descriptors based on patch level featuresconcentrate only on point of interests (and their neighbourhood) instead ofdescribing the whole image The name rdquopatchrdquo refers to the area around thepoint which is also considered Since these regions are much smaller thanthe image itself patch-level feature descriptions are also called local featuredescriptors Some of the most famous ones are SIFT (sub-subsection 2221[49 50]) and SURF [51] In a sense Haar-like features (see sub-subsection2222 [45]) belong here as well

bull region-level feature description As patches are sometimes too small to de-scribe the object appropriately a larger scale is often needed The region isa general expression for a set of connected pixels The size and shape of thisarea is not defined Region-level features often constructed from lower levelfeatures presented above However they do not try to apply higher levelgeometric models or structures Resulting from that region-level featuresare also called mid-level representations The most well known descriptors

13

22 Object detection on conventional 2D images

of this group are the BOF (Bag of Features or Bag of Words [52]) and theHOG (Histogram of Oriented Gradients sub-subsection 2223 [53])

Due to the limited scope of this paper it is not possible to give a deeperoverview about the features used in computer vision Instead three of the meth-ods will be presented here These were selected based on fame performanceon general object recognition and the amount of relevant literature examplesand implementation All of them are well-known and both the articles and themethods are part of the computer vision history

2221 SIFT features

Scale-invariant feature transform (or SIFT) is an image processing algorithm usedto detect and extract local features Sift descriptors are the most well-knownpatch-level feature feature descriptors (see point 222) [31] They were proposedby David Lowe in 1999 [49] The same author gave a more in-depth discussionabout his earlier work presented improvements in stability and feature invariancein 2004 [50] His idea was to extract interesting points as features from the object(training image) and try to match these on the test images to locate the object

The main steps of the algorithm are the following [50]

1 Scale-space extrema detection The first stage searches over the image inmultiple scale using a difference-of-Gaussian function to identify potentialinterest points that are scale and orientation invariant

2 Key-point localization At each selected point the algorithm tries to fit amodel to determine location and scale Key-points are selected based ontheir stability points which are expected to be more stable based on themodel are kept

3 Orientation assignment rdquoOne or more orientations are assigned to eachkey-point based on local image gradient directions All future operationsare performed on image data that has been transformed relative to theassigned orientation scale and location for each feature thereby providinginvariance to these transformationsrdquo [50]

4 Key-point descriptor After selecting the interesting points the gradientsaround them are calculated in a selected scale

The biggest advantage of the descriptor is the scale and rotation invarianceNote that rotation here means rotating in the plane of the image not capturingthe object from a rotated viewpoint For example a rotation invariant (frontal)

14

22 Object detection on conventional 2D images

face detector should be able to detect faces upside down but not faces pho-tographed from a side-view SIFT is also partly invariant to illumination and3D camera viewpoint This is reached by the representation of the calculateddescriptors that allows for significant levels of local shape distortion and changein illumination The algorithm assumes that these pointsrsquo relative positions willnot change on the test images Thus SIFT descriptors tend to show worse perfor-mance on flexible objects or ones with moving parts Also the relative positionsof the key-points can be significantly different not only if the object is distorted(flexibility) but also if they are extracted from an other instance from the sameclass (two people for example) Resulting from this property SIFT is more oftenused for object tracking image stitching or to recognise a concrete object insteadof the general class However this is not a problem in the case of this projectsince only one ground-robot is part of it Thus no general class recognition isrequired recognizing the used ground robot would be enough

2222 Haar-like features

Haar-like features were introduced by Viola and Jones in 2001 [45] They proposedsimple features inspired by Haar wavelet basis which are also used in signalprocessing They can be seen as rdquotemplatesrdquo defining rectangular regions in thegrayscale image Viola and Jones described three types of the features

bull two-rectangle features rdquoThe value of a two-rectangle feature is the dif-ference between the sum of the pixels intensities within two rectangularregions The regions have the same size and shape and are horizontally orvertically adjacentrdquo [45]

bull three-rectangle features rdquoA three-rectangle feature computes the sum withintwo outside rectangles subtracted from the sum in a centre rectanglerdquo [45]

bull four-rectangle features rdquoA four-rectangle feature computes the differencebetween diagonal pairs of rectanglesrdquo [45]

See figure 25 for an example of two and three rectangle features The mainadvantage of this approach is the fact that these features can be computed veryrapidly using so-called integral images The integral image rdquocontains the sum ofpixel intensities above and to the left of itrdquo [45] at any given image location Thereferences to the integral values at each pixel location are kept in a structureFinally the Haar features can be computed by summing and subtracting theintegral values in the corners of the rectangles

[45] was motivated by the task of face detection Although it is more than14 years old it is still the base of many implemented face detection methods (in

15

22 Object detection on conventional 2D images

Figure 25 2 example Haar-like features that are proven to be effective separatorsTop row shows the features themselves Bottom row presents the features overlayedon a typical face The first feature corresponds with the observation that eyes areusually darker than the cheeks The second feature measures the intensity of areasaround the eyes compared to the nose Source [45]

consumer cameras for example) and inspired several researches [54] for exampleextended the Haar-like feature set with rotated rectangles by calculating theintegral images diagonally as well

Resulting from the nature of them Haar-like features are suitable to findpatterns of combined bright and dark patches (like faces on grayscale images forexample)

Despite of the original task Haar-like features are used widely in computervision for various tasks vehicle detection ([55]) hand gesture recognition ([56])and pedestrian detection ([57]) just to name a few

Haar-like features are very popular and have a lot of advantages Its maindrawback is the rotation-variance which in spite of several attempts (eg [54])is still not solved completely

2223 HOG features

The abbreviation Hog stands for Histogram of Oriented Gradients Unlike theHaar features (2222) Hog aims to gather information from the gradient image

The technique calculates the gradient of every pixel and produces a histogramof the different orientations (called bins) after quantization in small portions(called cells) on the image After a location based normalization of these cells(within blocks) the feature vector is the concatenation of these histograms Notethat while this is similar to Edge orientation histograms ([58]) it is different since

16

22 Object detection on conventional 2D images

Figure 26 Illustration of the discriminative and generative models The boxesmark the probabilities calculated and used by the models Note the arrows whichdisplay the information flow presenting the main difference in the two philosophiesSource [59]

not only sharp gradient changes (known as edges) are considered The number ofoptimal bins cells and blocks depend on the actual task and resolution and arewidely researched

HOG is essentially a dense version of the SIFT descriptors (see sub-subsection2221 and [31]) However it is not normalized with respect to the orientationThus HOG is not rotation invariant On the other hand it is normalized locallywith respect to the image contrast (over the blocks) which results in a superiorrobustness against illumination changes over SIFT The technique was first de-scribed in the well-known article [53] by Navneet Dalal and Bill Triggs in whichit was used for pedestrian detection While this application field is still the mostcommon usage of these feature descriptor it can be used for many kind of objectsas well as long as the class has a characteristic shape with significant edges

2224 Learning models in computer vision

Learning model approaches can be grouped as well [32 48] finds two mainphilosophies generative and discriminative models

Letrsquos denote the test data (images) with xi the corresponding labels by ciand the feature descriptors (extracted from the data xi) by θi where i = 1 to N where N is the number of inputs

Generative models look for a representation of the image (or any other inputdata) by approximate it keeping as much data as possible Afterwards given atest data they predict the probability of an x instance generated (with featuresθ) conditioned to class c [48] In other terms they aim is to produce a model inwhich the probability

P (x|(θ c))P (c)

to be ideally 1 if x contains an instance of class c and 0 if notThe most famous examples of generative models are Principal Component

Analysis (PCA [60]) and Independent Component Analysis (ICA [61])

17

22 Object detection on conventional 2D images

Discriminative models try to find the best decision boundaries for the giventraining data As the name suggests they try to discriminate the occurrenceof one class from everything else [48] They do this by defining the followingprobability

P (c|(θ x))

which is expected to be ideally 1 if x contains an instance of class c and 0 ifnot

Note that the biggest difference is the direction of the mapping and infor-mation flow discriminative models map images to class labels while generativemodels use a map from labels to the images (or rdquoobservablesrdquo [59]) See figure 26for a representation of the direction of the flow of information in both models

Resulting from approaches presented above (especially the probability mod-els) the discriminative methods are usually perform better for single class detec-tion while generative methods are more suitable for multi-class object detection(also called object recognition) [62] Since the task described in this paper re-quires one class to detect two of the most well-known discriminative methodswill be presented Adaptive boosting (AdaBoost [63] see sub-subsection 2225)and support vector machines (SVM [64] see sub-subsection 2226)

2225 AdaBoost

AdaBoost (short for Adaptive Boosting) is a discriminative machine learningmodel from the class of boosting algorithms It was first proposed by YoavFreund and Robert Schapire in 1997 [63] AdaBoostrsquos aim is to overcome theincreasing calculation expenses caused by the growing set of features (also calleddimensions) For example the number of possible Haar-like features (2222) isover 160000 in case of a 24times24 pixel window (see subsection 432) [65] AlthoughHaar-like features can be extracted efficiently using the integral-image computingthe full set would be very expensive

To reduce the dimensionality of the machine learning problem boosting algo-rithms try to combine several weaker classifiers to create a stronger Weak meansthe output of the classifiers only slightly correlates with the true labels of thedata Usually this is an iterative method

A single stage of AdaBoost can be described as follows [45] [65]

bull Initialize N middot T weights wti where N = number of training examples T =number of features in stage

bull For t = 1 N

1 Normalize the weights

18

22 Object detection on conventional 2D images

2 Select the best classifier using only a single feature by minimising thedetection error εt =

sumiwi|h(xi f p θ)minus yi| where h(xi) is a classifier

output and yi is a correct label (both with a range of 0 for negative 1for positive)

3 Define ht(x) = h(x ft pt θt) where ft pt and θt are minimizers for theerror above

4 Update the weights wt+1i = wti middot εt

1minusεt

1minusεi

bull The final classifier for the stage is based on the sum of weak classifiers

By repeating these steps AdaBoost selects those features (and only those)which improve the prediction Resulting from the way the weights are updatedsubsequent classifiers minimise the error mostly for the inputs which were misclas-sified by previous ones A famous application of AdaBoost is the Viola-Jones ob-ject detection framework presented in 2001 [45] (also see sub-subsection 2222)

2226 Support Vector Machine

Support vector machines (SVM also called support vector networks) is a machinelearning algorithm for two-group classification problems proposed by Vladimir NVapnik Bernhard E Boser and Isabelle M Guyon [66] Although the featuresof the method were known and used since the 1960s they were not put togetheruntil 1992

The idea of the support vector machine is to handle the set of features ex-tracted from the train and test data samples as vectors in the feature spaceThen they construct one ore more hyperplanes (a linear decision surface) in thisspace to separate the classes

The surface is constructed with two constraints First is to divide the spaceinto two parts in a way that no points from different classes remain in the same(in other words separate the two classes perfectly) The second constraint is tokeep the distance to the nearest training samplersquos vector from each class as largeas possible (in other words define the largest margin possible) Creating largermargin is proven to decrease the error of the classifier The closest vectors to thehyperplane (and due to that determining the width of the margin) are calledsupport vectors (hence the name of the method) See figure 27 for an exampleof a separable problem the chosen hyperplane and the margin

The algorithm was extended to not linearly separable tasks in 1995 by Vapnikand Corinna Cortes [64] To create a non-linear classifier the feature space ismapped to a much higher dimensional space where the classes become linearlyseparable and the method above can be used

19

22 Object detection on conventional 2D images

Figure 27 Example of a separable problem in 2D The support vectors aremarked with grey squares They define the margin of the largest separation betweenthe two classes Source [64]

Although support vector machines were designed for binary classification sev-eral methods were proposed to apply their outstanding performance on multi-classtasks The most often used approach to do this is to combine binary classifiers toobtain a multi-class one The binary classifiers can be rdquoone against allrdquo or rdquooneagainst onerdquo types The former means that each class is examined against all thepoints of the other classes The class whose classifier shows the greatest distancefrom the margin is chosen (in other terms the test samplersquos vector is the fartherfrom the other classesrsquo points than in any other cases) The latter is based on thecomparisons of all the possible pairings of the classifiers (each is compared witheach) The class rdquowinningrdquo the most comparisons is chosen as the final decisionof the classifier Good surveys on multi-class SVMs are [67] and [68]

A famous application of SVMs is the HOG descriptor based pedestrian detec-tion presented by Navneet Dalal and Bill Triggs in 2005 [53]

20

Chapter 3

Development

In this chapter the circumstances of the development will be presented whichinfluenced the research especially the objectives defined in Section 13 Firstthe available hardware resources (that is processing unit and sensors) will bediscussed along with their limitations and constraints respecting to the definedobjectives Finally the used software libraries will be presented why were theychosen and what are they used for in this project

31 Hardware resources

311 Nitrogen boardNitrogen board (officially called Nitrogen6x) is the chosen on-board processingunit of the UAV It is a highly integrated single board computer using an ARM-cortex A9 processor [69]

Nitrogen board will be responsible to manage the on-board sensors and collectthe recorded data After some preprocessing part of the data will be transmittedto the ground robot and to the ground station which communication is alsohandled by this unit Fortunately the Robotic Operating System (subsection322) is compatible with unit which will make this task easier

The detection method introduced in this thesis may be ported to this boardlater depending on the available processing units and the achieved communicationbandwidth between the vehicles and the ground station

312 SensorsThis subsection concludes the sensors integrated on-board the UAV including theflight controller the camera and the laser range finder

21

31 Hardware resources

Figure 31 image of Pixhawk flight controller Source [70]

3121 Pixhawk autopilot

Pixhawk is an open-source well supported auto-pilot system manufactured by3Drobotics company It is the chosen flight controller of the projectrsquos UAV Onfigure 31 the unit is shown with optional accessories

Also Pixhawk will be used as an Inertial Measurement Unit (IMU) for build-ing 3D maps This means less weight (since no additional sensor is needed) andbrings integrity both in hardware and software Although the Pixhawk is not adedicated IMU its built-in sensors are very accurate since they are used to sta-bilize multi-rotors during flight Therefore it was suitable to use it for mappingpurposes See 44 for details of its application

3122 Camera

Cameras are optical sensors capable of recording images They are the mostfrequently used sensors for object detection due to the fact that they are easy tointegrate and provide significant amount of data

Also another big advantage of these sensors is the fact that they are availablein several sizes with different resolutions and other features for a relatively cheapprice

Nowadays consumer UAVs are often equipped by some kind of action camerausually to provide an aerial video platform These sensors are suitable for theaim of the project as well resulting from their light weight and wide angle fieldof view

22

32 Chosen software

Figure 32 The chosen LIDAR sensor Hokuyo UTM-30LX It is a 2D rangescanner with a maximum range of 60 m and 3-5 cm accuracy Source [71]

3123 LiDar

The other most important sensor used in this project is the lidar These sensorsare designed for remote sensing and distance measurement The basic conceptof these sensors is to do this by illuminating objects with a laser beam Thereflected light is analysed and the distances are determined

Several versions are available with different range accuracy size and otherfeatures The chosen scanner for this project is the Hokuyo UTM-30LX It isa 2D range scanner with a maximum range of 60 m and 3-5 cm accuracy [71]These features combined with its compact size (62mmtimes62mmtimes875mm210g [71])makes it an ideal scanner for autonomous vehicles for example UAVs

Generally speaking lidars are much more expensive sensors then camerasHowever the depth information provided by them is very valuable for indoorflights

32 Chosen software

321 MatlabMATLAB is a high-level interpreted programming language and environmentdesigned for numerical computation It is a very useful tool for engineers and

23

32 Chosen software

scientist alike as thousands of functions are implemented and included in thesoftware from different fields of science Also excellent visualization tools areready to use which are necessary to easily debug and develop image processingmethods or 3D map construction and visualization Resulting from the reasonsabove developing and testing 2D or 3D image processing algorithms in Matlab isextremely convenient On the other hand Matlab is Java based therefore mostnon-mathematical calculations are slightly slow compared to a C++ environment

322 Robotic Operating System (ROS)Robotic Operating System is an extensive open-source operating system andframework designed for robotic development It is equipped with several toolsaimed to help developing unmanned vehicles like simulation and visualizationsoftware or excellent general message passing This latter property is a suitablesolution to connect the UAV UGV and the ground control station of the project

Also ROS contains several packages developed to handle sensors For exam-ple the both the chosen lidar sensor (3123) and the flight controller (3121)have ready-to-use drivers available

323 OpenCVOpenCV is probably to most popular open-source image processing library con-taining a wide range of functions like basic image manipulations (eg load writeresize rotate image) image processing tools (eg different kind of edge detec-tions threshold methods) and advanced machine learning based solutions (egfeature extractors and training tools) [72] Due to its high popularity examplesand support are available

In this project it was used to handle inputs (either from a file or from a cameraattached to the laptop) since Dlib the main library chosen has no functions toread inputs like that

324 DlibDlib is a general purpose cross-platform C++ library It has an excellent machinelearning library with the aim to provide a machine learning software developmenttoolkit for the C++ language (see figure 33 for an architecture overview) Dlibis completely open-source and is intended to be useful for both scientists andengineers

Dlib also has many computer vision related parts including ready-to-use fea-ture extractors and training tools and object tracking solutions which make it

24

32 Chosen software

Figure 33 Elements of Dlibrsquos machine learning toolkit Source [73]

easy to develop and test custom trained object detectors rapidly Another re-cently added feature is an easy-to-use 3D point cloud visualization function Re-sulting from this Dlib was a reasonable choice to use for this project since itscollection of machine learning image processing and 3D point cloud handlingfunctions made the development significantly easier

It is worth mentioning that aside the features above Dlib also contains com-ponents to handle linear algebra threading network IO graph visualisation andmanaging and numerous other tasks which makes it one of the most versatileC++ libraries Though it is not as popular and well-known as OpenCV it isextremely well documented supported and updated regularly Resulting fromthese the popularity of it increases For more information see [74] and [73]

25

Chapter 4

Designing and implementing thealgorithm

In this chapter challenges related to the topic of this thesis are listed After theconsideration of these the architecture of the designed algorithm will be intro-duced with detailed explanations given to the important parts The connectionbetween the 2D and 3D image inputs are defined Afterwards the base of thedetection algorithm is discussed in details along with all the aiding systems andinputs Finally the 3D imaging experiments and set-ups are introduced withexample images and the mathematical calculations

41 Challenges in the taskThe biggest challenges of the task are listed below

1 Unique object Maybe the biggest challenge of this task is the fact that therobot itself is a completely unique ground vehicle designed as part of theproject Thus no ready-to-use solutions are available which could partly orcompletely solve the problem For frequently researched object recognitiontasks like face detection several open-source codes examples and articlesare available The closest topic for a ground robot detection is car recogni-tion but those vehicles are too different to use any ready detectors Thusthe solution of this task have to be completely self-made and new

2 Limited resources Since the objective of this thesis is a detection executedon a UAV the limitations of this platform have to be considered Thismeans that both the dimensions and the weight of the on-board sensorsand computers are limited significantly Fortunately the cameras nowa-days have very compact sizes with impressive resolutions thus even smaller

26

42 Architecture of the detection system

aerial vehicles can carry one and record high quality videos The used lidar(3123) is heavier and larger but still possible to carry The on-board com-puter had to be chosen similarly a small and light hardware was neededWhile the Nitrogen6x board (311) would fulfil these requirements its pro-cessing power is not comparable to a laptop or desktop PC which shouldbe considered during the designing and assigning of processing methods

3 Moving platform No matter what kind of sensors are used for an objectdetection task movement and vibration of the sensor is usually makes itmore challenging Unfortunately both are present on a UAV platform Thiscauses problems because it changes the perspective and the viewpoint ofthe object Also they add noise and blur to the recordings And third nobackground-foreground separation algorithms are available for the cameraSee subsection 2211 for details

4 Various viewpoint of object Usually in the field of computer vision thepoint of view of the required object is more or less defined For exampleface detectors are usually applied to images where people are facing thecamera However resulting from the UAV platform the sensors will gain orlose height rapidly causing drastic perspective changes Also there are noconstraints about the relative orientation of the two vehicles which meansthat the robot should be recognized from 360 and from above

5 Various size of object Similarly to the previous point 4 the movement ofthe camera will cause significant changes in the scale of the object on theinput image In other words the algorithm have to detect the ground robotfrom a great distance (when it occupies a small part of the input) and alsoduring landing from relatively close (when it seems huge and almost fillsthe entire image)

6 Speed requirements The algorithmrsquos purpose is to support the landingmaneuver which is crucial part of the mission The most critical part is thetouch-down itself This (contrary to the other parts of the flight) requiresreal-time and very accurate position feedback from the sensors and the al-gorithms This is provided by another algorithm specially designed for thetask Thus the software discussed here does not have real-time require-ments On the other hand the system still needs continuous reports on theestimated position of the robot with at least 4Hz This means that everychosen method should be execute fast enough to make this possible

27

42 Architecture of the detection system

Support Vector Machines

camera2D Lidar

Trainer

Detector agorithmfront SVM side SVM

Detections Vatic annotation server

Sensors

Tracking

Evaluation

Current frame

Video reader

Regions of Interest

3D map

Other preprosessing(edgecolour detection)

Figure 41 On this figure a diagram of the designed architecture is shown Theaim of it is the present the connections between the 2D and 3D sensors detectionalgorithms and other parts of the system Arrows represents the dependencies andthe direction of the information flow

28

42 Architecture of the detection system

42 Architecture of the detection systemIn the previous chapters the project (11) and the objectives of this thesis (13)were introduced The advantages and disadvantages of the different availablesensors were presented (312) Some of the most often used feature extractionand classification methods (222) were examined with respect to the suitabilityfor the project Finally in subsection 41 the challenges of the task were discussed

After considering the aspects above a modular complete architecture wasdesigned which is suitable for the objectives defined See figure 41 for an overviewof the design The main idea of the structure is to compensate the disadvantagesof every module with an other part connected to it In some cases this principlebrings redundancy to the system On the other hand it provides a more robustdetection system for the robot The next enumeration will list the main partsof the architecture every already implemented module will be discussed later inthis chapter in detail

1 Training algorithm As explained in subsection 222 object detectionmethods using machine learning are very popular and superior in efficiencyTo apply such an algorithm for a task first a learning (also called trainingor teaching) period is needed when the software extract features from thetraining images and train a classifier to separate the two or more classes(Note that this learning step can be skipped if the algorithm learns rdquoon theflyrdquo or pre-trained detectors are used Neither of these are the case in thisproject)As mentioned above training usually happens off-line before the detectionsTherefore it is not required to be included in the final release However itis an inevitable part of the system since it produces the classifiers for thedetector These productions have to be reproducible compatible with theother parts and effective To make the development more convenient thetraining software is completely separated from testing See subsection 431for details

2 Sensor inputs Sensors are crucial for every autonomous vehicle since theyprovide the essential information about the surrounding environments Inthis project a lidar and a camera where used as main sensors for the task(along with several others needed to stabilize the vehicle itself See subsec-tion 312) The designed system relies more on the camera for now sincethe 3D imaging is in its experimental state However both sensors arehandled in the architecture already with the help of the Robotic Operat-ing System (322) This unified interface will make it possible to integratefurther sensors (eg infra-red cameras ultrasound sensors etc) later withease

29

42 Architecture of the detection system

Resulting from the scope of the project the vehicle itself is not readythus live recordings and testing were not possible during the developmentTherefore the system is able to manage images both from a camera andvideo file The latter is considered as a sensor module as well simulatingreal input from the sensor (other parts of the system canrsquot tell that it is nota live video feed from a camera)

3 Region of interest estimators and tracking Recognized as two of the mostimportant challenges (see 41) speed and the limitations of the availableresources were a great concern Thus methods which could reduce theamount of areas to process and such increase the speed of the detectionwere requiredResulting from the complex structure of the project both can be donebased on either the camera the lidar or both The idea of the architectureis to give a common interface (creating a type of module) to these methodsmaking their integration easy The most general and sophisticated approachwill be based on the continuously built and updated 3D maps That is ifboth the aerial and the ground vehicle were located in this map accurateestimations can be made on the current position of the robot excluding (3D)areas where it is unnecessary to search for the unmanned ground vehicle(UGV) In other words once the real 3D position of the robot is known itsnext one can be estimated based on its maximal velocity and the elapsedtime Theoretically it is not possible for the robot to be outside of thiscircle thus other parts of the map do not need to be scanned and Asa less sophisticated but more rapid alternative to this the currently seen3D sub-map (the point-cloud in front of the sensor) can be interpreted asa pre-filter for the 2D image processing methods For example filteringfor objects with more or less the same size of the UGV could be a veryeffective region of interest detector To achieve this the field of view of thetwo sensors needs to be registered carefully so corresponding areas can befound Unfortunately the real time 3D map building is not yet implementedThus these methods has yet to be developed On the other hand severalapproaches to find regions of interest based on 2D images were implementedAll of them will be presented in section 43

4 Evaluation To make the development easier and the results more objectivean evaluation system was needed This helps to determinate whether a givenmodification of the algorithm did actually make it rdquobetterrdquo andor fasterthan other versions This part of the architecture is only connected to thewhole system loosely (similarly to training point 1) and not required tobe included in the final release since it was designed for development and

30

43 2D image processing methods

debugging purposes However it had to be developed parallel with everyother parts to make sure it is compatible and up to date Thus evaluationhad to be mentioned in this list For more details of the evaluation pleasesee 511

5 Detector The detector module is the core algorithm which connects all theother parts listed above Receiving the inputs from the camera sensor (point2) using the classifiers trained by the training algorithm (point 1) andconsidering the information provided by the ROI estimators (3) executesthe detection Afterwards the detector is able to pass on the detectedobjects coordinates so the evaluation module (point 4) can measure theefficiency

43 2D image processing methodsIn this section the 2D camera image processing concerns methods and the processof the development will be presented

431 Chosen methods and the training algorithmBefore the design and implementation of the algorithm started several objectdetection methods were reviewed in section 22 In section 32 the used imageprocessing libraries and tool-kits were presented

While these two topics seem unrelated they have to be considered togethersince a good software library with image handling or even machine learningalgorithms implemented can save a lot of time during the development

Considering the reviewed methods the Histogram of Oriented Gradients (HOGsee sub-subsection 2223) was chosen as the main feature descriptor The reasonsfor this choice are discussed here

Haar-like features are suitable to find patterns of combined bright and darkpatches While the ground robotrsquos main colours are black (wheels) and white(the main body of the vehicle) their proportion is not ideal since brighter partsare more significant Variance of the relative positions of these patches (egdifferent view angles) can confuse the detector as well Also using OpenCVrsquos(323) training algorithm the learning process was significantly slower than thefinal training software (presented here later) making it very hard to develop andtest

SIFT descriptors (see sub-subsection 2221) have the advantage of rotationinvariance which is very desirable Since they assume that relative positions ofthe key-points will not change on the inputs they are more suitable to recognise

31

43 2D image processing methods

a concrete object instead of the general class This is not an issue in this casesince only one ground robot is present with known properties (in other wordsno generalization is needed) However the relative positions of the key-pointscan change not only because inter-class differences (an other instance of the classrdquolooksrdquo different) but caused by moving parts like rotating and steering wheelsfor example Also SIFT tends to perform worse under changing illuminationconditions which is a drawback in the case of this project since both vehiclesare planned to transfer between premises without no a-priori information aboutthe lighting

Histogram of Oriented Gradients (HOG see sub-subsection 2223) seemed tobe an appropriate choice as it performs well on objects with strong characteristicedges Since the UGV has a cuboid shape body and relatively big wheels itfulfils this condition Also HOG is normalized locally with respect to the imagecontrast which results in a robustness against illumination changes Thus itis expected to perform better than the other methods in case of change in thelighting (like moving to an other room or even outside)

Certainly HOG has it disadvantages as well First it is computationally ex-pensive especially compared to the Haar-like features However with properimplementation (eg a good choice of library) and the use of ROI detectors (seePoint 3) this issue can be solved Secondly HOG in contrast to SIFT is not rota-tion invariant In practice on the other hand it can handle a significant rotationof the object Also in the final version of the project the orientation of the camerawill be stabilized either in hardware (with a gimbal holding the camera levelled)or software (rotate the image based on the orientation sensor) Therefore theinput image of the detector system expected to be levelled Unfortunately dueto the aerial vehicle used in this project different viewpoints are predicted (see4) This means that even if the camera is levelled the object itself could seemlike rotated caused by the perspective To overcome this issue solutions will bepresented in this subsection and in 434

Although the method of HOG feature extraction is well described and docu-mented in [53] and not exceedingly complicated the choice of optimal parametersand the optimization of the computational expenses are time-consuming and com-plex tasks To make the development faster an appropriate library was searchedfor the extraction of HOG features Finally Dlib library (324) was chosen sinceit includes an excellent HOG feature extractor with serious optimization alongwith training tools (discussed in details later) and tracking features (see subsec-tion 434) Also several examples and documentations are available for bothtraining and detection which simplified the implementation

As a classifier Support vector machines (SVM see sub-subsection 2226)were chosen This is because nowadays SVMs are widely used in computer visionwith great results (Dalal used it as well in [53]) and for two-class detection prob-

32

43 2D image processing methods

lems (like this) they outperform the other popular ones Also it is implementedin Dlib as a ready-to-use solution [75] SVMs (and almost everything in Dlib)can save and load by serializing and de-serializing them which made the trans-portation of classifiers relatively easy Dlibrsquos SVMs are compatible with the HOGfeatures extracted with the same library The combination of the two methodswithin Dlib resulted in some very impressive detectors for speed limit sign orfaces Furthermore the face detector mentioned outperformed the Haar-like fea-ture based face detector included in OpenCV [74] After studying these examplesthe combination of Dlibrsquos SVM and HOG extractor was chosen as the base of thedetection

The implemented training software itself was written in C++ It can be exe-cuted from the command line It needs an xml file as an input which includes alist of the training images with the annotation of the objects (coordinates in pix-els) After importing the positive samples the negative samples are rdquogeneratedrdquofrom the training samples as well cut from areas where no annotated object ispresent Note that this feature means the object has to be annotated carefullysince a positive sample might get in to the negative training images otherwise

After the classifier (also called detector) is produced the training softwaretests it on an image database defined by a similar xml file This is a very con-venient feature since major errors are discovered during the training processalready If the first tests do not show any unexpected behaviours the finishedSVM is saved to the disk with serialization

Due to the fact that the robot looks completely unlike from different view-points one classifier was not sufficient because it could only match one typicalrdquoshaperdquo of the object Similarly the face detector mentioned above does notwork for people photographed from side view Therefore multiple classifiers wereconsidered for the task

Two SVMs were trained to overcome of this issue one for the front view andone for the side view of the ground vehicle This approach has two significantadvantages

First the robot is symmetrical horizontally and vertically as well In otherterms the vehicle looks almost exactly the same from the left and right whilethe front and rear-views are also very similar This is especially true if the work-ing principle of the HOG (2223) is considered since the shape of the robot(which defines the edges and gradients detected by the HOG descriptor) is iden-tical viewed from left and right or from front and rear This property of theUGV means that the same classifier will recognise it from the opposite directiontoo Therefore two classifiers are enough for all four sides This is not a trivialsimplification since cars for example look completely different from the front orthe rear and while the side-views are similar they are mirrored

The second advantage of these classifiers is very important in the future of

33

43 2D image processing methods

the project Due to the way the SVMs are trained they not only can detect theposition of the robot but depending on which one of them detects it the systemgains information about the orientation of the vehicle This is crucial since it canhelp predict the next position of the ground robot (see 434) Even more impor-tantly it helps the UAV to land on the UGV with the correct orientation (facingthe right way) Therefore the charging outputs will be connected properly

An other important aspect of the training is how to choose the training imagesFor this pictures were selected where the robot is captured directly from the sideor from the front with the camera almost at the level of the ground Asidethe sidefront of the robot (the top for example) was hidden or cropped Thismethod eliminated all the distortions found on pictures taken from above whichwould decrease the efficiency Although camera attached to the UAV will seldomcruise at this altitude (otherwise it would touch the ground) the HOG descriptor isrobust enough to find the patterns even viewed from above or sideways Althoughteaching an SVMs for diagonal views as well seemed like a logical idea it did notimprove the detections The reason for this is the fact that while side and frontviews are well defined it is hard to frame and train on a diagonal view

On figure 42 a comparison of training images and the produced HOG detec-tors is shown 42(a) is a typical training image for the side-view Hog detectorNotice that only the side of the robot is visible since the camera was hold verylow 42(b) is the visualized final side-view detector Both the wheels and thebody are easy to recognize thanks to the strong edges of them On 42(c) a train-ing image is displayed from the front-view detector training image set 42(d)shows the visualized final front detector Notice tangential lines along the wheelsand the characteristic top edge of the body

Please note that if the current two classifiers seem to be inefficient trainingadditional ones is easy to do with the training software presented here Alsoincluding them to the detector is simple as well since it was designed to beadaptive Please see subsection 435 for more details

432 Sliding window methodA very important property of these trained methods (222) are that usually theirtraining datasets (the images of the robot or any other object) are cropped onlycontaining the object and some margin around it Resulting from this a detectortrained this way will only be able to work on input images which contain theobject in a similarly cropped way Certainly such an image is not the usual inputfor an application since if the input contains only the robot there is no need forlocalization algorithms In other words the input image is usually much largerthan the cropped training images and may contain other kind of objects as wellThus a method is required which provides correctly sized input images for the

34

43 2D image processing methods

(a) sideview training image example

(b) sideview HOG detector

(c) frontview trainging image example

(d) frontview HOG

Figure 42 (a) a typical training image for the side-view Hog detector (b) is thefinal side-view HOG descriptor visualized (c) a training image from the front-viewHOG descriptor training image set (d) the final front detector visualized Noticethe strong lines around the typical edges of the training images

35

43 2D image processing methods

Figure 43 Representation of the sliding window method

detector cropped and resized from the original large inputThis can be achieved in several ways although the most common is the so

called sliding window method This algorithm takes the large image and slidesa window across it with predefined step-size and scales This rdquowindowrdquo behaveslike a mask cropping out smaller parts of the image With appropriate choice ofthese parameters (enough scales and small step-sizes) eventually every robot (orother desired object) will be cropped and handed to the trained detector See 43for a representation

It worth mentioning that it is possible that multiple instances of the soughtobject are present on the image For face or pedestrian detection algorithms forexample this is a potential situation Since only one ground robot was builtyet this is not likely to happen in this project (although reflections of the robotmight occur which are not handled explicitly) However if the system will beexpanded in the future multiple UGVs on the same input image will not cause aproblem for the designed algorithm

433 Pre-filteringAs listed in section 41 two of the most important challenges are speed andthe limitations of the available resources In subsection 432 the sliding windowconcept was introduced which makes it possible to cover every potential locationof the sought object On the other hand this means a lot of positions which areabsolutely unnecessary to check with the HOG detector since it is very unlikelythat the robot is there (eg positions above the ground or homogeneous surfaceslike the wall or the clearly empty ground) These areas could be excluded fromthe search with cheaper algorithms

36

43 2D image processing methods

Thus methods which could reduce the amount of areas to process and suchincrease the speed of the detection were required In other words these algorithmsfinds region of interests (ROI see 3) in which the HOG (or other) computationalheavy detectors are executed in a shorter time than scanning the whole inputimage The algorithms have be implemented with the right interface to keep themodular structure of the architecture Also this practice make it possible toeither swap the currently used with an other or a newly developed ROI detectormodule for further development

Based on intuition a good feature separation is the colour Anything whichdoes not have the same colour as the ground vehicle should be ignored To dothis however is rather complicated since the observed rdquocolourrdquo of the object (andanything else in the scene) depends on the lighting the exposure setting and thedynamic range of the camera Thus it is very hard to define a colour (and aregion of colours around of it) which could be a good base of segmentation onevery input videos Also one of the test environments used (the laboratory) andmany other premises reviewed has bright floor which is surprisingly similar to thecolour of the robot This made the filtering virtually useless

An other good hint for interesting areas can be the edges In this projectthis seems like a logical approach since the chosen feature descriptor operates ongradient images which are strongly connected to the edge map On figure 44 theresult of an edge detection on a typical input image of the system is presentedNote how the ground robot has significantly more edges than its environmentespecially the ground Filtering based on edges often eliminates the majority ofthe ground On the other hand chairs and other objects have lot of edges as wellThus these areas are still scanned

Another idea is to filter the image by the detections themselves on the previousframes In other terms the position of the robot on the previous input image isan excellent estimation of the vehiclersquos position on the current one Theoreticallyit is enough to search for the robot around its last known position with a radiusof the maximum distance the object can move during the time between framesPlease note that this distance is in image coordinates thus the possible movementof the camera is involved as well In practice this distance was set as a parameterafter experiments See subsection 435 for details

434 TrackingIn computer vision tracking means the following of a (moving) object over timeusing the camera image input This is usually done by storing information (likeappearance previous location speed and trajectory) about the object and passit on between frames

In the previous subsection (433) the idea of filtering by prior detections was

37

43 2D image processing methods

Figure 44 On this figure the result of an edge detection on a typical input imageof the system is presented Please note how the ground robot has significantly moreedges than its environment On the other hand chairs and other objects have lotof edges as well

presented Tracking is the extension of this concept If the position of the robotis already known there is no need to find it again To achieve the objectives it isenough to follow the found object(s) This makes a huge difference in calculationcosts since the algorithm does not look for an object described by a generalmodel on the whole image but search for the rdquosame set of pixelsrdquo in a significantlyreduced region Certainly tracking itself is not enough to fulfil the requirements ofthe task hence some kind of combination of rdquotraditionalrdquo classifiers and trackingwas needed

Tracking algorithms are widely researched and have their own extensive litera-ture Considering the complexity of implementing an own 2D tracking algorithmit was decided to choose and include a ready-to-use solution Fortunately Dlib(324) have an excellent object tracker included as well The code is based onthe very recent research presented in [76] The method was evaluated extensivelyand it outperformed state-of-the-art methods while being computationally effi-cient Aside its outstanding performance it is easy to include in the code onceDlib is installed The initializing parameter of the tracker is a bounding box isaround the area needed to be followed Since the classifiers return with boundingboxes those can be used as an input for the tracking See subsection 435 foran overview of the implementation An interesting fact is that it also uses HOGdescriptors as a part of the algorithm which shows the versatility of HOG andconfirms its suitability for this task

The final system of this project will have the advantage of a 3D map of itssurroundings This will allow tracking of objects in 3D relative to the environ-ment The UAV will have an estimation of the UGV even if its out of the frame

38

43 2D image processing methods

since its position will be mapped to the 3D map as well Unfortunately the realtime mapping is not ready yet thus this feature has yet to be implemented

435 Implemented detectorIn subsection 431 the advantages and disadvantages of the chosen object recog-nition methods were considered and the implemented training algorithm wasintroduced Then different approaches to speed up the process were presentedin subsection 432 433 and 434

After considering the points above a detector algorithm was implementedwith two aims First to provide an easy-to-use development and testing toolwith which different methods and their combinations (eg using 2 SVMs withROI) can be tried without having to change and rebuild the code itself Secondafter the optimal combination was chosen this software will be the base of thefinal detector used in the project

The input of the software is a camera feed or video which is read frame byframe The output of the algorithm is one or more rectangles bounding the areasbelieved to contain the sought object These are often called detection boxeshits or simply detections

During the development four different approaches were implemented All ofthem are based on the previous ones (incremental improvements) but all broughta new idea to the detector which made it faster (reach higher frame-rates see512) more accurate (see 511) or both The methods are listed below

1 Mode 1 Sliding window with all the classifiers

2 Mode 2 Sliding window with intelligent choice of classifier

3 Mode 3 Intelligent choice of classifiers and ROIs

4 Mode 4 Tracking based approach

They will all be presented in the next four subsections Since they are rathermodifications of the same code than separate solutions they were not imple-mented as new software Instead all were included in the same code whichdecides which mode to execute in run-time based on a parameter

Aside of changing between the implemented modes the detector is able toload as many SVMs as needed Their usage is different in every mode and willbe presented later

The software can display andor save the video frames with all the detectionsmarked measure and export the processing time for every frame save the de-tections to data files for evaluation or even process the same video input severaltimes

39

43 2D image processing methods

Table 41 Table of the available parameters

Name Valid values Functioninput path to video video as input for detectionsvm path to SVMs These SVMs will be usedmode [1 2 3 4] Selects which mode is usedsaveFrames [01] Turns on video frame exportsaveDetections [01] Turns on detection boxe exportsaveFPS [01] Turns on frame-rate measurementdisplayVideo [01] Turns on video displayDetectionsFileName string Sets the filename for saved detectionsFramesFolderName string Set the foldername used for saving videonumberOfLoops integer(0lt) Sets how many times the video is looped

To make the development more convenient these features can be switchedon or off via parameters which are imported from an input file every time thedetector is started There is an option to change the name of the file where thedetections are exported or the name of the folder where the video frames aresaved

Table 41 concludes all the implemented parameters with the possible valuesand a short description

Note that all these parameters have default values thus it is possible but notcompulsory to include every one of them The text shown below is an exampleinput parameter file

Example parameter file of the detector algorithminput testVideo1avilist all the detectors you want to usesvm groundrobotfrontsvm groundrobotsidesvmsaveDetections 0saveFrames 0displayVideo 0numberOfLoops 100mode 2saveFPS 1

Given this input file the software would execute a detection on testVideo1avi(input) one hundred times (numberOfLoops) using the groundrobotfrontsvm groundrobot-sidesvm classifiers (svm) It would neither save the detections (saveDetections)

40

43 2D image processing methods

nor the video frames (saveFrames) The latter is not shown (displayVideo) How-ever the processing time is saved for every frame (saveFPS) Note that the pa-rameter numberOfLoops is really useful to simulate longer inputs See figure 45for a screenshot of the interface after a parameter file is loaded

Figure 45 Example of the detectorrsquos user interface after loading the parametersThe software can be started from the command line with a parameter file inputwhich defines the detecting mode and the purpose of the execution (producingvideo efficiency statistics or measuring processing frame-rate)

4351 Mode 1 Sliding window with all the classifiers

Mode 1 is the simplest approach of the four After loading the two classifierstrained by the training software (front-view and side-view see 434) both rdquoslidesrdquoacross the input image All of them return with a vector of rectangles where theyfound the sought object These are handled as detections (saved displayed etc)There are no filters implemented for maximum number of detections overlappingallowed position etc

Please note that although currently two SVMs are used this and every othermethod can handle fewer or more of them For example given three classifiersmode 1 loads and slides all of them

As it can be seen this mode 1 is rather simple It checks every possibleposition with all the classifiers on every frame in multiple scales This meansthat it will not miss the object (assuming that one of them would recognize theobject if a cropped image of it is given as an input) On the other hand thisextensive search is very computational heavy especially with two classifiers Thisresults in the lowest frame per second rate of all the methods

4352 Mode 2 Sliding window with intelligent choice of classifier

As mentioned in the previous subsection Mode 1 checks every possible positionwith all the classifiers on every frame This is not efficient because of the waythe two used classifier was trained On figure 42 it can be seen that one of them

41

43 2D image processing methods

represents the frontrear view while the other one was trained for the side-viewof the robot

The idea of mode 2 is based on the assumption that it is very rare and usuallyunnecessary that these two classifiers detect something at once since the robot iseither viewed from the front or from the side Certainly looking at it diagonallywill show both mentioned sides but resulting from the perspective distortionthese detectors will not recognize both sides (and that is not needed either sincethere is no reason to find it twice)

Therefore it seemed logical to implement a basic memory to the code whichstores which one of the detectors found the object last time After that only thatone is used for the next frames If it succeeds to find the object again it remainsthe chosen detector If not the algorithm starts counting the sequential frames onwhich was unable to find something If this exceeds a predefined tolerance limit(that is the number of frames tolerated without detection) all of the classifiersare used again until one of them find the object In every other aspect mode 2is very similar to mode 1 presented above

The limit was introduced because it is very unlikely to change viewpointsbetween two frames so drastically that the other detector would have a betterchance to detect A couple of frames without detection is more like the result ofsome kind of image artefact like motion blur change in exposure etc

This modification of the original code resulted an much faster processingspeed since in significant amount of the time only one classifier was used in-stead of two Fortunately after finding an optimal limit the efficiency of thecode remained the same

4353 Mode 3 Intelligent choice of classifiers and ROIs

While mode 2 brought a significant improvement to the detectorrsquos speed it didnot fulfil the requirements of the project

In subsection 433 the idea of reducing the image searched was introduced asa way to increase the detection speed The position of the robot on the previousinput image is an excellent estimation of the vehiclersquos position on the current oneTheoretically it is enough to search for the robot around its last known positionwith a radius of the maximum distance the object can move during the timebetween frames However this is not an obvious calculation since the distance isin image coordinates hence it is not trivial to use the actual speed of the robotAlso the possible movement of the camera is involved as well For example evenif the UGV holds its position the rotating camera of the UAV will capture itrdquomovingrdquo across the input image

Instead of estimating the distance mentioned above a simpler approach wasimplemented The memory introduced in mode 2 was extended to store the

42

43 2D image processing methods

Figure 46 An example output of mode 3 The green rectangle marks theregion of interest The blue rectangle is the detection returned by the side-viewdetectorMode 3 tries to estimate a ROI based on the previous detections andsearches for the robot only in that Note the ratio between the ROI and the fullsize of the image

position of the detection aside the detector which returned it A new rectanglename ROI (Region of interest) was included which determinates in which areashould the detector search

The rectangle is initialized with the sizes of the full image (in other termsthe whole image is included in the ROI) Then every time a detection occursthe ROI is set to that rectangle since that is the last known position of the robotHowever this rectangle have to be enlarged caused by the reasons mentionedabove (movement of the camera and the object) Therefore the ROI is grown bya percent of its original size (set by a variable 50 by default) and every detectorsearches in this area Note that the same intelligent selection of classifiers whichwas described in mode 2 is included here as well

In the unfortunate case that none of the detectors returned detections insidethe ROI the region of interest is enlarged again by a smaller percent of its originalsize (set by a variable 3 by default) Note that if in any case the ROI reachesthe size of the of the image it will not grow in that direction anymore Eventuallyafter a series of missing detections the ROI will reach the size of the input imageIn this case mode 3 works exactly like mode 2

See figure 46 for an example of the method described above The greenrectangle marks the region of interest (ROI) The classifiers will search for therobot in this area The blue rectangle marks that the side-view detector foundsomething there (in this case correctly recognized the robot) On the next framethe blue rectangle will be the base of the ROI as it is enlarged in the followingsteps to define the region of interest

43

43 2D image processing methods

Please note that while the ROI is updated of the previous detections it isvery easy to change it to another rdquopre-filterrdquo due to the modular architectureintroduced in 42

4354 Mode 4 Tracking based approach

As mentioned in 434 Dlib library has a built-in tracker algorithm based on thevery resent research [76]

Mode 4 is very similar to mode 3 but includes the tracking algorithm men-tioned above Using a tracker makes it unnecessary to scan every frame (or partof it as mode 3 does) with at least one detector

Instead once the robot was found it is followed across the scene Certainlyperiodical validations are still needed Tracking is significantly faster than thesliding window detectors and in appropriate conditions can perform above realtime speed

It also improved the efficiency of the detector since on many frames the robotwas tracked and its position was marked correctly while the detectors missed itfrom the same view-point

On the other hand tracking can be misleading as well In case of the trackedregion rdquoslidesrdquo off the robot incorrect detections will be provided Also thetracker often estimates the position of the object correctly while the scale of therectangle does not fit which can result much larger or smaller detection boxesthan the object itself See figure 47(b) for an example Yellow rectangle marksthe tracked region which clearly includes the robot but itrsquos size is unacceptable

To avoid these phenomena the bounding box returned by the tracker is reg-ularly checked by detectors This is done in the very same way as in mode 3The output of the tracker is the base of the ROI which is than enlarged by pa-rameters (similarly to mode 3) The algorithm checks this area with the selectedclassifier(s)

If any of them founds the object the tracker is reinitialized based on thedetection box The rectangle is enlarged and translated to include a bigger partof the robot in contrast to the detected sides (note that the detectors are trainedfor the sides only see subsection 431) This is done because the tracker tendsto follow bigger patches of the image better

If none of the detectors returned detections inside the ROI the algorithm willcontinue to track the object based on previous clues of its position However afterthe number of sequential failed validations exceeds a tolerance limit (a parameter)the object is labelled as lost Afterwards the algorithm works exactly as mode 3would do enlarge the ROI on every frame until a new detection appears Thenthe tracker is reinitialized

See figure 47(a) for a representation of the processing method of mode 4

44

43 2D image processing methods

(a) Example output frame of mode 4

(b) Typical error of the tracker

Figure 47 (a) A presentation of mode 4 Red rectangle marks the detection ofthe front-view classifier The yellow box is the currently tracked area The greenrectangle marks the estimated region of interest (b) A typical error of the trackerYellow rectangle marks the tracked region which clearly includes the robot but toobig to accept as a correct detection

45

44 3D image processing methods

44 3D image processing methods

441 3D recording methodThe first challenge was how to create a 3D image As explained in sub-subsection3123 the used Lidar sensor is a 2D type one which means it scans only inplane In other words the sensor returns the distance of the closest reflectionpoint rotating around one (vertical) axis The result of this scan looks like across-section of a 3D image To produce a real and complete 3D image severalof these cross-section like one-plane scans have to be combined This is a rathercomplicated method since to cover the whole room the lidar needs to be movedHowever this movement or displacement will have a significant effect on therecorded distances

To simplify the initial experiments the position of the laser scanner was fixedon a tripod and only the orientation of it was changed This way the trans-formation between the coordinate systems and validation of the recordings wereeasier

On figure 48 a schematic diagram is given to introduce the 3D recording set-up The picture is divided in to 3 parts Part a is a side view of the room beingrecorded including the scanner and its dependences Part b is a top view of thevery same room displaying the field of view of the scanner Part c is a closeroverview of the scanner device the inertial measurement unit (IMU) and themounting layout Main parts of the setup are listed below with brief explanation

1 Lidar This is the 2D range scanner which was used in all of the experi-ments It is mounted on a tripod to fix its position The sensor is displayedon every sub-figure (a b c) On part a it is shown in two different orien-tation first when the recording plane is completely flat secondly when itis tilted

2 Visualisation of the laser ray The scanner measures distances in a planewhile rotating around its vertical axis which is call capital Z axis on thefigure (Note The sensor uses class 1 laser which is invisible and harmlessto human eye Red colour was chosen only for visualization)

3 Pitch angle of first orientation While the scanner is tilted its orientationis recorded relative to a fixed coordinate system For this experiment thepitch angle is the most relevant since roll and yaw did not change This iscalculated as a rotation from the (lower-case) z axis

4 Pitch angle of second orientation This is exactly the same variable asdescribed before but recorded for the second set-up of the recording system

46

44 3D image processing methods

5 Yaw angle of the laser ray As mentioned before the sensor used operateswith one single light source rotated around its vertical axis (capital) ZTo know where the recorded point is located relative to the sensorrsquos owncoordinate system the actual rotation of the beam is needed to be savedas well This is calculated as an angle between the forward direction of thesensor (marked with blue) and the laser ray All the points are recordedin this 2D space defined with polar coordinates a distance from the origin(the lidar) and a yaw angleThe angle is increasing counter-clockwise Pleasenote that this angle rotates with the lidar and its field Part b represents astate when the lidar is horizontal

6 Field of view of the sensor Part b of the picture visualize the limited butstill very wide field of view of the sensor The view angle is marked withgreen coloured curve and the covered area is yellowish The scanner is ableto record 270 or in other terms [minus135 135]

7 Blind-spot of the sensor As discussed in the previous point the scannercan register distances in 270 The remainder 90 is a blind-spot and islocated exactly behind the sensor (135minus135) This area is marked withdark grey on the figure

8 Inertial measurement unit All the points are recorded in the 2D spaceof the lidar which is a tilted and translated plane To map every recordedpoint from this to the 3D maprsquos coordinate system the orientation andposition of the Lidar sensor is needed Assuming that the latter is fixed(the scanner is mounted on a tripod) only the rotation remains variableTo record this some kind of inertial measurement unit (IMU) is neededwhich is marked with a grey rectangle on the image

9 Axis of rotation To scan the room a continuous tilting of the scannerwas needed It was not possible to place the light source on the axis of therotation This means an offset between the sensor and the axis which was asource of error The black circle represents the axis the tilting was executedaround

10 Offset between the axis and the light source As described in the previ-ous point an offset was present between the range sensor and the axis ofrotation This distance was carefully measured and it is marked by blue onthe figure

11 Ground vehicle The main objective of this thesis is to localize the UGVThus several 3D map was made of this vehicle itself However the lidarmain task is not the robot detection but the localization of the UAV itself

47

44 3D image processing methods

On part a and b its schematic side and top view is shown to visualize itsshape relative size and location

442 Android based recording set-upTo test the sensorrsquos capabilities simplified experiment set-ups were introducedas described in subsection 441 The scanned distances were recorded in csvformat by the official recording tool provided by the lidarrsquos manufacturer [77]As described in subsection 444 and in point 8 the orientation and position ofthe Lidar sensor is needed to determinate the recorded point coordinates relativeto the coordinate system fixed to the ground

Figure 410 Screenshot ofthe Android application for Lidarrecordings It is able to detect dis-play and log the rotation of the de-vice in a log file with user definedfilename Source own applicationand image

Assuming that the position is fixed (thescanner is mounted on a tripod) only therotation can change To record this somekind of inertial measurement unit (IMU) wasneeded

Since at the time these experimentsstarted no such device was available the au-thor of this thesis created an Android appli-cation The aim of this program was to usethe phonersquos sensors (accelerometer compassand gyroscope) as an IMU The software isable to detect display and log the rotationof the device along the three axis

The log file is structured in a way thatMatlab (see subsection 321) or other de-velopment tools can import it without anymodification needed The file-name is userdefined

Since both the Lidar and the applica-tion operates at a different and not suf-ficiently consistent frequency synchroniza-tion was necessary Since the two data wasrecorded completely separately this has to bedone off-line To achieve this a time-stamphas to be recorded for both type of mea-surements The lidar recorder tool does thisautomatically This function was added tothe Android application too thus the time-stamps of the recorded orientations are saved

48

44 3D image processing methods

Figure 48 Schematic figure to represent the 3D recording set-up Part a is a sideview of the room being recorded including the scanner and its dependences Partb is a top view of the very same room displaying the field of view of the scannerPart c is a closer overview of the scanner device the inertial measurement unit(IMU) and the mounting layout

49

44 3D image processing methods

Figure 49 Example representation of the output of the Lidar Sensor As it canbe seen the data returned by the scanner is basically a scan in one plane and canbe interpreted as a cross-section of a 3D image The sensor is positioned in themiddle of the cross Green areas were reached by the laser Source screenshot ofown recording by the official recording tool [77]

50

44 3D image processing methods

to the log file as well However since the two systems were not connected thetimestamps had to be synchronized manually

On Figure 410 the user interface of the application can be seen Note thatboth the displayed and the saved angles are in radian Figure 411(a) presentshow the phone was mounted next to the lidar and also shows the test environmentwhich was recorded The phone was attached to a common metal base with thelidar Thus the rotation of the phone is expected to be very close to the rotationof the lidar We used a Motorola Moto X (2013) for these experiments

The coordinate system used is fixed to the phone X axis is parallel with theshorter side of the screen pointing from the left to the right Y axis is parallel tolonger side of the screen pointing from to bottom to the top Z axis is pointingup and is perpendicular to the ground See figure 412 for a presentation of theaxes Angles were calculated in the following way

bull pitch is the rotation around the x axis It is 0 if the phone is lying on ahorizontal surface

bull roll is rotation around the y axis

bull yaw is the rotation around the z axis

Although the sensors in a phone are not designed for such precise tasks theset-up presented in this subsection worked sufficiently and produced spectacular3D images On figure 411(b) and 411(c) an example is shown along with therecorded screen captured by a camera (411(a))

443 Final set-up with Pixhawk flight controllerAlthough the Android application (see subsection 442) gave satisfactory resultsseveral reasons occurred for replacing it First the synchronization was compli-cated manual and off-line which is not acceptable in a real time UAV environ-ment Second neither the phonersquos sensors nor its operating system was designedto be mounted on an unmanned aircraft It is heavier less precise and moreexpensive than a dedicated IMU Thus the Android phone has to be replaced

As a replacement the Pixhawk (3121) flight controller was chosen because itis open-source well supported and most importantly it is the chosen controller ofthe projectrsquos UAV This meant less weight and brought integrity both in hardwareand software Although the Pixhawk is not a dedicated IMU its built-in sensorsare very accurate since they are used to stabilize multi-rotors during flight Fellowstudent Domonkos Huszar managed to connect both the Pixhawk and the lidar(3123) to the Robotic operating system (322) This way the two type ofsensors were recorded with the same software

51

44 3D image processing methods

(a) Laboratory and recorded scene

(b) Elevation of the produced 3D map

(c) The produced 3D map

Figure 411 (a) picture about the laboratory and the mounting of the phone (b)and (c) 3D map of the scan from different viewpoints Some object were markedon all images for easier understanding 1 2 3 are boxes 4 is a chair 5 is a studentat his desk 6 7 are windows 8 is the entrance 9 is a UAV wing on a cupboard

52

44 3D image processing methods

Thus the change from the Android software to the Pixhawk solved the problemof synchronization and accuracy All the other parts of the experiment (themounting method the coordinate systems etc) remained unchanged

444 3D reconstructionIn subsection 441 the recording method was described Subsection 442 and443 discussed the inertial measurement units and software used Processing andvisualizing the recorded data as a 3D map was the next step

All the pointsrsquo coordinates obtained were in the lidar sensorrsquos own coordinatesystem As shown on figure 48 (part b) this is a 2D space represented by polarcoordinates a distance from the origin (the scanner) and an angle between theline pointing to the point and the zero vector This space was tilted with thescanner during the records To calculate the positions recorded by the Lidar inthe ground fixed coordinate system this 2D space had to be transformed Trans-formation here means rigid motions no shape or size changes were desired Rigidmotions are reflections rotations translations and glide reflections Howeverin this experiment only translations and rotations were used

The 3D coordinate system was defined as the following

bull x axis is the vector product of y and z x = ytimes z pointing to the right

bull y axis is the facing direction of the scanner parallel to the ground pointing

Figure 412 Presentation of the axes used in the android application

53

44 3D image processing methods

forward

bull z axis is pointing up and is perpendicular to the ground

Note that the origin ([0 0 0]) is at the lidarrsquos location (at the top of the tripodin the height of the rotation axis) This means that points below the scanner havea negative altitude

The lidar and its own 2D space have 6 degrees of freedom (DOF) state in theground fixed coordinate system Three represents its positions along the x y zaxes The other three describes its orientation yaw pitch and roll

First the rotation of the scanner will be considered (the latter three DOF) Asit can bee seen on figure 48 the roll and yaw angles are not able to change Thusonly pitch has to be considered as a variable in this experiment (see point 3) Asa result the transformed coordinates of the recording point are the following

x = distance middot sin(minusyaw) (41)

y = distance middot cos(yaw) middot sin(pitch) (42)

z = distance middot cos(yaw) middot cos(pitch) (43)

Where distance and yaw are the two coordinates of the point in the plane ofthe lidar (figure 48 point 5) and pitch is the angle between the (lower-case) zaxis and the plane of the lidar (48 point 3 and 4)

The position of the lidar in the ground coordinate system did not change intheory only the orientation of it However as it can be seen on figure 48 part cthe rotation axis (marked by 9 see point 9) is not at the same level as the rangescanner of the lidar Since it was not possible to mount the device other way asignificant offset (marked and explained in point 10) occurred between the lightsource and the axis itself This resulted in an undesired movement of the sensoralong the z and y axes (note that x axis was not involved since the roll angle didnot change and only that would move the scanner along the x axis) While theresulted errors were no greater than the offset itself they are significant Thusalong the rotation a translation is needed as well The vector is calculated asthe sum of dy and dz

dy = offset middot sin(pitch) (44)

dz = offset middot cos(pitch) (45)

54

44 3D image processing methods

Where dy is the translation required along the y and dy is along the z axisOffset is the distance between the light source and the axis of rotation presentedon figure 48 part C marked with 10

Combining the 5 equations (41 42 43 44 45) we get xyz

= distance middot

sin(minusyaw)cos(yaw) middot sin(pitch)cos(yaw) middot cos(pitch)

+ offset middot

0sin(pitch)cos(pitch)

(46)

Using the transformation defined in equation 46 spectacular and more im-portantly precise 3D maps were created These were suitable for further worksuch us SLAM (Simultaneous localization and mapping) or the ground robotdetection

55

Chapter 5

Results

51 2D image detection results

511 EvaluationTo make the development easier and the results more objective an evaluationsystem was needed This helps determine whether a given modification of thealgorithm did actually make it better andor faster than other versions Thisinformation is extremely important in every vision based detection system to trackthe improvements of the code and check if it meets the predefined requirements

To understand what better means in case of a system like this some defi-nitions need to be clarified Generally a detection system can either identify aninput or reject it Here the input means a cropped image of the original pictureIn other words it is a position (coordinates) and dimensions (height and width)Please refer to subsection 432 for a more detailed description An identifiedinput is referred as positive Similarly a rejected one is called negative

5111 Definition of True positive and negative

A detection is true positive if the system correctly identifies an input as a positivesample In this case it means that the detector marks the ground robotrsquos positionas a detection Note that it is also often called a hit Similarly a true negativemeans that an input which does not contain the object is correctly rejected

5112 Definition of False positive and negative

In the case of most decision systems errors can be organised into two main groupsA mistake is a false positive error (also called type I error) if the system classifiesan input as a positive sample despite it is not In other words the system believes

56

51 2D image detection results

the object is present at that location although that is not the case In this projectthat typically means that something that is not the ground robot (a chair a boxor even a shadow) is still recognized as the UGV A false negative error (alsocalled type II error) occurs when an input is not recognized (rejected) although itshould be In this current task a false negative error is when the ground robotis not detected

5113 Reducing number of errors

Unfortunately it is really complicated to decrease both kind of errors at thesame time If more strict rules and thresholds are introduced in the decisionmaking the rate of false positives will decrease but simultaneously chances offalse negatives will increase On the other hand if conditions are loosened in thedetection method false negative errors may be reduced However caused by theexact same thing (less strict conditions) more false positives could appear

The actual task determines which kind of error should be minimised Incertain projects it could be more important to find every possible desired objectthan the fact that some mistakes will occur Such a project can be a manuallysupervised classification where appearing false positives can be eliminated by theoperator (for example illegal usage of bus lanes) Also a system with an accidentavoidance purpose should be over-secure and rather recognise non dangeroussituations as a hazard than to miss a real one

In this project the false positive rate was decided to be more important Thereason for this is that the position of the UGV is not crucial during the wholeflight If an input frame is processed incorrectly and a false negative error occurs(the robot is on the picture but the algorithm misses it) the tracking system(see subsection 434) would still be able to estimate its position After a fewframes the detector would be able to detect the robot again with great possibilityHowever a false negative detection at an unexpected position may confuse thesystem since it would have to choose from multiple possible charging platformsThis may lead to a landing attempt on a dangerous surface and to the possibilityof losing the aircraft

5114 Annotation and database building

To test the efficiency of the 2D detection algorithm its detections have to betested on an image dataset with its samples already labelled (either fully manuallyor using a semi-supervised classifier) or on a video set where the object (or objects)position is known on every frame This way the true positivenegative and falsepositivenegative values can be determined

57

51 2D image detection results

Figure 51 Figure of the Vatic user interface

Testing on image datasets is suitable for methods which do not use any in-formation transmitted between frames such as previous positions or ROI Thesealgorithms are usually sliding window based (see subsection 432) Their croppedinputs can be simulated with image datasets without complications

For this project the video based test is more appropriate since as described insubsection 434 and 433 the detector uses tracking and pre-filtering Testing thetracking feature without consecutive frames is pointless Also the pre-filteringand ROI generation is only applicable on frames containing more than just theobject itself

Testing on videos requires annotation which means that every object has tobe labelled on every frame This is a tiring and time-consuming task but withsuitable software it can be simplified To do this the Vatic environment waschosen

The abbreviation Vatic stands for Video Annotation Tool from Irvine Cal-ifornia It is a free to use on-line video annotation tool designed specially forcomputer vision research [78]

After the successful installation and connection to the server a clear interfaceis visible See an example screenshot on figure 51 The interface has a built invideo player which makes it possible to review the video with the annotation atnormal speed or even frame-by-frame The system was designed in a way thatmultiple objects can be annotated even from different classes (like pedestrianscars trucks for example) Since only one ground robot was built yet and no otherkind of objects were requested only one item was annotated on the videos

58

51 2D image detection results

The annotatorrsquos task is to mark the position and the size of the objects (thatis drawing a bounding box around it) on every frame of the videos (there areoptions to label an object Occluded obstructed or outside of frame) Althoughthis seems like a very time-consuming task Vatic uses sophisticated algorithmsto interpolate both the positions and the sizes of the objects on frames thatare not annotated yet This is a significant help for the annotator since aftermarking the object on some key-frames the only task remained is to correctthe interpolations between them if necessary

Note that it is crucial that the bounding boxes are very tight around the ob-ject This will guarantee that during the evaluation of the detectors the statisticsare based on true and objective annotations This topic will be explained in moredetails in section 53

Vatic is also able to crowdsource the annotation tasks using Amazon Mechan-ical Turk [79] which is Amazonrsquos solution for renting human resources This waylarge datasets with several videos are easy to build for relatively low cost How-ever this feature of the tool was not used in this project since the number of testvideos and their complexity did not make it necessary All the test videos wereannotated by the author

After successfully annotating a video Vatic is able to export the boundingboxes in several format including xml json etc The exported file contains theid of the frames and the coordinates of every bounding box along with the labelassigned for the objectrsquos type

Using these files it is possible to compare the detections of an algorithm (whichare also bounding boxes) with the annotations and thus evaluate the efficiencyof the code

512 Frame-rate measurement and analysisIn section 41 speed was recognised as one of the most important challenges

To address this issue the detector was prepared to measure and export thetime elapsed by processing each frames

The measurements were carried out on a laptop PC (Lenovo z580 i5-3210m8GB RAM) simulating a setup where the UAV sends video to a ground stationfor further processing During the tests every other unnecessary process wereterminated so their additional influences are minimized

The exact same software was used for every test only the parameter file waschanged (switching between the modes) Every method was executed on the samevideo one hundred times which means more than 100000 frames This amountensured that temporary resource changes during the test due to the operatingsystem did not influence the results significantly

59

52 3D image detection results

To analyse the exported processing times a Matlab script has been imple-mented This tool can load the log files from the test and calculate statistics ofthe data Measurements like the shortest longest and average processing timeper frame are displayed after execution along with the average frame-rate and itsvariance Latter is very important as mode 2 3 4 processing methods changeduring the execution by definition which results in a varying processing timeVariance measures these changes An example output of the analysis software ispresented below

Example output of Frame-rate analysis softwareLoaded 100 filesNumber of frames in video 1080Total number of frames processed 108000Longest processing time 0813Shortest processing time 0007Average ellapsed seconds 0032446variance of ellapsed seconds per frame

between video loops 80297e-07across the video 00021144

Average Frame per second rate of the processing 308204 FPS

The changes of average frame-rates between video loops were also monitoredIf this value was too big it probably indicated some unexpected load of thecomputer Thus the tests had to be restarted

It should be mentioned that the frame-rate of the software is strongly corre-lated with the resolution of the input images since the smaller the image it takesshorter time to rdquosliderdquo through the detector on it This is especially true for thefirst two methods (4351 4352) The test videosrsquo resolution are 960times 540

52 3D image detection resultsThis thesis focused on the 2D object detection methods since they are easierto implement and have been widely researched for decades Traditional camerasensors are significantly cheaper too and their output is rdquoready-to-userdquo On theother hand data from lidar sensors have to be processed to gather the imagesince the scanner records points in a 2D plane as explained in section 44 Notethat lidar 3D scanners exists but they are still scanning in a plane and have thismentioned processing built-in Also they are even more expensive than the 2Dversions

60

53 Discussion of results

Figure 52 On this figure the recording setup is presented In the backgroundthe ground robot is also visible which was the subject of the measurements

That being said the 3D depth map provided by these sensors are not possibleto obtain with cameras and contains valuable information for a project like thisand makes indoor navigation easier Therefore experiments were carried onthe record 3D images by tilting a 2D laser scanner (3123) and recording itsorientation with a IMU (3121)

On figure 52 a photo of the experimental setup is shown In the backgroundthe ground robot is visible which was the subject of the recordings

Figure 53 presents an example 3D point-cloud recorded of the ground robotBoth 53(a) and 53(b) is an image of the same map viewed from different pointsAs it can be seen recordings from one point is not enough to create completemaps as rdquoshadowsrdquo and obscured parts still occurs Note the rdquoemptyrdquo area behindthe robot on 53(b)

The created point-clouds were found to be detailed enough to carry simpler(eg threshold based on height) or more complex object detection methods (egregistering 3D key-points) on them in the future

53 Discussion of resultsIn general it can be said that the detector modes worked as expected on therecorded test videos with suitable efficiency Example videos can be found athere1 for all modes However to judge the performance of the algorithm moreobjective measurements were needed

The 2D image processing results were evaluated by speed and detection effi-1usersitkppkehu~palan1CranfieldVideos

61

53 Discussion of results

(a) Example result of the 3D scan of the ground robot

(b) Example of the rsquoshadowrsquo of the ground robot

Figure 53 Example of the 3D images built Both image shows the same 3Dmap viewed from different angles Note that the carpet is visible underneath thevehicle

62

53 Discussion of results

ciency The latter can be done by several values Here recall and precision willbe used Recall is defined by

TP

TP + FN

Where TP is the number of true positive and FN is the number of false negativedetections In other words it is the ratio of the of the detected and the totalnumber of positive samples Thus it is also called sensitivity

Precision is defined byTP

TP + FP

Where TP is the number of true positive FP is the number of false positivedetections In other terms precision is a measurement of the correctness of thereturned detections

Both values have a different meaning for validation and none of them can usedalone This is because as mentioned in sub-subsection 5113 both false positiveand negative errors are undesirable but neither recall nor precision measuresboth For example it is possible to reach an outstanding recall rate with extremeamount of false positives Similarly the increase of false negative errors willimprove the precision of the algorithm

The detections (the coordinates of the rectangles) were exported by the de-tection software for further analysis The rectangles were compared to the onesannotated with the Vatic video annotation tool (5114) based on position andthe overlapping areas For the latter a minimum of 50 was defined (ratio of thearea of the intersection and the union of the two rectangles)

Note that this also means that even if a detection is at the correct locationit is possible that it will be logged as false positive error since the overlap of therectangles is not satisfying Unfortunately registering this box as a false positivewill also generate a false negative error since only one detection is returned bythe detector for a location(overlapping detections are merged) Therefore theannotated object will not be covered by any of the detections

As expected mode 1 and mode 2 produced very similar results a decent 62recall rate was achieved This is a result of the fact that they work exactlythe same way except the selection of the used classifiers The results show thatintroducing this idea did not decrease the performance

Mode 3 managed to improve the recall rate by 2 with the introduction ofregion of interests Neither of the first 3 modules had false positive detections asa result of the strict thresholds of the SVMs

Mode 4 had the outstanding recall rate of 95 This is a result of the trackeralgorithm which was able to follow the object even at view-points were the SVMdetectors did no longer detected the robot See table 51 for all the recall andprecision values

63

53 Discussion of results

Similarly the processing speed of the methods introduced in subsection 435were analysed as well with the tools described in subsection 512

Mode 1 was proven to be the slowest with a frame-rate of 42 FPS This isnot surprising as this version of the algorithm scans the whole image with all thedetectors

Mode 2 raised this speed to 65 FPS by introducing an intelligent way ofselecting the applied detector

Mode 3 brought another significant increase with implementing region of in-terest estimators which reduce the amount of the image to be scanned It canprocess around 12 frames in a second

Finally mode 4 was proven to be the fastest solution by far As a result ofits tracking algorithm their is no need to use the SVM classifiers as often asbefore This is a very important result since tracking is significantly faster thanscanning with the detectors Mode 4 reached a frame-rate of 30 FPS during thetests Thus mode 4 can be called a rdquoreal-timerdquo object detection algorithm

See table 51 for the frame-rates of all modules displayed along their recalland precision values Note that all frame-rates are average of 100 executions

Table 51 Table of results

mode Recall Prescision FPS Variancemode 1 0623 1 42599 000000123mode 2 0622 1 6509 00029557mode 3 0645 1 1206 00070877mode 4 0955 0898 3082 00021144

As a conclusion it can be seen that mode 4 outperformed all the others bothin detection rate (recall 95) and speed (average frame-rate 30FPS) using ROIestimators and a tracking algorithm Due to the nature of the tracker and theevaluation method the number of false positives increased As mentioned beforeeven a correctly positioned but incorrectly scaled detection decreased the precisionrate However this should be simple to improve by adding further validations ofthe output of the tracker (both for position and scale)

64

Chapter 6

Conclusion and recommendedfuture work

61 ConclusionIn this paper an unmanned aerial vehicle based airborne tracking system waspresented with the aim of detecting objects indoor (potentially extended to out-door operations as well) and localize a ground robot that will serve as a landingplatform for recharging the UAV

First the project and the aims and objectives of this thesis were introducedin chapter 1 This introduction was followed by an extensive literature review togive an overview of the existing solutions for similar problems (chapter 2) Un-manned Aerial Vehicles were presented and examples were shown for applicationfields Then the most important object recognition methods were introducedand reviewed with the aspect of suitability for the discussed project

Then the environment of the development was described including the avail-able software and hardware resources (chapter 3) Their advantages and disad-vantages were analysed and their applications in the project are explained

Afterwards the recognized challenges of the project were collected and dis-cussed with respect to the object detection task in chapter 4 Considering theobjectives resources and the challenges a modular architecture was designedand introduced (section 42) The type of modules were listed and discussed indetails including their purposes and some possible realization

The modular structure of the system makes it very flexible and easy to extendor replace modules The currently implemented and used ones were listed andexplained Also some unsuccessful experiments were mentioned

Subsection 434 concludes the chosen feature extraction (HOG) and classi-fier (SVM) methods and presented the implemented training software along the

65

61 Conclusion

produced detectorsAs one of the most important parts the currently used detector was intro-

duced in subsection 435 This module was developed with special attention paidto validation and possible future work therefore additional debugging featureslike exporting detections demonstration videos and frame rate measurementswere implemented To make the development simpler an easy-to-use interfacewas included which let the most important parameters to be set without modifi-cation pf the code Four related but different detection methods were developedMode 1 scans the whole image with both trained detectors to locate the robotMode 2 does the same but instead of all classifiers only one is used most of thetime which is chosen based on previous detections This resulted in an increasedframe-rate Mode 3 introduces the concept of Region Of Interests (ROI) and onlyscans this area The ROI is defined based on previous detections however it iseasy to replace it with other methods due to the modular structure The reducedsearch space made the processing significantly faster Finally mode 4 includes atracking algorithm which makes it unnecessary to scan every frame (or part ofit) with a detector Instead once the robot was found it is followed across thescene Certainly periodical validations are still needed Tracking is significantlyfaster than the sliding window detectors and in appropriate conditions can per-form above real time speed (average frame-rate was 308 FPS) It also improvedthe efficiency of the detector since on many frames the robot was tracked andits position was marked correctly while the detectors missed it from the sameview-point However tracking can be misleading as well if the tracked regiondrifts away from the robot (eg a chair next to the robot is followed) To avoidthis the bounding box returned by the tracker is regularly checked by detectors

Special attention was given to the evaluation of the system Two software weredeveloped as additional tools one for evaluating the efficiency of the detectionsand an other for analysing the processing time and frame rates

Although the whole project (especially the autonomous UAV) still needs a lotof improvement and not ready for testing yet the ground robot detecting algo-rithmrsquos first version is ready-to-use and have shown promising results in simulatedexperiments

The detector based on mode 4 had a recall rate of 95 with an average frame-rate of 308 FPS on the test videos Although this processing speed was achievedon a laptop (simulating a setup where the UAV sends video to a ground stationfor further processing) porting this code on an on-board computer with smallercomputational capacity should result in a slower but still acceptable frame-ratewhich satisfies the objectives (5Hz)

While this thesis focused on two dimensional image processing and object de-tection methods 3D image inputs were considered as well Section 44 concludesthe progress made related to 3D mapping along the applied mathematics An

66

62 Recommended future work

experimental setup was created with which it was possible to create spectacu-lar and more importantly precise 3D maps about the environments with a 2Dlaser scanner To test the idea of using the 3D image as an input for the groundrobot detection several recordings were made about the UGV The created point-clouds were found to be detailed enough to carry simpler (eg threshold basedon height) or more complex object detection methods (eg registering 3D key-points) on them in the future

As a conclusion it can be said that the objectives defined in 13

1 explore possible solutions in a literature review

2 design a system which is able to detect the ground robot based on theavailable sensors

3 implement and test the first version of the algorithms

were all addressed and completed

62 Recommended future workIn spite of the satisfying results of the current setup several modifications andimprovements can and should be implemented to increase the efficiency and speedof the algorithm

Focusing on the 2D processing methods first experiments with retrainedSVMs are recommended It is expected that training on the same two sides (sideand front) but including more images from slightly rotated view-points (bothrolling the camera and capture the object from not completely the front) willresult in better performance

Region of interest estimators have huge potential in the project since they cansignificantly speed up the process Aside basic algorithms based on simple fea-tures (eg similar colour patches edges etc) several more complex segmentationmethods were introduced in the literature

Tracking can be tuned more precisely as well Aside the rectangle box whichmarks the estimated position of the object the tracker returns a confidence valuewhich measures how confident the algorithm is that the object is inside the rect-angle This is a very valuable information to eliminate detections which rdquoslippedrdquooff the object Also the further processing of the return position is recommendedFor example growing or shrinking the rectangle based on simple features (egthresholded patches intensity edges etc) can be profitable since the trackertends to perform better when it tracks the whole object On the other hand toobig bounding boxes are problems as well thus shrinking of the rectangles have tobe considered as well

67

62 Recommended future work

Using The 3D image for the detection of the robot is another very interestingopportunity that should be examined in more details Certainly this is onlypossible if the 3D imaging will be carried out in near real time speed (currentlyall the maps were built offline) The currently seen 3D image (the point-cloud infront of the sensor) can be interpreted as a pre-filter for the 2D image processingmethods For example filtering for objects with more or less the same size of theUGV could be a very effective region of interest detector To achieve this thefield of view of the two sensors needs to be register carefully so correspondingareas can be found

Finally when the continuous 3D map building (note that this is more thanthe imaging since the 3D images have to be aligned and combined) is finishedseveral more improvements will be possible Assuming that both the aerial andthe ground vehicle were located in this map the real 3D position of the robotwill be known Its next position can be estimated based on its maximal velocityand the elapsed time Theoretically it is not possible for the robot to be outsideof this circle thus other parts of the map do not need to be scanned Since thelocation and orientation of the UAV will be known as well the field-of-view ofthe sensors can be estimated After that the UAV can either move and rotate ina way to cover the possible locations of the UGV with its sensors or just ignore(does not process) inputs from other areas

It is strongly recommended to give high priority to evaluation and testingduring the further experiments For 2D images Vatic annotation tool was intro-duced in this thesis as a very useful and convenient tool to evaluate detectorsFor 3D detections this is not suitable but a similar solution is needed to keepup the progress towards the ready to deploy 3D mapping system planned

68

Acknowledgements

First of all I would like to thank Paacutezmaacuteny Peacuteter Catholic University and my teachersthere who helped me to advance and grow through the years and later made itpossible to continue my studies in Cranfield University where I was welcomed withwarm hospitality and was supported throughout of the development of this thesis

I would like to thank Dr Al Savvaris for the professional help and encourage-ment he provided while we worked together

Special thanks goes to my friends Domonkos Huszaacuter and Loacuteraacutent Kovaacutecs who Ishared all the happiness sadness worries troubles and the weigh of this pastyear and who I could always turn to in need or joy

Thanks to my friend Roberto Opromolla for helping us to make the necessarymeasurements

My family never stopped supporting me loving me and ensuring the rightenvironment for improvement and growth

Many thanks Fanni Melles for the support and love you gave me and that youstood by me even at the hardest times

Thanks for my friends Maacuterton Hunyady and Doacutera Babicz for the funny andsometimes cruel comments of the revision of this thesis

And finally many thanks to Andras Horvath who agreed to be my rsquosecondrsquosupervisor and mentor at my home university

69

References

[1] ldquoOverview senseFlyrdquo [Online] Available httpswwwsenseflycomdronesoverviewhtml [Accessed at 2015-08-08]

[2] H Chao Y Cao and Y Chen ldquoAutopilots for small fixed-wing unmanned airvehicles A surveyrdquo in Proceedings of the 2007 IEEE International Conference onMechatronics and Automation ICMA 2007 2007 pp 3144ndash3149

[3] A Frank J McGrew M Valenti D Levine and J How ldquoHover Transitionand Level Flight Control Design for a Single-Propeller Indoor AirplanerdquoAIAA Guidance Navigation and Control Conference and Exhibit pp 1ndash43 2007[Online] Available httparcaiaaorgdoiabs10251462007-6318

[4] ldquoFixed Wing Versus Rotary Wing For UAV Mapping Ap-plicationsrdquo [Online] Available httpwwwquestuavcomnewsfixed-wing-versus-rotary-wing-for-uav-mapping-applications [Accessed at 2015-08-08]

[5] ldquo DJI Store Phantom 3 Standardrdquo [Online] Available httpstoredjicomproductphantom-3-standard [Accessed at 2015-08-08]

[6] ldquoWorld War II V-1 Flying Bomb - Military Historyrdquo [Online] Available httpmilitaryhistoryaboutcomodartillerysiegeweaponspv1htm [Accessed at 2015-08-08]

[7] L Lin and M A Goodrich ldquoUAV intelligent path planning for wilderness searchand rescuerdquo in 2009 IEEERSJ International Conference on Intelligent Robotsand Systems IROS 2009 2009 pp 709ndash714

[8] M A Goodrich B S Morse D Gerhardt J L Cooper M Quigley J A Adamsand C Humphrey ldquoSupporting wilderness search and rescue using a camera-equipped mini UAVrdquo Journal of Field Robotics vol 25 no 1-2 pp 89ndash1102008

[9] P Doherty and P Rudol ldquoA UAV Search and Rescue Scenario with HumanBody Detection and Geolocalizationrdquo in Lecture Notes in Computer Science

70

REFERENCES

2007 pp 1ndash13 [Online] Available httpwwwspringerlinkcomcontentt361252205328408fulltextpdf

[10] A Jaimes S Kota and J Gomez ldquoAn approach to surveillance an area usingswarm of fixed wing and quad-rotor unmanned aerial vehicles UAV(s)rdquo 2008 IEEEInternational Conference on System of Systems Engineering SoSE 2008 2008

[11] M Kontitsis K Valavanis and N Tsourveloudis ldquoA UAV vision system for air-borne surveillancerdquo IEEE International Conference on Robotics and Automation2004 Proceedings ICRA rsquo04 2004 vol 1 2004

[12] M Quigley M A Goodrich S Griffiths A Eldredge and R W Beard ldquoTargetacquisition localization and surveillance using a fixed-wing mini-UAV and gim-baled camerardquo in Proceedings - IEEE International Conference on Robotics andAutomation vol 2005 2005 pp 2600ndash2606

[13] E Semsch M Jakob D Pavliacuteček and M Pěchouček ldquoAutonomous UAV surveil-lance in complex urban environmentsrdquo in Proceedings - 2009 IEEEWICACMInternational Conference on Intelligent Agent Technology IAT 2009 vol 2 2009pp 82ndash85

[14] M Israel ldquoA UAV-BASED ROE DEER FAWN DETECTION SYSTEMrdquo pp51ndash55 2012

[15] G Zhou and D Zang ldquoCivil UAV system for earth observationrdquo in InternationalGeoscience and Remote Sensing Symposium (IGARSS) 2007 pp 5319ndash5321

[16] W DeBusk ldquoUnmanned Aerial Vehicle Systems for Disaster Relief TornadoAlleyrdquo AIAA InfotechAerospace 2010 2010 [Online] Available httparcaiaaorgdoiabs10251462010-3506

[17] S DrsquoOleire-Oltmanns I Marzolff K Peter and J Ries ldquoUnmannedAerial Vehicle (UAV) for Monitoring Soil Erosion in Moroccordquo RemoteSensing vol 4 no 12 pp 3390ndash3416 2012 [Online] Available httpwwwmdpicom2072-42924113390

[18] F Nex and F Remondino ldquoUAV for 3D mapping applications a reviewrdquoApplied Geomatics vol 6 no 1 pp 1ndash15 2013 [Online] Availablehttplinkspringercom101007s12518-013-0120-x

[19] H Eisenbeiss ldquoThe Potential of Unmanned Aerial Vehicles for MappingrdquoPhotogrammetrische Woche Heidelberg pp 135ndash145 2011 [Online] Availablehttpwwwifpuni-stuttgartdepublicationsphowo11140Eisenbeisspdf

[20] C Bills J Chen and A Saxena ldquoAutonomous MAV flight in indoor environ-ments using single image perspective cuesrdquo in Proceedings - IEEE InternationalConference on Robotics and Automation 2011 pp 5776ndash5783

71

REFERENCES

[21] W Bath and J Paxman ldquoUAV localisation amp control through computer visionrdquo of the Australasian Conference on Robotics 2004 [Online] Availablehttpwwwcseunsweduau~acra2005proceedingspapersbathpdf

[22] K Ccedilelik S J Chung M Clausman and A K Somani ldquoMonocular vision SLAMfor indoor aerial vehiclesrdquo in 2009 IEEERSJ International Conference on Intel-ligent Robots and Systems IROS 2009 2009 pp 1566ndash1573

[23] H Oh D Y Won S S Huh D H Shim M J Tahk and A Tsourdos ldquoIndoorUAV control using multi-camera visual feedbackrdquo in Journal of Intelligent andRobotic Systems Theory and Applications vol 61 no 1-4 2011 pp 57ndash84

[24] Y M Mustafah A W Azman and F Akbar ldquoIndoor UAV Positioning UsingStereo Vision Sensorrdquo pp 575ndash579 2012

[25] P Jongho and K Youdan ldquoStereo vision based collision avoidance of quadrotorUAVrdquo in Control Automation and Systems (ICCAS) 2012 12th InternationalConference on 2012 pp 173ndash178

[26] J Weingarten and R Siegwart ldquoEKF-based 3D SLAM for structured environmentreconstructionrdquo in 2005 IEEERSJ International Conference on Intelligent Robotsand Systems IROS 2005 pp 2089ndash2094

[27] H Surmann A Nuumlchter and J Hertzberg ldquoAn autonomous mobile robot with a3D laser range finder for 3D exploration and digitalization of indoor environmentsrdquoRobotics and Autonomous Systems vol 45 no 3-4 pp 181ndash198 2003

[28] Y Lin J Hyyppauml and A Jaakkola ldquoMini-UAV-borne LIDAR for fine-scale map-pingrdquo IEEE Geoscience and Remote Sensing Letters vol 8 no 3 pp 426ndash4302011

[29] F Wang J Cui S K Phang B M Chen and T H Lee ldquoA mono-camera andscanning laser range finder based UAV indoor navigation systemrdquo in 2013 Inter-national Conference on Unmanned Aircraft Systems ICUAS 2013 - ConferenceProceedings 2013 pp 694ndash701

[30] M Nagai T Chen R Shibasaki H Kumagai and A Ahmed ldquoUAV-borne 3-Dmapping system by multisensor integrationrdquo IEEE Transactions on Geoscienceand Remote Sensing vol 47 no 3 pp 701ndash708 2009

[31] X Zhang Y-H Yang Z Han H Wang and C Gao ldquoObject class detectionrdquoACM Computing Surveys vol 46 no 1 pp 1ndash53 Oct 2013 [Online] Availablehttpdlacmorgcitationcfmid=25229682522978

[32] P M Roth and M Winter ldquoSurvey of Appearance-Based Methodsfor Object Recognitionrdquo Transform no ICG-TR-0108 2008 [Online]

72

REFERENCES

Available httpwwwicgtu-grazacatMemberspmrothpub_pmrothTR_ORat_downloadfile

[33] A Andreopoulos and J K Tsotsos ldquo50 Years of object recognition Directionsforwardrdquo Computer Vision and Image Understanding vol 117 no 8 pp 827ndash891 Aug 2013 [Online] Available httpwwwresearchgatenetpublication257484936_50_Years_of_object_recognition_Directions_forward

[34] J K Tsotsos Y Liu J C Martinez-Trujillo M Pomplun E Simine andK Zhou ldquoAttending to visual motionrdquo Computer Vision and Image Understand-ing vol 100 no 1-2 SPEC ISS pp 3ndash40 2005

[35] P Perona ldquoVisual Recognition Circa 2007rdquo pp 1ndash12 2007 [Online] Availablehttpswwwvisioncaltechedupublicationsperona-chapter-Dec07pdf

[36] M Piccardi ldquoBackground subtraction techniques a reviewrdquo 2004 IEEE Interna-tional Conference on Systems Man and Cybernetics (IEEE Cat No04CH37583)vol 4 pp 3099ndash3104 2004

[37] T Bouwmans ldquoRecent Advanced Statistical Background Modeling for ForegroundDetection - A Systematic Surveyrdquo Recent Patents on Computer Sciencee vol 4no 3 pp 147ndash176 2011

[38] L Xu and W Bu ldquoTraffic flow detection method based on fusion of frames dif-ferencing and background differencingrdquo in 2011 2nd International Conference onMechanic Automation and Control Engineering MACE 2011 - Proceedings 2011pp 1847ndash1850

[39] A-t Nghiem F Bremond I-s Antipolis and R Lucioles ldquoBackground subtrac-tion in people detection framework for RGB-D camerasrdquo 2004

[40] J C S Jacques C R Jung and S R Musse ldquoBackground subtraction andshadow detection in grayscale video sequencesrdquo in Brazilian Symposium of Com-puter Graphic and Image Processing vol 2005 2005 pp 189ndash196

[41] P Kaewtrakulpong and R Bowden ldquoAn Improved Adaptive Background MixtureModel for Real- time Tracking with Shadow Detectionrdquo Advanced Video BasedSurveillance Systems pp 1ndash5 2001

[42] G Turin ldquoAn introduction to matched filtersrdquo IRE Transactions on InformationTheory vol 6 no 3 1960

[43] K Briechle and U Hanebeck ldquoTemplate matching using fast normalized crosscorrelationrdquo Proceedings of SPIE vol 4387 pp 95ndash102 2001 [Online] AvailablehttplinkaiporglinkPSI4387951ampAgg=doi

73

REFERENCES

[44] H Schweitzer J W Bell and F Wu ldquoVery Fast Template Matchingrdquo Programno 009741 pp 358ndash372 2002 [Online] Available httpwwwspringerlinkcomindexH584WVN93312V4LTpdf

[45] P Viola and M Jones ldquoRobust real-time object detectionrdquo Interna-tional Journal of Computer Vision vol 57 pp 137ndash154 2001 [Online]Available httpscholargooglecomscholarhl=enampbtnG=Searchampq=intitleRobust+Real-time+Object+Detection0

[46] S Omachi and M Omachi ldquoFast template matching with polynomialsrdquo IEEETransactions on Image Processing vol 16 no 8 pp 2139ndash2149 2007

[47] M Jordan and J Kleinberg Bishop - Pattern Recognition and Machine Learning

[48] D Prasad ldquoSurvey of the problem of object detection in real imagesrdquoInternational Journal of Image Processing (IJIP) no 6 pp 441ndash466 2012[Online] Available httpwwwcscjournalsorgcscmanuscriptJournalsIJIPvolume6Issue6IJIP-702pdf

[49] D Lowe ldquoObject Recognition fromLocal Scale-Invariant Featuresrdquo IEEE Inter-national Conference on Computer Vision 1999

[50] D G Lowe ldquoDistinctive image features from scale-invariant keypointsrdquo Interna-tional Journal of Computer Vision vol 60 no 2 pp 91ndash110 2004

[51] H Bay T Tuytelaars and L Van Gool ldquoSURF Speeded up robust featuresrdquo inLecture Notes in Computer Science (including subseries Lecture Notes in ArtificialIntelligence and Lecture Notes in Bioinformatics) vol 3951 LNCS 2006 pp 404ndash417

[52] T Joachims ldquoA probabilistic analysis of the Rocchio algorithmwith TFIDF for text categorizationrdquo International Conference on Ma-chine Learning pp 143ndash151 1997 [Online] Available papers2publicationuuid9FC2122D-6D49-4DC5-AC03-E353D5B3D1D1

[53] N Dalal and B Triggs ldquoHistograms of Oriented Gradients for Human Detectionrdquoin CVPR rsquo05 Proceedings of the 2005 IEEE Computer Society Conference onComputer Vision and Pattern Recognition (CVPRrsquo05) - Volume 1 2005 pp 886ndash893 [Online] Available citeulike-article-id3047126$delimiter026E30F$nhttpdxdoiorg101109CVPR2005177

[54] J M Rainer Lienhart ldquoAn Extended Set of Haar-Like Features for Rapid ObjectDetectionrdquo [Online] Available httpciteseerxistpsueduviewdocsummarydoi=1011869433

74

REFERENCES

[55] S Han Y Han and H Hahn ldquoVehicle Detection Method using Haar-like Featureon Real Time Systemrdquo inWorld Academy of Science Engineering and Technology2009 pp 455ndash459

[56] Q C Q Chen N Georganas and E Petriu ldquoReal-time Vision-based Hand Ges-ture Recognition Using Haar-like Featuresrdquo 2007 IEEE Instrumentation amp Mea-surement Technology Conference IMTC 2007 2007

[57] G Monteiro P Peixoto and U Nunes ldquoVision-based pedestrian detection usingHaar-like featuresrdquo Robotica 2006 [Online] Available httpcyberc3sjtueducnCyberC3docpaperRobotica2006pdf

[58] J Martiacute J M Benediacute A M Mendonccedila and J Serrat Eds PatternRecognition and Image Analysis ser Lecture Notes in Computer Science BerlinHeidelberg Springer Berlin Heidelberg 2007 vol 4477 [Online] Availablehttpwwwspringerlinkcomindex101007978-3-540-72847-4

[59] A E C Pece ldquoOn the computational rationale for generative modelsrdquo ComputerVision and Image Understanding vol 106 no 1 pp 130ndash143 2007

[60] I T Joliffe Principle Component Analysis 2002 vol 2 [Online] Availablehttpwwwspringerlinkcomcontent978-0-387-95442-4

[61] A Hyvarinen J Karhunen and E Oja ldquoIndependent Component Analysisrdquovol 10 p 2002 2002

[62] K Mikolajczyk B Leibe and B Schiele ldquoMultiple Object Class Detection with aGenerative Modelrdquo IEEE Computer Society Conference on Computer Vision andPattern Recognition (CVPR) vol 1 pp 26ndash36 2006

[63] Y Freund and R E Schapire ldquoA Decision-theoretic Generalization of On-lineLearning and an Application to Boostingrdquo Journal of Computing Systems andScience vol 55 no 1 pp 119ndash139 1997

[64] C Cortes and V Vapnik ldquoSupport-vector networksrdquo Machine Learningvol 20 no 3 pp 273ndash297 Sep 1995 [Online] Available httplinkspringercom101007BF00994018

[65] P Viola and M Jones ldquoRobust real-time face detectionrdquo International Journalof Computer Vision vol 57 no 2 pp 137ndash154 2004

[66] B E Boser I M Guyon and V N Vapnik ldquoA Training Algorithmfor Optimal Margin Classifiersrdquo Proceedings of the 5th Annual ACM Workshopon Computational Learning Theory pp 144ndash152 1992 [Online] Availablehttpciteseerxistpsueduviewdocsummarydoi=1011213818

75

REFERENCES

[67] C-W Hsu and C-J Lin ldquoA comparison of methods for multiclass support vectormachinesrdquo IEEE Transactions on Neural Networks vol 13 no 2 pp 415ndash4252002

[68] A Mathur and G M Foody ldquoMulticlass and binary SVM classification Implica-tions for training and classification usersrdquo IEEE Geoscience and Remote SensingLetters vol 5 no 2 pp 241ndash245 2008

[69] ldquoNitrogen6X - iMX6 Single Board Computerrdquo [Online] Available httpboundarydevicescomproductnitrogen6x-board-imx6-arm-cortex-a9-sbc [Ac-cessed at 2015-08-10]

[70] ldquoPixhawk flight controllerrdquo [Online] Available httproverardupilotcomwikipixhawk-wiring-rover [Accessed at 2015-08-10]

[71] ldquoScanning range finder utm-30lx-ewrdquo [Online] Available httpswwwhokuyo-autjp02sensor07scannerutm_30lx_ewhtml [Accessed at 2015-07-22]

[72] ldquoopenCV manual Release 249rdquo [Online] Available httpdocsopencvorgopencv2refmanpdf [Accessed at 2015-07-20]

[73] D E King ldquoDlib-ml A Machine Learning Toolkitrdquo Journal of MachineLearning Research vol 10 pp 1755ndash1758 2009 [Online] Available httpjmlrcsailmitedupapersv10king09ahtml

[74] ldquodlib C++ Libraryrdquo [Online] Available httpdlibnet [Accessed at 2015-07-21]

[75] D E King ldquoMax-Margin Object Detectionrdquo Jan 2015 [Online] Availablehttparxivorgabs150200046

[76] M Danelljan G Haumlger and M Felsberg ldquoAccurate Scale Estimation for RobustVisual Trackingrdquo Proceedings of the British Machine Vision Conference BMVC2014

[77] ldquoUrgBenri Information Pagerdquo [Online] Available httpswwwhokuyo-autjp02sensor07scannerdownloaddataUrgBenrihtm [Accessed at 2015-07-20]

[78] ldquovatic - Video Annotation Tool - UC Irvinerdquo [Online] Available httpwebmiteduvondrickvatic [Accessed at 2015-07-24]

[79] ldquoAmazon Mechanical Turkrdquo [Online] Available httpswwwmturkcommturkwelcome [Accessed at 2015-07-26]

76

  • List of Figures
  • Absztrakt
  • Abstract
  • List of Abbreviations
  • 1 Introduction and project description
    • 11 Project description and requirements
    • 12 Type of vehicle
    • 13 Aims and objectives
      • 2 Literature Review
        • 21 UAVs and applications
          • 211 Fixed-wing UAVs
          • 212 Rotary-wing UAVs
          • 213 Applications
            • 22 Object detection on conventional 2D images
              • 221 Classical detection methods
                • 2211 Background subtraction
                • 2212 Template matching algorithms
                  • 222 Feature descriptors classifiers and learning methods
                    • 2221 SIFT features
                    • 2222 Haar-like features
                    • 2223 HOG features
                    • 2224 Learning models in computer vision
                    • 2225 AdaBoost
                    • 2226 Support Vector Machine
                      • 3 Development
                        • 31 Hardware resources
                          • 311 Nitrogen board
                          • 312 Sensors
                            • 3121 Pixhawk autopilot
                            • 3122 Camera
                            • 3123 LiDar
                                • 32 Chosen software
                                  • 321 Matlab
                                  • 322 Robotic Operating System (ROS)
                                  • 323 OpenCV
                                  • 324 Dlib
                                      • 4 Designing and implementing the algorithm
                                        • 41 Challenges in the task
                                        • 42 Architecture of the detection system
                                        • 43 2D image processing methods
                                          • 431 Chosen methods and the training algorithm
                                          • 432 Sliding window method
                                          • 433 Pre-filtering
                                          • 434 Tracking
                                          • 435 Implemented detector
                                            • 4351 Mode 1 Sliding window with all the classifiers
                                            • 4352 Mode 2 Sliding window with intelligent choice of classifier
                                            • 4353 Mode 3 Intelligent choice of classifiers and ROIs
                                            • 4354 Mode 4 Tracking based approach
                                                • 44 3D image processing methods
                                                  • 441 3D recording method
                                                  • 442 Android based recording set-up
                                                  • 443 Final set-up with Pixhawk flight controller
                                                  • 444 3D reconstruction
                                                      • 5 Results
                                                        • 51 2D image detection results
                                                          • 511 Evaluation
                                                            • 5111 Definition of True positive and negative
                                                            • 5112 Definition of False positive and negative
                                                            • 5113 Reducing number of errors
                                                            • 5114 Annotation and database building
                                                              • 512 Frame-rate measurement and analysis
                                                                • 52 3D image detection results
                                                                • 53 Discussion of results
                                                                  • 6 Conclusion and recommended future work
                                                                    • 61 Conclusion
                                                                    • 62 Recommended future work
                                                                      • References
Page 7: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly

CONTENTS

52 3D image detection results 6053 Discussion of results 61

6 Conclusion and recommended future work 6561 Conclusion 6562 Recommended future work 67

References 70

vi

List of Figures

11 Image of the ground robot 3

21 Fixed wing consumer drone 622 Example for consumer drones 723 Example for people detection with background subtraction 1124 Example of template matching 1225 2 example Haar-like features 1626 Illustration of the discriminative and generative models 1727 Example of a separable problem 20

31 Image of Pixhawk flight controller 2232 The chosen LIDAR sensor Hokuyo UTM-30LX 2333 Elements of Dlibrsquos machine learning toolkit 25

41 A diagram of the designed architecture 2842 Visualization of the trained HOG detectors 3543 Representation of the sliding window method 3644 Example image for the result of edge detection 3845 Example of the detectorrsquos user interface 4146 Presentation of mode 3 4347 Mode 4 example output frames 45410 Screenshot of the Android application for Lidar recordings 4848 Schematic figure to represent the 3D recording set-up 4949 Example representation of the output of the Lidar Sensor 50411 Picture about the laboratory and the recording set-up 52412 Presentation of the axes used in the android application 53

51 Figure of the Vatic user interface 5852 Photo of the recording process 6153 Example of the 3D images built 62

vii

List of Abbreviations

SATM School of Aerospace Technology and ManufacturingUAV Unmanned Aerial VehicleUAS Unmanned Aerial SystemUA Unmanned AircraftUGV Unmanned Ground VehicleHOG Histogram of Oriented GradientsRC Radio ControlledROS Robotic Operating SystemIMU Inertial Measurement UnitDoF Degree of FreedomSLAM Simultaneous Localization And MappingROI Region Of InterestVatic Video Annotation Tool from Irvine California

viii

Absztrakt

Ezen dolgozatban bemutataacutesra keruumll egy piloacuteta neacutelkuumlli repuumllő jaacuterműre szereltkoumlvető rendszer ami belteacuteri objektumokat hivatott detektaacutelni eacutes legfőbb ceacutelja afoumlldi egyseacuteg megtalaacutelaacutesa amely leszaacutelloacute eacutes uacutejratoumlltő aacutellomaacuteskeacutent fog szolgaacutelni

Előszoumlr a projekt illetve a teacutezis ceacuteljai lesznek felsorolva eacutes reacuteszletezveEzt koumlveti egy reacuteszletes irodalomkutataacutes amely bemutatja a hasonloacute kihiacutevaacute-

sokra leacutetező megoldaacutesokat Szerepel egy roumlvid oumlsszefoglalaacutes a piloacuteta neacutelkuumlli jaacuter-művekről eacutes alkalmazaacutesi teruumlleteikről majd a legismertebb objektumdetektaacuteloacutemoacutedszerek keruumllnek bemutataacutesra A kritika taacutergyalja előnyeiket eacutes haacutetraacutenyaikatkuumlloumlnoumls tekintettel a jelenlegi projektben valoacute alkalmazhatoacutesaacutegukra

A koumlvetkező reacutesz a fejleszteacutesi koumlruumllmeacutenyekről szoacutel beleeacutertve a rendelkezeacutesreaacutelloacute szoftvereket eacutes hardvereket

A feladat kihiacutevaacutesai bemutataacutesa utaacuten egy modulaacuteris architektuacutera terve keruumllbemutataacutesra figyelembe veacuteve a ceacutelokat erőforraacutesokat eacutes a felmeruumllő probleacutemaacutekat

Ezen architektuacutera egyik legfontosabb modulja a detektaacuteloacute algoritmus legfris-sebb vaacuteltozata reacuteszletezve is szerepel a koumlvetkező fejezetben keacutepesseacutegeivel moacuted-jaival eacutes felhasznaacuteloacutei feluumlleteacutevel egyuumltt

A modul hateacutekonysaacutegaacutenak meacutereacuteseacutere leacutetrejoumltt egy kieacuterteacutekelő koumlrnyezet melykeacutepes szaacutemos metrikaacutet kiszaacutemolni a detekcioacuteval kapcsolatban Mind a koumlrnyezetmind a metrikaacutek reacuteszletezve lesznek a koumlvetkező fejezetben melyet a legfrissebbalgoritmus aacuteltal eleacutert eredmeacutenyek koumlvetnek

Baacuter ez a dolgozat főkeacutent a hagyomaacutenyos (2D) keacutepeken operaacuteloacute detekcioacutesmoacutedszerekre koncentraacutel 3D keacutepalkotaacutesi eacutes feldolgozoacute moacutedszerek szinteacuten meg-fontolaacutesra keruumlltek Elkeacuteszuumllt egy kiacuteseacuterleti rendszer amely keacutepes laacutetvaacutenyos eacutespontos 3D teacuterkeacutepek leacutetrehozaacutesaacutera egy 2D leacutezer szkenner hasznaacutelataacuteval Szaacutemosfelveacutetel keacuteszuumllt a megoldaacutes kiproacutebaacutelaacutesa amelyek a rendszerrel egyuumltt bemutataacutesrakeruumllnek

Veacuteguumll az implementaacutelt moacutedszerek eacutes az eredmeacutenyek oumlsszefoglaloacuteja zaacuterja adolgozatot

ix

Abstract

In this paper an unmanned aerial vehicle based airborne tracking system is pre-sented with the aim of detecting objects indoor (potentially extended to outdooroperations as well) and localize a ground robot that will serve as a landing plat-form for recharging of the UAV

First the project and the aims and objectives of this thesis are introducedand discussed

Then an extensive literature review is presented to give overview of the ex-isting solutions for similar problems Unmanned Aerial Vehicles are presentedwith examples of their application fields After that the most relevant objectrecognition methods are reviewed and their suitability for the discussed project

Then the environment of the development will be described including theavailable software and hardware resources

Afterwards the challenges of the task are collected and discussed Consideringthe objectives resources and the challenges a modular architecture is designedand introduced

As one of the most important module the currently used detector is intro-duced along with its features modes and user interface

Special attention is given to the evaluation of the system Additional evalu-ating tools are introduced to analyse efficiency and speed

The ground robot detecting algorithmrsquos first version is evaluated giving promis-ing results in simulated experiments

While this thesis focuses on two dimensional image processing and objectdetection methods 3D image inputs are considered as well An experimentalsetup is introduced with the capability to create spectacular and precise 3Dmaps about the environments with a 2D laser scanner To test the idea of usingthe 3D image as an input for the ground robot detection several recordings weremade about the UGV and presented in this paper as well

Finally all implemented methods and relevant results are concluded

x

Chapter 1

Introduction and projectdescription

In this chapter an introduction is given to the whole project which this thesis ispart of Afterwards a structure of the sub-tasks in the project is presented alongwith the recognized challenges Then the aims and objectives of this thesis arelisted

11 Project description and requirementsThe projectrsquos main aim is to build a complete system based on one or moreunmanned autonomous vehicle which are able to carry out an indoor 3D mappingof a building (possible outdoor operations are kept in mind as well) A maplike this would be very useful for many applications such as surveillance rescuemissions architecture renovation etc Furthermore if a 3D model exists otherautonomous vehicles can navigate through the building with ease Later on inthis project after the map building is ready a thermal camera is planned to beattached to the vehicle to seek find and locate heat leaks as a demonstration ofthe system A system like this should be

1 fast Building a 3D map requires a lot of measurements and processingpower not to mention the essential functions stabilization navigationroute planning collision avoidance However to build a useful tool allthese functions should be executed simultaneously mostly on-board nearlyreal-time Furthermore the mapping process itself should be finished inreasonable time (depending on the size of the building)

2 accurate Reasonably accurate recording is required so the map would besuitable for the execution of further tasks Usually this means a maximum

1

12 Type of vehicle

error of 5-6cm Errors with similar magnitude can be corrected later de-pending on the application In the case of architecture for example aligninga 3D room model to the recorded point cloud could eliminate this noise

3 autonomous The solution should be as autonomous as possible The aimis to build a system which does not require human supervision (for examplemapping a dangerous area which would be too far for real time control) andcoordinate the process Although remote control is acceptable and shouldbe implemented it is desired to be minimal This should allow a remoteoperator to coordinate even more than one unit a time (since none of themneeds continuous attention) while keeping the ability to change the routeor priority of premises

4 complete Depending on the size and layout of the building the mappingprocess can be very complicated It may have more than one floors loops(the same location reached via an other route) or opened areas which areall a significant challenge for both the vehicle (plan route to avoid collisionsand cover every desired area) and the mapping software The latter shouldrecognize previously seen locations and rdquocloserdquo the loop (that is assignthe two separately recorded area to one location) and handle multi-layerbuildings The system have to manage these tasks and provide a mapwhich contains all the layers closes loops and represents all walls (includingceiling and floor)

12 Type of vehicleThe first question of such a project is what type of vehicle should be used Theoptions are either some kind of aerial or a ground vehicle Both have their ad-vantages and disadvantages

An aerial vehicle has a lot more degree of freedom since it can elevate fromthe ground It can reach positions and perform measurements which would beimpossible for a ground vehicle However it consumes a lot of energy to stayin the air thus such a system has significant limitations in weight payload andoperating time (since the batteries themselves weigh a lot)

A ground vehicle overcomes all these problems since it is able to carry alot more payload than an aerial one This means a lot of batteries thereforelonger operation time On the other hand it canrsquot elevate from the ground thusall the measurements will be performed from a relatively close position to theground This can be a big disadvantage since taller objects will not have theirtops recorded Also a ground robot would not be able to overcome big obstaclesclosing its way or to get to another floor

2

13 Aims and objectives

Figure 11 Image of the ground robot Source own picture

Considering these and many more arguments (price complexity human andmaterial resources) a combination of an aerial and a ground vehicle was chosenas a solution The idea is that both vehicles enter the building at the sametime and start mapping the environment simultaneously working on the samemap (which is unknown when the process starts) Using the advantage of theaerial vehicle the system scans parts of the rooms which are unavailable for theground vehicle With sophisticated route planning the areas scanned twice canbe minimized resulting in a faster map building Also great load capacity of theground robot makes it able to serve as a charging station for the aerial vehicleThis option will be discussed in details in section 13

13 Aims and objectivesAs it can be seen the scope of the project is extremely wide To cover the wholemore than fifteen students (visiting MSc and PhD) are working on it Thereforethe project has been divided to numerous subtasks

During the planning period of the project several challenges were recognizedfor example the lack of GPS signal indoor the vibration of the on-board sensorsor the short flight time of the aerial platform

This thesis addresses the last one As mentioned in section 12 the UAVconsumes a lot of energy to stay in the air thus such a system has significantlylimited operating time In contrast the ground robot is able to carry a lot morepayload (eg batteries) than an aerial one The idea to overcome this limitationof the UAV is to use the ground robot as a landing and recharging platform for

3

13 Aims and objectives

the aerial vehicleTo achieve this the system will need to have continuous update of the relative

position of the ground robot The aim of this thesis is to give a solution for thisproblem The following objectives were defined to structure the research

First possible solutions have to be collected and discussed in an extensive lit-erature review Then after considering the existing solutions and the constraintsand requirements of the discussed project a design of a new system is neededwhich is able to detect the ground robot using the available sensors Finally thefirst version of the software have to be implemented and tested

This paper will describe the process of the research introduce the designedsystem and discuss the experiences

4

Chapter 2

Literature Review

The aim of this chapter is to provide an overview of the previous researches andarticles related to the topic of this thesis First a short introduction of unmannedaerial vehicles will be given along with their most important advantages andapplication fields

Afterwards a brief conclusion of the science of object recognition in imageprocessing is provided The most often used feature extraction and classificationmethods are presented

21 UAVs and applicationsThe abbreviation UAV stands for Unmanned Aerial Vehicle Any aircraft whichis controlled or piloted by remotely andor by on-board computers They are alsoreferred as UAS for Unmanned Aerial System

They come in various sizes The smallest ones fit a man palms while even afull-size aeroplane can be an UAV by definition Similarly to traditional aircraftUnmanned Aerial vehicles are usually grouped into two big classes fixed-wingUAVs and rotary wing UAVs

211 Fixed-wing UAVsFixed-wing UAVs have a rigid wing which generates lift as the UAV moves for-ward They maneuver with control surfaces called ailerons rudder and elevatorGenerally speaking fixed-wing UAVs are easier to stabilize and control Forcomparison radio controlled aeroplanes can fly without any autopilot functionimplemented relying only on the pilotrsquos input This is a result of the muchsimpler structure compared to rotary wing ones due to the fact that a glidingmovement is easier to stabilize However they still have a quite extended market

5

21 UAVs and applications

Figure 21 One of the most popular consumer level fixed-wing mapping UAV ofthe SenseFly company Source [1]

of flight controllers whose aim is to add autonomous flying and pilot assistancefeatures [2]

One of the biggest advantage of fixed wing UAVs is that due to their naturalgliding capabilities they can stay airborne using no or small amount of powerFor the same reason fixed-wing UAVs are also able to carry heavier or morepayload for longer endurances using less energy which would be very useful inany mapping task

The drawback of the fixed wing aircraft in the case of this project is the factthat they have to keep moving to generate lift and stay airborne Therefore flyingaround between and in buildings is very complicated since there is limited spaceto move around Although there have been approaches to prepare a fixed wingUAV to hover and translate indoors [3] generally they are not optimal for indoorflights especially with heavy payload [4]

Fixed-wing UAVs are excellent choice for long endurance high altitude tasksThe long flight times and higher speed make it possible to cover larger areaswith one take off On figure 21 a consumer fixed-wing UAV is shown designedspecially for mapping

This project needs a vehicle which is able to fly between buildings and indoorswithout any complication Thus rotary-wing UAVs were reviewed

212 Rotary-wing UAVsRotary-wing UAVs generate lift with rotor blades attached to and rotated aroundan axis These blades work exactly as the wing on the fixed-wing vehicles butinstead of the vehicle moving forward the blades are moving constantly Thismakes the vehicle able to hover and hold its position Also due to the same prop-erty they can take off and land vertically These capabilities are very importantaspects for the project since both of them are crucial for indoor flight where spaceis limited

6

21 UAVs and applications

Figure 22 The newest version of the popular Phantom series of DJI Thisconsumer drone is easy to fly even for beginner pilots due to its many advancedautopilot features Source [5]

Based on the number of the rotors several sub-classes are known UAVs withone rotor are helicopters while anything with more than one rotor is called amultirotor or multicopter

The latter group controls its motion by modifying the speed of multiple downthrusting motors In general they are more complicated to control than fixed-wing UAVs since hovering enough requires constant compensation Multirotorsmust individually adjust the thrust of each motor If motors on one side areproducing more thrust than the other the multirotor tilts to the other sideThat is the reason why every remote controlled multicopter is equipped withsome kind of flight controller system They are also less stable (without electricstabilization) and less energy efficient than helicopters That is the reason forthe fact that no large scale multirotor is used since as the size is increased thesimplicity becomes less important than the inefficiency On the other hand theyare mechanically simpler and cheaper and their architecture is suitable for awider range of payloads which makes them appealing for many fields

Rotary-wing UAVsrsquo mechanical and electrical structures are generally morecomplex than fixed-wing ones This can result in more complicated and expansiverepairs and general maintenance which shorten operational time and increase thecost of the project Finally due to their lower speeds and shorter flight rangesrotary-wing UAVs will require many additional flights to survey any large areawhich is another cause of increasing operational costs [4]

7

21 UAVs and applications

In spite of the disadvantages listed above rotary-wing especially multirotorUAVs seem like an excellent choice for this project since there is no need to coverlarge areas and the shorter operational times can be solved with the chargingplatform

213 ApplicationsUAVs as many technological inventions got their biggest motivation from mili-tary goals and applications For example German rocket V-1 (rdquothe flying bombrdquo)had a huge impact on the history of UAS [6] An unmanned aerial vehicle makesit possible to observe spy on and attack the enemy without risking human lives

Fortunately non-military development is catching up too and UAVs are be-coming essential tools for many application fields As rotary wing UAVs es-pecially multicopters are getting cheaper they have appeared in the consumercategories as well They are available as simple remote controlled toys and ad-vanced aerial photography platforms as well [5] See figure 22 for a popularquadcopter

UAVs are gaining popularity in search and rescue operations since they areproviding a fast and relatively cheap tool to get an overview of the involvedarea A good example for this application field is a location of an earth-quakeor searching for missing people in the wildness See [7] and [8] for examples [9]is another related article which includes approaches to identify possibly injuredpeople autonomously

Another application field where UAVs were admittedly proven useful is surveil-lance [10] is an interesting example for this field since it proposes to use a com-bined swarm of rotary and fixed-wing UAVs [11] [12] address the same problemwhile [13] tries to overcome the challenges of the task in an urban area

Getting sensors in the air with low cost vehicles is an amazing opportunity forseveral research areas UAVs have been used for wild-life monitoring [14] earthobservation [15] and even tornado watch missions [16] [17] used UAVs to reducerdquothe data gap between field scale and satellite scale in soil erosion monitoringrdquo

UAVs are often used for 3D mapping and reconstruction tasks for topographyarchitecture or other purposes [18] is a review of current mapping methods basedon UAVs [19] is an other survey which also compares the photogrammetry withother methods to define the optimal application fields

Both surveys mentioned above are rather focus on topographical larger scalemapping tasks Indoor flying and mapping is more complicated in a sense sincethe danger of collision is significantly higher Indoor positioning of UAVs havebeen addressed in several papers They can be grouped by the sensor used Notethat GPS signal is usually too weak to use inside thus other methods are needed

Some researches tried to localize the aerial platform based on a single camera

8

22 Object detection on conventional 2D images

using monocular SLAM (Simultaneous Localization and Mapping) See [20ndash22]for UAVs navigating based on one camera only

Often the platform is equipped with multiple cameras to achieve stereo visionand create a depth map For examples of such solutions see [23ndash25]

A popular approach for indoor mapping is using 3D laser distance sensors asthe input of the SLAM algorithm Examples for this are [26ndash28]

However cameras and laser range finders are often used together [29 30] forexample manage to integrate the two sensors together and carry out 3D mappingwhile controlling navigating the UAV itself This project will use a similar setupthat is a combination of one or more monocular cameras with a 2D laser scanner

22 Object detection on conventional 2D imagesObject class detection is one of the most researched area of computer visionAs the available resources (computational power resolution of cameras etc)improved and became cheaper and more and more complex tasks occurred thefield has witnessed several approaches and methods Thus the topic has extensiveliterature To give an overview general surveys can be a good starting point

There are several articles aimed to provide a summary of the existing methodstrying to categorize and group them (eg [31ndash33])

The terms and definitions of computer vision tasks are defined in several wayin the literature [33] and [34] groups them by

bull Detection is a particular item present in the stimulus

bull Localization detection plus accurate location of item

bull Recognition localization of all the items present in the stimulus

bull Understanding recognition plus role of stimulus in the context of the scene

while [35] sets up five classes defined by the following (quoted from [33])

bull Verification Is a particular item present in an image patch

bull Detection and Localization Given a complex image decide if a particularexemplar object is located somewhere in the image and provide accuratelocation information of the given object

bull Classification Given an image patch decide which of the multiple possiblecategories are present in it Note that no location is determined

9

22 Object detection on conventional 2D images

bull Naming Given a large complex image (instead of an image patch as inthe classification problem) determine the location and labels of the objectspresent in that image

bull Description Given a complex image name all the objects present in itand describe the actions and relationships of the various objects within thecontext of the image

[31] defines object (class or category) detection as the combination of twogoals object categorization and object localization The former decides if anyinstance of the categories of interest is present in the input image The lattertask determines the positions and dimensions (projected size) of the objects thatare found

As it can be seen from the categories defined above the boundaries of thesesubclasses are not clear and they often overlap The aims and objectives listedin section 13 classify this task as a detection or localization (or a combinationof the two) For the sake of simplicity the aim of this thesis and the providedalgorithm will be referred as detection and detector

Beside the grouping of the tasks of computer vision several attempts weremade to classify the approaches themselves The biggest disjunctive propertyof the methods might be the fact that whether or not does it involve machinelearning

221 Classical detection methodsResulting from the rapid development of hardware resources along with the ex-tensive research of artificial intelligence methods not using machine learning aregetting neglected Generally speaking the ones using it show better performanceand are simpler to adopt if the circumstances change (eg re-train them)

It has to be noted that although these classical approaches are not very pop-ular nowadays their simplicity often means an attractive factor to perform easydetection tasks or pre-filter the image

2211 Background subtraction

Background subtraction methods are often used to detect moving objects onstanding platform (static cameras) Such an algorithm is able to learn the rdquoback-ground modelrdquo of an image with differentiating a few frames Although severalmethods exist (See [36] or [37] for good conclusion) the basic idea is common sep-arate the foreground from the background by subtracting the background modelfrom the input image This subtraction should result in an output image where

10

22 Object detection on conventional 2D images

Figure 23 Example for background subtraction used for people detection Thered patches show the object in the segmented foreground Source [39]

only objects in the foreground are shown The background model is usually up-dated regularly which means the detection is adaptive

These solutions could give an excellent base for tasks like traffic flow monitor-ing [38] or human detection [39] as moving points are separated and connectedor close ones could be managed as an object Thus all possible objects of interestare very well separated The fixed position of the camera means fixed perspectiveswhich make size and distance estimations accurate With fixed lengths reliablevelocity measuring can be implemented These and further additional properties(colour texture etc of the moving object) can be the base of an object classifica-tion system This method is often combined with more sophisticated recognitionmethods serving as a region of interest detector After the selection of interestingareas those methods can operate on a smaller input resulting in a faster responsetime

It has to be mentioned that not every moving (changing) patch is an objectas for example in certain lighting circumstances a vehicle and its shadow movealmost identically and have similar size parameters Effective methods to removeshadows from input have been described (eg[40 41]) Certainly process time ofthese filters sum up with the detectionrsquos time

An other concern is the possibility of a sudden change in illumination (such asswitching the light on or off) which would result change across the whole imageIn this case the background model should be adapted automatically and quickly

Although background subtraction has many benefits it is clearly not suitablefor this task since the camera will be mounted on a moving platform

2212 Template matching algorithms

Object class detection can be described as a search for a specific pattern on theinput image This definition is related of the matched filters (see [42]) techniqueof digital signal processing which basic idea is to find a known wavelet in anunknown signal This is usually achieved by cross-correlating them looking forhigh correlation coefficients

Template matching algorithms rely on the same concept interpreting the

11

22 Object detection on conventional 2D images

Figure 24 Example of template matching Notice how the output image isthe darkest at the correct position of the template (darker means higher value)An other interesting point of the picture is the high return values around thecalendarrsquos (on the wall left) dark headlines which are indeed look similar to thehandle Source [43]

image as a 2D digital signal They determine the position of the given patternby comparing a template (that contains the pattern) with the image

To perform the comparison the template is shifted (u v) discrete steps inthe (x y) directions respectively The comparison itself is usually calculated bysome kind of correlation over the area of the template for each (u v) positionThe method assumes that the function (the comparison method) will return thehighest value at the correct position [43]

See Figure 24 for an example input template (part a) and the output of thealgorithm (part b) Notice how the output image is the darkest at the correctposition of the template

The biggest drawback of the method is its computational complexity sincethe cross correlation has to be calculated at every position Several efforts werepublished to reduce the processing time [43] uses normalized cross correlationfor faster comparison [44] applies integral images (based on the idea of [45]) forthe same purpose Note that simplifying the template can reduce the complexityas well since it is easier to compare the two images For example [46] proposedan algorithm which approximates the template image with polynomials

Other disadvantages of template matching are the poor performance in case ofchanges in scale rotation or perspective Resulting from these template matchingis not a suitable concept for this project

12

22 Object detection on conventional 2D images

222 Feature descriptors classifiers and learning methodsWhile computer vision and machine learning might seem two well-separated areasof computer engineering computer vision is one of the biggest application fieldof machine learning [47] puts it this way rdquoPattern recognition has its origins inengineering whereas machine learning grew out of computer science Howeverthese activities can be viewed as two facets of the same field and together theyhave undergone substantial development over the past ten yearsrdquo

The two most important parts of a trained object detection algorithm arefeature extraction and feature classification The extraction method determineswhat type of features will be the base of the classifier and how they are extractedThe classification part is responsible for the machine learning method In otherwords how to decide if the extracted feature array (extracted by the featureextraction method described above) represents an object of interest

Since the latter part (defining the learning and classifier system) is less relatedto the vision studies this paper will only discuss a few learning methods brieflyBefore that some of the most famous and widely used feature descriptors arepresented

Feature extractors or descriptors are often grouped as well [48] recognizesedge based and patch based features [31] has three groups according to thelocality of the features

bull pixel-level feature description these features can be calculated for eachpixel separately Examples are the pixelrsquos intensity (grayscale) or its colourvector

bull patch-level feature description Descriptors based on patch level featuresconcentrate only on point of interests (and their neighbourhood) instead ofdescribing the whole image The name rdquopatchrdquo refers to the area around thepoint which is also considered Since these regions are much smaller thanthe image itself patch-level feature descriptions are also called local featuredescriptors Some of the most famous ones are SIFT (sub-subsection 2221[49 50]) and SURF [51] In a sense Haar-like features (see sub-subsection2222 [45]) belong here as well

bull region-level feature description As patches are sometimes too small to de-scribe the object appropriately a larger scale is often needed The region isa general expression for a set of connected pixels The size and shape of thisarea is not defined Region-level features often constructed from lower levelfeatures presented above However they do not try to apply higher levelgeometric models or structures Resulting from that region-level featuresare also called mid-level representations The most well known descriptors

13

22 Object detection on conventional 2D images

of this group are the BOF (Bag of Features or Bag of Words [52]) and theHOG (Histogram of Oriented Gradients sub-subsection 2223 [53])

Due to the limited scope of this paper it is not possible to give a deeperoverview about the features used in computer vision Instead three of the meth-ods will be presented here These were selected based on fame performanceon general object recognition and the amount of relevant literature examplesand implementation All of them are well-known and both the articles and themethods are part of the computer vision history

2221 SIFT features

Scale-invariant feature transform (or SIFT) is an image processing algorithm usedto detect and extract local features Sift descriptors are the most well-knownpatch-level feature feature descriptors (see point 222) [31] They were proposedby David Lowe in 1999 [49] The same author gave a more in-depth discussionabout his earlier work presented improvements in stability and feature invariancein 2004 [50] His idea was to extract interesting points as features from the object(training image) and try to match these on the test images to locate the object

The main steps of the algorithm are the following [50]

1 Scale-space extrema detection The first stage searches over the image inmultiple scale using a difference-of-Gaussian function to identify potentialinterest points that are scale and orientation invariant

2 Key-point localization At each selected point the algorithm tries to fit amodel to determine location and scale Key-points are selected based ontheir stability points which are expected to be more stable based on themodel are kept

3 Orientation assignment rdquoOne or more orientations are assigned to eachkey-point based on local image gradient directions All future operationsare performed on image data that has been transformed relative to theassigned orientation scale and location for each feature thereby providinginvariance to these transformationsrdquo [50]

4 Key-point descriptor After selecting the interesting points the gradientsaround them are calculated in a selected scale

The biggest advantage of the descriptor is the scale and rotation invarianceNote that rotation here means rotating in the plane of the image not capturingthe object from a rotated viewpoint For example a rotation invariant (frontal)

14

22 Object detection on conventional 2D images

face detector should be able to detect faces upside down but not faces pho-tographed from a side-view SIFT is also partly invariant to illumination and3D camera viewpoint This is reached by the representation of the calculateddescriptors that allows for significant levels of local shape distortion and changein illumination The algorithm assumes that these pointsrsquo relative positions willnot change on the test images Thus SIFT descriptors tend to show worse perfor-mance on flexible objects or ones with moving parts Also the relative positionsof the key-points can be significantly different not only if the object is distorted(flexibility) but also if they are extracted from an other instance from the sameclass (two people for example) Resulting from this property SIFT is more oftenused for object tracking image stitching or to recognise a concrete object insteadof the general class However this is not a problem in the case of this projectsince only one ground-robot is part of it Thus no general class recognition isrequired recognizing the used ground robot would be enough

2222 Haar-like features

Haar-like features were introduced by Viola and Jones in 2001 [45] They proposedsimple features inspired by Haar wavelet basis which are also used in signalprocessing They can be seen as rdquotemplatesrdquo defining rectangular regions in thegrayscale image Viola and Jones described three types of the features

bull two-rectangle features rdquoThe value of a two-rectangle feature is the dif-ference between the sum of the pixels intensities within two rectangularregions The regions have the same size and shape and are horizontally orvertically adjacentrdquo [45]

bull three-rectangle features rdquoA three-rectangle feature computes the sum withintwo outside rectangles subtracted from the sum in a centre rectanglerdquo [45]

bull four-rectangle features rdquoA four-rectangle feature computes the differencebetween diagonal pairs of rectanglesrdquo [45]

See figure 25 for an example of two and three rectangle features The mainadvantage of this approach is the fact that these features can be computed veryrapidly using so-called integral images The integral image rdquocontains the sum ofpixel intensities above and to the left of itrdquo [45] at any given image location Thereferences to the integral values at each pixel location are kept in a structureFinally the Haar features can be computed by summing and subtracting theintegral values in the corners of the rectangles

[45] was motivated by the task of face detection Although it is more than14 years old it is still the base of many implemented face detection methods (in

15

22 Object detection on conventional 2D images

Figure 25 2 example Haar-like features that are proven to be effective separatorsTop row shows the features themselves Bottom row presents the features overlayedon a typical face The first feature corresponds with the observation that eyes areusually darker than the cheeks The second feature measures the intensity of areasaround the eyes compared to the nose Source [45]

consumer cameras for example) and inspired several researches [54] for exampleextended the Haar-like feature set with rotated rectangles by calculating theintegral images diagonally as well

Resulting from the nature of them Haar-like features are suitable to findpatterns of combined bright and dark patches (like faces on grayscale images forexample)

Despite of the original task Haar-like features are used widely in computervision for various tasks vehicle detection ([55]) hand gesture recognition ([56])and pedestrian detection ([57]) just to name a few

Haar-like features are very popular and have a lot of advantages Its maindrawback is the rotation-variance which in spite of several attempts (eg [54])is still not solved completely

2223 HOG features

The abbreviation Hog stands for Histogram of Oriented Gradients Unlike theHaar features (2222) Hog aims to gather information from the gradient image

The technique calculates the gradient of every pixel and produces a histogramof the different orientations (called bins) after quantization in small portions(called cells) on the image After a location based normalization of these cells(within blocks) the feature vector is the concatenation of these histograms Notethat while this is similar to Edge orientation histograms ([58]) it is different since

16

22 Object detection on conventional 2D images

Figure 26 Illustration of the discriminative and generative models The boxesmark the probabilities calculated and used by the models Note the arrows whichdisplay the information flow presenting the main difference in the two philosophiesSource [59]

not only sharp gradient changes (known as edges) are considered The number ofoptimal bins cells and blocks depend on the actual task and resolution and arewidely researched

HOG is essentially a dense version of the SIFT descriptors (see sub-subsection2221 and [31]) However it is not normalized with respect to the orientationThus HOG is not rotation invariant On the other hand it is normalized locallywith respect to the image contrast (over the blocks) which results in a superiorrobustness against illumination changes over SIFT The technique was first de-scribed in the well-known article [53] by Navneet Dalal and Bill Triggs in whichit was used for pedestrian detection While this application field is still the mostcommon usage of these feature descriptor it can be used for many kind of objectsas well as long as the class has a characteristic shape with significant edges

2224 Learning models in computer vision

Learning model approaches can be grouped as well [32 48] finds two mainphilosophies generative and discriminative models

Letrsquos denote the test data (images) with xi the corresponding labels by ciand the feature descriptors (extracted from the data xi) by θi where i = 1 to N where N is the number of inputs

Generative models look for a representation of the image (or any other inputdata) by approximate it keeping as much data as possible Afterwards given atest data they predict the probability of an x instance generated (with featuresθ) conditioned to class c [48] In other terms they aim is to produce a model inwhich the probability

P (x|(θ c))P (c)

to be ideally 1 if x contains an instance of class c and 0 if notThe most famous examples of generative models are Principal Component

Analysis (PCA [60]) and Independent Component Analysis (ICA [61])

17

22 Object detection on conventional 2D images

Discriminative models try to find the best decision boundaries for the giventraining data As the name suggests they try to discriminate the occurrenceof one class from everything else [48] They do this by defining the followingprobability

P (c|(θ x))

which is expected to be ideally 1 if x contains an instance of class c and 0 ifnot

Note that the biggest difference is the direction of the mapping and infor-mation flow discriminative models map images to class labels while generativemodels use a map from labels to the images (or rdquoobservablesrdquo [59]) See figure 26for a representation of the direction of the flow of information in both models

Resulting from approaches presented above (especially the probability mod-els) the discriminative methods are usually perform better for single class detec-tion while generative methods are more suitable for multi-class object detection(also called object recognition) [62] Since the task described in this paper re-quires one class to detect two of the most well-known discriminative methodswill be presented Adaptive boosting (AdaBoost [63] see sub-subsection 2225)and support vector machines (SVM [64] see sub-subsection 2226)

2225 AdaBoost

AdaBoost (short for Adaptive Boosting) is a discriminative machine learningmodel from the class of boosting algorithms It was first proposed by YoavFreund and Robert Schapire in 1997 [63] AdaBoostrsquos aim is to overcome theincreasing calculation expenses caused by the growing set of features (also calleddimensions) For example the number of possible Haar-like features (2222) isover 160000 in case of a 24times24 pixel window (see subsection 432) [65] AlthoughHaar-like features can be extracted efficiently using the integral-image computingthe full set would be very expensive

To reduce the dimensionality of the machine learning problem boosting algo-rithms try to combine several weaker classifiers to create a stronger Weak meansthe output of the classifiers only slightly correlates with the true labels of thedata Usually this is an iterative method

A single stage of AdaBoost can be described as follows [45] [65]

bull Initialize N middot T weights wti where N = number of training examples T =number of features in stage

bull For t = 1 N

1 Normalize the weights

18

22 Object detection on conventional 2D images

2 Select the best classifier using only a single feature by minimising thedetection error εt =

sumiwi|h(xi f p θ)minus yi| where h(xi) is a classifier

output and yi is a correct label (both with a range of 0 for negative 1for positive)

3 Define ht(x) = h(x ft pt θt) where ft pt and θt are minimizers for theerror above

4 Update the weights wt+1i = wti middot εt

1minusεt

1minusεi

bull The final classifier for the stage is based on the sum of weak classifiers

By repeating these steps AdaBoost selects those features (and only those)which improve the prediction Resulting from the way the weights are updatedsubsequent classifiers minimise the error mostly for the inputs which were misclas-sified by previous ones A famous application of AdaBoost is the Viola-Jones ob-ject detection framework presented in 2001 [45] (also see sub-subsection 2222)

2226 Support Vector Machine

Support vector machines (SVM also called support vector networks) is a machinelearning algorithm for two-group classification problems proposed by Vladimir NVapnik Bernhard E Boser and Isabelle M Guyon [66] Although the featuresof the method were known and used since the 1960s they were not put togetheruntil 1992

The idea of the support vector machine is to handle the set of features ex-tracted from the train and test data samples as vectors in the feature spaceThen they construct one ore more hyperplanes (a linear decision surface) in thisspace to separate the classes

The surface is constructed with two constraints First is to divide the spaceinto two parts in a way that no points from different classes remain in the same(in other words separate the two classes perfectly) The second constraint is tokeep the distance to the nearest training samplersquos vector from each class as largeas possible (in other words define the largest margin possible) Creating largermargin is proven to decrease the error of the classifier The closest vectors to thehyperplane (and due to that determining the width of the margin) are calledsupport vectors (hence the name of the method) See figure 27 for an exampleof a separable problem the chosen hyperplane and the margin

The algorithm was extended to not linearly separable tasks in 1995 by Vapnikand Corinna Cortes [64] To create a non-linear classifier the feature space ismapped to a much higher dimensional space where the classes become linearlyseparable and the method above can be used

19

22 Object detection on conventional 2D images

Figure 27 Example of a separable problem in 2D The support vectors aremarked with grey squares They define the margin of the largest separation betweenthe two classes Source [64]

Although support vector machines were designed for binary classification sev-eral methods were proposed to apply their outstanding performance on multi-classtasks The most often used approach to do this is to combine binary classifiers toobtain a multi-class one The binary classifiers can be rdquoone against allrdquo or rdquooneagainst onerdquo types The former means that each class is examined against all thepoints of the other classes The class whose classifier shows the greatest distancefrom the margin is chosen (in other terms the test samplersquos vector is the fartherfrom the other classesrsquo points than in any other cases) The latter is based on thecomparisons of all the possible pairings of the classifiers (each is compared witheach) The class rdquowinningrdquo the most comparisons is chosen as the final decisionof the classifier Good surveys on multi-class SVMs are [67] and [68]

A famous application of SVMs is the HOG descriptor based pedestrian detec-tion presented by Navneet Dalal and Bill Triggs in 2005 [53]

20

Chapter 3

Development

In this chapter the circumstances of the development will be presented whichinfluenced the research especially the objectives defined in Section 13 Firstthe available hardware resources (that is processing unit and sensors) will bediscussed along with their limitations and constraints respecting to the definedobjectives Finally the used software libraries will be presented why were theychosen and what are they used for in this project

31 Hardware resources

311 Nitrogen boardNitrogen board (officially called Nitrogen6x) is the chosen on-board processingunit of the UAV It is a highly integrated single board computer using an ARM-cortex A9 processor [69]

Nitrogen board will be responsible to manage the on-board sensors and collectthe recorded data After some preprocessing part of the data will be transmittedto the ground robot and to the ground station which communication is alsohandled by this unit Fortunately the Robotic Operating System (subsection322) is compatible with unit which will make this task easier

The detection method introduced in this thesis may be ported to this boardlater depending on the available processing units and the achieved communicationbandwidth between the vehicles and the ground station

312 SensorsThis subsection concludes the sensors integrated on-board the UAV including theflight controller the camera and the laser range finder

21

31 Hardware resources

Figure 31 image of Pixhawk flight controller Source [70]

3121 Pixhawk autopilot

Pixhawk is an open-source well supported auto-pilot system manufactured by3Drobotics company It is the chosen flight controller of the projectrsquos UAV Onfigure 31 the unit is shown with optional accessories

Also Pixhawk will be used as an Inertial Measurement Unit (IMU) for build-ing 3D maps This means less weight (since no additional sensor is needed) andbrings integrity both in hardware and software Although the Pixhawk is not adedicated IMU its built-in sensors are very accurate since they are used to sta-bilize multi-rotors during flight Therefore it was suitable to use it for mappingpurposes See 44 for details of its application

3122 Camera

Cameras are optical sensors capable of recording images They are the mostfrequently used sensors for object detection due to the fact that they are easy tointegrate and provide significant amount of data

Also another big advantage of these sensors is the fact that they are availablein several sizes with different resolutions and other features for a relatively cheapprice

Nowadays consumer UAVs are often equipped by some kind of action camerausually to provide an aerial video platform These sensors are suitable for theaim of the project as well resulting from their light weight and wide angle fieldof view

22

32 Chosen software

Figure 32 The chosen LIDAR sensor Hokuyo UTM-30LX It is a 2D rangescanner with a maximum range of 60 m and 3-5 cm accuracy Source [71]

3123 LiDar

The other most important sensor used in this project is the lidar These sensorsare designed for remote sensing and distance measurement The basic conceptof these sensors is to do this by illuminating objects with a laser beam Thereflected light is analysed and the distances are determined

Several versions are available with different range accuracy size and otherfeatures The chosen scanner for this project is the Hokuyo UTM-30LX It isa 2D range scanner with a maximum range of 60 m and 3-5 cm accuracy [71]These features combined with its compact size (62mmtimes62mmtimes875mm210g [71])makes it an ideal scanner for autonomous vehicles for example UAVs

Generally speaking lidars are much more expensive sensors then camerasHowever the depth information provided by them is very valuable for indoorflights

32 Chosen software

321 MatlabMATLAB is a high-level interpreted programming language and environmentdesigned for numerical computation It is a very useful tool for engineers and

23

32 Chosen software

scientist alike as thousands of functions are implemented and included in thesoftware from different fields of science Also excellent visualization tools areready to use which are necessary to easily debug and develop image processingmethods or 3D map construction and visualization Resulting from the reasonsabove developing and testing 2D or 3D image processing algorithms in Matlab isextremely convenient On the other hand Matlab is Java based therefore mostnon-mathematical calculations are slightly slow compared to a C++ environment

322 Robotic Operating System (ROS)Robotic Operating System is an extensive open-source operating system andframework designed for robotic development It is equipped with several toolsaimed to help developing unmanned vehicles like simulation and visualizationsoftware or excellent general message passing This latter property is a suitablesolution to connect the UAV UGV and the ground control station of the project

Also ROS contains several packages developed to handle sensors For exam-ple the both the chosen lidar sensor (3123) and the flight controller (3121)have ready-to-use drivers available

323 OpenCVOpenCV is probably to most popular open-source image processing library con-taining a wide range of functions like basic image manipulations (eg load writeresize rotate image) image processing tools (eg different kind of edge detec-tions threshold methods) and advanced machine learning based solutions (egfeature extractors and training tools) [72] Due to its high popularity examplesand support are available

In this project it was used to handle inputs (either from a file or from a cameraattached to the laptop) since Dlib the main library chosen has no functions toread inputs like that

324 DlibDlib is a general purpose cross-platform C++ library It has an excellent machinelearning library with the aim to provide a machine learning software developmenttoolkit for the C++ language (see figure 33 for an architecture overview) Dlibis completely open-source and is intended to be useful for both scientists andengineers

Dlib also has many computer vision related parts including ready-to-use fea-ture extractors and training tools and object tracking solutions which make it

24

32 Chosen software

Figure 33 Elements of Dlibrsquos machine learning toolkit Source [73]

easy to develop and test custom trained object detectors rapidly Another re-cently added feature is an easy-to-use 3D point cloud visualization function Re-sulting from this Dlib was a reasonable choice to use for this project since itscollection of machine learning image processing and 3D point cloud handlingfunctions made the development significantly easier

It is worth mentioning that aside the features above Dlib also contains com-ponents to handle linear algebra threading network IO graph visualisation andmanaging and numerous other tasks which makes it one of the most versatileC++ libraries Though it is not as popular and well-known as OpenCV it isextremely well documented supported and updated regularly Resulting fromthese the popularity of it increases For more information see [74] and [73]

25

Chapter 4

Designing and implementing thealgorithm

In this chapter challenges related to the topic of this thesis are listed After theconsideration of these the architecture of the designed algorithm will be intro-duced with detailed explanations given to the important parts The connectionbetween the 2D and 3D image inputs are defined Afterwards the base of thedetection algorithm is discussed in details along with all the aiding systems andinputs Finally the 3D imaging experiments and set-ups are introduced withexample images and the mathematical calculations

41 Challenges in the taskThe biggest challenges of the task are listed below

1 Unique object Maybe the biggest challenge of this task is the fact that therobot itself is a completely unique ground vehicle designed as part of theproject Thus no ready-to-use solutions are available which could partly orcompletely solve the problem For frequently researched object recognitiontasks like face detection several open-source codes examples and articlesare available The closest topic for a ground robot detection is car recogni-tion but those vehicles are too different to use any ready detectors Thusthe solution of this task have to be completely self-made and new

2 Limited resources Since the objective of this thesis is a detection executedon a UAV the limitations of this platform have to be considered Thismeans that both the dimensions and the weight of the on-board sensorsand computers are limited significantly Fortunately the cameras nowa-days have very compact sizes with impressive resolutions thus even smaller

26

42 Architecture of the detection system

aerial vehicles can carry one and record high quality videos The used lidar(3123) is heavier and larger but still possible to carry The on-board com-puter had to be chosen similarly a small and light hardware was neededWhile the Nitrogen6x board (311) would fulfil these requirements its pro-cessing power is not comparable to a laptop or desktop PC which shouldbe considered during the designing and assigning of processing methods

3 Moving platform No matter what kind of sensors are used for an objectdetection task movement and vibration of the sensor is usually makes itmore challenging Unfortunately both are present on a UAV platform Thiscauses problems because it changes the perspective and the viewpoint ofthe object Also they add noise and blur to the recordings And third nobackground-foreground separation algorithms are available for the cameraSee subsection 2211 for details

4 Various viewpoint of object Usually in the field of computer vision thepoint of view of the required object is more or less defined For exampleface detectors are usually applied to images where people are facing thecamera However resulting from the UAV platform the sensors will gain orlose height rapidly causing drastic perspective changes Also there are noconstraints about the relative orientation of the two vehicles which meansthat the robot should be recognized from 360 and from above

5 Various size of object Similarly to the previous point 4 the movement ofthe camera will cause significant changes in the scale of the object on theinput image In other words the algorithm have to detect the ground robotfrom a great distance (when it occupies a small part of the input) and alsoduring landing from relatively close (when it seems huge and almost fillsthe entire image)

6 Speed requirements The algorithmrsquos purpose is to support the landingmaneuver which is crucial part of the mission The most critical part is thetouch-down itself This (contrary to the other parts of the flight) requiresreal-time and very accurate position feedback from the sensors and the al-gorithms This is provided by another algorithm specially designed for thetask Thus the software discussed here does not have real-time require-ments On the other hand the system still needs continuous reports on theestimated position of the robot with at least 4Hz This means that everychosen method should be execute fast enough to make this possible

27

42 Architecture of the detection system

Support Vector Machines

camera2D Lidar

Trainer

Detector agorithmfront SVM side SVM

Detections Vatic annotation server

Sensors

Tracking

Evaluation

Current frame

Video reader

Regions of Interest

3D map

Other preprosessing(edgecolour detection)

Figure 41 On this figure a diagram of the designed architecture is shown Theaim of it is the present the connections between the 2D and 3D sensors detectionalgorithms and other parts of the system Arrows represents the dependencies andthe direction of the information flow

28

42 Architecture of the detection system

42 Architecture of the detection systemIn the previous chapters the project (11) and the objectives of this thesis (13)were introduced The advantages and disadvantages of the different availablesensors were presented (312) Some of the most often used feature extractionand classification methods (222) were examined with respect to the suitabilityfor the project Finally in subsection 41 the challenges of the task were discussed

After considering the aspects above a modular complete architecture wasdesigned which is suitable for the objectives defined See figure 41 for an overviewof the design The main idea of the structure is to compensate the disadvantagesof every module with an other part connected to it In some cases this principlebrings redundancy to the system On the other hand it provides a more robustdetection system for the robot The next enumeration will list the main partsof the architecture every already implemented module will be discussed later inthis chapter in detail

1 Training algorithm As explained in subsection 222 object detectionmethods using machine learning are very popular and superior in efficiencyTo apply such an algorithm for a task first a learning (also called trainingor teaching) period is needed when the software extract features from thetraining images and train a classifier to separate the two or more classes(Note that this learning step can be skipped if the algorithm learns rdquoon theflyrdquo or pre-trained detectors are used Neither of these are the case in thisproject)As mentioned above training usually happens off-line before the detectionsTherefore it is not required to be included in the final release However itis an inevitable part of the system since it produces the classifiers for thedetector These productions have to be reproducible compatible with theother parts and effective To make the development more convenient thetraining software is completely separated from testing See subsection 431for details

2 Sensor inputs Sensors are crucial for every autonomous vehicle since theyprovide the essential information about the surrounding environments Inthis project a lidar and a camera where used as main sensors for the task(along with several others needed to stabilize the vehicle itself See subsec-tion 312) The designed system relies more on the camera for now sincethe 3D imaging is in its experimental state However both sensors arehandled in the architecture already with the help of the Robotic Operat-ing System (322) This unified interface will make it possible to integratefurther sensors (eg infra-red cameras ultrasound sensors etc) later withease

29

42 Architecture of the detection system

Resulting from the scope of the project the vehicle itself is not readythus live recordings and testing were not possible during the developmentTherefore the system is able to manage images both from a camera andvideo file The latter is considered as a sensor module as well simulatingreal input from the sensor (other parts of the system canrsquot tell that it is nota live video feed from a camera)

3 Region of interest estimators and tracking Recognized as two of the mostimportant challenges (see 41) speed and the limitations of the availableresources were a great concern Thus methods which could reduce theamount of areas to process and such increase the speed of the detectionwere requiredResulting from the complex structure of the project both can be donebased on either the camera the lidar or both The idea of the architectureis to give a common interface (creating a type of module) to these methodsmaking their integration easy The most general and sophisticated approachwill be based on the continuously built and updated 3D maps That is ifboth the aerial and the ground vehicle were located in this map accurateestimations can be made on the current position of the robot excluding (3D)areas where it is unnecessary to search for the unmanned ground vehicle(UGV) In other words once the real 3D position of the robot is known itsnext one can be estimated based on its maximal velocity and the elapsedtime Theoretically it is not possible for the robot to be outside of thiscircle thus other parts of the map do not need to be scanned and Asa less sophisticated but more rapid alternative to this the currently seen3D sub-map (the point-cloud in front of the sensor) can be interpreted asa pre-filter for the 2D image processing methods For example filteringfor objects with more or less the same size of the UGV could be a veryeffective region of interest detector To achieve this the field of view of thetwo sensors needs to be registered carefully so corresponding areas can befound Unfortunately the real time 3D map building is not yet implementedThus these methods has yet to be developed On the other hand severalapproaches to find regions of interest based on 2D images were implementedAll of them will be presented in section 43

4 Evaluation To make the development easier and the results more objectivean evaluation system was needed This helps to determinate whether a givenmodification of the algorithm did actually make it rdquobetterrdquo andor fasterthan other versions This part of the architecture is only connected to thewhole system loosely (similarly to training point 1) and not required tobe included in the final release since it was designed for development and

30

43 2D image processing methods

debugging purposes However it had to be developed parallel with everyother parts to make sure it is compatible and up to date Thus evaluationhad to be mentioned in this list For more details of the evaluation pleasesee 511

5 Detector The detector module is the core algorithm which connects all theother parts listed above Receiving the inputs from the camera sensor (point2) using the classifiers trained by the training algorithm (point 1) andconsidering the information provided by the ROI estimators (3) executesthe detection Afterwards the detector is able to pass on the detectedobjects coordinates so the evaluation module (point 4) can measure theefficiency

43 2D image processing methodsIn this section the 2D camera image processing concerns methods and the processof the development will be presented

431 Chosen methods and the training algorithmBefore the design and implementation of the algorithm started several objectdetection methods were reviewed in section 22 In section 32 the used imageprocessing libraries and tool-kits were presented

While these two topics seem unrelated they have to be considered togethersince a good software library with image handling or even machine learningalgorithms implemented can save a lot of time during the development

Considering the reviewed methods the Histogram of Oriented Gradients (HOGsee sub-subsection 2223) was chosen as the main feature descriptor The reasonsfor this choice are discussed here

Haar-like features are suitable to find patterns of combined bright and darkpatches While the ground robotrsquos main colours are black (wheels) and white(the main body of the vehicle) their proportion is not ideal since brighter partsare more significant Variance of the relative positions of these patches (egdifferent view angles) can confuse the detector as well Also using OpenCVrsquos(323) training algorithm the learning process was significantly slower than thefinal training software (presented here later) making it very hard to develop andtest

SIFT descriptors (see sub-subsection 2221) have the advantage of rotationinvariance which is very desirable Since they assume that relative positions ofthe key-points will not change on the inputs they are more suitable to recognise

31

43 2D image processing methods

a concrete object instead of the general class This is not an issue in this casesince only one ground robot is present with known properties (in other wordsno generalization is needed) However the relative positions of the key-pointscan change not only because inter-class differences (an other instance of the classrdquolooksrdquo different) but caused by moving parts like rotating and steering wheelsfor example Also SIFT tends to perform worse under changing illuminationconditions which is a drawback in the case of this project since both vehiclesare planned to transfer between premises without no a-priori information aboutthe lighting

Histogram of Oriented Gradients (HOG see sub-subsection 2223) seemed tobe an appropriate choice as it performs well on objects with strong characteristicedges Since the UGV has a cuboid shape body and relatively big wheels itfulfils this condition Also HOG is normalized locally with respect to the imagecontrast which results in a robustness against illumination changes Thus itis expected to perform better than the other methods in case of change in thelighting (like moving to an other room or even outside)

Certainly HOG has it disadvantages as well First it is computationally ex-pensive especially compared to the Haar-like features However with properimplementation (eg a good choice of library) and the use of ROI detectors (seePoint 3) this issue can be solved Secondly HOG in contrast to SIFT is not rota-tion invariant In practice on the other hand it can handle a significant rotationof the object Also in the final version of the project the orientation of the camerawill be stabilized either in hardware (with a gimbal holding the camera levelled)or software (rotate the image based on the orientation sensor) Therefore theinput image of the detector system expected to be levelled Unfortunately dueto the aerial vehicle used in this project different viewpoints are predicted (see4) This means that even if the camera is levelled the object itself could seemlike rotated caused by the perspective To overcome this issue solutions will bepresented in this subsection and in 434

Although the method of HOG feature extraction is well described and docu-mented in [53] and not exceedingly complicated the choice of optimal parametersand the optimization of the computational expenses are time-consuming and com-plex tasks To make the development faster an appropriate library was searchedfor the extraction of HOG features Finally Dlib library (324) was chosen sinceit includes an excellent HOG feature extractor with serious optimization alongwith training tools (discussed in details later) and tracking features (see subsec-tion 434) Also several examples and documentations are available for bothtraining and detection which simplified the implementation

As a classifier Support vector machines (SVM see sub-subsection 2226)were chosen This is because nowadays SVMs are widely used in computer visionwith great results (Dalal used it as well in [53]) and for two-class detection prob-

32

43 2D image processing methods

lems (like this) they outperform the other popular ones Also it is implementedin Dlib as a ready-to-use solution [75] SVMs (and almost everything in Dlib)can save and load by serializing and de-serializing them which made the trans-portation of classifiers relatively easy Dlibrsquos SVMs are compatible with the HOGfeatures extracted with the same library The combination of the two methodswithin Dlib resulted in some very impressive detectors for speed limit sign orfaces Furthermore the face detector mentioned outperformed the Haar-like fea-ture based face detector included in OpenCV [74] After studying these examplesthe combination of Dlibrsquos SVM and HOG extractor was chosen as the base of thedetection

The implemented training software itself was written in C++ It can be exe-cuted from the command line It needs an xml file as an input which includes alist of the training images with the annotation of the objects (coordinates in pix-els) After importing the positive samples the negative samples are rdquogeneratedrdquofrom the training samples as well cut from areas where no annotated object ispresent Note that this feature means the object has to be annotated carefullysince a positive sample might get in to the negative training images otherwise

After the classifier (also called detector) is produced the training softwaretests it on an image database defined by a similar xml file This is a very con-venient feature since major errors are discovered during the training processalready If the first tests do not show any unexpected behaviours the finishedSVM is saved to the disk with serialization

Due to the fact that the robot looks completely unlike from different view-points one classifier was not sufficient because it could only match one typicalrdquoshaperdquo of the object Similarly the face detector mentioned above does notwork for people photographed from side view Therefore multiple classifiers wereconsidered for the task

Two SVMs were trained to overcome of this issue one for the front view andone for the side view of the ground vehicle This approach has two significantadvantages

First the robot is symmetrical horizontally and vertically as well In otherterms the vehicle looks almost exactly the same from the left and right whilethe front and rear-views are also very similar This is especially true if the work-ing principle of the HOG (2223) is considered since the shape of the robot(which defines the edges and gradients detected by the HOG descriptor) is iden-tical viewed from left and right or from front and rear This property of theUGV means that the same classifier will recognise it from the opposite directiontoo Therefore two classifiers are enough for all four sides This is not a trivialsimplification since cars for example look completely different from the front orthe rear and while the side-views are similar they are mirrored

The second advantage of these classifiers is very important in the future of

33

43 2D image processing methods

the project Due to the way the SVMs are trained they not only can detect theposition of the robot but depending on which one of them detects it the systemgains information about the orientation of the vehicle This is crucial since it canhelp predict the next position of the ground robot (see 434) Even more impor-tantly it helps the UAV to land on the UGV with the correct orientation (facingthe right way) Therefore the charging outputs will be connected properly

An other important aspect of the training is how to choose the training imagesFor this pictures were selected where the robot is captured directly from the sideor from the front with the camera almost at the level of the ground Asidethe sidefront of the robot (the top for example) was hidden or cropped Thismethod eliminated all the distortions found on pictures taken from above whichwould decrease the efficiency Although camera attached to the UAV will seldomcruise at this altitude (otherwise it would touch the ground) the HOG descriptor isrobust enough to find the patterns even viewed from above or sideways Althoughteaching an SVMs for diagonal views as well seemed like a logical idea it did notimprove the detections The reason for this is the fact that while side and frontviews are well defined it is hard to frame and train on a diagonal view

On figure 42 a comparison of training images and the produced HOG detec-tors is shown 42(a) is a typical training image for the side-view Hog detectorNotice that only the side of the robot is visible since the camera was hold verylow 42(b) is the visualized final side-view detector Both the wheels and thebody are easy to recognize thanks to the strong edges of them On 42(c) a train-ing image is displayed from the front-view detector training image set 42(d)shows the visualized final front detector Notice tangential lines along the wheelsand the characteristic top edge of the body

Please note that if the current two classifiers seem to be inefficient trainingadditional ones is easy to do with the training software presented here Alsoincluding them to the detector is simple as well since it was designed to beadaptive Please see subsection 435 for more details

432 Sliding window methodA very important property of these trained methods (222) are that usually theirtraining datasets (the images of the robot or any other object) are cropped onlycontaining the object and some margin around it Resulting from this a detectortrained this way will only be able to work on input images which contain theobject in a similarly cropped way Certainly such an image is not the usual inputfor an application since if the input contains only the robot there is no need forlocalization algorithms In other words the input image is usually much largerthan the cropped training images and may contain other kind of objects as wellThus a method is required which provides correctly sized input images for the

34

43 2D image processing methods

(a) sideview training image example

(b) sideview HOG detector

(c) frontview trainging image example

(d) frontview HOG

Figure 42 (a) a typical training image for the side-view Hog detector (b) is thefinal side-view HOG descriptor visualized (c) a training image from the front-viewHOG descriptor training image set (d) the final front detector visualized Noticethe strong lines around the typical edges of the training images

35

43 2D image processing methods

Figure 43 Representation of the sliding window method

detector cropped and resized from the original large inputThis can be achieved in several ways although the most common is the so

called sliding window method This algorithm takes the large image and slidesa window across it with predefined step-size and scales This rdquowindowrdquo behaveslike a mask cropping out smaller parts of the image With appropriate choice ofthese parameters (enough scales and small step-sizes) eventually every robot (orother desired object) will be cropped and handed to the trained detector See 43for a representation

It worth mentioning that it is possible that multiple instances of the soughtobject are present on the image For face or pedestrian detection algorithms forexample this is a potential situation Since only one ground robot was builtyet this is not likely to happen in this project (although reflections of the robotmight occur which are not handled explicitly) However if the system will beexpanded in the future multiple UGVs on the same input image will not cause aproblem for the designed algorithm

433 Pre-filteringAs listed in section 41 two of the most important challenges are speed andthe limitations of the available resources In subsection 432 the sliding windowconcept was introduced which makes it possible to cover every potential locationof the sought object On the other hand this means a lot of positions which areabsolutely unnecessary to check with the HOG detector since it is very unlikelythat the robot is there (eg positions above the ground or homogeneous surfaceslike the wall or the clearly empty ground) These areas could be excluded fromthe search with cheaper algorithms

36

43 2D image processing methods

Thus methods which could reduce the amount of areas to process and suchincrease the speed of the detection were required In other words these algorithmsfinds region of interests (ROI see 3) in which the HOG (or other) computationalheavy detectors are executed in a shorter time than scanning the whole inputimage The algorithms have be implemented with the right interface to keep themodular structure of the architecture Also this practice make it possible toeither swap the currently used with an other or a newly developed ROI detectormodule for further development

Based on intuition a good feature separation is the colour Anything whichdoes not have the same colour as the ground vehicle should be ignored To dothis however is rather complicated since the observed rdquocolourrdquo of the object (andanything else in the scene) depends on the lighting the exposure setting and thedynamic range of the camera Thus it is very hard to define a colour (and aregion of colours around of it) which could be a good base of segmentation onevery input videos Also one of the test environments used (the laboratory) andmany other premises reviewed has bright floor which is surprisingly similar to thecolour of the robot This made the filtering virtually useless

An other good hint for interesting areas can be the edges In this projectthis seems like a logical approach since the chosen feature descriptor operates ongradient images which are strongly connected to the edge map On figure 44 theresult of an edge detection on a typical input image of the system is presentedNote how the ground robot has significantly more edges than its environmentespecially the ground Filtering based on edges often eliminates the majority ofthe ground On the other hand chairs and other objects have lot of edges as wellThus these areas are still scanned

Another idea is to filter the image by the detections themselves on the previousframes In other terms the position of the robot on the previous input image isan excellent estimation of the vehiclersquos position on the current one Theoreticallyit is enough to search for the robot around its last known position with a radiusof the maximum distance the object can move during the time between framesPlease note that this distance is in image coordinates thus the possible movementof the camera is involved as well In practice this distance was set as a parameterafter experiments See subsection 435 for details

434 TrackingIn computer vision tracking means the following of a (moving) object over timeusing the camera image input This is usually done by storing information (likeappearance previous location speed and trajectory) about the object and passit on between frames

In the previous subsection (433) the idea of filtering by prior detections was

37

43 2D image processing methods

Figure 44 On this figure the result of an edge detection on a typical input imageof the system is presented Please note how the ground robot has significantly moreedges than its environment On the other hand chairs and other objects have lotof edges as well

presented Tracking is the extension of this concept If the position of the robotis already known there is no need to find it again To achieve the objectives it isenough to follow the found object(s) This makes a huge difference in calculationcosts since the algorithm does not look for an object described by a generalmodel on the whole image but search for the rdquosame set of pixelsrdquo in a significantlyreduced region Certainly tracking itself is not enough to fulfil the requirements ofthe task hence some kind of combination of rdquotraditionalrdquo classifiers and trackingwas needed

Tracking algorithms are widely researched and have their own extensive litera-ture Considering the complexity of implementing an own 2D tracking algorithmit was decided to choose and include a ready-to-use solution Fortunately Dlib(324) have an excellent object tracker included as well The code is based onthe very recent research presented in [76] The method was evaluated extensivelyand it outperformed state-of-the-art methods while being computationally effi-cient Aside its outstanding performance it is easy to include in the code onceDlib is installed The initializing parameter of the tracker is a bounding box isaround the area needed to be followed Since the classifiers return with boundingboxes those can be used as an input for the tracking See subsection 435 foran overview of the implementation An interesting fact is that it also uses HOGdescriptors as a part of the algorithm which shows the versatility of HOG andconfirms its suitability for this task

The final system of this project will have the advantage of a 3D map of itssurroundings This will allow tracking of objects in 3D relative to the environ-ment The UAV will have an estimation of the UGV even if its out of the frame

38

43 2D image processing methods

since its position will be mapped to the 3D map as well Unfortunately the realtime mapping is not ready yet thus this feature has yet to be implemented

435 Implemented detectorIn subsection 431 the advantages and disadvantages of the chosen object recog-nition methods were considered and the implemented training algorithm wasintroduced Then different approaches to speed up the process were presentedin subsection 432 433 and 434

After considering the points above a detector algorithm was implementedwith two aims First to provide an easy-to-use development and testing toolwith which different methods and their combinations (eg using 2 SVMs withROI) can be tried without having to change and rebuild the code itself Secondafter the optimal combination was chosen this software will be the base of thefinal detector used in the project

The input of the software is a camera feed or video which is read frame byframe The output of the algorithm is one or more rectangles bounding the areasbelieved to contain the sought object These are often called detection boxeshits or simply detections

During the development four different approaches were implemented All ofthem are based on the previous ones (incremental improvements) but all broughta new idea to the detector which made it faster (reach higher frame-rates see512) more accurate (see 511) or both The methods are listed below

1 Mode 1 Sliding window with all the classifiers

2 Mode 2 Sliding window with intelligent choice of classifier

3 Mode 3 Intelligent choice of classifiers and ROIs

4 Mode 4 Tracking based approach

They will all be presented in the next four subsections Since they are rathermodifications of the same code than separate solutions they were not imple-mented as new software Instead all were included in the same code whichdecides which mode to execute in run-time based on a parameter

Aside of changing between the implemented modes the detector is able toload as many SVMs as needed Their usage is different in every mode and willbe presented later

The software can display andor save the video frames with all the detectionsmarked measure and export the processing time for every frame save the de-tections to data files for evaluation or even process the same video input severaltimes

39

43 2D image processing methods

Table 41 Table of the available parameters

Name Valid values Functioninput path to video video as input for detectionsvm path to SVMs These SVMs will be usedmode [1 2 3 4] Selects which mode is usedsaveFrames [01] Turns on video frame exportsaveDetections [01] Turns on detection boxe exportsaveFPS [01] Turns on frame-rate measurementdisplayVideo [01] Turns on video displayDetectionsFileName string Sets the filename for saved detectionsFramesFolderName string Set the foldername used for saving videonumberOfLoops integer(0lt) Sets how many times the video is looped

To make the development more convenient these features can be switchedon or off via parameters which are imported from an input file every time thedetector is started There is an option to change the name of the file where thedetections are exported or the name of the folder where the video frames aresaved

Table 41 concludes all the implemented parameters with the possible valuesand a short description

Note that all these parameters have default values thus it is possible but notcompulsory to include every one of them The text shown below is an exampleinput parameter file

Example parameter file of the detector algorithminput testVideo1avilist all the detectors you want to usesvm groundrobotfrontsvm groundrobotsidesvmsaveDetections 0saveFrames 0displayVideo 0numberOfLoops 100mode 2saveFPS 1

Given this input file the software would execute a detection on testVideo1avi(input) one hundred times (numberOfLoops) using the groundrobotfrontsvm groundrobot-sidesvm classifiers (svm) It would neither save the detections (saveDetections)

40

43 2D image processing methods

nor the video frames (saveFrames) The latter is not shown (displayVideo) How-ever the processing time is saved for every frame (saveFPS) Note that the pa-rameter numberOfLoops is really useful to simulate longer inputs See figure 45for a screenshot of the interface after a parameter file is loaded

Figure 45 Example of the detectorrsquos user interface after loading the parametersThe software can be started from the command line with a parameter file inputwhich defines the detecting mode and the purpose of the execution (producingvideo efficiency statistics or measuring processing frame-rate)

4351 Mode 1 Sliding window with all the classifiers

Mode 1 is the simplest approach of the four After loading the two classifierstrained by the training software (front-view and side-view see 434) both rdquoslidesrdquoacross the input image All of them return with a vector of rectangles where theyfound the sought object These are handled as detections (saved displayed etc)There are no filters implemented for maximum number of detections overlappingallowed position etc

Please note that although currently two SVMs are used this and every othermethod can handle fewer or more of them For example given three classifiersmode 1 loads and slides all of them

As it can be seen this mode 1 is rather simple It checks every possibleposition with all the classifiers on every frame in multiple scales This meansthat it will not miss the object (assuming that one of them would recognize theobject if a cropped image of it is given as an input) On the other hand thisextensive search is very computational heavy especially with two classifiers Thisresults in the lowest frame per second rate of all the methods

4352 Mode 2 Sliding window with intelligent choice of classifier

As mentioned in the previous subsection Mode 1 checks every possible positionwith all the classifiers on every frame This is not efficient because of the waythe two used classifier was trained On figure 42 it can be seen that one of them

41

43 2D image processing methods

represents the frontrear view while the other one was trained for the side-viewof the robot

The idea of mode 2 is based on the assumption that it is very rare and usuallyunnecessary that these two classifiers detect something at once since the robot iseither viewed from the front or from the side Certainly looking at it diagonallywill show both mentioned sides but resulting from the perspective distortionthese detectors will not recognize both sides (and that is not needed either sincethere is no reason to find it twice)

Therefore it seemed logical to implement a basic memory to the code whichstores which one of the detectors found the object last time After that only thatone is used for the next frames If it succeeds to find the object again it remainsthe chosen detector If not the algorithm starts counting the sequential frames onwhich was unable to find something If this exceeds a predefined tolerance limit(that is the number of frames tolerated without detection) all of the classifiersare used again until one of them find the object In every other aspect mode 2is very similar to mode 1 presented above

The limit was introduced because it is very unlikely to change viewpointsbetween two frames so drastically that the other detector would have a betterchance to detect A couple of frames without detection is more like the result ofsome kind of image artefact like motion blur change in exposure etc

This modification of the original code resulted an much faster processingspeed since in significant amount of the time only one classifier was used in-stead of two Fortunately after finding an optimal limit the efficiency of thecode remained the same

4353 Mode 3 Intelligent choice of classifiers and ROIs

While mode 2 brought a significant improvement to the detectorrsquos speed it didnot fulfil the requirements of the project

In subsection 433 the idea of reducing the image searched was introduced asa way to increase the detection speed The position of the robot on the previousinput image is an excellent estimation of the vehiclersquos position on the current oneTheoretically it is enough to search for the robot around its last known positionwith a radius of the maximum distance the object can move during the timebetween frames However this is not an obvious calculation since the distance isin image coordinates hence it is not trivial to use the actual speed of the robotAlso the possible movement of the camera is involved as well For example evenif the UGV holds its position the rotating camera of the UAV will capture itrdquomovingrdquo across the input image

Instead of estimating the distance mentioned above a simpler approach wasimplemented The memory introduced in mode 2 was extended to store the

42

43 2D image processing methods

Figure 46 An example output of mode 3 The green rectangle marks theregion of interest The blue rectangle is the detection returned by the side-viewdetectorMode 3 tries to estimate a ROI based on the previous detections andsearches for the robot only in that Note the ratio between the ROI and the fullsize of the image

position of the detection aside the detector which returned it A new rectanglename ROI (Region of interest) was included which determinates in which areashould the detector search

The rectangle is initialized with the sizes of the full image (in other termsthe whole image is included in the ROI) Then every time a detection occursthe ROI is set to that rectangle since that is the last known position of the robotHowever this rectangle have to be enlarged caused by the reasons mentionedabove (movement of the camera and the object) Therefore the ROI is grown bya percent of its original size (set by a variable 50 by default) and every detectorsearches in this area Note that the same intelligent selection of classifiers whichwas described in mode 2 is included here as well

In the unfortunate case that none of the detectors returned detections insidethe ROI the region of interest is enlarged again by a smaller percent of its originalsize (set by a variable 3 by default) Note that if in any case the ROI reachesthe size of the of the image it will not grow in that direction anymore Eventuallyafter a series of missing detections the ROI will reach the size of the input imageIn this case mode 3 works exactly like mode 2

See figure 46 for an example of the method described above The greenrectangle marks the region of interest (ROI) The classifiers will search for therobot in this area The blue rectangle marks that the side-view detector foundsomething there (in this case correctly recognized the robot) On the next framethe blue rectangle will be the base of the ROI as it is enlarged in the followingsteps to define the region of interest

43

43 2D image processing methods

Please note that while the ROI is updated of the previous detections it isvery easy to change it to another rdquopre-filterrdquo due to the modular architectureintroduced in 42

4354 Mode 4 Tracking based approach

As mentioned in 434 Dlib library has a built-in tracker algorithm based on thevery resent research [76]

Mode 4 is very similar to mode 3 but includes the tracking algorithm men-tioned above Using a tracker makes it unnecessary to scan every frame (or partof it as mode 3 does) with at least one detector

Instead once the robot was found it is followed across the scene Certainlyperiodical validations are still needed Tracking is significantly faster than thesliding window detectors and in appropriate conditions can perform above realtime speed

It also improved the efficiency of the detector since on many frames the robotwas tracked and its position was marked correctly while the detectors missed itfrom the same view-point

On the other hand tracking can be misleading as well In case of the trackedregion rdquoslidesrdquo off the robot incorrect detections will be provided Also thetracker often estimates the position of the object correctly while the scale of therectangle does not fit which can result much larger or smaller detection boxesthan the object itself See figure 47(b) for an example Yellow rectangle marksthe tracked region which clearly includes the robot but itrsquos size is unacceptable

To avoid these phenomena the bounding box returned by the tracker is reg-ularly checked by detectors This is done in the very same way as in mode 3The output of the tracker is the base of the ROI which is than enlarged by pa-rameters (similarly to mode 3) The algorithm checks this area with the selectedclassifier(s)

If any of them founds the object the tracker is reinitialized based on thedetection box The rectangle is enlarged and translated to include a bigger partof the robot in contrast to the detected sides (note that the detectors are trainedfor the sides only see subsection 431) This is done because the tracker tendsto follow bigger patches of the image better

If none of the detectors returned detections inside the ROI the algorithm willcontinue to track the object based on previous clues of its position However afterthe number of sequential failed validations exceeds a tolerance limit (a parameter)the object is labelled as lost Afterwards the algorithm works exactly as mode 3would do enlarge the ROI on every frame until a new detection appears Thenthe tracker is reinitialized

See figure 47(a) for a representation of the processing method of mode 4

44

43 2D image processing methods

(a) Example output frame of mode 4

(b) Typical error of the tracker

Figure 47 (a) A presentation of mode 4 Red rectangle marks the detection ofthe front-view classifier The yellow box is the currently tracked area The greenrectangle marks the estimated region of interest (b) A typical error of the trackerYellow rectangle marks the tracked region which clearly includes the robot but toobig to accept as a correct detection

45

44 3D image processing methods

44 3D image processing methods

441 3D recording methodThe first challenge was how to create a 3D image As explained in sub-subsection3123 the used Lidar sensor is a 2D type one which means it scans only inplane In other words the sensor returns the distance of the closest reflectionpoint rotating around one (vertical) axis The result of this scan looks like across-section of a 3D image To produce a real and complete 3D image severalof these cross-section like one-plane scans have to be combined This is a rathercomplicated method since to cover the whole room the lidar needs to be movedHowever this movement or displacement will have a significant effect on therecorded distances

To simplify the initial experiments the position of the laser scanner was fixedon a tripod and only the orientation of it was changed This way the trans-formation between the coordinate systems and validation of the recordings wereeasier

On figure 48 a schematic diagram is given to introduce the 3D recording set-up The picture is divided in to 3 parts Part a is a side view of the room beingrecorded including the scanner and its dependences Part b is a top view of thevery same room displaying the field of view of the scanner Part c is a closeroverview of the scanner device the inertial measurement unit (IMU) and themounting layout Main parts of the setup are listed below with brief explanation

1 Lidar This is the 2D range scanner which was used in all of the experi-ments It is mounted on a tripod to fix its position The sensor is displayedon every sub-figure (a b c) On part a it is shown in two different orien-tation first when the recording plane is completely flat secondly when itis tilted

2 Visualisation of the laser ray The scanner measures distances in a planewhile rotating around its vertical axis which is call capital Z axis on thefigure (Note The sensor uses class 1 laser which is invisible and harmlessto human eye Red colour was chosen only for visualization)

3 Pitch angle of first orientation While the scanner is tilted its orientationis recorded relative to a fixed coordinate system For this experiment thepitch angle is the most relevant since roll and yaw did not change This iscalculated as a rotation from the (lower-case) z axis

4 Pitch angle of second orientation This is exactly the same variable asdescribed before but recorded for the second set-up of the recording system

46

44 3D image processing methods

5 Yaw angle of the laser ray As mentioned before the sensor used operateswith one single light source rotated around its vertical axis (capital) ZTo know where the recorded point is located relative to the sensorrsquos owncoordinate system the actual rotation of the beam is needed to be savedas well This is calculated as an angle between the forward direction of thesensor (marked with blue) and the laser ray All the points are recordedin this 2D space defined with polar coordinates a distance from the origin(the lidar) and a yaw angleThe angle is increasing counter-clockwise Pleasenote that this angle rotates with the lidar and its field Part b represents astate when the lidar is horizontal

6 Field of view of the sensor Part b of the picture visualize the limited butstill very wide field of view of the sensor The view angle is marked withgreen coloured curve and the covered area is yellowish The scanner is ableto record 270 or in other terms [minus135 135]

7 Blind-spot of the sensor As discussed in the previous point the scannercan register distances in 270 The remainder 90 is a blind-spot and islocated exactly behind the sensor (135minus135) This area is marked withdark grey on the figure

8 Inertial measurement unit All the points are recorded in the 2D spaceof the lidar which is a tilted and translated plane To map every recordedpoint from this to the 3D maprsquos coordinate system the orientation andposition of the Lidar sensor is needed Assuming that the latter is fixed(the scanner is mounted on a tripod) only the rotation remains variableTo record this some kind of inertial measurement unit (IMU) is neededwhich is marked with a grey rectangle on the image

9 Axis of rotation To scan the room a continuous tilting of the scannerwas needed It was not possible to place the light source on the axis of therotation This means an offset between the sensor and the axis which was asource of error The black circle represents the axis the tilting was executedaround

10 Offset between the axis and the light source As described in the previ-ous point an offset was present between the range sensor and the axis ofrotation This distance was carefully measured and it is marked by blue onthe figure

11 Ground vehicle The main objective of this thesis is to localize the UGVThus several 3D map was made of this vehicle itself However the lidarmain task is not the robot detection but the localization of the UAV itself

47

44 3D image processing methods

On part a and b its schematic side and top view is shown to visualize itsshape relative size and location

442 Android based recording set-upTo test the sensorrsquos capabilities simplified experiment set-ups were introducedas described in subsection 441 The scanned distances were recorded in csvformat by the official recording tool provided by the lidarrsquos manufacturer [77]As described in subsection 444 and in point 8 the orientation and position ofthe Lidar sensor is needed to determinate the recorded point coordinates relativeto the coordinate system fixed to the ground

Figure 410 Screenshot ofthe Android application for Lidarrecordings It is able to detect dis-play and log the rotation of the de-vice in a log file with user definedfilename Source own applicationand image

Assuming that the position is fixed (thescanner is mounted on a tripod) only therotation can change To record this somekind of inertial measurement unit (IMU) wasneeded

Since at the time these experimentsstarted no such device was available the au-thor of this thesis created an Android appli-cation The aim of this program was to usethe phonersquos sensors (accelerometer compassand gyroscope) as an IMU The software isable to detect display and log the rotationof the device along the three axis

The log file is structured in a way thatMatlab (see subsection 321) or other de-velopment tools can import it without anymodification needed The file-name is userdefined

Since both the Lidar and the applica-tion operates at a different and not suf-ficiently consistent frequency synchroniza-tion was necessary Since the two data wasrecorded completely separately this has to bedone off-line To achieve this a time-stamphas to be recorded for both type of mea-surements The lidar recorder tool does thisautomatically This function was added tothe Android application too thus the time-stamps of the recorded orientations are saved

48

44 3D image processing methods

Figure 48 Schematic figure to represent the 3D recording set-up Part a is a sideview of the room being recorded including the scanner and its dependences Partb is a top view of the very same room displaying the field of view of the scannerPart c is a closer overview of the scanner device the inertial measurement unit(IMU) and the mounting layout

49

44 3D image processing methods

Figure 49 Example representation of the output of the Lidar Sensor As it canbe seen the data returned by the scanner is basically a scan in one plane and canbe interpreted as a cross-section of a 3D image The sensor is positioned in themiddle of the cross Green areas were reached by the laser Source screenshot ofown recording by the official recording tool [77]

50

44 3D image processing methods

to the log file as well However since the two systems were not connected thetimestamps had to be synchronized manually

On Figure 410 the user interface of the application can be seen Note thatboth the displayed and the saved angles are in radian Figure 411(a) presentshow the phone was mounted next to the lidar and also shows the test environmentwhich was recorded The phone was attached to a common metal base with thelidar Thus the rotation of the phone is expected to be very close to the rotationof the lidar We used a Motorola Moto X (2013) for these experiments

The coordinate system used is fixed to the phone X axis is parallel with theshorter side of the screen pointing from the left to the right Y axis is parallel tolonger side of the screen pointing from to bottom to the top Z axis is pointingup and is perpendicular to the ground See figure 412 for a presentation of theaxes Angles were calculated in the following way

bull pitch is the rotation around the x axis It is 0 if the phone is lying on ahorizontal surface

bull roll is rotation around the y axis

bull yaw is the rotation around the z axis

Although the sensors in a phone are not designed for such precise tasks theset-up presented in this subsection worked sufficiently and produced spectacular3D images On figure 411(b) and 411(c) an example is shown along with therecorded screen captured by a camera (411(a))

443 Final set-up with Pixhawk flight controllerAlthough the Android application (see subsection 442) gave satisfactory resultsseveral reasons occurred for replacing it First the synchronization was compli-cated manual and off-line which is not acceptable in a real time UAV environ-ment Second neither the phonersquos sensors nor its operating system was designedto be mounted on an unmanned aircraft It is heavier less precise and moreexpensive than a dedicated IMU Thus the Android phone has to be replaced

As a replacement the Pixhawk (3121) flight controller was chosen because itis open-source well supported and most importantly it is the chosen controller ofthe projectrsquos UAV This meant less weight and brought integrity both in hardwareand software Although the Pixhawk is not a dedicated IMU its built-in sensorsare very accurate since they are used to stabilize multi-rotors during flight Fellowstudent Domonkos Huszar managed to connect both the Pixhawk and the lidar(3123) to the Robotic operating system (322) This way the two type ofsensors were recorded with the same software

51

44 3D image processing methods

(a) Laboratory and recorded scene

(b) Elevation of the produced 3D map

(c) The produced 3D map

Figure 411 (a) picture about the laboratory and the mounting of the phone (b)and (c) 3D map of the scan from different viewpoints Some object were markedon all images for easier understanding 1 2 3 are boxes 4 is a chair 5 is a studentat his desk 6 7 are windows 8 is the entrance 9 is a UAV wing on a cupboard

52

44 3D image processing methods

Thus the change from the Android software to the Pixhawk solved the problemof synchronization and accuracy All the other parts of the experiment (themounting method the coordinate systems etc) remained unchanged

444 3D reconstructionIn subsection 441 the recording method was described Subsection 442 and443 discussed the inertial measurement units and software used Processing andvisualizing the recorded data as a 3D map was the next step

All the pointsrsquo coordinates obtained were in the lidar sensorrsquos own coordinatesystem As shown on figure 48 (part b) this is a 2D space represented by polarcoordinates a distance from the origin (the scanner) and an angle between theline pointing to the point and the zero vector This space was tilted with thescanner during the records To calculate the positions recorded by the Lidar inthe ground fixed coordinate system this 2D space had to be transformed Trans-formation here means rigid motions no shape or size changes were desired Rigidmotions are reflections rotations translations and glide reflections Howeverin this experiment only translations and rotations were used

The 3D coordinate system was defined as the following

bull x axis is the vector product of y and z x = ytimes z pointing to the right

bull y axis is the facing direction of the scanner parallel to the ground pointing

Figure 412 Presentation of the axes used in the android application

53

44 3D image processing methods

forward

bull z axis is pointing up and is perpendicular to the ground

Note that the origin ([0 0 0]) is at the lidarrsquos location (at the top of the tripodin the height of the rotation axis) This means that points below the scanner havea negative altitude

The lidar and its own 2D space have 6 degrees of freedom (DOF) state in theground fixed coordinate system Three represents its positions along the x y zaxes The other three describes its orientation yaw pitch and roll

First the rotation of the scanner will be considered (the latter three DOF) Asit can bee seen on figure 48 the roll and yaw angles are not able to change Thusonly pitch has to be considered as a variable in this experiment (see point 3) Asa result the transformed coordinates of the recording point are the following

x = distance middot sin(minusyaw) (41)

y = distance middot cos(yaw) middot sin(pitch) (42)

z = distance middot cos(yaw) middot cos(pitch) (43)

Where distance and yaw are the two coordinates of the point in the plane ofthe lidar (figure 48 point 5) and pitch is the angle between the (lower-case) zaxis and the plane of the lidar (48 point 3 and 4)

The position of the lidar in the ground coordinate system did not change intheory only the orientation of it However as it can be seen on figure 48 part cthe rotation axis (marked by 9 see point 9) is not at the same level as the rangescanner of the lidar Since it was not possible to mount the device other way asignificant offset (marked and explained in point 10) occurred between the lightsource and the axis itself This resulted in an undesired movement of the sensoralong the z and y axes (note that x axis was not involved since the roll angle didnot change and only that would move the scanner along the x axis) While theresulted errors were no greater than the offset itself they are significant Thusalong the rotation a translation is needed as well The vector is calculated asthe sum of dy and dz

dy = offset middot sin(pitch) (44)

dz = offset middot cos(pitch) (45)

54

44 3D image processing methods

Where dy is the translation required along the y and dy is along the z axisOffset is the distance between the light source and the axis of rotation presentedon figure 48 part C marked with 10

Combining the 5 equations (41 42 43 44 45) we get xyz

= distance middot

sin(minusyaw)cos(yaw) middot sin(pitch)cos(yaw) middot cos(pitch)

+ offset middot

0sin(pitch)cos(pitch)

(46)

Using the transformation defined in equation 46 spectacular and more im-portantly precise 3D maps were created These were suitable for further worksuch us SLAM (Simultaneous localization and mapping) or the ground robotdetection

55

Chapter 5

Results

51 2D image detection results

511 EvaluationTo make the development easier and the results more objective an evaluationsystem was needed This helps determine whether a given modification of thealgorithm did actually make it better andor faster than other versions Thisinformation is extremely important in every vision based detection system to trackthe improvements of the code and check if it meets the predefined requirements

To understand what better means in case of a system like this some defi-nitions need to be clarified Generally a detection system can either identify aninput or reject it Here the input means a cropped image of the original pictureIn other words it is a position (coordinates) and dimensions (height and width)Please refer to subsection 432 for a more detailed description An identifiedinput is referred as positive Similarly a rejected one is called negative

5111 Definition of True positive and negative

A detection is true positive if the system correctly identifies an input as a positivesample In this case it means that the detector marks the ground robotrsquos positionas a detection Note that it is also often called a hit Similarly a true negativemeans that an input which does not contain the object is correctly rejected

5112 Definition of False positive and negative

In the case of most decision systems errors can be organised into two main groupsA mistake is a false positive error (also called type I error) if the system classifiesan input as a positive sample despite it is not In other words the system believes

56

51 2D image detection results

the object is present at that location although that is not the case In this projectthat typically means that something that is not the ground robot (a chair a boxor even a shadow) is still recognized as the UGV A false negative error (alsocalled type II error) occurs when an input is not recognized (rejected) although itshould be In this current task a false negative error is when the ground robotis not detected

5113 Reducing number of errors

Unfortunately it is really complicated to decrease both kind of errors at thesame time If more strict rules and thresholds are introduced in the decisionmaking the rate of false positives will decrease but simultaneously chances offalse negatives will increase On the other hand if conditions are loosened in thedetection method false negative errors may be reduced However caused by theexact same thing (less strict conditions) more false positives could appear

The actual task determines which kind of error should be minimised Incertain projects it could be more important to find every possible desired objectthan the fact that some mistakes will occur Such a project can be a manuallysupervised classification where appearing false positives can be eliminated by theoperator (for example illegal usage of bus lanes) Also a system with an accidentavoidance purpose should be over-secure and rather recognise non dangeroussituations as a hazard than to miss a real one

In this project the false positive rate was decided to be more important Thereason for this is that the position of the UGV is not crucial during the wholeflight If an input frame is processed incorrectly and a false negative error occurs(the robot is on the picture but the algorithm misses it) the tracking system(see subsection 434) would still be able to estimate its position After a fewframes the detector would be able to detect the robot again with great possibilityHowever a false negative detection at an unexpected position may confuse thesystem since it would have to choose from multiple possible charging platformsThis may lead to a landing attempt on a dangerous surface and to the possibilityof losing the aircraft

5114 Annotation and database building

To test the efficiency of the 2D detection algorithm its detections have to betested on an image dataset with its samples already labelled (either fully manuallyor using a semi-supervised classifier) or on a video set where the object (or objects)position is known on every frame This way the true positivenegative and falsepositivenegative values can be determined

57

51 2D image detection results

Figure 51 Figure of the Vatic user interface

Testing on image datasets is suitable for methods which do not use any in-formation transmitted between frames such as previous positions or ROI Thesealgorithms are usually sliding window based (see subsection 432) Their croppedinputs can be simulated with image datasets without complications

For this project the video based test is more appropriate since as described insubsection 434 and 433 the detector uses tracking and pre-filtering Testing thetracking feature without consecutive frames is pointless Also the pre-filteringand ROI generation is only applicable on frames containing more than just theobject itself

Testing on videos requires annotation which means that every object has tobe labelled on every frame This is a tiring and time-consuming task but withsuitable software it can be simplified To do this the Vatic environment waschosen

The abbreviation Vatic stands for Video Annotation Tool from Irvine Cal-ifornia It is a free to use on-line video annotation tool designed specially forcomputer vision research [78]

After the successful installation and connection to the server a clear interfaceis visible See an example screenshot on figure 51 The interface has a built invideo player which makes it possible to review the video with the annotation atnormal speed or even frame-by-frame The system was designed in a way thatmultiple objects can be annotated even from different classes (like pedestrianscars trucks for example) Since only one ground robot was built yet and no otherkind of objects were requested only one item was annotated on the videos

58

51 2D image detection results

The annotatorrsquos task is to mark the position and the size of the objects (thatis drawing a bounding box around it) on every frame of the videos (there areoptions to label an object Occluded obstructed or outside of frame) Althoughthis seems like a very time-consuming task Vatic uses sophisticated algorithmsto interpolate both the positions and the sizes of the objects on frames thatare not annotated yet This is a significant help for the annotator since aftermarking the object on some key-frames the only task remained is to correctthe interpolations between them if necessary

Note that it is crucial that the bounding boxes are very tight around the ob-ject This will guarantee that during the evaluation of the detectors the statisticsare based on true and objective annotations This topic will be explained in moredetails in section 53

Vatic is also able to crowdsource the annotation tasks using Amazon Mechan-ical Turk [79] which is Amazonrsquos solution for renting human resources This waylarge datasets with several videos are easy to build for relatively low cost How-ever this feature of the tool was not used in this project since the number of testvideos and their complexity did not make it necessary All the test videos wereannotated by the author

After successfully annotating a video Vatic is able to export the boundingboxes in several format including xml json etc The exported file contains theid of the frames and the coordinates of every bounding box along with the labelassigned for the objectrsquos type

Using these files it is possible to compare the detections of an algorithm (whichare also bounding boxes) with the annotations and thus evaluate the efficiencyof the code

512 Frame-rate measurement and analysisIn section 41 speed was recognised as one of the most important challenges

To address this issue the detector was prepared to measure and export thetime elapsed by processing each frames

The measurements were carried out on a laptop PC (Lenovo z580 i5-3210m8GB RAM) simulating a setup where the UAV sends video to a ground stationfor further processing During the tests every other unnecessary process wereterminated so their additional influences are minimized

The exact same software was used for every test only the parameter file waschanged (switching between the modes) Every method was executed on the samevideo one hundred times which means more than 100000 frames This amountensured that temporary resource changes during the test due to the operatingsystem did not influence the results significantly

59

52 3D image detection results

To analyse the exported processing times a Matlab script has been imple-mented This tool can load the log files from the test and calculate statistics ofthe data Measurements like the shortest longest and average processing timeper frame are displayed after execution along with the average frame-rate and itsvariance Latter is very important as mode 2 3 4 processing methods changeduring the execution by definition which results in a varying processing timeVariance measures these changes An example output of the analysis software ispresented below

Example output of Frame-rate analysis softwareLoaded 100 filesNumber of frames in video 1080Total number of frames processed 108000Longest processing time 0813Shortest processing time 0007Average ellapsed seconds 0032446variance of ellapsed seconds per frame

between video loops 80297e-07across the video 00021144

Average Frame per second rate of the processing 308204 FPS

The changes of average frame-rates between video loops were also monitoredIf this value was too big it probably indicated some unexpected load of thecomputer Thus the tests had to be restarted

It should be mentioned that the frame-rate of the software is strongly corre-lated with the resolution of the input images since the smaller the image it takesshorter time to rdquosliderdquo through the detector on it This is especially true for thefirst two methods (4351 4352) The test videosrsquo resolution are 960times 540

52 3D image detection resultsThis thesis focused on the 2D object detection methods since they are easierto implement and have been widely researched for decades Traditional camerasensors are significantly cheaper too and their output is rdquoready-to-userdquo On theother hand data from lidar sensors have to be processed to gather the imagesince the scanner records points in a 2D plane as explained in section 44 Notethat lidar 3D scanners exists but they are still scanning in a plane and have thismentioned processing built-in Also they are even more expensive than the 2Dversions

60

53 Discussion of results

Figure 52 On this figure the recording setup is presented In the backgroundthe ground robot is also visible which was the subject of the measurements

That being said the 3D depth map provided by these sensors are not possibleto obtain with cameras and contains valuable information for a project like thisand makes indoor navigation easier Therefore experiments were carried onthe record 3D images by tilting a 2D laser scanner (3123) and recording itsorientation with a IMU (3121)

On figure 52 a photo of the experimental setup is shown In the backgroundthe ground robot is visible which was the subject of the recordings

Figure 53 presents an example 3D point-cloud recorded of the ground robotBoth 53(a) and 53(b) is an image of the same map viewed from different pointsAs it can be seen recordings from one point is not enough to create completemaps as rdquoshadowsrdquo and obscured parts still occurs Note the rdquoemptyrdquo area behindthe robot on 53(b)

The created point-clouds were found to be detailed enough to carry simpler(eg threshold based on height) or more complex object detection methods (egregistering 3D key-points) on them in the future

53 Discussion of resultsIn general it can be said that the detector modes worked as expected on therecorded test videos with suitable efficiency Example videos can be found athere1 for all modes However to judge the performance of the algorithm moreobjective measurements were needed

The 2D image processing results were evaluated by speed and detection effi-1usersitkppkehu~palan1CranfieldVideos

61

53 Discussion of results

(a) Example result of the 3D scan of the ground robot

(b) Example of the rsquoshadowrsquo of the ground robot

Figure 53 Example of the 3D images built Both image shows the same 3Dmap viewed from different angles Note that the carpet is visible underneath thevehicle

62

53 Discussion of results

ciency The latter can be done by several values Here recall and precision willbe used Recall is defined by

TP

TP + FN

Where TP is the number of true positive and FN is the number of false negativedetections In other words it is the ratio of the of the detected and the totalnumber of positive samples Thus it is also called sensitivity

Precision is defined byTP

TP + FP

Where TP is the number of true positive FP is the number of false positivedetections In other terms precision is a measurement of the correctness of thereturned detections

Both values have a different meaning for validation and none of them can usedalone This is because as mentioned in sub-subsection 5113 both false positiveand negative errors are undesirable but neither recall nor precision measuresboth For example it is possible to reach an outstanding recall rate with extremeamount of false positives Similarly the increase of false negative errors willimprove the precision of the algorithm

The detections (the coordinates of the rectangles) were exported by the de-tection software for further analysis The rectangles were compared to the onesannotated with the Vatic video annotation tool (5114) based on position andthe overlapping areas For the latter a minimum of 50 was defined (ratio of thearea of the intersection and the union of the two rectangles)

Note that this also means that even if a detection is at the correct locationit is possible that it will be logged as false positive error since the overlap of therectangles is not satisfying Unfortunately registering this box as a false positivewill also generate a false negative error since only one detection is returned bythe detector for a location(overlapping detections are merged) Therefore theannotated object will not be covered by any of the detections

As expected mode 1 and mode 2 produced very similar results a decent 62recall rate was achieved This is a result of the fact that they work exactlythe same way except the selection of the used classifiers The results show thatintroducing this idea did not decrease the performance

Mode 3 managed to improve the recall rate by 2 with the introduction ofregion of interests Neither of the first 3 modules had false positive detections asa result of the strict thresholds of the SVMs

Mode 4 had the outstanding recall rate of 95 This is a result of the trackeralgorithm which was able to follow the object even at view-points were the SVMdetectors did no longer detected the robot See table 51 for all the recall andprecision values

63

53 Discussion of results

Similarly the processing speed of the methods introduced in subsection 435were analysed as well with the tools described in subsection 512

Mode 1 was proven to be the slowest with a frame-rate of 42 FPS This isnot surprising as this version of the algorithm scans the whole image with all thedetectors

Mode 2 raised this speed to 65 FPS by introducing an intelligent way ofselecting the applied detector

Mode 3 brought another significant increase with implementing region of in-terest estimators which reduce the amount of the image to be scanned It canprocess around 12 frames in a second

Finally mode 4 was proven to be the fastest solution by far As a result ofits tracking algorithm their is no need to use the SVM classifiers as often asbefore This is a very important result since tracking is significantly faster thanscanning with the detectors Mode 4 reached a frame-rate of 30 FPS during thetests Thus mode 4 can be called a rdquoreal-timerdquo object detection algorithm

See table 51 for the frame-rates of all modules displayed along their recalland precision values Note that all frame-rates are average of 100 executions

Table 51 Table of results

mode Recall Prescision FPS Variancemode 1 0623 1 42599 000000123mode 2 0622 1 6509 00029557mode 3 0645 1 1206 00070877mode 4 0955 0898 3082 00021144

As a conclusion it can be seen that mode 4 outperformed all the others bothin detection rate (recall 95) and speed (average frame-rate 30FPS) using ROIestimators and a tracking algorithm Due to the nature of the tracker and theevaluation method the number of false positives increased As mentioned beforeeven a correctly positioned but incorrectly scaled detection decreased the precisionrate However this should be simple to improve by adding further validations ofthe output of the tracker (both for position and scale)

64

Chapter 6

Conclusion and recommendedfuture work

61 ConclusionIn this paper an unmanned aerial vehicle based airborne tracking system waspresented with the aim of detecting objects indoor (potentially extended to out-door operations as well) and localize a ground robot that will serve as a landingplatform for recharging the UAV

First the project and the aims and objectives of this thesis were introducedin chapter 1 This introduction was followed by an extensive literature review togive an overview of the existing solutions for similar problems (chapter 2) Un-manned Aerial Vehicles were presented and examples were shown for applicationfields Then the most important object recognition methods were introducedand reviewed with the aspect of suitability for the discussed project

Then the environment of the development was described including the avail-able software and hardware resources (chapter 3) Their advantages and disad-vantages were analysed and their applications in the project are explained

Afterwards the recognized challenges of the project were collected and dis-cussed with respect to the object detection task in chapter 4 Considering theobjectives resources and the challenges a modular architecture was designedand introduced (section 42) The type of modules were listed and discussed indetails including their purposes and some possible realization

The modular structure of the system makes it very flexible and easy to extendor replace modules The currently implemented and used ones were listed andexplained Also some unsuccessful experiments were mentioned

Subsection 434 concludes the chosen feature extraction (HOG) and classi-fier (SVM) methods and presented the implemented training software along the

65

61 Conclusion

produced detectorsAs one of the most important parts the currently used detector was intro-

duced in subsection 435 This module was developed with special attention paidto validation and possible future work therefore additional debugging featureslike exporting detections demonstration videos and frame rate measurementswere implemented To make the development simpler an easy-to-use interfacewas included which let the most important parameters to be set without modifi-cation pf the code Four related but different detection methods were developedMode 1 scans the whole image with both trained detectors to locate the robotMode 2 does the same but instead of all classifiers only one is used most of thetime which is chosen based on previous detections This resulted in an increasedframe-rate Mode 3 introduces the concept of Region Of Interests (ROI) and onlyscans this area The ROI is defined based on previous detections however it iseasy to replace it with other methods due to the modular structure The reducedsearch space made the processing significantly faster Finally mode 4 includes atracking algorithm which makes it unnecessary to scan every frame (or part ofit) with a detector Instead once the robot was found it is followed across thescene Certainly periodical validations are still needed Tracking is significantlyfaster than the sliding window detectors and in appropriate conditions can per-form above real time speed (average frame-rate was 308 FPS) It also improvedthe efficiency of the detector since on many frames the robot was tracked andits position was marked correctly while the detectors missed it from the sameview-point However tracking can be misleading as well if the tracked regiondrifts away from the robot (eg a chair next to the robot is followed) To avoidthis the bounding box returned by the tracker is regularly checked by detectors

Special attention was given to the evaluation of the system Two software weredeveloped as additional tools one for evaluating the efficiency of the detectionsand an other for analysing the processing time and frame rates

Although the whole project (especially the autonomous UAV) still needs a lotof improvement and not ready for testing yet the ground robot detecting algo-rithmrsquos first version is ready-to-use and have shown promising results in simulatedexperiments

The detector based on mode 4 had a recall rate of 95 with an average frame-rate of 308 FPS on the test videos Although this processing speed was achievedon a laptop (simulating a setup where the UAV sends video to a ground stationfor further processing) porting this code on an on-board computer with smallercomputational capacity should result in a slower but still acceptable frame-ratewhich satisfies the objectives (5Hz)

While this thesis focused on two dimensional image processing and object de-tection methods 3D image inputs were considered as well Section 44 concludesthe progress made related to 3D mapping along the applied mathematics An

66

62 Recommended future work

experimental setup was created with which it was possible to create spectacu-lar and more importantly precise 3D maps about the environments with a 2Dlaser scanner To test the idea of using the 3D image as an input for the groundrobot detection several recordings were made about the UGV The created point-clouds were found to be detailed enough to carry simpler (eg threshold basedon height) or more complex object detection methods (eg registering 3D key-points) on them in the future

As a conclusion it can be said that the objectives defined in 13

1 explore possible solutions in a literature review

2 design a system which is able to detect the ground robot based on theavailable sensors

3 implement and test the first version of the algorithms

were all addressed and completed

62 Recommended future workIn spite of the satisfying results of the current setup several modifications andimprovements can and should be implemented to increase the efficiency and speedof the algorithm

Focusing on the 2D processing methods first experiments with retrainedSVMs are recommended It is expected that training on the same two sides (sideand front) but including more images from slightly rotated view-points (bothrolling the camera and capture the object from not completely the front) willresult in better performance

Region of interest estimators have huge potential in the project since they cansignificantly speed up the process Aside basic algorithms based on simple fea-tures (eg similar colour patches edges etc) several more complex segmentationmethods were introduced in the literature

Tracking can be tuned more precisely as well Aside the rectangle box whichmarks the estimated position of the object the tracker returns a confidence valuewhich measures how confident the algorithm is that the object is inside the rect-angle This is a very valuable information to eliminate detections which rdquoslippedrdquooff the object Also the further processing of the return position is recommendedFor example growing or shrinking the rectangle based on simple features (egthresholded patches intensity edges etc) can be profitable since the trackertends to perform better when it tracks the whole object On the other hand toobig bounding boxes are problems as well thus shrinking of the rectangles have tobe considered as well

67

62 Recommended future work

Using The 3D image for the detection of the robot is another very interestingopportunity that should be examined in more details Certainly this is onlypossible if the 3D imaging will be carried out in near real time speed (currentlyall the maps were built offline) The currently seen 3D image (the point-cloud infront of the sensor) can be interpreted as a pre-filter for the 2D image processingmethods For example filtering for objects with more or less the same size of theUGV could be a very effective region of interest detector To achieve this thefield of view of the two sensors needs to be register carefully so correspondingareas can be found

Finally when the continuous 3D map building (note that this is more thanthe imaging since the 3D images have to be aligned and combined) is finishedseveral more improvements will be possible Assuming that both the aerial andthe ground vehicle were located in this map the real 3D position of the robotwill be known Its next position can be estimated based on its maximal velocityand the elapsed time Theoretically it is not possible for the robot to be outsideof this circle thus other parts of the map do not need to be scanned Since thelocation and orientation of the UAV will be known as well the field-of-view ofthe sensors can be estimated After that the UAV can either move and rotate ina way to cover the possible locations of the UGV with its sensors or just ignore(does not process) inputs from other areas

It is strongly recommended to give high priority to evaluation and testingduring the further experiments For 2D images Vatic annotation tool was intro-duced in this thesis as a very useful and convenient tool to evaluate detectorsFor 3D detections this is not suitable but a similar solution is needed to keepup the progress towards the ready to deploy 3D mapping system planned

68

Acknowledgements

First of all I would like to thank Paacutezmaacuteny Peacuteter Catholic University and my teachersthere who helped me to advance and grow through the years and later made itpossible to continue my studies in Cranfield University where I was welcomed withwarm hospitality and was supported throughout of the development of this thesis

I would like to thank Dr Al Savvaris for the professional help and encourage-ment he provided while we worked together

Special thanks goes to my friends Domonkos Huszaacuter and Loacuteraacutent Kovaacutecs who Ishared all the happiness sadness worries troubles and the weigh of this pastyear and who I could always turn to in need or joy

Thanks to my friend Roberto Opromolla for helping us to make the necessarymeasurements

My family never stopped supporting me loving me and ensuring the rightenvironment for improvement and growth

Many thanks Fanni Melles for the support and love you gave me and that youstood by me even at the hardest times

Thanks for my friends Maacuterton Hunyady and Doacutera Babicz for the funny andsometimes cruel comments of the revision of this thesis

And finally many thanks to Andras Horvath who agreed to be my rsquosecondrsquosupervisor and mentor at my home university

69

References

[1] ldquoOverview senseFlyrdquo [Online] Available httpswwwsenseflycomdronesoverviewhtml [Accessed at 2015-08-08]

[2] H Chao Y Cao and Y Chen ldquoAutopilots for small fixed-wing unmanned airvehicles A surveyrdquo in Proceedings of the 2007 IEEE International Conference onMechatronics and Automation ICMA 2007 2007 pp 3144ndash3149

[3] A Frank J McGrew M Valenti D Levine and J How ldquoHover Transitionand Level Flight Control Design for a Single-Propeller Indoor AirplanerdquoAIAA Guidance Navigation and Control Conference and Exhibit pp 1ndash43 2007[Online] Available httparcaiaaorgdoiabs10251462007-6318

[4] ldquoFixed Wing Versus Rotary Wing For UAV Mapping Ap-plicationsrdquo [Online] Available httpwwwquestuavcomnewsfixed-wing-versus-rotary-wing-for-uav-mapping-applications [Accessed at 2015-08-08]

[5] ldquo DJI Store Phantom 3 Standardrdquo [Online] Available httpstoredjicomproductphantom-3-standard [Accessed at 2015-08-08]

[6] ldquoWorld War II V-1 Flying Bomb - Military Historyrdquo [Online] Available httpmilitaryhistoryaboutcomodartillerysiegeweaponspv1htm [Accessed at 2015-08-08]

[7] L Lin and M A Goodrich ldquoUAV intelligent path planning for wilderness searchand rescuerdquo in 2009 IEEERSJ International Conference on Intelligent Robotsand Systems IROS 2009 2009 pp 709ndash714

[8] M A Goodrich B S Morse D Gerhardt J L Cooper M Quigley J A Adamsand C Humphrey ldquoSupporting wilderness search and rescue using a camera-equipped mini UAVrdquo Journal of Field Robotics vol 25 no 1-2 pp 89ndash1102008

[9] P Doherty and P Rudol ldquoA UAV Search and Rescue Scenario with HumanBody Detection and Geolocalizationrdquo in Lecture Notes in Computer Science

70

REFERENCES

2007 pp 1ndash13 [Online] Available httpwwwspringerlinkcomcontentt361252205328408fulltextpdf

[10] A Jaimes S Kota and J Gomez ldquoAn approach to surveillance an area usingswarm of fixed wing and quad-rotor unmanned aerial vehicles UAV(s)rdquo 2008 IEEEInternational Conference on System of Systems Engineering SoSE 2008 2008

[11] M Kontitsis K Valavanis and N Tsourveloudis ldquoA UAV vision system for air-borne surveillancerdquo IEEE International Conference on Robotics and Automation2004 Proceedings ICRA rsquo04 2004 vol 1 2004

[12] M Quigley M A Goodrich S Griffiths A Eldredge and R W Beard ldquoTargetacquisition localization and surveillance using a fixed-wing mini-UAV and gim-baled camerardquo in Proceedings - IEEE International Conference on Robotics andAutomation vol 2005 2005 pp 2600ndash2606

[13] E Semsch M Jakob D Pavliacuteček and M Pěchouček ldquoAutonomous UAV surveil-lance in complex urban environmentsrdquo in Proceedings - 2009 IEEEWICACMInternational Conference on Intelligent Agent Technology IAT 2009 vol 2 2009pp 82ndash85

[14] M Israel ldquoA UAV-BASED ROE DEER FAWN DETECTION SYSTEMrdquo pp51ndash55 2012

[15] G Zhou and D Zang ldquoCivil UAV system for earth observationrdquo in InternationalGeoscience and Remote Sensing Symposium (IGARSS) 2007 pp 5319ndash5321

[16] W DeBusk ldquoUnmanned Aerial Vehicle Systems for Disaster Relief TornadoAlleyrdquo AIAA InfotechAerospace 2010 2010 [Online] Available httparcaiaaorgdoiabs10251462010-3506

[17] S DrsquoOleire-Oltmanns I Marzolff K Peter and J Ries ldquoUnmannedAerial Vehicle (UAV) for Monitoring Soil Erosion in Moroccordquo RemoteSensing vol 4 no 12 pp 3390ndash3416 2012 [Online] Available httpwwwmdpicom2072-42924113390

[18] F Nex and F Remondino ldquoUAV for 3D mapping applications a reviewrdquoApplied Geomatics vol 6 no 1 pp 1ndash15 2013 [Online] Availablehttplinkspringercom101007s12518-013-0120-x

[19] H Eisenbeiss ldquoThe Potential of Unmanned Aerial Vehicles for MappingrdquoPhotogrammetrische Woche Heidelberg pp 135ndash145 2011 [Online] Availablehttpwwwifpuni-stuttgartdepublicationsphowo11140Eisenbeisspdf

[20] C Bills J Chen and A Saxena ldquoAutonomous MAV flight in indoor environ-ments using single image perspective cuesrdquo in Proceedings - IEEE InternationalConference on Robotics and Automation 2011 pp 5776ndash5783

71

REFERENCES

[21] W Bath and J Paxman ldquoUAV localisation amp control through computer visionrdquo of the Australasian Conference on Robotics 2004 [Online] Availablehttpwwwcseunsweduau~acra2005proceedingspapersbathpdf

[22] K Ccedilelik S J Chung M Clausman and A K Somani ldquoMonocular vision SLAMfor indoor aerial vehiclesrdquo in 2009 IEEERSJ International Conference on Intel-ligent Robots and Systems IROS 2009 2009 pp 1566ndash1573

[23] H Oh D Y Won S S Huh D H Shim M J Tahk and A Tsourdos ldquoIndoorUAV control using multi-camera visual feedbackrdquo in Journal of Intelligent andRobotic Systems Theory and Applications vol 61 no 1-4 2011 pp 57ndash84

[24] Y M Mustafah A W Azman and F Akbar ldquoIndoor UAV Positioning UsingStereo Vision Sensorrdquo pp 575ndash579 2012

[25] P Jongho and K Youdan ldquoStereo vision based collision avoidance of quadrotorUAVrdquo in Control Automation and Systems (ICCAS) 2012 12th InternationalConference on 2012 pp 173ndash178

[26] J Weingarten and R Siegwart ldquoEKF-based 3D SLAM for structured environmentreconstructionrdquo in 2005 IEEERSJ International Conference on Intelligent Robotsand Systems IROS 2005 pp 2089ndash2094

[27] H Surmann A Nuumlchter and J Hertzberg ldquoAn autonomous mobile robot with a3D laser range finder for 3D exploration and digitalization of indoor environmentsrdquoRobotics and Autonomous Systems vol 45 no 3-4 pp 181ndash198 2003

[28] Y Lin J Hyyppauml and A Jaakkola ldquoMini-UAV-borne LIDAR for fine-scale map-pingrdquo IEEE Geoscience and Remote Sensing Letters vol 8 no 3 pp 426ndash4302011

[29] F Wang J Cui S K Phang B M Chen and T H Lee ldquoA mono-camera andscanning laser range finder based UAV indoor navigation systemrdquo in 2013 Inter-national Conference on Unmanned Aircraft Systems ICUAS 2013 - ConferenceProceedings 2013 pp 694ndash701

[30] M Nagai T Chen R Shibasaki H Kumagai and A Ahmed ldquoUAV-borne 3-Dmapping system by multisensor integrationrdquo IEEE Transactions on Geoscienceand Remote Sensing vol 47 no 3 pp 701ndash708 2009

[31] X Zhang Y-H Yang Z Han H Wang and C Gao ldquoObject class detectionrdquoACM Computing Surveys vol 46 no 1 pp 1ndash53 Oct 2013 [Online] Availablehttpdlacmorgcitationcfmid=25229682522978

[32] P M Roth and M Winter ldquoSurvey of Appearance-Based Methodsfor Object Recognitionrdquo Transform no ICG-TR-0108 2008 [Online]

72

REFERENCES

Available httpwwwicgtu-grazacatMemberspmrothpub_pmrothTR_ORat_downloadfile

[33] A Andreopoulos and J K Tsotsos ldquo50 Years of object recognition Directionsforwardrdquo Computer Vision and Image Understanding vol 117 no 8 pp 827ndash891 Aug 2013 [Online] Available httpwwwresearchgatenetpublication257484936_50_Years_of_object_recognition_Directions_forward

[34] J K Tsotsos Y Liu J C Martinez-Trujillo M Pomplun E Simine andK Zhou ldquoAttending to visual motionrdquo Computer Vision and Image Understand-ing vol 100 no 1-2 SPEC ISS pp 3ndash40 2005

[35] P Perona ldquoVisual Recognition Circa 2007rdquo pp 1ndash12 2007 [Online] Availablehttpswwwvisioncaltechedupublicationsperona-chapter-Dec07pdf

[36] M Piccardi ldquoBackground subtraction techniques a reviewrdquo 2004 IEEE Interna-tional Conference on Systems Man and Cybernetics (IEEE Cat No04CH37583)vol 4 pp 3099ndash3104 2004

[37] T Bouwmans ldquoRecent Advanced Statistical Background Modeling for ForegroundDetection - A Systematic Surveyrdquo Recent Patents on Computer Sciencee vol 4no 3 pp 147ndash176 2011

[38] L Xu and W Bu ldquoTraffic flow detection method based on fusion of frames dif-ferencing and background differencingrdquo in 2011 2nd International Conference onMechanic Automation and Control Engineering MACE 2011 - Proceedings 2011pp 1847ndash1850

[39] A-t Nghiem F Bremond I-s Antipolis and R Lucioles ldquoBackground subtrac-tion in people detection framework for RGB-D camerasrdquo 2004

[40] J C S Jacques C R Jung and S R Musse ldquoBackground subtraction andshadow detection in grayscale video sequencesrdquo in Brazilian Symposium of Com-puter Graphic and Image Processing vol 2005 2005 pp 189ndash196

[41] P Kaewtrakulpong and R Bowden ldquoAn Improved Adaptive Background MixtureModel for Real- time Tracking with Shadow Detectionrdquo Advanced Video BasedSurveillance Systems pp 1ndash5 2001

[42] G Turin ldquoAn introduction to matched filtersrdquo IRE Transactions on InformationTheory vol 6 no 3 1960

[43] K Briechle and U Hanebeck ldquoTemplate matching using fast normalized crosscorrelationrdquo Proceedings of SPIE vol 4387 pp 95ndash102 2001 [Online] AvailablehttplinkaiporglinkPSI4387951ampAgg=doi

73

REFERENCES

[44] H Schweitzer J W Bell and F Wu ldquoVery Fast Template Matchingrdquo Programno 009741 pp 358ndash372 2002 [Online] Available httpwwwspringerlinkcomindexH584WVN93312V4LTpdf

[45] P Viola and M Jones ldquoRobust real-time object detectionrdquo Interna-tional Journal of Computer Vision vol 57 pp 137ndash154 2001 [Online]Available httpscholargooglecomscholarhl=enampbtnG=Searchampq=intitleRobust+Real-time+Object+Detection0

[46] S Omachi and M Omachi ldquoFast template matching with polynomialsrdquo IEEETransactions on Image Processing vol 16 no 8 pp 2139ndash2149 2007

[47] M Jordan and J Kleinberg Bishop - Pattern Recognition and Machine Learning

[48] D Prasad ldquoSurvey of the problem of object detection in real imagesrdquoInternational Journal of Image Processing (IJIP) no 6 pp 441ndash466 2012[Online] Available httpwwwcscjournalsorgcscmanuscriptJournalsIJIPvolume6Issue6IJIP-702pdf

[49] D Lowe ldquoObject Recognition fromLocal Scale-Invariant Featuresrdquo IEEE Inter-national Conference on Computer Vision 1999

[50] D G Lowe ldquoDistinctive image features from scale-invariant keypointsrdquo Interna-tional Journal of Computer Vision vol 60 no 2 pp 91ndash110 2004

[51] H Bay T Tuytelaars and L Van Gool ldquoSURF Speeded up robust featuresrdquo inLecture Notes in Computer Science (including subseries Lecture Notes in ArtificialIntelligence and Lecture Notes in Bioinformatics) vol 3951 LNCS 2006 pp 404ndash417

[52] T Joachims ldquoA probabilistic analysis of the Rocchio algorithmwith TFIDF for text categorizationrdquo International Conference on Ma-chine Learning pp 143ndash151 1997 [Online] Available papers2publicationuuid9FC2122D-6D49-4DC5-AC03-E353D5B3D1D1

[53] N Dalal and B Triggs ldquoHistograms of Oriented Gradients for Human Detectionrdquoin CVPR rsquo05 Proceedings of the 2005 IEEE Computer Society Conference onComputer Vision and Pattern Recognition (CVPRrsquo05) - Volume 1 2005 pp 886ndash893 [Online] Available citeulike-article-id3047126$delimiter026E30F$nhttpdxdoiorg101109CVPR2005177

[54] J M Rainer Lienhart ldquoAn Extended Set of Haar-Like Features for Rapid ObjectDetectionrdquo [Online] Available httpciteseerxistpsueduviewdocsummarydoi=1011869433

74

REFERENCES

[55] S Han Y Han and H Hahn ldquoVehicle Detection Method using Haar-like Featureon Real Time Systemrdquo inWorld Academy of Science Engineering and Technology2009 pp 455ndash459

[56] Q C Q Chen N Georganas and E Petriu ldquoReal-time Vision-based Hand Ges-ture Recognition Using Haar-like Featuresrdquo 2007 IEEE Instrumentation amp Mea-surement Technology Conference IMTC 2007 2007

[57] G Monteiro P Peixoto and U Nunes ldquoVision-based pedestrian detection usingHaar-like featuresrdquo Robotica 2006 [Online] Available httpcyberc3sjtueducnCyberC3docpaperRobotica2006pdf

[58] J Martiacute J M Benediacute A M Mendonccedila and J Serrat Eds PatternRecognition and Image Analysis ser Lecture Notes in Computer Science BerlinHeidelberg Springer Berlin Heidelberg 2007 vol 4477 [Online] Availablehttpwwwspringerlinkcomindex101007978-3-540-72847-4

[59] A E C Pece ldquoOn the computational rationale for generative modelsrdquo ComputerVision and Image Understanding vol 106 no 1 pp 130ndash143 2007

[60] I T Joliffe Principle Component Analysis 2002 vol 2 [Online] Availablehttpwwwspringerlinkcomcontent978-0-387-95442-4

[61] A Hyvarinen J Karhunen and E Oja ldquoIndependent Component Analysisrdquovol 10 p 2002 2002

[62] K Mikolajczyk B Leibe and B Schiele ldquoMultiple Object Class Detection with aGenerative Modelrdquo IEEE Computer Society Conference on Computer Vision andPattern Recognition (CVPR) vol 1 pp 26ndash36 2006

[63] Y Freund and R E Schapire ldquoA Decision-theoretic Generalization of On-lineLearning and an Application to Boostingrdquo Journal of Computing Systems andScience vol 55 no 1 pp 119ndash139 1997

[64] C Cortes and V Vapnik ldquoSupport-vector networksrdquo Machine Learningvol 20 no 3 pp 273ndash297 Sep 1995 [Online] Available httplinkspringercom101007BF00994018

[65] P Viola and M Jones ldquoRobust real-time face detectionrdquo International Journalof Computer Vision vol 57 no 2 pp 137ndash154 2004

[66] B E Boser I M Guyon and V N Vapnik ldquoA Training Algorithmfor Optimal Margin Classifiersrdquo Proceedings of the 5th Annual ACM Workshopon Computational Learning Theory pp 144ndash152 1992 [Online] Availablehttpciteseerxistpsueduviewdocsummarydoi=1011213818

75

REFERENCES

[67] C-W Hsu and C-J Lin ldquoA comparison of methods for multiclass support vectormachinesrdquo IEEE Transactions on Neural Networks vol 13 no 2 pp 415ndash4252002

[68] A Mathur and G M Foody ldquoMulticlass and binary SVM classification Implica-tions for training and classification usersrdquo IEEE Geoscience and Remote SensingLetters vol 5 no 2 pp 241ndash245 2008

[69] ldquoNitrogen6X - iMX6 Single Board Computerrdquo [Online] Available httpboundarydevicescomproductnitrogen6x-board-imx6-arm-cortex-a9-sbc [Ac-cessed at 2015-08-10]

[70] ldquoPixhawk flight controllerrdquo [Online] Available httproverardupilotcomwikipixhawk-wiring-rover [Accessed at 2015-08-10]

[71] ldquoScanning range finder utm-30lx-ewrdquo [Online] Available httpswwwhokuyo-autjp02sensor07scannerutm_30lx_ewhtml [Accessed at 2015-07-22]

[72] ldquoopenCV manual Release 249rdquo [Online] Available httpdocsopencvorgopencv2refmanpdf [Accessed at 2015-07-20]

[73] D E King ldquoDlib-ml A Machine Learning Toolkitrdquo Journal of MachineLearning Research vol 10 pp 1755ndash1758 2009 [Online] Available httpjmlrcsailmitedupapersv10king09ahtml

[74] ldquodlib C++ Libraryrdquo [Online] Available httpdlibnet [Accessed at 2015-07-21]

[75] D E King ldquoMax-Margin Object Detectionrdquo Jan 2015 [Online] Availablehttparxivorgabs150200046

[76] M Danelljan G Haumlger and M Felsberg ldquoAccurate Scale Estimation for RobustVisual Trackingrdquo Proceedings of the British Machine Vision Conference BMVC2014

[77] ldquoUrgBenri Information Pagerdquo [Online] Available httpswwwhokuyo-autjp02sensor07scannerdownloaddataUrgBenrihtm [Accessed at 2015-07-20]

[78] ldquovatic - Video Annotation Tool - UC Irvinerdquo [Online] Available httpwebmiteduvondrickvatic [Accessed at 2015-07-24]

[79] ldquoAmazon Mechanical Turkrdquo [Online] Available httpswwwmturkcommturkwelcome [Accessed at 2015-07-26]

76

  • List of Figures
  • Absztrakt
  • Abstract
  • List of Abbreviations
  • 1 Introduction and project description
    • 11 Project description and requirements
    • 12 Type of vehicle
    • 13 Aims and objectives
      • 2 Literature Review
        • 21 UAVs and applications
          • 211 Fixed-wing UAVs
          • 212 Rotary-wing UAVs
          • 213 Applications
            • 22 Object detection on conventional 2D images
              • 221 Classical detection methods
                • 2211 Background subtraction
                • 2212 Template matching algorithms
                  • 222 Feature descriptors classifiers and learning methods
                    • 2221 SIFT features
                    • 2222 Haar-like features
                    • 2223 HOG features
                    • 2224 Learning models in computer vision
                    • 2225 AdaBoost
                    • 2226 Support Vector Machine
                      • 3 Development
                        • 31 Hardware resources
                          • 311 Nitrogen board
                          • 312 Sensors
                            • 3121 Pixhawk autopilot
                            • 3122 Camera
                            • 3123 LiDar
                                • 32 Chosen software
                                  • 321 Matlab
                                  • 322 Robotic Operating System (ROS)
                                  • 323 OpenCV
                                  • 324 Dlib
                                      • 4 Designing and implementing the algorithm
                                        • 41 Challenges in the task
                                        • 42 Architecture of the detection system
                                        • 43 2D image processing methods
                                          • 431 Chosen methods and the training algorithm
                                          • 432 Sliding window method
                                          • 433 Pre-filtering
                                          • 434 Tracking
                                          • 435 Implemented detector
                                            • 4351 Mode 1 Sliding window with all the classifiers
                                            • 4352 Mode 2 Sliding window with intelligent choice of classifier
                                            • 4353 Mode 3 Intelligent choice of classifiers and ROIs
                                            • 4354 Mode 4 Tracking based approach
                                                • 44 3D image processing methods
                                                  • 441 3D recording method
                                                  • 442 Android based recording set-up
                                                  • 443 Final set-up with Pixhawk flight controller
                                                  • 444 3D reconstruction
                                                      • 5 Results
                                                        • 51 2D image detection results
                                                          • 511 Evaluation
                                                            • 5111 Definition of True positive and negative
                                                            • 5112 Definition of False positive and negative
                                                            • 5113 Reducing number of errors
                                                            • 5114 Annotation and database building
                                                              • 512 Frame-rate measurement and analysis
                                                                • 52 3D image detection results
                                                                • 53 Discussion of results
                                                                  • 6 Conclusion and recommended future work
                                                                    • 61 Conclusion
                                                                    • 62 Recommended future work
                                                                      • References
Page 8: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly

List of Figures

11 Image of the ground robot 3

21 Fixed wing consumer drone 622 Example for consumer drones 723 Example for people detection with background subtraction 1124 Example of template matching 1225 2 example Haar-like features 1626 Illustration of the discriminative and generative models 1727 Example of a separable problem 20

31 Image of Pixhawk flight controller 2232 The chosen LIDAR sensor Hokuyo UTM-30LX 2333 Elements of Dlibrsquos machine learning toolkit 25

41 A diagram of the designed architecture 2842 Visualization of the trained HOG detectors 3543 Representation of the sliding window method 3644 Example image for the result of edge detection 3845 Example of the detectorrsquos user interface 4146 Presentation of mode 3 4347 Mode 4 example output frames 45410 Screenshot of the Android application for Lidar recordings 4848 Schematic figure to represent the 3D recording set-up 4949 Example representation of the output of the Lidar Sensor 50411 Picture about the laboratory and the recording set-up 52412 Presentation of the axes used in the android application 53

51 Figure of the Vatic user interface 5852 Photo of the recording process 6153 Example of the 3D images built 62

vii

List of Abbreviations

SATM School of Aerospace Technology and ManufacturingUAV Unmanned Aerial VehicleUAS Unmanned Aerial SystemUA Unmanned AircraftUGV Unmanned Ground VehicleHOG Histogram of Oriented GradientsRC Radio ControlledROS Robotic Operating SystemIMU Inertial Measurement UnitDoF Degree of FreedomSLAM Simultaneous Localization And MappingROI Region Of InterestVatic Video Annotation Tool from Irvine California

viii

Absztrakt

Ezen dolgozatban bemutataacutesra keruumll egy piloacuteta neacutelkuumlli repuumllő jaacuterműre szereltkoumlvető rendszer ami belteacuteri objektumokat hivatott detektaacutelni eacutes legfőbb ceacutelja afoumlldi egyseacuteg megtalaacutelaacutesa amely leszaacutelloacute eacutes uacutejratoumlltő aacutellomaacuteskeacutent fog szolgaacutelni

Előszoumlr a projekt illetve a teacutezis ceacuteljai lesznek felsorolva eacutes reacuteszletezveEzt koumlveti egy reacuteszletes irodalomkutataacutes amely bemutatja a hasonloacute kihiacutevaacute-

sokra leacutetező megoldaacutesokat Szerepel egy roumlvid oumlsszefoglalaacutes a piloacuteta neacutelkuumlli jaacuter-művekről eacutes alkalmazaacutesi teruumlleteikről majd a legismertebb objektumdetektaacuteloacutemoacutedszerek keruumllnek bemutataacutesra A kritika taacutergyalja előnyeiket eacutes haacutetraacutenyaikatkuumlloumlnoumls tekintettel a jelenlegi projektben valoacute alkalmazhatoacutesaacutegukra

A koumlvetkező reacutesz a fejleszteacutesi koumlruumllmeacutenyekről szoacutel beleeacutertve a rendelkezeacutesreaacutelloacute szoftvereket eacutes hardvereket

A feladat kihiacutevaacutesai bemutataacutesa utaacuten egy modulaacuteris architektuacutera terve keruumllbemutataacutesra figyelembe veacuteve a ceacutelokat erőforraacutesokat eacutes a felmeruumllő probleacutemaacutekat

Ezen architektuacutera egyik legfontosabb modulja a detektaacuteloacute algoritmus legfris-sebb vaacuteltozata reacuteszletezve is szerepel a koumlvetkező fejezetben keacutepesseacutegeivel moacuted-jaival eacutes felhasznaacuteloacutei feluumlleteacutevel egyuumltt

A modul hateacutekonysaacutegaacutenak meacutereacuteseacutere leacutetrejoumltt egy kieacuterteacutekelő koumlrnyezet melykeacutepes szaacutemos metrikaacutet kiszaacutemolni a detekcioacuteval kapcsolatban Mind a koumlrnyezetmind a metrikaacutek reacuteszletezve lesznek a koumlvetkező fejezetben melyet a legfrissebbalgoritmus aacuteltal eleacutert eredmeacutenyek koumlvetnek

Baacuter ez a dolgozat főkeacutent a hagyomaacutenyos (2D) keacutepeken operaacuteloacute detekcioacutesmoacutedszerekre koncentraacutel 3D keacutepalkotaacutesi eacutes feldolgozoacute moacutedszerek szinteacuten meg-fontolaacutesra keruumlltek Elkeacuteszuumllt egy kiacuteseacuterleti rendszer amely keacutepes laacutetvaacutenyos eacutespontos 3D teacuterkeacutepek leacutetrehozaacutesaacutera egy 2D leacutezer szkenner hasznaacutelataacuteval Szaacutemosfelveacutetel keacuteszuumllt a megoldaacutes kiproacutebaacutelaacutesa amelyek a rendszerrel egyuumltt bemutataacutesrakeruumllnek

Veacuteguumll az implementaacutelt moacutedszerek eacutes az eredmeacutenyek oumlsszefoglaloacuteja zaacuterja adolgozatot

ix

Abstract

In this paper an unmanned aerial vehicle based airborne tracking system is pre-sented with the aim of detecting objects indoor (potentially extended to outdooroperations as well) and localize a ground robot that will serve as a landing plat-form for recharging of the UAV

First the project and the aims and objectives of this thesis are introducedand discussed

Then an extensive literature review is presented to give overview of the ex-isting solutions for similar problems Unmanned Aerial Vehicles are presentedwith examples of their application fields After that the most relevant objectrecognition methods are reviewed and their suitability for the discussed project

Then the environment of the development will be described including theavailable software and hardware resources

Afterwards the challenges of the task are collected and discussed Consideringthe objectives resources and the challenges a modular architecture is designedand introduced

As one of the most important module the currently used detector is intro-duced along with its features modes and user interface

Special attention is given to the evaluation of the system Additional evalu-ating tools are introduced to analyse efficiency and speed

The ground robot detecting algorithmrsquos first version is evaluated giving promis-ing results in simulated experiments

While this thesis focuses on two dimensional image processing and objectdetection methods 3D image inputs are considered as well An experimentalsetup is introduced with the capability to create spectacular and precise 3Dmaps about the environments with a 2D laser scanner To test the idea of usingthe 3D image as an input for the ground robot detection several recordings weremade about the UGV and presented in this paper as well

Finally all implemented methods and relevant results are concluded

x

Chapter 1

Introduction and projectdescription

In this chapter an introduction is given to the whole project which this thesis ispart of Afterwards a structure of the sub-tasks in the project is presented alongwith the recognized challenges Then the aims and objectives of this thesis arelisted

11 Project description and requirementsThe projectrsquos main aim is to build a complete system based on one or moreunmanned autonomous vehicle which are able to carry out an indoor 3D mappingof a building (possible outdoor operations are kept in mind as well) A maplike this would be very useful for many applications such as surveillance rescuemissions architecture renovation etc Furthermore if a 3D model exists otherautonomous vehicles can navigate through the building with ease Later on inthis project after the map building is ready a thermal camera is planned to beattached to the vehicle to seek find and locate heat leaks as a demonstration ofthe system A system like this should be

1 fast Building a 3D map requires a lot of measurements and processingpower not to mention the essential functions stabilization navigationroute planning collision avoidance However to build a useful tool allthese functions should be executed simultaneously mostly on-board nearlyreal-time Furthermore the mapping process itself should be finished inreasonable time (depending on the size of the building)

2 accurate Reasonably accurate recording is required so the map would besuitable for the execution of further tasks Usually this means a maximum

1

12 Type of vehicle

error of 5-6cm Errors with similar magnitude can be corrected later de-pending on the application In the case of architecture for example aligninga 3D room model to the recorded point cloud could eliminate this noise

3 autonomous The solution should be as autonomous as possible The aimis to build a system which does not require human supervision (for examplemapping a dangerous area which would be too far for real time control) andcoordinate the process Although remote control is acceptable and shouldbe implemented it is desired to be minimal This should allow a remoteoperator to coordinate even more than one unit a time (since none of themneeds continuous attention) while keeping the ability to change the routeor priority of premises

4 complete Depending on the size and layout of the building the mappingprocess can be very complicated It may have more than one floors loops(the same location reached via an other route) or opened areas which areall a significant challenge for both the vehicle (plan route to avoid collisionsand cover every desired area) and the mapping software The latter shouldrecognize previously seen locations and rdquocloserdquo the loop (that is assignthe two separately recorded area to one location) and handle multi-layerbuildings The system have to manage these tasks and provide a mapwhich contains all the layers closes loops and represents all walls (includingceiling and floor)

12 Type of vehicleThe first question of such a project is what type of vehicle should be used Theoptions are either some kind of aerial or a ground vehicle Both have their ad-vantages and disadvantages

An aerial vehicle has a lot more degree of freedom since it can elevate fromthe ground It can reach positions and perform measurements which would beimpossible for a ground vehicle However it consumes a lot of energy to stayin the air thus such a system has significant limitations in weight payload andoperating time (since the batteries themselves weigh a lot)

A ground vehicle overcomes all these problems since it is able to carry alot more payload than an aerial one This means a lot of batteries thereforelonger operation time On the other hand it canrsquot elevate from the ground thusall the measurements will be performed from a relatively close position to theground This can be a big disadvantage since taller objects will not have theirtops recorded Also a ground robot would not be able to overcome big obstaclesclosing its way or to get to another floor

2

13 Aims and objectives

Figure 11 Image of the ground robot Source own picture

Considering these and many more arguments (price complexity human andmaterial resources) a combination of an aerial and a ground vehicle was chosenas a solution The idea is that both vehicles enter the building at the sametime and start mapping the environment simultaneously working on the samemap (which is unknown when the process starts) Using the advantage of theaerial vehicle the system scans parts of the rooms which are unavailable for theground vehicle With sophisticated route planning the areas scanned twice canbe minimized resulting in a faster map building Also great load capacity of theground robot makes it able to serve as a charging station for the aerial vehicleThis option will be discussed in details in section 13

13 Aims and objectivesAs it can be seen the scope of the project is extremely wide To cover the wholemore than fifteen students (visiting MSc and PhD) are working on it Thereforethe project has been divided to numerous subtasks

During the planning period of the project several challenges were recognizedfor example the lack of GPS signal indoor the vibration of the on-board sensorsor the short flight time of the aerial platform

This thesis addresses the last one As mentioned in section 12 the UAVconsumes a lot of energy to stay in the air thus such a system has significantlylimited operating time In contrast the ground robot is able to carry a lot morepayload (eg batteries) than an aerial one The idea to overcome this limitationof the UAV is to use the ground robot as a landing and recharging platform for

3

13 Aims and objectives

the aerial vehicleTo achieve this the system will need to have continuous update of the relative

position of the ground robot The aim of this thesis is to give a solution for thisproblem The following objectives were defined to structure the research

First possible solutions have to be collected and discussed in an extensive lit-erature review Then after considering the existing solutions and the constraintsand requirements of the discussed project a design of a new system is neededwhich is able to detect the ground robot using the available sensors Finally thefirst version of the software have to be implemented and tested

This paper will describe the process of the research introduce the designedsystem and discuss the experiences

4

Chapter 2

Literature Review

The aim of this chapter is to provide an overview of the previous researches andarticles related to the topic of this thesis First a short introduction of unmannedaerial vehicles will be given along with their most important advantages andapplication fields

Afterwards a brief conclusion of the science of object recognition in imageprocessing is provided The most often used feature extraction and classificationmethods are presented

21 UAVs and applicationsThe abbreviation UAV stands for Unmanned Aerial Vehicle Any aircraft whichis controlled or piloted by remotely andor by on-board computers They are alsoreferred as UAS for Unmanned Aerial System

They come in various sizes The smallest ones fit a man palms while even afull-size aeroplane can be an UAV by definition Similarly to traditional aircraftUnmanned Aerial vehicles are usually grouped into two big classes fixed-wingUAVs and rotary wing UAVs

211 Fixed-wing UAVsFixed-wing UAVs have a rigid wing which generates lift as the UAV moves for-ward They maneuver with control surfaces called ailerons rudder and elevatorGenerally speaking fixed-wing UAVs are easier to stabilize and control Forcomparison radio controlled aeroplanes can fly without any autopilot functionimplemented relying only on the pilotrsquos input This is a result of the muchsimpler structure compared to rotary wing ones due to the fact that a glidingmovement is easier to stabilize However they still have a quite extended market

5

21 UAVs and applications

Figure 21 One of the most popular consumer level fixed-wing mapping UAV ofthe SenseFly company Source [1]

of flight controllers whose aim is to add autonomous flying and pilot assistancefeatures [2]

One of the biggest advantage of fixed wing UAVs is that due to their naturalgliding capabilities they can stay airborne using no or small amount of powerFor the same reason fixed-wing UAVs are also able to carry heavier or morepayload for longer endurances using less energy which would be very useful inany mapping task

The drawback of the fixed wing aircraft in the case of this project is the factthat they have to keep moving to generate lift and stay airborne Therefore flyingaround between and in buildings is very complicated since there is limited spaceto move around Although there have been approaches to prepare a fixed wingUAV to hover and translate indoors [3] generally they are not optimal for indoorflights especially with heavy payload [4]

Fixed-wing UAVs are excellent choice for long endurance high altitude tasksThe long flight times and higher speed make it possible to cover larger areaswith one take off On figure 21 a consumer fixed-wing UAV is shown designedspecially for mapping

This project needs a vehicle which is able to fly between buildings and indoorswithout any complication Thus rotary-wing UAVs were reviewed

212 Rotary-wing UAVsRotary-wing UAVs generate lift with rotor blades attached to and rotated aroundan axis These blades work exactly as the wing on the fixed-wing vehicles butinstead of the vehicle moving forward the blades are moving constantly Thismakes the vehicle able to hover and hold its position Also due to the same prop-erty they can take off and land vertically These capabilities are very importantaspects for the project since both of them are crucial for indoor flight where spaceis limited

6

21 UAVs and applications

Figure 22 The newest version of the popular Phantom series of DJI Thisconsumer drone is easy to fly even for beginner pilots due to its many advancedautopilot features Source [5]

Based on the number of the rotors several sub-classes are known UAVs withone rotor are helicopters while anything with more than one rotor is called amultirotor or multicopter

The latter group controls its motion by modifying the speed of multiple downthrusting motors In general they are more complicated to control than fixed-wing UAVs since hovering enough requires constant compensation Multirotorsmust individually adjust the thrust of each motor If motors on one side areproducing more thrust than the other the multirotor tilts to the other sideThat is the reason why every remote controlled multicopter is equipped withsome kind of flight controller system They are also less stable (without electricstabilization) and less energy efficient than helicopters That is the reason forthe fact that no large scale multirotor is used since as the size is increased thesimplicity becomes less important than the inefficiency On the other hand theyare mechanically simpler and cheaper and their architecture is suitable for awider range of payloads which makes them appealing for many fields

Rotary-wing UAVsrsquo mechanical and electrical structures are generally morecomplex than fixed-wing ones This can result in more complicated and expansiverepairs and general maintenance which shorten operational time and increase thecost of the project Finally due to their lower speeds and shorter flight rangesrotary-wing UAVs will require many additional flights to survey any large areawhich is another cause of increasing operational costs [4]

7

21 UAVs and applications

In spite of the disadvantages listed above rotary-wing especially multirotorUAVs seem like an excellent choice for this project since there is no need to coverlarge areas and the shorter operational times can be solved with the chargingplatform

213 ApplicationsUAVs as many technological inventions got their biggest motivation from mili-tary goals and applications For example German rocket V-1 (rdquothe flying bombrdquo)had a huge impact on the history of UAS [6] An unmanned aerial vehicle makesit possible to observe spy on and attack the enemy without risking human lives

Fortunately non-military development is catching up too and UAVs are be-coming essential tools for many application fields As rotary wing UAVs es-pecially multicopters are getting cheaper they have appeared in the consumercategories as well They are available as simple remote controlled toys and ad-vanced aerial photography platforms as well [5] See figure 22 for a popularquadcopter

UAVs are gaining popularity in search and rescue operations since they areproviding a fast and relatively cheap tool to get an overview of the involvedarea A good example for this application field is a location of an earth-quakeor searching for missing people in the wildness See [7] and [8] for examples [9]is another related article which includes approaches to identify possibly injuredpeople autonomously

Another application field where UAVs were admittedly proven useful is surveil-lance [10] is an interesting example for this field since it proposes to use a com-bined swarm of rotary and fixed-wing UAVs [11] [12] address the same problemwhile [13] tries to overcome the challenges of the task in an urban area

Getting sensors in the air with low cost vehicles is an amazing opportunity forseveral research areas UAVs have been used for wild-life monitoring [14] earthobservation [15] and even tornado watch missions [16] [17] used UAVs to reducerdquothe data gap between field scale and satellite scale in soil erosion monitoringrdquo

UAVs are often used for 3D mapping and reconstruction tasks for topographyarchitecture or other purposes [18] is a review of current mapping methods basedon UAVs [19] is an other survey which also compares the photogrammetry withother methods to define the optimal application fields

Both surveys mentioned above are rather focus on topographical larger scalemapping tasks Indoor flying and mapping is more complicated in a sense sincethe danger of collision is significantly higher Indoor positioning of UAVs havebeen addressed in several papers They can be grouped by the sensor used Notethat GPS signal is usually too weak to use inside thus other methods are needed

Some researches tried to localize the aerial platform based on a single camera

8

22 Object detection on conventional 2D images

using monocular SLAM (Simultaneous Localization and Mapping) See [20ndash22]for UAVs navigating based on one camera only

Often the platform is equipped with multiple cameras to achieve stereo visionand create a depth map For examples of such solutions see [23ndash25]

A popular approach for indoor mapping is using 3D laser distance sensors asthe input of the SLAM algorithm Examples for this are [26ndash28]

However cameras and laser range finders are often used together [29 30] forexample manage to integrate the two sensors together and carry out 3D mappingwhile controlling navigating the UAV itself This project will use a similar setupthat is a combination of one or more monocular cameras with a 2D laser scanner

22 Object detection on conventional 2D imagesObject class detection is one of the most researched area of computer visionAs the available resources (computational power resolution of cameras etc)improved and became cheaper and more and more complex tasks occurred thefield has witnessed several approaches and methods Thus the topic has extensiveliterature To give an overview general surveys can be a good starting point

There are several articles aimed to provide a summary of the existing methodstrying to categorize and group them (eg [31ndash33])

The terms and definitions of computer vision tasks are defined in several wayin the literature [33] and [34] groups them by

bull Detection is a particular item present in the stimulus

bull Localization detection plus accurate location of item

bull Recognition localization of all the items present in the stimulus

bull Understanding recognition plus role of stimulus in the context of the scene

while [35] sets up five classes defined by the following (quoted from [33])

bull Verification Is a particular item present in an image patch

bull Detection and Localization Given a complex image decide if a particularexemplar object is located somewhere in the image and provide accuratelocation information of the given object

bull Classification Given an image patch decide which of the multiple possiblecategories are present in it Note that no location is determined

9

22 Object detection on conventional 2D images

bull Naming Given a large complex image (instead of an image patch as inthe classification problem) determine the location and labels of the objectspresent in that image

bull Description Given a complex image name all the objects present in itand describe the actions and relationships of the various objects within thecontext of the image

[31] defines object (class or category) detection as the combination of twogoals object categorization and object localization The former decides if anyinstance of the categories of interest is present in the input image The lattertask determines the positions and dimensions (projected size) of the objects thatare found

As it can be seen from the categories defined above the boundaries of thesesubclasses are not clear and they often overlap The aims and objectives listedin section 13 classify this task as a detection or localization (or a combinationof the two) For the sake of simplicity the aim of this thesis and the providedalgorithm will be referred as detection and detector

Beside the grouping of the tasks of computer vision several attempts weremade to classify the approaches themselves The biggest disjunctive propertyof the methods might be the fact that whether or not does it involve machinelearning

221 Classical detection methodsResulting from the rapid development of hardware resources along with the ex-tensive research of artificial intelligence methods not using machine learning aregetting neglected Generally speaking the ones using it show better performanceand are simpler to adopt if the circumstances change (eg re-train them)

It has to be noted that although these classical approaches are not very pop-ular nowadays their simplicity often means an attractive factor to perform easydetection tasks or pre-filter the image

2211 Background subtraction

Background subtraction methods are often used to detect moving objects onstanding platform (static cameras) Such an algorithm is able to learn the rdquoback-ground modelrdquo of an image with differentiating a few frames Although severalmethods exist (See [36] or [37] for good conclusion) the basic idea is common sep-arate the foreground from the background by subtracting the background modelfrom the input image This subtraction should result in an output image where

10

22 Object detection on conventional 2D images

Figure 23 Example for background subtraction used for people detection Thered patches show the object in the segmented foreground Source [39]

only objects in the foreground are shown The background model is usually up-dated regularly which means the detection is adaptive

These solutions could give an excellent base for tasks like traffic flow monitor-ing [38] or human detection [39] as moving points are separated and connectedor close ones could be managed as an object Thus all possible objects of interestare very well separated The fixed position of the camera means fixed perspectiveswhich make size and distance estimations accurate With fixed lengths reliablevelocity measuring can be implemented These and further additional properties(colour texture etc of the moving object) can be the base of an object classifica-tion system This method is often combined with more sophisticated recognitionmethods serving as a region of interest detector After the selection of interestingareas those methods can operate on a smaller input resulting in a faster responsetime

It has to be mentioned that not every moving (changing) patch is an objectas for example in certain lighting circumstances a vehicle and its shadow movealmost identically and have similar size parameters Effective methods to removeshadows from input have been described (eg[40 41]) Certainly process time ofthese filters sum up with the detectionrsquos time

An other concern is the possibility of a sudden change in illumination (such asswitching the light on or off) which would result change across the whole imageIn this case the background model should be adapted automatically and quickly

Although background subtraction has many benefits it is clearly not suitablefor this task since the camera will be mounted on a moving platform

2212 Template matching algorithms

Object class detection can be described as a search for a specific pattern on theinput image This definition is related of the matched filters (see [42]) techniqueof digital signal processing which basic idea is to find a known wavelet in anunknown signal This is usually achieved by cross-correlating them looking forhigh correlation coefficients

Template matching algorithms rely on the same concept interpreting the

11

22 Object detection on conventional 2D images

Figure 24 Example of template matching Notice how the output image isthe darkest at the correct position of the template (darker means higher value)An other interesting point of the picture is the high return values around thecalendarrsquos (on the wall left) dark headlines which are indeed look similar to thehandle Source [43]

image as a 2D digital signal They determine the position of the given patternby comparing a template (that contains the pattern) with the image

To perform the comparison the template is shifted (u v) discrete steps inthe (x y) directions respectively The comparison itself is usually calculated bysome kind of correlation over the area of the template for each (u v) positionThe method assumes that the function (the comparison method) will return thehighest value at the correct position [43]

See Figure 24 for an example input template (part a) and the output of thealgorithm (part b) Notice how the output image is the darkest at the correctposition of the template

The biggest drawback of the method is its computational complexity sincethe cross correlation has to be calculated at every position Several efforts werepublished to reduce the processing time [43] uses normalized cross correlationfor faster comparison [44] applies integral images (based on the idea of [45]) forthe same purpose Note that simplifying the template can reduce the complexityas well since it is easier to compare the two images For example [46] proposedan algorithm which approximates the template image with polynomials

Other disadvantages of template matching are the poor performance in case ofchanges in scale rotation or perspective Resulting from these template matchingis not a suitable concept for this project

12

22 Object detection on conventional 2D images

222 Feature descriptors classifiers and learning methodsWhile computer vision and machine learning might seem two well-separated areasof computer engineering computer vision is one of the biggest application fieldof machine learning [47] puts it this way rdquoPattern recognition has its origins inengineering whereas machine learning grew out of computer science Howeverthese activities can be viewed as two facets of the same field and together theyhave undergone substantial development over the past ten yearsrdquo

The two most important parts of a trained object detection algorithm arefeature extraction and feature classification The extraction method determineswhat type of features will be the base of the classifier and how they are extractedThe classification part is responsible for the machine learning method In otherwords how to decide if the extracted feature array (extracted by the featureextraction method described above) represents an object of interest

Since the latter part (defining the learning and classifier system) is less relatedto the vision studies this paper will only discuss a few learning methods brieflyBefore that some of the most famous and widely used feature descriptors arepresented

Feature extractors or descriptors are often grouped as well [48] recognizesedge based and patch based features [31] has three groups according to thelocality of the features

bull pixel-level feature description these features can be calculated for eachpixel separately Examples are the pixelrsquos intensity (grayscale) or its colourvector

bull patch-level feature description Descriptors based on patch level featuresconcentrate only on point of interests (and their neighbourhood) instead ofdescribing the whole image The name rdquopatchrdquo refers to the area around thepoint which is also considered Since these regions are much smaller thanthe image itself patch-level feature descriptions are also called local featuredescriptors Some of the most famous ones are SIFT (sub-subsection 2221[49 50]) and SURF [51] In a sense Haar-like features (see sub-subsection2222 [45]) belong here as well

bull region-level feature description As patches are sometimes too small to de-scribe the object appropriately a larger scale is often needed The region isa general expression for a set of connected pixels The size and shape of thisarea is not defined Region-level features often constructed from lower levelfeatures presented above However they do not try to apply higher levelgeometric models or structures Resulting from that region-level featuresare also called mid-level representations The most well known descriptors

13

22 Object detection on conventional 2D images

of this group are the BOF (Bag of Features or Bag of Words [52]) and theHOG (Histogram of Oriented Gradients sub-subsection 2223 [53])

Due to the limited scope of this paper it is not possible to give a deeperoverview about the features used in computer vision Instead three of the meth-ods will be presented here These were selected based on fame performanceon general object recognition and the amount of relevant literature examplesand implementation All of them are well-known and both the articles and themethods are part of the computer vision history

2221 SIFT features

Scale-invariant feature transform (or SIFT) is an image processing algorithm usedto detect and extract local features Sift descriptors are the most well-knownpatch-level feature feature descriptors (see point 222) [31] They were proposedby David Lowe in 1999 [49] The same author gave a more in-depth discussionabout his earlier work presented improvements in stability and feature invariancein 2004 [50] His idea was to extract interesting points as features from the object(training image) and try to match these on the test images to locate the object

The main steps of the algorithm are the following [50]

1 Scale-space extrema detection The first stage searches over the image inmultiple scale using a difference-of-Gaussian function to identify potentialinterest points that are scale and orientation invariant

2 Key-point localization At each selected point the algorithm tries to fit amodel to determine location and scale Key-points are selected based ontheir stability points which are expected to be more stable based on themodel are kept

3 Orientation assignment rdquoOne or more orientations are assigned to eachkey-point based on local image gradient directions All future operationsare performed on image data that has been transformed relative to theassigned orientation scale and location for each feature thereby providinginvariance to these transformationsrdquo [50]

4 Key-point descriptor After selecting the interesting points the gradientsaround them are calculated in a selected scale

The biggest advantage of the descriptor is the scale and rotation invarianceNote that rotation here means rotating in the plane of the image not capturingthe object from a rotated viewpoint For example a rotation invariant (frontal)

14

22 Object detection on conventional 2D images

face detector should be able to detect faces upside down but not faces pho-tographed from a side-view SIFT is also partly invariant to illumination and3D camera viewpoint This is reached by the representation of the calculateddescriptors that allows for significant levels of local shape distortion and changein illumination The algorithm assumes that these pointsrsquo relative positions willnot change on the test images Thus SIFT descriptors tend to show worse perfor-mance on flexible objects or ones with moving parts Also the relative positionsof the key-points can be significantly different not only if the object is distorted(flexibility) but also if they are extracted from an other instance from the sameclass (two people for example) Resulting from this property SIFT is more oftenused for object tracking image stitching or to recognise a concrete object insteadof the general class However this is not a problem in the case of this projectsince only one ground-robot is part of it Thus no general class recognition isrequired recognizing the used ground robot would be enough

2222 Haar-like features

Haar-like features were introduced by Viola and Jones in 2001 [45] They proposedsimple features inspired by Haar wavelet basis which are also used in signalprocessing They can be seen as rdquotemplatesrdquo defining rectangular regions in thegrayscale image Viola and Jones described three types of the features

bull two-rectangle features rdquoThe value of a two-rectangle feature is the dif-ference between the sum of the pixels intensities within two rectangularregions The regions have the same size and shape and are horizontally orvertically adjacentrdquo [45]

bull three-rectangle features rdquoA three-rectangle feature computes the sum withintwo outside rectangles subtracted from the sum in a centre rectanglerdquo [45]

bull four-rectangle features rdquoA four-rectangle feature computes the differencebetween diagonal pairs of rectanglesrdquo [45]

See figure 25 for an example of two and three rectangle features The mainadvantage of this approach is the fact that these features can be computed veryrapidly using so-called integral images The integral image rdquocontains the sum ofpixel intensities above and to the left of itrdquo [45] at any given image location Thereferences to the integral values at each pixel location are kept in a structureFinally the Haar features can be computed by summing and subtracting theintegral values in the corners of the rectangles

[45] was motivated by the task of face detection Although it is more than14 years old it is still the base of many implemented face detection methods (in

15

22 Object detection on conventional 2D images

Figure 25 2 example Haar-like features that are proven to be effective separatorsTop row shows the features themselves Bottom row presents the features overlayedon a typical face The first feature corresponds with the observation that eyes areusually darker than the cheeks The second feature measures the intensity of areasaround the eyes compared to the nose Source [45]

consumer cameras for example) and inspired several researches [54] for exampleextended the Haar-like feature set with rotated rectangles by calculating theintegral images diagonally as well

Resulting from the nature of them Haar-like features are suitable to findpatterns of combined bright and dark patches (like faces on grayscale images forexample)

Despite of the original task Haar-like features are used widely in computervision for various tasks vehicle detection ([55]) hand gesture recognition ([56])and pedestrian detection ([57]) just to name a few

Haar-like features are very popular and have a lot of advantages Its maindrawback is the rotation-variance which in spite of several attempts (eg [54])is still not solved completely

2223 HOG features

The abbreviation Hog stands for Histogram of Oriented Gradients Unlike theHaar features (2222) Hog aims to gather information from the gradient image

The technique calculates the gradient of every pixel and produces a histogramof the different orientations (called bins) after quantization in small portions(called cells) on the image After a location based normalization of these cells(within blocks) the feature vector is the concatenation of these histograms Notethat while this is similar to Edge orientation histograms ([58]) it is different since

16

22 Object detection on conventional 2D images

Figure 26 Illustration of the discriminative and generative models The boxesmark the probabilities calculated and used by the models Note the arrows whichdisplay the information flow presenting the main difference in the two philosophiesSource [59]

not only sharp gradient changes (known as edges) are considered The number ofoptimal bins cells and blocks depend on the actual task and resolution and arewidely researched

HOG is essentially a dense version of the SIFT descriptors (see sub-subsection2221 and [31]) However it is not normalized with respect to the orientationThus HOG is not rotation invariant On the other hand it is normalized locallywith respect to the image contrast (over the blocks) which results in a superiorrobustness against illumination changes over SIFT The technique was first de-scribed in the well-known article [53] by Navneet Dalal and Bill Triggs in whichit was used for pedestrian detection While this application field is still the mostcommon usage of these feature descriptor it can be used for many kind of objectsas well as long as the class has a characteristic shape with significant edges

2224 Learning models in computer vision

Learning model approaches can be grouped as well [32 48] finds two mainphilosophies generative and discriminative models

Letrsquos denote the test data (images) with xi the corresponding labels by ciand the feature descriptors (extracted from the data xi) by θi where i = 1 to N where N is the number of inputs

Generative models look for a representation of the image (or any other inputdata) by approximate it keeping as much data as possible Afterwards given atest data they predict the probability of an x instance generated (with featuresθ) conditioned to class c [48] In other terms they aim is to produce a model inwhich the probability

P (x|(θ c))P (c)

to be ideally 1 if x contains an instance of class c and 0 if notThe most famous examples of generative models are Principal Component

Analysis (PCA [60]) and Independent Component Analysis (ICA [61])

17

22 Object detection on conventional 2D images

Discriminative models try to find the best decision boundaries for the giventraining data As the name suggests they try to discriminate the occurrenceof one class from everything else [48] They do this by defining the followingprobability

P (c|(θ x))

which is expected to be ideally 1 if x contains an instance of class c and 0 ifnot

Note that the biggest difference is the direction of the mapping and infor-mation flow discriminative models map images to class labels while generativemodels use a map from labels to the images (or rdquoobservablesrdquo [59]) See figure 26for a representation of the direction of the flow of information in both models

Resulting from approaches presented above (especially the probability mod-els) the discriminative methods are usually perform better for single class detec-tion while generative methods are more suitable for multi-class object detection(also called object recognition) [62] Since the task described in this paper re-quires one class to detect two of the most well-known discriminative methodswill be presented Adaptive boosting (AdaBoost [63] see sub-subsection 2225)and support vector machines (SVM [64] see sub-subsection 2226)

2225 AdaBoost

AdaBoost (short for Adaptive Boosting) is a discriminative machine learningmodel from the class of boosting algorithms It was first proposed by YoavFreund and Robert Schapire in 1997 [63] AdaBoostrsquos aim is to overcome theincreasing calculation expenses caused by the growing set of features (also calleddimensions) For example the number of possible Haar-like features (2222) isover 160000 in case of a 24times24 pixel window (see subsection 432) [65] AlthoughHaar-like features can be extracted efficiently using the integral-image computingthe full set would be very expensive

To reduce the dimensionality of the machine learning problem boosting algo-rithms try to combine several weaker classifiers to create a stronger Weak meansthe output of the classifiers only slightly correlates with the true labels of thedata Usually this is an iterative method

A single stage of AdaBoost can be described as follows [45] [65]

bull Initialize N middot T weights wti where N = number of training examples T =number of features in stage

bull For t = 1 N

1 Normalize the weights

18

22 Object detection on conventional 2D images

2 Select the best classifier using only a single feature by minimising thedetection error εt =

sumiwi|h(xi f p θ)minus yi| where h(xi) is a classifier

output and yi is a correct label (both with a range of 0 for negative 1for positive)

3 Define ht(x) = h(x ft pt θt) where ft pt and θt are minimizers for theerror above

4 Update the weights wt+1i = wti middot εt

1minusεt

1minusεi

bull The final classifier for the stage is based on the sum of weak classifiers

By repeating these steps AdaBoost selects those features (and only those)which improve the prediction Resulting from the way the weights are updatedsubsequent classifiers minimise the error mostly for the inputs which were misclas-sified by previous ones A famous application of AdaBoost is the Viola-Jones ob-ject detection framework presented in 2001 [45] (also see sub-subsection 2222)

2226 Support Vector Machine

Support vector machines (SVM also called support vector networks) is a machinelearning algorithm for two-group classification problems proposed by Vladimir NVapnik Bernhard E Boser and Isabelle M Guyon [66] Although the featuresof the method were known and used since the 1960s they were not put togetheruntil 1992

The idea of the support vector machine is to handle the set of features ex-tracted from the train and test data samples as vectors in the feature spaceThen they construct one ore more hyperplanes (a linear decision surface) in thisspace to separate the classes

The surface is constructed with two constraints First is to divide the spaceinto two parts in a way that no points from different classes remain in the same(in other words separate the two classes perfectly) The second constraint is tokeep the distance to the nearest training samplersquos vector from each class as largeas possible (in other words define the largest margin possible) Creating largermargin is proven to decrease the error of the classifier The closest vectors to thehyperplane (and due to that determining the width of the margin) are calledsupport vectors (hence the name of the method) See figure 27 for an exampleof a separable problem the chosen hyperplane and the margin

The algorithm was extended to not linearly separable tasks in 1995 by Vapnikand Corinna Cortes [64] To create a non-linear classifier the feature space ismapped to a much higher dimensional space where the classes become linearlyseparable and the method above can be used

19

22 Object detection on conventional 2D images

Figure 27 Example of a separable problem in 2D The support vectors aremarked with grey squares They define the margin of the largest separation betweenthe two classes Source [64]

Although support vector machines were designed for binary classification sev-eral methods were proposed to apply their outstanding performance on multi-classtasks The most often used approach to do this is to combine binary classifiers toobtain a multi-class one The binary classifiers can be rdquoone against allrdquo or rdquooneagainst onerdquo types The former means that each class is examined against all thepoints of the other classes The class whose classifier shows the greatest distancefrom the margin is chosen (in other terms the test samplersquos vector is the fartherfrom the other classesrsquo points than in any other cases) The latter is based on thecomparisons of all the possible pairings of the classifiers (each is compared witheach) The class rdquowinningrdquo the most comparisons is chosen as the final decisionof the classifier Good surveys on multi-class SVMs are [67] and [68]

A famous application of SVMs is the HOG descriptor based pedestrian detec-tion presented by Navneet Dalal and Bill Triggs in 2005 [53]

20

Chapter 3

Development

In this chapter the circumstances of the development will be presented whichinfluenced the research especially the objectives defined in Section 13 Firstthe available hardware resources (that is processing unit and sensors) will bediscussed along with their limitations and constraints respecting to the definedobjectives Finally the used software libraries will be presented why were theychosen and what are they used for in this project

31 Hardware resources

311 Nitrogen boardNitrogen board (officially called Nitrogen6x) is the chosen on-board processingunit of the UAV It is a highly integrated single board computer using an ARM-cortex A9 processor [69]

Nitrogen board will be responsible to manage the on-board sensors and collectthe recorded data After some preprocessing part of the data will be transmittedto the ground robot and to the ground station which communication is alsohandled by this unit Fortunately the Robotic Operating System (subsection322) is compatible with unit which will make this task easier

The detection method introduced in this thesis may be ported to this boardlater depending on the available processing units and the achieved communicationbandwidth between the vehicles and the ground station

312 SensorsThis subsection concludes the sensors integrated on-board the UAV including theflight controller the camera and the laser range finder

21

31 Hardware resources

Figure 31 image of Pixhawk flight controller Source [70]

3121 Pixhawk autopilot

Pixhawk is an open-source well supported auto-pilot system manufactured by3Drobotics company It is the chosen flight controller of the projectrsquos UAV Onfigure 31 the unit is shown with optional accessories

Also Pixhawk will be used as an Inertial Measurement Unit (IMU) for build-ing 3D maps This means less weight (since no additional sensor is needed) andbrings integrity both in hardware and software Although the Pixhawk is not adedicated IMU its built-in sensors are very accurate since they are used to sta-bilize multi-rotors during flight Therefore it was suitable to use it for mappingpurposes See 44 for details of its application

3122 Camera

Cameras are optical sensors capable of recording images They are the mostfrequently used sensors for object detection due to the fact that they are easy tointegrate and provide significant amount of data

Also another big advantage of these sensors is the fact that they are availablein several sizes with different resolutions and other features for a relatively cheapprice

Nowadays consumer UAVs are often equipped by some kind of action camerausually to provide an aerial video platform These sensors are suitable for theaim of the project as well resulting from their light weight and wide angle fieldof view

22

32 Chosen software

Figure 32 The chosen LIDAR sensor Hokuyo UTM-30LX It is a 2D rangescanner with a maximum range of 60 m and 3-5 cm accuracy Source [71]

3123 LiDar

The other most important sensor used in this project is the lidar These sensorsare designed for remote sensing and distance measurement The basic conceptof these sensors is to do this by illuminating objects with a laser beam Thereflected light is analysed and the distances are determined

Several versions are available with different range accuracy size and otherfeatures The chosen scanner for this project is the Hokuyo UTM-30LX It isa 2D range scanner with a maximum range of 60 m and 3-5 cm accuracy [71]These features combined with its compact size (62mmtimes62mmtimes875mm210g [71])makes it an ideal scanner for autonomous vehicles for example UAVs

Generally speaking lidars are much more expensive sensors then camerasHowever the depth information provided by them is very valuable for indoorflights

32 Chosen software

321 MatlabMATLAB is a high-level interpreted programming language and environmentdesigned for numerical computation It is a very useful tool for engineers and

23

32 Chosen software

scientist alike as thousands of functions are implemented and included in thesoftware from different fields of science Also excellent visualization tools areready to use which are necessary to easily debug and develop image processingmethods or 3D map construction and visualization Resulting from the reasonsabove developing and testing 2D or 3D image processing algorithms in Matlab isextremely convenient On the other hand Matlab is Java based therefore mostnon-mathematical calculations are slightly slow compared to a C++ environment

322 Robotic Operating System (ROS)Robotic Operating System is an extensive open-source operating system andframework designed for robotic development It is equipped with several toolsaimed to help developing unmanned vehicles like simulation and visualizationsoftware or excellent general message passing This latter property is a suitablesolution to connect the UAV UGV and the ground control station of the project

Also ROS contains several packages developed to handle sensors For exam-ple the both the chosen lidar sensor (3123) and the flight controller (3121)have ready-to-use drivers available

323 OpenCVOpenCV is probably to most popular open-source image processing library con-taining a wide range of functions like basic image manipulations (eg load writeresize rotate image) image processing tools (eg different kind of edge detec-tions threshold methods) and advanced machine learning based solutions (egfeature extractors and training tools) [72] Due to its high popularity examplesand support are available

In this project it was used to handle inputs (either from a file or from a cameraattached to the laptop) since Dlib the main library chosen has no functions toread inputs like that

324 DlibDlib is a general purpose cross-platform C++ library It has an excellent machinelearning library with the aim to provide a machine learning software developmenttoolkit for the C++ language (see figure 33 for an architecture overview) Dlibis completely open-source and is intended to be useful for both scientists andengineers

Dlib also has many computer vision related parts including ready-to-use fea-ture extractors and training tools and object tracking solutions which make it

24

32 Chosen software

Figure 33 Elements of Dlibrsquos machine learning toolkit Source [73]

easy to develop and test custom trained object detectors rapidly Another re-cently added feature is an easy-to-use 3D point cloud visualization function Re-sulting from this Dlib was a reasonable choice to use for this project since itscollection of machine learning image processing and 3D point cloud handlingfunctions made the development significantly easier

It is worth mentioning that aside the features above Dlib also contains com-ponents to handle linear algebra threading network IO graph visualisation andmanaging and numerous other tasks which makes it one of the most versatileC++ libraries Though it is not as popular and well-known as OpenCV it isextremely well documented supported and updated regularly Resulting fromthese the popularity of it increases For more information see [74] and [73]

25

Chapter 4

Designing and implementing thealgorithm

In this chapter challenges related to the topic of this thesis are listed After theconsideration of these the architecture of the designed algorithm will be intro-duced with detailed explanations given to the important parts The connectionbetween the 2D and 3D image inputs are defined Afterwards the base of thedetection algorithm is discussed in details along with all the aiding systems andinputs Finally the 3D imaging experiments and set-ups are introduced withexample images and the mathematical calculations

41 Challenges in the taskThe biggest challenges of the task are listed below

1 Unique object Maybe the biggest challenge of this task is the fact that therobot itself is a completely unique ground vehicle designed as part of theproject Thus no ready-to-use solutions are available which could partly orcompletely solve the problem For frequently researched object recognitiontasks like face detection several open-source codes examples and articlesare available The closest topic for a ground robot detection is car recogni-tion but those vehicles are too different to use any ready detectors Thusthe solution of this task have to be completely self-made and new

2 Limited resources Since the objective of this thesis is a detection executedon a UAV the limitations of this platform have to be considered Thismeans that both the dimensions and the weight of the on-board sensorsand computers are limited significantly Fortunately the cameras nowa-days have very compact sizes with impressive resolutions thus even smaller

26

42 Architecture of the detection system

aerial vehicles can carry one and record high quality videos The used lidar(3123) is heavier and larger but still possible to carry The on-board com-puter had to be chosen similarly a small and light hardware was neededWhile the Nitrogen6x board (311) would fulfil these requirements its pro-cessing power is not comparable to a laptop or desktop PC which shouldbe considered during the designing and assigning of processing methods

3 Moving platform No matter what kind of sensors are used for an objectdetection task movement and vibration of the sensor is usually makes itmore challenging Unfortunately both are present on a UAV platform Thiscauses problems because it changes the perspective and the viewpoint ofthe object Also they add noise and blur to the recordings And third nobackground-foreground separation algorithms are available for the cameraSee subsection 2211 for details

4 Various viewpoint of object Usually in the field of computer vision thepoint of view of the required object is more or less defined For exampleface detectors are usually applied to images where people are facing thecamera However resulting from the UAV platform the sensors will gain orlose height rapidly causing drastic perspective changes Also there are noconstraints about the relative orientation of the two vehicles which meansthat the robot should be recognized from 360 and from above

5 Various size of object Similarly to the previous point 4 the movement ofthe camera will cause significant changes in the scale of the object on theinput image In other words the algorithm have to detect the ground robotfrom a great distance (when it occupies a small part of the input) and alsoduring landing from relatively close (when it seems huge and almost fillsthe entire image)

6 Speed requirements The algorithmrsquos purpose is to support the landingmaneuver which is crucial part of the mission The most critical part is thetouch-down itself This (contrary to the other parts of the flight) requiresreal-time and very accurate position feedback from the sensors and the al-gorithms This is provided by another algorithm specially designed for thetask Thus the software discussed here does not have real-time require-ments On the other hand the system still needs continuous reports on theestimated position of the robot with at least 4Hz This means that everychosen method should be execute fast enough to make this possible

27

42 Architecture of the detection system

Support Vector Machines

camera2D Lidar

Trainer

Detector agorithmfront SVM side SVM

Detections Vatic annotation server

Sensors

Tracking

Evaluation

Current frame

Video reader

Regions of Interest

3D map

Other preprosessing(edgecolour detection)

Figure 41 On this figure a diagram of the designed architecture is shown Theaim of it is the present the connections between the 2D and 3D sensors detectionalgorithms and other parts of the system Arrows represents the dependencies andthe direction of the information flow

28

42 Architecture of the detection system

42 Architecture of the detection systemIn the previous chapters the project (11) and the objectives of this thesis (13)were introduced The advantages and disadvantages of the different availablesensors were presented (312) Some of the most often used feature extractionand classification methods (222) were examined with respect to the suitabilityfor the project Finally in subsection 41 the challenges of the task were discussed

After considering the aspects above a modular complete architecture wasdesigned which is suitable for the objectives defined See figure 41 for an overviewof the design The main idea of the structure is to compensate the disadvantagesof every module with an other part connected to it In some cases this principlebrings redundancy to the system On the other hand it provides a more robustdetection system for the robot The next enumeration will list the main partsof the architecture every already implemented module will be discussed later inthis chapter in detail

1 Training algorithm As explained in subsection 222 object detectionmethods using machine learning are very popular and superior in efficiencyTo apply such an algorithm for a task first a learning (also called trainingor teaching) period is needed when the software extract features from thetraining images and train a classifier to separate the two or more classes(Note that this learning step can be skipped if the algorithm learns rdquoon theflyrdquo or pre-trained detectors are used Neither of these are the case in thisproject)As mentioned above training usually happens off-line before the detectionsTherefore it is not required to be included in the final release However itis an inevitable part of the system since it produces the classifiers for thedetector These productions have to be reproducible compatible with theother parts and effective To make the development more convenient thetraining software is completely separated from testing See subsection 431for details

2 Sensor inputs Sensors are crucial for every autonomous vehicle since theyprovide the essential information about the surrounding environments Inthis project a lidar and a camera where used as main sensors for the task(along with several others needed to stabilize the vehicle itself See subsec-tion 312) The designed system relies more on the camera for now sincethe 3D imaging is in its experimental state However both sensors arehandled in the architecture already with the help of the Robotic Operat-ing System (322) This unified interface will make it possible to integratefurther sensors (eg infra-red cameras ultrasound sensors etc) later withease

29

42 Architecture of the detection system

Resulting from the scope of the project the vehicle itself is not readythus live recordings and testing were not possible during the developmentTherefore the system is able to manage images both from a camera andvideo file The latter is considered as a sensor module as well simulatingreal input from the sensor (other parts of the system canrsquot tell that it is nota live video feed from a camera)

3 Region of interest estimators and tracking Recognized as two of the mostimportant challenges (see 41) speed and the limitations of the availableresources were a great concern Thus methods which could reduce theamount of areas to process and such increase the speed of the detectionwere requiredResulting from the complex structure of the project both can be donebased on either the camera the lidar or both The idea of the architectureis to give a common interface (creating a type of module) to these methodsmaking their integration easy The most general and sophisticated approachwill be based on the continuously built and updated 3D maps That is ifboth the aerial and the ground vehicle were located in this map accurateestimations can be made on the current position of the robot excluding (3D)areas where it is unnecessary to search for the unmanned ground vehicle(UGV) In other words once the real 3D position of the robot is known itsnext one can be estimated based on its maximal velocity and the elapsedtime Theoretically it is not possible for the robot to be outside of thiscircle thus other parts of the map do not need to be scanned and Asa less sophisticated but more rapid alternative to this the currently seen3D sub-map (the point-cloud in front of the sensor) can be interpreted asa pre-filter for the 2D image processing methods For example filteringfor objects with more or less the same size of the UGV could be a veryeffective region of interest detector To achieve this the field of view of thetwo sensors needs to be registered carefully so corresponding areas can befound Unfortunately the real time 3D map building is not yet implementedThus these methods has yet to be developed On the other hand severalapproaches to find regions of interest based on 2D images were implementedAll of them will be presented in section 43

4 Evaluation To make the development easier and the results more objectivean evaluation system was needed This helps to determinate whether a givenmodification of the algorithm did actually make it rdquobetterrdquo andor fasterthan other versions This part of the architecture is only connected to thewhole system loosely (similarly to training point 1) and not required tobe included in the final release since it was designed for development and

30

43 2D image processing methods

debugging purposes However it had to be developed parallel with everyother parts to make sure it is compatible and up to date Thus evaluationhad to be mentioned in this list For more details of the evaluation pleasesee 511

5 Detector The detector module is the core algorithm which connects all theother parts listed above Receiving the inputs from the camera sensor (point2) using the classifiers trained by the training algorithm (point 1) andconsidering the information provided by the ROI estimators (3) executesthe detection Afterwards the detector is able to pass on the detectedobjects coordinates so the evaluation module (point 4) can measure theefficiency

43 2D image processing methodsIn this section the 2D camera image processing concerns methods and the processof the development will be presented

431 Chosen methods and the training algorithmBefore the design and implementation of the algorithm started several objectdetection methods were reviewed in section 22 In section 32 the used imageprocessing libraries and tool-kits were presented

While these two topics seem unrelated they have to be considered togethersince a good software library with image handling or even machine learningalgorithms implemented can save a lot of time during the development

Considering the reviewed methods the Histogram of Oriented Gradients (HOGsee sub-subsection 2223) was chosen as the main feature descriptor The reasonsfor this choice are discussed here

Haar-like features are suitable to find patterns of combined bright and darkpatches While the ground robotrsquos main colours are black (wheels) and white(the main body of the vehicle) their proportion is not ideal since brighter partsare more significant Variance of the relative positions of these patches (egdifferent view angles) can confuse the detector as well Also using OpenCVrsquos(323) training algorithm the learning process was significantly slower than thefinal training software (presented here later) making it very hard to develop andtest

SIFT descriptors (see sub-subsection 2221) have the advantage of rotationinvariance which is very desirable Since they assume that relative positions ofthe key-points will not change on the inputs they are more suitable to recognise

31

43 2D image processing methods

a concrete object instead of the general class This is not an issue in this casesince only one ground robot is present with known properties (in other wordsno generalization is needed) However the relative positions of the key-pointscan change not only because inter-class differences (an other instance of the classrdquolooksrdquo different) but caused by moving parts like rotating and steering wheelsfor example Also SIFT tends to perform worse under changing illuminationconditions which is a drawback in the case of this project since both vehiclesare planned to transfer between premises without no a-priori information aboutthe lighting

Histogram of Oriented Gradients (HOG see sub-subsection 2223) seemed tobe an appropriate choice as it performs well on objects with strong characteristicedges Since the UGV has a cuboid shape body and relatively big wheels itfulfils this condition Also HOG is normalized locally with respect to the imagecontrast which results in a robustness against illumination changes Thus itis expected to perform better than the other methods in case of change in thelighting (like moving to an other room or even outside)

Certainly HOG has it disadvantages as well First it is computationally ex-pensive especially compared to the Haar-like features However with properimplementation (eg a good choice of library) and the use of ROI detectors (seePoint 3) this issue can be solved Secondly HOG in contrast to SIFT is not rota-tion invariant In practice on the other hand it can handle a significant rotationof the object Also in the final version of the project the orientation of the camerawill be stabilized either in hardware (with a gimbal holding the camera levelled)or software (rotate the image based on the orientation sensor) Therefore theinput image of the detector system expected to be levelled Unfortunately dueto the aerial vehicle used in this project different viewpoints are predicted (see4) This means that even if the camera is levelled the object itself could seemlike rotated caused by the perspective To overcome this issue solutions will bepresented in this subsection and in 434

Although the method of HOG feature extraction is well described and docu-mented in [53] and not exceedingly complicated the choice of optimal parametersand the optimization of the computational expenses are time-consuming and com-plex tasks To make the development faster an appropriate library was searchedfor the extraction of HOG features Finally Dlib library (324) was chosen sinceit includes an excellent HOG feature extractor with serious optimization alongwith training tools (discussed in details later) and tracking features (see subsec-tion 434) Also several examples and documentations are available for bothtraining and detection which simplified the implementation

As a classifier Support vector machines (SVM see sub-subsection 2226)were chosen This is because nowadays SVMs are widely used in computer visionwith great results (Dalal used it as well in [53]) and for two-class detection prob-

32

43 2D image processing methods

lems (like this) they outperform the other popular ones Also it is implementedin Dlib as a ready-to-use solution [75] SVMs (and almost everything in Dlib)can save and load by serializing and de-serializing them which made the trans-portation of classifiers relatively easy Dlibrsquos SVMs are compatible with the HOGfeatures extracted with the same library The combination of the two methodswithin Dlib resulted in some very impressive detectors for speed limit sign orfaces Furthermore the face detector mentioned outperformed the Haar-like fea-ture based face detector included in OpenCV [74] After studying these examplesthe combination of Dlibrsquos SVM and HOG extractor was chosen as the base of thedetection

The implemented training software itself was written in C++ It can be exe-cuted from the command line It needs an xml file as an input which includes alist of the training images with the annotation of the objects (coordinates in pix-els) After importing the positive samples the negative samples are rdquogeneratedrdquofrom the training samples as well cut from areas where no annotated object ispresent Note that this feature means the object has to be annotated carefullysince a positive sample might get in to the negative training images otherwise

After the classifier (also called detector) is produced the training softwaretests it on an image database defined by a similar xml file This is a very con-venient feature since major errors are discovered during the training processalready If the first tests do not show any unexpected behaviours the finishedSVM is saved to the disk with serialization

Due to the fact that the robot looks completely unlike from different view-points one classifier was not sufficient because it could only match one typicalrdquoshaperdquo of the object Similarly the face detector mentioned above does notwork for people photographed from side view Therefore multiple classifiers wereconsidered for the task

Two SVMs were trained to overcome of this issue one for the front view andone for the side view of the ground vehicle This approach has two significantadvantages

First the robot is symmetrical horizontally and vertically as well In otherterms the vehicle looks almost exactly the same from the left and right whilethe front and rear-views are also very similar This is especially true if the work-ing principle of the HOG (2223) is considered since the shape of the robot(which defines the edges and gradients detected by the HOG descriptor) is iden-tical viewed from left and right or from front and rear This property of theUGV means that the same classifier will recognise it from the opposite directiontoo Therefore two classifiers are enough for all four sides This is not a trivialsimplification since cars for example look completely different from the front orthe rear and while the side-views are similar they are mirrored

The second advantage of these classifiers is very important in the future of

33

43 2D image processing methods

the project Due to the way the SVMs are trained they not only can detect theposition of the robot but depending on which one of them detects it the systemgains information about the orientation of the vehicle This is crucial since it canhelp predict the next position of the ground robot (see 434) Even more impor-tantly it helps the UAV to land on the UGV with the correct orientation (facingthe right way) Therefore the charging outputs will be connected properly

An other important aspect of the training is how to choose the training imagesFor this pictures were selected where the robot is captured directly from the sideor from the front with the camera almost at the level of the ground Asidethe sidefront of the robot (the top for example) was hidden or cropped Thismethod eliminated all the distortions found on pictures taken from above whichwould decrease the efficiency Although camera attached to the UAV will seldomcruise at this altitude (otherwise it would touch the ground) the HOG descriptor isrobust enough to find the patterns even viewed from above or sideways Althoughteaching an SVMs for diagonal views as well seemed like a logical idea it did notimprove the detections The reason for this is the fact that while side and frontviews are well defined it is hard to frame and train on a diagonal view

On figure 42 a comparison of training images and the produced HOG detec-tors is shown 42(a) is a typical training image for the side-view Hog detectorNotice that only the side of the robot is visible since the camera was hold verylow 42(b) is the visualized final side-view detector Both the wheels and thebody are easy to recognize thanks to the strong edges of them On 42(c) a train-ing image is displayed from the front-view detector training image set 42(d)shows the visualized final front detector Notice tangential lines along the wheelsand the characteristic top edge of the body

Please note that if the current two classifiers seem to be inefficient trainingadditional ones is easy to do with the training software presented here Alsoincluding them to the detector is simple as well since it was designed to beadaptive Please see subsection 435 for more details

432 Sliding window methodA very important property of these trained methods (222) are that usually theirtraining datasets (the images of the robot or any other object) are cropped onlycontaining the object and some margin around it Resulting from this a detectortrained this way will only be able to work on input images which contain theobject in a similarly cropped way Certainly such an image is not the usual inputfor an application since if the input contains only the robot there is no need forlocalization algorithms In other words the input image is usually much largerthan the cropped training images and may contain other kind of objects as wellThus a method is required which provides correctly sized input images for the

34

43 2D image processing methods

(a) sideview training image example

(b) sideview HOG detector

(c) frontview trainging image example

(d) frontview HOG

Figure 42 (a) a typical training image for the side-view Hog detector (b) is thefinal side-view HOG descriptor visualized (c) a training image from the front-viewHOG descriptor training image set (d) the final front detector visualized Noticethe strong lines around the typical edges of the training images

35

43 2D image processing methods

Figure 43 Representation of the sliding window method

detector cropped and resized from the original large inputThis can be achieved in several ways although the most common is the so

called sliding window method This algorithm takes the large image and slidesa window across it with predefined step-size and scales This rdquowindowrdquo behaveslike a mask cropping out smaller parts of the image With appropriate choice ofthese parameters (enough scales and small step-sizes) eventually every robot (orother desired object) will be cropped and handed to the trained detector See 43for a representation

It worth mentioning that it is possible that multiple instances of the soughtobject are present on the image For face or pedestrian detection algorithms forexample this is a potential situation Since only one ground robot was builtyet this is not likely to happen in this project (although reflections of the robotmight occur which are not handled explicitly) However if the system will beexpanded in the future multiple UGVs on the same input image will not cause aproblem for the designed algorithm

433 Pre-filteringAs listed in section 41 two of the most important challenges are speed andthe limitations of the available resources In subsection 432 the sliding windowconcept was introduced which makes it possible to cover every potential locationof the sought object On the other hand this means a lot of positions which areabsolutely unnecessary to check with the HOG detector since it is very unlikelythat the robot is there (eg positions above the ground or homogeneous surfaceslike the wall or the clearly empty ground) These areas could be excluded fromthe search with cheaper algorithms

36

43 2D image processing methods

Thus methods which could reduce the amount of areas to process and suchincrease the speed of the detection were required In other words these algorithmsfinds region of interests (ROI see 3) in which the HOG (or other) computationalheavy detectors are executed in a shorter time than scanning the whole inputimage The algorithms have be implemented with the right interface to keep themodular structure of the architecture Also this practice make it possible toeither swap the currently used with an other or a newly developed ROI detectormodule for further development

Based on intuition a good feature separation is the colour Anything whichdoes not have the same colour as the ground vehicle should be ignored To dothis however is rather complicated since the observed rdquocolourrdquo of the object (andanything else in the scene) depends on the lighting the exposure setting and thedynamic range of the camera Thus it is very hard to define a colour (and aregion of colours around of it) which could be a good base of segmentation onevery input videos Also one of the test environments used (the laboratory) andmany other premises reviewed has bright floor which is surprisingly similar to thecolour of the robot This made the filtering virtually useless

An other good hint for interesting areas can be the edges In this projectthis seems like a logical approach since the chosen feature descriptor operates ongradient images which are strongly connected to the edge map On figure 44 theresult of an edge detection on a typical input image of the system is presentedNote how the ground robot has significantly more edges than its environmentespecially the ground Filtering based on edges often eliminates the majority ofthe ground On the other hand chairs and other objects have lot of edges as wellThus these areas are still scanned

Another idea is to filter the image by the detections themselves on the previousframes In other terms the position of the robot on the previous input image isan excellent estimation of the vehiclersquos position on the current one Theoreticallyit is enough to search for the robot around its last known position with a radiusof the maximum distance the object can move during the time between framesPlease note that this distance is in image coordinates thus the possible movementof the camera is involved as well In practice this distance was set as a parameterafter experiments See subsection 435 for details

434 TrackingIn computer vision tracking means the following of a (moving) object over timeusing the camera image input This is usually done by storing information (likeappearance previous location speed and trajectory) about the object and passit on between frames

In the previous subsection (433) the idea of filtering by prior detections was

37

43 2D image processing methods

Figure 44 On this figure the result of an edge detection on a typical input imageof the system is presented Please note how the ground robot has significantly moreedges than its environment On the other hand chairs and other objects have lotof edges as well

presented Tracking is the extension of this concept If the position of the robotis already known there is no need to find it again To achieve the objectives it isenough to follow the found object(s) This makes a huge difference in calculationcosts since the algorithm does not look for an object described by a generalmodel on the whole image but search for the rdquosame set of pixelsrdquo in a significantlyreduced region Certainly tracking itself is not enough to fulfil the requirements ofthe task hence some kind of combination of rdquotraditionalrdquo classifiers and trackingwas needed

Tracking algorithms are widely researched and have their own extensive litera-ture Considering the complexity of implementing an own 2D tracking algorithmit was decided to choose and include a ready-to-use solution Fortunately Dlib(324) have an excellent object tracker included as well The code is based onthe very recent research presented in [76] The method was evaluated extensivelyand it outperformed state-of-the-art methods while being computationally effi-cient Aside its outstanding performance it is easy to include in the code onceDlib is installed The initializing parameter of the tracker is a bounding box isaround the area needed to be followed Since the classifiers return with boundingboxes those can be used as an input for the tracking See subsection 435 foran overview of the implementation An interesting fact is that it also uses HOGdescriptors as a part of the algorithm which shows the versatility of HOG andconfirms its suitability for this task

The final system of this project will have the advantage of a 3D map of itssurroundings This will allow tracking of objects in 3D relative to the environ-ment The UAV will have an estimation of the UGV even if its out of the frame

38

43 2D image processing methods

since its position will be mapped to the 3D map as well Unfortunately the realtime mapping is not ready yet thus this feature has yet to be implemented

435 Implemented detectorIn subsection 431 the advantages and disadvantages of the chosen object recog-nition methods were considered and the implemented training algorithm wasintroduced Then different approaches to speed up the process were presentedin subsection 432 433 and 434

After considering the points above a detector algorithm was implementedwith two aims First to provide an easy-to-use development and testing toolwith which different methods and their combinations (eg using 2 SVMs withROI) can be tried without having to change and rebuild the code itself Secondafter the optimal combination was chosen this software will be the base of thefinal detector used in the project

The input of the software is a camera feed or video which is read frame byframe The output of the algorithm is one or more rectangles bounding the areasbelieved to contain the sought object These are often called detection boxeshits or simply detections

During the development four different approaches were implemented All ofthem are based on the previous ones (incremental improvements) but all broughta new idea to the detector which made it faster (reach higher frame-rates see512) more accurate (see 511) or both The methods are listed below

1 Mode 1 Sliding window with all the classifiers

2 Mode 2 Sliding window with intelligent choice of classifier

3 Mode 3 Intelligent choice of classifiers and ROIs

4 Mode 4 Tracking based approach

They will all be presented in the next four subsections Since they are rathermodifications of the same code than separate solutions they were not imple-mented as new software Instead all were included in the same code whichdecides which mode to execute in run-time based on a parameter

Aside of changing between the implemented modes the detector is able toload as many SVMs as needed Their usage is different in every mode and willbe presented later

The software can display andor save the video frames with all the detectionsmarked measure and export the processing time for every frame save the de-tections to data files for evaluation or even process the same video input severaltimes

39

43 2D image processing methods

Table 41 Table of the available parameters

Name Valid values Functioninput path to video video as input for detectionsvm path to SVMs These SVMs will be usedmode [1 2 3 4] Selects which mode is usedsaveFrames [01] Turns on video frame exportsaveDetections [01] Turns on detection boxe exportsaveFPS [01] Turns on frame-rate measurementdisplayVideo [01] Turns on video displayDetectionsFileName string Sets the filename for saved detectionsFramesFolderName string Set the foldername used for saving videonumberOfLoops integer(0lt) Sets how many times the video is looped

To make the development more convenient these features can be switchedon or off via parameters which are imported from an input file every time thedetector is started There is an option to change the name of the file where thedetections are exported or the name of the folder where the video frames aresaved

Table 41 concludes all the implemented parameters with the possible valuesand a short description

Note that all these parameters have default values thus it is possible but notcompulsory to include every one of them The text shown below is an exampleinput parameter file

Example parameter file of the detector algorithminput testVideo1avilist all the detectors you want to usesvm groundrobotfrontsvm groundrobotsidesvmsaveDetections 0saveFrames 0displayVideo 0numberOfLoops 100mode 2saveFPS 1

Given this input file the software would execute a detection on testVideo1avi(input) one hundred times (numberOfLoops) using the groundrobotfrontsvm groundrobot-sidesvm classifiers (svm) It would neither save the detections (saveDetections)

40

43 2D image processing methods

nor the video frames (saveFrames) The latter is not shown (displayVideo) How-ever the processing time is saved for every frame (saveFPS) Note that the pa-rameter numberOfLoops is really useful to simulate longer inputs See figure 45for a screenshot of the interface after a parameter file is loaded

Figure 45 Example of the detectorrsquos user interface after loading the parametersThe software can be started from the command line with a parameter file inputwhich defines the detecting mode and the purpose of the execution (producingvideo efficiency statistics or measuring processing frame-rate)

4351 Mode 1 Sliding window with all the classifiers

Mode 1 is the simplest approach of the four After loading the two classifierstrained by the training software (front-view and side-view see 434) both rdquoslidesrdquoacross the input image All of them return with a vector of rectangles where theyfound the sought object These are handled as detections (saved displayed etc)There are no filters implemented for maximum number of detections overlappingallowed position etc

Please note that although currently two SVMs are used this and every othermethod can handle fewer or more of them For example given three classifiersmode 1 loads and slides all of them

As it can be seen this mode 1 is rather simple It checks every possibleposition with all the classifiers on every frame in multiple scales This meansthat it will not miss the object (assuming that one of them would recognize theobject if a cropped image of it is given as an input) On the other hand thisextensive search is very computational heavy especially with two classifiers Thisresults in the lowest frame per second rate of all the methods

4352 Mode 2 Sliding window with intelligent choice of classifier

As mentioned in the previous subsection Mode 1 checks every possible positionwith all the classifiers on every frame This is not efficient because of the waythe two used classifier was trained On figure 42 it can be seen that one of them

41

43 2D image processing methods

represents the frontrear view while the other one was trained for the side-viewof the robot

The idea of mode 2 is based on the assumption that it is very rare and usuallyunnecessary that these two classifiers detect something at once since the robot iseither viewed from the front or from the side Certainly looking at it diagonallywill show both mentioned sides but resulting from the perspective distortionthese detectors will not recognize both sides (and that is not needed either sincethere is no reason to find it twice)

Therefore it seemed logical to implement a basic memory to the code whichstores which one of the detectors found the object last time After that only thatone is used for the next frames If it succeeds to find the object again it remainsthe chosen detector If not the algorithm starts counting the sequential frames onwhich was unable to find something If this exceeds a predefined tolerance limit(that is the number of frames tolerated without detection) all of the classifiersare used again until one of them find the object In every other aspect mode 2is very similar to mode 1 presented above

The limit was introduced because it is very unlikely to change viewpointsbetween two frames so drastically that the other detector would have a betterchance to detect A couple of frames without detection is more like the result ofsome kind of image artefact like motion blur change in exposure etc

This modification of the original code resulted an much faster processingspeed since in significant amount of the time only one classifier was used in-stead of two Fortunately after finding an optimal limit the efficiency of thecode remained the same

4353 Mode 3 Intelligent choice of classifiers and ROIs

While mode 2 brought a significant improvement to the detectorrsquos speed it didnot fulfil the requirements of the project

In subsection 433 the idea of reducing the image searched was introduced asa way to increase the detection speed The position of the robot on the previousinput image is an excellent estimation of the vehiclersquos position on the current oneTheoretically it is enough to search for the robot around its last known positionwith a radius of the maximum distance the object can move during the timebetween frames However this is not an obvious calculation since the distance isin image coordinates hence it is not trivial to use the actual speed of the robotAlso the possible movement of the camera is involved as well For example evenif the UGV holds its position the rotating camera of the UAV will capture itrdquomovingrdquo across the input image

Instead of estimating the distance mentioned above a simpler approach wasimplemented The memory introduced in mode 2 was extended to store the

42

43 2D image processing methods

Figure 46 An example output of mode 3 The green rectangle marks theregion of interest The blue rectangle is the detection returned by the side-viewdetectorMode 3 tries to estimate a ROI based on the previous detections andsearches for the robot only in that Note the ratio between the ROI and the fullsize of the image

position of the detection aside the detector which returned it A new rectanglename ROI (Region of interest) was included which determinates in which areashould the detector search

The rectangle is initialized with the sizes of the full image (in other termsthe whole image is included in the ROI) Then every time a detection occursthe ROI is set to that rectangle since that is the last known position of the robotHowever this rectangle have to be enlarged caused by the reasons mentionedabove (movement of the camera and the object) Therefore the ROI is grown bya percent of its original size (set by a variable 50 by default) and every detectorsearches in this area Note that the same intelligent selection of classifiers whichwas described in mode 2 is included here as well

In the unfortunate case that none of the detectors returned detections insidethe ROI the region of interest is enlarged again by a smaller percent of its originalsize (set by a variable 3 by default) Note that if in any case the ROI reachesthe size of the of the image it will not grow in that direction anymore Eventuallyafter a series of missing detections the ROI will reach the size of the input imageIn this case mode 3 works exactly like mode 2

See figure 46 for an example of the method described above The greenrectangle marks the region of interest (ROI) The classifiers will search for therobot in this area The blue rectangle marks that the side-view detector foundsomething there (in this case correctly recognized the robot) On the next framethe blue rectangle will be the base of the ROI as it is enlarged in the followingsteps to define the region of interest

43

43 2D image processing methods

Please note that while the ROI is updated of the previous detections it isvery easy to change it to another rdquopre-filterrdquo due to the modular architectureintroduced in 42

4354 Mode 4 Tracking based approach

As mentioned in 434 Dlib library has a built-in tracker algorithm based on thevery resent research [76]

Mode 4 is very similar to mode 3 but includes the tracking algorithm men-tioned above Using a tracker makes it unnecessary to scan every frame (or partof it as mode 3 does) with at least one detector

Instead once the robot was found it is followed across the scene Certainlyperiodical validations are still needed Tracking is significantly faster than thesliding window detectors and in appropriate conditions can perform above realtime speed

It also improved the efficiency of the detector since on many frames the robotwas tracked and its position was marked correctly while the detectors missed itfrom the same view-point

On the other hand tracking can be misleading as well In case of the trackedregion rdquoslidesrdquo off the robot incorrect detections will be provided Also thetracker often estimates the position of the object correctly while the scale of therectangle does not fit which can result much larger or smaller detection boxesthan the object itself See figure 47(b) for an example Yellow rectangle marksthe tracked region which clearly includes the robot but itrsquos size is unacceptable

To avoid these phenomena the bounding box returned by the tracker is reg-ularly checked by detectors This is done in the very same way as in mode 3The output of the tracker is the base of the ROI which is than enlarged by pa-rameters (similarly to mode 3) The algorithm checks this area with the selectedclassifier(s)

If any of them founds the object the tracker is reinitialized based on thedetection box The rectangle is enlarged and translated to include a bigger partof the robot in contrast to the detected sides (note that the detectors are trainedfor the sides only see subsection 431) This is done because the tracker tendsto follow bigger patches of the image better

If none of the detectors returned detections inside the ROI the algorithm willcontinue to track the object based on previous clues of its position However afterthe number of sequential failed validations exceeds a tolerance limit (a parameter)the object is labelled as lost Afterwards the algorithm works exactly as mode 3would do enlarge the ROI on every frame until a new detection appears Thenthe tracker is reinitialized

See figure 47(a) for a representation of the processing method of mode 4

44

43 2D image processing methods

(a) Example output frame of mode 4

(b) Typical error of the tracker

Figure 47 (a) A presentation of mode 4 Red rectangle marks the detection ofthe front-view classifier The yellow box is the currently tracked area The greenrectangle marks the estimated region of interest (b) A typical error of the trackerYellow rectangle marks the tracked region which clearly includes the robot but toobig to accept as a correct detection

45

44 3D image processing methods

44 3D image processing methods

441 3D recording methodThe first challenge was how to create a 3D image As explained in sub-subsection3123 the used Lidar sensor is a 2D type one which means it scans only inplane In other words the sensor returns the distance of the closest reflectionpoint rotating around one (vertical) axis The result of this scan looks like across-section of a 3D image To produce a real and complete 3D image severalof these cross-section like one-plane scans have to be combined This is a rathercomplicated method since to cover the whole room the lidar needs to be movedHowever this movement or displacement will have a significant effect on therecorded distances

To simplify the initial experiments the position of the laser scanner was fixedon a tripod and only the orientation of it was changed This way the trans-formation between the coordinate systems and validation of the recordings wereeasier

On figure 48 a schematic diagram is given to introduce the 3D recording set-up The picture is divided in to 3 parts Part a is a side view of the room beingrecorded including the scanner and its dependences Part b is a top view of thevery same room displaying the field of view of the scanner Part c is a closeroverview of the scanner device the inertial measurement unit (IMU) and themounting layout Main parts of the setup are listed below with brief explanation

1 Lidar This is the 2D range scanner which was used in all of the experi-ments It is mounted on a tripod to fix its position The sensor is displayedon every sub-figure (a b c) On part a it is shown in two different orien-tation first when the recording plane is completely flat secondly when itis tilted

2 Visualisation of the laser ray The scanner measures distances in a planewhile rotating around its vertical axis which is call capital Z axis on thefigure (Note The sensor uses class 1 laser which is invisible and harmlessto human eye Red colour was chosen only for visualization)

3 Pitch angle of first orientation While the scanner is tilted its orientationis recorded relative to a fixed coordinate system For this experiment thepitch angle is the most relevant since roll and yaw did not change This iscalculated as a rotation from the (lower-case) z axis

4 Pitch angle of second orientation This is exactly the same variable asdescribed before but recorded for the second set-up of the recording system

46

44 3D image processing methods

5 Yaw angle of the laser ray As mentioned before the sensor used operateswith one single light source rotated around its vertical axis (capital) ZTo know where the recorded point is located relative to the sensorrsquos owncoordinate system the actual rotation of the beam is needed to be savedas well This is calculated as an angle between the forward direction of thesensor (marked with blue) and the laser ray All the points are recordedin this 2D space defined with polar coordinates a distance from the origin(the lidar) and a yaw angleThe angle is increasing counter-clockwise Pleasenote that this angle rotates with the lidar and its field Part b represents astate when the lidar is horizontal

6 Field of view of the sensor Part b of the picture visualize the limited butstill very wide field of view of the sensor The view angle is marked withgreen coloured curve and the covered area is yellowish The scanner is ableto record 270 or in other terms [minus135 135]

7 Blind-spot of the sensor As discussed in the previous point the scannercan register distances in 270 The remainder 90 is a blind-spot and islocated exactly behind the sensor (135minus135) This area is marked withdark grey on the figure

8 Inertial measurement unit All the points are recorded in the 2D spaceof the lidar which is a tilted and translated plane To map every recordedpoint from this to the 3D maprsquos coordinate system the orientation andposition of the Lidar sensor is needed Assuming that the latter is fixed(the scanner is mounted on a tripod) only the rotation remains variableTo record this some kind of inertial measurement unit (IMU) is neededwhich is marked with a grey rectangle on the image

9 Axis of rotation To scan the room a continuous tilting of the scannerwas needed It was not possible to place the light source on the axis of therotation This means an offset between the sensor and the axis which was asource of error The black circle represents the axis the tilting was executedaround

10 Offset between the axis and the light source As described in the previ-ous point an offset was present between the range sensor and the axis ofrotation This distance was carefully measured and it is marked by blue onthe figure

11 Ground vehicle The main objective of this thesis is to localize the UGVThus several 3D map was made of this vehicle itself However the lidarmain task is not the robot detection but the localization of the UAV itself

47

44 3D image processing methods

On part a and b its schematic side and top view is shown to visualize itsshape relative size and location

442 Android based recording set-upTo test the sensorrsquos capabilities simplified experiment set-ups were introducedas described in subsection 441 The scanned distances were recorded in csvformat by the official recording tool provided by the lidarrsquos manufacturer [77]As described in subsection 444 and in point 8 the orientation and position ofthe Lidar sensor is needed to determinate the recorded point coordinates relativeto the coordinate system fixed to the ground

Figure 410 Screenshot ofthe Android application for Lidarrecordings It is able to detect dis-play and log the rotation of the de-vice in a log file with user definedfilename Source own applicationand image

Assuming that the position is fixed (thescanner is mounted on a tripod) only therotation can change To record this somekind of inertial measurement unit (IMU) wasneeded

Since at the time these experimentsstarted no such device was available the au-thor of this thesis created an Android appli-cation The aim of this program was to usethe phonersquos sensors (accelerometer compassand gyroscope) as an IMU The software isable to detect display and log the rotationof the device along the three axis

The log file is structured in a way thatMatlab (see subsection 321) or other de-velopment tools can import it without anymodification needed The file-name is userdefined

Since both the Lidar and the applica-tion operates at a different and not suf-ficiently consistent frequency synchroniza-tion was necessary Since the two data wasrecorded completely separately this has to bedone off-line To achieve this a time-stamphas to be recorded for both type of mea-surements The lidar recorder tool does thisautomatically This function was added tothe Android application too thus the time-stamps of the recorded orientations are saved

48

44 3D image processing methods

Figure 48 Schematic figure to represent the 3D recording set-up Part a is a sideview of the room being recorded including the scanner and its dependences Partb is a top view of the very same room displaying the field of view of the scannerPart c is a closer overview of the scanner device the inertial measurement unit(IMU) and the mounting layout

49

44 3D image processing methods

Figure 49 Example representation of the output of the Lidar Sensor As it canbe seen the data returned by the scanner is basically a scan in one plane and canbe interpreted as a cross-section of a 3D image The sensor is positioned in themiddle of the cross Green areas were reached by the laser Source screenshot ofown recording by the official recording tool [77]

50

44 3D image processing methods

to the log file as well However since the two systems were not connected thetimestamps had to be synchronized manually

On Figure 410 the user interface of the application can be seen Note thatboth the displayed and the saved angles are in radian Figure 411(a) presentshow the phone was mounted next to the lidar and also shows the test environmentwhich was recorded The phone was attached to a common metal base with thelidar Thus the rotation of the phone is expected to be very close to the rotationof the lidar We used a Motorola Moto X (2013) for these experiments

The coordinate system used is fixed to the phone X axis is parallel with theshorter side of the screen pointing from the left to the right Y axis is parallel tolonger side of the screen pointing from to bottom to the top Z axis is pointingup and is perpendicular to the ground See figure 412 for a presentation of theaxes Angles were calculated in the following way

bull pitch is the rotation around the x axis It is 0 if the phone is lying on ahorizontal surface

bull roll is rotation around the y axis

bull yaw is the rotation around the z axis

Although the sensors in a phone are not designed for such precise tasks theset-up presented in this subsection worked sufficiently and produced spectacular3D images On figure 411(b) and 411(c) an example is shown along with therecorded screen captured by a camera (411(a))

443 Final set-up with Pixhawk flight controllerAlthough the Android application (see subsection 442) gave satisfactory resultsseveral reasons occurred for replacing it First the synchronization was compli-cated manual and off-line which is not acceptable in a real time UAV environ-ment Second neither the phonersquos sensors nor its operating system was designedto be mounted on an unmanned aircraft It is heavier less precise and moreexpensive than a dedicated IMU Thus the Android phone has to be replaced

As a replacement the Pixhawk (3121) flight controller was chosen because itis open-source well supported and most importantly it is the chosen controller ofthe projectrsquos UAV This meant less weight and brought integrity both in hardwareand software Although the Pixhawk is not a dedicated IMU its built-in sensorsare very accurate since they are used to stabilize multi-rotors during flight Fellowstudent Domonkos Huszar managed to connect both the Pixhawk and the lidar(3123) to the Robotic operating system (322) This way the two type ofsensors were recorded with the same software

51

44 3D image processing methods

(a) Laboratory and recorded scene

(b) Elevation of the produced 3D map

(c) The produced 3D map

Figure 411 (a) picture about the laboratory and the mounting of the phone (b)and (c) 3D map of the scan from different viewpoints Some object were markedon all images for easier understanding 1 2 3 are boxes 4 is a chair 5 is a studentat his desk 6 7 are windows 8 is the entrance 9 is a UAV wing on a cupboard

52

44 3D image processing methods

Thus the change from the Android software to the Pixhawk solved the problemof synchronization and accuracy All the other parts of the experiment (themounting method the coordinate systems etc) remained unchanged

444 3D reconstructionIn subsection 441 the recording method was described Subsection 442 and443 discussed the inertial measurement units and software used Processing andvisualizing the recorded data as a 3D map was the next step

All the pointsrsquo coordinates obtained were in the lidar sensorrsquos own coordinatesystem As shown on figure 48 (part b) this is a 2D space represented by polarcoordinates a distance from the origin (the scanner) and an angle between theline pointing to the point and the zero vector This space was tilted with thescanner during the records To calculate the positions recorded by the Lidar inthe ground fixed coordinate system this 2D space had to be transformed Trans-formation here means rigid motions no shape or size changes were desired Rigidmotions are reflections rotations translations and glide reflections Howeverin this experiment only translations and rotations were used

The 3D coordinate system was defined as the following

bull x axis is the vector product of y and z x = ytimes z pointing to the right

bull y axis is the facing direction of the scanner parallel to the ground pointing

Figure 412 Presentation of the axes used in the android application

53

44 3D image processing methods

forward

bull z axis is pointing up and is perpendicular to the ground

Note that the origin ([0 0 0]) is at the lidarrsquos location (at the top of the tripodin the height of the rotation axis) This means that points below the scanner havea negative altitude

The lidar and its own 2D space have 6 degrees of freedom (DOF) state in theground fixed coordinate system Three represents its positions along the x y zaxes The other three describes its orientation yaw pitch and roll

First the rotation of the scanner will be considered (the latter three DOF) Asit can bee seen on figure 48 the roll and yaw angles are not able to change Thusonly pitch has to be considered as a variable in this experiment (see point 3) Asa result the transformed coordinates of the recording point are the following

x = distance middot sin(minusyaw) (41)

y = distance middot cos(yaw) middot sin(pitch) (42)

z = distance middot cos(yaw) middot cos(pitch) (43)

Where distance and yaw are the two coordinates of the point in the plane ofthe lidar (figure 48 point 5) and pitch is the angle between the (lower-case) zaxis and the plane of the lidar (48 point 3 and 4)

The position of the lidar in the ground coordinate system did not change intheory only the orientation of it However as it can be seen on figure 48 part cthe rotation axis (marked by 9 see point 9) is not at the same level as the rangescanner of the lidar Since it was not possible to mount the device other way asignificant offset (marked and explained in point 10) occurred between the lightsource and the axis itself This resulted in an undesired movement of the sensoralong the z and y axes (note that x axis was not involved since the roll angle didnot change and only that would move the scanner along the x axis) While theresulted errors were no greater than the offset itself they are significant Thusalong the rotation a translation is needed as well The vector is calculated asthe sum of dy and dz

dy = offset middot sin(pitch) (44)

dz = offset middot cos(pitch) (45)

54

44 3D image processing methods

Where dy is the translation required along the y and dy is along the z axisOffset is the distance between the light source and the axis of rotation presentedon figure 48 part C marked with 10

Combining the 5 equations (41 42 43 44 45) we get xyz

= distance middot

sin(minusyaw)cos(yaw) middot sin(pitch)cos(yaw) middot cos(pitch)

+ offset middot

0sin(pitch)cos(pitch)

(46)

Using the transformation defined in equation 46 spectacular and more im-portantly precise 3D maps were created These were suitable for further worksuch us SLAM (Simultaneous localization and mapping) or the ground robotdetection

55

Chapter 5

Results

51 2D image detection results

511 EvaluationTo make the development easier and the results more objective an evaluationsystem was needed This helps determine whether a given modification of thealgorithm did actually make it better andor faster than other versions Thisinformation is extremely important in every vision based detection system to trackthe improvements of the code and check if it meets the predefined requirements

To understand what better means in case of a system like this some defi-nitions need to be clarified Generally a detection system can either identify aninput or reject it Here the input means a cropped image of the original pictureIn other words it is a position (coordinates) and dimensions (height and width)Please refer to subsection 432 for a more detailed description An identifiedinput is referred as positive Similarly a rejected one is called negative

5111 Definition of True positive and negative

A detection is true positive if the system correctly identifies an input as a positivesample In this case it means that the detector marks the ground robotrsquos positionas a detection Note that it is also often called a hit Similarly a true negativemeans that an input which does not contain the object is correctly rejected

5112 Definition of False positive and negative

In the case of most decision systems errors can be organised into two main groupsA mistake is a false positive error (also called type I error) if the system classifiesan input as a positive sample despite it is not In other words the system believes

56

51 2D image detection results

the object is present at that location although that is not the case In this projectthat typically means that something that is not the ground robot (a chair a boxor even a shadow) is still recognized as the UGV A false negative error (alsocalled type II error) occurs when an input is not recognized (rejected) although itshould be In this current task a false negative error is when the ground robotis not detected

5113 Reducing number of errors

Unfortunately it is really complicated to decrease both kind of errors at thesame time If more strict rules and thresholds are introduced in the decisionmaking the rate of false positives will decrease but simultaneously chances offalse negatives will increase On the other hand if conditions are loosened in thedetection method false negative errors may be reduced However caused by theexact same thing (less strict conditions) more false positives could appear

The actual task determines which kind of error should be minimised Incertain projects it could be more important to find every possible desired objectthan the fact that some mistakes will occur Such a project can be a manuallysupervised classification where appearing false positives can be eliminated by theoperator (for example illegal usage of bus lanes) Also a system with an accidentavoidance purpose should be over-secure and rather recognise non dangeroussituations as a hazard than to miss a real one

In this project the false positive rate was decided to be more important Thereason for this is that the position of the UGV is not crucial during the wholeflight If an input frame is processed incorrectly and a false negative error occurs(the robot is on the picture but the algorithm misses it) the tracking system(see subsection 434) would still be able to estimate its position After a fewframes the detector would be able to detect the robot again with great possibilityHowever a false negative detection at an unexpected position may confuse thesystem since it would have to choose from multiple possible charging platformsThis may lead to a landing attempt on a dangerous surface and to the possibilityof losing the aircraft

5114 Annotation and database building

To test the efficiency of the 2D detection algorithm its detections have to betested on an image dataset with its samples already labelled (either fully manuallyor using a semi-supervised classifier) or on a video set where the object (or objects)position is known on every frame This way the true positivenegative and falsepositivenegative values can be determined

57

51 2D image detection results

Figure 51 Figure of the Vatic user interface

Testing on image datasets is suitable for methods which do not use any in-formation transmitted between frames such as previous positions or ROI Thesealgorithms are usually sliding window based (see subsection 432) Their croppedinputs can be simulated with image datasets without complications

For this project the video based test is more appropriate since as described insubsection 434 and 433 the detector uses tracking and pre-filtering Testing thetracking feature without consecutive frames is pointless Also the pre-filteringand ROI generation is only applicable on frames containing more than just theobject itself

Testing on videos requires annotation which means that every object has tobe labelled on every frame This is a tiring and time-consuming task but withsuitable software it can be simplified To do this the Vatic environment waschosen

The abbreviation Vatic stands for Video Annotation Tool from Irvine Cal-ifornia It is a free to use on-line video annotation tool designed specially forcomputer vision research [78]

After the successful installation and connection to the server a clear interfaceis visible See an example screenshot on figure 51 The interface has a built invideo player which makes it possible to review the video with the annotation atnormal speed or even frame-by-frame The system was designed in a way thatmultiple objects can be annotated even from different classes (like pedestrianscars trucks for example) Since only one ground robot was built yet and no otherkind of objects were requested only one item was annotated on the videos

58

51 2D image detection results

The annotatorrsquos task is to mark the position and the size of the objects (thatis drawing a bounding box around it) on every frame of the videos (there areoptions to label an object Occluded obstructed or outside of frame) Althoughthis seems like a very time-consuming task Vatic uses sophisticated algorithmsto interpolate both the positions and the sizes of the objects on frames thatare not annotated yet This is a significant help for the annotator since aftermarking the object on some key-frames the only task remained is to correctthe interpolations between them if necessary

Note that it is crucial that the bounding boxes are very tight around the ob-ject This will guarantee that during the evaluation of the detectors the statisticsare based on true and objective annotations This topic will be explained in moredetails in section 53

Vatic is also able to crowdsource the annotation tasks using Amazon Mechan-ical Turk [79] which is Amazonrsquos solution for renting human resources This waylarge datasets with several videos are easy to build for relatively low cost How-ever this feature of the tool was not used in this project since the number of testvideos and their complexity did not make it necessary All the test videos wereannotated by the author

After successfully annotating a video Vatic is able to export the boundingboxes in several format including xml json etc The exported file contains theid of the frames and the coordinates of every bounding box along with the labelassigned for the objectrsquos type

Using these files it is possible to compare the detections of an algorithm (whichare also bounding boxes) with the annotations and thus evaluate the efficiencyof the code

512 Frame-rate measurement and analysisIn section 41 speed was recognised as one of the most important challenges

To address this issue the detector was prepared to measure and export thetime elapsed by processing each frames

The measurements were carried out on a laptop PC (Lenovo z580 i5-3210m8GB RAM) simulating a setup where the UAV sends video to a ground stationfor further processing During the tests every other unnecessary process wereterminated so their additional influences are minimized

The exact same software was used for every test only the parameter file waschanged (switching between the modes) Every method was executed on the samevideo one hundred times which means more than 100000 frames This amountensured that temporary resource changes during the test due to the operatingsystem did not influence the results significantly

59

52 3D image detection results

To analyse the exported processing times a Matlab script has been imple-mented This tool can load the log files from the test and calculate statistics ofthe data Measurements like the shortest longest and average processing timeper frame are displayed after execution along with the average frame-rate and itsvariance Latter is very important as mode 2 3 4 processing methods changeduring the execution by definition which results in a varying processing timeVariance measures these changes An example output of the analysis software ispresented below

Example output of Frame-rate analysis softwareLoaded 100 filesNumber of frames in video 1080Total number of frames processed 108000Longest processing time 0813Shortest processing time 0007Average ellapsed seconds 0032446variance of ellapsed seconds per frame

between video loops 80297e-07across the video 00021144

Average Frame per second rate of the processing 308204 FPS

The changes of average frame-rates between video loops were also monitoredIf this value was too big it probably indicated some unexpected load of thecomputer Thus the tests had to be restarted

It should be mentioned that the frame-rate of the software is strongly corre-lated with the resolution of the input images since the smaller the image it takesshorter time to rdquosliderdquo through the detector on it This is especially true for thefirst two methods (4351 4352) The test videosrsquo resolution are 960times 540

52 3D image detection resultsThis thesis focused on the 2D object detection methods since they are easierto implement and have been widely researched for decades Traditional camerasensors are significantly cheaper too and their output is rdquoready-to-userdquo On theother hand data from lidar sensors have to be processed to gather the imagesince the scanner records points in a 2D plane as explained in section 44 Notethat lidar 3D scanners exists but they are still scanning in a plane and have thismentioned processing built-in Also they are even more expensive than the 2Dversions

60

53 Discussion of results

Figure 52 On this figure the recording setup is presented In the backgroundthe ground robot is also visible which was the subject of the measurements

That being said the 3D depth map provided by these sensors are not possibleto obtain with cameras and contains valuable information for a project like thisand makes indoor navigation easier Therefore experiments were carried onthe record 3D images by tilting a 2D laser scanner (3123) and recording itsorientation with a IMU (3121)

On figure 52 a photo of the experimental setup is shown In the backgroundthe ground robot is visible which was the subject of the recordings

Figure 53 presents an example 3D point-cloud recorded of the ground robotBoth 53(a) and 53(b) is an image of the same map viewed from different pointsAs it can be seen recordings from one point is not enough to create completemaps as rdquoshadowsrdquo and obscured parts still occurs Note the rdquoemptyrdquo area behindthe robot on 53(b)

The created point-clouds were found to be detailed enough to carry simpler(eg threshold based on height) or more complex object detection methods (egregistering 3D key-points) on them in the future

53 Discussion of resultsIn general it can be said that the detector modes worked as expected on therecorded test videos with suitable efficiency Example videos can be found athere1 for all modes However to judge the performance of the algorithm moreobjective measurements were needed

The 2D image processing results were evaluated by speed and detection effi-1usersitkppkehu~palan1CranfieldVideos

61

53 Discussion of results

(a) Example result of the 3D scan of the ground robot

(b) Example of the rsquoshadowrsquo of the ground robot

Figure 53 Example of the 3D images built Both image shows the same 3Dmap viewed from different angles Note that the carpet is visible underneath thevehicle

62

53 Discussion of results

ciency The latter can be done by several values Here recall and precision willbe used Recall is defined by

TP

TP + FN

Where TP is the number of true positive and FN is the number of false negativedetections In other words it is the ratio of the of the detected and the totalnumber of positive samples Thus it is also called sensitivity

Precision is defined byTP

TP + FP

Where TP is the number of true positive FP is the number of false positivedetections In other terms precision is a measurement of the correctness of thereturned detections

Both values have a different meaning for validation and none of them can usedalone This is because as mentioned in sub-subsection 5113 both false positiveand negative errors are undesirable but neither recall nor precision measuresboth For example it is possible to reach an outstanding recall rate with extremeamount of false positives Similarly the increase of false negative errors willimprove the precision of the algorithm

The detections (the coordinates of the rectangles) were exported by the de-tection software for further analysis The rectangles were compared to the onesannotated with the Vatic video annotation tool (5114) based on position andthe overlapping areas For the latter a minimum of 50 was defined (ratio of thearea of the intersection and the union of the two rectangles)

Note that this also means that even if a detection is at the correct locationit is possible that it will be logged as false positive error since the overlap of therectangles is not satisfying Unfortunately registering this box as a false positivewill also generate a false negative error since only one detection is returned bythe detector for a location(overlapping detections are merged) Therefore theannotated object will not be covered by any of the detections

As expected mode 1 and mode 2 produced very similar results a decent 62recall rate was achieved This is a result of the fact that they work exactlythe same way except the selection of the used classifiers The results show thatintroducing this idea did not decrease the performance

Mode 3 managed to improve the recall rate by 2 with the introduction ofregion of interests Neither of the first 3 modules had false positive detections asa result of the strict thresholds of the SVMs

Mode 4 had the outstanding recall rate of 95 This is a result of the trackeralgorithm which was able to follow the object even at view-points were the SVMdetectors did no longer detected the robot See table 51 for all the recall andprecision values

63

53 Discussion of results

Similarly the processing speed of the methods introduced in subsection 435were analysed as well with the tools described in subsection 512

Mode 1 was proven to be the slowest with a frame-rate of 42 FPS This isnot surprising as this version of the algorithm scans the whole image with all thedetectors

Mode 2 raised this speed to 65 FPS by introducing an intelligent way ofselecting the applied detector

Mode 3 brought another significant increase with implementing region of in-terest estimators which reduce the amount of the image to be scanned It canprocess around 12 frames in a second

Finally mode 4 was proven to be the fastest solution by far As a result ofits tracking algorithm their is no need to use the SVM classifiers as often asbefore This is a very important result since tracking is significantly faster thanscanning with the detectors Mode 4 reached a frame-rate of 30 FPS during thetests Thus mode 4 can be called a rdquoreal-timerdquo object detection algorithm

See table 51 for the frame-rates of all modules displayed along their recalland precision values Note that all frame-rates are average of 100 executions

Table 51 Table of results

mode Recall Prescision FPS Variancemode 1 0623 1 42599 000000123mode 2 0622 1 6509 00029557mode 3 0645 1 1206 00070877mode 4 0955 0898 3082 00021144

As a conclusion it can be seen that mode 4 outperformed all the others bothin detection rate (recall 95) and speed (average frame-rate 30FPS) using ROIestimators and a tracking algorithm Due to the nature of the tracker and theevaluation method the number of false positives increased As mentioned beforeeven a correctly positioned but incorrectly scaled detection decreased the precisionrate However this should be simple to improve by adding further validations ofthe output of the tracker (both for position and scale)

64

Chapter 6

Conclusion and recommendedfuture work

61 ConclusionIn this paper an unmanned aerial vehicle based airborne tracking system waspresented with the aim of detecting objects indoor (potentially extended to out-door operations as well) and localize a ground robot that will serve as a landingplatform for recharging the UAV

First the project and the aims and objectives of this thesis were introducedin chapter 1 This introduction was followed by an extensive literature review togive an overview of the existing solutions for similar problems (chapter 2) Un-manned Aerial Vehicles were presented and examples were shown for applicationfields Then the most important object recognition methods were introducedand reviewed with the aspect of suitability for the discussed project

Then the environment of the development was described including the avail-able software and hardware resources (chapter 3) Their advantages and disad-vantages were analysed and their applications in the project are explained

Afterwards the recognized challenges of the project were collected and dis-cussed with respect to the object detection task in chapter 4 Considering theobjectives resources and the challenges a modular architecture was designedand introduced (section 42) The type of modules were listed and discussed indetails including their purposes and some possible realization

The modular structure of the system makes it very flexible and easy to extendor replace modules The currently implemented and used ones were listed andexplained Also some unsuccessful experiments were mentioned

Subsection 434 concludes the chosen feature extraction (HOG) and classi-fier (SVM) methods and presented the implemented training software along the

65

61 Conclusion

produced detectorsAs one of the most important parts the currently used detector was intro-

duced in subsection 435 This module was developed with special attention paidto validation and possible future work therefore additional debugging featureslike exporting detections demonstration videos and frame rate measurementswere implemented To make the development simpler an easy-to-use interfacewas included which let the most important parameters to be set without modifi-cation pf the code Four related but different detection methods were developedMode 1 scans the whole image with both trained detectors to locate the robotMode 2 does the same but instead of all classifiers only one is used most of thetime which is chosen based on previous detections This resulted in an increasedframe-rate Mode 3 introduces the concept of Region Of Interests (ROI) and onlyscans this area The ROI is defined based on previous detections however it iseasy to replace it with other methods due to the modular structure The reducedsearch space made the processing significantly faster Finally mode 4 includes atracking algorithm which makes it unnecessary to scan every frame (or part ofit) with a detector Instead once the robot was found it is followed across thescene Certainly periodical validations are still needed Tracking is significantlyfaster than the sliding window detectors and in appropriate conditions can per-form above real time speed (average frame-rate was 308 FPS) It also improvedthe efficiency of the detector since on many frames the robot was tracked andits position was marked correctly while the detectors missed it from the sameview-point However tracking can be misleading as well if the tracked regiondrifts away from the robot (eg a chair next to the robot is followed) To avoidthis the bounding box returned by the tracker is regularly checked by detectors

Special attention was given to the evaluation of the system Two software weredeveloped as additional tools one for evaluating the efficiency of the detectionsand an other for analysing the processing time and frame rates

Although the whole project (especially the autonomous UAV) still needs a lotof improvement and not ready for testing yet the ground robot detecting algo-rithmrsquos first version is ready-to-use and have shown promising results in simulatedexperiments

The detector based on mode 4 had a recall rate of 95 with an average frame-rate of 308 FPS on the test videos Although this processing speed was achievedon a laptop (simulating a setup where the UAV sends video to a ground stationfor further processing) porting this code on an on-board computer with smallercomputational capacity should result in a slower but still acceptable frame-ratewhich satisfies the objectives (5Hz)

While this thesis focused on two dimensional image processing and object de-tection methods 3D image inputs were considered as well Section 44 concludesthe progress made related to 3D mapping along the applied mathematics An

66

62 Recommended future work

experimental setup was created with which it was possible to create spectacu-lar and more importantly precise 3D maps about the environments with a 2Dlaser scanner To test the idea of using the 3D image as an input for the groundrobot detection several recordings were made about the UGV The created point-clouds were found to be detailed enough to carry simpler (eg threshold basedon height) or more complex object detection methods (eg registering 3D key-points) on them in the future

As a conclusion it can be said that the objectives defined in 13

1 explore possible solutions in a literature review

2 design a system which is able to detect the ground robot based on theavailable sensors

3 implement and test the first version of the algorithms

were all addressed and completed

62 Recommended future workIn spite of the satisfying results of the current setup several modifications andimprovements can and should be implemented to increase the efficiency and speedof the algorithm

Focusing on the 2D processing methods first experiments with retrainedSVMs are recommended It is expected that training on the same two sides (sideand front) but including more images from slightly rotated view-points (bothrolling the camera and capture the object from not completely the front) willresult in better performance

Region of interest estimators have huge potential in the project since they cansignificantly speed up the process Aside basic algorithms based on simple fea-tures (eg similar colour patches edges etc) several more complex segmentationmethods were introduced in the literature

Tracking can be tuned more precisely as well Aside the rectangle box whichmarks the estimated position of the object the tracker returns a confidence valuewhich measures how confident the algorithm is that the object is inside the rect-angle This is a very valuable information to eliminate detections which rdquoslippedrdquooff the object Also the further processing of the return position is recommendedFor example growing or shrinking the rectangle based on simple features (egthresholded patches intensity edges etc) can be profitable since the trackertends to perform better when it tracks the whole object On the other hand toobig bounding boxes are problems as well thus shrinking of the rectangles have tobe considered as well

67

62 Recommended future work

Using The 3D image for the detection of the robot is another very interestingopportunity that should be examined in more details Certainly this is onlypossible if the 3D imaging will be carried out in near real time speed (currentlyall the maps were built offline) The currently seen 3D image (the point-cloud infront of the sensor) can be interpreted as a pre-filter for the 2D image processingmethods For example filtering for objects with more or less the same size of theUGV could be a very effective region of interest detector To achieve this thefield of view of the two sensors needs to be register carefully so correspondingareas can be found

Finally when the continuous 3D map building (note that this is more thanthe imaging since the 3D images have to be aligned and combined) is finishedseveral more improvements will be possible Assuming that both the aerial andthe ground vehicle were located in this map the real 3D position of the robotwill be known Its next position can be estimated based on its maximal velocityand the elapsed time Theoretically it is not possible for the robot to be outsideof this circle thus other parts of the map do not need to be scanned Since thelocation and orientation of the UAV will be known as well the field-of-view ofthe sensors can be estimated After that the UAV can either move and rotate ina way to cover the possible locations of the UGV with its sensors or just ignore(does not process) inputs from other areas

It is strongly recommended to give high priority to evaluation and testingduring the further experiments For 2D images Vatic annotation tool was intro-duced in this thesis as a very useful and convenient tool to evaluate detectorsFor 3D detections this is not suitable but a similar solution is needed to keepup the progress towards the ready to deploy 3D mapping system planned

68

Acknowledgements

First of all I would like to thank Paacutezmaacuteny Peacuteter Catholic University and my teachersthere who helped me to advance and grow through the years and later made itpossible to continue my studies in Cranfield University where I was welcomed withwarm hospitality and was supported throughout of the development of this thesis

I would like to thank Dr Al Savvaris for the professional help and encourage-ment he provided while we worked together

Special thanks goes to my friends Domonkos Huszaacuter and Loacuteraacutent Kovaacutecs who Ishared all the happiness sadness worries troubles and the weigh of this pastyear and who I could always turn to in need or joy

Thanks to my friend Roberto Opromolla for helping us to make the necessarymeasurements

My family never stopped supporting me loving me and ensuring the rightenvironment for improvement and growth

Many thanks Fanni Melles for the support and love you gave me and that youstood by me even at the hardest times

Thanks for my friends Maacuterton Hunyady and Doacutera Babicz for the funny andsometimes cruel comments of the revision of this thesis

And finally many thanks to Andras Horvath who agreed to be my rsquosecondrsquosupervisor and mentor at my home university

69

References

[1] ldquoOverview senseFlyrdquo [Online] Available httpswwwsenseflycomdronesoverviewhtml [Accessed at 2015-08-08]

[2] H Chao Y Cao and Y Chen ldquoAutopilots for small fixed-wing unmanned airvehicles A surveyrdquo in Proceedings of the 2007 IEEE International Conference onMechatronics and Automation ICMA 2007 2007 pp 3144ndash3149

[3] A Frank J McGrew M Valenti D Levine and J How ldquoHover Transitionand Level Flight Control Design for a Single-Propeller Indoor AirplanerdquoAIAA Guidance Navigation and Control Conference and Exhibit pp 1ndash43 2007[Online] Available httparcaiaaorgdoiabs10251462007-6318

[4] ldquoFixed Wing Versus Rotary Wing For UAV Mapping Ap-plicationsrdquo [Online] Available httpwwwquestuavcomnewsfixed-wing-versus-rotary-wing-for-uav-mapping-applications [Accessed at 2015-08-08]

[5] ldquo DJI Store Phantom 3 Standardrdquo [Online] Available httpstoredjicomproductphantom-3-standard [Accessed at 2015-08-08]

[6] ldquoWorld War II V-1 Flying Bomb - Military Historyrdquo [Online] Available httpmilitaryhistoryaboutcomodartillerysiegeweaponspv1htm [Accessed at 2015-08-08]

[7] L Lin and M A Goodrich ldquoUAV intelligent path planning for wilderness searchand rescuerdquo in 2009 IEEERSJ International Conference on Intelligent Robotsand Systems IROS 2009 2009 pp 709ndash714

[8] M A Goodrich B S Morse D Gerhardt J L Cooper M Quigley J A Adamsand C Humphrey ldquoSupporting wilderness search and rescue using a camera-equipped mini UAVrdquo Journal of Field Robotics vol 25 no 1-2 pp 89ndash1102008

[9] P Doherty and P Rudol ldquoA UAV Search and Rescue Scenario with HumanBody Detection and Geolocalizationrdquo in Lecture Notes in Computer Science

70

REFERENCES

2007 pp 1ndash13 [Online] Available httpwwwspringerlinkcomcontentt361252205328408fulltextpdf

[10] A Jaimes S Kota and J Gomez ldquoAn approach to surveillance an area usingswarm of fixed wing and quad-rotor unmanned aerial vehicles UAV(s)rdquo 2008 IEEEInternational Conference on System of Systems Engineering SoSE 2008 2008

[11] M Kontitsis K Valavanis and N Tsourveloudis ldquoA UAV vision system for air-borne surveillancerdquo IEEE International Conference on Robotics and Automation2004 Proceedings ICRA rsquo04 2004 vol 1 2004

[12] M Quigley M A Goodrich S Griffiths A Eldredge and R W Beard ldquoTargetacquisition localization and surveillance using a fixed-wing mini-UAV and gim-baled camerardquo in Proceedings - IEEE International Conference on Robotics andAutomation vol 2005 2005 pp 2600ndash2606

[13] E Semsch M Jakob D Pavliacuteček and M Pěchouček ldquoAutonomous UAV surveil-lance in complex urban environmentsrdquo in Proceedings - 2009 IEEEWICACMInternational Conference on Intelligent Agent Technology IAT 2009 vol 2 2009pp 82ndash85

[14] M Israel ldquoA UAV-BASED ROE DEER FAWN DETECTION SYSTEMrdquo pp51ndash55 2012

[15] G Zhou and D Zang ldquoCivil UAV system for earth observationrdquo in InternationalGeoscience and Remote Sensing Symposium (IGARSS) 2007 pp 5319ndash5321

[16] W DeBusk ldquoUnmanned Aerial Vehicle Systems for Disaster Relief TornadoAlleyrdquo AIAA InfotechAerospace 2010 2010 [Online] Available httparcaiaaorgdoiabs10251462010-3506

[17] S DrsquoOleire-Oltmanns I Marzolff K Peter and J Ries ldquoUnmannedAerial Vehicle (UAV) for Monitoring Soil Erosion in Moroccordquo RemoteSensing vol 4 no 12 pp 3390ndash3416 2012 [Online] Available httpwwwmdpicom2072-42924113390

[18] F Nex and F Remondino ldquoUAV for 3D mapping applications a reviewrdquoApplied Geomatics vol 6 no 1 pp 1ndash15 2013 [Online] Availablehttplinkspringercom101007s12518-013-0120-x

[19] H Eisenbeiss ldquoThe Potential of Unmanned Aerial Vehicles for MappingrdquoPhotogrammetrische Woche Heidelberg pp 135ndash145 2011 [Online] Availablehttpwwwifpuni-stuttgartdepublicationsphowo11140Eisenbeisspdf

[20] C Bills J Chen and A Saxena ldquoAutonomous MAV flight in indoor environ-ments using single image perspective cuesrdquo in Proceedings - IEEE InternationalConference on Robotics and Automation 2011 pp 5776ndash5783

71

REFERENCES

[21] W Bath and J Paxman ldquoUAV localisation amp control through computer visionrdquo of the Australasian Conference on Robotics 2004 [Online] Availablehttpwwwcseunsweduau~acra2005proceedingspapersbathpdf

[22] K Ccedilelik S J Chung M Clausman and A K Somani ldquoMonocular vision SLAMfor indoor aerial vehiclesrdquo in 2009 IEEERSJ International Conference on Intel-ligent Robots and Systems IROS 2009 2009 pp 1566ndash1573

[23] H Oh D Y Won S S Huh D H Shim M J Tahk and A Tsourdos ldquoIndoorUAV control using multi-camera visual feedbackrdquo in Journal of Intelligent andRobotic Systems Theory and Applications vol 61 no 1-4 2011 pp 57ndash84

[24] Y M Mustafah A W Azman and F Akbar ldquoIndoor UAV Positioning UsingStereo Vision Sensorrdquo pp 575ndash579 2012

[25] P Jongho and K Youdan ldquoStereo vision based collision avoidance of quadrotorUAVrdquo in Control Automation and Systems (ICCAS) 2012 12th InternationalConference on 2012 pp 173ndash178

[26] J Weingarten and R Siegwart ldquoEKF-based 3D SLAM for structured environmentreconstructionrdquo in 2005 IEEERSJ International Conference on Intelligent Robotsand Systems IROS 2005 pp 2089ndash2094

[27] H Surmann A Nuumlchter and J Hertzberg ldquoAn autonomous mobile robot with a3D laser range finder for 3D exploration and digitalization of indoor environmentsrdquoRobotics and Autonomous Systems vol 45 no 3-4 pp 181ndash198 2003

[28] Y Lin J Hyyppauml and A Jaakkola ldquoMini-UAV-borne LIDAR for fine-scale map-pingrdquo IEEE Geoscience and Remote Sensing Letters vol 8 no 3 pp 426ndash4302011

[29] F Wang J Cui S K Phang B M Chen and T H Lee ldquoA mono-camera andscanning laser range finder based UAV indoor navigation systemrdquo in 2013 Inter-national Conference on Unmanned Aircraft Systems ICUAS 2013 - ConferenceProceedings 2013 pp 694ndash701

[30] M Nagai T Chen R Shibasaki H Kumagai and A Ahmed ldquoUAV-borne 3-Dmapping system by multisensor integrationrdquo IEEE Transactions on Geoscienceand Remote Sensing vol 47 no 3 pp 701ndash708 2009

[31] X Zhang Y-H Yang Z Han H Wang and C Gao ldquoObject class detectionrdquoACM Computing Surveys vol 46 no 1 pp 1ndash53 Oct 2013 [Online] Availablehttpdlacmorgcitationcfmid=25229682522978

[32] P M Roth and M Winter ldquoSurvey of Appearance-Based Methodsfor Object Recognitionrdquo Transform no ICG-TR-0108 2008 [Online]

72

REFERENCES

Available httpwwwicgtu-grazacatMemberspmrothpub_pmrothTR_ORat_downloadfile

[33] A Andreopoulos and J K Tsotsos ldquo50 Years of object recognition Directionsforwardrdquo Computer Vision and Image Understanding vol 117 no 8 pp 827ndash891 Aug 2013 [Online] Available httpwwwresearchgatenetpublication257484936_50_Years_of_object_recognition_Directions_forward

[34] J K Tsotsos Y Liu J C Martinez-Trujillo M Pomplun E Simine andK Zhou ldquoAttending to visual motionrdquo Computer Vision and Image Understand-ing vol 100 no 1-2 SPEC ISS pp 3ndash40 2005

[35] P Perona ldquoVisual Recognition Circa 2007rdquo pp 1ndash12 2007 [Online] Availablehttpswwwvisioncaltechedupublicationsperona-chapter-Dec07pdf

[36] M Piccardi ldquoBackground subtraction techniques a reviewrdquo 2004 IEEE Interna-tional Conference on Systems Man and Cybernetics (IEEE Cat No04CH37583)vol 4 pp 3099ndash3104 2004

[37] T Bouwmans ldquoRecent Advanced Statistical Background Modeling for ForegroundDetection - A Systematic Surveyrdquo Recent Patents on Computer Sciencee vol 4no 3 pp 147ndash176 2011

[38] L Xu and W Bu ldquoTraffic flow detection method based on fusion of frames dif-ferencing and background differencingrdquo in 2011 2nd International Conference onMechanic Automation and Control Engineering MACE 2011 - Proceedings 2011pp 1847ndash1850

[39] A-t Nghiem F Bremond I-s Antipolis and R Lucioles ldquoBackground subtrac-tion in people detection framework for RGB-D camerasrdquo 2004

[40] J C S Jacques C R Jung and S R Musse ldquoBackground subtraction andshadow detection in grayscale video sequencesrdquo in Brazilian Symposium of Com-puter Graphic and Image Processing vol 2005 2005 pp 189ndash196

[41] P Kaewtrakulpong and R Bowden ldquoAn Improved Adaptive Background MixtureModel for Real- time Tracking with Shadow Detectionrdquo Advanced Video BasedSurveillance Systems pp 1ndash5 2001

[42] G Turin ldquoAn introduction to matched filtersrdquo IRE Transactions on InformationTheory vol 6 no 3 1960

[43] K Briechle and U Hanebeck ldquoTemplate matching using fast normalized crosscorrelationrdquo Proceedings of SPIE vol 4387 pp 95ndash102 2001 [Online] AvailablehttplinkaiporglinkPSI4387951ampAgg=doi

73

REFERENCES

[44] H Schweitzer J W Bell and F Wu ldquoVery Fast Template Matchingrdquo Programno 009741 pp 358ndash372 2002 [Online] Available httpwwwspringerlinkcomindexH584WVN93312V4LTpdf

[45] P Viola and M Jones ldquoRobust real-time object detectionrdquo Interna-tional Journal of Computer Vision vol 57 pp 137ndash154 2001 [Online]Available httpscholargooglecomscholarhl=enampbtnG=Searchampq=intitleRobust+Real-time+Object+Detection0

[46] S Omachi and M Omachi ldquoFast template matching with polynomialsrdquo IEEETransactions on Image Processing vol 16 no 8 pp 2139ndash2149 2007

[47] M Jordan and J Kleinberg Bishop - Pattern Recognition and Machine Learning

[48] D Prasad ldquoSurvey of the problem of object detection in real imagesrdquoInternational Journal of Image Processing (IJIP) no 6 pp 441ndash466 2012[Online] Available httpwwwcscjournalsorgcscmanuscriptJournalsIJIPvolume6Issue6IJIP-702pdf

[49] D Lowe ldquoObject Recognition fromLocal Scale-Invariant Featuresrdquo IEEE Inter-national Conference on Computer Vision 1999

[50] D G Lowe ldquoDistinctive image features from scale-invariant keypointsrdquo Interna-tional Journal of Computer Vision vol 60 no 2 pp 91ndash110 2004

[51] H Bay T Tuytelaars and L Van Gool ldquoSURF Speeded up robust featuresrdquo inLecture Notes in Computer Science (including subseries Lecture Notes in ArtificialIntelligence and Lecture Notes in Bioinformatics) vol 3951 LNCS 2006 pp 404ndash417

[52] T Joachims ldquoA probabilistic analysis of the Rocchio algorithmwith TFIDF for text categorizationrdquo International Conference on Ma-chine Learning pp 143ndash151 1997 [Online] Available papers2publicationuuid9FC2122D-6D49-4DC5-AC03-E353D5B3D1D1

[53] N Dalal and B Triggs ldquoHistograms of Oriented Gradients for Human Detectionrdquoin CVPR rsquo05 Proceedings of the 2005 IEEE Computer Society Conference onComputer Vision and Pattern Recognition (CVPRrsquo05) - Volume 1 2005 pp 886ndash893 [Online] Available citeulike-article-id3047126$delimiter026E30F$nhttpdxdoiorg101109CVPR2005177

[54] J M Rainer Lienhart ldquoAn Extended Set of Haar-Like Features for Rapid ObjectDetectionrdquo [Online] Available httpciteseerxistpsueduviewdocsummarydoi=1011869433

74

REFERENCES

[55] S Han Y Han and H Hahn ldquoVehicle Detection Method using Haar-like Featureon Real Time Systemrdquo inWorld Academy of Science Engineering and Technology2009 pp 455ndash459

[56] Q C Q Chen N Georganas and E Petriu ldquoReal-time Vision-based Hand Ges-ture Recognition Using Haar-like Featuresrdquo 2007 IEEE Instrumentation amp Mea-surement Technology Conference IMTC 2007 2007

[57] G Monteiro P Peixoto and U Nunes ldquoVision-based pedestrian detection usingHaar-like featuresrdquo Robotica 2006 [Online] Available httpcyberc3sjtueducnCyberC3docpaperRobotica2006pdf

[58] J Martiacute J M Benediacute A M Mendonccedila and J Serrat Eds PatternRecognition and Image Analysis ser Lecture Notes in Computer Science BerlinHeidelberg Springer Berlin Heidelberg 2007 vol 4477 [Online] Availablehttpwwwspringerlinkcomindex101007978-3-540-72847-4

[59] A E C Pece ldquoOn the computational rationale for generative modelsrdquo ComputerVision and Image Understanding vol 106 no 1 pp 130ndash143 2007

[60] I T Joliffe Principle Component Analysis 2002 vol 2 [Online] Availablehttpwwwspringerlinkcomcontent978-0-387-95442-4

[61] A Hyvarinen J Karhunen and E Oja ldquoIndependent Component Analysisrdquovol 10 p 2002 2002

[62] K Mikolajczyk B Leibe and B Schiele ldquoMultiple Object Class Detection with aGenerative Modelrdquo IEEE Computer Society Conference on Computer Vision andPattern Recognition (CVPR) vol 1 pp 26ndash36 2006

[63] Y Freund and R E Schapire ldquoA Decision-theoretic Generalization of On-lineLearning and an Application to Boostingrdquo Journal of Computing Systems andScience vol 55 no 1 pp 119ndash139 1997

[64] C Cortes and V Vapnik ldquoSupport-vector networksrdquo Machine Learningvol 20 no 3 pp 273ndash297 Sep 1995 [Online] Available httplinkspringercom101007BF00994018

[65] P Viola and M Jones ldquoRobust real-time face detectionrdquo International Journalof Computer Vision vol 57 no 2 pp 137ndash154 2004

[66] B E Boser I M Guyon and V N Vapnik ldquoA Training Algorithmfor Optimal Margin Classifiersrdquo Proceedings of the 5th Annual ACM Workshopon Computational Learning Theory pp 144ndash152 1992 [Online] Availablehttpciteseerxistpsueduviewdocsummarydoi=1011213818

75

REFERENCES

[67] C-W Hsu and C-J Lin ldquoA comparison of methods for multiclass support vectormachinesrdquo IEEE Transactions on Neural Networks vol 13 no 2 pp 415ndash4252002

[68] A Mathur and G M Foody ldquoMulticlass and binary SVM classification Implica-tions for training and classification usersrdquo IEEE Geoscience and Remote SensingLetters vol 5 no 2 pp 241ndash245 2008

[69] ldquoNitrogen6X - iMX6 Single Board Computerrdquo [Online] Available httpboundarydevicescomproductnitrogen6x-board-imx6-arm-cortex-a9-sbc [Ac-cessed at 2015-08-10]

[70] ldquoPixhawk flight controllerrdquo [Online] Available httproverardupilotcomwikipixhawk-wiring-rover [Accessed at 2015-08-10]

[71] ldquoScanning range finder utm-30lx-ewrdquo [Online] Available httpswwwhokuyo-autjp02sensor07scannerutm_30lx_ewhtml [Accessed at 2015-07-22]

[72] ldquoopenCV manual Release 249rdquo [Online] Available httpdocsopencvorgopencv2refmanpdf [Accessed at 2015-07-20]

[73] D E King ldquoDlib-ml A Machine Learning Toolkitrdquo Journal of MachineLearning Research vol 10 pp 1755ndash1758 2009 [Online] Available httpjmlrcsailmitedupapersv10king09ahtml

[74] ldquodlib C++ Libraryrdquo [Online] Available httpdlibnet [Accessed at 2015-07-21]

[75] D E King ldquoMax-Margin Object Detectionrdquo Jan 2015 [Online] Availablehttparxivorgabs150200046

[76] M Danelljan G Haumlger and M Felsberg ldquoAccurate Scale Estimation for RobustVisual Trackingrdquo Proceedings of the British Machine Vision Conference BMVC2014

[77] ldquoUrgBenri Information Pagerdquo [Online] Available httpswwwhokuyo-autjp02sensor07scannerdownloaddataUrgBenrihtm [Accessed at 2015-07-20]

[78] ldquovatic - Video Annotation Tool - UC Irvinerdquo [Online] Available httpwebmiteduvondrickvatic [Accessed at 2015-07-24]

[79] ldquoAmazon Mechanical Turkrdquo [Online] Available httpswwwmturkcommturkwelcome [Accessed at 2015-07-26]

76

  • List of Figures
  • Absztrakt
  • Abstract
  • List of Abbreviations
  • 1 Introduction and project description
    • 11 Project description and requirements
    • 12 Type of vehicle
    • 13 Aims and objectives
      • 2 Literature Review
        • 21 UAVs and applications
          • 211 Fixed-wing UAVs
          • 212 Rotary-wing UAVs
          • 213 Applications
            • 22 Object detection on conventional 2D images
              • 221 Classical detection methods
                • 2211 Background subtraction
                • 2212 Template matching algorithms
                  • 222 Feature descriptors classifiers and learning methods
                    • 2221 SIFT features
                    • 2222 Haar-like features
                    • 2223 HOG features
                    • 2224 Learning models in computer vision
                    • 2225 AdaBoost
                    • 2226 Support Vector Machine
                      • 3 Development
                        • 31 Hardware resources
                          • 311 Nitrogen board
                          • 312 Sensors
                            • 3121 Pixhawk autopilot
                            • 3122 Camera
                            • 3123 LiDar
                                • 32 Chosen software
                                  • 321 Matlab
                                  • 322 Robotic Operating System (ROS)
                                  • 323 OpenCV
                                  • 324 Dlib
                                      • 4 Designing and implementing the algorithm
                                        • 41 Challenges in the task
                                        • 42 Architecture of the detection system
                                        • 43 2D image processing methods
                                          • 431 Chosen methods and the training algorithm
                                          • 432 Sliding window method
                                          • 433 Pre-filtering
                                          • 434 Tracking
                                          • 435 Implemented detector
                                            • 4351 Mode 1 Sliding window with all the classifiers
                                            • 4352 Mode 2 Sliding window with intelligent choice of classifier
                                            • 4353 Mode 3 Intelligent choice of classifiers and ROIs
                                            • 4354 Mode 4 Tracking based approach
                                                • 44 3D image processing methods
                                                  • 441 3D recording method
                                                  • 442 Android based recording set-up
                                                  • 443 Final set-up with Pixhawk flight controller
                                                  • 444 3D reconstruction
                                                      • 5 Results
                                                        • 51 2D image detection results
                                                          • 511 Evaluation
                                                            • 5111 Definition of True positive and negative
                                                            • 5112 Definition of False positive and negative
                                                            • 5113 Reducing number of errors
                                                            • 5114 Annotation and database building
                                                              • 512 Frame-rate measurement and analysis
                                                                • 52 3D image detection results
                                                                • 53 Discussion of results
                                                                  • 6 Conclusion and recommended future work
                                                                    • 61 Conclusion
                                                                    • 62 Recommended future work
                                                                      • References
Page 9: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly

List of Abbreviations

SATM School of Aerospace Technology and ManufacturingUAV Unmanned Aerial VehicleUAS Unmanned Aerial SystemUA Unmanned AircraftUGV Unmanned Ground VehicleHOG Histogram of Oriented GradientsRC Radio ControlledROS Robotic Operating SystemIMU Inertial Measurement UnitDoF Degree of FreedomSLAM Simultaneous Localization And MappingROI Region Of InterestVatic Video Annotation Tool from Irvine California

viii

Absztrakt

Ezen dolgozatban bemutataacutesra keruumll egy piloacuteta neacutelkuumlli repuumllő jaacuterműre szereltkoumlvető rendszer ami belteacuteri objektumokat hivatott detektaacutelni eacutes legfőbb ceacutelja afoumlldi egyseacuteg megtalaacutelaacutesa amely leszaacutelloacute eacutes uacutejratoumlltő aacutellomaacuteskeacutent fog szolgaacutelni

Előszoumlr a projekt illetve a teacutezis ceacuteljai lesznek felsorolva eacutes reacuteszletezveEzt koumlveti egy reacuteszletes irodalomkutataacutes amely bemutatja a hasonloacute kihiacutevaacute-

sokra leacutetező megoldaacutesokat Szerepel egy roumlvid oumlsszefoglalaacutes a piloacuteta neacutelkuumlli jaacuter-művekről eacutes alkalmazaacutesi teruumlleteikről majd a legismertebb objektumdetektaacuteloacutemoacutedszerek keruumllnek bemutataacutesra A kritika taacutergyalja előnyeiket eacutes haacutetraacutenyaikatkuumlloumlnoumls tekintettel a jelenlegi projektben valoacute alkalmazhatoacutesaacutegukra

A koumlvetkező reacutesz a fejleszteacutesi koumlruumllmeacutenyekről szoacutel beleeacutertve a rendelkezeacutesreaacutelloacute szoftvereket eacutes hardvereket

A feladat kihiacutevaacutesai bemutataacutesa utaacuten egy modulaacuteris architektuacutera terve keruumllbemutataacutesra figyelembe veacuteve a ceacutelokat erőforraacutesokat eacutes a felmeruumllő probleacutemaacutekat

Ezen architektuacutera egyik legfontosabb modulja a detektaacuteloacute algoritmus legfris-sebb vaacuteltozata reacuteszletezve is szerepel a koumlvetkező fejezetben keacutepesseacutegeivel moacuted-jaival eacutes felhasznaacuteloacutei feluumlleteacutevel egyuumltt

A modul hateacutekonysaacutegaacutenak meacutereacuteseacutere leacutetrejoumltt egy kieacuterteacutekelő koumlrnyezet melykeacutepes szaacutemos metrikaacutet kiszaacutemolni a detekcioacuteval kapcsolatban Mind a koumlrnyezetmind a metrikaacutek reacuteszletezve lesznek a koumlvetkező fejezetben melyet a legfrissebbalgoritmus aacuteltal eleacutert eredmeacutenyek koumlvetnek

Baacuter ez a dolgozat főkeacutent a hagyomaacutenyos (2D) keacutepeken operaacuteloacute detekcioacutesmoacutedszerekre koncentraacutel 3D keacutepalkotaacutesi eacutes feldolgozoacute moacutedszerek szinteacuten meg-fontolaacutesra keruumlltek Elkeacuteszuumllt egy kiacuteseacuterleti rendszer amely keacutepes laacutetvaacutenyos eacutespontos 3D teacuterkeacutepek leacutetrehozaacutesaacutera egy 2D leacutezer szkenner hasznaacutelataacuteval Szaacutemosfelveacutetel keacuteszuumllt a megoldaacutes kiproacutebaacutelaacutesa amelyek a rendszerrel egyuumltt bemutataacutesrakeruumllnek

Veacuteguumll az implementaacutelt moacutedszerek eacutes az eredmeacutenyek oumlsszefoglaloacuteja zaacuterja adolgozatot

ix

Abstract

In this paper an unmanned aerial vehicle based airborne tracking system is pre-sented with the aim of detecting objects indoor (potentially extended to outdooroperations as well) and localize a ground robot that will serve as a landing plat-form for recharging of the UAV

First the project and the aims and objectives of this thesis are introducedand discussed

Then an extensive literature review is presented to give overview of the ex-isting solutions for similar problems Unmanned Aerial Vehicles are presentedwith examples of their application fields After that the most relevant objectrecognition methods are reviewed and their suitability for the discussed project

Then the environment of the development will be described including theavailable software and hardware resources

Afterwards the challenges of the task are collected and discussed Consideringthe objectives resources and the challenges a modular architecture is designedand introduced

As one of the most important module the currently used detector is intro-duced along with its features modes and user interface

Special attention is given to the evaluation of the system Additional evalu-ating tools are introduced to analyse efficiency and speed

The ground robot detecting algorithmrsquos first version is evaluated giving promis-ing results in simulated experiments

While this thesis focuses on two dimensional image processing and objectdetection methods 3D image inputs are considered as well An experimentalsetup is introduced with the capability to create spectacular and precise 3Dmaps about the environments with a 2D laser scanner To test the idea of usingthe 3D image as an input for the ground robot detection several recordings weremade about the UGV and presented in this paper as well

Finally all implemented methods and relevant results are concluded

x

Chapter 1

Introduction and projectdescription

In this chapter an introduction is given to the whole project which this thesis ispart of Afterwards a structure of the sub-tasks in the project is presented alongwith the recognized challenges Then the aims and objectives of this thesis arelisted

11 Project description and requirementsThe projectrsquos main aim is to build a complete system based on one or moreunmanned autonomous vehicle which are able to carry out an indoor 3D mappingof a building (possible outdoor operations are kept in mind as well) A maplike this would be very useful for many applications such as surveillance rescuemissions architecture renovation etc Furthermore if a 3D model exists otherautonomous vehicles can navigate through the building with ease Later on inthis project after the map building is ready a thermal camera is planned to beattached to the vehicle to seek find and locate heat leaks as a demonstration ofthe system A system like this should be

1 fast Building a 3D map requires a lot of measurements and processingpower not to mention the essential functions stabilization navigationroute planning collision avoidance However to build a useful tool allthese functions should be executed simultaneously mostly on-board nearlyreal-time Furthermore the mapping process itself should be finished inreasonable time (depending on the size of the building)

2 accurate Reasonably accurate recording is required so the map would besuitable for the execution of further tasks Usually this means a maximum

1

12 Type of vehicle

error of 5-6cm Errors with similar magnitude can be corrected later de-pending on the application In the case of architecture for example aligninga 3D room model to the recorded point cloud could eliminate this noise

3 autonomous The solution should be as autonomous as possible The aimis to build a system which does not require human supervision (for examplemapping a dangerous area which would be too far for real time control) andcoordinate the process Although remote control is acceptable and shouldbe implemented it is desired to be minimal This should allow a remoteoperator to coordinate even more than one unit a time (since none of themneeds continuous attention) while keeping the ability to change the routeor priority of premises

4 complete Depending on the size and layout of the building the mappingprocess can be very complicated It may have more than one floors loops(the same location reached via an other route) or opened areas which areall a significant challenge for both the vehicle (plan route to avoid collisionsand cover every desired area) and the mapping software The latter shouldrecognize previously seen locations and rdquocloserdquo the loop (that is assignthe two separately recorded area to one location) and handle multi-layerbuildings The system have to manage these tasks and provide a mapwhich contains all the layers closes loops and represents all walls (includingceiling and floor)

12 Type of vehicleThe first question of such a project is what type of vehicle should be used Theoptions are either some kind of aerial or a ground vehicle Both have their ad-vantages and disadvantages

An aerial vehicle has a lot more degree of freedom since it can elevate fromthe ground It can reach positions and perform measurements which would beimpossible for a ground vehicle However it consumes a lot of energy to stayin the air thus such a system has significant limitations in weight payload andoperating time (since the batteries themselves weigh a lot)

A ground vehicle overcomes all these problems since it is able to carry alot more payload than an aerial one This means a lot of batteries thereforelonger operation time On the other hand it canrsquot elevate from the ground thusall the measurements will be performed from a relatively close position to theground This can be a big disadvantage since taller objects will not have theirtops recorded Also a ground robot would not be able to overcome big obstaclesclosing its way or to get to another floor

2

13 Aims and objectives

Figure 11 Image of the ground robot Source own picture

Considering these and many more arguments (price complexity human andmaterial resources) a combination of an aerial and a ground vehicle was chosenas a solution The idea is that both vehicles enter the building at the sametime and start mapping the environment simultaneously working on the samemap (which is unknown when the process starts) Using the advantage of theaerial vehicle the system scans parts of the rooms which are unavailable for theground vehicle With sophisticated route planning the areas scanned twice canbe minimized resulting in a faster map building Also great load capacity of theground robot makes it able to serve as a charging station for the aerial vehicleThis option will be discussed in details in section 13

13 Aims and objectivesAs it can be seen the scope of the project is extremely wide To cover the wholemore than fifteen students (visiting MSc and PhD) are working on it Thereforethe project has been divided to numerous subtasks

During the planning period of the project several challenges were recognizedfor example the lack of GPS signal indoor the vibration of the on-board sensorsor the short flight time of the aerial platform

This thesis addresses the last one As mentioned in section 12 the UAVconsumes a lot of energy to stay in the air thus such a system has significantlylimited operating time In contrast the ground robot is able to carry a lot morepayload (eg batteries) than an aerial one The idea to overcome this limitationof the UAV is to use the ground robot as a landing and recharging platform for

3

13 Aims and objectives

the aerial vehicleTo achieve this the system will need to have continuous update of the relative

position of the ground robot The aim of this thesis is to give a solution for thisproblem The following objectives were defined to structure the research

First possible solutions have to be collected and discussed in an extensive lit-erature review Then after considering the existing solutions and the constraintsand requirements of the discussed project a design of a new system is neededwhich is able to detect the ground robot using the available sensors Finally thefirst version of the software have to be implemented and tested

This paper will describe the process of the research introduce the designedsystem and discuss the experiences

4

Chapter 2

Literature Review

The aim of this chapter is to provide an overview of the previous researches andarticles related to the topic of this thesis First a short introduction of unmannedaerial vehicles will be given along with their most important advantages andapplication fields

Afterwards a brief conclusion of the science of object recognition in imageprocessing is provided The most often used feature extraction and classificationmethods are presented

21 UAVs and applicationsThe abbreviation UAV stands for Unmanned Aerial Vehicle Any aircraft whichis controlled or piloted by remotely andor by on-board computers They are alsoreferred as UAS for Unmanned Aerial System

They come in various sizes The smallest ones fit a man palms while even afull-size aeroplane can be an UAV by definition Similarly to traditional aircraftUnmanned Aerial vehicles are usually grouped into two big classes fixed-wingUAVs and rotary wing UAVs

211 Fixed-wing UAVsFixed-wing UAVs have a rigid wing which generates lift as the UAV moves for-ward They maneuver with control surfaces called ailerons rudder and elevatorGenerally speaking fixed-wing UAVs are easier to stabilize and control Forcomparison radio controlled aeroplanes can fly without any autopilot functionimplemented relying only on the pilotrsquos input This is a result of the muchsimpler structure compared to rotary wing ones due to the fact that a glidingmovement is easier to stabilize However they still have a quite extended market

5

21 UAVs and applications

Figure 21 One of the most popular consumer level fixed-wing mapping UAV ofthe SenseFly company Source [1]

of flight controllers whose aim is to add autonomous flying and pilot assistancefeatures [2]

One of the biggest advantage of fixed wing UAVs is that due to their naturalgliding capabilities they can stay airborne using no or small amount of powerFor the same reason fixed-wing UAVs are also able to carry heavier or morepayload for longer endurances using less energy which would be very useful inany mapping task

The drawback of the fixed wing aircraft in the case of this project is the factthat they have to keep moving to generate lift and stay airborne Therefore flyingaround between and in buildings is very complicated since there is limited spaceto move around Although there have been approaches to prepare a fixed wingUAV to hover and translate indoors [3] generally they are not optimal for indoorflights especially with heavy payload [4]

Fixed-wing UAVs are excellent choice for long endurance high altitude tasksThe long flight times and higher speed make it possible to cover larger areaswith one take off On figure 21 a consumer fixed-wing UAV is shown designedspecially for mapping

This project needs a vehicle which is able to fly between buildings and indoorswithout any complication Thus rotary-wing UAVs were reviewed

212 Rotary-wing UAVsRotary-wing UAVs generate lift with rotor blades attached to and rotated aroundan axis These blades work exactly as the wing on the fixed-wing vehicles butinstead of the vehicle moving forward the blades are moving constantly Thismakes the vehicle able to hover and hold its position Also due to the same prop-erty they can take off and land vertically These capabilities are very importantaspects for the project since both of them are crucial for indoor flight where spaceis limited

6

21 UAVs and applications

Figure 22 The newest version of the popular Phantom series of DJI Thisconsumer drone is easy to fly even for beginner pilots due to its many advancedautopilot features Source [5]

Based on the number of the rotors several sub-classes are known UAVs withone rotor are helicopters while anything with more than one rotor is called amultirotor or multicopter

The latter group controls its motion by modifying the speed of multiple downthrusting motors In general they are more complicated to control than fixed-wing UAVs since hovering enough requires constant compensation Multirotorsmust individually adjust the thrust of each motor If motors on one side areproducing more thrust than the other the multirotor tilts to the other sideThat is the reason why every remote controlled multicopter is equipped withsome kind of flight controller system They are also less stable (without electricstabilization) and less energy efficient than helicopters That is the reason forthe fact that no large scale multirotor is used since as the size is increased thesimplicity becomes less important than the inefficiency On the other hand theyare mechanically simpler and cheaper and their architecture is suitable for awider range of payloads which makes them appealing for many fields

Rotary-wing UAVsrsquo mechanical and electrical structures are generally morecomplex than fixed-wing ones This can result in more complicated and expansiverepairs and general maintenance which shorten operational time and increase thecost of the project Finally due to their lower speeds and shorter flight rangesrotary-wing UAVs will require many additional flights to survey any large areawhich is another cause of increasing operational costs [4]

7

21 UAVs and applications

In spite of the disadvantages listed above rotary-wing especially multirotorUAVs seem like an excellent choice for this project since there is no need to coverlarge areas and the shorter operational times can be solved with the chargingplatform

213 ApplicationsUAVs as many technological inventions got their biggest motivation from mili-tary goals and applications For example German rocket V-1 (rdquothe flying bombrdquo)had a huge impact on the history of UAS [6] An unmanned aerial vehicle makesit possible to observe spy on and attack the enemy without risking human lives

Fortunately non-military development is catching up too and UAVs are be-coming essential tools for many application fields As rotary wing UAVs es-pecially multicopters are getting cheaper they have appeared in the consumercategories as well They are available as simple remote controlled toys and ad-vanced aerial photography platforms as well [5] See figure 22 for a popularquadcopter

UAVs are gaining popularity in search and rescue operations since they areproviding a fast and relatively cheap tool to get an overview of the involvedarea A good example for this application field is a location of an earth-quakeor searching for missing people in the wildness See [7] and [8] for examples [9]is another related article which includes approaches to identify possibly injuredpeople autonomously

Another application field where UAVs were admittedly proven useful is surveil-lance [10] is an interesting example for this field since it proposes to use a com-bined swarm of rotary and fixed-wing UAVs [11] [12] address the same problemwhile [13] tries to overcome the challenges of the task in an urban area

Getting sensors in the air with low cost vehicles is an amazing opportunity forseveral research areas UAVs have been used for wild-life monitoring [14] earthobservation [15] and even tornado watch missions [16] [17] used UAVs to reducerdquothe data gap between field scale and satellite scale in soil erosion monitoringrdquo

UAVs are often used for 3D mapping and reconstruction tasks for topographyarchitecture or other purposes [18] is a review of current mapping methods basedon UAVs [19] is an other survey which also compares the photogrammetry withother methods to define the optimal application fields

Both surveys mentioned above are rather focus on topographical larger scalemapping tasks Indoor flying and mapping is more complicated in a sense sincethe danger of collision is significantly higher Indoor positioning of UAVs havebeen addressed in several papers They can be grouped by the sensor used Notethat GPS signal is usually too weak to use inside thus other methods are needed

Some researches tried to localize the aerial platform based on a single camera

8

22 Object detection on conventional 2D images

using monocular SLAM (Simultaneous Localization and Mapping) See [20ndash22]for UAVs navigating based on one camera only

Often the platform is equipped with multiple cameras to achieve stereo visionand create a depth map For examples of such solutions see [23ndash25]

A popular approach for indoor mapping is using 3D laser distance sensors asthe input of the SLAM algorithm Examples for this are [26ndash28]

However cameras and laser range finders are often used together [29 30] forexample manage to integrate the two sensors together and carry out 3D mappingwhile controlling navigating the UAV itself This project will use a similar setupthat is a combination of one or more monocular cameras with a 2D laser scanner

22 Object detection on conventional 2D imagesObject class detection is one of the most researched area of computer visionAs the available resources (computational power resolution of cameras etc)improved and became cheaper and more and more complex tasks occurred thefield has witnessed several approaches and methods Thus the topic has extensiveliterature To give an overview general surveys can be a good starting point

There are several articles aimed to provide a summary of the existing methodstrying to categorize and group them (eg [31ndash33])

The terms and definitions of computer vision tasks are defined in several wayin the literature [33] and [34] groups them by

bull Detection is a particular item present in the stimulus

bull Localization detection plus accurate location of item

bull Recognition localization of all the items present in the stimulus

bull Understanding recognition plus role of stimulus in the context of the scene

while [35] sets up five classes defined by the following (quoted from [33])

bull Verification Is a particular item present in an image patch

bull Detection and Localization Given a complex image decide if a particularexemplar object is located somewhere in the image and provide accuratelocation information of the given object

bull Classification Given an image patch decide which of the multiple possiblecategories are present in it Note that no location is determined

9

22 Object detection on conventional 2D images

bull Naming Given a large complex image (instead of an image patch as inthe classification problem) determine the location and labels of the objectspresent in that image

bull Description Given a complex image name all the objects present in itand describe the actions and relationships of the various objects within thecontext of the image

[31] defines object (class or category) detection as the combination of twogoals object categorization and object localization The former decides if anyinstance of the categories of interest is present in the input image The lattertask determines the positions and dimensions (projected size) of the objects thatare found

As it can be seen from the categories defined above the boundaries of thesesubclasses are not clear and they often overlap The aims and objectives listedin section 13 classify this task as a detection or localization (or a combinationof the two) For the sake of simplicity the aim of this thesis and the providedalgorithm will be referred as detection and detector

Beside the grouping of the tasks of computer vision several attempts weremade to classify the approaches themselves The biggest disjunctive propertyof the methods might be the fact that whether or not does it involve machinelearning

221 Classical detection methodsResulting from the rapid development of hardware resources along with the ex-tensive research of artificial intelligence methods not using machine learning aregetting neglected Generally speaking the ones using it show better performanceand are simpler to adopt if the circumstances change (eg re-train them)

It has to be noted that although these classical approaches are not very pop-ular nowadays their simplicity often means an attractive factor to perform easydetection tasks or pre-filter the image

2211 Background subtraction

Background subtraction methods are often used to detect moving objects onstanding platform (static cameras) Such an algorithm is able to learn the rdquoback-ground modelrdquo of an image with differentiating a few frames Although severalmethods exist (See [36] or [37] for good conclusion) the basic idea is common sep-arate the foreground from the background by subtracting the background modelfrom the input image This subtraction should result in an output image where

10

22 Object detection on conventional 2D images

Figure 23 Example for background subtraction used for people detection Thered patches show the object in the segmented foreground Source [39]

only objects in the foreground are shown The background model is usually up-dated regularly which means the detection is adaptive

These solutions could give an excellent base for tasks like traffic flow monitor-ing [38] or human detection [39] as moving points are separated and connectedor close ones could be managed as an object Thus all possible objects of interestare very well separated The fixed position of the camera means fixed perspectiveswhich make size and distance estimations accurate With fixed lengths reliablevelocity measuring can be implemented These and further additional properties(colour texture etc of the moving object) can be the base of an object classifica-tion system This method is often combined with more sophisticated recognitionmethods serving as a region of interest detector After the selection of interestingareas those methods can operate on a smaller input resulting in a faster responsetime

It has to be mentioned that not every moving (changing) patch is an objectas for example in certain lighting circumstances a vehicle and its shadow movealmost identically and have similar size parameters Effective methods to removeshadows from input have been described (eg[40 41]) Certainly process time ofthese filters sum up with the detectionrsquos time

An other concern is the possibility of a sudden change in illumination (such asswitching the light on or off) which would result change across the whole imageIn this case the background model should be adapted automatically and quickly

Although background subtraction has many benefits it is clearly not suitablefor this task since the camera will be mounted on a moving platform

2212 Template matching algorithms

Object class detection can be described as a search for a specific pattern on theinput image This definition is related of the matched filters (see [42]) techniqueof digital signal processing which basic idea is to find a known wavelet in anunknown signal This is usually achieved by cross-correlating them looking forhigh correlation coefficients

Template matching algorithms rely on the same concept interpreting the

11

22 Object detection on conventional 2D images

Figure 24 Example of template matching Notice how the output image isthe darkest at the correct position of the template (darker means higher value)An other interesting point of the picture is the high return values around thecalendarrsquos (on the wall left) dark headlines which are indeed look similar to thehandle Source [43]

image as a 2D digital signal They determine the position of the given patternby comparing a template (that contains the pattern) with the image

To perform the comparison the template is shifted (u v) discrete steps inthe (x y) directions respectively The comparison itself is usually calculated bysome kind of correlation over the area of the template for each (u v) positionThe method assumes that the function (the comparison method) will return thehighest value at the correct position [43]

See Figure 24 for an example input template (part a) and the output of thealgorithm (part b) Notice how the output image is the darkest at the correctposition of the template

The biggest drawback of the method is its computational complexity sincethe cross correlation has to be calculated at every position Several efforts werepublished to reduce the processing time [43] uses normalized cross correlationfor faster comparison [44] applies integral images (based on the idea of [45]) forthe same purpose Note that simplifying the template can reduce the complexityas well since it is easier to compare the two images For example [46] proposedan algorithm which approximates the template image with polynomials

Other disadvantages of template matching are the poor performance in case ofchanges in scale rotation or perspective Resulting from these template matchingis not a suitable concept for this project

12

22 Object detection on conventional 2D images

222 Feature descriptors classifiers and learning methodsWhile computer vision and machine learning might seem two well-separated areasof computer engineering computer vision is one of the biggest application fieldof machine learning [47] puts it this way rdquoPattern recognition has its origins inengineering whereas machine learning grew out of computer science Howeverthese activities can be viewed as two facets of the same field and together theyhave undergone substantial development over the past ten yearsrdquo

The two most important parts of a trained object detection algorithm arefeature extraction and feature classification The extraction method determineswhat type of features will be the base of the classifier and how they are extractedThe classification part is responsible for the machine learning method In otherwords how to decide if the extracted feature array (extracted by the featureextraction method described above) represents an object of interest

Since the latter part (defining the learning and classifier system) is less relatedto the vision studies this paper will only discuss a few learning methods brieflyBefore that some of the most famous and widely used feature descriptors arepresented

Feature extractors or descriptors are often grouped as well [48] recognizesedge based and patch based features [31] has three groups according to thelocality of the features

bull pixel-level feature description these features can be calculated for eachpixel separately Examples are the pixelrsquos intensity (grayscale) or its colourvector

bull patch-level feature description Descriptors based on patch level featuresconcentrate only on point of interests (and their neighbourhood) instead ofdescribing the whole image The name rdquopatchrdquo refers to the area around thepoint which is also considered Since these regions are much smaller thanthe image itself patch-level feature descriptions are also called local featuredescriptors Some of the most famous ones are SIFT (sub-subsection 2221[49 50]) and SURF [51] In a sense Haar-like features (see sub-subsection2222 [45]) belong here as well

bull region-level feature description As patches are sometimes too small to de-scribe the object appropriately a larger scale is often needed The region isa general expression for a set of connected pixels The size and shape of thisarea is not defined Region-level features often constructed from lower levelfeatures presented above However they do not try to apply higher levelgeometric models or structures Resulting from that region-level featuresare also called mid-level representations The most well known descriptors

13

22 Object detection on conventional 2D images

of this group are the BOF (Bag of Features or Bag of Words [52]) and theHOG (Histogram of Oriented Gradients sub-subsection 2223 [53])

Due to the limited scope of this paper it is not possible to give a deeperoverview about the features used in computer vision Instead three of the meth-ods will be presented here These were selected based on fame performanceon general object recognition and the amount of relevant literature examplesand implementation All of them are well-known and both the articles and themethods are part of the computer vision history

2221 SIFT features

Scale-invariant feature transform (or SIFT) is an image processing algorithm usedto detect and extract local features Sift descriptors are the most well-knownpatch-level feature feature descriptors (see point 222) [31] They were proposedby David Lowe in 1999 [49] The same author gave a more in-depth discussionabout his earlier work presented improvements in stability and feature invariancein 2004 [50] His idea was to extract interesting points as features from the object(training image) and try to match these on the test images to locate the object

The main steps of the algorithm are the following [50]

1 Scale-space extrema detection The first stage searches over the image inmultiple scale using a difference-of-Gaussian function to identify potentialinterest points that are scale and orientation invariant

2 Key-point localization At each selected point the algorithm tries to fit amodel to determine location and scale Key-points are selected based ontheir stability points which are expected to be more stable based on themodel are kept

3 Orientation assignment rdquoOne or more orientations are assigned to eachkey-point based on local image gradient directions All future operationsare performed on image data that has been transformed relative to theassigned orientation scale and location for each feature thereby providinginvariance to these transformationsrdquo [50]

4 Key-point descriptor After selecting the interesting points the gradientsaround them are calculated in a selected scale

The biggest advantage of the descriptor is the scale and rotation invarianceNote that rotation here means rotating in the plane of the image not capturingthe object from a rotated viewpoint For example a rotation invariant (frontal)

14

22 Object detection on conventional 2D images

face detector should be able to detect faces upside down but not faces pho-tographed from a side-view SIFT is also partly invariant to illumination and3D camera viewpoint This is reached by the representation of the calculateddescriptors that allows for significant levels of local shape distortion and changein illumination The algorithm assumes that these pointsrsquo relative positions willnot change on the test images Thus SIFT descriptors tend to show worse perfor-mance on flexible objects or ones with moving parts Also the relative positionsof the key-points can be significantly different not only if the object is distorted(flexibility) but also if they are extracted from an other instance from the sameclass (two people for example) Resulting from this property SIFT is more oftenused for object tracking image stitching or to recognise a concrete object insteadof the general class However this is not a problem in the case of this projectsince only one ground-robot is part of it Thus no general class recognition isrequired recognizing the used ground robot would be enough

2222 Haar-like features

Haar-like features were introduced by Viola and Jones in 2001 [45] They proposedsimple features inspired by Haar wavelet basis which are also used in signalprocessing They can be seen as rdquotemplatesrdquo defining rectangular regions in thegrayscale image Viola and Jones described three types of the features

bull two-rectangle features rdquoThe value of a two-rectangle feature is the dif-ference between the sum of the pixels intensities within two rectangularregions The regions have the same size and shape and are horizontally orvertically adjacentrdquo [45]

bull three-rectangle features rdquoA three-rectangle feature computes the sum withintwo outside rectangles subtracted from the sum in a centre rectanglerdquo [45]

bull four-rectangle features rdquoA four-rectangle feature computes the differencebetween diagonal pairs of rectanglesrdquo [45]

See figure 25 for an example of two and three rectangle features The mainadvantage of this approach is the fact that these features can be computed veryrapidly using so-called integral images The integral image rdquocontains the sum ofpixel intensities above and to the left of itrdquo [45] at any given image location Thereferences to the integral values at each pixel location are kept in a structureFinally the Haar features can be computed by summing and subtracting theintegral values in the corners of the rectangles

[45] was motivated by the task of face detection Although it is more than14 years old it is still the base of many implemented face detection methods (in

15

22 Object detection on conventional 2D images

Figure 25 2 example Haar-like features that are proven to be effective separatorsTop row shows the features themselves Bottom row presents the features overlayedon a typical face The first feature corresponds with the observation that eyes areusually darker than the cheeks The second feature measures the intensity of areasaround the eyes compared to the nose Source [45]

consumer cameras for example) and inspired several researches [54] for exampleextended the Haar-like feature set with rotated rectangles by calculating theintegral images diagonally as well

Resulting from the nature of them Haar-like features are suitable to findpatterns of combined bright and dark patches (like faces on grayscale images forexample)

Despite of the original task Haar-like features are used widely in computervision for various tasks vehicle detection ([55]) hand gesture recognition ([56])and pedestrian detection ([57]) just to name a few

Haar-like features are very popular and have a lot of advantages Its maindrawback is the rotation-variance which in spite of several attempts (eg [54])is still not solved completely

2223 HOG features

The abbreviation Hog stands for Histogram of Oriented Gradients Unlike theHaar features (2222) Hog aims to gather information from the gradient image

The technique calculates the gradient of every pixel and produces a histogramof the different orientations (called bins) after quantization in small portions(called cells) on the image After a location based normalization of these cells(within blocks) the feature vector is the concatenation of these histograms Notethat while this is similar to Edge orientation histograms ([58]) it is different since

16

22 Object detection on conventional 2D images

Figure 26 Illustration of the discriminative and generative models The boxesmark the probabilities calculated and used by the models Note the arrows whichdisplay the information flow presenting the main difference in the two philosophiesSource [59]

not only sharp gradient changes (known as edges) are considered The number ofoptimal bins cells and blocks depend on the actual task and resolution and arewidely researched

HOG is essentially a dense version of the SIFT descriptors (see sub-subsection2221 and [31]) However it is not normalized with respect to the orientationThus HOG is not rotation invariant On the other hand it is normalized locallywith respect to the image contrast (over the blocks) which results in a superiorrobustness against illumination changes over SIFT The technique was first de-scribed in the well-known article [53] by Navneet Dalal and Bill Triggs in whichit was used for pedestrian detection While this application field is still the mostcommon usage of these feature descriptor it can be used for many kind of objectsas well as long as the class has a characteristic shape with significant edges

2224 Learning models in computer vision

Learning model approaches can be grouped as well [32 48] finds two mainphilosophies generative and discriminative models

Letrsquos denote the test data (images) with xi the corresponding labels by ciand the feature descriptors (extracted from the data xi) by θi where i = 1 to N where N is the number of inputs

Generative models look for a representation of the image (or any other inputdata) by approximate it keeping as much data as possible Afterwards given atest data they predict the probability of an x instance generated (with featuresθ) conditioned to class c [48] In other terms they aim is to produce a model inwhich the probability

P (x|(θ c))P (c)

to be ideally 1 if x contains an instance of class c and 0 if notThe most famous examples of generative models are Principal Component

Analysis (PCA [60]) and Independent Component Analysis (ICA [61])

17

22 Object detection on conventional 2D images

Discriminative models try to find the best decision boundaries for the giventraining data As the name suggests they try to discriminate the occurrenceof one class from everything else [48] They do this by defining the followingprobability

P (c|(θ x))

which is expected to be ideally 1 if x contains an instance of class c and 0 ifnot

Note that the biggest difference is the direction of the mapping and infor-mation flow discriminative models map images to class labels while generativemodels use a map from labels to the images (or rdquoobservablesrdquo [59]) See figure 26for a representation of the direction of the flow of information in both models

Resulting from approaches presented above (especially the probability mod-els) the discriminative methods are usually perform better for single class detec-tion while generative methods are more suitable for multi-class object detection(also called object recognition) [62] Since the task described in this paper re-quires one class to detect two of the most well-known discriminative methodswill be presented Adaptive boosting (AdaBoost [63] see sub-subsection 2225)and support vector machines (SVM [64] see sub-subsection 2226)

2225 AdaBoost

AdaBoost (short for Adaptive Boosting) is a discriminative machine learningmodel from the class of boosting algorithms It was first proposed by YoavFreund and Robert Schapire in 1997 [63] AdaBoostrsquos aim is to overcome theincreasing calculation expenses caused by the growing set of features (also calleddimensions) For example the number of possible Haar-like features (2222) isover 160000 in case of a 24times24 pixel window (see subsection 432) [65] AlthoughHaar-like features can be extracted efficiently using the integral-image computingthe full set would be very expensive

To reduce the dimensionality of the machine learning problem boosting algo-rithms try to combine several weaker classifiers to create a stronger Weak meansthe output of the classifiers only slightly correlates with the true labels of thedata Usually this is an iterative method

A single stage of AdaBoost can be described as follows [45] [65]

bull Initialize N middot T weights wti where N = number of training examples T =number of features in stage

bull For t = 1 N

1 Normalize the weights

18

22 Object detection on conventional 2D images

2 Select the best classifier using only a single feature by minimising thedetection error εt =

sumiwi|h(xi f p θ)minus yi| where h(xi) is a classifier

output and yi is a correct label (both with a range of 0 for negative 1for positive)

3 Define ht(x) = h(x ft pt θt) where ft pt and θt are minimizers for theerror above

4 Update the weights wt+1i = wti middot εt

1minusεt

1minusεi

bull The final classifier for the stage is based on the sum of weak classifiers

By repeating these steps AdaBoost selects those features (and only those)which improve the prediction Resulting from the way the weights are updatedsubsequent classifiers minimise the error mostly for the inputs which were misclas-sified by previous ones A famous application of AdaBoost is the Viola-Jones ob-ject detection framework presented in 2001 [45] (also see sub-subsection 2222)

2226 Support Vector Machine

Support vector machines (SVM also called support vector networks) is a machinelearning algorithm for two-group classification problems proposed by Vladimir NVapnik Bernhard E Boser and Isabelle M Guyon [66] Although the featuresof the method were known and used since the 1960s they were not put togetheruntil 1992

The idea of the support vector machine is to handle the set of features ex-tracted from the train and test data samples as vectors in the feature spaceThen they construct one ore more hyperplanes (a linear decision surface) in thisspace to separate the classes

The surface is constructed with two constraints First is to divide the spaceinto two parts in a way that no points from different classes remain in the same(in other words separate the two classes perfectly) The second constraint is tokeep the distance to the nearest training samplersquos vector from each class as largeas possible (in other words define the largest margin possible) Creating largermargin is proven to decrease the error of the classifier The closest vectors to thehyperplane (and due to that determining the width of the margin) are calledsupport vectors (hence the name of the method) See figure 27 for an exampleof a separable problem the chosen hyperplane and the margin

The algorithm was extended to not linearly separable tasks in 1995 by Vapnikand Corinna Cortes [64] To create a non-linear classifier the feature space ismapped to a much higher dimensional space where the classes become linearlyseparable and the method above can be used

19

22 Object detection on conventional 2D images

Figure 27 Example of a separable problem in 2D The support vectors aremarked with grey squares They define the margin of the largest separation betweenthe two classes Source [64]

Although support vector machines were designed for binary classification sev-eral methods were proposed to apply their outstanding performance on multi-classtasks The most often used approach to do this is to combine binary classifiers toobtain a multi-class one The binary classifiers can be rdquoone against allrdquo or rdquooneagainst onerdquo types The former means that each class is examined against all thepoints of the other classes The class whose classifier shows the greatest distancefrom the margin is chosen (in other terms the test samplersquos vector is the fartherfrom the other classesrsquo points than in any other cases) The latter is based on thecomparisons of all the possible pairings of the classifiers (each is compared witheach) The class rdquowinningrdquo the most comparisons is chosen as the final decisionof the classifier Good surveys on multi-class SVMs are [67] and [68]

A famous application of SVMs is the HOG descriptor based pedestrian detec-tion presented by Navneet Dalal and Bill Triggs in 2005 [53]

20

Chapter 3

Development

In this chapter the circumstances of the development will be presented whichinfluenced the research especially the objectives defined in Section 13 Firstthe available hardware resources (that is processing unit and sensors) will bediscussed along with their limitations and constraints respecting to the definedobjectives Finally the used software libraries will be presented why were theychosen and what are they used for in this project

31 Hardware resources

311 Nitrogen boardNitrogen board (officially called Nitrogen6x) is the chosen on-board processingunit of the UAV It is a highly integrated single board computer using an ARM-cortex A9 processor [69]

Nitrogen board will be responsible to manage the on-board sensors and collectthe recorded data After some preprocessing part of the data will be transmittedto the ground robot and to the ground station which communication is alsohandled by this unit Fortunately the Robotic Operating System (subsection322) is compatible with unit which will make this task easier

The detection method introduced in this thesis may be ported to this boardlater depending on the available processing units and the achieved communicationbandwidth between the vehicles and the ground station

312 SensorsThis subsection concludes the sensors integrated on-board the UAV including theflight controller the camera and the laser range finder

21

31 Hardware resources

Figure 31 image of Pixhawk flight controller Source [70]

3121 Pixhawk autopilot

Pixhawk is an open-source well supported auto-pilot system manufactured by3Drobotics company It is the chosen flight controller of the projectrsquos UAV Onfigure 31 the unit is shown with optional accessories

Also Pixhawk will be used as an Inertial Measurement Unit (IMU) for build-ing 3D maps This means less weight (since no additional sensor is needed) andbrings integrity both in hardware and software Although the Pixhawk is not adedicated IMU its built-in sensors are very accurate since they are used to sta-bilize multi-rotors during flight Therefore it was suitable to use it for mappingpurposes See 44 for details of its application

3122 Camera

Cameras are optical sensors capable of recording images They are the mostfrequently used sensors for object detection due to the fact that they are easy tointegrate and provide significant amount of data

Also another big advantage of these sensors is the fact that they are availablein several sizes with different resolutions and other features for a relatively cheapprice

Nowadays consumer UAVs are often equipped by some kind of action camerausually to provide an aerial video platform These sensors are suitable for theaim of the project as well resulting from their light weight and wide angle fieldof view

22

32 Chosen software

Figure 32 The chosen LIDAR sensor Hokuyo UTM-30LX It is a 2D rangescanner with a maximum range of 60 m and 3-5 cm accuracy Source [71]

3123 LiDar

The other most important sensor used in this project is the lidar These sensorsare designed for remote sensing and distance measurement The basic conceptof these sensors is to do this by illuminating objects with a laser beam Thereflected light is analysed and the distances are determined

Several versions are available with different range accuracy size and otherfeatures The chosen scanner for this project is the Hokuyo UTM-30LX It isa 2D range scanner with a maximum range of 60 m and 3-5 cm accuracy [71]These features combined with its compact size (62mmtimes62mmtimes875mm210g [71])makes it an ideal scanner for autonomous vehicles for example UAVs

Generally speaking lidars are much more expensive sensors then camerasHowever the depth information provided by them is very valuable for indoorflights

32 Chosen software

321 MatlabMATLAB is a high-level interpreted programming language and environmentdesigned for numerical computation It is a very useful tool for engineers and

23

32 Chosen software

scientist alike as thousands of functions are implemented and included in thesoftware from different fields of science Also excellent visualization tools areready to use which are necessary to easily debug and develop image processingmethods or 3D map construction and visualization Resulting from the reasonsabove developing and testing 2D or 3D image processing algorithms in Matlab isextremely convenient On the other hand Matlab is Java based therefore mostnon-mathematical calculations are slightly slow compared to a C++ environment

322 Robotic Operating System (ROS)Robotic Operating System is an extensive open-source operating system andframework designed for robotic development It is equipped with several toolsaimed to help developing unmanned vehicles like simulation and visualizationsoftware or excellent general message passing This latter property is a suitablesolution to connect the UAV UGV and the ground control station of the project

Also ROS contains several packages developed to handle sensors For exam-ple the both the chosen lidar sensor (3123) and the flight controller (3121)have ready-to-use drivers available

323 OpenCVOpenCV is probably to most popular open-source image processing library con-taining a wide range of functions like basic image manipulations (eg load writeresize rotate image) image processing tools (eg different kind of edge detec-tions threshold methods) and advanced machine learning based solutions (egfeature extractors and training tools) [72] Due to its high popularity examplesand support are available

In this project it was used to handle inputs (either from a file or from a cameraattached to the laptop) since Dlib the main library chosen has no functions toread inputs like that

324 DlibDlib is a general purpose cross-platform C++ library It has an excellent machinelearning library with the aim to provide a machine learning software developmenttoolkit for the C++ language (see figure 33 for an architecture overview) Dlibis completely open-source and is intended to be useful for both scientists andengineers

Dlib also has many computer vision related parts including ready-to-use fea-ture extractors and training tools and object tracking solutions which make it

24

32 Chosen software

Figure 33 Elements of Dlibrsquos machine learning toolkit Source [73]

easy to develop and test custom trained object detectors rapidly Another re-cently added feature is an easy-to-use 3D point cloud visualization function Re-sulting from this Dlib was a reasonable choice to use for this project since itscollection of machine learning image processing and 3D point cloud handlingfunctions made the development significantly easier

It is worth mentioning that aside the features above Dlib also contains com-ponents to handle linear algebra threading network IO graph visualisation andmanaging and numerous other tasks which makes it one of the most versatileC++ libraries Though it is not as popular and well-known as OpenCV it isextremely well documented supported and updated regularly Resulting fromthese the popularity of it increases For more information see [74] and [73]

25

Chapter 4

Designing and implementing thealgorithm

In this chapter challenges related to the topic of this thesis are listed After theconsideration of these the architecture of the designed algorithm will be intro-duced with detailed explanations given to the important parts The connectionbetween the 2D and 3D image inputs are defined Afterwards the base of thedetection algorithm is discussed in details along with all the aiding systems andinputs Finally the 3D imaging experiments and set-ups are introduced withexample images and the mathematical calculations

41 Challenges in the taskThe biggest challenges of the task are listed below

1 Unique object Maybe the biggest challenge of this task is the fact that therobot itself is a completely unique ground vehicle designed as part of theproject Thus no ready-to-use solutions are available which could partly orcompletely solve the problem For frequently researched object recognitiontasks like face detection several open-source codes examples and articlesare available The closest topic for a ground robot detection is car recogni-tion but those vehicles are too different to use any ready detectors Thusthe solution of this task have to be completely self-made and new

2 Limited resources Since the objective of this thesis is a detection executedon a UAV the limitations of this platform have to be considered Thismeans that both the dimensions and the weight of the on-board sensorsand computers are limited significantly Fortunately the cameras nowa-days have very compact sizes with impressive resolutions thus even smaller

26

42 Architecture of the detection system

aerial vehicles can carry one and record high quality videos The used lidar(3123) is heavier and larger but still possible to carry The on-board com-puter had to be chosen similarly a small and light hardware was neededWhile the Nitrogen6x board (311) would fulfil these requirements its pro-cessing power is not comparable to a laptop or desktop PC which shouldbe considered during the designing and assigning of processing methods

3 Moving platform No matter what kind of sensors are used for an objectdetection task movement and vibration of the sensor is usually makes itmore challenging Unfortunately both are present on a UAV platform Thiscauses problems because it changes the perspective and the viewpoint ofthe object Also they add noise and blur to the recordings And third nobackground-foreground separation algorithms are available for the cameraSee subsection 2211 for details

4 Various viewpoint of object Usually in the field of computer vision thepoint of view of the required object is more or less defined For exampleface detectors are usually applied to images where people are facing thecamera However resulting from the UAV platform the sensors will gain orlose height rapidly causing drastic perspective changes Also there are noconstraints about the relative orientation of the two vehicles which meansthat the robot should be recognized from 360 and from above

5 Various size of object Similarly to the previous point 4 the movement ofthe camera will cause significant changes in the scale of the object on theinput image In other words the algorithm have to detect the ground robotfrom a great distance (when it occupies a small part of the input) and alsoduring landing from relatively close (when it seems huge and almost fillsthe entire image)

6 Speed requirements The algorithmrsquos purpose is to support the landingmaneuver which is crucial part of the mission The most critical part is thetouch-down itself This (contrary to the other parts of the flight) requiresreal-time and very accurate position feedback from the sensors and the al-gorithms This is provided by another algorithm specially designed for thetask Thus the software discussed here does not have real-time require-ments On the other hand the system still needs continuous reports on theestimated position of the robot with at least 4Hz This means that everychosen method should be execute fast enough to make this possible

27

42 Architecture of the detection system

Support Vector Machines

camera2D Lidar

Trainer

Detector agorithmfront SVM side SVM

Detections Vatic annotation server

Sensors

Tracking

Evaluation

Current frame

Video reader

Regions of Interest

3D map

Other preprosessing(edgecolour detection)

Figure 41 On this figure a diagram of the designed architecture is shown Theaim of it is the present the connections between the 2D and 3D sensors detectionalgorithms and other parts of the system Arrows represents the dependencies andthe direction of the information flow

28

42 Architecture of the detection system

42 Architecture of the detection systemIn the previous chapters the project (11) and the objectives of this thesis (13)were introduced The advantages and disadvantages of the different availablesensors were presented (312) Some of the most often used feature extractionand classification methods (222) were examined with respect to the suitabilityfor the project Finally in subsection 41 the challenges of the task were discussed

After considering the aspects above a modular complete architecture wasdesigned which is suitable for the objectives defined See figure 41 for an overviewof the design The main idea of the structure is to compensate the disadvantagesof every module with an other part connected to it In some cases this principlebrings redundancy to the system On the other hand it provides a more robustdetection system for the robot The next enumeration will list the main partsof the architecture every already implemented module will be discussed later inthis chapter in detail

1 Training algorithm As explained in subsection 222 object detectionmethods using machine learning are very popular and superior in efficiencyTo apply such an algorithm for a task first a learning (also called trainingor teaching) period is needed when the software extract features from thetraining images and train a classifier to separate the two or more classes(Note that this learning step can be skipped if the algorithm learns rdquoon theflyrdquo or pre-trained detectors are used Neither of these are the case in thisproject)As mentioned above training usually happens off-line before the detectionsTherefore it is not required to be included in the final release However itis an inevitable part of the system since it produces the classifiers for thedetector These productions have to be reproducible compatible with theother parts and effective To make the development more convenient thetraining software is completely separated from testing See subsection 431for details

2 Sensor inputs Sensors are crucial for every autonomous vehicle since theyprovide the essential information about the surrounding environments Inthis project a lidar and a camera where used as main sensors for the task(along with several others needed to stabilize the vehicle itself See subsec-tion 312) The designed system relies more on the camera for now sincethe 3D imaging is in its experimental state However both sensors arehandled in the architecture already with the help of the Robotic Operat-ing System (322) This unified interface will make it possible to integratefurther sensors (eg infra-red cameras ultrasound sensors etc) later withease

29

42 Architecture of the detection system

Resulting from the scope of the project the vehicle itself is not readythus live recordings and testing were not possible during the developmentTherefore the system is able to manage images both from a camera andvideo file The latter is considered as a sensor module as well simulatingreal input from the sensor (other parts of the system canrsquot tell that it is nota live video feed from a camera)

3 Region of interest estimators and tracking Recognized as two of the mostimportant challenges (see 41) speed and the limitations of the availableresources were a great concern Thus methods which could reduce theamount of areas to process and such increase the speed of the detectionwere requiredResulting from the complex structure of the project both can be donebased on either the camera the lidar or both The idea of the architectureis to give a common interface (creating a type of module) to these methodsmaking their integration easy The most general and sophisticated approachwill be based on the continuously built and updated 3D maps That is ifboth the aerial and the ground vehicle were located in this map accurateestimations can be made on the current position of the robot excluding (3D)areas where it is unnecessary to search for the unmanned ground vehicle(UGV) In other words once the real 3D position of the robot is known itsnext one can be estimated based on its maximal velocity and the elapsedtime Theoretically it is not possible for the robot to be outside of thiscircle thus other parts of the map do not need to be scanned and Asa less sophisticated but more rapid alternative to this the currently seen3D sub-map (the point-cloud in front of the sensor) can be interpreted asa pre-filter for the 2D image processing methods For example filteringfor objects with more or less the same size of the UGV could be a veryeffective region of interest detector To achieve this the field of view of thetwo sensors needs to be registered carefully so corresponding areas can befound Unfortunately the real time 3D map building is not yet implementedThus these methods has yet to be developed On the other hand severalapproaches to find regions of interest based on 2D images were implementedAll of them will be presented in section 43

4 Evaluation To make the development easier and the results more objectivean evaluation system was needed This helps to determinate whether a givenmodification of the algorithm did actually make it rdquobetterrdquo andor fasterthan other versions This part of the architecture is only connected to thewhole system loosely (similarly to training point 1) and not required tobe included in the final release since it was designed for development and

30

43 2D image processing methods

debugging purposes However it had to be developed parallel with everyother parts to make sure it is compatible and up to date Thus evaluationhad to be mentioned in this list For more details of the evaluation pleasesee 511

5 Detector The detector module is the core algorithm which connects all theother parts listed above Receiving the inputs from the camera sensor (point2) using the classifiers trained by the training algorithm (point 1) andconsidering the information provided by the ROI estimators (3) executesthe detection Afterwards the detector is able to pass on the detectedobjects coordinates so the evaluation module (point 4) can measure theefficiency

43 2D image processing methodsIn this section the 2D camera image processing concerns methods and the processof the development will be presented

431 Chosen methods and the training algorithmBefore the design and implementation of the algorithm started several objectdetection methods were reviewed in section 22 In section 32 the used imageprocessing libraries and tool-kits were presented

While these two topics seem unrelated they have to be considered togethersince a good software library with image handling or even machine learningalgorithms implemented can save a lot of time during the development

Considering the reviewed methods the Histogram of Oriented Gradients (HOGsee sub-subsection 2223) was chosen as the main feature descriptor The reasonsfor this choice are discussed here

Haar-like features are suitable to find patterns of combined bright and darkpatches While the ground robotrsquos main colours are black (wheels) and white(the main body of the vehicle) their proportion is not ideal since brighter partsare more significant Variance of the relative positions of these patches (egdifferent view angles) can confuse the detector as well Also using OpenCVrsquos(323) training algorithm the learning process was significantly slower than thefinal training software (presented here later) making it very hard to develop andtest

SIFT descriptors (see sub-subsection 2221) have the advantage of rotationinvariance which is very desirable Since they assume that relative positions ofthe key-points will not change on the inputs they are more suitable to recognise

31

43 2D image processing methods

a concrete object instead of the general class This is not an issue in this casesince only one ground robot is present with known properties (in other wordsno generalization is needed) However the relative positions of the key-pointscan change not only because inter-class differences (an other instance of the classrdquolooksrdquo different) but caused by moving parts like rotating and steering wheelsfor example Also SIFT tends to perform worse under changing illuminationconditions which is a drawback in the case of this project since both vehiclesare planned to transfer between premises without no a-priori information aboutthe lighting

Histogram of Oriented Gradients (HOG see sub-subsection 2223) seemed tobe an appropriate choice as it performs well on objects with strong characteristicedges Since the UGV has a cuboid shape body and relatively big wheels itfulfils this condition Also HOG is normalized locally with respect to the imagecontrast which results in a robustness against illumination changes Thus itis expected to perform better than the other methods in case of change in thelighting (like moving to an other room or even outside)

Certainly HOG has it disadvantages as well First it is computationally ex-pensive especially compared to the Haar-like features However with properimplementation (eg a good choice of library) and the use of ROI detectors (seePoint 3) this issue can be solved Secondly HOG in contrast to SIFT is not rota-tion invariant In practice on the other hand it can handle a significant rotationof the object Also in the final version of the project the orientation of the camerawill be stabilized either in hardware (with a gimbal holding the camera levelled)or software (rotate the image based on the orientation sensor) Therefore theinput image of the detector system expected to be levelled Unfortunately dueto the aerial vehicle used in this project different viewpoints are predicted (see4) This means that even if the camera is levelled the object itself could seemlike rotated caused by the perspective To overcome this issue solutions will bepresented in this subsection and in 434

Although the method of HOG feature extraction is well described and docu-mented in [53] and not exceedingly complicated the choice of optimal parametersand the optimization of the computational expenses are time-consuming and com-plex tasks To make the development faster an appropriate library was searchedfor the extraction of HOG features Finally Dlib library (324) was chosen sinceit includes an excellent HOG feature extractor with serious optimization alongwith training tools (discussed in details later) and tracking features (see subsec-tion 434) Also several examples and documentations are available for bothtraining and detection which simplified the implementation

As a classifier Support vector machines (SVM see sub-subsection 2226)were chosen This is because nowadays SVMs are widely used in computer visionwith great results (Dalal used it as well in [53]) and for two-class detection prob-

32

43 2D image processing methods

lems (like this) they outperform the other popular ones Also it is implementedin Dlib as a ready-to-use solution [75] SVMs (and almost everything in Dlib)can save and load by serializing and de-serializing them which made the trans-portation of classifiers relatively easy Dlibrsquos SVMs are compatible with the HOGfeatures extracted with the same library The combination of the two methodswithin Dlib resulted in some very impressive detectors for speed limit sign orfaces Furthermore the face detector mentioned outperformed the Haar-like fea-ture based face detector included in OpenCV [74] After studying these examplesthe combination of Dlibrsquos SVM and HOG extractor was chosen as the base of thedetection

The implemented training software itself was written in C++ It can be exe-cuted from the command line It needs an xml file as an input which includes alist of the training images with the annotation of the objects (coordinates in pix-els) After importing the positive samples the negative samples are rdquogeneratedrdquofrom the training samples as well cut from areas where no annotated object ispresent Note that this feature means the object has to be annotated carefullysince a positive sample might get in to the negative training images otherwise

After the classifier (also called detector) is produced the training softwaretests it on an image database defined by a similar xml file This is a very con-venient feature since major errors are discovered during the training processalready If the first tests do not show any unexpected behaviours the finishedSVM is saved to the disk with serialization

Due to the fact that the robot looks completely unlike from different view-points one classifier was not sufficient because it could only match one typicalrdquoshaperdquo of the object Similarly the face detector mentioned above does notwork for people photographed from side view Therefore multiple classifiers wereconsidered for the task

Two SVMs were trained to overcome of this issue one for the front view andone for the side view of the ground vehicle This approach has two significantadvantages

First the robot is symmetrical horizontally and vertically as well In otherterms the vehicle looks almost exactly the same from the left and right whilethe front and rear-views are also very similar This is especially true if the work-ing principle of the HOG (2223) is considered since the shape of the robot(which defines the edges and gradients detected by the HOG descriptor) is iden-tical viewed from left and right or from front and rear This property of theUGV means that the same classifier will recognise it from the opposite directiontoo Therefore two classifiers are enough for all four sides This is not a trivialsimplification since cars for example look completely different from the front orthe rear and while the side-views are similar they are mirrored

The second advantage of these classifiers is very important in the future of

33

43 2D image processing methods

the project Due to the way the SVMs are trained they not only can detect theposition of the robot but depending on which one of them detects it the systemgains information about the orientation of the vehicle This is crucial since it canhelp predict the next position of the ground robot (see 434) Even more impor-tantly it helps the UAV to land on the UGV with the correct orientation (facingthe right way) Therefore the charging outputs will be connected properly

An other important aspect of the training is how to choose the training imagesFor this pictures were selected where the robot is captured directly from the sideor from the front with the camera almost at the level of the ground Asidethe sidefront of the robot (the top for example) was hidden or cropped Thismethod eliminated all the distortions found on pictures taken from above whichwould decrease the efficiency Although camera attached to the UAV will seldomcruise at this altitude (otherwise it would touch the ground) the HOG descriptor isrobust enough to find the patterns even viewed from above or sideways Althoughteaching an SVMs for diagonal views as well seemed like a logical idea it did notimprove the detections The reason for this is the fact that while side and frontviews are well defined it is hard to frame and train on a diagonal view

On figure 42 a comparison of training images and the produced HOG detec-tors is shown 42(a) is a typical training image for the side-view Hog detectorNotice that only the side of the robot is visible since the camera was hold verylow 42(b) is the visualized final side-view detector Both the wheels and thebody are easy to recognize thanks to the strong edges of them On 42(c) a train-ing image is displayed from the front-view detector training image set 42(d)shows the visualized final front detector Notice tangential lines along the wheelsand the characteristic top edge of the body

Please note that if the current two classifiers seem to be inefficient trainingadditional ones is easy to do with the training software presented here Alsoincluding them to the detector is simple as well since it was designed to beadaptive Please see subsection 435 for more details

432 Sliding window methodA very important property of these trained methods (222) are that usually theirtraining datasets (the images of the robot or any other object) are cropped onlycontaining the object and some margin around it Resulting from this a detectortrained this way will only be able to work on input images which contain theobject in a similarly cropped way Certainly such an image is not the usual inputfor an application since if the input contains only the robot there is no need forlocalization algorithms In other words the input image is usually much largerthan the cropped training images and may contain other kind of objects as wellThus a method is required which provides correctly sized input images for the

34

43 2D image processing methods

(a) sideview training image example

(b) sideview HOG detector

(c) frontview trainging image example

(d) frontview HOG

Figure 42 (a) a typical training image for the side-view Hog detector (b) is thefinal side-view HOG descriptor visualized (c) a training image from the front-viewHOG descriptor training image set (d) the final front detector visualized Noticethe strong lines around the typical edges of the training images

35

43 2D image processing methods

Figure 43 Representation of the sliding window method

detector cropped and resized from the original large inputThis can be achieved in several ways although the most common is the so

called sliding window method This algorithm takes the large image and slidesa window across it with predefined step-size and scales This rdquowindowrdquo behaveslike a mask cropping out smaller parts of the image With appropriate choice ofthese parameters (enough scales and small step-sizes) eventually every robot (orother desired object) will be cropped and handed to the trained detector See 43for a representation

It worth mentioning that it is possible that multiple instances of the soughtobject are present on the image For face or pedestrian detection algorithms forexample this is a potential situation Since only one ground robot was builtyet this is not likely to happen in this project (although reflections of the robotmight occur which are not handled explicitly) However if the system will beexpanded in the future multiple UGVs on the same input image will not cause aproblem for the designed algorithm

433 Pre-filteringAs listed in section 41 two of the most important challenges are speed andthe limitations of the available resources In subsection 432 the sliding windowconcept was introduced which makes it possible to cover every potential locationof the sought object On the other hand this means a lot of positions which areabsolutely unnecessary to check with the HOG detector since it is very unlikelythat the robot is there (eg positions above the ground or homogeneous surfaceslike the wall or the clearly empty ground) These areas could be excluded fromthe search with cheaper algorithms

36

43 2D image processing methods

Thus methods which could reduce the amount of areas to process and suchincrease the speed of the detection were required In other words these algorithmsfinds region of interests (ROI see 3) in which the HOG (or other) computationalheavy detectors are executed in a shorter time than scanning the whole inputimage The algorithms have be implemented with the right interface to keep themodular structure of the architecture Also this practice make it possible toeither swap the currently used with an other or a newly developed ROI detectormodule for further development

Based on intuition a good feature separation is the colour Anything whichdoes not have the same colour as the ground vehicle should be ignored To dothis however is rather complicated since the observed rdquocolourrdquo of the object (andanything else in the scene) depends on the lighting the exposure setting and thedynamic range of the camera Thus it is very hard to define a colour (and aregion of colours around of it) which could be a good base of segmentation onevery input videos Also one of the test environments used (the laboratory) andmany other premises reviewed has bright floor which is surprisingly similar to thecolour of the robot This made the filtering virtually useless

An other good hint for interesting areas can be the edges In this projectthis seems like a logical approach since the chosen feature descriptor operates ongradient images which are strongly connected to the edge map On figure 44 theresult of an edge detection on a typical input image of the system is presentedNote how the ground robot has significantly more edges than its environmentespecially the ground Filtering based on edges often eliminates the majority ofthe ground On the other hand chairs and other objects have lot of edges as wellThus these areas are still scanned

Another idea is to filter the image by the detections themselves on the previousframes In other terms the position of the robot on the previous input image isan excellent estimation of the vehiclersquos position on the current one Theoreticallyit is enough to search for the robot around its last known position with a radiusof the maximum distance the object can move during the time between framesPlease note that this distance is in image coordinates thus the possible movementof the camera is involved as well In practice this distance was set as a parameterafter experiments See subsection 435 for details

434 TrackingIn computer vision tracking means the following of a (moving) object over timeusing the camera image input This is usually done by storing information (likeappearance previous location speed and trajectory) about the object and passit on between frames

In the previous subsection (433) the idea of filtering by prior detections was

37

43 2D image processing methods

Figure 44 On this figure the result of an edge detection on a typical input imageof the system is presented Please note how the ground robot has significantly moreedges than its environment On the other hand chairs and other objects have lotof edges as well

presented Tracking is the extension of this concept If the position of the robotis already known there is no need to find it again To achieve the objectives it isenough to follow the found object(s) This makes a huge difference in calculationcosts since the algorithm does not look for an object described by a generalmodel on the whole image but search for the rdquosame set of pixelsrdquo in a significantlyreduced region Certainly tracking itself is not enough to fulfil the requirements ofthe task hence some kind of combination of rdquotraditionalrdquo classifiers and trackingwas needed

Tracking algorithms are widely researched and have their own extensive litera-ture Considering the complexity of implementing an own 2D tracking algorithmit was decided to choose and include a ready-to-use solution Fortunately Dlib(324) have an excellent object tracker included as well The code is based onthe very recent research presented in [76] The method was evaluated extensivelyand it outperformed state-of-the-art methods while being computationally effi-cient Aside its outstanding performance it is easy to include in the code onceDlib is installed The initializing parameter of the tracker is a bounding box isaround the area needed to be followed Since the classifiers return with boundingboxes those can be used as an input for the tracking See subsection 435 foran overview of the implementation An interesting fact is that it also uses HOGdescriptors as a part of the algorithm which shows the versatility of HOG andconfirms its suitability for this task

The final system of this project will have the advantage of a 3D map of itssurroundings This will allow tracking of objects in 3D relative to the environ-ment The UAV will have an estimation of the UGV even if its out of the frame

38

43 2D image processing methods

since its position will be mapped to the 3D map as well Unfortunately the realtime mapping is not ready yet thus this feature has yet to be implemented

435 Implemented detectorIn subsection 431 the advantages and disadvantages of the chosen object recog-nition methods were considered and the implemented training algorithm wasintroduced Then different approaches to speed up the process were presentedin subsection 432 433 and 434

After considering the points above a detector algorithm was implementedwith two aims First to provide an easy-to-use development and testing toolwith which different methods and their combinations (eg using 2 SVMs withROI) can be tried without having to change and rebuild the code itself Secondafter the optimal combination was chosen this software will be the base of thefinal detector used in the project

The input of the software is a camera feed or video which is read frame byframe The output of the algorithm is one or more rectangles bounding the areasbelieved to contain the sought object These are often called detection boxeshits or simply detections

During the development four different approaches were implemented All ofthem are based on the previous ones (incremental improvements) but all broughta new idea to the detector which made it faster (reach higher frame-rates see512) more accurate (see 511) or both The methods are listed below

1 Mode 1 Sliding window with all the classifiers

2 Mode 2 Sliding window with intelligent choice of classifier

3 Mode 3 Intelligent choice of classifiers and ROIs

4 Mode 4 Tracking based approach

They will all be presented in the next four subsections Since they are rathermodifications of the same code than separate solutions they were not imple-mented as new software Instead all were included in the same code whichdecides which mode to execute in run-time based on a parameter

Aside of changing between the implemented modes the detector is able toload as many SVMs as needed Their usage is different in every mode and willbe presented later

The software can display andor save the video frames with all the detectionsmarked measure and export the processing time for every frame save the de-tections to data files for evaluation or even process the same video input severaltimes

39

43 2D image processing methods

Table 41 Table of the available parameters

Name Valid values Functioninput path to video video as input for detectionsvm path to SVMs These SVMs will be usedmode [1 2 3 4] Selects which mode is usedsaveFrames [01] Turns on video frame exportsaveDetections [01] Turns on detection boxe exportsaveFPS [01] Turns on frame-rate measurementdisplayVideo [01] Turns on video displayDetectionsFileName string Sets the filename for saved detectionsFramesFolderName string Set the foldername used for saving videonumberOfLoops integer(0lt) Sets how many times the video is looped

To make the development more convenient these features can be switchedon or off via parameters which are imported from an input file every time thedetector is started There is an option to change the name of the file where thedetections are exported or the name of the folder where the video frames aresaved

Table 41 concludes all the implemented parameters with the possible valuesand a short description

Note that all these parameters have default values thus it is possible but notcompulsory to include every one of them The text shown below is an exampleinput parameter file

Example parameter file of the detector algorithminput testVideo1avilist all the detectors you want to usesvm groundrobotfrontsvm groundrobotsidesvmsaveDetections 0saveFrames 0displayVideo 0numberOfLoops 100mode 2saveFPS 1

Given this input file the software would execute a detection on testVideo1avi(input) one hundred times (numberOfLoops) using the groundrobotfrontsvm groundrobot-sidesvm classifiers (svm) It would neither save the detections (saveDetections)

40

43 2D image processing methods

nor the video frames (saveFrames) The latter is not shown (displayVideo) How-ever the processing time is saved for every frame (saveFPS) Note that the pa-rameter numberOfLoops is really useful to simulate longer inputs See figure 45for a screenshot of the interface after a parameter file is loaded

Figure 45 Example of the detectorrsquos user interface after loading the parametersThe software can be started from the command line with a parameter file inputwhich defines the detecting mode and the purpose of the execution (producingvideo efficiency statistics or measuring processing frame-rate)

4351 Mode 1 Sliding window with all the classifiers

Mode 1 is the simplest approach of the four After loading the two classifierstrained by the training software (front-view and side-view see 434) both rdquoslidesrdquoacross the input image All of them return with a vector of rectangles where theyfound the sought object These are handled as detections (saved displayed etc)There are no filters implemented for maximum number of detections overlappingallowed position etc

Please note that although currently two SVMs are used this and every othermethod can handle fewer or more of them For example given three classifiersmode 1 loads and slides all of them

As it can be seen this mode 1 is rather simple It checks every possibleposition with all the classifiers on every frame in multiple scales This meansthat it will not miss the object (assuming that one of them would recognize theobject if a cropped image of it is given as an input) On the other hand thisextensive search is very computational heavy especially with two classifiers Thisresults in the lowest frame per second rate of all the methods

4352 Mode 2 Sliding window with intelligent choice of classifier

As mentioned in the previous subsection Mode 1 checks every possible positionwith all the classifiers on every frame This is not efficient because of the waythe two used classifier was trained On figure 42 it can be seen that one of them

41

43 2D image processing methods

represents the frontrear view while the other one was trained for the side-viewof the robot

The idea of mode 2 is based on the assumption that it is very rare and usuallyunnecessary that these two classifiers detect something at once since the robot iseither viewed from the front or from the side Certainly looking at it diagonallywill show both mentioned sides but resulting from the perspective distortionthese detectors will not recognize both sides (and that is not needed either sincethere is no reason to find it twice)

Therefore it seemed logical to implement a basic memory to the code whichstores which one of the detectors found the object last time After that only thatone is used for the next frames If it succeeds to find the object again it remainsthe chosen detector If not the algorithm starts counting the sequential frames onwhich was unable to find something If this exceeds a predefined tolerance limit(that is the number of frames tolerated without detection) all of the classifiersare used again until one of them find the object In every other aspect mode 2is very similar to mode 1 presented above

The limit was introduced because it is very unlikely to change viewpointsbetween two frames so drastically that the other detector would have a betterchance to detect A couple of frames without detection is more like the result ofsome kind of image artefact like motion blur change in exposure etc

This modification of the original code resulted an much faster processingspeed since in significant amount of the time only one classifier was used in-stead of two Fortunately after finding an optimal limit the efficiency of thecode remained the same

4353 Mode 3 Intelligent choice of classifiers and ROIs

While mode 2 brought a significant improvement to the detectorrsquos speed it didnot fulfil the requirements of the project

In subsection 433 the idea of reducing the image searched was introduced asa way to increase the detection speed The position of the robot on the previousinput image is an excellent estimation of the vehiclersquos position on the current oneTheoretically it is enough to search for the robot around its last known positionwith a radius of the maximum distance the object can move during the timebetween frames However this is not an obvious calculation since the distance isin image coordinates hence it is not trivial to use the actual speed of the robotAlso the possible movement of the camera is involved as well For example evenif the UGV holds its position the rotating camera of the UAV will capture itrdquomovingrdquo across the input image

Instead of estimating the distance mentioned above a simpler approach wasimplemented The memory introduced in mode 2 was extended to store the

42

43 2D image processing methods

Figure 46 An example output of mode 3 The green rectangle marks theregion of interest The blue rectangle is the detection returned by the side-viewdetectorMode 3 tries to estimate a ROI based on the previous detections andsearches for the robot only in that Note the ratio between the ROI and the fullsize of the image

position of the detection aside the detector which returned it A new rectanglename ROI (Region of interest) was included which determinates in which areashould the detector search

The rectangle is initialized with the sizes of the full image (in other termsthe whole image is included in the ROI) Then every time a detection occursthe ROI is set to that rectangle since that is the last known position of the robotHowever this rectangle have to be enlarged caused by the reasons mentionedabove (movement of the camera and the object) Therefore the ROI is grown bya percent of its original size (set by a variable 50 by default) and every detectorsearches in this area Note that the same intelligent selection of classifiers whichwas described in mode 2 is included here as well

In the unfortunate case that none of the detectors returned detections insidethe ROI the region of interest is enlarged again by a smaller percent of its originalsize (set by a variable 3 by default) Note that if in any case the ROI reachesthe size of the of the image it will not grow in that direction anymore Eventuallyafter a series of missing detections the ROI will reach the size of the input imageIn this case mode 3 works exactly like mode 2

See figure 46 for an example of the method described above The greenrectangle marks the region of interest (ROI) The classifiers will search for therobot in this area The blue rectangle marks that the side-view detector foundsomething there (in this case correctly recognized the robot) On the next framethe blue rectangle will be the base of the ROI as it is enlarged in the followingsteps to define the region of interest

43

43 2D image processing methods

Please note that while the ROI is updated of the previous detections it isvery easy to change it to another rdquopre-filterrdquo due to the modular architectureintroduced in 42

4354 Mode 4 Tracking based approach

As mentioned in 434 Dlib library has a built-in tracker algorithm based on thevery resent research [76]

Mode 4 is very similar to mode 3 but includes the tracking algorithm men-tioned above Using a tracker makes it unnecessary to scan every frame (or partof it as mode 3 does) with at least one detector

Instead once the robot was found it is followed across the scene Certainlyperiodical validations are still needed Tracking is significantly faster than thesliding window detectors and in appropriate conditions can perform above realtime speed

It also improved the efficiency of the detector since on many frames the robotwas tracked and its position was marked correctly while the detectors missed itfrom the same view-point

On the other hand tracking can be misleading as well In case of the trackedregion rdquoslidesrdquo off the robot incorrect detections will be provided Also thetracker often estimates the position of the object correctly while the scale of therectangle does not fit which can result much larger or smaller detection boxesthan the object itself See figure 47(b) for an example Yellow rectangle marksthe tracked region which clearly includes the robot but itrsquos size is unacceptable

To avoid these phenomena the bounding box returned by the tracker is reg-ularly checked by detectors This is done in the very same way as in mode 3The output of the tracker is the base of the ROI which is than enlarged by pa-rameters (similarly to mode 3) The algorithm checks this area with the selectedclassifier(s)

If any of them founds the object the tracker is reinitialized based on thedetection box The rectangle is enlarged and translated to include a bigger partof the robot in contrast to the detected sides (note that the detectors are trainedfor the sides only see subsection 431) This is done because the tracker tendsto follow bigger patches of the image better

If none of the detectors returned detections inside the ROI the algorithm willcontinue to track the object based on previous clues of its position However afterthe number of sequential failed validations exceeds a tolerance limit (a parameter)the object is labelled as lost Afterwards the algorithm works exactly as mode 3would do enlarge the ROI on every frame until a new detection appears Thenthe tracker is reinitialized

See figure 47(a) for a representation of the processing method of mode 4

44

43 2D image processing methods

(a) Example output frame of mode 4

(b) Typical error of the tracker

Figure 47 (a) A presentation of mode 4 Red rectangle marks the detection ofthe front-view classifier The yellow box is the currently tracked area The greenrectangle marks the estimated region of interest (b) A typical error of the trackerYellow rectangle marks the tracked region which clearly includes the robot but toobig to accept as a correct detection

45

44 3D image processing methods

44 3D image processing methods

441 3D recording methodThe first challenge was how to create a 3D image As explained in sub-subsection3123 the used Lidar sensor is a 2D type one which means it scans only inplane In other words the sensor returns the distance of the closest reflectionpoint rotating around one (vertical) axis The result of this scan looks like across-section of a 3D image To produce a real and complete 3D image severalof these cross-section like one-plane scans have to be combined This is a rathercomplicated method since to cover the whole room the lidar needs to be movedHowever this movement or displacement will have a significant effect on therecorded distances

To simplify the initial experiments the position of the laser scanner was fixedon a tripod and only the orientation of it was changed This way the trans-formation between the coordinate systems and validation of the recordings wereeasier

On figure 48 a schematic diagram is given to introduce the 3D recording set-up The picture is divided in to 3 parts Part a is a side view of the room beingrecorded including the scanner and its dependences Part b is a top view of thevery same room displaying the field of view of the scanner Part c is a closeroverview of the scanner device the inertial measurement unit (IMU) and themounting layout Main parts of the setup are listed below with brief explanation

1 Lidar This is the 2D range scanner which was used in all of the experi-ments It is mounted on a tripod to fix its position The sensor is displayedon every sub-figure (a b c) On part a it is shown in two different orien-tation first when the recording plane is completely flat secondly when itis tilted

2 Visualisation of the laser ray The scanner measures distances in a planewhile rotating around its vertical axis which is call capital Z axis on thefigure (Note The sensor uses class 1 laser which is invisible and harmlessto human eye Red colour was chosen only for visualization)

3 Pitch angle of first orientation While the scanner is tilted its orientationis recorded relative to a fixed coordinate system For this experiment thepitch angle is the most relevant since roll and yaw did not change This iscalculated as a rotation from the (lower-case) z axis

4 Pitch angle of second orientation This is exactly the same variable asdescribed before but recorded for the second set-up of the recording system

46

44 3D image processing methods

5 Yaw angle of the laser ray As mentioned before the sensor used operateswith one single light source rotated around its vertical axis (capital) ZTo know where the recorded point is located relative to the sensorrsquos owncoordinate system the actual rotation of the beam is needed to be savedas well This is calculated as an angle between the forward direction of thesensor (marked with blue) and the laser ray All the points are recordedin this 2D space defined with polar coordinates a distance from the origin(the lidar) and a yaw angleThe angle is increasing counter-clockwise Pleasenote that this angle rotates with the lidar and its field Part b represents astate when the lidar is horizontal

6 Field of view of the sensor Part b of the picture visualize the limited butstill very wide field of view of the sensor The view angle is marked withgreen coloured curve and the covered area is yellowish The scanner is ableto record 270 or in other terms [minus135 135]

7 Blind-spot of the sensor As discussed in the previous point the scannercan register distances in 270 The remainder 90 is a blind-spot and islocated exactly behind the sensor (135minus135) This area is marked withdark grey on the figure

8 Inertial measurement unit All the points are recorded in the 2D spaceof the lidar which is a tilted and translated plane To map every recordedpoint from this to the 3D maprsquos coordinate system the orientation andposition of the Lidar sensor is needed Assuming that the latter is fixed(the scanner is mounted on a tripod) only the rotation remains variableTo record this some kind of inertial measurement unit (IMU) is neededwhich is marked with a grey rectangle on the image

9 Axis of rotation To scan the room a continuous tilting of the scannerwas needed It was not possible to place the light source on the axis of therotation This means an offset between the sensor and the axis which was asource of error The black circle represents the axis the tilting was executedaround

10 Offset between the axis and the light source As described in the previ-ous point an offset was present between the range sensor and the axis ofrotation This distance was carefully measured and it is marked by blue onthe figure

11 Ground vehicle The main objective of this thesis is to localize the UGVThus several 3D map was made of this vehicle itself However the lidarmain task is not the robot detection but the localization of the UAV itself

47

44 3D image processing methods

On part a and b its schematic side and top view is shown to visualize itsshape relative size and location

442 Android based recording set-upTo test the sensorrsquos capabilities simplified experiment set-ups were introducedas described in subsection 441 The scanned distances were recorded in csvformat by the official recording tool provided by the lidarrsquos manufacturer [77]As described in subsection 444 and in point 8 the orientation and position ofthe Lidar sensor is needed to determinate the recorded point coordinates relativeto the coordinate system fixed to the ground

Figure 410 Screenshot ofthe Android application for Lidarrecordings It is able to detect dis-play and log the rotation of the de-vice in a log file with user definedfilename Source own applicationand image

Assuming that the position is fixed (thescanner is mounted on a tripod) only therotation can change To record this somekind of inertial measurement unit (IMU) wasneeded

Since at the time these experimentsstarted no such device was available the au-thor of this thesis created an Android appli-cation The aim of this program was to usethe phonersquos sensors (accelerometer compassand gyroscope) as an IMU The software isable to detect display and log the rotationof the device along the three axis

The log file is structured in a way thatMatlab (see subsection 321) or other de-velopment tools can import it without anymodification needed The file-name is userdefined

Since both the Lidar and the applica-tion operates at a different and not suf-ficiently consistent frequency synchroniza-tion was necessary Since the two data wasrecorded completely separately this has to bedone off-line To achieve this a time-stamphas to be recorded for both type of mea-surements The lidar recorder tool does thisautomatically This function was added tothe Android application too thus the time-stamps of the recorded orientations are saved

48

44 3D image processing methods

Figure 48 Schematic figure to represent the 3D recording set-up Part a is a sideview of the room being recorded including the scanner and its dependences Partb is a top view of the very same room displaying the field of view of the scannerPart c is a closer overview of the scanner device the inertial measurement unit(IMU) and the mounting layout

49

44 3D image processing methods

Figure 49 Example representation of the output of the Lidar Sensor As it canbe seen the data returned by the scanner is basically a scan in one plane and canbe interpreted as a cross-section of a 3D image The sensor is positioned in themiddle of the cross Green areas were reached by the laser Source screenshot ofown recording by the official recording tool [77]

50

44 3D image processing methods

to the log file as well However since the two systems were not connected thetimestamps had to be synchronized manually

On Figure 410 the user interface of the application can be seen Note thatboth the displayed and the saved angles are in radian Figure 411(a) presentshow the phone was mounted next to the lidar and also shows the test environmentwhich was recorded The phone was attached to a common metal base with thelidar Thus the rotation of the phone is expected to be very close to the rotationof the lidar We used a Motorola Moto X (2013) for these experiments

The coordinate system used is fixed to the phone X axis is parallel with theshorter side of the screen pointing from the left to the right Y axis is parallel tolonger side of the screen pointing from to bottom to the top Z axis is pointingup and is perpendicular to the ground See figure 412 for a presentation of theaxes Angles were calculated in the following way

bull pitch is the rotation around the x axis It is 0 if the phone is lying on ahorizontal surface

bull roll is rotation around the y axis

bull yaw is the rotation around the z axis

Although the sensors in a phone are not designed for such precise tasks theset-up presented in this subsection worked sufficiently and produced spectacular3D images On figure 411(b) and 411(c) an example is shown along with therecorded screen captured by a camera (411(a))

443 Final set-up with Pixhawk flight controllerAlthough the Android application (see subsection 442) gave satisfactory resultsseveral reasons occurred for replacing it First the synchronization was compli-cated manual and off-line which is not acceptable in a real time UAV environ-ment Second neither the phonersquos sensors nor its operating system was designedto be mounted on an unmanned aircraft It is heavier less precise and moreexpensive than a dedicated IMU Thus the Android phone has to be replaced

As a replacement the Pixhawk (3121) flight controller was chosen because itis open-source well supported and most importantly it is the chosen controller ofthe projectrsquos UAV This meant less weight and brought integrity both in hardwareand software Although the Pixhawk is not a dedicated IMU its built-in sensorsare very accurate since they are used to stabilize multi-rotors during flight Fellowstudent Domonkos Huszar managed to connect both the Pixhawk and the lidar(3123) to the Robotic operating system (322) This way the two type ofsensors were recorded with the same software

51

44 3D image processing methods

(a) Laboratory and recorded scene

(b) Elevation of the produced 3D map

(c) The produced 3D map

Figure 411 (a) picture about the laboratory and the mounting of the phone (b)and (c) 3D map of the scan from different viewpoints Some object were markedon all images for easier understanding 1 2 3 are boxes 4 is a chair 5 is a studentat his desk 6 7 are windows 8 is the entrance 9 is a UAV wing on a cupboard

52

44 3D image processing methods

Thus the change from the Android software to the Pixhawk solved the problemof synchronization and accuracy All the other parts of the experiment (themounting method the coordinate systems etc) remained unchanged

444 3D reconstructionIn subsection 441 the recording method was described Subsection 442 and443 discussed the inertial measurement units and software used Processing andvisualizing the recorded data as a 3D map was the next step

All the pointsrsquo coordinates obtained were in the lidar sensorrsquos own coordinatesystem As shown on figure 48 (part b) this is a 2D space represented by polarcoordinates a distance from the origin (the scanner) and an angle between theline pointing to the point and the zero vector This space was tilted with thescanner during the records To calculate the positions recorded by the Lidar inthe ground fixed coordinate system this 2D space had to be transformed Trans-formation here means rigid motions no shape or size changes were desired Rigidmotions are reflections rotations translations and glide reflections Howeverin this experiment only translations and rotations were used

The 3D coordinate system was defined as the following

bull x axis is the vector product of y and z x = ytimes z pointing to the right

bull y axis is the facing direction of the scanner parallel to the ground pointing

Figure 412 Presentation of the axes used in the android application

53

44 3D image processing methods

forward

bull z axis is pointing up and is perpendicular to the ground

Note that the origin ([0 0 0]) is at the lidarrsquos location (at the top of the tripodin the height of the rotation axis) This means that points below the scanner havea negative altitude

The lidar and its own 2D space have 6 degrees of freedom (DOF) state in theground fixed coordinate system Three represents its positions along the x y zaxes The other three describes its orientation yaw pitch and roll

First the rotation of the scanner will be considered (the latter three DOF) Asit can bee seen on figure 48 the roll and yaw angles are not able to change Thusonly pitch has to be considered as a variable in this experiment (see point 3) Asa result the transformed coordinates of the recording point are the following

x = distance middot sin(minusyaw) (41)

y = distance middot cos(yaw) middot sin(pitch) (42)

z = distance middot cos(yaw) middot cos(pitch) (43)

Where distance and yaw are the two coordinates of the point in the plane ofthe lidar (figure 48 point 5) and pitch is the angle between the (lower-case) zaxis and the plane of the lidar (48 point 3 and 4)

The position of the lidar in the ground coordinate system did not change intheory only the orientation of it However as it can be seen on figure 48 part cthe rotation axis (marked by 9 see point 9) is not at the same level as the rangescanner of the lidar Since it was not possible to mount the device other way asignificant offset (marked and explained in point 10) occurred between the lightsource and the axis itself This resulted in an undesired movement of the sensoralong the z and y axes (note that x axis was not involved since the roll angle didnot change and only that would move the scanner along the x axis) While theresulted errors were no greater than the offset itself they are significant Thusalong the rotation a translation is needed as well The vector is calculated asthe sum of dy and dz

dy = offset middot sin(pitch) (44)

dz = offset middot cos(pitch) (45)

54

44 3D image processing methods

Where dy is the translation required along the y and dy is along the z axisOffset is the distance between the light source and the axis of rotation presentedon figure 48 part C marked with 10

Combining the 5 equations (41 42 43 44 45) we get xyz

= distance middot

sin(minusyaw)cos(yaw) middot sin(pitch)cos(yaw) middot cos(pitch)

+ offset middot

0sin(pitch)cos(pitch)

(46)

Using the transformation defined in equation 46 spectacular and more im-portantly precise 3D maps were created These were suitable for further worksuch us SLAM (Simultaneous localization and mapping) or the ground robotdetection

55

Chapter 5

Results

51 2D image detection results

511 EvaluationTo make the development easier and the results more objective an evaluationsystem was needed This helps determine whether a given modification of thealgorithm did actually make it better andor faster than other versions Thisinformation is extremely important in every vision based detection system to trackthe improvements of the code and check if it meets the predefined requirements

To understand what better means in case of a system like this some defi-nitions need to be clarified Generally a detection system can either identify aninput or reject it Here the input means a cropped image of the original pictureIn other words it is a position (coordinates) and dimensions (height and width)Please refer to subsection 432 for a more detailed description An identifiedinput is referred as positive Similarly a rejected one is called negative

5111 Definition of True positive and negative

A detection is true positive if the system correctly identifies an input as a positivesample In this case it means that the detector marks the ground robotrsquos positionas a detection Note that it is also often called a hit Similarly a true negativemeans that an input which does not contain the object is correctly rejected

5112 Definition of False positive and negative

In the case of most decision systems errors can be organised into two main groupsA mistake is a false positive error (also called type I error) if the system classifiesan input as a positive sample despite it is not In other words the system believes

56

51 2D image detection results

the object is present at that location although that is not the case In this projectthat typically means that something that is not the ground robot (a chair a boxor even a shadow) is still recognized as the UGV A false negative error (alsocalled type II error) occurs when an input is not recognized (rejected) although itshould be In this current task a false negative error is when the ground robotis not detected

5113 Reducing number of errors

Unfortunately it is really complicated to decrease both kind of errors at thesame time If more strict rules and thresholds are introduced in the decisionmaking the rate of false positives will decrease but simultaneously chances offalse negatives will increase On the other hand if conditions are loosened in thedetection method false negative errors may be reduced However caused by theexact same thing (less strict conditions) more false positives could appear

The actual task determines which kind of error should be minimised Incertain projects it could be more important to find every possible desired objectthan the fact that some mistakes will occur Such a project can be a manuallysupervised classification where appearing false positives can be eliminated by theoperator (for example illegal usage of bus lanes) Also a system with an accidentavoidance purpose should be over-secure and rather recognise non dangeroussituations as a hazard than to miss a real one

In this project the false positive rate was decided to be more important Thereason for this is that the position of the UGV is not crucial during the wholeflight If an input frame is processed incorrectly and a false negative error occurs(the robot is on the picture but the algorithm misses it) the tracking system(see subsection 434) would still be able to estimate its position After a fewframes the detector would be able to detect the robot again with great possibilityHowever a false negative detection at an unexpected position may confuse thesystem since it would have to choose from multiple possible charging platformsThis may lead to a landing attempt on a dangerous surface and to the possibilityof losing the aircraft

5114 Annotation and database building

To test the efficiency of the 2D detection algorithm its detections have to betested on an image dataset with its samples already labelled (either fully manuallyor using a semi-supervised classifier) or on a video set where the object (or objects)position is known on every frame This way the true positivenegative and falsepositivenegative values can be determined

57

51 2D image detection results

Figure 51 Figure of the Vatic user interface

Testing on image datasets is suitable for methods which do not use any in-formation transmitted between frames such as previous positions or ROI Thesealgorithms are usually sliding window based (see subsection 432) Their croppedinputs can be simulated with image datasets without complications

For this project the video based test is more appropriate since as described insubsection 434 and 433 the detector uses tracking and pre-filtering Testing thetracking feature without consecutive frames is pointless Also the pre-filteringand ROI generation is only applicable on frames containing more than just theobject itself

Testing on videos requires annotation which means that every object has tobe labelled on every frame This is a tiring and time-consuming task but withsuitable software it can be simplified To do this the Vatic environment waschosen

The abbreviation Vatic stands for Video Annotation Tool from Irvine Cal-ifornia It is a free to use on-line video annotation tool designed specially forcomputer vision research [78]

After the successful installation and connection to the server a clear interfaceis visible See an example screenshot on figure 51 The interface has a built invideo player which makes it possible to review the video with the annotation atnormal speed or even frame-by-frame The system was designed in a way thatmultiple objects can be annotated even from different classes (like pedestrianscars trucks for example) Since only one ground robot was built yet and no otherkind of objects were requested only one item was annotated on the videos

58

51 2D image detection results

The annotatorrsquos task is to mark the position and the size of the objects (thatis drawing a bounding box around it) on every frame of the videos (there areoptions to label an object Occluded obstructed or outside of frame) Althoughthis seems like a very time-consuming task Vatic uses sophisticated algorithmsto interpolate both the positions and the sizes of the objects on frames thatare not annotated yet This is a significant help for the annotator since aftermarking the object on some key-frames the only task remained is to correctthe interpolations between them if necessary

Note that it is crucial that the bounding boxes are very tight around the ob-ject This will guarantee that during the evaluation of the detectors the statisticsare based on true and objective annotations This topic will be explained in moredetails in section 53

Vatic is also able to crowdsource the annotation tasks using Amazon Mechan-ical Turk [79] which is Amazonrsquos solution for renting human resources This waylarge datasets with several videos are easy to build for relatively low cost How-ever this feature of the tool was not used in this project since the number of testvideos and their complexity did not make it necessary All the test videos wereannotated by the author

After successfully annotating a video Vatic is able to export the boundingboxes in several format including xml json etc The exported file contains theid of the frames and the coordinates of every bounding box along with the labelassigned for the objectrsquos type

Using these files it is possible to compare the detections of an algorithm (whichare also bounding boxes) with the annotations and thus evaluate the efficiencyof the code

512 Frame-rate measurement and analysisIn section 41 speed was recognised as one of the most important challenges

To address this issue the detector was prepared to measure and export thetime elapsed by processing each frames

The measurements were carried out on a laptop PC (Lenovo z580 i5-3210m8GB RAM) simulating a setup where the UAV sends video to a ground stationfor further processing During the tests every other unnecessary process wereterminated so their additional influences are minimized

The exact same software was used for every test only the parameter file waschanged (switching between the modes) Every method was executed on the samevideo one hundred times which means more than 100000 frames This amountensured that temporary resource changes during the test due to the operatingsystem did not influence the results significantly

59

52 3D image detection results

To analyse the exported processing times a Matlab script has been imple-mented This tool can load the log files from the test and calculate statistics ofthe data Measurements like the shortest longest and average processing timeper frame are displayed after execution along with the average frame-rate and itsvariance Latter is very important as mode 2 3 4 processing methods changeduring the execution by definition which results in a varying processing timeVariance measures these changes An example output of the analysis software ispresented below

Example output of Frame-rate analysis softwareLoaded 100 filesNumber of frames in video 1080Total number of frames processed 108000Longest processing time 0813Shortest processing time 0007Average ellapsed seconds 0032446variance of ellapsed seconds per frame

between video loops 80297e-07across the video 00021144

Average Frame per second rate of the processing 308204 FPS

The changes of average frame-rates between video loops were also monitoredIf this value was too big it probably indicated some unexpected load of thecomputer Thus the tests had to be restarted

It should be mentioned that the frame-rate of the software is strongly corre-lated with the resolution of the input images since the smaller the image it takesshorter time to rdquosliderdquo through the detector on it This is especially true for thefirst two methods (4351 4352) The test videosrsquo resolution are 960times 540

52 3D image detection resultsThis thesis focused on the 2D object detection methods since they are easierto implement and have been widely researched for decades Traditional camerasensors are significantly cheaper too and their output is rdquoready-to-userdquo On theother hand data from lidar sensors have to be processed to gather the imagesince the scanner records points in a 2D plane as explained in section 44 Notethat lidar 3D scanners exists but they are still scanning in a plane and have thismentioned processing built-in Also they are even more expensive than the 2Dversions

60

53 Discussion of results

Figure 52 On this figure the recording setup is presented In the backgroundthe ground robot is also visible which was the subject of the measurements

That being said the 3D depth map provided by these sensors are not possibleto obtain with cameras and contains valuable information for a project like thisand makes indoor navigation easier Therefore experiments were carried onthe record 3D images by tilting a 2D laser scanner (3123) and recording itsorientation with a IMU (3121)

On figure 52 a photo of the experimental setup is shown In the backgroundthe ground robot is visible which was the subject of the recordings

Figure 53 presents an example 3D point-cloud recorded of the ground robotBoth 53(a) and 53(b) is an image of the same map viewed from different pointsAs it can be seen recordings from one point is not enough to create completemaps as rdquoshadowsrdquo and obscured parts still occurs Note the rdquoemptyrdquo area behindthe robot on 53(b)

The created point-clouds were found to be detailed enough to carry simpler(eg threshold based on height) or more complex object detection methods (egregistering 3D key-points) on them in the future

53 Discussion of resultsIn general it can be said that the detector modes worked as expected on therecorded test videos with suitable efficiency Example videos can be found athere1 for all modes However to judge the performance of the algorithm moreobjective measurements were needed

The 2D image processing results were evaluated by speed and detection effi-1usersitkppkehu~palan1CranfieldVideos

61

53 Discussion of results

(a) Example result of the 3D scan of the ground robot

(b) Example of the rsquoshadowrsquo of the ground robot

Figure 53 Example of the 3D images built Both image shows the same 3Dmap viewed from different angles Note that the carpet is visible underneath thevehicle

62

53 Discussion of results

ciency The latter can be done by several values Here recall and precision willbe used Recall is defined by

TP

TP + FN

Where TP is the number of true positive and FN is the number of false negativedetections In other words it is the ratio of the of the detected and the totalnumber of positive samples Thus it is also called sensitivity

Precision is defined byTP

TP + FP

Where TP is the number of true positive FP is the number of false positivedetections In other terms precision is a measurement of the correctness of thereturned detections

Both values have a different meaning for validation and none of them can usedalone This is because as mentioned in sub-subsection 5113 both false positiveand negative errors are undesirable but neither recall nor precision measuresboth For example it is possible to reach an outstanding recall rate with extremeamount of false positives Similarly the increase of false negative errors willimprove the precision of the algorithm

The detections (the coordinates of the rectangles) were exported by the de-tection software for further analysis The rectangles were compared to the onesannotated with the Vatic video annotation tool (5114) based on position andthe overlapping areas For the latter a minimum of 50 was defined (ratio of thearea of the intersection and the union of the two rectangles)

Note that this also means that even if a detection is at the correct locationit is possible that it will be logged as false positive error since the overlap of therectangles is not satisfying Unfortunately registering this box as a false positivewill also generate a false negative error since only one detection is returned bythe detector for a location(overlapping detections are merged) Therefore theannotated object will not be covered by any of the detections

As expected mode 1 and mode 2 produced very similar results a decent 62recall rate was achieved This is a result of the fact that they work exactlythe same way except the selection of the used classifiers The results show thatintroducing this idea did not decrease the performance

Mode 3 managed to improve the recall rate by 2 with the introduction ofregion of interests Neither of the first 3 modules had false positive detections asa result of the strict thresholds of the SVMs

Mode 4 had the outstanding recall rate of 95 This is a result of the trackeralgorithm which was able to follow the object even at view-points were the SVMdetectors did no longer detected the robot See table 51 for all the recall andprecision values

63

53 Discussion of results

Similarly the processing speed of the methods introduced in subsection 435were analysed as well with the tools described in subsection 512

Mode 1 was proven to be the slowest with a frame-rate of 42 FPS This isnot surprising as this version of the algorithm scans the whole image with all thedetectors

Mode 2 raised this speed to 65 FPS by introducing an intelligent way ofselecting the applied detector

Mode 3 brought another significant increase with implementing region of in-terest estimators which reduce the amount of the image to be scanned It canprocess around 12 frames in a second

Finally mode 4 was proven to be the fastest solution by far As a result ofits tracking algorithm their is no need to use the SVM classifiers as often asbefore This is a very important result since tracking is significantly faster thanscanning with the detectors Mode 4 reached a frame-rate of 30 FPS during thetests Thus mode 4 can be called a rdquoreal-timerdquo object detection algorithm

See table 51 for the frame-rates of all modules displayed along their recalland precision values Note that all frame-rates are average of 100 executions

Table 51 Table of results

mode Recall Prescision FPS Variancemode 1 0623 1 42599 000000123mode 2 0622 1 6509 00029557mode 3 0645 1 1206 00070877mode 4 0955 0898 3082 00021144

As a conclusion it can be seen that mode 4 outperformed all the others bothin detection rate (recall 95) and speed (average frame-rate 30FPS) using ROIestimators and a tracking algorithm Due to the nature of the tracker and theevaluation method the number of false positives increased As mentioned beforeeven a correctly positioned but incorrectly scaled detection decreased the precisionrate However this should be simple to improve by adding further validations ofthe output of the tracker (both for position and scale)

64

Chapter 6

Conclusion and recommendedfuture work

61 ConclusionIn this paper an unmanned aerial vehicle based airborne tracking system waspresented with the aim of detecting objects indoor (potentially extended to out-door operations as well) and localize a ground robot that will serve as a landingplatform for recharging the UAV

First the project and the aims and objectives of this thesis were introducedin chapter 1 This introduction was followed by an extensive literature review togive an overview of the existing solutions for similar problems (chapter 2) Un-manned Aerial Vehicles were presented and examples were shown for applicationfields Then the most important object recognition methods were introducedand reviewed with the aspect of suitability for the discussed project

Then the environment of the development was described including the avail-able software and hardware resources (chapter 3) Their advantages and disad-vantages were analysed and their applications in the project are explained

Afterwards the recognized challenges of the project were collected and dis-cussed with respect to the object detection task in chapter 4 Considering theobjectives resources and the challenges a modular architecture was designedand introduced (section 42) The type of modules were listed and discussed indetails including their purposes and some possible realization

The modular structure of the system makes it very flexible and easy to extendor replace modules The currently implemented and used ones were listed andexplained Also some unsuccessful experiments were mentioned

Subsection 434 concludes the chosen feature extraction (HOG) and classi-fier (SVM) methods and presented the implemented training software along the

65

61 Conclusion

produced detectorsAs one of the most important parts the currently used detector was intro-

duced in subsection 435 This module was developed with special attention paidto validation and possible future work therefore additional debugging featureslike exporting detections demonstration videos and frame rate measurementswere implemented To make the development simpler an easy-to-use interfacewas included which let the most important parameters to be set without modifi-cation pf the code Four related but different detection methods were developedMode 1 scans the whole image with both trained detectors to locate the robotMode 2 does the same but instead of all classifiers only one is used most of thetime which is chosen based on previous detections This resulted in an increasedframe-rate Mode 3 introduces the concept of Region Of Interests (ROI) and onlyscans this area The ROI is defined based on previous detections however it iseasy to replace it with other methods due to the modular structure The reducedsearch space made the processing significantly faster Finally mode 4 includes atracking algorithm which makes it unnecessary to scan every frame (or part ofit) with a detector Instead once the robot was found it is followed across thescene Certainly periodical validations are still needed Tracking is significantlyfaster than the sliding window detectors and in appropriate conditions can per-form above real time speed (average frame-rate was 308 FPS) It also improvedthe efficiency of the detector since on many frames the robot was tracked andits position was marked correctly while the detectors missed it from the sameview-point However tracking can be misleading as well if the tracked regiondrifts away from the robot (eg a chair next to the robot is followed) To avoidthis the bounding box returned by the tracker is regularly checked by detectors

Special attention was given to the evaluation of the system Two software weredeveloped as additional tools one for evaluating the efficiency of the detectionsand an other for analysing the processing time and frame rates

Although the whole project (especially the autonomous UAV) still needs a lotof improvement and not ready for testing yet the ground robot detecting algo-rithmrsquos first version is ready-to-use and have shown promising results in simulatedexperiments

The detector based on mode 4 had a recall rate of 95 with an average frame-rate of 308 FPS on the test videos Although this processing speed was achievedon a laptop (simulating a setup where the UAV sends video to a ground stationfor further processing) porting this code on an on-board computer with smallercomputational capacity should result in a slower but still acceptable frame-ratewhich satisfies the objectives (5Hz)

While this thesis focused on two dimensional image processing and object de-tection methods 3D image inputs were considered as well Section 44 concludesthe progress made related to 3D mapping along the applied mathematics An

66

62 Recommended future work

experimental setup was created with which it was possible to create spectacu-lar and more importantly precise 3D maps about the environments with a 2Dlaser scanner To test the idea of using the 3D image as an input for the groundrobot detection several recordings were made about the UGV The created point-clouds were found to be detailed enough to carry simpler (eg threshold basedon height) or more complex object detection methods (eg registering 3D key-points) on them in the future

As a conclusion it can be said that the objectives defined in 13

1 explore possible solutions in a literature review

2 design a system which is able to detect the ground robot based on theavailable sensors

3 implement and test the first version of the algorithms

were all addressed and completed

62 Recommended future workIn spite of the satisfying results of the current setup several modifications andimprovements can and should be implemented to increase the efficiency and speedof the algorithm

Focusing on the 2D processing methods first experiments with retrainedSVMs are recommended It is expected that training on the same two sides (sideand front) but including more images from slightly rotated view-points (bothrolling the camera and capture the object from not completely the front) willresult in better performance

Region of interest estimators have huge potential in the project since they cansignificantly speed up the process Aside basic algorithms based on simple fea-tures (eg similar colour patches edges etc) several more complex segmentationmethods were introduced in the literature

Tracking can be tuned more precisely as well Aside the rectangle box whichmarks the estimated position of the object the tracker returns a confidence valuewhich measures how confident the algorithm is that the object is inside the rect-angle This is a very valuable information to eliminate detections which rdquoslippedrdquooff the object Also the further processing of the return position is recommendedFor example growing or shrinking the rectangle based on simple features (egthresholded patches intensity edges etc) can be profitable since the trackertends to perform better when it tracks the whole object On the other hand toobig bounding boxes are problems as well thus shrinking of the rectangles have tobe considered as well

67

62 Recommended future work

Using The 3D image for the detection of the robot is another very interestingopportunity that should be examined in more details Certainly this is onlypossible if the 3D imaging will be carried out in near real time speed (currentlyall the maps were built offline) The currently seen 3D image (the point-cloud infront of the sensor) can be interpreted as a pre-filter for the 2D image processingmethods For example filtering for objects with more or less the same size of theUGV could be a very effective region of interest detector To achieve this thefield of view of the two sensors needs to be register carefully so correspondingareas can be found

Finally when the continuous 3D map building (note that this is more thanthe imaging since the 3D images have to be aligned and combined) is finishedseveral more improvements will be possible Assuming that both the aerial andthe ground vehicle were located in this map the real 3D position of the robotwill be known Its next position can be estimated based on its maximal velocityand the elapsed time Theoretically it is not possible for the robot to be outsideof this circle thus other parts of the map do not need to be scanned Since thelocation and orientation of the UAV will be known as well the field-of-view ofthe sensors can be estimated After that the UAV can either move and rotate ina way to cover the possible locations of the UGV with its sensors or just ignore(does not process) inputs from other areas

It is strongly recommended to give high priority to evaluation and testingduring the further experiments For 2D images Vatic annotation tool was intro-duced in this thesis as a very useful and convenient tool to evaluate detectorsFor 3D detections this is not suitable but a similar solution is needed to keepup the progress towards the ready to deploy 3D mapping system planned

68

Acknowledgements

First of all I would like to thank Paacutezmaacuteny Peacuteter Catholic University and my teachersthere who helped me to advance and grow through the years and later made itpossible to continue my studies in Cranfield University where I was welcomed withwarm hospitality and was supported throughout of the development of this thesis

I would like to thank Dr Al Savvaris for the professional help and encourage-ment he provided while we worked together

Special thanks goes to my friends Domonkos Huszaacuter and Loacuteraacutent Kovaacutecs who Ishared all the happiness sadness worries troubles and the weigh of this pastyear and who I could always turn to in need or joy

Thanks to my friend Roberto Opromolla for helping us to make the necessarymeasurements

My family never stopped supporting me loving me and ensuring the rightenvironment for improvement and growth

Many thanks Fanni Melles for the support and love you gave me and that youstood by me even at the hardest times

Thanks for my friends Maacuterton Hunyady and Doacutera Babicz for the funny andsometimes cruel comments of the revision of this thesis

And finally many thanks to Andras Horvath who agreed to be my rsquosecondrsquosupervisor and mentor at my home university

69

References

[1] ldquoOverview senseFlyrdquo [Online] Available httpswwwsenseflycomdronesoverviewhtml [Accessed at 2015-08-08]

[2] H Chao Y Cao and Y Chen ldquoAutopilots for small fixed-wing unmanned airvehicles A surveyrdquo in Proceedings of the 2007 IEEE International Conference onMechatronics and Automation ICMA 2007 2007 pp 3144ndash3149

[3] A Frank J McGrew M Valenti D Levine and J How ldquoHover Transitionand Level Flight Control Design for a Single-Propeller Indoor AirplanerdquoAIAA Guidance Navigation and Control Conference and Exhibit pp 1ndash43 2007[Online] Available httparcaiaaorgdoiabs10251462007-6318

[4] ldquoFixed Wing Versus Rotary Wing For UAV Mapping Ap-plicationsrdquo [Online] Available httpwwwquestuavcomnewsfixed-wing-versus-rotary-wing-for-uav-mapping-applications [Accessed at 2015-08-08]

[5] ldquo DJI Store Phantom 3 Standardrdquo [Online] Available httpstoredjicomproductphantom-3-standard [Accessed at 2015-08-08]

[6] ldquoWorld War II V-1 Flying Bomb - Military Historyrdquo [Online] Available httpmilitaryhistoryaboutcomodartillerysiegeweaponspv1htm [Accessed at 2015-08-08]

[7] L Lin and M A Goodrich ldquoUAV intelligent path planning for wilderness searchand rescuerdquo in 2009 IEEERSJ International Conference on Intelligent Robotsand Systems IROS 2009 2009 pp 709ndash714

[8] M A Goodrich B S Morse D Gerhardt J L Cooper M Quigley J A Adamsand C Humphrey ldquoSupporting wilderness search and rescue using a camera-equipped mini UAVrdquo Journal of Field Robotics vol 25 no 1-2 pp 89ndash1102008

[9] P Doherty and P Rudol ldquoA UAV Search and Rescue Scenario with HumanBody Detection and Geolocalizationrdquo in Lecture Notes in Computer Science

70

REFERENCES

2007 pp 1ndash13 [Online] Available httpwwwspringerlinkcomcontentt361252205328408fulltextpdf

[10] A Jaimes S Kota and J Gomez ldquoAn approach to surveillance an area usingswarm of fixed wing and quad-rotor unmanned aerial vehicles UAV(s)rdquo 2008 IEEEInternational Conference on System of Systems Engineering SoSE 2008 2008

[11] M Kontitsis K Valavanis and N Tsourveloudis ldquoA UAV vision system for air-borne surveillancerdquo IEEE International Conference on Robotics and Automation2004 Proceedings ICRA rsquo04 2004 vol 1 2004

[12] M Quigley M A Goodrich S Griffiths A Eldredge and R W Beard ldquoTargetacquisition localization and surveillance using a fixed-wing mini-UAV and gim-baled camerardquo in Proceedings - IEEE International Conference on Robotics andAutomation vol 2005 2005 pp 2600ndash2606

[13] E Semsch M Jakob D Pavliacuteček and M Pěchouček ldquoAutonomous UAV surveil-lance in complex urban environmentsrdquo in Proceedings - 2009 IEEEWICACMInternational Conference on Intelligent Agent Technology IAT 2009 vol 2 2009pp 82ndash85

[14] M Israel ldquoA UAV-BASED ROE DEER FAWN DETECTION SYSTEMrdquo pp51ndash55 2012

[15] G Zhou and D Zang ldquoCivil UAV system for earth observationrdquo in InternationalGeoscience and Remote Sensing Symposium (IGARSS) 2007 pp 5319ndash5321

[16] W DeBusk ldquoUnmanned Aerial Vehicle Systems for Disaster Relief TornadoAlleyrdquo AIAA InfotechAerospace 2010 2010 [Online] Available httparcaiaaorgdoiabs10251462010-3506

[17] S DrsquoOleire-Oltmanns I Marzolff K Peter and J Ries ldquoUnmannedAerial Vehicle (UAV) for Monitoring Soil Erosion in Moroccordquo RemoteSensing vol 4 no 12 pp 3390ndash3416 2012 [Online] Available httpwwwmdpicom2072-42924113390

[18] F Nex and F Remondino ldquoUAV for 3D mapping applications a reviewrdquoApplied Geomatics vol 6 no 1 pp 1ndash15 2013 [Online] Availablehttplinkspringercom101007s12518-013-0120-x

[19] H Eisenbeiss ldquoThe Potential of Unmanned Aerial Vehicles for MappingrdquoPhotogrammetrische Woche Heidelberg pp 135ndash145 2011 [Online] Availablehttpwwwifpuni-stuttgartdepublicationsphowo11140Eisenbeisspdf

[20] C Bills J Chen and A Saxena ldquoAutonomous MAV flight in indoor environ-ments using single image perspective cuesrdquo in Proceedings - IEEE InternationalConference on Robotics and Automation 2011 pp 5776ndash5783

71

REFERENCES

[21] W Bath and J Paxman ldquoUAV localisation amp control through computer visionrdquo of the Australasian Conference on Robotics 2004 [Online] Availablehttpwwwcseunsweduau~acra2005proceedingspapersbathpdf

[22] K Ccedilelik S J Chung M Clausman and A K Somani ldquoMonocular vision SLAMfor indoor aerial vehiclesrdquo in 2009 IEEERSJ International Conference on Intel-ligent Robots and Systems IROS 2009 2009 pp 1566ndash1573

[23] H Oh D Y Won S S Huh D H Shim M J Tahk and A Tsourdos ldquoIndoorUAV control using multi-camera visual feedbackrdquo in Journal of Intelligent andRobotic Systems Theory and Applications vol 61 no 1-4 2011 pp 57ndash84

[24] Y M Mustafah A W Azman and F Akbar ldquoIndoor UAV Positioning UsingStereo Vision Sensorrdquo pp 575ndash579 2012

[25] P Jongho and K Youdan ldquoStereo vision based collision avoidance of quadrotorUAVrdquo in Control Automation and Systems (ICCAS) 2012 12th InternationalConference on 2012 pp 173ndash178

[26] J Weingarten and R Siegwart ldquoEKF-based 3D SLAM for structured environmentreconstructionrdquo in 2005 IEEERSJ International Conference on Intelligent Robotsand Systems IROS 2005 pp 2089ndash2094

[27] H Surmann A Nuumlchter and J Hertzberg ldquoAn autonomous mobile robot with a3D laser range finder for 3D exploration and digitalization of indoor environmentsrdquoRobotics and Autonomous Systems vol 45 no 3-4 pp 181ndash198 2003

[28] Y Lin J Hyyppauml and A Jaakkola ldquoMini-UAV-borne LIDAR for fine-scale map-pingrdquo IEEE Geoscience and Remote Sensing Letters vol 8 no 3 pp 426ndash4302011

[29] F Wang J Cui S K Phang B M Chen and T H Lee ldquoA mono-camera andscanning laser range finder based UAV indoor navigation systemrdquo in 2013 Inter-national Conference on Unmanned Aircraft Systems ICUAS 2013 - ConferenceProceedings 2013 pp 694ndash701

[30] M Nagai T Chen R Shibasaki H Kumagai and A Ahmed ldquoUAV-borne 3-Dmapping system by multisensor integrationrdquo IEEE Transactions on Geoscienceand Remote Sensing vol 47 no 3 pp 701ndash708 2009

[31] X Zhang Y-H Yang Z Han H Wang and C Gao ldquoObject class detectionrdquoACM Computing Surveys vol 46 no 1 pp 1ndash53 Oct 2013 [Online] Availablehttpdlacmorgcitationcfmid=25229682522978

[32] P M Roth and M Winter ldquoSurvey of Appearance-Based Methodsfor Object Recognitionrdquo Transform no ICG-TR-0108 2008 [Online]

72

REFERENCES

Available httpwwwicgtu-grazacatMemberspmrothpub_pmrothTR_ORat_downloadfile

[33] A Andreopoulos and J K Tsotsos ldquo50 Years of object recognition Directionsforwardrdquo Computer Vision and Image Understanding vol 117 no 8 pp 827ndash891 Aug 2013 [Online] Available httpwwwresearchgatenetpublication257484936_50_Years_of_object_recognition_Directions_forward

[34] J K Tsotsos Y Liu J C Martinez-Trujillo M Pomplun E Simine andK Zhou ldquoAttending to visual motionrdquo Computer Vision and Image Understand-ing vol 100 no 1-2 SPEC ISS pp 3ndash40 2005

[35] P Perona ldquoVisual Recognition Circa 2007rdquo pp 1ndash12 2007 [Online] Availablehttpswwwvisioncaltechedupublicationsperona-chapter-Dec07pdf

[36] M Piccardi ldquoBackground subtraction techniques a reviewrdquo 2004 IEEE Interna-tional Conference on Systems Man and Cybernetics (IEEE Cat No04CH37583)vol 4 pp 3099ndash3104 2004

[37] T Bouwmans ldquoRecent Advanced Statistical Background Modeling for ForegroundDetection - A Systematic Surveyrdquo Recent Patents on Computer Sciencee vol 4no 3 pp 147ndash176 2011

[38] L Xu and W Bu ldquoTraffic flow detection method based on fusion of frames dif-ferencing and background differencingrdquo in 2011 2nd International Conference onMechanic Automation and Control Engineering MACE 2011 - Proceedings 2011pp 1847ndash1850

[39] A-t Nghiem F Bremond I-s Antipolis and R Lucioles ldquoBackground subtrac-tion in people detection framework for RGB-D camerasrdquo 2004

[40] J C S Jacques C R Jung and S R Musse ldquoBackground subtraction andshadow detection in grayscale video sequencesrdquo in Brazilian Symposium of Com-puter Graphic and Image Processing vol 2005 2005 pp 189ndash196

[41] P Kaewtrakulpong and R Bowden ldquoAn Improved Adaptive Background MixtureModel for Real- time Tracking with Shadow Detectionrdquo Advanced Video BasedSurveillance Systems pp 1ndash5 2001

[42] G Turin ldquoAn introduction to matched filtersrdquo IRE Transactions on InformationTheory vol 6 no 3 1960

[43] K Briechle and U Hanebeck ldquoTemplate matching using fast normalized crosscorrelationrdquo Proceedings of SPIE vol 4387 pp 95ndash102 2001 [Online] AvailablehttplinkaiporglinkPSI4387951ampAgg=doi

73

REFERENCES

[44] H Schweitzer J W Bell and F Wu ldquoVery Fast Template Matchingrdquo Programno 009741 pp 358ndash372 2002 [Online] Available httpwwwspringerlinkcomindexH584WVN93312V4LTpdf

[45] P Viola and M Jones ldquoRobust real-time object detectionrdquo Interna-tional Journal of Computer Vision vol 57 pp 137ndash154 2001 [Online]Available httpscholargooglecomscholarhl=enampbtnG=Searchampq=intitleRobust+Real-time+Object+Detection0

[46] S Omachi and M Omachi ldquoFast template matching with polynomialsrdquo IEEETransactions on Image Processing vol 16 no 8 pp 2139ndash2149 2007

[47] M Jordan and J Kleinberg Bishop - Pattern Recognition and Machine Learning

[48] D Prasad ldquoSurvey of the problem of object detection in real imagesrdquoInternational Journal of Image Processing (IJIP) no 6 pp 441ndash466 2012[Online] Available httpwwwcscjournalsorgcscmanuscriptJournalsIJIPvolume6Issue6IJIP-702pdf

[49] D Lowe ldquoObject Recognition fromLocal Scale-Invariant Featuresrdquo IEEE Inter-national Conference on Computer Vision 1999

[50] D G Lowe ldquoDistinctive image features from scale-invariant keypointsrdquo Interna-tional Journal of Computer Vision vol 60 no 2 pp 91ndash110 2004

[51] H Bay T Tuytelaars and L Van Gool ldquoSURF Speeded up robust featuresrdquo inLecture Notes in Computer Science (including subseries Lecture Notes in ArtificialIntelligence and Lecture Notes in Bioinformatics) vol 3951 LNCS 2006 pp 404ndash417

[52] T Joachims ldquoA probabilistic analysis of the Rocchio algorithmwith TFIDF for text categorizationrdquo International Conference on Ma-chine Learning pp 143ndash151 1997 [Online] Available papers2publicationuuid9FC2122D-6D49-4DC5-AC03-E353D5B3D1D1

[53] N Dalal and B Triggs ldquoHistograms of Oriented Gradients for Human Detectionrdquoin CVPR rsquo05 Proceedings of the 2005 IEEE Computer Society Conference onComputer Vision and Pattern Recognition (CVPRrsquo05) - Volume 1 2005 pp 886ndash893 [Online] Available citeulike-article-id3047126$delimiter026E30F$nhttpdxdoiorg101109CVPR2005177

[54] J M Rainer Lienhart ldquoAn Extended Set of Haar-Like Features for Rapid ObjectDetectionrdquo [Online] Available httpciteseerxistpsueduviewdocsummarydoi=1011869433

74

REFERENCES

[55] S Han Y Han and H Hahn ldquoVehicle Detection Method using Haar-like Featureon Real Time Systemrdquo inWorld Academy of Science Engineering and Technology2009 pp 455ndash459

[56] Q C Q Chen N Georganas and E Petriu ldquoReal-time Vision-based Hand Ges-ture Recognition Using Haar-like Featuresrdquo 2007 IEEE Instrumentation amp Mea-surement Technology Conference IMTC 2007 2007

[57] G Monteiro P Peixoto and U Nunes ldquoVision-based pedestrian detection usingHaar-like featuresrdquo Robotica 2006 [Online] Available httpcyberc3sjtueducnCyberC3docpaperRobotica2006pdf

[58] J Martiacute J M Benediacute A M Mendonccedila and J Serrat Eds PatternRecognition and Image Analysis ser Lecture Notes in Computer Science BerlinHeidelberg Springer Berlin Heidelberg 2007 vol 4477 [Online] Availablehttpwwwspringerlinkcomindex101007978-3-540-72847-4

[59] A E C Pece ldquoOn the computational rationale for generative modelsrdquo ComputerVision and Image Understanding vol 106 no 1 pp 130ndash143 2007

[60] I T Joliffe Principle Component Analysis 2002 vol 2 [Online] Availablehttpwwwspringerlinkcomcontent978-0-387-95442-4

[61] A Hyvarinen J Karhunen and E Oja ldquoIndependent Component Analysisrdquovol 10 p 2002 2002

[62] K Mikolajczyk B Leibe and B Schiele ldquoMultiple Object Class Detection with aGenerative Modelrdquo IEEE Computer Society Conference on Computer Vision andPattern Recognition (CVPR) vol 1 pp 26ndash36 2006

[63] Y Freund and R E Schapire ldquoA Decision-theoretic Generalization of On-lineLearning and an Application to Boostingrdquo Journal of Computing Systems andScience vol 55 no 1 pp 119ndash139 1997

[64] C Cortes and V Vapnik ldquoSupport-vector networksrdquo Machine Learningvol 20 no 3 pp 273ndash297 Sep 1995 [Online] Available httplinkspringercom101007BF00994018

[65] P Viola and M Jones ldquoRobust real-time face detectionrdquo International Journalof Computer Vision vol 57 no 2 pp 137ndash154 2004

[66] B E Boser I M Guyon and V N Vapnik ldquoA Training Algorithmfor Optimal Margin Classifiersrdquo Proceedings of the 5th Annual ACM Workshopon Computational Learning Theory pp 144ndash152 1992 [Online] Availablehttpciteseerxistpsueduviewdocsummarydoi=1011213818

75

REFERENCES

[67] C-W Hsu and C-J Lin ldquoA comparison of methods for multiclass support vectormachinesrdquo IEEE Transactions on Neural Networks vol 13 no 2 pp 415ndash4252002

[68] A Mathur and G M Foody ldquoMulticlass and binary SVM classification Implica-tions for training and classification usersrdquo IEEE Geoscience and Remote SensingLetters vol 5 no 2 pp 241ndash245 2008

[69] ldquoNitrogen6X - iMX6 Single Board Computerrdquo [Online] Available httpboundarydevicescomproductnitrogen6x-board-imx6-arm-cortex-a9-sbc [Ac-cessed at 2015-08-10]

[70] ldquoPixhawk flight controllerrdquo [Online] Available httproverardupilotcomwikipixhawk-wiring-rover [Accessed at 2015-08-10]

[71] ldquoScanning range finder utm-30lx-ewrdquo [Online] Available httpswwwhokuyo-autjp02sensor07scannerutm_30lx_ewhtml [Accessed at 2015-07-22]

[72] ldquoopenCV manual Release 249rdquo [Online] Available httpdocsopencvorgopencv2refmanpdf [Accessed at 2015-07-20]

[73] D E King ldquoDlib-ml A Machine Learning Toolkitrdquo Journal of MachineLearning Research vol 10 pp 1755ndash1758 2009 [Online] Available httpjmlrcsailmitedupapersv10king09ahtml

[74] ldquodlib C++ Libraryrdquo [Online] Available httpdlibnet [Accessed at 2015-07-21]

[75] D E King ldquoMax-Margin Object Detectionrdquo Jan 2015 [Online] Availablehttparxivorgabs150200046

[76] M Danelljan G Haumlger and M Felsberg ldquoAccurate Scale Estimation for RobustVisual Trackingrdquo Proceedings of the British Machine Vision Conference BMVC2014

[77] ldquoUrgBenri Information Pagerdquo [Online] Available httpswwwhokuyo-autjp02sensor07scannerdownloaddataUrgBenrihtm [Accessed at 2015-07-20]

[78] ldquovatic - Video Annotation Tool - UC Irvinerdquo [Online] Available httpwebmiteduvondrickvatic [Accessed at 2015-07-24]

[79] ldquoAmazon Mechanical Turkrdquo [Online] Available httpswwwmturkcommturkwelcome [Accessed at 2015-07-26]

76

  • List of Figures
  • Absztrakt
  • Abstract
  • List of Abbreviations
  • 1 Introduction and project description
    • 11 Project description and requirements
    • 12 Type of vehicle
    • 13 Aims and objectives
      • 2 Literature Review
        • 21 UAVs and applications
          • 211 Fixed-wing UAVs
          • 212 Rotary-wing UAVs
          • 213 Applications
            • 22 Object detection on conventional 2D images
              • 221 Classical detection methods
                • 2211 Background subtraction
                • 2212 Template matching algorithms
                  • 222 Feature descriptors classifiers and learning methods
                    • 2221 SIFT features
                    • 2222 Haar-like features
                    • 2223 HOG features
                    • 2224 Learning models in computer vision
                    • 2225 AdaBoost
                    • 2226 Support Vector Machine
                      • 3 Development
                        • 31 Hardware resources
                          • 311 Nitrogen board
                          • 312 Sensors
                            • 3121 Pixhawk autopilot
                            • 3122 Camera
                            • 3123 LiDar
                                • 32 Chosen software
                                  • 321 Matlab
                                  • 322 Robotic Operating System (ROS)
                                  • 323 OpenCV
                                  • 324 Dlib
                                      • 4 Designing and implementing the algorithm
                                        • 41 Challenges in the task
                                        • 42 Architecture of the detection system
                                        • 43 2D image processing methods
                                          • 431 Chosen methods and the training algorithm
                                          • 432 Sliding window method
                                          • 433 Pre-filtering
                                          • 434 Tracking
                                          • 435 Implemented detector
                                            • 4351 Mode 1 Sliding window with all the classifiers
                                            • 4352 Mode 2 Sliding window with intelligent choice of classifier
                                            • 4353 Mode 3 Intelligent choice of classifiers and ROIs
                                            • 4354 Mode 4 Tracking based approach
                                                • 44 3D image processing methods
                                                  • 441 3D recording method
                                                  • 442 Android based recording set-up
                                                  • 443 Final set-up with Pixhawk flight controller
                                                  • 444 3D reconstruction
                                                      • 5 Results
                                                        • 51 2D image detection results
                                                          • 511 Evaluation
                                                            • 5111 Definition of True positive and negative
                                                            • 5112 Definition of False positive and negative
                                                            • 5113 Reducing number of errors
                                                            • 5114 Annotation and database building
                                                              • 512 Frame-rate measurement and analysis
                                                                • 52 3D image detection results
                                                                • 53 Discussion of results
                                                                  • 6 Conclusion and recommended future work
                                                                    • 61 Conclusion
                                                                    • 62 Recommended future work
                                                                      • References
Page 10: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly

Absztrakt

Ezen dolgozatban bemutataacutesra keruumll egy piloacuteta neacutelkuumlli repuumllő jaacuterműre szereltkoumlvető rendszer ami belteacuteri objektumokat hivatott detektaacutelni eacutes legfőbb ceacutelja afoumlldi egyseacuteg megtalaacutelaacutesa amely leszaacutelloacute eacutes uacutejratoumlltő aacutellomaacuteskeacutent fog szolgaacutelni

Előszoumlr a projekt illetve a teacutezis ceacuteljai lesznek felsorolva eacutes reacuteszletezveEzt koumlveti egy reacuteszletes irodalomkutataacutes amely bemutatja a hasonloacute kihiacutevaacute-

sokra leacutetező megoldaacutesokat Szerepel egy roumlvid oumlsszefoglalaacutes a piloacuteta neacutelkuumlli jaacuter-művekről eacutes alkalmazaacutesi teruumlleteikről majd a legismertebb objektumdetektaacuteloacutemoacutedszerek keruumllnek bemutataacutesra A kritika taacutergyalja előnyeiket eacutes haacutetraacutenyaikatkuumlloumlnoumls tekintettel a jelenlegi projektben valoacute alkalmazhatoacutesaacutegukra

A koumlvetkező reacutesz a fejleszteacutesi koumlruumllmeacutenyekről szoacutel beleeacutertve a rendelkezeacutesreaacutelloacute szoftvereket eacutes hardvereket

A feladat kihiacutevaacutesai bemutataacutesa utaacuten egy modulaacuteris architektuacutera terve keruumllbemutataacutesra figyelembe veacuteve a ceacutelokat erőforraacutesokat eacutes a felmeruumllő probleacutemaacutekat

Ezen architektuacutera egyik legfontosabb modulja a detektaacuteloacute algoritmus legfris-sebb vaacuteltozata reacuteszletezve is szerepel a koumlvetkező fejezetben keacutepesseacutegeivel moacuted-jaival eacutes felhasznaacuteloacutei feluumlleteacutevel egyuumltt

A modul hateacutekonysaacutegaacutenak meacutereacuteseacutere leacutetrejoumltt egy kieacuterteacutekelő koumlrnyezet melykeacutepes szaacutemos metrikaacutet kiszaacutemolni a detekcioacuteval kapcsolatban Mind a koumlrnyezetmind a metrikaacutek reacuteszletezve lesznek a koumlvetkező fejezetben melyet a legfrissebbalgoritmus aacuteltal eleacutert eredmeacutenyek koumlvetnek

Baacuter ez a dolgozat főkeacutent a hagyomaacutenyos (2D) keacutepeken operaacuteloacute detekcioacutesmoacutedszerekre koncentraacutel 3D keacutepalkotaacutesi eacutes feldolgozoacute moacutedszerek szinteacuten meg-fontolaacutesra keruumlltek Elkeacuteszuumllt egy kiacuteseacuterleti rendszer amely keacutepes laacutetvaacutenyos eacutespontos 3D teacuterkeacutepek leacutetrehozaacutesaacutera egy 2D leacutezer szkenner hasznaacutelataacuteval Szaacutemosfelveacutetel keacuteszuumllt a megoldaacutes kiproacutebaacutelaacutesa amelyek a rendszerrel egyuumltt bemutataacutesrakeruumllnek

Veacuteguumll az implementaacutelt moacutedszerek eacutes az eredmeacutenyek oumlsszefoglaloacuteja zaacuterja adolgozatot

ix

Abstract

In this paper an unmanned aerial vehicle based airborne tracking system is pre-sented with the aim of detecting objects indoor (potentially extended to outdooroperations as well) and localize a ground robot that will serve as a landing plat-form for recharging of the UAV

First the project and the aims and objectives of this thesis are introducedand discussed

Then an extensive literature review is presented to give overview of the ex-isting solutions for similar problems Unmanned Aerial Vehicles are presentedwith examples of their application fields After that the most relevant objectrecognition methods are reviewed and their suitability for the discussed project

Then the environment of the development will be described including theavailable software and hardware resources

Afterwards the challenges of the task are collected and discussed Consideringthe objectives resources and the challenges a modular architecture is designedand introduced

As one of the most important module the currently used detector is intro-duced along with its features modes and user interface

Special attention is given to the evaluation of the system Additional evalu-ating tools are introduced to analyse efficiency and speed

The ground robot detecting algorithmrsquos first version is evaluated giving promis-ing results in simulated experiments

While this thesis focuses on two dimensional image processing and objectdetection methods 3D image inputs are considered as well An experimentalsetup is introduced with the capability to create spectacular and precise 3Dmaps about the environments with a 2D laser scanner To test the idea of usingthe 3D image as an input for the ground robot detection several recordings weremade about the UGV and presented in this paper as well

Finally all implemented methods and relevant results are concluded

x

Chapter 1

Introduction and projectdescription

In this chapter an introduction is given to the whole project which this thesis ispart of Afterwards a structure of the sub-tasks in the project is presented alongwith the recognized challenges Then the aims and objectives of this thesis arelisted

11 Project description and requirementsThe projectrsquos main aim is to build a complete system based on one or moreunmanned autonomous vehicle which are able to carry out an indoor 3D mappingof a building (possible outdoor operations are kept in mind as well) A maplike this would be very useful for many applications such as surveillance rescuemissions architecture renovation etc Furthermore if a 3D model exists otherautonomous vehicles can navigate through the building with ease Later on inthis project after the map building is ready a thermal camera is planned to beattached to the vehicle to seek find and locate heat leaks as a demonstration ofthe system A system like this should be

1 fast Building a 3D map requires a lot of measurements and processingpower not to mention the essential functions stabilization navigationroute planning collision avoidance However to build a useful tool allthese functions should be executed simultaneously mostly on-board nearlyreal-time Furthermore the mapping process itself should be finished inreasonable time (depending on the size of the building)

2 accurate Reasonably accurate recording is required so the map would besuitable for the execution of further tasks Usually this means a maximum

1

12 Type of vehicle

error of 5-6cm Errors with similar magnitude can be corrected later de-pending on the application In the case of architecture for example aligninga 3D room model to the recorded point cloud could eliminate this noise

3 autonomous The solution should be as autonomous as possible The aimis to build a system which does not require human supervision (for examplemapping a dangerous area which would be too far for real time control) andcoordinate the process Although remote control is acceptable and shouldbe implemented it is desired to be minimal This should allow a remoteoperator to coordinate even more than one unit a time (since none of themneeds continuous attention) while keeping the ability to change the routeor priority of premises

4 complete Depending on the size and layout of the building the mappingprocess can be very complicated It may have more than one floors loops(the same location reached via an other route) or opened areas which areall a significant challenge for both the vehicle (plan route to avoid collisionsand cover every desired area) and the mapping software The latter shouldrecognize previously seen locations and rdquocloserdquo the loop (that is assignthe two separately recorded area to one location) and handle multi-layerbuildings The system have to manage these tasks and provide a mapwhich contains all the layers closes loops and represents all walls (includingceiling and floor)

12 Type of vehicleThe first question of such a project is what type of vehicle should be used Theoptions are either some kind of aerial or a ground vehicle Both have their ad-vantages and disadvantages

An aerial vehicle has a lot more degree of freedom since it can elevate fromthe ground It can reach positions and perform measurements which would beimpossible for a ground vehicle However it consumes a lot of energy to stayin the air thus such a system has significant limitations in weight payload andoperating time (since the batteries themselves weigh a lot)

A ground vehicle overcomes all these problems since it is able to carry alot more payload than an aerial one This means a lot of batteries thereforelonger operation time On the other hand it canrsquot elevate from the ground thusall the measurements will be performed from a relatively close position to theground This can be a big disadvantage since taller objects will not have theirtops recorded Also a ground robot would not be able to overcome big obstaclesclosing its way or to get to another floor

2

13 Aims and objectives

Figure 11 Image of the ground robot Source own picture

Considering these and many more arguments (price complexity human andmaterial resources) a combination of an aerial and a ground vehicle was chosenas a solution The idea is that both vehicles enter the building at the sametime and start mapping the environment simultaneously working on the samemap (which is unknown when the process starts) Using the advantage of theaerial vehicle the system scans parts of the rooms which are unavailable for theground vehicle With sophisticated route planning the areas scanned twice canbe minimized resulting in a faster map building Also great load capacity of theground robot makes it able to serve as a charging station for the aerial vehicleThis option will be discussed in details in section 13

13 Aims and objectivesAs it can be seen the scope of the project is extremely wide To cover the wholemore than fifteen students (visiting MSc and PhD) are working on it Thereforethe project has been divided to numerous subtasks

During the planning period of the project several challenges were recognizedfor example the lack of GPS signal indoor the vibration of the on-board sensorsor the short flight time of the aerial platform

This thesis addresses the last one As mentioned in section 12 the UAVconsumes a lot of energy to stay in the air thus such a system has significantlylimited operating time In contrast the ground robot is able to carry a lot morepayload (eg batteries) than an aerial one The idea to overcome this limitationof the UAV is to use the ground robot as a landing and recharging platform for

3

13 Aims and objectives

the aerial vehicleTo achieve this the system will need to have continuous update of the relative

position of the ground robot The aim of this thesis is to give a solution for thisproblem The following objectives were defined to structure the research

First possible solutions have to be collected and discussed in an extensive lit-erature review Then after considering the existing solutions and the constraintsand requirements of the discussed project a design of a new system is neededwhich is able to detect the ground robot using the available sensors Finally thefirst version of the software have to be implemented and tested

This paper will describe the process of the research introduce the designedsystem and discuss the experiences

4

Chapter 2

Literature Review

The aim of this chapter is to provide an overview of the previous researches andarticles related to the topic of this thesis First a short introduction of unmannedaerial vehicles will be given along with their most important advantages andapplication fields

Afterwards a brief conclusion of the science of object recognition in imageprocessing is provided The most often used feature extraction and classificationmethods are presented

21 UAVs and applicationsThe abbreviation UAV stands for Unmanned Aerial Vehicle Any aircraft whichis controlled or piloted by remotely andor by on-board computers They are alsoreferred as UAS for Unmanned Aerial System

They come in various sizes The smallest ones fit a man palms while even afull-size aeroplane can be an UAV by definition Similarly to traditional aircraftUnmanned Aerial vehicles are usually grouped into two big classes fixed-wingUAVs and rotary wing UAVs

211 Fixed-wing UAVsFixed-wing UAVs have a rigid wing which generates lift as the UAV moves for-ward They maneuver with control surfaces called ailerons rudder and elevatorGenerally speaking fixed-wing UAVs are easier to stabilize and control Forcomparison radio controlled aeroplanes can fly without any autopilot functionimplemented relying only on the pilotrsquos input This is a result of the muchsimpler structure compared to rotary wing ones due to the fact that a glidingmovement is easier to stabilize However they still have a quite extended market

5

21 UAVs and applications

Figure 21 One of the most popular consumer level fixed-wing mapping UAV ofthe SenseFly company Source [1]

of flight controllers whose aim is to add autonomous flying and pilot assistancefeatures [2]

One of the biggest advantage of fixed wing UAVs is that due to their naturalgliding capabilities they can stay airborne using no or small amount of powerFor the same reason fixed-wing UAVs are also able to carry heavier or morepayload for longer endurances using less energy which would be very useful inany mapping task

The drawback of the fixed wing aircraft in the case of this project is the factthat they have to keep moving to generate lift and stay airborne Therefore flyingaround between and in buildings is very complicated since there is limited spaceto move around Although there have been approaches to prepare a fixed wingUAV to hover and translate indoors [3] generally they are not optimal for indoorflights especially with heavy payload [4]

Fixed-wing UAVs are excellent choice for long endurance high altitude tasksThe long flight times and higher speed make it possible to cover larger areaswith one take off On figure 21 a consumer fixed-wing UAV is shown designedspecially for mapping

This project needs a vehicle which is able to fly between buildings and indoorswithout any complication Thus rotary-wing UAVs were reviewed

212 Rotary-wing UAVsRotary-wing UAVs generate lift with rotor blades attached to and rotated aroundan axis These blades work exactly as the wing on the fixed-wing vehicles butinstead of the vehicle moving forward the blades are moving constantly Thismakes the vehicle able to hover and hold its position Also due to the same prop-erty they can take off and land vertically These capabilities are very importantaspects for the project since both of them are crucial for indoor flight where spaceis limited

6

21 UAVs and applications

Figure 22 The newest version of the popular Phantom series of DJI Thisconsumer drone is easy to fly even for beginner pilots due to its many advancedautopilot features Source [5]

Based on the number of the rotors several sub-classes are known UAVs withone rotor are helicopters while anything with more than one rotor is called amultirotor or multicopter

The latter group controls its motion by modifying the speed of multiple downthrusting motors In general they are more complicated to control than fixed-wing UAVs since hovering enough requires constant compensation Multirotorsmust individually adjust the thrust of each motor If motors on one side areproducing more thrust than the other the multirotor tilts to the other sideThat is the reason why every remote controlled multicopter is equipped withsome kind of flight controller system They are also less stable (without electricstabilization) and less energy efficient than helicopters That is the reason forthe fact that no large scale multirotor is used since as the size is increased thesimplicity becomes less important than the inefficiency On the other hand theyare mechanically simpler and cheaper and their architecture is suitable for awider range of payloads which makes them appealing for many fields

Rotary-wing UAVsrsquo mechanical and electrical structures are generally morecomplex than fixed-wing ones This can result in more complicated and expansiverepairs and general maintenance which shorten operational time and increase thecost of the project Finally due to their lower speeds and shorter flight rangesrotary-wing UAVs will require many additional flights to survey any large areawhich is another cause of increasing operational costs [4]

7

21 UAVs and applications

In spite of the disadvantages listed above rotary-wing especially multirotorUAVs seem like an excellent choice for this project since there is no need to coverlarge areas and the shorter operational times can be solved with the chargingplatform

213 ApplicationsUAVs as many technological inventions got their biggest motivation from mili-tary goals and applications For example German rocket V-1 (rdquothe flying bombrdquo)had a huge impact on the history of UAS [6] An unmanned aerial vehicle makesit possible to observe spy on and attack the enemy without risking human lives

Fortunately non-military development is catching up too and UAVs are be-coming essential tools for many application fields As rotary wing UAVs es-pecially multicopters are getting cheaper they have appeared in the consumercategories as well They are available as simple remote controlled toys and ad-vanced aerial photography platforms as well [5] See figure 22 for a popularquadcopter

UAVs are gaining popularity in search and rescue operations since they areproviding a fast and relatively cheap tool to get an overview of the involvedarea A good example for this application field is a location of an earth-quakeor searching for missing people in the wildness See [7] and [8] for examples [9]is another related article which includes approaches to identify possibly injuredpeople autonomously

Another application field where UAVs were admittedly proven useful is surveil-lance [10] is an interesting example for this field since it proposes to use a com-bined swarm of rotary and fixed-wing UAVs [11] [12] address the same problemwhile [13] tries to overcome the challenges of the task in an urban area

Getting sensors in the air with low cost vehicles is an amazing opportunity forseveral research areas UAVs have been used for wild-life monitoring [14] earthobservation [15] and even tornado watch missions [16] [17] used UAVs to reducerdquothe data gap between field scale and satellite scale in soil erosion monitoringrdquo

UAVs are often used for 3D mapping and reconstruction tasks for topographyarchitecture or other purposes [18] is a review of current mapping methods basedon UAVs [19] is an other survey which also compares the photogrammetry withother methods to define the optimal application fields

Both surveys mentioned above are rather focus on topographical larger scalemapping tasks Indoor flying and mapping is more complicated in a sense sincethe danger of collision is significantly higher Indoor positioning of UAVs havebeen addressed in several papers They can be grouped by the sensor used Notethat GPS signal is usually too weak to use inside thus other methods are needed

Some researches tried to localize the aerial platform based on a single camera

8

22 Object detection on conventional 2D images

using monocular SLAM (Simultaneous Localization and Mapping) See [20ndash22]for UAVs navigating based on one camera only

Often the platform is equipped with multiple cameras to achieve stereo visionand create a depth map For examples of such solutions see [23ndash25]

A popular approach for indoor mapping is using 3D laser distance sensors asthe input of the SLAM algorithm Examples for this are [26ndash28]

However cameras and laser range finders are often used together [29 30] forexample manage to integrate the two sensors together and carry out 3D mappingwhile controlling navigating the UAV itself This project will use a similar setupthat is a combination of one or more monocular cameras with a 2D laser scanner

22 Object detection on conventional 2D imagesObject class detection is one of the most researched area of computer visionAs the available resources (computational power resolution of cameras etc)improved and became cheaper and more and more complex tasks occurred thefield has witnessed several approaches and methods Thus the topic has extensiveliterature To give an overview general surveys can be a good starting point

There are several articles aimed to provide a summary of the existing methodstrying to categorize and group them (eg [31ndash33])

The terms and definitions of computer vision tasks are defined in several wayin the literature [33] and [34] groups them by

bull Detection is a particular item present in the stimulus

bull Localization detection plus accurate location of item

bull Recognition localization of all the items present in the stimulus

bull Understanding recognition plus role of stimulus in the context of the scene

while [35] sets up five classes defined by the following (quoted from [33])

bull Verification Is a particular item present in an image patch

bull Detection and Localization Given a complex image decide if a particularexemplar object is located somewhere in the image and provide accuratelocation information of the given object

bull Classification Given an image patch decide which of the multiple possiblecategories are present in it Note that no location is determined

9

22 Object detection on conventional 2D images

bull Naming Given a large complex image (instead of an image patch as inthe classification problem) determine the location and labels of the objectspresent in that image

bull Description Given a complex image name all the objects present in itand describe the actions and relationships of the various objects within thecontext of the image

[31] defines object (class or category) detection as the combination of twogoals object categorization and object localization The former decides if anyinstance of the categories of interest is present in the input image The lattertask determines the positions and dimensions (projected size) of the objects thatare found

As it can be seen from the categories defined above the boundaries of thesesubclasses are not clear and they often overlap The aims and objectives listedin section 13 classify this task as a detection or localization (or a combinationof the two) For the sake of simplicity the aim of this thesis and the providedalgorithm will be referred as detection and detector

Beside the grouping of the tasks of computer vision several attempts weremade to classify the approaches themselves The biggest disjunctive propertyof the methods might be the fact that whether or not does it involve machinelearning

221 Classical detection methodsResulting from the rapid development of hardware resources along with the ex-tensive research of artificial intelligence methods not using machine learning aregetting neglected Generally speaking the ones using it show better performanceand are simpler to adopt if the circumstances change (eg re-train them)

It has to be noted that although these classical approaches are not very pop-ular nowadays their simplicity often means an attractive factor to perform easydetection tasks or pre-filter the image

2211 Background subtraction

Background subtraction methods are often used to detect moving objects onstanding platform (static cameras) Such an algorithm is able to learn the rdquoback-ground modelrdquo of an image with differentiating a few frames Although severalmethods exist (See [36] or [37] for good conclusion) the basic idea is common sep-arate the foreground from the background by subtracting the background modelfrom the input image This subtraction should result in an output image where

10

22 Object detection on conventional 2D images

Figure 23 Example for background subtraction used for people detection Thered patches show the object in the segmented foreground Source [39]

only objects in the foreground are shown The background model is usually up-dated regularly which means the detection is adaptive

These solutions could give an excellent base for tasks like traffic flow monitor-ing [38] or human detection [39] as moving points are separated and connectedor close ones could be managed as an object Thus all possible objects of interestare very well separated The fixed position of the camera means fixed perspectiveswhich make size and distance estimations accurate With fixed lengths reliablevelocity measuring can be implemented These and further additional properties(colour texture etc of the moving object) can be the base of an object classifica-tion system This method is often combined with more sophisticated recognitionmethods serving as a region of interest detector After the selection of interestingareas those methods can operate on a smaller input resulting in a faster responsetime

It has to be mentioned that not every moving (changing) patch is an objectas for example in certain lighting circumstances a vehicle and its shadow movealmost identically and have similar size parameters Effective methods to removeshadows from input have been described (eg[40 41]) Certainly process time ofthese filters sum up with the detectionrsquos time

An other concern is the possibility of a sudden change in illumination (such asswitching the light on or off) which would result change across the whole imageIn this case the background model should be adapted automatically and quickly

Although background subtraction has many benefits it is clearly not suitablefor this task since the camera will be mounted on a moving platform

2212 Template matching algorithms

Object class detection can be described as a search for a specific pattern on theinput image This definition is related of the matched filters (see [42]) techniqueof digital signal processing which basic idea is to find a known wavelet in anunknown signal This is usually achieved by cross-correlating them looking forhigh correlation coefficients

Template matching algorithms rely on the same concept interpreting the

11

22 Object detection on conventional 2D images

Figure 24 Example of template matching Notice how the output image isthe darkest at the correct position of the template (darker means higher value)An other interesting point of the picture is the high return values around thecalendarrsquos (on the wall left) dark headlines which are indeed look similar to thehandle Source [43]

image as a 2D digital signal They determine the position of the given patternby comparing a template (that contains the pattern) with the image

To perform the comparison the template is shifted (u v) discrete steps inthe (x y) directions respectively The comparison itself is usually calculated bysome kind of correlation over the area of the template for each (u v) positionThe method assumes that the function (the comparison method) will return thehighest value at the correct position [43]

See Figure 24 for an example input template (part a) and the output of thealgorithm (part b) Notice how the output image is the darkest at the correctposition of the template

The biggest drawback of the method is its computational complexity sincethe cross correlation has to be calculated at every position Several efforts werepublished to reduce the processing time [43] uses normalized cross correlationfor faster comparison [44] applies integral images (based on the idea of [45]) forthe same purpose Note that simplifying the template can reduce the complexityas well since it is easier to compare the two images For example [46] proposedan algorithm which approximates the template image with polynomials

Other disadvantages of template matching are the poor performance in case ofchanges in scale rotation or perspective Resulting from these template matchingis not a suitable concept for this project

12

22 Object detection on conventional 2D images

222 Feature descriptors classifiers and learning methodsWhile computer vision and machine learning might seem two well-separated areasof computer engineering computer vision is one of the biggest application fieldof machine learning [47] puts it this way rdquoPattern recognition has its origins inengineering whereas machine learning grew out of computer science Howeverthese activities can be viewed as two facets of the same field and together theyhave undergone substantial development over the past ten yearsrdquo

The two most important parts of a trained object detection algorithm arefeature extraction and feature classification The extraction method determineswhat type of features will be the base of the classifier and how they are extractedThe classification part is responsible for the machine learning method In otherwords how to decide if the extracted feature array (extracted by the featureextraction method described above) represents an object of interest

Since the latter part (defining the learning and classifier system) is less relatedto the vision studies this paper will only discuss a few learning methods brieflyBefore that some of the most famous and widely used feature descriptors arepresented

Feature extractors or descriptors are often grouped as well [48] recognizesedge based and patch based features [31] has three groups according to thelocality of the features

bull pixel-level feature description these features can be calculated for eachpixel separately Examples are the pixelrsquos intensity (grayscale) or its colourvector

bull patch-level feature description Descriptors based on patch level featuresconcentrate only on point of interests (and their neighbourhood) instead ofdescribing the whole image The name rdquopatchrdquo refers to the area around thepoint which is also considered Since these regions are much smaller thanthe image itself patch-level feature descriptions are also called local featuredescriptors Some of the most famous ones are SIFT (sub-subsection 2221[49 50]) and SURF [51] In a sense Haar-like features (see sub-subsection2222 [45]) belong here as well

bull region-level feature description As patches are sometimes too small to de-scribe the object appropriately a larger scale is often needed The region isa general expression for a set of connected pixels The size and shape of thisarea is not defined Region-level features often constructed from lower levelfeatures presented above However they do not try to apply higher levelgeometric models or structures Resulting from that region-level featuresare also called mid-level representations The most well known descriptors

13

22 Object detection on conventional 2D images

of this group are the BOF (Bag of Features or Bag of Words [52]) and theHOG (Histogram of Oriented Gradients sub-subsection 2223 [53])

Due to the limited scope of this paper it is not possible to give a deeperoverview about the features used in computer vision Instead three of the meth-ods will be presented here These were selected based on fame performanceon general object recognition and the amount of relevant literature examplesand implementation All of them are well-known and both the articles and themethods are part of the computer vision history

2221 SIFT features

Scale-invariant feature transform (or SIFT) is an image processing algorithm usedto detect and extract local features Sift descriptors are the most well-knownpatch-level feature feature descriptors (see point 222) [31] They were proposedby David Lowe in 1999 [49] The same author gave a more in-depth discussionabout his earlier work presented improvements in stability and feature invariancein 2004 [50] His idea was to extract interesting points as features from the object(training image) and try to match these on the test images to locate the object

The main steps of the algorithm are the following [50]

1 Scale-space extrema detection The first stage searches over the image inmultiple scale using a difference-of-Gaussian function to identify potentialinterest points that are scale and orientation invariant

2 Key-point localization At each selected point the algorithm tries to fit amodel to determine location and scale Key-points are selected based ontheir stability points which are expected to be more stable based on themodel are kept

3 Orientation assignment rdquoOne or more orientations are assigned to eachkey-point based on local image gradient directions All future operationsare performed on image data that has been transformed relative to theassigned orientation scale and location for each feature thereby providinginvariance to these transformationsrdquo [50]

4 Key-point descriptor After selecting the interesting points the gradientsaround them are calculated in a selected scale

The biggest advantage of the descriptor is the scale and rotation invarianceNote that rotation here means rotating in the plane of the image not capturingthe object from a rotated viewpoint For example a rotation invariant (frontal)

14

22 Object detection on conventional 2D images

face detector should be able to detect faces upside down but not faces pho-tographed from a side-view SIFT is also partly invariant to illumination and3D camera viewpoint This is reached by the representation of the calculateddescriptors that allows for significant levels of local shape distortion and changein illumination The algorithm assumes that these pointsrsquo relative positions willnot change on the test images Thus SIFT descriptors tend to show worse perfor-mance on flexible objects or ones with moving parts Also the relative positionsof the key-points can be significantly different not only if the object is distorted(flexibility) but also if they are extracted from an other instance from the sameclass (two people for example) Resulting from this property SIFT is more oftenused for object tracking image stitching or to recognise a concrete object insteadof the general class However this is not a problem in the case of this projectsince only one ground-robot is part of it Thus no general class recognition isrequired recognizing the used ground robot would be enough

2222 Haar-like features

Haar-like features were introduced by Viola and Jones in 2001 [45] They proposedsimple features inspired by Haar wavelet basis which are also used in signalprocessing They can be seen as rdquotemplatesrdquo defining rectangular regions in thegrayscale image Viola and Jones described three types of the features

bull two-rectangle features rdquoThe value of a two-rectangle feature is the dif-ference between the sum of the pixels intensities within two rectangularregions The regions have the same size and shape and are horizontally orvertically adjacentrdquo [45]

bull three-rectangle features rdquoA three-rectangle feature computes the sum withintwo outside rectangles subtracted from the sum in a centre rectanglerdquo [45]

bull four-rectangle features rdquoA four-rectangle feature computes the differencebetween diagonal pairs of rectanglesrdquo [45]

See figure 25 for an example of two and three rectangle features The mainadvantage of this approach is the fact that these features can be computed veryrapidly using so-called integral images The integral image rdquocontains the sum ofpixel intensities above and to the left of itrdquo [45] at any given image location Thereferences to the integral values at each pixel location are kept in a structureFinally the Haar features can be computed by summing and subtracting theintegral values in the corners of the rectangles

[45] was motivated by the task of face detection Although it is more than14 years old it is still the base of many implemented face detection methods (in

15

22 Object detection on conventional 2D images

Figure 25 2 example Haar-like features that are proven to be effective separatorsTop row shows the features themselves Bottom row presents the features overlayedon a typical face The first feature corresponds with the observation that eyes areusually darker than the cheeks The second feature measures the intensity of areasaround the eyes compared to the nose Source [45]

consumer cameras for example) and inspired several researches [54] for exampleextended the Haar-like feature set with rotated rectangles by calculating theintegral images diagonally as well

Resulting from the nature of them Haar-like features are suitable to findpatterns of combined bright and dark patches (like faces on grayscale images forexample)

Despite of the original task Haar-like features are used widely in computervision for various tasks vehicle detection ([55]) hand gesture recognition ([56])and pedestrian detection ([57]) just to name a few

Haar-like features are very popular and have a lot of advantages Its maindrawback is the rotation-variance which in spite of several attempts (eg [54])is still not solved completely

2223 HOG features

The abbreviation Hog stands for Histogram of Oriented Gradients Unlike theHaar features (2222) Hog aims to gather information from the gradient image

The technique calculates the gradient of every pixel and produces a histogramof the different orientations (called bins) after quantization in small portions(called cells) on the image After a location based normalization of these cells(within blocks) the feature vector is the concatenation of these histograms Notethat while this is similar to Edge orientation histograms ([58]) it is different since

16

22 Object detection on conventional 2D images

Figure 26 Illustration of the discriminative and generative models The boxesmark the probabilities calculated and used by the models Note the arrows whichdisplay the information flow presenting the main difference in the two philosophiesSource [59]

not only sharp gradient changes (known as edges) are considered The number ofoptimal bins cells and blocks depend on the actual task and resolution and arewidely researched

HOG is essentially a dense version of the SIFT descriptors (see sub-subsection2221 and [31]) However it is not normalized with respect to the orientationThus HOG is not rotation invariant On the other hand it is normalized locallywith respect to the image contrast (over the blocks) which results in a superiorrobustness against illumination changes over SIFT The technique was first de-scribed in the well-known article [53] by Navneet Dalal and Bill Triggs in whichit was used for pedestrian detection While this application field is still the mostcommon usage of these feature descriptor it can be used for many kind of objectsas well as long as the class has a characteristic shape with significant edges

2224 Learning models in computer vision

Learning model approaches can be grouped as well [32 48] finds two mainphilosophies generative and discriminative models

Letrsquos denote the test data (images) with xi the corresponding labels by ciand the feature descriptors (extracted from the data xi) by θi where i = 1 to N where N is the number of inputs

Generative models look for a representation of the image (or any other inputdata) by approximate it keeping as much data as possible Afterwards given atest data they predict the probability of an x instance generated (with featuresθ) conditioned to class c [48] In other terms they aim is to produce a model inwhich the probability

P (x|(θ c))P (c)

to be ideally 1 if x contains an instance of class c and 0 if notThe most famous examples of generative models are Principal Component

Analysis (PCA [60]) and Independent Component Analysis (ICA [61])

17

22 Object detection on conventional 2D images

Discriminative models try to find the best decision boundaries for the giventraining data As the name suggests they try to discriminate the occurrenceof one class from everything else [48] They do this by defining the followingprobability

P (c|(θ x))

which is expected to be ideally 1 if x contains an instance of class c and 0 ifnot

Note that the biggest difference is the direction of the mapping and infor-mation flow discriminative models map images to class labels while generativemodels use a map from labels to the images (or rdquoobservablesrdquo [59]) See figure 26for a representation of the direction of the flow of information in both models

Resulting from approaches presented above (especially the probability mod-els) the discriminative methods are usually perform better for single class detec-tion while generative methods are more suitable for multi-class object detection(also called object recognition) [62] Since the task described in this paper re-quires one class to detect two of the most well-known discriminative methodswill be presented Adaptive boosting (AdaBoost [63] see sub-subsection 2225)and support vector machines (SVM [64] see sub-subsection 2226)

2225 AdaBoost

AdaBoost (short for Adaptive Boosting) is a discriminative machine learningmodel from the class of boosting algorithms It was first proposed by YoavFreund and Robert Schapire in 1997 [63] AdaBoostrsquos aim is to overcome theincreasing calculation expenses caused by the growing set of features (also calleddimensions) For example the number of possible Haar-like features (2222) isover 160000 in case of a 24times24 pixel window (see subsection 432) [65] AlthoughHaar-like features can be extracted efficiently using the integral-image computingthe full set would be very expensive

To reduce the dimensionality of the machine learning problem boosting algo-rithms try to combine several weaker classifiers to create a stronger Weak meansthe output of the classifiers only slightly correlates with the true labels of thedata Usually this is an iterative method

A single stage of AdaBoost can be described as follows [45] [65]

bull Initialize N middot T weights wti where N = number of training examples T =number of features in stage

bull For t = 1 N

1 Normalize the weights

18

22 Object detection on conventional 2D images

2 Select the best classifier using only a single feature by minimising thedetection error εt =

sumiwi|h(xi f p θ)minus yi| where h(xi) is a classifier

output and yi is a correct label (both with a range of 0 for negative 1for positive)

3 Define ht(x) = h(x ft pt θt) where ft pt and θt are minimizers for theerror above

4 Update the weights wt+1i = wti middot εt

1minusεt

1minusεi

bull The final classifier for the stage is based on the sum of weak classifiers

By repeating these steps AdaBoost selects those features (and only those)which improve the prediction Resulting from the way the weights are updatedsubsequent classifiers minimise the error mostly for the inputs which were misclas-sified by previous ones A famous application of AdaBoost is the Viola-Jones ob-ject detection framework presented in 2001 [45] (also see sub-subsection 2222)

2226 Support Vector Machine

Support vector machines (SVM also called support vector networks) is a machinelearning algorithm for two-group classification problems proposed by Vladimir NVapnik Bernhard E Boser and Isabelle M Guyon [66] Although the featuresof the method were known and used since the 1960s they were not put togetheruntil 1992

The idea of the support vector machine is to handle the set of features ex-tracted from the train and test data samples as vectors in the feature spaceThen they construct one ore more hyperplanes (a linear decision surface) in thisspace to separate the classes

The surface is constructed with two constraints First is to divide the spaceinto two parts in a way that no points from different classes remain in the same(in other words separate the two classes perfectly) The second constraint is tokeep the distance to the nearest training samplersquos vector from each class as largeas possible (in other words define the largest margin possible) Creating largermargin is proven to decrease the error of the classifier The closest vectors to thehyperplane (and due to that determining the width of the margin) are calledsupport vectors (hence the name of the method) See figure 27 for an exampleof a separable problem the chosen hyperplane and the margin

The algorithm was extended to not linearly separable tasks in 1995 by Vapnikand Corinna Cortes [64] To create a non-linear classifier the feature space ismapped to a much higher dimensional space where the classes become linearlyseparable and the method above can be used

19

22 Object detection on conventional 2D images

Figure 27 Example of a separable problem in 2D The support vectors aremarked with grey squares They define the margin of the largest separation betweenthe two classes Source [64]

Although support vector machines were designed for binary classification sev-eral methods were proposed to apply their outstanding performance on multi-classtasks The most often used approach to do this is to combine binary classifiers toobtain a multi-class one The binary classifiers can be rdquoone against allrdquo or rdquooneagainst onerdquo types The former means that each class is examined against all thepoints of the other classes The class whose classifier shows the greatest distancefrom the margin is chosen (in other terms the test samplersquos vector is the fartherfrom the other classesrsquo points than in any other cases) The latter is based on thecomparisons of all the possible pairings of the classifiers (each is compared witheach) The class rdquowinningrdquo the most comparisons is chosen as the final decisionof the classifier Good surveys on multi-class SVMs are [67] and [68]

A famous application of SVMs is the HOG descriptor based pedestrian detec-tion presented by Navneet Dalal and Bill Triggs in 2005 [53]

20

Chapter 3

Development

In this chapter the circumstances of the development will be presented whichinfluenced the research especially the objectives defined in Section 13 Firstthe available hardware resources (that is processing unit and sensors) will bediscussed along with their limitations and constraints respecting to the definedobjectives Finally the used software libraries will be presented why were theychosen and what are they used for in this project

31 Hardware resources

311 Nitrogen boardNitrogen board (officially called Nitrogen6x) is the chosen on-board processingunit of the UAV It is a highly integrated single board computer using an ARM-cortex A9 processor [69]

Nitrogen board will be responsible to manage the on-board sensors and collectthe recorded data After some preprocessing part of the data will be transmittedto the ground robot and to the ground station which communication is alsohandled by this unit Fortunately the Robotic Operating System (subsection322) is compatible with unit which will make this task easier

The detection method introduced in this thesis may be ported to this boardlater depending on the available processing units and the achieved communicationbandwidth between the vehicles and the ground station

312 SensorsThis subsection concludes the sensors integrated on-board the UAV including theflight controller the camera and the laser range finder

21

31 Hardware resources

Figure 31 image of Pixhawk flight controller Source [70]

3121 Pixhawk autopilot

Pixhawk is an open-source well supported auto-pilot system manufactured by3Drobotics company It is the chosen flight controller of the projectrsquos UAV Onfigure 31 the unit is shown with optional accessories

Also Pixhawk will be used as an Inertial Measurement Unit (IMU) for build-ing 3D maps This means less weight (since no additional sensor is needed) andbrings integrity both in hardware and software Although the Pixhawk is not adedicated IMU its built-in sensors are very accurate since they are used to sta-bilize multi-rotors during flight Therefore it was suitable to use it for mappingpurposes See 44 for details of its application

3122 Camera

Cameras are optical sensors capable of recording images They are the mostfrequently used sensors for object detection due to the fact that they are easy tointegrate and provide significant amount of data

Also another big advantage of these sensors is the fact that they are availablein several sizes with different resolutions and other features for a relatively cheapprice

Nowadays consumer UAVs are often equipped by some kind of action camerausually to provide an aerial video platform These sensors are suitable for theaim of the project as well resulting from their light weight and wide angle fieldof view

22

32 Chosen software

Figure 32 The chosen LIDAR sensor Hokuyo UTM-30LX It is a 2D rangescanner with a maximum range of 60 m and 3-5 cm accuracy Source [71]

3123 LiDar

The other most important sensor used in this project is the lidar These sensorsare designed for remote sensing and distance measurement The basic conceptof these sensors is to do this by illuminating objects with a laser beam Thereflected light is analysed and the distances are determined

Several versions are available with different range accuracy size and otherfeatures The chosen scanner for this project is the Hokuyo UTM-30LX It isa 2D range scanner with a maximum range of 60 m and 3-5 cm accuracy [71]These features combined with its compact size (62mmtimes62mmtimes875mm210g [71])makes it an ideal scanner for autonomous vehicles for example UAVs

Generally speaking lidars are much more expensive sensors then camerasHowever the depth information provided by them is very valuable for indoorflights

32 Chosen software

321 MatlabMATLAB is a high-level interpreted programming language and environmentdesigned for numerical computation It is a very useful tool for engineers and

23

32 Chosen software

scientist alike as thousands of functions are implemented and included in thesoftware from different fields of science Also excellent visualization tools areready to use which are necessary to easily debug and develop image processingmethods or 3D map construction and visualization Resulting from the reasonsabove developing and testing 2D or 3D image processing algorithms in Matlab isextremely convenient On the other hand Matlab is Java based therefore mostnon-mathematical calculations are slightly slow compared to a C++ environment

322 Robotic Operating System (ROS)Robotic Operating System is an extensive open-source operating system andframework designed for robotic development It is equipped with several toolsaimed to help developing unmanned vehicles like simulation and visualizationsoftware or excellent general message passing This latter property is a suitablesolution to connect the UAV UGV and the ground control station of the project

Also ROS contains several packages developed to handle sensors For exam-ple the both the chosen lidar sensor (3123) and the flight controller (3121)have ready-to-use drivers available

323 OpenCVOpenCV is probably to most popular open-source image processing library con-taining a wide range of functions like basic image manipulations (eg load writeresize rotate image) image processing tools (eg different kind of edge detec-tions threshold methods) and advanced machine learning based solutions (egfeature extractors and training tools) [72] Due to its high popularity examplesand support are available

In this project it was used to handle inputs (either from a file or from a cameraattached to the laptop) since Dlib the main library chosen has no functions toread inputs like that

324 DlibDlib is a general purpose cross-platform C++ library It has an excellent machinelearning library with the aim to provide a machine learning software developmenttoolkit for the C++ language (see figure 33 for an architecture overview) Dlibis completely open-source and is intended to be useful for both scientists andengineers

Dlib also has many computer vision related parts including ready-to-use fea-ture extractors and training tools and object tracking solutions which make it

24

32 Chosen software

Figure 33 Elements of Dlibrsquos machine learning toolkit Source [73]

easy to develop and test custom trained object detectors rapidly Another re-cently added feature is an easy-to-use 3D point cloud visualization function Re-sulting from this Dlib was a reasonable choice to use for this project since itscollection of machine learning image processing and 3D point cloud handlingfunctions made the development significantly easier

It is worth mentioning that aside the features above Dlib also contains com-ponents to handle linear algebra threading network IO graph visualisation andmanaging and numerous other tasks which makes it one of the most versatileC++ libraries Though it is not as popular and well-known as OpenCV it isextremely well documented supported and updated regularly Resulting fromthese the popularity of it increases For more information see [74] and [73]

25

Chapter 4

Designing and implementing thealgorithm

In this chapter challenges related to the topic of this thesis are listed After theconsideration of these the architecture of the designed algorithm will be intro-duced with detailed explanations given to the important parts The connectionbetween the 2D and 3D image inputs are defined Afterwards the base of thedetection algorithm is discussed in details along with all the aiding systems andinputs Finally the 3D imaging experiments and set-ups are introduced withexample images and the mathematical calculations

41 Challenges in the taskThe biggest challenges of the task are listed below

1 Unique object Maybe the biggest challenge of this task is the fact that therobot itself is a completely unique ground vehicle designed as part of theproject Thus no ready-to-use solutions are available which could partly orcompletely solve the problem For frequently researched object recognitiontasks like face detection several open-source codes examples and articlesare available The closest topic for a ground robot detection is car recogni-tion but those vehicles are too different to use any ready detectors Thusthe solution of this task have to be completely self-made and new

2 Limited resources Since the objective of this thesis is a detection executedon a UAV the limitations of this platform have to be considered Thismeans that both the dimensions and the weight of the on-board sensorsand computers are limited significantly Fortunately the cameras nowa-days have very compact sizes with impressive resolutions thus even smaller

26

42 Architecture of the detection system

aerial vehicles can carry one and record high quality videos The used lidar(3123) is heavier and larger but still possible to carry The on-board com-puter had to be chosen similarly a small and light hardware was neededWhile the Nitrogen6x board (311) would fulfil these requirements its pro-cessing power is not comparable to a laptop or desktop PC which shouldbe considered during the designing and assigning of processing methods

3 Moving platform No matter what kind of sensors are used for an objectdetection task movement and vibration of the sensor is usually makes itmore challenging Unfortunately both are present on a UAV platform Thiscauses problems because it changes the perspective and the viewpoint ofthe object Also they add noise and blur to the recordings And third nobackground-foreground separation algorithms are available for the cameraSee subsection 2211 for details

4 Various viewpoint of object Usually in the field of computer vision thepoint of view of the required object is more or less defined For exampleface detectors are usually applied to images where people are facing thecamera However resulting from the UAV platform the sensors will gain orlose height rapidly causing drastic perspective changes Also there are noconstraints about the relative orientation of the two vehicles which meansthat the robot should be recognized from 360 and from above

5 Various size of object Similarly to the previous point 4 the movement ofthe camera will cause significant changes in the scale of the object on theinput image In other words the algorithm have to detect the ground robotfrom a great distance (when it occupies a small part of the input) and alsoduring landing from relatively close (when it seems huge and almost fillsthe entire image)

6 Speed requirements The algorithmrsquos purpose is to support the landingmaneuver which is crucial part of the mission The most critical part is thetouch-down itself This (contrary to the other parts of the flight) requiresreal-time and very accurate position feedback from the sensors and the al-gorithms This is provided by another algorithm specially designed for thetask Thus the software discussed here does not have real-time require-ments On the other hand the system still needs continuous reports on theestimated position of the robot with at least 4Hz This means that everychosen method should be execute fast enough to make this possible

27

42 Architecture of the detection system

Support Vector Machines

camera2D Lidar

Trainer

Detector agorithmfront SVM side SVM

Detections Vatic annotation server

Sensors

Tracking

Evaluation

Current frame

Video reader

Regions of Interest

3D map

Other preprosessing(edgecolour detection)

Figure 41 On this figure a diagram of the designed architecture is shown Theaim of it is the present the connections between the 2D and 3D sensors detectionalgorithms and other parts of the system Arrows represents the dependencies andthe direction of the information flow

28

42 Architecture of the detection system

42 Architecture of the detection systemIn the previous chapters the project (11) and the objectives of this thesis (13)were introduced The advantages and disadvantages of the different availablesensors were presented (312) Some of the most often used feature extractionand classification methods (222) were examined with respect to the suitabilityfor the project Finally in subsection 41 the challenges of the task were discussed

After considering the aspects above a modular complete architecture wasdesigned which is suitable for the objectives defined See figure 41 for an overviewof the design The main idea of the structure is to compensate the disadvantagesof every module with an other part connected to it In some cases this principlebrings redundancy to the system On the other hand it provides a more robustdetection system for the robot The next enumeration will list the main partsof the architecture every already implemented module will be discussed later inthis chapter in detail

1 Training algorithm As explained in subsection 222 object detectionmethods using machine learning are very popular and superior in efficiencyTo apply such an algorithm for a task first a learning (also called trainingor teaching) period is needed when the software extract features from thetraining images and train a classifier to separate the two or more classes(Note that this learning step can be skipped if the algorithm learns rdquoon theflyrdquo or pre-trained detectors are used Neither of these are the case in thisproject)As mentioned above training usually happens off-line before the detectionsTherefore it is not required to be included in the final release However itis an inevitable part of the system since it produces the classifiers for thedetector These productions have to be reproducible compatible with theother parts and effective To make the development more convenient thetraining software is completely separated from testing See subsection 431for details

2 Sensor inputs Sensors are crucial for every autonomous vehicle since theyprovide the essential information about the surrounding environments Inthis project a lidar and a camera where used as main sensors for the task(along with several others needed to stabilize the vehicle itself See subsec-tion 312) The designed system relies more on the camera for now sincethe 3D imaging is in its experimental state However both sensors arehandled in the architecture already with the help of the Robotic Operat-ing System (322) This unified interface will make it possible to integratefurther sensors (eg infra-red cameras ultrasound sensors etc) later withease

29

42 Architecture of the detection system

Resulting from the scope of the project the vehicle itself is not readythus live recordings and testing were not possible during the developmentTherefore the system is able to manage images both from a camera andvideo file The latter is considered as a sensor module as well simulatingreal input from the sensor (other parts of the system canrsquot tell that it is nota live video feed from a camera)

3 Region of interest estimators and tracking Recognized as two of the mostimportant challenges (see 41) speed and the limitations of the availableresources were a great concern Thus methods which could reduce theamount of areas to process and such increase the speed of the detectionwere requiredResulting from the complex structure of the project both can be donebased on either the camera the lidar or both The idea of the architectureis to give a common interface (creating a type of module) to these methodsmaking their integration easy The most general and sophisticated approachwill be based on the continuously built and updated 3D maps That is ifboth the aerial and the ground vehicle were located in this map accurateestimations can be made on the current position of the robot excluding (3D)areas where it is unnecessary to search for the unmanned ground vehicle(UGV) In other words once the real 3D position of the robot is known itsnext one can be estimated based on its maximal velocity and the elapsedtime Theoretically it is not possible for the robot to be outside of thiscircle thus other parts of the map do not need to be scanned and Asa less sophisticated but more rapid alternative to this the currently seen3D sub-map (the point-cloud in front of the sensor) can be interpreted asa pre-filter for the 2D image processing methods For example filteringfor objects with more or less the same size of the UGV could be a veryeffective region of interest detector To achieve this the field of view of thetwo sensors needs to be registered carefully so corresponding areas can befound Unfortunately the real time 3D map building is not yet implementedThus these methods has yet to be developed On the other hand severalapproaches to find regions of interest based on 2D images were implementedAll of them will be presented in section 43

4 Evaluation To make the development easier and the results more objectivean evaluation system was needed This helps to determinate whether a givenmodification of the algorithm did actually make it rdquobetterrdquo andor fasterthan other versions This part of the architecture is only connected to thewhole system loosely (similarly to training point 1) and not required tobe included in the final release since it was designed for development and

30

43 2D image processing methods

debugging purposes However it had to be developed parallel with everyother parts to make sure it is compatible and up to date Thus evaluationhad to be mentioned in this list For more details of the evaluation pleasesee 511

5 Detector The detector module is the core algorithm which connects all theother parts listed above Receiving the inputs from the camera sensor (point2) using the classifiers trained by the training algorithm (point 1) andconsidering the information provided by the ROI estimators (3) executesthe detection Afterwards the detector is able to pass on the detectedobjects coordinates so the evaluation module (point 4) can measure theefficiency

43 2D image processing methodsIn this section the 2D camera image processing concerns methods and the processof the development will be presented

431 Chosen methods and the training algorithmBefore the design and implementation of the algorithm started several objectdetection methods were reviewed in section 22 In section 32 the used imageprocessing libraries and tool-kits were presented

While these two topics seem unrelated they have to be considered togethersince a good software library with image handling or even machine learningalgorithms implemented can save a lot of time during the development

Considering the reviewed methods the Histogram of Oriented Gradients (HOGsee sub-subsection 2223) was chosen as the main feature descriptor The reasonsfor this choice are discussed here

Haar-like features are suitable to find patterns of combined bright and darkpatches While the ground robotrsquos main colours are black (wheels) and white(the main body of the vehicle) their proportion is not ideal since brighter partsare more significant Variance of the relative positions of these patches (egdifferent view angles) can confuse the detector as well Also using OpenCVrsquos(323) training algorithm the learning process was significantly slower than thefinal training software (presented here later) making it very hard to develop andtest

SIFT descriptors (see sub-subsection 2221) have the advantage of rotationinvariance which is very desirable Since they assume that relative positions ofthe key-points will not change on the inputs they are more suitable to recognise

31

43 2D image processing methods

a concrete object instead of the general class This is not an issue in this casesince only one ground robot is present with known properties (in other wordsno generalization is needed) However the relative positions of the key-pointscan change not only because inter-class differences (an other instance of the classrdquolooksrdquo different) but caused by moving parts like rotating and steering wheelsfor example Also SIFT tends to perform worse under changing illuminationconditions which is a drawback in the case of this project since both vehiclesare planned to transfer between premises without no a-priori information aboutthe lighting

Histogram of Oriented Gradients (HOG see sub-subsection 2223) seemed tobe an appropriate choice as it performs well on objects with strong characteristicedges Since the UGV has a cuboid shape body and relatively big wheels itfulfils this condition Also HOG is normalized locally with respect to the imagecontrast which results in a robustness against illumination changes Thus itis expected to perform better than the other methods in case of change in thelighting (like moving to an other room or even outside)

Certainly HOG has it disadvantages as well First it is computationally ex-pensive especially compared to the Haar-like features However with properimplementation (eg a good choice of library) and the use of ROI detectors (seePoint 3) this issue can be solved Secondly HOG in contrast to SIFT is not rota-tion invariant In practice on the other hand it can handle a significant rotationof the object Also in the final version of the project the orientation of the camerawill be stabilized either in hardware (with a gimbal holding the camera levelled)or software (rotate the image based on the orientation sensor) Therefore theinput image of the detector system expected to be levelled Unfortunately dueto the aerial vehicle used in this project different viewpoints are predicted (see4) This means that even if the camera is levelled the object itself could seemlike rotated caused by the perspective To overcome this issue solutions will bepresented in this subsection and in 434

Although the method of HOG feature extraction is well described and docu-mented in [53] and not exceedingly complicated the choice of optimal parametersand the optimization of the computational expenses are time-consuming and com-plex tasks To make the development faster an appropriate library was searchedfor the extraction of HOG features Finally Dlib library (324) was chosen sinceit includes an excellent HOG feature extractor with serious optimization alongwith training tools (discussed in details later) and tracking features (see subsec-tion 434) Also several examples and documentations are available for bothtraining and detection which simplified the implementation

As a classifier Support vector machines (SVM see sub-subsection 2226)were chosen This is because nowadays SVMs are widely used in computer visionwith great results (Dalal used it as well in [53]) and for two-class detection prob-

32

43 2D image processing methods

lems (like this) they outperform the other popular ones Also it is implementedin Dlib as a ready-to-use solution [75] SVMs (and almost everything in Dlib)can save and load by serializing and de-serializing them which made the trans-portation of classifiers relatively easy Dlibrsquos SVMs are compatible with the HOGfeatures extracted with the same library The combination of the two methodswithin Dlib resulted in some very impressive detectors for speed limit sign orfaces Furthermore the face detector mentioned outperformed the Haar-like fea-ture based face detector included in OpenCV [74] After studying these examplesthe combination of Dlibrsquos SVM and HOG extractor was chosen as the base of thedetection

The implemented training software itself was written in C++ It can be exe-cuted from the command line It needs an xml file as an input which includes alist of the training images with the annotation of the objects (coordinates in pix-els) After importing the positive samples the negative samples are rdquogeneratedrdquofrom the training samples as well cut from areas where no annotated object ispresent Note that this feature means the object has to be annotated carefullysince a positive sample might get in to the negative training images otherwise

After the classifier (also called detector) is produced the training softwaretests it on an image database defined by a similar xml file This is a very con-venient feature since major errors are discovered during the training processalready If the first tests do not show any unexpected behaviours the finishedSVM is saved to the disk with serialization

Due to the fact that the robot looks completely unlike from different view-points one classifier was not sufficient because it could only match one typicalrdquoshaperdquo of the object Similarly the face detector mentioned above does notwork for people photographed from side view Therefore multiple classifiers wereconsidered for the task

Two SVMs were trained to overcome of this issue one for the front view andone for the side view of the ground vehicle This approach has two significantadvantages

First the robot is symmetrical horizontally and vertically as well In otherterms the vehicle looks almost exactly the same from the left and right whilethe front and rear-views are also very similar This is especially true if the work-ing principle of the HOG (2223) is considered since the shape of the robot(which defines the edges and gradients detected by the HOG descriptor) is iden-tical viewed from left and right or from front and rear This property of theUGV means that the same classifier will recognise it from the opposite directiontoo Therefore two classifiers are enough for all four sides This is not a trivialsimplification since cars for example look completely different from the front orthe rear and while the side-views are similar they are mirrored

The second advantage of these classifiers is very important in the future of

33

43 2D image processing methods

the project Due to the way the SVMs are trained they not only can detect theposition of the robot but depending on which one of them detects it the systemgains information about the orientation of the vehicle This is crucial since it canhelp predict the next position of the ground robot (see 434) Even more impor-tantly it helps the UAV to land on the UGV with the correct orientation (facingthe right way) Therefore the charging outputs will be connected properly

An other important aspect of the training is how to choose the training imagesFor this pictures were selected where the robot is captured directly from the sideor from the front with the camera almost at the level of the ground Asidethe sidefront of the robot (the top for example) was hidden or cropped Thismethod eliminated all the distortions found on pictures taken from above whichwould decrease the efficiency Although camera attached to the UAV will seldomcruise at this altitude (otherwise it would touch the ground) the HOG descriptor isrobust enough to find the patterns even viewed from above or sideways Althoughteaching an SVMs for diagonal views as well seemed like a logical idea it did notimprove the detections The reason for this is the fact that while side and frontviews are well defined it is hard to frame and train on a diagonal view

On figure 42 a comparison of training images and the produced HOG detec-tors is shown 42(a) is a typical training image for the side-view Hog detectorNotice that only the side of the robot is visible since the camera was hold verylow 42(b) is the visualized final side-view detector Both the wheels and thebody are easy to recognize thanks to the strong edges of them On 42(c) a train-ing image is displayed from the front-view detector training image set 42(d)shows the visualized final front detector Notice tangential lines along the wheelsand the characteristic top edge of the body

Please note that if the current two classifiers seem to be inefficient trainingadditional ones is easy to do with the training software presented here Alsoincluding them to the detector is simple as well since it was designed to beadaptive Please see subsection 435 for more details

432 Sliding window methodA very important property of these trained methods (222) are that usually theirtraining datasets (the images of the robot or any other object) are cropped onlycontaining the object and some margin around it Resulting from this a detectortrained this way will only be able to work on input images which contain theobject in a similarly cropped way Certainly such an image is not the usual inputfor an application since if the input contains only the robot there is no need forlocalization algorithms In other words the input image is usually much largerthan the cropped training images and may contain other kind of objects as wellThus a method is required which provides correctly sized input images for the

34

43 2D image processing methods

(a) sideview training image example

(b) sideview HOG detector

(c) frontview trainging image example

(d) frontview HOG

Figure 42 (a) a typical training image for the side-view Hog detector (b) is thefinal side-view HOG descriptor visualized (c) a training image from the front-viewHOG descriptor training image set (d) the final front detector visualized Noticethe strong lines around the typical edges of the training images

35

43 2D image processing methods

Figure 43 Representation of the sliding window method

detector cropped and resized from the original large inputThis can be achieved in several ways although the most common is the so

called sliding window method This algorithm takes the large image and slidesa window across it with predefined step-size and scales This rdquowindowrdquo behaveslike a mask cropping out smaller parts of the image With appropriate choice ofthese parameters (enough scales and small step-sizes) eventually every robot (orother desired object) will be cropped and handed to the trained detector See 43for a representation

It worth mentioning that it is possible that multiple instances of the soughtobject are present on the image For face or pedestrian detection algorithms forexample this is a potential situation Since only one ground robot was builtyet this is not likely to happen in this project (although reflections of the robotmight occur which are not handled explicitly) However if the system will beexpanded in the future multiple UGVs on the same input image will not cause aproblem for the designed algorithm

433 Pre-filteringAs listed in section 41 two of the most important challenges are speed andthe limitations of the available resources In subsection 432 the sliding windowconcept was introduced which makes it possible to cover every potential locationof the sought object On the other hand this means a lot of positions which areabsolutely unnecessary to check with the HOG detector since it is very unlikelythat the robot is there (eg positions above the ground or homogeneous surfaceslike the wall or the clearly empty ground) These areas could be excluded fromthe search with cheaper algorithms

36

43 2D image processing methods

Thus methods which could reduce the amount of areas to process and suchincrease the speed of the detection were required In other words these algorithmsfinds region of interests (ROI see 3) in which the HOG (or other) computationalheavy detectors are executed in a shorter time than scanning the whole inputimage The algorithms have be implemented with the right interface to keep themodular structure of the architecture Also this practice make it possible toeither swap the currently used with an other or a newly developed ROI detectormodule for further development

Based on intuition a good feature separation is the colour Anything whichdoes not have the same colour as the ground vehicle should be ignored To dothis however is rather complicated since the observed rdquocolourrdquo of the object (andanything else in the scene) depends on the lighting the exposure setting and thedynamic range of the camera Thus it is very hard to define a colour (and aregion of colours around of it) which could be a good base of segmentation onevery input videos Also one of the test environments used (the laboratory) andmany other premises reviewed has bright floor which is surprisingly similar to thecolour of the robot This made the filtering virtually useless

An other good hint for interesting areas can be the edges In this projectthis seems like a logical approach since the chosen feature descriptor operates ongradient images which are strongly connected to the edge map On figure 44 theresult of an edge detection on a typical input image of the system is presentedNote how the ground robot has significantly more edges than its environmentespecially the ground Filtering based on edges often eliminates the majority ofthe ground On the other hand chairs and other objects have lot of edges as wellThus these areas are still scanned

Another idea is to filter the image by the detections themselves on the previousframes In other terms the position of the robot on the previous input image isan excellent estimation of the vehiclersquos position on the current one Theoreticallyit is enough to search for the robot around its last known position with a radiusof the maximum distance the object can move during the time between framesPlease note that this distance is in image coordinates thus the possible movementof the camera is involved as well In practice this distance was set as a parameterafter experiments See subsection 435 for details

434 TrackingIn computer vision tracking means the following of a (moving) object over timeusing the camera image input This is usually done by storing information (likeappearance previous location speed and trajectory) about the object and passit on between frames

In the previous subsection (433) the idea of filtering by prior detections was

37

43 2D image processing methods

Figure 44 On this figure the result of an edge detection on a typical input imageof the system is presented Please note how the ground robot has significantly moreedges than its environment On the other hand chairs and other objects have lotof edges as well

presented Tracking is the extension of this concept If the position of the robotis already known there is no need to find it again To achieve the objectives it isenough to follow the found object(s) This makes a huge difference in calculationcosts since the algorithm does not look for an object described by a generalmodel on the whole image but search for the rdquosame set of pixelsrdquo in a significantlyreduced region Certainly tracking itself is not enough to fulfil the requirements ofthe task hence some kind of combination of rdquotraditionalrdquo classifiers and trackingwas needed

Tracking algorithms are widely researched and have their own extensive litera-ture Considering the complexity of implementing an own 2D tracking algorithmit was decided to choose and include a ready-to-use solution Fortunately Dlib(324) have an excellent object tracker included as well The code is based onthe very recent research presented in [76] The method was evaluated extensivelyand it outperformed state-of-the-art methods while being computationally effi-cient Aside its outstanding performance it is easy to include in the code onceDlib is installed The initializing parameter of the tracker is a bounding box isaround the area needed to be followed Since the classifiers return with boundingboxes those can be used as an input for the tracking See subsection 435 foran overview of the implementation An interesting fact is that it also uses HOGdescriptors as a part of the algorithm which shows the versatility of HOG andconfirms its suitability for this task

The final system of this project will have the advantage of a 3D map of itssurroundings This will allow tracking of objects in 3D relative to the environ-ment The UAV will have an estimation of the UGV even if its out of the frame

38

43 2D image processing methods

since its position will be mapped to the 3D map as well Unfortunately the realtime mapping is not ready yet thus this feature has yet to be implemented

435 Implemented detectorIn subsection 431 the advantages and disadvantages of the chosen object recog-nition methods were considered and the implemented training algorithm wasintroduced Then different approaches to speed up the process were presentedin subsection 432 433 and 434

After considering the points above a detector algorithm was implementedwith two aims First to provide an easy-to-use development and testing toolwith which different methods and their combinations (eg using 2 SVMs withROI) can be tried without having to change and rebuild the code itself Secondafter the optimal combination was chosen this software will be the base of thefinal detector used in the project

The input of the software is a camera feed or video which is read frame byframe The output of the algorithm is one or more rectangles bounding the areasbelieved to contain the sought object These are often called detection boxeshits or simply detections

During the development four different approaches were implemented All ofthem are based on the previous ones (incremental improvements) but all broughta new idea to the detector which made it faster (reach higher frame-rates see512) more accurate (see 511) or both The methods are listed below

1 Mode 1 Sliding window with all the classifiers

2 Mode 2 Sliding window with intelligent choice of classifier

3 Mode 3 Intelligent choice of classifiers and ROIs

4 Mode 4 Tracking based approach

They will all be presented in the next four subsections Since they are rathermodifications of the same code than separate solutions they were not imple-mented as new software Instead all were included in the same code whichdecides which mode to execute in run-time based on a parameter

Aside of changing between the implemented modes the detector is able toload as many SVMs as needed Their usage is different in every mode and willbe presented later

The software can display andor save the video frames with all the detectionsmarked measure and export the processing time for every frame save the de-tections to data files for evaluation or even process the same video input severaltimes

39

43 2D image processing methods

Table 41 Table of the available parameters

Name Valid values Functioninput path to video video as input for detectionsvm path to SVMs These SVMs will be usedmode [1 2 3 4] Selects which mode is usedsaveFrames [01] Turns on video frame exportsaveDetections [01] Turns on detection boxe exportsaveFPS [01] Turns on frame-rate measurementdisplayVideo [01] Turns on video displayDetectionsFileName string Sets the filename for saved detectionsFramesFolderName string Set the foldername used for saving videonumberOfLoops integer(0lt) Sets how many times the video is looped

To make the development more convenient these features can be switchedon or off via parameters which are imported from an input file every time thedetector is started There is an option to change the name of the file where thedetections are exported or the name of the folder where the video frames aresaved

Table 41 concludes all the implemented parameters with the possible valuesand a short description

Note that all these parameters have default values thus it is possible but notcompulsory to include every one of them The text shown below is an exampleinput parameter file

Example parameter file of the detector algorithminput testVideo1avilist all the detectors you want to usesvm groundrobotfrontsvm groundrobotsidesvmsaveDetections 0saveFrames 0displayVideo 0numberOfLoops 100mode 2saveFPS 1

Given this input file the software would execute a detection on testVideo1avi(input) one hundred times (numberOfLoops) using the groundrobotfrontsvm groundrobot-sidesvm classifiers (svm) It would neither save the detections (saveDetections)

40

43 2D image processing methods

nor the video frames (saveFrames) The latter is not shown (displayVideo) How-ever the processing time is saved for every frame (saveFPS) Note that the pa-rameter numberOfLoops is really useful to simulate longer inputs See figure 45for a screenshot of the interface after a parameter file is loaded

Figure 45 Example of the detectorrsquos user interface after loading the parametersThe software can be started from the command line with a parameter file inputwhich defines the detecting mode and the purpose of the execution (producingvideo efficiency statistics or measuring processing frame-rate)

4351 Mode 1 Sliding window with all the classifiers

Mode 1 is the simplest approach of the four After loading the two classifierstrained by the training software (front-view and side-view see 434) both rdquoslidesrdquoacross the input image All of them return with a vector of rectangles where theyfound the sought object These are handled as detections (saved displayed etc)There are no filters implemented for maximum number of detections overlappingallowed position etc

Please note that although currently two SVMs are used this and every othermethod can handle fewer or more of them For example given three classifiersmode 1 loads and slides all of them

As it can be seen this mode 1 is rather simple It checks every possibleposition with all the classifiers on every frame in multiple scales This meansthat it will not miss the object (assuming that one of them would recognize theobject if a cropped image of it is given as an input) On the other hand thisextensive search is very computational heavy especially with two classifiers Thisresults in the lowest frame per second rate of all the methods

4352 Mode 2 Sliding window with intelligent choice of classifier

As mentioned in the previous subsection Mode 1 checks every possible positionwith all the classifiers on every frame This is not efficient because of the waythe two used classifier was trained On figure 42 it can be seen that one of them

41

43 2D image processing methods

represents the frontrear view while the other one was trained for the side-viewof the robot

The idea of mode 2 is based on the assumption that it is very rare and usuallyunnecessary that these two classifiers detect something at once since the robot iseither viewed from the front or from the side Certainly looking at it diagonallywill show both mentioned sides but resulting from the perspective distortionthese detectors will not recognize both sides (and that is not needed either sincethere is no reason to find it twice)

Therefore it seemed logical to implement a basic memory to the code whichstores which one of the detectors found the object last time After that only thatone is used for the next frames If it succeeds to find the object again it remainsthe chosen detector If not the algorithm starts counting the sequential frames onwhich was unable to find something If this exceeds a predefined tolerance limit(that is the number of frames tolerated without detection) all of the classifiersare used again until one of them find the object In every other aspect mode 2is very similar to mode 1 presented above

The limit was introduced because it is very unlikely to change viewpointsbetween two frames so drastically that the other detector would have a betterchance to detect A couple of frames without detection is more like the result ofsome kind of image artefact like motion blur change in exposure etc

This modification of the original code resulted an much faster processingspeed since in significant amount of the time only one classifier was used in-stead of two Fortunately after finding an optimal limit the efficiency of thecode remained the same

4353 Mode 3 Intelligent choice of classifiers and ROIs

While mode 2 brought a significant improvement to the detectorrsquos speed it didnot fulfil the requirements of the project

In subsection 433 the idea of reducing the image searched was introduced asa way to increase the detection speed The position of the robot on the previousinput image is an excellent estimation of the vehiclersquos position on the current oneTheoretically it is enough to search for the robot around its last known positionwith a radius of the maximum distance the object can move during the timebetween frames However this is not an obvious calculation since the distance isin image coordinates hence it is not trivial to use the actual speed of the robotAlso the possible movement of the camera is involved as well For example evenif the UGV holds its position the rotating camera of the UAV will capture itrdquomovingrdquo across the input image

Instead of estimating the distance mentioned above a simpler approach wasimplemented The memory introduced in mode 2 was extended to store the

42

43 2D image processing methods

Figure 46 An example output of mode 3 The green rectangle marks theregion of interest The blue rectangle is the detection returned by the side-viewdetectorMode 3 tries to estimate a ROI based on the previous detections andsearches for the robot only in that Note the ratio between the ROI and the fullsize of the image

position of the detection aside the detector which returned it A new rectanglename ROI (Region of interest) was included which determinates in which areashould the detector search

The rectangle is initialized with the sizes of the full image (in other termsthe whole image is included in the ROI) Then every time a detection occursthe ROI is set to that rectangle since that is the last known position of the robotHowever this rectangle have to be enlarged caused by the reasons mentionedabove (movement of the camera and the object) Therefore the ROI is grown bya percent of its original size (set by a variable 50 by default) and every detectorsearches in this area Note that the same intelligent selection of classifiers whichwas described in mode 2 is included here as well

In the unfortunate case that none of the detectors returned detections insidethe ROI the region of interest is enlarged again by a smaller percent of its originalsize (set by a variable 3 by default) Note that if in any case the ROI reachesthe size of the of the image it will not grow in that direction anymore Eventuallyafter a series of missing detections the ROI will reach the size of the input imageIn this case mode 3 works exactly like mode 2

See figure 46 for an example of the method described above The greenrectangle marks the region of interest (ROI) The classifiers will search for therobot in this area The blue rectangle marks that the side-view detector foundsomething there (in this case correctly recognized the robot) On the next framethe blue rectangle will be the base of the ROI as it is enlarged in the followingsteps to define the region of interest

43

43 2D image processing methods

Please note that while the ROI is updated of the previous detections it isvery easy to change it to another rdquopre-filterrdquo due to the modular architectureintroduced in 42

4354 Mode 4 Tracking based approach

As mentioned in 434 Dlib library has a built-in tracker algorithm based on thevery resent research [76]

Mode 4 is very similar to mode 3 but includes the tracking algorithm men-tioned above Using a tracker makes it unnecessary to scan every frame (or partof it as mode 3 does) with at least one detector

Instead once the robot was found it is followed across the scene Certainlyperiodical validations are still needed Tracking is significantly faster than thesliding window detectors and in appropriate conditions can perform above realtime speed

It also improved the efficiency of the detector since on many frames the robotwas tracked and its position was marked correctly while the detectors missed itfrom the same view-point

On the other hand tracking can be misleading as well In case of the trackedregion rdquoslidesrdquo off the robot incorrect detections will be provided Also thetracker often estimates the position of the object correctly while the scale of therectangle does not fit which can result much larger or smaller detection boxesthan the object itself See figure 47(b) for an example Yellow rectangle marksthe tracked region which clearly includes the robot but itrsquos size is unacceptable

To avoid these phenomena the bounding box returned by the tracker is reg-ularly checked by detectors This is done in the very same way as in mode 3The output of the tracker is the base of the ROI which is than enlarged by pa-rameters (similarly to mode 3) The algorithm checks this area with the selectedclassifier(s)

If any of them founds the object the tracker is reinitialized based on thedetection box The rectangle is enlarged and translated to include a bigger partof the robot in contrast to the detected sides (note that the detectors are trainedfor the sides only see subsection 431) This is done because the tracker tendsto follow bigger patches of the image better

If none of the detectors returned detections inside the ROI the algorithm willcontinue to track the object based on previous clues of its position However afterthe number of sequential failed validations exceeds a tolerance limit (a parameter)the object is labelled as lost Afterwards the algorithm works exactly as mode 3would do enlarge the ROI on every frame until a new detection appears Thenthe tracker is reinitialized

See figure 47(a) for a representation of the processing method of mode 4

44

43 2D image processing methods

(a) Example output frame of mode 4

(b) Typical error of the tracker

Figure 47 (a) A presentation of mode 4 Red rectangle marks the detection ofthe front-view classifier The yellow box is the currently tracked area The greenrectangle marks the estimated region of interest (b) A typical error of the trackerYellow rectangle marks the tracked region which clearly includes the robot but toobig to accept as a correct detection

45

44 3D image processing methods

44 3D image processing methods

441 3D recording methodThe first challenge was how to create a 3D image As explained in sub-subsection3123 the used Lidar sensor is a 2D type one which means it scans only inplane In other words the sensor returns the distance of the closest reflectionpoint rotating around one (vertical) axis The result of this scan looks like across-section of a 3D image To produce a real and complete 3D image severalof these cross-section like one-plane scans have to be combined This is a rathercomplicated method since to cover the whole room the lidar needs to be movedHowever this movement or displacement will have a significant effect on therecorded distances

To simplify the initial experiments the position of the laser scanner was fixedon a tripod and only the orientation of it was changed This way the trans-formation between the coordinate systems and validation of the recordings wereeasier

On figure 48 a schematic diagram is given to introduce the 3D recording set-up The picture is divided in to 3 parts Part a is a side view of the room beingrecorded including the scanner and its dependences Part b is a top view of thevery same room displaying the field of view of the scanner Part c is a closeroverview of the scanner device the inertial measurement unit (IMU) and themounting layout Main parts of the setup are listed below with brief explanation

1 Lidar This is the 2D range scanner which was used in all of the experi-ments It is mounted on a tripod to fix its position The sensor is displayedon every sub-figure (a b c) On part a it is shown in two different orien-tation first when the recording plane is completely flat secondly when itis tilted

2 Visualisation of the laser ray The scanner measures distances in a planewhile rotating around its vertical axis which is call capital Z axis on thefigure (Note The sensor uses class 1 laser which is invisible and harmlessto human eye Red colour was chosen only for visualization)

3 Pitch angle of first orientation While the scanner is tilted its orientationis recorded relative to a fixed coordinate system For this experiment thepitch angle is the most relevant since roll and yaw did not change This iscalculated as a rotation from the (lower-case) z axis

4 Pitch angle of second orientation This is exactly the same variable asdescribed before but recorded for the second set-up of the recording system

46

44 3D image processing methods

5 Yaw angle of the laser ray As mentioned before the sensor used operateswith one single light source rotated around its vertical axis (capital) ZTo know where the recorded point is located relative to the sensorrsquos owncoordinate system the actual rotation of the beam is needed to be savedas well This is calculated as an angle between the forward direction of thesensor (marked with blue) and the laser ray All the points are recordedin this 2D space defined with polar coordinates a distance from the origin(the lidar) and a yaw angleThe angle is increasing counter-clockwise Pleasenote that this angle rotates with the lidar and its field Part b represents astate when the lidar is horizontal

6 Field of view of the sensor Part b of the picture visualize the limited butstill very wide field of view of the sensor The view angle is marked withgreen coloured curve and the covered area is yellowish The scanner is ableto record 270 or in other terms [minus135 135]

7 Blind-spot of the sensor As discussed in the previous point the scannercan register distances in 270 The remainder 90 is a blind-spot and islocated exactly behind the sensor (135minus135) This area is marked withdark grey on the figure

8 Inertial measurement unit All the points are recorded in the 2D spaceof the lidar which is a tilted and translated plane To map every recordedpoint from this to the 3D maprsquos coordinate system the orientation andposition of the Lidar sensor is needed Assuming that the latter is fixed(the scanner is mounted on a tripod) only the rotation remains variableTo record this some kind of inertial measurement unit (IMU) is neededwhich is marked with a grey rectangle on the image

9 Axis of rotation To scan the room a continuous tilting of the scannerwas needed It was not possible to place the light source on the axis of therotation This means an offset between the sensor and the axis which was asource of error The black circle represents the axis the tilting was executedaround

10 Offset between the axis and the light source As described in the previ-ous point an offset was present between the range sensor and the axis ofrotation This distance was carefully measured and it is marked by blue onthe figure

11 Ground vehicle The main objective of this thesis is to localize the UGVThus several 3D map was made of this vehicle itself However the lidarmain task is not the robot detection but the localization of the UAV itself

47

44 3D image processing methods

On part a and b its schematic side and top view is shown to visualize itsshape relative size and location

442 Android based recording set-upTo test the sensorrsquos capabilities simplified experiment set-ups were introducedas described in subsection 441 The scanned distances were recorded in csvformat by the official recording tool provided by the lidarrsquos manufacturer [77]As described in subsection 444 and in point 8 the orientation and position ofthe Lidar sensor is needed to determinate the recorded point coordinates relativeto the coordinate system fixed to the ground

Figure 410 Screenshot ofthe Android application for Lidarrecordings It is able to detect dis-play and log the rotation of the de-vice in a log file with user definedfilename Source own applicationand image

Assuming that the position is fixed (thescanner is mounted on a tripod) only therotation can change To record this somekind of inertial measurement unit (IMU) wasneeded

Since at the time these experimentsstarted no such device was available the au-thor of this thesis created an Android appli-cation The aim of this program was to usethe phonersquos sensors (accelerometer compassand gyroscope) as an IMU The software isable to detect display and log the rotationof the device along the three axis

The log file is structured in a way thatMatlab (see subsection 321) or other de-velopment tools can import it without anymodification needed The file-name is userdefined

Since both the Lidar and the applica-tion operates at a different and not suf-ficiently consistent frequency synchroniza-tion was necessary Since the two data wasrecorded completely separately this has to bedone off-line To achieve this a time-stamphas to be recorded for both type of mea-surements The lidar recorder tool does thisautomatically This function was added tothe Android application too thus the time-stamps of the recorded orientations are saved

48

44 3D image processing methods

Figure 48 Schematic figure to represent the 3D recording set-up Part a is a sideview of the room being recorded including the scanner and its dependences Partb is a top view of the very same room displaying the field of view of the scannerPart c is a closer overview of the scanner device the inertial measurement unit(IMU) and the mounting layout

49

44 3D image processing methods

Figure 49 Example representation of the output of the Lidar Sensor As it canbe seen the data returned by the scanner is basically a scan in one plane and canbe interpreted as a cross-section of a 3D image The sensor is positioned in themiddle of the cross Green areas were reached by the laser Source screenshot ofown recording by the official recording tool [77]

50

44 3D image processing methods

to the log file as well However since the two systems were not connected thetimestamps had to be synchronized manually

On Figure 410 the user interface of the application can be seen Note thatboth the displayed and the saved angles are in radian Figure 411(a) presentshow the phone was mounted next to the lidar and also shows the test environmentwhich was recorded The phone was attached to a common metal base with thelidar Thus the rotation of the phone is expected to be very close to the rotationof the lidar We used a Motorola Moto X (2013) for these experiments

The coordinate system used is fixed to the phone X axis is parallel with theshorter side of the screen pointing from the left to the right Y axis is parallel tolonger side of the screen pointing from to bottom to the top Z axis is pointingup and is perpendicular to the ground See figure 412 for a presentation of theaxes Angles were calculated in the following way

bull pitch is the rotation around the x axis It is 0 if the phone is lying on ahorizontal surface

bull roll is rotation around the y axis

bull yaw is the rotation around the z axis

Although the sensors in a phone are not designed for such precise tasks theset-up presented in this subsection worked sufficiently and produced spectacular3D images On figure 411(b) and 411(c) an example is shown along with therecorded screen captured by a camera (411(a))

443 Final set-up with Pixhawk flight controllerAlthough the Android application (see subsection 442) gave satisfactory resultsseveral reasons occurred for replacing it First the synchronization was compli-cated manual and off-line which is not acceptable in a real time UAV environ-ment Second neither the phonersquos sensors nor its operating system was designedto be mounted on an unmanned aircraft It is heavier less precise and moreexpensive than a dedicated IMU Thus the Android phone has to be replaced

As a replacement the Pixhawk (3121) flight controller was chosen because itis open-source well supported and most importantly it is the chosen controller ofthe projectrsquos UAV This meant less weight and brought integrity both in hardwareand software Although the Pixhawk is not a dedicated IMU its built-in sensorsare very accurate since they are used to stabilize multi-rotors during flight Fellowstudent Domonkos Huszar managed to connect both the Pixhawk and the lidar(3123) to the Robotic operating system (322) This way the two type ofsensors were recorded with the same software

51

44 3D image processing methods

(a) Laboratory and recorded scene

(b) Elevation of the produced 3D map

(c) The produced 3D map

Figure 411 (a) picture about the laboratory and the mounting of the phone (b)and (c) 3D map of the scan from different viewpoints Some object were markedon all images for easier understanding 1 2 3 are boxes 4 is a chair 5 is a studentat his desk 6 7 are windows 8 is the entrance 9 is a UAV wing on a cupboard

52

44 3D image processing methods

Thus the change from the Android software to the Pixhawk solved the problemof synchronization and accuracy All the other parts of the experiment (themounting method the coordinate systems etc) remained unchanged

444 3D reconstructionIn subsection 441 the recording method was described Subsection 442 and443 discussed the inertial measurement units and software used Processing andvisualizing the recorded data as a 3D map was the next step

All the pointsrsquo coordinates obtained were in the lidar sensorrsquos own coordinatesystem As shown on figure 48 (part b) this is a 2D space represented by polarcoordinates a distance from the origin (the scanner) and an angle between theline pointing to the point and the zero vector This space was tilted with thescanner during the records To calculate the positions recorded by the Lidar inthe ground fixed coordinate system this 2D space had to be transformed Trans-formation here means rigid motions no shape or size changes were desired Rigidmotions are reflections rotations translations and glide reflections Howeverin this experiment only translations and rotations were used

The 3D coordinate system was defined as the following

bull x axis is the vector product of y and z x = ytimes z pointing to the right

bull y axis is the facing direction of the scanner parallel to the ground pointing

Figure 412 Presentation of the axes used in the android application

53

44 3D image processing methods

forward

bull z axis is pointing up and is perpendicular to the ground

Note that the origin ([0 0 0]) is at the lidarrsquos location (at the top of the tripodin the height of the rotation axis) This means that points below the scanner havea negative altitude

The lidar and its own 2D space have 6 degrees of freedom (DOF) state in theground fixed coordinate system Three represents its positions along the x y zaxes The other three describes its orientation yaw pitch and roll

First the rotation of the scanner will be considered (the latter three DOF) Asit can bee seen on figure 48 the roll and yaw angles are not able to change Thusonly pitch has to be considered as a variable in this experiment (see point 3) Asa result the transformed coordinates of the recording point are the following

x = distance middot sin(minusyaw) (41)

y = distance middot cos(yaw) middot sin(pitch) (42)

z = distance middot cos(yaw) middot cos(pitch) (43)

Where distance and yaw are the two coordinates of the point in the plane ofthe lidar (figure 48 point 5) and pitch is the angle between the (lower-case) zaxis and the plane of the lidar (48 point 3 and 4)

The position of the lidar in the ground coordinate system did not change intheory only the orientation of it However as it can be seen on figure 48 part cthe rotation axis (marked by 9 see point 9) is not at the same level as the rangescanner of the lidar Since it was not possible to mount the device other way asignificant offset (marked and explained in point 10) occurred between the lightsource and the axis itself This resulted in an undesired movement of the sensoralong the z and y axes (note that x axis was not involved since the roll angle didnot change and only that would move the scanner along the x axis) While theresulted errors were no greater than the offset itself they are significant Thusalong the rotation a translation is needed as well The vector is calculated asthe sum of dy and dz

dy = offset middot sin(pitch) (44)

dz = offset middot cos(pitch) (45)

54

44 3D image processing methods

Where dy is the translation required along the y and dy is along the z axisOffset is the distance between the light source and the axis of rotation presentedon figure 48 part C marked with 10

Combining the 5 equations (41 42 43 44 45) we get xyz

= distance middot

sin(minusyaw)cos(yaw) middot sin(pitch)cos(yaw) middot cos(pitch)

+ offset middot

0sin(pitch)cos(pitch)

(46)

Using the transformation defined in equation 46 spectacular and more im-portantly precise 3D maps were created These were suitable for further worksuch us SLAM (Simultaneous localization and mapping) or the ground robotdetection

55

Chapter 5

Results

51 2D image detection results

511 EvaluationTo make the development easier and the results more objective an evaluationsystem was needed This helps determine whether a given modification of thealgorithm did actually make it better andor faster than other versions Thisinformation is extremely important in every vision based detection system to trackthe improvements of the code and check if it meets the predefined requirements

To understand what better means in case of a system like this some defi-nitions need to be clarified Generally a detection system can either identify aninput or reject it Here the input means a cropped image of the original pictureIn other words it is a position (coordinates) and dimensions (height and width)Please refer to subsection 432 for a more detailed description An identifiedinput is referred as positive Similarly a rejected one is called negative

5111 Definition of True positive and negative

A detection is true positive if the system correctly identifies an input as a positivesample In this case it means that the detector marks the ground robotrsquos positionas a detection Note that it is also often called a hit Similarly a true negativemeans that an input which does not contain the object is correctly rejected

5112 Definition of False positive and negative

In the case of most decision systems errors can be organised into two main groupsA mistake is a false positive error (also called type I error) if the system classifiesan input as a positive sample despite it is not In other words the system believes

56

51 2D image detection results

the object is present at that location although that is not the case In this projectthat typically means that something that is not the ground robot (a chair a boxor even a shadow) is still recognized as the UGV A false negative error (alsocalled type II error) occurs when an input is not recognized (rejected) although itshould be In this current task a false negative error is when the ground robotis not detected

5113 Reducing number of errors

Unfortunately it is really complicated to decrease both kind of errors at thesame time If more strict rules and thresholds are introduced in the decisionmaking the rate of false positives will decrease but simultaneously chances offalse negatives will increase On the other hand if conditions are loosened in thedetection method false negative errors may be reduced However caused by theexact same thing (less strict conditions) more false positives could appear

The actual task determines which kind of error should be minimised Incertain projects it could be more important to find every possible desired objectthan the fact that some mistakes will occur Such a project can be a manuallysupervised classification where appearing false positives can be eliminated by theoperator (for example illegal usage of bus lanes) Also a system with an accidentavoidance purpose should be over-secure and rather recognise non dangeroussituations as a hazard than to miss a real one

In this project the false positive rate was decided to be more important Thereason for this is that the position of the UGV is not crucial during the wholeflight If an input frame is processed incorrectly and a false negative error occurs(the robot is on the picture but the algorithm misses it) the tracking system(see subsection 434) would still be able to estimate its position After a fewframes the detector would be able to detect the robot again with great possibilityHowever a false negative detection at an unexpected position may confuse thesystem since it would have to choose from multiple possible charging platformsThis may lead to a landing attempt on a dangerous surface and to the possibilityof losing the aircraft

5114 Annotation and database building

To test the efficiency of the 2D detection algorithm its detections have to betested on an image dataset with its samples already labelled (either fully manuallyor using a semi-supervised classifier) or on a video set where the object (or objects)position is known on every frame This way the true positivenegative and falsepositivenegative values can be determined

57

51 2D image detection results

Figure 51 Figure of the Vatic user interface

Testing on image datasets is suitable for methods which do not use any in-formation transmitted between frames such as previous positions or ROI Thesealgorithms are usually sliding window based (see subsection 432) Their croppedinputs can be simulated with image datasets without complications

For this project the video based test is more appropriate since as described insubsection 434 and 433 the detector uses tracking and pre-filtering Testing thetracking feature without consecutive frames is pointless Also the pre-filteringand ROI generation is only applicable on frames containing more than just theobject itself

Testing on videos requires annotation which means that every object has tobe labelled on every frame This is a tiring and time-consuming task but withsuitable software it can be simplified To do this the Vatic environment waschosen

The abbreviation Vatic stands for Video Annotation Tool from Irvine Cal-ifornia It is a free to use on-line video annotation tool designed specially forcomputer vision research [78]

After the successful installation and connection to the server a clear interfaceis visible See an example screenshot on figure 51 The interface has a built invideo player which makes it possible to review the video with the annotation atnormal speed or even frame-by-frame The system was designed in a way thatmultiple objects can be annotated even from different classes (like pedestrianscars trucks for example) Since only one ground robot was built yet and no otherkind of objects were requested only one item was annotated on the videos

58

51 2D image detection results

The annotatorrsquos task is to mark the position and the size of the objects (thatis drawing a bounding box around it) on every frame of the videos (there areoptions to label an object Occluded obstructed or outside of frame) Althoughthis seems like a very time-consuming task Vatic uses sophisticated algorithmsto interpolate both the positions and the sizes of the objects on frames thatare not annotated yet This is a significant help for the annotator since aftermarking the object on some key-frames the only task remained is to correctthe interpolations between them if necessary

Note that it is crucial that the bounding boxes are very tight around the ob-ject This will guarantee that during the evaluation of the detectors the statisticsare based on true and objective annotations This topic will be explained in moredetails in section 53

Vatic is also able to crowdsource the annotation tasks using Amazon Mechan-ical Turk [79] which is Amazonrsquos solution for renting human resources This waylarge datasets with several videos are easy to build for relatively low cost How-ever this feature of the tool was not used in this project since the number of testvideos and their complexity did not make it necessary All the test videos wereannotated by the author

After successfully annotating a video Vatic is able to export the boundingboxes in several format including xml json etc The exported file contains theid of the frames and the coordinates of every bounding box along with the labelassigned for the objectrsquos type

Using these files it is possible to compare the detections of an algorithm (whichare also bounding boxes) with the annotations and thus evaluate the efficiencyof the code

512 Frame-rate measurement and analysisIn section 41 speed was recognised as one of the most important challenges

To address this issue the detector was prepared to measure and export thetime elapsed by processing each frames

The measurements were carried out on a laptop PC (Lenovo z580 i5-3210m8GB RAM) simulating a setup where the UAV sends video to a ground stationfor further processing During the tests every other unnecessary process wereterminated so their additional influences are minimized

The exact same software was used for every test only the parameter file waschanged (switching between the modes) Every method was executed on the samevideo one hundred times which means more than 100000 frames This amountensured that temporary resource changes during the test due to the operatingsystem did not influence the results significantly

59

52 3D image detection results

To analyse the exported processing times a Matlab script has been imple-mented This tool can load the log files from the test and calculate statistics ofthe data Measurements like the shortest longest and average processing timeper frame are displayed after execution along with the average frame-rate and itsvariance Latter is very important as mode 2 3 4 processing methods changeduring the execution by definition which results in a varying processing timeVariance measures these changes An example output of the analysis software ispresented below

Example output of Frame-rate analysis softwareLoaded 100 filesNumber of frames in video 1080Total number of frames processed 108000Longest processing time 0813Shortest processing time 0007Average ellapsed seconds 0032446variance of ellapsed seconds per frame

between video loops 80297e-07across the video 00021144

Average Frame per second rate of the processing 308204 FPS

The changes of average frame-rates between video loops were also monitoredIf this value was too big it probably indicated some unexpected load of thecomputer Thus the tests had to be restarted

It should be mentioned that the frame-rate of the software is strongly corre-lated with the resolution of the input images since the smaller the image it takesshorter time to rdquosliderdquo through the detector on it This is especially true for thefirst two methods (4351 4352) The test videosrsquo resolution are 960times 540

52 3D image detection resultsThis thesis focused on the 2D object detection methods since they are easierto implement and have been widely researched for decades Traditional camerasensors are significantly cheaper too and their output is rdquoready-to-userdquo On theother hand data from lidar sensors have to be processed to gather the imagesince the scanner records points in a 2D plane as explained in section 44 Notethat lidar 3D scanners exists but they are still scanning in a plane and have thismentioned processing built-in Also they are even more expensive than the 2Dversions

60

53 Discussion of results

Figure 52 On this figure the recording setup is presented In the backgroundthe ground robot is also visible which was the subject of the measurements

That being said the 3D depth map provided by these sensors are not possibleto obtain with cameras and contains valuable information for a project like thisand makes indoor navigation easier Therefore experiments were carried onthe record 3D images by tilting a 2D laser scanner (3123) and recording itsorientation with a IMU (3121)

On figure 52 a photo of the experimental setup is shown In the backgroundthe ground robot is visible which was the subject of the recordings

Figure 53 presents an example 3D point-cloud recorded of the ground robotBoth 53(a) and 53(b) is an image of the same map viewed from different pointsAs it can be seen recordings from one point is not enough to create completemaps as rdquoshadowsrdquo and obscured parts still occurs Note the rdquoemptyrdquo area behindthe robot on 53(b)

The created point-clouds were found to be detailed enough to carry simpler(eg threshold based on height) or more complex object detection methods (egregistering 3D key-points) on them in the future

53 Discussion of resultsIn general it can be said that the detector modes worked as expected on therecorded test videos with suitable efficiency Example videos can be found athere1 for all modes However to judge the performance of the algorithm moreobjective measurements were needed

The 2D image processing results were evaluated by speed and detection effi-1usersitkppkehu~palan1CranfieldVideos

61

53 Discussion of results

(a) Example result of the 3D scan of the ground robot

(b) Example of the rsquoshadowrsquo of the ground robot

Figure 53 Example of the 3D images built Both image shows the same 3Dmap viewed from different angles Note that the carpet is visible underneath thevehicle

62

53 Discussion of results

ciency The latter can be done by several values Here recall and precision willbe used Recall is defined by

TP

TP + FN

Where TP is the number of true positive and FN is the number of false negativedetections In other words it is the ratio of the of the detected and the totalnumber of positive samples Thus it is also called sensitivity

Precision is defined byTP

TP + FP

Where TP is the number of true positive FP is the number of false positivedetections In other terms precision is a measurement of the correctness of thereturned detections

Both values have a different meaning for validation and none of them can usedalone This is because as mentioned in sub-subsection 5113 both false positiveand negative errors are undesirable but neither recall nor precision measuresboth For example it is possible to reach an outstanding recall rate with extremeamount of false positives Similarly the increase of false negative errors willimprove the precision of the algorithm

The detections (the coordinates of the rectangles) were exported by the de-tection software for further analysis The rectangles were compared to the onesannotated with the Vatic video annotation tool (5114) based on position andthe overlapping areas For the latter a minimum of 50 was defined (ratio of thearea of the intersection and the union of the two rectangles)

Note that this also means that even if a detection is at the correct locationit is possible that it will be logged as false positive error since the overlap of therectangles is not satisfying Unfortunately registering this box as a false positivewill also generate a false negative error since only one detection is returned bythe detector for a location(overlapping detections are merged) Therefore theannotated object will not be covered by any of the detections

As expected mode 1 and mode 2 produced very similar results a decent 62recall rate was achieved This is a result of the fact that they work exactlythe same way except the selection of the used classifiers The results show thatintroducing this idea did not decrease the performance

Mode 3 managed to improve the recall rate by 2 with the introduction ofregion of interests Neither of the first 3 modules had false positive detections asa result of the strict thresholds of the SVMs

Mode 4 had the outstanding recall rate of 95 This is a result of the trackeralgorithm which was able to follow the object even at view-points were the SVMdetectors did no longer detected the robot See table 51 for all the recall andprecision values

63

53 Discussion of results

Similarly the processing speed of the methods introduced in subsection 435were analysed as well with the tools described in subsection 512

Mode 1 was proven to be the slowest with a frame-rate of 42 FPS This isnot surprising as this version of the algorithm scans the whole image with all thedetectors

Mode 2 raised this speed to 65 FPS by introducing an intelligent way ofselecting the applied detector

Mode 3 brought another significant increase with implementing region of in-terest estimators which reduce the amount of the image to be scanned It canprocess around 12 frames in a second

Finally mode 4 was proven to be the fastest solution by far As a result ofits tracking algorithm their is no need to use the SVM classifiers as often asbefore This is a very important result since tracking is significantly faster thanscanning with the detectors Mode 4 reached a frame-rate of 30 FPS during thetests Thus mode 4 can be called a rdquoreal-timerdquo object detection algorithm

See table 51 for the frame-rates of all modules displayed along their recalland precision values Note that all frame-rates are average of 100 executions

Table 51 Table of results

mode Recall Prescision FPS Variancemode 1 0623 1 42599 000000123mode 2 0622 1 6509 00029557mode 3 0645 1 1206 00070877mode 4 0955 0898 3082 00021144

As a conclusion it can be seen that mode 4 outperformed all the others bothin detection rate (recall 95) and speed (average frame-rate 30FPS) using ROIestimators and a tracking algorithm Due to the nature of the tracker and theevaluation method the number of false positives increased As mentioned beforeeven a correctly positioned but incorrectly scaled detection decreased the precisionrate However this should be simple to improve by adding further validations ofthe output of the tracker (both for position and scale)

64

Chapter 6

Conclusion and recommendedfuture work

61 ConclusionIn this paper an unmanned aerial vehicle based airborne tracking system waspresented with the aim of detecting objects indoor (potentially extended to out-door operations as well) and localize a ground robot that will serve as a landingplatform for recharging the UAV

First the project and the aims and objectives of this thesis were introducedin chapter 1 This introduction was followed by an extensive literature review togive an overview of the existing solutions for similar problems (chapter 2) Un-manned Aerial Vehicles were presented and examples were shown for applicationfields Then the most important object recognition methods were introducedand reviewed with the aspect of suitability for the discussed project

Then the environment of the development was described including the avail-able software and hardware resources (chapter 3) Their advantages and disad-vantages were analysed and their applications in the project are explained

Afterwards the recognized challenges of the project were collected and dis-cussed with respect to the object detection task in chapter 4 Considering theobjectives resources and the challenges a modular architecture was designedand introduced (section 42) The type of modules were listed and discussed indetails including their purposes and some possible realization

The modular structure of the system makes it very flexible and easy to extendor replace modules The currently implemented and used ones were listed andexplained Also some unsuccessful experiments were mentioned

Subsection 434 concludes the chosen feature extraction (HOG) and classi-fier (SVM) methods and presented the implemented training software along the

65

61 Conclusion

produced detectorsAs one of the most important parts the currently used detector was intro-

duced in subsection 435 This module was developed with special attention paidto validation and possible future work therefore additional debugging featureslike exporting detections demonstration videos and frame rate measurementswere implemented To make the development simpler an easy-to-use interfacewas included which let the most important parameters to be set without modifi-cation pf the code Four related but different detection methods were developedMode 1 scans the whole image with both trained detectors to locate the robotMode 2 does the same but instead of all classifiers only one is used most of thetime which is chosen based on previous detections This resulted in an increasedframe-rate Mode 3 introduces the concept of Region Of Interests (ROI) and onlyscans this area The ROI is defined based on previous detections however it iseasy to replace it with other methods due to the modular structure The reducedsearch space made the processing significantly faster Finally mode 4 includes atracking algorithm which makes it unnecessary to scan every frame (or part ofit) with a detector Instead once the robot was found it is followed across thescene Certainly periodical validations are still needed Tracking is significantlyfaster than the sliding window detectors and in appropriate conditions can per-form above real time speed (average frame-rate was 308 FPS) It also improvedthe efficiency of the detector since on many frames the robot was tracked andits position was marked correctly while the detectors missed it from the sameview-point However tracking can be misleading as well if the tracked regiondrifts away from the robot (eg a chair next to the robot is followed) To avoidthis the bounding box returned by the tracker is regularly checked by detectors

Special attention was given to the evaluation of the system Two software weredeveloped as additional tools one for evaluating the efficiency of the detectionsand an other for analysing the processing time and frame rates

Although the whole project (especially the autonomous UAV) still needs a lotof improvement and not ready for testing yet the ground robot detecting algo-rithmrsquos first version is ready-to-use and have shown promising results in simulatedexperiments

The detector based on mode 4 had a recall rate of 95 with an average frame-rate of 308 FPS on the test videos Although this processing speed was achievedon a laptop (simulating a setup where the UAV sends video to a ground stationfor further processing) porting this code on an on-board computer with smallercomputational capacity should result in a slower but still acceptable frame-ratewhich satisfies the objectives (5Hz)

While this thesis focused on two dimensional image processing and object de-tection methods 3D image inputs were considered as well Section 44 concludesthe progress made related to 3D mapping along the applied mathematics An

66

62 Recommended future work

experimental setup was created with which it was possible to create spectacu-lar and more importantly precise 3D maps about the environments with a 2Dlaser scanner To test the idea of using the 3D image as an input for the groundrobot detection several recordings were made about the UGV The created point-clouds were found to be detailed enough to carry simpler (eg threshold basedon height) or more complex object detection methods (eg registering 3D key-points) on them in the future

As a conclusion it can be said that the objectives defined in 13

1 explore possible solutions in a literature review

2 design a system which is able to detect the ground robot based on theavailable sensors

3 implement and test the first version of the algorithms

were all addressed and completed

62 Recommended future workIn spite of the satisfying results of the current setup several modifications andimprovements can and should be implemented to increase the efficiency and speedof the algorithm

Focusing on the 2D processing methods first experiments with retrainedSVMs are recommended It is expected that training on the same two sides (sideand front) but including more images from slightly rotated view-points (bothrolling the camera and capture the object from not completely the front) willresult in better performance

Region of interest estimators have huge potential in the project since they cansignificantly speed up the process Aside basic algorithms based on simple fea-tures (eg similar colour patches edges etc) several more complex segmentationmethods were introduced in the literature

Tracking can be tuned more precisely as well Aside the rectangle box whichmarks the estimated position of the object the tracker returns a confidence valuewhich measures how confident the algorithm is that the object is inside the rect-angle This is a very valuable information to eliminate detections which rdquoslippedrdquooff the object Also the further processing of the return position is recommendedFor example growing or shrinking the rectangle based on simple features (egthresholded patches intensity edges etc) can be profitable since the trackertends to perform better when it tracks the whole object On the other hand toobig bounding boxes are problems as well thus shrinking of the rectangles have tobe considered as well

67

62 Recommended future work

Using The 3D image for the detection of the robot is another very interestingopportunity that should be examined in more details Certainly this is onlypossible if the 3D imaging will be carried out in near real time speed (currentlyall the maps were built offline) The currently seen 3D image (the point-cloud infront of the sensor) can be interpreted as a pre-filter for the 2D image processingmethods For example filtering for objects with more or less the same size of theUGV could be a very effective region of interest detector To achieve this thefield of view of the two sensors needs to be register carefully so correspondingareas can be found

Finally when the continuous 3D map building (note that this is more thanthe imaging since the 3D images have to be aligned and combined) is finishedseveral more improvements will be possible Assuming that both the aerial andthe ground vehicle were located in this map the real 3D position of the robotwill be known Its next position can be estimated based on its maximal velocityand the elapsed time Theoretically it is not possible for the robot to be outsideof this circle thus other parts of the map do not need to be scanned Since thelocation and orientation of the UAV will be known as well the field-of-view ofthe sensors can be estimated After that the UAV can either move and rotate ina way to cover the possible locations of the UGV with its sensors or just ignore(does not process) inputs from other areas

It is strongly recommended to give high priority to evaluation and testingduring the further experiments For 2D images Vatic annotation tool was intro-duced in this thesis as a very useful and convenient tool to evaluate detectorsFor 3D detections this is not suitable but a similar solution is needed to keepup the progress towards the ready to deploy 3D mapping system planned

68

Acknowledgements

First of all I would like to thank Paacutezmaacuteny Peacuteter Catholic University and my teachersthere who helped me to advance and grow through the years and later made itpossible to continue my studies in Cranfield University where I was welcomed withwarm hospitality and was supported throughout of the development of this thesis

I would like to thank Dr Al Savvaris for the professional help and encourage-ment he provided while we worked together

Special thanks goes to my friends Domonkos Huszaacuter and Loacuteraacutent Kovaacutecs who Ishared all the happiness sadness worries troubles and the weigh of this pastyear and who I could always turn to in need or joy

Thanks to my friend Roberto Opromolla for helping us to make the necessarymeasurements

My family never stopped supporting me loving me and ensuring the rightenvironment for improvement and growth

Many thanks Fanni Melles for the support and love you gave me and that youstood by me even at the hardest times

Thanks for my friends Maacuterton Hunyady and Doacutera Babicz for the funny andsometimes cruel comments of the revision of this thesis

And finally many thanks to Andras Horvath who agreed to be my rsquosecondrsquosupervisor and mentor at my home university

69

References

[1] ldquoOverview senseFlyrdquo [Online] Available httpswwwsenseflycomdronesoverviewhtml [Accessed at 2015-08-08]

[2] H Chao Y Cao and Y Chen ldquoAutopilots for small fixed-wing unmanned airvehicles A surveyrdquo in Proceedings of the 2007 IEEE International Conference onMechatronics and Automation ICMA 2007 2007 pp 3144ndash3149

[3] A Frank J McGrew M Valenti D Levine and J How ldquoHover Transitionand Level Flight Control Design for a Single-Propeller Indoor AirplanerdquoAIAA Guidance Navigation and Control Conference and Exhibit pp 1ndash43 2007[Online] Available httparcaiaaorgdoiabs10251462007-6318

[4] ldquoFixed Wing Versus Rotary Wing For UAV Mapping Ap-plicationsrdquo [Online] Available httpwwwquestuavcomnewsfixed-wing-versus-rotary-wing-for-uav-mapping-applications [Accessed at 2015-08-08]

[5] ldquo DJI Store Phantom 3 Standardrdquo [Online] Available httpstoredjicomproductphantom-3-standard [Accessed at 2015-08-08]

[6] ldquoWorld War II V-1 Flying Bomb - Military Historyrdquo [Online] Available httpmilitaryhistoryaboutcomodartillerysiegeweaponspv1htm [Accessed at 2015-08-08]

[7] L Lin and M A Goodrich ldquoUAV intelligent path planning for wilderness searchand rescuerdquo in 2009 IEEERSJ International Conference on Intelligent Robotsand Systems IROS 2009 2009 pp 709ndash714

[8] M A Goodrich B S Morse D Gerhardt J L Cooper M Quigley J A Adamsand C Humphrey ldquoSupporting wilderness search and rescue using a camera-equipped mini UAVrdquo Journal of Field Robotics vol 25 no 1-2 pp 89ndash1102008

[9] P Doherty and P Rudol ldquoA UAV Search and Rescue Scenario with HumanBody Detection and Geolocalizationrdquo in Lecture Notes in Computer Science

70

REFERENCES

2007 pp 1ndash13 [Online] Available httpwwwspringerlinkcomcontentt361252205328408fulltextpdf

[10] A Jaimes S Kota and J Gomez ldquoAn approach to surveillance an area usingswarm of fixed wing and quad-rotor unmanned aerial vehicles UAV(s)rdquo 2008 IEEEInternational Conference on System of Systems Engineering SoSE 2008 2008

[11] M Kontitsis K Valavanis and N Tsourveloudis ldquoA UAV vision system for air-borne surveillancerdquo IEEE International Conference on Robotics and Automation2004 Proceedings ICRA rsquo04 2004 vol 1 2004

[12] M Quigley M A Goodrich S Griffiths A Eldredge and R W Beard ldquoTargetacquisition localization and surveillance using a fixed-wing mini-UAV and gim-baled camerardquo in Proceedings - IEEE International Conference on Robotics andAutomation vol 2005 2005 pp 2600ndash2606

[13] E Semsch M Jakob D Pavliacuteček and M Pěchouček ldquoAutonomous UAV surveil-lance in complex urban environmentsrdquo in Proceedings - 2009 IEEEWICACMInternational Conference on Intelligent Agent Technology IAT 2009 vol 2 2009pp 82ndash85

[14] M Israel ldquoA UAV-BASED ROE DEER FAWN DETECTION SYSTEMrdquo pp51ndash55 2012

[15] G Zhou and D Zang ldquoCivil UAV system for earth observationrdquo in InternationalGeoscience and Remote Sensing Symposium (IGARSS) 2007 pp 5319ndash5321

[16] W DeBusk ldquoUnmanned Aerial Vehicle Systems for Disaster Relief TornadoAlleyrdquo AIAA InfotechAerospace 2010 2010 [Online] Available httparcaiaaorgdoiabs10251462010-3506

[17] S DrsquoOleire-Oltmanns I Marzolff K Peter and J Ries ldquoUnmannedAerial Vehicle (UAV) for Monitoring Soil Erosion in Moroccordquo RemoteSensing vol 4 no 12 pp 3390ndash3416 2012 [Online] Available httpwwwmdpicom2072-42924113390

[18] F Nex and F Remondino ldquoUAV for 3D mapping applications a reviewrdquoApplied Geomatics vol 6 no 1 pp 1ndash15 2013 [Online] Availablehttplinkspringercom101007s12518-013-0120-x

[19] H Eisenbeiss ldquoThe Potential of Unmanned Aerial Vehicles for MappingrdquoPhotogrammetrische Woche Heidelberg pp 135ndash145 2011 [Online] Availablehttpwwwifpuni-stuttgartdepublicationsphowo11140Eisenbeisspdf

[20] C Bills J Chen and A Saxena ldquoAutonomous MAV flight in indoor environ-ments using single image perspective cuesrdquo in Proceedings - IEEE InternationalConference on Robotics and Automation 2011 pp 5776ndash5783

71

REFERENCES

[21] W Bath and J Paxman ldquoUAV localisation amp control through computer visionrdquo of the Australasian Conference on Robotics 2004 [Online] Availablehttpwwwcseunsweduau~acra2005proceedingspapersbathpdf

[22] K Ccedilelik S J Chung M Clausman and A K Somani ldquoMonocular vision SLAMfor indoor aerial vehiclesrdquo in 2009 IEEERSJ International Conference on Intel-ligent Robots and Systems IROS 2009 2009 pp 1566ndash1573

[23] H Oh D Y Won S S Huh D H Shim M J Tahk and A Tsourdos ldquoIndoorUAV control using multi-camera visual feedbackrdquo in Journal of Intelligent andRobotic Systems Theory and Applications vol 61 no 1-4 2011 pp 57ndash84

[24] Y M Mustafah A W Azman and F Akbar ldquoIndoor UAV Positioning UsingStereo Vision Sensorrdquo pp 575ndash579 2012

[25] P Jongho and K Youdan ldquoStereo vision based collision avoidance of quadrotorUAVrdquo in Control Automation and Systems (ICCAS) 2012 12th InternationalConference on 2012 pp 173ndash178

[26] J Weingarten and R Siegwart ldquoEKF-based 3D SLAM for structured environmentreconstructionrdquo in 2005 IEEERSJ International Conference on Intelligent Robotsand Systems IROS 2005 pp 2089ndash2094

[27] H Surmann A Nuumlchter and J Hertzberg ldquoAn autonomous mobile robot with a3D laser range finder for 3D exploration and digitalization of indoor environmentsrdquoRobotics and Autonomous Systems vol 45 no 3-4 pp 181ndash198 2003

[28] Y Lin J Hyyppauml and A Jaakkola ldquoMini-UAV-borne LIDAR for fine-scale map-pingrdquo IEEE Geoscience and Remote Sensing Letters vol 8 no 3 pp 426ndash4302011

[29] F Wang J Cui S K Phang B M Chen and T H Lee ldquoA mono-camera andscanning laser range finder based UAV indoor navigation systemrdquo in 2013 Inter-national Conference on Unmanned Aircraft Systems ICUAS 2013 - ConferenceProceedings 2013 pp 694ndash701

[30] M Nagai T Chen R Shibasaki H Kumagai and A Ahmed ldquoUAV-borne 3-Dmapping system by multisensor integrationrdquo IEEE Transactions on Geoscienceand Remote Sensing vol 47 no 3 pp 701ndash708 2009

[31] X Zhang Y-H Yang Z Han H Wang and C Gao ldquoObject class detectionrdquoACM Computing Surveys vol 46 no 1 pp 1ndash53 Oct 2013 [Online] Availablehttpdlacmorgcitationcfmid=25229682522978

[32] P M Roth and M Winter ldquoSurvey of Appearance-Based Methodsfor Object Recognitionrdquo Transform no ICG-TR-0108 2008 [Online]

72

REFERENCES

Available httpwwwicgtu-grazacatMemberspmrothpub_pmrothTR_ORat_downloadfile

[33] A Andreopoulos and J K Tsotsos ldquo50 Years of object recognition Directionsforwardrdquo Computer Vision and Image Understanding vol 117 no 8 pp 827ndash891 Aug 2013 [Online] Available httpwwwresearchgatenetpublication257484936_50_Years_of_object_recognition_Directions_forward

[34] J K Tsotsos Y Liu J C Martinez-Trujillo M Pomplun E Simine andK Zhou ldquoAttending to visual motionrdquo Computer Vision and Image Understand-ing vol 100 no 1-2 SPEC ISS pp 3ndash40 2005

[35] P Perona ldquoVisual Recognition Circa 2007rdquo pp 1ndash12 2007 [Online] Availablehttpswwwvisioncaltechedupublicationsperona-chapter-Dec07pdf

[36] M Piccardi ldquoBackground subtraction techniques a reviewrdquo 2004 IEEE Interna-tional Conference on Systems Man and Cybernetics (IEEE Cat No04CH37583)vol 4 pp 3099ndash3104 2004

[37] T Bouwmans ldquoRecent Advanced Statistical Background Modeling for ForegroundDetection - A Systematic Surveyrdquo Recent Patents on Computer Sciencee vol 4no 3 pp 147ndash176 2011

[38] L Xu and W Bu ldquoTraffic flow detection method based on fusion of frames dif-ferencing and background differencingrdquo in 2011 2nd International Conference onMechanic Automation and Control Engineering MACE 2011 - Proceedings 2011pp 1847ndash1850

[39] A-t Nghiem F Bremond I-s Antipolis and R Lucioles ldquoBackground subtrac-tion in people detection framework for RGB-D camerasrdquo 2004

[40] J C S Jacques C R Jung and S R Musse ldquoBackground subtraction andshadow detection in grayscale video sequencesrdquo in Brazilian Symposium of Com-puter Graphic and Image Processing vol 2005 2005 pp 189ndash196

[41] P Kaewtrakulpong and R Bowden ldquoAn Improved Adaptive Background MixtureModel for Real- time Tracking with Shadow Detectionrdquo Advanced Video BasedSurveillance Systems pp 1ndash5 2001

[42] G Turin ldquoAn introduction to matched filtersrdquo IRE Transactions on InformationTheory vol 6 no 3 1960

[43] K Briechle and U Hanebeck ldquoTemplate matching using fast normalized crosscorrelationrdquo Proceedings of SPIE vol 4387 pp 95ndash102 2001 [Online] AvailablehttplinkaiporglinkPSI4387951ampAgg=doi

73

REFERENCES

[44] H Schweitzer J W Bell and F Wu ldquoVery Fast Template Matchingrdquo Programno 009741 pp 358ndash372 2002 [Online] Available httpwwwspringerlinkcomindexH584WVN93312V4LTpdf

[45] P Viola and M Jones ldquoRobust real-time object detectionrdquo Interna-tional Journal of Computer Vision vol 57 pp 137ndash154 2001 [Online]Available httpscholargooglecomscholarhl=enampbtnG=Searchampq=intitleRobust+Real-time+Object+Detection0

[46] S Omachi and M Omachi ldquoFast template matching with polynomialsrdquo IEEETransactions on Image Processing vol 16 no 8 pp 2139ndash2149 2007

[47] M Jordan and J Kleinberg Bishop - Pattern Recognition and Machine Learning

[48] D Prasad ldquoSurvey of the problem of object detection in real imagesrdquoInternational Journal of Image Processing (IJIP) no 6 pp 441ndash466 2012[Online] Available httpwwwcscjournalsorgcscmanuscriptJournalsIJIPvolume6Issue6IJIP-702pdf

[49] D Lowe ldquoObject Recognition fromLocal Scale-Invariant Featuresrdquo IEEE Inter-national Conference on Computer Vision 1999

[50] D G Lowe ldquoDistinctive image features from scale-invariant keypointsrdquo Interna-tional Journal of Computer Vision vol 60 no 2 pp 91ndash110 2004

[51] H Bay T Tuytelaars and L Van Gool ldquoSURF Speeded up robust featuresrdquo inLecture Notes in Computer Science (including subseries Lecture Notes in ArtificialIntelligence and Lecture Notes in Bioinformatics) vol 3951 LNCS 2006 pp 404ndash417

[52] T Joachims ldquoA probabilistic analysis of the Rocchio algorithmwith TFIDF for text categorizationrdquo International Conference on Ma-chine Learning pp 143ndash151 1997 [Online] Available papers2publicationuuid9FC2122D-6D49-4DC5-AC03-E353D5B3D1D1

[53] N Dalal and B Triggs ldquoHistograms of Oriented Gradients for Human Detectionrdquoin CVPR rsquo05 Proceedings of the 2005 IEEE Computer Society Conference onComputer Vision and Pattern Recognition (CVPRrsquo05) - Volume 1 2005 pp 886ndash893 [Online] Available citeulike-article-id3047126$delimiter026E30F$nhttpdxdoiorg101109CVPR2005177

[54] J M Rainer Lienhart ldquoAn Extended Set of Haar-Like Features for Rapid ObjectDetectionrdquo [Online] Available httpciteseerxistpsueduviewdocsummarydoi=1011869433

74

REFERENCES

[55] S Han Y Han and H Hahn ldquoVehicle Detection Method using Haar-like Featureon Real Time Systemrdquo inWorld Academy of Science Engineering and Technology2009 pp 455ndash459

[56] Q C Q Chen N Georganas and E Petriu ldquoReal-time Vision-based Hand Ges-ture Recognition Using Haar-like Featuresrdquo 2007 IEEE Instrumentation amp Mea-surement Technology Conference IMTC 2007 2007

[57] G Monteiro P Peixoto and U Nunes ldquoVision-based pedestrian detection usingHaar-like featuresrdquo Robotica 2006 [Online] Available httpcyberc3sjtueducnCyberC3docpaperRobotica2006pdf

[58] J Martiacute J M Benediacute A M Mendonccedila and J Serrat Eds PatternRecognition and Image Analysis ser Lecture Notes in Computer Science BerlinHeidelberg Springer Berlin Heidelberg 2007 vol 4477 [Online] Availablehttpwwwspringerlinkcomindex101007978-3-540-72847-4

[59] A E C Pece ldquoOn the computational rationale for generative modelsrdquo ComputerVision and Image Understanding vol 106 no 1 pp 130ndash143 2007

[60] I T Joliffe Principle Component Analysis 2002 vol 2 [Online] Availablehttpwwwspringerlinkcomcontent978-0-387-95442-4

[61] A Hyvarinen J Karhunen and E Oja ldquoIndependent Component Analysisrdquovol 10 p 2002 2002

[62] K Mikolajczyk B Leibe and B Schiele ldquoMultiple Object Class Detection with aGenerative Modelrdquo IEEE Computer Society Conference on Computer Vision andPattern Recognition (CVPR) vol 1 pp 26ndash36 2006

[63] Y Freund and R E Schapire ldquoA Decision-theoretic Generalization of On-lineLearning and an Application to Boostingrdquo Journal of Computing Systems andScience vol 55 no 1 pp 119ndash139 1997

[64] C Cortes and V Vapnik ldquoSupport-vector networksrdquo Machine Learningvol 20 no 3 pp 273ndash297 Sep 1995 [Online] Available httplinkspringercom101007BF00994018

[65] P Viola and M Jones ldquoRobust real-time face detectionrdquo International Journalof Computer Vision vol 57 no 2 pp 137ndash154 2004

[66] B E Boser I M Guyon and V N Vapnik ldquoA Training Algorithmfor Optimal Margin Classifiersrdquo Proceedings of the 5th Annual ACM Workshopon Computational Learning Theory pp 144ndash152 1992 [Online] Availablehttpciteseerxistpsueduviewdocsummarydoi=1011213818

75

REFERENCES

[67] C-W Hsu and C-J Lin ldquoA comparison of methods for multiclass support vectormachinesrdquo IEEE Transactions on Neural Networks vol 13 no 2 pp 415ndash4252002

[68] A Mathur and G M Foody ldquoMulticlass and binary SVM classification Implica-tions for training and classification usersrdquo IEEE Geoscience and Remote SensingLetters vol 5 no 2 pp 241ndash245 2008

[69] ldquoNitrogen6X - iMX6 Single Board Computerrdquo [Online] Available httpboundarydevicescomproductnitrogen6x-board-imx6-arm-cortex-a9-sbc [Ac-cessed at 2015-08-10]

[70] ldquoPixhawk flight controllerrdquo [Online] Available httproverardupilotcomwikipixhawk-wiring-rover [Accessed at 2015-08-10]

[71] ldquoScanning range finder utm-30lx-ewrdquo [Online] Available httpswwwhokuyo-autjp02sensor07scannerutm_30lx_ewhtml [Accessed at 2015-07-22]

[72] ldquoopenCV manual Release 249rdquo [Online] Available httpdocsopencvorgopencv2refmanpdf [Accessed at 2015-07-20]

[73] D E King ldquoDlib-ml A Machine Learning Toolkitrdquo Journal of MachineLearning Research vol 10 pp 1755ndash1758 2009 [Online] Available httpjmlrcsailmitedupapersv10king09ahtml

[74] ldquodlib C++ Libraryrdquo [Online] Available httpdlibnet [Accessed at 2015-07-21]

[75] D E King ldquoMax-Margin Object Detectionrdquo Jan 2015 [Online] Availablehttparxivorgabs150200046

[76] M Danelljan G Haumlger and M Felsberg ldquoAccurate Scale Estimation for RobustVisual Trackingrdquo Proceedings of the British Machine Vision Conference BMVC2014

[77] ldquoUrgBenri Information Pagerdquo [Online] Available httpswwwhokuyo-autjp02sensor07scannerdownloaddataUrgBenrihtm [Accessed at 2015-07-20]

[78] ldquovatic - Video Annotation Tool - UC Irvinerdquo [Online] Available httpwebmiteduvondrickvatic [Accessed at 2015-07-24]

[79] ldquoAmazon Mechanical Turkrdquo [Online] Available httpswwwmturkcommturkwelcome [Accessed at 2015-07-26]

76

  • List of Figures
  • Absztrakt
  • Abstract
  • List of Abbreviations
  • 1 Introduction and project description
    • 11 Project description and requirements
    • 12 Type of vehicle
    • 13 Aims and objectives
      • 2 Literature Review
        • 21 UAVs and applications
          • 211 Fixed-wing UAVs
          • 212 Rotary-wing UAVs
          • 213 Applications
            • 22 Object detection on conventional 2D images
              • 221 Classical detection methods
                • 2211 Background subtraction
                • 2212 Template matching algorithms
                  • 222 Feature descriptors classifiers and learning methods
                    • 2221 SIFT features
                    • 2222 Haar-like features
                    • 2223 HOG features
                    • 2224 Learning models in computer vision
                    • 2225 AdaBoost
                    • 2226 Support Vector Machine
                      • 3 Development
                        • 31 Hardware resources
                          • 311 Nitrogen board
                          • 312 Sensors
                            • 3121 Pixhawk autopilot
                            • 3122 Camera
                            • 3123 LiDar
                                • 32 Chosen software
                                  • 321 Matlab
                                  • 322 Robotic Operating System (ROS)
                                  • 323 OpenCV
                                  • 324 Dlib
                                      • 4 Designing and implementing the algorithm
                                        • 41 Challenges in the task
                                        • 42 Architecture of the detection system
                                        • 43 2D image processing methods
                                          • 431 Chosen methods and the training algorithm
                                          • 432 Sliding window method
                                          • 433 Pre-filtering
                                          • 434 Tracking
                                          • 435 Implemented detector
                                            • 4351 Mode 1 Sliding window with all the classifiers
                                            • 4352 Mode 2 Sliding window with intelligent choice of classifier
                                            • 4353 Mode 3 Intelligent choice of classifiers and ROIs
                                            • 4354 Mode 4 Tracking based approach
                                                • 44 3D image processing methods
                                                  • 441 3D recording method
                                                  • 442 Android based recording set-up
                                                  • 443 Final set-up with Pixhawk flight controller
                                                  • 444 3D reconstruction
                                                      • 5 Results
                                                        • 51 2D image detection results
                                                          • 511 Evaluation
                                                            • 5111 Definition of True positive and negative
                                                            • 5112 Definition of False positive and negative
                                                            • 5113 Reducing number of errors
                                                            • 5114 Annotation and database building
                                                              • 512 Frame-rate measurement and analysis
                                                                • 52 3D image detection results
                                                                • 53 Discussion of results
                                                                  • 6 Conclusion and recommended future work
                                                                    • 61 Conclusion
                                                                    • 62 Recommended future work
                                                                      • References
Page 11: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly

Abstract

In this paper an unmanned aerial vehicle based airborne tracking system is pre-sented with the aim of detecting objects indoor (potentially extended to outdooroperations as well) and localize a ground robot that will serve as a landing plat-form for recharging of the UAV

First the project and the aims and objectives of this thesis are introducedand discussed

Then an extensive literature review is presented to give overview of the ex-isting solutions for similar problems Unmanned Aerial Vehicles are presentedwith examples of their application fields After that the most relevant objectrecognition methods are reviewed and their suitability for the discussed project

Then the environment of the development will be described including theavailable software and hardware resources

Afterwards the challenges of the task are collected and discussed Consideringthe objectives resources and the challenges a modular architecture is designedand introduced

As one of the most important module the currently used detector is intro-duced along with its features modes and user interface

Special attention is given to the evaluation of the system Additional evalu-ating tools are introduced to analyse efficiency and speed

The ground robot detecting algorithmrsquos first version is evaluated giving promis-ing results in simulated experiments

While this thesis focuses on two dimensional image processing and objectdetection methods 3D image inputs are considered as well An experimentalsetup is introduced with the capability to create spectacular and precise 3Dmaps about the environments with a 2D laser scanner To test the idea of usingthe 3D image as an input for the ground robot detection several recordings weremade about the UGV and presented in this paper as well

Finally all implemented methods and relevant results are concluded

x

Chapter 1

Introduction and projectdescription

In this chapter an introduction is given to the whole project which this thesis ispart of Afterwards a structure of the sub-tasks in the project is presented alongwith the recognized challenges Then the aims and objectives of this thesis arelisted

11 Project description and requirementsThe projectrsquos main aim is to build a complete system based on one or moreunmanned autonomous vehicle which are able to carry out an indoor 3D mappingof a building (possible outdoor operations are kept in mind as well) A maplike this would be very useful for many applications such as surveillance rescuemissions architecture renovation etc Furthermore if a 3D model exists otherautonomous vehicles can navigate through the building with ease Later on inthis project after the map building is ready a thermal camera is planned to beattached to the vehicle to seek find and locate heat leaks as a demonstration ofthe system A system like this should be

1 fast Building a 3D map requires a lot of measurements and processingpower not to mention the essential functions stabilization navigationroute planning collision avoidance However to build a useful tool allthese functions should be executed simultaneously mostly on-board nearlyreal-time Furthermore the mapping process itself should be finished inreasonable time (depending on the size of the building)

2 accurate Reasonably accurate recording is required so the map would besuitable for the execution of further tasks Usually this means a maximum

1

12 Type of vehicle

error of 5-6cm Errors with similar magnitude can be corrected later de-pending on the application In the case of architecture for example aligninga 3D room model to the recorded point cloud could eliminate this noise

3 autonomous The solution should be as autonomous as possible The aimis to build a system which does not require human supervision (for examplemapping a dangerous area which would be too far for real time control) andcoordinate the process Although remote control is acceptable and shouldbe implemented it is desired to be minimal This should allow a remoteoperator to coordinate even more than one unit a time (since none of themneeds continuous attention) while keeping the ability to change the routeor priority of premises

4 complete Depending on the size and layout of the building the mappingprocess can be very complicated It may have more than one floors loops(the same location reached via an other route) or opened areas which areall a significant challenge for both the vehicle (plan route to avoid collisionsand cover every desired area) and the mapping software The latter shouldrecognize previously seen locations and rdquocloserdquo the loop (that is assignthe two separately recorded area to one location) and handle multi-layerbuildings The system have to manage these tasks and provide a mapwhich contains all the layers closes loops and represents all walls (includingceiling and floor)

12 Type of vehicleThe first question of such a project is what type of vehicle should be used Theoptions are either some kind of aerial or a ground vehicle Both have their ad-vantages and disadvantages

An aerial vehicle has a lot more degree of freedom since it can elevate fromthe ground It can reach positions and perform measurements which would beimpossible for a ground vehicle However it consumes a lot of energy to stayin the air thus such a system has significant limitations in weight payload andoperating time (since the batteries themselves weigh a lot)

A ground vehicle overcomes all these problems since it is able to carry alot more payload than an aerial one This means a lot of batteries thereforelonger operation time On the other hand it canrsquot elevate from the ground thusall the measurements will be performed from a relatively close position to theground This can be a big disadvantage since taller objects will not have theirtops recorded Also a ground robot would not be able to overcome big obstaclesclosing its way or to get to another floor

2

13 Aims and objectives

Figure 11 Image of the ground robot Source own picture

Considering these and many more arguments (price complexity human andmaterial resources) a combination of an aerial and a ground vehicle was chosenas a solution The idea is that both vehicles enter the building at the sametime and start mapping the environment simultaneously working on the samemap (which is unknown when the process starts) Using the advantage of theaerial vehicle the system scans parts of the rooms which are unavailable for theground vehicle With sophisticated route planning the areas scanned twice canbe minimized resulting in a faster map building Also great load capacity of theground robot makes it able to serve as a charging station for the aerial vehicleThis option will be discussed in details in section 13

13 Aims and objectivesAs it can be seen the scope of the project is extremely wide To cover the wholemore than fifteen students (visiting MSc and PhD) are working on it Thereforethe project has been divided to numerous subtasks

During the planning period of the project several challenges were recognizedfor example the lack of GPS signal indoor the vibration of the on-board sensorsor the short flight time of the aerial platform

This thesis addresses the last one As mentioned in section 12 the UAVconsumes a lot of energy to stay in the air thus such a system has significantlylimited operating time In contrast the ground robot is able to carry a lot morepayload (eg batteries) than an aerial one The idea to overcome this limitationof the UAV is to use the ground robot as a landing and recharging platform for

3

13 Aims and objectives

the aerial vehicleTo achieve this the system will need to have continuous update of the relative

position of the ground robot The aim of this thesis is to give a solution for thisproblem The following objectives were defined to structure the research

First possible solutions have to be collected and discussed in an extensive lit-erature review Then after considering the existing solutions and the constraintsand requirements of the discussed project a design of a new system is neededwhich is able to detect the ground robot using the available sensors Finally thefirst version of the software have to be implemented and tested

This paper will describe the process of the research introduce the designedsystem and discuss the experiences

4

Chapter 2

Literature Review

The aim of this chapter is to provide an overview of the previous researches andarticles related to the topic of this thesis First a short introduction of unmannedaerial vehicles will be given along with their most important advantages andapplication fields

Afterwards a brief conclusion of the science of object recognition in imageprocessing is provided The most often used feature extraction and classificationmethods are presented

21 UAVs and applicationsThe abbreviation UAV stands for Unmanned Aerial Vehicle Any aircraft whichis controlled or piloted by remotely andor by on-board computers They are alsoreferred as UAS for Unmanned Aerial System

They come in various sizes The smallest ones fit a man palms while even afull-size aeroplane can be an UAV by definition Similarly to traditional aircraftUnmanned Aerial vehicles are usually grouped into two big classes fixed-wingUAVs and rotary wing UAVs

211 Fixed-wing UAVsFixed-wing UAVs have a rigid wing which generates lift as the UAV moves for-ward They maneuver with control surfaces called ailerons rudder and elevatorGenerally speaking fixed-wing UAVs are easier to stabilize and control Forcomparison radio controlled aeroplanes can fly without any autopilot functionimplemented relying only on the pilotrsquos input This is a result of the muchsimpler structure compared to rotary wing ones due to the fact that a glidingmovement is easier to stabilize However they still have a quite extended market

5

21 UAVs and applications

Figure 21 One of the most popular consumer level fixed-wing mapping UAV ofthe SenseFly company Source [1]

of flight controllers whose aim is to add autonomous flying and pilot assistancefeatures [2]

One of the biggest advantage of fixed wing UAVs is that due to their naturalgliding capabilities they can stay airborne using no or small amount of powerFor the same reason fixed-wing UAVs are also able to carry heavier or morepayload for longer endurances using less energy which would be very useful inany mapping task

The drawback of the fixed wing aircraft in the case of this project is the factthat they have to keep moving to generate lift and stay airborne Therefore flyingaround between and in buildings is very complicated since there is limited spaceto move around Although there have been approaches to prepare a fixed wingUAV to hover and translate indoors [3] generally they are not optimal for indoorflights especially with heavy payload [4]

Fixed-wing UAVs are excellent choice for long endurance high altitude tasksThe long flight times and higher speed make it possible to cover larger areaswith one take off On figure 21 a consumer fixed-wing UAV is shown designedspecially for mapping

This project needs a vehicle which is able to fly between buildings and indoorswithout any complication Thus rotary-wing UAVs were reviewed

212 Rotary-wing UAVsRotary-wing UAVs generate lift with rotor blades attached to and rotated aroundan axis These blades work exactly as the wing on the fixed-wing vehicles butinstead of the vehicle moving forward the blades are moving constantly Thismakes the vehicle able to hover and hold its position Also due to the same prop-erty they can take off and land vertically These capabilities are very importantaspects for the project since both of them are crucial for indoor flight where spaceis limited

6

21 UAVs and applications

Figure 22 The newest version of the popular Phantom series of DJI Thisconsumer drone is easy to fly even for beginner pilots due to its many advancedautopilot features Source [5]

Based on the number of the rotors several sub-classes are known UAVs withone rotor are helicopters while anything with more than one rotor is called amultirotor or multicopter

The latter group controls its motion by modifying the speed of multiple downthrusting motors In general they are more complicated to control than fixed-wing UAVs since hovering enough requires constant compensation Multirotorsmust individually adjust the thrust of each motor If motors on one side areproducing more thrust than the other the multirotor tilts to the other sideThat is the reason why every remote controlled multicopter is equipped withsome kind of flight controller system They are also less stable (without electricstabilization) and less energy efficient than helicopters That is the reason forthe fact that no large scale multirotor is used since as the size is increased thesimplicity becomes less important than the inefficiency On the other hand theyare mechanically simpler and cheaper and their architecture is suitable for awider range of payloads which makes them appealing for many fields

Rotary-wing UAVsrsquo mechanical and electrical structures are generally morecomplex than fixed-wing ones This can result in more complicated and expansiverepairs and general maintenance which shorten operational time and increase thecost of the project Finally due to their lower speeds and shorter flight rangesrotary-wing UAVs will require many additional flights to survey any large areawhich is another cause of increasing operational costs [4]

7

21 UAVs and applications

In spite of the disadvantages listed above rotary-wing especially multirotorUAVs seem like an excellent choice for this project since there is no need to coverlarge areas and the shorter operational times can be solved with the chargingplatform

213 ApplicationsUAVs as many technological inventions got their biggest motivation from mili-tary goals and applications For example German rocket V-1 (rdquothe flying bombrdquo)had a huge impact on the history of UAS [6] An unmanned aerial vehicle makesit possible to observe spy on and attack the enemy without risking human lives

Fortunately non-military development is catching up too and UAVs are be-coming essential tools for many application fields As rotary wing UAVs es-pecially multicopters are getting cheaper they have appeared in the consumercategories as well They are available as simple remote controlled toys and ad-vanced aerial photography platforms as well [5] See figure 22 for a popularquadcopter

UAVs are gaining popularity in search and rescue operations since they areproviding a fast and relatively cheap tool to get an overview of the involvedarea A good example for this application field is a location of an earth-quakeor searching for missing people in the wildness See [7] and [8] for examples [9]is another related article which includes approaches to identify possibly injuredpeople autonomously

Another application field where UAVs were admittedly proven useful is surveil-lance [10] is an interesting example for this field since it proposes to use a com-bined swarm of rotary and fixed-wing UAVs [11] [12] address the same problemwhile [13] tries to overcome the challenges of the task in an urban area

Getting sensors in the air with low cost vehicles is an amazing opportunity forseveral research areas UAVs have been used for wild-life monitoring [14] earthobservation [15] and even tornado watch missions [16] [17] used UAVs to reducerdquothe data gap between field scale and satellite scale in soil erosion monitoringrdquo

UAVs are often used for 3D mapping and reconstruction tasks for topographyarchitecture or other purposes [18] is a review of current mapping methods basedon UAVs [19] is an other survey which also compares the photogrammetry withother methods to define the optimal application fields

Both surveys mentioned above are rather focus on topographical larger scalemapping tasks Indoor flying and mapping is more complicated in a sense sincethe danger of collision is significantly higher Indoor positioning of UAVs havebeen addressed in several papers They can be grouped by the sensor used Notethat GPS signal is usually too weak to use inside thus other methods are needed

Some researches tried to localize the aerial platform based on a single camera

8

22 Object detection on conventional 2D images

using monocular SLAM (Simultaneous Localization and Mapping) See [20ndash22]for UAVs navigating based on one camera only

Often the platform is equipped with multiple cameras to achieve stereo visionand create a depth map For examples of such solutions see [23ndash25]

A popular approach for indoor mapping is using 3D laser distance sensors asthe input of the SLAM algorithm Examples for this are [26ndash28]

However cameras and laser range finders are often used together [29 30] forexample manage to integrate the two sensors together and carry out 3D mappingwhile controlling navigating the UAV itself This project will use a similar setupthat is a combination of one or more monocular cameras with a 2D laser scanner

22 Object detection on conventional 2D imagesObject class detection is one of the most researched area of computer visionAs the available resources (computational power resolution of cameras etc)improved and became cheaper and more and more complex tasks occurred thefield has witnessed several approaches and methods Thus the topic has extensiveliterature To give an overview general surveys can be a good starting point

There are several articles aimed to provide a summary of the existing methodstrying to categorize and group them (eg [31ndash33])

The terms and definitions of computer vision tasks are defined in several wayin the literature [33] and [34] groups them by

bull Detection is a particular item present in the stimulus

bull Localization detection plus accurate location of item

bull Recognition localization of all the items present in the stimulus

bull Understanding recognition plus role of stimulus in the context of the scene

while [35] sets up five classes defined by the following (quoted from [33])

bull Verification Is a particular item present in an image patch

bull Detection and Localization Given a complex image decide if a particularexemplar object is located somewhere in the image and provide accuratelocation information of the given object

bull Classification Given an image patch decide which of the multiple possiblecategories are present in it Note that no location is determined

9

22 Object detection on conventional 2D images

bull Naming Given a large complex image (instead of an image patch as inthe classification problem) determine the location and labels of the objectspresent in that image

bull Description Given a complex image name all the objects present in itand describe the actions and relationships of the various objects within thecontext of the image

[31] defines object (class or category) detection as the combination of twogoals object categorization and object localization The former decides if anyinstance of the categories of interest is present in the input image The lattertask determines the positions and dimensions (projected size) of the objects thatare found

As it can be seen from the categories defined above the boundaries of thesesubclasses are not clear and they often overlap The aims and objectives listedin section 13 classify this task as a detection or localization (or a combinationof the two) For the sake of simplicity the aim of this thesis and the providedalgorithm will be referred as detection and detector

Beside the grouping of the tasks of computer vision several attempts weremade to classify the approaches themselves The biggest disjunctive propertyof the methods might be the fact that whether or not does it involve machinelearning

221 Classical detection methodsResulting from the rapid development of hardware resources along with the ex-tensive research of artificial intelligence methods not using machine learning aregetting neglected Generally speaking the ones using it show better performanceand are simpler to adopt if the circumstances change (eg re-train them)

It has to be noted that although these classical approaches are not very pop-ular nowadays their simplicity often means an attractive factor to perform easydetection tasks or pre-filter the image

2211 Background subtraction

Background subtraction methods are often used to detect moving objects onstanding platform (static cameras) Such an algorithm is able to learn the rdquoback-ground modelrdquo of an image with differentiating a few frames Although severalmethods exist (See [36] or [37] for good conclusion) the basic idea is common sep-arate the foreground from the background by subtracting the background modelfrom the input image This subtraction should result in an output image where

10

22 Object detection on conventional 2D images

Figure 23 Example for background subtraction used for people detection Thered patches show the object in the segmented foreground Source [39]

only objects in the foreground are shown The background model is usually up-dated regularly which means the detection is adaptive

These solutions could give an excellent base for tasks like traffic flow monitor-ing [38] or human detection [39] as moving points are separated and connectedor close ones could be managed as an object Thus all possible objects of interestare very well separated The fixed position of the camera means fixed perspectiveswhich make size and distance estimations accurate With fixed lengths reliablevelocity measuring can be implemented These and further additional properties(colour texture etc of the moving object) can be the base of an object classifica-tion system This method is often combined with more sophisticated recognitionmethods serving as a region of interest detector After the selection of interestingareas those methods can operate on a smaller input resulting in a faster responsetime

It has to be mentioned that not every moving (changing) patch is an objectas for example in certain lighting circumstances a vehicle and its shadow movealmost identically and have similar size parameters Effective methods to removeshadows from input have been described (eg[40 41]) Certainly process time ofthese filters sum up with the detectionrsquos time

An other concern is the possibility of a sudden change in illumination (such asswitching the light on or off) which would result change across the whole imageIn this case the background model should be adapted automatically and quickly

Although background subtraction has many benefits it is clearly not suitablefor this task since the camera will be mounted on a moving platform

2212 Template matching algorithms

Object class detection can be described as a search for a specific pattern on theinput image This definition is related of the matched filters (see [42]) techniqueof digital signal processing which basic idea is to find a known wavelet in anunknown signal This is usually achieved by cross-correlating them looking forhigh correlation coefficients

Template matching algorithms rely on the same concept interpreting the

11

22 Object detection on conventional 2D images

Figure 24 Example of template matching Notice how the output image isthe darkest at the correct position of the template (darker means higher value)An other interesting point of the picture is the high return values around thecalendarrsquos (on the wall left) dark headlines which are indeed look similar to thehandle Source [43]

image as a 2D digital signal They determine the position of the given patternby comparing a template (that contains the pattern) with the image

To perform the comparison the template is shifted (u v) discrete steps inthe (x y) directions respectively The comparison itself is usually calculated bysome kind of correlation over the area of the template for each (u v) positionThe method assumes that the function (the comparison method) will return thehighest value at the correct position [43]

See Figure 24 for an example input template (part a) and the output of thealgorithm (part b) Notice how the output image is the darkest at the correctposition of the template

The biggest drawback of the method is its computational complexity sincethe cross correlation has to be calculated at every position Several efforts werepublished to reduce the processing time [43] uses normalized cross correlationfor faster comparison [44] applies integral images (based on the idea of [45]) forthe same purpose Note that simplifying the template can reduce the complexityas well since it is easier to compare the two images For example [46] proposedan algorithm which approximates the template image with polynomials

Other disadvantages of template matching are the poor performance in case ofchanges in scale rotation or perspective Resulting from these template matchingis not a suitable concept for this project

12

22 Object detection on conventional 2D images

222 Feature descriptors classifiers and learning methodsWhile computer vision and machine learning might seem two well-separated areasof computer engineering computer vision is one of the biggest application fieldof machine learning [47] puts it this way rdquoPattern recognition has its origins inengineering whereas machine learning grew out of computer science Howeverthese activities can be viewed as two facets of the same field and together theyhave undergone substantial development over the past ten yearsrdquo

The two most important parts of a trained object detection algorithm arefeature extraction and feature classification The extraction method determineswhat type of features will be the base of the classifier and how they are extractedThe classification part is responsible for the machine learning method In otherwords how to decide if the extracted feature array (extracted by the featureextraction method described above) represents an object of interest

Since the latter part (defining the learning and classifier system) is less relatedto the vision studies this paper will only discuss a few learning methods brieflyBefore that some of the most famous and widely used feature descriptors arepresented

Feature extractors or descriptors are often grouped as well [48] recognizesedge based and patch based features [31] has three groups according to thelocality of the features

bull pixel-level feature description these features can be calculated for eachpixel separately Examples are the pixelrsquos intensity (grayscale) or its colourvector

bull patch-level feature description Descriptors based on patch level featuresconcentrate only on point of interests (and their neighbourhood) instead ofdescribing the whole image The name rdquopatchrdquo refers to the area around thepoint which is also considered Since these regions are much smaller thanthe image itself patch-level feature descriptions are also called local featuredescriptors Some of the most famous ones are SIFT (sub-subsection 2221[49 50]) and SURF [51] In a sense Haar-like features (see sub-subsection2222 [45]) belong here as well

bull region-level feature description As patches are sometimes too small to de-scribe the object appropriately a larger scale is often needed The region isa general expression for a set of connected pixels The size and shape of thisarea is not defined Region-level features often constructed from lower levelfeatures presented above However they do not try to apply higher levelgeometric models or structures Resulting from that region-level featuresare also called mid-level representations The most well known descriptors

13

22 Object detection on conventional 2D images

of this group are the BOF (Bag of Features or Bag of Words [52]) and theHOG (Histogram of Oriented Gradients sub-subsection 2223 [53])

Due to the limited scope of this paper it is not possible to give a deeperoverview about the features used in computer vision Instead three of the meth-ods will be presented here These were selected based on fame performanceon general object recognition and the amount of relevant literature examplesand implementation All of them are well-known and both the articles and themethods are part of the computer vision history

2221 SIFT features

Scale-invariant feature transform (or SIFT) is an image processing algorithm usedto detect and extract local features Sift descriptors are the most well-knownpatch-level feature feature descriptors (see point 222) [31] They were proposedby David Lowe in 1999 [49] The same author gave a more in-depth discussionabout his earlier work presented improvements in stability and feature invariancein 2004 [50] His idea was to extract interesting points as features from the object(training image) and try to match these on the test images to locate the object

The main steps of the algorithm are the following [50]

1 Scale-space extrema detection The first stage searches over the image inmultiple scale using a difference-of-Gaussian function to identify potentialinterest points that are scale and orientation invariant

2 Key-point localization At each selected point the algorithm tries to fit amodel to determine location and scale Key-points are selected based ontheir stability points which are expected to be more stable based on themodel are kept

3 Orientation assignment rdquoOne or more orientations are assigned to eachkey-point based on local image gradient directions All future operationsare performed on image data that has been transformed relative to theassigned orientation scale and location for each feature thereby providinginvariance to these transformationsrdquo [50]

4 Key-point descriptor After selecting the interesting points the gradientsaround them are calculated in a selected scale

The biggest advantage of the descriptor is the scale and rotation invarianceNote that rotation here means rotating in the plane of the image not capturingthe object from a rotated viewpoint For example a rotation invariant (frontal)

14

22 Object detection on conventional 2D images

face detector should be able to detect faces upside down but not faces pho-tographed from a side-view SIFT is also partly invariant to illumination and3D camera viewpoint This is reached by the representation of the calculateddescriptors that allows for significant levels of local shape distortion and changein illumination The algorithm assumes that these pointsrsquo relative positions willnot change on the test images Thus SIFT descriptors tend to show worse perfor-mance on flexible objects or ones with moving parts Also the relative positionsof the key-points can be significantly different not only if the object is distorted(flexibility) but also if they are extracted from an other instance from the sameclass (two people for example) Resulting from this property SIFT is more oftenused for object tracking image stitching or to recognise a concrete object insteadof the general class However this is not a problem in the case of this projectsince only one ground-robot is part of it Thus no general class recognition isrequired recognizing the used ground robot would be enough

2222 Haar-like features

Haar-like features were introduced by Viola and Jones in 2001 [45] They proposedsimple features inspired by Haar wavelet basis which are also used in signalprocessing They can be seen as rdquotemplatesrdquo defining rectangular regions in thegrayscale image Viola and Jones described three types of the features

bull two-rectangle features rdquoThe value of a two-rectangle feature is the dif-ference between the sum of the pixels intensities within two rectangularregions The regions have the same size and shape and are horizontally orvertically adjacentrdquo [45]

bull three-rectangle features rdquoA three-rectangle feature computes the sum withintwo outside rectangles subtracted from the sum in a centre rectanglerdquo [45]

bull four-rectangle features rdquoA four-rectangle feature computes the differencebetween diagonal pairs of rectanglesrdquo [45]

See figure 25 for an example of two and three rectangle features The mainadvantage of this approach is the fact that these features can be computed veryrapidly using so-called integral images The integral image rdquocontains the sum ofpixel intensities above and to the left of itrdquo [45] at any given image location Thereferences to the integral values at each pixel location are kept in a structureFinally the Haar features can be computed by summing and subtracting theintegral values in the corners of the rectangles

[45] was motivated by the task of face detection Although it is more than14 years old it is still the base of many implemented face detection methods (in

15

22 Object detection on conventional 2D images

Figure 25 2 example Haar-like features that are proven to be effective separatorsTop row shows the features themselves Bottom row presents the features overlayedon a typical face The first feature corresponds with the observation that eyes areusually darker than the cheeks The second feature measures the intensity of areasaround the eyes compared to the nose Source [45]

consumer cameras for example) and inspired several researches [54] for exampleextended the Haar-like feature set with rotated rectangles by calculating theintegral images diagonally as well

Resulting from the nature of them Haar-like features are suitable to findpatterns of combined bright and dark patches (like faces on grayscale images forexample)

Despite of the original task Haar-like features are used widely in computervision for various tasks vehicle detection ([55]) hand gesture recognition ([56])and pedestrian detection ([57]) just to name a few

Haar-like features are very popular and have a lot of advantages Its maindrawback is the rotation-variance which in spite of several attempts (eg [54])is still not solved completely

2223 HOG features

The abbreviation Hog stands for Histogram of Oriented Gradients Unlike theHaar features (2222) Hog aims to gather information from the gradient image

The technique calculates the gradient of every pixel and produces a histogramof the different orientations (called bins) after quantization in small portions(called cells) on the image After a location based normalization of these cells(within blocks) the feature vector is the concatenation of these histograms Notethat while this is similar to Edge orientation histograms ([58]) it is different since

16

22 Object detection on conventional 2D images

Figure 26 Illustration of the discriminative and generative models The boxesmark the probabilities calculated and used by the models Note the arrows whichdisplay the information flow presenting the main difference in the two philosophiesSource [59]

not only sharp gradient changes (known as edges) are considered The number ofoptimal bins cells and blocks depend on the actual task and resolution and arewidely researched

HOG is essentially a dense version of the SIFT descriptors (see sub-subsection2221 and [31]) However it is not normalized with respect to the orientationThus HOG is not rotation invariant On the other hand it is normalized locallywith respect to the image contrast (over the blocks) which results in a superiorrobustness against illumination changes over SIFT The technique was first de-scribed in the well-known article [53] by Navneet Dalal and Bill Triggs in whichit was used for pedestrian detection While this application field is still the mostcommon usage of these feature descriptor it can be used for many kind of objectsas well as long as the class has a characteristic shape with significant edges

2224 Learning models in computer vision

Learning model approaches can be grouped as well [32 48] finds two mainphilosophies generative and discriminative models

Letrsquos denote the test data (images) with xi the corresponding labels by ciand the feature descriptors (extracted from the data xi) by θi where i = 1 to N where N is the number of inputs

Generative models look for a representation of the image (or any other inputdata) by approximate it keeping as much data as possible Afterwards given atest data they predict the probability of an x instance generated (with featuresθ) conditioned to class c [48] In other terms they aim is to produce a model inwhich the probability

P (x|(θ c))P (c)

to be ideally 1 if x contains an instance of class c and 0 if notThe most famous examples of generative models are Principal Component

Analysis (PCA [60]) and Independent Component Analysis (ICA [61])

17

22 Object detection on conventional 2D images

Discriminative models try to find the best decision boundaries for the giventraining data As the name suggests they try to discriminate the occurrenceof one class from everything else [48] They do this by defining the followingprobability

P (c|(θ x))

which is expected to be ideally 1 if x contains an instance of class c and 0 ifnot

Note that the biggest difference is the direction of the mapping and infor-mation flow discriminative models map images to class labels while generativemodels use a map from labels to the images (or rdquoobservablesrdquo [59]) See figure 26for a representation of the direction of the flow of information in both models

Resulting from approaches presented above (especially the probability mod-els) the discriminative methods are usually perform better for single class detec-tion while generative methods are more suitable for multi-class object detection(also called object recognition) [62] Since the task described in this paper re-quires one class to detect two of the most well-known discriminative methodswill be presented Adaptive boosting (AdaBoost [63] see sub-subsection 2225)and support vector machines (SVM [64] see sub-subsection 2226)

2225 AdaBoost

AdaBoost (short for Adaptive Boosting) is a discriminative machine learningmodel from the class of boosting algorithms It was first proposed by YoavFreund and Robert Schapire in 1997 [63] AdaBoostrsquos aim is to overcome theincreasing calculation expenses caused by the growing set of features (also calleddimensions) For example the number of possible Haar-like features (2222) isover 160000 in case of a 24times24 pixel window (see subsection 432) [65] AlthoughHaar-like features can be extracted efficiently using the integral-image computingthe full set would be very expensive

To reduce the dimensionality of the machine learning problem boosting algo-rithms try to combine several weaker classifiers to create a stronger Weak meansthe output of the classifiers only slightly correlates with the true labels of thedata Usually this is an iterative method

A single stage of AdaBoost can be described as follows [45] [65]

bull Initialize N middot T weights wti where N = number of training examples T =number of features in stage

bull For t = 1 N

1 Normalize the weights

18

22 Object detection on conventional 2D images

2 Select the best classifier using only a single feature by minimising thedetection error εt =

sumiwi|h(xi f p θ)minus yi| where h(xi) is a classifier

output and yi is a correct label (both with a range of 0 for negative 1for positive)

3 Define ht(x) = h(x ft pt θt) where ft pt and θt are minimizers for theerror above

4 Update the weights wt+1i = wti middot εt

1minusεt

1minusεi

bull The final classifier for the stage is based on the sum of weak classifiers

By repeating these steps AdaBoost selects those features (and only those)which improve the prediction Resulting from the way the weights are updatedsubsequent classifiers minimise the error mostly for the inputs which were misclas-sified by previous ones A famous application of AdaBoost is the Viola-Jones ob-ject detection framework presented in 2001 [45] (also see sub-subsection 2222)

2226 Support Vector Machine

Support vector machines (SVM also called support vector networks) is a machinelearning algorithm for two-group classification problems proposed by Vladimir NVapnik Bernhard E Boser and Isabelle M Guyon [66] Although the featuresof the method were known and used since the 1960s they were not put togetheruntil 1992

The idea of the support vector machine is to handle the set of features ex-tracted from the train and test data samples as vectors in the feature spaceThen they construct one ore more hyperplanes (a linear decision surface) in thisspace to separate the classes

The surface is constructed with two constraints First is to divide the spaceinto two parts in a way that no points from different classes remain in the same(in other words separate the two classes perfectly) The second constraint is tokeep the distance to the nearest training samplersquos vector from each class as largeas possible (in other words define the largest margin possible) Creating largermargin is proven to decrease the error of the classifier The closest vectors to thehyperplane (and due to that determining the width of the margin) are calledsupport vectors (hence the name of the method) See figure 27 for an exampleof a separable problem the chosen hyperplane and the margin

The algorithm was extended to not linearly separable tasks in 1995 by Vapnikand Corinna Cortes [64] To create a non-linear classifier the feature space ismapped to a much higher dimensional space where the classes become linearlyseparable and the method above can be used

19

22 Object detection on conventional 2D images

Figure 27 Example of a separable problem in 2D The support vectors aremarked with grey squares They define the margin of the largest separation betweenthe two classes Source [64]

Although support vector machines were designed for binary classification sev-eral methods were proposed to apply their outstanding performance on multi-classtasks The most often used approach to do this is to combine binary classifiers toobtain a multi-class one The binary classifiers can be rdquoone against allrdquo or rdquooneagainst onerdquo types The former means that each class is examined against all thepoints of the other classes The class whose classifier shows the greatest distancefrom the margin is chosen (in other terms the test samplersquos vector is the fartherfrom the other classesrsquo points than in any other cases) The latter is based on thecomparisons of all the possible pairings of the classifiers (each is compared witheach) The class rdquowinningrdquo the most comparisons is chosen as the final decisionof the classifier Good surveys on multi-class SVMs are [67] and [68]

A famous application of SVMs is the HOG descriptor based pedestrian detec-tion presented by Navneet Dalal and Bill Triggs in 2005 [53]

20

Chapter 3

Development

In this chapter the circumstances of the development will be presented whichinfluenced the research especially the objectives defined in Section 13 Firstthe available hardware resources (that is processing unit and sensors) will bediscussed along with their limitations and constraints respecting to the definedobjectives Finally the used software libraries will be presented why were theychosen and what are they used for in this project

31 Hardware resources

311 Nitrogen boardNitrogen board (officially called Nitrogen6x) is the chosen on-board processingunit of the UAV It is a highly integrated single board computer using an ARM-cortex A9 processor [69]

Nitrogen board will be responsible to manage the on-board sensors and collectthe recorded data After some preprocessing part of the data will be transmittedto the ground robot and to the ground station which communication is alsohandled by this unit Fortunately the Robotic Operating System (subsection322) is compatible with unit which will make this task easier

The detection method introduced in this thesis may be ported to this boardlater depending on the available processing units and the achieved communicationbandwidth between the vehicles and the ground station

312 SensorsThis subsection concludes the sensors integrated on-board the UAV including theflight controller the camera and the laser range finder

21

31 Hardware resources

Figure 31 image of Pixhawk flight controller Source [70]

3121 Pixhawk autopilot

Pixhawk is an open-source well supported auto-pilot system manufactured by3Drobotics company It is the chosen flight controller of the projectrsquos UAV Onfigure 31 the unit is shown with optional accessories

Also Pixhawk will be used as an Inertial Measurement Unit (IMU) for build-ing 3D maps This means less weight (since no additional sensor is needed) andbrings integrity both in hardware and software Although the Pixhawk is not adedicated IMU its built-in sensors are very accurate since they are used to sta-bilize multi-rotors during flight Therefore it was suitable to use it for mappingpurposes See 44 for details of its application

3122 Camera

Cameras are optical sensors capable of recording images They are the mostfrequently used sensors for object detection due to the fact that they are easy tointegrate and provide significant amount of data

Also another big advantage of these sensors is the fact that they are availablein several sizes with different resolutions and other features for a relatively cheapprice

Nowadays consumer UAVs are often equipped by some kind of action camerausually to provide an aerial video platform These sensors are suitable for theaim of the project as well resulting from their light weight and wide angle fieldof view

22

32 Chosen software

Figure 32 The chosen LIDAR sensor Hokuyo UTM-30LX It is a 2D rangescanner with a maximum range of 60 m and 3-5 cm accuracy Source [71]

3123 LiDar

The other most important sensor used in this project is the lidar These sensorsare designed for remote sensing and distance measurement The basic conceptof these sensors is to do this by illuminating objects with a laser beam Thereflected light is analysed and the distances are determined

Several versions are available with different range accuracy size and otherfeatures The chosen scanner for this project is the Hokuyo UTM-30LX It isa 2D range scanner with a maximum range of 60 m and 3-5 cm accuracy [71]These features combined with its compact size (62mmtimes62mmtimes875mm210g [71])makes it an ideal scanner for autonomous vehicles for example UAVs

Generally speaking lidars are much more expensive sensors then camerasHowever the depth information provided by them is very valuable for indoorflights

32 Chosen software

321 MatlabMATLAB is a high-level interpreted programming language and environmentdesigned for numerical computation It is a very useful tool for engineers and

23

32 Chosen software

scientist alike as thousands of functions are implemented and included in thesoftware from different fields of science Also excellent visualization tools areready to use which are necessary to easily debug and develop image processingmethods or 3D map construction and visualization Resulting from the reasonsabove developing and testing 2D or 3D image processing algorithms in Matlab isextremely convenient On the other hand Matlab is Java based therefore mostnon-mathematical calculations are slightly slow compared to a C++ environment

322 Robotic Operating System (ROS)Robotic Operating System is an extensive open-source operating system andframework designed for robotic development It is equipped with several toolsaimed to help developing unmanned vehicles like simulation and visualizationsoftware or excellent general message passing This latter property is a suitablesolution to connect the UAV UGV and the ground control station of the project

Also ROS contains several packages developed to handle sensors For exam-ple the both the chosen lidar sensor (3123) and the flight controller (3121)have ready-to-use drivers available

323 OpenCVOpenCV is probably to most popular open-source image processing library con-taining a wide range of functions like basic image manipulations (eg load writeresize rotate image) image processing tools (eg different kind of edge detec-tions threshold methods) and advanced machine learning based solutions (egfeature extractors and training tools) [72] Due to its high popularity examplesand support are available

In this project it was used to handle inputs (either from a file or from a cameraattached to the laptop) since Dlib the main library chosen has no functions toread inputs like that

324 DlibDlib is a general purpose cross-platform C++ library It has an excellent machinelearning library with the aim to provide a machine learning software developmenttoolkit for the C++ language (see figure 33 for an architecture overview) Dlibis completely open-source and is intended to be useful for both scientists andengineers

Dlib also has many computer vision related parts including ready-to-use fea-ture extractors and training tools and object tracking solutions which make it

24

32 Chosen software

Figure 33 Elements of Dlibrsquos machine learning toolkit Source [73]

easy to develop and test custom trained object detectors rapidly Another re-cently added feature is an easy-to-use 3D point cloud visualization function Re-sulting from this Dlib was a reasonable choice to use for this project since itscollection of machine learning image processing and 3D point cloud handlingfunctions made the development significantly easier

It is worth mentioning that aside the features above Dlib also contains com-ponents to handle linear algebra threading network IO graph visualisation andmanaging and numerous other tasks which makes it one of the most versatileC++ libraries Though it is not as popular and well-known as OpenCV it isextremely well documented supported and updated regularly Resulting fromthese the popularity of it increases For more information see [74] and [73]

25

Chapter 4

Designing and implementing thealgorithm

In this chapter challenges related to the topic of this thesis are listed After theconsideration of these the architecture of the designed algorithm will be intro-duced with detailed explanations given to the important parts The connectionbetween the 2D and 3D image inputs are defined Afterwards the base of thedetection algorithm is discussed in details along with all the aiding systems andinputs Finally the 3D imaging experiments and set-ups are introduced withexample images and the mathematical calculations

41 Challenges in the taskThe biggest challenges of the task are listed below

1 Unique object Maybe the biggest challenge of this task is the fact that therobot itself is a completely unique ground vehicle designed as part of theproject Thus no ready-to-use solutions are available which could partly orcompletely solve the problem For frequently researched object recognitiontasks like face detection several open-source codes examples and articlesare available The closest topic for a ground robot detection is car recogni-tion but those vehicles are too different to use any ready detectors Thusthe solution of this task have to be completely self-made and new

2 Limited resources Since the objective of this thesis is a detection executedon a UAV the limitations of this platform have to be considered Thismeans that both the dimensions and the weight of the on-board sensorsand computers are limited significantly Fortunately the cameras nowa-days have very compact sizes with impressive resolutions thus even smaller

26

42 Architecture of the detection system

aerial vehicles can carry one and record high quality videos The used lidar(3123) is heavier and larger but still possible to carry The on-board com-puter had to be chosen similarly a small and light hardware was neededWhile the Nitrogen6x board (311) would fulfil these requirements its pro-cessing power is not comparable to a laptop or desktop PC which shouldbe considered during the designing and assigning of processing methods

3 Moving platform No matter what kind of sensors are used for an objectdetection task movement and vibration of the sensor is usually makes itmore challenging Unfortunately both are present on a UAV platform Thiscauses problems because it changes the perspective and the viewpoint ofthe object Also they add noise and blur to the recordings And third nobackground-foreground separation algorithms are available for the cameraSee subsection 2211 for details

4 Various viewpoint of object Usually in the field of computer vision thepoint of view of the required object is more or less defined For exampleface detectors are usually applied to images where people are facing thecamera However resulting from the UAV platform the sensors will gain orlose height rapidly causing drastic perspective changes Also there are noconstraints about the relative orientation of the two vehicles which meansthat the robot should be recognized from 360 and from above

5 Various size of object Similarly to the previous point 4 the movement ofthe camera will cause significant changes in the scale of the object on theinput image In other words the algorithm have to detect the ground robotfrom a great distance (when it occupies a small part of the input) and alsoduring landing from relatively close (when it seems huge and almost fillsthe entire image)

6 Speed requirements The algorithmrsquos purpose is to support the landingmaneuver which is crucial part of the mission The most critical part is thetouch-down itself This (contrary to the other parts of the flight) requiresreal-time and very accurate position feedback from the sensors and the al-gorithms This is provided by another algorithm specially designed for thetask Thus the software discussed here does not have real-time require-ments On the other hand the system still needs continuous reports on theestimated position of the robot with at least 4Hz This means that everychosen method should be execute fast enough to make this possible

27

42 Architecture of the detection system

Support Vector Machines

camera2D Lidar

Trainer

Detector agorithmfront SVM side SVM

Detections Vatic annotation server

Sensors

Tracking

Evaluation

Current frame

Video reader

Regions of Interest

3D map

Other preprosessing(edgecolour detection)

Figure 41 On this figure a diagram of the designed architecture is shown Theaim of it is the present the connections between the 2D and 3D sensors detectionalgorithms and other parts of the system Arrows represents the dependencies andthe direction of the information flow

28

42 Architecture of the detection system

42 Architecture of the detection systemIn the previous chapters the project (11) and the objectives of this thesis (13)were introduced The advantages and disadvantages of the different availablesensors were presented (312) Some of the most often used feature extractionand classification methods (222) were examined with respect to the suitabilityfor the project Finally in subsection 41 the challenges of the task were discussed

After considering the aspects above a modular complete architecture wasdesigned which is suitable for the objectives defined See figure 41 for an overviewof the design The main idea of the structure is to compensate the disadvantagesof every module with an other part connected to it In some cases this principlebrings redundancy to the system On the other hand it provides a more robustdetection system for the robot The next enumeration will list the main partsof the architecture every already implemented module will be discussed later inthis chapter in detail

1 Training algorithm As explained in subsection 222 object detectionmethods using machine learning are very popular and superior in efficiencyTo apply such an algorithm for a task first a learning (also called trainingor teaching) period is needed when the software extract features from thetraining images and train a classifier to separate the two or more classes(Note that this learning step can be skipped if the algorithm learns rdquoon theflyrdquo or pre-trained detectors are used Neither of these are the case in thisproject)As mentioned above training usually happens off-line before the detectionsTherefore it is not required to be included in the final release However itis an inevitable part of the system since it produces the classifiers for thedetector These productions have to be reproducible compatible with theother parts and effective To make the development more convenient thetraining software is completely separated from testing See subsection 431for details

2 Sensor inputs Sensors are crucial for every autonomous vehicle since theyprovide the essential information about the surrounding environments Inthis project a lidar and a camera where used as main sensors for the task(along with several others needed to stabilize the vehicle itself See subsec-tion 312) The designed system relies more on the camera for now sincethe 3D imaging is in its experimental state However both sensors arehandled in the architecture already with the help of the Robotic Operat-ing System (322) This unified interface will make it possible to integratefurther sensors (eg infra-red cameras ultrasound sensors etc) later withease

29

42 Architecture of the detection system

Resulting from the scope of the project the vehicle itself is not readythus live recordings and testing were not possible during the developmentTherefore the system is able to manage images both from a camera andvideo file The latter is considered as a sensor module as well simulatingreal input from the sensor (other parts of the system canrsquot tell that it is nota live video feed from a camera)

3 Region of interest estimators and tracking Recognized as two of the mostimportant challenges (see 41) speed and the limitations of the availableresources were a great concern Thus methods which could reduce theamount of areas to process and such increase the speed of the detectionwere requiredResulting from the complex structure of the project both can be donebased on either the camera the lidar or both The idea of the architectureis to give a common interface (creating a type of module) to these methodsmaking their integration easy The most general and sophisticated approachwill be based on the continuously built and updated 3D maps That is ifboth the aerial and the ground vehicle were located in this map accurateestimations can be made on the current position of the robot excluding (3D)areas where it is unnecessary to search for the unmanned ground vehicle(UGV) In other words once the real 3D position of the robot is known itsnext one can be estimated based on its maximal velocity and the elapsedtime Theoretically it is not possible for the robot to be outside of thiscircle thus other parts of the map do not need to be scanned and Asa less sophisticated but more rapid alternative to this the currently seen3D sub-map (the point-cloud in front of the sensor) can be interpreted asa pre-filter for the 2D image processing methods For example filteringfor objects with more or less the same size of the UGV could be a veryeffective region of interest detector To achieve this the field of view of thetwo sensors needs to be registered carefully so corresponding areas can befound Unfortunately the real time 3D map building is not yet implementedThus these methods has yet to be developed On the other hand severalapproaches to find regions of interest based on 2D images were implementedAll of them will be presented in section 43

4 Evaluation To make the development easier and the results more objectivean evaluation system was needed This helps to determinate whether a givenmodification of the algorithm did actually make it rdquobetterrdquo andor fasterthan other versions This part of the architecture is only connected to thewhole system loosely (similarly to training point 1) and not required tobe included in the final release since it was designed for development and

30

43 2D image processing methods

debugging purposes However it had to be developed parallel with everyother parts to make sure it is compatible and up to date Thus evaluationhad to be mentioned in this list For more details of the evaluation pleasesee 511

5 Detector The detector module is the core algorithm which connects all theother parts listed above Receiving the inputs from the camera sensor (point2) using the classifiers trained by the training algorithm (point 1) andconsidering the information provided by the ROI estimators (3) executesthe detection Afterwards the detector is able to pass on the detectedobjects coordinates so the evaluation module (point 4) can measure theefficiency

43 2D image processing methodsIn this section the 2D camera image processing concerns methods and the processof the development will be presented

431 Chosen methods and the training algorithmBefore the design and implementation of the algorithm started several objectdetection methods were reviewed in section 22 In section 32 the used imageprocessing libraries and tool-kits were presented

While these two topics seem unrelated they have to be considered togethersince a good software library with image handling or even machine learningalgorithms implemented can save a lot of time during the development

Considering the reviewed methods the Histogram of Oriented Gradients (HOGsee sub-subsection 2223) was chosen as the main feature descriptor The reasonsfor this choice are discussed here

Haar-like features are suitable to find patterns of combined bright and darkpatches While the ground robotrsquos main colours are black (wheels) and white(the main body of the vehicle) their proportion is not ideal since brighter partsare more significant Variance of the relative positions of these patches (egdifferent view angles) can confuse the detector as well Also using OpenCVrsquos(323) training algorithm the learning process was significantly slower than thefinal training software (presented here later) making it very hard to develop andtest

SIFT descriptors (see sub-subsection 2221) have the advantage of rotationinvariance which is very desirable Since they assume that relative positions ofthe key-points will not change on the inputs they are more suitable to recognise

31

43 2D image processing methods

a concrete object instead of the general class This is not an issue in this casesince only one ground robot is present with known properties (in other wordsno generalization is needed) However the relative positions of the key-pointscan change not only because inter-class differences (an other instance of the classrdquolooksrdquo different) but caused by moving parts like rotating and steering wheelsfor example Also SIFT tends to perform worse under changing illuminationconditions which is a drawback in the case of this project since both vehiclesare planned to transfer between premises without no a-priori information aboutthe lighting

Histogram of Oriented Gradients (HOG see sub-subsection 2223) seemed tobe an appropriate choice as it performs well on objects with strong characteristicedges Since the UGV has a cuboid shape body and relatively big wheels itfulfils this condition Also HOG is normalized locally with respect to the imagecontrast which results in a robustness against illumination changes Thus itis expected to perform better than the other methods in case of change in thelighting (like moving to an other room or even outside)

Certainly HOG has it disadvantages as well First it is computationally ex-pensive especially compared to the Haar-like features However with properimplementation (eg a good choice of library) and the use of ROI detectors (seePoint 3) this issue can be solved Secondly HOG in contrast to SIFT is not rota-tion invariant In practice on the other hand it can handle a significant rotationof the object Also in the final version of the project the orientation of the camerawill be stabilized either in hardware (with a gimbal holding the camera levelled)or software (rotate the image based on the orientation sensor) Therefore theinput image of the detector system expected to be levelled Unfortunately dueto the aerial vehicle used in this project different viewpoints are predicted (see4) This means that even if the camera is levelled the object itself could seemlike rotated caused by the perspective To overcome this issue solutions will bepresented in this subsection and in 434

Although the method of HOG feature extraction is well described and docu-mented in [53] and not exceedingly complicated the choice of optimal parametersand the optimization of the computational expenses are time-consuming and com-plex tasks To make the development faster an appropriate library was searchedfor the extraction of HOG features Finally Dlib library (324) was chosen sinceit includes an excellent HOG feature extractor with serious optimization alongwith training tools (discussed in details later) and tracking features (see subsec-tion 434) Also several examples and documentations are available for bothtraining and detection which simplified the implementation

As a classifier Support vector machines (SVM see sub-subsection 2226)were chosen This is because nowadays SVMs are widely used in computer visionwith great results (Dalal used it as well in [53]) and for two-class detection prob-

32

43 2D image processing methods

lems (like this) they outperform the other popular ones Also it is implementedin Dlib as a ready-to-use solution [75] SVMs (and almost everything in Dlib)can save and load by serializing and de-serializing them which made the trans-portation of classifiers relatively easy Dlibrsquos SVMs are compatible with the HOGfeatures extracted with the same library The combination of the two methodswithin Dlib resulted in some very impressive detectors for speed limit sign orfaces Furthermore the face detector mentioned outperformed the Haar-like fea-ture based face detector included in OpenCV [74] After studying these examplesthe combination of Dlibrsquos SVM and HOG extractor was chosen as the base of thedetection

The implemented training software itself was written in C++ It can be exe-cuted from the command line It needs an xml file as an input which includes alist of the training images with the annotation of the objects (coordinates in pix-els) After importing the positive samples the negative samples are rdquogeneratedrdquofrom the training samples as well cut from areas where no annotated object ispresent Note that this feature means the object has to be annotated carefullysince a positive sample might get in to the negative training images otherwise

After the classifier (also called detector) is produced the training softwaretests it on an image database defined by a similar xml file This is a very con-venient feature since major errors are discovered during the training processalready If the first tests do not show any unexpected behaviours the finishedSVM is saved to the disk with serialization

Due to the fact that the robot looks completely unlike from different view-points one classifier was not sufficient because it could only match one typicalrdquoshaperdquo of the object Similarly the face detector mentioned above does notwork for people photographed from side view Therefore multiple classifiers wereconsidered for the task

Two SVMs were trained to overcome of this issue one for the front view andone for the side view of the ground vehicle This approach has two significantadvantages

First the robot is symmetrical horizontally and vertically as well In otherterms the vehicle looks almost exactly the same from the left and right whilethe front and rear-views are also very similar This is especially true if the work-ing principle of the HOG (2223) is considered since the shape of the robot(which defines the edges and gradients detected by the HOG descriptor) is iden-tical viewed from left and right or from front and rear This property of theUGV means that the same classifier will recognise it from the opposite directiontoo Therefore two classifiers are enough for all four sides This is not a trivialsimplification since cars for example look completely different from the front orthe rear and while the side-views are similar they are mirrored

The second advantage of these classifiers is very important in the future of

33

43 2D image processing methods

the project Due to the way the SVMs are trained they not only can detect theposition of the robot but depending on which one of them detects it the systemgains information about the orientation of the vehicle This is crucial since it canhelp predict the next position of the ground robot (see 434) Even more impor-tantly it helps the UAV to land on the UGV with the correct orientation (facingthe right way) Therefore the charging outputs will be connected properly

An other important aspect of the training is how to choose the training imagesFor this pictures were selected where the robot is captured directly from the sideor from the front with the camera almost at the level of the ground Asidethe sidefront of the robot (the top for example) was hidden or cropped Thismethod eliminated all the distortions found on pictures taken from above whichwould decrease the efficiency Although camera attached to the UAV will seldomcruise at this altitude (otherwise it would touch the ground) the HOG descriptor isrobust enough to find the patterns even viewed from above or sideways Althoughteaching an SVMs for diagonal views as well seemed like a logical idea it did notimprove the detections The reason for this is the fact that while side and frontviews are well defined it is hard to frame and train on a diagonal view

On figure 42 a comparison of training images and the produced HOG detec-tors is shown 42(a) is a typical training image for the side-view Hog detectorNotice that only the side of the robot is visible since the camera was hold verylow 42(b) is the visualized final side-view detector Both the wheels and thebody are easy to recognize thanks to the strong edges of them On 42(c) a train-ing image is displayed from the front-view detector training image set 42(d)shows the visualized final front detector Notice tangential lines along the wheelsand the characteristic top edge of the body

Please note that if the current two classifiers seem to be inefficient trainingadditional ones is easy to do with the training software presented here Alsoincluding them to the detector is simple as well since it was designed to beadaptive Please see subsection 435 for more details

432 Sliding window methodA very important property of these trained methods (222) are that usually theirtraining datasets (the images of the robot or any other object) are cropped onlycontaining the object and some margin around it Resulting from this a detectortrained this way will only be able to work on input images which contain theobject in a similarly cropped way Certainly such an image is not the usual inputfor an application since if the input contains only the robot there is no need forlocalization algorithms In other words the input image is usually much largerthan the cropped training images and may contain other kind of objects as wellThus a method is required which provides correctly sized input images for the

34

43 2D image processing methods

(a) sideview training image example

(b) sideview HOG detector

(c) frontview trainging image example

(d) frontview HOG

Figure 42 (a) a typical training image for the side-view Hog detector (b) is thefinal side-view HOG descriptor visualized (c) a training image from the front-viewHOG descriptor training image set (d) the final front detector visualized Noticethe strong lines around the typical edges of the training images

35

43 2D image processing methods

Figure 43 Representation of the sliding window method

detector cropped and resized from the original large inputThis can be achieved in several ways although the most common is the so

called sliding window method This algorithm takes the large image and slidesa window across it with predefined step-size and scales This rdquowindowrdquo behaveslike a mask cropping out smaller parts of the image With appropriate choice ofthese parameters (enough scales and small step-sizes) eventually every robot (orother desired object) will be cropped and handed to the trained detector See 43for a representation

It worth mentioning that it is possible that multiple instances of the soughtobject are present on the image For face or pedestrian detection algorithms forexample this is a potential situation Since only one ground robot was builtyet this is not likely to happen in this project (although reflections of the robotmight occur which are not handled explicitly) However if the system will beexpanded in the future multiple UGVs on the same input image will not cause aproblem for the designed algorithm

433 Pre-filteringAs listed in section 41 two of the most important challenges are speed andthe limitations of the available resources In subsection 432 the sliding windowconcept was introduced which makes it possible to cover every potential locationof the sought object On the other hand this means a lot of positions which areabsolutely unnecessary to check with the HOG detector since it is very unlikelythat the robot is there (eg positions above the ground or homogeneous surfaceslike the wall or the clearly empty ground) These areas could be excluded fromthe search with cheaper algorithms

36

43 2D image processing methods

Thus methods which could reduce the amount of areas to process and suchincrease the speed of the detection were required In other words these algorithmsfinds region of interests (ROI see 3) in which the HOG (or other) computationalheavy detectors are executed in a shorter time than scanning the whole inputimage The algorithms have be implemented with the right interface to keep themodular structure of the architecture Also this practice make it possible toeither swap the currently used with an other or a newly developed ROI detectormodule for further development

Based on intuition a good feature separation is the colour Anything whichdoes not have the same colour as the ground vehicle should be ignored To dothis however is rather complicated since the observed rdquocolourrdquo of the object (andanything else in the scene) depends on the lighting the exposure setting and thedynamic range of the camera Thus it is very hard to define a colour (and aregion of colours around of it) which could be a good base of segmentation onevery input videos Also one of the test environments used (the laboratory) andmany other premises reviewed has bright floor which is surprisingly similar to thecolour of the robot This made the filtering virtually useless

An other good hint for interesting areas can be the edges In this projectthis seems like a logical approach since the chosen feature descriptor operates ongradient images which are strongly connected to the edge map On figure 44 theresult of an edge detection on a typical input image of the system is presentedNote how the ground robot has significantly more edges than its environmentespecially the ground Filtering based on edges often eliminates the majority ofthe ground On the other hand chairs and other objects have lot of edges as wellThus these areas are still scanned

Another idea is to filter the image by the detections themselves on the previousframes In other terms the position of the robot on the previous input image isan excellent estimation of the vehiclersquos position on the current one Theoreticallyit is enough to search for the robot around its last known position with a radiusof the maximum distance the object can move during the time between framesPlease note that this distance is in image coordinates thus the possible movementof the camera is involved as well In practice this distance was set as a parameterafter experiments See subsection 435 for details

434 TrackingIn computer vision tracking means the following of a (moving) object over timeusing the camera image input This is usually done by storing information (likeappearance previous location speed and trajectory) about the object and passit on between frames

In the previous subsection (433) the idea of filtering by prior detections was

37

43 2D image processing methods

Figure 44 On this figure the result of an edge detection on a typical input imageof the system is presented Please note how the ground robot has significantly moreedges than its environment On the other hand chairs and other objects have lotof edges as well

presented Tracking is the extension of this concept If the position of the robotis already known there is no need to find it again To achieve the objectives it isenough to follow the found object(s) This makes a huge difference in calculationcosts since the algorithm does not look for an object described by a generalmodel on the whole image but search for the rdquosame set of pixelsrdquo in a significantlyreduced region Certainly tracking itself is not enough to fulfil the requirements ofthe task hence some kind of combination of rdquotraditionalrdquo classifiers and trackingwas needed

Tracking algorithms are widely researched and have their own extensive litera-ture Considering the complexity of implementing an own 2D tracking algorithmit was decided to choose and include a ready-to-use solution Fortunately Dlib(324) have an excellent object tracker included as well The code is based onthe very recent research presented in [76] The method was evaluated extensivelyand it outperformed state-of-the-art methods while being computationally effi-cient Aside its outstanding performance it is easy to include in the code onceDlib is installed The initializing parameter of the tracker is a bounding box isaround the area needed to be followed Since the classifiers return with boundingboxes those can be used as an input for the tracking See subsection 435 foran overview of the implementation An interesting fact is that it also uses HOGdescriptors as a part of the algorithm which shows the versatility of HOG andconfirms its suitability for this task

The final system of this project will have the advantage of a 3D map of itssurroundings This will allow tracking of objects in 3D relative to the environ-ment The UAV will have an estimation of the UGV even if its out of the frame

38

43 2D image processing methods

since its position will be mapped to the 3D map as well Unfortunately the realtime mapping is not ready yet thus this feature has yet to be implemented

435 Implemented detectorIn subsection 431 the advantages and disadvantages of the chosen object recog-nition methods were considered and the implemented training algorithm wasintroduced Then different approaches to speed up the process were presentedin subsection 432 433 and 434

After considering the points above a detector algorithm was implementedwith two aims First to provide an easy-to-use development and testing toolwith which different methods and their combinations (eg using 2 SVMs withROI) can be tried without having to change and rebuild the code itself Secondafter the optimal combination was chosen this software will be the base of thefinal detector used in the project

The input of the software is a camera feed or video which is read frame byframe The output of the algorithm is one or more rectangles bounding the areasbelieved to contain the sought object These are often called detection boxeshits or simply detections

During the development four different approaches were implemented All ofthem are based on the previous ones (incremental improvements) but all broughta new idea to the detector which made it faster (reach higher frame-rates see512) more accurate (see 511) or both The methods are listed below

1 Mode 1 Sliding window with all the classifiers

2 Mode 2 Sliding window with intelligent choice of classifier

3 Mode 3 Intelligent choice of classifiers and ROIs

4 Mode 4 Tracking based approach

They will all be presented in the next four subsections Since they are rathermodifications of the same code than separate solutions they were not imple-mented as new software Instead all were included in the same code whichdecides which mode to execute in run-time based on a parameter

Aside of changing between the implemented modes the detector is able toload as many SVMs as needed Their usage is different in every mode and willbe presented later

The software can display andor save the video frames with all the detectionsmarked measure and export the processing time for every frame save the de-tections to data files for evaluation or even process the same video input severaltimes

39

43 2D image processing methods

Table 41 Table of the available parameters

Name Valid values Functioninput path to video video as input for detectionsvm path to SVMs These SVMs will be usedmode [1 2 3 4] Selects which mode is usedsaveFrames [01] Turns on video frame exportsaveDetections [01] Turns on detection boxe exportsaveFPS [01] Turns on frame-rate measurementdisplayVideo [01] Turns on video displayDetectionsFileName string Sets the filename for saved detectionsFramesFolderName string Set the foldername used for saving videonumberOfLoops integer(0lt) Sets how many times the video is looped

To make the development more convenient these features can be switchedon or off via parameters which are imported from an input file every time thedetector is started There is an option to change the name of the file where thedetections are exported or the name of the folder where the video frames aresaved

Table 41 concludes all the implemented parameters with the possible valuesand a short description

Note that all these parameters have default values thus it is possible but notcompulsory to include every one of them The text shown below is an exampleinput parameter file

Example parameter file of the detector algorithminput testVideo1avilist all the detectors you want to usesvm groundrobotfrontsvm groundrobotsidesvmsaveDetections 0saveFrames 0displayVideo 0numberOfLoops 100mode 2saveFPS 1

Given this input file the software would execute a detection on testVideo1avi(input) one hundred times (numberOfLoops) using the groundrobotfrontsvm groundrobot-sidesvm classifiers (svm) It would neither save the detections (saveDetections)

40

43 2D image processing methods

nor the video frames (saveFrames) The latter is not shown (displayVideo) How-ever the processing time is saved for every frame (saveFPS) Note that the pa-rameter numberOfLoops is really useful to simulate longer inputs See figure 45for a screenshot of the interface after a parameter file is loaded

Figure 45 Example of the detectorrsquos user interface after loading the parametersThe software can be started from the command line with a parameter file inputwhich defines the detecting mode and the purpose of the execution (producingvideo efficiency statistics or measuring processing frame-rate)

4351 Mode 1 Sliding window with all the classifiers

Mode 1 is the simplest approach of the four After loading the two classifierstrained by the training software (front-view and side-view see 434) both rdquoslidesrdquoacross the input image All of them return with a vector of rectangles where theyfound the sought object These are handled as detections (saved displayed etc)There are no filters implemented for maximum number of detections overlappingallowed position etc

Please note that although currently two SVMs are used this and every othermethod can handle fewer or more of them For example given three classifiersmode 1 loads and slides all of them

As it can be seen this mode 1 is rather simple It checks every possibleposition with all the classifiers on every frame in multiple scales This meansthat it will not miss the object (assuming that one of them would recognize theobject if a cropped image of it is given as an input) On the other hand thisextensive search is very computational heavy especially with two classifiers Thisresults in the lowest frame per second rate of all the methods

4352 Mode 2 Sliding window with intelligent choice of classifier

As mentioned in the previous subsection Mode 1 checks every possible positionwith all the classifiers on every frame This is not efficient because of the waythe two used classifier was trained On figure 42 it can be seen that one of them

41

43 2D image processing methods

represents the frontrear view while the other one was trained for the side-viewof the robot

The idea of mode 2 is based on the assumption that it is very rare and usuallyunnecessary that these two classifiers detect something at once since the robot iseither viewed from the front or from the side Certainly looking at it diagonallywill show both mentioned sides but resulting from the perspective distortionthese detectors will not recognize both sides (and that is not needed either sincethere is no reason to find it twice)

Therefore it seemed logical to implement a basic memory to the code whichstores which one of the detectors found the object last time After that only thatone is used for the next frames If it succeeds to find the object again it remainsthe chosen detector If not the algorithm starts counting the sequential frames onwhich was unable to find something If this exceeds a predefined tolerance limit(that is the number of frames tolerated without detection) all of the classifiersare used again until one of them find the object In every other aspect mode 2is very similar to mode 1 presented above

The limit was introduced because it is very unlikely to change viewpointsbetween two frames so drastically that the other detector would have a betterchance to detect A couple of frames without detection is more like the result ofsome kind of image artefact like motion blur change in exposure etc

This modification of the original code resulted an much faster processingspeed since in significant amount of the time only one classifier was used in-stead of two Fortunately after finding an optimal limit the efficiency of thecode remained the same

4353 Mode 3 Intelligent choice of classifiers and ROIs

While mode 2 brought a significant improvement to the detectorrsquos speed it didnot fulfil the requirements of the project

In subsection 433 the idea of reducing the image searched was introduced asa way to increase the detection speed The position of the robot on the previousinput image is an excellent estimation of the vehiclersquos position on the current oneTheoretically it is enough to search for the robot around its last known positionwith a radius of the maximum distance the object can move during the timebetween frames However this is not an obvious calculation since the distance isin image coordinates hence it is not trivial to use the actual speed of the robotAlso the possible movement of the camera is involved as well For example evenif the UGV holds its position the rotating camera of the UAV will capture itrdquomovingrdquo across the input image

Instead of estimating the distance mentioned above a simpler approach wasimplemented The memory introduced in mode 2 was extended to store the

42

43 2D image processing methods

Figure 46 An example output of mode 3 The green rectangle marks theregion of interest The blue rectangle is the detection returned by the side-viewdetectorMode 3 tries to estimate a ROI based on the previous detections andsearches for the robot only in that Note the ratio between the ROI and the fullsize of the image

position of the detection aside the detector which returned it A new rectanglename ROI (Region of interest) was included which determinates in which areashould the detector search

The rectangle is initialized with the sizes of the full image (in other termsthe whole image is included in the ROI) Then every time a detection occursthe ROI is set to that rectangle since that is the last known position of the robotHowever this rectangle have to be enlarged caused by the reasons mentionedabove (movement of the camera and the object) Therefore the ROI is grown bya percent of its original size (set by a variable 50 by default) and every detectorsearches in this area Note that the same intelligent selection of classifiers whichwas described in mode 2 is included here as well

In the unfortunate case that none of the detectors returned detections insidethe ROI the region of interest is enlarged again by a smaller percent of its originalsize (set by a variable 3 by default) Note that if in any case the ROI reachesthe size of the of the image it will not grow in that direction anymore Eventuallyafter a series of missing detections the ROI will reach the size of the input imageIn this case mode 3 works exactly like mode 2

See figure 46 for an example of the method described above The greenrectangle marks the region of interest (ROI) The classifiers will search for therobot in this area The blue rectangle marks that the side-view detector foundsomething there (in this case correctly recognized the robot) On the next framethe blue rectangle will be the base of the ROI as it is enlarged in the followingsteps to define the region of interest

43

43 2D image processing methods

Please note that while the ROI is updated of the previous detections it isvery easy to change it to another rdquopre-filterrdquo due to the modular architectureintroduced in 42

4354 Mode 4 Tracking based approach

As mentioned in 434 Dlib library has a built-in tracker algorithm based on thevery resent research [76]

Mode 4 is very similar to mode 3 but includes the tracking algorithm men-tioned above Using a tracker makes it unnecessary to scan every frame (or partof it as mode 3 does) with at least one detector

Instead once the robot was found it is followed across the scene Certainlyperiodical validations are still needed Tracking is significantly faster than thesliding window detectors and in appropriate conditions can perform above realtime speed

It also improved the efficiency of the detector since on many frames the robotwas tracked and its position was marked correctly while the detectors missed itfrom the same view-point

On the other hand tracking can be misleading as well In case of the trackedregion rdquoslidesrdquo off the robot incorrect detections will be provided Also thetracker often estimates the position of the object correctly while the scale of therectangle does not fit which can result much larger or smaller detection boxesthan the object itself See figure 47(b) for an example Yellow rectangle marksthe tracked region which clearly includes the robot but itrsquos size is unacceptable

To avoid these phenomena the bounding box returned by the tracker is reg-ularly checked by detectors This is done in the very same way as in mode 3The output of the tracker is the base of the ROI which is than enlarged by pa-rameters (similarly to mode 3) The algorithm checks this area with the selectedclassifier(s)

If any of them founds the object the tracker is reinitialized based on thedetection box The rectangle is enlarged and translated to include a bigger partof the robot in contrast to the detected sides (note that the detectors are trainedfor the sides only see subsection 431) This is done because the tracker tendsto follow bigger patches of the image better

If none of the detectors returned detections inside the ROI the algorithm willcontinue to track the object based on previous clues of its position However afterthe number of sequential failed validations exceeds a tolerance limit (a parameter)the object is labelled as lost Afterwards the algorithm works exactly as mode 3would do enlarge the ROI on every frame until a new detection appears Thenthe tracker is reinitialized

See figure 47(a) for a representation of the processing method of mode 4

44

43 2D image processing methods

(a) Example output frame of mode 4

(b) Typical error of the tracker

Figure 47 (a) A presentation of mode 4 Red rectangle marks the detection ofthe front-view classifier The yellow box is the currently tracked area The greenrectangle marks the estimated region of interest (b) A typical error of the trackerYellow rectangle marks the tracked region which clearly includes the robot but toobig to accept as a correct detection

45

44 3D image processing methods

44 3D image processing methods

441 3D recording methodThe first challenge was how to create a 3D image As explained in sub-subsection3123 the used Lidar sensor is a 2D type one which means it scans only inplane In other words the sensor returns the distance of the closest reflectionpoint rotating around one (vertical) axis The result of this scan looks like across-section of a 3D image To produce a real and complete 3D image severalof these cross-section like one-plane scans have to be combined This is a rathercomplicated method since to cover the whole room the lidar needs to be movedHowever this movement or displacement will have a significant effect on therecorded distances

To simplify the initial experiments the position of the laser scanner was fixedon a tripod and only the orientation of it was changed This way the trans-formation between the coordinate systems and validation of the recordings wereeasier

On figure 48 a schematic diagram is given to introduce the 3D recording set-up The picture is divided in to 3 parts Part a is a side view of the room beingrecorded including the scanner and its dependences Part b is a top view of thevery same room displaying the field of view of the scanner Part c is a closeroverview of the scanner device the inertial measurement unit (IMU) and themounting layout Main parts of the setup are listed below with brief explanation

1 Lidar This is the 2D range scanner which was used in all of the experi-ments It is mounted on a tripod to fix its position The sensor is displayedon every sub-figure (a b c) On part a it is shown in two different orien-tation first when the recording plane is completely flat secondly when itis tilted

2 Visualisation of the laser ray The scanner measures distances in a planewhile rotating around its vertical axis which is call capital Z axis on thefigure (Note The sensor uses class 1 laser which is invisible and harmlessto human eye Red colour was chosen only for visualization)

3 Pitch angle of first orientation While the scanner is tilted its orientationis recorded relative to a fixed coordinate system For this experiment thepitch angle is the most relevant since roll and yaw did not change This iscalculated as a rotation from the (lower-case) z axis

4 Pitch angle of second orientation This is exactly the same variable asdescribed before but recorded for the second set-up of the recording system

46

44 3D image processing methods

5 Yaw angle of the laser ray As mentioned before the sensor used operateswith one single light source rotated around its vertical axis (capital) ZTo know where the recorded point is located relative to the sensorrsquos owncoordinate system the actual rotation of the beam is needed to be savedas well This is calculated as an angle between the forward direction of thesensor (marked with blue) and the laser ray All the points are recordedin this 2D space defined with polar coordinates a distance from the origin(the lidar) and a yaw angleThe angle is increasing counter-clockwise Pleasenote that this angle rotates with the lidar and its field Part b represents astate when the lidar is horizontal

6 Field of view of the sensor Part b of the picture visualize the limited butstill very wide field of view of the sensor The view angle is marked withgreen coloured curve and the covered area is yellowish The scanner is ableto record 270 or in other terms [minus135 135]

7 Blind-spot of the sensor As discussed in the previous point the scannercan register distances in 270 The remainder 90 is a blind-spot and islocated exactly behind the sensor (135minus135) This area is marked withdark grey on the figure

8 Inertial measurement unit All the points are recorded in the 2D spaceof the lidar which is a tilted and translated plane To map every recordedpoint from this to the 3D maprsquos coordinate system the orientation andposition of the Lidar sensor is needed Assuming that the latter is fixed(the scanner is mounted on a tripod) only the rotation remains variableTo record this some kind of inertial measurement unit (IMU) is neededwhich is marked with a grey rectangle on the image

9 Axis of rotation To scan the room a continuous tilting of the scannerwas needed It was not possible to place the light source on the axis of therotation This means an offset between the sensor and the axis which was asource of error The black circle represents the axis the tilting was executedaround

10 Offset between the axis and the light source As described in the previ-ous point an offset was present between the range sensor and the axis ofrotation This distance was carefully measured and it is marked by blue onthe figure

11 Ground vehicle The main objective of this thesis is to localize the UGVThus several 3D map was made of this vehicle itself However the lidarmain task is not the robot detection but the localization of the UAV itself

47

44 3D image processing methods

On part a and b its schematic side and top view is shown to visualize itsshape relative size and location

442 Android based recording set-upTo test the sensorrsquos capabilities simplified experiment set-ups were introducedas described in subsection 441 The scanned distances were recorded in csvformat by the official recording tool provided by the lidarrsquos manufacturer [77]As described in subsection 444 and in point 8 the orientation and position ofthe Lidar sensor is needed to determinate the recorded point coordinates relativeto the coordinate system fixed to the ground

Figure 410 Screenshot ofthe Android application for Lidarrecordings It is able to detect dis-play and log the rotation of the de-vice in a log file with user definedfilename Source own applicationand image

Assuming that the position is fixed (thescanner is mounted on a tripod) only therotation can change To record this somekind of inertial measurement unit (IMU) wasneeded

Since at the time these experimentsstarted no such device was available the au-thor of this thesis created an Android appli-cation The aim of this program was to usethe phonersquos sensors (accelerometer compassand gyroscope) as an IMU The software isable to detect display and log the rotationof the device along the three axis

The log file is structured in a way thatMatlab (see subsection 321) or other de-velopment tools can import it without anymodification needed The file-name is userdefined

Since both the Lidar and the applica-tion operates at a different and not suf-ficiently consistent frequency synchroniza-tion was necessary Since the two data wasrecorded completely separately this has to bedone off-line To achieve this a time-stamphas to be recorded for both type of mea-surements The lidar recorder tool does thisautomatically This function was added tothe Android application too thus the time-stamps of the recorded orientations are saved

48

44 3D image processing methods

Figure 48 Schematic figure to represent the 3D recording set-up Part a is a sideview of the room being recorded including the scanner and its dependences Partb is a top view of the very same room displaying the field of view of the scannerPart c is a closer overview of the scanner device the inertial measurement unit(IMU) and the mounting layout

49

44 3D image processing methods

Figure 49 Example representation of the output of the Lidar Sensor As it canbe seen the data returned by the scanner is basically a scan in one plane and canbe interpreted as a cross-section of a 3D image The sensor is positioned in themiddle of the cross Green areas were reached by the laser Source screenshot ofown recording by the official recording tool [77]

50

44 3D image processing methods

to the log file as well However since the two systems were not connected thetimestamps had to be synchronized manually

On Figure 410 the user interface of the application can be seen Note thatboth the displayed and the saved angles are in radian Figure 411(a) presentshow the phone was mounted next to the lidar and also shows the test environmentwhich was recorded The phone was attached to a common metal base with thelidar Thus the rotation of the phone is expected to be very close to the rotationof the lidar We used a Motorola Moto X (2013) for these experiments

The coordinate system used is fixed to the phone X axis is parallel with theshorter side of the screen pointing from the left to the right Y axis is parallel tolonger side of the screen pointing from to bottom to the top Z axis is pointingup and is perpendicular to the ground See figure 412 for a presentation of theaxes Angles were calculated in the following way

bull pitch is the rotation around the x axis It is 0 if the phone is lying on ahorizontal surface

bull roll is rotation around the y axis

bull yaw is the rotation around the z axis

Although the sensors in a phone are not designed for such precise tasks theset-up presented in this subsection worked sufficiently and produced spectacular3D images On figure 411(b) and 411(c) an example is shown along with therecorded screen captured by a camera (411(a))

443 Final set-up with Pixhawk flight controllerAlthough the Android application (see subsection 442) gave satisfactory resultsseveral reasons occurred for replacing it First the synchronization was compli-cated manual and off-line which is not acceptable in a real time UAV environ-ment Second neither the phonersquos sensors nor its operating system was designedto be mounted on an unmanned aircraft It is heavier less precise and moreexpensive than a dedicated IMU Thus the Android phone has to be replaced

As a replacement the Pixhawk (3121) flight controller was chosen because itis open-source well supported and most importantly it is the chosen controller ofthe projectrsquos UAV This meant less weight and brought integrity both in hardwareand software Although the Pixhawk is not a dedicated IMU its built-in sensorsare very accurate since they are used to stabilize multi-rotors during flight Fellowstudent Domonkos Huszar managed to connect both the Pixhawk and the lidar(3123) to the Robotic operating system (322) This way the two type ofsensors were recorded with the same software

51

44 3D image processing methods

(a) Laboratory and recorded scene

(b) Elevation of the produced 3D map

(c) The produced 3D map

Figure 411 (a) picture about the laboratory and the mounting of the phone (b)and (c) 3D map of the scan from different viewpoints Some object were markedon all images for easier understanding 1 2 3 are boxes 4 is a chair 5 is a studentat his desk 6 7 are windows 8 is the entrance 9 is a UAV wing on a cupboard

52

44 3D image processing methods

Thus the change from the Android software to the Pixhawk solved the problemof synchronization and accuracy All the other parts of the experiment (themounting method the coordinate systems etc) remained unchanged

444 3D reconstructionIn subsection 441 the recording method was described Subsection 442 and443 discussed the inertial measurement units and software used Processing andvisualizing the recorded data as a 3D map was the next step

All the pointsrsquo coordinates obtained were in the lidar sensorrsquos own coordinatesystem As shown on figure 48 (part b) this is a 2D space represented by polarcoordinates a distance from the origin (the scanner) and an angle between theline pointing to the point and the zero vector This space was tilted with thescanner during the records To calculate the positions recorded by the Lidar inthe ground fixed coordinate system this 2D space had to be transformed Trans-formation here means rigid motions no shape or size changes were desired Rigidmotions are reflections rotations translations and glide reflections Howeverin this experiment only translations and rotations were used

The 3D coordinate system was defined as the following

bull x axis is the vector product of y and z x = ytimes z pointing to the right

bull y axis is the facing direction of the scanner parallel to the ground pointing

Figure 412 Presentation of the axes used in the android application

53

44 3D image processing methods

forward

bull z axis is pointing up and is perpendicular to the ground

Note that the origin ([0 0 0]) is at the lidarrsquos location (at the top of the tripodin the height of the rotation axis) This means that points below the scanner havea negative altitude

The lidar and its own 2D space have 6 degrees of freedom (DOF) state in theground fixed coordinate system Three represents its positions along the x y zaxes The other three describes its orientation yaw pitch and roll

First the rotation of the scanner will be considered (the latter three DOF) Asit can bee seen on figure 48 the roll and yaw angles are not able to change Thusonly pitch has to be considered as a variable in this experiment (see point 3) Asa result the transformed coordinates of the recording point are the following

x = distance middot sin(minusyaw) (41)

y = distance middot cos(yaw) middot sin(pitch) (42)

z = distance middot cos(yaw) middot cos(pitch) (43)

Where distance and yaw are the two coordinates of the point in the plane ofthe lidar (figure 48 point 5) and pitch is the angle between the (lower-case) zaxis and the plane of the lidar (48 point 3 and 4)

The position of the lidar in the ground coordinate system did not change intheory only the orientation of it However as it can be seen on figure 48 part cthe rotation axis (marked by 9 see point 9) is not at the same level as the rangescanner of the lidar Since it was not possible to mount the device other way asignificant offset (marked and explained in point 10) occurred between the lightsource and the axis itself This resulted in an undesired movement of the sensoralong the z and y axes (note that x axis was not involved since the roll angle didnot change and only that would move the scanner along the x axis) While theresulted errors were no greater than the offset itself they are significant Thusalong the rotation a translation is needed as well The vector is calculated asthe sum of dy and dz

dy = offset middot sin(pitch) (44)

dz = offset middot cos(pitch) (45)

54

44 3D image processing methods

Where dy is the translation required along the y and dy is along the z axisOffset is the distance between the light source and the axis of rotation presentedon figure 48 part C marked with 10

Combining the 5 equations (41 42 43 44 45) we get xyz

= distance middot

sin(minusyaw)cos(yaw) middot sin(pitch)cos(yaw) middot cos(pitch)

+ offset middot

0sin(pitch)cos(pitch)

(46)

Using the transformation defined in equation 46 spectacular and more im-portantly precise 3D maps were created These were suitable for further worksuch us SLAM (Simultaneous localization and mapping) or the ground robotdetection

55

Chapter 5

Results

51 2D image detection results

511 EvaluationTo make the development easier and the results more objective an evaluationsystem was needed This helps determine whether a given modification of thealgorithm did actually make it better andor faster than other versions Thisinformation is extremely important in every vision based detection system to trackthe improvements of the code and check if it meets the predefined requirements

To understand what better means in case of a system like this some defi-nitions need to be clarified Generally a detection system can either identify aninput or reject it Here the input means a cropped image of the original pictureIn other words it is a position (coordinates) and dimensions (height and width)Please refer to subsection 432 for a more detailed description An identifiedinput is referred as positive Similarly a rejected one is called negative

5111 Definition of True positive and negative

A detection is true positive if the system correctly identifies an input as a positivesample In this case it means that the detector marks the ground robotrsquos positionas a detection Note that it is also often called a hit Similarly a true negativemeans that an input which does not contain the object is correctly rejected

5112 Definition of False positive and negative

In the case of most decision systems errors can be organised into two main groupsA mistake is a false positive error (also called type I error) if the system classifiesan input as a positive sample despite it is not In other words the system believes

56

51 2D image detection results

the object is present at that location although that is not the case In this projectthat typically means that something that is not the ground robot (a chair a boxor even a shadow) is still recognized as the UGV A false negative error (alsocalled type II error) occurs when an input is not recognized (rejected) although itshould be In this current task a false negative error is when the ground robotis not detected

5113 Reducing number of errors

Unfortunately it is really complicated to decrease both kind of errors at thesame time If more strict rules and thresholds are introduced in the decisionmaking the rate of false positives will decrease but simultaneously chances offalse negatives will increase On the other hand if conditions are loosened in thedetection method false negative errors may be reduced However caused by theexact same thing (less strict conditions) more false positives could appear

The actual task determines which kind of error should be minimised Incertain projects it could be more important to find every possible desired objectthan the fact that some mistakes will occur Such a project can be a manuallysupervised classification where appearing false positives can be eliminated by theoperator (for example illegal usage of bus lanes) Also a system with an accidentavoidance purpose should be over-secure and rather recognise non dangeroussituations as a hazard than to miss a real one

In this project the false positive rate was decided to be more important Thereason for this is that the position of the UGV is not crucial during the wholeflight If an input frame is processed incorrectly and a false negative error occurs(the robot is on the picture but the algorithm misses it) the tracking system(see subsection 434) would still be able to estimate its position After a fewframes the detector would be able to detect the robot again with great possibilityHowever a false negative detection at an unexpected position may confuse thesystem since it would have to choose from multiple possible charging platformsThis may lead to a landing attempt on a dangerous surface and to the possibilityof losing the aircraft

5114 Annotation and database building

To test the efficiency of the 2D detection algorithm its detections have to betested on an image dataset with its samples already labelled (either fully manuallyor using a semi-supervised classifier) or on a video set where the object (or objects)position is known on every frame This way the true positivenegative and falsepositivenegative values can be determined

57

51 2D image detection results

Figure 51 Figure of the Vatic user interface

Testing on image datasets is suitable for methods which do not use any in-formation transmitted between frames such as previous positions or ROI Thesealgorithms are usually sliding window based (see subsection 432) Their croppedinputs can be simulated with image datasets without complications

For this project the video based test is more appropriate since as described insubsection 434 and 433 the detector uses tracking and pre-filtering Testing thetracking feature without consecutive frames is pointless Also the pre-filteringand ROI generation is only applicable on frames containing more than just theobject itself

Testing on videos requires annotation which means that every object has tobe labelled on every frame This is a tiring and time-consuming task but withsuitable software it can be simplified To do this the Vatic environment waschosen

The abbreviation Vatic stands for Video Annotation Tool from Irvine Cal-ifornia It is a free to use on-line video annotation tool designed specially forcomputer vision research [78]

After the successful installation and connection to the server a clear interfaceis visible See an example screenshot on figure 51 The interface has a built invideo player which makes it possible to review the video with the annotation atnormal speed or even frame-by-frame The system was designed in a way thatmultiple objects can be annotated even from different classes (like pedestrianscars trucks for example) Since only one ground robot was built yet and no otherkind of objects were requested only one item was annotated on the videos

58

51 2D image detection results

The annotatorrsquos task is to mark the position and the size of the objects (thatis drawing a bounding box around it) on every frame of the videos (there areoptions to label an object Occluded obstructed or outside of frame) Althoughthis seems like a very time-consuming task Vatic uses sophisticated algorithmsto interpolate both the positions and the sizes of the objects on frames thatare not annotated yet This is a significant help for the annotator since aftermarking the object on some key-frames the only task remained is to correctthe interpolations between them if necessary

Note that it is crucial that the bounding boxes are very tight around the ob-ject This will guarantee that during the evaluation of the detectors the statisticsare based on true and objective annotations This topic will be explained in moredetails in section 53

Vatic is also able to crowdsource the annotation tasks using Amazon Mechan-ical Turk [79] which is Amazonrsquos solution for renting human resources This waylarge datasets with several videos are easy to build for relatively low cost How-ever this feature of the tool was not used in this project since the number of testvideos and their complexity did not make it necessary All the test videos wereannotated by the author

After successfully annotating a video Vatic is able to export the boundingboxes in several format including xml json etc The exported file contains theid of the frames and the coordinates of every bounding box along with the labelassigned for the objectrsquos type

Using these files it is possible to compare the detections of an algorithm (whichare also bounding boxes) with the annotations and thus evaluate the efficiencyof the code

512 Frame-rate measurement and analysisIn section 41 speed was recognised as one of the most important challenges

To address this issue the detector was prepared to measure and export thetime elapsed by processing each frames

The measurements were carried out on a laptop PC (Lenovo z580 i5-3210m8GB RAM) simulating a setup where the UAV sends video to a ground stationfor further processing During the tests every other unnecessary process wereterminated so their additional influences are minimized

The exact same software was used for every test only the parameter file waschanged (switching between the modes) Every method was executed on the samevideo one hundred times which means more than 100000 frames This amountensured that temporary resource changes during the test due to the operatingsystem did not influence the results significantly

59

52 3D image detection results

To analyse the exported processing times a Matlab script has been imple-mented This tool can load the log files from the test and calculate statistics ofthe data Measurements like the shortest longest and average processing timeper frame are displayed after execution along with the average frame-rate and itsvariance Latter is very important as mode 2 3 4 processing methods changeduring the execution by definition which results in a varying processing timeVariance measures these changes An example output of the analysis software ispresented below

Example output of Frame-rate analysis softwareLoaded 100 filesNumber of frames in video 1080Total number of frames processed 108000Longest processing time 0813Shortest processing time 0007Average ellapsed seconds 0032446variance of ellapsed seconds per frame

between video loops 80297e-07across the video 00021144

Average Frame per second rate of the processing 308204 FPS

The changes of average frame-rates between video loops were also monitoredIf this value was too big it probably indicated some unexpected load of thecomputer Thus the tests had to be restarted

It should be mentioned that the frame-rate of the software is strongly corre-lated with the resolution of the input images since the smaller the image it takesshorter time to rdquosliderdquo through the detector on it This is especially true for thefirst two methods (4351 4352) The test videosrsquo resolution are 960times 540

52 3D image detection resultsThis thesis focused on the 2D object detection methods since they are easierto implement and have been widely researched for decades Traditional camerasensors are significantly cheaper too and their output is rdquoready-to-userdquo On theother hand data from lidar sensors have to be processed to gather the imagesince the scanner records points in a 2D plane as explained in section 44 Notethat lidar 3D scanners exists but they are still scanning in a plane and have thismentioned processing built-in Also they are even more expensive than the 2Dversions

60

53 Discussion of results

Figure 52 On this figure the recording setup is presented In the backgroundthe ground robot is also visible which was the subject of the measurements

That being said the 3D depth map provided by these sensors are not possibleto obtain with cameras and contains valuable information for a project like thisand makes indoor navigation easier Therefore experiments were carried onthe record 3D images by tilting a 2D laser scanner (3123) and recording itsorientation with a IMU (3121)

On figure 52 a photo of the experimental setup is shown In the backgroundthe ground robot is visible which was the subject of the recordings

Figure 53 presents an example 3D point-cloud recorded of the ground robotBoth 53(a) and 53(b) is an image of the same map viewed from different pointsAs it can be seen recordings from one point is not enough to create completemaps as rdquoshadowsrdquo and obscured parts still occurs Note the rdquoemptyrdquo area behindthe robot on 53(b)

The created point-clouds were found to be detailed enough to carry simpler(eg threshold based on height) or more complex object detection methods (egregistering 3D key-points) on them in the future

53 Discussion of resultsIn general it can be said that the detector modes worked as expected on therecorded test videos with suitable efficiency Example videos can be found athere1 for all modes However to judge the performance of the algorithm moreobjective measurements were needed

The 2D image processing results were evaluated by speed and detection effi-1usersitkppkehu~palan1CranfieldVideos

61

53 Discussion of results

(a) Example result of the 3D scan of the ground robot

(b) Example of the rsquoshadowrsquo of the ground robot

Figure 53 Example of the 3D images built Both image shows the same 3Dmap viewed from different angles Note that the carpet is visible underneath thevehicle

62

53 Discussion of results

ciency The latter can be done by several values Here recall and precision willbe used Recall is defined by

TP

TP + FN

Where TP is the number of true positive and FN is the number of false negativedetections In other words it is the ratio of the of the detected and the totalnumber of positive samples Thus it is also called sensitivity

Precision is defined byTP

TP + FP

Where TP is the number of true positive FP is the number of false positivedetections In other terms precision is a measurement of the correctness of thereturned detections

Both values have a different meaning for validation and none of them can usedalone This is because as mentioned in sub-subsection 5113 both false positiveand negative errors are undesirable but neither recall nor precision measuresboth For example it is possible to reach an outstanding recall rate with extremeamount of false positives Similarly the increase of false negative errors willimprove the precision of the algorithm

The detections (the coordinates of the rectangles) were exported by the de-tection software for further analysis The rectangles were compared to the onesannotated with the Vatic video annotation tool (5114) based on position andthe overlapping areas For the latter a minimum of 50 was defined (ratio of thearea of the intersection and the union of the two rectangles)

Note that this also means that even if a detection is at the correct locationit is possible that it will be logged as false positive error since the overlap of therectangles is not satisfying Unfortunately registering this box as a false positivewill also generate a false negative error since only one detection is returned bythe detector for a location(overlapping detections are merged) Therefore theannotated object will not be covered by any of the detections

As expected mode 1 and mode 2 produced very similar results a decent 62recall rate was achieved This is a result of the fact that they work exactlythe same way except the selection of the used classifiers The results show thatintroducing this idea did not decrease the performance

Mode 3 managed to improve the recall rate by 2 with the introduction ofregion of interests Neither of the first 3 modules had false positive detections asa result of the strict thresholds of the SVMs

Mode 4 had the outstanding recall rate of 95 This is a result of the trackeralgorithm which was able to follow the object even at view-points were the SVMdetectors did no longer detected the robot See table 51 for all the recall andprecision values

63

53 Discussion of results

Similarly the processing speed of the methods introduced in subsection 435were analysed as well with the tools described in subsection 512

Mode 1 was proven to be the slowest with a frame-rate of 42 FPS This isnot surprising as this version of the algorithm scans the whole image with all thedetectors

Mode 2 raised this speed to 65 FPS by introducing an intelligent way ofselecting the applied detector

Mode 3 brought another significant increase with implementing region of in-terest estimators which reduce the amount of the image to be scanned It canprocess around 12 frames in a second

Finally mode 4 was proven to be the fastest solution by far As a result ofits tracking algorithm their is no need to use the SVM classifiers as often asbefore This is a very important result since tracking is significantly faster thanscanning with the detectors Mode 4 reached a frame-rate of 30 FPS during thetests Thus mode 4 can be called a rdquoreal-timerdquo object detection algorithm

See table 51 for the frame-rates of all modules displayed along their recalland precision values Note that all frame-rates are average of 100 executions

Table 51 Table of results

mode Recall Prescision FPS Variancemode 1 0623 1 42599 000000123mode 2 0622 1 6509 00029557mode 3 0645 1 1206 00070877mode 4 0955 0898 3082 00021144

As a conclusion it can be seen that mode 4 outperformed all the others bothin detection rate (recall 95) and speed (average frame-rate 30FPS) using ROIestimators and a tracking algorithm Due to the nature of the tracker and theevaluation method the number of false positives increased As mentioned beforeeven a correctly positioned but incorrectly scaled detection decreased the precisionrate However this should be simple to improve by adding further validations ofthe output of the tracker (both for position and scale)

64

Chapter 6

Conclusion and recommendedfuture work

61 ConclusionIn this paper an unmanned aerial vehicle based airborne tracking system waspresented with the aim of detecting objects indoor (potentially extended to out-door operations as well) and localize a ground robot that will serve as a landingplatform for recharging the UAV

First the project and the aims and objectives of this thesis were introducedin chapter 1 This introduction was followed by an extensive literature review togive an overview of the existing solutions for similar problems (chapter 2) Un-manned Aerial Vehicles were presented and examples were shown for applicationfields Then the most important object recognition methods were introducedand reviewed with the aspect of suitability for the discussed project

Then the environment of the development was described including the avail-able software and hardware resources (chapter 3) Their advantages and disad-vantages were analysed and their applications in the project are explained

Afterwards the recognized challenges of the project were collected and dis-cussed with respect to the object detection task in chapter 4 Considering theobjectives resources and the challenges a modular architecture was designedand introduced (section 42) The type of modules were listed and discussed indetails including their purposes and some possible realization

The modular structure of the system makes it very flexible and easy to extendor replace modules The currently implemented and used ones were listed andexplained Also some unsuccessful experiments were mentioned

Subsection 434 concludes the chosen feature extraction (HOG) and classi-fier (SVM) methods and presented the implemented training software along the

65

61 Conclusion

produced detectorsAs one of the most important parts the currently used detector was intro-

duced in subsection 435 This module was developed with special attention paidto validation and possible future work therefore additional debugging featureslike exporting detections demonstration videos and frame rate measurementswere implemented To make the development simpler an easy-to-use interfacewas included which let the most important parameters to be set without modifi-cation pf the code Four related but different detection methods were developedMode 1 scans the whole image with both trained detectors to locate the robotMode 2 does the same but instead of all classifiers only one is used most of thetime which is chosen based on previous detections This resulted in an increasedframe-rate Mode 3 introduces the concept of Region Of Interests (ROI) and onlyscans this area The ROI is defined based on previous detections however it iseasy to replace it with other methods due to the modular structure The reducedsearch space made the processing significantly faster Finally mode 4 includes atracking algorithm which makes it unnecessary to scan every frame (or part ofit) with a detector Instead once the robot was found it is followed across thescene Certainly periodical validations are still needed Tracking is significantlyfaster than the sliding window detectors and in appropriate conditions can per-form above real time speed (average frame-rate was 308 FPS) It also improvedthe efficiency of the detector since on many frames the robot was tracked andits position was marked correctly while the detectors missed it from the sameview-point However tracking can be misleading as well if the tracked regiondrifts away from the robot (eg a chair next to the robot is followed) To avoidthis the bounding box returned by the tracker is regularly checked by detectors

Special attention was given to the evaluation of the system Two software weredeveloped as additional tools one for evaluating the efficiency of the detectionsand an other for analysing the processing time and frame rates

Although the whole project (especially the autonomous UAV) still needs a lotof improvement and not ready for testing yet the ground robot detecting algo-rithmrsquos first version is ready-to-use and have shown promising results in simulatedexperiments

The detector based on mode 4 had a recall rate of 95 with an average frame-rate of 308 FPS on the test videos Although this processing speed was achievedon a laptop (simulating a setup where the UAV sends video to a ground stationfor further processing) porting this code on an on-board computer with smallercomputational capacity should result in a slower but still acceptable frame-ratewhich satisfies the objectives (5Hz)

While this thesis focused on two dimensional image processing and object de-tection methods 3D image inputs were considered as well Section 44 concludesthe progress made related to 3D mapping along the applied mathematics An

66

62 Recommended future work

experimental setup was created with which it was possible to create spectacu-lar and more importantly precise 3D maps about the environments with a 2Dlaser scanner To test the idea of using the 3D image as an input for the groundrobot detection several recordings were made about the UGV The created point-clouds were found to be detailed enough to carry simpler (eg threshold basedon height) or more complex object detection methods (eg registering 3D key-points) on them in the future

As a conclusion it can be said that the objectives defined in 13

1 explore possible solutions in a literature review

2 design a system which is able to detect the ground robot based on theavailable sensors

3 implement and test the first version of the algorithms

were all addressed and completed

62 Recommended future workIn spite of the satisfying results of the current setup several modifications andimprovements can and should be implemented to increase the efficiency and speedof the algorithm

Focusing on the 2D processing methods first experiments with retrainedSVMs are recommended It is expected that training on the same two sides (sideand front) but including more images from slightly rotated view-points (bothrolling the camera and capture the object from not completely the front) willresult in better performance

Region of interest estimators have huge potential in the project since they cansignificantly speed up the process Aside basic algorithms based on simple fea-tures (eg similar colour patches edges etc) several more complex segmentationmethods were introduced in the literature

Tracking can be tuned more precisely as well Aside the rectangle box whichmarks the estimated position of the object the tracker returns a confidence valuewhich measures how confident the algorithm is that the object is inside the rect-angle This is a very valuable information to eliminate detections which rdquoslippedrdquooff the object Also the further processing of the return position is recommendedFor example growing or shrinking the rectangle based on simple features (egthresholded patches intensity edges etc) can be profitable since the trackertends to perform better when it tracks the whole object On the other hand toobig bounding boxes are problems as well thus shrinking of the rectangles have tobe considered as well

67

62 Recommended future work

Using The 3D image for the detection of the robot is another very interestingopportunity that should be examined in more details Certainly this is onlypossible if the 3D imaging will be carried out in near real time speed (currentlyall the maps were built offline) The currently seen 3D image (the point-cloud infront of the sensor) can be interpreted as a pre-filter for the 2D image processingmethods For example filtering for objects with more or less the same size of theUGV could be a very effective region of interest detector To achieve this thefield of view of the two sensors needs to be register carefully so correspondingareas can be found

Finally when the continuous 3D map building (note that this is more thanthe imaging since the 3D images have to be aligned and combined) is finishedseveral more improvements will be possible Assuming that both the aerial andthe ground vehicle were located in this map the real 3D position of the robotwill be known Its next position can be estimated based on its maximal velocityand the elapsed time Theoretically it is not possible for the robot to be outsideof this circle thus other parts of the map do not need to be scanned Since thelocation and orientation of the UAV will be known as well the field-of-view ofthe sensors can be estimated After that the UAV can either move and rotate ina way to cover the possible locations of the UGV with its sensors or just ignore(does not process) inputs from other areas

It is strongly recommended to give high priority to evaluation and testingduring the further experiments For 2D images Vatic annotation tool was intro-duced in this thesis as a very useful and convenient tool to evaluate detectorsFor 3D detections this is not suitable but a similar solution is needed to keepup the progress towards the ready to deploy 3D mapping system planned

68

Acknowledgements

First of all I would like to thank Paacutezmaacuteny Peacuteter Catholic University and my teachersthere who helped me to advance and grow through the years and later made itpossible to continue my studies in Cranfield University where I was welcomed withwarm hospitality and was supported throughout of the development of this thesis

I would like to thank Dr Al Savvaris for the professional help and encourage-ment he provided while we worked together

Special thanks goes to my friends Domonkos Huszaacuter and Loacuteraacutent Kovaacutecs who Ishared all the happiness sadness worries troubles and the weigh of this pastyear and who I could always turn to in need or joy

Thanks to my friend Roberto Opromolla for helping us to make the necessarymeasurements

My family never stopped supporting me loving me and ensuring the rightenvironment for improvement and growth

Many thanks Fanni Melles for the support and love you gave me and that youstood by me even at the hardest times

Thanks for my friends Maacuterton Hunyady and Doacutera Babicz for the funny andsometimes cruel comments of the revision of this thesis

And finally many thanks to Andras Horvath who agreed to be my rsquosecondrsquosupervisor and mentor at my home university

69

References

[1] ldquoOverview senseFlyrdquo [Online] Available httpswwwsenseflycomdronesoverviewhtml [Accessed at 2015-08-08]

[2] H Chao Y Cao and Y Chen ldquoAutopilots for small fixed-wing unmanned airvehicles A surveyrdquo in Proceedings of the 2007 IEEE International Conference onMechatronics and Automation ICMA 2007 2007 pp 3144ndash3149

[3] A Frank J McGrew M Valenti D Levine and J How ldquoHover Transitionand Level Flight Control Design for a Single-Propeller Indoor AirplanerdquoAIAA Guidance Navigation and Control Conference and Exhibit pp 1ndash43 2007[Online] Available httparcaiaaorgdoiabs10251462007-6318

[4] ldquoFixed Wing Versus Rotary Wing For UAV Mapping Ap-plicationsrdquo [Online] Available httpwwwquestuavcomnewsfixed-wing-versus-rotary-wing-for-uav-mapping-applications [Accessed at 2015-08-08]

[5] ldquo DJI Store Phantom 3 Standardrdquo [Online] Available httpstoredjicomproductphantom-3-standard [Accessed at 2015-08-08]

[6] ldquoWorld War II V-1 Flying Bomb - Military Historyrdquo [Online] Available httpmilitaryhistoryaboutcomodartillerysiegeweaponspv1htm [Accessed at 2015-08-08]

[7] L Lin and M A Goodrich ldquoUAV intelligent path planning for wilderness searchand rescuerdquo in 2009 IEEERSJ International Conference on Intelligent Robotsand Systems IROS 2009 2009 pp 709ndash714

[8] M A Goodrich B S Morse D Gerhardt J L Cooper M Quigley J A Adamsand C Humphrey ldquoSupporting wilderness search and rescue using a camera-equipped mini UAVrdquo Journal of Field Robotics vol 25 no 1-2 pp 89ndash1102008

[9] P Doherty and P Rudol ldquoA UAV Search and Rescue Scenario with HumanBody Detection and Geolocalizationrdquo in Lecture Notes in Computer Science

70

REFERENCES

2007 pp 1ndash13 [Online] Available httpwwwspringerlinkcomcontentt361252205328408fulltextpdf

[10] A Jaimes S Kota and J Gomez ldquoAn approach to surveillance an area usingswarm of fixed wing and quad-rotor unmanned aerial vehicles UAV(s)rdquo 2008 IEEEInternational Conference on System of Systems Engineering SoSE 2008 2008

[11] M Kontitsis K Valavanis and N Tsourveloudis ldquoA UAV vision system for air-borne surveillancerdquo IEEE International Conference on Robotics and Automation2004 Proceedings ICRA rsquo04 2004 vol 1 2004

[12] M Quigley M A Goodrich S Griffiths A Eldredge and R W Beard ldquoTargetacquisition localization and surveillance using a fixed-wing mini-UAV and gim-baled camerardquo in Proceedings - IEEE International Conference on Robotics andAutomation vol 2005 2005 pp 2600ndash2606

[13] E Semsch M Jakob D Pavliacuteček and M Pěchouček ldquoAutonomous UAV surveil-lance in complex urban environmentsrdquo in Proceedings - 2009 IEEEWICACMInternational Conference on Intelligent Agent Technology IAT 2009 vol 2 2009pp 82ndash85

[14] M Israel ldquoA UAV-BASED ROE DEER FAWN DETECTION SYSTEMrdquo pp51ndash55 2012

[15] G Zhou and D Zang ldquoCivil UAV system for earth observationrdquo in InternationalGeoscience and Remote Sensing Symposium (IGARSS) 2007 pp 5319ndash5321

[16] W DeBusk ldquoUnmanned Aerial Vehicle Systems for Disaster Relief TornadoAlleyrdquo AIAA InfotechAerospace 2010 2010 [Online] Available httparcaiaaorgdoiabs10251462010-3506

[17] S DrsquoOleire-Oltmanns I Marzolff K Peter and J Ries ldquoUnmannedAerial Vehicle (UAV) for Monitoring Soil Erosion in Moroccordquo RemoteSensing vol 4 no 12 pp 3390ndash3416 2012 [Online] Available httpwwwmdpicom2072-42924113390

[18] F Nex and F Remondino ldquoUAV for 3D mapping applications a reviewrdquoApplied Geomatics vol 6 no 1 pp 1ndash15 2013 [Online] Availablehttplinkspringercom101007s12518-013-0120-x

[19] H Eisenbeiss ldquoThe Potential of Unmanned Aerial Vehicles for MappingrdquoPhotogrammetrische Woche Heidelberg pp 135ndash145 2011 [Online] Availablehttpwwwifpuni-stuttgartdepublicationsphowo11140Eisenbeisspdf

[20] C Bills J Chen and A Saxena ldquoAutonomous MAV flight in indoor environ-ments using single image perspective cuesrdquo in Proceedings - IEEE InternationalConference on Robotics and Automation 2011 pp 5776ndash5783

71

REFERENCES

[21] W Bath and J Paxman ldquoUAV localisation amp control through computer visionrdquo of the Australasian Conference on Robotics 2004 [Online] Availablehttpwwwcseunsweduau~acra2005proceedingspapersbathpdf

[22] K Ccedilelik S J Chung M Clausman and A K Somani ldquoMonocular vision SLAMfor indoor aerial vehiclesrdquo in 2009 IEEERSJ International Conference on Intel-ligent Robots and Systems IROS 2009 2009 pp 1566ndash1573

[23] H Oh D Y Won S S Huh D H Shim M J Tahk and A Tsourdos ldquoIndoorUAV control using multi-camera visual feedbackrdquo in Journal of Intelligent andRobotic Systems Theory and Applications vol 61 no 1-4 2011 pp 57ndash84

[24] Y M Mustafah A W Azman and F Akbar ldquoIndoor UAV Positioning UsingStereo Vision Sensorrdquo pp 575ndash579 2012

[25] P Jongho and K Youdan ldquoStereo vision based collision avoidance of quadrotorUAVrdquo in Control Automation and Systems (ICCAS) 2012 12th InternationalConference on 2012 pp 173ndash178

[26] J Weingarten and R Siegwart ldquoEKF-based 3D SLAM for structured environmentreconstructionrdquo in 2005 IEEERSJ International Conference on Intelligent Robotsand Systems IROS 2005 pp 2089ndash2094

[27] H Surmann A Nuumlchter and J Hertzberg ldquoAn autonomous mobile robot with a3D laser range finder for 3D exploration and digitalization of indoor environmentsrdquoRobotics and Autonomous Systems vol 45 no 3-4 pp 181ndash198 2003

[28] Y Lin J Hyyppauml and A Jaakkola ldquoMini-UAV-borne LIDAR for fine-scale map-pingrdquo IEEE Geoscience and Remote Sensing Letters vol 8 no 3 pp 426ndash4302011

[29] F Wang J Cui S K Phang B M Chen and T H Lee ldquoA mono-camera andscanning laser range finder based UAV indoor navigation systemrdquo in 2013 Inter-national Conference on Unmanned Aircraft Systems ICUAS 2013 - ConferenceProceedings 2013 pp 694ndash701

[30] M Nagai T Chen R Shibasaki H Kumagai and A Ahmed ldquoUAV-borne 3-Dmapping system by multisensor integrationrdquo IEEE Transactions on Geoscienceand Remote Sensing vol 47 no 3 pp 701ndash708 2009

[31] X Zhang Y-H Yang Z Han H Wang and C Gao ldquoObject class detectionrdquoACM Computing Surveys vol 46 no 1 pp 1ndash53 Oct 2013 [Online] Availablehttpdlacmorgcitationcfmid=25229682522978

[32] P M Roth and M Winter ldquoSurvey of Appearance-Based Methodsfor Object Recognitionrdquo Transform no ICG-TR-0108 2008 [Online]

72

REFERENCES

Available httpwwwicgtu-grazacatMemberspmrothpub_pmrothTR_ORat_downloadfile

[33] A Andreopoulos and J K Tsotsos ldquo50 Years of object recognition Directionsforwardrdquo Computer Vision and Image Understanding vol 117 no 8 pp 827ndash891 Aug 2013 [Online] Available httpwwwresearchgatenetpublication257484936_50_Years_of_object_recognition_Directions_forward

[34] J K Tsotsos Y Liu J C Martinez-Trujillo M Pomplun E Simine andK Zhou ldquoAttending to visual motionrdquo Computer Vision and Image Understand-ing vol 100 no 1-2 SPEC ISS pp 3ndash40 2005

[35] P Perona ldquoVisual Recognition Circa 2007rdquo pp 1ndash12 2007 [Online] Availablehttpswwwvisioncaltechedupublicationsperona-chapter-Dec07pdf

[36] M Piccardi ldquoBackground subtraction techniques a reviewrdquo 2004 IEEE Interna-tional Conference on Systems Man and Cybernetics (IEEE Cat No04CH37583)vol 4 pp 3099ndash3104 2004

[37] T Bouwmans ldquoRecent Advanced Statistical Background Modeling for ForegroundDetection - A Systematic Surveyrdquo Recent Patents on Computer Sciencee vol 4no 3 pp 147ndash176 2011

[38] L Xu and W Bu ldquoTraffic flow detection method based on fusion of frames dif-ferencing and background differencingrdquo in 2011 2nd International Conference onMechanic Automation and Control Engineering MACE 2011 - Proceedings 2011pp 1847ndash1850

[39] A-t Nghiem F Bremond I-s Antipolis and R Lucioles ldquoBackground subtrac-tion in people detection framework for RGB-D camerasrdquo 2004

[40] J C S Jacques C R Jung and S R Musse ldquoBackground subtraction andshadow detection in grayscale video sequencesrdquo in Brazilian Symposium of Com-puter Graphic and Image Processing vol 2005 2005 pp 189ndash196

[41] P Kaewtrakulpong and R Bowden ldquoAn Improved Adaptive Background MixtureModel for Real- time Tracking with Shadow Detectionrdquo Advanced Video BasedSurveillance Systems pp 1ndash5 2001

[42] G Turin ldquoAn introduction to matched filtersrdquo IRE Transactions on InformationTheory vol 6 no 3 1960

[43] K Briechle and U Hanebeck ldquoTemplate matching using fast normalized crosscorrelationrdquo Proceedings of SPIE vol 4387 pp 95ndash102 2001 [Online] AvailablehttplinkaiporglinkPSI4387951ampAgg=doi

73

REFERENCES

[44] H Schweitzer J W Bell and F Wu ldquoVery Fast Template Matchingrdquo Programno 009741 pp 358ndash372 2002 [Online] Available httpwwwspringerlinkcomindexH584WVN93312V4LTpdf

[45] P Viola and M Jones ldquoRobust real-time object detectionrdquo Interna-tional Journal of Computer Vision vol 57 pp 137ndash154 2001 [Online]Available httpscholargooglecomscholarhl=enampbtnG=Searchampq=intitleRobust+Real-time+Object+Detection0

[46] S Omachi and M Omachi ldquoFast template matching with polynomialsrdquo IEEETransactions on Image Processing vol 16 no 8 pp 2139ndash2149 2007

[47] M Jordan and J Kleinberg Bishop - Pattern Recognition and Machine Learning

[48] D Prasad ldquoSurvey of the problem of object detection in real imagesrdquoInternational Journal of Image Processing (IJIP) no 6 pp 441ndash466 2012[Online] Available httpwwwcscjournalsorgcscmanuscriptJournalsIJIPvolume6Issue6IJIP-702pdf

[49] D Lowe ldquoObject Recognition fromLocal Scale-Invariant Featuresrdquo IEEE Inter-national Conference on Computer Vision 1999

[50] D G Lowe ldquoDistinctive image features from scale-invariant keypointsrdquo Interna-tional Journal of Computer Vision vol 60 no 2 pp 91ndash110 2004

[51] H Bay T Tuytelaars and L Van Gool ldquoSURF Speeded up robust featuresrdquo inLecture Notes in Computer Science (including subseries Lecture Notes in ArtificialIntelligence and Lecture Notes in Bioinformatics) vol 3951 LNCS 2006 pp 404ndash417

[52] T Joachims ldquoA probabilistic analysis of the Rocchio algorithmwith TFIDF for text categorizationrdquo International Conference on Ma-chine Learning pp 143ndash151 1997 [Online] Available papers2publicationuuid9FC2122D-6D49-4DC5-AC03-E353D5B3D1D1

[53] N Dalal and B Triggs ldquoHistograms of Oriented Gradients for Human Detectionrdquoin CVPR rsquo05 Proceedings of the 2005 IEEE Computer Society Conference onComputer Vision and Pattern Recognition (CVPRrsquo05) - Volume 1 2005 pp 886ndash893 [Online] Available citeulike-article-id3047126$delimiter026E30F$nhttpdxdoiorg101109CVPR2005177

[54] J M Rainer Lienhart ldquoAn Extended Set of Haar-Like Features for Rapid ObjectDetectionrdquo [Online] Available httpciteseerxistpsueduviewdocsummarydoi=1011869433

74

REFERENCES

[55] S Han Y Han and H Hahn ldquoVehicle Detection Method using Haar-like Featureon Real Time Systemrdquo inWorld Academy of Science Engineering and Technology2009 pp 455ndash459

[56] Q C Q Chen N Georganas and E Petriu ldquoReal-time Vision-based Hand Ges-ture Recognition Using Haar-like Featuresrdquo 2007 IEEE Instrumentation amp Mea-surement Technology Conference IMTC 2007 2007

[57] G Monteiro P Peixoto and U Nunes ldquoVision-based pedestrian detection usingHaar-like featuresrdquo Robotica 2006 [Online] Available httpcyberc3sjtueducnCyberC3docpaperRobotica2006pdf

[58] J Martiacute J M Benediacute A M Mendonccedila and J Serrat Eds PatternRecognition and Image Analysis ser Lecture Notes in Computer Science BerlinHeidelberg Springer Berlin Heidelberg 2007 vol 4477 [Online] Availablehttpwwwspringerlinkcomindex101007978-3-540-72847-4

[59] A E C Pece ldquoOn the computational rationale for generative modelsrdquo ComputerVision and Image Understanding vol 106 no 1 pp 130ndash143 2007

[60] I T Joliffe Principle Component Analysis 2002 vol 2 [Online] Availablehttpwwwspringerlinkcomcontent978-0-387-95442-4

[61] A Hyvarinen J Karhunen and E Oja ldquoIndependent Component Analysisrdquovol 10 p 2002 2002

[62] K Mikolajczyk B Leibe and B Schiele ldquoMultiple Object Class Detection with aGenerative Modelrdquo IEEE Computer Society Conference on Computer Vision andPattern Recognition (CVPR) vol 1 pp 26ndash36 2006

[63] Y Freund and R E Schapire ldquoA Decision-theoretic Generalization of On-lineLearning and an Application to Boostingrdquo Journal of Computing Systems andScience vol 55 no 1 pp 119ndash139 1997

[64] C Cortes and V Vapnik ldquoSupport-vector networksrdquo Machine Learningvol 20 no 3 pp 273ndash297 Sep 1995 [Online] Available httplinkspringercom101007BF00994018

[65] P Viola and M Jones ldquoRobust real-time face detectionrdquo International Journalof Computer Vision vol 57 no 2 pp 137ndash154 2004

[66] B E Boser I M Guyon and V N Vapnik ldquoA Training Algorithmfor Optimal Margin Classifiersrdquo Proceedings of the 5th Annual ACM Workshopon Computational Learning Theory pp 144ndash152 1992 [Online] Availablehttpciteseerxistpsueduviewdocsummarydoi=1011213818

75

REFERENCES

[67] C-W Hsu and C-J Lin ldquoA comparison of methods for multiclass support vectormachinesrdquo IEEE Transactions on Neural Networks vol 13 no 2 pp 415ndash4252002

[68] A Mathur and G M Foody ldquoMulticlass and binary SVM classification Implica-tions for training and classification usersrdquo IEEE Geoscience and Remote SensingLetters vol 5 no 2 pp 241ndash245 2008

[69] ldquoNitrogen6X - iMX6 Single Board Computerrdquo [Online] Available httpboundarydevicescomproductnitrogen6x-board-imx6-arm-cortex-a9-sbc [Ac-cessed at 2015-08-10]

[70] ldquoPixhawk flight controllerrdquo [Online] Available httproverardupilotcomwikipixhawk-wiring-rover [Accessed at 2015-08-10]

[71] ldquoScanning range finder utm-30lx-ewrdquo [Online] Available httpswwwhokuyo-autjp02sensor07scannerutm_30lx_ewhtml [Accessed at 2015-07-22]

[72] ldquoopenCV manual Release 249rdquo [Online] Available httpdocsopencvorgopencv2refmanpdf [Accessed at 2015-07-20]

[73] D E King ldquoDlib-ml A Machine Learning Toolkitrdquo Journal of MachineLearning Research vol 10 pp 1755ndash1758 2009 [Online] Available httpjmlrcsailmitedupapersv10king09ahtml

[74] ldquodlib C++ Libraryrdquo [Online] Available httpdlibnet [Accessed at 2015-07-21]

[75] D E King ldquoMax-Margin Object Detectionrdquo Jan 2015 [Online] Availablehttparxivorgabs150200046

[76] M Danelljan G Haumlger and M Felsberg ldquoAccurate Scale Estimation for RobustVisual Trackingrdquo Proceedings of the British Machine Vision Conference BMVC2014

[77] ldquoUrgBenri Information Pagerdquo [Online] Available httpswwwhokuyo-autjp02sensor07scannerdownloaddataUrgBenrihtm [Accessed at 2015-07-20]

[78] ldquovatic - Video Annotation Tool - UC Irvinerdquo [Online] Available httpwebmiteduvondrickvatic [Accessed at 2015-07-24]

[79] ldquoAmazon Mechanical Turkrdquo [Online] Available httpswwwmturkcommturkwelcome [Accessed at 2015-07-26]

76

  • List of Figures
  • Absztrakt
  • Abstract
  • List of Abbreviations
  • 1 Introduction and project description
    • 11 Project description and requirements
    • 12 Type of vehicle
    • 13 Aims and objectives
      • 2 Literature Review
        • 21 UAVs and applications
          • 211 Fixed-wing UAVs
          • 212 Rotary-wing UAVs
          • 213 Applications
            • 22 Object detection on conventional 2D images
              • 221 Classical detection methods
                • 2211 Background subtraction
                • 2212 Template matching algorithms
                  • 222 Feature descriptors classifiers and learning methods
                    • 2221 SIFT features
                    • 2222 Haar-like features
                    • 2223 HOG features
                    • 2224 Learning models in computer vision
                    • 2225 AdaBoost
                    • 2226 Support Vector Machine
                      • 3 Development
                        • 31 Hardware resources
                          • 311 Nitrogen board
                          • 312 Sensors
                            • 3121 Pixhawk autopilot
                            • 3122 Camera
                            • 3123 LiDar
                                • 32 Chosen software
                                  • 321 Matlab
                                  • 322 Robotic Operating System (ROS)
                                  • 323 OpenCV
                                  • 324 Dlib
                                      • 4 Designing and implementing the algorithm
                                        • 41 Challenges in the task
                                        • 42 Architecture of the detection system
                                        • 43 2D image processing methods
                                          • 431 Chosen methods and the training algorithm
                                          • 432 Sliding window method
                                          • 433 Pre-filtering
                                          • 434 Tracking
                                          • 435 Implemented detector
                                            • 4351 Mode 1 Sliding window with all the classifiers
                                            • 4352 Mode 2 Sliding window with intelligent choice of classifier
                                            • 4353 Mode 3 Intelligent choice of classifiers and ROIs
                                            • 4354 Mode 4 Tracking based approach
                                                • 44 3D image processing methods
                                                  • 441 3D recording method
                                                  • 442 Android based recording set-up
                                                  • 443 Final set-up with Pixhawk flight controller
                                                  • 444 3D reconstruction
                                                      • 5 Results
                                                        • 51 2D image detection results
                                                          • 511 Evaluation
                                                            • 5111 Definition of True positive and negative
                                                            • 5112 Definition of False positive and negative
                                                            • 5113 Reducing number of errors
                                                            • 5114 Annotation and database building
                                                              • 512 Frame-rate measurement and analysis
                                                                • 52 3D image detection results
                                                                • 53 Discussion of results
                                                                  • 6 Conclusion and recommended future work
                                                                    • 61 Conclusion
                                                                    • 62 Recommended future work
                                                                      • References
Page 12: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly

Chapter 1

Introduction and projectdescription

In this chapter an introduction is given to the whole project which this thesis ispart of Afterwards a structure of the sub-tasks in the project is presented alongwith the recognized challenges Then the aims and objectives of this thesis arelisted

11 Project description and requirementsThe projectrsquos main aim is to build a complete system based on one or moreunmanned autonomous vehicle which are able to carry out an indoor 3D mappingof a building (possible outdoor operations are kept in mind as well) A maplike this would be very useful for many applications such as surveillance rescuemissions architecture renovation etc Furthermore if a 3D model exists otherautonomous vehicles can navigate through the building with ease Later on inthis project after the map building is ready a thermal camera is planned to beattached to the vehicle to seek find and locate heat leaks as a demonstration ofthe system A system like this should be

1 fast Building a 3D map requires a lot of measurements and processingpower not to mention the essential functions stabilization navigationroute planning collision avoidance However to build a useful tool allthese functions should be executed simultaneously mostly on-board nearlyreal-time Furthermore the mapping process itself should be finished inreasonable time (depending on the size of the building)

2 accurate Reasonably accurate recording is required so the map would besuitable for the execution of further tasks Usually this means a maximum

1

12 Type of vehicle

error of 5-6cm Errors with similar magnitude can be corrected later de-pending on the application In the case of architecture for example aligninga 3D room model to the recorded point cloud could eliminate this noise

3 autonomous The solution should be as autonomous as possible The aimis to build a system which does not require human supervision (for examplemapping a dangerous area which would be too far for real time control) andcoordinate the process Although remote control is acceptable and shouldbe implemented it is desired to be minimal This should allow a remoteoperator to coordinate even more than one unit a time (since none of themneeds continuous attention) while keeping the ability to change the routeor priority of premises

4 complete Depending on the size and layout of the building the mappingprocess can be very complicated It may have more than one floors loops(the same location reached via an other route) or opened areas which areall a significant challenge for both the vehicle (plan route to avoid collisionsand cover every desired area) and the mapping software The latter shouldrecognize previously seen locations and rdquocloserdquo the loop (that is assignthe two separately recorded area to one location) and handle multi-layerbuildings The system have to manage these tasks and provide a mapwhich contains all the layers closes loops and represents all walls (includingceiling and floor)

12 Type of vehicleThe first question of such a project is what type of vehicle should be used Theoptions are either some kind of aerial or a ground vehicle Both have their ad-vantages and disadvantages

An aerial vehicle has a lot more degree of freedom since it can elevate fromthe ground It can reach positions and perform measurements which would beimpossible for a ground vehicle However it consumes a lot of energy to stayin the air thus such a system has significant limitations in weight payload andoperating time (since the batteries themselves weigh a lot)

A ground vehicle overcomes all these problems since it is able to carry alot more payload than an aerial one This means a lot of batteries thereforelonger operation time On the other hand it canrsquot elevate from the ground thusall the measurements will be performed from a relatively close position to theground This can be a big disadvantage since taller objects will not have theirtops recorded Also a ground robot would not be able to overcome big obstaclesclosing its way or to get to another floor

2

13 Aims and objectives

Figure 11 Image of the ground robot Source own picture

Considering these and many more arguments (price complexity human andmaterial resources) a combination of an aerial and a ground vehicle was chosenas a solution The idea is that both vehicles enter the building at the sametime and start mapping the environment simultaneously working on the samemap (which is unknown when the process starts) Using the advantage of theaerial vehicle the system scans parts of the rooms which are unavailable for theground vehicle With sophisticated route planning the areas scanned twice canbe minimized resulting in a faster map building Also great load capacity of theground robot makes it able to serve as a charging station for the aerial vehicleThis option will be discussed in details in section 13

13 Aims and objectivesAs it can be seen the scope of the project is extremely wide To cover the wholemore than fifteen students (visiting MSc and PhD) are working on it Thereforethe project has been divided to numerous subtasks

During the planning period of the project several challenges were recognizedfor example the lack of GPS signal indoor the vibration of the on-board sensorsor the short flight time of the aerial platform

This thesis addresses the last one As mentioned in section 12 the UAVconsumes a lot of energy to stay in the air thus such a system has significantlylimited operating time In contrast the ground robot is able to carry a lot morepayload (eg batteries) than an aerial one The idea to overcome this limitationof the UAV is to use the ground robot as a landing and recharging platform for

3

13 Aims and objectives

the aerial vehicleTo achieve this the system will need to have continuous update of the relative

position of the ground robot The aim of this thesis is to give a solution for thisproblem The following objectives were defined to structure the research

First possible solutions have to be collected and discussed in an extensive lit-erature review Then after considering the existing solutions and the constraintsand requirements of the discussed project a design of a new system is neededwhich is able to detect the ground robot using the available sensors Finally thefirst version of the software have to be implemented and tested

This paper will describe the process of the research introduce the designedsystem and discuss the experiences

4

Chapter 2

Literature Review

The aim of this chapter is to provide an overview of the previous researches andarticles related to the topic of this thesis First a short introduction of unmannedaerial vehicles will be given along with their most important advantages andapplication fields

Afterwards a brief conclusion of the science of object recognition in imageprocessing is provided The most often used feature extraction and classificationmethods are presented

21 UAVs and applicationsThe abbreviation UAV stands for Unmanned Aerial Vehicle Any aircraft whichis controlled or piloted by remotely andor by on-board computers They are alsoreferred as UAS for Unmanned Aerial System

They come in various sizes The smallest ones fit a man palms while even afull-size aeroplane can be an UAV by definition Similarly to traditional aircraftUnmanned Aerial vehicles are usually grouped into two big classes fixed-wingUAVs and rotary wing UAVs

211 Fixed-wing UAVsFixed-wing UAVs have a rigid wing which generates lift as the UAV moves for-ward They maneuver with control surfaces called ailerons rudder and elevatorGenerally speaking fixed-wing UAVs are easier to stabilize and control Forcomparison radio controlled aeroplanes can fly without any autopilot functionimplemented relying only on the pilotrsquos input This is a result of the muchsimpler structure compared to rotary wing ones due to the fact that a glidingmovement is easier to stabilize However they still have a quite extended market

5

21 UAVs and applications

Figure 21 One of the most popular consumer level fixed-wing mapping UAV ofthe SenseFly company Source [1]

of flight controllers whose aim is to add autonomous flying and pilot assistancefeatures [2]

One of the biggest advantage of fixed wing UAVs is that due to their naturalgliding capabilities they can stay airborne using no or small amount of powerFor the same reason fixed-wing UAVs are also able to carry heavier or morepayload for longer endurances using less energy which would be very useful inany mapping task

The drawback of the fixed wing aircraft in the case of this project is the factthat they have to keep moving to generate lift and stay airborne Therefore flyingaround between and in buildings is very complicated since there is limited spaceto move around Although there have been approaches to prepare a fixed wingUAV to hover and translate indoors [3] generally they are not optimal for indoorflights especially with heavy payload [4]

Fixed-wing UAVs are excellent choice for long endurance high altitude tasksThe long flight times and higher speed make it possible to cover larger areaswith one take off On figure 21 a consumer fixed-wing UAV is shown designedspecially for mapping

This project needs a vehicle which is able to fly between buildings and indoorswithout any complication Thus rotary-wing UAVs were reviewed

212 Rotary-wing UAVsRotary-wing UAVs generate lift with rotor blades attached to and rotated aroundan axis These blades work exactly as the wing on the fixed-wing vehicles butinstead of the vehicle moving forward the blades are moving constantly Thismakes the vehicle able to hover and hold its position Also due to the same prop-erty they can take off and land vertically These capabilities are very importantaspects for the project since both of them are crucial for indoor flight where spaceis limited

6

21 UAVs and applications

Figure 22 The newest version of the popular Phantom series of DJI Thisconsumer drone is easy to fly even for beginner pilots due to its many advancedautopilot features Source [5]

Based on the number of the rotors several sub-classes are known UAVs withone rotor are helicopters while anything with more than one rotor is called amultirotor or multicopter

The latter group controls its motion by modifying the speed of multiple downthrusting motors In general they are more complicated to control than fixed-wing UAVs since hovering enough requires constant compensation Multirotorsmust individually adjust the thrust of each motor If motors on one side areproducing more thrust than the other the multirotor tilts to the other sideThat is the reason why every remote controlled multicopter is equipped withsome kind of flight controller system They are also less stable (without electricstabilization) and less energy efficient than helicopters That is the reason forthe fact that no large scale multirotor is used since as the size is increased thesimplicity becomes less important than the inefficiency On the other hand theyare mechanically simpler and cheaper and their architecture is suitable for awider range of payloads which makes them appealing for many fields

Rotary-wing UAVsrsquo mechanical and electrical structures are generally morecomplex than fixed-wing ones This can result in more complicated and expansiverepairs and general maintenance which shorten operational time and increase thecost of the project Finally due to their lower speeds and shorter flight rangesrotary-wing UAVs will require many additional flights to survey any large areawhich is another cause of increasing operational costs [4]

7

21 UAVs and applications

In spite of the disadvantages listed above rotary-wing especially multirotorUAVs seem like an excellent choice for this project since there is no need to coverlarge areas and the shorter operational times can be solved with the chargingplatform

213 ApplicationsUAVs as many technological inventions got their biggest motivation from mili-tary goals and applications For example German rocket V-1 (rdquothe flying bombrdquo)had a huge impact on the history of UAS [6] An unmanned aerial vehicle makesit possible to observe spy on and attack the enemy without risking human lives

Fortunately non-military development is catching up too and UAVs are be-coming essential tools for many application fields As rotary wing UAVs es-pecially multicopters are getting cheaper they have appeared in the consumercategories as well They are available as simple remote controlled toys and ad-vanced aerial photography platforms as well [5] See figure 22 for a popularquadcopter

UAVs are gaining popularity in search and rescue operations since they areproviding a fast and relatively cheap tool to get an overview of the involvedarea A good example for this application field is a location of an earth-quakeor searching for missing people in the wildness See [7] and [8] for examples [9]is another related article which includes approaches to identify possibly injuredpeople autonomously

Another application field where UAVs were admittedly proven useful is surveil-lance [10] is an interesting example for this field since it proposes to use a com-bined swarm of rotary and fixed-wing UAVs [11] [12] address the same problemwhile [13] tries to overcome the challenges of the task in an urban area

Getting sensors in the air with low cost vehicles is an amazing opportunity forseveral research areas UAVs have been used for wild-life monitoring [14] earthobservation [15] and even tornado watch missions [16] [17] used UAVs to reducerdquothe data gap between field scale and satellite scale in soil erosion monitoringrdquo

UAVs are often used for 3D mapping and reconstruction tasks for topographyarchitecture or other purposes [18] is a review of current mapping methods basedon UAVs [19] is an other survey which also compares the photogrammetry withother methods to define the optimal application fields

Both surveys mentioned above are rather focus on topographical larger scalemapping tasks Indoor flying and mapping is more complicated in a sense sincethe danger of collision is significantly higher Indoor positioning of UAVs havebeen addressed in several papers They can be grouped by the sensor used Notethat GPS signal is usually too weak to use inside thus other methods are needed

Some researches tried to localize the aerial platform based on a single camera

8

22 Object detection on conventional 2D images

using monocular SLAM (Simultaneous Localization and Mapping) See [20ndash22]for UAVs navigating based on one camera only

Often the platform is equipped with multiple cameras to achieve stereo visionand create a depth map For examples of such solutions see [23ndash25]

A popular approach for indoor mapping is using 3D laser distance sensors asthe input of the SLAM algorithm Examples for this are [26ndash28]

However cameras and laser range finders are often used together [29 30] forexample manage to integrate the two sensors together and carry out 3D mappingwhile controlling navigating the UAV itself This project will use a similar setupthat is a combination of one or more monocular cameras with a 2D laser scanner

22 Object detection on conventional 2D imagesObject class detection is one of the most researched area of computer visionAs the available resources (computational power resolution of cameras etc)improved and became cheaper and more and more complex tasks occurred thefield has witnessed several approaches and methods Thus the topic has extensiveliterature To give an overview general surveys can be a good starting point

There are several articles aimed to provide a summary of the existing methodstrying to categorize and group them (eg [31ndash33])

The terms and definitions of computer vision tasks are defined in several wayin the literature [33] and [34] groups them by

bull Detection is a particular item present in the stimulus

bull Localization detection plus accurate location of item

bull Recognition localization of all the items present in the stimulus

bull Understanding recognition plus role of stimulus in the context of the scene

while [35] sets up five classes defined by the following (quoted from [33])

bull Verification Is a particular item present in an image patch

bull Detection and Localization Given a complex image decide if a particularexemplar object is located somewhere in the image and provide accuratelocation information of the given object

bull Classification Given an image patch decide which of the multiple possiblecategories are present in it Note that no location is determined

9

22 Object detection on conventional 2D images

bull Naming Given a large complex image (instead of an image patch as inthe classification problem) determine the location and labels of the objectspresent in that image

bull Description Given a complex image name all the objects present in itand describe the actions and relationships of the various objects within thecontext of the image

[31] defines object (class or category) detection as the combination of twogoals object categorization and object localization The former decides if anyinstance of the categories of interest is present in the input image The lattertask determines the positions and dimensions (projected size) of the objects thatare found

As it can be seen from the categories defined above the boundaries of thesesubclasses are not clear and they often overlap The aims and objectives listedin section 13 classify this task as a detection or localization (or a combinationof the two) For the sake of simplicity the aim of this thesis and the providedalgorithm will be referred as detection and detector

Beside the grouping of the tasks of computer vision several attempts weremade to classify the approaches themselves The biggest disjunctive propertyof the methods might be the fact that whether or not does it involve machinelearning

221 Classical detection methodsResulting from the rapid development of hardware resources along with the ex-tensive research of artificial intelligence methods not using machine learning aregetting neglected Generally speaking the ones using it show better performanceand are simpler to adopt if the circumstances change (eg re-train them)

It has to be noted that although these classical approaches are not very pop-ular nowadays their simplicity often means an attractive factor to perform easydetection tasks or pre-filter the image

2211 Background subtraction

Background subtraction methods are often used to detect moving objects onstanding platform (static cameras) Such an algorithm is able to learn the rdquoback-ground modelrdquo of an image with differentiating a few frames Although severalmethods exist (See [36] or [37] for good conclusion) the basic idea is common sep-arate the foreground from the background by subtracting the background modelfrom the input image This subtraction should result in an output image where

10

22 Object detection on conventional 2D images

Figure 23 Example for background subtraction used for people detection Thered patches show the object in the segmented foreground Source [39]

only objects in the foreground are shown The background model is usually up-dated regularly which means the detection is adaptive

These solutions could give an excellent base for tasks like traffic flow monitor-ing [38] or human detection [39] as moving points are separated and connectedor close ones could be managed as an object Thus all possible objects of interestare very well separated The fixed position of the camera means fixed perspectiveswhich make size and distance estimations accurate With fixed lengths reliablevelocity measuring can be implemented These and further additional properties(colour texture etc of the moving object) can be the base of an object classifica-tion system This method is often combined with more sophisticated recognitionmethods serving as a region of interest detector After the selection of interestingareas those methods can operate on a smaller input resulting in a faster responsetime

It has to be mentioned that not every moving (changing) patch is an objectas for example in certain lighting circumstances a vehicle and its shadow movealmost identically and have similar size parameters Effective methods to removeshadows from input have been described (eg[40 41]) Certainly process time ofthese filters sum up with the detectionrsquos time

An other concern is the possibility of a sudden change in illumination (such asswitching the light on or off) which would result change across the whole imageIn this case the background model should be adapted automatically and quickly

Although background subtraction has many benefits it is clearly not suitablefor this task since the camera will be mounted on a moving platform

2212 Template matching algorithms

Object class detection can be described as a search for a specific pattern on theinput image This definition is related of the matched filters (see [42]) techniqueof digital signal processing which basic idea is to find a known wavelet in anunknown signal This is usually achieved by cross-correlating them looking forhigh correlation coefficients

Template matching algorithms rely on the same concept interpreting the

11

22 Object detection on conventional 2D images

Figure 24 Example of template matching Notice how the output image isthe darkest at the correct position of the template (darker means higher value)An other interesting point of the picture is the high return values around thecalendarrsquos (on the wall left) dark headlines which are indeed look similar to thehandle Source [43]

image as a 2D digital signal They determine the position of the given patternby comparing a template (that contains the pattern) with the image

To perform the comparison the template is shifted (u v) discrete steps inthe (x y) directions respectively The comparison itself is usually calculated bysome kind of correlation over the area of the template for each (u v) positionThe method assumes that the function (the comparison method) will return thehighest value at the correct position [43]

See Figure 24 for an example input template (part a) and the output of thealgorithm (part b) Notice how the output image is the darkest at the correctposition of the template

The biggest drawback of the method is its computational complexity sincethe cross correlation has to be calculated at every position Several efforts werepublished to reduce the processing time [43] uses normalized cross correlationfor faster comparison [44] applies integral images (based on the idea of [45]) forthe same purpose Note that simplifying the template can reduce the complexityas well since it is easier to compare the two images For example [46] proposedan algorithm which approximates the template image with polynomials

Other disadvantages of template matching are the poor performance in case ofchanges in scale rotation or perspective Resulting from these template matchingis not a suitable concept for this project

12

22 Object detection on conventional 2D images

222 Feature descriptors classifiers and learning methodsWhile computer vision and machine learning might seem two well-separated areasof computer engineering computer vision is one of the biggest application fieldof machine learning [47] puts it this way rdquoPattern recognition has its origins inengineering whereas machine learning grew out of computer science Howeverthese activities can be viewed as two facets of the same field and together theyhave undergone substantial development over the past ten yearsrdquo

The two most important parts of a trained object detection algorithm arefeature extraction and feature classification The extraction method determineswhat type of features will be the base of the classifier and how they are extractedThe classification part is responsible for the machine learning method In otherwords how to decide if the extracted feature array (extracted by the featureextraction method described above) represents an object of interest

Since the latter part (defining the learning and classifier system) is less relatedto the vision studies this paper will only discuss a few learning methods brieflyBefore that some of the most famous and widely used feature descriptors arepresented

Feature extractors or descriptors are often grouped as well [48] recognizesedge based and patch based features [31] has three groups according to thelocality of the features

bull pixel-level feature description these features can be calculated for eachpixel separately Examples are the pixelrsquos intensity (grayscale) or its colourvector

bull patch-level feature description Descriptors based on patch level featuresconcentrate only on point of interests (and their neighbourhood) instead ofdescribing the whole image The name rdquopatchrdquo refers to the area around thepoint which is also considered Since these regions are much smaller thanthe image itself patch-level feature descriptions are also called local featuredescriptors Some of the most famous ones are SIFT (sub-subsection 2221[49 50]) and SURF [51] In a sense Haar-like features (see sub-subsection2222 [45]) belong here as well

bull region-level feature description As patches are sometimes too small to de-scribe the object appropriately a larger scale is often needed The region isa general expression for a set of connected pixels The size and shape of thisarea is not defined Region-level features often constructed from lower levelfeatures presented above However they do not try to apply higher levelgeometric models or structures Resulting from that region-level featuresare also called mid-level representations The most well known descriptors

13

22 Object detection on conventional 2D images

of this group are the BOF (Bag of Features or Bag of Words [52]) and theHOG (Histogram of Oriented Gradients sub-subsection 2223 [53])

Due to the limited scope of this paper it is not possible to give a deeperoverview about the features used in computer vision Instead three of the meth-ods will be presented here These were selected based on fame performanceon general object recognition and the amount of relevant literature examplesand implementation All of them are well-known and both the articles and themethods are part of the computer vision history

2221 SIFT features

Scale-invariant feature transform (or SIFT) is an image processing algorithm usedto detect and extract local features Sift descriptors are the most well-knownpatch-level feature feature descriptors (see point 222) [31] They were proposedby David Lowe in 1999 [49] The same author gave a more in-depth discussionabout his earlier work presented improvements in stability and feature invariancein 2004 [50] His idea was to extract interesting points as features from the object(training image) and try to match these on the test images to locate the object

The main steps of the algorithm are the following [50]

1 Scale-space extrema detection The first stage searches over the image inmultiple scale using a difference-of-Gaussian function to identify potentialinterest points that are scale and orientation invariant

2 Key-point localization At each selected point the algorithm tries to fit amodel to determine location and scale Key-points are selected based ontheir stability points which are expected to be more stable based on themodel are kept

3 Orientation assignment rdquoOne or more orientations are assigned to eachkey-point based on local image gradient directions All future operationsare performed on image data that has been transformed relative to theassigned orientation scale and location for each feature thereby providinginvariance to these transformationsrdquo [50]

4 Key-point descriptor After selecting the interesting points the gradientsaround them are calculated in a selected scale

The biggest advantage of the descriptor is the scale and rotation invarianceNote that rotation here means rotating in the plane of the image not capturingthe object from a rotated viewpoint For example a rotation invariant (frontal)

14

22 Object detection on conventional 2D images

face detector should be able to detect faces upside down but not faces pho-tographed from a side-view SIFT is also partly invariant to illumination and3D camera viewpoint This is reached by the representation of the calculateddescriptors that allows for significant levels of local shape distortion and changein illumination The algorithm assumes that these pointsrsquo relative positions willnot change on the test images Thus SIFT descriptors tend to show worse perfor-mance on flexible objects or ones with moving parts Also the relative positionsof the key-points can be significantly different not only if the object is distorted(flexibility) but also if they are extracted from an other instance from the sameclass (two people for example) Resulting from this property SIFT is more oftenused for object tracking image stitching or to recognise a concrete object insteadof the general class However this is not a problem in the case of this projectsince only one ground-robot is part of it Thus no general class recognition isrequired recognizing the used ground robot would be enough

2222 Haar-like features

Haar-like features were introduced by Viola and Jones in 2001 [45] They proposedsimple features inspired by Haar wavelet basis which are also used in signalprocessing They can be seen as rdquotemplatesrdquo defining rectangular regions in thegrayscale image Viola and Jones described three types of the features

bull two-rectangle features rdquoThe value of a two-rectangle feature is the dif-ference between the sum of the pixels intensities within two rectangularregions The regions have the same size and shape and are horizontally orvertically adjacentrdquo [45]

bull three-rectangle features rdquoA three-rectangle feature computes the sum withintwo outside rectangles subtracted from the sum in a centre rectanglerdquo [45]

bull four-rectangle features rdquoA four-rectangle feature computes the differencebetween diagonal pairs of rectanglesrdquo [45]

See figure 25 for an example of two and three rectangle features The mainadvantage of this approach is the fact that these features can be computed veryrapidly using so-called integral images The integral image rdquocontains the sum ofpixel intensities above and to the left of itrdquo [45] at any given image location Thereferences to the integral values at each pixel location are kept in a structureFinally the Haar features can be computed by summing and subtracting theintegral values in the corners of the rectangles

[45] was motivated by the task of face detection Although it is more than14 years old it is still the base of many implemented face detection methods (in

15

22 Object detection on conventional 2D images

Figure 25 2 example Haar-like features that are proven to be effective separatorsTop row shows the features themselves Bottom row presents the features overlayedon a typical face The first feature corresponds with the observation that eyes areusually darker than the cheeks The second feature measures the intensity of areasaround the eyes compared to the nose Source [45]

consumer cameras for example) and inspired several researches [54] for exampleextended the Haar-like feature set with rotated rectangles by calculating theintegral images diagonally as well

Resulting from the nature of them Haar-like features are suitable to findpatterns of combined bright and dark patches (like faces on grayscale images forexample)

Despite of the original task Haar-like features are used widely in computervision for various tasks vehicle detection ([55]) hand gesture recognition ([56])and pedestrian detection ([57]) just to name a few

Haar-like features are very popular and have a lot of advantages Its maindrawback is the rotation-variance which in spite of several attempts (eg [54])is still not solved completely

2223 HOG features

The abbreviation Hog stands for Histogram of Oriented Gradients Unlike theHaar features (2222) Hog aims to gather information from the gradient image

The technique calculates the gradient of every pixel and produces a histogramof the different orientations (called bins) after quantization in small portions(called cells) on the image After a location based normalization of these cells(within blocks) the feature vector is the concatenation of these histograms Notethat while this is similar to Edge orientation histograms ([58]) it is different since

16

22 Object detection on conventional 2D images

Figure 26 Illustration of the discriminative and generative models The boxesmark the probabilities calculated and used by the models Note the arrows whichdisplay the information flow presenting the main difference in the two philosophiesSource [59]

not only sharp gradient changes (known as edges) are considered The number ofoptimal bins cells and blocks depend on the actual task and resolution and arewidely researched

HOG is essentially a dense version of the SIFT descriptors (see sub-subsection2221 and [31]) However it is not normalized with respect to the orientationThus HOG is not rotation invariant On the other hand it is normalized locallywith respect to the image contrast (over the blocks) which results in a superiorrobustness against illumination changes over SIFT The technique was first de-scribed in the well-known article [53] by Navneet Dalal and Bill Triggs in whichit was used for pedestrian detection While this application field is still the mostcommon usage of these feature descriptor it can be used for many kind of objectsas well as long as the class has a characteristic shape with significant edges

2224 Learning models in computer vision

Learning model approaches can be grouped as well [32 48] finds two mainphilosophies generative and discriminative models

Letrsquos denote the test data (images) with xi the corresponding labels by ciand the feature descriptors (extracted from the data xi) by θi where i = 1 to N where N is the number of inputs

Generative models look for a representation of the image (or any other inputdata) by approximate it keeping as much data as possible Afterwards given atest data they predict the probability of an x instance generated (with featuresθ) conditioned to class c [48] In other terms they aim is to produce a model inwhich the probability

P (x|(θ c))P (c)

to be ideally 1 if x contains an instance of class c and 0 if notThe most famous examples of generative models are Principal Component

Analysis (PCA [60]) and Independent Component Analysis (ICA [61])

17

22 Object detection on conventional 2D images

Discriminative models try to find the best decision boundaries for the giventraining data As the name suggests they try to discriminate the occurrenceof one class from everything else [48] They do this by defining the followingprobability

P (c|(θ x))

which is expected to be ideally 1 if x contains an instance of class c and 0 ifnot

Note that the biggest difference is the direction of the mapping and infor-mation flow discriminative models map images to class labels while generativemodels use a map from labels to the images (or rdquoobservablesrdquo [59]) See figure 26for a representation of the direction of the flow of information in both models

Resulting from approaches presented above (especially the probability mod-els) the discriminative methods are usually perform better for single class detec-tion while generative methods are more suitable for multi-class object detection(also called object recognition) [62] Since the task described in this paper re-quires one class to detect two of the most well-known discriminative methodswill be presented Adaptive boosting (AdaBoost [63] see sub-subsection 2225)and support vector machines (SVM [64] see sub-subsection 2226)

2225 AdaBoost

AdaBoost (short for Adaptive Boosting) is a discriminative machine learningmodel from the class of boosting algorithms It was first proposed by YoavFreund and Robert Schapire in 1997 [63] AdaBoostrsquos aim is to overcome theincreasing calculation expenses caused by the growing set of features (also calleddimensions) For example the number of possible Haar-like features (2222) isover 160000 in case of a 24times24 pixel window (see subsection 432) [65] AlthoughHaar-like features can be extracted efficiently using the integral-image computingthe full set would be very expensive

To reduce the dimensionality of the machine learning problem boosting algo-rithms try to combine several weaker classifiers to create a stronger Weak meansthe output of the classifiers only slightly correlates with the true labels of thedata Usually this is an iterative method

A single stage of AdaBoost can be described as follows [45] [65]

bull Initialize N middot T weights wti where N = number of training examples T =number of features in stage

bull For t = 1 N

1 Normalize the weights

18

22 Object detection on conventional 2D images

2 Select the best classifier using only a single feature by minimising thedetection error εt =

sumiwi|h(xi f p θ)minus yi| where h(xi) is a classifier

output and yi is a correct label (both with a range of 0 for negative 1for positive)

3 Define ht(x) = h(x ft pt θt) where ft pt and θt are minimizers for theerror above

4 Update the weights wt+1i = wti middot εt

1minusεt

1minusεi

bull The final classifier for the stage is based on the sum of weak classifiers

By repeating these steps AdaBoost selects those features (and only those)which improve the prediction Resulting from the way the weights are updatedsubsequent classifiers minimise the error mostly for the inputs which were misclas-sified by previous ones A famous application of AdaBoost is the Viola-Jones ob-ject detection framework presented in 2001 [45] (also see sub-subsection 2222)

2226 Support Vector Machine

Support vector machines (SVM also called support vector networks) is a machinelearning algorithm for two-group classification problems proposed by Vladimir NVapnik Bernhard E Boser and Isabelle M Guyon [66] Although the featuresof the method were known and used since the 1960s they were not put togetheruntil 1992

The idea of the support vector machine is to handle the set of features ex-tracted from the train and test data samples as vectors in the feature spaceThen they construct one ore more hyperplanes (a linear decision surface) in thisspace to separate the classes

The surface is constructed with two constraints First is to divide the spaceinto two parts in a way that no points from different classes remain in the same(in other words separate the two classes perfectly) The second constraint is tokeep the distance to the nearest training samplersquos vector from each class as largeas possible (in other words define the largest margin possible) Creating largermargin is proven to decrease the error of the classifier The closest vectors to thehyperplane (and due to that determining the width of the margin) are calledsupport vectors (hence the name of the method) See figure 27 for an exampleof a separable problem the chosen hyperplane and the margin

The algorithm was extended to not linearly separable tasks in 1995 by Vapnikand Corinna Cortes [64] To create a non-linear classifier the feature space ismapped to a much higher dimensional space where the classes become linearlyseparable and the method above can be used

19

22 Object detection on conventional 2D images

Figure 27 Example of a separable problem in 2D The support vectors aremarked with grey squares They define the margin of the largest separation betweenthe two classes Source [64]

Although support vector machines were designed for binary classification sev-eral methods were proposed to apply their outstanding performance on multi-classtasks The most often used approach to do this is to combine binary classifiers toobtain a multi-class one The binary classifiers can be rdquoone against allrdquo or rdquooneagainst onerdquo types The former means that each class is examined against all thepoints of the other classes The class whose classifier shows the greatest distancefrom the margin is chosen (in other terms the test samplersquos vector is the fartherfrom the other classesrsquo points than in any other cases) The latter is based on thecomparisons of all the possible pairings of the classifiers (each is compared witheach) The class rdquowinningrdquo the most comparisons is chosen as the final decisionof the classifier Good surveys on multi-class SVMs are [67] and [68]

A famous application of SVMs is the HOG descriptor based pedestrian detec-tion presented by Navneet Dalal and Bill Triggs in 2005 [53]

20

Chapter 3

Development

In this chapter the circumstances of the development will be presented whichinfluenced the research especially the objectives defined in Section 13 Firstthe available hardware resources (that is processing unit and sensors) will bediscussed along with their limitations and constraints respecting to the definedobjectives Finally the used software libraries will be presented why were theychosen and what are they used for in this project

31 Hardware resources

311 Nitrogen boardNitrogen board (officially called Nitrogen6x) is the chosen on-board processingunit of the UAV It is a highly integrated single board computer using an ARM-cortex A9 processor [69]

Nitrogen board will be responsible to manage the on-board sensors and collectthe recorded data After some preprocessing part of the data will be transmittedto the ground robot and to the ground station which communication is alsohandled by this unit Fortunately the Robotic Operating System (subsection322) is compatible with unit which will make this task easier

The detection method introduced in this thesis may be ported to this boardlater depending on the available processing units and the achieved communicationbandwidth between the vehicles and the ground station

312 SensorsThis subsection concludes the sensors integrated on-board the UAV including theflight controller the camera and the laser range finder

21

31 Hardware resources

Figure 31 image of Pixhawk flight controller Source [70]

3121 Pixhawk autopilot

Pixhawk is an open-source well supported auto-pilot system manufactured by3Drobotics company It is the chosen flight controller of the projectrsquos UAV Onfigure 31 the unit is shown with optional accessories

Also Pixhawk will be used as an Inertial Measurement Unit (IMU) for build-ing 3D maps This means less weight (since no additional sensor is needed) andbrings integrity both in hardware and software Although the Pixhawk is not adedicated IMU its built-in sensors are very accurate since they are used to sta-bilize multi-rotors during flight Therefore it was suitable to use it for mappingpurposes See 44 for details of its application

3122 Camera

Cameras are optical sensors capable of recording images They are the mostfrequently used sensors for object detection due to the fact that they are easy tointegrate and provide significant amount of data

Also another big advantage of these sensors is the fact that they are availablein several sizes with different resolutions and other features for a relatively cheapprice

Nowadays consumer UAVs are often equipped by some kind of action camerausually to provide an aerial video platform These sensors are suitable for theaim of the project as well resulting from their light weight and wide angle fieldof view

22

32 Chosen software

Figure 32 The chosen LIDAR sensor Hokuyo UTM-30LX It is a 2D rangescanner with a maximum range of 60 m and 3-5 cm accuracy Source [71]

3123 LiDar

The other most important sensor used in this project is the lidar These sensorsare designed for remote sensing and distance measurement The basic conceptof these sensors is to do this by illuminating objects with a laser beam Thereflected light is analysed and the distances are determined

Several versions are available with different range accuracy size and otherfeatures The chosen scanner for this project is the Hokuyo UTM-30LX It isa 2D range scanner with a maximum range of 60 m and 3-5 cm accuracy [71]These features combined with its compact size (62mmtimes62mmtimes875mm210g [71])makes it an ideal scanner for autonomous vehicles for example UAVs

Generally speaking lidars are much more expensive sensors then camerasHowever the depth information provided by them is very valuable for indoorflights

32 Chosen software

321 MatlabMATLAB is a high-level interpreted programming language and environmentdesigned for numerical computation It is a very useful tool for engineers and

23

32 Chosen software

scientist alike as thousands of functions are implemented and included in thesoftware from different fields of science Also excellent visualization tools areready to use which are necessary to easily debug and develop image processingmethods or 3D map construction and visualization Resulting from the reasonsabove developing and testing 2D or 3D image processing algorithms in Matlab isextremely convenient On the other hand Matlab is Java based therefore mostnon-mathematical calculations are slightly slow compared to a C++ environment

322 Robotic Operating System (ROS)Robotic Operating System is an extensive open-source operating system andframework designed for robotic development It is equipped with several toolsaimed to help developing unmanned vehicles like simulation and visualizationsoftware or excellent general message passing This latter property is a suitablesolution to connect the UAV UGV and the ground control station of the project

Also ROS contains several packages developed to handle sensors For exam-ple the both the chosen lidar sensor (3123) and the flight controller (3121)have ready-to-use drivers available

323 OpenCVOpenCV is probably to most popular open-source image processing library con-taining a wide range of functions like basic image manipulations (eg load writeresize rotate image) image processing tools (eg different kind of edge detec-tions threshold methods) and advanced machine learning based solutions (egfeature extractors and training tools) [72] Due to its high popularity examplesand support are available

In this project it was used to handle inputs (either from a file or from a cameraattached to the laptop) since Dlib the main library chosen has no functions toread inputs like that

324 DlibDlib is a general purpose cross-platform C++ library It has an excellent machinelearning library with the aim to provide a machine learning software developmenttoolkit for the C++ language (see figure 33 for an architecture overview) Dlibis completely open-source and is intended to be useful for both scientists andengineers

Dlib also has many computer vision related parts including ready-to-use fea-ture extractors and training tools and object tracking solutions which make it

24

32 Chosen software

Figure 33 Elements of Dlibrsquos machine learning toolkit Source [73]

easy to develop and test custom trained object detectors rapidly Another re-cently added feature is an easy-to-use 3D point cloud visualization function Re-sulting from this Dlib was a reasonable choice to use for this project since itscollection of machine learning image processing and 3D point cloud handlingfunctions made the development significantly easier

It is worth mentioning that aside the features above Dlib also contains com-ponents to handle linear algebra threading network IO graph visualisation andmanaging and numerous other tasks which makes it one of the most versatileC++ libraries Though it is not as popular and well-known as OpenCV it isextremely well documented supported and updated regularly Resulting fromthese the popularity of it increases For more information see [74] and [73]

25

Chapter 4

Designing and implementing thealgorithm

In this chapter challenges related to the topic of this thesis are listed After theconsideration of these the architecture of the designed algorithm will be intro-duced with detailed explanations given to the important parts The connectionbetween the 2D and 3D image inputs are defined Afterwards the base of thedetection algorithm is discussed in details along with all the aiding systems andinputs Finally the 3D imaging experiments and set-ups are introduced withexample images and the mathematical calculations

41 Challenges in the taskThe biggest challenges of the task are listed below

1 Unique object Maybe the biggest challenge of this task is the fact that therobot itself is a completely unique ground vehicle designed as part of theproject Thus no ready-to-use solutions are available which could partly orcompletely solve the problem For frequently researched object recognitiontasks like face detection several open-source codes examples and articlesare available The closest topic for a ground robot detection is car recogni-tion but those vehicles are too different to use any ready detectors Thusthe solution of this task have to be completely self-made and new

2 Limited resources Since the objective of this thesis is a detection executedon a UAV the limitations of this platform have to be considered Thismeans that both the dimensions and the weight of the on-board sensorsand computers are limited significantly Fortunately the cameras nowa-days have very compact sizes with impressive resolutions thus even smaller

26

42 Architecture of the detection system

aerial vehicles can carry one and record high quality videos The used lidar(3123) is heavier and larger but still possible to carry The on-board com-puter had to be chosen similarly a small and light hardware was neededWhile the Nitrogen6x board (311) would fulfil these requirements its pro-cessing power is not comparable to a laptop or desktop PC which shouldbe considered during the designing and assigning of processing methods

3 Moving platform No matter what kind of sensors are used for an objectdetection task movement and vibration of the sensor is usually makes itmore challenging Unfortunately both are present on a UAV platform Thiscauses problems because it changes the perspective and the viewpoint ofthe object Also they add noise and blur to the recordings And third nobackground-foreground separation algorithms are available for the cameraSee subsection 2211 for details

4 Various viewpoint of object Usually in the field of computer vision thepoint of view of the required object is more or less defined For exampleface detectors are usually applied to images where people are facing thecamera However resulting from the UAV platform the sensors will gain orlose height rapidly causing drastic perspective changes Also there are noconstraints about the relative orientation of the two vehicles which meansthat the robot should be recognized from 360 and from above

5 Various size of object Similarly to the previous point 4 the movement ofthe camera will cause significant changes in the scale of the object on theinput image In other words the algorithm have to detect the ground robotfrom a great distance (when it occupies a small part of the input) and alsoduring landing from relatively close (when it seems huge and almost fillsthe entire image)

6 Speed requirements The algorithmrsquos purpose is to support the landingmaneuver which is crucial part of the mission The most critical part is thetouch-down itself This (contrary to the other parts of the flight) requiresreal-time and very accurate position feedback from the sensors and the al-gorithms This is provided by another algorithm specially designed for thetask Thus the software discussed here does not have real-time require-ments On the other hand the system still needs continuous reports on theestimated position of the robot with at least 4Hz This means that everychosen method should be execute fast enough to make this possible

27

42 Architecture of the detection system

Support Vector Machines

camera2D Lidar

Trainer

Detector agorithmfront SVM side SVM

Detections Vatic annotation server

Sensors

Tracking

Evaluation

Current frame

Video reader

Regions of Interest

3D map

Other preprosessing(edgecolour detection)

Figure 41 On this figure a diagram of the designed architecture is shown Theaim of it is the present the connections between the 2D and 3D sensors detectionalgorithms and other parts of the system Arrows represents the dependencies andthe direction of the information flow

28

42 Architecture of the detection system

42 Architecture of the detection systemIn the previous chapters the project (11) and the objectives of this thesis (13)were introduced The advantages and disadvantages of the different availablesensors were presented (312) Some of the most often used feature extractionand classification methods (222) were examined with respect to the suitabilityfor the project Finally in subsection 41 the challenges of the task were discussed

After considering the aspects above a modular complete architecture wasdesigned which is suitable for the objectives defined See figure 41 for an overviewof the design The main idea of the structure is to compensate the disadvantagesof every module with an other part connected to it In some cases this principlebrings redundancy to the system On the other hand it provides a more robustdetection system for the robot The next enumeration will list the main partsof the architecture every already implemented module will be discussed later inthis chapter in detail

1 Training algorithm As explained in subsection 222 object detectionmethods using machine learning are very popular and superior in efficiencyTo apply such an algorithm for a task first a learning (also called trainingor teaching) period is needed when the software extract features from thetraining images and train a classifier to separate the two or more classes(Note that this learning step can be skipped if the algorithm learns rdquoon theflyrdquo or pre-trained detectors are used Neither of these are the case in thisproject)As mentioned above training usually happens off-line before the detectionsTherefore it is not required to be included in the final release However itis an inevitable part of the system since it produces the classifiers for thedetector These productions have to be reproducible compatible with theother parts and effective To make the development more convenient thetraining software is completely separated from testing See subsection 431for details

2 Sensor inputs Sensors are crucial for every autonomous vehicle since theyprovide the essential information about the surrounding environments Inthis project a lidar and a camera where used as main sensors for the task(along with several others needed to stabilize the vehicle itself See subsec-tion 312) The designed system relies more on the camera for now sincethe 3D imaging is in its experimental state However both sensors arehandled in the architecture already with the help of the Robotic Operat-ing System (322) This unified interface will make it possible to integratefurther sensors (eg infra-red cameras ultrasound sensors etc) later withease

29

42 Architecture of the detection system

Resulting from the scope of the project the vehicle itself is not readythus live recordings and testing were not possible during the developmentTherefore the system is able to manage images both from a camera andvideo file The latter is considered as a sensor module as well simulatingreal input from the sensor (other parts of the system canrsquot tell that it is nota live video feed from a camera)

3 Region of interest estimators and tracking Recognized as two of the mostimportant challenges (see 41) speed and the limitations of the availableresources were a great concern Thus methods which could reduce theamount of areas to process and such increase the speed of the detectionwere requiredResulting from the complex structure of the project both can be donebased on either the camera the lidar or both The idea of the architectureis to give a common interface (creating a type of module) to these methodsmaking their integration easy The most general and sophisticated approachwill be based on the continuously built and updated 3D maps That is ifboth the aerial and the ground vehicle were located in this map accurateestimations can be made on the current position of the robot excluding (3D)areas where it is unnecessary to search for the unmanned ground vehicle(UGV) In other words once the real 3D position of the robot is known itsnext one can be estimated based on its maximal velocity and the elapsedtime Theoretically it is not possible for the robot to be outside of thiscircle thus other parts of the map do not need to be scanned and Asa less sophisticated but more rapid alternative to this the currently seen3D sub-map (the point-cloud in front of the sensor) can be interpreted asa pre-filter for the 2D image processing methods For example filteringfor objects with more or less the same size of the UGV could be a veryeffective region of interest detector To achieve this the field of view of thetwo sensors needs to be registered carefully so corresponding areas can befound Unfortunately the real time 3D map building is not yet implementedThus these methods has yet to be developed On the other hand severalapproaches to find regions of interest based on 2D images were implementedAll of them will be presented in section 43

4 Evaluation To make the development easier and the results more objectivean evaluation system was needed This helps to determinate whether a givenmodification of the algorithm did actually make it rdquobetterrdquo andor fasterthan other versions This part of the architecture is only connected to thewhole system loosely (similarly to training point 1) and not required tobe included in the final release since it was designed for development and

30

43 2D image processing methods

debugging purposes However it had to be developed parallel with everyother parts to make sure it is compatible and up to date Thus evaluationhad to be mentioned in this list For more details of the evaluation pleasesee 511

5 Detector The detector module is the core algorithm which connects all theother parts listed above Receiving the inputs from the camera sensor (point2) using the classifiers trained by the training algorithm (point 1) andconsidering the information provided by the ROI estimators (3) executesthe detection Afterwards the detector is able to pass on the detectedobjects coordinates so the evaluation module (point 4) can measure theefficiency

43 2D image processing methodsIn this section the 2D camera image processing concerns methods and the processof the development will be presented

431 Chosen methods and the training algorithmBefore the design and implementation of the algorithm started several objectdetection methods were reviewed in section 22 In section 32 the used imageprocessing libraries and tool-kits were presented

While these two topics seem unrelated they have to be considered togethersince a good software library with image handling or even machine learningalgorithms implemented can save a lot of time during the development

Considering the reviewed methods the Histogram of Oriented Gradients (HOGsee sub-subsection 2223) was chosen as the main feature descriptor The reasonsfor this choice are discussed here

Haar-like features are suitable to find patterns of combined bright and darkpatches While the ground robotrsquos main colours are black (wheels) and white(the main body of the vehicle) their proportion is not ideal since brighter partsare more significant Variance of the relative positions of these patches (egdifferent view angles) can confuse the detector as well Also using OpenCVrsquos(323) training algorithm the learning process was significantly slower than thefinal training software (presented here later) making it very hard to develop andtest

SIFT descriptors (see sub-subsection 2221) have the advantage of rotationinvariance which is very desirable Since they assume that relative positions ofthe key-points will not change on the inputs they are more suitable to recognise

31

43 2D image processing methods

a concrete object instead of the general class This is not an issue in this casesince only one ground robot is present with known properties (in other wordsno generalization is needed) However the relative positions of the key-pointscan change not only because inter-class differences (an other instance of the classrdquolooksrdquo different) but caused by moving parts like rotating and steering wheelsfor example Also SIFT tends to perform worse under changing illuminationconditions which is a drawback in the case of this project since both vehiclesare planned to transfer between premises without no a-priori information aboutthe lighting

Histogram of Oriented Gradients (HOG see sub-subsection 2223) seemed tobe an appropriate choice as it performs well on objects with strong characteristicedges Since the UGV has a cuboid shape body and relatively big wheels itfulfils this condition Also HOG is normalized locally with respect to the imagecontrast which results in a robustness against illumination changes Thus itis expected to perform better than the other methods in case of change in thelighting (like moving to an other room or even outside)

Certainly HOG has it disadvantages as well First it is computationally ex-pensive especially compared to the Haar-like features However with properimplementation (eg a good choice of library) and the use of ROI detectors (seePoint 3) this issue can be solved Secondly HOG in contrast to SIFT is not rota-tion invariant In practice on the other hand it can handle a significant rotationof the object Also in the final version of the project the orientation of the camerawill be stabilized either in hardware (with a gimbal holding the camera levelled)or software (rotate the image based on the orientation sensor) Therefore theinput image of the detector system expected to be levelled Unfortunately dueto the aerial vehicle used in this project different viewpoints are predicted (see4) This means that even if the camera is levelled the object itself could seemlike rotated caused by the perspective To overcome this issue solutions will bepresented in this subsection and in 434

Although the method of HOG feature extraction is well described and docu-mented in [53] and not exceedingly complicated the choice of optimal parametersand the optimization of the computational expenses are time-consuming and com-plex tasks To make the development faster an appropriate library was searchedfor the extraction of HOG features Finally Dlib library (324) was chosen sinceit includes an excellent HOG feature extractor with serious optimization alongwith training tools (discussed in details later) and tracking features (see subsec-tion 434) Also several examples and documentations are available for bothtraining and detection which simplified the implementation

As a classifier Support vector machines (SVM see sub-subsection 2226)were chosen This is because nowadays SVMs are widely used in computer visionwith great results (Dalal used it as well in [53]) and for two-class detection prob-

32

43 2D image processing methods

lems (like this) they outperform the other popular ones Also it is implementedin Dlib as a ready-to-use solution [75] SVMs (and almost everything in Dlib)can save and load by serializing and de-serializing them which made the trans-portation of classifiers relatively easy Dlibrsquos SVMs are compatible with the HOGfeatures extracted with the same library The combination of the two methodswithin Dlib resulted in some very impressive detectors for speed limit sign orfaces Furthermore the face detector mentioned outperformed the Haar-like fea-ture based face detector included in OpenCV [74] After studying these examplesthe combination of Dlibrsquos SVM and HOG extractor was chosen as the base of thedetection

The implemented training software itself was written in C++ It can be exe-cuted from the command line It needs an xml file as an input which includes alist of the training images with the annotation of the objects (coordinates in pix-els) After importing the positive samples the negative samples are rdquogeneratedrdquofrom the training samples as well cut from areas where no annotated object ispresent Note that this feature means the object has to be annotated carefullysince a positive sample might get in to the negative training images otherwise

After the classifier (also called detector) is produced the training softwaretests it on an image database defined by a similar xml file This is a very con-venient feature since major errors are discovered during the training processalready If the first tests do not show any unexpected behaviours the finishedSVM is saved to the disk with serialization

Due to the fact that the robot looks completely unlike from different view-points one classifier was not sufficient because it could only match one typicalrdquoshaperdquo of the object Similarly the face detector mentioned above does notwork for people photographed from side view Therefore multiple classifiers wereconsidered for the task

Two SVMs were trained to overcome of this issue one for the front view andone for the side view of the ground vehicle This approach has two significantadvantages

First the robot is symmetrical horizontally and vertically as well In otherterms the vehicle looks almost exactly the same from the left and right whilethe front and rear-views are also very similar This is especially true if the work-ing principle of the HOG (2223) is considered since the shape of the robot(which defines the edges and gradients detected by the HOG descriptor) is iden-tical viewed from left and right or from front and rear This property of theUGV means that the same classifier will recognise it from the opposite directiontoo Therefore two classifiers are enough for all four sides This is not a trivialsimplification since cars for example look completely different from the front orthe rear and while the side-views are similar they are mirrored

The second advantage of these classifiers is very important in the future of

33

43 2D image processing methods

the project Due to the way the SVMs are trained they not only can detect theposition of the robot but depending on which one of them detects it the systemgains information about the orientation of the vehicle This is crucial since it canhelp predict the next position of the ground robot (see 434) Even more impor-tantly it helps the UAV to land on the UGV with the correct orientation (facingthe right way) Therefore the charging outputs will be connected properly

An other important aspect of the training is how to choose the training imagesFor this pictures were selected where the robot is captured directly from the sideor from the front with the camera almost at the level of the ground Asidethe sidefront of the robot (the top for example) was hidden or cropped Thismethod eliminated all the distortions found on pictures taken from above whichwould decrease the efficiency Although camera attached to the UAV will seldomcruise at this altitude (otherwise it would touch the ground) the HOG descriptor isrobust enough to find the patterns even viewed from above or sideways Althoughteaching an SVMs for diagonal views as well seemed like a logical idea it did notimprove the detections The reason for this is the fact that while side and frontviews are well defined it is hard to frame and train on a diagonal view

On figure 42 a comparison of training images and the produced HOG detec-tors is shown 42(a) is a typical training image for the side-view Hog detectorNotice that only the side of the robot is visible since the camera was hold verylow 42(b) is the visualized final side-view detector Both the wheels and thebody are easy to recognize thanks to the strong edges of them On 42(c) a train-ing image is displayed from the front-view detector training image set 42(d)shows the visualized final front detector Notice tangential lines along the wheelsand the characteristic top edge of the body

Please note that if the current two classifiers seem to be inefficient trainingadditional ones is easy to do with the training software presented here Alsoincluding them to the detector is simple as well since it was designed to beadaptive Please see subsection 435 for more details

432 Sliding window methodA very important property of these trained methods (222) are that usually theirtraining datasets (the images of the robot or any other object) are cropped onlycontaining the object and some margin around it Resulting from this a detectortrained this way will only be able to work on input images which contain theobject in a similarly cropped way Certainly such an image is not the usual inputfor an application since if the input contains only the robot there is no need forlocalization algorithms In other words the input image is usually much largerthan the cropped training images and may contain other kind of objects as wellThus a method is required which provides correctly sized input images for the

34

43 2D image processing methods

(a) sideview training image example

(b) sideview HOG detector

(c) frontview trainging image example

(d) frontview HOG

Figure 42 (a) a typical training image for the side-view Hog detector (b) is thefinal side-view HOG descriptor visualized (c) a training image from the front-viewHOG descriptor training image set (d) the final front detector visualized Noticethe strong lines around the typical edges of the training images

35

43 2D image processing methods

Figure 43 Representation of the sliding window method

detector cropped and resized from the original large inputThis can be achieved in several ways although the most common is the so

called sliding window method This algorithm takes the large image and slidesa window across it with predefined step-size and scales This rdquowindowrdquo behaveslike a mask cropping out smaller parts of the image With appropriate choice ofthese parameters (enough scales and small step-sizes) eventually every robot (orother desired object) will be cropped and handed to the trained detector See 43for a representation

It worth mentioning that it is possible that multiple instances of the soughtobject are present on the image For face or pedestrian detection algorithms forexample this is a potential situation Since only one ground robot was builtyet this is not likely to happen in this project (although reflections of the robotmight occur which are not handled explicitly) However if the system will beexpanded in the future multiple UGVs on the same input image will not cause aproblem for the designed algorithm

433 Pre-filteringAs listed in section 41 two of the most important challenges are speed andthe limitations of the available resources In subsection 432 the sliding windowconcept was introduced which makes it possible to cover every potential locationof the sought object On the other hand this means a lot of positions which areabsolutely unnecessary to check with the HOG detector since it is very unlikelythat the robot is there (eg positions above the ground or homogeneous surfaceslike the wall or the clearly empty ground) These areas could be excluded fromthe search with cheaper algorithms

36

43 2D image processing methods

Thus methods which could reduce the amount of areas to process and suchincrease the speed of the detection were required In other words these algorithmsfinds region of interests (ROI see 3) in which the HOG (or other) computationalheavy detectors are executed in a shorter time than scanning the whole inputimage The algorithms have be implemented with the right interface to keep themodular structure of the architecture Also this practice make it possible toeither swap the currently used with an other or a newly developed ROI detectormodule for further development

Based on intuition a good feature separation is the colour Anything whichdoes not have the same colour as the ground vehicle should be ignored To dothis however is rather complicated since the observed rdquocolourrdquo of the object (andanything else in the scene) depends on the lighting the exposure setting and thedynamic range of the camera Thus it is very hard to define a colour (and aregion of colours around of it) which could be a good base of segmentation onevery input videos Also one of the test environments used (the laboratory) andmany other premises reviewed has bright floor which is surprisingly similar to thecolour of the robot This made the filtering virtually useless

An other good hint for interesting areas can be the edges In this projectthis seems like a logical approach since the chosen feature descriptor operates ongradient images which are strongly connected to the edge map On figure 44 theresult of an edge detection on a typical input image of the system is presentedNote how the ground robot has significantly more edges than its environmentespecially the ground Filtering based on edges often eliminates the majority ofthe ground On the other hand chairs and other objects have lot of edges as wellThus these areas are still scanned

Another idea is to filter the image by the detections themselves on the previousframes In other terms the position of the robot on the previous input image isan excellent estimation of the vehiclersquos position on the current one Theoreticallyit is enough to search for the robot around its last known position with a radiusof the maximum distance the object can move during the time between framesPlease note that this distance is in image coordinates thus the possible movementof the camera is involved as well In practice this distance was set as a parameterafter experiments See subsection 435 for details

434 TrackingIn computer vision tracking means the following of a (moving) object over timeusing the camera image input This is usually done by storing information (likeappearance previous location speed and trajectory) about the object and passit on between frames

In the previous subsection (433) the idea of filtering by prior detections was

37

43 2D image processing methods

Figure 44 On this figure the result of an edge detection on a typical input imageof the system is presented Please note how the ground robot has significantly moreedges than its environment On the other hand chairs and other objects have lotof edges as well

presented Tracking is the extension of this concept If the position of the robotis already known there is no need to find it again To achieve the objectives it isenough to follow the found object(s) This makes a huge difference in calculationcosts since the algorithm does not look for an object described by a generalmodel on the whole image but search for the rdquosame set of pixelsrdquo in a significantlyreduced region Certainly tracking itself is not enough to fulfil the requirements ofthe task hence some kind of combination of rdquotraditionalrdquo classifiers and trackingwas needed

Tracking algorithms are widely researched and have their own extensive litera-ture Considering the complexity of implementing an own 2D tracking algorithmit was decided to choose and include a ready-to-use solution Fortunately Dlib(324) have an excellent object tracker included as well The code is based onthe very recent research presented in [76] The method was evaluated extensivelyand it outperformed state-of-the-art methods while being computationally effi-cient Aside its outstanding performance it is easy to include in the code onceDlib is installed The initializing parameter of the tracker is a bounding box isaround the area needed to be followed Since the classifiers return with boundingboxes those can be used as an input for the tracking See subsection 435 foran overview of the implementation An interesting fact is that it also uses HOGdescriptors as a part of the algorithm which shows the versatility of HOG andconfirms its suitability for this task

The final system of this project will have the advantage of a 3D map of itssurroundings This will allow tracking of objects in 3D relative to the environ-ment The UAV will have an estimation of the UGV even if its out of the frame

38

43 2D image processing methods

since its position will be mapped to the 3D map as well Unfortunately the realtime mapping is not ready yet thus this feature has yet to be implemented

435 Implemented detectorIn subsection 431 the advantages and disadvantages of the chosen object recog-nition methods were considered and the implemented training algorithm wasintroduced Then different approaches to speed up the process were presentedin subsection 432 433 and 434

After considering the points above a detector algorithm was implementedwith two aims First to provide an easy-to-use development and testing toolwith which different methods and their combinations (eg using 2 SVMs withROI) can be tried without having to change and rebuild the code itself Secondafter the optimal combination was chosen this software will be the base of thefinal detector used in the project

The input of the software is a camera feed or video which is read frame byframe The output of the algorithm is one or more rectangles bounding the areasbelieved to contain the sought object These are often called detection boxeshits or simply detections

During the development four different approaches were implemented All ofthem are based on the previous ones (incremental improvements) but all broughta new idea to the detector which made it faster (reach higher frame-rates see512) more accurate (see 511) or both The methods are listed below

1 Mode 1 Sliding window with all the classifiers

2 Mode 2 Sliding window with intelligent choice of classifier

3 Mode 3 Intelligent choice of classifiers and ROIs

4 Mode 4 Tracking based approach

They will all be presented in the next four subsections Since they are rathermodifications of the same code than separate solutions they were not imple-mented as new software Instead all were included in the same code whichdecides which mode to execute in run-time based on a parameter

Aside of changing between the implemented modes the detector is able toload as many SVMs as needed Their usage is different in every mode and willbe presented later

The software can display andor save the video frames with all the detectionsmarked measure and export the processing time for every frame save the de-tections to data files for evaluation or even process the same video input severaltimes

39

43 2D image processing methods

Table 41 Table of the available parameters

Name Valid values Functioninput path to video video as input for detectionsvm path to SVMs These SVMs will be usedmode [1 2 3 4] Selects which mode is usedsaveFrames [01] Turns on video frame exportsaveDetections [01] Turns on detection boxe exportsaveFPS [01] Turns on frame-rate measurementdisplayVideo [01] Turns on video displayDetectionsFileName string Sets the filename for saved detectionsFramesFolderName string Set the foldername used for saving videonumberOfLoops integer(0lt) Sets how many times the video is looped

To make the development more convenient these features can be switchedon or off via parameters which are imported from an input file every time thedetector is started There is an option to change the name of the file where thedetections are exported or the name of the folder where the video frames aresaved

Table 41 concludes all the implemented parameters with the possible valuesand a short description

Note that all these parameters have default values thus it is possible but notcompulsory to include every one of them The text shown below is an exampleinput parameter file

Example parameter file of the detector algorithminput testVideo1avilist all the detectors you want to usesvm groundrobotfrontsvm groundrobotsidesvmsaveDetections 0saveFrames 0displayVideo 0numberOfLoops 100mode 2saveFPS 1

Given this input file the software would execute a detection on testVideo1avi(input) one hundred times (numberOfLoops) using the groundrobotfrontsvm groundrobot-sidesvm classifiers (svm) It would neither save the detections (saveDetections)

40

43 2D image processing methods

nor the video frames (saveFrames) The latter is not shown (displayVideo) How-ever the processing time is saved for every frame (saveFPS) Note that the pa-rameter numberOfLoops is really useful to simulate longer inputs See figure 45for a screenshot of the interface after a parameter file is loaded

Figure 45 Example of the detectorrsquos user interface after loading the parametersThe software can be started from the command line with a parameter file inputwhich defines the detecting mode and the purpose of the execution (producingvideo efficiency statistics or measuring processing frame-rate)

4351 Mode 1 Sliding window with all the classifiers

Mode 1 is the simplest approach of the four After loading the two classifierstrained by the training software (front-view and side-view see 434) both rdquoslidesrdquoacross the input image All of them return with a vector of rectangles where theyfound the sought object These are handled as detections (saved displayed etc)There are no filters implemented for maximum number of detections overlappingallowed position etc

Please note that although currently two SVMs are used this and every othermethod can handle fewer or more of them For example given three classifiersmode 1 loads and slides all of them

As it can be seen this mode 1 is rather simple It checks every possibleposition with all the classifiers on every frame in multiple scales This meansthat it will not miss the object (assuming that one of them would recognize theobject if a cropped image of it is given as an input) On the other hand thisextensive search is very computational heavy especially with two classifiers Thisresults in the lowest frame per second rate of all the methods

4352 Mode 2 Sliding window with intelligent choice of classifier

As mentioned in the previous subsection Mode 1 checks every possible positionwith all the classifiers on every frame This is not efficient because of the waythe two used classifier was trained On figure 42 it can be seen that one of them

41

43 2D image processing methods

represents the frontrear view while the other one was trained for the side-viewof the robot

The idea of mode 2 is based on the assumption that it is very rare and usuallyunnecessary that these two classifiers detect something at once since the robot iseither viewed from the front or from the side Certainly looking at it diagonallywill show both mentioned sides but resulting from the perspective distortionthese detectors will not recognize both sides (and that is not needed either sincethere is no reason to find it twice)

Therefore it seemed logical to implement a basic memory to the code whichstores which one of the detectors found the object last time After that only thatone is used for the next frames If it succeeds to find the object again it remainsthe chosen detector If not the algorithm starts counting the sequential frames onwhich was unable to find something If this exceeds a predefined tolerance limit(that is the number of frames tolerated without detection) all of the classifiersare used again until one of them find the object In every other aspect mode 2is very similar to mode 1 presented above

The limit was introduced because it is very unlikely to change viewpointsbetween two frames so drastically that the other detector would have a betterchance to detect A couple of frames without detection is more like the result ofsome kind of image artefact like motion blur change in exposure etc

This modification of the original code resulted an much faster processingspeed since in significant amount of the time only one classifier was used in-stead of two Fortunately after finding an optimal limit the efficiency of thecode remained the same

4353 Mode 3 Intelligent choice of classifiers and ROIs

While mode 2 brought a significant improvement to the detectorrsquos speed it didnot fulfil the requirements of the project

In subsection 433 the idea of reducing the image searched was introduced asa way to increase the detection speed The position of the robot on the previousinput image is an excellent estimation of the vehiclersquos position on the current oneTheoretically it is enough to search for the robot around its last known positionwith a radius of the maximum distance the object can move during the timebetween frames However this is not an obvious calculation since the distance isin image coordinates hence it is not trivial to use the actual speed of the robotAlso the possible movement of the camera is involved as well For example evenif the UGV holds its position the rotating camera of the UAV will capture itrdquomovingrdquo across the input image

Instead of estimating the distance mentioned above a simpler approach wasimplemented The memory introduced in mode 2 was extended to store the

42

43 2D image processing methods

Figure 46 An example output of mode 3 The green rectangle marks theregion of interest The blue rectangle is the detection returned by the side-viewdetectorMode 3 tries to estimate a ROI based on the previous detections andsearches for the robot only in that Note the ratio between the ROI and the fullsize of the image

position of the detection aside the detector which returned it A new rectanglename ROI (Region of interest) was included which determinates in which areashould the detector search

The rectangle is initialized with the sizes of the full image (in other termsthe whole image is included in the ROI) Then every time a detection occursthe ROI is set to that rectangle since that is the last known position of the robotHowever this rectangle have to be enlarged caused by the reasons mentionedabove (movement of the camera and the object) Therefore the ROI is grown bya percent of its original size (set by a variable 50 by default) and every detectorsearches in this area Note that the same intelligent selection of classifiers whichwas described in mode 2 is included here as well

In the unfortunate case that none of the detectors returned detections insidethe ROI the region of interest is enlarged again by a smaller percent of its originalsize (set by a variable 3 by default) Note that if in any case the ROI reachesthe size of the of the image it will not grow in that direction anymore Eventuallyafter a series of missing detections the ROI will reach the size of the input imageIn this case mode 3 works exactly like mode 2

See figure 46 for an example of the method described above The greenrectangle marks the region of interest (ROI) The classifiers will search for therobot in this area The blue rectangle marks that the side-view detector foundsomething there (in this case correctly recognized the robot) On the next framethe blue rectangle will be the base of the ROI as it is enlarged in the followingsteps to define the region of interest

43

43 2D image processing methods

Please note that while the ROI is updated of the previous detections it isvery easy to change it to another rdquopre-filterrdquo due to the modular architectureintroduced in 42

4354 Mode 4 Tracking based approach

As mentioned in 434 Dlib library has a built-in tracker algorithm based on thevery resent research [76]

Mode 4 is very similar to mode 3 but includes the tracking algorithm men-tioned above Using a tracker makes it unnecessary to scan every frame (or partof it as mode 3 does) with at least one detector

Instead once the robot was found it is followed across the scene Certainlyperiodical validations are still needed Tracking is significantly faster than thesliding window detectors and in appropriate conditions can perform above realtime speed

It also improved the efficiency of the detector since on many frames the robotwas tracked and its position was marked correctly while the detectors missed itfrom the same view-point

On the other hand tracking can be misleading as well In case of the trackedregion rdquoslidesrdquo off the robot incorrect detections will be provided Also thetracker often estimates the position of the object correctly while the scale of therectangle does not fit which can result much larger or smaller detection boxesthan the object itself See figure 47(b) for an example Yellow rectangle marksthe tracked region which clearly includes the robot but itrsquos size is unacceptable

To avoid these phenomena the bounding box returned by the tracker is reg-ularly checked by detectors This is done in the very same way as in mode 3The output of the tracker is the base of the ROI which is than enlarged by pa-rameters (similarly to mode 3) The algorithm checks this area with the selectedclassifier(s)

If any of them founds the object the tracker is reinitialized based on thedetection box The rectangle is enlarged and translated to include a bigger partof the robot in contrast to the detected sides (note that the detectors are trainedfor the sides only see subsection 431) This is done because the tracker tendsto follow bigger patches of the image better

If none of the detectors returned detections inside the ROI the algorithm willcontinue to track the object based on previous clues of its position However afterthe number of sequential failed validations exceeds a tolerance limit (a parameter)the object is labelled as lost Afterwards the algorithm works exactly as mode 3would do enlarge the ROI on every frame until a new detection appears Thenthe tracker is reinitialized

See figure 47(a) for a representation of the processing method of mode 4

44

43 2D image processing methods

(a) Example output frame of mode 4

(b) Typical error of the tracker

Figure 47 (a) A presentation of mode 4 Red rectangle marks the detection ofthe front-view classifier The yellow box is the currently tracked area The greenrectangle marks the estimated region of interest (b) A typical error of the trackerYellow rectangle marks the tracked region which clearly includes the robot but toobig to accept as a correct detection

45

44 3D image processing methods

44 3D image processing methods

441 3D recording methodThe first challenge was how to create a 3D image As explained in sub-subsection3123 the used Lidar sensor is a 2D type one which means it scans only inplane In other words the sensor returns the distance of the closest reflectionpoint rotating around one (vertical) axis The result of this scan looks like across-section of a 3D image To produce a real and complete 3D image severalof these cross-section like one-plane scans have to be combined This is a rathercomplicated method since to cover the whole room the lidar needs to be movedHowever this movement or displacement will have a significant effect on therecorded distances

To simplify the initial experiments the position of the laser scanner was fixedon a tripod and only the orientation of it was changed This way the trans-formation between the coordinate systems and validation of the recordings wereeasier

On figure 48 a schematic diagram is given to introduce the 3D recording set-up The picture is divided in to 3 parts Part a is a side view of the room beingrecorded including the scanner and its dependences Part b is a top view of thevery same room displaying the field of view of the scanner Part c is a closeroverview of the scanner device the inertial measurement unit (IMU) and themounting layout Main parts of the setup are listed below with brief explanation

1 Lidar This is the 2D range scanner which was used in all of the experi-ments It is mounted on a tripod to fix its position The sensor is displayedon every sub-figure (a b c) On part a it is shown in two different orien-tation first when the recording plane is completely flat secondly when itis tilted

2 Visualisation of the laser ray The scanner measures distances in a planewhile rotating around its vertical axis which is call capital Z axis on thefigure (Note The sensor uses class 1 laser which is invisible and harmlessto human eye Red colour was chosen only for visualization)

3 Pitch angle of first orientation While the scanner is tilted its orientationis recorded relative to a fixed coordinate system For this experiment thepitch angle is the most relevant since roll and yaw did not change This iscalculated as a rotation from the (lower-case) z axis

4 Pitch angle of second orientation This is exactly the same variable asdescribed before but recorded for the second set-up of the recording system

46

44 3D image processing methods

5 Yaw angle of the laser ray As mentioned before the sensor used operateswith one single light source rotated around its vertical axis (capital) ZTo know where the recorded point is located relative to the sensorrsquos owncoordinate system the actual rotation of the beam is needed to be savedas well This is calculated as an angle between the forward direction of thesensor (marked with blue) and the laser ray All the points are recordedin this 2D space defined with polar coordinates a distance from the origin(the lidar) and a yaw angleThe angle is increasing counter-clockwise Pleasenote that this angle rotates with the lidar and its field Part b represents astate when the lidar is horizontal

6 Field of view of the sensor Part b of the picture visualize the limited butstill very wide field of view of the sensor The view angle is marked withgreen coloured curve and the covered area is yellowish The scanner is ableto record 270 or in other terms [minus135 135]

7 Blind-spot of the sensor As discussed in the previous point the scannercan register distances in 270 The remainder 90 is a blind-spot and islocated exactly behind the sensor (135minus135) This area is marked withdark grey on the figure

8 Inertial measurement unit All the points are recorded in the 2D spaceof the lidar which is a tilted and translated plane To map every recordedpoint from this to the 3D maprsquos coordinate system the orientation andposition of the Lidar sensor is needed Assuming that the latter is fixed(the scanner is mounted on a tripod) only the rotation remains variableTo record this some kind of inertial measurement unit (IMU) is neededwhich is marked with a grey rectangle on the image

9 Axis of rotation To scan the room a continuous tilting of the scannerwas needed It was not possible to place the light source on the axis of therotation This means an offset between the sensor and the axis which was asource of error The black circle represents the axis the tilting was executedaround

10 Offset between the axis and the light source As described in the previ-ous point an offset was present between the range sensor and the axis ofrotation This distance was carefully measured and it is marked by blue onthe figure

11 Ground vehicle The main objective of this thesis is to localize the UGVThus several 3D map was made of this vehicle itself However the lidarmain task is not the robot detection but the localization of the UAV itself

47

44 3D image processing methods

On part a and b its schematic side and top view is shown to visualize itsshape relative size and location

442 Android based recording set-upTo test the sensorrsquos capabilities simplified experiment set-ups were introducedas described in subsection 441 The scanned distances were recorded in csvformat by the official recording tool provided by the lidarrsquos manufacturer [77]As described in subsection 444 and in point 8 the orientation and position ofthe Lidar sensor is needed to determinate the recorded point coordinates relativeto the coordinate system fixed to the ground

Figure 410 Screenshot ofthe Android application for Lidarrecordings It is able to detect dis-play and log the rotation of the de-vice in a log file with user definedfilename Source own applicationand image

Assuming that the position is fixed (thescanner is mounted on a tripod) only therotation can change To record this somekind of inertial measurement unit (IMU) wasneeded

Since at the time these experimentsstarted no such device was available the au-thor of this thesis created an Android appli-cation The aim of this program was to usethe phonersquos sensors (accelerometer compassand gyroscope) as an IMU The software isable to detect display and log the rotationof the device along the three axis

The log file is structured in a way thatMatlab (see subsection 321) or other de-velopment tools can import it without anymodification needed The file-name is userdefined

Since both the Lidar and the applica-tion operates at a different and not suf-ficiently consistent frequency synchroniza-tion was necessary Since the two data wasrecorded completely separately this has to bedone off-line To achieve this a time-stamphas to be recorded for both type of mea-surements The lidar recorder tool does thisautomatically This function was added tothe Android application too thus the time-stamps of the recorded orientations are saved

48

44 3D image processing methods

Figure 48 Schematic figure to represent the 3D recording set-up Part a is a sideview of the room being recorded including the scanner and its dependences Partb is a top view of the very same room displaying the field of view of the scannerPart c is a closer overview of the scanner device the inertial measurement unit(IMU) and the mounting layout

49

44 3D image processing methods

Figure 49 Example representation of the output of the Lidar Sensor As it canbe seen the data returned by the scanner is basically a scan in one plane and canbe interpreted as a cross-section of a 3D image The sensor is positioned in themiddle of the cross Green areas were reached by the laser Source screenshot ofown recording by the official recording tool [77]

50

44 3D image processing methods

to the log file as well However since the two systems were not connected thetimestamps had to be synchronized manually

On Figure 410 the user interface of the application can be seen Note thatboth the displayed and the saved angles are in radian Figure 411(a) presentshow the phone was mounted next to the lidar and also shows the test environmentwhich was recorded The phone was attached to a common metal base with thelidar Thus the rotation of the phone is expected to be very close to the rotationof the lidar We used a Motorola Moto X (2013) for these experiments

The coordinate system used is fixed to the phone X axis is parallel with theshorter side of the screen pointing from the left to the right Y axis is parallel tolonger side of the screen pointing from to bottom to the top Z axis is pointingup and is perpendicular to the ground See figure 412 for a presentation of theaxes Angles were calculated in the following way

bull pitch is the rotation around the x axis It is 0 if the phone is lying on ahorizontal surface

bull roll is rotation around the y axis

bull yaw is the rotation around the z axis

Although the sensors in a phone are not designed for such precise tasks theset-up presented in this subsection worked sufficiently and produced spectacular3D images On figure 411(b) and 411(c) an example is shown along with therecorded screen captured by a camera (411(a))

443 Final set-up with Pixhawk flight controllerAlthough the Android application (see subsection 442) gave satisfactory resultsseveral reasons occurred for replacing it First the synchronization was compli-cated manual and off-line which is not acceptable in a real time UAV environ-ment Second neither the phonersquos sensors nor its operating system was designedto be mounted on an unmanned aircraft It is heavier less precise and moreexpensive than a dedicated IMU Thus the Android phone has to be replaced

As a replacement the Pixhawk (3121) flight controller was chosen because itis open-source well supported and most importantly it is the chosen controller ofthe projectrsquos UAV This meant less weight and brought integrity both in hardwareand software Although the Pixhawk is not a dedicated IMU its built-in sensorsare very accurate since they are used to stabilize multi-rotors during flight Fellowstudent Domonkos Huszar managed to connect both the Pixhawk and the lidar(3123) to the Robotic operating system (322) This way the two type ofsensors were recorded with the same software

51

44 3D image processing methods

(a) Laboratory and recorded scene

(b) Elevation of the produced 3D map

(c) The produced 3D map

Figure 411 (a) picture about the laboratory and the mounting of the phone (b)and (c) 3D map of the scan from different viewpoints Some object were markedon all images for easier understanding 1 2 3 are boxes 4 is a chair 5 is a studentat his desk 6 7 are windows 8 is the entrance 9 is a UAV wing on a cupboard

52

44 3D image processing methods

Thus the change from the Android software to the Pixhawk solved the problemof synchronization and accuracy All the other parts of the experiment (themounting method the coordinate systems etc) remained unchanged

444 3D reconstructionIn subsection 441 the recording method was described Subsection 442 and443 discussed the inertial measurement units and software used Processing andvisualizing the recorded data as a 3D map was the next step

All the pointsrsquo coordinates obtained were in the lidar sensorrsquos own coordinatesystem As shown on figure 48 (part b) this is a 2D space represented by polarcoordinates a distance from the origin (the scanner) and an angle between theline pointing to the point and the zero vector This space was tilted with thescanner during the records To calculate the positions recorded by the Lidar inthe ground fixed coordinate system this 2D space had to be transformed Trans-formation here means rigid motions no shape or size changes were desired Rigidmotions are reflections rotations translations and glide reflections Howeverin this experiment only translations and rotations were used

The 3D coordinate system was defined as the following

bull x axis is the vector product of y and z x = ytimes z pointing to the right

bull y axis is the facing direction of the scanner parallel to the ground pointing

Figure 412 Presentation of the axes used in the android application

53

44 3D image processing methods

forward

bull z axis is pointing up and is perpendicular to the ground

Note that the origin ([0 0 0]) is at the lidarrsquos location (at the top of the tripodin the height of the rotation axis) This means that points below the scanner havea negative altitude

The lidar and its own 2D space have 6 degrees of freedom (DOF) state in theground fixed coordinate system Three represents its positions along the x y zaxes The other three describes its orientation yaw pitch and roll

First the rotation of the scanner will be considered (the latter three DOF) Asit can bee seen on figure 48 the roll and yaw angles are not able to change Thusonly pitch has to be considered as a variable in this experiment (see point 3) Asa result the transformed coordinates of the recording point are the following

x = distance middot sin(minusyaw) (41)

y = distance middot cos(yaw) middot sin(pitch) (42)

z = distance middot cos(yaw) middot cos(pitch) (43)

Where distance and yaw are the two coordinates of the point in the plane ofthe lidar (figure 48 point 5) and pitch is the angle between the (lower-case) zaxis and the plane of the lidar (48 point 3 and 4)

The position of the lidar in the ground coordinate system did not change intheory only the orientation of it However as it can be seen on figure 48 part cthe rotation axis (marked by 9 see point 9) is not at the same level as the rangescanner of the lidar Since it was not possible to mount the device other way asignificant offset (marked and explained in point 10) occurred between the lightsource and the axis itself This resulted in an undesired movement of the sensoralong the z and y axes (note that x axis was not involved since the roll angle didnot change and only that would move the scanner along the x axis) While theresulted errors were no greater than the offset itself they are significant Thusalong the rotation a translation is needed as well The vector is calculated asthe sum of dy and dz

dy = offset middot sin(pitch) (44)

dz = offset middot cos(pitch) (45)

54

44 3D image processing methods

Where dy is the translation required along the y and dy is along the z axisOffset is the distance between the light source and the axis of rotation presentedon figure 48 part C marked with 10

Combining the 5 equations (41 42 43 44 45) we get xyz

= distance middot

sin(minusyaw)cos(yaw) middot sin(pitch)cos(yaw) middot cos(pitch)

+ offset middot

0sin(pitch)cos(pitch)

(46)

Using the transformation defined in equation 46 spectacular and more im-portantly precise 3D maps were created These were suitable for further worksuch us SLAM (Simultaneous localization and mapping) or the ground robotdetection

55

Chapter 5

Results

51 2D image detection results

511 EvaluationTo make the development easier and the results more objective an evaluationsystem was needed This helps determine whether a given modification of thealgorithm did actually make it better andor faster than other versions Thisinformation is extremely important in every vision based detection system to trackthe improvements of the code and check if it meets the predefined requirements

To understand what better means in case of a system like this some defi-nitions need to be clarified Generally a detection system can either identify aninput or reject it Here the input means a cropped image of the original pictureIn other words it is a position (coordinates) and dimensions (height and width)Please refer to subsection 432 for a more detailed description An identifiedinput is referred as positive Similarly a rejected one is called negative

5111 Definition of True positive and negative

A detection is true positive if the system correctly identifies an input as a positivesample In this case it means that the detector marks the ground robotrsquos positionas a detection Note that it is also often called a hit Similarly a true negativemeans that an input which does not contain the object is correctly rejected

5112 Definition of False positive and negative

In the case of most decision systems errors can be organised into two main groupsA mistake is a false positive error (also called type I error) if the system classifiesan input as a positive sample despite it is not In other words the system believes

56

51 2D image detection results

the object is present at that location although that is not the case In this projectthat typically means that something that is not the ground robot (a chair a boxor even a shadow) is still recognized as the UGV A false negative error (alsocalled type II error) occurs when an input is not recognized (rejected) although itshould be In this current task a false negative error is when the ground robotis not detected

5113 Reducing number of errors

Unfortunately it is really complicated to decrease both kind of errors at thesame time If more strict rules and thresholds are introduced in the decisionmaking the rate of false positives will decrease but simultaneously chances offalse negatives will increase On the other hand if conditions are loosened in thedetection method false negative errors may be reduced However caused by theexact same thing (less strict conditions) more false positives could appear

The actual task determines which kind of error should be minimised Incertain projects it could be more important to find every possible desired objectthan the fact that some mistakes will occur Such a project can be a manuallysupervised classification where appearing false positives can be eliminated by theoperator (for example illegal usage of bus lanes) Also a system with an accidentavoidance purpose should be over-secure and rather recognise non dangeroussituations as a hazard than to miss a real one

In this project the false positive rate was decided to be more important Thereason for this is that the position of the UGV is not crucial during the wholeflight If an input frame is processed incorrectly and a false negative error occurs(the robot is on the picture but the algorithm misses it) the tracking system(see subsection 434) would still be able to estimate its position After a fewframes the detector would be able to detect the robot again with great possibilityHowever a false negative detection at an unexpected position may confuse thesystem since it would have to choose from multiple possible charging platformsThis may lead to a landing attempt on a dangerous surface and to the possibilityof losing the aircraft

5114 Annotation and database building

To test the efficiency of the 2D detection algorithm its detections have to betested on an image dataset with its samples already labelled (either fully manuallyor using a semi-supervised classifier) or on a video set where the object (or objects)position is known on every frame This way the true positivenegative and falsepositivenegative values can be determined

57

51 2D image detection results

Figure 51 Figure of the Vatic user interface

Testing on image datasets is suitable for methods which do not use any in-formation transmitted between frames such as previous positions or ROI Thesealgorithms are usually sliding window based (see subsection 432) Their croppedinputs can be simulated with image datasets without complications

For this project the video based test is more appropriate since as described insubsection 434 and 433 the detector uses tracking and pre-filtering Testing thetracking feature without consecutive frames is pointless Also the pre-filteringand ROI generation is only applicable on frames containing more than just theobject itself

Testing on videos requires annotation which means that every object has tobe labelled on every frame This is a tiring and time-consuming task but withsuitable software it can be simplified To do this the Vatic environment waschosen

The abbreviation Vatic stands for Video Annotation Tool from Irvine Cal-ifornia It is a free to use on-line video annotation tool designed specially forcomputer vision research [78]

After the successful installation and connection to the server a clear interfaceis visible See an example screenshot on figure 51 The interface has a built invideo player which makes it possible to review the video with the annotation atnormal speed or even frame-by-frame The system was designed in a way thatmultiple objects can be annotated even from different classes (like pedestrianscars trucks for example) Since only one ground robot was built yet and no otherkind of objects were requested only one item was annotated on the videos

58

51 2D image detection results

The annotatorrsquos task is to mark the position and the size of the objects (thatis drawing a bounding box around it) on every frame of the videos (there areoptions to label an object Occluded obstructed or outside of frame) Althoughthis seems like a very time-consuming task Vatic uses sophisticated algorithmsto interpolate both the positions and the sizes of the objects on frames thatare not annotated yet This is a significant help for the annotator since aftermarking the object on some key-frames the only task remained is to correctthe interpolations between them if necessary

Note that it is crucial that the bounding boxes are very tight around the ob-ject This will guarantee that during the evaluation of the detectors the statisticsare based on true and objective annotations This topic will be explained in moredetails in section 53

Vatic is also able to crowdsource the annotation tasks using Amazon Mechan-ical Turk [79] which is Amazonrsquos solution for renting human resources This waylarge datasets with several videos are easy to build for relatively low cost How-ever this feature of the tool was not used in this project since the number of testvideos and their complexity did not make it necessary All the test videos wereannotated by the author

After successfully annotating a video Vatic is able to export the boundingboxes in several format including xml json etc The exported file contains theid of the frames and the coordinates of every bounding box along with the labelassigned for the objectrsquos type

Using these files it is possible to compare the detections of an algorithm (whichare also bounding boxes) with the annotations and thus evaluate the efficiencyof the code

512 Frame-rate measurement and analysisIn section 41 speed was recognised as one of the most important challenges

To address this issue the detector was prepared to measure and export thetime elapsed by processing each frames

The measurements were carried out on a laptop PC (Lenovo z580 i5-3210m8GB RAM) simulating a setup where the UAV sends video to a ground stationfor further processing During the tests every other unnecessary process wereterminated so their additional influences are minimized

The exact same software was used for every test only the parameter file waschanged (switching between the modes) Every method was executed on the samevideo one hundred times which means more than 100000 frames This amountensured that temporary resource changes during the test due to the operatingsystem did not influence the results significantly

59

52 3D image detection results

To analyse the exported processing times a Matlab script has been imple-mented This tool can load the log files from the test and calculate statistics ofthe data Measurements like the shortest longest and average processing timeper frame are displayed after execution along with the average frame-rate and itsvariance Latter is very important as mode 2 3 4 processing methods changeduring the execution by definition which results in a varying processing timeVariance measures these changes An example output of the analysis software ispresented below

Example output of Frame-rate analysis softwareLoaded 100 filesNumber of frames in video 1080Total number of frames processed 108000Longest processing time 0813Shortest processing time 0007Average ellapsed seconds 0032446variance of ellapsed seconds per frame

between video loops 80297e-07across the video 00021144

Average Frame per second rate of the processing 308204 FPS

The changes of average frame-rates between video loops were also monitoredIf this value was too big it probably indicated some unexpected load of thecomputer Thus the tests had to be restarted

It should be mentioned that the frame-rate of the software is strongly corre-lated with the resolution of the input images since the smaller the image it takesshorter time to rdquosliderdquo through the detector on it This is especially true for thefirst two methods (4351 4352) The test videosrsquo resolution are 960times 540

52 3D image detection resultsThis thesis focused on the 2D object detection methods since they are easierto implement and have been widely researched for decades Traditional camerasensors are significantly cheaper too and their output is rdquoready-to-userdquo On theother hand data from lidar sensors have to be processed to gather the imagesince the scanner records points in a 2D plane as explained in section 44 Notethat lidar 3D scanners exists but they are still scanning in a plane and have thismentioned processing built-in Also they are even more expensive than the 2Dversions

60

53 Discussion of results

Figure 52 On this figure the recording setup is presented In the backgroundthe ground robot is also visible which was the subject of the measurements

That being said the 3D depth map provided by these sensors are not possibleto obtain with cameras and contains valuable information for a project like thisand makes indoor navigation easier Therefore experiments were carried onthe record 3D images by tilting a 2D laser scanner (3123) and recording itsorientation with a IMU (3121)

On figure 52 a photo of the experimental setup is shown In the backgroundthe ground robot is visible which was the subject of the recordings

Figure 53 presents an example 3D point-cloud recorded of the ground robotBoth 53(a) and 53(b) is an image of the same map viewed from different pointsAs it can be seen recordings from one point is not enough to create completemaps as rdquoshadowsrdquo and obscured parts still occurs Note the rdquoemptyrdquo area behindthe robot on 53(b)

The created point-clouds were found to be detailed enough to carry simpler(eg threshold based on height) or more complex object detection methods (egregistering 3D key-points) on them in the future

53 Discussion of resultsIn general it can be said that the detector modes worked as expected on therecorded test videos with suitable efficiency Example videos can be found athere1 for all modes However to judge the performance of the algorithm moreobjective measurements were needed

The 2D image processing results were evaluated by speed and detection effi-1usersitkppkehu~palan1CranfieldVideos

61

53 Discussion of results

(a) Example result of the 3D scan of the ground robot

(b) Example of the rsquoshadowrsquo of the ground robot

Figure 53 Example of the 3D images built Both image shows the same 3Dmap viewed from different angles Note that the carpet is visible underneath thevehicle

62

53 Discussion of results

ciency The latter can be done by several values Here recall and precision willbe used Recall is defined by

TP

TP + FN

Where TP is the number of true positive and FN is the number of false negativedetections In other words it is the ratio of the of the detected and the totalnumber of positive samples Thus it is also called sensitivity

Precision is defined byTP

TP + FP

Where TP is the number of true positive FP is the number of false positivedetections In other terms precision is a measurement of the correctness of thereturned detections

Both values have a different meaning for validation and none of them can usedalone This is because as mentioned in sub-subsection 5113 both false positiveand negative errors are undesirable but neither recall nor precision measuresboth For example it is possible to reach an outstanding recall rate with extremeamount of false positives Similarly the increase of false negative errors willimprove the precision of the algorithm

The detections (the coordinates of the rectangles) were exported by the de-tection software for further analysis The rectangles were compared to the onesannotated with the Vatic video annotation tool (5114) based on position andthe overlapping areas For the latter a minimum of 50 was defined (ratio of thearea of the intersection and the union of the two rectangles)

Note that this also means that even if a detection is at the correct locationit is possible that it will be logged as false positive error since the overlap of therectangles is not satisfying Unfortunately registering this box as a false positivewill also generate a false negative error since only one detection is returned bythe detector for a location(overlapping detections are merged) Therefore theannotated object will not be covered by any of the detections

As expected mode 1 and mode 2 produced very similar results a decent 62recall rate was achieved This is a result of the fact that they work exactlythe same way except the selection of the used classifiers The results show thatintroducing this idea did not decrease the performance

Mode 3 managed to improve the recall rate by 2 with the introduction ofregion of interests Neither of the first 3 modules had false positive detections asa result of the strict thresholds of the SVMs

Mode 4 had the outstanding recall rate of 95 This is a result of the trackeralgorithm which was able to follow the object even at view-points were the SVMdetectors did no longer detected the robot See table 51 for all the recall andprecision values

63

53 Discussion of results

Similarly the processing speed of the methods introduced in subsection 435were analysed as well with the tools described in subsection 512

Mode 1 was proven to be the slowest with a frame-rate of 42 FPS This isnot surprising as this version of the algorithm scans the whole image with all thedetectors

Mode 2 raised this speed to 65 FPS by introducing an intelligent way ofselecting the applied detector

Mode 3 brought another significant increase with implementing region of in-terest estimators which reduce the amount of the image to be scanned It canprocess around 12 frames in a second

Finally mode 4 was proven to be the fastest solution by far As a result ofits tracking algorithm their is no need to use the SVM classifiers as often asbefore This is a very important result since tracking is significantly faster thanscanning with the detectors Mode 4 reached a frame-rate of 30 FPS during thetests Thus mode 4 can be called a rdquoreal-timerdquo object detection algorithm

See table 51 for the frame-rates of all modules displayed along their recalland precision values Note that all frame-rates are average of 100 executions

Table 51 Table of results

mode Recall Prescision FPS Variancemode 1 0623 1 42599 000000123mode 2 0622 1 6509 00029557mode 3 0645 1 1206 00070877mode 4 0955 0898 3082 00021144

As a conclusion it can be seen that mode 4 outperformed all the others bothin detection rate (recall 95) and speed (average frame-rate 30FPS) using ROIestimators and a tracking algorithm Due to the nature of the tracker and theevaluation method the number of false positives increased As mentioned beforeeven a correctly positioned but incorrectly scaled detection decreased the precisionrate However this should be simple to improve by adding further validations ofthe output of the tracker (both for position and scale)

64

Chapter 6

Conclusion and recommendedfuture work

61 ConclusionIn this paper an unmanned aerial vehicle based airborne tracking system waspresented with the aim of detecting objects indoor (potentially extended to out-door operations as well) and localize a ground robot that will serve as a landingplatform for recharging the UAV

First the project and the aims and objectives of this thesis were introducedin chapter 1 This introduction was followed by an extensive literature review togive an overview of the existing solutions for similar problems (chapter 2) Un-manned Aerial Vehicles were presented and examples were shown for applicationfields Then the most important object recognition methods were introducedand reviewed with the aspect of suitability for the discussed project

Then the environment of the development was described including the avail-able software and hardware resources (chapter 3) Their advantages and disad-vantages were analysed and their applications in the project are explained

Afterwards the recognized challenges of the project were collected and dis-cussed with respect to the object detection task in chapter 4 Considering theobjectives resources and the challenges a modular architecture was designedand introduced (section 42) The type of modules were listed and discussed indetails including their purposes and some possible realization

The modular structure of the system makes it very flexible and easy to extendor replace modules The currently implemented and used ones were listed andexplained Also some unsuccessful experiments were mentioned

Subsection 434 concludes the chosen feature extraction (HOG) and classi-fier (SVM) methods and presented the implemented training software along the

65

61 Conclusion

produced detectorsAs one of the most important parts the currently used detector was intro-

duced in subsection 435 This module was developed with special attention paidto validation and possible future work therefore additional debugging featureslike exporting detections demonstration videos and frame rate measurementswere implemented To make the development simpler an easy-to-use interfacewas included which let the most important parameters to be set without modifi-cation pf the code Four related but different detection methods were developedMode 1 scans the whole image with both trained detectors to locate the robotMode 2 does the same but instead of all classifiers only one is used most of thetime which is chosen based on previous detections This resulted in an increasedframe-rate Mode 3 introduces the concept of Region Of Interests (ROI) and onlyscans this area The ROI is defined based on previous detections however it iseasy to replace it with other methods due to the modular structure The reducedsearch space made the processing significantly faster Finally mode 4 includes atracking algorithm which makes it unnecessary to scan every frame (or part ofit) with a detector Instead once the robot was found it is followed across thescene Certainly periodical validations are still needed Tracking is significantlyfaster than the sliding window detectors and in appropriate conditions can per-form above real time speed (average frame-rate was 308 FPS) It also improvedthe efficiency of the detector since on many frames the robot was tracked andits position was marked correctly while the detectors missed it from the sameview-point However tracking can be misleading as well if the tracked regiondrifts away from the robot (eg a chair next to the robot is followed) To avoidthis the bounding box returned by the tracker is regularly checked by detectors

Special attention was given to the evaluation of the system Two software weredeveloped as additional tools one for evaluating the efficiency of the detectionsand an other for analysing the processing time and frame rates

Although the whole project (especially the autonomous UAV) still needs a lotof improvement and not ready for testing yet the ground robot detecting algo-rithmrsquos first version is ready-to-use and have shown promising results in simulatedexperiments

The detector based on mode 4 had a recall rate of 95 with an average frame-rate of 308 FPS on the test videos Although this processing speed was achievedon a laptop (simulating a setup where the UAV sends video to a ground stationfor further processing) porting this code on an on-board computer with smallercomputational capacity should result in a slower but still acceptable frame-ratewhich satisfies the objectives (5Hz)

While this thesis focused on two dimensional image processing and object de-tection methods 3D image inputs were considered as well Section 44 concludesthe progress made related to 3D mapping along the applied mathematics An

66

62 Recommended future work

experimental setup was created with which it was possible to create spectacu-lar and more importantly precise 3D maps about the environments with a 2Dlaser scanner To test the idea of using the 3D image as an input for the groundrobot detection several recordings were made about the UGV The created point-clouds were found to be detailed enough to carry simpler (eg threshold basedon height) or more complex object detection methods (eg registering 3D key-points) on them in the future

As a conclusion it can be said that the objectives defined in 13

1 explore possible solutions in a literature review

2 design a system which is able to detect the ground robot based on theavailable sensors

3 implement and test the first version of the algorithms

were all addressed and completed

62 Recommended future workIn spite of the satisfying results of the current setup several modifications andimprovements can and should be implemented to increase the efficiency and speedof the algorithm

Focusing on the 2D processing methods first experiments with retrainedSVMs are recommended It is expected that training on the same two sides (sideand front) but including more images from slightly rotated view-points (bothrolling the camera and capture the object from not completely the front) willresult in better performance

Region of interest estimators have huge potential in the project since they cansignificantly speed up the process Aside basic algorithms based on simple fea-tures (eg similar colour patches edges etc) several more complex segmentationmethods were introduced in the literature

Tracking can be tuned more precisely as well Aside the rectangle box whichmarks the estimated position of the object the tracker returns a confidence valuewhich measures how confident the algorithm is that the object is inside the rect-angle This is a very valuable information to eliminate detections which rdquoslippedrdquooff the object Also the further processing of the return position is recommendedFor example growing or shrinking the rectangle based on simple features (egthresholded patches intensity edges etc) can be profitable since the trackertends to perform better when it tracks the whole object On the other hand toobig bounding boxes are problems as well thus shrinking of the rectangles have tobe considered as well

67

62 Recommended future work

Using The 3D image for the detection of the robot is another very interestingopportunity that should be examined in more details Certainly this is onlypossible if the 3D imaging will be carried out in near real time speed (currentlyall the maps were built offline) The currently seen 3D image (the point-cloud infront of the sensor) can be interpreted as a pre-filter for the 2D image processingmethods For example filtering for objects with more or less the same size of theUGV could be a very effective region of interest detector To achieve this thefield of view of the two sensors needs to be register carefully so correspondingareas can be found

Finally when the continuous 3D map building (note that this is more thanthe imaging since the 3D images have to be aligned and combined) is finishedseveral more improvements will be possible Assuming that both the aerial andthe ground vehicle were located in this map the real 3D position of the robotwill be known Its next position can be estimated based on its maximal velocityand the elapsed time Theoretically it is not possible for the robot to be outsideof this circle thus other parts of the map do not need to be scanned Since thelocation and orientation of the UAV will be known as well the field-of-view ofthe sensors can be estimated After that the UAV can either move and rotate ina way to cover the possible locations of the UGV with its sensors or just ignore(does not process) inputs from other areas

It is strongly recommended to give high priority to evaluation and testingduring the further experiments For 2D images Vatic annotation tool was intro-duced in this thesis as a very useful and convenient tool to evaluate detectorsFor 3D detections this is not suitable but a similar solution is needed to keepup the progress towards the ready to deploy 3D mapping system planned

68

Acknowledgements

First of all I would like to thank Paacutezmaacuteny Peacuteter Catholic University and my teachersthere who helped me to advance and grow through the years and later made itpossible to continue my studies in Cranfield University where I was welcomed withwarm hospitality and was supported throughout of the development of this thesis

I would like to thank Dr Al Savvaris for the professional help and encourage-ment he provided while we worked together

Special thanks goes to my friends Domonkos Huszaacuter and Loacuteraacutent Kovaacutecs who Ishared all the happiness sadness worries troubles and the weigh of this pastyear and who I could always turn to in need or joy

Thanks to my friend Roberto Opromolla for helping us to make the necessarymeasurements

My family never stopped supporting me loving me and ensuring the rightenvironment for improvement and growth

Many thanks Fanni Melles for the support and love you gave me and that youstood by me even at the hardest times

Thanks for my friends Maacuterton Hunyady and Doacutera Babicz for the funny andsometimes cruel comments of the revision of this thesis

And finally many thanks to Andras Horvath who agreed to be my rsquosecondrsquosupervisor and mentor at my home university

69

References

[1] ldquoOverview senseFlyrdquo [Online] Available httpswwwsenseflycomdronesoverviewhtml [Accessed at 2015-08-08]

[2] H Chao Y Cao and Y Chen ldquoAutopilots for small fixed-wing unmanned airvehicles A surveyrdquo in Proceedings of the 2007 IEEE International Conference onMechatronics and Automation ICMA 2007 2007 pp 3144ndash3149

[3] A Frank J McGrew M Valenti D Levine and J How ldquoHover Transitionand Level Flight Control Design for a Single-Propeller Indoor AirplanerdquoAIAA Guidance Navigation and Control Conference and Exhibit pp 1ndash43 2007[Online] Available httparcaiaaorgdoiabs10251462007-6318

[4] ldquoFixed Wing Versus Rotary Wing For UAV Mapping Ap-plicationsrdquo [Online] Available httpwwwquestuavcomnewsfixed-wing-versus-rotary-wing-for-uav-mapping-applications [Accessed at 2015-08-08]

[5] ldquo DJI Store Phantom 3 Standardrdquo [Online] Available httpstoredjicomproductphantom-3-standard [Accessed at 2015-08-08]

[6] ldquoWorld War II V-1 Flying Bomb - Military Historyrdquo [Online] Available httpmilitaryhistoryaboutcomodartillerysiegeweaponspv1htm [Accessed at 2015-08-08]

[7] L Lin and M A Goodrich ldquoUAV intelligent path planning for wilderness searchand rescuerdquo in 2009 IEEERSJ International Conference on Intelligent Robotsand Systems IROS 2009 2009 pp 709ndash714

[8] M A Goodrich B S Morse D Gerhardt J L Cooper M Quigley J A Adamsand C Humphrey ldquoSupporting wilderness search and rescue using a camera-equipped mini UAVrdquo Journal of Field Robotics vol 25 no 1-2 pp 89ndash1102008

[9] P Doherty and P Rudol ldquoA UAV Search and Rescue Scenario with HumanBody Detection and Geolocalizationrdquo in Lecture Notes in Computer Science

70

REFERENCES

2007 pp 1ndash13 [Online] Available httpwwwspringerlinkcomcontentt361252205328408fulltextpdf

[10] A Jaimes S Kota and J Gomez ldquoAn approach to surveillance an area usingswarm of fixed wing and quad-rotor unmanned aerial vehicles UAV(s)rdquo 2008 IEEEInternational Conference on System of Systems Engineering SoSE 2008 2008

[11] M Kontitsis K Valavanis and N Tsourveloudis ldquoA UAV vision system for air-borne surveillancerdquo IEEE International Conference on Robotics and Automation2004 Proceedings ICRA rsquo04 2004 vol 1 2004

[12] M Quigley M A Goodrich S Griffiths A Eldredge and R W Beard ldquoTargetacquisition localization and surveillance using a fixed-wing mini-UAV and gim-baled camerardquo in Proceedings - IEEE International Conference on Robotics andAutomation vol 2005 2005 pp 2600ndash2606

[13] E Semsch M Jakob D Pavliacuteček and M Pěchouček ldquoAutonomous UAV surveil-lance in complex urban environmentsrdquo in Proceedings - 2009 IEEEWICACMInternational Conference on Intelligent Agent Technology IAT 2009 vol 2 2009pp 82ndash85

[14] M Israel ldquoA UAV-BASED ROE DEER FAWN DETECTION SYSTEMrdquo pp51ndash55 2012

[15] G Zhou and D Zang ldquoCivil UAV system for earth observationrdquo in InternationalGeoscience and Remote Sensing Symposium (IGARSS) 2007 pp 5319ndash5321

[16] W DeBusk ldquoUnmanned Aerial Vehicle Systems for Disaster Relief TornadoAlleyrdquo AIAA InfotechAerospace 2010 2010 [Online] Available httparcaiaaorgdoiabs10251462010-3506

[17] S DrsquoOleire-Oltmanns I Marzolff K Peter and J Ries ldquoUnmannedAerial Vehicle (UAV) for Monitoring Soil Erosion in Moroccordquo RemoteSensing vol 4 no 12 pp 3390ndash3416 2012 [Online] Available httpwwwmdpicom2072-42924113390

[18] F Nex and F Remondino ldquoUAV for 3D mapping applications a reviewrdquoApplied Geomatics vol 6 no 1 pp 1ndash15 2013 [Online] Availablehttplinkspringercom101007s12518-013-0120-x

[19] H Eisenbeiss ldquoThe Potential of Unmanned Aerial Vehicles for MappingrdquoPhotogrammetrische Woche Heidelberg pp 135ndash145 2011 [Online] Availablehttpwwwifpuni-stuttgartdepublicationsphowo11140Eisenbeisspdf

[20] C Bills J Chen and A Saxena ldquoAutonomous MAV flight in indoor environ-ments using single image perspective cuesrdquo in Proceedings - IEEE InternationalConference on Robotics and Automation 2011 pp 5776ndash5783

71

REFERENCES

[21] W Bath and J Paxman ldquoUAV localisation amp control through computer visionrdquo of the Australasian Conference on Robotics 2004 [Online] Availablehttpwwwcseunsweduau~acra2005proceedingspapersbathpdf

[22] K Ccedilelik S J Chung M Clausman and A K Somani ldquoMonocular vision SLAMfor indoor aerial vehiclesrdquo in 2009 IEEERSJ International Conference on Intel-ligent Robots and Systems IROS 2009 2009 pp 1566ndash1573

[23] H Oh D Y Won S S Huh D H Shim M J Tahk and A Tsourdos ldquoIndoorUAV control using multi-camera visual feedbackrdquo in Journal of Intelligent andRobotic Systems Theory and Applications vol 61 no 1-4 2011 pp 57ndash84

[24] Y M Mustafah A W Azman and F Akbar ldquoIndoor UAV Positioning UsingStereo Vision Sensorrdquo pp 575ndash579 2012

[25] P Jongho and K Youdan ldquoStereo vision based collision avoidance of quadrotorUAVrdquo in Control Automation and Systems (ICCAS) 2012 12th InternationalConference on 2012 pp 173ndash178

[26] J Weingarten and R Siegwart ldquoEKF-based 3D SLAM for structured environmentreconstructionrdquo in 2005 IEEERSJ International Conference on Intelligent Robotsand Systems IROS 2005 pp 2089ndash2094

[27] H Surmann A Nuumlchter and J Hertzberg ldquoAn autonomous mobile robot with a3D laser range finder for 3D exploration and digitalization of indoor environmentsrdquoRobotics and Autonomous Systems vol 45 no 3-4 pp 181ndash198 2003

[28] Y Lin J Hyyppauml and A Jaakkola ldquoMini-UAV-borne LIDAR for fine-scale map-pingrdquo IEEE Geoscience and Remote Sensing Letters vol 8 no 3 pp 426ndash4302011

[29] F Wang J Cui S K Phang B M Chen and T H Lee ldquoA mono-camera andscanning laser range finder based UAV indoor navigation systemrdquo in 2013 Inter-national Conference on Unmanned Aircraft Systems ICUAS 2013 - ConferenceProceedings 2013 pp 694ndash701

[30] M Nagai T Chen R Shibasaki H Kumagai and A Ahmed ldquoUAV-borne 3-Dmapping system by multisensor integrationrdquo IEEE Transactions on Geoscienceand Remote Sensing vol 47 no 3 pp 701ndash708 2009

[31] X Zhang Y-H Yang Z Han H Wang and C Gao ldquoObject class detectionrdquoACM Computing Surveys vol 46 no 1 pp 1ndash53 Oct 2013 [Online] Availablehttpdlacmorgcitationcfmid=25229682522978

[32] P M Roth and M Winter ldquoSurvey of Appearance-Based Methodsfor Object Recognitionrdquo Transform no ICG-TR-0108 2008 [Online]

72

REFERENCES

Available httpwwwicgtu-grazacatMemberspmrothpub_pmrothTR_ORat_downloadfile

[33] A Andreopoulos and J K Tsotsos ldquo50 Years of object recognition Directionsforwardrdquo Computer Vision and Image Understanding vol 117 no 8 pp 827ndash891 Aug 2013 [Online] Available httpwwwresearchgatenetpublication257484936_50_Years_of_object_recognition_Directions_forward

[34] J K Tsotsos Y Liu J C Martinez-Trujillo M Pomplun E Simine andK Zhou ldquoAttending to visual motionrdquo Computer Vision and Image Understand-ing vol 100 no 1-2 SPEC ISS pp 3ndash40 2005

[35] P Perona ldquoVisual Recognition Circa 2007rdquo pp 1ndash12 2007 [Online] Availablehttpswwwvisioncaltechedupublicationsperona-chapter-Dec07pdf

[36] M Piccardi ldquoBackground subtraction techniques a reviewrdquo 2004 IEEE Interna-tional Conference on Systems Man and Cybernetics (IEEE Cat No04CH37583)vol 4 pp 3099ndash3104 2004

[37] T Bouwmans ldquoRecent Advanced Statistical Background Modeling for ForegroundDetection - A Systematic Surveyrdquo Recent Patents on Computer Sciencee vol 4no 3 pp 147ndash176 2011

[38] L Xu and W Bu ldquoTraffic flow detection method based on fusion of frames dif-ferencing and background differencingrdquo in 2011 2nd International Conference onMechanic Automation and Control Engineering MACE 2011 - Proceedings 2011pp 1847ndash1850

[39] A-t Nghiem F Bremond I-s Antipolis and R Lucioles ldquoBackground subtrac-tion in people detection framework for RGB-D camerasrdquo 2004

[40] J C S Jacques C R Jung and S R Musse ldquoBackground subtraction andshadow detection in grayscale video sequencesrdquo in Brazilian Symposium of Com-puter Graphic and Image Processing vol 2005 2005 pp 189ndash196

[41] P Kaewtrakulpong and R Bowden ldquoAn Improved Adaptive Background MixtureModel for Real- time Tracking with Shadow Detectionrdquo Advanced Video BasedSurveillance Systems pp 1ndash5 2001

[42] G Turin ldquoAn introduction to matched filtersrdquo IRE Transactions on InformationTheory vol 6 no 3 1960

[43] K Briechle and U Hanebeck ldquoTemplate matching using fast normalized crosscorrelationrdquo Proceedings of SPIE vol 4387 pp 95ndash102 2001 [Online] AvailablehttplinkaiporglinkPSI4387951ampAgg=doi

73

REFERENCES

[44] H Schweitzer J W Bell and F Wu ldquoVery Fast Template Matchingrdquo Programno 009741 pp 358ndash372 2002 [Online] Available httpwwwspringerlinkcomindexH584WVN93312V4LTpdf

[45] P Viola and M Jones ldquoRobust real-time object detectionrdquo Interna-tional Journal of Computer Vision vol 57 pp 137ndash154 2001 [Online]Available httpscholargooglecomscholarhl=enampbtnG=Searchampq=intitleRobust+Real-time+Object+Detection0

[46] S Omachi and M Omachi ldquoFast template matching with polynomialsrdquo IEEETransactions on Image Processing vol 16 no 8 pp 2139ndash2149 2007

[47] M Jordan and J Kleinberg Bishop - Pattern Recognition and Machine Learning

[48] D Prasad ldquoSurvey of the problem of object detection in real imagesrdquoInternational Journal of Image Processing (IJIP) no 6 pp 441ndash466 2012[Online] Available httpwwwcscjournalsorgcscmanuscriptJournalsIJIPvolume6Issue6IJIP-702pdf

[49] D Lowe ldquoObject Recognition fromLocal Scale-Invariant Featuresrdquo IEEE Inter-national Conference on Computer Vision 1999

[50] D G Lowe ldquoDistinctive image features from scale-invariant keypointsrdquo Interna-tional Journal of Computer Vision vol 60 no 2 pp 91ndash110 2004

[51] H Bay T Tuytelaars and L Van Gool ldquoSURF Speeded up robust featuresrdquo inLecture Notes in Computer Science (including subseries Lecture Notes in ArtificialIntelligence and Lecture Notes in Bioinformatics) vol 3951 LNCS 2006 pp 404ndash417

[52] T Joachims ldquoA probabilistic analysis of the Rocchio algorithmwith TFIDF for text categorizationrdquo International Conference on Ma-chine Learning pp 143ndash151 1997 [Online] Available papers2publicationuuid9FC2122D-6D49-4DC5-AC03-E353D5B3D1D1

[53] N Dalal and B Triggs ldquoHistograms of Oriented Gradients for Human Detectionrdquoin CVPR rsquo05 Proceedings of the 2005 IEEE Computer Society Conference onComputer Vision and Pattern Recognition (CVPRrsquo05) - Volume 1 2005 pp 886ndash893 [Online] Available citeulike-article-id3047126$delimiter026E30F$nhttpdxdoiorg101109CVPR2005177

[54] J M Rainer Lienhart ldquoAn Extended Set of Haar-Like Features for Rapid ObjectDetectionrdquo [Online] Available httpciteseerxistpsueduviewdocsummarydoi=1011869433

74

REFERENCES

[55] S Han Y Han and H Hahn ldquoVehicle Detection Method using Haar-like Featureon Real Time Systemrdquo inWorld Academy of Science Engineering and Technology2009 pp 455ndash459

[56] Q C Q Chen N Georganas and E Petriu ldquoReal-time Vision-based Hand Ges-ture Recognition Using Haar-like Featuresrdquo 2007 IEEE Instrumentation amp Mea-surement Technology Conference IMTC 2007 2007

[57] G Monteiro P Peixoto and U Nunes ldquoVision-based pedestrian detection usingHaar-like featuresrdquo Robotica 2006 [Online] Available httpcyberc3sjtueducnCyberC3docpaperRobotica2006pdf

[58] J Martiacute J M Benediacute A M Mendonccedila and J Serrat Eds PatternRecognition and Image Analysis ser Lecture Notes in Computer Science BerlinHeidelberg Springer Berlin Heidelberg 2007 vol 4477 [Online] Availablehttpwwwspringerlinkcomindex101007978-3-540-72847-4

[59] A E C Pece ldquoOn the computational rationale for generative modelsrdquo ComputerVision and Image Understanding vol 106 no 1 pp 130ndash143 2007

[60] I T Joliffe Principle Component Analysis 2002 vol 2 [Online] Availablehttpwwwspringerlinkcomcontent978-0-387-95442-4

[61] A Hyvarinen J Karhunen and E Oja ldquoIndependent Component Analysisrdquovol 10 p 2002 2002

[62] K Mikolajczyk B Leibe and B Schiele ldquoMultiple Object Class Detection with aGenerative Modelrdquo IEEE Computer Society Conference on Computer Vision andPattern Recognition (CVPR) vol 1 pp 26ndash36 2006

[63] Y Freund and R E Schapire ldquoA Decision-theoretic Generalization of On-lineLearning and an Application to Boostingrdquo Journal of Computing Systems andScience vol 55 no 1 pp 119ndash139 1997

[64] C Cortes and V Vapnik ldquoSupport-vector networksrdquo Machine Learningvol 20 no 3 pp 273ndash297 Sep 1995 [Online] Available httplinkspringercom101007BF00994018

[65] P Viola and M Jones ldquoRobust real-time face detectionrdquo International Journalof Computer Vision vol 57 no 2 pp 137ndash154 2004

[66] B E Boser I M Guyon and V N Vapnik ldquoA Training Algorithmfor Optimal Margin Classifiersrdquo Proceedings of the 5th Annual ACM Workshopon Computational Learning Theory pp 144ndash152 1992 [Online] Availablehttpciteseerxistpsueduviewdocsummarydoi=1011213818

75

REFERENCES

[67] C-W Hsu and C-J Lin ldquoA comparison of methods for multiclass support vectormachinesrdquo IEEE Transactions on Neural Networks vol 13 no 2 pp 415ndash4252002

[68] A Mathur and G M Foody ldquoMulticlass and binary SVM classification Implica-tions for training and classification usersrdquo IEEE Geoscience and Remote SensingLetters vol 5 no 2 pp 241ndash245 2008

[69] ldquoNitrogen6X - iMX6 Single Board Computerrdquo [Online] Available httpboundarydevicescomproductnitrogen6x-board-imx6-arm-cortex-a9-sbc [Ac-cessed at 2015-08-10]

[70] ldquoPixhawk flight controllerrdquo [Online] Available httproverardupilotcomwikipixhawk-wiring-rover [Accessed at 2015-08-10]

[71] ldquoScanning range finder utm-30lx-ewrdquo [Online] Available httpswwwhokuyo-autjp02sensor07scannerutm_30lx_ewhtml [Accessed at 2015-07-22]

[72] ldquoopenCV manual Release 249rdquo [Online] Available httpdocsopencvorgopencv2refmanpdf [Accessed at 2015-07-20]

[73] D E King ldquoDlib-ml A Machine Learning Toolkitrdquo Journal of MachineLearning Research vol 10 pp 1755ndash1758 2009 [Online] Available httpjmlrcsailmitedupapersv10king09ahtml

[74] ldquodlib C++ Libraryrdquo [Online] Available httpdlibnet [Accessed at 2015-07-21]

[75] D E King ldquoMax-Margin Object Detectionrdquo Jan 2015 [Online] Availablehttparxivorgabs150200046

[76] M Danelljan G Haumlger and M Felsberg ldquoAccurate Scale Estimation for RobustVisual Trackingrdquo Proceedings of the British Machine Vision Conference BMVC2014

[77] ldquoUrgBenri Information Pagerdquo [Online] Available httpswwwhokuyo-autjp02sensor07scannerdownloaddataUrgBenrihtm [Accessed at 2015-07-20]

[78] ldquovatic - Video Annotation Tool - UC Irvinerdquo [Online] Available httpwebmiteduvondrickvatic [Accessed at 2015-07-24]

[79] ldquoAmazon Mechanical Turkrdquo [Online] Available httpswwwmturkcommturkwelcome [Accessed at 2015-07-26]

76

  • List of Figures
  • Absztrakt
  • Abstract
  • List of Abbreviations
  • 1 Introduction and project description
    • 11 Project description and requirements
    • 12 Type of vehicle
    • 13 Aims and objectives
      • 2 Literature Review
        • 21 UAVs and applications
          • 211 Fixed-wing UAVs
          • 212 Rotary-wing UAVs
          • 213 Applications
            • 22 Object detection on conventional 2D images
              • 221 Classical detection methods
                • 2211 Background subtraction
                • 2212 Template matching algorithms
                  • 222 Feature descriptors classifiers and learning methods
                    • 2221 SIFT features
                    • 2222 Haar-like features
                    • 2223 HOG features
                    • 2224 Learning models in computer vision
                    • 2225 AdaBoost
                    • 2226 Support Vector Machine
                      • 3 Development
                        • 31 Hardware resources
                          • 311 Nitrogen board
                          • 312 Sensors
                            • 3121 Pixhawk autopilot
                            • 3122 Camera
                            • 3123 LiDar
                                • 32 Chosen software
                                  • 321 Matlab
                                  • 322 Robotic Operating System (ROS)
                                  • 323 OpenCV
                                  • 324 Dlib
                                      • 4 Designing and implementing the algorithm
                                        • 41 Challenges in the task
                                        • 42 Architecture of the detection system
                                        • 43 2D image processing methods
                                          • 431 Chosen methods and the training algorithm
                                          • 432 Sliding window method
                                          • 433 Pre-filtering
                                          • 434 Tracking
                                          • 435 Implemented detector
                                            • 4351 Mode 1 Sliding window with all the classifiers
                                            • 4352 Mode 2 Sliding window with intelligent choice of classifier
                                            • 4353 Mode 3 Intelligent choice of classifiers and ROIs
                                            • 4354 Mode 4 Tracking based approach
                                                • 44 3D image processing methods
                                                  • 441 3D recording method
                                                  • 442 Android based recording set-up
                                                  • 443 Final set-up with Pixhawk flight controller
                                                  • 444 3D reconstruction
                                                      • 5 Results
                                                        • 51 2D image detection results
                                                          • 511 Evaluation
                                                            • 5111 Definition of True positive and negative
                                                            • 5112 Definition of False positive and negative
                                                            • 5113 Reducing number of errors
                                                            • 5114 Annotation and database building
                                                              • 512 Frame-rate measurement and analysis
                                                                • 52 3D image detection results
                                                                • 53 Discussion of results
                                                                  • 6 Conclusion and recommended future work
                                                                    • 61 Conclusion
                                                                    • 62 Recommended future work
                                                                      • References
Page 13: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly

12 Type of vehicle

error of 5-6cm Errors with similar magnitude can be corrected later de-pending on the application In the case of architecture for example aligninga 3D room model to the recorded point cloud could eliminate this noise

3 autonomous The solution should be as autonomous as possible The aimis to build a system which does not require human supervision (for examplemapping a dangerous area which would be too far for real time control) andcoordinate the process Although remote control is acceptable and shouldbe implemented it is desired to be minimal This should allow a remoteoperator to coordinate even more than one unit a time (since none of themneeds continuous attention) while keeping the ability to change the routeor priority of premises

4 complete Depending on the size and layout of the building the mappingprocess can be very complicated It may have more than one floors loops(the same location reached via an other route) or opened areas which areall a significant challenge for both the vehicle (plan route to avoid collisionsand cover every desired area) and the mapping software The latter shouldrecognize previously seen locations and rdquocloserdquo the loop (that is assignthe two separately recorded area to one location) and handle multi-layerbuildings The system have to manage these tasks and provide a mapwhich contains all the layers closes loops and represents all walls (includingceiling and floor)

12 Type of vehicleThe first question of such a project is what type of vehicle should be used Theoptions are either some kind of aerial or a ground vehicle Both have their ad-vantages and disadvantages

An aerial vehicle has a lot more degree of freedom since it can elevate fromthe ground It can reach positions and perform measurements which would beimpossible for a ground vehicle However it consumes a lot of energy to stayin the air thus such a system has significant limitations in weight payload andoperating time (since the batteries themselves weigh a lot)

A ground vehicle overcomes all these problems since it is able to carry alot more payload than an aerial one This means a lot of batteries thereforelonger operation time On the other hand it canrsquot elevate from the ground thusall the measurements will be performed from a relatively close position to theground This can be a big disadvantage since taller objects will not have theirtops recorded Also a ground robot would not be able to overcome big obstaclesclosing its way or to get to another floor

2

13 Aims and objectives

Figure 11 Image of the ground robot Source own picture

Considering these and many more arguments (price complexity human andmaterial resources) a combination of an aerial and a ground vehicle was chosenas a solution The idea is that both vehicles enter the building at the sametime and start mapping the environment simultaneously working on the samemap (which is unknown when the process starts) Using the advantage of theaerial vehicle the system scans parts of the rooms which are unavailable for theground vehicle With sophisticated route planning the areas scanned twice canbe minimized resulting in a faster map building Also great load capacity of theground robot makes it able to serve as a charging station for the aerial vehicleThis option will be discussed in details in section 13

13 Aims and objectivesAs it can be seen the scope of the project is extremely wide To cover the wholemore than fifteen students (visiting MSc and PhD) are working on it Thereforethe project has been divided to numerous subtasks

During the planning period of the project several challenges were recognizedfor example the lack of GPS signal indoor the vibration of the on-board sensorsor the short flight time of the aerial platform

This thesis addresses the last one As mentioned in section 12 the UAVconsumes a lot of energy to stay in the air thus such a system has significantlylimited operating time In contrast the ground robot is able to carry a lot morepayload (eg batteries) than an aerial one The idea to overcome this limitationof the UAV is to use the ground robot as a landing and recharging platform for

3

13 Aims and objectives

the aerial vehicleTo achieve this the system will need to have continuous update of the relative

position of the ground robot The aim of this thesis is to give a solution for thisproblem The following objectives were defined to structure the research

First possible solutions have to be collected and discussed in an extensive lit-erature review Then after considering the existing solutions and the constraintsand requirements of the discussed project a design of a new system is neededwhich is able to detect the ground robot using the available sensors Finally thefirst version of the software have to be implemented and tested

This paper will describe the process of the research introduce the designedsystem and discuss the experiences

4

Chapter 2

Literature Review

The aim of this chapter is to provide an overview of the previous researches andarticles related to the topic of this thesis First a short introduction of unmannedaerial vehicles will be given along with their most important advantages andapplication fields

Afterwards a brief conclusion of the science of object recognition in imageprocessing is provided The most often used feature extraction and classificationmethods are presented

21 UAVs and applicationsThe abbreviation UAV stands for Unmanned Aerial Vehicle Any aircraft whichis controlled or piloted by remotely andor by on-board computers They are alsoreferred as UAS for Unmanned Aerial System

They come in various sizes The smallest ones fit a man palms while even afull-size aeroplane can be an UAV by definition Similarly to traditional aircraftUnmanned Aerial vehicles are usually grouped into two big classes fixed-wingUAVs and rotary wing UAVs

211 Fixed-wing UAVsFixed-wing UAVs have a rigid wing which generates lift as the UAV moves for-ward They maneuver with control surfaces called ailerons rudder and elevatorGenerally speaking fixed-wing UAVs are easier to stabilize and control Forcomparison radio controlled aeroplanes can fly without any autopilot functionimplemented relying only on the pilotrsquos input This is a result of the muchsimpler structure compared to rotary wing ones due to the fact that a glidingmovement is easier to stabilize However they still have a quite extended market

5

21 UAVs and applications

Figure 21 One of the most popular consumer level fixed-wing mapping UAV ofthe SenseFly company Source [1]

of flight controllers whose aim is to add autonomous flying and pilot assistancefeatures [2]

One of the biggest advantage of fixed wing UAVs is that due to their naturalgliding capabilities they can stay airborne using no or small amount of powerFor the same reason fixed-wing UAVs are also able to carry heavier or morepayload for longer endurances using less energy which would be very useful inany mapping task

The drawback of the fixed wing aircraft in the case of this project is the factthat they have to keep moving to generate lift and stay airborne Therefore flyingaround between and in buildings is very complicated since there is limited spaceto move around Although there have been approaches to prepare a fixed wingUAV to hover and translate indoors [3] generally they are not optimal for indoorflights especially with heavy payload [4]

Fixed-wing UAVs are excellent choice for long endurance high altitude tasksThe long flight times and higher speed make it possible to cover larger areaswith one take off On figure 21 a consumer fixed-wing UAV is shown designedspecially for mapping

This project needs a vehicle which is able to fly between buildings and indoorswithout any complication Thus rotary-wing UAVs were reviewed

212 Rotary-wing UAVsRotary-wing UAVs generate lift with rotor blades attached to and rotated aroundan axis These blades work exactly as the wing on the fixed-wing vehicles butinstead of the vehicle moving forward the blades are moving constantly Thismakes the vehicle able to hover and hold its position Also due to the same prop-erty they can take off and land vertically These capabilities are very importantaspects for the project since both of them are crucial for indoor flight where spaceis limited

6

21 UAVs and applications

Figure 22 The newest version of the popular Phantom series of DJI Thisconsumer drone is easy to fly even for beginner pilots due to its many advancedautopilot features Source [5]

Based on the number of the rotors several sub-classes are known UAVs withone rotor are helicopters while anything with more than one rotor is called amultirotor or multicopter

The latter group controls its motion by modifying the speed of multiple downthrusting motors In general they are more complicated to control than fixed-wing UAVs since hovering enough requires constant compensation Multirotorsmust individually adjust the thrust of each motor If motors on one side areproducing more thrust than the other the multirotor tilts to the other sideThat is the reason why every remote controlled multicopter is equipped withsome kind of flight controller system They are also less stable (without electricstabilization) and less energy efficient than helicopters That is the reason forthe fact that no large scale multirotor is used since as the size is increased thesimplicity becomes less important than the inefficiency On the other hand theyare mechanically simpler and cheaper and their architecture is suitable for awider range of payloads which makes them appealing for many fields

Rotary-wing UAVsrsquo mechanical and electrical structures are generally morecomplex than fixed-wing ones This can result in more complicated and expansiverepairs and general maintenance which shorten operational time and increase thecost of the project Finally due to their lower speeds and shorter flight rangesrotary-wing UAVs will require many additional flights to survey any large areawhich is another cause of increasing operational costs [4]

7

21 UAVs and applications

In spite of the disadvantages listed above rotary-wing especially multirotorUAVs seem like an excellent choice for this project since there is no need to coverlarge areas and the shorter operational times can be solved with the chargingplatform

213 ApplicationsUAVs as many technological inventions got their biggest motivation from mili-tary goals and applications For example German rocket V-1 (rdquothe flying bombrdquo)had a huge impact on the history of UAS [6] An unmanned aerial vehicle makesit possible to observe spy on and attack the enemy without risking human lives

Fortunately non-military development is catching up too and UAVs are be-coming essential tools for many application fields As rotary wing UAVs es-pecially multicopters are getting cheaper they have appeared in the consumercategories as well They are available as simple remote controlled toys and ad-vanced aerial photography platforms as well [5] See figure 22 for a popularquadcopter

UAVs are gaining popularity in search and rescue operations since they areproviding a fast and relatively cheap tool to get an overview of the involvedarea A good example for this application field is a location of an earth-quakeor searching for missing people in the wildness See [7] and [8] for examples [9]is another related article which includes approaches to identify possibly injuredpeople autonomously

Another application field where UAVs were admittedly proven useful is surveil-lance [10] is an interesting example for this field since it proposes to use a com-bined swarm of rotary and fixed-wing UAVs [11] [12] address the same problemwhile [13] tries to overcome the challenges of the task in an urban area

Getting sensors in the air with low cost vehicles is an amazing opportunity forseveral research areas UAVs have been used for wild-life monitoring [14] earthobservation [15] and even tornado watch missions [16] [17] used UAVs to reducerdquothe data gap between field scale and satellite scale in soil erosion monitoringrdquo

UAVs are often used for 3D mapping and reconstruction tasks for topographyarchitecture or other purposes [18] is a review of current mapping methods basedon UAVs [19] is an other survey which also compares the photogrammetry withother methods to define the optimal application fields

Both surveys mentioned above are rather focus on topographical larger scalemapping tasks Indoor flying and mapping is more complicated in a sense sincethe danger of collision is significantly higher Indoor positioning of UAVs havebeen addressed in several papers They can be grouped by the sensor used Notethat GPS signal is usually too weak to use inside thus other methods are needed

Some researches tried to localize the aerial platform based on a single camera

8

22 Object detection on conventional 2D images

using monocular SLAM (Simultaneous Localization and Mapping) See [20ndash22]for UAVs navigating based on one camera only

Often the platform is equipped with multiple cameras to achieve stereo visionand create a depth map For examples of such solutions see [23ndash25]

A popular approach for indoor mapping is using 3D laser distance sensors asthe input of the SLAM algorithm Examples for this are [26ndash28]

However cameras and laser range finders are often used together [29 30] forexample manage to integrate the two sensors together and carry out 3D mappingwhile controlling navigating the UAV itself This project will use a similar setupthat is a combination of one or more monocular cameras with a 2D laser scanner

22 Object detection on conventional 2D imagesObject class detection is one of the most researched area of computer visionAs the available resources (computational power resolution of cameras etc)improved and became cheaper and more and more complex tasks occurred thefield has witnessed several approaches and methods Thus the topic has extensiveliterature To give an overview general surveys can be a good starting point

There are several articles aimed to provide a summary of the existing methodstrying to categorize and group them (eg [31ndash33])

The terms and definitions of computer vision tasks are defined in several wayin the literature [33] and [34] groups them by

bull Detection is a particular item present in the stimulus

bull Localization detection plus accurate location of item

bull Recognition localization of all the items present in the stimulus

bull Understanding recognition plus role of stimulus in the context of the scene

while [35] sets up five classes defined by the following (quoted from [33])

bull Verification Is a particular item present in an image patch

bull Detection and Localization Given a complex image decide if a particularexemplar object is located somewhere in the image and provide accuratelocation information of the given object

bull Classification Given an image patch decide which of the multiple possiblecategories are present in it Note that no location is determined

9

22 Object detection on conventional 2D images

bull Naming Given a large complex image (instead of an image patch as inthe classification problem) determine the location and labels of the objectspresent in that image

bull Description Given a complex image name all the objects present in itand describe the actions and relationships of the various objects within thecontext of the image

[31] defines object (class or category) detection as the combination of twogoals object categorization and object localization The former decides if anyinstance of the categories of interest is present in the input image The lattertask determines the positions and dimensions (projected size) of the objects thatare found

As it can be seen from the categories defined above the boundaries of thesesubclasses are not clear and they often overlap The aims and objectives listedin section 13 classify this task as a detection or localization (or a combinationof the two) For the sake of simplicity the aim of this thesis and the providedalgorithm will be referred as detection and detector

Beside the grouping of the tasks of computer vision several attempts weremade to classify the approaches themselves The biggest disjunctive propertyof the methods might be the fact that whether or not does it involve machinelearning

221 Classical detection methodsResulting from the rapid development of hardware resources along with the ex-tensive research of artificial intelligence methods not using machine learning aregetting neglected Generally speaking the ones using it show better performanceand are simpler to adopt if the circumstances change (eg re-train them)

It has to be noted that although these classical approaches are not very pop-ular nowadays their simplicity often means an attractive factor to perform easydetection tasks or pre-filter the image

2211 Background subtraction

Background subtraction methods are often used to detect moving objects onstanding platform (static cameras) Such an algorithm is able to learn the rdquoback-ground modelrdquo of an image with differentiating a few frames Although severalmethods exist (See [36] or [37] for good conclusion) the basic idea is common sep-arate the foreground from the background by subtracting the background modelfrom the input image This subtraction should result in an output image where

10

22 Object detection on conventional 2D images

Figure 23 Example for background subtraction used for people detection Thered patches show the object in the segmented foreground Source [39]

only objects in the foreground are shown The background model is usually up-dated regularly which means the detection is adaptive

These solutions could give an excellent base for tasks like traffic flow monitor-ing [38] or human detection [39] as moving points are separated and connectedor close ones could be managed as an object Thus all possible objects of interestare very well separated The fixed position of the camera means fixed perspectiveswhich make size and distance estimations accurate With fixed lengths reliablevelocity measuring can be implemented These and further additional properties(colour texture etc of the moving object) can be the base of an object classifica-tion system This method is often combined with more sophisticated recognitionmethods serving as a region of interest detector After the selection of interestingareas those methods can operate on a smaller input resulting in a faster responsetime

It has to be mentioned that not every moving (changing) patch is an objectas for example in certain lighting circumstances a vehicle and its shadow movealmost identically and have similar size parameters Effective methods to removeshadows from input have been described (eg[40 41]) Certainly process time ofthese filters sum up with the detectionrsquos time

An other concern is the possibility of a sudden change in illumination (such asswitching the light on or off) which would result change across the whole imageIn this case the background model should be adapted automatically and quickly

Although background subtraction has many benefits it is clearly not suitablefor this task since the camera will be mounted on a moving platform

2212 Template matching algorithms

Object class detection can be described as a search for a specific pattern on theinput image This definition is related of the matched filters (see [42]) techniqueof digital signal processing which basic idea is to find a known wavelet in anunknown signal This is usually achieved by cross-correlating them looking forhigh correlation coefficients

Template matching algorithms rely on the same concept interpreting the

11

22 Object detection on conventional 2D images

Figure 24 Example of template matching Notice how the output image isthe darkest at the correct position of the template (darker means higher value)An other interesting point of the picture is the high return values around thecalendarrsquos (on the wall left) dark headlines which are indeed look similar to thehandle Source [43]

image as a 2D digital signal They determine the position of the given patternby comparing a template (that contains the pattern) with the image

To perform the comparison the template is shifted (u v) discrete steps inthe (x y) directions respectively The comparison itself is usually calculated bysome kind of correlation over the area of the template for each (u v) positionThe method assumes that the function (the comparison method) will return thehighest value at the correct position [43]

See Figure 24 for an example input template (part a) and the output of thealgorithm (part b) Notice how the output image is the darkest at the correctposition of the template

The biggest drawback of the method is its computational complexity sincethe cross correlation has to be calculated at every position Several efforts werepublished to reduce the processing time [43] uses normalized cross correlationfor faster comparison [44] applies integral images (based on the idea of [45]) forthe same purpose Note that simplifying the template can reduce the complexityas well since it is easier to compare the two images For example [46] proposedan algorithm which approximates the template image with polynomials

Other disadvantages of template matching are the poor performance in case ofchanges in scale rotation or perspective Resulting from these template matchingis not a suitable concept for this project

12

22 Object detection on conventional 2D images

222 Feature descriptors classifiers and learning methodsWhile computer vision and machine learning might seem two well-separated areasof computer engineering computer vision is one of the biggest application fieldof machine learning [47] puts it this way rdquoPattern recognition has its origins inengineering whereas machine learning grew out of computer science Howeverthese activities can be viewed as two facets of the same field and together theyhave undergone substantial development over the past ten yearsrdquo

The two most important parts of a trained object detection algorithm arefeature extraction and feature classification The extraction method determineswhat type of features will be the base of the classifier and how they are extractedThe classification part is responsible for the machine learning method In otherwords how to decide if the extracted feature array (extracted by the featureextraction method described above) represents an object of interest

Since the latter part (defining the learning and classifier system) is less relatedto the vision studies this paper will only discuss a few learning methods brieflyBefore that some of the most famous and widely used feature descriptors arepresented

Feature extractors or descriptors are often grouped as well [48] recognizesedge based and patch based features [31] has three groups according to thelocality of the features

bull pixel-level feature description these features can be calculated for eachpixel separately Examples are the pixelrsquos intensity (grayscale) or its colourvector

bull patch-level feature description Descriptors based on patch level featuresconcentrate only on point of interests (and their neighbourhood) instead ofdescribing the whole image The name rdquopatchrdquo refers to the area around thepoint which is also considered Since these regions are much smaller thanthe image itself patch-level feature descriptions are also called local featuredescriptors Some of the most famous ones are SIFT (sub-subsection 2221[49 50]) and SURF [51] In a sense Haar-like features (see sub-subsection2222 [45]) belong here as well

bull region-level feature description As patches are sometimes too small to de-scribe the object appropriately a larger scale is often needed The region isa general expression for a set of connected pixels The size and shape of thisarea is not defined Region-level features often constructed from lower levelfeatures presented above However they do not try to apply higher levelgeometric models or structures Resulting from that region-level featuresare also called mid-level representations The most well known descriptors

13

22 Object detection on conventional 2D images

of this group are the BOF (Bag of Features or Bag of Words [52]) and theHOG (Histogram of Oriented Gradients sub-subsection 2223 [53])

Due to the limited scope of this paper it is not possible to give a deeperoverview about the features used in computer vision Instead three of the meth-ods will be presented here These were selected based on fame performanceon general object recognition and the amount of relevant literature examplesand implementation All of them are well-known and both the articles and themethods are part of the computer vision history

2221 SIFT features

Scale-invariant feature transform (or SIFT) is an image processing algorithm usedto detect and extract local features Sift descriptors are the most well-knownpatch-level feature feature descriptors (see point 222) [31] They were proposedby David Lowe in 1999 [49] The same author gave a more in-depth discussionabout his earlier work presented improvements in stability and feature invariancein 2004 [50] His idea was to extract interesting points as features from the object(training image) and try to match these on the test images to locate the object

The main steps of the algorithm are the following [50]

1 Scale-space extrema detection The first stage searches over the image inmultiple scale using a difference-of-Gaussian function to identify potentialinterest points that are scale and orientation invariant

2 Key-point localization At each selected point the algorithm tries to fit amodel to determine location and scale Key-points are selected based ontheir stability points which are expected to be more stable based on themodel are kept

3 Orientation assignment rdquoOne or more orientations are assigned to eachkey-point based on local image gradient directions All future operationsare performed on image data that has been transformed relative to theassigned orientation scale and location for each feature thereby providinginvariance to these transformationsrdquo [50]

4 Key-point descriptor After selecting the interesting points the gradientsaround them are calculated in a selected scale

The biggest advantage of the descriptor is the scale and rotation invarianceNote that rotation here means rotating in the plane of the image not capturingthe object from a rotated viewpoint For example a rotation invariant (frontal)

14

22 Object detection on conventional 2D images

face detector should be able to detect faces upside down but not faces pho-tographed from a side-view SIFT is also partly invariant to illumination and3D camera viewpoint This is reached by the representation of the calculateddescriptors that allows for significant levels of local shape distortion and changein illumination The algorithm assumes that these pointsrsquo relative positions willnot change on the test images Thus SIFT descriptors tend to show worse perfor-mance on flexible objects or ones with moving parts Also the relative positionsof the key-points can be significantly different not only if the object is distorted(flexibility) but also if they are extracted from an other instance from the sameclass (two people for example) Resulting from this property SIFT is more oftenused for object tracking image stitching or to recognise a concrete object insteadof the general class However this is not a problem in the case of this projectsince only one ground-robot is part of it Thus no general class recognition isrequired recognizing the used ground robot would be enough

2222 Haar-like features

Haar-like features were introduced by Viola and Jones in 2001 [45] They proposedsimple features inspired by Haar wavelet basis which are also used in signalprocessing They can be seen as rdquotemplatesrdquo defining rectangular regions in thegrayscale image Viola and Jones described three types of the features

bull two-rectangle features rdquoThe value of a two-rectangle feature is the dif-ference between the sum of the pixels intensities within two rectangularregions The regions have the same size and shape and are horizontally orvertically adjacentrdquo [45]

bull three-rectangle features rdquoA three-rectangle feature computes the sum withintwo outside rectangles subtracted from the sum in a centre rectanglerdquo [45]

bull four-rectangle features rdquoA four-rectangle feature computes the differencebetween diagonal pairs of rectanglesrdquo [45]

See figure 25 for an example of two and three rectangle features The mainadvantage of this approach is the fact that these features can be computed veryrapidly using so-called integral images The integral image rdquocontains the sum ofpixel intensities above and to the left of itrdquo [45] at any given image location Thereferences to the integral values at each pixel location are kept in a structureFinally the Haar features can be computed by summing and subtracting theintegral values in the corners of the rectangles

[45] was motivated by the task of face detection Although it is more than14 years old it is still the base of many implemented face detection methods (in

15

22 Object detection on conventional 2D images

Figure 25 2 example Haar-like features that are proven to be effective separatorsTop row shows the features themselves Bottom row presents the features overlayedon a typical face The first feature corresponds with the observation that eyes areusually darker than the cheeks The second feature measures the intensity of areasaround the eyes compared to the nose Source [45]

consumer cameras for example) and inspired several researches [54] for exampleextended the Haar-like feature set with rotated rectangles by calculating theintegral images diagonally as well

Resulting from the nature of them Haar-like features are suitable to findpatterns of combined bright and dark patches (like faces on grayscale images forexample)

Despite of the original task Haar-like features are used widely in computervision for various tasks vehicle detection ([55]) hand gesture recognition ([56])and pedestrian detection ([57]) just to name a few

Haar-like features are very popular and have a lot of advantages Its maindrawback is the rotation-variance which in spite of several attempts (eg [54])is still not solved completely

2223 HOG features

The abbreviation Hog stands for Histogram of Oriented Gradients Unlike theHaar features (2222) Hog aims to gather information from the gradient image

The technique calculates the gradient of every pixel and produces a histogramof the different orientations (called bins) after quantization in small portions(called cells) on the image After a location based normalization of these cells(within blocks) the feature vector is the concatenation of these histograms Notethat while this is similar to Edge orientation histograms ([58]) it is different since

16

22 Object detection on conventional 2D images

Figure 26 Illustration of the discriminative and generative models The boxesmark the probabilities calculated and used by the models Note the arrows whichdisplay the information flow presenting the main difference in the two philosophiesSource [59]

not only sharp gradient changes (known as edges) are considered The number ofoptimal bins cells and blocks depend on the actual task and resolution and arewidely researched

HOG is essentially a dense version of the SIFT descriptors (see sub-subsection2221 and [31]) However it is not normalized with respect to the orientationThus HOG is not rotation invariant On the other hand it is normalized locallywith respect to the image contrast (over the blocks) which results in a superiorrobustness against illumination changes over SIFT The technique was first de-scribed in the well-known article [53] by Navneet Dalal and Bill Triggs in whichit was used for pedestrian detection While this application field is still the mostcommon usage of these feature descriptor it can be used for many kind of objectsas well as long as the class has a characteristic shape with significant edges

2224 Learning models in computer vision

Learning model approaches can be grouped as well [32 48] finds two mainphilosophies generative and discriminative models

Letrsquos denote the test data (images) with xi the corresponding labels by ciand the feature descriptors (extracted from the data xi) by θi where i = 1 to N where N is the number of inputs

Generative models look for a representation of the image (or any other inputdata) by approximate it keeping as much data as possible Afterwards given atest data they predict the probability of an x instance generated (with featuresθ) conditioned to class c [48] In other terms they aim is to produce a model inwhich the probability

P (x|(θ c))P (c)

to be ideally 1 if x contains an instance of class c and 0 if notThe most famous examples of generative models are Principal Component

Analysis (PCA [60]) and Independent Component Analysis (ICA [61])

17

22 Object detection on conventional 2D images

Discriminative models try to find the best decision boundaries for the giventraining data As the name suggests they try to discriminate the occurrenceof one class from everything else [48] They do this by defining the followingprobability

P (c|(θ x))

which is expected to be ideally 1 if x contains an instance of class c and 0 ifnot

Note that the biggest difference is the direction of the mapping and infor-mation flow discriminative models map images to class labels while generativemodels use a map from labels to the images (or rdquoobservablesrdquo [59]) See figure 26for a representation of the direction of the flow of information in both models

Resulting from approaches presented above (especially the probability mod-els) the discriminative methods are usually perform better for single class detec-tion while generative methods are more suitable for multi-class object detection(also called object recognition) [62] Since the task described in this paper re-quires one class to detect two of the most well-known discriminative methodswill be presented Adaptive boosting (AdaBoost [63] see sub-subsection 2225)and support vector machines (SVM [64] see sub-subsection 2226)

2225 AdaBoost

AdaBoost (short for Adaptive Boosting) is a discriminative machine learningmodel from the class of boosting algorithms It was first proposed by YoavFreund and Robert Schapire in 1997 [63] AdaBoostrsquos aim is to overcome theincreasing calculation expenses caused by the growing set of features (also calleddimensions) For example the number of possible Haar-like features (2222) isover 160000 in case of a 24times24 pixel window (see subsection 432) [65] AlthoughHaar-like features can be extracted efficiently using the integral-image computingthe full set would be very expensive

To reduce the dimensionality of the machine learning problem boosting algo-rithms try to combine several weaker classifiers to create a stronger Weak meansthe output of the classifiers only slightly correlates with the true labels of thedata Usually this is an iterative method

A single stage of AdaBoost can be described as follows [45] [65]

bull Initialize N middot T weights wti where N = number of training examples T =number of features in stage

bull For t = 1 N

1 Normalize the weights

18

22 Object detection on conventional 2D images

2 Select the best classifier using only a single feature by minimising thedetection error εt =

sumiwi|h(xi f p θ)minus yi| where h(xi) is a classifier

output and yi is a correct label (both with a range of 0 for negative 1for positive)

3 Define ht(x) = h(x ft pt θt) where ft pt and θt are minimizers for theerror above

4 Update the weights wt+1i = wti middot εt

1minusεt

1minusεi

bull The final classifier for the stage is based on the sum of weak classifiers

By repeating these steps AdaBoost selects those features (and only those)which improve the prediction Resulting from the way the weights are updatedsubsequent classifiers minimise the error mostly for the inputs which were misclas-sified by previous ones A famous application of AdaBoost is the Viola-Jones ob-ject detection framework presented in 2001 [45] (also see sub-subsection 2222)

2226 Support Vector Machine

Support vector machines (SVM also called support vector networks) is a machinelearning algorithm for two-group classification problems proposed by Vladimir NVapnik Bernhard E Boser and Isabelle M Guyon [66] Although the featuresof the method were known and used since the 1960s they were not put togetheruntil 1992

The idea of the support vector machine is to handle the set of features ex-tracted from the train and test data samples as vectors in the feature spaceThen they construct one ore more hyperplanes (a linear decision surface) in thisspace to separate the classes

The surface is constructed with two constraints First is to divide the spaceinto two parts in a way that no points from different classes remain in the same(in other words separate the two classes perfectly) The second constraint is tokeep the distance to the nearest training samplersquos vector from each class as largeas possible (in other words define the largest margin possible) Creating largermargin is proven to decrease the error of the classifier The closest vectors to thehyperplane (and due to that determining the width of the margin) are calledsupport vectors (hence the name of the method) See figure 27 for an exampleof a separable problem the chosen hyperplane and the margin

The algorithm was extended to not linearly separable tasks in 1995 by Vapnikand Corinna Cortes [64] To create a non-linear classifier the feature space ismapped to a much higher dimensional space where the classes become linearlyseparable and the method above can be used

19

22 Object detection on conventional 2D images

Figure 27 Example of a separable problem in 2D The support vectors aremarked with grey squares They define the margin of the largest separation betweenthe two classes Source [64]

Although support vector machines were designed for binary classification sev-eral methods were proposed to apply their outstanding performance on multi-classtasks The most often used approach to do this is to combine binary classifiers toobtain a multi-class one The binary classifiers can be rdquoone against allrdquo or rdquooneagainst onerdquo types The former means that each class is examined against all thepoints of the other classes The class whose classifier shows the greatest distancefrom the margin is chosen (in other terms the test samplersquos vector is the fartherfrom the other classesrsquo points than in any other cases) The latter is based on thecomparisons of all the possible pairings of the classifiers (each is compared witheach) The class rdquowinningrdquo the most comparisons is chosen as the final decisionof the classifier Good surveys on multi-class SVMs are [67] and [68]

A famous application of SVMs is the HOG descriptor based pedestrian detec-tion presented by Navneet Dalal and Bill Triggs in 2005 [53]

20

Chapter 3

Development

In this chapter the circumstances of the development will be presented whichinfluenced the research especially the objectives defined in Section 13 Firstthe available hardware resources (that is processing unit and sensors) will bediscussed along with their limitations and constraints respecting to the definedobjectives Finally the used software libraries will be presented why were theychosen and what are they used for in this project

31 Hardware resources

311 Nitrogen boardNitrogen board (officially called Nitrogen6x) is the chosen on-board processingunit of the UAV It is a highly integrated single board computer using an ARM-cortex A9 processor [69]

Nitrogen board will be responsible to manage the on-board sensors and collectthe recorded data After some preprocessing part of the data will be transmittedto the ground robot and to the ground station which communication is alsohandled by this unit Fortunately the Robotic Operating System (subsection322) is compatible with unit which will make this task easier

The detection method introduced in this thesis may be ported to this boardlater depending on the available processing units and the achieved communicationbandwidth between the vehicles and the ground station

312 SensorsThis subsection concludes the sensors integrated on-board the UAV including theflight controller the camera and the laser range finder

21

31 Hardware resources

Figure 31 image of Pixhawk flight controller Source [70]

3121 Pixhawk autopilot

Pixhawk is an open-source well supported auto-pilot system manufactured by3Drobotics company It is the chosen flight controller of the projectrsquos UAV Onfigure 31 the unit is shown with optional accessories

Also Pixhawk will be used as an Inertial Measurement Unit (IMU) for build-ing 3D maps This means less weight (since no additional sensor is needed) andbrings integrity both in hardware and software Although the Pixhawk is not adedicated IMU its built-in sensors are very accurate since they are used to sta-bilize multi-rotors during flight Therefore it was suitable to use it for mappingpurposes See 44 for details of its application

3122 Camera

Cameras are optical sensors capable of recording images They are the mostfrequently used sensors for object detection due to the fact that they are easy tointegrate and provide significant amount of data

Also another big advantage of these sensors is the fact that they are availablein several sizes with different resolutions and other features for a relatively cheapprice

Nowadays consumer UAVs are often equipped by some kind of action camerausually to provide an aerial video platform These sensors are suitable for theaim of the project as well resulting from their light weight and wide angle fieldof view

22

32 Chosen software

Figure 32 The chosen LIDAR sensor Hokuyo UTM-30LX It is a 2D rangescanner with a maximum range of 60 m and 3-5 cm accuracy Source [71]

3123 LiDar

The other most important sensor used in this project is the lidar These sensorsare designed for remote sensing and distance measurement The basic conceptof these sensors is to do this by illuminating objects with a laser beam Thereflected light is analysed and the distances are determined

Several versions are available with different range accuracy size and otherfeatures The chosen scanner for this project is the Hokuyo UTM-30LX It isa 2D range scanner with a maximum range of 60 m and 3-5 cm accuracy [71]These features combined with its compact size (62mmtimes62mmtimes875mm210g [71])makes it an ideal scanner for autonomous vehicles for example UAVs

Generally speaking lidars are much more expensive sensors then camerasHowever the depth information provided by them is very valuable for indoorflights

32 Chosen software

321 MatlabMATLAB is a high-level interpreted programming language and environmentdesigned for numerical computation It is a very useful tool for engineers and

23

32 Chosen software

scientist alike as thousands of functions are implemented and included in thesoftware from different fields of science Also excellent visualization tools areready to use which are necessary to easily debug and develop image processingmethods or 3D map construction and visualization Resulting from the reasonsabove developing and testing 2D or 3D image processing algorithms in Matlab isextremely convenient On the other hand Matlab is Java based therefore mostnon-mathematical calculations are slightly slow compared to a C++ environment

322 Robotic Operating System (ROS)Robotic Operating System is an extensive open-source operating system andframework designed for robotic development It is equipped with several toolsaimed to help developing unmanned vehicles like simulation and visualizationsoftware or excellent general message passing This latter property is a suitablesolution to connect the UAV UGV and the ground control station of the project

Also ROS contains several packages developed to handle sensors For exam-ple the both the chosen lidar sensor (3123) and the flight controller (3121)have ready-to-use drivers available

323 OpenCVOpenCV is probably to most popular open-source image processing library con-taining a wide range of functions like basic image manipulations (eg load writeresize rotate image) image processing tools (eg different kind of edge detec-tions threshold methods) and advanced machine learning based solutions (egfeature extractors and training tools) [72] Due to its high popularity examplesand support are available

In this project it was used to handle inputs (either from a file or from a cameraattached to the laptop) since Dlib the main library chosen has no functions toread inputs like that

324 DlibDlib is a general purpose cross-platform C++ library It has an excellent machinelearning library with the aim to provide a machine learning software developmenttoolkit for the C++ language (see figure 33 for an architecture overview) Dlibis completely open-source and is intended to be useful for both scientists andengineers

Dlib also has many computer vision related parts including ready-to-use fea-ture extractors and training tools and object tracking solutions which make it

24

32 Chosen software

Figure 33 Elements of Dlibrsquos machine learning toolkit Source [73]

easy to develop and test custom trained object detectors rapidly Another re-cently added feature is an easy-to-use 3D point cloud visualization function Re-sulting from this Dlib was a reasonable choice to use for this project since itscollection of machine learning image processing and 3D point cloud handlingfunctions made the development significantly easier

It is worth mentioning that aside the features above Dlib also contains com-ponents to handle linear algebra threading network IO graph visualisation andmanaging and numerous other tasks which makes it one of the most versatileC++ libraries Though it is not as popular and well-known as OpenCV it isextremely well documented supported and updated regularly Resulting fromthese the popularity of it increases For more information see [74] and [73]

25

Chapter 4

Designing and implementing thealgorithm

In this chapter challenges related to the topic of this thesis are listed After theconsideration of these the architecture of the designed algorithm will be intro-duced with detailed explanations given to the important parts The connectionbetween the 2D and 3D image inputs are defined Afterwards the base of thedetection algorithm is discussed in details along with all the aiding systems andinputs Finally the 3D imaging experiments and set-ups are introduced withexample images and the mathematical calculations

41 Challenges in the taskThe biggest challenges of the task are listed below

1 Unique object Maybe the biggest challenge of this task is the fact that therobot itself is a completely unique ground vehicle designed as part of theproject Thus no ready-to-use solutions are available which could partly orcompletely solve the problem For frequently researched object recognitiontasks like face detection several open-source codes examples and articlesare available The closest topic for a ground robot detection is car recogni-tion but those vehicles are too different to use any ready detectors Thusthe solution of this task have to be completely self-made and new

2 Limited resources Since the objective of this thesis is a detection executedon a UAV the limitations of this platform have to be considered Thismeans that both the dimensions and the weight of the on-board sensorsand computers are limited significantly Fortunately the cameras nowa-days have very compact sizes with impressive resolutions thus even smaller

26

42 Architecture of the detection system

aerial vehicles can carry one and record high quality videos The used lidar(3123) is heavier and larger but still possible to carry The on-board com-puter had to be chosen similarly a small and light hardware was neededWhile the Nitrogen6x board (311) would fulfil these requirements its pro-cessing power is not comparable to a laptop or desktop PC which shouldbe considered during the designing and assigning of processing methods

3 Moving platform No matter what kind of sensors are used for an objectdetection task movement and vibration of the sensor is usually makes itmore challenging Unfortunately both are present on a UAV platform Thiscauses problems because it changes the perspective and the viewpoint ofthe object Also they add noise and blur to the recordings And third nobackground-foreground separation algorithms are available for the cameraSee subsection 2211 for details

4 Various viewpoint of object Usually in the field of computer vision thepoint of view of the required object is more or less defined For exampleface detectors are usually applied to images where people are facing thecamera However resulting from the UAV platform the sensors will gain orlose height rapidly causing drastic perspective changes Also there are noconstraints about the relative orientation of the two vehicles which meansthat the robot should be recognized from 360 and from above

5 Various size of object Similarly to the previous point 4 the movement ofthe camera will cause significant changes in the scale of the object on theinput image In other words the algorithm have to detect the ground robotfrom a great distance (when it occupies a small part of the input) and alsoduring landing from relatively close (when it seems huge and almost fillsthe entire image)

6 Speed requirements The algorithmrsquos purpose is to support the landingmaneuver which is crucial part of the mission The most critical part is thetouch-down itself This (contrary to the other parts of the flight) requiresreal-time and very accurate position feedback from the sensors and the al-gorithms This is provided by another algorithm specially designed for thetask Thus the software discussed here does not have real-time require-ments On the other hand the system still needs continuous reports on theestimated position of the robot with at least 4Hz This means that everychosen method should be execute fast enough to make this possible

27

42 Architecture of the detection system

Support Vector Machines

camera2D Lidar

Trainer

Detector agorithmfront SVM side SVM

Detections Vatic annotation server

Sensors

Tracking

Evaluation

Current frame

Video reader

Regions of Interest

3D map

Other preprosessing(edgecolour detection)

Figure 41 On this figure a diagram of the designed architecture is shown Theaim of it is the present the connections between the 2D and 3D sensors detectionalgorithms and other parts of the system Arrows represents the dependencies andthe direction of the information flow

28

42 Architecture of the detection system

42 Architecture of the detection systemIn the previous chapters the project (11) and the objectives of this thesis (13)were introduced The advantages and disadvantages of the different availablesensors were presented (312) Some of the most often used feature extractionand classification methods (222) were examined with respect to the suitabilityfor the project Finally in subsection 41 the challenges of the task were discussed

After considering the aspects above a modular complete architecture wasdesigned which is suitable for the objectives defined See figure 41 for an overviewof the design The main idea of the structure is to compensate the disadvantagesof every module with an other part connected to it In some cases this principlebrings redundancy to the system On the other hand it provides a more robustdetection system for the robot The next enumeration will list the main partsof the architecture every already implemented module will be discussed later inthis chapter in detail

1 Training algorithm As explained in subsection 222 object detectionmethods using machine learning are very popular and superior in efficiencyTo apply such an algorithm for a task first a learning (also called trainingor teaching) period is needed when the software extract features from thetraining images and train a classifier to separate the two or more classes(Note that this learning step can be skipped if the algorithm learns rdquoon theflyrdquo or pre-trained detectors are used Neither of these are the case in thisproject)As mentioned above training usually happens off-line before the detectionsTherefore it is not required to be included in the final release However itis an inevitable part of the system since it produces the classifiers for thedetector These productions have to be reproducible compatible with theother parts and effective To make the development more convenient thetraining software is completely separated from testing See subsection 431for details

2 Sensor inputs Sensors are crucial for every autonomous vehicle since theyprovide the essential information about the surrounding environments Inthis project a lidar and a camera where used as main sensors for the task(along with several others needed to stabilize the vehicle itself See subsec-tion 312) The designed system relies more on the camera for now sincethe 3D imaging is in its experimental state However both sensors arehandled in the architecture already with the help of the Robotic Operat-ing System (322) This unified interface will make it possible to integratefurther sensors (eg infra-red cameras ultrasound sensors etc) later withease

29

42 Architecture of the detection system

Resulting from the scope of the project the vehicle itself is not readythus live recordings and testing were not possible during the developmentTherefore the system is able to manage images both from a camera andvideo file The latter is considered as a sensor module as well simulatingreal input from the sensor (other parts of the system canrsquot tell that it is nota live video feed from a camera)

3 Region of interest estimators and tracking Recognized as two of the mostimportant challenges (see 41) speed and the limitations of the availableresources were a great concern Thus methods which could reduce theamount of areas to process and such increase the speed of the detectionwere requiredResulting from the complex structure of the project both can be donebased on either the camera the lidar or both The idea of the architectureis to give a common interface (creating a type of module) to these methodsmaking their integration easy The most general and sophisticated approachwill be based on the continuously built and updated 3D maps That is ifboth the aerial and the ground vehicle were located in this map accurateestimations can be made on the current position of the robot excluding (3D)areas where it is unnecessary to search for the unmanned ground vehicle(UGV) In other words once the real 3D position of the robot is known itsnext one can be estimated based on its maximal velocity and the elapsedtime Theoretically it is not possible for the robot to be outside of thiscircle thus other parts of the map do not need to be scanned and Asa less sophisticated but more rapid alternative to this the currently seen3D sub-map (the point-cloud in front of the sensor) can be interpreted asa pre-filter for the 2D image processing methods For example filteringfor objects with more or less the same size of the UGV could be a veryeffective region of interest detector To achieve this the field of view of thetwo sensors needs to be registered carefully so corresponding areas can befound Unfortunately the real time 3D map building is not yet implementedThus these methods has yet to be developed On the other hand severalapproaches to find regions of interest based on 2D images were implementedAll of them will be presented in section 43

4 Evaluation To make the development easier and the results more objectivean evaluation system was needed This helps to determinate whether a givenmodification of the algorithm did actually make it rdquobetterrdquo andor fasterthan other versions This part of the architecture is only connected to thewhole system loosely (similarly to training point 1) and not required tobe included in the final release since it was designed for development and

30

43 2D image processing methods

debugging purposes However it had to be developed parallel with everyother parts to make sure it is compatible and up to date Thus evaluationhad to be mentioned in this list For more details of the evaluation pleasesee 511

5 Detector The detector module is the core algorithm which connects all theother parts listed above Receiving the inputs from the camera sensor (point2) using the classifiers trained by the training algorithm (point 1) andconsidering the information provided by the ROI estimators (3) executesthe detection Afterwards the detector is able to pass on the detectedobjects coordinates so the evaluation module (point 4) can measure theefficiency

43 2D image processing methodsIn this section the 2D camera image processing concerns methods and the processof the development will be presented

431 Chosen methods and the training algorithmBefore the design and implementation of the algorithm started several objectdetection methods were reviewed in section 22 In section 32 the used imageprocessing libraries and tool-kits were presented

While these two topics seem unrelated they have to be considered togethersince a good software library with image handling or even machine learningalgorithms implemented can save a lot of time during the development

Considering the reviewed methods the Histogram of Oriented Gradients (HOGsee sub-subsection 2223) was chosen as the main feature descriptor The reasonsfor this choice are discussed here

Haar-like features are suitable to find patterns of combined bright and darkpatches While the ground robotrsquos main colours are black (wheels) and white(the main body of the vehicle) their proportion is not ideal since brighter partsare more significant Variance of the relative positions of these patches (egdifferent view angles) can confuse the detector as well Also using OpenCVrsquos(323) training algorithm the learning process was significantly slower than thefinal training software (presented here later) making it very hard to develop andtest

SIFT descriptors (see sub-subsection 2221) have the advantage of rotationinvariance which is very desirable Since they assume that relative positions ofthe key-points will not change on the inputs they are more suitable to recognise

31

43 2D image processing methods

a concrete object instead of the general class This is not an issue in this casesince only one ground robot is present with known properties (in other wordsno generalization is needed) However the relative positions of the key-pointscan change not only because inter-class differences (an other instance of the classrdquolooksrdquo different) but caused by moving parts like rotating and steering wheelsfor example Also SIFT tends to perform worse under changing illuminationconditions which is a drawback in the case of this project since both vehiclesare planned to transfer between premises without no a-priori information aboutthe lighting

Histogram of Oriented Gradients (HOG see sub-subsection 2223) seemed tobe an appropriate choice as it performs well on objects with strong characteristicedges Since the UGV has a cuboid shape body and relatively big wheels itfulfils this condition Also HOG is normalized locally with respect to the imagecontrast which results in a robustness against illumination changes Thus itis expected to perform better than the other methods in case of change in thelighting (like moving to an other room or even outside)

Certainly HOG has it disadvantages as well First it is computationally ex-pensive especially compared to the Haar-like features However with properimplementation (eg a good choice of library) and the use of ROI detectors (seePoint 3) this issue can be solved Secondly HOG in contrast to SIFT is not rota-tion invariant In practice on the other hand it can handle a significant rotationof the object Also in the final version of the project the orientation of the camerawill be stabilized either in hardware (with a gimbal holding the camera levelled)or software (rotate the image based on the orientation sensor) Therefore theinput image of the detector system expected to be levelled Unfortunately dueto the aerial vehicle used in this project different viewpoints are predicted (see4) This means that even if the camera is levelled the object itself could seemlike rotated caused by the perspective To overcome this issue solutions will bepresented in this subsection and in 434

Although the method of HOG feature extraction is well described and docu-mented in [53] and not exceedingly complicated the choice of optimal parametersand the optimization of the computational expenses are time-consuming and com-plex tasks To make the development faster an appropriate library was searchedfor the extraction of HOG features Finally Dlib library (324) was chosen sinceit includes an excellent HOG feature extractor with serious optimization alongwith training tools (discussed in details later) and tracking features (see subsec-tion 434) Also several examples and documentations are available for bothtraining and detection which simplified the implementation

As a classifier Support vector machines (SVM see sub-subsection 2226)were chosen This is because nowadays SVMs are widely used in computer visionwith great results (Dalal used it as well in [53]) and for two-class detection prob-

32

43 2D image processing methods

lems (like this) they outperform the other popular ones Also it is implementedin Dlib as a ready-to-use solution [75] SVMs (and almost everything in Dlib)can save and load by serializing and de-serializing them which made the trans-portation of classifiers relatively easy Dlibrsquos SVMs are compatible with the HOGfeatures extracted with the same library The combination of the two methodswithin Dlib resulted in some very impressive detectors for speed limit sign orfaces Furthermore the face detector mentioned outperformed the Haar-like fea-ture based face detector included in OpenCV [74] After studying these examplesthe combination of Dlibrsquos SVM and HOG extractor was chosen as the base of thedetection

The implemented training software itself was written in C++ It can be exe-cuted from the command line It needs an xml file as an input which includes alist of the training images with the annotation of the objects (coordinates in pix-els) After importing the positive samples the negative samples are rdquogeneratedrdquofrom the training samples as well cut from areas where no annotated object ispresent Note that this feature means the object has to be annotated carefullysince a positive sample might get in to the negative training images otherwise

After the classifier (also called detector) is produced the training softwaretests it on an image database defined by a similar xml file This is a very con-venient feature since major errors are discovered during the training processalready If the first tests do not show any unexpected behaviours the finishedSVM is saved to the disk with serialization

Due to the fact that the robot looks completely unlike from different view-points one classifier was not sufficient because it could only match one typicalrdquoshaperdquo of the object Similarly the face detector mentioned above does notwork for people photographed from side view Therefore multiple classifiers wereconsidered for the task

Two SVMs were trained to overcome of this issue one for the front view andone for the side view of the ground vehicle This approach has two significantadvantages

First the robot is symmetrical horizontally and vertically as well In otherterms the vehicle looks almost exactly the same from the left and right whilethe front and rear-views are also very similar This is especially true if the work-ing principle of the HOG (2223) is considered since the shape of the robot(which defines the edges and gradients detected by the HOG descriptor) is iden-tical viewed from left and right or from front and rear This property of theUGV means that the same classifier will recognise it from the opposite directiontoo Therefore two classifiers are enough for all four sides This is not a trivialsimplification since cars for example look completely different from the front orthe rear and while the side-views are similar they are mirrored

The second advantage of these classifiers is very important in the future of

33

43 2D image processing methods

the project Due to the way the SVMs are trained they not only can detect theposition of the robot but depending on which one of them detects it the systemgains information about the orientation of the vehicle This is crucial since it canhelp predict the next position of the ground robot (see 434) Even more impor-tantly it helps the UAV to land on the UGV with the correct orientation (facingthe right way) Therefore the charging outputs will be connected properly

An other important aspect of the training is how to choose the training imagesFor this pictures were selected where the robot is captured directly from the sideor from the front with the camera almost at the level of the ground Asidethe sidefront of the robot (the top for example) was hidden or cropped Thismethod eliminated all the distortions found on pictures taken from above whichwould decrease the efficiency Although camera attached to the UAV will seldomcruise at this altitude (otherwise it would touch the ground) the HOG descriptor isrobust enough to find the patterns even viewed from above or sideways Althoughteaching an SVMs for diagonal views as well seemed like a logical idea it did notimprove the detections The reason for this is the fact that while side and frontviews are well defined it is hard to frame and train on a diagonal view

On figure 42 a comparison of training images and the produced HOG detec-tors is shown 42(a) is a typical training image for the side-view Hog detectorNotice that only the side of the robot is visible since the camera was hold verylow 42(b) is the visualized final side-view detector Both the wheels and thebody are easy to recognize thanks to the strong edges of them On 42(c) a train-ing image is displayed from the front-view detector training image set 42(d)shows the visualized final front detector Notice tangential lines along the wheelsand the characteristic top edge of the body

Please note that if the current two classifiers seem to be inefficient trainingadditional ones is easy to do with the training software presented here Alsoincluding them to the detector is simple as well since it was designed to beadaptive Please see subsection 435 for more details

432 Sliding window methodA very important property of these trained methods (222) are that usually theirtraining datasets (the images of the robot or any other object) are cropped onlycontaining the object and some margin around it Resulting from this a detectortrained this way will only be able to work on input images which contain theobject in a similarly cropped way Certainly such an image is not the usual inputfor an application since if the input contains only the robot there is no need forlocalization algorithms In other words the input image is usually much largerthan the cropped training images and may contain other kind of objects as wellThus a method is required which provides correctly sized input images for the

34

43 2D image processing methods

(a) sideview training image example

(b) sideview HOG detector

(c) frontview trainging image example

(d) frontview HOG

Figure 42 (a) a typical training image for the side-view Hog detector (b) is thefinal side-view HOG descriptor visualized (c) a training image from the front-viewHOG descriptor training image set (d) the final front detector visualized Noticethe strong lines around the typical edges of the training images

35

43 2D image processing methods

Figure 43 Representation of the sliding window method

detector cropped and resized from the original large inputThis can be achieved in several ways although the most common is the so

called sliding window method This algorithm takes the large image and slidesa window across it with predefined step-size and scales This rdquowindowrdquo behaveslike a mask cropping out smaller parts of the image With appropriate choice ofthese parameters (enough scales and small step-sizes) eventually every robot (orother desired object) will be cropped and handed to the trained detector See 43for a representation

It worth mentioning that it is possible that multiple instances of the soughtobject are present on the image For face or pedestrian detection algorithms forexample this is a potential situation Since only one ground robot was builtyet this is not likely to happen in this project (although reflections of the robotmight occur which are not handled explicitly) However if the system will beexpanded in the future multiple UGVs on the same input image will not cause aproblem for the designed algorithm

433 Pre-filteringAs listed in section 41 two of the most important challenges are speed andthe limitations of the available resources In subsection 432 the sliding windowconcept was introduced which makes it possible to cover every potential locationof the sought object On the other hand this means a lot of positions which areabsolutely unnecessary to check with the HOG detector since it is very unlikelythat the robot is there (eg positions above the ground or homogeneous surfaceslike the wall or the clearly empty ground) These areas could be excluded fromthe search with cheaper algorithms

36

43 2D image processing methods

Thus methods which could reduce the amount of areas to process and suchincrease the speed of the detection were required In other words these algorithmsfinds region of interests (ROI see 3) in which the HOG (or other) computationalheavy detectors are executed in a shorter time than scanning the whole inputimage The algorithms have be implemented with the right interface to keep themodular structure of the architecture Also this practice make it possible toeither swap the currently used with an other or a newly developed ROI detectormodule for further development

Based on intuition a good feature separation is the colour Anything whichdoes not have the same colour as the ground vehicle should be ignored To dothis however is rather complicated since the observed rdquocolourrdquo of the object (andanything else in the scene) depends on the lighting the exposure setting and thedynamic range of the camera Thus it is very hard to define a colour (and aregion of colours around of it) which could be a good base of segmentation onevery input videos Also one of the test environments used (the laboratory) andmany other premises reviewed has bright floor which is surprisingly similar to thecolour of the robot This made the filtering virtually useless

An other good hint for interesting areas can be the edges In this projectthis seems like a logical approach since the chosen feature descriptor operates ongradient images which are strongly connected to the edge map On figure 44 theresult of an edge detection on a typical input image of the system is presentedNote how the ground robot has significantly more edges than its environmentespecially the ground Filtering based on edges often eliminates the majority ofthe ground On the other hand chairs and other objects have lot of edges as wellThus these areas are still scanned

Another idea is to filter the image by the detections themselves on the previousframes In other terms the position of the robot on the previous input image isan excellent estimation of the vehiclersquos position on the current one Theoreticallyit is enough to search for the robot around its last known position with a radiusof the maximum distance the object can move during the time between framesPlease note that this distance is in image coordinates thus the possible movementof the camera is involved as well In practice this distance was set as a parameterafter experiments See subsection 435 for details

434 TrackingIn computer vision tracking means the following of a (moving) object over timeusing the camera image input This is usually done by storing information (likeappearance previous location speed and trajectory) about the object and passit on between frames

In the previous subsection (433) the idea of filtering by prior detections was

37

43 2D image processing methods

Figure 44 On this figure the result of an edge detection on a typical input imageof the system is presented Please note how the ground robot has significantly moreedges than its environment On the other hand chairs and other objects have lotof edges as well

presented Tracking is the extension of this concept If the position of the robotis already known there is no need to find it again To achieve the objectives it isenough to follow the found object(s) This makes a huge difference in calculationcosts since the algorithm does not look for an object described by a generalmodel on the whole image but search for the rdquosame set of pixelsrdquo in a significantlyreduced region Certainly tracking itself is not enough to fulfil the requirements ofthe task hence some kind of combination of rdquotraditionalrdquo classifiers and trackingwas needed

Tracking algorithms are widely researched and have their own extensive litera-ture Considering the complexity of implementing an own 2D tracking algorithmit was decided to choose and include a ready-to-use solution Fortunately Dlib(324) have an excellent object tracker included as well The code is based onthe very recent research presented in [76] The method was evaluated extensivelyand it outperformed state-of-the-art methods while being computationally effi-cient Aside its outstanding performance it is easy to include in the code onceDlib is installed The initializing parameter of the tracker is a bounding box isaround the area needed to be followed Since the classifiers return with boundingboxes those can be used as an input for the tracking See subsection 435 foran overview of the implementation An interesting fact is that it also uses HOGdescriptors as a part of the algorithm which shows the versatility of HOG andconfirms its suitability for this task

The final system of this project will have the advantage of a 3D map of itssurroundings This will allow tracking of objects in 3D relative to the environ-ment The UAV will have an estimation of the UGV even if its out of the frame

38

43 2D image processing methods

since its position will be mapped to the 3D map as well Unfortunately the realtime mapping is not ready yet thus this feature has yet to be implemented

435 Implemented detectorIn subsection 431 the advantages and disadvantages of the chosen object recog-nition methods were considered and the implemented training algorithm wasintroduced Then different approaches to speed up the process were presentedin subsection 432 433 and 434

After considering the points above a detector algorithm was implementedwith two aims First to provide an easy-to-use development and testing toolwith which different methods and their combinations (eg using 2 SVMs withROI) can be tried without having to change and rebuild the code itself Secondafter the optimal combination was chosen this software will be the base of thefinal detector used in the project

The input of the software is a camera feed or video which is read frame byframe The output of the algorithm is one or more rectangles bounding the areasbelieved to contain the sought object These are often called detection boxeshits or simply detections

During the development four different approaches were implemented All ofthem are based on the previous ones (incremental improvements) but all broughta new idea to the detector which made it faster (reach higher frame-rates see512) more accurate (see 511) or both The methods are listed below

1 Mode 1 Sliding window with all the classifiers

2 Mode 2 Sliding window with intelligent choice of classifier

3 Mode 3 Intelligent choice of classifiers and ROIs

4 Mode 4 Tracking based approach

They will all be presented in the next four subsections Since they are rathermodifications of the same code than separate solutions they were not imple-mented as new software Instead all were included in the same code whichdecides which mode to execute in run-time based on a parameter

Aside of changing between the implemented modes the detector is able toload as many SVMs as needed Their usage is different in every mode and willbe presented later

The software can display andor save the video frames with all the detectionsmarked measure and export the processing time for every frame save the de-tections to data files for evaluation or even process the same video input severaltimes

39

43 2D image processing methods

Table 41 Table of the available parameters

Name Valid values Functioninput path to video video as input for detectionsvm path to SVMs These SVMs will be usedmode [1 2 3 4] Selects which mode is usedsaveFrames [01] Turns on video frame exportsaveDetections [01] Turns on detection boxe exportsaveFPS [01] Turns on frame-rate measurementdisplayVideo [01] Turns on video displayDetectionsFileName string Sets the filename for saved detectionsFramesFolderName string Set the foldername used for saving videonumberOfLoops integer(0lt) Sets how many times the video is looped

To make the development more convenient these features can be switchedon or off via parameters which are imported from an input file every time thedetector is started There is an option to change the name of the file where thedetections are exported or the name of the folder where the video frames aresaved

Table 41 concludes all the implemented parameters with the possible valuesand a short description

Note that all these parameters have default values thus it is possible but notcompulsory to include every one of them The text shown below is an exampleinput parameter file

Example parameter file of the detector algorithminput testVideo1avilist all the detectors you want to usesvm groundrobotfrontsvm groundrobotsidesvmsaveDetections 0saveFrames 0displayVideo 0numberOfLoops 100mode 2saveFPS 1

Given this input file the software would execute a detection on testVideo1avi(input) one hundred times (numberOfLoops) using the groundrobotfrontsvm groundrobot-sidesvm classifiers (svm) It would neither save the detections (saveDetections)

40

43 2D image processing methods

nor the video frames (saveFrames) The latter is not shown (displayVideo) How-ever the processing time is saved for every frame (saveFPS) Note that the pa-rameter numberOfLoops is really useful to simulate longer inputs See figure 45for a screenshot of the interface after a parameter file is loaded

Figure 45 Example of the detectorrsquos user interface after loading the parametersThe software can be started from the command line with a parameter file inputwhich defines the detecting mode and the purpose of the execution (producingvideo efficiency statistics or measuring processing frame-rate)

4351 Mode 1 Sliding window with all the classifiers

Mode 1 is the simplest approach of the four After loading the two classifierstrained by the training software (front-view and side-view see 434) both rdquoslidesrdquoacross the input image All of them return with a vector of rectangles where theyfound the sought object These are handled as detections (saved displayed etc)There are no filters implemented for maximum number of detections overlappingallowed position etc

Please note that although currently two SVMs are used this and every othermethod can handle fewer or more of them For example given three classifiersmode 1 loads and slides all of them

As it can be seen this mode 1 is rather simple It checks every possibleposition with all the classifiers on every frame in multiple scales This meansthat it will not miss the object (assuming that one of them would recognize theobject if a cropped image of it is given as an input) On the other hand thisextensive search is very computational heavy especially with two classifiers Thisresults in the lowest frame per second rate of all the methods

4352 Mode 2 Sliding window with intelligent choice of classifier

As mentioned in the previous subsection Mode 1 checks every possible positionwith all the classifiers on every frame This is not efficient because of the waythe two used classifier was trained On figure 42 it can be seen that one of them

41

43 2D image processing methods

represents the frontrear view while the other one was trained for the side-viewof the robot

The idea of mode 2 is based on the assumption that it is very rare and usuallyunnecessary that these two classifiers detect something at once since the robot iseither viewed from the front or from the side Certainly looking at it diagonallywill show both mentioned sides but resulting from the perspective distortionthese detectors will not recognize both sides (and that is not needed either sincethere is no reason to find it twice)

Therefore it seemed logical to implement a basic memory to the code whichstores which one of the detectors found the object last time After that only thatone is used for the next frames If it succeeds to find the object again it remainsthe chosen detector If not the algorithm starts counting the sequential frames onwhich was unable to find something If this exceeds a predefined tolerance limit(that is the number of frames tolerated without detection) all of the classifiersare used again until one of them find the object In every other aspect mode 2is very similar to mode 1 presented above

The limit was introduced because it is very unlikely to change viewpointsbetween two frames so drastically that the other detector would have a betterchance to detect A couple of frames without detection is more like the result ofsome kind of image artefact like motion blur change in exposure etc

This modification of the original code resulted an much faster processingspeed since in significant amount of the time only one classifier was used in-stead of two Fortunately after finding an optimal limit the efficiency of thecode remained the same

4353 Mode 3 Intelligent choice of classifiers and ROIs

While mode 2 brought a significant improvement to the detectorrsquos speed it didnot fulfil the requirements of the project

In subsection 433 the idea of reducing the image searched was introduced asa way to increase the detection speed The position of the robot on the previousinput image is an excellent estimation of the vehiclersquos position on the current oneTheoretically it is enough to search for the robot around its last known positionwith a radius of the maximum distance the object can move during the timebetween frames However this is not an obvious calculation since the distance isin image coordinates hence it is not trivial to use the actual speed of the robotAlso the possible movement of the camera is involved as well For example evenif the UGV holds its position the rotating camera of the UAV will capture itrdquomovingrdquo across the input image

Instead of estimating the distance mentioned above a simpler approach wasimplemented The memory introduced in mode 2 was extended to store the

42

43 2D image processing methods

Figure 46 An example output of mode 3 The green rectangle marks theregion of interest The blue rectangle is the detection returned by the side-viewdetectorMode 3 tries to estimate a ROI based on the previous detections andsearches for the robot only in that Note the ratio between the ROI and the fullsize of the image

position of the detection aside the detector which returned it A new rectanglename ROI (Region of interest) was included which determinates in which areashould the detector search

The rectangle is initialized with the sizes of the full image (in other termsthe whole image is included in the ROI) Then every time a detection occursthe ROI is set to that rectangle since that is the last known position of the robotHowever this rectangle have to be enlarged caused by the reasons mentionedabove (movement of the camera and the object) Therefore the ROI is grown bya percent of its original size (set by a variable 50 by default) and every detectorsearches in this area Note that the same intelligent selection of classifiers whichwas described in mode 2 is included here as well

In the unfortunate case that none of the detectors returned detections insidethe ROI the region of interest is enlarged again by a smaller percent of its originalsize (set by a variable 3 by default) Note that if in any case the ROI reachesthe size of the of the image it will not grow in that direction anymore Eventuallyafter a series of missing detections the ROI will reach the size of the input imageIn this case mode 3 works exactly like mode 2

See figure 46 for an example of the method described above The greenrectangle marks the region of interest (ROI) The classifiers will search for therobot in this area The blue rectangle marks that the side-view detector foundsomething there (in this case correctly recognized the robot) On the next framethe blue rectangle will be the base of the ROI as it is enlarged in the followingsteps to define the region of interest

43

43 2D image processing methods

Please note that while the ROI is updated of the previous detections it isvery easy to change it to another rdquopre-filterrdquo due to the modular architectureintroduced in 42

4354 Mode 4 Tracking based approach

As mentioned in 434 Dlib library has a built-in tracker algorithm based on thevery resent research [76]

Mode 4 is very similar to mode 3 but includes the tracking algorithm men-tioned above Using a tracker makes it unnecessary to scan every frame (or partof it as mode 3 does) with at least one detector

Instead once the robot was found it is followed across the scene Certainlyperiodical validations are still needed Tracking is significantly faster than thesliding window detectors and in appropriate conditions can perform above realtime speed

It also improved the efficiency of the detector since on many frames the robotwas tracked and its position was marked correctly while the detectors missed itfrom the same view-point

On the other hand tracking can be misleading as well In case of the trackedregion rdquoslidesrdquo off the robot incorrect detections will be provided Also thetracker often estimates the position of the object correctly while the scale of therectangle does not fit which can result much larger or smaller detection boxesthan the object itself See figure 47(b) for an example Yellow rectangle marksthe tracked region which clearly includes the robot but itrsquos size is unacceptable

To avoid these phenomena the bounding box returned by the tracker is reg-ularly checked by detectors This is done in the very same way as in mode 3The output of the tracker is the base of the ROI which is than enlarged by pa-rameters (similarly to mode 3) The algorithm checks this area with the selectedclassifier(s)

If any of them founds the object the tracker is reinitialized based on thedetection box The rectangle is enlarged and translated to include a bigger partof the robot in contrast to the detected sides (note that the detectors are trainedfor the sides only see subsection 431) This is done because the tracker tendsto follow bigger patches of the image better

If none of the detectors returned detections inside the ROI the algorithm willcontinue to track the object based on previous clues of its position However afterthe number of sequential failed validations exceeds a tolerance limit (a parameter)the object is labelled as lost Afterwards the algorithm works exactly as mode 3would do enlarge the ROI on every frame until a new detection appears Thenthe tracker is reinitialized

See figure 47(a) for a representation of the processing method of mode 4

44

43 2D image processing methods

(a) Example output frame of mode 4

(b) Typical error of the tracker

Figure 47 (a) A presentation of mode 4 Red rectangle marks the detection ofthe front-view classifier The yellow box is the currently tracked area The greenrectangle marks the estimated region of interest (b) A typical error of the trackerYellow rectangle marks the tracked region which clearly includes the robot but toobig to accept as a correct detection

45

44 3D image processing methods

44 3D image processing methods

441 3D recording methodThe first challenge was how to create a 3D image As explained in sub-subsection3123 the used Lidar sensor is a 2D type one which means it scans only inplane In other words the sensor returns the distance of the closest reflectionpoint rotating around one (vertical) axis The result of this scan looks like across-section of a 3D image To produce a real and complete 3D image severalof these cross-section like one-plane scans have to be combined This is a rathercomplicated method since to cover the whole room the lidar needs to be movedHowever this movement or displacement will have a significant effect on therecorded distances

To simplify the initial experiments the position of the laser scanner was fixedon a tripod and only the orientation of it was changed This way the trans-formation between the coordinate systems and validation of the recordings wereeasier

On figure 48 a schematic diagram is given to introduce the 3D recording set-up The picture is divided in to 3 parts Part a is a side view of the room beingrecorded including the scanner and its dependences Part b is a top view of thevery same room displaying the field of view of the scanner Part c is a closeroverview of the scanner device the inertial measurement unit (IMU) and themounting layout Main parts of the setup are listed below with brief explanation

1 Lidar This is the 2D range scanner which was used in all of the experi-ments It is mounted on a tripod to fix its position The sensor is displayedon every sub-figure (a b c) On part a it is shown in two different orien-tation first when the recording plane is completely flat secondly when itis tilted

2 Visualisation of the laser ray The scanner measures distances in a planewhile rotating around its vertical axis which is call capital Z axis on thefigure (Note The sensor uses class 1 laser which is invisible and harmlessto human eye Red colour was chosen only for visualization)

3 Pitch angle of first orientation While the scanner is tilted its orientationis recorded relative to a fixed coordinate system For this experiment thepitch angle is the most relevant since roll and yaw did not change This iscalculated as a rotation from the (lower-case) z axis

4 Pitch angle of second orientation This is exactly the same variable asdescribed before but recorded for the second set-up of the recording system

46

44 3D image processing methods

5 Yaw angle of the laser ray As mentioned before the sensor used operateswith one single light source rotated around its vertical axis (capital) ZTo know where the recorded point is located relative to the sensorrsquos owncoordinate system the actual rotation of the beam is needed to be savedas well This is calculated as an angle between the forward direction of thesensor (marked with blue) and the laser ray All the points are recordedin this 2D space defined with polar coordinates a distance from the origin(the lidar) and a yaw angleThe angle is increasing counter-clockwise Pleasenote that this angle rotates with the lidar and its field Part b represents astate when the lidar is horizontal

6 Field of view of the sensor Part b of the picture visualize the limited butstill very wide field of view of the sensor The view angle is marked withgreen coloured curve and the covered area is yellowish The scanner is ableto record 270 or in other terms [minus135 135]

7 Blind-spot of the sensor As discussed in the previous point the scannercan register distances in 270 The remainder 90 is a blind-spot and islocated exactly behind the sensor (135minus135) This area is marked withdark grey on the figure

8 Inertial measurement unit All the points are recorded in the 2D spaceof the lidar which is a tilted and translated plane To map every recordedpoint from this to the 3D maprsquos coordinate system the orientation andposition of the Lidar sensor is needed Assuming that the latter is fixed(the scanner is mounted on a tripod) only the rotation remains variableTo record this some kind of inertial measurement unit (IMU) is neededwhich is marked with a grey rectangle on the image

9 Axis of rotation To scan the room a continuous tilting of the scannerwas needed It was not possible to place the light source on the axis of therotation This means an offset between the sensor and the axis which was asource of error The black circle represents the axis the tilting was executedaround

10 Offset between the axis and the light source As described in the previ-ous point an offset was present between the range sensor and the axis ofrotation This distance was carefully measured and it is marked by blue onthe figure

11 Ground vehicle The main objective of this thesis is to localize the UGVThus several 3D map was made of this vehicle itself However the lidarmain task is not the robot detection but the localization of the UAV itself

47

44 3D image processing methods

On part a and b its schematic side and top view is shown to visualize itsshape relative size and location

442 Android based recording set-upTo test the sensorrsquos capabilities simplified experiment set-ups were introducedas described in subsection 441 The scanned distances were recorded in csvformat by the official recording tool provided by the lidarrsquos manufacturer [77]As described in subsection 444 and in point 8 the orientation and position ofthe Lidar sensor is needed to determinate the recorded point coordinates relativeto the coordinate system fixed to the ground

Figure 410 Screenshot ofthe Android application for Lidarrecordings It is able to detect dis-play and log the rotation of the de-vice in a log file with user definedfilename Source own applicationand image

Assuming that the position is fixed (thescanner is mounted on a tripod) only therotation can change To record this somekind of inertial measurement unit (IMU) wasneeded

Since at the time these experimentsstarted no such device was available the au-thor of this thesis created an Android appli-cation The aim of this program was to usethe phonersquos sensors (accelerometer compassand gyroscope) as an IMU The software isable to detect display and log the rotationof the device along the three axis

The log file is structured in a way thatMatlab (see subsection 321) or other de-velopment tools can import it without anymodification needed The file-name is userdefined

Since both the Lidar and the applica-tion operates at a different and not suf-ficiently consistent frequency synchroniza-tion was necessary Since the two data wasrecorded completely separately this has to bedone off-line To achieve this a time-stamphas to be recorded for both type of mea-surements The lidar recorder tool does thisautomatically This function was added tothe Android application too thus the time-stamps of the recorded orientations are saved

48

44 3D image processing methods

Figure 48 Schematic figure to represent the 3D recording set-up Part a is a sideview of the room being recorded including the scanner and its dependences Partb is a top view of the very same room displaying the field of view of the scannerPart c is a closer overview of the scanner device the inertial measurement unit(IMU) and the mounting layout

49

44 3D image processing methods

Figure 49 Example representation of the output of the Lidar Sensor As it canbe seen the data returned by the scanner is basically a scan in one plane and canbe interpreted as a cross-section of a 3D image The sensor is positioned in themiddle of the cross Green areas were reached by the laser Source screenshot ofown recording by the official recording tool [77]

50

44 3D image processing methods

to the log file as well However since the two systems were not connected thetimestamps had to be synchronized manually

On Figure 410 the user interface of the application can be seen Note thatboth the displayed and the saved angles are in radian Figure 411(a) presentshow the phone was mounted next to the lidar and also shows the test environmentwhich was recorded The phone was attached to a common metal base with thelidar Thus the rotation of the phone is expected to be very close to the rotationof the lidar We used a Motorola Moto X (2013) for these experiments

The coordinate system used is fixed to the phone X axis is parallel with theshorter side of the screen pointing from the left to the right Y axis is parallel tolonger side of the screen pointing from to bottom to the top Z axis is pointingup and is perpendicular to the ground See figure 412 for a presentation of theaxes Angles were calculated in the following way

bull pitch is the rotation around the x axis It is 0 if the phone is lying on ahorizontal surface

bull roll is rotation around the y axis

bull yaw is the rotation around the z axis

Although the sensors in a phone are not designed for such precise tasks theset-up presented in this subsection worked sufficiently and produced spectacular3D images On figure 411(b) and 411(c) an example is shown along with therecorded screen captured by a camera (411(a))

443 Final set-up with Pixhawk flight controllerAlthough the Android application (see subsection 442) gave satisfactory resultsseveral reasons occurred for replacing it First the synchronization was compli-cated manual and off-line which is not acceptable in a real time UAV environ-ment Second neither the phonersquos sensors nor its operating system was designedto be mounted on an unmanned aircraft It is heavier less precise and moreexpensive than a dedicated IMU Thus the Android phone has to be replaced

As a replacement the Pixhawk (3121) flight controller was chosen because itis open-source well supported and most importantly it is the chosen controller ofthe projectrsquos UAV This meant less weight and brought integrity both in hardwareand software Although the Pixhawk is not a dedicated IMU its built-in sensorsare very accurate since they are used to stabilize multi-rotors during flight Fellowstudent Domonkos Huszar managed to connect both the Pixhawk and the lidar(3123) to the Robotic operating system (322) This way the two type ofsensors were recorded with the same software

51

44 3D image processing methods

(a) Laboratory and recorded scene

(b) Elevation of the produced 3D map

(c) The produced 3D map

Figure 411 (a) picture about the laboratory and the mounting of the phone (b)and (c) 3D map of the scan from different viewpoints Some object were markedon all images for easier understanding 1 2 3 are boxes 4 is a chair 5 is a studentat his desk 6 7 are windows 8 is the entrance 9 is a UAV wing on a cupboard

52

44 3D image processing methods

Thus the change from the Android software to the Pixhawk solved the problemof synchronization and accuracy All the other parts of the experiment (themounting method the coordinate systems etc) remained unchanged

444 3D reconstructionIn subsection 441 the recording method was described Subsection 442 and443 discussed the inertial measurement units and software used Processing andvisualizing the recorded data as a 3D map was the next step

All the pointsrsquo coordinates obtained were in the lidar sensorrsquos own coordinatesystem As shown on figure 48 (part b) this is a 2D space represented by polarcoordinates a distance from the origin (the scanner) and an angle between theline pointing to the point and the zero vector This space was tilted with thescanner during the records To calculate the positions recorded by the Lidar inthe ground fixed coordinate system this 2D space had to be transformed Trans-formation here means rigid motions no shape or size changes were desired Rigidmotions are reflections rotations translations and glide reflections Howeverin this experiment only translations and rotations were used

The 3D coordinate system was defined as the following

bull x axis is the vector product of y and z x = ytimes z pointing to the right

bull y axis is the facing direction of the scanner parallel to the ground pointing

Figure 412 Presentation of the axes used in the android application

53

44 3D image processing methods

forward

bull z axis is pointing up and is perpendicular to the ground

Note that the origin ([0 0 0]) is at the lidarrsquos location (at the top of the tripodin the height of the rotation axis) This means that points below the scanner havea negative altitude

The lidar and its own 2D space have 6 degrees of freedom (DOF) state in theground fixed coordinate system Three represents its positions along the x y zaxes The other three describes its orientation yaw pitch and roll

First the rotation of the scanner will be considered (the latter three DOF) Asit can bee seen on figure 48 the roll and yaw angles are not able to change Thusonly pitch has to be considered as a variable in this experiment (see point 3) Asa result the transformed coordinates of the recording point are the following

x = distance middot sin(minusyaw) (41)

y = distance middot cos(yaw) middot sin(pitch) (42)

z = distance middot cos(yaw) middot cos(pitch) (43)

Where distance and yaw are the two coordinates of the point in the plane ofthe lidar (figure 48 point 5) and pitch is the angle between the (lower-case) zaxis and the plane of the lidar (48 point 3 and 4)

The position of the lidar in the ground coordinate system did not change intheory only the orientation of it However as it can be seen on figure 48 part cthe rotation axis (marked by 9 see point 9) is not at the same level as the rangescanner of the lidar Since it was not possible to mount the device other way asignificant offset (marked and explained in point 10) occurred between the lightsource and the axis itself This resulted in an undesired movement of the sensoralong the z and y axes (note that x axis was not involved since the roll angle didnot change and only that would move the scanner along the x axis) While theresulted errors were no greater than the offset itself they are significant Thusalong the rotation a translation is needed as well The vector is calculated asthe sum of dy and dz

dy = offset middot sin(pitch) (44)

dz = offset middot cos(pitch) (45)

54

44 3D image processing methods

Where dy is the translation required along the y and dy is along the z axisOffset is the distance between the light source and the axis of rotation presentedon figure 48 part C marked with 10

Combining the 5 equations (41 42 43 44 45) we get xyz

= distance middot

sin(minusyaw)cos(yaw) middot sin(pitch)cos(yaw) middot cos(pitch)

+ offset middot

0sin(pitch)cos(pitch)

(46)

Using the transformation defined in equation 46 spectacular and more im-portantly precise 3D maps were created These were suitable for further worksuch us SLAM (Simultaneous localization and mapping) or the ground robotdetection

55

Chapter 5

Results

51 2D image detection results

511 EvaluationTo make the development easier and the results more objective an evaluationsystem was needed This helps determine whether a given modification of thealgorithm did actually make it better andor faster than other versions Thisinformation is extremely important in every vision based detection system to trackthe improvements of the code and check if it meets the predefined requirements

To understand what better means in case of a system like this some defi-nitions need to be clarified Generally a detection system can either identify aninput or reject it Here the input means a cropped image of the original pictureIn other words it is a position (coordinates) and dimensions (height and width)Please refer to subsection 432 for a more detailed description An identifiedinput is referred as positive Similarly a rejected one is called negative

5111 Definition of True positive and negative

A detection is true positive if the system correctly identifies an input as a positivesample In this case it means that the detector marks the ground robotrsquos positionas a detection Note that it is also often called a hit Similarly a true negativemeans that an input which does not contain the object is correctly rejected

5112 Definition of False positive and negative

In the case of most decision systems errors can be organised into two main groupsA mistake is a false positive error (also called type I error) if the system classifiesan input as a positive sample despite it is not In other words the system believes

56

51 2D image detection results

the object is present at that location although that is not the case In this projectthat typically means that something that is not the ground robot (a chair a boxor even a shadow) is still recognized as the UGV A false negative error (alsocalled type II error) occurs when an input is not recognized (rejected) although itshould be In this current task a false negative error is when the ground robotis not detected

5113 Reducing number of errors

Unfortunately it is really complicated to decrease both kind of errors at thesame time If more strict rules and thresholds are introduced in the decisionmaking the rate of false positives will decrease but simultaneously chances offalse negatives will increase On the other hand if conditions are loosened in thedetection method false negative errors may be reduced However caused by theexact same thing (less strict conditions) more false positives could appear

The actual task determines which kind of error should be minimised Incertain projects it could be more important to find every possible desired objectthan the fact that some mistakes will occur Such a project can be a manuallysupervised classification where appearing false positives can be eliminated by theoperator (for example illegal usage of bus lanes) Also a system with an accidentavoidance purpose should be over-secure and rather recognise non dangeroussituations as a hazard than to miss a real one

In this project the false positive rate was decided to be more important Thereason for this is that the position of the UGV is not crucial during the wholeflight If an input frame is processed incorrectly and a false negative error occurs(the robot is on the picture but the algorithm misses it) the tracking system(see subsection 434) would still be able to estimate its position After a fewframes the detector would be able to detect the robot again with great possibilityHowever a false negative detection at an unexpected position may confuse thesystem since it would have to choose from multiple possible charging platformsThis may lead to a landing attempt on a dangerous surface and to the possibilityof losing the aircraft

5114 Annotation and database building

To test the efficiency of the 2D detection algorithm its detections have to betested on an image dataset with its samples already labelled (either fully manuallyor using a semi-supervised classifier) or on a video set where the object (or objects)position is known on every frame This way the true positivenegative and falsepositivenegative values can be determined

57

51 2D image detection results

Figure 51 Figure of the Vatic user interface

Testing on image datasets is suitable for methods which do not use any in-formation transmitted between frames such as previous positions or ROI Thesealgorithms are usually sliding window based (see subsection 432) Their croppedinputs can be simulated with image datasets without complications

For this project the video based test is more appropriate since as described insubsection 434 and 433 the detector uses tracking and pre-filtering Testing thetracking feature without consecutive frames is pointless Also the pre-filteringand ROI generation is only applicable on frames containing more than just theobject itself

Testing on videos requires annotation which means that every object has tobe labelled on every frame This is a tiring and time-consuming task but withsuitable software it can be simplified To do this the Vatic environment waschosen

The abbreviation Vatic stands for Video Annotation Tool from Irvine Cal-ifornia It is a free to use on-line video annotation tool designed specially forcomputer vision research [78]

After the successful installation and connection to the server a clear interfaceis visible See an example screenshot on figure 51 The interface has a built invideo player which makes it possible to review the video with the annotation atnormal speed or even frame-by-frame The system was designed in a way thatmultiple objects can be annotated even from different classes (like pedestrianscars trucks for example) Since only one ground robot was built yet and no otherkind of objects were requested only one item was annotated on the videos

58

51 2D image detection results

The annotatorrsquos task is to mark the position and the size of the objects (thatis drawing a bounding box around it) on every frame of the videos (there areoptions to label an object Occluded obstructed or outside of frame) Althoughthis seems like a very time-consuming task Vatic uses sophisticated algorithmsto interpolate both the positions and the sizes of the objects on frames thatare not annotated yet This is a significant help for the annotator since aftermarking the object on some key-frames the only task remained is to correctthe interpolations between them if necessary

Note that it is crucial that the bounding boxes are very tight around the ob-ject This will guarantee that during the evaluation of the detectors the statisticsare based on true and objective annotations This topic will be explained in moredetails in section 53

Vatic is also able to crowdsource the annotation tasks using Amazon Mechan-ical Turk [79] which is Amazonrsquos solution for renting human resources This waylarge datasets with several videos are easy to build for relatively low cost How-ever this feature of the tool was not used in this project since the number of testvideos and their complexity did not make it necessary All the test videos wereannotated by the author

After successfully annotating a video Vatic is able to export the boundingboxes in several format including xml json etc The exported file contains theid of the frames and the coordinates of every bounding box along with the labelassigned for the objectrsquos type

Using these files it is possible to compare the detections of an algorithm (whichare also bounding boxes) with the annotations and thus evaluate the efficiencyof the code

512 Frame-rate measurement and analysisIn section 41 speed was recognised as one of the most important challenges

To address this issue the detector was prepared to measure and export thetime elapsed by processing each frames

The measurements were carried out on a laptop PC (Lenovo z580 i5-3210m8GB RAM) simulating a setup where the UAV sends video to a ground stationfor further processing During the tests every other unnecessary process wereterminated so their additional influences are minimized

The exact same software was used for every test only the parameter file waschanged (switching between the modes) Every method was executed on the samevideo one hundred times which means more than 100000 frames This amountensured that temporary resource changes during the test due to the operatingsystem did not influence the results significantly

59

52 3D image detection results

To analyse the exported processing times a Matlab script has been imple-mented This tool can load the log files from the test and calculate statistics ofthe data Measurements like the shortest longest and average processing timeper frame are displayed after execution along with the average frame-rate and itsvariance Latter is very important as mode 2 3 4 processing methods changeduring the execution by definition which results in a varying processing timeVariance measures these changes An example output of the analysis software ispresented below

Example output of Frame-rate analysis softwareLoaded 100 filesNumber of frames in video 1080Total number of frames processed 108000Longest processing time 0813Shortest processing time 0007Average ellapsed seconds 0032446variance of ellapsed seconds per frame

between video loops 80297e-07across the video 00021144

Average Frame per second rate of the processing 308204 FPS

The changes of average frame-rates between video loops were also monitoredIf this value was too big it probably indicated some unexpected load of thecomputer Thus the tests had to be restarted

It should be mentioned that the frame-rate of the software is strongly corre-lated with the resolution of the input images since the smaller the image it takesshorter time to rdquosliderdquo through the detector on it This is especially true for thefirst two methods (4351 4352) The test videosrsquo resolution are 960times 540

52 3D image detection resultsThis thesis focused on the 2D object detection methods since they are easierto implement and have been widely researched for decades Traditional camerasensors are significantly cheaper too and their output is rdquoready-to-userdquo On theother hand data from lidar sensors have to be processed to gather the imagesince the scanner records points in a 2D plane as explained in section 44 Notethat lidar 3D scanners exists but they are still scanning in a plane and have thismentioned processing built-in Also they are even more expensive than the 2Dversions

60

53 Discussion of results

Figure 52 On this figure the recording setup is presented In the backgroundthe ground robot is also visible which was the subject of the measurements

That being said the 3D depth map provided by these sensors are not possibleto obtain with cameras and contains valuable information for a project like thisand makes indoor navigation easier Therefore experiments were carried onthe record 3D images by tilting a 2D laser scanner (3123) and recording itsorientation with a IMU (3121)

On figure 52 a photo of the experimental setup is shown In the backgroundthe ground robot is visible which was the subject of the recordings

Figure 53 presents an example 3D point-cloud recorded of the ground robotBoth 53(a) and 53(b) is an image of the same map viewed from different pointsAs it can be seen recordings from one point is not enough to create completemaps as rdquoshadowsrdquo and obscured parts still occurs Note the rdquoemptyrdquo area behindthe robot on 53(b)

The created point-clouds were found to be detailed enough to carry simpler(eg threshold based on height) or more complex object detection methods (egregistering 3D key-points) on them in the future

53 Discussion of resultsIn general it can be said that the detector modes worked as expected on therecorded test videos with suitable efficiency Example videos can be found athere1 for all modes However to judge the performance of the algorithm moreobjective measurements were needed

The 2D image processing results were evaluated by speed and detection effi-1usersitkppkehu~palan1CranfieldVideos

61

53 Discussion of results

(a) Example result of the 3D scan of the ground robot

(b) Example of the rsquoshadowrsquo of the ground robot

Figure 53 Example of the 3D images built Both image shows the same 3Dmap viewed from different angles Note that the carpet is visible underneath thevehicle

62

53 Discussion of results

ciency The latter can be done by several values Here recall and precision willbe used Recall is defined by

TP

TP + FN

Where TP is the number of true positive and FN is the number of false negativedetections In other words it is the ratio of the of the detected and the totalnumber of positive samples Thus it is also called sensitivity

Precision is defined byTP

TP + FP

Where TP is the number of true positive FP is the number of false positivedetections In other terms precision is a measurement of the correctness of thereturned detections

Both values have a different meaning for validation and none of them can usedalone This is because as mentioned in sub-subsection 5113 both false positiveand negative errors are undesirable but neither recall nor precision measuresboth For example it is possible to reach an outstanding recall rate with extremeamount of false positives Similarly the increase of false negative errors willimprove the precision of the algorithm

The detections (the coordinates of the rectangles) were exported by the de-tection software for further analysis The rectangles were compared to the onesannotated with the Vatic video annotation tool (5114) based on position andthe overlapping areas For the latter a minimum of 50 was defined (ratio of thearea of the intersection and the union of the two rectangles)

Note that this also means that even if a detection is at the correct locationit is possible that it will be logged as false positive error since the overlap of therectangles is not satisfying Unfortunately registering this box as a false positivewill also generate a false negative error since only one detection is returned bythe detector for a location(overlapping detections are merged) Therefore theannotated object will not be covered by any of the detections

As expected mode 1 and mode 2 produced very similar results a decent 62recall rate was achieved This is a result of the fact that they work exactlythe same way except the selection of the used classifiers The results show thatintroducing this idea did not decrease the performance

Mode 3 managed to improve the recall rate by 2 with the introduction ofregion of interests Neither of the first 3 modules had false positive detections asa result of the strict thresholds of the SVMs

Mode 4 had the outstanding recall rate of 95 This is a result of the trackeralgorithm which was able to follow the object even at view-points were the SVMdetectors did no longer detected the robot See table 51 for all the recall andprecision values

63

53 Discussion of results

Similarly the processing speed of the methods introduced in subsection 435were analysed as well with the tools described in subsection 512

Mode 1 was proven to be the slowest with a frame-rate of 42 FPS This isnot surprising as this version of the algorithm scans the whole image with all thedetectors

Mode 2 raised this speed to 65 FPS by introducing an intelligent way ofselecting the applied detector

Mode 3 brought another significant increase with implementing region of in-terest estimators which reduce the amount of the image to be scanned It canprocess around 12 frames in a second

Finally mode 4 was proven to be the fastest solution by far As a result ofits tracking algorithm their is no need to use the SVM classifiers as often asbefore This is a very important result since tracking is significantly faster thanscanning with the detectors Mode 4 reached a frame-rate of 30 FPS during thetests Thus mode 4 can be called a rdquoreal-timerdquo object detection algorithm

See table 51 for the frame-rates of all modules displayed along their recalland precision values Note that all frame-rates are average of 100 executions

Table 51 Table of results

mode Recall Prescision FPS Variancemode 1 0623 1 42599 000000123mode 2 0622 1 6509 00029557mode 3 0645 1 1206 00070877mode 4 0955 0898 3082 00021144

As a conclusion it can be seen that mode 4 outperformed all the others bothin detection rate (recall 95) and speed (average frame-rate 30FPS) using ROIestimators and a tracking algorithm Due to the nature of the tracker and theevaluation method the number of false positives increased As mentioned beforeeven a correctly positioned but incorrectly scaled detection decreased the precisionrate However this should be simple to improve by adding further validations ofthe output of the tracker (both for position and scale)

64

Chapter 6

Conclusion and recommendedfuture work

61 ConclusionIn this paper an unmanned aerial vehicle based airborne tracking system waspresented with the aim of detecting objects indoor (potentially extended to out-door operations as well) and localize a ground robot that will serve as a landingplatform for recharging the UAV

First the project and the aims and objectives of this thesis were introducedin chapter 1 This introduction was followed by an extensive literature review togive an overview of the existing solutions for similar problems (chapter 2) Un-manned Aerial Vehicles were presented and examples were shown for applicationfields Then the most important object recognition methods were introducedand reviewed with the aspect of suitability for the discussed project

Then the environment of the development was described including the avail-able software and hardware resources (chapter 3) Their advantages and disad-vantages were analysed and their applications in the project are explained

Afterwards the recognized challenges of the project were collected and dis-cussed with respect to the object detection task in chapter 4 Considering theobjectives resources and the challenges a modular architecture was designedand introduced (section 42) The type of modules were listed and discussed indetails including their purposes and some possible realization

The modular structure of the system makes it very flexible and easy to extendor replace modules The currently implemented and used ones were listed andexplained Also some unsuccessful experiments were mentioned

Subsection 434 concludes the chosen feature extraction (HOG) and classi-fier (SVM) methods and presented the implemented training software along the

65

61 Conclusion

produced detectorsAs one of the most important parts the currently used detector was intro-

duced in subsection 435 This module was developed with special attention paidto validation and possible future work therefore additional debugging featureslike exporting detections demonstration videos and frame rate measurementswere implemented To make the development simpler an easy-to-use interfacewas included which let the most important parameters to be set without modifi-cation pf the code Four related but different detection methods were developedMode 1 scans the whole image with both trained detectors to locate the robotMode 2 does the same but instead of all classifiers only one is used most of thetime which is chosen based on previous detections This resulted in an increasedframe-rate Mode 3 introduces the concept of Region Of Interests (ROI) and onlyscans this area The ROI is defined based on previous detections however it iseasy to replace it with other methods due to the modular structure The reducedsearch space made the processing significantly faster Finally mode 4 includes atracking algorithm which makes it unnecessary to scan every frame (or part ofit) with a detector Instead once the robot was found it is followed across thescene Certainly periodical validations are still needed Tracking is significantlyfaster than the sliding window detectors and in appropriate conditions can per-form above real time speed (average frame-rate was 308 FPS) It also improvedthe efficiency of the detector since on many frames the robot was tracked andits position was marked correctly while the detectors missed it from the sameview-point However tracking can be misleading as well if the tracked regiondrifts away from the robot (eg a chair next to the robot is followed) To avoidthis the bounding box returned by the tracker is regularly checked by detectors

Special attention was given to the evaluation of the system Two software weredeveloped as additional tools one for evaluating the efficiency of the detectionsand an other for analysing the processing time and frame rates

Although the whole project (especially the autonomous UAV) still needs a lotof improvement and not ready for testing yet the ground robot detecting algo-rithmrsquos first version is ready-to-use and have shown promising results in simulatedexperiments

The detector based on mode 4 had a recall rate of 95 with an average frame-rate of 308 FPS on the test videos Although this processing speed was achievedon a laptop (simulating a setup where the UAV sends video to a ground stationfor further processing) porting this code on an on-board computer with smallercomputational capacity should result in a slower but still acceptable frame-ratewhich satisfies the objectives (5Hz)

While this thesis focused on two dimensional image processing and object de-tection methods 3D image inputs were considered as well Section 44 concludesthe progress made related to 3D mapping along the applied mathematics An

66

62 Recommended future work

experimental setup was created with which it was possible to create spectacu-lar and more importantly precise 3D maps about the environments with a 2Dlaser scanner To test the idea of using the 3D image as an input for the groundrobot detection several recordings were made about the UGV The created point-clouds were found to be detailed enough to carry simpler (eg threshold basedon height) or more complex object detection methods (eg registering 3D key-points) on them in the future

As a conclusion it can be said that the objectives defined in 13

1 explore possible solutions in a literature review

2 design a system which is able to detect the ground robot based on theavailable sensors

3 implement and test the first version of the algorithms

were all addressed and completed

62 Recommended future workIn spite of the satisfying results of the current setup several modifications andimprovements can and should be implemented to increase the efficiency and speedof the algorithm

Focusing on the 2D processing methods first experiments with retrainedSVMs are recommended It is expected that training on the same two sides (sideand front) but including more images from slightly rotated view-points (bothrolling the camera and capture the object from not completely the front) willresult in better performance

Region of interest estimators have huge potential in the project since they cansignificantly speed up the process Aside basic algorithms based on simple fea-tures (eg similar colour patches edges etc) several more complex segmentationmethods were introduced in the literature

Tracking can be tuned more precisely as well Aside the rectangle box whichmarks the estimated position of the object the tracker returns a confidence valuewhich measures how confident the algorithm is that the object is inside the rect-angle This is a very valuable information to eliminate detections which rdquoslippedrdquooff the object Also the further processing of the return position is recommendedFor example growing or shrinking the rectangle based on simple features (egthresholded patches intensity edges etc) can be profitable since the trackertends to perform better when it tracks the whole object On the other hand toobig bounding boxes are problems as well thus shrinking of the rectangles have tobe considered as well

67

62 Recommended future work

Using The 3D image for the detection of the robot is another very interestingopportunity that should be examined in more details Certainly this is onlypossible if the 3D imaging will be carried out in near real time speed (currentlyall the maps were built offline) The currently seen 3D image (the point-cloud infront of the sensor) can be interpreted as a pre-filter for the 2D image processingmethods For example filtering for objects with more or less the same size of theUGV could be a very effective region of interest detector To achieve this thefield of view of the two sensors needs to be register carefully so correspondingareas can be found

Finally when the continuous 3D map building (note that this is more thanthe imaging since the 3D images have to be aligned and combined) is finishedseveral more improvements will be possible Assuming that both the aerial andthe ground vehicle were located in this map the real 3D position of the robotwill be known Its next position can be estimated based on its maximal velocityand the elapsed time Theoretically it is not possible for the robot to be outsideof this circle thus other parts of the map do not need to be scanned Since thelocation and orientation of the UAV will be known as well the field-of-view ofthe sensors can be estimated After that the UAV can either move and rotate ina way to cover the possible locations of the UGV with its sensors or just ignore(does not process) inputs from other areas

It is strongly recommended to give high priority to evaluation and testingduring the further experiments For 2D images Vatic annotation tool was intro-duced in this thesis as a very useful and convenient tool to evaluate detectorsFor 3D detections this is not suitable but a similar solution is needed to keepup the progress towards the ready to deploy 3D mapping system planned

68

Acknowledgements

First of all I would like to thank Paacutezmaacuteny Peacuteter Catholic University and my teachersthere who helped me to advance and grow through the years and later made itpossible to continue my studies in Cranfield University where I was welcomed withwarm hospitality and was supported throughout of the development of this thesis

I would like to thank Dr Al Savvaris for the professional help and encourage-ment he provided while we worked together

Special thanks goes to my friends Domonkos Huszaacuter and Loacuteraacutent Kovaacutecs who Ishared all the happiness sadness worries troubles and the weigh of this pastyear and who I could always turn to in need or joy

Thanks to my friend Roberto Opromolla for helping us to make the necessarymeasurements

My family never stopped supporting me loving me and ensuring the rightenvironment for improvement and growth

Many thanks Fanni Melles for the support and love you gave me and that youstood by me even at the hardest times

Thanks for my friends Maacuterton Hunyady and Doacutera Babicz for the funny andsometimes cruel comments of the revision of this thesis

And finally many thanks to Andras Horvath who agreed to be my rsquosecondrsquosupervisor and mentor at my home university

69

References

[1] ldquoOverview senseFlyrdquo [Online] Available httpswwwsenseflycomdronesoverviewhtml [Accessed at 2015-08-08]

[2] H Chao Y Cao and Y Chen ldquoAutopilots for small fixed-wing unmanned airvehicles A surveyrdquo in Proceedings of the 2007 IEEE International Conference onMechatronics and Automation ICMA 2007 2007 pp 3144ndash3149

[3] A Frank J McGrew M Valenti D Levine and J How ldquoHover Transitionand Level Flight Control Design for a Single-Propeller Indoor AirplanerdquoAIAA Guidance Navigation and Control Conference and Exhibit pp 1ndash43 2007[Online] Available httparcaiaaorgdoiabs10251462007-6318

[4] ldquoFixed Wing Versus Rotary Wing For UAV Mapping Ap-plicationsrdquo [Online] Available httpwwwquestuavcomnewsfixed-wing-versus-rotary-wing-for-uav-mapping-applications [Accessed at 2015-08-08]

[5] ldquo DJI Store Phantom 3 Standardrdquo [Online] Available httpstoredjicomproductphantom-3-standard [Accessed at 2015-08-08]

[6] ldquoWorld War II V-1 Flying Bomb - Military Historyrdquo [Online] Available httpmilitaryhistoryaboutcomodartillerysiegeweaponspv1htm [Accessed at 2015-08-08]

[7] L Lin and M A Goodrich ldquoUAV intelligent path planning for wilderness searchand rescuerdquo in 2009 IEEERSJ International Conference on Intelligent Robotsand Systems IROS 2009 2009 pp 709ndash714

[8] M A Goodrich B S Morse D Gerhardt J L Cooper M Quigley J A Adamsand C Humphrey ldquoSupporting wilderness search and rescue using a camera-equipped mini UAVrdquo Journal of Field Robotics vol 25 no 1-2 pp 89ndash1102008

[9] P Doherty and P Rudol ldquoA UAV Search and Rescue Scenario with HumanBody Detection and Geolocalizationrdquo in Lecture Notes in Computer Science

70

REFERENCES

2007 pp 1ndash13 [Online] Available httpwwwspringerlinkcomcontentt361252205328408fulltextpdf

[10] A Jaimes S Kota and J Gomez ldquoAn approach to surveillance an area usingswarm of fixed wing and quad-rotor unmanned aerial vehicles UAV(s)rdquo 2008 IEEEInternational Conference on System of Systems Engineering SoSE 2008 2008

[11] M Kontitsis K Valavanis and N Tsourveloudis ldquoA UAV vision system for air-borne surveillancerdquo IEEE International Conference on Robotics and Automation2004 Proceedings ICRA rsquo04 2004 vol 1 2004

[12] M Quigley M A Goodrich S Griffiths A Eldredge and R W Beard ldquoTargetacquisition localization and surveillance using a fixed-wing mini-UAV and gim-baled camerardquo in Proceedings - IEEE International Conference on Robotics andAutomation vol 2005 2005 pp 2600ndash2606

[13] E Semsch M Jakob D Pavliacuteček and M Pěchouček ldquoAutonomous UAV surveil-lance in complex urban environmentsrdquo in Proceedings - 2009 IEEEWICACMInternational Conference on Intelligent Agent Technology IAT 2009 vol 2 2009pp 82ndash85

[14] M Israel ldquoA UAV-BASED ROE DEER FAWN DETECTION SYSTEMrdquo pp51ndash55 2012

[15] G Zhou and D Zang ldquoCivil UAV system for earth observationrdquo in InternationalGeoscience and Remote Sensing Symposium (IGARSS) 2007 pp 5319ndash5321

[16] W DeBusk ldquoUnmanned Aerial Vehicle Systems for Disaster Relief TornadoAlleyrdquo AIAA InfotechAerospace 2010 2010 [Online] Available httparcaiaaorgdoiabs10251462010-3506

[17] S DrsquoOleire-Oltmanns I Marzolff K Peter and J Ries ldquoUnmannedAerial Vehicle (UAV) for Monitoring Soil Erosion in Moroccordquo RemoteSensing vol 4 no 12 pp 3390ndash3416 2012 [Online] Available httpwwwmdpicom2072-42924113390

[18] F Nex and F Remondino ldquoUAV for 3D mapping applications a reviewrdquoApplied Geomatics vol 6 no 1 pp 1ndash15 2013 [Online] Availablehttplinkspringercom101007s12518-013-0120-x

[19] H Eisenbeiss ldquoThe Potential of Unmanned Aerial Vehicles for MappingrdquoPhotogrammetrische Woche Heidelberg pp 135ndash145 2011 [Online] Availablehttpwwwifpuni-stuttgartdepublicationsphowo11140Eisenbeisspdf

[20] C Bills J Chen and A Saxena ldquoAutonomous MAV flight in indoor environ-ments using single image perspective cuesrdquo in Proceedings - IEEE InternationalConference on Robotics and Automation 2011 pp 5776ndash5783

71

REFERENCES

[21] W Bath and J Paxman ldquoUAV localisation amp control through computer visionrdquo of the Australasian Conference on Robotics 2004 [Online] Availablehttpwwwcseunsweduau~acra2005proceedingspapersbathpdf

[22] K Ccedilelik S J Chung M Clausman and A K Somani ldquoMonocular vision SLAMfor indoor aerial vehiclesrdquo in 2009 IEEERSJ International Conference on Intel-ligent Robots and Systems IROS 2009 2009 pp 1566ndash1573

[23] H Oh D Y Won S S Huh D H Shim M J Tahk and A Tsourdos ldquoIndoorUAV control using multi-camera visual feedbackrdquo in Journal of Intelligent andRobotic Systems Theory and Applications vol 61 no 1-4 2011 pp 57ndash84

[24] Y M Mustafah A W Azman and F Akbar ldquoIndoor UAV Positioning UsingStereo Vision Sensorrdquo pp 575ndash579 2012

[25] P Jongho and K Youdan ldquoStereo vision based collision avoidance of quadrotorUAVrdquo in Control Automation and Systems (ICCAS) 2012 12th InternationalConference on 2012 pp 173ndash178

[26] J Weingarten and R Siegwart ldquoEKF-based 3D SLAM for structured environmentreconstructionrdquo in 2005 IEEERSJ International Conference on Intelligent Robotsand Systems IROS 2005 pp 2089ndash2094

[27] H Surmann A Nuumlchter and J Hertzberg ldquoAn autonomous mobile robot with a3D laser range finder for 3D exploration and digitalization of indoor environmentsrdquoRobotics and Autonomous Systems vol 45 no 3-4 pp 181ndash198 2003

[28] Y Lin J Hyyppauml and A Jaakkola ldquoMini-UAV-borne LIDAR for fine-scale map-pingrdquo IEEE Geoscience and Remote Sensing Letters vol 8 no 3 pp 426ndash4302011

[29] F Wang J Cui S K Phang B M Chen and T H Lee ldquoA mono-camera andscanning laser range finder based UAV indoor navigation systemrdquo in 2013 Inter-national Conference on Unmanned Aircraft Systems ICUAS 2013 - ConferenceProceedings 2013 pp 694ndash701

[30] M Nagai T Chen R Shibasaki H Kumagai and A Ahmed ldquoUAV-borne 3-Dmapping system by multisensor integrationrdquo IEEE Transactions on Geoscienceand Remote Sensing vol 47 no 3 pp 701ndash708 2009

[31] X Zhang Y-H Yang Z Han H Wang and C Gao ldquoObject class detectionrdquoACM Computing Surveys vol 46 no 1 pp 1ndash53 Oct 2013 [Online] Availablehttpdlacmorgcitationcfmid=25229682522978

[32] P M Roth and M Winter ldquoSurvey of Appearance-Based Methodsfor Object Recognitionrdquo Transform no ICG-TR-0108 2008 [Online]

72

REFERENCES

Available httpwwwicgtu-grazacatMemberspmrothpub_pmrothTR_ORat_downloadfile

[33] A Andreopoulos and J K Tsotsos ldquo50 Years of object recognition Directionsforwardrdquo Computer Vision and Image Understanding vol 117 no 8 pp 827ndash891 Aug 2013 [Online] Available httpwwwresearchgatenetpublication257484936_50_Years_of_object_recognition_Directions_forward

[34] J K Tsotsos Y Liu J C Martinez-Trujillo M Pomplun E Simine andK Zhou ldquoAttending to visual motionrdquo Computer Vision and Image Understand-ing vol 100 no 1-2 SPEC ISS pp 3ndash40 2005

[35] P Perona ldquoVisual Recognition Circa 2007rdquo pp 1ndash12 2007 [Online] Availablehttpswwwvisioncaltechedupublicationsperona-chapter-Dec07pdf

[36] M Piccardi ldquoBackground subtraction techniques a reviewrdquo 2004 IEEE Interna-tional Conference on Systems Man and Cybernetics (IEEE Cat No04CH37583)vol 4 pp 3099ndash3104 2004

[37] T Bouwmans ldquoRecent Advanced Statistical Background Modeling for ForegroundDetection - A Systematic Surveyrdquo Recent Patents on Computer Sciencee vol 4no 3 pp 147ndash176 2011

[38] L Xu and W Bu ldquoTraffic flow detection method based on fusion of frames dif-ferencing and background differencingrdquo in 2011 2nd International Conference onMechanic Automation and Control Engineering MACE 2011 - Proceedings 2011pp 1847ndash1850

[39] A-t Nghiem F Bremond I-s Antipolis and R Lucioles ldquoBackground subtrac-tion in people detection framework for RGB-D camerasrdquo 2004

[40] J C S Jacques C R Jung and S R Musse ldquoBackground subtraction andshadow detection in grayscale video sequencesrdquo in Brazilian Symposium of Com-puter Graphic and Image Processing vol 2005 2005 pp 189ndash196

[41] P Kaewtrakulpong and R Bowden ldquoAn Improved Adaptive Background MixtureModel for Real- time Tracking with Shadow Detectionrdquo Advanced Video BasedSurveillance Systems pp 1ndash5 2001

[42] G Turin ldquoAn introduction to matched filtersrdquo IRE Transactions on InformationTheory vol 6 no 3 1960

[43] K Briechle and U Hanebeck ldquoTemplate matching using fast normalized crosscorrelationrdquo Proceedings of SPIE vol 4387 pp 95ndash102 2001 [Online] AvailablehttplinkaiporglinkPSI4387951ampAgg=doi

73

REFERENCES

[44] H Schweitzer J W Bell and F Wu ldquoVery Fast Template Matchingrdquo Programno 009741 pp 358ndash372 2002 [Online] Available httpwwwspringerlinkcomindexH584WVN93312V4LTpdf

[45] P Viola and M Jones ldquoRobust real-time object detectionrdquo Interna-tional Journal of Computer Vision vol 57 pp 137ndash154 2001 [Online]Available httpscholargooglecomscholarhl=enampbtnG=Searchampq=intitleRobust+Real-time+Object+Detection0

[46] S Omachi and M Omachi ldquoFast template matching with polynomialsrdquo IEEETransactions on Image Processing vol 16 no 8 pp 2139ndash2149 2007

[47] M Jordan and J Kleinberg Bishop - Pattern Recognition and Machine Learning

[48] D Prasad ldquoSurvey of the problem of object detection in real imagesrdquoInternational Journal of Image Processing (IJIP) no 6 pp 441ndash466 2012[Online] Available httpwwwcscjournalsorgcscmanuscriptJournalsIJIPvolume6Issue6IJIP-702pdf

[49] D Lowe ldquoObject Recognition fromLocal Scale-Invariant Featuresrdquo IEEE Inter-national Conference on Computer Vision 1999

[50] D G Lowe ldquoDistinctive image features from scale-invariant keypointsrdquo Interna-tional Journal of Computer Vision vol 60 no 2 pp 91ndash110 2004

[51] H Bay T Tuytelaars and L Van Gool ldquoSURF Speeded up robust featuresrdquo inLecture Notes in Computer Science (including subseries Lecture Notes in ArtificialIntelligence and Lecture Notes in Bioinformatics) vol 3951 LNCS 2006 pp 404ndash417

[52] T Joachims ldquoA probabilistic analysis of the Rocchio algorithmwith TFIDF for text categorizationrdquo International Conference on Ma-chine Learning pp 143ndash151 1997 [Online] Available papers2publicationuuid9FC2122D-6D49-4DC5-AC03-E353D5B3D1D1

[53] N Dalal and B Triggs ldquoHistograms of Oriented Gradients for Human Detectionrdquoin CVPR rsquo05 Proceedings of the 2005 IEEE Computer Society Conference onComputer Vision and Pattern Recognition (CVPRrsquo05) - Volume 1 2005 pp 886ndash893 [Online] Available citeulike-article-id3047126$delimiter026E30F$nhttpdxdoiorg101109CVPR2005177

[54] J M Rainer Lienhart ldquoAn Extended Set of Haar-Like Features for Rapid ObjectDetectionrdquo [Online] Available httpciteseerxistpsueduviewdocsummarydoi=1011869433

74

REFERENCES

[55] S Han Y Han and H Hahn ldquoVehicle Detection Method using Haar-like Featureon Real Time Systemrdquo inWorld Academy of Science Engineering and Technology2009 pp 455ndash459

[56] Q C Q Chen N Georganas and E Petriu ldquoReal-time Vision-based Hand Ges-ture Recognition Using Haar-like Featuresrdquo 2007 IEEE Instrumentation amp Mea-surement Technology Conference IMTC 2007 2007

[57] G Monteiro P Peixoto and U Nunes ldquoVision-based pedestrian detection usingHaar-like featuresrdquo Robotica 2006 [Online] Available httpcyberc3sjtueducnCyberC3docpaperRobotica2006pdf

[58] J Martiacute J M Benediacute A M Mendonccedila and J Serrat Eds PatternRecognition and Image Analysis ser Lecture Notes in Computer Science BerlinHeidelberg Springer Berlin Heidelberg 2007 vol 4477 [Online] Availablehttpwwwspringerlinkcomindex101007978-3-540-72847-4

[59] A E C Pece ldquoOn the computational rationale for generative modelsrdquo ComputerVision and Image Understanding vol 106 no 1 pp 130ndash143 2007

[60] I T Joliffe Principle Component Analysis 2002 vol 2 [Online] Availablehttpwwwspringerlinkcomcontent978-0-387-95442-4

[61] A Hyvarinen J Karhunen and E Oja ldquoIndependent Component Analysisrdquovol 10 p 2002 2002

[62] K Mikolajczyk B Leibe and B Schiele ldquoMultiple Object Class Detection with aGenerative Modelrdquo IEEE Computer Society Conference on Computer Vision andPattern Recognition (CVPR) vol 1 pp 26ndash36 2006

[63] Y Freund and R E Schapire ldquoA Decision-theoretic Generalization of On-lineLearning and an Application to Boostingrdquo Journal of Computing Systems andScience vol 55 no 1 pp 119ndash139 1997

[64] C Cortes and V Vapnik ldquoSupport-vector networksrdquo Machine Learningvol 20 no 3 pp 273ndash297 Sep 1995 [Online] Available httplinkspringercom101007BF00994018

[65] P Viola and M Jones ldquoRobust real-time face detectionrdquo International Journalof Computer Vision vol 57 no 2 pp 137ndash154 2004

[66] B E Boser I M Guyon and V N Vapnik ldquoA Training Algorithmfor Optimal Margin Classifiersrdquo Proceedings of the 5th Annual ACM Workshopon Computational Learning Theory pp 144ndash152 1992 [Online] Availablehttpciteseerxistpsueduviewdocsummarydoi=1011213818

75

REFERENCES

[67] C-W Hsu and C-J Lin ldquoA comparison of methods for multiclass support vectormachinesrdquo IEEE Transactions on Neural Networks vol 13 no 2 pp 415ndash4252002

[68] A Mathur and G M Foody ldquoMulticlass and binary SVM classification Implica-tions for training and classification usersrdquo IEEE Geoscience and Remote SensingLetters vol 5 no 2 pp 241ndash245 2008

[69] ldquoNitrogen6X - iMX6 Single Board Computerrdquo [Online] Available httpboundarydevicescomproductnitrogen6x-board-imx6-arm-cortex-a9-sbc [Ac-cessed at 2015-08-10]

[70] ldquoPixhawk flight controllerrdquo [Online] Available httproverardupilotcomwikipixhawk-wiring-rover [Accessed at 2015-08-10]

[71] ldquoScanning range finder utm-30lx-ewrdquo [Online] Available httpswwwhokuyo-autjp02sensor07scannerutm_30lx_ewhtml [Accessed at 2015-07-22]

[72] ldquoopenCV manual Release 249rdquo [Online] Available httpdocsopencvorgopencv2refmanpdf [Accessed at 2015-07-20]

[73] D E King ldquoDlib-ml A Machine Learning Toolkitrdquo Journal of MachineLearning Research vol 10 pp 1755ndash1758 2009 [Online] Available httpjmlrcsailmitedupapersv10king09ahtml

[74] ldquodlib C++ Libraryrdquo [Online] Available httpdlibnet [Accessed at 2015-07-21]

[75] D E King ldquoMax-Margin Object Detectionrdquo Jan 2015 [Online] Availablehttparxivorgabs150200046

[76] M Danelljan G Haumlger and M Felsberg ldquoAccurate Scale Estimation for RobustVisual Trackingrdquo Proceedings of the British Machine Vision Conference BMVC2014

[77] ldquoUrgBenri Information Pagerdquo [Online] Available httpswwwhokuyo-autjp02sensor07scannerdownloaddataUrgBenrihtm [Accessed at 2015-07-20]

[78] ldquovatic - Video Annotation Tool - UC Irvinerdquo [Online] Available httpwebmiteduvondrickvatic [Accessed at 2015-07-24]

[79] ldquoAmazon Mechanical Turkrdquo [Online] Available httpswwwmturkcommturkwelcome [Accessed at 2015-07-26]

76

  • List of Figures
  • Absztrakt
  • Abstract
  • List of Abbreviations
  • 1 Introduction and project description
    • 11 Project description and requirements
    • 12 Type of vehicle
    • 13 Aims and objectives
      • 2 Literature Review
        • 21 UAVs and applications
          • 211 Fixed-wing UAVs
          • 212 Rotary-wing UAVs
          • 213 Applications
            • 22 Object detection on conventional 2D images
              • 221 Classical detection methods
                • 2211 Background subtraction
                • 2212 Template matching algorithms
                  • 222 Feature descriptors classifiers and learning methods
                    • 2221 SIFT features
                    • 2222 Haar-like features
                    • 2223 HOG features
                    • 2224 Learning models in computer vision
                    • 2225 AdaBoost
                    • 2226 Support Vector Machine
                      • 3 Development
                        • 31 Hardware resources
                          • 311 Nitrogen board
                          • 312 Sensors
                            • 3121 Pixhawk autopilot
                            • 3122 Camera
                            • 3123 LiDar
                                • 32 Chosen software
                                  • 321 Matlab
                                  • 322 Robotic Operating System (ROS)
                                  • 323 OpenCV
                                  • 324 Dlib
                                      • 4 Designing and implementing the algorithm
                                        • 41 Challenges in the task
                                        • 42 Architecture of the detection system
                                        • 43 2D image processing methods
                                          • 431 Chosen methods and the training algorithm
                                          • 432 Sliding window method
                                          • 433 Pre-filtering
                                          • 434 Tracking
                                          • 435 Implemented detector
                                            • 4351 Mode 1 Sliding window with all the classifiers
                                            • 4352 Mode 2 Sliding window with intelligent choice of classifier
                                            • 4353 Mode 3 Intelligent choice of classifiers and ROIs
                                            • 4354 Mode 4 Tracking based approach
                                                • 44 3D image processing methods
                                                  • 441 3D recording method
                                                  • 442 Android based recording set-up
                                                  • 443 Final set-up with Pixhawk flight controller
                                                  • 444 3D reconstruction
                                                      • 5 Results
                                                        • 51 2D image detection results
                                                          • 511 Evaluation
                                                            • 5111 Definition of True positive and negative
                                                            • 5112 Definition of False positive and negative
                                                            • 5113 Reducing number of errors
                                                            • 5114 Annotation and database building
                                                              • 512 Frame-rate measurement and analysis
                                                                • 52 3D image detection results
                                                                • 53 Discussion of results
                                                                  • 6 Conclusion and recommended future work
                                                                    • 61 Conclusion
                                                                    • 62 Recommended future work
                                                                      • References
Page 14: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 15: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 16: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 17: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 18: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 19: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 20: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 21: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 22: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 23: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 24: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 25: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 26: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 27: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 28: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 29: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 30: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 31: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 32: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 33: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 34: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 35: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 36: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 37: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 38: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 39: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 40: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 41: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 42: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 43: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 44: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 45: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 46: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 47: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 48: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 49: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 50: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 51: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 52: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 53: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 54: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 55: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 56: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 57: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 58: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 59: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 60: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 61: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 62: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 63: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 64: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 65: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 66: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 67: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 68: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 69: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 70: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 71: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 72: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 73: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 74: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 75: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 76: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 77: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 78: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 79: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 80: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 81: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 82: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 83: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 84: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 85: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 86: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly
Page 87: Indoor localisation and classification of objects for an Unmanned …users.itk.ppke.hu/~palan1/Theses/Andras_Palffy_MSc... · 2015. 12. 21. · Image processing techniques will mainly