final be project report with corrected guide

1

B.E.PROJECT REPORT

On

ESTIMATING THE COUNT OF PEOPLE FROM A VIDEO

Submitted by,

Prasad Sunil Udawant (B120023170)

Tejas Suresh Pandhare (B120023102)

Pratik Gahininath Kekane (B120023117)

Project Guide

Prof. P. Mahajani

(Internal Guide)

Sponsored by

College

Year: 2015-2016

Maharashtra Institute of Technology, Pune – 38

Department of Electronics and Telecommunication

2

MAEER’s

MAHARASHTRA INSTITUTE OF TECHNOLOGY, PUNE.

CERTIFICATE

This is to certify that the Project entitled

ESTIMATING THE COUNT OF PEOPLE FROM A VIDEO

has been carried out successfully by

Prasad Sunil Udawant (B120023170)

Tejas Suresh Pandhare (B120023102)

Pratik Gahininath Kekane (B120023117)

during the Academic Year 2015-2016 in partial fulfilment of their course of study for

Bachelor’s Degree in

Electronics and Telecommunication

as per the syllabus prescribed by the

Shri Savitribai Phule Pune University.

Prof. Mrs. P. Mahajani Dr. G.N. Mulay

Internal Guide Head of Department

(Electronics and Telecommunications)

MIT, Pune.

3

ABSTRACT

There is an ever increasing pressure to provide services for an ever

increasing human population. On many occasions, managing the people becomes

critical especially at public places like pilgrimage centers, malls, tourist places. This

is where technology comes to help us out.

There are existing technologies to detect people in an enclosed

environment. These technologies use sensors like infra-red, thermal etc. They have

varying accuracies and drawbacks. Nowadays, there is an increasing trend in video

based solution. With image processing providing a variety of processing techniques

and efficient algorithms, it assures a pin-point accuracy. We are making use of Da

Vinci video processor (DM 6437) which boasts a very long instruction word

architecture (VLIW) developed by Texas Instrument. The video processing back end

and front end (VPBE and VPFE) are video specific platforms that enable easy

processing of real time video. The features of Da Vinci combined with image

segmentation would help detect the number of people. We will use techniques like

histogram, K-means along with edge detection to ensure reliability of count.

On implementation of our proposed model, we will be able to detect the

number of people in a specific environment in a response a real time video input.

This will open up gates for number of controlling actions depending on application

at hand.

4

ACKNOWLEDGEMENT

We sincerely thank our final year mentor Prof. (Mrs.) P. Mahajani for

all her support and help. She gave shape to our abstract idea with stimulating

suggestions and encouragement, resulting in a successful project. Her timely

guidance was the reason for the systematic progress of the project

We also appreciate the role of all the other staff members, departmental

facilities and HoD Sir who helped in approving our project by conducting several

review sessions to track our progress and direct us in the correct path.

5

LIST OF ABBREVIATIONS

APL Application Layer

DVDP Digital Video Development Platform

DVSDK Digital Video Software Development Kit

EPSI Embedded Peripheral Software Interface

EVM Evaluation Module

GPP General Purpose processor

HD High Definition

IOL Input output Layer

NTSC National Television System Committee

PAL Phase Alternating Line

SD Standard Definition

SPL Signal Processing Layer

VISA Video Image Speech Audio

VPBE Video Processing Back End

VPFE Video Processing Front End

VPSS Video Processing Sub System

xDAIS eXpressDSP Algorithm Interface Standard

xDM eXpressDSP Digital Media

6

LIST OF FIGURES

Serial no. Figure Name

2.1 People count using Face Recognition

2.2 People count using PIR Sensor

3.1 System Block Diagram

3.2 Video Scanning Methods

3.3 Block Diagram of DaVinci Processor

3.4 Configuration Switch S3 Summary

3.5 Operating system layers in DaVinci Processor

3.6 Functional Block Diagram

3.7 VPFE Functional Block Diagram

3.8 VPBE Functional Block Diagram

4.1 System Flowchart

4.2 Background Subtraction

4.3 Erosion of a binary image with a disk structuring element

4.4 Dilation of a binary image with a disk structuring element

4.5 Opening of binary image

5.1 Color Bar Output

5.2 Edge Segmentation Output

A.1 YCbCr Sampling

7

CONTENTS

Chapter 1. Introduction………………………………………………....……………..............8

1.1 Scope of project……………………………………………………….............................10

1.2 Organization of report……………………………………………………........................11

Chapter 2. Literature Survey……………………………………………………...….............12

2.1 Present Scenario…...……….………………………………...………..............................14

Chapter 3. System Development….….…………..…………………………………………. 16

3.1 System specifications...................................................................................................... 16

3.2 System block diagram & description.............................................................................. 17

3.3 System block components................................................................................................ 19

Video standards………………………………………………………………………….19

Color CCD Camera……………………………………………………………………...21

TM320DM6437 DA VINCI Video processor…………………………………………..24

3.4 Complexities involved…………………………………………………………...……...33

Chapter 4. System Design…..………………………………………...………….…...............35

4.1 Image preprocessing.........................................................................................................36

4.2 Background Substraction..................................................................................................36

4.3Image Segmentation...........................................................................................................38

4.4 Morphological operations….............................................................................................39

Chapter 5. Implementation of system & Results.…………...……………….…...................41

5.1 Test Code for Color Bars……………………………………………………………..…43

5.2 Edge Detection using IMGLIB…………………………………………………...……..47

5.3 Background substraction on DM6437………………………………….…....…..49 Chapter 6. References.........……………….……………………....…………………..............52

APPENDIX A:Video File Format..…………………………………………………………...53

http://www.naaptol.com/price/792467-Intex-WebCam-IT-305WC-.html

8

Chapter 1 INTRODUCTION

Surveillance systems are used for monitoring, screening and tracking of

activities in public places such as banks, in order to ensure security. Various aspects

like screening objects and people, bio metric identification and video surveillance,

maintaining the database of potential threats etc., are used for monitoring the

activity. Moving object tracking in video has attracted a great deal of interest in

computer vision. For object recognition, navigation systems and surveillance

systems, object tracking is the first-step. The object tracking methods may broadly

be categorized as segmentation-based method, template-based method, probabilistic

and pixel-wise. In segmentation-based tracking or “blob detector”, the basic idea is

aimed at detecting points and/or regions in the image that are either brighter or darker

than the surrounding. They are easy to implement and fast to compute but may lack

accuracy for some application. Template concepts are based on matching the direct

appearance from frame to frame. These methods offer a great deal of accuracy but

are computationally expensive. The probabilistic method uses intelligent searching

strategy for tracking the target object. Similarly, the similarity matching techniques

are used for tracking the target object in pixel-based methods.

Most tracking algorithms are based on difference evaluation between

the current image and a previous image or a background image. However, algorithms

based on the difference of images have problems in the following cases.

(1) Still objects included in the tracking task exist.

(2) Multiple moving objects are present in the same frame.

(3) The camera is moving.

(4) Occlusion of objects occurs.

This can be solved by using an algorithm for object tracking, based on image

segmentation and pattern matching. But we use a novel approach image

+segmentation algorithm, in order to extract all objects in the input image.

The background subtraction method is to use the difference method of the

current image and background image to detect moving objects, with simple

algorithm, but very sensitive to the changes in the external environment and has poor

anti- interference ability. However, it can provide the most complete object information in the case if the background is known. It receives the most attention

due to its computationally affordable implementation and its accurate detection of

moving entities. In this project, in a single static camera condition, we combine

dynamic background modelling with threshold selection method based on the

background subtraction, and update current frame on the basis of exact detection of

object. This method is effective to improve the effect of moving object detection.

9

Any motion detection system based on background subtraction needs to handle a

number of critical situations such as:

1. Noise image, due to a poor quality image source.

2. Gradual variations of the lighting conditions in the scene.

3. Small movements of non-static objects such as tree branches and bushes

blowing in the wind.

4. Undeviating variations of the objects in the scene, such as cars that park (or

depart after a long period).

5. Sudden changes in the light conditions, (e.g. sudden raining), or the presence of

a light switch (the change from daylight to non-natural lights in the evening).

6. Movements of objects in the background that leave parts of it different from the

background model;

7. Shadow regions that are projected by foreground objects and are detected as

moving objects.

8. Multiple objects moving in the scene both for long and short periods.

The main objective of this paper is to develop an algorithm that can

detect human motion at certain distance for object tracking applications. We carry

out various tasks such as motion detection, background modelling and subtraction,

foreground detection etc.

10

1.1 SCOPE OF PROJECT

The motivation for such a system is to gather data about how many people

there are inside a building at a given time. This will help the owners to set up

fire extinguishing equipment or the size and placement of fire exits. Building

owners are required by law to have enough of this equipment based on how

many people that can gather inside. A computer vision counting system have

the advantage of not disrupting the flow of traffic like contact based systems

might do, and more robust then simple photoelectric cells. Knowing when and

how many customers are inside a shopping mall could also help to optimize

labor scheduling, system controls and monitor the effectiveness of

promotional events. Optimization of security measures is also a possible

benefit from this, knowing how many security guards should be assigned, and

hot-spots inside the mall for them to patrol.

Our work will hopefully answer the questions like:-

1. Can we achieve better separation of groups into individuals?

Exploring algorithms and methods so that each individual can be

detected, tracked and counted.

2. Can we find features to discriminate people from objects?

Features need to be found and combined to make good decisions about

foreground objects

3. Can our proposed algorithm detect moving as well as stationary

humans?

11

1.2 ORGANIZATION OF THE REPORT

Chapter 1 :Introduction and scope

This chapter provides an overview of the basic functionality of the system and

describes its scope of expansion

Chapter 2: Literature Survey and present scenario.

We present the Literature Survey of the work done in this field so far as well

as the present scenario.

Chapter 3 : System Block Diagram and Flow Chart

Explains in detail the design and development process of the system. Includes

system specifications block diagram, description of each block, System flow

chart.

Chapter 4:System Design

It includes all the algorithms used for image pre-processing, feature

extraction and classification and recognition.

Chapter 5 : Result and Conclusion

It includes all test codes and functions run on MATLAB & CCSv3.3 to verify

accurate working of hardware & the various stages of the proposed algorithm

of a test video. It also includes future scope of project based on the conclusion.

Chapter 6 : Appendix

Includes important references related to the work done on the topic,

datasheets, books for implementation of the proposed system and websites,

referred, as & when, the doubt & queries arose.

12

Chapter 2 LITERATURE SURVEY

[A] “ HUMAN DETECTION IN VIDEO”, 1 Muhammad Usman Ghani Khan

,2 Atif Saeed, presented at Journal of Theoretical and Applied Information

Technology in 2007/05.

Their proposed algorithm comprises of following steps:

1. Converting a video sequence in to individual images.

2. Accessing the sequential images and detecting the important features.

3. Allocating those regions (if any) giving indications of human presence such one

indication is having a human skin like color.

4. Applying movement detection test for all of the allocated regions.

5. Applying face detector to those detected moving objects to detect if it is a face

or not

[B] “Fast People Counting Using Head Detection from Skeleton Graph” IEEE

International Conference on Advanced Video and Signal Based Surveillance}

In this paper, they present a new method for counting people. This

method is based on the head detection after a segmentation of the human body by

skeleton graph process. The skeleton silhouette is computed and decomposed into a

set of segments corresponding to the head, torso and limbs. This structure captures

the minimal information about the

Skeleton shape.

[C] Real-time people counting system using video camera by Roy-Erland Berg

In this thesis, experiments have be tried out on a people counting system in an

effort to enhance the accuracy when separating counting groups of people, and

nonhuman Objects. This system features automatic color equalization, adaptive

background Subtraction, shadow detection algorithm and Kalman tracking.

[D] Real-Time Video and Image Processing for Object Tracking using Da

Vinci Processor, Dissertation submitted by Badri Narayan Patro, M.Tech , IIT

Bombay under the guidance of Prof. V. Rajbabu

In this project they have developed and demonstrated a framework for

real-time implementation of image and video processing algorithms such as object

13

tracking and image inversion using Davinci processor. More specifically we track

single object and two object present in the scene captured by a CC camera that acts

as the video input device and output is displayed in LCD display. The tracking

happens in real-time consuming 30 frames per second (fps) and is robust to

background and illumination changes. The performance of single object tracking

using background subtraction and blob detection was very efficient in speed and

accuracy as compared to a PC (Matlab) implementation of a similar algorithm.

Execution time for different blocks of single object tracking were estimated using

the profile and accuracy of the detection is verified using the debugger provided by

TI code composer studio (CCS). We demonstrate that the TMS320DM6437

processor provides at least ten-times speed-up and is able to track a moving object

in real-time.

2.1 PRESENT SCENARIO

14

Presently there are good many numbers of different algorithms

available to make this system work. Some of them make use of multiple cameras

while some make use of single camera.

A very obvious way to get the count of people in a video is by

recognizing the face of each individual. But this puts lot of limitations on the system

and makes it impractical to work with. Some systems implement a new method to

detect a human skin and faces from colored images. These systems based on the

detection of all pixels in colored images which are probably a human skin via a

reference skin colors matrix. The image then goes through some modifications to

enhance the face detection.

Figure 2.1: People count using Face Recognition

Pyro electric infrared (PIR) sensors are well-known occupancy detectors.

They have been widely employed for human tracking systems, due to their low

cost and power consumption, small form factor and unobtrusive and privacy-

preserving interaction. In particular, a dense array of PIR sensors having digital

15

output and the modulated visibility of Fresnel lenses can provide capabilities for

tracking human motion, identifying walking subject and counting people entering or

leaving the entrance of a room or building. However, the analog output signal of PIR

sensors involves more aspects beyond simple people presence, including the

distance of the body from the PIR sensor, the velocity of the movement (i.e.,

direction and speed), body shape and gait (i.e., a particular way or manner of

walking). Thus, we can leverage discriminative features of the analog output signal

of PIR sensors in order to develop various applications for indoor human tracking

and localization

A number of systems are based upon a similar approach i.e. feature based

regression. This involves detection of humans based upon the features extracted

from background and foregrounds of the image. The interpretation of these features

varies from system to system. While many follow a complex mathematical model

to retrieve some meaningful information.

Figure 2: Human Detection using PIR sensor

Figure 2.2: People count using Face Recognition

Chapter 3 SYSTEM DEVELOPMENT

16

3.1 System Specifications

Our proposed algorithm takes video as an input via a CCD based

camera. Its features like extraordinary dynamic range, spatial resolution, spectral

bandwidth, and acquisition speed serve our purpose to get a noise free input. One of

the obtained video frames is used for processing on a specialized video processing

board popularly known as ‘DA VINCI’. While the primary software we have used

to support our hardware is Code Composer Studio, we use MATLAB to just test and

verify the concept before we actually apply to our main system.

Software:

Code Composer Studio v3.3

MATLAB 2014a (for testing purposes only)

Hardware:

TM320DM6437 DA VINCI Video processor

CRT TV Box

Color CCD camera for video surveillance

3.2 SYSTEM BLOCK DIAGRAM & DESCRIPTION


17

Figure 3.1: System Block Diagram

System Block Description:-

CCD camera will be installed in an area which is to be brought under video

surveillance. It will be installed in the specific area at a minimum height and

a minimum angle which would be able to detect people with varying heights.

The video stream, which is in PAL form, is then fed to the DA VINCI video

processor. This device includes a Video Processing Sub-System (VPSS) with

two configurable video/imaging peripherals:

1) Video processing front-end input (VPFE) used for video capture,

2) Video processing back-end output (VPBE) output.

Video processing front-end is comprised of CCD controller, a

preview engine, Histogram module, Auto-exposure/white balance, focus

module, and resizer. The common video decoders CMOS sensors, CCDs can

18

be easily interfaced to CCDC. The previewer is real time image processing

engine that takes raw image data from either CMOS sensor or CCD and

converts from RGB to YUV422 format. The resizer accepts image data for

vertical and horizontal resizing.

Video processing back-end driver is comprised of on-screen display

and video encoder (VENC). The VENC provides 4 analog DACs that run at

54 MHz providing a means for composite NTSC PAL and/or component

video output. It also provides digital output interface to RGB devices.

Thus the features of the processor like the third generation high

performance, advanced VelociTI VLIW architecture developed by TI make it

an excellent choice for digital media applications.

The CRT TV box on obtaining the video converts the processed

frames into a monitor compatible video format VGA.

19

3.3 SYSTEM BLOCK COMPONENTS

1) Video Standards:

Progressive scan captures, transmits and displays an image in a way

similar to text on page, line by line, top to bottom.

The interlaced scan in a CRT display also completes such a scan, but in

two phases (or two fields-viz. odd and even). The first field displays the first and all

odd numbered lines from top left corner to bottom right corner. The second pass

displays the second and all even numbered lines, filling in the gaps in first (odd field)

scan. This scanning by alternate lines is called interlacing.

A field is an image that contains only half of line needed to make

complete picture and therefore, it saves bandwidth.

Figure 3.2: Video Scanning Methods

20

The two video standards that are currently employed in any television system are as

following:

A] NTSC (National Television System Committee)

> 29.97 interlaced frames of video per second

> Scans 525 lines per second. Out of these 525 lines, 480 are for visible raster and

others are for synchronization and vertical retrace

> Gives higher temporal resolution than PAL

> Screen updates more frequently and hence motion is rendered better in NTSC than

in PAL video

B] PAL (Phase Alternating Line)

> PAL alternates the Chroma phase between each line of video such that if there are

any drifts in Chroma decoding they average out between lines. NTSC doesn’t have

this protection and as a result its Chroma reproduction can be wrong, however PAL

can be accused of having less Chroma detail.

> PAL specifies 786 pixels per line, 625 lines or 50 fields (25 frames) per second

> PAL gives higher spectral resolution than NTSC

> PAL video is of higher resolution than NTSC video

21

Difference between NTSC and PAL:

NTSC is the video system or standard used in North America or most of South

America. In NTSC, 30 frames are transmitted each second. Each frame is

made up of individual 525 scan lines. PAL is a predominant video system or

standard mostly used overseas. In PAL, 25 frames are transmitted each

second. Each frame is made up of 625 individual scan lines.

720*576=414,720 for 4:3 aspect ratio using PAL.

720*480=345,600 for 4:3 aspect ratio using NTSC.

2) Color CCD camera

DESCRIPTION

MODEL NO: MCB2200

1/3” Color Camera

PAL, Audio

12V DC, 3W max

SPECIFICATIONS

Pick up device SONY 1/3”nterline transfer color CCD

Picture Elements NTSC: 510*492; PAL: 500*582(STD Res.)

NTSC: 768*494; PAL: 752*582(STD Res.)

Horizontal Resolution 380 TV lines (Standard Resolution)

480 TV lines (High Resolution)

Sensitivity 0.3lux/F=1.2 (Standard Resolution)

0.5lux/F=1.2 (High Resolution)

S/N Ratio Over 48 dB

22

Electronic Shutter 1/60(1/50) to 1/100,000

Auto Iris Video/Direct Drive Switch

Auto Gain Control On/Off Switch

Gamma Correction 0.45

Video output BNC VBS 1.0 Vp-p, 75ohm

Power Source DC 12v only or AC 24V/DC 12V

Sync. Mode Internal Sync.

Lens Mount C/CS mount

Power Consumption 3W Max.

SPECIAL FEATURES

1) Electronic Shutter ON/OFF :

ES ON:

The camera continuously adjusts the shutter speed from 1/60(NTSC),

1/50(PAL) second to 1/100,000 second according to the luminance conditions

of the scene.

ES OFF:

The shutter speed is fixed at 1/60(NTSC), 1/50(PAL) second. Set ES OFF,

when auto iris lens is uses or flicker is observed under a very bright fluorescent

lamp. Otherwise, turn ES ON for optimum performance.

2) Back Light Compensation ON/OFF:

When BLC is turned on, the AGC, ES & IRIS operating point is

determined by averaging over the center area instead of entire field-of-

23

view, so that dimly lit foreground object at center area can be clearly

distinguished from brightly lit backgrounds.

BLC Should not be used unless it is needed to compensate for back-

lit.

3) Automatic Gain Control ON/OFF :

AGC ON:

The sensitivity increases automatically when light is low

AGC OFF:

A low noise picture is obtained under a low light condition.

24

3) TM320DM6437 DA VINCI Video processor

Figure 3.3: Block Diagram of DaVinci Processor

Description

The DaVinci EVM is a development board that enables evaluation of and

design with the DaVinci processors. The EVM serves as both a working

platform as well as reference design.

The DaVinci family consists of DSP based system on a chip processors

designed to handle today’s video and connectivity driven applications. The

DaVinci EVM is a reference platform that highlights the on-chip capabilities.

Board features include:

Features

256 MB of sDRAM

16 MB of linear Flash memory


25

Composite video inputs (1 decoder)

Composite and component video outputs

AIC33 stereo codec

Stereo analog audio inputs and outputs

S/PIDF digital audio outputs

USB 2.0 host connector

10/100 Ethernet PHY

Infrared remote interface

9-pin UART

SD/MMC/MS serial media card support

CompactFlash/SM/xD parallel media card support

ATA hard disc interface

FIGURE 3.4 : Configuration Switch S3 Summary

26

Da Vinci Processor and Family

The Da Vinci Technology is a family of processors with integrated with

software and hardware tools package for a flexible solution of the host of

applications from cameras to phones to hand held devices to automotive gadgets.

DaVinci Technology is the combination of raw processing power and software

needed to simplify and speed up the production of digital multi-media and video

equipment.

Da Vinci technology consists of:

Da Vinci Processors: Scalable, programmable DSPs and DSP-based SoCs (system on chip) tailored

from DSP cores, accelerators, peripherals and ARM processors optimized to

match performance, various price and feature requirements in a spectrum of

digital video end equipments i.e. TMS320DM6437 and TMS320DM6467.

Da Vinci Software: It is inter communicable, optimized, video and audio standards. Codecs

leveraging DSP and integrated accelerators, APIs within operating systems

(Linux) for rapid software implementation. I.e. Codec Engine, DSP BIOS,

NDK, Audio and Video Codec.

Da Vinci Development Tools/Kits:

Complete development kits along with reference designs, DM6437 DVSDK,

Code Composer Studio, Green Hill, and Virtual Linux. Da Vinci Video

Processor solution are the tailored for digital video, image and vision

application. The Da Vinci platform includes a general purpose processor

(GPP), video ac

27

Basic working functionality of the Da Vinci processor

Let us take example of video capture driver, for example, reads data

from a video port or peripheral and starts filling a memory buffer. When this input

buffer is full, an interrupt is generated by the IOL to the APL and a pointer to this

full buffer is passed to the APL.The APL picks up this buffer pointer and in turn

generates an interrupt to the SPL and passes the pointer. The SPL now processes the

data in this input buffer and when complete, generates an interrupt back to the APL

and passes the pointer of the output buffer that it created. The APL passes this output

buffer pointer to the IOL commanding it to display it or send it out on the network.

Note that only pointers are passed while the buffers remain in place. The overhead

passing the pointers is negligible. All these three layers and different APIs and

Different Driver and component are shown in Figure 3.5

Figure 3.5: Operating system layers in DaVinci Processor

>>Signal Processing Layer (SPL):

SPL consists of the entire signal processing functions or algorithms that

run on the device. For example, a video codec, such as MPEG4-SP or H.264, will

run in this layer. These algorithms are wrapped with expressed Digital Media (xDM)

API. In between xDM and (Video, Image, Speech, Audio) VISA are the Codec

28

Engine, Link and DSP/BIOS. Memory buffers, along with their pointers, provide

input and output to the xDM functions. This decouples the SPL from all other layers.

The Signal Processing layer (SPL) presents VISA APIs to all other layers. The main

component of the SPL are xDM, XDAIS, VISA APIS and Codec Engine Interface.

>> Input output Layer (IOL):

The Input Output Layer (IOL) covers all the peripheral drivers and

generates buffers for input or output data. Whenever a buffer is full or empty, an

interrupt is generated to the APL. Typically, these buffers reside in shared memory,

and only pointers are passed from IOL to the APL and eventually to SPL. The IOL

is delivered as drivers integrated into an Operating System such as Linux OS or

WinCE. In the case of Linux, these drivers reside in the kernel space of Linux OS.

The Input Output layer (IOL) presents the OS-provided APIs as well as EPSI APIs

to all other layers. IOL contains Video Processing Subsystem(VPSS) device driver

used for video capturing and displaying, USB driver to capture video to USB based

media, debug is done by using UART serial port driver for console application, when

we want to captured video is sent

over the network we need for Ethernet driver that is EMAC and VPFE driver

internally uses I2C driver for communication protocol, for audio processing system

Multichannel Audio Serial Port (McASP) driver are used, for buffering of stream

data we are using Multichannel Buffered Serial Port (McBSP) driver are used.

>>Application Layer (APL):

The Application layer interacts with IOL and SPL. It makes calls to IOL

for data input and output, and to SPL for processing. The Sample Application Thread

(SAT) is a sample application component that shows how to call EPSI and VISA

APIS and interfaces with SPL and IOL as built in library functions. All other

application components are left to the developer. He may develop them or leverage

the vast open source community software. These include, but not limited to,

Graphical User Interfaces (GUI), middle ware, networking stack, etc. Master thread

is the highest level thread such as an audio or video thread that handles the opening

of I/O resources (through EPSI API), the creation of processing algorithm instances

(through VISA API), as well as the freeing of these resources. Once necessary

resources for a given task are acquired, the master thread specifies an input source

for data (usually driver or file), the processing to be performed on the input data

(such as compression or decompression) and an output source for the processed data

(usually driver or file).

The Network Developer’s Kit (NDK) provides services such as HTTP

server, DHCP client/server, DNS server, etc. that reside in the application layer.

29

Note that these services use the socket interface of the NDK, which resides in the

I/O layer, so the NDK spans both layers.

Figure 3.6: Functional Block Diagram

30

Video Processing Sub Systems :

1. Video Processing Front End (VPFE):

The VPFE block is comprised of a charge-coupled device (CCD)

controller (CCDC), preview engine image pipe (IPIPE), hardware 3A statistic

generator (H3A), resizer and histogram. The CCD controller is responsible for

accepting raw unprocessed image/video data from a sensor (CMOS or CCD). The

preview engine image pipe (IPIPE) is responsible for transforming raw

(unprocessed) image/video data from a sensor (CMOS or CCD) into YCbCr 422

data which is easily controlled for compression or display Typically, the output of

the preview engine is used for both video compression and displaying it on an

external display device, such as a NTSC/PAL analog encoder or a digital LCD. The

output of Preview engine or DDR2 is the input to the Resizer, which can be resized

to the 720x480 pixels per frame. The output of the resizer module will be sent to the

SDRAM/DDRAM.

Then Resizer is free to the Preview Engine Pipe for the further

processing. The H3A module is designed to support the control loops for auto focus

(AF), auto white balance (AWB), and auto exposure (AE) by collecting metrics

about the imaging/video data, where AF engine extracts and filters RGB data from

the input image/video data and provides either the accumulation or peaks of the data

in a specified region and AE/AWB engine accumulates the values and checks for

saturated values in a sub-sampling of the video data. Histogram allows luminance

intensity distribution of the pixels of the image/ frame to be represented.

31

FIGURE 3.7: VPFE BLOCK DIAGRAM

2. Video Processing Back End (VPBE):

VPBE is responsible for displaying the processed image on different

display devices such as TV, LCD or HDTV. The VPBE block is comprised of the

on-screen display (OSD) and the video encoder (VENC) modules. OSD is a graphic

accelerator, which is responsible for resizing of images to either NTSC format or

PAL format (640x480 to 720x576) on the output devices and it combines display

windows into a single display frame, which helps VENC module to output the video

data. The primary function of the OSD module is to gather and combine video data

and display/bitmap data and then pass it to the video encoder (VENC) in YCbCr

format. VENC converts the display frame from the OSD into the correctly formatted,

desired output signals in order to interface it to different display devices. The VENC

takes the display frame from the on-screen display (OSD) and formats it into the

desired output format and output signals (including data, clocks, sync, etc.) that are

required to interface to display devices. The VENC consists of three primary sub-

blocks, analog video encoder which generates required signals to interface to

32

NTSC/PAL system also includes video A/D converter second is timing generator,

responsible for generate the specific timing required for analog video output and

lastly digital LCD controller, which supports various LCD display formats, YUV

outputs for interface to high-definition video encoders and/or DVI/HDMI interface

devices.

Figure 3.8: VPBE Functional block diagram

33

3.3 COMPLEXITIES INVOLVED

1) Positioning camera:

To be able to get a full skeleton of all the people under surveillance without much

ambiguity using a single camera is not that simple. It may happen that the camera is

installed in a fashion that the captured video might miss a person. So proper

precaution must be taken to bring that area under surveillance in a way that it will

not hide people in close proximity.

2) Identifying human beings from the input video:

There are number of ways which people have worked upon such as face detection

algorithm. But this put many restrictions on the expected system. One of it being

capturing the face of every human under video surveillance which is impractical.

Thus, we opted for physique based algorithm which detects the skeleton of human

after subtracting the background from the input frame.

3) Counting stationary people:

We started off with identifying the foreground by subtracting the background frame

from an input video frame. This gave us all the moving objects in a foreground and

among those we were able to count people. But this left stationary people uncounted.

So to overcome this flaw we thought of a reference image of the area under

surveillance. This will facilitate detection of moving and stationary people.

4) Separating people from group of people:

The most ambiguous situation in video processing application is overlapping edges.

These overlapping edges arise on account of people in close proximity. In our

system, this leads to overlapping of the respective skeletons which will lead to

34

erroneous count of people. To get rid of such situation we have included head

detection and pose estimation in our algorithm.

35

Chapter 4 SYSTEM DESIGN

Introduction

People counting algorithm is applied to different application such as, automated

video surveillance, traffic monitoring, stampede management etc. It involves various

image processing algorithm such as image segmentation, morphological image

processing.

FIGURE 4.1: SYSTEM FLOWCHART

The major steps for object tracking are as shown in Figure 4.1. Here are the

different steps in people counting:

1. Image preprocessing.

2. Background subtraction.

3. Image segmentation. (Thresholding).

4. Morphological Operation (opening).

5. Blob detection and analysis (connected component labeling).

6. Count the no of people.

36

4.1 Image Preprocessing

The image captured by a surveillance camera is affected by various system noises

and output data format may be uncompressed or compressed. In order to remove the

noise preprocessing of the image is essential. Preprocessing of image includes

filtering and noise removal data.

4.2 Background Subtraction

Background subtraction is a widely used approach for detecting moving objects in

videos from static cameras. The rationale in this approach is that of detecting the

moving objects from the difference between the current frame and a reference frame,

often called the “background image”, or “background model”. It is required that the

background image must be a representation of the scene with no moving objects and

must be kept regularly updated so as to adapt tithe varying luminance conditions and

geometry settings.

The main motivation for the background subtraction is to detect all the foreground

objects in a frame sequence from fixed camera. In order to detect the foreground

objects, the difference between the current frame and an image of the scene’s static

background is compared with threshold. The detection equation is expressed as:

|frame(i) - background(i)| > Threshold (4.1)

The background image varies due to many factors such as illumination changes

(gradual or sudden changes due to clouds in the background), changes due to camera

oscillations, changes due to high-frequencies background objects (such as tree

branches, sea waves etc.).

The basic methods for background subtraction are:

1. Frame difference

|frame(i) - frame(i - 1)| > Threshold (4.2)

Here the previous frame is used as a background estimate. This evidently works only

in particular conditions of object’s speed and frame rate, and is very sensitive to the

threshold.

37

2. Average or median

Background image is obtained as the average or the median of the previous n frames.

This method is rather fast, but needs large memory. The memory requirement is n

*size(frame).

3. Background obtained as the running average

B(i + 1) = α * F(i) + (1 - α) * B(i) (4.3)

Where α, the learning rate, is typically 0.05 and no more memory

requirements.

We are using adaptive background subtraction algorithm:

FIGURE 4.2: BACKGROUND SUBSTRACTION

38

Other methods are:

Median, running average give the fastest speed [14]. Mixture of Gaussians,

KDE, eigenbackgrounds, SKDA, optimized mean-shift gives intermediate

speed while standard meanshiftgives slowest speed.

For memory requirements, average, median, KDE [14], mean-shift consumes

highest memory.Mixture of Gaussian, eigenbackgrounds, SKDA consumes

intermediate and running average consumes very low memory.

For accuracy parameter, mixture of Gaussians and Eigen backgrounds

provide good accuracy and the simple methods such as standard average,

running average, and median can provide acceptable accuracy in specific

applications.

4.3 Image Segmentation

Thresholding:

Thresholding means classify image histogram by one or several

thresholds. The pixels are classified based on gray scale values lying within a gray

scale class. The process of thresholding involves deciding a gray scale value to

distinguish different classes and this gray scale value is called “threshold”.

Threshold based classification can be classified as global-threshold dividing and

local-threshold dividing. Global-threshold dividing involves obtaining threshold by

entire image information and dividing entire image. Local threshold dividing

involves obtaining thresholds in different regions and dividing each region based on

it.

In threshold segmentation, selecting threshold is the key. In traditional

segmentation, threshold is determined by one-dimension histogram. However, one

dimension histogram only reflects the distribution of image gray scale, without the

space correlation between image pixels. It may lead to error in segmentation and

dissatisfactory result. Other image segmentation algorithm includes region growing,

edge detection, clustering etc. Among these thresholding and region growing are

generally not used alone, but used in a series of treatment process. The disadvantage

is that it has inherent dependence on the selection of seed region and the order in

which pixels and regions are examined; the resulting segments by region splitting

may appear too square due to the splitting scheme.

The background subtraction image is a gray-scale image so it has to be

39

transformed in a binary image to make the segmented image (i.e. separation of the

foreground and the background). To transform a gray-scale image (255 values) in

binary image (2 values) a threshold must interfere. All the Pixel's values smaller than

this threshold is viewed as the background of the scene (value 0). This will eliminate

allot of “noisy” pixels which have, the most of the times, a close value and will

eliminate.

Too some of the Pixels which represent the shadows make by the moving

objects. Infect, in a gray-scale image, the shadow of an object, most of the time,

doesn't change allot the feature (color) of the Pixel. So this shadow, in the

background subtraction, has a small value.

4.4 Morphological Operations :

1. Erosion :

The first morphological operation used is the erosion. It's a basic

operation and its primary feature is to erode away the boundaries of the different

foreground regions. Thus this foreground objects will become smaller (little of them

will totally be vanished) and holes in objects will be bigger. Let X be a subset of E

and let B denote the structure element. The morphological

erosion is defined by:

In outline, all the pixels of the foreground object which can totally contain the

structure element B will be contained in the eroded object. For example, take

consider of a 3x3 square structure element having its morphological center the

same as the geometrical center. It is as follows:

40

To compute a binary erosion, all the Pixels of the foreground must be

process. For each pixel of the foreground, the algorithm puts the structure element

(the center of the structure element matches with the pixel) and tests if the structure

element is completely contained in the foreground. If it is not, the current pixel will

be considered like the background and on the contrary, if it is, the current pixel will

be contained in the eroded foreground

Figure 4.3: Erosion of a binary image with a disk structuring element

41

2. Dilation:

Like the erosion, the dilatation is the second basic operation and its primary

feature is to dilate the boundaries of the different foreground regions. Thus this

foreground objects will become bigger and holes in objects will be smaller (little of

them will totally disappear).

Let X be a subset of E and let B denote the structure element. The

morphological erosion is defined by:

Figure 4.4: Dilation of binary image

In outline, all the pixels of the background which can touch the

foreground regions, by putting on it the structure element B, will be contained in the

dilated object.

For example, take consider of a 3x3 square structure element having its

morphological center the same as the geometrical center (see illustration 7). To

compute a binary dilatation, all the Pixels of the background must be process. For

each pixel of the background, the algorithm puts the structure element (the center of

the structure element matches with the pixel) and tests if the structure element is in

touch with at least one pixel of the foreground. If it is, the current pixel will be

considered like the foreground and on the contrary, if it is not, the pixel will stay a

background pixel.

42

3) Opening:

Figure 4.5: Opening of binary image

The opening operation is a combination of the two basics operation (Erosion and

Dilatation). It's the dilation of the erosion and its primary feature is too eliminate

noise (small objects). This operation will separate blobs which are linked with a

small layer. Let X be a subset of E and let B denote the structure element. The

morphological

erosion is defined by:

Blob Analysis :

Once the segmentation is done, an another image processing must be

launched in the binary image. In fact, in order to count objects, the first step

to do is to identify all the objects on the scene and calculate all their features.

This process is called a Blob Analysis. It consists to analyze the binary image,

find all the blobs present and compute statistics for each one. Typically, the

blobs features usually calculated are area (number of pixels which compose

the blob), perimeter, and location and blob shape. In this process, it is

possible to filter the different blobs by their features. For example, if the

searching blobs have to have a minimum area, some blobs can be eliminate

with this algorithm if they don't respect this constraint (it permits to limit the

number of blobs, thus reduce the computing operations). Two different ways

of connection can be defined in the blob analysis algorithm depending of the

application. One consists to take the adjacent pixels along the vertical and the

horizontal as touching pixels and the other by including diagonally adjacent

pixels.

43

Chapter 5 RESULTS & CONCLUSION

5.1 TEST CODE FOR COLOR BARS Generate a colorbars test box, height and width will define the size of the generated colorbar buffer, the return value is a pointer to the buffer

void* generate_colorbars(

int height, //height of colorbar buffer

int width) //width of colorbar buffer

{

int xx = 0; //local horizontal counter

int yy = 0; //local vertical counter

void* localBoxBuffPtr; //buffer pointer to fill in with color bar

values

localBoxBuffPtr = malloc(height*width*2); //allocate the buffer

for( xx = 1; xx < height*width*2; xx+=1 ) //initialize the new

buffer

*( ( (unsigned char*)localBoxBuffPtr ) +xx) = 0x01//fill

in the data with clear value

for( yy = 0; yy < height; yy+=1 ){

for( xx = 0; xx < width*2; xx+=1 ){

if(xx > ((width*2)/8)*0 && xx < ((width*2)/8)*1){ //white

bar

if(xx%4 == 0) //If byte is Cb

*( ( (unsigned char*)localBoxBuffPtr ) +

((yy*width*2)+xx)) = 128;//Cb

if(xx%4 == 1) //If byte is Y0


((yy*width*2)+xx)) = 180;//Y0

if(xx%4 == 2) //If byte is Cr


((yy*width*2)+xx)) = 128;//Cr



((yy*width*2)+xx)) = 180;//Y1

}

if(xx > ((width*2)/8)*1 && xx < ((width*2)/8)*2){

//yellow bar



((yy*width*2)+xx)) = 44;//Cb



((yy*width*2)+xx)) = 162;//Y0

44



((yy*width*2)+xx)) = 142;//Cr



((yy*width*2)+xx)) = 162;//Y1

}

if(xx > ((width*2)/8)*2 && xx < ((width*2)/8)*3){ //cyan

bar



((yy*width*2)+xx)) = 156;//Cb



((yy*width*2)+xx)) = 131;//Y0



((yy*width*2)+xx)) = 44;//Cr



((yy*width*2)+xx)) = 131;//Y1

}

if(xx > ((width*2)/8)*3 && xx < ((width*2)/8)*4){ //green

bar



((yy*width*2)+xx)) = 72;//Cb



((yy*width*2)+xx)) = 112;//Y0



((yy*width*2)+xx)) = 58;//Cr



((yy*width*2)+xx)) = 112;//Y1

}

if(xx > ((width*2)/8)*4 && xx < ((width*2)/8)*5){

//magenta bar



((yy*width*2)+xx)) = 184;//Cb



((yy*width*2)+xx)) = 84;//Y0



((yy*width*2)+xx)) = 198;//Cr



((yy*width*2)+xx)) = 84;//Y1

}

if(xx > ((width*2)/8)*5 && xx < ((width*2)/8)*6){ //red

bar

45



((yy*width*2)+xx)) = 100;//Cb



((yy*width*2)+xx)) = 65;//Y0



((yy*width*2)+xx)) = 212;//Cr



((yy*width*2)+xx)) = 65;//Y1

}

if(xx > ((width*2)/8)*6 && xx < ((width*2)/8)*7){ //blue

bar



((yy*width*2)+xx)) = 212;//Cb



((yy*width*2)+xx)) = 35;//Y0



((yy*width*2)+xx)) = 114;//Cr



((yy*width*2)+xx)) = 35;//Y1

}

if(xx > ((width*2)/8)*7 && xx < ((width*2)/8)*8){ //black

bar



((yy*width*2)+xx)) = 128;//Cb



((yy*width*2)+xx)) = 16;//Y0



((yy*width*2)+xx)) = 128;//Cr



((yy*width*2)+xx)) = 16;//Y1

}

}//for xx...

}//for yy...

return localBoxBuffPtr;

} // End generate_colorbars()

46

FIGURE 5.1 : COLOR BAR OUTPUT

47

5.2 EDGE DETECTION USING IMGLIB

The Texas Instruments C64x+ IMGLIB is an optimized Image/Video

Processing Functions Library for C programmers using TMS320C64x+ devices. It

includes many C-callable, assembly optimized, and general-purpose image/video

processing routines. These routines are used in real-time applications where optimal

execution speed is critical. Using these routines assures execution speeds

considerably faster than equivalent code written in standard ANSI C language. In

addition, by providing ready-to-use DSP functions, TI IMGLIB can significantly

shorten image/video processing application development time.

In case of Code Composer Studio, IMGLIB can be added by selecting

Add Files to Project from the Project menu, and choosing imglib2.l64P from the list

of libraries under the c64plusthan in imglib_v2xx folder. Also, ensure that it have

linked to the correct run time support library (rts64plus.lib). An alternate to include

the above two libraries in your project is to add the following lines in your linker

command file: -lrts64plus.lib -limglib2.l64P The include directory contains the

header files necessary to be included in the C code when you call an IMGLIB2

function from C code, and should be added to the "include path" in CCS build

options. The Image and Video processing Library (IMGLIB) [] is which is having

70 building block kernels that can be used for image and video processing

applications.

IMGLIB includes:

Compression and Decompression : DCT, motion estimation, quantization,

wavelet Processing

Image Analysis: Boundary and perimeter estimation, morphological

operations, edge detection, image histogram, image thresholding

Image Filtering & Format Conversion: image convolution, image

Correlation, median filtering, color space conversion

VLIB is software library having more than 40 kernels from TI accelerates video

analytics development and increases performance up to 10 times. This 40+ kernels

provide the ability to perform:

Background Modeling & Subtraction

Object Feature Extraction

Tracking & Recognition

Low-level Pixel Processing

Step 1: open Video preview project; video_preview.pjt.

Step 2: Add these two library for sobel and median filter function.

48

#include <C:\dvsdk_1_01_00_15\include\IMG_sobel_3x3_8.h>

#include <C:\dvsdk_1_01_00_15\include\IMG_median_3x3_8.h>

Step 3: Add these two caller function following parameter. frameBuffPtr is an

array of structure “frame”, in order to access access the frame buffer pointer use

“frame.frameBufferPtr”. and 576, 1440 are the length and width of the frame.

FVID_exchange(hGioVpfeCcdc, &frameBuffPtr);

IMG_sobel_3x3_8((frameBuffPtr->frame.frameBufferPtr),(frameBuffPtr-

>frame.frameBufferPtr),480,1440);

IMG_median_3x3_8((frameBuffPtr->frame.frameBufferPtr),8,(frameBuffPtr-

>frame.frameBufferPtr));

FVID_exchange(hGioVpbeVid0, &frameBuffPtr);

FIGURE 5.2: EDGE DETECTION OUTPUT WITH WHITE

BACKGROUND

49

5.3 Background Subtraction On DM6437

The simple way to access the frame buffer itself is to reach into the structure

with frameBuffPtr->frame.frameBufferPtr. This will return the address of the

current frame you recently swapped or plan to swap with FVID_exchange(). You

can also type cast it to a type you find more useful, below is a small example of

extracting the frame buffer pointer from a FVID_exchange() call.

int* framepointer;

FVID_exchange(hGioVpfeCcdc, &frameBuffPtr);

framepointer = (int*)(frameBuffPtr->frame.frameBufferPtr);

Note the frame you are now pointing to is an interleaved YCbCr 4:2:2 stream, so

every other byte is a Y, every 4th byte is a Cb, and every other 4th byte is a Cr (i.e.

Cb Y Cr Y Cb Y Cr Y Cb Y...), which also means the size of the buffer will be your

frame width x height x 2 bytes per pixel

void imagebw(void* currentFrame,int x, int y)

{

int xx = 0;

for( xx = 1; xx < (x * y)*2; xx+=2 )

*( ( (unsigned char*)currentFrame ) + xx )=0x80;

}

void copyframe(void *currentFrame) //stored background model

{ int xx = 0;

for( xx = 1; xx < 829440; xx+=1 )

arr[xx]=*( ( (unsigned char*)currentFrame+xx ));

}

void writeframe(void *currentFrame)// return frame to frame buffer

{

int xx = 0;

http://en.wikipedia.org/wiki/4:2:2

50

for( xx = 1; xx < 829440; xx+=1 )

*( ( (unsigned char*)currentFrame+xx ))=arr1[xx];

}

void subtract(void *currentFrame) //frame subtraction

{

int xx = 0;

for( xx = 1; xx < 829440; xx+=1 )

arr1[xx]=*( ( (unsigned char*)currentFrame+xx ))-arr[xx];

}

5.4 CONCLUSION

We have used background subtraction method, utilized blob analysis

using edge detection algorithms and arrived at estimating the count of people from

a video. While this technique gives accurate results in an environment wherein

people aren’t carrying objects and also they aren’t too close to each other.

So in a nutshell, we can successfully count the people in restricted

environment but to make this system more generic, we have to resort to another

techniques.

51

5.5 FUTURE SCOPE

The system can be improved to make sure that objects carried by people

in a video will not be counted as separate people.

Also if people who are close to each other in an input video then using

present system their count will not be accurate as blob analysis will be faulty. This

can be improved in future systems.

.

52

CHAPTER 6 REFERENCES

[1] Real-time people counting system using video camera by Roy-Relend Berg.

Presented on 2007/05/30.

[2] Real-Time People Counting system using Video Camera by Damien LEFLOCH

Master presented at Master of Computer Science, Image and Artificial Intelligence 2007 Supervisors: Jon Y. Hardeberg, Faouzi Alaya Cheikh, Pierre Gouton

[3] TI E2E Community - https://e2e.ti.com/

[4] https://processors.wiki.ti.com

[5] Fast People Counting Using Head Detection From Skeleton Graph {2010

Seventh IEEE International Conference on Advanced Video and Signal Based

Surveillance} by Djamel MERAD, Kheir-Eddine AZIZ. Nicolas THOME.

[6] Digital Image Processing by Rafeal C Gonzalez and Richard E Woods

[7] Handbook of Image Video Processing Editor Al Bovik

[8] TMS320C64x+ DSP Image/Video Processing Library (v2.0.1) (Programmer's

Guide by Texas Instrument)

[9] TM320DM6437 DVDP getting started guide

[10] TM320DM6437 Datasheet

[11] Spectrum digital technical reference manual DM 6437

https://e2e.ti.com/

53

APPENDIX

A} Video File Formats

The three most popular color models are RGB (used in computer

graphics), YIQ, YUV, or YCbCr (used in video systems) and CMYK (used in color

printing). The YUV color space is used by the PAL, NTSC and SECAM (Sequential

Couleur Avec M moire or Sequential Color with Memory) composite color video

standards. The black-and-white system used only luma (Y) information; color

information (U and V) was added in such a way that a black-and-white receiver

would still display a normal black-and-white picture. For digital RGB values with a

range of 0:255, Y has a range of 0:255, U a range of 0 to +/-112, and V a range of 0

to +/-157. YCbCr, or its close relatives, Y’UV, YUV, Y’CbCr, and Y’PbPr are

designed to be efficient at encoding RGB values so they consume less space while

retaining the full perceptual value. YCbCr is a scaled and offset version of YUV

color space. Y is defined to have a nominal 8-bit range of 16-235; Cb and Cr are

defined to have a nominal range of 16-240. There are several YCbCr sampling

formats, such as 4:4:4, 4:2:2, 4:1:1, and 4:2:0.

Now, if we filter a 4:4:4 YCbCr signal by subsampling the Chroma by a

factor of two horizontally, we end up with ‘4:2:2’ implies that there are four luma

values for every two Chroma values on a given video line. Each (Y, Cb) or (Y, Cr)

pair represents one pixel value. Another way to say this is that a Chroma pair

coincides spatially with every another luma value. 4:2:2 YCbCr qualitatively shows

little loss in image quality compared with its 4:4:4 YCbCr source, even though it

represents a saving of 33% in bandwidth over 4:4:4 YCbCr. 4:2:2 YCbCr is a

foundation for the ITU-R BT.601 video recommendation, and it is the most common

format for transferring digital video between subsystem components.

FIGURE A.1 : YCbCr Sampling

final be project report with corrected guide

Documents