challenges and opportunities in dnn-based video analytics: a … · 2020-05-03 · challenges and...

Challenges and Opportunities in DNN-BasedVideo Analytics: A Demonstration of the

BlazeIt Video Query Engine

Daniel Kang, Peter Bailis, Matei ZahariaStanford DAWN Project, InfoLab

ABSTRACT

As video volumes grow, analysts are increasingly able to querythe real world. Since manually watching these growing volumesof video is infeasible, analysts have increasingly turned to deeplearning to perform automatic analyses. However, these meth-ods are: costly (running up to 10x slower than real time, i.e.,3 fps) and cumbersome to deploy, requiring writing complex,imperative code with many low-level libraries (e.g., OpenCV,MXNet). There is an incredible opportunity to leverage tech-niques from the data management community to automate andoptimize these analytics pipelines. In this paper, we describeour ongoing work in the Stanford DAWN lab on BlazeIt, ananalytics engine for scalable and usable video analytics thatcurrently contains an optimizing query engine. We propose ademonstration of BlazeIt’s query language, FrameQL, its usecases, and our preliminary work on debugging machine learning,which will show the feasibility of video analytics at scale. Wefurther describe the challenges that arise from large-scale video,progress we have made in automating and optimizing videoanalytics pipelines, and our plans to extend BlazeIt.

1 Introduction

Video can be used to answer queries about the real world, bothover large historical datasets (e.g., how many people were inTimes Square in July?) and in real time (e.g., which streets arethe busiest now?) in a variety of domains ranging from urbanplanning, autonomous vehicle planning, and ornithology. Forexample, an urban planner working on traffic meter setting orcity planning [5] may be interested in whether Mondays havenotably different traffic volumes than Tuesdays, and counts thenumber of cars that enter and exit on various highway exits.An analyst at an autonomous car company may notice thecar behaves strangely at lane divers with poor markings andsearches for events at lane dividers with poor lane markings [20].An ornithologist may be interested in the feeding patterns ofbirds and monitors various bird feeders.

It is not cost effective and is too time-consuming to manuallywatch these growing quantities of video (London alone has over

This article is published under a Creative Commons Attribution License(http://creativecommons.org/licenses/by/3.0/), which permits distributionand reproduction in any medium as well allowing derivative works, pro-vided that you attribute the original work to the author(s) and CIDR 2019.9th Biennial Conference on Innovative Data Systems Research (CIDR ‘19)January 13-16, 2019 , Asilomar, California, USA.

FrameQLquery

User-definedinvariants

Web UI / SQL console

Model TuningAutomatically tune small model using invariants

Parse QueryDetermine necessary

fields, query type

Rule-based OptimizationExecute optimized plan for query type:

aggregation, scrubbing, content-based selection

Query Execution

Front end

Back end

Relation, visualization

Figure 1: Architecture diagram for BlazeIt.

500k CCTVs [1]) to answer these questions. Automated meth-ods of video analysis are increasingly important in answeringsuch queries.Fortunately, modern computer vision (CV) techniques have

made great strides in automating these tasks, with near human-levels of accuracy for some restricted tasks [9] and rapid progresson others [8] in the form of deep neural networks and models.These trends present an incredible opportunity for visual datamanagement research. We believe that applying data manage-ment techniques to video will enable analyses over the real world,as relational DBs have enabled analyses over structured data.Unfortunately, it is prohibitively expensive to naively deploy

these deep models at scale. While these models could be runexhaustively over video to extract relevant information andsubsequently be used to answer queries via a traditional queryengine, they come at great computational cost. Running a deepmodel over a single frame of video can take up to hundreds ofbillions of FLOPs (and run up to 10× slower than real-time),making it infeasible to exhaustively extract information fromvideo. Systems builders have created custom pipelines for spe-cific tasks (e.g., binary detection) [11, 15], but these systemscannot adopt to user’s queries. As a result, we believe it to benecessary to rethink analytics for video.

In addition to the computational expense, analysis with deepnetworks poses several other challenges:

• Scalability: we believe the trend of deeper and more com-putationally expensive models will continue. As networkstoday are already infeasible at scale, we view scalabilityas a major obstacle to video analytics.

• Usability: using these deep networks requires knowledge

of CV, deep learning, and programming, and often requireswriting complex, ad-hoc code.

• Integration with the data ecosystem: many videoqueries (e.g., searching for actions or events) are not easilyexpressed in standard SQL.

• Storage and indexing: As running deep models overevery frame of video is infeasible, we can only index asmall fraction of the video. Additionally, existing videostorage formats are not well suited for indexing.

• Real-time analysis and actuation: as cameras pro-liferate, so will the demand for real-time analysis andactuation (e.g., decision making). Due to the size of video,streaming the data back to a central server is infeasible.

We have begun to build a system called BlazeIt for usableand scalable video analytics to demonstrate the feasibility ofvideo analytics at scale and address the challenges we have listed.BlazeIt currently contains an optimizing query engine and itssystem diagram is shown in Figure 1.

In the remainder of this paper, we describe our currentprogress in BlazeIt, our plans to extend BlazeIt, and adescription of our demonstration of BlazeIt.

2 Related Work

BlazeIt builds on a long tradition of data management formultimedia and video, and on recent advances in CV. We outlinesome relevant parts of the literature below.

Visual data management. Visual data management hasaimed to organize and query visual data [6, 24]. These sys-tems were followed by a range of “multimedia” database forstoring [4, 19], querying [3, 18], and managing [13, 27] videodata. These systems use classic CV techniques such as low-levelimage features (e.g., colors, textures) and/or rely on textualannotations for semantic queries. However, recent advances inCV allow the automatic population of semantic data and webelieve it is critical to reinvestigate these systems.

Many query languages for visual data have been developed [12,22]. However, much of the prior work has assumed the datais provided (e.g., actors in a movie) or that low-level imagefeatures are queried. In this work, we describe how these fieldscan be automatically populated and optimized.

Modern video analytics. Systems builders have createdspecific pipelines for video analytics, but have yet to built acomplete query system with a rich query language. For example,[2, 11, 15] optimize binary detection (the presence or absenceof a particular object class in video). However, these systemscannot adapt to user’s queries. In BlazeIt, we augment thesepipelines with a rich query language and novel optimizationswhich these systems do not support.

Other systems reduce the latency of live queries [28] or increasethe throughput of batch analytics [25] when the computationis pre-defined as a black-box computation graph. These systemscannot perform certain optimizations such as in BlazeIt, asthey do not have access to the computation semantics. BlazeItcould be run on such systems for live queries or scale-out.

A range of contemporary systems aim to optimize complemen-tary functions, including arbitrary UDFs [23], visualization [26],and patch-based visual analytics [17]. To the best of our knowl-edge, BlazeIt is the first system with a declarative query lan-guage and query optimizer for visual analytics that automaticallyextracts information from pixel data directly, without requiringknowledge of the neural networks used to process the pixel data.

SELECT FCOUNT(*)

FROM taipei

WHERE class = 'car'

ERROR WITHIN 0.1

AT CONFIDENCE 95%

(a) The FrameQL query forcounting the frame-averagednumber of cars.

SELECT timestamp

FROM taipei

GROUP BY timestamp

HAVING SUM(class='bus')>=1

AND SUM(class='car')>=5

LIMIT 10 GAP 300

(b) The FrameQL query forselecting 10 frames of at least onebus and five cars, with each frameat least 10 seconds apart.

SELECT *

FROM taipei

WHERE class = 'bus' AND redness(content) >= 17.5

AND area(mask) > 100000

GROUP BY trackid

HAVING COUNT(*) > 15

(c) The FrameQL query for selecting all the information of redbuses at least 100,000 pixels large, in the scene for at least 0.5s (at30 fps, 0.5s is 15 frames). The last constraint is for noise reduction.

Figure 2: Three FrameQL example queries.

3 BlazeIt Overview

Over the past two years, we have built systems for scalable andusable video analytics in the DAWN lab at Stanford Univer-sity, beginning with NoScope [15]. NoScope leveraged modelspecialization, in which a smaller, cheaper model is trained tomimic a larger reference model on a reduced task (e.g., countingcars in a frame vs full object detection). However, our work onthe initial NoScope system only optimizes binary detection.In response, we have initiated work on BlazeIt [14], which

significantly expandsNoScope by including an optimizing queryengine and a query language, FrameQL. BlazeIt’s primarygoal is to execute FrameQL queries efficiently, as materializingthe records is the primary computational bottleneck. Despiteour progress, several challenges remain. We describe BlazeItbelow and describe our plans to extend BlazeIt in Section 4.

3.1 FrameQL Overview

BlazeIt queries are specified via FrameQL, a SQL-like lan-guage. FrameQL allows users to query the objects appearing ina video feed over space and time by content and location at theframe level. We choose a declarative language for two reasons.First, by providing a table-like schema using the standard re-lational algebra, we enable users with only familiarity with SQLto query videos, whereas extracting the data necessary for thesequeries manually would require expertise in deep learning, CV,and programming. Second, the separation of the specificationand implementation enables new forms of optimizations.In FrameQL’s data model, each video (stored and com-

pressed in formats such as H.264) is represented as a relation,so they can be composed with relational operators. FrameQLsupports selection, projection, and aggregation. FrameQL’sdata schema contains fields relating to the time, location, andclass of objects, scene and global identifiers, the box contents,and the features from the object detection method. BlazeItcan populate these fields automatically.We show three examples in Figure 2, with full details in [14].

3.2 BlazeIt’s Architecture

To materialize FrameQL records, BlazeIt uses object detec-tion, which, given a frame of video, returns the set of boundingboxes and class information of the objects in the video. These

object detectors come in the form of deep networks.In traditional query processing engines, the records are typi-

cally cheap to process (e.g., 100s of cycles to apply a predicate),but materializing the records for a frame of video, i.e., perform-ing object detection, is extremely expensive (e.g., billions ofFLOPs for Mask R-CNN [8]). This computational cost alsomakes the solution of materializing all the records infeasible.When the user issues an FrameQL query, BlazeIt’s query

engine optimizes and executes the query. BlazeIt’s primarychallenge is executing the query efficiently: naive methods, suchas performing object detection on every frame or using No-Scope [15] as a filter, are often prohibitively slow. To optimizeand execute the query, BlazeIt inspects the query contentsto see if optimizations can be applied. For example, BlazeItcannot optimize SELECT *, but it can optimize computing theaverage number of cars per frame by sampling and applyingmodel specialization [15].Because object detection is the major computational bottle-

neck, BlazeIt’s optimizer primarily attempts to reduce thenumber of object detection calls while achieving the target accu-racy. BlazeIt leverages three novel optimizations and existingtechniques from NoScope to reduce the computational cost ofobject detection. One key primitive we use in BlazeIt is calledmodel specialization, which we developed in NoScope [15].

BlazeIt currently targets aggregation (e.g., Figure 2a), scrub-bing (e.g., Figure 2b), and content-based selection (e.g., Fig-ure 2c):

• Aggregation: to optimize aggregation, BlazeIt usesspecialized networks to directly compute the answer or toreduce the variance in sampling.

• Scrubbing: to optimize scrubbing, BlazeIt trains spe-cialized networks to search for frames matching the pred-icate and adapts importance sampling for fast search.

• Content-based selection: to optimize content-basedselection, BlazeIt infers, from the query, a set of filtersthat discards irrelevant frames.

These optimizations can give up to three orders of magnitudespeedup. [14] provides full details.

4 Ongoing WorkWhile the core ideas in FrameQL and in BlazeIt’s currentoptimizations offer significant speedups in video analytics, manychallenges remain. We describe several of these challenges, aswell as how we are incorporating solutions into BlazeIt.

Scalability. While model specialization currently can speedup restricted tasks (e.g., counting, classification), some queriesrequire the full functionality of object detection, the overwhelm-ing bottleneck for many queries. We are exploring ways ofautomatically specializing object detection methods.Off-the-shelf object detectors are trained on generic datasets,

but are often deployed in specific scenarios, e.g., on a streetcorner in New York or inside a store. While powerful objectdetection methods (e.g., Mask R-CNN [8]) may be able to bedeployed in all such scenarios, weaker models (e.g., SSD [21] witha MobileNet [10] backbone) may not be as accurate. However,similar to how specialized models forego the full generality forinference speed, we plan to explore fast methods of specializingthese weaker object detection models in restricted scenarios (e.g.,a model inside a store may not need to detect cars).

Storage and indexing. While storage of video has beenextensively studied for efficient compression, there are newopportunities for storage and indexing for neural network-based

analytics engines. For efficient storage, videos are compressedusing spatial information. For example, many storage formats(including H.264) largely consist of two types of frames: I-frames,which capture full frames of video, and P-frames, which capturedeltas between frames. We believe learning format-specificneural networks for compressed data is promising and can avoiddecoding P-frames in many situations, similar to [7].Additionally, we have found that specialized neural networks

can operate on significantly reduced representations of the videoframes (e.g., scaled-down frames or the results of specific con-volution layers). We plan to explore methods of indexing viaspecialized networks for exploratory queries, while maintaininghigh accuracy for downstream queries.

Real-time analytics and actuation. While batch analyticscan be useful for understanding aggregate behavior, many sce-narios require real-time analytics and actuation, e.g., real-timetraffic meter setting.Deep learning raises a new set of concerns for real-time an-

alytics. Many robots (e.g., autonomous vehicles, drones) andsensors (e.g., street cameras) will contain accelerators that rundeep networks for a restricted set of tasks and possibly reducedaccuracy. For example, an autonomous vehicle may have anobject detection network, but no model for collecting the de-mographic information of the pedestrians it sees. Additionally,it will become increasingly costly to transfer data to a centralserver for storage and processing.

We believe model specialization for real-time analytics on edgedevices and bandwidth reduction are promising. For example,edge devices have a number of competing constraints (e.g., power,bandwidth, accuracy) and we believe extending inference-awaremodel search to account for multiple constraints is promising.

Usability via debugging. Despite advances in deep learningand specialization, deep and specialized models are not alwaysaccurate, often in predictable ways. The accuracy of thesemethods is a major obstacle to accurate analysis.We have begun work on a model debugging system that will

help identify and fix model issues via model assertions [16].Model assertions are boolean statements over the output ofmodels and are similar to program assertions. For example, ananalyst may code an assertion that there should not be 3 carswithin another car bounding box (a common error for lower-quality object detection methods). These assertions can be usedto fix behavior at run-time, or collect data to improve models.

5 Demonstration

To highlight the ease of use of FrameQL for video analytics, wewill present a demonstration of our UI and accompanying queryengine. Our demonstration will contain two parts: a textualinterface for users to issue FrameQL queries and a visualizationtool to display the frames and metadata of the returned results.The textual interface will accept FrameQL queries such as

in Figure 2. For queries that return summary statistics (i.e.,aggregation), the demo will display the output as is. For queriesthat return frame numbers and box information, we will pro-vide a visualization tool that can show frames or a sequence offrames with the boxes (and metadata) drawn on the frames. Anexample of the interface is shown in Figure 3 and an exampleof a visualized frame is shown in Figure 4b.We will demonstrate two use cases of BlazeIt in the ur-

ban planning scenario: 1) exploratory queries and 2) modeldebugging via user-defined invariants.

Exploratory query use case. We will first demonstrate the

http://ddkang.github.io/blazeit

BlazeIt Demo

Video name: Submit

Ingest

Query

SELECT timestampFROM taipeiGROUP BY timestampHAVING SUM(class='bus') >= 1 AND SUM(class='car') >= 5LIMIT 2 GAP 300

SQL output

timestamp140203949302

Visualization

Frame: 1402039

Figure 3: An example query with example answer andvisualized frame.

(a) An example when the assertiontriggers.

(b) Simply collecting instances ofobjects is not sufficient.

Figure 4: An example of an error an object detection methodcan make on night-street.

ease of use of FrameQL for an urban planning use case. Theanalyst has access to a set of cameras at street corners.

First, we demonstrate how the urban planner can count thenumber of cars in the video, which can be accomplished withthe query in Figure 2a. The query specifies an error bound viaERROR for fast execution. This count information can be usedto set traffic metering or for understanding traffic patterns.

Second, we demonstrate how the urban planner can look forrare events by querying for events of at least five cars and atleast one bus, which can be accomplished with the query inFigure 2b. The urban planner may be interested in such eventsto understand how public transit interacts with congestion.

Finally, we demonstrate how to perform content-based selec-tion by searching for red buses, which can be accomplished bythe query shown in Figure 2c. The urban planner may be in-terested in searching for such events to understand how tourismaffects traffic and looks for red buses as a proxy for tour buses.This query shows how to exhaustively select frames with redbuses. Here, redness and area return the measure of rednessof an image and the area of a box respectively.

Model debugging use case. We will additionally demon-strate ongoing work [16] to extend BlazeIt with debuggingcapabilities via user-defined invariants.

An analyst may view a video of a street corner and notices acommon error is multiple overlapping boxes of a car when thereshould only be one box. Collecting instances of this error isprohibitively expensive with random sampling or by collectinginstances of objects, as shown Figure 4b.

The analyst writes an assertion that several boxes of carsshould not overlap. By running this assertion, the analyst isable to collect training instances (Figure 4a) to improve themodel. We will show the improved model performance after itis trained on the result of the triggered assertions.

6 ConclusionsVideo is a rich, rapidly growing source of high value, high volumedata. We have outlined several key challenges in querying video,ranging from scalability to storage and indexing. In response,we are developing BlazeIt to demonstrate the feasibility ofusable video analytics at scale. We show that that a large classof queries regarding the spatiotemporal information of objectsin video can be answered in FrameQL and that BlazeIt candeliver several orders of magnitude speedup over baselines byleveraging model specialization, sampling, and learned filters.However, much work remains to make video analytics usableand scalable. We believe BlazeIt will be a valuable platformto build scalable and usable video analytics engines.

Acknowledgments

This research was supported in part by affiliate members andother supporters of the Stanford DAWN project—Google, Intel,Microsoft, NEC, Teradata, and VMware—as well as DARPA un-der No. FA8750-17-2-0095 (D3M), industrial gifts and supportfrom Toyota Research Institute, Juniper Networks, KeysightTechnologies, Hitachi, Facebook, Northrop Grumman, NetApp,and the NSF under grants DGE-1656518 and CNS-1651570.

7 References

[1] Cctv:Too many cameras useless, warns surveillance watchdog Tony Porter, 2015.

[2] M. R. Anderson, M. Cafarella, T. F. Wenisch, andG. Ros. Predicate optimization for a visual analytics database. SysML, 2018.

[3] W. Aref, M. Hammad, A. C. Catlin,I. Ilyas, T. Ghanem, A. Elmagarmid, and M. Marzouk. Video query processingin the vdbms testbed for video database research. In ACM-MMDB. ACM, 2003.

[4] F. Arman, A. Hsu, and M.-Y. Chiu. Image processing on compressed data forlarge video databases. In ACM international conference on Multimedia. ACM, 1993.

[5] J. A. Burke, D. Estrin, M. Hansen, A. Parker, N. Ramanathan,S. Reddy, and M. B. Srivastava. Participatory sensing. In WSW, 2006.

[6] M. Flickner, H. Sawhney, W. Niblack, J. Ashley,Q. Huang, B. Dom, M. Gorkani, J. Hafner, D. Lee, D. Petkovic, et al.Query by image and video content: The qbic system. computer, 28(9), 1995.

[7] L. Gueguen, A. Sergeev, B. Kadlec, R. Liu,and J. Yosinski. Faster neural networks straight from jpeg. In NeurIPS, 2018.

[8] K. He, G. Gkioxari, P. Dollar, and R. Girshick. Mask r-cnn. In ICCV. IEEE, 2017.

[9] K. He, X. Zhang, S. Ren,and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.

[10] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand,M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neuralnetworks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.

[11] K. Hsieh, G. Ananthanarayanan,P. Bodik, P. Bahl, M. Philipose, P. B. Gibbons, and O. Mutlu. Focus:Querying large video datasets with low latency and low cost. OSDI, 2018.

[12] E. Hwang and V. Subrahmanian. Querying videolibraries. Journal of Visual Communication and Image Representation, 7(1), 1996.

[13] R. Jain and A. Hampapur. Metadata in video databases. SIGMOD, 1994.

[14] D. Kang, P. Bailis, and M. Zaharia. Blazeit: Fast exploratoryvideo queries using neural networks. arXiv preprint arXiv:1805.01046, 2018.

[15] D. Kang, J. Emmons, F. Abuzaid, P. Bailis, and M. Zaharia. Noscope:optimizing neural network queries over video at scale. PVLDB, 10(11), 2017.

[16] D. Kang, D. Raghavan, P. Bailis, and M. Zaharia. Modelassertions for debugging machine learning. NeurIPS MLSys Workshop, 2018.

[17] S. Krishnan, A. Dziedzic, and E. Aaron.Deeplens: Towards a visual data management system. In CIDR, 2019.

[18] M. La Cascia and E. Ardizzone. Jacob: Just a content-basedquery system for video databases. In ICASSP, volume 2. IEEE, 1996.

[19] J. Lee, J. Oh, and S. Hwang. Strg-index: Spatio-temporalregion graph indexing for large video databases. In SIGMOD. ACM, 2005.

[20] T. Lee. There’s growingevidence Tesla’s autopilot handles lane dividers poorly. Ars Technica, 2018.

[21] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu,and A. C. Berg. Ssd: Single shot multibox detector. In ECCV. Springer, 2016.

[22] C. Lu, M. Liu, and Z. Wu.Svql: A sql extended query language for video databases. IJDTA, 8(3), 2015.

[23] Y. Lu, A. Chowdhery, S. Kandula, and S. Chaudhuri. Acceleratingmachine learning inference with probabilistic predicates. In SIGMOD, 2018.

[24] V. E. Ogle and M. Stonebraker.Chabot: Retrieval from a relational database of images. Computer, 28(9), 1995.

[25] A. Poms, W. Crichton, P. Hanrahan,and K. Fatahalian. Scanner: Efficient video analysis at scale. SIGGRAPH, 2018.

[26] W. Tao, X. Liu, C. Demiralp, R. Chang, and M. Stonebraker.Kyrix: Interactive visual data exploration at scale. In CIDR, 2019.

[27] A. Yoshitaka and T. Ichikawa. Asurvey on content-based retrieval for multimedia databases. TKDE, 11(1), 1999.

[28] H. Zhang, G. Ananthanarayanan,P. Bodik, M. Philipose, P. Bahl, and M. J. Freedman. Live video analyticsat scale with approximation and delay-tolerance. In NSDI, volume 9, 2017.

challenges and opportunities in dnn-based video analytics: a … · 2020-05-03 · challenges and...

Documents