avv transport research centre, the netherlands k.f. chan detection of missing and corrupted data...

Download AVV Transport Research Centre, the Netherlands K.F. Chan Detection of missing and corrupted data from a huge volume of traffic data Viking workshop Copenhage,

If you can't read please download the document

Upload: homer-douglas

Post on 18-Jan-2018

221 views

Category:

Documents


0 download

DESCRIPTION

AVV Transport Research Centre, the Netherlands K.F. Chan Background: What are the problems? Huge volume of data Different information chains Different systems adopted in a chain Different users with different requirements Little users do validation (limitation in time and budget) Only “yes/no” validation to select “clean” data at the end No mechanism to trace and eliminate sources of the errors No complete validaton of an entire chain

TRANSCRIPT

AVV Transport Research Centre, the Netherlands K.F. Chan Detection of missing and corrupted data from a huge volume of traffic data Viking workshop Copenhage, Oct. 2005 AVV Transport Research Centre, the Netherlands K.F. Chan Content of the presentation 1.Background 2.The information chain 3.System quality & chain quality 4.Current situation 5.Lessons learned from a pilot 6.The way ahead AVV Transport Research Centre, the Netherlands K.F. Chan Background: What are the problems? Huge volume of data Different information chains Different systems adopted in a chain Different users with different requirements Little users do validation (limitation in time and budget) Only yes/no validation to select clean data at the end No mechanism to trace and eliminate sources of the errors No complete validaton of an entire chain AVV Transport Research Centre, the Netherlands K.F. Chan Information Chain Information chain Information chain with many processing stations Independent development/maintenance per station Data delivery at different aggregation levels Delivery to on-line & off-line users AVV Transport Research Centre, the Netherlands K.F. Chan Information Chain: on-line collectionprocessing distribution use AVV Transport Research Centre, the Netherlands K.F. Chan Information Chain: off-line collectionprocessing history use Statistical research Traffic engineering research DB AVV Transport Research Centre, the Netherlands K.F. Chan Sources of error Detectors: Physical error (e.g. distortion); error in adjustment OS: Disturbances in electronica Communcation: down, disturbance, interference Measurement technique: inaccurancy, e.g. Stop-and-go traffic; Extreme low intensity; Alternating traffic flow Tmp. changes in road situations (e.g. road works, incidents) Software: Bugs, configuration errors Human errors Others . AVV Transport Research Centre, the Netherlands K.F. Chan How to measure quality ? Inspection of individual station: 1.Availability: is any data missing ? 2.Correctness: is data correct ? Inspection of entire information chain: 3.Is data missing in a single station or multiple stations? 4.Is incorrect data contaminating a single station or the entire information chain? AVV Transport Research Centre, the Netherlands K.F. Chan How to measure data quality ? (step 1) Availability: is any data missing ? Inspection: Where are missing data? Analyse: How serious are they? (location, freqency, duration, etc.) Actions: What to do with them? (remove origin, estimation, postphone, ) V (Time vs space) V (day vs time) I (time vs space) AVV Transport Research Centre, the Netherlands K.F. Chan How to measure data quality ? (step 2) Correctness: is data correct ? Detection: How to trace suspected data? Analyse: Is suspected data really incorrect? Actions: What to do with them? (ignore them, disable them, replace them with estimated values ) Normal daily patternSuspected daily pattern AVV Transport Research Centre, the Netherlands K.F. Chan effect of missing data for the entire chain Inspection: Which station causes the missing data? Analyse: What is the cause and conseqence for the entire chain? Problem: How to integrate different data sources and how to navigate between them? How to determine chain quality ? (step 3) AVV Transport Research Centre, the Netherlands K.F. Chan How to determine chain quality ? (step 4) Effect of incorrect data for the entire chain Detection: what is the overal system quality? Analyse: Is the effect negligible? Problem: How to validate data of the entire chain? What is the investment and is it cost-effective? AVV Transport Research Centre, the Netherlands K.F. Chan Current NL situation Sub-optimal: incoherent validations by different users Different methodes, different concepts, different aggregations Overlap, Complementary, Limited budget, time Restricted to own use Restricted to single dimension, single system, no chain validation Seldom use of COTS products AVV Transport Research Centre, the Netherlands K.F. Chan Example 1: INTENS Validation in Time Validation using daily intensity pattern Fully operational application (10 years in service) Collect historical data Construct reference day-curves per location (e.g. 2 ) Check whether new day-curve lies within the historical spectrum (Partially automated) AVV Transport Research Centre, the Netherlands K.F. Chan Example 2: Validation in Space Conservation of flow (COF) principle Determine COF BIAS for each node COF equation applied to nodes of entire network Compare day-intensities of neighbouring nodes Disqualify unreliable nodes AB AVV Transport Research Centre, the Netherlands K.F. Chan Pilot Orientation data inspection pilot on data inspection Proof-of-concept for chain data inspection of multiple stations Experiments with COTS products Emphasize on high performance, high accessibility OLAP functions: drill-up; drill-down through differen data Generic data model Short time-to-product AVV Transport Research Centre, the Netherlands K.F. Chan Technical Users Management Users Business Users COTS products vs own development OLAP products commercial available for visual inspection Algorithm for automatic detection and estimation need to be developed Lessons learned from a Pilot Different users different requirements Management users: high abstraction, concise information Business users: medium abstraction, just enough details Technical users: concrete information, high details AVV Transport Research Centre, the Netherlands K.F. Chan The fundation ! A generic data model multi-dimensional x: Geography (pt, cross-section, link, segment, traj.) d: type of days t: time of days c: Categories M(x,d,t,c): data contents [V, I, rt, availablility, etc.] AVV Transport Research Centre, the Netherlands K.F. Chan Example: Chain Station 3 Velocity per segment x =wegvakken T = tijdstip v een dag M= Snelheid D = Dag x =wegvakken D = dagen v een maand M= Snelheid T = 8:00 T = tijdstip v een dag D = dagen v een maand M= Snelheid X = wegvak x = wegvakken/ meetvak/ meetraaien, etc. T = tijdstip v een dag M= Snelheid Intensiteit, rt, etc. d = dagen AVV Transport Research Centre, the Netherlands K.F. Chan Example: Chain Station 1 Intensity per location x = meetraaien T = tijdstip v een dag M= Intensiteit D = Dag x = meetraaien D = dagen v een maand M= Intensiteit T = 8:00 T = tijdstip v een dag D = dagen v een maand M= Intensiteit X = meetraai x = wegvakken/ meetvak/ meetraaien, etc. T = tijdstip v een dag M= Snelheid Intensiteit, rt, etc. d = dagen AVV Transport Research Centre, the Netherlands K.F. Chan Example: Comparison raw & estimated data Raw data Data with estimation AVV Transport Research Centre, the Netherlands K.F. Chan The way ahead Project Da Vinci ( ) (Data Validation and Inspection for Corporate Information chain) Planned functionalities Validation in time Validation in space Validation of relationship between measures Automatic detection of outliners Optional: outliners detection using Kalmen filters Qualification of outliners using external data, e.g. incident DB, congestion DB, weather DB Multi-stations drill-down and drill-up function through data AVV Transport Research Centre, the Netherlands K.F. Chan Best practices What best practice cases can be identified? Method to trace missing and corrupted data through the entire chain Possibility to trace source of errors in information chain What specific aspects can be regarded as best practice? Improve overall chain quality Reduce analysis time, enhancing SLA Reduce costs & time for software development and maintanence - generic data model, effective bugs tracking, uniform GUI Are the best parctices to the country in question, to a certain region or globally? Currently national use; potential for globally use