lukaszgolab$ - university of waterloolgolab/sigmod13_tutorial.pdf ·...

72
Data Stream Warehousing Lukasz Golab [email protected] University of Waterloo Theodore Johnson [email protected] AT&T Labs Research

Upload: dinhhuong

Post on 08-May-2018

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

Data  Stream  Warehousing  

Lukasz  Golab  [email protected]  University  of  Waterloo  

 

Theodore  Johnson  [email protected]  

AT&T  Labs  -­‐  Research  

Page 2: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

Big  Data  

•  Every  2  days  we  create  as  much  informaKon  as  we  did  up  to  2003  (Eric  Schmidt)  

•  Becoming  easier  to  produce/collect  – Sensors,  Web,  cheap  bandwidth  

•  Becoming  easier/cheaper  to  store  – Cheap  hard  disks,  commodity  hardware  

 

Page 3: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

TradiKonal  Big  Data  Workflow  •  Wait  for  data  to  arrive  •  Prepare  and  load  data  

–  Into  HDFS,  key-­‐value  store,  …    – or  into  a  database,  then  index  

•  Compute  result  •  Start  over  

Page 4: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

But  •  Many  interesKng  data  sets  are  “streaming”  

– Monitoring  (IP  networks,  infrastructure,  smart  transportaKon  systems  and  power  grids,  RFID,  system  logs,  manufacturing)  

– TransacKons  (stock  Kckers,  credit  card  purchases)  – User  behaviour  logs  (Web,  social  media)  

Page 5: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

Stream  Data  Workflow  

•  For  each  item  or  batch  of  items  – Do  some  processing  – Compute/update  results  

•  Now  feasible  due  to  cheap  RAM,  mulK-­‐cores,  etc.  

Page 6: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

“Fast  Data”  Systems  •  Data  Stream  Management  Systems  (DSMS)  

– Borealis,  StreamBase,  Gigascope  – Simple  queries  over  fast  append-­‐only  data  – Results  streamed  out,  usually  not  stored  

•  Key-­‐value  stores  have  fast  transacKonal  response,  but  analyKcs  are  difficult  – Put/get  interface  makes  correlaKon  difficult  – AnalyKcs  are  inefficient  on  distributed  stores  

Page 7: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

In  This  Tutorial  •  Big  Data  Management  

– Focus  on  scalability  and  deep  analyKcs,  but  high  latency  

•  Fast  Data  Management  – Low  latency,  but  limited  capability  and  no  persistent  storage  

•  Can  we  do  both?  – Data  Stream  Warehousing  

Page 8: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

Five  “V”s  of  Big  Data  •  Volume  •  Velocity  •  Variety  

– Data  integraKon  •  VerificaKon  

– Data  cleaning  •  Value  

– Data  mining  

Page 9: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

Outline  •  Why?  •  What?  •  Detailed  example  •  How?  

– Common  elements  – System  architectures  – Performance  opKmizaKons  – Data  stream  quality  

•  Open  Problems  

Page 10: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

Why?  •  Could  have  2  separate  systems,  but  

– Not  clear  where  to  divide  the  systems  – Overhead  of  moving  data  from  one  system  to  the  other  

– Harder  to  develop  applicaKons  •  Different  SQL  dialects,  etc.  

– Historical  data  provides  context  for  real-­‐Kme  data  – Even  tradiKonal  analyKcs/reporKng  is  becoming  more  real-­‐Kme  

•  Reduce  Kme  from  ingest  to  insight  

Page 11: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

What?  •  Load  data  from  a  mulKtude  of  streaming  sources  – Wide  variaKon  in  data  latencies  

•  Provide  transparent  access  to  both  real-­‐Kme  and  historical  data  

•  Gracefully  handle  late-­‐arriving  data  •  Schedule  queries  and  updates  to  materialized  views  in  spite  of  highly  variable  workloads  – Load  shedding  by  dropping  data  is  not  an  opKon  

Page 12: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

MulKtude  of  streaming  sources  •  Data  becomes  most  useful  when  you  can  correlate  results  from  many  sources  – Hundreds  to  thousands  of  disKnct  data  feeds  

•  Network  monitoring  –  Correlate  twiCer  feeds,  acKve  monitoring  streams,  and  link  uKlizaKons  to  idenKfy  trouble  spots  

•  Smart  Grid  –  Correlate  smart  meter  readings,  line  temperature  measurements,  and  phasor  measurement  units  to  proacKvely  react  to  overloads  and  avoid  blackouts  

Page 13: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

0  

2  

4  

6  

8  

10  

12  

0   100000   200000   300000   400000   500000   600000  

Num

ber  o

f  Windo

ws  

Time  (  seconds)  

Late-­‐arriving  data  •  Late  arriving  data  is  a  

common  problem  for  streaming  systems.  

•  DSMS  :  data  arrives  minutes  late  

•  Stream  Warehouse  :  data  can  arrive  days  late  

•  Load  all  data  and  propagate  their  results  in  spite  of  lateness.  

Page 14: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

•  AlerKng,  troubleshooKng,  real-­‐Kme  data  mining  all  depend  on  access  to  real-­‐Kme  and  historical  data  

•  Hard  to  draw  a  boundary  between  new  and  old  

Transparent  Access  

Scheduling  •  Ensure  that  the  most  Kme  criKcal  applicaKons/views    get  priority  service.  

•  Ensure  that  no  applicaKon  is  starved  •  In  spite  of  temporary  overload  

Page 15: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

Network  Monitoring  •  Darkstar  project  at  AT&T  Labs  •  MoKvaKng  applicaKon  for  the  Data  Depot  stream  warehouse  system  

•  Data  collected:  –  Passive  and  acKve  probe  measurements,  route  monitoring,  system  logs,  configuraKon  data,  customer  service  Kckets  and  notes  

•  For:  – Networking  research,  data  mining,  alerKng,  troubleshooKng  

Page 16: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

Darkstar:  Mining  Vast  Amounts  of  Data  

Network  

Route  monitors  (OSPFmon,  BGPmon)  

Device  service    monitoring  (CIQ,  MTANet,  STREAM)  

AcKve  service  and    connecKvity  monitoring  

Syslog  Config  

SNMP  Polling  (router,  link)  Nenlow  

Deep  Packet    InspecKon  (DPI)  

Alarms  

Tickets  

AuthenKcaKon/  logging  (tacacs)  

Customer  feedback  –  IVR,  Kckets,  MTS  

IP  Backhaul  Enterprise  IP,  VPNs  

Ethernet  Access  

IPTV  

Layer  one  

Mobility  

Page 17: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

ARGUS:  DetecKng  Service  Issues…  • Goal:  detect  and  isolate  ac#onable  anomaly  events  using  comprehensive  end-­‐to-­‐end  performance  measurements  (e.g.  GS  tool)  •  SophisKcated  anomaly  detecKon  and  heurisKcs  •  SpaKal  localizaKon  •  Accurately  accounts  for  service  performance  that  varies  considerably  by  Kme-­‐of-­‐day  

and  locaKon  •  Impact:  •  Reduced  detecKon  Kme  from  days  to  approx.  15  mins  for  detecKng  data  service  issues  

•  OperaKonal  naKon-­‐wide  monitoring  data  service  performance  for  3G  and  LTE  (TCP  retransmission,  RTT,  throughput  from    GS  Tool)  

Page 18: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

Market  

Sub-­‐Market   Sub-­‐Market  …  

SGSN   SGSN  

…  RNC   RNC  

…  

SITE  SITE   …  

SITE  

SITE  

RNC  

SITE  

SITE  

RNC  

SITE  

SITE  RNC  

SGSN  

SGSN   GGSN  

GGSN  

Collect  end-­‐to-­‐

end  Performance  

Data  

Approach:  Mobility  LocalizaKon  Hierarchy  

Page 19: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

Case  Example:  Silent  CE  Overload  CondiKon    • ARGUS  detected  event:  2  Columbia  3G  Ericsson  SGSN’s  impacKng  RNC’s  in  West  Virginia,  Norfolk,  and  Richmond  •  No  other  indicaKon  of  issue  •  Topology  highlighted  CE  used  by  only  impacted  SGSNs  

         

 •  RCA:  “6148  48  port  1gig  card  is  limited  to  a  shared  1  gig  bus  for  each  set  

of  8  gig  ports”  

ARGUS  alarm:  clmamdorpn2  (TCP  retransmissions)   CE  UGlizaGon  flaJening    

Page 20: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

ARGUS  As  a  General  Capability…  Spike  in  call  drop  rate  on  MSC  hrndvacxca1    RTT  anomalies  (SGSN  level)    

Outage  start  5:30  GMT  

First  Anomaly  5:40  GMT  

CTS  Ticket  Created  08:21  GMT  

Social  media  (TwiCer)  NY  outage  

LA  outage  

Node  metrics,    acKve  measurements  (CBB,  IPAG  WIPM  delay)…  

 

Mobility  customer  Kckets  (Boston  market  –  PE  isolaKon)  

Page 21: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

• 1.  At-­‐a-­‐glance  view  of  network  topology  and  state    

• VisualizaKon  to  summarize  important  informaKon  on  network  health  •  Color-­‐coded    

• Complimentary  to  KckeKng  system  –  reporKng  issues  below  “alarming”  status  

Page 21

hCp://ptolemy.research.aC.com/  

Use  network  visualiza9on  and  convenient  data  explora9on  to  help  network  operators  with  network  health  monitoring  and  service  problem  troubleshoo9ng  

Ptolemy  

hCp://ptolemy.research.aC.com/mobility  

Page 22: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

Assess  damage,  idenKfy  remaining  capacity  

Page 22

Loss  of  many  links  out  of  Japan.  What’s  ler?  

Example  1:  Japan  Earthquake,  March  11th  2011  

Page 23: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

IdenKfy  traffic  shirs,  no  congesKon  

Page 23

Increase  in  link  load  as  traffic  re-­‐routed  

Link  load  

Example  1:  Japan  Earthquake,  March  11th  2011  

Page 24: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

Recap  •  Load  data  from  mulKple  diverse  sources  •  Transparent  access  to  real-­‐Kme  and  historical  data  

•  Schedule  queries/updates  – And  materialized  views  

•  Handle  late/out-­‐of-­‐order  data  •  Could  have  two  separate  systems,  but  …  

Page 25: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

Architectures  •  DSMS-­‐based  •  DBMS-­‐based  •  Hadoop-­‐based  

Page 26: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

DSMS-­‐based  •  Add  ability  to  store  data  (e.g.,  Aurora/Borealis)  

Output  stream  

“StaKc”  data  set  

ConnecKon  point  

Page 27: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

DSMS-­‐based  

•  Example2:  Moirae:  History-­‐enhanced  monitoring  

Postgres  

Borealis  

SQL  query  Data  export  

Page 28: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

DSMS-­‐based  •  Example  3:  Dejavu:  paCern  matching  over  live  and  historical  streams  – Actually  DBMS-­‐based  (MySQL)  

MySQL  

PaCern  matching  engine  

PaCern  match  query  Data  export  

Page 29: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

DSMS-­‐based  •  Pros  

– Enables  real-­‐Kme  processing  with  context  

•  Cons  – Does  not  enable  complex  analyKcs  

•  Must  keep  up  with  live  data  

– Stores  limited  history  

Page 30: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

DBMS-­‐based  •  Use  the  query  processing  and  storage  engine  of  a  DBMS  

•  Add  layers  for  addiKonal  services  – Fast  data  load  – Temporal  parKKoning  – Update  propagaKon  – Scheduling  

•  Add  stream  warehouse-­‐specific  features  and  opKmizaKons  

Page 31: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

DBMS-­‐based  

•  Design  decisions:  – Row  store  (Data  Depot/Daytona,  Truviso/Postgres)  vs.  column  store  (DataCell/MonetDB,  SAP  HANA,  VerKca)  

– Disk  (Data  Depot,  Truviso)  vs.  main  memory  (DBToaster,  SAP  HANA)  

Page 32: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

DBMS-­‐based  •  Pros:  

– Leverage  SQL,  query  opKmizaKon,  data  storage  

•  Cons:  – Not  quite  real-­‐Kme  

Page 33: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

Hadoop  /  Map-­‐Reduce  based  •  HOP  (Hadoop  Online  Prototype)  •  Idea:  instead  of  waiKng  for  all  mappers  to  finish,  send  output  incrementally  from  mappers  to  reducers  – periodically  invoke  reducers  on  the  available  data  

Page 34: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

Hadoop  /  Map-­‐Reduce  based  

•  MapUpdate/Muppet  (Walmart  Labs),  similar  ideas  in:  Incoop,  SCALLA  – Reduce:  for  each  key,  process  all  values  and  return  a  single  output  value  

– Update:  given  a  new  (k,v)  pair,  return  an  updated  output  value  using  the  new  pair  and  state  of  k  

•  And  update  the  state  

Page 35: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

Hadoop  /  Map-­‐Reduce  based  •  Nova  (Yahoo)    

– “Pipelining”  between  jobs  in  a  workflow  (in  large  batches)  

– Pass  a  “delta”  to  the  next  job  in  a  workflow  

Page 36: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

Hadoop  /  Map-­‐Reduce  based  •  Pros:  

– Leverage  scale-­‐out  and  fault  tolerance  •  Cons:  

– Again,  not  quite  real-­‐Kme  

Page 37: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

How?  •  Common  elements  in  a  stream  warehouse    – Temporal  parKKoning  – Update  propagaKon  /  workflow  – Temporal  dimension  tables  – Temporal  consistency  management  

Page 38: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

Temporal  ParKKoning  

•  The  primary  parKKoning  field  is  the  record  Kmestamp  •  Stream  data  is  mostly  sorted  •  Most  new  data  loads  into  a  new  parKKon  

–  Avoid  rebuilding  indices  •  Simplified  data  expiraKon  –  roll  off  oldest  parKKons  

Time  

Data  

Index  

New  data  

Page 39: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

Update  PropagaKon  /  Workflow  •  Streaming  analyKcs  –  maintain  a  system  of  complex  materialized  views  

•  Push  new  data  through  base  tables  to  all  dependent  tables  –  Create  new  parKKons  – Update  exisKng  parKKons  as  needed  

TwiCer  feeds  

AcKve  measure  

Link  uKl  

Customer  complaint  

Service  alerts  

SenKment  analysis  

Hourly  aggregate  

Daily  aggregate  

Page 40: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

Temporal  Dimension  Tables  •  Most  streaming  data  describes  events  

–  Occurs  in  a  point  in  Kme,  or  is  a  measurement  during  a  well-­‐defined  interval  

•  Some  streaming  data  defines  condi#ons  –  ProperKes  of  an  enKty  that  endures  for  a  Kme  interval  –  Temporal  dimension  tables  –  Kmestamp  is  valid  Kme  interval.  

•  Pervasive  use  –  You  can’t  evaluate  an  event  without  knowing  about  the  environment  

–  Link  speeds,  cell  tower  locaKons,  power  grid  organizaKon  •  Snapshot  tables  don’t  work  

–  Late  arriving  data,  recomputaKon,  new  long-­‐term  analyses.  

Page 41: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

Temporal  Dimension  Table  Example  SNMP_BytesTransferred  

Ip_address   Timestamp   Bytes_xfered  

4.3.2.1   1:05   1,000,000  

4.3.2.1   1:10   1,200,000  

4.3.2.1   1:15   2,200,000  

LinkSpeed  Ip_address   Tlo   Thi   Speed  

4.3.2.1   12:15   1:15   1,000,000  B/min  

4.3.2.1   1:15   -­‐   2,000,000  B/min  

Ip_address   Timestamp   UKlizaKon  

4.3.2.1   1:05   .2  

4.3.2.1   1:10   .24  

4.3.2.1   1:15   .22  

LinkUKlizaKon  

Page 42: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

Temporal  Dimension  Tables  •  Updates  

– Snapshots  of  current  status,  deltas.  •  Snapshot  windows  in  StreamInsight  •  Compute  from  the  stream  

– Frames  –  based  on  a  condiKon  of  records  in  a  stream  

–  Interval  punctuaKon  

Page 43: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

OpKmizaKons  

•  DB-­‐toaster  •  MulK-­‐version  Concurrency  Control  •  ParKKon  Restructuring  •  ParKKon  Revisions  •  Temporal  Consistency  Management  •  Scheduling  

Page 44: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

DB-­‐toaster  •  Maintain  complex  

aggregate  views  over  streaming  data.  

•  In-­‐memory  architecture  :  all  storage  is  via  hash  table.  –  1TB  main  memory  servers  are  

inexpensive  •  Uses  novel  recursive-­‐delta  

technique  to  accelerate  maintenance  –  CollecKon  of  support  views  

that  can  significantly  reduce  update  Kme.  

Join(R,S,T))  

Join(S,T))   Join(R,T))   Join(R,S))  

T)   S)   R)  

Page 45: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

MulK-­‐version  Concurrency  Control  •  MVCC  allows  queries  and  updates  to  proceed  concurrently  – Read  isolaKon  – Long  analyKc  queries  do  not  block  real-­‐Kme  updates  

•  Single-­‐updater  MVCC  is  cheap  and  easy  – Use  a  directory-­‐swap  algorithms  

•  Encourages  use  of  cloud-­‐friendly  write-­‐once  files.  

Page 46: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

ParKKon  Restructuring  •  As  data  ages,  its  best  representaKon  changes  

– Most  recent  data  :  opKmize  for  fast  ingest  – Stable  data  :  opKmize  for  queries  – Historical  data  :  minimize  storage  cost  

•  Restructure  parKKons  as  the  data  ages  – MVCC  allows  data  maintenance  to  occur  as  a  non-­‐interfering  background  task  

Page 47: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

ParKKon  Size  

•  New  parKKons  should  match  the  update  increment  

•  Problem  :  parKKon  explosion  –  1  minute  parKKons,  1440  per  day,  525,600  per  year  

•  Merge  parKKons  as  they  age  

Time  

Data  

Index  

Indexes  opKonal  

Page 48: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

Data  Layout  •  Write-­‐opKmized  data  

–  Row-­‐oriented,  lightly  indexed,  uncompressed  •  Read-­‐opKmized  data  

– Highly  indexed,  lightly  compressed,  column  storage  if  beneficial  

•  Transform  as  a  background  task  when  the  data  becomes  stable  –  Combine  with  parKKon  merging  

•  Aggressive  compression  for  archival  data  •  ImplementaKons  in  SAP  HANA  and  VerKca  

Page 49: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

ParKKon  Revisions  

•  Some  data  always  arrives  late  •  Problem  :  need  to  recompute  exisKng  parKKons  – Disk  prefers  sequenKal  access  – Write-­‐once  files  :  need  to  recompute  the  enKre  parKKon  

•  SoluKon:  chain  updates  to  the  parKKon  – Value  of  the  parKKon  is  the  sum  of  the  primary  (anchor)  contents  plus  the  updates  (revisions).  

Page 50: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

ParKKon  Revisions    

•  Problem:  Don’t  change  old  parKKons,  but  what  if  data  arrives  out-­‐of-­‐order?  

•  SoluKon:  Overflow  chains  (Truviso)  

Time  

anchor  

revisions  

Packet_Stream  

Packets  

Page 51: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

•  Works  with  “raw”  and  derived/aggregated  data  

•  E.g.,  packet  counts:  

Data  Layout  

1000   1200   1150   1400  

Time  

25  

Packet_Stream  

Packets  

Packet_counts  

Page 52: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

Temporal  Consistency  Management  

•  TradiKonal  noKon  of  consistency  :  a  snapshot  of  the  system.  

•  Doesn’t  apply  in  a  stream  warehouse  – Late-­‐arriving  data  is  common  – Different  data  sources  have  different  Kme  lags  and  different  likelihoods  of  late  data  

•  Instead,  label  data  by  its  degree  of  completeness  

Page 53: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

0  

2  

4  

6  

8  

10  

12  

0   100000   200000   300000   400000   500000   600000  

Num

ber  o

f  Windo

ws  

Time  (  seconds)  

Number  of  windows  per  package  

Page 54: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

Query  Stability  •  How  do  I  know  when  the  data  is  stable  enough  to  query?  

•  What  is  stable  enough?  – Data  will  never  change  – Data  won’t  change  much.  –  I’ll  take  whatever  is  there.  

Page 55: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

Consistency  Levels  •  PunctuaKons  on  parKKons  that  indicate  completeness.  

•  Example  (simple)  collecKon  of  consistency  levels  – Open  :  The  parKKon  should  have  some  data  in  it.  –  Closed  :  The  parKKon  will  not  change.  –  Complete  :  the  parKKon  will  not  change,  and  all  data  has  been  received.  

•  Closed  is  a  guess  – WeaklyClosed,  StronglyClosed  

•  Infer  at  base  tables,  propagate  inferences  to  materialized  views.  

Page 56: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

Workflow  Scheduling  •  Need  to  limit  resource  use  to  avoid  thrashing.  

– Hundreds  of  tables  to  update,  limited  (CPU,  memory,  cache,  network)  resources.  

–  Exclusive  resources:  non-­‐preempKve  scheduling.  •  Ensure  that  high-­‐priority  jobs  can  execute  

–  Real-­‐Kme  scheduling  •  Measures  of  lateness:  

–  Staleness  :  difference  between  current  Kme  and  most  recent  data.  

–  Tardiness  :  the  difference  between  a  task  deadline  and  task  compleKon.  

Page 57: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

Workflow  Scheduling  •  Staleness  funcKon:  

difference  between  current  Kme  and  most  recent  data  loaded  

•  Hierarchies  of  views  with  highly  varying  execuKon  Kmes.  

9:30   9:45   10:00   10:15  

TwiCer  feeds  

AcKve  measure  

Link  uKl  

Customer  complaint  

Service  alerts  

SenKment  analysis  

Hourly  aggregate  

Daily  aggregate  

fast  frequent  

slow  infrequent  

Page 58: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

Bounded  Tardiness  Scheduling  •  Bound  on  the  maximum  tardiness  of  any  task  in  a  task  set.  

•  If  update  jobs  are  scheduled  regularly,  bounded  tardiness  =>  bounded  staleness  

•  Most  real-­‐Kme  scheduling  algorithms  have  bounded  tardiness  –  EDF,  minimum  slack,  etc.  –  There  can  be  differences  in  the  tardiness  bounds  

•  Pick  a  heurisKc  that  works  well  –  E.g.  pick  the  task  that  provides  the  largest  marginal  reducKon  in  staleness.  

Page 59: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

Track  Scheduling  •  ComplicaKon:  Large  differences  in  task  execuKon  Kme  – Update  a  base  table  with  1  minute  of  data  vs.  compute  a  daily  aggregate.  

•  Tardiness  bounds  depend  on  the  largest  task  execuKon  Kmes.  –  Long  tasks  block  short  criKcal  tasks.  

•  Track  Scheduling  :    –  parKKon  tasks  by  execuKon  Kme.      –  Restrict  the  number  of  long  tasks  that  can  execute  concurrently  

–  Reserve  resources  for  short  criKcal  tasks  

Page 60: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

Transient  Overload  •  Common  source  of  overload  :  catch-­‐up  processing.  – A  feed  breaks  for  a  day,  then  is  restored.  –  The  source  schema  changes,  requiring  a  pause  in  processing  to  change  update  procedures.  

– New  tables  load  a  long  history  •  Update  Chopping  

–  Break  a  (temporally)  long  update  into  short  segments.  •  Update  period  adjustment  

– Decrease  the  period  of  backlogged  tables  to  use  up  (but  not  oversubscribe)  available  resources.  

Page 61: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

Data  Stream  Quality  

•  New  data  quality  problems  – SystemaKc  errors  in  machine-­‐generated  streams  – Correlated  glitches  – Missing/delayed  data  

•  New  semanKcs  – RelaKonal  data:  keys,  FDs,  CFDs  – Streaming/temporal  data:  order,  arrival  frequency  (sequenKal  dependencies),  conservaKon  laws  

Page 62: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

Open  Problems  •  Hybrid  system  architectures  and  cross-­‐system  opKmizaKons  

•  Big  and  fast  analyKcs  as  a  cloud  service  •  Big/fast  data  mining  •  Data  stream  quality/profiling  •  Complexity  management  and  administraKon  of  a  big/fast  data  management  system  

Page 63: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

Bibliography  •  ApplicaKons  

–  Smart  Grid  •  hCp://energy.gov/oe/technology-­‐development/smart-­‐grid  

–  Semiconductor  Manufacturing  •  hCp://www.appliedmaterials.com/technologies/library/techedge-­‐prizm  

•  hCp://www.extremetech.com/extreme/155588-­‐applied-­‐materials-­‐designs-­‐tools-­‐to-­‐leverage-­‐big-­‐data-­‐and-­‐build-­‐beCer-­‐chips  

•  Networking  ApplicaKons  –  C.  Kalmanek  et  al.,  Darkstar:  Using  Exploratory  Data  Mining  to  Raise  the  

Bar  on  Network  Reliability  and  Performance,  DRCN  2009.  –  H.  Yan,  A.  Flavel,  Z.  Ge,  A.  Gerber,  D.  Massey,  C.  Papadopoulos,  H.  Shah,  J.  

Yates:  Argus:  End-­‐to-­‐end  service  anomaly  detecKon  and  localizaKon  from  an  ISP's  point  of  view.  INFOCOM  2012:2756-­‐2760  

Page 64: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

Bibliography  •  DSMS-­‐based  systems  

– D.  J.  Abadi,  D.  Carney,  U.  ÇeKntemel,  M.  Cherniack,  C.  Convey,  S.  Lee,  M.  Stonebraker,  N.  Tatbul,  S.  B.  Zdonik:  Aurora:  a  new  model  and  architecture  for  data  stream  management.  VLDB  J.  12(2):  120-­‐139  (2003)  

– M.  Balazinska,  Y.  C.  Kwon,  N.  Kuchta,  D.  Lee:  Moirae:  History-­‐Enhanced  Monitoring.  CIDR  2007:  375-­‐386  

– N.  Dindar,  P.  M.  Fischer,  M.  Soner,  N.  Tatbul:  Efficiently  correlaKng  complex  events  over  live  and  archived  data  streams.  DEBS  2011:  243-­‐254  

Page 65: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

Bibliography  

•  DBMS-­‐based  systems  –  Truviso  :  S.  Krishnamurthy,  M.  J.  Franklin,  J.  Davis,  D.  Farina,  P.  Golovko,  A.  Li,  N.  Thombre:  ConKnuous  analyKcs  over  disconKnuous  streams.  SIGMOD  2010:1081-­‐1092  

– DataCell:  E.  Liarou,  R.  Goncalves,  S.  Idreos:  ExploiKng  the  power  of  relaKonal  databases  for  efficient  stream  processing.  EDBT  2009:  323-­‐334  

–  L.  Golab,  T.  Johnson,  J.  S.  Seidel,  V.  Shkapenyuk:  Stream  warehousing  with  DataDepot.  SIGMOD  Conference  2009:  847-­‐854  

Page 66: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

Bibliography  •  Hadoop  /  Map-­‐Reduce  Based  Systems  

–  T.  Condie,  N.  Conway,  P.  Alvaro,  J.  M.  Hellerstein,  K.  Elmeleegy,  R.  Sears:  MapReduce  Online.  NSDI  2010:  313-­‐328  

–  W.  Lam,  L.  Liu,  S.  T.  S.  Prasad,  A.  Rajaraman,  Z.  Vacheri,  A.  H.i  Doan:  Muppet:  MapReduce-­‐Style  Processing  of  Fast  Data.  PVLDB  5(12):  1814-­‐1825  (2012)  

–  C.  Olston,  G.  Chiou,  L.  Chitnis,  F.  Liu,  Y.  Han,  M.  Larsson,  A.  Neumann,  V.  B.  N.  Rao,  V.  Sankarasubramanian,  S.  Seth,  C.  Tian,  T.  ZiCornell,  X.  Wang:  Nova:  conKnuous  Pig/Hadoop  workflows.  SIGMOD  Conference  2011:  1081-­‐1090  

–  B.  Li,  E.  Mazur,  Y.  Diao,  A.  McGregor,  P.  J.  Shenoy:  SCALLA:  A  Planorm  for  Scalable  One-­‐Pass  AnalyKcs  Using  MapReduce.  ACM  Trans.  Database  Syst.  37(4):  27  (2012)  

–  P.  BhatoKa,  A.  Wieder,  R.  Rodrigues,  U.  A.  Acar,  R.  Pasquin:  Incoop:  MapReduce  for  incremental  computaKons.  SoCC  2011:  7  

Page 67: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

Bibliography  

•  Late  Arriving  Data  –  S.  Krishnamurthy  et  al.,  ConKnuous  analyKcs  over  disconKnuous  

streams,  SIGMOD  2010,  1081-­‐1092  –  J.  Li.  K.Ture,  V.  Shkapenyuk,  V.  Papadimos,  T.  Johnson,  D.  Maier,  Out-­‐

of-­‐order  processing:  a  new  architecture  for  high-­‐performance  stream  systems,  PVLDB  1(1):  274-­‐288  (2008).  

–  Lukasz  Golab,  Theodore  Johnson:  Consistency  in  a  Stream  Warehouse.  CIDR  2011:  114-­‐122  

Page 68: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

Bibliography  •  Update  PropagaKon  /  Workflow  

–  T.  Johnson,  V.  Shkapenyuk:  Update  PropagaKon  in  a  Streaming  Warehouse.  SSDBM  2011:  129-­‐149  

–  C.  Olston  et  al.  Nova:  conKnuous  Pig/Hadoop  workflows.  SIGMOD  Conference  2011:  1081-­‐1090  

•  Temporal  Dimension  Tables  –  Interval  Event  Stream  Processing,  M.  Li,  M.  Mani,  E.  A.  Rundensteiner.,  D.  Wang,  T  Lin,  DEBS  2008  

–  David  Maier,  Michael  Grossniklaus,  Sharmadha  Moorthy,  KrisKn  Ture:  Capturing  episodes:  may  the  frame  be  with  you.  DEBS  2012:1-­‐11  

–  Snapshot  windows:  hCp://msdn.microsor.com/en-­‐us/library/ff518550.aspx  

Page 69: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

Bibliography  •  MulK-­‐Version  Concurrency  Control  

–  D.  Quass,  J.  Widom:  On-­‐Line  Warehouse  View  Maintenance.  SIGMOD  Conference  1997:  393-­‐404  

–  V.  Sikka,  F.  Färber,  W.  Lehner,  S.  K.  Cha,  T.  Peh,  Christof  B.:  Efficient  transacKon  processing  in  SAP  HANA  database:  the  end  of  a  column  store  myth.  SIGMOD  Conference  2012:  731-­‐742  

•  Data  ParKKon  TransformaKons  –  V.  Sikka,  F.  Färber,  W.  Lehner,  S.  K.  Cha,  T.  Peh,  B.  Christof:  Efficient  transacKon  processing  in  SAP  HANA  database:  the  end  of  a  column  store  myth.  SIGMOD  Conference  2012:  731-­‐742  

–  A.  Lamb,  M.  Fuller,  R.  Varadarajan,  N.  Tran,  B.  Vandier,  L.  Doshi,  C.  Bear:  The  VerKca  AnalyKc  Database:  C-­‐Store  7  Years  Later  .  PVLDB  5(12):  1790-­‐1801  (2012)  

Page 70: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

Bibliography  •  DB  Toaster  

–  DBToaster:  Higher-­‐order  Delta  Processing  for  Dynamic,  Frequently  Fresh  Views,  Y.  Ahmad  O.  Kennedy,  C.  Koch,  .  M.  Nikolic,  Proc  VLDB  2012  

•  ParKKon  Revisions  –  S.  Krishnamurthy,  M.  J.  Franklin,  J.  Davis,  D.  Farina,  P.  Golovko,  A.  Li,  N.  Thombre:  ConKnuous  analyKcs  over  disconKnuous  streams.  SIGMOD  2010:1081-­‐1092  

•  Temporal  Consistency  Management  –  Lukasz  Golab,  Theodore  Johnson:  Consistency  in  a  Stream  Warehouse.  CIDR  2011:114-­‐122  

•  Bounded  Tardiness  Scheduling  –  H.  Leontyev,  J.  H.  Anderson:  Generalized  tardiness  bounds  for  global  mulKprocessor  scheduling.  Real-­‐Time  Systems  44(1-­‐3):  26-­‐71  (2010)  

Page 71: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

Bibliography  •  Stream  Warehouse  Scheduling  

– Lukasz  Golab,  Theodore  Johnson,  Vladislav  Shkapenyuk:  Scalable  Scheduling  of  Updates  in  Streaming  Data  Warehouses.  IEEE  Trans.  Knowl.  Data  Eng.  24(6):  1092-­‐1105  (2012)  

– S.  Guirguis,  M.  A.  Sharaf,  P.  K.  Chrysanthis,  A.  Labrinidis,  K.  Pruhs,  AdapKve  Scheduling  of  Web  TransacKons.    Proc.  2009  Intl.  Conf.  on  Data  Engineering  

Page 72: LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf · •ARGUS&detected&event:&2$Columbia3G$Ericsson$SGSN’s$impacKng$RNC’s$in$ WestVirginia,$Norfolk,$and$Richmond$

Bibliography  •  Data  stream  quality  

–  Lukasz  Golab,  Howard  J.  Karloff,  Flip  Korn,  Avishek  Saha,  Divesh  Srivastava:  SequenKal  Dependencies.  PVLDB  2(1):  574-­‐585  (2009)  

–  Lukasz  Golab,  Howard  J.  Karloff,  Flip  Korn,  Barna  Saha,  Divesh  Srivastava:  Discovering  ConservaKon  Rules.  ICDE  2012:  738-­‐749  

–  Tamraparni  Dasu,  Ji  Meng  Loh:  StaKsKcal  DistorKon:  Consequences  of  Data  Cleaning.  PVLDB  5(11):  1674-­‐1683  (2012)  

–  Lukasz  Golab,  Data  Warehouse  Quality:  Summary  and  Outlook,  In:  S.  Sadiq  (ed.),  Handbook  of  Data  Quality  -­‐  Research  and  PracKce,  Springer-­‐Verlag  Berlin  Heidelberg  2013