managing incompleteness, complexity and scale in big...

Managing Incompleteness,

Complexity and Scale in Big Data Nick Duffield

Electrical and Computer Engineering Texas A&M University

http://nickduffield.net/work

Three Challenges for Big Data •  Complexity

–  Problem: high-dimensional data with complex dependence between variables, difficult to model

–  Solution: machine learning dominant relationships •  Incompleteness

–  Problem: not all quantities can be directly measured –  Solution: statistically infer what we want from what we

have •  Scale

–  Problem: huge datasets: costly to store, slow to compute –  Solution: smart data reduction retains ability to answer

most important queries

Big Data Complexity: Customer Experience

•  Which objective metrics closely associated with customer dissatisfaction? –  If known, remediate and prevent future troubles

•  Solution: –  (machine) learn metrics (and values), service settings, most associated with

occurrence of customer calls. Set action thresholds. –  Monitor metrics, take action when thresholds exceed

•  Operational savings –  Reduce call volume to customer care center, reduce churn

•  Reverse problem –  Learn calling patterns and keywords most predictive of network problem

Objective Metrics of Network Performance

Noisy measures of customer experience

Packet loss and delay; line quality; service parameters

Customer care calls; social media; keyword analysis

?

Incompleteness: Internet tomography

•  What ISPs want –  Origin-Destination (OD) traffic rates

between any two routers •  What ISPs have

–  Measured traffic rates on each link •  Linear relation

–  Link_Rates = A . OD_Rates –  A = routing matrix

•  encodes which links that OD traffic traverses

•  Solve? Under-constrained problem –  Different possible sets of OD_Rates yield

the same set of measured Link_rates

Internet Tomography •  Gravity Model?

–  OD_Rate(A à B) =const. × Rate(AàALL) × Rate(ALLàB) –  Can measure Rate(AàALL) at links emanating from A

•  Problem with gravity! –  Gravity model is not a solution of Link_Rates = A . OD_Rates

•  Solution: Tomogravity –  Use solution closest to gravity model!

•  Penalized likelihood solution

–  Quick to compute, good accuracy •  In daily use in ISPs, Routers

constraint subspace L = A.M

Tomogravity = least square solution

M1

M2

gravity model solution

Big Data Scale •  ISP operations generate 100s of Terabytes of usage

measurement data daily •  Passive traffic measurements by (core) routers

–  Session-level traffic summaries (flow records) –  Each flow record reports IP source and destination,

#packets, bytes, timing, .. –  Core routers stream flow records to collectors for analysis

•  Used widely in network management –  timescale from months (planning) to seconds (security)

•  Still need tomo-gravity outside core!

Managing Data Scale through Sampling

•  Turn Big Data into Smaller Data –  Savings in storage, bandwidth; speed up queries

•  Reference sampling –  Reuse samples over multiple retrospective queries –  Know query class in advance, but not specific query

•  “Smart” sampling –  matches data characteristics to analysis requirements –  E.g. uniform sampling is useless on heavy tails

•  Streaming constraints –  Sample to be computable in small time per item –  Big data constraint often not met in classical methods

Statistically Optimal Stream Sampling

•  Aim: –  Sample fraction of flow records –  Use to answer queries approximately

•  Problem: heavy tails –  10% of the flow records report 90% of bytes –  Uniform sampling misses most of the 10%

•  Big hit on accuracy •  Solution:

–  Statistically optimal non-uniform sampling algorithms (minimal estimation variance)

–  Computationally feasible for stream sampling –  In use in ISPs

Taming the Heavy Tail

•  Distribution of traffic estimates

Uniform sampling Smart sampling

Next: Streaming ISP Graph Data •  ISP Communications Graph from Flow Records

–  node = IP address; –  edge = flow from source to destination

compromise control flooding

•  Hard to detect against background •  Known attacks:

–  Signature matching based on subgraphs, flow features, timing

•  Unknown attacks: –  exploratory & retrospective analysis

•  Smart sampling of subgraphs

Sampling + Knowledge Discovery

•  Interplay between sampling and data mining is not well understood –  Need to understand how ML/DM algorithms are affected by

sampling –  E.g. how big a sample is needed to build an accurate classifier? –  E.g. what sampling strategy optimizes cluster quality

•  Expect results to be method specific –  I.e. “smart samping + k-means”

Sampling and Privacy •  Current focus on privacy-preserving data mining

–  Opportunity for sampling to be part of the solution •  Naïve sampling provides “privacy in expectation”

–  Your data remains private if you aren’t included in the sample… •  Intuition: uncertainty from sampling contributes to privacy

–  This intuition can be formalized with different privacy models •  Sampling can be analyzed in the context of differential privacy

–  Sampling alone does not provide differential privacy –  But applying a DP method to sampled data does guarantee privacy –  A tradeoff between sampling rate and privacy parameters

•  Understand benefits as well as risks of information flows •  Network calculus of risk/reward trade-off from information sharing, joining

Outlook •  Big data challenges

–  Incompleteness, complexity, scale •  Generic problems; transferable solutions

–  Find causal relations in high dimensional data •  Use machine learning for discovery & prediction

–  Big Data Tomography •  Solve ill-posed inverse problems with constraints from

models and side data –  Smart Sampling

•  Speed up computations and save on resources •  Tune sampling to mediate between data and queries

–  Role of sampling in ML/DM, privacy,…

managing incompleteness, complexity and scale in big...

Documents