machine learning for machine data - info.sumologic.cominfo.sumologic.com/rs/sumologic/images/... ·...
TRANSCRIPT
Machine learning for machine data
1
David Andrzejewski -‐ @davidandrzej Data Sciences, Sumo Logic Strata Conference – Machine Data Track February 13, 2014
This talk: Machine Learning + Machine Data = Awesome!
2
! YES – overview of log data – solving log data problems with machine learning – specific examples
• (mostly) Sumo Logic-‐related • customer use cases
– general lessons learned
This talk: Machine Learning + Machine Data = Awesome!
3
! NO (or, not much) – Sumo Logic deep dive – Tech stack talk
• In-‐memory Hadoop for real-‐Ume Cassandra SQL in hybrid clouds
– Big data “shock and awe” • 800 yo[abytes / second ZOMG!!11!!
– Algorithm shootout • Deep learning vs random forests vs SVMs vs coin flips vs ...
– Extreme math
HyperLogLog: analysis of a near-optimal cardinality algorithm 131
Definition 1 An ideal multiset of cardinality n is a sequence obtained by arbitrary replications and per-mutations applied to n uniform identically distributed random variables over the real interval [0, 1].In the analytical part of our paper (Sections 2 and 3), we postulate that the collection of hashed valuesh(M), which the algorithm processes constitutes an ideal multiset. This assumption is a natural way tomodel the outcome of well designed hash functions. Note that the number of distinct elements of such anideal multiset equals n with probability 1. We henceforth let E
n
and Vn
be the expectation and varianceoperators under this model.
Theorem 1 Let the algorithm HYPERLOGLOG of Figure 2 be applied to an ideal multiset of (unknown)cardinality n, using m � 3 registers, and let E be the resulting cardinality estimate.
(i) The estimate E is asymptotically almost unbiased in the sense that
1
nE
n
(E) =
n!11 + �
1
(n) + o(1), where |�1
(n)| < 5 · 10
�5 as soon as m � 16.
(ii) The standard error defined as 1
n
pV
n
(E) satisfies as n!1,
1
n
pV
n
(E) =
n!1
�mpm
+ �2
(n) + o(1), where |�2
(n)| < 5 · 10
�4 as soon as m � 16,
the constants �m
being bounded, with �16
.= 1.106, �
32
.= 1.070, �
64
.= 1.054, �
128
.= 1.046, and �1 =p
3 log(2)� 1
.= 1.03896.
The standard error measures in relative terms the typical error to be observed (in a mean quadraticsense). The functions �
1
(n), �2
(n) represent oscillating functions of a tiny amplitude, which are com-putable, and whose effect could in theory be at least partly compensated—they can anyhow be safelyneglected for all practical purposes.
Plan of the paper. The bulk of the paper is devoted to the proof of Theorem 1. We determine theasymptotic behaviour of E
n
(Z) and Vn
(Z), where Z is the indicator 1/P
2
�M
(j). The value of ↵
m
inEquation (3), which makes E an asymptotically almost unbiased estimator, is derived from this analysis,as is the value of the standard error. The mean value analysis forms the subject of Section 2. In fact,the exact expression of E
n
(Z) being hard to manage, we first “poissonize” the problem and examineEP(�)
(Z), which represents the expected value of the indicator Z when the total number of elements is notfixed, but rather obeys a Poisson law of parameter �. We then prove that, asymptotically, the behaviours ofE
n
(Z) and EP(�)
(Z) are close, when one chooses � := n: this is the depoissonization step. The varianceanalysis of the indicator Z, hence of the standard error, is sketched in Section 3 and is entirely parallel tothe mean value analysis. Finally, Section 4 examines how to implement the HYPERLOGLOG algorithmin real-life contexts, presents simulations, and discusses optimality issues.
2 Mean value analysis
Our starting point is the random variable Z (the “indicator”) defined in (2). We recall that En
refers toexpectations under the ideal multiset model, when the (unknown) cardinality n is fixed. The analysisstarts from the exact expression of E
n
(Z) in Proposition 1, continues with an asymptotic analysis of thecorresponding Poisson expectation summarized by Proposition 2, and concludes with the depoissonizationargument of Proposition 3.
! Data sciences @ Sumo Logic ! Co-‐organizer @ SF ML Meetup ! Previous
– Post-‐doc in knowledge discovery
! Even more previous machine data research projects
– University of Wisconsin-‐-‐Madison – Microsog Research
Context: me
4
−1
0
1
−0.6−0.4−0.200.20.40.60.8
−1
−0.5
0
0.5
PCA3
PCA2PCA1
254 bug1 runs106 bug3 runs147 bug4 runs329 bug5 runs206 bug8 runs186 other runs
Context: Sumo Logic
5
“Turning Machine Data Into IT and Business Insights”
Learn, classify, predict
Search, monitor, visualize
Context: Sumo Logic
6
“Turning Machine Data Into IT and Business Insights”
Learn, classify, predict
Search, monitor, visualize
Anatomy of a log message: Five W’s
7
Anatomy of a log message: Five W’s
8
! When? Timestamp with Ume zone
Anatomy of a log message: Five W’s
9
! When? Timestamp with Ume zone ! Where? Host, module, code locaUon
Anatomy of a log message: Five W’s
10
! When? Timestamp with Ume zone ! Where? Host, module, code locaUon ! Who? AuthenUcaUon context
Anatomy of a log message: Five W’s
11
! When? Timestamp with Ume zone ! Where? Host, module, code locaUon ! Who? AuthenUcaUon context ! What? Log level and key-‐value pairs
What’s missing
12
Traversing the stack
Sumo Logic ConfidenUal 13
Custom App Code
Server / OS
VirtualizaUon
Databases
Network
Open Source Sogware
Middleware
12/20/2011 17:23:44 PST [user=234fsf] failed transaction, sessionid:2F0A232324, [host=pay002.sjc] amount=1725.00
12/20/11 17:23:34 AMQ7163: WebSphere MQ job number 18429 started FOR client_session=2F0A232324.
12202011 17:23:27 /usr/local/build/mysql/libexec/mysqld: Abnormal shutdown [18429]
20-12-2011 17:23:19 database-host login[3866]: DEAD_PROCESS: 18429 ttys000
Dec 20, 2011 17:22:14,,, message=Created virtual machine user-3 on esxi01.office.thedomain.com
<134>Dec 20 2011 17:22:12: %PIX-6-106100: access-list inside_access_out denied tcp inside/68.162.72.163(4326) -> outside/45.200.244.124(3127) hit-cnt 1(first hit)
66.249.67.24 - - [20/Dec/2011:17:23:40 -0700] ”POST /APP/Order.php HTTP/1.1" 304 146 "-" SESSION=2F0A232324
Customer ID Session ID
Job number
Process ID
Root cause!
Enhanced visibility into machine behaviors ! Compliance
– OperaUonal (SLA) – Regulatory (audits) – Security
! Availability / performance – Faster MTTR
! Business insights ($$$)
Log use cases – “organizaUonal percepUon”
14
! (wildly) varying formats – prinq, JSON, XML, Windows, X-‐delimited, ...
! Specialized knowledge
! Noise ! Cascading failures
Log challenges
15
[2008-05-07 09:50:08.450 'App' 3560 verbose] [VpxdHeartbeat] Invalid heartbeat from 10.17.218.46
“A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable.” -‐ Leslie Lamport
Complexity
16
Transport for London December 2013
Key to symbols Explanation of zones
1
3
45
6
2
789
Station in both zones
Station in both zones
Station in both zones
Station in Zone 9
Station in Zone 6
Station in Zone 5
Station in Zone 3Station in Zone 2
Station in Zone 1
Station in Zone 4
Station in Zone 8
Station in Zone 7
National Rail
Riverboat services
Airport
Tramlink
Interchange stations
Step-free access from street to platform
Step-free access from street to train
Emirates Air Line
Check before you travel
Key to lines
Metropolitan
Victoria
Circle
Central
Bakerloo
DLR
London Overground
Piccadilly
Waterloo & City
Jubilee
Hammersmith & City
Northern
DistrictDistrict open weekends, public holidays and some Olympia events
Emirates Air Line
Bank Waterloo & City line open between Bank and Waterloo 0621-0030 Mondays to Fridays and 0802-0030 Saturdays. Between Waterloo and Bank 0615-0030 Mondays to Fridays and 0800-0030 Saturdays. Closed Sundays and Public Holidays.---------------------------------------------------------------------------------Camden Town Sunday 1300-1730 open for interchange and exit only.---------------------------------------------------------------------------------Canary Wharf Step-free interchange between Underground, Canary Wharf DLR and Heron Quays DLR stations at street level.---------------------------------------------------------------------------------Cannon Street Open until 2100 Mondays to Fridays and 0730-1930 Saturdays. Closed Sundays.---------------------------------------------------------------------------------Embankment Bakerloo and Northern line trains will not stop at this station from early January 2014 until early November 2014. ---------------------------------------------------------------------------------Emirates Greenwich Peninsula and Emirates Royal DocksSpecial fares apply. Open 0700-2000 Mondays to Fridays, 0800-2000 Saturdays, 0900-2000 Sundays and 0800-2000 Public Holidays. Opening hours are extended by one hour in the evening after 1 April 2014 and may be extended on certain events days. Please check close to the time of travel. ---------------------------------------------------------------------------------Heron Quays Step-free interchange between Heron Quays and Canary Wharf Underground station at street level.---------------------------------------------------------------------------------Hounslow WestStep-free access for manual wheelchairs only.---------------------------------------------------------------------------------Kilburn No step-free access from late January 2014 until mid May 2014.---------------------------------------------------------------------------------Stanmore Step-free access via a steep ramp. ---------------------------------------------------------------------------------Turnham Green Served by Piccadilly line trains until 0650 Mondays to Saturdays, 0745 Sundays and after 2230 every evening. At other times use District line.---------------------------------------------------------------------------------Waterloo Waterloo & City line open between Bank and Waterloo 0621-0030 Mondays to Fridays and 0802-0030 Saturdays. Between Waterloo and Bank 0615-0030 Mondays to Fridays and 0800-0030 Saturdays. Closed Sundays and Public Holidays.No step-free access from late January 2014 until late July 2014.---------------------------------------------------------------------------------West India QuayNot served by DLR trains from Bank towards Lewisham before 2100 on Mondays to Fridays.---------------------------------------------------------------------------------
River Thames
A
B
C
D
E
F
1 2 3 4 5 6 7 8 9
1 2 3 4 5 76 8 9
A
B
C
D
E
F
2 2
22
2
5
8 8 6
2
4
4
65
41
3
2
43
3
36 3 1
1
3
3
59 7 7Special fares apply
5
5
4
4
4
AmershamChorleywood
Mill Hill East
Rickmansworth
Perivale
KentishTown West
CamdenRoad
Dalston Kingsland
Wanstead Park
Vauxhall
Hanger Lane
Edgware
Burnt Oak
Colindale
Hendon Central
Brent Cross
Golders Green
WestSilvertown
EmiratesRoyal Docks
EmiratesGreenwichPeninsula Pontoon Dock
LondonCity Airport
WoolwichArsenal
King George V
Hampstead
Belsize Park
Chalk Farm
Chalfont &Latimer
Chesham
New CrossGate
Moor Park
NorthwoodNorthwoodHills
Pinner
North Harrow
Custom House for ExCeL
Prince Regent
Royal Albert
Beckton Park
Cyprus
GallionsReach
Beckton
Watford
Croxley
Fulham Broadway
LambethNorth
HeathrowTerminal 4
Harrow-on-the-Hill
KensalRise
BethnalGreen
Westferry
SevenSisters
Blackwall
BrondesburyPark
HampsteadHeath
HarringayGreen Lanes
LeytonstoneHigh Road
LeytonMidland Road
HackneyCentral
NorthwickPark
PrestonRoad
RoyalVictoria
WembleyPark
Rayners Lane
Watford High Street
RuislipGardens
South Ruislip
Greenford
Northolt
South Harrow
Sudbury Hill
Sudbury Town
Alperton
Pimlico
Park Royal
North Ealing
Acton Central
South Acton
Ealing Broadway
Watford Junction
West Ruislip
Bushey
Carpenders Park
Hatch End
North Wembley
West Brompton
Ealing Common
South Kenton
Kenton
Wembley Central
Kensal Green
Queen’s Park
Gunnersbury
Kew Gardens
Richmond
Stockwell
Bow Church
Stonebridge Park
Harlesden
Camden Town
Willesden Junction
Headstone Lane
Parsons Green
Putney Bridge
East Putney
Southfields
Wimbledon Park
Wimbledon
Island Gardens
Greenwich
Deptford Bridge
South Quay
Crossharbour
Mudchute
Heron Quays
West India Quay
Elverson Road
Oakwood
Cockfosters
Southgate
Arnos Grove
Bounds Green
Theydon Bois
Epping
Debden
Loughton
Buckhurst Hill
WalthamstowQueen’s Road
Woodgrange Park
Leytonstone
Leyton
Wood Green
Turnpike Lane
Manor House
Stanmore
Canons Park
Queensbury
Kingsbury
High Barnet
Totteridge & Whetstone
Woodside Park
West Finchley
Finchley CentralWoodford
South Woodford
Snaresbrook
Hainault
Fairlop
Barkingside
Newbury Park
East Finchley
Highgate
Archway
Devons Road
Langdon Park
All Saints
Tufnell Park
Kentish Town
Neasden
Dollis Hill
Willesden Green
South Tottenham
Swiss Cottage
ImperialWharf
Brixton
Kilburn
West Hampstead
Blackhorse Road
Acton Town
CanningTown
Finchley Road
Highbury &Islington
Canary Wharf
Stratford
StratfordInternational
FinsburyPark
Elephant & Castle
Stepney Green
Barking
East Ham
Plaistow
Upton Park
Poplar
West Ham
Upper Holloway
PuddingMill Lane
Kennington
Borough
Elm ParkDagenham
East
DagenhamHeathway
Becontree
Upney
Heathrow Terminal 5
Finchley Road& Frognal
CrouchHill
Northfields
Boston Manor
South Ealing
Osterley
Hounslow Central
Hounslow East
Clapham North
Clapham High Street
Oval
Clapham Common
Clapham South
Balham
Tooting Bec
Tooting Broadway
Colliers Wood
South Wimbledon
Arsenal
Holloway Road
Caledonian Road
Morden
West Croydon
HounslowWest
Hatton Cross
HeathrowTerminals 1, 2, 3
ClaphamJunction
WestHarrow
Brondesbury CaledonianRoad &
Barnsbury
TottenhamHale
WalthamstowCentral
HackneyWick
Homerton
WestActon
Limehouse EastIndia
Crystal Palace
ChiswickPark
RodingValley
GrangeHill
Chigwell
Redbridge
GantsHill
Wanstead
Ickenham
TurnhamGreen
Uxbridge
Hillingdon Ruislip
GospelOak
Mile End
Bow Road
Bromley-by-Bow
Upminster
Upminster Bridge
Hornchurch
Norwood Junction
Sydenham
Forest Hill
Anerley
Penge West
Honor Oak Park
Brockley
Harrow &Wealdstone
Cutty Sark for Maritime Greenwich
Ruislip Manor
Eastcote
Wapping
New Cross
Queens RoadPeckham
Peckham Rye
Denmark Hill
Surrey Quays
Whitechapel
Lewisham
Kilburn Park
Regent’s Park
KilburnHigh Road
EdgwareRoad
SouthHampstead
GoodgeStreet
Shepherd’s BushMarket
Goldhawk Road
Hammersmith
Bayswater
Warren Street
Aldgate
Euston
Farringdon
BarbicanRussellSquare
Kensington(Olympia)
MorningtonCrescent
High StreetKensington
Old Street
St. John’s Wood
Green Park
BakerStreet
NottingHill Gate
Victoria
AldgateEast
Blackfriars
Mansion House
Temple
Cannon Street
OxfordCircus
BondStreet
TowerHill
Westminster
PiccadillyCircus
CharingCross
Holborn
Tower Gateway
Monument
Moorgate
Leicester Square
London Bridge
St. Paul’s
Hyde Park Corner
Knightsbridge
StamfordBrook
RavenscourtPark
WestKensington
NorthActon
HollandPark
Marylebone
Angel
Queensway MarbleArch
SouthKensington
SloaneSquare
WandsworthRoad
Covent Garden
LiverpoolStreet
GreatPortland
Street
Bank
EastActon
ChanceryLane
LancasterGate
Warwick AvenueMaida Vale
Fenchurch Street
Paddington
BaronsCourt
GloucesterRoad St. James’s
Park
Latimer RoadLadbroke Grove
Royal Oak
Westbourne Park
Bermondsey
Rotherhithe
ShoreditchHigh Street
Dalston Junction
Haggerston
Hoxton
Wood Lane
Shepherd’sBush
WhiteCity
King’s CrossSt. Pancras
EustonSquareEdgware
Road
Southwark
Embankment
Stratford High Street
Abbey Road
Star Lane
Waterloo
TottenhamCourt Road
Canonbury
Shadwell
Earl’sCourt
NorthGreenwich
CanadaWater
! Logs: like “computer tweets” ! Twi[er 2013*
– Peak @ ~144k TPS – Avg ~6k tweets / second
! Log data – Example: 1 TB / day – Avg ~25k logs / second
“OMG java.lang.NullPointerExcepUon #fail”
17
* https://blog.twitter.com/2013/new-tweets-per-second-record-and-how
Systems that learn from experience
18
19
Unsupervised clustering ! Given: set of items ! Do: group similar items
20
Unsupervised clustering ! Given: set of items ! Do: group similar items
Too many logs! “data disorientaUon”
~60k results: 30 minutes, one component
DisUll logs down to underlying structure
LogReduce: results "compressed” ~1000x
printf("Health status check: %s is %s”,
hostid, hoststatus)
Health status check: zim-5 is OK
Health status check: gir-3 is OK
Health status check: gir-2 is TIMED OUT
Health status check: dib-1 is OK
In the beginning, there was the prinq()
Log generaUon
printf("Health status check: %s is %s”,
hostid, hoststatus)
Health status check: zim-5 is OK
Health status check: gir-3 is OK
Health status check: gir-2 is TIMED OUT
Health status check: dib-1 is OK
Health status check: *** is ***
Reverse engineering prinq()
Log generaUon
“magic”
26
1. Define string distance funcKon (e.g., Левенште́йн)
2. Do distance-‐based clustering
���������
��������
Unsupervised clustering ! Given: log messages ! Do: group by “signature”
Drill-‐down into the original raw logs
28
ParKally supervised clustering ! Given: set of items + side info ! Do: group similar items
29
ParKally supervised clustering ! Given: set of items + side info ! Do: group similar items
Too many wildcards!
30
“Hint” from human user
31
Not enough wildcards!
32
“Hint” from human user
33
34
Learning to rank ! Given: set of items, historical data ! Do: rank by “relevance”
Two pages is sUll too many!
35
36
Learning to rank ! Given: signatures, user acUvity
! Do: rank by “relevance”
! Site “acUng weird” ! InvesUgaUon with LogReduce
– error logs è issue with content push/publish workflow • root cause
– 50 minutes later: “object missing” errors serving content • user-‐visible outage
! Benefits – rapidly “skim” the logs – create an alert
True story: troubleshooUng @ digital media co.
37
! Domains
– financial services – SaaS vendors
! Use cases – Availability / performance
• Mean Ume to invesUgaUon (MTTI) = “hours to minutes”
– Security • Quickly bubble up unusual logs
More true stories
38
General lessons
39
! Combat “data disorientaUon”
General lessons
40
! Combat “data disorientaUon” ! Surface latent structure
General lessons
41
! Combat “data disorientaUon” ! Surface latent structure ! Link to underlying raw data
General lessons
42
! Combat “data disorientaUon” ! Surface latent structure ! Link to underlying raw data ! Empower user to improve results
General lessons
43
unknown unknowns
44
45
Outlier detecKon ! Given: data points ! Do: idenUfy outliers
46
Outlier detecKon ! Given: data points ! Do: idenUfy outliers
47
Health check OK
Request processed
Txn timeout, retry
Anomaly detecKon ! Given: log data ! Do: flag anomalies
48
Health check OK
Request processed
Txn timeout, retry
Anomaly detecKon ! Given: log data ! Do: flag anomalies
InvesUgate and annotate events
49
logs
signatures
RAW DATA
HUMAN
InvesUgate and annotate events
50
logs
signatures
RAW DATA
HUMAN
InvesUgate and annotate events
51
logs
signatures
event
RAW DATA
HUMAN
InvesUgate and annotate events
52
logs
signatures
event
RAW DATA
HUMAN
Umeline / alerts
53
Supervised classificaKon ! Given: labeled data points ! Do: predict future labels
54
Supervised classificaKon ! Given: labeled data points ! Do: predict future labels
55
Supervised classificaKon ! Given: log data, annotated events
! Do: classify new occurrences
event
Umeline / alerts
True stories
56
! SSH problems – configuraUon errors – script user auth failures
! PotenUal security events – surge of failed logins
! Unhappy infrastructure – Oracle – VMware
! Explain algorithm “decisions”
– why was this flagged as anomaly?
More general lessons
57
! Explain algorithm “decisions”
– why was this flagged as anomaly?
! Link to underlying raw data – (AGAIN)
More general lessons
58
! Explain algorithm “decisions”
– why was this flagged as anomaly?
! Link to underlying raw data – (AGAIN)
! Empower user to improve results – (AGAIN)
More general lessons
59
Numerical Ume-‐series data
60
61
Signal decomposiKon ! Given: Ume-‐series ! Do: extract model components
62
Outlier detecKon ! Given: data points ! Do: idenUfy outliers
63
Outlier detecKon ! Given: data points ! Do: idenUfy outliers
Raw
Smoothed
Windowed model
65
Raw
Smoothed
Smoothed vs +3σ
Windowed model
True stories
66
! Financial services: bad data points ! Security: misbehavior ! OperaUons: alerUng
Yet more general lessons
67
! Time-‐series data analysis well-‐studied
! Read the literature(s)!
< OBLIGATORY PLUGS >
68
freesumo.com
BONUS: scale-‐out, streaming architecture
69
logs
logs
logs
BONUS: approximaUng with Count-‐Min Sketch
70
BONUS: approximaUng with Count-‐Min Sketch
71
Approximate counUng with Count-‐Min Sketch
72