data mining with splunk
DESCRIPTION
TRANSCRIPT
Data Mining and Exploration
David Carasso, Office of CTO, Chief Mind
AGENDAWhat is data mining?
What’s the plan of attack?
What type of events do I have?
How do I mine fields?
How do I to detect anomalous events?
Why do I need to visualize my data?
What is Data Mining?
3
Is this data mining?
4
This is an orange
What is Data Mining?
Extracting implicit, previously unknown, and potentially useful information from data.
5
Better
6
Data PreparationData ExplorationData Mining
7
Und
erst
andi
ng
What’s the plan of attack?
8
Preparing the data
You've been thrown data you aren't familiar with…
Mar 7 12:40:01 willLaptop crond(pam_unix)[10696]: session opened for user root by (uid=0)Mar 7 12:40:01 willLaptop crond(pam_unix)[10695]: session closed for user rootMar 7 12:40:02 willLaptop crond(pam_unix)[10696]: session closed for user rootMar 7 12:44:47 willLaptop gconfd (root-10750): starting (version 2.10.0), pid 10750 user 'root'Mar 7 12:44:47 willLaptop gconfd (root-10750): Resolved address "xml:readonly:/etc/gconf/gconf.xml.mandatory" to a read-only config...Mar 7 12:44:47 willLaptop gconfd (root-10750): Resolved address "xml:readwrite:/root/.gconf”… Mar 7 12:45:01 willLaptop crond(pam_unix)[10754]: session opened for user root by (uid=0)Mar 7 12:45:02 willLaptop crond(pam_unix)[10754]: session closed for user root....
9
Anomalies(unexpected
address)
Transactions(open-close)
Fields(pid)
Eventtypes(closed sessions)
Is Understanding Linear?
10
Event Groups Events
FieldsAnomalies
reports
No.
What type of events do I have?
11
Given Some Unknown DataMar 7 12:40:01 willLaptop crond(pam_unix)[10696]: session opened for user root by (uid=0)Mar 7 12:40:01 willLaptop crond(pam_unix)[10695]: session closed for user rootMar 7 12:40:02 willLaptop crond(pam_unix)[10696]: session closed for user rootMar 7 12:44:47 willLaptop gconfd (root-10750): starting (version 2.10.0), pid 10750 user 'root'Mar 7 12:44:47 willLaptop gconfd (root-10750): Resolved address "xml:readonly:/etc/gconf/gconf.xml.mandatory" to a read-only config...Mar 7 12:44:47 willLaptop gconfd (root-10750): Resolved address "xml:readwrite:/root/.gconf”… Mar 7 12:44:47 willLaptop gconfd (root-10750): Resolved address "xml:readonly:/etc/gconf/gconf.xml.defaults" to a read-only configuration ...Mar 7 12:45:01 willLaptop crond(pam_unix)[10754]: session opened for user root by (uid=0)Mar 7 12:45:02 willLaptop crond(pam_unix)[10754]: session closed for user root....
12
Find Broad Categories of Events
Group Events by Content, Format, and Time
13
Group Events by ContentCluster events with similar values.
Show 3 examples from each cluster, from the most common cluster to the least: …| cluster labelonly=t showcount=t | dedup 3 cluster_label sortby -cluster_count, cluster_label, _time
14
Events By Contentcount label _raw--------------------------------------------------------------------------------------------------------- 1339 3 Mar 7 11:05:01 willLaptop crond(pam_unix)[6785]: session opened for user root by… 1339 3 Mar 7 11:10:01 willLaptop crond(pam_unix)[1769]: session opened for user root by … 1339 3 Mar 7 11:10:01 willLaptop crond(pam_unix)[1766]: session opened for user root by … 1324 2 Mar 7 11:05:02 willLaptop crond(pam_unix)[6785]: session closed for user root 1324 2 Mar 7 11:10:01 willLaptop crond(pam_unix)[1766]: session closed for user root 1324 2 Mar 7 11:10:02 willLaptop crond(pam_unix)[1769]: session closed for user root
136 13 Mar 7 20:05:08 willLaptop kernel: SELinux: initialized (dev selinuxfs, type selinuxfs)… 136 13 Mar 7 20:05:09 willLaptop kernel: SELinux: initialized (dev usbfs, type usbfs), uses … 136 13 Mar 7 20:05:09 willLaptop kernel: SELinux: initialized (dev sysfs, type sysfs), uses …
15
Group by $%#! FormatCluster events by first 7 punctuation chars: …| rex field=punct "(?<smallpunct>.{7})” | eventstats count by smallpunct | sort -count, smallpunct | dedup 3 smallpunct
16
Events by Formatcount smallpunct raw------------------------------------------------------------------------------------------------ 637 __::__( Mar 10 16:50:02 willLaptop crond(pam_unix)[9639]: session closed for user root 637 __::__( Mar 10 16:50:01 willLaptop crond(pam_unix)[9638]: session closed for user root 637 __::__( Mar 10 16:50:01 willLaptop crond(pam_unix)[9639]: session opened for user root by …
367 __::__: Mar 10 15:30:25 willLaptop dhclient: bound to 10.1.1.194 -- renewal in 5788 seconds. 367 __::__: Mar 10 15:30:25 willLaptop dhclient: DHCPACK from 10.1.1.50 367 __::__: Mar 10 15:30:25 willLaptop dhclient: DHCPREQUEST on eth0 to 10.1.1.50 port 67
57 __::__[ Mar 10 16:46:32 willLaptop ntpd[2544]: synchronized to 138.23.180.126, stratum 2 57 __::__[ Mar 10 16:46:27 willLaptop ntpd[2544]: synchronized to LOCAL(0), stratum 10 57 __::__[ Mar 10 16:42:09 willLaptop ntpd[2544]: time reset -0.236567 s
17
Group by TimeLook for bursts of events
18
• Turn on computer• Load a web page• Detects speeding car• Print document• Scan security badge
Group by Time Bursts… | transaction maxpause=2s | search eventcount>1Mar 10 16:50:01 willLaptop crond(pam_unix)[9638]: session opened for user root by (uid=0) Mar 10 16:50:01 willLaptop crond(pam_unix)[9639]: session opened for user root by (uid=0) Mar 10 16:50:01 willLaptop crond(pam_unix)[9638]: session closed for user root Mar 10 16:50:02 willLaptop crond(pam_unix)[9639]: session closed for user root Mar 10 15:30:25 willLaptop dhclient: DHCPREQUEST on eth0 to 10.1.1.50 port 67 Mar 10 15:30:25 willLaptop dhclient: DHCPACK from 10.1.1.50 Mar 10 15:30:25 willLaptop dhclient: bound to 10.1.1.194 -- renewal in 5788 seconds. Mar 10 16:45:01 willLaptop crond(pam_unix)[9553]: session opened for user root by (uid=0) Mar 10 16:45:02 willLaptop crond(pam_unix)[9553]: session closed for user root
19
Multiple Sources
20
(not really correct)
Now what?
1. ✓ group your data2. tell splunk!
21
Telling Splunk(about your groups of events)
Add eventtypes and tags
22
Huh?
SURPRISE TANGENT!
What is an eventtype?
23
Eventtype
A dynamic “tag” added to events, if they would match the search that defines the eventtype.
24
Eventtype: Name: “closed_root” Definition: “session closed” root
Event: … session closed for user root …
=>eventtype=closed_root
25
26
Create an Eventtype
27
Independent searches will return events tagged with previous eventtypes that help classify events.
28
Create reports on the classifications you’ve made
Ok, it wasn’t a tangent.
How do I mine fields?
29
Fields Correlation
Discover correlations to remove uninteresting fields and narrow in on promising reports.
30
haiku
Fields Correlation Haiku
Discover patterns in fields with a correlation: co-occurring fields.
31
indulgence
Splunkd.log Sample File09-05-2012 15:34:11.886 -0700 INFO ExecProcessor - Ran script: python /opt/splunk/etc/apps/...09-05-2012 15:34:02.467 -0700 ERROR TcpOutputProc - Can't find or illegal IP address or ...09-05-2012 15:32:03.397 -0700 INFO ProcessTracker - Process ran long; type=SplunkOptimize ...09-05-2012 15:30:20.016 -0700 WARN DispatchCommand - The system is approaching the maximum ...
32
fascinating
Field Correlation… | correlateRowField C CN Component Context L ...------------------------ ---- ---- --------- ------- ---- C 1.00 1.00 0.00 0.00 1.00 CN 1.00 1.00 0.00 0.00 1.00 Component 0.00 0.00 1.00 0.06 0.00 Context 0.00 0.00 0.06 1.00 0.00 L 1.00 1.00 0.00 0.00 1.00 Log_Level 0.00 0.00 1.00 0.06 0.00 …
33
Field Associationsautomatically deduce correlations and implications of field values: …| associate Log_Level Component
34
Field Association Summary Uncond Cond Ref_Key Ref_Value Target_Key Support Entropy Entropy Increase Top_Conditional_Value --------- ------------------------ ---------- ------- ------- ------- -------- ------------------------ Component DatabaseDirectoryManager Log_Level 34.67% 1.182 0.000 1.182201 WARN (62.25% -> 100.00%) Component HotDBManager Log_Level 38.25% 1.182 0.000 1.182201 INFO (33.15% -> 100.00%) Component SavedSplunker Log_Level 394.31% 1.182 0.000 1.182201 WARN (62.25% -> 100.00%) Component databasePartitionPolicy Log_Level 95.50% 1.182 0.417 0.765017 INFO (33.15% -> 91.57%) Component loader Log_Level 79.17% 1.182 0.050 1.131883 INFO (33.15% -> 99.44%) Component timeinvertedIndex Log_Level 44.28% 1.182 0.000 1.182201 INFO (33.15% -> 100.00%)
35
Top Fields by FieldsMost common Log_Level by Component:
... | top Log_Level by Component
Component Log_Level count percent---------------------------------- --------- ----- ----------AdminManager WARN 1 100.000000DatabaseDirectoryManager WARN 153 100.000000DateParserVerbose WARN 262 100.000000DedupProcessor ERROR 1 100.000000DeploymentClient DEBUG 60 85.714286DeploymentClient WARN 5 7.142857
36
How do I to detect anomalous events?
37
Types of Anomalies
Anomalies you know about
Anomalies you don’t know about
38
Handling Known Anomalies.Easy. Define a search for the anomalous condition and make an alert to detect it.
ip=10.* NOT domain=mycompany.com … | stats perc99(spent) 500ms.
Alert on “spent>500” 39
Finding Unknown AnomaliesLook for Abnormal• Single-Field Values• Multi-Field Values• Contexts• Visual Inspections…
40
Anomalies by Single Field Values
Identify anomalous values in a given field either by frequency of occurrence or number of standard deviations from the mean.
… | anomalousvalue action=summary pthresh=0.02 | search isNum=YES
41
Anomalies by Single Field Values
42
Anomalous by Many Values
Look for small clusters – by content, format, and time – to find anomalies. For example…
…| cluster …| sort cluster_count
43
Smallest Clusters by Contentcount label uri
1 7 /img/skins/default/bolt.png
1 37 /en-US/search/inspector?sid=1345075042.125&namespace=search
1 45 /services/admin/summarization?count=10
1 53 /services/pdfgen/is_available?viewId=index_status_health&...
1 57 /static/splunkrc_cmds.xml
44
Small Clusters: Bursts of OneFind bursts of just a single events where a pause of 2 seconds occurred around it.
… |transaction maxpause=2s | search eventcount = 1
Mar 10 16:46:32 willLaptop ntpd[2544]: synchronized to 138.23.180.126… Mar 10 16:46:27 willLaptop ntpd[2544]: synchronized to LOCAL(0), stratum… Mar 10 16:42:09 willLaptop ntpd[2544]: time reset -0.236567…
45
Burst of OneSame idea, different data source: splunk
[11:58:08] "POST /services/search/jobs/export HTTP/1.1" 200 201630 …
[11:12:51] "POST /services/search/jobs/export HTTP/1.1" 200 459441 …
[10:00:58] "GET /servicesNS/nobody/SplunkDeploymentMonitor/backfill/…
46
Anomalous by ContextIdentify values not expected by the context of other events.
… | anomalies field=file labelonly=true maxvalues=10
47
Anomalous by Context
48
Unexpectedness file0.00 shelper0.16 shelper0.00 1345502591.3560.00 1345502591.3560.00 1345074401.1910.00 1345074031.1530.03 1345074328.1860.00 1345502591.3560.35 conf-dm_backfill0.00 1345074309.1850.00 1345502591.356
time
Surprise Eventtype: Part Deux!Classified major categories of your data with eventtypes? -- just search for things that don’t match those eventtypes
49
50
Once you can describe anomalous behavior as a search…
51
52
Other mining commands• kmeans: Performs k-means clustering on selected fields. • outlier: Removes outlying numerical values. • af (analyze fields): Analyzes numerical fields for their
ability to predict another discrete field• fieldsummary : Generates summary information fields. • shape: Produces a symbolic 'shape' attribute describing
the shape of a numeric multivalued field
53
Why do I need to visualize my data?
54
Data Mining by Visualization Visualization can capture nuances in the data that numerical or linguistic summaries cannot easily capture.
55
56
These data points are radically different.
*Source: Anscombe’s Quartet (Anscombe 1973)
Why visualize?Because they all have the exact same
• average (7.50)• standard deviation (2.03) • least-squares fit (3 + 0.5x).
Do not just rely on numerical summarization.57
But I already have charts!You don’t graph enough. Data Exploration
Don’t decide ahead of time what graphs you wantRegularly do out-of-the-box scenarios with graphs
58
Variations:• Subsets of Events (paying customers vs lookers)• Fields by Fields (including eventtypes and tags)• Ignored fields• Min/max/avg/count• Compare to other times windows• Transactions
59
Data Exploration
Visual ArrangementSorting data, Changing Scales (Linear/Log), Min/Max can have a huge difference on looking at the same data.
60
Visual Considerations
61
Pick representations that make obvious the distinctions you need to care about.
Summary
62
Summary• Discovery is an iterative process.• Group events by content, format, and time, and
define classifications with eventtypes and tags• Focus on promising fields with correlations• Discover unknown anomalies with small clusters.• Visualize your data, from a dozen angles.
63
But wait!
64
More to come: Predictive Analytics
65
… | forecast foo
The End
66
.,`...,`...,`...,`...,`...,`...,`...,`...,`...,`...,`...,`...
.,`......_.,`...,`...,`...,`...,`...,`...,`...,`...,`....._..
...___..|.|...__._..._.__.,`..._.__.,`..___...__.,`...__.|.|.
../.__|.|.|../._`.|.|.'_.\....|.'_.\.../._.\..\.\./\././.|.|.
.|.(__..|.|.|.(_|.|.|.|_).|...|.|.|.|.|.(_).|..\.V..V./..|_|.
..\___|.|_|..\__,_|.|..__/....|_|.|_|..\___/....\_/\_/...(_).
.,`...,`...,`...,`..|_|.,`...,`...,`...,`...,`...,`...,`.....
.,`...,`...,`...,`...,`...,`...,`...,`...,`...,`...,`...,`...Golf clapping at #datamining
Mine the Gap.