machine*learning,* entropy*and*fraud*in* splunk* · machine*learning*goal*...
TRANSCRIPT
Copyright © 2014 Splunk Inc.
Fred Wilmot (CISSP) Director, Global Security PracEce
SebasEen Tricaud Principal Strategist, Global Security PracEce
Machine Learning, Entropy and Fraud in
Splunk
Disclaimer
2
During the course of this presentaEon, we may make forward looking statements regarding future events or the expected performance of the company. We cauEon you that such statements reflect our current expectaEons and
esEmates based on factors currently known to us and that actual events or results could differ materially. For important factors that may cause actual results to differ from those contained in our forward-‐looking statements,
please review our filings with the SEC. The forward-‐looking statements made in the this presentaEon are being made as of the Eme and date of its live presentaEon. If reviewed aSer its live presentaEon, this presentaEon may not contain current or accurate informaEon. We do not assume any obligaEon to update any forward looking statements we may make. In addiEon, any informaEon about our roadmap outlines our general product direcEon and is subject to change at any Eme without noEce. It is for informaEonal purposes only and shall not, be incorporated into any contract or other commitment. Splunk undertakes no obligaEon either to develop the features or funcEonality described or to
include any such feature or funcEonality in a future release.
Agenda
! What is Machine Learning? ! Use cases ! Results ! Lessons learned
3
WARNING
4
Do not visit URLs in this presentaEon, they will make your computer sick!
Machine Learning Goal
Program computers to use example data or past experience to solve a given problem
Some Machine Learning Use Cases
6
! User behavior profiling and base-‐lining ! Asset and applicaEon modeling ! Finding New Security Threats
– SQLi – Network proxy/DNS/evaluaEon – SenEment from SLA (semanEc language analysis) – ExfiltraEon – C2 channels / Malware
! Fraud
Master Machine Learning in 2 slides!
Machine that Learns
Algorithms: types of learning
Input Vectors
Outputs
Training Regimes Noise Performance EvaluaEon
Learn – Classify -‐ Cluster
9
! Learning: – Is “Subject: Fais grandir ton machin” a spam? – Is “jet-‐machinery.com” a valid url? – Store what we know in a good or bad dataset
! Classify (supervised/semi-‐supervised learning): – Based on a learning, tries to put things in the good or bad dataset and re-‐
evaluates model.
! Cluster (non-‐supervised learning): – Group objects in a geometrical space
Use Cases
Use Cases
Domain analysis for threat detecEon
SQL InjecEon agack detecEon
Web based financial fraud
11
Use case: Threat detecEon via Domain Analysis
! www.google.com
! www.g0ogle.com
12
Known good URL
Really close to known good URL… probably malicious!
Use case: Threat detecEon via URL Analysis
! www.google.com
! www.g0ogle.com
13
Known good URL
Really close to known good URL… probably malicious!
Accelerate your HunEng Shannon!
URLs from web logs and email
ML: Levenstein Distance and
Shannon Entropy Anomalous
URLs
14
Working with Data
15
! #1 rule: be sure ingest the data properly – ‘CIM’ the data – Make sure fields are extracted – Make sure sourcetyped appropriately #2 rule: make sure you understand your data’s context #3 rule: choose an algorithm you understand, to evaluate the data #4 rule: have a general idea of what your outcome should be
! #4 rule: see #1 rule
Example: how to get the entropy of a subdomain properly? Consume/extract URLs è Apply Shannon Entropy èvalidate with results
DetecEng the No.1 Programming Error
16
DetecEng SQLi
17
Web proxy logs Web access logs
StochasEc gradient descent -‐ bayesian, naive bayesian and
bag of words
92% True posiEve
Why is Fraud detecEon so slow?
18
AuthenEcated transacEons are
well… authenEcated L
Slight variaEons in user behavior are hard to detect
Manual processes require mulEple
people
Math saves Bank$
19
Web logs with session keys, screen res, user
name
Randomness of the key sizes and the n-‐grams of keys -‐ clustering to find
outlier
Discover hijacked, proxied sessions
So how does all this work??
Short answer…
You install a couple of apps and train the models for a bit… and that’s its
No really, whats under the hood ?
23
Aah…
Our Data Journey: ML ExploraEon Scope
AssumpEons QuesEons
• How much data will this evaluaEon require?
• What kind of data can we apply our learning to?
• What data sources will we need to work with to get a valuable result?
• Can we understand good/bad using algorithms?
• Scaled Test infrastructure • High-‐quality data • Machine learning funcEons wrigen in Splunk
• Our approach will get results • IteraEon and collaboraEon on training sets
Splunk + ML Flow
25
Data Label + Data Index Lable+Data Search
Machine Learning Framework
(Results+Tag) + ML
K/V Stores results
Design Decisions
26
! Search Eme? ! Index Eme? ! Data stores and choices? ! How would we relate calculated values at search Eme, back to raw data at ingest Eme?
! Do we have reference data? ! Batch or near-‐real-‐Eme ML evaluaEon?
! We made two different choices-‐ Index Eme and search Eme ML for tesEng.
Index Eme requirements
27
! We need a unique idenEfier for each event-‐ or we can’t relate features evaluated back to the raw data.
Machine Learning IteraEon and Algorithms
Tools Requirements
• KV store for labels and raw data • Methodology for interchangeable
algorithms interacEng with KV store
• IteraEve, scalable method for creaEng a reference data set
• Ability to label data, and operate on it.
• MLSET/MLGET • Levenshtein – New • Bayes -‐ New • Shannon Entropy -‐ New • WordCount – New SPL • Fast Fourier -‐ New • (Perceptron) – coming soon • (Gradient Decent) – coming soon
28
29
ML Architecture – Data AcquisiEon
Menage
Proxy Thread
Add UUID
Forwarder
Indexes
Indexes
Indexes
Indexes
30
ML Architecture – Data EvaluaEon Menage
Proxy Thread
Add UUID
Indexes
Indexes
Indexes
Indexes
| anomalies field=file labelonly=true maxvalues=10 | bayes field=* | output entropy
Label::value
Adds a calculated field to data
User uses ML to evaluate data
Label::value added to event stream
Using Key Value Persistent Cache
31
• Populate Redis KV store based on ML search output.
• Label event with new Label::value mapped to
UUID • Pass Label::value è Index Eme to Menage
• Import Redis module to Splunk as a lookup for a value given a key (or use key store of choice)
Redis is an open source, advanced key-‐value store.
EvaluaEng Events with Reference Data
32
• generate a list of the top 5 whitelist domains to use the words as the key list for levenshtein calculaEon. We want a reference known good entropy list! • top_accepted_domains.csv • top_sites.txt
• Create a whitelist of users for all data (we may want to rate their risk at some point;) • proxy_users.csv
index=bluecoat cs_username=* cs_categories="whitelist*" | lookup • pull down a phishtank verified phishing mail list, we want a reference
blacklist lookup: • phishtank_verified.csv
ExtracEng an URL properly
33
Sample URL TLD Comments
hgp://www.brit.croydon.sch.uk croydon.sch.uk Third level TLD allocated by the Local EducaEon
Authority
192.168.0.42 IPv4 address, no TLD
www.splunk.42 42 This is not an IP address, 42 is correct
www.example.paris paris GTLD extracted smoothly
34
hJp://www.splunk.com/view/enterprise-‐security-‐app/SP-‐CAAAE8Z#tab_2
FAUP
domain_without_tld: splunk tld: com
lua input modules
lua output modules
Web Server
Faup Library
How many TLDs are “com”?
How many domains are “splunk”?
f4E
Splunk State Store
Using Evaluated Data for ML Features
MLSET/MLGET
35
Each event has a UUID, which is expected by the ML search commands MLSET, MLGET
• This calculated and populates field values which we’ll use as ML features to graph, or represent the data
• These calculaEons, creates the labels that disEnguish ‘anomalies’ or ‘outliers’ in the grouping of data we are evaluaEng.
Search-‐Ume operaUon on Splunk data to put into K/V stash: index=bluecoat cs_host=* | lookup webfaup url as cs_host | lookup wordstats word as url_domain | rename url_domain as domain ws_entropy as entropy | mlset algo="listlevenshtein” fields="domain,entropy” Pulling the Machine Learning results back at search: index=bluecoat cs_host=*| mlget algo="listlevenshtein”| table in.domain,in.entropy,levenscores.* Then we invesUgate results, and graph!
Results
36
• Wrote 4 Algorithms for evaluaEng URLs for these use cases: Malware, ExfiltraEon, Insider Threat detecEon, phishing agacks
• Created a method to build ML into Splunk using a KV store
• IdenEfied fraud and SQLi in proxy logs
• Make as few index-‐Eme decisions as possible to stay as close to real-‐Eme as possible.
37
38
39
40
Get URL Parser app
hgp://apps.splunk.com/app/1545/
Another approach to the same data…
43
For Security + Data Science N00bs
ML for Proxy logs
The Approach • The approach of applying Machine Learning Framework evaluaEng proxy data in order to classify the data at index Eme, based on specific features of the data.
• Performs intelligent analysis on incoming data and classifies it • Focus on idenEfying SQL injecEon • Because of the incremental training approach (StochasEc Gradient Descent), it gets more accurate with more dataapplied
44
What It Does
45
! Allows monitoring of calculated agributes
! Allows training on specific data fields for accuracy and feature isolaEon
! Seamlessly distributes trained models to all instances of Menage
Why It Magers • ML for Proxy allows for mulEple levels of automaEc analysis • Machine learning models installed by default adapt to your data and get beJer over Ume (StochasUc Gradient Descent)
• Incoming data is enriched via trained models and Menage before index Eme
• ModelPipeline Framework allows you to create custom models to fit your needs
46
How To Use It • Step 1: Follow instrucEons to configure Menage in Menage SpecificaEon document.
• Step 2: Configure regular expressions in props.conf if needed. • Step 3: Train models from “Train Models” dashboard.
– bow(php) where php is the PHP arguments field of the url gives good results for SQL injecEon
– Index your reference data, and evaluate change over Eme
• Step 4: Forward new data through Menage to have data classificaEon appended.
• Step 5: Analyze enriched data and periodically re-‐train models.
47
Step 1 • Menage must be configured on any indexer you want data enrichment and classificaEon on.
• Necessary conf files can either be pushed out in a distributed in scenario or modified manually.
• Menage is actually started by execuEng handler_server.py and menage.go.
• AuthenEcaEon is stored in a configuraEon file in that directory, more info can be found in the Menage Python Handler document.
48
Step 2 • Current regular expressions are designed for SGOS proxy data. • Regular expressions and parameter names can be changed as needed, you just need to remember to put in the new parameter name(s) in the train command as well.
• Contents of the MLFramework folder can also be extracted into the bin directory of any app for machine learning capabiliEes.
49
Step 3 • Training the models is probably the most important step! • Be careful the of the parameters you choose to train on, too many features will decrease accuracy as well as too few.
• Be sure to only train on features relevant to what you’re looking for – E.g. PHP arguments if you’re looking for SQL injecEon
• The extra parameter funcEons are really useful for specific tasks: – E.g. bag of words approach applied to PHP arguments can be really useful for
SQL injecEon detecEon
50
Step 4 • Forwarders must be configured to send all data to a port Menage is listening on to get classificaEon on new data.
• Ideally there should be an instance of Menage running on every indexer so all of your data is enriched.
• The ports Menage is listening on and sending to can be modified in the menage.ini file in the bin directory of Menage.
51
Step 5 • When Menage classifies incoming data, labels will be appended to the metadata of the event which can then be searched and evaluated based on. – The screenshot at the beginning of the slideshow shows the number of events
classified by Menage as having SQL content by semanEc analysis and by Snort signature detecEon.
• Most models support incremental training and should be trained frequently on new data coming in to improve accuracy – This also allows the models to adapt to your network
52
Constraints • Assuming independent features and algorithms, false posiEves will not go up when using a cascade,
• However • True posiEves will decrease. • Unless: • we keep the detecEon specialised and simple, and therefore be able to make P(A|M) = 1.0 or very close.
AssumpEons • Perfect detecEon is impossible. • Threat coverage is less than 100%. • Log feeds can fail someEmes. • Something that is malicious *might* cause an alarm. • The enEre set of malicious events includes those we can detect,
those we might detect, and some we don’t even know about. • Of those we don’t know about, given the right circumstances, we
have a chance of discovering through staEsEcal analysis. • Even when we should be able to detect an event, the above
constraints makes this less than certain.
What can we control? • The effecEveness of the IDS; • Coverage; • Noisy events; • CorrelaEon algorithms.
Lessons Learned
Quote Box
57
“A pessimist sees the difficulty in every opportunity; an opEmist sees the opportunity in every difficulty.”
-‐ Winston Churchill
THANK YOU