machinelearning, entropyandfraudin splunk* · machinelearninggoal*...

Copyright © 2014 Splunk Inc.

Fred Wilmot (CISSP) Director, Global Security PracEce

SebasEen Tricaud Principal Strategist, Global Security PracEce

Machine Learning, Entropy and Fraud in

Splunk

Disclaimer

2

During the course of this presentaEon, we may make forward looking statements regarding future events or the expected performance of the company. We cauEon you that such statements reflect our current expectaEons and

esEmates based on factors currently known to us and that actual events or results could differ materially. For important factors that may cause actual results to differ from those contained in our forward-‐looking statements,

please review our filings with the SEC. The forward-‐looking statements made in the this presentaEon are being made as of the Eme and date of its live presentaEon. If reviewed aSer its live presentaEon, this presentaEon may not contain current or accurate informaEon. We do not assume any obligaEon to update any forward looking statements we may make. In addiEon, any informaEon about our roadmap outlines our general product direcEon and is subject to change at any Eme without noEce. It is for informaEonal purposes only and shall not, be incorporated into any contract or other commitment. Splunk undertakes no obligaEon either to develop the features or funcEonality described or to

include any such feature or funcEonality in a future release.

Agenda

!   What is Machine Learning? !   Use cases !   Results !   Lessons learned

3

WARNING

4

Do not visit URLs in this presentaEon, they will make your computer sick!

Machine Learning Goal

Program computers to use example data or past experience to solve a given problem

Some Machine Learning Use Cases

6

!   User behavior profiling and base-‐lining !   Asset and applicaEon modeling !   Finding New Security Threats

–  SQLi –  Network proxy/DNS/evaluaEon –  SenEment from SLA (semanEc language analysis) –  ExfiltraEon –  C2 channels / Malware

!   Fraud

Master Machine Learning in 2 slides!

Machine that Learns

Algorithms: types of learning

Input Vectors

Outputs

Training Regimes Noise Performance EvaluaEon

Learn – Classify -‐ Cluster

9

!   Learning: –  Is “Subject: Fais grandir ton machin” a spam? –  Is “jet-‐machinery.com” a valid url? –  Store what we know in a good or bad dataset

!   Classify (supervised/semi-‐supervised learning): –  Based on a learning, tries to put things in the good or bad dataset and re-‐

evaluates model.

!   Cluster (non-‐supervised learning): –  Group objects in a geometrical space

Use Cases

Use Cases

Domain analysis for threat detecEon

SQL InjecEon agack detecEon

Web based financial fraud

11

Use case: Threat detecEon via Domain Analysis

! www.google.com

!   www.g0ogle.com

12

Known good URL

Really close to known good URL… probably malicious!

Use case: Threat detecEon via URL Analysis

! www.google.com

!   www.g0ogle.com

13

Known good URL

Really close to known good URL… probably malicious!

Accelerate your HunEng Shannon!

URLs from web logs and email

ML: Levenstein Distance and

Shannon Entropy Anomalous

URLs

14

Working with Data

15

!   #1 rule: be sure ingest the data properly –  ‘CIM’ the data –  Make sure fields are extracted –  Make sure sourcetyped appropriately #2 rule: make sure you understand your data’s context #3 rule: choose an algorithm you understand, to evaluate the data #4 rule: have a general idea of what your outcome should be

!   #4 rule: see #1 rule

Example: how to get the entropy of a subdomain properly? Consume/extract URLs è Apply Shannon Entropy èvalidate with results

DetecEng the No.1 Programming Error

16

DetecEng SQLi

17

Web proxy logs Web access logs

StochasEc gradient descent -‐ bayesian, naive bayesian and

bag of words

92% True posiEve

Why is Fraud detecEon so slow?

18

AuthenEcated transacEons are

well… authenEcated L

Slight variaEons in user behavior are hard to detect

Manual processes require mulEple

people

Math saves Bank$

19

Web logs with session keys, screen res, user

name

Randomness of the key sizes and the n-‐grams of keys -‐ clustering to find

outlier

Discover hijacked, proxied sessions

You know you want in J

20

[email protected]

So how does all this work??

Short answer…

You install a couple of apps and train the models for a bit… and that’s its

No really, whats under the hood ?

23

Aah…

Our Data Journey: ML ExploraEon Scope

AssumpEons QuesEons

•  How much data will this evaluaEon require?

•  What kind of data can we apply our learning to?

•  What data sources will we need to work with to get a valuable result?

•  Can we understand good/bad using algorithms?

•  Scaled Test infrastructure •  High-‐quality data •  Machine learning funcEons wrigen in Splunk

•  Our approach will get results •  IteraEon and collaboraEon on training sets

Splunk + ML Flow

25

Data Label + Data Index Lable+Data Search

Machine Learning Framework

(Results+Tag) + ML

K/V Stores results

Design Decisions

26

!   Search Eme? !   Index Eme? !   Data stores and choices? !   How would we relate calculated values at search Eme, back to raw data at ingest Eme?

!   Do we have reference data? !   Batch or near-‐real-‐Eme ML evaluaEon?

!   We made two different choices-‐ Index Eme and search Eme ML for tesEng.

Index Eme requirements

27

!   We need a unique idenEfier for each event-‐ or we can’t relate features evaluated back to the raw data.

Machine Learning IteraEon and Algorithms

Tools Requirements

•  KV store for labels and raw data •  Methodology for interchangeable

algorithms interacEng with KV store

•  IteraEve, scalable method for creaEng a reference data set

•  Ability to label data, and operate on it.

•  MLSET/MLGET •  Levenshtein – New •  Bayes -‐ New •  Shannon Entropy -‐ New •  WordCount – New SPL •  Fast Fourier -‐ New •  (Perceptron) – coming soon •  (Gradient Decent) – coming soon

28

29

ML Architecture – Data AcquisiEon

Menage

Proxy Thread

Add UUID

Forwarder

Indexes

Indexes

Indexes

Indexes

30

ML Architecture – Data EvaluaEon Menage

Proxy Thread

Add UUID

Indexes

Indexes

Indexes

Indexes

| anomalies field=file labelonly=true maxvalues=10 | bayes field=* | output entropy

Label::value

Adds a calculated field to data

User uses ML to evaluate data

Label::value added to event stream

Using Key Value Persistent Cache

31

•  Populate Redis KV store based on ML search output.

•  Label event with new Label::value mapped to

UUID •  Pass Label::value è Index Eme to Menage

•  Import Redis module to Splunk as a lookup for a value given a key (or use key store of choice)

Redis is an open source, advanced key-‐value store.

EvaluaEng Events with Reference Data

32

•  generate a list of the top 5 whitelist domains to use the words as the key list for levenshtein calculaEon. We want a reference known good entropy list! •  top_accepted_domains.csv •  top_sites.txt

•  Create a whitelist of users for all data (we may want to rate their risk at some point;) •  proxy_users.csv

index=bluecoat cs_username=* cs_categories="whitelist*" | lookup •  pull down a phishtank verified phishing mail list, we want a reference

blacklist lookup: •  phishtank_verified.csv

ExtracEng an URL properly

33

Sample URL TLD Comments

hgp://www.brit.croydon.sch.uk croydon.sch.uk Third level TLD allocated by the Local EducaEon

Authority

192.168.0.42 IPv4 address, no TLD

www.splunk.42 42 This is not an IP address, 42 is correct

www.example.paris paris GTLD extracted smoothly

34

hJp://www.splunk.com/view/enterprise-‐security-‐app/SP-‐CAAAE8Z#tab_2

FAUP

domain_without_tld: splunk tld: com

lua input modules

lua output modules

Web Server

Faup Library

How many TLDs are “com”?

How many domains are “splunk”?

f4E

Splunk State Store

Using Evaluated Data for ML Features

MLSET/MLGET

35

Each event has a UUID, which is expected by the ML search commands MLSET, MLGET

•  This calculated and populates field values which we’ll use as ML features to graph, or represent the data

•  These calculaEons, creates the labels that disEnguish ‘anomalies’ or ‘outliers’ in the grouping of data we are evaluaEng.

Search-‐Ume operaUon on Splunk data to put into K/V stash: index=bluecoat cs_host=* | lookup webfaup url as cs_host | lookup wordstats word as url_domain | rename url_domain as domain ws_entropy as entropy | mlset algo="listlevenshtein” fields="domain,entropy” Pulling the Machine Learning results back at search: index=bluecoat cs_host=*| mlget algo="listlevenshtein”| table in.domain,in.entropy,levenscores.* Then we invesUgate results, and graph!

Results

36

•  Wrote 4 Algorithms for evaluaEng URLs for these use cases: Malware, ExfiltraEon, Insider Threat detecEon, phishing agacks

•  Created a method to build ML into Splunk using a KV store

•  IdenEfied fraud and SQLi in proxy logs

•  Make as few index-‐Eme decisions as possible to stay as close to real-‐Eme as possible.

Get URL Parser app

hgp://apps.splunk.com/app/1545/

Another approach to the same data…

43

For Security + Data Science N00bs

ML for Proxy logs

The Approach •  The approach of applying Machine Learning Framework evaluaEng proxy data in order to classify the data at index Eme, based on specific features of the data.

•  Performs intelligent analysis on incoming data and classifies it •  Focus on idenEfying SQL injecEon •  Because of the incremental training approach (StochasEc Gradient Descent), it gets more accurate with more dataapplied

44

What It Does

45

!   Allows monitoring of calculated agributes

!   Allows training on specific data fields for accuracy and feature isolaEon

!   Seamlessly distributes trained models to all instances of Menage

Why It Magers •  ML for Proxy allows for mulEple levels of automaEc analysis •  Machine learning models installed by default adapt to your data and get beJer over Ume (StochasUc Gradient Descent)

•  Incoming data is enriched via trained models and Menage before index Eme

•  ModelPipeline Framework allows you to create custom models to fit your needs

46

How To Use It •  Step 1: Follow instrucEons to configure Menage in Menage SpecificaEon document.

•  Step 2: Configure regular expressions in props.conf if needed. •  Step 3: Train models from “Train Models” dashboard.

–  bow(php) where php is the PHP arguments field of the url gives good results for SQL injecEon

–  Index your reference data, and evaluate change over Eme

•  Step 4: Forward new data through Menage to have data classificaEon appended.

•  Step 5: Analyze enriched data and periodically re-‐train models.

47

Step 1 •  Menage must be configured on any indexer you want data enrichment and classificaEon on.

•  Necessary conf files can either be pushed out in a distributed in scenario or modified manually.

•  Menage is actually started by execuEng handler_server.py and menage.go.

•  AuthenEcaEon is stored in a configuraEon file in that directory, more info can be found in the Menage Python Handler document.

48

Step 2 •  Current regular expressions are designed for SGOS proxy data. •  Regular expressions and parameter names can be changed as needed, you just need to remember to put in the new parameter name(s) in the train command as well.

•  Contents of the MLFramework folder can also be extracted into the bin directory of any app for machine learning capabiliEes.

49

Step 3 •  Training the models is probably the most important step! •  Be careful the of the parameters you choose to train on, too many features will decrease accuracy as well as too few.

•  Be sure to only train on features relevant to what you’re looking for –  E.g. PHP arguments if you’re looking for SQL injecEon

•  The extra parameter funcEons are really useful for specific tasks: –  E.g. bag of words approach applied to PHP arguments can be really useful for

SQL injecEon detecEon

50

Step 4 •  Forwarders must be configured to send all data to a port Menage is listening on to get classificaEon on new data.

•  Ideally there should be an instance of Menage running on every indexer so all of your data is enriched.

•  The ports Menage is listening on and sending to can be modified in the menage.ini file in the bin directory of Menage.

51

Step 5 •  When Menage classifies incoming data, labels will be appended to the metadata of the event which can then be searched and evaluated based on. –  The screenshot at the beginning of the slideshow shows the number of events

classified by Menage as having SQL content by semanEc analysis and by Snort signature detecEon.

•  Most models support incremental training and should be trained frequently on new data coming in to improve accuracy –  This also allows the models to adapt to your network

52

Constraints •  Assuming independent features and algorithms, false posiEves will not go up when using a cascade,

•  However •  True posiEves will decrease. •  Unless: •  we keep the detecEon specialised and simple, and therefore be able to make P(A|M) = 1.0 or very close.

AssumpEons •  Perfect detecEon is impossible. •  Threat coverage is less than 100%. •  Log feeds can fail someEmes. •  Something that is malicious *might* cause an alarm. •  The enEre set of malicious events includes those we can detect,

those we might detect, and some we don’t even know about. •  Of those we don’t know about, given the right circumstances, we

have a chance of discovering through staEsEcal analysis. •  Even when we should be able to detect an event, the above

constraints makes this less than certain.

What can we control? •  The effecEveness of the IDS; •  Coverage; •  Noisy events; •  CorrelaEon algorithms.

Lessons Learned

Quote Box

57

“A pessimist sees the difficulty in every opportunity; an opEmist sees the opportunity in every difficulty.”

-‐ Winston Churchill

THANK YOU

machine*learning,* entropy*and*fraud*in* splunk* · machine*learning*goal*...

Documents

machinelearning, entropyandfraudin splunk* · machinelearninggoal*...