machine*learning,* entropy*and*fraud*in* splunk* · machine*learning*goal*...

58
Copyright © 2014 Splunk Inc. Fred Wilmot (CISSP) Director, Global Security PracEce SebasEen Tricaud Principal Strategist, Global Security PracEce Machine Learning, Entropy and Fraud in Splunk

Upload: others

Post on 19-Apr-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine*Learning,* Entropy*and*Fraud*in* Splunk* · Machine*Learning*Goal* Program*computers*to*use*example*dataor*past experience*to*solve*agiven*problem*

Copyright  ©  2014  Splunk  Inc.  

Fred  Wilmot  (CISSP)  Director,  Global  Security  PracEce  

SebasEen  Tricaud  Principal  Strategist,  Global  Security  PracEce  

Machine  Learning,  Entropy  and  Fraud  in  

Splunk  

Page 2: Machine*Learning,* Entropy*and*Fraud*in* Splunk* · Machine*Learning*Goal* Program*computers*to*use*example*dataor*past experience*to*solve*agiven*problem*

Disclaimer  

2  

During  the  course  of  this  presentaEon,  we  may  make  forward  looking  statements  regarding  future  events  or  the  expected  performance  of  the  company.  We  cauEon  you  that  such  statements  reflect  our  current  expectaEons  and  

esEmates  based  on  factors  currently  known  to  us  and  that  actual  events  or  results  could  differ  materially.  For  important  factors  that  may  cause  actual  results  to  differ  from  those  contained  in  our  forward-­‐looking  statements,  

please  review  our  filings  with  the  SEC.  The  forward-­‐looking  statements  made  in  the  this  presentaEon  are  being  made  as  of  the  Eme  and  date  of  its  live  presentaEon.  If  reviewed  aSer  its  live  presentaEon,  this  presentaEon  may  not  contain  current  or  accurate  informaEon.  We  do  not  assume  any  obligaEon  to  update  any  forward  looking  statements  we  may  make.  In  addiEon,  any  informaEon  about  our  roadmap  outlines  our  general  product  direcEon  and  is  subject  to  change  at  any  Eme  without  noEce.  It  is  for  informaEonal  purposes  only  and  shall  not,  be  incorporated  into  any  contract  or  other  commitment.  Splunk  undertakes  no  obligaEon  either  to  develop  the  features  or  funcEonality  described  or  to  

include  any  such  feature  or  funcEonality  in  a  future  release.  

Page 3: Machine*Learning,* Entropy*and*Fraud*in* Splunk* · Machine*Learning*Goal* Program*computers*to*use*example*dataor*past experience*to*solve*agiven*problem*

Agenda  

!   What  is  Machine  Learning?  !   Use  cases  !   Results  !   Lessons  learned  

3  

Page 4: Machine*Learning,* Entropy*and*Fraud*in* Splunk* · Machine*Learning*Goal* Program*computers*to*use*example*dataor*past experience*to*solve*agiven*problem*

WARNING  

4  

Do  not  visit  URLs  in  this  presentaEon,  they  will  make  your  computer  sick!    

Page 5: Machine*Learning,* Entropy*and*Fraud*in* Splunk* · Machine*Learning*Goal* Program*computers*to*use*example*dataor*past experience*to*solve*agiven*problem*

Machine  Learning  Goal  

Program  computers  to  use  example  data  or  past  experience  to  solve  a  given  problem  

Page 6: Machine*Learning,* Entropy*and*Fraud*in* Splunk* · Machine*Learning*Goal* Program*computers*to*use*example*dataor*past experience*to*solve*agiven*problem*

Some  Machine  Learning  Use  Cases  

6  

!   User  behavior  profiling  and  base-­‐lining  !   Asset  and  applicaEon  modeling  !   Finding  New  Security  Threats  

–  SQLi  –  Network  proxy/DNS/evaluaEon  –  SenEment  from  SLA  (semanEc  language  analysis)  –  ExfiltraEon    –  C2  channels  /  Malware  

!   Fraud  

Page 7: Machine*Learning,* Entropy*and*Fraud*in* Splunk* · Machine*Learning*Goal* Program*computers*to*use*example*dataor*past experience*to*solve*agiven*problem*

Master  Machine  Learning  in  2  slides!  

Page 8: Machine*Learning,* Entropy*and*Fraud*in* Splunk* · Machine*Learning*Goal* Program*computers*to*use*example*dataor*past experience*to*solve*agiven*problem*

Machine  that  Learns  

Algorithms:  types  of  learning  

Input  Vectors  

Outputs  

Training  Regimes  Noise  Performance  EvaluaEon  

Page 9: Machine*Learning,* Entropy*and*Fraud*in* Splunk* · Machine*Learning*Goal* Program*computers*to*use*example*dataor*past experience*to*solve*agiven*problem*

Learn  –  Classify  -­‐  Cluster  

9  

!   Learning:  –  Is  “Subject:  Fais  grandir  ton  machin”  a  spam?  –  Is  “jet-­‐machinery.com”  a  valid  url?  –  Store  what  we  know  in  a  good  or  bad  dataset  

!   Classify  (supervised/semi-­‐supervised  learning):  –  Based  on  a  learning,  tries  to  put  things  in  the  good  or  bad  dataset  and  re-­‐

evaluates  model.  

!   Cluster  (non-­‐supervised  learning):  –  Group  objects  in  a  geometrical  space  

Page 10: Machine*Learning,* Entropy*and*Fraud*in* Splunk* · Machine*Learning*Goal* Program*computers*to*use*example*dataor*past experience*to*solve*agiven*problem*

Use  Cases  

Page 11: Machine*Learning,* Entropy*and*Fraud*in* Splunk* · Machine*Learning*Goal* Program*computers*to*use*example*dataor*past experience*to*solve*agiven*problem*

Use  Cases  

Domain  analysis  for  threat  detecEon  

SQL  InjecEon  agack  detecEon  

Web  based  financial  fraud  

11  

Page 12: Machine*Learning,* Entropy*and*Fraud*in* Splunk* · Machine*Learning*Goal* Program*computers*to*use*example*dataor*past experience*to*solve*agiven*problem*

Use  case:  Threat  detecEon  via  Domain  Analysis  

! www.google.com    

!   www.g0ogle.com    

12  

Known  good  URL  

Really  close  to  known  good  URL…  probably  malicious!  

Page 13: Machine*Learning,* Entropy*and*Fraud*in* Splunk* · Machine*Learning*Goal* Program*computers*to*use*example*dataor*past experience*to*solve*agiven*problem*

Use  case:  Threat  detecEon  via  URL  Analysis  

! www.google.com    

!   www.g0ogle.com    

13  

Known  good  URL  

Really  close  to  known  good  URL…  probably  malicious!  

Page 14: Machine*Learning,* Entropy*and*Fraud*in* Splunk* · Machine*Learning*Goal* Program*computers*to*use*example*dataor*past experience*to*solve*agiven*problem*

Accelerate  your  HunEng  Shannon!  

URLs  from  web  logs  and  email  

ML:  Levenstein  Distance  and  

Shannon  Entropy    Anomalous  

URLs  

14  

Page 15: Machine*Learning,* Entropy*and*Fraud*in* Splunk* · Machine*Learning*Goal* Program*computers*to*use*example*dataor*past experience*to*solve*agiven*problem*

Working  with  Data  

15  

!   #1  rule:  be  sure  ingest  the  data  properly  –  ‘CIM’  the  data  –  Make  sure  fields  are  extracted  –  Make  sure  sourcetyped  appropriately  #2  rule:  make  sure  you  understand  your  data’s  context  #3  rule:  choose  an  algorithm  you  understand,  to  evaluate  the  data  #4  rule:  have  a  general  idea  of  what  your  outcome  should  be  

!   #4  rule:  see  #1  rule  

Example:  how  to  get  the  entropy  of  a  subdomain  properly?    Consume/extract  URLs  è  Apply  Shannon  Entropy  èvalidate  with  results  

Page 16: Machine*Learning,* Entropy*and*Fraud*in* Splunk* · Machine*Learning*Goal* Program*computers*to*use*example*dataor*past experience*to*solve*agiven*problem*

DetecEng  the  No.1  Programming  Error  

16  

Page 17: Machine*Learning,* Entropy*and*Fraud*in* Splunk* · Machine*Learning*Goal* Program*computers*to*use*example*dataor*past experience*to*solve*agiven*problem*

DetecEng  SQLi  

17  

Web  proxy  logs                  Web  access  logs  

StochasEc  gradient  descent  -­‐  bayesian,  naive  bayesian  and  

bag  of  words  

92%  True  posiEve  

Page 18: Machine*Learning,* Entropy*and*Fraud*in* Splunk* · Machine*Learning*Goal* Program*computers*to*use*example*dataor*past experience*to*solve*agiven*problem*

Why  is  Fraud  detecEon  so  slow?  

18  

AuthenEcated  transacEons  are  

well…  authenEcated  L  

Slight  variaEons  in  user  behavior  are  hard  to  detect  

Manual  processes  require  mulEple  

people    

Page 19: Machine*Learning,* Entropy*and*Fraud*in* Splunk* · Machine*Learning*Goal* Program*computers*to*use*example*dataor*past experience*to*solve*agiven*problem*

Math  saves  Bank$  

19  

Web  logs  with  session  keys,  screen  res,  user  

name  

Randomness  of  the  key  sizes  and  the  n-­‐grams  of  keys  -­‐  clustering  to  find  

outlier  

Discover  hijacked,  proxied  sessions  

Page 20: Machine*Learning,* Entropy*and*Fraud*in* Splunk* · Machine*Learning*Goal* Program*computers*to*use*example*dataor*past experience*to*solve*agiven*problem*

You  know  you  want  in  J  

20  

[email protected]  

Page 21: Machine*Learning,* Entropy*and*Fraud*in* Splunk* · Machine*Learning*Goal* Program*computers*to*use*example*dataor*past experience*to*solve*agiven*problem*

So  how  does  all  this  work??  

Page 22: Machine*Learning,* Entropy*and*Fraud*in* Splunk* · Machine*Learning*Goal* Program*computers*to*use*example*dataor*past experience*to*solve*agiven*problem*

Short  answer…  

You  install  a  couple  of  apps  and  train  the  models  for  a  bit…  and  that’s  its  

Page 23: Machine*Learning,* Entropy*and*Fraud*in* Splunk* · Machine*Learning*Goal* Program*computers*to*use*example*dataor*past experience*to*solve*agiven*problem*

No  really,  whats  under  the  hood  ?  

23  

Aah…  

Page 24: Machine*Learning,* Entropy*and*Fraud*in* Splunk* · Machine*Learning*Goal* Program*computers*to*use*example*dataor*past experience*to*solve*agiven*problem*

Our  Data  Journey:  ML  ExploraEon  Scope  

AssumpEons  QuesEons  

•  How  much  data  will  this  evaluaEon  require?  

•  What  kind  of  data  can  we  apply  our  learning  to?  

•  What  data  sources  will  we  need  to  work  with  to  get  a  valuable  result?  

•  Can  we  understand  good/bad  using  algorithms?  

•  Scaled  Test  infrastructure    •  High-­‐quality  data  •  Machine  learning  funcEons  wrigen  in  Splunk  

•  Our  approach  will  get  results  •  IteraEon  and  collaboraEon  on  training  sets  

Page 25: Machine*Learning,* Entropy*and*Fraud*in* Splunk* · Machine*Learning*Goal* Program*computers*to*use*example*dataor*past experience*to*solve*agiven*problem*

Splunk  +  ML  Flow  

25  

Data   Label  +  Data     Index  Lable+Data   Search  

Machine  Learning  Framework  

(Results+Tag)  +  ML  

K/V  Stores  results  

Page 26: Machine*Learning,* Entropy*and*Fraud*in* Splunk* · Machine*Learning*Goal* Program*computers*to*use*example*dataor*past experience*to*solve*agiven*problem*

Design  Decisions  

26  

!   Search  Eme?  !   Index  Eme?  !   Data  stores  and  choices?  !   How  would  we  relate  calculated  values  at  search  Eme,  back  to  raw  data  at  ingest  Eme?  

!   Do  we  have  reference  data?  !   Batch  or  near-­‐real-­‐Eme  ML  evaluaEon?  

!   We  made  two  different  choices-­‐  Index  Eme  and  search  Eme  ML  for  tesEng.  

Page 27: Machine*Learning,* Entropy*and*Fraud*in* Splunk* · Machine*Learning*Goal* Program*computers*to*use*example*dataor*past experience*to*solve*agiven*problem*

Index  Eme  requirements  

27  

!   We  need  a  unique  idenEfier  for  each  event-­‐  or  we  can’t  relate  features  evaluated  back  to  the  raw  data.  

Page 28: Machine*Learning,* Entropy*and*Fraud*in* Splunk* · Machine*Learning*Goal* Program*computers*to*use*example*dataor*past experience*to*solve*agiven*problem*

Machine  Learning  IteraEon  and  Algorithms  

Tools  Requirements  

•  KV  store  for  labels  and  raw  data  •  Methodology  for  interchangeable  

algorithms  interacEng  with  KV  store  

•  IteraEve,  scalable  method  for  creaEng  a  reference  data  set  

•  Ability  to  label  data,  and  operate  on  it.  

•  MLSET/MLGET  •  Levenshtein  –  New  •  Bayes  -­‐  New  •  Shannon  Entropy  -­‐  New  •  WordCount  –  New  SPL  •  Fast  Fourier  -­‐  New  •  (Perceptron)  –  coming  soon  •  (Gradient  Decent)  –  coming  soon  

28  

Page 29: Machine*Learning,* Entropy*and*Fraud*in* Splunk* · Machine*Learning*Goal* Program*computers*to*use*example*dataor*past experience*to*solve*agiven*problem*

29  

ML  Architecture  –  Data  AcquisiEon  

Menage

Proxy  Thread

Add  UUID

Forwarder

Indexes

Indexes

Indexes

Indexes

Page 30: Machine*Learning,* Entropy*and*Fraud*in* Splunk* · Machine*Learning*Goal* Program*computers*to*use*example*dataor*past experience*to*solve*agiven*problem*

30  

ML  Architecture  –  Data  EvaluaEon  Menage

Proxy  Thread

Add  UUID

Indexes

Indexes

Indexes

Indexes

|  anomalies  field=file  labelonly=true  maxvalues=10  |  bayes  field=*  |  output  entropy        

Label::value  

Adds  a  calculated  field  to  data  

User  uses  ML  to  evaluate  data  

Label::value  added  to  event  stream  

Page 31: Machine*Learning,* Entropy*and*Fraud*in* Splunk* · Machine*Learning*Goal* Program*computers*to*use*example*dataor*past experience*to*solve*agiven*problem*

Using  Key  Value  Persistent  Cache  

31  

•  Populate  Redis  KV  store  based  on  ML  search  output.  

•  Label  event  with  new  Label::value  mapped  to  

UUID  •  Pass  Label::value  è  Index  Eme  to  Menage  

•  Import  Redis  module  to  Splunk  as  a  lookup  for  a  value  given  a  key  (or  use  key  store  of  choice)  

Redis  is  an  open  source,  advanced  key-­‐value  store.  

Page 32: Machine*Learning,* Entropy*and*Fraud*in* Splunk* · Machine*Learning*Goal* Program*computers*to*use*example*dataor*past experience*to*solve*agiven*problem*

EvaluaEng  Events  with  Reference  Data  

32  

•  generate  a  list  of  the  top  5  whitelist  domains  to  use  the  words  as  the  key  list  for  levenshtein  calculaEon.  We  want  a  reference  known  good  entropy  list!  •  top_accepted_domains.csv  •  top_sites.txt  

•  Create  a  whitelist  of  users  for  all  data  (we  may  want  to  rate  their  risk  at  some  point;)  •  proxy_users.csv  

index=bluecoat  cs_username=*  cs_categories="whitelist*"  |  lookup      •  pull  down  a  phishtank  verified  phishing  mail  list,  we  want  a  reference  

blacklist  lookup:    •  phishtank_verified.csv  

Page 33: Machine*Learning,* Entropy*and*Fraud*in* Splunk* · Machine*Learning*Goal* Program*computers*to*use*example*dataor*past experience*to*solve*agiven*problem*

ExtracEng  an  URL  properly  

33  

Sample  URL   TLD   Comments  

hgp://www.brit.croydon.sch.uk   croydon.sch.uk  Third  level  TLD  allocated  by  the  Local  EducaEon  

Authority  

192.168.0.42   IPv4  address,  no  TLD  

www.splunk.42   42   This  is  not  an  IP  address,  42  is  correct  

www.example.paris   paris   GTLD  extracted  smoothly  

Page 34: Machine*Learning,* Entropy*and*Fraud*in* Splunk* · Machine*Learning*Goal* Program*computers*to*use*example*dataor*past experience*to*solve*agiven*problem*

34  

hJp://www.splunk.com/view/enterprise-­‐security-­‐app/SP-­‐CAAAE8Z#tab_2  

FAUP  

domain_without_tld:  splunk  tld:  com  

lua  input  modules  

lua  output  modules  

Web  Server  

Faup  Library  

How  many  TLDs  are  “com”?  

How  many  domains  are  “splunk”?  

f4E  

Splunk  State  Store  

Using  Evaluated  Data  for  ML  Features  

Page 35: Machine*Learning,* Entropy*and*Fraud*in* Splunk* · Machine*Learning*Goal* Program*computers*to*use*example*dataor*past experience*to*solve*agiven*problem*

MLSET/MLGET  

35  

Each  event  has  a  UUID,  which  is  expected  by  the  ML  search  commands  MLSET,  MLGET  

•  This  calculated  and  populates  field  values  which  we’ll  use  as  ML  features  to  graph,  or  represent  the  data  

•  These  calculaEons,  creates  the  labels  that  disEnguish  ‘anomalies’  or  ‘outliers’  in  the  grouping  of  data  we  are  evaluaEng.  

 Search-­‐Ume  operaUon  on  Splunk  data  to  put  into  K/V  stash:  index=bluecoat  cs_host=*  |  lookup  webfaup  url  as  cs_host  |  lookup  wordstats  word  as  url_domain  |  rename  url_domain  as  domain  ws_entropy  as  entropy  |  mlset  algo="listlevenshtein”  fields="domain,entropy”    Pulling  the  Machine  Learning  results  back  at  search:  index=bluecoat  cs_host=*|  mlget  algo="listlevenshtein”|  table  in.domain,in.entropy,levenscores.*    Then  we  invesUgate  results,  and  graph!  

Page 36: Machine*Learning,* Entropy*and*Fraud*in* Splunk* · Machine*Learning*Goal* Program*computers*to*use*example*dataor*past experience*to*solve*agiven*problem*

Results  

36  

•  Wrote  4  Algorithms  for  evaluaEng  URLs  for  these  use  cases:  Malware,  ExfiltraEon,  Insider  Threat  detecEon,  phishing  agacks  

•  Created  a  method  to  build  ML  into  Splunk  using  a  KV  store  

•  IdenEfied  fraud  and  SQLi  in  proxy  logs  

•  Make  as  few  index-­‐Eme  decisions  as  possible  to  stay  as  close  to  real-­‐Eme  as  possible.  

Page 37: Machine*Learning,* Entropy*and*Fraud*in* Splunk* · Machine*Learning*Goal* Program*computers*to*use*example*dataor*past experience*to*solve*agiven*problem*

37  

Page 38: Machine*Learning,* Entropy*and*Fraud*in* Splunk* · Machine*Learning*Goal* Program*computers*to*use*example*dataor*past experience*to*solve*agiven*problem*

38  

Page 39: Machine*Learning,* Entropy*and*Fraud*in* Splunk* · Machine*Learning*Goal* Program*computers*to*use*example*dataor*past experience*to*solve*agiven*problem*

39  

Page 40: Machine*Learning,* Entropy*and*Fraud*in* Splunk* · Machine*Learning*Goal* Program*computers*to*use*example*dataor*past experience*to*solve*agiven*problem*

40  

Page 41: Machine*Learning,* Entropy*and*Fraud*in* Splunk* · Machine*Learning*Goal* Program*computers*to*use*example*dataor*past experience*to*solve*agiven*problem*

Get  URL  Parser  app      

hgp://apps.splunk.com/app/1545/  

Page 42: Machine*Learning,* Entropy*and*Fraud*in* Splunk* · Machine*Learning*Goal* Program*computers*to*use*example*dataor*past experience*to*solve*agiven*problem*

Another  approach  to  the  same  data…  

Page 43: Machine*Learning,* Entropy*and*Fraud*in* Splunk* · Machine*Learning*Goal* Program*computers*to*use*example*dataor*past experience*to*solve*agiven*problem*

43  

For  Security  +  Data  Science  N00bs    

ML  for  Proxy  logs  

Page 44: Machine*Learning,* Entropy*and*Fraud*in* Splunk* · Machine*Learning*Goal* Program*computers*to*use*example*dataor*past experience*to*solve*agiven*problem*

The  Approach  •  The  approach  of  applying  Machine  Learning  Framework  evaluaEng  proxy  data  in  order  to  classify  the  data  at  index  Eme,  based  on  specific  features  of  the  data.  

•  Performs  intelligent  analysis  on  incoming  data  and  classifies  it  •  Focus  on  idenEfying  SQL  injecEon  •  Because  of  the  incremental  training  approach  (StochasEc  Gradient  Descent),  it  gets  more  accurate  with  more  dataapplied  

44  

Page 45: Machine*Learning,* Entropy*and*Fraud*in* Splunk* · Machine*Learning*Goal* Program*computers*to*use*example*dataor*past experience*to*solve*agiven*problem*

What  It  Does  

45  

!   Allows  monitoring  of  calculated  agributes  

!   Allows  training  on  specific  data  fields  for  accuracy  and  feature  isolaEon  

!   Seamlessly  distributes  trained  models  to  all  instances  of  Menage  

Page 46: Machine*Learning,* Entropy*and*Fraud*in* Splunk* · Machine*Learning*Goal* Program*computers*to*use*example*dataor*past experience*to*solve*agiven*problem*

Why  It  Magers  •  ML  for  Proxy  allows  for  mulEple  levels  of  automaEc  analysis  •  Machine  learning  models  installed  by  default  adapt  to  your  data  and  get  beJer  over  Ume  (StochasUc  Gradient  Descent)  

•  Incoming  data  is  enriched  via  trained  models  and  Menage  before  index  Eme  

•  ModelPipeline  Framework  allows  you  to  create  custom  models  to  fit  your  needs  

46  

Page 47: Machine*Learning,* Entropy*and*Fraud*in* Splunk* · Machine*Learning*Goal* Program*computers*to*use*example*dataor*past experience*to*solve*agiven*problem*

How  To  Use  It  •  Step  1:  Follow  instrucEons  to  configure  Menage  in  Menage  SpecificaEon  document.  

•  Step  2:  Configure  regular  expressions  in  props.conf  if  needed.  •  Step  3:  Train  models  from  “Train  Models”  dashboard.  

–  bow(php)  where  php  is  the  PHP  arguments  field  of  the  url  gives  good  results  for  SQL  injecEon  

–  Index  your  reference  data,  and  evaluate  change  over  Eme  

•  Step  4:  Forward  new  data  through  Menage  to  have  data  classificaEon  appended.  

•  Step  5:  Analyze  enriched  data  and  periodically  re-­‐train  models.  

47  

Page 48: Machine*Learning,* Entropy*and*Fraud*in* Splunk* · Machine*Learning*Goal* Program*computers*to*use*example*dataor*past experience*to*solve*agiven*problem*

Step  1  •  Menage  must  be  configured  on  any  indexer  you  want  data  enrichment  and  classificaEon  on.  

•  Necessary  conf  files  can  either  be  pushed  out  in  a  distributed  in  scenario  or  modified  manually.  

•  Menage  is  actually  started  by  execuEng  handler_server.py  and  menage.go.  

•  AuthenEcaEon  is  stored  in  a  configuraEon  file  in  that  directory,  more  info  can  be  found  in  the  Menage  Python  Handler  document.  

48  

Page 49: Machine*Learning,* Entropy*and*Fraud*in* Splunk* · Machine*Learning*Goal* Program*computers*to*use*example*dataor*past experience*to*solve*agiven*problem*

Step  2  •  Current  regular  expressions  are  designed  for  SGOS  proxy  data.  •  Regular  expressions  and  parameter  names  can  be  changed  as  needed,  you  just  need  to  remember  to  put  in  the  new  parameter  name(s)  in  the  train  command  as  well.  

•  Contents  of  the  MLFramework  folder  can  also  be  extracted  into  the  bin  directory  of  any  app  for  machine  learning  capabiliEes.  

49  

Page 50: Machine*Learning,* Entropy*and*Fraud*in* Splunk* · Machine*Learning*Goal* Program*computers*to*use*example*dataor*past experience*to*solve*agiven*problem*

Step  3  •  Training  the  models  is  probably  the  most  important  step!  •  Be  careful  the  of  the  parameters  you  choose  to  train  on,  too  many  features  will  decrease  accuracy  as  well  as  too  few.  

•  Be  sure  to  only  train  on  features  relevant  to  what  you’re  looking  for  –  E.g.  PHP  arguments  if  you’re  looking  for  SQL  injecEon  

•  The  extra  parameter  funcEons  are  really  useful  for  specific  tasks:  –  E.g.  bag  of  words  approach  applied  to  PHP  arguments  can  be  really  useful  for  

SQL  injecEon  detecEon  

50  

Page 51: Machine*Learning,* Entropy*and*Fraud*in* Splunk* · Machine*Learning*Goal* Program*computers*to*use*example*dataor*past experience*to*solve*agiven*problem*

Step  4  •  Forwarders  must  be  configured  to  send  all  data  to  a  port  Menage  is  listening  on  to  get  classificaEon  on  new  data.  

•  Ideally  there  should  be  an  instance  of  Menage  running  on  every  indexer  so  all  of  your  data  is  enriched.  

•  The  ports  Menage  is  listening  on  and  sending  to  can  be  modified  in  the  menage.ini  file  in  the  bin  directory  of    Menage.  

51  

Page 52: Machine*Learning,* Entropy*and*Fraud*in* Splunk* · Machine*Learning*Goal* Program*computers*to*use*example*dataor*past experience*to*solve*agiven*problem*

Step  5  •  When  Menage  classifies  incoming  data,  labels  will  be  appended  to  the  metadata  of  the  event  which  can  then  be  searched  and  evaluated  based  on.  –  The  screenshot  at  the  beginning  of  the  slideshow  shows  the  number  of  events  

classified  by  Menage  as  having  SQL  content  by  semanEc  analysis  and  by  Snort  signature  detecEon.  

•  Most  models  support  incremental  training  and  should  be  trained  frequently  on  new  data  coming  in  to  improve  accuracy  –  This  also  allows  the  models  to  adapt  to  your  network  

52  

Page 53: Machine*Learning,* Entropy*and*Fraud*in* Splunk* · Machine*Learning*Goal* Program*computers*to*use*example*dataor*past experience*to*solve*agiven*problem*

Constraints  •  Assuming  independent  features  and  algorithms,  false  posiEves  will  not  go  up  when  using  a  cascade,  

•  However  •    True  posiEves  will  decrease.    •  Unless:  •  we  keep  the  detecEon  specialised  and  simple,  and  therefore  be  able  to  make  P(A|M)  =  1.0  or  very  close.  

Page 54: Machine*Learning,* Entropy*and*Fraud*in* Splunk* · Machine*Learning*Goal* Program*computers*to*use*example*dataor*past experience*to*solve*agiven*problem*

AssumpEons  •  Perfect  detecEon  is  impossible.  •  Threat  coverage  is  less  than  100%.  •  Log  feeds  can  fail  someEmes.  •  Something  that  is  malicious  *might*  cause  an  alarm.  •  The  enEre  set  of  malicious  events  includes  those  we  can  detect,  

those  we  might  detect,  and  some  we  don’t  even  know  about.  •  Of  those  we  don’t  know  about,  given  the  right  circumstances,  we  

have  a  chance  of  discovering  through  staEsEcal  analysis.  •  Even  when  we  should  be  able  to  detect  an  event,  the  above  

constraints  makes  this  less  than  certain.  

Page 55: Machine*Learning,* Entropy*and*Fraud*in* Splunk* · Machine*Learning*Goal* Program*computers*to*use*example*dataor*past experience*to*solve*agiven*problem*

What  can  we  control?  •  The  effecEveness  of  the  IDS;  •  Coverage;  •  Noisy  events;  •  CorrelaEon  algorithms.  

Page 56: Machine*Learning,* Entropy*and*Fraud*in* Splunk* · Machine*Learning*Goal* Program*computers*to*use*example*dataor*past experience*to*solve*agiven*problem*

Lessons  Learned  

Page 57: Machine*Learning,* Entropy*and*Fraud*in* Splunk* · Machine*Learning*Goal* Program*computers*to*use*example*dataor*past experience*to*solve*agiven*problem*

Quote  Box  

57  

“A  pessimist  sees  the  difficulty  in  every  opportunity;  an  opEmist  sees  the  opportunity  in  every  difficulty.”    

-­‐  Winston  Churchill  

Page 58: Machine*Learning,* Entropy*and*Fraud*in* Splunk* · Machine*Learning*Goal* Program*computers*to*use*example*dataor*past experience*to*solve*agiven*problem*

THANK  YOU