practical machine learning agenda - the punch · 2020-04-07 · 4 searching visualising reporting...

18
Practical Machine Learning PunchPlatform team Agenda 1 Moving To Machine Learning Starting From Log Management Challenges Thanks Thales

Upload: others

Post on 20-May-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Practical Machine Learning Agenda - The Punch · 2020-04-07 · 4 Searching Visualising Reporting Alerting Kafka Storm Kafka Storm ElasticSearch Data Processing Raw Data Forwarding

Practical Machine Learning

PunchPlatform team

Agenda

1

Moving To Machine Learning

Starting From Log Management

Challenges

ThanksThales

Page 2: Practical Machine Learning Agenda - The Punch · 2020-04-07 · 4 Searching Visualising Reporting Alerting Kafka Storm Kafka Storm ElasticSearch Data Processing Raw Data Forwarding

Starting From Log Management

2

Page 3: Practical Machine Learning Agenda - The Punch · 2020-04-07 · 4 Searching Visualising Reporting Alerting Kafka Storm Kafka Storm ElasticSearch Data Processing Raw Data Forwarding

3

Starting From Log Management

YY

Y

Y YBatch Processing

Speed Processing

Alerting

Searching Visualising Reporting

logs, documents, events, …Data

logs, documents, events, …Data

Y

- Solve all the operational issues to collect, transport and transform the data - Make it easy for the user to compose its functional pipeline - Provide end-to-end deployment, supervision and monitoring

Page 4: Practical Machine Learning Agenda - The Punch · 2020-04-07 · 4 Searching Visualising Reporting Alerting Kafka Storm Kafka Storm ElasticSearch Data Processing Raw Data Forwarding

4

Searching Visualising Reporting Alerting

ElasticSearchKafkaKafka Storm Storm

Data Processing

Raw Data IndexingForwardingParsing

Normalising

Log Management Technical Pipeline

Ceph

Long Term Storage

- Powered by well-known COTS, user adoption is key - Make it possible to plug in arbitrary processing - No loss of data ever

Page 5: Practical Machine Learning Agenda - The Punch · 2020-04-07 · 4 Searching Visualising Reporting Alerting Kafka Storm Kafka Storm ElasticSearch Data Processing Raw Data Forwarding

5

End User Experience

- Forensics : searching, reporting, aggregating - Real-time, dynamic - Already a data scientist starter tool

Page 6: Practical Machine Learning Agenda - The Punch · 2020-04-07 · 4 Searching Visualising Reporting Alerting Kafka Storm Kafka Storm ElasticSearch Data Processing Raw Data Forwarding

6

CyberSecurity Platforms Architecture

Boston Paris Sydney Singapore

dns/ldap/etc storm zookeeper kafka elasticsearch ceph

- automatically deployed - yearly updates, with no service

interruption - No loss of data ever - 16 production platforms

Page 7: Practical Machine Learning Agenda - The Punch · 2020-04-07 · 4 Searching Visualising Reporting Alerting Kafka Storm Kafka Storm ElasticSearch Data Processing Raw Data Forwarding

7

Data is at hand

1-3 months are replicated : Each log is stored on 2 (or more) Elasticsearch servers. 4-12 months are not replicated : but online and indexed in ElasticSearch. 1-12 months are replicated. Each log is stored on 2 (or more) servers.

jan feb mar apr may jun jul aug sep oct nov dec

- 1 year of Indexed and normalised logs in Elasticsearch - 1 year of compressed raw logs in distributed object storage - days (up to a month) of normalised data in Kafka

The Punchplatform architecture and connectors make it simple and safe to deploy arbitrary processing on the data. Either batch or real-time (streaming).

Page 8: Practical Machine Learning Agenda - The Punch · 2020-04-07 · 4 Searching Visualising Reporting Alerting Kafka Storm Kafka Storm ElasticSearch Data Processing Raw Data Forwarding

Moving To Machine Learning

8

Page 9: Practical Machine Learning Agenda - The Punch · 2020-04-07 · 4 Searching Visualising Reporting Alerting Kafka Storm Kafka Storm ElasticSearch Data Processing Raw Data Forwarding

9

Live Demo

Page 10: Practical Machine Learning Agenda - The Punch · 2020-04-07 · 4 Searching Visualising Reporting Alerting Kafka Storm Kafka Storm ElasticSearch Data Processing Raw Data Forwarding

10

Live Demo Explained

Searching Visualising Reporting Alerting

ElasticSearchKafkaKafka Storm Storm

Data Processing

Raw Data

Data Analytics: Machine Learning Anomaly Detection

IndexingAlertingParsing

Normalising

Page 11: Practical Machine Learning Agenda - The Punch · 2020-04-07 · 4 Searching Visualising Reporting Alerting Kafka Storm Kafka Storm ElasticSearch Data Processing Raw Data Forwarding

Challenges

11

Page 12: Practical Machine Learning Agenda - The Punch · 2020-04-07 · 4 Searching Visualising Reporting Alerting Kafka Storm Kafka Storm ElasticSearch Data Processing Raw Data Forwarding

12

Train/Test

Predict

Our Process

Raw Data CyberSecurity Analyst

Data Scientist

model

feature

Feature Engineering

clean/normalise

- solve the data access and ownership issue - small, agile, integrated team - well defined process …… and something like a - clear and shared vision of achievable steps : MVPs

Page 13: Practical Machine Learning Agenda - The Punch · 2020-04-07 · 4 Searching Visualising Reporting Alerting Kafka Storm Kafka Storm ElasticSearch Data Processing Raw Data Forwarding

Making (Spark) ML simpler

13

Fit

Model

Transform

Input Data

Detection Data

Scored Data

Plan

Fit Every Night on last day of data

Transform last hour of Data every half an hour

View Evaluate

Alert

design

- By configuration, not by code - Leverage data normalisation, off the shelves libraries - Quick to setup and test

Page 14: Practical Machine Learning Agenda - The Punch · 2020-04-07 · 4 Searching Visualising Reporting Alerting Kafka Storm Kafka Storm ElasticSearch Data Processing Raw Data Forwarding

Fit Job Example

14

Field selection, enrichment, ..Punch Stage

K-Means, Regression, …Spark Stage

Save the computed ModelModel Output

ElasticSearch last day of firewall logs

Data Input

SparkQLSpark Stage

FilterSpark Stage

- This is defined in a plain Json configuration file

Page 15: Practical Machine Learning Agenda - The Punch · 2020-04-07 · 4 Searching Visualising Reporting Alerting Kafka Storm Kafka Storm ElasticSearch Data Processing Raw Data Forwarding

Transform Job Example

15

Field selection, filtering, …Punch Stage

(say) K-Means/Regression/…Spark Stage

Save the scored data to ElasticSearch

Data Output

Real Time Firewall LogsLive Input

SparkQLSpark Stage

select fieldsSpark Stage

from Fit Job Model Input

- This is defined in a plain Json configuration file

Page 16: Practical Machine Learning Agenda - The Punch · 2020-04-07 · 4 Searching Visualising Reporting Alerting Kafka Storm Kafka Storm ElasticSearch Data Processing Raw Data Forwarding

Leveraging Spark MLib

16

https://spark.apache.org/mllib/

Classification: logistic regression, naive Bayes,...

Regression: generalized linear regression, survival regression,...

Decision trees, random forests, gradient-boosted trees

Recommendation: alternating least squares (ALS)

Clustering: K-means, Gaussian mixtures (GMMs),...

Topic modeling: latent Dirichlet allocation (LDA)

Frequent itemsets, association rules, sequential pattern mining

Feature transformations: standardization, normalization, hashing,...

ML Pipeline construction Model evaluation and hyper-parameter tuning ML persistence:

saving and loading models and Pipelines

Distributed linear algebra: SVD, PCA,... Statistics: summary statistics, hypothesis testing,...

ML Workflow

Utilities

ML Algorithms

Page 17: Practical Machine Learning Agenda - The Punch · 2020-04-07 · 4 Searching Visualising Reporting Alerting Kafka Storm Kafka Storm ElasticSearch Data Processing Raw Data Forwarding

17

Conclusions

When embarking on AI projects you dramatically improve your chances of producing value by :

• Operating in a build now, learn as you go fashion. Truly sophisticated products are arrived at via iteration and variation; not naive designs steeped in theory;

• Using nascent discoveries only in the context of a working product;

• Encouraging Agility from your Data Scientists as much as your developers and product managers;

• Closing the gap between lab and factory wherever possible, favoring quick and lean solutions that grow more valid with time;

• Leveraging the machine learning already available in open source tools, only coding from the ground up when absolutely necessary;

• Passing user feedback into your data pipelines by exposing imperfect models to end users early.

(Sean McClure)

Page 18: Practical Machine Learning Agenda - The Punch · 2020-04-07 · 4 Searching Visualising Reporting Alerting Kafka Storm Kafka Storm ElasticSearch Data Processing Raw Data Forwarding

Thanks !

http://punchplatfom.io

http://kibana.punchplatform.com

http://doc.punchplatform.com