proofpoint: fraud detection and security on social media

25
© 2015 Proofpoint, Inc. threat protection | compliance | archiving & governance | secure communication Cassandra at Proofpoint (& Nexgate) Harold Nguyen, Data Scientist, Nexgate division of Proofpoint Slides created with the help of Proofpoint colleagues: Bryan Burns, Brian Hawkins, Wayne Lewis, Andy Maas, Anand Somani, Grey Saylor, and Rich Sutton

Upload: datastax-academy

Post on 15-Apr-2017

837 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Proofpoint: Fraud Detection and Security on Social Media

© 2015 Proofpoint, Inc.© 2015 Proofpoint, Inc.

threat protection | compliance | archiving & governance | secure communication

Cassandra at Proofpoint (& Nexgate) !Harold Nguyen, Data Scientist, Nexgate division of Proofpoint !Slides created with the help of Proofpoint colleagues: Bryan Burns, Brian Hawkins, Wayne Lewis, Andy Maas, Anand Somani, Grey Saylor, and Rich Sutton

Page 2: Proofpoint: Fraud Detection and Security on Social Media

© 2015 Proofpoint, Inc.

Outline of This Talk

whoami ! About Proofpoint ! Cassandra Uses cases for Proofpoint email security

Targeted Attack Protection Threat-event correlation General-purpose infrastructure Clustering email topics !

Cassandra Uses for Proofpoint Nexgate social media security and compliance Spam multiplicity Trending topics Archive Search Data integrity and connectedness across the globe

Page 3: Proofpoint: Fraud Detection and Security on Social Media

© 2015 Proofpoint, Inc.

Few Words About Me

Data engineer/scientist ! Responsible for content classification, fraudulent detection, and security research ! Work with entering, marketing and research teams

Page 4: Proofpoint: Fraud Detection and Security on Social Media

© 2015 Proofpoint, Inc.

About Proofpoint

Security and compliance for enterprise messaging (email, social, and mobile) Founded 2002 1100 employees worldwide $2.5B public company: PFPT $200M revenue ! Cassandra used all overthe organization

Page 5: Proofpoint: Fraud Detection and Security on Social Media

© 2015 Proofpoint, Inc.

Cassandra for Email Security and Compliance

Use cases of Cassandra for Email Security

Page 6: Proofpoint: Fraud Detection and Security on Social Media

© 2015 Proofpoint, Inc.

Targeted Attack Protection (TAP)

What is Targeted Attack ? Attack aimed at specific user or organization, designed to breech a specific target !

What is TAP ? Combats targeted threats by monitoring suspicious messages containing malicious URLs and attachments, and analyzing user clicks ! Predictive defense by using machine learning techniques to determine would ‘could likely’ by malicious and take preemptive steps ! Insights into threat by determining if an organization is under attack, who is being targeted, what threats are received, and if they are still valid threats

Page 7: Proofpoint: Fraud Detection and Security on Social Media

© 2015 Proofpoint, Inc.

Cassandra with TAP - (Use Case 1)

C* use case with TAP Uses Cassandra as an indexer - index URLs (row key) to email messages (columns) that contain them Store a blob of email message to display on dashboard for malicious alerting !

C* infrastructure 40-node cluster in AWS, c3.2xlarge nodes About 2 TB of EBS storage on each Replication factor of 4 Data has increased by 100% since a year ago !

KairosDB and C* JMX metrics inserted into KairosDB, where they are read and monitored from Over the 3 clusters (9, 6, 6), 14 billion metrics a month from 1000s of machines Has become critical to Proofpoint being able to track metrics from systems

Page 8: Proofpoint: Fraud Detection and Security on Social Media

© 2015 Proofpoint, Inc.

Threat Database (Use Case 2)

Problem: • Proofpoint collects billions of threat data points a day that aren’t being correlated

Solution: • Build a custom graph database on top of C* • Key is vertex, wide rows are edges • 18 nodes, 24 TB of data, ingest peaks of 1M events per second

Benefits: • Security researchers can now identify relationships between hosts, actors and threats

that they couldn’t before • Dridex campaign, detection of numerous targeted attacks

Page 9: Proofpoint: Fraud Detection and Security on Social Media

© 2015 Proofpoint, Inc.

Why not TitanDB ?

Proofpoint security research team created a graph database on top of Cassandra (CQL application) ! Why didn’t we use TitanDB, or other existing Graph DBs) ?

These DBs want to generate their own IDs- causes unnecessary querying for us This killed insert performance

Created our own ID generation scheme so an ID could be deterministically generated without querying the db Cassandra allowed us to overwrite the same data multiple times if needed without needing to query the db to reconcile duplicates Titan could be “hacked” to use a hash-based id and not call Cassandra for id generation, but their keys were contained to 64-bit integers (too small for us) !

Other design differences from Titan: A key cache is used in the import application, so we avoid having to write the vertex key over and over Shard data into many subgraphs -

queries can thus include time ranges, and reduces compaction overhead !

Edges design is similar to Titan - edges of a vertex are kept in the same data partition

Page 10: Proofpoint: Fraud Detection and Security on Social Media

© 2015 Proofpoint, Inc.

Email General Purpose Infrastructure (Use Case 3) We also have a 6-node cluster in 2 datacenters (3 nodes in each DC) Stores email and attachments as large encrypted blobs (from 20M to 2 GB) - for “SecureShare” - a product that securely shares emails As an identity database - users, customers, etc..

Chosen over SQL because of its distributed / multi-DC nature

Page 11: Proofpoint: Fraud Detection and Security on Social Media

© 2015 Proofpoint, Inc.

Email Topics Clustering (Use case 4)

Uses Cassandra as a store for clustered email topics Uses Word2Vec algorithm with a 100-dimensional vector, and apply Spark-streaming MLib k-means clustering algorithm on incoming stream of email subjects

Tried k= 20, 50, and 100 Word2vec translates synonymous words into the same vector space

Page 12: Proofpoint: Fraud Detection and Security on Social Media

© 2015 Proofpoint, Inc.

Use Cases for Social Media Security and Compliance

Page 13: Proofpoint: Fraud Detection and Security on Social Media

© 2015 Proofpoint, Inc.

Why Cassandra

Content classification is what we do. The completeness of any classification system is predicated on the breadth of the corpus of data upon which it is built.!!

We wanted to keep everything. Forever.

Page 14: Proofpoint: Fraud Detection and Security on Social Media

© 2015 Proofpoint, Inc.

Deployment: Current production

2.1 TB of data, 150 writes & 15 reads per second

Page 15: Proofpoint: Fraud Detection and Security on Social Media

© 2015 Proofpoint, Inc.

Start – 2012: • Cassandra 1.1.6,

Ubuntu 10.04 • One datacenter, three

nodes

Finish – today: • Datastax Enterprise 4.5.7, Ubuntu

12.04 • Three datacenters • Nine nodes • Solr deployed • Half a billion content stored

Deployment: Evolution

Page 16: Proofpoint: Fraud Detection and Security on Social Media

© 2015 Proofpoint, Inc.

Never Gone Down

Page 17: Proofpoint: Fraud Detection and Security on Social Media

© 2015 Proofpoint, Inc.

Going Global (Use Case 5)

Objectives: • Scale system horizontally • Allow customers to keep data in a region (EU), while benefitting from other

data centers (spammy users) Dedicated “Global” C* cluster, with each Nexgate app instance having its own “Local” C* cluster

us-­‐west us-­‐east eu-­‐central

Page 18: Proofpoint: Fraud Detection and Security on Social Media

© 2015 Proofpoint, Inc.

Global data center

2 nodes in each datacenter across the globe, 60 writes/sec, 1 read/sec

Page 19: Proofpoint: Fraud Detection and Security on Social Media

© 2015 Proofpoint, Inc.

Spam multiplicity (Use Case 6)

Problem: • Spammers on social media repeat messages across accounts

Solution: • Cassandra data model to query repeat-content spammers in real time

Benefits: Efficiently get a count of times we’ve seen content, while retaining detail data, supporting real-time analysis

Page 20: Proofpoint: Fraud Detection and Security on Social Media

© 2015 Proofpoint, Inc.

Spam Multiplicity Data Modeling

CREATE  TABLE  item_cnt  (      content  text,      column1  text,      value  text,      PRIMARY  KEY  ((content),  column1)  )

Hash of Content (Partition Key) Column1 Value

d131dd02c5e6eec4… property_native_id “itemId_timestamp”

Look up content quickly (by hitting hashed index) Number of columns = number of times content was seen Value provides information for offline analysis (time series, patterns in content, etc…)

Page 21: Proofpoint: Fraud Detection and Security on Social Media

© 2015 Proofpoint, Inc.

Trending topics (Use Case 7)

Problem: • Detect when the conversation radically changes on a social account

Solution: • Use Cassandra data model to detect social mob and alert when it occurs

Benefits: Efficiently get bi-gram counts from adjacent date ranges and analyze them

Page 22: Proofpoint: Fraud Detection and Security on Social Media

© 2015 Proofpoint, Inc.

Trending Topics Data Modeling

CREATE TABLE trending_topics (

account_id int,

year_month text,

minute_bucket timestamp,

topic text,

item_id int,

counter_value counter,

PRIMARY KEY ((account_id, year_month), minute_bucket, topic, item_id)

)

Goal: Get back number of times a topic has been seen for any range of minutes

!(account_id, year_month) is “composite partition key” -> data with same account id and date live on same node

! Sorted by minute_bucket, topics, and item_id

! Only account_id and year_month are minimum necessary values needed for query

!Flexibility for analysis

Page 23: Proofpoint: Fraud Detection and Security on Social Media

© 2015 Proofpoint, Inc.

Archive search (Use Case 8)

Problem: • Allow customers to identify arbitrary compliance problems in social content with an open-ended

search feature Solution:

• Cassandra column family that contains the content • Datastax Enterprise Solr with a core on that columnfamily

Benefits: Near real-time index updates make new content available via search from same infrastructure

combined with trending topics, can be used to easily lookup and remove inappropriate content from social media account

Page 24: Proofpoint: Fraud Detection and Security on Social Media

© 2015 Proofpoint, Inc.

Summary

8 use cases that take advantage of Cassandra: Data modeling Distributed nature Other tools can easily plugin (Solr, Spark) Ease of Use Community’s amazing support

Page 25: Proofpoint: Fraud Detection and Security on Social Media

© 2015 Proofpoint, Inc.

QA&threat protection | compliance | archiving & governance | secure communication