proofpoint: fraud detection and security on social media

© 2015 Proofpoint, Inc.© 2015 Proofpoint, Inc.

threat protection | compliance | archiving & governance | secure communication

Cassandra at Proofpoint (& Nexgate) !Harold Nguyen, Data Scientist, Nexgate division of Proofpoint !Slides created with the help of Proofpoint colleagues: Bryan Burns, Brian Hawkins, Wayne Lewis, Andy Maas, Anand Somani, Grey Saylor, and Rich Sutton

© 2015 Proofpoint, Inc.

Outline of This Talk

whoami ! About Proofpoint ! Cassandra Uses cases for Proofpoint email security

Targeted Attack Protection Threat-event correlation General-purpose infrastructure Clustering email topics !

Cassandra Uses for Proofpoint Nexgate social media security and compliance Spam multiplicity Trending topics Archive Search Data integrity and connectedness across the globe


Few Words About Me

Data engineer/scientist ! Responsible for content classification, fraudulent detection, and security research ! Work with entering, marketing and research teams


About Proofpoint

Security and compliance for enterprise messaging (email, social, and mobile) Founded 2002 1100 employees worldwide $2.5B public company: PFPT $200M revenue ! Cassandra used all overthe organization


Cassandra for Email Security and Compliance

Use cases of Cassandra for Email Security


Targeted Attack Protection (TAP)

What is Targeted Attack ? Attack aimed at specific user or organization, designed to breech a specific target !

What is TAP ? Combats targeted threats by monitoring suspicious messages containing malicious URLs and attachments, and analyzing user clicks ! Predictive defense by using machine learning techniques to determine would ‘could likely’ by malicious and take preemptive steps ! Insights into threat by determining if an organization is under attack, who is being targeted, what threats are received, and if they are still valid threats


Cassandra with TAP - (Use Case 1)

C* use case with TAP Uses Cassandra as an indexer - index URLs (row key) to email messages (columns) that contain them Store a blob of email message to display on dashboard for malicious alerting !

C* infrastructure 40-node cluster in AWS, c3.2xlarge nodes About 2 TB of EBS storage on each Replication factor of 4 Data has increased by 100% since a year ago !

KairosDB and C* JMX metrics inserted into KairosDB, where they are read and monitored from Over the 3 clusters (9, 6, 6), 14 billion metrics a month from 1000s of machines Has become critical to Proofpoint being able to track metrics from systems


Threat Database (Use Case 2)

Problem: • Proofpoint collects billions of threat data points a day that aren’t being correlated

Solution: • Build a custom graph database on top of C* • Key is vertex, wide rows are edges • 18 nodes, 24 TB of data, ingest peaks of 1M events per second

Benefits: • Security researchers can now identify relationships between hosts, actors and threats

that they couldn’t before • Dridex campaign, detection of numerous targeted attacks


Why not TitanDB ?

Proofpoint security research team created a graph database on top of Cassandra (CQL application) ! Why didn’t we use TitanDB, or other existing Graph DBs) ?

These DBs want to generate their own IDs- causes unnecessary querying for us This killed insert performance

Created our own ID generation scheme so an ID could be deterministically generated without querying the db Cassandra allowed us to overwrite the same data multiple times if needed without needing to query the db to reconcile duplicates Titan could be “hacked” to use a hash-based id and not call Cassandra for id generation, but their keys were contained to 64-bit integers (too small for us) !

Other design differences from Titan: A key cache is used in the import application, so we avoid having to write the vertex key over and over Shard data into many subgraphs -

queries can thus include time ranges, and reduces compaction overhead !

Edges design is similar to Titan - edges of a vertex are kept in the same data partition


Email General Purpose Infrastructure (Use Case 3) We also have a 6-node cluster in 2 datacenters (3 nodes in each DC) Stores email and attachments as large encrypted blobs (from 20M to 2 GB) - for “SecureShare” - a product that securely shares emails As an identity database - users, customers, etc..

Chosen over SQL because of its distributed / multi-DC nature


Email Topics Clustering (Use case 4)

Uses Cassandra as a store for clustered email topics Uses Word2Vec algorithm with a 100-dimensional vector, and apply Spark-streaming MLib k-means clustering algorithm on incoming stream of email subjects

Tried k= 20, 50, and 100 Word2vec translates synonymous words into the same vector space


Use Cases for Social Media Security and Compliance


Why Cassandra

Content classification is what we do. The completeness of any classification system is predicated on the breadth of the corpus of data upon which it is built.!!

We wanted to keep everything. Forever.


Deployment: Current production

2.1 TB of data, 150 writes & 15 reads per second


Start – 2012: • Cassandra 1.1.6,

Ubuntu 10.04 • One datacenter, three

nodes

Finish – today: • Datastax Enterprise 4.5.7, Ubuntu

12.04 • Three datacenters • Nine nodes • Solr deployed • Half a billion content stored

Deployment: Evolution


Never Gone Down


Going Global (Use Case 5)

Objectives: • Scale system horizontally • Allow customers to keep data in a region (EU), while benefitting from other

data centers (spammy users) Dedicated “Global” C* cluster, with each Nexgate app instance having its own “Local” C* cluster

us-‐west us-‐east eu-‐central


Global data center

2 nodes in each datacenter across the globe, 60 writes/sec, 1 read/sec


Spam multiplicity (Use Case 6)

Problem: • Spammers on social media repeat messages across accounts

Solution: • Cassandra data model to query repeat-content spammers in real time

Benefits: Efficiently get a count of times we’ve seen content, while retaining detail data, supporting real-time analysis


Spam Multiplicity Data Modeling

CREATE TABLE item_cnt ( content text, column1 text, value text, PRIMARY KEY ((content), column1) )

Hash of Content (Partition Key) Column1 Value

d131dd02c5e6eec4… property_native_id “itemId_timestamp”

Look up content quickly (by hitting hashed index) Number of columns = number of times content was seen Value provides information for offline analysis (time series, patterns in content, etc…)


Trending topics (Use Case 7)

Problem: • Detect when the conversation radically changes on a social account

Solution: • Use Cassandra data model to detect social mob and alert when it occurs

Benefits: Efficiently get bi-gram counts from adjacent date ranges and analyze them


Trending Topics Data Modeling

CREATE TABLE trending_topics (

account_id int,

year_month text,

minute_bucket timestamp,

topic text,

item_id int,

counter_value counter,

PRIMARY KEY ((account_id, year_month), minute_bucket, topic, item_id)

)

Goal: Get back number of times a topic has been seen for any range of minutes

!(account_id, year_month) is “composite partition key” -> data with same account id and date live on same node

! Sorted by minute_bucket, topics, and item_id

! Only account_id and year_month are minimum necessary values needed for query

!Flexibility for analysis


Archive search (Use Case 8)

Problem: • Allow customers to identify arbitrary compliance problems in social content with an open-ended

search feature Solution:

• Cassandra column family that contains the content • Datastax Enterprise Solr with a core on that columnfamily

Benefits: Near real-time index updates make new content available via search from same infrastructure

combined with trending topics, can be used to easily lookup and remove inappropriate content from social media account


Summary

8 use cases that take advantage of Cassandra: Data modeling Distributed nature Other tools can easily plugin (Solr, Spark) Ease of Use Community’s amazing support


QA&threat protection | compliance | archiving & governance | secure communication

proofpoint: fraud detection and security on social media

Technology