mozscape no sql-at-terabyte-scale

22
Mozscape: NoSQL at Terabyte Scale Phil Smith Software Engineer

Upload: philhsmith

Post on 22-Jan-2018

239 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Mozscape no sql-at-terabyte-scale

Mozscape: NoSQL at Terabyte Scale

Phil SmithSoftware Engineer

Page 2: Mozscape no sql-at-terabyte-scale

What We Do

SEO & Inbound Marketing Metrics

www.opensiteexplorer.org

Page 3: Mozscape no sql-at-terabyte-scale

What We Do

Collect back links across the web

www.opensiteexplorer.org

Page 4: Mozscape no sql-at-terabyte-scale

What We Do

Collect back links across the web

www.opensiteexplorer.org

Compute metrics estimating value

Page 5: Mozscape no sql-at-terabyte-scale

What We Do

Collect back links across the web

www.opensiteexplorer.org

Compute metrics estimating value

Serve links and metrics with API and OSE

Page 6: Mozscape no sql-at-terabyte-scale

How We Do

~25-30 billion pages per month

Crawl the Web

Page 7: Mozscape no sql-at-terabyte-scale

How We Do

~25-30 billion pages per month

20 Crawler machines

Crawl the Web

Page 8: Mozscape no sql-at-terabyte-scale

How We Do

~25-30 billion pages per month

20 Crawler machines

~256 MB/sec aggregate download rate

Crawl the Web

Page 9: Mozscape no sql-at-terabyte-scale

How We Do

1:5 to 1:50 Compression Ratios

Compute Aggregates and Metrics

Page 10: Mozscape no sql-at-terabyte-scale

How We Do

1:5 to 1:50 Compression Ratios

Aggregates are Parallelized Linear Scans

Compute Aggregates and Metrics

Page 11: Mozscape no sql-at-terabyte-scale

How We Do

1:5 to 1:50 Compression Ratios

Aggregates are Parallelized Linear Scans

Communication Avoided where Possible

Compute Aggregates and Metrics

Page 12: Mozscape no sql-at-terabyte-scale

How We Do

~12 TB per Release in Amazon S3

Surface with a Read-Only API

Page 13: Mozscape no sql-at-terabyte-scale

How We Do

~12 TB per Release in Amazon S3

6 m2.4xlarge Instances for Cache

Surface with a Read-Only API

Page 14: Mozscape no sql-at-terabyte-scale

How We Do

~12 TB per Release in Amazon S3

6 m2.4xlarge Instances for Cache

~28k Requests per Minute

Surface with a Read-Only API

Page 15: Mozscape no sql-at-terabyte-scale

Observations and Strategy

Billions of Small, Similar Records

De-normalization Avoids Complex Joins

Batch-style Emphasizes Spatial Locality

Page 16: Mozscape no sql-at-terabyte-scale

Data Layout

Column-Orientation exploits Locality

Broken into 5GB chunks for S3

~64KB Compression Runs within

Page 17: Mozscape no sql-at-terabyte-scale

Compression

Tuned to Overcome Disk Read Bound

By-Column, Run & Gap Encoding on LZO

Customized Pipelines per Column

Page 18: Mozscape no sql-at-terabyte-scale

Job Control

Each Stage has Parallel, Idempotent Tasks

Tasks are Procs with easy Command Line

stdout, exit code are logged to track state

Page 19: Mozscape no sql-at-terabyte-scale

Checkpoints

Time

S3

Table Scan Checkpoint

Barrier

Table Scan

Barrier

Page 20: Mozscape no sql-at-terabyte-scale

Indexing

Columns have BDBs indexing by ID

Subset of IDs map to Compression Runs

Decompress Run and Scan to find Record

Page 21: Mozscape no sql-at-terabyte-scale

Physical Deployment

Crawlers run in Colo for white-listed IPs

Batch Process and API layer in EC2

The API might be in a colo too, butELB + Autoscaling are nice

Page 22: Mozscape no sql-at-terabyte-scale

Questions?

We’re Hiring!