mozscape no sql-at-terabyte-scale

Post on 22-Jan-2018

239 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Mozscape: NoSQL at Terabyte Scale

Phil SmithSoftware Engineer

What We Do

SEO & Inbound Marketing Metrics

www.opensiteexplorer.org

What We Do

Collect back links across the web

www.opensiteexplorer.org

What We Do

Collect back links across the web

www.opensiteexplorer.org

Compute metrics estimating value

What We Do

Collect back links across the web

www.opensiteexplorer.org

Compute metrics estimating value

Serve links and metrics with API and OSE

How We Do

~25-30 billion pages per month

Crawl the Web

How We Do

~25-30 billion pages per month

20 Crawler machines

Crawl the Web

How We Do

~25-30 billion pages per month

20 Crawler machines

~256 MB/sec aggregate download rate

Crawl the Web

How We Do

1:5 to 1:50 Compression Ratios

Compute Aggregates and Metrics

How We Do

1:5 to 1:50 Compression Ratios

Aggregates are Parallelized Linear Scans

Compute Aggregates and Metrics

How We Do

1:5 to 1:50 Compression Ratios

Aggregates are Parallelized Linear Scans

Communication Avoided where Possible

Compute Aggregates and Metrics

How We Do

~12 TB per Release in Amazon S3

Surface with a Read-Only API

How We Do

~12 TB per Release in Amazon S3

6 m2.4xlarge Instances for Cache

Surface with a Read-Only API

How We Do

~12 TB per Release in Amazon S3

6 m2.4xlarge Instances for Cache

~28k Requests per Minute

Surface with a Read-Only API

Observations and Strategy

Billions of Small, Similar Records

De-normalization Avoids Complex Joins

Batch-style Emphasizes Spatial Locality

Data Layout

Column-Orientation exploits Locality

Broken into 5GB chunks for S3

~64KB Compression Runs within

Compression

Tuned to Overcome Disk Read Bound

By-Column, Run & Gap Encoding on LZO

Customized Pipelines per Column

Job Control

Each Stage has Parallel, Idempotent Tasks

Tasks are Procs with easy Command Line

stdout, exit code are logged to track state

Checkpoints

Time

S3

Table Scan Checkpoint

Barrier

Table Scan

Barrier

Indexing

Columns have BDBs indexing by ID

Subset of IDs map to Compression Runs

Decompress Run and Scan to find Record

Physical Deployment

Crawlers run in Colo for white-listed IPs

Batch Process and API layer in EC2

The API might be in a colo too, butELB + Autoscaling are nice

Questions?

We’re Hiring!

top related