mozscape no sql-at-terabyte-scale

Mozscape: NoSQL at Terabyte Scale

Phil SmithSoftware Engineer

What We Do

SEO & Inbound Marketing Metrics

www.opensiteexplorer.org

http://www.opensiteexplorer.org


What We Do

Collect back links across the web




What We Do



Compute metrics estimating value



What We Do



Compute metrics estimating value

Serve links and metrics with API and OSE



How We Do

~25-30 billion pages per month

Crawl the Web

How We Do


20 Crawler machines

Crawl the Web

How We Do


20 Crawler machines

~256 MB/sec aggregate download rate

Crawl the Web

How We Do

1:5 to 1:50 Compression Ratios

Compute Aggregates and Metrics

How We Do


Aggregates are Parallelized Linear Scans


How We Do


Aggregates are Parallelized Linear Scans

Communication Avoided where Possible


How We Do

~12 TB per Release in Amazon S3

Surface with a Read-Only API

How We Do


6 m2.4xlarge Instances for Cache


How We Do


6 m2.4xlarge Instances for Cache

~28k Requests per Minute


Observations and Strategy

Billions of Small, Similar Records

De-normalization Avoids Complex Joins

Batch-style Emphasizes Spatial Locality

Data Layout

Column-Orientation exploits Locality

Broken into 5GB chunks for S3

~64KB Compression Runs within

Compression

Tuned to Overcome Disk Read Bound

By-Column, Run & Gap Encoding on LZO

Customized Pipelines per Column

Job Control

Each Stage has Parallel, Idempotent Tasks

Tasks are Procs with easy Command Line

stdout, exit code are logged to track state

Checkpoints

Time

S3

Table Scan Checkpoint

Barrier

Table Scan

Barrier

Indexing

Columns have BDBs indexing by ID

Subset of IDs map to Compression Runs

Decompress Run and Scan to find Record

Physical Deployment

Crawlers run in Colo for white-listed IPs

Batch Process and API layer in EC2

The API might be in a colo too, butELB + Autoscaling are nice

Questions?

We’re Hiring!

mozscape no sql-at-terabyte-scale

Technology