mozscape no sql-at-terabyte-scale

Mozscape: NoSQL at Terabyte Scale

Phil SmithSoftware Engineer

What We Do

SEO & Inbound Marketing Metrics

www.opensiteexplorer.org

What We Do

Collect back links across the web

What We Do

Compute metrics estimating value

What We Do

Compute metrics estimating value

Serve links and metrics with API and OSE

How We Do

~25-30 billion pages per month

Crawl the Web

How We Do

20 Crawler machines

Crawl the Web

How We Do

20 Crawler machines

~256 MB/sec aggregate download rate

Crawl the Web

How We Do

1:5 to 1:50 Compression Ratios

Compute Aggregates and Metrics

How We Do

Aggregates are Parallelized Linear Scans

How We Do

Aggregates are Parallelized Linear Scans

Communication Avoided where Possible

How We Do

~12 TB per Release in Amazon S3

Surface with a Read-Only API

How We Do

6 m2.4xlarge Instances for Cache

How We Do

6 m2.4xlarge Instances for Cache

~28k Requests per Minute

Observations and Strategy

Billions of Small, Similar Records

De-normalization Avoids Complex Joins

Batch-style Emphasizes Spatial Locality

Data Layout

Column-Orientation exploits Locality

Broken into 5GB chunks for S3

~64KB Compression Runs within

Compression

Tuned to Overcome Disk Read Bound

By-Column, Run & Gap Encoding on LZO

Customized Pipelines per Column

Job Control

Each Stage has Parallel, Idempotent Tasks

Tasks are Procs with easy Command Line

stdout, exit code are logged to track state

Checkpoints

Table Scan Checkpoint

Barrier

Table Scan

Barrier

Indexing

Columns have BDBs indexing by ID

Subset of IDs map to Compression Runs

Decompress Run and Scan to find Record

Physical Deployment

Crawlers run in Colo for white-listed IPs

Batch Process and API layer in EC2

The API might be in a colo too, butELB + Autoscaling are nice

Questions?

We’re Hiring!

mozscape no sql-at-terabyte-scale

Technology

· web viewwe set out to build large servers ( 10-way...

drillix combined operational & analytical sql at scale ·...

microsoft uses sql server 2012 5.8-terabyte sap erp...

ver tica database multi terabyte

guss - criteo meetup scale sql for the web

implementing scale-out solutions with microsoft sql server...

sql server 2012 beyond relational performance and scale

terabyte disk memp

large-scale sql server deployments for dbas unisys

large scale sql considerations for sharepoint deployments

terabyte-scale supervised 3d training and benchmarking

user manual - terabyte

how big is a terabyte?

shark sql and rich analytics at scale

building scale-out database solutions on sql azure

terabyte-scale image similarity search with hadoop

hippogriffdb: balancing i/o and gpu bandwidth in big data...

managing data quality in a terabyte-scale sensor archive...

sparc server t5/m6 - oracle | integrated cloud ... · m6...

beyond aurora. scale-out sql databases for aws