petabyte-scale text processing with spark

Petabyte-Scale Text Processing with Spark

Oleksii Sliusarenko, Grammarly Inc.E-mail: aliaxey90 (at) gmail (dot) com

Read the full article in Grammarly tech blog

http://tech.grammarly.com/blog/posts/Petabyte-Scale-Text-Processing-with-Spark.html

Modern error correcting

depending from the weatherdepending on the weather

Size: 3 Petabytes

Format: WARC - Raw HTTP protocol dump

We need: 1 PB or 2000 x 480GB SSD disks

Common Crawl = internet dump

High-level pipeline view

Extract texts English Filter Deduplicate

Break into words

Count frequencies

Typical processing step example

Processing example:

count each n-gram frequency

Input data example:

<sentence> <tab> <frequency>

Output data example:

<n-gram> <tab> <frequency>

My name is Bob. 12

Kiev is a capital. 25

name is 12

is 37

Classic and modern approaches

Our alternatives

$12000

$3000$1000

Default choice: Amazon EMR

$12000

$24000OOM

segfault

Our MapReduce

12x faster than Hadoop

Easy to learn Full support

2x2=4

Our MapReduce

Hardware failures Network failures

Distributed failsafe difficulties:

Fixing Spark

3 months!

First of all

Latest stable

Latest stable

◈ Build Spark with patch◈ Don’t forget Hadoop native libraries

The hardest button

S3 HEAD request failed for "file path" -

ResponseCode=403, ResponseMessage=Forbidden.

Why???

HTTP Head Request

HTTP body contains the

error description, but it’

s not fetched!

No body!

Possible reasons

Possible reasons:

◈ AccessDenied◈ AccountProblem◈ CrossLocationLoggingProhibited◈ InvalidAccessKeyId◈ InvalidObjectState◈ InvalidPayer◈ InvalidSecurity◈ NotSignedUp◈ RequestTimeTooSkewed◈ SignatureDoesNotMatch

We need to go deeper!

Spark Hadoop JetS3t HttpClient

Fix here

Fixing Spark

◈ Choose latest filesystem: S3A, not S3 or S3N

◈ conf.setInt("fs.s3a.connection.maximum", 100)

◈ Use DirectOutputCommitter

◈ --conf spark.hadoop.fs.s3a.access.key=…

Fixing S3

Fixing Spark

◈ Spark.default.parallelism = cores * 3

◈ spark_mb = system_ram_mb * 4 // 5

◈ set("spark.akka.frameSize", "2047") Fixing OOM

Fixing Spark

◈ Don’t force Kryo class registration

◈ Use bzip2 compression for input filesFixing miscellaneous

Our Ultimate Spark Recipe

See Grammarly tech blog for more info


Use spot instances

Spot instance

80% cheaper!

Safe Transient

Regular instance

Cheap

Expensive

◈ We spent the same amount of money

◈ Further experiments will be cheaper

◈ You can save three months!

Was It All Worth It?

◈ Don’t reinvent the wheel

◈ New technology will eat a lot of time

◈ Don’t be afraid to dive into code

◈ Look at problems from various angles

◈ Use spot instances

Take-aways

Thanks!Any questions?You can find me at aliaxey90 (at) gmail (dot) com

Read the full article in Grammarly tech blog


petabyte-scale text processing with spark

Data & Analytics