petabyte-scale text processing with spark

24
Petabyte-Scale Text Processing with Spark Oleksii Sliusarenko, Grammarly Inc. E-mail: aliaxey90 (at) gmail (dot) com Read the full article in Grammarly tech blog

Upload: alex-slusarenko

Post on 16-Apr-2017

203 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Petabyte-Scale Text Processing with Spark

Petabyte-Scale Text Processing with Spark

Oleksii Sliusarenko, Grammarly Inc.E-mail: aliaxey90 (at) gmail (dot) com

Read the full article in Grammarly tech blog

Page 2: Petabyte-Scale Text Processing with Spark

Modern error correcting

depending from the weatherdepending on the weather

Page 3: Petabyte-Scale Text Processing with Spark

Size: 3 Petabytes

Format: WARC - Raw HTTP protocol dump

We need: 1 PB or 2000 x 480GB SSD disks

Common Crawl = internet dump

Page 4: Petabyte-Scale Text Processing with Spark

High-level pipeline view

Extract texts English Filter Deduplicate

Break into words

Count frequencies

Page 5: Petabyte-Scale Text Processing with Spark

Typical processing step example

Processing example:

count each n-gram frequency

Input data example:

<sentence> <tab> <frequency>

Output data example:

<n-gram> <tab> <frequency>

My name is Bob. 12

Kiev is a capital. 25

name is 12

is 37

Page 6: Petabyte-Scale Text Processing with Spark

Classic and modern approaches

Page 7: Petabyte-Scale Text Processing with Spark

Our alternatives

$12000

$3000$1000

Page 8: Petabyte-Scale Text Processing with Spark

Default choice: Amazon EMR

$12000

$24000OOM

segfault

Page 9: Petabyte-Scale Text Processing with Spark

Our MapReduce

12x faster than Hadoop

Easy to learn Full support

2x2=4

Page 10: Petabyte-Scale Text Processing with Spark

Our MapReduce

Hardware failures Network failures

Distributed failsafe difficulties:

Page 11: Petabyte-Scale Text Processing with Spark

Fixing Spark

3 months!

Page 12: Petabyte-Scale Text Processing with Spark

First of all

Latest stable

Latest stable

◈ Build Spark with patch◈ Don’t forget Hadoop native libraries

Page 13: Petabyte-Scale Text Processing with Spark

The hardest button

S3 HEAD request failed for "file path" -

ResponseCode=403, ResponseMessage=Forbidden.

Why???

Page 14: Petabyte-Scale Text Processing with Spark

HTTP Head Request

HTTP body contains the

error description, but it’

s not fetched!

No body!

Page 15: Petabyte-Scale Text Processing with Spark

Possible reasons

Possible reasons:

◈ AccessDenied◈ AccountProblem◈ CrossLocationLoggingProhibited◈ InvalidAccessKeyId◈ InvalidObjectState◈ InvalidPayer◈ InvalidSecurity◈ NotSignedUp◈ RequestTimeTooSkewed◈ SignatureDoesNotMatch

Page 16: Petabyte-Scale Text Processing with Spark

We need to go deeper!

Spark Hadoop JetS3t HttpClient

Fix here

Page 17: Petabyte-Scale Text Processing with Spark

Fixing Spark

◈ Choose latest filesystem: S3A, not S3 or S3N

◈ conf.setInt("fs.s3a.connection.maximum", 100)

◈ Use DirectOutputCommitter

◈ --conf spark.hadoop.fs.s3a.access.key=…

Fixing S3

Page 18: Petabyte-Scale Text Processing with Spark

Fixing Spark

◈ Spark.default.parallelism = cores * 3

◈ spark_mb = system_ram_mb * 4 // 5

◈ set("spark.akka.frameSize", "2047") Fixing OOM

Page 19: Petabyte-Scale Text Processing with Spark

Fixing Spark

◈ Don’t force Kryo class registration

◈ Use bzip2 compression for input filesFixing miscellaneous

Page 20: Petabyte-Scale Text Processing with Spark

Our Ultimate Spark Recipe

See Grammarly tech blog for more info

Page 21: Petabyte-Scale Text Processing with Spark

Use spot instances

Spot instance

80% cheaper!

Safe Transient

Regular instance

Cheap

Expensive

Page 22: Petabyte-Scale Text Processing with Spark

◈ We spent the same amount of money

◈ Further experiments will be cheaper

◈ You can save three months!

Was It All Worth It?

Page 23: Petabyte-Scale Text Processing with Spark

◈ Don’t reinvent the wheel

◈ New technology will eat a lot of time

◈ Don’t be afraid to dive into code

◈ Look at problems from various angles

◈ Use spot instances

Take-aways

Page 24: Petabyte-Scale Text Processing with Spark

Thanks!Any questions?You can find me at aliaxey90 (at) gmail (dot) com

Read the full article in Grammarly tech blog