relink - tech talk

Post on 09-Apr-2017

59 Views

Category:

Data & Analytics

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

relink

www.relinklabs.com

Bjarne Ørum FruergaardLead Data Scientist

bf@relinklabs.com

Machine Learning Match Making

1000employees

10%growth

10%churn

&

200people per year

hire

15.000profiles

manually assess

1.200.000EURO

Solution

Job CV & cover letter

Job <—> Applicant analyses empowering the recruiter

Large graphs connect jobs, educations & skills Augment job descriptions and profiles

Relevant skills, job experience and education for the job Probable skills with confidence on profiles

Rankingrelink

Select Top N

Interesting challenge: Given batch of J job descriptions Score 5M profiles (all jobs simultaneously) For each job (sequentially):

Top N in each partition Merge Top N from each partition

Sequential is a nuisance Collect on driver Results are not distributed

Tree Digests

[1] Dunning, T. "COMPUTING EXTREMELY ACCURATE QUANTILES USING t-DIGESTS”. https://github.com/tdunning/t-digest/blob/master/docs/t-digest-paper/histo.pdf

Images shamelessly copied from here (thanks Cam Davidson-Pilon!): [2] https://dataorigami.net/blogs/napkin-folding/19055451-percentile-and-quantile-estimation-of-big-data-the-t-digest

Compressing the CDF Estimate quantile or percentiles with low error Associative and commutative “I streamed 8mb of pareto-distributed data into a t-Digest. The resulting size was 5kb, and I could estimate any percentile or quantile desired. Accuracy was on the order of 0.002%.”

[1]

[2]

Given batch of J job descriptions Score 5M profiles (all jobs simultaneously) Compute t-Digests locally on executors Sum t-Digests Broadcast t-Digests Filter partitions where score >= percentile Approximate Top N remain in partitions

t-Digests are small Collect on driver is comparatively small Results remain distributed Approximates Top N

Let’s try it!

5 jobs simultaneously ~5M scored profiles N=5000 Two methods: getTopScoresWithSortAndLimit getTopScoresWithTDigests

~28 seconds

~8 seconds

relink

www.relinklabs.com

Thank you!

top related