relink - tech talk

20
relink www.relinklabs.com Bjarne Ørum Fruergaard Lead Data Scientist [email protected] Machine Learning Match Making

Upload: relink

Post on 09-Apr-2017

59 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Page 1: Relink - Tech Talk

relink

www.relinklabs.com

Bjarne Ørum FruergaardLead Data Scientist

[email protected]

Machine Learning Match Making

Page 2: Relink - Tech Talk

1000employees

Page 3: Relink - Tech Talk

10%growth

10%churn

&

Page 4: Relink - Tech Talk

200people per year

hire

Page 5: Relink - Tech Talk

15.000profiles

manually assess

Page 6: Relink - Tech Talk

1.200.000EURO

Page 7: Relink - Tech Talk

Solution

Page 8: Relink - Tech Talk

Job CV & cover letter

Job <—> Applicant analyses empowering the recruiter

Large graphs connect jobs, educations & skills Augment job descriptions and profiles

Relevant skills, job experience and education for the job Probable skills with confidence on profiles

Page 9: Relink - Tech Talk
Page 10: Relink - Tech Talk

Rankingrelink

Page 11: Relink - Tech Talk
Page 12: Relink - Tech Talk
Page 13: Relink - Tech Talk
Page 14: Relink - Tech Talk

Select Top N

Page 15: Relink - Tech Talk

Interesting challenge: Given batch of J job descriptions Score 5M profiles (all jobs simultaneously) For each job (sequentially):

Top N in each partition Merge Top N from each partition

Sequential is a nuisance Collect on driver Results are not distributed

Page 16: Relink - Tech Talk

Tree Digests

[1] Dunning, T. "COMPUTING EXTREMELY ACCURATE QUANTILES USING t-DIGESTS”. https://github.com/tdunning/t-digest/blob/master/docs/t-digest-paper/histo.pdf

Images shamelessly copied from here (thanks Cam Davidson-Pilon!): [2] https://dataorigami.net/blogs/napkin-folding/19055451-percentile-and-quantile-estimation-of-big-data-the-t-digest

Compressing the CDF Estimate quantile or percentiles with low error Associative and commutative “I streamed 8mb of pareto-distributed data into a t-Digest. The resulting size was 5kb, and I could estimate any percentile or quantile desired. Accuracy was on the order of 0.002%.”

[1]

[2]

Page 17: Relink - Tech Talk

Given batch of J job descriptions Score 5M profiles (all jobs simultaneously) Compute t-Digests locally on executors Sum t-Digests Broadcast t-Digests Filter partitions where score >= percentile Approximate Top N remain in partitions

t-Digests are small Collect on driver is comparatively small Results remain distributed Approximates Top N

Page 18: Relink - Tech Talk

Let’s try it!

Page 19: Relink - Tech Talk

5 jobs simultaneously ~5M scored profiles N=5000 Two methods: getTopScoresWithSortAndLimit getTopScoresWithTDigests

~28 seconds

~8 seconds

Page 20: Relink - Tech Talk

relink

www.relinklabs.com

Thank you!