gitrecruit final 1

16
GitRecruit Recruit the right tech talent using Github Insight Data Science Fellow: Yinghan Fu

Upload: yinghan-fu

Post on 24-Jan-2017

148 views

Category:

Internet


0 download

TRANSCRIPT

GitRecruit

Recruit the right tech talent using Github

Insight Data Science Fellow:

Yinghan Fu

Tech companies compete for talent

• Recruiting is difficult and expensive.

• Companies use Github for code repositories.

• GitRecruit can automate talent search.

Algorithm-using README files to find similar repositories

NLTK

Algorithm-using README files to find similar repositories

NLTK -- the Natural

Language Toolkit -- is a

suite of open source

Python modules, data

sets and tutorials

supporting research

and development in

Natural Language

Processing.

NLTK

README

Algorithm-using README files to find similar repositories

NLTK

README tf-idf vector

��������

����

������

���

��� �

1.8

2.4

2.4

22

Algorithm-using README files to find similar repositories

NLTK

README tf-idf vectorScikit-learn is a Python

module for machine

learning built on top of

SciPy and distributed

under the 3-Clause

BSD license.

The project was

started in 2007 by

David Cournapeau as a

Google Summer of

Matplotlib is a python

2D plotting library

which produces

publication quality

figures in a variety of

hardcopy formats and

interactive

environments across

platforms. matplotlib

can be used in python

~110,000 repository README files

~70% pull requests

NumPy is the

fundamental package

needed for scientific

computing with

Python. This package

contains: a powerful N-

dimensional array

object . sophisticated

(broadcasting)

functions

��������

����

������

���

��� �

1.8

2.4

2.4

22

Algorithm-using README files to find similar repositories

NLTK

README tf-idf vector

~110,000 repository README files

~70% pull requests

��������

����

������

���

���������

0 0 0

2.5 2.3 2.8

0 0 0

0 3.1 0

5.4 4.3 6.7

⋮ ⋮ ⋮

��������

����

������

���

��� �

1.8

2.4

2.4

22

Algorithm-using README files to find similar repositories

��� 0.5 0.3 0.8

Cosine similarity

Algorithm-using README files to find similar repositories

��� 0.5 0.3 0.8

Cosine similarity

Optimization and benchmark show

the search engine is effective

Database Driver Audio Text Processing

Training

133 repository readme files

Optimize mean average precision

Testing

123 repository readme files

Mean average precision 0.39(random 0.09, p-value<0.00001)

……………

……………

……………

……………

……………

……………

……………

……………

……………

……………

……………

……………

……………

……………

……………

……………

……………

……………

……………

……………

……………

……………

……………

……………

……………

……………

……………

……………

……………

……………

……………

……………

……………

……………

……………

……………

....

About me, Yinghan Fu

About me, Yinghan Fu

• GitRecruit

– MySQL, Python, NLTK, scikit-learn etc.

• Machine learning

– C++, Python

• Dynamic programming algorithm

– C++

• User-user/item-item collaborative filtering

– Python, mrjob, MongoDB

The search engine can concentrate python packages from the same category

Github activities are highly concentrated

16 mil repos

6 mil users

2.7 mil pull requests merged

0.4 mil repos have any pull

requests merged

tf-idf vectorization

�" = 1 + log(�),+)

-." = log�

/�)

Parameters optimized

• Number of words included

• Minimum document frequency

• Maximum document frequency

• Sublinear term frequency

• Cosine similarity

• Maximum n-gram