gitrecruit final 1
TRANSCRIPT
Tech companies compete for talent
• Recruiting is difficult and expensive.
• Companies use Github for code repositories.
• GitRecruit can automate talent search.
Algorithm-using README files to find similar repositories
NLTK -- the Natural
Language Toolkit -- is a
suite of open source
Python modules, data
sets and tutorials
supporting research
and development in
Natural Language
Processing.
NLTK
README
Algorithm-using README files to find similar repositories
NLTK
README tf-idf vector
��������
����
������
���
��� �
⋮
1.8
2.4
2.4
22
⋮
Algorithm-using README files to find similar repositories
NLTK
README tf-idf vectorScikit-learn is a Python
module for machine
learning built on top of
SciPy and distributed
under the 3-Clause
BSD license.
The project was
started in 2007 by
David Cournapeau as a
Google Summer of
Matplotlib is a python
2D plotting library
which produces
publication quality
figures in a variety of
hardcopy formats and
interactive
environments across
platforms. matplotlib
can be used in python
~110,000 repository README files
~70% pull requests
NumPy is the
fundamental package
needed for scientific
computing with
Python. This package
contains: a powerful N-
dimensional array
object . sophisticated
(broadcasting)
functions
��������
����
������
���
��� �
⋮
1.8
2.4
2.4
22
⋮
Algorithm-using README files to find similar repositories
NLTK
README tf-idf vector
~110,000 repository README files
~70% pull requests
��������
����
������
���
���������
⋮
0 0 0
2.5 2.3 2.8
0 0 0
0 3.1 0
5.4 4.3 6.7
⋮ ⋮ ⋮
��������
����
������
���
��� �
⋮
1.8
2.4
2.4
22
⋮
Optimization and benchmark show
the search engine is effective
Database Driver Audio Text Processing
Training
133 repository readme files
Optimize mean average precision
Testing
123 repository readme files
Mean average precision 0.39(random 0.09, p-value<0.00001)
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
....
About me, Yinghan Fu
• GitRecruit
– MySQL, Python, NLTK, scikit-learn etc.
• Machine learning
– C++, Python
• Dynamic programming algorithm
– C++
• User-user/item-item collaborative filtering
– Python, mrjob, MongoDB
Github activities are highly concentrated
16 mil repos
6 mil users
2.7 mil pull requests merged
0.4 mil repos have any pull
requests merged