big data: the weakest link

Big Data: the weakest link

Vivek Nair, Tim Menzies{vivekaxl,tim.menzies}@gmail.comHPCC Eng. Summit - Sept 29, 2015

Where is the weakest link?

2


3


4


5


6

7

Premise of Big Data

Analysis is a “systems” task?

• Better conclusions = same algorithms + more data + more cpu

• If so, then … – No role for human error– All insight is auto-generated

from CPUs.

Analysis is a “human” task?• Current results on “software

analytics”– A human-intensive process

8

Q: Is Big Data a “Systems” or “Human”-task?A: Yes

Code used in my last paper

(1100 LOC of Python calling scikitlearn)

9

Use a Higher-Level languages?

• ECL solves this problem?

• But if you can write it quick, – you can write it wrong, quick.

10

Is this really a problem?

• Q: What would we expect to see if…– Top experts, publishing in top

journals– Many of the same data sets– 8 years of trying

• A: – Perhaps some upward

progress– Perhaps a little less variance

11

So, what do we see?

• Software analytics– Defect prediction– Many of the same learners,– Many of the same data sets

• 42 papers, top journals,

• 23 author groups• 2002 to 2010• Y-axis measures

mean performance 12

Researcher Bias: The Use of Machine Learning in Software Defect Prediction, Martin Shepperd, David Bowes, and Tracy Hall, IEEE TRANS on Soft. Eng. , 40(6), JUNE 2014

A little theory

• James D. Herbsleb, CMU• Socio-Technical Coordination• A predictor for higher defects:

– Groups of programmers working on similar functions then,

– but do not sharing that expertise

13

Q: How to find expertise groups

within the HPCC community?

A: using data mining

14

Static features and commit history can act as a cue for expertise

● Our motivationo “relation between embodiment and language

acquisition by locating the ‘minimal set of necessary features’ that enable language of any kind to be learned” - The Philosophy of Expertise

15

Software analytics results: learn predictors for expertise

● “...counts of the cumulative number of different developers changing a file over its lifetime can help to improve defect predictions…”[1]

● “Quantify person's experience with a part of code using change history of the code”[2]

● “RevFinder, a file location-based code-reviewer recommendation approach” [3]

● “30% of its code entities has more than 0.3 of similarity with at least one developer vocabulary” [4]

16

[1] Ostrand, Thomas J., Elaine J. Weyuker, and Robert M. Bell. "Programmer-based fault prediction." Proceedings of the 6th International Conference on Predictive Models in Software Engineering. ACM, 2010.

[2] Mockus, Audris, and James D. Herbsleb. "Expertise browser: a quantitative approach to identifying expertise." Proceedings of the 24th international conference on software engineering. ACM, 2002.

[3] Thongtanunam, Patanamon, et al. "Who should review my code? A file location-based code-reviewer recommendation approach for Modern Code Review."Software Analysis, Evolution and Reengineering (SANER), 2015 IEEE 22nd International Conference on. IEEE, 2015.

[4] Santos, Katyusco de F., Dalton DS Guerrero, and Jorge CA de Figueiredo. "Using Developers Contributions on Software Vocabularies to Identify Experts."Information Technology-New Generations (ITNG), 2015 12th International Conference on. IEEE, 2015.

Q: And what data mining suite will we use to mine data about programmers?

• A: need you ask?

17

Source Code

18

But what are we clustering?Developer products

• Lightweight parsing of source code • Developers profiles, accessed via LinkedIn

Languages Used

Skill Set (self promotion)

Data processing1. Github repos (for code) Linkedin (for years of work)➔

2. Static code analysis: frequency counts of AST features (e.g. count loops, returns, var comparisons, map, etc )3. Bayes classifier

Earlycareer

Later career

Classification

- Features: Nodes of AST- Algorithms Used: Simple Cart, Random

Forest, Naive Bayes etc.- Can distinguish expert from novice

programmers •precision= 78% early career•precision = 74% later career

* Using Weka

Current status

The good news• Can auto-find groups of

better programmers• Can do that for very large

data sets– The ECL advantages

The other news• Seeking larger data sets• Talking to HackerRank• Looking at ways to

instrument the HPCC forums– Matchmaker tools– Affinity groups

24


25


26

We can make that link stronger

27

Acknowledgements:Thanks to funding from LexisNexis

28

big data: the weakest link

Engineering

software analysis

software engineering

expertise groups

software defect prediction

software vocabularies

data sets

software analytics results

code entities