clustering recurrent and semantically cohesive program

Clustering Recurrent and Semantically Cohesive Program Statements in Introductory Programming Assignments

By, Marin and RiveroIn, CIKM 2019Presenters: Barış Ardıç, Ege Berkay Gülcan, Mert Kara

1/19

Outline

1. Introduction

2. Related Works

3. Methodology3.1. Approximate Graph Alignment

3.2. Discovering Core Statements

4. Experimental Study

5. Conclusion

2/19

Introduction

● There is a strong demand for computing education (Universities,MOOCs etc..)

● Universities reporting 1.000+ students taking introductory programming courses.

● Manual grading is becoming unfeasible.

● Alternative methods to traditional grading is an emerging research field.

3/19

Related Work on Automatic Grading

● Data Driven Approaches:

Requires pre-existing knowledge or constraints like “a fixed number of variables”.

● Non Data Driven Approaches:

Requires reference solutions or functional testing

4/19

The Idea

● Core Statement: Recurrent associations between program statements performing similar semantics, across programs.○ Entails a group of individual program statements that share a higher-level intent.

● Example: s.contains(c) or s.indexOf(c) != -1● Both would belong to same core statement.

5/19

Methodology - Approximate Graph Alignment

● Each student program is translated into program dependence graphs

● From the graphs, graphlets are generated

○ Graphlets are small connected non-isomorphic induced subgraphs of a large network

● Topological and semantic similarities between graphlets of program dependency graphs are

calculated to find alignment 𝜙

6/19

Methodology - Approximate Graph Alignment

● 𝜙(U1) = V1

● 𝜙(U6) = V9

● 𝜙(U7) = V5

● 𝜙(U10) = V8

● 𝜙(U8i) = V6i

● 𝜙(U8c) = V6c

● 𝜙(U8u) = V6u

● 𝜙(U9) = V7

7/19

Methodology - Discovering Core Statements

● Obtained G=(V,E) from approximate graph alignment phase, where V is the set of all

statements, and E is set of alignments between statements.

● Assumption: Statements that are consistently aligned should perform similar

semantics.

● Uses SCAN Algorithm[2]

8/19

● Ɛ: Density Threshold, i.e, connectedness threshold, structural similarity

● μ: Structure Threshold, i.e, minimum size of each cluster

● NƐ[u]: Ɛ-Neighborhood of a Node u: set of nodes that are K-hops reachable from u and at least Ɛ

similar to u

● Uses Jaccard Similarity of N[u] and N[v] is bigger than Ɛ ∈ (0,1].

Methodology - SCAN

9/19

Methodology - SCAN

Example with μ = 3, Ɛ = ½

NƐ[v9] = {v9, w10, x8} NƐ[x8] = {v9, w10, x8}

NƐ[w10] = {v9, w10, x8} u6 is not included since sim(w10,u6) = 2/5

10/19

Experimental Study

● Dataset: Assignments are mined from CodeChef, a platform to improve programming

skills through assignments and contests

● Setup:○ SCAN algorithm to for discovery of core statements and outliers

○ SourceDG to compute program dependence graphs

11/19

Experimental Study - Results

Results with 𝝐 = 0.8 and μ = 5%12/19

Experimental Study - Metrics

● Control dependence distance (CDD):○ Between control edges of two nodes, it measures the similarity between those control edges

● Data dependence distance (DDD)○ Between data edges of two nodes, it measures the similarity between those data edges

● Node label distance(NLD)○ Between labels of two nodes, it measures the similarity between those nodes’ labels

13/19


CDD, 𝝐 = 0.8 and μ = 5%

14/19


DDD, 𝝐 = 0.8 and μ = 5%

15/19


NLD, 𝝐 = 0.8 and μ = 5%

16/19

Conclusion

● Increases grader productivity

● Pre-Existing knowledge is not required

● Applicable for new assignments

● Allows for feedback propagation besides grading

● Provides a high level overview of possible solutions

17/19

References

[1] Victor J. Marin and Carlos R. Rivero. 2019. Clustering Recurrent and Semantically Cohesive Program Statements in Introductory Programming Assignments. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM '19). ACM, New York, NY, USA, 911-920. DOI: https://doi.org/10.1145/3357384.3357960

[2] Xiaowei Xu, Nurcan Yuruk, Zhidan Feng, and Thomas A. J. Schweiger. 2007. SCAN: A Structural Clustering Algorithm for Networks. In KDD. 824--833.

18/19

https://doi.org/10.1145/3357384.3357960

Q&A

19/19

clustering recurrent and semantically cohesive program

Documents