presentation layout popularity and challenges of …migod/846/current/projects/04...social network...

6
Popularity and Challenges of Graph Cypher Queries Sheik Shameer, Shivasurya Sankarapandian (CS846 Fall 2019) Presentation Layout Introduction Motivation Dataset Details Methodology for Data Extraction Results and implications Threats to Validity Introduction A NoSql Database that uses graph structures for semantic queries with nodes and edges. They allow fast retrieval of complex hierarchical structures that are difficult to model in relational database systems. Commonly used in Fraud detection analysis, network and database infrastructure monitoring, Recommendation engines, Social Network, Knowledge Graphs, Privacy and risk management. Social Network Experiment- Finding Friends of Friends Database of 1,000,000 users, searching for 1000users

Upload: others

Post on 22-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Presentation Layout Popularity and Challenges of …migod/846/current/projects/04...Social Network Experiment-Finding Friends of Friends Database of 1,000,000 users, searching for

Popularity and Challenges of Graph Cypher QueriesSheik Shameer, Shivasurya Sankarapandian (CS846 Fall 2019)

Presentation Layout

• Introduction• Motivation• Dataset Details• Methodology for Data Extraction• Results and implications• Threats to Validity

Introduction

• A NoSql Database that uses graph structures for semantic queries with nodes and edges.• They allow fast retrieval of complex hierarchical structures that are difficult to model

in relational database systems.• Commonly used in Fraud detection analysis, network and database

infrastructure monitoring, Recommendation engines, Social Network, Knowledge Graphs, Privacy and risk management.

Social Network Experiment- Finding Friends of Friends

Database of 1,000,000 users, searching for 1000users

Page 2: Presentation Layout Popularity and Challenges of …migod/846/current/projects/04...Social Network Experiment-Finding Friends of Friends Database of 1,000,000 users, searching for

Cypher Queries

• Some of the most popular Graph database management system are Neo4j, microsoft azure, Cosmos DB, OrientDB, ArangoDB, Virtuoso.

• For this study we will be looking into Neo4j.• Some of the popular forms of graph query languages are Cypher, SPARQL, GraphQL and

Gremlin.• For this study we will be looking into the Cypher Query language.

MATCH (:Movie{ title: 'Wall Street' })<-[:ACTED_IN|:DIRECTED]-(person)RETURN person.name

Oliver StoneMichael DouglasCharlie SheenMartin Sheen

Motivation

• Version Control system – can this be used to provide relevant information on the problems faced by developers in Open source repositories.

• Can we use Abstract Syntax trees to mine the Cypher queries from the repositories.• Can we start building a corpus of graph cypher queries that can be further used for

analysis by others.• Can the information that we gained help others to make useful contributions to the open

source community.

Research Questions

• RQ1 - "what type of graph cypher queries are popular among the developers now?"

• RQ2 : "what type of graph cypher queries do the developers have trouble with?"

Page 3: Presentation Layout Popularity and Challenges of …migod/846/current/projects/04...Social Network Experiment-Finding Friends of Friends Database of 1,000,000 users, searching for

Data setv

Repositories Count Cypher Queries Mined

Java 2579 4159

Java Script 1212 832

Total 3791 4991

Methodology for data extraction

DatasetJava and JavaScript GitHub Repositories

Extracting graph database queries from source code,

• Regular Expressions Pattern matching approach• Abstract Syntax Tree parsing approach

AST Approach

• Follows Visitor Pattern• Parse source code, modules to represent as Tree• Traverse for Identifiers & CallExpression, with official Driver method calls• Extract parameters, variables within block

Page 4: Presentation Layout Popularity and Challenges of …migod/846/current/projects/04...Social Network Experiment-Finding Friends of Friends Database of 1,000,000 users, searching for

Mining and Tools

• 1212 JavaScript repositories from GitHub which uses Neo4J• Verify existence of Neo4J-driver in the repo• ESPrima – Source code parser and Construct AST• ESPrima-Walk – Efficiently traverse AST and Filter queries• Node-git to fetch commit logs and code changes of extracted graphdb queries• Shell Script for automation

Mining and Tools

• 2579 java repositories mined from github• Javalang python module for AST tree • Javaparser library – we were able to mine the queries with this library• We were looking for official methods and the variables used in them• Able to mine queries from the same file, totaling around 4159 queries• 836 java queries commit messages were mined using combination of git log and grep

commands

RQ1 - "what type of graph cypher queries are popular among the developers now?"

• Call , Match and Create type of queries were popular among

• So the Cypher Queries are predominantly used for Creating, Fetching and also for calling procedures.

Java Type of Cypher Querises

Javascript Type of Cypher Queries

Inferences RQ1

• Call procedures were very popular.

• We also used the tokenization and stemming concepts in NLP to search for most used words in the messages of the commits that created the Queries.

• The word "procedure" had a significant usage.

Word Tokens Count

Procedure 654

Initial 626

Commit 611

Fixes 388

Neo4j 295

Annotation 259

Change 226

Sparkles 224

Branch 216

Page 5: Presentation Layout Popularity and Challenges of …migod/846/current/projects/04...Social Network Experiment-Finding Friends of Friends Database of 1,000,000 users, searching for

Inferences RQ1

• Neo4j default procedures were used – 516• Other procedures worth mentioning were apoc repositories, machine learning

procedures.• We also found that users were writing their own procedures after tokenizing the

call queries.• Neo4jversioner is a repository that deals with network and database infrastructure,

these procedures can be used by other users in the related domain as well.

Procedures Countorg.neo4j.procedure.simpleArgument 42org.neo4j.procedure.writingProcedure 40org.neo4j.procedure.defaultValues 30org.neo4j.procedure.node 28org.neo4j.procedure.integrationTestMe 24org.neo4j.procedure.schemaProcedure 20org.neo4j.procedure.genericListWithDefault 18org.neo4j.procedure.recursiveSum 18org.neo4j.procedure.sideEffect 16org.neo4j.procedure.createNode 12

Procedures Count graph.versioner.diff 4graph.versioner.diff.from.current 3graph.versioner.diff.from.previous 3graph.versioner.get.all 2graph.versioner.get.by.date 1graph.versioner.get.by.label 2graph.versioner.get.current.path 1graph.versioner.get.current.state 1graph.versioner.get.nth.state 2graph.versioner.init 6graph.versioner.patch 6graph.versioner.patch.from 4graph.versioner.rollback 4graph.versioner.rollback.nth 2graph.versioner.rollback.to 4graph.versioner.update 4

Procedures Countregression.linear.addM 2regression.linear.create 8regression.linear.delete 2regression.linear.info 3regression.linear.load 3regression.linear.test 1regression.linear.train 3regression.logistic.add 1regression.logistic.delete 1regression.linear.add 2

RQ2

• what type of graph cypher queries do the developers have trouble with?

• With the extracted 832 Queries from JavaScript and 4159 from Java, verified for false positive queries

• Git-log with corresponding line number and file names that produced commit information

• 100 Random queries from Javascript and Java.• Manually verified the code changes and commit information

Refactored Neo4J types of queries - JavaScript Repo Commits for Sample 100 Queries

Page 6: Presentation Layout Popularity and Challenges of …migod/846/current/projects/04...Social Network Experiment-Finding Friends of Friends Database of 1,000,000 users, searching for

Refactored Neo4J types of queries – Java Repo Commits for Sample 100 Queries

RQ2 Results

• Transaction, Merge and Match has large number of changes in refactoring the particular query whereas other type of queries have infrequent changes

• Rare and common query edits in the MATCH, CALL and CREATE queries such as • Adding & Renaming Alias• Adding & Removing Attributes • Adding & Removing Conditions• Adopting new version procedures and libraries

Threats to Validity

• We collect JavaScript and Java source code from Opensource which may not represent the whole general set.

• Developers may use Object Relational Mapping, runtime query generation which can be missed out by static tools like AST.

• We generalize our results based on the Java and Javascript repositories we mined there may be repositories in other programming languages like python that may provide further insights to our work.