graph processing with apache tinkerpop
TRANSCRIPT
Graph Processing withApache TinkerPop (incubating)
Jason PluradSoftware Engineer, IBM | Committer, Apache TinkerPop
• ProjectUpdate• GraphLandscape• AGraphProblem• Hands-OnGraph
http://tinkerpop.apache.org
AboutMe• Twitter@pluradj• GitHub@pluradj• Openchannels– TinkerPopmailinglists– Titanmailinglist– StackOverflow
(Apache)TinkerPop (incubating)• 2009:Inception• 2012:TinkerPop 2• 2015:ApacheIncubator• 2016:TopLevelProject?– TLPVOTEpassed!–WaitingonboardmeetingtoestablishTLP
Podling Releases
• 3.0– Majorrefactor,Java8lambdaexpressions,GremlinServer,OLAPgraphcomputers
• 3.1– Hadoop2support,persistedRDDs• 3.2– OLAPjobchaining,OLAPgraphfilters,
performanceimprovements
Commongraphdatadomains• SocialNetworkAnalysis• ConfigurationManagementDatabase• MasterDataManagement• RecommendationEngines• KnowledgeGraphs• InternetofThings
PropertyGraphandGremlin• Structure– Vertex– Edge– Properties
• Gremlin– Domainspecificlanguage(DSL)forgraph– Dataflow:forwardandbackward– TraversalSteps– Bindingsfornon-JVMlanguages
ApacheTinkerPopGraphComputingFramework
GraphLandscape• GraphdatabasevsGraphprocessor– OLTPvsOLAP– Neighborhoodvswholegraph
• Multi-model:nottheonlystoreinyourapp
IBM Graph (Beta)
• ManagedGraph-as-a-Service(OLTP)• Focusonyourdata,notinstallandoperations• #sleepMore
http://ibm.biz/IBMGraph
Whatisthis?module.exports = xxxxxxx;function xxxxxxx (str, len, ch) {str = String(str);var i = -1;if (!ch && ch !== 0) ch = ' ';len = len - str.length;while (++i < len) {str = ch + str;
}return str;
}
AGraphProblem:DependencyManagement
• OnMarch22,2016npm broketheInternet• Left-padwasunpublished– 11linesofcode– WTFPLlicense– Hundredsofbreakingbuildsperminute– http://blog.npmjs.org/post/141577284765/kik-left-pad-and-npm
• ArewesafewithApache?
Questionsforthegraph• Whichdependenciesareatrisk?• Whichonesshouldberefactoredtoavoid?• Riskfactors– Unsuitablelicense– Singledeveloper– Toolittlecode/Toomuchcode– Changestoofrequently/Codeisstagnant– Nobodyelseisusingit
Let’sgoforaride!
Titan(Aurelius)• PickagraphdatabaseforOLTP…– ApachelicensebutnotinASF
• Codehasstagnatedintheopen– DataStax Enterprise(DSE)Graph– Wideopenopportunities• GenesisGraphisupnext!• ApacheS2Graph(incubating)• ApacheFlink (Gelly)• ApacheSolr (GraphQuery)
ApacheSparkorApacheGiraph• PickagraphprocessorforOLAP…– Sparkisthenewhotness– Giraph isbettersuitedforgiganticgraphs
• ByusingApacheTinkerPop andGremlin,wecanuseeitheroneseamlessly
VagrantandVirtualbox• Developersdon’talwaysgetkeystothecloud• Virtualmachinestotherescue– Host:16GBRAMormore– 3-4VMswith3GBRAM
• Proveoutyourgraphalgorithmsonasmalldatasetbeforewastingtimeonabigdataset
ApacheAmbari• SimpleinstallforApacheHadoopandrelatedApachebigdatapackages– HDFS,YARN,MapReduce,HBase,Spark,etc
• Managementandmonitoringdashboard• Enablesintegrationofothersoftware
Gettingthedata• NPMregistryrunsonApacheCouchDB• ReplicationinApacheCouchDB isawesome– https://skimdb.npmjs.com/registry
Transformthedata• ApacheCouchDB isadocumentstore• Dependenciesaregraphdata• Otherthingscanbetoo– Users– Keywords– License
• Graphmodeldependsonthequestionsyouwanttoaskofthegraph
NPMGraphSchema
Document250K
Package1.5M
Keyword81K
License2K
Person125K
license
dependencydevDependency
Hands-On:GremlinConsole
https://asciinema.org/a/21qk1rn9yt6tt7sour9w9ynxn
TheGraphComputer
AnatomyofaVertexProgram• Vertex-centricgraphlogic• Parallelexecution(BSP)
OutoftheboxVertexPrograms• Traversal• BulkLoader• BulkDumper• PageRank• PeerPressure
Hands-On:GraphProgram
OLAP Traversal Sources> graph = GraphFactory.open('conf/npmgraph-olap.properties')> g = graph.traversal().withComputer(SparkGraphComputer)> g = graph.traversal().withComputer(GiraphGraphComputer)
Graph Statistics via TraversalVertexProgram> g.V().count() // vertex count> g.E().count() // edge count> g.V().label().groupCount() // vertex label distribution> g.E().label().groupCount() // edge label distribution> g.V().properties().key().groupCount() // vertex property distribution
Nextstop?Moredata!• Graphsareforconnectingdata!• ConsumedatafromGitHub– Userdata– Staticcodeanalysis– Codeusageanalysis
• ConsumedatafromTwitter– Trendingnews– Securityalerts
Summary
• ApacheTinkerPop isforgraphcomputing• OLTPvs OLAPisanimportantdistinction– Gremlinallowsyoutoseamlessbridgethetwo
• Graphthinkingisdifferentthanrelational– Isthefuturemulti-model?
• Manyopportunitiestoinnovateinthisspace
Acknowledgements• MarkoRodriguez
– Gremlin language,GremlinOLAP• Ketrina Yim
– Illustrator,creatorofGremlinandfriends• StephenMallette
– TinkerPop releasemanager,Gremlinapplications• DanielKuppitz
– Gremlin languageguru
• DavidRobinson– Bigdata,multi-model
architect/developer
Questions?
Thankyou!