data analytics: from conceptual modelling to logical...
TRANSCRIPT
Data Analytics: From Conceptual Modellingto Logical Representation
Qing Wang(B) and Minjian Liu
Research School of Computer Science,Australian National University, Canberra, Australia
{qing.wang,minjian.liu}@anu.edu.au
Abstract. In recent years, data analytics has been studied in a broadrange of areas, such as health-care, social sciences, and commerce. Inorder to accurately capture user requirements for enhancing communi-cation between analysts, domain experts and users, conceptualising dataanalytics tasks to provide a high level of modelling abstraction becomesincreasingly important. In this paper, we discuss the modelling of dataanalytics and how a conceptual framework for data analytics applicationscan be transformed into a logical framework that supports a simple yetexpressive query language for specifying data analytics tasks. We havealso implemented our modelling method into a unified data analyticsplatform, which allows to incorporate analytics algorithms as plug-ins ina flexible and open manner, We present case studies on three real-worlddata analytics applications and our experimental results on an unifieddata analytics platform.
Keywords: Data analytics · Conceptual modelling · Logical model ·Query language
1 Introduction
Data analytics is rapidly growing in popularity, with a variety of applicationsin many areas, e.g., health-care, social sciences, commerce, etc. This has ledto the recent development of a large number of data analytics tools and sys-tems, most of which are built upon graph models, such as GraphLab [11] andPregel [12]. Nonetheless, in practice, many data analytics applications are stillconducted in an ad-hoc way, due to the lack of general principles to design,develop and implement data analytics applications. For example, the decisionon choosing data models for data analytics applications often relies on individu-als’ own expertise, rather than a systematic consideration of requirements. Thiscalls for a formal design paradigm that can provide a high level of modellingabstraction to support users in understanding their data analytics requirements.In particular, with the increasing complexity of data analytics applications, the
Q. Wang and M. Liu—Contributed equally to this work.
c© Springer International Publishing AG 2016I. Comyn-Wattiau et al. (Eds.): ER 2016, LNCS 9974, pp. 415–429, 2016.DOI: 10.1007/978-3-319-46397-1 32
416 Q. Wang and M. Liu
need to explicitly represent data analytics requirements into a conceptual modelis pressingly required [6].
Recently, several methods for conceptually modelling data analytics applica-tions have been reported [2,15,16]. A conceptual modelling paradigm for net-work analytics applications, called the Network Analytics ER model (NAER),was proposed in [15]. In a nutshell, the NAER model extends the concepts ofthe traditional ER models [4] in three aspects: (a) the structural aspect - analyt-ical entity and relationship types are added to represent first-class entities andrelationships from the data analytics perspective; (b) the manipulation aspect- topological constructs are added to explicitly represent different topologicalstructures of interest; and (c) the integrity aspect - constraints are added forgoverning integrity among different data analytics tasks. Based on this concep-tual modelling paradigm, a set of design guidelines has further been provided in[15], through which users can benefit from establishing a conceptual frameworkthat provides a coherent and comprehensive view on data analytics applications.As depicted in Fig. 1, such a conceptual framework may consist of a core schema,which has basic entity and relationship types to capture data requirements as inthe traditional ER modelling, topology schemas, which have analytical entity andrelationship types to capture query requirements of data analytics applications,and query topics, which describe the structure of queries in query requirementsand are associated with both the core schema and the topology schemas.
Fig. 1. A general process of modelling data analytics applications
Nonetheless, how can we transform such a conceptual framework into a logi-cal framework which is well suited to model the logical structure of data analyticsapplications without ambiguities? Although basic entity and relationship typesin the core schema can be easily transformed into relation schemas following theexisting rules [14], it is not yet clear: (1) How can analytical entity and relation-ship types be accurately defined at the logical level? (2) What logical structurecan topology schemas be translated into? (3) How can topological constructs bespecified using a query language, ideally in a declarative way? These questions
Data Analytics: From Conceptual Modelling to Logical Representation 417
are left unanswered in the previous works [15,16]. This paper aims to answerthese questions by exploring the connections between such a conceptual frame-work for data analytics and its corresponding logical representation.
Contributions. We have the following contributions in this paper:
– We discuss how a conceptual framework for data analytics as introduced in[15] can be effectively transformed into a logical framework.
– We introduce a novel query language for data analytics, which extends SQLwith the ability to query topological properties of interest.
– We have implemented our modelling method into a unified data analyticsplatform, which allows to incorporate analytics algorithms as plug-ins in aflexible and open manner.
– We present three real-world data analytics applications to illustrate the expres-sive power and simplicity of our modelling method, and the experimentalresults of evaluating the performance of our data analytics platform.
Outline. In the following, Sect. 2 discusses the modelling of data analytics andSect. 3 introduces our query language for data analytics. We discuss three dataanalytics applications in Sect. 4, and present our experimental results in Sect. 5.The paper is concluded in Sect. 6.
2 Modelling Data Analytics
In this section, we discuss data analytics from a modelling perspective. Thisis because, in practice, many organisations are facing the challenges of manag-ing data analytics tasks in a complex environment, and using modelling tech-niques can bring in several advantages to addressing these challenges, including:enhancing communications among multiple stakeholders, understanding connec-tions among complex analysis requirements, and detecting design flaws earlierand right from the start before implementing any code.
We first recall the Network Analytics ER (NAER) modelling method [15],then elaborate on the transformation from a conceptual model into the logicalrepresentation for data analytics applications.
Generally, the NAER modelling method supports two kinds of entities andrelationships [15]: (1) base entities and relationships which specify first-classentities and relationships that should be stored in a database system from adata management perspective, as in the traditional ER modelling; (2) analyticalentities and relationships which specify first-class entities and relationships usedfor the data analytics purpose. In the NAER model, base types are the groundfrom which analytical types can be derived, and the base types that define ananalytical type are called the support of the analytical type.
To conceptualise data analytics tasks, not only data requirements (i.e., whatkind of data is needed) but also query requirements (i.e., what kinds of queries areused) are considered in the conceptual modelling process. Base entity and rela-tionship types are used to capture data requirements, leading to a core schema,
418 Q. Wang and M. Liu
while analytical entity and relationship types are used to capture query require-ments, which yields a number of topology schemas. That is, a core schema con-tains a set of base types, and each topology schema contains a set of analyticaltypes, and the support of each analytical type in a topology schema is a sub-set of base types in the core schema. In general, each conceptual frameworkfor data analytics applications contains a core schema which is relatively large,and a number of topology schemas which are often small. Although being small,topology schemas can be flexibly composed into larger schemas if needed [16].
After a conceptual framework has been established as previously discussed,the question arising is: how can such a conceptual framework be transformed intoa logical framework? Although, in principle, it is possible to choose any logicaldata model, e.g., the relational data model, a graph model or a combination ofseveral data models, data in many real-world applications is stored in relationaldatabases. Moreover, data analytics tasks often require sophisticated analysis onboth relational and topological properties of data. For these reasons, we developthe following data model at the logical level:
– Transform basic entities and relationships in the core schema into a set ofrelations for storage, as in the traditional ER modelling approach [14];
– Transform analytical entities and relationships in the topology schemas into aset of entity-relationship (ER) graphs for analytics [9]. That is, one topologyschema corresponds to one type of ER graphs in which each vertex repre-sents an analytical entity and each edge between two vertices represents ananalytical relationship between two analytical entities.
Relation
RelationRelation
ER-graph
ER-graph
Topology schemas
Graph schemas ER-graph
ER-graph
Core schema
Relation schemasGraph mapper
Graph mapper
Graph mapper
Graph mapper
Fig. 2. A hybrid data model at the logical level
Figure 2 illustrates a hybrid data model, in which a collection of ER graphsare constructed on top of relations through graph mappers either on the fly or in amaterialized manner, as will be formally defined in Sect. 3. Accordingly, the coreschema and topology schemas in a conceptual model are transformed into a set ofrelation schemas and graph schemas in a logical model. In practice, such a hybriddata model can be easily built by applying the above transformation rules to aconceptual model that describes data analytics tasks. Since data analytics tasks
Data Analytics: From Conceptual Modelling to Logical Representation 419
often require additional querying capability over graphs, for example, findingpaths, detecting communities, clustering, ranking, etc., in order to implementsuch a hybrid data model at the logical level, we would need a query languagethat can support joint analytics of relations and graphs.
3 A Query Language for Data Analytics
We present a SQL-like query language for data analytics, called RG-SQL, whichextends the standard SQL with new features to facilitate joint analytics of rela-tions and graphs. More specifically, RG-SQL provides data definition statementsthat can create graphs from relations in a flexible way, and data manipula-tion statements to conduct various data analytics operations over relations andgraphs. In the following, we explain these new features of RG-SQL in detail.
Creating graphs. RG-SQL can create two types of graphs: undirected graphsand directed graphs, through the specification on graph types using UNGRAPH andDIGRAPH, respectively. Graphs can be created either on the fly or in a materializedmanner with the following syntax:
– Graphs on the flySELECT <attribute list>FROM <relations | graphs>WHERE <graph name> IS <graph type> AS (graph mapper);
– Materialized graphsCREATE <graph type> <graph name> AS (graph mapper);
where <graph type> := UNGRAPH | DIGRAPH, and a graph mapper is a SQL querythat extracts an edge list (i.e., a list of edges of a graph, which is a common datastructure for representing a graph) from relations in the underlying databasesfor graph construction.
Ranking. To assess the importance of vertices within a graph, RQ-SQL providesa RANK operator with the following syntax:
RANK( <graph name>, <measure>)<measure> := degree | indegree | outdegree | betweenness |
closeness | pagerank
A number of measures are available for determining the importance of vertices[3]. One may choose the most suitable measure for a specific query based onthe type of the graph and desired properties. Each RANK( <graph name> ,<measure>) yields a relation with two attributes: vertexid, and value.
Clustering. To explore the clustering structure of vertices over a graph, RG-SQL provides a CLUSTER operator with the following syntax:
CLUSTER( <graph name>, <algorithm>)<algorithm> := CC | SCC | GN | CNM | MC
420 Q. Wang and M. Liu
where CC refers to an algorithm of finding connected components, SCC analgorithm of finding strongly connected components, and GN, CNM and MCthree algorithms for community detection, which respectively correspond toGirvan-Newman algorithm [7], Clauset-Newman-Moore Algorithm [5] andPeixoto’s modified Monte Carlo Algorithm [13]. Each CLUSTER( <graph name>,<algorithm>) yields a relation with three attributes: clusterid, size andmembers.
Path finding. To find paths among two or more vertices, RG-SQL provides aPATH operator with the following syntax:
PATH( <graph name>, <path expression>)<path expression> := . | V | <path expression>/ <path expression>|
<path expression>// <path expression>where V is a vertex expression that imposes certain condition on the vertices ofa path, . is a do-not-care symbol indicating that any vertex is allowed in itsposition, / represents one edge, and // represents any number of edges. A pathexpression is valid if it contains a vertex expression in the first and last positions.For example, an expression V1//V2 specifies a path between two vertices V1 andV2, regardless of the length of the path. Each PATH( <graph name> , <pathexpression>) yields a relation with three attributes: pathid, length and path.
3.1 Discussion
We now briefly discuss the expressive power of RG-SQL in comparison with therelational query language SQL and the graph query language Cypher used inNeo4j (http://neo4j.com). Since RG-SQL extends the standard SQL with theadditional operations, such as ranking, clustering and path finding, RQ-SQLis strictly more expressive than SQL and has the expressive power beyond thefirst order logic [1], for example, recursion in a path finding expression V1//V2cannot be expressed by SQL but can be expressed by RG-SQL. For Cypher, itis a query language designed to express graph patterns, which can nonethelessbe expressed by RG-SQL or its variations through a combination of path findingoperations. However, not all operations of RG-SQL can be expressed by Cypher,e.g., ranking operations using betweenness and clustering operations using GN.
4 Data Analytics Applications
In this section we study data analytics tasks in three real-world applications andexplain how data analytics requirements can be conceptualized in our work.
4.1 ACM Digital Library
ACM Digital Library (http://dl.acm.org/) is a bibliographical network contain-ing a collection of articles, authors, and publishers. Each article is written by one
Data Analytics: From Conceptual Modelling to Logical Representation 421
or more authors, one article may cite a number of other articles, and articles areincluded in conference proceedings or journals published by publishers. Figure 3depicts a conceptual schema for this data analytics application, which includesthe topology schemas Sa1 and Sa2 required by the following queries:
Q1: [Collaborative communities] Find the communities that consist of authorswho collaborate with each other to publish articles together.
Q2: [Influential articles] Find the top 3 most influential articles.
For Q1, we may use RQ-SQL to create a materialized coauthorship graphfor coauthorship over Sa1, then find the collaborative communities in thecoauthorship graph by applying the MC algorithm in CLUSTER.
CREATE UNGRAPH coauthorship AS(SELECT w1.aid, w2.aid AS coaidFROM WRITE AS w1, WRITE AS w2WHERE w1.aid!=w2.aid AND w1.pid=w2.pid);
SELECT clusterid, size, membersFROM CLUSTER(coauthorship, MC);
For Q2, we may create a citation graph over Sa2 on the fly and then tofind influential articles in the citation graph using the measure betweenness.
SELECT vertexid, valueFROM RANK(citation, betweenness)WHERE citation IS DIGRAPH AS (SELECT aid, citedaid FROM CITE)LIMIT 3;
COAUTHORSHIP
CITATION
ARTICLE
CITE PROCEEDINGS
WRITE
JOURNAL
AUTHOR PUBLISH
JOURNAL* COCITATION
from
to
ARTICLE*AUTHOR*
Core Schema
Topology SchemasSa1 Sa2
Sa3
PUBLISHED_BY
+
PUBLISHER
Fig. 3. A conceptual schema for ACM Digital Library
422 Q. Wang and M. Liu
4.2 Twitter
Twitter (https://twitter.com/) is a social network which enables users to posttweets. Users may follow one another. A tweet can mention one or more users andbe labelled by one or more tags. Figure 4 depicts a conceptual schema for dataanalytics in Twitter. Typical data analytics tasks in Twitter include to analysehow users follow each other and to find the most followed people as describedby the following queries:
Q3: [Shortest path] Find the shortest path between Jack and Max.Q4: [Most followed people] Find the most followed people who have posted at
least one tweet about ANU.
Firstly, the following graph over the topology schema St1 is created basedon entities of user∗ and their relationships in following. Then for Q3 we mayfind the shortest path between Jack and Max using the following RG-SQL query:
SELECT *FROM PATH(following, v1//v2)WHERE v1 AS (SELECT uid FROM USER WHERE name = ‘Jack’)
AND v2 AS (SELECT uid FROM USER WHERE name = ‘Max’)ORDER BY length ASC LIMIT 1;
For Q4, we need to not only find the most followed people in the followinggraph but also people who have posted a tweet tagged by @ANU from therelations over the core schema, as illustrated by the following RG-SQL query.
SELECT uid, value
FROM RANK(following, pagerank) AS p1, POST AS p2, LABELLED_BY AS l
WHERE p1.vertexid=p2.uid AND p2.twid=l.tid AND l.label=‘ANU’
ORDER BY value DESC;
4.3 Stack Overflow
Stack Overflow (http://stackoverflow.com/) is a collaboratively edited questionand answer site for programmers. Users may ask questions or post answers.A question may have zero or more answers and be labelled by tags. For eachquestion, one answer can be accepted as the accepted answer. A conceptualschema for data analytics in Stack Overflow is presented in Fig. 5.
Q5: [Python experts] Find top 10 Python experts in Stack Overflow (i.e. userswho often reply Python questions and their answers are often accepted).
Q6: [Most influential expert] Find the influential expert in Stack Overflow whois involved in one of the top 3 largest question-answer communities.
Similarly, we first create the getting answers graph over the topologyschema Ss1. Then the RQ-SQL query for Q5 is follows:
Data Analytics: From Conceptual Modelling to Logical Representation 423
Fig. 4. A conceptual schema for Twitter
SELECT * FROM RANK(getting_answers, pagerank) WHERE vertexid IN
(SELECT owner_id FROM ANSWER AS a, LABELLED_BY AS l, TAG AS t
WHERE a.parent_qid=l.qid AND l.tid=t.tid AND tag_label = ‘python’)
LIMIT 10;
For Q6, we have the following RQ-SQL query, in which both RANK andCLUSTER operators are applied over two different graphs and their results can beflexibly combined to support further analytics.
SELECT r.vertexidFROM RANK(getting_answers, pagerank) AS r,
(SELECT members FROM CLUSTER(co-answering, MC)ORDER BY size DESC LIMIT 3) AS c
WHERE r.vertexid=ANY(c.members);
5 Experiments
We have implemented our modelling method into a unified data analytics plat-form, called Rogas, which allows to incorporate analytics algorithms as plug-insin a flexible and open manner [10]. To understand how well Rogas can performin comparison with other database systems, we have conducted experiments tocompare the expressive power of query languages and the time efficiency of queryexecution in three different systems: PostgreSQL (http://www.postgresql.org/),Neo4j (http://neo4j.com) and Rogas. These experiments were performed on aDell Optiplex 9020 desktop computer with the Intel(R) Core(TM) i7-4790 CPU3.6 GHz 8 cores processor, 16 GB of memory and 256 GB disk. Rogas extends thequery engine of PostgreSQL 9.4.4, with additional functionalities implementedusing Python 2.7.6. The version of Neo4j we used is community 2.2.5.
424 Q. Wang and M. Liu
Fig. 5. A conceptual schema for Stack Overflow
In our experiments, we used the data sets from the data analytics applicationsdiscussed in Sect. 4: (1) ACM Digital Library (ACM DL) data set providedby the ACM Digital Library (http://dl.acm.org/), (2) Stack Overflow data setfrom the Stanford Network Analytics Platform (http://snap.stanford.edu/proj/snap-icwsm/), and (3) Twitter data set provided by Haewoon Kwak (http://an.kaist.ac.kr/traces/WWW2010.html). Table 1 presents more details about thesethree data sets.
Table 2 depicts the queries used in our experiments, which can be generallydivided into three categories: (1) Q1–Q3 are relational queries including join,sorting, and aggregate operations; (2) Q4–Q10 are queries about graph proper-ties, including: triangle counting, pagerank centrality, path finding and commu-nity detection; (3) Q11–Q12 are sophisticated queries that may combine severalgraph properties, e.g., Q11 combines pagerank centrality with finding connectedcomponents and Q12 combines pagerank centrality with path finding.
Our first experiment is to illustrate the expressive power of the three querylanguages: PostgreSQL, RG-SQL and Cypher in terms of the queries Q1–Q12. Asshown in Table 3, PostgreSQL, RG-SQL and Cypher do have different expressivepowers. SQL cannot be used to specify Q6–Q12 and Cypher cannot be used tospecify Q10–Q12. Nonetheless, RG-SQL is expressive enough to specify all thesequeries.
Our second experiment is to evaluate the time efficiency of query execu-tion in Rogas, PostgreSQL and Neo4j. As not all queries can be expressedby PostgreSQL and Neo4j, we have thus compared Q1-Q5 over the three sys-tems, and Q6-Q9 only over Rogas and Neo4j. Note that, for Q6 and Q7, Neo4jneeds to use an extension, called Neo4j Mazerunner, to run graph analyticsalgorithms at scale with Hadoop HDFS and Apache Spark (http://neo4j.com/developer/apache-spark/#mazerunner), and is thus required to send an HTTP
Data Analytics: From Conceptual Modelling to Logical Representation 425
Table 1. Three data sets used in our experiments
Data set Raw data size No of vertices in
graphs (Neo4j)
No of edges in
graphs (Neo4j)
No of records in relations
(PostgreSQL)
ACM DL 14.9 GB (XML) 1,128,243 2,488,849 publisher 50
journal 128
proceedings 6,421
article 337,006
author 784,638
write 932,400
cite 1,212,894
Stack Overflow 30.6 GB (XML) 21,713,109 31,747,662 question 7,990,787
answer 13,684,117
tag 38,205
labelled by 13,466,686
Twitter 29.7 GB (TXT) 13,250,196 264,368,797 tweet 10,762,104
tag 210,121
user 2,277,971
follow 259,602,970
mentioned in 3,108,776
labelled by 1,657,051
GET request to Neo4j Mazerunner. In such cases, the time of executing queriesin Neo4j includes the time for sending and receiving the requests. For each query,we ran it 5 times in each system and took the average time for plotting. Figure 6presents our experimental result. The key observations are as follows:
– For Q1–Q5, Rogas performed equally well with PostgreSQL, and better thanNeo4j in most queries except for Q4. This is because Q4 is about patternmatching which requires to navigate hyper-connectivity on graphs, and Neo4jhas been particularly optimised for such queries whereas we have not yetimplemented any query optimisation techniques. For Q5, it is not surprisingthat Rogas performed better than Neo4j since it handles the problem of trian-gle counting, for which the study in [8] has also experimentally verified thatrelational databases can perform the triangle counting task very efficientlythrough expressing a three-way self-join.
– For Q6–Q7, as Neo4j needs to use Neo4j Mazerunner, it requires time onsending and receiving the requests. Thus, Rogas performed better than Neo4j.However, for Q8–Q9, similar to Q4, these queries need to navigate hyper-connectivity on graphs and Neo4j performed better than Rogas.
In addition to Q1–Q12, we have also run several queries about closeness centralityover Twitter using Rogas and Neo4j. Rogas can successfully complete the queriesand return the query results, while Neo4j failed and the system reported the“OutOfMemory” error. The reason for this is that the graphs created in Twitterare large so that processing these queries exceeded the memory limitation ofNeo4j.
426 Q. Wang and M. Liu
Table 2. Queries used in our experiments
Query Data set Query description
Join Operation + Sorting Operation
Q1Stack
OverflowShow the question id, the owner id and the tag label of top 10questions that have the most view count.
Join Operation + Sorting Operation + Aggregate Operation
Q2Stack
Overflow
Show the top 5 answerers and their latest reputation score in andescending order based on the number of their answers that acceptedby questions.
Join Operation + Sorting Operation + Aggregate Operation
Q3ACMDL
Show the number of articles of each journal and proceeding alongwith the journal name and the proceeding title in a descending order.
Pattern Matching
Q4 TwitterRecommend 10 twitter users for Jack who currently does not followthese users but Jack follows somebody who are following them.
Triangle Counting
Q5ACMDL
Count the number of triangles of the co-authorship network.
PageRank Centrality
Q6ACMDL
Find the top 10 influential authors according to the pagerankcentrality in the co-authorship network.
Connected Component
Q7ACMDL
Count the number of connected components of the co-authorshipnetwork.
Path Finding
Q8ACMDL
Find paths with length less than 2, which connect two author V1 andV2 in the co-authorship network where author V1 is affiliated at ANUand author V2 is affiliated at UNSW.
Shortest Path
Q9ACMDL
Find a shortest paths between two authors Michael Norrish and KevinElphinstone in the co-author network.
Community Detection
Q10Stack
OverflowFind a group of tags that they are often used together to label aquestion.
PageRank Centrality + Connected Component
Q11ACMDL
According to the pagerank centrality, find the top 3 authors of thebiggest collaborative community in the co-authorship network.
PageRank Centrality + Path Finding
Q12ACMDL
According to the pagerank centrality, show how the top 2 authorsconnect with each other in the co-authorship network.
Data Analytics: From Conceptual Modelling to Logical Representation 427
Table 3. Comparison on the expressive power of the query languages PostgreSQL,RQ-SQL and Cypher over the queries Q1–Q12, where
√and × indicate “expressible”
and “not expressible”, respectively
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12
PostgreSQL√ √ √ √ √ × × × × × × ×
RG-SQL√ √ √ √ √ √ √ √ √ √ √ √
Cypher√ √ √ √ √ √ √ √ √ × × ×
Fig. 6. Comparison on the time efficiency of query execution in Rogas, PostgreSQLand Neo4j over the queries Q1–Q9
428 Q. Wang and M. Liu
6 Conclusions
In this paper, we have discussed how data analytics tasks can be conceptualisedby a conceptual model and then transformed into a logical model. We have alsoproposed a query language for data analytics, and implemented the proposedmethods into a data analytics platform that can unify various data analyticstasks and algorithms. This work was based on our case studies on several real-world data analytics applications.
In the future, we plan to add query topics into our data analytics platformand investigate the development of a query language at a higher level throughquery topics. We will also study network dynamics and develop techniques toanalyse and visualise networks that dynamically change over time.
Acknowledgement. We thank the ACM Digital Library for providing the data setof the ACM bibliographical network.
References
1. Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases. Addison-Wesley,Reading (1995)
2. Bao, Z., Tay, Y., Zhou, J.: sonSchema: a conceptual schema for social networks.In: Conceptual Modeling, pp. 197–211 (2013)
3. Brandes, U., Erlebach, T.: Network Analysis: Methodological Foundations.Springer Science & Business Media, New York (2005)
4. Chen, P.: The entity-relationship model - toward a unified view of data. ACMTODS 1(1), 9–36 (1976)
5. Clauset, A., Newman, M.E., Moore, C.: Finding community structure in very largenetworks. Phys. Rev. E 70(6), 066111 (2004)
6. Embley, D.W., Liddle, S.W.: Big data—conceptual modeling to the rescue. In: Ng,W., Storey, V.C., Trujillo, J.C. (eds.) ER 2013. LNCS, vol. 8217, pp. 1–8. Springer,Heidelberg (2013). doi:10.1007/978-3-642-41924-9 1
7. Girvan, M., Newman, M.E.: Community structure in social and biological networks.PNAS 99(12), 7821–7826 (2002)
8. Jindal, A., Madden, S.: GRAPHiQL: a graph intuitive query language for relation-aldatabases. In: IEEE International Conference on Big Data, pp. 441–450 (2014)
9. Kasneci, G., Ramanath, M., Sozio, M., Suchanek, F.M., Weikum, G.: Star: steiner-tree approximation in relationship graphs. In: ICDE, pp. 868–879 (2009)
10. Liu, M., Wang, Q.: Rogas: a declaratice framework for network analysis. In: VLDB(2016)
11. Low, Y., Gonzalez, J.E., Kyrola, A., Bickson, D., Guestrin, C.E., Hellerstein,J.: Graphlab: a new framework for parallel machine learning. arXiv preprintarXiv:1408.2041 (2014)
12. Malewicz, G., Austern, M.H., Bik, A.J., Dehnert, J.C., Horn, I., Leiser, N., Cza-jkowski, G.: Pregel: a system for large-scale graph processing. In: ACM SIGMOD,pp. 135–146 (2010)
13. Peixoto, T.P.: Efficient Monte Carlo and greedy heuristic for the inference of sto-chastic block models. Phys. Rev. E 89(1), 012804 (2014)
Data Analytics: From Conceptual Modelling to Logical Representation 429
14. Thalheim, B.: Entity-relationship Modeling: Foundations of Database Technology.Springer Science & Business Media, New York (2013)
15. Wang, Q.: Network analytics ER model-towards a conceptual view of networkanalytics. In: ER, pp. 158–171 (2014)
16. Wang, Q.: A conceptual modeling framework for network analytics. Data Knowl.Eng. 99, 59–71 (2015)