snql: a social networks query and transformation languagecgutierr/papers/amw2011.pdf · social...

14
SNQL: A Social Networks Query and Transformation Language Mauro San Mart´ ın 1 , Claudio Gutierrez 2 , and Peter T. Wood 3 1 Dept. of Mathematics, U. de La Serena - 2 Dept. of Computer Science, U. de Chile 3 Dept. of Computer Science and Information Systems, Birkbeck, Univ. of London Abstract. Social Network (SN) data has become ubiquitous, demand- ing advanced and flexible means to represent, transform and query such data. In addition to the intrinsic challenges of querying graph data is the requirement that networks be restructured, and thus that new values be created. To address these, we introduce a dedicated data model and query lan- guage, SNQL, founded on previous research on graph databases and on the experience of SN researchers. Technically, it is based in GraphLog and second-order tuple generating dependencies, allowing expressiveness for graph querying and node creation as required by SN, while keeping the complexity of query evaluation in NLOGSPACE. 1 Introduction The widespread embracing of social networking services—such as Facebook, MySpace and LinkedIn—as the indispensable tools to manage people’s digital and physical lives has resulted in rapidly growing amounts of social activity data. Furthermore, the social information represented by many sites usually includes not only people but a great variety of objects (usually referred to generically as actors in the social networking literature) and relations [4]: photographs (Flickr), other sites (del.icio.us), places (Yahoo! Travel), goods (Amazon), and so on. This diversity produces complex social networks (SN) which require managing, query- ing and transforming. In the context of these new applications, users have access to an increas- ing amount of SN data, and consequently the need arises for them to manage and build their own networks based on relevant “pieces” of the huge networks available. For example, it is common to find applicable online information about published research linked with departmental and local scientific networks. This calls for more advanced and flexible means to query and transform SN data. Along the same lines, scientific experimentation with social network analysis (SNA) tools (e.g. Pajek [10]) calls for data management tools to extract parts of SN from different environments and to integrate, filter and transform them. A further requirement is that SN data representation needs to be flexible enough to incorporate on-the-fly attributes (e.g. for data curation). In addition, data is used and seen from diverse points of view by different users. Hence classical modeling in terms of a fixed set of entities, attributes and relationships does not work well. For example, in SNA it is common for attributes to become

Upload: others

Post on 15-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SNQL: A Social Networks Query and Transformation Languagecgutierr/papers/amw2011.pdf · Social Network (SN) data has become ubiquitous, demand-ing advanced and exible means to represent,

SNQL: A Social Networks Query andTransformation Language

Mauro San Martın1, Claudio Gutierrez2, and Peter T. Wood3

1Dept. of Mathematics, U. de La Serena - 2Dept. of Computer Science, U. de Chile3Dept. of Computer Science and Information Systems, Birkbeck, Univ. of London

Abstract. Social Network (SN) data has become ubiquitous, demand-ing advanced and flexible means to represent, transform and query suchdata. In addition to the intrinsic challenges of querying graph data is therequirement that networks be restructured, and thus that new values becreated.To address these, we introduce a dedicated data model and query lan-guage, SNQL, founded on previous research on graph databases and onthe experience of SN researchers. Technically, it is based in GraphLogand second-order tuple generating dependencies, allowing expressivenessfor graph querying and node creation as required by SN, while keepingthe complexity of query evaluation in NLOGSPACE.

1 Introduction

The widespread embracing of social networking services—such as Facebook,MySpace and LinkedIn—as the indispensable tools to manage people’s digitaland physical lives has resulted in rapidly growing amounts of social activity data.Furthermore, the social information represented by many sites usually includesnot only people but a great variety of objects (usually referred to generically asactors in the social networking literature) and relations [4]: photographs (Flickr),other sites (del.icio.us), places (Yahoo! Travel), goods (Amazon), and so on. Thisdiversity produces complex social networks (SN) which require managing, query-ing and transforming.

In the context of these new applications, users have access to an increas-ing amount of SN data, and consequently the need arises for them to manageand build their own networks based on relevant “pieces” of the huge networksavailable. For example, it is common to find applicable online information aboutpublished research linked with departmental and local scientific networks. Thiscalls for more advanced and flexible means to query and transform SN data.Along the same lines, scientific experimentation with social network analysis(SNA) tools (e.g. Pajek [10]) calls for data management tools to extract partsof SN from different environments and to integrate, filter and transform them.

A further requirement is that SN data representation needs to be flexibleenough to incorporate on-the-fly attributes (e.g. for data curation). In addition,data is used and seen from diverse points of view by different users. Henceclassical modeling in terms of a fixed set of entities, attributes and relationshipsdoes not work well. For example, in SNA it is common for attributes to become

Page 2: SNQL: A Social Networks Query and Transformation Languagecgutierr/papers/amw2011.pdf · Social Network (SN) data has become ubiquitous, demand-ing advanced and exible means to represent,

1

person

personperson friendshipMary

a1name

r1friend

CentralCity

city

John

a2name

friend

CapitalCity

city

Anna3 name

introducerCentralCity city

a) Friendship Network among Persons

Mary

person

a1 name

friendshipr1

friendJohn

person

a2name

friend

Annperson

a3 name

introducercity

a4

lives-inr2 place

CapitalCity

name

inhabitant

city

a5

lives-inr3

place

CentralCity

name

inhabitant

lives-inr4

place

inhabitant

b) Friendship Network among Persons (Cities Promoted from Attributes to Actors)

citycity friendships_between_

cities

2

a5

inhabitants

r5friends

CentralCity

name

1

a4

inhabitants

friends

CapitalCity

name

c) Friendship Network between Cities

number

Fig. 1. Friendship Network and Transformations. a) A social network representing thefriendship relation (square node) between Mary and John, who were introduced byAnn (actors as round nodes, and attribute values as grey dots). b) The same socialnetwork after promoting city attributes to actors. c) The social network result aftergrouping persons by city and computing aggregate attributes: inhabitants of each city,and number of friendships between cities.

actors, for aggregated data of actors to become attributes, and for relations tohave arity greater than two.

Example 1. Consider a SN of friendship relations among people, including whichother person, if any, introduced them; this implies having relations of variablearity (solved by representing relations as nodes). People are described by theattributes ‘name’ and ‘city’ (see Fig. 1(a)). The study of the relevance of city ofresidence to friendship might require promoting cities to actors and linking peo-ple and cities with a new type of relation, e.g. ‘lives-in’ (see Fig. 1(b)). Anothertype of transformation would be to group people by city of residence, thus defin-ing a network of cities, where relations summarize friendships among residentsof cities. Additionally, one might like to describe in the network the population(person count) of each city, and label the relations between them with the num-ber of friendship-relations between people (see Fig. 1(c)). ut

The requirements suggested by Example 1 together with the integration ofdiverse types of information as described above call for a simple and flexiblemodel of data for SN. Furthermore, due to the growing volume of SN data,any transformation and query language for SN should be scalable. This impliesthat a transformation and query language should conform, from a theoreticalperspective, to low complexity bounds, and, because of practical concerns, besimple and modular while being sufficiently expressive.

Several of these challenges have been already voiced. Fifteen years ago Free-man defined the maximal structure experiment that extended the basic networkrepresentation to include attributes as well as to accommodate changes over

Page 3: SNQL: A Social Networks Query and Transformation Languagecgutierr/papers/amw2011.pdf · Social Network (SN) data has become ubiquitous, demand-ing advanced and exible means to represent,

time [13]. More recently, the need to improve both network data formats in thecontext of the social web as well as data management services for large anddynamic social networks has been identified [8, 17].

To date, the above challenges have been addressed with ad-hoc approaches,and to the best of our knowledge there is no generic data management solutionin spite of the wide agreement on the urgent necessity of addressing this prob-lem [16, 20]. In the appendix on related work we will review widely-used tools inSNA, such as Pajek and UCINET [15], which focus on the final analysis of SNdata. Some proposals for graph databases (e.g. GraphDB [14]) have features todeal with SN data, but with ad-hoc developer-oriented languages. In the samespirit some SN services provide APIs (e.g. Facebook’s Graph API1 and the OpenGraph Protocol).

The three most comprehensive proposals on the challenges presented aboveare BiQL, SocialScope, and SoQL. SocialScope [3] is a logical architecture fordiscovering and managing social information, which includes an algebraic querylanguage. It does not provide the capability to construct new data nor dealwith the complexity of data dynamics (particularly transformation of actorsinto attributes). SoQL [18] is an SQL-like query language for SN, focused onidentifying groups and paths over a classical network. It does not incorporatedata transformation features. Finally, BiQL [11] is also an SQL-like languagewith a rich set of features. However, efficiency is still not a concern in theseproposals, and none of them analyze computational complexity aspects of thelanguage.

Our transformation and query language, named SNQL, is inspired by twoearlier languages: GraphLog [9] and second-order tuple-generating dependen-cies (SO tgds) [12]. SNQL comprises, both syntactically and conceptually, twomodules, one for SN matching and one for SN construction, which essentiallycorrespond to GraphLog and SO tgds, respectively.

GraphLog was a seminal query language for graph data, designed to be ex-pressive while at the same time having low computational complexity. Apartfrom standard features, it includes aggregation and transitive closure making itsuitable for many SN queries. However, GraphLog does not provide functionalityto create new objects/actors, a crucial requirement for SN.

Example 2. Consider again the SN from Example 1. The following SNQL queryproduces the network depicted in Fig. 1(b) from that in Fig. 1(a) by promotingthe ‘city’ attribute to a new type of actor (city) and producing a new type ofrelation (lives-in) to associate people with cities.

CONSTRUCT CP IF R2 = f(A1, A2) AND A2 = g(L1)

WHERE EP

FROM FriendshipNetwork

Patterns EP and CP, depicted in Fig. 2, denote an extraction pattern and a con-struction pattern, respectively. FriendshipNetwork is the SN shown in Fig. 1(a).

1 http://developers.facebook.com/docs/reference/api/

Page 4: SNQL: A Social Networks Query and Transformation Languagecgutierr/papers/amw2011.pdf · Social Network (SN) data has become ubiquitous, demand-ing advanced and exible means to represent,

L1city

person friendship

L2

A1name

R1P1city

A2lives-in

R2place

L1name inhabitant

Extraction Pattern: EP Contruction Pattern: CP

person friendship

L2

A1name

R1P1

Fig. 2. Attribute Promotion. Patterns EP and CP transform attribute ‘city’ to a newactor whose id is functionally created from its literal value (bound to variable L1). Alsonew relations are required to link persons to the newly created city actors.

Note that in the result, cities become hubs that connect all people living ineach of them, and that the new actors require creation of new ids from the data:the oids of cities are produced by applying a function g to the literal valuesbound to variable L1. Similarly, new relation identifiers for the ‘lives-in’ relationare created, one for each (person,city) pair matched by (A1,A2). ut

Creation of values was addressed by database query languages such as IQL [2]and ILOG [7], that allowed oid or value invention. By relying on rules thatuse both recursion and oid invention, it can be shown that such languages canexpress all computable database queries [7]. However, it turns out that whenrecursion and invention are not allowed to interact, as in our proposal, queriescan be evaluated in PTIME [2]. In our language, we consider another formalismused to invent values, namely, SO tgds [12]. SO tgds use existentially quantifiedfunction symbols (and equalities) to specify the composition of schema mappings,where values in a target schema (output) may need to be existentially quantified(invented). Since SO tgds are not recursive, materializing the result of a schemamapping (which corresponds to creating an SN in our setting) can be done inPTIME [12].

Our contributions are as follows:

1. A data structure capable of representing the informational richness and mal-leability of social networks.

2. A transformation and query language satisfying the data management re-quirements of social networks with good properties: adequate expressiveness,and accessible to social networks field practitioners.

3. A query language that includes object creation but maintains low complexity.4. A corresponding evaluation algorithm whose complexity scales adequately.

The outline of the rest of the paper is as follows. Section 2 briefly presentsthe requirements and SNA use cases. Section 3 defines the syntax (via exam-ples) and semantics of SNQL. Section 4 presents results on the expressivenessand complexity of SNQL. We included an Appendix with complete syntax andextended related work.

2 SNQL Requirements and Use Cases

To identify the general requirements and operations needed for SNQL, we col-lected use cases and archetypical operations from standard SNA literature [22,

Page 5: SNQL: A Social Networks Query and Transformation Languagecgutierr/papers/amw2011.pdf · Social Network (SN) data has become ubiquitous, demand-ing advanced and exible means to represent,

Table 1. Selected social network data management use cases.

Use Case Description RequiredQuery Features

1. SelectingGroups

Select a subnetwork of actors and relations thatsatisfy conditions on their attribute values and/orparticipation in certain relations.

Pattern match-ing, filtering byattributes values.

2. Pro-motingAttributesto Actors

From an actor A1 and one of its attributes (att, v)produce a new actor A2 = f(v) and a new relationR = g(A1, v) (f and g functions): all actors withvalue v for att will be connected to A2.

Pattern matching,creation of new ob-jects, pattern pro-duction.

3. Identify-ing Brokers

Each time a characteristic brokerage pattern isfound, label the broker in the output accordingly.Some brokerage patterns require that certain re-lations do not exist.

Pattern matching,negation, patternproduction.

4. CountingBinary Rela-tions

Select all relations of a given type, group by par-ticipant actors, count. Produce only one relationper group with the new attribute count.

Pattern matching,aggregation, pat-tern production.

5. Ego-networkselection

Select an actor along with all its direct neighbors,and the relations between them.

Pattern matching,induced subgraphs.

6. k-neighbor-hood

Same as above but to distance k instead of one. Pattern matching,transitive closure,induced subgraphs.

19, 10], surveyed relevant publications, mainly from the journal Social Networks,and studied the operations available in SNA software tools such as Pajek [10].

2.1 Data structure requirements and definition

The natural and traditional choice for representing social networks is to usegraphs where actors are nodes and relations are edges. However, doing so lim-its the representation power to that of binary relations and forbids attributeson relations. Thus, our logical data structure, the social networks data model(SNDM), is a graph where actors, relations and attributes are all modeled asnodes, and edges associate attributes with the actors or relations they describe,and actors with the relations in which they participate. This structure is imple-mented using three sets of triples:

– A typing set N. Each triple (oid, [isa|isr], family) ∈ N indicates that theactor or relation oid belongs to (is of type) family.

– A set R indicating roles. Each triple (a oid, role, r oid) ∈ R indicates thatthe actor a oid participates with role in the relation r oid.

– A set M describing attributes. Each triple(oid, pred, v) ∈ M indicates thatthe actor or relation oid has the attribute pred with value v.

Page 6: SNQL: A Social Networks Query and Transformation Languagecgutierr/papers/amw2011.pdf · Social Network (SN) data has become ubiquitous, demand-ing advanced and exible means to represent,

2.2 Requirements of SNQL

Table 1 summarizes the use cases from the list gathered from the sources men-tioned above. Each use case is selected for its relevance and justifies the inclusionof a query language feature.

Social networks operations can be divided into two groups: data managementoperations that return a social network, and measure operations that returnvalues or sets of values, such as centrality. Today there are various tools thatdeal with measure operations (e.g. Pajek, R, and UCINET) and clearly belongmore to the SNA tools field than to the data management one.

SNQL focuses on the first type, data management operations, which producenetworks from networks. Through the use of aggregate functions, we assumethat node, group and network measures are available to the language but do notneed to be expressible in the language itself. Brandes [5] offers a survey of suchmeasures and the corresponding evaluation algorithms.

Each SNQL query must be able to filter and/or transform a given SN into anew SN. Filter queries are used to reduce the size of a SN and to focus on rele-vant groups. Transformations produce a new SN where some implicit structuralelement has been made explicit.

3 The SNQL Query Language

The design of SNQL addresses the following issues: to be expressive enough, andkeep the evaluation cost under practical bounds. The challenge was to identifya few generic operations to cover the required expressiveness, and to be closed,that is, to be able to construct (and transform) social networks into social net-works. The solution was to use GraphLog and second-order tuple generatingdependencies, allowing expressiveness for graph querying (see Figure 3).

3.1 Query Syntax

The language should be friendly enough for both the lay-user and the program-mer. For the former, a visual language close to the SN graphical representationis ideal: in the simpler cases, one extraction pattern and one construction pat-tern cover many use cases; in the general case, the extraction pattern shouldresemble the DAG of query graphs that exists in GraphLog [9]. For the latter,an SQL-like language would be familiar to developers and advanced databaseusers (for searching text, writing, pasting, debugging, etc.) Thus, our languagehas both syntaxes.

At the abstract level, and for the purpose of formally studying and analyzingits semantics and complexity, we use a translation to a more formal representa-tion, based on Datalog/GraphLog [1, 9], and the data exchange formalism calledsecond-order tuple generating dependencies [12].

An SNQL construct query Q follows the standard SELECT|CONSTRUCT –WHERE – FROM structure of languages like SQL and SPARQL. It receives as

Page 7: SNQL: A Social Networks Query and Transformation Languagecgutierr/papers/amw2011.pdf · Social Network (SN) data has become ubiquitous, demand-ing advanced and exible means to represent,

SNQL Query

ExtractionPatterns

Construction Patterns

DataLog Query

D': ResultSocial Network

1DataLog Query

Evaluation

2Production and gatheringof partial results (subnets)

D: SourceSocial Network

One table for eachextraction pattern

X1 X2 Xn...

VariableBindings

SOtgdEvaluation

Fig. 3. Overview of the SNQL language. An SNQL query is composed of an extraction(datalog-like) plus a construction (SO tgd-like) pattern. The evaluation process hastwo stages: (1) the Datalog program is evaluated over the data source network Dproducing intermediate tables of variable bindings; (2) partial results are producedfrom the obtained tuples (a table for each distinguished predicate) using the SO tgd.All partial results are gathered in the final result network D′.

input social networks (the FROM clause), extracts information using patterns(the WHERE clause), and outputs a new social network, possibly with new val-ues, using the CONSTRUCT clause. For space reasons, we present the syntax ofSNQL by example, using the SN of friendship relations among people presentedin Example 1 (see Fig. 1) (the complete syntax is in Appendix A).

Example 3. (Promoting Attributes to Actors) Recall the query from Example 2,where the patterns EP and CP were depicted in Fig. 2. Expanding the patternsas lists of triples, the query is as follows:

CONSTRUCT {(A1, isa, person), (A2, isa, city), (R1, isr, friendship),

(R2, isr, lives-in), (A1, inhabitant, R2), (A1, P1, R1),

(A1, name, L2), (A2, place, R2), (A2, name, L1)}

IF R2=f(A1, A2) AND A2=g(L1)

WHERE {(A1, isa, person), (R1, isr, friendship),

(A1, city, L1), (A1, P1, R1), (A1, name, L2)}

FROM FriendshipNetwork

Example 4. (Grouping and Aggregation) The following SNQL query producesthe network depicted in Fig. 1(c) from that in Fig. 1(a), grouping people by cityand counting friendship relations between cities.

CONSTRUCT CP1 IF A4 = f(L1) AS SN1

WHERE AGG({L1}, COUNT AS L4, EP1)

FROM FriendshipNetwork

UNION

CONSTRUCT CP2 IF A5 = f(L2) AND A6 = f(L3) AND R2 = g(A5, A6) AS SN2

Page 8: SNQL: A Social Networks Query and Transformation Languagecgutierr/papers/amw2011.pdf · Social Network (SN) data has become ubiquitous, demand-ing advanced and exible means to represent,

personA1

person friendship

L2

A2

city

R1friend cityA4

L1name

Extraction Pattern 1: EP1 Extraction Pattern 2: EP2 Contruction Pattern 1: CP1 Contruction Pattern 2: CP2L1city

person

L3

A3

city

friendL4inhabitants

city friendshipbetween cities

A5 R2friendcity

A6friend

L5

number

Fig. 4. Grouping and Aggregation. Patterns EP1 and CP1 group people by ‘city’ andcount the number of inhabitants in each city. Patterns EP2 and CP2 group and countfriendship relations between pairs of cities.

person friendship

L1

A1

name

R1friend

Extraction Pattern 1: EP1 (TC) Contruction Pattern 1: CP1

person

L2

A2

name

friend

person friendship

L3

A3

name

R2friendperson

L4

A4

name

friend

Extraction Pattern 2: EP2 Extraction Pattern 3: EP3 (TC)

person friendship

L1

A1

name

R3friend

person

L5

A5

name

friend

person friendship

L3

A3

name

R2friendperson

L4

A4

name

friend

Fig. 5. Transitive Closure. Patterns EP1 and EP3 identify all transitively reachableactors from person named ‘John’ through ‘friendship’ relations. Patterns EP2 and CP1

produce the relations induced by pairs of actors in the set of reachable actors.

WHERE AGG({L2,L3}, COUNT AS L5, EP2 FILTER (L2 != L3))

FROM FriendshipNetwork

Patterns EP1, EP2, CP1, and CP2 are depicted in Fig. 4. Note that each newgroup (actor) requires a new oid functionally produced from the value of at-tribute ‘city’. Also the number of inhabitants bound to L4, and the numberof friendship-between-cities bound to L5, must be computed with the aggre-gate function COUNT. The first argument of AGG is the set of grouping variables,the second is the aggregation function required, and the third is an extractionpattern. The results of the two construct queries are combined using UNION toproduce the desired result. ut

Example 5. (Transitive Closure) The following SNQL query (whose patterns areshown in Fig. 5) produces the network comprising an actor that meets a givencriterion (his name is ‘John’), along with all other actors that can be reachedtransitively by matching the given pattern (of friendship relations). Additionallyall induced relations between pairs of reachable actors are included in the result.

CONSTRUCT CP1

WHERE EP2 FILTER ((A3 != A4) AND (A3 = A1 OR A3 = A2) AND

(A4 = A1 OR A4 = A5))

AND (TC(A1, A2, EP1) WITH L1=’John’)

AND (TC(A1, A5, EP3) WITH L1=’John’)

FROM FriendshipNetwork

TC returns the transitive closure of the binary relation formed by all instantia-tions of the variables appearing as its first and second arguments when matchingthe extraction pattern of its third argument. A starting condition is specified af-ter WITH. As a result, variables A2 and A5 are bound to the people reachable

Page 9: SNQL: A Social Networks Query and Transformation Languagecgutierr/papers/amw2011.pdf · Social Network (SN) data has become ubiquitous, demand-ing advanced and exible means to represent,

0. Each triple t of the form (A,B,C) is translated as t(A,B,C), where t is n, r orm according to the type of the triple.

1. A list of triples (basic pattern) { t1, ..., tn }: p(z)←∧

i∈1..n ti(Ai, Bi, Ci).2. PATT1 AND PATT2: p(z)← p1(x), p2(y)3. PATT1 OR PATT2: p(z)← p1(x)

p(z)← p2(y)4. PATT1 AND-NOT PATT2: p(z)← p1(x),¬p2(y).5. PATT1 FILTER C: p(z)← p1(x), c(x)

(assuming condition C is simulated by predicate c)6. TC (Vs, Vt, PATT1) WITH <start-condition>:

p(U, V )← p1(. . . U . . . V . . . ), start cond(. . . U . . . V . . . )p(U, V )← p1(. . . U . . .W . . . ), p(W,V )

(assuming variable Vs corresponds to variable U and Vt to variable V of p1(x))7. AGG(VList, AggF, PATT1) : p(z,A(y))← p1(z, y)

(assuming Vlist is the set of variables Z, Y = X−Z and AggF is the aggregatefunction A)

Fig. 6. Translation of Extraction Pattern to Datalog.

from John through transitive friendship relations. Pattern EP2, along with theFILTER conditions above, is then used to match all the induced friendship rela-tions between distinct pairs of people reachable from ‘John’ (along with ‘John’himself). ut

3.2 Query Semantics

An SNQL query Q of the form CONSTRUCT <T> WHERE <PATT> FROM <S> trans-forms social networks into social networks, and is evaluated as shown in Figure 3.

The formal semantics can be expressed in standard formalisms (Datalog andtuple-generating dependencies) as follows. Let D be a social network, Q an SNQLquery, and Q(D) the result of applying Q to D. For set of variables X, let x bethe tuple comprising all variables in X.

An extraction pattern is recursively decomposed and simulated by a Datalogprogram as follows. Let PATT be the pattern to be simulated by predicate p andassume that patterns PATT1 and PATT2 are simulated by p1 and p2, respectively.Let z, x and y contain the projected variables of PATT, PATT1 and PATT2, re-spectively. Depending on the structure of PATT, the translation is as shown inFig. 6.

The predicate p obtained from pattern PATT in the previous translation, isnow used to produce the query result. Here the list of triples <list-of-pattern-triples>of the CONSTRUCT clause along with the corresponding lists of equalities <expr> = <expr>

play a central role. The equalities are of two types. One type defines each vari-able: vi = termi, 1 ≤ i ≤ k; the other is of the form termi = terml, where eachterm may contain variables (from p), constants and functions.

Page 10: SNQL: A Social Networks Query and Transformation Languagecgutierr/papers/amw2011.pdf · Social Network (SN) data has become ubiquitous, demand-ing advanced and exible means to represent,

output-n(A4,isa,city) :- construct1(A4,L1,L4)

output-m(A4,name,L1) :- construct1(A4,L1,L4)

output-m(A4,inhabitants,L4) :- construct1(A4,L1,L4)

output-n(A5,isa,city) :- construct2(A5,R2,A6,L5)

output-n(R2,isr,friendship-between-cities) :- construct2(A5,R2,A6,L5)

output-n(A6,isa,city) :- construct2(A5,R2,A6,L5)

output-r(A5 friend,R2) :- construct2(A5,R2,A6,L5)

output-r(A6,friend,R2) :- construct2(A5,R2,A6,L5)

output-m(R2,number,L5) :- construct2(A5,R2,A6,L5)

construct1(A4,L1,L4) :- ag1(L1,N), A4=f(L1), L4=N

ag1(L1,count(A1)) :- ep1(A1,L1)

ep1(A1,L1) :- n(A1,isa,person), m(A1,city,L1)

construct2(A5,R2,A6,L5) :- ag2(L2,L3,M), A5=f(L2), A6=f(L3),

R2=g(A5,A6), L5=M

ag2(L2,L3,count(A2,R1,A3)) :- ep2(A2,A3,R1,L2,L3)

ep2(A2,A3,R1,L2,L3) :- n(A2,isa, person), n(R1,isr,friendship),

n(A3,isa,person), r(A2,friend,R1),

r(A3,friend,R1), m(A2,city,L2),

m(A3,city,L3), L2 != L3

Fig. 7. Translation of query in Example 4 to Datalog.

For a given CONSTRUCT trList IF eqList the construction process takesthe result of the extraction process, the p(z) predicate, plus the list of equalitieseqList translated as ∧jeqj to produce the following rule:

construct(v1, . . . , vk)← p(z) ∧∧j

eqj . (1)

Finally, the resulting social network SN is the set of instantiations of eachtriple t in the list of triples trList in the CONSTRUCT using the values in theconstruct predicate:

SN =⋃{

t(u1, u2, u3) : ∃(..u1..u2..u3..) ∈ construct and t in trList}

(2)

Example 6. Consider the SNQL query Q of Example 4 that involved groupingand aggregation. The translation of Q to Datalog is shown in Figure 7. ut

In practice, the evaluation process may proceed more efficiently by avoidingintermediate materialization: each match of the extraction pattern produces acollection of tuples corresponding to Datalog predicates, and this collection oftuples is processed by the construction pattern to produce a subnetwork of theoutput (see Section 4).

Page 11: SNQL: A Social Networks Query and Transformation Languagecgutierr/papers/amw2011.pdf · Social Network (SN) data has become ubiquitous, demand-ing advanced and exible means to represent,

4 Complexity and Expressiveness

The main goal of this paper was to introduce a sufficiently flexible data modeland expressive query language that meets the data manipulation requirements ofsocial networks. In this section we state—without proof due to space constraints—results that show the good behaviour of the language regarding complexity andexpressive power.

SNQL is composed of two modules: one for extraction of information and onefor construction of a new network. In the design, consideration has been givento providing the maximum expressiveness possible while keeping the complex-ity of processing within reasonable bounds. First, for extraction, we consideredGraphLog (possibly with summarization functions), which is a graph query lan-guage designed to be simple, graphical, oriented to graphs, and be as expressiveas possible while staying within the LOGSPACE complexity bound [9]. Sec-ond, for the construction module, whose main purpose is the creation of newidentifiers in the process of creating the new network, the language is mod-eled after second-order tuple-generating dependencies, which are known to be afamily of transformations between tables of tuples with the “right” expressive-ness/complexity tradeoff [12].

Knowing this, it is not surprising that SNQL covers all use cases being usedin current practice by SN researchers. (There are still some queries defined the-oretically by SN scientists which are not covered by SNQL; but it can be provedthat they fall out of the scope of a reasonably efficient complexity bound. A typ-ical example is cohesive subgroups defined using shortest paths lengths betweenmembers, for instance k-cores.) Formally stated, this result can be presented asfollows:

Claim. SNQL solves all use cases presented in SN practice that fall in the LOGSPACEcomplexity bound.

A formal proof of this claim relies on the list of use cases in current practice.The column “Required Query Features” of Table 1 collects the features neededfor the classical use cases from the SN community. All of them, except inducedsubgraph, are incorporated directly in the language. For the induced subgraph,Example 5 gives the idea how this is done.

As for the expressive power as compared to classical databases languages, wecan prove the following two results:

Theorem 1. The SNQL extraction module has the same expressive power asGraphLog.

Theorem 2. The construction module can be specified by one SOtgd of the form:

∃f1 . . . fm(∀x1(φ1 → ψ1) ∧ . . . ∧ ∀xn(φn → ψn)),

where each (φi → ψi) has the form:

(p(x) ∧∧k

eqk)→ (t1 ∧ . . . ∧ tr), (3)

Page 12: SNQL: A Social Networks Query and Transformation Languagecgutierr/papers/amw2011.pdf · Social Network (SN) data has become ubiquitous, demand-ing advanced and exible means to represent,

(a) For an expressionCONSTRUCT trList1 IF eqList1; ... trListn IF eqListn;

translated in the form of eq. (3), define n clauses: auxj(xj)← p(x)∧∧

k eqk, j =1, . . . , n.

(b) For each clause trListj IF eqListj; and each triple (x,y,z) in trListj, de-fine a rule t(x, y, z)← auxj(xj).

(c) Add the clauses generated by (a) and (b) to the original Datalog programgenerated from the extraction pattern (see Fig. 6).

(d) Obtain the values of the triples to be generated by running the new program.

Fig. 8. Evaluation Algorithm.

where p(x) and eqk follow the notation of equations (1) and (2), that is, predicatep(x) is the result of the processing of the extraction pattern, and tj and eqkare predicates resulting from the translation of the triples and equalities in theCONSTRUCT clause, and each tuple xi includes all variables in p and in theeqk’s.

A naive implementation of the semantics presented in Fig. 6 would materi-alize intermediate results.

This can be avoided by using the algorithm in Fig. 8.

Lemma 1. The Evaluation Algorithm is correct.

It is possible to show that the above evaluation computes queries efficientlyfrom a database perspective. As is customary when studying the complexityof the evaluation problem for a query language [21], we consider its associateddecision problem. We denote this problem by Evaluation and define it asfollows.

INPUT : A Social Network S, a query Q and a triple t = (a, b, c).QUESTION : Is t ∈ [[Q]]D?

Theorem 3. The complexity of Evaluation is in NLOGSPACE.

5 Conclusions

Based on social network practice, we have presented the design of a data modeland query language for SN. Of particular novelty, is the ability of our language totransform one network into another, in the process creating new actors and newattributes based on aggregation, features crucial for social network researchers.

We presented the syntax, semantics and complexity analysis. The languageis based on classical database results to obtain a good balance between expres-siveness and complexity. Presenting both a graphical and SQL-like syntax, we

Page 13: SNQL: A Social Networks Query and Transformation Languagecgutierr/papers/amw2011.pdf · Social Network (SN) data has become ubiquitous, demand-ing advanced and exible means to represent,

designed it to have the most encompassing expressiveness while staying tractable.In fact, we show that it includes all tractable operations found in our survey ofSN data management practice, and we prove that the cost of transforming net-works can be done efficiently from a computational point of view. all tractableoperations found in our survey of current SN data management practice

References

1. Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases. Addison-Wesley(1995)

2. Abiteboul, S., Kanellakis, P.C.: Object identity as a query language primitive. J.ACM 45(5), 798–842 (1998)

3. Amer-Yahia, S., Lakshmanan, L.V.S., Yu, C.: Socialscope: Enabling informationdiscovery on social content sites. In: CIDR. www.crdrdb.org (2009)

4. Amer-Yahia, S., Markl, V., Halevy, A.Y., Doan, A., Alonso, G., Kossmann, D.,Weikum, G.: Databases and web 2.0 panel at vldb 2007. SIGMOD Record 37(1),49–52 (2008)

5. Brandes, U., Erlebach, T. (eds.): Network Analysis: Methodological Foundations[outcome of a Dagstuhl seminar, 13-16 April 2004], Lecture Notes in ComputerScience, vol. 3418. Springer (2005)

6. Breiger, R., Carley, K., Pattison, P. (eds.): Dynamic Social Network Modeling andAnalysis: Workshop Summary and Papers. The National Academies Press (2003)

7. Cabibbo, L.: The expressive power of stratified logic programs with value invention.Inf. Comput. 147(1), 22–56 (1998)

8. Carley, K.M.: Linking capabilities to needs. In: Breiger et al. [6], pp. 363–3709. Consens, M.P., Mendelzon, A.O.: Graphlog: a visual formalism for real life recur-

sion. In: PODS. pp. 404–416. ACM Press (1990)10. de Nooy, W., Mrvar, A., Batagelj, V.: Exploratory Social Network Analysis with

Pajek. Cambridge University Press (2005)11. Dries, A., Nijssen, S., Raedt, L.D.: A query language for analyzing networks. In:

Cheung, D.W.L., Song, I.Y., Chu, W.W., Hu, X., Lin, J.J. (eds.) CIKM. pp. 485–494. ACM (2009)

12. Fagin, R., Kolaitis, P.G., Popa, L., Tan, W.C.: Composing schema mappings:Second-order dependencies to the rescue. ACM Trans. Database Syst. 30(4), 994–1055 (2005)

13. Freeman, L., Romney, A.K., White, D.R. (eds.): Research Methods in Social Net-work Analysis. Transaction Publishers (1992)

14. Guting, R.: Graphdb: modeling and querying graphs in databases. In: 20th VLDBConference. pp. 297–308 (1994)

15. Huisman, M., van Duijn, M.: Software for Social Network Analysis, chap. Softwarefor Social Network Analysis, pp. 270–316. Cambridge (2005)

16. Jagadish, H.V., Olken, F.: Database management for life science research: Sum-mary report of the workshop on data management for molecular and cell biology.OMICS 7(1), 131–137 (2003)

17. Mika, P.: Social Networks and the Semantic Web, Semantic Web And BeyondComputing for Human Experience, vol. 5. Springer (2007)

18. Ronen, R., Shmueli, O.: Soql: A language for querying and creating data in socialnetworks. In: ICDE. pp. 1595–1602. IEEE (2009)

19. Scott, J.: Social Network Analysis. SAGE Publications, second edn. (2000)

Page 14: SNQL: A Social Networks Query and Transformation Languagecgutierr/papers/amw2011.pdf · Social Network (SN) data has become ubiquitous, demand-ing advanced and exible means to represent,

20. Topaloglou, T., Davidson, S.B., Jagadish, H.V., Markowitz, V.M., Steeg, E.W.,Tyers, M.: Biological data management: Research, practice and opportunities. In:VLDB. pp. 1233–1236 (2004)

21. Vardi, M.Y.: The complexity of relational query languages (extended abstract). In:STOC. pp. 137–146. ACM (1982)

22. Wasserman, S., Faust, K.: Social Network Analysis: Methods and Applications.Structural Analysis in the Social Sciences, Cambridge University Press (1994)

A SNQL Syntax

<construct-query> := CONSTRUCT <list-of-construct-patt>

WHERE <extract-patt>

FROM <list-of-social-networks>

<list-of-construct-patt> ::= <construct-patt>[ AS <sn-id>

[, <construct-patt> AS <sn-id>]*]

<construct-patt> ::= <list-of-pattern-triples>

[IF <list-of-equalities>]

<list-of-equalities> ::= <expr> = <expr> [ AND <expr> = <expr>]*

<extract-patt> ::= <list-of-pattern-triples> [MATCH <sn-id>]

| <extract-patt> AND <extract-patt>

| <extract-patt> OR <extract-patt>

| <extract-patt> AND-NOT <extract-patt>

| <extract-patt> FILTER <condition>

| TC(<startV>, <endV>, <extract-patt>)

WITH <start-condition>

| AGG(<group-vars>, <aggr-func>, <extract-patt>)

<list-of-pattern-triples> ::= {<pattern-triple>[, (<pattern-triple>)]*}

<pattern-triple> ::= (term, term, term)

<list-of-social-networks> ::= <social-network>[ AS <sn-id>

[, <social-network> AS <sn-id>]*]

<social-network> ::= <sn-id> | <list-of-instance-triples>

<list-of-instance-triples> ::= {<instance-triple>[, (<instance-triple>)]*}

<instance-triple> ::= <n-triple> | <r-triple> | <m-triple>

<n-triple> ::= (<oid>, <constant>, <constant>)

<p-triple> ::= (<oid>, <constant>, <oid>)

<m-triple> ::= (<oid>, <constant>, <constant>)

<constant> ::= <object-id> | <literal>

<condition> ::= <logic-expression>

<term> ::= <variable> | <constant> | <expr> AS <alias>

<expr> ::= <variable> | <constant>

| <function>(<expr>[, <expr> ]*)

<alias> ::= <variable> | <null>