,qirupdwlrq©6flhqf hv©dqg© 7hfkqrorjlhv …acmbulletin.fiit.stuba.sk/vol8num2/vol8num2.pdf · ip...

Information Sciences and TechnologiesBulletin of the ACM Slovakia

December 2016Volume 8, Number 2

J. Papán

I. Srba

K. Rástočný

E. Kuric

Š. Dlugolinský

J. Balažia

J. Mojžiš

P. Helebrandt

M. Vojtko

Š. Sabo

IP Fast Reroute

Promoting Sustainability and Transferability of Community Question Answering

Metadata Management for Large Information Spaces

Automatic Estimation of Software Developer’s Expertise

Combining Named Entity Recognition Methods for Concept Extraction

Seamless Handover in Networks Based on IEEE 802.11 Standard

Visualization, Navigation and Relationship Discovery in Graphs

Architecture for Core Networks Utilizing Software Defined Networking

Formal Description of Embedded Operating Systems

Social Insect Inspired Algorithm to Detect and Track Topics in Dynamic Documents

Published by Slovak University of Technology Press, Vazovova 5, 812 43 Bratislava, IČO: 00397687

on behalf of the ACM Slovakia ChapterISSN 1338-1237 (printed edition)

ISSN 1338-6654 (online)Registration number: MK SR EV 3929/09

1

10

17

21

26

37

45

56

62

69

Chapter

Aim and Scope of the Information Sciences andTechnologies Bulletin of the ACM Slovakia

ACM Slovakia offers a forum for rapid disseminationof research results in the area of computing/informaticsand more broadly of information and communication sci-ences and technologies. It is primarily a web based bul-letin publishing results of dissertations submitted at anyuniversity in Slovakia or elsewhere, perhaps also resultsof outstanding master theses. Besides that, conferencesthat meet bulletin’s expectations with regard to scien-tific rigor are invited to consider publishing their papersin the bulletin in form of special issues. Besides the webversion of the bulletin, a paper version is available, too.

The Bulletin aims:

• To advance and to increase knowledge and inter-est in the science, design, development, construc-tion, languages, management and applications ofmodern computing a.k.a. informatics, and morebroadly of information and communication sciencesand technologies.

• To facilitate a communication between persons hav-ing an interest in information and communicationsciences and technologies by providing a forum forrapid dissemination of scholarly articles.

Scope of the Bulletin is:

• original research in an area within the broader fam-ily of information sciences and technologies, witha particular focus on computer science, computerengineering, software engineering and informationsystems, and also other similarly well establishedfields such as artificial intelligence or informationscience.

Types of contributions:

• Extended abstracts of doctoral dissertations.This is the primary type of article in the Bulletin.It presents main contributions of the dissertation inform of a journal paper together with separate sec-tion with list of published works of the author. InSlovakia and the Czech Republic, it corresponds totypical length of so called autoreferat. In fact, it isenvisaged that publishing the extended abstract inthe Bulletin makes autoreferat obsolete and even-tually can replace it completely. It should be notedthat by publishing it in the Bulletin, the extendedabstract will receive a much wider dissemination.Exceptionally, at the discretion of the EditorialBoard, the Bulletin may accept extended abstractsof other than doctoral theses, e.g. Master theses,when research results reported are sufficiently wor-thy of publishing in this forum. Rules and proce-dures of publishing are similar.

• Conference papers. The Bulletin offers orga-nizers of interesting scientific events in some areawithin the scope of the Bulletin to consider pub-lishing papers of the Conference in the Bulletin asits special issue. Any such proposal will be subjectof discussion with the Editorial Board which willultimately decide. From the scientific merit point

of view, method of peer reviewing, acceptance ratioetc. are issues that will be raised in the discussion.

Besides that the Bulletin may include other types of con-tributions that will contribute to fulfilling its aims, sothat it best serves the professional community in the areaof information and communication sciences and tech-nologies. There are four regular issues annually.

Editorial Board

Editor in ChiefPavol NavratSlovak University of Technology in Bratislava, Slovakia

Associate Editor in ChiefMaria BielikovaSlovak University of Technology in Bratislava, Slovakia

Members:Andras BenczurEotvos Lorand University, Budapest, Hungary

Johann EderUniversity of Vienna, Austria

Viliam GeffertP. J. Safarik University, Kosice, Slovakia

Tomas HruskaBrno University of Technology, Czech Republic

Mirjana IvanovicUniversity of Novi Sad, Serbia

Robert LorenczCzech Technical University, Prague, Czech Republic

Karol MatiaskoUniversity of Zilina, Slovakia

Yannis ManolopoulosAristotle University, Thessaloniki, Greece

Tadeusz MorzyPoznan University of Technology, Poland

Valerie NovitzkaTechnical University in Kosice, Slovakia

Jaroslav PokornyCharles University in Prague, Czech Republic

Lubos PopelınskyMasaryk University, Brno, Czech Republic

Branislav RovanComenius University, Bratislava, Slovakia

Vaclav SnaselVSB-Technical University of Ostrava, Czech Republic

Jirı SafarıkUniversity of West Bohemia, Plzen, Czech Republic

Executive Editor: Dominik MackoCover Design: Peter LackoTypeset in LATEX using style based on ACM SIG Proceedings Template.

IP Fast Reroute

Jozef Papán∗

Department of InfoComm networksFaculty of Management Science and Informatics

University of ŽilinaUniverzitná 1, 010 26 Žilina, [email protected]

AbstractIn this work, a new innovative Multicast Repair (M-REP)IPFRR mechanism, which uses an IP multicast technol-ogy, is presented. The proposed M-RER mechanism usesProtocol Independent Multicast - Dense Mode (PIM-DM)with modified algorithm of the Reverse Path Forwarding(RPF). The key contribution of this work is the fact thatthe proposed M-REP IPFRR mechanism is independentof the link-state routing protocols and the internal algo-rithm does not explicitly calculate the alternative path.

Categories and Subject DescriptorsC.2.0 [Computer - communication Networks]: Gen-eralSecurity and protection; C.2.3 [Computer - com-munication Networks]: Network Operations-Networkmanagement

KeywordsIP Fast Reroute; IPFRR; multicast; RPF; PIM-DM

1. IP Fast RerouteAfter a link or node failure, a process of network con-vergence starts in a network, during which routers mustupdate their routing tables. The overall time of networkconvergence might take from a few milliseconds up to tensof seconds. During this process, several destinations inthe network might become unavailable, packet loss mightincrease or even routing loops might occur. Several so-lutions have been introduced and developed for solvingthese negative impacts - these mechanisms are called bya common term Fast Reroute (FRR) mechanisms.

The first FRR mechanism was Multiprotocol Label Switch-ing (MPLS) FRR, which uses an explicit backup routes.

∗Recommended by thesis supervisors: Assoc. Prof. PavelSegec, Dr. Peter PaluchDefended at Faculty of Management Science and Infor-matics, University of Zilina on August 25, 2015.

c© Copyright 2016. All rights reserved. Permission to make digitalor hard copies of part or all of this work for personal or classroom useis granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies show this notice onthe first page or initial screen of a display along with the full citation.Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy other-wise, to republish, to post on servers, to redistribute to lists, or to useany component of this work in other works requires prior specific per-mission and/or a fee. Permissions may be requested from STU Press,Vazovova 5, 811 07 Bratislava, Slovakia.Papán, J. IP Fast Reroute. Information Sciences and Technologies Bul-letin of the ACM Slovakia, Vol. 8, No. 2 (2016) 1-9

However, since the MPLS mechanisms are not used in ev-ery network and MPLS is not scalable enough, the nextdevelopment lead towards the IPFRR mechanisms.

The main goal of all IPFRR mechanisms is to minimizethe network recovery time after a node or link failure.The key feature of these mechanisms is the calculationof alternative route before the failure occurs [13, 9]. Thecomputation of alternative route requires network topol-ogy information and therefore most of the existing IPFRRmechanisms strongly depend on the usage of link-staterouting protocols.

When there is a link failure in the network, the IPFRRmechanism routes the packets to a pre-computed alterna-tive route until the network converges, Figure 1. Duringthis time, the routing protocol makes updates about thenetwork topology changes. This update of routing proto-cols happens in the background. After its completion, therouting protocol takes back the control over the routingof packets.

An important factor in IPFRR is the recovery time afterthe node or link failure. This time should be one of thekey factors when evaluating the IPFRR mechanisms. Theaverage reaction time of current IPFRR mechanisms forfast recovery is 50ms [10, 3].

Many IPFRR mechanisms have been proposed. They canbe categorized into three main groups:- Loop Free Alternates (LFA) mechanisms [5],- Equal Cost Multiple Paths (ECMP) mechanisms [11],- Multihop solutions: Tunnels, Multiple Routing Configu-rations (MRC), Maximally Redundant Trees (MRT), PQ-Space, U-Turn Alternative, Remote LFA (rLFA) [6, 4].

Modern network routing protocols use a relatively slowand complex hello mechanism. The failure detection time

Figure 1: Basic IPFRR principle.

2 Papan, J.: IP Fast Reroute

Figure 2: LFA.

of routing protocols alone is insufficient for rapid rerout-ing requirements. Therefore, fast failure detection is animportant part of an IPFRR mechanism. There is a num-ber of existing alternative failure detection mechanism ap-proaches that can be used [13]:- Physical detection mechanism (loss of carrier, loss oflight, increase in bit error rate, etc.),- Independent detection mechanism (Bidirectional FailureDetection protocol) [12],- Routing protocol detection (Hello mechanisms).

1.1 Loop Free AlternatesWhen the source router S detects a link failure, it sendstraffic to an alternate back up router - also called an LFA(see Figure 2). The selection of the LFA router is pre-computed in advance. An LFA router must be directlyconnected to the source router S. The LFA router mustprovide a loop-free path to forward packets to the desti-nation D. The source router may have precomputed morethan only one next-hop LFA router [5].

The LFA router election is defined by two criteria. Theseconditions guarantee that LFA router provides a loop freepath:

Loop-Free Criterion:

Cost(N,D) < Cost(N,S) + Cost(S,D) (1)

Downstream Path Criterion:

Cost(N,D) < Cost(S,D) (2)

where S is a source router, N is a potential LFA router,D is the destination router and Cost(N,D) is the cost ofthe shortest path from N to D, Cost(N,S) is the cost ofthe shortest path from N to S, Cost(S,D) is the cost ofthe shortest path from S to D. LFA mechanism has goodbasic protection against a link or a node failure. OtherLFA mechanisms improvements allows the locations ofthe LFA backup router more than one hop away from thesource router (for example Remote LFA mechanism).

Figure 3: U-Turn.

1.2 U-Turn AlternativeLFA mechanisms usually use directly connected neigh-boring routers to send data traffic around the failed link.When an LFA router is not available, the U-Turn mech-anism can be used instead [9]. The mechanism allowsthe source router S to send traffic to a so-called U-Turnalternative router N (see Figure 3).

U-Turn router (N) then recognizes the special traffic fromthe source router S and this traffic will not be dropped.When the U-Turn router receives packets from the sourcerouter S, packets will be forwarded to the LFA router ofrouter N. The LFA router then sends these packets todestination.

In the case the router U does not implement the U-Turnmechanism, packets from router S can be blackholed, mis-routed or looped.

The U-Turn alternative candidate must pass followingcondition [9]:

Node Selection Criterion:

Cost(N,D) ≥ Cost(N,S) + Cost(S,D) (3)

where S is a source router, N is a potential U-Turn router,D is a destination router and Cost(N,D) is the cost of theshortest path from N to D, Cost(N,S) is the cost of theshortest path from N to S, Cost(S,D) is the cost of theshortest path from S to D.

A U-Turn router must be able to recognise the trafficfrom the source router S, either in implicit or in explicitway. Implicit detection means that the U-Turn router hasa special algorithm for recognition of an IP FRR trafficsent from a source router S. The algorithm tells the routerwhich traffic from source router S is sent over a backuppath as opposed to normal routing.

Information Sciences and Technologies Bulletin of the ACM Slovakia, Vol. 8, No. 2 (2016) 1-9 3

Explicit detection occurs when the source router S willsomehow modify header of packets and the U-Turn routeris able to receive and recognize these modified packets.Modification of packet headers may possibly cause prob-lems with compatibility among other routers within thenetwork.

In the next section, we focus on analyzing the disadvan-tages of existing IPFRR mechanisms.

2. Problem Specification2.1 Pre-computingThe basic principle of IPFRR mechanisms is based on thefast detection of the link failure and precomputed alter-native routes. The complexity of these pre-calculations isnot trivial.

The computational complexity increases with the numberof the routers in the network. The computations need tobe performed again after topology change in order to up-date the alternative routes. The routers usually performthese calculations as processes with low priority duringthe idle time of the router CPU. The additional alter-native route calculations thus consume time and systemresources of the router. Therefore we consider these pre-computations to be one of the problematic areas of theexisting IPFRR mechanisms.

2.2 Dependence on Link-State Routing ProtocolsAnother important factor is that many of the existingIPFRR algorithms require the topology information aboutthe network in order to pre-compute the alternative route.This fact limits the usage of IPFRR mechanisms to thenetworks with link-state routing protocol. Majority of ex-isting IPFRR mechanisms depend on the link-state rout-ing protocols.

2.3 Research DirectionThe analysis of existing solutions shows that the exist-ing mechanisms meet the basic IPFRR requirements, butthey are complicated. Our goal was therefore to develop anew simpler mechanism that would meet the basic IPFRRrequirements. One of the possibilities that has not beenused in the current IPFRR mechanisms is the multicasttechnology [7]. This was the starting point of our searchfor a new mechanism that would bring a new principleinto IPFRR area. After we had made the decision to usethe multicast technology, the question was which multi-cast protocol to use? We have focused our efforts mostlyon the PIM protocol. The PIM protocol can work eitherin sparse [8] or dense [2] mode.

3. Protocol Independent Multicast - Dense Mode(PIM-DM)

PIM-DM protocol assumes, that all routers in networkwant to receive multicast traffic. At the beginning of mul-ticast transmission, routers with enabled PIM-DM proto-col send multicast packets to all other routers in the net-work. This process is called flooding [2]. PIM-DM proto-col uses Reverse Path Forwarding (RPF) protection mech-anism against micro-loops, which can occur during initialflooding of multicast communication. If some routers withenabled PIM-DM do not want to receive specific multicastcommunication, they send Prune message to upstreamrouter. This process is called pruning.

Figure 4: Protocol PIM-DM.

Interfaces on routers, which send the Prune message, getto pruned state. Pruned state is valid for a limited periodof time. After this period, routers receive the multicastcommunication again. The prune state is related to a spe-cific multicast (S, G) pair. If a new receiver appears in thepruned area, the PIM-DM protocol uses the PIM Graftmessage to cancel the pruned state. The PIM Graft mes-sage is sent by the corresponding router to its upstreamrouter.

In order to minimize the number of pruning and floodingprocesses, the PIM-DM protocol uses a State Refresh mes-sage. This message is for extension of the pruned state.Flooding and pruning processes cause unwanted traffic inthe network. The PIM-DM is more efficient when it isused in a network with dense multicast traffic.

The protocol PIM-DM uses an RPF protection againstmicro-loops. Multicast packet is accepted by a router onlyif it passes an RPF check. RPF check in PIM-DM meansthat a multicast packet is accepted only if it is received viainterface, which is used in unicast communication to reachthe source of multicast transmission. In other words, themulticast packets are accepted only if they arrive via aninterface, which is on the shortest path by unicast routingtable to source of multicast transmission.

4. Proposal of New M-REP IPFRR MechanismAt the beginning of flooding multicast communi-cation, the PIM-DM sends packets to all routerswith enabled PIM protocol. This fact means, thatmulticast packets (independently of any failures)will get to the destination router. We want touse this specific behavior of PIM-DM protocol todevelop a new IPFRR mechanism.


4.1 Modification of RPFThe original behavior of RPF mechanism is not compat-ible with our intended RPF utilization in IPFRR. Undersome circumstances, a specific router with original RPFmechanisms may drop our IPFRR communication.

The original RPF mechanism uses information from uni-cast routing table to select the correct RPF interface forspecific multicast (S, G) flow. However, in network, whererouter or link failure has occurred, the information inunicast routing table may not be correct on the affectedrouters by failure until the process of network convergenceis complete. It means that some router on the originalpath to destination can drop our IPFRR multicast flowbecause of RPF check.

Using a simple modification of the original RPF mech-anism, we can flood IPFRR communication around thefailed link or router. Our new IPFRR mechanism stilluses the RPF mechanism in PIM-DM, but it focuses onmodification of RPF mechanism, which selects the correctRPF interface for specific multicast communication.

4.2 Description of New IPFRR MechanismOur new IPFRR mechanism is not designed to provide alink or node protection, but to protect a specific unicastflow (flow protection).

We assume that a customer sends an important flow ofdata to the destination D. If any router on the path to thedestination detects a connection failure (link or node), itbecomes the router S - source router.

The source router encapsulates protected unicast flow toa specific multicast flow (specific Source, G pair), which isimmediately flooded to all active interfaces with enabledPIM Dense Mode. The router performs this tunnelingof unicast flow until the process of convergence in thenetwork is complete.

From the moment, when the link or node failure occurson the original shortest path between the source and thedestination, the routing information in the routing tablesis outdated. The result is that the routers do not havethe current information about the correct RPF interface,until the convergence process in the network is complete.

If we retain the original RPF mechanism (the packet mustenter via interface, which according to the routing tableis on the shortest path back to the sender of the packet),one of our routers on original shortest path can drop ourspecific multicast flow because of RPF check failure.

When the first multicast packet for a group G enters on aspecific interface of router (non-S), this interface becomesthe RPF interface for our IPFRR multicast flow. Theterm ”first packet” denotes a multicast packet, processingof which leads to the creation of a new route record inthe multicast routing table for a specific (Source, G) pair.In other words, the RPF interface of all routers, will bethe interface of the first arrival, 6 of specific IPFRRmulticast packets for a specific (Source, G) pair. After theselection of the RPF interface, routers forward multicastpackets to all other PIM enabled interfaces 5.

Each router can have only one RPF interface. It canbe proved that for a network with point-to-point links,

Figure 5: M-REP IPFRR mechanism.

Figure 6: The rule of first arrival.

our modified RPF mechanism provides loop-free pathsto all other routers in the network, including the des-tination router D. In dissertation work is mathematicalproof, which proved, that our modification of originalRPF mechanism does not cause micro-loops during ini-tial flooding of multicast packets.

The alternative path, which is created by selection ofmodified RPF interface, is a randomly generated path.In other words, the created path may not be the shortestpossible path (IPFRR techniques generally do not providethe shortest alternative paths).

The IPFRR encapsulated packets must be restored backto the original format while leaving the network domain.The restoration process of IPFRR communication is per-formed by the destination router (D). Router D is therouter, which has directly connected the original recipi-ent of unicast packet. Router D performs the necessarydecapsulation, which means that the end of IPFRR multi-cast distribution tree is on this router. After this process,the packet can be sent to its original destination.


Our modification of the original RPF mechanism for spe-cific multicast (Source, G) flow does not cause micro-loopsbetween routers with point-to-point links. Requirementsof our new IPFRR mechanism for physical network topol-ogy are:- point-to-point links between routers,- router D, original destination of protected flow, must bedirectly connected to this router.

We note that the original pruning process in the PIM-DM protocol is not modified. If a router receives un-necessary protected multicast flow, for which it does nothave a recipient, it prunes from the multicast distribu-tion tree. The final result of the flooding process is onlyone route created from the router S (performs tunnellingcommunication) to the router D (performs restoration ofcommunication).

Tunnelling mechanism of IPv4 unicast communication isone of the many possible solutions how to back up infor-mation of original source and destination of the packet.

Another way how to backup the original source and des-tination of the protected unicast flow in IPv6 is the useof next headers, in which we can backup this information.This information can then be restored by router D fromthe next header of the packet.

4.3 Multiple FailuresProtocol PIM-DM uses the Graft message to re-initializethe distribution tree and cancel the Prune state. In theclassical PIM-DM, the Graft message is sent through theRPF interface. In our mechanism, we need to deal withthe loss of the RPF interface and therefore we have mod-ified the terms of use of the Graft message.

In the following example, we show the application of themodified RPF procedure and Graft mechanism M-REP.Suppose we have the topology shown in Figure 7, andthe first link failure happens on router R2. The routerwhich detects this failure, becomes router S and startsencapsulating unicast communication for specific multi-cast specified by pair (Source, G). Suppose the alterna-tive route created using the rule of first arrival is S →R1 → R2 → R3 → R4 → D. The routers, which receivethe unwanted multicast communication, send the Prunemessage. When there is another failure in the network,this time router R4, the router D must restore the alter-native route. The question arises, which interface shouldbe selected on router D as the RPF interface for sendingthe Graft message to its upstream router.

If the router D selects fa 0/0 as the RPF interface anduses it to send the Graft message, the router R5 cannotuse it to restore the alternative route, because it wouldlose the connectivity with the upstream router R4. Thismeans that the router D cannot determine the properRPF interface in this situation. Therefore the router Ddoes not select any RPF interface for the specific (Source,G) multicast flow and sends the Graft message throughall remaining interfaces and removes the item for (Source,G) from its own multicast routing table. After receivingthe Graft message, the router R6 sets the given interfaceto forward and uses its own RPF interface of the firstarrival to send the next Graft message to the upstreamrouter. When the router D receives the specific (Source,G) multicast flow again (clearly from router R6), it setsits new RPF interface. The alternative route is restored

Figure 7: Solution to subsequent failures.

and it will be S→ R1→ R2→ R3→ R6→ D. Packets willbe delivered to their destination only after the multicastdistribution tree is restored. Until the specific (Source,G) multicast distribution tree is restored, the packets arethrown away by router R3 or any of its upstream routers.

If we modify previous topology to topology on picture(Figure 8), problem may arise if we use same scenario asin previous case. Assume, that in time of failure or routerR4 another link failure between routers R3 and R7 occurs.

When router R7 detects failure on its RPF interface offirst arrival, it sends Graft message to all other PIM en-

Figure 8: Solution to subsequent failures part 2.


abled interfaces, which means in given topology sendingGraft message to router R6. In classical PIM-DM router,which receives Graft message via RPF interface, won’taccept it. Therefore we must modify behavior of router,which receives Graft message via RPF interface of firstarrival. If router receives Graft message via its RPF inter-face of first arrival, it removes specific (Source, G) recordin multicast routing table, cancel current RPF interfaceof first arrival and sends Graft message to all other re-maining interface. After arrival of new specific (Source,G) packet router sets new RPF interface based on firstarrival again. Final alternative route will be S → R1 →R2 → R3 → R6 → D.

5. TestingIn this section, we describe the testing of the M-REPIPFRR mechanism in the OMNeT++ simulator (version4.5). We have used the existing implementation of thePIM-DM protocol from the ANSA library (version 2.2) [1]as the basis for the simulations and we have then im-plemented the M-REP mechanism functionality into theANSA library.

To generate the data flow, we use the Source1 in the sim-ulations and Host2 is the recipient of the generated flow.The data flow from Source1 to Host2 is protected by theproposed M-REP mechanism. The primary route for thisdata flow is R1 → R3 → R5, Figure 9.

We simulate the failure of the whole router that lies on theprimary path to the destination. This scenario representsthe disconnection of all links connected to the router R3.Table 1 shows the breakdown of the R3 failure time.

The expected behavior of the M-REP mechanism is thatafter the R3 failure, it finds an alternative route bypassingthe failed router and delivers the protected unchanged

Figure 9: The rule of first arrival.

Table 1: Description of Router Failure Scenario

Time (sims) Component Action

1 50 Source1Source1 sends data toHost2, period 1sims

2 52 R3 Failure of router R33 54 R3 Restoration of router R3

Table 2: Network Communication After theRouter Failure

Time (sims) Action Type of message1 50.00001162 Source1 → R1 appData2 50.00003497 R1 → R3 appData3 50.00005832 R3 → R5 appData4 50.00008167 R5 → Host2 appData5 51 Source1 → R1 appData6 51.00001173 R1 → R3 appData7 51.00002346 R3 → R5 appData8 51.00003519 R5 → Host2 appData9 52 Source1 → R1 appData10 52.00001173 R1 → R2 appData11 52.00001173 R1 → R4 appData12 52.00002346 R2 → R4 appData13 52.00002346 R4 → R2 appData14 52.00002346 R4 → R5 appData15 52.00002346 R4 → R6 appData16 52.00003519 R5 → Host2 appData17 52.00003519 R5 → R6 appData18 52.00003519 R6 → R5 appData19 52.000036099999 R2 → R4 PIMJoinPrune20 52.000036099999 R4 → R2 PIMJoinPrune21 52.000041909999 R2 → R1 PIMJoinPrune22 52.000047829999 R5 → R6 PIMJoinPrune23 52.000047829999 R6 → R5 PIMJoinPrune24 52.000053639999 R6 → R4 PIMJoinPrune25 53 Source1 → R1 appData26 53.00001173 R1 → R4 appData27 53.00002346 R4 → R5 appData28 53.00003519 R5 → Host2 appData29 54 Source1 → R1 appData30 54.00002335 R1 → R3 appData31 54.0000467 R3 → R5 appData32 54.00005843 R5 → Host2 appData

unicast data flow to the destination Host2. Figure 9 showsthe topology after the R3 failure.

Router R1 detects the link failure on the original path todestination and starts modifying the packets designatedto go to Host2. R1 sends the packets to all output inter-faces (except the incoming one).

Every router in the network receives the packets (Table 2,green color, lines 9 to 18). Router R5 determines that itis directly connected to the destination and restores thepackets as previously described.

The alternative route that is created using the rule of thefirst arrival of specific multicast packets is R1 → R4 →R5 (Table 2, lines 9 to 18, lines 25 to 28). Table 2 showsthe communication in the network before and after therouter R3 failure.

This scenario represents a situation that often happensalso in the real ISP operation. The test has confirmedthat the M-REP mechanism is able to protect the specificunicast flow against router failure. The dissertation thesisalso contains other tests, e.g. subsequent failures of linksor routers in the network.

6. Benefits of M-REP IPFRR MechanismThe main benefit of our new IPFRR mechanism is thefact, that it is independent of pre-computation. Alter-native path is not calculated by internal algorithm as inexisting IPFRR mechanisms. Due to this feature, we cansay that M-REP IPFRR mechanism is currently unique.


According to analysis of existing IPFRR mechanisms onlya few are implemented in operating systems of routers. Byutilizing of existing multicast protocol PIM-DM, its mini-mum modification of RPF logic and modification of Graftmechanism, M-REP mechanism can be implemented inreal routers.

In the following sections we discuss important benefitsof our M-REP mechanism, problem areas and future re-search.

6.1 Independent of Pre-computationMost of existing IPFRR mechanisms are based on pre-computation of alternative paths. M-REP mechanismdoes not require pre-computation of alternative route. Weuse a specific multicast address and the process of flood-ing/pruning in PIM-DM to send protected traffic aroundthe failed link or router.

For existing IPFRR mechanisms, the network size affectsthe amount of pre-computation. This means that the sizeof network increases the number of preparatory calcula-tions of alternative backup paths.

Proposed M-REP mechanism is independent of pre-com-putation of alternative path, which means, it is not af-fected by size of the network

6.2 Independent of Routing ProtocolsWith pre-computation of alternative route there is alsothe related dependence on routing protocols. Some ofthe existing IPFRR mechanisms require network topologyinformation for calculation of alternative backup path.That means, they are dependent on usage of link-staterouting protocols. M-REP IPFRR mechanism does notrequire preparatory calculations of alternative route andconstruction of multicast distribution tree is not alsobased on information from unicast routing table, whichmeans, that it is independent of routing protocols. Itsupports static routing, distance-vector routing protocolsand link-state routing protocols.

6.3 100% Repair CoverageAnother advantage of M-REP IPFRR mechanism is, thatit solves the problem of multiple failures in network. Pro-tected unicast flow, which is encapsulated as specific mul-ticast traffic, floods via functional links in the network.When multiple link or node failures occur and there isonly one possible path from source to destination, ourM-REP IPFRR mechanism is able to find it and use it.In other words, M-REP mechanism provides 100 percentrepair coverage.

6.4 Subsequent Link or Node Failures in NetworkExisting IPFRR mechanism are tested mostly against linkor node failure. M-REP mechanism is based on mul-ticast protocol PIM-DM with modified RPF and Graftmechanism. M-REP mechanism was tested against mul-tiple, e.g. subsequent failures within network at differenttimes. Simulations in simulator OMNeT++ proved thatproposed mechanism can provide alternative path aftermultiple, e.g. subsequent link or node failures.

6.5 Simple ImplementationM-REP mechanism uses existing multicast protocol PIM-DM. By simple modification of RPF logic for selecting

RPF interface as well as the modification of Graft mecha-nism modification in this protocol, we used its native be-havior in new IPFRR area. PIM-DM protocol is currentlysupported by many important manufacturers of routers.One advantage of M-REP mechanism is thus its simpleimplementation.

6.6 Problem Areas and Future ResearchProblem areas of M-REP mechanism are related to itsspecified requirements on physical topology of network.The first requirement for network topology is the exis-tence of point-to-point links between routers.

6.6.1 Multi-access NetworksThe proposed M-REP mechanism uses modified RPF logicso that RPF interface on the router is determined bythe rule of first arrival. This means that for our specificIPFRR multicast group we do not use information fromunicast routing table. This information is based on short-est paths to destinations. If we use M-REP mechanismin multi-access networks, micro-loops can occur.

We have a give topology, Figure 10, where router S sendsIPFRR packet on network with multi-access. If the routerR first receives the IPFRR packet via interface fa 0/0from router D, according our RPF rule of first arrival,this interface will be chosen for RPF interface. A micro-loop might occur between routers D and R.

Therefore, this issue needs further research in future. Wenote that some of existing IPFRR mechanisms have alsoproblems in multi-access networks (for example RemoteLFA).

6.6.2 Random Alternative RouteExisting IPFRR mechanisms use SPF algorithm to cal-culate the alternative shortest path from source to desti-nations. These calculations require the processing powerof router, but calculated alternative route is the shortestpossible route.

Figure 10: Micro-loops in multi-access segment.


Original PIM-DM protocol creates Shortest Path Tree(SPT). The creation of these SPT trees provides RPFmechanism in PIM-DM. RPF mechanism in M-REP donot use information from unicast routing table to verifythe correct RPF interface, but the RPF interface is se-lected by the rule of first arrival of specific unicast (Source,G) packet. Multicasts packets reach the destination, butit is not possible to guarantee the shortest path. Multi-cast data may or may not get to destination by shortestpath.

The second requirement of M-REP mechanism is thatrouter D must have directly connected destination on itsoutput interfaces. Original unicast flow of data is deliv-ered to this destination. Destination router is identifiedby this requirement. Specific multicast (Source, G) treeis created from the source to destination router.

6.6.3 Packet EncapsulationOur mechanism uses in IPv4 encapsulation of protectedunicast flow with additional IP header (multicast tun-neling). In IPv6 can be used the same principle or newIPv6 header can proposed for this purpose. Modificationsof original packets brings problems with MTU, increasedCPU load and other problems. However, it should benoted that most of existing IPFRR mechanism also encap-sulates packets or modify specific bits in packet header.Encapsulation of packets is one of the biggest disadvan-tages of existing IPFRR mechanism, but in present thistechnology is most common.

6.6.4 Flooding/Pruning Process in PIM-DMPIM-DM protocol at the beginning of multicast transmis-sion sends multicast packets to all routers in administra-tive domain (flooding process). Routers, which don’t haverecipients for specific multicast communication, prunefrom multicast distribution tree (pruning process). Be-sides these processes, PIM-DM periodically sends ”Hello”messages to other routers. Flooding and pruning pro-cesses brings unnecessary load in the network. However,M-REP IPFRR mechanism encapsulates protected uni-cast flow to specific (Source, G) multicast flow until pro-cess of network convergence is complete. When the pro-cess of network convergence is complete, routing protocolroutes packets again.

6.7 Future ResearchCompanies such as Cisco Systems and Juniper Networksfocus they future development on IPFRR technology, be-cause the requirements on ISP grow every day. Future re-search should focus on further validation of the proposedmechanism M-REP, solving of the problem areas and im-plementation in an experimental environment Quagga.The current requirements of M-REP mechanism for point-to-point links or directly connected destination may besometimes limiting.

The proposed M-REP IPFRR mechanism is designed toprotect a specific unicast flow. Therefore, further researchshould focus on verification whether it is possible to pro-tect all flows, whose primary route lead through the failedlink.

7. ConclusionThis work presents a new M-REP IPFRR mechanism thatsolves some of the disadvantages of existing IPFRR mech-

anisms. The new M-REP IPFRR mechanism relies onthe innovative use of multicast PIM-DM protocol. At thebeginning of the multicast transmission (flooding), thePIM-DM protocol delivers the multicast data to everyPIM router in the network (regardless of failure). Thisis the specific property of the PIM-DM protocol that wehave used when designing the new M-REP mechanism.This mechanism is primary designed to protect the spe-cific unicast flow using the backup multicast distributiontree defined by the unique multicast address.

As previously mentioned, the majority of the IPFRRmechanisms requires pre-computing of alternative routesfor the case of various failure of links or routers in thenetwork. These preparatory calculations have undesiredeffects, e.g. loading of router CPU, dependence on thelink-state routing protocols etc.

M-REP mechanism does not require preparatory calcu-lations of backup routes, because it uses the PIM-DMprotocol’s flooding process to floods the multicast com-munication. With respect to the specific conditions andpurposes, under which this system is supposed to oper-ate, it was necessary to modify the RPF mechanism inPIM-DM. RPF interface is chosen based on the arrival ofthe first packet with special multicast address. All routersin the administrative domain must have exactly one RPFinterface for specific (Source, G) flow.

Protocol PIM-DM with modified RPF mechanism explic-itly creates a tree, which means that no routing loopsare created among the routers. The alternative route iscreated using the rule of the first arrival of the specificmulticast packet. The protected unicast communicationis encapsulated until the network convergence process iscompleted.

The new M-REP IPFRR mechanism solves the problemof performing the pre-comutations by the existing IPFRRmechanisms, the dependence on routing protocols andproblem of multiple failures in the same network, e.g.subsequent failures. Other advantages are the 100% re-pair coverage and simple implementation into the existingoperating systems of routers.

Acknowledgements. This paper is the outcome of theproject ”Quality education by supporting innovative forms,quality research and international cooperation - a success-ful graduate for practice”, ITMS code 26110230090 sup-ported by the Education Operational Program funded bythe European Social Fund.

References[1] Fakulta informacných technológií VUT Brno. ANSA extension

above INET framework for OMNeT++, 2014.https://github.com/kvetak/ANSA.

[2] A. Adams, J. Nicholas, and W. Siadak. Protocol IndependentMulticast - Dense Mode (PIM-DM): Protocol Specification(Revised). RFC 3973, Network Working Group, pages 4–10, 2010.

[3] S. Antonakopoulos, Y. Bejerano, and P. Koppol. A simple IP fastreroute scheme for full coverage. BellLabs, Murray Hill, USA,page 1, 2012.

[4] A. Atlas. U-turn Alternates for IP/LDP Fast-Reroute. Google,Internet-Draft, Network Working Group, pages 1–8, 2006.

[5] A. Atlas and A. Zinin. Basic Specification for IP Fast Reroute:Loop-Free Alternates. Alcatel-Lucent, RFC 5286, StandardsTrack, Network Working Group, pages 3–5, 2008.


[6] S. Bryant, C. Filsfils, S. Previdi, and M. Shand. IP Fast Rerouteusing tunnels. Cisco Systems, Network Working Group,Internet-Draft, pages 1–10, 2010.

[7] S. Deering. Host Extensions for IP Multicasting. StanfordUniversity, RFC 1112, Network Working Group, pages 1–5, 2006.

[8] B. Fenner, M. Handley, H. Holbrook, and I. Kouvelas. ProtocolIndependent Multicast - Sparse Mode (PIM-SM): ProtocolSpecification (Revised). RFC 4601, Standards Track, NetworkWorking Group, pages 1–146, 2006.

[9] M. Gjoka, V. Ram, and X. Yang. Evaluation of IP Fast RerouteProposals. COMSWARE 2007. 2nd International Conference,pages 1–8, 2007.

[10] A. T. Hassan. Evaluation of Fast Reroute Mechanisms inBroadband Networks. Master of Electrical and ComputerEngineering, University of Ottawa, page 1, 2007.

[11] C. Hopps. Analysis of an Equal-Cost Multi-Path Algorithm.NextHop Technologiest, RFC 2992, Informational, NetworkWorking Group, pages 1–5, 2000.

[12] D. Katz and D. Ward. Bidirectional Forwarding Detection (BFD).Juniper Networks, Request for Comments: 5880, Standards Track,IETF, ISSN: 2070-1721, pages 1–50, 2010.

[13] M. Shand and S. Bryant. IP Fast Reroute Framework. RFC 5714,Internet Engineering Task Force, Informational, ISSN:2070-1721, Cisco Systems, pages 5–7, 2010.

Selected Papers by the AuthorJ. Papán, P. Segec, P. Palúch. Utilization of PIM-DM in IP Fast

Reroute. In ICETA 2014: 12th IEEE Int. Conf. on EmergingeLearning Technologies and Applications, IEEE, 373–378, 2014.

J. Papán, P. Segec, P. Palúch. Tunnels in IP Fast Reroute. In DigitalTechnologies: The 10th International Conference, ISBN978-1-4799-3301-3, IEEE, 281–285, 2014.

J. Papán, P. Segec, P. Palúch. Multicast in IP Fast Reroute. InELEKTRO 2014: Proceedings of 10th International Conference,ISBN 978-1-4799-3720-2, IEEE, 81–85, 2014.

P. Segec, P. Palúch, J. Papán, M. Kubina. The integration of WebRTCand SIP: Way of Enhancing Real-time, Interactive MultimediaCommunication. In ICETA 2014: 12th IEEE InternationalConference on Emerging eLearning Technologies andApplications, ISBN 978-1-4799-7739-0, IEEE, 437–442, 2014.

J. Papán, M. Jurecka, J. Milanová. WSN for Forest Monitoring toPrevent Illegal Logging. In FedCSIS: Proceedings of theFederated Conference on Computer Science and InformationSystems, ISBN 978-83-60810-51-4, IEEE, 809–812, 2012.

M. Drozdová, M. Mokryš, M. Kardoš, Z. Kurillová, J. Papán. Changeof Paradigm for Development of Software Support for eLearning.In ICETA 2012: 10th IEEE International Conference onEmerging eLearning Technologies and Applications, IEEE, 2012.

Promoting Sustainability and Transferabilityof Community Question Answering

Ivan Srba∗

Institute of Informatics, Information Systems and Software EngineeringFaculty of Informatics and Information Technologies

Slovak University of Technology in BratislavaIlkovicova 2, 842 16 Bratislava, Slovakia

[email protected]

AbstractCommunity Question Answering (CQA) provides peoplewith a possibility to ask various questions and, at thesame time, provide answers on questions of other users(e.g. Yahoo! Answers). Our thesis concerns with twoopen emerging problems closely related to the CQA con-cept: (1) a long-term sustainability of CQA ecosystems,and (2) their transferability to educational and organiza-tional environments.

At first, we conducted a case study on recent negativedevelopment of Stack Overflow’s community which is re-flected in increasing amount of low-quality content cre-ated by undesired groups of users. Consequently, we sug-gested to preserve a long-term sustainability of CQA com-munities by means of robust reputation mechanisms andanswerer-oriented adaptive support methods that in addi-tion involve the whole community. We put these sugges-tions into practice by means of two novel methods: (1) forreputation calculation focused on quality of users’ contri-butions, and (2) for recommendation of new questions topotential answerers with utilization of non-QA data.

Our main contribution to the second open problem liesin introduction of a novel organization-wide educationalCQA system Askalot, which takes educational as well asorganizational specifics into consideration.

Categories and Subject DescriptorsH.3.2 [Information Storage and Retrieval]: Informa-tion Search and Retrieval; K.3.1 [Computers and Ed-ucation]: Computer Uses in Education

∗Recommended by thesis supervisor: Prof. MariaBielikovaDefended at Faculty of Informatics and Information Tech-nologies, Slovak University of Technology in Bratislava onJune 29, 2016.c© Copyright 2016. All rights reserved. Permission to make digital

or hard copies of part or all of this work for personal or classroom useis granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies show this notice onthe first page or initial screen of a display along with the full citation.Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy other-wise, to republish, to post on servers, to redistribute to lists, or to useany component of this work in other works requires prior specific per-mission and/or a fee. Permissions may be requested from STU Press,Vazovova 5, 811 07 Bratislava, Slovakia.Srba, I. Promoting Sustainability and Transferability of CommunityQuestion Answering. Information Sciences and Technologies Bulletinof the ACM Slovakia, Vol. 8, No. 2 (2016) 10-16

KeywordsCommunity question answering, knowledge sharing, sus-tainability, educational domain, adaptive collaborationsupport

1. IntroductionStandard information retrieval tools, such as GoogleSearch, represent the most popular way how to searchfor required information on the Web. However, success-fulness of these tools decrease when a user wants to findinformation, which is highly context-specific, subjective(e.g. a recommendation), scattered across many sourcesor which cannot be easily described by keywords. In thesecases, Internet users have a possibility to utilize alterna-tive tools that are based on knowledge sharing in greatonline communities of people. One of the most success-ful examples of these community-based knowledge sharingsystems is Community Question Answering (CQA).

Knowledge sharing in CQA systems take place in fourmain steps (see Figure 1):

1. Question Creation. Any member of communityis able to post a new question by providing its name,detailed description and usually it is also necessaryto assign it into a hierarchy of categories or tags. Incontrast to standard information retrieval tools, thedescription of question is not limited to keywordsand thus an asker can define his/her informationneed more precisely.

2. Question Answering. As soon as the question isposted, all other members of the community have apossibility to provide their answer candidates, votefor the best answer, vote for the question (if theyconsider it as an interesting one) or provide addi-tional comments.

3. Question Closing. In the case of obtaining cor-rect answer, the asker can mark one of the providedanswers as accepted one.

4. Question Search. After best answer acceptance,the corresponding question is moved to the archiveof solved questions, where it can be retrieved if ad-ditional users will seek for the same information inthe future.


��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

Figure 1: Standard question lifecycle in CQA systems.

Some of the existing CQA systems provide a possibilityto ask questions without any topic restriction, such as Ya-hoo! Answers or Wiki Answers. On the hand, there areCQA systems focused on specific topic areas, for exam-ple Stack Overflow, which concerns only with questionsrelated to programming.

The first CQA systems (e.g. Yahoo! Answers establishedin 2005) emerged as a result of rapid Web 2.0 develop-ment. Since then, they have gained a great popularityand nowadays, they contain communities with millionsof users who collaborate together to provide answers onthousands of new questions asked each day.

1.1 Two Perspectives on Community Question An-swering1

In order to understand principles and concepts of CQAsystems better, we can describe the question answeringprocess from two perspectives. In the first perspective,CQA systems can be characterized as information sys-tems fundamentally based on knowledge sharing, morespecifically they utilize a number of modern theories howonline communities work, such as communities of prac-tice, collective intelligence, wisdom of the crowd, socialinteraction, crowdsourcing or human computation.

At the same time, we recognized that the question answer-ing process is actually a specific type of informal learning.Therefore, CQA systems can be also perceived from morealternative perspective of community-based collaborativelearning. In this second perspective, we can characterizeCQA systems by means of theories related to technologyenhanced learning, such as computer-supported collabo-rative learning, peer-learning or knowledge building com-munities.

1.2 Collaboration Support in CQA Systems1Overall successfulness and popularity of CQA systems at-tracts researchers from many areas, mainly from com-puter science, psychology and sociology. As the result,CQA systems became the subject of many research pub-lications, which also comprise the bases for our thesis.

1The results summarized in this section have been pub-lished in I. Srba, M. Bielikova, A Comprehensive Surveyand Classification of Approaches for Community QuestionAnswering.

0

10

20

30

40

50

60

hcr

aes

er fo r

eb

mu

N

pa

pe

rs

Year

Figure 2: A number of research papers tacklingwith CQA systems covered by our survey. Thelast year 2014 covers all papers that were availablein digital libraries before February 2015.

However, in spite of a great number of research paperspublished during the 10-years-long history of research onCQA systems, this area lacks the comprehensive surveywhich reflects the state of the art. Absence of this kindof survey have caused many negative consequences (e.g.missing established terminology, difficult orientation inthe area especially for novices). In order to face thesedrawbacks, we proposed the first complex classificationand survey of research problems solved in CQA systems.

To achieve the best possible coverage of our survey, welooked up papers explicitly aimed at CQA systems in dig-ital libraries (ACM DL, IEEE Xplore, Springer Link andScience Direct). Consequently, we supplemented the listof found papers also with additional publications refer-enced in the related work. Finally, we obtained the listconsisting of 265 papers created before the end of year2014 (see Figure 2).

Among obtained research papers, we identified also a fewsurveys, however, they were published several years agoand thus they did not reflect the state-of-the-art ap-proaches (e.g. [7]) or they were focused only one specificproblem (e.g. question routing in [2]).

In order to prepare solid foundations for our survey, weproposed a description framework at first. Following theanalyses of 265 obtained papers, we identified a set of at-tributes that characterize research papers and their con-

12 Srba, I.: Promoting Sustainability and Transferability of Community Question Answering

tributions (i.e. category of approach, subject of research,type of solved problem, input information, gold standard,algorithm, evaluation metrics and dataset). Consequently,we utilized this descriptive framework to propose a com-plex three-level hierarchy of tasks solved in CQA systems.On the first level, we divided the approaches according tocategory of approach into three groups: (1) exploratorystudies, (2) content and user modelling; and finally (3)adaptive support. On the second and the third level, wecategorized approaches according to subject of researchand type of solved problem respectively. In each of thesegroups of approaches, we described several representativeapproaches with utilization of remaining attributes fromthe descriptive framework.

1.3 Open Problems and Thesis GoalsFollowing the state of the art in the area of CQA systems,we identified open problems, which resulted from constantdevelopment of these systems.

• Absence of approaches addressing emergingproblems of CQA ecosystems. In spite of over-all popularity and successfulness, some of the mostpopular CQA systems have recently experiencednegative development of their content and commu-nity. The most eminent problems, which signifi-cantly hinder the question answering process, area rapidly increasing amount of low-quality contentand a growing number of undesired groups of userswho purposefully abuse CQA systems (e.g. in or-der to quickly solve their problems without return-ing the received help back to the community). Inspite of the great effort in supporting collaborationof users, the existing state-of-the-art approaches donot sufficiently address these negative trends. More-over, some of the approaches for providing userswith collaboration support even indirectly supportthese undesired groups of users. The main reasonsfor this discrepancy is that collaboration supportis primarily aimed at askers and their goals to re-ceive answers in the shortest possible time. On theother hand, answerers and their expectations are nottaken into consideration sufficiently. In addition, ex-isting approaches involve in the question answeringprocess only a small subset of highly active and ex-pert users, while the rest of the community is usuallyleft unutilized.

• Undiscovered potential of educational CQAsystems. In addition, in spite of many positive re-sults of CQA systems on the open Web, their bene-ficial effects have not been fully discovered in otherenvironments yet. Nowadays, we witness initial ef-forts on taking advance of their concepts in businesscontext. Question answering in CQA systems can beperceived, however, not only as a process of knowl-edge sharing, but also as a specific kind of informalcollaborative learning. Therefore, CQA systems in-corporate also interesting and undiscovered learningpotential that can be utilized especially in educa-tional domain. This potential is obvious particu-larly at organizational level, as students are quiteoften struggling with various problems related to alearning process or learning materials that cannotbe answered easily in CQA systems outside theireducational organization.

We aim to address these open problems by exploring sus-tainability and transferability of community question an-swering. In particular, our thesis goals are:

• Goal 1: Proposal and evaluation of new meth-ods to preservation of CQA sustainability. Inorder to suppress the negative consequences of thecurrent development in the most popular CQA sys-tems and to maintain the long-term sustainability oftheir ecosystems, our first goal is to investigate theemerging problems in more details. Consequently,we aim to propose novel methods that can supportthe collaboration between users and at the sametime contribute to the long-term sustainability ofCQA systems. For example, the recommendation ofquestions to potential answerers represents a possi-bility to motivate and involve all kinds of answer-ers (not only active and expert ones) with respectto their interests in particular topics, question diffi-culty etc.

• Goal 2: Adapting successful concepts of CQAsystems for specifics of organization-wide andeducational environment. Our second goal is toexamine how verified CQA systems can be adaptedin two transitions: (1) from a non-educational to aneducational context; and (2) from the open Web toan organizational environment. Consequently, weintroduce the novel organization-wide educationalCQA system Askalot that is specifically designed tosupport the question answering process while takingorganizational as well as educational specifics intoconsideration.

2. Proposed Solutions for Preservation of CQASustainability2

With the aim to describe the emerging problems in CQAcommunities more precisely, we conducted a case studyon CQA system Stack Overflow. At first, we evaluated acommunity perception in Meta Stack Overflow (a specificpart of Stack Overflow, which is dedicated to questionsabout system itself). Starting from year 2014, it is possi-ble to witness an increasing trend of questions that pointout a negative development of the community. The com-munity identified three main groups of undesired users:

1. Help vampires, who create a great number of ques-tions without any effort to find the required infor-mation by means of standard information retrievaltools, while they are interested only in getting theirquestions answered and they do not return any re-ceived help back to the community.

2. Noobs who create trivial and low-quality questions.

3. Reputation collectors who purposefully answermainly low-quality and uninteresting questions(mostly created by the previous two groups of users)in order to gain as much reputation as possible.

2The results summarized in this section have been pub-lished in I. Srba, M. Bielikova, Why Is Stack OverflowFailing? Preserving Sustainability in Community Ques-tion Answering, 2016.


Consequently, we supported and statistically confirmedthe community perception by easily reproducible quanti-tative analyses, which are also suitable for monitoring thecommunity evolution in the future.

As the solution to this negative trend, we proposed tochange the standard reputation mechanisms and to re-search new methods for adaptive collaboration support,which are primarily answer-oriented (since the most ofthe existing methods are asker-oriented) and which in-volve the whole community (since the most of the existingmethods involve only small part of highly active experts).

In order to verify our suggestions, we proposed and eval-uated (1) a method for reputation calculation with con-sideration of content quality and question difficulty; (2)a method for question routing with consideration of non-QA data (i.e. data that are not the result of the questionanswering process itself).

2.1 Reputation Based on Content Quality and Diffi-culty3

User reputation in CQA systems represents the globalvalue of user for the community and it reflects his/herexpertise and activity in the system. The existing meth-ods for reputation estimation, however, emphasis mainlyuser activity and thus they very often give high reputa-tion for very active users (despite the real quality of theircontributions). The same problem is also present in rule-based reputation mechanisms employed in CQA systems(e.g. in Stack Overflow).

We proposed a new method for reputation calculation,which puts emphasis on the level of user expertise. Inother words, users gain a bigger amount of reputation forasking difficult questions and for providing high-qualityanswers on difficult questions. The correctness of the pro-posed method was evaluated on two independent datasetsfrom Stack Exchange platform. Experimental resultsshowed that our method achieved better results in com-parison with original method for reputation calculationin Stack Exchange platform as well as in comparison withother metrics proposed in the previous research papers,e.g. Z-score [8]. In addition, in comparison with StackExchange reputation, the distribution of reputation cal-culated by our method follows approximately Gauss nor-mal distribution (what correspond to expectation that themajority of users have average level of expertise).

2.2 Question Routing Based on Non-QA Data4

In order to evaluate our remaining suggestions for preser-vation of CQA sustainability, we proposed a novel methodfor question routing (i.e. recommendation of new ques-tions to potential answerers). On the basis of the state-of-the-art analyses, we found out that almost all existingquestion routing methods work solely with QA data (i.e.data that are the result of the question answering process,mainly logs about asked questions and provided answers).

3The results summarized in this section have been pub-lished in A. Huna, I. Srba, M. Bielikova, Exploiting Con-tent Quality and Question Difficulty in CQA ReputationSystems, 2016.4The results summarized in this section have been pub-lished in I. Srba, M. Grznar, M. Bielikova, Utilizing Non-QA Data to Improve Questions Routing for Users withLow QA Activity in CQA, 2015.

This solution, however, cause that these methods are ableto route questions only to a small part of the whole com-munity, which consists of highly active and expert users.On the other hand, the big potential of the rest of thecommunity (mainly newcomers and lurkers) is left unuti-lized. If it will be possible to recommend questions also tothese users, we can motivate them to participate on ques-tion answering more actively and thus we can contributealso to long-term sustainability of CQA ecosystems.

In order to achieve this shift, we proposed to considerduring question routing not only QA-data but also non-QA data (i.e. user information which are publicly avail-able inside or outside of CQA systems). Non-QA datahave been already previously utilized in CQA systems,however, only for determination of social attributes (e.g.[3]) or to determine user expertise, nevertheless only withsimple term vectors [4], which have been already in otherworks outperformed by models based on latent topics.

To fill this gap, we proposed and implemented the methodfor question routing that connects utilization of non-QAdata with verified state-of-the-art user expertise modellingby means of latent topics (LDA). The recommendation isperformed in four steps: (1) construction of question pro-files; (2) construction of non-QA data profiles; (3) con-structions of user profiles; and finally, (4) matching ques-tion and user profiles.

We experimentally evaluated the proposed method on adataset from CQA system Android Enthusiasts. We com-pared three versions of out method which considers QAdata, non-QA data and their combination. The resultsshowed that non-QA data improved precision of recom-mendation for all kinds of users and not only for thosewith low level of QA data as we originally hypothesized.

3. Utilization of CQA Systems in Organizationaland Educational Environment5

In the second part of our dissertation thesis, we inves-tigated transferability of CQA systems to additional do-mains. In spite of the large number of research paper,just a few of them concern with utilization of CQA sys-tems in organizational environment, e.g. [5]. Specifically,educational organizations represent an interesting area,where CQA systems have a potential to improve knowl-edge sharing among students as well as communicationwith a teacher.

Utilization of CQA systems in education is not, however,straightforward. Standard open CQA systems are notappropriate to support learning. Some of them even pro-hibit asking questions related to homework or assignments(e.g. Stack Overflow). Moreover, organizational environ-ment has many specifics that on one hand make questionanswering more difficult (e.g. a higher probability of ex-pert overload), and on the other hand, provide new pos-sibilities (e.g. presence of a teacher, a possibility to askquestion closely related to the organization).

5The results summarized in this section have been pub-lished in I. Srba, M. Bielikova, Askalot: CommunityQuestion Answering as a Means for Knowledge Sharingin an Educational Organization, 2015 and in in I. Srba,M. Bielikova, Design of CQA Systems for Flexible andScalable Deployment and Evaluation, 2016.


��

��

��

��

��

��

��

��! ��"��#��

$��

%&��!��

��'��(��

��"��

��

��"��

��'��

��

)��!��!��*�

+��(��!��!��*�

��

Figure 3: Askalot system in the context of existing CQA systems.

We built on these organizational and educational specificsand we proposed a new concept of university-wide educa-tional CQA system. In order to verify its feasibility, wedesigned and implemented CQA system Askalot6, whichspecifically supports collaborative learning in communi-ties of learners across the whole organization. Existingeducational CQA systems (see Figure 3) are focused ei-ther on question answering in open communities outsideorganization, e.g. OpenStudy [6], or inside organizations,however, only at class-level, e.g. Green Dolphin [1].

CQA system Askalot supports specifically two groups ofusers: students and teachers. At first, it provides themwith standard question answering functions (asking a ques-tion, posting answers and comments, voting, best answerselection, see Figure 4), but also with more advancedcommunity-features (e.g. community profiles, following,see Figure 5) and workspace awareness (e.g. dashboard,activity feed, complex notification system). Teachers havein addition possibility to see statistics describing how wellstudents are able to perform during question answering.

Besides Askalot’s primary goal to support educationalquestion answering, it can be also characterized as anopen experimental platform. It is built on an experimen-tal infrastructure, which allows to implement and evaluateany adaptive collaboration support methods in a simpleand effective way. The experimental infrastructure canbe used in offline experiments with datasets coming fromAskalot itself or even with datasets from any system inStack Exchange platform, or in live experiments with acommunity of students in Askalot.

Askalot is implemented as an open-source web applica-tion7, which provides responsive user interface so it canbe used on personal computers as well as on mobile de-vices. The development of system is driven by test withtest coverage at 90%.

System Askalot was experimentally evaluated at our Fac-ulty of Informatics and Information Technologies, SlovakUniversity of Technology in Bratislava during the summerterm of academic year 2015/2014. During the pilot eval-uation, 600 bachelor students from four selected courses

6Demo of CQA system Askalot is available at:https://askalot.fiit.stuba.sk/demo7Source code of CQA system Askalot is available at:https://github.com/AskalotCQA/askalot

and their teachers joined the community in Askalot andasked 180 questions and provided 250 answers. Nowa-days, students have a possibility to ask questions relatedto any subject taught at our faculty.

4. Contributions and ConclusionsContributions achieved in the dissertation thesis can bedivided into three main groups:

1. Overview of theories and state of the art inCQA systems. In spite of significant interest inthis area in academy as well as in industry, the sys-tematic overview of theories standing behind theirsuccess was missing. In our work, we described CQAsystems from two perspectives - from the perspec-tive of knowledge sharing and collaborative learn-ing. In each of them we identified the most im-portant theories, which provide an insight how thecommunity-based question answering process works.At the same time, we did an analyses of 265 researchpublications, which served us as the basis for pro-posal of descriptive framework, complex three-levelcategorization hierarchy of approaches as well as fordescribing representative approaches from each cat-egory. This survey should help novice researchersto get better overview of the research domain andto identify optimal techniques in methods’ proposaland evaluation.

2. Supporting long-term sustainability of CQAecosystems. We identified the increasing negativetrend in development of some CQA systems. In or-der to describe it in more details, we conducted thecase study on CQA system Stack Overflow. In thestudy, we analysed community perception, whichwe supplemented with easily executable and repro-ducible quantitative analyses. These analyses al-low any other researchers continue monitoring de-velopment of the negative trends not only on StackOverflow but also on all other CQA systems builton the top of Stack Exchange platform. Followingthe achieved insight in the case study, we proposedseveral remedy solutions (e.g. new attitude to repu-tation calculation or systematic involvement of thewhole community). These suggestions were illus-trated and verified by means of innovative methodsfor (1) reputation calculation based on content qual-ity and difficulty; and (2) question routing based onnon-QA data.


Figure 4: Detail of a question posted in Askalot. (1) Question evaluated by a teacher as a good one. (2)Answer posted by a teacher is highlighted with a different background so it can be easily distinguishedfrom other answers posted by students. (3) Answer is marked by a student as an accepted one.

Figure 5: Detail of a user community profile in Askalot. (1) User gravatar with a gold reputation level.(2) User activity heat map.


3. Investigation of CQA transferability to edu-cational domain. Last but not least, we identifieda potential of CQA systems to be employed not onlyon the standard open Web, but also in organiza-tional and educational environment. The proposedconcept of university-wide educational CQA systemhas been evaluated by implementation of CQA sys-tem Askalot. Askalot has been deployed as a sup-plementary tool to formal educational process at ourfaculty. Its community consists of more than 1100students and teacher. In addition, Askalot providesalso the experimental infrastructure, which has beenalready used in evaluation of several research papersand bachelor or master theses.

The achieved results in the dissertation thesis providesgood basis for additional research in the area of CQAsystems. At first, it is possible to continue in proposalof additional methods for adaptive collaboration support,which will also contribute to sustainability of CQA com-munity ecosystems. We perceive another potential in fur-ther development of educational CQA systems. We planto deploy Askalot at Lugano University as a part of col-laboration project in SCOPES programme. Moreover, wehave started a collaboration with Harvard University withthe aim to adjust implementation of Askalot so it can beused for question answering in MOOC system edX.

Acknowledgements. This work was partially supportedby grants No. VG1/0646/15, VG1/0675/11, Tradice No.APVV-0208-10, KEGA 009STU-4/2014, and it is the par-tial result of collaboration within the SCOPES JRP/IP,No. 160480/2015.

References[1] C. Aritajati and N. H. Narayanan. Facilitating Students’

Collaboration and Learning in a Question and Answer System. InProc. of the 2013 Conf. on Computer Supported Cooperative Workcompanion - CSCW ’13, pages 101–106, New York, New York,USA, 2013. ACM Press.

[2] B. Furlan, B. Nikolic, and V. Milutinovic. A survey and evaluationof state-of-the-art intelligent question routing systems. Int. J. ofIntelligent Systems, 28:686–708, 2013.

[3] Z. Liu and B. J. Jansen. Predicting Potential Responders in SocialQ&A Based on non-QA Features. In Proc. of the ExtendedAbstracts of the 32nd Annual ACM Conf. on Human Factors inComputing Systems - CHI EA ’14, pages 2131–2136, New York,New York, USA, 2014. ACM Press.

[4] L. Luo, F. Wang, M. X. Zhou, Y. Pan, and H. Chen. Who Have GotAnswers? Growing the Pool of Answerers in a Smart EnterpriseSocial QA System. In Proc. of the 19th Int. Conf. on IntelligentUser Interfaces - IUI ’14, pages 7–16, New York, New York, USA,2014. ACM Press.

[5] K. Ortbach, O. Gaß, S. Köffer, S. Schacht, N. Walter, A. Maedche,and B. Niehaves. Design Principles for a Social Question andAnswers Site: Enabling User-to-User Support in Organizations. InProceedings of 9th Int. Conf. on Advancing the Impact of DesignScience: Moving from Theory to Practice - DESRIST ’14, volume8463 LNCS, pages 54–68. Springer Berlin Heidelberg, 2014.

[6] A. Ram, H. Ai, P. Ram, and S. Sahay. Open Social LearningCommunities. In Proc. of the Int. Conf. on Web Intelligence,Mining and Semantics - WIMS’11, New York, New York, USA,2011. ACM Press.

[7] C. Shah, S. Oh, and J. S. Oh. Research agenda for social Q&A.Library and Information Science Research, 31(4):205–209, 2009.

[8] J. Zhang, M. S. Ackerman, and L. Adamic. Expertise Networks inOnline Communities: Structure and Algorithms. In Proc. of the16th Int. Conf. on World Wide Web - WWW ’07, pages 221–230,New York, New York, USA, 2007. ACM Press.

Selected Papers by the AuthorI. Srba, M. Bieliková. A Comprehensive Survey and Classification of

Approaches for Community Question Answering. In ACMTransactions on the Web, ACM Press, to appear.

I. Srba, M. Bieliková. Why Is Stack Overflow Failing? PreservingSustainability in Community Question Answering. In IEEESoftware, 33 (4). IEEE, 2016.

I. Srba, M. Bieliková. Dynamic group formation as an approach tocollaborative learning support. In IEEE Transactions onLearning Technologies, 8 (2): pages 173-186. IEEE, 2015.

I. Srba, M. Grznár, M. Bieliková. Utilizing Non-QA Data to ImproveQuestions Routing for Users with Low QA Activity in CQA. InProceedings of IEEE/ACM International Conference onAdvances in Social Networks Analysis and Mining - ASONAM’15, pages 129-136. ACM Press, 2015.

I. Srba, M. Bieliková. Design of CQA Systems for Flexible andScalable Deployment and Evaluation. In Proceedings ofInternational Conference on Web Engineering - ICWE ’16, toappear. Springer Berlin Heidelberg, 2016.

A. Huna, I. Srba, M. Bieliková. Exploiting Content Quality andQuestion Difficulty in CQA Reputation Systems. In Proceedingsof International Conference on Network Science - NetSciX ’16,LNCS 9564, pages 68-81. Springer Berlin Heidelberg, 2016.

I. Srba, M. Bieliková. Askalot: Community Question Answering as aMeans for Knowledge Sharing in an Educational Organization.In Proceedings of the 18th ACM Conference Companion onComputer Supported Cooperative Work & Social Computing -CSCW ’15 Companion, pages 179-182. ACM Press, 2015.

M. Bieliková, M. Šimko, M. Barla, J. Tvarožek, M. Labaj, R. Móro, I.Srba, J. Ševcech. ALEF: from Application to Platform forAdaptive Collaborative Learning. In Recommender Systems forTechnology Enhanced Learning: Research Trends &Applications, pages 195-225. Springer, 2014.

I. Srba, M. Bieliková. Encouragement of Collaborative LearningBased on Dynamic Groups. In Proceedings of the 7th EuropeanConference of Technology Enhanced Learning - EC-TEL ’12,LNCS 7563, pages 432-37. Springer Berlin Heidelberg, 2012.

R. Móro, I. Srba, M. Uncík, M. Bieliková, M. Šimko. TowardsCollaborative Metadata Enrichment for Adaptive Web-BasedLearning. In Proceedings of IEEE/WIC/ACM InternationalConferences on Web Intelligence and Intelligent AgentTechnology - WI/IAT ’11, pages 106-109. IEEE, 2011.

I. Srba, M. Bieliková. Tracing Strength of Relationships in SocialNetworks. In Proceedings of IEEE/WIC/ACM InternationalConference on Web Intelligence and Intelligent Agent Technology- WI/IAT ’10, pages 13-16. IEEE, 2010.

I. Srba, M. Bieliková. Discovering Educational Potential Embedded inCommunity Question Answering. In Proceedings from the FirstInternational Workshop on Educational Knowledge Management- EKM ’2014, pages 1-9. LiU Electronic Press, 2014.

I. Srba, M. Bieliková. Designing Learning Environments Based onCollaborative Content Creation. In Proceedings of Workshop onCollaborative Technologies for Working and Learning, pages49-53. CEUR, 2013.

Metadata Management for Large Information Spaces

Karol Rástocný∗



[email protected]

AbstractDue to size and heterogeneity of large information spaces,methods of data processing use metadata as their mainsource for their tasks. Usage of already created metadatadecreases necessity of raw data preprocessing and it in-creases efficiency of data processing methods. But largeinformation spaces as the Web or especially source coderepositories are not stable. Information stored in them aremodified continuously. These modifications affect qualityof metadata.

In our work we challenge the problem of metadata man-agement for large information spaces, while we focus tothree main goals: (I) proposition of a metadata modelsuitable for information exchange and efficient metadatamaintenance; (II) proposition of scalable metadata repos-itory which respects characteristics the metadata model;(III) approach to medatada maintenance which keep meta-data valid and consistent.

To fulfil these goals we propose novel metadata represen-tation via information tags as class of descriptive meta-data. We also proposed information tags model basedon standardized Open Annotation model and informationtags repository which provides effective access to infor-mation tag for main information tags use cases. We ad-dress metadata maintenance via proposition of robust lo-cation descriptor for anchoring information tags to sourcecode and the maintenance approach based on querying astream of events about tagged content.

Categories and Subject DescriptorsH.3.5 [Information Storage and Retrieval]: OnlineInformation Services—Data sharing ; D.2.8 [Software En-gineering]: Metrics

∗Recommended by thesis supervisor: Prof. MariaBielikovaDefended at Faculty of Informatics and Information Tech-nologies, Slovak University of Technology in Bratislava onJune 29, 2016.c© Copyright 2016. All rights reserved. Permission to make digital

or hard copies of part or all of this work for personal or classroom useis granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies show this notice onthe first page or initial screen of a display along with the full citation.Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy other-wise, to republish, to post on servers, to redistribute to lists, or to useany component of this work in other works requires prior specific per-mission and/or a fee. Permissions may be requested from STU Press,Vazovova 5, 811 07 Bratislava, Slovakia.Rástocný, K. Metadata Management for Large Information Spaces. In-formation Sciences and Technologies Bulletin of the ACM Slovakia,Vol. 8, No. 2 (2016) 17-20

Keywordsinformation tags, descriptive metadata, metadata man-agement, developer activities, empirical software metrics

1. Information TagsLarge information spaces like the Web or informationspaces of software houses are create by humans with themain idea to share information to humans. So resourcesof these information spaces are structured with focus toreadability and understandability by humans. In spite ofthis, these information spaces are not directly accessed byhumans but they are processed by systems with the goalto make information accessible to humans. But only di-rect presenting resources to humans is often not enoughand systems have to be able to find correct, requestedinformation. To fulfil goals like this, systems have to pro-cess resources to obtain new information or knowledgefrom information spaces. In this cases it is not efficientfor systems to directly process resources but they use de-scriptive metadata (describe resources with informationidentifying resources [4], e.g. titles of web-page) with pre-viously obtained information about resources.

In the case of the Web is this problem addressed by theSemantic Web initiative, which stores descriptive meta-data in ontologies [13] and maps them to web-pages viasemantic links[1], microdata1 or microformats2, that arestored directly in web-pages. Although this solution canbe generalized to other domains, it has to deal with theproblem of instability of information spaces. Resources ofinformation spaces are continuously created, modified andremoved from information spaces. All these modificationsleads to necessity of updating ontologies and references tothem, what is in case of ontologies often problematic andit is ineffective to process all resources if they do not con-tain reference to updated parts of ontologies.

We address this problem by proposition of informationtags as subset of descriptive metadata with semantic re-lations to a tagged content. We define information tagsare defined by triplet (Type, Anchoring, Body) [2, 8]:

• Type – defines a type and a meaning of the informa-tion tag;

• Anchoring – identifies the tagged information arti-fact;

1http://www.w3.org/TR/microdata/2http://microformats.org/

18 Rastocny, K.: Metadata Management for Large Information Spaces

Downloaded 198x

Is-a Year

Is-a Topic

Is-a Definition

Visited 12x

Visited 24x

Visited 573x

Words 33

Figure 1: Examples of simple information tags in a human-readable format. An arrow and a highlightedtext represent anchoring, the first word defines a type and the second word represents a body of aninfor-mation tag.

• Body – represents a structured information, thestructure of which corresponds to the type of theinformation tag.

An information tag is a tag which contains structuredmachine-readable information which tags information ar-tifact with its property, e.g. number of words in a para-graph, type of phrase, number of downloads or visits ofsome resource (see Figure 1). As a result information tagscan describe the whole resource in detail with its global(e.g., the topic of the resource) and partial (e.g., a num-ber of clicks on hyperlinks in the resource) properties thatcan be used for various purposes such as comparing doc-uments or mining new information.

Main advantage of information tags is their independencefrom tagged resources. All informations stored in infor-mation tags can be used without necessity to access re-sources and information tags directly reference tagged re-sources so edits of resources do not affect references tothem.

2. Information Tags Model and RepositoryAn information tag is a triplet of a body, an anchoringto an information artifact and a type. These features arecommon with human annotations, so information tags canbe modelled as annotations. Annotations’ “data struc-ture” has quite long history and so several annotationmodels have been standardized. A problem is that anno-tation models have not been proposed with the require-ment to efficiency of annotation maintenance by reasonthat maintenance of a freeform, human-readable body istoo complex task.

To supply acceptable information tags model, we basedthe information tags model on existing the Open Annota-tions Model [11]. The Open Annotation Model has beenproposed by wide Open Annotation community and itprovides lot of possibilities that covers almost all require-ments of different types of annotations. Since an infor-mation tag is not so complex data structure as generalannotation and we have to respect specific requirements

of information tags repository aAS efficient access, effi-cient maintenance, ease of a use and scalability [9], wehave lightened the Open Annotation Model (we have re-tained only inevitable elements) and redesigned the modelto the object model.

Based on this redesign of the model we are able to storeinformation tags in scalable document databases. Thesedatabases give as possibility to manipulate with informa-tion tags as with whole units and we do not have to re-quest multiple RDF triplets. On the other side it decreaseinference possibilities of the Open Annotation Model. Wedeal with this problem by proposition of SPARQL queryprocessing algorithm base on MapReduce programmingmodel. By evaluation of this algorithm we proved, thatit has performance comparable with native RDF reposi-tories [9].

3. Information Tags MaintenanceWe split the problem of information tags maintenance totwo independent sub-problems:

• Maintenance of information tag anchors – informa-tion tags refers parts of resources, so when tagged re-sources have been changed, these references shouldbe repaired;

• Maintenance of information tag bodies – bodies ofinformation tags contains main information so theyare the most sensitive to changes in informationspaces.

3.1 Maintenance of Information Tags AnchorsRepairing metadata anchoring after a modification of doc-uments belongs to basic problems of the metadata main-tenance. The complexity of this maintenance is anchor-ing descriptor dependent. In case of textual documents,a popular anchoring descriptor is based on column andline indexes that characterize metadata position in a doc-ument. The descriptor based on indexes is easily inter-pretable, but it is very sensitive to changes in documents,that can affect metadata positions in three ways [7]:


• Without a change – a document has been modifiedafter metadata anchoring position;

• Simple shift – a document has been modified be-fore metadata anchoring position. This modificationcauses simple shifting metadata to new position;

• Complex modification – a document has been mod-ified on the place, where metadata is anchored. Inthis case, a determination of new position may becomplicated and a resolution whether metadata stillhave original meaning or they have to be updatedor deleted has to be made.

In case of textual documents there are many solutions tothis problem, e.g. SGDOM based anchoring [5] or tree-based descriptors [3, 6]. But there is no specialized solu-tion for source code, which respect source code character-istics, support real-time interpreting anchors and directcomparison of anchors. For this reason we analyzes morethan 60,000 C# source code files and identified their char-acteristics that can be utilized for proposition of robustlocation descriptor for source code [9].

We propose descriptor, which consist of index locationdecriptor and context based descriptor. The index loca-tion descriptors are directly comparable and can be re-solved effectively, if the tagged source code has not beenmodified. On the other side the context-based descrip-tor is robust to source code modification. For interpret-ing the context based descriptor we proposed algorithmwhich combines tokenization and string similarity algo-rithms with Smith-Waterman algorithm. This combina-tion of algorithms allows real-time interpreting even aftercomplex source code modifications [9].

3.2 Maintenance of Information Tags BodiesMaintenance of information tags bodies has to react toall types of changes in characteristics of resources of in-formation systems. In general we can define structural,semantic and empirical characteristics of resources. Towhich characteristic information tags are sensitive, de-pends on types of information tags. E.g. when an infor-mation tag contains LLOC source code metric, simple re-naming tagged class does not affects the information tag.But if an information tag contains number of views of theclass, the information tags is affected even by scrolling ina source code file.

We reflect this diversity by proposition of the tagger [10],which transforms users’ and systems’ activities over re-sources of information space to a stream of events in formof linked steam data [12]. After that the tagger queriedthe stream and executes maintaining actions after obtain-ing results of stream queries.

4. Conclusions and ContributionsIn the dissertation thesis we discussed problems of meta-data management for large information spaces. As mainproblem we identify invalidation of medata caused by in-stability of information spaces. A solution of this problemwe split to four main contributions of the thesis:

• Information tags – we proposed novel representationof descriptive metadata, that is natural for systems.This representation is independent from describedresources what fulfils initial requirement of effectivemetadata maintenance.

• Information tags repository – to utilize contribu-tions of information tags, we proposed informationtags repository based on MongoDB that stores in-formation tags in the model based on the standard-ized Open Annotation Model. This in combinationof proposed SPARQL query processing algorithmguarantees integration possibility with existing sys-tems.

• Robust descriptor for source code – information tagsreference tagged information artifact via robust de-scriptors. For this reason we proposed robust loca-tion descriptor and its interpreting algorithm, whichis able to identify tagged source code artifact in real-time.

• Stream-based metadata maintenance – to keep in-formation tag space valid and consistent we pro-posed method for creating, updating and removinginformation tags based on querying steam of eventsabout users and systems actions over informationspaces. The method executed necessary mainte-nance actions after receiving results from streamqueries.

We evaluate contributions of our work in the domain ofthe project PerConIK3 (Personalized Conveying of Infor-mation and Knowledge). In the project we used proposedmethods for management of metadata about source codes.We deploy implemented repository as main informationstore in the PerConIK architecture. The tagger contin-uously processed stream of developers’ activities in IDEs(Microsoft Visual Studio and Eclipse) and web browsersand source code changes from git repositories.

In addition we proposed set of information tag types fordevelopers, that can be used by developers for manual tag-ging of source code [10]. These tags are proposed mainlyfor reviewing source code. To support this process we de-velop system CodeReview4 (see Figure 2), that has beenused in school course Team project for two years.

Acknowledgements. This work was partially supportedby grants No. APVV-0233-10, APVV-0208-10, VEGA1/0752/14 and VEGA 1/0646/15 and is the partial resultof the Research & Development Operational Programmefor the project Research of methods for acquisition, analy-sis and personalized conveying of information and knowl-edge, ITMS 26240220039, co-funded by the ERDF.

References[1] S. Araujo, G.-j. Houben, and D. Schwabe. Linkator: Enriching

Web Pages by Automatically Adding Dereferenceable SemanticAnnotations. In B. Benatallah, F. Casati, G. Kappel, and G. Rossi,editors, Web Engineering, 10th International Conference, ICWE2010, volume 6189 of Lecture Notes in Computer Science, pages355–369, Berlin, Heidelberg, 2010. Springer Berlin Heidelberg.

[2] M. Bieliková, I. Polášek, M. Barla, E. Kuric, K. Rástocný,J. Tvarožek, and P. Lacko. Platform Independent SoftwareDevelopment Monitoring: Design of an Architecture. InV. Geffert, B. Preneel, B. Rovan, J. Štuller, and A. M. Tjoa,editors, SOFSEM 2014: Theory and Practice of ComputerScience, volume 8327 of LNCS, pages 126–137. SpringerInternational Publishing, Cham, 2014.

3http://perconik.fiit.stuba.sk/4https://perconik.fiit.stuba.sk/CodeReview

20 Rastocny, K.: Metadata Management for Large Information Spaces

Figure 2: System CodeReview, displaying one source code file with information tag.

[3] J. Kahan and M.-R. Koivunen. Annotea: An Open RDFInfrastructure for Shared Web Annotations. In Proceedings of the10th International Conference on World Wide Web - WWW’01,pages 623–632, New York, 2001. ACM Press.

[4] NISO. Understanding metadata. NISO Press, Bethesda, 2004.[5] T. A. Phelps and R. Wilensky. Robust Intra-Document Locations.

Computer Networks, 33(1-6):105–118, jun 2000.[6] B. Plimmer, S. H.-h. Chang, M. Doshi, L. Laycock, and

N. Seneviratne. iAnnotate: Exploring Multi-User Ink Annotationin Web Browsers. In Proceedings of the Eleventh AustralasianConference on User Interface - Volume 106, volume 106, pages52–60. Australian Computer Society, Inc., 2010.

[7] R. Priest and B. Plimmer. RCA: Experiences with an IDEAnnotation Tool. In Proceedings of the 6th ACM SIGCHI NewZealand Chapter’s International Conference on Computer-humanInteraction Design Centered HCI - CHINZ’06, pages 53–60, NewYork, 2006. ACM Press.

[8] K. Rástocný and M. Bieliková. Maintenance of Human andMachine Metadata over the Web Content. In M. Grossniklaus andM. Wimmer, editors, Current Trends in Web Engineering (ICWE2012), volume 7703 of LNCS, pages 216–220. Springer BerlinHeidelberg, Berlin, Heidelberg, 2012.

[9] K. Rástocný and M. Bieliková. Metadata Anchoring for SourceCode: Robust Location Descriptor Definition, Building andInterpreting. In H. Decker, L. Lhotská, S. Link, J. Basl, and A. M.Tjoa, editors, Database and Expert Systems Applications, volume8056 of LNCS, pages 372–379. Springer Berlin Heidelberg,Berlin, Heidelberg, 2013.

[10] K. Rastocny and M. Bielikova. Empirical metadata maintenancein source code development process. In Engineering of ComputerBased Systems (ECBS-EERC), 2015 4th Eastern EuropeanRegional Conference on the, pages 25–31, Aug 2015.

[11] R. Sanderson, P. Ciccarese, and H. Van de Sompel. Designing theW3C open annotation data model. In Proceedings of the 5thAnnual ACM Web Science Conference on - WebSci ’13, pages366–375, New York, 2013. ACM Press.

[12] J. F. Sequeda and O. Corcho. Linked Stream Data: A PositionPaper. In Proceedings of the 2nd International Workshop onSemantic Sensor Networks, pages 148–157, CEUR-WS, 2009.CEUR-WS.

[13] N. Shadbolt, T. Berners-Lee, and W. Hall. The Semantic WebRevisited. IEEE Intelligent Systems, 21(3):96–101, may 2006.

Selected Papers by the AuthorK. Rástocný, M. Tvarožek, M. Bieliková. Web Search Results

Exploration via Cluster-Based Views and Zoom-BasedNavigation. Journal of Universal Computer Science, 19(16):2320–2346, 2013.

K. Rástocný, M. Tvarožek, M. Bieliková. Supporting Search ResultBrowsing and Exploration via Cluster-based Views andZoom-based navigation. In Proceedings 2011 IEEE/WIC/ACMInternational Conference on Web Intelligence and IntelligentAgent Technology - Workshops, pages 297–300, Lyon, France,2011. CS IEEE Press.

K. Rástocný, M. Bieliková. Maintenance of Knowledge Tags withinHeterogeneous Web Content. In Proceedings of Current Trendsin Web Engineering: ICWE 2012 International WorkshopsMDWE, Composable Web, WeRE, QWE, and DoctoralConsortium, LNCS 7703, pages 216–220, Berlin, Germany,2012. Springer.

M. Bieliková, K. Rástocný. Lightweight Semantics over WebInformation Systems Content Employing Knowledge Tags. In :S. Castano et al., editors ER Workshops 2012, LNCS 7518, pages327–336, Lyon, Italy, 2012. Springer.

K. Rástocný, M. Bieliková. Metadata Anchoring for Source Code:Robust Location Descriptor Definition, Building and Interpreting.In : H. Decker et al., editors DEXA 2013, Part II, LNCS 8056,pages 372–379, Prague, Czech Republic, 2013. Springer.

M. Bieliková, I. Polášek, M. Barla, E. Kuric, K. Rástocný, J.Tvarožek, P. Lacko. Platform Independent SoftwareDevelopment Monitoring: Design of an Archi-tecture In : V.Geffert et al., editors SOFSEM 2014, LNCS 8327, pages327–336, Starý Smokovec, Slovak Republic, 2014. Springer.

K. Rástocný, M. Bieliková. Enriching Source Code by EmpiricalMetadata. In ESEM 2014: 8th ACM/IEEE InternationalSymposium on Empirical Software Engineering andMeasurement, page 1, Torino, Italy, 2014. ACM.

K. Rástocný, M. Bieliková. Empirical Metadata Maintenance inSource Code Development Process. In ECBS-EERC 2015: 2015IEEE Fourth Eastern European Regional Conference on theEngineering of Computer Based Systems, pages 25–31, Brno,Czech Republic, 2015. CS IEEE Press.

Automatic Estimation of Software Developer’s Expertise

Eduard Kuric∗



[email protected]

AbstractSoftware developer’s expertise can be defined as a degreeof his or her familiarity with source code artifacts of a soft-ware system, respective to other developers of the system.Existing approaches to estimate developer’s expertise areusually based on evaluating a degree of developer’s sourcecode authorship. In addition to the authorship, devel-oper’s development productivity should be considered.

The contributions of this work can be split into threeparts. First, we propose a developer’s model overlayingdomain model and a method for its automatic acquisition.The model provides software project-related informationat different levels of abstraction (e.g., at level of softwareconcerns). It is based on metadata and relationships be-tween them derived from corresponding resources. Sec-ond, we propose a method for estimation of developer’sexpertise in the subject software system at level of soft-ware concerns. The method considers both developer’sdevelopment productivity and his or her familiarity witha concern. Finally, we propose a method to recommendan expert developer for a newly created development taskat level of concerns. We evaluate the proposed approachby applying it to the expert recommendation for develop-ment tasks.

Categories and Subject DescriptorsD.2.7 [Software Engineering]: Distribution, Mainte-nance,and Enhancement - Version control; D.2.8 [Soft-ware Engineering]: Metrics; D.2.9 [Software Engi-neering]: Management - Productivity, software processmodels, software quality, assurance, Programming teams;H.3.4 [Infor-mation Storage and Retrieval]: Systemsand Software - Information Networks

∗Recommended by thesis supervisor: Prof. MariaBielikovaDefended at Faculty of Informatics and Information Tech-nologies, Slovak University of Technology in Bratislava onJune 29, 2016.

c© Copyright 2016. All rights reserved. Permission to make digitalor hard copies of part or all of this work for personal or classroom useis granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies show this notice onthe first page or initial screen of a display along with the full citation.Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy other-wise, to republish, to post on servers, to redistribute to lists, or to useany component of this work in other works requires prior specific per-mission and/or a fee. Permissions may be requested from STU Press,Vazovova 5, 811 07 Bratislava, Slovakia.Kuric, E. Automatic Estimation of Software Developer’s Expertise. In-formation Sciences and Technologies Bulletin of the ACM Slovakia,Vol. 8, No. 2 (2016) 21-25

Keywordssoftware developer, estimation of expertise, developmentproductivity, familiarity with software artifact, softwaresystem, source code, expert recommendation, expert iden-tification, domain modelling, software developer model,latent topic, conceptual concern, interaction data, soft-ware repository, mining software repositories

1. IntroductionIn many explanatory dictionaries, expertise is defined as“a special skill or knowledge a person has in a particularfield”. The person is called as an expert in that field. The“or” conjunction between the terms skill and knowledgein the definition can cause questions. Can a person bean expert if he or she has only a special skill in a field ofinterest? Can a person be an expert if he or she has onlyspecial knowledge-base in that field of interest? Whatdoes “special” mean, i.e. how to quantify the differencebetween two special knowledge/skills in the same field ofinterest?

For those questions, it is not simple to take a clear posi-tion in general. Let’s consider. Is a “theoretical” physicistan expert if he or she is not able to check his or her theoryin practice? If someone would say “no”, it could be inter-esting to hear a Richard Feynmann’s answer. There is onemore example which is even more extreme. Consider twocar mechanics. Both have special knowledge in repairingcars but only one of them has skills which enable him orher to repair the car with available tools. Then, are themechanics both experts?

Previous research on modelling software developer’s ex-pertise can be categorized into implicit and explicit ap-proaches:

• Implicit approaches investigate how data collectedfrom developer’s working activities can be used toestimate what he or she knows in a software project,a degree of his or her familiarity with software arti-facts, or what he or she wants/needs to know.

• Explicit approaches empirically study developers,they propose models describing different parts ofknowledge and investigate how developers compre-hend source code.

In our work we deal with implicit approaches. Expertrecommendation systems in software engineering help tolocate (discover) and to recommend individuals (experts)

22 Kuric, E.: Automatic Estimation of Software Developer’s Expertise

who have appropriate expertise on a given source codeartifact. IT managers and software development teamleaders are increasingly challenged with the need to im-prove the efficiency and quality of software system devel-opment. The size of a software system and high rate ofchanges of source code make it difficult for developers toidentify who, from the team, knows a particular part ofthe system. The identification of the software developerwith right knowledge (expertise) to the developer whoneeds help during the implementation can improve col-laboration and awareness of the development team. Thewaiting time for an answer can be reduced by contactingthe expert directly.

In general, existing approaches to estimate developer’s ex-pertise in/familiarity with source code artifacts are basedon investigating interaction activities [4, 14], changes [10,8, 5, 9], bugs [1, 12, 11, 21], or usage of technologies [15,19, 20, 7, 18].

The proposed approaches are not usually generalizable,the metrics are usually defined with regard to the struc-ture of software repositories, there are no metrics esti-mating developer’s expertise at higher level of abstractionsuch as software concerns. In the field of expert recom-mendation there is no a clear baseline with which to com-pare expertise metrics to each other. There are severaldefinitions of developer’s expertise in literature, however,without a clear baseline, it is difficult to determine whichautomatic expertise method best approximates his or herexpertise.

We identified the following open problems:

• Insufficient separation of domain conceptualizationand resources. Conceptual concerns of a softwaresystem refer to main technical concepts that reflectbusiness logic or domain of the system [3]. Thereis no clear distinction between the concerns repre-senting the domain conceptualization and resourcessuch as: tasks defining developers’ work, softwareartifacts resulting from developers’ activities for thetasks, and interactions capturing and describing de-velopers’ activities with the artifacts. A lot of knowl-edge about a software and its developers is containedin these resources, however, the resources and thedomain model are often seen as one layer [17, 16,2]. The vocabulary or language used in source codediffers from the one used for describing developmenttasks. It results in a lack of flexibility in estimationof developer’s expertise at the level of the concernsand the level of the particular software artifacts.

• None or minimal consideration of developer’s de-velopment skills. Existing approaches to estimatedeveloper’s expertise in a part of a software systemusually rely on the assumption that the developer’scommits to the code reflect his or her expertise in(familiarity with) that part of the system [10, 8,14]. However, while solving tasks the software de-velopers rewrite their code, reverse (undo) changes,try alternatives, familiarize with surrounding code,explore information space, etc. Estimation of devel-oper’s expertise in source code has an impact on howquickly and successfully development task is sup-posed to be solved [12, 14]. The concept of exper-tise includes both knowledge and skills. To estimate

developer’s expertise in a software there should beconsidered both his or her knowledge and develop-ment skills. One thing is to be aware of the existenceof some functionality, another thing is to be able toperform a correction, enhancement, or reduction ofsource code effectively in terms of the spent effort.

• Limited interconnection of tasks with codebase in ex-pert identification. Existing approaches to recom-mend an expert for a newly created developmenttask are often based on use of a repository to iden-tify the person who has solved the tasks similar tothe target task. The similarity is measured throughcomparing descriptions of the tasks [22, 11]. Thetime required to implement a new functionality, tochange an existing functionality, or to fix a bug canbe significantly reduced if the task is assigned to adeveloper who knows corresponding source code (aparticular part of a software system) [4]. When thedeveloper’s expertise is estimated only at the level ofthe tasks, it does not consider sufficiently the devel-oper’s expertise of a particular part of the codebaseof a software system. The source code that a de-veloper has to consult in the target task does nothave to be the same as in the similar tasks. Lookingat the software system as a web of conceptual con-cerns inferred from both the tasks and codebase wecan recommend the most expert developer to a taskwho best covers the task in terms of his or her (es-timated) expertise in the concerns included in thetarget task.

We aim to address these problems by devising methodsfor automatic domain model acquisition, expertise esti-mation, and expert recommendation at the level of con-ceptual concerns. In particular, thesis goals are:

• To design a domain model of a software project anda method for its acquisition that provides a clearseparation between domain conceptualization andresources.

• To propose and evaluate a method for estimation ofdeveloper’s expertise at the level of conceptual con-cerns that covers both developer’s familiarity withtasks and codebase, and his or her development skillsmeasured through development productivity reflect-ing the effort he or she expends while working on theparticular concern.

• To propose and evaluate a method to recommendan expert for a newly created development task atthe level of conceptual concerns inferred from boththe tasks and the codebase of a software system.

2. Software Developer’s Model Overlaying Do-main Model

In our work we define developer’s expertise through hisor her familiarity with a software system or its part asfollows:

Developer’s expertise refers to a degree of being aware ofexisting functionality in a subject software system and adegree of ability to locate relevant source code artifact(s),

respective to other developers of the software system.


Figure 1: Developer model we propose overlays the domain model. Domain model elements are metadataelements and topics. Metadata elements can be instances of source code entity abstraction and taskabstraction. They contain relevant domain terms and a set of attributes, e.g., source code propertiesestimated by using various software metrics. Topics are inferred from the metadata corpus containingthe relevant domain terms. The developer model is updated when the overlay model receives informationon developer’s activities from the repository layer.

To construct a domain model we use primary data froma version control system, an issue tracking system andinteraction data.

Developer model we propose overlays the domain model(see Figure 1). Domain model elements are metadataelements and topics. Metadata elements can be instancesof source code entity abstraction and (development) taskabstraction. They contain relevant domain terms and aset of attributes, e.g., source code properties estimatedby using various software metrics. Topics are inferred1

from the metadata corpus containing the relevant domainterms. The developer model is updated when the overlaymodel receives information on developer’s activities fromthe repository layer.

We distinguish two types of relationships, namely, devel-oper–metadata element relationship and developer–topicrelationship. The developer–metadata element relation-

1Software concerns can be automatically extracted (in-ferred) by using statistical topic modelling techniquesadapted to software. Such technique can be used to ap-proximate software concerns as (latent) topics.

ship between a developer and a metadata element is cre-ated iff the developer (co)authors the corresponding re-source element to the metadata element. The developer–topic relationship between a developer and a topic is cre-ated iff the topic is assigned to a metadata element andthere is a relationship between the developer and themetadata element.

The developer–metadata element and developer–topic re-lationships can have attributes. A developer–topic re-lationship can have attributes (expertise characteristics)such as a degree of the developer’s familiarity with thetopic and his or her development productivity on the topic(expertise metrics to estimate these characteristics will bedefined later). A developer–source code entity abstractionrelationship can have an attribute such as a degree of de-veloper’s authorship of the source code entity with respectto co-authors.

The proposed model provides general apparatus to definedifferent types of relationships between domain elements.Various attributes can be assigned to the elements andrelationships.

24 Kuric, E.: Automatic Estimation of Software Developer’s Expertise

3. Software Developer’s Expertise EstimationThe software system can be seen as a web of topics. Wepropose a method to recommend an expert for a target(newly created) development task of a topic. It is basedon estimation of developer’s expertise on particular topicsof the software system relative to other developers devel-oping the same system.

To estimate developer’s expertise for a topic we proposetwo metrics. The first one estimates developer’s devel-opment productivity on the topic. The second one esti-mates his or her degree of familiarity with the topic. Bothmetrics are based on previous developer’s work (resolvedtasks) on the subject topic.

Developer’s development productivity metric considerstwo characteristics influencing on development time re-quired to complete a given task of a topic, namely, com-plexity of the task, and developer’s activities performedwhile solving the task. It reflects the following assump-tions:

• The more complex a developer’s change in a sourcecode entity is and the less time he or she needs toperform the change, the higher his or her produc-tivity probably is.

• The more edit interactions a developer performs ina development session and the less exploration in-teractions he or she carries out in that session thenthe more expert he or she probably is.

The metric for estimation of developer’s familiarity with atopic considers his or her real source code contribution tothe topic, i.e. the amount of created and modified sourcecode. This metric reflects the following assumptions:

• The more lines of code a developer owns in a sourcecode entity the more expert he or she is in the entity.

• Code contributions committed earlier in the pasthave a smaller weight than recent code changes.

While the productivity metric focuses on the process ofdevelopment, the familiarity metric operates with the re-sult of the development (process).

4. Contributions and ConclusionsExpertise is difficult to estimate or observe directly. Thereare many types of expertise, competing definitions, andpossible taxonomies of expertise in literature. The mostdirect way to show developer’s expertise, perhaps, is togive the developer to solve a test (proposed for profes-sional licensing). However, a method of manual testingto identify an expert for a part of a software is not effec-tive in practice.

Finding experts is critical in the development of large soft-ware systems, especially if there are (geographically) dis-tributed teams. Authors [6] estimate that about half ofdevelopment time software developers spend communicat-ing with each other. A substantial portion of this time isthe information communication [13]. The time can be re-duced by an expertise-finding support that helps softwaredevelopers to determine with whom to communicate dur-ing development tasks. On the other side, if we directly

assign a task to a software developer who knows the cor-responding part of the software system and/or his or herdevelopment productivity on the topic of the task is high,the time required to resolve the task can be (significantly)reduced.

Contributions achieved in this work are as follows:

• proposal of a developer model overlaying the domainmodel and a method for its automatic acquisition,

• proposal of a novel method for estimation of devel-oper’s expertise at the level of topics that considersboth developer’s familiarity with a topic and his orher development productivity on the topic,

• proposal of a novel method for recommendation ofexperts to newly created development tasks at thelevel of topics that are inferred from both the tasksand the codebase of a software system.

We conducted experiments on five open-source projectsand performed a case study on two commercial/closedsoftware projects. Although, experiments performed onthe five open source projects and the two commercialprojects do not provide enough justification of general-ity of our approach, the general results indicate that ourapproach can be useful in recommending experts.

Acknowledgements. This work was partially supportedby grants No. APVV-0208-10, VG 1/0752/14 and VG1/0675/11 and is the partial result of the Research & De-velopment Operational Programme for the project Re-search of methods for acquisition, analysis and person-alized conveying of information and knowledge, ITMS26240220039, co-funded by the ERDF.

References[1] J. Anvik and G. C. Murphy. Determining implementation

expertise from bug reports. In Proceedings of the 4thInternational Workshop on Mining Software Repositories, MSR’07, pages 2–, USA, 2007. IEEE Computer Society.

[2] M. F. Bosu and S. G. MacDonell. Data quality in empiricalsoftware engineering: A targeted review. In Proceedings of the17th International Conference on Evaluation and Assessment inSoftware Engineering, EASE ’13, pages 171–176, New York, NY,USA, 2013. ACM.

[3] T.-H. Chen, S. W. Thomas, M. Nagappan, and A. E. Hassan.Explaining software defects using topic models. In Proceedings ofthe 9th IEEE Working Conference on Mining SoftwareRepositories, MSR ’12, pages 189–198, Piscataway, NJ, USA,2012. IEEE Press.

[4] T. Fritz, G. C. Murphy, E. Murphy-Hill, J. Ou, and E. Hill.Degree-of-knowledge: Modeling a developer’s knowledge ofcode. ACM Trans. Software Eng. Methodol., 23(2):14:1–14:42,Apr. 2014.

[5] T. Girba, A. Kuhn, M. Seeberger, and S. Ducasse. Howdevelopers drive software evolution. In Proceedings of the 8thInternational Workshop on Principles of Software Evolution,IWPSE ’05, pages 113–122, USA, 2005. IEEE Computer Society.

[6] J. D. Herbsleb, H. Klein, G. M. Olson, H. Brunner, J. S. Olson,and J. Harding. Object-oriented analysis and design in softwareproject teams. Hum.-Comput. Interact., 10(2):249–292, Sept.1995.

[7] D. Ma, D. Schuler, T. Zimmermann, and J. Sillito. Expertrecommendation with usage expertise. In ICSM, pages 535–538,2009.


[8] D. W. McDonald and M. S. Ackerman. Expertise recommender: aflexible recommendation system and architecture. In Proceedingsof the Conference on Computer Supported Cooperative Work,pages 231–240, USA, 2000. ACM.

[9] S. Minto and G. C. Murphy. Recommending emergent teams. InProceedings of the 4th International Workshop on MiningSoftware Repositories, MSR ’07, pages 5–, USA, 2007. IEEEComputer Society.

[10] A. Mockus and J. D. Herbsleb. Expertise browser: a quantitativeapproach to identifying expertise. In Proc. of the 24th Int. Conf.on Software Eng., pages 503–512, USA, 2002. ACM.

[11] N. Nagwani and S. Verma. Predicting expert developers for newlyreported bugs using frequent terms similarities of bug attributes.In Proceedings of the 9th International Conference on ICT andKnowledge Engineering (ICT Knowledge Engineering), pages113–117, Bangkok, Jan 2012. IEEE.

[12] T. T. Nguyen, T. Nguyen, E. Duesterwald, T. Klinger, andP. Santhanam. Inferring developer expertise through defectanalysis. In Proc. of the 34th Int. Conf. on Software Engineering,pages 1297–1300, Zurich, June 2012. IEEE.

[13] D. E. Perry, N. Staudenmayer, and L. G. Votta. People,organizations, and process improvement. IEEE Software,11(4):36–45, July 1994.

[14] R. Robbes and D. Röthlisberger. Using developer interaction datato compare expertise metrics. In Proceedings of the 10th WorkingConference on Mining Software Repositories, MSR ’13, pages297–300, Piscataway, NJ, USA, 2013. IEEE Press.

[15] M. Robillard, W. Maalej, R. Walker, and T. Zimmermann.Recommendation Systems in Software Engineering. SpringerBerlin Heidelberg, 2014.

[16] D. Rodriguez, I. Herraiz, and R. Harrison. On softwareengineering repositories and their open problems. In Proceedingsof the First International Workshop on Realizing AI Synergies inSoftware Engineering, RAISE ’12, pages 52–56, Piscataway, NJ,USA, 2012.

[17] M. M. Rosli, E. Tempero, and A. Luxton-Reilly. What is in ourdatasets?: Describing a structure of datasets. In Proceedings of theAustralasian Computer Science Week Multiconference, ACSW’16, pages 28:1–28:10, New York, NY, USA, 2016. ACM.

[18] D. Schuler and T. Zimmermann. Mining usage expertise fromversion archives. In Proceedings of the 2008 InternationalWorking Conference on Mining Software Repositories, MSR ’08,pages 121–124. ACM, 2008.

[19] C. Teyton, J.-R. Falleri, F. Morandat, and X. Blanc. Find yourlibrary experts. In Proceedings of the 20th Working Conference onReverse Engineering (WCRE), pages 202–211, Koblenz,Germany, Oct 2013. IEEE.

[20] A. Vivacqua and H. Lieberman. Agents to assist in finding help.In Proceedings of the SIGCHI Conference on Human Factors inComputing Systems, CHI ’00, pages 65–72, USA, 2000. ACM.

[21] W. Wu, W. Zhang, Y. Yang, and Q. Wang. Drex: Developerrecommendation with k-nearest-neighbor search and expertiseranking. In Software Engineering Conference (APSEC), 18th AsiaPacific, pages 389–396. IEEE, Dec 2011.

[22] X. Xie, W. Zhang, Y. Yang, and Q. Wang. Dretom: Developerrecommendation based on topic models for bug resolution. InProceedings of the 8th International Conference on PredictiveModels in Software Engineering, PROMISE ’12, pages 19–28,New York, NY, USA, 2012. ACM.

Selected Papers by the AuthorE. Kuric, M. Bieliková. ANNOR: efficient image annotation based on

combining local and global features. Computers and Graphics,47(2): 1–15, 2015.

E. Kuric, M. Bieliková. Estimation of student’s programmingexpertise. In Proceedings of the 8th ACM/IEEE internationalsymposium on empirical software engineering and measurement,p. 4, Torino, Italy, 2014. ACM.

E. Kuric, M. Bieliková. Webification of software development: userfeedback for developer’s modeling. In Proceedings of the 14thInternational conference on Web Engineering, pp. 550–553,Toulouse, France, 2014. Springer.

E. Kuric, M. Bieliková. Search in source code based on identifyingpopular fragments. In Proceedings of the 39th internationalconference on current trends in theory and practice of computerscience, pp. 408–419, Spindleruv Mlyn, Czech Republic, 2013.Springer.

E. Kuric, M. Bieliková. Search in source code based on identifyingpopular fragments. In Proceedings of the 39th internationalconference on current trends in theory and practice of computerscience, pp. 408–419, Spindleruv Mlyn, Czech Republic, 2013.Springer.

E. Kuric, M. Bieliková. Automatic Image Annotation Using Globaland Local Features. In Proceedings of the 6th InternationalWorkshop on Semantic media adaptation and personalization,pp. 33–38, Vigo, Spain, 2011, IEEE CS.

M. Bieliková, I. Polášek, M. Barla, E. Kuric, K. Rástocný, J.Tvarožek, P. Lacko. Platform independent software developmentmonitoring: design of an architecture. In Proceedings of the 40thInternational conference on current trends in theory and practiceof computer science, pp. 126–137, Novy Smokovec, Slovakia,2014, Springer.

Combining Named Entity Recognition Methodsfor Concept Extraction

Štefan Dlugolinský∗

Institute of InformaticsSlovak Academy of Sciences

Dúbravská cesta 9, 845 07 Bratislava, [email protected]

AbstractNamed entity recognition (NER) is a key task of miningsemantics from text. Recent growth of social media raisednew challenges for NER. Our evaluation of a number ofpopular NE recognizers over a micro-post dataset showeda significant drop-off in results quality. Current state-of-the-art NER methods perform much better on formal textthan on micro-posts. However, the experiment providedus with an interesting observation – although individualNER tools did not perform very well on micro-post data,we have received recall over 91% when we merged all theresults of the examined tools. This means that if we wereable to combine different NE recognizers in a meaningfulway, we might be able to get NER in micro-posts of a veryhigh quality. We propose a method for NER in micro-posts, which is designed to combine annotations yieldedby existing NER tools in order to produce more preciseresults than input tools alone. We combine NE recogniz-ers utilizing machine learning techniques, namely decisiontree and random forest using the C4.5 algorithm. Eval-uation on a standard dataset shows that the proposedapproach outperforms underlying NER methods as wellas the state-of-the-art NE recognizer specially trained onthe micro-post data. To the best of our knowledge, up-to-date, the proposed approach achieves the highest F1

score on the #MSM2013 dataset.

Categories and Subject DescriptorsI.2 [ARTIFICIAL INTELLIGENCE]: Natural Lan-guage Processing—Language parsing and understanding,Text analysis

Keywordsnamed entity recognition, machine learning, micro-posts

∗Recommended by thesis supervisor: Assoc. Prof. Dr.Michal Laclavık.Defended at Faculty of Informatics and Information Tech-nologies, Slovak University of Technology in Bratislava onAugust 25, 2016.c© Copyright 2016. All rights reserved. Permission to make digital

or hard copies of part or all of this work for personal or classroom useis granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies show this notice onthe first page or initial screen of a display along with the full citation.Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy other-wise, to republish, to post on servers, to redistribute to lists, or to useany component of this work in other works requires prior specific per-mission and/or a fee. Permissions may be requested from STU Press,Vazovova 5, 811 07 Bratislava, Slovakia.Dlugolinský, Š. Combining Named Entity Recognition Methodsfor Concept Extraction. Information Sciences and Technologies Bul-letin of the ACM Slovakia, Vol. 8, No. 2 (2016) 26-36

1. IntroductionRecent years have seen a significant growth in social mediainteraction. People are able to interact via the Internetfrom almost anywhere at any time. They can share theirexperiences, thoughts and knowledge instantly and theydo so in mass dimensions. The easiest and probably themost popular way of interaction on the Web is throughmicro-posts – short text messages posted on the Web.There numerous such services offering such communica-tion. Notorious examples of micro-posts include tweets,Facebook statuses, comments, Google+ posts, Instagramphotos. Micro-posts analysis has a big potential in hiddenknowledge that can be used in wide range of domains suchas emergency response, public opinion assessment, busi-ness or political sentiment analysis and many more. Themost important task in order to analyse and make senseof micro-posts is the Named Entity Recognition (NER).NER in micro-posts is a challenging problem due to thelimited size of a single micro-post, prevalence of term am-biguity, noisy content, multilingualism [3]. These are themain reasons why existing NER methods perform betteron formal newswire text than on micro-posts and there isclearly a space for new methods of NER designed for so-cial media streams. In a thesis, introduced by the currentpaper, we propose an approach for combining NER meth-ods represented by different NE recognizers in order tomake a new NE recognizer intended to be used on micro-posts. The method is designed to combine annotationsproduced by different NER tools by exploiting machinelearning (ML) techniques. We use the term annotation torefer to a substring of an input text that has been markedby a NER tool as a reference to an entity of one of targetclasses; i.e., LOC, MISC, ORG and PER. The main chal-lenge is the transformation of text annotations producedby NER tools into a form usable for training ML classifi-cation algorithms. Once the NER annotations were trans-formed to an appropriate format, we have performed anevaluation of a number of popular ML classification tech-niques. The best performing on our problem domain wasthe C4.5 algorithm [17] that was used to train decisiontree (DT) and random forest (RF) models. The resultingclassification model outperformed the best of the com-bined recognizers by gaining more than 16% in F1 scoreand the best baseline model by gaining 1% in F1 score.

The main contributions of the work are following:

• We show that although existing NER tools designedfor news text do not perform well on micro-posts,by merging results of several different NER tools,we can achieve high recall and precision.


• We utilize ML classifiers to combine the outputs ofmultiple NE recognizers. The principal challenge isthe transformation of text annotations yielded byNER tools to feature vectors that can be used forthe training of classification algorithms.

• We provide an extensive evaluation of popular clas-sification models to assess their suitability for theproblem of combining results of NER tools. For thebest performing ones, we study the influence of al-gorithms parameters on the classification results.

The paper is structured as follows: Section 2 presents astate-of-the-art related to NER in micro-posts throughcombining multiple NER methods. Section 3 briefly de-scribes several popular NER tools, which are evaluatedover a standard micro-posts dataset in Section 4. Resultsshow a dramatic drop in quality measures compared tothe numbers reported on news datasets. Section 5definesbaseline NE recognizers, and explains our approach ofcombining NER tools and evaluates our NE recognitionmodels. Finally, Section 6 summarizes our results andconcludes the paper.

2. Present State of the ArtThis section presents a state-of-the-art related to NERin micro-posts, which is based on a combination of mul-tiple methods. Regarding the NER for tweets, a simi-lar approach has been taken by Liu et al. [15]. Authorscombine a k-Nearest Neighbors (k-NN) classifier with alinear Conditional Random Fields (CRF) model under asemi-supervised learning framework and show increase inF1 with respect to a baseline system, which is its modi-fied version without k-NN and semi-supervised learning.Etter et al. [9] address multilingual NER for short in-formal text. They do not rely on language dependentfeatures such as dictionaries or POS tagging, but theyuse language independent features derived from the char-acter composition of a word and its context in a mes-sage; i.e., words, character n-grams for words, ±k wordsto the left, message length, word length and word positionin message. They use an algorithm that combines Sup-port Vector Machine (SVM) with a Hidden Markov Model(HMM) to train a NER model on a manually annotateddata1. The experiments show that the language inde-pendent features lead to F1 score increase and the modeloutperforms Ritter et al. [22]. Ritter et al. [22] presentre-built NLP pipeline for tweets; i.e., POS tagger, chun-ker and NE recognizer. The NE recognizer leverages theredundancy inherent in tweets using Labelled LDA [18]to exploit Freebase2 dictionaries as a source of distant su-pervision. TwiNER, a novel unsupervised NER systemfor targeted tweet streams is proposed by Li et al. [14].Similarly to Etter et al. [9], TwiNER does not rely onany linguistic features of the text. It aggregates infor-mation from the Web and Wikipedia. The advantage ofTwiNER is that it does not require manually annotatedtraining sets. Alternatively, TwiNER does not categorizethe type of discovered NEs. Authors prefer the problemof correctly locating and recognizing presence of NEs in-stead of their classification. Habib and Keulen [12], thewinning solution of the #MSM2013 IE Challenge, splitsthe NER problem in named entity extraction (NEE) and

1There were about 40,000 tweets tagged in more than twoweeks by Amazon Mechanical Turk2http://www.freebase.com

named entity classification (NEC), too. The NEE task isperformed by union of entities recognized by two models;i.e., CRF and SVM. Both models are trained on manuallylabelled tweet data. The CRF involves POS tags and cap-italization of the words as features. The SVM segmentstweet using Li et al. [14] approach and enriches the seg-ments by external knowledge base (KB). It uses the samefeatures as the SVM model and information from externalKB.

3. NE Recognizers Considered for CombiningWe chose various existing NER recognizers based on state-of-the-art NER methods as candidates for combining inclassification models discussed later. The list of thesetools was complemented by Miscinator, our NE recognizerspecially designed for Making Sense of Microposts 2013(#MSM2013) Concept Extraction Challenge. Below, webriefly describe the recognizers focusing on which methodsthey use for NER and how they were configured.

ANNIE [6] relies on finite state algorithms, gazetteersand the JAPE (Java Annotation Patterns Engine)language [7]. We have used ANNIE from GATEDeveloper 7.1.

Apache OpenNLP3 is based on maximum entropymodels [21] and perceptron learning algorithm [23].We have used Apache OpenNLP v1.5.2.

Illinois Named Entity Tagger [19] uses a regularizedaveraged perceptron [11] with external knowledge(un-labelled text, gazetteers built from Wikipediaand word class models). We have used Illinois NETv1.0.4 with 4-label type set and default configura-tion.

Illinois Wikifier [20] is based on a Ranking SVM (Sup-port Vector Machine) [5] and exploits Wikipedialink structure in disambiguation. We have used Illi-nois Wikifier 1.04.

Open Calais operates behind a shroud of mystery sincethere is not much information available about howits NE recognition works. Official sources5 say, thatit uses NLP, machine learning and other methods aswell as Linked data.

Stanford Named Entity Recognizer [10] is basedon CRF (Conditional Random Field) sequencemodels [13]. We have used Stanford NER v1.2.76

with English 4-class caseless CONLL model7.

Wikipedia Miner8 [16] is a text annotation tool,which is capable of annotating Wikipedia topics ina given text. It exploits Wikipedia link graph andWikipedia category hierarchy and relies on machinelearning classifiers, which are used for measuring re-latedness of concepts and terms, as well as for mea-suring disambiguation. We have used this softwareto discover Wikipedia topics in micro-posts. Discov-ered topics were then tagged according to DBPediaOntology9.

4http://cogcomp.cs.illinois.edu/page/download_view/Wikifier5http://www.opencalais.com/about6http://nlp.stanford.edu/software/stanford-ner-2012-11-11.zip7english.conll.4class.caseless.distsim.crf.ser.gz9http://dbpedia.org/Ontology

28 Dlugolinsky, S.: Combining Named Entity Recognition Methods for Concept Extraction

Illinois Wikifier LingPipe Stanford NERWikipedia Miner

Wikipedia SVM CRFHMM

Open Calais ANNIE Apache OpenNLPIllinois NET

GazetteersPerceptronLearning

Finite statealgorithms

Linked DataMaximumentropyMachine learning

Figure 1: Outline of NE recognizers.

Miscinator was our gazetteer tool specially designed forannotating MISC entities in the #MSM2013 Con-cept Extraction Challenge. The gazetteer was con-structed by extending the MISC annotations fromthe training set with Google Sets10.

Most of the NE recognizers were based on statistical learn-ing methods. Some of them used also gazetteers and otherexternal knowledge such as Wikipedia or Linked Data.Outline of the NE recognizers is shown in Figure 1.

4. Evaluation of NE RecognizersIn this section, we provide an evaluation of the NERrecognizers described in Section 3 over micro-post data.Our intent was to observe the performance of each in-dividual NE recognizer before combining it with otherNER tools. The evaluation was also focused on analy-sis, which NE recognizer is more suitable for particularNE class and whether NE recognizers produce diverse re-sults. Evaluated NE recognizers were not specially con-figured, tweaked or trained for micro-posts prior to theevaluation. They used their default configuration for for-mal English text. The reason for this was that we wantedto see how they cope with a different kind of text thatthey were trained for. As they differed in supported NEclasses, it was necessary to align them with our taxon-omy (LOC, MISC, ORG, PER). This was achieved bysimple mapping; e.g. Person→PER. If the mapping wasnot possible, we skipped the particular NE class and wedid not include it in computation of evaluation metrics;e.g., skipped MISC class for ANNIE, Apache OpenNLPand LingPipe.

4.1 DatasetNE recognizers were evaluated over the adapted#MSM2013 IE Challenge training dataset [2]. We tookthe 1.5 version and cleaned it from duplicate micro-posts as well as from micro-posts overlapping the testdataset. The cleaned training dataset finally contained2752 unique manually annotated micro-posts with classi-fication restricted to four entity types:

PER – full or partial person names

10http://googlesystem.blogspot.com/2012/11/google-sets-still-available.html

LOC - 606 (19.44%)

MISC - 215 (6.9%)

ORG - 601 (19.28%)

PER - 1696 (54.39%)

LOC - 96 (6.35%)

MISC - 94 (6.22%)

ORG - 232 (15.35%)

PER - 1089 (72.07%)

Figure 2: Occurence of named entities in train (left) andtest (right) datasets.

LOC – full or partial (geographical or physical) locationnames, including: cities, provinces or states, coun-tries, continents and (physical) facilities

ORG – full or partial organization names, including aca-demic, state, governmental, military and business orenterprise organizations

MISC – a concept not covered by any of the cate-gories above, yet limited to one of the entity types:film/movie, entertainment award event, politicalevent, programming language, sporting event andTV show.

We also adapted the test dataset from the #MSM2013IE Challenge over which we later evaluated our classifi-cation models. The occurrence of NEs in both datasetsis displayed in Figure 2. Named entity types were notequally distributed. The most frequent entity type inboth datasets was PER and the least frequent was MISC.Datasets used in this work are also available for down-load11 in GATE SerialDataStore format. Datasets includeresults of all the used NE recognizers as well as our NERmodels discussed later in this chapter.

4.2 Evaluation ResultsEvaluation results are displayed in Table 1 and orderedby Micro avg. F1 score. We provide also a Macro sum-mary which averages P , R and F1 measures on a perdocument basis, while the Micro summary considers thewhole dataset as a one document. It turned out that thebest performing NE tagger on the evaluation dataset wasOpenCalais, which was the best in recognizing LOC andORG entities. The second was Illinois NET, which wasthe best in recognizing PER entities. The best tool in rec-ognizing MISC entities was Miscinator, which achieved48% in F1 score. Further details about the evaluationcan be found in [8]. Some of the evaluation results mayslightly differ from those displayed in Table 1. It is for thereason that we accepted adjectival and demonymic formsfor countries as MISC type; e.g., Slovak (adjectival), Slo-vaks (demonym).

4.3 Theoretical Gain in PerformanceThe current evaluation also shows that the NE recognizersproduced diverse annotations. This behavior can be seenin increased recall after the results were unified from allof the taggers and cleaned from duplicates. Figure 3 andFigure 4 illustrate the situation and possible recall, whichcan theoretically be achieved when combining the recog-nizers. Gain in recall on a per NE class basis is computedin Table 2 together with overall recall using the macro

11http://ikt.ui.sav.sk/microposts/


Table 1: Evaluation of NE Recognizers over the Training Dataset

F1 Macro avg. Micro avg.

NE recognizer LOC MISC ORG PER P R F1 P R F1

Open Calais 0.737 0.270 0.557 0.692 0.657 0.503 0.564 0.720 0.598 0.653Illinois NET 0.721 0.105 0.359 0.789 0.488 0.505 0.493 0.607 0.645 0.626Stanford NER 0.670 0.053 0.292 0.747 0.446 0.436 0.440 0.597 0.591 0.594ANNIE 0.677 – 0.356 0.606 0.711 0.369 0.410 0.636 0.483 0.549Illinois Wikifier 0.552 0.159 0.509 0.624 0.541 0.419 0.461 0.624 0.472 0.537Apache OpenNLP 0.507 – 0.270 0.579 0.679 0.281 0.339 0.624 0.384 0.475Wikipedia Miner 0.560 0.062 0.327 0.613 0.346 0.524 0.391 0.321 0.573 0.412LingPipe 0.349 – 0.071 0.348 0.400 0.300 0.192 0.161 0.381 0.226Miscinator – 0.479 – – 0.922 0.092 0.120 0.687 0.025 0.049

Table 2: Gain in Recall That Could Be TheoreticallyAchieved If Combining Taggers Together

Theoretical gain in recall [%]

dataset LOC MISC ORG PER Mac. Mic.

train 26.4 96.2 73.7 22.7 63.3 40.7test 43.1 90.5 91.7 16.0 64.2 29.0

and micro averaging. Gain values were calculated withrespect to score of the best tool for particular NE class.Macro and Micro values were calculated analogously. Ifthe taggers would not produce diverse results, the recallof unified and de-duplicated results will copy the perfor-mance of the tools. Therefore, we see a place for theo-retical gain in performance of NER when combining thetools together. Yet the increased recall of unified taggerscomes hand in hand with lower precision. This is becausethe number of false positive (FP) tags can grow with ev-ery new tagger added to the union. We expect that therecan be a combining model trained, which would decreasethe number of false positive (FP) entities produced bycombined taggers and thus increase the precision, whilekeeping the number of true positives (TP) and thereforerecall still relatively high.

Nevertheless, the real gain in recall can be higher than thetheoretical gain calculated in Table 2. This is possible ifcombining model is capable of decreasing the number offalse negatives (FN) occurring for combined taggers; i.e.,decreasing the number of missed entities. The number offalse negatives (FN) can be decreased if combining modelcounts on true negative (TN) entities produced by com-bined taggers and transforms them to true positive (TP)entities or false positive (FP) entities. If they are trans-formed to true positives (TP) then recall together withprecision are increased. If they are transformed to falsepositives (FP) then only recall is increased and precisionis decreased. Of course, combining model should not pro-duce less true positive (TP) entities than the combinedtaggers together to gain the recall. True negative (TN)entities, recognized by combined taggers, can be properlychosen NEs having classification out of the target taxon-omy; e.g. NP – noun phrase. If such NP entity is takenand transformed to a true positive (TP) entity of the tar-get taxonomy, for instance PER, then recall is increased asthe number of true positives (TP) is increased too. More-over, precision is also increased. Therefore, we chose truenegatives (TN) to be involved in training of combinationmodel described later in Section 5.

LOC MISC ORG PER Macro Micro

Precision Recall F1

0.0

0.2

0.4

0.6

0.8

1.0

0.24

0.92

0.38

0.06

0.72

0.11

0.08

0.82

0.15

0.31

0.96

0.47

0.17

0.86

0.28

0.17

0.91

0.29

Figure 3: Precision, Recall and F1 of unified NE recog-nizers over the train dataset.

LOC MISC ORG PER Macro Micro

Precision Recall F1

0.0

0.2

0.4

0.6

0.8

1.0

0.09

0.86

0.16

0.03

0.43

0.06

0.06

0.79

0.11

0.34

0.96

0.51

0.13

0.76

0.21

0.16

0.89

0.28

Figure 4: Precision, Recall and F1 of unified NE recog-nizers over the test dataset.

Results of the evaluation and the experiment with mergedannotations showed that there was a place for theoreticalgain in overall performance of NER if the taggers werecombined together.

5. Combining NE RecognizersThe idea of how to combine NE recognizers was to use ma-chine learning techniques to build a classification model,which would be trained on features describing micro-posts’ text as well as annotations produced by involvedNE recognizers. We used the training dataset to build themodel and the test dataset for evaluating it and compar-ing with other NE recognizers (Section 4).

According to the evaluation results in section 4, we chosefor combining seven out of eight NE recognizers basedon different methods. The discarded one was LingPipebecause its model English News: MUC-6 was not suitable


for micro-post texts, despite it fitted as the best for thistask from all of the tree available LingPipe’s models. Theother two available LingPipe models were English Genes:GeneTag and English Genomics: GENIA. Seven chosenNE recognizers were then complemented by Miscinatortool.

As the overall recall of the underlying NE recognizers wasrelatively high, we wanted to gain maximum precisionwhile not devalue the recall. We decided to involve ma-chine learning techniques, but it was necessary to trans-form this problem into a standard machine learning task.In this case it was suitable to transform the task of NERinto a task of classification. The intent was that machinelearning process would produce a classification model ca-pable of classifying given annotations from involved meth-ods into four target classes LOC, MISC, ORG, PER andone special class NULL indicating that the annotationdoes not belong to any of the four target classes. Then asimple algorithm can be applied to merge the re-classifiedannotations into the final results.

5.1 Baseline NE RecognizersThere were three baseline NE recognizers defined, whichwe used to compare the performance of the combiningmodels.

The first baseline, Baseline Train, was defined as astraightforward composition of the best NE recognizersin each NE class according to the evaluation made overthe training dataset (Section 4, Table 1); i.e., LOC, MISCand ORG classes were extracted by OpenCalais and PERclass was extracted by Illinois NET.

Theoretically, there could be a better baseline assembledif we chose the best NE recognizers according to evalua-tion on the test dataset instead of the training dataset. Itis because there is no guarantee that all the best taggerson one dataset are the best on a different one. Therefore,we defined a theoretical baseline, Baseline Test, which isassembled from the best taggers according to evaluationmade over the test dataset. This baseline is theoretical,because in real life we do not know in advance which tag-gers for which NE class should we assemble.

The third, Baseline SNER, was a model basedon Stanford NER CRFClassifier trained onthe training dataset. To train this model, weadapted properties of the out-of-the-box model en-glish.conll.4class.caseless.distsim.crf.ser.gz12.

The performance of the baselines can be seen in Table 3together with performances of the NE recognizers consid-ered for combining. The evaluation was made over thetest dataset. Results are ordered by Micro avg. F1 score.As was expected, the baselines outperformed underlyingNE recognizers in precision and F1 measures. Our goalwas to overcome the performance of the baselines with acombining model produced by machine learning approach.

12https://github.com/stanfordnlp/CoreNLP/raw/fe2a9672bd7beb589d245d13d20e89754c06917f/scripts/ner/english.conll.4class.caseless.distsim.prop

annotation vector

tweet vector

method1 vector

method2 vector

methodN vector… correct

answerpreproc.vector

training vector

Figure 5: Training vector.

annotation vector

annotation type

first letter capital

all letters upper cased

all letters lower cased

capitalized words word count

Figure 6: Annotation vector.

5.2 Transforming NE Annotations into VectorsWe took an approach of describing how particular meth-ods performed on different entity types compared to theresponse of other methods and manual annotation. Usedas a training vector, this description was an input fortraining a classification model. A vector of input train-ing features was generated for each annotation found byunderlying NER methods restricted to following types:LOC, MISC, ORG, PER, NP – noun phrase, VP – verbphrase, OTHER – different type. We called this annota-tion a reference annotation. The vector of each referenceannotation consisted of several sub-vectors (Figure 5).

The first sub-vector of the training vector was an annota-tion vector (Figure 6). The annotation vector describedthe reference annotation – whether it was upper or lowercase, used a capital first letter or capitalized all of itswords, the word count, and the type of the detected an-notation.

The second sub-vector described micro-posts as a whole(Figure 7). It contained features describing whether allwords longer than four characters were capitalized, up-percase, or lowercase. We called this sub-vector tweetvector.

The remainder of the sub-vectors were computed accord-ing to the overlap of the reference annotation with anno-tations produced by particular NER method. Such sub-vector (termed a method vector by us) was computed foreach method and contained four other vectors describingthe overlap of method annotations with reference annota-tion on each target entity type (Figure 8). The annotationtype attribute was filled with a class of method annotationthat exactly matched position of the reference annotationand was one of the target entity classes, otherwise it wasleft blank.

Each overlap vector of a particular method and NE class(Figure 9) consisted of five components – ail: the aver-age intersection length of a reference annotation with themethod annotations of the same NE class, aiia: the av-

tweet vector

all words* capitalized

all words* upper cased

all words* lower cased

* words longer than four characters

preproc. vector

ail aiia aiir

Figure 7: Tweet vector (left) and preprocessing vector(right).


Table 3: Evaluation of NE Recognizers over the Test Dataset


Model LOC MISC ORG PER P R F1 P R F1

Baseline SNER 0.589 0.267 0.465 0.864 0.689 0.484 0.546 0.836 0.710 0.768Baseline Test 0.614 0.295 0.464 0.844 0.669 0.491 0.554 0.801 0.706 0.750Baseline Train 0.614 0.295 0.296 0.844 0.694 0.436 0.512 0.831 0.672 0.743

Stanford NER 0.513 0.000 0.302 0.822 0.393 0.431 0.409 0.673 0.668 0.670Illinois NET 0.500 0.058 0.317 0.844 0.407 0.462 0.430 0.647 0.694 0.669Open Calais 0.614 0.295 0.296 0.691 0.643 0.412 0.474 0.656 0.602 0.628ANNIE 0.480 – 0.194 0.679 0.605 0.325 0.338 0.633 0.519 0.570Illinois Wikifier 0.343 0.087 0.464 0.677 0.439 0.382 0.393 0.628 0.496 0.554Apache OpenNLP 0.384 – 0.126 0.637 0.571 0.265 0.287 0.619 0.430 0.508Wikipedia Miner 0.292 0.039 0.288 0.671 0.287 0.463 0.322 0.323 0.571 0.413LingPipe 0.147 – 0.046 0.385 0.364 0.277 0.145 0.147 0.378 0.211Miscinator – 0.191 – – 0.881 0.029 0.048 0.524 0.007 0.014

method vector

annotation type

LOC overlap vector

MISC overlap vector

ORG overlap vector

PER overlap vector

Figure 8: Method vector.

NE overlap vector

ail aiia aiir avg. confidence

confidence variance

Figure 9: Overlap vector.

erage intersection ratio of the method annotations of thesame NE class with reference annotation, aiir: the averageintersection ratio of a reference annotation with methodannotations of the same NE class, average confidence (ifthe underlying method return such value), and varianceof the average confidence.

The ail component in overlap vector was computed usingformula (1), where R was a fixed reference annotation andMC was a set of nmethod annotations of class C intersect-ing with the reference annotation R. The ail componentwas a simple arithmetic mean of intersection lengths.

ail(R,MC) =1

n

n∑

i=1

|R ∩MCi| (1)

The aiia component was computed using formula (2),which was also a simple arithmetic mean, but the inter-section lengths were normalized by lengths of particularmethod annotations MCi intersecting with the referenceannotation R. We wanted the value of aiia component todescribe how much were method annotations covered bythe reference annotation.

aiia(R,MC) =1

n

n∑

i=1

|R ∩MCi||MCi|

(2)

Similarly, the aiir component was computed using for-mula (3), but the intersection lengths were normalized bylength of the reference annotation R. The value of aiircomponent was used to describe how much was the refer-ence annotation covered by method annotations.

aiir(R,MC) =1

n

n∑

i=1

|R ∩MCi||R| (3)

A simple example of overlap vector computation is shownin Figure 10. The overlap vector is computed for method 4and PER class according to the highlighted reference an-notation. In this example, the reference annotation isM2.PER1, but it can be any method annotation or man-ual annotation. The remainder of the method 4 overlapvectors are zero-valued since method 4 does not return an-notations of types LOC, MISC and ORG. Similarly, therewill be overlap vectors according to the same reference an-notation computed for methods 1, 2 and 3 to finally haveall method vectors computed in a training vector. In ad-dition, there will be eight training vectors computed, dueto the eight annotations taken as reference annotations,where also the manual annotation PER is included.

The final two components in the training vector were thecorrect answer (i.e., the correct annotation type takenfrom manual annotation) and a special preprocessing vec-tor (Figure 7). The pre-processing vector included threecomponents: ail, aiia and aiir, which described the inter-section of the reference annotation when it was correctwith the correct answer. If the reference annotation wasnot correct the values of the pre-processing vector com-ponents were set to zero.

The number of learning features depended on the numberof combined methods, since for each involved method anew method vector was computed and included into thetraining vector. There were some features, which wereless or more important or not important at all. The effectof specific learning features is discussed later.

5.3 Training Data PreprocessingTraining data was generated automatically as a collec-tion of training vectors, which needed further processingprior to apply machine learning algorithms. There were


l dA l a r

drl y ln e A adyS

y. r S ynM ed

d n re y a dlS y A l

se y. S l M n aaAy d wdrr l

n yeS dydl ralA

M1.LOC1M1.PER1

M2.PER1

M4.PER1

M3.LOC1drl y ln e A adyS

yS endyM3.PER1

text

method 1

method 2

method 3

method 4

manual PER

M4.PER2

ail(M2.PER1,M4PER) =1

2(6 + 6) = 6.00

aiia(M2.PER1,M4PER) =1

2

(6

10+

6

6

)= 0.80

aiir(M2.PER1,M4PER) =1

2

(6

13+

6

13

).= 0.46

PER avg. score vector

ail6.00

aiia0.80

aiir0.46

avg. confidence

0.00

confidence variance

0.00

MISC avg. score vector

00000

LOC avg. score vector

00000

ORG avg. score vector

00000

method 4 vector

annotation type

NULL

Figure 10: Example of overlap vector computation.

duplicate training vectors removed in order to eliminatedistortion in training and validation process thus gettinga more balanced classification model.

According to the pre-processing vector (Figure 7), therewere training vectors removed, in which the annotationtype attribute in the annotation vector was correct but theaiir attribute in the pre-processing vector was not equalto 1.0, i.e., the bounds of the reference annotation werenot equal to the bounds of the correct answer. In previ-ous versions, we tried to accept all the training vectorswhose aiir attribute was at least 0.95, i.e., the referenceannotation overlapped with the correct answer at least on95%, but this led to models with lower precision.

We removed also several attributes, which led to zero in-formation gain and which were not useful for the classi-fication, i.e., attributes with the same value for all thetraining vectors. They were usually average confidenceand variance of the average confidence scores, becausesome NE recognizers did not provide annotation confi-dence information, hence both attributes were always zeroand therefore also their information gain. Due to samereasons, we have removed also attributes, which containedinformation in less than 3% of records. Attributes of thepre-processing vector have been also removed.

The pre-processing phase, described above, significantlyreduced the size of training data and therefore memoryrequirements as well as it had sped up the training pro-cess. It started with a set of ∼ 63K training vectorswith ∼ 200 attributes and finished on ∼ 31K uniquerecords with ∼ 100 highly relevant attributes.

Table 4: Performance of Classification Models Built byDifferent Algorithms

Model AUROC ACC F1

Decision Tree J48 0.939 0.969 0.938Random Forest 0.927 0.972 0.925Bagging 0.912 0.972 0.908Multilayer Perceptron 0.895 0.955 0.890Dagging 0.889 0.922 0.880Bayess Net 0.857 0.954 0.865RBF Network 0.850 0.923 0.835AdaBoost.M1 0.811 0.804 0.750Naive Bayes 0.797 0.919 0.814

5.4 Model Training and EvaluationWe tried several algorithms to train different classifica-tion model candidates, which we compared each otheraccording to F1 score. We also examined AUROC (TheArea Under an ROC curve - Receiver Operating Charac-teristic curve) and ACC (accuracy) measures. All thesethree measures were obtained from 10-fold cross valida-tion of the model candidates over the training dataset.Cross validation served as a good method for identifyingsuitable model candidates, as it avoided overfitting effectwithout a need of another test dataset. The best perfor-mance was achieved by DT classification model built withJ4813 algorithm (DTJ48) followed by RF [4] model. Thethird was a classification model based on REPTree (Re-duced Error Pruned Tree) built with Bagging algorithm(Table 4).

We focused on the first two best performing algorithmsand built several classification models while varying someof the input parameters of these algorithms in order togain precision and recall. It was Minimum Number of In-stances per Leaf parameter (hereinafter parameter ”M”)for decision trees and number of trees for random forest.The classification models were evaluated using a hold-out validation method over the test dataset. Evaluationresults are displayed in Table 6. The best performingmodels were based on random forest, namely RF N100,RF N200, RF N300 and RF N400. These models outper-formed models based on decision trees as well as baselinerecognizers and all the combined NE recognizers. We cansee that recall and precision were growing with the num-ber of trees for the random forest models and continued toconverge to 79% and 76% respectively. This behavior ismore evident in Figure 11, where F1 measures are shownfor particular NE classes according to the variated num-ber of trees. Dashed lines indicate score of the baselines,i.e., Baseline SNER, Baseline Test and Baseline Train.The performance of the Test and Train baselines was thesame for LOC, MISC and PER classes since they useddifferent tagger only for ORG class (see section 5.1 formore details). Therefore, their lines interlap each otherin the graph for LOC, MISC and PER classes.

Evaluation results of the models built with J48 algorithm(C4.5 implementation) while varying the M parameter aredisplayed in Figure 12. Although these models did notoutperform the best baseline, one of the models, DTJ48M13, was slightly better than the rest of the baseline mod-els.

13J48 is an implementation of C4.5 algorithm


Table 5: Mean and Standard Deviation of the Combining Models Evaluated over the Test Dataset


Models LOC MISC ORG PER P R F1 P R F1

RFµ 0.560 0.233 0.450 0.868 0.548 0.527 0.528 0.751 0.752 0.751σ 0.032 0.031 0.049 0.014 0.050 0.017 0.029 0.044 0.015 0.030

DTJ48µ 0.543 0.281 0.394 0.857 0.547 0.508 0.519 0.754 0.723 0.738σ 0.048 0.048 0.024 0.011 0.033 0.009 0.016 0.023 0.006 0.014

Table 6: Evaluation of NER Models over the Test Dataset



RF N100 0.584 0.231 0.476 0.883 0.589 0.529 0.543 0.788 0.760 0.774RF N200 0.593 0.236 0.484 0.878 0.604 0.529 0.548 0.789 0.759 0.774RF N300 0.600 0.234 0.491 0.876 0.602 0.533 0.550 0.788 0.758 0.773RF N400 0.597 0.234 0.490 0.876 0.601 0.533 0.549 0.788 0.758 0.772Baseline SNER 0.589 0.267 0.465 0.864 0.689 0.484 0.546 0.836 0.710 0.768RF N17 0.557 0.262 0.476 0.876 0.559 0.542 0.543 0.766 0.762 0.764RF N21 0.554 0.257 0.474 0.874 0.562 0.537 0.540 0.766 0.758 0.762RF N11 0.570 0.247 0.463 0.873 0.554 0.535 0.538 0.763 0.758 0.761RF N14 0.550 0.259 0.456 0.877 0.548 0.536 0.535 0.761 0.760 0.760RF N9 0.569 0.255 0.475 0.867 0.551 0.544 0.542 0.755 0.760 0.758RF N7 0.562 0.252 0.444 0.868 0.537 0.537 0.531 0.746 0.758 0.752DTJ48 M13 0.570 0.356 0.365 0.867 0.599 0.516 0.539 0.775 0.729 0.751Baseline Test 0.614 0.295 0.464 0.844 0.669 0.491 0.554 0.801 0.706 0.750DTJ48 M11 0.585 0.268 0.400 0.863 0.560 0.515 0.529 0.766 0.726 0.746DTJ48 M9 0.549 0.288 0.388 0.864 0.554 0.510 0.522 0.770 0.723 0.745Baseline Train 0.614 0.295 0.296 0.844 0.694 0.436 0.512 0.831 0.672 0.743DTJ48 M7 0.567 0.231 0.411 0.858 0.535 0.510 0.517 0.754 0.726 0.740RF N5 0.530 0.221 0.420 0.859 0.511 0.518 0.507 0.727 0.747 0.737DTJ48 M5 0.536 0.232 0.427 0.854 0.526 0.507 0.512 0.750 0.722 0.736#MSM2013 21 3 0.505 0.308 0.411 0.834 0.510 0.532 0.514 0.701 0.726 0.713DTJ48 M2 0.453 0.312 0.372 0.836 0.504 0.492 0.493 0.711 0.712 0.712RF N3 0.500 0.195 0.368 0.846 0.466 0.496 0.477 0.685 0.730 0.707RF N2 0.508 0.151 0.333 0.836 0.445 0.489 0.457 0.643 0.711 0.675

The #MSM2013 21 3 model in the Table 6 was our sub-mission to the #MSM2013 IE Challenge [24]. This modelwas one of our early models and finished in the challengeas the first runner-up with a loss of 1% in F1 on the winnerHabib et. al [12]. #MSM2013 21 3 model was the secondbest in precision and the best in recall in the challenge.Results of this model in the table may be slightly worsethan the official challenge results14, since we have usedmore strict evaluation criteria. We did not accept par-tially correct consecutive annotations; i.e., PER/Chris-tian PER/Bale was incorrect, while PER/Christian Balewas correct.

For a better comparison we present precision, recall andF1 measures of the best performing model – RF N100,best DT model – DTJ48 M13, baseline recognizers andthe top three combined NE recognizers in Figure 13. Thehighest score in precision was achieved by Baseline SNERfollowed by the rest of the baselines. RF N100 andDTJ48 M13 were fourth and fifth respectively. However,they performed better as any of the combined NE rec-ognizers. RF N100 gained 17% and DTJ48 M13 15% inprecision with respect to Stanford NER as the best in pre-

14http://oak.dcs.shef.ac.uk/msm2013/ie_challenge/results/challenge_results_summary.pdf

cision among the combined NE recognizers. The loss ofRF N100 on Baseline SNER was 6%. The highest score inrecall was achieved by RF N100 followed by DTJ48 M13.The gain in recall of the RF N100 model was 7% withrespect to the best baseline – Baseline SNER and 10%with respect to the best combined NE recognizer – Illi-nois NET. The highest score in F1 measure was achievedby RF N100. The gain in F1 of the RF N100 model was1% with respect to the best baseline – Baseline SNER and16% with respect to the best combined NE recognizer –Stanford NER. DTJ48 M13 was the third with 2% loss inF1 on the second Baseline SNER, but with 12% gain inF1 with respect to Stanford NER.

The gain in F1 scores was a sign that combining mod-els were capable of eliminating false positive (FP) enti-ties and/or transforming true negative (TN) entities intotrue positive (TP) entities. To confirm this assumption,we compared the results of the best combining model,RF N100, with merged and de-duplicated results of thecombined tools. Results of the alignments are in Table 7.Although we did not notice that there were any true nega-tive (TN) entities transformed to true positive (TP) enti-ties, but instead there were 194 true positive (TP) entitiestransformed to false negative (FN) entities, we noticedthat there were 3 487 false positive (FP) entities elimi-


●●

●

●● ●

●● ●

●●

● ●

2 5 10 20 50 100 200

0.50

0.54

0.58

LOC

Trees

F1

●

●

●

● ●●

● ●●

●● ● ●

2 5 10 20 50 100 200

0.16

0.22

0.28

MISC

Trees

F1

●

●

●

●

●●

●

● ● ●●

● ●

2 5 10 20 50 100 200

0.30

0.40

ORG

Trees

F1

●

●

●

● ●

●

● ●●

●

●● ●

2 5 10 20 50 100 200

0.84

0.86

0.88

PER

Trees

F1

●

●

●

●

●● ●

● ●●

● ● ●

2 5 10 20 50 100 200

0.46

0.50

0.54

Macro

Trees

F1

●

●

●

●●

● ●● ●

● ● ● ●

2 5 10 20 50 100 200

0.68

0.72

0.76

Micro

Trees

F1

Baseline SNER Baseline Test Baseline Train

Figure 11: Impact on F1 while varying number of treesfor Random Forest algorithm.

nated. This caused the RF N100 model to gain 199.5%in precision with respect to the merged and de-duplicatedresults of the combined tools. The decrease of true posi-tive (TP) entities and the increase of false negative (FN)entities led to 14.45% loss in recall, but the 33.08% gainin F1 was still relatively high and showed that there canbe combining models trained with superior performanceto that of the combined tools.

A closer analysis of the annotation results indicates thatthere were many results correctly classified, but such re-sults did not exactly match the position in text; i.e., re-sults were partially correct. Therefore, we tried to applypost-processing and trimmed non-alphabetical charactersoff the results. We also removed definite articles fromLOC and PER results. Moreover, we removed titles fromPER results; e.g., Dr., Mr. or Sir. Evaluation of modelswith this simple post-processing (PP) is displayed in Ta-

Table 7: Performance Analysis of the RF N100 Modelover the #MSM2013 Test Dataset

TP FN FP

NE Model COR PAR MIS SPU

LOCmerged+dedup. 82 8 6 425RF N100 59 7 30 40

∆ −23 −1 24 −385

MISCmerged+dedup. 40 12 42 1 224RF N100 15 7 72 14

∆ −25 −5 30 −1 210

ORGmerged+dedup. 179 18 35 921RF N100 104 35 93 66

∆ −75 17 58 −855

PERmerged+dedup. 1 042 20 27 1 133RF N100 971 44 74 96

∆ −71 24 47 −1 037

Allmerged+dedup. 1 343 58 110 3 703RF N100 1 149 93 269 216

∆ −194 35 159 −3 487

●

●

●

●

●

●

2 4 6 8 10 12

0.45

0.50

0.55

0.60

LOC

M

F1

●

● ●

●

●

●

2 4 6 8 10 12

0.24

0.28

0.32

0.36

MISC

M

F1

●

●

●

●●

●

2 4 6 8 10 12

0.30

0.40

ORG

M

F1

●

●

●

●●

●

2 4 6 8 10 12

0.83

50.

850

0.86

5

PER

M

F1

●

●●

●

●

●

2 4 6 8 10 12

0.50

0.52

0.54

Macro

M

F1

●

●●

● ●

●

2 4 6 8 10 12

0.71

0.73

0.75

0.77

Micro

M

F1

Baseline SNER Baseline Test Baseline Train

Figure 12: Impact on F1 while varying parameter M forDecision Tree J48 (C4.5) algorithm.

ble 8. We applied post-processing on the best performingRF and DT models as well as on the best baseline – Base-line SNER, where the gain in F1 with respect to modelswithout post-processing was 1.9%, 2.8% and 0.3% respec-tively. The highest score in F1 measure was achieved byRF N100 PP model, which gained 2.5% over the bestbaseline – Baseline SNER PP.

6. ConclusionsWe introduced an approach to combine NE recognizersbased on diverse methods on a task of NER in micro-posts and examined several machine learning techniquesfor the combination of text and annotation features pro-duced by the recognizers. The best performing machinelearning techniques were random forest and decision treesbased on the C4.5 algorithm. Combining models builton top of these techniques achieved performance superiorto those of the combined NE recognizers. Moreover, thebest combining model, RF N100, trained over the infor-mal text of micro-posts performed better than the base-line recognizers, although the combined NE recognizerswere not specially trained or tweaked on infomal text.The gain in F1 of the RF N100 model with respect to thebest of the combined NE recognizers, Stanford NER, was

Bas

elin

e S

NE

R

Bas

elin

e Tr

ain

Bas

elin

e Te

st

RF

N10

0

DT

J48

M13

Sta

nfor

d N

ER

Ope

n C

alai

s

Illin

ois

NE

T

Precision

0.60

0.65

0.70

0.75

0.80

0.85 0.83

6

0.83

1

0.80

1

0.78

8

0.77

5

0.67

3

0.65

6

0.64

7

RF

N10

0

DT

J48

M13

Bas

elin

e S

NE

R

Bas

elin

e Te

st

Illin

ois

NE

T

Bas

elin

e Tr

ain

Sta

nfor

d N

ER

Ope

n C

alai

s

Recall

0.60

0.65

0.70

0.75

0.80

0.85

0.76

0

0.72

9

0.71

0

0.70

6

0.69

4

0.67

2

0.66

8

0.60

2

RF

N10

0

Bas

elin

e S

NE

R

DT

J48

M13

Bas

elin

e Te

st

Bas

elin

e Tr

ain

Sta

nfor

d N

ER

Illin

ois

NE

T

Ope

n C

alai

s

F1

0.60

0.65

0.70

0.75

0.80

0.85

0.77

4

0.76

8

0.75

1

0.75

0

0.74

3

0.67

0

0.66

9

0.62

8

Figure 13: Comparison of our two best performing modelsRF N100 and DTJ48 M13 with the baselines and top-three combined NE recognizers.


Table 8: Evaluation of Classification Models Using Post-Processing (PP) over the Test Dataset



RF N100 PP 0.594 0.246 0.544 0.887 0.618 0.550 0.568 0.804 0.774 0.789RF N100 0.584 0.231 0.476 0.883 0.589 0.529 0.543 0.788 0.760 0.774DTJ48 M13 PP 0.576 0.356 0.441 0.880 0.626 0.537 0.563 0.796 0.750 0.772Baseline SNER PP 0.578 0.267 0.465 0.868 0.687 0.482 0.544 0.839 0.712 0.770Baseline SNER 0.589 0.267 0.465 0.864 0.689 0.484 0.546 0.836 0.710 0.768DTJ48 M13 0.570 0.356 0.365 0.867 0.599 0.516 0.539 0.775 0.729 0.751Baseline Test 0.614 0.295 0.464 0.844 0.669 0.491 0.554 0.801 0.706 0.750Baseline Train 0.614 0.295 0.296 0.844 0.694 0.436 0.512 0.831 0.672 0.743#MSM2013 21 3 0.505 0.308 0.411 0.834 0.510 0.532 0.514 0.701 0.726 0.713

16% and 1% with respect to the best baseline recognizer,which was also Stanford NER, but specially trained onthe micro-posts data. Performance of the RF and DTmodels indicates that machine learning techniques leadto more favorable combination of underlying NE recog-nizers than was conducted manually in one of the base-line NE recognizers, which was an ensemble of the bestNE recognizers for each NE class. The gain in F1 of theRF N100 model with respect to the ensemble baseline wasapprox. 3%. The advantage of the combining models isthat they can adapt to actual text according to its fea-tures and annotations from combined NE recognizers, aswell as benefit from given negative examples as we sawit in their capability to eliminate false positive (FP) re-sults given by combined tools. The proposed approach ofcombining NER methods was successfully applied in the#MSM2013 Concept Extraction Challenge organized un-der WWW2013 and finished as the first runner-up withF1 = 66.2 % and a 1.2 % loss. More specifically, we werethe first in recall score (R = 61.3 %) and the second inprecision score (P = 76.4 %).

Acknowledgements. This work was partiallysupported by projects VEGA 2/0184/10, VEGA2/0185/13, SMART ITMS: 26240120005, SMART IIITMS: 26240120029, Recler ITMS: 26240220029, TRA-DICE APVV-0208-10, VENIS FP7-284984 and CLANAPVV-0809-11. The author would like to thank his su-pervisor Assoc. Prof. Dr. Michal Laclavık for his valuableadvice and comments.

References[1] A. E. C. Basave, M. Rowe, M. Stankovic, and A.-S. Dadzie,

editors. Proceedings, Concept Extraction Challenge at the 3rdWorkshop on Making Sense of Microposts (#MSM2013): Bigthings come in small packages, Rio de Janeiro, Brazil, 13 May2013, May 2013.

[2] A. E. C. Basave, A. Varga, M. Rowe, M. Stankovic, and A.-S.Dadzie. Making sense of microposts (#msm2013) conceptextraction challenge. In Basave et al. [1], pages 1–15.

[3] K. Bontcheva and D. Rout. Making sense of social media streamsthrough semantics: a survey. Semantic Web, 2012.

[4] L. Breiman. Random forests. Mach. Learn., 45(1):5–32, Oct.2001.

[5] C. Cortes and V. Vapnik. Support-vector networks. MachineLearning, 20(3):273–297, 1995.

[6] H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan.GATE: A Framework and Graphical Development Environmentfor Robust NLP Tools and Applications. In Proceedings of the40th Anniversary Meeting of the Association for ComputationalLinguistics (ACL’02), 2002.

[7] H. Cunningham, D. Maynard, and V. Tablan. Jape: a javaannotation patterns engine. Technical Report CS-00-10,University of Sheffield, UK, November 2000.

[8] S. Dlugolinsky, M. Ciglan, and M. Laclavik. Evaluation of namedentity recognition tools on microposts. In Intelligent EngineeringSystems (INES), 2013 IEEE 17th Int. Conf. on, 2013.

[9] D. Etter, F. Ferraro, R. Cotterell, O. Buzek, and B. Van Durme.Nerit: Named entity recognition for informal text. Technicalreport, Technical Report 11, Human Language Technology Centerof Excellence, Johns Hopkins University, July, 2013.

[10] J. R. Finkel, T. Grenager, and C. Manning. Incorporatingnon-local information into information extraction systems bygibbs sampling. In Proceedings of the 43rd Annual Meeting onAssociation for Computational Linguistics, ACL ’05, pages363–370, Stroudsburg, PA, USA, 2005. Association forComputational Linguistics.

[11] Y. Freund and R. E. Schapire. Large margin classification usingthe perceptron algorithm. Machine learning, 37(3):277–296,1999.

[12] M. Habib, M. V. Keulen, and Z. Zhu. Concept extractionchallenge: University of Twente at #msm2013. In Basave et al.[1], pages 17–20.

[13] J. D. Lafferty, A. McCallum, and F. C. N. Pereira. Conditionalrandom fields: Probabilistic models for segmenting and labelingsequence data. In Proceedings of the Eighteenth InternationalConference on Machine Learning, ICML ’01, pages 282–289, SanFrancisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc.

[14] C. Li, J. Weng, Q. He, Y. Yao, A. Datta, A. Sun, and B.-S. Lee.Twiner: Named entity recognition in targeted twitter stream. InProceedings of the 35th International ACM SIGIR Conference onResearch and Development in Information Retrieval, SIGIR ’12,pages 721–730, New York, NY, USA, 2012. ACM.

[15] X. Liu, S. Zhang, F. Wei, and M. Zhou. Recognizing namedentities in tweets. In Proceedings of the 49th Annual Meeting ofthe Association for Computational Linguistics: Human LanguageTechnologies - Volume 1, HLT ’11, pages 359–367, Stroudsburg,PA, USA, 2011. Association for Computational Linguistics.

[16] D. Milne and I. H. Witten. An open-source toolkit for miningwikipedia. Artif. Intell., 194:222–239, Jan. 2013.

[17] J. R. Quinlan. C4.5: programs for machine learning. MorganKaufmann Publishers Inc., San Francisco, CA, USA, 1993.

[18] D. Ramage, D. Hall, R. Nallapati, and C. D. Manning. Labeledlda: A supervised topic model for credit attribution inmulti-labeled corpora. In Proceedings of the 2009 Conference onEmpirical Methods in Natural Language Processing: Volume 1 -Volume 1, EMNLP ’09, pages 248–256, Stroudsburg, PA, USA,2009. Association for Computational Linguistics.

[19] L. Ratinov and D. Roth. Design challenges and misconceptions innamed entity recognition. In Proceedings of the ThirteenthConference on Computational Natural Language Learning,CoNLL ’09, pages 147–155, Stroudsburg, PA, USA, 2009.Association for Computational Linguistics.

[20] L. Ratinov, D. Roth, D. Downey, and M. Anderson. Local andglobal algorithms for disambiguation to wikipedia. InProceedings of the 49th Annual Meeting of the Association forComputational Linguistics: Human Language Technologies -Volume 1, HLT ’11, pages 1375–1384, Stroudsburg, PA, USA,2011. Association for Computational Linguistics.


[21] A. Ratnaparkhi. Maximum entropy models for natural languageambiguity resolution. PhD thesis, University of Pennsylvania,1998.

[22] A. Ritter, S. Clark, Mausam, and O. Etzioni. Named entityrecognition in tweets: An experimental study. In Proceedings ofthe Conference on Empirical Methods in Natural LanguageProcessing, EMNLP ’11, pages 1524–1534, Stroudsburg, PA,USA, 2011. Association for Computational Linguistics.

[23] F. Rosenblatt. The perceptron: a probabilistic model forinformation storage and organization in the brain. Psychologicalreview, 65(6):386, 1958.

[24] Štefan Dlugolinský, P. Krammer, M. Ciglan, and M. Laclavík.MSM2013 IE Challenge: Annotowatch. In Basave et al. [1],pages 21–26.

Selected Papers by the AuthorS. Dlugolinsky, M. Laclavik, L. Hluchy. Towards a search system for

the web exploiting spatial data of a web document. in Databaseand Expert Systems Applications (DEXA), 2010 Workshop on,Aug 2010, pp. 27–31.

S. Dlugolinsky, M. Seleng, M. Laclavik, L. Hluchy. Distributedweb-scale infrastructure for crawling, indexing and search withsemantic support. Computer Science, vol. 13, no. 4, 2012.[Online]. Available:http://journals.agh.edu.pl/csci/article/view/42

S. Dlugolinsky, M. Ciglan, M. Laclavik. Evaluation of named entityrecognition tools on micro-posts. in Intelligent EngineeringSystems (INES), 2013 IEEE 17th Int. Conf. on, 2013.

Š. Dlugolinský, P. Krammer, M. Ciglan, M. Laclavík. MSM2013 IEChallenge: Annotowatch. in Making Sense of Microposts(#MSM2013) Concept Extraction Challenge, A. E. C. Basave,M. Rowe, M. Stankovic, and A.-S. Dadzie, Eds., May 2013, pp.21–26. [Online]. Available:http://ceur-ws.org/Vol-1019/paper_21.pdf

Š. Dlugolinský, M. Laclavík, M. Šeleng, M. Ciglan, M. Tomašek,L. Hluchý. Advanced email search in small enterprises. In 7thWorkshop on Intelligent and Knowledge Oriented Technologies.- Bratislava : Nakladatel’stvo STU, 2012, p. 23-26. ISBN978-80-227-3812-5.

Š. Dlugolinský, T. G. Nguyen, M. Laclavík, M. Šeleng. Charactergazetteer for named entity recognition with linear matchingcomplexity. In Proceedings of the 2013 World Congress onInformation and Communication Technologies : WICT 2013.Eds. Ngo, L.T. et al. - IEEE Systems Man and CyberneticsSociety, Spain Chapter, 2013, p. 364-368. ISBN978-1-4799-3230-6.

Š. Dlugolinský, P. Krammer, M. Ciglan, M. Laclavík, L. Hluchý.Combining named entity recognition methods for conceptextraction in Microposts. In M. Rowe, M. Stankovic, and A.-S.Dadzie, editors, 4th Workshop on Making Sense of Microposts(#Microposts2014), pages 34–41, April 2014.

Seamless Handover in Networks Based on IEEE 802.11Standard

Ján Balažia∗

Institute of Applied InformaticsFaculty of Informatics and Information Technologies


[email protected]

AbstractIn recent years we have seen tremendous growth in the useof various multimedia services, either in terms of high-res video, targeting realtime broadcasting or voice ser-vices that use IP protocol based networks. At the sametime, small portable computers and tablets entered themarket in big fashion and mobile phones became a fully-fledged replacement of computers on the road. With therising number of mobile devices sold, the demand for thesekind of services keeping the mobility of client grows enor-mously. This is the fundamental issue of IEEE 802.11networks that are already part of every mobile devicesold: the time needed to reassociate with access points is50 milliseconds at best. Multimedia services using voice,however, for their smooth transmission have a maximummargin of tolerance at 40 to 50 milliseconds, which makesnetworks based on the IEEE 802.11 standard hardly us-able.

The aim of this work was to propose an architecture andprotocol support necessary to achieve the beforementionedtransition in negligible time in order to eliminate problemsconnected to transmission of multimedia services and atthe same time make it unnecessary in any way to inter-fere with the software and hardware of existing mobilestations. The proposal was verified on existing hardwarein laboratory environment and test results confirmed thecorrectness of the architecture design proposal.

Categories and Subject DescriptorsC.2.1 [Computer-communication networks]: Net-work Architecture and Design; C.2.3 [Computer-com-

∗Recommended by thesis supervisor: Assoc. Prof. IvanKotuliak.Defended at Faculty of Informatics and Information Tech-nologies, Slovak University of Technology in Bratislava onAugust 25, 2016.

c© Copyright 2016. All rights reserved. Permission to make digitalor hard copies of part or all of this work for personal or classroom useis granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies show this notice onthe first page or initial screen of a display along with the full citation.Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy other-wise, to republish, to post on servers, to redistribute to lists, or to useany component of this work in other works requires prior specific per-mission and/or a fee. Permissions may be requested from STU Press,Vazovova 5, 811 07 Bratislava, Slovakia.Balažia, J. Seamless Handover in Networks Based on IEEE 802.11Standard. Information Sciences and Technologies Bulletin of the ACMSlovakia, Vol. 8, No. 2 (2016) 37-44

munication networks]: Network Operations—networkmanagement, network monitoring

Keywords802.11, seamless handover, handover latency, remoteMAC separator, network architecture, network manage-ment

1. IntroductionThis work was originally inspired by mobile 3GPP archi-tecture, where mobile users has access to multimedia ser-vices even while being on the move[1]. However, mobilenetworks have great advantage in crompare with networksbased on 802.11 standard. These networks are allowed tohave more associations at a time. By this possibility thehandover is quick enough, so the user us unable to experi-ence unpleasant disruptions by watching live video call orhaving a VoIP session while traveling in a train or in a car.In compare, 802.11 standard strictly defines single associ-ation at a time [2]. As a result, the mobile station (MS)has to pass each time while reassociating via three phasesdefined by 802.11: Detection phase, Selection phase andExecution phase [3]. All these phases are connected withdelays that end up in the best case scenarios with approx-imately 50ms interruption (while using 802.11r standard).According to authors [4],[5] the maximal tolerance for dis-ruption of voice services is between 40 to 50 milliseconds,which is even for latest standards hardly reachable. Tobe able to explain problem in more detail we will definebefore-mentioned handover phases.

The detection phase starts in mobile station in given timeintervals and uses several techniques to detect the linkquality. Once etc. the the level of RSSI reaches level of-80 dBm the station should go to the next phase, whichis selection Phase [6]. However, there are several prob-lems connected with this decision. What if the increaseof RSSI was caused accidentally by a signal mirroring orby a temporary obstacle. When is actually the right timeto switch to the next phase? There might be several morefactors that could tell that the connection is starting tobe problematic. Etc. the rapid throughput decrees orsignal to noise ration decrease. Existing solutions will bediscussed later.

Once the station decides that there is a right time to moveforward, the Selection phase takes place. This phase isused for scanning for new Access Points (AP). We can say,that this phase is also one of the most time and energy

38 Balazia, J.: Seamless Handover in Networks Based on IEEE 802.11 Standard

consuming. When the station start to search for newAPs, it is unable to transport data any more. Thereforethe search has to fit between data transportation gaps,which is of course decreasing throughput at a time. Afterfinishing the search, the station needs to select a best APin range. Again, there might be several variables thatthe equation might include. The first one is the ServiceSet Identifier (SSID) which is crucial to stay in the samenetwork. If there are more APs with different Base ServiceSet Identifier (BSSID), the station selects one with thebest service delivery (etc. QoS support, security settings,..)[7].

After selection of a new AP the phase of realisation isin charge. In general, mobile networks use 802.11 secu-rity (802.11i) and QoS (802.11e) amendments as that arecausing big delays due to necessary message exchange[8].These delays are now in status quo caused by latest stan-dardised amendment 802.11r that defined procedures forproactive security information exchange (in order to savetime), while not defining protocols to be used. There-fore, most vendors implemented their proprietary solu-tions, which are incompatible, and amendment was im-plemented only by a big companys in their proprietarysolutions [9][10][11].

2. State of the ArtThe main focus of our work is to bring a seamless han-dover to MSs while roaming the network (in infrastructuremode) within one Extended Service Set (ESS). This setcovers Distributed System (DS) compound from more Ba-sic Service Sets (BSS) which represents APs[12]. Withinthe communication with AP over 802.11, MS uses framesbelonging to these three categories:

• Data frames - these frames are used for data com-munication on L2 and use LLC and SNAP headersto address frames.

• Control frames - frames used on medium to directsMSs to use medium only within the assigned timeslot via CTS, RTS and ACK messages.

• Management frames - frames used for managementfunctions as association, handover, disassociation,channel switching, QoS management and so on[13].

In the first step, before MS can start to communicate, asuccessful association with BSS has to occur. This asso-ciation is based on Service Set Identifier (SSID) that be-longs to whole ESS. To get this parameter MS might pas-sively listen on the medium, or use active scanning to getthe information sooner. In exchange of successful networkprobe the MS receives network information like authenti-cation method and the next process may start. After au-thentication frame exchange AP and MS exchanges theircapabilities and MS may start to communicate over thechannel. Once the MS decides to leave the de-authentica-tion management frame is sent and MS may leave themedium. This process is shown on Figure 1 [8].

One of the most crucial part in our work is MS mobil-ity. In nowadays networks the handover management islocated within MS. The whole decision process starts pe-riodically in MS (etc. each 500ms) and measures currentRSSI, throughput, packet loss and other connection pa-rameters as well as the parameters of neighbouring APs

Figure 1: Authentication process.

(via active or passive scanning). Once the MS finds outthat parameters of other AP are better, MS issues a han-dover process, starting with de-association from currentAP and continuing with already explained associationprocess (if there is not used 802.11r or 802.11i with keycashing explained later)[14].

To ensure better or quicker, handover several tweaks mightbe used within all handover phases described in followingsections.

2.1 Discovery PhaseIn the discovery phase preemptive search might be usedinstead of searching once MS decides that the currentRSSI is not good enough[25]. This type of search mightdecrease a throughput by a little, because it is issued pe-riodically, while still having connection with the presentAP. Other drawback is that it is much more energy con-suming to do scann proactively. Other approach is tonot use active scanning, but passively listen on channelsto save as much energy as possible [16]. On the otherside the drawback is that the station might not receiveall beacon messages from APs around in the time. Thenext method discussed is SyncScan [17], which is pas-sive scanning method, trying to predict the time of bea-con reception so the time waiting on each channel canbe rapidly decreased. Other approach is to use 802.11kthat advertises ordered list of alternative APs that theMS may use for future communication [18]. To avoid un-necessary handovers also some mathematical models areused as smoothening the RSSI data via Moving Averageor Exponential Weighted Moving Average[16].

2.2 Selection PhaseThe selection of appropriate AP selection mechanism isnecessary. In general, there are solutions based on client,server or mixed solutions, all with the main aim to avoidsomething that is called ”jo-jo”effect. This effect is hopingbetween two or more neighbouring AP because ”better”connection parameters. As a result the communication iskilled by non stop re-associations. For avoidance, severalalgorithms are used. The most basic algorithms issuesthe handover once the throughput is below the accep-tance, so the chance to hop back is minimal. Extensionto this algorithm is to use history, so if the throughput isdecreasing in time the handover can be issued. Some al-gorithms also uses trends while observing other APs andmaking a correlation with current signal strength. Thetypical handover algorithms are based on signal strengthand may use: best signal, thresholds, hysteresis or trends


and prediction; or by observing throughput and may use:load balancing, or 802.11k [16].

2.3 Realisation PhaseWhile speaking about handover, two types may occur:on second and third OSI layer. In our work we focuson the second layer and we are trying to solve followingdelays: radio signal changing, reauthentication, reasso-cation and QoS negotiation. This phase is one of themost important, because while the MS is in realisationphase the connection is lost. While being so important,most of the companies patented their solutions, or theyare closed source and proprietary[9][10][11]. From thosethat are open we picked up 802.11f protocol that was usedfor communication between APs to proactively distributeMS security context, so the MS may spare some timewith security negotiation. However, because authors didnot define message exchange the implementation was ven-dor to vendor specific and the 802.11f amendment waslater withdrawn[19]. The other solution in use is 802.11iwith proactive key cashing. The advantage is that the APstores already exchanged keys so once the MS visits theAP agin there is no need for extra negotiation[20]. Thedrawback is that the MS has to exchange the full 802.11iauthentication while visiting new APs, which is actuallythe major case. Currently a best solution on the marketis 802.11r that uses proactive key derives distributed topotential APs that MS may visit. As a result only 4 wayhandshake has to be performed between MS and AP. Thisprocess is aproximately 50ms long [21].

2.4 Experimental SolutionsThis work was also inspired by three experimental works.The first one is Personal AP [22], which defines architec-ture of flying ghost APs following MS, so MS does nothave to perform handover. This architecture is based onso called Split MAC separator approach. To explain it,IEEE defines three types of 802.11 networks:

• Local MAC separator based - the whole logic of anAP (radio, control and management) is located atWireless Terminal Points (WTP). This is the mostused approach in the 802.11 networks we know.

• Split MAC separator based - in this architecture theradio layer and the delay sensitive management andcontrol functions are located at WTP (logical AP).The rest as QoS management, device configurationand load balancing are shifted to Access Controller(AC). As an example of integrator we may use MeruNetworks Systems [10], or an experimental approachcalled Personal AP [22].

• Remote MAC separator based - solutions based onthis architecture shift all the management and thecontrol logic to the AC. WTP is used only as a radiocontroller and a bridge between wired and wirelessmedium. Our architecture proposal is based on thistype of architecture as well as the proprietary solu-tion of Aruba Networks Systems [9].

Second inspiring work is called D-Scan [23] and is focusedon gathering data from AP dense area. Authors extracteduseful data from 802.11 frames in the air to better observecurrent environment. The third and the last called Im-proving the latency of 802.11 hand-offs using neighbour

Figure 2: MS communicating via WTP2 beforetransition.

graphs [24] focused on graph creation, where edges repre-sented useful paths between nodes that represented APs.This technique might be used for better predictions.

3. Open ProblemsAs a result of current environment overview we decidedto look at the 802.11 network as a homogenous entity in-stead of looking on counterparts and solving their prob-lems separately. Based on this we defined following openproblems:

• New architecture proposal that allows MSs to per-form seamless transition between different WTPsand that allows them to roam across the networkwithout data loss that will affect multimedia ser-vices. The architecture is created with aim not tochange any exiting client behaviour so current sta-tions can seamlessly use services provided by thisarchitecture.

• New protocol proposal that will support the networkmanagement and client handover.

• Algorithm that will support the client transition be-tween different WTPs and that will allow them toroam seamlessly.

The proposed architecture will be verified on existinghardware and compared with generally used 802.11 so-lution.

4. Architecture SpecificationAccording to open problems we defined entities that co-operates within our new architecture:

• Access Controller (AC): The distributed networkcore that coordinates vital functions of the wholenetwork. By those functions we mean especiallyWTP coordination, user and service managementand handover decision with the final execution.

• Access Point (AP): Each MS has dedicated AP cre-ated within AC. As the architecture is Remote MACseparator based the connection between AP and MSis mediated via WTP.


Figure 3: MS communicating via WTP3 aftertransition.

• Wireless Terminal Point (WTP): Physical devicethat spreads wireless signal and coordinates controlmessaging on a medium. All other 802.11 vital func-tions are shifted to the network core - AC.

• Mobile station (MS): User device that communi-cates towards its AP located in AC via WTP. Thisdevice is unaware of architecture type and is unableto see how the data are transferred towards its ded-icated AP.

Figure 2 and Figure 3 shows, how is the communicationestablished on L2 and how does the transition of MS be-tween two WTPs looks like. The important thing is thatwhole AP related context is exactly same within wholeESS.

To be able to ensure this transition we have specified re-quirements for all network entities:

• AC - has to use communication protocol to talk toWTPs on the network, has to manage WTPs on thenetwork, has to create logical APs and manage itscontext for all connected MSs on the network, hasto analyse network and MS behaviour and based ongathered data make appropriate decisions, has totransport user data to the right destination.

• WTP - has to use communication protocol that talksto AC, has to bridge management and data framesbetween AC and MSs, has to implement controlmessaging for a wireless medium, has to implementthe handover functionality, has to gather and sendstatistical information of connected client to the AC.

According to requirements we created proposal that cov-ers required functionality. Each of following chapters dealswith one exact part of the architecture proposal.

4.1 APMP Management ProtocolAccess Point Management Protocol (APMP) is our pro-prietary solution that has one primary aim: to executehandover in minimal possible time. Therefore, none exist-ing solution like CAPWAP[25] or OpenFlow[[13] is used.These will only extend the time that we need to shorten.Also, architecture use UDP to transport control messages

Figure 4: APMP state diagram.

and L2 extension is used to transport data frames (will bediscussed in next chapter). APMP protocol consist from1B long message ID followed by TLV fields: 1B used fortype, 2B used for length and last value is long accordingto previous field.

This protocol defines seven basic processes that are vitalfor architecture functionality and are shown in Figure 4.

The description of processes is as follows:

1. Accepting new WTP: New WTP is accepted byAC after an exchange of APMPttc message (thatmight be verified with AAA server) and connectedAPMPcp reply message containing WTP parame-ters. Otherwise the communication is rejected byAPMPrecreq and all the future messages are dis-carded until next successful authorisation. WTPhas to send APMPkeepalive messages to hold ac-tive connection.

2. MS network discovery: The communication startsby forwarding MS Probe request forwards AC viaAPMPprobe message. In exchange, AC replies spe-cific MS AP context via message APMPstactx. Thiscontext contains personalised BSSID, which will beused for future communication.

3. MS Authentication: Based on the context received,MS start authentication with AP via WTP. ThisWTP uses APMPauth message to forward the au-thentication frame to AC. The authentication mightbe validated via Radius server. Important is thatthe key derive used by MS for communication withAP is stored in the AC and is never propagated toWTP.

4. MS Association: After authentication MS exchangesassociation messages trough WTP and APMPassocmessage. WTP immediately response so the MSmay start to communicate.


Figure 5: Data encapsulation frame from AC to-wards destination (top) and from source towardsAC (bottom).

5. Statistics: Each WTP periodically collect and sendMS statistics to AC via APMPstactx message. Ag-gregated information is parsed by AC and on itsbasis handover decision could be made.

6. Handover: Handover always originates in AC and ispropagated via APMPrreq message. In the first stepnew WTP is informed. Once WTP decides to acceptMS, it replies with APMPack message. OtherwiseAPMPrej message is sent. After successful receptionby AC, old WTP is informed via APMPrel messageto release the client. This message is agin confirmedby APMPack.

7. Disassociation: Once MS decides to leave the ESSa Disassociation frame is sent. This frame is for-warded to AC via APMPdisas message and AC re-plies with APMPrel.

4.2 APMP Management ProtocolThe primary purpose for joining the wireless network isto send data towards destination. In compare with LocalMAC separator based solution we have to make one morehop on L2 to be able to transport encrypted frame fromMS to the logical AP located in AC. The frame has to bedelivered to AC, because WTPs does not hold encryptionkeys so it is unable to understand data above L2 as sameas no other device anywhere. According to 802.11 and802.3 addressing schemes we had to encapsulate originaldestination address created by MS and replace it withthe MAC address of AC so the frame can be processeddecrypted and agin sent towards the proper destination.Again, if the destination is wireless client, AC has to en-crypt the frame with the key of destination MS, replacethe source address with the address of AC, encapsulatethe original source and send it towards destination. Forencapsulation we have created new ether type with value0x2222 and the structure is shown in Figure 5.

As a side effect we had to lower the MTU from 1500 to1942 which is having a small impact on data throughputevaluated in a Results chapter.

4.3 Access ControllerAs we already defined management protocol and datacommunication, we can move on to describing processeswithin AC. The first and most important process is con-nected with MS behaviour starting with joining the net-work, moving across ESS and at the end leaving the net-work.

The lifecycle starts with a new MS starting with associa-tion process by Probe message or by resuming from sleep.

Both options have to pass authentication and once it isfinished MS may start to communicate.

Each station is afterwards observed by a statistics mod-ule. If this module tells that there is a better WTP, thenAC amy start the handover process. The start is invokedby APMPrreq message sent towards new WTP. This stepmay finish in the successful handover (MS is accepted byWTP), or in case of failure AC will select next possibleWTP to tray the handover again. After successful MS re-ception by WTP and AMPMack message, the old WTPis informed to flush the MS context and the circle maystart again. If the AC do not get any statistical messagewith MS identifier for certain amount of time, MS con-text is automatically discarded and last known WTPs isinformed to flush the MS context via APMPrel message.

The next important process is connected with gatheringstatistics used to decide wether to roam the MS or not.Each station is monitored right after successful associa-tion. After reception of a first APMPstat message MSis included calculations, otherwise the station is markedas expired and after certain amount of time is discarded.The same thing happens once the MS is not included inany statistic received from all connected WPTs. Once theAC finds out that the MS on current WTP is about toreach the RSSI threshold, the handover process is startedand the station is market as in roaming progress until thesuccessful roam or the failure. According to the result thelist of potential WTPs per MS is recalculated.

4.4 Wireless Terminal PointThe role of WTP is to manage 802.11 physical layer andto bridge connection between MS and AC. The diagramof MS behaviour is easier in compare with AC and can bedescribed as follows.

Association process of MS starts with sending Probe re-quest (by MS) on the network with known ESSID. Fromnow on, all WTPs that receives this message creates theirown MS diagram. This Probe message is then redirectedto AC via APMPprobe message and as a reply APMP-statctx is sent back to MS. This message contains MScontext needed for creation of WTP interface that will beused for communication with the MS. After the receptionthe state is set to the active. Otherwise, APMPrej is sentto WTP (telling that MS is unable to join network) andstatus is set as rejected. The association is finished afterexchanging APMPauth a APMPassoc messages and sta-tus is set to associated so the MS may start to transferthe data.

Next important scenario is the handover. As was men-tioned before, the decision is made according to statisticsexchanged via APMPstat messages. Once AC decidesthat it is a time to make handover APMPrreq messageis sent towards a new WTP. If this WTP is capable toserve another client, associated state is set and APMPackmessage is sent back to AC. In exchange, previous WTPreceives message APMPrel to releases the context of MS.

4.5 ResultsVerification of proposed architecture was done in labo-ratory conditions on existing hardware using two Ubuntubased computers serving as WTP via our modified versionof HostAPd[26]. As a hardware usb WiFi stick TP-LinkTL-WN821N v3 with chip Atheros AR7010+AR9287


managed via nl80211 driver was used. AC was imple-ment in C using native Linux network calls and lpthreadlibrary for threads management. As a MS Android, iOS,Windows XP and Windows 7, Mac and Linux based de-vices were used. The testing and evaluation software wasIXIA IxChariot software with Wireshark and ping com-mand. The topology is shown in Figure 8, where STA wasroamed between two WTPs. On AC and on STA had Ix-Chariot endpoints that were monitoring traffic passing viathe test architecture.

The testing was executed on our remote MAC separatorbased architecture and on general local MAC providedby unmodified HostAPd in the same version. The testhandover was issued manually by the AC at the specifictime. Both WTPs had the same channel to produce someinterference (in distance of several meters).

The testing scenarios were split in two cases: The firstone was focused on ICMP response time and route whilemeasuring which path was used to transport the ICMPpacket from a source (MS) to a destination (AC). Thesecond one was focused on a throughput, a data loss anda link reliability while transferring RTP stream from theMS to the AC while performing the handover.

The first test consisted from ICMP Echo message ex-change between the MS and the AC and between the MSand both WTPs, while a packet filter was set to not for-ward ICMP between mediating WTPs. Results shown inFigure 6 tells that while using our proposed architecturethe MS did not notice any outage. For comparison, whileusing local MAC separator based architecture a long gapin communication occurred, as is shown in Figure 7. In hegraph a blue line represents the response from AC and ared and a green responses from the current WTP / APs.The X axis shows time and Y shows the number of packettransferred in the time interval.

In the graph showing our architecture the response timefrom WTP2 and AC decreased after handover, while thethroughput raised. This effect was caused by shuttingdown a hardware interface on the WTP1 that was pro-ducing interference as both WTPs were communicationgon the same channel.

The second test measured an optimal throughput whileusing our proposed and the original architecture. We usedIxChariot to generate TCP steam and results showed thatour architecture had 2 Mbit/s slower throughput. Thisslowdown was caused by lowering the MTU needed bythe extra header in data frames and by user space imple-mentation of the data transport protocol.

Once we had optimal values of throughput for both archi-tectures we generated RTP stream (via IxChariot) start-ing in MS and ending in AC. We measured delays and thedata loss for both, while performing the handover. Thehandover was issued in the 15th second and we can clearlysee the difference between our architecture in Figure 9and original in Figure 10. Detailed statistics provided bythe IxChariot shown that the gap in our architecture was1,41ms long in comparison with 3,337s long cut in originallocal MAC separator based architecture.

Figure 6: ICMP message exchange in our pro-posed architecture.

Figure 7: ICMP message exchange in original ar-chitecture.

Figure 8: Testbed topology of our proposed archi-tecture (left) and the original (right).


Figure 9: One-way delay test in our architecture.

Figure 10: One-way delay test in original archi-tecture.

5. ConclusionsWe have proposed a centralised IEEE 802.11 architecturebased on Remote MAC separator with the main goal toreduce all unnecessary delays associated to handover pro-cess. The reduction was done by shifting the handoverlogic away from mobile station and placing it the networkcore. Our tests confirmed that the architecture proposalwas correct and that the existing 50ms handover statusQOU can be easily lowered while not breaking any exist-ing IEEE standards. The drawback of this proposal is theenlargement of Ethernet headers which may bring a littlethroughput slowdown that we measured as 2Mbit/s.

Acknowledgements. This work was partially supportedby the Scientific Grant Agency of Slovak Republic, grantNo. VEGA 1/0836/16.

References[1] K. Ahmavaara, H. Haverinen, and R. Pichna. Interworking ar-

chitecture between 3GPP and WLAN systems. IEEECommunications Magazine, Vol. 41, pages 74âAS81, Nov. 2003.

[2] IEEE Standard for Information technology. art 11: Wireless LANMedium Access Control (MAC) and Physical Layer (PHY)Specifications.http://standards.ieee.org/getieee802/download/802.11-2012.pdf,Marec 2012

[3] Kashif Nizam Khan, Jinat Rehana. Wireless HandoffOptimization: A Comparison of IEEE 802.11r and HOKEY.https://hal.inria.fr/hal-01056504/document, 2014

[4] I.F. Akyildiz, J. Xie, and S. Mohanty. A survey of mobilitymanagement in next-generation all-IP-based wireless systems. InWireless Communications, IEEE (See also IEEE PersonalCommunications), pages 16-28, August 2004.

[5] Tim Szigeti, Christina Hattingh. Quality of Service DesignOverview. Cisco Press,http://www.ciscopress.com/articles/article.asp?p=357102&rl=1,December 2004

[6] Ali Safa Sadiq, Kamalrulnizam Abu Bakar, Kayhan Zrar Ghafoor,and Alberto J. Gonzalez. Mobility and Signal Strength-AwareHandover Decision in Mobile IPv6 based Wireless LAN.http://www.iaeng.org/publication/IMECS2011/IMECS2011_pp664-669.pdf, 2011

[7] Microsoft. How 802.11 Wireless Works.http://technet.microsoft.com/en-us/library/cc757419%28v=ws.10%29.aspx, March2003

[8] Intel Corporation. Understanding IEEE* 802.11 Authenticationand Association for Network and I O.http://www.intel.com/content/www/us/en/support/network-and-i-o/wireless-networking/000006508.html

[9] Aruba Networks. http://www.arubanetworks.com/.[10] Meru Networks. http://www.fortinet.com/meru/[11] Cisco Systems. http://www.cisco.com/[12] Cisco Systems Wireless LANs: Extending the Reach of a LAN.

http://www.ciscopress.com/articles/article.asp?p=1156068&seqNum=4, 2008

[13] McKeown, N., Anderson, T., Balakrishnan, H., Parulkar, G.,Peterson, L., Rexford, J., Shenker, S., Turner, J. OpenFlow:Enabling Innovation in Campus Networks.http://archive.openflow.org/documents/openflow-wp-latest.pdf,2008.

[14] Anthony Noerpel and Yi-Bing Lin. Handover Management for aPCS NetworkâAI, IEEE personal communications.http://ieeexplore.ieee.org/iel4/98/13833/00637379.pdf?arnumber=637379, December 1997.

[15] Pejman Roshan and Jonathan Leary. 802.11 Wireless LANFundamentals. 1st Edition,http://docstore.mik.ua/cisco/pdf/other/Cisco%20Press,%20802.11%20Wireless%20Lan%20Fundamentals%20(2003)%20Kb.pdf,December 2003

[16] Vivek Mhatre and Konstantina Papagiannaki. Using SmartTriggers for Improved User Performance in 802.11 WirelessNetworks. MobiSys’06, Uppsala, Sweden,http://portal.acm.org/citation.cfm?id=1134706&dl=ACM&coll=&CFID=15151515&CFTOKEN=6184618, June 2006

[17] Ishwar Ramani and Stefan Savage. SyncScan: Practical FastHandoff for 802.11 Infrastructure Networks.http://www.cs.ucsd.edu/ savage/papers/Infocom05.pdf, 2005.

[18] IEEE Standard for Information technology. 802.11k-2008 - Part11: Wireless LAN Medium Access Control (MAC) and PhysicalLayer (PHY) Specifications Amendment 1: Radio ResourceMeasurement of Wireless LANs.http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=4544755&filter=AND(p_Publication_Number:4544752), June2008

[19] IEEE Computer Society. IEEE Trial-Use Recommended Practicefor Multi-Vendor Access Point Interoperability via anInter-Access Point Protocol Across Distribution SystemsSupporting IEEE 802.11TM Operation.http://standards.ieee.org/getieee802/download/802.11F-2003.pdf,2003

[20] Benjamin Miller. Is it the network Solving VoIP Problems on aWireless LAN. http://www.users.miamioh.edu/roseaw/cit286/WP_Miller_VoIP_LAN.pdf, 2007.

[21] KUANG-HUI CHI, CHIEN-CHAO TSENG AND YA-HSUANTSAI. Fast Handoff among IEEE 802.11r Mobility Domains.http://www.iis.sinica.edu.tw/page/jise/2010/201007_12.pdf, 2010.

[22] Lei Zan, Jidong Wang and Lichun Bao. Personal AP Protocol forMobility Management in IEEE 802.11 Systems.http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=1541021&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D1541021, 2005.

[23] Jin Teng, Changqing Xu, Weijia Jia, Dong Xuan D-Scan:Enabling Fast and Smooth Handoffs in AP-dense 802.11 WirelessNetworks.http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=5062198&url=http%3A%2F%2Fieeexplore.ieee.org%2Fiel5%2F5061887%2F5061888%2F05062198.pdf%3Farnumber%3D5062198, 2009.

[24] Minho Shin, Arunesh Mishra, William A. Arbaugh. Improvingthe Latency of 802.11 Hand-offs using Neighbour Graphs.https://www.usenix.org/legacy/publications/library/proceedings/mobisys04/pdf/p70-shin.pdf, 2004.


[25] Internet Engineering Task Force. Architecture Taxonomy forControl and Provisioning of Wireless Access Points (CAPWAP).https://tools.ietf.org/html/rfc4118, July 2004.

[26] Hostapd and wpa_supplicant. https://w1.fi/

Selected Papers by the AuthorJ. Balažia, I. Kotuliak. Seamless handover in 802.11networks. In

Proc. of 2012 5th Joint IFIP Wireless and Mobile NetworkingConf., Bratislava, Slovakia, September, 2012.

J. Balažia, R. Bencel, I. Kotuliak. Architecture proposal for seamlesshandover in 802.11 networks. In: Proc. of 2016 9th Joint IFIPWireless and Mobile Networking Conf., Colmar, France, July,2016.

Visualization, Navigation and Relationship Discovery inGraphs

Ján Mojžiš∗

Institute of InformaticsSlovak Academy of Sciences

Dúbravská cesta 9, 845 07 Bratislava, [email protected]

AbstractLinked data is a concept used for interlinking several datasources, often placed across the world. One of its key re-quirements is a link. Another is machine readable struc-ture. But nowadays, still many of data sources on theweb offer plain unstructured data. Newspaper articles,social networks or business register. Often the data isHTML formatted, where the formatting is mixed withthe content, which is improper for machine reading. Andthe data would be very useful. We could extract infor-mation about persons, events, places and other objects.In order to extract such information from unstructureddata sources, an advanced techniques of information ex-traction are used. Even if data structure is extracted orcreated, a presentation of information to the end user iscrucial, because an information overload or clutter can beintroduced.

In scope of our work, we focus on graph data structures,data extraction, distributed computing and graph visu-alization. We design, implement and evaluate a singlemachine system for data extraction and information re-trieval, capable of using advanced graph visualization andfiltering techniques. We propose a new visualization con-cept of pen patterns and colors. Next we define a newuniversal graph visualization and filtering method, us-able for filtering and relationship discovery. We proposea new distributed algorithm PCMARS, intended to beused in a Pregel computing cluster for the graph relation-ship discovery tasks. We implement our proposal in aclient, stand alone program AGECRT (Advanced Graphand Clutter Removal Tool) and distributed algorithm PC-MARS. A solution is dedicated as one single architecture.

∗Recommended by thesis supervisor: Assoc. Prof. MichalLaclavıkDefended at Faculty of Informatics and Information Tech-nologies, Slovak University of Technology in Bratislava onAugust 25, 2016.

c© Copyright 2016. All rights reserved. Permission to make digitalor hard copies of part or all of this work for personal or classroom useis granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies show this notice onthe first page or initial screen of a display along with the full citation.Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy other-wise, to republish, to post on servers, to redistribute to lists, or to useany component of this work in other works requires prior specific per-mission and/or a fee. Permissions may be requested from STU Press,Vazovova 5, 811 07 Bratislava, Slovakia.Mojžiš, J. Visualization, Navigation and Relationship Discovery inGraphs. Information Sciences and Technologies Bulletin of the ACMSlovakia, Vol. 8, No. 2 (2016) 45-55

Categories and Subject DescriptorsE.1 [Data Structures]: Distributed data structures,Graphs and networks; H.1.2 [Models and principles]:User/Machine Systems—Human factors, Human infor-mation processing ; H.3.3 [Information storage and re-trieval]: Information search and retrieval—Informationfiltering ; H.3.4 [Information storage and retrieval]:Systems and software—Distributed systems

Keywordsgraph, visualization, distributed computing, parallel com-puting, relationship discovery, linked data

1. IntroductionFrom many kinds of data sources, the Web is most dy-namic, dense and universal. Originally created by humansfor humans, there are sources we use often on daily basis.Newspaper articles, business statistics, social networks,traffic are just a few examples. But despite the fact, thatwe have a vast set of data available. As humans, in orderto get information, we are capable to extract and use onlya small portion of such data. Indeed a machine computingpower is very helpful for several reasons. First, machinecan process data thousands times faster and more reli-able than any human. Next, with the use of appropriatesoftware environments we could perform information ex-traction and visualization tasks.

Based on website internetlivestats1, in 2015, internet con-nected roughly 863 millions websites, in comparison with2005 it grown more than 13 times. But this metric doesnot include web pages themselves, only base unique IPaddresses. Also, worth of a note are dynamic, periodi-cally changing websites, like news. A big contribution tothe Web is the content of social networks, which are re-sults of collaborative work of millions persons across theworld. Here, a term Information Society is quite suitable.

Google Search is a popular web search engine owned byGoogle Inc, which also maintains web data index. Basedof official Google web page2, Google index contains morethan 100 millions gigabytes of data (approximately 25thousands of 4 terabyte discs). Google also maintaineda Freebase, multi-domain database of billions of N-triplestatements. Now, the project has emerged into Wikidata

1www.internetlivestats.com/total-number-of-websites/,16.6.20162https://www.google.com/insidesearch/howsearchworks/crawling-indexing.html, 16.6.2016

46 Mojzis, J.: Visualization, Navigation and Relationship Discovery in Graphs

knowledge base. Wikidata contains structured, machinereadable data gathered across all the world.

News articles, blogs and social networks, together withtraditional web pages (homepages, company pages), writ-ten in HTML are types of unstructured data (formattingmixed with content). They are a great part of the Webvolume, many times due to their daily or periodically up-dated content. One of most popular social networks in thepresent, Facebook (FB), holds more than billion monthlyactive users (at least once per 30 days cycle, user is loggedin). In 2004, the count was 1 million. Ending year 2015,the count grown to 1.5 billion. Users contribute to FBwriting status updates, periodical submissions on ”time-line” and each user maintains his/her own profile, more orless detail or public. FB is international social network,which links a broad spectrum of people across the world,not depending on language or culture. In the recent past,FB is also the space for firms, political parties or non-governmental organizations to promote themselves. FBevolution is illustrated in Fig.3.

We see Slovak Business Register (SBR) as one kind ofsocial network, which is a public register, where subjects(natural or legal persons and companies) are listed basedon particular law. In comparison to FB, SBR links arenot based on friends, instead, we find Partners, Manage-ment Body, Supervisory board or Liquidators. From of-ficial statistics of Ministry of Justice SR3, the networkgrown from 52 thousands subjects in 1995 to more than254 thousands registered subjects in 2015. And, despitethe fact, that SBR performs liquidation and deletion ofregistered entries, on the webpage of SBR, there is stillpossible to find and display all deleted subjects. Fig.1 dis-plays an evolution of registered subjects in SBR databasebased on years.

The importance and significance of the Web, as a largeevolving information space is marked by various chal-lenges, like Semantic Web Challenge4 or research papersat WWW 5 or ISWC6 events. It is a kind of motivationfor us to try and find relations in the data. To use amodern distributed computing models, like Pregel, wherethe data is represented in a graph structure, giving a newopportunities for graph algorithms.

A presentation, or a kind of an ”overview” for the user isgiven by techniques from Information visualization field.A graph visualization, for instance, visualizes relation-ships providing graphic representation in vertexes andedges. Despite the rich community of researchers andmany contributing solutions for many issues in graph visu-alization, the potential for visualization is not fully used.The research is intended to layout algorithms, cluster-ing or drawing. Many solutions try to address the edgecrossing problem on traditional basis (layouts, clustering,bundling, focus+context techniques), but we did not findthe works for edge pens and colors. Only rather contro-versial edge visualization, where edges are rather hiddenthan displayed [4, 5], ultimately solving edge crossing, butmaking higher uncertainty in relations.

3http://www.justice.gov.sk/stat/statr.htm4challenge.semanticweb.org/5libra.msra.cn/Conference/526/www-world-wide-web-conference-series6libra.msra.cn/Conference/360/iswc-international-semantic-web-conference

100

150

200

250

300

50

100

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

2015

Figure 1: Slovak Business Register. The evolu-tion of registered subjects count. Vertical axis inthousands.

2

4

6

0

2

2002

2004

2006

2008

2010

2012

2014

2016

Figure 2: en.wikipedia.org. Article publishedcount on year basis. Vertical axis in millions.

In scope of this paper we propose a new complex solu-tion for information extraction and processing, involv-ing relationship discovery, resulting in visualization. Wepresent a new and universal method, usable for relation-ship discovery as well as filtering in graph visualization.We identify and mark relation between graph clutter, cog-nitive science and psychology, from which, based on ourresearch, we design a new pen drawing and color styles forgraph visualization. Combining our solution under archi-tecture capable of extraction, distributed processing andvisualization. We also evaluate our proposal on selecteddata inputs.

1.1 GoalsThe aim of this work is to combine relationship discov-ery, visualization and navigation in graph data. Designfiltering techniques, offering some degree of personaliza-tion. For unstructured data of SBR propose a structureand its creation. Valid structure can serve in data integra-tion from various sources. Progress toward the concept ofLinked Data by T. Berners-lee[2] for the transparent andeffective data usage in public concern. A combinationof visualization and filtration techniques in data integra-tion, navigation and relation discovery tasks could leadto interesting results. The work pick the following partialgoals:

1. Design visualization techniques for better limpidityof visualizations. For 2D visualizations of graphs,several existing techniques use coloring, but penstyles or patterns are potent, yet unused. Vari-ous pen patterns can, however, equally discriminateedges as well, as colors. Patterns with colors can


500

1000

1500

2000

0

500

2004

2006

2008

2010

2012

2014

Figure 3: Facebook. Monthly active users. Verti-cal axis in millions.

100

150

200

250

300

50

100

2004

2006

2008

2010

2012

2014

Figure 4: News articles. Article published counton year basis. Vertical axis in thousands.

represent particular information (the significance ofrelationship, type of relationship, time context, andetc.). Colors are used, but their choice is rather em-pirical. We offer color choice based on studies ofcognitive science and psychology, especially on hu-man visual system.

2. Design and define a new and universally usable graphfilter and visualization method. Graph filtering isnot only about hiding redundancies and clearinglinks, clustering or layouting. Filtering can progressfurther into semantics of graph data. It can par-tially address graph visualization problems. We pro-pose a method Many-to many, intended to workwith whole sets of vertexes. Filter function Many-to-many would, in those sets, watch edges betweenvertexes and consider properties of edges and ver-texes. Relations in compliance with the conditionswould pass the filter, unsatisfying remain hidden.Its usage is therefore not only for graph filtering,but is also valuable in visualization and relationshipdiscovery tasks.

3. Design an algorithm for distributed relationship dis-covery. One of base criteria in distributed comput-ing is large-scale, because of increasing data volume.When developing an algorithm for relationship dis-covery, we need to select appropriate data structureand distributed computing model. We also use theconcept Many-to-many for the algorithm and willwork with the sets of vertexes. There is no auto-matic reasoning, algorithm contains several condi-tions, executed during relationship discovery, mak-ing it more universal for general use.

4. Personalize graph filter and visualization. For meth-

ods of visualization, using colors and pen patterns,enable personalization by defining color sets andpatterns, creating different user profiles. For filter-ing, define own sets of ”interesting” vertexes, to bepreferentially searched and displayed in graph visu-alization.

5. Combine navigation, filtration and relationship dis-covery. Relationship discovery is important to us.With the mind on constantly growing content ofthe Web, offering rather large volume of data. Thuswe calculate with a distributed computing platform.Base combination is connection between distributedcomputing algorithm and client side application withpresentation layer. In general, a solution offeringfiltering, visualization and interactive relationshipdiscovery is preferred. We utilize Many-to-many to-gether with distributed algorithm, explore based onlinks or semantics. Thus we propose complex largescale solution, for interactive relationship discoveryand perceive them with the use of various methodsof graph visualization and filtration. As we show, itis possible to implement and use it.

2. Colors and Patterns of PenIn introduction we briefly explain our motivation for re-search. We note SBR or Billion triple challenge. Visu-alizing such volumes of data as a whole is not desirable,mainly to the limitations of current displays. In orderto display a graph and avoid common issues regardinggraph clutter, we can use methods, which are (to our bestnotion) currently left unused. In our research in graph vi-sualization field, we realized, that edge coloring plays nomajor aspect in proposed solutions. It is despite a fact,that those papers focus on graph visualization [10, 18, 9].

2.1 Related WorkJianu [10] even discuss edge coloring, but his proposalis to color edges based on their relative positions. Thushis coloring changes with edge positions and colors areindependent of semantic context. His scheme cannot ex-press additional information possibly stored in vertexesor edges. Rusu [18] use gestalt psychology in the processof edge drawing. His primary goal is to solve (aesthetic)problem with edge crossing and he does this by terminat-ing edge in the position of the crossing. Edge coloring isnot primary. Herman [9] offers an extensive list of visual-ization methods. It is a bit surprising, that no place forcolors (with an exception of treemaps) and edge styles isgiven. Perhaps, a worth noting is, rather experimental,proposal of partial edge drawing [5, 4], but we see it asrather a controversial method of edge visualization. Justwith a no edge visualization, the edge crossing problemcould be definitely solved, but we suppose, that increas-ing uncertainty is not intended for graph visualizations.As the edge crossing is quite common problem in graphvisualizations, those crossing edges would all remain hid-den.

2.2 DesignGraph visualization is rather broad field with many meth-ods and finding regarding how to visualize graphs. In thisfield, there is also a constant research. Visualization is afinal layer, easing interaction and navigation. It increasethe clearness of relationships. However, we still see un-used potential in this field, we define following goals forour visualization, following 1st point of our main goals:


• Define proper pen styles for edges

• recommend proper colors for the pen with whitebackground

• propose a persistent color representation for edges(including re-layouting)

• propose a semantics for relationship during the pro-cess of edge color and pen style selection.

We define appropriate patterns and colors for pen. Pen isan object defining visual properties of edges, while edgesare painted in resulting visualization. Our design is basedon information about human visual system. Based on Vi-sual Expert[7], a difference of colors and brightness is im-portant. A two colors of relative high color and brightnessdifference are suitable for a combination in visualization.[8] states how to calculate color difference. To calculatebrightness difference, we need to calculate brightness it-self. The basis are three components of RGB model. Eachcomponent has a weight assigned by the standard ITU-R in recommendation BT.601-4. Poynton [17] here notes,that weights are no longer considered accurate, as were forolder CRT displays. We thus propose modified weights:Y = 0, 30078×R+ 0, 58986001×G+ 0, 10935999×B. Yis a resulting brightness, normalized on the scale 0-255,R,G and B are color components of RGB model. We dis-covered new weights during our experiments of color tograyscale image conversions. We then implemented algo-rithms and proposed weights in our program BKonvert7,which served as educative and experimental tool duringour study at Matej Bel University in the Applied Infor-matics field.

Brightness difference is a brightness difference Y∆ = YA−YB . Color difference K∆ is an expression of how dis-tant given two colors are in RGB model. We rely on W3[6], where there are formulas for brightness and color dif-ference calculations. Color difference is given by K∆ =|R1−R2|+|G1−G2|+|B1−B2|. A suitable high brightnessdifference Y∆ is, based on W3, a constant with a value of125. A color difference K∆ is (according to W3) a con-stant with a value of 500. When the result of brightnessdifference is Y∆ > 125 and color difference K∆ > 500,then given two colors are in a suitable combination (ac-cording to W3). We thus propose aforementioned formu-las (Y∆ and K∆) for a suitable color selection, based onreferencing color (background).

We propose following pen pattern styles: three termi-nated and one full, all in two variants; straight and curved(Fig.5, Fig.6)

Together there are eight patterns (4 straight and 4 curved),which we consider proper for drawing. Patterns are placedin a set Z, while |Z| = 8. More patterns could cause dis-crimination issues, when two similar patterns are coloredby the same color. We do not recommend other patterns,like ”+”, because it reminds crossing, again, when twoedges get near and are colored by the same. We do notrecommend using bent lines, as the sudden change in edgedirection angle could obfuscate edge following. These pat-terns (along with presented color formulas) are used forvisualization in our program AGECRT, see Fig.8.

7https://sourceforge.net/projects/bkonvert/

Figure 5: Pen patterns, straight lines.

Figure 6: Pen patterns, curved lines.

Figure 7: Color selection from the set, based onh1 and h2 hash values.

In Fig.7 is an algorithm of color selection. Main func-tion is color. h1, h2 are hash values of vertexes v1, v2 inintegral numbers. If h1! = h2, h1 = h1XORh2. XORis a binary operation of bitwise xor on integral numbers.MOD is a binary operation modulo 8. Function colorHashfirstly evaluates h1 to decide whether a color should bepure. Pure color, in our design, is a color, having twocomponents of RGB equal to zero and one non-zero (setpureColors). If pure = true, then color is pure and is isselected from pureColor. Otherwise, a color is selectedfrom edgeColors. A selected color is used to draw edgesbetween v1 and v2.

8in modular arithmetic, a binary operation, returning in-tegral remainder from division.


Figure 8: AGECRT main window with visualization of Boris Kollar network.

Similarly, we select edge pattern style. Fig.5 and Fig.6are placed in one set. Because there are two vertexes andone pen pattern set, the selection is almost identical tothe color selection process. Use binary operation XOR tocombine hashes and get a number, next function moduloreturns an index of pattern in the set and edge pattern isselected.

3. New and Universal Filtering MethodThe main motivation is the increasing volume of data onthe Web, like we stated and illustrated in introduction.Due to the scaling Web, we have to design a large-scale so-lution. It bears parallel computing power and distributedresources. An interesting choice is a Pregel, one of newand potent distributed computing models. Pregel com-bines fault tolerance, simplicity of development on a dis-tributed cluster and a new concept of synchronization,based on supersteps. Pregel is a brand new computingmodel, but with a clear potential [13].

During our research, we discovered rules for graph filter-ing, which could be used universally. They can be usedin graph visualization for filtering and also to discoverrelationships in a distributed computing model.

Our goals in this section follow main goals, 2. and 3.

3.1 Base DesignLet G be the graph G = V,E, where E is a set of edgesand V set of vertexes connected by edges from E. Wepropose to divide the V set into two disjunctive subsetsof interesting vertexes I and ordinary vertexes O. For al-gorithm of relationship discovery and graph visualization,set I holds vertexes, which are important or interesting.In visualization, the set contains all currently visible ver-texes. In distributed computing algorithm for relationship

discovery, the set contains vertexes, between which, a re-lation is to be discovered. Set O contains vertexes, notvisible in visualization process. In relationship discovery,there are all vertexes V −I. With these two sets, we definerules:

H = {∃u ∈ O|∃v1, v2,∈ I : ∃e = {u, v1} ∧ ∃e = {u, v2}}(1)

h ∈ H : M = {u 6= h|∃e = {h, u} ∧ u ∈ I} (2)

Where:

H - set of all neighbors of vertexes v1 a v2

I - set of all interesting vertexes

O - set of all ordinary vertexes

M - set of all interesting neighbors of vertex h.

Rule 1 tells of existence of vertex u from the O set, whichis in sequence P = (v1, e1, u, e2, v2). Rule 2 specifies, thatvertex u is excluded from the interesting set of vertexesI. In graph visualization, we see proposed rules usable forvertex hiding or expanding/collapsing (like Herman statesabout ghosting and hiding [9]). In the expanding process,only vertexes, which pass proposed rules filter could beidentified and visualized. For details we recommend ourformer works[14, 15]. In algorithm of distributed relation-ship discovery, these rules directly qualifies the relation-ship.

3.2 Method in Distributed Algorithm PCMARSProposed rules are used in our Pregel-based algorithmPCMARS (Pregel Computing Model And RelationshipSearch). In Pregel, a base component is a vertex-centricapproach (Pregel type). All vertexes perform their own


compute() function and the computing synchronization isgoverned by supersteps. Messages are passed in superstepi and received in superstep i+1. Even if computations arerunning parallel between nodes, compute() method runssequential inside each node’s thread. Messages intendedto vertexes, which are located on the same node are storedin the stack of current node. Here messages wait, untilthey are sequentially picked and processed in compute()in suitable vertex on this node. Pregel usually terminates,if there are no additional messages in stack and no ver-texes are active (on a given node). Important functionis compute(), as it is the function, which is user-definedand also message passing, where user can select whichvalues are to be passed. In PCMARS algorithm, all im-portant control instructions are performed with the use ofspecialized messages and matching responses. There aretwo base types of messages PING and INTERESTING.Each one of two messages is intended for the matchingst of vertexes. PING for the O set and INTERESTINGfor I set. Furthermore, there are response messages, asreactions to receiving two base types of messages. Re-sponse messages are STORED INTERESTING REPLY,STORED ACTIVATED REPLY andSTORED REPLY PINGER. The choice depends on thekind of received message and on the vertex kind (set of Oor I), as well as on particular properties or attributes ofparticular vertex, where compute() is currently running.

A base algorithm design is in Fig.9. A few notes to pro-posed algorithm. Main part of compute() method is aloop of message receiving (hasMsgAt()). Because thereare several kinds of defined messages, each one must behandled differently. Some messages, for instance, are dis-tributed along. If actual vertex (running compute()) re-ceives INTERESTING, it is activated (if vertex is from Oset), increments the path length and distribute the mes-sage toward its neighbors. Or the value of path lengthmust remain the same, when the reply requires the origi-nal path length. If any vertex of O set is activated fromtwo different I vertexes, such vertex is considered as a re-lationship. Each vertex has an access to its neighbors listalong with particular connecting edges (here an algorithmcan be modified based on edge types) and also maintainsa structure of verexes, which it has received a messagefrom. It is used in several cases during message gener-ation. A vertex can thus, distribute messages receivedseveral steps before. When vertex from O is activated bya message from I vertex, such vertex is then capable ofresponding with STORED INTERESTING REPLY andactivate other vertexes from O set.

Algorithm behaves differently in the first two supersteps.Vertexes of O are sending exploratory messages PING.Vertexes of I are sending INTERESTING messages. Sub-sequent supersteps receive these messages and react ac-cordingly. Algorithm in Fig.9 is a sample code pickedfrom PCMARS source code, which is more complicatedand complex. Whole sourcecode can be found on suppliedCD in the work.

4. Base ArchitectureWe follow our base goals 4 and 5, presenting an base archi-tecture, combining distributed algorithm PCMARS andour client side application AGECRT. It is a platform in-dependent solution, capable of large-scale computing withthe use of computer cluster. Also it can perform visual-ization and filtration as a client side application, offeringinteractive navigation and relationship discovery with its

Figure 9: PCMARS with function compute().

presentation layer. It can gather data from the Web andprocess them. With the help of its parsers and designedschemata, it is able to extract a structure from unstruc-tured HTML data sources. Data can then be used insubsequent further development of Linked Data. Fig.10depicts a combination of PCMARS and AGECRT in onesystem.

Communication layer will be covered in implementation,for example as a console. Communication can run alsoon a client side (for test purposes). Then PCMARS andAGECRT can run on one client machine at the same time.PCMARS is equipped with the functionality needed.Thanks to Sedge framework, it is possible to configureworker nodes, add or remove them as needed, which isstated in its configuration file. Communication betweenPCMARS and AGECRT will be realized through a con-sole interface or with sockets. Architecture is supportingboth cases.

Data entry and extraction filters must be equipped witha suitable set of data processors. Parsers are used to ex-tract data. From unstructured HTML, a structured data


User

AGECRT

entry data

extraction filters

personalization layer

visualization

filters

communication layer

entry data layer

visualization layer

Remote data

komunikačná vrstva

Pregel

PCMARS

communication layer

Remote data

sources

Figure 10: Base architecture.

form will be gathered. For a structured data sources, it issimple to extract data from XML, RDF/XML, CSS or N-triples formats. Here a template or a schema is requiredonly. Java already supports XML reading and parsing.We calculate with more complex formats, like PDF, forwhich, we can use one of available third party drivers, likethe one with source code, pdf2HTMLEx9. Even it sup-port only HTML format in output, we have discovered,that it is rather a simple form of HTML and thus suit-able for parsing. We have also evaluated Tabula[11], but,today, after evaluation, we can state, that pdf2HTMLExis more suitable and reliable. pdf2HTMLEx requires noextra interaction with the user, is is able to process manykinds of PDF files, including scientific papers and arti-cles, pdf2HTMLEx is thus a seamless choice. With mi-nor changes, it is possible to implement SPARQL end-points support. SPARQL endpoints are maintained bytheir respective data publishers. Offering access throughseveral structured data protocols, like JSON, POST, GETor SOAP. These protocols can easily be built into entrydata extraction filters.

Visualization layer is covered solely with a Jung10 frame-work. We have used this framework to define custommethods of drawing and presentation layer in AGECRT.

Internal entry data layer works with structured data, re-fined with one of entry data extraction filters. This layer isdirectly connected with visualization layer and with per-sonalization layer.

Personalization layer is a governing layer to visualizationlayer and entry data layer. It holds certain definitions forpersonalization. Enables to define and store parametersfor vertex selection and filtering, visualization settings,layout algorithms and additional information about ver-texes and edges. User can create his own profile and re-use it in subsequent visualizations. For example a vertexfilter. Only vertexes listed in the filter would visualize,others are hidden.

9http://coolwanglu.github.io/pdf2htmlEX/10http://jrtom.github.io/jung/

Figure 11: Sample visualization A of the short-est path between ”Christopher D. Manning” and”L.Hluchy” using standard Dijkstra shortest pathalgorithm. There are two types of vertexes: au-thors (orange color) and articles (blue color).Statistics |VA| = 9, |EA| = 8, νA = 1, 78.

5. EvaluationIn this section we present sample evaluations, selectedfrom our thesis.

5.1 Clutter RemovalVisualization filter for graph clutter removal is workingaccording to its design and rules (page 49). We can evalu-ate visualization filter based on parameters: actual reduc-tion of visual components (vertexes) based on rules andreduction of graph density (edges and vertexes count).We use base graph G = V,E, which also proposed onpage 49. Further we refer to vertex count as |V | andedges count |E|, we also define an average vertex degree

ν =Σv∈V deg(v)

|V | . Fig.11 is a visualization of path be-

tween vertexes ”Christopher Manning” and ”L.Hluchy”.Path was obtained using a standard Dijkstra shortest pathalgorithm. Data source is ACM citation graph11, where|V | = 622, 335, |E| = 1, 334, 753, f , ν = 4.29.

In Fig.12 we see neighbors of vertex ”Authoritative sources”without filter applied. It connects 200 neighbors. Nametags in rectangles were hidden and replaced with circles

11http://datahub.io/dataset/rkb-explorer-acm, retreived12.6.2016


Figure 12: All 200 neighbors of vertex ”Authorita-tive sources” (vertex in white circle). Blue coloredare newly visualized vertexes and edges. Statistics|VB | = 207, |EB | = 567, νB = 5, 47.

Figure 13: Visualization C is a visualization ofa graph, where a filter is applied to vertex ”Au-thoritative sources” (pink color). Blue color isfor newly visualized vertexes and edges. Statis-tics |VC | = 13, |EC | = 17, νC = 2, 62.

and polygons, for simplicity. Visualization A (Fig.11) de-picts a graph, where |VA| = 9, |EA| = 8, νA = 1, 78. Forcomparison, B (Fig.12) is a graph |VB | = 207, |EB | = 567,νB = 5, 47. Number of visualized vertexes is ∆|V | =|VB | − |VA| = 198, new count of edges is ∆|E| = |EB | −|EA| = 559, ν is raised from 1, 78 to 5, 47. For such largenumber of vertexes and edges, with edge crossing, a vi-sualization is considerably cluttered and relations are un-clear.

Effect of the filter is displayed in Fig.13. In comparison toA, there are new vertexes |VC | − |VA| = 4, edges |EC | −|EA| = 9, ν raised from 1, 78 to 2, 62. Based on rules(page 49) are filtered all vertexes, which do not satisfythe rules. Hidden remaining |VB | − |VC | = 194 vertexesand |EB | − |EC | = 550 edges. Average vertex degree ishigher than in A (Fig.11), which is reasonable, due tonewly visualized vertexes, but considerably lesser than inB (Fig.12), where no filter is applied.

In Fig.15 we can see a network of neighbors of vertex”Ground Term Confluence”. The graph is a resulting vi-sualization D, after we applied our filter two times on thisvertex. Each time a filter was applied, new vertexes werefound and visualized. Here is a drawback of our proposalof graph visualization filter, based on rules from page 49.Effectiveness of such filter is limited. A visualization spaceis eventually filled, which leads to information overload or

Node 1

Node 2

Node 3

Node 4

Node 5

Supersteps

Messages(thousands)

Figure 14: An evolution of message sending basedon supersteps in algorithm PCMARS. Verticalaxis represent message count in logarithmic scale,horizontal denote supersteps.

higher uncertainty of relations. If we perceive graph clut-ter also as edge crossing problem and vertexes overlappingedges, where we cannot sufficiently state, where one edgeends and other is starting.

A solution actually exists. One just need to use anothervisualization technique, especially, edge coloring. We useour design from page 47. Colored edges with styles fromvisualization E are found in Fig.16. Based on new col-ors and edge styles, we can state, that new informationwas introduced. Between vertexes ”Semantics and” (leftof ”Ground term confluence”) and ”PCLOS: a critic”, isan edge, colored in pink color and drawn with a fullpen. Edge leads under ”An extensible” and ”An objectorien” vertexes, but due to color and edge curving, wecan suppose, that edge starts in ”Semantics and”, endingin ”PCLOS: a critic” vertex, free of any breaks. Simi-larly, vertexes ”Sun’s Link Serv” and ”An object-orient”(above ”Ground term confluence”) and other cases. Werecommend the use of transparency to offer additionalpossibility, following edges through overlapping vertexes.

5.2 Relationship DiscoveryBased on the design of PCMARS from page 49, we de-fine two sets; ordinary vertexes O and interesting ver-texes I. Tab.1 lists vertexes from I set, among which,according to rules (page 49) and algorithm design, rela-tions are searched. We have evaluated data gathered fromFreebase12 dataset. Evaluation started on 21.8.2014 at15:05:55 and finished 31.8 (10 days). There were 207,382executed supersteps. A progress of message generation isdisplayed in Fig.14.

Algorithm returned 314,668 vertexes u, which are con-tained in sequence P = (v1, e1, u, e2, v2) from page 49.Those vertexes received INTERESTING or INTEREST-ING REPLY messages at least 2× (vertexes v ∈ I). Inorder to visualize such a graph, we have used our AGE-CRT tool. Several vertex types were excluded from visu-alization (Gender, Male, Female, Place of Birth, Place ofDeath), because we were not interested in such commonrelations. After exclusion, graph contained 234,786 ver-texes and 262,279 edges. A graph was still too big to visu-alize, so we have used Dijkstra shortest path. In this case,shortest path was searched multiple times, because we

12https://developers.google.com/freebase/, Retrieved22.6.2016


Table 1: Set I of interesting vertexes of Freebase dataset. This set is used in distributed computingalgorithm PCMARS and also in AGECRT to discover relations and shortest path.

Freebase MID Name Freebas MID Name

m.02xbw2 Gabrielle Union m.029 l Delroy Lindo

m.0147dk Will Smith m.012d40 Jackie Chan

m.0271y9f Jaden Smith m.05v r84 Jackie Chan

m.01qg7c Barry Sonnenfeld m.01q ph Owen Wilson

m.01vvzb1 DMX m.01xndd J.J. Abrams

m.0hqly Steven Seagal m.042xrr Anthony Anderson

m.0gy64rt Samuel Steven Seagal m.0bvb9mz Anthony Anderson

m.02633g Martin Lawrence m.0451j Jet Li

m.01th95y Martin Lawrence m.05qg6g Zoe Saldana

m.01hhx1l Willennium m.05jpsx Chi McBride

m.0b8xmc Robinne Lee m.029pnn Tom Arnold

m.07y925 Marsha Thomason m.05d79k Bill Duke

Inferring Web c...

Databases in so...

Tools for view ...

Integrating an ...

An object-orien...

Authoritative s...Evolutionary mo...

Split objects

Ground term con...

Fine grained da...

Class modificat...

StratOSphere

Transaction man...

Deductive datab...

Visualizing and...

Tailoring OO an...

Hy+

Flood Modeling ...

Inheritance as ...

Semantics and i...Searching for i...

Foundations of ...

On hypertext

Neptune: a hype...

PCLOS: stress t...

Intermedia: The...

PCLOS: a critic...

Multiple object...

Extensible quer...

The architectur...

Managing knowle...

Christopher D. ...

The point of vi...

Extrapolation m...

A data model an...

Towards a new d...

An extensible d...

Aspects

An architecture...

ET++---an objec...

L. Hluchy

Research direct...

Learning random...

An industry/aca...Sun's Link Serv...

The use of clus...

Object-oriented...

Accessee contro...A polymorphic c...

Programmable br...

Expressing stru...

Monotonic confl...

Creating abstra...

PIROL

Document langua...

Components fram...

HyPursuit

Giang T. Nguyen...

Objects with ro...

CommonLoops: me...

Using prototypi...

Versions and ch...

Object structur...

Supporting expl...

Dimensions of o...

Vamp: the Aldus...

Intermedia: A c...

Figure 15: Neighbors of vertex ”Ground Term Confluence”. Blue colored are newly visualized vertexesand edges. |VF | = 67, |EF | = 158, νF = 4, 72.

Inferring Web c...

Databases in so...

Tools for view ...

Integrating an ...

An object-orien...

Authoritative s...Evolutionary mo...

Split objects

Ground term con...

Fine grained da...

Class modificat...

StratOSphere

Transaction man...

Deductive datab...

Visualizing and...

Tailoring OO an...

Hy+

Flood Modeling ...

Inheritance as ...

Semantics and i...Searching for i...

Foundations of ...

On hypertext

Neptune: a hype...

PCLOS: stress t...

Intermedia: The...

PCLOS: a critic...

Multiple object...

Extensible quer...

The architectur...

Managing knowle...

Christopher D. ...

The point of vi...

Extrapolation m...

A data model an...

Towards a new d...

An extensible d...

Aspects

An architecture...

ET++---an objec...

L. Hluchy

Research direct...

Learning random...

An industry/aca...Sun's Link Serv...

The use of clus...

Object-oriented...

Accessee contro...A polymorphic c...

Programmable br...

Expressing stru...

Monotonic confl...

Creating abstra...

PIROL

Document langua...

Components fram...

HyPursuit

Giang T. Nguyen...

Objects with ro...

CommonLoops: me...

Using prototypi...

Versions and ch...

Object structur...

Supporting expl...

Dimensions of o...

Vamp: the Aldus...

Intermedia: A c...

Figure 16: In spatial layout, identical visualization with Fig.15, difference is edge coloring and pen stylingfrom page 47.


Izabella Miko (...

Pilot (tv.tv_se...

Daniel Cage The...

Redd Foxx (base...

Free Angela & A...

Pete Rock (comm...

Person (freebas...

Charles Chan (b...

United States o...

Sheriff (fictio...

Pras (base.yale...

Sister Act 2: B...

Dancer (people....

Jaden Smith (ba...

Jamaican Americ...

Gail Cronauer (...

Nightmare (base...

Simon Davies (f...

Hoodlum (base.t...

Tom Arnold (tv....

Will Smith (med...

Somebody to Lov...

Fist of Legend ...

Dallas (locatio...

Marsha Thomason...

Ray Park (user....

East Coast hip ...

The Golden Spid...Gabrielle Union...

Paul Wright (ba...

Arthur W. Forne...

Theatre Directo...

Actor (base.sko...

Voice Actor (ba...

Marcia Linn (ba...

Brian Houston (...

Advocate (base....

Alex Garcia (ba...

Barry Sonnenfel...

Yuen Woo-ping (...

Eric Hayes (bas...

Donal Lardner W...

Renata Paschoal...

Jessica Nichols...

Aleta Chappelle...

Around the Worl...

Miss Dial (base...

Thommie Walsh (...

Betty Furness (...

DMX (base.type_...

Jack Arnold (ba...

Chris Rock (bas...

Felicity (base....

On Deadly Groun...

Tallulah Ana?s ...

Don't Give Up o... Susanna Hoffs (...

United Kingdom ... Bill Duke (base...

Eric Stough (ba...

Carol Abrams (p...

Compton (locati...

Darren Dean (ba...

Unleashed (base...

Robert Golden (...

Peter Thomason ...

Paul Rieckhoff ...

Anthony Anderso...

Official websit...

J.J. Abrams (mu...

Sunday Boling (...

Isaac Hayes III...

Steven Seagal (...

Jet Li (award.a...

Martial Artist ...

Martin Lawrence...

Lost (common.to...

Gerald W. Abram...

Emilia Attías (...

Omaha (base.typ...

Pilot (common.t...

Curtis Luciani ...

Space Chimps (f...

Owen Wilson (mu...

Hip hop music (...

Jaycee Chan (fi...

Hoodlum (media_...

English Languag...

Presenter (user...

Belly of the Be...

Han Chinese (ba...

The Karate Kid ...

Felicity (music...

Musical Track (...

Film Producer (...

Willow Smith (a...

The Suburbans (...

Massive Attack ...

Delroy Lindo (t...

H. Jon Benjamin...

18 Again! (base...

Jackie Chan (us...

The Cemetery Cl...

1.FM Absolute ...

Television prod...

Chinese martial...

Dr. Dre (base.s...

Joseph Campanel...

Bobby Creekwate...

Figure 17: Resulting visualization of Freebase dataset, after processed with PCMARS. Visualization istaken from our program AGECRT. Bold styled are edges for vertexes of ”Jackie Chan”, ”Martial Artist”and ”Will Smith”.

have searched for shortest path between all vertexes fromI set regarding to this notation dijk(u, v), u 6= v, u, v ∈ I.Paths were continuously added into visualization, untilall vertexes from I set are searched using dijk function.Fig.17 Freebase dataset is a specific one. For instance,name (Martin Lawrence) has different identifiers MID.This property of Freebase would require deeper study inits internal object representation. Another property ofFreebase is, that ”film”vertexes are not directly connectedto their ”actor” vertexes, instead, a mid-vertex v2 is con-nected (film → v 2 → actor). Vertex v 2 contains infor-mation on actor role in particular film (character name,film title). We had to increase maximum path length to2 (max path length in PCMARS algorithm from page 50.Vertexes v 2 are always unique and are dedicated to par-ticular combination of film and actor.

From the output of PCMARS and AGECRT visualiza-tion, we can find several interesting results. Expectedrelation between ”Jackie Chan” and ”Owen Wilson” is notfound. Although vertexes ”Will Smith” and ”Karate Kid”are connected, a connection with ”Jackie Chan”is missing.On the other side, we find a connection ”Jet Li”, ”Chi-nese Martial Arts” and ”Karate Kid”. The reason behindthis is, that ”Will Smith” (from Z set) activated ”KarateKid”(O set) indirectly, through several vertexes (e.g. ”PG(USA)”, MID = m.0kprc8). Thus, ”Karate Kid” was acti-vated from vertex ”Will Smith”. Vertex ”Jet Li” activated”Chinese martial arts”. Both vertexes ”Karate Kid” and”Jet Li” are directly connected to ”Chinese martial arts”.However, vertex ”Jackie Chan”is connected through ”Peo-ple Choice Award” (MID.0dlskb3). This vertex is unique

to whole dataset (contained only in one path). Here wecould change maximum path length (currently set to 2)or use semantic information of neighboring vertexes andedges.

Additional look in Fig.17 can raise another question. ”An-thony Anderson”, ”Tom Arnold” and ”Marsha Thoma-son” are all connected with ”Actor” vertex. Why not aswell ”Will Smith” or ”Jackie Chan”? Are they not ac-tors? The reason is, that ”Marsha Thomason” is directlyconnected with ”Actor”, like is vertex ”Will Smith”. But”Marsha Thomason” did not activated vertex ”Actor” di-rectly. ”Marsha Thomason” activated vertex ”Scott Tay-lor” (MID.025zlv2), which, in turn, sent a message fur-ther to ”Actor” vertex. This path ”Marsha Thomason”→”Scott Taylor” → ”Actor” was stored. Path length is 2(number of edges).

6. ConclussionsIn our thesis, we propose a new coloring method for graphedge crossing problem. There are many papers available,intended to graph visualization. But we have not foundany, which would discuss pen patterns and edge colorstyles, based on cognitive and psychovisual aspects. Per-haps it is a bit shameful, as with edge styles, we can rep-resent more different edge combinations. From Fig.5 andFig.6 we can have eight different edge styles. Togetherwith colors in Fig.7, we can have 8 × 11 = 88 differentedge combinations. Even if papers discuss graph clut-ter, only a brief, conventional aspect of aesthetics criteriais addressed, without higher point on cognitive science


or human visual system perceiving. Through our designof color and pen styles, formulation of rules, distributedcomputing algorithm development and implementation inPCMARS and AGECRT we fulfill our base goals, statedin our thesis.

Relations discovered with PCMARS could, theoreticallybe found with the use of SPARQL querying. However,a few notes must be stated. In order to explore pathsin SPARQL, a SPARQL (ver.) 1.1 is needed (propertypaths support). One have to use either star symbol (*)to denote a variable path length, or write down a full,fixed path length, including all predicates, contained inpath. It is also possible to use negation with symbol (!),but in order to navigate properly with negation, one mustwrite a predicate NOT contained in path, otherwise suchpredicate would terminate path exploration. With thisinformation, we can use SPARQL 1.1 property paths toperform relationship discovery, similar to PCMARS. Butwe should carefully select SPARQL implementations, asnot each one does actually support 1.1 version, or theirsupport is partial. We recommend consulting with W3C13 webpage, where a list of implementations, along withtheir support, is maintained. In several specific cases,even a performance issues could be introduced by imple-mentations, as stated in [1, 12].

PCMARS, in its current design, can omit edge directions.This can be advantage as well as disadvantage, depend-ing on particular situations. Edge direction omitting canbe easily altered with additional condition on message re-ceiving, to evaluate, whether an outgoing edge actuallyexists.

Entry data extraction filters, specified in architecture(Fig.10) can be further enhanced with new filters for newdata sources. In our thesis we propose one data extractionfilter for Slovak Business Register (SBR) HTML outputalong with a sample structured schema. Currently, thereis still rather large quantity of unstructured data sourceson the Web, despite proposed practices of Linked Data[3],recommendations of Tim Berners-lee [2] and W3C consor-tium working groups efforts14 (particularly CSS, RDF orSPARQL groups). Although Slovak Republic is a memberstate of EU, it has open government data portaldata.gov.sk, participates in open government projects ande-government, still, many bills, invoices or acts are pub-lished in plaint, unstructured PDF format. For detailsabout semantic data availability in Slovakia, we recom-mend consulting our recent work [16].

Acknowledgements. This work was supported by theSlovak Research and Development Agency, project CLANwith id APVV-0809-11 and by the Scientific Grant Agencyof the Ministry of Education, science, research and sportof the Slovak Republic and the Slovak Academy of Sci-ences, project VEGA, id 2/0185/13.

References[1] M. Arenas, S. Conca, and J. Pérez. Counting beyond a yottabyte,

or how sparql 1.1 property paths will prevent adoption of thestandard. In Proc. of the 21st int. conf. on World Wide Web, pages629–638. ACM, 2012.

[2] T. Berners-lee. Linked data - design issues.http://www.w3.org/DesignIssues/LinkedData.html/.Retrieved 18.6.206.

13https://www.w3.org/wiki/SparqlImplementations14https://www.w3.org/Consortium/activities. Retrieved20.6.2016

[3] C. Bizer, T. Heath, and T. Berners-Lee. Linked data-the story sofar. Semantic Services, Interoperability and Web Applications:Emerging Concepts, pages 205–227, 2009.

[4] T. Bruckdorfer, S. Cornelsen, C. Gutwenger, M. Kaufmann,F. Montecchiani, M. Nöllenburg, and A. Wolff. Progress on partialedge drawings. In Graph Drawing, pages 67–78. Springer, 2012.

[5] M. Burch, C. Vehlow, N. Konevtsova, and D. Weiskopf.Evaluating partially drawn links for directed graph edges. InGraph Drawing, pages 226–237. Springer, 2011.

[6] W. W. W. Consortium et al. Techniques for accessibilityevaluation and repair tools.http://www.w3.org/TR/2000/WD-AERT-20000426, 2000.

[7] V. Expert. Sbfaq part 6: Color for text and graph legibility. http://www.visualexpert.com/FAQ/Part6/cfaqPart6.html/.Retrieved 18.6.2016.

[8] Had2Know. How to Calculate Color Contrast from RGB Values.http://www.had2know.com/technology/

color-contrast-calculator-web-design.html. Retrieved15.6.2016.

[9] I. Herman, G. Melançon, and M. S. Marshall. Graph visualizationand navigation in information visualization: A survey.Visualization and Computer Graphics, IEEE Transactions on,6(1):24–43, 2000.

[10] R. Jianu, A. Rusu, A. J. Fabian, and D. H. Laidlaw. A coloringsolution to the edge crossing problem. In InformationVisualisation, 2009 13th Int. Conf., pages 691–696. IEEE, 2009.

[11] M. Laclav et al. Accuracy of person identification based on publicavailable data. In 2016 IEEE 14th Int. Symposium on AppliedMachine Intelligence and Informatics (SAMI), pages 253–256.IEEE, 2016.

[12] K. Losemann and W. Martens. The complexity of evaluating pathexpressions in sparql. In Proc. of the 31st symposium onPrinciples of Database Systems, pages 101–112. ACM, 2012.

[13] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn,N. Leiser, and G. Czajkowski. Pregel: a system for large-scalegraph processing. In Proc. of the 2010 ACM SIGMOD Int. Conf.on Management of data, pages 135–146. ACM, 2010.

[14] J. Mojžiš and M. Laclavík. Graph clutter filtering based onconnectivity distance and visibility. In Science and InformationConf. (SAI), 2014, pages 153–158. IEEE, 2014.

[15] J. Mojžiš and M. Laclavík. Relationship discovery and navigationin big graphs. In Intelligent Systems in Science and Information2014, pages 109–123. Springer, 2015.

[16] J. Mojžiš and M. Laclavík. Browsing semantic data in slovakia.BRAIN. Broad Research in Artificial Intelligence andNeuroscience, 6(3-4):47–59, 2016.

[17] C. Poynton. Frequently asked questions about color.http://cyrille.nathalie.free.fr/computer%

20vision/color_gamma_white_balance/ColorFAQ.pdf.Retrieved 20.6.2016.

[18] A. Rusu, A. J. Fabian, and R. Jianu. Using the gestalt principle ofclosure to alleviate the edge crossing problem in graph drawings.In Information Visualisation (IV), 2011 15th Int. Conf. on, pages488–493. IEEE, 2011.

Selected Papers by the AuthorJ. Mojžiš and M. Laclavík. SRelation: Fast RDF graph traversal. In

Knowledge engineering and the semantic web : 4th Int. Conf.,KESW 2013. Eds. Klinov, P., Mouromtsev, D. - Berlin :Springer, 2013, cCIS 394, p. 69-82.

J. Mojžiš and M. Laclavík. Graph clutter filtering based onconnectivity distance and visibility. In Proc. of Science andInformation Conf. 2014. - London : The Science and Information(SAI) Organization, 2014, p. 153-158.

J. Mojžiš and M. Laclavík. Relationship Discovery and Navigation inBig Graphs. In: Intelligent Systems in Science and Information2014. Springer Int. Publishing, 2015. p. 109-123.

J. Mojžiš and M. Laclavík. Browsing Semantic Data in Slovakia.BRAIN. Broad Research in Artificial Intelligence andNeuroscience, 2016, 6.3-4: 47-59.

Architecture for Core Networks Utilizing SoftwareDefined Networking

Pavol Helebrandt∗

Institute of Computer Engineering and Applied InformaticsFaculty of Informatics and Information Technologies


[email protected]

AbstractNew and popular approach to computer network architec-ture - Software Defined Networking aims to programmat-ically and centrally control the whole network providingmany advantages. However, deployment of SDN in largescale networks of telco operators and service providers islimited due to lack of standardized communication be-tween SDN controllers and use of routing algorithms oftraditional networks.

In this dissertation we provide analysis of SDN principles,existing solutions and methods to scale their performancefor large scale networks. Based on the analysis we formu-late problem of SDN domain interconnection for east-westcommunication. To solve this problem, we propose a newarchitecture for interconnection of controllers in variousSDN domains called INT Architecture. INT Architectureis formally verified by modelling in Petri Nets and prac-tical tests of INT Architecture prototype using virtualmachines. INT Architecture is beneficial enhancement ofSDN enabling greater cooperation of SDN controllers andapplications in large scale multi-domain networks.

Categories and Subject DescriptorsC.2.1 [Computer-communication networks]: Net-work Architecture and Design; C.2.2 [Computer-communication networks]: Network Protocols;C.2.3 [Computer-communication networks]: Net-work Operations

∗Recommended by thesis supervisor: Assoc. Prof. IvanKotuliakDefended at Faculty of Informatics and Information Tech-nologies, Slovak University of Technology in Bratislava onAugust 25, 2016.

c© Copyright 2016. All rights reserved. Permission to make digitalor hard copies of part or all of this work for personal or classroom useis granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies show this notice onthe first page or initial screen of a display along with the full citation.Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy other-wise, to republish, to post on servers, to redistribute to lists, or to useany component of this work in other works requires prior specific per-mission and/or a fee. Permissions may be requested from STU Press,Vazovova 5, 811 07 Bratislava, Slovakia.Helebrandt, P. Architecture for Core Networks Utilizing Software De-fined Networking. Information Sciences and Technologies Bulletin ofthe ACM Slovakia, Vol. 8, No. 2 (2016) 56-61

KeywordsSoftware defined networking, Wide area networks, Inter-net, Scalability, Network management, SDN multi-domain, Interconnect, SDN peering

1. IntroductionInternet traffic keeps growing because of better accessand rising popularity of multimedia and P2P, while newand innovative applications require even more resourcesand better network parameters. Telco carrier networksnowadays are mostly based on Ethernet and Multipro-tocol Label Switching (MPLS) solutions, some being up-graded with Shortest Path Bridging (SPB) and General-ized MPLS (GMPLS). Routing between these carriers Au-tonomous Systems (AS) is performed by Internet globalrouting table and Border Gateway Protocol (BGP).

Software Defined Networking [14] is a new approach tonetworking that aims to programmatically control thewhole network from a logically centralized node. Whileresearch in the area of Software Defined Networking is inthe centre of attention and improving the controller per-formance is just getting into the spotlight, there is verylittle research into the advantages of interconnection ofvarious controllers in heterogeneous SDN domains.

Motivation for this project is to facilitate deployment ofSDN and new innovative applications in large scale net-works and enable increase of SDN deployment amongISPs. In this paper we introduce new architecture forinterconnection of control planes and SDN applications.The interconnection system can be deployed in variousSDN domains and enable control plane communicationand coordination for better provisioning of services acrossmultiple domains. To accomplish this, we propose a newinterface for SDN controllers for interconnection of do-mains using a new vendor neutral communication proto-col.

The rest of this paper is organized as follows: state of theart of SDN and scaling of SDN control plane is describedin Section 2. Section 3 identifies problem statement withpremises, thesis and goals of this paper. Section 4 intro-duces INT Architecture proposal for interconnection andcooperation between SDN controllers in multi-domain en-vironment to enhance SDN. Section 5 presents methodsused for formal verification of proposed INT Architecture.Section 6 concludes this paper by summarizing results andbenefits of INT Architecture proposal.


Figure 1: SDN Architecture [8].

2. The State of the ArtIn this section we analyse SDN and its potential benefitsand limitations. Furthermore, we investigate problem ofscaling the SDN control plane for deployment in largenetworks and their interconnection.

2.1 Software Defined NetworkingSoftware Defined Networking is a novel paradigm in com-puter network architecture with aim to control all networknodes with programme. This enables solving of manyproblems in traditional approaches to networking whilealso enabling new features as well. The general drive be-hind SDN is to increase flexibility, manageability and ex-tensibility of computer networks, with secondary goal ofdecreasing equipment costs. This is achieved by takingadvantage of fast development and deployment cycle ofrelatively cheap software applications in contrast to ex-pensive specialized networking hardware.

Functionality of networking equipment can be conceptu-ally divided into switching of traffic data between inter-faces - Data Forwarding plane; and Control plane - rulescreated by processor running operating system, routingalgorithms, address translation, and other higher func-tions. In traditional networking, both control plane anddata forwarding plane are implemented in every networknode, often using specialized hardware. This enables ev-ery device to be totally autonomous and make all highlevel decisions, such as packet routing independently.

The fundamental principle of SDN architecture depictedin Figure 1, is separation of control and data forward-ing planes that communicate over standard interface. Byimplementing of separated control plane by software forgeneral purpose computer from forwarding plane on net-work equipment, it is possible to centralize decisions aswell as configuration of all network devices. Using cen-tralized view of the whole network enables high level de-cisions about traffic management and computations to bemade only once, results then propagated to be used by allnodes in data forwarding plane. Furthermore, the central-ized control plane implemented in software executed ongeneral purpose processors can bring many advantages tonetworking - especially speeding up innovation, new net-work features development and deployment. While thereare many approaches to SDN, the main enabler of SDNand de facto standard communication protocol is Open-Flow.

Figure 2: OpenFlow switch packet handling [1].

2.1.1 OpenFlowOpenFlow [13] is an open standard originally developed atuniversities and currently maintained by Open NetworkFoundation (ONF) - non-profit consortium with missionto commercialize and promote OpenFlow based SDN. TheOpenFlow switch standard [1] defines communication in-terface between control plane and forwarding plane de-vices and so must be implemented by both sides.

OpenFlow Controller makes high level switching decisionsin control plane formulating them into forwarding rules,composed of matches and actions. These rules are entriesfor Flow Tables used by OpenFlow switch in forwardingplane to handle incoming packets.

When a packet is received by OpenFlow enabled switch,it is handled in OpenFlow pipeline composed of one ormore Flow Tables, each containing entries with rules andactions to be performed on the packet belonging to flow.If match for the packet is not found in any Flow Tableand rule to send unknown packets to Controller is set-up,it is sent to the controller. Controller processes the packetand either drops the packet or establishes a new flow, bycreating a new entry in Flow Tables. The handling mech-anism of a received packet inside the OpenFlow switch ischarted in Figure 2.

2.1.2 Alternative and Related TechnologiesAlthough OpenFlow is currently the most popular ap-proach, SDN is an extensive discipline in constant fluxand many novel viewpoints to SDN implementation ex-ist and continue to be developed. Furthermore, there aretechnologies that can be considered partial SDN enablersand in addition to other projects that can greatly benefitfrom deployment together with SDN.

From commercial solutions, such as Cisco ONE and Nu-age Virtualized Services Platform aimed at providing someprogrammability for devices from these vendors, to var-ious open solutions. Among these are Interface to theRouting System (I2RS) [2], ForCES [5], NETCONF [6]and PCEP [18]. While I2RS is a new and ambitiousapproach with goal to provide standard unified interfaceto routing system for control and management; ForCES,NETCONF and PCEP are older technologies repurposed

58 Helebrandt, P.: Architecture for Core Networks Utilizing Software Defined Networking

for use in SDN. Especially PCEP in combination withMPLS and its multiple extensions are seen as a gradualmigration path towards SDN without network disruptionand maintains existing interoperability, which is very im-portant factor to telco operators. As such PCE can beconsidered as an evolutionary path towards SDN usingalready deployed traditional network equipment, whileOpenFlow is revolutionary.

Network Function Virtualization (NFV) [7] is a carrier-driven initiative to virtualize network functions and mi-grate them from purpose-built devices to generic servers.While SDN and NFV both cover similar themes and canbenefit from one another, they are independent and donot require deployment of the other in network.

2.2 Scaling SDN Control PlaneThis section analyses the problem of scaling the SDN con-trol plane in large scale networks, i.e. the Internet, andchallenges in SDN controller interconnection.

Scale has been an active and often contentious topic inthe discourse around SDN for a long time. Criticism ofthe SDN paradigm argues that changing the control planeimplementation model from anything but full distributionwill lead to scalability challenges. Furthermore, there isthe common concern that questions the scalability of us-ing traditional SDN, i.e. OpenFlow, to control physicalswitches due to forwarding table limits.

In theory, any SDN approach can have the same scalingproperties as traditional networking. For example, thereis no reason that controllers cannot run traditional rout-ing protocols between them. However, scaling propertiesof a system built using an SDN approach that actuallybenefits from the architecture, and scaling properties ofan SDN system different from the traditional networks ismuch more interesting endeavour.

There are various approaches to scaling SDNs that arecurrently in different stages of development and/or de-ployment. These can be classified into two categories.The first method scales up the performance of single SDNcontroller with increased optimization and parallelizationof execution, such as NOX MT [17] and Maestro [3].There are also hybrid solutions using physically distribut-ed controllers in clusters with multiple instances of singlelogical controller ElastiCon [4], xBar [12], HyperFlow [16],Onix [10]. However, these are designed for single domainuse only.

Second approach scales out by deployment of multipleinterconnected controllers that communicate for coopera-tion. Additionally, there is also the aspect of whether theinterconnected SDN controllers are all in the same do-main or some of them are in different domains. To clarifymeaning of interface orientation and terms used in thispaper, we provide illustration of their position in Figure3.

Performance of controllers is becoming more important todevelopers as increasing number of OpenFlow controllersare being both developed and deployed. However, a singlephysical controller, albeit a high performance one, is notenough to manage a sizeable network. High availabilityand maintaining low response times are among the criticalreasons why a network needs multiple controllers.

Figure 3: SDN interface orientation compass.

SDN interconnect (SDNi) was among the first to dealwith connecting SDN domains using an automated sys-tem. SDNi draft [19] proposes an open protocol SDNi forthe interface between Software Defined Networking do-mains to exchange information between the domain SDNControllers. However, this draft expired in 2012 and wasabandoned with no further work.

Another approach to interconnection of SDN controllersis East-West Bridge (EWBridge) [11], which is still indevelopment. EWBridge proposes a design for high per-formance communication system between heterogeneousNetwork Operating Systems and partition large telco op-erator network domain into subnetworks.

3. Problem StatementAlthough there are projects to scale or distribute the SDNcontroller functions to better accommodate a large net-work with several thousands of active nodes, very littlework was done on interconnecting controllers of such largenetworks and leveraging advantages of SDN.

At the moment it is very difficult to deploy SDN architec-ture in very large scale networks - i.e. the Internet - andutilize its benefits, because of lacking effective method forlarge scale controller distribution. While SDN paradigmis getting traction in data centres and campuses, that canbe large networks with several thousands of nodes, theseare typically managed by a single controller. However,controllers in these SDN islands are using traditional net-work protocols like BGP to exchange only routing infor-mation between domains. This is a limiting factor forusefulness and flexibility that could be provided by com-pletely SDN based networks. Even though remote accessand manual alteration of controllers in partners’ networkis possible to some extent, this method is not feasible forthe Internet and is antithesis to SDN principle of networkmanagement automation.

Thesis of this dissertation is proposing architecture forinterconnection of controllers in various SDN domains tocommunicate and exchange information for better provi-sioning of services with greater flexibility across differentnetwork domains. From the thesis arise these partial ob-jectives:

• Improve SDN architecture to benefit from east-westinterface between controllers


Figure 4: INT Architecture.

• Design new universal east-west communication pro-tocol for interconnection of heterogeneous SDN net-works

• Define a communication interface in SDN controllerfor the interconnection protocol

• Verify the functionality of the designed protocol andmethods by comparing it with unmodified networkand alternative existing methods

4. INT ArchitectureTo achieve goals defined in problem statement, a new in-terconnection architecture is necessary. In this section wedescribe design of INT Architecture for interconnection ofSDN controllers. We define communication protocol forSDN controllers, functions and interfaces used, as well asformal model of communication protocol. The high levelarchitecture of the proposed system interconnecting twoSDN domains is depicted in Figure 4.

Let us have n SDN domains [SDN1, SDN2, . . . , SDNn]composed each of exactly one Controller [CNTi belongingto domain SDNi] and set of forwarders [FWDi1, FWDi2,. . . , FWDij ] belonging to domain SDNi. SDN domainis part of SDN network that is managed by single logicalSDN controller CNTi, although it can be implementedas collection of multiple physical controllers. One SDNdomain is administered by a single organization and canbe thought of as similar to Autonomous System in BGP.

Currently any domains SDNa and SDNb are intercon-nected only with traditional routing protocols, e.g. BGPto provide IP connectivity in data forwarding plane. Wepropose to replace traditional routing protocols with INTArchitecture for interconnection of domains SDNa andSDNb in control plane to leverage advantages of SDNapplications across these domains. Interconnection of do-mains SDNa and SDNb is managed by INT Managerapplication MNGi of controller CNTi for each domainSDNi. INT Manager MNGi controls all interconnec-tions and configuration of INT Interface IFi for handlingthe message exchange between controllers. Connectionof control planes itself is created between INT Interfaces

IFa and IFb that are components of MNGa and MNGb

in domains SDNa and SDNb respectively. INT Man-ager applications MNGa and MNGb control operationof data forwarding plane connection between edge for-warders FWDai and FWDbj linking domains SDNa andSDNb.

4.1 INT Architecture FunctionsMain functions or components of the INT Architectureare INT Manager and INT Interface.

The INT Manager function is responsible for interpret-ing the network topology - data plane routers and theirlinks managed by a SDN controller, into a virtual router.This virtual router presents all networks in the controllerdomain and networks reachable by it, with various pathmetrics for all of them. It is further responsible for set-ting up connections to other controllers and managementof existing sessions. It also provides interface for domainadministrator to manage peer connection setup, sessionmanagement and advertised domain SDN and/or NFVcapabilities.

For the INT Architecture to be useable in heterogeneousenvironment with various SDN controllers, INT Managerneeds to use appropriate API for Northbound communica-tion with a given controller. Every controller uses slightlydifferent data structures to store the information aboutits network topology and traffic data, but most providecommon Northbound interface to access this data.

While the INT Manager compiles network topology dataand presents administration point for interconnection ses-sion management, INT Interface is responsible for han-dling the communication between INT Managers of dif-ferent SDN controllers. This communication can be clas-sified into two categories:

• Intra-domain - between controllers in the sameadministrative domain, often managing the samenetwork in cooperation to provide higher process-ing power and/or controller redundancy. OpenFlowswitch specification has already defined mechanismsto support connection to multiple controllers, butcontroller cooperation remains unstandardized.

60 Helebrandt, P.: Architecture for Core Networks Utilizing Software Defined Networking

• Inter-domain - between controllers in different ad-ministrative domains is more substantial to adop-tion of SDN in large scale networks. While routingbetween different networks is possible using existingrouting protocols - BGP for linking AS and variousInterior Gateway Protocols for networks inside anAS - it negates the advantages that can be leveragedby using SDN inside the networks being linked.

For inter-domain communication one of the controllers inthe domain is selected to be the ”master controller” tocommunicate with peer domains and represent the focalpoint of the SDN domain control plane. It provides logicalsingle point for peer controllers outside the domain to con-centrate and disseminate information to controllers insidethe domain and vice-versa. This minimizes the need forconnections between all controllers and as such functionssimilarly to route reflector in BGP or designated router inOSPF. Both these types of communication between SDNcontrollers is achieved over INT Interface using the INTProtocol explained in the following section.

4.2 INT ProtocolThe INT Protocol enables both intra-domain and inter-domain path setup and exchange of information betweencontrollers about their capabilities. This protocol furtherprovides not only scalability features for controllers in sin-gle administrative domain, but also various levels of net-work topology abstraction and control for peer controllersin separate SDN domains. INT Manager functions of dif-ferent SDN controllers are connected and communicateusing INT protocol over TLS or plain TCP session. Theprotocol itself is partly inspired by existing protocols suchas CDNi and BGP and is composed of three sub-layers:

• INT Session Management - used for peer con-nection setup and session management. Connec-tion establishment inside the administrative domainshould be automated to minimize administrationoverhead, but require manual setup for inter-domainfor security and purposes,

• Capabilities Information Exchange - responsi-ble for exchange of information about domain ca-pabilities and networks available inside and throughthe domain together with path metrics,

• Path Setup - used for end-to-end flow path setupbetween client nodes in all the domains along thetraffic route according to path metrics requirementsof the original domain controller.

Interconnection of SDN controllers in single administra-tive domain is relatively straightforward process. Sincecontrollers manage parts of the same network and thereare no restrictions on shared data, little to no manual ad-ministration is needed for interconnect session to be initi-ated and network information shared between controllers.Creation of interconnection session and path setup be-tween two peering controllers using the INT Protocol isillustrated by message flow in Figure 5.

Interconnecting controllers in different domains is moredifficult because of added technical intricacies stemmingfrom typically unsafe connection, distinct management

Figure 5: INT Protocol.

and security policies. Furthermore, SDN domains under-stood in the form of traditional AS are operated by in-dependent organizations with different levels of trust andinvolvement of agreements and contracts between sepa-rate legal entities.

5. VerificationVerification of proposed INT Architecture that can besplit into two parts. Firstly, formal verification of INTArchitecture design correctness with use of mathematicalmodelling. Secondly, practical tests using prototype im-plementation of INT Architecture in virtual environmentused for testing.

5.1 Methodology for Formal VerificationPetri Nets (PN) are a mathematical instrument well suitedfor modelling of discrete event systems. Graphical repre-sentation of Petri Net is a bipartite directed multigraph,as defined by Petersen [15]. Bipartite because vertices canbe divided into two disjoint groups - conditions and tasksthat are connected by arcs. Every edge connects a placeto a transition or vice versa, as can be seen in Figure 6.No edge can be between two places or two transitions.Conditions (or places, states) are graphically representedby circles, tasks (or transitions) by bars, and arcs by di-rected edges. Tokens placed in places define state of thePetri Net, also called marking.

Using formal definition, Petri net is a five-tuple PN =(P, T, FW,M0), where:

• P = {p1, . . . , pn} is finite nonempty set of places,

• T = {t1, . . . , tm} is finite nonempty set of transi-tions,

• P ∩ T = ∅, P ∪ T = ∅,

Figure 6: Petri Net example.


• F ⊆ (P × T ) ∪ (T × P ) is set of arcs,

• W : F → (Z+) is weight function,

• M0 : P → (Z+ ∪∅) is the initial marking.

Petri Nets are distinguished by the fact that thanks totheir mathematical model, we can investigate their vari-ous properties and by extension protocol properties. Ac-cording to [9], properties of Petri nets important for com-munication protocol model validation are:

• Reachability - any marking Mn is reachable frominitial marking M0 if there is firing sequence of tran-sitions t1, . . . , tk, where M0, t1,M1, t2,M2, . . . ,Mn−1, tk, Mn,

• Reversibility - property of PN, when initial mark-ing of PN is reachable M0 after firing finite numberof transitions t1, . . . , tk,

• Boundedness - PN is k-bounded, or simply bound-ed, if the number of tokens in each place does notexceed a finite number k for any marking reachablefrom the initial marking M0. PN is called safe if itis 1-bounded,

• Liveness - if for every PN marking M reachablefrom the initial marking M0, there exists fireabletransition t that leads to different marking M ′,meaning there are no deadlocks.

By using Petri Nets for modelling of communication pro-tocol and investigating these selected properties of themodel, we can determine behaviour of the modelled pro-tocol and correctness of its design.

6. ConclusionsIncreasing and network management complexity makesconcept of centrally controlled and programmable SDNvery appealing. On the other hand, scaling SDN controlplane for large networks has been an active and oftencontentious topic. Criticism of the SDN paradigm arguesthat changing the control plane implementation modelfrom anything but full distribution of traditional networkswill lead to scalability challenges.

There are projects to scale or distribute the SDN con-troller functions to better accommodate a large networkwith several thousands of active nodes. However, currentSDN architecture is limited in leveraging most of benefitsit offers in large scale interconnected networks by lack ofstandardized communication between controllers.

To solve this problem, we proposed INT Architectureto improve SDN for interconnection and cooperation be-tween SDN controllers in multi-domain environment. INTArchitecture includes INT Manager and INT Interfacefunctions, together with extensible INT Protocol for stan-dardizes communication between various heterogeneousSDN controllers.

Acknowledgements. This work was partially supportedby the Slovak National Grant agency VEGA 1/0676/12,NTB 2012et011 and STU Grant for Young Researchers2016.

References[1] OpenFlow switch specification version 1.3.1. 2012.[2] A. Atlas, T. Nadeau, and D. Ward. Interface to the Routing

System Problem Statement. IETF,draft-atlas-i2rs-problem-statement-01, work in progress, 2013.

[3] Z. Cai. Maestro: Achieving scalability and coordination incentralizaed network control plane. PhD thesis, Rice University,2012.

[4] A. Dixit, F. Hao, S. Mukherjee, T. Lakshman, and R. Kompella.Towards an elastic distributed SDN controller. In ACMSIGCOMM Computer Communication Review, volume 43, pages7–12. ACM, 2013.

[5] A. Doria, J. H. Salim, R. Haas, H. Khosravi, W. Wang, L. Dong,R. Gopal, and J. Halpern. Forwarding and control elementseparation (ForCES) protocol specification. Technical report,2010.

[6] R. Enns, M. Bjorklund, and J. Schoenwaelder. Networkconfiguration protocol (NETCONF). Network, 2011.

[7] ETSI. Network Functions Virtualisation: ArchitecturalFramework. Technical report, Technical Report ETSI GS NFV002 v1. 1.1, 2013.

[8] O. N. Fundation. Software-Defined Networking: The New Normfor Networks. ONF White Paper, 2012.

[9] G. Holzmann. Design and validation of computer protocols.Prentice Hall, 1990.

[10] T. Koponen, M. Casado, N. Gude, J. Stribling, L. Poutievski,M. Zhu, R. Ramanathan, Y. Iwata, H. Inoue, T. Hama, et al. Onix:A Distributed Control Platform for Large-scale ProductionNetworks. In OSDI, volume 10, pages 1–6, 2010.

[11] P. Lin, J. Bi, and Y. Wang. East-West Bridge for SDN NetworkPeering. In Frontiers in Internet Technologies, pages 170–181.Springer, 2013.

[12] J. McCauley, A. Panda, M. Casado, T. Koponen, and S. Shenker.Extending SDN to large-scale networks. Open NetworkingSummit, pages 1–2, 2013.

[13] N. McKeown, T. Anderson, H. Balakrishnan, G. Parulkar,L. Peterson, J. Rexford, S. Shenker, and J. Turner. OpenFlow:Enabling Innovation in Campus Networks. ACM SIGCOMMComputer Communication Review, 38(2):69–74, 2008.

[14] T. D. Nadeau and K. Gray. SDN: Software Defined Networks.O’Reilly Media, Inc., 2013.

[15] J. L. Peterson. Petri net theory and the modeling of systems.Prentice Hall PTR, 1981.

[16] A. Tootoonchian and Y. Ganjali. HyperFlow: A distributedcontrol plane for OpenFlow. In Proceedings of the 2010 internetnetwork management conference on Research on enterprisenetworking, pages 3–3, 2010.

[17] A. Tootoonchian, S. Gorbunov, Y. Ganjali, M. Casado, andR. Sherwood. On Controller Performance in Software-DefinedNetworks. In Presented as part of the 2nd USENIX Workshop onHot Topics in Management of Internet, Cloud, and EnterpriseNetworks and Services, 2012.

[18] J. Vasseur and J. Le Roux. Path computation element (PCE)communication protocol (PCEP). 2009.

[19] H. Yin, H. Xie, T. Tsou, D. Lopez, P. Aranda, and R. Sidi. Sdni: Amessage exchange protocol for software defined networks (sdns)across multiple domains. IETF draft, work in progress, 2012.

Selected Papers by the AuthorP. Helebrandt, I, Kotuliak. Novel SDN multi-domain architecture. In

Proceedings of IEEE 12th International Conference on EmergingeLearning Technologies and Applications (ICETA), 2014, pages139-143.

P. Truchly, P. Helebrandt, L. Danielovic. Implementation andEvaluation of IPv6 to IPv4 Transition Mechanisms in NetworkSimulator 3. In Proceedings of IEEE 23rd InternationalConference on Systems, Signals and Image Processing (IWSSIP),2016, In print.

Formal Description of Embedded Operating Systems

Martin Vojtko∗

Institute of Computer Engineering and Applied InformaticsFaculty of Informatics and Information Technologies


[email protected]

AbstractThe fast development of new processors introduces prob-lems with the adaptation of operating systems. Whena new processor is presented on the market, the operat-ing system needs to be adapted to the processor architec-ture and features. It is done by the reprogramming of aplatform-dependent layer and the implementation of miss-ing device modules of the operating system. The adapta-tion process of the operating system is more complicatedwhen the new processor has a completely different archi-tecture than the one of the operating system for whichit was previously designed for. Another problem of theadaptation is in the processor datasheets, because theyare not processable by the computer so the generation ofthe operating system code from datasheets is not possible.In this dissertation thesis, we present an updated adapta-tion process of embedded operating systems. We designeda Processor Formal Description that acts as a computerprocessable datasheet. This description is used for auto-mated code generation of platform-dependent code. As asupport to the adaptation process we present a conceptof an adaptation framework that helps to reduce timeneeded for the adaptation of the operating system.

Categories and Subject DescriptorsC.0 [Computer Systems Organization]: General—hardware/software interfaces; C.3 [Computer SystemsOrganization]: Special-purpose and Application-basedSystems—microprocessor/microcomputer applications,real-time and embedded systems; D.2.2 [Software En-gineering]: Design Tools and Techniques—modules andinterfaces; D.3.4 [Programming languages]: Proces-sors—code generation; D.4.7 [Operating Systems]: Or-

∗Recommended by thesis supervisor: Assoc. Prof. TiborKrajcovicDefended at Faculty of Informatics and Information Tech-nologies, Slovak University of Technology in Bratislava onOctober 17, 2016.c© Copyright 2016. All rights reserved. Permission to make digital

or hard copies of part or all of this work for personal or classroom useis granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies show this notice onthe first page or initial screen of a display along with the full citation.Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy other-wise, to republish, to post on servers, to redistribute to lists, or to useany component of this work in other works requires prior specific per-mission and/or a fee. Permissions may be requested from STU Press,Vazovova 5, 811 07 Bratislava, Slovakia.Vojtko, M. Formal Description of Embedded Operating Systems. In-formation Sciences and Technologies Bulletin of the ACM Slovakia,Vol. 8, No. 2 (2016) 62-68

ganization and Design—real-time systems and embeddedsystems, standardization

KeywordsProcessor Formal Description, Adaptation of OperatingSystems, Code Generation, Adaptation Process, Mod-elling of Modules of Operating Systems, Modular Oper-ating Systems, Layered Operating Systems, EmbeddedOperating Systems

1. IntroductionThe growing number of processor architectures leads tothe need for a methodology which allows fast and effec-tive operating system adaptation to those architectures.Future embedded systems will have multi-core/many-corearchitectures [7] or mixed architectures consisting of mul-ti-core processor clusters. New types of architectures willintroduce new types of operating systems which will beself-adaptive [9]. New operating systems running in aheterogeneous environment will need a database of exist-ing processor ports, device modules and processing cores.Modules and platform ports will be loaded to programmemories of the processor during system initialization orwill be loaded on-line during the system run-time.

Many-core systems are changing the traditional conceptof the processor as a system with several devices and afew processing cores. The number of cores will grow inthe future together with the number of intelligent devicesthat will be connected to a shared network [7]. This futurehighly scalable network architecture calls for a change ofthe standard architecture of operating systems into a dis-tributed architecture.

Recent operating systems are seen as a software that in-terfaces and extends the processor. In the future it willbe more than that. The operating system will be seenas a framework or a database of modules and platformports. Consequently, the developer will choose modulesand platform ports from the database that fits architec-ture of the processor. The operating system frameworkwill also provide tools that help the developer to createmodules and platform ports that are missing. Operat-ing systems, like FreeRTOS [8] and many more, startedthis transformation but it is only a beginning and manyaspects of the operating system will change in the future.

In this paper we analyse a generalized form of the adap-tation process that is used during adaptation of any op-erating system nowadays. This process does not supportan automated code generation that will be crucial in the


future. We propose an extension to the process in orderto add formalization techniques to the processor descrip-tion. This extension allows the generation of platform-dependent parts of the operating system. As a result ofthis formalization we specify the Processor Formal De-scription (PFD) in this paper. The PFD describes eachdevice and processing core of the processor in a form thatis processable by the computer. From the PFD we gen-erate a glue code, which interfaces the processor and theoperating system. Finally we use this glue code for theimplementation of modules of the operating system.

We also propose a framework that will support the adap-tation process of embedded operating systems. The frame-work consists of tools and services that help to describethe processor [10], generate glue code [11], describe oper-ating system modules that encapsulate devices and pro-cessing cores of the processor [12], and implement thosemodules [12].

2. Adaptation Process of OSThe adaptation process of the operating system (OS) ismostly started because there is the need to use a specificfeature of the OS or there is the need to port specific OSto an architecture that is in some way special or solvesthe specified problem. The experience of the developerwith the OS plays a big role in this need. The adaptationof the OSE OS to the many-core architecture Tilepro64 isa good example [2]. The mapping of the OSE schedulerwas done on the mentioned many-core architecture. Theauthors provide information about the steps of the adap-tation of the operating system but the defined process ofthe adaptation is strongly application specific.

Well known operating systems, such as FreeRTOS [8],Avrx [3] or TinyOS [6], provide adaptation manuals to thedeveloper. Those manuals explain which parts of the OSshould be adapted during the adaptation to the processor.In the FreeRTOS example, there exists a vast amount ofminimal working examples (MWEs) and adaptations tomany existing platforms. This sort of database of exam-ples increases the popularity of this OS. But what willhappen when there is no existing example for the proces-sor that the developer wants to use is that the developerwill have to implement the support for it. After the adap-tation the developer should provide the solution to theFreeRTOS community.

The FreeRTOS community has no strict rules for the plat-form ports so they can differ from port to port. Themostly affected part of the OS during the adaptation isa platform-dependent part that interfaces OS modules tothe hardware. When each developer implements this layerdifferently the code between platform ports is not man-ageable.

Another aspect of the adaptation is a missing standardthat will induce manufacturers of processors to providedatasheets in a standardized form. Each manufacturerhas its own templates. Another problem is that thosedata-sheets were prepared for human so any computerprocessing is nearly impossible. Our idea is to proposesuch a standard that will describe the processor in a formthat will be processable by a computer.

Processor manufacturers provide also many MWEs fortheir platforms. Many of the manufacturers implement

Processor Datasheet

Analysis of Processor Devices

Analysis of Processing Cores

Implementation of Device Modules

Implementation of OS Low Level Services

Platform-dependentSource Code

Datasheets of Processing cores

Device ModulesSource Code

Implementation of Platform-dependent

Layer of OS

Figure 1: Generalized process of adaptation of anoperating system [12].

their own header files, source files and glue code. Thiscode is helpful when you use the OS on the processorsissued by one manufacturer (sometimes from one familyonly). This code also differs between manufacturers.

The Figure 1 shows the generalised adaptation processthat can be divided into two workflows. The first workflowshows an analysis of processing cores of the processor, adesign of OS modules and an implementation of a codethat uses features of the processor. The second workflowshows an analysis of existing processor devices, a designand an implementation of a code that manages processingcores of the processor.

The first step of the adaptation of the OS to the processoris an analysis. As we mentioned previously the analysiscan be split into two parts where devices and processingcores are analysed separately. In this step the designeranalyses all the materials that are provided by the man-ufacturer of the processor. Mostly it is in the form of adatasheet of the processor or datasheets of the processingcores of the processor.

2.1 Processing CoresDuring the adaptation of processing cores the designerhas to find out how main services of the OS can be im-plemented. The services are:

• core and OS initializing,

• interrupt handling and

• task switching.

During core and OS initialization all the operating modesof the processing core, and stacks and memories of the OSkernel are set. Most of the manufacturers provide MWEsfor the core initialization but they have to be adjusted tothe needs of the OS.

The interrupt handling is the service of the OS that is par-tially implemented in assembly language. Most of the pro-cessors provide an interrupt subsystem that can be usedby the OS. The developer has to implement an interface to

64 Vojtko, M.: Formal Description of Embedded Operating Systems

this subsystem and after that he can start implementinginterrupt routines that are mapped to a specific interruptsource, as can be the task switch.

The task switch is crucial for any OS because it handlesthe correct storing of the old task and loading of the newtask. During the task switch each register of the process-ing core has to be stored before a task can be replaced byanother.

All previously mentioned services are highly platform-dependent so in most cases those services have to be im-plemented during almost every adaptation of the OS.

2.2 DevicesDuring the adaptation of devices the designer chooses de-vices that will be needed for the successful completion ofthe task.

The developer analyses the functionality and the commu-nication interface of each device. The interface mostlyconsists of registers and signals by which the OS can sendtasks.

The designer uses the registers of the device for the imple-mentation of the glue code that acts as an interface thatthe OS can understand. The interface consists of simpleread/write routines that access device registers. The de-signer can use this interface during the implementationof device modules of the OS. Many manufacturers imple-ment their own glue code for their processors. This is veryhelpful because the developer can concentrate on the OSdesign. The problem is that this glue code differs betweenmanufacturers.

2.3 Generation of the Glue CodeIn the past, there were projects that tried to generate gluecode for hardware. The glue code was mostly meant asa code that was needed to connect two hardware com-ponents with different interfaces [4]. Sometimes the re-sulting glue contained even new pieces of hardware thatacted as a translator of communication [13] [5]. Thosetechniques were used for a connection of the processingcore to the device through a set of separate signal lines.Nowadays the majority of devices is connected to the pro-cessing core via standardized interfaces, e.g. internal bus.Also the interfaces of hardware are standardized [1], so thecomplexity of the hardware interconnection is reduced.

3. Proposal of Novel Adaptation ProcessThe novel adaptation process was designed to help thedeveloper of the embedded OS to generate the platform-dependent code of the OS for any chosen processor. Gen-eration of code reduces the adaptation time and speedsup the preparation of working prototypes. The processalso helps during the modelling and implementation ofthe modules of the OS. Those modules manage the proces-sor devices and processing cores. The process describedin the Figure 2 is suitable for embedded OSs that have alayered architecture that consists of at least one platform-dependent and one platform-independent layer [12].

The proposed process applies formalization techniques thatallow the generation of the mayor part of the platform-dependent layer of the OS. The platform-dependent layerconsists of simple routines that are applied above registers

Processor Datasheet

Processor Formal Description

Analysis of Processor Devices

Analysis of Processing Cores

Opisy modulov zariadení OS

Description of Device Modules

Platform-dependent Source Code Generation

Implementation of Device Modules

Implementation of OS Low Level Services

Platform-dependentSource Code

Description of Core Modules

Datasheets of Processing cores

Device ModulesSource Code

Module Formal Description

Figure 2: New proposal for the adaptation processof the embedded operating system [12].

of devices and processing cores. Those routines are state-less and perform just one operation at a time. Togetherthey create an interface that consists of many simple andsimilar routines that can be produced by automatic gen-eration of the code.

The advantages of generated code are:

• fast prototyping,

• reduction of error probability,

• hiding of hardware complexity,

• consistent and similar result across most of archi-tectures and

• the developer can concentrate on the application do-main.

3.1 Inputs of the Adaptation ProcessAs the input for the novel adaptation process the devel-oper needs the following documents [12]:

• Processor datasheet - provides information aboutprocessor devices;

• Processing cores datasheet - provides informationabout the processing cores of the processor;

• Processor description file - represents a computer-readable form of the processor datasheet.

The Processor Formal Description (PFD) is a new docu-ment introduced in the adaptation process. It is the re-sult of the processor analysis. Currently, the preparationof PFD has to be done by the developer but in the futureit could be provided by the manufacturer of the proces-sor. The PFD stores information about each item of theprocessor that can be affected by an instruction from theinstruction set of the processor. More information can befound in the section 4.


3.2 Description of Devices and CoresThe new process formalizes most of the aspects of theOS adaptation so there is no reference to any program-ming language until the implementation phase. This isdifferent in the old adaptation process where the descrip-tion language is mostly the same as the implementationlanguage. Programming languages have often limitationsthat are impacting also the design of modules. One ofthe limitations is that the programming language (in em-bedded systems it is mostly C) has poor ability to modelparallel execution of tasks.

In the description of devices or cores there is no need forsuch parallel design but the independence of the descrip-tion from the programming language can help to expressaspects of devices or cores that can not be expressed bya programming language (e.g. connections between de-vices).

The glue code is a part of the code that has to be im-plemented but from the perspective of the developer ithas no added value to the functionality of the OS. It justinterfaces the hardware to higher levels of the OS. Thenature of the glue code is simplicity that provides goodspace for a code generation.

The glue code is generated from the description of devicesand cores (so from PFD). The device is accessed by writ-ing to its registers or reading from them. Those simpleoperations can be fully covered by the generator of theglue code. The processing core is more complicated thanthe device so only a part of its description can be used forthe generation of the code.

3.3 Description of OS ModulesThe PFD is also used during the design of the OS mod-ules. The described processor parts are ”named items”that can be used during the modelling of module be-haviour (e.g. as the description in a flow chart diagram).

In the past we proposed the Module Formal Descriptionthat is based on workflow diagrams. In that case the de-scription uses parts of the PFD as building blocks thatcan model whole behaviour of the OS module. As a con-sequence the model can be easily converted to program-ming language by another code generation that can createskeletons of whole OS module [12].

4. Formal Description of the ProcessorThe formal description is needed to allow the automaticgeneration of the platform-dependent code. The PFD de-scribes a processor from the top to the bottom startingfrom processor devices and processing cores [10]. A wholemathematical model was implemented to cover any partof the processor that can be affected by an instructionfrom the instruction set of the processor, but there is nospace to cover this model in this paper. The model is fullydescribed in Vojtko et al. [12].

The Figure 3 shows the visualization of the PFD whereprocessor is the center of the model. Black arrows rep-resent ”consists of” relationship and red dashed arrowsrepresent ”depends on” relationship. So we can say thatprocessor consists of devices and processing cores. A de-vice consists of registers, I/O signals and interrupt sig-nals. A processing core consists of registers, I/O signals,instructions and operating modes. Any register consists

P

D2

Dn

...

R1 Rn

P1

Pn

...

...

...

...

...

P1

Pn...

I1

In

A1 An

...

S1

Sn

...

Is1

Isn

D1

O1

On...

O1

On

IS1

ISn

R1

Rn

S1

Sn

M1

Mn

...

...

...

...

P1

Pn...

On

O1...

P1

P

D1

A1

IS1

R1

M1

S1

P1

O1

R1

I1

S1

P1

O1

Part Option

Register Part

Processing Core Signal

Operating Mode

Instruction

Processing Core Register

Processing Core

Processor

Device

Device Register

Interrupt Signal

Device Signal

Register Part

Part Option

ContainsDepends On

Is1 Interrupt Source

Figure 3: Visualization of the PFD items [12].

of register parts and register parts can consist of options.So the model of processor is a 5 level hierarchy of items.

From the perspective of the code generation the most im-portant parts of the PFD are levels 3, 4 and 5. Fromthose levels the platform-dependent layer of the OS isgenerated. Levels 1 and 2 have their importance in or-ganization of code into logical modules (e.g. devices andprocessing cores). Those two levels are helpful during themodelling of OS device modules.

The ”depends on” relationship reflects a dependence thatcan exist between items of third and fourth level (i.e. reg-ister, signal and register part) of the PFD. There is adependence between two items when the change in oneitem triggers a change in another. A good example ofdependence is the reset of the interrupt register that wasaccessed by a read operation. When you try to read thisregister you also start a sequence of events that resetsthe register parts to their default values. The coverage ofdependencies in the PFD is extremely helpful during theimplementation of OS modules because the descriptionof dependencies informs the developer that he should bevigilant when working with dependent registers so he im-plements the module keeping those dependencies in mind.

In greater detail the PFD describes the communicationinterface of devices and processing cores. This meansthat the internal structure of hardware modules is hid-den. This hiding of hardware structure is an advantagecompared to other examples of descriptions as is VHDLor Verilog, because manufacturers do not want to publishtheir hardware architecture. The Figure 4 shows the in-put and output signals of the device and the registers ofthe device. The signals and registers marked by red colorform the communication interface that is modelled by thePFD. Also other signals (as are bus signals) exist in thedevice but from the perspective of the PFD those signalsare not directly accessible by the processor instructions.


Device

Interrupt Signals External In

pu

t and

Ou

tpu

t

Address

Data

Signals

Registers

Bus Signals

Intern

al Inp

ut an

d O

utp

ut

Figure 4: Communication interface of device [12].

5. Formal Description of OS ModulesThe module of the OS manages and controls the processordevice. It uses platform-dependent code prepared by theglue code generator. The formal description can be usedduring the design of OS modules, because it simplifies theadaptation process. In Vojtko et al. [12] such a formaldescription was proposed that uses existing register partsand registers described in the PFD as blocks of a workflowdiagram (e.g. the Figure 5).

The module of the OS can be divided into 3 parts whichare modelled independently[12]:

• Module initialization,

• Interrupt handling and

• Data processing.

MR (mode reg.)

BRGR (baud rate generator reg.)

Set CD (clock divider)(BR = USCLKS/8*(1+OVER)/CD)

[no]

Is MR_UMD == ISO7816?

Set CD (clock divider)(BR = USCLKS/CD/FIDI)

[yes]

Set FP (fractional part)

Init begin

Init end

Set UMD=NORMAL

(UsartMode)

Set SYNC (sync mode)

Set OVER (oversampling)

Is SYNC == ASYNC?

[yes]

[no]

Figure 5: Init function of USART (diagram) [12].

Figure 6: Init function of USART (code) [12].

The module initialization models the process of the devicesetup. The interrupt handling models the process of inter-rupt source selection and appropriate interrupt responseroutine. The data processing models the ways and meansof data preparation, transferring and receiving.

The Figure 5 shows how an initialization of universal se-rial interface can be modelled. In the Figure there are twoenvelopes (MR and BRGR) that represent two registersof the universal serial interface. Those registers containparts that are set to a value specified by the option name.The diagram allows to model the dependence betweenparts of the register. In this example the setup of over-sampling (OVER) will have effect only if synchronization(SYNC) of the serial interface is set to asynchronous mode(ASYNC).

From this diagram a code can be generated as is shown inthe Figure 6. The generated function uses four parametersthat are used for those blocks in the model that were notset right in the diagram. As the figure shows there is nocondition generated for SYNC as was used in the diagram.This is because the diagram informs that the set of OVERwill have no effect to behaviour of USART when SYNCis not set to ASYNC value. As can be seen in the codethere is the variable dataset that is set to the needed valueand then this variable is written to the mode register. Allvalues of the register parts are written to the dataset soonly internal registers of the processor are used until thewrite operation to the device register is done.

6. Framework for OS AdaptationThe concept of the OS adaptation framework is basedon the adaptation process. This framework will supportthe adaptation process by a set of services and databases.The Figure 7 represents the conceptual architecture of theframework.

The framework will positively impact these tasks:

• PFD creation and validation,

• platform dependent code generation,

PFD

Operating System

Platform-dependent layer

......Modul

1Modul

nScheduler

MFDdb

SRCdb

DatasheetsDescription

ValidationDescription

Generation

Mapping

Implementation/Generation

Selection

Selection

MFD

Validation

Reusage

Reusage

SRC

Validation

PFDdb

SRC

......

Figure 7: The Adaptation framework services.(MFD - module formal description, db - database,SRC - source code)


• OS module modelling and validation,

• OS module mapping to the PFD,

• code implementation and/or generation, and

• selection of platforms and modules.

The framework will use 3 databases to store OS environ-ment:

• database of PFDs,

• database of OS modules, and

• database of OS source codes.

6.1 Description of the ProcessorThe framework allows preparation of the PFD from theprocessor datasheets. The prepared PFD will by validatedand sent to the database of PFDs. The stored descrip-tion can be then used by the generator to produce theplatform-dependent code of the OS. PFDs are also usedfor the mapping of the OS modules to communicationinterfaces of the devices and processing cores.

Stored PFDs will be decomposed into devices and cores.Each identified device will be inserted to a database un-der validation, which will guarantee that device is notpresented in the database as a duplicity.

6.2 Description of the ModuleSince the PFD covers only the communication interfaceof the processor device or core, there is still need for thedevelopment of the OS module that manages this deviceor core. Workflow diagrams will be used for mapping ofthe OS module to the communication interface. The re-sulting model of an OS module will be validated and theninserted to the database of Module Formal Descriptions(MFDs). The documentation of a model will be a com-pulsory part of the module description.

6.3 Module Code Generation/ImplementationThe developer implements the OS module code from theMFD . Some parts of module can be generated automat-ically as a skeleton of the module which will help to thedeveloper during implementation. Implemented moduleis then stored in a database of the OS source codes withlinkage to parent module description and compulsory doc-umentation.

6.4 Selection of the OS PartsAs a part of a system design, the developer of embed-ded system will have access to database of the OS sourcecodes. From this database the developer will choose aplatform-dependent layer and select a compatible deviceand processing core modules for the chosen processor. Hecan also add/describe/implement missing modules.

6.5 DatabasesFull PFDs will be stored in the database of descriptions.Those files will be also decomposed into separate devicesand processing cores. A problem can arise when the samedevice exists in more processors so this situation has tobe solved by a unique identification of the device. In or-der to avoid a duplicate upload of the existing PFD it

is necessary to create a protection mechanism. If exist-ing processor was revised by the manufacturer it will bepossible to revise the PFD too.

The database can be used also during the creation of anew PFD file where the developer can search for devicesand cores of the existing PFDs in the database and includethem into the new PFD, which will reduce duplicity anddescription time.

The Database of MFDs will store descriptions of OS de-vices and processing core modules. Similarly as the data-base of PFDs this database will also use unique identifica-tion of inserted MFDs. MFDs from the database can beused for describing similar device modules as MWEs. Ex-isting modules can be reused in the module description,which will reduce description time.

Unique versions and ports of OS source code will be storedin the database of sources. The developer will be able toselect platform-dependent code for the selected proces-sor and he will be able to select source codes of modulesbased on the description of a module, because there canbe presented more versions of the module.

7. ConclusionsThe concept of the adaptation framework for embeddedoperating systems was presented in this paper. The frame-work will provide services for the developer of the embed-ded operating system. These services will help duringthe adaptation of the operating system to new proces-sors. The adaptation time will be shorter and adaptationcomplexity simpler. Until now, the formal descriptionof the processor and the generator of the platform de-pendent code was developed. The generator generatesthe platform-dependent code in programming languageC. The next step in the work is the design of a moduledescription tool that allows describing operating systemmodules.

Acknowledgements. This work was supported by theMinistry of Education, Science, Research and Sport of theSlovak Republic within the Research and DevelopmentOperational Program for the project: ”University SciencePark of STU Bratislava”, ITMS 26240220084, co-fundedby the European Regional Development Fund.

References[1] Accellera Systems Iniciative Inc. Open Core Protocol

Specification, 2013.[2] V. Avula. Adapting operating systems to embedded manycores:

Scheduling and inter-process communication. Master’s thesis,Uppsala universitet, 2014.

[3] L. Barello. AvrX Real Time Kernel, 2007.http://www.barello.net/avrx/.

[4] P. Chou, R. Ortega, and G. Borriello. Synthesis of thehardware/software interface in microcontroller-based systems. InComputer-Aided Design, 1992. ICCAD-92. Digest of TechnicalPapers., 1992 IEEE/ACM International Conference on, pages488–495, Nov 1992.

[5] Z. Guo, A. Mitra, and W. Najjar. Automation of ip core interfacegeneration for reconfigurable computing. In Int. Conference onField Programmable Logic and Applications (FPL 2006), Madrid,Spain,, page 6, Aug 2006.

[6] P. Levis and D. Gay. TinyOS Programming. Cambridge UniversityPress, 2009.

[7] P. Ranganathan. From microprocessors to nanostores: Rethinkingdata-centric systems. Computer, 44(1):39–48, Jan 2011.

[8] Real Time Engineers Ltd. The FreeRTOS Project, 2015.http://www.freertos.org/.


[9] M. Seltzer and C. Small. Self-monitoring and self-adaptingoperating systems. In Operating Systems, 1997., The SixthWorkshop on Hot Topics in, pages 124–129, May 1997.

[10] M. Vojtko and T. Krajcovic. Adaptability of an EmbeddedOperating System: a Formal Description of a Processor. In 10thInternational Joint Conferences on Computer, Information,Systems Sciences, and Engineering, page 4, Dec. 2014. in print,http://fiit.stuba.sk/%7evojtko/VojtkoAoEOS.pdf.

[11] M. Vojtko and T. Krajcovic. Adaptability of an EmbeddedOperating System: a Generator of a Platform Dependent Code. InCybernetics and informaticcs (K&I), 28th InternationalConference on, page 6, Feb 2016.

[12] M. Vojtko and T. Krajcovic. Semi-automated process ofadaptation of embedded operating systems. Journal of ElectricalEngineering, page 10, 2016. in review process,http://fiit.stuba.sk/%7evojtko/VojtkoJEEEC.pdf.

[13] E. Walkup and G. Borriello. Automatic synthesis of device driversfor hardware/software co-design. Technical report, University ofWashington, Department of Computer Science and Engineering,Seattle, Washington, Jun 1994.

Selected Papers by the AuthorM. Vojtko, T. Krajcovic. Semi-Automated Process of Adaptation of

Embedded Operating Systems. In Journal of ElectricalEngineering, 2016. Sent for review.

M. Vojtko, T. Krajcovic. Adaptability of an Embedded OperatingSystem: a Generator of a Platform Dependent Code. In 2016Cybernetics & Informatics (K&I), Levoca, Slovakia, 2016, pp.1-6.

M. Vojtko. Adaptability of embedded operating systems. In PESW2015 : proceedings of the 3rd embedded systems workshop, July2015, pp. 1.

M. Vojtko, T. Krajcovic. Adaptability of an Embedded OperatingSystem: a Formal Description of a Processor. In In 10thInternational Joint Conferences on Computer, Information,Systems Sciences, and Engineering, 2014. Springer. In print.

M. Vojtko, T. Krajcovic, Prototype of Modular Operating System forembedded applications. In Applied Electronics (AE), 2013International Conference on, Pilsen, Czech Republic, 2013, pp.1-4.

Social Insect Inspired Algorithm to Detectand Track Topics in Dynamic Documents

Štefan Sabo∗



[email protected]

AbstractIn our work we present a novel approach to identificationand tracking of news stories on the web. We utilize aset of social insect inspired agents to acquire news arti-cles and subsequently analyse relationships between ar-ticles based on story words. Story words represent ourconcept for modelling terms relevant to news stories as awhole, instead of using keywords relevant only to a singledocument. We leverage behavioural patterns inspired byhoney bees when foraging for food in order to design a selfadjusting and self prioritizing mechanism that allows fordynamic response to changing news story landscape. Dueto the independent nature of agents, the resulting systemoffers flexibility, scalability and distributivity while main-taining high level of cooperation during identification andtracking of currently unfolding news stories.

Categories and Subject DescriptorsH.3.3 [Information Storage and Retrieval]: Informa-tion Search and Retrieval—retrieval models, search pro-cess; I.2.11 [Artificial Intelligence]: Distributed Arti-ficial Intelligence—multiagent systems; H.3.1 [Informa-tion Storage and Retrieval]: Content Analysis andIndexing

Keywordsbeehive metaphor, kernel methods, keyword extraction,multi-agent systems, online learning, topic detection andtracking, web crawling

∗Recommended by thesis supervisor: Prof. Pavol NavratDefended at Faculty of Informatics and Information Tech-nologies, Slovak University of Technology in Bratislava onSeptember 29, 2016.

c© Copyright 2016. All rights reserved. Permission to make digitalor hard copies of part or all of this work for personal or classroom useis granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies show this notice onthe first page or initial screen of a display along with the full citation.Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy other-wise, to republish, to post on servers, to redistribute to lists, or to useany component of this work in other works requires prior specific per-mission and/or a fee. Permissions may be requested from STU Press,Vazovova 5, 811 07 Bratislava, Slovakia.Sabo, Š. Social Insect Inspired Algorithm to Detect and Track Topics inDynamic Documents. Information Sciences and Technologies Bulletinof the ACM Slovakia, Vol. 8, No. 2 (2016) 69-74

1. IntroductionWhen visiting a web based news portal, human visitorfinds himself browsing a wide range of articles that coverthe current news stories, events and developments. Basedon their content and also context the user can piece to-gether a picture of what is currently going on and makedecisions upon whether he wants to pursue further read-ing, or make some research of his own. Our goal is tobe able to provide support this process by analysing theavailable web based articles and providing an overview ofcurrent events that are unfolding around us.

This synthesis of larger picture is however not a sim-ple process from the information point of view, due tosheer complexity of additional contextual information andknowledge that a human user is able to leverage whenanalysing content of a given news article. In our work wetake a different approach and instead of trying to analyseimplications of information contained within individualnews articles, we propose a social insect inspired approachthat utilizes a set of agents to determine relationships be-tween articles using syntactic analysis. Based on spacialdistribution of these relationships within the article space,we are then able to establish which articles comprise dis-tinct news stories in a dynamic and flexible fashion.

2. Article Acquisition and AnalysisArticles that we analyse in order to determine currentnews stories are extracted from the web. Therefore thefirst step is to acquire and process raw article documents.When designing our article acquisition mechanism, weconsider following aspects of the news domain:

Dynamics. One of the most important aspects of thenews domain is the dynamics of the presented data. Newsstories are added on a daily, even hourly basis as newevents unfold and it is imperative that these changes areaccounted for. Therefore current articles need to be up-dated to reflect the most recent changes and new articlesneed to be retrieved as they become available.

Lack of structure. A news article generally representsan unstructured textual information. Certain parts ofan article, such as its title or individual paragraphs maybe identified through parsing of the document structure.However the topic of a news article, key figures, charactersand events described within the article are not readilyavailable and need to be identified through deeper analysisof plain text.

70 Sabo, S.: Social Insect Inspired Algorithm to Detect and Track Topics in Dynamic Documents

Succinct language. A web based source of news articlesneeds to convey information about the content of an ar-ticle succinctly and unambiguously. This way a potentialreader may quickly decide whether to visit the given ar-ticle or not. Due to the limited amount of space availablein the title of an article, the information about the ar-ticle content needs to be compressed into few words, atmost few short sentences. Therefore article titles will of-ten contain named entities or unique terms that identifythe news story being covered unambiguously enough sothat it may be easily recognized by an average reader.

2.1 Crawling Policies and TechniquesIn order to acquire articles from the web we utilize a setof independent agents. Each agent acts as an advancedweb crawler and is autonomously capable of traversing theweb and acquiring articles, which are further processedand evaluated. In order to facilitate the interaction ofagents with the web environment, each agent adopts thefollowing web crawling policies and techniques.

Revisiting policy. Revisiting is a crucial policy identifiedby Castillo [4] as one of main cornerstones of effectivecrawling. Revisiting policy determines how long an agentneeds to wait before revisiting an article that had alreadybeen visited in the past. Our revisiting policy sets theminimum delay between successive visits to 30 minutes, aswe generally do not assume that an article will be updatedat a more frequent rate.

Parsing. Parsing of a retrieved document is done throughautomated parsing tools and aims to extract article con-tent along with article title, paragraphs, hypertext linksand optional time and location stamps.

Courtesy delays. Short delays between requests are im-plemented in order to alleviate load of multiple successiverequests on a single given web host. Our agents are set toobserve 5 second minimum courtesy delays. In additionto saving resources this also helps to control the pace ofarticle acquisition during periods of high latency.

2.2 Story Word ExtractionDuring the article acquisition process, the articles areanalysed by agents in order to identify relationships be-tween articles that share topical similarities. For this pur-pose we have proposed a concept of story words which rep-resent specific terms linked to news stories. Story wordsmay be either existing terms that are linked to a particu-lar story such as Watergate, or even completely new termssuch as more recent Brexit. The only requirement is thateach story word represents a given news story at a giventime.

Using the concept of story words we are able to breakdown news stories into specific aspects which we can trackusing our set of agents. Each agent is assigned a specificstory word candidate and proceeds to evaluate the con-tent of articles in order to determine the popularity andrelevance of its current story word candidate. The aimof this process is to determine the most prominent storywords and map out relationships between individual ar-ticles based on these story words. However, the dimen-sionality of the article—story word space is generally toohigh to evaluate all possible combinations of articles andstory word candidates, therefore we utilize a bio inspired

mechanism to focus the evaluation effort onto the mostpromising story word candidates and articles.

3. Honey Bee Inspired Mechanism of AgentCoordination

In order to provide a coordination mechanism that wouldallow the agents to prioritize combinations of articles andstory words we look for inspiration in nature. A similarproblem of prioritizing sources in a decentralized envi-ronment is observed by honey bees (apis mellifera) whendeciding on which food sources to visit while foraging forfood. Honey bees have therefore developed a communi-cation mechanism that allows them to share informationabout the most suitable food sources through their typicalwaggle dance.

Waggle dance is a specific movement pattern of a beethat encodes information about the direction of and dis-tance to a food source. If a bee finds a suitable sourceof food it may choose to promote it by sharing informa-tion through engaging in a waggle dance. If a bee returnsfrom a foraging trip and has not found a suitable source, itmay adopt a different source through observing of danc-ing bees. Thus good sources of food are promoted andforaged from while unsuitable sources are abandoned.

3.1 Allocation of AgentsThe idea of modelling agent coordination mechanism af-ter waggle dance based coordination of honey bees wasfirst utilized in Artificial Beehive Colony (ABC) modelproposed by Karaboga [5] for numeric optimization. Thismodel was further enhanced by Navrat [6] by adjustingit for web based document retrieval. This enhanced ver-sion of ABC model is known as Beehive Metaphor andit is also the model upon which our work is based. Thegeneral overview of Beehive Metaphor is given in Figure 1.

Based on the quality of its current source, each agent isable to perform foraging, dancing or observing task at atime. Unlike the original Beehive Metaphor model, ouragents do not carry document, but rather each agent isassigned a story word candidate to evaluate. A story wordcandidate is a term which potentially represents a certainnews story. The task of an agent is to determine, whethera story word candidate represents a viable story word.

In order to determine viability of a story word candidate,agents visit multiple articles and evaluate relevance of thecurrently carried story word to the visited articles. De-pending on the relevance of the current story word eachagent may select either to adopt foraging, dancing orobserving task in accordance to the Beehive Metaphormodel. Foraging is the basic task in which agent evalu-ates its current story word candidate. During the dancingan agent propagates its source so that other agents mayadopt it. Observing task is adopted if an agent deems itscurrent story word candidate as unsuitable and proceedsto select a new candidate from the pool of candidates cur-rently being propagated by dancing agents.

3.2 Identification of RelationshipsThe main goal of article analysis is to determine whichstory words are relevant to the currently ongoing newsstories. In order for a story word to be considered suitable,it needs not only to be relevant to a single article, butto connect multiple articles forming a given story line.


User query Dispatch room

Dance floor

Auditorium

Dance?

Leave?

Follow?

source1

sourcei

sourcej

sourceM

Source base

Hive

ιS

DP1

iS

XP1

iS

XP

jS

FP

jS

FP1

iS

DP

Figure 1: Decision making process of an agent as described in Beehive Metaphor [6].

Therefore our mechanism of story word identification isaimed to find relationships between articles based on theevaluated story words.

If an agent discovers that its current story word is rele-vant to multiple articles, it will evaluate the given storyword positively and may establish a relationship betweenthe relevant articles based on its current story word. Byrecording relationships between articles we are able tomap out the explored parts of the article space and deter-mine the closeness of given articles, which can be in turnused to determine the shape of currently ongoing newsstories.

3.3 Advantages of Social Insect InspiredCoordination

Utilizing a set of agents inspired by honey bee inspiredagents in order to determine the story words most rele-vant to the currently ongoing stories is a novel approachthat has few specific advantages over traditional topic de-tection and tracking methods.

Dynamic story tracking capability. Ability to dynam-ically track news stories represents the most importantcharacteristic of our news story tracking system, as it al-lows us to track news stories in real time as they develop.The key feature of our system that enables us to trackstories in real time is the social insect inspired inspiredbehaviour of the agent swarm. Instead of treating the webas a static repository of articles, we are able to maintaina persistent presence on the web and acquire articles asthey emerge. Although it is not feasible to track changesin every single article, the self prioritizing capability ofour agent swarm allows us to focus the exploration ef-fort around the most relevant articles where the actualdevelopment takes place.

Iterative article evaluation. Iterative evaluation of arti-cles is closely related to the dynamic story tracking capa-

bility of our approach. It refers to the ability of our ap-proach to process new articles iteratively as they emerge.When extracting a set of story words relevant to the setof articles there are two general possibilities how to doit. Batch algorithms such as Probabilistic latent semanticanalysis [3] or Latent dirichlet allocation [2] treat the setof articles as closed, with classification always analysingthe whole set in each run. This process is costly there-fore the rate at which new articles may be introduced andevaluated is limited. Through decomposition of individualnews stories into a set of story words that may be trackedindependently and individually for each article, our ap-proach offers iterative processing capability. Thereforewe are able to process each article as it emerges and in-corporate it into the whole news story landscape withouthaving to reassess the whole article set.

No learning required. With our approach no learning orsupervision is needed in order to classify articles accord-ing to news stories they cover. The reason lies in the wayhow news stories are decomposed into sets of related storywords. If we had represented news stories as latent distri-butions over terms from a vocabulary, we would need toprovide a set of training examples in order to determinethe structure of each news story. However our representa-tion of news stories through a set of story words enablesus to move away from complex analysis of the whole storyand instead focus on tracking of individual story words.This is a more straightforward task as we only need toevaluate the relevance of a story word candidate to anarticle with can be done by analysing the content of thegiven article without previous training.

Scalability and distributivity. Final advantage of ourapproach lies in its scalability and distributivity, both ofwhich are supported by the fact that the crucial step of ar-ticle acquisition and evaluation may be performed in a de-centralized manner by a set of independent agents. Thereare no global decisions or dependencies that would serve


Figure 2: Graph representation of the articlespace. Nodes are coloured according to their type.Green colour represents visited article nodes, redcolour represents unvisited article nodes and bluecolour represents story word nodes.

as bottlenecks in the communication scheme of the pro-posed system. Each agent is capable to perform its tasksindividually and independently, thus no system overheadis added even after introducing new agents into the sys-tem. The decisions of an agent are based on its localattributes such as the story word it carries or content ofthe article it visits. This allows to deploy our proposedapproach in a flexible system with loosely coupled indi-vidual modules. Addition or subtraction of resources maybe dynamically performed through introduction of newagents or removal of existing ones without compromisingthe integrity of the article evaluation process.

4. Article Space ModelThe result of article acquisition and evaluation process isa model of the article space that includes articles, hyper-links, identified story words and their respective relation-ships to related articles. In order to capture relationshipsbetween individual acquired entities we have elected tomodel the article space as a single large graph structure.

4.1 Graph Representation of Article RelationshipsThere are two types of nodes in the article graph, arti-cles and story words. Nodes are interconnected by twotypes of edges, hyperlinks and relationships. A hyperlinkedge connects two articles that are also interconnectedby a hyperlink on the web. The relationship type edgesconnect either an article to a story word node if the par-ticular story word is related to the article, or two articlesif a connection between two articles has been establishedon a basis of certain story word. Each relationship edgeis weighted according to the confidence of the given re-lationship. An example of an article graph is given inFigure 2.

4.2 Louvain Algorithm Based News Story ExtractionAlthough the article graph structure provides us with anoverview of articles and their respective relationships, it isby itself not sufficient to provide insight into the ongoingnews stories. In order to extract news stories informa-tion we need to transform identified relationships into acomplex news story structure.

Arizona

NSA

earning

protest

deal

market

ban

air

marriage

Bermuda

Court

U.S.

activist

talkIran

Madridczar

project

official

StateopenEbola

inscription

test

Hurricane

strike

staff

right

Hawaii

Publisher

Kobani

Car

Ansell

purpose

response

government

militant

Alaska

entry

spat

world

Kurd

dollar

Agile

case

support

Eye

Lawmaker

New

inflation

forecast

resident

BB

travel

Syria

chip

BRIEF-Asknet

police

Gonzalotrial

China

Washington

percent

Obama

force

firm

energyShire

health

bankstock

Poroshenko

Storm

name

worker

stability

billion

bridge

approach

Ukraine

Kok

Dallas

fight

tax

oil

Rio

loyalist

k

plane

west

Liberia

Turkey

BRIEF-Sanoma

euro

import

rate

CEO

Fed

UNstandoff

US

leader

analyst

American

Gaming

airstrike

patient

zloty

growth

Rating

Montreal

deficit

price

sanction

protester

eye

comment

condition

Allergan

sclerosi

nurse

doctor Putin

EU

Russia

Actavi

Saudi

J&J

budget

offer

BRIEF-Nanobiotix

rise

clashe

B

pct

Outlook

governor

Tesco

los

tension

Rico

Vietnam

peace

aid

jump

Hong Kong

Figure 3: Article graph with identified stories.Nodes are colour coded according to their newsstory affiliation and story words are labelled.

For this purpose we utilize Louvain graph algorithm [1]to detect modules within the article graph. By usingLouvain algorithm we partition the graph into subgraphscalled modules is such a was as to maximize the numberof connections between nodes within individual modulesand to minimize the number of connection between nodesin different subgraphs.

In general we assume that if two articles cover the samenews story their probability of sharing relevance to a com-mon story word is higher than with random two articlesthat cover different news stories. Therefore if we parti-tion a graph into modules in such a way as to minimizethe number of connections of nodes across different mod-ules we are maximizing the probability of articles withinthe same module to cover the same news story. Resultingmodules represent news stories identified by our approach.Each story therefore consists of a set of articles along witha set of related story words and corresponding relation-ships amongst them. An example of a resulting graphwith nodes color coded according to their affiliated newsstory is given in Figure 3.

5. Contributions and Research GoalsThere are two main contributions of our approach in re-lation to the current state of the art, which map onto ourresearch goals of topic representation suitable for swarmbased extraction and scalability of an insect inspired topicdetection and tracking approach.

5.1 Topic Representation Suitable for Swarm BasedExtraction

The first contribution is the representation of a news storysuitable for evaluation by a set of independent agents. Intopic detection and tracking a topic is generally viewed asa probabilistic distribution over a set of terms, which canbe sampled and evaluated as a single abstract concept.We present a different view with topics of news storiesdecomposed into multiple story related terms which wecall story words. This provides us with a model of storydecomposed into multiple aspects that can be trackedand evaluated independently, while maintaining accuracyof story detection comparable to batch processing algo-rithms, as shown by our experiments [7][8][9][10]. Thismodel also provides a richer representation of the news


story semantics, as the underlying relationships betweenindividual story words allow us to explore the various as-pects of the story on a more tangible level than with tra-ditional latent stories represented by probabilistic models.

5.2 Scalability of Insect Inspired TDT ApproachThe second contribution of our approach is the proposalof a decentralized self prioritizing algorithm for classi-fication task within non parametric or semi parametricspaces. Standard online methods for classification in nonparametric or semi parametric spaces share a commonfeature of being computationally dependent on the num-ber of individual data points in the given space. Thegeneral notion of our approach is similar to kernel basedmethods in that we survey the space by performing oneto one comparisons of individual data points in order toestablish their relative distance.

However our social insect inspired approach introducesa self prioritizing capability that allows to determine themost relevant data points and prioritize their comparisonsover the others, while at the same time rejecting compar-isons of data points that are deemed irrelevant. In addi-tion to prioritizing the search of document space we arealso able to control the rate at which the search occurs bychanging the number of involved agents. This provides fora scalable solution which addresses the common issue ofkernel methods with sensitivity to the number of availabledata points.

6. ConclusionIn our work we have developed a novel approach to newsstory identification and tracking using a set of indepen-dent agents inspired by social insect. The social insectinspired model of the agent behaviour allows for a dy-namic self prioritizing system that identifies and tracksthe most relevant parts of news stories in time. The ad-vantages of such approach lie in its ability to dynamicallyidentify relationships between news articles through eval-uation of relevance to common story related terms calledstory words.

By utilizing a honey bee inspired mechanism of coordina-tion, our agents are able to determine the most promisingstory words to evaluate and thus focus on the most rele-vant part of article space. This allows for a dynamic andflexible approach that detects news stories as they evolve.Furthermore the story word based representation of top-ics allow for fine grained tracking of individual aspectsof each news story. Through the decomposition of topicsinto interconnected terms we are able to deploy a methodsimilar to traditional kernel based approaches that cir-cumvents the high dimensionality of news story space.Scalability drawbacks of kernel methods when processinghigh number of inputs are addressed through social in-sect inspired coordination mechanism that allows to pri-oritize which articles and which story words to analysefirst. The underlying scheme is general and thus it is ourhope that it may find further applications in analysis ofhigh-dimensional or non-parametric spaces beyond newsstory identification.

Acknowledgements. This work was partially supportedby the Slovak Research and Development Agency underthe contracts Nos. APVV- 0208-10 and APVV-15-0508and the Scientific Grant Agency of Slovak Republic, grantNo. VG 1/0752/14.

References[1] V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefebvre.

Fast unfolding of communities in large networks. J. of StatisticalMechanics: Theory and Experiment, 2008(10):P10008, 2008.

[2] L. Bolelli, e. Ertekin, and C. Giles. Topic and trend detection intext collections using latent dirichlet allocation. InM. Boughanem, C. Berrut, J. Mothe, and C. Soule-Dupuy, editors,Advances in Information Retrieval, volume 5478 of Lecture Notesin Computer Science, pages 776–780. Springer Berlin /Heidelberg, 2009.

[3] T. Brants, F. Chen, and I. Tsochantaridis. Topic-based documentsegmentation with probabilistic latent semantic analysis. In Proc.of the eleventh int. conf. on Information and knowledgemanagement, CIKM ’02, pages 211–218, New York, NY, USA,2002. ACM.

[4] C. Castillo. Effective web crawling. SIGIR Forum, 39(1):55–56,June 2005.

[5] D. Karaboga. An idea based on Honey Bee Swarm for NumericalOptimization. Technical Report TR06, Erciyes University, Oct.2005.

[6] P. Navrat. Bee hive metaphor for web search. Communication andCognition-Artificial Intelligence, 23(1-4):15–20, 2006.

[7] P. Navrat and S. Sabo. What’s going on out there right now? abeehive based machine to give snapshot of the ongoing stories onthe web. In Nature and Biologically Inspired Computing (NaBIC),2012 Fourth World Congress on, pages 168 –174, nov. 2012.

[8] P. Navrat and S. Sabo. Beehive based machine to give snapshot ofthe ongoing stories on the web. Transactions on ComputationalScience XXI, Special Issue on Innovations in Nature-InspiredComputing and Applications, pages 296–314, 2013.

[9] S. Sabo, A. Kovarova, and P. Navrat. Multiple developing newsstories identified and tracked by social insects and visualizedusing the new galactic streams and concurrent streams metaphors.Int. J. of Hybrid Intelligent Systems, 12:27–39, 2015.

[10] S. Sabo and P. Navrat. Social insect inspired approach foridentification and dynamic tracking of news stories on the web. InNature and Biologically Inspired Computing (NaBIC), 2013World Congress on, pages 226–231. IEEE, 2013.

Selected Papers by the AuthorS. Sabo, P. Navrat. Bee Inspired Detecting and Tracking of Currently

Developing News Stories From the Web. [submitted to] In: Int. J.of Bio-Inspired Computation, Inderscience Publishers, 2016.

S. Sabo, P. Navrat. Social insect inspired approach for identificationand dynamic tracking of news stories on the Web. In: Nature andBiologically Inspired Computing (NaBIC), 2013, World Congresson, IEEE, 2013, pp. 226–231.

P. Navrat, S. Sabo. What’s going on out there right now? A beehivebased machine to give snapshot of the ongoing stories on theWeb. In: Nature and Biologically Inspired Computing (NaBIC),2012 Fourth World Congress on, 2012.

P. Navrat, S. Sabo. Beehive Based Machine to Give Snapshot of theOngoing Stories on the Web. Transactions on ComputationalScience XXI, Special Issue on Innovations in Nature-InspiredComputing and Applications, 2013, pp. 296–314.

S. Sabo, A. Kovarova, P. Navrat. Multiple developing news storiesidentified and tracked by social insects and visualized using thenew galactic streams and concurrent streams metaphors. Int. J. ofHybrid Intelligent Systems, 2015, vol. 12, pp. 27–39.

Š. Sabo. Dynamic Detection and Tracking of Stories in News Articlesfrom the Web [in Slovak]. In: WIKT 2013, 8th Workshop onIntelligent and Knowledge oriented Technologies, Centre forInformation Technologies, 2013, pp. 167–171.

A. Kovárová, Š. Sabo. Visualization of News Articles Identified byBees [in Slovak]. In: WIKT 2013, 8th Workshop on Intelligentand Knowledge oriented Technologies, Centre for InformationTechnologies, 2013, pp. 35–40.

S. Sabo. Beehive Metaphor Inspired Web Crawler. In: 8th StudentResearch Conf. in Informatics and Information TechnologiesBratislava, Nakladatel’stvo STU, 2012, pp. 249–254.


P. Návrat, Š. Sabo. Determining Keywords for Unfolding StoriesUsing Swarm of Social Agents [in Slovak]. In: WIKT 2012, 7thWorkshop on Intelligent and Knowledge oriented Technologies,Nakladatel’stvo STU, 2012, pp. 37–40.

Š. Sabo. Tracking of a Story on the Web Using Multi-Agent SystemInspired by Social Behaviour of Bees [in Slovak]. In: WIKT2011, Proc. 6th Workshop on Intelligent and Knowledge orientedTechnologies, Košice, Technická Univerzita, 2011, pp. 167–171.

Instructions to the authors

Publishing procedureAll contributions are web-published. A contribution ispublished without unnecessary delay right after it hasbeen accepted. Contributions are published on the fly inthe current issue. It is at the discretion of the Editor-in-chief to determine, when the current issue is closed anda subsequent new one is open. There will be at least twoissues in a year but it is left up to the Editor-in-chief toadjust periodicity of the Bulletin to actual needs.

Extended abstracts of theses is the primary type of ar-ticle in the Bulletin. Each extended abstract will by an-notated by identifying the thesis supervisor, who mustrecommend it for publication and stands for the Edi-torial Board in a role similar to a reviewer. We offerpublishing extended abstracts on the Bulletin’s web be-fore the thesis is defended. This preliminary publishingis a specific service to the academic community. As soonas we learn about successful defence, the extended ab-stract gains the status of accepted paper and will beincluded in the forthcoming issue. The accepted paperwill be annotated with the date of successful defence andname of the insitution where the defence took place.

It is the policy of the Bulletin to offer a free access toall its articles on the web. Moreover, the publisher willseek opportunities to promote as wide as possible accessand/or indexing of the articles. All the past issues re-main accessible on the web as part of the web portal ofthe Bulletin. Closed issues will be made available also ina printable form, free for downloading and printing byanyone interested.

Policy on OriginalityIt is the policy of the Bulletin that Slovak Universityof Technology be the sole, original publisher of articles.Manuscripts that have been submitted simultaneously toother magazines, journals or to conferences, symposia,or workshops without the prior written consent of theEditor-in-Chief will be rejected outright and will not bereconsidered. Publication of expanded versions of papersthat have been disseminated via proceedings or newslet-ters is permitted only if the Editor-in-Chief judges thatthere is significant additional benefit to be gained fromjournal publication. A conference chairperson can ar-range with the Editor-in-Chief to publish selected pa-pers from conferences, symposia, and workshops, aftersuitable reviewing. The papers must meet the editorialrequirements for research articles. Acknowledgement ofthe originating conference will appear as a credit whenthe paper is published in the Bulletin.

Manuscript information for extended abstractsof doctoral dissertationsAll contributions are submitted electronically. Send yourmanuscript as LATEXsources and .pdf files by e-mail [email protected]. Paper’s length should be6-12 pages. Please, use LATEXstyle, which is available todownload at bulletin web-page http://slovakia.acm.

org/bulletin/.

Some remarks to the provided style:

• Headings and AbstractThe heading must contain the title, full name, andaddress of the author(s), thesis supervisor, abstractof about 100-200 words.

• Categories and Subject DescriptorsDefine category and subject descriptors accordingto ACM Computing Classification System(see http://www.acm.org/about/class/1998/).

• KeywordsPlease specify 5 to 10 keywords.

• EquationsYou may want to display math equations in threedistinct styles: inline, numbered or non-numbereddisplay (we recommend the numbered style). Pleasemake sure that your equations are clearly formu-lated and described.

• Figures and tablesFigures and tables cannot be split across pages,the best placement for them is typically the top orthe bottom of the page nearest their initial cite.To ensure this proper ”floating” placement of fig-ures/table, use the environment figure/table to en-close the figure and its caption.

• ReferencesPlease use BibTeX to automatically produce thebibliography. If possible use abbreviations like:Proceedings – Proc., International – Int., Confer-ence – Conf., Journal – J.

• Selected papers by the authorThis section is used for thesis extended abstracts.Please write down all publications which are re-lated to your thesis.

Information Sciences and TechnologiesBulletin of the ACM Slovakia

December 2015Volume 7, Number 2

R. Krakovský

D. Macko

Š. Krištofík

L. Clementis

Processing of Information in Multidimensional Data Space by Projective ART Neural Network

Contribution to System-Level Design and Verification of Low-Power Digital Systems

A Contribution Towards Architectures and Algorithms for Self Repair of RAMs

Study of Game Strategy Emergence by Using Neural Networks

Published by Slovak University of Technology Press, Vazovova 5, 812 43 Bratislava, IČO: 00397687

on behalf of the ACM Slovakia ChapterISSN 1338-1237 (printed edition)

ISSN 1338-6654 (online)Registration number: MK SR EV 3929/09

1

10

18

26

Chapter