[ieee 2009 5th international symposium on applied computational intelligence and informatics (saci)...

539978-1-4244-4478-6/09/$25.00 ©2009 IEEE

5th International Symposium on Applied Computational Intelligence and Informatics • May 28–29, 2009 – Timişoara, Romania

Unsupervised Exploration of Scientific Articles

Anton Alin-Adrian, Crețu Vladimir-Ioan

Unsupervised exploration of scientific articlesAnton Alin-Adrian and Cretu Vladimir-Ioan

”Politehnica” University of TimisoaraFaculty of Automation and Computers

2nd Vasile Parvan Ave.Email: [email protected] and [email protected]

Abstract—Unsupervised data exploration techniques are usedfor extracting relational and structural information from massifsof data. In this paper we explore a collection of ACM transactionsand IEEE conferences related by the subject of high performancedistributed computing - with a scientific computing flavor -in order to understand how they relate to each other. Theinterpreted result is proved reasonable and is defined by groupsof conferences and transactions which can be scaled by theirdegree of abstractness and physical realization.

I. INTRODUCTION

The exploration of unorganized information for extractingknowledge with the help of computers introduces a newand complementary approach for taxonomical studies, trendsurveys and many other scenarios. In 1973 the very samemethods were applied by Sneath and Sokal in the field ofbiology in order to help understand the relations between thespecies and discover the evolutionary process[1].

Most of the data exploration studies refine new methodswith datasets from bioinformatics and marketing, with the lastmainly focusing on world wide web analysis. In 2006 Yi Penget al. used similar techniques for clustering 1400 scientificarticles within the interdisciplinary field of data mining[2].

We approached the broad spectrum of distributed computingand borrowed the clustering methods in order to discoverreasonable relations between the scientific articles. The 60001

articles belong to 16 conference proceedings and transactions,all of which are connected through various levels by the fieldof high performance distributed computing. The articles sumup to 4.5 Gigabytes of disk data.

The input needs to be described as a collection of classeswith a shared pool of attributes. The classes themselvesare to be grouped into clusters according to their attributecharacteristics.

A matrix data structure perfectly fits the requirements. Sucha matrix assigns a row for each class and a column for eachattribute. The representation is called a measurement matrix.

Another approach, sometimes transparently handled by thesoftware, is to use a dissimilarity matrix. This representationbenefits of matrix symmetry and thus only half of the squarematrix is necessary:

A = (aij)A = AT

aij = aji

(1)

15946

The attribute values may or may not be nominal, becausethey can easily be converted into numerical values. Missingattributes are sometimes important in differentiating the clus-ters but frequently can trigger implementation shortcomingsand require workarounds.

The dissimilarity matrix itself contains distances betweenthe class objects according to various attributes. It can becomputed from the measurement matrix or it can providethe means to explore unmeasured data based on subjectiveassumptions. Dissimilarity minimization algorithms are morerobust to errors and extremes because they do not tend toagglomerate real values on the basis of similarities.

We chose two combined standard methods for exploringthe scientific articles. Hierarchical clustering is used for iden-tifying parentship and containment relations and partitioningaround medoids is used to validate the result and prove thatthe findings are reasonable and natural.

The implementation section describes the process of mea-suring the attributes for each class and the assembling of theinput matrix; afterwards we present the third-party softwarepackages invoked as tools in the experiment.

In the ”Experimental Results” section we present the outputof the two methods and discuss arguments for proving thefindings, introducing a reasonable explanation. The ”Conclu-sion” ends up the discussion proposing future approaches withimproved detail and accuracy and emphasizes the originalityand importance of such strategies for better understandinginterdisciplinary progress.

II. PROPOSED METHOD

A. Partitioning around medoids

Clustering by the means of medoids was introduced byKaufman and Rousseeuw[3]. A medoid is a representativeobject from each cluster whose average dissimilarity (distance)to all the objects in the group is minimal. The optimizationmodel is proposed by Vinod[4] and is described by equation(3) subject to (4), (5), (6) and (7).

Let the set of objects be denoted by X:

X = {x1, x2, ..., xn} (2)

and the dissimilarity between objects i and j described byd(i, j) where i and j are the indexes of the objects xi and xj .

mini

j

d(i, j)zij (3)

Anton Alin-Adrian, Crețu Vladimir-Ioan • Unsupervised Exploration of Scientific Articles

540

i

zij = 1 j = 1, 2, ..., n (4)

zij ≤ yi i, j = 1, 2, ..., n (5)

i

yi = k k = number of clusters (6)

yi, zij ∈ {0, 1} i, j = 1, 2, ..., n (7)

Constraints (4) and (7) imply that for a given j one of the zij

is equal to ”1” and all others are ”0”. (6) expresses that thereare k objects to be chosen as representative while (5) ensuresthat objects can only be assigned to representative objects.

According to (4) the dissimilarity between an object j andit’s representative object is given by

i

d(i, j)zij (8)

and because all objects must be assigned, the total dissimilarityis given by

i

j

d(i, j)zij (9)

which is the function that needs to be minimized as a solutionto the partitioning problem.

The results vary with different values for k, the number ofclusters, and the method itself does not provide for a way offinding the optimal value.

B. Hierarchical clustering

The single linkage hierarchical clustering method is de-scribed by Fionn Murtagh. Because of symmetry, only halfof the dissimilarity matrix is necessary[5].

Input An n(n− 1)/2 set of dissimilarities

Step 1 Determine the smallest dissimilarity dik

Step 2 Agglomerate objects i and k by replacement:

di∪k = min(dij , dkj) (10)

Delete dij and dkj for all j, as they are useless.Step 3 Loop to Step 1 while two objects remain

Fig. 1. Hierarchical algorithm

The algorithm requires O(n2) time complexity due toO(n) for Step2 and O(n) for the total number of (n − 1)agglomerations.

In hierarchical clustering the objects are agglomerated to-gether one by one merging the similar classes until the finalgroup contains the initial data set.

III. IMPLEMENTATION

A. Preprocessing

The initial data set consists of 16 IEEE2 and ACM3 transac-tions listed in Figure 3. We used home-grown scripts in orderto extract the text and split it into words according to theregular expression in Figure 2.

Each conference, journal or transaction represents a class.For each word the frequency of appearance in each of the 16classes had to be normalized according to the total number ofwords present in the class:

frelative = f/wordcounter (11)

/(\w[’\w-]*)/g

Fig. 2. Word splitting regular expression

Due to software limitations and practical concepts, themissing attributes were simply set to a frequency of 0.0.

CC[6] Computer clustersCCGRID[7] Computer clusters and the grid

CSE08[8] Computer ScienceChallenges[9] Challenges in Distributed Computing

ChinaGRID[10] Grids and clusteringHPDC[11], [12]

HP Distributed Computing[13], [14], [15][16], [17]

HPCSAGE[18] HPC ApplicationsJEA[19] Journal of Experimental Algorithmics

LOPLAS[20] Letters on Programming LanguagesSC[21] The Supercomputing Conference

TACO[22] Transactions on Computer ArchitectureTALG[23] AlgorithmsTOCL[24] Computer LanguageTOCS[25] Computer Science

TOMACS[26] Mathematics and Computer SimulationTOMS[27] Mathematical Software

Fig. 3. Conferences and transactions

The final dataset represents a matrix of 16 classes (one classper row) with their attributes set as relative frequencies likein Table 1. Each word entry represents a unique column in

Class Name Word1 ... WordnCC 0.098 ... 0.111... ... ... ...

TOMS 0.102 ... 0.104

TABLE IMEASUREMENT MATRIX

the measurement matrix and we had to develop customary

2http://www.ieee.org3http://www.acm.org


541

software in order to properly format it. Because the imple-mentation of our matrix assembler in Figure 4 limits the totalnumber of words to MAXCOLS, n the number of uniquewords is reduced to 18650. The relative frequency convergesto zero faster.

#define MAXCOLS 5000/* . . . */buildline("ACM_CC.txt.pdt", 1);buildline("TOMS.txt.pdt", 16);/* . . . */

Fig. 4. Matrix assembler code

The algorithm used to construct the measurement matrixlike in Table 1 is presented in Figure 5. iw and jw are indexescorresponding to the current class and respectively to theproper word attribute column assigned for the new word entry.The word relative frequencies (wordrel freq) are inserted intothe matrix by expanding it with additional columns whennecessary.

Input Classes with word frequencies

Step 1 for each class doIf word with wordrel freq not in A = (aij)doExpand all rows with column aijw

Set all aijw = 0.0 where i = iwdoneSet aiwjw = wordrel freq

Step 2 loop to Step 1 until no classes

Fig. 5. Matrix assembler algorithm

A matrix of dissimilarities as defined by equation (1) can beconstructed based on the measurement matrix and the processis silently handled by the software packages.

B. The software packages

The measurements were processed for clustering with open-source4 software from the ”R Project”5. We made use ofthe ”cluster” and ”MLInterfaces” packages available on theproject’s website. The ”stats” package is delivered with thedefault installation along with the ”graphics” suite.

plot(silhouette(pam(x,3)))plot(hclust(dist(x)))

Fig. 6. Function tools example

We used the Euclidean distance in both clustering methodsin order to define the dissimilarity between two objects i andj:

d(xi, xj) = d(i, j) =

(fi − fj)2 (12)

4http://www.opensource.org/5http://www.r-project.org/

The ”pam” function is a component of the ”cluster” packageand ”hclust” belongs to the ”stats” module. The two functionscorrespond to the two different methods proposed for dataexploration.

The ”pam” function from Figure 6 takes a matrix as firstargument and the k number of clusters as second.

The ”plot” function belongs to the ”graphics” package andthe ”silhouette” is implemented by the ”cluster” component.Partitioning around medoids works on both dissimilarity andmeasurement types of matrix, but hierarchical clustering re-quires the dissimilarity matrix which is computed by the ”dist”function from the ”stats” package[3].

Except the ”MLInterfaces” module everything else is avail-able from within the default installation bundle and requiresmanual loading. We used R version 2.8.16 which is freesoftware7 defined by opensource licenses.

IV. EXPERIMENTAL RESULTS

A. Hierarchical partitioning

The dendrogram in Figure 7 presents the ”Journal of Ex-perimental Algorithmics” as the superclass of all because it isthe most abstract of all. Algorithms can be described in bothsoftware and hardware logic according to various structuralorganizations and technologies, so it is natural that they fitover all classes. While experimental algorithms are purely

Fig. 7. Hierarchical clustering

conceptual, the problem falls down to realization levels stepby step. There are three major groups of classes that fit themost suitable distinction.

From right to left, according to the dendrogram height, theTransactions on Computational Logic[24], Algorithms (non-experimental)[23], Modelling with Computer Simulation[26]and Mathematical Software[27] fit together. Non-experimental

6Copyright (C) 2008 The R Foundation for Statistical Computing7http://www.fsf.org/


542

(applied) algorithms, mathematics and simulation share acommon abstractness which is closer to the realization levelthan the experimental ideas.

The transaction on Computational Logic[24] is the first steptaken into the realization realm because it provides the meansto express the abstract domain of the other three journalsmarking the beginning of the ”SPEECH” region on the graph.

The next group lists Computer Architecture with CodeOptimization[22], Letters on Programming Languages[20] andthe Transaction on Computer Systems[25] together. The ACMcomputing curricula8 describes multiple subfields within the”computer science” domain; the Transaction on Computer Sys-tems brings ”theoretical and conceptual explorations”9 into therealm of computers, with design principles, case studies andprotocols - sometimes even borrowing from non-technical10

domains.Programming languages and computer systems are less

abstract classes than those in the previous partition, but theystill represent a mixture of pragmatic concepts. The computerorganization class however brings physical realization closeras it introduces structural hardware logic into the cluster.

The final group is an obvious refinement step mostlydealing with the parallelization of computer architecture orthe expansion of network topologies that emulate and mimicsupercomputers at lower costs.

It is interesting to notice that high performance dis-tributed computing[11] is evaluated as a subfield ofsupercomputing[21] because it’s being parented by the formerin the cost optimization process. That also makes sense forthe distributed computing challenges[9].

The next method will confirm the results.

B. Partitioning around medoids

Clustering around medoids minimizes the similarities be-tween the 3 groups in Figure 8. The average silhouette ofsi = 0.51 is the minimum required in order to declare theresults reasonable and non-artificial. Silhouettes of 0.71 belongto strong natural structures. Finding the best value for k withthe help of a silhouette graph is detailed by Rousseeuw. Thehigher the average silhouette si, the better is the result[28].

In order to find the best value for k we tried values fromk = {2, .., 15}. For k = 3 we obtained the only satisfyingresult corresponding to the reasonable structure from Figure8.

The first group from the figure appears to be a bit forced,having a silhouette average too low for acceptance (si ≤ 0.5).The reason for this is that the dataset contains many radicalextremes, like the Experimental Algorithmics, and the algo-rithm forced the number to k = 3 in order to keep the globalsilhouette reasonable[19].

The last group also contains a silhouette of zero which isnormal for a journal of completely different nature than allof the other classes. Both ACM and IEEE papers tend to

8http://www.acm.org9ACM description10as in bio-inspired computing

attract their own kind. The silhouette of zero for ”HPCSA”11

also testifies that the final conference was indexed by athird scientific database which mostly deals with humanisticdomains12. If the ”outsider” would have enough similaritieswith any of the other two partitions, the silhouette would havebeen lower than zero due to attractions to other clusters[18].

Fig. 8. PAM reasonable structure

The presence of extremes also had a bad influence makingit impossible to obtain a ”strong” result and by eliminatingthem the results are improved.

Compared to hierarchical clustering, the Letters on Pro-gramming Languages[20] are swapped with the Transactionson Computational Logic[24] and that leaks how similar theyare. Spoken algorithms are an expression of logical circuits sothe two classes are different sides of the same coin. The causefor the swap can be identified with the short period of timewhen the letters were edited compared to the transactions.

Overall, the same pattern of scaling the path from ideato speech is obvious. The partitions are milestones on theroad from conceptual abstractness to physical realization andimplementation.

C. Meaning within results

The results of the analysis emphasize on grouping the con-ferences and transactions according to how much abstractnessor how much physical realization is injected into the cluster.

The road from ideas to solution is marked by the ever-present stages from Figure 9. The problem is first formallydescribed by developing a mathematical model. In order toimplement the computational model the initial mathematicalform has to be adapted from continuous to finite and discretedomains.

11HPCSAGE in Figure 312http://www.sagepub.com/


543

The discrete domain is the field where recipes of method-ologies and steps merge together into applied algorithms.Algorithms are still a very conceptual domain, and the passingfrom idea to realization level is reached when the algorithm isformally described by using a machine language (the ”speech”barrier).

Fig. 9. From concept to solution

The primitive machine language evolves and is refined intoparadigms and high level expressions. It is important to noticethat the algorithm expressed as primitive machine languageis a formal description of a physical logic circuit. Whilethe construction of individual logic circuits for each prob-lem description is expensive, generic machines can processinstructions for basic mathematic operations.

Fig. 10. Refined solution

After the initial problem generates a solution, the aspectsof cost are taken into consideration and the model is refinedby sharing the load with multiple machines as in Figure 10.

The three clusters found in the previous section correspondto the three stages of problem solving from Figure 11.

Stage 1 Discretised model of the problemStage 2 Machine speech descriptionStage 3 Physical resolver

Fig. 11. Stages of problem solving

If inside a supercomputer parallelization handles multipro-cessors with various connecting topologies the use of computer

networks maps and refines the specific structures outside asingle machine. Computer clusters are refined solutions to thecost effectiveness problem and they dominate the TOP50013

supercomputing list. Grids of computer clusters link themtogether according the various policies and requirements[29].

V. CONCLUSION

Unsupervised exploration and document clustering of sci-entific articles prove to be a good complement for surveyingthe trends and the state of the art in scientific domains. Byexploring and detecting relationships between the scientificarticles we can argument specific taxonomies with provenmeasurements and complement bibliographic studies.

In this paper we have proved that by using simple unsuper-vised exploration techniques a load of very useful informationcan be extracted from a database of scientific articles. Weproved that the results are valid and reasonable and weintroduced a scale of abstractness and physical realization inorder to explain the results within the original dataset. Wedetailed the interpretation by identifying the general processof problem solving.

Future work still needs to be done in order to refine thefrequency counting software, for instance by altering theregular expression to ignore the case and match the tokenswith semantic dictionaries.

The similarities inside of each partition and what ties thoseconferences together would be easily noticed on a largerdatabase set.

With the help of biclustering and feature selection tech-niques it will be possible to analyze how the word attributescan be grouped together inside the determined partitions.Selecting the representative features from each cluster mayhelp detail the results and support denomination of the groupsinside specific taxonomies.

REFERENCES

[1] P. Sneath and R. Sokal, Numerical Taxonomy. San Francisco: W.H. &Freeman, 1973.

[2] Z. C. Yi Peng, Gang Kou and Y. Shi, “Recent trends in Data Mining(DM): Document Clustering of DM Publications,” International Con-ference on Service Systems and Service Management, vol. 2, pp. 1653–1659, October 2006.

[3] K. Leonard and R. Peter, “Clustering by means of medoids,” Interna-tional Conference on Statistical Data Analysis Based on the L1-Normand Related Methods, August - September 1987.

[4] H. Vinod, “Integer Programming and the Theory of Grouping,” Journalof the American Statistical Association, vol. 64, pp. 506–519, 1969.

[5] F. Murtagh and A. Heck, Multivariate Data Analysis with AstronomicalApplications. Dordrecht, Holland: Kluwer Academic Publishers, 1987.

[6] IEEE, International Conference on Cluster Computing, 1999-2008.[7] IEEE/ACM, International Conference on Cluster Computing and the

Grid, 2001-2008.[8] IEEE, Mexican International Conference on Computer Science, 2008.[9] ——, International Workshop on Challenges of Large Applications in

Distributed Enviroments, 2003-2006.[10] ——, ChinaGrid Annual Conference, 1990-2008.[11] ACM, International Conference Series on High Performance Distributed

Computing, 1995-2008.[12] ACM/HPDC, International Workshop on Challenges of Large Applica-

tions in Distributed Environments, 2007.

13http://www.top500.org


544

[13] ——, International Workshop on Data-Aware Distributed Computing,2008.

[14] ——, International Workshop on Grid Monitoring, 2007.[15] ——, International Workshop on Service-Oriented Computing Perfor-

mance: Aspects, Issues, and Approaches, 2007.[16] ——, International Workshop on Use of P2P, GRID and Agents for the

Development of Content Networks, 2007-2008.[17] ——, International Workshop on Workflows in Support of Large-Scale

Science, 2007.[18] SAGE, International Journal of High Performance Computing Applica-

tions, 1999-2008.[19] ACM, Journal of Experimental Algorithmics, vol. 1-13, 1996-2008.[20] ——, Letters on Programming Languages and Systems, vol. 1-2, 1991-

1992.[21] IEEE/ACM, International Conference on Supercomputing, 1995-2006.[22] ACM, Transactions on Architecutre and Code Organization, vol. 1-5,

2004-2008.[23] ——, Transactions on Algorithms, 2005-2008.[24] ——, Transactions on Computational Logic, vol. 1-9, 2000-2008.[25] ——, Transactions on Computer Systems, vol. 1-26, 1983-2008.[26] ——, Transactions on Modelling and Computer Simulation, vol. 1-18,

1991-2008.[27] ——, Transactions on Mathematical Software, vol. 1-35, 1975-2008.[28] Rousseuw, “Representing Data Partitions,” Proceedings of the Statistical

Computing Section of the American Statistical Association, pp. 275–280,1985.

[29] A. S. Tanenbaum and M. van Steen, Distributed Systems: Principles andParadigms. Prentice Hall, October 2006.

[ieee 2009 5th international symposium on applied computational intelligence and informatics (saci)...

Documents