heinz dreher school of information systems curtin university, perth, western australia automatic...

36
Heinz Dreher School of Information Systems Curtin University, Perth, Western Australia Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 1 25th Bled eConference eDependability: Reliable and Trustworthy eStructures, eProcesses, eOperations and eServices for the Future June 17, 2012 – June 20, 2012; Bled, Slovenia Automatic Semantic Trend Analysis of the Bled eConference: 2001-2011

Upload: sydney-barber

Post on 29-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Heinz Dreher School of Information Systems Curtin University, Perth, Western Australia Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012

Heinz DreherSchool of Information Systems

Curtin University, Perth, Western Australia

Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 1

25th Bled eConferenceeDependability:

Reliable and Trustworthy eStructures, eProcesses, eOperations

and eServices for the FutureJune 17, 2012 – June 20, 2012; Bled, Slovenia

Automatic Semantic Trend Analysis of the Bled

eConference: 2001-2011

Page 2: Heinz Dreher School of Information Systems Curtin University, Perth, Western Australia Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012

overviewFinding concepts and concept-trends in vast repositories of text

Semantic analysis algorithms have developed over the last decade to the point where they are almost within reach of everyone as is Google for text searching

This study reports on an experimental application of automated semantic analysis of the Bled eConference 2001-2011 proceedings full text corpus

Rubrico, the specific tool used in the study is introduced

The methodology used to deploy Rubrico on the Bled corpus for the purpose of revealing the embedded concepts is explained

Interpretation and discussion is offered to indicate the possibilities ensuing from the semantic analysis

Future work is indicated to address limitations and further explore the prospects

Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 2

Page 3: Heinz Dreher School of Information Systems Curtin University, Perth, Western Australia Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012

the problemgiven a large body of knowledge (textual corpus) how do we find out what is in it?

consider the proceedings of your favourite conference, e.g. Bled eConference held in Slovenia in June each year

attracting speakers and delegates from business, government, information technology providers and universities - it is a major venue for

researchers working in all aspects of “e”

#documents for the eBled conference years 2001 - 2011

Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 3

year 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 total #docs 50 49 71 52 51 52 60 45 41 42 42 555

Page 4: Heinz Dreher School of Information Systems Curtin University, Perth, Western Australia Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012

some possible solutions if you are a lecturer you get your 40 students to each read and summarize 10 papers (=400), and you do 10 yourself (now 500 are done), and reward the 11 good students with another 5 each and we have the 555 papers summarized! (they get marks, you get summaries)

if you are an award winning researcher, get your RA to do the work – following a methodology you have already had her devise

if you have PhD students you set up a paper review and critical appraisal colloquium and offer the task as a learning exercise in preparation for their upcoming literature review

if you are innovative, you’ll look around for some ‘smartech’ way to tackle the taskAutomatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 4

Page 5: Heinz Dreher School of Information Systems Curtin University, Perth, Western Australia Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012

a ‘smartech’ way to tackle the task

this will use already devised algorithms suited to the purpose and involves, rather obviously, artificial intelligence, since the task of finding concepts in a document requires intelligence

in machine learning, a branch of artificial intelligence, there exist suitable algorithms that can be deployed

semantic analysis of a corpus is the task of building structures that approximate concepts from a large set of documents. Initially it does not involve prior semantic understanding of the documents, although this is beneficial for subsequent follow-on and refined analyses

Latent Semantic Analysis (or Latent Semantic Indexing), is perhaps the best known – but there are many others

Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 5

Page 6: Heinz Dreher School of Information Systems Curtin University, Perth, Western Australia Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012

a sample of the many methods & techniques

1 Supervised learningArtificial neural network; Bayesian statistics; Case-based reasoning; Decision trees; Inductive logic programming; Gaussian process regression; Learning Automata; Learning Vector Quantization; Minimum message length (decision trees, decision graphs, etc.); Nearest Neighbor Algorithm; Analogical modeling; Support Vector Machines; Regression analysis

2 Unsupervised learningAssociation rule learning; Hierarchical clustering; Partitional clustering e.g. Latent Semantic Analysis (LSA or LSI)

3 Reinforcement learning

4 Combinations and adaptations of the aboveRubrico (Reiterer, Dreher, & Gütl)Normalised Word Vector (NWV) as used in MarkIT™ (Williams & Dreher)webLyzard technology (Scharl & Weichselbraun)

(refer to inter alia Wikipedia for a good start on the topic of semantic analysis)

Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 6

Page 7: Heinz Dreher School of Information Systems Curtin University, Perth, Western Australia Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012

the Rubrico process

Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 7

Page 8: Heinz Dreher School of Information Systems Curtin University, Perth, Western Australia Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012

Rubrico generates a vizualisation

Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 8

Page 9: Heinz Dreher School of Information Systems Curtin University, Perth, Western Australia Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012

Rubrico generates a concept hierarchy,

and Rr values (Rubrico-

relevance)

Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 9

Ontological structure of concept “knowledge” computed from “Bled 2001-2011 Abstracts”

As an example of a concept hierarchy (3 levels deep), look to top left. The concept “knowledge” subsumes “model”, which itself subsumes “framework”.

Typically, concepts are identified via a thesaurus, reference ontology, and word taxonomies such as that implemented in the lexical database WordNet.

‘Rubrico-relevance’ or Rr is shown in the rightmost column

Page 10: Heinz Dreher School of Information Systems Curtin University, Perth, Western Australia Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012

Rubrico ConceptKeyword counts for the Bled 2001-

2011corpus

Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 10

Σ from 1 to ConceptKeywordCount of all Rr values for a conference

year = 100

Page 11: Heinz Dreher School of Information Systems Curtin University, Perth, Western Australia Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012

what we (humans) may generate from what Rubrico generates -

interpretation

use Rr to identify ‘Top concepts’ (and ‘Bottom concepts’ too)

focus on the Rr_rank (1 is the highest value Rr in a given year, or unit of analysis)

choose some ‘concepts’ for detailed study (as one cannot realistically track thousands) – good opportunity for further tool development

focus on the top-30

and/or the bottom-30

or make any other selection

Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 11

Page 12: Heinz Dreher School of Information Systems Curtin University, Perth, Western Australia Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012

using Rr to select conceptsTop30 Concepts in 2001 and 2011

Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 12

Page 13: Heinz Dreher School of Information Systems Curtin University, Perth, Western Australia Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012

Top 30 Concepts by Relevance, and # Years Occurring

Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 13

Page 14: Heinz Dreher School of Information Systems Curtin University, Perth, Western Australia Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012

Bottom 30 Concepts by Relevance, and # Years Occurring

Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 14

Page 15: Heinz Dreher School of Information Systems Curtin University, Perth, Western Australia Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012

a final set of 37 concepts (ConceptKeywords) to focus on

Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 15

Page 16: Heinz Dreher School of Information Systems Curtin University, Perth, Western Australia Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012

37 ConceptKeywords and their usage over 11 Conference Years

“ConceptKeywords-and-their-usage-over-time” = Appendix 2

Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 16

Page 17: Heinz Dreher School of Information Systems Curtin University, Perth, Western Australia Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012

Interpretation and Discussion

the ConceptKeywords identified as ‘prominent’ across the 11 conference years for further study in this presentation have been selected from the ‘Top 37’, and they are:

1) user; 2) pp; 3) relationship; 4) knowledge; 5) com; 6) group

zero-entry concepts for some years: 11) supplier; 13) participant; 14) need; 18)

site; 19) respondent; 22) employee; 23) device

first appearing in 2005 concepts: access; http; people

Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 17

Page 18: Heinz Dreher School of Information Systems Curtin University, Perth, Western Australia Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012

most ‘prominent’ across the 11 conference years was ConceptKeyword user

1) user;

<<< put in the ontology for user across the 11 years … on one slide???? huh ? I will need to devise some new ‘dynamic’ vizualisations for this … must wait for later …>>>

Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 18

Page 19: Heinz Dreher School of Information Systems Curtin University, Perth, Western Australia Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012

2001 ontology for user

Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 19

Page 20: Heinz Dreher School of Information Systems Curtin University, Perth, Western Australia Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012

2011 ontology for user

Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 20

Page 21: Heinz Dreher School of Information Systems Curtin University, Perth, Western Australia Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012

1) user

in each of the 11 conference years (2001-2011), the concept of user featured strongly, being ranked (Rr_rank) at either 1, 2, or 3, except for the year 2010 in which it achieved only 3834th place.

this, at first, very surprising perturbation is easily explained. See below, and note that in column with name “rank-B.10” the ConceptKeywords people and health (rows 27 and 28) have values 2 and 3 respectively. This indicates that authors were using the term or concept people rather than user, and the reader may now check that eHealth was a big feature of the 24th Bled eConference held in that year. People is another perspective on user, and one may postulate that what we see here is the response by authors to the calls of the conference organisers and editors, adding weight to the proposition that editors have a big influence in the direction of the thinking of a body of authors. To the extent that this is true one can verify that the semantic analysis (for example as per Rubrico) is creating a ‘true’ picture of reality.

Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 21

Page 22: Heinz Dreher School of Information Systems Curtin University, Perth, Western Australia Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012

2) pp

the second most prominent ConceptKeyword to emerge is pp. What an odd thing is that?

pp is of course meaningless as a concept in the usual sense, however if we understand that the corpus is a collection of scholarly articles, for which the authors have created reference lists, often including a sequence of pages in their citations (indicated by “pp. 22-55”, for example), then we may make some sense out of this.

is there a particular style of referencing being required which may explain the pp performance?

note that in the year 2001 the Rr_rank is 6, then climbing to 3, then 2, and very often at 1. This could, for example, be associated with an already-strong expectation of precise citation being tightened during the early years of the decade.

Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 22

Page 23: Heinz Dreher School of Information Systems Curtin University, Perth, Western Australia Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012

3) 2004_relationship

Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 23

Page 24: Heinz Dreher School of Information Systems Curtin University, Perth, Western Australia Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012

3) 2011_relationship

Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 24

Page 25: Heinz Dreher School of Information Systems Curtin University, Perth, Western Australia Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012

3) relationship

relationship is the third most prominently occurring ConceptKeyword and its Rr_rank profile rises and falls but remains within the range of a low at 15 and a high of 2.

does such a consistent and strong performance indicate a very great emphasis, in this conference, on the pursuit of truth and explanation through systematic study and investigation of the essential connections between things?

Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 25

Page 26: Heinz Dreher School of Information Systems Curtin University, Perth, Western Australia Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012

4) knowledge

Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 26

Page 27: Heinz Dreher School of Information Systems Curtin University, Perth, Western Australia Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012

4) knowledge

knowledge (the fourth) its Rr_rank profile rises and falls but remains within the rage of a low at 17 and a high of 2.

does such a consistent and strong performance indicate a very great emphasis, in this conference, on the pursuit of truth and explanation through systematic study and investigation of the essential connections between things?

Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 27

Page 28: Heinz Dreher School of Information Systems Curtin University, Perth, Western Australia Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012

5) com

item 5 featured in the ‘top’ 37 ConceptKeywords list is com, which performs strongly over all of the conference years and is clearly associated with references to “dot com” and website URLs.

Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 28

Page 29: Heinz Dreher School of Information Systems Curtin University, Perth, Western Australia Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012

6)grou

p

Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 29

Page 30: Heinz Dreher School of Information Systems Curtin University, Perth, Western Australia Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012

6) group

group is the first item (according to a row-by-row consideration of Appendix 2) to have a zero score (in year 2001) then gradually, if a little erratically, growing in prominence over the ensuing decade. It may reveal an emergence of the role of people in teams and concern for societal issues in general.

in the left hand panel of the previous Figure, one can see that group is rather an extensive structure, consisting of 34 elements, nine of which have sub-hierarchies. In the right hand panel we see the content sub-hierarchy, which is 3-levels deep.

Inspection of the occurrence for group reveals that this ConceptKeyword was absent from the 2001 proceedings.

Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 30

Page 31: Heinz Dreher School of Information Systems Curtin University, Perth, Western Australia Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012

zero-entry concepts for some years:

11) supplier; 13) participant; 14) need; 18) site;

19) respondent; 22) employee; 23) device

the following ConceptKeywords have zero entries for one or more conference years: supplier (missing in 2003); participant (2010); need (02, 05, 08, 10, and 11); site; respondent; employee; device; …..

Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 31

Page 32: Heinz Dreher School of Information Systems Curtin University, Perth, Western Australia Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012

first appearing in 2005 concepts:

access; http; people

ConceptKeywords access, http, people, are remarkable because they appear only in 2005, then disappear for a period and perhaps reappear. This may be indicative of a fad, but ….

Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 32

Page 33: Heinz Dreher School of Information Systems Curtin University, Perth, Western Australia Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012

human concepts vs Rubrico concepts

the human-created ontology (in our terminology) is at a higher conceptual level than achieved by Rubrico. Combinations of terms such as “eMarkets, Directories, Auctions” are reported as being characteristic of the period 1998-2002 (Clarke 2012)

the ‘super concept‘ formed from “eMarkets, Directories, Auctions” to be detected automatically it must have a textual association and eventual representation in the corpus

For the years 2001 and 2002 in our analysis the ConceptKeywords transaction and implementation may pertainAutomatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 33

Page 34: Heinz Dreher School of Information Systems Curtin University, Perth, Western Australia Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012

eMarkets_Directories_Auctions cf.

ConceptKeywords transaction & implementation

Clarke (2012) presents manual analyses of the Bled Conference corpus

the human-created ontology (in our terminology) is at a higher conceptual level than achieved by Rubrico

combinations of terms such as “eMarkets, Directories, Auctions” are reported as being characteristic of the period 1998-2002 (Clarke 2012)

for the ‘super concept’ formed from “eMarkets, Directories, Auctions” to be detected automatically it must have a textual association and eventual representation in the corpus

years 2001 and 2002 in our analysis the ConceptKeywords transaction and implementation may pertain

an intensive knowledge-elicitation and -engineering exercise would be needed to match this against the mentioned ‘super concept’ eMarkets_Directories_Auctions

Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 34

Page 35: Heinz Dreher School of Information Systems Curtin University, Perth, Western Australia Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012

Future work As in all experimental research, there are limitations and deficiencies that one would like addressed. Rubrico makes possible the semantic treatment of vast amounts of text; but it is not intelligent.

The human mind may find it difficult to comprehend certain ontological structures that Rubrico computes (e.g. group, user, pp). Therefore, an improvement that needs to be considered is for human editing of the initially-computed ontology, followed by a re-run of the semantic analysis.

Another limitation at present is the performance of Rubrico with large document sets – which we currently estimate as being greater than about 100 conference papers. Our initial attempt at analysis was to deploy Rubrico on a laptop computer, to deal with the entire 555 documents in the Bled 2001-2011 Conference corpus; it resulted in ‘stagnation’. This points to a clear need for implementation in a more computationally-powerful environment.

Next, we want further functionality to automate the construction of the “ConceptKeywords-and-their-usage-over-time” for example, and interactivity, interoperation, and dynamic visualisation of the elements of information structures depicted therein.

In order to check the usefulness of automated conceptual analysis of full text corpora such as has been attempted here, one would ideally need to engage in a comparison with the results of other types of analyses, and especially human-generated ones. As there is a special section of this 25th anniversary of the Bled conference, any alternate analyses published could form an interesting and useful agenda for further research.

Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 35

Page 36: Heinz Dreher School of Information Systems Curtin University, Perth, Western Australia Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012

AcknowledgementsEmanuel Reiterer who developed the Rubrico “conceptual analysis system” into a computer program during a Master’s project in 2009 (TU Graz - with C. Gütl & H. Dreher)

Gregor Lenart who was instrumental in providing access to the full text of the proceedings for the years 2001-2012

Roger Clarke who encouraged the computer analysis of the full-text documents and assisted in obtaining access to the textual repository referred to here as the Bled 2001-2011 Conference Proceedings Corpus

Naomi Dreher – the RA who devised the “Finding Relevant Concepts Method” - Dreher, N. & Dreher, H. (2011). Empowering Doctoral Candidates in Finding Relevant Concepts in a Literature Set. International Journal of Doctoral Studies, Vol. 6, pp. 33-50. http://ijds.org/

Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 36