enterprise users and web search behavior

7/28/2019 Enterprise Users and Web Search Behavior

1/94

University of Tennessee, KnoxvilleTrace: Tennessee Research and CreativeExchange

Masters Theses Graduate School

5-2010

Enterprise Users and Web Search Behavior April Ann LewisThe University of Tennessee, Knoxville , [email protected]

This Thesis is brought to you for free and open access by the Graduate School at Trace: Tennessee Research and Creative Exchange. It has beenaccepted for inclusion in Masters Theses by an authorized administrator of Trace: Tennessee Research and Creative Exchange. For more information,please [email protected].

Recommended CitationLewis, April Ann, "Enterprise Users and Web Search Behavior. " Master's Thesis, University of Tennessee, 2010.http://trace.tennessee.edu/utk_gradthes/643
http://trace.tennessee.edu/http://trace.tennessee.edu/http://trace.tennessee.edu/utk_gradtheshttp://trace.tennessee.edu/utk-gradmailto:[email protected]:[email protected]:[email protected]://trace.tennessee.edu/utk-gradhttp://trace.tennessee.edu/utk_gradtheshttp://trace.tennessee.edu/http://trace.tennessee.edu/


2/94

To the Graduate Council:

I am submitting herewith a thesis written by April Ann Lewis entitled "Enterprise Users and Web SearBehavior." I have examined the final electronic copy of this thesis for form and content and recommenthat it be accepted in partial fulfillment of the requirements for the degree of Master of Science, with a

major in Information Sciences.Peiling Wang, Major Professor

We have read this thesis and recommend its acceptance:

Dania Bilal, Lorraine Normore

Accepted for the Council:Carolyn R. Hodges

Vice Provost and Dean of the Graduate School

(Original signatures are on file with official student records.)


3/94

To the Graduate Council:

I am submitting herewith a thesis written by April Ann Lewis entitled Enterprise Users

and Web Search Behavior . I have examined the final electronic copy of this thesis forform and content and recommend that it be accepted in partial fulfillment of therequirements for the degree of Master of Science, with a major in Information Science.

Peiling Wang, Major Professor

We have read this thesis andrecommend its acceptance:

Dania Bilal

Lorraine Normore

Accepted for the Council:

Carolyn R. Hodges

Vice Provost and Dean of the GraduateSchool

(Original signatures on file with official student records)


4/94

Enterprise Users and Web Search Behavior

A Thesis Presented forThe Master of Science

DegreeThe University of Tennessee, Knoxville

April Ann LewisMay 2010


5/94

ii

Copyright 2010 by April Ann LewisAll rights reserved.


6/94

iii

Acknowledgements

I would like to thank Oak Ridge National Laboratory (ORNL) for supporting mygraduate education and encouraging me to pursue my newly found interests in applied

information science research. ORNLs Chief I nformation Officer (CIO) and the web server

support team graciously provided me with a very robust data set and answered all

questions I had regarding its format.

I would also like to acknowledge the extensive amount of work that Dr. Peiling Wang

has done in the area of We b data mining and analysis. Dr. Wangs relational database

model for web queries was fundamental to my data mining efforts. I have learned muchas a graduate student from her research work as well as from her classroom instruction.

I am honored that Dr. Wang agreed to chair my thesis committee.

I am also very grateful to have had Dr. Lorraine Normore and Dr. Dania Bilal serve on my

committee, both very accomplished in complementary areas of Information Science

research. Dr. Normore first introduced me to human-computer interaction (HCI)

relevant to information search. One of the motivations for this thesis was characterizing

the corporate users interaction with the ORNL intranet search environment. Dr. Bilal

provided me with the basic understanding of search environments, tasks related to

information searching and the theory of cognitive motivation for successful search.


7/94

iv

Abstract

This thesis describes analysis of user web query behavior associated with Oak

Ridge National Laboratorys (ORNL) Enterprise S earch System (Hereafter, ORNL

Intranet). The ORNL Intranet provides users a means to search all kinds of data stores

for relevant business and research information using a single query. The Global Intranet

Trends for 2010 Report suggests the biggest current obstacle for corporate intranets is

findability and Siloed content. Intranets differ from internets in the way they create,

control, and share content which can make it often difficult and sometimes impossible

for users to find information. Stenmark (2006) first noted studies of corporate internal

search behavior is lacking and so appealed for more published research on the subject.

This study employs mature scientific internet web query transaction log analysis(TLA) to examine how corporate intranet users at ORNL search for information. The

focus of the study is to better understand general search behaviors and to identify

unique trends associated with query composition and vocabulary. The results are

compared to published Intranet studies. A literature review suggests only a handful of

intranet based web search studies exist and each focus largely on a single aspect of

intranet search. This implies that the ORNL study is the first to comprehensively analyze

a corporate intranet user web query corpus, providing results to the public.

This study analyzes 65,000 user queries submitted to the ORNL intranet fromSeptember 17, 2007 through December 31, 2007. A granular relational data model first

introduced by Wang, Berry, and Yang (2003) for Web query analysis was adopted and

modified for data mining and analysis of the ORNL query corpus. The ORNL query corpus

is characterized using Zipf Distributions, descriptive word statistics, and Mutual

Information. User search vocabulary is analyzed using frequency distribution and

probability statistics.

The results showed that ORNL users searched for unique types of information.

ORNL users are uncertain of how to best formulate queries and dont use search

interface tools to narrow search scope. Special domain language comprised 38% of the

queries. The average results returned per query for ORNL were too high and no hits

occurred 16.34%.


8/94

v

Table of Contents

Acknowledgements ............................................................................................................. iii

Abstract ............................................................................................................................... iv Chapter 1 Introduction and General Information .............................................................. 1

Introduction .................................................................................................................... 1 Research Questions ........................................................................................................ 4

Chapter 2 Literature Review .............................................................................................. 6 TLA Theory and Methodology ........................................................................................ 6

TLA Theory .................................................................................................................. 6 TLA Methodology ........................................................................................................ 6

Data Collection ........................................................................................................ 7 Data Preparation ..................................................................................................... 8 Data Analysis ........................................................................................................... 9

Literature Review Objectives ...................................................................................... 9 Session Analysis .................................................................................................... 10 Longitudinal Analysis ............................................................................................ 11 Visual Presentation of Information Needs ........................................................... 12

Conclusion ................................................................................................................. 13 Chapter 3 Methods .......................................................................................................... 14

Research Environment .................................................................................................. 14 The Data ................................................................................................................ 15 Data Structure ....................................................................................................... 15 Preparation ........................................................................................................... 19

Processing ............................................................................................................. 21 RDMS Development .............................................................................................. 23

Methods ........................................................................................................................ 27 Mutual Information Analysis .................................................................................... 27 Zipf Analysis .............................................................................................................. 30 Approach to Spell Check Query Vocabulary ............................................................. 31

Chapter 4 ........................................................................................................................... 33 RQ1: What general search behaviors do ORNL searchers exhibit when searching theintranet? ........................................................................................................................ 33 RQ2: How do users formulate their queries? ............................................................... 45

RQ3: What are the Characteristics of the User Vocabulary? ....................................... 52 RQ4: How do ORNL results compare to the published studies? .................................. 60 Discussion...................................................................................................................... 64

General Search Behavior........................................................................................... 64 Query Formulation .................................................................................................... 65 Vocabulary Analysis .................................................................................................. 65

Chapter 5 Summary and Conclusion ................................................................................ 67 Summary ....................................................................................................................... 67


9/94

vi

General Search Behavior........................................................................................... 67 Query Formulation .................................................................................................... 68 Vocabulary Analysis .................................................................................................. 68

Results Comparison to Published Studies................................................................. 68 Conclusion ..................................................................................................................... 69

Future Study Recommendations .............................................................................. 70 References ........................................................................................................................ 71 Appendix ........................................................................................................................... 76

Appendix A .................................................................................................................... 77 Appendix B .................................................................................................................... 78 Appendix C .................................................................................................................... 82

Vita .................................................................................................................................... 83


10/94

vii

List of Tables

Table 1. Defines all the information fields that are available in the ORNL access log .... 16

Table 2. Information fields and definitions of ORNL collected query log .................... 18 Table 3. Examples of unsupported query strings submitted by ORNL searchers............ 20 Table 4. Top ORNL computer platforms ........................................................................... 34 Table 5. Browser breakdown for ORNL users ................................................................... 36 Table 6. Top 20 URL's requested...................................................................................... 37 Table 7. Distribution and categorization of page types ................................................... 38 Table 8. Most popular external search engines................................................................ 42 Table 9. ORNL top 25 most frequent clean queries ......................................................... 47 Table 10. ORNL top 10 queries with N-words.................................................................. 49 Table 11. Popular ORNL query words .............................................................................. 53 Table 12. Select ORNL mutual information values .......................................................... 57 Table 13. All word pairs sets involving the words "pay" and "band" .............................. 59 Table 14. ORNL results compared to published studies ................................................... 63


11/94

viii

List of Figures

Figure 1. ORNL web query ER model for relational database ......................................... 22

Figure 2. ORNL query database, highlighted tables supported query level analysis ....... 25 Figure 3. ORNL query database, highlighted tables supported vocabulary analysis ....... 26 Figure 4. Typical Zipf distribution plot .............................................................................. 31 Figure 5. Spell-Check Procedure ....................................................................................... 32 Figure 6. ORNL aggregated page types most clicked ....................................................... 39 Figure 7. ORNL topic categories ....................................................................................... 40 Figure 8. ORNL business category queries ........................................................................ 41 Figure 9. Query counts for each month ........................................................................... 43 Figure 10. Bi-monthly comparison week 3, September & October 2007........................ 44 Figure 11. The distribution of words in Unique Queries .................................................. 48 Figure 12. ORNL total query count distribution from September to December 2007 .... 50 Figure 13. Temporal frequency sampling of ten unique queries .................................... 51 Figure 14. Distribution of word length associated with ORNL unique query words ....... 54 Figure 15. Zipf distribution plot of the top 100 and top 2000 words ............................... 55 Figure 16. Sample of unique or irregular vocabulary ...................................................... 55


12/94

1

Chapter 1

Introduction and General Information

Introduction

Many companies are adopting internet search practices for their intranets. While

the underlying search process is the same for both the Internet and the intranet, the

search needs of the respective users and their environments are very different (Fagin et

al., 2003). The Internet consists of users who have individualized information needs and

share no understanding with the information providers. Internet users have access to an

unbounded document set that may include advertisements and spam.

Conversely, ORNL intranet users search for information individually, but theyshare contextual understanding of the information space with the providers. The

document set or search corpus available to ORNL users is controlled and limited. Users

are not exposed to advertisements or spam within the search environment. Much more

is known about internet search as many studies have been published that include search

success statistics. The number of unsuccessful Internet searches reported by college

students in a recent library user internet search survey was nearly 50% of all internet

search submissions. (Mann, 2005). It is difficult to find any similar qualitative results

measured relative to intranet search.There are two very distinct environments when it comes to web search 1) the

internet and 2) the intranet. The way these environments are viewed from both users

and researchers are very different. There are only a handful of published studies

regarding intranet search, but internet search reports are published nearly every three

months. The most recent internet statistics published was in February (Nielson, 2010),

which reported that Google is the most preferred search engine (65.2% of all searches).

That same report listed Yahoo as second, losing 18% more of its previously reported

search share to Google. The percentage of typical daily users has grown to nearly 50%,

with users extremely positive about search engines and their search experiences

(Fallows, 2008). However, in that same report users are described as generally

unsophisticated about how and why they use search.

In contrast, there are no free regular web based reports available to the public

on intranets statistics. When in-depth reports or studies are available they typically

must be purchased. On average, intranets workers spend about 25% their time


13/94

2

searching for information (Feldman, &Sherman, 2004). Feldman and Sherman (2004)

also report that a company with 1,000 knowledge workers may waste well over $6M

dollars a year looking for information that doesnt exist, failing to find information that

does, or recreating information that could have been found. The search experience for

intranet users is not pleasant. A recent enterprise intranet search survey by Ward (2005,

Sept. 7) found that web -rage was experienced after 12 minutes of fruitless search,

although nearly 7% of the 566 people surveyed said they felt irritated after only three

minutes.

Not only is there a difference in the internet and intranet search environment,

there are key unique distinctions in search engine performance and query vocabulary

requirements. For example, indexing and ranking of search results on the internet can

be impacted by organic linking and spam. The intranet is not affected by spam and crosslinking is not typically practiced in corporations. The way search results are stitched

together as a product of federated search is different. The intranet has special rules for

stitching like security access, duplication, etc. Tagging of information is not implicit

within the intranet, which affects indexing. This is not to say implicit tagging of items

associated with the internet always results in improved search performance. Intranets

tend to have a smaller or narrow search vocabulary due to special domain language.

The functional capability of a dynamic search is also critical for intranets. It is

estimated that intranets as enterprises have tens or even hundreds of times larger datacollections (both structured and unstructured) than internets (Li, Cao, Hu, Xu, Li, &

Meyerzon, 2005). The recent intranet study done by Li, et al.( 2005) demonstrated that

an intranet search does not just focus on search of relevant documents; it includes

special types of information such as definitions, persons, experts, homepages and

applications. Another unique challenge to search inside the intranet is dealing with

secure content, when it is not included the value for the searcher is greatly diminished

(Valdez-Perez, 2007). David Hawking (2006) aptly describes the enterprise as a complex

information environment which makes measuring the quality of search results difficult.

While this study does not offer a solution to this problem, it is characterizing the ORNL

intranet which could provide a framework for evaluating corporate web search

environments. Clearly this is a motivating factor for comprehensively analyzing ones

corporate intranet, specifically measuring general search behavior exhibited by users,

examining trends in query submission and reformulation, as well as results of search

both successes and failures.


14/94

3

Successful search equates to optimized findability. Measuring findability

means characterizing the enterprise search environment. This typically involves

analyzing query logs to identify what topics users are searching for, query formulation

which is characterizing query submissions, and the percentage of search failure (no hits

or too many results). It is a presumption of this study that it is not enough to understand

query level results. It is also necessary to analyze information related to the general

search behavior which describes how and when users search. Only when we

understand both search behavior and search results can we improve overall efficiency

within intranet search systems. General search behavior can be determined by analyzing

access logs and usage reports. It complements search analysis by helping us understand

the unique characteristics of our web users.

It is because of these fundamental differences that organizations must evaluatetheir intranet search solution; simply applying best practices found with internet search

is not practical. A successful organization must make sure that users can actually find

information on their unique systems in a reasonable amount of time. Efficient search

engines must be configured to match the characteristics of the users and the special

information they seek. The most common way to characterize the users and the

information they seek is to gather statistics on intranet usage and to evaluate user

search logs.

Transaction Log Analysis (TLA) typically focuses on the interaction behaviorsoccurring among the users, the search system, and the information (Jansen, 2009).

Content analysis of server log files describes user interaction as it relates to internet

usage statistics/reports, and search queries. Several studies have been done in this area

with only Stenmark (2005, 2006) focusing on corporate intranets (Beitzel, Jensen, Lewis,

Chodury,& Frieder, 2007; Wang, 2006; Baeza-Yates, Calderon-Benavides & Gonzalez,

2006; Wang, Berry, &Yang, 2003 ; Jansen & Spink, 2006; Wolfram, Wang, & Zhang, 2008.

This study will contribute to TLA by applying the (Wang, et al., 2003) method of mutual

information analysis to intranet queries. It also implements the Wang, (2006) method of

topic identification, complemented by general transaction analysis of ORNL user search

usage statistics. In addition to contextual analysis, this study includes indirect analyses

of access logs and usage reports to better characterize general ORNL search behavior.

Unlike narrowly focused published intranet studies, this study will comprehensively

analyze a corporate intranet s user web query corpus for the purpose of improving the

overall Intranet search experience. Along with query logs it evaluates access and usage


15/94

4

logs in order to gain a holistic view of the ORNL search enterprise. A literature review

suggests this may also be the first study to perform TLA on an intranet site using the

Microsoft Office SharePoint Search Engine.

This thesis will add to the growing body of literature associated with web query

transaction log analysis for Intranets by providing methodology to other intranet users

and managers who may want to holistically analyze their search environment. It

combines log analysis used by search system administrators to measure search engine

performance and interaction along with traditional query log analysis which measure

users search performance and interaction. The thesis is organized as follows. The next

section discusses the research questions associated with this study. Chapter 2

summarizes the public extent of research related to intranet and web search. Chapter 3

characterizes the ORNL enterprise search environment, the transaction log files used inthe study and the research methodology. Chapter 4 presents results and discussion

while Chapter 5 summarizes the study results and discusses implications of the study.

Research Questions

This study employs mature scientific internet web query transaction log analysis

(TLA) to better understand how intranet users at ORNL search for information. The

focus of the study is examining general search behaviors and identifying unique trendsassociated with query composition and vocabulary. The goals of the research are three-

fold and include answers to the following research questions (RQ):

RQ1. What general search behaviors do ORNL searchers exhibit when searching the

intranet?

a. What is the size of the ORNL search audience?b. What interfaces do ORNL users employ most when they search?c. What types of pages do ORNL users click most often when results are

available?

d. What topics do ORNL users commonly search for?e. When do ORNL users search and what are their search results?

RQ2. How do ORNL users formulate their queries?

a. What are the most frequently submitted queries?


16/94

5

b. How many ORNL user queries are unique?c. How many ORNL user queries are blank?d. What are the lengths of ORNL user queries?e. What are the distribution of ORNL queries relative to length and time?

RQ3. What are the characteristics of the ORNL user vocabulary?

a. What is the length and distribution of ORNL unique terms?b. With what frequency do ORNL user queries contain acronyms,

abbreviations, and misspelled words?

c. What is the frequency of common stop words?d. Are there terms that occur together frequently (term co -occurrence)?

RQ4. How do ORNL results compare to the published studies?


17/94

6

Chapter 2

Literature Review

TLA Theory and Methodology

This chapter provides a brief overview of mature Transaction Log Analysis (TLA).

The overview contains two major sections, the first TLA theory and the second TLA

methodology. The overview is then followed by a short discussion of literature review

objectives. Following review objectives are the discussions of each work and what

impact they had on developing methodology for this study.

TLA Theory

The use of data stored in transaction logs of web search engines, intranets and

web sites can provide valuable insight into understanding the information searching

process of internet searchers(Jansen 2006). Many researchers(Jansen, Spink, & Taska,

2009) feel transaction log data can provide feedback into what users are looking for in

search architectures. Although there is a body of literature on empirical studies of TLA,

few provide detailed methodological clarifications on data models used and the

underlying rationales for these models (Wang, Wolfram, Zhang, Hong, &Wu, 2007).

While TLA is emerging as a viable research methodology, it is not without its critics.

Critics feel that TLA doesnt go far enough and that the logs dont record the usersperceptions of the search and therefore dont measure the real needs of the

information searcher (Kurth, 1993).

TLA Methodology

Many studies have examined transaction log analysis (TLA) of web based search

engines. Researchers have used transaction logs for analyzing a variety of applications

from internet search to library information retrieval (IR) systems (Croft, Cook, &Wilder,

1995; Jansen, Spink &Sarajevic, 2000; Jones, Cunningham, &McNab, 1998; Wang, et al.,2003; Wang, Wolfram, &Wu, 2008). In Search log analysis: What it is, whats been

done, and how to do it, Jansen reviews the fundame ntal research motivation for TLA

and describes a methodology for conducting succesful TLA research. A recent tutorial

published by Wang, Wolfram, and Wu (2008) entitled Web Search Log Analysis and

User Behavior Modeling focuses specifically on the techn ical process for conducting


18/94

7

web transaction log analysis using the best tools developed by researchers over the last

decade.

In all of these studies, TLA methodology is commonly described as a three-stage

process. The first stage is data collection which includes the process of collecting the

interaction data for a given period of time using transaction logs. The second stage is

cleaning and parsing the log files to make them suitable for analysis. The third and final

stage is analysis which requires selecting a specific research methodology. Of course,

the research questions define what can be answered by the default data in typical

transaction logs (Jansen, 2006). Fortunately, today s search logging software easily

allows for expanding unobtrusive data collection of additional variables to meet analysis

needs.

Data Collection

Transaction logs come in different formats, but more recent commercially

available search tools produce standard World Wide Web Consortium (W3C) extended

or Internet Information Services (IIS) format log files. Inherently, all data logs vary in

content. The data format and fidelity should be addressed along with any predefined

assumptions (Jansen, &Pooch, 2001 ). In Privacy Concerns for W eb Logging Data Kirstie

Hawkey (2009) suggests researchers should anonymize or otherwise transform any

sensitive or personal data before receiving, working with or publishing it. Most private

or government organizations have policies related to sensitive informationmanagement. Researchers should consult with the Chief Information Officer (CIO) in

their organization to discuss proper handling and dissemination of search log related

information.

Most log files contain data that can be used to analyze users search behaviors

with IR systems whether internet or intranet by discerning attributes of distinct search

processes and their resulting components. Jansen and Pooch (2001) establish the

framework terminology for analyzing the search process describing three distinct

components 1) Session , 2) Query , and 3) Term . Session analysis is focused on discrete

entries entered by single users. This is the most difficult of the three as the researcher

must determine what constitutes a session. Session boundary detection is difficult as

users search for multiple topics on a single computer or a single computer may be

shared by multiple searchers (Wolfram, et al., 2008). Sessions can be comprised of

single or multiples queries.


19/94

8

A query is defined as a string of characters or word(s) entered into an

information retrieval system. A query can contain multiple strings of characters or

words (Korfhage, 1997). Query level analysis usually involves examining query length,

query complexity, and failure rate. Query length represents the number of words or

unique character strings in a query. Query syntax looks at specific components

comprising the words or strings. This can range from the use of special symbols like

hyphens to Boolean operators, even examination of capitalization and spelling. Failure

rate quantifies how often a searcher receives no information matches for their

character string submission. Todays search logs usually report failure rate as number

of hits . When searches receive no results matching their query, the number of hits

equal zero.

A term is defined as a string of characters separated by some delimiter such as atab, space, comma, or colon. It is up to the researcher whether they should include

special syntax or delimiters in the queries or terms. There are impacts to the analysis

whether you keep them or remove them like defining unique semantic terms. Term

analysis involves evaluating the number of characters in a term, the frequency of the

term and its tendency for it to appear with other terms in queries or the corpus. High

usage terms are those terms that occur most often in a search corpus and are easily

identified by tokenizing queries (splitting multiple term queries into single terms) and

counting identical terms. Mutual information or term co-occurrence measures theoccurrence of term pa irs. In Mining Longitudinal Web Queries: Trends and Patterns,

Wang, Berry, and Yang (2003) examine co-occurrence with queries extracted

unobtrusively from the website of the University of Tennessee, Knoxville (UTK). To

promote statistical consistency in the ORNL search model, the present study employs

these authors methodology for queries and word pairs.

Data Preparation

Data preparation is the most important and time consuming component of TLA.

Cleaning of the raw log files usually require identification of format and data record

errors through visual inspection of the file. Depending on the size of the file and the

type of errors, a single editing script might be sufficient. More likely the search file will

contain hundreds if not thousands of records, many requiring a unique editing

solution, an instance in which manual edits and multiple scripts are required.

Typically, the percentage of corrupted data is small relative to the overall data

set (Jansen, et al. 2009). Data preparation also includes identifying exclusion data.


20/94

9

Exclusion data are special instances of data that are excluded from analysis like

addresses, or phone numbers because they will negatively impact the search log

analysis objectives. The last step in data preparation is importing the clean TLA data into

a relational database or log analysis software tool and calculating standard interaction

metrics that will serve as a basis for further analysis (Jansen, 2006).

Data Analysis

The best way to manage search log queries for multiple types of analysis is

through a robust relational database management system (RDBMS). Importing and

tracking each query as a unique event affords traceability from derived characterization

data. It is simpler in a RDBMS to attach additional attributes to each record and to

correlate across a diverse population of records. Statistical analysis should include at

least the mean, standard deviation and median wherever possible if you intend tocompare results across studies. All data should be presented with the lowest unit of

measure, avoiding aggregation category values at all cost (Jansen, and Pooch, 2001).

Lastly, the RDBMS method for storing quantitative data is optimal for secondary

analysis.

Literature Review Objectives

In support of this study, an extensive literature search was conducted using

online sources, conference proceedings, technical articles and two significant reference

books Web Search: Public Searching of the Web by Jansen, and Spink(2005) andHandbook of Research on Web Log Analysis edited by Jansen, Spink and Taksa(2009).

The latter is a must for anyone considering TLA research.

The criteria for related work in this study was that it must be focused on context

analysis using TLA methods and involve an intranet or an academic web site. This study

presumes academic web sites qualify as an intranet like site as they do have limited

access (password protected accounts) and employees use the same enterprise search

site. It also presumes role based access that is staff has access to more information than

students. Qualifying studies were placed in one of three context analysis categories 1)

Session Analysis, 2) Longitudinal Analysis, and 3) Visual Presentation of Information

Needs.

Session study usually involves analyzing query information specific to individual

measures like length of session, average number and length of sessions per user.

Sometimes it will involve analysis of click-through behavior, which is done to see where

the searcher has been or to predict where they are going next.


21/94

10

Longitudinal analysis is temporal query analysis and is usually focused on

analyzing query trends for a single search site across multiple time increments, usually

across months and or years. These types of studies (Stenmark & Jadaan, 2006; Wang, et

al., 2003) look at query and token frequencies to identify popular queries (top 100 and

top 25), words, word pairs and triples. Most include characterizing words in the corpus

using Zipf distributions. Only one evaluates term co-occurrence using mutual

information statistics.

Visual presentation of information needs focuses on research methods used to

identify what users are looking for and ways to visually represent the results in a topic

map. These studies usually involve quantitative analysis of queries resulting in the

clustering or aggregation of query information into topics.

Session Analysis

A literature review suggests Dick Stenmarks article Searching the intranet:

Corporate users and their queries (2005) is one of the first intranet studies on web

sessions. The study was done for SwedCorp a commercial vehicle manufacturing

company using the UltraSeek search engine by Verity. Session analysis is difficult

because there is no variable in the UltraSeek log file that indicates when a user begins

and ends a search. The single item that varies across these studies is the time threshold

defining a search session. This study chose 13 minute session boundaries. After

determining the threshold Stenmark analyzed the data to determine session length interms of interaction per session, the elapsed time of each session, and distribution of

the sessions. The study also involved query analysis, reporting number of queries, zero

term queries, and repeat queries. Single term queries dominated with no query

containing more than 9 terms. Stenmark s study (2005) is relevant to the ORNL study

because it too looks at intranet queries. Some of the results from the ORNL study can be

compared to the SwedCorp results with the following caveats: the SwedCorp study

involves the UltraSeek search engine which limits indexing of intranet information to

URLs only. This limits the search study to page results that link to text documents, not

real enterprise multimedia or applications search. UltraSeek is also an anonymous

search engine and because it doesnt know who you are and what you can have access

to, it restricts you from all se nsitive intranet information. This is a good example of

why intranet studies are needed on the newer search engines like Microsoft Office

SharePoint Server (MOSS). MOSS logs do give indications as to when a user starts and

stops a session. MOSS does not limit what is counted, for example access to all media in


22/94

11

all pages is counted, not just single URL page access. MOSS knows who the user is

because it employs password protected access. Lastly, MOSS is able to index not just

filter, which means it indexes more than URLs.

Mining Web Search Behaviors: Strategies and Techniques for Data Modeling

and Analysis by Wang, et al. (2007) used the 80-20 empirical rule to develop an

interactive web tool for exploring certain query session thresholds. The Wang, et al.

(2007) study analyzed many of the same query and session issues as Stenmar ks(2005)

study, but the implementation was quite different. This study implemented a highly

granular, comprehensive relational data model which maximized transactional data

inclusion and expansion. Great detail was included in the data section describing data

preparation, processing and construction of the data model. The concept of the data

model was the inspiration for the ORNL data model. The data used in this analysis wasfrom multiple sites (Excite, HealthLink, and UTK), only one of three qualifying as an

intranet like site, UTK. The only variables that are avai lable for comparison in this

study are top queries and unique queries. Fortunately, Wang, Berry, and Yang (2003)

also have an earlier longitudinal analysis using four years of UTK search data stored in a

relational data model that is relevant to the ORNL study.

Longitudinal Analysis

Intranet Users Information -Seeking Behavior: an Analysis of Longitudinal Search

Log Data by Stenmark and Jadaan (2006) is focused on temporal characterization of intranet users across three different years, comparing results to public web studies. In

the 2006 study, Stenmark and Jadaan evaluated SwedCorps query data submitted to

their InfoSeek Search site. While the paper also includes some session analysis data, the

bulk of the analysis focused on the search queries. His query analysis reported for each

year the number of queries, empty queries, single terms, average number and

maximum number of terms in a query. Again results viewed pages were analyzed with

reports on number of explicit pages, the mean and maximum number of results pages

viewed. Stenmark and Jadaans (2006) study suggests intranet users engage in fewer

and shorter search sessions than the public web studies. The length of intranet query

submissions is significantly shorter than public searches. This study certainly gives some

results that can be compared to ORNL results. Stenmark and Jadaans(2006) study tends

to not discuss cleansing and processing of the data, a lack of methodology substance.

Another article by Stenmark in 2006 What are you searching for? A content

analysis of intranet search involves a pure intranet study done using Volvo intranet


23/94

12

search logs. It was a longitudinal study from 2002 through 2004, although not the same

months or even days across years. This study not only involved typical query analysis but

included an open card sort exercise to derive topics from query terms. Zipf distributions

were used to characterize the word corpus. Some analysis was done regarding term

pairs and triples, as well as advanced statistics on word pairs. He also includes linguistic

analysis of Boolean operators. Many of the reported results will be useful for

comparison. While this study is more comprehensive in the area of context analysis, it

still does not provide much substance in methodology.

Mining Longitudinal Web Queries: Trends and Patterns by Wang, Berry, and

Yang (2003) entails the analysis of four years worth of UTK site search logs (May 1997 to

May 2001). The research objectives were very user oriented, understanding their user

web query behavior, identifying search problems, and developing techniques foroptimizing query analysis. A comprehensive characterization of queries was done, along

with word associations using Zipf distribution. What stands out for this query study is

that the paper, logically presents in detail their data processing and analysis techniques.

A web query entity relationship model helps describe each step in the process and how

the relational data management structure was built. It was easy to see how the same

measurements could be produced with the ORNL data set. This paper provides an

extensive roadmap for contextual search analysis.

Visual Presentation of Information NeedsThere is only one relevant publication that falls in this category, A Dual -

approach to Web Query Mining: Towards Conceptual Representations of Information

Needs by Wang (2006). This study also examines University of Tennessee, Knoxville

(UTK) queries, but with an added focus of web clustering for identifying what

information users are seeking. The strategy was to analyze mutual information values

and similar queries of a single user session for the purpose of identifying semantically

related terms. Mutual information was certainly helpful, but threshold boundaries were

needed to more tightly identify sessions and thus topic branching. The visual

representation of semantic networks was interesting because it helped describe the

relationship between unique high frequency terms and word pairs. It also demonstrated

how mutual information values can be used to help cluster words based on association

strength.


24/94

13

Conclusion

A granular relational data model first introduced by Wang, Berry, and Yang

(2003) for Web query analysis was adopted and modified for data mining and analysis of

the ORNL query corpus. The ORNL query corpus is characterized using Zipf Distributions,

log-log graphs and descriptive word statistics found in both Stenmark and Jadaan(2006)

and Wang, et al. (2007) respectively. User search vocabulary is analyzed using

frequency distribution and probability statistics (Mutual Information), a methodology

both attributable to Wang, Berry, and Yang (2003). Results from both of the

aforementioned studies will be used for results comparison. The ORNL study will build

on visual topic identification using mutual information values similar to the study by

Wang (2006).


25/94

14

Chapter 3

Methods

Research Environment

This research is based on analysis of web query logs from ORNLs intranet. ORNL

is a multi-program science and technology laboratory managed for the Department of

Energy (DOE) by UT-Battelle, LLC. ORNL is also the Department of Energys largest

science and energy laboratory. Scientists and engineers at ORNL conduct basic and

applied research. Their goal is to develop scientific knowledge and technology that

strengthens the nations leadership in six key areas of science; energy science, high -

performance computing, neutron science, materials science at the nanoscale, systems

biology, and national security. ORNL also performs other work for DOE including isotope

production, program management, and science related information management

(http://www.ornl.gov/ ).

ORNL has over 4,600 staff and approximately 3,000 guest researchers at the

laboratory every year. Staff and visitors are a mix of U.S. and foreign citizens.

Educationally they represent a mix of technical professionals, degreed workers, and

students at both graduate and undergraduate level.

In 2007 ORNL replaced its Verity UltraSeek search engine with Microsoft

SharePoint Server 2007. SharePoint content that is shared through this tool is documentlibraries, picture libraries, lists, discussion boards, surveys, individual and shared web

sites and web workspaces. The ORNL SharePoint search engine indexes about 200 public

and internal web servers, covering close to 1,000,000 documents. This search server

change netted nearly a three-fold increase in the number of documents searched by

users and removed strict anonymity from the ORNL intranet search process. Now that

ORNL searchers are being exposed to three times as many information sources, it is

more important than ever to make sure that the results provided to users via intranet

search are relevant.The search engine unobtrusively generates several log files. Which log file is used

for analysis depends on the TLA research questions and research objectives. The three

types of files used in this study are usage reports, access logs and query logs. The next

section provides a structural description of the different files, followed by description of

how the data was prepared and processed for analysis.


26/94

15

The Data

MOSS uses a lot of different logging files to help in collection of user search

information. Collection of transaction log files is automatic and unobtrusive, so there

was no special data collection required for this study. Analyses of these incidental logs

provide an effective review of the overall ORNL user search experience. MOSS provides

three key sources of usage information the administrator Search usage data, as well as

access and query log information. Each can be beneficial in understanding how people

are generally using the intranet site and what information they are looking for. In

combination they provide deeper insight into general user search behavior. All queries

analyzed in this study were submitted through the MOSS search engine and occurredbetween September 17 and December 31, 2007.

Data Structure

MOSS 2007 Search uses Internet Information Services (IIS) standards to capture

transaction information from users and stores the output in World Wide Web

Consortium (W3C) extended log file format. A W3C IIS file manager utility came with

MOSS and it was used by the ORNL web manager during installation to choose which

information is important to regularly collect for the organization.

The first W3C IIS file we will discuss is called the access log and it contains thedate and time a transaction was recorded, the address of the server which made the

log, the Internet Protocol (IP) address of the requestor, the type of browser the request

was made in, a query submission if one was made, the URL address of the clicked or

downloaded item, the type of page the user selected, and the length of time the request

took. The fields in the file are delimited by a semi-colon and maintain a strict order. The

fields are the date, time stamp, the name where the search service was running, the log

location, the path of the item downloaded, the query issued, the individual requesting

search access, and the type of browser used to search (table 1). Here is an example of

that data log from ORNL.

2007-09-17 21:45:14 W3SVC758222333 111.xx.x.xx GET/SearchCenter/_themes/Lichen/pagebackgrad_lichen.gif - 80 ORNL\ 111.xx.xxx.xxxMozilla/4.0+(compatible;+MSIE+7.0;+Windows+NT+6.0;+SLCC1;+.NET+CLR+2.0.50727;+.


27/94

16

NET+CLR+3.0.04506;+MS-RTC+LM+8;+InfoPath.2;+.NET+CLR+3.5.21022) 200 0 0203,2007-09-17 21:45:14 W3SVC758222333 111.xx.x.xx POST/searchcenter/Pages/Results.aspx k=mhp&s=All+Sites 80 - 160.xx.xxx.xxx

Table 1. Defines all the information fields that are available in the ORNL access log

IIS ACCESS LOG DEFINITIONdate The year, month, and day entry was

recordedtime The time the log file was recorded in

UTCs-sitename The Internet service name and

instance number that was running onthe client.s-ip The IP address of the server on which

the log was createdcs-method Command issued by the user like GET

or POST or PASScs-uri-stem The path of the item downloaded or

postedcs-uri-query The query, if any, that the client

submitted. A Universal ResourceIdentifier (URI) query is necessaryonly for dynamic pages.

s-port The server portcs-username The name of the authenticated user

who accessed the server.Anonymous users are indicated by ahyphen

c-ip The IP address of the clientcs(useragent) The type of browser that the client

usedsc-status The HTTP status codesc-substatus The substatus error codesc-win32-status The Windows status codetime taken The length of time that the action

took, in milliseconds


28/94

17

The second type of W3C IIS file used in this study is the Query log file. The query

file contains data that when analyzed can provide insight on query volume trends, top

queries, click through rates, queries with zero results, search topics, and various detailed

information on query level statistics. For extended query analysis and reporting query

log export data is provided in Excel files.

Search query logging is enabled by default in the MOSS Shared Services Provider

(SSP). The information tracked in the query log includes the query terms used, search

results returned for search queries and pages that were viewed from the search results.

The search usage data is beneficial in understanding how ORNL users are searching and

identifying the type of information they are downloading. Below is an example of a

single record of that ORNL file. Each record contains 19 fields and individual fields are

separated by commas.

NULL, intimal hyperplasia, 9F73D42F-7E3D-4508-B5C0-89885EFEB222, All Sites, NULL, 6,0, NULL, 2007-09-17 06:30:14.497, 2007-09-1706:30:40.870,0,0,https://sharepoint.ornl.gov/search/Pages/results.aspx,ORNLMOSSINDEX,0,0,0,NULL,NULL

This sample record shows that a user typed a query string in the ORNL search

box intimal hyperplasia as indicated in field tw o. The search yielded six results as

listed in field 6 of the record. None of the results were clicked on the results page as

indicated by the term NULL in the first record field. This suggests the user was not

satisfied with the results or was interrupted in the search process. Fields nine and ten

contain a date timestamp, the first indicating when the search was submitted

(9/17/2007 at 6:30 in the morning) and the second field indicates at what time the

result URL was clicked. Since no URL was clicked in this instance only the date occupies

this field. These fields along with number of results, the clicked URL rank and clicked

URL were used in this study. A complete list of fields and their definitions in the querylog can be found in Table 2.


29/94

18

Table 2. Information fields and definitions of ORNL collected query log

Field#

MSSQL QUERY LOG DEFINITION

1 clickedurl URI's clicked in the results page2 query string The query test of the search that was executed3 site guid The id of the site or collection from which the search query was

executed4 scope Defines the limits of the searchable space for example All

Sites(Search Center, Top-level site, sub-sites, or Lists & Libraries ),this site(current site and all its sub-sites), this list of sites(Lists &Libraries, or people(on All Sites)

5 bestBet Keyword terms as described by the administrator to enhancesearch results, can also be called a "synonym ring"(a glossary of names, processed, and concepts)

6 NumResults The number of relevant results returned for the search query7 NumbestBets The number of bestbets returned for the search query8 clickedurlRank The result position of the clicked URL9 SearchTime The date and time when the search was executed10 ClickTime The time when the resulting URL was clicked11 AdvancedSearch In many cases, users type a keyword phrase in the search box and

then click the Go Search button or press Enter to execute theirquery. If this technique does not produce the result they are

looking for on the first few pages of search results, some users willgive up. However, advanced users tend try again by using a moreadvanced query to target the content they are looking for.

12 Continued Identifies the last entry corresponding to a search query13 resultsUrl The URI of the page where the ranked results were posted14 queryServer The name of the query server in which the search query was

executed15 numHighConf The number of high confidence results returned for the search

query16 didYouMean Indentify if spelling suggestion is returned(0=yes, 1=no)17 ResultView Indentifies the order in which relevant results were ordered

18 contextual Scope The contextual search under which the query was executed19 contextual ScopeUrl The URI of the contextual lscope


30/94

19

Lastly, Usage report information was used from the search site reporting service

to complement the access and query log information. Usage reporting is a service that

enables intranet SharePoint site administrators to monitor high level statistics about the

use of their sites. Usage reporting also includes usage reports for search queries. Items

selected from that report in this study was top queries in the last 30 days of the query

log data set, the average number of search requests per day and month, as well as

search results of the top destination page types.

Preparation

Data preparation included developing a plan for cleansing and anonymizing

transaction log data. Cleaning the data includes removing data errors and anonymizing

the data included removing personal user information as well as ORNL descriptive

network information from the logs. The query logs contain not only query requests but

identifying information of the person who initiated the request. Martin(1997), an early

information scientist with a legal background was the first to consider privacy issues

with monitoring of online information systems for studies in user behavior. This study

implements Kurths (1993) suggestions for protecting information that may reveal

searcher identity. First all personal information like the ORNL three letter user id was

removed from the logs. IP addresses were anonymized by replacing all but the first

three numbers of the IP address with x. Session analysis was not performed so the rewas no need to track individual user session information. Permission to use the data was

secured from the CIO of the organization after submitting a reasonable data security

plan. Permission to publish the results was granted after review.

The access logs were mined in their native format using the Log Parser2.2 tool

and therefore did not require any special processing. The usage reports also did not

require manipulation. The query log however did require data cleansing, parsing, and in

some cases reformatting.

Initial review of the query transaction log found structural issues amongst thefiles records. Queries involving names of authors were distributed across multiple

record cells, thus the strings were concatenated and used to replace the partial

information in the original query string field. Additionally, a small number of files had

the term efaultproperties in the query string box and the remainder of the row data

was shifted by one cell. These query strings were deleted and the remaining data moved


31/94

20

left by one column. The remaining query statements were at least contained inside the

query column, with some exhibiting strange forms.

MOSS supports four basic types of keyword syntax for search, prefixes, phrases,

and single or multiple words. Querying the system is not case sensitive and Boolean

logic is not required. It was clear just from examining the first 2000 records from the

query log that users did not clearly understand the query rules of the MOSS search

system. Table 3 depicts some of the non-compliant and unusual queries.

Table 3. Examples of unsupported query strings submitted by ORNL searchers

Record QueryString Type

24 blank 283 10/2007 international festival dates22 efaultproperties error23 '+vascular +injury Boolean operators10 vascular_injury special character106 fmla acronyms333 4200000162 numbers397 "recruiting coordinator" quotations437 share.ornl.gov partial web addresses587 https://share.ornl.gov/projects/doe_bap URIs606 zip-code hyphenated terms1817 ji* wildcards


32/94

21

A cursory glance at the query string structures suggested many parsing rules

were needed to establish what a qualifying query would look like. Special characters like

quotations, Boolean logic operators, commas, and underscores were removed as long as

it did not impact the context of the query. Special punctuation such as commas, quotes,

and back slashes were also removed. The context of a query was validated by examining

queries nearby the target query. Blank queries were filtered, but not removed. Since

the study focused on user vocabulary most queries containing numbers were removed.

The exception to the rule was when numbers gave special context to a word or phrase

i.e. W-2, 401K, etc. For practical purposes of lexical study all building numbers, phone

numbers, office numbers, conference room numbers and form numbers were deleted.

URL addresses as queries were also removed.

Processing

A commercially available software tool called Log Parser2.2 was used to mine

data within the access text file. Log parser is a powerful, versatile tool that provides

universal query access to standard IIS text-based data such as log files. Using the Log

Parser2.2 tool on the access log files give the first quick glimpse into the behavior of

searchers. The first step is to determine what data is valuable to the study and identify

it by term, for example indentifying popular browser types . The next step involves

telling the parser to retrieve the data about brows ers called cs(useragent) in the

access log and then telling the parser how to process the data . The results of your

query can be custom-formatted in text based output, or they can be directed to other

output like SQL, or a chart. Appendix A presents the bulk of the queries that were

created to mine the access log file in this study. Details of the output and how it was

analyzed can be found in the Analysis section. Again, Table 1 defines all 15 of the

information fields that are available in the ORNL access log. These fields are easily

identifiable in the Log Parser2.2 script examples found in Appendix A.

The analysis requirements for the entire study must be considered as the datamodel is constructed. An ORNL data model was constructed to assist in database design.

The Query data model represents the specific data needed to meet analysis

requirements. It also defines processing constraints within the query corpus and depicts

the relationships between data entities (Figure1).


33/94

22

Figure 1. ORNL web query ER model for relational database

The original query log (an Excel file) was cleaned and processed prior to import

into Microsoft Access 2007( A). Additional processing of the query log included isolating

and normalizing the cleaned queries strings by converting all text to lower case. By

normalizing the text we remove the distinction between The and the and ldrd and

LDRD. Normalization did not include removing affixes, for example removing ingfrom parking leaving the word park. Too often this can dramatically change the context

of a query. The case normalization had a positive impact on query count and

determining accurately high frequency queries and terms. Spelling errors were counted

as unique word occurrences. The length of the query string was derived which includes


34/94

23

a count of all characters in the query string to include spaces. The resultant data was

imported into a data table called clean queries ( B).

The next processing step was to tokenize the query strings by parsing the words

into word tokens. Each token word retains its Clean Query ID (C_QID) number and is

parsed into a single record with an assigned string position number which identifies

which position the word occupied in the clean query( C). For example clean query

number 145 is business operations calendar. It is split on white space into three

tokens (145, business, 1), (145, operations, 2), and (145, calendar, 3). Unique tokens are

found by removing all duplicate tokens. Tokens are then spell checked and if required

categorized as misspelled words (designated as a 1 or 0 in the attribute case ). Spell

check was used in the first pass through the data. Human review was also required to

check for acronym and abbreviation spelling.The last processing step parsed single word tokens into word pairs ( D). This

processing was necessary to calculate co-occurrence values or mutual information

statistics. Mutual information statistics define the relationship between two words, the

higher the value the tighter the relationship. Mutual information results can help drive

the construction of a next word index or can assist in clustering web queries. If web

queries can be clustered and classified into an information category, we can determine

what topics searchers are looking for.

RDMS Development

A granular relational data model first introduced by Wang, Berry, and Yang

(2003) for Web query analysis was adopted and modified for data mining and analysis of

the ORNL query corpus. The relational data structure is optimal computationally for

large data sets that have to be repeatedly processed. Such a data structure also provides

a rich environment for multifaceted analysis.

The ER model displayed in figure 1 was used to create the ORNL relational data

model or schema shown in figure 2. The ORNL query relational database consists of six

tables each representing a distinct data topic: Transaction Log, Clean Queries,

Unique Queries, Unique Query Tokens, Unique Tokens, and Mutual Information.

The Transaction Log merely represented all of the original log queries. The MSSQL

query log was imported directly into the Transaction Log table. The table was assigned


35/94

24

a primary key automatically by Excel (QID). The field names remained static, except for

NumResults which was changed to Num_Hits.

The next table created was Clean Queries. The relationship between tables

Transaction Log and Clean Query is one -to-one. The Clean Queries Table contains

information about the time each query was submitted (Year, Month, and Date), and the

elapsed time which is defined as the time from when the search was initiated until a

resultant URL was clicked (Time_Taken). It also contains Tsec which is just Time_Taken

converted into seconds, numHits, and the query_string_clean. Clean Queries has a

primary key of C_QID and a foreign key of QID.

The Unique Queries table was derived from Clean Queries and it stores only

unique queries. The primary key for Unique Queries is C_QID. A Visual Basic (VB)

Script was written to count the occurrence of the unique queries inside the CleanQueries Corpus (Appendix B). The counts are stored in a record field called

Query_Clean_Freq and represent how many times each query occurred in the entire

ORNL search corpus. Another VB script was written to determine the number of words

contained in each query (Appendix B). Lastly, a field was added called CharCount. This

field contains information regarding query length, which is defined as the number of

characters contained in the query to include whitespace. This field was added to the

table by inserting a data formula in table design mode. The relationship between Clean

Queries and Unique Queries is one -to-one.Repeat queries happen quite often in query sessions and across a query corpus.

Many common queries are submitted by different searchers, and less often duplicate

queries are submitted by a single user within a web search session. There are several

reasons why this occur, most often the user cant understand why there were no results

and in disbelief resubmits the same query. Repeat query submission can also occur by

accident. Below is an example of an individual query that was very specific, three

keywords, but it contains a typo. The number of hits for the first query was zero, so the

user decides next to only type a single keyword, the first word of the previous query

relocation. The user received 1,024 hits on this query, plenty of information, but was

it the right information? Next, the user submits the same query receiving again 1,024

results. This was obviously not the information the searcher desired, so they altered

their query a fourth time and received only 256 results. The session terminates at this

point which means the user finally found their information, their search was

interrupted, or they just gave up on the search.


36/94

25

Initial Query relocation perdiem 0Query Revision 1 Relocation 1024

Query Revision 2 Relocation 1024Query Revision 3 and Final Query per diem 256

To support vocabulary analysis, the Unique Queries had to be broken down

into word elements (see figure 3). In linguistics this is called tokenization. Each unique

query was broken down into single text segments, with each child segment retaining its

mapping or position inside the Query_String_Clean. The relationship between

Unique Queries and Unique Query Token is one -to-many, each having the same

primary key C_QID.

Figure 2. ORNL query database, highlighted tables supported query level analysis


37/94

26

The Unique Tokens table was designed to keep track of all the words that

comprised unique tokens. It too has C_QID as a primary key. It also contains two counts,

Freq_in_Corpus which indicates how often the word appeared in the entire corpus

and Freq_in_Query which indicates how many time the word appeared in a query. For

example, the query Maryville College Maryville Tennessee has three unique tokens 1)

Maryville, 2) Col lege, and 3) Tennessee. The Freq_in_Query value of the string

Maryville for this C_QID is 2. The Freq_in_Corpus value is much higher. The last field in

this table is CharCount and it describes the number of characters in the unique token

string.

Figure 3. ORNL query database, highlighted tables supported vocabulary analysis


38/94

27

The last table is the mutual information (MI) table which contains information

specific to unique word pairs, along with their joint frequency(F12) and a value that

describes how closely word pairs are related I(w1,w2). Frequencies for each token in the

word pair is also in the MI table and was imported from the Unique Tokens table

(Freq_in_Corpus1 and Freq_in_Corpus2). The primary key for this table is WP_ID and

the foreign key is C_QID. The relationship between Unique Tokens and Mutual

Information is many -to-one.

Methods

Mutual Information Analysis

Word analysis is a subcomponent of Linguistics, the scientific study of naturallanguage. Words are the smallest semantic units that comprise language and it is their

patterns of occurrence in text and phrases such as intranet queries either in isolation or

as pairs that can help us understand the searchers intent. This analysis focused on word

pairs for queries with 2 n 14, where n is the number of terms or words in the

query.

Mutual Information measures the dependence that each word in a word pair has

on each other. It is a common measurement used in Information Theory to quantify

relationships between words found in text or queries. It is theorized that mutual

information values can be used to resolve query translation and query term

management. Query term translation may be cross-language or translations within

language, for example translating query word pairs to key terms in a synonym index.

Query translation may also be referred to as query expansion, which means the query is

not replaced by new terms, but rather the query is revised to include new terms or to

change the order of the terms to give it new semantic meaning based on its original

interpreted conceptual intent. Mutual Information study is also sometimes referred to

as collocation-based similarity or co-occurrence, word association study, and bigram

analysis.Mutual Information is also used to measure word association. Word association

is very important in the area of information retrieval. Measuring the value of word

associations empirically was largely developed by American psycholinguist James J.

Jenkins. Psycholinguists study the psychology of language and in Jenkins case he focused

on how words are combined to create meaning. A cornerstone study was conducted in


39/94

28

1964 by Jenkins and Palermo establishing subjective norms for measuring word

association ratios. The Palermo-Jenkins word association list was subsequently adopted

as a standard for testing word association.

In 1990, Church and Hanks challenged the Palermo-Jenkins standard on the

grounds that is was very subjective, and proposed measuring association norms with the

concept of Mutual Information. Their motivation behind establishing the new measure

was increased objectivity and reduced cost.

Mutual Information is an Information Science (IS) theory term developed by

Claude Shannon at Bell Labs in the 1940s. The t heory is very dependent on entropy,

which is just a mathematical way to describe the uncertainty of a single random variable

[H(x)]. Conditional entropy describes the entropy of a single random variable affected by

another single random variable [H(X|Y)]. Reducing the uncertainty between thesevalues is called Mutual Information (MI). Shannon largely used mutual information for

digital communications, specifically signal data processing. He used it for data

compression, which is a means to optimally sto re digital signal data.

Mutual information was later adopted for web based purposes. Web-based MI

was first introduced by Turney in 2001. The method was renamed to PointWise Mutual

Information (PMI). The PointWise MI between two words w 1 & w 2 can be described as

the log base 2 ratio of the probability of seeing the word pair and the product of the

single word probabilities (EQ1).

(EQ1)

In 2003 Wang, Berry and Yang adapted it to linguistic dependency of terms in

web query strings. The Mutual Information ( I) between two words w 1 & w2 can be

described as the natural log ratio of the probable word pair (relative word frequency/

number of cleaned queries (single and multi-word) and the product of the single wordprobabilities (relative word frequency/ number of cleaned queries). See equation

two(EQ2) for definition.


40/94

29

Mutual Information is defined as two points (words) w1 and w2, each having

probabilities P(w 1) and P(w 2)(Church, &Hanks,1990). The mutual information formula

I(w1,w2) used in this study is defined according to Wang, Berry and Yang (2003) to be

(EQ2)

Where: P(w 1 ), P(w2) are probabilities estimated by relative frequencies of the two words

(see EQ3) and P(w 1 ,w 2 ) is the relative frequency of the word pair (order is not

considered, therefore P (w 1


41/94

30

observes all word-pairs, not just the most occurring word pairs in terms of strength. This

ensures that the low frequency pairs are not ignored.

The protocol for parsing all qualifying queries(queries with 2 words or more)in

prepration for MI analysis was to break queries into adjacent word pairs. Word order is

not differented. Two word queries are natural word pairs. Three word or longer queries

recieved identical adjacent pairing. For example the query business operations

calendar is broken into 3 word pairs 1) business operations, 2) operations c alendar and

3)business operations. Adjacent pairing in this fashion helped retain the queries

semantic intent.

Zipf Analysis

Zipfs Law is used to generally characterize a linguistic corpus, in this case a

corpus of web queries. It states given some natural corpus of language, the frequency of

any word is inversely proportional to its rank in the frequency table (Kali, 2003). Zipf s

Law was used to characterize rank-frequency distributions of unique queries in

Searching the Web: The Public and Their Qu eries (Spink, Wolfram, Jansen, &Saracevic,

2001). A double log frequency plot is normally used to plot Zipf statistics, with the x-axis

representing log (frequency) and the y-axis representing the log (rank order). The corpus

is considered to have Zipf -like distribution if when fitted with a straight line, it has a

slope of m = -1. It is suggested that Web use follows a Zipfian pattern when plotted on a

log-log scale (Nielsen, 1997). Zipf distribution is often used to characterize TLA

components such as queries and vocabularies, page requests, and hypertext references.

Starting from the upper right oval and working down to the lower left of the graph three

circles describe three key word frequencies that occur in a typical corpuss word

distribution (figure 4). In a typical Zipf distribution there are a small amount of queries

or words that are used repeatedly (upper left oval), another group which occurs less

frequent (middle oval), and a sizeable group of words that are rarely used (lower right

oval).


42/94

31

Figure 4. Typical Zipf distribution plot

Approach to Spell Check Query Vocabulary

The misspelled query words were identified using a custom spell checker

application. The reference tools in Microsoft Word specifically the dictionary, grammar

guide,Thesaurus, and spell checker are very useful in application development. There

are two key objects denoted by the read bracket in the Word spell-check procedure

(figure 5) . The first is ProofReadingErrors collection which is a rang e object containing

the proof errors which can be any form of text (word, sentence, paragraph, or an entire

document). After SpellCollection is declared as a set of range objects, then

SpellCollection is populated with the list of corpus misspelled words. The procedure

loops through the word list comparing them with Word Reference Tools. Microsoft

provides examples of Visual Basic for Application procedures of Word on the web. Itdoes not take much programming experience to download and invoke the spell-checker

procedure, but it does require programming skills to manage the output in a project

specific user interface.


43/94

32

Figure 5. Spell-Check Procedure


44/94

33

Chapter 4

Results and Discussion

RQ1: What general search behaviors do ORNL searchers exhibit when searching theintranet?

Most intranet search software automatically collect basic usage information in

files called access and query logs. Information that can be mined from the log files

describes general organizational search behaviors. The least basic information includes:

what browsers the users prefer to search with, times when they search, which external

search engines they are more likely to use when searching outside the intranet, what

topics they are seeking via page views, how often they are receiving no hits on their

requests, and what types of page results are clicked most often. The next few pagesdescribe how this study used information contained in access and query log files to

characterize the general search behaviors of ORNL searchers.

Four distinct SQL queries were developed (Appendix A) to extract data that

identified general user search behaviors exhibited by ORNL intranet searchers.

a. What is the size of the ORNL search audience?b. What interfaces do ORNL users employ most when they search?

c. What types of pages do ORNL users click most often when results areavailable?

d. What topics do ORNL users commonly search for?e. When do ORNL users search and what are their search results?

The first and second of the queries focused on determining the approximate size

of the ORNL search audience and characterizing what tools they are using for search

inside the enterprise. The results showed that ORNL had 8,640 total distinct users

between September and December of 2007, with the average visitor staying 12.2

minutes (a). The average unique daily visitor is 4,966. The number of users is defined asthe number of distinct IP addresses that submitted search queries. This assumption

does not take into consideration that one user may actually submit queries from

multiple computers. The latter is highly probable as most ORNL users have at least one

desktop computer and one laptop computer. The data is useful as an estimate of the

general size of our search audience for this study.


45/94

34

The browser chosen by a user for search often has much to do with the platform

and operating systems. Examining page hits by browser type shows the top two

operating systems for ORNL is Windows XP and Vista, followed by Mac OS (table 4).

Windows clearly represents the bulk of computer platforms at ORNL. Since the top

platform OS is by Microsoft, one might assume the most popular browser is by

Microsoft.

Table 4. Top ORNL computer platforms


46/94

35

Browsers are software residing on computer platforms that allows users to

access and search a web based search environment like the Internet or an intranet.

ORNL employees need to necessarily access the intranet to use applications, to do

research, to share information, or to order equipment. The range of intranet based

tasks by user is great so browser developers have been diligent in creating browsers

with distinct performance characteristics. Two commonly utilized browsers are Internet

Explorer, made by Microsoft and Firefox, developed by Mozilla. Other browsers

emerging in the search environment is Chrome by Google. Browsers have different

levels of speed , reliability, ease of use, information organization, data presentation and

formatting, search engine plug-ins, etc. Understanding what browsers your user

audience prefers may impact how the intranet information should be organized and

presented for search.The number of distinct browsers reported for search in this study using the

logparser2.2 query was 494. The browser count seemed high for the data, but that was

because the browser is reported as a brand (Firefox, Internet Explorer, etc.) plus as a

specific version, for a specific operating system, for a specific Operating System (OS)

version, etc. This was not surprising as most lab workers have at least two computers (a

laptop and a desktop) and each likely has a different OS, with varying OS version

numbers, browsers, version numbers, etc. Independent of the high number of distinct

browsers reported, the results showed that ORNL searchers overwhelmingly useInternet Explorer 7.0(b) as their connection to the SharePoint search server (table5).

The third SQL query selected the top 20 most viewed web links or URLs. The

most popular URLs requested were pulled from the Access Log (Tab le 6). URL queries

were removed from the query logs to focus solely on vocabulary analysis. The top

ranking URL is the default MOSS page site. The second URL is a handler for managing

web site administration requests. The third URL listed simply as /

enterprise users and web search behavior

Documents