enterprise users and web search behavior

Upload: testerjoe456

Post on 03-Apr-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/28/2019 Enterprise Users and Web Search Behavior

    1/94

    University of Tennessee, KnoxvilleTrace: Tennessee Research and CreativeExchange

    Masters Theses Graduate School

    5-2010

    Enterprise Users and Web Search Behavior April Ann LewisThe University of Tennessee, Knoxville , [email protected]

    This Thesis is brought to you for free and open access by the Graduate School at Trace: Tennessee Research and Creative Exchange. It has beenaccepted for inclusion in Masters Theses by an authorized administrator of Trace: Tennessee Research and Creative Exchange. For more information,please [email protected].

    Recommended CitationLewis, April Ann, "Enterprise Users and Web Search Behavior. " Master's Thesis, University of Tennessee, 2010.http://trace.tennessee.edu/utk_gradthes/643

    http://trace.tennessee.edu/http://trace.tennessee.edu/http://trace.tennessee.edu/utk_gradtheshttp://trace.tennessee.edu/utk-gradmailto:[email protected]:[email protected]:[email protected]://trace.tennessee.edu/utk-gradhttp://trace.tennessee.edu/utk_gradtheshttp://trace.tennessee.edu/http://trace.tennessee.edu/
  • 7/28/2019 Enterprise Users and Web Search Behavior

    2/94

    To the Graduate Council:

    I am submitting herewith a thesis written by April Ann Lewis entitled "Enterprise Users and Web SearBehavior." I have examined the final electronic copy of this thesis for form and content and recommenthat it be accepted in partial fulfillment of the requirements for the degree of Master of Science, with a

    major in Information Sciences.Peiling Wang, Major Professor

    We have read this thesis and recommend its acceptance:

    Dania Bilal, Lorraine Normore

    Accepted for the Council:Carolyn R. Hodges

    Vice Provost and Dean of the Graduate School

    (Original signatures are on file with official student records.)

  • 7/28/2019 Enterprise Users and Web Search Behavior

    3/94

    To the Graduate Council:

    I am submitting herewith a thesis written by April Ann Lewis entitled Enterprise Users

    and Web Search Behavior . I have examined the final electronic copy of this thesis forform and content and recommend that it be accepted in partial fulfillment of therequirements for the degree of Master of Science, with a major in Information Science.

    Peiling Wang, Major Professor

    We have read this thesis andrecommend its acceptance:

    Dania Bilal

    Lorraine Normore

    Accepted for the Council:

    Carolyn R. Hodges

    Vice Provost and Dean of the GraduateSchool

    (Original signatures on file with official student records)

  • 7/28/2019 Enterprise Users and Web Search Behavior

    4/94

    Enterprise Users and Web Search Behavior

    A Thesis Presented forThe Master of Science

    DegreeThe University of Tennessee, Knoxville

    April Ann LewisMay 2010

  • 7/28/2019 Enterprise Users and Web Search Behavior

    5/94

    ii

    Copyright 2010 by April Ann LewisAll rights reserved.

  • 7/28/2019 Enterprise Users and Web Search Behavior

    6/94

    iii

    Acknowledgements

    I would like to thank Oak Ridge National Laboratory (ORNL) for supporting mygraduate education and encouraging me to pursue my newly found interests in applied

    information science research. ORNLs Chief I nformation Officer (CIO) and the web server

    support team graciously provided me with a very robust data set and answered all

    questions I had regarding its format.

    I would also like to acknowledge the extensive amount of work that Dr. Peiling Wang

    has done in the area of We b data mining and analysis. Dr. Wangs relational database

    model for web queries was fundamental to my data mining efforts. I have learned muchas a graduate student from her research work as well as from her classroom instruction.

    I am honored that Dr. Wang agreed to chair my thesis committee.

    I am also very grateful to have had Dr. Lorraine Normore and Dr. Dania Bilal serve on my

    committee, both very accomplished in complementary areas of Information Science

    research. Dr. Normore first introduced me to human-computer interaction (HCI)

    relevant to information search. One of the motivations for this thesis was characterizing

    the corporate users interaction with the ORNL intranet search environment. Dr. Bilal

    provided me with the basic understanding of search environments, tasks related to

    information searching and the theory of cognitive motivation for successful search.

  • 7/28/2019 Enterprise Users and Web Search Behavior

    7/94

    iv

    Abstract

    This thesis describes analysis of user web query behavior associated with Oak

    Ridge National Laboratorys (ORNL) Enterprise S earch System (Hereafter, ORNL

    Intranet). The ORNL Intranet provides users a means to search all kinds of data stores

    for relevant business and research information using a single query. The Global Intranet

    Trends for 2010 Report suggests the biggest current obstacle for corporate intranets is

    findability and Siloed content. Intranets differ from internets in the way they create,

    control, and share content which can make it often difficult and sometimes impossible

    for users to find information. Stenmark (2006) first noted studies of corporate internal

    search behavior is lacking and so appealed for more published research on the subject.

    This study employs mature scientific internet web query transaction log analysis(TLA) to examine how corporate intranet users at ORNL search for information. The

    focus of the study is to better understand general search behaviors and to identify

    unique trends associated with query composition and vocabulary. The results are

    compared to published Intranet studies. A literature review suggests only a handful of

    intranet based web search studies exist and each focus largely on a single aspect of

    intranet search. This implies that the ORNL study is the first to comprehensively analyze

    a corporate intranet user web query corpus, providing results to the public.

    This study analyzes 65,000 user queries submitted to the ORNL intranet fromSeptember 17, 2007 through December 31, 2007. A granular relational data model first

    introduced by Wang, Berry, and Yang (2003) for Web query analysis was adopted and

    modified for data mining and analysis of the ORNL query corpus. The ORNL query corpus

    is characterized using Zipf Distributions, descriptive word statistics, and Mutual

    Information. User search vocabulary is analyzed using frequency distribution and

    probability statistics.

    The results showed that ORNL users searched for unique types of information.

    ORNL users are uncertain of how to best formulate queries and dont use search

    interface tools to narrow search scope. Special domain language comprised 38% of the

    queries. The average results returned per query for ORNL were too high and no hits

    occurred 16.34%.

  • 7/28/2019 Enterprise Users and Web Search Behavior

    8/94

    v

    Table of Contents

    Acknowledgements ............................................................................................................. iii

    Abstract ............................................................................................................................... iv Chapter 1 Introduction and General Information .............................................................. 1

    Introduction .................................................................................................................... 1 Research Questions ........................................................................................................ 4

    Chapter 2 Literature Review .............................................................................................. 6 TLA Theory and Methodology ........................................................................................ 6

    TLA Theory .................................................................................................................. 6 TLA Methodology ........................................................................................................ 6

    Data Collection ........................................................................................................ 7 Data Preparation ..................................................................................................... 8 Data Analysis ........................................................................................................... 9

    Literature Review Objectives ...................................................................................... 9 Session Analysis .................................................................................................... 10 Longitudinal Analysis ............................................................................................ 11 Visual Presentation of Information Needs ........................................................... 12

    Conclusion ................................................................................................................. 13 Chapter 3 Methods .......................................................................................................... 14

    Research Environment .................................................................................................. 14 The Data ................................................................................................................ 15 Data Structure ....................................................................................................... 15 Preparation ........................................................................................................... 19

    Processing ............................................................................................................. 21 RDMS Development .............................................................................................. 23

    Methods ........................................................................................................................ 27 Mutual Information Analysis .................................................................................... 27 Zipf Analysis .............................................................................................................. 30 Approach to Spell Check Query Vocabulary ............................................................. 31

    Chapter 4 ........................................................................................................................... 33 RQ1: What general search behaviors do ORNL searchers exhibit when searching theintranet? ........................................................................................................................ 33 RQ2: How do users formulate their queries? ............................................................... 45

    RQ3: What are the Characteristics of the User Vocabulary? ....................................... 52 RQ4: How do ORNL results compare to the published studies? .................................. 60 Discussion...................................................................................................................... 64

    General Search Behavior........................................................................................... 64 Query Formulation .................................................................................................... 65 Vocabulary Analysis .................................................................................................. 65

    Chapter 5 Summary and Conclusion ................................................................................ 67 Summary ....................................................................................................................... 67

  • 7/28/2019 Enterprise Users and Web Search Behavior

    9/94

    vi

    General Search Behavior........................................................................................... 67 Query Formulation .................................................................................................... 68 Vocabulary Analysis .................................................................................................. 68

    Results Comparison to Published Studies................................................................. 68 Conclusion ..................................................................................................................... 69

    Future Study Recommendations .............................................................................. 70 References ........................................................................................................................ 71 Appendix ........................................................................................................................... 76

    Appendix A .................................................................................................................... 77 Appendix B .................................................................................................................... 78 Appendix C .................................................................................................................... 82

    Vita .................................................................................................................................... 83

  • 7/28/2019 Enterprise Users and Web Search Behavior

    10/94

    vii

    List of Tables

    Table 1. Defines all the information fields that are available in the ORNL access log .... 16

    Table 2. Information fields and definitions of ORNL collected query log .................... 18 Table 3. Examples of unsupported query strings submitted by ORNL searchers............ 20 Table 4. Top ORNL computer platforms ........................................................................... 34 Table 5. Browser breakdown for ORNL users ................................................................... 36 Table 6. Top 20 URL's requested...................................................................................... 37 Table 7. Distribution and categorization of page types ................................................... 38 Table 8. Most popular external search engines................................................................ 42 Table 9. ORNL top 25 most frequent clean queries ......................................................... 47 Table 10. ORNL top 10 queries with N-words.................................................................. 49 Table 11. Popular ORNL query words .............................................................................. 53 Table 12. Select ORNL mutual information values .......................................................... 57 Table 13. All word pairs sets involving the words "pay" and "band" .............................. 59 Table 14. ORNL results compared to published studies ................................................... 63

  • 7/28/2019 Enterprise Users and Web Search Behavior

    11/94

    viii

    List of Figures

    Figure 1. ORNL web query ER model for relational database ......................................... 22

    Figure 2. ORNL query database, highlighted tables supported query level analysis ....... 25 Figure 3. ORNL query database, highlighted tables supported vocabulary analysis ....... 26 Figure 4. Typical Zipf distribution plot .............................................................................. 31 Figure 5. Spell-Check Procedure ....................................................................................... 32 Figure 6. ORNL aggregated page types most clicked ....................................................... 39 Figure 7. ORNL topic categories ....................................................................................... 40 Figure 8. ORNL business category queries ........................................................................ 41 Figure 9. Query counts for each month ........................................................................... 43 Figure 10. Bi-monthly comparison week 3, September & October 2007........................ 44 Figure 11. The distribution of words in Unique Queries .................................................. 48 Figure 12. ORNL total query count distribution from September to December 2007 .... 50 Figure 13. Temporal frequency sampling of ten unique queries .................................... 51 Figure 14. Distribution of word length associated with ORNL unique query words ....... 54 Figure 15. Zipf distribution plot of the top 100 and top 2000 words ............................... 55 Figure 16. Sample of unique or irregular vocabulary ...................................................... 55

  • 7/28/2019 Enterprise Users and Web Search Behavior

    12/94

    1

    Chapter 1

    Introduction and General Information

    Introduction

    Many companies are adopting internet search practices for their intranets. While

    the underlying search process is the same for both the Internet and the intranet, the

    search needs of the respective users and their environments are very different (Fagin et

    al., 2003). The Internet consists of users who have individualized information needs and

    share no understanding with the information providers. Internet users have access to an

    unbounded document set that may include advertisements and spam.

    Conversely, ORNL intranet users search for information individually, but theyshare contextual understanding of the information space with the providers. The

    document set or search corpus available to ORNL users is controlled and limited. Users

    are not exposed to advertisements or spam within the search environment. Much more

    is known about internet search as many studies have been published that include search

    success statistics. The number of unsuccessful Internet searches reported by college

    students in a recent library user internet search survey was nearly 50% of all internet

    search submissions. (Mann, 2005). It is difficult to find any similar qualitative results

    measured relative to intranet search.There are two very distinct environments when it comes to web search 1) the

    internet and 2) the intranet. The way these environments are viewed from both users

    and researchers are very different. There are only a handful of published studies

    regarding intranet search, but internet search reports are published nearly every three

    months. The most recent internet statistics published was in February (Nielson, 2010),

    which reported that Google is the most preferred search engine (65.2% of all searches).

    That same report listed Yahoo as second, losing 18% more of its previously reported

    search share to Google. The percentage of typical daily users has grown to nearly 50%,

    with users extremely positive about search engines and their search experiences

    (Fallows, 2008). However, in that same report users are described as generally

    unsophisticated about how and why they use search.

    In contrast, there are no free regular web based reports available to the public

    on intranets statistics. When in-depth reports or studies are available they typically

    must be purchased. On average, intranets workers spend about 25% their time

  • 7/28/2019 Enterprise Users and Web Search Behavior

    13/94

    2

    searching for information (Feldman, &Sherman, 2004). Feldman and Sherman (2004)

    also report that a company with 1,000 knowledge workers may waste well over $6M

    dollars a year looking for information that doesnt exist, failing to find information that

    does, or recreating information that could have been found. The search experience for

    intranet users is not pleasant. A recent enterprise intranet search survey by Ward (2005,

    Sept. 7) found that web -rage was experienced after 12 minutes of fruitless search,

    although nearly 7% of the 566 people surveyed said they felt irritated after only three

    minutes.

    Not only is there a difference in the internet and intranet search environment,

    there are key unique distinctions in search engine performance and query vocabulary

    requirements. For example, indexing and ranking of search results on the internet can

    be impacted by organic linking and spam. The intranet is not affected by spam and crosslinking is not typically practiced in corporations. The way search results are stitched

    together as a product of federated search is different. The intranet has special rules for

    stitching like security access, duplication, etc. Tagging of information is not implicit

    within the intranet, which affects indexing. This is not to say implicit tagging of items

    associated with the internet always results in improved search performance. Intranets

    tend to have a smaller or narrow search vocabulary due to special domain language.

    The functional capability of a dynamic search is also critical for intranets. It is

    estimated that intranets as enterprises have tens or even hundreds of times larger datacollections (both structured and unstructured) than internets (Li, Cao, Hu, Xu, Li, &

    Meyerzon, 2005). The recent intranet study done by Li, et al.( 2005) demonstrated that

    an intranet search does not just focus on search of relevant documents; it includes

    special types of information such as definitions, persons, experts, homepages and

    applications. Another unique challenge to search inside the intranet is dealing with

    secure content, when it is not included the value for the searcher is greatly diminished

    (Valdez-Perez, 2007). David Hawking (2006) aptly describes the enterprise as a complex

    information environment which makes measuring the quality of search results difficult.

    While this study does not offer a solution to this problem, it is characterizing the ORNL

    intranet which could provide a framework for evaluating corporate web search

    environments. Clearly this is a motivating factor for comprehensively analyzing ones

    corporate intranet, specifically measuring general search behavior exhibited by users,

    examining trends in query submission and reformulation, as well as results of search

    both successes and failures.

  • 7/28/2019 Enterprise Users and Web Search Behavior

    14/94

    3

    Successful search equates to optimized findability. Measuring findability

    means characterizing the enterprise search environment. This typically involves

    analyzing query logs to identify what topics users are searching for, query formulation

    which is characterizing query submissions, and the percentage of search failure (no hits

    or too many results). It is a presumption of this study that it is not enough to understand

    query level results. It is also necessary to analyze information related to the general

    search behavior which describes how and when users search. Only when we

    understand both search behavior and search results can we improve overall efficiency

    within intranet search systems. General search behavior can be determined by analyzing

    access logs and usage reports. It complements search analysis by helping us understand

    the unique characteristics of our web users.

    It is because of these fundamental differences that organizations must evaluatetheir intranet search solution; simply applying best practices found with internet search

    is not practical. A successful organization must make sure that users can actually find

    information on their unique systems in a reasonable amount of time. Efficient search

    engines must be configured to match the characteristics of the users and the special

    information they seek. The most common way to characterize the users and the

    information they seek is to gather statistics on intranet usage and to evaluate user

    search logs.

    Transaction Log Analysis (TLA) typically focuses on the interaction behaviorsoccurring among the users, the search system, and the information (Jansen, 2009).

    Content analysis of server log files describes user interaction as it relates to internet

    usage statistics/reports, and search queries. Several studies have been done in this area

    with only Stenmark (2005, 2006) focusing on corporate intranets (Beitzel, Jensen, Lewis,

    Chodury,& Frieder, 2007; Wang, 2006; Baeza-Yates, Calderon-Benavides & Gonzalez,

    2006; Wang, Berry, &Yang, 2003 ; Jansen & Spink, 2006; Wolfram, Wang, & Zhang, 2008.

    This study will contribute to TLA by applying the (Wang, et al., 2003) method of mutual

    information analysis to intranet queries. It also implements the Wang, (2006) method of

    topic identification, complemented by general transaction analysis of ORNL user search

    usage statistics. In addition to contextual analysis, this study includes indirect analyses

    of access logs and usage reports to better characterize general ORNL search behavior.

    Unlike narrowly focused published intranet studies, this study will comprehensively

    analyze a corporate intranet s user web query corpus for the purpose of improving the

    overall Intranet search experience. Along with query logs it evaluates access and usage

  • 7/28/2019 Enterprise Users and Web Search Behavior

    15/94

    4

    logs in order to gain a holistic view of the ORNL search enterprise. A literature review

    suggests this may also be the first study to perform TLA on an intranet site using the

    Microsoft Office SharePoint Search Engine.

    This thesis will add to the growing body of literature associated with web query

    transaction log analysis for Intranets by providing methodology to other intranet users

    and managers who may want to holistically analyze their search environment. It

    combines log analysis used by search system administrators to measure search engine

    performance and interaction along with traditional query log analysis which measure

    users search performance and interaction. The thesis is organized as follows. The next

    section discusses the research questions associated with this study. Chapter 2

    summarizes the public extent of research related to intranet and web search. Chapter 3

    characterizes the ORNL enterprise search environment, the transaction log files used inthe study and the research methodology. Chapter 4 presents results and discussion

    while Chapter 5 summarizes the study results and discusses implications of the study.

    Research Questions

    This study employs mature scientific internet web query transaction log analysis

    (TLA) to better understand how intranet users at ORNL search for information. The

    focus of the study is examining general search behaviors and identifying unique trendsassociated with query composition and vocabulary. The goals of the research are three-

    fold and include answers to the following research questions (RQ):

    RQ1. What general search behaviors do ORNL searchers exhibit when searching the

    intranet?

    a. What is the size of the ORNL search audience?b. What interfaces do ORNL users employ most when they search?c. What types of pages do ORNL users click most often when results are

    available?

    d. What topics do ORNL users commonly search for?e. When do ORNL users search and what are their search results?

    RQ2. How do ORNL users formulate their queries?

    a. What are the most frequently submitted queries?

  • 7/28/2019 Enterprise Users and Web Search Behavior

    16/94

    5

    b. How many ORNL user queries are unique?c. How many ORNL user queries are blank?d. What are the lengths of ORNL user queries?e. What are the distribution of ORNL queries relative to length and time?

    RQ3. What are the characteristics of the ORNL user vocabulary?

    a. What is the length and distribution of ORNL unique terms?b. With what frequency do ORNL user queries contain acronyms,

    abbreviations, and misspelled words?

    c. What is the frequency of common stop words?d. Are there terms that occur together frequently (term co -occurrence)?

    RQ4. How do ORNL results compare to the published studies?

  • 7/28/2019 Enterprise Users and Web Search Behavior

    17/94

    6

    Chapter 2

    Literature Review

    TLA Theory and Methodology

    This chapter provides a brief overview of mature Transaction Log Analysis (TLA).

    The overview contains two major sections, the first TLA theory and the second TLA

    methodology. The overview is then followed by a short discussion of literature review

    objectives. Following review objectives are the discussions of each work and what

    impact they had on developing methodology for this study.

    TLA Theory

    The use of data stored in transaction logs of web search engines, intranets and

    web sites can provide valuable insight into understanding the information searching

    process of internet searchers(Jansen 2006). Many researchers(Jansen, Spink, & Taska,

    2009) feel transaction log data can provide feedback into what users are looking for in

    search architectures. Although there is a body of literature on empirical studies of TLA,

    few provide detailed methodological clarifications on data models used and the

    underlying rationales for these models (Wang, Wolfram, Zhang, Hong, &Wu, 2007).

    While TLA is emerging as a viable research methodology, it is not without its critics.

    Critics feel that TLA doesnt go far enough and that the logs dont record the usersperceptions of the search and therefore dont measure the real needs of the

    information searcher (Kurth, 1993).

    TLA Methodology

    Many studies have examined transaction log analysis (TLA) of web based search

    engines. Researchers have used transaction logs for analyzing a variety of applications

    from internet search to library information retrieval (IR) systems (Croft, Cook, &Wilder,

    1995; Jansen, Spink &Sarajevic, 2000; Jones, Cunningham, &McNab, 1998; Wang, et al.,2003; Wang, Wolfram, &Wu, 2008). In Search log analysis: What it is, whats been

    done, and how to do it, Jansen reviews the fundame ntal research motivation for TLA

    and describes a methodology for conducting succesful TLA research. A recent tutorial

    published by Wang, Wolfram, and Wu (2008) entitled Web Search Log Analysis and

    User Behavior Modeling focuses specifically on the techn ical process for conducting

  • 7/28/2019 Enterprise Users and Web Search Behavior

    18/94

    7

    web transaction log analysis using the best tools developed by researchers over the last

    decade.

    In all of these studies, TLA methodology is commonly described as a three-stage

    process. The first stage is data collection which includes the process of collecting the

    interaction data for a given period of time using transaction logs. The second stage is

    cleaning and parsing the log files to make them suitable for analysis. The third and final

    stage is analysis which requires selecting a specific research methodology. Of course,

    the research questions define what can be answered by the default data in typical

    transaction logs (Jansen, 2006). Fortunately, today s search logging software easily

    allows for expanding unobtrusive data collection of additional variables to meet analysis

    needs.

    Data Collection

    Transaction logs come in different formats, but more recent commercially

    available search tools produce standard World Wide Web Consortium (W3C) extended

    or Internet Information Services (IIS) format log files. Inherently, all data logs vary in

    content. The data format and fidelity should be addressed along with any predefined

    assumptions (Jansen, &Pooch, 2001 ). In Privacy Concerns for W eb Logging Data Kirstie

    Hawkey (2009) suggests researchers should anonymize or otherwise transform any

    sensitive or personal data before receiving, working with or publishing it. Most private

    or government organizations have policies related to sensitive informationmanagement. Researchers should consult with the Chief Information Officer (CIO) in

    their organization to discuss proper handling and dissemination of search log related

    information.

    Most log files contain data that can be used to analyze users search behaviors

    with IR systems whether internet or intranet by discerning attributes of distinct search

    processes and their resulting components. Jansen and Pooch (2001) establish the

    framework terminology for analyzing the search process describing three distinct

    components 1) Session , 2) Query , and 3) Term . Session analysis is focused on discrete

    entries entered by single users. This is the most difficult of the three as the researcher

    must determine what constitutes a session. Session boundary detection is difficult as

    users search for multiple topics on a single computer or a single computer may be

    shared by multiple searchers (Wolfram, et al., 2008). Sessions can be comprised of

    single or multiples queries.

  • 7/28/2019 Enterprise Users and Web Search Behavior

    19/94

    8

    A query is defined as a string of characters or word(s) entered into an

    information retrieval system. A query can contain multiple strings of characters or

    words (Korfhage, 1997). Query level analysis usually involves examining query length,

    query complexity, and failure rate. Query length represents the number of words or

    unique character strings in a query. Query syntax looks at specific components

    comprising the words or strings. This can range from the use of special symbols like

    hyphens to Boolean operators, even examination of capitalization and spelling. Failure

    rate quantifies how often a searcher receives no information matches for their

    character string submission. Todays search logs usually report failure rate as number

    of hits . When searches receive no results matching their query, the number of hits

    equal zero.

    A term is defined as a string of characters separated by some delimiter such as atab, space, comma, or colon. It is up to the researcher whether they should include

    special syntax or delimiters in the queries or terms. There are impacts to the analysis

    whether you keep them or remove them like defining unique semantic terms. Term

    analysis involves evaluating the number of characters in a term, the frequency of the

    term and its tendency for it to appear with other terms in queries or the corpus. High

    usage terms are those terms that occur most often in a search corpus and are easily

    identified by tokenizing queries (splitting multiple term queries into single terms) and

    counting identical terms. Mutual information or term co-occurrence measures theoccurrence of term pa irs. In Mining Longitudinal Web Queries: Trends and Patterns,

    Wang, Berry, and Yang (2003) examine co-occurrence with queries extracted

    unobtrusively from the website of the University of Tennessee, Knoxville (UTK). To

    promote statistical consistency in the ORNL search model, the present study employs

    these authors methodology for queries and word pairs.

    Data Preparation

    Data preparation is the most important and time consuming component of TLA.

    Cleaning of the raw log files usually require identification of format and data record

    errors through visual inspection of the file. Depending on the size of the file and the

    type of errors, a single editing script might be sufficient. More likely the search file will

    contain hundreds if not thousands of records, many requiring a unique editing

    solution, an instance in which manual edits and multiple scripts are required.

    Typically, the percentage of corrupted data is small relative to the overall data

    set (Jansen, et al. 2009). Data preparation also includes identifying exclusion data.

  • 7/28/2019 Enterprise Users and Web Search Behavior

    20/94

    9

    Exclusion data are special instances of data that are excluded from analysis like

    addresses, or phone numbers because they will negatively impact the search log

    analysis objectives. The last step in data preparation is importing the clean TLA data into

    a relational database or log analysis software tool and calculating standard interaction

    metrics that will serve as a basis for further analysis (Jansen, 2006).

    Data Analysis

    The best way to manage search log queries for multiple types of analysis is

    through a robust relational database management system (RDBMS). Importing and

    tracking each query as a unique event affords traceability from derived characterization

    data. It is simpler in a RDBMS to attach additional attributes to each record and to

    correlate across a diverse population of records. Statistical analysis should include at

    least the mean, standard deviation and median wherever possible if you intend tocompare results across studies. All data should be presented with the lowest unit of

    measure, avoiding aggregation category values at all cost (Jansen, and Pooch, 2001).

    Lastly, the RDBMS method for storing quantitative data is optimal for secondary

    analysis.

    Literature Review Objectives

    In support of this study, an extensive literature search was conducted using

    online sources, conference proceedings, technical articles and two significant reference

    books Web Search: Public Searching of the Web by Jansen, and Spink(2005) andHandbook of Research on Web Log Analysis edited by Jansen, Spink and Taksa(2009).

    The latter is a must for anyone considering TLA research.

    The criteria for related work in this study was that it must be focused on context

    analysis using TLA methods and involve an intranet or an academic web site. This study

    presumes academic web sites qualify as an intranet like site as they do have limited

    access (password protected accounts) and employees use the same enterprise search

    site. It also presumes role based access that is staff has access to more information than

    students. Qualifying studies were placed in one of three context analysis categories 1)

    Session Analysis, 2) Longitudinal Analysis, and 3) Visual Presentation of Information

    Needs.

    Session study usually involves analyzing query information specific to individual

    measures like length of session, average number and length of sessions per user.

    Sometimes it will involve analysis of click-through behavior, which is done to see where

    the searcher has been or to predict where they are going next.

  • 7/28/2019 Enterprise Users and Web Search Behavior

    21/94

    10

    Longitudinal analysis is temporal query analysis and is usually focused on

    analyzing query trends for a single search site across multiple time increments, usually

    across months and or years. These types of studies (Stenmark & Jadaan, 2006; Wang, et

    al., 2003) look at query and token frequencies to identify popular queries (top 100 and

    top 25), words, word pairs and triples. Most include characterizing words in the corpus

    using Zipf distributions. Only one evaluates term co-occurrence using mutual

    information statistics.

    Visual presentation of information needs focuses on research methods used to

    identify what users are looking for and ways to visually represent the results in a topic

    map. These studies usually involve quantitative analysis of queries resulting in the

    clustering or aggregation of query information into topics.

    Session Analysis

    A literature review suggests Dick Stenmarks article Searching the intranet:

    Corporate users and their queries (2005) is one of the first intranet studies on web

    sessions. The study was done for SwedCorp a commercial vehicle manufacturing

    company using the UltraSeek search engine by Verity. Session analysis is difficult

    because there is no variable in the UltraSeek log file that indicates when a user begins

    and ends a search. The single item that varies across these studies is the time threshold

    defining a search session. This study chose 13 minute session boundaries. After

    determining the threshold Stenmark analyzed the data to determine session length interms of interaction per session, the elapsed time of each session, and distribution of

    the sessions. The study also involved query analysis, reporting number of queries, zero

    term queries, and repeat queries. Single term queries dominated with no query

    containing more than 9 terms. Stenmark s study (2005) is relevant to the ORNL study

    because it too looks at intranet queries. Some of the results from the ORNL study can be

    compared to the SwedCorp results with the following caveats: the SwedCorp study

    involves the UltraSeek search engine which limits indexing of intranet information to

    URLs only. This limits the search study to page results that link to text documents, not

    real enterprise multimedia or applications search. UltraSeek is also an anonymous

    search engine and because it doesnt know who you are and what you can have access

    to, it restricts you from all se nsitive intranet information. This is a good example of

    why intranet studies are needed on the newer search engines like Microsoft Office

    SharePoint Server (MOSS). MOSS logs do give indications as to when a user starts and

    stops a session. MOSS does not limit what is counted, for example access to all media in

  • 7/28/2019 Enterprise Users and Web Search Behavior

    22/94

    11

    all pages is counted, not just single URL page access. MOSS knows who the user is

    because it employs password protected access. Lastly, MOSS is able to index not just

    filter, which means it indexes more than URLs.

    Mining Web Search Behaviors: Strategies and Techniques for Data Modeling

    and Analysis by Wang, et al. (2007) used the 80-20 empirical rule to develop an

    interactive web tool for exploring certain query session thresholds. The Wang, et al.

    (2007) study analyzed many of the same query and session issues as Stenmar ks(2005)

    study, but the implementation was quite different. This study implemented a highly

    granular, comprehensive relational data model which maximized transactional data

    inclusion and expansion. Great detail was included in the data section describing data

    preparation, processing and construction of the data model. The concept of the data

    model was the inspiration for the ORNL data model. The data used in this analysis wasfrom multiple sites (Excite, HealthLink, and UTK), only one of three qualifying as an

    intranet like site, UTK. The only variables that are avai lable for comparison in this

    study are top queries and unique queries. Fortunately, Wang, Berry, and Yang (2003)

    also have an earlier longitudinal analysis using four years of UTK search data stored in a

    relational data model that is relevant to the ORNL study.

    Longitudinal Analysis

    Intranet Users Information -Seeking Behavior: an Analysis of Longitudinal Search

    Log Data by Stenmark and Jadaan (2006) is focused on temporal characterization of intranet users across three different years, comparing results to public web studies. In

    the 2006 study, Stenmark and Jadaan evaluated SwedCorps query data submitted to

    their InfoSeek Search site. While the paper also includes some session analysis data, the

    bulk of the analysis focused on the search queries. His query analysis reported for each

    year the number of queries, empty queries, single terms, average number and

    maximum number of terms in a query. Again results viewed pages were analyzed with

    reports on number of explicit pages, the mean and maximum number of results pages

    viewed. Stenmark and Jadaans (2006) study suggests intranet users engage in fewer

    and shorter search sessions than the public web studies. The length of intranet query

    submissions is significantly shorter than public searches. This study certainly gives some

    results that can be compared to ORNL results. Stenmark and Jadaans(2006) study tends

    to not discuss cleansing and processing of the data, a lack of methodology substance.

    Another article by Stenmark in 2006 What are you searching for? A content

    analysis of intranet search involves a pure intranet study done using Volvo intranet

  • 7/28/2019 Enterprise Users and Web Search Behavior

    23/94

    12

    search logs. It was a longitudinal study from 2002 through 2004, although not the same

    months or even days across years. This study not only involved typical query analysis but

    included an open card sort exercise to derive topics from query terms. Zipf distributions

    were used to characterize the word corpus. Some analysis was done regarding term

    pairs and triples, as well as advanced statistics on word pairs. He also includes linguistic

    analysis of Boolean operators. Many of the reported results will be useful for

    comparison. While this study is more comprehensive in the area of context analysis, it

    still does not provide much substance in methodology.

    Mining Longitudinal Web Queries: Trends and Patterns by Wang, Berry, and

    Yang (2003) entails the analysis of four years worth of UTK site search logs (May 1997 to

    May 2001). The research objectives were very user oriented, understanding their user

    web query behavior, identifying search problems, and developing techniques foroptimizing query analysis. A comprehensive characterization of queries was done, along

    with word associations using Zipf distribution. What stands out for this query study is

    that the paper, logically presents in detail their data processing and analysis techniques.

    A web query entity relationship model helps describe each step in the process and how

    the relational data management structure was built. It was easy to see how the same

    measurements could be produced with the ORNL data set. This paper provides an

    extensive roadmap for contextual search analysis.

    Visual Presentation of Information NeedsThere is only one relevant publication that falls in this category, A Dual -

    approach to Web Query Mining: Towards Conceptual Representations of Information

    Needs by Wang (2006). This study also examines University of Tennessee, Knoxville

    (UTK) queries, but with an added focus of web clustering for identifying what

    information users are seeking. The strategy was to analyze mutual information values

    and similar queries of a single user session for the purpose of identifying semantically

    related terms. Mutual information was certainly helpful, but threshold boundaries were

    needed to more tightly identify sessions and thus topic branching. The visual

    representation of semantic networks was interesting because it helped describe the

    relationship between unique high frequency terms and word pairs. It also demonstrated

    how mutual information values can be used to help cluster words based on association

    strength.

  • 7/28/2019 Enterprise Users and Web Search Behavior

    24/94

    13

    Conclusion

    A granular relational data model first introduced by Wang, Berry, and Yang

    (2003) for Web query analysis was adopted and modified for data mining and analysis of

    the ORNL query corpus. The ORNL query corpus is characterized using Zipf Distributions,

    log-log graphs and descriptive word statistics found in both Stenmark and Jadaan(2006)

    and Wang, et al. (2007) respectively. User search vocabulary is analyzed using

    frequency distribution and probability statistics (Mutual Information), a methodology

    both attributable to Wang, Berry, and Yang (2003). Results from both of the

    aforementioned studies will be used for results comparison. The ORNL study will build

    on visual topic identification using mutual information values similar to the study by

    Wang (2006).

  • 7/28/2019 Enterprise Users and Web Search Behavior

    25/94

    14

    Chapter 3

    Methods

    Research Environment

    This research is based on analysis of web query logs from ORNLs intranet. ORNL

    is a multi-program science and technology laboratory managed for the Department of

    Energy (DOE) by UT-Battelle, LLC. ORNL is also the Department of Energys largest

    science and energy laboratory. Scientists and engineers at ORNL conduct basic and

    applied research. Their goal is to develop scientific knowledge and technology that

    strengthens the nations leadership in six key areas of science; energy science, high -

    performance computing, neutron science, materials science at the nanoscale, systems

    biology, and national security. ORNL also performs other work for DOE including isotope

    production, program management, and science related information management

    (http://www.ornl.gov/ ).

    ORNL has over 4,600 staff and approximately 3,000 guest researchers at the

    laboratory every year. Staff and visitors are a mix of U.S. and foreign citizens.

    Educationally they represent a mix of technical professionals, degreed workers, and

    students at both graduate and undergraduate level.

    In 2007 ORNL replaced its Verity UltraSeek search engine with Microsoft

    SharePoint Server 2007. SharePoint content that is shared through this tool is documentlibraries, picture libraries, lists, discussion boards, surveys, individual and shared web

    sites and web workspaces. The ORNL SharePoint search engine indexes about 200 public

    and internal web servers, covering close to 1,000,000 documents. This search server

    change netted nearly a three-fold increase in the number of documents searched by

    users and removed strict anonymity from the ORNL intranet search process. Now that

    ORNL searchers are being exposed to three times as many information sources, it is

    more important than ever to make sure that the results provided to users via intranet

    search are relevant.The search engine unobtrusively generates several log files. Which log file is used

    for analysis depends on the TLA research questions and research objectives. The three

    types of files used in this study are usage reports, access logs and query logs. The next

    section provides a structural description of the different files, followed by description of

    how the data was prepared and processed for analysis.

  • 7/28/2019 Enterprise Users and Web Search Behavior

    26/94

    15

    The Data

    MOSS uses a lot of different logging files to help in collection of user search

    information. Collection of transaction log files is automatic and unobtrusive, so there

    was no special data collection required for this study. Analyses of these incidental logs

    provide an effective review of the overall ORNL user search experience. MOSS provides

    three key sources of usage information the administrator Search usage data, as well as

    access and query log information. Each can be beneficial in understanding how people

    are generally using the intranet site and what information they are looking for. In

    combination they provide deeper insight into general user search behavior. All queries

    analyzed in this study were submitted through the MOSS search engine and occurredbetween September 17 and December 31, 2007.

    Data Structure

    MOSS 2007 Search uses Internet Information Services (IIS) standards to capture

    transaction information from users and stores the output in World Wide Web

    Consortium (W3C) extended log file format. A W3C IIS file manager utility came with

    MOSS and it was used by the ORNL web manager during installation to choose which

    information is important to regularly collect for the organization.

    The first W3C IIS file we will discuss is called the access log and it contains thedate and time a transaction was recorded, the address of the server which made the

    log, the Internet Protocol (IP) address of the requestor, the type of browser the request

    was made in, a query submission if one was made, the URL address of the clicked or

    downloaded item, the type of page the user selected, and the length of time the request

    took. The fields in the file are delimited by a semi-colon and maintain a strict order. The

    fields are the date, time stamp, the name where the search service was running, the log

    location, the path of the item downloaded, the query issued, the individual requesting

    search access, and the type of browser used to search (table 1). Here is an example of

    that data log from ORNL.

    2007-09-17 21:45:14 W3SVC758222333 111.xx.x.xx GET/SearchCenter/_themes/Lichen/pagebackgrad_lichen.gif - 80 ORNL\ 111.xx.xxx.xxxMozilla/4.0+(compatible;+MSIE+7.0;+Windows+NT+6.0;+SLCC1;+.NET+CLR+2.0.50727;+.

  • 7/28/2019 Enterprise Users and Web Search Behavior

    27/94

    16

    NET+CLR+3.0.04506;+MS-RTC+LM+8;+InfoPath.2;+.NET+CLR+3.5.21022) 200 0 0203,2007-09-17 21:45:14 W3SVC758222333 111.xx.x.xx POST/searchcenter/Pages/Results.aspx k=mhp&s=All+Sites 80 - 160.xx.xxx.xxx

    Table 1. Defines all the information fields that are available in the ORNL access log

    IIS ACCESS LOG DEFINITIONdate The year, month, and day entry was

    recordedtime The time the log file was recorded in

    UTCs-sitename The Internet service name and

    instance number that was running onthe client.s-ip The IP address of the server on which

    the log was createdcs-method Command issued by the user like GET

    or POST or PASScs-uri-stem The path of the item downloaded or

    postedcs-uri-query The query, if any, that the client

    submitted. A Universal ResourceIdentifier (URI) query is necessaryonly for dynamic pages.

    s-port The server portcs-username The name of the authenticated user

    who accessed the server.Anonymous users are indicated by ahyphen

    c-ip The IP address of the clientcs(useragent) The type of browser that the client

    usedsc-status The HTTP status codesc-substatus The substatus error codesc-win32-status The Windows status codetime taken The length of time that the action

    took, in milliseconds

  • 7/28/2019 Enterprise Users and Web Search Behavior

    28/94

    17

    The second type of W3C IIS file used in this study is the Query log file. The query

    file contains data that when analyzed can provide insight on query volume trends, top

    queries, click through rates, queries with zero results, search topics, and various detailed

    information on query level statistics. For extended query analysis and reporting query

    log export data is provided in Excel files.

    Search query logging is enabled by default in the MOSS Shared Services Provider

    (SSP). The information tracked in the query log includes the query terms used, search

    results returned for search queries and pages that were viewed from the search results.

    The search usage data is beneficial in understanding how ORNL users are searching and

    identifying the type of information they are downloading. Below is an example of a

    single record of that ORNL file. Each record contains 19 fields and individual fields are

    separated by commas.

    NULL, intimal hyperplasia, 9F73D42F-7E3D-4508-B5C0-89885EFEB222, All Sites, NULL, 6,0, NULL, 2007-09-17 06:30:14.497, 2007-09-1706:30:40.870,0,0,https://sharepoint.ornl.gov/search/Pages/results.aspx,ORNLMOSSINDEX,0,0,0,NULL,NULL

    This sample record shows that a user typed a query string in the ORNL search

    box intimal hyperplasia as indicated in field tw o. The search yielded six results as

    listed in field 6 of the record. None of the results were clicked on the results page as

    indicated by the term NULL in the first record field. This suggests the user was not

    satisfied with the results or was interrupted in the search process. Fields nine and ten

    contain a date timestamp, the first indicating when the search was submitted

    (9/17/2007 at 6:30 in the morning) and the second field indicates at what time the

    result URL was clicked. Since no URL was clicked in this instance only the date occupies

    this field. These fields along with number of results, the clicked URL rank and clicked

    URL were used in this study. A complete list of fields and their definitions in the querylog can be found in Table 2.

  • 7/28/2019 Enterprise Users and Web Search Behavior

    29/94

    18

    Table 2. Information fields and definitions of ORNL collected query log

    Field#

    MSSQL QUERY LOG DEFINITION

    1 clickedurl URI's clicked in the results page2 query string The query test of the search that was executed3 site guid The id of the site or collection from which the search query was

    executed4 scope Defines the limits of the searchable space for example All

    Sites(Search Center, Top-level site, sub-sites, or Lists & Libraries ),this site(current site and all its sub-sites), this list of sites(Lists &Libraries, or people(on All Sites)

    5 bestBet Keyword terms as described by the administrator to enhancesearch results, can also be called a "synonym ring"(a glossary of names, processed, and concepts)

    6 NumResults The number of relevant results returned for the search query7 NumbestBets The number of bestbets returned for the search query8 clickedurlRank The result position of the clicked URL9 SearchTime The date and time when the search was executed10 ClickTime The time when the resulting URL was clicked11 AdvancedSearch In many cases, users type a keyword phrase in the search box and

    then click the Go Search button or press Enter to execute theirquery. If this technique does not produce the result they are

    looking for on the first few pages of search results, some users willgive up. However, advanced users tend try again by using a moreadvanced query to target the content they are looking for.

    12 Continued Identifies the last entry corresponding to a search query13 resultsUrl The URI of the page where the ranked results were posted14 queryServer The name of the query server in which the search query was

    executed15 numHighConf The number of high confidence results returned for the search

    query16 didYouMean Indentify if spelling suggestion is returned(0=yes, 1=no)17 ResultView Indentifies the order in which relevant results were ordered

    18 contextual Scope The contextual search under which the query was executed19 contextual ScopeUrl The URI of the contextual lscope

  • 7/28/2019 Enterprise Users and Web Search Behavior

    30/94

    19

    Lastly, Usage report information was used from the search site reporting service

    to complement the access and query log information. Usage reporting is a service that

    enables intranet SharePoint site administrators to monitor high level statistics about the

    use of their sites. Usage reporting also includes usage reports for search queries. Items

    selected from that report in this study was top queries in the last 30 days of the query

    log data set, the average number of search requests per day and month, as well as

    search results of the top destination page types.

    Preparation

    Data preparation included developing a plan for cleansing and anonymizing

    transaction log data. Cleaning the data includes removing data errors and anonymizing

    the data included removing personal user information as well as ORNL descriptive

    network information from the logs. The query logs contain not only query requests but

    identifying information of the person who initiated the request. Martin(1997), an early

    information scientist with a legal background was the first to consider privacy issues

    with monitoring of online information systems for studies in user behavior. This study

    implements Kurths (1993) suggestions for protecting information that may reveal

    searcher identity. First all personal information like the ORNL three letter user id was

    removed from the logs. IP addresses were anonymized by replacing all but the first

    three numbers of the IP address with x. Session analysis was not performed so the rewas no need to track individual user session information. Permission to use the data was

    secured from the CIO of the organization after submitting a reasonable data security

    plan. Permission to publish the results was granted after review.

    The access logs were mined in their native format using the Log Parser2.2 tool

    and therefore did not require any special processing. The usage reports also did not

    require manipulation. The query log however did require data cleansing, parsing, and in

    some cases reformatting.

    Initial review of the query transaction log found structural issues amongst thefiles records. Queries involving names of authors were distributed across multiple

    record cells, thus the strings were concatenated and used to replace the partial

    information in the original query string field. Additionally, a small number of files had

    the term efaultproperties in the query string box and the remainder of the row data

    was shifted by one cell. These query strings were deleted and the remaining data moved

  • 7/28/2019 Enterprise Users and Web Search Behavior

    31/94

    20

    left by one column. The remaining query statements were at least contained inside the

    query column, with some exhibiting strange forms.

    MOSS supports four basic types of keyword syntax for search, prefixes, phrases,

    and single or multiple words. Querying the system is not case sensitive and Boolean

    logic is not required. It was clear just from examining the first 2000 records from the

    query log that users did not clearly understand the query rules of the MOSS search

    system. Table 3 depicts some of the non-compliant and unusual queries.

    Table 3. Examples of unsupported query strings submitted by ORNL searchers

    Record QueryString Type

    24 blank 283 10/2007 international festival dates22 efaultproperties error23 '+vascular +injury Boolean operators10 vascular_injury special character106 fmla acronyms333 4200000162 numbers397 "recruiting coordinator" quotations437 share.ornl.gov partial web addresses587 https://share.ornl.gov/projects/doe_bap URIs606 zip-code hyphenated terms1817 ji* wildcards

  • 7/28/2019 Enterprise Users and Web Search Behavior

    32/94

    21

    A cursory glance at the query string structures suggested many parsing rules

    were needed to establish what a qualifying query would look like. Special characters like

    quotations, Boolean logic operators, commas, and underscores were removed as long as

    it did not impact the context of the query. Special punctuation such as commas, quotes,

    and back slashes were also removed. The context of a query was validated by examining

    queries nearby the target query. Blank queries were filtered, but not removed. Since

    the study focused on user vocabulary most queries containing numbers were removed.

    The exception to the rule was when numbers gave special context to a word or phrase

    i.e. W-2, 401K, etc. For practical purposes of lexical study all building numbers, phone

    numbers, office numbers, conference room numbers and form numbers were deleted.

    URL addresses as queries were also removed.

    Processing

    A commercially available software tool called Log Parser2.2 was used to mine

    data within the access text file. Log parser is a powerful, versatile tool that provides

    universal query access to standard IIS text-based data such as log files. Using the Log

    Parser2.2 tool on the access log files give the first quick glimpse into the behavior of

    searchers. The first step is to determine what data is valuable to the study and identify

    it by term, for example indentifying popular browser types . The next step involves

    telling the parser to retrieve the data about brows ers called cs(useragent) in the

    access log and then telling the parser how to process the data . The results of your

    query can be custom-formatted in text based output, or they can be directed to other

    output like SQL, or a chart. Appendix A presents the bulk of the queries that were

    created to mine the access log file in this study. Details of the output and how it was

    analyzed can be found in the Analysis section. Again, Table 1 defines all 15 of the

    information fields that are available in the ORNL access log. These fields are easily

    identifiable in the Log Parser2.2 script examples found in Appendix A.

    The analysis requirements for the entire study must be considered as the datamodel is constructed. An ORNL data model was constructed to assist in database design.

    The Query data model represents the specific data needed to meet analysis

    requirements. It also defines processing constraints within the query corpus and depicts

    the relationships between data entities (Figure1).

  • 7/28/2019 Enterprise Users and Web Search Behavior

    33/94

    22

    Figure 1. ORNL web query ER model for relational database

    The original query log (an Excel file) was cleaned and processed prior to import

    into Microsoft Access 2007( A). Additional processing of the query log included isolating

    and normalizing the cleaned queries strings by converting all text to lower case. By

    normalizing the text we remove the distinction between The and the and ldrd and

    LDRD. Normalization did not include removing affixes, for example removing ingfrom parking leaving the word park. Too often this can dramatically change the context

    of a query. The case normalization had a positive impact on query count and

    determining accurately high frequency queries and terms. Spelling errors were counted

    as unique word occurrences. The length of the query string was derived which includes

  • 7/28/2019 Enterprise Users and Web Search Behavior

    34/94

    23

    a count of all characters in the query string to include spaces. The resultant data was

    imported into a data table called clean queries ( B).

    The next processing step was to tokenize the query strings by parsing the words

    into word tokens. Each token word retains its Clean Query ID (C_QID) number and is

    parsed into a single record with an assigned string position number which identifies

    which position the word occupied in the clean query( C). For example clean query

    number 145 is business operations calendar. It is split on white space into three

    tokens (145, business, 1), (145, operations, 2), and (145, calendar, 3). Unique tokens are

    found by removing all duplicate tokens. Tokens are then spell checked and if required

    categorized as misspelled words (designated as a 1 or 0 in the attribute case ). Spell

    check was used in the first pass through the data. Human review was also required to

    check for acronym and abbreviation spelling.The last processing step parsed single word tokens into word pairs ( D). This

    processing was necessary to calculate co-occurrence values or mutual information

    statistics. Mutual information statistics define the relationship between two words, the

    higher the value the tighter the relationship. Mutual information results can help drive

    the construction of a next word index or can assist in clustering web queries. If web

    queries can be clustered and classified into an information category, we can determine

    what topics searchers are looking for.

    RDMS Development

    A granular relational data model first introduced by Wang, Berry, and Yang

    (2003) for Web query analysis was adopted and modified for data mining and analysis of

    the ORNL query corpus. The relational data structure is optimal computationally for

    large data sets that have to be repeatedly processed. Such a data structure also provides

    a rich environment for multifaceted analysis.

    The ER model displayed in figure 1 was used to create the ORNL relational data

    model or schema shown in figure 2. The ORNL query relational database consists of six

    tables each representing a distinct data topic: Transaction Log, Clean Queries,

    Unique Queries, Unique Query Tokens, Unique Tokens, and Mutual Information.

    The Transaction Log merely represented all of the original log queries. The MSSQL

    query log was imported directly into the Transaction Log table. The table was assigned

  • 7/28/2019 Enterprise Users and Web Search Behavior

    35/94

    24

    a primary key automatically by Excel (QID). The field names remained static, except for

    NumResults which was changed to Num_Hits.

    The next table created was Clean Queries. The relationship between tables

    Transaction Log and Clean Query is one -to-one. The Clean Queries Table contains

    information about the time each query was submitted (Year, Month, and Date), and the

    elapsed time which is defined as the time from when the search was initiated until a

    resultant URL was clicked (Time_Taken). It also contains Tsec which is just Time_Taken

    converted into seconds, numHits, and the query_string_clean. Clean Queries has a

    primary key of C_QID and a foreign key of QID.

    The Unique Queries table was derived from Clean Queries and it stores only

    unique queries. The primary key for Unique Queries is C_QID. A Visual Basic (VB)

    Script was written to count the occurrence of the unique queries inside the CleanQueries Corpus (Appendix B). The counts are stored in a record field called

    Query_Clean_Freq and represent how many times each query occurred in the entire

    ORNL search corpus. Another VB script was written to determine the number of words

    contained in each query (Appendix B). Lastly, a field was added called CharCount. This

    field contains information regarding query length, which is defined as the number of

    characters contained in the query to include whitespace. This field was added to the

    table by inserting a data formula in table design mode. The relationship between Clean

    Queries and Unique Queries is one -to-one.Repeat queries happen quite often in query sessions and across a query corpus.

    Many common queries are submitted by different searchers, and less often duplicate

    queries are submitted by a single user within a web search session. There are several

    reasons why this occur, most often the user cant understand why there were no results

    and in disbelief resubmits the same query. Repeat query submission can also occur by

    accident. Below is an example of an individual query that was very specific, three

    keywords, but it contains a typo. The number of hits for the first query was zero, so the

    user decides next to only type a single keyword, the first word of the previous query

    relocation. The user received 1,024 hits on this query, plenty of information, but was

    it the right information? Next, the user submits the same query receiving again 1,024

    results. This was obviously not the information the searcher desired, so they altered

    their query a fourth time and received only 256 results. The session terminates at this

    point which means the user finally found their information, their search was

    interrupted, or they just gave up on the search.

  • 7/28/2019 Enterprise Users and Web Search Behavior

    36/94

    25

    Initial Query relocation perdiem 0Query Revision 1 Relocation 1024

    Query Revision 2 Relocation 1024Query Revision 3 and Final Query per diem 256

    To support vocabulary analysis, the Unique Queries had to be broken down

    into word elements (see figure 3). In linguistics this is called tokenization. Each unique

    query was broken down into single text segments, with each child segment retaining its

    mapping or position inside the Query_String_Clean. The relationship between

    Unique Queries and Unique Query Token is one -to-many, each having the same

    primary key C_QID.

    Figure 2. ORNL query database, highlighted tables supported query level analysis

  • 7/28/2019 Enterprise Users and Web Search Behavior

    37/94

    26

    The Unique Tokens table was designed to keep track of all the words that

    comprised unique tokens. It too has C_QID as a primary key. It also contains two counts,

    Freq_in_Corpus which indicates how often the word appeared in the entire corpus

    and Freq_in_Query which indicates how many time the word appeared in a query. For

    example, the query Maryville College Maryville Tennessee has three unique tokens 1)

    Maryville, 2) Col lege, and 3) Tennessee. The Freq_in_Query value of the string

    Maryville for this C_QID is 2. The Freq_in_Corpus value is much higher. The last field in

    this table is CharCount and it describes the number of characters in the unique token

    string.

    Figure 3. ORNL query database, highlighted tables supported vocabulary analysis

  • 7/28/2019 Enterprise Users and Web Search Behavior

    38/94

    27

    The last table is the mutual information (MI) table which contains information

    specific to unique word pairs, along with their joint frequency(F12) and a value that

    describes how closely word pairs are related I(w1,w2). Frequencies for each token in the

    word pair is also in the MI table and was imported from the Unique Tokens table

    (Freq_in_Corpus1 and Freq_in_Corpus2). The primary key for this table is WP_ID and

    the foreign key is C_QID. The relationship between Unique Tokens and Mutual

    Information is many -to-one.

    Methods

    Mutual Information Analysis

    Word analysis is a subcomponent of Linguistics, the scientific study of naturallanguage. Words are the smallest semantic units that comprise language and it is their

    patterns of occurrence in text and phrases such as intranet queries either in isolation or

    as pairs that can help us understand the searchers intent. This analysis focused on word

    pairs for queries with 2 n 14, where n is the number of terms or words in the

    query.

    Mutual Information measures the dependence that each word in a word pair has

    on each other. It is a common measurement used in Information Theory to quantify

    relationships between words found in text or queries. It is theorized that mutual

    information values can be used to resolve query translation and query term

    management. Query term translation may be cross-language or translations within

    language, for example translating query word pairs to key terms in a synonym index.

    Query translation may also be referred to as query expansion, which means the query is

    not replaced by new terms, but rather the query is revised to include new terms or to

    change the order of the terms to give it new semantic meaning based on its original

    interpreted conceptual intent. Mutual Information study is also sometimes referred to

    as collocation-based similarity or co-occurrence, word association study, and bigram

    analysis.Mutual Information is also used to measure word association. Word association

    is very important in the area of information retrieval. Measuring the value of word

    associations empirically was largely developed by American psycholinguist James J.

    Jenkins. Psycholinguists study the psychology of language and in Jenkins case he focused

    on how words are combined to create meaning. A cornerstone study was conducted in

  • 7/28/2019 Enterprise Users and Web Search Behavior

    39/94

    28

    1964 by Jenkins and Palermo establishing subjective norms for measuring word

    association ratios. The Palermo-Jenkins word association list was subsequently adopted

    as a standard for testing word association.

    In 1990, Church and Hanks challenged the Palermo-Jenkins standard on the

    grounds that is was very subjective, and proposed measuring association norms with the

    concept of Mutual Information. Their motivation behind establishing the new measure

    was increased objectivity and reduced cost.

    Mutual Information is an Information Science (IS) theory term developed by

    Claude Shannon at Bell Labs in the 1940s. The t heory is very dependent on entropy,

    which is just a mathematical way to describe the uncertainty of a single random variable

    [H(x)]. Conditional entropy describes the entropy of a single random variable affected by

    another single random variable [H(X|Y)]. Reducing the uncertainty between thesevalues is called Mutual Information (MI). Shannon largely used mutual information for

    digital communications, specifically signal data processing. He used it for data

    compression, which is a means to optimally sto re digital signal data.

    Mutual information was later adopted for web based purposes. Web-based MI

    was first introduced by Turney in 2001. The method was renamed to PointWise Mutual

    Information (PMI). The PointWise MI between two words w 1 & w 2 can be described as

    the log base 2 ratio of the probability of seeing the word pair and the product of the

    single word probabilities (EQ1).

    (EQ1)

    In 2003 Wang, Berry and Yang adapted it to linguistic dependency of terms in

    web query strings. The Mutual Information ( I) between two words w 1 & w2 can be

    described as the natural log ratio of the probable word pair (relative word frequency/

    number of cleaned queries (single and multi-word) and the product of the single wordprobabilities (relative word frequency/ number of cleaned queries). See equation

    two(EQ2) for definition.

  • 7/28/2019 Enterprise Users and Web Search Behavior

    40/94

    29

    Mutual Information is defined as two points (words) w1 and w2, each having

    probabilities P(w 1) and P(w 2)(Church, &Hanks,1990). The mutual information formula

    I(w1,w2) used in this study is defined according to Wang, Berry and Yang (2003) to be

    (EQ2)

    Where: P(w 1 ), P(w2) are probabilities estimated by relative frequencies of the two words

    (see EQ3) and P(w 1 ,w 2 ) is the relative frequency of the word pair (order is not

    considered, therefore P (w 1

  • 7/28/2019 Enterprise Users and Web Search Behavior

    41/94

    30

    observes all word-pairs, not just the most occurring word pairs in terms of strength. This

    ensures that the low frequency pairs are not ignored.

    The protocol for parsing all qualifying queries(queries with 2 words or more)in

    prepration for MI analysis was to break queries into adjacent word pairs. Word order is

    not differented. Two word queries are natural word pairs. Three word or longer queries

    recieved identical adjacent pairing. For example the query business operations

    calendar is broken into 3 word pairs 1) business operations, 2) operations c alendar and

    3)business operations. Adjacent pairing in this fashion helped retain the queries

    semantic intent.

    Zipf Analysis

    Zipfs Law is used to generally characterize a linguistic corpus, in this case a

    corpus of web queries. It states given some natural corpus of language, the frequency of

    any word is inversely proportional to its rank in the frequency table (Kali, 2003). Zipf s

    Law was used to characterize rank-frequency distributions of unique queries in

    Searching the Web: The Public and Their Qu eries (Spink, Wolfram, Jansen, &Saracevic,

    2001). A double log frequency plot is normally used to plot Zipf statistics, with the x-axis

    representing log (frequency) and the y-axis representing the log (rank order). The corpus

    is considered to have Zipf -like distribution if when fitted with a straight line, it has a

    slope of m = -1. It is suggested that Web use follows a Zipfian pattern when plotted on a

    log-log scale (Nielsen, 1997). Zipf distribution is often used to characterize TLA

    components such as queries and vocabularies, page requests, and hypertext references.

    Starting from the upper right oval and working down to the lower left of the graph three

    circles describe three key word frequencies that occur in a typical corpuss word

    distribution (figure 4). In a typical Zipf distribution there are a small amount of queries

    or words that are used repeatedly (upper left oval), another group which occurs less

    frequent (middle oval), and a sizeable group of words that are rarely used (lower right

    oval).

  • 7/28/2019 Enterprise Users and Web Search Behavior

    42/94

    31

    Figure 4. Typical Zipf distribution plot

    Approach to Spell Check Query Vocabulary

    The misspelled query words were identified using a custom spell checker

    application. The reference tools in Microsoft Word specifically the dictionary, grammar

    guide,Thesaurus, and spell checker are very useful in application development. There

    are two key objects denoted by the read bracket in the Word spell-check procedure

    (figure 5) . The first is ProofReadingErrors collection which is a rang e object containing

    the proof errors which can be any form of text (word, sentence, paragraph, or an entire

    document). After SpellCollection is declared as a set of range objects, then

    SpellCollection is populated with the list of corpus misspelled words. The procedure

    loops through the word list comparing them with Word Reference Tools. Microsoft

    provides examples of Visual Basic for Application procedures of Word on the web. Itdoes not take much programming experience to download and invoke the spell-checker

    procedure, but it does require programming skills to manage the output in a project

    specific user interface.

  • 7/28/2019 Enterprise Users and Web Search Behavior

    43/94

    32

    Figure 5. Spell-Check Procedure

  • 7/28/2019 Enterprise Users and Web Search Behavior

    44/94

    33

    Chapter 4

    Results and Discussion

    RQ1: What general search behaviors do ORNL searchers exhibit when searching theintranet?

    Most intranet search software automatically collect basic usage information in

    files called access and query logs. Information that can be mined from the log files

    describes general organizational search behaviors. The least basic information includes:

    what browsers the users prefer to search with, times when they search, which external

    search engines they are more likely to use when searching outside the intranet, what

    topics they are seeking via page views, how often they are receiving no hits on their

    requests, and what types of page results are clicked most often. The next few pagesdescribe how this study used information contained in access and query log files to

    characterize the general search behaviors of ORNL searchers.

    Four distinct SQL queries were developed (Appendix A) to extract data that

    identified general user search behaviors exhibited by ORNL intranet searchers.

    a. What is the size of the ORNL search audience?b. What interfaces do ORNL users employ most when they search?

    c. What types of pages do ORNL users click most often when results areavailable?

    d. What topics do ORNL users commonly search for?e. When do ORNL users search and what are their search results?

    The first and second of the queries focused on determining the approximate size

    of the ORNL search audience and characterizing what tools they are using for search

    inside the enterprise. The results showed that ORNL had 8,640 total distinct users

    between September and December of 2007, with the average visitor staying 12.2

    minutes (a). The average unique daily visitor is 4,966. The number of users is defined asthe number of distinct IP addresses that submitted search queries. This assumption

    does not take into consideration that one user may actually submit queries from

    multiple computers. The latter is highly probable as most ORNL users have at least one

    desktop computer and one laptop computer. The data is useful as an estimate of the

    general size of our search audience for this study.

  • 7/28/2019 Enterprise Users and Web Search Behavior

    45/94

    34

    The browser chosen by a user for search often has much to do with the platform

    and operating systems. Examining page hits by browser type shows the top two

    operating systems for ORNL is Windows XP and Vista, followed by Mac OS (table 4).

    Windows clearly represents the bulk of computer platforms at ORNL. Since the top

    platform OS is by Microsoft, one might assume the most popular browser is by

    Microsoft.

    Table 4. Top ORNL computer platforms

  • 7/28/2019 Enterprise Users and Web Search Behavior

    46/94

    35

    Browsers are software residing on computer platforms that allows users to

    access and search a web based search environment like the Internet or an intranet.

    ORNL employees need to necessarily access the intranet to use applications, to do

    research, to share information, or to order equipment. The range of intranet based

    tasks by user is great so browser developers have been diligent in creating browsers

    with distinct performance characteristics. Two commonly utilized browsers are Internet

    Explorer, made by Microsoft and Firefox, developed by Mozilla. Other browsers

    emerging in the search environment is Chrome by Google. Browsers have different

    levels of speed , reliability, ease of use, information organization, data presentation and

    formatting, search engine plug-ins, etc. Understanding what browsers your user

    audience prefers may impact how the intranet information should be organized and

    presented for search.The number of distinct browsers reported for search in this study using the

    logparser2.2 query was 494. The browser count seemed high for the data, but that was

    because the browser is reported as a brand (Firefox, Internet Explorer, etc.) plus as a

    specific version, for a specific operating system, for a specific Operating System (OS)

    version, etc. This was not surprising as most lab workers have at least two computers (a

    laptop and a desktop) and each likely has a different OS, with varying OS version

    numbers, browsers, version numbers, etc. Independent of the high number of distinct

    browsers reported, the results showed that ORNL searchers overwhelmingly useInternet Explorer 7.0(b) as their connection to the SharePoint search server (table5).

    The third SQL query selected the top 20 most viewed web links or URLs. The

    most popular URLs requested were pulled from the Access Log (Tab le 6). URL queries

    were removed from the query logs to focus solely on vocabulary analysis. The top

    ranking URL is the default MOSS page site. The second URL is a handler for managing

    web site administration requests. The third URL listed simply as /