Transcript
Page 1: Mining the web for business intelligence: Homepage ... · Thanks for visiting my page! Do e-mail me with any questions or suggestions for my webpage! Figure 1: Personal webpage. search

945 million. It is expected to achieve anannual growth rate of 16 per cent in2005 and 2006.1 Most people will agreewith the assertion that market modellingefforts over the next decade will reflectthe internet’s growing influence on

INTRODUCTIONInternet technology has undergoneremarkable growth in recent years andhas become an indispensable part ofpeople’s lives. As of March 2004, theworldwide internet population numbered

32 Database Marketing & Customer Strategy Management Vol. 12, 1, 32–54 � Henry Stewart Publications 1741–2447 (2004)

Mining the web for businessintelligence: Homepage analysisin the internet eraReceived (in revised form): 19th April, 2004

Kin-nam Lauis currently an associate professor of marketing at The Chinese University of Hong Kong, Shatin, Hong Kong. His researchinterests include customer relationship management (CRM), data mining, management information systems (MIS) andmarketing research. He has published in Journal of Management Information Systems, Journal of Marketing Research, Journalof Classification, Decision Science, European Journal of Operations Research and Journal of Database Marketing & CustomerStrategy Management. He obtained his PhD from Purdue University, USA.

Kam-hon Leeis Professor of Marketing at The Chinese University of Hong Kong. His research areas include business negotiation,cross-cultural marketing, marketing ethics, social marketing and tourism marketing. He has published in the Journal ofMarketing, Journal of Management, Journal of Business Ethics, European Journal of Marketing, International Marketing Review,Psychology and Health, The World Economy, Cornell HRA Quarterly and other refereed journals. He also serves on theeditorial boards of various international and regional journals. He obtained his BCom and MCom at The Chinese University ofHong Kong and his PhD in marketing at Northwestern University in Evanston, Illinois, USA.

Ying Hois a PhD student in marketing at The Chinese University of Hong Kong. Her research focuses on tourism marketing. She haspublished in Cornell HRA Quarterly. She obtained her BBA at The Chinese University of Hong Kong and her MSc ininternational business at University of Manchester Institute of Science and Technology in Manchester, UK.

Pong-yuen Lamis currently the Director of New Beverages for Coca Cola China Division. He started his marketing career with Procter &Gamble Hong Kong in 1989 and was the Director of Marketing for McDonald’s China in 1995–97. His research areas includeMIS, text mining and operational research. He has published in the Journal of the Operational Research Society and CornellHRA Quarterly. He received his BSc in computer science, MBA and PhD in marketing from The Chinese University of HongKong. He is also a Harvard alumnus after receiving the general manager executive education at Harvard Business School.

Abstract Information in websites provides good opportunities for marketers tounderstand and to acquire potential customers through the internet. The essence ofweb mining is to use powerful search engines to convert unorganised text informationinto customer intelligence stored in a database. In this paper, the authors construct adictionary of 80,750 keywords/phrases to identify the portraits of 6,173 students fromself-revealed information in their personal homepages. The authors summarise theirempirical results and report the technological limitations and marketing challenges ofweb mining.

Kin-nam LauAssociate Professor ofMarketing, Faculty ofBusiness Administration,K. K. Leung Building, TheChinese University of HongKong, Satin, Hong Kong.

Tel: � 852 2609 7766;Fax: � 852 2603 5473;e-mail: [email protected]

Page 2: Mining the web for business intelligence: Homepage ... · Thanks for visiting my page! Do e-mail me with any questions or suggestions for my webpage! Figure 1: Personal webpage. search

gathering, analysis and dissemination islabour intensive and time consuming. Theinternet offers an alternative channel formarketers to improve effectiveness andefficiency in their marketing research andcustomer acquisition efforts. In recentyears, it has been noticeable thatincreasing numbers of the internetpopulation now have their own personalwebpages. This trend is expected tocontinue into the next decade because ofthe increasing ease of creating a homepageand people’s increasing internetexperience. More individuals are nowsetting up their own webpages to providepersonal interests openly, and welcomeother people to interact with them. Freeweb space can be easily and freelyobtained from companies such as Yahoo,Netscape, AOL, etc. Software tools tocreate personal homepages are also gettingmore user-friendly. Companies such asIntel, Microsoft, Yahoo, etc are creatingand promoting new uses of personalwebsites. For individuals, it is better topost family photos and/or videos ofseveral megabytes in one’s personalwebsite for access by relatives and friendsrather than e-mailing them. Thesepersonal websites contain enormousamounts of personal information in theform of text-based data, which can beconverted into useful business intelligencefor marketing purposes.

As an illustration, assume one is atourist destination marketer who wouldlike to understand potential customers forsubsequent acquisition based on their lifestage, socioeconomic and behaviouralcharacteristics. Figure 1 is a personalwebpage. This webpage identifies theowner as a well-educated, single femalewho likes music, drama, ball games,cycling and travelling. Given thisunderstanding, tourist destinationmarketers may target the potentialcustomer by offering service packages thatmatch her interests. In this way, personal

consumer behaviour and marketingstrategy.2 The internet is bringing aboutmajor changes in how businesses areconceived and managed, ushering in theera of e-business.3

Existing internet-related researchfocuses on how the development of theWorld Wide Web affects marketingdecisions and consumer behaviours. Forexample, Prasad et al. analyse therelationship between pricing strategy andadvertising levels for internet websites.4

Degeratu et al. compare how differentstore environments (online and traditionalstores) can differentially affect consumerchoices.5 Deleersnyder et al. quantify theimpact of adding an internet channel onthe long-term performance growth of afirm’s established channels.6 On theconsumer behaviour side, Dholakia et al.study the susceptibility of consumers toherding bias in digital auctions.7 Shankaret al. address the issue of customersatisfaction and loyalty in online andoffline environments.8

Recent advances in internet technologyprovide market researchers with acontinuous flow of timely and accuratebusiness intelligence. For instance,knowledge about potential customers (eggender, age, marital status, interests andhobbies) is available on personalhomepages, which can be converted intoconsumer intelligence databases formarketing purposes. Discussions innewsgroups and online bulletin boardsmay serve as abundant sources of marketintelligence (eg consumer preference,evaluation of existing products, customercomplaints). Key firmographics (eg line ofbusiness, numbers of staff, year ofestablishment) can be extracted fromcompany websites, which help managersbetter to understand business partners andcompetitors.

Traditionally, marketers have acquiredcustomer information through marketsurveys. This process of information

� Henry Stewart Publications 1741–2447 (2004) Vol. 12, 1, 32–54 Database Marketing & Customer Strategy Management 33

Mining the web for business intelligence

Page 3: Mining the web for business intelligence: Homepage ... · Thanks for visiting my page! Do e-mail me with any questions or suggestions for my webpage! Figure 1: Personal webpage. search

terms underlined in Figure 1) and theuse of a powerful engine to searcheach of these keywords in everywebpage to identify its owner’scharacteristics. This process is referredto by the authors as ‘web mining’.Web mining is the process of retrievingand converting text information onwebsites into an organised databasecontaining key variables of interest forbetter understanding customers. Itinvolves the use of text miningtechniques which extract customerintelligence from unstructured orsemi-structured text documents onpersonal websites. The ratio ofstructured to unstructured informationcurrently stored electronically isestimated as 10 per cent structured to90 per cent unstructured, and thistrend is expected to continue.9

Through web mining, it is possible tomake use of the personal informationon the web to understand thedemographic, behavioural and attitudinalcharacteristics of potential customers.Based on such understanding, marketers

websites provide marketers with a goldenopportunity for capturing potentialcustomers’ self-revealed personalinformation and send them tailormadepromotion packages.

CONCEPT OF WEB MININGIn principle, millions of webpageowners can be systematically categorisedin a database structure. The task ofconverting textual information of a fewwebpage owners into a database couldeasily be carried out manually (ie byeye). However, due to the massivenumber of websites, a ‘human eye’approach may demand thousands ofwebsite analysts working for a verylong period of time. This would resultin substantial cost, fatigue problems anddifficulty in standardising differentjudgments by different analysts.Therefore, the process of webpageanalysis has to be automated. Thisrequires the researcher to develop adictionary containing thousands ofimportant keywords and/or phrases (eg

34 Database Marketing & Customer Strategy Management Vol. 12, 1, 32–54 � Henry Stewart Publications 1741–2447 (2004)

Lau, Lee, Ho and Lam

About Gloria

I am working as a merchandiser in a stationery company. I am single. That’swhy I am still living with my parents and my brother. I got my Bachelor ofScience degree in 2000. My major was mathematics.

I like cycling, playing piano, listening to music and playing ball games. Duringmy school life, I also participated in many activities such as drama team. I liketravelling too! I travelled to Thailand at the end of September 2001 with myfamily. I did lots of shopping and sightseeing there. It is really a wonderfulplace.

Thanks for visiting my page! Do e-mail me with any questions or suggestionsfor my webpage!

Figure 1: Personal webpage

Page 4: Mining the web for business intelligence: Homepage ... · Thanks for visiting my page! Do e-mail me with any questions or suggestions for my webpage! Figure 1: Personal webpage. search

search terms (eg AND, OR, NOT) witha limited number of keywords, which isequivalent to asking each webpageowner a few questions. Thousands ofwebpages may satisfy the search criteria,yet the researcher still has to manuallyread through these pages to identifypotential webpage owners. Sincewebpages satisfying a few keywords maynot necessarily meet expectations, due tothe ‘context issue’, such a process can betime consuming and may not be fruitfulfor marketing purposes. Unlike thecurrent web search, web mining allowsfor search techniques beyond simpleBoolean searches and accepts tens ofthousands of keywords in the searchprocess. The results of text search will befurther analysed to identify keycharacteristics of each webpage owner.

The concepts of web mining and textmining are gaining increasing popularitynowadays. A number of powerful toolsare currently available to analyse textualinformation in e-mails, customer surveys,corporate documents, medical records,patent databases etc (see Table 1).Industries such as financial services,insurance services, healthcare andconsultancy have successfully appliedtext-mining techniques in their datamanagement functions. Examples ofbusinesses with successful applications arePrincipal Financial Group,13 HeritageMutual Insurance,14 Hewitt AssociatesLLC,15 Louisville Hospitals16 and DiseaseResearch at the University ofPennsylvania.17 Users applaud theefficiency of these tools in managinghuge amounts of text-based information.Since this concept has proven to beuseful in a number of industries, theauthors propose that marketers may usethis technique to locate target customers,identify their needs and communicatewith them at a relatively low cost.

In conclusion, web mining can beconsidered as the extension and

can offer tailormade service packages toindividual customers.

The fundamental concept of webmining is actually not new to marketingresearchers. Since the early 1950s,sociologists and psychologists have usedcontent analysis to convertsemi-structured qualitative informationinto well-structured information forstatistical analysis. It had already gonethrough five methodological stages,namely: frequency analysis, valenceanalysis, intensity analysis, contingencyanalysis and computer analysis. In the late1960s, researchers started to usecomputers to assist them in analysing textinformation. Today, numerous computerprograms (eg CAIR, ATLAS/ti, Catpac,CDC, EZ-Text etc) are available for amore sophisticated content analysis.10

Since the 1980s, computer scientistshave developed search engines for usersto search the internet according to theirself-selected keywords. Most of theseengines are free and publicly availablefrom ISP homepages (eg Yahoo,Netscape, etc). There are also marketingstudies focusing on the use of searchengines. Bradlow and Schmittleinevaluate the ability of six popular websearch engines to locate webpagescontaining common marketing/management phrases.11 Hoque and Lohsestudied how the design of user interfacesin websites influences information searchcosts.12 This shows the promise of newmarketing research methods which usethe internet and websites. The presentstudy continues this tradition anddemonstrates a way of convertingwebpage information into a systematicmarketing database.

Web searches via search engines (egYahoo, Netscape, MSN, etc) areextremely popular nowadays. The issueis, ‘Are we satisfied with the results?’ Inmost publicly available search engines,one is only allowed to use Boolean

� Henry Stewart Publications 1741–2447 (2004) Vol. 12, 1, 32–54 Database Marketing & Customer Strategy Management 35

Mining the web for business intelligence

Page 5: Mining the web for business intelligence: Homepage ... · Thanks for visiting my page! Do e-mail me with any questions or suggestions for my webpage! Figure 1: Personal webpage. search

search algorithms and methods.18 Toview it from a marketing informationsearch perspective, five key stages aretypically involved.

Definition of research objectives andconceptsIt is crucial at the very beginning formarketers to identify their targetedpopulation and the types of informationthey are looking for. Key aspects ofpersonal information (labelled as‘concepts’) are general demographics,stage of life cycle, hobbies/interests,wealth/purchasing power, etc. This stage

improvement of current web searchmethods and traditional content analyses.Its ultimate goal is to integrate marketingresearch with database marketing in theinternet era. Its objective can be stated as:

‘To retrieve and convert unorganised textinformation from both personal and companywebsites into an organised database containingkey marketing variables of interest to theresearcher (eg demographics, socioeconomics,behaviour, interests etc) for betterunderstanding of our customers’.

Web mining research in computerscience focuses on the development of

36 Database Marketing & Customer Strategy Management Vol. 12, 1, 32–54 � Henry Stewart Publications 1741–2447 (2004)

Lau, Lee, Ho and Lam

Table 1: Text-mining software

Product name Vendor URL

Intelligent Miner for Text IBM http://www-3.ibm.com/software/data/iminer/fortextSAS Text Mine SAS http://www.sas.com/technologies/analytics/datamining/textminer/SPSS LexiQuest Mine SPSS http://www.spss.com/spssbi/lexiquest/mine.htmSTATISTICA Text Miner StatSoft http://www.statsoftinc.com/textminer.htmlGoodNews Robotics Institute http://www-2.cs.cmu.edu/~softagents/text_miner.htmlSMART Text Miner SMART Communications http://www.smartny.com/miner.htmWordStat Provalis Research http://www.simstat.com/TextAnalyst Megaputer Intelligence http://www.megaputer.com/products/tm.php3SemioMap Semio Corp http://www.informationweek.com/683/83iumin.htmClearResearch ClearResearch Corp. http://www.clearforest.com/Products/Analytics/ClearResearch.aspCopernic Summarizer Copernic Technologies http://www.copernic.com/en/products/summarizer/CATPAC Terra Research & http://www.pbelisle.com/library/reviews/catpac6.htm

Computing, Inc. dtSearch Text dtSearch Corp. http://www.dtsearch.com/PLF_Features_2.htmlRetrival EngineDataSet V Intercon Systems http://ds-dataset.com/DIMRS_Features.htmStatistical Text Mining Enkata Technologies http://www.enkata.com/products/statistical_text_mining.htmlFiles Search Assistant ASK-Labs http://www.aks-labs.com/products/files_search_assistant.htmExactAnswer InsightSoft-M http://www.insight.com.ru/products.html#3INFACT Insightful Corp. http://www.insightful.com/products/infact/default.aspISYS search solution Odyssey Development http://www.isysusa.com/products/index.htmlKlarity Intology http://www.intology.com.au/20products/Kwalitan Science Plus Group http://www.scienceplus.nl/Lextek Language Identifier Lextek International http://www.languageidentifier.com/Leximancer Leximancer http://www.leximancer.com/overview.htmlLextek Lextek International http://www.lextek.com/onix/Matchpoint Triplehop Technologies http://www.triplehop.com/product_demos/matchpoint.htmlMonarch Pro Datawatch Corporation http://monarch.datawatch.com/MindServer Recommind Inc. http://www.recommind.com/english/solutions/default.asp?url=ProductsOnline Miner Temis Group http://www.temis-group.com/TextQuest Social Science Consulting http://www.textquest.de/eindex.htmlReadware Information Management http://www.readware.com/prod_infoproc.aspProcessor Information TechnologyVantagePoint Search Technology http://www.thevantagepoint.com/pages/whitesheet_1.htmlVisualText Text Analysis International, Inc. http://www.textanalysis.com/Products/products.htmlINTEXT Social Science Consulting http://www.intext.de/eindex.htmlSpy-EM University of Illinois http://www.cs.uic.edu/~liub/S-EM/S-EM-download.html

Sources: http://www.google.com.hkhttp://www.kdnuggets.com/software/text.html

Page 6: Mining the web for business intelligence: Homepage ... · Thanks for visiting my page! Do e-mail me with any questions or suggestions for my webpage! Figure 1: Personal webpage. search

thousands of keywords are used in thedictionary, an equal number of binaryvariables (1 � keyword matched and0 � keyword not matched) must beconstructed. These variables areanalysed, and the results are translatedinto rules to identify key characteristicsfor each webpage owner. This stage isequivalent to data analysis in traditionalmarket research.

Solving the non-response problemThere are many reasons for non-responses(missing values). For example, differentwebpage owners may have different waysof expressing their behaviour. Therefore,it is difficult to construct a dictionary tocapture all possible ways of expression.Even if the dictionary is almost perfect,webpage owners typically do not releaseall of their personal information to thepublic and this will result in a substantialnumber of non-responses to certainkeywords. However, some missinginformation can be estimated throughstatistical techniques. Moreover,traditional survey research on a sample ofwebpage owners can be used assupplementary market information intackling the non-response problem. Thisstage is equivalent to the handling ofnon-response problem in traditionalsurvey research.

SummaryThe intention of this paper is not toshow that the web mining approach isready and viable today. Instead, theauthors argue that this approach has thepotential to become a major trend infuture marketing research. Theirobjective is to evaluate its feasibilitythrough a real life study onapproximately 6,000 student homepagesusing an existing text mining tool. Theresearch focus is on the dictionary

is the same as the research problemdefinition stage in traditional marketresearch.

Web crawling (data collection)With research objectives set, webcrawlers (ie sophisticated computerprograms) are sent to a user-defined setof uniform resource locators (URLs) or aweb space to collect information (eg textfiles, HTML files etc). Metadata collectedfrom target homepages are stored in adatabase for text and data analyses. Thisstage is similar to the data collectionstage in traditional market research.

Dictionary construction and textsearch (questionnaire design andinterview)As search engines and text mining toolscan recognise keywords and phrases butcannot understand the concepts behindthe text, it is necessary for researchers toconstruct a dictionary that acts as theknowledge base to associate keywordsand phrases with specific concepts. Thedictionary is then used to translateunorganised text on various globalwebsites into meaningful figures andindexes, which provide significantmeaning to marketers. With acomprehensive dictionary set, textmining tools act like interviewers tocollect, analyse and store the personalinformation that can be mined fromwebpages. This stage is similar to theprocess of designing questionnaires andconducting interviews in traditionalmarket research.

Text analysis to identify keycharacteristics of customers (dataanalysis)At this stage, there are search resultsfor each keyword. Since tens of

� Henry Stewart Publications 1741–2447 (2004) Vol. 12, 1, 32–54 Database Marketing & Customer Strategy Management 37

Mining the web for business intelligence

Page 7: Mining the web for business intelligence: Homepage ... · Thanks for visiting my page! Do e-mail me with any questions or suggestions for my webpage! Figure 1: Personal webpage. search

and travel interest. It is expected that theinformation will enable marketers tounderstand customer needs, and targettheir services at the customers whoseneeds most coincide with the servicepackages to be offered.

A number of powerful text-miningtools are available today for analysinge-mails, patent databases, etc. In thisstudy, the authors used one of thewell-known text mining products as theprimary tool for text analysis — the IBMIntelligent Miner for Text.19 It is used forweb crawling and text search. Theresearcher collected 7,941 HTMLdocuments (personal homepages) fromthe target universities. As a result ofpreliminary screening, a total of 6,173personal homepages were retained foranalysis. The reasons for the removal of1,768 homepages are described below.

Non-English homepagesThe English dictionary constructed forthis feasibility study cannot adequatelycater for the text mining needs ofpersonal homepages written in otherlanguages.

Non-personal homepagesA number of webpages hosted on theuniversities’ websites are webpages forcourses, departments, assignments, etc,which do not contain personalinformation and thus are not the targetsubjects for this analysis.

Miscellaneous problemsThere are other problems, such ashomepages that cannot be loaded or readby the text search engine, pagescontaining no textual data,password-protected webpages and emptypages — mostly webpages underconstruction. In addition, webpages with

construction and text analysis process.Specific research questions are:

1 Is it possible to construct a concisedictionary so that keywords can beinterpreted in the right context toidentify key characteristics of webpageowners?

2 Is existing technology effective andefficient enough to analyse hugenumbers of personal homepages?

3 What are the major concerns in futureweb mining applications?

In the following section, the authors firstoutline the design of their study andthen explain the dictionary constructionand text search processes.

RESEARCH DESIGN OFFEASIBILITY STUDYTo demonstrate the web miningprocess, envisage a group of marketerswho are interested in studying thebehavioural and demographic profiles ofcollege students. The objective of theauthors’ web mining program is tobuild a comprehensive andwell-structured information database forthis market segment, which mayfacilitate better understanding of thespecific needs and interests of individualstudent customers. Based on thisunderstanding, marketers can locate agroup of potential customers for aspecific marketing programme and bringtailormade promotion packages to theseindividual customers.

Suppose students provide a companywith their personal websites; marketerscan then conduct a feasibility study onstudents from two major universities inthe world. This study focuses on eightmajor attributes (called concepts) of aperson’s demographic profile. They aregender, year of study, major, maritalstatus, sibling, quiet hobby, sports interest

38 Database Marketing & Customer Strategy Management Vol. 12, 1, 32–54 � Henry Stewart Publications 1741–2447 (2004)

Lau, Lee, Ho and Lam

Page 8: Mining the web for business intelligence: Homepage ... · Thanks for visiting my page! Do e-mail me with any questions or suggestions for my webpage! Figure 1: Personal webpage. search

which result to use in interpreting theconcept. Results at a lower certaintylevel will be used only when the resultsof a higher certainty level areunavailable. Within the same level ofcertainty, any conflicting result leads to amissing value. Dictionary construction foreach concept is explained in Table 2.

Gender

This contains keywords and/or phrasesindicating whether the homepage owneris a male or female. Search results ofdifferent certainty levels are:

1 Level one (most certain) — 1,816 malenames and 1,648 female names20 aresearched in the ‘title’, ‘header’ and‘meta’ fields of HTML files. Thesesections of HTML files, especially the‘title’ field, usually contain the nameof the webpage owner. Therefore,names found in these sections aremore likely to be in the right context,while names found in other parts ofthe document may be names of friendsor relatives.

2 Level two (less certain) — 48,496 termssuch as ‘my name is John’, ‘call meMary’, etc are searched within allsections of an HTML document. Asearch term is matched only if thesame search term appears in adocument as a consecutive clause.

3 Level three (least certain) — otherindicative phrases (eg my wife, myhusband, etc) are searched in allsections of a document. Altogetherthere are 80 search terms. A searchterm is matched only if the samesearch term appears in a document asa consecutive clause.

Year of study

This concept includes keywords andphrases related to one’s student status (ie

more than one owner are not qualifiedas individual personal homepages and arethus excluded from this study.

TEXT SEARCH AND DICTIONARYCONSTRUCTION

Text searchAs the text collection in this study islarge, search indexes were pre-built tospeed up the text search process. Majorindex types of the IBM Text SearchEngine used in this study include theLinguistic Index and Ngram Index. TheLinguistic Index applies the samelinguistic processing to the search termsbefore searching while the Ngram Indexinvolves no linguistic processing at all(see Figure 2).

Moreover, different types of queryinterface demonstrate varying degrees ofsearch sophistication. Single-word queryand multiple-term query can be enhancedby using section support, which specifiessearching in specific sections of astructured HTML document eg in title,meta, and header fields (see Figure 3) orcontext-based query, which can be usedto limit a phrase search within a sentence(see Figure 4).

Dictionary constructionThe interpretation of keywords and/orphrases identified in homepages has to becontext-based. For example, the word‘Mary’ found in a homepage does notautomatically imply that the homepageowner is a female. The completesentence could be either ‘This is mygirlfriend Mary’ or ‘Mary’s homepage’.Therefore, search results for differentkeywords/phrases have to be categorisedinto different levels; eg level one, leveltwo and so on (in descending order ofcertainty). With search results on eachkeyword, the researcher will choose

� Henry Stewart Publications 1741–2447 (2004) Vol. 12, 1, 32–54 Database Marketing & Customer Strategy Management 39

Mining the web for business intelligence

Page 9: Mining the web for business intelligence: Homepage ... · Thanks for visiting my page! Do e-mail me with any questions or suggestions for my webpage! Figure 1: Personal webpage. search

‘others’. A ‘double major’ category iscreated taking into consideration thelarge number of students who haveacademic interests in more than one area.Names of majors and their associatedcourses are used as indications of one’sconcentration area/research interest. Alevel one result is obtained by searching2,661 indicative phrases with majors suchas ‘my major is engineering’. This resultlevel is most certain because, to a certainextent, the inclusion of indicative phrases(eg ‘my concentration is’, ‘my major is’)validates the interpretation of searchterms in the right context. A level tworesult is obtained by searching 1,585course names associated with therespective majors. A search term is

undergraduate or postgraduate).Keywords and phrases are searchedwithin all sections of an HTMLdocument. A search term is matchedonly if all words of the search term arefound within the same sentence (but notnecessarily as a consecutive clause). Thenumber of search terms used forundergraduates and postgraduates are 83and 82 respectively.

Major

Majors selected for this study arebusiness, science and engineering, art,social sciences, medicine and law. Forsimplicity, the last four majors aregrouped into a single category labelled

40 Database Marketing & Customer Strategy Management Vol. 12, 1, 32–54 � Henry Stewart Publications 1741–2447 (2004)

Lau, Lee, Ho and Lam

Consider: Document contains: ‘I also create web pages for myself’.

1. Using Linguistic IndexIf search ‘pages’, then ‘keyword matched’.If search ‘page’, then ‘keyword matched’.

2. Using Ngram IndexIf search ‘pages’, then ‘keyword matched’.If search ‘page’, then ‘keyword not matched’.

Figure 2: Differences between the Linguistic Index and Ngram Index

Consider: Document 1 contains: ‘John’s homepage’ in title fieldDocument 2 contains: ‘John is good.’ NOT in title field

Query requirement: Use section support for ‘title’

Search result: If search ‘John’ in Document 1, then ‘keyword matched’.If search ‘John’ in Document 2, then ‘keyword not matched’.

Figure 3: Section support

Consider: ‘I am married with one child. I like playing football’.

Query requirement: Use context-based query within a sentence.

Search result: If search ‘like football’, then ‘phrase matched’.If search ‘with football’, then ‘phrase not matched’.

Figure 4: Context-based query

Page 10: Mining the web for business intelligence: Homepage ... · Thanks for visiting my page! Do e-mail me with any questions or suggestions for my webpage! Figure 1: Personal webpage. search

� Henry Stewart Publications 1741–2447 (2004) Vol. 12, 1, 32–54 Database Marketing & Customer Strategy Management 41

Mining the web for business intelligence

Tab

le2:

Dic

tiona

ryco

nstr

uctio

n Co

ncep

ts

Sea

rch

met

hod

1.G

end

er2.

Year

of

stud

y

Sec

tio

nse

arch

Ind

icat

ive

phr

ases

�na

mes

:

Key

wo

rds/

phr

ases

Oth

erre

late

dke

ywo

rds/

pha

ses

Text

sear

chin

‘titl

e’,

‘hea

der

’an

d‘m

eta’

field

sof

HTM

Lfil

es:

Pop

ular

mal

ena

mes

(1,8

16)*

Pop

ular

fem

ale

nam

es(1

,648

)

Mal

eam

(Mal

ena

mes

),I’m

(Mal

ena

mes

),ca

llm

e(M

ale

nam

es),

My

nam

eis

(Mal

ena

mes

),m

ygi

ven

nam

eis

(Mal

ena

mes

),M

yC

hris

tian

nam

eis

(Mal

ena

mes

),am

calle

d(M

ale

nam

es),

I’mca

lled

(Mal

ena

mes

),am

know

nas

(Mal

ena

mes

),I’m

know

nas

(Mal

ena

mes

),w

elco

me

hom

epag

eof

(Mal

ena

mes

),(M

ale

nam

es)’s

hom

epag

e,w

elco

me

to(M

ale

nam

es)’s

pla

ce,

wel

com

eto

(Mal

ena

mes

)’sp

age

(25,

424)

Fem

ale

am(F

emal

ena

mes

),I’m

(Fem

ale

nam

es),

call

me

(Fem

ale

nam

es),

my

nam

eis

(Fem

ale

nam

es),

my

give

nna

me

is(F

emal

ena

mes

),m

yC

hris

tian

nam

eis

(Fem

ale

nam

es),

amca

lled

(Fem

ale

nam

es),

I’mca

lled

(Fem

ale

nam

es),

amkn

own

as(F

emal

ena

mes

),I’m

know

nas

(Fem

ale

nam

es),

wel

com

eho

mep

age

of(F

emal

ena

mes

),(F

emal

ena

mes

)’sho

mep

age,

wel

com

eto

(Fem

ale

nam

es)’s

pla

ce,

wel

com

eto

(Fem

ale

nam

es)’s

pag

e(2

3,07

2)

Key

wor

ds/

phr

ases

ind

icat

ing

und

ergr

adua

test

udie

s:eg

I’ma

fres

hman

,am

und

ergr

adua

te,

Iam

first

sem

este

rm

ajor

,is

my

first

colle

geye

aret

c(8

3)

Key

wor

ds/

phr

ases

ind

icat

ing

pos

tgra

dua

test

udie

s:eg

I’mgr

adst

uden

t,I

amca

ndid

ate

for

MP

hil

deg

ree,

I’md

oing

my

PhD

,re

ceiv

edm

yP

hDet

c(8

2)

Oth

erge

nder

-rel

ated

phr

ases

:eg

‘my

boy

frie

nd’,

‘my

wife

’,‘I

ama

man

’et

c.(8

0)

Page 11: Mining the web for business intelligence: Homepage ... · Thanks for visiting my page! Do e-mail me with any questions or suggestions for my webpage! Figure 1: Personal webpage. search

42 Database Marketing & Customer Strategy Management Vol. 12, 1, 32–54 � Henry Stewart Publications 1741–2447 (2004)

Lau, Lee, Ho and Lam

Tab

le2:

(Con

tinue

d)

Co

ncep

ts

Sea

rch

met

hod

3.M

ajo

r4.

Mar

ital

stat

us

Sec

tio

nse

arch

Ind

icat

ive

phr

ase

�ke

ywo

rds

Key

wo

rds/

phr

ases

Oth

erre

late

dke

ywo

rds/

phr

ases

Ind

icat

ive

phr

ases

�na

mes

ofm

ajor

facu

lties

:m

ym

ajor

is(m

ajor

nam

e)m

yco

ncen

trat

ion

is(m

ajor

nam

e)st

udyi

ng(m

ajor

nam

e)(m

ajor

nam

e)m

ajor

(maj

orna

me)

conc

entr

ator

Bac

helo

rof

(maj

orna

me)

(Maj

orna

me)

stud

ent*

Maj

or*(

maj

orna

me)

Min

or*(

maj

orna

me)

maj

or*in

(maj

orna

me)

min

or*in

(maj

orna

me)

(Maj

orna

me)

grad

uate

*st

uden

tof

(maj

orna

me)

switc

hed

to(m

ajor

nam

e)

Bus

ines

sm

ajor

(560

)S

cien

cean

den

gine

erin

gm

ajor

(631

)O

ther

facu

lties

(incl

udin

gar

t,so

cial

scie

nce,

med

icin

e,la

w)

(1,4

70)

Nam

esof

maj

orco

urse

s:B

usin

ess

maj

or(1

30)

Sci

ence

and

engi

neer

ing

maj

or(7

22)

Oth

erfa

culti

es(in

clud

ing

art,

soci

alsc

ienc

e,m

edic

ine,

law

)(7

33)

Key

wor

ds/

phr

ases

ind

icat

ing

mar

ital

stat

usas

‘sin

gle’

:eg

I’msi

ngle

,m

ygi

rlfrie

nd,

amgo

ing

tom

arry

,m

arita

lst

atus

:si

ngle

etc

(14)

Key

wor

ds/

phr

ases

ind

icat

ing

mar

ital

stat

usas

‘mar

ried

’:eg

I’mm

arrie

d,

we

are

mar

ried

,m

arita

lst

atus

:m

arrie

det

c(7

)

Key

wor

ds/

phr

ases

ind

icat

ing

mar

ital

stat

usas

‘div

orce

d/w

idow

ed’:

egam

div

orce

d,

we

are

div

orce

d,

amw

idow

ed,

amw

idow

eret

c(1

7)

Page 12: Mining the web for business intelligence: Homepage ... · Thanks for visiting my page! Do e-mail me with any questions or suggestions for my webpage! Figure 1: Personal webpage. search

� Henry Stewart Publications 1741–2447 (2004) Vol. 12, 1, 32–54 Database Marketing & Customer Strategy Management 43

Mining the web for business intelligence

Tab

le2:

(Con

tinue

d)

Co

ncep

ts

Sea

rch

met

hod

5.S

iblin

g6.

Qui

etho

bb

y

Sec

tio

nse

arch

Ind

icat

ive

phr

ase

�ke

ywo

rds

Key

wo

rds/

phr

ases

Oth

erre

late

dke

ywo

rds/

pha

ses

amyo

unge

stof

child

ren

Iam

old

est

ofch

ildre

nI

amel

des

tof

child

ren

Iam

mid

dle

child

I’myo

unge

stof

child

ren

I’mol

des

tof

child

ren

I’mel

des

tof

child

ren

I’mm

idd

lech

ildm

yb

aby

bro

ther

*m

yb

aby

sist

er*

my

bro

ther

*m

yel

der

bro

ther

*m

yel

der

sist

er*

my

old

erb

roth

er*

my

old

ersi

ster

*m

ysi

ster

*m

yyo

unge

rb

roth

er*

my

youn

ger

sist

er*

have

bab

yb

roth

er*

have

bab

ysi

ster

*ha

veb

roth

er*

have

eld

erb

roth

er*

have

eld

ersi

ster

*ha

veol

der

bro

ther

*ha

veol

der

sist

er*

have

sist

er*

have

youn

ger

bro

ther

*ha

veyo

unge

rsi

ster

*(2

8)

Key

wor

ds/

phr

ases

rela

ted

tom

usic

:eg

mus

ical

inst

rum

ents

,fa

mou

sm

usic

ians

,p

opul

arb

and

s,m

usic

typ

es,

alb

ums

etc.

(81)

Key

wor

ds/

phr

ases

rela

ted

tore

adin

g/w

ritin

g:eg

pop

ular

writ

ers,

typ

esof

boo

kset

c(1

,204

)

Key

wor

ds/

phr

ases

rela

ted

tom

ovie

/tel

evis

ion:

egty

pes

ofte

levi

sion

pro

gram

mes

,fa

mou

sm

ovie

star

set

c.(4

46)

Key

wor

ds/

phr

ases

rela

ted

toco

llect

ing:

egty

pes

ofco

llect

ions

(53)

Key

wor

ds/

phr

ases

rela

ted

toar

t/p

aint

ing:

egty

pes

ofp

aint

ings

,fa

mou

sar

tists

etc.

(995

)

Page 13: Mining the web for business intelligence: Homepage ... · Thanks for visiting my page! Do e-mail me with any questions or suggestions for my webpage! Figure 1: Personal webpage. search

44 Database Marketing & Customer Strategy Management Vol. 12, 1, 32–54 � Henry Stewart Publications 1741–2447 (2004)

Lau, Lee, Ho and Lam

Tab

le2:

(Con

tinue

d)

Co

ncep

ts

Sea

rch

met

hod

7.S

po

rts

inte

rest

8.Tr

avel

inte

rest

Sec

tio

nse

arch

Sea

rch

the

term

‘tra

vel’

in‘t

itle’

,‘h

ead

er’

and

‘met

a’fie

lds

ofH

TML

doc

umen

ts(1

)

Ind

icat

ive

phr

ase

�ke

ywo

rds

Ind

icat

ive

phr

ases

�na

mes

ofm

ajor

coun

trie

s:tr

avel

*(co

untr

yna

mes

),to

uris

t*(c

ount

ryna

mes

),tr

ip(c

ount

ryna

mes

),p

lace

(cou

ntry

nam

es),

tour

(cou

ntry

nam

es),

fun

in(c

ount

ryna

mes

),(c

ount

ryna

mes

)tr

avel

,va

catio

ning

in(c

ount

ryna

mes

),va

catio

n(c

ount

ryna

mes

),ho

liday

(cou

ntry

nam

es),

adve

ntur

e(c

ount

ryna

mes

),(c

ount

ryna

mes

)ad

vent

ure,

wel

com

e(c

ount

ryna

mes

),p

ictu

re*(

coun

try

nam

es),

togo

to(c

ount

ryna

mes

),d

estin

atio

n*(c

ount

ryna

mes

),go

ing

to(c

ount

ryna

mes

),se

tof

ffo

r(c

ount

ryna

mes

)(4

,428

)

Ind

icat

ive

phr

ases

�na

mes

ofm

ajor

citie

s:tr

avel

*(ci

tyna

mes

),va

catio

n(c

ityna

mes

),to

uris

t*(c

ityna

mes

),p

lace

(city

nam

es),

trip

(city

nam

es),

tour

(city

nam

es),

(city

nam

es)

trav

el,

fun

in(c

ityna

mes

),va

catio

ning

in(c

ityna

mes

),ho

liday

(city

nam

es),

togo

to(c

ityna

mes

),ad

vent

ure

(city

nam

es),

(city

nam

es)

adve

ntur

e,w

elco

me

(city

nam

es),

pic

ture

*(ci

tyna

mes

),go

ing

to(c

ityna

mes

),d

estin

atio

n*(c

ityna

mes

),se

tof

ffo

r(c

ityna

mes

)(4

,454

)

Key

wo

rds/

phr

ases

Maj

orsp

orts

,te

ams

and

pla

yers

:Fo

otb

all

(232

)Jo

g(2

)W

ater

spor

ts(8

)S

kiin

g(1

)B

aske

tbal

l(4

9)B

oxin

g(2

,509

)Fi

shin

g(1

)C

yclin

g(2

)S

hoot

ing

(2)

Sw

imm

ing

(2)

Maj

orsp

orts

cate

gorie

s:Te

amsp

orts

(299

)In

div

idua

lsp

orts

(2,5

52)

Dyn

amic

spor

ts(2

,845

)S

tatic

spor

ts(1

2)E

xpen

sive

spor

ts(2

2)Le

ssex

pen

sive

spor

ts(2

,820

)Tr

avel

-rel

ated

spor

ts(3

2)

Maj

orto

uris

tat

trac

tions

(1,1

36)

Oth

erre

late

dke

ywo

rds/

phr

ases

Oth

ertr

avel

-rel

ated

wor

ds:

eg‘b

ackp

acki

ng’,

‘bud

get

trav

el’

etc

(45)

*Num

ber

sin

bra

cket

sre

pre

sent

the

tota

lnu

mb

erof

keyw

ord

sus

ed

Page 14: Mining the web for business intelligence: Homepage ... · Thanks for visiting my page! Do e-mail me with any questions or suggestions for my webpage! Figure 1: Personal webpage. search

53 and 995, respectively. A search term ismatched only if the same search termappears in a document as a consecutiveclause.

Sports interest

This concept identifies people’s interestsin sports, both by major sport types andsport categories. Major sport typesselected are football, basketball, skiing,swimming etc. A total of 2,808 terms aresearched for major sports types, whichinclude names of sporting teams andplayers such as Houston Comets,Evander Holyfield, etc.

These individual sport types arefurther combined to form a number ofsport categories such as team sports,individual sports, dynamic sports, staticsports, expensive sports, less expensivesports and travel-related sports. Examplesof each category are listed as follows:

1 Team sports — football, rugby,basketball, handball, hockey, softball,volleyball, etc.

2 Individual sports — auto racing,archery, bowling, boxing, bullfighting,diving, etc.

3 Dynamic sports — adventure racing,football, basketball, biathlon, etc.

4 Static sports — archery, boomerang,bowling, croquet, fishing, golf, etc.

5 Expensive sports — adventure racing,biathlon, boat racing, bullfighting, skydiving, etc.

6 Less expensive sports — football,basketball, table tennis, swimming, etc.

7 Travel-related sports — mountainbiking, snow sports, scuba diving,snowboarding, etc.

These categories account for a total of8,582 search terms. A search term ismatched only if the same search termappears in a document as a consecutiveclause.

matched only if the same search termappears in the document as a consecutiveclause. Majors and course names arecompiled with reference to the academiccurriculum of universities in the USA.

Marital status

This concept includes keywords andphrases indicating one’s marital status iesingle, married or others (eg divorced orwidowed). Only one level of result isavailable, which is obtained by searching38 phrases (eg ‘I’m single’, ‘we aremarried’, etc). A search term is matchedonly if the same search term appears in adocument as a consecutive clause.

Sibling

This concept indicates whether thehomepage owner has siblings or not. Itconsists of 28 search terms such as ‘mybrother’, ‘my sister’, etc. A search term ismatched only if all words of the searchterm are found within the same sentence(but not necessarily as a consecutiveclause).

Quiet hobby

This concept explores the homepageowner’s fondness for music,reading/writing, movie/television,collecting and arts/painting. Keywords formusic contain musical instruments, famousmusicians, popular bands, music types andmusic albums. The category ofreading/writing contains names of writersand types of books. The category ofmovie/television consists of names ofmovie stars and types of televisionprogramme. Collecting specifies majortypes of collections eg coin collection,stamp collection, etc. Keywords related toarts/painting are names of artists and typesof painting. The numbers of search termsfor these five categories are 81, 1,204, 446,

� Henry Stewart Publications 1741–2447 (2004) Vol. 12, 1, 32–54 Database Marketing & Customer Strategy Management 45

Mining the web for business intelligence

Page 15: Mining the web for business intelligence: Homepage ... · Thanks for visiting my page! Do e-mail me with any questions or suggestions for my webpage! Figure 1: Personal webpage. search

dictionary with unnecessary questionswould increase the search time, while anover-simplified one would result in a largenumber of missing values. Moreover, ifthe keywords or phrases included in thedictionary are not indicative enough, thesearch terms will be out-of-context andresult in a low accuracy rate. Therefore,the dictionary should be concise andto-the-point, meaning the number ofkeywords and phrases identified for eachconcept should be kept to a minimumwhile, at the same time, be able tocapture a maximum profile from apersonal website. Once a high qualitydictionary is secured, it becomes simpleand cost effective for marketers to providepersonalised services to potentialcustomers on a routine basis. Thedictionary constructed for this pilot studyis available upon request.

RESULTS OF TEXT SEARCHA total of 80,750 terms (keywords andphrases) was searched in each of the6,173 homepages. The total searchprocess took approximately one day.Compared with human eyes, the timeefficiency was acceptable. Text searchresults are presented in Table 3. Themost commonly identified concept is‘Gender’ with missing value of about 35per cent. It is observed that moststudents mention their names and/orinclude other gender indicative phrasessuch as ‘my wife’, ‘my boyfriend’, etc intheir personal homepages. Thepercentage of missing values for otherconcepts such as ‘year of study’, ‘major’,‘quiet hobby’, ‘sports interest’ and ‘travelinterest’ ranges from 43 per cent to 67per cent. The non-response problem ismost serious for the ‘sibling’ and ‘maritalstatus’ concepts, which have missingvalues of 87 per cent and 94 per centrespectively. This is understandablebecause college students are less likely to

Travel interest

This concept explores one’s potentialinterest in travel. Search results bydifferent level of certainty are:

1 Level one — 1,136 names ofworldwide tourist attractions such as‘Eiffel Tower’, ‘Golden Gate Bridge’,etc are searched within all sections ofan HTML document. Searching thenames of tourist attractions is expectedto produce the most accurate result for‘travel interest’ because when peoplemention tourist attractions in theirpersonal homepages, they are usuallyreferring to the places they havevisited, or the places they want to visitin the future.

2 Level two — the word ‘travel’ issearched in the ‘title’, ‘header’ and‘meta’ fields of HTML files. It isobserved that people who liketravelling sometimes devote part oftheir homepages to travel-relatedarticles, photos or hyperlinks. In thesehomepages, the term ‘travel’ isconstantly found in the ‘title’, ‘header’and ‘meta’ fields.

3 Level three — 8,882 indicative phrasesplus names of countries/capital citiesare searched within all sections of anHTML document. Examples are‘going to Asia’, ‘set off for Paris’ etc.

4 Level four — other travel-relatedkeywords and phrases (eg backpacking,budget travel, etc) are searched withinall sections of an HTML document.As the identification of these termsdoes not guarantee that they are in theright context, this result level is theleast certain among all four levels. Thenumber of search terms for level fouris 45.

In summary, the results from text miningare largely determined by the accuracyand comprehensiveness of thepre-constructed dictionary. A lengthy

46 Database Marketing & Customer Strategy Management Vol. 12, 1, 32–54 � Henry Stewart Publications 1741–2447 (2004)

Lau, Lee, Ho and Lam

Page 16: Mining the web for business intelligence: Homepage ... · Thanks for visiting my page! Do e-mail me with any questions or suggestions for my webpage! Figure 1: Personal webpage. search

owners, but all eight concepts areidentified for only 16 out of a total of6,173 subjects.

There are two key reasons for the largenumber of missing values. First, as websiteinformation is self-revealed, people have

mention information about siblings andmarital status on their personal webpages.

Table 4 gives a general overview ofthe number of concepts identified. Theauthors can successfully identify at leastthree concepts for 3,165 webpage

� Henry Stewart Publications 1741–2447 (2004) Vol. 12, 1, 32–54 Database Marketing & Customer Strategy Management 47

Mining the web for business intelligence

Table 3: Text search results

Concept Frequency count Percentage %

1. GenderMale 2,661 43.11Female 1,354 21.93Missing 2,158 34.962. Year of studyUndergraduate 1,122 18.18Postgraduate 1,040 16.85Missing 4,011 64.983. MajorBusiness 311 5.04Science or engineering 845 13.69Others 610 9.88Double major 567 9.19Missing 3,840 62.214. Marital statusSingle 218 3.53Married 133 2.15Missing 5,822 94.315. SiblingWith sibling 815 13.20Missing 5,358 86.806. Quiet hobbyMusic 1,129 18.29Reading/writing 361 5.85Movie/television 875 14.17Collecting 26 0.42Art/painting 562 9.10Missing 2,678 43.387. Sports interestFootball 752 12.18Water sports 185 3.00Skiing 192 3.11Basketball 430 6.97Boxing 80 1.30Fishing 171 2.77Cycling 51 0.83Swimming 323 5.23Team sports 1,357 21.98Individual sports 1,333 21.59Dynamic sports 1,872 30.33Static sports 575 9.31Expensive sports 833 13.49Less expensive sports 1,728 27.99Travel-related sports 769 12.46Missing 3,877 62.818. Travel interestAsia 79 1.28Europe 153 2.48Middle East 24 0.39Africa 31 0.50North America 79 1.28South America 43 0.23Oceania 33 0.53USA 244 3.95Missing 4,157 67.34

Page 17: Mining the web for business intelligence: Homepage ... · Thanks for visiting my page! Do e-mail me with any questions or suggestions for my webpage! Figure 1: Personal webpage. search

about 10 per cent of the total pool ofpersonal homepages analysed) for twopurposes:

1 To demonstrate the accuracy ofdictionary construction and textanalysis.

2 To use the sample result to formulatescoring rules for missing variables.

Accuracy results for individual conceptsare presented in Table 5. The authorscalculated the overall hit rate for eachconcept from:

overall hit rate (%) � (correctlyclassified cases � total number ofcases) � 100.

In general, the text analysis results canbe considered as reasonably accuratewith six concepts attaining hit ratesabove 80 per cent. In particular, hitrates for ‘year of study’, ‘marital status’,‘sibling’ and ‘sports interest’ (all above85 per cent) are higher than otherconcepts. ‘Gender’, ‘travel interest’ and‘quiet hobby’ have hit rates of 82 percent, 81 per cent and 76 per centrespectively. The text search result for‘major’ is less satisfactory, with a hitrate of 67 per cent because it has fourprediction categories. There are threemajor factors that affect the hit rate —context-based search, dictionaryconstruction and multiple meanings forone word.

Context-based searchCorrect interpretation of conceptsrequires keywords and phrases to befound in the right context as defined bythe researcher. As discussed earlier undertext search, query interfaces such as‘section support’ and ‘context-basedquery’ are used to enhance the correctinterpretation of search terms. However,

their own preferences concerning howmuch and what type of information todisclose to others. If people do notdisclose such information on personalhomepages, it is impossible for marketersto identify this information from the web.

Secondly, the constructed dictionary forthis feasibility study could have beenmore exhaustive if there were moreresources. For example, the current studyincludes names of popular music bands,pop singers, music albums and musicalinstruments in the ‘music’ category. Infuture, researchers might include moremusic-related terms, such as names ofclassical musicians, sound tracks, operas,etc, to further enhance the capability ofthe dictionary.

To sum up, the proportion of missingvalues is quite large for some concepts.Even so, the number of potentialcustomers identified from web mining canstill be large enough for profitablemarketing if the target population issubstantial. In addition, the objective ofthe web mining project is to establish acomprehensive database of customerinformation for one-to-one marketing.Every aspect of a person’s demographic,attitudinal and behavioural data obtainedmay be useful to different marketers.

ACCURACY OF TEXT SEARCHRESULTAt this stage, the authors randomlyselected 587 personal homepages (ie

48 Database Marketing & Customer Strategy Management Vol. 12, 1, 32–54 � Henry Stewart Publications 1741–2447 (2004)

Lau, Lee, Ho and Lam

Table 4: Frequency for identified concepts

Number of conceptsidentified Frequency count

One concept only 1,626Two concepts only 1,382Three concepts only 1,215Four concepts only 897Five concepts only 598Six concepts only 331Seven concepts only 108Eight concepts only 16

Page 18: Mining the web for business intelligence: Homepage ... · Thanks for visiting my page! Do e-mail me with any questions or suggestions for my webpage! Figure 1: Personal webpage. search

� Henry Stewart Publications 1741–2447 (2004) Vol. 12, 1, 32–54 Database Marketing & Customer Strategy Management 49

Mining the web for business intelligence

Table 5: Accuracy of results

1. Gender

Computer resultMale Female Missing Total

Human judgment Male 229 1 48 278Female 4 147 41 192Missing 7 6 104 117Accuracy rate* 0.82 0.77 0.89

2. Year of study

Computer resultUndergrad. Postgrad. Missing Total

Human judgment Undergrad. 144 5 44 193Postgrad. 5 83 21 109Missing 4 5 276 285Accuracy rate 0.75 0.76 0.97

3. Major

Computer resultScience and Double

Business engineering Others major Missing Total

Human judgment Business 31 2 2 10 16 61Science and 1 77 7 9 90 184engineeringOthers 0 1 46 9 20 76Double major 2 7 6 18 5 38Missing 1 1 4 1 221 228Accuracy rate 0.51 0.42 0.61 0.47 0.97

4. Marital status

Computer resultSingle Married Missing Total

Human judgment Single 25 0 19 44Married 0 7 15 22Missing 2 1 518 521Accuracy rate 0.57 0.32 0.99

5. Sibling

Computer resultWith sibling Missing Total

Human judgment With sibling 97 13 110Missing 1 476 477Accuracy rate 0.88 0.99

6. Quiet hobby

Computer resultLike quiet hobby Missing Total

Human judgment Like quiet hobby 217 25 242Missing 114 231 345Accuracy rate 0.90 0.67

Page 19: Mining the web for business intelligence: Homepage ... · Thanks for visiting my page! Do e-mail me with any questions or suggestions for my webpage! Figure 1: Personal webpage. search

Multiple meanings for one wordThe possibility of multiple interpretationsof a search term causes another problemfor context-based search. The surfacemeaning of a term may be different fromits latent meaning. For instance, thephrase ‘my girl’ may have differentmeanings; such as (1) my girlfriend, (2)my daughter and (3) a close femalefriend. It will be problematic, therefore,if a researcher considers only the firstmeaning (ie my girlfriend), and uses theterm ‘my girl’ as an indication of malegender. This linguistic issue causes greatdifficulty in ascertaining the correctinterpretation of search terms within theright context.

SOLVING THE NON-RESPONSEPROBLEMWhen the value of a variable for aparticular individual is missing after thetext search, it can probably be inferredfrom the association relationship amongthe true variables observed in therandom sample. The estimationprocedure can be illustrated using genderas an example. From Table 5.1, 278males and 192 females were directly

current web mining tools have twotechnical limitations.

First, researchers cannot specify theword sequence of the same search term.For instance, the sentence ‘Jane’sboyfriend, my girlfriend and I love hiking’matches the search term ‘my boyfriend’.In this case, even if, in fact, the webpageowner is a male, he will be wronglyidentified as a female. The secondtechnical issue is that the text analysistool can only recognise <p> as anindication of sentence termination inHTML files. In reality, HTML filescontain many format types such as table,list, etc. Thus, search results tend to beinaccurate, as the tool may not correctlyidentify the terminator of sentences.

Dictionary constructionThe dictionary built for this study is notexhaustive in view of the huge numberof terms and expressions that can be usedto convey the same meaning. Forexample, for one’s fondness of reading asa hobby, possible expressions might be‘reading is my life’, ‘reading is great fun’,‘I like reading’ and many otherexpressions.

50 Database Marketing & Customer Strategy Management Vol. 12, 1, 32–54 � Henry Stewart Publications 1741–2447 (2004)

Lau, Lee, Ho and Lam

Table 5: (Continued)

7. Sports interest

Computer resultLike sports Missing Total

Human judgment Like sports 206 11 217Missing 32 338 370Accuracy rate 0.95 0.91

8. Travel interest

Computer resultLike travel Missing Total

Human judgment Like travel 81 14 95Missing 97 395 492Accuracy rate 0.85 0.80

*Accuracy rate (%) � (true related cases � computer identified related cases) � 100 per centExample: accuracy rate for male � (229 � 278) � 100 per cent � 82 per cent

Page 20: Mining the web for business intelligence: Homepage ... · Thanks for visiting my page! Do e-mail me with any questions or suggestions for my webpage! Figure 1: Personal webpage. search

results are displayed in Table 6. Asexpected, most students with missinginformation on these variables arepredicted to be male undergraduatesmajoring in science/engineering. Theauthors formulate the original associationrelationship among variables as aregression model for prediction purposes,but the scoring should not be interpretedas the causation results.

While the missing proportions for‘gender’, ‘year of study’ and ‘major’ arerelatively small (20–49 per cent) in therandom sample, the missing proportionsfor other variables (re marital status,sibling, quiet hobby, sports interest andtravel interest) are significantly higher (ie59–89 per cent). The logistic regressioncan no longer be applicable to scorecustomers with missing information onthese variables, because there are nosufficient cases for discrimination on thedependent variable with respect to theindependent variables. Consequently,scorings for all these variables cannot beinternally generated from the webmining process. A traditional survey on asample of web owners about theinterests/activities they display as well asthe interests/activities they do not displayin their webpages would help to deducethe scores on such variables. Forexample, the score on travel interest �t

can be approximated by the conditionalprobability of real interest in travel giventhe demographics of the webpage owner.

CONCLUSIONThis paper presents an idea — themining of websites by advanced internettechnology and text mining techniques— for extracting meaningful informationfrom personal websites for betterunderstanding of customers. As stated inthe research questions, the authors’ mostimportant concerns for web mining are:(1) construction of a high quality

observed and confirmed by eye (not bysearch engine) out of 587 students in therandom sample. The authors constructeda training sample of 470, which consistsof all students with identified gender.Students whose gender cannot beidentified by eye are excluded from thistraining sample. Let Yi � 1 if student i isa male, and Yi � 0 if student i is afemale. Similarly, other variables (denotedby X1, . . ., Xn) can also be observed byeye for these 470 students in the trainingsample. These Xs are binary variablesrepresenting different categories(including missing) of all the conceptsother than gender. Now treat Y as thedependent variable and (X1, . . ., Xn) asthe independent variables in a stepwiselogistic regression framework and apply itto the training sample to get:

V � 0.66 � 0.897*sport6 � 0.752*hobby3 � 1.033*sibling� 0.851*major4 (1)

� � exp(v)/(1 � exp(v)) (2)

where sport6 � 1 if the student loves‘less expensive sports’ and 0 otherwise;hobby3 � 1 if the student enjoys ‘movieor television and 0 otherwise; sibling � 1if the student has a sibling and 0otherwise; major4 � 1 if the studentbelongs to ‘other major’ and 0 otherwise;and � is the probability that the studentis a male.

Equations (1) and (2) suggest that if astudent loves ‘less expensive sports’ anddoes not enjoy movie/television and hasno sibling and does not belong to ‘othermajor’, this student has a higher chanceof being a male. Then, they are used togenerate scores (ie �) for those 2,158students with missing gender after thetext search process. Similarly, thistechnique is applied in order to generatescores for those who have missing valuesfor ‘year of study’ and ‘major’. The

� Henry Stewart Publications 1741–2447 (2004) Vol. 12, 1, 32–54 Database Marketing & Customer Strategy Management 51

Mining the web for business intelligence

Page 21: Mining the web for business intelligence: Homepage ... · Thanks for visiting my page! Do e-mail me with any questions or suggestions for my webpage! Figure 1: Personal webpage. search

Any inadequacies in the dictionarywould result in missing values, whichcould only be partially resolved withstatistical techniques and supplementaryinformation obtained from traditionalmarket survey research.

Regarding technical capability, theresults of the feasibility study show thatcurrent web mining tools are capable ofprocessing a large number of homepages(ie 6,173 HTML files) and attainingreasonably satisfactory text mining results.However, limitations still exist in the textmining tools, which need furtherimprovement. In addition, current crawlertechnology is not capable of accessingweb information stored in databases.Unless researchers know how to access aparticular database, the data stored thereinwill become invisible because searchengines cannot access dynamic databases.There is a need to develop a moreadvanced crawler technology.

There are also other important issues

dictionary; (2) capabilities of current webmining technology; and (3) other issuesthat may arise in web miningapplications. The answers to thesequestions are summarised below.

The first concern is about theconstruction of a high quality dictionarycapable of converting disorganised text inthe web into a structured personalinformation database. Results of thisfeasibility study provide a promising start.The dictionary constructed can beconsidered to be fairly accurate incapturing and interpreting webinformation. It is also highly possible forresearchers to revise the dictionary in thefuture so that more accurate text searchresults can be obtained in less computertime. However, given the self-revealednature of web information, the dictionarycan never be perfect, in the sense thatmarketers can never capture all customerdata that are interesting to the marketersif these data are unavailable on the web.

52 Database Marketing & Customer Strategy Management Vol. 12, 1, 32–54 � Henry Stewart Publications 1741–2447 (2004)

Lau, Lee, Ho and Lam

Table 6: Scoring on missing variables

Variables Count Percentage

Gender Certain cases Male 2,661 43.11%Female 1,354 21.93%

Uncertain cases Predicted score* > 0.75 370 5.99%0.5 <predicted score < 0.75 1,585 25.68%0.25 <predicted score < 0.50 199 3.22%Predicted score < 0.25 4 0.065%

Year of study Certain cases Undergraduate 1,122 18.18%Postgraduate 1,040 16.85%

Uncertain cases Predicted score* > 0.75 638 10.33%0.5 <predicted score < 0.75 3,207 51.95%0.25 <predicted score < 0.50 132 2.14%Predicted score < 0.25 34 0.55%

Major Certain cases Business 311 5.04%Science or engineering 845 13.69%Others 610 9.88%Double major 567 9.19%

Uncertain cases Highest predicted score on:Business 13 0.21%Science or engineering 2,989 48.42%Others 838 13.58%

Total: 6,173*Predicted score is in a range of 0 to 1 where: 1 represents male, 0 represents female for gender;1 represents undergraduate, 0 represents postgraduate for year of study .

Page 22: Mining the web for business intelligence: Homepage ... · Thanks for visiting my page! Do e-mail me with any questions or suggestions for my webpage! Figure 1: Personal webpage. search

survey approach. Nonetheless, dataintegrity problems may also arise when aperson has a number of homepages, ofwhich some contain outdated orinaccurate information. Inaccurate datamay cause noise in marketing prediction.

Privacy issuesThe possible invasion of individualprivacy and misuse of customerinformation could seriously impedemarketing progress on the web.22 Analysisof personal information on individualwebsites and the subsequent use of thisinformation for marketing purposes maycause ethical concerns. However, ifcustomers provide their website addressesto marketers voluntarily, there would beno risk of invading individual privacy.

To sum up, web mining has significantmarketing implications. It presents analternative to traditional marketingresearch which gains an understanding ofthe target population through statisticalinference from a representative sample.Even though the target population is verylarge, the size of the survey (qualitative orquantitative) is still relatively small, due tocost and time constraints. By comparison,the web-mining approach studies allaccessible web information (ie thepopulation) instead of just the sample.The web-mining approach will study eachindividual homepage, of which therecould be thousands of millions in theworld. The marketing value of theweb-mining technique increases with theincreasing number of customers who havetheir own webpage. It is profitable formarketers to make use of these readilyavailable information sources better tounderstand their customers.

The current study of personalhomepages of college students is only oneof the many business applications ofweb-mining technology. Marketers mayuse web-mining techniques to acquire

that deserve researchers’ and practitioners’attention in future web miningapplications and these are discussed below.

Data availability and accuracyData availability is an important issue thatdeserves practitioners’ attention in futureweb-mining applications. To the best ofthe authors’ knowledge, there are nocommonly-agreed data about the currentnumber and growth rate of personalhomepages. As the construction ofpersonal homepages is only in its earlystage of development, the currentnumber of personal homepage owners israther low. However, current users areinnovators who represent only a smallproportion of the total population. Whenmore webpage owners broadcast theirwebsite addresses to friends/relatives, thenew idea and the usefulness of publishingone’s personal homepage will permeatesociety and increase the number ofadopters. It is possible to predict thenumber of personal webpages bycalculating it as a simple multiple of (1)the total internet population; and (2) theproportion of internet users creating theirown personal webpages. Even if theproportion of internet users creating theirown personal webpages is assumed toremain the same as the currentproportion, there will be more personalwebpages as the internet populationgrows substantially.21 As more consumersdisplay personal information on theirpersonal webpages, it will be profitablefor marketers to utilise thereadily-available information on the webto understand their customers.

Marketers may also be concernedabout data accuracy. As informationobtained from both personal websites andtraditional market surveys areself-revealed in nature, data accuracy inthe web mining approach should be atleast comparable to that of the traditional

� Henry Stewart Publications 1741–2447 (2004) Vol. 12, 1, 32–54 Database Marketing & Customer Strategy Management 53

Mining the web for business intelligence

Page 23: Mining the web for business intelligence: Homepage ... · Thanks for visiting my page! Do e-mail me with any questions or suggestions for my webpage! Figure 1: Personal webpage. search

5 Degeratu, A. M., Rangaswamy, A. and Wu, J.(2000) ‘Consumer choice behavior in online andtraditional supermarkets: The effects of brandname, price, and other search attributes’,International Journal of Research in Marketing, Vol. 17,No. 1, pp. 55–78.

6 Deleersnyder, B., Geyskens, I., Gielens, K. andDekimpe, M. G. (2002) ‘How cannibalistic is theinternet channel? A study of the newspaperindustry in the United Kingdom and theNetherlands’, International Journal of Research inMarketing, Vol. 19, No. 4, pp. 337–348.

7 Dholakia, U. M., Basuroy, S. and Soltysinski, K.(2002) ‘Auction or agent (or both)? A study ofmoderators of the herding bias in digital auctions’,International Journal of Research in Marketing, Vol. 19,No. 2, pp. 115–130.

8 Shankar, V., Smith, A. K. and Rangaswamy, A.(2003) ‘Customer satisfaction and loyalty in onlineand offline environments’, International Journal ofResearch in Marketing, Vol. 20, No. 2, pp. 153–175.

9 Zorn, P., Emanoil, M., Marshall, L. and Panek, M.(1999) ‘Mining meets the web’, Online, Vol. 23,No. 5, pp. 16–28.

10 Popping, R. (2000) ‘Computer-assisted TextAnalysis’, Sage Publications, London.

11 Bradlow, E. C. and Schmittlein, D. C. (2000) ‘Thelittle engines that could: Modeling theperformance of world wide web search engines’,Marketing Science, Vol. 19, No. 1, pp. 43–62.

12 Hoque, A. Y. and Lohse, G. L. (1999) ‘Aninformation search cost perspective for designinginterfaces for electronic commerce’, Journal ofMarketing Research, Vol. 36, No. 3, pp. 387–394.

13 http:www-3.ibm.com/software/success/cssdb.nsf/csp/navo-4vpvek.

14 http://www.kmworld.com/resources/featurearticles/index.cfm?action=readfeature&feature_id=130.

15 http://www-3.ibm.com/software/success/cssdb.nsf/CS/NAVO-4D8PWL?OpenDocument&Site=software.

16 http://www.sas.com/success/louisville.html.17 http://www.spss.com/press/template_view.cfm?

PR_ID � 571.18 Chang, G., Marcus J. H., McHugh, J. A. M. and

Wang, J. T. L. (2001) ‘Mining the World WideWeb: An Information Search Approach’, KluwerAcademic Publishers, Boston, MA.

19 http://www-3.ibm.com/software/data/iminer/fortext/.

20 Withycombe, E. G. (1977) ‘The OxfordDictionary of English Christian Names’,Clarendon, New York, NY.

21 Lau, K. N., Lee, K. H., Lam, P. Y. and Ho, Y.(2001) ‘Web-site marketing for the tourismindustry: A rejoinder’, Cornell Hotel and RestaurantAdministration Quarterly, Vol. 42, No. 6, pp. 66–67.

22 Krauss, M. (2000) ‘Get a handle on the privacywild card’, Marketing News, Vol. 34, No. 5, p. 12.

customer intelligence from personalhomepages and newsgroups and to obtaincompetitor intelligence from companywebsites. In the meantime, it isappropriate to acknowledge the technicaldifficulty, as well as the complexity anduncertainty, of future technologicaldevelopments. In addition, web miningtechnology alone cannot facilitate thedevelopment of a comprehensivecustomer database. It has to besupplemented by traditional surveyresearch and statistical techniques tohandle the non-response problem.Furthermore, the success of this marketingapproach depends heavily on the publicacceptance of personal websites. Theseconcerns pose a certain degree ofuncertainty on its future advancement.

The aim of this paper is to study thefeasibility of web mining applicationsrather than to report a successful story.The results are, by no means, indicationsof its future success. There is room toenhance this new approach for betteraccuracy and efficiency. It is the authors’hope that this paper will generate interestand discussion among academics andpractitioners in this new area of marketingresearch.

AcknowledgmentThe authors thank IBM for their technical support inthis project.

References1 http://www.clickz.com/stats/big_

picture/geographics/article.php/151151.2 Mahajan, V. and Venkatesh, R. (2000) ‘Marketing

modeling for e-business’, International Journal ofResearch in Marketing, Vol. 17, No. 2-3, pp.215–225.

3 Lilien, G. L. and Rangaswamy, A. (2000)‘Modeled to bits: Decision models for the digital,networked economy’, International Journal ofResearch in Marketing, Vol. 17, No. 2–3, pp.227–235.

4 Prasad, A., Mahajan, V. and Bronnenberg, B.(2003) ‘Advertising versus pay-per-view inelectronic media’, International Journal of Research inMarketing, Vol. 20, No. 1, pp. 13–30.

54 Database Marketing & Customer Strategy Management Vol. 12, 1, 32–54 � Henry Stewart Publications 1741–2447 (2004)

Lau, Lee, Ho and Lam


Top Related