heather hedden how semantic tagging - hedden information€¦ · document indexing/material...

6
indability is about making information easier to find. After all, if it cannot be found, it may as well not exist. Leading information specialists have been saying this for years, and now with the increasing volume of content and increasing pressures of time, money, and competition, more of us are finding this statement to be true. In addition to traditional controlled vocabulary-based indexing, information architecture has evolved to make browsing and navigation methods more effective, search engine capabilities have been improving to help us find the proverbial needle in the haystack, and bookmarking and social tagging have emerged to help us find our own content, and that we share with members of a social networking group. The various methods of enhancing findability each have their limitations. Traditional document indexing/material cataloging and web information architecture do not go deep enough. Indexing is usually at the document level, and cataloging only works on the level of the material as a whole (books, sound recordings, video recordings, etc.). Information architecture aids in the navigation of a website, intranet, or portal, but in itself it is often not Heather Hedden WWW . ECONTENTMAG . COM 1 F SEMANTIC TAGGING Increases Findability How also Review Case Study featured Data Audits for Content Security How Semantic Tagging Increases Findability content news Bizo Means Business: Tackling the Ad-Targeting Dilemma Semantra 2.5 Searches for Business Intelligence This article is reprinted with permission from EContent magazine, October, 2008. © Online, a division of Information Today, Inc.

Upload: others

Post on 27-May-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Heather Hedden How SEMANTIC TAGGING - Hedden Information€¦ · document indexing/material cataloging and web information architecture do not go deep enough. Indexing is usually

indability is about making information easier to find. After all, if it cannot be found, it may as well notexist. Leading information specialists have been saying this for years, and now with the increasingvolume of content and increasing pressures of time, money, and competition, more of us are finding this

statement to be true. In addition to traditional controlled vocabulary-based indexing, information architecturehas evolved to make browsing and navigation methods more effective, search engine capabilities havebeen improving to help us find the proverbial needle in the haystack, and bookmarking and socialtagging have emerged to help us find our own content, and that we share with members of asocial networking group.

The various methods of enhancing findability each have their limitations. Traditionaldocument indexing/material cataloging and web information architecture do not go deepenough. Indexing is usually at the document level, and cataloging only works on the levelof the material as a whole (books, sound recordings, video recordings, etc.). Informationarchitecture aids in the navigation of a website, intranet, or portal, but in itself it is often not

Heather Hedden

W W W.E C O N T E N T M A G .C O M1

F

SEMANTICTAGGING

Increases Findability

How

alsoReview PAGE 21

X1 PROFESSIONAL CLIENT

Case Study PAGE 45

A CASE OF DOCUMENT—AND

ORGANIZATIONWIDE—COLLABORATION

featuredData Audits for Content Security PAGE 32

How Semantic Tagging Increases Findability PAGE 38

content newsBizo Means Business: Tackling the Ad-Targeting Dilemma PAGE 12

Semantra 2.5 Searches for Business Intelligence PAGE 14 This article is reprinted with permission from EContent magazine, October, 2008. © Online, a division of Information Today, Inc.

SemanticTagging 1/29/09 4:41 PM Page 38

Page 2: Heather Hedden How SEMANTIC TAGGING - Hedden Information€¦ · document indexing/material cataloging and web information architecture do not go deep enough. Indexing is usually

SemanticTagging 1/29/09 4:41 PM Page 39

Page 3: Heather Hedden How SEMANTIC TAGGING - Hedden Information€¦ · document indexing/material cataloging and web information architecture do not go deep enough. Indexing is usually

different things to people coming fromdifferent parts of the information man-agement field. It may be used inter-changeably with “semantic indexing” incontexts where “indexing” is used for“tagging.” Nevertheless, in the quest forbetter methods of findability, the termsemantic tagging is starting to appear indescriptions of information services andproducts, blogs, online articles, and pre-sentations.

SEMANTIC TAGGING IN PUBLISHED

INDEXES

“Semantic information … enablespublishers to distinguish their contentfrom their competitors,’” explains BillKasdorf of Apex CoVantage, organizer/moderator of a preconference seminaron semantic tagging at the Society forScholarly Publishing’s (SSP) annualconference this May in Boston. “Inaddition, great progress has been maderecently in moving semantics beyond thetheoretical: Actual publishers are actuallydoing it, and they'reactually getting realbenefits from it.”

Some people would argue that semantictagging is nothing new. It can be definedas the assigning of selected controlledvocabulary (aka taxonomy) terms,especially by trained indexers, to contentitems, such as articles, images, or other

sufficient for finding specific information.Search engines match user-entered key-words and phrases to those foundwithin the texts or metatag fields ofdocuments, but these are still just wordmatches and do not necessarily go afterthe meaning of a document. For example,many words are quite ambiguous, andsearch results would not be accurate onwords such as “state,” “log,” or “screen”—even in combination with other words.Social tagging only involves files or web-pages that the user and colleagues havealready viewed or created. More signifi-cantly, though, social tagging tends tosuffer from inconsistent application oftags, such as using both synonyms(movie, motion picture, film), singular/plural forms, and abbreviations(Corporation/ Corp., information/info).

New techniques and tools are beingdeveloped to address the shortcomings ofthese various approaches to finding infor-mation and to deliver better results in anincreasingly competitive informationindustry. “Semantic tagging,” in the variousways that it is understood, is a term thatdescribes many of these new (and somenot-so-new) findability approaches.Semantic tagging is by no means anaccepted concept with an agreed upondefinition. Other than the obvious “taggingfor meaning,” semantic tagging means

documents, to reflect the meaning ofthe content. Human subject indexing isinherently semantic, because humanindexers can discern the meaning ofcontent. This has been done by periodicaland other database index publishers fordecades. Once the domain of largedatabase publishing companies (H.W.Wilson, ProQuest, Gale, EBSCO, etc.),more affordable client/server and desktopsoftware for taxonomy management,indexing, and web database publishinghave enabled publishers of all sizes toengage in this form of semantic indexing.Meanwhile, the growing popularity ofsocial tagging has made users more awareof the value of subject terms that reflectthe meaning of a piece of content incomparison-free text word/phrase search.

Nevertheless, there are publishers thatconsider semantic tagging to be somethingmore than mere controlled vocabulary-based human indexing; they are pursuingnew techniques. This was evident inthe participation in the SSP Bostonconference’s semantic tagging seminar,Say What You Mean: How SemanticTagging Makes Content More Discoverable,More Useful, and More Valuable.

One way that semantic indexing isdistinguished from traditional subjectindexing of documents is that it focuseson concepts rather than the documents

How Semantic Tagging Increases Findability

W W W.E C O N T E N T M A G .C O M3

Alexander Street Press LLC has developed highly structured facets of tags forplays and scenes.

The Alexander Street Press’ highly specific tag categories for Early Encountersin North America

SemanticTagging 1/29/09 4:42 PM Page 40

Page 4: Heather Hedden How SEMANTIC TAGGING - Hedden Information€¦ · document indexing/material cataloging and web information architecture do not go deep enough. Indexing is usually

as a whole. Panel presenter StephenRhind-Tutt, president of Alexander StreetPress, LLC, explained that semanticindexing can answer complex questionsof who, what, and when, such as “Whatbattles during the Civil War resulted inmore than 1,000 deaths?” Regularindexing merely answers the question“What documents discuss this battle?”

Specialized and multilevel facets (ormetadata, depending on your perspective)of controlled vocabularies can be imple-mented to support semantically complexuser queries, as done by humanitiespublisher Alexander Street Press. Itsdatabase of theatrical plays is indexed bythe top-level facets, including playwrightdata, theater data, specific production data,theater company information, charactercharacteristics, scene data, and play textdata. Its Early Encounters in North Americahistory database has nine controlledvocabularies, including author, source,year, place environment, flora, fauna,encounter, people, personal event, andcultural event. Setting up the controlledvocabulary and facets requires one to “gointo the data and ask ‘what are the latentsemantic issues that will be asked’ … Thisneeds to be discipline specific,” accordingto Rhind-Tutt. Finally, the content searchedwith faceted taxonomies and supportinginterfaces needs to be sufficiently struc-tured with metadata, tagging, or indexing

that precisely captures each subject in itsappropriate facet.

Another way that semantic indexing isdistinguished from traditional subjectindexing of documents is that it focuseson pieces of content at a finer, granularlevel rather than the documents as awhole. This is an approach taken by medical research database developerSilverchair, as explained by its CTO JakeZarnegar: “We apply semantic tags atany change of topic or concept in thedata at any level—including articles,sections, paragraphs, tables, figures,equations, sidebars, videos, etc. Manytaxonomic tagging systems deal with theentire data entity as one unit.” Using itsinternally developed TOTEM taxonomymanagement platform, Silverchair insertstaxonomy tags into the XML content.

According to Zarnegar, “Tagging shouldbe done at the smallest ‘atomic’ level thatcan stand on its own if taken out.”Whether the original source is a book,article, or pamphlet, subject indexing isoften done to the paragraph level.

SEMANTIC TAGGING IN SEARCH

Turning to the area of automatedsearch and retrieval, enterprise searchengines, content management systems,and related discovery and data miningproducts that do not utilize humanindexing, semantic tagging obviouslyplays a smaller role. Nevertheless, someof these vendors claim to offer semanticcapabilities. In the competitive enterprisesearch space, new technologies are oftenbased on either autocategorization(automatic indexing/tagging) or varioustext analytics techniques, such as patternrecognition or entity extraction. Most oftext analytics is not semantic because itdoes not discern the meaning of words,but rather may classify words by part ofspeech (grammar). Various forms ofautocategorization, on the other hand,may or may not have a degree ofsemantic technology involved.

Silverchair search results, indexed to the chapter subsection level and utilitizing astructured taxonomy

Collexis Holdings’ Research Profiles database withweighted subjects indicated in bar graphs

OC T O B E R 2008 ECO N T E N T 4

SemanticTagging 1/29/09 4:42 PM Page 41

Page 5: Heather Hedden How SEMANTIC TAGGING - Hedden Information€¦ · document indexing/material cataloging and web information architecture do not go deep enough. Indexing is usually

In cases where autocategorizationsearch solutions or content managementsoftware come prepackaged with tax-onomies or have a feature to build orautomatically generate taxonomies (whichonly some vendors offer), there is apotential for what may be called semantictagging. A simple taxonomy as used ininformation architecture with a hierarchyof category terms is not sufficient foreffective autocategrization. What isneeded is really more of a “thesaurus”style of taxonomy, whereby there is acluster of synonyms or other equivalentterms (abbreviations, acronyms, spellingvariations, grammatical variations, etc.)for each concept in the taxonomy. Thus,the taxonomy is comprised not merely ofwords, but of concepts which derivemeaning (“semantics”) from their clusterof synonyms. Autocategorization productsthat provide integrated taxonomies includeInterwoven, Inc.’s MetaTagger; TeragramCorp.’s Categorizer and TaxonomyManager; and Northern Light Goup, LLC’sEnterprise Search Engine, MI Analyst, andAnalyst Direct. Northern Light supportswhat it calls “meaning extraction.”

Knowledge discovery vendor CollexisHoldings, Inc. makes use of taxonomies inwhat it calls semantic tagging by usingweighted taxonomy terms. In the CollexisKnowledge Dashboard product, based on

statistical approaches including frequency,uniqueness, and data field location (suchas title or text body), terms’ relativeweights are displayed with bar graphs.According to Collexis COO Steve Leicht,who also presented on the SSP panel,semantic tagging “can include taxonomictagging, ontology-based tags, topicmaps, other controlled vocabularies,mixed statistical approaches, etc.”

While much of text analytics does notinvolve semantic analysis, the specialty ofnatural language processing (NLP) isoften involved in such attempts. NLP hasmany other applications beyond semanticanalysis and tagging, but it is being appliedin that area as well. At the fourth annualSemantic Technology conference in SanJose, Calif., in May, the topic of semantictagging was presented by TextWise, adeveloper of text extraction, search, cate-gorization, and classification technologiesusing both NLP and statistics. In the pre-sentation “Applying Trainable SemanticVectors to Tagging, Search/Discovery,Bookmarking and Matching,” a panel ofTextWise speakers explained how itsSemantic Signatures function as tags forbookmarking or in generating tags tomap/link an existing tag set.

Semantic tagging’s integration withsearch technologies is also being applied

in niche service areas. For example,Relevad, whose tagline is “semantickeyword analytics,” provides hosted webservice for online advertisement placing.Relevad claims a growing database ofmore than 8 million keywords and morethan 500 million neighbor keywordmeanings. Trovix, meanwhile, provides aweb service of matching jobs toresumes utilizing complex scoringalgorithms in combination with a“hierarchical knowledgebase” of U.S.cities, skills, positions, industries, andcompanies.

SEMANTIC SOCIAL TAGGING

The term “tagging” is most stronglyassociated these days with social taggingor social bookmarking, whereby peopleassign tags (terms or keywords) of theirown choice to documents, blog posts, orwebpages that they have created or haveviewed to assist in locating the documentslater, whether by themselves or by others.Better known tagging websites and servicesinclude Delicious, Flickr, and Technorati.There is generally no taxonomy or con-trolled vocabulary involved, as any wordscan be used as tags, although this ischanging in some applications.

Fundamentally, this type of tagging is“semantic” as well, because humans

W W W.E C O N T E N T M A G .C O M5

Fuzzzy tagging/tag creation UI supporting parenttags (broader terms), friend tags (related terms),

and child tags (narrower terms)

How Semantic Tagging Increases Findability

Companies Featured in This Article

Alexander Street Press, LLCwww.alexanderstreet.com

Apex CoVantagewww.apexcovantage.com

Collexis Holdings, Inc. www.collexis.com

Interwoven, Inc. www.interwoven.com

Northern Light Group, LLC www.northernlight.com

Relevad www.relevad.com

Silverchair www.silverchair.com

Teragram Corp. www.teragram.com

TextWise www.textwise.com

Thomson Reuters’ Calaisservice www.opencalais.com

Zigtag, Inc. www.zigtag.com

SemanticTagging 1/29/09 4:42 PM Page 42

Page 6: Heather Hedden How SEMANTIC TAGGING - Hedden Information€¦ · document indexing/material cataloging and web information architecture do not go deep enough. Indexing is usually

manually tag content for what it means.The problem is that this tagging is donebased on what the document means tothe tagger at the time of tagging, notnecessarily what it means to other usersor even to the initial tagger at a later time.Furthermore, any lists of the occurrencesof a tag can be long, undifferentiated, andambiguous. The term “semantic tagging”within the sphere of social tagging, there-fore, is being used to refer to a method ofimposing consistent and more refinedmeaning. In other words, utilizing somekind of a taxonomy. Such semantic socialtags are also being called “rich tags.” Notonly are the tags’ meanings clarified bysynonyms, but there also may be links torelated-term tags and the presence ofglossary definitions for tags. In otherwords, semantic tags or rich tags areessentially terms in what is known tolibrarians as a thesaurus.

Social tagging sites/services that offerwhat they call semantic tagging includeZigtag, a Canadian startup, and individual-led projects Faviki and Fuzzzy (yes, withthree z’s). Zigtag (in private beta as of thiswriting) is a sidebar plug-in, which differ-entiates itself from other tagging servicesby providing a “semantic dictionary” ofmore than 2 million tags. Tags are definedand synonyms are linked together. Favikiis a social bookmarking tool that providesterms from Wikipedia, extracted by theopen DBpedia tool. This not only providesconsistency, but also extensive definitionsfor each of more than 2.18 millionWikipedia resources. Fuzzzy, on the otherhand, did not start with a prebuilt taxonomy,but user-created terms are entered into ashared tag set (thesaurus) and variousrelationships (broader, narrower, related)

are supported. Thus, Fuzzzy “enablesglobal distributed tagging.” The organictag set of Fuzzzy is built upon the TopicMap ISO standard and an underlyinginfrastructure with Web Services.

It isn’t just new kids with extra con-sonants pursuing social tagging however.Big, established content players are alsogetting involved. Thomson Reuters offersits open Calais Web Service, whichingests unstructured text and, using NLP,and returns RDF-formatted results identi-fying entities, facts, and events within thetext. In May, Calais was made available asplug-in software for the Drupal publishingplatform, Yahoo!’s new Searchmonkeyservice, and the WordPress blogging platform.The Calais plug-in for WordPress, calledTagaroo, returns tag suggestions based ontext typed into a blog but gives users theoption of choosing which they want to apply.Calais also offers licensed code to makeone’s site part of the “Semantic Web.”

TAGGING AND THE SEMANTIC WEB

Finally, semantic tagging can bedefined as tagging for the semantic web.This involves tags that make use ofRDF (Resource Description Framework)specifications or OWL (Web OntologyLanguage) of the World Wide WebConsortium (W3C). This also impliesbeing used for public webpages that canbe accessed with semantic web browsers,rather than merely internal enterprise orlibrary products or services. As such, a tagis more than a term; it is an object with itsown attributes. According to Rhind-Tutt,“The difference between semantic indexingand standard indexing is that the formerdoes more than simply apply subjects toterms. It includes the addition of meta-data

about tags that allows semanticallyindexed terms to interoperate with othersimilarly indexed terms.”

(This is discussed at more length in theblog post “Tagging and the SemanticWeb”; see www.designmills.com/2008/05/20/tagging-in-the-semantic-web.)

While social tagging can be made moresemantic, we have to remember that socialtagging is not always about pure findability.The social aspect is about identifying whatother people have labeled as interesting ornoteworthy, especially if there is a ratingaspect in involved. For the semantic web,on the other hand, information findabilityis a major objective, as stated in W3C’sSemantic Web Activity Statement: "tocreate a universal medium for theexchange of data. It is envisaged tosmoothly interconnect personal infor-mation management, enterprise applicationintegration, and the global sharing ofcommercial, scientific and cultural data.”

Silverchair’s Zarnegar put it well:“Semantic tagging is best applied in areaswhen there is a qualitative ‘best answer’ to auser query (as opposed to a ‘most popular’answer) … If you look at industries wheresemantic tagging (and structured data)have found a foothold (aviation, medicine,genetics, chemistry, and others) you’ll seethey are not areas where you want to gotoo far with iffy information!”

OC T O B E R 2008 ECO N T E N T 6

HEATHER HEDDEN ([email protected]) IS ANINSTRUCTOR OF CONTINUING EDUCATION WORKSHOPSAT SIMMONS COLLEGE GRADUATE SCHOOL OFLIBRARY AND INFORMATION SCIENCE, AND FOUNDERAND MANAGER OF THE TAXONOMIES & CONTROLLEDVOCABULARIES SIG OF THE AMERICAN SOCIETY FORINDEXING. COMMENTS? EMAIL LETTERS TO THE EDITOR [email protected].

SemanticTagging 1/29/09 4:42 PM Page 43