introduction - download.microsoft.comdownload.microsoft.com/.../fastsearchsharepointaddin… ·...

Adding structure to unstructured content for enhanced findabilityBest practice approaches to improve search experience by creating metadata on-the-fly with FAST Search Server 2010 for SharePoint

Dima Diall and Jens Laurits Nielsen

Microsoft Corporation

Table of ContentsIntroduction...........................................................................................................................................3

Metadata is a real challenge that affects findability...............................................................................3

Where does metadata live?...............................................................................................................4

Manual metadata is unreliable and costly..........................................................................................5

Poor metadata leads to poor findability.............................................................................................6

Solutions in FAST Search Server 2010 for SharePoint.............................................................................9

Content processing pipeline...............................................................................................................9

Property extraction..........................................................................................................................10

Where is metadata used by enterprise search?...............................................................................11

Deep refiners – why do they matter?...............................................................................................13

Property extraction in FAST Search Server 2010 for SharePoint..........................................................15

Out-of-the-box property extractors.................................................................................................16

Extending property extraction..........................................................................................................17

Calling external property extractors.................................................................................................18

Best practices.......................................................................................................................................18

Deepen your understanding of your audiences and your content...................................................19

Use existing language resources inside and outside your enterprise...............................................20

Keep the index synchronized with content sources and dictionaries...............................................21

Distinguish search management from systems management..........................................................22

Case studies..........................................................................................................................................23

General Mills....................................................................................................................................23

Mississippi Department of Transportation (MDOT).........................................................................24

Conclusion............................................................................................................................................25

1 | P a g e

Table of FiguresFigure 1 – Properties panel in Microsoft Word (left) and metadata columns in a SharePoint document library (right)..........................................................................................................................................4Figure 2 – Concentric spheres of people’s concerns and interests at the workplace.............................5Figure 3 – Poor metadata inevitably degrades the search experience and findability...........................7Figure 4 – Inconsistent and missing metadata limit usability of refiners................................................8Figure 5 – Rely less on manual tagging and more on automated content enrichment tools.................9Figure 6 – High-level overview of the content processing pipeline in FAST Search Server 2010 for SharePoint..............................................................................................................................................9Figure 7 – Property extraction in action tagging companies, locations and people in a Microsoft Word document.............................................................................................................................................11Figure 8 – High-quality metadata enables intuitive refiners to help users find their way around search results...................................................................................................................................................14Figure 9 – Example deep refiners supported out-of-the-box...............................................................16Figure 10 – Extending property extraction (dictionaries from e.g. SharePoint lists, XML feeds) to create custom refiners.........................................................................................................................17Figure 11 – Extending property extraction with text mining and classification (on-premises software or web services in the cloud)................................................................................................................18Figure 12 – Each user, or group of users, has different information needs to do their work...............19Figure 13 – Changes in enterprise content and dictionary data sources must be reflected in the search index.....................................................................................................................................................21Figure 14 – Risk of solution decay without adequate search management.........................................22

Table of TablesTable 1 – A diverse set of reasons lead to generally poor quality of metadata in most organizations today......................................................................................................................................................5Table 2 – Description of the key content processing stages shown on Figure 6...................................10Table 3 – Potential internal data sources available in the enterprise mapped to hypothetical dictionaries...........................................................................................................................................20Table 4 – Sample external data sources for dictionaries and text mining/classification tools..............21Table 5 – Job profile and responsibilities for search management.......................................................22Table 6 – Overview of the FAST Search Server 2010 for SharePoint deployment General Mills..........24Table 7 – Overview of the FAST Search Server 2010 for SharePoint deployment at MDOT.................25

2 | P a g e

IntroductionThe volume of digital content and associated information overload has grown explosively over the past decade; at the same time enterprise search crept up to secure an increasing prominence in CIO’s agendas worldwide, both in small as well as large organizations. Employees – the end users – draw a parallel between their experience online and at work, asking the question “Why can’t we search for information inside our company as easily as on the web?” , which is understandable.

Indeed it is a fair demand to make, as search technology was available in the enterprise even before the Internet boom, although in practice there are significant dissimilarities between the two worlds. First and most important, resources – web search engines typically enjoy much greater scale and a tangible return, whereas this is not a revenue-generating investment as an internal application. Second, the mix of information sources used in the enterprise to inform decisions and drive business processes is far more heterogeneous (structured/unstructured) and less well interconnected compared to web pages and sites on the Internet.

Linked to this, one of the thorniest challenges that most enterprise search implementation projects inside the firewall stumble upon is metadata quality, or rather lack thereof. This paper intends to explain briefly what metadata is, why is it usually of poor quality, and the reasons it is very important for findability. It then argues how out-of-the-box capabilities in FAST Search Server 2010 for SharePoint help you reduce these issues, specifically by automating the metadata creation process on-the-fly at index-time – in effect you add structure to unstructured content. Given the correct rules and dictionaries, such machine-driven content enrichment offers three intrinsic benefits that set it apart from methods relying on manual input from end users or subject matter experts: scalable and consistent metadata at a lower cost.

A discussion follows on how this high-quality metadata is leveraged by FAST Search Server 2010 for SharePoint to enable compelling search experiences that increase findability overall. Deep refiners are spotlighted in how they engage users in a dialog that makes the experience conversational, allowing users to slice & dice search results intuitively and zoom into the required information in just a few clicks.

Finally, the paper closes describing four key best practices to ensure success in enterprise search, two case studies showcasing FAST Search Server 2010 for SharePoint early adopters.

Metadata is a real challenge that affects findabilityMetadata is loosely defined as “data about data” – basically, metadata is a set of properties that describe a document or some other piece of content. The elementary purpose for metadata is twofold: organizing content (cataloguing, labeling, tagging, etc.) and locating content (browsing, filtering, sorting, searching, etc.).

Looking beyond this basic functionality, metadata is a central element across a wide range of real-world applications, such as, among others, different areas within information management1,

1 Archiving, records management (RM), enterprise content management (ECM), knowledge management (KM), web content management (WCM), digital asset management (DAM), etc.

3 | P a g e

regulatory compliance, business workflow automation, classification and recommendation engines, and enterprise search. It becomes obvious that without a high enough level of metadata quality, a solution implemented to address any of these scenarios will not completely fulfill the value promised in the original business case that justified the project in the first place.

The challenge is that, in most organizations today, the quality of metadata is always low – usually missing and, if existent, often inconsistent or incorrect. This paper focuses on how metadata quality affects enterprise search solutions and, more specifically, findability – the ability for people to obtain the correct information, at the right time for better and faster decision-making, perform daily tasks, complete business processes and, ultimately, achieve their goals.

The reasons for the dire state of affairs when it comes to metadata are diverse and a more detailed discussion follows later in this section (page 5). For now, it suffices to point out issues related to enterprise information governance and reliance on manual approaches to creating metadata. Without automated information classification and tagging tools, it is impossible to keep up with the exponentially growing content volumes in today’s world.

Where does metadata live?Most documents in electronic format, such as Microsoft Office and PDF files, certain properties or pieces of metadata will be stored in the file itself, e.g. last modified date, title, author, keywords, comments, template, and so on. These properties will either be automatically populated by the software application or manually entered by the person creating or updating the document.

Figure 1 – Properties panel in Microsoft Word (left) and metadata columns in a SharePoint document library (right)

Some properties will be managed in the container or repository used to actually store the document, ranging from a simple file folder on disk to a sophisticated content management system, such as SharePoint, Documentum, and others. Some examples of metadata managed in this manner are as follows: content format/size, filename, title, author, owner, user/group permissions, publication/expiration date, retention policy, document status, taxonomy classification, folksonomy tags, and so on.

As you can see, overlaps with properties in the document file itself are possible, and automated synchronization for such overlaps may exist or not. Again, this metadata can be manually entered by

4 | P a g e

the user uploading the document to the repository, or can be automatically detected, suggested and/or assigned by the system.

Manual metadata is unreliable and costlyVery few organizations of any size can boast about having a high standard of metadata quality across their many internal information stores, in particular when it comes to unstructured content, such as documents in email attachments, file shares or other corporate repositories. In fact, metadata is often poor – it is either:

Inexistent (missing entries or values for the properties) Incorrect (inaccurate, wrong or otherwise useless entries) Inconsistent (spelling variations, abbreviations, synonyms or different interpretations for

the same entry)

As stated earlier, several reasons contribute to this situation and Table 1 contains an overview of some key explanations.

Table 1 – A diverse set of reasons lead to generally poor quality of metadata in most organizations today

In many cases, the lack of an integrated and strategic approach to information governance across the enterprise leads to separate but overlapping content repositories – silos – with different metadata standards and point solutions to access/search them. In large organizations (multinational corporations and government), the problem’s complexity can get magnified with divergent approaches in different business units or departments, not even considering the many unmanaged repositories like network file shares or people’s desktop files and email attachments.

Furthermore, the metadata creation and management process typically relies heavily on manual input from end-users and/or subject matter experts. People generally have an egocentric view of the world, interesting and concerning themselves primarily about issues that affect them directly on a personal level and then gradually outwards in the social network – such is human nature.

Unsurprisingly, this translates into people’s behavior in their professional lives as well, where employees find themselves working in a “bubble” so to speak as illustrated on Figure 2.

5 | P a g e

Information governance

Ineffective information management strategy across enterpriseFragmented approach leads to content silosDifferent metadata standards and search interfaces

Human nature

Reliance on manual input from users or subject matter expertsThe former is frequently unreliableThe latter becomes prohibitively expensive very quickly

Explosive content growth

Manual tagging does not scale and is low-qualityLack of automated content tagging and classification toolsImpossible to keep up with ever growing volumes of content

My enterprise's objectivesMy department's prioritiesMy team's goals & projects

My own job & interests

Figure 2 – Concentric spheres of people’s concerns and interests at the workplace

Specifically regarding the creation and management of metadata if you do not perceive a clear benefit from this you are very unlikely to spend any amount of time and energy being meticulous when filling in property forms after you save or upload a document. Presumably you may not even remember to enter any metadata (other than possibly the file name), as in general deployed applications and content repositories in the enterprise will rarely prompt you, the user, for it explicitly. In cases that you are prompted to add metadata, it is still rare for the system or application to proactively help you to provide input through automated suggestions (e.g. type-ahead, auto-completion, etc.).

It is a fact that, until recently, there was no obvious value for investing in high quality metadata across enterprise content, and therefore there was little incentive for widespread and consistent user participation. This scenario is in stark contrast to what these same users experience online on the internet, especially on websites with social computing features (e.g. Amazon, Digg, Facebook, YouTube, etc.), where highly relevant recommendations and rich collaborative filtering offer tangible benefits. Naturally, applications inside the firewall cannot match the scale or critical mass of their web 2.0 counterparts, but as the momentum of enterprise search technology grows the metadata deficiencies become very apparent.

Last, but not least, the production and sharing of digital content has continued to grow exponentially over the past decade, without any apparent abating in sight. This means, considering the two other factors discussed earlier in this section (ineffective information governance that results in a big increase of content silos and weaknesses of manually created metadata) that it is impossible to keep up with the flow of new content, not to mention dealing with the metadata backlog in content from existing repositories and archives.

If one cannot rely exclusively on end-users, it is still clear that using subject matter experts to manually fill the metadata gap is not only impractical, but a very costly proposition. So the answer will necessarily involve automation, as will be discussed after looking at some examples of how low quality metadata degrades findability in the next section.

Poor metadata leads to poor findabilityHaving explored the factors contributing to the status quo of poor metadata in most organizations today, it is perhaps no surprise why so many enterprise search projects fall short of delivering the business value they promise. End users continue to live a paradoxical reality where it is easier for them to find and take action on information online, on the internet, than in the comparatively much smaller world inside their organization’s firewall boundaries, or even their own desktops.

The following list identifies some ways in which poor metadata impairs the users’ search experience and, in turn, findability. Ultimately, in information retrieval or enterprise search scenarios, this erodes trust in the system overall leading to low user adoption.

Search results are difficult to scan or navigate. Documents returned may be duplicates, incomplete or not current. Low confidence in the authority and correctness of information. Difficult to locate relevant experts for specific topics.

6 | P a g e

The screenshot on Figure 3 shows these issues in which at first glance the initial couple of results listed may seem duplicates, because the title and author’s name have not been updated from the original template that was used to create a document. If there are several (non-duplicate) results on a certain topic that appear to have been authored by a single person, the unwitting user may be led to believe that the template’s author is the expert on the subject.

Further down on the screenshot, meaningless titles, author names or other pieces of metadata merely confuse users as they scan the list trying to assess the results’ pertinence to their current information need or task. And, finally, at the bottom there is missing metadata, also a common occurrence that raises questions about the completeness and authority of the result set.

Figure 3 – Poor metadata inevitably degrades the search experience and findability

It is worth pointing out that employees in today’s enterprises have grown accustomed to search on the internet, where they would enter only a keyword or two into the search box and get fairly relevant results back. In effect, these users have been trained by these popular web search engines and have come to expect the same behavior in enterprise search deployments inside their firewall.

7 | P a g e

However, unlike web search, the mix of content and use case scenarios in enterprise search is usually far more complex in many ways. Therefore, the users’ expectations are often misaligned with reality.

The range of information sources employees have to interact with in their jobs to complete daily tasks and business processes is both, more fragmented (or siloed, as discussed in the previous section) and usually much wider in scope – personal files and email, collaboration spaces, intranet sites, document repositories, databases, line-of-business applications, enterprise systems, and so on. Moreover, these various resources are not as seamlessly interconnected through hyperlinks as web pages and sites on the internet, so that it is much harder to establish the user’s context or intent based on one or two search keywords.

Given the current search paradigm of “one or two keywords”, in inside-the-firewall scenarios it is important to give users additional tools that enable them to refine the result set returned by a full-text query and help them navigate a large information space without getting lost. Such refinement tools are typically presented as several facets across the top or to the either side of the result list, where users can interactively select entries to constrain or expand the scope of their search.

These refiners (or facets) have spread like wildfire to search result pages on e-commerce and media Websites (e.g. Amazon.com, eBay.com, FT.com, Factiva.com, etc.), and will be discussed in more detail in the context of enterprise search later in this document (page 13). For now, the focus is on the fact that such refinement tools are still a rare sight in the enterprise, specifically in legacy search systems or applications as also shown on Figure 3 (top). After running a search, users are frequently left with three options: 1) re-sort the result list by e.g. date, author or title; 2) jump around to other pages in the result set; or 3) only try to reformulate the original query string.

The final point to make here is that even when search result refinement tools are available their usefulness depends on metadata quality. If metadata is poor this sophisticated search feature is impacted negatively. The result of this is that users quickly lose confidence and stop relying on the refiners as they see multiple variations for the same entry (because of inconsistent metadata), or hit counts not adding up to the total number of results (because of missing metadata), as displayed on Figure 4.

Figure 4 – Inconsistent and missing metadata limit usability of refiners

8 | P a g e

Solutions in FAST Search Server 2010 for SharePointThe bottom line, as demonstrated previously, is that metadata quality is always low. Solutions to this thorny problem will necessarily involve automation of the metadata creation and maintenance process – as revealed on Figure 5.

Figure 5 – Rely less on manual tagging and more on automated content enrichment tools

So, how can FAST Search Server 2010 for SharePoint help solve this problem and deliver immediate business value out-of-the-box?

First, sophisticated content processing enriches content from multiple data repositories (structured and unstructured) to increase findability and users’ search experience.

Second, property extraction helps overcome poor metadata by generating it and normalizing it on-the-fly.

The next two sections deepen the discussion of these two very significant capabilities that set FAST Search Server 2010 for SharePoint apart from many other enterprise search platforms, including the standard search offering included in SharePoint 2007 and 2010.

Content processing pipelineContent processing in FAST Search Server 2010 for SharePoint is designed as a pipeline – as illustrated in Figure 6 – a sequential set of discrete processing stages that analyze and enrich content prior to indexing by the search engine. Right out-of-the-box, FAST Search Server 2010 for SharePoint creates all of the necessary elements, or “ingredients”, to optimize findability and deliver a great search experience to end-users.

Figure 6 – High-level overview of the content processing pipeline in FAST Search Server 2010 for SharePoint

However, there is more: the pipeline has can be extended, through a built-in mechanism, to accommodate custom processing stages of your own to address any specific needs of the business…

9 | P a g e

One example could be to look up additional metadata stored elsewhere (e.g. a stand-alone database) to further increase findability of the document being indexed (e.g. from an archive that contains scanned reports, originally in paper format) – this particular scenario is central to the second case study presented later in this document (page 25).

At a conceptual level, a typical pipeline would perform tasks such as outlined in the following Table 2. Most of these stages are part of the standard pipeline configuration in FAST Search Server 2010 for SharePoint.

Table 2 – Description of the key content processing stages shown on Figure 6

Stage Description

Format conversion Extracts plain text and metadata from multiple content formats, such as Microsoft Office, PDF, and more. (400+ file formats supported).

Language Encoding and Detection

Identifies the encoding and languages used in the text content so that the appropriate linguistic normalization rules and dictionaries can be applied downstream.

TokenizationBreaks text into tokens using language-specific rules regarding punctuation, diacritics, accents, compound words, phrases and numbers (currency, telephones, part numbers, etc.).

Lemmatization

Applies linguistic normalization to content so users’ queries match documents that contain words and phrases in either canonical or inflected forms (singular/plural, masculine/feminine, etc.); for example search for “mice” and find “mouse”.

Property extraction

Recognizes predefined entities mentioned in the content; out of the box support for Companies, Locations and People but this can be extended to other categories (additional details in next section)

VectorizationCreates document vectors (phrase/weight pairs reflecting important terms and frequency of occurrence) to enable functionality such as “find documents similar to this one”.

Date and Time Normalization

Converts dates and times to a standard representation, to handle locale-specific representations; for example, the date 14-Mar-10 is the same as March 14, 2010.

Custom Processing Stage

Enables you to extend the content processing pipeline with custom stages (home-grown solutions or third-party software) to address any specific business needs of your own.

Property MappingMaps the relevant pieces of content and metadata (crawled properties) discovered in the pipeline to the index schema (managed properties) for searching.

As a final note, realize that all of this content processing happens at a lightning speed. By using the appropriate software and hardware configuration, out-of-the-box the pipeline easily scales to process several dozens, hundreds, or even thousands of content items (documents or records) per second.

Property extractionIn the context of enterprise search, property extraction can be defined as the ability to recognize entities (e.g. people, companies, locations, etc.) within unstructured content (e.g. document’s body text) for indexing as supplementary metadata. Figure 7 shows how this feature would automatically

10 | P a g e

parse a hypothetical Microsoft Word document and identify different kinds of entities mentioned in the text – in this case companies, locations and people.

Figure 7 – Property extraction in action tagging companies, locations and people in a Microsoft Word document

As seen previously, property extraction is an integral feature of the content processing pipelines in FAST Search Server 2010 for SharePoint. These extracted entities are exposed as crawled properties for additional processing downstream in the pipeline and, finally, converted to managed properties in the index schema by the Properties Mapper pipeline stage. This makes this additional metadata available to the search engine.

Property extraction, also known as entity extraction (a sub-domain of text mining or text analytics), relies on pattern recognition and/or dictionaries. Furthermore, these rules and dictionaries should usually also define normalizations to extracted entities so that multiple variations of the same entity can be merged as a single master entry in the generated metadata.

Once the correct rules and dictionaries have been set up, the machine-generated metadata, that is produced on-the-fly through property extraction, has three intrinsic properties that differentiate it from methods relying on human input: scalable and consistent metadata at a reduced cost. It follows that metadata with such characteristics addresses the findability problems highlighted earlier (page 6) by enabling a richer search experience for end-users, as demonstrated next.

Where is metadata used by enterprise search?It was explained earlier how poor metadata (missing, incorrect or inconsistent) degrades the search experience and findability leading to low user confidence and adoption of search solutions. This section and the next argue how a richer set of consistent metadata has the opposite effect with regard to business value by making life easier for the users.

11 | P a g e

FAST Search Server 2010 for SharePoint uses indexed metadata ( i.e. managed properties) in several ways to increase findability and simplify the search experience overall.

1. Deep refiners; also known as facets or navigators. Refiners organize the search results page into different dimensions according to available managed properties offering users an at-a-glance overview of the complete result set, and they also intuitively guide users towards selecting possible options to quickly zoom into the information required to progress with their task. Deep refiners are discussed in greater detail in the following section.

2. Relevancy tuning. Managed properties are exploited to tune the relevancy models and ranking algorithms used in the search engine by assigning different weights to different fields in such a way that, for example, a match in the title is considered more significant than a match in the body text. Another common practice is to give importance to freshness so that newer content is positioned higher in the result list. An example relating specifically to property extraction would be to differentiate between matches, with regard to rank score effect, in automatically extracted metadata fields (such as companies, locations, people, products, and so on) and less reliable manual metadata. Multiple relevancy models and rank profiles can be configured in FAST Search Server 2010 for SharePoint to cater for the various needs of a diverse range of search applications and users groups in the enterprise.

3. Multi-level and formula-based sorting. Managed properties can also be used for sorting result sets according to the value of one or more indexed fields. Search results are usually sorted in descending order by the rank score computed for each hit and, in most cases, result sorting controls are also exposed to the user (as shown in Figure 8); depending on the goal of a particular search application this default behavior can be modified. An example of sorting by multiple dimensions could be to sort first by date (descending), then for a given date sort by title (ascending). Multi-level sorting can also be combined with relevancy ranking: sort by rank score, and then by geographic distance (to a given point on a map), and then by rating, and then finally by date. In the previous example, geo-sorting is enabled by using trigonometric formulae in the sort expression (one of the parameters in the query that was sent to the search engine).

4. Fielded search. Yet another use for managed properties is to perform searches restricted to one or more fields in the index. For instance, search for documents of a specific format, published in a given date range, by a certain author, in a particular branch in the taxonomy tree, mentioning several concrete companies and locations. This would usually be explicitly exposed to the user as an input form on the search interface (generally known as “advanced search”), or seamlessly integrated into a search-driven application for business process or workflow automation.

The important takeaway here is that managed properties underpin the key features discussed earlier in this section; using automated tooling included in FAST Search Server 2010 for SharePoint (content processing and property extraction) you can generate the necessary high-quality metadata on-the-fly during indexing, consistently and cost-effectively. This is very important for findability and a top experience for your end-users, translating directly in their increased confidence, satisfaction and adoption of the solution, which ultimately drives business value and justifies your investment in high-end enterprise search technology.

12 | P a g e

It is also very important to direct your attention to another aspect. The first three features listed earlier are fundamental differentiating factors when you compare FAST Search Server 2010 for SharePoint with the standard search capability included with SharePoint 2010.

1. SharePoint 2010 search only provides shallow refiners, whereas FAST Search Server 2010 for SharePoint supports both shallow and deep refiners (explained in the next section).

2. On a given instance of SharePoint 2010, search relevancy can only be tuned so that it applies to all users and applications, whereas FAST Search Server 2010 for SharePoint supports multiple relevancy models and rank profiles.

3. Sorting in SharePoint search can only be performed on a single dimension (either by rank score or a managed property), whereas FAST Search Server 2010 for SharePoint supports multi-level and formula-based sorting.

Deep refiners: why do they matter?It was established earlier (page 6) that users usually draw a parallel between web search and enterprise search, transferring the “one or two keywords” habit to information access scenarios inside the firewall. For example, consider a large software company that develops, sells and implements Enterprise Resource Planning (ERP) solutions. When somebody searches for “erp”, what do they actually mean? What information will satisfy their needs in this particular instance?

Most times users will be looking for information in some specific context: to support a decision, to complete a task, to action a business process, and so on. Nevertheless, this query is very vague at the best of times; although users still have high expectations that search will deliver the “right” information. Considering the heterogeneous nature of enterprise content and an organization’s diverse audiences (with different responsibilities, priorities and tasks), how do you deliver the needle in the haystack?

Attempting to give a simple answer here would be naïve at best. In an enterprise search setting, several components (query suggestions, synonyms, spell-checking, relevancy ranking, best bets, recommendations, federation, etc.) contribute to excellent findability from a user’s perspective; nevertheless there is one overriding factor: customize the search experience to the user’s context. In other words, you should strive to make the various aspects of accessing and acting on information pertinent to the user’s role, location, business interests, professional network, and so on. The benefits of user context in FAST Search Server 2010 for SharePoint are explored in a related white paper2 while this document focuses on another two very important elements: high-quality metadata and deep refiners (the former underpins the latter, as previously discussed).

So if the user searching for “erp”, in the example scenario earlier in this article, is a sales professional a great experience would put emphasis on relevant sales materials, presentations and go-to-market pitches. Conversely, the same query submitted by an implementation consultant should prioritize project toolkits, best practice guidelines, relevant IP from previous projects, colleagues with experience in similar cases, and so on. Having a better understanding of who is asking the question (and why) gives the search application a better handle on how to answer it while providing a much more contextual experience to the end-users.

2 See “Delivering great search experiences with User Context” by Tony Hart and Mark Stone.

13 | P a g e

Still, despite such optimizations, it may be challenging to disambiguate the user’s context/intent precisely, or the number of relevant results may be large, so it is of great value for users to have additional search refinement tools to help them make sense and navigate a vast information space without getting lost. The screenshot on Figure 8 presents deep refiners on the left-hand side. A search application that uses deep refiners, supported by high-quality metadata, offers two clear advantages a system without these characteristics:

1. At-a-glance overview of the complete result set returned by the free-text search (typically very vague); and

2. Intuitive guidance toward predictable results refinement options (narrow or widen the scope of the search).

In effect, with deep refiners search becomes conversational: a dialog is established with the user, ultimately helping them formulate a “better question” and quickly surface the required information. This is an interactive process so that users remain in their “one or two keywords” comfort zone, but are not forced into adopting other tactics developed in web search such as trying to edit their original query in the search box.

14 | P a g e

Figure 8 – High-quality metadata enables intuitive refiners to help users find their way around search results

Returning to the example that was mentioned earlier, after a salesperson performs a search for “erp”, if he or she is looking for a customer presentation from last year, the result set could be refined by content type (e.g. PowerPoint), by date (e.g. older than 6 months), by industry (e.g. Healthcare) and by region (e.g. Japan). On the other hand, the consultant investigating an issue during implementation might refine by content source (e.g. knowledge base), by product module (e.g. Payroll) and by software version (e.g. v5.3 SP2). Again, having a better understanding of who is asking the question (and why) gives the search application a better handle on how to answer it while providing a much more contextual experience to the end-users.

Deep refiners differ from shallow refiners because they are generated through the analysis of the complete result set, whereas the latter are created by parsing just the top N hits in the result list returned by the search engine (N rarely exceeds 100 hits). As mentioned earlier, deep refiners are an exclusive feature to FAST Search Server 2010 for SharePoint, as opposed to shallow navigators that

15 | P a g e

are available in the standard search offering included SharePoint 2010 and in many other search engines on the market today.

Both, deep and shallow refiners organize the result page into facets that provide an overview of the search results and guide the user toward possible options. However, the difference is very important in several ways, especially in scenarios with a large content corpus where a user’s initial search may easily return much more than a few hundred results, possibly encountering millions of matching items.

Deep refiners consider every single item matching the query, including the long tail, whereas shallow refiners offer only a sample from the top. This is important because most users rarely look beyond the first results page (another habit developed with web search), so deep refiners offer much higher fidelity overview of the whole result set (even for millions of hits).

Deep refiners display precise hit counts for each entry, because they are computed across the full result set, whereas shallow refiners generally do not include hit counts at all (under the risk of being inaccurate when more than N results are returned). This lets users discover at-a-glance what is popular or unique in the result set.

Hit counts in deep refiners inform users beforehand how many results will be returned upon selecting an entry, whereas with shallow refiners it is a gamble. This is significant as it makes the act of refining the result set predictable, also avoiding the possibility of drilling down into a “0 results” dead end.

In a nutshell, deep refiners make the search experience conversational, interactive and predictable. Users can pivot, slice & dice their search results intuitively and quickly zoom into the information that they require to complete their task.

Property extraction in FAST Search Server 2010 for SharePointThis section delves into more details about property extractors (and related deep refiners) included out-of-the-box with FAST Search Server 2010 for SharePoint, and also by describing the mechanisms available for extending this capability.

Out-of-the-box property extractorsFAST Search Server 2010 for SharePoint includes property extraction dictionaries for 11 languages3 and 3 kinds of entities:

Location (by default enabled in the pipeline) Company (by default enabled in the pipeline) Person (disabled, must be manually enabled)

Figure 9 presents some sample deep refiners available out-of-the-box in FAST Search Server 2010 for SharePoint. The top row displays the refiners populated by the property extractors listed earlier: locations, companies, persons. Although strictly speaking, the refiners on the bottom row are not

3 Arabic, Dutch, English, French, German, Italian, Japanese, Norwegian, Portuguese, Russian, Spanish.

16 | P a g e

generated via property extraction, they are still pertinent to exemplify how metadata enhances the search experience and, ultimately, findability:

Modified date. FAST Search Server 2010 for SharePoint normalizes date/time formats from different content sources; a deep refiner based on this standardized date/time managed property enables users to consistently constraint the date range in their searches.

Result types. FAST Search Server 2010 for SharePoint detects and converts over 400 file formats for indexing; a deep refiner based on the resulting managed property enables users to intuitively limit their search to specific kinds of content.

Language FAST Search Server 2010 for SharePoint automatically detects over 80 languages used in the ingested content; a deep refiner can be defined on the resulting managed property to enable users to select different languages.

Figure 9 – Example deep refiners supported out-of-the-box

As we have seen previously, consistent metadata generated through automated tools drives multiple search features that improve findability. For organizations struggling with the following kinds of issues, these out-of-the-box capabilities alone provide lots of business value with regard to better decision-making and efficiency as discussed earlier in the paper:

Large content volumes Poor document metadata Inadequate search result refinement options Low user adoption of enterprise search

17 | P a g e

Extending property extractionIn addition to the 3 property extractors available out-of-the-box, FAST Search Server 2010 for SharePoint features a built-in extensibility mechanism based on custom dictionaries: the whole words extractor.

A dictionary is basically a list of keywords and phrases to be matched in the text as it is processed in the pipeline. The dictionary may define variations (e.g. alternative spellings, synonyms, aliases or acronyms) to be normalized or mapped to a single entry (e.g. for display purposes). Several such dictionaries may co-exist to address different needs of the business, for example:

Business and industry-specific concepts Customer names and references Competitor names, stock tickers and products Employee names and aliases Project names and codes Product names, acronyms and part numbers

In fact, the data required to build and maintain high-quality property extraction dictionaries may be readily available within the organization (e.g. line-of-business applications, databases, XML files) or from external sources. This topic will be revisited in the best practices later (page 20).

Figure 10 – Extending property extraction (dictionaries from e.g. SharePoint lists, XML feeds) to create custom refiners

Property extractors and deep refiners provided with FAST Search Server 2010 for SharePoint deliver immediate value right out of the box. However the extensibility feature lets you to go far beyond. Figure 10 provides an example of the process of creating custom dictionaries from existing enterprise systems or data to make enterprise search solution speak the language of your own business.

One reason why you might want to do this, looking back at the earlier example scenario of the ERP software company, would to detect product module names and version numbers mentioned in the

18 | P a g e

actual content if, for example, the knowledge base does contain reliable metadata on these two dimensions. A similar scenario will be presented in the first case study later (page 24).

Calling external property extractorsAnother approach is to use the content processing extensibility mechanism in FAST Search Server 2010 for SharePoint to invoke external content enrichment tools as illustrated on Figure 11.

Figure 11 – Extending property extraction with text mining and classification (on-premises software or web services in the cloud)

Actually, such tools may have already been deployed and actively used in other parts of the organization to meet a range of different business needs. These applications may serve very diverse goals and not necessarily directly related to search.

Two distinctive capabilities provided by this kind of tools are as follows:

Text mining for detecting/tagging entities, facts, relationships and/or sentiment;

Rule-based or statistical classification against controlled vocabularies, taxonomies or ontologies.

Depending on the enterprise’s size and industry sector, these tools could exist as home-grown solutions, deployed as third-party software or accessed via web services from providers specializing in specific verticals and/or domains. A few existing options available on the marketplace today will be listed ahead, under best practices (page 20). In other circumstances, this can actually be designed and developed during the implementation of the enterprise search solution, as will be presented in the second case study presented later in the document (page 25).

Best practicesThis section presents four major best practices for a successful deployment of an enterprise search solution, and its on-going management and development in order to continually satisfy the business needs of the target end-users.

1. Deepen your understanding of your audiences and your content. This is very important to successfully implement enterprise search projects in general, but specifically to get property extraction right in the first place.

2. Use existing language resources inside and outside your enterprise. High-quality property extraction becomes easier by using existing data sources to build dictionaries, or third-party content enrichment software or web services.

19 | P a g e

3. Keep the index synchronized with content sources and dictionaries. Systematic processes must be set up so that all property extraction dictionaries and the search index remain in synch with their respective data sources.

4. Distinguish search management from systems management. The focus and skillset for managing the end users’ search experience differ from the technical aspects of running the software. Therefore, a division of responsibilities between IT and the business is very important.

Deepen your understanding of your audiences and your contentAs illustrated on Figure 12, almost any organization today will have many audiences, or user groups, with different information or content access needs to participate in business processes, make timely decisions, perform tasks to drive projects until they are completed, or in short, achieve their goals. Going one or two levels down, even in the same business unit or department the user profiles and their respective information needs will vary.

Figure 12 – Each user, or group of users, has different information needs to do their jobs

Therefore, it is very important to start answering the following questions very early on in your project to ensure a successful deployment of enterprise search in general, and property extraction in particular:

End users• Who are your users and what are their goals/priorities?• How many users , or user groups/profiles, are targeted?• Where are they located , geographically and within the organization?• What content do they require to do their jobs?• How do they search/use this content (information access patterns)?• Do they have access to the correct content sources?

Enterprise content• What content is available across the enterprise?• How much content is there, structured and unstructured?• Where is it stored geographically and by kind of repository?• How is it accessed by the users?• What is the quality of the content, including the all-important metadata?

Business problems• What are the findability pains , current and expected in the future?

20 | P a g e

• What are the gaps between what users need what they currently have?• How can the content be enriched consistently and cost-effectively?• Which business strategies and processes will be affected and how?

Use existing language resources inside and outside your enterpriseAdding your own property extractors is a powerful approach to ensure that pertinent business language is reflected in the search solution. This would make the experience truly conversational and actionable for your end-users. There are many internal assets and external resources that can help you achieve (and maintain) better quality property extraction more efficiently.

Most organizations will already have at their disposal a considerable range of potentially interesting internal sources for building dictionaries, such thesauri, taxonomies, ontologies, controlled vocabularies, master databases, enterprise systems, and also other line-of-business applications and databases. Table 3 lists potential property extraction dictionaries paired with possible internal data sources to consider. This is really a cornerstone to make search speak the language of your business.

Table 3 – Potential internal data sources available in the enterprise mapped to hypothetical dictionaries

Dictionary Sample data sourcesEmployees Active Directory (AD), LDAP directories, HR management systems, and so on.Customers Customer Relationship Management (CRM), and so on.Suppliers Enterprise Resource Planning (ERP), and so on.Products Product Lifecycle Management (PLM), ERP, and so on.Processes Business Process Modeling (BPM), workflow system, and so on.Projects Enterprise Project Management (EPM), portfolio management, and so on.

Concepts SharePoint Term Store, taxonomy management, and so on.Others... SharePoint Lists, line-of-business databases, XML feeds, and so on.

Before deciding on the sources to use, it is important to assess the quality of the input data for dictionaries, given its direct effect on the value of property extraction output downstream. Some quality parameters to consider here include the completeness of coverage (local vs. global), terminology normalization (spelling variations and synonyms), supported languages, on-going maintenance costs, data ownership, among others.

Last but not least, subject matter experts across the organization are always a convenient and useful resource that is available internally. They are frequently able to provide actual data for high-quality dictionaries (e.g. held in local spreadsheets or other databases), or contribute to improving existing data sources. If nothing else, they are likely to provide additional pointers in the right direction.

Resources available outside the enterprise can be divided in three main categories listed later in this section; Table 4 provides some specific examples for each.

Internet resources. Valuable lists and databases are available on the World Wide Web, often free but sometimes for a fee. Given the nature of material on the World Wide Web there is a range quality available, from a very high standard to poor and unmaintained. Possible sources for this data would typically be government agencies, industry bodies, research institutions, academia or even community-driven efforts (such as Wikipedia or DBpedia).

21 | P a g e

Content providers. Proprietary data licensed from third-parties specializing in specific industry verticals and/or knowledge domains. Examples include company directors, stock tickers, financial instruments, biomedical terminology, legal taxonomies, airport codes, and so on.

Specialized vendors. Content enrichment software deployed locally or delivered remotely via web services, typically marketed by dedicated third-parties (but possibly developed in-house in certain cases). Examples include entity, fact, relationship extraction, as well as taxonomy and ontology classification.

Table 4 – Sample external data sources for dictionaries and text mining/classification tools

Keep the index synchronized with content sources and dictionariesIt is unavoidable that the language of the business will change. The external environment, enterprise content, users’ needs and expectations all evolve over time. The enterprise search solution, too, should adapt to respond to these changes. Adequate processes have to be in place to ensure that all property extraction dictionaries and the search index are systematically kept in synch with their respective data sources, as exemplified on the timeline in Figure 13.

Figure 13 – Changes in enterprise content and dictionary data sources must be reflected in the search index

Where possible, as a default approach, dictionary upkeep should be integrated in regular business processes and workflows, or automated through scheduled tasks. For instance, custom dictionaries derived from another system should be updated through automatic feeds when changes occur upstream. For example:

Business concepts dictionary integrated with SharePoint Term Store;

22 | P a g e

Internet resources

Wikipedia.orgDBpedia.orgWordNet, from Princeton UniversityMedical Subject Headings (MeSH)Library of Congress Subject Headings (LCSH)etc.

Content providers

Reed BusinessWolters KluwerThomson ReutersLexis NexisFactivaBoardExDun & BradstreetBureaux van Dijketc.

Specialized vendors

Temis (Luxid®)Nstein (TME)Basis Tech (Rosette)Lexalytics (Salience)ClearForest (OpenCalais)Smartlogic (Semaphore)conceptSearching (conceptClassifier)etc.

Projects dictionary integrated with EPM system; Products dictionary integrated with PLM system; And so on.

It is important that the relevant stakeholders to these processes schedule regular analysis checkpoints throughout the year (e.g. monthly or quarterly) to review and handle exceptional cases.

Distinguish search management from systems managementHowever great the initial implementation of enterprise search is shortly after the go-live date, a proactive and on-going management approach is very important to ensure that the solution remains

highly pertinent to the user population, as their needs, content and business language changes over time.

Otherwise, findability and the search experience overall will gradually deteriorate, dragging users’ confidence with it, as illustrated by the metaphor in Figure 14.

Figure 14 – Risk of solution decay without sufficient search management

Who, then, should be responsible for managing the search experience? It is very important to make a clear distinction between systems management and search management.

Systems management is usually more concerned with the technical aspects of keeping servers running, monitoring system and application logs, analyzing index and query performance, updating software, and so on.

Search management focuses on the actual end-user experience from a business perspective, as suggested by the job profile and sample tasks listed on Table 5.

Table 5 – Job profile and responsibilities for search management

The key point to keep in mind here is: the skillset required is that of a SharePoint administrator, not the typical profile of a programmer or systems engineer. Consequently, search management is usually not a responsibility that lies in the IT camp alone, in fact it should be owned by the business.

23 | P a g e

Job profile

Skillset of a SharePoint administrator (not a programmer or systems engineer)Business perspective and focusGood ability with languagesAttention to detail

Sample tasks

Monitor search reports (daily or weekly)Run user polls and focus groups (quartelyy)Process users feedback/questions (ongoing)Update dictionaries and manage keywords (as required)Support search-related projects

Staffing levels will obviously vary according to the size and complexity of the enterprise search solution, ranging from a single person working part-time up to a dedicated, distributed team to cater for different geographies and languages.

Case studiesOver the years, several hundred organizations have reaped the benefits from deploying FAST Search Server 2010 for SharePoint’s enterprise search technology such as content processing, property extraction and deep refiners to build innovative solutions.

With the recent release of FAST Search Server 2010 for SharePoint, Microsoft raises the stakes by democratizing high-end enterprise search capabilities in the marketplace. Two early adopters of this newly released product are showcased later in this white paper. The full case studies are available online:

1. General Mills frees more time for innovation with research-focused search application 2. Mississippi Department of Transportation saves lives with better business insight

General Mills To stay out front as one of the world’s leading food companies, General Mills strives to move new product ideas out of development and onto supermarket shelves as quickly as possible. As many large companies, General Mills struggles with multiple content silos, each with a separate interface for managing, searching and accessing information.

To help its scientists and researchers in Innovation, Technology and Quality (ITQ) take full advantage of information and expertise throughout the company, General Mills created a research-focused search application by using Microsoft SharePoint Server 2010 and Microsoft FAST Search Server 2010 for SharePoint. Researchers enjoy faster and more accurate searches, which yields more time for innovation.

Because researchers can perform more comprehensive and accurate searches, they reduce time spent duplicating efforts and benefit from the work of other people. “By using FAST Search Server 2010 for SharePoint, our researchers can refine their searches and find exactly what they are looking for. They spend more time innovating than looking for information.” Michelle Check, R&D Systems Leader, Global Knowledge Services, General Mills.

For example, General Mills created a custom product refiner that enables researchers to narrow search results by particular products. If a researcher performs a search for “high fiber cereals,” he or she can drill down into the result set using specific product name refiners, e.g. Cheerios or Chex cereals. By using refiners in this manner “FAST Search Server 2010 for SharePoint enables us to search in the language of our business,” Check says.

General Mills was able to meet 85% of its search needs with the out-of-the-box functionality and, given the excellent success achieved quickly, the expansion of this solution companywide is already being planned.

24 | P a g e

http://www.microsoft.com/casestudies/Case_Study_Detail.aspx?CaseStudyID=4000007073

http://www.microsoft.com/caseStudies/Case_Study_Detail.aspx?casestudyid=4000007255

Table 6 summarizes the business problem, approach to the solution and benefits realized by General Mills, looking specifically at the aspects related to property extraction. Please read the full case study4 for additional details.

Table 6 – Overview of the FAST Search Server 2010 for SharePoint deployment General Mills

Mississippi Department of Transportation (MDOT)As part of its mission to enhance the safety and efficiency of Mississippi’s highways, the Mississippi Department of Transportation (MDOT) collects large amounts of data about accident trends and roadwork projects. Because this data is spread across multiple sources, MDOT needed a solution that would empower its employees to find and share information more easily so they can discover trends, develop strategies, connect with colleagues, and reduce decision cycles.

MDOT selected Microsoft FAST Search Server 2010 for SharePoint to help people find information across different sources. Now, MDOT can make more effective use of its employees’ knowledge and experience to develop policy and funding priorities that will improve transportation safety. “We are literally reducing decision cycles from days to minutes for hundreds of overlapping decisions a day. By using SharePoint Server 2010, we can make better spending decisions and increase program performance without a very large investment.” John Simpson, Chief Technology Officer, MDOT.

With FAST Search Server 2010 for SharePoint, MDOT was able to customize its solution to make it easy for employees to find the information that they need. By using the advanced content processing capabilities in FAST Search Server 2010 for SharePoint, MDOT was able to incorporate its own unique vocabulary into the search experience. Employees now can refine their search results by project, county, district, dates, folder, document type, and author. Additional custom refiners can easily be added as needed.

4 URL: http://www.microsoft.com/caseStudies/Case_Study_Detail.aspx?casestudyid=4000007255 (General Mills case study)

25 | P a g e

Business problem

Various content sources used in ITQ: multiple SharePoint sites, internal applications/databases, a global products databases and web (Bing)Researchers forced to search content sources separatelyLow relevancy in existing search applicationsHigh effort in information discovery tasksGrowing difficulty in networking between experts as company grew worldwide

Solution approach

Search-driven application built with FAST Search Server 2010 for SharePoint to index all internal sources and federate to external sources (global products database and web)Default property extractors people, companies and locations mentioned in documentsProperty extraction was extended to recognize also product namesDeep refiners are used on extracted properties to quickly locate information

Benefits and value

Unified interface for searching all content sources with more relevant resultsImproved user satisfaction and productivityGreater sharing and reuse of existing information and knowledgeImproved networking between experts and communication across product areasProof point to justify business case for widening the deployment

http://www.microsoft.com/caseStudies/Case_Study_Detail.aspx?casestudyid=4000007255

Instead of relying on a few experts to manage important data, MDOT has democratized access to information across the agency. This makes it easier for people to find specific information or documents and work together to replicate success. “An engineer at one end of the state working on a bridge project can search that bridge design and find another engineer who has worked on the same design,” says Simpson. “All of a sudden, those two are talking, and we are using our human resources far more effectively”.

Table 7 summarizes the business problem, approach to the solution and benefits realized by MDOT, looking specifically at the enterprise search capabilities related to property extraction. Please read the full case study5 for additional details.

Table 7 – Overview of the FAST Search Server 2010 for SharePoint deployment at MDOT

ConclusionHigh-quality metadata plays a fundamental role in findability and success with enterprise search, as argued throughout this paper. Yet, the reality in most organizations today is that, for various reasons, metadata quality is always poor, missing, inconsistent or incorrect. This situation is not going to change as manual metadata creation methods are unreliable, expensive and cannot keep up with the exponential growth in the flow of digital information.

FAST Search Server 2010 for SharePoint includes powerful capabilities out-of-the-box as well as built-in extensibility mechanisms (content processing pipeline and property extraction) that enable you to enrich content on-the-fly in a scalable and cost-effective manner. Moreover, deep refiners engage users in a dialog that makes the experience conversational, interactive and predictable and lets users slice & dice search results intuitively and quickly zoom into the required information.

So join other forward-thinking organizations such as General Mills and Mississippi Department of Transportation in rolling out compelling search experiences and increased findability with FAST Search Server 2010 for SharePoint for the benefit of your business and end-users.

5 URL: http://www.microsoft.com/casestudies/Case_Study_Detail.aspx?CaseStudyID=4000007073 (MDOT case study)

26 | P a g e

Business problem

Poor access to a large, active collection of paper-based contracts and project documentsContent stored in image format with no metadataHigh-quality metadata managed in a separate database-based systemInformation silos constrain sharing of dataRequirements to provide internal and public access

Solution approach

FAST Search Server 2010 for SharePoint indexes document images using iFilter OCR technologyExtended content processing pipeline with custom .NET code that merges metadata from database with documents being indexedDeep refiners based on the custom metadata are used to quickly locate information

Benefits and value

Unified self-service interface where users quickly locate information according to their needs (dates, folder, route, county, project, etc.)Dramatically shortened document search times from several hours or days to mere seconds or minutesUsers have more time to focus on higher value tasks and projects

http://www.microsoft.com/casestudies/Case_Study_Detail.aspx?CaseStudyID=4000007073