unstructured data in bi

Download Unstructured Data in BI

If you can't read please download the document

Upload: monaheng-diaho

Post on 16-Apr-2017

3.655 views

Category:

Technology


1 download

TRANSCRIPT

Title

Unstructured Data in BI6th May 2011byMonaheng DiahoStudy Leader: Dr. Kotze

UNIVERSITEIT VAN DIE VRYSTAAT UNIVERSITY OF THE FREE STATE YUNIVESITHI YA FREISTATA

Unstructured dataDoes not reside in relational database tables.Has no predefined structure or format.Not arranged in any order. Difficult to categorise for use in BI.Resides in several documents over multiple sourcesInternal (data within an organisation)External (data outside the organisation)Environmental Scanning: scanning for information about events trends and relationships in a companys outside environment. (Sabherwal & Becerra-Fernandez 2011:85)

Environmental scanning: (Sabherwal & Becerra-Fernandez 2011:85)Shows how changes in external environment may impact a companys decision making.Predictor of improved organisational performance through monitoring external events.Includes seeking/searching and using information.

A two dimensional model proposed by Daft & Weick(1984): (Sabherwal & Becerra-Fernandez 2011:86)Environmental Analysability (EA).Organisational intrusiveness (OI).

Environmental scanning contdUndirected viewing mode.Satisfied with limited information.Does not seek comprehensive data.Relies on irregular contacts and information.Conditioned viewing mode.Makes use of standard procedures.Relies on significant data from external reports that are widely used in industry.

Environmental scanning contdSearching mode.Systematically analyses data to produce market forecasts, trend analysis and intelligence reports.Willing to revise and update existing knowledge.Enacting mode.Construct own environment.Gather information by trying new behaviour and observing what happens.Experiment, test and stimulate.Ignore precedent, rules and traditional expectations.

Types of unstructured content: (Ferguson 2011:6; McCallum 2005:49; SPSS 2003:3):

HTML content (e.g. web chat, blogs and web pages)Documents (e.g. memos, research papers and articles)Forms (e.g. patent applications)EmailsSMS content.Multimedia content (audio, video, images).

Examples of data sources: (Ferguson 2011:6)Email archives.Call center transcripts.Customer feedback databases.Enterprise intranets.Enterprise content management systems.File systems.Document management systems.Social networking sites.RSS Newsfeeds.

Wittles (n.d.) asserts that :

20% of an organisations data is structured and ready for use in BI data analysis The remaining 80% is unstructured data.Significance of unstructured data is underestimated.

The social media effectThe current main driver in the upsurge of online content is social networks.Facebook statistics are used as an example.

Ferguson (2011:4)

Ferguson (2011:4)

Social IntelligenceBringing unstructured data into the decision making process.Augment structured data to optimise intelligence.

Examples of intelligence Brand intelligenceIdentifying customer complaints or reviews for a product.Competitor intelligenceBenchmarking marketing campaigns.Influencer intelligenceIdentifying trendsetters.Organisational intelligenceManaging employee relations.

Examples of intelligence contdCrime intelligenceFraud detection.Copy detection.Organised crime detection.

Untangling unstructured dataContent analytics (text mining & web mining) The process of analysing semi-structured or unstructured content from one or more sources to derive insight that will be of business benefit. (Ferguson 2011:4)

Data acquisitionUsing crawlers, search and indexing technologiesTo identify tag and index relevant content.Multiple crawlers can be set to crawl in parallel.Crawled content can beIndexed and the index made available for analysis.Stored in a file system (e.g. Hadoop DFS, MongoDB).

Text mining system architecture(Feldman & Sanger 2007:17)

High level view text mining app (Ferguson 2011:12)

Pros & ConsProsProvides a deep insight for BI.Quick detection of trends.ConsAnalytics are industry dependent, because each industry has unique content to utilise.Indexing large content volumes may bog down search engine performance.Content tagging may not be accurate.Crawlers may not detect some content.

Future considerations:Ensuring that user content is accurately tagged.Ensure that content is up-to-date and relevant.Validating content sources.Identify business drivers to get the best solution.For scalability issues allocate adequate processing power to analytics.

Possible research opportunitiesPatent violation detection system.Questionnaire/interview analysis system.CRM content analytics.Contextual comparison and assessment.Multimedia content detection.

ReferencesFeldman, R. and Sanger, J. 2007. The text mining handbook: Advanced approaches in analyzing unstructured data. New York: Cambridge University Press.Ferguson, M. 2011. Integrating and analysing unstructured data. Info360 BI Conference. Washington DC.McCallum, A. 2005. Information extraction. (http://www.cs.umass.edu/~mccallum/papers/acm-queue-ie.pdf)Retrieved 17 February 2011. Sabherwal, R. & Becerra-Fernandez, I. 2011. Business intelligence: Practices, technologies, and management. John Wiley & Sons, Inc: New Jersey. SPSS. 2003. Meeting the challenge for text: Making text ready for predictive analysis. Chicago.Wittles, G. n.d. Unstructured data offers a vast store of untapped BI value. (http://www.themanager.org/strategy/Unstructured_data.htm)Retrieved 19 February 2011.

END