bussiness analytics chep-2

8/13/2019 Bussiness Analytics Chep-2

1/36

Types of Digital Data


2/36


3/36

Digital Data

Unstructured (80%) :

Semi-structured :

Structured : organized form, in tables,

According to Merrill Lynch 80-90% of business

data is either unstructured or semi-structured. Data is usually in this format which makes it

difficult to extract information from it.


4/36

2


5/36


6/36

Structured data is organized in rows and columns in arigidly defined format so that applications can retrieve andprocess it efficiently. Typically stored using a databasemanagement system (DBMS).

Data is unstructured if its elements cannot be stored inrows and columns, and is therefore difficult to query andretrieve by business applications.

For example, customer contacts may be stored in variousforms such as sticky notes, e-mail messages, business cards,

or even digital format files such as .doc, .txt, and .pdf . Dueits unstructured nature, it is difficult to retrieve using acustomer relationship management application.


7/36

Types of data


8/36

In hospital (GoodLife) data structure is

maintained in structured way, so anyone can

locate desire information easily.

Comes from Access , OLTP, SQL, spreadsheets,

Fully described datasets.

Clearly describe categories and sub categories.

Data neatly placed in rows and columns

Indexing can be easily done.


9/36

Characteristics of Structured data


10/36


11/36


12/36


13/36

Unstructured Data

Email had not been successfully updated in

medical system database as it fell in the

Unstructured format.

Difficult to determine the meaning of the

data.

Does not follow any rules and semantics.

Any type so unpredictable.

Free form text without any structure


14/36

Characteristics of UnStructured data


15/36

Anything in non database form.

Bitmap objects : image, video, audio files.

Textual objects : word, email.

Body of email is raw data without any structure.

Email had not been updated into the medical

database record.

Noisy text such as chats, emails, sms. Language isalso different from normal lang.


16/36


17/36

Sources of unstructured data


18/36

How to manage unstructured data

Index in SQL is created on existing tables to retrieve the rows quickly.

When there are thousands of records in a table, retrieving information will take a

long time. Therefore indexes are created on columns which are accessed frequently,

so that the information can be retrieved quickly.

Indexes can be created on a single column or a group of columns. When a index is

created, it first sorts the data and then it assigns a ROWID for each row.

Indexing is nothing but an identifier and represents data in adata set.

Indexing is possible in case of unstructured data .

Based on text or some other attributes like the filename.

Indexing is difficult in unstructured data is difficult because itdoes not follow any naming conventions.


19/36

Tags /metadata

Using metadata data in the document can be

tagged but in unstructured data this is

difficult as little or no metadata is available.

structure of the document cannot be

determined as it is coming from more than

one source and doesnt has particular format


20/36

Classification/taxonomy

Taxonomy is classifying data on the basis of the relationshipsthat exist between data.

Data can be arranged in groups and placed in hierarchies

based on the taxonomy prevalent in an organization.Classifying unstructured data is difficult as identifyingrelationships between data is not an easy task.

CAS (content addressable storage ):It stores data based ontheir metadata.

It assigns a unique to every object stored in it. It is used extensively to store emails.


21/36

Challenges to store

S l i h ll


22/36

Solution to storage challenges


23/36


24/36


25/36

UIMA : Unstructured Information

Management Architecture

Solution for unstructured data.

It is an open source platform from IBM whichintegrates different kinds of analysis engines to providea complete solution for knowledge discovery from

unstructured data. UIMA stores information in structured format.

Various analysis engines analyze unstructured data indifferent ways as such:

Breaking up of documents. Grouping and classifying acc. to taxonomy.

Detecting parts of speech ,grammar and synonyms

Detecting events and times

Detecting relationship between various elements.


26/36

Semi structured data

Only about 10 percent of data in an organization is semistructured.

Semi structured data does not conform to any data model.

Data cant be stored in rows and columns.

Semi structured data has tags and markers which helpgroup the data and describe how the data is stored ,givingsome metadata.

But they are not sufficient for management andautomation of data.

Similar entities are grouped and organized in a hierarchy. The properties or the attributes within a group may or may

not be the same.


27/36

Characteristics of semi-structure data


28/36


29/36


30/36

How semi structured data is stored

Schemas : used to define the structure of data.The problem with schema is that requirementsare ever changing and changes required in dataalso lead to changes in schema.

Graph based data models: these can be used todescribe data .self describing, tree like structureto describe relationship and hierarchies. Schemaless approach.

XML: used to store and exchange semi structureddata. It allows the user to define tags to storedata hierarchical form.

Ch ll i St f i


31/36

Challenges in Storage of semi

structured data


32/36

Solution for storing


33/36

Challenges to extract information.


34/36


35/36


36/36

Difference between structured and

semi structured data

bussiness analytics chep-2

Documents