bussiness analytics chep-2

Upload: rohit-kumar

Post on 04-Jun-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/13/2019 Bussiness Analytics Chep-2

    1/36

    Types of Digital Data

  • 8/13/2019 Bussiness Analytics Chep-2

    2/36

  • 8/13/2019 Bussiness Analytics Chep-2

    3/36

    Digital Data

    Unstructured (80%) :

    Semi-structured :

    Structured : organized form, in tables,

    According to Merrill Lynch 80-90% of business

    data is either unstructured or semi-structured. Data is usually in this format which makes it

    difficult to extract information from it.

  • 8/13/2019 Bussiness Analytics Chep-2

    4/36

    2

  • 8/13/2019 Bussiness Analytics Chep-2

    5/36

  • 8/13/2019 Bussiness Analytics Chep-2

    6/36

    Structured data is organized in rows and columns in arigidly defined format so that applications can retrieve andprocess it efficiently. Typically stored using a databasemanagement system (DBMS).

    Data is unstructured if its elements cannot be stored inrows and columns, and is therefore difficult to query andretrieve by business applications.

    For example, customer contacts may be stored in variousforms such as sticky notes, e-mail messages, business cards,

    or even digital format files such as .doc, .txt, and .pdf . Dueits unstructured nature, it is difficult to retrieve using acustomer relationship management application.

  • 8/13/2019 Bussiness Analytics Chep-2

    7/36

    Types of data

  • 8/13/2019 Bussiness Analytics Chep-2

    8/36

    In hospital (GoodLife) data structure is

    maintained in structured way, so anyone can

    locate desire information easily.

    Comes from Access , OLTP, SQL, spreadsheets,

    Fully described datasets.

    Clearly describe categories and sub categories.

    Data neatly placed in rows and columns

    Indexing can be easily done.

  • 8/13/2019 Bussiness Analytics Chep-2

    9/36

    Characteristics of Structured data

  • 8/13/2019 Bussiness Analytics Chep-2

    10/36

  • 8/13/2019 Bussiness Analytics Chep-2

    11/36

  • 8/13/2019 Bussiness Analytics Chep-2

    12/36

  • 8/13/2019 Bussiness Analytics Chep-2

    13/36

    Unstructured Data

    Email had not been successfully updated in

    medical system database as it fell in the

    Unstructured format.

    Difficult to determine the meaning of the

    data.

    Does not follow any rules and semantics.

    Any type so unpredictable.

    Free form text without any structure

  • 8/13/2019 Bussiness Analytics Chep-2

    14/36

    Characteristics of UnStructured data

  • 8/13/2019 Bussiness Analytics Chep-2

    15/36

    Anything in non database form.

    Bitmap objects : image, video, audio files.

    Textual objects : word, email.

    Body of email is raw data without any structure.

    Email had not been updated into the medical

    database record.

    Noisy text such as chats, emails, sms. Language isalso different from normal lang.

  • 8/13/2019 Bussiness Analytics Chep-2

    16/36

  • 8/13/2019 Bussiness Analytics Chep-2

    17/36

    Sources of unstructured data

  • 8/13/2019 Bussiness Analytics Chep-2

    18/36

    How to manage unstructured data

    Index in SQL is created on existing tables to retrieve the rows quickly.

    When there are thousands of records in a table, retrieving information will take a

    long time. Therefore indexes are created on columns which are accessed frequently,

    so that the information can be retrieved quickly.

    Indexes can be created on a single column or a group of columns. When a index is

    created, it first sorts the data and then it assigns a ROWID for each row.

    Indexing is nothing but an identifier and represents data in adata set.

    Indexing is possible in case of unstructured data .

    Based on text or some other attributes like the filename.

    Indexing is difficult in unstructured data is difficult because itdoes not follow any naming conventions.

  • 8/13/2019 Bussiness Analytics Chep-2

    19/36

    Tags /metadata

    Using metadata data in the document can be

    tagged but in unstructured data this is

    difficult as little or no metadata is available.

    structure of the document cannot be

    determined as it is coming from more than

    one source and doesnt has particular format

  • 8/13/2019 Bussiness Analytics Chep-2

    20/36

    Classification/taxonomy

    Taxonomy is classifying data on the basis of the relationshipsthat exist between data.

    Data can be arranged in groups and placed in hierarchies

    based on the taxonomy prevalent in an organization.Classifying unstructured data is difficult as identifyingrelationships between data is not an easy task.

    CAS (content addressable storage ):It stores data based ontheir metadata.

    It assigns a unique to every object stored in it. It is used extensively to store emails.

  • 8/13/2019 Bussiness Analytics Chep-2

    21/36

    Challenges to store

    S l i h ll

  • 8/13/2019 Bussiness Analytics Chep-2

    22/36

    Solution to storage challenges

  • 8/13/2019 Bussiness Analytics Chep-2

    23/36

  • 8/13/2019 Bussiness Analytics Chep-2

    24/36

  • 8/13/2019 Bussiness Analytics Chep-2

    25/36

    UIMA : Unstructured Information

    Management Architecture

    Solution for unstructured data.

    It is an open source platform from IBM whichintegrates different kinds of analysis engines to providea complete solution for knowledge discovery from

    unstructured data. UIMA stores information in structured format.

    Various analysis engines analyze unstructured data indifferent ways as such:

    Breaking up of documents. Grouping and classifying acc. to taxonomy.

    Detecting parts of speech ,grammar and synonyms

    Detecting events and times

    Detecting relationship between various elements.

  • 8/13/2019 Bussiness Analytics Chep-2

    26/36

    Semi structured data

    Only about 10 percent of data in an organization is semistructured.

    Semi structured data does not conform to any data model.

    Data cant be stored in rows and columns.

    Semi structured data has tags and markers which helpgroup the data and describe how the data is stored ,givingsome metadata.

    But they are not sufficient for management andautomation of data.

    Similar entities are grouped and organized in a hierarchy. The properties or the attributes within a group may or may

    not be the same.

  • 8/13/2019 Bussiness Analytics Chep-2

    27/36

    Characteristics of semi-structure data

  • 8/13/2019 Bussiness Analytics Chep-2

    28/36

  • 8/13/2019 Bussiness Analytics Chep-2

    29/36

  • 8/13/2019 Bussiness Analytics Chep-2

    30/36

    How semi structured data is stored

    Schemas : used to define the structure of data.The problem with schema is that requirementsare ever changing and changes required in dataalso lead to changes in schema.

    Graph based data models: these can be used todescribe data .self describing, tree like structureto describe relationship and hierarchies. Schemaless approach.

    XML: used to store and exchange semi structureddata. It allows the user to define tags to storedata hierarchical form.

    Ch ll i St f i

  • 8/13/2019 Bussiness Analytics Chep-2

    31/36

    Challenges in Storage of semi

    structured data

  • 8/13/2019 Bussiness Analytics Chep-2

    32/36

    Solution for storing

  • 8/13/2019 Bussiness Analytics Chep-2

    33/36

    Challenges to extract information.

  • 8/13/2019 Bussiness Analytics Chep-2

    34/36

  • 8/13/2019 Bussiness Analytics Chep-2

    35/36

  • 8/13/2019 Bussiness Analytics Chep-2

    36/36

    Difference between structured and

    semi structured data