1 unstructured data (ud) what is unstructured data? how is it statistically valuable? challenges of...

11
1 Unstructured Data (UD) What is unstructured data? How is it statistically valuable? Challenges of turning UD into information

Upload: benedict-dickerson

Post on 21-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Unstructured Data (UD) What is unstructured data? How is it statistically valuable? Challenges of turning UD into information

Unstructured Data (UD)

What is unstructured data?

How is it statistically valuable?

Challenges of turning UD into information

Page 2: 1 Unstructured Data (UD) What is unstructured data? How is it statistically valuable? Challenges of turning UD into information

2

A way to describe data that is not contained in a database or some other type of data structure . Unstructured data can be textual or non-textual.These databases are sometimes called “NoSQL”

Unstructured data is:

Page 3: 1 Unstructured Data (UD) What is unstructured data? How is it statistically valuable? Challenges of turning UD into information

Features of “unstructured” data Does not reside in traditional databases Does not fit a relational data model Generated by both humans and machines

Facebook, Linked-in etc... Machine-to-machine communication (IP address routing)

Examples include Personal messaging – email, instant messages, tweets, chat Business documents – business reports, presentations,

survey responses Web content – web pages, blogs, wikis, audio files, photos,

videos Sensor output – satellite imagery, geolocation data, scanner

transactions (transportation arrival and departures)

Page 4: 1 Unstructured Data (UD) What is unstructured data? How is it statistically valuable? Challenges of turning UD into information

The value of unstructured data sources Provide a rich source of information about people, households

and economies May enable the more accurate and timely measurement of a

range of demographic, social, economic and environmental phenomena Combined with traditional data sources As a replacement for traditional data sources

So presents unprecedented opportunities for official statistics to Improve delivery of current statistical outputs Create new information products not possible with traditional data

sources ABS believes that the benefit should be demonstrated on a

case-by-case basis – the improvement of end-to-end statistical outcomes in terms of objective criteria such as accuracy, relevance, consistency, interpretability, timeliness, and cost

Page 5: 1 Unstructured Data (UD) What is unstructured data? How is it statistically valuable? Challenges of turning UD into information

Content analysis

Unstructured data must be analysed to extract and expose the information it contains

Different types of analysis are possible, such as: Entity analysis – people, organisations, objects and

events, and the relationships between them Topic analysis – topics or themes, and their relative

importance Sentiment analysis – subjective view of a person to a

particular topic Feature analysis – inherent characteristics that are

significant for a particular analytical perspective (e.g. land coverage in satellite imagery)

Many others

Page 6: 1 Unstructured Data (UD) What is unstructured data? How is it statistically valuable? Challenges of turning UD into information

Scale: 40 Zettabyte [ZB] =43 980 465 111 040 Gigabyte [GB]

=   

1 ZB = 1021 bytes = 1024 Exabytes

About 85% is unstructured data

Page 7: 1 Unstructured Data (UD) What is unstructured data? How is it statistically valuable? Challenges of turning UD into information

Big Data

Data sets of such size, complexity and volatility that their business value cannot be fully realised with existing data capture, storage, processing, analysis and management capabilities

The systematic use of unstructured data is the ‘Big Data’ challenge!

Page 8: 1 Unstructured Data (UD) What is unstructured data? How is it statistically valuable? Challenges of turning UD into information

Some other significant challenges Validity of statistical inference

Sample biases Model biases

Privacy and public trust Disclosure threat due to mosaic effect

Data integrity Missing, inconsistent and inaccurate data Volatile sources

Data ownership and access Public good versus commercial advantage Value of private sector data

Page 9: 1 Unstructured Data (UD) What is unstructured data? How is it statistically valuable? Challenges of turning UD into information

What are some ways to manage unstructured data?

Page 10: 1 Unstructured Data (UD) What is unstructured data? How is it statistically valuable? Challenges of turning UD into information

10

“NoSQL” databases

NoSQL databases are storage databases which do not use the SQL language. However they have there own ways of structuring this data.Some of them are:

Page 11: 1 Unstructured Data (UD) What is unstructured data? How is it statistically valuable? Challenges of turning UD into information

11

MONGO DB

Hadoop

Cassandra

CouchDB

Hypertable