introduction to data science intro,ch(1,2,3)

18
Data science Data Science An emerging area of work concerned with the collection, preparation, analysis ,visualization, management, and preservation of large collections of information . 1

Upload: hebaahmad

Post on 15-Aug-2015

37 views

Category:

Education


0 download

TRANSCRIPT

Data science

Data Science

An emerging area of work concerned with the collection, preparation, analysis ,visualization, management, and preservation of large collections of information .

1

Web page

much of the data in the world is non-numeric and unstructured.

unstructured means that the data are not arranged in neat rows and columns. Think of a web page

2

$

3

Data architecture

Data

acquisition

Data

analysis

Data

archiving

4

Data architect

providing input on how the data would need to be routed and organized to support the analysis, visualization, and presentation of the data to the

appropriate people.

5

Data acquisition

focuses on how the data are collected, and importantly , how the data are represented prior to analysis and presentation.

Tool example :barcode

Different barcodes are used for the same product. (for example, for different sized boxes of cereal).

6

Data analysis

using portions of data (samples) to make inferences about the larger context, and visualization of the data by presenting it in tables, graphs, and even animations.

7

Data archiving

Preservation of collected data in a form that makes it highly reusable ,so "data curation" is

a difficult challenge because it is so hard to anticipate all of the future uses of the data.

Example(Twitter):

Geocodes : data that shows the geographical location from which a tweet was sent could be a useful element to store with the data.

8

Learning the application domain

Communicating with data users

Seeing the big picture of a complex system

Knowing how data can be represented :metadata

Data transformation and analysis

Visualization and presentation

Attention to quality

Ethical reasoning :privacy 9

About Data •Data comes from the Latin word, "datum,"

meaning a "thing given“

10

za15id05v2005kamel

11

“The fundamental problem of communication is that of reproducing at one point either exactly or approximately a message selected at another point”

CLAUDE SHANNON

yes

1

0

No

Maybe 01

ASCII

12

Identifying Data Problems Data Science is an applied activity and data scientists serve the needs and solve the problems of data users.

Hint:

The data scientist may never actually become a farmer, but if you are going to identify a data problem that a farmer has, you have to learn to think like a farmer, to some degree.

3 questions:

subject matter experts.

ask about anomalies

ask about risks and uncertainty

13

Introduction To R R is an integrated suite of software facilities for data manipulation, calculation , graphical Display and other things it has .

"R" is an open source software program

an effective data handling and storage facility.

a suite of operators for calculations on arrays, in particular matrices,

a large, coherent, integrated collection of intermediate tools for data analysis,

graphical facilities for data analysis and display either directly at the computer or on hardcopy.

14

Additional Pros: R was among the first analysis programs to

integrate capabilities for drawing data directly from the Twitter(r) social media platform

The extensibility of R means that new modules are being added all the time by volunteers

the lessons one learns in working with R are almost universally applicable to other programs and environments.

15

CONS:

R is "command line" oriented

R is not especially good at giving feedback or error messages.

16

How to write a text

myText <- "this is a piece of text" Create Data Set :

myFamilyAges <- c(43, 42, 12, 8, 5)

c(): Concatenates data elements together Assignment arrow: <-

Some mathematical function :

sum():Adds data elements

range():Min value and max value

mean():The average

17

18