1 peter fox data science – itec/csci/erth-6961-01 week 10, november 9, 2010 data analysis and...

48
1 Peter Fox Data Science – ITEC/CSCI/ERTH-6961-01 Week 10, November 9, 2010 Data Analysis and Visualization

Upload: susanna-parks

Post on 26-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-6961-01 Week 10, November 9, 2010 Data Analysis and Visualization

1

Peter Fox

Data Science – ITEC/CSCI/ERTH-6961-01

Week 10, November 9, 2010

Data Analysis and Visualization

Page 2: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-6961-01 Week 10, November 9, 2010 Data Analysis and Visualization

Reading assignment

• Peirce – why was this on the list?

• Analysis – who read this?

• Visualization – we will discuss these later

• Project definition – how is it going?

2

Page 3: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-6961-01 Week 10, November 9, 2010 Data Analysis and Visualization

Contents• Preparing for data analysis, completing and

presenting results

• Visualization as an information tool

• Visualization as an analysis tool

• New visualization methods (new types of data)

• Managing the output of viz/ analysis

• Enabling discovery

• Use, citation, attribution and reproducability

• Next week 3

Page 4: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-6961-01 Week 10, November 9, 2010 Data Analysis and Visualization

20080602 Fox VSTO et al.

4

Page 5: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-6961-01 Week 10, November 9, 2010 Data Analysis and Visualization

Types of data

5

Page 6: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-6961-01 Week 10, November 9, 2010 Data Analysis and Visualization

Data types• Time-based, space-based, image-based, …

• Encoded in different formats

• May need to manipulate the data, e.g. – In our Data Mining tutorial and conversion to

ARFF– Coordinates– Units – Higher order, e.g. derivative, average

6

Page 7: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-6961-01 Week 10, November 9, 2010 Data Analysis and Visualization

Induction or deduction?• Induction: The development of theories from

observation– Qualitative – usually information-based

• Deduction: The testing/application of theories– Quantitative – usually numeric, data-based

7

Page 8: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-6961-01 Week 10, November 9, 2010 Data Analysis and Visualization

‘Signal to noise’• Understanding accuracy and precision

– Accuracy– Precision

• Affects choices of analysis

• Affects interpretations (gigo)

• Leads to data quality and assurance specification

• Signal and noise are context dependent

8

Page 9: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-6961-01 Week 10, November 9, 2010 Data Analysis and Visualization

Other considerations• Continuous or discrete

• Underlying reference system

• Oh yeah: metadata standards and conventions

• The underlying data structures are important at this stage but there is a tendency to read in partial data– Why is this a problem?– How to ameliorate any problems?

9

Page 10: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-6961-01 Week 10, November 9, 2010 Data Analysis and Visualization

Outlier• An extreme, or atypical, data value(s) in a

sample.

• They should be considered carefully, before exclusion from analysis.

• For example, data values maybe recorded erroneously, and hence they may be corrected.

• However, in other cases they may just be surprisingly different, but not necessarily 'wrong'.

10

Page 11: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-6961-01 Week 10, November 9, 2010 Data Analysis and Visualization

Special values in data• Fill value

• Error value

• Missing value

• Not-a-number

• Infinity

• Default

• Null

• Rational numbers

11

Page 12: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-6961-01 Week 10, November 9, 2010 Data Analysis and Visualization

Errors• Three main types: personal error, systematic

error, and random error

• Personal errors are mistakes on the part of the experimenter. It is your responsibility to make sure that there are no errors in recording data or performing calculations

• Systematic errors tend to decrease or increase all measurements of a quantity, (for instance all of the measurements are too large). E.g. calibration

12

Page 13: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-6961-01 Week 10, November 9, 2010 Data Analysis and Visualization

Errors• Random errors are also known as statistical

uncertainties, and are a series of small, unknown, and uncontrollable events

• Statistical uncertainties are much easier to assign, because there are rules for estimating the size

• E.g. If you are reading a ruler, the statistical uncertainty is half of the smallest division on the ruler. Even if you are recording a digital readout, the uncertainty is half of the smallest place given. This type of error should always be recorded for any measurement

13

Page 14: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-6961-01 Week 10, November 9, 2010 Data Analysis and Visualization

Standard measures of error• Absolute deviation

– is simply the difference between an experimentally determined value and the accepted value

• Relative deviation– is a more meaningful value than the absolute

deviation because it accounts for the relative size of the error. The relative percentage deviation is given by the absolute deviation divided by the accepted value and multiplied by 100%

• Standard deviation– standard definition

14

Page 15: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-6961-01 Week 10, November 9, 2010 Data Analysis and Visualization

Standard deviation• the average value is found by summing and

dividing by the number of determinations. Then the residuals are found by finding the absolute value of the difference between each determination and the average value. Third, square the residuals and sum them. Last, divide the result by the number of determinations - 1 and take the square root.

15

Page 16: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-6961-01 Week 10, November 9, 2010 Data Analysis and Visualization

Propagating errors• This is an unfortunate term – it means making

sure that the result of the analysis carries with it a calculation (rather than an estimate) of the error

• E.g. if C=A+B (your analysis), then ∂C=∂A+∂B

• E.g. if C=A-B (your analysis), then ∂C=∂A+∂B!

• Exercise – it’s not as simple for other calcs.

• When the function is not merely addition, subtraction, multiplication, or division, the error propagation must be defined by the total derivative of the function.

16

Page 17: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-6961-01 Week 10, November 9, 2010 Data Analysis and Visualization

Types of analysis• Preliminary

• Detailed

• Summary

• Reporting the results and propagating uncertainty

• Qualitative v. quantitative, e.g. see http://hsc.uwe.ac.uk/dataanalysis/index.asp

17

Page 18: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-6961-01 Week 10, November 9, 2010 Data Analysis and Visualization

What is preliminary analysis?• Self-explanatory…?

• Down sampling…?

• The more measurements that can be made of a quantity, the better the result – Reproducibility is an axiom of science

• When time is involved, e.g. a signal – the ‘sampling theorem’ – having an idea of the hypothesis is useful, e.g. periodic versus aperiodic or other…

• http://en.wikipedia.org/wiki/Nyquist–Shannon_sampling_theorem

18

Page 19: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-6961-01 Week 10, November 9, 2010 Data Analysis and Visualization

Detailed analysis• Most important distinction between initial and

the main analysis is that during initial data analysis it refrains from any analysis.

• Basic statistics of important variables– Scatter plots– Correlations– Cross-tabulations

• Dealing with quality, bias, uncertainty, accuracy, precision limitations - assessing

• Dealing with under- or over-sampling

• Filtering, cleaning19

Page 20: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-6961-01 Week 10, November 9, 2010 Data Analysis and Visualization

Summary analysis• Collecting the results and accompanying

documentation

• Repeating the analysis (yes, it’s obvious)

• Repeating with a subset

• Assessing significance, e.g. the confusion matrix we used in the supervised classification example for data mining, p-values (null hypothesis probability)

20

Page 21: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-6961-01 Week 10, November 9, 2010 Data Analysis and Visualization

Reporting results/ uncertainty• Consider the number of significant digits in

the result which is indicative of the certainty of the result

• Number of significant digits depends on the measuring equipment you use and the precision of the measuring process - do not report digits beyond what was recorded

• The number of significant digits in a value infers the precision of that value

21

Page 22: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-6961-01 Week 10, November 9, 2010 Data Analysis and Visualization

Reporting results…• In calculations, it is important to keep enough

digits to avoid round off error.

• In general, keep at least one more digit than is significant in calculations to avoid round off error

• It is not necessary to round every intermediate result in a series of calculations, but it is very important to round your final result to the correct number of significant digits.  

22

Page 23: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-6961-01 Week 10, November 9, 2010 Data Analysis and Visualization

Uncertainty• Results are usually reported as result ±

uncertainty (or error)

• The uncertainty is given to one significant digit, and the result is rounded to that place

• For example, a result might be reported as 12.7 ± 0.4 m/s2. A more precise result would be reported as 12.745 ± 0.004 m/s2. A result should not be reported as 12.70361 ± 0.2 m/s2

• Units are very important to any result23

Page 24: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-6961-01 Week 10, November 9, 2010 Data Analysis and Visualization

Secondary analysis• Depending on where you are in the data

analysis pipeline (i.e. do you know?)

• Having a clear enough awareness of what has been done to the data (either by you or others) prior to the next analysis step is very important – it is very similar to sampling bias

• Read the metadata (or create it) and documentation

24

Page 25: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-6961-01 Week 10, November 9, 2010 Data Analysis and Visualization

Tools• 4GL

– Matlab– IDL– Ferret– NCL– Many others

• Statistics– SPSS– Gnu R

• Excel

• What have you used?25

Page 26: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-6961-01 Week 10, November 9, 2010 Data Analysis and Visualization

Considerations for viz.• What is the improvement in the

understanding of the data as compared to the situation without visualization?

• Which visualization techniques are suitable for one's data? – E.g. Are direct volume rendering techniques to be

preferred over surface rendering techniques?

26

Page 27: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-6961-01 Week 10, November 9, 2010 Data Analysis and Visualization

Why visualization?• Reducing amount of data, quantization

• Patterns

• Features

• Events

• Trends

• Irregularities

• Leading to presentation of data, i.e. information products

• Exit points for analysis27

Page 28: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-6961-01 Week 10, November 9, 2010 Data Analysis and Visualization

Types of visualization• Color coding (including false color)

• Classification of techniques is based on– Dimensionality– Information being sought, i.e. purpose

• Line plots

• Contours

• Surface rendering techniques

• Volume rendering techniques

• Animation techniques

• Non-realistic, including ‘cartoon/ artist’ style28

Page 29: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-6961-01 Week 10, November 9, 2010 Data Analysis and Visualization

Image (aka Raster) file formats• CGM, the Computer Graphics Metafile, has

been an ISO standard since 1987. It has the capability to encompass both graphical and image data.

• PostScript or more specifically Encapsulated PostScript Format (EPSF), is a page description language with sophisticated text facilities . For graphics, as compared to CGM, it tends to be expensive in terms of storage.

29

Page 30: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-6961-01 Week 10, November 9, 2010 Data Analysis and Visualization

Image file formats• TIFF, the Tagged Image File Format,

encompasses a range of different formats, originally designed for interchange between electronic publishing packages.

• GIF, the Graphical Interchange Format , is quite widespread and can encode a number of separate images of different sizes and colors.

30

Page 31: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-6961-01 Week 10, November 9, 2010 Data Analysis and Visualization

Image file formats• RGB, the Red Green Blue format of Silicon

Graphics, is used by most visualization software packages as the internal image format. The format consist of a header containing the dimensions of the image, followed by the actual image data.

• The image data is stored as a 2D array of tuples. Each tuple is a vector with 3 components: R, G, and B. The RGB components determine the color of every pixel (picture element) in the image. 31

Page 32: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-6961-01 Week 10, November 9, 2010 Data Analysis and Visualization

Image file formats• PPM, the Portable Pixmap Format (24 bits

per pixel), PGM, the Portable Greyscale Format (8 bits per pixel), and PBM, the Portable Bitmap Format (1 bit per pixel) formats are pixel based and are distributed with the the X-Window system (version 11.4).

32

Page 33: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-6961-01 Week 10, November 9, 2010 Data Analysis and Visualization

Image file formats• XBM is the X-Window one Bit image file format,

which has been standardized by the MIT X-consortium.

• A major constraint on the use of images is the large data volume which has to be dealt with.

• Large sets of image data can have severe implications for storage, memory, and transmission costs.

• Therefore, compression techniques are very important.

• There are two categories based on whether or not it is possible to reconstruct the initial picture after compression.

33

Page 34: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-6961-01 Week 10, November 9, 2010 Data Analysis and Visualization

Compression (any format)• Lossless compression methods are methods for

which the original, uncompressed data can be recovered exactly. Examples of this category are the Run Length Encoding, and the Lempel-Ziv Welch algorithm.

• Lossy methods - in contrast to lossless compression, the original data cannot be recovered exactly after a lossy compression of the data. An example of this category is the Color Cell Compression method.

• Lossy compression techniques can reach reduction rates of 0.9, whereas lossless compression techniques normally have a maximum reduction rate of 0.5.

34

Page 35: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-6961-01 Week 10, November 9, 2010 Data Analysis and Visualization

Vector formats• Postscript

• PDF

• SVG

• ‘Shape files’

• CGM (also)

• …

35

Page 36: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-6961-01 Week 10, November 9, 2010 Data Analysis and Visualization

Animation formats• Mpeg

• Avi

• Qt

• Wmv

• Animated GIF

36

Page 37: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-6961-01 Week 10, November 9, 2010 Data Analysis and Visualization

Remember - metadata• Many of these formats already contain

metadata or fields for metadata, use them!

37

Page 38: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-6961-01 Week 10, November 9, 2010 Data Analysis and Visualization

Tools• Conversion

– Imtools– GraphicConverter– Gnu convert– Many more

• Combination/Visualization– IDV– Matlab– Gnuplot– http://disc.sci.gsfc.nasa.gov/giovanni

38

Page 39: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-6961-01 Week 10, November 9, 2010 Data Analysis and Visualization

New modes• http://www.actoncopenhagen.decc.gov.uk/

content/en/embeds/flash/4-degrees-large-map-final

• http://www.smashingmagazine.com/2007/08/02/data-visualization-modern-approaches/

• Many modes: – http://www.siggraph.org/education/materials/

HyperVis/domik/folien.html

39

Page 40: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-6961-01 Week 10, November 9, 2010 Data Analysis and Visualization

Periodic table

40

Page 41: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-6961-01 Week 10, November 9, 2010 Data Analysis and Visualization

Publications, web sites• www.jove.com - Journal of Visualized

Experiments

• www.visualizing.org -

• data-gov.tw.rpi.edu -

41

Page 42: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-6961-01 Week 10, November 9, 2010 Data Analysis and Visualization

Managing visualization products

• The importance of a ‘self-describing’ product

• Visualization products are not just consumed by people

• How many images, graphics files do you have on your computer for which the origin, purpose, use is still known?

• How are these logically organized?

42

Page 43: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-6961-01 Week 10, November 9, 2010 Data Analysis and Visualization

(Class 2) Management• Creation of logical collections

• Physical data handling

• Interoperability support

• Security support

• Data ownership

• Metadata collection, management and access.

• Persistence

• Knowledge and information discovery

• Data dissemination and publication 43

Page 44: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-6961-01 Week 10, November 9, 2010 Data Analysis and Visualization

Discovery of visualizations• When represented as images:

– Image-based type free text search?– Referred to in publications (articles, books, web

pages)

• Vector graphics:– Postscript or PDF– SVG– Others?

• What makes this easy or hard or impossible?

44

Page 45: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-6961-01 Week 10, November 9, 2010 Data Analysis and Visualization

Use, citation, attribution• Think about and implement a way for others

(including you) to easily use, cite, attribute any analysis or visualization you develop

• This must include suitable connections to the underlying (aka backbone) data – and note this may not just be the full data set!

• Naming, logical organization, etc. are key

• Make them a resource, e.g. URI/ URL

45

Page 46: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-6961-01 Week 10, November 9, 2010 Data Analysis and Visualization

Producability/ reproducability• The documentation around procedures used

in the analysis and visualization are very often neglected – DO NOT make this mistake

• Treat this just like a data collection (or generation) exercise

• Follow your management plan

• Despite the lack or minimal metadata/ metainformation standards, capture and record it

• Get someone else to verify that it works 46

Page 47: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-6961-01 Week 10, November 9, 2010 Data Analysis and Visualization

Summary• Purpose of analysis should drive the type that

is conducted

• Many constraints due to prior management of the data

• Become proficient in a variety of methods, tools

• Many considerations around visualization, similar to analysis, many new modes of viz.

• Management of the products is a significant task 47

Page 48: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-6961-01 Week 10, November 9, 2010 Data Analysis and Visualization

What is next• Next week - Academic basis for Data and

Information Science, Data Models, Schema, Markup Languages and Data as Service Paradigms

• Reading– NITRD report – OCLC Sustainable Digital Preservation and

Access – National Science Founcation Cyberinfrastructure

Plan chapter on Data– European High-Level group on data 48