google refine from a business perspective

65
Google Refine Analysis A Business Perspective April, 08 2012 Sathishwaran.R - 10BM60079 Vijaya Prabhu - 10BM60097 Vinod Gupta School of Management, IIT Kharagpur Tutorial was created using Google Refine Version 2.5 on a Windows 7

Upload: vijaya-prabhu

Post on 20-Jun-2015

214 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Google refine   from a business perspective

Google Refine AnalysisA Business Perspective

April, 08 2012

Sathishwaran.R - 10BM60079Vijaya Prabhu - 10BM60097

Vinod Gupta School of Management, IIT Kharagpur

This Tutorial was created using Google Refine Version 2.5 on a Windows 7 platform

Page 2: Google refine   from a business perspective

2

Data Cleansing

• Data cleansing is identifying the wrong or inaccurate records in the data set and making appropriate corrections to the records.

• It involves identifying incomplete, inaccurate, and incorrect parts of data and then either replacing them with correct data or deleting the incorrect data

• Data cleansing results in data which is consistent with the other standard data and is useful for performing various analysis

• The error in the data could be due to data entry error by the user, failure during transmission of data or improper data definitions.

Page 3: Google refine   from a business perspective

3

Need for Data Cleansing

• Incorrect or inaccurate data may lead to false conclusions and can cause investments to be misdirected in finance.

• Also government needs accurate data on population and census for directing the funds to the deserving areas.

• Many organizations tap into customer information. If the data is not accurate, for eg. If the address is not accurate then the business runs the risk of send wrong information, thus losing customers.

Page 4: Google refine   from a business perspective

4

Challenges Data Cleansing

• Loss of Information: In many cases the record may be incomplete, hence the whole record may require to be deleted which leads to loss of information. It could become costly if huge number of data is deleted.

• Maintenance of Data: Once the data is cleansed then any change in the data specification needs to affect only the new values. Hence data management solutions should be designed in such a way that the process of data entry and retrieval are altered to provide correct data.

• Data cleansing is an iterative process which needs significant work in exploration and corrction of entries.

Page 5: Google refine   from a business perspective

5

About Google Refine

• Google Refine is a powerful tool that can be effectively used for data cleansing.

• It helps in working with raw data, cleaning it up, transforming from one format to other, encompassing it with web services and linking it to databases.

• It is very easy to use and has a web interface.• It is freely available and works well with any browser.• Google Refine is a desktop application and it runs a

small web server on your system and we need to point our browser to the server to use refine.

Page 6: Google refine   from a business perspective

6

Getting Started - Installation

1. Download the zip file (appropriate Windows, Mac, Linux versions) from the link http://code.google.com/p/google-refine/wiki/Downloads?tm=2

2. Uncompress the files from the zip file.3. Run the “google-refine.exe” file.4. A command window opens and Google refine

runs taking the user to the home page in the default browser.

Page 7: Google refine   from a business perspective

7

Google Refine Homepage

Page 8: Google refine   from a business perspective

8

Importing Data

• Google Refine supports TSV, CSV, Excel (.xls and .xlsx), JSON, XML, and Google data document formats.

• Once imported the data is in Google Refine’s own data format.

• We have used TSV data on Disasters worldwide from 1900-2008 available from http://www.infochimps.com/datasets/disasters-worldwide-from-1900-2008 for the tutorial.

Page 9: Google refine   from a business perspective

9

Importing Data

Page 10: Google refine   from a business perspective

10

Importing Data

Page 11: Google refine   from a business perspective

11

Creating ProjectData Uploaded

Page 12: Google refine   from a business perspective

12

Creating Project Project Created

Page 13: Google refine   from a business perspective

13

Faceting

• Faceting is about seeing the big picture and filtering based on rows to work on data you want to change in bulk.

• We can create a facet for a column to get the details about that column and then we can filter to a subset of rows with a constraint.

• We can perform text facet, Numeric facet, timeline facet and scatterplot facet. Also various customized facets can be designed.

Page 14: Google refine   from a business perspective

14

Faceting

Page 15: Google refine   from a business perspective

15

Faceting

The Column Type has 18

unique options

Page 16: Google refine   from a business perspective

16

Removing Redundancy

Even though they are of same type, shows as

different options due to case

Page 17: Google refine   from a business perspective

17

Removing Redundancy

Page 18: Google refine   from a business perspective

18

Removing Redundancy

Page 19: Google refine   from a business perspective

19

Removing Redundancy

Page 20: Google refine   from a business perspective

20

Removing Redundancy

Reduced to 15 unique options

Page 21: Google refine   from a business perspective

21

Numeric Faceting

Page 22: Google refine   from a business perspective

22

Numeric Faceting

Highly clustered towards low

values

Page 23: Google refine   from a business perspective

23

Numeric Faceting

Page 24: Google refine   from a business perspective

24

Numeric Faceting

Page 25: Google refine   from a business perspective

25

Numeric Faceting

Cost column is blank and has no

value

Page 26: Google refine   from a business perspective

26

Numeric Faceting

Calamities with low cost

Page 27: Google refine   from a business perspective

27

Numeric Faceting

Calamities with high cost

Page 28: Google refine   from a business perspective

28

Clustering• Clustering is used to merge choices which look similar.

Page 29: Google refine   from a business perspective

29

Clustering

Page 30: Google refine   from a business perspective

30

Clustering

Data Merged

Page 31: Google refine   from a business perspective

31

Using Expressions• Expressions are used to transform existing data to create new data

Page 32: Google refine   from a business perspective

32

Using Expressions

Page 33: Google refine   from a business perspective

33

Using Expressions

Page 34: Google refine   from a business perspective

34

Data Augmentation

• Reconciliation option in Google refine allows data to be linked to web pages. Suppose we want details on the country where the calamity has struck we can perform the following steps

Page 35: Google refine   from a business perspective

35

Reconciliation

Page 36: Google refine   from a business perspective

36

Reconciliation

Page 37: Google refine   from a business perspective

37

Reconciliation

Page 38: Google refine   from a business perspective

38

Reconciliation

Page 39: Google refine   from a business perspective

39

Reconciliation

Page 40: Google refine   from a business perspective

40

Data Enrichment

Page 41: Google refine   from a business perspective

41

Data Enrichment

Page 42: Google refine   from a business perspective

42

Data Enrichment

Page 43: Google refine   from a business perspective

43

Data Enrichment

Page 44: Google refine   from a business perspective

44

Export

Page 45: Google refine   from a business perspective

45

Step 1

Step 2

How to Use Twitter Data

Page 46: Google refine   from a business perspective

46

Step 3

Page 47: Google refine   from a business perspective

47

Step 4

Step 5

Page 48: Google refine   from a business perspective

48

Step 6

Page 49: Google refine   from a business perspective

49

Step 7 Step 8

Page 50: Google refine   from a business perspective

50

Output

Page 51: Google refine   from a business perspective

51

Friends Events using Facebook data

Page 52: Google refine   from a business perspective

52

Friends Events using Facebook data

Page 53: Google refine   from a business perspective

53

Friends Events using Facebook data

Page 54: Google refine   from a business perspective

54

Friends Events using Facebook data

Page 55: Google refine   from a business perspective

55

Friends Events using Facebook data

Page 56: Google refine   from a business perspective

56

Friends Events using Facebook data

Page 57: Google refine   from a business perspective

57

Friends Events using Facebook data

Page 58: Google refine   from a business perspective

58

Friends Events using Facebook data

Page 59: Google refine   from a business perspective

59

Friends Events using Facebook data

Page 60: Google refine   from a business perspective

60

Friends Events using Facebook data

Page 61: Google refine   from a business perspective

61

Friends Events using Facebook data

• After splitting the cell using separator },{

Page 62: Google refine   from a business perspective

62

Friends Events using Facebook data

Page 63: Google refine   from a business perspective

63

Friends Events using Facebook data• After updating for other columns and rearranging it we get the events as

Page 64: Google refine   from a business perspective

64

DISLIKED

LIKED

Page 65: Google refine   from a business perspective

65

Thank You