a data model, workflow, and architecture for integrating data

22
A Data Model, Workflow, and Architecture for Integrating Data David Massart, PhD San Francisco – Feb. 12, 2015

Upload: dmassart

Post on 19-Jul-2015

151 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: A Data Model, Workflow, and Architecture for Integrating Data

A Data Model, Workflow, and Architecture for Integrating Data

David Massart, PhD

San Francisco – Feb. 12, 2015

Page 2: A Data Model, Workflow, and Architecture for Integrating Data

Who Am I ?

Page 3: A Data Model, Workflow, and Architecture for Integrating Data

Outline

• Data model: Resources, facts, and actions

• Data acquisition workflow

• Data integration and curation

• Views

• Architecture

Page 4: A Data Model, Workflow, and Architecture for Integrating Data

Resources, Facts, and Actions

• “Resources” are described with “facts” collected during data acquisition “actions”

Page 5: A Data Model, Workflow, and Architecture for Integrating Data

Resource

• Anything of interest (e.g., product, customer, geographical area)

• Characterized by:

– A type

– An identity

Page 6: A Data Model, Workflow, and Architecture for Integrating Data

Fact

• Basic property of resources

• Characterized by

– Property name (e.g., weight)

– Value (e.g., 155 pounds)

– Timestamp (e.g., 2015-02-12)

Page 7: A Data Model, Workflow, and Architecture for Integrating Data

Data Acquisition Action

• Occurs when a tool is used to acquire facts about resources from a data source at a given time

• Characterized by

– Action identifier (e.g., #1)

– Timestamp (e.g., 2015-02-12 09:35:12)

– Tool (e.g., web crawler)

– Data source (e.g., http://census.gov)

• Sources of data can be from a database, a human curator, a website, etc.

Page 8: A Data Model, Workflow, and Architecture for Integrating Data

Action-Fact Fragment Data Model

{“action id”: 1,“action timestamp”: “2015-02-12 19:30:01”,“tool”: “zettadownloader”,“data source”:

“http://api.census.gov/data/2012/acs5?get=B25082_001E,B25111_001E,NAME&for=zip+code+tabulation+area:*”,

“resource type”: “area”,“resource id”: “94114”,“fact property”: “value”,“fact value”: “8508810400”,“fact timestamp”: “2012-07-01”

}

Page 9: A Data Model, Workflow, and Architecture for Integrating Data

Data Acquisition Workflow

Page 10: A Data Model, Workflow, and Architecture for Integrating Data

Acquisition & Caching

• Acquire

– Obtain raw data from a data source*

– Turn it into JSON records, if needed

• Cache

– Store raw data (timestamp)

– Makes actions re-playable at any time

* data acquisition action

Page 11: A Data Model, Workflow, and Architecture for Integrating Data

Identification

• Identify the resources described in records

• Can be easy (e.g., zip codes for geographical areas)

• Or very complex (e.g., bibliographical references)

Page 12: A Data Model, Workflow, and Architecture for Integrating Data

Normalization

• Replace properties and values found in records by identifiers from data dictionaries and controlled vocabularies

• Examples:

– Country name -> ISO 3166-1 (e.g., Belgium -> 32)

– Data source -> Data source ids

Page 13: A Data Model, Workflow, and Architecture for Integrating Data

Fragmentation

• Break records into actions and facts

• Store action-fact fragments

Page 14: A Data Model, Workflow, and Architecture for Integrating Data

Data Integration & Curation: Principles

• Data integration and data curation are data acquisition actions

• Allowed:– Data creations only (i.e., data acquisition actions)

• Not allowed– Data deletions*

– Data updates

* Except for legal reason

Page 15: A Data Model, Workflow, and Architecture for Integrating Data

Data Integration

• Occurs at the Action-Fact-Fragment level

• Is required when inconsistencies are detected

• E.g., two or more fragments have different values for the same property of the same resource with the same timestamp

aidatime-stamp

tool source rtype rid fproperty fvalueftime-stamp

#1 20150124 X #13 area 94114 population 32100 20120701

#2 20150125 Y #6 area 94114 population 30100 20120701

Page 16: A Data Model, Workflow, and Architecture for Integrating Data

Data Integration (cont.)

• Possible resolutions:

– Do nothing

– Select one of the existing values

– Derive a new value from existing ones

– Add the correct value

• Results in the addition of a new fragment

aidatime-stamp

tool source rtype rid fproperty fvalueftime-stamp

#1 20150124 X #13 area 94114 population 32100 20120701

#2 20150125 Y #6 area 94114 population 30100 20120701

#3 20150212 Z #101 area 94114 population 31100 20120701

Page 17: A Data Model, Workflow, and Architecture for Integrating Data

Data Curation

• Occurs at the Action-Fact-Fragment level

• Consists of adding, updating, or removing a fact about a resource by adding a new fragment

aidatime-stamp

tool source rtype rid fproperty fvalueftime-stamp

#1 20150124 W #24 area 94113 population 15000 20120701

#2 20150125 W #24 area 94113 population 17000 20120701

#3 20150126 W #24 area 94113 population - 20120701

Page 18: A Data Model, Workflow, and Architecture for Integrating Data

Views

• Built from fragments

• Application-specific (allows for optimization)

• Read-only and expendable

• Special cases

– State: All facts at a given timestamp

– Fact trending: Evolution of a given fact over time

– Action visualization: All facts generated by a given action

Page 19: A Data Model, Workflow, and Architecture for Integrating Data

Data Collection Architecture

Page 20: A Data Model, Workflow, and Architecture for Integrating Data

Application Architecture

Page 21: A Data Model, Workflow, and Architecture for Integrating Data

Conclusion

• Simple– Fragment data model

• Flexible– Allows for easily building expendable views

• Scalable– E.g., using resource ids as sharding key

• Robust – Any action can easily be cancelled

– Any state can easily be restored

Page 22: A Data Model, Workflow, and Architecture for Integrating Data

More details available at http://zettadatanet.wordpress.com

These slides are available at http://www.slideshare.net/dmassart/a-data-model-workflow-and-architecture-

for-integrating-data