effective audit trail of data with prov-o€¦ · title: 2017_mlw_chi_effective audit trail with...
TRANSCRIPT
13 June 2017© COPYRIGHT MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Effective Audit Trail of DataWith PROV-OScott Henninger, Senior Consultant, MarkLogic
SLIDE: 2 © COPYRIGHT 2017 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Operationalizing the Metadata EFFECTIVE AUDIT TRAIL WITH PROV-O
Data GovernanceQuality management
Provenance DimensionsTechnical perspective
Provenance ModelsMetadata description
SLIDE: 3 © COPYRIGHT 2017 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Strategy and Execution§ Shared: Exchange of data between different
departments is possible
§ Reliable: Source has competence in the field of interest
§ Accurate: All accounting events are correct in value and description
§ Current: Data is up-to-date for the world it models
DATA GOVERNANCE
DATA LINEAGERISK MANAGEMENT
REGULATORY COMPLIANCE
ORGANIZATIONPROCESSES
DATA QUALITY
POLICIES
SLIDE: 4 © COPYRIGHT 2017 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Information Chain§ Create: Generate new data entities or update
their state
§ Derive: Create value from more contributing data entities
§ Analyze: Inspect data to discover new useful information
§ Report: Submission of summary data as evidence of events
DATA GOVERNANCEINGEST
PREPARE
TRANSFORM
PUBLISH
DELETE
1
2
3
5
4
SLIDE: 5 © COPYRIGHT 2017 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Provenance Metadata § Origin: Proof of the data ownership during the
history of the data
§ Timeline: Recorded timestamps of all events the data experienced
§ Process: Transformations that change the data during its lifecycle
DATA GOVERNANCE
UPDATES RESPONSIBILITY
INFORMATION ORIGIN
REPRODUCIBILITY
EVOLUTIONACCESS
Data provenance documents the inputs, entities, systems, and processes that influence data of interest, providing a historical record of the data and its origins
Data lineage includes the data's origins, what happens to it and where it moves over time
SLIDE: 6 © COPYRIGHT 2017 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
“If the benefits of provenance are so well understood, why don’t more firms recognize
it as a priority? ”
What makes it difficult
§ Location Data is spread across different systems in different organizational silos
§ OwnershipLack of mature data governance makes the challenge of data lineage even more daunting
§ SpreadsheetsBusiness processes run outside of data management processes
Unforeseen costs§ Compliance risk
The business gets exposed to difficult contract negotiations which can incur additional data costs
§ Redundant data activitiesDuplicate controls are performed in different departments several times
§ Accuracy of analyticsImpossible to verify why models result in sub-optimal outcomes
SLIDE: 7 © COPYRIGHT 2017 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Operationalizing the Metadata EFFECTIVE AUDIT TRAIL WITH PROV-O
Provenance ModelsMetadata description
Data GovernanceQuality management
Provenance DimensionsTechnical perspective
SLIDE: 8 © COPYRIGHT 2017 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Metadata RepositoryPROVENANCE MODEL
ETL*ETL
ETL
LAZYComplex technique for reasoning
EAGERDerived directly from output database
TRACING PROVENANCE
TRACING PROVENANCE
SLIDE: 9 © COPYRIGHT 2017 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Provenance StoragePROVENANCE MODEL
Envelope PatternProvenance stored with data
Separate DatabaseLarge provenance payloads stored with reference to data
<envelope> <provenance> <sem:triple> <sem:subject>/doc/id_a12a3.xml</sem:subject> <sem:predicate>http://www.w3.org/ns/prov#wasGeneratedBy </sem:predicate> <sem:object>/xform2016-07-20</sem:object> </sem:triple> <sem:triple> <sem:subject>/CanonicalTransform2016-07-20</sem:subject> <sem:predicate>http://www.w3.org/ns/prov#endedAtTime </sem:predicate> <sem:object> datatype="http://www.w3.org/2001/XMLSchema#dateTime"> 2016-07-20T12:01:42.987</sem:object> </sem:triple> ... </provenance><content>
<doc-id>a12a3</docId> <workflowStatus>Draft</workflowStatus> <version>2.3</version> ... </content></envelope>
Content Database Provenance Database
<provenance><sem:triple>
<sem:subject>/doc/id_a12a3.xml</sem:subject>
<sem:predicate>wasGeneratedBy</sem:predicate><sem:object>/xform2016-07-2</sem:object>
</sem:triple></provenance>
<content><doc-id>a12a3</docId><workflowStatus>Draft</workflowStatus><version>2.3</version>
...</content>
uri: /doc/id_a12a3.xml
SLIDE: 10 © COPYRIGHT 2017 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
PROV Data ModelPROVENANCE MODEL
§ Entity: a trade, order, document, or other kind of entity, physical, digital or conceptual with some fixed aspects
§ Activity: something that occurs over a period of time and acts upon or with entities, such as creating, consuming, transforming, modifying, etc.
§ Agent: the business line responsible for an activity taking place, for the existence of an entity
AGENT
wasDerivedFrom
wasAttributedTo
wasAssociatedWith
uses wasGeneratedBy
xs:dateTime xs:dateTime
ENTITY
ACTIVITY
startedAtTime endedAtTime
W3C standard, circa 2013:https://www.w3.org/TR/prov-o/
SLIDE: 11 © COPYRIGHT 2017 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Encoding SpecificationPROVENANCE MODEL
XML
<prov:documentxmlns:prov="http://www.w3.org/ns/prov#"xmlns:ex="http://example.com/ns/ex#">
<prov:entity prov:id="ex:e1"><prov:type xsi:type="xsd:string">approval </prov:type>
</prov:entity>
<prov:activity prov:id="ex:a1"><prov:type xsi:type="xsd:QName">Editing</prov:type>
</prov:activity>
</prov:document>
@prefix prov: <http://www.w3.org/ns/prov#> .@prefix : <http://example.com/> .
:geneSequencinga prov:Activity;
prov:startedAtTime "2012-04-25T01:30:00Z"; prov:used :drosophilaSample-84; prov:wasAssociatedWith :lab-technician-GH-32.
:drosophilaSample-84
a prov:Entity;prov:wasAttributedTo :lab-technician-FE-56.
:lab-technician-GH-32 a prov:Agent .
PROV-XML• Types and elements are reusable
PROV-O• Reason on provenance data• Specialized properties • Model-based extensions of the standard
SLIDE: 12 © COPYRIGHT 2017 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Operationalizing the Metadata EFFECTIVE AUDIT TRAIL WITH PROV-O
Provenance DimensionsTechnical perspective
Provenance ModelsMetadata description
Data GovernanceQuality management
SLIDE: 13 © COPYRIGHT 2017 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
▪ Non mutually exclusive ▪Who endorses the information ▪ How a decision is made▪ User consumption of provenance ▪What was considered for that decision
CONTENT
Content, Use, and ManagementPROVENANCE DIMENSIONS
MANAGEMENT USE
SLIDE: 14 © COPYRIGHT 2017 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
▪ Non mutually exclusive ▪Who endorses the information ▪ How a decision is made▪ User consumption of provenance ▪What was considered for that decision
CONTENT
USE
Content ▪ Use ▪ ManagementPROVENANCE DIMENSIONS
MANAGEMNT
Scenario…an investment bank is implementing new regulatory reporting defined by CFTC, that will provide more information on their trading activities (extended type of financial products and pieces of data) in a shorter time frame (near real-time publication) with higher complexity of the rules determining who has the obligation to deliver the information.
© COPYRIGHT 2017 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 15
@prefix prov: <http://www.w3.org/ns/prov#> .
@prefix : <http://example.com#> .
:TransactionReport a prov:Entity;
prov:generatedAtTime "2017-04-12T12:12:12” ;
prov:wasDerivedFrom :TransactionA ;
prov:wasGeneratedBy :ReportGen .
:ReportGen a prov:Activity;
prov:used prov:TransactionA ;
prov:used :Venue1 ;
prov:wasAssociatedWith :Msma .
:TransactionA a prov:Entity
prov:wasAtttributedTo :Murex
:Venue1 a prov:Entity .
:Murex a prov:Agent, prov:SoftwareAgent .
:Msma a prov:Agent, prov:Organization ;
§ What the provenance is about
§ Sources used to create new result
§ Process that yielded the artifact
AttributionPROVENANCE CONTENT
XML
© COPYRIGHT 2017 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 16
§ Amendments are incorporated in the trade
§ Different aspects of the same trade linked together
EvolutionPROVENANCE CONTENT
ORIGINAL VERSION
AMENDED VERSION
@prefix prov: <http://www.w3.org/ns/prov#> .
@prefix : <http://example.com#> .
:Transaction1 a prov:Entity.
:Transaction2 a prov:Entity;
prov:wasRevisionOf :Transaction1.
:TransactionReport1 a prov:Entity;
prov:wasDerivedFrom :Transaction1.
:TransactionReport2 a prov:Entity;
prov:wasDerivedFrom :Transaction2.
:PostTradeReport a prov:Entity;
prov:generatedAtTime "2017-04-12T12:12:14";
prov:wasDerivedFrom :Transaction2;
prov:alternateOf :TransactieReport2.
© COPYRIGHT 2017 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 17
§ Two temporal axes to maintain the business valid and the system times
Bitemporal TimelinesPROVENANCE CONTENT
WHEN THE EVENT OCCURRED(Valid Time)
WHEN IT WAS RECORDED(System Time)
LAGNOV
19NOV
21
NOV
20
WHEN IT WAS RECORDED(System Time)
WHEN THE EVENT OCCURRED(Valid Time)
{ "transaction":
{
"system-start": "2014-11-19T11:00:00",
"valid-start": "2014-11-21T12:00:00",
"trader": "12XL9A",
"price": 12
}
}
{ "trader":
{
"system-start": "2014-11-19T11:00:00",
"valid-start": "2014-11-20T12:00:00",
"id": "12XL9A",
"name": "John"
}
}
SLIDE: 18 © COPYRIGHT 2017 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
ContentPROVENANCE DIMENSIONS
Post Trade Report
Trade v1Trade v2
Transaction Report
wasDerivedFrom
alternateOf
wasDerivedFrom
wasRevisionOf
Murex
wasAttributedTo
Reporting
used
2017-04-24T12:12:12
generatedAtTimegenerated
Ingest
generated
wasInfluencedBy
2017-04-20T10:22:12
receivedAtTime
Feedback
wasInvalidatedBy
Transaction System
Software Agentvalue
SLIDE: 19 © COPYRIGHT 2017 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
UnderstandingPROVENANCE USE
§ … what was the trading and reference data used to generate this transaction report ...
§ … why is there a difference between the transaction report and the post trade report …
§ … were there any changes in reference data at the time the correction was sent …
DATA STEWARD
DATA QUALITY
SLIDE: 20 © COPYRIGHT 2017 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
CompliancePROVENANCE USE
§ …which department provided the trade data and when was the booking done …
§ … what transactions were not reported in the time required, and for what reasons …
§ … are any transactions that should have been reported for new versions of rules …
§ ... have traders complied with rules ... COMPLIANCE OFFICER
REGULATORY COMPLIANCE
BUSINESS DEPARTMENT
PENALTY
SLIDE: 21 © COPYRIGHT 2017 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
DebuggingPROVENANCE USE
§ ... where did an error occur in a specific data field ...
§ … was the notification for the post trade publisher sent for that specific trade …
§ … what version of data extraction rules were used when the transaction report was created …
§ … what is the percent of reportable transactions from the daily volume …
IT OPERATIONS
OPERATIONAL DATA STORE
APPROVED PUBLICATION ARRANGEMENT
SLIDE: 22 © COPYRIGHT 2017 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Trusting Data Sources§ Forward-Looking provenance
- Anticipating problems given provenance information from other systems
§ Analysis may find some sources/transformations/ETLs are troublesome
- ...sometimes in specific contexts, such as high load rates, etc.
§ Look for alternatives when designing future efforts
§ Target troublesome processes for future refactoring efforts
PROVENANCE USE
SLIDE: 23 © COPYRIGHT 2017 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
PublicationPROVENANCE MANAGEMENT
AC
CES
SXML
RESOURCES
PROVENANCE
CO
NTE
NTLINK
SEARCH
BROWSE
rdf - SPARQL
html - HTTP
xml - HTTP
PROVENANCE URI
TARGET URI
§ Access
§ Locate
§ Query
HOW TO CONSUME PROVENANCE?
SLIDE: 24 © COPYRIGHT 2017 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
DisseminationPROVENANCE MANAGEMENT
§ Security: secure HTTP should be used across unsecured networks; authentication should be enforced
§ Access control: provenance information should follow the same access control rules as the resources
§ Bundle: care is needed to ensure that the integrity of provenance is maintained
PRIVACY WALL
HTTPS
Provenance discovery
Provenance of provenance
SLIDE: 25 © COPYRIGHT 2017 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
PROVENANCE DIMENSIONS
USE
CONTENTMANAGEMENT
PUBLICATION
ACCESS
QUERY
SEMANTIC
XML
DOCUMENT
UNDERSTANDING
DEBUGGING
COMPLIANCE
FORWARD-LOOKING
SLIDE: 26 © COPYRIGHT 2017 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
solid compliance architecture
ALL of your data and metadata
Complete track of data changes
Full query composability
Security, publishing, monitoring, etc.
Fewer tools and processes to manage
What makes it easy
DATA LINEAGE
Questions?