- crossing the chasm - simplifying data … moving beyond the data - crossing the chasm -...
Post on 18-May-2018
217 Views
Preview:
TRANSCRIPT
Moving Beyond the Datawww.d-Wise.com
- Crossing the Chasm -Simplifying Data Management
with Perl and Metadata
d-Wise TechnologiesStephen Baker
Manager, Strategy and Business Development
Moving Beyond the Datawww.d-Wise.com
Overview
• Defining the chasm: The flow of data through the clinical organization
• Examining a framework
• Technology Considerations
• Summary
Moving Beyond the Datawww.d-Wise.com
Defining the chasm
Receive Data from external
vendors
Perform analysis on
data
Let’s examine how this evolves over time…
Moving Beyond the Datawww.d-Wise.com
Data Moving Through the Business
Incoming Data
stageForETL.sas
editChecks.sas
changeFromBaseline.sas
LOCF.sas
stageForAnalysis.sas AnalysisData
transform.sas
Moving Beyond the Datawww.d-Wise.com
The Chasm Emerges
Incoming Data 1
stageForETL1.sas
editChecks.sas
changeFromBaseline.sas
stageForAnalysis.sas AnalysisData 1
transform1.sas
Incoming Data 2
stageForETL2.sas
editChecks.sas
LOCF.sas
stageForAnalysis.sas AnalysisData 2
transform2.sas
Suddenly, the workflows are different!
Moving Beyond the Datawww.d-Wise.com
FlowX.Sh
The Chasm Introduces Chaos
Incoming Data X
Prog1.sas
Prog3.sas
Prog4.sas
Prog5.sas
Prog2.sas
Analysis Data X
FlowX.Sh
Incoming Data X
Prog1.sas
Prog3.sas
Prog4.sas
Prog5.sas
Prog2.sas
Analysis Data X
FlowX.Sh
Incoming Data X
Prog1.sas
Prog3.sas
Prog4.sas
Prog5.sas
Prog2.sas
Analysis Data X
FlowX.Sh
Incoming Data X
Prog1.sas
Prog3.sas
Prog4.sas
Prog5.sas
Prog2.sas
Analysis Data X
Moving Beyond the Datawww.d-Wise.com
Chaos Becomes Unmanageable
Analysis Data 1
Incoming Data 1
Script1
Analysis Data 2
Incoming Data 2
Script2
Analysis Data 3
Incoming Data 3
Script3
Analysis Data 4
Incoming Data 4
Script4
Analysis Data 5
Incoming Data 5
Script5
Analysis Data 6
Incoming Data 6
Script6
Analysis Data X
Incoming Data X
ScriptX…
Moving Beyond the Datawww.d-Wise.com
Adding Organizational Complexity
IT Informatics
ExternalServer analysis.sas7bdat
Shell Script
Edit Checks
SAS ETL
Shell Script
Edit Checks
SAS ETL
Shell Script
Edit Checks
SAS ETLFile
Server
analysis.sas7bdat
analysis.sas7bdatVendor 2:
SASDatasets
Vendor 1: Password protected
Zip file
Vendor 3: CSVs
Moving Beyond the Datawww.d-Wise.com
Crossing the chasm
Receive Data from external vendors
Perform analysis on data
• Every clinical organization receives data from vendors in multiple formats and has to move it through the business process to enable analysis to happen
• What would a system for managing this process look like?
• Separation of programming logic and configuration information
• Repeatable tasks, ‘actions’, should be defined and used to build all workflows
• Notification needs such as emailing data consumers and updating a dashboard should be supported
Moving Beyond the Datawww.d-Wise.com
Components of a Framework
The system must…
• support varying input data formats from CROs
• externalize configuration details such as user credentials and file server paths
• be flexible to changes in the infrastructure and portable across operating systems
• be easily extensible as the business places new demands on the framework
• support varying data preparation activities
• provide a reusable library of actions from which to build work flows
• support extended the library easily and surfacing new features to users
• provide a simple interface for the users to define new flows or modify existing flows
• support all staff having visibility into the health of the data flows
• provide a “single source of the truth” for understanding how individual data flows are defined
Moving Beyond the Datawww.d-Wise.com
A Framework Emerges
File Server
FTPServer
SAS programs
Perl Automation Framework
WorkflowConfigFiles
Study Analysis Data
Study Analysis Data
Study Analysis Data
Data Workflow Health Dashboard
Database
Moving Beyond the Datawww.d-Wise.com
What Role Does Metadata Play?
• Vastly overloaded term in this industry – let’s talk about metadata as configuration information to drive framework ‘action’ building blocks
• Metadata at the Action level
– Location of a file to operate on
– Password required to unzip an archive
• Metadata at the Workflow Level
– Trigger mechanism that starts this workflow
– Conditional events that should cause a workflow to abort
• Metadata at the System Level
– Notification details – who to notify, when, and how?
– IT configuration – decouple application from infrastructure
• Goals
– Proper tradeoff between automation and flexibility
– Separate programming logic from configuration logic
Moving Beyond the Datawww.d-Wise.com
Workflows Actions
• Obvious activities, such as copying files or unzipping files, are quickly identified and easily supported.
• More ambiguous tasks, such as specialized ETL or statistical methods, might be tougher to compartmentalize.
• In the reference system, actions defined included:
• Each action implements an interface, making the automation component easily extensible.
– Unzip
– Copy
– run a SAS program
– run a Perl program
– Move
– Delete
– search through a text file
– search through a log file
setProperties(pathToConfigFile, notifyEmailAddresses)
execute(ActionError)
Moving Beyond the Datawww.d-Wise.com
Workflow Lifecycle
• The system reads a top level workflow registry file to learn about all defined workflows
• Each workflow is defined as
– a trigger event, such as a file appearing in a folder or a time to execute a certain process
– a sequence of actions
– notification requirements
• If the trigger event is satisfied, the remaining actions are executed in order
• If a failure is noticed while processing in any intermediate action, the workflow as a whole fails
• System events (starting a workflow, trigger satisfied, action running, error occurred, workflow completed) are logged to the database
Moving Beyond the Datawww.d-Wise.com
Single Source of the Truth?
• Goals
– To be able to definitively understand what activities are required to produce a given data
– To be able to see the overall health of the data workflows
– To perform rudimentary impact analysis for ETL-like actions
• In the ‘chaos’ example… multiple, ad-hoc programs - difficult to manage, impossible to see comprehensively
• Good… metadata driven actions defined in user maintained text files
• Better… A UI for viewing/editing workflow configurations in a database
• Best… A Workflow Health Dashboard surfacing system details from the database
• Ideal… The information surfaced by the system enables the business to focus on the sciences rather than the supporting tools
Moving Beyond the Datawww.d-Wise.com
What is Perl Good For?
• <free>Open Source</free>, easy to install, portable across operating systems
• Terse syntax keeps programs short and human readable
• Powerful scripting language
• Rapid Development Features:
– Exceptionally powerful for processing text files
– Robust Support for OO design, exception handling, regular expressions
– Easy extension via Perl modules available from CPAN
– Log4perl
• Turn-key integration for database, email, rolling file appender
• Log Levels (debug, info, warn, error, fatal) make it easy to tweak to verbosity of the system without changing code
– Automated IQ/OQ
– Unit Testing Frameworks
Moving Beyond the Datawww.d-Wise.com
Picking the Right Technology for the Job• SAS
– SAS is a superior technology for manipulation, viewing, and analysis of tabular data
– disappointing lack of programmable exception handling, testing frameworks
– SAS datasets are a part of the clinical industry
• Perl
– superior technology for scripting and processing text files
– open source community provides a vast array of usable tools enabling rapid development
• Database
– Relational database, either COTS or open source, enable web applications to be quickly built to interface with the system and help control the user experience
• Where role does open source play in this industry?
– Cost of adoption is more than just cost of licenses
– Vendor lock-in can be a challenge to break free from over time
Moving Beyond the Datawww.d-Wise.com
Summary
• Ad-hoc approaches evolve from desirable, to complicated, to unmanageable, to painful – and costs increase at each step along the way
• Framework approaches encourage reusability and extend capability/accountability to broader audiences
• Separate business logic from programming logic• Abstraction and metadata driven approaches enable the
business• Custom solutions and open source are viable options
when considering how to apply technology to business problems
top related