bringing openclinica data into sas
DESCRIPTION
OpenClinica Global Forum 2010. A Java tool to create \'SAS friendly\' XML from OpenClinicaTRANSCRIPT
Bringing OpenClinica Data into SAS
CRIC supports a wide variety of studies ‘Regulatory’ clinical trials Many different types of academic study Variable size and complexity
Investigators design their own CRFs CRIC has limited control over design strategies and
CRF consistency.
Analysis requirements and data formats vary
SPSS, Stata, SAS, Excel.
CRIC’s Preferred data handling tool is SAS
CRIC and OpenClinica
OpenClinica exports seem difficult for our users to work with.
Data structures vary depending on the data content.
CRF versions (repeat as extra columns) Group contents (number of repeats)
Multi-select objects difficult to handle. Must be ‘broken’ into separate variables for analysis.
Null values represented as text in otherwise numeric variables
OpenClinica Export
The Challenge We wanted to:
Produce consistently usable data for minimal up front effort.
Get data that could easily be transferred into different formats.
Produce tall, thin, de-normalized data sets suitable for data management purposes.
Leverage CRF metadata to add value: Dataset labels Variable labels SAS formats and informats SAS special missing values.
Create ‘SAS friendly’ XML to be read by the XML Libname engine.
Create a SAS XML Map file to assign labels, data types, informats and formats.
Generate a CNTLIN data set in the XML suitable for use by PROC FORMAT.
Note: The XML file can also be imported directly into MS Access.
The Solution
SAS macros or external utility?◦ Hi complexity
Ensure OpenClinica metadata translated into legal SAS names.
Map OC hierarchy to SAS data sets. CRFs, sections, groups and data items to tables, rows
and columns. De-duplicate object names
◦ No resource to develop complex macros
Development Approach
Command Line Java Utility◦ Programmer available
(I would have to write SAS code myself!)
◦ Capable development environment◦ Portable (Windows / Linux)◦ Callable from within SAS
The Choice
Enter connection parameters and study identifier (interactively or command line)
Connect to Postgres via ODBC
Read study metadata
Manipulate the metadata
Write map file
Read study data
Write data file
Data Processing
Legalize Names SAS names <= 32 characters Must start with a letter or underscore Format names cannot end in a number
De-duplicate names Multiple CRFs may contain the same section and
response option names. Duplicate names have numbers and underscores
appended.
Metadata Manipulations
CRFs◦ No ‘top level’ mapping between CRFs and data
sets.
CRF Section -> SAS data set CRF sections contain logically grouped data – CRFs
may not! CRFs containing multiple sections result in multiple
output data sets. Every data item contained within a section is output
to the same data set. Section label -> dataset name Section title -> dataset label
Metadata Manipulations
Groups -> Rows Ungrouped section data repeated in each row Each repeat becomes a separate row in the data set Rows are numbered to provide a unique key based
on their order within the group. Multiple groups contained within the same section
are merged based on order within the groups. Where groups contain unequal numbers of rows
missing values result.
Metadata Manipulations
CRF items -> dataset variables Item_name -> variable name Description_label -> variable label
Calculate length of character variables SAS has no support for VARCHARs. Explicitly
specifying variable length saves considerable space on disk.
Metadata Manipulations
A new column is created for each response value Column names based on item_name Columns labeled based on item_label and response
option value. Columns contain 1 or 0 to indicate selected or
unselected.
Multi-select and Checkbox items
Response option lists become SAS formats and informats.
Format names created from CRF item’s response_label.
Format names legalized and de-duplicated. If separate CRFs contain identical response option
lists only one format results.
Formats and Informats are written to the XML as a new data table.
This is used as a CNTRLIN data set for PROC FORMAT.
Response Options
Informats are created to read numeric data and handle OpenClinica null values.
CRF Dates
proc format;invalue crfdate 'ASKU' = .k
'NA' = .a'NASK' = .d'NI' = .i'NP' = .p'OTH' = .o'UNK' = .uother = [mmddyy10.];
run;
Missing Values
Numeric Response Options
proc format;invalue bestnull 'ASKU' = .k
'NA' = .a'NASK' = .d'NI' = .i'NP' = .p'OTH' = .o'UNK' = .uother = [best10.];
run;
Missing Values
Formats are created for CRF data. Response options
proc format;value yesno 0 = 'No'
1 = 'Yes'.k = 'ASKU'.a = 'NA' .d = 'NASK'.i = 'NI' .p = 'NP' .o = 'OTH' .u = 'UNK';
run;
Missing Values
Dates
proc format;value crfdate .k = 'ASKU'
.a = 'NA'
.d = 'NASK'
.i = 'NI'
.p = 'NP'
.o = 'OTH'
.u = 'UNK‘Other = [date9.] ;
run;
Missing Values
Numeric Data
proc format;value bestnull .k = 'ASKU'
.a = 'NA'
.d = 'NASK'
.i = 'NI'
.p = 'NP'
.o = 'OTH'
.u = 'UNK‘Other = [best10.] ;
run;
Missing Values
CRF Data◦ One data set per CRF section
Each row contains: Study ID Site ID Subject ID Study event name Event start and end date CRF Name CRF Version
Data Set Output
Subject Data List of subjects including site, secondary ID, group,
etc.
Event Data List of subjects study events including start date, end
date and status.
CRF Status◦ List of subject CRFs including event details, CRF
version, creation date, completion date and status.
Discrepancies
Output Data Sets
Data for removed subjects is not exported.
PHI data remains encrypted .
Output Data Sets
C:> java -jar export.jar---------------------------------------- Export Output: ---------------------------------------- MAP FILE: export.map.xml EXPORT FILE: export.xml----------------------------------------Postgresql driver loaded Enter Database url (default: localhost):Database port (default: 5432):Database name (default: openclinica):username (default: clinica):password: Enter Export file name (default: derived from study):Enter Map file name (default: derived from study):
Interactive Execution
Successful connection to database openclinica on jdbc:postgresql://localhost:5432/
Please choose a study:---------------------- 1) Study1 2) Study2 3) Study3 4) Study4==> 1 Retrieving study metadataCreating subject tableWriting formats to .xml fileWriting subjects to .xml fileRetrieving study item dataWriting study item data to fileCompleteFiles generated: study1.map.xml Study1.xml
Interactive Execution
Command line options may be used rather than prompts. Options include:
Host, database, ID and password Study OID File names Suppression of map file Creation of ‘SPSS friendly’ SAS data sets
Minimal formatting allows data sets to be exported to SPSS using PROC EXPORT.
Command line options allow the utility to be executed from within SAS.
Command Line Options
Define libraries
libname ocdata xml92 “data_file.xml" xmlmap=“map_file.map“ access=readonly;
libname library “c:\project\fmt";
libname stdylib “c:\project\data";
SAS Code
Execute the Import%let scommand =java -Xmx256m -jar c:\export\export.jar;
%let shost =-h 10.11.12.13;
%let sport =-p 5432;
%let sstudy =-soid S_STDY1234;
%let sdatabase =-D openclinica;
%let suser =-U dbuserid;
%let spswd =-P password;
%let spss = ;
X "&scommand &shost &sport &sstudy &sdatabase &suser &spswd &smapFile &sdataFile &spss";
SAS Code
Create the Format Catalog from the XML
proc sort data=ocdata92.fmtlib out=work.fmtlib;
by fmtname type start;
run;
proc format cntlin=work.fmtlib library=library fmtlib;
run;
SAS Code
Copy the Data Sets
proc datasets library=ocdata92;
copy out=studylib;
exclude fmtlib;
quit;
SAS Code
Import into SAS
If we have time:◦ XML Structures◦ Import into Access◦ Import into Excel
Do It!
SAS 9.2 (English).lnk