challenges with clinical trial data analysis · 2012-06-04 · challenges with clinical trial data...

6
Newsletter June 2012 01 Challenges with Clinical Trial Data Analysis Sreekanth Nunna, Bhaskar Govind, Dr. A. K. Mathai SAS (Statistical Analysis System) programming activity is an inseparable part of clinical trial data analysis. Moreover, the regulatory authorities such as FDA insist that SAS software must be used for data analysis. However, the most important and difficult task for the Data Analyst is to make sure that the data is perfect and it is free from error. But most often this task is very difficult and cumbersome. The objective of this paper is to highlight some of the points which may be useful for the SAS Programmers and Statisticians while cleaning the data for final analysis. This may be used as a check list while doing the data cleaning task. SAS Programming challenges Commonly the data issues include three categories such as outliers, missing data and unformatted values. These issues need to be addressed before the SAS Programming activities are initiated. Apart from the above issues, the issue related to duplicate records needs to be resolved by the data analyst. In order to overcome this difficulty, the total number of records in each data set needs to be ascertained by the data analyst before final analysis. If there is any such duplicate records are available, it should be deleted after getting conformation from the appropriate person / authority. Apart from the above issues, there are certain specific points needs to be checked for accuracy in data sets such as demography, drug administration, lab data, adverse events, physical examination, vital signs and study completion. In demography data set, there is a chance that the multiple duplicate records may exist. This needs to be identified and deleted. Also there may be some missing information such as date of birth, gender, race, height and weight, which needs to be obtained from the investigator. Another issue is the difference in units for some variables such as height, weight etc. For example, height may be reported in cm or meter or feet & inches. This should be converted in to one single unit such as cm or meter for all cases. For some cases, the unit may not be reported in the data set. This should be obtained from the investigator prior to final data analysis. Another common issue is by using unformatted values in data set. For example, the gender variable may be entered as Male, M, female and F. This needs to be standardized such that the data should entered as Male / Female or M/F. Yet another issue with continuous variables is the out of range values in the data set. For example, the age limit for the study subjects is 18 to 50 years. There may be an entry of 60 years for one subject. This needs to be clarified with investigator / CDM personnel. Then appropriate action should be taken to modify the data set. Example1: There are several character variables which should have a limited number of valid values. For this example, we expect values of GENDER to be 'F' or 'M', values of DX (age) the numerals 1 through 99, and values of FH (family history) to be 0 or 1. A very simple approach to identifying invalid character values in this file is to use PROC FORMAT to list all the unique values of these variables. Listing of Data 001 0 0 1 1 1 A 0 0 0 0 03 99 1.3 M F X X X 3 9 1 1 1 8 F M F M 2 F f M 002 003 004 XX5 010 011 013 002 023 987 PATNO GENDER DX FH

Upload: others

Post on 22-Mar-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Challenges with Clinical Trial Data Analysis · 2012-06-04 · Challenges with Clinical Trial Data Analysis Sreekanth Nunna, Bhaskar Govind, Dr. A. K. Mathai SAS (Statistical Analysis

Newsletter J u n e 2 0 1 2

01

Challenges with Clinical Trial Data AnalysisSreekanth Nunna, Bhaskar Govind, Dr. A. K. Mathai

SAS (Statistical Analysis System) programming activity is an inseparable part of clinical trial data analysis. Moreover, the regulatory authorities such as FDA insist that SAS software must be used for data analysis. However, the most important and di�cult task for the Data Analyst is to make sure that the data is perfect and it is free from error. But most often this task is very di�cult and cumbersome. The objective of this paper is to highlight some of the points which may be useful for the SAS Programmers and Statisticians while cleaning the data for �nal analysis. This may be used as a check list while doing the data cleaning task.

SAS Programming challenges

Commonly the data issues include three categories such as outliers, missing data and unformatted values. These issues need to be addressed before the SAS Programming activities are initiated. Apart from the above issues, the issue related to duplicate records needs to be resolved by the data analyst. In order to overcome this di�culty, the total number of records in each data set needs to be ascertained by the data analyst before �nal analysis. If there is any such duplicate records are available, it should be deleted after getting conformation from the appropriate person / authority. Apart from the above issues, there are certain speci�c points needs to be checked for accuracy in data sets such as demography, drug administration, lab data, adverse events, physical examination, vital signs and study completion.

In demography data set, there is a chance that the multiple duplicate records may exist. This needs to be identi�ed and deleted. Also there may be some missing information such as date of birth, gender, race, height and weight, which needs to be obtained from the investigator. Another issue is the di�erence in units for some variables such as height, weight etc. For example, height may be reported in cm or meter or feet & inches. This should be converted in to one single unit such as cm or meter for all cases. For some cases, the unit may not be reported in the data set. This should be obtained from the investigator prior to �nal data analysis. Another common issue is by using unformatted values in data set. For example, the gender variable may be entered as Male, M, female and F. This needs to be standardized such that the data should entered as Male / Female or M/F. Yet

another issue with continuous variables is the out of range values in the data set. For example, the age limit for the study subjects is 18 to 50 years. There may be an entry of 60 years for one subject. This needs to be clari�ed with investigator / CDM personnel. Then appropriate action should be taken to modify the data set.

Example1:There are several character variables which should have a limited number of valid values. For this example, we expect values of GENDER to be 'F' or 'M', values of DX (age) the numerals 1 through 99, and values of FH (family history) to be 0 or 1. A very simple approach to identifying invalid character values in this �le is to use PROC FORMAT to list all the unique values of these variables.

Listing of Data

001 0011

1

A0

000

03

99

1.3

MFX

X

X

3911

18

FMFM2Ff

M

002003004XX5010011013002023987

PATNO GENDER DX FH

Page 2: Challenges with Clinical Trial Data Analysis · 2012-06-04 · Challenges with Clinical Trial Data Analysis Sreekanth Nunna, Bhaskar Govind, Dr. A. K. Mathai SAS (Statistical Analysis

Newsletter J u n e 2 0 1 2

02

PROC FORMAT;

VALUE $GENDER 'F' = 'VALID'

'M' = 'VALID'

' ' = 'MISSING'

OTHER = 'MISCODED';

VALUE $DX '001' - '99' = 'VALID'

' ' = 'MISSING'

OTHER = 'MISCODED';

VALUE $AE '0' = 'VALID'

'1' = 'VALID'

' ' = 'MISSING'

OTHER = 'MISCODED';

RUN;

Listing of Final Data with Formats:

Obs PATNO GENDER Formatted FormattedDX Formatted FH

1 M VALID VALID 099001 VALID

10 f023 MISCODED MISSING 0 VALID

X9 F002 VALID MISCODED 0 VALID

18 2013 MISCODED VALID 0 VALID

7 M011 8VALID VALID 1 VALID

16 f010 MISCODED VALID 0 VALID

15 M A005 VALID VALID MISCODED

4 F 1004 9VALID VALID VALID

33 X MISCODED 1003 VALID VALID

MISCODED11 987 M VALID VALID1.3 03

X2 F 0002 MISCODEDVALID VALID

Example2:

004 120200101.210 .008

18086 24000912040.0102030068011901302201484.208017

20040090032182010020

783422023

PATNO HR SBP DBP

Page 3: Challenges with Clinical Trial Data Analysis · 2012-06-04 · Challenges with Clinical Trial Data Analysis Sreekanth Nunna, Bhaskar Govind, Dr. A. K. Mathai SAS (Statistical Analysis

Newsletter J u n e 2 0 1 2

03

Variable Name Description Variable Type Valid Values

PATNO Patient Number Alpha NumericCharacter

HR Heart Rate 40 to 100 Numeric

SBP Systolic Blood Pressure 80 to 200 Numeric

DBP Diastolic Blood Pressure 60 to 120 Numeric

USING A DATA STEP TO CHECK FOR INVALID VALUESA simple DATA _NULL_ step can also be used to produce a

report on out-of-range values.

1. Here is the program:

USING A DATA _NULL_ DATA STEP TO LIST OUT-OFRANGE DATA VALUES

DATA _NULL_;

INFILE "C:\CLEANING\PATIENTS,TXT" PAD;

FILE PRINT; ***OUTPUT TO THE OUTPUT WINDOW;

TITLE "LISTING OF PATIENT NUMBERS AND INVALID

***CHECK HR;

IF (HR LT 40 AND HR NE .) OR HR GT 100 THEN

PUT PATNO= HR=;

***CHECK SBP;

IF (SBP LT 80 AND SBP NE .) OR SBP GT 200 THEN

PUT PATNO= SBP=;

***CHECK DBP;

IF (DBP LT 60 AND DBP NE .) OR DBP GT 120 THEN

PUT PATNO= DBP=;

RUN;

Diastolic Blood Pressure60 120 180 200

Outliers

8 20

Outliers

Reference range

The above examples illustrate the procedure to clean the data from a clinical trial. This would help the SAS programmers / Data analysts, especially for the beginners to look into the quality of clinical data and sort out the discrepancies as mentioned above.

Another important data set is drug administration. There is a chance that certain records are duplicates in nature. For example, a particular subject may have multiple duplicate

records with same start date, stop date and drug code. This issue needs to be addressed. Yet another situation is that the data is missing for some subjects. These subjects’ documents needs to be checked to con�rm that there are no missing values related to study drug. Also there may be a chance that the recording of the drug might have been interchanged. That is, instead of drug A, drug B has been entered in to the data base. The data analyst should check with the random allocation list and con�rm the drug

Page 4: Challenges with Clinical Trial Data Analysis · 2012-06-04 · Challenges with Clinical Trial Data Analysis Sreekanth Nunna, Bhaskar Govind, Dr. A. K. Mathai SAS (Statistical Analysis

MEDICAL CODING IN CLINICAL TRIALS

Medical Coding is the process of grouping or classifying medical terms like medications, adverse events, medical history, diagnoses and disease conditions with reference to known standard terms in the medical dictionaries. In clinical trials, subject information in addition to the study medication are collected and recorded on relevant CRFs. Data generated in these trials are ultimately subjected to further analysis. It is very essential that this data gets interpreted uniformly in a standardized format. Hence medical coding is required by using standardized medical dictionaries. The coded data will be reviewed by Medical Monitors, Clinical Scientists, and Statisticians to analyze adverse events and medications for safety and e�cacy study.

Medical coding in clinical trials can be explained in 3 phases:1. Coding initiation phase2. Coding phase3. Coding closure phase

1. Coding initiation phase

An e�cient and coordinated teamwork between Database programming team and Operational team is needed to initiate the medical coding process. Database programming team is responsible for the import and loading of the corresponding versions of the Medical dictionaries into the appropriate CDMS tool (Ex:- OCTMS) according to the client speci�cations/study requirements. The Operational team performs user acceptance testing on the corresponding dictionary version and makes sure that the uploaded dictionary version is according to the given operationalspeci�cations and the same will be documented according to

Newsletter J u n e 2 0 1 2

04

allocation and the correct entry in the data base. Similarly the data analyst needs to check whether the start date is less than or equal to stop date. There is yet another situation is that the stop date is missing for the withdrawn subjects from the study or completed the study by the subjects. These cases also should have the stop-date. Similarly, for some subjects, the stop-date is entered, but the subject is still continuing the study drug.

Apart from the above points, the following minor issues may be considered while checking the quality of the data. It must be noted that all SAEs (Serious Adverse Events) are AEs (Adverse Events) but all AEs need not to be SAEs. Also, if the subject has discontinued the study drug, still the subject can continue in the study for further follow up. The above data issues may be relevant for other data sets such as medical history, physical examination, vital sign, concomitant medication, laboratory data, adverse event,

serious adverse event and study Completion.

Conclusion

As mentioned in the introduction, SAS programming activity is an inseparable part of any clinical trial data analysis. The quality of the data must be ensured before undertaking the �nal analysis. In order to make sure that the �nal data set is free from error and omissions, the points mentioned above would be useful for the data analyst / statistician / SAS Programmers in the clinical trial industry. This paper would certainly give an insight into the importance of good quality data generation and analysis from clinical trials. The beginners in SAS Programming and data analysis must be encouraged to look into the various aspects of the quality of the data before undertaking the �nal analysis.

the organizational documentation policies. The �nal evaluated dictionary version will be issued to the corresponding functional team who is handling the speci�c study. This process will be repeated for the same version of the dictionary when it is used for the future studies.

2. Coding phase

Medical coding will be done on validated and clean data by the data management team. Initially auto-coding is performed in which the reported terms will be automatically coded by the medical dictionaries. If perfect match was not found, the reported terms will not get autocoded. In these conditions Manual Coding should be performed by the Medical Coding team. A perfect blend of auto-coding and manual coding is recommended for a good quality coded data.

Auto-coding: In this process the reported term will be automatically coded when an exact match is found in the medical dictionary and we will call it as “direct hit”. It is not advisable to completely depend on auto-coding because there are no autoencoders currently available that are capable of critically analyzing and interpreting a reported term. Almost all autoencoders work using fairly simplistic methods without any provision or ability to distinguish the medically signi�cant di�erences among similar entries in the dictionaries. Using an autoencoder that lacks embedded medical knowledge will invariably deliver compromised results and will often require the data to be recoded, costing more in the long run.

Page 5: Challenges with Clinical Trial Data Analysis · 2012-06-04 · Challenges with Clinical Trial Data Analysis Sreekanth Nunna, Bhaskar Govind, Dr. A. K. Mathai SAS (Statistical Analysis

Newsletter J u n e 2 0 1 2

05

Manual Coding: This activity should be performed when auto-coding process gets failed for uncoded terms which may be due to various reasons. The data management team should analyze the reasons (e.g. Spelling errors, non-contributing words, ambiguous information, extra spaces, etc.) for the failing of auto-coding process. CDMS will raise discrepancies if terms do not get coded. These discrepancies should be managed and suitable action should be taken by the data management team and feedback can be taken from TMS group, about suitable action to be taken. Based on the feedback, data management team edits and updates the database accordingly and makes terms available for auto-coding. It is the responsibility of the medical coder to assign terms manually for those terms where the auto-coding cannot be performed even after taking necessary actions with documentary support for the same.

3. Coding closure phase

This phase will start once LPLV (Last patient and last visit) was received. Medical coder should check all the coded terms from accuracy and consistency point of view and should check all the queries got resolved and updated into the database. Listings regarding all the coded terms should be sent to the sponsor for their �nal review and should take their approval for the closure of medical coding process for the speci�c study and same have to be documented as per organizational documentation policies. Even after performing both auto and manual coding some terms may remain uncoded and this re�ects the complexity of the term and sometimes it may be due to non-availability of the term in the medical dictionaries. These terms should be tagged as uncodable and the underlying reason should be documented. In most cases, reported terms that are queried need to be updated and should be resubmitted. Coder should try to his/her level best and code the term using best possible entry available in the corresponding dictionary.

Challenges in Medical Coding:

A good and consistent coding requires clear initial data and this cannot be expected on the CRF’s initially because what is clear to the investigator at the time of data collection may be unclear to the medical coder at the time of data coding. The medical coder is supposed to code only the reported terms which are reported by the investigator at the clinical trial site and is not permitted to add or interpret the

extraneous information from other sources. So performing medical coding under stringent conditions will be a challenging task for any medical coder in the industry and here are the few challenges which will be faced by the medical coder at the time of medical coding.

1. Ambiguous information: In this type the reported term re�ects one or more interpretations of uncertain nature. These types of presentations are more common in clinical trial data for which coding is necessary. Sometimes we can expect abreviations of ambiguous nature instead of standard abbreviations (e.g. ECG, COPD, and HIV).

Examples:-

Pain (Site of pain-leg, hand, abdomen?)Congestion (Site of congestion-nasal, pulmonary, liver?)

2. Vague information: In this type the reported term is not clearly stated. Sometimes instead of reporting single event investigator will report the event in many sentences and that will cause problem during medical coding process.

Example: Subject is feeling like “Stoned” and complaining of “Cracking sounds in the ear”.

3. Ambiguous lab data: Sometimes the investigator will report lab �ndings of clinical signi�cance in Adverse Event forms.

Example: “Glucose of 32” and here the source of specimen is not mentioned and assigning the appropriate term will become challenge for the coder.

4. Lack of speci�city: Sometimes the reported term will be lacking speci�city about the cause.

Example: Right limb tenderness, left limb swelling etc.

5. Con�icting laboratory data: Sometimes lab �ndings will be recorded in a con�icting manner where the lab parameter, its value in numeric’s and diagnostic term in the same �eld.

Example: Hypernatremia with serum sodium level 180 mEq/L

6. Combination terms: Sometimes a combination of events will be recorded in the same �eld which causes problem during coding process and splitting of the terms should be

Page 6: Challenges with Clinical Trial Data Analysis · 2012-06-04 · Challenges with Clinical Trial Data Analysis Sreekanth Nunna, Bhaskar Govind, Dr. A. K. Mathai SAS (Statistical Analysis

www.makrocare.com 06

Newsletter J u n e 2 0 1 2

done with sponsor’s approval in order to overcome this problem.

Example: Fever with rash and itching

7. Spelling errors: Some mistakes will be expected at the time of clinical data entry and it interferes in auto-coding process.

Example: Parecitamol instead of paracetamol

Conclusion:

Now a days clinical research is becoming a growing industry which is playing a major role in accelerating the drug discovery process as it is liaising between laboratory and the drug shelves in the market. For the approval of new drugs, Regulatory Authorities require more subject enrollment in clinical trials for the safety assessment since safety is the major concern in any clinical trial. As a result, many pharma

companies are now outsourcing their clinical research activities to Contract Research Organizations (CRO) on end-to-end service basis. Due to the involvement of global investigators and global trial subjects from various geographical areas, the clinical trial data is prone for inconsistency, which a�ects the data analysis signi�cantly and specially safety data. To overcome the inconsistency in safety data (AE, CM, MH) a standard and globally accepted terminology is needed to achieve consistency and uniformity across the clinical trial. Medical coding using medical dictionaries like MedDRA and WHODD will serve the purpose of making clinical trial data more consistent and feasible for e�ective analysis and thereby meeting the objectives of clinical trial protocol.

About MakroCareMakroCare is a global drug development services �rm that operates through 4 main divisions - CRO, SMO, Informatics and Consulting. Integrated and innovative services in the areas of regulatory a�airs, risk management, site management, patient recruitment, trial management (P II/III and late phase), biometrics, QA audits, PV/Safety, and informatics.