ft-nir spectroscopy and laser diffraction particle sizing of apis in
TRANSCRIPT
FT-NIR spectroscopy and Laser Diffraction particle sizing of
APIs in Pharmaceutical formulations
Joana Lúcia Carrilho Figueiredo
Dissertação para obtenção do Grau de Mestre em
Mestrado Integrado em Engenharia Química
Júri
Presidente: Prof. João Carlos Moura Bordado
Orientador: Prof. José Monteiro Cardoso de Menezes
Vogais: Prof. Helena Maria Rodrigues Vasconcelos Pinheiro
Dr. Paulo Alexandre de Araújo Loureiro Amaral
Setembro de 2008
FT-NIR spectroscopy and Laser Diffraction particle sizing of APIs in Pharmaceutical formulations
Joana Lúcia Carrilho Figueiredo
i
ACKNOWLEDGEMENTS
This work would not have been possible without the support and encouragement of
Professor José Cardoso de Menezes under whose supervision I did this thesis.
I would also like to thank Dr. Paulo Amaral, the Quality Director of Lusomedicamenta.
He was a strong promotor for the application of this project.
Cristiana Rocha, the Quality Assurance supervisor of Lusomedicamenta, is acknowledged
I thank her sympathy, availability and readiness in the collection of raw materials and
solvents.
Thanks to Professor Maria Joana Neiva Correia for the availability of her laboratory,
where the samples were prepared.
Licínia Rodrigues and Pedro Ceitil are acknowledged for their generous share of
knowledge on the different topics in this thesis.
I would like to express my special gratitude to João Henriques, Ricardo Duarte, Ornella
Preisner and Pedro Felizardo for taking time to discuss and see things from a new perspective
when I needed. Vera Lourenço and Gledson Emidio, my partners, are recognized for their
encouragement and great coffee time.
I wish to thank my best friend, Raquel Lopes, and my sister, Rita Figueiredo, for reading
the draft of the thesis and for their valuable comments.
I would also like to thank Filipe Calado, my boyfriend, for his loving support.
I cannot end without thanking my friends and my family for the constant encouragement
and love.
To them I dedicate this thesis.
ii
ABSTRACT
Near-Infrared (NIR) spectroscopy associated with chemometrics and Laser Diffraction
have proven to be suitable tools for simple and rapid analysis in the Pharmaceutical Industry.
This work aims the simultaneous determination of the three Active Pharmaceutical
Ingredients (APIs): Paracetamol (PA); Pseudoephedrine Hydrochloride (PS) and
Detromethorphan Hydrobromide (DX) in a pharmaceutical formulation, using NIR
spectroscopy. In addition, the Particle Size Distribution (PSD) of each API was determined by
Powder Laser Diffraction.
NIR spectra contain chemical and physical information about each of above components.
In order to explore the potential of NIR spectroscopy and to understand
similarities/differences between APIs, the spectra were analysed based on different pre-
processing data. PA and PS are chemically more similar than DX because have the same
functional groups. Physically, PA and DX have a Gaussian PSD, while PS has a bimodal
distribution. The interpretation of physical results obtained by NIR spectroscopy corroborates
with those obtained by Laser Diffraction.
Quantitative analysis of the pharmaceutical formulation was based on Partial Least
Squares (PLS) regression. The accuracy of NIR calibration model was evaluated according to
root mean square error of prediction (RMSEP), and the best results were 4mg of PA; 3mg of
PS and 2mg of DX per tablet.
The physical properties measured by both techniques were well correlated by Orthogonal
Projections to Latent Structure (OPLS) analysis, with a cross validated predictive ability of
45.9%
NIR Spectroscopy, Powder Laser Diffraction or both techniques can be used in-process
monitoring and control in the pharmaceutical solid dosage production.
Keywords: API, Near Infrared Spectroscopy, PLS, Powder Laser Diffraction, Particle Size
Distribution
iii
RESUMO
A Espectroscopia de Infravermelho Próximo (NIR) associada à Quimiometria e a
Difracção de Laser têm-se revelado ferramentas adequadas para a análise simples e rápida na
Indústria Farmacêutica.
Este trabalho visa a determinação simultânea de três Princípios Activos (APIs):
Paracetamol (PA); Pseudoefedrina Cloridrato (PS) e Bromidrato de Dextrometorfano (DX),
numa formulação farmacêutica, utilizando Espectroscopia em NIR. Para além disso, a
distribuição do tamanho de partícula de cada API foi determinada através de Difracção de
Laser de pós.
Os espectros NIR contêm informação química e física sobre cada componente acima
mencionado. Assim, para explorar o potencial da espectroscopia NIR e compreender as
semelhanças/diferenças entre APIs, os espectros foram analisados com base em diferentes
pré-processamentos. O PA e o PS são quimicamente mais semelhantes que o DX porque tem
os mesmos grupos funcionais. Fisicamente, o PA e o DX têm uma distribuição do tamanho de
partículas Gaussiana, enquanto o PS tem uma distribuição bimodal. A interpretação dos
resultados físicos obtidos pela espectroscopia de NIR corrobora com aquela obtida por
Difracção a Laser.
A análise quantitativa da formulação farmacêutica foi baseada na regressão dos Mínimos
Quadrados Parciais (PLS). A precisão do modelo de calibração NIR foi avaliado de acordo
com o erro médio quadrado de previsão (RMSEP), e os melhores resultados foram 4mg de
PA; 3mg de PS e 2mg de DX por comprimido.
As propriedades físicas medidas em ambas as técnicas foram bem correlacionadas através
da Projecção Ortogonal de Estruturas Latentes (OPLS), com uma habilidade preditiva de
validação cruzada de 45.9%.
A espectroscopia em NIR, a Difracção de Laser de pós ou ambas as técnicas podem ser
usadas na monitorização do processo e controlo da produção farmacêutica de dosagens
sólidas.
Palavras-chave: Princípio Activo, Espectroscopia de Infravermelho Próximo, PLS, Difracção
de Laser a pós, Distribuição do Tamanho de Partícula
iv
INDEX
Acknowledgements................................................................................................................ i
Abstract................................................................................................................................. ii
Resumo ................................................................................................................................iii
Index .................................................................................................................................... iv
Index of Figures ..................................................................................................................vii
Index of Tables .................................................................................................................... ix
Abbreviations....................................................................................................................... xi
1. Introduction ................................................................................................................. 1
1.1. NIR Spectroscopy ................................................................................................ 1
1.1.1. Advantages vs. disadvantages....................................................................... 2
1.1.2. Applications .................................................................................................. 2
1.1.3. Instrumentation ............................................................................................. 2
1.2. Chemometrics....................................................................................................... 4
1.2.1. Qualitative analysis in NIR spectroscopy..................................................... 5
1.2.1.1. Unsupervised classification methods ........................................................ 5
1.2.1.2. Supervised classification methods ............................................................ 6
1.2.2. Quantitative analysis in NIR spectroscopy ................................................... 6
1.2.3. Spectra pre-processing .................................................................................. 7
1.2.3.1. Mean centering.......................................................................................... 8
1.2.3.2. Autoscaling ............................................................................................... 8
1.2.3.3. Derivatives ................................................................................................ 8
1.2.3.4. Multiplicative Scatter Correction (MSC).................................................. 8
1.2.3.5. Standard Normal Variate (SNV)............................................................... 8
1.2.4. Variables’ selection....................................................................................... 9
1.2.5. Number of principal components (PC’s) needed.......................................... 9
1.2.6. Outliers........................................................................................................ 10
1.2.7. Statistics ...................................................................................................... 10
1.3. Powder Laser Diffraction ................................................................................... 11
1.3.1. Advantages vs. disadvantages..................................................................... 12
v
1.3.2. Instrumentation ........................................................................................... 13
2. Experimental.............................................................................................................. 14
2.1. NIR Spectroscopy .............................................................................................. 14
2.1.1. Sample preparation ..................................................................................... 14
2.1.2. Measurement............................................................................................... 15
2.1.3. Software ...................................................................................................... 16
2.2. Powder Laser Diffraction ................................................................................... 16
2.2.1. Measurement............................................................................................... 16
3. Results and Discussion .............................................................................................. 18
3.1. NIR spectroscopy and chemometric analysis of each API’s.............................. 18
3.2. Particle size distribution of each APIs ...............................................................22
3.3. Quantitative analysis of API’s............................................................................ 25
3.3.1. First strategy................................................................................................ 26
3.3.1.1. Calibration vs. Test sets .......................................................................... 27
3.3.1.2. Data pre-processing................................................................................. 28
3.3.1.3. Variable selection....................................................................................28
3.3.1.4. Number of PCs........................................................................................ 33
3.3.1.5. Outliers.................................................................................................... 34
3.3.1.6. Statistics .................................................................................................. 35
3.3.1.7. First strategy without variable selection ................................................. 36
3.3.1.8. First strategy with variable selection ...................................................... 37
3.3.2. Second strategy ........................................................................................... 39
3.3.2.1. Second strategy without variable selection............................................. 40
3.3.2.2. Second strategy with variable selection .................................................. 41
3.4. Obtained results vs. other studies....................................................................... 42
3.5. Orthogonal analysis............................................................................................ 43
4. Conclusions ............................................................................................................... 45
5. Suggestions for future work ...................................................................................... 48
6. References ................................................................................................................. 49
7. Appendix ................................................................................................................... 52
7.1. Determination of Percent Relative Standard Deviation (%RSD) ...................... 52
7.2. Matrix design for laboratory samples................................................................. 52
vi
7.3. First Strategy ...................................................................................................... 53
7.4. Second Strategy.................................................................................................. 55
7.5. Orthogonal analysis............................................................................................ 57
7.6. Mastersizer Average Result Analysis Report..................................................... 58
vii
INDEX OF FIGURES
Figure 1 – The NIR region in electromagnetic spectrum [2]. ................................................ 1
Figure 2 – The NIR spectrometer with solid and tablet sampling accessory. ...................... 4
Figure 3 – Representation of a PCA model structure........................................................... 5
Figure 4 – Representation of a PLS model structure............................................................ 7
Figure 5 – The powder laser diffraction equipment. .......................................................... 12
Figure 6 – Detection of instrumental noise in an FT-NIR absorption spectrum of DX. .... 15
Figure 7 – FT-NIR absorption spectra of the three active principles obtained by diffuse
reflectance. ............................................................................................................................... 18
Figure 8 – Chemical structure of DX, PA and PS, respectively. ........................................ 19
Figure 9 – Scores plot of 2nd derivative (15 point Savitzky-Golay) spectra of the three
APIs. ......................................................................................................................................... 20
Figure 10 – Scores plot spectra of the three APIs without any pre-treatment. ................... 20
Figure 11 – Scores plot spectra of six batches of DX from two different manufactures
(without any pre-treatment)...................................................................................................... 21
Figure 12 – The measure background................................................................................. 22
Figure 13 – Particle size distribution of the three APIs measured based on the Malvern
optical model. ........................................................................................................................... 23
Figure 14 – FT-NIR MSC spectra of each calibration set (with DX (DSM)). PA
concentration increases in the arrow direction between 77.1% and 92.3% (a); while PS
concentration between 10.2% and 0% (b); and DX concentration among 5.4% and 0% (c). . 27
Figure 15 – Scores plot of PA (with DX (DSM)) samples with the selected calibration and
test sets based on NIR MSC and Mean Centering pre-processed spectra................................ 28
Figure 16 – Coefficient of Determination (R2) versus wavenumber of PA calibration set
(with DX (DSM)). .................................................................................................................... 29
Figure 17 – iPLS results for DX (DSM) calibration set...................................................... 30
Figure 18 – Optimal spectral region selected by iPLS for pre-processed previous spectrum.
.................................................................................................................................................. 30
Figure 19 – Diagnostic plots of GA analysis: Fitness vs. Number of variables (a);
Evolution of average and best fitness (b); Evolution of number of variables (c); and Models
with variable number (d) .......................................................................................................... 32
Figure 20 – PRESS for PLS on PA calibration set (with DX (DSM)) data based on MSC,
1st derivative and Mean Centering pre-processing spectra....................................................... 33
viii
Figure 21 – Analysis showing PLS Model of PA calibration set (with DX (DSM)).......... 33
Figure 22 – Studentized Residuals versus Leverage for PA calibration set (with DX
(DSM))...................................................................................................................................... 34
Figure 23 – Q residuals versus sample for PA calibration set (with DX (DSM))............... 35
Figure 24 – Correlation between measured and predicted PA calibration set (with DX
(DSM)) [●: calibration set; ▼: validation set]. ........................................................................ 35
Figure 25 – Correlation between measured and cross-validation predicted set (with DX
(DSM))...................................................................................................................................... 39
ix
INDEX OF TABLES
Table 1 – The concentration range of each API for each calibration set............................ 14
Table 2 – The optical properties of APIs and dispersants. ................................................. 16
Table 3 – Particle Size distributions obtained for different lots and APIs suppliers (percent
relative standard deviation (%RSD) and weighted residual and obscuration)......................... 24
Table 4 – Size distributions obtained for different batches for DX (DSM) and relative
error. ......................................................................................................................................... 25
Table 5 – The correlation between samples’ concentration for each calibration set with DX
(DSM). ...................................................................................................................................... 26
Table 6 – The best GA parameters chosen to use for DX (DSM) calibration set............... 31
Table 7 – The best results for each calibration set without variable selection (using DX
(DSM))...................................................................................................................................... 36
Table 8 – The best results obtained for each calibration set (using DX (DSM)) with
variable selection...................................................................................................................... 37
Table 9 – The best results obtained in the first strategy. .................................................... 38
Table 10 – The correlation between samples’ concentration for each calibration set with
DX (DSM). ............................................................................................................................... 39
Table 11 – The best results for each calibration set without variable selection (using DX
(DSM))...................................................................................................................................... 40
Table 12 –The best results for each calibration set (using DX (DSM)) using variable
selection techniques.................................................................................................................. 41
Table 13 – The best results obtained in the first and second strategy. ............................... 42
Table 14 – The best results obtained in current and Alcalá’s study. .................................. 42
Table 15 – OPLS summary results for different pre-processing techniques of DX samples.
.................................................................................................................................................. 43
Table 16 – The residual in X and Y results. ....................................................................... 44
Table 17 – The best results for each calibration model, using PLS regression, with variable
selection (using DX (DSM))..................................................................................................... 46
Table 18 – The weight percentage of each API and respectively RMSEP obtained for the
best calibration set and the weight in of each active ingredient in a tablet. ............................. 46
Table 19 – The accuracy obtained in our study Alcalá’s study [35]. ................................. 47
Table 20 – Matrix design for laboratory samples. .............................................................. 52
x
Table 21 – The correlation between samples’ concentration for each calibration set with
DX (Divis). ............................................................................................................................... 53
Table 22 – The best results for each calibration set without variable selection (using DX
(Divis)). .................................................................................................................................... 54
Table 23 –The best results for each calibration set with variable selection (using DX
(Divis)). .................................................................................................................................... 54
Table 24 –The best results for each calibration set with variable selection (using DX
(DSM))...................................................................................................................................... 55
Table 25 – The correlation between samples’ concentration for each calibration set with
DX (Divis). ............................................................................................................................... 56
Table 26 – The best results for each calibration set without variable selection (using DX
(Divis)). .................................................................................................................................... 56
Table 27 –The best results for each calibration set with variable selection (using DX
(Divis)). .................................................................................................................................... 57
xi
ABBREVIATIONS
API – Active Pharmaceutical Ingredient
d(0.1) – Equivalent volume diameter at 10% cumulative volume
d(0.5) – Median of particle size distribution or equivalent volume diameter at 50% cumulative
volume
d(0.9) – Equivalent volume diameter at 90% cumulative volume
DX – Detromethorphan Hydrobromide
ED – Euclidean Distance
EMEA – European Medicines Agency
FT – Fourier Transform
ICH – International Conference on Harmonisation
IF – Infrared
GA – Genetic Algorithm
LED – Light Emitting Diodes
LDA – Linear Discriminant Analysis
LV – Latent Variables
MC – Mean Centering
MD – Mahalanobis Distance
MLR – Multivariate Linear Regression
MSC – Multiplicative Scatter Correction
NIR – Near-Infrared
OPLS – Orthogonal Projections to Latent Structure
PA – Paracetamol
PASG – Pharmaceutical Analytical Sciences Group
PAT – Process Analytical Technology
PC’s – Principal Components
PCA – Principal Component Analysis
PCR – Principal Component Regression
PLS – Partial Least Squares
PLS-DA – Partial Least Squares Discriminant Analysis
PRESS – Prediction Residual Error Sum of Square
PS – Pseudoephedrine Hydrochloride
PSD – Particle Size Distribution
xii
RI – Refractive Index
RMSECV – Root Mean Square Error of Cross-Validation
RMSEP – Root Mean Square Error of Prediction
SG – Savitzky-Golay
SIMCA – Soft Independent Modelling of Class Analogy
SNV – Standard Normal Variate
US FDA – United States Food and Drug Administration
UV – Ultraviolet
% (w/w) – Weight Percentage
1st D – First Derivative
2nd D – Second Derivative
R2 (X)p – The percentage of X data explained by the model of predictive set
R2 (Y)p – The percentage of Y data explained by the model of predictive set
Q2p – The percentage of variation predicted by the model according to cross-validation
predicted set
LVp – Latent Variables of predictive set
R2 (X)o – The percentage of X data explained by the model of orthogonal set
R2 (Y)o – The percentage of Y data explained by the model of orthogonal set
Q2o – The percentage of variation predicted by the model according to cross-validation
orthogonal set
LV o – Latent Variables of orthogonal set
1
1. INTRODUCTION
In the last years, the pharmaceutical industry has developed and implemented innovative
approaches to ensure the final product quality and to reduce its production costs, according
with the Process Analytical Technology (PAT) initiative from the US Food and Drug
Administration (FDA) [1, 12]. The goal of PAT is to monitor and control the manufacturing
processes, in a real-time, to increase process understanding and that quality in the final
product is obtained consistently [1].
There are several PAT monitoring tools available. This thesis focuses only on NIR
Spectroscopy and Powder Laser Diffraction.
1.1. NIR Spectroscopy
In the 19th century, William Herschel discovered infrared radiation by passing sunlight
through a prism. However, only in 1960s the Near Infrared (NIR) spectroscopy emerges into
the analytical world, with the work of Karl Norris of the US Department of Agriculture [3].
Nowadays, this important analytical technology has been used in different industrial fields,
Petrochemistry; Medical; Environmental; Pharmaceutical and Textile Industries; and others.
In the electromagnetic spectrum, the NIR region is located in between Mid-infrared and
Visible. In a range of wavenumber 4000-14000cm-1 (respectively wavelength 700-2500nm),
the absorption radiation of overtone and combination bands of covalent bonds such as N-H,
O-H and C-H of organic molecules can be measured using a NIR instrument (Figure 1).
Figure 1 – The NIR region in electromagnetic spectrum [2].
2
1.1.1. Advantages vs. disadvantages
NIR spectroscopy is the measurement of absorbed, reflected or transmitted light incident
on a sample at a certain wavelength. In NIR region, the absorption is lower than in the
adjacent regions of the spectra, because has a high overtone order. Consequently, this method
does not require a previous treatment (e.g. a dilution), which allows rapid and easy analysis.
In addition, the sample can be reused because this method is non-destructive. The pathlengths
and the ability to sample through glass in the NIR allow samples to be measured in common
solid and liquid forms.
Like other techniques, NIR spectroscopy has also some drawbacks. The low sensitivity of
this technique restricts the determination of the active principles with less than 0.01% (w/w)
[10]. NIR spectroscopy is an indirect method which requires a reference method,
Chemometric techniques – statistical and mathematical procedures – to extract, and interpret
spectral information acquired from the sample NIR spectra.
1.1.2. Applications
The NIR spectra capture chemical and physical variability in the samples, which can be
used in several applications. In pharmaceutical industry, NIR spectroscopy is applied to
qualify and/or quantify active pharmaceutical ingredients (APIs) and excipients; to
characterize polymorphic form, granulation, powder blending, drying and coating of in-
process product, etc.
1.1.3. Instrumentation
For a wide range of NIR spectroscopy applications there are available different NIR
spectrometers and sample accessories.
The spectrometers are made up of four components: a light source; a wavelength selector;
sample accessories; and a radiation detector. Thus, light from a source is passed through a
wavelength selector to select a limited region of the spectrum. The radiant light from the
wavelength selector strikes the sample and the emerging beam is caught by the detector [7].
The most frequently employed sources of NIR spectrometer are tungsten and tungsten-
halogen lamps, because they generate a continuous radiation. Light Emitting Diodes (LEDs)
are also used, since they offer greater lifetime; are much more efficient than other source and
3
they can be a wavelength selector. However, they only emit a limited range of wavenumbers
and are very expensive [7-8].
The wavelength selectors are used to provide a narrow band of radiation, and there are
commercially available four general types of: filter instruments; LED source; Dispersive
optics-based instruments and interferometric (Fourier Transform instruments).
The filter allows just a particular slice of the spectrum to pass through a bandpass filter [8-
9].
In older dispersive instruments there were used prisms, which disperse different
wavelength accomplished with the separating capability of refraction. Currently, they use
monochromators, which consist on entrance and exit slits, mirrors, and a grating to disperse
the light. Nearly all commercially dispersive monochromators are diffraction grating, because
they are more efficient than others [7-8].
Fourier-Transform NIR spectrometers have several advantages over all other wavelength
selectors, because they show best resolution and signal-to-noise-ratios. These instruments add
to the Jacquinot advantage (or throughput) and the Fellgett advantage (or multiplex). FT
instruments do not require slits to achieve resolution, consequently it gets higher throughput
than dispersive instruments (the Jacquinot advantage). Furthermore, this equipment collects
all wavelengths simultaneous, which increases the detection efficiency of signals. This feature
is called Multiplex advantage [3-7].
According to the type of required analysis, the analyst has to choose the proper sampling
and spectra acquisition. In case of on-line1 analysis fibre optic probes are used, while if the
sample is removed from the process stream and analysed at-line on at a laboratory several
spectra acquisition accessories (e.g. fibre optic probes, vial holder, powder or tablet sampling
accessories) can be used depending on the type of samples (solid, tablets, liquid, etc) [3, 34].
Sample information in the NIR region is usually collected as an absorption spectrum
through transmission or diffuses reflectance measurements with a NIR spectrometer. If the
light passes through a sample it is called transmission, this is common for liquids transparent
samples by using quartz cuvetes. In case of diffuse reflectance, incident radiation is projected
into the surface of the sample and is reflected at different angles, which commonly occurs
with powder or solid samples [3].
1 The sample is measured without being removed from the process stream.
4
Detectors convert radiant energy to a measurable signal and their use depends on the
wavelength range to be measured. Silicon detectors are used for a limited range of wavelength
(between 700 and 1100nm), and InGaAs and PbS detectors devices are more suitable for wide
range of 1100-2500nm.
Figure 2 – The NIR spectrometer with solid and tablet sampling accessory.
Two important parameters to ensure good spectra collection of a FT-NIR spectrometer
during an analysis are number of scans and resolution. Resolution in an FT-NIR determines a
small frequency interval that can be distinguished over a spectral range and typically this
parameter ranges from 4 and 64cm-1. In case of selecting a high resolution, the spectrum
becomes more detailed, but it captures more noise and the analysis takes a longer time. The
number of scans acquired enhances the signal amplitude per unit time. This parameter is
inversely proportional to noise effect, and the typical values of number of scans are between
16 and 128 [17]. Setting these parameters is based on a compromise between the operation
time and analysis quality desired.
1.2. Chemometrics
After collecting the NIR spectra, the processing and interpretation of multivariate data for
qualitative and quantitative analysis is done in chemometric software. The first step is to split
the data set in two groups: calibration set and test set. Based on the thumb rule, two third of
the data set are employed for calibration purpose and the rest is used for testing [14]. Then,
the model is developed and optimized according to spectra pre-processing, best number of
variables, identification and elimination of outliers. In the following step, the model is
predicted with a test group to check its robustness and efficiency, and it should be at least
validated with unknown samples.
5
One of the advantages of NIR spectroscopy is the ability of classifying/identifying and
quantifying samples.
1.2.1. Qualitative analysis in NIR spectroscopy
Qualitative analysis uses NIR spectral information to identify and to classify samples, for
example as raw-material libraries. These techniques can be unsupervised and supervised. In
the unsupervised classification no a priori assumption is made about the samples that are
going to be classified, while supervised classification requires knowledge about the category
membership of samples [11].
1.2.1.1. Unsupervised classification methods
Several unsupervised methods are available, but the more common are Principal
Components Analysis (PCA) and Hierarchical methods.
Usually, PCA is used as a first step of the data analysis in order to detect patterns in
multivariate data collection. Thereby, PCA is a technique that, by reducing original data
dimensionality, allows relevant visualization features from data spectra. The original data (X)
is decomposed to scores – T (the values that represent the samples in the space defined by the
principal components) and loadings – LT (the correlation coefficients between the original
variables and the principal components) [6]. PCA selects a direction that retains maximal
structure in a lower dimension among the data (Figure 3).
Figure 3 – Representation of a PCA model structure.
Hierarchical methods proceed by an evaluation of samples similarity in terms of their NIR
spectra and result in a cluster sequence which can be represented graphically as a
dendogramme [12].
6
1.2.1.2. Supervised classification methods
The most used classical methods of supervised classification are distance based methods;
linear discriminant analysis (LDA); soft independent modelling of class analogy (SIMCA);
and partial least squares discriminant analysis (PLS-DA) [3, 12-14]. .
In case of distance based methods the similarity or dissimilarity of the test samples and
calibration samples is measured. The Euclidean Distance (ED) and Mahalanobis Distance
(MD) are the most popular distance methods. In ED, all directions in spectral space have the
same weight, which results on circles at round point. For the MD the variability along the axis
of the data set, and the distribution of points is following an ellipsoid is weighted.
LDA can be considered as a method similar to PCA with the difference that LDA focuses
on finding the direction that achieves a maximum separation among classes of a data set.
In the SIMCA method each principal component is calculated separately by class. This is
the most used class-modelling technique.
The aim of PLS-DA is to find the variables and directions in the multivariate space which
discriminate the established classes in the calibration set.
1.2.2. Quantitative analysis in NIR spectroscopy
In quantitative analysis, data is based on Beer’s Law, which states that the absorption
measured of each sample is proportional to concentration. The quantitative models are
developed using NIR spectral information (X variable), which is directly dependent of an
analyte concentration or a property that has been determined (Y variable). The most
employed techniques of quantitative analysis using NIR spectroscopy is based on Multivariate
Linear Regression (MLR), Principal Component Regression (PCR), Partial Least Square
(PLS) and Orthogonal PLS (OPLS) [3, 12-14].
The MLR allows the establishment of a linear link between a reduced number of
regression variables and a property of the samples (e.g. the concentration values). This
technique is very limited, consequently is less used in current applications in comparison with
other methods. This method should be just applied when there are more samples than
variables.
7
PCR model is built in two-step. First the spectral data is compressed with a PCA, and then
the concentration data is regressed against the scores2 matrix using a method similar to MLR.
This method can be used for very complex mixtures since the number of regression variable is
bigger than the number of calibration samples.
PLS method is similar to PCR, but produces better models using a lower number of
principal components (vide Principal Components). Nevertheless, both methods (PCR and
PLS) require a large number of samples for accurate calibration, and must avoid collinear
constituent concentrations. In PLS, the original data (X) is decomposed to scores, T; loadings,
LT; and residuals, E (the relative distance between the model and the observed points) [6]
(Figure 4).
Figure 4 – Representation of a PLS model structure.
The OPLS method is a modification of PLS which the independent set (X variables) is
separated into two parts: one that is linearly related to dependent set (Y variables), predictive
and the other is orthogonal [36-38].
1.2.3. Spectra pre-processing
Spectra often have problems with noise arising from instrument errors or are affected by
physical effects such as light scattering. Pre-process data reduces the contribution from noise
or even remove it and enhances the chemical signal of interest.
There are several pre-treatment methods that can be used to remove the non-constituent
data information such as mean-centering, autoscale, first- and second-derivative,
multiplicative signal correction, standard normal variate, among others. Sometimes, it can be
useful to apply a combination of pre-processing algorithms to improve the quality of the
model.
2 The individual transformed observations are called scores, while the participation of the original variables in
the principal components is given by the loadings [15].
8
1.2.3.1. Mean centering
The mean centering pre-process involves the subtraction of the average spectrum of each
spectrum, which enhanced the differences between the samples. This technique allows the
increase in the accuracy prediction of the model.
1.2.3.2. Autoscaling
Autoscalling, like mean-centering, removes absolute intensity information; moreover, it
also removes total variance information in each of the variables, scaled to unit variance. This
technique is often used when the X-variables have not the same units of measurement.
1.2.3.3. Derivatives
Derivatives of spectral data are used to remove offset and background slope variations
between samples. The first derivative of a spectrum removes the baseline offset, while the
second derivative eliminates slope differences between spectra as well as effectively
minimizing the physical properties of a sample [24]. The most common algorithm employed
in derivatives is Savitzky-Golay (SG) method, which requires the number of data point in the
function specified.
The main disadvantage of using this technique is the difficult interpretation of spectra
resulting.
1.2.3.4. Multiplicative Scatter Correction (MSC)
The MSC pre-process reduces the spectral variability caused by pathlength effects such as
different particle size and light scattering among samples, generally in diffuse reflectance
spectroscopy. Mathematically, this method calculates the average spectrum from all the data
in the calibration set and uses it as the reference spectrum [12, 17].
1.2.3.5. Standard Normal Variate (SNV)
Like MSC, SNV method is used to remove scattering effect from the variations of spectral
data, but the correction factors are determined differently. Each spectrum is corrected
individually by first centering the spectra values, and then the centered spectra are scaled by
the standard deviation calculated from the individual spectra values [12, 17].
9
1.2.4. Variables’ selection
A variable selection algorithm reduces the number of variables which usually contain
redundant and noise information. There are several strategies to select the relevant variables
to allow producing the ‘best’ model, such as Coefficient of Determination, iPLS (MATLAB
toolbox) and Genetic Algorithm (GA) (MATLAB toolbox).
The Coefficient of Determination (R2) correlates the NIR spectra information (X variable)
and the analyte concentration (Y variable). This coefficient varies between 0 and 1, and the
highly coefficient indicates the best correlated region of spectrum.
As well as coefficient of determination, iPLS investigates the collinear variables of data
sets. This method splits the data set into equidistant intervals and calculates PLS models for
each interval. The iPLS has the ability to focus on important spectral region with less
interference [18].
GA is an optimization method based on genetic processes of biological organism.
Initially, it generates randomly an initial population of individuals which are represented in
encoded form called chromosomes. Next the fitness of each chromosome is evaluated, and
then it is applied genetic operators: selection, crossover and mutation. Lastly it is checked if
the new population satisfies the termination conditions, otherwise, everything is repeated
from the fitness step until a certain percentage of chromosomes are identical. The major
advantage of GA is its flexibility and robustness, but there is an inherent risk of overfitting
[19-21].
1.2.5. Number of principal components (PC’s) needed
The model is based on reducing the number of variables, consequently is essential to
select the number of PC’s which best define the model. Each PC contains different relevant
information, but the first components represent the most important data variation. If too many
components are used, too much information data is included in the model, which becomes an
overfitted solution. The model will be data dependent and will present more difficult to
predict results. On the other hand, using too few components, the model will not capture
enough variability in the data – underfitted. So, the optimal number of PC’s is between the
two extremes.
The main problem in choosing the number of PC’s is subjectivity. The number of PC’s
can be selected according to the minor Prediction Residual Error Sum of Square (PRESS).
10
PRESS calculates the squared difference between the test and calibration samples used in the
model.
But, before calculating the number of PC’s, the model should be tested. There are two
different methods to test it: Self-Prediction and Cross-Validation. Self-Prediction, predict the
same samples used for model building, which does not guarantee the model performance. The
Cross-Validation method is based on predicting subsets of samples, previously removed from
calibration set. Note that selection of subsets of samples could be done leave-on-out or
contiguous block procedure. This technique has two main advantages over the other. The first
benefit is a good performance of the model, since the predicted samples are not the same as
the samples used to build the model. The second one is the simplicity outliers’ detection.
1.2.6. Outliers
An outlier is a sample which has different characteristics from calibration set. There are
several reasons to detect an outlier such as: an instrumental or experimental error, change of
operative conditions, etc.
If the differences between a supposed outlier and calibration set are significant, the
samples do not fit the model well and they should be identified and eliminated to build an
efficient model. In fact, not all outliers are erroneous, i.e., some observations can be just
slightly different from the rest, which can guarantee the model robustness. To distinguish
between erroneous and non-erroneous outliers diagnostic tools are required to detect it, such
as leverage, residuals (e.g. Y-studentized residual and spectral residual) and hotelling T2.
1.2.7. Statistics
After determining the calibration equation, to evaluate the accuracy of model it is required
to check some parameters such as coefficient of determination of model (R2), root mean
square error of cross-validation (RMSECV) and root mean square error of prediction
(RMSEP).
Coefficient of determination of model is calculated between the NIR predictive and the
reference measurement value, from the calibration and the test sets [13].
11
( )
( )
−
−−=∑
∑=
=
=
=ni
ii
ni
iii
yy
yyR
1
2
1
2
2
\ˆ
1 (1)
where iy \ˆ is the estimated result for sample i when the model is constructed with the
sample i removed, yi is the reference measurement result for sample i, and y is the mean of
reference measurement results for all samples in the train and test sets.
The root mean square error of cross-validation is calculated as follows (2), where n is the
number of samples in the calibration set [13].
( )
n
yyRMSECV
ni
iii∑
=
=
−= 1
2\ˆ
(2)
The RMSECV calculated by cross-validation may give over-optimistic results, because the
same samples used for calibration development are also applied to validate the model.
For the test set, the root mean square error of prediction is calculated as follows equation
(3), and iy is the estimated result of the model for test sample i and m is the number of samples
in the test set [13].
( )
m
yyRMSEP
mi
iii∑
=
=
−= 1
2ˆ
(3)
The optimum model is defined with a lowest RMSECV and RMSEP and higher R2.
1.3. Powder Laser Diffraction
The particle size of APIs and excipients has huge influence on its handling and
processing, which can be crucial on the manufacture process. Thus, the particle size
distribution (PSD) analysis becomes of great importance for process optimization and control.
For the characterization of the particle size there are some precise and accurate analytical
methodologies. The most common techniques are optical microscopy, analytical sieving
method and powder laser diffraction that may be used depending on the measuring purpose.
12
The analytical sieving method is an old, but cheap technique. Usually, this method is
applied to powdered materials having a particle size of more than about 75 µm [23].
The optical microscopy is used to observe the morphological appearance and shape of the
particle. This method can generally be applied to particles in the size range between 0.5 and
100 µm, however it is not suitable as a quality or production control technique [23].
The most regularly applied technique is powder laser diffraction, which was used during
this study.
1.3.1. Advantages vs. disadvantages
The powder laser diffraction system allows a rapid measurement with a small volume of
sample, without the need of any external calibration. Moreover, the powder laser diffraction
equipment has a high reproducibility, is very flexible, and has the ability to analyse dry or wet
particles.
Figure 5 – The powder laser diffraction equipment.
In case of wet analysis it is fundamental to choose a good dispersant, to guarantee that the
sample does not solubilise.
The Mie’s theory assumes that the determination of particle size is based on the equivalent
sphere diameter; however the majority of particles are irregular [24]. But comparing some
feature of the actual particle to an imaginary spherical particle is the easy way to get a single
unique number to describe an irregular shaped particle [22].
13
1.3.2. Instrumentation
Powder laser diffraction is one of the most used techniques for particle size analysis. This
method consists on a passage of the sample through a focused He-Ne laser beam (λmáx = 633
nm) [23]. The particles scatter light at an angle, inversely proportional to their size, which is
measured by photosensitive detectors. According to Mie’s theory, the particle size can be
calculated with scattering intensity and angle information. But for this is necessary to specify
the refractive index (RI) and the absorption of the material under study [22-23].
14
2. EXPERIMENTAL
A quantitative analysis in NIR spectroscopy was developed and a PSD by powder laser
diffraction of the APIs in a Pharmaceutical formulation was determined.
2.1. NIR Spectroscopy
2.1.1. Sample preparation
The pharmaceutical formulation studied is a mixture of three APIs: Paracetamol (PA),
Pseudoephedrine Hydrochloride (PS), and Dextromethorphan Hydrobromide (DX) and
placebo.
For the development of a quantitative analysis, three independent experimental designs
(vide appendix 7.2) for each API were created. Thereby, in each calibration set,
concentrations of a selected API and placebo were varied by overdosing and underdosing.
The concentration range (%) of each API was chosen according to an extreme situation,
which can lead to production problems. PS and DX exist in minor quantity in the
pharmaceutical formulation, so each API was underdosing at the minimum limit (0% (w/w))
which allows detecting homogeneity problems during the production. In case of the majority
component, the PA was overdosed until 92.3% (w/w), assuming the inexistence of placebo.
Besides the low PS and DX concentration, NIR spectroscopy allows identifying both
components in a sample because their concentrations are upper than 0.01% (w/w) [10].
The range and nominal concentration (% w/w) of each API is shown in Table 1.
Table 1 – The concentration range of each API for each calibration set.
API % (w/w) PA 84,7 ± 7,6 PS 5,1 ± 5,1 DX 2,7 ± 2,7
In the laboratory, 19 PA powder samples; 12 PS and DX samples were prepared in an
amount of 7.5 g each. During sample preparation, the active principles and placebo were
accurately weighed (in an analytical balance with 0.1 mg precision) and properly
homogenized in a small laboratory vortex mixer for 1 minute, between each addiction.
15
Currently, the production of the studied pharmaceutical formulation is being carried out
with DX from two different suppliers (Divis and DSM). Thus, there were developed two sets
of samples with each supplier.
2.1.2. Measurement
The diffuse reflectance spectra were collected in an ABB FTLA2000 FT-NIR
spectrometer, equipped with a tungsten-halogen source; an InAs detector and a powder
sampling accessory. Before spectra data acquisition, the best gain for background and samples
was selected and aligned. Each spectrum had an average of 64 scans and provided a resolution
of 16cm-1. The spectral data analysis covered the range from 3996.2 to 12004 cm-1.
Before the acquisition of NIR spectra, every day a reference spectrum was recorded, the
background (using Teflon). Background measures the instrument and environment
contributions, to correct those deviations from sample measurement [17].
All sample measurements were recorded in triplicate.
A spectrum captures many different variations such as constituent parameter (e.g.
concentration, drying, coating, etc); instrument variations (e.g. detector noise); environmental
conditions (e.g. laboratory room temperature) and differences in sample handling, which
affect the baseline and absorbance. A good performance calibration set should only represent
the different concentrations of the constituents of the mixture. Therefore, before start the
construction of the calibration models, the noise level was checked over full wavenumber
range and for both high and low absorbance of spectra.
4000 5000 6000 7000 8000 9000 10000 11000 120000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
wavenumber (cm-1)
Abs
orba
nce
DX
Figure 6 – Detection of instrumental noise in an FT-NIR absorption spectrum of DX.
16
Following the Figure 6 instrumental noise in a range of 9003.1-12004cm-1 was detected,
which was previously removed prior to any pre-process (hence eliminating the other
irrelevant variations) or analysis.
2.1.3. Software
The data collection was controlled using GRAMS/AI (version 7.0 from Thermo Galatic,
USA) software. Multivariable calibration was performed in Matlab (version 6.5 from
Mathworks Inc., USA) with PLS toolbox (version 3.0 from Eigenvector Inc., USA). The
variable selection was developed with iPLS toolbox (2.1 routine by Nørgaard) and GA
toolbox (version 6.5 from Mathworks Inc., USA). The orthogonal analysis was done in
SIMCA-P+11.5 (MKS-Umetrics, Umeå).
2.2. Powder Laser Diffraction
2.2.1. Measurement
Wet laser diffraction measurements of each API were performed with Malvern
Mastersizer MS2000 (from Malvern Instruments Ltd., UK)3 using a small amount of sample.
This equipment allows ± 1% of accuracy on d(0.5) [26].
Before starting a measurement, the sample, the dispersant RI index and the sample
absorption value discrimination were required. The DX RI index and PS and DX absorption
values were estimated by trial and error, since there was no information available. The
viability of these parameters was checked according to the thumb rule of residual. The optical
properties of APIs and dispersants were summarized in Table 2.
Table 2 – The optical properties of APIs and dispersants.
API RI of API Absorption Dispersant RI of Dispersant [31]
PA 1.62[27] 0.32[29] Deionised water (20ºC) 1.33
PS 1.53[28] 0.50 Ether 1.35
DX 1.50 0.50 Ether 1.35
3 This equipment can measure particle sizes from 0.02 to 2000µm.
17
For PA measurements deionised water was used instead of tap water because the first
allows to do stable measurements, based on a Malvern 2000 report [25]. The high electrolyte
concentration in the tap water causes the emulsion to flocculate, which report a much larger
particle size than expected.
For all measurements, the pump speed was adjusted for 1750 rpm, which guarantee the
best conditions of suspending all the material without air bubble formation.
18
3. RESULTS AND DISCUSSION
The main aim of this work was the development and optimization of a PLS calibration
model to quantify simultaneously the three active principles of the commercial
pharmaceutical formulation studied. Several calibrations were built to determine the most
accurate and robust one. First, the chemical and physical information contained in NIR
spectra of each pure component were analysed in order to explore the potential of NIR
spectroscopy and to understand their similarities/differences. In parallel, their particle size
distribution by powder laser diffraction was determined.
Finally, in order to correlate the obtained results by NIR spectroscopy and powder laser
diffraction, an OPLS analysis was developed.
3.1. NIR spectroscopy and chemometric analysis of each
API’s
Development of quantitative analysis should be preceded of an exercise aimed at
correlating chemical knowledge about the APIs of pharmaceutical formulation. The results of
this exercise allow identifying some important NIR absorptions bands of the each active
principle.
Figure 7 – FT-NIR absorption spectra of the three active principles obtained by diffuse reflectance.
Figure 7 shows a strong overlapping of PA and PS absorption band (i.e. lack of selectivity
of NIR absorptions) between 4040-4080 cm-1, C-H and C-C combinations; in the range of
19
5880-6060 cm-1 the C-H 1st overtone; and in the 8740-8860 cm-1 region the C-H 2nd overtone.
This phenomenon can be justified by the similarity of some functional groups of APIs.
As can be seen in Figure 8, PA and PS both have an aromatic ring, a hydroxyl group (–
OH) and a secondary amine (–R2NH).
Figure 8 – Chemical structure of DX, PA and PS, respectively.
Despite of lack of selectivity of PA and PS, the development of multivariate models is
possible because there are chemometric techniques capable of solve this problem and some
selective spectral ranges for each active principle. DX absorption band can be visualised N-H
1st overtone from 6520-6720 cm-1 and C-H 2nd overtone absorption band from 8200-8450 cm-
1. In case of PA a N-H combination band in 4560-4750 cm-1 and C-H 1st overtone from 5900-
6150 cm-1 can be identified. For PS the C-H 1st overtone combination is detected in the range
of 7250-7500 cm-1.
NIR spectra can capture the chemical and physical4 characteristics of samples, which can
be interpreted through the use of chemometrics techniques. In case of spectra pre-processing
(e.g. second-derivative) physical effects are reduced, which makes easier to identify the
chemical information in only one PC. In other hand, if spectra pre-processing is not applied,
then the first PC’s can capture the physical effects of sample.
Figure 9 shows that the PCA model used two PCs explaining 98.85% of total variation of
X, and obtaining three distinct clusters.
4 NIR spectroscopy can captures particle size of the pharmaceutical compounds and different suppliers.
20
Figure 9 – Scores plot of 2nd derivative (15 point Savitzky-Golay) spectra of the three APIs.
The first component (87.16% variance) should describe the chemical properties because
PA and PS, which are chemically more similar, are closer to each other than DX in PC1. This
spectra pre-processing removes almost all the irrelevant information (such as noise), but only
reduces physical characteristics of samples. Consequently, PC2 supposedly describes that
information (11.69% variance). According to this assumption, PA and DX should be
physically more similar than PS.
In the scores plot spectra of APIs without any spectra pre-processing (Figure 10) this
assumption was checked.
Figure 10 – Scores plot spectra of the three APIs without any pre-treatment.
The PCA projection indicates that the two first principal components account for 98.92%
of the total variance, which is quite significant. The first component represents 81.31% of
21
variance, probably described by particle size distribution. According to Figure 10, DX and PA
should have a more similar particle size in comparison to PS, because they have the same
importance of PC1. This corroborates the previously supposition that the physical properties
of samples are described by PC2 of Figure 9.
Without any mathematic treatment, the NIR spectra contain sample information besides
background variation and noise. However, the chemical signal of interest is reduced and
physical parameters more enhanced. Thereby, in Figure 9, PC2 should explain also physical
effects, such as light scattering.
Figure 10, the PS forms a less spread cluster, while PA and DX clusters are more
scattered. This can be justified by the different number of batches in each cluster and batch-
to-batch variability. The spectral data of PA has information from two lots and DX from six,
although PS is constituted just by a sample of one lot.
As above mentioned, the production of pharmaceutical formulation studied is being
carried out with DX from two different suppliers. PCA of six batches (three of each supplier)
was developed with the goal to analyse physical differences between samples. Figure 11
shows that two first principal components account for the most spectral variation (99.80%)
enough to describe the variation between samples.
Figure 11 – Scores plot spectra of six batches of DX from two different manufactures (without any pre-treatment).
The first PC captures most of the variance in the data without any pre-treatment (94.65%),
which may describe the differences between particle sizes of the samples. The distance
between each of two suppliers is small in PC1 (when compared to Figure 10). Consequently,
this does not guarantee that there are different particle sizes between samples.
22
PC2 probably represents different powder compaction force between replicas of each
batch, which could be justified by the light scattering effect (5.15% variance).
To confirm physical properties speculations made before regarding between APIs, particle
size distribution was measured by powder laser diffraction.
3.2. Particle size distribution of each APIs
In the PSD analysis, two important parameters should be taken into account: the residual
and obscuration (%).
To guarantee a good fit between the calculated data and the measurement data, the residual
has to be less than 1% [22]. Otherwise, it may indicate the use of incorrect RI and/or
absorption values for the sample/dispersant or poor background.
If several measurements have a high residual value, the RI/absorption values should be
changed and they have to be recalculated.
In case of a poor background, an untypical background light scattering pattern is observed.
An erroneous background measurement provides a wrong particle size distribution, because
particle size is determined by the difference between the sample measurement and the
background. Thereby, to assure good results all analysis that did not meet the measurement
threshold parameters should be repeated.
Before starting the sample measurement a clean and stable background is required. For
that, a typical light scattering have to be detected, i.e., a near exponential decay across the
detector array and less than 20 units of scatter by ring 20 should be observed in the measure
background (Figure 12) [22].
Figure 12 – The measure background.
23
The equipment allows the measure of the amount of light scattered by the sample and
correlates with the concentration of material present within the measurement zone, obscurity.
The Malvern 2000 software (version 5.22) has an “obscuration bar” that gives a visual
indication of how much sample is added, which have to be about 10-20% of obscurity.
Finally, the sample analysis is proceeded [22].
The particle size distribution of each API of the studied pharmaceutical formulation was
measured according to a wet dispersion powder laser diffraction method (vide 2.2.1), as can
be seen in Figure 13.
Figure 13 – Particle size distribution of the three APIs measured based on the Malvern optical model.
The PSD curves of PA and DX are typically Gaussian or Normal Distributions, which
implies that the mean, median and mode match and the particle size is uniform along the
volume distribution. The PS distribution is more complex than PA and DX. This is a bimodal
distribution, i.e., this lot has non-homogenous PSD. The left peak contained almost all of the
volume percentage and ended at about 170 µm, while the peak to the right covered the
particle size range 720 µm (vide appendix 7.6).
Thus, PC1 interpretation of the score plot of the three APIs without any pre-treatment –
Figure 10 – was corrected because PA and DX PSD are more similar than PS.
The PSD is described by the equivalent volume diameter at 10%, 50% and 90%
cumulative volume, respectively, d(0.1); d(0.5) and d(0.9).
One sample of PS batch, six of DX and two of PA were properly analysed. The results
obtained are summarized on Table 3, as well as some measurement parameters that indicate
how well the calculated data fitted the measurement data.
24
Table 3 – Particle Size distributions obtained for different lots and APIs suppliers (percent relative standard deviation (%RSD) and weighted residual and obscuration).
API Batch
number d(0.1)
µµµµm % RSD
d(0.5) µµµµm
% RSD d(0.9)
µµµµm % RSD
Weighted residual
Obscuration (%)
7004727 69.57 7.4 134.39 2.0 251.33 4.7 0.8 15.9
DX (DSM) 8000874 63.79 2.5 128.23 2.4 235.95 3.7 0.7 12.9
8002235 66.90 0.2 139.87 2.0 252.73 1.4 0.8 10.3
7000732 48.46 2.5 105.87 0.4 206.04 0.4 0.2 14.0
DX (Divis) 7000733 42.88 5.3 94.76 4.0 187.44 5.0 0.9 13.7
8002055 65.19 1.4 135.55 1.0 252.51 0.1 0.7 11.0
PA 7004729 81.44 7.5 248.77 3.0 541.29 5.1 0.8 11.2
(Mallinckrodt) 8001124 94.44 3.0 255.95 4.1 490.21 7.1 0.7 10.7
PS (BASF) 8000221 9.93 7.3 37.75 7.4 385.28 4.9 0.7 17.3
The DX PSD was almost equal between lots of both suppliers, with the exception of the
two first lots from Divis (which were similar among them). Probably, some problems during
the production of these two lots occurred. The DX obtained results confirm the previous
interpretation of Figure 11 – score plot analysis of the six lots from both suppliers. The PC1
actually describes the particle size, which is quite similar and increases from the left to right
(except the 8002055 batch from Divis).
There are no significant PA PSD differences between lots.
PS has an irregular PSD, once the particle size different between d(0.5) and d(0.9) is
bigger than d(0.1) and d(0.5), which does not occur with other APIs. This can be justified
with the Bimodal Distribution.
The precision of this method was ensured by percent relative standard deviation (%RSD),
which has to be less than 3% at the d(0.5) and 5% at the d(0.1) and d(0.9) according with the
ISO standard for powder laser diffraction measurements – ISO13320-1 [30]. For DX PSD, the
%RSD for d(0.1) of 7004727 batch from DSM and 7000733 batch from Divis showed a small
deviation when compared to the imposed limits. The same happens with the d(0.1) of
7004729 batch and d(0.9) of 8001124 batch for PA, such as the d(0.1) and d(0.5) from PS.
As mentioned above, there are two important parameters in the PSD analysis: obscuration
(%) and residual value. In accordance with the set limits, all measurements were done with a
sufficient amount of sample (between 10.3% and 17.3%). Moreover, the residual rule of
thumb was respected (less than 1%), so a correct refractive index and absorption values were
used and a clean background was measured. So, good results were obtained.
The Quality Assurance and Quality Control Departments from Lusomedicamenta made
available product data sheets for each API batch. However, only the DX (DSM) data sheets
25
contained PSD analysis, these results were compared with those obtained by powder laser
diffraction on Malvern equipment at IST.
Table 4 – Size distributions obtained for different batches for DX (DSM) and relative error.
API Batch
number d(0.1)
µµµµm Relative
error d(0.5)
µµµµm Relative
error d(0.9)
µµµµm Relative
error 7004727 69.57 134.39 251.33
Supplier 62.00 12.2%
139.00 3.3%
235.00 6.9%
DX 8000874 63.79 128.23 235.95
(DSM) Supplier 61.00 4.6%
136.00 5.7%
231.00 2.1%
8002235 66.90 139.87 252.73
Supplier 58.00 15.4%
134.00 4.4%
225.00 12.3%
The obtained results by powder laser diffraction on the Malvern and the ones filled by
Lusomedicamenta supplier are quite similar. Therefore, the relative error calculated is
acceptable, since is less than 15.4%. The small differences between both results can be
justified possibly due to experimental errors or a suboptimal RI for the analysis protocol at
IST, since IST results are always higher than DSM’s.
3.3. Quantitative analysis of API’s
As mentioned before, the pharmaceutical formulation contains two compounds (PA and
PS) with overlapping spectra, which requires the use of multivariate chemometrics techniques
to solve this problem. PLS is a reasonable choice for the resolution of overlapping signal and
quantitative analysis.
In this work, the quantitative analysis of pharmaceutical product was developed, based on
FT-NIR spectroscopy. Thereby, several strategies for robust multivariate modelling using
PLS regression were proposed, with different spectra pre-processing, with or without variable
selection.
Currently, Lusomedicamenta is using both DX from DSM and Divis. However, for the
development of quantitative analysis of this pharmaceutical formulation, the DX supplier is
irrelevant because calibration models are building focused on DX chemical properties (the
physical properties are minimised by applying spectral pre-treatments, but beyond that they
have similar PSD and are within the specifications). Nevertheless, two independent models of
quantitative analysis for both suppliers were developed and the obtained results were quite
similar, as would be expected. For simplicity in Results and Discussion only the DX (DSM)
26
calibrations results were presented since these were better than DX (Divis) (all information
about the other calibrations set is available in appendix).
3.3.1. First strategy
Three independent experimental designs were created, where each API and placebo
concentrations were varied by overdosing or underdosing. They are very correlated, but
placebo does not interfere directly on the calibration model therefore its quantification is not
done. So, this was considered the best procedure because correlations between API’s
concentration were minimized, avoiding correlations among constituents.
Three individual calibrations (one for each API) were developed, which allow taking into
account small deviations of only one API from linearity in the studied concentration range.
The correlation between samples’ concentration for each calibration set with DX (DSM)
can be seen in Table 5.
Table 5 – The correlation between samples’ concentration for each calibration set with DX (DSM).
R2 PA PS DX Placebo
PA 1.00 - - -
PA PS 0.01 1.00 - -
DX 0.01 0.14 1.00 -
Placebo 1.00 0.01 0.01 1.00
R2 PA PS DX Placebo
PA 1.00 - - -
PS PS 0.16 1.00 - -
DX 0.15 3.00E-04 1.00 -
Placebo 0.16 1.00 3.00E-04 1.00
R2 PA PS DX Placebo
PA 1.00 - - -
DX PS 0.37 1.00 - -
DX 0.29 0.18 1.00 -
Placebo 0.29 0.18 1.00 1.00
The API concentration variation is easily detected in FT-NIR spectra (Figure 14).
27
a) b)
c)
Figure 14 – FT-NIR MSC spectra of each calibration set (with DX (DSM)). PA concentration increases in the arrow direction between 77.1% and 92.3% (a); while PS concentration between 10.2% and 0% (b); and DX concentration among 5.4% and 0% (c).
In each calibration set, the API concentration is proportional to absorption measured –
based on Beer’s Law – and increases in the arrow direction.
3.3.1.1. Calibration vs. Test sets
The spectral data are split into two subsets: calibration set (two third of all samples) and
test set (one third of samples). The first set is employed to build the model, while the other is
used to predict it. Note that, if it is chosen a single sample for the test set, the three replicas
are assigned to that set. The choice of calibration set is a crucial point, to ensure a robust
model the calibration set has to cover the maximum spectral variability observed in the scores
plot (Figure 15), as well the variability expected from future samples.
PA
concentration
+
- -
PS
concentration +
DX
concentration +
-
28
Figure 15 – Scores plot of PA (with DX (DSM)) samples with the selected calibration and test sets based on NIR MSC and Mean Centering pre-processed spectra.
Independently the pre-processing used in this step, the distribution of subsets selected is
almost constant.
3.3.1.2. Data pre-processing
Three approaches to pre-process the NIR spectra were applied in this work, MSC; first-
(1st D) and second-derivative (2nd D) using the SG algorithm with a 21-point moving window
and a second-order polynomial.
The calibration model was constructed by PLS, using contiguous block method for cross-
validation.
3.3.1.3. Variable selection
The calibration model can be constructed over the whole wavenumber range or with
selected spectral ranges. In the first case, the model is more robust because is susceptible to
interferences. In other hand, the second procedure allows simplifying the calibration model
and obtains a much precise model, which only focuses on the variables whose variation of
API concentration is significant.
On this strategy three different techniques of variable selection were applied: Coefficient
of Determination, iPLS and Genetic Algorithm.
29
The Coefficient of Determination (R2) correlates the NIR spectra information (X variable)
and the API concentration (Y variable). The coefficient over wavenumber was calculated by a
function developed in Matlab, as shown in the Figure 16.
4000 4500 5000 5500 6000 6500 7000 7500 8000 8500 90000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
wavenumber (cm-1)
R2
Figure 16 – Coefficient of Determination (R2) versus wavenumber of PA calibration set (with DX (DSM)).
This coefficient varies between 0 and 1, and the best correlated region of spectrum had a
highly coefficient. As variable selection criteria, all wavenumber ranges which had a
coefficient over 0.7 were accepted.
In case of using iPLS, the spectra were split into smaller equal width regions, PLS
regression models for each sub-intervals developed and the global RMSECV (over the whole
wavenumber range) calculated. The region with the lowest model error was chosen.
For each calibration set several combinations of pre-treatments and spectral intervals were
studied, and the best combination was chosen. Figure 17 shows the example of DX calibration
set based on MSC, second derivative and mean centering pre-processing spectra.
30
Figure 17 – iPLS results for DX (DSM) calibration set.
As can be seen in Figure 17, the pre-processed spectra was split into 20 intervals, and the
optimal number of PLS components in each interval was indicated at the bottom of each
vertical bar. Moreover, the global model RMSECV with 4 LV was represented with a dotted
line.
Next, in more detail the fifth interval was investigated since had lower RMSECV than the
global model. In the 5022.3 and 5269.1cm-1 range a better calibration will be developed
comparing to the whole wavenumber range.
Figure 18 – Optimal spectral region selected by iPLS for pre-processed previous spectrum.
4500 5000 5500 6000 6500 7000 7500 8000 8500 9000
-12
-10
-8
-6
-4
-2
0
2
x 10-3
Wavelength
Res
pons
e, r
aw d
ata
[mea
n is
use
d in
the
cal
cula
tions
]
Interval number 5, wavelengths 5022.27-5269.14
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200
0.2
0.4
0.6
0.8
1
RM
SE
CV
Dotted line is RMSECV (4 LV's) for global model / I talic numbers are optimal LVs in interval model
Interval number
5 1 5 2 3 3 4 2 4 2 4 3 9 2 1 3 4 3 2 2
31
The interval 5 had the lowest RMSECV using 3 PLS components (more details for this
calibration can be seen in Table 8). Furthermore, this interval represented the C=O 2nd
overtone absorption band of DX, shows in Figure 5.
This method gives an overview of the spectral data and shows the interesting spectral
region to calibrate PLS model, but only allows knowing information about each interval.
On this strategy another method for variable selection in PLS regression was used, Genetic
Algorithm, for the same calibrations set on first strategy studied.
GA is a random technique inspired by natural selection mechanism, which find the
optimal variable subset to build a PLS model with the lowest RMSECV.
For each calibration set several pre-treatments and different GA parameters was studied,
and the best combination chosen. The GA for DX (DSM) calibration set was performed with
the following parameters:
Table 6 – The best GA parameters chosen to use for DX (DSM) calibration set.
Parameter Value Population size 64
Maximum generations 100
Mutation rate 0.005
Window width5 10
Convergence 50
Initial terms 30
Crossover Double
Regression PLS
Maximum LV 8
Cross validation Contiguous
In this case, GA was performed using a PLS regression method with a maximum of 8 LV
to avoid over fitting of the model and contiguous block method for cross-validation used.
Moreover, a MSC, second derivative and mean centering pre-processing was applied.
The algorithm started by a randomly generation of an initial population constituted by 64
individuals. Each individual was represented by chromosomes. At generation, the fitness of
each chromosome was calculated and evaluated according to the RMSECV, and half of
individuals with worst results are discarded. Next, the offspring by the genetic operator such
as double cross-over and mutation was created.
5 This parameter indicates how many adjacent variables should be grouped together at a time.
32
In double cross-over, the genes from two random individuals are split at some random
point, which are randomly grouped creating two new individuals.
Mutation consists in an arbitrary bit in a genetic sequence change from its original state
with a mutation rate of either 5%.
The GA simulation finishes when the amount of chromosomes defined by the convergence
of 50% or when 100 generations is reached, otherwise the generated offspring will be
repeated until termination conditions are satisfied.
During each analysis, a command window will display the progress of the GA run. In
Figure 19, the last GA generation is shown.
12 14 16 18
0.055
0.06
0.065
0.07
Number of Windows
Fitness vs. # of Windows at Generation 20
Fitn
ess
0 5 10 15 200
0.05
0.1
0.15
0.2
Generation
Ave
rag
e a
nd B
est
Fitn
ess
Evolution of Average and Best Fitness
0 5 10 15 2014
16
18
20
22
Generation
Ave
rag
e W
ind
ow
s U
sed
Evolution of Number of Windows
0 20 40 600
5
10
15
20
25
Window Number
Mo
dels
Incl
udin
g W
ind
ow
Models with Window at Generation 20
Figure 19 – Diagnostic plots of GA analysis: Fitness vs. Number of variables (a); Evolution of average and best fitness (b); Evolution of number of variables (c); and Models with variable number (d)
At fitness vs. Number of variables plot the actual fitness at generation 20 is described with
green circles.
The evolution of average and best fitness plot can be seen a dashed line which represents
the RMSECV obtained using all variables. Furthermore, the best and average fitness lines
over the generations tend to converge for a minor RMSECV value.
Evolution of number of variables plot shows the average variables number used by each
generation.
The last plot (d) shows the variables selected at generation 20.
In this technique, for each simulation different variables sets were selected. Thereby, ten
simulations for each calibration were developed and the variables which were repeated at least
five times were selected.
33
3.3.1.4. Number of PCs
The number of PC’s was selected according to the minor PRESS. Based on Figure 20, the
PA calibration set (with DX (DSM) based on MC; 1st derivative and Mean centering pre-
processing spectra) goes through a minimum at 3 LVs. A good PLS model was built with 3
LV, with a lowest RMSECV. Less than 3 LV’s, few information data is included in the
model; while more than this value, too much information is added.
1 2 3 4 5 6 7 8 9 10
1.25
1.3
1.35
1.4
1.45
1.5
1.55
1.6
1.65
Latent Variable Number
RM
SE
CV
Figure 20 – PRESS for PLS on PA calibration set (with DX (DSM)) data based on MSC, 1st derivative and Mean Centering pre-processing spectra.
This choice was also supported by the variance captured in Y as shown in Figure 21.
Figure 21 – Analysis showing PLS Model of PA calibration set (with DX (DSM)).
34
This model with 3 LV can capture 98.77% of Y cumulative variance, which is quite
significant.
3.3.1.5. Outliers
In case of a sample being significantly different the rest of the calibration set, it can be
outlier which should be eliminated. But, not all outliers are erroneous. The outliers’ detection
has to be very careful in order to avoid elimination of representative samples to the model.
There are some techniques to detect outliers. In this study only applied leverage and Q
residuals were consistently used.
Leverage measures the importance of the sample has on a model, while Y-studentized
residual is an indication of the lack of fit of the y-value of a sample [19].
0.05 0.1 0.15 0.2 0.25-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
Leverage
Y S
tdnt
Res
idua
l 1
34
Figure 22 – Studentized Residuals versus Leverage for PA calibration set (with DX (DSM)).
Sample number 34 had a standard deviation of error around 1.5 and a very high leverage,
about 0.255. This suggested that sample 34 was an outlier, which could be checked by making
Q residuals versus sample.
35
5 10 15 20 25 30 35 400
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1x 10
-5
Sample
Q R
esid
uals
(2.
66%
)
34
Figure 23 – Q residuals versus sample for PA calibration set (with DX (DSM)).
As can be seen in Figure 23, sample 34 had significantly higher Q residuals than the rest
of the calibration set, so it was really an outlier.
After the outliers’ detection, they are eliminated to build an efficient model.
3.3.1.6. Statistics
After the calibration model is built the RMSECV and the R2 are calculated to previously
evaluate the performance of cross-validation calibration data. At least, the predictive ability of
the model is measured by RMSEP with test set – samples set is not used for model
development.
78 80 82 84 86 88 90 9276
78
80
82
84
86
88
90
92
94
Y Measured 1
Y P
redi
cted
1
R2 = 0.9863 Latent VariablesRMSEC = 0.53486RMSECV = 0.73284RMSEP = 0.61823
Figure 24 – Correlation between measured and predicted PA calibration set (with DX (DSM)) [●: calibration set; ▼: validation set].
36
Figure 24 shows the performance of PA calibration model developed on MSC, first
derivative and mean centering pre-processed spectra over the whole wavenumber range. This
model was built using 3 LV and presented the lowest RMSEP for PA (more details can be
seen in Table 8).
3.3.1.7. First strategy without variable selection
Several models without variable selection were developed, according the steps previously
described. In addition, the characteristics of the best models of each API were summarized in
Table 7.
Table 7 – The best results for each calibration set without variable selection (using DX (DSM)).
Compound Pre-processing Range LV R2 RMSECV (%)
RMSEP (%)
MSC and MC 4 0.99 0.77 0.6 PA
MSC, 1st D and MC 3 0.99 0.73 0.62
MSC and MC 3 0.98 0.57 0.64 PS
MSC, 1st D and MC 3996.2-9003.1 3 0.97 0.61 0.64
MSC, 1st D and MC 4 0.98 0.43 0.37 DX
MSC, 2nd D and MC 4 0.98 0.42 0.37
The accuracy of model can be checked through of the coefficient of determination of
model (R2), RMSEP and RMSECV analysis.
The R2 is a statistical measure of how well the regression line approximates the real data
points. An R2 of 1.0 indicates that fitted model explains all variability in y Predicted.
The RMSECV and RMSEP estimate cross-validation and prediction error of the model. A
good calibration should have a low error values. The small differences between RMSEP and
RMSECV indicate that the model was robust not only for the observations in the calibration
dataset but also for prediction set. Moreover, theses differences also demonstrate the influence
of proper pre-processing methods on the raw data.
The R2 calculated for each model was very high, between 0.97 and 0.99.
Despite of RMSEP of the first PA model has a low RMSEP; the differences between
RMSECV and RMSEP are bigger than the second model. So, MSC, first-derivative and mean
centering is the best model, with only 3 latent variables. The value of RMSECV is 0.73 and
RMSEP equal to 0.62.
37
For PS and DX models, in both cases, second calibration has the lower prediction error
and RMSECV and RMSEP are more similar. Note that for PS calibration, theses values are
almost equals. In this application, MSC; first-derivative and MC and MSC; second-derivative
and MC seem to performer better PS and DX models, respectively, than others pre-processes.
RMSECV and RMSEP are 0.61 and 0.64 for PS model, and 0.42 and 0.37 for DX model.
In general, the obtained results are good. Over the whole wavenumber range, besides
capturing concentration variability, the model also contains irrelevant information and noise,
which can be avoided with variable selection. In this procedure, only the regions with relevant
information are considered in the model, which increases the prediction performance of the
model.
Thereby, new models with variable selection techniques were developed with the aim of
obtaining better results.
3.3.1.8. First strategy with variable selection
At each calibration set, three different methods of variable selection – Coefficient of
Determination, iPLS and Genetic Algorithm – were applied, with the aim of identifying
one/more regions where the concentration variation of each API was more significant. Several
pre-treatments were used and variables selected, but only the models with the best predictive
capabilities are shown in the Table 8.
Table 8 – The best results obtained for each calibration set (using DX (DSM)) with variable selection6.
Method Compound Pre-processing Range LV R2 RMSECV
(%) RMSEP
(%)
PA MSC, 1st D
and MC
4196.8-4428.2 4644.2-5400.3 5786-5971.2 6056-7259.5
8023.3-8663.6 3 0.99 0.76 0.67
R2 PS MSC, 1st D
and MC
4158.2-4204.5 4289.4-4412.8 4482.2-4544 4621.1-4698.3
5045.4-5361.7 5770.6-5832.3 5978.9-6079.2 6287.5-7282.7 7467.8-7529.5 8061.9-8231.6 8447.6-8686.7 8794.8-8833.3
2 0.98 0.57 0.62
DX MSC, 2nd D
and MC
4844.8-4906.5 5030-5346.3 5678-5863.2 6626.9-6657.8
6765.8-7028.1 3 0.98 0.34 0.33
6 For simplicity reasons, the range selected of each calibration set for GA variable selection is not included on this Table, but it is available in appendix 7.3.
38
Method Compound Pre-processing Range LV R2 RMSECV
(%) RMSEP
(%)
PA MSC, 2nd D
and MC 6688.6-7012.7 2 0.98 0.74 0.74
iPLS PS MSC and
MC 4003.9-4497.7 3 0.98 0.61 0.66
DX MSC, 2nd D
and MC 5022.3-5269.1 3 0.98 0.38 0.38
PA MSC, 2nd D
and MC Vide appendix 7.3 3 0.98 0.7 0.74
GA PS MSC, 2nd D
and MC Vide appendix 7.3 3 0.98 0.54 0.56
DX MSC, 2nd D
and MC Vide appendix 7.3 3 0.98 0.29 0.35
For the six models, a good R2 were obtained. In spite of the lowest RMSEP value of PA
model – chosen with R2 technique – the model based on iPLS variable selection was better
than the first because there were no differences between RMSECV and RMSEP. This model,
with MSC; second-derivative; mean centering pre-processing and 2 PLS factors, perform
better than others. RMSECV and RMSEP were both 0.74.
The PS model which used GA method to select the most relevant variables, and DX model
that used R2 techniques were very robust. In that way, the predictive set can describe the same
root mean square error that cross-validation. Both models were better performed with MSC,
second-derivative, mean centering pre-process and with 3 latent variables. For PS model the
RMSECV value is 0.54, while RMSEP 0.56, and for DX model are 0.34 and 0.33,
respectively.
For the models developed with variable selection better results were obtained than for over
full wavenumber (Table 9).
Table 9 – The best results obtained in the first strategy.
Without variable selection With variable selection Compound RMSECV (%) RMSEP (%) RMSECV (%) RMSEP (%)
PA 0.73 0.62 0.74 0.74
PS 0.61 0.64 0.54 0.56
DX 0.42 0.37 0.34 0.33
Thereby, the calibration models based on a small subset of wavenumbers have a lowest
prediction error (with the exception of PA) and are more accurate.
39
3.3.2. Second strategy
The first strategy minimizes correlation between the principal components in the three
independent calibration sets – for each API. However, this procedure is quite ideal, since it
does not take into account interactions between APIs.
First, the hypothesis of to join all available data spectra of three calibration sets was
admitted. But, during the development of an API model was perceptible that this was not the
best strategy, as can be seen in Figure 25.
0 10 20 30 40 50 60 70 80 90-60
-40
-20
0
20
40
60
80
100
Y Measured 1
Y C
V P
redi
cted
1
Figure 25 – Correlation between measured and cross-validation predicted set of PA (with DX (DSM)).
There was no ‘linear’ relation between the measured and predicted variables, so this data
set could not be used to build an accuracy calibration model.
Then, a new strategy was admitted. For each calibration set the first and last three spectra
(the extremes) of the other sets were added, including the three replicas. Thereby, the
concentrations of all APIs were increased or decreased in each calibration set, taking into
account interactions between APIs. The correlation coefficient between samples’
concentration was calculated with DX (DSM), as can be seen in Table 10 (the correlation
values for other calibrations set are available in appendix 7.4).
Table 10 – The correlation between samples’ concentration for each calibration set with DX (DSM).
R2 PA PS DX Placebo
PA 1.00 - - -
PA PS 9.00E-07 1.00 - -
DX 3.00E-10 6.00E-07 1.00 -
Placebo 0.75 0.19 0.05 1.00
40
R2 PA PS DX Placebo
PA 1.00 - - -
PS PS 8.00E-07 1.00 - -
DX 1.00E-07 3.00E-05 1.00 1.00
Placebo 0.60 0.32 0.08 1.00
R2 PA PS DX Placebo
PA 1.00 - - -
DX PS 2.00E-07 1.00 - -
DX 6.00E-09 3.00E-05 1.00 -
Placebo 0.62 0.29 0.09 1.00
On this strategy for each calibration set, the correlation coefficient between API (whose
concentration was varied) and placebo concentration was minor than the first strategy.
However, the other APIs (whose concentration were not varied) and placebo became more
correlated.
3.3.2.1. Second strategy without variable selection
Several calibration models were built on differently pre-processed data, and the best
performance model for each calibration set was chosen and summarized in Table 11.
Table 11 – The best results for each calibration set without variable selection (using DX (DSM)).
Compound Pre-processing Range LV R2 RMSECV (%)
RMSEP (%)
MSC and MC 2 0.93 1.21 2.04 PA
MSC, 2nd D and MC 3 0.94 1.16 2.10
MSC, 1st D and MC 5 0.93 0.77 1.08 PS
MSC, 2nd D and MC 5 0.94 0.72 1.24
MSC and MC 6 0.95 0.36 0.43 PS
MSC, 1st D and MC
3996.2-9003.1
6 0.94 0.37 0.42
Comparing each model, the lowest RMSECV of PA and PS calibration were obtained
after the MSC, second-derivative and mean centering pre-processing, while DX calibration
was with MSC and mean centering.
PS and DX models need 5 and 6 PLS factor, respectively, and PA model just requires 2 or
3 latent variables. Comparing with previous results these models have too much PLS factor
and not better RMSECV and RMSEP. PA models look underfitting since the differences
41
between RMSECV and RMSEP are significant, consequently they will not capture enough
variability in the data because they do not have enough information.
3.3.2.2. Second strategy with variable selection
The obtained results for over the whole wavenumber range were not good. So, the same
variable selection techniques previously mentioned were applied, with the aim of finding the
specific regions of each API to get better results. The models with the lowest RMSECV and
RMSEP for each calibration set are summarized in Table 12.
Table 12 –The best results for each calibration set (using DX (DSM)) using variable selection techniques.
Method Compound Pre-processing Range LV R2 RMSECV
(%) RMSEP
(%)
PA MSC, 2nd D
and MC
4775.4-4991.4 5099.4-5423.4 6534.3-7182.4
8216.2-8285.6 2 0.95 1.04 1.72
R2 PS MSC, 1st D
and MC 4335.7-4520.8 2 0.96 0.64 1.47
DX MSC, 2nd D
and MC 4868-5091.7 5608.6-
5932.6 4 0.94 0.37 0.46
PA MSC, 2nd D
and MC 4767.7-5114.8 2 0.88 1.52 1.38
iPLS PS MSC, 2nd D
and MC 4505.4-4752.3 2 0.84 1.29 1.95
DX MSC, 2nd D
and MC 4844.8-5114.8 2 0.93 0.39 0.44
PA MSC, 2nd D
and MC 6873.8-7197.8 7645.3-8177.6 8393.6-8686.7
3 0.91 1.34 1.37
GA PS MSC, 2nd D
and MC
4150.5-4374.2 4844.8-4991.4 5384.9-5454.3 5616.3-5685.7 5847.7-6071.5 6387.8-6457.2 7159.2-7460.1 7930.7-8000.1 8162.1-8540.2
4 0.89 0.93 1.08
DX MSC, 1st D
and MC
5400.3-6194.9 6310.6-7058.9 7174.7-7753.3
7923-8609.6 5 0.97 0.28 0.46
The best results, concerning the lowest RMSEP and the similarity between prediction and
cross-validation errors, were obtained after MSC, second-derivative and MC of the spectra for
PA and PS model with variable selected by GA. Thereby, for these models a RMSECV of
1.34 and RMSEP of 1.37 for PA calibration, with 3 PLS factors and a RMSECV of 0.93 and
RMSEP of 1.08 for PS calibration, using 4 latent variables were obtained. Correlation
coefficient for each models were 0.91 and 0.89, respectively.
42
Although, for the same pre-treatment applied for other calibration sets, the best DX model
was build with variable selection by iPLS method. This model needs 2 latent variables and a
correlation coefficient of 0.93, moreover, the RMSECV is 0.39 and a RMSEP equal to 0.44.
In conclusion, all models after the variable selection had better results than without, as
observed in the first strategy. In addition, the models developed for the first strategy had
better results than this one, as can be seen in Table 13.
Table 13 – The best results obtained in the first and second strategy.
First strategy Second strategy Compound
RMSECV (%) RMSEP (%) RMSECV (%) RMSEP (%) PA 0.74 0.74 1.34 1.37
PS 0.54 0.56 0.93 1.08
DX 0.34 0.33 0.39 0.44
3.4. Obtained results vs. other studies
A similar study for an analogous pharmaceutical formulation was already done by M.
Alcalá for two of the APIs studied [35]. Once for each API independent calibration models
were developed, a comparison analysis could be made between both studies.
Table 14 – The best results obtained in current and Alcalá’s study.
Current study Alcalá’s study
PA DX PA DX
% (w/w) 84.7 2.7 6.5 0.2
Pre-processing MSC, 2nd D
and MC MSC, 2nd D and
MC 1st D 2nd D
Range (cm-1) 6688.6-7012.7
4844.8-4906.5 5030-5346.3 5678-5863.2
6626.9-6657.8 6765.8-7028.1
4255-9090 4255-9090
LV 2 3 4 5
RMSECV (%) 0.74 0.34 2.7 1.7
RMSEP (%) 0.74 0.34 2.2 3.3
In these two studies, PA and DX calibration models were built using different pre-
processes and ranges. In the current work, both models had less LV; low and more similar
RMSECV and RMSEP than Alcalá’s study. Consequently, these models should be more
robust. But, to assure this supposition, external validation and validation according to the ICH
(International Conference on Harmonisation), EMEA (European Medicines Agency) and
43
PASG (Pharmaceutical Analytical Sciences Group) guidance must be done prior to use by the
pharmaceutical industry.
3.5. Orthogonal analysis
OPLS modelling was performed to establish the correlation between NIR spectroscopy
data of DX (X block) and it laser diffraction PSD (Y block).
This method aims to separate the systematic variation in the X-block into two parts: one
that is linearly related to Y, and a second which is unrelated (orthogonal) to Y. This
separation facilitates the model interpretation, thus the components that are related to Y are
called predictive (p), while those that are unrelated to Y are called orthogonal (o).
First, to simplify the application of this technique, the DX PSD [d(0.1); d(0.5) and d(0.9)]
was compressed in only 1 LV by PCA. One LV described it properly because a high
percentage of X data explained by the model – R2 (X) – and percentage of variation predicted
by the model according to cross-validation – Q2 – were obtained, respectively 0.98 and 0.95.
Next, OPLS model for two different pre-processes was developed.
Table 15 – OPLS summary results for different pre-processing techniques of DX samples.
Data set R2 (X)p R2 (Y)p Q2p LV p R2 (X)o R2 (Y)o Q2
o LV o
Without pre-processing
0.712 0.526 0.459 1 0.283 0.144 0.096 1
Second derivative7
0.299 0.302 0.285 1 0.701 0.679 0.468 4
On the first model – without any pre-processing –, as mentioned above, the physical
properties are more evident, such as DX particle size. Thus, that information (Y) was
predicted from physical properties contained in spectral data (X) on this model. As expected,
the predictive set is much more influent than the orthogonal, because X and Y blocks are
greatly related. The representation of predictive explained variance is much higher than the
orthogonal for the same latent variables, the Q2p= 0.459 while Q2
o= 0.096.
In the second model, spectral data (X) was pre-processed with second-derivative spectra,
which minimises the physical properties and gives more emphasis on chemical properties.
Consequently in this case, the opposite of first model occurs; the orthogonal set had more
influence than predictive set, as can be confirmed in Q2 value – 0.679 and 0.285 respectively.
Thus, the particle size information (Y) is almost unrelated to spectral data (X).
7 For the spectral data was applied 2nd-derivative using the SG algorithm with a 15-point moving window.
44
The deviation between the percentage of X and Y data explained by the model is described
by the residuals. In this case, there are almost no residuals in either data set; consequently the
spectra have low noise level.
Table 16 – The residual in X and Y results.
Data set Residual X Residual Y Without pre-processing 0.01 0.33
Second derivative 0.00 0.02
45
4. CONCLUSIONS
Near Infrared (NIR) spectroscopy, in combination with chemometrics, enables quantitative
analysis of a pharmaceutical preparation. This work was developed with the purpose of to
accurately quantifying the concentration of each 3 Active Pharmaceutical Ingredients (APIs)
in pharmaceutical solid dosage.
NIR spectroscopy provides major advantages over conventional methods, because it does
not require sample preparation and it is a non-destructive technique. This technique also has
the potential to distinguish the chemical and physical properties of samples.
In order to know more in detail about APIs, the chemical properties of each API were
studied by NIR spectroscopy, using second-derivative pre-process (which minimises the
physical properties). Comparing FT-NIR absorption spectra of the three APIs (Figure 7) and
theirs chemical structure (Figure 8) the overlapping of several absorption bands was evident,
especially between Paracetamol (PA) and Pseudoephedrine Hydrochloride (PS). This
phenomenon can be justified by the similarity of some functional groups of APIs. The lack of
selectivity problem can be solved with multivariate chemometrics techniques (such as Partial
Least Squares (PLS)) and variable selection. Thereby, the quantitative determination of
content in pharmaceutical product was developed using PLS regression, without overlapping
absorption bands between PA and PS.
The differences between physical properties of samples were identified by NIR
spectroscopy (without any pre-processing) and powder laser diffraction. According with both
techniques, PA and Detromethorphan Hydrobromide (DX) Particle Size Distribution (PSD)
were more similar than PS. As can be seen on Figure 13, PA and DX had Gaussian
distribution, while PS was bimodal, representing a heterogeneous PSD.
Powder laser diffraction is a fast and useful analytical tool for the particle characterization
of the API batches, with adequate precision over a wide particle size range.
For NIR calibrations, an appropriate experimental design was created, with a low
correlation between concentrations, to guarantee a good calibration set necessary to built
robust model. Three different calibration sets of laboratory samples were prepared, where
only one API and placebo concentrations were varied by overdosing and underdosing. Each
calibration set was used to predict only one API, taking into account small deviations of only
one API from linearity in the studied concentration range. This approach (first strategy)
solved the selectivity problems because in each calibration set only the concentration
variation of one API is captured.
46
The first strategy is quite ideal, since it does not take into account interactions between
APIs. Thus, a second strategy was considered, where some spectra from other calibration set
were added, and consequently the concentrations of all APIs were increased or decreased in
each calibration set. However, besides the correlation between each API (whose concentration
was varied) and placebo concentrations decreased, the other APIs (whose concentration were
not varied) and placebo became more correlated. Consequently, the results obtained from
second strategy were worst than in the first strategy.
Variable selection was also applied to search of the spectral region with the minimum
non-linearity of responses, i.e., choosing the spectral information of interesting components
and removing irrelevant or noise signals. Better results were obtained with variable selection.
The prediction ability of the proposed methods is summarized in Table 17.
Table 17 – The best results for each calibration model, using PLS regression, with variable selection (using DX (DSM)).
Compound Method Pre-processing Range LV R2 RMSECV
(%) RMSEP
(%)
PA iPLS MSC, 2nd D and MC 6688.6-7012.7 2 0.98 0.74 0.74
PS GA MSC, 2nd D and MC Vide appendix 7.4 3 0.98 0.54 0.56
DX R2 MSC, 2nd D and MC
4844.8-4906.5 5030-5346.3 5678-5863.2
6626.9-6657.8 6765.8-7028.1 3 0.98 0.34 0.33
The three calibrations models provide the lowest RMSECV and RMSEP and the high R2
were obtained.
The calibration models developed can be used for quality control purposes of the studied
pharmaceutical formulation. The absolute error of each API in a tablet was calculated to
demonstrate the potentiality of this method.
Table 18 – The weight percentage of each API and respectively RMSEP obtained for the best calibration set and the weight in of each active ingredient in a tablet.
Compound % (w/w) RMSEP (%) Weight (mg in a tablet) PA 84.75 0.74 500 ± 4
PS 5.08 0.56 30 ± 3
DX 2.67 0.33 16 ± 2
As can be seen in Table 18, this technique allows an accurate detection with small errors.
47
In addition, this study was compared with one available in the literature, which uses a
similar pharmaceutical formulation [35].
Table 19 – The accuracy obtained in our study Alcalá’s study [35].
Compound Current study Alcalá’s Study PA 500 ± 4 650 ± 22
DX 16 ± 2 20 ± 33
PA and DX models build in the current study allow identifying of a small amount of APIs,
comparing with the Alcalá’s study. Consequently, the first models are more accurate than the
developed by Alcalá.
The OPLS model allows to verify how well Y and X variables are correlated, in this case
powder laser diffraction and NIR spectroscopy respectively. As mentioned above, using
different pre-treatments the physical and chemical properties of samples can be evidenced in
NIR spectra. Thereby, without any pre-processing, the powder laser diffraction set can be
predicted by NIR spectroscopy set (based on the physical properties of samples), since a high
Q2p was obtained (equal to 0.459). However, if a second-derivative pre-process was used, as
expected, X and Y are much more unrelated, because NIR spectra focus on the chemical
properties of samples. Consequently, the Q2o is high than Q2p, respectively 0.468 and 0.285.
In conclusion, these two powerful techniques can be used in parallel in the quantification
and quality control of solid dosage formulation in the Pharmaceutical Industry.
48
5. SUGGESTIONS FOR FUTURE WORK
In this study, a quantitative analysis of the three APIs in a solid dosage formulation was
developed to assure a quality control of Pharmaceutical Product, with NIR spectroscopy and
Chemometric techniques. For the multivariate calibration models good results were obtained,
however their accuracy have to be validated with external samples and ICH; EMEA and
PASG guidance rules.
For the development of a quantitative analysis, three independent sets of laboratory
samples for each API were produced. In each calibration set, concentrations of a selected API
and placebo were varied by overdosing and underdosing. Instead of three sets of laboratory
samples, only one with placebo and all API concentration randomly varied could be prepared
(avoiding the correlation between APIs). Moreover, that set had more concentration
variability and number of samples. Thus, with only that set the three calibration models could
be created.
In this work the potential of NIR spectroscopy, associated with chemometrics, and Laser
Diffraction were studied. A combination of both techniques proven to have be suitable tools
for quality control of the end product. These operational tools can be used also to monitor and
control the manufacturing process, in a real-time, according to PAT initiative. Thus, during
the granulation processes with the Powder Laser Diffraction the potential risk of particle
segregation within the product could be reduced or eliminated. And, the quality control of
end-product can be assurance with NIR Spectroscopy.
49
6. REFERENCES
[1] Guidance for industry. PAT – a framework for innovative pharmaceutical
manufacturing and quality assurance (U.S. Food and Drug Administration, Rockville, MD,
USA, 2003)
[2] http://www.asdi.com/nir-chart_grid_rev-3.pdf (July 2008)
[3] H. W. Siesler, Y. Ozaki, S. Kawata, H. M. Heise; Near-Infrared Spectroscopy
Principles, Instruments, Applications; WILEY-VCH; New York; 2002
[4] Barbara Stuart; Infrared Spectroscopy: Fundamentals and applications; John Wiley &
Sons, Ltd; New York; 2004
[5] J. Luypaert, D. L. Massart, Y. Vander Heyden; Near-infrared spectroscopy
applications in pharmaceutical analysis; Talanta; Volume 72, Issue 3, 15 May 2007, Pages
865-883
[6] Matthias Otto; Chemometric Statistics and Computer Application in Analytical
Chemistry; WILEY-VCH; Weinheim, Germany; 1999
[7] Daniel C. Harris; Quantitative Chemical Analysis – Third Volume; Fifth Edition;
Freeman; New York; 1995
[8] Skoog, West, Holer; Analytical Chemistry an Introduction – Sixth Edition; New York,
1997
[9] Emil W. Ciurczak, James K. Drennen III; Pharmaceutical and Medical Applications of
Near-Infrared Spectroscopy; Marcel Dekker Inc.; New York; 2002
[10] Celio Pasquini; Near Infrared Spectroscopy: Fundamentals, Practical Aspects and
Analytical Applications; J. Braz. Chem. Soc.; Volume 14 Nº2, São Paulo, March/April 2003
[11] Bernhard Lendl, Bo Karlberg; Advancing from unsupervised, single variable-based
methods: A challenge for qualitative analysis; Trends in Analytical Chemistry; Volume 24 Nº
6, 2005
[12] Katherine A. Bakeev; Process Analytical Technology: Spectroscopic Tools and
Implementation Strategies for the Chemical and Pharmaceutical Industries; Blackwell
Publishing; New York; 2005
[13] Tormod Næs, Tomas Isaksson, Tom Fearn, Tony Davies; A user-friendly guide to
Multivariate Calibration and Classification; NIR Publications; Chichester, U.K; 2002
[14] Yves Roggo, Pascal Chalus, Lene Maurer, Carmen Lema-Martinez, Aurélie Edmond,
Nadine Jent; A review of Near Infrared spectroscopy and Chemometrics in pharmaceutical
50
technologies; Journal of Pharmaceutical and Biomedical Analysis; Volume 44, Issue 3; 27
July 2007; Pages 683-700
[15] Mei-Lin Wu, You-Shao Wang; Using chemometrics to evaluate anthropogenic effects
in Daya Bay, China; Estuarine, Coastal and Shelf Science; Volume 72, Issue 4; May 2007;
Pages 732-742
[16] F. González, R. Pous; Quality control in manufacturing process by near infrared
spectroscopy; Journal of Pharmaceutical and Biomedical Analysis; Volume 13 Nº4, April
1995; Pages 419-423(5)
[17] PLSplus IQ user’s guide; Thermo Electron Corporation, Salem, NH, USA
[18] Lars Nørgaard; iToolbox Manual; July 2004; Denmark
[19] Matlab manual version 6.5; Mathworks Inc., 2005
[20] B. Üstün, W. J. Melssen, M. Oudenhuijzen, L. M. C. Buydens; Determination of
optimal support vector regression parameters by genetic algorithms and simplex optimization;
Analytica Chimica Acta; Volume 544 Nº 1-2; May 2005; Pages 292-305
[21] Yibin Ying, Yande Liu; Non-destructive measurement of internal quality in pear
using genetic algorithms and FT-NIR spectroscopy; Journal of Food Engineering; Volume 84
Nº2; 2008; Pages 206-213
[22] Mastersizer 2000 user manual; Malvern Instruments; 2007
[23] Alan Rawle; Technical Paper: Basic Principles of particles size analysis; Malvern
Instruments; New York; 1995
[24] S. Sonja Sekulic, John Wakeman, Phil Doherty, Perry A. Hailey; Automated system
for the on-line monitoring of powder blending processes using near-infrared spectroscopy:
Part II. Qualitative approaches to blend evaluation; Journal of Pharmaceutical and Biomedical
Analysis; Volume 17, Issue 8, 30 September 1998; Pages 1285-1309
[25] Application Note: Wet method development for laser diffraction measurements;
Malvern Instruments
[26] Application Note: Method validation for laser diffraction measurements; Malvern
Instruments
[27] http://www.chemspider.com/RecordView.aspx?id=1906 (July 2008)
[28]http://www.chemspider.com/RecordView.aspx?rid=13bf38e9-a8ac-4992-a826-
191ff5964465 (July 2008)
[29] Weng Li Yoon, Roger D. Jee, Andrew Charvill, Gerard Lee, Anthony C. Moffat;
Application of near-infrared spectroscopy to the determination of the sites of manufacture of
51
proprietary products; Journal of Pharmaceutical and Biomedical Analysis; Volume 34, Issue
5, 20 March 2004, Pages 933-944
[30] A.P. Tinke, K. Vanhoutte, F. Vanhoutt, M. De Smet, H. De Winter; Laser diffraction
and image analysis as a supportive analytical tool in the pharmaceutical development of
immediate release direct compression formulations; International Journal of Pharmaceutics;
Volume 297 Nº1-2; 2005; Pages 80-88
[31] Sample dispersion & Refractive index guide; man. 0079 version 3.1; Malvern
Instruments; 1997
[32] http://www.dsm.com (August 2008)
[33] M. Blanco, A. Eustaquio, J. M. González, D. Serrano; Identification and quantitation
assays for intact tablets of two related pharmaceutical preparations by reflectance near-
infrared spectroscopy: validation of the procedure; Journal of Pharmaceutical and Biomedical
Analysis; Volume 22 Nº1; 2000; Pages 139-148
[34] http://www.abb.com (July 2008)
[35] M. Blanco, M. Alcalá; Simultaneous quantitation of five principles in a
pharmaceutical preparation: Development and validation of a near infrared spectroscopy
method; European Journal of Pharmaceutical Sciences; Volume 27 Nº 2-3; 2006; Pages 280-
286
[36] Mattias Hedenström, Susanne Wiklund, Björn Sundberg, Ulf Edlund; Visualization
and interpretation of OPLS models based on 2D NMR data; Chemometrics and Intelligent
Laboratory Systems; Volume 92 Nº2; 2008; Pages 110-117
[37] Svante Wold, Johan Trygg, Anders Berglund, Henrik Antti; Some recent
developments in PLS modeling; Chemometrics and Intelligent Laboratory Systems; Volume
58, Issue 2; 28 October 2001; Pages 131-150
[38] Jon Gabrielsson, Hans Jonsson, Christian Airiau, Bernd Schmidt, Richard Escott,
Johan Trygg; OPLS methodology for analysis of pre-processing effect on spectroscopy data;
Chemometrics and Intelligent Laboratory Systems; Volume 84, Issue 1-2; 1 December 2006;
Pages 153-158
52
7. APPENDIX
7.1. Determination of Percent Relative Standard Deviation
(%RSD)
Precision is often measured by the standard deviation of the set. The standard deviation s
of a set of n repeat measurements is defined as
( )1
2
−−
= ∑n
xxs (4)
where x is a single measurement and x is the mean measurement.
The lower standard deviation means a good precision of set of repeat measurements.
The relative precision of two or more methods of measurement is compared by calculating
their percent relative standard deviation (%RSD), which is calculated from the standard
deviation s and mean measurementx , according to the equation:
x
sRSD
×= 100% (5)
7.2. Matrix design for laboratory samples
For the production of laboratory samples, a matrix design was created based on three
independent calibration sets. At each calibration set, the concentration of the API was reduced
adding small amounts of placebo, as can be seen on Table 20.
Table 20 – Matrix design for laboratory samples8. PA PS DX 1 92.3 10.2 5.4 2 91.4 10.2 5.3 3 90.6 9.2 4.8 4 89.7 8.1 4.3 5 88.8 7.1 3.7 6 88.0 6.1 3.2 7 87.1 5.1 2.7 8 86.3 4.1 2.1
8 For each calibration set, only the API represented and placebo concentrations were varied.
53
PA PS DX 9 85.5 3.1 1.6 10 84.7 2.0 1.1 11 83.8 1.0 0.5 12 82.9 0.0 0.0 13 82.1 14 81.2 15 80.4 16 79.5 17 78.7 18 77.8 19 77.1
7.3. First Strategy
For simplicity reasons, in Results and Discussion was only presented the results of
calibrations set developed with DX (DSM).
The correlation between pairs of APIs for each calibration set with DX (Divis) can be seen
in Table 21.
Table 21 – The correlation between samples’ concentration for each calibration set with DX (Divis).
R2 PA PS DX Placebo
PA 1 - - -
PA PS 0.08 1 - -
DX 0.09 0.01 1 -
Placebo 1 0.08 0.09 1
R2 PA PS DX Placebo
PA 1 - - -
PS PS 0.17 1 - -
DX 5.00E-05 0.03 1 -
Placebo 0.17 1 0.03 1
R2 PA PS DX Placebo
PA 1 - - -
DX PS 0.35 1 - -
DX 0.01 0.09 1 -
Placebo 0.01 0.09 1 1
Several models without variable selection were developed, according the steps previously
described. In addition, the characteristics of the best models of each API were summarized in
Table 22.
54
Table 22 – The best results for each calibration set without variable selection (using DX (Divis)).
Compound Pre-processing Range LV R2 RMSECV (%) RMSEP (%)
MSC and MC 3 0.98 1.26 0.67 PA MSC, 1st D and
MC 3 0.95 1.21 1.29
MSC and MC 2 0.97 0.58 0.79 PS MSC, 1st D and
MC 2 0.97 0.65 0.59
MSC, 1st D and MC
3 0.97 0.33 0.29 DX
MSC, 2nd D and MC
3996.2-9003.1
3 0.98 0.34 0.23
At each calibration set were applied three different techniques of variable selection, with
the aim of finding one/more characteristics regions, where the concentration variation of each
API is more significant. Thereby, three calibration models of API's with three different
methods of variable selection – Coefficient of Determination, iPLS and Genetic Algorithm –
were developed, but only the models with the best predictive capabilities are shown in the
Table below.
Table 23 –The best results for each calibration set with variable selection (using DX (Divis)).
Method Compound Pre-processing Range LV R2 RMSECV
(%) RMSEP
(%)
PA MSC, 1st
D and MC
4327.9-4443.7 4821.7-4922 4968.3-5408 5801.5-5895 6110-7251.8 7992.4-8517
3 0.98 1.18 0.60
PS MSC and
MC 5006.8-54003 6256.6-7305.8 2 0.91 0.65 1.99 R2
DX MSC, 1st
D and MC
4790.8-4891.1 5045.4-5400.3 5786-5809.2 6295.2-6410.9
6889.2-7182.4 8239.3-8324.2 3 0.99 0.26 0.25
PA MSC, 2nd D and MC
4505.38-4999.12 5 0.97 1.16 1.35
PS MSC, 1st
D and MC 7020.4-7344.4 4 0.97 0.59 0.83 iPLS
DX MSC, 1st
D and MC 5022.3-5269.1 4 0.98 0.30 0.39
GA PA MSC, 2nd D and MC
4158.2-4219.8 4327.9-4428.2 4783.1-4814 5269.1-5485.1
6295.2-6364.6 6534.3-6673.2 6765.8-6827.5 7004.9-7097.5 7321.2-7406.1 7498.7-7668.4 7791.8-7845.8 7969.3-8216.2
4 0.97 1.18 1.14
55
Method Compound Pre-processing Range LV R2 RMSECV
(%) RMSEP
(%)
PS MSC, 2nd D and MC
4042.5-4158.2 4351.1-4405.1 4520.8-4698.3 4883.4-4945.1 5261.4-5377.1 5508.3-5693.4 5816.9-5932.6 6156.3-6356.9
6480.3-6619.52 6704.1 6773.5-6935.5 7043.5-7113 7236.4-7275 7375.3-7552.7
7637.6 7768.7-7899.9 8154.4-8262.4 8370.4-8586.5 8694.5-
8733 8887.3-9003.1
2 0.96 0.87 0.71
DX MSC, 1st
D and MC
4621.1-4767.7 4852.5-4922 5006.8-5076.3 5469.7-5616.3 6549.8-6619.2 7004.9-7097.5 7321.2-7406.1 7498.7-7668.4 7791.8-7845.8 7969.3-8216.2
4 0.99 0.27 0.29
For simplicity reasons, the range selected of each calibration set for GA variable selection
is not included on the Table 9. Thereby, the three calibration models of API's which used
Genetic Algorithm are shown in the Table below.
Table 24 –The best results for each calibration set with variable selection (using DX (DSM)).
Compound Pre-
processing Range LV R2 RMSECV (%)
RMSEP (%)
PA MSC, 2nd D and MC
4158.2-4219.9 4327.9-4428.2 4783.1-4814 5269.1-5485.1
6295.2-6364.6 6534.3-6673.2 6765.8-6827.5 7004.9-7097.5 7321.2-7406.1 7498.7-7668.4 7791.8-7845.8 7969.3-8216.2
3 0.98 0.70 0.74
PS MSC, 1st
D and MC
4235.4-4690.5 5932.6-6002 6318.3-6387.8 6704.1-6773.5 6935.5-7004.9 7475.5-7545
8941.3-9003.1
3 0.98 0.54 0.56
DX MSC, 2nd D and MC
4621.1-4767.7 4852.5-4922 5006.8-5076.3 5469.7-5616.3 6549.8-6619.2 7012.7-7082.1 7167-7313.5 7861.3-7930.7
8324.2-8393.6 8555.6-8856.5
3 0.98 0.29 0.35
7.4. Second Strategy
On this strategy, at each calibration set, previously developed, was added the first three
and last three spectra (included three replicas). The correlation coefficient between pairs of
components was calculated with DX (Divis) can be seen in Table 25.
56
Table 25 – The correlation between samples’ concentration for each calibration set with DX (Divis).
R2 PA PS DX Placebo PA 1.00 - - -
PS 4.00E-07 1.00 - -
DX 3.00E-07 2.00E-06 1.00 -
PA
Placebo 0.75 0.19 0.05 1.00
R2 PA PS DX Placebo
PA 1.00 - - -
PS 2.00E-07 1.00 - -
DX 2.00E-07 3.00E-05 1.00 -
PS
Placebo 0.62 0.29 0.09 1.00
R2 PA PS DX Placebo PA 1.00 - - -
PS 8.00E-07 1.00 - -
DX 1.00E-07 3.00E-05 1.00 -
DX
Placebo 0.60 0.32 0.08 1.00
Several calibration models were built on differently pre-processed data, and the best
performance model for each calibration set was chosen and summarized in Table 26.
Table 26 – The best results for each calibration set without variable selection (using DX (Divis)). Compound Pre-processing Range LV R2 RMSECV (%) RMSEP (%)
MSC and mean centering
6 0.90 1.40 2.01
PA MSC, first derivative and mean centering
6 0.89 1.46 1.75
MSC and mean centering
3 0.94 0.77 0.80
PS MSC, second derivative and mean centering
4 0.94 0.81 0.82
MSC and mean centering
6 0.96 0.31 0.47
DX MSC, second derivative and mean centering
3996.2-9003.1
6 0.95 0.34 0.35
At each calibration set were applied three different techniques of variable selection, but
only the models with the best predictive capabilities are shown in the Table below.
57
Table 27 –The best results for each calibration set with variable selection (using DX (Divis)).
Method Compound Pre-processing Range LV R2 RMSECV (%)
RMSEP (%)
PA MSC, first
derivative and mean centering
4852.5-4914.3 4983.7-5392.6 6110-6310.6
6 0.91 1.34 2.11
PS MSC, second derivative and mean centering
4142.8-4189.1 4297.1-4389.7 4490-4667.4 5986.6-6056
8794.8-8825.6
3 0.95 0.7 0.76 R2
DX MSC, first
derivative and mean centering
4790.8-4860.3 5145.7-5161.1 8239.3-8277.9
4 0.95 0.35 0.41
PA MSC, second derivative and mean centering
4767.7-5292.3 5 0.95 1.03 1.86
PS MSC, second derivative and mean centering
4505.4-4752.3 3 0.84 0.97 1.30 iPLS
DX MSC, first
derivative and mean centering
8007.9-8501.6 3 0.95 0.33 0.38
PA MSC, first
derivative and mean centering
6210-6310.6 6572.9-6619.2 7128.4-7197.8 7444.7-7521.8
8787-8841
3 0.95 1.41 1.79
PS MSC, second derivative and mean centering
3996.2-4760 4999.1-5454.3 5847.7-5917.2 7930.7-8771.6
4 0.94 0.9 0.64
GA
DX MSC, first
derivative and mean centering
4898.8-5546.9 6302.9-6434.1 7344.4-7691.6 8216.2-8409
5 0.97 0.23 0.32
7.5. Orthogonal analysis
To express the performance of the various models on the example data, it is usually used
the standard measures of fit, R2, and the fraction of the total variation of the Y’s, Q2.
The explained variation of X and Y is above described by equation 4 and 5, respectively:
)()(1)(2
XSSESSXR −= (6)
)()(1)(2
YSSFSSYR −= (7)
58
where, SS means the sum of squares, and E and F are the residual matrices of X and Y,
respectively.
The fraction of the total variation of the Y’s than can be predicted by a component can be
described by a following equation.
)(12
YSSPRESSQ −= (8)
The prediction error sum of squares (PRESS) is the squared difference between observed
Y and predicted values when the observations were kept out.
7.6. Mastersizer Average Result Analysis Report