lecture data mining in r 732a44 programming in r
Post on 14-Jan-2016
223 Views
Preview:
TRANSCRIPT
732A44 Programming in R
Lecture
Data Mining in R
732A44 Programming in R
Logistic regression: two classes• Consider Logistic model with one predictor X=Price of the car
Y=Equipment• Logistic model
• Use function glm(formula, family, data)– Formula: Response~Model
• Model consists of a+b (addition), a:b (interaction terms, a*b (addition and interaction) . All predictors
– Family: specify binomial
)exp(1
)exp()|1(
)|0(1
)|1(log
)|0(
)|1(log
10
10
10
x
xxXYP
xxXYP
xXYP
xXYP
xXYP
732A44 Programming in R
Logistic regression: two classes
reg<-glm(X3...Equipment~Price.in.SEK., family=binomial, data=mydata);
732A44 Programming in R
Logistic regression: several predictors
Data about contraceptive use
– Several analysis plots can be obtained by plot(lrfit)
– Response: matrix success/failure
)exp(1
)exp()|1(
)|0(
)|1(log
1
1
1
0
0
0
x
xx
xx
x
T
T
T
XYP
XYP
XYP
732A44 Programming in R
Logistic regression
Further comments• Nominal logistic regressions (library mlogit, function
mlogit) • Stepwise model selection: step() function.• Prediction: predict() function
Smoothing splines
Minimize a penalized sum of squared residuals
where λ is smoothing parameter.
λ=0 : any function interpolating dataλ=+ : least squares line fit
732A44 Programming in R
dttfxfyfRSSN
iii
2
1
2,
732A44 Programming in R
Smoothing splines
• smooth.spline(x, y, df, spar, cv,…)– Df degrees of freedom– Spar: penalty parameter– CV=
• TRUE=GCV• FALSE=CV• NA= no CV
plot(m2$Kilometer,m2$Price, main="df=40");res<-smooth.spline( m2$Kilometer, m2$Price,df=40);lines(res, col="blue");
732A44 Programming in R
Generalized additive models
A function
of the expected response is additive in the set of inputs, i.e.,
Example: Nonlinear logistic regression of a binary response
)(...)())...,,|(( 111 ppn XsXsXXYEg
))...,,|(( 1 nXXYEg
)()|0(
)|1(log
)|(1
)|(log 0 xs
xXYP
xXYP
xXYE
xXYE
732A44 Programming in R
GAM• gam(formula,family=gaussian,data,method="GCV.Cp" select=FALSE, sp)
– Method: method for selection of smoothing parameters– Select: TRUE – variable selection is performed– Sp: smoothing parameters (maximal df)– Formula: usual terms and spline terms s(…)Library: mgcv
• Car properties
• Predict.gam() can be used for predictions
bp<-gam(MPG~s(WT, sp=2)+s(SP, sp=1),data=m3)
vis.gam(bp, theta=10, phi=30);
732A44 Programming in R
GAM
Smoothing componentsplot(bp, pages=1)
Principal components analysisIdea: Introduce a new coordinate system (PC1,
PC2, …) where • The first principal component (PC1) is the
direction that maximizes the variance of the projected data
• The second principal component (PC2) is the direction that maximizes the variance of the projected data after the variation along PC1 has been removed
• …
In the new coordinate system, coefficients corresponding to the last principal components are very small can take away this columns
732A44 Programming in R
5
10
15
0 5 10
X1
X2
PC1
PC2
732A44 Programming in R
Principal components analysis
• princomp(x, ...)
m4<-m3;m4$MODEL<-c();res<-princomp(m4);
loadings(res);plot(res);biplot(res);summary(res);
732A44 Programming in R
Decision trees
0
10
20
10 20
X1
X2 X2
0 1 X11
01
<9 >=9
<16 <7>=16 >=7
<15 >=15
732A44 Programming in R
Regression tree example
732A44 Programming in R
Training-validation-test
• Training-validation (60/40)
• If training-validation-test is required, use similar strategy
sub <- sample(nrow(m2), floor(nrow(m2) * 0.6))training <- m2[sub, ]validation <- m2[-sub, ]
732A44 Programming in R
Decision trees by CARTGrowing a full treeLibrary ”tree”.• Create tree: tree(formula, data, subset, split = c("deviance", "gini"),…)
– Subset: if subset of cases needs to be used for training– Split: splitting criterion– More parameters with control parameter
• Prune tree with help of validation set: prune.tree(tree, newdata, method = c("deviance", "misclass”),…)
• Prune tree with cross-validation: cv.tree(object, FUN = prune.tree, K = 10, ...)
– K is number of folds in cross-validation
732A44 Programming in R
Classification trees: CART
sub <- sample(nrow(m5), floor(nrow(m5) * 0.6))training <- m5[sub, ]validation <- m5[-sub, ]mytree<-tree(Area~.-Region-X,data=training);summary(mytree)plot(mytree,type="uniform");text(mytree,cex=0.5);
Example: Olive oils in Italy
732A44 Programming in R
Classification trees: CART
• Dependence of the misclassification rate on the length of the tree:
treeseq1<-prune.tree(mytree, newdata=validation,method="misclass")plot(treeseq1);title("Validation");treeseq2<-cv.tree(mytree, method="misclass")plot(treeseq2);title("CV");
732A44 Programming in R
Regression trees: CART
mytree2<-tree(eicosenoic~linoleic+linolenic+palmitic+palmitoleic,data=training);mytree3<-prune.tree(mytree2, best=4) #totally 4 leavesprint(mytree3)summary(mytree3)plot.tree(mytree3)text(mytree3)
732A44 Programming in R
Decision trees: other techniques
• Conditional inference treesLibrary: party
• CART, another library ”rpart”
training$X<-c();training$Area<-c();mytree4<-ctree(Region~.,data=training);print(mytree4)plot(mytree4, type= "simple");# gives nice plots
732A44 Programming in R
Neural network
• Input nodes, input layer• [Hidden nodes, Hidden
layer(s)]• Output nodes, output layer• Weights• Activation functions• Combination functions
x1 x2 xp
z1 z2 zM…
…
f1 fK…
732A44 Programming in R
Neural networks• Feed –forward NNsLibrary: neuralnet• neuralnet(formula, data, hidden = 1, rep = 1, startweights = NULL, algorithm =
"rprop+", err.fct = "sse", act.fct = "logistic", linear.output = TRUE,…)– Hidden: vector showing amount of hidden neurons at each layer– Rep: amount of runs of network– Startweights: starting weights– Algorithm: ”backprop”, ”rpprop+”, ”sag”, ”slr”– Err.fct: any function +”sse”+”ce” (cross-entropy)– Act.fct:any function+”logistic”+”tanh”– Linear.output: TRUE, if no activation at the output
• confidence.interval(x, alpha = 0.05) Confidence intervals for weights• compute(x, covariate) Prediction• plot(x,…) plot given neural network
732A44 Programming in R
Neural networks
• Examplemynet<-neuralnet( Region~eicosenoic+linoleic+linolenic+palmitic, data=training, rep=5, hidden=c(2,2),act.fct="tanh")plot(mynet);mynet$result.matrix
732A44 Programming in R
Neural networks
• Prediction with compute()• Finding misclassification rate: table(true_values,predicted
values) – not only for neural networks• Another package, ready for qualitative response (classical
nnet):
mynet1<-nnet( Region~eicosenoic+linoleic, data=training, size=3);coef(mynet1)predict(mynet1, data=validation);
732A44 Programming in R
Clustering
• Purpose is to identify groups of observations into intput space (separated)
– K-means– Hierarchical– Density-based
732A44 Programming in R
K-means
• Amount of seeds K should be given• Starting seed positions needed
• kmeans(x, centers, iter.max = 10, nstart = 1)– X: data frame– Centers: either ”K” value or set of initial cluster centers– Iter.max: maximum number of iterations res<-kmeans(data.frame (m5$linoleic,
m5$eicosenoic),2);
732A44 Programming in R
K-means
• One way to visualizeplot(m5$linoleic, m5$eicosenoic, col=res$cluster);points(res$centers[,1], res$centers[,2], col = 1:2, pch = 8, cex=2)
732A44 Programming in R
Hierarchical clustering• Agglomerative
– Place each point into a single cluster– Merge nearest clusters until you get 1 cluster
• Meaning of ”two objects are close”?– Measure of proximity (ex: quantiative vars, Euclidian distance)
• Similarity measure srs (=1 if same object, <1 otherwise)– Ex: correlation
• Dissimilarity measure δrs (=0 if same object, >0 otherwise)– Ex: euclidian distance
732A44 Programming in R
Hierarchical clustering
• hclust(d, method = "complete", members=NULL)– D: dissimilarity measure– Method: ”ward”, "single", "complete", "average",
"mcquitty", "median" or "centroid".Returned: a tree showing merging sequence
• cutree(tree, k = NULL, h = NULL)– K: number of clusters to make– H: at which level to cutReturned: cluster index
732A44 Programming in R
Hierarchical clustering
• Example
x<-data.frame(m5$linolenic, m5$eicosenoic);m5_dist<-dist(x);m5_dend<-hclust(m5_dist, method="complete")plot(m5_dend);
732A44 Programming in R
Hierarchical clustering
• Example
DO NOT forget to standardize!
clust=cutree(m5_dend, k=2);plot(m5$linoleic, m5$eicosenoic, col=clust);
732A44 Programming in R
Density-based clustering
• Kernel-based density estimation.Library: pdfcluster• pdfCluster(x, h = h.norm(x), hmult = 0.75,…)
– X: Data to be partitioned– h: a vector of smoothing parameters– Hmult: shrinkage factor
x<-data.frame(m5$linolenic, m5$eicosenoic);res<-pdfCluster(x);plot(res)
732A44 Programming in R
Reference
http://cran.r-project.org/doc/contrib/YanchangZhao-refcard-data-mining.pdf
top related