measure up! data analysis tools to optimize library management dr. lesley farmercalifornia state...
TRANSCRIPT
Measure Up! Data Analysis Tools to Optimize Library Management
Dr. Lesley FarmerCalifornia State University Long Beach
Research data analytics to assess California school libraries, and identify variables to improve their impactData analysis statistics
Choosing data analysis tools
Agenda
What significant trends between 2007 and 2012 exist in California school library programs?
What is the profile of a consistently highly (and low) effective school library progarms?
What are the predictors for high – and low -- school library impact over time?
Research Questions Based on 2007 and 2012 California School Libraries Data
Trend analysis of California school libraries
Predictive models of impactful California school libraries, which might be generalizable
Increased use of data analytics to improve libraries
Needs
Use California State Department of Education annual school library survey reports datasets (2007-8 and 2011-2012)
Code survey variables: e.g., meet standard or not Compare school libraries that meet state model
school library standards baseline criteria with those who did not meet standards
Use several statistical techniques: clustering analysis, decision trees, logistic regression
Method
Sample California School Library Reports Distribution
64 Independent Variables
Meet Standard or Not (binary) API (Academic Performance Index) Socio-economic API decile
Dependent Variables
Kth nearest-neighbor (knn) is a clustering method that uses distances between variables to group observations together.
Those with smaller distances between them are assumed to be similar, so
looking closer at the individual clusters can potentially determine important characteristics.
Clustering
Measures the distance between two clusters
Observations with least differences are clustered
Joins “close” clusters so that resulting within-cluster variance is minimized
Ward Method of Clustering
Enhanced access: on weekends, summer Book budget
Important Ward-based Variables
Measure distance between the centroids (means) of each cluster
Join 2 nearest clusters
Centroid Method of Clustering
Centroid Cluster-based Variables
Positive: Access during breaks Internet access Online productivity
tools Reference help
Negative: No access before OR
after school No Internet access No online library
catalog No “extra” funding
Flowchart of decisions and possible consequences Node=test, branch=outcome, leaf=decision Path from root to leaf is classification rule Split data into training set and test set Select “information gain” attribute to separate
data Do tree pruning for optimal selection (aim for
homogeneous class) Useful for predictions
Decision Trees
Online library catalog Internet access Online DBs Video DBs Budget (and funding sources) Collection currency Reference help
Dependent variable: met standards or not
CART (Classification & Regression Trees) Important Independent Variables
Budget (and funding sources)
Collection currency
Online lib rary catalog
Reference help # of books
Dependent variable: met standards or not
C4.5 decision tree (more than binary splits) Important Dependent Variables
Probabilistic statistical classification model Measure relationship between categorical
dependent variable and independent (continuous or categorical) variables
Regression line is nonlinear Run with combination of main effects Aim for best fit Predicts outcome of categorical
dependent variable
Logistic Regression
Backward Selection: start with all variables and remove insignificant ones
Forward Selection: start with 1 significant variable until model is complete
Stepwise Selection: add or remove a variable depending on making model better
Main Effects:Different ways to determine the best logistic
regression model
Use to compare models Distinguishes classifiers that are optimal
under some class and sub-optimal classifiers
Plotting 2 classes: true-positive versus false-positive rates
ROC (Receiver Operating Characteristics)
DEPENDENT Variable: API
Staffing Online library catalog Collection currency Internet access Online DBs Budget (and fund sources) Reference help
CART Best Model:Ultimate Important Predictable Variables
What data do you collect?
22
Circulation figuresPatron usageFacilities usageComputer usageInternet usageReference consultations and fillLibrary guides/bibliographies useInstructional sessionsWebsite hits (including tutorials)Database usage vs costILL processing and turnaround timeOrdering, processing, cataloging, preservation, weeding workflow and timeEbook usage vs costLibrary software usage vs costStaff schedulingEquipment maintenance and repairs
What tools do you use to collect data?
Surveys Web statistics Circulation statistics Interviews and interviews Observation LibQual / LibPAS Flowfinity Document collecting
23
What do you DO with that data?
Descriptive statistics Analyze workflow for efficiency Reveal trends Benchmark efforts Control quality Do cost-benefit analysis Analyze student learning Optimize scheduling Optimize queuing
24
Data: demographics, staff, resources, services
Use: trends over time, correlations between staff and resources/services,
Demographic correlations with staffing, resources and services
AASL membership correlations with staffing, resources and services
AASL Longitudinal Data
Copyright Median by State
$/Student by Region 2009-2012
# of Books/Student by School Level 2009-12
Techniques Correlation analysis (for relationship between continuous variables) Multiple Regression(continuous response
variable), Logistic Regression(categorical response variable)
Decision Trees Principle Components, Factor Analysis Hypothesis testing (paired tests, two sample
tests, ANOVA) Chi-Square tests of independence (for relationship between categorical variables)29
Graphs
Box Plots Stem and Leaf Plots Histograms/Bar Graphs Pareto Charts Pie Charts Time Series Plot Outlier assessment
30
31
32
Stem-and-Leaf Plot
33
KM ANALYSIS APPROACH
DATA ANALYTIC TOOLS
Cause identification Fishbone diagram, correlation analysis, regression analysis, ANOVA, clustering, principal components
Cost-benefit analysis / ROI
Pugh matrix, Pearson correlation
Customer satisfaction
Regression analysis, Likert techniques, chi square
Decision Decision tree, Pugh matrixError and tolerance analysis
Pareto analysis, control chart
Failure analysis Pareto analysis, control chart, clusteringJob analysis Demerit systems, flow chartProcess capacity Process capacityQuality analysis Pugh matrix, control chartQuality control Control chart, run chartQuantity analysis Histogram, run chartQueuing Poisson distributionScalability Process capabilityTime analysis Run chart, Poisson distribution, activity
network diagramWork flow and process analysis
Fishbone diagram, activity network diagram, flow chart, run chart
Let’s talk!http://www.librarydataanalytics.com/
Next Steps