computing in archaeology basic statistics week 8 (25/04/07) © richard haddlesey
TRANSCRIPT
Computing in Computing in ArchaeologyArchaeology
Basic StatisticsBasic Statistics
Week 8 (25/04/07)Week 8 (25/04/07)© Richard Haddlesey www.medievalarchitecture.net
AimsAims
To familiarise ourselves with KEY To familiarise ourselves with KEY statistical terms and their meaningsstatistical terms and their meanings
To understand the use of stats in To understand the use of stats in archaeologyarchaeology
To assign variables, appropriate To assign variables, appropriate levels of measurement, at the levels of measurement, at the recording levelrecording level
Key textsKey texts
Basic StatsBasic Stats
Batch
VariablesVariables
Case Case Case
Post holes
Length, area, diameter
Post hole ID
VariablesVariables
Variables are measured according Variables are measured according to one of FOUR levelsto one of FOUR levels
1.1. Nominal Nominal = arbitrary name= arbitrary name
2.2. OrdinalOrdinal = sequence with no distance= sequence with no distance
3.3. IntervalInterval = sequence with fixed distance= sequence with fixed distance
4.4. RatioRatio = sequence with a fixed = sequence with a fixed datumdatum
Vince NOIRVince NOIR
NNominalominal OOrdinalrdinal IIntervalnterval RRatioatio
Nominal examplesNominal examples
ConditionCondition AgeAge DiameterDiameter LengthLength ContextContext PeriodPeriod
Ordinal examplesOrdinal examples
ConditionCondition1.1. ExcellentExcellent
2.2. GoodGood
3.3. FairFair
4.4. PoorPoor
Here “2” may be between “1” and Here “2” may be between “1” and “3” but is unlikely to be of equal “3” but is unlikely to be of equal distancedistance
Interval examplesInterval examples
PeriodPeriod1.1. Late Bronze (1200-650)Late Bronze (1200-650)2.2. Early Iron (649-100)Early Iron (649-100)3.3. Late Iron (100+)Late Iron (100+)
Here, if we have 3 artefacts dated Here, if we have 3 artefacts dated 150BC, 300BC and 450BC, although 150BC, 300BC and 450BC, although bb may be equal distance between may be equal distance between aa and and cc, , cc is not twice as old as is not twice as old as aa..
This is because there is no datum.This is because there is no datum.
Ratio examplesRatio examples
Age instead of periodAge instead of period• 1000 ya is twice 500 ya1000 ya is twice 500 ya• 20kg is twice 10kg20kg is twice 10kg
Ratio is the highest level of Ratio is the highest level of measurement because it has a measurement because it has a datum datum
Mortlakestyle bowl
Fengatestyle bowl
Grooved ware jar
Nominal, Ordinal and Interval
Note!Note!
Avoid using 0 or 1 to indicate such Avoid using 0 or 1 to indicate such variables as yes or no, as we may variables as yes or no, as we may need to know if it is “no” or “no data”need to know if it is “no” or “no data”
Also when using presence or absence Also when using presence or absence you may wish to add “missing” to you may wish to add “missing” to avoid confusionavoid confusion
Further distinctionFurther distinction
Nominal and OrdinalNominal and Ordinal• = categorical= categorical• = qualitative= qualitative
Interval and RatioInterval and Ratio• = continuous= continuous• = quantitative= quantitative
CodingCoding
Nominal and Ordinal often need coding, to Nominal and Ordinal often need coding, to minimise errors, via a keyword indexminimise errors, via a keyword index
con = contextcon = context• str = stray findstr = stray find• set = settlementset = settlement• bur = burialbur = burial
Avoid 1,2,3,etc, as you will have to keep Avoid 1,2,3,etc, as you will have to keep looking up their meanings which is time looking up their meanings which is time consumingconsuming
CodingCoding
NOTE!NOTE!
EVERY DATA VALUE MUST HAVE A EVERY DATA VALUE MUST HAVE A CODE AND ONLY ONE CODE!CODE AND ONLY ONE CODE!
GroupingGrouping
Good for periods, as in Good for periods, as in • Late Bronze (1200-650)Late Bronze (1200-650)• Early Iron (649-100)Early Iron (649-100)• Late Iron (100+)Late Iron (100+)
NOTE: it is better to record as a NOTE: it is better to record as a continuous variable (i.e. 780BC), continuous variable (i.e. 780BC), then group as an output (i.e. Late then group as an output (i.e. Late Bronze)Bronze)
Good PracticeGood Practice
Always keep a “CLEAN” version of Always keep a “CLEAN” version of the original data setthe original data set
Exploring the dataExploring the data
Context FNO Taxon Bone z1 z2 z3 z4 z5 z6 F/U L/R art. sex NISP chop cut m1 m2 m3 m4269 58 bs mn 0 0 0 0 0 0 - r - - 1 35.9 14.6722 191 eq sc 1 1 1 1 1 1 f r 2 - 1 78.2 40.7 55.6722 191 eq sc 1 1 1 1 1 1 f l 2 - 1 78.7 41.4 48.5371 102 eq sc 1 1 1 1 1 1 f r - - 1 45.0 58.0 52.9722 191 eq cal 1 1 1 1 1 0 f r 2 - 1 90.6 45.0722 191 eq mp 1 1 1 0 0 0 f l 2 - 1 41 45.6 40.3 28.7722 191 eq mp 1 1 1 0 0 0 f r 2 - 1 42 46.0 39.5 29.4722 191 eq mp 1 1 1 0 0 0 f r 2 - 1 46.0 39.7 28.5285 72 bs cal 1 1 1 1 1 0 f r - - 1 1 1 137.5 46.3722 191 eq mp 1 1 1 0 0 0 f l 2 - 1 42 46.3 40.0 29.2722 191 eq pp 1 1 1 0 0 0 f l 2 - 1 71 48.7 45.0 32.5722 191 eq pp 1 1 1 0 0 0 f r 2 - 1 71 48.8 45.2 32.5722 191 eq pp 1 1 1 0 0 0 f r 2 - 1 68 49.0 45.0 34.1722 191 eq pel 1 1 1 1 1 1 f l 2 - 1 60.1 52.2722 191 eq ast 1 1 1 1 0 0 - r 2 - 1 51 53 44.9722 191 eq ast 1 1 1 1 0 0 - l 2 - 1 51 54 44.4 52.7722 191 eq mciii 1 1 1 1 1 1 f r 2 - 1 187 179 43.7 28.6722 191 eq mciii 1 1 1 1 1 1 f l 2 - 1 187 180 42.8722 191 eq mtiii 1 1 1 1 1 1 f l 2 - 1 229 223 41.4 39.1722 191 eq mtiii 1 1 1 1 1 1 f r 2 - 1 229 223 42.8 39.5722 191 eq hum 1 1 1 1 1 1 f/f r 2 - 1 232 30.8722 191 eq rad 1 1 1 1 1 1 f/f l 2 - 1 274 71.7 64.2
example data set
univariate frequency tableunivariate frequency table
speciesspecies frequencyfrequency
cattlecattle 187187
sheepsheep 109109
pigpig 7878
horsehorse 2121
TotalTotal 395395
speciesspecies pitspits ditchesditches TotalTotal
cattlecattle 6767 120120 187187
sheepsheep 6363 4646 109109
pigpig 4141 3737 7878
horsehorse 33 1818 2121
TotalTotal 174174 221221 395395
bivariate frequency tablebivariate frequency table
bivariate frequency tablebivariate frequency table
speciesspecies pitspits ditchesditches TotalTotal
cattlecattle 67 67 39%39% 120 120 54%54% 187187
sheepsheep 63 63 36%36% 46 46 21%21% 109109
pigpig 41 41 24%24% 37 37 17%17% 7878
horsehorse 3 3 2%2% 18 18 8%8% 2121
TotalTotal 174 174 100%100% 221 221 100% 100% 395395
MultivariateMultivariate
These tend to operate on a table, or These tend to operate on a table, or matrix of items, described in terms of matrix of items, described in terms of a set of variablesa set of variables
Pictorial displays forPictorial displays forcategorical datacategorical data
0
5
10
15
20
25
30
35
40
45
50
cattle sheep pig horse
%
bar chart
0
10
20
30
40
50
60
cattle sheep pig horse
%
pits
ditches
multiple bar chart
pie chart
Pictorial displays forcontinuous data
0
2
4
6
Co
un
t
Hunt's House
Monkton
4 9.0 5 0.0 5 1.0 5 2.0 5 3.0 5 4.0 5 5.0 5 6.0 5 7.0 5 8.0 5 9.0 6 0.0 6 1.0 6 2.0 6 3.0 6 4.0 6 5.0 6 6.0 6 7.0 6 8.0 6 9.0 7 0.0 7 1.0 7 2.0
Bd (mm)
0
2
4
6
Co
un
t
histogram
Basic descriptive statistics:
• mode• median• mean• range• variance• standard deviation
pottery fragments (weights in grams):2, 2, 3, 5, 8
pottery fragments (weights in grams):2, 2, 3, 5, 8
Mode = 2
ModeMode
Mode is the only way to measure Mode is the only way to measure average/typical in the average/typical in the NominalNominal class class
If there are two averages then they If there are two averages then they are bimodal (1,2,are bimodal (1,2,33,,33,,6,66,6,7,8,9),7,8,9)
Three = trimodal, etc.Three = trimodal, etc.
pottery fragments (weights in grams):2, 2, 3, 5, 8
Mode = 2
Median = 3
MedianMedian
Best for Best for ordinalordinal and above and above
If the number of variables is even, If the number of variables is even, you make a number between the two you make a number between the two middle numbers middle numbers
(1,2,3,(1,2,3,4,54,5,6,7,8 = 4+5/2=,6,7,8 = 4+5/2=4.54.5))
pottery fragments (weights in grams):2, 2, 3, 5, 8
Mode = 2
Median = 3
Mean = (2+2+3+5+8)/5 = 4
MeanMean
The most commonly used average The most commonly used average and, it will only work for and, it will only work for intervalinterval and and ratioratio
It is the most important measure of It is the most important measure of position because a lot of further position because a lot of further statistical analyses are based on itstatistical analyses are based on it
ConclusionConclusion
It is important to understand that the It is important to understand that the modemode, , medianmedian and and meanmean are three quite are three quite different measures of position which can different measures of position which can give three different values when applied to give three different values when applied to the same data-setthe same data-set
2, 2, 3, 5, 8 2, 2, 3, 5, 6, 8
Mode = 2 2 Median = 3 4 Mean = 4 4.333
The The skewskew
symmetrical
Positive skew Negative skew
Measures of variability – the spread
pottery fragments (weights in grams):2, 2, 3, 5, 8
Range =
max – min
8 - 2 = 6
• Very simple and of limited use
variance
key:
pottery fragments (weights in grams):2, 2, 3, 5, 8
s2 =
(2-4)2 + (2-4)2 + (3-4)2 +(5-4)2 + (8-4)2
5
variance (s2)
s2 = 5.2
s2 =
(Mean = 2=2=3=5=8/5=4)
variance
standard deviation
pottery fragments (weights in grams):2, 2, 3, 5, 8
variance (s2) = = 5.2
standard deviation =
= (√variance) = √5.2 = 2.28
SummarySummary
Variables are measured according to Variables are measured according to one of one of FOURFOUR levels levels
1.1. Nominal Nominal = arbitrary name= arbitrary name
2.2. OrdinalOrdinal = sequence with no distance= sequence with no distance
3.3. IntervalInterval = sequence with fixed distance= sequence with fixed distance
4.4. RatioRatio = sequence with a fixed datum= sequence with a fixed datum
SummarySummary
Measures of position Measures of position (average/typical)(average/typical)• ModeMode• MedianMedian• MeanMean• RangeRange• VarianceVariance• Standard DeviationStandard Deviation