how many replicate tests are needed to test cookstove...

34
1 How many replicate tests are needed to test cookstove performance and 1 emissions? Three is not always adequate 2 Yungang Wang 1,* , Michael D. Sohn 1 , Yilun Wang 2 , Kathleen M. Lask 3 , Thomas W. 3 Kirchstetter 1,4 , Ashok J. Gadgil 1,4 4 1 Lawrence Berkeley National Laboratory, One Cyclotron Road, Berkeley, CA 94720 5 2 ISO Innovative Analytics, San Francisco, CA 94111 6 3 University of California - Berkeley, College of Engineering, Applied Science and Technology 7 Program, Berkeley CA 94720 8 4 University of California - Berkeley, Department of Civil and Environmental Engineering, 9 Berkeley CA 94720 10 11 Corresponding author: Yungang Wang (Email: [email protected]; Phone: 1-510-495-8065) 12 13 Abstract 14 Almost half of the world’s population still cooks on biomass cookstoves of poor efficiency and 15 primitive design, such as three stone fires (TSF). Emissions from biomass cookstoves contribute 16 to adverse health effects and climate change. A number of improved cookstoves with higher 17 energy efficiency and lower emissions have been designed and promoted across the world. 18 During the design development, and for the selection of a stove for dissemination, the stove 19 performance and emissions are commonly evaluated, communicated and compared using the 20 arithmetic average of replicate tests made using a standardized laboratory-based test, commonly 21 the water boiling test (WBT). However, the statistics section of the test protocol contains some 22 debatable concepts and in certain cases, easily misinterpreted recommendations. Also, there is 23

Upload: others

Post on 19-Feb-2021

6 views

Category:

Documents


0 download

TRANSCRIPT

  • 1

    How many replicate tests are needed to test cookstove performance and 1

    emissions? – Three is not always adequate 2

    Yungang Wang1,*

    , Michael D. Sohn1, Yilun Wang

    2, Kathleen M. Lask

    3, Thomas W. 3

    Kirchstetter1,4

    , Ashok J. Gadgil1,4

    4

    1Lawrence Berkeley National Laboratory, One Cyclotron Road, Berkeley, CA 94720 5

    2ISO Innovative Analytics, San Francisco, CA 94111 6

    3University of California - Berkeley, College of Engineering, Applied Science and Technology 7

    Program, Berkeley CA 94720 8

    4University of California - Berkeley, Department of Civil and Environmental Engineering, 9

    Berkeley CA 94720 10

    11

    Corresponding author: Yungang Wang (Email: [email protected]; Phone: 1-510-495-8065) 12

    13

    Abstract 14

    Almost half of the world’s population still cooks on biomass cookstoves of poor efficiency and 15

    primitive design, such as three stone fires (TSF). Emissions from biomass cookstoves contribute 16

    to adverse health effects and climate change. A number of improved cookstoves with higher 17

    energy efficiency and lower emissions have been designed and promoted across the world. 18

    During the design development, and for the selection of a stove for dissemination, the stove 19

    performance and emissions are commonly evaluated, communicated and compared using the 20

    arithmetic average of replicate tests made using a standardized laboratory-based test, commonly 21

    the water boiling test (WBT). However, the statistics section of the test protocol contains some 22

    debatable concepts and in certain cases, easily misinterpreted recommendations. Also, there is 23

    mailto:[email protected]

  • 2

    no agreement in the literature on how many replicate tests should be performed to ensure 24

    “confidence” in the reported average performance (with three being the most common number of 25

    replicates). This matter has not received sufficient attention in the rapidly growing literature on 26

    stoves, and yet is crucial for estimating and communicating the performance of a stove, and for 27

    comparing the performance between stoves. We illustrate an application using data from a 28

    number of replicate tests of performance and emission of the Berkeley-Darfur Stove (BDS) and 29

    the TSF under well-controlled laboratory conditions. Here we focus on two as illustrative: time-30

    to-boil and emissions of PM2.5 (particulate matter less than or equal to 2.5 micrometers in 31

    diameter). We demonstrate that interpretation of the results comparing these stoves could be 32

    misleading if only a small number of replicates had been conducted. We then describe a 33

    practical approach, useful to both stove testers and designers, to assess the number of replicates 34

    needed to obtain useful data from previously untested stoves with unknown variability. 35

    36

    Keywords: Cookstove; Berkeley-Darfur Stove; Variability; Confidence Interval; Kolmogorov–37

    Smirnov Test; Bootstrap 38

    39

    1. Introduction 40

    About half of the world’s population uses biomass as fuel for cooking (IEA, 2004). The smoke 41

    from biomass cooking fires was recently found to be the largest environmental threat to health in 42

    the world, and is associated with 4 million deaths each year (Lim et al., 2012). This exposure 43

    has also been linked to adverse respiratory, cardiovascular, neonatal, and cancer outcomes 44

    (Smith et al., 2004; Weinhold, 2011). A 2011 World Bank report notes significant contributions 45

  • 3

    of biomass cooking to global climate change (World Bank, 2011). The contribution to climate 46

    change from black carbon (BC) emission from biomass cooking is a topic of growing interest, 47

    especially in terms of climate forcing and melting of glaciers (Hadley et al., 2010; Ramanathan 48

    and Carmichael, 2008). Current biomass stoves lead to a large burden of disease, and contribute 49

    to adverse impacts on local and the global environment. Hence there is substantial interest in 50

    developing and disseminating fuel-efficient biomass stoves with reduced emissions (e.g. DOE 51

    2011). Launched in September 2010, the Global Alliance for Clean Cookstoves (GACC) “100 52

    by 20” goal calls for 100 million homes to adopt clean and efficient stoves and fuels by 2020. 53

    The “three-stone fire” (TSF) is a commonly prevailing cooking method for a large 54

    fraction of the population at the base of the economic pyramid. In quantifying the performance 55

    of an improved stove, the TSF is commonly used as the baseline. This least expensive class of 56

    stove is simply an arrangement of three large stones supporting a pot over an open and unvented 57

    biomass fire. A TSF is one of the two stoves we analyzed in this study. We also tested the 58

    performance and emissions of the Berkeley-Darfur Stove (BDS) as an exemplar of an improved 59

    fuel-efficient biomass cookstove. The BDS was developed at Lawrence Berkeley National 60

    Laboratory (LBNL) for internally displaced persons in Darfur, Sudan 61

    (http://cookstoves.lbl.gov/darfur.php). It is an all-metal precision-designed natural-convection 62

    stove, with design features co-developed by iterative feedback from Darfuri women cooks. The 63

    BDS by design accommodates Darfuri traditional round-bottom cooking pots and cooking 64

    techniques (Figure 1). 65

    A literature survey of recent laboratory cookstove testing in peer-reviewed journal 66

    articles shows widely different numbers of replicate tests (Bailis et al., 2007; Jetter and Kariher, 67

    2009; Jetter et al., 2012; MacCarty et al., 2008, 2010; Roden et al., 2009; Smith et al., 2007). 68

  • 4

    The number of replicates reported in these seven studies range from 1 to 23. However, six out of 69

    seven studies have reported results with only 3 or fewer replicates. One then can rightly ask: 70

    how many replicate tests do I need to test the performance and emissions of the stove? 71

    Answering this question is application specific, and requires greater specificity. For example, 72

    the question might be better phrased. For a water boiling test (WBT), how many replicates are 73

    needed to estimate the average “time to boil” to within 2 minutes and with 95% confidence? Or 74

    how many replicates are needed to confirm, with 95% confidence, that Stove “A” emits less 75

    PM2.5 than Stove “B”? These questions exemplify perhaps the most frequently asked questions in 76

    planning stove experiments and interpreting their results. 77

    There is no single or simple answer to the number of replicates needed to answer the 78

    above questions. The answer depends on the experimental design, how many parameters need to 79

    be estimated, and the resulting variability in the stove replicates. In this study, we investigate 80

    how to answer the above questions using data from the BDS and TSF water boiling experiments. 81

    We show how the number of replicates is linked to uncertainty and variability in the experiments 82

    and stove performance. We also show how many replicates are likely needed as various 83

    practical performance comparisons, such as “Does Stove A perform better than Stove B?” and 84

    “What is the uncertainty in the expected performance of Stove A or Stove B?” Finally, we 85

    describe a practical approach to design an experiment to test the performance of a previously 86

    untested stove. 87

    88

    2. Problem statement and causes of variability 89

  • 5

    The Appendix 6 of the WBT (version 3.0, http://www.pciaonline.org/node/1048) 90

    provides a detailed approach for comparing the performance of stoves. It describes a suite of test 91

    statistics and important considerations for interpreting test results. While comprehensive, the 92

    description contains some debatable concepts and in certain cases, easily-misinterpreted 93

    recommendations. For example, it affirms “At least three tests should be performed on each 94

    stove” and provides a cogent explanation for it. It also discusses the importance of paying 95

    attention to the statistical significance of a series of comparison tests between the performances 96

    of two stoves. While both statements are correct, it is not surprising that stove testers 97

    misinterpret these comments as (i) “only three tests are needed” or (ii) a hypothesis test with 98

    strong p-value (assuming a Gaussian distribution) provides unarguable confirmation of stove 99

    performance or comparison results. In fact, neither interpretation is correct or claimed in the 100

    text. We reason further elucidation of Appendix 6 is necessary, and a more transparent 101

    methodology would greatly benefit stove testers. We believe a transparent methodology would 102

    be best accomplished by developing an approach that maps the trade space between sample size, 103

    variability, and confidence. We also believe it is important to show that alternative methods for 104

    comparing the performances of stoves are available and should be considered. This work thus 105

    builds and improves upon Appendix 6 by providing new methods of interpreting test results for 106

    stove testers. 107

    The literature generally shows that even under carefully controlled conditions, stove test 108

    results show high test-to-test variability (coefficient of variation > 1.0, e.g. Jetter et al., 2012). 109

    There are many possible causes of this variability even within a precisely defined test such as the 110

    latest WBT (version 4.2.2), and we list a few here. Stove efficiency and emissions are generally 111

    a function of thermal power, and owing to the discrete nature of fuel-feeding events, a stove’s 112

  • 6

    thermal power invariably varies, also contributing to temporal variability within a test, which can 113

    translate into test-to-test variability. Despite due care, the ratio of bark to sapwood to hardwood 114

    for various pieces of fuelwood can be different, and thus will have different burn characteristics. 115

    Furthermore, different pieces of fuelwood may have different surface to volume ratios, 116

    contributing to different rates of burning. Lastly, even reasonably experienced and careful stove 117

    testers demonstrate some variability in the way they tend the fire in the stove from test to test, 118

    and within a test (Granderson et al., 2009). All these (and other uncontrolled factors) together 119

    give rise to what we lump together as variability in the test-to-test replicate results for a stove 120

    under controlled laboratory conditions. 121

    122

    3. Approach 123

    The question of “How many replicate tests do I need?” is not novel. It is a well-researched 124

    question in classical statistical theory, but has not received much attention from the stove 125

    research community. We briefly summarize here the statistical background relevant to answer 126

    the question. 127

    3.1 Probability density function and cumulative distribution function 128

    Technically, for a continuous random variable, the probability density function (PDF) describes 129

    the probability that a value will be within a certain range of the sample. However, as this range 130

    is evaluated by integrating, it can be chosen to be quite small, so for most practical purposes, the 131

    PDF may be considered the probability of obtaining a particular value (Ellison, 2009). 132

    Graphically, if the PDF is a curve, the cumulative distribution function (CDF) is the area under 133

    that curve. It is used to compute probability; the larger the included range, the greater the 134

  • 7

    probability. Because of this, the CDF over the entire range is equal to 1. For a normal (or 135

    Gaussian) distribution, the CDF curve is a normal ogee curve, which is a smooth even S-shaped 136

    curve (Ellison, 2009). Skewing in the distribution away from the Gaussian will lead to one half 137

    of the S to be elongated or distorted. 138

    3.2 Standard error and confidence interval for an average 139

    The standard deviation refers to the variation of observations within individual experimental 140

    units, whereas the standard error refers to the random variation of an estimate (made with n 141

    replicates) from the mean value that will be obtained as the number of replicates increases. The 142

    standard deviation is calculated by: 143

    √ ∑( ̅)

    (1) 144

    where are the individual measurements used to calculate the average. A 145

    convenient way to calculate the sample standard deviation is using the “STDEV” function in 146

    Excel. The standard error is the measure of the experimental error of an estimated statistic (e.g. 147

    the mean). For the sample average ̅ from n replicate tests, the standard error ̅ is √ , where 148

    is the standard deviation of the n replicates. The standard error on the mean can be reduced by 149

    increasing the number of replicates. Replication will not reduce the standard deviation but it will 150

    reduce the standard error. In practical term, this means that our goal is to achieve a standard 151

    error small enough to make convincing and useful conclusions. Additionally, in our experience, 152

    computing the variance can be problematic from very few replicates. It is mathematically 153

    correct that a variance can be computed from just three replicates. However, we have commonly 154

    found that three replicates resulted in a somewhat small variance, only to be often greater or 155

  • 8

    much greater once we include the fourth and fifth sample. As a rule-of-thumb we are dubious of 156

    variances computed from fewer than five replicates. 157

    The confidence interval indicates the reliability of an estimate made from a given number 158

    of replicates. The ( ) confidence interval for the average ̅ has the form ̅ , 159

    where is called the half-length, since a segment of the length of 2E centered on ̅, provides the 160

    full confidence interval. E is related to α, σ, and n (the number of replicates) by the following 161

    equation. 162

    ⁄ √ (2) 163

    Where ⁄ is a dimensionless number that can be looked up in standard handbooks for various 164

    standard distributions (e.g. Berthouex and Brown, 2002). Transposing equation (2), the number 165

    of replicates that will produce this interval half-length is 166

    (

    ) (3) 167

    This assumes random sampling. It also assumes that n is large enough that the normal 168

    distribution can be used to define the confidence interval. To apply equation (3), we must 169

    specify ( ) . Values of ( ) that might be used are shown in the top row 170

    with corresponding values of Z in the bottom row of Table 1. 171

    When the measurements are assumed to be normally distributed but the number of 172

    replicates is small (by small, textbooks suggest less than 30) and the population standard 173

    deviation is unknown, a Student’s t-distribution is used (Berthouex and Brown, 2002). To 174

    calculate the number of replicates n, the coefficient is used in place of ⁄ shown in equation 175

  • 9

    (3). A selection of t-values is listed in Table 2. The t value decreases as n increases, but notice 176

    that there is little change once n exceeds 5. An exact solution of the number of replicates for 177

    small n (less than 30) requires an iterative solution, but a good approximate is obtained by using 178

    a rounded value of t = 2.1 or 2.2, which covers a good working range of n = 10 to n = 25 (p = 179

    0.05). When analyzing data we carry three decimal places in the value of t, but that kind of 180

    accuracy is misplaced. The greatest uncertainty lies in the value of the specified (refer to 181

    Equation (2)), so we can conveniently round off t to one decimal place. Additional information 182

    about confidence interval estimation and experiment sizing can be found in Berthouex and 183

    Brown (2002), Spiegel et al. (2008), and Taylor (1997). 184

    3.3 Bootstrapping 185

    All the preceding discussion was predicated on the assumption of a Gaussian distribution of 186

    underlying population. What if the distribution is not Gaussian? Bootstrapping is a powerful 187

    statistical approach that allows estimation of the variability of many properties of the data 188

    without making any assumptions about the shape of the original distribution F. Efron (1979) 189

    provides an accessible explanation, with examples, of the bootstrap method. The key principle 190

    of Bootstrapping is to simulate repeated observations from the unknown distribution F, using 191

    repeated sampling of the obtained single set of data. Bootstrapping can be implemented by 192

    constructing a number of resamples of the observed dataset. Each resample is obtained by 193

    random sampling with replacement from the original dataset (Varian, 2005). Increasing the 194

    number of resamples can reduce the impact of random sampling errors, but it cannot increase the 195

    amount of information existing in the original dataset (Efron and Tibshirani, 1993). 196

    3.4 Kolmogorov-Smirnov test 197

  • 10

    The Kolmogorov-Smirnov (K-S) test quantifies whether two cumulative distribution functions 198

    (CDFs) are from the same population. It does so by exploring the maximum distance between 199

    the two CDFs. Corder et al. (2009) provide a good summary of the K-S test. The null 200

    hypothesis of a K-S test poses that the two samples are from the same population, and the 201

    research hypothesis poses either that they generally differ, leading to a two-tailed probability 202

    estimate, or that they differ in a specific direction, leading to a one-tailed estimate (Wall, 2003). 203

    The K-S test can be used to compare a sample distribution and a reference distribution or to 204

    compare two sample distributions. We will apply this test to help us explore how many 205

    replicates are needed to confirm whether the performance of two stoves is indistinguishable. 206

    The K-S test is a nonparametric statistical test and is only limited by the condition that it 207

    must be applied to continuous distributions. Unlike the t-test and other parametric tests, which 208

    require assuming Gaussian distribution, continuity is the primary requirement for application of 209

    K-S test making it a very useful tool with unknown distributions. Also for small and medium 210

    samples, it is more effective to use the K-S test over other nonparametric “goodness-of-fit” tests, 211

    such as the chi-square test or the Wilcoxon test. The different research hypotheses of the K-S 212

    test also provide directional flexibility which the chi-square test cannot provide (Wall, 2003). 213

    214

    4. Methods 215

    4.1 Laboratory testing 216

    Laboratory tests of BDS and TSF were performed at the LBNL cookstove testing facility. 217

    Concentrations of PM2.5 (particulate matter less than or equal to 2.5 micrometers in diameter), 218

    carbon monoxide/carbon dioxide (CO/CO2), BC, and several other co-pollutants emitted from 219

  • 11

    the BDS and TSF were simultaneously measured. The DustTrak measures the amount of light 220

    scattered by particles and relates that to their mass. It is calibrated for a National Institute of 221

    Standards and Technology (NIST) certified PM standard composed of soil from Arizona. Since 222

    the amount of light scattered by particles is specific to their morphology and chemical 223

    composition, in this study a calibration specific to wood smoke was developed, per the 224

    manufacturer’s recommendation, by comparing PM2.5 concentrations measured with the 225

    DustTrak after adjusting for secondary dilution to those measured gravimetrically. However, the 226

    DustTrak data are not as reliable and consistent as gravimetric results. 227

    The CO/CO2 concentrations were measured in a single instrument by nondispersive 228

    infrared absorption spectroscopy (NDIR analyzer, CAI 600 series). A cookstove smoke-specific 229

    calibration was developed for the BC aethalometer measurements. The results were compared 230

    with particle light-absorption coefficients measured with a photoacoustic absorption 231

    spectrometer (PAS) and elemental carbon concentrations measured using a thermal-optical 232

    analysis method. The moisture content of each piece of fuel wood was measured using a 233

    moisture meter (Delmhorst, J-2000). Soft (pine and fir) and hard (oak) woods were used in an 234

    equal number of tests with both stove types. Soft wood pieces were saw-cut to approximately 15 235

    cm long with a square cross-section of approximately 4 cm2 and hard wood pieces were hatchet-236

    cut to a similar size but irregular shape. The variability in the laboratory test results could 237

    probably be further reduced by using consistent quality wood with more consistent dimensions. 238

    The BDS and TSF were compared using a modification of the WBT V3.0 protocol. The 239

    WBT is intended to provide a method to compare the performance and emissions of different 240

    stoves in completing a defined standardized task (Bailis et al., 2007). In our modified protocol, a 241

    fire is ignited and maintained by periodic feeding of fuelwood to bring 2.5 L of water in a 2.3 kg 242

  • 12

    metal Darfur pot (without pot lids) to boil and subsequently maintain it on simmer for 15 243

    minutes, whereupon the fire is extinguished and the mass of remaining fuelwood is measured. 244

    The WBT suggests a default test volume of water of 5 L. We chose to test with 2.5 L of water, 245

    because it reflects the actual volume of food stove users prepare at a time. Our previous testing 246

    results show no significant difference of time to boil between cold start and hot start for both the 247

    BDS and TSF. Therefore, only one high-power phase (cold start) was included in each test. 248

    Note that the International Organization for Standardization (ISO) International Workshop 249

    Agreement (IWA) metrics average high-power (cold start and hot start) values 250

    (http://www.pciaonline.org/files/ISO-IWA-Cookstoves.pdf). When three WBT replicate tests 251

    are performed, n is equal to 6. 252

    One of the main metrics in our modified WBT test is the time to boil. In an important 253

    report by the United States Agency for International Developing (USAID, 2008), authors state, 254

    “Fuel-efficient stoves can deliver numerous benefits to end-user households, including fuel and 255

    time savings.” This underlines what we found in our work in Darfur, time savings are indeed 256

    important to the users. Moreover, we learned from our field partners that the most attractive 257

    feature of the BDS is that the stove could take their drinking water to boiling in less than 5 258

    minutes. The refugee women in Darfur IDP camps have named the BDS in Arabic “Kanun 259

    Khamsa Dagaig” (i.e., “the 5-min stove”), indicating this as the single most important feature of 260

    the BDS from their perspective. Therefore, we believe “time to boil” is an important testing 261

    matrix from the user perspective and consequently, it is important for us to examine for both 262

    BDS and TSF. 263

    Stove testers control the fuel feeding rate that determines the time to boil. Two trained 264

    stove testers were employed for all the tests in this study. The average fuel burning rates for 265

  • 13

    BDS and TSF are 12.2 ± 0.9 g/min and 13.8 ± 1.3 g/min (mean ± 1SD), respectively. These 266

    values indicate that the fire tending skill of the two testers is very consistent. Please note in other 267

    areas of the world where fuel is more abundant and inexpensive compared to Darfur, users often 268

    sacrifice fuel consumption for time savings. As shown in Figure 1, the BDS has a small fire box 269

    opening to prevent using more fuel wood than necessary. The TSF has no such restriction, so it 270

    can achieve a higher fuel burning rate than the BDS, therefore, the TSF could have a shorter time 271

    to boil if fuel consumption is not an issue. The detailed testing methodology and results are 272

    given by Kirchstetter et al. (2010). 273

    4.2 Data analysis 274

    Stove performance is strongly influenced by the skill of the person tending the stove. Dozens of 275

    tests were practiced by trained stove testers on both TSF and BDS, and these data were discarded 276

    before performing the tests to produce the data reported in this paper. This ensured that the 277

    variability observed in the test results was not being primarily influenced by increasing skill of 278

    the tester in tending the stove. There were 20 and 21 tests completed for TSF and BDS for data 279

    analysis, respectively. All instrumentation discussed above operated properly during these 41 280

    tests. The statistical analysis was performed using Statistical Analysis System (SAS Institute 281

    Inc., version 9) and R (http://www.r-project.org/). 282

    283

    5. Results and discussion 284

    5.1 Data overview 285

    http://www.r-project.org/

  • 14

    The stove performance and emission results of 21 BDS tests and 20 TSF tests are 286

    comprehensively presented in Kirchstetter et al. (2010). The moisture content and dry mass of 287

    the soft and hard woods were similar to each other and were the same for TSF and BDS tests. 288

    The completion of tests with softwood (10 tests) required about 90% of the time duration and 289

    90% of the wood mass compared to those with hard wood (10 tests). 290

    The data of time to boil and PM2.5 emission factor (g/g of fuel consumed) for TSF and 291

    BDS are selected for the statistical analysis in this study. We understand that PM2.5 emissions 292

    per energy delivered to the cooking pot (g/MJ delivered) is an important metric of cookstove 293

    performance, because it is based on the fundamental desired output - cooking energy- that 294

    enables valid comparisons between all stoves and fuels (Smith et al., 2000). Also cooking 295

    energy tends to have less variation than time to boil, so it might require a smaller number of 296

    replicates. However, the data for the mass of water evaporated and the mass of fuel consumed 297

    during cold start were not collected when these tested were conducted. Thus, a shortcoming of 298

    this study is that it is not possible to calculate the emission factors based on energy delivered to 299

    the pot. 300

    The histogram plots of these data are shown in Figure 2 and Figure 3. The CDF plots for 301

    the same data are shown in Figure 4 and Figure 5. On average, cooking tests with the BDS were 302

    completed in 74% of the time for TSF (30.3 minutes vs. 41.0 minutes). There was less variation 303

    in time to boil with the BDS, as indicated by a narrower spread in the CDF curves for BDS 304

    compared to TSF (Figure 4). The average PM2.5 emission factor for the BDS tests was 80% of 305

    that for the TSF (3.1 g/kg-wood burned vs. 3.9 g/kg-wood burned). PM2.5 shows large test-to-306

    test variability. The distributions of BDS and TSF PM2.5 data overlap substantially, but the 307

  • 15

    question to answer is whether data from BDS and TST tests show performance data that are 308

    different and discernable. 309

    5.2 Number of replicate tests to estimate the mean 310

    We now discuss the number of replicate tests needed to estimate the experiment mean within a 311

    user-defined level of confidence. For example, suppose the analyst desires to compute the 312

    expected boil time of the BDS within a range of plus or minus 2 minutes. Suppose also that the 313

    analyst desires the certainty of that estimate to be 95%. In other words, the analyst is saying, “I 314

    would like to know the number of replicate tests needed to compute the average time to boil of 315

    the BDS within a range of 4 minutes, and I want to know that range with a confidence of 95%.” 316

    Figure 6 shows the number replicates needed for three probability levels (0.1, 0.05, and 0.01), 317

    which correspond to confidences of 90%, 95%, and 99%, respectively. We compute the number 318

    of replicates using equation (3). The x-axis represents the number of replicates ranging from 1 to 319

    25. The y-axis represents the width of the confidence interval about the mean, which is twice the 320

    E value in equation (2). As can be seen in the figure, the smaller the confidence interval about 321

    the mean desired, the larger the number of replicates required. 322

    As the 0.05 probability in Figure 6 shows, if the width of the confidence interval for the 323

    mean time to boil is 4 minutes at the probability of 0.05, 7 replicates are required. Note that σ 324

    for the underlying distribution in equation (2) is calculated based on the original 21 replicate 325

    tests. If only two replicates are conducted, the width of the confidence interval about the mean is 326

    38 minutes at the probability of 0.05 (191 minutes for the probability of 0.01, 19 minutes for the 327

    probability of 0.10). When the number of replicates increases to 5, the width shrinks to 5.3 328

    minutes at the probability of 0.05 (8.8 minutes for the probability of 0.01, 4.1 minutes for the 329

  • 16

    probability of 0.10). The width of the confidence interval about the mean is relatively stable 330

    when the number of replicates is greater than 15. A similar trend is observed for the BDS PM2.5 331

    emission factor data. The width of the confidence interval about the mean BDS PM2.5 emission 332

    factor is enormous for n < 5, and becomes steady when n > 10. 333

    5.3 Number of replicate tests to compare two stoves 334

    We now discuss how many replicate tests are needed to confirm whether the performance of two 335

    stoves is indistinguishable, within a level of confidence. In essence, we test whether the 336

    underlying statistical distribution of the two stoves for the mean boil time or emission factor are 337

    the same. Figure 7 shows the probability as a function of the number of replicates calculated 338

    using the K-S test. 339

    On the x-axis is the number of replicates. For every replicate number, we generated 340

    50,000 bootstrap samples using the original 21 replicate tests for the BDS and 50,000 bootstrap 341

    samples using the original 20 TSF replicate tests. For each pair of samples, we compute the 342

    probability (p value) that they come from the same distribution. We then compute the ratio, or 343

    probability, of the number of pairs that come from the same distribution divided by 50,000 with a 344

    confidence of 95%. The y-axis shows the resulting probability. When the number of replicates 345

    is greater than 6, the probability that the BDS and the TSF time to boil data are from two 346

    different distributions is greater than 95%. For the PM2.5 emission factor data, 30 replicates are 347

    required to ensure that at least 95% chance the BDS and the TSF samples are drawn from two 348

    different distributions. 349

    5.4 A practical approach to assess the number of replicate tests 350

  • 17

    The difficulty with estimating the number of replicate tests needed for one particular stove is the 351

    lack of prior knowledge about the expected σ of the planned experiments. We knew the σ for the 352

    above demonstration because we had already conducted 21 replicates. 353

    In the absence of the σ, the experiment designer must speculate on the variance. We 354

    recommend reviewing the literature of similar stoves to pose a notional variance. In the absence 355

    of such data, then the designer must use any other information as a starting point, such as the 356

    variance computed from the TSF and BDS replicates reported here. Note the σ values for BDS 357

    and TSF for time-to-boil are 2.1 minutes and 5.6 minutes, respectively, and for emission factor 358

    for PM2.5 they are 1.2 g/kg-wood and 1.0 g/kg-wood, respectively. The σ values calculated for 359

    all measured variables are summarized in Table 3. 360

    Note the wide difference in the three-stone-fire and the Berkeley-Darfur Stove. The 361

    former is a set up with three stones with irregular shape, and the dimensions and shape and 362

    spacing of the stones can vary from test to test. Results reported in the literature have generally 363

    been with consistent dimensions, shape, and spacing of the stones (or bricks). This factor may 364

    have caused more variation in our TSF results compared to literature values. In contrast, the 365

    BDS is precisely engineered metal stove of fixed dimensions. The remarkable point is that while 366

    there is a difference in the σ values for the time to boil, there is not a large difference in the σ 367

    values for emission factors of the TSF and BDS despite the significant design difference. So, we 368

    recommend starting conservatively, with the notional σ similar to the value for the TSF. If the 369

    designer’s stove or testing conditions are likely to show less variation, then perhaps start with a 370

    notional variance that is 10% less. Conversely, our BDS experiment was conducted in a 371

    controlled laboratory setting. If the designer expects greater variation in the experiment (say, 372

    owing to variable field conditions), then begin with a notional variance of 10, 50, or even 100 373

  • 18

    percent greater. For example, when testing fan-assisted stoves, which burn engineered wood 374

    pellets, we might start with a notional variance that is 10% smaller than was found here owing to 375

    the uniform nature of the engineered fuel pellets. On the other hand, when testing an open fire 376

    (the fire is open completely to the ambient environment), we might begin with a notional 377

    variance that is twice that of the TSF (three stones or bricks are places between the fire and the 378

    ambient to provide the support to the cooking pot and some insulation of the fire) laboratory tests 379

    reported here. 380

    With a notional variance, the designer would proceed with equation (3) to compute the 381

    number of replicates needed based on the desired size of confidence interval (E) and the level of 382

    confidence desired (α). Remember also that test conditions change, instruments malfunction, 383

    and interpretation of tests differ (such as the precise time of onset of hard boil, or the precise 384

    duration that water simmer, can be questionable). These factors should also be considered 385

    beyond what is computed from the above statistics to arrive at the number of replicate tests. 386

    More replicate tests should be planned than required by the statistical estimation to compensate 387

    for these unusual occurrences. This also increases the margin of safety in case the variability in 388

    the underlying distribution, represented by the standard deviation (σ) in equation (2), is larger 389

    than anticipated. A conservative margin of 100% is recommended based on our abundant stove 390

    laboratory testing experience. 391

    With the number of replicate tests determined, the experimenters conduct the tests. With 392

    these data now in hand, the experimenters can calculate the actual, observed variance computed 393

    from the experiment. This value should be used to estimate the analysis results. One might need 394

    to conduct additional replicate tests to achieve the desired confidence interval and desired level 395

    of confidence in the mean estimation from the test results. 396

  • 19

    397

    6. Conclusions 398

    Our results show moderate inherent variability (coefficient of variation up to 0.4) among the TSF 399

    and BDS time to boil and PM2.5 emission measurements based on the modified WBT protocol. 400

    We demonstrate using these data as examples that some stove laboratory testing results could be 401

    misleading if only a small number of replicate tests were conducted. However, there are costs 402

    associated with increasing the number of replicates. The average value of any measured 403

    parameter should be always reported together with the number of replicates conducted and the 404

    uncertainty (e.g. standard deviation or confidence interval). Cautions must be exercised in the 405

    interpretation of results based on only a few replicates. We then describe a practical approach to 406

    calculate the number of replicate tests needed to obtain useful data from previously untested 407

    stoves. 408

    The implications of these results include the following: (1) In the stove design and 409

    laboratory testing phase, researchers need to conduct a relatively large number of replicate tests 410

    to ensure with some confidence that the improvements of stove performance and emission levels 411

    are truly achieved. (2) In the stove field testing phase, even more tests are required because of 412

    the less controlled testing environment and the associated larger inherent variability within the 413

    replicates. (3) In the stove dissemination and adoption phase, decision makers and policy 414

    analysts should take into consideration the variability and confidence intervals of the laboratory 415

    and field testing results prior to any decisions. 416

    417

    Acknowledgements 418

  • 20

    This work was performed at the Lawrence Berkeley National Laboratory, operated by the 419

    University of California, under DOE Contract DE‐AC02‐ 05CH11231. We gratefully 420

    acknowledge partial support for this work from LBNL’s LDRD funds, and DOE’s Biomass 421

    Energy Technologies Office. The data used in this work were collected during research 422

    supported with grant number 500-99-013 from the California Energy Commission (CEC). Its 423

    contents are solely the responsibility of the authors and do not necessarily represent the official 424

    views of the CEC. Kathleen M. Lask was supported with National Defense Science and 425

    Engineering Graduate (NDSEG) Fellowship and the National Science Foundation Graduate 426

    Research Fellowship. The authors gratefully acknowledge Douglas Sullivan, Jessica 427

    Granderson, Chelsea Preble, Odelle Hadley and Philip Price of Lawrence Berkeley National 428

    Laboratory for their support of this project, as well as the many students, interns, and researchers 429

    who, before us, contributed to the development of the Berkeley-Darfur Stove. The authors are 430

    very grateful to have the paper manuscript reviewed by the journal reviewers. The paper quality 431

    is substantially improved owing to the careful review. 432

    433

    Disclaimer 434

    This document was prepared as an account of work sponsored by the United States Government. 435

    While this document is believed to contain correct information, neither the United States 436

    Government nor any agency thereof, nor The Regents of the University of California, nor any of 437

    their employees, makes any warranty, express or implied, or assumes any legal responsibility for 438

    the accuracy, completeness, or usefulness of any information, apparatus, product, or process 439

    disclosed, or represents that its use would not infringe privately owned rights. Reference herein 440

  • 21

    to any specific commercial product, process, or service by its trade name, trademark, 441

    manufacturer, or otherwise, does not necessarily constitute or imply its endorsement, 442

    recommendation, or favoring by the United States Government or any agency thereof, or The 443

    Regents of the University of California. The views and opinions of authors expressed herein do 444

    not necessarily state or reflect those of the United States Government or any agency thereof, or 445

    The Regents of the University of California. 446

    447

    References 448

    Bailis, R., Berrueta, V., Chengappa, C., Dutta, K., Edwards, R., Masera, O., Still, D., Smith, K. 449

    R., 2007. Performance testing for monitoring improved biomass stove interventions: 450

    experiences of the household energy and health project. Energy for Sustainable Development 451

    11 (2), 57-70. 452

    Berthouex, P. M., Brown, L. C., 2002. Statistics for Environmental Engineers. Second Edition. 453

    Lewis Publishers. 454

    Corder, G., Foreman, D., 2009. Nonparametric Statistics for Non-Statisticians: A Step-by-Step 455

    Approach. Wiley. 456

    DOE (Department of Energy), 2011. Biomass cookstoves technical meeting: Summary report, 457

    Alexandria, VA. 458

    http://www1.eere.energy.gov/biomass/pdfs/cookstove_meeting_summary.pdf. Accessed 459

    February 4, 2013. 460

    Efron, B., 1979. Bootstrapping methods: Another look at the jackknife. The Annals of Statistics 461

    7 (1): 1-26. 462

    Efron, B., Tibshirani, R., 1993. An introduction to the Bootstrap. Boca Raton, FL: Chapman & 463

    Hall/CRC. ISBN 0-412-04231-2. 464

    Ellison, S., Barwick, V., Duguid Farrant, T., Practical Statistics for the Analytical Scientist: A 465

    Bench Guide, 2nd ed., (Royal Society of Chemistry, 2009)\ 466

    http://www1.eere.energy.gov/biomass/pdfs/cookstove_meeting_summary.pdf

  • 22

    Granderson, J., Sandhu, J. S., Vasquez, D., Ramirez, E., Smith, K. R., 2009. Fuel use and design 467

    analysis of improved woodburning cookstoves in the Guatemalan Highlands. Biomass and 468

    Bioenergy 33, 306-315. 469

    Hadley, O. L., Corrigan, C. E., Kirchstetter, T. W., Cliff, S. S., Ramanathan, V., 2010. 470

    Measured black carbon deposition on the Sierra Nevada snow pack and implication for snow 471

    pack retreat. Atmos. Chem. Phys., 10, 7505-7513. 472

    IEA (International Energy Agency), 2004. Energy and Development. World Energy Outlook 473

    2004. IEA Publications, Paris. 474

    Jetter, J., Kariher, P., 2009. Solid-fuel household cook stoves: Characterization of performance 475

    and emissions. Biomass and Bioenergy 33, 294-305. 476

    Jetter, J., Zhao, Y., Smith, K. R., Khan, B., Yelverton, T., DeCarlo, P., Hays, M. D., 2012. 477

    Pollutant emissions and energy efficiency under controlled conditions for household biomass 478

    cookstoves and implications for metrics useful in setting international test standards. 479

    Environmental Science & Technology 46, 10827-10834. 480

    Kirchstetter, T., Preble, C., Hadley, O., Gadgil, A., 2010. Quantification of black carbon and 481

    other pollutant emissions from a traditional and an improved cookstove. Lawrence Berkeley 482

    National Laboratory (LBNL) Report, number: LBNL-6062E. Available: 483

    http://gadgillab.berkeley.edu/wp-content/uploads/2010/11/TWK_improved-cookstove.f_13-484

    6-10.pdf. Accessed December 5, 2013. 485

    Lim S.S., Vos, T., Flaxman, A. D., Danaei, G., Shibuya, K., Adair-Rohani, H. et al., 2012. A 486

    comparative risk assessment of burden of disease and injury attributable to 67 risk factors 487

    and risk factor clusters in 21 regions, 1990-2010: a systematic analysis for the Global Burden 488

    of Disease Study 2010. Lancet 380, 2224-60. 489

    MacCarty, N., Ogle, D., Still, D., Bond, T., Roden, C., 2008. A laboratory comparison of the 490

    global warming impact of five major types of biomass cooking stoves. Energy for 491

    Sustainable Development 12 (2), 56-65. 492

    MacCarty, N., Still, D., Ogle, D., 2010. Fuel use and emissions performance of fifty cooking 493

    stoves in the laboratory and related benchmarks of performance. Energy for Sustainable 494

    Development 14, 161-171. 495

    Milton, J. S., Arnold, J. C. 1995. Introduction to probability and statistics: Principles and 496

    applications for engineering and the computing sciences, 3rd ed., McGraw-Hill. 497

  • 23

    Ramanathan, V., Carmichael, G., 2008. Global and regional climate changes due to black 498

    carbon. Nature Geoscience 1, 221 – 227. 499

    Roden, C. A., Bond, T. C., Conway, S., Benjamin, A., Pinel, O., MacCarty, N., Still, D., 2009. 500

    Laboratory and field investigations of particulate and carbon monoxide emissions from 501

    traditional and improved cookstoves. Atmospheric Environment 43, 1170-1181. 502

    Smith, K. R., Uma, R., Kishore, V. V. N., Lata, K., Joshi, V., Zhang, J., Rasmussen, R. A., 503

    Khalil, M. A. K., 2000. Greenhouse gases from small-scale combustion devices in 504

    developing countries; EPA/600/R-00/052; U.S. Environmental Protection Agency: 505

    Washington, DC. 506

    Smith, K. R., Mehta, S., Maeusezahl‐Feuz, M., 2004. Indoor smoke from household solid fuels. 507

    In Comparative quantification of health risks: global and regional burden of disease due to 508

    selected major risk factors, M. Ezzati, A.D. Rodgers, A.D. Lopez, and C.L.J. Murray eds., 509

    World Health Organization, Geneva, Switzerland. 510

    Smith, K. R., Dutta, K., Chengappa, C., Gusain, P. P. S., Berrueta, V., Masera, O., Edwards, R., 511

    Bailis, R., Shields, K. N., 2007. Monitoring and evaluation of improved biomass cookstove 512

    programs for indoor air quality and stove performance: Conclusions from the household 513

    energy and health project. Energy for Sustainable Development 11 (2), 5-18. 514

    Spiegel, M. R., Lipschutz, S., Liu, J., 2008. Mathematical Handbook of Formulas and Tables, 515

    3rd ed. McGraw-Hill. 516

    Taylor, J. R., 1997. An Introduction to Error Analysis, 2nd ed. University Science Books. 517

    USAID (United States Agency for International Development), Fuel-efficiency stove programs 518

    in IDP settings – Summary evaluation report, Darfur, Susan. December 2008; Available: 519

    http://pdf.usaid.gov/pdf_docs/PDACM099.pdf. Accessed January 6, 2014. 520

    Varian, H., 2005. Bootstrap Tutorial. Mathematics Journal, 9, 768-775. 521

    Wall, J. V., Jenkins, C. R., 2003. Practical Statistics for Astronomers, Cambridge University 522

    Press. 523

    Weinhold, B., 2011. Indoor PM pollution and elevated blood pressure: Cardiovascular impact of 524

    indoor biomass burning. Environmental Health Perspectives, 119 (10), A442. 525

    World Bank, 2011. Household Cookstoves, Environment, Health and Climate Change: A New 526

    Look at an Old Problem, The World Bank, Washington, DC; Available: 527

  • 24

    http://climatechange.worldbank.org/sites/default/files/documents/Household%20Cookstoves-528

    web.pdf. Accessed December 5, 2013. 529

    http://climatechange.worldbank.org/sites/default/files/documents/Household%20Cookstoves-web.pdfhttp://climatechange.worldbank.org/sites/default/files/documents/Household%20Cookstoves-web.pdf

  • 25

    Table 1. Summary of Z values. 530

    1 – α = 0.99 1 – α = 0.95 1 – α = 0.90

    z = 2.56 z = 1.96 z = 1.64

    531

  • 26

    Table 2. Student’s t-distribution critical values. 532

    n n – 1 t.995 (One sided) or t.975 (One sided) or t.95 (One sided) or

    (Number of

    replicates)

    (Degrees of

    Freedom) t.99 (Two sided) t.95 (Two sided) t.90 (Two sided)

    1 - - - -

    2 1 63.657 12.706 6.314

    3 2 9.925 4.303 2.920

    4 3 5.841 3.182 2.353

    5 4 4.604 2.776 2.132

    6 5 4.032 2.571 2.015

    7 6 3.707 2.447 1.943

    8 7 3.500 2.365 1.895

    9 8 3.355 2.306 1.860

    10 9 3.250 2.262 1.833

    11 10 3.169 2.228 1.812

    12 11 3.106 2.201 1.796

    13 12 3.054 2.179 1.782

    14 13 3.012 2.160 1.771

    15 14 2.977 2.145 1.761

    16 15 2.947 2.132 1.753

    17 16 2.921 2.120 1.746

    18 17 2.898 2.110 1.740

    19 18 2.878 2.101 1.734

    20 19 2.861 2.093 1.729

    21 20 2.845 2.086 1.725

    22 21 2.831 2.080 1.721

    23 22 2.819 2.074 1.717

    24 23 2.807 2.069 1.714

    25 24 2.797 2.064 1.711

    26 25 2.787 2.060 1.708

    27 26 2.779 2.056 1.706

    28 27 2.771 2.052 1.703

    29 28 2.763 2.048 1.701

    30 29 2.756 2.045 1.699

    533

    534

  • 27

    Table 3. Summary of the standard deviation values (σ) of all measured variables for the BDS 535

    (n=21) and the TSF (n=20). 536

    TSF BDS

    Time to boil (minute) 5.6 2.1

    Dry wood burned (g) 75.4 33.6

    CO emission factor (g/kg-wood) 6.8 5.8

    PM2.5 emission factor (g/kg-wood) 1.0 1.2

    BC emission factor (g/kg-wood) 0.3 0.5

    537

  • 28

    538

    539

    Figure 1. Schematic of the Berkeley-Darfur Stove. (1) A tapered wind collar that increases fuel-540

    efficiency in the windy Darfur environment and allows for multiple pot sizes; (2) Wooden 541

    handles for easy handling; (3) Metal tabs for accommodating flat plates for bread baking; (4) 542

    Internal ridges for optimal spacing between the stove and a pot for maximum fuel efficiency; (5) 543

    Feet for stability with optional stakes for additional stability; (6) Nonaligned air openings 544

    between the outer stove and inner fire box to accommodate windy conditions; and (7) Small fire 545

    box opening to prevent using more fuel wood than necessary. 546

  • 29

    547

    548

    Figure 2. Histogram of time to boil data for the BDS and the TSF. 549

  • 30

    550

    551

    Figure 3. Histogram of PM2.5 emission factor data for the BDS and the TSF. 552

  • 31

    553

    554

    Figure 4. Cumulative distribution function (CDF) of time to boil data for the BDS and the TSF. 555

  • 32

    556

    557

    Figure 5. Cumulative distribution function (CDF) of PM2.5 emission factor data for the BDS and 558

    the TSF. 559

  • 33

    560

    561

    Figure 6. The width of the confidence interval about the mean as a function of the number of 562

    replicate tests at three probability levels (0.1, 0.05, and 0.01) for the BDS time to boil and PM2.5 563

    emission factor data. For example, if the width of the confidence interval for the mean time to 564

    boil is 4 minutes at probability levels of 0.1, 0.05, and 0.01, 5, 7 and 12 replicates are required, 565

    respectively, as indicated by the black horizontal dash line and the black vertical arrows. 566

  • 34

    567

    568

    Figure 7. Kolmogorov-Smirnov test result showing the probability of the BDS and the TSF 569

    bootstrap samples are drawn from two different distributions as a function of the number of 570

    replicate tests for the time to boil and PM2.5 emission factor data. 571