machine learning based simulation and optimization of...

45
1 Machine Learning Based Simulation and Optimization of Soybean Variety Selection Improving crop yield is a critical and necessary component of achieving food security and protecting natural resources and environmental quality for future generations. Although significant progress has been made in agricultural science developing seed varieties with genetic traits desirable in different planting environments and in advancing farming technologies and practices, tremendous opportunities exist in exploring how individual farmers, with the limited available resource, can make the best use of what agricultural science offers. As a small part of the research effort in that direction, this paper proposes an analytics framework for seed variety selection decisions one of the most important decisions a farmer has to make that has significant implications for the yield of the farm. Agribusinesses offer many seed varieties to farmers based on the yield performance of the seed varieties observed over several years at various farm locations. Farmers face the decision of selecting a small set of seed varieties offered and allocating farmland to the selected seed varieties. An informed decision requires accurate predictions of yield performances of seed varieties on the targeted farmland and balancing tradeoffs between expected yield and risk associated with the varieties selected. This paper uses soybean seed data from Syngenta, an agribusiness, to describe and evaluate five machine-learning models as the predictive models for the soybean yield. Using a dataset collected between 2008 and 2014, we evaluate these models and choose Regression Tree (RT) as the most suitable machine-learning model to inform simulation and optimization of soybean yield under different weather conditions. We fit Kernel Density Estimations (KDEs) at the terminal nodes of RT to approximate the continuous probability distributions of the soybean yield from which soybean yields are simulated under different weather scenarios. We formulate an simulation-based optimization problem to determine the optimal soybean-mix to minimize the risk associated with the yield. As a result, by considering the farm characteristics, the farmer is equipped to make the optimal soybean-mix decision. The methodology developed in this research can be applied to seed selection decisions of other crops and influence the farming practice positively. Keywords: Food Shortage; Soybean; Machine Learning; Simulation and Optimization. 1. Introduction Humanity is facing the greatest challenge of feeding itself. The world population has grown by six hun- dred percentage - from one billion to about six billion - in the last two hundred years. According to the

Upload: others

Post on 22-Jun-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine Learning Based Simulation and Optimization of ...apps.olin.wustl.edu/workingpapers/pdf/2018-04-001.pdf · Machine Learning Based Simulation and Optimization of Soybean Variety

1

Machine Learning Based Simulation and Optimization of Soybean

Variety Selection

Improving crop yield is a critical and necessary component of achieving food security and protecting

natural resources and environmental quality for future generations. Although significant progress has

been made in agricultural science developing seed varieties with genetic traits desirable in different

planting environments and in advancing farming technologies and practices, tremendous opportunities

exist in exploring how individual farmers, with the limited available resource, can make the best use of

what agricultural science offers. As a small part of the research effort in that direction, this paper proposes

an analytics framework for seed variety selection decisions – one of the most important decisions a farmer

has to make that has significant implications for the yield of the farm. Agribusinesses offer many seed

varieties to farmers based on the yield performance of the seed varieties observed over several years at

various farm locations. Farmers face the decision of selecting a small set of seed varieties offered and

allocating farmland to the selected seed varieties. An informed decision requires accurate predictions of

yield performances of seed varieties on the targeted farmland and balancing tradeoffs between expected

yield and risk associated with the varieties selected. This paper uses soybean seed data from Syngenta, an

agribusiness, to describe and evaluate five machine-learning models as the predictive models for the

soybean yield. Using a dataset collected between 2008 and 2014, we evaluate these models and choose

Regression Tree (RT) as the most suitable machine-learning model to inform simulation and optimization

of soybean yield under different weather conditions. We fit Kernel Density Estimations (KDEs) at the

terminal nodes of RT to approximate the continuous probability distributions of the soybean yield from

which soybean yields are simulated under different weather scenarios. We formulate an simulation-based

optimization problem to determine the optimal soybean-mix to minimize the risk associated with the

yield. As a result, by considering the farm characteristics, the farmer is equipped to make the optimal

soybean-mix decision. The methodology developed in this research can be applied to seed selection

decisions of other crops and influence the farming practice positively.

Keywords: Food Shortage; Soybean; Machine Learning; Simulation and Optimization.

1. Introduction

Humanity is facing the greatest challenge of feeding itself. The world population has grown by six hun-

dred percentage - from one billion to about six billion - in the last two hundred years. According to the

Page 2: Machine Learning Based Simulation and Optimization of ...apps.olin.wustl.edu/workingpapers/pdf/2018-04-001.pdf · Machine Learning Based Simulation and Optimization of Soybean Variety

2

Population Institute, roughly, 230 thousand more babies are born every day. The World Food Programme

estimates that about 795 million people do not have adequate food to have a healthy life. About 3.1 mil-

lion children die every year because of poor nutrition. On the other hand, decreasing land used for farm-

ing has made the burden of food shortage acute. Simply attempting to increase the land available for

farming is unlikely to be a sustainable solution to the food shortage problem. Improving crop yield is a

critical and necessary component of achieving food security and protecting natural resources and environ-

mental quality for future generations. Although significant progress has been made in agricultural science

in developing seed varieties with genetic traits desirable in different planting environments and in advanc-

ing farming technologies and practices, tremendous opportunities exist in exploring how individual farm-

ers, with the limited available resource, can make the best use of what agricultural science can offer. Agri-

businesses offer many seed varieties to farmers based on the yield performance of the seed varieties ob-

served over several years at various farm locations. Farmers need to select a small set of seed varieties

from the varieties offered by agribusinesses most suitable for the soil and weather characteristics of their

farms and allocate farmland to the selected seed varieties. Integrating the rich data set from agribusi-

nesses’ experimental farms and the data from local farms give rise to opportunities for greatly improving

the seed selection decisions. This paper proposes an analytics framework to aid farmers to optimally de-

termine the proportion of varieties to grow on a targeted farmland based on the seed performance and

farm characteristics data.

Take soybeans for example. Every year, soybean farmers make decisions about the mix of varieties to

be grown. They would allot most of their land to grow trusted commercial varieties. At the same time,

they will experiment with new experimental varieties by growing them in a small portion of their land. If

any of the experimental varieties consistently produce high yield year-after-year, farmer’s confidence to

grow that variety will increase. While deciding on the mix of varieties to be grown, farmers also need to

factor in uncertainty due to weather and soil conditions, and historical yield studies of different varieties.

To add to the complexity, varieties that may not be the best under ideal weather condition may perform

relatively better under bad weather than the varieties that give the highest yield under ideal weather condi-

tions. Thus, the variation in the yield of a seed variety under different weather conditions is an important

risk measure that needs to be factored-in while choosing the mix of varieties to be grown on a farm. Our

survey conducted among soybean farmers indicates that farmers do not have a systematic approach to

consider the variation in the yield of the seed variety under different weather conditions. They intuitively

process all these uncertainties and complex relationship between the yield of varieties, weather condi-

tions, and soil characteristics of the farm, and choose to grow one variety or a mix of a few varieties to

hedge against uncertainties. The intuition- or experience-based decision has the limitation of not being

Page 3: Machine Learning Based Simulation and Optimization of ...apps.olin.wustl.edu/workingpapers/pdf/2018-04-001.pdf · Machine Learning Based Simulation and Optimization of Soybean Variety

3

able to fully utilize a large amount of data from various sources or assess the impact of a multitude of un-

certain factors on crop yields. Our paper addresses those challenges with data-driven models to aid the

farmers in making optimal decisions about the mix of varieties and their proportions to be grown on their

farms. We utilize the dataset from Syngenta - an agribusiness - to propose a framework that informs simu-

lation and optimization models through machine learning and optimally allocates up to a chosen number

of varieties of soybeans to the targeted farm. This is the methodological goal of this research. This re-

search is a novel effort in integrating machine learning, simulation, and optimization in any field, espe-

cially, it is ground-breaking for the purposes of choosing a mix of soybean varieties for growing at a tar-

geted farm. At present, neither the farmers nor an agribusiness uses a systematic approach, such as the

one introduced in this work, to model the uncertainties while deciding the proportion of different varieties

to grow. We believe that equipping farmers with analytics and decision support tools to fully process and

utilize the data from agriculture science will empower farmers in their seed selection decisions and im-

prove their crop yield. As a result, the systematic approach introduced in this paper helps the humanity in

taking a step towards alleviating the food shortage crisis.

The data-driven analytics framework created in this paper is shown in Figure 1. It considers historical

data on weather conditions, soil characteristics, and performance of soybean varieties in many experi-

mental plantations in the Midwest of the USA. The major tasks performed by this framework can be

grouped into three steps: descriptive analytics, predictive analytics, and prescriptive analytics. In our de-

scriptive analytics, we explore the distribution of the response variable, create clusters of farms to group

similar farms together, impute missing data, and create new variables to identify clusters. We fit five ma-

chine-learning models, i.e., Regression Trees (RT), Random Forest (RF), Boosted Trees (BT), Multivari-

ate Adaptive Regression Splines (MARS), and Artificial Neural Network (ANN) to predict soybean yield

as part of our predictive analytics. RT was found to be the winning machine-learning model in our out-of-

sample testing. We also utilize kernel density estimates (KDE) to approximate the soybean yield distribu-

tion at the terminal nodes of the RT. Unlike in traditional prediction assignment, in which the mean at the

terminal node of RT will be used as the predicted yield, we use KDE at the terminal nodes of RT to simu-

late many scenarios of soybean yield under different weather scenarios. Performance measures from such

simulations of soybean yield from KDEs quantify the expected yield and yield-related uncertainties and

serve as input to our optimization model. In the prescriptive analytics part of the framework, we evaluate

the performance of all of the seed varieties and optimally allocate up to a given number of seed varieties

to be grown on a targeted farm. Each of these components of the framework is described in later sections.

Page 4: Machine Learning Based Simulation and Optimization of ...apps.olin.wustl.edu/workingpapers/pdf/2018-04-001.pdf · Machine Learning Based Simulation and Optimization of Soybean Variety

4

Figure 1. Data-Driven Analytics Framework.

The contribution of the research is three-fold:

This research introduces a novel data-driven analytics framework to the agriculture community to

tackle the seed variety selection decisions. The novelty stems from its full scale data-driven and

its integration of three categories of analytics tools: machine learning, simulation, and optimiza-

tion.

The simulation model created from this approach represents the yield dynamics accurately be-

cause it is entirely based on the pattern learned from a real data set collected over a long period of

time. Also, this approach makes the simulation to run efficiently by reducing the number of simu-

lation states. An overview of this approach is given in Section 4.4. The simulation enables the

farmer to assess the risk of growing a seed variety. A pilot version of the tool (SimSOY) is avail-

able as a web-application. A tool like this, which integrates machine learning, simulation, and op-

timization, can positively impact the practice.

To the best of our knowledge, this is the one of the first works in the Operations Management

(OM) literature that proposes a data-driven methodology for seed selection decisions. To under-

stand the complexity of the variety selection decision, we conduct a survey among farmers who

grow soybeans. The survey results not only confirm the need for and the value of the proposed

approach, but also reveal that seed selection, because of its relevance to agricultural science and

economics factors, can give rise to a rich set of challenging research problems that are of interest

to the OM community.

The rest of this paper is organized as follows: Section 2 presents a survey we conducted among farmers

and discusses observations. It also compares the seed selection decision in farming with investment port-

folio decision in finance. Section 3 reviews the relevant literature. In Section 4, the analytics framework

Page 5: Machine Learning Based Simulation and Optimization of ...apps.olin.wustl.edu/workingpapers/pdf/2018-04-001.pdf · Machine Learning Based Simulation and Optimization of Soybean Variety

5

used in this research is explained. This section includes discussions on variables, notations, tree structures

used to build the simulation model, the kernel density estimation in the terminal nodes, the simulation al-

gorithm and results from the simulation, and formulations of the soybean-mix decision based on portfolio

optimization theory. In Section 4, we also present the optimal solution along with the solutions’ sensitiv-

ity to parameters in the model. Section 5 discusses the implementation of the pilot web-application of

SimSOY. In Section 6, we discuss our observations and provide concluding remarks.

2. Understanding Seed Selection Decisions

Seed retailers usually sell well-experimented varieties that are suitable for the region. Soybean farmers

allot most of their land to grow trusted varieties. Most of the research and experimentation done by the

seed developers and farmers determine whether a variety is suitable for the region. However, farmers, as

well as seed retailers, do not have an analytics-driven decision support tool that can help them utilize the

vast amount of historical yield performance data from experimental farms and combine it with their

knowledge of local farm conditions to choose suitable varieties for targeted farms.

To understand the current practice and relevant factors influencing farmers’ seed selection decisions, we

conducted a survey among the soybean farmers.

2.1 Survey of Farmers

Initially, we created a survey that was data intensive, i.e., we picked varieties with high historical yield

from the Syngenta data set used in this research and estimated their expected yield and variance-covari-

ance matrix and provided those as an input to the farmers to answer survey questions.1 After testing that

version of the survey with a few farmers, it became clear to us that farmers were not used to processing a

data-heavy survey like that, especially, the covariance part of the data. As a result, we simplified our sur-

vey to ask open-ended questions without providing any such data. We worked with Illinois Soybean As-

1 The thought behind providing such a data set was to see how farmers would use the expected yield and risk associ-

ated with the varieties in choosing varieties for growing on their farm.

Page 6: Machine Learning Based Simulation and Optimization of ...apps.olin.wustl.edu/workingpapers/pdf/2018-04-001.pdf · Machine Learning Based Simulation and Optimization of Soybean Variety

6

sociation to iteratively improve the survey until the survey was fit for administering. In this effort, we re-

ceived thirty-seven complete responses.2 Table 1 lists the questions asked and key observations made

from the responses.

Table 1: Survey Questions Posed to and Responses of Farmers and Seed Salesperson.

1. Are you a farmer or a seed salesperson?

36 out of 37 participants are farmers and one participant is a seed salesperson. 3

Survey Questions Posed to

Farmers (36) Seed Salesperson (1)

2. Do you usually grow soybean on your farm?

All of them responded with a “Yes.”

2. Do you sell soybean?

Response: “Yes.”

3. In how many acres do you usually grow soybean?

The answers ranged from thirteen acres to twelve thousand

acres. The average size of the farm is about 1475 acres.

3. How many soybean varieties are typically

offered to the farmers?

Response: “20 Plus”

This response clearly justifies the need for an

analytics tool like the one proposed in this re-

search because it would be difficult to intui-

tively process so many varieties offered and se-

lect four or five varieties from those to grow.

2 Illinois Soybean Association helped us in administering the survey in exchange of $750 gift towards the Illinois

Soybean Growers fund for thirty or more responses.

3 As indicated on the table, out of the thirty-seven responders, thirty-six were farmers. Illinois Soybean Association

was able to get only one seed salesperson complete the survey. For farmers, questions shown on the left side of the

table were displayed while questions on the right side of the table were posed for the seed salesperson. We used

Qualtrics survey tool to administer this survey, which made the changing of the questions between farmers and sales-

person possible. It also enabled us to administer the survey online.

3

1

3 3 3 3

0

2 2

4

3

1

2

0

1 1

2

0

1

0

1

0

1

2

3

4

5

CO

UN

T

ACRES

Acreage for Soybean?

Page 7: Machine Learning Based Simulation and Optimization of ...apps.olin.wustl.edu/workingpapers/pdf/2018-04-001.pdf · Machine Learning Based Simulation and Optimization of Soybean Variety

7

4. How many varieties of soybean you usually grow at the

same time?

The responses ranged between 1 and 10 varieties. The av-

erage is almost 4 Varieties. When farmers responded with

a range, the average of the range was used as the response

to generate the figure.

4. In how many acres, a typical farmer, would

grow soybean?

Response: “30% to 60% of their land.”

This response indicates that farmers typically grow

multiple crops at the same time to reduce the risk.

Finding out the optimal allocation of land for dif-

ferent crops would be another interesting area of

research for the OM community.

5. How do you normally make soybean variety selection

decisions?

Responses show that farmers do make use of the histori-

cal data for decision-making. Farmers also value the rec-

ommendations offered by dealers, salesperson, and agro-

nomical professionals highly in selecting the varieties.

5. How many varieties of soybean, a typical

farmer, usually grow at the same time?

Response: “4 to 6.”

The response is consistent with the responses ob-

tained from farmers.

3 3

8

5

11

4

0 0 01

0

5

10

15

0 to 1 1.1 to 2 2.1 to 3 3.1 to 4 4.1 to 5 5.1 to 6 6.1 to 7 7.1 to 8 8.1 to 9 9.1 to 10

CO

UN

T

VARIETIES

Number of Varieties Grown?

21

15

3 2 1 1

0

5

10

15

20

25

CO

UN

T

Variety Selection Criteria?

Page 8: Machine Learning Based Simulation and Optimization of ...apps.olin.wustl.edu/workingpapers/pdf/2018-04-001.pdf · Machine Learning Based Simulation and Optimization of Soybean Variety

8

6. How do you divide your land among the different soy-

bean varieties that you choose to grow?

Responses indicate that majority of the farmers use sim-

ple rules as either split their land equally among the varie-

ties or allocate more land to the variety with the highest

expected yield and split the remaining land equally be-

tween the other varieties.

6. How do farmers normally make soybean variety

selection decisions?

Response: “Yield data, past experiences and rec-

ommendations from seed professionals.”

Again, the response is consistent with the re-

sponses of farmers.

7. How do they divide their land among the differ-

ent soybean varieties they choose to grow?

Response: “Highest percentage would be to high-

est yielding variety, other percentage would be to

variety of a different maturity range, over any

other reason.”

This is also consistent with the responses of farm-

ers.

Would you like to receive a copy of the summary of this survey results?

Twenty four of them responded as “Yes” to this question. It shows that majority of them cared to know more

about this research.

What is a convenient e-mail address at which we can reach you?

What is the address at which we can reach you?

Thirty of them provided either their e-mail address or physical address to contact them.

Question 5, one of the two most important questions of the survey, asks how farmers select their varieties.

From the responses to this question, we attempt to understand farmers’ thought process and priorities in

selecting varieties of soybean for growing. Twenty-one responses to this question directly or indirectly

mentioned that they consider historical data from the dealer, past yield performance of varieties on their

own farms, and sometimes, even the yield performance on different plots within the farm for selecting the

varieties. They also go with the recommendations of the seed salesperson. This indicates that the high

14

12

5

32

02468

10121416

CO

UN

T

Land Allocation Criteria?

Page 9: Machine Learning Based Simulation and Optimization of ...apps.olin.wustl.edu/workingpapers/pdf/2018-04-001.pdf · Machine Learning Based Simulation and Optimization of Soybean Variety

9

level of trust between the farmers and the seed salesperson. A few farmers consider that the Relative Ma-

turity of the variety and other factors as important. Like many other plants, soybean crop is a photosensi-

tive plant. Relative Maturity of soybean varieties indicate how long it will take for the crop to mature and

be ready for the harvest. In general, varieties that need longer period to mature have higher yield and vice

versa. However, the high expected yield of the varieties that take longer time to mature comes with a high

risk associated with it due to the weather and other factors simply because the crop has to be on the field

for a longer duration. Also, the relative maturity that is suitable for the southern parts of the United States

of America (USA) will not be suitable for the northern parts. The varieties considered in this research are

the ones that would be suitable for the Midwestern states of the USA. Therefore, the relative maturity re-

lated variable considered in this research is RM_25. This variable indicates the probability that a farm site

would be suitable for the varieties considered in this research.

There are other interesting criteria mentioned which are not explicitly considered in this research. For ex-

ample, one farmer mentions that he/she would prefer varieties with resistance to soybean cyst nematode

and herbicide. Soybean cyst nematode (SCN) is a disease caused to soybean crop by a microscopic worm

that gets into the cells of soybean and feed on it. SCN affected crop usually looks yellowish. There could

be significant loss of yield because of SCN. Also, herbicide resistance is important so that the crop

doesn’t gets affected by herbicides used to kill unwanted vegetation. Similarly, another farmer mentions

that the choice of the varieties in his/her farm depends on the herbicide program that he/she has decided to

use for that year. He/she doesn’t want to get too much of the same herbicide chemical going into his/her

soil every year. Therefore, he/she chooses varieties that are appropriate for his/her choice of the herbicide

treatment program for that year. Interestingly, there was only one response that mentioned cost as a fac-

tor.

Another important question, Question 6, on the survey asked how farmers allotted the land between the

varieties. We wanted to understand how farmers’ current allocation of land between the varieties would

compare to the results from the optimization performed in this paper. Responses like indicate that farmers

use simple rules to allocate the land among the varieties. Responses “Even splits to mitigate risk” and “I

Page 10: Machine Learning Based Simulation and Optimization of ...apps.olin.wustl.edu/workingpapers/pdf/2018-04-001.pdf · Machine Learning Based Simulation and Optimization of Soybean Variety

10

would divide it equally. It spreads out the risk to have several varieties” also indicate that farmers clearly

understand the need for growing several varieties to reduce the risk.

These responses indicate that farmers are intuitively trying to balance the expected yield and risk associ-

ated with varieties by growing different varieties. We believe that an analytics tool, which systematically

balances the expected yield and the risk associated with varieties, will be valuable to the farmers.

Twelve of the responses indicate that the allocation of the land to the varieties hugely depend on the soil

characteristics. Their responses indicate that the soil type could change from field to field even within the

same farm. To that end, consideration of eleven important soil related variables, i.e., CONUS_PH, CO-

NUS_AWC, CONUS_SILT, CONUS_SAND, CONUS_CLAY, ISRIC_PH, ISRIC_CEC, ISRIC_CLAY,

ISRIC_SILT, ISRIC_SAND, and EXTRACT_CE, as predictors of the yield is a clear strength in this re-

search. These predictors, along with other predictors of yield, are introduced in Section 4.

As indicated before in the responses to Question 5, for some of the farmers, allocating land to varieties

also depend on relative maturity of the varieties. There is an indication that planting varieties with differ-

ent maturity is preferable so that they do not have to harvest the entire farm at the same time. For exam-

ple, read the following responses.

“Divide on maturity groups so I can keep harvesting beans and not all ready at one time.”

“Usually put 500 acres in early maturing variety and 1000 acres in 2 later season varieties.”

Farmers may not have enough equipment and personnel to harvest the entire farm at the same time. Har-

vesting the entire farm on the same day or even within a few days may not be feasible.4 Interestingly, five

responses mention that their land allocation depends on bin storage capacities and tillage practice. Those

responses reveal that operational challenges in harvesting also play a role in their planting decisions,

which represent opportunities for the OM community.

Overall, based on the survey responses, we believe that farmers’ goals and the goals of this research are in

alignment. Both seek to increase the expected yield and lower the risk while deciding the varieties to be

grown. However, farmers are using intuition and recommendations to allocate the land to different varie-

4 Our optimization formulation proposed in Section 4 could enforce the selection of varieties that have slightly dif-

ferent maturity so that farmers can spread the harvesting period to a manageable window.

Page 11: Machine Learning Based Simulation and Optimization of ...apps.olin.wustl.edu/workingpapers/pdf/2018-04-001.pdf · Machine Learning Based Simulation and Optimization of Soybean Variety

11

ties to achieve the goal, which ends up being suboptimal. The analytics framework introduced in this re-

search will significantly increase the expected yield and lower the risk. We find that farmers can improve

their expected yield by nine percent by using the framework introduced in this research.

2.2 Portfolio of Soybean Varieties

Determining the proportion of each of the soybean varieties to be grown in a farmland and financial port-

folio decisions share a great deal of similarity. Table 2 provides a conceptual comparison between the two

problems. Both decisions involve choosing a portfolio of items under significant uncertainties that are

outside of the control of the decision maker. As in financial portfolio optimization, reasonable solution for

soybean variety allocation problem must account for both expected yield and risks associated with the

yield of the portfolio. Syngenta believes that, in general, farmers are concerned about the risk associated

with the yield under adverse weather conditions. From farmer’s point of view, when the conditions are

unfavorable, they want to minimize the risk associated with their farm’s yield. However, our survey re-

sults do not indicate that farmers have a tool or a system to systematically model the risk based on the

past data. Although portfolio optimization is a well-established area in finance, methodologies developed

by academicians - like MarKowitz (1959) - are not common in agriculture. This is further discussed in the

literature review section.

Table 2. A Conceptual Comparison of Seed Variety Selection and Investment Portfolio Decisions

Seed Variety Selection Decision Investment Portfolio Decision

Similarity

Objectives Maximize (Minimize) Expected

yield (risk of yield) of the farm

Maximize (Minimize) Expected re-

turn (Risk) of Portfolio

Uncertain-

ties

Weather-related factors, disease,

pest, soil conditions, farm manage-

ment, etc.

Macro-economic conditions (social,

political, etc.), industry competi-

tion, operational efficiency of the

firm, etc.

Difference Return of

the invest-

ment

Within a given period, the same

seed planted in different farms can

have different yields

Within a given investment period,

the return of an asset is independent

of which investors purchased it

In a nutshell, the portfolio theory considers both the expected yield of a portfolio and the risk

associated with the yield, and formulates the portfolio decision as an optimization problem with the

objective of maximizing (minimizing) the portfolio’s expected yield (risk) contingent upon a level of risk

(yield) measured by the standard deviation (mean) of the yield. By varying the expected yield of the port-

folio and solving the corresponding optimization problem, one can obtain an efficient frontier of the deci-

sion represented by a curve in the mean-risk space. The decision maker can decide the optimal portfolio

Page 12: Machine Learning Based Simulation and Optimization of ...apps.olin.wustl.edu/workingpapers/pdf/2018-04-001.pdf · Machine Learning Based Simulation and Optimization of Soybean Variety

12

of varieties based on his/her risk-yield trade-off preference. The analogy between soybean variety alloca-

tion and portfolio investment suggests that portfolio theory can be used for optimizing the proportion of

different soybean varieties grown on the farm. However, there are a number of challenges that are unique

to the soybean-mix optimization. As highlighted in Table 2, the yield of a soybean variety is affected by

environmental factors (such as weather, disease, pest, and other conditions) associated with the farm

where the seed is planted. Thus, the actual performance of a soybean variety can vary across farms. Even

when the varieties have been tested over several years in a range of locations, their performance can vary

depending on weather conditions. Imagine that there are two varieties under consideration: Variety A and

Variety B. Variety A might yield 60 bushels per acre under ideal weather conditions while Variety B

yields only 55 bushels per acre under the same conditions. However, when weather is not ideal, Variety

B, with a superior drought tolerance characteristic, might yield 53 bushels per acre while Variety A yields

only 50 bushels per acre. When considering only the mean yield, Variety A appears to be superior with a

mean yield of 55 bushels per acre when compared to Variety B’s mean yield of 54 bushels per acre. How-

ever, Variety A’s yield varies between 50 bushels per acre and 60 bushels per acre while Variety B’s

yield varies between 53 bushels per acre and 55 bushels per acre. This hypothetical example illustrates

that the higher mean yield of Variety A comes with a higher risk. On the other hand, varieties like Variety

B due to inherent drought resistant characteristic is less risky. To model risk level associated with varie-

ties, it is important that the proposed methodology quantify mean, variance, and co-variance of yield. The

crucial challenge in applying portfolio theory to a specific farm’s seed variety allocation decision is in the

estimation of the mean, variance, and covariance of yields of a given set of seed varieties for the farm.

For example, the Syngenta dataset includes data for 182 varieties of soybeans tested in more than 350

site-year scenarios from 2008 to 2014. Consistent with the survey responses, it is assumed that farmers

would be interested to utilize the accumulated knowledge (weather, soil condition, etc.) of the farm and

yield information from these site-year scenarios. Survey results show that majority of the farmers look

into the historical yield data. What farmers lack is a tool or framework that can systematically use the his-

torical data to make decisions. Therefore, this paper proposes a data-driven framework, which seamlessly

integrates machine learning with simulation and optimization, to optimally allocate the land between the

soybean varieties by learning the mean, variance, and covariance of the yield of soybean varieties from

the historical data.

Page 13: Machine Learning Based Simulation and Optimization of ...apps.olin.wustl.edu/workingpapers/pdf/2018-04-001.pdf · Machine Learning Based Simulation and Optimization of Soybean Variety

13

3. Literature Review

Applying modern empirical models to study factors affecting crop yield has been an active research area

in the past century. Ronald Fisher, a towering figure in statistics developed experimental designs to ana-

lyze confounding effects of soil type and weather on crop yield (Fisher 1925). During the prior few dec-

ades, application of predictive and prescriptive analytics for choosing a mixture of cultivars has been at-

tempted. Nalley et al. (2008) used linear regression to model the wheat yield with variety, seed breeder,

and year as predictors. Similarly, Barkley et al. (2013) used a regression model to predict wheat yield

with prevalent diseases, location, and weather as predictors. These researchers find that weather and loca-

tion of the experimental site play an important role in the prediction of the yield. Dixon et al. (1994) in-

cluded solar radiation as one of the predictors and showed that including solar radiation in the model re-

sults in a superior prediction of corn yield. Earlier to this work, Fischer (1985) showed that the number of

kernels per square meter of wheat crops is linearly related to the incident solar radiation in the thirty or so

days preceding to anthesis. Unlike many other models in which linear relationships between yield and

predictors were assumed, Schlenker and Roberts (2006; 2009) considered nonlinear effects of weather on

yield. In our research, flexible machine-learning models are considered as candidates to predict soybean

yield and, in turn, help in the simulation of soybean yield for different weather scenarios. Unlike in most

of the reviewed literature, our research does not impose a model structure to the data. Instead, the consid-

ered machine learning tune their flexibility accordingly to the structure observed in the actual data. As a

result, more accurate predictions and simulations would emerge from the machine learning models con-

sidered in our research.

In the last two decades, a wide variety of simulation approaches were used to forecast crop yield un-

der different weather conditions (Semenov et al. 1996; Tubiello et al. 2002; Hansen & Indeje, 2004; Chal-

linor et al. 2005; Gijsman et al. 2007; Lobell & Burke, 2010). Unlike most of the simulation modeling ap-

proaches found in the agriculture literature, the simulation model created in this research – similar to

Sundaramoorthi et al. (2009), Sundaramoorthi et al. (2012) and Sundaramoorthi (2014) - is completely

informed by the data. This way of creating simulation models does not require explicit botanical

knowledge about the crop growing process. Understanding the growing of soybean would require an

understanding of weather impacts and soil impacts in addition to the botanical aspects of soybean. Inter-

action of all these aspects adds to the complexity of the crop growing process. It is unlikely even for an

expert to thoroughly understand the intricacies of crop growth and yield in order to simulate that process

accurately. By contrast, the simulation technique used in this paper unearths patterns in the data using ma-

Page 14: Machine Learning Based Simulation and Optimization of ...apps.olin.wustl.edu/workingpapers/pdf/2018-04-001.pdf · Machine Learning Based Simulation and Optimization of Soybean Variety

14

chine learning to inform the simulation of the yield for each soybean variety under different weather con-

ditions. The resulting simulated yield data and performance measures from the simulation are used to op-

timally allocate up to a given number of soybean varieties to be grown at a farm site. The optimization

problem in this research is formulated similarly to the portfolio optimization problem, which is reviewed

below.

In finance community, a portfolio is a set of assets an institution or an individual holds hoping that

the return obtained from the portfolio is favorable as time progresses. Quite often, it is desired to mini-

mize the risk of the portfolio. This is analogous to the notion “don’t put all your eggs in one basket.” The

idea is to diversify the portfolio to mitigate the risk. Based on this principle, Markowitz (1959), Tobin

(1958), Sharpe (1963, 1964, & 1970), and Lintner (1965a &1965b) created the portfolio optimization the-

ory. A notable innovation in Markowitz (1959) was the quantification of the portfolio risk similar to ones

in Section 4.6 of this paper. In probability theory, this quantification already existed as the variance of

weighted sum of random variables. The quantification of portfolio risk in Markowitz (1959) was made

possible by Uspensky (1937). Application of portfolio optimization theory in agriculture is relatively

new. Some of the portfolio optimization theory in farming can be found in Robison and Brake (1979),

Nyikal and Kosura (2005), Figge (2004), Barkley and Peterson (2008), and Nalley et al. (2009). The port-

folio optimization formulation in this research has some similarities with the ones in Barkley and Peterson

(2008) and Nalley et al. (2009). However, the formulation in this research is more intriguing because of

the way we estimate and incorporate yield uncertainties from the simulation to the portfolio optimization

formulation. Also, additional requirements of farming practice make the optimization problem difficult

from the computational point of view. One of the restrictions is that only up to a certain number of varie-

ties out of a large number of seed varieties should be chosen for growing. This requirement introduces

binary decision variables to our optimization formulation, which adds to the complexity of prescriptive

analytics component of the data-driven analytics framework.

The paper belongs to a emerging research area in the OM community that studies operations and supply

chain problems in food systems. We have identified some of such opportunities in Section 2.1 related to

seed selections. We briefly review some OM researches contributed to farming decisions in agriculture

(for examples, see Allen & Schuster, 2004; Huh & Lall, 2013; Tang et al. 2015; Chen et al. 2015; Bansal

et al. 2017; Bansal & Nagarajan, 2017). Allen and Schuster (2004) uses the case of grape to introduce an

approach for optimizing the rate of harvest by considering the balance between the rate of harvest, size of

the crop, and risk associated with delay in harvesting. It would be interesting to integrate soybean specific

factors such as relative maturity’s implications on risk and harvest rate to their model. Huh and Lall

(2013) formulated two stochastic programming models to optimize the crop diversification decisions. The

Page 15: Machine Learning Based Simulation and Optimization of ...apps.olin.wustl.edu/workingpapers/pdf/2018-04-001.pdf · Machine Learning Based Simulation and Optimization of Soybean Variety

15

uncertainty in their problems were due to the amount of irrigation water and market price for the crop.

They show how the effect of reducing the uncertainty in the market price could change the optimal deci-

sion. Chen et al. (2015) studied peer-to-peer interactions between farmers on a knowledge learning and

sharing platform in the presence of an expert. Their study shows what works and what doesn’t when the

system is in equilibrium. Creating and studying such a platform to enable interactions between farmers

and seed salesperson would be valuable. Bansal et al. (2017) developed an optimization based approach

to derive the expected value and the standard deviation of a probability distribution from the quantile

judgement data of experts. They show that their approach is equivalent to the estimations from a sample

data set. This work has been utilized by an agribusiness for decision-making and achieved significant cost

saving. Bansal & Nagarajan (2017) studied an agribusiness’s production of hybrid seeds under con-

strained supply of parent seeds subjected to randomness in the production process. The agribusiness has a

separate production site in South America to mitigate the risk of failure in production at the main site. The

paper formulates and solves an optimization problem for both sites to allocate the parent seeds optimally.

Our work complement the literature by proposing a data-driven approach to making tailored seed selec-

tion recommendations to farmers. The framework introduced in this paper to address this issue is readily

applicable to agribusinesses and farmers with historical data on yield, weather, and soil characteristics.

Optimal allocation of farmland between different seed varieties determined from the framework posi-

tively affects agribusinesses and farmers with better yields. Our hope is that this topic spurs more interests

from the OM community in agriculture operations.

4. Methodology and Theory

4.1 Descriptive analytics

The Syngenta dataset contains 34,212 yield data points from 182 soybean varieties. The data was

gathered between 2008 and 2014 from more than 350 site – year scenarios. The key variables in-

cluded in the dataset describe weather condition, soil characteristics, and soybean variety. The fol-

lowing list describes the variables used:

Variety Yield (Y) - This is the response variable measured in bushels per acre. There are 182 vari-

eties of soybeans and their yields are considered in this research as the response variables;

Latitude (X1) - Latitude of the farm site from which data was collected;

Longitude (X2) - Longitude of the farm site from which data was collected;

Page 16: Machine Learning Based Simulation and Optimization of ...apps.olin.wustl.edu/workingpapers/pdf/2018-04-001.pdf · Machine Learning Based Simulation and Optimization of Soybean Variety

16

Area (X3) - Probability of growing soybeans in the nearby area of the farm site;

RM_25 (X4) - Probability of growing soybean varieties with relative maturity similar to the ones

considered in this research in the target farm;

TOT_IRR_DE (X5) - Probability of field irrigation in the nearby area of the farm site;

Soil_Cube (X6) - Soil type based on texture, available water holding capacity, and soil drainage;

Temp (X7) - Sum of daily degree Celsius at the farm site between April 1st and October 31st for the

year in which yield is measured.

TEMP_MED (X8) - Median of Sum of temperatures for season between 2001 and 2014;

PREC (X9) - Sum of precipitation between April 1st and October 31st for the year at the farm site

in which yield is measured.

PREC_MED (X10) - Median of Sum of precipitation for season between 2001 and 2014;

RAD (X11) - Daily Watts per square meter solar radiation sum between April 1st and October 31st

for the year in which yield is measured;

RAD_MED (X12) - Median of Sum of solar radiation for season between 2001 and 2014;

CONUS_PH (X13) - Topsoil (10 to 20 cm depth) pH. The source of this data, along with other

variables starting with ‘CONUS’, is the Earth System Science Center in the College of Earth and

Mineral Sciences at the Pennsylvania State University;

CONUS_AWC (X14) - Topsoil (10 to 20 cm depth) Available water capacity in 150 cm soil pro-

file;

CONUS_CLAY (X15) - Topsoil (10 to 20 cm depth) clay content in percentage;

CONUS_SILT (X16) - Topsoil (10 to 20 cm depth) silt content in percentage;

CONUS_SAND (X17) - Topsoil (10 to 20 cm depth) sand content in percentage;

Page 17: Machine Learning Based Simulation and Optimization of ...apps.olin.wustl.edu/workingpapers/pdf/2018-04-001.pdf · Machine Learning Based Simulation and Optimization of Soybean Variety

17

ISRIC_PH (X18): Soil ( 5 - 30 cm depth ) pH. The source of this data, along with other variables

starting with ‘ISRIC’, is World Soil Information. ISRIC stands for independent science-based

foundation (ISRIC);

ISRIC_CEC (X19) - Soil (5-30 cm depth) cation exchange capacity measured in centimol per kilo

gram;

ISRIC_CLAY (X20) - Soil (5-30 cm depth) clay content in percentage;

ISRIC_SILT (X21) - Soil (5-30 cm depth) silt content in percentage;

ISRIC_SAND (X22) - Soil (5-30 cm depth) sand content in percentage;

EXTRACT_CE (X23) - Cation exchange capacity of soil. Indicates fertility of the soil;

SY_DENS (X24) - Indicates how often farmers grow soybeans at the target farm site;

SY_ACRES (X25) - Estimated acres in an area of 100 km2;

Cluster_Number (X26) – A categorical variable, which groups near-by farms together;

Let p be the number of input variables (X1 … X26). Therefore, p = 26 in this research. Soybean ex-

perts at Syngenta consider all of these input variables as important predictors of soybean yield.

Our survey results also indicate that farmers consider how well a variety will grow in their soil.

Some of them also consider relative maturity of the crop as an important factor. Obviously,

weather related variables are very important. However, none of the farmers mention about sys-

tematically considering uncertainties in yield due to weather while deciding the soybean varieties

they grow.

As a first step in this research, the latitude and longitude information was plotted on a map to see

from where the data were collected. Figure 2 shows that the data were primarily obtained from the

farms located in the mid-western states of the United States. The farm for which the optimization

is performed – referred to as the target farm – like several other farms is located in the state of

Iowa. Unlike other variables, latitude and longitude reveal the specific location only when they

are combined together. To assist the predictive analytics component of the research in identifying

the location information of farm sites, clusters of farm sites were desired. To create clusters, K-

means clustering algorithm (Hastie et al. 2001) was applied to the latitude and longitude data to

produce two to one hundred clusters. Figure 3 shows the within sum of squares (variability) plot

Page 18: Machine Learning Based Simulation and Optimization of ...apps.olin.wustl.edu/workingpapers/pdf/2018-04-001.pdf · Machine Learning Based Simulation and Optimization of Soybean Variety

18

for a different number of clusters. Clusters with small within (group) sum of squares, i.e., with

similar latitudes and longitudes are desirable. It can be seen in Figure 3 that within sum of squares

evens out after an initial drop. Based on the within sum of squares, it was decided to group all of

the farm sites with twenty clusters. The target farm – also referred to as Evaluation farm - (shown

in red on Figure 2) and farm sites in green belonged to the same cluster. All the other nineteen

clusters are shown using the same color to make the targeted farm’s cluster readily visible. As a

result of clustering, a new categorical variable – Cluster_Number (X26) - was added to the list of

variables utilized for modeling. It should be noted, if appropriate, other similar variables can be

included in this research without any considerable change in the methodology presented here.

Figure 2. Data gathering sites (Green and Red are in the same cluster)

Page 19: Machine Learning Based Simulation and Optimization of ...apps.olin.wustl.edu/workingpapers/pdf/2018-04-001.pdf · Machine Learning Based Simulation and Optimization of Soybean Variety

19

Figure 3. Within Cluster Sum of Squares for Different Number of Clusters.

A preliminary descriptive analytics (Figure 4) shows that ‘V102’, ‘V115’, ‘V121’, ‘V103’, and ’V99’

are the most frequently grown varieties in this dataset. Within the cluster of the targeted farm, ‘V121’,

‘V115’, ‘V103’, ‘V133’, and ‘V128’ are the most frequently grown varieties. Roughly, one-fourth of the

data is from just four varieties. This shows that the dataset is highly unbalanced and many varieties have

very few observations. Fifteen varieties have only one observation. Seventy-one varieties have less than

twenty-five observations. In other words, these seventy-one varieties have less than one observation per

predictor. In the predictive analytics part of the research, the data from these seventy-one varieties were

combined to create one model for predicting the yield of these varieties. Therefore, a total of one hundred

and twelve predictive models were created, i.e., one hundred and eleven predictive models for each of the

varieties with adequate data and a combined model for all of the remaining seventy-one varieties. Figure

5 shows the distribution of yield data. The yield appears to be normally distributed with a mean yield

slightly less than 60 bushels per acre. Based on this observation, the yield variable was retained as it is

without any transformation.

The most detrimental weather event could wipe out the soybean crop. Some examples include early

season frost, early season flooding (submergence), very cold (0 to 5 C) rain/sleet within the first 12 hours

of planting. Even though these weather conditions can potentially wipe out an entire crop, these are all

generally associated with early season events, and there is the potential for replanting. It is much worse to

have adverse weather conditions when the soybean crop is well established with five or six nodes. At that

stage, the crop could still be killed by frost/freezing temperatures, but the risk due to adverse weather has

likely already decreased to near zero by that time. Besides early-season weather events, droughts could be

detrimental to the crop and its yield. Heat would also cause problems and often occur together with

drought. Even with the availability of rich historical weather data, the weather for a soybean season is

generally unpredictable several months in advance. Clearly, weather is an important factor that influences

the soybean yield. Temperature, precipitation, radiation, and their median values are the weather-related

variables included in the model building. The Syngenta dataset was processed to retain only the current

season’s temperature (X7), precipitation (X9), radiation (X11), and their medians (X8, X10, X12) for each ob-

servation. While selecting the soybean varieties to grow, it is important to hedge against different possi-

ble weather conditions (X7, X9, and X11). For that purpose, the historical realizations of X7, X9, and X11 for

the targeted farm’s cluster have been filtered out from the dataset and yield for each variety was simulated

under 1000 of those X7, X9, and X11 weather scenarios.

Page 20: Machine Learning Based Simulation and Optimization of ...apps.olin.wustl.edu/workingpapers/pdf/2018-04-001.pdf · Machine Learning Based Simulation and Optimization of Soybean Variety

20

Figure 4. Most Frequently Appearing Soybean Varieties in the Dataset.

Figure 5. Frequency Distribution of Soybean Yield for All the Sites in the Dataset.

4.2 Predictive analytics

With the advent of sophisticated data capture and storage technology, data science based decision-making

is gaining momentum among practitioners and academic researchers. Big data sets are becoming available

in many fields. Concurrently, statisticians, engineers, and computer scientists have been working on ma-

chine learning models that would learn from the data to make accurate predictions. In classical statistics,

the model form would be assumed before fitting it to the data – quite often, simple linear relationships are

assumed. In contrast, machine-learning models learn the structure or the model form algorithmically from

the data (Breiman 2001). In this approach, the model’s ability to predict accurately is valued more than

having an interpretable model structure. Generally speaking, machine-learning based modeling does not

produce generalizable insights and strategies that can be used for the systems beyond the one modeled.

Despite being hard to interpret because of model’s highly non-linear structure, experts believe that the

structure produced by machine learning is a better representation of the data. In turn, machine-learning

Page 21: Machine Learning Based Simulation and Optimization of ...apps.olin.wustl.edu/workingpapers/pdf/2018-04-001.pdf · Machine Learning Based Simulation and Optimization of Soybean Variety

21

models represent the system that is being modeled accurately compared to the models obtained from clas-

sical approaches with simple linear relationships. Nevertheless, solving managerial problems with analyt-

ics framework and making a positive impact on the operations management practice is valuable (Simchi-

Levi 2017). To that end, this paper creates a novel analytics framework by integrating machine learning,

simulation, and optimization to solve an important problem that can contribute significantly to the variety

selection decisions.

We consider five well-known flexible machine-learning models – Regression Trees (RT), Random Forest

(RF), Boosted Trees (BT), Multivariate Adaptive Regression Splines (MARS), and Artificial Neural Net-

work (ANN) – as candidate machine learning methods to become part of our framework and predict soy-

bean yield (Y) based on X1 … X26. We chose these five machine learning models because they are flexible

and capable of fitting almost any practically observable structure in data. We do not consider parametric

models, such as Ordinary Linear Regression, because we do not want to impose a model structure, a pri-

ori, on the data set. The machine learning models considered in this paper are capable of fitting a linear

structure on its own if the data drives them towards a linear structure (Hastie et al. 2001). Moreover, it is a

well-established fact that machine learning models like RF, BT, and ANN win majority of the modern

day prediction contests conducted by Kaggle and Knowledge Discovery and Data Mining conferences.

Since the soybean-mix selection problem pertains to the optimization of allocation of varieties based on

their yield, it is incumbent on the machine learning to find a good prediction of soybean yield at a given

farm site for different possible weather scenarios. We choose the best predicting machine-learning model

by evaluating all of them on an out-of-sample data set. For a comprehensive explanation of the chosen

machine learning models, see Breiman et al. (1984), Breiman (1996, 2001), Friedman (1991, 1999a,

1999b), Hastie et al. (2001), Friedman and Stuetzle (1981), and Ripley (1996). Based on the results ob-

tained from these five models, we have selected RT models as our final model. We present a brief over-

view of RT model below. We have also included a brief review of other models in the appendix.

4.2.1 Regression Tree (RT)

RT algorithm was developed by Breiman et al. (1984). This algorithm utilizes recursive binary splitting to

unearth patterns in a high-dimensional space. RT divides the data space into many mutually exclusive

subsets (S1, S2, …, SM). The responses within a subset are similar to each other. The resulting RT model T

predicts the response Y with a parameter, cm (m = 1,…, M):

1 2 26

1

, ,..., ,M

m m

m

T X c I X X X S

(1)

Page 22: Machine Learning Based Simulation and Optimization of ...apps.olin.wustl.edu/workingpapers/pdf/2018-04-001.pdf · Machine Learning Based Simulation and Optimization of Soybean Variety

22

where I (.) is an indicator function. Let β = {Sm, cm; m = 1,…, M} be the vector of parameters of the RT

model. β is estimated by minimizing the following function

1

, ,i m

M

i m

m x S

Min L y c

(2)

Traditionally, L(.) is the sum of squared errors (SSE) loss. For SSE loss, the optimal 𝒄𝒎 is simply the av-

erage of Y within the subset Sm, as indicated below.

ˆ | .m i i mc Average y x S (3)

Refer to Breiman et al. (1984) for a detailed discussion on how to choose the splitting variables, split

points, and how to prune the fully-grown tree for better predictions in a new data set. As per the tradition,

the least squares splitting rule was used for branching decisions, while allowing the minimum number of

observations in each terminal node of the tree to be 5 data points. The main advantage of the RT model is

its easily interpretable decision tree like structure, as shown in the hypothetical regression tree structure in

Figure 6, even while allowing flexible non-linear relationships among the variables.

Figure 6. Hypothetical Regression Tree Structure.

4.2.2 Machine Learning Results

As mentioned earlier, in classical statistics, one would judge the model performance based on a model fit

criterion like R-Squared. Whereas in machine learning, models are evaluated by their prediction accuracy.

To evaluate the prediction accuracy of different machine learning models, there are three popular cross-

validation strategies available: validation set approach, leave one out cross-validation, and k-fold cross

Page 23: Machine Learning Based Simulation and Optimization of ...apps.olin.wustl.edu/workingpapers/pdf/2018-04-001.pdf · Machine Learning Based Simulation and Optimization of Soybean Variety

23

validation. Each of these strategies has merits and de-merits. For computational efficiency in model fitting

and evaluation, validation set approach is preferred over the other two strategies (James et al. 2001). We

utilize the validation set approach for model evaluation purposes by splitting the data set into training and

test data. Machine learning models were built using the training data and evaluated based on their predic-

tion accuracy in the test data. We have randomly chosen 80% of the Syngenta data for training and the

rest for testing. As indicated earlier, yield of soybean variety (Y) is measured in bushels per acre. The av-

erage yield for the farms nearby the targeted farm site is about 56 bushels per acre. Root Mean Squared

Error (RMSE) was chosen as the measure to evaluate the accuracy of the machine learning models. Un-

like other model accuracy measures, RMSE’s units are same as that of yield (Y). From a practitioner’s

stand point of view, RMSE readily quantifies how precise the machine learning models are because of the

ease in interpretability of RMSE. Moreover, we prefer RMSE over other popular accuracy measures like

Mallow’s Cp, Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), and adjusted

R2 because we are not concerned about including unnecessary predictors in our model (James et al.

2001). The input variables included in this research are agreed to be important variables in predicting

yield by the soybean experts at Syngenta.

Table 3: Performance of Machineries

Machinery RT RF BT MARS ANN

RMSE (bushels per acre) 6.6 6.9 6.8 7.8 8.0

Table 3 shows RMSE values achieved by each of the machine learning models considered in this

work. It should be noted that at the stage of evaluating each of the machine learning algorithm listed on

the table, we built only one model with all of the varieties in it, i.e., we used an extra variable X27 to input

the soybean variety. Once we identified the best machine-learning algorithm for this research, we built

dedicated models for each of the soybean variety except for those seventy-one varieties that did not have

enough data. According to the results in Table 3, RT, RF, and BT – all tree-based methods – achieved

similar accuracy. RT seems to be slightly better than the other two. MARS and ANN are not as good as

the tree-based algorithms for the Syngenta data. Apart from being accurate, RTs have practical and easily

interpretable structures which makes them a viable method for extracting important knowledge about the

interplay dynamics between yield, weather conditions, and soil conditions. We chose RT as the winning

machine learning model and applied it to the Syngenta dataset to develop tree structures for each variety

separately based on weather scenarios and soil conditions of the farm. Then KDE was used to estimate

continuous probability distributions of soybean yield at the terminal nodes of the RT models. In classical

Page 24: Machine Learning Based Simulation and Optimization of ...apps.olin.wustl.edu/workingpapers/pdf/2018-04-001.pdf · Machine Learning Based Simulation and Optimization of Soybean Variety

24

prediction problems, the mean of the responses at the terminal will be used as the predicted response. In

this research, instead of using the mean at a terminal node as the predicted response, we simulate re-

sponses from KDEs fitted at the terminal nodes of RT models. Depending on weather scenarios and other

predictor variables, RT models will be traversed to reach a terminal node from which soybean yield

would be simulated from the KDE. Building simulations informed by the trees are a better representation

of the actual system and efficient to execute when compared to the traditional simulation models. In this

research, 1000 scenarios of weather conditions at the target farm site together with the other information

of the target farm are used to traverse RT models to simulate yields for the farm site, from which mean,

variance, and covariance of yields can be computed and fed as the input to the portfolio optimization for-

mulation. A comprehensive description of all the components of the framework, including KDE, is pro-

vided in the proceeding sections. It should be noted that at a higher level, Sundaramoorthi et al. (2009),

Sundaramoorthi et al. (2012), and Sundaramoorthi (2014) have done a similar integration of KDE with

RT. Readers familiar with those implementations of KDE in RT models can skip reading subsection 4.3.

4.3 Kernel Density Estimate (KDE)

An important aspect considered while creating this framework was to keep the framework user-

friendly for practitioners. It is important that the methodology described in this paper should be

quickly reproducible in real time even when the underlying data changes. Syngenta, like any other

agribusiness, continuously tests and collects data on soybean varieties. All the work done in this

research and conclusions arrived at are good only until the data remains the same. When a new

data set is augmented, practitioners would be burdened if they were expected to repeat any of the

steps in the framework manually. When the data changes, new RT models can be easily fit using

R. However, it is not practical to manually inspect the yield distributions and estimate parameters

of yield distribution at the terminal nodes of RT models from a set of possible theoretical proba-

bility distributions. Tree models built from the Syngenta dataset had yield distributions at the ter-

minal nodes that resembled uniform distribution except at the tails of the distribution. Using uni-

form distributions would yield reasonable approximations of the yield distribution in all of the ter-

minal nodes. From a practitioner’s stand-point-of-view, when the data is appended or replaced, it

would be a pain point to check for changes in the shape of the distribution and re-estimate param-

eters of the distribution. To ease this issue, kernel density estimates are used at the terminal nodes,

which are capable of approximating any shaped distribution. KDEs are similar to empirical distri-

butions except that KDEs are continuous and do not ‘step’ like empirical distributions. As a result,

there is a greater chance to simulate yield values that are observed in the historical data as well as

yield that are similar to the yield in the terminal nodes but not observed in the historical data. By

Page 25: Machine Learning Based Simulation and Optimization of ...apps.olin.wustl.edu/workingpapers/pdf/2018-04-001.pdf · Machine Learning Based Simulation and Optimization of Soybean Variety

25

using KDE at the terminal nodes, practitioners can avoid the step that requires them visually

checking for the changes in the shape of the yield distribution. They can also avoid the process of

manually re-estimating parameters of the distribution even when the shape remains the same but

the parameters of the distribution change. The simulation code written in this research is capable

of automatically refitting KDEs at the terminal nodes of any RT model without any input from the

user. Whereas, fitting parametric distributions would require practitioners to visually inspect

changes in the shapes of the distribution. It should be noted that Sundaramoorthi et al. (2009),

Sundaramoorthi et al. (2012), and Sundaramoorthi (2014) report other distributions, such as right-

skewed distribution and normal distribution, at the terminal nodes of RT models. We cannot

simply hope that Uniform distributions would re-appear when the data changes. For a practitioner,

using KDE is a huge time-saver when the dataset is updated. For these reasons, fitting KDEs at

the terminal nodes and simulating the yield from KDEs is the preferred approach.

Based on predictors, one would traverse the tree branches and reach the terminal node

where we use KDE to estimate the probability density function of the soybean yield (Y). As-

sume there are n(j) soybean yield data y1, ..., yn(j) available in the terminal node j. Let K(.) be a

kernel function. Then the kernel density estimator �̂�𝑗,ℎ(𝑦) at any point y is defined as shown in

the following function (Silverman 1978):

�̂�𝑗,ℎ(𝑦) =1

ℎ×𝑛(𝑗)∑ 𝐾 (

𝑦𝑖−𝑦

ℎ)

𝑛(𝑗)𝑖=1 (12)

where h is the bandwidth, which limits the size of the neighborhood and the number of soybean

yield around 𝑦 that will influence the estimation of the density at a given y. If the size of the

neighborhood is wide then many dissimilar soybean yield data will influence the density esti-

mate at y resulting in a smoother fit for the density function. On the other hand, narrower neigh-

borhood will result in a rough but a much more accurate estimation of the density function at y.

The size of the neighborhood results in a tradeoff between smoothness and accuracy of the ker-

nel density estimate. For some applications, smoothness would be preferred over accuracy.

However, smoothness of the density estimates is not of a great concern in this work. We have

used Sheather and Jones plug-in (SJPI) estimates as the initial value of h. In KDE community

SJPI is considered as a good estimate for the bandwidth size (Sheather and Jones 1991; Jones et

Page 26: Machine Learning Based Simulation and Optimization of ...apps.olin.wustl.edu/workingpapers/pdf/2018-04-001.pdf · Machine Learning Based Simulation and Optimization of Soybean Variety

26

al. 1996; Sheather 2004). Regardless, our implementation of KDE considers an automatic tun-

ing of the neighborhood size based on an accuracy criterion discussed in the next subsection.

We estimated KDE with SJPI neighborhoods in each terminal node of our RT models. A typical

KDE with Gaussian and triangular kernels at one of the terminal nodes is shown in Figure 7.

Figure 7. Kernel Density Estimates (Solid-Gaussian, and Dotted-Triangular).

4.3.1 Selection of Kernel

Gaussian and triangular kernels were chosen in this research because of their popularity and ease with

which samples can be drawn from these kernels. SAS® was used to estimate SJPI neighborhood size at

each terminal node of RT models (Sheather and Jones 1991). At each terminal node, we generated 10,000

samples of soybean yield from Gaussian and triangular kernel density estimates and compared them with

the actual data at these terminal nodes. To make the comparison, we created four different intervals, i.e.,

(0, M/2], (M/2, M], (M, (1.5*M)], ((1.5*M), ∞), where M is the median of the actual data in the terminal

node. It should be noted that in classical statistics Q-Q plots would be preferred to make these compari-

sons. Using Q-Q plots would help us compare the sampled realizations with the actual data by plotting

quantiles of the two distributions against each other. However, we deliberately avoided Q-Q plots be-

cause a practitioner would have to visually check the plots at each terminal node to see if the sampled re-

alizations and actual data are similar. Moreover, such an approach will not be able to determine neighbor-

hood sizes explained in the next subsection. Therefore, we used the four intervals - (0, M/2], (M/2, M],

Page 27: Machine Learning Based Simulation and Optimization of ...apps.olin.wustl.edu/workingpapers/pdf/2018-04-001.pdf · Machine Learning Based Simulation and Optimization of Soybean Variety

27

(M, (1.5*M)], ((1.5*M), ∞) – to compare the sampled data with the actual data. We were able to automate

the selection of neighborhood size (bandwidth) by utilizing these four intervals.

In total, we had 1,519 terminal nodes in 112 RT models. We sampled ten thousand soybean yield re-

alizations from Gaussian and triangular kernels at each of the 1519 terminal nodes. We compared the

fractions of the sampled soybean yield in each of the four ranges at every terminal node with the actual

fractions of the true data in these ranges. We declared the kernel, which produced a sampled soybean

yield closer to the fractions of the true data in an interval as the winner of that interval. We awarded a

point for each such wins. Both kernels were declared to be tied if they produced the same fraction of sam-

pled soybean yield in an interval. Then both kernels would receive half a point for a tie. A kernel would

be considered to be the winner of a terminal node if it wins more than three points at that terminal node.

Again, both kernels are declared to be tied in the contest to win a terminal node if they both have two

points each. Results from these contests are shown in Table 4.

Table 4. Performance of Gaussian and triangular kernels

Tree Gaussian Triangular Tie

All 112 Trees

JR = 1519

Range I wins 20 18 1481

Range II wins 803 699 17

Range III wins 803 697 19

Range IV wins 3 4 1512

% wins 26.8% 23.4% 49.8%

Ter. node wins 790 685 44

% Ter. node wins 52% 45% 3%

As mentioned earlier, there were a total of 1519 terminals nodes from the 112 trees built. Results

in Table 4 show that the Gaussian kernel performs better than the triangular. Among all of the

1519 competitions, the Gaussian kernel performed at least as well as triangular in 76.6% of the

Page 28: Machine Learning Based Simulation and Optimization of ...apps.olin.wustl.edu/workingpapers/pdf/2018-04-001.pdf · Machine Learning Based Simulation and Optimization of Soybean Variety

28

contests. Overall, Gaussian kernel seems to be a better choice than triangular kernel in our RT

models to simulate soybean yield.

4.3.2 Determination of Bandwidth

According to Epanechnikov (1969) and Silverman (1978), for the accuracy of the estimation, hav-

ing the right neighborhood size is more important than choosing an appropriate kernel. As men-

tioned before, a smaller bandwidth will lead to a rough but precise estimation of the yield distribu-

tion, whereas, a bigger bandwidth will result in a lesser accuracy but a smoother fit. Choosing a

bandwidth value, including SJPI bandwidth estimation method (Sheather and Jones 1991),

considers a compromise between smoothness and accuracy of the estimated density. Like men-

tioned earlier, we used SJPI bandwidths as the starting bandwidth values at each terminal node.

One can make the neighborhood size wider or narrower by either increasing or decreasing the

SJPI bandwidths. Data used in this research were collected over six years and have more than

thirty-four thousand observations. We consider that this data set is a good representation of the

reality in soybean yield, i.e., data set holds intricate yield dynamics of soybean varieties. There-

fore, we prefer a less smooth but an accurate density estimate that reflects the data more accu-

rately. To achieve that, if the proportion of sampled yield in the four intervals - (0, M/2], (M/2, M],

(M, (1.5*M)], ((1.5*M), ∞) – is not within 1% of the true fraction of the data, the bandwidth was

iteratively decreased by one until the criterion was met or the bandwidth is almost zero and cannot

be reduced any further. For example, the tenth terminal node of RT model fitted for ‘V102’ vari-

ety, as shown in Table 5, had soybean yield realizations that violated the 1% limit. After three it-

erations of bandwidth tuning, all four ranges have fractions within the limit. This leads to a

change of bandwidth at this terminal node to 1.08 from 4.08 and thus produces a more representa-

tive distribution of soybean yield even though it is no longer a smooth distribution. Another inter-

esting observation in this research, unlike Sundaramoorthi et al. (2009), Sundaramoorthi et al. (2012),

and Sundaramoorthi (2014), is that several terminal nodes had no data in Range I and Range IV,

which shows the homogenous nature of soybean yield data at terminal nodes.

Table 5. Bandwidth Tuning for Terminal Node 10 of ‘V102’ Variety Tree.

Bandwidth

tuning

Simulated

Fraction

Actual

Fraction

Difference

Before

Page 29: Machine Learning Based Simulation and Optimization of ...apps.olin.wustl.edu/workingpapers/pdf/2018-04-001.pdf · Machine Learning Based Simulation and Optimization of Soybean Variety

29

h = 4.08

Range I 0 0 0

Range II 0.5314 0.520833 -0.010567

Range III 0.4686 0.479167 0.010567

Range IV 0 0 0

After

h = 1.08

Range I 0 0 0

Range II 0.519200 0.520833 0.001633

Range III 0.480800 0.479167 -0.001633

Range IV 0 0 0

4.4 Data-driven Simulation Model

An important contribution of this research is the introduction of a general framework for forecast-

ing soybean yield using a data-driven simulation method. If farmers decide to grow their own

choices of varieties, the simulation can be used to simulate the yield under different weather sce-

narios. To drive the simulation, broadly speaking, two essential questions are asked: (1) What is

the weather condition? (2) What is the soil type? The answer to the second question is readily

available for the target farm. However, the answer to the first question is not same year-after-year.

In general, farmers are aware of the weather patterns for their farm sites based on historical obser-

vations. However, it is highly unlikely to predict the exact weather for an upcoming season well

in advance. It is important to hedge yield predictions with a variety of weather conditions (X7, X9,

and X11). To that end, the historical realizations of X7, X9, and X11 for the targeted farm’s cluster

have been filtered out from the dataset and one-thousand of those X7, X9, and X11 realizations were

randomly chosen for hedging. A sample of these 1000 realizations of X7, X9, and X11 are shown

in Table 6. It should be noted that these realizations were randomly chosen as a vector, i.e., each

row shown in Table 6 is chosen from the same year. As a result, 1000 yield values for each vari-

ety of soybeans are simulated from KDEs by traversing RT models and reaching the terminal

nodes based on the one-thousand weather scenarios and other predictor values of the target farm.

Page 30: Machine Learning Based Simulation and Optimization of ...apps.olin.wustl.edu/workingpapers/pdf/2018-04-001.pdf · Machine Learning Based Simulation and Optimization of Soybean Variety

30

Table 6. Sample Scenarios of X7, X9, and X11 from the target farm’s Cluster.

Scenarios TEMP (X7) PREC (X9) RAD (X11)

1 3320.3 636.6 1053145

2 3206.3 786.2 1017325

3 3715.2 327.8 1144159

4 3372.4 636.4 1067932

5 3192.6 805.5 1006773

… … … …

1000 3349.7 621.6 1059809

To execute the simulation, the RT models of all of the varieties are traversed to reach terminal

nodes based on the values of X1, X2, … X26. At the terminal node, a random sample of the soybean

yield (y) is simulated from the KDE of that terminal node. The procedure of simulating the soy-

bean yield (y), shown in Algorithm 1, is repeated for all of the weather scenarios in Table 6.

Algorithm 1 Simulation procedure

For variety (j) 1 to 182 do the following

For Simulation Scenario (i) 1 to 1000 do the following

Step 1: Set the values for X7, X9, and X11 from Table 6 corresponding to the simulation sce-

nario. Set other variables X1, … X6, X8, X10, X12, … X26 to the values corresponding the target

farm. X1, … X6, X8, X10, X12, … X26 are same for all i.

Step 2: Traverse and simulate soybean yield �̂�𝑖𝑗 from KDEs at the terminal nodes of the RT

model corresponding to the variety j.

For a practitioner to use this framework, it is important that the simulation is fast and efficient.

For example, one would want to forecast the soybean yield only a few days before sowing the

seed. The simulation model could assist them provided the run time is sufficiently short. The big-

gest tree built for the simulation has 110 terminal nodes. Simulating the soybean yield from the

KDEs at these terminal nodes is efficient and data-driven because the RT models reflect the pat-

terns in the actual data. Also, one should note that the simulation algorithm, listed in Algorithm 1,

Page 31: Machine Learning Based Simulation and Optimization of ...apps.olin.wustl.edu/workingpapers/pdf/2018-04-001.pdf · Machine Learning Based Simulation and Optimization of Soybean Variety

31

has no subjective input involved. Therefore, this simulation avoids misrepresentation of the soy-

bean yield because it is entirely driven by the machine learning models, which in turn were cre-

ated from the actual data gathered over a long period of time.

4.5 Simulation Results

A C++ program was written to simulate from KDEs at the terminal nodes of the RT models. Even

though we have already chosen the kernel function and bandwidth sizes in order to get accurate

simulations of soybean yield, we re-check the accuracy of the simulation again by comparing sim-

ulated yield from the chosen kernel and bandwidths with the actual system. As a part of the vali-

dation, true soybean yield of variety ‘V115’ were compared with the simulated yields for the farm

sites in the same cluster as the target farm. There were seven farms that had grown ‘V115’ within

this cluster. Out of these seven, only four farms (Farm sites: 2240, 2241, 2245, and 2250) had

more than two observations for ‘V115’. The plot showing the comparisons of actual and simu-

lated yield for these four farms is shown in Figure 8. The figure shows the actual mean yield of

‘V115’ in these four farms plotted along with mean, min, max, 5th percentile, and 95th percentile

of one-thousand simulation scenarios of ‘V115’ yield for these farms. On these curves, every

marker represents a soybean yield value in these four farms. In the figure, we joined the soybean

yield values - as if the farm sites are continuous – just to avoid distractions while visualizing the

differences. It can be observed from the figure that the actual mean of yield in the four farms are

embedded between the 5th and 95th percentiles of the simulation soybean yield. In general, the

simulated soybean yield mean curve is close to the actual mean yield curve in terms of the magni-

tude – for two farms they coincide with each other. Farm site 2250’s actual mean yield is little

more off from the simulated mean when compared to the other three farms. While exploring the

variability in yield of ‘V115’ among these four farm sites, it was observed that the variability of

‘V115’ yield within the farm site 2250 was much higher which explains the higher deviation be-

tween the simulated and actual mean. To make these comparisons for all the varieties is enticing

but practically not possible to present all such comparisons in this paper. In general, it is safe to

conclude that the simulation encompasses the actual yield for all the varieties.

4.6 Prescriptive Analytics

The ultimate objective of this research is to prescribe up to N – a number chosen by Syngenta -

soybean varieties to be grown at the targeted farm site. The optimization model, which takes input

from the data-driven simulation of soybean yield, is formulated as shown below.

Page 32: Machine Learning Based Simulation and Optimization of ...apps.olin.wustl.edu/workingpapers/pdf/2018-04-001.pdf · Machine Learning Based Simulation and Optimization of Soybean Variety

32

Figure 8. Comparison of Actual and Simulated Yield for Four Sites in the Same Cluster

Notations:

𝐸𝑖: mean yield of variety 𝑖 estimated from the simulation

𝜎𝑖2 : yield variance of variety 𝑖 estimated from the simulation

𝜎𝑖,𝑗2 : yield covariance of varieties 𝑖 and 𝑗 estimated from the simulation

𝑐𝑖: the proportion of the land dedicated to grow variety 𝑖 (decision variable)

𝐼𝑖: Binary variables; 𝐼𝑖 = 1 indicates variety 𝑖 is selected

In general, farmers are concerned about the risk associated with the yield at their farm. Our

survey results show that they try to mitigate risk by growing more than one variety at the same

time. However, they do not have a tool that systematically balances the risk and the expected

yield to achieve optimization. We assume that farmers want to mitigate the risk in yield that is at-

tributed to the weather scenarios prevailing in their region. Therefore, the objective of the model

is to minimize the risk associated with the yield at the farm, while the expected yield exceeds the

threshold yield m.

𝑀𝑖𝑛 ∑ 𝑐𝑖2𝜎𝑖

2

𝑖

+ 2 ∑ ∑ 𝑐𝑖𝑐𝑗𝜎𝑖,𝑗2

𝑗>𝑖𝑖

𝑠. 𝑡.

0

10

20

30

40

50

60

70

80

S I T E 2 2 4 0 S I T E 2 2 4 1 S I T E 2 2 4 5 S I T E 2 2 5 0

Sim Mean Sim Min Sim Max

Sim 95th Sim 5th Actual Mean

Page 33: Machine Learning Based Simulation and Optimization of ...apps.olin.wustl.edu/workingpapers/pdf/2018-04-001.pdf · Machine Learning Based Simulation and Optimization of Soybean Variety

33

∑ 𝑐𝑖𝐸𝑖

𝑖

≥ 𝑚

∑ 𝐼𝑖

𝑖

≤ 𝑁

∑ 𝑐𝑖

𝑖

= 1

𝑐𝑖 ≤ 𝐼𝑖 ∀𝑖 𝑐𝑖 ≥ 0 ∀𝑖

𝐼𝑖 ∊ {0,1} ∀𝑖

The first constraint ensures that the optimal variety-mix exceeds the least acceptable expected

yield. The second constraint caps the number of varieties chosen at N (N=5, a number suggested

by Syngenta for the considered target farm). The third constraint requires that soybean is grown in

the entire farmland allotted for the soybean farming. The fourth constraint ensures that the propor-

tion of a variety to be grown is zero when the variety is not chosen to be grown. It should be noted

that even though this formulation has similarities with the financial portfolio optimization, because of the

binary variables present in the portfolio optimization type formulation, this problem is not easy to solve.

We used Lingo 16 to solve this model. Lingo reported the solution as a local optimum. To ensure that the

local and global solutions are not off from each other, we attempted to solve the model with multiple

starting solutions. We used Sobol´ Quasirandom Sequence, a low discrepancy sequence, to generate 1000

starting solutions for ci. Sobol´ Quasirandom Sequence emerged from the number theoretic methods that

generate pseudo-random numbers with space filling properties. Unlike random numbers, the main idea of

these sequences is to fill the space as uniformly as possible. For a recent application of Sobol´ sequence,

see Yang et al. (2009). To review the methodology, refer to Sobol´ (1967). Starting with these 1000 initial

solutions for ci, we generated as many local optimum solutions. It turns out all the local solutions are

same, which indicates that the local solution is indeed the global solution. Lingo conservatively reports

the solution to our optimization problem as a local solution because of the approximations involved. Fig-

ures 9 shows the efficient frontier, which is obtained by solving the optimization problem for different

values of minimum mean yield requirement m.

Page 34: Machine Learning Based Simulation and Optimization of ...apps.olin.wustl.edu/workingpapers/pdf/2018-04-001.pdf · Machine Learning Based Simulation and Optimization of Soybean Variety

34

Figure 9. Yield (m) in bushels per acre vs. Risk (Objective Value) in bushels per acre

As expected, when we increased m (represented as Mean Yield in the figure), objective value, i.e., the

risk at optimality measured in bushels per acre square also increased. When the expected value was re-

quired to be at least 45 bushels per acre, the optimal solution included the mix of varieties which had a

negligible amount of risk. When the value of m was increased, riskier but high-yielding varieties were

chosen. As m increased, the percentage of high yielding varieties (with higher risk) like V129 and V128

in the optimal solution increases rapidly. From survey responses, we know that on average farmers grow

four varieties. Based on the historical expected yield, the top four high-yielding varieties (ignoring the

risk) in the same cluster as the target farm are V68, V93, V10, and V112. If we grow these four varieties

at the target farm by splitting the land equally, i.e., 25% of the land for each of these four varieties, the

expected yield for the farm is only 55 bushels per acre with a risk of 5.28 bushels per acre. From figure 9,

we can see that one of the optimal portfolios of varieties have an expected yield of about 68 bushels per

acre while the risk is less than 5 bushels per acre. From Question 3’s response, we know that an average

farmer grows soybeans in about 1,475 acres. Therefore, the loss is about 19,175 bushels for an average

Page 35: Machine Learning Based Simulation and Optimization of ...apps.olin.wustl.edu/workingpapers/pdf/2018-04-001.pdf · Machine Learning Based Simulation and Optimization of Soybean Variety

35

farmer. At a rate of $9.25 per bushels, the monetary loss is $177,369 for an average farmer who re-

sponded to our survey. If they allocate 50% of the land for the highest yielding variety and split the land

equally between the other three varieties, the expected yield increases to 57 bushels per acre with a small

increase in the risk (5.29 bushels per acre). The monetary loss would be $150,081. These are significant

losses.

6. Implementation

Syngenta won the Franz Edelman award, an award presented by INFORMS (the Institute for Operations

Research and Management Science), for utilizing analytics-driven strategy in plant-breeding. They uti-

lized simulation and stochastic optimization to choose the best breeding plans. Syngenta estimated to save

almost $300 million by using this tool. Upon winning the Franz Edelman award, Syngenta leadership

continues to bring analytics to other areas of their business. As a result, Syngenta and INFORMS hosted a

crop challenge in Analytics which aimed at identifying proportions of up to five soybean varieties to be

planted at a particular site – referred to as Target farm or Evaluation farm - as described earlier. This

shows that Syngenta is committed to utilizing analytics for determining seed-mix to be grown in specific

regions. This step goes beyond plant-breeding and makes the power of analytics tangible to an ordinary

farmer who is confronted with the seed-mix selection decision. As a first step of our implementation, we

have created a pilot web-based application which will enable a farmer to simulate the yield of any of the

182 soybean varieties for a specific farm. The farmer would input all the farm-related data (location, soil-

type, irrigation, etc.) along with his/her choice of the soybean variety in a simple user-friendly form on

the web-app as shown in the below figure. Upon entering the data, the farmer would click the “Simulate”

button at the bottom of the form which will trigger the execution of SimSOY and display the results ob-

tained from the simulation. Behind the scenes, one thousand scenarios of the selected variety were simu-

lated as per the steps in Algorithm 1. From the simulated yield, we calculate and display mean yield, risk

(standard deviation) involved in growing that variety, and 95% confidence interval for the yield. Figures

10 and 11 show the implementation of SimSOY as a web-application. The web-application can be ac-

cessed at http://ec2-54-191-101-197.us-west-2.compute.amazonaws.com:8080/. It should be noted that

the optimization part of this research has not been included in the web-app yet. We wanted to keep the

first version of the pilot web-app simple. We hope to continuously refine the app and include more fea-

tures such as optimization in the next iteration of the web-application.

Page 36: Machine Learning Based Simulation and Optimization of ...apps.olin.wustl.edu/workingpapers/pdf/2018-04-001.pdf · Machine Learning Based Simulation and Optimization of Soybean Variety

36

Figure 10. SimSOY - Input Form

Figure 11. SimSOY - Output

Page 37: Machine Learning Based Simulation and Optimization of ...apps.olin.wustl.edu/workingpapers/pdf/2018-04-001.pdf · Machine Learning Based Simulation and Optimization of Soybean Variety

37

6. Conclusion

With the advent of Big Data, the field of data-driven analytics is broadening and gaining momentum. Tra-

ditional linear models are getting challenged. As a result, new machine-learning algorithms and novel ma-

chine learning-driven decision-making frameworks like in this research are created. For example, Ang et

al. (2016) combined LASSO – a learning machinery – with queuing theory to improve the prediction time

in emergency rooms, Ozen et al. (2016) used machineries with hierarchical optimization to maximize op-

erating room utilization and net profit in Mayo Clinic, and Ferreira et al. (2016) combined tree-based

learning machinery with optimization to improve pricing decisions of an online fashion retailer. Among

unpublished work, these machine learning models solving real-world Big Data challenges posed by sev-

eral contests, such as the annual Data Mining and Knowledge Discover contest (KDD cup) and several

similar ones on Kaggle website, are well known in the machine-learning community. In this paper, we

have presented a framework, which integrated machine-learning, kernel density estimation, simulation,

pseudo-random number generation, and optimization to help farmers make one of the most important de-

cisisons in farming – seed selection. We tested five different machine-learning models and found that re-

gression tree works the best for the Syngenta data set. To simulate the yield, we fitted kernel density esti-

mates at the terminal nodes of the tree models. Under different weather scenarios, the yield of each vari-

ety was simulated. From the simulated yield, parameters needed for optimization were estimated. Finally,

we formulated and solved an optimization problem to choose the mix of soybean varieties to be grown.

We used Sobol` sequence to generate many local optimal solutions to check if the solution was the global

solution. From a practical perspective, this research makes following contribution:

We have introduced a novel data-driven decision making framework, which integrates several

aspects of analytics. This method avoids misrepresentation of yield dynamics because it is en-

tirely based on the pattern learned from a real data set collected over a long period. Moreover,

this approach intelligently reduces the decision-making state space by intelligently using the tree

structures.

We have introduced SimSOY to simulate yield of a soybean variety that a farmer wants to grow

under different weather conditions. The simulation enables the farmer to experiment and assess

the risk associated with growing a variety. This tool is available as a web-application.

The decision framework explicitly address the tradeoff between crop yield and its risk and ena-

bles farmers to optimally decide the mix of varieties they want to grow based on their risk-

Page 38: Machine Learning Based Simulation and Optimization of ...apps.olin.wustl.edu/workingpapers/pdf/2018-04-001.pdf · Machine Learning Based Simulation and Optimization of Soybean Variety

38

tolerance level.

The survey we conducted among farmers sheds light on their thought process behind allocating

land for different varieties. Our research offers practical decision support tools that account for

the most important tradeoffs that influence their decision-making. Based on the survey, we also

identified various issues/factors that farmers would consider in the seed selection decision, which

can give rise to a rich set of interesting questions that merit future research.

In sum, we have solved a difficult and important problem, which has great practical implications for

farmers. An average farmer can improve the yield by about nine percent by using the optimization results

from this paper instead of using simple rules to allocate land between the varieties grown. As a result,

they can increase their revenue by more than $150,000. The methodology introduced in this paper can be

applied to seed selection decisions for crops other than soybean. We believe that the analytics framework

introduced in this research, which integrates methodologies from machine learning, optimization, and

simulation has great potential to improve the quality of seed selection decision, and ultimately improve

farmers’ crop yield. Our hope is that operations management community actively embraces machine-

learning and data-driven decision-making methodologies to solve important problems in the agriculture

domain.

References

Allen, S. J, Schuster, E. W. 2004. Controlling the Risk for an Agricultural Harvest. Manufacturing Ser-

vice Oper. Management 6(3) 225–236

Ang, E., Kwasnick, S., Bayati, M., Plambeck, E.L., Aratow, M. 2016. Accurate emergency department

wait time prediction. Manufacturing Service Oper. Management 18(1) 141–156

Bansal, S., Gutierrez, G., Keiser, J. 2017. Using Experts’ Noisy Quantile Judgments to Quantify Risks:

Theory and Application to Agribusiness, Operations Research, 65(5) 1115-1130.

Bansal, S., Nagarajan, M. 2017. Product Portfolio Management with Production Flexibility in Agribusi-

ness, Operations Research, 65(4) 914-930.

Barkley, A., Peterson. H. 2008. Wheat Variety Selection: An Application of Portfolio Theory to Improve

Returns. Proceedings of the NCCC-134 Conference on Applied Commodity Price Analysis, Forecast-

ing, and Market Risk Management. St. Louis, MO.

Page 39: Machine Learning Based Simulation and Optimization of ...apps.olin.wustl.edu/workingpapers/pdf/2018-04-001.pdf · Machine Learning Based Simulation and Optimization of Soybean Variety

39

Barkley, A., Tack, J., Nalley, L.L., Bergtold, J., Bowden, R., Fritz, A. 2014. Weather, Disease, and Wheat

Breeding Effects on Kansas Wheat Varietal Yields, 1985 to 2011. Agronomy Journal 106 (1) 227 –

235.

Breiman, L., Friedman, J. H., Oishen, R. A., Stone, C. J. 1984. Classification and regression trees.

Wadsworth, Belmont, Ca.

Breiman, L. 1996. Bagging Predictors, Machine Learning 24 (2) 123-140.

Breiman, L. 2001a. Statistical Modeling: The Two Cultures, Statistical Science 16 (3) 199-231.

Breiman, L. 2001b. Random Forests, Machine Learning 45 (1) 5-32.

Challinor, A.J., Wheeler, T.R., Slingo, J.M., Craufurd, P.Q., Grimes, D.I.F. 2005. Simulation of crop

yields using ERA-40: limits to skill and nonstationarity in weather–yield relationships. Journal of Ap-

plied Meteorology 44 (4) 516–531.

Chen, Y. J., Shanthikumar, G. J., Shen, Z. M., 2015. Incentive for Peer-to-Peer Knowledge Sharing

among Farmers in Developing Economies. Production and Operations Management 24 (9) 1430 –

1440.

Cowles, A. 1938. Common-stock indexes - Monograph 3. Yale University Press, New Haven, CT.

Dixon, B.L., Hollinger, S.E., Garcia, P., Tirupattur, V. 1994. Estimating corn yield response models to

predict impacts of climate change. J. Agric. Resource Econ. 19 58–68.

Epanechnikov, V. A. 1969. Nonparametric estimation of a multivariate probability density.

Theory of Probability and its Applications 14 153-158.

Ferreira, K.J., Lee, B.H.A., Simchi-Levi D 2016. Analytics for an online retailer: Demand forecasting and

price optimization. Manufacturing Service Oper. Management 18(1) 69–88

Figge, F. 2004. Bio-folio: Applying Portfolio Theory to Biodiversity. Biodiversity and Conservation 13

822-849.

Fischer, R. A. 1985. Number of kernels in wheat crops and the influence of solar radiation and tempera-

ture. J. Agric. Sci. 105 447–461.

Fisher R. A. 1925. Statistical Methods for Research Workers. Oliver & Boyd, Edinburgh, U K.

Friedman, J. H. 1991. Multivariate Adaptive Regression Splines. The Annals of Statistics 19 (1) 1-67.

Friedman, J. H. 1999a. Greedy Function Approximation: A Gradient Boosting Machine. Technical Re-

port, Salford Systems, San Diego, CA.

Friedman, J. H. 1999b. Stochastic Gradient Boosting. Technical Report, Salford Systems, San Diego, CA.

Friedman, J. H., Stuetzle, W. 1981. Projection Pursuit Regression. Journal of the American Statistical As-

sociation 76 (376) 817 – 823.

Gijsman, A.J., Thornton, P.K., Hoogenboom, G. 2007. Using the WISE database to parameterize soil in-

puts for crop simulation models. Computers and Electronics in Agriculture 56 (2) 85–100.

Page 40: Machine Learning Based Simulation and Optimization of ...apps.olin.wustl.edu/workingpapers/pdf/2018-04-001.pdf · Machine Learning Based Simulation and Optimization of Soybean Variety

40

Hansen, J.W., Indeje, M. 2004. Linking dynamic seasonal climate forecasts with crop simulation for

maize yield prediction in semi-arid Kenya. Agricultural and Forest Meteorology 125 (1–2) 143–157.

Hastie, T., Tibshirani, R., Friedman, J. H. 2001. The Elements of Statistical Learning: Data Mining, Infer-

ence, and Prediction. Springer-Verlag, NY.

Huh, W. T., Lall, U. 2013. Optimal Crop Choice, Irrigation Allocation, and the Impact of Contract Farm-

ing. Production and Operations Management 22 (5) 1126 - 1143

James, G., Witten, D., Hastie, T., Tibshirani, R. 2013. An Introduction to Statistical Learning with Appli-

cations in R. Springer-Verlag, NY.

Jones, M. C., Marron, J. S., Sheather, S. J. 1996. A brief survey of bandwidth selection for density esti-

mation. Journal of American Statistical Association 91(433) 401-407.

Lintner, J. 1965a.The valuation of risk assets and the selection of risky investments in stock portfolios and

capital budgets. Rev. Econ. Statist. 47 13-37.

Lintner, J. 1965b. Security Prices, Risk, and Maximum Gains from Diversification. Journal of Finance

21 587–615.

Lobell, D.B., Burke, M.B. 2010. On the use of statistical models to predict crop yield responses to cli-

mate change. Agric. Forest Meteorol. 150 (11) 1443 – 14552.

Markowitz, H. 1959. Portfolio Selection: Efficient Diversification of Investments. John Wiley and Sons,

NY.

Nalley, L.L., Barkley, A., Chumley, F. 2008. The impact of the Kansas wheat breeding program on wheat

yields, 1911–2005. J. Agric. Appl. Econ. 40 913–925.

Nyikal, R.A, Kosura, W.O. 2005. Risk Preference and Optimal Enterprise Combinations in Kahuro Divi-

sion of Murang’a District, Kenya. Agricultural Economics 32 131- 140.

Ozen, A., Marmor, Y., Rohleder, T., Balasubramanian, H., Huddleston, J., Huddleston, P. 2016 Optimi-

zation and simulation of orthopedic spine surgery cases at Mayo Clinic. Manufacturing Service Oper.

Management 18(1) 157–175.

Ripley, B. D. (1996). Pattern Recognition and Neural Networks, Cambridge University Press, Cam-

bridge, UK.

Robinson, LJ, Brake, J. 1979. Application of portfolio theory to farmer and lender behavior. Am. J. Agric.

Econ. 61(1) 158-164.

Schlenker, W., Roberts, M.J. 2006. Nonlinear effects of weather on corn yields. Rev. Agric. Econ 28

391–398.

Schlenker, W., Roberts, M.J. 2009. Nonlinear temperature effects indicate severe damages to U.S. crop

yields under climate change. Proc. Natl. Acad. Sci. 106 15594–15598.

Page 41: Machine Learning Based Simulation and Optimization of ...apps.olin.wustl.edu/workingpapers/pdf/2018-04-001.pdf · Machine Learning Based Simulation and Optimization of Soybean Variety

41

Semenov, M.A., Wolf, J., Evans, L.G., Eckersten, H., Iglesias, A. 1996. Comparison of wheat simulation

models under climate change, II. Application of climate change scenarios. Climate Research 7 271–

281.

Sharpe, W. F. 1963. A simplified model for portfolio analysis. Management. Science 9 277-93.

Sharpe, W. F. 1964. Capital asset prices: a theory of market equilibrium under conditions of risk. Journal

of Finance 19 425-42.

Sharpe,W. 1970. Portfolio Theory and Capital Markets. McGraw-Hill, New York.

Sheather, S. J. 2004. Density estimation. Statistical Science 19(4) 588-597.

Sheather, S. J., Jones, M. C. 1991. A reliable data-based bandwidth selection method for kernel density

estimation. Journal of Royal Statistical Society 53(3) 683-690.

Silverman, B. W. 1978. Choosing window width when estimating a density. Biometrika 65(1) 1-11.

Simchi-Levi, D. 2017. From the Editor. Management Science https://doi.org/10.1287/mnsc.2017.3019

Sobol´, I. M. 1967. The distribution of points in a cube and the approximate evaluation of integrals. USSR

Computational Mathematics and Mathematical Physics 7 86–112.

Sundaramoorthi, D., Chen, V.C., Rosenberger, J. M., Kim, S. B., Buckley-Behan, D. 2009. A data-inte-

grated simulation model to evaluate nurse-patient assignments. Health Care Management Science

12(3) 252–268.

Sundaramoorthi, D., Coult, A., Nguyen, D. 2012. A Data-Integrated Simulation of Financial Market Dy-

namics. Int. J. of Operations Research and Information Systems 3(3) 74–86.

Sundaramoorthi, D. 2014. A data-integrated simulation model to forecast ground-level ozone concentra-

tion. Annals of Operations Research 216(1) 53–69.

Tang, C. S., Wang, Y., Zhao, M. 2015. The Implications of Utilizing Market Information and Adopting

Agricultural Advice for Farmers in Developing Economies. Production and Operations Management

24 (8) 1197 - 1215

Tobin, J. 1958. Liquidity Preference as Behavior towards Risk. The Review of Economic Studies 25 65–

86.

Tubiello, F.N., Rosenzweig, C., Goldberg, R.A., Jagtap, S., Jones, J.W. 2002. Effects of climate change

on US crop production: Simulation results using two different GCM scenarios. Part I: Wheat, potato,

maize, and citrus. Climate Research 20 259–270.

Uspensky, J. V. 1937. Introduction to mathematical probability. McGraw-Hill, NY.

Williams, J. B. 1938. The theory of investment value. Harvard University Press, Cambridge, MA.

Yang, Z., V. C. P. Chen, M. E. Chang, M. L. Sattler, A. Wen 2009. A Decision-Making Framework for

Ozone Pollution Control. Operations Research 57(2) 484–498.

Page 42: Machine Learning Based Simulation and Optimization of ...apps.olin.wustl.edu/workingpapers/pdf/2018-04-001.pdf · Machine Learning Based Simulation and Optimization of Soybean Variety

42

Appendix

Random Forest (RF)

Recently, ensembled machineries have resulted in better prediction accuracy. RF is an ensemble of many

RT models. In RF, many RT models are fitted by repeatedly sampling the training data with replacement.

The process of repeatedly sampling from the same data with replacement is called bootstrap aggregating

or bagging. The following procedure represents the algorithm to grow a RF.

For b =1 to B:

1. Randomly choose N observations of X and Y (with replacement) from the training dataset.

Call this data set as Θb.

2. Using Θb grow a RT Tb.

Average the predictions from B number of RTs to get the prediction by the RF model. We can show

the RF model (�̂�) as below.

1 2ˆ ˆ ˆ ˆ, ,..., .Bf Average T T T (4)

In addition to the bootstrap aggregating the training data, each RT model only uses a subset of p predic-

tors in order to de-correlate the predictions. As per the tradition, we use, only 26 5 input variables

for splitting at each node. For more details, refer to Breiman (1996), and Breiman (2001).

Boosted Trees (BT)

BT is another ensemble approach based on RT. In BT, sequence of RTs are fitted to correct the prediction

errors made by the prior trees in the sequence. In a way, successive trees in BT are boosting the prediction

accuracy of the entire model. BT is represented as below.

1

; .B

B b

b

f X T X

(5)

Growing of the bth tree to minimize the squared error loss is represented as below.

2 2

1 1, ( ) ( ; ) ( ) ( ; ) ( ; ) ,i b b i b b ib bL y f X T X y f X T X e T X (6)

Page 43: Machine Learning Based Simulation and Optimization of ...apps.olin.wustl.edu/workingpapers/pdf/2018-04-001.pdf · Machine Learning Based Simulation and Optimization of Soybean Variety

43

where eib is the prediction error from the prior trees in the sequence.

As indicated earlier, the primary goal of the algorithm is to grow a new tree in the sequence which best

fits the prediction error of the model at that step. For a comprehensive review of the algorithm, refer to

Friedman (1999a, 1999b) and Hastie et al. (2001).

Multivariate Adaptive Regression Splines (MARS)

In a sense, MARS is similar to stepwise linear regression (Friedman 1991). Stepwise linear regression

adds predictors sequentially based on a criterion like SSE. In MARS, piecewise linear functions or inter-

action terms of those functions are sequentially added to improve the fitting criterion like SSE. These are

called basis functions. The form of the basis functions is (Xj – k) + and (k – Xj) +. The notation “+” repre-

sents the positive region of the function.

Therefore,

, if ( )

0, otherwise

j j

j

X k X kX K

(7)

and, similarly,

, if ( )

0, otherwise

j j

j

k X X kk X

(8)

The direction of basis function changes at k. This point is called a knot. In theory, there are infinite

knot points in any problem. However, MARS algorithm considers only the observed values of Xj in the

training data as candidates for knots. If there are N observations in the training data with N unique reali-

zations of any Xj, then there are 2pN possible basis functions in total. Let C be the set of all of the basis

functions. Then the MARS model is shown below.

0

1

,B

b b

b

f X M X

(9)

where Mb(X) is a basis function in C, or a product of the basis functions in C. As in linear regression,

given an estimate of Mb(X), βb can be estimated by minimizing the Sum of Squared Errors (SSE). The in-

teresting and challenging part of MARS model construction is the choice of Mb(X) in each step. The

model building process starts with a constant function M0(X) = 1. In the next step, for example, a function

of form β11 (Xj – k) + + β12 (k – Xj) +, would be considered. Since this function is multiplied to a constant

Page 44: Machine Learning Based Simulation and Optimization of ...apps.olin.wustl.edu/workingpapers/pdf/2018-04-001.pdf · Machine Learning Based Simulation and Optimization of Soybean Variety

44

function from the previous step, one would end up with the same function.

Artificial Neural Network (ANN)

ANN has become one of the most popular machinery for regression as well as classification. There are

infinitely many possible architectures for ANN. If one manages to find a suitable ANN architecture for

the underlying data, ANN would outperform other models in terms of prediction accuracy. As in BT and

RF, ANN is highly non-linear and uninterpretable. It can be represented as a network diagram with an in-

put layer, an output layer, and at least one hidden layer. We discuss and use a model with one hidden

layer in this work. A typical ANN model for regression with one hidden layer is shown in Figure 11.

Nodes or Neurons in the hidden layer, represented by H1, …, HB, are formed by a function which uses lin-

ear combinations of the inputs. In function form, it is represented below.

, 1,...,T

b b bH X b B (11)

𝑌𝑗

𝐻𝐵 𝐻2 𝐻1

𝑋𝑃

𝑋2 𝑋1

Figure 11. Network Diagram of an ANN with one hidden layer.

Page 45: Machine Learning Based Simulation and Optimization of ...apps.olin.wustl.edu/workingpapers/pdf/2018-04-001.pdf · Machine Learning Based Simulation and Optimization of Soybean Variety

45

And the response Yj is formed by having linear combinations of Hb as shown below.

0 .T

jY H (12)

Sigmoid and Gaussian radial basis functions are common choices for the function . If one chooses an

identity function for , then one ends up with an ANN equivalent to what is called the projection pursuit

regression model in statistics (Friedman and Stuetzle 1981). Equations (11) and (12) involve the parame-

ters 0 1 10 11 1 20 21 2 0 1, ,..., , , ,..., , , ,..., ,..., , ,..., .B P P B B BP These parameters are referred to as

weights. Minimization of Sum of Squared Errors (SSE) is used to estimate these weights. The minimiza-

tion is done using gradient descent, which is referred to as back-propagation in ANN. The gradient is

computed by forward and backward sweep over the network. The general idea of the process is as fol-

lows: In the forward sweep, the parameters are fixed and the predictions are made using equation (12). In

the backward sweep, the errors in predictions are back-propagated to update the model. For details, see

Ripley (1996).