optimization of a fed-batch bioreactor catalina valencia … · 2010. 7. 14. · keywords:...

OPTIMIZATION OF A FED-BATCH BIOREACTORUSING SIMULATION-BASED APPROACH

Catalina Valencia Peroni ∗ Jay H. Lee ∗∗,1

Niket S. Kaisare ∗∗

∗Universitat Rovira I Virgili, Tarragona, Catalunya, Spain∗∗ School of Chemical Engineering, Georgia Institute of

Technology, Atlanta, GA 30332, U.S.A.

Abstract: We use simulation-based approach to find the optimal feeding strategy forcloned invertase expression in Saccharomyces cerevisiae in a fed-batch bioreactor. Theoptimal strategy maximizes the productivity and minimizes the fermentation time.This procedure is motivated from Neuro Dynamic Programming (NDP) literature,wherein the optimal solution is parameterized in the form of a cost-to-go or profit-to-go functions. The proposed approach uses simulations from a heuristic feeding policyas a starting point to generate the profit-to-go vs state data. An artificial neuralnetwork is used to obtain profit-to-go as a function of system state. Iterations ofBellman equation are used to improve the profit function. The profit-to-go functionthus obtained, is then implemented in an online controller, which essentially convertsinfinite horizon problem into an equivalent one-step-ahead problem.

Keywords: Fed-batch Reactor, Optimal Control, Neuro-Dynamic Programming

1. INTRODUCTION

A vast majority of industrially important fermen-tors are operated in fed-batch mode, which in-volves addition of substrates continuously to anotherwise batch reactor. Fed-batch reactors areespecially useful when growth or metabolite pro-duction follows substrate or product inhibition ki-netics. In such cases, substrates need to be addedin a controlled manner in order to maximize theproductivity with respect to the desired product.For example, if product formation is inhibited un-der excess substrate conditions, low substrate con-centrations are desired for high product yields. Atthe same time, higher substrate levels are requiredto prevent starvation of cells and to maintainhigh growth rates. Thus, there exists an optimumfeed profile that maximizes the productivity of theprocess.

1 The author to whom correspondence should be ad-dressed. Email: [email protected]

The determination of optimal feed profile is a chal-lenging problem, as the bioreactors follow com-plex nonlinear dynamics. Resulting optimal con-trol problem is therefore a nonlinear programming(NLP) problem accompanied by various input andstate constraints. The optimization is often non-convex and the global optimum difficult to achieve(Banga et al., 1997).

Many authors have used Pontryagin’s maximumprinciple to solve the optimal control problem.However, this approach may be very difficultfor such problems; so several alternative solu-tion techniques have been proposed. Cuthrell andBiegler (1989) solved the optimal control problemof a fed-batch penicillin reactor using successivequadratic programming (SQP) based on orthog-onal collocation on finite elements. Luus (1993)used iterative dynamic programming (IDP) tofind optimal feed profile for the same reactor;while Banga et al. (1997) presented a fast stochas-tic dynamic optimization method — called in-

tegrated controlled random search for dynamicsystems, ICRS/DS — for this reactor. Recently,Bonvin et al. (2002) provided a good review of dy-namic optimization methods for batch processes,and also presented algorithms that achieve nearlyoptimal performance in the presence of uncertain-ties.

In this paper, optimal control of a fed-batch fer-mentor for cloned invertase expression in Saccha-romyces cerevisiae is considered. The expressionof the enzyme is repressed at high glucose con-centration. Hence, a fed-batch reactor is ideal forthis process (Patkar and Seo, 1992). Patkar etal. (1993) proposed a model for this fermentationprocess, involving a set of four coupled ODEs.They used first-order conjugate gradient methodin order to find the optimal feed rate profile forthis system. Later, Chaudhuri and Modak (1998)used a neural network model into the generalizedreduced gradient method for the same optimiza-tion problem.

The main disadvantage of the above methods isthat the fermentation ending time should be fixeda priori in both cases. In order to find the optimalfermentation ending time, several different fer-mentation ending times should be guessed and, foreach one of them, one productivity optimizationproblem should be solved. This is extremely com-putationally demanding. Methods such as controlparameterization may be used for free-end-timeproblems to obtain open loop optimal policies.Another drawback of these methods stems fromthe fact that they are open loop optimal, whichmeans that each time an initial condition changes,a new and different optimization problem shouldbe solved. Besides, the resulting fixed policies donot take into account the possible process distur-bances.

Dynamic Programming (DP) is an optimizationmethod that can be used to overcome these limita-tion. It was introduced by Bellman and coworkers(Bellman, 1957) as a feasible approach to solve thedynamic optimization that results from an opti-mal control problem. Here, the aim is to find theoptimal ‘cost-to-go’ function, which can be usedto parameterize the optimal solution as a functionof the system state. This method is promising aspresents a feasible approach to solve any optimalcontrol problem. However, it suffers from ‘curseof dimensionality’, which refers to exponential in-crease in computational cost with increase in statedimension.

Recently, Neuro-Dynamic Programming (NDP)was proposed as a way to alleviate the curse ofdimensionality (Bertsekas and Tsitsiklis, 1996).NDP uses simulated process data obtained undersuboptimal policies to fit an approximate cost-to-go function – usually by fitting artificial neural

networks, hence the name. The initial approxi-mate cost-to-go function is further improved byan iteration procedure based on the so calledBellman equation. Closely related to NDP arethe methods in AI literature, collectively classi-fied as Reinforcement Learning (RL, Sutton andBartow (1998)). We applied this approach to on-line control of a continuous bioreactor (Kaisare etal., 2002) as well as that of a benchmark Van derVusse reactor (Lee and Lee, 2001). Simulation-based NDP method will be applied for optimalcontrol of the fed-batch Saccharomyces cerevisiaefermentor. We have considered a deterministicoptimization problem, i.e. the model is knownand full state feedback is assumed. The methodcan be expanded to stochastic case. Moreover,simulation-based NDP methods do not requirethe explicit model to be known—methods like Q-learning (Watkins and Dayan, 1992) developed inRL community are model-free stochastic learningtechniques.

2. STATEMENT OF THE OPTIMALCONTROL PROBLEM

Our objective is to find an optimal feed profile µthat adapts itself when initial conditions changeor when disturbances occur. The optimal policy isa function of the system state, represented as

µ(x) = arg maxu,tf

{productivity − λ · tf} (1)

subject to relevant input and state constraints andfollowing system dynamics. λ is a positive con-stant that penalizes invertase fermentation time.

In general, the performance index is mathemat-ically represented as the sum of stage-wise costsincurred (or rewards obtained) until the end ofhorizon.

J(xk) =tf

∑

i=k

φ(xi, ui) + φ̄t(xtf ) (2)

subject to

Path Constraint: gi(xi, ui) ≥ 0Model Constraint: ẋ = f(x, u)

Here φ is the stage-wise cost/reward and φ̄p isthe terminal cost/reward. The path constraintsinclude all input and state constraints. The sys-tem dynamics appear as model constraints. Fordiscrete-time system, the model f(x, u) can beintegrated for one time step. Equivalently, modelconstraint becomes xk+1 = fh(xk, uk).

The above problem is an infinite-horizon controlproblem, as we wish to solve free terminal timeoptimal control (i.e. tf is not fixed).

3. NEURO-DYNAMIC PROGRAMMING

We begin this section with the description of Dy-namic Programming (DP), which is an elegantway to solve the previously introduced optimiza-tion problem. In this method, the process is mod-eled as a chain of consecutive transitions fromone state to another. The way each transition ismade depends on the decision variable, which hasan associated cost or reward. The objective ofdynamic programming is to maximize the totalprofit or minimize the total cost, obtained fromthe transitions needed to reach the final desiredprocess state from initial process state.

DP involves stage-wise calculation of the profit-to-go function 2 to arrive at the solution, notjust for a specific x0 but for general x0. Thus,the objective function is split into two parts —one stage reward obtained in going from xk toxk+1, and total future rewards as a function ofxk+1. The latter is parameterized as profit-to-gofunction, which represents the “desirability” ofstate xk+1. Using the profit-to-go function, themulti-stage optimization problem is cast into anequivalent single-stage optimization that is solvedonline.

maxuk

{φ (xk, uk) + J∗ (xk+1)} (3)

where the optimal profit-to-go function is at eachstage is defined as

J∗ = maxu

{ tf∑

i=k+1

φ(xi, ui) + φ̄t(xtf )

}

(4)

The crucial step in DP is calculation of the profit-to-go function J∗(x). This involves stage-wiseevaluation of rewards obtained in all possible tran-sitions from each state in the state space. Withoutgoing into further details, we mention here thatthis approach is seldom practically feasible dueto the exponential growth of the computationwith respect to state dimension. This is referredto as the ‘curse of dimensionality’, which mustbe removed in order for this approach to find awidespread use.

An alternative method, which uses simulationsto overcome the curse of dimensionality, involvescomputation of an approximation of the profit-to-go function. Exhaustive sampling of state spacecan be avoided by identifying relevant regions ofthe space through simulation under judiciouslychosen suboptimal policies. The policy improve-ment theorem states that the new policy definedby µ(x) = arg maxu

{

φ(x, u) + J i(fh(x, u))}

is an

2 It is customary in DP to use cost-to-go for minimizationproblem. We use profit-to-go as we solve maximizationproblem. Both cost and profit are exactly equivalent

improvement over the original policy (Howard,1960). Indeed, when the new policy is as good asthe original policy, the above equation becomesthe same as Bellman optimality equation (5).

J∞(x) = maxu{φ(x, u) + J∞(fh(x, u))} (5)

Equivalently, for a free-end-time batch optimiza-tion problem, the above equation can be formu-lated as

J∞(x) = max[

φ̄t(x),

maxu{φ(x, u) + J∞(fh(x, u))}

]

(6)

where the modification accounts for terminationof batch at a state x. In other words, if φ̄t(x) isgreater than the second term, the batch should beterminated.

Use of the Bellman equation to obtain iterativeimprovement of cost-to-go approximator formsthe crux of various methods like Neuro-DynamicProgramming (NDP) (Bertsekas and Tsitsiklis,1996), Reinforcement Learning (RL) (Sutton andBartow, 1998), Temporal Difference (Tsitsiklisand Roy, 1997) and such.

In this paper, the basic idea from NDP and RL lit-erature is used to obtain optimal performance of afed-batch bioreactor. Relevant regions of the statespace are identified through simulations of severalheuristic policies, and initial suboptimal profit-to-go function is calculated from the simulation data.A functional approximator is used to interpolatebetween these data points. Neural network is thechosen function approximator (hence the name‘Neuro’ in NDP). Evolutionary improvement is ob-tained through iterations of the Bellman equation(7). When the iterations converge, this offline-computed profit-to-go approximation is then usedfor online optimal control calculation for the re-actor.

3.1 Algorithm

A detailed algorithm was presented in our pre-vious work (Kaisare et al., 2002), which is repro-duced in this section. Following notations are usedin this section and rest of the paper. J representsprofit-to-go values. A function approximation re-lating J to corresponding state x is denoted asJ̃(x). Superscript ()i represents iteration index forcost iteration loop and k is discrete time. Finally,J̃(k) ≡ J̃(x(k)) and φ(k) ≡ φ(x(k), u(k)).

The simulation-based approach involves computa-tion of the converged profit-to-go approximationoffline. The following steps describe the generalprocedure for the infinite horizon profit-to-go ap-proximation.

(1) Perform simulations of the process with cho-sen suboptimal policies under all representa-tive operating conditions.

(2) Using the simulation data, calculate the∞-horizon profit-to-go for each state vis-ited during the simulation. For example,each closed loop simulation yields us datax(0), x(1), . . . , x(tf ), where tf is the termina-tion time for the specific suboptimal policy.For each of these points, compute the profit-to-go value.

(3) Fit a neural network or other function ap-proximator to the data to approximate theprofit-to-go function — denoted as J̃0(x) —as a smooth function of the states.

(4) To improve the approximation, perform thefollowing iteration (referred to as the cost orvalue iteration) until convergence:• With the current profit-to-go approxi-

mation J̃ i(x), calculate J i+1(k) for thegiven sample points of x by solving

J i+1 = max[

φ̄t(xk) maxu{φ(xk, uk)

+J̃ i(fh(xk, uk))}]

(7)

which is based on the Bellman Equation.• Fit an improved cost-to-go approximator

J̃ i+1 to the x vs. J i+1(x) data.(5) Policy Update may sometimes be neces-

sary to increase the coverage of the statespace. In this case, more suboptimal simu-lations with the updated policy (maxu φ +J̃ i) are used to increase the coverage or thenumber of data points in certain region ofstate space.

Once the value iteration described above con-verges, the resulting profit-to-go function can beused for online control. At each time, state es-timates are obtained (for deterministic problem,we have the state itself). Single-stage optimizationproblem, as shown in Eq. (3), is solved using theconverged profit-to-go function J̃a in place of theoptimal J∗.

4. INVERTASE PRODUCTIONOPTIMIZATION

The fermentation process considered here consistsof production of invertase in Saccharomyces cere-visiae utilizing glucose for growth. Patkar andSeo (1992) reported the fermentation kineticsof invertase production in fed-batch cultures; aswell as experimental data for cell density (c, ex-pressed as optical density OD), glucose concentra-tion (s gm/L) and specific invertase activity (i)obtained with various glucose feeding strategiesin 1.2 liter bioreactor. The productivity of thereactor at any time is given by

productivity = icV (8)

Thus the optimization problem as defined in Eq.(1) becomes

maxu,tf

{

icV |tf − λtf}

(9)

with constraints u ∈ [0, 0.2722] and V ≤ 1.2. Themodel equations are summarized in appendix A.

4.1 Obtaining profit-to-go function

The first step in simulation-based NDP optimiza-tion is to obtain a suboptimal profit-to-go func-tion. This was done through simulations of a set ofsuboptimal heuristic policies. The heuristics wereselected so that the 4-dimensional state space wassufficiently covered. The policies involved main-taining the reactor at initial volume V (0) for cer-tain amount of time ti and then increasing thefeed rate until end of fermentation time or untilconstraints are met (if Vmax is reached before tf ,feed rate is reduced to u = 0 and reactor is oper-ated in batch mode until t = tf ). Mathematically,the feed rate profiles followed were

u(t; ti, b) ={

0 if t < ti0.02(1 + b(t− ti)2.2) if t ≥ ti

We used four values of b = [0.05, 0.07, 0.1, 0.13]and nine values of ti = [1, 2, . . . , 9] to generate 36different heuristic policies. Each of these policieswere implemented for three different initial valuesof V (0) = [0.4, 0.6, 0.8].

For each of the heuristic policies, the optimalending time was determined — as the time thatyielded the maximum profit value calculated ac-cording to Eq. (A.5). The productivity thus cal-culated gives the terminal reward function φ̄t =icV |tf . Stage-wise reward is given by φ(xk) =λ∆tk. Thus, for each of the states correspondingto a specific heuristic policy, the profit-to-go func-tion is calculated as

J(xk) = icV |tf − λ · (tf − tk)

where tk = k.∆t is the “current” time for xk.

We obtained a total of 9328 states and correspond-ing profit-to-go values through simulation of theheuristic policies. Next, a function approximatorwas used to correlate the profit-to-go as a func-tion of system state. A backpropagation neuralnetwork was used that had 4 input nodes (cor-responding to 4 state variables), 1 output node(profit-to-go) and two hidden layers with 17 and 5nodes respectively. We denote this approximationas J̃0.

Policy Profit icV |tf tfPatkar (1993) 3.70 7.30 12Chaudhuri (1998) 3.50 7.10 12Best heuristic 3.72 7.23 11.7NDP 3.80 7.25 11.5

Table 1. Optimal control results forV (0) = 0.6.

Given this initial approximation of profit function,Bellman equation was used iteratively to improvethe optimality of the profit-to-go approximation.For each of the 9328 data points, following equa-tion was solved

J̃ i+1 = max[

icV |xk

maxuk

{

−λ∆tk + J̃ i(xk+1)}

]

Here, superscript i represents the iteration index.∆t represent time steps, which were taken to beconstant and equal to 0.1 hours. Value iterationswere performed as elaborated in section 3.1. Threeiterations were required for the profit functionto converge with 1-norm less than 0.2. The bestneural network structures for each of the iterationswere 4-17-5-1, 4-17-5-1 and 4-13-5-1 respectively.

Simulations using the converged profit-to-go ap-proximator resulted in visits to regions of statespace previously unvisited by heuristic controller.Hence, policy update was performed by includingthese unvisited states, and value iteration of Bell-man equation was performed again. It took justone iteration for the profit function to converge.The neural network structure was found to be 4-15-5-1, and this J̃4 was used for control.

5. RESULTS

The fourth trained neural network was imple-mented into online optimal control. The reactorwas started with V (0) = 0.6. This was the casesolved by Patkar et al. (1993) and Chaudhuri andModak (1998). The results are shown in Table 1.State space plots for online controller performanceare shown in Fig. 1. It can be seen from the figurethat the optimal policy results from interpolationin the state space and not in the policy space.

The controller was then tested with different ini-tial volume V (0) = 0.5. The results are shown inthe second column in Table 2. This condition wasnot “seen” before by the function approximator.Next, we tested the controller in presence of un-known disturbance: abrupt cell death occurs at 9h resulting in a 50% decrease in cell concentra-tion. The NDP method still gives most optimalperformance over the other methods. The bestheuristic reported above is the heuristic that gavemaximum profit value. It should be noted that

Policy Profit ProfitV0 = 0.5 Cell death

Patkar et al. (1993) 3.74 0.62Best heuristic 3.58 1.88NDP 4.06 1.97Table 2. Control results for a different

V0 and unknown disturbance cases

we found different heuristics to be the best fordifferent V (0) values. Thus, the optimal controllerdoes not select any particular heuristic policy;instead it “patches” solution of various heuristicsto come with a different, optimal policy.

6. CONCLUSIONS

A simulation-based Neuro-Dynamic Programming(NDP) strategy was applied to obtain optimalfeeding profile for different initial conditions forinvertase production in a fed-batch bioreactor.Simulation from suboptimal heuristic laws is usedto identify relevant regions of the state space andto initialize the profit-to-go function approxima-tion. The profit-to-go approximator is then im-proved by performing iterations of Bellman equa-tion over only the relevant regions of the statespace. The profit-to-go function thus obtained wasthen used for online optimal control. This methodgives nearly optimal performance for different ini-tial conditions without requiring to recomputethe profit-to-go function. This method thereforehas a promise in controlling fed-batch reactors inpresence of disturbances.

7. REFERENCES

Banga, J. R., A. A. Alonso and R. P. Singh (1997).Stochastic dynamic optimization of batchand semicontinuous bioprocesses. Biotechnol-ogy Progress 13, 326–335.

Bellman, R. E. (1957). Dynamic Programming.Princeton University Press. New Jersey.

Bertsekas, D. P. and J. N. Tsitsiklis (1996).Neuro-Dynamic Programming. Athena Scien-tific. Belmont, MA.

Bonvin, D., B. Srinivasan and D. Ruppen (2002).Dynamic optimization in the batch chemicalindustry. In: Chemical Process Control–VI.pp. 255–273.

Chaudhuri, B. and J. M. Modak (1998). Optimiza-tion of fed-batch bioreactor using neural net-work model. Bioprocess Engineering 19, 71–79.

Cuthrell, J. E. and L. T. Biegler (1989). Simulta-neous optimization and solution methods forbatch reactor control profiles. Computers andChemical Engineering 13, 49–62.

Howard, R. (1960). Dynamic programming andmarkov processes. MIT Press. CambridgeMassachussets.

Fig. 1. State-space plot showing online performance for the three cases V0 = 0.4(�), V0 = 0.6(◦), and V0 = 0.8(×). Dotsrepresent the original data from heuristic feeding policies.

Kaisare, N. S., J. M. Lee and J. H. Lee (2002). Si-multion based strategy for nonlinear optimalcontrol: Application to a microbial cell reac-tor. Int. J. of Robust and Nonlinear Control,accepted for publication.

Lee, J. M. and J. H. Lee (2001). Neuro-dynamicprogramming method for mpc. In: DYCOPSVI. pp. 157–162.

Luus, R. (1993). Optimization of fed-batch fer-mentors by iterative dynamic programming.Biotechnology and Bioengineering 41, 599–602.

Patkar, A. and J. H. Seo (1992). Fermentation ki-netics of recombinant yeast in batch and fed-batch cultures. Biotechnology and Bioengi-neering 40, 103–109.

Patkar, A., J. H. Seo and H. C. Lim (1993). Mod-eling and optimization of cloned invertase ex-pression in saccharomyces cerevisiae. Biotech-nology and Bioengineering 41, 1066–1074.

Sutton, R. S. and A. G. Bartow (1998). Reinforce-ment learning: an introduction. MIT Press.Cambridge Massachussets.

Tsitsiklis, J. N. and B. Van Roy (1997). Ananalysis of temporal-difference learning withfunction approximation. IEEE Transactionson Automatic Control 42, 674–690.

Watkins, C. J. C. H. and P. Dayan (1992). Q-learning. Machine Learning 8, 279–292.

Appendix A. FED-BATCH REACTOR MODEL

Mass balance equations

( ˙cV ) = (RrYcr + RfYcf ) cV (A.1)

( ˙sV ) = usf −RtcV (A.2)( ˙icV ) = (π − kdi) cV (A.3)

V̇ = u (A.4)

where u is the feed rate, c(OD), s(g/L) andi(units/OD/l) represent concentrations of cell,substrate and invertase respectively. The rate ex-pressions are

Rr =0.55s

0.05 + sRt = max

{

1.25s0.95 + s

,Rr

}

π =6.25s

0.1 + s + 2s2Rf = Rt −Rr

Ycr = 0.6, Ycf = 0.15, kd = 1.85.

Operating conditions: c(0) = 0.15, s(0) = 5,i(0) = 0.1, sf = 10, V (0) = {0.4, 0.5, 0.6, 0.7, 0.8.

Objective function:

maxu

{

icV |tf − λtf}

(A.5)

subject to constraints 0 ≤ u ≤ 0.2722, V ≤1.2. Here, tf is fermentation time, λ = 0.3 andsampling interval is ∆t = 0.1.

optimization of a fed-batch bioreactor catalina valencia … · 2010. 7. 14. · keywords:...

Documents