Online Adaptive Critic Robust Control of DiscreteTime  IFAT
Transcript Of Online Adaptive Critic Robust Control of DiscreteTime  IFAT
Preprints of the 21st IFAC World Congress (Virtual) Berlin, Germany, July 1217, 2020
Online Adaptive Critic Robust Control of DiscreteTime Nonlinear Systems With Unknown
Dynamics ⋆
Hao Fu ∗,∗∗ Xin Chen ∗,∗∗ Min Wu ∗,∗∗
∗ School of Automation, China University of Geosciences, Wuhan, 430074, China (email: [email protected]).
∗∗ Hubei Key Laboratory of Advanced Control and Intelligent Automation for Complex Systems, Wuhan, 430074, China
Abstract: This paper concerns the optimal model reference adaptive control problem for unknown discretetime nonlinear systems. For such problem, it is challenging to improve online learning efﬁciency and guaranteeing robustness to the uncertainty. To this end, we develop an online adaptive critic robust control method. In this method, a critic network and a new supervised action network are constructed to not only improve the realtime learning efﬁciency, but also obtain the optimal control performance. By combining the designed compensation control term, robustness is further guaranteed by compensating the uncertainty. The comparative simulation study is conducted to show the superiority of our developed method.
Keywords: Approximate dynamic programming (ADP), unknown nonlinear systems, neural network (NN), supervised learning, model reference adaptive control (MRAC), robust control.
1. INTRODUCTION
During past several decades, reinforcement learning (RL) has gained a great deal of research attention in the artiﬁcial intelligence community. In the control system society, approximate/adaptive dynamic programming (ADP) Werbos (1992) (also called adaptive critic design), which combines RL and the adaptive control, has been employed to address the optimal regulation issue ﬁrstly via the actioncritic network framework. Fruitful results Chen et al. (2019); Lian et al. (2016); Ha et al. (2018); Wang et al. (2020); Pang & Jiang (2019); Si et al. (2001) have been reported on ADP in recent years.
In the aforementioned results, an online ADP method Si et al. (2001) has been developed with no requirement of system dynamics. Convergence of this algorithm has been analyzed via the Lyapunov extension theorem Liu et al. (2012). On this basis, He et al. He et al. (2012) have further proposed a new ADP framework with an additional reference/goal network integrated into the actioncritic network. What’s more, this algorithm has witnessed extensive studies in terms of the optimal tracking control Yang et al. (2009); Ni et al. (2013); Mu et al. (2017) as well.
The model reference adaptive control (MRAC) aims at enforcing the controlled systems to track the desired reference model rather than a tracking trajectory. Then, the closedloop control system has the characteristics of the reference model. The optimal MRAC is far more worth deserving investigation than those tracking control. From the stateoftheart developments of this investigation, only Radac et al. (2017, 2018); Fu et al. (2017);
⋆ This work was supported in part by the National Natural Science Foundation of China under Grant 61873248, in part by the Hubei Provincial Natural Science Foundation of China under Grant 2017CFA030 and Grant 2015CFA010, and in part by the 111 project of China under Grant B17040.
Wang et al. (2018) have developed the ADPbased optimal MRAC approach.
As existence of the reference input in MRAC, there invariably exists a feedforward control term dependent on input dynamics of the systems. The input dynamics needs to be derived via identiﬁcation. To obviate this requirement, change of the reference input in Radac et al. (2017, 2018) is ignored during the learning process. Fu et al. (2017); Wang et al. (2018) don’t consider the uncertainty resulting from the identiﬁcation error. As such, it is still a challenge to investigate the ADPbased MRAC method with robustness to such uncertainty.
On the other hand, in the beginning of training phase, this online ADP method is easy to cause inefﬁciency or high failure rate with unknown dynamics Zhao et al. (2013); Fathinezhad et al. (2009). The inefﬁciency or high failure rate is an unacceptable and fatal risk in the realtime control.
Motivated by above discussions, we develop an online adaptive critic robust control method for discretetime nonlinear systems with unknown dynamics. This method ensures that closedloop control systems have robustness to uncertainty and highefﬁciency learning performance.
The main contributions of this study include the following two aspects.
(1) In contrast to the existing online ADP methods Yang et al. (2009); Ni et al. (2013); Mu et al. (2017), our developed control method greatly reduces the failure rate and improves the learning efﬁciency via a critic network and a new supervised action network.
(2) Unlike Radac et al. (2017, 2018); Fu et al. (2017); Wang et al. (2018), our developed control method well guarantees robustness to the uncertainties resulting from iden
Copyright lies with the authors
1722
Preprints of the 21st IFAC World Congress (Virtual) Berlin, Germany, July 1217, 2020
tiﬁcation and the exterior disturbance by introducing the compensation control into the learning process.
The outline of this paper is arranged as follows. The problem description is stated in Section 2. In Section 3, the online adaptive critic robust control method is given. In Section 4 provides the comparative simulation. In Section 5, the conclusions are stated.
2. PROBLEM FORMULATIONS
Consider the following discretetime nonlinear system: { xi(t + 1) = xi+1(t), i = 1, 2, . . . , n − 1 (1) xn(t + 1) = f (x(t)) + g(x(t))u(t) + d(t),
in which x(t) = [x1T (t), x2T (t), . . . , xnT (t)]T ∈ Rnm denotes the state with xi(t) ∈ Rm, f : Rnm → Rm and g : Rnm → Rm×m are unknown smooth nonlinear functions, u(t) ∈ Rm represents the control input, and d(t) ∈ Rm denotes an unknown persistent disturbance. Note that, under the fullstate feedback linearization, the general nonlinear systems can be converted to the formation (1) via the coordinate transformation.
Assumption 1: The nonlinear function g(t) is always bounded and nonsingular for ∀x(t).
Deﬁne a ref{erence model as xri(t + 1) = xri+1(t), i = 1, 2, . . . , n − 1 (2) xrn(t + 1) = Arxr(t) + Brur(t),
where xr(t) = [xrT1(t), xrT2(t), . . . , xrTn(t)]T ∈ Rnm denotes the reference state with xri(t) ∈ Rm, Ar ∈ Rm×nm and Br ∈ Rm×m represent the constant matrices of the reference model, ur(t) is the reference control input. Here, xr(t) and ur(t) are all assumed to be bounded.
The objective of this paper is to enable the system (1) to track the reference model (2) on behavior with optimum via designing an optimal control law u(t). Subtracting (2) from (1) yields the model reference tracking error dynamics
{ ei(t + 1) = ei+1(t), i = 1, 2, . . . , n − 1 (3) en(t + 1) = f (t) + g(t)u(t) + d(t) − Arxr(t) − Brur(t),
where e(t) = x(t) − xr(t) denotes the model reference tracking error with ei(t) = xi(t) − xri(t).
To realize the optimum, it is needed to minimize the perfor
mance index function or cost function
∑∞
J(t) = γk−t r(k),
(4)
k=t
in which γ is a discount factor, r(t) = eT (t)Qe(t) + uT (t)Ru(t)
is deﬁned as the utility function or reward with the positive
symmetric matrices Q and R.
In accordance with Bellman’s optimality principle, the optimal cost function J∗(t) satisﬁes the following Bellman equation:
J∗(t) = min{r(t) + γJ∗(t + 1)}.
(5)
u(t)
Due to unknown dynamics for (1), it is difﬁcult to solve the Bellman equation (5). To overcome this difﬁculty, the ADPbased MRAC methods Radac et al. (2017, 2018); Fu et al. (2017); Wang et al. (2018) have been proposed. But, Radac et al. (2017, 2018) have no realtime control performance. In Fu et al. (2017); Wang et al. (2018), the system uncertainty resulting from identiﬁcation is not considered. On the other
x(t)
Critic J (t )
xr (t ) network
e (t ) − kve (t −1)
x(t)
xr (t )
Action ua (t )
u (t )
network
γ I m×1
J (t −1) r(t)
Uc (t)
Plant x (t + 1)
gˆ −1 (t )
−1
us (t −1)
z
e (t)
Sliding mode control
us (t)
x(t)
kve (t ) + Ar xr (t ) + Brur (t ) − λ1en (t ) −⋯ − λn−1e2 (t )
ur (t ) Reference model xr (t + 1) z−1 xr (t ) e (t )
Fig. 1. Online adaptive critic robust control structure diagram.
hand, inefﬁciency or the high failure rate is always existent in the online ADP methods Yang et al. (2009); Ni et al. (2013); Mu et al. (2017), which is an unacceptable and fatal risk in the realtime control.
3. ADAPTIVE CRITIC ROBUST CONTROL
In this section, an online adaptive critic robust control method is developed to achieve robustness to the uncertainty and learning efﬁciency. Its control structure diagram is depicted in Fig. 1.
Deﬁne a ﬁltered model reference tracking error as
e¯(t) = en(t) + λ1en−1(t) + · · · + λn−1e1(t),
(6)
where λ1, . . . , λn−1 are constants such that zn−1 + λ1zn−2 + · · · + λn−1 is stable. Then, the ﬁltered model reference tracking
error dynamics can be formulated as
e¯(t + 1) = f (t) + g(t)u(t) + d(t) − Arxr(t) − Brur(t)
+ λ1en(t) + · · · + λn−1e2(t).
(7)
An adaptive critic robust control law is designed as
u(t) =gˆ−1(t)(us(t) + kve¯(t) + Arxr(t) + Brur(t)
− λ1en(t) − · · · − λn−1e2(t)) + ua(t),
(8)
where kv ∈ Rm×m is the gain matrix, ua(t) denotes a neural network (NN) control term, us(t) represents a compensation control term, and gˆ(t) is the estimation of g(t). Note that, gˆ(t)
is usually obtained by the model identiﬁcation method Zhao
et al. (2016); Jiang et al. (2018). According to Assumption 1
and the results of Wang et al. (2002), it is deduced that gˆ(t) is
also bounded away from singularity.
A desirable value of u(t) is given by
ud(t) =g−1(t)(us(t) + kve¯(t) − f (t) − d(t) + Arxr(t) + Brur(t)
− λ1en(t) − · · · − λn−1e2(t)).
(9)
Using (9) and substituting (8) into (7) yields
e¯(t + 1) =kve¯(t) + g(t)(u(t) − ud(t))
=kve¯(t) + f1(t) + g(t)ua(t) + us(t) + d1(t), (10)
where f1(t) = f (t) + (kv − λ1Im)xn(t) + (kvλ1 − λ2Im)xn−1(t) + · · · + (kvλn−2 − λn−1Im)x2(t) + kvλn−1x1(t), d1(t) = g(t)(gˆ−1(t) −g−1(t))(us(t)+Arxr(t)+Brur(t)+(λ1Im −kv)xrn(t)+(λ2Im − kvλ1)xrn−1(t) + · · · + (λn−1Im − kvλn−2)xr2(t) − kvλn−1xr1(t)) + d(t), and Im ∈ Rm×m is an identity matrix. According to the results of Fu et al. (2018), it is inferred from Assumption 1 that d1(t) is bounded.
1723
Preprints of the 21st IFAC World Congress (Virtual) Berlin, Germany, July 1217, 2020
For requirement of the optimal control, the critic network and the supervised action network are constructed as follows.
Since it is intractable to acquire the analytical resolution of J∗(t) by solving (5), NN is employed to near the cost function
J(t) as follows
J(t) = w∗cT (t)ϕc(v∗cT (t)zc(t)) + εc,
(11)
where zc(t) = [eT (t), uTa (t)]T denotes the input of the critic network with hc neurons of the hidden layer, ϕc(·) represents
the active function of the critic network, w∗c(t) ∈ Rhc×1 and v∗c(t) ∈ R(nm+m)×hc denote the ideal weights, and εc is the critic network approximation error.
Similarly, since w∗cT (t) and v∗cT (t) cannot be obtained directly, the estimation of J(t) is constructed as
Jˆ(t) = wTc (t)ϕc(vTc (t)zc(t)),
(12)
where wTc (t) and vTc (t) are the estimations of w∗cT (t) and v∗cT (t).
Due to unknown dynamics for (10), the supervised action
network ua(t) has an NN representation as
ua(t) = ϕa2(w∗aT (t)ϕa1(v∗aT (t)za(t))) + εa,
(13)
where za(t) = x(t) denotes the input of the supervised action network with ha neurons of the hidden layer, ϕa1(·) and ϕa2(·) represent the active functions, w∗a(t) ∈ Rha×m and v∗a(t) ∈ Rnm×ha denote the ideal weights, and εa is the supervised action
network approximation error.
Since the ideal weights w∗aT (t) and v∗aT (t) cannot be obtained directly, the actual control term ua(t) is constructed as
ua(t) = ϕa2(wTa (t)ϕa1(vTa (t)za(t))),
(14)
where wTa (t) and vTa (t) are the estimations of w∗aT (t) and v∗aT (t). For simplicity, ϕc(t), ϕa1(t), and ϕa2(t) are used to represent ϕc(vTc (t)zc(t)), ϕa1(vTa (t)za(t)), and ϕa2(wTa (t)ϕa1(vTa (t)za(t))), respectively.
The prediction error of the supervised action network is represented as
ea(t) = (Jˆ(t) −Uc)Im×1 + f1(t) + g(t)ua(t), (15)
where Uc = 0 is the ultimate cost objective and Im×1 ∈ Rm×1 is a matrix whose elements are all 1.
Remark 1: There are two targets in the supervised action
network design. one is to minimize the error between the cost function estimation Jˆ(t) and the ultimate cost objective Uc. Its motivation is that the cost function estimation Jˆ(t) approximates the optimal cost function J∗(t). Another target is
to minimize the error between the output of the action network and g−1(t) f (t), which is similar to the supervised learning or
the adaptive NN control.
Due to no prior knowledge of f (t) and g(t), by using (10), (15) is reformulated as
ea(t) = (Jˆ(t) −Uc)Im×1 + e¯(t + 1) − kve¯(t) − us(t). (16)
Deﬁne its objective function as
Ea(t) = 21 eTa (t)ea(t).
(17)
The weights of the supervised action network are updated by
∆wa(t) = −ηaϕa1(t)eTa (t)wac(t)diag(ϕa′2(t)),
(18a)
∆va(t) = −ηaza(t)eTa (t)wac(t)diag(ϕa′2(t)wTa (t)diag(ϕa′1(t)), (18b)
where ηa is the learning rate, diag(·) is the diagonalized operator, ϕa′1(t) and ϕa′2(t) respectively represent the derivative of ϕa1(t) and ϕa2(t), and wac(t) is deﬁned as
wac(t) = αcIm×1wTc (t)diag(ϕc′ (t))vTc2(t) + (1 − αc)g(t), (19)
with which αc will be designed in details later, ϕc′ (t) represents the derivative of ϕc(t), the matrix vc2(t) ∈ Rm×hc satisﬁes vc(t) = [vTc1(t), vTc2(t)]T with vc1(t) ∈ Rnm×hc .
In the critic network, via the Bellman equation (5), the predic
tion error can be represented as
ec(t) = γJˆ(t) − Jˆ(t − 1) + r(t).
(20)
Deﬁne its objective function as
Ec (t )
=
αc 2
ec (t )ec (t ).
(21)
By using the gradient descent algorithm, the weights of the
critic network are updated by
∆wc(t) = −ηcαcγec(t)ϕc(t),
(22a)
∆vc(t) = −ηcαcγec(t)zc(t)wTc (t)diag(ϕc′ (t)).
(22b)
Design a learningschedule factor αc as
∑ 1, if N1α k=t−tNα +1 ∥e¯(t)∥ < εα ,
αc =
1
∑ 0, if
t
∥e¯(t)∥ ≥ ε ,
(23)
Nα k=t−N +1
α
α
where εα > 0 is a design constant and Nα is a positive integer.
It is worth pointing out that the traditional online actioncritic
framework is easy to lead to some inefﬁciency for the real
time control problem. Speciﬁcally, in the beginning of online
training phase, the state of the system (1) may be far away
from the reference state, which results in the training failure
risk.
This
case
can
be
viewed
as
1 Nα
∑tk=t−Nα +1 ∥e¯(t)∥
≥
εα ,
i.e. α = 0. The supervised action network guides the sys
tem state back to the neighbor of the reference state. Once
1 Nα
∑tk=t−Nα +1 ∥e¯(t)∥
<
εα
holds,
the
online
adaptive
critic
learning works to further derive the optimal control policy. It
is obvious that the high failure risk is avoided in the beginning
of the online training phase.
It can be concluded that the learning process is convergent and e(t) is uniformly ultimately bounded (UUB) and its boundary relies on d1(t), whose proof is omitted here for saving the space. Then, we can deduce from the UUB property that e¯(t) and e¯(t) − kve¯(t − 1) are also bounded. Without loss of generality, let
∥e¯(t + 1) − kve¯(t)∥ ≤ δe,
(24)
where δe > 0.
From (10), we have ∥ f1(t) + g(t)ua(t) + d1(t)∥ ≤ δe. Let
d2(t) = f1(t) + g(t)ua(t) + d1(t).
(25)
Then, we get
e¯(t + 1) = kve¯(t) + us(t) + d2(t).
(26)
Remark 2: When the weights of the critic network or the
supervised action network are close to a convergent re
gion, it is necessary to reduce the learning rate values Ni
et al. (2013); Mu et al. (2017). Without loss of generality, let N1s ∑tk=t−Ns+1 ∥wc(t) − wc(t − 1)∥ < εs represent that the weights are close to the convergent region. In this case, due to
1724
Preprints of the 21st IFAC World Congress (Virtual) Berlin, Germany, July 1217, 2020
reducing the learning rates, when adding a weak compensation control signal us(t), the system state change resulting from us(t) has few impact on the learning process. Then, (25) still holds. Thus, the linear system (26) with a persistent disturbance d2(t) is always existent.
Since the speciﬁc information of d2(t) is unavailable, the distur{bance observer Kim et al. (2016) is designed by
dˆ2(t) = kde¯(t) − zd(t), zd(t + 1) = zd(t) + kd((kv − Im)e¯(t) + us(t) + dˆ2(t)), (27)
in which dˆ2(t) is an estimation of d2(t), kd ∈ Rm×m is a diagonal observer matrix, and zd(t) is a new state variable. From the conclusion of Kim et al. (2016), it is known that dˆ2(t) is convergent to d2(t).
Inspired by Du et al. (2016), we design a chatteringfree com
∑ pensation control as
(qs1−kv)e¯(t)−qs2sigαs (e¯(t))−dˆ2(t), if αs = 1
1t
us (t ) =
and
∥wc(t) − wc(t − 1)∥ < εs, (28)
Ns k=t−Ns+1
0, otherwise,
where 0 < qs1 < 1, 0 < qs2 < 1, 0 < αs < 1, sigαs (·) = sgn(·) · αs , εs > 0 is a design constant, and Ns is a positive integer.
According to the results of Du et al. (2016), we know that the compensation control us(t) is a chatteringfree signal and has a capability of the disturbance attenuation. Then, us(t) ensures the system’s robust to the uncertainty by compensating d2(t). As such, our developed online adaptive critic robust control
method not only has the highefﬁciency optimal control prop
erty in real time, but also keeps robust to the uncertainty. The
procedure to realize this method is summarized as Algorithm 1.
4. SIMULATION
To verify the superiority of the theoretical results, a simulation
example on our developed method is conducted by comparing
with the traditional online ADP methods. Dynamics of a one
link robot manipulator is considered in the following:
Gθ¨ + Dθ˙ + MgL sin(θ ) = τ,
(29)
where g = 9.8 m/s2 is a gravitational acceleration, D = 1 repre
sents a viscous friction coefﬁcient, L = 1 m stands for the length of the link, M = 1 kg represents the payload mass, G = 1 kg · m2 stands for the inertia moment, θ is the angle position, τ is the torque, and τd is a disturbance. Note that, its dynamics is
unavailable for the controller design.
Discretizing (29) using Euler methods with the sampling inter
val Ts = 0.05 s yields
x1(t + 1) =x2(t),
x2(t
+
1)
=
2G−DTs G
x2 (t )
−
G−DTs G
x1 (t )
(30)
MgLT 2
T2
T2
−
G
s
sin(x1(t)) +
s
G
u(t)
+
s
G
d(t),
where
d(t
)
=
0.08
cos(1.8Tst
−
π 4
)
sin(Tst
+
π 3
).
The reference model is given by xr1(t + 1) =xr2(t), xr2(t + 1) =(2 − 1.5Ts)xr2(t) + (1.5Ts − 2.5Ts2 − 1) (31) × xr1(t) + Ts2ur(t),
where
ur
(t
)
=
sin(0.2Tst
)
cos(0.4Tst
+
π 2
).
Algorithm 1:
\∗ itri: the maximal trial numbers; tc: the cumulative time for breaking; tt : the simulation terminal time; ic, ia: the maximal iteration numbers in the critic network
and the supervised action network, respectively;
Ect , Eat : the objective function thresholds in the critic
network and the supervised action network, respectively; 1): set the coefﬁcients λ1, λ2, . . . , λn−1, γ, Q, R, kv, kd, ηa,
ηc, Na, Ns, εa, εs, εe, qs1, and qs2; 2): for 1 to itri do \∗ trial 3): initialize x(0), x∗(0);
4): initialize wc(0), vc(0), wa(0), and va(0) randomly; 5): us(0) = 0 and ua(0) ← (14); 6): while t ≤ tt do 7): wa(t) = wa(t − 1), va(t) = va(t − 1), wc(t) = wc(t − 1),
and vc(t) = vc(t − 1); 8): calculate u(t − 1) ← (8); 9): x(t) ← (1), x∗(t) ← (2), e(t) ← (3), and e¯(t) ← (6); 10): if t > tc and t1c ∑tkc=t−tc+1 ∥e¯(k)∥ > εe 11): break this trial; 12): endif 13): calculate Jˆ(t) ← (11), ua(t) ← (14), dˆ2(t) ← (27), us(t)
← (28), and αc ← (23); 14): r(t) = eT (t)Qe(t) + uT (t)Ru(t);
15): calculate Ec(t) and set i = 0; 16): while ((i < ic) & (Ec(t) > Ect )) do 17): update wc(t) = wc(t − 1) + ∆wc(t) and vc(t) = vc(t − 1) +
∆vc (t ); 18): Jˆ(t) ← (11);
19): if
1 N
∑tk=t−Ns+1 ∥wc(t) − wc(t − 1)∥ < εs
20):
s
reduce
ηa
and
ηc;
21): else
22): reset ηa and ηc;
23): endif
24): calculate Ec(t) and set i = i + 1; 25): endwhile \∗ critic network
26): calculate Ea(t) and set j = 0; 27): while (( j < ia) & (Ea(t) > Eat )) do 28): update wa(t) = wa(t −1)+∆wa(t) and va(t) = va(t −1)+
∆va (t ); 29): ua(t) ← (14) and u(t) ← (8); 30): Jˆ(t) ← (11);
31): calculate Ea(t) and set j = j + 1; 32): endwhile \∗ action network 33): t = t + 1;
34): endwhile
35): endfor
The matrices Q and R is chosen as diag{0.5, 0.5} and 0.3, respectively. The critic network and the supervised action network are constructed by two threelayer back propagation NNs with structures of 341 and 231, respectively. The activation functions ϕc(·) and ϕa(·) are selected as the hyperbolic tangent function. The initial weights for both the networks are randomly generated from [−1, 1]. In view of the result of the adaptive critic robust control method, some parameters used in the simulation are presented in Table 1. In addition, by combining with the results of Kim et al. (2016); Du et al. (2016), the observer and compensation control parameters are chosen as kd = 0.3, qs1 = 0.6, and qs2 = 0.4.
The trajectories of the system state x1(t) and the reference state xr1(t) are presented in Fig. 2. The curve of the model reference tracking error e1(t) is depicted in Fig. 3. It is observed that the system (31) can exactly track the reference model (30) on
1725
Preprints of the 21st IFAC World Congress (Virtual) Berlin, Germany, July 1217, 2020
x1(t) xr1(t)
x1(t) xr1(t)
State and reference state (rad)
State and reference state (rad)
Time steps
Fig. 2. System state x1(t) and reference state xr1(t).
0
Tracking error (rad)
0.3
5
0.2
10 0.1
0.0
15 0.1
0.2
20
73 100
0
200
0.01
0.00
150
400
200 0.01 1000
1100
600
800
Time steps
1000
1200
1200
Fig. 3. Model reference tracking error e1(t).
Control input (N⋅m)
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5
0
0.06 0.03 0.00 0.03
1000
1100
0.4
0.2
0.0
0.2
0.4
0.6
73
100
150
200
400
600
800
Time steps
1200
200
1000 1200
Fig. 4. Control input curve.
behavior by using our developed method. The control input curve is shown in Fig. 4.
To highlight the better learning efﬁciency and robustness of our developed online adaptive critic robust control method, it
Table 1. Parameters in the example
Para. λ1
Para. ia
Para. Ns
Value 0.3
Value 300 Value 10
Para. γ
Para. ic
Para. εa
Value 0.95 Value 200 Value 0.7
Para. kv
Para. Eat Para. εe
Value 0.1
Value 1e4 Value
1
Para. ηa Para. Ect Para. εs
Value 0.001 Value 6e4 Value
0.1
Para. ηc
Para. Na 
Value 0.004 Value
10 
Time steps
Fig. 5. System state x1(t) and reference state xr1(t) in Ni et al. (2013).
is necessary to conduct a comparative simulation experiment by comparing with the previous method Ni et al. (2013). As Ni et al. (2013) is a tracking control method, the trajectory of the reference model needs to be obtained beforehand, and then the tracking control is implemented. The reference network, the critic network, and the action network are constructed by three similar NNs with structures of 541, 651, and 431, respectively. Let u(t) = u¯(t)/gˆ(t) in (30), in which u¯(t) is the output of the action network. Some parameters of this method selected in the simulation are presented in Table 2, in which ηr, ir, and Ert represent the learning rate, the maximal iteration numbers, and the objective function threshold for the reference network, respectively. Other parameter setting and some initial conditions are set as same as those of our developed online adaptive critic robust control method, such as λ1, γ, kv, the initial states, and the initial weights.
Table 2. Parameters in the example for Ni et al. (2013)
Para. ηa Para. ir
Value 0.001 Value 150
Para. ηc
Para. Eat
Value 0.004 Value 1e4
Para. ηr
Para. Ect
Value 0.001 Value 6e4
Para. ia
Para. Ert
Value 300 Value 2e4
Para. ic
Para. 
Value 200 Value

By using the method proposed in Ni et al. (2013), we can obtain the trajectories of x1(t) and xr1(t), and the model reference tracking error curve, which are shown in Figs. 5 and 6, respectively. Through the comparisons between Figs. 2 and 3 and that between Figs. 5 and 6, it can be seen that our developed method produces the smaller model reference tracking error than the method proposed in Ni et al. (2013) does. This means that our developed method has the superior robustness.
Table 3. Simulation results on both the methods
Methods Number of experiments Number of trials Success rate (%)
Traditional method
100
20
53
Our method
100
20
100
In this comparative simulation study, a run consists of a max
imum of 20 consecutive trials. It is considered successful if
1 Nα
∑tk=t−Nα +1
∥e¯(t)∥
≤
0.07
for
∀t
>
900
holds.
Otherwise,
if
the controller is unable to learn to make the system (30) track
the reference model (31) on behavior within 20 trials, then the
run is considered unsuccessful. We run 100 experiments for the
traditional method Ni et al. (2013) and our developed method,
whose simulation results are listed in Table 3. It is observed
1726
Preprints of the 21st IFAC World Congress (Virtual) Berlin, Germany, July 1217, 2020
Tracking error (rad)
0.03
1.5
0.00
1.0 0.03
0.06
0.5
0.09 200 400 600 800 1000 1200
0.0
0.5 0
200
400
600
800 1000 1200
Time steps
Fig. 6. Model reference tracking error e1(t) in Ni et al. (2013).
that, in contrast to the traditional online ADP method Ni et al. (2013), our developed method reduces greatly the learning failure rate.
5. CONCLUSION
An online adaptive critic robust control method has been developed to handle the optimal MRAC problem for the nonlinear systems. The online adaptive critic robust controller consists of the critic network, the supervised action network, and the compensation control term. Via the new deﬁned learning schedule factor, such controller not only achieves the highefﬁciency learning as well as the optimality in real time, but also has the robustness to the uncertainty. A comparative simulation has been provided to show the superiority of our developed method. Further investigation and experimentation are recommended into the stability analysis, optimization of the algorithm, and applications to the real systems.
REFERENCES
Chen, X., Wang, W., Cao, W., & Wu, M. (2019). Gaussiankernelbased adaptive critic design using twophase value iteration. Information Sciences, 482, 139–155.
Du, H., Yu, X., Chen, M. Z. Q., & Li, S. (2016). Chatteringfree discretetime sliding mode control. Automatica, 68, 8791.
Fathinezhad, F., Derhami, V., & Rezaeian, M. (2016). Supervised fuzzy reinforcement learning for robot navigation. Applied Soft Computing, 40, 3341.
Fu, H., Chen, X., & Wang, W. (2017). A model reference adaptive control with ADPtoSMC strategy for unknown nonlinear systems. Proceedings of the 11th Asian Control Conference, 15371542.
Fu, H. Chen, X. Wang, W. & Wu, M. (2020). MRAC for unknown discretetime nonlinear systems based on supervised neural dynamic programming. Neurocomputing, 384, 130141.
Ha, M., Wang, D., & Liu, D. (2018). Eventtriggered adaptive critic control design for discretetime constrained nonlinear systems. IEEE Transactions on Systems, Man, and Cybernetics: Systems, doi: 10.1109/TSMC.2018.2868510.
He, H., Ni, Z., & Fu, J. (2012). A threenetwork architecture for online learning and optimization based on adaptive dynamic programming. Neurocomputing, 78(1), 313.
Jiang, H., Zhang, H., Xiao, G., & Cui, X. (2018). Databased approximate optimal control for nonzerosum games of multiplayer systems using adaptive dynamic programming. Neurocomputing, 275, 192199.
Kim, K. & Rew, H. (2013). Reduced order disturbance observer for discretetime linear systems. Automatica, 49(4), 968975.
Lian, C., Xu, X., Chen, H., & He, H. (2016). Nearoptimal tracking control of mobile robots via recedinghorizon dual heuristic programming. IEEE Transactions on Cybernetics, 46(11), 24842496.
Liu, F., Sun, J., Si, J., Guo, W., & Mei, S. (2012). A boundedness result for the direct heuristic dynamic programming. Neural Network, 32, 229235.
Mu, C., Ni, Z., Sun, C., & He, H. (2017). Airbreathing hypersonic vehicle tracking control based on adaptive dynamic programming. IEEE Transactions on Neural Networks and Learning Systems, 28(3), 584–598.
Mu, C., Ni, Z., Sun, C., & He, H. (2017). Datadriven tracking control with adaptive dynamic programming for a class of continuoustime nonlinear systems. IEEE Transactions on Cybernetics, 47(6), 14601470.
Ni, Z., He, H., & Wen, J. (2013). Adaptive learning in tracking control based on the dual critic network design. IEEE Transactions on Neural Networks and Learning Systems, 24(6), 913928.
Pang, B. & Jiang, Z. P. (2019). Adaptive optimal control of linear periodic systems: An offpolicy value iteration approach. arXiv: 1901.08650.
Radac, M. B., Precup, R. E., & Roman, R. C. (2017). Modelfree control performance improvement using virtual reference feedback tuning and reinforcement Qlearning. International Journal of Systems Science, 48(5), 1071–1083.
Radac, M. B., Precup, R. E., & Roman, R. C. (2018). Datadriven model reference control of MIMO vertical tank systems with modelfree VRFT and Qlearning. ISA Transactions, 73, 227–238.
Si, J. & Wang, Y. T. (2001). Online learning control by association and reinforcement. IEEE Transactions on Neural Network, 12(2), 264276.
Wang, W., Chen, X., Wang, F., & Fu, H. (2018). ADPbased model reference adaptive control design for unknown discretetime nonlinear systems. Proceedings of the 37th Chinese Control Conference, 8049–8054.
Wang, W. Y., Chan, M. L., Hsu, C. C. J., & Lee, T. T. (2002). H∞ trackingbased sliding mode control for uncertain nonlinear systems via an adaptive fuzzyneural approach. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 32(4), 483492.
Wang, Z., Wei, Q., & Liu, D. (2020). Eventtriggered adaptive dynamic programming for discretetime multiplayer games. Information Sciences, 506, 457470.
Werbos, P. J. (1992). Approximate dynamic programming for realtime control and neural modeling. In D. A. White, & D. A. Sofge (Eds.), Handbook of intelligent control. New York: Van Nostrand Reinhold, (Chapter 13).
Yang, L., Si, J., Tsakalis, K. S., & Rodriguez, A. A. (2009). Direct heuristic dynamic programming for nonlinear tracking control with ﬁltered tracking error. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39(6), 16171622.
Zhao, D., Wang, B., & Liu, D. (2013). A supervised actorcritic approach for adaptive cruise control. Soft Computing, 17(11), 20892099.
Zhao, D., Zhang, Q., Wang, D., & Zhu, Y. (2016). Experience replay for optimal control of nonzerosum game systems with unknown dynamics. IEEE Transactions on Cybernetics, 46(3), 854865.
1727
Online Adaptive Critic Robust Control of DiscreteTime Nonlinear Systems With Unknown
Dynamics ⋆
Hao Fu ∗,∗∗ Xin Chen ∗,∗∗ Min Wu ∗,∗∗
∗ School of Automation, China University of Geosciences, Wuhan, 430074, China (email: [email protected]).
∗∗ Hubei Key Laboratory of Advanced Control and Intelligent Automation for Complex Systems, Wuhan, 430074, China
Abstract: This paper concerns the optimal model reference adaptive control problem for unknown discretetime nonlinear systems. For such problem, it is challenging to improve online learning efﬁciency and guaranteeing robustness to the uncertainty. To this end, we develop an online adaptive critic robust control method. In this method, a critic network and a new supervised action network are constructed to not only improve the realtime learning efﬁciency, but also obtain the optimal control performance. By combining the designed compensation control term, robustness is further guaranteed by compensating the uncertainty. The comparative simulation study is conducted to show the superiority of our developed method.
Keywords: Approximate dynamic programming (ADP), unknown nonlinear systems, neural network (NN), supervised learning, model reference adaptive control (MRAC), robust control.
1. INTRODUCTION
During past several decades, reinforcement learning (RL) has gained a great deal of research attention in the artiﬁcial intelligence community. In the control system society, approximate/adaptive dynamic programming (ADP) Werbos (1992) (also called adaptive critic design), which combines RL and the adaptive control, has been employed to address the optimal regulation issue ﬁrstly via the actioncritic network framework. Fruitful results Chen et al. (2019); Lian et al. (2016); Ha et al. (2018); Wang et al. (2020); Pang & Jiang (2019); Si et al. (2001) have been reported on ADP in recent years.
In the aforementioned results, an online ADP method Si et al. (2001) has been developed with no requirement of system dynamics. Convergence of this algorithm has been analyzed via the Lyapunov extension theorem Liu et al. (2012). On this basis, He et al. He et al. (2012) have further proposed a new ADP framework with an additional reference/goal network integrated into the actioncritic network. What’s more, this algorithm has witnessed extensive studies in terms of the optimal tracking control Yang et al. (2009); Ni et al. (2013); Mu et al. (2017) as well.
The model reference adaptive control (MRAC) aims at enforcing the controlled systems to track the desired reference model rather than a tracking trajectory. Then, the closedloop control system has the characteristics of the reference model. The optimal MRAC is far more worth deserving investigation than those tracking control. From the stateoftheart developments of this investigation, only Radac et al. (2017, 2018); Fu et al. (2017);
⋆ This work was supported in part by the National Natural Science Foundation of China under Grant 61873248, in part by the Hubei Provincial Natural Science Foundation of China under Grant 2017CFA030 and Grant 2015CFA010, and in part by the 111 project of China under Grant B17040.
Wang et al. (2018) have developed the ADPbased optimal MRAC approach.
As existence of the reference input in MRAC, there invariably exists a feedforward control term dependent on input dynamics of the systems. The input dynamics needs to be derived via identiﬁcation. To obviate this requirement, change of the reference input in Radac et al. (2017, 2018) is ignored during the learning process. Fu et al. (2017); Wang et al. (2018) don’t consider the uncertainty resulting from the identiﬁcation error. As such, it is still a challenge to investigate the ADPbased MRAC method with robustness to such uncertainty.
On the other hand, in the beginning of training phase, this online ADP method is easy to cause inefﬁciency or high failure rate with unknown dynamics Zhao et al. (2013); Fathinezhad et al. (2009). The inefﬁciency or high failure rate is an unacceptable and fatal risk in the realtime control.
Motivated by above discussions, we develop an online adaptive critic robust control method for discretetime nonlinear systems with unknown dynamics. This method ensures that closedloop control systems have robustness to uncertainty and highefﬁciency learning performance.
The main contributions of this study include the following two aspects.
(1) In contrast to the existing online ADP methods Yang et al. (2009); Ni et al. (2013); Mu et al. (2017), our developed control method greatly reduces the failure rate and improves the learning efﬁciency via a critic network and a new supervised action network.
(2) Unlike Radac et al. (2017, 2018); Fu et al. (2017); Wang et al. (2018), our developed control method well guarantees robustness to the uncertainties resulting from iden
Copyright lies with the authors
1722
Preprints of the 21st IFAC World Congress (Virtual) Berlin, Germany, July 1217, 2020
tiﬁcation and the exterior disturbance by introducing the compensation control into the learning process.
The outline of this paper is arranged as follows. The problem description is stated in Section 2. In Section 3, the online adaptive critic robust control method is given. In Section 4 provides the comparative simulation. In Section 5, the conclusions are stated.
2. PROBLEM FORMULATIONS
Consider the following discretetime nonlinear system: { xi(t + 1) = xi+1(t), i = 1, 2, . . . , n − 1 (1) xn(t + 1) = f (x(t)) + g(x(t))u(t) + d(t),
in which x(t) = [x1T (t), x2T (t), . . . , xnT (t)]T ∈ Rnm denotes the state with xi(t) ∈ Rm, f : Rnm → Rm and g : Rnm → Rm×m are unknown smooth nonlinear functions, u(t) ∈ Rm represents the control input, and d(t) ∈ Rm denotes an unknown persistent disturbance. Note that, under the fullstate feedback linearization, the general nonlinear systems can be converted to the formation (1) via the coordinate transformation.
Assumption 1: The nonlinear function g(t) is always bounded and nonsingular for ∀x(t).
Deﬁne a ref{erence model as xri(t + 1) = xri+1(t), i = 1, 2, . . . , n − 1 (2) xrn(t + 1) = Arxr(t) + Brur(t),
where xr(t) = [xrT1(t), xrT2(t), . . . , xrTn(t)]T ∈ Rnm denotes the reference state with xri(t) ∈ Rm, Ar ∈ Rm×nm and Br ∈ Rm×m represent the constant matrices of the reference model, ur(t) is the reference control input. Here, xr(t) and ur(t) are all assumed to be bounded.
The objective of this paper is to enable the system (1) to track the reference model (2) on behavior with optimum via designing an optimal control law u(t). Subtracting (2) from (1) yields the model reference tracking error dynamics
{ ei(t + 1) = ei+1(t), i = 1, 2, . . . , n − 1 (3) en(t + 1) = f (t) + g(t)u(t) + d(t) − Arxr(t) − Brur(t),
where e(t) = x(t) − xr(t) denotes the model reference tracking error with ei(t) = xi(t) − xri(t).
To realize the optimum, it is needed to minimize the perfor
mance index function or cost function
∑∞
J(t) = γk−t r(k),
(4)
k=t
in which γ is a discount factor, r(t) = eT (t)Qe(t) + uT (t)Ru(t)
is deﬁned as the utility function or reward with the positive
symmetric matrices Q and R.
In accordance with Bellman’s optimality principle, the optimal cost function J∗(t) satisﬁes the following Bellman equation:
J∗(t) = min{r(t) + γJ∗(t + 1)}.
(5)
u(t)
Due to unknown dynamics for (1), it is difﬁcult to solve the Bellman equation (5). To overcome this difﬁculty, the ADPbased MRAC methods Radac et al. (2017, 2018); Fu et al. (2017); Wang et al. (2018) have been proposed. But, Radac et al. (2017, 2018) have no realtime control performance. In Fu et al. (2017); Wang et al. (2018), the system uncertainty resulting from identiﬁcation is not considered. On the other
x(t)
Critic J (t )
xr (t ) network
e (t ) − kve (t −1)
x(t)
xr (t )
Action ua (t )
u (t )
network
γ I m×1
J (t −1) r(t)
Uc (t)
Plant x (t + 1)
gˆ −1 (t )
−1
us (t −1)
z
e (t)
Sliding mode control
us (t)
x(t)
kve (t ) + Ar xr (t ) + Brur (t ) − λ1en (t ) −⋯ − λn−1e2 (t )
ur (t ) Reference model xr (t + 1) z−1 xr (t ) e (t )
Fig. 1. Online adaptive critic robust control structure diagram.
hand, inefﬁciency or the high failure rate is always existent in the online ADP methods Yang et al. (2009); Ni et al. (2013); Mu et al. (2017), which is an unacceptable and fatal risk in the realtime control.
3. ADAPTIVE CRITIC ROBUST CONTROL
In this section, an online adaptive critic robust control method is developed to achieve robustness to the uncertainty and learning efﬁciency. Its control structure diagram is depicted in Fig. 1.
Deﬁne a ﬁltered model reference tracking error as
e¯(t) = en(t) + λ1en−1(t) + · · · + λn−1e1(t),
(6)
where λ1, . . . , λn−1 are constants such that zn−1 + λ1zn−2 + · · · + λn−1 is stable. Then, the ﬁltered model reference tracking
error dynamics can be formulated as
e¯(t + 1) = f (t) + g(t)u(t) + d(t) − Arxr(t) − Brur(t)
+ λ1en(t) + · · · + λn−1e2(t).
(7)
An adaptive critic robust control law is designed as
u(t) =gˆ−1(t)(us(t) + kve¯(t) + Arxr(t) + Brur(t)
− λ1en(t) − · · · − λn−1e2(t)) + ua(t),
(8)
where kv ∈ Rm×m is the gain matrix, ua(t) denotes a neural network (NN) control term, us(t) represents a compensation control term, and gˆ(t) is the estimation of g(t). Note that, gˆ(t)
is usually obtained by the model identiﬁcation method Zhao
et al. (2016); Jiang et al. (2018). According to Assumption 1
and the results of Wang et al. (2002), it is deduced that gˆ(t) is
also bounded away from singularity.
A desirable value of u(t) is given by
ud(t) =g−1(t)(us(t) + kve¯(t) − f (t) − d(t) + Arxr(t) + Brur(t)
− λ1en(t) − · · · − λn−1e2(t)).
(9)
Using (9) and substituting (8) into (7) yields
e¯(t + 1) =kve¯(t) + g(t)(u(t) − ud(t))
=kve¯(t) + f1(t) + g(t)ua(t) + us(t) + d1(t), (10)
where f1(t) = f (t) + (kv − λ1Im)xn(t) + (kvλ1 − λ2Im)xn−1(t) + · · · + (kvλn−2 − λn−1Im)x2(t) + kvλn−1x1(t), d1(t) = g(t)(gˆ−1(t) −g−1(t))(us(t)+Arxr(t)+Brur(t)+(λ1Im −kv)xrn(t)+(λ2Im − kvλ1)xrn−1(t) + · · · + (λn−1Im − kvλn−2)xr2(t) − kvλn−1xr1(t)) + d(t), and Im ∈ Rm×m is an identity matrix. According to the results of Fu et al. (2018), it is inferred from Assumption 1 that d1(t) is bounded.
1723
Preprints of the 21st IFAC World Congress (Virtual) Berlin, Germany, July 1217, 2020
For requirement of the optimal control, the critic network and the supervised action network are constructed as follows.
Since it is intractable to acquire the analytical resolution of J∗(t) by solving (5), NN is employed to near the cost function
J(t) as follows
J(t) = w∗cT (t)ϕc(v∗cT (t)zc(t)) + εc,
(11)
where zc(t) = [eT (t), uTa (t)]T denotes the input of the critic network with hc neurons of the hidden layer, ϕc(·) represents
the active function of the critic network, w∗c(t) ∈ Rhc×1 and v∗c(t) ∈ R(nm+m)×hc denote the ideal weights, and εc is the critic network approximation error.
Similarly, since w∗cT (t) and v∗cT (t) cannot be obtained directly, the estimation of J(t) is constructed as
Jˆ(t) = wTc (t)ϕc(vTc (t)zc(t)),
(12)
where wTc (t) and vTc (t) are the estimations of w∗cT (t) and v∗cT (t).
Due to unknown dynamics for (10), the supervised action
network ua(t) has an NN representation as
ua(t) = ϕa2(w∗aT (t)ϕa1(v∗aT (t)za(t))) + εa,
(13)
where za(t) = x(t) denotes the input of the supervised action network with ha neurons of the hidden layer, ϕa1(·) and ϕa2(·) represent the active functions, w∗a(t) ∈ Rha×m and v∗a(t) ∈ Rnm×ha denote the ideal weights, and εa is the supervised action
network approximation error.
Since the ideal weights w∗aT (t) and v∗aT (t) cannot be obtained directly, the actual control term ua(t) is constructed as
ua(t) = ϕa2(wTa (t)ϕa1(vTa (t)za(t))),
(14)
where wTa (t) and vTa (t) are the estimations of w∗aT (t) and v∗aT (t). For simplicity, ϕc(t), ϕa1(t), and ϕa2(t) are used to represent ϕc(vTc (t)zc(t)), ϕa1(vTa (t)za(t)), and ϕa2(wTa (t)ϕa1(vTa (t)za(t))), respectively.
The prediction error of the supervised action network is represented as
ea(t) = (Jˆ(t) −Uc)Im×1 + f1(t) + g(t)ua(t), (15)
where Uc = 0 is the ultimate cost objective and Im×1 ∈ Rm×1 is a matrix whose elements are all 1.
Remark 1: There are two targets in the supervised action
network design. one is to minimize the error between the cost function estimation Jˆ(t) and the ultimate cost objective Uc. Its motivation is that the cost function estimation Jˆ(t) approximates the optimal cost function J∗(t). Another target is
to minimize the error between the output of the action network and g−1(t) f (t), which is similar to the supervised learning or
the adaptive NN control.
Due to no prior knowledge of f (t) and g(t), by using (10), (15) is reformulated as
ea(t) = (Jˆ(t) −Uc)Im×1 + e¯(t + 1) − kve¯(t) − us(t). (16)
Deﬁne its objective function as
Ea(t) = 21 eTa (t)ea(t).
(17)
The weights of the supervised action network are updated by
∆wa(t) = −ηaϕa1(t)eTa (t)wac(t)diag(ϕa′2(t)),
(18a)
∆va(t) = −ηaza(t)eTa (t)wac(t)diag(ϕa′2(t)wTa (t)diag(ϕa′1(t)), (18b)
where ηa is the learning rate, diag(·) is the diagonalized operator, ϕa′1(t) and ϕa′2(t) respectively represent the derivative of ϕa1(t) and ϕa2(t), and wac(t) is deﬁned as
wac(t) = αcIm×1wTc (t)diag(ϕc′ (t))vTc2(t) + (1 − αc)g(t), (19)
with which αc will be designed in details later, ϕc′ (t) represents the derivative of ϕc(t), the matrix vc2(t) ∈ Rm×hc satisﬁes vc(t) = [vTc1(t), vTc2(t)]T with vc1(t) ∈ Rnm×hc .
In the critic network, via the Bellman equation (5), the predic
tion error can be represented as
ec(t) = γJˆ(t) − Jˆ(t − 1) + r(t).
(20)
Deﬁne its objective function as
Ec (t )
=
αc 2
ec (t )ec (t ).
(21)
By using the gradient descent algorithm, the weights of the
critic network are updated by
∆wc(t) = −ηcαcγec(t)ϕc(t),
(22a)
∆vc(t) = −ηcαcγec(t)zc(t)wTc (t)diag(ϕc′ (t)).
(22b)
Design a learningschedule factor αc as
∑ 1, if N1α k=t−tNα +1 ∥e¯(t)∥ < εα ,
αc =
1
∑ 0, if
t
∥e¯(t)∥ ≥ ε ,
(23)
Nα k=t−N +1
α
α
where εα > 0 is a design constant and Nα is a positive integer.
It is worth pointing out that the traditional online actioncritic
framework is easy to lead to some inefﬁciency for the real
time control problem. Speciﬁcally, in the beginning of online
training phase, the state of the system (1) may be far away
from the reference state, which results in the training failure
risk.
This
case
can
be
viewed
as
1 Nα
∑tk=t−Nα +1 ∥e¯(t)∥
≥
εα ,
i.e. α = 0. The supervised action network guides the sys
tem state back to the neighbor of the reference state. Once
1 Nα
∑tk=t−Nα +1 ∥e¯(t)∥
<
εα
holds,
the
online
adaptive
critic
learning works to further derive the optimal control policy. It
is obvious that the high failure risk is avoided in the beginning
of the online training phase.
It can be concluded that the learning process is convergent and e(t) is uniformly ultimately bounded (UUB) and its boundary relies on d1(t), whose proof is omitted here for saving the space. Then, we can deduce from the UUB property that e¯(t) and e¯(t) − kve¯(t − 1) are also bounded. Without loss of generality, let
∥e¯(t + 1) − kve¯(t)∥ ≤ δe,
(24)
where δe > 0.
From (10), we have ∥ f1(t) + g(t)ua(t) + d1(t)∥ ≤ δe. Let
d2(t) = f1(t) + g(t)ua(t) + d1(t).
(25)
Then, we get
e¯(t + 1) = kve¯(t) + us(t) + d2(t).
(26)
Remark 2: When the weights of the critic network or the
supervised action network are close to a convergent re
gion, it is necessary to reduce the learning rate values Ni
et al. (2013); Mu et al. (2017). Without loss of generality, let N1s ∑tk=t−Ns+1 ∥wc(t) − wc(t − 1)∥ < εs represent that the weights are close to the convergent region. In this case, due to
1724
Preprints of the 21st IFAC World Congress (Virtual) Berlin, Germany, July 1217, 2020
reducing the learning rates, when adding a weak compensation control signal us(t), the system state change resulting from us(t) has few impact on the learning process. Then, (25) still holds. Thus, the linear system (26) with a persistent disturbance d2(t) is always existent.
Since the speciﬁc information of d2(t) is unavailable, the distur{bance observer Kim et al. (2016) is designed by
dˆ2(t) = kde¯(t) − zd(t), zd(t + 1) = zd(t) + kd((kv − Im)e¯(t) + us(t) + dˆ2(t)), (27)
in which dˆ2(t) is an estimation of d2(t), kd ∈ Rm×m is a diagonal observer matrix, and zd(t) is a new state variable. From the conclusion of Kim et al. (2016), it is known that dˆ2(t) is convergent to d2(t).
Inspired by Du et al. (2016), we design a chatteringfree com
∑ pensation control as
(qs1−kv)e¯(t)−qs2sigαs (e¯(t))−dˆ2(t), if αs = 1
1t
us (t ) =
and
∥wc(t) − wc(t − 1)∥ < εs, (28)
Ns k=t−Ns+1
0, otherwise,
where 0 < qs1 < 1, 0 < qs2 < 1, 0 < αs < 1, sigαs (·) = sgn(·) · αs , εs > 0 is a design constant, and Ns is a positive integer.
According to the results of Du et al. (2016), we know that the compensation control us(t) is a chatteringfree signal and has a capability of the disturbance attenuation. Then, us(t) ensures the system’s robust to the uncertainty by compensating d2(t). As such, our developed online adaptive critic robust control
method not only has the highefﬁciency optimal control prop
erty in real time, but also keeps robust to the uncertainty. The
procedure to realize this method is summarized as Algorithm 1.
4. SIMULATION
To verify the superiority of the theoretical results, a simulation
example on our developed method is conducted by comparing
with the traditional online ADP methods. Dynamics of a one
link robot manipulator is considered in the following:
Gθ¨ + Dθ˙ + MgL sin(θ ) = τ,
(29)
where g = 9.8 m/s2 is a gravitational acceleration, D = 1 repre
sents a viscous friction coefﬁcient, L = 1 m stands for the length of the link, M = 1 kg represents the payload mass, G = 1 kg · m2 stands for the inertia moment, θ is the angle position, τ is the torque, and τd is a disturbance. Note that, its dynamics is
unavailable for the controller design.
Discretizing (29) using Euler methods with the sampling inter
val Ts = 0.05 s yields
x1(t + 1) =x2(t),
x2(t
+
1)
=
2G−DTs G
x2 (t )
−
G−DTs G
x1 (t )
(30)
MgLT 2
T2
T2
−
G
s
sin(x1(t)) +
s
G
u(t)
+
s
G
d(t),
where
d(t
)
=
0.08
cos(1.8Tst
−
π 4
)
sin(Tst
+
π 3
).
The reference model is given by xr1(t + 1) =xr2(t), xr2(t + 1) =(2 − 1.5Ts)xr2(t) + (1.5Ts − 2.5Ts2 − 1) (31) × xr1(t) + Ts2ur(t),
where
ur
(t
)
=
sin(0.2Tst
)
cos(0.4Tst
+
π 2
).
Algorithm 1:
\∗ itri: the maximal trial numbers; tc: the cumulative time for breaking; tt : the simulation terminal time; ic, ia: the maximal iteration numbers in the critic network
and the supervised action network, respectively;
Ect , Eat : the objective function thresholds in the critic
network and the supervised action network, respectively; 1): set the coefﬁcients λ1, λ2, . . . , λn−1, γ, Q, R, kv, kd, ηa,
ηc, Na, Ns, εa, εs, εe, qs1, and qs2; 2): for 1 to itri do \∗ trial 3): initialize x(0), x∗(0);
4): initialize wc(0), vc(0), wa(0), and va(0) randomly; 5): us(0) = 0 and ua(0) ← (14); 6): while t ≤ tt do 7): wa(t) = wa(t − 1), va(t) = va(t − 1), wc(t) = wc(t − 1),
and vc(t) = vc(t − 1); 8): calculate u(t − 1) ← (8); 9): x(t) ← (1), x∗(t) ← (2), e(t) ← (3), and e¯(t) ← (6); 10): if t > tc and t1c ∑tkc=t−tc+1 ∥e¯(k)∥ > εe 11): break this trial; 12): endif 13): calculate Jˆ(t) ← (11), ua(t) ← (14), dˆ2(t) ← (27), us(t)
← (28), and αc ← (23); 14): r(t) = eT (t)Qe(t) + uT (t)Ru(t);
15): calculate Ec(t) and set i = 0; 16): while ((i < ic) & (Ec(t) > Ect )) do 17): update wc(t) = wc(t − 1) + ∆wc(t) and vc(t) = vc(t − 1) +
∆vc (t ); 18): Jˆ(t) ← (11);
19): if
1 N
∑tk=t−Ns+1 ∥wc(t) − wc(t − 1)∥ < εs
20):
s
reduce
ηa
and
ηc;
21): else
22): reset ηa and ηc;
23): endif
24): calculate Ec(t) and set i = i + 1; 25): endwhile \∗ critic network
26): calculate Ea(t) and set j = 0; 27): while (( j < ia) & (Ea(t) > Eat )) do 28): update wa(t) = wa(t −1)+∆wa(t) and va(t) = va(t −1)+
∆va (t ); 29): ua(t) ← (14) and u(t) ← (8); 30): Jˆ(t) ← (11);
31): calculate Ea(t) and set j = j + 1; 32): endwhile \∗ action network 33): t = t + 1;
34): endwhile
35): endfor
The matrices Q and R is chosen as diag{0.5, 0.5} and 0.3, respectively. The critic network and the supervised action network are constructed by two threelayer back propagation NNs with structures of 341 and 231, respectively. The activation functions ϕc(·) and ϕa(·) are selected as the hyperbolic tangent function. The initial weights for both the networks are randomly generated from [−1, 1]. In view of the result of the adaptive critic robust control method, some parameters used in the simulation are presented in Table 1. In addition, by combining with the results of Kim et al. (2016); Du et al. (2016), the observer and compensation control parameters are chosen as kd = 0.3, qs1 = 0.6, and qs2 = 0.4.
The trajectories of the system state x1(t) and the reference state xr1(t) are presented in Fig. 2. The curve of the model reference tracking error e1(t) is depicted in Fig. 3. It is observed that the system (31) can exactly track the reference model (30) on
1725
Preprints of the 21st IFAC World Congress (Virtual) Berlin, Germany, July 1217, 2020
x1(t) xr1(t)
x1(t) xr1(t)
State and reference state (rad)
State and reference state (rad)
Time steps
Fig. 2. System state x1(t) and reference state xr1(t).
0
Tracking error (rad)
0.3
5
0.2
10 0.1
0.0
15 0.1
0.2
20
73 100
0
200
0.01
0.00
150
400
200 0.01 1000
1100
600
800
Time steps
1000
1200
1200
Fig. 3. Model reference tracking error e1(t).
Control input (N⋅m)
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5
0
0.06 0.03 0.00 0.03
1000
1100
0.4
0.2
0.0
0.2
0.4
0.6
73
100
150
200
400
600
800
Time steps
1200
200
1000 1200
Fig. 4. Control input curve.
behavior by using our developed method. The control input curve is shown in Fig. 4.
To highlight the better learning efﬁciency and robustness of our developed online adaptive critic robust control method, it
Table 1. Parameters in the example
Para. λ1
Para. ia
Para. Ns
Value 0.3
Value 300 Value 10
Para. γ
Para. ic
Para. εa
Value 0.95 Value 200 Value 0.7
Para. kv
Para. Eat Para. εe
Value 0.1
Value 1e4 Value
1
Para. ηa Para. Ect Para. εs
Value 0.001 Value 6e4 Value
0.1
Para. ηc
Para. Na 
Value 0.004 Value
10 
Time steps
Fig. 5. System state x1(t) and reference state xr1(t) in Ni et al. (2013).
is necessary to conduct a comparative simulation experiment by comparing with the previous method Ni et al. (2013). As Ni et al. (2013) is a tracking control method, the trajectory of the reference model needs to be obtained beforehand, and then the tracking control is implemented. The reference network, the critic network, and the action network are constructed by three similar NNs with structures of 541, 651, and 431, respectively. Let u(t) = u¯(t)/gˆ(t) in (30), in which u¯(t) is the output of the action network. Some parameters of this method selected in the simulation are presented in Table 2, in which ηr, ir, and Ert represent the learning rate, the maximal iteration numbers, and the objective function threshold for the reference network, respectively. Other parameter setting and some initial conditions are set as same as those of our developed online adaptive critic robust control method, such as λ1, γ, kv, the initial states, and the initial weights.
Table 2. Parameters in the example for Ni et al. (2013)
Para. ηa Para. ir
Value 0.001 Value 150
Para. ηc
Para. Eat
Value 0.004 Value 1e4
Para. ηr
Para. Ect
Value 0.001 Value 6e4
Para. ia
Para. Ert
Value 300 Value 2e4
Para. ic
Para. 
Value 200 Value

By using the method proposed in Ni et al. (2013), we can obtain the trajectories of x1(t) and xr1(t), and the model reference tracking error curve, which are shown in Figs. 5 and 6, respectively. Through the comparisons between Figs. 2 and 3 and that between Figs. 5 and 6, it can be seen that our developed method produces the smaller model reference tracking error than the method proposed in Ni et al. (2013) does. This means that our developed method has the superior robustness.
Table 3. Simulation results on both the methods
Methods Number of experiments Number of trials Success rate (%)
Traditional method
100
20
53
Our method
100
20
100
In this comparative simulation study, a run consists of a max
imum of 20 consecutive trials. It is considered successful if
1 Nα
∑tk=t−Nα +1
∥e¯(t)∥
≤
0.07
for
∀t
>
900
holds.
Otherwise,
if
the controller is unable to learn to make the system (30) track
the reference model (31) on behavior within 20 trials, then the
run is considered unsuccessful. We run 100 experiments for the
traditional method Ni et al. (2013) and our developed method,
whose simulation results are listed in Table 3. It is observed
1726
Preprints of the 21st IFAC World Congress (Virtual) Berlin, Germany, July 1217, 2020
Tracking error (rad)
0.03
1.5
0.00
1.0 0.03
0.06
0.5
0.09 200 400 600 800 1000 1200
0.0
0.5 0
200
400
600
800 1000 1200
Time steps
Fig. 6. Model reference tracking error e1(t) in Ni et al. (2013).
that, in contrast to the traditional online ADP method Ni et al. (2013), our developed method reduces greatly the learning failure rate.
5. CONCLUSION
An online adaptive critic robust control method has been developed to handle the optimal MRAC problem for the nonlinear systems. The online adaptive critic robust controller consists of the critic network, the supervised action network, and the compensation control term. Via the new deﬁned learning schedule factor, such controller not only achieves the highefﬁciency learning as well as the optimality in real time, but also has the robustness to the uncertainty. A comparative simulation has been provided to show the superiority of our developed method. Further investigation and experimentation are recommended into the stability analysis, optimization of the algorithm, and applications to the real systems.
REFERENCES
Chen, X., Wang, W., Cao, W., & Wu, M. (2019). Gaussiankernelbased adaptive critic design using twophase value iteration. Information Sciences, 482, 139–155.
Du, H., Yu, X., Chen, M. Z. Q., & Li, S. (2016). Chatteringfree discretetime sliding mode control. Automatica, 68, 8791.
Fathinezhad, F., Derhami, V., & Rezaeian, M. (2016). Supervised fuzzy reinforcement learning for robot navigation. Applied Soft Computing, 40, 3341.
Fu, H., Chen, X., & Wang, W. (2017). A model reference adaptive control with ADPtoSMC strategy for unknown nonlinear systems. Proceedings of the 11th Asian Control Conference, 15371542.
Fu, H. Chen, X. Wang, W. & Wu, M. (2020). MRAC for unknown discretetime nonlinear systems based on supervised neural dynamic programming. Neurocomputing, 384, 130141.
Ha, M., Wang, D., & Liu, D. (2018). Eventtriggered adaptive critic control design for discretetime constrained nonlinear systems. IEEE Transactions on Systems, Man, and Cybernetics: Systems, doi: 10.1109/TSMC.2018.2868510.
He, H., Ni, Z., & Fu, J. (2012). A threenetwork architecture for online learning and optimization based on adaptive dynamic programming. Neurocomputing, 78(1), 313.
Jiang, H., Zhang, H., Xiao, G., & Cui, X. (2018). Databased approximate optimal control for nonzerosum games of multiplayer systems using adaptive dynamic programming. Neurocomputing, 275, 192199.
Kim, K. & Rew, H. (2013). Reduced order disturbance observer for discretetime linear systems. Automatica, 49(4), 968975.
Lian, C., Xu, X., Chen, H., & He, H. (2016). Nearoptimal tracking control of mobile robots via recedinghorizon dual heuristic programming. IEEE Transactions on Cybernetics, 46(11), 24842496.
Liu, F., Sun, J., Si, J., Guo, W., & Mei, S. (2012). A boundedness result for the direct heuristic dynamic programming. Neural Network, 32, 229235.
Mu, C., Ni, Z., Sun, C., & He, H. (2017). Airbreathing hypersonic vehicle tracking control based on adaptive dynamic programming. IEEE Transactions on Neural Networks and Learning Systems, 28(3), 584–598.
Mu, C., Ni, Z., Sun, C., & He, H. (2017). Datadriven tracking control with adaptive dynamic programming for a class of continuoustime nonlinear systems. IEEE Transactions on Cybernetics, 47(6), 14601470.
Ni, Z., He, H., & Wen, J. (2013). Adaptive learning in tracking control based on the dual critic network design. IEEE Transactions on Neural Networks and Learning Systems, 24(6), 913928.
Pang, B. & Jiang, Z. P. (2019). Adaptive optimal control of linear periodic systems: An offpolicy value iteration approach. arXiv: 1901.08650.
Radac, M. B., Precup, R. E., & Roman, R. C. (2017). Modelfree control performance improvement using virtual reference feedback tuning and reinforcement Qlearning. International Journal of Systems Science, 48(5), 1071–1083.
Radac, M. B., Precup, R. E., & Roman, R. C. (2018). Datadriven model reference control of MIMO vertical tank systems with modelfree VRFT and Qlearning. ISA Transactions, 73, 227–238.
Si, J. & Wang, Y. T. (2001). Online learning control by association and reinforcement. IEEE Transactions on Neural Network, 12(2), 264276.
Wang, W., Chen, X., Wang, F., & Fu, H. (2018). ADPbased model reference adaptive control design for unknown discretetime nonlinear systems. Proceedings of the 37th Chinese Control Conference, 8049–8054.
Wang, W. Y., Chan, M. L., Hsu, C. C. J., & Lee, T. T. (2002). H∞ trackingbased sliding mode control for uncertain nonlinear systems via an adaptive fuzzyneural approach. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 32(4), 483492.
Wang, Z., Wei, Q., & Liu, D. (2020). Eventtriggered adaptive dynamic programming for discretetime multiplayer games. Information Sciences, 506, 457470.
Werbos, P. J. (1992). Approximate dynamic programming for realtime control and neural modeling. In D. A. White, & D. A. Sofge (Eds.), Handbook of intelligent control. New York: Van Nostrand Reinhold, (Chapter 13).
Yang, L., Si, J., Tsakalis, K. S., & Rodriguez, A. A. (2009). Direct heuristic dynamic programming for nonlinear tracking control with ﬁltered tracking error. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39(6), 16171622.
Zhao, D., Wang, B., & Liu, D. (2013). A supervised actorcritic approach for adaptive cruise control. Soft Computing, 17(11), 20892099.
Zhao, D., Zhang, Q., Wang, D., & Zhu, Y. (2016). Experience replay for optimal control of nonzerosum game systems with unknown dynamics. IEEE Transactions on Cybernetics, 46(3), 854865.
1727