Online Adaptive Critic Robust Control of Discrete-Time - IFAT

Transcript Of Online Adaptive Critic Robust Control of Discrete-Time - IFAT
Preprints of the 21st IFAC World Congress (Virtual) Berlin, Germany, July 12-17, 2020
Online Adaptive Critic Robust Control of Discrete-Time Nonlinear Systems With Unknown
Dynamics ⋆
Hao Fu ∗,∗∗ Xin Chen ∗,∗∗ Min Wu ∗,∗∗
∗ School of Automation, China University of Geosciences, Wuhan, 430074, China (e-mail: [email protected]).
∗∗ Hubei Key Laboratory of Advanced Control and Intelligent Automation for Complex Systems, Wuhan, 430074, China
Abstract: This paper concerns the optimal model reference adaptive control problem for unknown discrete-time nonlinear systems. For such problem, it is challenging to improve online learning efficiency and guaranteeing robustness to the uncertainty. To this end, we develop an online adaptive critic robust control method. In this method, a critic network and a new supervised action network are constructed to not only improve the real-time learning efficiency, but also obtain the optimal control performance. By combining the designed compensation control term, robustness is further guaranteed by compensating the uncertainty. The comparative simulation study is conducted to show the superiority of our developed method.
Keywords: Approximate dynamic programming (ADP), unknown nonlinear systems, neural network (NN), supervised learning, model reference adaptive control (MRAC), robust control.
1. INTRODUCTION
During past several decades, reinforcement learning (RL) has gained a great deal of research attention in the artificial intelligence community. In the control system society, approximate/adaptive dynamic programming (ADP) Werbos (1992) (also called adaptive critic design), which combines RL and the adaptive control, has been employed to address the optimal regulation issue firstly via the action-critic network framework. Fruitful results Chen et al. (2019); Lian et al. (2016); Ha et al. (2018); Wang et al. (2020); Pang & Jiang (2019); Si et al. (2001) have been reported on ADP in recent years.
In the aforementioned results, an online ADP method Si et al. (2001) has been developed with no requirement of system dynamics. Convergence of this algorithm has been analyzed via the Lyapunov extension theorem Liu et al. (2012). On this basis, He et al. He et al. (2012) have further proposed a new ADP framework with an additional reference/goal network integrated into the action-critic network. What’s more, this algorithm has witnessed extensive studies in terms of the optimal tracking control Yang et al. (2009); Ni et al. (2013); Mu et al. (2017) as well.
The model reference adaptive control (MRAC) aims at enforcing the controlled systems to track the desired reference model rather than a tracking trajectory. Then, the closed-loop control system has the characteristics of the reference model. The optimal MRAC is far more worth deserving investigation than those tracking control. From the state-of-the-art developments of this investigation, only Radac et al. (2017, 2018); Fu et al. (2017);
⋆ This work was supported in part by the National Natural Science Foundation of China under Grant 61873248, in part by the Hubei Provincial Natural Science Foundation of China under Grant 2017CFA030 and Grant 2015CFA010, and in part by the 111 project of China under Grant B17040.
Wang et al. (2018) have developed the ADP-based optimal MRAC approach.
As existence of the reference input in MRAC, there invariably exists a feedforward control term dependent on input dynamics of the systems. The input dynamics needs to be derived via identification. To obviate this requirement, change of the reference input in Radac et al. (2017, 2018) is ignored during the learning process. Fu et al. (2017); Wang et al. (2018) don’t consider the uncertainty resulting from the identification error. As such, it is still a challenge to investigate the ADP-based MRAC method with robustness to such uncertainty.
On the other hand, in the beginning of training phase, this online ADP method is easy to cause inefficiency or high failure rate with unknown dynamics Zhao et al. (2013); Fathinezhad et al. (2009). The inefficiency or high failure rate is an unacceptable and fatal risk in the real-time control.
Motivated by above discussions, we develop an online adaptive critic robust control method for discrete-time nonlinear systems with unknown dynamics. This method ensures that closedloop control systems have robustness to uncertainty and highefficiency learning performance.
The main contributions of this study include the following two aspects.
(1) In contrast to the existing online ADP methods Yang et al. (2009); Ni et al. (2013); Mu et al. (2017), our developed control method greatly reduces the failure rate and improves the learning efficiency via a critic network and a new supervised action network.
(2) Unlike Radac et al. (2017, 2018); Fu et al. (2017); Wang et al. (2018), our developed control method well guarantees robustness to the uncertainties resulting from iden-
Copyright lies with the authors
1722
Preprints of the 21st IFAC World Congress (Virtual) Berlin, Germany, July 12-17, 2020
tification and the exterior disturbance by introducing the compensation control into the learning process.
The outline of this paper is arranged as follows. The problem description is stated in Section 2. In Section 3, the online adaptive critic robust control method is given. In Section 4 provides the comparative simulation. In Section 5, the conclusions are stated.
2. PROBLEM FORMULATIONS
Consider the following discrete-time nonlinear system: { xi(t + 1) = xi+1(t), i = 1, 2, . . . , n − 1 (1) xn(t + 1) = f (x(t)) + g(x(t))u(t) + d(t),
in which x(t) = [x1T (t), x2T (t), . . . , xnT (t)]T ∈ Rnm denotes the state with xi(t) ∈ Rm, f : Rnm → Rm and g : Rnm → Rm×m are unknown smooth nonlinear functions, u(t) ∈ Rm represents the control input, and d(t) ∈ Rm denotes an unknown persistent disturbance. Note that, under the full-state feedback linearization, the general nonlinear systems can be converted to the formation (1) via the coordinate transformation.
Assumption 1: The nonlinear function g(t) is always bounded and nonsingular for ∀x(t).
Define a ref{erence model as xri(t + 1) = xri+1(t), i = 1, 2, . . . , n − 1 (2) xrn(t + 1) = Arxr(t) + Brur(t),
where xr(t) = [xrT1(t), xrT2(t), . . . , xrTn(t)]T ∈ Rnm denotes the reference state with xri(t) ∈ Rm, Ar ∈ Rm×nm and Br ∈ Rm×m represent the constant matrices of the reference model, ur(t) is the reference control input. Here, xr(t) and ur(t) are all assumed to be bounded.
The objective of this paper is to enable the system (1) to track the reference model (2) on behavior with optimum via designing an optimal control law u(t). Subtracting (2) from (1) yields the model reference tracking error dynamics
{ ei(t + 1) = ei+1(t), i = 1, 2, . . . , n − 1 (3) en(t + 1) = f (t) + g(t)u(t) + d(t) − Arxr(t) − Brur(t),
where e(t) = x(t) − xr(t) denotes the model reference tracking error with ei(t) = xi(t) − xri(t).
To realize the optimum, it is needed to minimize the perfor-
mance index function or cost function
∑∞
J(t) = γk−t r(k),
(4)
k=t
in which γ is a discount factor, r(t) = eT (t)Qe(t) + uT (t)Ru(t)
is defined as the utility function or reward with the positive
symmetric matrices Q and R.
In accordance with Bellman’s optimality principle, the optimal cost function J∗(t) satisfies the following Bellman equation:
J∗(t) = min{r(t) + γJ∗(t + 1)}.
(5)
u(t)
Due to unknown dynamics for (1), it is difficult to solve the Bellman equation (5). To overcome this difficulty, the ADPbased MRAC methods Radac et al. (2017, 2018); Fu et al. (2017); Wang et al. (2018) have been proposed. But, Radac et al. (2017, 2018) have no real-time control performance. In Fu et al. (2017); Wang et al. (2018), the system uncertainty resulting from identification is not considered. On the other
x(t)
Critic J (t )
xr (t ) network
e (t ) − kve (t −1)
x(t)
xr (t )
Action ua (t )
u (t )
network
γ I m×1
J (t −1) r(t)
Uc (t)
Plant x (t + 1)
gˆ −1 (t )
−1
us (t −1)
z
e (t)
Sliding mode control
us (t)
x(t)
kve (t ) + Ar xr (t ) + Brur (t ) − λ1en (t ) −⋯ − λn−1e2 (t )
ur (t ) Reference model xr (t + 1) z−1 xr (t ) e (t )
Fig. 1. Online adaptive critic robust control structure diagram.
hand, inefficiency or the high failure rate is always existent in the online ADP methods Yang et al. (2009); Ni et al. (2013); Mu et al. (2017), which is an unacceptable and fatal risk in the real-time control.
3. ADAPTIVE CRITIC ROBUST CONTROL
In this section, an online adaptive critic robust control method is developed to achieve robustness to the uncertainty and learning efficiency. Its control structure diagram is depicted in Fig. 1.
Define a filtered model reference tracking error as
e¯(t) = en(t) + λ1en−1(t) + · · · + λn−1e1(t),
(6)
where λ1, . . . , λn−1 are constants such that |zn−1 + λ1zn−2 + · · · + λn−1| is stable. Then, the filtered model reference tracking
error dynamics can be formulated as
e¯(t + 1) = f (t) + g(t)u(t) + d(t) − Arxr(t) − Brur(t)
+ λ1en(t) + · · · + λn−1e2(t).
(7)
An adaptive critic robust control law is designed as
u(t) =gˆ−1(t)(us(t) + kve¯(t) + Arxr(t) + Brur(t)
− λ1en(t) − · · · − λn−1e2(t)) + ua(t),
(8)
where kv ∈ Rm×m is the gain matrix, ua(t) denotes a neural network (NN) control term, us(t) represents a compensation control term, and gˆ(t) is the estimation of g(t). Note that, gˆ(t)
is usually obtained by the model identification method Zhao
et al. (2016); Jiang et al. (2018). According to Assumption 1
and the results of Wang et al. (2002), it is deduced that gˆ(t) is
also bounded away from singularity.
A desirable value of u(t) is given by
ud(t) =g−1(t)(us(t) + kve¯(t) − f (t) − d(t) + Arxr(t) + Brur(t)
− λ1en(t) − · · · − λn−1e2(t)).
(9)
Using (9) and substituting (8) into (7) yields
e¯(t + 1) =kve¯(t) + g(t)(u(t) − ud(t))
=kve¯(t) + f1(t) + g(t)ua(t) + us(t) + d1(t), (10)
where f1(t) = f (t) + (kv − λ1Im)xn(t) + (kvλ1 − λ2Im)xn−1(t) + · · · + (kvλn−2 − λn−1Im)x2(t) + kvλn−1x1(t), d1(t) = g(t)(gˆ−1(t) −g−1(t))(us(t)+Arxr(t)+Brur(t)+(λ1Im −kv)xrn(t)+(λ2Im − kvλ1)xrn−1(t) + · · · + (λn−1Im − kvλn−2)xr2(t) − kvλn−1xr1(t)) + d(t), and Im ∈ Rm×m is an identity matrix. According to the results of Fu et al. (2018), it is inferred from Assumption 1 that d1(t) is bounded.
1723
Preprints of the 21st IFAC World Congress (Virtual) Berlin, Germany, July 12-17, 2020
For requirement of the optimal control, the critic network and the supervised action network are constructed as follows.
Since it is intractable to acquire the analytical resolution of J∗(t) by solving (5), NN is employed to near the cost function
J(t) as follows
J(t) = w∗cT (t)ϕc(v∗cT (t)zc(t)) + εc,
(11)
where zc(t) = [eT (t), uTa (t)]T denotes the input of the critic network with hc neurons of the hidden layer, ϕc(·) represents
the active function of the critic network, w∗c(t) ∈ Rhc×1 and v∗c(t) ∈ R(nm+m)×hc denote the ideal weights, and εc is the critic network approximation error.
Similarly, since w∗cT (t) and v∗cT (t) cannot be obtained directly, the estimation of J(t) is constructed as
Jˆ(t) = wTc (t)ϕc(vTc (t)zc(t)),
(12)
where wTc (t) and vTc (t) are the estimations of w∗cT (t) and v∗cT (t).
Due to unknown dynamics for (10), the supervised action
network ua(t) has an NN representation as
ua(t) = ϕa2(w∗aT (t)ϕa1(v∗aT (t)za(t))) + εa,
(13)
where za(t) = x(t) denotes the input of the supervised action network with ha neurons of the hidden layer, ϕa1(·) and ϕa2(·) represent the active functions, w∗a(t) ∈ Rha×m and v∗a(t) ∈ Rnm×ha denote the ideal weights, and εa is the supervised action
network approximation error.
Since the ideal weights w∗aT (t) and v∗aT (t) cannot be obtained directly, the actual control term ua(t) is constructed as
ua(t) = ϕa2(wTa (t)ϕa1(vTa (t)za(t))),
(14)
where wTa (t) and vTa (t) are the estimations of w∗aT (t) and v∗aT (t). For simplicity, ϕc(t), ϕa1(t), and ϕa2(t) are used to represent ϕc(vTc (t)zc(t)), ϕa1(vTa (t)za(t)), and ϕa2(wTa (t)ϕa1(vTa (t)za(t))), respectively.
The prediction error of the supervised action network is represented as
ea(t) = (Jˆ(t) −Uc)Im×1 + f1(t) + g(t)ua(t), (15)
where Uc = 0 is the ultimate cost objective and Im×1 ∈ Rm×1 is a matrix whose elements are all 1.
Remark 1: There are two targets in the supervised action
network design. one is to minimize the error between the cost function estimation Jˆ(t) and the ultimate cost objective Uc. Its motivation is that the cost function estimation Jˆ(t) approximates the optimal cost function J∗(t). Another target is
to minimize the error between the output of the action network and g−1(t) f (t), which is similar to the supervised learning or
the adaptive NN control.
Due to no prior knowledge of f (t) and g(t), by using (10), (15) is reformulated as
ea(t) = (Jˆ(t) −Uc)Im×1 + e¯(t + 1) − kve¯(t) − us(t). (16)
Define its objective function as
Ea(t) = 21 eTa (t)ea(t).
(17)
The weights of the supervised action network are updated by
∆wa(t) = −ηaϕa1(t)eTa (t)wac(t)diag(ϕa′2(t)),
(18a)
∆va(t) = −ηaza(t)eTa (t)wac(t)diag(ϕa′2(t)wTa (t)diag(ϕa′1(t)), (18b)
where ηa is the learning rate, diag(·) is the diagonalized operator, ϕa′1(t) and ϕa′2(t) respectively represent the derivative of ϕa1(t) and ϕa2(t), and wac(t) is defined as
wac(t) = αcIm×1wTc (t)diag(ϕc′ (t))vTc2(t) + (1 − αc)g(t), (19)
with which αc will be designed in details later, ϕc′ (t) represents the derivative of ϕc(t), the matrix vc2(t) ∈ Rm×hc satisfies vc(t) = [vTc1(t), vTc2(t)]T with vc1(t) ∈ Rnm×hc .
In the critic network, via the Bellman equation (5), the predic-
tion error can be represented as
ec(t) = γJˆ(t) − Jˆ(t − 1) + r(t).
(20)
Define its objective function as
Ec (t )
=
αc 2
ec (t )ec (t ).
(21)
By using the gradient descent algorithm, the weights of the
critic network are updated by
∆wc(t) = −ηcαcγec(t)ϕc(t),
(22a)
∆vc(t) = −ηcαcγec(t)zc(t)wTc (t)diag(ϕc′ (t)).
(22b)
Design a learningschedule factor αc as
∑ 1, if N1α k=t−tNα +1 ∥e¯(t)∥ < εα ,
αc =
1
∑ 0, if
t
∥e¯(t)∥ ≥ ε ,
(23)
Nα k=t−N +1
α
α
where εα > 0 is a design constant and Nα is a positive integer.
It is worth pointing out that the traditional online action-critic
framework is easy to lead to some inefficiency for the real-
time control problem. Specifically, in the beginning of online
training phase, the state of the system (1) may be far away
from the reference state, which results in the training failure
risk.
This
case
can
be
viewed
as
1 Nα
∑tk=t−Nα +1 ∥e¯(t)∥
≥
εα ,
i.e. α = 0. The supervised action network guides the sys-
tem state back to the neighbor of the reference state. Once
1 Nα
∑tk=t−Nα +1 ∥e¯(t)∥
<
εα
holds,
the
online
adaptive
critic
learning works to further derive the optimal control policy. It
is obvious that the high failure risk is avoided in the beginning
of the online training phase.
It can be concluded that the learning process is convergent and e(t) is uniformly ultimately bounded (UUB) and its boundary relies on d1(t), whose proof is omitted here for saving the space. Then, we can deduce from the UUB property that e¯(t) and e¯(t) − kve¯(t − 1) are also bounded. Without loss of generality, let
∥e¯(t + 1) − kve¯(t)∥ ≤ δe,
(24)
where δe > 0.
From (10), we have ∥ f1(t) + g(t)ua(t) + d1(t)∥ ≤ δe. Let
d2(t) = f1(t) + g(t)ua(t) + d1(t).
(25)
Then, we get
e¯(t + 1) = kve¯(t) + us(t) + d2(t).
(26)
Remark 2: When the weights of the critic network or the
supervised action network are close to a convergent re-
gion, it is necessary to reduce the learning rate values Ni
et al. (2013); Mu et al. (2017). Without loss of generality, let N1s ∑tk=t−Ns+1 ∥wc(t) − wc(t − 1)∥ < εs represent that the weights are close to the convergent region. In this case, due to
1724
Preprints of the 21st IFAC World Congress (Virtual) Berlin, Germany, July 12-17, 2020
reducing the learning rates, when adding a weak compensation control signal us(t), the system state change resulting from us(t) has few impact on the learning process. Then, (25) still holds. Thus, the linear system (26) with a persistent disturbance d2(t) is always existent.
Since the specific information of d2(t) is unavailable, the distur{bance observer Kim et al. (2016) is designed by
dˆ2(t) = kde¯(t) − zd(t), zd(t + 1) = zd(t) + kd((kv − Im)e¯(t) + us(t) + dˆ2(t)), (27)
in which dˆ2(t) is an estimation of d2(t), kd ∈ Rm×m is a diagonal observer matrix, and zd(t) is a new state variable. From the conclusion of Kim et al. (2016), it is known that dˆ2(t) is convergent to d2(t).
Inspired by Du et al. (2016), we design a chattering-free com-
∑ pensation control as
(qs1−kv)e¯(t)−qs2sigαs (e¯(t))−dˆ2(t), if αs = 1
1t
us (t ) =
and
∥wc(t) − wc(t − 1)∥ < εs, (28)
Ns k=t−Ns+1
0, otherwise,
where 0 < qs1 < 1, 0 < qs2 < 1, 0 < αs < 1, sigαs (·) = sgn(·)| · |αs , εs > 0 is a design constant, and Ns is a positive integer.
According to the results of Du et al. (2016), we know that the compensation control us(t) is a chattering-free signal and has a capability of the disturbance attenuation. Then, us(t) ensures the system’s robust to the uncertainty by compensating d2(t). As such, our developed online adaptive critic robust control
method not only has the high-efficiency optimal control prop-
erty in real time, but also keeps robust to the uncertainty. The
procedure to realize this method is summarized as Algorithm 1.
4. SIMULATION
To verify the superiority of the theoretical results, a simulation
example on our developed method is conducted by comparing
with the traditional online ADP methods. Dynamics of a one-
link robot manipulator is considered in the following:
Gθ¨ + Dθ˙ + MgL sin(θ ) = τ,
(29)
where g = 9.8 m/s2 is a gravitational acceleration, D = 1 repre-
sents a viscous friction coefficient, L = 1 m stands for the length of the link, M = 1 kg represents the payload mass, G = 1 kg · m2 stands for the inertia moment, θ is the angle position, τ is the torque, and τd is a disturbance. Note that, its dynamics is
unavailable for the controller design.
Discretizing (29) using Euler methods with the sampling inter-
val Ts = 0.05 s yields
x1(t + 1) =x2(t),
x2(t
+
1)
=
2G−DTs G
x2 (t )
−
G−DTs G
x1 (t )
(30)
MgLT 2
T2
T2
−
G
s
sin(x1(t)) +
s
G
u(t)
+
s
G
d(t),
where
d(t
)
=
0.08
cos(1.8Tst
−
π 4
)
sin(Tst
+
π 3
).
The reference model is given by xr1(t + 1) =xr2(t), xr2(t + 1) =(2 − 1.5Ts)xr2(t) + (1.5Ts − 2.5Ts2 − 1) (31) × xr1(t) + Ts2ur(t),
where
ur
(t
)
=
sin(0.2Tst
)
cos(0.4Tst
+
π 2
).
Algorithm 1:
\∗ itri: the maximal trial numbers; tc: the cumulative time for breaking; tt : the simulation terminal time; ic, ia: the maximal iteration numbers in the critic network
and the supervised action network, respectively;
Ect , Eat : the objective function thresholds in the critic
network and the supervised action network, respectively; 1): set the coefficients λ1, λ2, . . . , λn−1, γ, Q, R, kv, kd, ηa,
ηc, Na, Ns, εa, εs, εe, qs1, and qs2; 2): for 1 to itri do \∗ trial 3): initialize x(0), x∗(0);
4): initialize wc(0), vc(0), wa(0), and va(0) randomly; 5): us(0) = 0 and ua(0) ← (14); 6): while t ≤ tt do 7): wa(t) = wa(t − 1), va(t) = va(t − 1), wc(t) = wc(t − 1),
and vc(t) = vc(t − 1); 8): calculate u(t − 1) ← (8); 9): x(t) ← (1), x∗(t) ← (2), e(t) ← (3), and e¯(t) ← (6); 10): if t > tc and t1c ∑tkc=t−tc+1 ∥e¯(k)∥ > εe 11): break this trial; 12): endif 13): calculate Jˆ(t) ← (11), ua(t) ← (14), dˆ2(t) ← (27), us(t)
← (28), and αc ← (23); 14): r(t) = eT (t)Qe(t) + uT (t)Ru(t);
15): calculate Ec(t) and set i = 0; 16): while ((i < ic) & (Ec(t) > Ect )) do 17): update wc(t) = wc(t − 1) + ∆wc(t) and vc(t) = vc(t − 1) +
∆vc (t ); 18): Jˆ(t) ← (11);
19): if
1 N
∑tk=t−Ns+1 ∥wc(t) − wc(t − 1)∥ < εs
20):
s
reduce
ηa
and
ηc;
21): else
22): reset ηa and ηc;
23): endif
24): calculate Ec(t) and set i = i + 1; 25): endwhile \∗ critic network
26): calculate Ea(t) and set j = 0; 27): while (( j < ia) & (Ea(t) > Eat )) do 28): update wa(t) = wa(t −1)+∆wa(t) and va(t) = va(t −1)+
∆va (t ); 29): ua(t) ← (14) and u(t) ← (8); 30): Jˆ(t) ← (11);
31): calculate Ea(t) and set j = j + 1; 32): endwhile \∗ action network 33): t = t + 1;
34): endwhile
35): endfor
The matrices Q and R is chosen as diag{0.5, 0.5} and 0.3, respectively. The critic network and the supervised action network are constructed by two three-layer back propagation NNs with structures of 3-4-1 and 2-3-1, respectively. The activation functions ϕc(·) and ϕa(·) are selected as the hyperbolic tangent function. The initial weights for both the networks are randomly generated from [−1, 1]. In view of the result of the adaptive critic robust control method, some parameters used in the simulation are presented in Table 1. In addition, by combining with the results of Kim et al. (2016); Du et al. (2016), the observer and compensation control parameters are chosen as kd = 0.3, qs1 = 0.6, and qs2 = 0.4.
The trajectories of the system state x1(t) and the reference state xr1(t) are presented in Fig. 2. The curve of the model reference tracking error e1(t) is depicted in Fig. 3. It is observed that the system (31) can exactly track the reference model (30) on
1725
Preprints of the 21st IFAC World Congress (Virtual) Berlin, Germany, July 12-17, 2020
x1(t) xr1(t)
x1(t) xr1(t)
State and reference state (rad)
State and reference state (rad)
Time steps
Fig. 2. System state x1(t) and reference state xr1(t).
0
Tracking error (rad)
0.3
-5
0.2
-10 0.1
0.0
-15 -0.1
-0.2
-20
73 100
0
200
0.01
0.00
150
400
200 -0.01 1000
1100
600
800
Time steps
1000
1200
1200
Fig. 3. Model reference tracking error e1(t).
Control input (N⋅m)
2.0 1.5 1.0 0.5 0.0 -0.5 -1.0 -1.5 -2.0 -2.5
0
0.06 0.03 0.00 -0.03
1000
1100
0.4
0.2
0.0
-0.2
-0.4
-0.6
73
100
150
200
400
600
800
Time steps
1200
200
1000 1200
Fig. 4. Control input curve.
behavior by using our developed method. The control input curve is shown in Fig. 4.
To highlight the better learning efficiency and robustness of our developed online adaptive critic robust control method, it
Table 1. Parameters in the example
Para. λ1
Para. ia
Para. Ns
Value 0.3
Value 300 Value 10
Para. γ
Para. ic
Para. εa
Value 0.95 Value 200 Value 0.7
Para. kv
Para. Eat Para. εe
Value 0.1
Value 1e-4 Value
1
Para. ηa Para. Ect Para. εs
Value 0.001 Value 6e-4 Value
0.1
Para. ηc
Para. Na -
Value 0.004 Value
10 -
Time steps
Fig. 5. System state x1(t) and reference state xr1(t) in Ni et al. (2013).
is necessary to conduct a comparative simulation experiment by comparing with the previous method Ni et al. (2013). As Ni et al. (2013) is a tracking control method, the trajectory of the reference model needs to be obtained beforehand, and then the tracking control is implemented. The reference network, the critic network, and the action network are constructed by three similar NNs with structures of 5-4-1, 6-5-1, and 4-3-1, respectively. Let u(t) = u¯(t)/gˆ(t) in (30), in which u¯(t) is the output of the action network. Some parameters of this method selected in the simulation are presented in Table 2, in which ηr, ir, and Ert represent the learning rate, the maximal iteration numbers, and the objective function threshold for the reference network, respectively. Other parameter setting and some initial conditions are set as same as those of our developed online adaptive critic robust control method, such as λ1, γ, kv, the initial states, and the initial weights.
Table 2. Parameters in the example for Ni et al. (2013)
Para. ηa Para. ir
Value 0.001 Value 150
Para. ηc
Para. Eat
Value 0.004 Value 1e-4
Para. ηr
Para. Ect
Value 0.001 Value 6e-4
Para. ia
Para. Ert
Value 300 Value 2e-4
Para. ic
Para. -
Value 200 Value
-
By using the method proposed in Ni et al. (2013), we can obtain the trajectories of x1(t) and xr1(t), and the model reference tracking error curve, which are shown in Figs. 5 and 6, respectively. Through the comparisons between Figs. 2 and 3 and that between Figs. 5 and 6, it can be seen that our developed method produces the smaller model reference tracking error than the method proposed in Ni et al. (2013) does. This means that our developed method has the superior robustness.
Table 3. Simulation results on both the methods
Methods Number of experiments Number of trials Success rate (%)
Traditional method
100
20
53
Our method
100
20
100
In this comparative simulation study, a run consists of a max-
imum of 20 consecutive trials. It is considered successful if
1 Nα
∑tk=t−Nα +1
∥e¯(t)∥
≤
0.07
for
∀t
>
900
holds.
Otherwise,
if
the controller is unable to learn to make the system (30) track
the reference model (31) on behavior within 20 trials, then the
run is considered unsuccessful. We run 100 experiments for the
traditional method Ni et al. (2013) and our developed method,
whose simulation results are listed in Table 3. It is observed
1726
Preprints of the 21st IFAC World Congress (Virtual) Berlin, Germany, July 12-17, 2020
Tracking error (rad)
0.03
1.5
0.00
1.0 -0.03
-0.06
0.5
-0.09 200 400 600 800 1000 1200
0.0
-0.5 0
200
400
600
800 1000 1200
Time steps
Fig. 6. Model reference tracking error e1(t) in Ni et al. (2013).
that, in contrast to the traditional online ADP method Ni et al. (2013), our developed method reduces greatly the learning failure rate.
5. CONCLUSION
An online adaptive critic robust control method has been developed to handle the optimal MRAC problem for the nonlinear systems. The online adaptive critic robust controller consists of the critic network, the supervised action network, and the compensation control term. Via the new defined learning schedule factor, such controller not only achieves the high-efficiency learning as well as the optimality in real time, but also has the robustness to the uncertainty. A comparative simulation has been provided to show the superiority of our developed method. Further investigation and experimentation are recommended into the stability analysis, optimization of the algorithm, and applications to the real systems.
REFERENCES
Chen, X., Wang, W., Cao, W., & Wu, M. (2019). Gaussiankernel-based adaptive critic design using two-phase value iteration. Information Sciences, 482, 139–155.
Du, H., Yu, X., Chen, M. Z. Q., & Li, S. (2016). Chattering-free discrete-time sliding mode control. Automatica, 68, 87-91.
Fathinezhad, F., Derhami, V., & Rezaeian, M. (2016). Supervised fuzzy reinforcement learning for robot navigation. Applied Soft Computing, 40, 33-41.
Fu, H., Chen, X., & Wang, W. (2017). A model reference adaptive control with ADP-to-SMC strategy for unknown nonlinear systems. Proceedings of the 11th Asian Control Conference, 1537-1542.
Fu, H. Chen, X. Wang, W. & Wu, M. (2020). MRAC for unknown discrete-time nonlinear systems based on supervised neural dynamic programming. Neurocomputing, 384, 130141.
Ha, M., Wang, D., & Liu, D. (2018). Event-triggered adaptive critic control design for discrete-time constrained nonlinear systems. IEEE Transactions on Systems, Man, and Cybernetics: Systems, doi: 10.1109/TSMC.2018.2868510.
He, H., Ni, Z., & Fu, J. (2012). A three-network architecture for on-line learning and optimization based on adaptive dynamic programming. Neurocomputing, 78(1), 3-13.
Jiang, H., Zhang, H., Xiao, G., & Cui, X. (2018). Databased approximate optimal control for nonzero-sum games of multi-player systems using adaptive dynamic programming. Neurocomputing, 275, 192-199.
Kim, K. & Rew, H. (2013). Reduced order disturbance observer for discrete-time linear systems. Automatica, 49(4), 968-975.
Lian, C., Xu, X., Chen, H., & He, H. (2016). Near-optimal tracking control of mobile robots via receding-horizon dual heuristic programming. IEEE Transactions on Cybernetics, 46(11), 2484-2496.
Liu, F., Sun, J., Si, J., Guo, W., & Mei, S. (2012). A boundedness result for the direct heuristic dynamic programming. Neural Network, 32, 229-235.
Mu, C., Ni, Z., Sun, C., & He, H. (2017). Air-breathing hypersonic vehicle tracking control based on adaptive dynamic programming. IEEE Transactions on Neural Networks and Learning Systems, 28(3), 584–598.
Mu, C., Ni, Z., Sun, C., & He, H. (2017). Data-driven tracking control with adaptive dynamic programming for a class of continuous-time nonlinear systems. IEEE Transactions on Cybernetics, 47(6), 1460-1470.
Ni, Z., He, H., & Wen, J. (2013). Adaptive learning in tracking control based on the dual critic network design. IEEE Transactions on Neural Networks and Learning Systems, 24(6), 913-928.
Pang, B. & Jiang, Z. P. (2019). Adaptive optimal control of linear periodic systems: An off-policy value iteration approach. arXiv: 1901.08650.
Radac, M. B., Precup, R. E., & Roman, R. C. (2017). Modelfree control performance improvement using virtual reference feedback tuning and reinforcement Q-learning. International Journal of Systems Science, 48(5), 1071–1083.
Radac, M. B., Precup, R. E., & Roman, R. C. (2018). Datadriven model reference control of MIMO vertical tank systems with model-free VRFT and Q-learning. ISA Transactions, 73, 227–238.
Si, J. & Wang, Y. T. (2001). On-line learning control by association and reinforcement. IEEE Transactions on Neural Network, 12(2), 264-276.
Wang, W., Chen, X., Wang, F., & Fu, H. (2018). ADPbased model reference adaptive control design for unknown discrete-time nonlinear systems. Proceedings of the 37th Chinese Control Conference, 8049–8054.
Wang, W. Y., Chan, M. L., Hsu, C. C. J., & Lee, T. T. (2002). H∞ tracking-based sliding mode control for uncertain nonlinear systems via an adaptive fuzzy-neural approach. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 32(4), 483-492.
Wang, Z., Wei, Q., & Liu, D. (2020). Event-triggered adaptive dynamic programming for discrete-time multi-player games. Information Sciences, 506, 457-470.
Werbos, P. J. (1992). Approximate dynamic programming for real-time control and neural modeling. In D. A. White, & D. A. Sofge (Eds.), Handbook of intelligent control. New York: Van Nostrand Reinhold, (Chapter 13).
Yang, L., Si, J., Tsakalis, K. S., & Rodriguez, A. A. (2009). Direct heuristic dynamic programming for nonlinear tracking control with filtered tracking error. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39(6), 1617-1622.
Zhao, D., Wang, B., & Liu, D. (2013). A supervised actor-critic approach for adaptive cruise control. Soft Computing, 17(11), 2089-2099.
Zhao, D., Zhang, Q., Wang, D., & Zhu, Y. (2016). Experience replay for optimal control of nonzero-sum game systems with unknown dynamics. IEEE Transactions on Cybernetics, 46(3), 854-865.
1727
Online Adaptive Critic Robust Control of Discrete-Time Nonlinear Systems With Unknown
Dynamics ⋆
Hao Fu ∗,∗∗ Xin Chen ∗,∗∗ Min Wu ∗,∗∗
∗ School of Automation, China University of Geosciences, Wuhan, 430074, China (e-mail: [email protected]).
∗∗ Hubei Key Laboratory of Advanced Control and Intelligent Automation for Complex Systems, Wuhan, 430074, China
Abstract: This paper concerns the optimal model reference adaptive control problem for unknown discrete-time nonlinear systems. For such problem, it is challenging to improve online learning efficiency and guaranteeing robustness to the uncertainty. To this end, we develop an online adaptive critic robust control method. In this method, a critic network and a new supervised action network are constructed to not only improve the real-time learning efficiency, but also obtain the optimal control performance. By combining the designed compensation control term, robustness is further guaranteed by compensating the uncertainty. The comparative simulation study is conducted to show the superiority of our developed method.
Keywords: Approximate dynamic programming (ADP), unknown nonlinear systems, neural network (NN), supervised learning, model reference adaptive control (MRAC), robust control.
1. INTRODUCTION
During past several decades, reinforcement learning (RL) has gained a great deal of research attention in the artificial intelligence community. In the control system society, approximate/adaptive dynamic programming (ADP) Werbos (1992) (also called adaptive critic design), which combines RL and the adaptive control, has been employed to address the optimal regulation issue firstly via the action-critic network framework. Fruitful results Chen et al. (2019); Lian et al. (2016); Ha et al. (2018); Wang et al. (2020); Pang & Jiang (2019); Si et al. (2001) have been reported on ADP in recent years.
In the aforementioned results, an online ADP method Si et al. (2001) has been developed with no requirement of system dynamics. Convergence of this algorithm has been analyzed via the Lyapunov extension theorem Liu et al. (2012). On this basis, He et al. He et al. (2012) have further proposed a new ADP framework with an additional reference/goal network integrated into the action-critic network. What’s more, this algorithm has witnessed extensive studies in terms of the optimal tracking control Yang et al. (2009); Ni et al. (2013); Mu et al. (2017) as well.
The model reference adaptive control (MRAC) aims at enforcing the controlled systems to track the desired reference model rather than a tracking trajectory. Then, the closed-loop control system has the characteristics of the reference model. The optimal MRAC is far more worth deserving investigation than those tracking control. From the state-of-the-art developments of this investigation, only Radac et al. (2017, 2018); Fu et al. (2017);
⋆ This work was supported in part by the National Natural Science Foundation of China under Grant 61873248, in part by the Hubei Provincial Natural Science Foundation of China under Grant 2017CFA030 and Grant 2015CFA010, and in part by the 111 project of China under Grant B17040.
Wang et al. (2018) have developed the ADP-based optimal MRAC approach.
As existence of the reference input in MRAC, there invariably exists a feedforward control term dependent on input dynamics of the systems. The input dynamics needs to be derived via identification. To obviate this requirement, change of the reference input in Radac et al. (2017, 2018) is ignored during the learning process. Fu et al. (2017); Wang et al. (2018) don’t consider the uncertainty resulting from the identification error. As such, it is still a challenge to investigate the ADP-based MRAC method with robustness to such uncertainty.
On the other hand, in the beginning of training phase, this online ADP method is easy to cause inefficiency or high failure rate with unknown dynamics Zhao et al. (2013); Fathinezhad et al. (2009). The inefficiency or high failure rate is an unacceptable and fatal risk in the real-time control.
Motivated by above discussions, we develop an online adaptive critic robust control method for discrete-time nonlinear systems with unknown dynamics. This method ensures that closedloop control systems have robustness to uncertainty and highefficiency learning performance.
The main contributions of this study include the following two aspects.
(1) In contrast to the existing online ADP methods Yang et al. (2009); Ni et al. (2013); Mu et al. (2017), our developed control method greatly reduces the failure rate and improves the learning efficiency via a critic network and a new supervised action network.
(2) Unlike Radac et al. (2017, 2018); Fu et al. (2017); Wang et al. (2018), our developed control method well guarantees robustness to the uncertainties resulting from iden-
Copyright lies with the authors
1722
Preprints of the 21st IFAC World Congress (Virtual) Berlin, Germany, July 12-17, 2020
tification and the exterior disturbance by introducing the compensation control into the learning process.
The outline of this paper is arranged as follows. The problem description is stated in Section 2. In Section 3, the online adaptive critic robust control method is given. In Section 4 provides the comparative simulation. In Section 5, the conclusions are stated.
2. PROBLEM FORMULATIONS
Consider the following discrete-time nonlinear system: { xi(t + 1) = xi+1(t), i = 1, 2, . . . , n − 1 (1) xn(t + 1) = f (x(t)) + g(x(t))u(t) + d(t),
in which x(t) = [x1T (t), x2T (t), . . . , xnT (t)]T ∈ Rnm denotes the state with xi(t) ∈ Rm, f : Rnm → Rm and g : Rnm → Rm×m are unknown smooth nonlinear functions, u(t) ∈ Rm represents the control input, and d(t) ∈ Rm denotes an unknown persistent disturbance. Note that, under the full-state feedback linearization, the general nonlinear systems can be converted to the formation (1) via the coordinate transformation.
Assumption 1: The nonlinear function g(t) is always bounded and nonsingular for ∀x(t).
Define a ref{erence model as xri(t + 1) = xri+1(t), i = 1, 2, . . . , n − 1 (2) xrn(t + 1) = Arxr(t) + Brur(t),
where xr(t) = [xrT1(t), xrT2(t), . . . , xrTn(t)]T ∈ Rnm denotes the reference state with xri(t) ∈ Rm, Ar ∈ Rm×nm and Br ∈ Rm×m represent the constant matrices of the reference model, ur(t) is the reference control input. Here, xr(t) and ur(t) are all assumed to be bounded.
The objective of this paper is to enable the system (1) to track the reference model (2) on behavior with optimum via designing an optimal control law u(t). Subtracting (2) from (1) yields the model reference tracking error dynamics
{ ei(t + 1) = ei+1(t), i = 1, 2, . . . , n − 1 (3) en(t + 1) = f (t) + g(t)u(t) + d(t) − Arxr(t) − Brur(t),
where e(t) = x(t) − xr(t) denotes the model reference tracking error with ei(t) = xi(t) − xri(t).
To realize the optimum, it is needed to minimize the perfor-
mance index function or cost function
∑∞
J(t) = γk−t r(k),
(4)
k=t
in which γ is a discount factor, r(t) = eT (t)Qe(t) + uT (t)Ru(t)
is defined as the utility function or reward with the positive
symmetric matrices Q and R.
In accordance with Bellman’s optimality principle, the optimal cost function J∗(t) satisfies the following Bellman equation:
J∗(t) = min{r(t) + γJ∗(t + 1)}.
(5)
u(t)
Due to unknown dynamics for (1), it is difficult to solve the Bellman equation (5). To overcome this difficulty, the ADPbased MRAC methods Radac et al. (2017, 2018); Fu et al. (2017); Wang et al. (2018) have been proposed. But, Radac et al. (2017, 2018) have no real-time control performance. In Fu et al. (2017); Wang et al. (2018), the system uncertainty resulting from identification is not considered. On the other
x(t)
Critic J (t )
xr (t ) network
e (t ) − kve (t −1)
x(t)
xr (t )
Action ua (t )
u (t )
network
γ I m×1
J (t −1) r(t)
Uc (t)
Plant x (t + 1)
gˆ −1 (t )
−1
us (t −1)
z
e (t)
Sliding mode control
us (t)
x(t)
kve (t ) + Ar xr (t ) + Brur (t ) − λ1en (t ) −⋯ − λn−1e2 (t )
ur (t ) Reference model xr (t + 1) z−1 xr (t ) e (t )
Fig. 1. Online adaptive critic robust control structure diagram.
hand, inefficiency or the high failure rate is always existent in the online ADP methods Yang et al. (2009); Ni et al. (2013); Mu et al. (2017), which is an unacceptable and fatal risk in the real-time control.
3. ADAPTIVE CRITIC ROBUST CONTROL
In this section, an online adaptive critic robust control method is developed to achieve robustness to the uncertainty and learning efficiency. Its control structure diagram is depicted in Fig. 1.
Define a filtered model reference tracking error as
e¯(t) = en(t) + λ1en−1(t) + · · · + λn−1e1(t),
(6)
where λ1, . . . , λn−1 are constants such that |zn−1 + λ1zn−2 + · · · + λn−1| is stable. Then, the filtered model reference tracking
error dynamics can be formulated as
e¯(t + 1) = f (t) + g(t)u(t) + d(t) − Arxr(t) − Brur(t)
+ λ1en(t) + · · · + λn−1e2(t).
(7)
An adaptive critic robust control law is designed as
u(t) =gˆ−1(t)(us(t) + kve¯(t) + Arxr(t) + Brur(t)
− λ1en(t) − · · · − λn−1e2(t)) + ua(t),
(8)
where kv ∈ Rm×m is the gain matrix, ua(t) denotes a neural network (NN) control term, us(t) represents a compensation control term, and gˆ(t) is the estimation of g(t). Note that, gˆ(t)
is usually obtained by the model identification method Zhao
et al. (2016); Jiang et al. (2018). According to Assumption 1
and the results of Wang et al. (2002), it is deduced that gˆ(t) is
also bounded away from singularity.
A desirable value of u(t) is given by
ud(t) =g−1(t)(us(t) + kve¯(t) − f (t) − d(t) + Arxr(t) + Brur(t)
− λ1en(t) − · · · − λn−1e2(t)).
(9)
Using (9) and substituting (8) into (7) yields
e¯(t + 1) =kve¯(t) + g(t)(u(t) − ud(t))
=kve¯(t) + f1(t) + g(t)ua(t) + us(t) + d1(t), (10)
where f1(t) = f (t) + (kv − λ1Im)xn(t) + (kvλ1 − λ2Im)xn−1(t) + · · · + (kvλn−2 − λn−1Im)x2(t) + kvλn−1x1(t), d1(t) = g(t)(gˆ−1(t) −g−1(t))(us(t)+Arxr(t)+Brur(t)+(λ1Im −kv)xrn(t)+(λ2Im − kvλ1)xrn−1(t) + · · · + (λn−1Im − kvλn−2)xr2(t) − kvλn−1xr1(t)) + d(t), and Im ∈ Rm×m is an identity matrix. According to the results of Fu et al. (2018), it is inferred from Assumption 1 that d1(t) is bounded.
1723
Preprints of the 21st IFAC World Congress (Virtual) Berlin, Germany, July 12-17, 2020
For requirement of the optimal control, the critic network and the supervised action network are constructed as follows.
Since it is intractable to acquire the analytical resolution of J∗(t) by solving (5), NN is employed to near the cost function
J(t) as follows
J(t) = w∗cT (t)ϕc(v∗cT (t)zc(t)) + εc,
(11)
where zc(t) = [eT (t), uTa (t)]T denotes the input of the critic network with hc neurons of the hidden layer, ϕc(·) represents
the active function of the critic network, w∗c(t) ∈ Rhc×1 and v∗c(t) ∈ R(nm+m)×hc denote the ideal weights, and εc is the critic network approximation error.
Similarly, since w∗cT (t) and v∗cT (t) cannot be obtained directly, the estimation of J(t) is constructed as
Jˆ(t) = wTc (t)ϕc(vTc (t)zc(t)),
(12)
where wTc (t) and vTc (t) are the estimations of w∗cT (t) and v∗cT (t).
Due to unknown dynamics for (10), the supervised action
network ua(t) has an NN representation as
ua(t) = ϕa2(w∗aT (t)ϕa1(v∗aT (t)za(t))) + εa,
(13)
where za(t) = x(t) denotes the input of the supervised action network with ha neurons of the hidden layer, ϕa1(·) and ϕa2(·) represent the active functions, w∗a(t) ∈ Rha×m and v∗a(t) ∈ Rnm×ha denote the ideal weights, and εa is the supervised action
network approximation error.
Since the ideal weights w∗aT (t) and v∗aT (t) cannot be obtained directly, the actual control term ua(t) is constructed as
ua(t) = ϕa2(wTa (t)ϕa1(vTa (t)za(t))),
(14)
where wTa (t) and vTa (t) are the estimations of w∗aT (t) and v∗aT (t). For simplicity, ϕc(t), ϕa1(t), and ϕa2(t) are used to represent ϕc(vTc (t)zc(t)), ϕa1(vTa (t)za(t)), and ϕa2(wTa (t)ϕa1(vTa (t)za(t))), respectively.
The prediction error of the supervised action network is represented as
ea(t) = (Jˆ(t) −Uc)Im×1 + f1(t) + g(t)ua(t), (15)
where Uc = 0 is the ultimate cost objective and Im×1 ∈ Rm×1 is a matrix whose elements are all 1.
Remark 1: There are two targets in the supervised action
network design. one is to minimize the error between the cost function estimation Jˆ(t) and the ultimate cost objective Uc. Its motivation is that the cost function estimation Jˆ(t) approximates the optimal cost function J∗(t). Another target is
to minimize the error between the output of the action network and g−1(t) f (t), which is similar to the supervised learning or
the adaptive NN control.
Due to no prior knowledge of f (t) and g(t), by using (10), (15) is reformulated as
ea(t) = (Jˆ(t) −Uc)Im×1 + e¯(t + 1) − kve¯(t) − us(t). (16)
Define its objective function as
Ea(t) = 21 eTa (t)ea(t).
(17)
The weights of the supervised action network are updated by
∆wa(t) = −ηaϕa1(t)eTa (t)wac(t)diag(ϕa′2(t)),
(18a)
∆va(t) = −ηaza(t)eTa (t)wac(t)diag(ϕa′2(t)wTa (t)diag(ϕa′1(t)), (18b)
where ηa is the learning rate, diag(·) is the diagonalized operator, ϕa′1(t) and ϕa′2(t) respectively represent the derivative of ϕa1(t) and ϕa2(t), and wac(t) is defined as
wac(t) = αcIm×1wTc (t)diag(ϕc′ (t))vTc2(t) + (1 − αc)g(t), (19)
with which αc will be designed in details later, ϕc′ (t) represents the derivative of ϕc(t), the matrix vc2(t) ∈ Rm×hc satisfies vc(t) = [vTc1(t), vTc2(t)]T with vc1(t) ∈ Rnm×hc .
In the critic network, via the Bellman equation (5), the predic-
tion error can be represented as
ec(t) = γJˆ(t) − Jˆ(t − 1) + r(t).
(20)
Define its objective function as
Ec (t )
=
αc 2
ec (t )ec (t ).
(21)
By using the gradient descent algorithm, the weights of the
critic network are updated by
∆wc(t) = −ηcαcγec(t)ϕc(t),
(22a)
∆vc(t) = −ηcαcγec(t)zc(t)wTc (t)diag(ϕc′ (t)).
(22b)
Design a learningschedule factor αc as
∑ 1, if N1α k=t−tNα +1 ∥e¯(t)∥ < εα ,
αc =
1
∑ 0, if
t
∥e¯(t)∥ ≥ ε ,
(23)
Nα k=t−N +1
α
α
where εα > 0 is a design constant and Nα is a positive integer.
It is worth pointing out that the traditional online action-critic
framework is easy to lead to some inefficiency for the real-
time control problem. Specifically, in the beginning of online
training phase, the state of the system (1) may be far away
from the reference state, which results in the training failure
risk.
This
case
can
be
viewed
as
1 Nα
∑tk=t−Nα +1 ∥e¯(t)∥
≥
εα ,
i.e. α = 0. The supervised action network guides the sys-
tem state back to the neighbor of the reference state. Once
1 Nα
∑tk=t−Nα +1 ∥e¯(t)∥
<
εα
holds,
the
online
adaptive
critic
learning works to further derive the optimal control policy. It
is obvious that the high failure risk is avoided in the beginning
of the online training phase.
It can be concluded that the learning process is convergent and e(t) is uniformly ultimately bounded (UUB) and its boundary relies on d1(t), whose proof is omitted here for saving the space. Then, we can deduce from the UUB property that e¯(t) and e¯(t) − kve¯(t − 1) are also bounded. Without loss of generality, let
∥e¯(t + 1) − kve¯(t)∥ ≤ δe,
(24)
where δe > 0.
From (10), we have ∥ f1(t) + g(t)ua(t) + d1(t)∥ ≤ δe. Let
d2(t) = f1(t) + g(t)ua(t) + d1(t).
(25)
Then, we get
e¯(t + 1) = kve¯(t) + us(t) + d2(t).
(26)
Remark 2: When the weights of the critic network or the
supervised action network are close to a convergent re-
gion, it is necessary to reduce the learning rate values Ni
et al. (2013); Mu et al. (2017). Without loss of generality, let N1s ∑tk=t−Ns+1 ∥wc(t) − wc(t − 1)∥ < εs represent that the weights are close to the convergent region. In this case, due to
1724
Preprints of the 21st IFAC World Congress (Virtual) Berlin, Germany, July 12-17, 2020
reducing the learning rates, when adding a weak compensation control signal us(t), the system state change resulting from us(t) has few impact on the learning process. Then, (25) still holds. Thus, the linear system (26) with a persistent disturbance d2(t) is always existent.
Since the specific information of d2(t) is unavailable, the distur{bance observer Kim et al. (2016) is designed by
dˆ2(t) = kde¯(t) − zd(t), zd(t + 1) = zd(t) + kd((kv − Im)e¯(t) + us(t) + dˆ2(t)), (27)
in which dˆ2(t) is an estimation of d2(t), kd ∈ Rm×m is a diagonal observer matrix, and zd(t) is a new state variable. From the conclusion of Kim et al. (2016), it is known that dˆ2(t) is convergent to d2(t).
Inspired by Du et al. (2016), we design a chattering-free com-
∑ pensation control as
(qs1−kv)e¯(t)−qs2sigαs (e¯(t))−dˆ2(t), if αs = 1
1t
us (t ) =
and
∥wc(t) − wc(t − 1)∥ < εs, (28)
Ns k=t−Ns+1
0, otherwise,
where 0 < qs1 < 1, 0 < qs2 < 1, 0 < αs < 1, sigαs (·) = sgn(·)| · |αs , εs > 0 is a design constant, and Ns is a positive integer.
According to the results of Du et al. (2016), we know that the compensation control us(t) is a chattering-free signal and has a capability of the disturbance attenuation. Then, us(t) ensures the system’s robust to the uncertainty by compensating d2(t). As such, our developed online adaptive critic robust control
method not only has the high-efficiency optimal control prop-
erty in real time, but also keeps robust to the uncertainty. The
procedure to realize this method is summarized as Algorithm 1.
4. SIMULATION
To verify the superiority of the theoretical results, a simulation
example on our developed method is conducted by comparing
with the traditional online ADP methods. Dynamics of a one-
link robot manipulator is considered in the following:
Gθ¨ + Dθ˙ + MgL sin(θ ) = τ,
(29)
where g = 9.8 m/s2 is a gravitational acceleration, D = 1 repre-
sents a viscous friction coefficient, L = 1 m stands for the length of the link, M = 1 kg represents the payload mass, G = 1 kg · m2 stands for the inertia moment, θ is the angle position, τ is the torque, and τd is a disturbance. Note that, its dynamics is
unavailable for the controller design.
Discretizing (29) using Euler methods with the sampling inter-
val Ts = 0.05 s yields
x1(t + 1) =x2(t),
x2(t
+
1)
=
2G−DTs G
x2 (t )
−
G−DTs G
x1 (t )
(30)
MgLT 2
T2
T2
−
G
s
sin(x1(t)) +
s
G
u(t)
+
s
G
d(t),
where
d(t
)
=
0.08
cos(1.8Tst
−
π 4
)
sin(Tst
+
π 3
).
The reference model is given by xr1(t + 1) =xr2(t), xr2(t + 1) =(2 − 1.5Ts)xr2(t) + (1.5Ts − 2.5Ts2 − 1) (31) × xr1(t) + Ts2ur(t),
where
ur
(t
)
=
sin(0.2Tst
)
cos(0.4Tst
+
π 2
).
Algorithm 1:
\∗ itri: the maximal trial numbers; tc: the cumulative time for breaking; tt : the simulation terminal time; ic, ia: the maximal iteration numbers in the critic network
and the supervised action network, respectively;
Ect , Eat : the objective function thresholds in the critic
network and the supervised action network, respectively; 1): set the coefficients λ1, λ2, . . . , λn−1, γ, Q, R, kv, kd, ηa,
ηc, Na, Ns, εa, εs, εe, qs1, and qs2; 2): for 1 to itri do \∗ trial 3): initialize x(0), x∗(0);
4): initialize wc(0), vc(0), wa(0), and va(0) randomly; 5): us(0) = 0 and ua(0) ← (14); 6): while t ≤ tt do 7): wa(t) = wa(t − 1), va(t) = va(t − 1), wc(t) = wc(t − 1),
and vc(t) = vc(t − 1); 8): calculate u(t − 1) ← (8); 9): x(t) ← (1), x∗(t) ← (2), e(t) ← (3), and e¯(t) ← (6); 10): if t > tc and t1c ∑tkc=t−tc+1 ∥e¯(k)∥ > εe 11): break this trial; 12): endif 13): calculate Jˆ(t) ← (11), ua(t) ← (14), dˆ2(t) ← (27), us(t)
← (28), and αc ← (23); 14): r(t) = eT (t)Qe(t) + uT (t)Ru(t);
15): calculate Ec(t) and set i = 0; 16): while ((i < ic) & (Ec(t) > Ect )) do 17): update wc(t) = wc(t − 1) + ∆wc(t) and vc(t) = vc(t − 1) +
∆vc (t ); 18): Jˆ(t) ← (11);
19): if
1 N
∑tk=t−Ns+1 ∥wc(t) − wc(t − 1)∥ < εs
20):
s
reduce
ηa
and
ηc;
21): else
22): reset ηa and ηc;
23): endif
24): calculate Ec(t) and set i = i + 1; 25): endwhile \∗ critic network
26): calculate Ea(t) and set j = 0; 27): while (( j < ia) & (Ea(t) > Eat )) do 28): update wa(t) = wa(t −1)+∆wa(t) and va(t) = va(t −1)+
∆va (t ); 29): ua(t) ← (14) and u(t) ← (8); 30): Jˆ(t) ← (11);
31): calculate Ea(t) and set j = j + 1; 32): endwhile \∗ action network 33): t = t + 1;
34): endwhile
35): endfor
The matrices Q and R is chosen as diag{0.5, 0.5} and 0.3, respectively. The critic network and the supervised action network are constructed by two three-layer back propagation NNs with structures of 3-4-1 and 2-3-1, respectively. The activation functions ϕc(·) and ϕa(·) are selected as the hyperbolic tangent function. The initial weights for both the networks are randomly generated from [−1, 1]. In view of the result of the adaptive critic robust control method, some parameters used in the simulation are presented in Table 1. In addition, by combining with the results of Kim et al. (2016); Du et al. (2016), the observer and compensation control parameters are chosen as kd = 0.3, qs1 = 0.6, and qs2 = 0.4.
The trajectories of the system state x1(t) and the reference state xr1(t) are presented in Fig. 2. The curve of the model reference tracking error e1(t) is depicted in Fig. 3. It is observed that the system (31) can exactly track the reference model (30) on
1725
Preprints of the 21st IFAC World Congress (Virtual) Berlin, Germany, July 12-17, 2020
x1(t) xr1(t)
x1(t) xr1(t)
State and reference state (rad)
State and reference state (rad)
Time steps
Fig. 2. System state x1(t) and reference state xr1(t).
0
Tracking error (rad)
0.3
-5
0.2
-10 0.1
0.0
-15 -0.1
-0.2
-20
73 100
0
200
0.01
0.00
150
400
200 -0.01 1000
1100
600
800
Time steps
1000
1200
1200
Fig. 3. Model reference tracking error e1(t).
Control input (N⋅m)
2.0 1.5 1.0 0.5 0.0 -0.5 -1.0 -1.5 -2.0 -2.5
0
0.06 0.03 0.00 -0.03
1000
1100
0.4
0.2
0.0
-0.2
-0.4
-0.6
73
100
150
200
400
600
800
Time steps
1200
200
1000 1200
Fig. 4. Control input curve.
behavior by using our developed method. The control input curve is shown in Fig. 4.
To highlight the better learning efficiency and robustness of our developed online adaptive critic robust control method, it
Table 1. Parameters in the example
Para. λ1
Para. ia
Para. Ns
Value 0.3
Value 300 Value 10
Para. γ
Para. ic
Para. εa
Value 0.95 Value 200 Value 0.7
Para. kv
Para. Eat Para. εe
Value 0.1
Value 1e-4 Value
1
Para. ηa Para. Ect Para. εs
Value 0.001 Value 6e-4 Value
0.1
Para. ηc
Para. Na -
Value 0.004 Value
10 -
Time steps
Fig. 5. System state x1(t) and reference state xr1(t) in Ni et al. (2013).
is necessary to conduct a comparative simulation experiment by comparing with the previous method Ni et al. (2013). As Ni et al. (2013) is a tracking control method, the trajectory of the reference model needs to be obtained beforehand, and then the tracking control is implemented. The reference network, the critic network, and the action network are constructed by three similar NNs with structures of 5-4-1, 6-5-1, and 4-3-1, respectively. Let u(t) = u¯(t)/gˆ(t) in (30), in which u¯(t) is the output of the action network. Some parameters of this method selected in the simulation are presented in Table 2, in which ηr, ir, and Ert represent the learning rate, the maximal iteration numbers, and the objective function threshold for the reference network, respectively. Other parameter setting and some initial conditions are set as same as those of our developed online adaptive critic robust control method, such as λ1, γ, kv, the initial states, and the initial weights.
Table 2. Parameters in the example for Ni et al. (2013)
Para. ηa Para. ir
Value 0.001 Value 150
Para. ηc
Para. Eat
Value 0.004 Value 1e-4
Para. ηr
Para. Ect
Value 0.001 Value 6e-4
Para. ia
Para. Ert
Value 300 Value 2e-4
Para. ic
Para. -
Value 200 Value
-
By using the method proposed in Ni et al. (2013), we can obtain the trajectories of x1(t) and xr1(t), and the model reference tracking error curve, which are shown in Figs. 5 and 6, respectively. Through the comparisons between Figs. 2 and 3 and that between Figs. 5 and 6, it can be seen that our developed method produces the smaller model reference tracking error than the method proposed in Ni et al. (2013) does. This means that our developed method has the superior robustness.
Table 3. Simulation results on both the methods
Methods Number of experiments Number of trials Success rate (%)
Traditional method
100
20
53
Our method
100
20
100
In this comparative simulation study, a run consists of a max-
imum of 20 consecutive trials. It is considered successful if
1 Nα
∑tk=t−Nα +1
∥e¯(t)∥
≤
0.07
for
∀t
>
900
holds.
Otherwise,
if
the controller is unable to learn to make the system (30) track
the reference model (31) on behavior within 20 trials, then the
run is considered unsuccessful. We run 100 experiments for the
traditional method Ni et al. (2013) and our developed method,
whose simulation results are listed in Table 3. It is observed
1726
Preprints of the 21st IFAC World Congress (Virtual) Berlin, Germany, July 12-17, 2020
Tracking error (rad)
0.03
1.5
0.00
1.0 -0.03
-0.06
0.5
-0.09 200 400 600 800 1000 1200
0.0
-0.5 0
200
400
600
800 1000 1200
Time steps
Fig. 6. Model reference tracking error e1(t) in Ni et al. (2013).
that, in contrast to the traditional online ADP method Ni et al. (2013), our developed method reduces greatly the learning failure rate.
5. CONCLUSION
An online adaptive critic robust control method has been developed to handle the optimal MRAC problem for the nonlinear systems. The online adaptive critic robust controller consists of the critic network, the supervised action network, and the compensation control term. Via the new defined learning schedule factor, such controller not only achieves the high-efficiency learning as well as the optimality in real time, but also has the robustness to the uncertainty. A comparative simulation has been provided to show the superiority of our developed method. Further investigation and experimentation are recommended into the stability analysis, optimization of the algorithm, and applications to the real systems.
REFERENCES
Chen, X., Wang, W., Cao, W., & Wu, M. (2019). Gaussiankernel-based adaptive critic design using two-phase value iteration. Information Sciences, 482, 139–155.
Du, H., Yu, X., Chen, M. Z. Q., & Li, S. (2016). Chattering-free discrete-time sliding mode control. Automatica, 68, 87-91.
Fathinezhad, F., Derhami, V., & Rezaeian, M. (2016). Supervised fuzzy reinforcement learning for robot navigation. Applied Soft Computing, 40, 33-41.
Fu, H., Chen, X., & Wang, W. (2017). A model reference adaptive control with ADP-to-SMC strategy for unknown nonlinear systems. Proceedings of the 11th Asian Control Conference, 1537-1542.
Fu, H. Chen, X. Wang, W. & Wu, M. (2020). MRAC for unknown discrete-time nonlinear systems based on supervised neural dynamic programming. Neurocomputing, 384, 130141.
Ha, M., Wang, D., & Liu, D. (2018). Event-triggered adaptive critic control design for discrete-time constrained nonlinear systems. IEEE Transactions on Systems, Man, and Cybernetics: Systems, doi: 10.1109/TSMC.2018.2868510.
He, H., Ni, Z., & Fu, J. (2012). A three-network architecture for on-line learning and optimization based on adaptive dynamic programming. Neurocomputing, 78(1), 3-13.
Jiang, H., Zhang, H., Xiao, G., & Cui, X. (2018). Databased approximate optimal control for nonzero-sum games of multi-player systems using adaptive dynamic programming. Neurocomputing, 275, 192-199.
Kim, K. & Rew, H. (2013). Reduced order disturbance observer for discrete-time linear systems. Automatica, 49(4), 968-975.
Lian, C., Xu, X., Chen, H., & He, H. (2016). Near-optimal tracking control of mobile robots via receding-horizon dual heuristic programming. IEEE Transactions on Cybernetics, 46(11), 2484-2496.
Liu, F., Sun, J., Si, J., Guo, W., & Mei, S. (2012). A boundedness result for the direct heuristic dynamic programming. Neural Network, 32, 229-235.
Mu, C., Ni, Z., Sun, C., & He, H. (2017). Air-breathing hypersonic vehicle tracking control based on adaptive dynamic programming. IEEE Transactions on Neural Networks and Learning Systems, 28(3), 584–598.
Mu, C., Ni, Z., Sun, C., & He, H. (2017). Data-driven tracking control with adaptive dynamic programming for a class of continuous-time nonlinear systems. IEEE Transactions on Cybernetics, 47(6), 1460-1470.
Ni, Z., He, H., & Wen, J. (2013). Adaptive learning in tracking control based on the dual critic network design. IEEE Transactions on Neural Networks and Learning Systems, 24(6), 913-928.
Pang, B. & Jiang, Z. P. (2019). Adaptive optimal control of linear periodic systems: An off-policy value iteration approach. arXiv: 1901.08650.
Radac, M. B., Precup, R. E., & Roman, R. C. (2017). Modelfree control performance improvement using virtual reference feedback tuning and reinforcement Q-learning. International Journal of Systems Science, 48(5), 1071–1083.
Radac, M. B., Precup, R. E., & Roman, R. C. (2018). Datadriven model reference control of MIMO vertical tank systems with model-free VRFT and Q-learning. ISA Transactions, 73, 227–238.
Si, J. & Wang, Y. T. (2001). On-line learning control by association and reinforcement. IEEE Transactions on Neural Network, 12(2), 264-276.
Wang, W., Chen, X., Wang, F., & Fu, H. (2018). ADPbased model reference adaptive control design for unknown discrete-time nonlinear systems. Proceedings of the 37th Chinese Control Conference, 8049–8054.
Wang, W. Y., Chan, M. L., Hsu, C. C. J., & Lee, T. T. (2002). H∞ tracking-based sliding mode control for uncertain nonlinear systems via an adaptive fuzzy-neural approach. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 32(4), 483-492.
Wang, Z., Wei, Q., & Liu, D. (2020). Event-triggered adaptive dynamic programming for discrete-time multi-player games. Information Sciences, 506, 457-470.
Werbos, P. J. (1992). Approximate dynamic programming for real-time control and neural modeling. In D. A. White, & D. A. Sofge (Eds.), Handbook of intelligent control. New York: Van Nostrand Reinhold, (Chapter 13).
Yang, L., Si, J., Tsakalis, K. S., & Rodriguez, A. A. (2009). Direct heuristic dynamic programming for nonlinear tracking control with filtered tracking error. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39(6), 1617-1622.
Zhao, D., Wang, B., & Liu, D. (2013). A supervised actor-critic approach for adaptive cruise control. Soft Computing, 17(11), 2089-2099.
Zhao, D., Zhang, Q., Wang, D., & Zhu, Y. (2016). Experience replay for optimal control of nonzero-sum game systems with unknown dynamics. IEEE Transactions on Cybernetics, 46(3), 854-865.
1727