dc.description.abstract |
<p>Zeroth-Order (ZO) methods have been applied to solve black-box or simulation-based
optimization prroblems. These problems arise in many important applications nowa-
days, e.g., generating adversarial attacks on machine learning systems, learning to
control the system with complicated physics structure or human in the loop. In these
problem settings, the objective function to optimize does not have an explicit math-
ematical form and therefore its gradient cannot be obtained. This invalidates all
gradient-based optimization approaches. On the other hand, ZO methods approxi- mate
the gradient by using the objective function values. Many existing ZO methods adopt
the two-point feedback scheme to approximate the unknown gradient due to its low estimation
variance and fast convergence speed. Specifically, two-point ZO method estimates the
gradient at the current iterate of the algorithm by querying the objective function
value twice at two distinct neighbor points around the current iterate. Such scheme
becomes infeasible or difficult to implement when the objective function is time-varying,
or when multiple agents collaboratively optimize a global objective function that
depends on all agents’ decisions, because the value of the objective function can
be queried only once at a single decision point. However, con- ventional ZO method
based on one-point feedback is subject to large variance of the gradient estimation
and therefore slows down the convergence.</p><p>In this dissertation, we propose a
novel one-point ZO method based on the residual feedback. Specifically, the residual
feedback scheme estimates the gradient using the residual between the values of the
objective function at two consecutive iterates of the algorithm. When optimizing a
deterministic Lipschitz function, we show that the query complexity of ZO with the
proposed one-point residual feedback matches that of ZO with the existing two-point
schemes. Moreover, the query complexity of the proposed algorithm can be improved
when the objective function has Lipschitz gradient. Then, for stochastic bandit optimization
problems, we show that ZO with one-point residual feedback achieves the same convergence
rate as that of two-point scheme with uncontrollable data samples.</p><p>Next, we
apply the proposed one-point residual-feedback gradient estimator to solve online
optimizaiton problems, where the objective function varies over time. In the online
setting, since each objective function can only be evaluated once at a single decision
point, existing two-point ZO methods are not feasible and only one-point ZO methods
can be used. We develop regret bounds for ZO with the proposed one- point residual
feedback scheme for both convex and nonconvex online optimization problems. Specifically,
for both deterministic and stochastic problems and for both Lipschitz and smooth objective
functions, we show that using residual feedback can produce gradient estimates with
much smaller variance compared to conventional one- point feedback methods. As a result,
our regret bounds are much tighter compared to existing regret bounds for ZO with
conventional one-point feedback, which suggests that ZO with residual feedback can
better track the optimizer of online optimization problems. Additionally, our regret
bounds rely on weaker assumptions than those used in conventional one-point feedback
methods.</p><p>The proposed residual-feedback scheme is next decentralized to conduct
dis- tributed policy optimization in the multi-agent reinforcement learning (MARL)
prob- lems. Existing MARL algorithms often assume that every agent can observe the
states and actions of all the other agents in the network. This can be impractical
in large-scale problems, where sharing the state and action information with multi-hop
neighbors may incur significant communication overhead. The advantage of the pro-
posed zeroth-order policy optimization method is that it allows the agents to compute
the local policy gradients needed to update their local policy functions using local
estimates of the global accumulated rewards that depend on partial state and action
information only and can be obtained using consensus. Specifically, the local ZO policy
gradients relying on one-point residual-feedback significantly reduces the vari- ance
of the local policy gradient estimates compared to the conventional one-point policy
gradient estimates, improving, in this way, the learning performance. We show that
the proposed distributed zeroth-order policy optimization method with constant stepsize
converges to a neighborhood of the global optimal policy that depends on the number
of consensus steps used to calculate the local estimates of the global accumulated
rewards.Another challenge in the distributed ZO optimization problems is that the
agents may conduct local updates in an asynchronous fashion when they do not have
ac- cess to a global clock. To deal with this challenge, we propose an asynchronous
zeroth-order distributed optimization method that relies on the proposed one-point
residual feedback gradient estimator. We show that this estimator is unbiased under
asynchronous updating, and theoretically analyze its convergence.
We demonstrate the effectiveness of all proposed algorithms via extensive numerical
experiments.</p>
|
|