A NEW ZEROTH-ORDER ORACLE FOR DISTRIBUTED AND NON-STATIONARY LEARNING
Zeroth-Order (ZO) methods have been applied to solve black-box or simulation-based optimization prroblems. These problems arise in many important applications nowa- days, e.g., generating adversarial attacks on machine learning systems, learning to control the system with complicated physics structure or human in the loop. In these problem settings, the objective function to optimize does not have an explicit math- ematical form and therefore its gradient cannot be obtained. This invalidates all gradient-based optimization approaches. On the other hand, ZO methods approxi- mate the gradient by using the objective function values. Many existing ZO methods adopt the two-point feedback scheme to approximate the unknown gradient due to its low estimation variance and fast convergence speed. Specifically, two-point ZO method estimates the gradient at the current iterate of the algorithm by querying the objective function value twice at two distinct neighbor points around the current iterate. Such scheme becomes infeasible or difficult to implement when the objective function is time-varying, or when multiple agents collaboratively optimize a global objective function that depends on all agents’ decisions, because the value of the objective function can be queried only once at a single decision point. However, con- ventional ZO method based on one-point feedback is subject to large variance of the gradient estimation and therefore slows down the convergence.
In this dissertation, we propose a novel one-point ZO method based on the residual feedback. Specifically, the residual feedback scheme estimates the gradient using the residual between the values of the objective function at two consecutive iterates of the algorithm. When optimizing a deterministic Lipschitz function, we show that the query complexity of ZO with the proposed one-point residual feedback matches that of ZO with the existing two-point schemes. Moreover, the query complexity of the proposed algorithm can be improved when the objective function has Lipschitz gradient. Then, for stochastic bandit optimization problems, we show that ZO with one-point residual feedback achieves the same convergence rate as that of two-point scheme with uncontrollable data samples.
Next, we apply the proposed one-point residual-feedback gradient estimator to solve online optimizaiton problems, where the objective function varies over time. In the online setting, since each objective function can only be evaluated once at a single decision point, existing two-point ZO methods are not feasible and only one-point ZO methods can be used. We develop regret bounds for ZO with the proposed one- point residual feedback scheme for both convex and nonconvex online optimization problems. Specifically, for both deterministic and stochastic problems and for both Lipschitz and smooth objective functions, we show that using residual feedback can produce gradient estimates with much smaller variance compared to conventional one- point feedback methods. As a result, our regret bounds are much tighter compared to existing regret bounds for ZO with conventional one-point feedback, which suggests that ZO with residual feedback can better track the optimizer of online optimization problems. Additionally, our regret bounds rely on weaker assumptions than those used in conventional one-point feedback methods.
The proposed residual-feedback scheme is next decentralized to conduct dis- tributed policy optimization in the multi-agent reinforcement learning (MARL) prob- lems. Existing MARL algorithms often assume that every agent can observe the states and actions of all the other agents in the network. This can be impractical in large-scale problems, where sharing the state and action information with multi-hop neighbors may incur significant communication overhead. The advantage of the pro- posed zeroth-order policy optimization method is that it allows the agents to compute the local policy gradients needed to update their local policy functions using local estimates of the global accumulated rewards that depend on partial state and action information only and can be obtained using consensus. Specifically, the local ZO policy gradients relying on one-point residual-feedback significantly reduces the vari- ance of the local policy gradient estimates compared to the conventional one-point policy gradient estimates, improving, in this way, the learning performance. We show that the proposed distributed zeroth-order policy optimization method with constant stepsize converges to a neighborhood of the global optimal policy that depends on the number of consensus steps used to calculate the local estimates of the global accumulated rewards.Another challenge in the distributed ZO optimization problems is that the agents may conduct local updates in an asynchronous fashion when they do not have ac- cess to a global clock. To deal with this challenge, we propose an asynchronous zeroth-order distributed optimization method that relies on the proposed one-point residual feedback gradient estimator. We show that this estimator is unbiased under asynchronous updating, and theoretically analyze its convergence. We demonstrate the effectiveness of all proposed algorithms via extensive numerical experiments.
Multi-Agent Reinforcement Learning
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Rights for Collection: Duke Dissertations
Works are deposited here by their authors, and represent their research and opinions, not that of Duke University. Some materials and descriptions may include offensive content. More info