# Browsing by Author "Zavlanos, Michael"

###### Results Per Page

###### Sort Options

Item Open Access A NEW ZEROTH-ORDER ORACLE FOR DISTRIBUTED AND NON-STATIONARY LEARNING(2021) Zhang, YanZeroth-Order (ZO) methods have been applied to solve black-box or simulation-based optimization prroblems. These problems arise in many important applications nowa- days, e.g., generating adversarial attacks on machine learning systems, learning to control the system with complicated physics structure or human in the loop. In these problem settings, the objective function to optimize does not have an explicit math- ematical form and therefore its gradient cannot be obtained. This invalidates all gradient-based optimization approaches. On the other hand, ZO methods approxi- mate the gradient by using the objective function values. Many existing ZO methods adopt the two-point feedback scheme to approximate the unknown gradient due to its low estimation variance and fast convergence speed. Specifically, two-point ZO method estimates the gradient at the current iterate of the algorithm by querying the objective function value twice at two distinct neighbor points around the current iterate. Such scheme becomes infeasible or difficult to implement when the objective function is time-varying, or when multiple agents collaboratively optimize a global objective function that depends on all agents’ decisions, because the value of the objective function can be queried only once at a single decision point. However, con- ventional ZO method based on one-point feedback is subject to large variance of the gradient estimation and therefore slows down the convergence.

In this dissertation, we propose a novel one-point ZO method based on the residual feedback. Specifically, the residual feedback scheme estimates the gradient using the residual between the values of the objective function at two consecutive iterates of the algorithm. When optimizing a deterministic Lipschitz function, we show that the query complexity of ZO with the proposed one-point residual feedback matches that of ZO with the existing two-point schemes. Moreover, the query complexity of the proposed algorithm can be improved when the objective function has Lipschitz gradient. Then, for stochastic bandit optimization problems, we show that ZO with one-point residual feedback achieves the same convergence rate as that of two-point scheme with uncontrollable data samples.

Next, we apply the proposed one-point residual-feedback gradient estimator to solve online optimizaiton problems, where the objective function varies over time. In the online setting, since each objective function can only be evaluated once at a single decision point, existing two-point ZO methods are not feasible and only one-point ZO methods can be used. We develop regret bounds for ZO with the proposed one- point residual feedback scheme for both convex and nonconvex online optimization problems. Specifically, for both deterministic and stochastic problems and for both Lipschitz and smooth objective functions, we show that using residual feedback can produce gradient estimates with much smaller variance compared to conventional one- point feedback methods. As a result, our regret bounds are much tighter compared to existing regret bounds for ZO with conventional one-point feedback, which suggests that ZO with residual feedback can better track the optimizer of online optimization problems. Additionally, our regret bounds rely on weaker assumptions than those used in conventional one-point feedback methods.

The proposed residual-feedback scheme is next decentralized to conduct dis- tributed policy optimization in the multi-agent reinforcement learning (MARL) prob- lems. Existing MARL algorithms often assume that every agent can observe the states and actions of all the other agents in the network. This can be impractical in large-scale problems, where sharing the state and action information with multi-hop neighbors may incur significant communication overhead. The advantage of the pro- posed zeroth-order policy optimization method is that it allows the agents to compute the local policy gradients needed to update their local policy functions using local estimates of the global accumulated rewards that depend on partial state and action information only and can be obtained using consensus. Specifically, the local ZO policy gradients relying on one-point residual-feedback significantly reduces the vari- ance of the local policy gradient estimates compared to the conventional one-point policy gradient estimates, improving, in this way, the learning performance. We show that the proposed distributed zeroth-order policy optimization method with constant stepsize converges to a neighborhood of the global optimal policy that depends on the number of consensus steps used to calculate the local estimates of the global accumulated rewards.Another challenge in the distributed ZO optimization problems is that the agents may conduct local updates in an asynchronous fashion when they do not have ac- cess to a global clock. To deal with this challenge, we propose an asynchronous zeroth-order distributed optimization method that relies on the proposed one-point residual feedback gradient estimator. We show that this estimator is unbiased under asynchronous updating, and theoretically analyze its convergence. We demonstrate the effectiveness of all proposed algorithms via extensive numerical experiments.

Item Embargo Adaptive Planning in Changing Policies and Environments(2023) Sivakumar, Kavinayan PillaiarBeing able to adapt to different tasks is a staple of learning, as agents aim to generalize across different situations. Specifically, it is important for agents to adapt to the policies of other agents around them. In swarm settings, multi-agent sports settings, or other team-based environments, agents learning from one another can save time and reduce errors in performance. As a result, traditional transfer reinforcement learning proposes ways to decrease the time it takes for an agent to learn from an expert agent. However, the problem of transferring knowledge across agents that operate in different action spaces and are therefore heterogeneous poses new challenges. Mainly, it is difficult to translate between heterogeneous agents whose action spaces are not guaranteed to intersect.

We propose a transfer reinforcement learning algorithm between heterogeneous agents based on a subgoal trajectory mapping algorithm. We learn a mapping between expert and learner trajectories that are expressed through subgoals. We do so by training a recurrent neural network on trajectories in a training set. Then, given a new task, we input the expert's trajectory of subgoals into the trained model to predict the optimal trajectory of subgoals for the learner agent. We show that the learner agent is able to learn an optimal policy faster with this predicted trajectory of subgoals.

It is equally important for agents to adapt to the intentions of agents around them. To this end, we propose an inverse reinforcement learning algorithm to estimate the reward function of an agent as it updates its policy over time. Previous work in this field assume the reward function is approximated by a set of linear feature functions. Choosing an expressive enough set of feature functions can be challenging, and failure to do so can skew the learned reward function. Instead, we propose an algorithm to estimate the policy parameters of the agent as it learns, bundling adjacent trajectories together in a new form of behavior cloning we call bundle behavior cloning. Our complexity analysis shows that using bundle behavior cloning, we can attain a tighter bound on the difference between the distribution of the cloned policy and that of the true policy than the same bound achieved in standard behavior cloning. We show experiments where our method achieves the same overall reward using the estimated reward function as that learnt from the initial trajectories, as well as testing the feasibility of bundle behavior cloning with different neural network structures and empirically testing the effect of the bundle choice on performance.

Finally, due to the need for agents to adapt to environments that are prone to change due to damage or detection, we propose the design of a robotic sensing agent to detect damage. In such dangerous environments, it may be unsafe for human operators to manually take measurements. Current literature in structural health monitoring proposes sequential sensing algorithms to optimize the number of locations measurements need to be taken at before locating sources of damage. As a result, the robotic sensing agent we designed is mobile, semi-autonomous, and precise in measuring a location on the model structure we built. We detail the components of our robotic sensing agent, as well as show measurement data taken from our agent at two locations on the structure displaying little to no noise in the measurement.

Item Open Access Distributed Intermittent Connectivity Control of Mobile Robot Networks(2018) Kantaros, YiannisWireless communication is known to play a pivotal role in enabling teams of robots to successfully accomplish global coordinated tasks. In fact, network connectivity is an underlying assumption in every distributed control and optimization algorithm. For this reason, in recent years, there is growing research in designing controllers that ensure point-to-point or end-to-end network connectivity for all time. Nevertheless, all these methods severely restrict the robots from accomplishing their tasks, as motion planning is always restricted by connectivity constraints on the network. Instead, a much preferred solution is to enable robots to communicate in an intermittent fashion, and operate in disconnect mode the rest of the time giving rise to an intermittently connected communication network. While in disconnect mode, the robots can accomplish their tasks free of communication constraints. The goal of this dissertation is to design a distributed intermittent connectivity framework that (i) ensures that the communication network is connected over time, infinitely often (ii) is flexible enough to account for arbitrary dynamic tasks, and (iii) can be applied to large-scale networks.

The great challenge in developing intermittent connectivity protocols for networks of mobile robots is to decide (i) which robots talk to which, (ii) where, and (iii) when, so that the communication network is connected over time infinitely often. To address these challenges, we decompose the network into small groups of robots, also called teams, so that every robot belongs to at least one team and that there is a path, i.e., a sequence of teams, where consecutive teams have non-empty intersections, connecting every two teams of robots, so that information can propagate in the network. First, given such fixed teams, we design infinite sequences of communication events for all robots, also called communication schedules, independent of the tasks assigned to the robots, that determine when every team should communicate, so that the communication network is connected over time infinitely often. The designed communication schedules ensure that all teams communicate infinitely often, i.e., that the communication network is connected over time infinitely often. Between communication events the robots can move in the workspace free of communication constraints to accomplish their assigned tasks. Theoretical guarantees and numerical experiments corroborate the proposed framework. This is the first distributed intermittent connectivity framework that can be applied to large-scale networks and is flexible enough to account for arbitrary dynamic robot tasks.

Next, given user-specified fixed teams, we integrate the respective communication schedules with task planning. Specifically, we consider high-level complex tasks captured by temporal logic formulas, state-estimation tasks, and time-critical dynamic tasks. The proposed distributed integrated path planning and intermittent connectivity frameworks determine both where and when every team should communicate so that the assigned task is accomplished, the communication network is connected over time infinitely often, and a user-specified metric, such as total traveled distance or consumed energy, is minimized. We show that employing the proposed intermittent connectivity framework for such tasks results in significant performance gains compared to the existing solutions in the literature that maintain connectivity for all time. Theoretical guarantees, numerical and experimental studies support the proposed distributed control algorithms.

Finally, we propose a fully autonomous intermittent connectivity framework that can handle arbitrary dynamic tasks and also allows the robots to locally and online update the structure of the teams and the communication schedules, effectively allowing them to decide who they should talk to, so that they can better accomplish newly assigned tasks. The structure of the teams, the associated communication locations, and the time instants when communication within teams will occur are integrated online with task planning giving rise to paths, i.e., sequences of waypoints, that ensure that the assigned task is accomplished, the communication network is connected over time infinitely often, and a user specified metric is minimized. This is the first fully autonomous, distributed, and

online intermittent connectivity framework that can handle arbitrary dynamic tasks and also controls the topology of the intermittently connected robot network to better accomplish these tasks. At the same time, the proposed framework scales well with the size of the robot network. Theoretical guarantees and numerical experiments corroborate the proposed distributed control scheme.

Item Open Access Transfer learning in continuous RL under unobservable contextual information(2020) Liu, ChenyuIn this paper, we consider a transfer Reinforcement Learning (RL) problem in continuous stateand action spaces, under unobserved contextual information. The context here can represent a specific unique mental view of the world that an expert agent has formed through past interactions with this world. We assume that this context is not accessible to a learner agent who can only observe the expert data and does not know how they were generated. Then, our goal is to use the context-aware continuous expert data to learn an optimal context-unaware policy for the learner using only a few new data samples. To this date, such problems are typically solved using imitation learning that assumes that both the expert and learner agents have access to the same information. However, if the learner does not know the expert context, using the expert data alone will result in a biased learner policy and will require many new data samples to improve. To address this challenge, in this paper, we formulate the learning problem that the learner agent solves as a causal bound-constrained Multi-Armed-Bandit (MAB) problem. The arms of this MAB correspond to a set of basis policy functions that can be initialized in an unsupervised way using the expert data and represent the different expert behaviors affected by the unobserved context. On the other hand, the MAB constraints correspond to causal bounds on the accumulated rewards of these basis policy functions that we also compute from the expert data. The solution to this MAB allows the learner agent to select the best basis policy and improve it online. And the use of causal bounds reduces the exploration variance and, therefore, improves the learning rate. We provide numerical experiments on an autonomous driving example that show that our proposed transfer RL method improves the learner’s policy faster compared to imitation learning methods and enjoys much lower variance during training