Adaptive Planning in Changing Policies and Environments

Limited Access
This item is unavailable until:



Journal Title

Journal ISSN

Volume Title

Repository Usage Stats



Being able to adapt to different tasks is a staple of learning, as agents aim to generalize across different situations. Specifically, it is important for agents to adapt to the policies of other agents around them. In swarm settings, multi-agent sports settings, or other team-based environments, agents learning from one another can save time and reduce errors in performance. As a result, traditional transfer reinforcement learning proposes ways to decrease the time it takes for an agent to learn from an expert agent. However, the problem of transferring knowledge across agents that operate in different action spaces and are therefore heterogeneous poses new challenges. Mainly, it is difficult to translate between heterogeneous agents whose action spaces are not guaranteed to intersect.

We propose a transfer reinforcement learning algorithm between heterogeneous agents based on a subgoal trajectory mapping algorithm. We learn a mapping between expert and learner trajectories that are expressed through subgoals. We do so by training a recurrent neural network on trajectories in a training set. Then, given a new task, we input the expert's trajectory of subgoals into the trained model to predict the optimal trajectory of subgoals for the learner agent. We show that the learner agent is able to learn an optimal policy faster with this predicted trajectory of subgoals.

It is equally important for agents to adapt to the intentions of agents around them. To this end, we propose an inverse reinforcement learning algorithm to estimate the reward function of an agent as it updates its policy over time. Previous work in this field assume the reward function is approximated by a set of linear feature functions. Choosing an expressive enough set of feature functions can be challenging, and failure to do so can skew the learned reward function. Instead, we propose an algorithm to estimate the policy parameters of the agent as it learns, bundling adjacent trajectories together in a new form of behavior cloning we call bundle behavior cloning. Our complexity analysis shows that using bundle behavior cloning, we can attain a tighter bound on the difference between the distribution of the cloned policy and that of the true policy than the same bound achieved in standard behavior cloning. We show experiments where our method achieves the same overall reward using the estimated reward function as that learnt from the initial trajectories, as well as testing the feasibility of bundle behavior cloning with different neural network structures and empirically testing the effect of the bundle choice on performance.

Finally, due to the need for agents to adapt to environments that are prone to change due to damage or detection, we propose the design of a robotic sensing agent to detect damage. In such dangerous environments, it may be unsafe for human operators to manually take measurements. Current literature in structural health monitoring proposes sequential sensing algorithms to optimize the number of locations measurements need to be taken at before locating sources of damage. As a result, the robotic sensing agent we designed is mobile, semi-autonomous, and precise in measuring a location on the model structure we built. We detail the components of our robotic sensing agent, as well as show measurement data taken from our agent at two locations on the structure displaying little to no noise in the measurement.





Sivakumar, Kavinayan Pillaiar (2023). Adaptive Planning in Changing Policies and Environments. Dissertation, Duke University. Retrieved from


Dukes student scholarship is made available to the public using a Creative Commons Attribution / Non-commercial / No derivative (CC-BY-NC-ND) license.