Thumbnail Image




Journal Title

Journal ISSN

Volume Title


The Transformer has demonstrated superior performance in various natural language processing (NLP) tasks, including machine translation, language understanding, and text generation. The proposed multihead attention mechanism provides strong flexibility for fusing contextual information and therefore facilitates long-range relation modeling. Further, Transformers have proved effective for learning universal knowledge at scale, and representative models are BERT, GPT, and their subsequent variants. It is observed that the Transformer is more tolerant to the convergence plateau and is capable of scaling to more than one hundred billion parameters.Despite these advances, we believe that the Transformer can be further pushed toward the two extremes of knowledge learning: expert knowledge and universal knowledge. On the one hand, professional knowledge, such as medical knowledge accumulated by humans through a large amount of education and practice, plays a vital role in professional disciplines. However, due to the various forms of expert knowledge (e.g., knowledge graphs, textual templates, and tables of statistics) and the need to develop different Transformer models to deal with different forms of knowledge, there is an urgent need for a unified framework to efficiently encode and decode different types of knowledge. On the other hand, learning universal knowledge requires substantial training data and a large model size to absorb the information from unlabeled data in a self-supervised manner. However, the existing self- supervised language models lack a structured encoding of the input and therefore fail to generate plausible text in a controllable way. Moreover, learning from high-dimensional input, such as image pixels, is challenging for the Transformer due to the heavy computational consumption and sparse semantic information represented by the pixels. In this proposal, we address these challenges by first defining a unified formulation for acquiring both expert and universal knowledge and then developing several novel Transformer models and their variants, including the Graph Transformers, the Variational Autoencoders (VAEs) implemented by the Transformer architecture, and the Visual-Linguistic Masked Autoencoders (VL-MAEs) for learning visual representation with additional language supervision. Finally, the techniques developed within this proposal will alleviate the burden and lower the entry bar of learning with universal knowledge and expertise for ML researchers and practitioners. They will also reduce the cost of research.





Li, Yuan (2022). LEARNING BOTH EXPERT AND UNIVERSAL KNOWLEDGE USING TRANSFORMERS. Dissertation, Duke University. Retrieved from https://hdl.handle.net/10161/25221.


Dukes student scholarship is made available to the public using a Creative Commons Attribution / Non-commercial / No derivative (CC-BY-NC-ND) license.