On the Knowledge Transfer via Pretraining, Distillation and Federated Learning

Carin, LawrenceHao, Weituo2022-06-152022-06-152022https://hdl.handle.net/10161/25279Modern machine learning technology based on a revival of deep neural networks has been successfully applied in many pragmatic domains such as computer vision(CV) and natural language processing(NLP). The very standard paradigm is \emph{pre-training}: a large model with billions of parameters is trained on a surrogate task and then adapted to the downstream task of interest via fine-tuning. Knowledge transfer is what makes the pre-training possible, but the scale is what makes it powerful, which requires the availability of much more training data and computing resources. \hspace{0.5cm} Along with the great success of deep learning, fueled by larger datasets and more computation capability, however, come a series of interesting research topics. First, most pre-trained models learn on one-modal(vision or text) dataset and are designed for the single-step downstream task such as classification. Does pre-training for more complex tasks such as reinforcement learning still work? Second, pre-trained models obtain impressive empirical performance at the price of deployment challenges on low-resource(both memory and computation) platforms. How to compress the large models into smaller ones in an efficient way? Third, collecting sufficient training data is often expensive, time-consuming, or even unrealistic in many scenarios due to privacy constraints. Does it exist a training paradigm without data exchange? \hspace{0.5cm}For less explored questions mentioned above, I conducted several projects related but not limited to: $\RN{1}$) large-scale pre-training based on multi-modal input for vision and language navigation, proofing the effectiveness of knowledge transfer across complex tasks via ~\emph{pre-training}; $\RN{2}$) data augmentation for compressing large-scale language models, improving the efficiency of knowledge transfer in the teacher-student \emph{distillation} framework; $\RN{3}$) weight factorization for model weights sharing in \emph{Federated Learning}, achieving the trade-off between model performance and data privacy. Computer engineeringElectrical engineeringDeep learningFederated learningMachine learningRepresentation learningOn the Knowledge Transfer via Pretraining, Distillation and Federated LearningDissertation