On the Knowledge Transfer via Pretraining, Distillation and Federated Learning
dc.contributor.advisor | Carin, Lawrence | |
dc.contributor.author | Hao, Weituo | |
dc.date.accessioned | 2022-06-15T18:44:25Z | |
dc.date.available | 2022-06-15T18:44:25Z | |
dc.date.issued | 2022 | |
dc.department | Electrical and Computer Engineering | |
dc.description.abstract | Modern machine learning technology based on a revival of deep neural networks has been successfully applied in many pragmatic domains such as computer vision(CV) and natural language processing(NLP). The very standard paradigm is \emph{pre-training}: a large model with billions of parameters is trained on a surrogate task and then adapted to the downstream task of interest via fine-tuning. Knowledge transfer is what makes the pre-training possible, but the scale is what makes it powerful, which requires the availability of much more training data and computing resources. \hspace{0.5cm} Along with the great success of deep learning, fueled by larger datasets and more computation capability, however, come a series of interesting research topics. First, most pre-trained models learn on one-modal(vision or text) dataset and are designed for the single-step downstream task such as classification. Does pre-training for more complex tasks such as reinforcement learning still work? Second, pre-trained models obtain impressive empirical performance at the price of deployment challenges on low-resource(both memory and computation) platforms. How to compress the large models into smaller ones in an efficient way? Third, collecting sufficient training data is often expensive, time-consuming, or even unrealistic in many scenarios due to privacy constraints. Does it exist a training paradigm without data exchange? \hspace{0.5cm}For less explored questions mentioned above, I conducted several projects related but not limited to: $\RN{1}$) large-scale pre-training based on multi-modal input for vision and language navigation, proofing the effectiveness of knowledge transfer across complex tasks via ~\emph{pre-training}; $\RN{2}$) data augmentation for compressing large-scale language models, improving the efficiency of knowledge transfer in the teacher-student \emph{distillation} framework; $\RN{3}$) weight factorization for model weights sharing in \emph{Federated Learning}, achieving the trade-off between model performance and data privacy. | |
dc.identifier.uri | ||
dc.subject | Computer engineering | |
dc.subject | Electrical engineering | |
dc.subject | Deep learning | |
dc.subject | Federated learning | |
dc.subject | Machine learning | |
dc.subject | Representation learning | |
dc.title | On the Knowledge Transfer via Pretraining, Distillation and Federated Learning | |
dc.type | Dissertation |