Materials

Why Multimodal?

Uniomodal Representation Learning

Self-supervised Representation Learning

Deep Clustering for Unsupervised Learning of Visual Features, Caron et al.
A Simple Framework for Contrastive Learning of Visual Representations (SimCLR), Chen et al.
A Survey on Self-Supervised Learning: Algorithms, Applications, and Future Trends, Gui et al.
wav2vec2.0: A framework for self-supervised learning of speech representations, Baevski et al.

Machine Learning

Pattern Recognition and Machine Learning, C. Bishop
Distributed Representations of Words and Phrases and their Compositionality, T. Mikolov, I. Sutskever, K. Chen, G. Corrado, J. Dean
An Introduction to Variational Autoencoders, D. P. Kingma, M. Welling

Multimodal Learning

Speech and Audio

Deep Clustering and Conventional Networks for Music Separation: Strong Together, J. R. Hershey et al.

Multimodal Learning: Newer domains

Datasets

[MINST]
[FMINST]
[CIFAR]
[CANDOR]
[Coswara]
[ImageNet]
[WIT]
[Google Audio Dataset]
[Vox celeb dataset]