Why Multimodal?
Uniomodal Representation Learning
Self-supervised Representation Learning
Machine Learning
Multimodal Learning
- Look, Listen and Learn, R. Arandjelović, A. Zisserman
- Look, Listen and Learn More: Design choices for deep audio embeddings, J. Cramer, H. Wu, J. Salamon, J. P. Bello
- Learning Transferable Visual Models From Natural Language Supervision (CLIP), Alec Radford et al.
- Learning Audio-Language Representations (CLAP), Andonian et al.
- Speech2Face: Learning the Face Behind a Voice, T. Wen et al.
- WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning, Krishna Srinivasan et al.
Speech and Audio
Multimodal Learning: Newer domains
- Multimodal Biomedical AI, Julián N. Acosta, Guido J. Falcone, Pranav Rajpurkar, Eric J. Topol
- BIOSCAN-5M: A Multimodal Dataset for Insect Biodiversity, Zahra Gharaee, Scott C. Lowe, ZeMing Gong, Pablo Millan Arias, Nicholas Pellegrino, Austin T. Wang, Joakim Bruslund Haurum, Iuliia Zarubiieva, Lila Kari, Dirk Steinke, Graham W. Taylor, Paul Fieguth, Angel X. Chang
Datasets
- [MINST]
- [FMINST]
- [CIFAR]
- [CANDOR]
- [Coswara]
- [ImageNet]
- [WIT]
- [Google Audio Dataset]
- [Vox celeb dataset]