Gokul - Publications

Google Scholar

10 Publications: 1 Patent | 7 Conferences |2 Reports

[R2] Towards Learning Efficient Multilingual and Multimodal Representation

2023 | MSc Thesis @ MBZUAI #language #vision #speech

This thesis focuses on developing efficient representation methods for multilingual and multimodal data in machine learning. The research is divided into three stages, each focusing on specific tasks. The first stage investigates and improves multilingual representation approaches for question-answering and text-to-speech tasks. The second stage aims to improve the fusion strategies of multimodal representations for hateful meme classification. In the third stage, the previous stages are unified by exploring the image retrieval task and improving the performance of multilingual and multimodal representations.

The thesis proposes various approaches using pre-trained models and multimodal fusion techniques to improve the performance and the cultural relevance of various machine learning applications. For example, the proposed Hate-CLIPper architecture achieves state-of-the-art performance on meme detection, while training using a natively multilingual and multimodal Wikipedia Image-Text dataset with English text augmentation enables retrieval of culturally relevant images in ten Indian languages. The research not only contributes to the development of efficient representation methods for multilingual and multimodal data, but also inspires further investigations into the use of pre-trained models and multimodal fusion techniques for machine learning in multilingual and multimodal settings.

[C7] WavLink: Compact Audio-Text Embeddings with a Global Whisper Token

Gokul Karthik Kumar, Ludovick Lepauloux, Hakim Hacid

2025 | Under Review #language #speech

Whisper has become the de-facto encoder for extracting general-purpose audio features in large audio–language models, where a 30-second clip is typically represented by 1500 frame features projected into an LLM. In contrast, audio–text embedding models like CLAP-based models have largely relied on alternative audio encoders (e.g., HTS-AT, PaSST), and have not leveraged Whisper effectively. We present WavLink, a compact audio–text embedding model that augments Whisper encoder with a learnable global token, trained jointly with a text encoder. Through a systematic study of design choices—including pretrained text encoders, loss functions, training modes, and data mixtures—we identify configurations that yield state-of-the-art retrieval performance. Our two-stage training recipe across three model sizes, combined with Matryoshka-style supervision, improves scalability, enabling 8× smaller embeddings with minimal performance drop. WavLink also demonstrates competitive performance on AIR-Bench with MCQs and zero-shot classification.

[C6] Competitive Audio-Language Models with Data-Efficient Single-Stage Training on Public Data

Gokul Karthik Kumar, Rishabh Saraf, Ludovick Lepauloux, Abdul Muneer, Billel Mokeddem, Hakim Hacid

2025 | ASRU #language #speech

Large language models (LLMs) have transformed NLP, yet their integration with audio remains underexplored—despite audio’s centrality to human communication. We introduce Falcon3-Audio, a family of Audio-Language Models (ALMs) built on instruction-tuned LLMs and Whisper encoders. Using a remarkably small amount of public audio data—less than 30K hours (5K unique)—Falcon3-Audio-7B matches the best reported performance among open-weight models on the MMAU benchmark, with a score of 64.14, matching R1-AQA, while distinguishing itself through superior data and parameter efficiency, single-stage training, and transparency. Notably, our smallest 1B model remains competitive with larger open models ranging from 2B to 13B parameters. Through extensive ablations, we find that common complexities—such as curriculum learning, multiple audio encoders, and intricate cross-attention connectors—are not required for strong performance, even compared to models trained on over 500K hours of data.

[C5] VisCon-100K: Leveraging Contextual Web Data for Fine-tuning Vision Language Models

Gokul Karthik Kumar, Iheb Chaabane, Kebin Wu

2025 | PAKDD #language #vision

Vision-language models (VLMs) excel in various visual benchmarks but are often constrained by the lack of high-quality visual fine-tuning data. To address this challenge, we introduce VisCon-100K, a novel dataset derived from interleaved image-text web documents. Our approach transforms 45K web documents from the OBELICS dataset into 100K image conversation samples. We utilize GPT-4V to generate image-contextual captions and OpenChat 3.5 model to convert these captions into diverse free-form and multiple-choice question-answer pairs. Integrating this dataset for fine-tuning considerably enhances VLM performance across multiple benchmarks. Unlike methods that focus solely on fine-grained visual content, our approach leverages accompanying web context, yielding superior results. We also discover that a 'leaky modality mix', where conversation samples contain questions answerable from both the image and its contextual caption, outperforms non-leaky combinations of captions and Q&A pairs. VisCon-100k dataset shows strong performance with two popular VLM approaches: text-only large language model (LLM) aligned with a vision encoder using image captions data (ShareGPT4V-7b) and multimodally pretrained LLM (IDEFICS2-8b) using interleaved image-text data. In addition to releasing the VisCon-100K dataset, we provide a contextual captioner trained on this dataset, facilitating scalable fine-tuning data generation for future research and open-source applications. Using the same pipeline, but substituting our trained contextual captioner for GPT-4V, we also release the larger VisCon-1M dataset.

[C4] Towards Building Text-To-Speech Systems for the Next Billion Users

Gokul Karthik Kumar=, Praveen S V=, Pratyush Kumar, Mitesh M. Khapra, Karthik Nandakumar

2023 | ICASSP #language #speech

Deep learning based text-to-speech (TTS) systems have been evolving rapidly with advances in model architectures, training methodologies, and generalization across speakers and languages. However, these advances have not been thoroughly investigated for Indian language speech synthesis. Such investigation is computationally expensive given the number and diversity of Indian languages, relatively lower resource availability, and the diverse set of advances in neural TTS that remain untested. In this paper, we evaluate the choice of acoustic models, vocoders, supplementary loss functions, training schedules, and speaker and language diversity for Dravidian and Indo-Aryan languages. Based on this, we identify monolingual models with FastPitch and HiFi-GAN V1, trained jointly on male and female speakers to perform the best. With this setup, we train and evaluate TTS models for 13 languages and find our models to significantly improve upon existing models in all languages as measured by mean opinion scores. We open-source all models on the Bhashini platform.

[C3] Hate-CLIPper: Multimodal Hateful Meme Classification based on Cross-modal Interaction of CLIP features

Gokul Karthik Kumar, Karthik Nandakumar

2022 | EMNLP Workshop #language #vision

Hateful memes are a growing menace on social media. While the image and its corresponding text in a meme are related, they do not necessarily convey the same meaning when viewed individually. Hence, detecting hateful memes requires careful consideration of both visual and textual information. Multimodal pre-training can be beneficial for this task because it effectively captures the relationship between the image and the text by representing them in a similar feature space. Furthermore, it is essential to model the interactions between the image and text features through intermediate fusion. Most existing methods either employ multimodal pre-training or intermediate fusion, but not both. In this work, we propose the Hate-CLIPper architecture, which explicitly models the cross-modal interactions between the image and text representations obtained using Contrastive Language-Image Pre-training (CLIP) encoders via a feature interaction matrix (FIM). A simple classifier based on the FIM representation is able to achieve state-of-the-art performance on the Hateful Memes Challenge (HMC) dataset with an AUROC of 85.8, which even surpasses the human performance of 82.65. Experiments on other meme datasets such as Propaganda Memes and TamilMemes also demonstrate the generalizability of the proposed approach. Finally, we analyze the interpretability of the FIM representation and show that cross-modal interactions can indeed facilitate the learning of meaningful concepts. The code for this work is available at https://github.com/gokulkarthik/hateclipper

[C2] MuCoT: Multilingual Contrastive Training For Question-Answering In Low-resource Languages

Gokul Karthik Kumar, Abhishek Singh Gehlot, Sahal Shaji Mullappilly, Karthik Nandakumar

2022 | ACL Workshop #language

Accuracy of English-language Question Answering (QA) systems has improved significantly in recent years with the advent of Transformer-based models (e.g., BERT). These models are pre-trained in a self-supervised fashion with a large English text corpus and further fine-tuned with a massive English QA dataset (e.g., SQuAD). However, QA datasets on such a scale are not available for most of the other languages. Multi-lingual BERT-based models (mBERT) are often used to transfer knowledge from high-resource languages to low-resource languages. Since these models are pre-trained with huge text corpora containing multiple languages, they typically learn language-agnostic embeddings for tokens from different languages. However, directly training an mBERT-based QA system for low-resource languages is challenging due to the paucity of training data. In this work, we augment the QA samples of the target language using translation and transliteration into other languages and use the augmented data to fine-tune an mBERT-based QA model, which is already pre-trained in English. Experiments on the Google ChAII dataset show that fine-tuning the mBERT model with translations from the same language family boosts the question-answering performance, whereas the performance degrades in the case of cross-language families. We further show that introducing a contrastive loss between the translated question-context feature pairs during the fine-tuning process, prevents such degradation with cross-lingual family translations and leads to marginal improvement. The code for this work is available at https://github.com/gokulkarthik/mucot.

[R1] An Empirical Study Of Self-supervised Learning Approaches For Object Detection With Transformers

Gokul Karthik Kumar, Sahal Shaji Mullappilly, Abhishek Singh Gehlot

2022 | ArXiv #vision

Self-supervised learning (SSL) methods such as masked language modeling have shown massive performance gains by pretraining transformer models for a variety of natural language processing tasks. The follow-up research adapted similar methods like masked image modeling in vision transformer and demonstrated improvements in the image classification task. Such simple self-supervised methods are not exhaustively studied for object detection transformers (DETR, Deformable DETR) as their transformer encoder modules take input in the convolutional neural network (CNN) extracted feature space rather than the image space as in general vision transformers. However, the CNN feature maps still maintain the spatial relationship and we utilize this property to design self-supervised learning approaches to train the encoder of object detection transformers in pretraining and multi-task learning settings. We explore common self-supervised methods based on image reconstruction, masked image modeling and jigsaw. Preliminary experiments in the iSAID dataset demonstrate faster convergence of DETR in the initial epochs in both pretraining and multi-task learning settings; nonetheless, similar improvement is not observed in the case of multi-task learning with Deformable DETR.

[P1] Method And System For Forecasting Sales Based On N-Gram Model

Gokul Karthik, Avinash Achar, Balaraman Ravindran

2021 | US Patent #time-series

This disclosure relates generally to method and system for forecasting sales based on N-Gram model. The present disclosure provides accurate prediction of sales for optimal operations to reduce the cost. The method receives a plurality of inputs of each product comprising a sales history, and a current price bin. The categorical sale(s) for each product is discretized based on the sales history by clustering each product sales history into a one or more groups based on a maximum sales velocity range. Further, a probability table is generated for the discretized categorical sales of each product based on computing a round off weighted mean and a median using a N-Gram model. Then, a smooth probability table is computed for the generated probability table. To forecast sales multistep prediction for the smooth probability table is computed based on at least one of a joint approach, a bootstrapped approach, and a step greedy approach.

[C1] Dynamic Bus Arrival Time Prediction: A Temporal Difference Learning Approach

LKP Vignesh, Avinash Achar, Gokul Karthik

2020 | IJCNN #time-series

Public transport buses suffer travel time uncertainties owing to diverse factors such as dwell times at bus stops, signals, seasonal variations and fluctuating travel demands etc. Traffic in the developing world in particular is afflicted by additional factors like lack of lane discipline, diverse modes of transport and excess vehicles. The bus travel time prediction problem on account of these factors continues to remain a demanding problem especially in developing countries. The current work proposes a method to address bus travel time prediction in real-time. The central idea of our method is to recast the dynamic prediction problem as a value-function prediction problem under a suitably constructed Markov reward process (MRP). Once recast as an MRP, we explore a family of value-function predictors using temporal-difference (TD) learning for bus prediction. Existing approaches build supervised models either by (a)training based on travel time targets only between successive bus-stops while keeping the no. of models linear in the number of bus-stops OR (b)training a single model which predicts between any two bus-stops while ignoring the huge variation in the travel-time targets during training. Our TD-based approach attempts to strike an optimal balance between the above two class of approaches by training with travel-time targets between any two bus-stops while keeping the number of models (approximately) linear in the number of bus-stops. It also keeps a check on the variation in the travel-time targets. Our extensive experimental results vindicate the efficacy of the proposed method. The method exhibits comparable or superior prediction performance on mid-length and long-length routes compared to the state-of-the art.

Google Sites

Report abuse