About Me

sample-image Hi! I am Neeraj Sharma: a researcher + engineer, amused by "signals" of nature.
+ From July 2022 - Now: I am an Assistant Professor at the School of Data Science and Artificial Intelligence, in the Indian Institute of Technology, Guwahati (India). To know more about my research group SPIN Lab click here.


Previous Affiliations


+ From Dec. 2021 - June 2022: I was a Scientist at the Fraunhofer Institute for Integrated Circuits, IIS in Erlangen (Germany). To see an (animated) history of this institute: click here. Working in the Audio Labs here, I explored ways to quantify spatial cognition triggered by speech and audio signals. In this adventure, I made use of behvaioral and EEG listening experiments to capture and analyze spatial audio cognition. I was also a member of the Spatial Audio Group, led by Prof. Emanuël Habets at the International Audio Labs, Erlangen.
+ From Jan. 2021 - Nov. 2021: I was a CV Raman Postdoctoral Researcher at the Indian Institute of Science, Bengaluru (India). I worked on topics related to computational analysis of cough, breathing, and speech sound signals for COVID-19 detection. Working together in a wonderful team, we developed the Coswara (CoVID + Swara) tool during this phase. I was a member of the LEAP Lab, Dept. Electrical Engineering, and was mentored by Dr. Sriram Ganapathy (IISc).
+ From Aug. - Dec. 2020: I was a Postdoctoral Researcher at the Society of Innovation and Development (SiD), Indian Institute of Science, Bengaluru. I worked on analyzing the impact of language familiarity on talker change perception. I was mentored by Dr. Sriram Ganapathy (IISc).
+ From Feb. 2019 - March 2020: I was a Postdoctoral Researcher at the Neuroscience Institute and Dept. Psychology, Carnegie Mellon University (CMU), Pittsburgh. I worked on understanding human brain response to talker changes while listening to multi-talker speech. I designed behavioral and EEG experiments for this. I worked with the mentorship of Prof. Lori Holt and Prof. Barbara Shinn-Cunningham.
+ Mar. 2017 - Nov. 2018: I was a BrainHub Postdoctoral Fellow, at CMU. I pursued a human and machine perception study on talker change detection. This research was funded by the BrainHub, at CMU, Pittsburgh. I carried this work with guidance of Prof. Lori Holt (CMU) and Prof. Sriram Ganapathy (IISc). To know more about the findings: click here.
+ On June 2018: I received PhD (and Masters) from the Indian Institute of Science, Bengaluru, India. My PhD thesis is titled "Information-rich sampling of time-varying signals". I had my research habitat at the wonderful Speech and Audio Group (SAG), Dept. Electrical Communication Engineering, lead by Prof. T. V. Sreenivas. To get a quick description of the thesis: click here.
+ On June 2009: I received a Bachelors in Techology (Instrumentation and Electronics Engg.), from the College of Engineering and Technology, Bhubaneswar.

A bit more information

SPIN Lab: click here.
Blog: click here.
CV: Download PDF.
GitHub-ID: neerajww.
Google Scholar: Click here.
My PhD Thesis: Information-rich sampling of time-varying signals.
Project Coswara: Does COVID-19 make a unique respiratory sound?.
Email-id: X@Y.com where, X is neerajww and Y is gmail


Talks

The AI Landscape: Exploring the Ultimate Interdisciplinary Frontier
A talk as Invited Speaker, International Day of Light, SPIE Student Chapter, IIT Guwahati, 19th May 2023.
Intelligence or: How we started to stop worrying and love living
A talk as Invited EEE Department Speaker in the Research and Industrial Conclave, IIT Guwahati, 15th May 2023.
AI for All
A 180 mins bootcamp in the Research and Industrial Conclave, IIT Guwahati, 14th May 2023 Slides.
The Curious Case of Sound
Talk in TEDx IIT Guwahati, Feb 2023. Video.
The COVID-19 Connection: Exploring cough, breathing, and speech sounds for respiratory healthcare
Talk in Assam AI Initiative, Jan 2023. Video.
The curious case of respiratory sound samples: COVID-19 detection from breathing, cough, and speech sound recordings
Talk in the Data Science and AI Seminar Series, organized by the Reading Group MF School Data Science and AI, IIT Guwahati, Aug 2022
Can COVID-19 be detected using sound recordings?
Invited talk at the Fraunhofer IIS, in Erlangen, Dec 2021 Slides
Language familiarity impacts brain processing of speech
Invited talk at the Annual EECS Symposium 2021, Indian Institute of Science, Bangalore, May 2021 Video | Slides
Desire for speed: Quest to bridge the gap between human and machine
Invited talk at the Short Term Teachers' Training Programme AICTE 2020, at TIST, Kerala, Nov. 2020 Slides
Coswara: A database of breathing, cough, and voice sounds for COVID-19 diagnosis
Presentation at the Interspeech 2020 Conference, Beijing, October 2020 Video | Slides
On the impact of language familiarity in talker change detection
Presentation at the ICASSP 2020 Conference, Barcelona, May 2020 Video | Slides
Are machines ready to listen?
Part-3 in Symposium on Dimension-based Attention in Learning and Understanding Spoken Language, CogSci, Wisconsin, July 2018 Slides
Talker change detection: Human and machine comparison
Communications Neuroscience Research Laboratory, Boston University, Boston, May 2019 Slides

Media Coverage/Outreach

The Modern Challenge of Voice Identity and Anonymity
, by Aria Bracci in HotPod News, 11 Apr 2021
Covid test is not a cough nut to track, finds IISc
, by Hemanth CS, 20 Feb 2021
Alexa, are you ready to join our dinner table conversation!
, by IISc, 11 Jun 2020
Computer Voice Recognition Still Learning to Detect Who’s Talking
, by Yuen Yiu in InsideScience, 25 Jan 2019

Teaching

July-Nov 2023:
I will be involved in teaching (and learning from) the following courses.
  • DA321: Multimodal Data Analysis and Learning - I
  • DA514: Python Programming Lab
Jan-May 2023:
Completed.

Research

  • Please check my group webpage: SPIN Lab

Peer-reviewed Published Findings
(for updated list see Google Scholar)

Sparse signal reconstruction based on signal dependent non-uniform samples (In ICASSP'12, Kyoto)
Click to expand.

Hide this content.

The classical approach to A/D conversion has been uniform sampling and we get perfect reconstruction for bandlimited signals by satisfying the Nyquist Sampling Theorem. We propose a non-uniform sampling scheme based on level crossing (LC) time information. We show stable reconstruction of bandpass signals with correct scale factor and hence a unique reconstruction from only the non-uniform time information. For reconstruction from the level crossings we make use of the sparse reconstruction based optimization by constraining the bandpass signal to be sparse in its frequency content. While overdetermined system of equations is resorted to in the literature we use an undetermined approach along with sparse reconstruction formulation. We could get a reconstruction SNR >20dB and perfect support recovery with probability close to 1, in noise-less case and with lower probability in the noisy case. Random picking of LC from different levels over the same limited signal duration and for the same length of information, is seen to be advantageous for reconstruction.

Hide this content.


Event-triggered sampling and reconstruction of sparse trigonometric polynomials (In SPCOM'14, Bangalore)

Click to expand.

Hide this content.

We propose data acquisition from continuous-time signals belonging to the class of real-valued trigonometric polynomials using an event-triggered sampling paradigm. The sampling schemes proposed are: level crossing (LC), close to extrema LC, and extrema sampling. Analysis of robustness of these schemes to jitter, and bandpass additive gaussian noise is presented. In general these sampling schemes will result in non-uniformly spaced sample instants. We address the issue of signal reconstruction from the acquired data-set by imposing structure of sparsity on the signal model to circumvent the problem of gap and density constraints. The recovery performance is contrasted amongst the various schemes and with random sampling scheme. In the proposed approach, both sampling and reconstruction are non-linear operations, and in contrast to random sampling methodologies proposed in compressive sensing these techniques may be implemented in practice with low-power circuitry.

Hide this content.


Moving Sound Source Parameter Estimation Using A Single Microphone And Signal Extrema Samples (In ICASSP'15, Brisbane)

Click to expand.

Hide this content.

Estimating the parameters of moving sound sources using only the source signal is of interest in low-power, and contact-less source monitoring applications, such as, industrial robotics and bio-acoustics. The received signal embeds the motion attributes of the source via Doppler effect. In this paper, we analyze the Doppler effect on mixture of time-varying sinusoids. Focusing, on the instantaneous frequency (IF) of the received signal, we show that the IF profile composed of IF and its first two derivatives can be used to obtain source motion parameters. This requires a smooth estimate of IF profile. However, the numerical implementation of traditional approaches, such as analytic signal and energy separation approach, gives oscillatory behavior hence a non-smooth IF estimate. We devise an algorithm using non-uniformly spaced signal extrema samples of the received signal for smooth IF profile estimation. Using the smooth IF profiles for a source moving on a linear trajectory with constant velocity, an accurate estimate of moving source parameters is obtained. We see promise of this approach for an arbitrary trajectory motion parameter estimation.

Hide this content.


Time-instant Sampling Based Encoding of Time-varying Acoustic Spectrum (In MoH'14, Athens)

Click to expand.

Hide this content.

The inner ear has been shown to characterize an acoustic stimuli by transducing fluid motion in the inner ear to mechanical bending of stereocilia on the inner hair cells (IHCs). The excitation motion/energy transferred to an IHC is dependent on the frequency spectrum of the acoustic stimuli, and the spatial location of the IHC along the length of the basilar membrane (BM). Subsequently, the afferent auditory nerve fiber (ANF) bundle samples the encoded waveform in the IHCs by synapsing with them. In this work we focus on sampling of information by afferent ANFs from the IHCs, and show computationally that sampling at specific time instants is sufficient for decoding of time-varying acoustic spectrum embedded in the acoustic stimuli. The approach is based on sampling the signal at its zero-crossings and higher-order derivative zero-crossings. We show results of the approach on time-varying acoustic spectrum estimation from cricket call signal recording. The framework gives a time-domain and non-spatial processing perspective to auditory signal processing. The approach works on the full band signal, and is devoid of modeling any bandpass filtering mimicking the BM action. Instead, we motivate the approach from the perspective of event-triggered sampling by afferent ANFs on the stimuli encoded in the IHCs. Though the approach gives acoustic spectrum estimation but it is shallow on its complete understanding for plausible bio-mechanical replication with current mammalian auditory mechanics insights.

Hide this content.


Event-triggered Sampling Using Signal Extrema for Instantaneous Amplitude and Instantaneous Frequency Estimation (In Signal Processing'15, Elsevier (Journal))

Click to expand.

Hide this content.

Event-triggered sampling (ETS) is a new approach towards efficient signal analysis. The goal of ETS need not be only signal reconstruction, but also direct estimation of desired information in the signal by skillful design of event. We show a promise of ETS approach towards better analysis of oscillatory non-stationary signals modeled by a time-varying sinusoid, when compared to existing uniform Nyquist-rate sampling based signal processing. We examine samples drawn using ETS, with events as zero-crossing (ZC), level- crossing (LC), and extrema, for additive in-band noise and jitter in detection instant. We find that extrema samples are robust, and also facilitate instantaneous amplitude (IA), and instantaneous frequency (IF) estimation in a time-varying sinusoid. The estimation is proposed solely using extrema samples, and a local polynomial regression based least-squares fitting approach. The proposed approach shows improvement, for noisy signals, over widely used analytic signal, energy separation, and ZC based approaches (which are based on uniform Nyquist-rate sampling based data-acquisition and processing). Further, extrema based ETS in general gives a sub-sampled representation (relative to Nyquist-rate) of a time-varying sinusoid. For the same data-set size captured with extrema based ETS, and uniform sampling, the former gives much better IA and IF estimation.

Hide this content.


Mel-scale sub-band modelling for perceptually improved time-scale modification of speech and audio signals (In NCC, 2017, Chennai)

Click to expand.

Hide this content.

Good quality time-scale modification (TSM) of speech, and audio is a long standing challenge. The crux of the challenge is to maintain the perceptual subtilities of temporal variations in pitch and timbre even after time-scaling the signal. Widely used approaches, such as phase vocoder, and waveform overlap-add (OLA), are based on quasi-stationary assumption and the time-scaled signals have perceivable artifacts. In contrast to these approaches, we propose application of time-varying sinusoidal modeling for TSM, without any quasi-stationary assumption. The proposed model comprises of a mel-scale nonuniform bandwidth filter bank, and the instantaneous amplitude (IA), and instantaneous phase (IP) factorization of sub-band time-varying sinusoids. TSM of the signal is done by time-scaling IA, and IP in each sub-band. The lowpass nature of IA, and IP allows for time-scaling via interpolation. Formal listening tests on speech, and music (solo, and polyphonic) show reduction in TSM artifacts such as phasiness, and transient smearing. Further, the proposed approach gives improved quality in comparison to waveform synchronous OLA (WSOLA), phase vocoder with identity phase locking, and the recently proposed harmonic-percussive separation (HPS) based TSM methods. The obtained improvement in TSM quality highlights that speech analysis can benefit from appropriate choice of time-varying signal models.

Hide this content.


Leveraging LSTM models for overlap detection in multi-party meetings (In ICASSP, 2018, Calgary)

Click to expand.

Hide this content.

The detection of overlapping speech segments is of key importance in speech applications involving analysis of multi-party conversations. The detection problem is challenging because overlapping speech segments are typically captured as short speech utterances far-field microphone recordings. In this paper, we propose detection of overlap segments using a neural network architecture consisting of long-short term memory (LSTM) models. The neural network architecture learns the presence of overlap in speech by identifying the spectrotemporal structure of overlapping speech segments. In order to evaluate the model performance, we perform experiments on simulated overlapped speech generated from the TIMIT database, and natural multi-talker conversational speech in the augmented multi-party interaction (AMI) meeting corpus. The proposed approach yields improvements over a Gaussian mixture model based overlap detection system. Furthermore, as an application of overlap detection, integration of overlap detection into speaker diarization task is shown to give improvement in diarization error rate.

Hide this content.


Multicomponent 2-D AM-FM Modeling of Speech Spectrograms (In Interspeech, 2018, Hyderabad)

Click to expand.

Hide this content.

In contrast to 1-D short-time analysis of speech, 2-D modeling of spectrograms provides a characterization of speech attributes directly in the joint time-frequency plane. Building on existing 2-D models to analyze a spectrogram patch, we propose a multicomponent 2-D AM-FM representation for spectrogram decomposition. The components of the proposed representation comprise a DC, a fundamental frequency carrier and its harmonics, and a spectrotemporal envelope, all in 2-D. The number of harmonics required is patch-dependent. The estimation of the AM and FM is done using the Riesz transform, and the component weights are estimated using a least-squares approach. The proposed representation provides an improvement over existing state-of-the-art approaches, for both male and female speakers. This is quantified using reconstruction SNR and perceptual evaluation of speech quality (PESQ) metric. Further, we perform an overlap-add on the DC component, pooling all the patches and obtain a time-frequency (t-f) aperiodicity map for the speech signal. We verify its effectiveness in improving speech synthesis quality by using it in an existing state-of-the-art vocoder.

Hide this content.


Time-varying Sinusoidal Demodulation for Non-stationary Modeling of Speech (In Speech Communication'18, Elsevier (Journal))

Click to expand.

Hide this content.

Speech signals contain a fairly rich time-evolving spectral content. Accurate analysis of this time-evolving spectrum is an open challenge in signal processing. Towards this, we visit time-varying sinusoidal modeling of speech and propose an alternate model estimation approach. The estimation operates on the whole signal without any short-time analysis. The approach proceeds by extracting the fundamental frequency sinusoid (FFS) from speech signal. The instantaneous amplitude (IA) of the FFS is used for voiced/unvoiced stream segregation. The voiced stream is then demodulated using a variant of in-phase and quadrature-phase demodulation carried at harmonics of the FFS. The result is a non-parametric time-varying sinusoidal representation, specifically, an additive mixture of quasi-harmonic sinusoids for voiced stream and a wideband mono-component sinusoid for unvoiced stream. The representation is evaluated for analysis-synthesis, and the bandwidth of IA and IF signals are found to be crucial in preserving the quality. Also, the obtained IA and IF signals are found to be carriers of perceived speech attributes, such as speaker characteristics and intelligibility. On comparing the proposed modeling framework with the existing approaches, which operate on short-time segments, improvement is found in simplicity of implementation, objective-scores, and computation time. The listening test scores suggest that the quality preserves naturalness but does not yet beat the state-of-the-art short-time analysis methods. In summary, the proposed representation lends itself for high resolution temporal analysis of non-stationary speech signals, and also allows quality preserving modification and synthesis.

Hide this content.


Talker Change Detection: A comparison of human and machine performance (In Journal of Acoustics Society of America, 2019)

Click to expand.

Hide this content.

The automatic analysis of conversational audio remains difficult, in part due to the presence of multiple talkers speaking in turns, often with significant intonation variations and overlapping speech. The majority of prior work on psychoacoustic speech analysis and system design has focused on single-talker speech, or multi-talker speech with overlapping talkers (for example, the cocktail party effect). There has been much less focus on how listeners detect a change in talker or in probing the acoustic features significant in characterizing a talker's voice in conversational speech. This study examines human talker change detection (TCD) in multi-party speech utterances using a novel behavioral paradigm in which listeners indicate the moment of perceived talker change. Human reaction times in this task can be well-estimated by a model of the acoustic feature distance among speech segments before and after a change in talker, with estimation improving for models incorporating longer durations of speech prior to a talker change. Further, human performance is superior to several on-line and off-line state-of-the-art machine TCD systems.

Hide this content.


Analyzing human reaction time for talker change detection (In ICASSP, 2019, Brighton)

Click to expand.

Hide this content.

The ability to detect a change in the input is an essential aspect of perception. In speech communication, we use this ability to identify “talker changes” when listening to conversational speech (such as, audio podcasts). In this paper, we propose to improve our understanding about how fast listeners detect a change in talker, and the acoustic features tracked to identify a voice by designing a novel experimental paradigm. A listening experiment is designed in which listeners indicate the moment of perceived talker change in multi-talker speech utterances. We examine talker change detection performance by probing the human reaction time (RT). A random forest regression is used to model the relationship between RTs and acoustic features. The findings suggest that: (i) RT is less than a second, (ii) RT can be predicted from the difference in acoustic features of segment before and after change, and (iii) there a exists a significant dependence of RT on MFCC-D1 (delta MFCCs) features between segments of speech before and after the change instant. Further, a comparison with a machine system designed for the same task of TCD using speaker diarization principles showed a poor performance relative to the humans.

Hide this content.


On the impact of language familiarity in talker change detection (in ICASSP, 2020, Barcelona)

Click to expand.

Hide this content.

The ability to detect talker changes when listening to conversational speech is fundamental to perception and understanding of multi-talker speech. In this paper, we propose an experimental paradigm to provide insights on the impact of language familiarity on talker change detection. Two multi-talker speech stimulus sets, one in a language familiar to the listeners (English) and the other unfamiliar (Chinese), are created. A listening test is performed in which listeners indicate the number of talkers in the presented stimuli. Analysis of human performance shows statistically significant results for: $(a)$ lower miss (and a higher false alarm) rate in familiar versus unfamiliar language, and ($b$) longer response time in familiar versus unfamiliar language. These results signify a link between perception of talker attributes and language proficiency. Subsequently, a machine system is designed to perform the same task. The system makes use of the current state-of-the-art diarization approach with x-vector embeddings. A performance comparison on the same stimulus set indicates that the machine system falls short of human performance by a huge margin, for both languages.

Hide this content.


Coswara - a database of breathing, cough, and voice sounds for COVID-19 diagnosis (in INTERSPEECH, 2020, Shanghai)

Click to expand.

Hide this content.

The COVID-19 pandemic presents global challenges transcending boundaries of country, race, religion, and economy. The current gold standard method for COVID-19 detection is the reverse transcription polymerase chain reaction (RT-PCR) testing. However, this method is expensive, time-consuming, and violates social distancing. Also, as the pandemic is expected to stay for a while, there is a need for an alternate diagnosis tool which overcomes these limitations, and is deployable at a large scale. The prominent symptoms of COVID-19 include cough and breathing difficulties. We foresee that respiratory sounds, when analyzed using machine learning techniques, can provide useful insights, enabling the design of a diagnostic tool. Towards this, the paper presents an early effort in creating (and analyzing) a database, called Coswara, of respiratory sounds, namely, cough, breath, and voice. The sound samples are collected via worldwide crowdsourcing using a website application. The curated dataset is released as open access. As the pandemic is evolving, the data collection and analysis is a work in progress. We believe that insights from analysis of Coswara can be effective in enabling sound based technology solutions for point-of-care diagnosis of respiratory infection, and in the near future this can help to diagnose COVID-19.

Hide this content.


Other than Research

a) Execom Member of IEEE-IISc Student Branch (2012-13)
Got Best Volunter Award for the year 2012-13. Together with a very sporty team of volunteers in our Execom, and team lead by Prof. T. Srinivas we had a wonderful set of activities in and around campus.
b) IISc ECE Dept. WebTeam Member (2014-15)
Together with set of 3 more members and spearheaded by Prof. Chandra R. Murthy I often maintain the ECE website.
c) Sunday Cricket League (SCL, 2014-15)
Every sunday we have huge fun making our adrenaline flow to bowl, bat and field.
d) Camera clicks
Very often I get amazed by nature, and on getting an opportunity, I click-capture-upload some pictures here: click to see. I find it very difficult to prune the selection!

Quotes

Use the sunrise as an alarm. It has no snooze.

Thoughts

Good and bad is a function of surroundings.

Haricharan

Write-ups, Talks, Good books, and ...

"Throwing Light into the Tunnel: auditory models and perception"
[inivited talk in WiSSAP-2015, 04-01-2015] Click here to get the PDF.

"Sound Analysis: some knowns and unknowns"
[in SIAM-IISc Chapter Student Talk Series, @IISc, 08-05-2015]
Click here to get the PDF.

"Detect and Sample: an event-triggered approach for data acquisition and processing"
[Work Discussion at ICTS-IISc Workshop, 08-01-2015]

"Turns are Good: Processing Extrema of a Nonstationary Narrowband Signal"
[Delivered in Spectrum Lab, IISc, 22-10-2013]

"Function Approximations"
[Links to some good PDFs, 11-01-2016] Taylor, Fourier, Chebyshev, Pade, ... Click here to get the PDF.

"Detect and Sample: Questioning uniform Nyquist-rate sampling"
[Delivered on IEEE Day celebrations in campus, 01-10-2013]

Technical books I have liked: I sometimes update the rarely updated list here: click (archive) .