About the thesis

sample-image — Defended on 17 Jan 2018, IISc. (Download: 11 MB)

+ Why this thesis?
A huge class of non-stationary signals, which includes speech, music, birdsong, machine vibrations, and geophysical signals, feature a temporally evolving spectrum. Encountering non-stationarity is a rule rather than an exception while analyzing natural signals. Unlike stationary signal analysis, non-stationary signal analysis is a fairly challenging problem in the expanse of signal processing. Conventional approaches to analyze non-stationary signals use short-time quasi-stationary assumptions. Example, short-time signal segments are analyzed using one of several transforms, such as Fourier, chirplets, and wavelets, with a pre-defined basis. However, the quasi-stationary assumption is a serious limitation in recognizing fine temporal and spectral variations in signals. An accurate analysis of embedded variations can provide for more insightful understanding of natural signals. The thesis presents some approaches in this direction.
+ What is the motivation?
We believe that the human auditory system is often engaged in interpreting information from the ambient sounds, such as speech and audio, featuring immense signal non-stationarity. How is the auditory system so good at analysis and information extraction from these signals? Findings in the neuroscience literature suggest an information encoding scheme based on spiking activity (termed as neural firings) in the nerve fibers. It is observed that the spiking activity is synchronized to attributes, such as intensity, onset, and frequency content, etc., of the sound stimulus. Interestingly, over finer temporal resolution, the spiking is found to be synchronized with the instants of zero-crossings (ZCs) (and extrema, as well) of tone stimulus. Drawn by these findings, we conceptualize an event-synchronized sampling (ESS) and analysis of non-stationary signals. With event chosen as ZCs (and extrema) the sampling density (defined over short-time intervals) is time-varying. The time-varyingness adapts to the temporally evolving spectral content of the signal. In contrast to this, the sampling density is fixed (defined by an external clock) in traditional uniform Nyquist-rate samplers used in signal processing. We hypothesize that the captured dataset via ESS is a compact and informative dataset for analysis of non-stationary signals.
+ Acknowledgement

I will like to dedicate the thesis to all whom I have met in my journey till here. I have been fortunate to learn and have the company of wonderful people who have kept me inspired.

Chapter-wise

2 This chapter presents a review on auditory processing and the how it motivates an alternate non-stationary signal processing approach contrasting with traditional stationary methods. It serves as a tutorial illustrating some of the lucunae of traditional time-frequency analysis approaches.

3 Speech signals have a time-varying spectral content. This implies presence of time-varying redundancy in the signal, and opens up a possibility for adapting the sampling rate in continuous-time to discrete-time conversion. In this chapter, event-synchronized sampling using higher- order zero-crossings (HoZCs) is explored to facilitate such adaptation. HoZCs refer to ZCs associated with higher-order derivatives of the signal. Signal reconstruction from the captured samples is pursued within a convex optimization framework.

4 A variety of non-stationary signals can be modeled as time-varying sinusoids, that is, $$x(t)=a(t)\sin2\pi\int_{0}^{t}f(\tau)d\tau.$$ Of interest in this signal model is estimating the instantaneous amplitude (IA,$a(t)$) and the instantaneous frequency (IF, $f(t)$). In this chapter, we evaluate the effectiveness of samples drawn from ESS in estimating the IA and IF. The proposed approach shows similar accuracy as widely used analytic signal, energy separation (which are based on uniform Nyquist-rate sampling and processing) and ZC based approaches, and some improvement in case of measurement noise. For the same dataset size, performance for extrema versus uniform sampling dataset, the former gives much better IA and IF estimation. An analysis for robustness of extrema instants to Gaussian additive in-band noise and jitter in sampling time instants is also pursued.

5 Here, we generalize the application of ESS from mono-component time-varying sinusoids to multi-component signals. We propose higher-order ZCs (HoZCs), or ZCs of signal derivatives, as informative samples for precise IA and IF estimation of each sinusoid in multi-component signals. It is shown that, over a wide range of modulation parameters, namely, bandwidth and modulation indices, the IA and IF signals associated with each sinusoid are preserved over successive signal derivatives. Further, successive differentiation induces a highest IF sinusoid amplitude dominance into the signal.

6 Accurate analysis of this time-evolving speech spectrum is an open challenge in signal processing. Towards this, we designed an approach which overcomes some of the challenges using foundational concepts in signal processing. This is a suprisingly simple approach based on fundamental signal processing tricks to play with modulations in the signal. Here, we demonstrate the application to speech representation-modification-synthesis, with improved perceptual quality.

7 Here, I present some future directions which build on the topics explored in this thesis.

Papers ...

Sparse signal reconstruction based on signal dependent non-uniform samples (In ICASSP'12, Kyoto)
Click to expand.: The classical approach to A/D conversion has been uniform sampling and we get perfect reconstruction for bandlimited signals by satisfying the Nyquist Sampling Theorem. We propose a non-uniform sampling scheme based on level crossing (LC) time information. We show stable reconstruction of bandpass signals with correct scale factor and hence a unique reconstruction from only the non-uniform time information. For reconstruction from the level crossings we make use of the sparse reconstruction based optimization by constraining the bandpass signal to be sparse in its frequency content. While overdetermined system of equations is resorted to in the literature we use an undetermined approach along with sparse reconstruction formulation. We could get a reconstruction SNR >20dB and perfect support recovery with probability close to 1, in noise-less case and with lower probability in the noisy case. Random picking of LC from different levels over the same limited signal duration and for the same length of information, is seen to be advantageous for reconstruction.
Event-triggered sampling and reconstruction of sparse trigonometric polynomials (In SPCOM'14, Bangalore): We propose data acquisition from continuous-time signals belonging to the class of real-valued trigonometric polynomials using an event-triggered sampling paradigm. The sampling schemes proposed are: level crossing (LC), close to extrema LC, and extrema sampling. Analysis of robustness of these schemes to jitter, and bandpass additive gaussian noise is presented. In general these sampling schemes will result in non-uniformly spaced sample instants. We address the issue of signal reconstruction from the acquired data-set by imposing structure of sparsity on the signal model to circumvent the problem of gap and density constraints. The recovery performance is contrasted amongst the various schemes and with random sampling scheme. In the proposed approach, both sampling and reconstruction are non-linear operations, and in contrast to random sampling methodologies proposed in compressive sensing these techniques may be implemented in practice with low-power circuitry.
Moving Sound Source Parameter Estimation Using A Single Microphone And Signal Extrema Samples (In ICASSP'15, Brisbane): Estimating the parameters of moving sound sources using only the source signal is of interest in low-power, and contact-less source monitoring applications, such as, industrial robotics and bio-acoustics. The received signal embeds the motion attributes of the source via Doppler effect. In this paper, we analyze the Doppler effect on mixture of time-varying sinusoids. Focusing, on the instantaneous frequency (IF) of the received signal, we show that the IF profile composed of IF and its first two derivatives can be used to obtain source motion parameters. This requires a smooth estimate of IF profile. However, the numerical implementation of traditional approaches, such as analytic signal and energy separation approach, gives oscillatory behavior hence a non-smooth IF estimate. We devise an algorithm using non-uniformly spaced signal extrema samples of the received signal for smooth IF profile estimation. Using the smooth IF profiles for a source moving on a linear trajectory with constant velocity, an accurate estimate of moving source parameters is obtained. We see promise of this approach for an arbitrary trajectory motion parameter estimation.
Time-instant Sampling Based Encoding of Time-varying Acoustic Spectrum (In MoH'14, Athens): The inner ear has been shown to characterize an acoustic stimuli by transducing fluid motion in the inner ear to mechanical bending of stereocilia on the inner hair cells (IHCs). The excitation motion/energy transferred to an IHC is dependent on the frequency spectrum of the acoustic stimuli, and the spatial location of the IHC along the length of the basilar membrane (BM). Subsequently, the afferent auditory nerve fiber (ANF) bundle samples the encoded waveform in the IHCs by synapsing with them. In this work we focus on sampling of information by afferent ANFs from the IHCs, and show computationally that sampling at specific time instants is sufficient for decoding of time-varying acoustic spectrum embedded in the acoustic stimuli. The approach is based on sampling the signal at its zero-crossings and higher-order derivative zero-crossings. We show results of the approach on time-varying acoustic spectrum estimation from cricket call signal recording. The framework gives a time-domain and non-spatial processing perspective to auditory signal processing. The approach works on the full band signal, and is devoid of modeling any bandpass filtering mimicking the BM action. Instead, we motivate the approach from the perspective of event-triggered sampling by afferent ANFs on the stimuli encoded in the IHCs. Though the approach gives acoustic spectrum estimation but it is shallow on its complete understanding for plausible bio-mechanical replication with current mammalian auditory mechanics insights.
Event-triggered Sampling Using Signal Extrema for Instantaneous Amplitude and Instantaneous Frequency Estimation (In Signal Processing'15, Elsevier (Journal)): Event-triggered sampling (ETS) is a new approach towards efficient signal analysis. The goal of ETS need not be only signal reconstruction, but also direct estimation of desired information in the signal by skillful design of event. We show a promise of ETS approach towards better analysis of oscillatory non-stationary signals modeled by a time-varying sinusoid, when compared to existing uniform Nyquist-rate sampling based signal processing. We examine samples drawn using ETS, with events as zero-crossing (ZC), level- crossing (LC), and extrema, for additive in-band noise and jitter in detection instant. We find that extrema samples are robust, and also facilitate instantaneous amplitude (IA), and instantaneous frequency (IF) estimation in a time-varying sinusoid. The estimation is proposed solely using extrema samples, and a local polynomial regression based least-squares fitting approach. The proposed approach shows improvement, for noisy signals, over widely used analytic signal, energy separation, and ZC based approaches (which are based on uniform Nyquist-rate sampling based data-acquisition and processing). Further, extrema based ETS in general gives a sub-sampled representation (relative to Nyquist-rate) of a time-varying sinusoid. For the same data-set size captured with extrema based ETS, and uniform sampling, the former gives much better IA and IF estimation.
Mel-scale sub-band modelling for perceptually improved time-scale modification of speech and audio signals (In NCC, 2017, Chennai): Good quality time-scale modification (TSM) of speech, and audio is a long standing challenge. The crux of the challenge is to maintain the perceptual subtilities of temporal variations in pitch and timbre even after time-scaling the signal. Widely used approaches, such as phase vocoder, and waveform overlap-add (OLA), are based on quasi-stationary assumption and the time-scaled signals have perceivable artifacts. In contrast to these approaches, we propose application of time-varying sinusoidal modeling for TSM, without any quasi-stationary assumption. The proposed model comprises of a mel-scale nonuniform bandwidth filter bank, and the instantaneous amplitude (IA), and instantaneous phase (IP) factorization of sub-band time-varying sinusoids. TSM of the signal is done by time-scaling IA, and IP in each sub-band. The lowpass nature of IA, and IP allows for time-scaling via interpolation. Formal listening tests on speech, and music (solo, and polyphonic) show reduction in TSM artifacts such as phasiness, and transient smearing. Further, the proposed approach gives improved quality in comparison to waveform synchronous OLA (WSOLA), phase vocoder with identity phase locking, and the recently proposed harmonic-percussive separation (HPS) based TSM methods. The obtained improvement in TSM quality highlights that speech analysis can benefit from appropriate choice of time-varying signal models.
Time-varying Sinusoidal Demodulation for Non-stationary Modeling of Speech (In Speech Communication'18, Elsevier (Journal)): Speech signals contain a fairly rich time-evolving spectral content. Accurate analysis of this time-evolving spectrum is an open challenge in signal processing. Towards this, we visit time-varying sinusoidal modeling of speech and propose an alternate model estimation approach. The estimation operates on the whole signal without any short-time analysis. The approach proceeds by extracting the fundamental frequency sinusoid (FFS) from speech signal. The instantaneous amplitude (IA) of the FFS is used for voiced/unvoiced stream segregation. The voiced stream is then demodulated using a variant of in-phase and quadrature-phase demodulation carried at harmonics of the FFS. The result is a non-parametric time-varying sinusoidal representation, specifically, an additive mixture of quasi-harmonic sinusoids for voiced stream and a wideband mono-component sinusoid for unvoiced stream. The representation is evaluated for analysis-synthesis, and the bandwidth of IA and IF signals are found to be crucial in preserving the quality. Also, the obtained IA and IF signals are found to be carriers of perceived speech attributes, such as speaker characteristics and intelligibility. On comparing the proposed modeling framework with the existing approaches, which operate on short-time segments, improvement is found in simplicity of implementation, objective-scores, and computation time. The listening test scores suggest that the quality preserves naturalness but does not yet beat the state-of-the-art short-time analysis methods. In summary, the proposed representation lends itself for high resolution temporal analysis of non-stationary speech signals, and also allows quality preserving modification and synthesis.

Information-rich sampling of time-varying signals

Neeraj Kumar Sharma
Dept. ECE, Indian Institute of Science
Bengaluru, India

About the thesis

Chapter-wise

Papers ...

Codes & Demos

Extras

Support and Funding