Signal representation is a classic problem in signal processing. A beautiful approach has been using sinusoidal functions (example Fourier series)
as othogonal and complete basis for periodic signal representations. The approach is beautiful because very often this representation makes
physical sense. Broadening the domain of 'kinds of signals', natural signals such as speech are non-stationary.
This is because the attributes of the underlying signal generator are time-varying (example, vocal tract length varies as we speak).
The Fourier series (and other periodic series) approach are unable to model this time-varyingness, and fail to serve as a physically
meaningful representation of the signal.
An alternative widely adopted is the short-time analysis of non-stationary signals. It is assumed that in each short-time frame
the signal is stationary. Thus, the Fourier series representation in each frame begins to make sense.
Celebrated techniques, output of such an approach, include spectrogram (or STFT), and linear prediction coding (LPC).
We propose an alternate short-time representation of speech based on Chebyshev polynomial basis.
Why Chebsyshev polynomial basis? They are orthogonal (in [-1,1]), oscillatory, and have several nice properties for function approximation.
To know click here.
Further as representations, they are particularly suited for non-periodic boundary conditions, have no Runge phenomenon
in interploation, and give a least squares approximation close to minimax one in the class of polynomials (like, Vandermode, Legendre,
Hermite, cardinal series, ...).
The proposed representation can be analyzed for - how the speech is encoded in the basis coefficients?.
Example: We consider 240 secs of speech of two speakers (male and female) sampled at Fs =- 48000 samples/sec. The signals is represented with the proposed
approach taking N = 37 order Chebyshev polynomial.
Using a higher order improves the quality of reconstruction as can be heard in the sound samples snippets.
Here, we try quantizing the representation. The Chebyshev coefficients amplitude distribution shows that most of the amplitudes have a small dynamic range and peak at zero. None of the coefficient has a PDF for amplitude ditribution as uniform. Nevertheless, quantizing with a uniform quantizer at 4 bits-per-coeffcient we get the below reconstrcuted signal.
Here, we try to understand the perceptual importance of each basis element. On comparing the short-time Chebyshev representation with spectrogram (on same time-scale, example go back to see the first figure) it is evident that lower basis elements encode vowel features and higher basis elements encode fricative features in speech. However, speech is not just vowel and fricatives but also contains perceivable attributes such as pitch, emotion, speaker information, etc. Below we present some sound files representing the individual elements of the 37 dimenisional representation of a speech file.
Here, we scale the time domain in the obtained representation.
That is, y(t) = x(pt), where 'p' is the scale factor.
This operation is analogous to playing the signal x(t=n x Ts) at x(t=n x pTs).
While arbitrary 'p' can be choosen in a DSP simulation but this is not a practical option as sound cards don't support any arbitrary clock rate.
Instead, scaling the time axis of x(t) keeping Ts fixed seems more practical.
Scaling time axis can be done via sinc interpolation for a bandlimited signal. But speech (and audio signal in general) is only locally bandlimited.
Hence, the slow decay of sinc kernel is non-optimal when it comes to time-scaling of non-stationary signals such as speech.
We make use of the proposed representation for this.
That is, to scale time by 'p' we have for each 2.5 msec of the signal,
$$y(t)=x(pt)=\displaystyle\sum_{k=0}^{N}a_kT_k(pt).$$
Below are the audio signals obtained using above formulation on different scale factors.
[Fs = 48000 samples/sec, p = 0.50]
[Fs = 48000 samples/sec, p = 0.75]
[Fs = 48000 samples/sec, p = 0.85]
[Fs = 48000 samples/sec, No scaling (original audio)]
[Fs = 48000 samples/sec, p = 1.15]
[Fs = 48000 samples/sec, p = 1.25]
[Fs = 48000 samples/sec, p = 1.50]
[Fs = 48000 samples/sec, p = 2.00]
It can be perceived, as also should be expected, that the pitch is not preserved.
Here, we learn a dictionary using the proposed representation. Let the STCT (short-time Chebyshev represenation) matrix be denoted by C (obtained with 2.5 msec segment size). Dictionary learning involves decomposition of C into product of two matrices D and A such that, $$\|C-DA\|_p$$ is minimum. Now, there can be many possibilities for D and A. To make the decomposition useful one option is to enforce sparsity in the columns of A. This will allow D to serve as a pruned sub-space for columns of C. A common set-up used is, $$\arg\min_{D,A}\|C-DA\|_2 \ \mbox{s.t. }\|A\|_{0}\leq K\ .$$ Here, D serves as a dictionary to encode C via A. Important parameters are sparsity factor (K), and the dimensions of the matrices. That is, $$C_{n\times m},\ D_{n\times q},\mbox{and}\ A_{q\times m}.$$ We will use a 36th order Chebyshev basis represenation, hence n is fixed to 37. Given an audio file, m depends on number of 2.5 msec segments. To learn the dictionary we can play with the parameters q and K.
Here, we intend to use the dictionary learnt using STCT representation as discriminable features across sound sources. Consider, D1 and D2 as dictionaries created from STCT represenations belonging to two different sound sources, respectively. Example D1 pertains to speaker 1 (spk-1), and D2 pertains to speaker 2 (spk-2). Now, given a test sound signal composed of a mix of speech of spk-1 and spk-2, we intend to reconstruct it using [D1 D2]. That is let the STCT representation of the test signal be C. Then, we solve for: $$ \arg\min_{A_1,A_2} \|[A_1\ A_2]^{\intercal}\| _{0}\ \mbox{s.t. } \| C -[D_1\ D_2] [A_1\ A_2]^{\intercal}\|_2\leq err$$ $$ \hat{C}_{1} = D_1 A_1,\ \hat{C}_{1} = D_2 A_2$$ $$ \hat{C}_{1}\rightarrow \hat{x}_1(t),\ \hat{C}_{2}\rightarrow \hat{x}_2(t)$$ As of now, the sepration is not good. This questions the discriminability using the representations.
Here, we intend to use the Chebyshev basis as a domain for sparse representation. While this will hold for polynomials, we apply it to oscillatory signals (that is speech and audio). The formulation for a signal segment (y) of duration 20 msec follows below. $$ x_{n\times 1} = A_{\mathcal{T}}c_{n\times 1}$$ $$ y_{m\times 1} = \Phi_{m\times n} x$$ AT is the Chebyshve matrix sampled at instants contained in T. These instants can be any n time instants. We choose them to be n points of Chebyshev grids. The goal is to recover x from the sampled m (< n ) random projections in y. This is done for each 20 msec segment of the signal. The recovery can be posed as an sparse recovery problem. $$ \arg\min_{c}\|y-\Phi A_{\mathcal{T}} c\|_{1}+\lambda \|c\|_{1}$$ $$ \hat{x} = A_{\mathcal{T}}c$$ The above is a LASSO formulation, and we use CVX to solve it. LAMBDA is a regularize, we set it to 1.