Frequency-domain analysis of the speech signal
Waveform is a representation of the speech signal. It is a visualisation of the signal in time domain. This representation is almost silent on the frequency contents and the frequency distribution of energy in the speech signal. To get a frequency-domain representation, you need to take Fourier transform of the speech signal. Since speech signal has time-varying properties, the transformation from time-domain to frequency-domain also needs to be done in a time-dependent manner. In other words, you need to take small frames at different points along the time axis, take Fourier transform of the short-duration frames, and then proceed along the time axis towards the end of the utterance. The process is called short-time Fourier transform (STFT). Steps involved in STFT computation are:
1. Select a short-duration frame of the speech signal by windowing
2. Compute Fourier transform of the selected duration
3. Shift the window along the time axis to select the neighbouring frame
4. Repeat step 2 until you reach the end of the speech signal
To select a short-duration frame of speech, normally a window function with gradually rising and falling property is used. Commonly used window functions in speech processing are Hamming and Hanning windows. A Hamming window with ‘N’ points is mathematically represented by:
and a Hanning window with ‘N’ points is mathematically represented by:
hn(n)=0.5[1–cos (2πn/N)], 0≤n<N
A window function has non-zero values over a selected set of points and zero values outside this interval. When you multiply a signal with a window function, you get a set of ‘N’ selected samples from the location where you place the window and zero-valued samples at all other points. Fig. 5 shows Hamming and Hanning windows of 400 points each.
You need to finalise the following parameters before computing short-time Fourier transform of the speech signal:
1. Type of window function to be used for framing the speech signal
2. Frame length Nwt in milliseconds
3. Frame shift Nst in milliseconds
4. DFT length
For a sampling frequency of ƒs, you have to use:
Nw=Nwt ƒs /1000
Ns=Nst ƒs /1000
to convert the frame length and frame shift (Nwt, Nst) in milliseconds into the corresponding number in samples (Nwt, Nst). Here fs is the sampling frequency of the speech signal expressed in Hertz. Once these parameters are finalised, framing operation is performed using the MATLAB user-defined function (needs to be copied to the same folder where the main program is stored):
frames = speech2frames( speech, Nw, Ns,
‘cols’, hanning, false );
Generally, frame-duration parameter Nwt and frame-shift parameter Nst are selected such that consecutive frames have sufficient overlap.The condition Nst<Nwt ensures an overlapping window placement. In speech processing applications, overlapping is generally kept above percent by proper selection of Nst and Nwt. The framing operation returns a number of short-duration frames selected using the window function with the specified frame length and frameshift parameters. Each frame is stored as a column vector in the returned array. Once the framing is performed, DFT operation is used to transform each frame to a frequency domain using the command:
MAG = abs( fft(frames,nfft,1) );
Parameter ‘nfft’ specifies the number of points in the DFT operation. It is kept as a power of 2 and must be greater than the frame length in samples. Assuming the wav file has sampling frequency fs of 10kHz, we have used 1024 points as ‘nfft’ for a frame length of 400 samples (40ms). If the sampling frequency of the wav file is not 10kHz, the file needs to be resampled to 10kHz for proper working of the program.Frame shift parameter is set as 100 samples (10ms). MAG variable has the absolute value of Fourier transform of frames stored column wise. The magnitude of Fourier transform is also called spectrum of that frame of the signal. As the speech signal has time-varying properties, the spectrum also goes on varying with time as we move along the samples in the wav file.
Magnitude spectrum computed for individual frames can be represented in many forms. We have been following three parameters: Frame number (indicator of the time axis), DFT bin number (indicator of the frequency axis) and magnitude of DFT computed (indicator of the spectral energy).
These three parameters can be represented conveniently in a 2D format using spectrogram. Spectrogram can be considered as an image representing time and frequency parameters (along X and Y axes) and magnitude values as the intensity of pixels in the X-Y plane. Stronger magnitudes get represented by dark spots and silences (low- or zero-amplitude signals) get represented by white spots in the image.
For a real valued signal, you need to take only the first half of the magnitude spectrum, since the spectrum has a symmetric shape with respect to nfft/2. You will see that the range of values in the computed magnitude spectrum is very high as you move from frames with valid speech signal to frames involving silences or pauses. It is better to limit the dynamic range to a fixed value before plotting. The magnitude spectrum is converted into the log scale and its dynamic range is limited to 50dB in the main program (spectrogram_efy.m) that computes and plots the spectrogram of a speech signal.