EEE203 Signals and Systems I
Lab: Alexa, What did I Say?
Bob recently bought an Amazon Echo Dot. He is very curious of the automatic speech recognition technology used by Alexa. He wants to understand how it works. To start simple, he wants to know how to recognize phonemes. He has a short speech recording. He needs your help to design an algorithm to identify what phonemes were said in the recording. Before tackling this challenge, first some background knowledge.
Speech Analysis
Let’s first briefly discuss some of the important speech properties. Firstly, speech signals are non-stationary, i.e., they change over time. However, speech signals can typically be considered as quasi- stationary over short segments, typically 5-20 ms. Thus, we often study the statistical and spectral properties of speech defined over short segments such as 20 ms.
Speech can generally be classified as voiced (e.g., /a/, /i/, etc), unvoiced (e.g.,/sh/), or mixed. Time and frequency domain plots for sample voiced and unvoiced segments are shown in Fig. 1. Voiced speech is quasi-periodic in the time-domain and harmonically structured in the frequency-domain, while unvoiced speech is random-like and broadband. In addition, the energy of voiced segments is generally higher than the energy of unvoiced segments.
Figure 1. Voiced and unvoiced segments and their short-time spectra.
We will focus on voiced speech here. The short-time spectrum of voiced speech is characterized by its fine and formant structure. The fine harmonic structure is a consequence of the quasi-periodicity of speech and may be attributed to the vibrating vocal chords.
Figure 2. Voiced speech fine harmonic structure obtained using the Fast Fourier Transform (FFT). Note the narrow harmonic peaks. Recall periodic signals have harmonic spectra.
Note the first peak of the spectrum in Fig. 2 (pointed by arrow) is the fundamental or pitch frequency. (The small peak in front is due to the DC component.) The fundamental frequency is usually higher for female speakers relative to male speakers.
The formant structure (spectral envelope) of voiced speech is due to the interaction of the source and the vocal tract. The vocal tract consists of the pharynx and the mouth cavity. The shape of the spectral envelope that "fits" the short-time spectrum of voiced speech in Fig. 2 is shown in Fig. 3. The spectral envelope is characterized by a set of peaks which are called formants. The formants are the resonant modes of the vocal tract. For the average vocal tract there are three to five formants below 5 kHz. The amplitudes and locations of the first three formants, usually occurring above 90 Hz and below 3 kHz, are quite important both in speech synthesis and perception. Higher formants are also important for wideband and unvoiced speech representations. The spectral envelope of voiced speech can be approximated by the frequency response of an all-pole filter, for which the filter coefficients can be estimated through linear predictive analysis.
Figure 3. Voiced speech formant (vocal tract) envelope obtained using the (Linear Predictive Coding) LPC frequency response. Note the envelope peaks pointed by arrows are known as formants.
The formant frequencies characterize the intelligibility of speech and are often parameterized and used for recognizing and synthesizing phonemes. Examples of phonemes along with their first two formant frequencies for male voice are listed in Table 1.
Table 1. Average (ideal) formant frequencies for male
Phoneme
|
Formant F1 (Hz)
|
Formant F2 (Hz)
|
(a,
|
560
|
1480
|
(i,
|
280
|
2620
|
(ɑ,
|
560
|
920
|
(u,
|
320
|
920
|
Z-Transform, Poles and Zeroes
To find the formant frequencies needed for phoneme recognition, one has to understand the relationship between poles and frequency peaks of a fiIter,s frequency response.
A discrete-time LTI system (i.e., a digital filter) can be characterized by linear-constant coefficient difference equation
The z-Transform. of the impulse response of the above LTI system, i.e., the system or transfer function, can be written in the following form.
Assuming a0 = 1, we have
The am,s and bl,s are caIIed the fiIter coefficients of the system with a0 always being equal to one.
The transfer function in Equation 1 can also be written in terms of poles and zeros as follows
where ζl and pm are the zeros and poles of H(z) respectively, and G is a positive constant.
The locations of poles and zeros determine system characteristics. For a causal and BIBO stable filter, we know that all the poles must be inside the unit circle. The locations of poles and zeros also affect the shape of the frequency response. Before exploring their impact in the exercises coming up, Iet,s review the polar and rectangular coordinates.
Recall in z-Transform, the complex variable z can be represented in polar form as
where r is the magnitude of z and ω is the angle/phase of z. To convert (2) to rectangular coordinates, apply the Euler identity ejw = cos(w) + jsin(w), and we get
where x = r cos(w) and y = rsin(ω) are the x and y coordinates of z respectively.
Machine Learning Clustering Algorithm
After the formant frequencies from the speech signal are extracted, a simple machine learning algorithm is used to identify the phonemes. Specifically, a simple and intuitive algorithm called K-Means is used to group phonemes into clusters.
A machine learning clustering algorithm deals with finding a structure or a pattern in a collection of unlabeled dataset. For a K-Means clustering algorithm, one has to specify the number of clusters K that the algorithm will split the data into. Note the algorithm has no prior knowledge where the centroids of the K clusters should be and has to “learn” it through an iterative process. Specifically, the algorithm tries to minimize the sum of squared distance from the data points within the clusters to their respective centroids. After the algorithm converges, data will be grouped into K number of clusters such that the data points within each cluster are closer to each other and data points from different clusters are farther apart.
An example is shown in Fig. 4. The left figure plots an unlabeled data set (all data points are green). After running the K-Means algorithm with K set to two, the right figure shows the data points are split into two clusters with the centroids of the clusters identified by X’s. Also each data point is now labeled by the cluster it belongs to (blue points vs. red points).
Figure 4. A data set before and after the K-Means algorithm
Lab Task
1. Go to http://jdsp.engineering.asu.edu/JDSP-HTML5/JDSP.html. Build the simulation diagram as shown in Fig. 5. Note the output of LPC+ block is from the TOP.
Figure 5. JDSP-HTML5 block diagram for formant estimation from speech signal
The blocks, their functions and parameter settings are described below:
a. Signal Generator (Long) (on the left): stores pre-recorded speech phonemes
i. Choose “Phoneme a” as the signal (Note this signal is the same as the first part of the audio recording.)
ii. Set Frame. Size to 160 samples. This corresponds to a 20ms window with a
sampling frequency of 8000Hz, which is standard for mobile speech processing.
b. Window (under “Basic Blocks”): multiplying the signal with a window function can provide better frequency resolution.
i. Choose Hamming window.
ii. Set the Length to 160 (same as the frame. size of the Long Signal Generator). This will multiply point-by-point the window function with the signal frame.
iii. Click “Update” .
c. LPC+ (under “Speech Blocks”): performs linear predictive analysis on the windowed signal, the output is the filter coefficients of an all-pole filter representing the vocal tract
i. Set the LPC order to 8
d. Freq-Resp (on the left): plots the spectral envelope of the windowed signal using the LPC coefficients
i. Set the scale to dB
e. Formant (under “Speech Blocks”): obtains the first and second formants from the speech signal frame using the LPC filter coefficients.
i. There are no parameters to set or change in the formant block.
ii. This block stores the formants until the “Reset” button is clicked. Upon clicking the “Reset” button the stored formant values will be erased.
f. PZ-Plot (under “Filter Blocks”): shows a pole-zero plot corresponding to the LPC filter coefficients.
i. There are no parameters to set or change in the PZ-Plot block.
ii. From the PZ-Plot block you can verify that an all-pole filter is used to estimate the linear predictive coefficients (all poles, with only zeroes at 0).
g. FFT (on the left): computes the Fast Fourier Transform of the speech signal.
i. Set FFT size to be 256.
h. Plot (on the left): plot the speech signal spectrum based on output of the FFT block.
i. Choose Continuous Magnitude plot in dB scale.
2. Let’s first analyze one frame of the speech phoneme signal “a”. In the SigGen(L) block, select
“Frame Range” to start at 53 and stop at 53, click “Update” . You should see the plot on the right updated to “Frame. 53/383” .
a. Open the Long Signal Generator, Plot (connected to FFT), PZ-Plot, Freq-Resp and Formant blocks. Take a screenshot.
b. In the Plot window, why does the speech signal spectrum have a harmonic structure?
c. How many poles are there in the PZ-Plot window? How many peaks in the Freq-Resp window? What is the relationship between the number of poles of the filter transfer function and the number of peaks of its frequency response? (Recall the discrete-time Fourier transform is periodic with period 2π, so the frequency 2π is connected to frequency zero.)
d. Convert the poles in the PZ-Plot window from rectangular coordinates to Polar
coordinates by filling out the following table. (There is a nice online conversion tool at
https://www.intmath.com/complex-numbers/convert-polar-rectangular-
interactive.php)
Rectangular form
|
Magnitude
|
Phase (degrees)
|
Phase (radians in π’s = degrees/180)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
What is the relationship between the phase of the poles and the locations/frequencies of the peaks of the frequency response? What about the magnitude of the peak and the width of the peaks (narrow or wide peaks)?
e. In the Formant block, the first and second formants of the speech phoneme signal at
frame. 53 are given in Hertz. The translation from analog frequency f in Hertz to the digital frequency in radians shown in the frequency response plot is by 2πf/fs where the sampling frequency fs = 8000Hz. Calculate the digital frequency for the two formants? Which poles in rectangular form. do they correspond to?
3. Now let’s analyze ten frames of the speech signal. First open the Formant block and click
“Reset” . This will clear the stored formants. In the SigGen(L) block, select “Frame Range” to start at 51 and stop at 60, click “Rerun” . You will see ten pairs of formants are stored in the Formant block, which correspond to the first two formants of the ten frames of speech signal. Take a screenshot of the Formant window. Are the formants for the ten frames the same? Is the speech signal stationary over a long period?
4. Next let’s try the phoneme “i” . In the Formant block, click “Reset” . In SigGen(L), choose “Phoneme i” as the signal (note this signal is the same as the second part of the audio recording), and select “Frame Range” to start at 53 and stop at 53, click “Update”. Take a screenshot of the Long Signal Generator, Plot (connected to FFT), PZ-Plot, Freq-Resp and Formant blocks.
5. Lastly let’s play the phoneme “a” and “i” back-to-back (just like in the audio recording) and let the K-Means algorithm to form. two clusters. In the Formant block, click “Reset”. First, we need to acquire formant frequencies from audio data frames. Go to SigGen(L) block, choose the phoneme “a”, select “Frame Range” to start at 101 and stop at 150, click “Rerun” . You should see 50 formants stored in the Formant block. Next choose the phoneme “i”, select “Frame. Range” to start at 101 and stop at 150, click “Rerun” . You should now see 100 or 101 formants stored in the Formant block after execution.
Next, add a K-Means block (under “Machine Learning Blocks”) to the output of the Formant block on the right. Open the K-Means block and set the “Clusters” to be 2. Click “Calculate” . You should see the two formant frequencies of the 100+ audio data frames on the graph with the 1st formant as the vertical axis and the 2nd formant as the horizontal axis. The centroids of the two clusters are labelled as green flowers. Click “Show MSE & Centroid Values” . Take a screenshot of the K-Means and the K-Means Values windows.
Lastly, based on the centroids output by the K-Means block and Table 1, what phonemes are said in the recording?