代写EEE203 Signals and Systems I Lab: Alexa, What did I Say?代写C/C++编程

EEE203 Signals and Systems I

Lab: Alexa, What did I Say?

Bob recently bought an Amazon Echo Dot. He is very curious of the automatic speech recognition technology used by Alexa.  He wants to understand how it works. To start simple, he wants to know how to recognize phonemes. He has a short speech recording.  He needs your help to design an algorithm to identify what phonemes were said in the recording.  Before tackling this challenge, first some background knowledge.

Speech Analysis

Let’s first briefly discuss some of the important speech properties. Firstly, speech signals are non-stationary, i.e., they change over time. However, speech signals can typically be considered as quasi- stationary over short segments, typically 5-20 ms. Thus, we often study the statistical and spectral properties of speech defined over short segments such as 20 ms.

Speech can generally be classified as voiced (e.g., /a/, /i/, etc), unvoiced (e.g.,/sh/), or mixed. Time and frequency domain plots for sample voiced and unvoiced segments are shown in Fig. 1. Voiced speech is quasi-periodic in the time-domain and harmonically structured in the frequency-domain, while unvoiced speech is random-like and broadband. In addition, the energy of voiced segments is generally higher than the energy of unvoiced segments.

Figure 1. Voiced and unvoiced segments and their short-time spectra.

We will focus on voiced speech here. The short-time spectrum of voiced speech is characterized by its fine and formant structure. The fine harmonic structure is a consequence of the quasi-periodicity of speech and may be attributed to the vibrating vocal chords.

Figure 2. Voiced speech fine harmonic structure obtained using the Fast Fourier Transform (FFT). Note the narrow harmonic peaks. Recall periodic signals have harmonic spectra.

Note the first peak of the spectrum in Fig. 2 (pointed by arrow) is the fundamental or pitch frequency. (The small peak in front is due to the DC component.) The fundamental frequency is usually higher for female speakers relative to male speakers.

The formant structure (spectral envelope) of voiced speech is due to the interaction of the source and the vocal tract. The vocal tract consists of the pharynx and the mouth cavity. The shape of the spectral envelope that "fits" the short-time spectrum of voiced speech in Fig. 2 is shown in Fig. 3. The spectral envelope is characterized by a set of peaks which are called formants. The formants are the resonant modes of the vocal tract. For the average vocal tract there are three to five formants below 5 kHz. The amplitudes and locations of the first three formants, usually occurring above 90 Hz and below 3 kHz, are quite important both in speech synthesis and perception. Higher formants are also important for wideband and unvoiced speech representations.  The spectral envelope of voiced speech can be approximated by the frequency response of an all-pole filter, for which the filter coefficients can be estimated through linear predictive analysis.

Figure 3. Voiced speech formant (vocal tract) envelope obtained using the (Linear Predictive Coding) LPC frequency response. Note the envelope peaks pointed by arrows are known as formants.

The formant frequencies characterize the intelligibility of speech and are often parameterized and used for recognizing and synthesizing phonemes. Examples of phonemes along with their first two formant frequencies for male voice are listed in Table 1.

Table 1. Average (ideal) formant frequencies for male

Phoneme

Formant F1 (Hz)

Formant F2 (Hz)

(a,

560

1480

(i,

280

2620

(ɑ,

560

920

(u,

320

920

Z-Transform, Poles and Zeroes

To find the formant frequencies needed for phoneme recognition, one has to understand the relationship between poles and frequency peaks of a fiIter,s frequency response.

A discrete-time LTI system (i.e., a digital filter) can be characterized by linear-constant coefficient difference equation

The z-Transform. of the impulse response of the above LTI system, i.e., the system or transfer function, can be written in the following form.

Assuming a0  = 1, we have

The am,s and bl,s are caIIed the fiIter coefficients of the system with a0  always being equal to one.

The transfer function in Equation 1 can also be written in terms of poles and zeros as follows

where ζl  and pm  are the zeros and poles of H(z) respectively, and G is a positive constant.

The locations of poles and zeros determine system characteristics. For a causal and BIBO stable filter, we know that all the poles must be inside the unit circle.  The locations of poles and zeros also affect the shape of the frequency response. Before exploring their impact in the exercises coming up, Iet,s review the polar and rectangular coordinates.

Recall in z-Transform, the complex variable z can be represented in polar form as

where r is the magnitude of z and ω is the angle/phase of z. To convert (2) to rectangular coordinates, apply the Euler identity ejw  = cos(w) + jsin(w), and we get

where x = r cos(w) and y  = rsin(ω) are the x and y coordinates of z respectively.

Machine Learning Clustering Algorithm

After the formant frequencies from the speech signal are extracted, a simple machine learning algorithm is used to identify the phonemes. Specifically, a simple and intuitive algorithm called K-Means is used to   group phonemes into clusters.

A machine learning clustering algorithm deals with finding a structure or a pattern in a collection of unlabeled dataset. For a K-Means clustering algorithm, one has to specify the number of clusters K that  the algorithm will split the data into. Note the algorithm has no prior knowledge where the centroids of the K clusters should be and has to “learn” it through an iterative process. Specifically, the algorithm tries to minimize the sum of squared distance from the data points within the clusters to their respective centroids. After the algorithm converges, data will be grouped into K number of clusters such that the data points within each cluster are closer to each other and data points from different clusters are farther apart.

An example is shown in Fig. 4. The left figure plots an unlabeled data set (all data points are green). After running the K-Means algorithm with K set to two, the right figure shows the data points are split into two clusters with the centroids of the clusters identified by X’s.  Also each data point is now labeled by the cluster it belongs to (blue points vs. red points).

Figure 4. A data set before and after the K-Means algorithm

Lab Task

1.    Go to http://jdsp.engineering.asu.edu/JDSP-HTML5/JDSP.html. Build the simulation diagram as shown in Fig. 5. Note the output of LPC+ block is from the TOP.

Figure 5. JDSP-HTML5 block diagram for formant estimation from speech signal

The blocks, their functions and parameter settings are described below:

a.    Signal Generator (Long) (on the left): stores pre-recorded speech phonemes

i.   Choose “Phoneme a” as the signal (Note this signal is the same as the first part of the audio recording.)

ii.   Set Frame. Size to 160 samples. This corresponds to a 20ms window with a

sampling frequency of 8000Hz, which is standard for mobile speech processing.

b.    Window (under “Basic Blocks”): multiplying the signal with a window function can provide better frequency resolution.

i.   Choose Hamming window.

ii.   Set the Length to 160 (same as the frame. size of the Long Signal Generator). This will multiply point-by-point the window function with the signal frame.

iii.   Click “Update” .

c.    LPC+ (under “Speech Blocks”): performs linear predictive analysis on the windowed signal, the output is the filter coefficients of an all-pole filter representing the vocal tract

i.   Set the LPC order to 8

d.    Freq-Resp (on the left): plots the spectral envelope of the windowed signal using the LPC coefficients

i.   Set the scale to dB

e.    Formant (under “Speech Blocks”): obtains the first and second formants from the speech signal frame using the LPC filter coefficients.

i.   There are no parameters to set or change in the formant block.

ii.   This block stores the formants until the “Reset” button is clicked. Upon clicking the “Reset” button the stored formant values will be erased.

f.     PZ-Plot (under “Filter Blocks”): shows a pole-zero plot corresponding to the LPC filter coefficients.

i.   There are no parameters to set or change in the PZ-Plot block.

ii.    From the PZ-Plot block you can verify that an all-pole filter is used to estimate the linear predictive coefficients (all poles, with only zeroes at 0).

g.    FFT (on the left): computes the Fast Fourier Transform of the speech signal.

i.   Set FFT size to be 256.

h.    Plot (on the left): plot the speech signal spectrum based on output of the FFT block.

i.   Choose Continuous Magnitude plot in dB scale.

2.    Let’s first analyze one frame of the speech phoneme signal “a”. In the SigGen(L) block, select

“Frame Range” to start at 53 and stop at 53, click “Update” . You should see the plot on the right updated to “Frame. 53/383” .

a.    Open the Long Signal Generator, Plot (connected to FFT), PZ-Plot, Freq-Resp and Formant blocks. Take a screenshot.

b.    In the Plot window, why does the speech signal spectrum have a harmonic structure?

c.    How many poles are there in the PZ-Plot window? How many peaks in the Freq-Resp window? What is the relationship between the number of poles of the filter transfer function and the number of peaks of its frequency response? (Recall the discrete-time Fourier transform is periodic with period 2π, so the frequency 2π is connected to frequency zero.)

d.    Convert the poles in the PZ-Plot window from rectangular coordinates to Polar

coordinates by filling out the following table. (There is a nice online conversion tool at

https://www.intmath.com/complex-numbers/convert-polar-rectangular-

interactive.php)

Rectangular form

Magnitude

Phase (degrees)

Phase (radians in π’s = degrees/180)

What is the relationship between the phase of the poles and the locations/frequencies of the peaks of the frequency response? What about the magnitude of the peak and the width of the peaks (narrow or wide peaks)?

e.    In the Formant block, the first and second formants of the speech phoneme signal at

frame. 53 are given in Hertz. The translation from analog frequency f in Hertz to the digital frequency in radians shown in the frequency response plot is by 2πf/fs where the sampling frequency fs = 8000Hz. Calculate the digital frequency for the two formants?   Which poles in rectangular form. do they correspond to?

3.    Now let’s analyze ten frames of the speech signal. First open the Formant block and click

“Reset” . This will clear the stored formants. In the SigGen(L) block, select “Frame Range” to start at 51 and stop at 60, click “Rerun” . You will see ten pairs of formants are stored in the Formant block, which correspond to the first two formants of the ten frames of speech signal. Take a screenshot of the Formant window. Are the formants for the ten frames the same? Is the speech signal stationary over a long period?

4.    Next let’s try the phoneme “i” . In the Formant block, click “Reset” . In SigGen(L), choose “Phoneme i” as the signal (note this signal is the same as the second part of the audio recording), and select “Frame Range” to start at 53 and stop at 53, click “Update”. Take a screenshot of the Long Signal Generator, Plot (connected to FFT), PZ-Plot, Freq-Resp and  Formant blocks.

5.    Lastly let’s play the phoneme “a” and “i” back-to-back (just like in the audio recording) and let   the K-Means algorithm to form. two clusters. In the Formant block, click “Reset”. First, we need to acquire formant frequencies from audio data frames. Go to SigGen(L) block, choose the phoneme “a”, select “Frame Range” to start at 101 and stop at 150, click “Rerun” . You should see 50 formants stored in the Formant block. Next choose the phoneme “i”, select “Frame. Range” to start at 101 and stop at 150, click “Rerun” .  You should now see 100 or 101 formants stored in the Formant block after execution.

Next, add a K-Means block (under “Machine Learning Blocks”) to the output of the Formant block on the right. Open the K-Means block and set the “Clusters” to be 2. Click “Calculate” . You  should see the two formant frequencies of the 100+ audio data frames on the graph with the 1st  formant as the vertical axis and the 2nd  formant as the horizontal axis. The centroids of the two clusters are labelled as green flowers. Click “Show MSE & Centroid Values” . Take a screenshot of the K-Means and the K-Means Values windows.

Lastly, based on the centroids output by the K-Means block and Table 1, what phonemes are said in the recording?




热门主题

课程名

mktg2509 csci 2600 38170 lng302 csse3010 phas3226 77938 arch1162 engn4536/engn6536 acx5903 comp151101 phl245 cse12 comp9312 stat3016/6016 phas0038 comp2140 6qqmb312 xjco3011 rest0005 ematm0051 5qqmn219 lubs5062m eee8155 cege0100 eap033 artd1109 mat246 etc3430 ecmm462 mis102 inft6800 ddes9903 comp6521 comp9517 comp3331/9331 comp4337 comp6008 comp9414 bu.231.790.81 man00150m csb352h math1041 eengm4100 isys1002 08 6057cem mktg3504 mthm036 mtrx1701 mth3241 eeee3086 cmp-7038b cmp-7000a ints4010 econ2151 infs5710 fins5516 fin3309 fins5510 gsoe9340 math2007 math2036 soee5010 mark3088 infs3605 elec9714 comp2271 ma214 comp2211 infs3604 600426 sit254 acct3091 bbt405 msin0116 com107/com113 mark5826 sit120 comp9021 eco2101 eeen40700 cs253 ece3114 ecmm447 chns3000 math377 itd102 comp9444 comp(2041|9044) econ0060 econ7230 mgt001371 ecs-323 cs6250 mgdi60012 mdia2012 comm221001 comm5000 ma1008 engl642 econ241 com333 math367 mis201 nbs-7041x meek16104 econ2003 comm1190 mbas902 comp-1027 dpst1091 comp7315 eppd1033 m06 ee3025 msci231 bb113/bbs1063 fc709 comp3425 comp9417 econ42915 cb9101 math1102e chme0017 fc307 mkt60104 5522usst litr1-uc6201.200 ee1102 cosc2803 math39512 omp9727 int2067/int5051 bsb151 mgt253 fc021 babs2202 mis2002s phya21 18-213 cege0012 mdia1002 math38032 mech5125 07 cisc102 mgx3110 cs240 11175 fin3020s eco3420 ictten622 comp9727 cpt111 de114102d mgm320h5s bafi1019 math21112 efim20036 mn-3503 fins5568 110.807 bcpm000028 info6030 bma0092 bcpm0054 math20212 ce335 cs365 cenv6141 ftec5580 math2010 ec3450 comm1170 ecmt1010 csci-ua.0480-003 econ12-200 ib3960 ectb60h3f cs247—assignment tk3163 ics3u ib3j80 comp20008 comp9334 eppd1063 acct2343 cct109 isys1055/3412 math350-real math2014 eec180 stat141b econ2101 msinm014/msing014/msing014b fit2004 comp643 bu1002 cm2030
联系我们
EMail: 99515681@qq.com
QQ: 99515681
留学生作业帮-留学生的知心伴侣!
工作时间:08:00-21:00
python代写
微信客服:codinghelp
站长地图