Calculating Speech Parameters

Site hosted by Angelfire.com: Build your free website today!

Calculating Speech Parameters

The first step in modeling the behavior of the TMS 5220 speech synthesis chip in Matlab is to obtain the coded speech parameters which are needed to be serially fed into the chip from some memory buffer before it can create synthetic speech. The 12 coded speech parameters are: Pitch, Energy, and K1-K10. Pitch and Energy are used to produce the filter excitation sequences (glottal--voiced, noise--unvoiced). K1-K10 are the reflection coefficients used to form the LPC lattice network. The chip has a 40 Hz frame rate, which means that these 12 coded speech parameters must calculated for each 25ms frame of the speech signal to be generated.

Creating a .wav file

Before any speech parameters could be calculated, first a speech file (utterance) was needed. I used Microsoft Sound Recorder to record a .wav speech file (PCM, 8000 Hz, 16 bit, mono).

Breaking up the .wav file

The speech file needed to be broken up into sections which are 25 ms in length. I used the following function to accomplish this (refer to Matlab appendix for more details).

function: y = blockX(x,windowsize,frameRate,fs)

x is the .wav file utterance which has been converted into a 1xN vector of discrete values between 1 and -1 in Matlab using x = wavread('utterance')

windowsize = 200 (25ms * fs)

frameRate = 40 %this is the frame rate used by the TMS 5220 chip

fs = 8000 %this is the sampling frequency used by the TMS 5220 chip

y is an MxL matrix, where each column is a window of x (L frames in x)

Calculating LPC Coefficients

The values for the reflection coefficients K1-K10 are used to define the digital lattice filter which acts as the vocal tract in this speech synthesis system. New values for K1-K10 are needed for every 25msec block of the utterance. The values for K1-K10 are derived from the LPC-10 coefficient values, so LPC values must be calculated first.

LPC (linear predictive coding) is based on linear equations to formulate a mathematical (all pole) model of the human vocal tract and an ability to predict a speech sample based on previous ones. The vocal tract is modeled by an all pole transfer function of the form:

wpe4.jpg (6674 bytes)

In order to calculate the LPC values for each 25 msec block of the utterance, I used Matlab's built in LPC function.

function lpc_coefficients = lpc(y,p)

y is a 25 msec utterance

p is the number of lpc coefficients (poles)

Calculating Reflection Coefficients K1-K10

The values for K1-K10 are based on the LPC-10 coefficients since the LPC values determine the tranfer function to model the vocal tract and K1-K10 are simply values used in a lattice filter to implement the transfer function. Since the transfer function is an all pole model of the vocal tract a IIR (infinite impulse response) Lattice Filter is used. In general, an Mth order linear time invariant IIR filter can be represented as the transfer function:

wpe1.jpg (1789 bytes)

In our case, the numerator would have a value of 1, since we are using an all pole model, and the a values would correspond to the p = M = 10 lpc coefficients. So, our 10th order lattice filter is implemented by an architecture similar to Figure 2. Figure 2 consists of M =10 first order sections which are cascaded together and whose outputs are summed by a Ladder section.

wpe3.jpg (15601 bytes)

Figure 2: Lattice-Ladder Architecture

The values for K1-K10 are algebraically related to the lpc coefficients. Various techniques exist for designing these filters including the Gray-Markel method or Mason's Gain Formula analysis. Its not practical to show the calculations here, but instead I used Matlab to help.

K = tf2latc(1,den) %finds the lattice parameters K for an IIR all-pole (AR) lattice filter.

%den consists of a vector of the 10 lpc coefficients

[F,G] = latcfilt(K,X) %filters utterance X with the FIR lattice coefficients in vector K.

%F is the forward lattice filter result, and G is the backward filter result.

Why use Lattice Filter to model the vocal tract?

A lattice filter was used to model the vocal track since lattice filters have a low sensitivity to coefficient roundoff errors. Having a low sensitivity to coefficient roundoff error is extremely important when using a D/A converter which only uses 8 bits to represent its data. If for example a Cascade architecture were used to implement this filter then we could potentially loose more than 1 bit of precision due to coefficient roundoff error, thus causing a loss in the dynamic range of the output speech waveform. This would result in speech that sounds very maching like.

Calculating Pitch

Pitch information was calculated and then used to produce the filter excitation sequence (glottal--voiced, noise--unvoiced). To calculate the pitch of each 25 msec section of the utterance, I used the lpc values to find the residue (excitation) and then performed an autocorrelation on it. From the autocorrelation, pitch was determined by taking the time difference between the first largest peak after the zero lag peak and the zero lag peak.

Calculating Energy

Energy was calculated and then used to control the volume of the filter excitation sequence (glottal--voiced, noise--unvoiced). In general, the higher the energy, the higher the volume of the output speech signal. To represent Energy, I just used the peak amplitude of the Power Spectrum of each 25 msec section of the windowed speech waveform.