Speech Transcript - Speech Bandwidth Expansion For Speaker Recognition On Telephony Audio

hello everyone

i am then used to but often

i am a research scientist i'd been dropped security

based in atlanta are you would say

i'm here to present our paper

i to read speech bandwidth expansion for speaker recognition on telephone you or you

this is the overview of might all

i will start by giving a motivation as to why we need bandwidth expansion

followed by explaining the problem statement

and then i will describe some prior research in this area

we will then explain

the bandwidth expansion system that we propose in this paper

and show some results of bandwidth expansion performance

finally

i really others this show you some speaker verification experiments that you perform

and the results that we obtained with the bandwidth expanse just

in this paper we therefore to y

no audio that the sampled at sixteen khz

now has wideband or you

typically the audio that is sampled at sixteen khz and has frequency content

between zero to eight khz

but is called wideband audio in this paper

input additional telephone the audio

which is due back band limited to a three hundred to three thousand four hundred

an example of the universe is referred to as

narrowband audio in this paper

speaker verification systems

typically work well on why nine between all you

this is because

the higher frequency content maybe in four and eight khz

in by band all your

is helpful in speaker discrimination

the wine mandarin systems

the one of the wideband audio stream systems perform warily

a narrowband or you to the mismatch in the training and testing conditions

so the lack of you have higher frequency information in the narrowband speech leads to

the degraded performance

so the question that we days in this paper is can be indexed estimate the

higher frequency content that is missing in narrowband or you

in such a way that it improves the performance

on why band trained

speaker verification systems

so this is the problem statement

the narrowband on or the u

well as a and which is band limited to four khz

is shown on the left

in this figure we have shown the spectrogram

for showing the frequencies between zero and

e you know parts

you see that there is no information or frequency contained in between forty kilobytes

and the objective will be banded expansion system is to use the lower frequency content

of the narrowband audio to estimate the missing higher frequency content that is typically present

in the wideband audio

the objective of the s estimation of the higher frequency content is inside to be

of that it improves the performance of speaker verification systems

then as being a lot of research that has been conducted in bandwidth expansion

the earliest approaches to bandwidth expansion they don't the problem into two parts

estimating the on the log of the spectrum and the excitation signal of this of

this paper

the on the left estimation is typically

are then made using spline fitting cubic spline fitting one option mixture model based approaches

and

spectral folding is a is used for

estimating it's extending the excitation six

so this is the earliest approaches in bandwidth expansion

more recent approaches use the neural network based bandwidth extraction and

these kind of deep neural network based systems have shown improvement in the performance of

asr systems

are trained on wideband speech

more recent work in speaker verification related to bandwidth expansion

has you have used

d plus it will networks

and bidirectional l s t m network architectures for or forming bandwidth expansion

this work has also shown significant improvement in the performance of speaker verification systems

in this

but we propose a novel bandwidth expansion system

that is lightweight compared to all the systems proposed in the literature

in this system the band with the bantered expansion is performed using a c n

b and then network architecture

a feed-forward c n and t are not capture better in there is a

single convolutional layer

which is which more forms one deconvolution along the time axis

followed by three v forward layers

containing

one thousand twenty four nodes in each layer

there are sixty four filters in the convolutional ears

and after the convolution operation the feature maps that slightly and fact that the feed-forward

here's

this is the architecture of the d and then

that performs the bandwidth expansion

the input

to the deep neural network is

and the narrowband log spectrum

narrowband log spectrum so we extract the spectrogram

from the eight khz telephone the audio

and we perform

the mean and variance normalization of the spectrum

and compute the logarithm

of the spectrum and feed it as input to the network

the output of the network is the s is tries to estimate the complete than

some of the

corresponding by back to see that

the input to the network

a fixed

eleven frames

of one twenty eight dimensional narrowband log spectrum

the features are computed at twenty millisecond frame size and ten milisecond frame rate

the network output is to fifty seven dimensional wideband log spectrum

the network is trained with the mean squared error loss and adam optimiser

after the and the network output is a pain

the mean and variance computed from the input us narrowband spectrum is added back

to the wideband spectra

often i think that the mean and variance

and inverse

no bias vector or the

and inverse filtering is applied

bring up the energy content in the higher frequencies

this is made him than do

in a to compensate for the mean values of the energies

in the higher frequency which

the output of this system

is the white that lost spectrum which is for the processed

for

speaker verification

this bandwidth expansion

b and then system is trained on every speech on the rubber dataset

and the v c d k dataset

this is the inverse filtering that is use the reverse the low-pass filtering effect

the mean and variance of the not narrowband log spectrum is added back to the

estimated wideband log spectrum which is the output of the vienna

the higher frequency energies of the narrowband all your are attenuated viewable by selectively

you re well as they do clustering be added back

the this filter the i about this

inverse vector in the log domain two

the estimated by

well getting back the ugly normalized wideband spectrum estimate

the data for this for training the bandwidth expansion system is simulated using or telephone

equally codec simulation software

the limited speech and v c d k datasets

i'll hold

wideband audio data sets libby speech as a sampling rate of sixteen khz and b

c d k is originally forty eight khz audio it should be bring down by

down sampling to sixteen khz

what these datasets are clean speech bit by band data at sixteen khz

in order to simulate telephone the artifacts in the wideband speech be perform a

coding and decoding using three different

audio codecs the three audio codecs that be used for simulating the telephone data are

of the adaptive might be the narrowband amr and b

the allpass narrowband codec and this week data back codec

so this three codecs cover a wide range of telephone the applications that are commonly

used as you can see from this table that my ten b is typically used

in mobile telephone

allpass is used in white like what's a playstation for except and silk is also

used in wide applications voip applications

so be it a sixteen a and i don't a wide band audio from delivery

speech data set or d dct case dataset we passed through it a d v

boss the audio through the

audio coding application

which course which converts it into a coded signal

and then be passing through the audio codec decoder to get back the telephone e

just a or they started narrowband sick so this is how they sixteen khz audio

is converted to eight khz or a telephone e distorted audio

we simulate the data set for bandwidth expansion train

the bandwidth expansion system is that's trained on a hundred hours of liberty speech and

we syndicated a sec

the performance of the bandwidth expansion is computed by the log spectral distortion measure which

is basically the mean squared error in this between the estimated wideband spectrum and the

actual wideband spectrum

in the log domain

so the by a d

not spectral distortion

is show the results are shown here

the simple up sampling of

narrowband audio now gives there'll low a log spectral distortion of one point seven nine

three in the higher frequency d h by doing simple subsampling we are not adding

any new information but the audio all that simple a lab sampling does is

performance

interpolation between samples

and followed by

no less affected so interpolation followed by an no but followed by smoothing that this

simple up sampling

and simple have sampling gives a log spectral distortion of one point seven nine three

the be a bandwidth x expanded system output

has

l s d value of one point two nine one it just a significant reduction

compared to

the simple of sampled signal

the loss but we have been be due to bandwidth expansion be estimated the complete

spectrum of the art

that is the spectrum from zero

at universe of the

wideband audio

we also compute the log spectral distortion of

the in order bags

of a as a as a result

so in the lower frequency band zero to four khz which is already but i

sent it but

in the narrowband spectrum

this simple up sampling gives the

not spectral distortion of point nine three four

benesty bandwidths expanded system output

a distortion of one point zero to nine

so this means that the bandwidth expansion system

introduces a mind that of distortion

in the lower frequencies

compared to simple laptop

and also remember

that

the audio codecs that be applied

to simulate the telephone e audio would have introduced more distortions in the lower frequencies

that is why

that is

a significant amount of log spectral distortion

even in the lower frequencies

for the simple example signal

finally this table shows that the bandwidth expansion system clearly or phone

spectral estimation of higher frequency content

this is an example log of the output of the bandwidth expansion system on top

is the eight khz narrowband telephone you argue

we have

performs simple subsampling of the telephone the audio

to show the spectrogram

you see no frequency content in

i between forty kilobytes

the bandwidth expanded output

is shown in the

a total pain

you see that of the higher frequencies are estimated

by the

we have a pretty well by the bandwidth expansion system

and the bottom be in

shows the sixteen khz reference

next i we will more to the speaker verification experiments

the speaker verification experiments in this paper are performed

on a speaker verification system

as shown in this figure

our speaker verification system is that the convolution neural network based speaker embedding

it consists of five convolutional layers

followed by

i statistics pulling here

for followed by two fully connected us

and the output is a softmax

layer and beating speaker labels

the input to the speaker embedding system is thirty dimensional

mfcc features

the training of these speaker and speaker recognition system

the speaker embedding system is performed in two stages

the first stage be used

a softmax output

of speaker labels

and doing the network with

a cross entropy loss

training

in stage two

be the remote the second fully connected eer and the output layer

and that's a card a fully connected leotard different layer

in the all in the output

actually is all the layers

before the back

and train the network with large margin or side loss

this is how this you got embedding system is trained two stages

this system is train

completely on be walks lm to dataset it's consists of

sixteen khz clean audio

so this is a wide band plain system so we train

two different speaker embedding systems

using the same architecture

one system used train on the only one select two

sixteen khz audio

the second system we train it on mixed training

the use both p by nine audio

and the bandwidth expanded downsampled and band with expanded

audio

so we possibly walk select to dataset

to what it

distortion

and followed by

the bandwidth expansion

using them

bandwidth expansion system that the proposed in this paper

and then we combine the two datasets the by the original wideband

box a let

and divine with expanded

downsampled what's that

brain

the speaker recognition system

a speaker recognition speaker and body

we train the speaker every

so note that

both of these systems are trained on wideband audio

one is train on the original wideband data

the second w b plus you w e is trained on by nine

last

bandwidth expanded by

if all the results

here

the speaker verification results are shown in this they in this so by graph

these this by graph shows the speaker verification equal error rates that we obtain

using d by band only trained system

so the system is trained only on what select sixteen khz wideband audio

you see that we perform speaker verification test

for different test sets

double select one

e subset

the as id w

a dataset

the speakers in the why you dataset

and the nist sre two thousand and second ten second test set

so these are before test sets that be a computer the results on

the green

you can see

a performing bandwidth expansion

other uses the equal error rate

contrary to simply upsample signal

so the block an audience shows the equal error rate obtained using

a simple subsampling

and the bottom plot in l a shows

the design

after nine bit extraction

note that

the box tell everyone

yes i

the s id w

they have set and that's id w eval set

at all

sixty universe

audio

v past these test sets

the would be coded distortion

assimilation that we'd have a lot in this paper to simulate telephone e audio

and then be ugly speaker verification on top of it is twenty distorted l a

funny or that the start it yes say that is the results that those are

the desires the actually in the orange blinded

because a lot is the output of the bandwidth expansion system

when you past the telephone need to start essex would have an expansion

system

normal that the nist sre two thousand and test set

yes

are telling it consists of the only telephone your you

so i

is inherently a narrowband speech signal

recorded using a real telephone the audio

so we don't have

we don't have the us this we don't have the results for

if the nist sre dataset

four

is divided by nine or go because there is no wideband audio in this design

so we have only the

and results

for an simple up sampling and bandwidth expansion

so you see that even in the unseen case of nist sre dataset consists of

real telephone your

that is a significant improvement in the equal error rate

finally we show the results on the mixed plain system even on the mixed trained

system there is a significant improvement in the equal error rates

across all the test sets

it is a particular of point to note here is that the equal error rates

obtained that

on the original wideband a test audio cassettes

i might lower

then what the obtained but the wideband plain system

so initially for example forty bucks eleven ease set

we obtain the four point one to eer

by after d but i'm explaining the eer values to three point two percent

that means that the bandwidth expansion has helped improve the performance even on the original

sixteen khz audio

so these are the conclusions of our paper the bandwidth expansion system that the proposed

in this paper performs significantly better

than upset simple up sampling

we obtain a relate to equal error rate reduction of four point four percent

on the nist sre two thousand and second

and a nine point ninety percent improvement on the s i t w u about

six and eleven point one percent improvement on the inside you don't you test set

the bandwidth expansion well so improved in the accuracy on the original sixteen khz data

across all

the protocols across all the test sets

which means that the bandwidth expansion system is helping as an augmentation mechanism for training

speaker verification of for training the speaker recognition system

so the perforce bandwidth expansion system is also significantly lightweight system

compared to other systems that a recently proposed

and the system can be

deployed and used even in a legal times an audio

these are some references that have order well i in this presentation

please refer to the paper for further details and desires

and

i will be glad to answer your questions

thank you for listening to my talk

i look forward your

questions and discussions regarding this paper

thank you

Speech Bandwidth Expansion For Speaker Recognition On Telephony Audio

Speech Application

Ganesh Sivaraman, Amruta Vidwans, Elie Khoury