hello everyone
i am then used to but often
i am a research scientist i'd been dropped security
based in atlanta are you would say
i'm here to present our paper
i to read speech bandwidth expansion for speaker recognition on telephone you or you
this is the overview of might all
i will start by giving a motivation as to why we need bandwidth expansion
followed by explaining the problem statement
and then i will describe some prior research in this area
we will then explain
the bandwidth expansion system that we propose in this paper
and show some results of bandwidth expansion performance
finally
i really others this show you some speaker verification experiments that you perform
and the results that we obtained with the bandwidth expanse just
in this paper we therefore to y
no audio that the sampled at sixteen khz
now has wideband or you
typically the audio that is sampled at sixteen khz and has frequency content
between zero to eight khz
but is called wideband audio in this paper
input additional telephone the audio
which is due back band limited to a three hundred to three thousand four hundred
hz
an example of the universe is referred to as
narrowband audio in this paper
speaker verification systems
typically work well on why nine between all you
this is because
the higher frequency content maybe in four and eight khz
in by band all your
is helpful in speaker discrimination
the wine mandarin systems
the one of the wideband audio stream systems perform warily
a narrowband or you to the mismatch in the training and testing conditions
so the lack of you have higher frequency information in the narrowband speech leads to
the degraded performance
so the question that we days in this paper is can be indexed estimate the
higher frequency content that is missing in narrowband or you
in such a way that it improves the performance
on why band trained
speaker verification systems
so this is the problem statement
the narrowband on or the u
well as a and which is band limited to four khz
is shown on the left
in this figure we have shown the spectrogram
for showing the frequencies between zero and
e you know parts
you see that there is no information or frequency contained in between forty kilobytes
and the objective will be banded expansion system is to use the lower frequency content
of the narrowband audio to estimate the missing higher frequency content that is typically present
in the wideband audio
the objective of the s estimation of the higher frequency content is inside to be
of that it improves the performance of speaker verification systems
then as being a lot of research that has been conducted in bandwidth expansion
the earliest approaches to bandwidth expansion they don't the problem into two parts
estimating the on the log of the spectrum and the excitation signal of this of
this paper
the on the left estimation is typically
are then made using spline fitting cubic spline fitting one option mixture model based approaches
and
spectral folding is a is used for
estimating it's extending the excitation six
so this is the earliest approaches in bandwidth expansion
more recent approaches use the neural network based bandwidth extraction and
these kind of deep neural network based systems have shown improvement in the performance of
asr systems
are trained on wideband speech
more recent work in speaker verification related to bandwidth expansion
has you have used
d plus it will networks
and bidirectional l s t m network architectures for or forming bandwidth expansion
this work has also shown significant improvement in the performance of speaker verification systems
in this
but we propose a novel bandwidth expansion system
that is lightweight compared to all the systems proposed in the literature
in this system the band with the bantered expansion is performed using a c n
b and then network architecture
a feed-forward c n and t are not capture better in there is a
single convolutional layer
which is which more forms one deconvolution along the time axis
followed by three v forward layers
containing
one thousand twenty four nodes in each layer
there are sixty four filters in the convolutional ears
and after the convolution operation the feature maps that slightly and fact that the feed-forward
here's
this is the architecture of the d and then
that performs the bandwidth expansion
the input
to the deep neural network is
and the narrowband log spectrum
narrowband log spectrum so we extract the spectrogram
from the eight khz telephone the audio
and we perform
the mean and variance normalization of the spectrum
and compute the logarithm
of
of the spectrum and feed it as input to the network
the output of the network is the s is tries to estimate the complete than
some of the
corresponding by back to see that
the input to the network
a fixed
eleven frames
of one twenty eight dimensional narrowband log spectrum
the features are computed at twenty millisecond frame size and ten milisecond frame rate
the network output is to fifty seven dimensional wideband log spectrum
the network is trained with the mean squared error loss and adam optimiser
after the and the network output is a pain
the mean and variance computed from the input us narrowband spectrum is added back
to the wideband spectra
often i think that the mean and variance
and inverse
no bias vector or the
and inverse filtering is applied
bring up the energy content in the higher frequencies
this is made him than do
in a to compensate for the mean values of the energies
in the higher frequency which
the output of this system
is the white that lost spectrum which is for the processed
for
speaker verification
this bandwidth expansion
b and then system is trained on every speech on the rubber dataset
and the v c d k dataset
this is the inverse filtering that is use the reverse the low-pass filtering effect
the mean and variance of the not narrowband log spectrum is added back to the
estimated wideband log spectrum which is the output of the vienna
the higher frequency energies of the narrowband all your are attenuated viewable by selectively
you re well as they do clustering be added back
the this filter the i about this
inverse vector in the log domain two
the estimated by
well getting back the ugly normalized wideband spectrum estimate
the data for this for training the bandwidth expansion system is simulated using or telephone
equally codec simulation software
the limited speech and v c d k datasets
i'll hold
wideband audio data sets libby speech as a sampling rate of sixteen khz and b
c d k is originally forty eight khz audio it should be bring down by
down sampling to sixteen khz
what these datasets are clean speech bit by band data at sixteen khz
in order to simulate telephone the artifacts in the wideband speech be perform a
coding and decoding using three different
audio codecs the three audio codecs that be used for simulating the telephone data are
of the adaptive might be the narrowband amr and b
the allpass narrowband codec and this week data back codec
so this three codecs cover a wide range of telephone the applications that are commonly
used as you can see from this table that my ten b is typically used
in mobile telephone
allpass is used in white like what's a playstation for except and silk is also
used in wide applications voip applications
so be it a sixteen a and i don't a wide band audio from delivery
speech data set or d dct case dataset we passed through it a d v
boss the audio through the
audio coding application
which course which converts it into a coded signal
and then be passing through the audio codec decoder to get back the telephone e
just a or they started narrowband sick so this is how they sixteen khz audio
is converted to eight khz or a telephone e distorted audio
we simulate the data set for bandwidth expansion train
the bandwidth expansion system is that's trained on a hundred hours of liberty speech and
we syndicated a sec
the performance of the bandwidth expansion is computed by the log spectral distortion measure which
is basically the mean squared error in this between the estimated wideband spectrum and the
actual wideband spectrum
in the log domain
so the by a d
not spectral distortion
is show the results are shown here
the simple up sampling of
narrowband audio now gives there'll low a log spectral distortion of one point seven nine
three in the higher frequency d h by doing simple subsampling we are not adding
any new information but the audio all that simple a lab sampling does is
performance
interpolation between samples
and followed by
no less affected so interpolation followed by an no but followed by smoothing that this
simple up sampling
and simple have sampling gives a log spectral distortion of one point seven nine three
the be a bandwidth x expanded system output
has
l s d value of one point two nine one it just a significant reduction
compared to
the simple of sampled signal
the loss but we have been be due to bandwidth expansion be estimated the complete
spectrum of the art
that is the spectrum from zero
at universe of the
wideband audio
we also compute the log spectral distortion of
the in order bags
of a as a as a result
so in the lower frequency band zero to four khz which is already but i
sent it but
in the narrowband spectrum
this simple up sampling gives the
not spectral distortion of point nine three four
benesty bandwidths expanded system output
as
a distortion of one point zero to nine
so this means that the bandwidth expansion system
introduces a mind that of distortion
in the lower frequencies
compared to simple laptop
and also remember
that
the audio codecs that be applied
to simulate the telephone e audio would have introduced more distortions in the lower frequencies
that is why
that is
a significant amount of log spectral distortion
even in the lower frequencies
for the simple example signal
finally this table shows that the bandwidth expansion system clearly or phone
spectral estimation of higher frequency content
this is an example log of the output of the bandwidth expansion system on top
is the eight khz narrowband telephone you argue
we have
performs simple subsampling of the telephone the audio
to show the spectrogram
you see no frequency content in
i between forty kilobytes
the bandwidth expanded output
is shown in the
a total pain
you see that of the higher frequencies are estimated
by the
we have a pretty well by the bandwidth expansion system
and the bottom be in
shows the sixteen khz reference
next i we will more to the speaker verification experiments
the speaker verification experiments in this paper are performed
on a speaker verification system
as shown in this figure
our speaker verification system is that the convolution neural network based speaker embedding
it consists of five convolutional layers
followed by
i statistics pulling here
for followed by two fully connected us
and the output is a softmax
layer and beating speaker labels
the input to the speaker embedding system is thirty dimensional
mfcc features
the training of these speaker and speaker recognition system
the speaker embedding system is performed in two stages
the first stage be used
a softmax output
of speaker labels
and doing the network with
a cross entropy loss
training
in stage two
be the remote the second fully connected eer and the output layer
and that's a card a fully connected leotard different layer
in the all in the output
actually is all the layers
before the back
and train the network with large margin or side loss
this is how this you got embedding system is trained two stages
this system is train
completely on be walks lm to dataset it's consists of
sixteen khz clean audio
so this is a wide band plain system so we train
two different speaker embedding systems
using the same architecture
one system used train on the only one select two
sixteen khz audio
the second system we train it on mixed training
the use both p by nine audio
and the bandwidth expanded downsampled and band with expanded
audio
so we possibly walk select to dataset
to what it
distortion
and followed by
the bandwidth expansion
using them
bandwidth expansion system that the proposed in this paper
and then we combine the two datasets the by the original wideband
box a let
and divine with expanded
downsampled what's that
brain
the speaker recognition system
a speaker recognition speaker and body
we train the speaker every
so note that
both of these systems are trained on wideband audio
one is train on the original wideband data
the second w b plus you w e is trained on by nine
last
bandwidth expanded by
if all the results
here
the speaker verification results are shown in this they in this so by graph
these this by graph shows the speaker verification equal error rates that we obtain
using d by band only trained system
so the system is trained only on what select sixteen khz wideband audio
you see that we perform speaker verification test
on
for different test sets
double select one
e subset
the as id w
a dataset
the speakers in the why you dataset
and the nist sre two thousand and second ten second test set
so these are before test sets that be a computer the results on
the green
you can see
a performing bandwidth expansion
other uses the equal error rate
contrary to simply upsample signal
so the block an audience shows the equal error rate obtained using
a simple subsampling
and the bottom plot in l a shows
the design
after nine bit extraction
note that
the box tell everyone
yes i
the s id w
they have set and that's id w eval set
at all
sixty universe
audio
v past these test sets
the would be coded distortion
assimilation that we'd have a lot in this paper to simulate telephone e audio
and then be ugly speaker verification on top of it is twenty distorted l a
funny or that the start it yes say that is the results that those are
the desires the actually in the orange blinded
because a lot is the output of the bandwidth expansion system
when you past the telephone need to start essex would have an expansion
system
normal that the nist sre two thousand and test set
yes
are telling it consists of the only telephone your you
so i
is inherently a narrowband speech signal
recorded using a real telephone the audio
so we don't have
we don't have the us this we don't have the results for
if the nist sre dataset
four
is divided by nine or go because there is no wideband audio in this design
so we have only the
and results
for an simple up sampling and bandwidth expansion
so you see that even in the unseen case of nist sre dataset consists of
real telephone your
that is a significant improvement in the equal error rate
finally we show the results on the mixed plain system even on the mixed trained
system there is a significant improvement in the equal error rates
across all the test sets
it is a particular of point to note here is that the equal error rates
obtained that
on the original wideband a test audio cassettes
i might lower
then what the obtained but the wideband plain system
so initially for example forty bucks eleven ease set
we obtain the four point one to eer
by after d but i'm explaining the eer values to three point two percent
that means that the bandwidth expansion has helped improve the performance even on the original
sixteen khz audio
so these are the conclusions of our paper the bandwidth expansion system that the proposed
in this paper performs significantly better
than upset simple up sampling
we obtain a relate to equal error rate reduction of four point four percent
on the nist sre two thousand and second
and a nine point ninety percent improvement on the s i t w u about
six and eleven point one percent improvement on the inside you don't you test set
the bandwidth expansion well so improved in the accuracy on the original sixteen khz data
across all
the protocols across all the test sets
which means that the bandwidth expansion system is helping as an augmentation mechanism for training
speaker verification of for training the speaker recognition system
so the perforce bandwidth expansion system is also significantly lightweight system
compared to other systems that a recently proposed
and the system can be
deployed and used even in a legal times an audio
these are some references that have order well i in this presentation
please refer to the paper for further details and desires
and
i will be glad to answer your questions
thank you for listening to my talk
i look forward your
questions and discussions regarding this paper
thank you