Speech Transcript - The IBM 2016 Speaker Recognition System

a good morning everyone so i'm not sure if you notice what

this is the only speaker recognition and talk in this section

so which makes me feel somehow like

the distant relative that the family invites

but you know they don't want to like to

so today i'm gonna be a presenting the some of the recent advances that we've

had seen in our speaker recognition system and i will share some results

that we obtained with the system on the nist sre two thousand and ten

extended core task

tasks with an emphasis on the telephony condition which is condition five

this is joint work with actually ongoing about the who is now an assistant professor

at i see bangalore india and jason pelecanos

i will start mike like with a brief overview of some of the recent a

state-of-the-art works in speaker recognition then i will

share with you the objectives of my talk i will present a our speaker recognition

system and the key components that contributed the most towards the end results

i'll describe our experimental setup the data we used the d and then acoustic models

as well it's configurations as well as the speaker recognition system configuration and i'll share

with you as i said the results we obtained with the system on the nist

sre two thousand ten extended core task

mostly on condition five and then a comparable

so when we look at the recent a state-of-the-art work on a speaker recognition forcible

most of most of the state-of-the-art systems are i-vector based

and they somehow use a universal background model to generate statistics to compute the i-vectors

now when we look at this through the time so we started by traditional unsupervised

a gaussian mixture models to represent the ubms

and then more recently we used a

of phone any event based you be ubms which are derived from an asr systems

i would like to emphasise here that even though this work

don't at i b and does not get much created but this is the first

work that in fact used seen owns to compute a

the hyper parameters of the ubm for speaker recognition in fact it was it achieves

a state-of-the-art results as a single system on the nist sre two thousand

and then a this work game the work from best buy which basically used the

nn based a scene on posteriors to compute the ubm parameters

more recently there was the work from a johns hopkins university the used a t

v and then based you

a posteriors to compute the ubm parameters any fact they found that

contrary to read to what sri found

with the

a with the diagonal covariance make a so with a with the ubm that uses

diagonal covariance matrices you can i in fact you

estimate the ubm parameters we

with a full covariance matrices and reduce a lot of computations then you need to

necessarily

go through the hassle of

of it for the nn based system so you can use directly if a supervised

ubm to compute the statistics and then from their compute the i-vectors and they had

nice gains as well

we also some of the state-of-the-art systems also use they don't use the nn posteriors

to compute the ubm hyper parameters a use the n and bottleneck features and then

the

the rest of the pipeline mean in i-vector based speaker recognition system remains thing

so i mentioned some of the word here i would like to give some pretty

to their heck a fixed work in ninety eight

that was the first to explore a bottleneck based features for a speaker recognition

so the objectives of mike talk today

i will be sharing our a state-of-the-art results on the nist sre two thousand and

ten extended core tasks again our emphasis is on telephony condition which is condition five

i will be presenting the key system components that contributed the most towards achieving these

results

namely i will talk about the fmllr based features that we used

and compared them in fact compare compared them with the a more traditional raw acoustic

features such as mfccs

we also used the n and based acoustic models in place of a gmm

unsupervised gmm acoustic model for i think ubm

the dev so this is basically

technically not novel id and then based i-vector a i-vectors a they've been around for

awhile now

what we did here we nearly double

the size of the scene onset and we wanted to see how that impacts the

speaker recognition performance

and then finally we explore a nearest neighbor this given analysis to achieve inter session

variability compensation in the i-vector space we compared the performance with

the more commonly used lda

we also quantify the contribution all

these three system components

to work towards the performance in fact we also

we'll see how varying for example the signal that the size of the scene on

set

will impact the perform

now let's take a look at

our speaker recognition system so you see the flowchart of all that speaker recognition system

here this is assuming that all the all the model parameters already train a so

we have at the in an acoustic model training i-vector extractors for that the lda

models

and

so the three components i just mentioned let me repeat this

we have

a similar based features

can be used to train and evaluate the d and then

as well as

to compute the sufficient statistics for i-vector extraction so with a from a large you

can achieve well speaker and channel and normalization

we have the in an acoustic model instead of a an unsupervised on a gmm

acoustic model to compute

the i-vectors again we nearly compared to the previous work double the size of the

signal set

and then we replace the traditional of the more commonly used lda with

and you for intersession variability compensation and used you i'm sure you're familiar with the

sparkling

so if we look at the previous work

with the non with the nn based signal i-vectors we what we of there is

that many systems the used to different set of features

compute the posteriors

and to compute a sufficient statistics so typically asr features are different from a

speaker-recognition features which makes sense so in this work we wanted to see

what happens if we can you many five

or use the same set of features to both trained and evaluated the nn and

to compute a sufficient

statistics for i-vector extraction so for the two words that we

are considered to use o a feature space maximum likelihood linear regression transforms to which

are actually use and based features which are used as features for our d n

and system

so adamantly transform is a linear transform like this which is basically would which can

be decomposed into a linear manner and it right and a translation and these parameters

can be obtain a

using the alignments that we obtain

from a from the first pass through a gmm hmm system

and then a maximum likelihood basically

estimate gives us a fight and better probably not be repeated parameters here

and the product of this transform on a raw acoustic features such as mfccs or

even transform like lda transform features

our speaker and channel more normalized features

i mean this may sound contradictory by the way so because after the larger used

to reduce speaker variability but as we know

there are two types of variability is their variability other speakers

speaker variability can be with teen or across speakers right so here we believe that

the

normalisation that if from lower provides within speakers dominates the a between speaker normalisation also

we get the benefit of channel normalization so if we have different setup and stands

for example we think that f and lark and take care that as well

now i as i mentioned the in ansi known i-vectors they've been around for awhile

so nothing technique the new in this

here in this slide the only difference is that you we really nearly a double

the size of our semen set compared to the previous work to compute the posteriors

and the from their computer sufficient

sufficient i will sufficient statistics

so i'm not gonna spend much time on this

now what

but can i conduct a present that

now we know basically how to rapidly compute i-vectors using even ten k c known

so just you connect this work to the

one of the presentations yesterday a set of money talked about

a how i-vector distributions are not necessarily gaussian and actually he showed us some distributions

and that was even on clean data not of in noisy data okay so

and lda basically it's formulated based on gaussian distribution assumptions for a class for different

for individual classes

or even if they are not gaussian they need to be at least uni modal

at a soul

therefore lda cannot effectively handle multimodal data

which is typical in the nist sre types of a scenarios because data come from

various sources we have switchboard sources of data we have mixer sources of data and

that causes a multimodal the in the i-vectors

and also for applications such as language recognition we because we only have a few

classes the lda transform

i can be rank deficient

so we might get a hit from that as well

so instead of trying to transform the i-vector space so that is more gaussian like

what center are presented yesterday

we here we tried to use the transform that is the does not assumed gaussianity

or does not use the class

the ball or a structure of the classes to compute a the between class scatter

matrices so

when you look at the lda uses the class centroids

the differences between class controls will error rate here

arrow here we see to compute the between-class gotta meet now in the n b

a what we do with that we don't assume any a global extract structure for

classes for individual classes rather we assume that classes are only locally structure

so we use the local

means that are computed based on character a nearest neighbours for each individual sample

and then used to differences to compute the between class scatter matrices

another thing is that we introduce this weighting function here which is basically to emphasise

the sample these samples near the classification boundary which are more important for discrimination between

different classes rather than the sample here

which should get a really small way because it doesn't contribute towards to discrimination

the class discrimination

and then another thing is that on like lda and be a

even that we have enough a number of examples of for can for different classes

can always before right

so therefore is very useful for applications such as language id which we don't you

publish to work in i guess a twenty fifteen and we actually obtain some gains

over the

so our experimental setup a for training data we extracted english telephony and microphone data

from a

this two thousand and four through two thousand and eight sre data we also used

switchboard data both cellular and land line data

this are basically resulted in a total of sixty k recordings

to train our system hyper parameters for evaluations we considered the nist a twenty ten

sre and that a evaluation set there is that we considered a nist sre two

thousand ten compared to twenty twelve is because

we had some anchors to compare our performance of our system where it

with other with other no sites

so the conditions we consider where c one condition want to see five

you can see the details here but i wanted to emphasise that again our emphasis

is on condition five which is

a left levantine and there is a mismatch between enrollment and test

so the type of

phones used in

enrollment s are not necessarily the same

our didn't system

our d an acoustic model had a seven hidden layers a six to six of

wins

ha twenty forty eight a hidden units and then the bottleneck layer which

five hundred and twelve units we use fisher data to train it

in addition to the think eight original think a signals we also consider two point

four k

posteriors to see basically how of varying the size of how a varying the granular

larry d

the in the output layer

well in fact that speaker recognition performance

a typical setup for speaker our speaker recognition system

we use five hundred dimensional total variability subspace

which was reduced to two fifty using an l d or lda simply lda which

was trained on the entire a training set and we report equal error rate or

and mindcf away and ten

i think that we also consider a to give working in thinking signals from i-vector

extraction

in terms of results list can compare held every and da so this is a

these results are obtained with mfcc twenty forty component gaussian mixture model thinking in and

results are reported condition five as we see a no matter what type of acoustic

model we use lda all always provided a nice benefit over

and the across the three metrics

and the reason is because in as i mentioned in v a can handle non

gaussian and more to model

more effectively then lda for comparison of

mfccs version of a large

again condition five thinking in we can see that

first with lda and the it doesn't matter we always have improvement with a from

a large aware mfccs and the reason is because you

m f and large provide a speaker and channel normalization

also note that we unified a speaker recognition and speech recognition features this way okay

so the system is even it's simpler but we should also take into account the

fact that for every two in order to compute fmllr transforms we need a two-pass

system as opposed to

to look up to measure the impact of signals that

size we consider to pay for k ever think a posteriors accuracy as an increase

the signals that side results improve

we also considered thirty two j signals

okay so choose just to just to see how it how it impacts performance

we did not see much gains with thirty two k c that experiment to the

wind to finish

i just one to emphasise here that in contrast to what we see with the

d n and

if you increase the size of the components the number of components in a gmm

athletes with a diagonal covariance matrices you don't c d's gain if you increase the

size

of the gmm components from two k to forty two six k to make a

you don't see

probably gave it if not degradations

and now they say i picture is worth in other words you're this that bloody

at this work to table

so for a week what we what you can you can see how lda compared

to lda with both a gmm and the in and based systems the performance larger

menus gmm gmms to compute the i-vector a posterior to compute i-vectors

and with the nn as we increase the size of the signals that the this

gap in performance

a narrow and then secondly we can compare two k versus

ten k the nn seen on a

performance

a progression of our system over time without and with the very basic system gmms

and mfcc then lda we replace the and the got it you came and we

are replaced you know you know got of used and the timit and from lars

got further boost in performance so we at the best published a performance on the

nist sre two thousand and ten condition five at least

for other conditions we believe those are also the best college performances these we you

know refer to the paper for more detail

because we're really on previous best results i wanted to mention you created to a

d choose more that it's you one point o nine equal error rate

they had a gender dependent system our system is gender independent

and but this work

which also the also use the gender dependent systems and what the only reported results

on female trials while i'm not sure how we can

compare it is numbers but

so in conclusion i presented hours a speaker recognition system the components that it has

i shared with you our results and quantify the contribution of different components

we have if you're interested for further progress on our system

please come visit us are i in your speech or you know if you buy

me a cookie after this i might be i able to share more details when

you thank you

for some questions

for your presentation my question is about the weights in the in computation it is

those weights in the

original in da that you mentioned

in computation

those weights alright originally in the in da for us they are they and you

say that at all the data that are close the boundaries are have to listen

to look at this let's take a look at how things are computed so that

is that it says minimum of the distance between

each sound and its k nearest neighbor within that class

e seven point it's k nearest neighbours from the j a two class right and

then that is divided by a so of course if the sample is not close

to the boundary it's gonna be closer to use case okay a k nearest neighbours

from kids the same class

right

so this is gonna be smallest number is gonna be small compared to the denominator

so you're gonna get close to zero number versus if the if the south are

close to the boundary this number

is gonna get

close to this number

so sorry this number is gonna basically at the this is gonna come out of

the mean this so it's the sample from class i compared to its case i

k nearest mean of the k here and here's neighbours from plastic is gonna come

out so this is divided by

this term is gonna be point five so for samples near and dear a classification

boundary you get point five four samples that are far from that is a battery

where you're gonna get some a zero norm

can you tell conceptually what does it mean i mean

what "'cause" i mean why some those doubles the boundaries more weights done because the

sound of which are far from either far anyway so what are but directly what

we what they contribute or you know a station the between class scatter because sometimes

that are if it assumes that they are gaussians

samples that are near yes

first well even if it's gaussian

so all those data is that are a away from the mean so that are

like l layers right

the because you can distinguish them

if there is that she that can conclude that can do not directly last but

again we extended data more than those that are or more representative of the training

set is the training set what we mean by shift is the training set already

have the labels we know that those samples or

are in that class and they are far from the classification boundary

okay

work

thank you thank you for your the actually of a question regarding the implementation of

the indy so a new papers i've seen several things like they're within class covariance

use the classical one are used for this work on this work we use the

classical one

okay be clear lately we limit limited with something for this work we use the

exactly the same we compute the within class scatter matrix exactly the same way you

computed for lda

for at k nearest neighbours we use one versus press

that means that work each class

you consider that fast versus all the other classes and you compute the and you

compute the nearest exit was the development question so if there is a except for

the computational time because it was gonna be other was gonna be various i know

but intimate results does it change anything

if we i this is never be was so slow that i've never explore

thank you

english degree

The IBM 2016 Speaker Recognition System

Speaker & Language Recognition Systems

Seyed Omid Sadjadi, Sriram Ganapathy, Jason Pelecanos