Speech Transcript - Modeling Overlapping Speech using Vector Taylor Series

everyone

i'm planning to k i'm working in you gap research and ct would and i

mean representing the talk on modeling overlapping speech using but it in the cities

this work i have been but modified eisenhower we would low

first i would be presenting the motivation for using this problem

then in brief i would discuss the previous approaches for detecting overlapping speech

then i would go towards the vector taylor series approach

in which we have two parts on the first part is the

using the standard vts approach and then the next part would be the multiclass we

just algorithm which is

which we have proposed in this work and then we will discuss the results in

experiments

so the more recent comes from the problem of speaker diarization

but if the task of deciding or determining who spoke when in the meeting audio

so ugly when the audio recording you want to find out on different portions which

belong to the speakers

one challenges that the number of speakers is not applied in one

so it's it has to be determined booty an unsupervised manner

now in this task overlapping speech

becomes a very huge

the source of either

so first i would define the overlapping speech it is at the moment trying to

speaker speak simultaneously it might be when people are debating they are arguing or when

they are just

okay so when there does agreeing or disagreeing like

this kind of things are men they're laughing together

so what happens is that when you have overlapping speech in your audio then you

cannot model

the art speaker models very precisely

or when you are doing speaker recognition and you want to assign one speaker anybody

to a portion in which actually there are two people speak speaking then that also

results in some errors in speaker recognition

and a previous studies have shown that in meetings sometimes those model twenty percent of

the spoken name can be overlapping if the participants are maybe

active

no the previous approaches so one of the first what was done by book

in we see well it made adamant be segmented for the three classes speech non-speech

an overlapping speech

this was the baseline

then people have used

a salad the knowledge of silence distribution

and some things like speaker changes because it has been found that people tractable with

that when the speaker change

the state-of-the-art a is based on convolutive non-negative sparse coding and which

d gotten we put like they have

no and basis for each speaker

and then the artifact to find out the activity of each speaker for each stream

and they have used the same features using non stardom neural networks long short-term memory

neural networks

now we come to

our problem

so before i moved to overlapping speech that is one and all of this problem

of how to model the speech which is corrupted with noise

so if you have a noisy speech model by then you can express it in

the signal domain as

x plus and where x if you're clean speech

actually of the channel noise and is the additive noise

so in the mfcc domain

these are the mel-scale filter power spectrum

you take the log and dct then and then you but the mfcc features

so in the mfcc domain

this simple expression here

it becomes

a quite complex expression value have a linear park and the nonlinear part

the see if the dct a text and seen what six the pseudoinverse of that

so we call this nonlinear part of g

so you have by the way to x besides plus this nonlinear part

we want to model this equation and we use that the data c d is

basically two

x point this expression here

so that it is it is simply an expansion of objective function about the point

where you

have the first then

this is the first order don't but you pick the first derivative

so do when this expression for the noisy speech

if you at this point it around this point m u x new and new

at which are the mean of clean speech mean of noisy speech and we wanna

you know

channel noise

so you can't this expression here in which the first line

if s at the evaluation of y around this point

and the second line is

the first order down

a bit

that with energy this capital g and this capital have a

they are the derivatives of y with respect to accent at ten and

so in the standard

and the standard rectangular c d's when you are trying to model

this of i here

what people do is that the if you model gmm for x

and

a single gaussian for the noise and add

this is because the nicest a study and

that's ads if the channel noise

so the expletive a gmm it is being corrupted by additive noise and then at

using the vts approximation and that gives you the can obtain

speech by

the

these are i these are gaussian so that look like wave but they are the

this of the gmm

now become to the overlapping speech so what we propose what we propose is that

the overlapping speech is actually just a superposition of two or more in you just

speakers

so if you if we see the model for the noisy speech we can make

the analogous model for overlapping speech but we see that this x it's x one

which we call them in speaker

and this external here is the corrupting speaker

with like the additive noise

the we for simplicity of be ignored are the channel noise because of

the recordings are the recording for all the speakers and the same room

so we are not going to deal with edgier

doing

analogy we have this expression where the than the can overlapping speech is now a

combination of

this is no linear campus this non linear term

this nonlinear domain cms the k as in the case of the noisy speech

again analogous to the case of target speech we have the mean speaker gmm here

and the corrupting speaker which is being represented by a single gaussian here as a

like the additive noise

the equations are totally similar as in the noisy speech

the and you can see here

that the subscript m so each component of y here is being computed using this

component from the main speaker and then some contribution from the corrupting speaker

and

this g and have which are the derivatives of by

they are also different for each component

now

if you take the expectation of this why here then you can guide the mean

for the overlapping speech and the variance for the overlapping speech so this if the

final overlapping speech model which we want to estimate

now a for estimating that model we are going to use the em algorithm for

which this of the q function

so q one of the overlapping speech data x from excellent to at the time

frames

we want to use the probability of

having this data using them overlapping speech model new why am signal y m

and

we optimized this function q

with respect to the mean of look at a mean of the corrupting speaker x

two

so the update equations for me units this exhibition the new x to zero if

the previous

value for the

mean of adapting speaker and that of the new value for the mean of adapting

speaker

one thing that you can notice here is that

this one mean of its two presents the kind of things because it is being

updating

using all the mixture components

from the overlapping speech model

the through the whole vts algorithm but something like this thought initially we estimate or

we initialize the mean of adapting speaker and the covariance

then we compute the overlapping speech model using these expressions

after that we use the em look but we optimize the q function

and will be replaced them or us to go in signal x to zero by

mu extrinsic next two

in this work we are not going to update segments to because it

it's very have you for computation

then when this look on what converges we finally a t v finally guide of

overlapping speech model by

which we used for overlapping speech detection

so the overlapping speech detection system it i for input it takes the meeting audio

recordings

and the recordings are informal speech segments which we got using the speech activity detection

system

then one major task is to have speaker models the initial speaker models for the

mean speaker and the kind of things because the how to how to get that

so there are two options either you can use the oracle speaker segmentations

or we take them from the data are not that additional port

so this is much more challenging task because when you take the speaker lines alignment

from the data is not put

you don't know how many speakers that what actually in your audio

so you might get more than the actual number of speakers as an utterance and

output

the output which we are one finally if the detection of overlaps

now so

given the audio recording this blue box shows the

a speech segment given by the speech activity detection

remove a slight sliding analysis window what it

for each analysis window we can have and square hypothesis so we have you on

that in and then overlap that would be two speakers who would be overlapping so

if you have and speakers then the total number of overlapping speech models can be

and squared minus and

that this and shows the single speaker models when only one speaker is speaking

so this is a huge number so what we do with that for each speech

segment first we determine the means speaker and then we compute the overlapping speech models

when that means speaker if being a big by

some of the speaker

finally we have overlap model is that the speaker

i is being adapted with speaker g

and then there i think that speaker models

where the speaker i is speaking alone so we compared all this likelihood ratios for

the domain if we have overlapping speech a single speaker speech

up to hear that was the standard but it is likely that bloat now be

moved to the multiclass but it is really the algorithm so you would have seen

that in the standard vts we used only one simple gaussian distribution for the noise

but there sometimes and might be good in the case when we are dealing with

noise but in case of overlapping speech

the other cup are the expert without collecting speaker he himself if the human being

in and said so

it's not like a noisy might be he might i don't multiple phonemes in that

window

so we want to prevent him using more data

or more a better modeling

so what we suppose that likes instead of having one single question here we assume

that all the gaussian all or all the questions in the gmm of x two

are also present

so now we are going to have a rectangular to this combination of

two gmms with this gmms for the adapting speaker

so what we do here is that v for start with the times and that

each of this gaussian might have might have hit in that analysis window

by then for each of the gaussian be computed i'm value which is the average

number of frames assigned to that question component in that analysis window

if this gamma value happens to be lower than it actually you to

then we clustering with and you're just watching component

v guide like this kind of clustering

and

then we say that

the gaussian which have the highest got mine discussed that would be that of the

standard so idea this d the all components they would being adopted by one single

gaussian here now these all gaussians would be good update by the cluster center of

disgusted

we make that have them sent because

all this gaussian mixture models that have been doing

using the difference ubm the same difference ubm

in the gonna pick speech by

the question here

it would be computed using the gaussian here last the a contribution from the kind

of things speaker

a from this component

if you said you'd like what you zero that you don't want to set any

threshold and their window clustering

and each question would be going to one than what having the one-to-one combination to

give you look at a bit speech

the equations for mean update in case of multiclass we get think we show might

because we d s is the cms the previous case the only difference being that

now you have a subscript see here which denotes the cluster

the for each class you have a different the third thing going

and that centroid would be updated using this equation

and as i or shall work in the previous like that

idea this mean was being computed using all the gaussian components but now this equation

only takes into account

the questions which is the which are in the cluster c

similarly all the other questions they are identical the only difference being that

instead of having the single gaussian thing the gaussian representation for text to now be

doing the stairways

so you have a subscripts the every good

so that's the multiclass we do this algorithm framework

now coming to that experiments so different than on the it might it as i

which is the meeting data set

so the meetings are kind of like there are a group of three or four

people who are trying to design a remote or something there are so they are

discussing arguing debating

and the vector the duration varies from seventeen to fifty seven minutes

the audio which we take

if of from a single distant microphone which is the most difficult task

and then we use like mfcc features

and for the think that speaker model we use a i mean be adapted

gmm

now the added my so that it's called it would have detection error which is

the false alarm time but smith time

divided by the label speaker overlap time so one thing to notice that the false

alarms that come from the reasons we're the only think that speaker is speaking

and that those reasons are quite much more than the overlapping speech

so this whole expression it can be more it can take values over a hundred

percent

the first experiment which we did what using the standard vts where we have only

one gaussian representation for the corrupting speaker

we wanted to determine the analysis window size which were about the best

so we found that

when you were using going over a window size of three point two seconds the

elderly voice

lower as compared to the smaller venues like this

above this

the added identically if that much and

instead a the computation time in a lot because then you have you are doing

the same computational burden

apply a larger window

so in the next experiment we are going to use this window size

so these are this is the cost for the previous table so this that the

required precision guided and a the cut one the top if the with the window

size two point two seconds

so not be that the results for the multiclass vts so in the standard vts

the overlap detection error rate was ninety six point two percent

when we use the multiclass vts

it top of well by an absolute value of sixteen percent

and these for experiments that type of data domain

what should be and optimal value for the threshold for this thing

so when so in a window of three point two seconds we had three hundred

twenty frames

and if without l

threshold of five frames for each gaussian

then

this values here the denote how much that the clustering happen i mean we start

from sixty four clusters in the beginning

and if we have what utah five then and then we have tens of this

we found that the best results were

when we were having and threshold of one frame

in that case

the data this and the overlap detection error it reduces to eighty percent

which is quite good

and

the final number of clusters that we got if

twenty four point seven still beginning with sixty four we end up having twenty four

point seven does this year

as i said we have like to different kind of options for modeling the speaker

one likely model the speaker from the oracle or one

we are modeled the speakers on the data is not bored

so in case of articles the speaker models are ready purely to begin with so

that's why

the results that are quite good

but when we start with the database an output

we don't we might get a seven speaker target speakers

when there are actually only for speaker so

it's a set of problems given that is

the added it is ninety three point three percent

which is it better than the standard vts approach

so these are the kernel sorta previous table

but i that if using but that a vision system so that efficient system works

in a totally unsupervised manner and the final goal we have it is to make

this by data back end we want to so

improve it

up to this point which is by the articles

so we are trying to reduce this gap

comparing to the other words

so the mfcc a gmm system which is which was proposed by bouquet

it works with a ninety two point four percent

or whatever it takes another the state-of-the-art which using l s d m o'clock set

seventy six point nine percent

the best of that we have in this work at eighty percent

but then there's of using the or tickets

a completely unsupervised the system works at and add an error tradeoff ninety three point

two people think

after

okay so

through in the conclusions we have proposed a new approach for overlapping speech model

and we extended the biggest crime framework to the multiclass vts

system

and we analyze that if we have a billows of three point two seconds and

it was better

and then we were able to have

okay concentrations precisions up to fourteen seventy percent

one thing to note here is that in the l svm approach

they had very good precision but in a case we have a much better because

then that

the future about which we want to do with into the covariance operation and delta

features and in the case of the activation not work we want we order models

use

so after that we also we extended the work for you think we wouldn't submission

and you when the security of got its output

so we have been a way to improve these numbers from you do seventy eight

and this ninety three point two i don't in nine

but still from eighty nine to seventy six it's we have to work for that

or although we cannot say that says working in part with the

state-of-the-art system but we think that this of the very promising approach

and this can be used maybe for some other kind of maybe if you want

to model speech corrupted with noise but noise which is much more complex

i think that to thank you

so i'm having problems understanding our when you go from ninety six ninety three percent

error that that's a big improvement

i'm not questioning that he's

more what might help of i guess if you can be done sometime a test

like that's what is the performance that you think

is necessary for usable system to work

you hit seventy six is kind of state-of-the-art

has anyone done any test where maybe you take a clean data that doesn't have

any overlap at all

but in certain control amounts of overlap

where you can run that performance metric there and decide whether humans or where the

this the subsequent diarisation system

is acceptable when it hits you know an error rate of fifty percent i i'm

not sure what number you actually have to hit before you say it's a viable

solution because come from ninety three bird ninety six to ninety three year a year

something just seems like the numbers are just too high to make it practically users

okay so are the ones that the first person i'm not aware of any for

where they have artificially created or they have concluded overlaps in the audio

but

so the main task the main purpose of doing all this to improve the speaker

recognition system so we want to know that it's values for finally improved activation either

the state-of-the-art using an svm which had the edited of seventy six point nine but

i think in that paper they have not a given the data vision edited which

they have achieved using that system

in our system but so

of you have a people in interspeech very we also present

the effect of this overlap detection on television and

in the case when we have eighty nine percent error

this value ninety three point three we have we need way to reduce it to

eighty nine percent and when we use that system for television we have marginal improvement

over the baseline

so i hope that when someone by when you have a over the prediction error

rate

below at it would have quite a significant improvement only

a show why

sure

more speakers once

the second question how defined who is the main speaker who is the a six

once the first base and that's of anybody question

that's can have them sent to keep the number of more than slow and that

we have done the thing that

the overlaps are i don't remember the exact values but

unless people are laughing together or a having a very uncontrolled

meeting are discussed and then they would tend to speak like to be for all

together otherwise the claim to like

but when one speaker that i speaking and then someone other speakers start speaking at

that moment they might have an overlap of with speaker

and this but this formulation of vts

at this moment we cannot extended to three speakers

because of the formulation so be we are assuming one additive noise

and in a repeat the second version

sorry

okay

so for the means because we use the we have speaker models for all speakers

so we directly use them

to find out which gives the most likely what for that analysis window

so we use that thing to determine the mean speaker

i'm just wondering about the inter annotator agreement on this task i it seems to

be very difficult task to even for humans

so all those numbers in the range of a inter annotator agreement story

i mean

do you have any ideal on this point

or what the annotation which we have come from icsi and i have descent with

annotation it's quite accurate even the overlaps like but is more than over that's the

have been annotated

but i'm not sure about the inter annotator document

Modeling Overlapping Speech using Vector Taylor Series

Speaker Diarization

Pranay Dighe, Marc Ferras and Herve Bourlard