everyone
i'm planning to k i'm working in you gap research and ct would and i
mean representing the talk on modeling overlapping speech using but it in the cities
this work i have been but modified eisenhower we would low
so
first i would be presenting the motivation for using this problem
then in brief i would discuss the previous approaches for detecting overlapping speech
then i would go towards the vector taylor series approach
in which we have two parts on the first part is the
using the standard vts approach and then the next part would be the multiclass we
just algorithm which is
which we have proposed in this work and then we will discuss the results in
experiments
so the more recent comes from the problem of speaker diarization
but if the task of deciding or determining who spoke when in the meeting audio
so ugly when the audio recording you want to find out on different portions which
belong to the speakers
one challenges that the number of speakers is not applied in one
so it's it has to be determined booty an unsupervised manner
now in this task overlapping speech
becomes a very huge
the source of either
so first i would define the overlapping speech it is at the moment trying to
speaker speak simultaneously it might be when people are debating they are arguing or when
they are just
okay so when there does agreeing or disagreeing like
this kind of things are men they're laughing together
so what happens is that when you have overlapping speech in your audio then you
cannot model
the art speaker models very precisely
or when you are doing speaker recognition and you want to assign one speaker anybody
to a portion in which actually there are two people speak speaking then that also
results in some errors in speaker recognition
and a previous studies have shown that in meetings sometimes those model twenty percent of
the spoken name can be overlapping if the participants are maybe
active
no the previous approaches so one of the first what was done by book
in we see well it made adamant be segmented for the three classes speech non-speech
an overlapping speech
this was the baseline
then people have used
a salad the knowledge of silence distribution
and some things like speaker changes because it has been found that people tractable with
that when the speaker change
the state-of-the-art a is based on convolutive non-negative sparse coding and which
d gotten we put like they have
no and basis for each speaker
and then the artifact to find out the activity of each speaker for each stream
and they have used the same features using non stardom neural networks long short-term memory
neural networks
now we come to
our problem
so before i moved to overlapping speech that is one and all of this problem
of how to model the speech which is corrupted with noise
so if you have a noisy speech model by then you can express it in
the signal domain as
x plus and where x if you're clean speech
actually of the channel noise and is the additive noise
so in the mfcc domain
these are the mel-scale filter power spectrum
you take the log and dct then and then you but the mfcc features
so in the mfcc domain
this simple expression here
it becomes
a quite complex expression value have a linear park and the nonlinear part
the see if the dct a text and seen what six the pseudoinverse of that
so we call this nonlinear part of g
so you have by the way to x besides plus this nonlinear part
no
we want to model this equation and we use that the data c d is
basically two
x point this expression here
so that it is it is simply an expansion of objective function about the point
where you
have the first then
this is the first order don't but you pick the first derivative
so do when this expression for the noisy speech
if you at this point it around this point m u x new and new
at which are the mean of clean speech mean of noisy speech and we wanna
you know
channel noise
so you can't this expression here in which the first line
if s at the evaluation of y around this point
and the second line is
the first order down
a bit
that with energy this capital g and this capital have a
they are the derivatives of y with respect to accent at ten and
so in the standard
and the standard rectangular c d's when you are trying to model
this of i here
what people do is that the if you model gmm for x
and
a single gaussian for the noise and add
this is because the nicest a study and
that's ads if the channel noise
so the expletive a gmm it is being corrupted by additive noise and then at
using the vts approximation and that gives you the can obtain
speech by
the
these are i these are gaussian so that look like wave but they are the
this of the gmm
now become to the overlapping speech so what we propose what we propose is that
the overlapping speech is actually just a superposition of two or more in you just
speakers
so if you if we see the model for the noisy speech we can make
the analogous model for overlapping speech but we see that this x it's x one
which we call them in speaker
and this external here is the corrupting speaker
with like the additive noise
the we for simplicity of be ignored are the channel noise because of
the recordings are the recording for all the speakers and the same room
so we are not going to deal with edgier
so
doing
analogy we have this expression where the than the can overlapping speech is now a
combination of
this is no linear campus this non linear term
this nonlinear domain cms the k as in the case of the noisy speech
again analogous to the case of target speech we have the mean speaker gmm here
and the corrupting speaker which is being represented by a single gaussian here as a
like the additive noise
the equations are totally similar as in the noisy speech
the and you can see here
that the subscript m so each component of y here is being computed using this
component from the main speaker and then some contribution from the corrupting speaker
and
this g and have which are the derivatives of by
they are also different for each component
now
if you take the expectation of this why here then you can guide the mean
for the overlapping speech and the variance for the overlapping speech so this if the
final overlapping speech model which we want to estimate
now a for estimating that model we are going to use the em algorithm for
which this of the q function
so q one of the overlapping speech data x from excellent to at the time
frames
we want to use the probability of
having this data using them overlapping speech model new why am signal y m
and
we optimized this function q
with respect to the mean of look at a mean of the corrupting speaker x
two
so the update equations for me units this exhibition the new x to zero if
the previous
value for the
mean of adapting speaker and that of the new value for the mean of adapting
speaker
one thing that you can notice here is that
this one mean of its two presents the kind of things because it is being
updating
using all the mixture components
from the overlapping speech model
the through the whole vts algorithm but something like this thought initially we estimate or
we initialize the mean of adapting speaker and the covariance
then we compute the overlapping speech model using these expressions
after that we use the em look but we optimize the q function
and will be replaced them or us to go in signal x to zero by
mu extrinsic next two
in this work we are not going to update segments to because it
it's very have you for computation
then when this look on what converges we finally a t v finally guide of
overlapping speech model by
which we used for overlapping speech detection
so the overlapping speech detection system it i for input it takes the meeting audio
recordings
and the recordings are informal speech segments which we got using the speech activity detection
system
then one major task is to have speaker models the initial speaker models for the
mean speaker and the kind of things because the how to how to get that
so there are two options either you can use the oracle speaker segmentations
or we take them from the data are not that additional port
so this is much more challenging task because when you take the speaker lines alignment
from the data is not put
you don't know how many speakers that what actually in your audio
so you might get more than the actual number of speakers as an utterance and
output
the output which we are one finally if the detection of overlaps
now so
given the audio recording this blue box shows the
a speech segment given by the speech activity detection
remove a slight sliding analysis window what it
for each analysis window we can have and square hypothesis so we have you on
that in and then overlap that would be two speakers who would be overlapping so
if you have and speakers then the total number of overlapping speech models can be
and squared minus and
that this and shows the single speaker models when only one speaker is speaking
so this is a huge number so what we do with that for each speech
segment first we determine the means speaker and then we compute the overlapping speech models
when that means speaker if being a big by
some of the speaker
finally we have overlap model is that the speaker
i is being adapted with speaker g
and then there i think that speaker models
where the speaker i is speaking alone so we compared all this likelihood ratios for
the domain if we have overlapping speech a single speaker speech
so
up to hear that was the standard but it is likely that bloat now be
moved to the multiclass but it is really the algorithm so you would have seen
that in the standard vts we used only one simple gaussian distribution for the noise
but there sometimes and might be good in the case when we are dealing with
noise but in case of overlapping speech
the other cup are the expert without collecting speaker he himself if the human being
in and said so
it's not like a noisy might be he might i don't multiple phonemes in that
window
so we want to prevent him using more data
or more a better modeling
so what we suppose that likes instead of having one single question here we assume
that all the gaussian all or all the questions in the gmm of x two
are also present
so now we are going to have a rectangular to this combination of
two gmms with this gmms for the adapting speaker
so what we do here is that v for start with the times and that
each of this gaussian might have might have hit in that analysis window
by then for each of the gaussian be computed i'm value which is the average
number of frames assigned to that question component in that analysis window
if this gamma value happens to be lower than it actually you to
then we clustering with and you're just watching component
so
v guide like this kind of clustering
and
then we say that
the gaussian which have the highest got mine discussed that would be that of the
standard so idea this d the all components they would being adopted by one single
gaussian here now these all gaussians would be good update by the cluster center of
disgusted
we make that have them sent because
all this gaussian mixture models that have been doing
using the difference ubm the same difference ubm
so
in the gonna pick speech by
the question here
it would be computed using the gaussian here last the a contribution from the kind
of things speaker
a from this component
if you said you'd like what you zero that you don't want to set any
threshold and their window clustering
and each question would be going to one than what having the one-to-one combination to
give you look at a bit speech
the equations for mean update in case of multiclass we get think we show might
because we d s is the cms the previous case the only difference being that
now you have a subscript see here which denotes the cluster
the for each class you have a different the third thing going
and that centroid would be updated using this equation
and as i or shall work in the previous like that
idea this mean was being computed using all the gaussian components but now this equation
only takes into account
the questions which is the which are in the cluster c
similarly all the other questions they are identical the only difference being that
instead of having the single gaussian thing the gaussian representation for text to now be
doing the stairways
so you have a subscripts the every good
so that's the multiclass we do this algorithm framework
now coming to that experiments so different than on the it might it as i
which is the meeting data set
so the meetings are kind of like there are a group of three or four
people who are trying to design a remote or something there are so they are
discussing arguing debating
and the vector the duration varies from seventeen to fifty seven minutes
the audio which we take
if of from a single distant microphone which is the most difficult task
and then we use like mfcc features
and for the think that speaker model we use a i mean be adapted
gmm
now the added my so that it's called it would have detection error which is
the false alarm time but smith time
divided by the label speaker overlap time so one thing to notice that the false
alarms that come from the reasons we're the only think that speaker is speaking
and that those reasons are quite much more than the overlapping speech
so this whole expression it can be more it can take values over a hundred
percent
the first experiment which we did what using the standard vts where we have only
one gaussian representation for the corrupting speaker
we wanted to determine the analysis window size which were about the best
so we found that
when you were using going over a window size of three point two seconds the
elderly voice
lower as compared to the smaller venues like this
above this
the added identically if that much and
instead a the computation time in a lot because then you have you are doing
the same computational burden
apply a larger window
so in the next experiment we are going to use this window size
so these are this is the cost for the previous table so this that the
required precision guided and a the cut one the top if the with the window
size two point two seconds
so not be that the results for the multiclass vts so in the standard vts
the overlap detection error rate was ninety six point two percent
when we use the multiclass vts
it top of well by an absolute value of sixteen percent
and these for experiments that type of data domain
what should be and optimal value for the threshold for this thing
so when so in a window of three point two seconds we had three hundred
twenty frames
and if without l
threshold of five frames for each gaussian
then
this values here the denote how much that the clustering happen i mean we start
from sixty four clusters in the beginning
and if we have what utah five then and then we have tens of this
we found that the best results were
when we were having and threshold of one frame
in that case
the data this and the overlap detection error it reduces to eighty percent
which is quite good
and
the final number of clusters that we got if
twenty four point seven still beginning with sixty four we end up having twenty four
point seven does this year
so
as i said we have like to different kind of options for modeling the speaker
one likely model the speaker from the oracle or one
we are modeled the speakers on the data is not bored
so in case of articles the speaker models are ready purely to begin with so
that's why
the results that are quite good
but when we start with the database an output
we don't we might get a seven speaker target speakers
when there are actually only for speaker so
it's a set of problems given that is
the added it is ninety three point three percent
which is it better than the standard vts approach
so these are the kernel sorta previous table
but i that if using but that a vision system so that efficient system works
in a totally unsupervised manner and the final goal we have it is to make
this by data back end we want to so
improve it
up to this point which is by the articles
so we are trying to reduce this gap
comparing to the other words
so the mfcc a gmm system which is which was proposed by bouquet
it works with a ninety two point four percent
or whatever it takes another the state-of-the-art which using l s d m o'clock set
seventy six point nine percent
the best of that we have in this work at eighty percent
but then there's of using the or tickets
a completely unsupervised the system works at and add an error tradeoff ninety three point
two people think
so
after
okay so
through in the conclusions we have proposed a new approach for overlapping speech model
and we extended the biggest crime framework to the multiclass vts
system
and we analyze that if we have a billows of three point two seconds and
it was better
and then we were able to have
okay concentrations precisions up to fourteen seventy percent
one thing to note here is that in the l svm approach
they had very good precision but in a case we have a much better because
then that
the future about which we want to do with into the covariance operation and delta
features and in the case of the activation not work we want we order models
to
use
so after that we also we extended the work for you think we wouldn't submission
and you when the security of got its output
so we have been a way to improve these numbers from you do seventy eight
and this ninety three point two i don't in nine
but still from eighty nine to seventy six it's we have to work for that
or although we cannot say that says working in part with the
state-of-the-art system but we think that this of the very promising approach
and this can be used maybe for some other kind of maybe if you want
to model speech corrupted with noise but noise which is much more complex
i think that to thank you
so i'm having problems understanding our when you go from ninety six ninety three percent
error that that's a big improvement
i'm not questioning that he's
more what might help of i guess if you can be done sometime a test
like that's what is the performance that you think
is necessary for usable system to work
you hit seventy six is kind of state-of-the-art
has anyone done any test where maybe you take a clean data that doesn't have
any overlap at all
but in certain control amounts of overlap
where you can run that performance metric there and decide whether humans or where the
this the subsequent diarisation system
is acceptable when it hits you know an error rate of fifty percent i i'm
not sure what number you actually have to hit before you say it's a viable
solution because come from ninety three bird ninety six to ninety three year a year
something just seems like the numbers are just too high to make it practically users
okay so are the ones that the first person i'm not aware of any for
where they have artificially created or they have concluded overlaps in the audio
but
so the main task the main purpose of doing all this to improve the speaker
recognition system so we want to know that it's values for finally improved activation either
so
the state-of-the-art using an svm which had the edited of seventy six point nine but
i think in that paper they have not a given the data vision edited which
they have achieved using that system
in our system but so
of you have a people in interspeech very we also present
the effect of this overlap detection on television and
in the case when we have eighty nine percent error
so
this value ninety three point three we have we need way to reduce it to
eighty nine percent and when we use that system for television we have marginal improvement
over the baseline
so i hope that when someone by when you have a over the prediction error
rate
below at it would have quite a significant improvement only
a show why
sure
more speakers once
the second question how defined who is the main speaker who is the a six
so
once the first base and that's of anybody question
that's can have them sent to keep the number of more than slow and that
we have done the thing that
the overlaps are i don't remember the exact values but
unless people are laughing together or a having a very uncontrolled
meeting are discussed and then they would tend to speak like to be for all
together otherwise the claim to like
but when one speaker that i speaking and then someone other speakers start speaking at
that moment they might have an overlap of with speaker
and this but this formulation of vts
p
at this moment we cannot extended to three speakers
because of the formulation so be we are assuming one additive noise
and in a repeat the second version
sorry
okay
so for the means because we use the we have speaker models for all speakers
so we directly use them
to find out which gives the most likely what for that analysis window
so we use that thing to determine the mean speaker
i'm just wondering about the inter annotator agreement on this task i it seems to
be very difficult task to even for humans
so all those numbers in the range of a inter annotator agreement story
i mean
do you have any ideal on this point
or what the annotation which we have come from icsi and i have descent with
annotation it's quite accurate even the overlaps like but is more than over that's the
have been annotated
but i'm not sure about the inter annotator document