hi everyone so i'm a couple of them from the limbs see in france
so this is a joint work with all those people and you might know claude
barras the last order
he says hi
if you know in
so i'm going to talk about the this notion of person instance graphs for named
speaker identification in tv broadcast
so this is the outline of my talk
us first i'm going to give you a bit of context
then i'm going to discuss those this notion of person instance graph how we can
build them
and then how we can mind those the graphs to do speaker identification in
in tv shows an present some experimental results and then conclude my talk
so
about the context though we where working in the framework of these french challenge call
the whole pair
well we were given the tv shows like to this one for instance
they were
talk shows t v news the and were asked to answer automatically and these two
questions who speaks when
and
who appears when
in this form so we really need to the speaker diarization and then try to
identify each speech done separately
and provide normalized
name
this was very important to give the exact the form of the name like nicholas
equity fossil america but my here
i'm only going to focus on the who speaks when the task here
so they are many ways of am multiple sources of information to answer those questions
so obviously we can use the audio stream i to do speaker diarization an identification
we can also processed the speech to get some transcription form it
we can obviously use a visual stream to do fights clustering recognition and we can
try to get some names also from the
the o c r here
and
and so that the they are those two extremes coming from asr o c r
two and we can do name entity detection on this and to try to propagate
the names to the speaker cluster for instance here i'm not going to user will
the visual information because she's a
speaker addition
okay
so there are two ways of are recognising people in this kind of video the
unsupervised way and these supervised way
in on the left part in green i show you how we can do that
in the unsupervised fashion that means that we are not allowed to use prior all
biometric models
to recognize the person the speaker
so we each is usually done like that's we first transcribe the speech and try
to extract names from these a speech transcript
and in parallel we do speaker diarization and then we try to propagate the names
that where a detected in this in the speech transcript to the speaker cluster
to try to name
the speaker cluster that's what we call the named speaker identification so this is fully
unsupervised in terms of
biometric models
on the other side obviously
we can when we have a training data for various because we can for instance
bill an i-vector
speaker id system and use it
to do acoustic bass speaker identification
and we could also try to fuse those two into one a unified framework and
that's what i'm going to talk about a and this talk is about trying to
do all of that into one unified a framework
okay
so this framework i
is actually what i call the person instance graph so i'm going to describe it
as
good as i can so that you get an idea of
how it's peeled
so starting from the speech signal
we apply to another set for the speech-to-text
system from the company vocabulary search
and so it provides
both the speech transcription so these are the
the black dots here
and here you have a zoom on one particular speech turn and it also provides
us with the speech turns a segmentation into speech turns
so in the rest of my talk
this speech turns will be need denoted by t like turn
and for instance in this video
in these all pole audio now we don't use we deal there are five speech
turns denoted do you want to t five
a those are the first nodes
well my graph of this person instance graph
on top of this a speech transcript we can try to do spoken name detection
to do that
we use conditional random fields based on the that the one bit implementation of a
crf
we train two different classes of models
some of them were trained to only detect parts of names like
first name last name titles
and all there is were trying to detect complete names that once
and so they are a bunch of models that we trained here and they where
the output were combined using yet another
crf
model
on a using the output of these models as features
so what we get from these model is he's so then the names are detected
in the tech stream
and so here for instance there were five a
spoken names that were detected
and they are connected in this graph
to a canonical representation of the person here nicholas acquisition nicholas like was his name
was
detected and it's connected
to yet another
note in this graph which represent nicholas according
so in the rest of the talking as will be spoken names
that's which was as
and the identity
a vertex is in this graph are denoted i
so they are here for instance a for identity nodes and five is a spoken
names in this graph
and so what can we do with those names that were detected so what we
want you we want to
probably get those the spoken names to the neighboring speech turns we want to try
to
to use them to identify the that the speaker in the conversation
so they are many ways of estimating the probability that the spoken name s
is actually the identity of the speech turn t
in the literature so there where at first the people aware using hand-made rules about
in based on the
the context of the problems name in the speech transcript
other people use the contextual n-grams
and
even more recently semantic classification tree so we chose to use context all n-grams here
so let me show you an example for example in if in the speech transcript
someone says thank us as might be nicholas equity for instance then it's very likely
that the previous speech turn
is actually in you consequently so that's basically what does here
there is an eighty eight percent chance that the spoken name s
is actually the identity of the previous speech turn t one
that's how we are able to connect spoken names to speech turn in the graph
so weights edges are weighted by these probabilities
and then so
it's good but we can only propagate the names to at the neighboring speech turns
so what we can with what can we do next we can also compute some
kind of similarity between the all the speech turns
here we simply use the bayesian information criterion but based on mfcc features for each
speech turn and here for instance you have the
the in
in their speaker distribution of the big
similarity measure or and the
in green intra speaker so on the on our repair dataset
and so based on those two distribution we can estimate some kind of probability that
to speech turn t n t prime are the same speaker
that's how we connect all the speech turns in the graph
so at this point we have we can have these this big graph here
so i'm just going to focus on the station here so if the set of
thirty season this graph so they are three types of courtesies speech turns t
spoken name s
and identity vertex is i
and this graph is not necessarily complete
for instance the this identity of vertex to be not the connected to this speech
done for instance so
this is and you complete graph and
we denote by p
the weights that are
given to each edges or a p v prime is actually the probability that the
two parties is v prime
a are actually the same person of the same identity
so now that we have these graph what we want to achieve we want to
mine those graphs
to finally get our answer so try to give an identity to each of these
the speech turns
so you see in this example so this is the reference the here
it's nearly impossible to get a because the names of the
the name of this guy a is never even pronounce in the
e in the in the t v show
so
by chains we may have
biometric model for this guy
so there are
this is a very messy slide
but
so depending on how many edge is we put in this graph we can address
different tasks
for instance if we just connect this spoken name we speech turn we are able
just to
identify the addressee
of each speech tonight each time so only neighboring of speech turn can be
identify but then if we are those the
those the speech a speech turns speech turn the
edges
where able to propagate the names to all the speech turns
and if by chance we have a biometric models for this guy gas and j
then we using an i-vector system for instance we are able to connect each speech
turn to all
biometric models
and
estimate some kind of probability that those are the same person
so this is completely supervised speaker identification using these and this is completely unsupervised and
we can try to all these age in these big graph to do jointly
nee unsupervised and supervised
speaker identification
so
how can we mind these graphs then
and you objective is always thing is it to each vertex in this graph to
try to give a you correct identity
so at least in this can actually be modeled as a clustering problem
where we want to group all instance all thirty season the graph corresponding to the
same person
with the actual identity so here is what we expect on from a perfect system
in this graph
we would like to
putting the same clusters
the speech turns by a speaker c and all the names spoken
well all the time is name is pronounce also he in the same rough
so and we would like this was speaker hey in my first example
even though we don't have a an identity a in the graph we want to
be able to
cluster only speech don't like that
and some spoken names are use less to identify a
and you want because this is just someone we're talking about and not someone who
is present in the in the t v show
so to do that
we define
a set of function close ugh who called clustering function so
a delta
associated to each pair of nodes in this graph plp prior one
if they are in a same cluster and zero otherwise
the thing is not all function defined like that
actually code for a value clustering what we need to do you we need to
add some other constraints in this to this functions for instance
if we must be in the same cluster as itself
symmetry constraints on there so transitive at constraints like if you energy prime are in
the same cluster and be prime and b second are in the same cluster then
v and v secondmost been the same cluster
so this defines a search space
delta p
on the set of thirty six
but
we need to look for
the best clustering function delta
that the basic cluster all our data
so to do that we use or integral linear programming
and we want to maximize these objective function
basically a good clustering would a cluster
we group similar data
or data with high
probability
into the same cluster and separate
approach this is with loads a similarity into two different clusters so that's what this
objective function that is
and it is just normalized by the
number of edges in the grass
and we have this parameter i'll fact that can be tuned
to balance between in track clusters similarity and inter cluster the similarity
and we also add the additional constraints like for instance
for every speech turn in the graph
it can have at most one identity
alright depends if yours screws of in your crew or
but usually you have only one identity
and also we force spoken name
to be in the same cluster as their identity
the thing is with this formulation is that
you see that we some on all the edges on this graph
and the problem is that they are much more many more
speech turn to speech turn edges than they are points ten speech turn to spoken
name ages
so
i divided this objective function into sub objective function
this is basically exactly the same except that
we
the weight to all tap to every type of ages
so this way we can give more weight for instance twos spoken name to speech
turn edges in this graph
and this makes the this gives a set of parameters that we need to of
the hyper parameter that we need to optimize so beta and had five
and this is
optimized using a random search in the
in the alpha beta space
how much more time
so i'm coming to the
experimental results
so
he's the corpus that we were given by the organiser of the rubber challenge
so the corpus is divided into seven type of shows like they are tv news
talk shows
so the training set is made of twenty eight hours fully annotated in terms of
speaker a speech transcript
and name
the spoken names
and also we are given visual information which are is not relevant here but the
for instance we get and annotation or
one frame every ten seconds we know exactly would peers in this in this frame
so this training set is used to estimate the probability between speech turns the to
train the i-vector system and to train the speech turn to spoken name propagation probability
we used the development set
nine hours to estimate those the hyperparameter alpha and beta
and we use the test set
and it's a value at the this way this is basically identification error rate so
this is the total amount of a
wrongly the total duration the wrongly
i don't to find it plus
a missed detection for set on divided by the total duration of speech in the
reference
so this can go higher than one if you
we
do lots of false alarm for instance
so here are the big table of results i'm going to focus on the on
the few selected points
so i in this configuration b where we are completely unsupervised
it's
we can see that the an oracle do that too would be able to name
someone as soon as is name is pronounced in the in the stream
anywhere in the in the audio stream
i can only get the fifty six percent recall anyway
we get to twenty nine a here using this these graph
so there is a long way to go up to
to get the good a perfect results here
when we are combined the whole thing
the same an oracle would get fourteen percent
identification error rate
and our this oracle is able to recognize the someone as soon as
either there is a biometric model for eight or the name is pronounced in the
speech transcript
so
also there is a long way to go to get a perfect results
but so i'm just going to focus on the interesting results now i mean the
one that actually worked
so
note this is a better results angle i'm going to skip it as well
by adding at the red ages in the graph so going from a to be
where able to increase the recall so that was expected because we are now able
to propagate the names to all the speech turns
but also what's interesting is that we also increase the precision
which wasn't what i expected first when a
when i did this work
and what's interesting also is that we can combine those two approaches the names speaker
identification this right completely unsupervised
with standard the
i-vector acoustic speaker identification
and we are able to get the ten percent absolute the improvement to compared to
the i-vector system
and it works both for precision so we are able to increase the precision of
an i-vector system using those the spoken names
and obviously recall because they are some percent the for which we don't have a
biometric models so
we can use the spoken names to
to do to improve the identification
and i also wanted to stress this point that we also have results based on
the fully manual the
spoken name detection
and it happens that the even though our
a name detection system has a slot error rate of around thirty five percent
i it actually doesn't degrade when we go from manual a name detection to fully
automatic name detection so this is
an interesting result that we are robust to this kind of errors may be because
spoken names are often the repeated multiple times in the video so we manage to
get one of these
this is just the
a representation of the this weights beta that we are automatically
obtain using parameters hyper parameter tuning
when we only use the this configuration b so this is completely unsupervised
it actually gives more weight
to a speech turn to spoken name edges then to than the edges between two
speech turns
and when we do the for the full graph
it actually give the same weights
to the i-vector edges
and the speech turn to spoken name ages
so
this is the concluded
so we got the this ten percent absolute improvement over the i-vector system using spoken
names so this is kind of cheating because what using more information but
this can be improved even more if we had for instance written names
experiments that we did the
when the a given another fifteen percent the increase in performance
and so they are still a lot of errors that we need to address i
thank you very much
and thank you
just a quick advertisement on this corpus that may be of interest for those of
you doing speaker diarization as well
and i have the first question
not using any a priori knowledge on the distribution of speakers in a conversation or
in the media five like quite everybody
could you comment and then do you think various
some information to get that's the next step actually we plan to modify this
objective function to take the structure of a tissue into account so for instance we
could the ad here a term
that
take into account the prior probability that the when one a speaker speaks at time
t then there is a high chance that we can hear him again thirty seconds
later so this is not that all the taken into account for now but we
really need to out these
prior information the structure
i totally agree but we did you mean just the prior knowledge on the presence
of the speaker or
on
i don't know
the this
this is planned we're going to have the some extra terms here is to force
that some kind of structure
okay thanks and just
you could also pictures of the results of the evaluation complaining goes
you say that is what was done the focus of a few evaluation
could be nice to have an eight year what was the but with the differences
in a different participant
you close to be a
we notice of the based on did you see some differences i don't know
the main difference when the who appears when task in speaker id we were more
less the same and the same results
but what the
actually that's what gives the most information in terms of identities actually ups
the names that are written on screen
usually it's really easy to provide a to the current speech
speaker
and this it is if the fifteen free improvement in terms of performance when we
use the visual the
you're string
no it's the basically used on the
segmentation used for this stuff it with the goes and divergence followed by some kind
of linear clustering and
no it's not oracle it's a so the along the thirty five percent there are
there is
i think five
to ten percent coming from the speech activity detection and segmentation errors