0:00:15 | hi everyone so i'm a couple of them from the limbs see in france |
---|
0:00:20 | so this is a joint work with all those people and you might know claude |
---|
0:00:24 | barras the last order |
---|
0:00:25 | he says hi |
---|
0:00:26 | if you know in |
---|
0:00:30 | so i'm going to talk about the this notion of person instance graphs for named |
---|
0:00:34 | speaker identification in tv broadcast |
---|
0:00:37 | so this is the outline of my talk |
---|
0:00:39 | us first i'm going to give you a bit of context |
---|
0:00:43 | then i'm going to discuss those this notion of person instance graph how we can |
---|
0:00:47 | build them |
---|
0:00:49 | and then how we can mind those the graphs to do speaker identification in |
---|
0:00:54 | in tv shows an present some experimental results and then conclude my talk |
---|
0:01:01 | so |
---|
0:01:02 | about the context though we where working in the framework of these french challenge call |
---|
0:01:08 | the whole pair |
---|
0:01:10 | well we were given the tv shows like to this one for instance |
---|
0:01:15 | they were |
---|
0:01:16 | talk shows t v news the and were asked to answer automatically and these two |
---|
0:01:22 | questions who speaks when |
---|
0:01:23 | and |
---|
0:01:24 | who appears when |
---|
0:01:29 | in this form so we really need to the speaker diarization and then try to |
---|
0:01:33 | identify each speech done separately |
---|
0:01:36 | and provide normalized |
---|
0:01:41 | name |
---|
0:01:42 | this was very important to give the exact the form of the name like nicholas |
---|
0:01:46 | equity fossil america but my here |
---|
0:01:50 | i'm only going to focus on the who speaks when the task here |
---|
0:01:54 | so they are many ways of am multiple sources of information to answer those questions |
---|
0:02:00 | so obviously we can use the audio stream i to do speaker diarization an identification |
---|
0:02:04 | we can also processed the speech to get some transcription form it |
---|
0:02:10 | we can obviously use a visual stream to do fights clustering recognition and we can |
---|
0:02:14 | try to get some names also from the |
---|
0:02:16 | the o c r here |
---|
0:02:18 | and |
---|
0:02:20 | and so that the they are those two extremes coming from asr o c r |
---|
0:02:24 | two and we can do name entity detection on this and to try to propagate |
---|
0:02:29 | the names to the speaker cluster for instance here i'm not going to user will |
---|
0:02:33 | the visual information because she's a |
---|
0:02:37 | speaker addition |
---|
0:02:40 | okay |
---|
0:02:41 | so there are two ways of are recognising people in this kind of video the |
---|
0:02:45 | unsupervised way and these supervised way |
---|
0:02:47 | in on the left part in green i show you how we can do that |
---|
0:02:51 | in the unsupervised fashion that means that we are not allowed to use prior all |
---|
0:02:55 | biometric models |
---|
0:02:57 | to recognize the person the speaker |
---|
0:03:00 | so we each is usually done like that's we first transcribe the speech and try |
---|
0:03:06 | to extract names from these a speech transcript |
---|
0:03:10 | and in parallel we do speaker diarization and then we try to propagate the names |
---|
0:03:14 | that where a detected in this in the speech transcript to the speaker cluster |
---|
0:03:18 | to try to name |
---|
0:03:21 | the speaker cluster that's what we call the named speaker identification so this is fully |
---|
0:03:25 | unsupervised in terms of |
---|
0:03:27 | biometric models |
---|
0:03:29 | on the other side obviously |
---|
0:03:32 | we can when we have a training data for various because we can for instance |
---|
0:03:36 | bill an i-vector |
---|
0:03:39 | speaker id system and use it |
---|
0:03:41 | to do acoustic bass speaker identification |
---|
0:03:44 | and we could also try to fuse those two into one a unified framework and |
---|
0:03:50 | that's what i'm going to talk about a and this talk is about trying to |
---|
0:03:53 | do all of that into one unified a framework |
---|
0:03:59 | okay |
---|
0:04:00 | so this framework i |
---|
0:04:03 | is actually what i call the person instance graph so i'm going to describe it |
---|
0:04:09 | as |
---|
0:04:10 | good as i can so that you get an idea of |
---|
0:04:13 | how it's peeled |
---|
0:04:15 | so starting from the speech signal |
---|
0:04:19 | we apply to another set for the speech-to-text |
---|
0:04:23 | system from the company vocabulary search |
---|
0:04:26 | and so it provides |
---|
0:04:28 | both the speech transcription so these are the |
---|
0:04:30 | the black dots here |
---|
0:04:32 | and here you have a zoom on one particular speech turn and it also provides |
---|
0:04:36 | us with the speech turns a segmentation into speech turns |
---|
0:04:40 | so in the rest of my talk |
---|
0:04:43 | this speech turns will be need denoted by t like turn |
---|
0:04:47 | and for instance in this video |
---|
0:04:50 | in these all pole audio now we don't use we deal there are five speech |
---|
0:04:54 | turns denoted do you want to t five |
---|
0:04:56 | a those are the first nodes |
---|
0:04:58 | well my graph of this person instance graph |
---|
0:05:03 | on top of this a speech transcript we can try to do spoken name detection |
---|
0:05:09 | to do that |
---|
0:05:11 | we use conditional random fields based on the that the one bit implementation of a |
---|
0:05:15 | crf |
---|
0:05:17 | we train two different classes of models |
---|
0:05:20 | some of them were trained to only detect parts of names like |
---|
0:05:24 | first name last name titles |
---|
0:05:26 | and all there is were trying to detect complete names that once |
---|
0:05:32 | and so they are a bunch of models that we trained here and they where |
---|
0:05:36 | the output were combined using yet another |
---|
0:05:39 | crf |
---|
0:05:42 | model |
---|
0:05:43 | on a using the output of these models as features |
---|
0:05:48 | so what we get from these model is he's so then the names are detected |
---|
0:05:52 | in the tech stream |
---|
0:05:55 | and so here for instance there were five a |
---|
0:05:59 | spoken names that were detected |
---|
0:06:01 | and they are connected in this graph |
---|
0:06:04 | to a canonical representation of the person here nicholas acquisition nicholas like was his name |
---|
0:06:10 | was |
---|
0:06:11 | detected and it's connected |
---|
0:06:15 | to yet another |
---|
0:06:17 | note in this graph which represent nicholas according |
---|
0:06:21 | so in the rest of the talking as will be spoken names |
---|
0:06:27 | that's which was as |
---|
0:06:28 | and the identity |
---|
0:06:30 | a vertex is in this graph are denoted i |
---|
0:06:36 | so they are here for instance a for identity nodes and five is a spoken |
---|
0:06:41 | names in this graph |
---|
0:06:44 | and so what can we do with those names that were detected so what we |
---|
0:06:47 | want you we want to |
---|
0:06:50 | probably get those the spoken names to the neighboring speech turns we want to try |
---|
0:06:54 | to |
---|
0:06:56 | to use them to identify the that the speaker in the conversation |
---|
0:07:01 | so they are many ways of estimating the probability that the spoken name s |
---|
0:07:05 | is actually the identity of the speech turn t |
---|
0:07:08 | in the literature so there where at first the people aware using hand-made rules about |
---|
0:07:13 | in based on the |
---|
0:07:14 | the context of the problems name in the speech transcript |
---|
0:07:18 | other people use the contextual n-grams |
---|
0:07:20 | and |
---|
0:07:22 | even more recently semantic classification tree so we chose to use context all n-grams here |
---|
0:07:27 | so let me show you an example for example in if in the speech transcript |
---|
0:07:31 | someone says thank us as might be nicholas equity for instance then it's very likely |
---|
0:07:36 | that the previous speech turn |
---|
0:07:37 | is actually in you consequently so that's basically what does here |
---|
0:07:41 | there is an eighty eight percent chance that the spoken name s |
---|
0:07:47 | is actually the identity of the previous speech turn t one |
---|
0:07:50 | that's how we are able to connect spoken names to speech turn in the graph |
---|
0:07:55 | so weights edges are weighted by these probabilities |
---|
0:08:01 | and then so |
---|
0:08:02 | it's good but we can only propagate the names to at the neighboring speech turns |
---|
0:08:07 | so what we can with what can we do next we can also compute some |
---|
0:08:11 | kind of similarity between the all the speech turns |
---|
0:08:13 | here we simply use the bayesian information criterion but based on mfcc features for each |
---|
0:08:19 | speech turn and here for instance you have the |
---|
0:08:22 | the in |
---|
0:08:24 | in their speaker distribution of the big |
---|
0:08:29 | similarity measure or and the |
---|
0:08:31 | in green intra speaker so on the on our repair dataset |
---|
0:08:35 | and so based on those two distribution we can estimate some kind of probability that |
---|
0:08:40 | to speech turn t n t prime are the same speaker |
---|
0:08:43 | that's how we connect all the speech turns in the graph |
---|
0:08:48 | so at this point we have we can have these this big graph here |
---|
0:08:54 | so i'm just going to focus on the station here so if the set of |
---|
0:08:58 | thirty season this graph so they are three types of courtesies speech turns t |
---|
0:09:03 | spoken name s |
---|
0:09:04 | and identity vertex is i |
---|
0:09:07 | and this graph is not necessarily complete |
---|
0:09:13 | for instance the this identity of vertex to be not the connected to this speech |
---|
0:09:18 | done for instance so |
---|
0:09:20 | this is and you complete graph and |
---|
0:09:23 | we denote by p |
---|
0:09:24 | the weights that are |
---|
0:09:27 | given to each edges or a p v prime is actually the probability that the |
---|
0:09:32 | two parties is v prime |
---|
0:09:34 | a are actually the same person of the same identity |
---|
0:09:39 | so now that we have these graph what we want to achieve we want to |
---|
0:09:42 | mine those graphs |
---|
0:09:44 | to finally get our answer so try to give an identity to each of these |
---|
0:09:50 | the speech turns |
---|
0:09:51 | so you see in this example so this is the reference the here |
---|
0:09:55 | it's nearly impossible to get a because the names of the |
---|
0:09:58 | the name of this guy a is never even pronounce in the |
---|
0:10:02 | e in the in the t v show |
---|
0:10:04 | so |
---|
0:10:05 | by chains we may have |
---|
0:10:09 | biometric model for this guy |
---|
0:10:11 | so there are |
---|
0:10:13 | this is a very messy slide |
---|
0:10:15 | but |
---|
0:10:17 | so depending on how many edge is we put in this graph we can address |
---|
0:10:20 | different tasks |
---|
0:10:21 | for instance if we just connect this spoken name we speech turn we are able |
---|
0:10:27 | just to |
---|
0:10:29 | identify the addressee |
---|
0:10:30 | of each speech tonight each time so only neighboring of speech turn can be |
---|
0:10:36 | identify but then if we are those the |
---|
0:10:39 | those the speech a speech turns speech turn the |
---|
0:10:43 | edges |
---|
0:10:43 | where able to propagate the names to all the speech turns |
---|
0:10:46 | and if by chance we have a biometric models for this guy gas and j |
---|
0:10:52 | then we using an i-vector system for instance we are able to connect each speech |
---|
0:10:57 | turn to all |
---|
0:10:59 | biometric models |
---|
0:11:00 | and |
---|
0:11:03 | estimate some kind of probability that those are the same person |
---|
0:11:07 | so this is completely supervised speaker identification using these and this is completely unsupervised and |
---|
0:11:13 | we can try to all these age in these big graph to do jointly |
---|
0:11:16 | nee unsupervised and supervised |
---|
0:11:19 | speaker identification |
---|
0:11:24 | so |
---|
0:11:25 | how can we mind these graphs then |
---|
0:11:28 | and you objective is always thing is it to each vertex in this graph to |
---|
0:11:32 | try to give a you correct identity |
---|
0:11:34 | so at least in this can actually be modeled as a clustering problem |
---|
0:11:37 | where we want to group all instance all thirty season the graph corresponding to the |
---|
0:11:43 | same person |
---|
0:11:44 | with the actual identity so here is what we expect on from a perfect system |
---|
0:11:50 | in this graph |
---|
0:11:52 | we would like to |
---|
0:11:53 | putting the same clusters |
---|
0:11:55 | the speech turns by a speaker c and all the names spoken |
---|
0:11:59 | well all the time is name is pronounce also he in the same rough |
---|
0:12:03 | so and we would like this was speaker hey in my first example |
---|
0:12:09 | even though we don't have a an identity a in the graph we want to |
---|
0:12:13 | be able to |
---|
0:12:14 | cluster only speech don't like that |
---|
0:12:16 | and some spoken names are use less to identify a |
---|
0:12:20 | and you want because this is just someone we're talking about and not someone who |
---|
0:12:23 | is present in the in the t v show |
---|
0:12:27 | so to do that |
---|
0:12:29 | we define |
---|
0:12:30 | a set of function close ugh who called clustering function so |
---|
0:12:35 | a delta |
---|
0:12:37 | associated to each pair of nodes in this graph plp prior one |
---|
0:12:41 | if they are in a same cluster and zero otherwise |
---|
0:12:45 | the thing is not all function defined like that |
---|
0:12:48 | actually code for a value clustering what we need to do you we need to |
---|
0:12:52 | add some other constraints in this to this functions for instance |
---|
0:12:58 | if we must be in the same cluster as itself |
---|
0:13:01 | symmetry constraints on there so transitive at constraints like if you energy prime are in |
---|
0:13:06 | the same cluster and be prime and b second are in the same cluster then |
---|
0:13:09 | v and v secondmost been the same cluster |
---|
0:13:11 | so this defines a search space |
---|
0:13:15 | delta p |
---|
0:13:16 | on the set of thirty six |
---|
0:13:18 | but |
---|
0:13:20 | we need to look for |
---|
0:13:22 | the best clustering function delta |
---|
0:13:25 | that the basic cluster all our data |
---|
0:13:29 | so to do that we use or integral linear programming |
---|
0:13:32 | and we want to maximize these objective function |
---|
0:13:36 | basically a good clustering would a cluster |
---|
0:13:40 | we group similar data |
---|
0:13:42 | or data with high |
---|
0:13:45 | probability |
---|
0:13:46 | into the same cluster and separate |
---|
0:13:51 | approach this is with loads a similarity into two different clusters so that's what this |
---|
0:13:56 | objective function that is |
---|
0:13:58 | and it is just normalized by the |
---|
0:14:00 | number of edges in the grass |
---|
0:14:02 | and we have this parameter i'll fact that can be tuned |
---|
0:14:06 | to balance between in track clusters similarity and inter cluster the similarity |
---|
0:14:12 | and we also add the additional constraints like for instance |
---|
0:14:16 | for every speech turn in the graph |
---|
0:14:19 | it can have at most one identity |
---|
0:14:23 | alright depends if yours screws of in your crew or |
---|
0:14:27 | but usually you have only one identity |
---|
0:14:29 | and also we force spoken name |
---|
0:14:33 | to be in the same cluster as their identity |
---|
0:14:39 | the thing is with this formulation is that |
---|
0:14:44 | you see that we some on all the edges on this graph |
---|
0:14:48 | and the problem is that they are much more many more |
---|
0:14:54 | speech turn to speech turn edges than they are points ten speech turn to spoken |
---|
0:14:59 | name ages |
---|
0:15:00 | so |
---|
0:15:02 | i divided this objective function into sub objective function |
---|
0:15:09 | this is basically exactly the same except that |
---|
0:15:12 | we |
---|
0:15:13 | the weight to all tap to every type of ages |
---|
0:15:17 | so this way we can give more weight for instance twos spoken name to speech |
---|
0:15:22 | turn edges in this graph |
---|
0:15:24 | and this makes the this gives a set of parameters that we need to of |
---|
0:15:30 | the hyper parameter that we need to optimize so beta and had five |
---|
0:15:36 | and this is |
---|
0:15:40 | optimized using a random search in the |
---|
0:15:43 | in the alpha beta space |
---|
0:15:46 | how much more time |
---|
0:15:50 | so i'm coming to the |
---|
0:15:53 | experimental results |
---|
0:15:56 | so |
---|
0:15:57 | he's the corpus that we were given by the organiser of the rubber challenge |
---|
0:16:04 | so the corpus is divided into seven type of shows like they are tv news |
---|
0:16:09 | talk shows |
---|
0:16:12 | so the training set is made of twenty eight hours fully annotated in terms of |
---|
0:16:16 | speaker a speech transcript |
---|
0:16:19 | and name |
---|
0:16:21 | the spoken names |
---|
0:16:22 | and also we are given visual information which are is not relevant here but the |
---|
0:16:28 | for instance we get and annotation or |
---|
0:16:33 | one frame every ten seconds we know exactly would peers in this in this frame |
---|
0:16:39 | so this training set is used to estimate the probability between speech turns the to |
---|
0:16:45 | train the i-vector system and to train the speech turn to spoken name propagation probability |
---|
0:16:54 | we used the development set |
---|
0:16:57 | nine hours to estimate those the hyperparameter alpha and beta |
---|
0:17:02 | and we use the test set |
---|
0:17:04 | and it's a value at the this way this is basically identification error rate so |
---|
0:17:09 | this is the total amount of a |
---|
0:17:11 | wrongly the total duration the wrongly |
---|
0:17:15 | i don't to find it plus |
---|
0:17:18 | a missed detection for set on divided by the total duration of speech in the |
---|
0:17:22 | reference |
---|
0:17:23 | so this can go higher than one if you |
---|
0:17:26 | we |
---|
0:17:27 | do lots of false alarm for instance |
---|
0:17:31 | so here are the big table of results i'm going to focus on the on |
---|
0:17:36 | the few selected points |
---|
0:17:38 | so i in this configuration b where we are completely unsupervised |
---|
0:17:44 | it's |
---|
0:17:46 | we can see that the an oracle do that too would be able to name |
---|
0:17:50 | someone as soon as is name is pronounced in the in the stream |
---|
0:17:54 | anywhere in the in the audio stream |
---|
0:17:56 | i can only get the fifty six percent recall anyway |
---|
0:18:01 | we get to twenty nine a here using this these graph |
---|
0:18:05 | so there is a long way to go up to |
---|
0:18:08 | to get the good a perfect results here |
---|
0:18:11 | when we are combined the whole thing |
---|
0:18:15 | the same an oracle would get fourteen percent |
---|
0:18:20 | identification error rate |
---|
0:18:22 | and our this oracle is able to recognize the someone as soon as |
---|
0:18:25 | either there is a biometric model for eight or the name is pronounced in the |
---|
0:18:29 | speech transcript |
---|
0:18:31 | so |
---|
0:18:31 | also there is a long way to go to get a perfect results |
---|
0:18:35 | but so i'm just going to focus on the interesting results now i mean the |
---|
0:18:40 | one that actually worked |
---|
0:18:44 | so |
---|
0:18:46 | note this is a better results angle i'm going to skip it as well |
---|
0:18:51 | by adding at the red ages in the graph so going from a to be |
---|
0:18:54 | where able to increase the recall so that was expected because we are now able |
---|
0:18:58 | to propagate the names to all the speech turns |
---|
0:19:00 | but also what's interesting is that we also increase the precision |
---|
0:19:04 | which wasn't what i expected first when a |
---|
0:19:08 | when i did this work |
---|
0:19:12 | and what's interesting also is that we can combine those two approaches the names speaker |
---|
0:19:17 | identification this right completely unsupervised |
---|
0:19:19 | with standard the |
---|
0:19:21 | i-vector acoustic speaker identification |
---|
0:19:24 | and we are able to get the ten percent absolute the improvement to compared to |
---|
0:19:30 | the i-vector system |
---|
0:19:32 | and it works both for precision so we are able to increase the precision of |
---|
0:19:36 | an i-vector system using those the spoken names |
---|
0:19:39 | and obviously recall because they are some percent the for which we don't have a |
---|
0:19:43 | biometric models so |
---|
0:19:45 | we can use the spoken names to |
---|
0:19:49 | to do to improve the identification |
---|
0:19:54 | and i also wanted to stress this point that we also have results based on |
---|
0:19:59 | the fully manual the |
---|
0:20:02 | spoken name detection |
---|
0:20:03 | and it happens that the even though our |
---|
0:20:06 | a name detection system has a slot error rate of around thirty five percent |
---|
0:20:12 | i it actually doesn't degrade when we go from manual a name detection to fully |
---|
0:20:17 | automatic name detection so this is |
---|
0:20:19 | an interesting result that we are robust to this kind of errors may be because |
---|
0:20:23 | spoken names are often the repeated multiple times in the video so we manage to |
---|
0:20:27 | get one of these |
---|
0:20:32 | this is just the |
---|
0:20:34 | a representation of the this weights beta that we are automatically |
---|
0:20:40 | obtain using parameters hyper parameter tuning |
---|
0:20:43 | when we only use the this configuration b so this is completely unsupervised |
---|
0:20:48 | it actually gives more weight |
---|
0:20:50 | to a speech turn to spoken name edges then to than the edges between two |
---|
0:20:56 | speech turns |
---|
0:20:57 | and when we do the for the full graph |
---|
0:21:00 | it actually give the same weights |
---|
0:21:02 | to the i-vector edges |
---|
0:21:04 | and the speech turn to spoken name ages |
---|
0:21:08 | so |
---|
0:21:09 | this is the concluded |
---|
0:21:11 | so we got the this ten percent absolute improvement over the i-vector system using spoken |
---|
0:21:16 | names so this is kind of cheating because what using more information but |
---|
0:21:21 | this can be improved even more if we had for instance written names |
---|
0:21:25 | experiments that we did the |
---|
0:21:27 | when the a given another fifteen percent the increase in performance |
---|
0:21:32 | and so they are still a lot of errors that we need to address i |
---|
0:21:36 | thank you very much |
---|
0:21:37 | and thank you |
---|
0:21:42 | just a quick advertisement on this corpus that may be of interest for those of |
---|
0:21:46 | you doing speaker diarization as well |
---|
0:22:03 | and i have the first question |
---|
0:22:07 | not using any a priori knowledge on the distribution of speakers in a conversation or |
---|
0:22:14 | in the media five like quite everybody |
---|
0:22:18 | could you comment and then do you think various |
---|
0:22:20 | some information to get that's the next step actually we plan to modify this |
---|
0:22:27 | objective function to take the structure of a tissue into account so for instance we |
---|
0:22:32 | could the ad here a term |
---|
0:22:35 | that |
---|
0:22:36 | take into account the prior probability that the when one a speaker speaks at time |
---|
0:22:42 | t then there is a high chance that we can hear him again thirty seconds |
---|
0:22:46 | later so this is not that all the taken into account for now but we |
---|
0:22:51 | really need to out these |
---|
0:22:54 | prior information the structure |
---|
0:22:56 | i totally agree but we did you mean just the prior knowledge on the presence |
---|
0:23:01 | of the speaker or |
---|
0:23:02 | on |
---|
0:23:03 | i don't know |
---|
0:23:05 | the this |
---|
0:23:06 | this is planned we're going to have the some extra terms here is to force |
---|
0:23:10 | that some kind of structure |
---|
0:23:13 | okay thanks and just |
---|
0:23:15 | you could also pictures of the results of the evaluation complaining goes |
---|
0:23:21 | you say that is what was done the focus of a few evaluation |
---|
0:23:25 | could be nice to have an eight year what was the but with the differences |
---|
0:23:30 | in a different participant |
---|
0:23:33 | you close to be a |
---|
0:23:35 | we notice of the based on did you see some differences i don't know |
---|
0:23:40 | the main difference when the who appears when task in speaker id we were more |
---|
0:23:46 | less the same and the same results |
---|
0:23:48 | but what the |
---|
0:23:50 | actually that's what gives the most information in terms of identities actually ups |
---|
0:23:58 | the names that are written on screen |
---|
0:24:01 | usually it's really easy to provide a to the current speech |
---|
0:24:06 | speaker |
---|
0:24:08 | and this it is if the fifteen free improvement in terms of performance when we |
---|
0:24:13 | use the visual the |
---|
0:24:15 | you're string |
---|
0:24:27 | no it's the basically used on the |
---|
0:24:34 | segmentation used for this stuff it with the goes and divergence followed by some kind |
---|
0:24:41 | of linear clustering and |
---|
0:24:44 | no it's not oracle it's a so the along the thirty five percent there are |
---|
0:24:49 | there is |
---|
0:24:50 | i think five |
---|
0:24:52 | to ten percent coming from the speech activity detection and segmentation errors |
---|