0:00:26 | so let's talk or more complex |
---|
0:00:36 | i |
---|
0:00:53 | right so that we present you do that |
---|
0:00:59 | so the goals is really |
---|
0:01:01 | challenges to find people in multiple |
---|
0:01:03 | multimodal context |
---|
0:01:07 | so |
---|
0:01:08 | what you mean multimodal context condition it's is that the participant can use the speech |
---|
0:01:14 | and image |
---|
0:01:15 | to recognize people |
---|
0:01:18 | it is occur at the french collaboration and the corpus is provided by and that |
---|
0:01:24 | the evaluation is organized by adding the |
---|
0:01:28 | three you're and the research associate participate to the to the challenge is this presentation |
---|
0:01:34 | is a presentation of the evaluation is not a presentation of the systems we participate |
---|
0:01:39 | to the to the competing to the challenge |
---|
0:01:41 | and that if you want to have more details about the solution becomes a sample |
---|
0:01:46 | was |
---|
0:01:47 | please go to interspeech yeah and you might be some |
---|
0:01:52 | of which |
---|
0:01:55 | so what about my presentation i will present in the first task after the corpus |
---|
0:02:01 | the matrix we used |
---|
0:02:03 | and some results from the driver in contains that's will be noted that we consider |
---|
0:02:08 | so that we can do is six years and some conclusions |
---|
0:02:13 | so the main task is to answer the question who is present in the videos |
---|
0:02:19 | so that's means that is visible or |
---|
0:02:22 | is speaking in the videos |
---|
0:02:25 | two conditions are proposed difference is on supervised condition that's means that the participants can |
---|
0:02:33 | build |
---|
0:02:34 | a priori models for the very the face or for the speech from a different |
---|
0:02:41 | persons that's that might be on the videos |
---|
0:02:45 | another side you have an unsupervised a condition where the participant and are always to |
---|
0:02:52 | use only the videos the test videos to find the people |
---|
0:02:59 | this man task is |
---|
0:03:02 | every time you we have after also task that's |
---|
0:03:06 | are more precise in the question that's mean to use to answer the question who |
---|
0:03:11 | is speaking with visible on the video what names are start |
---|
0:03:15 | oh on the speech |
---|
0:03:18 | what names are displayed on the screen |
---|
0:03:21 | to answer the question two conditions to and a mixture model conditions where people can |
---|
0:03:29 | use all the modalities to answer the question and also where S a for who |
---|
0:03:36 | is speaking they can only use the speech |
---|
0:03:40 | for who is visible that can only use |
---|
0:03:43 | the video that the image |
---|
0:03:47 | and |
---|
0:03:48 | we |
---|
0:03:50 | we assume that for answers this question there are a some technologies that are a |
---|
0:03:56 | necessary and so we propose that's |
---|
0:03:59 | where we |
---|
0:04:00 | yeah we assessed the speaker diarization the speech transcription the ad detection and segmenting the |
---|
0:04:09 | overlaid words text detection and segmentation |
---|
0:04:11 | and the optical character recognition for the text on screen |
---|
0:04:18 | so a lot of the scandals for so as i say that right or do |
---|
0:04:22 | not was conduct analysis here and the first and second official campaign will be on |
---|
0:04:29 | a two thousand source thirteen and two thousand fourteen |
---|
0:04:36 | so what do not show so you have sentence different shows that's are gonna are |
---|
0:04:43 | not is that in the corpus |
---|
0:04:45 | that is there are a different utterances of the same show us assume that some |
---|
0:04:51 | people for example the presentation are present for multiple |
---|
0:04:56 | yeah |
---|
0:04:57 | shows and different shows and a clean |
---|
0:05:02 | we worked with different kind of sure like you're information show or a political debate |
---|
0:05:09 | you have at the bottom and question to the government stations to you |
---|
0:05:13 | and the celebrity news shows |
---|
0:05:17 | the we choose this kind of shows because they are very different and valuable the |
---|
0:05:24 | some of them are more difficult examples are because of the kind of speech for |
---|
0:05:30 | example you have for some for example for the celebrity a new show you have |
---|
0:05:35 | more spontaneous speech and for the parliament question to the government for example is always |
---|
0:05:43 | a read a speech so it's to mixed the condition of speech |
---|
0:05:50 | all the this |
---|
0:05:52 | shows come from two different channels |
---|
0:05:55 | and then at the end of the project that will be a sixty hours of |
---|
0:06:00 | videos for |
---|
0:06:02 | for the database so i can imagine that you don't know was easy the other |
---|
0:06:06 | so i propose you to show a little samples us to have an idea of |
---|
0:06:12 | the |
---|
0:06:13 | of the |
---|
0:06:14 | the D |
---|
0:06:25 | i |
---|
0:06:26 | yeah |
---|
0:06:27 | sure |
---|
0:06:42 | oh |
---|
0:06:45 | i |
---|
0:06:49 | i |
---|
0:06:51 | yeah |
---|
0:06:53 | i |
---|
0:06:55 | i think yeah |
---|
0:07:07 | i |
---|
0:07:08 | i |
---|
0:07:11 | yeah |
---|
0:07:14 | yeah |
---|
0:07:16 | i |
---|
0:07:17 | oh |
---|
0:07:29 | oh |
---|
0:07:35 | i |
---|
0:07:39 | i |
---|
0:08:03 | yeah so |
---|
0:08:06 | for the corpus was annotated form visual annotations |
---|
0:08:11 | it's i mean image of a point of view so on |
---|
0:08:15 | although it we annotate and one image every ten seconds |
---|
0:08:21 | we determine the dickheads with one of the know |
---|
0:08:25 | performance |
---|
0:08:27 | the ads are described like to say if is there are there are there is |
---|
0:08:32 | no occlusion of the jets or for example if you have a parent shorter or |
---|
0:08:40 | something indication nazis |
---|
0:08:43 | the person is name |
---|
0:08:45 | the rate so that the text objects are in a rectangle to transcribe |
---|
0:08:52 | and the on you on all detected text transcription you have to the person names |
---|
0:08:59 | are annotated in the in the text |
---|
0:09:02 | and so has to have something which i |
---|
0:09:05 | it's more accurate diarization the parents acted experiments of all the other hand |
---|
0:09:12 | and all the text |
---|
0:09:13 | all |
---|
0:09:14 | given to have to |
---|
0:09:17 | to know where the is the fruit separation of the audience |
---|
0:09:22 | for |
---|
0:09:24 | for the speech annotation have a standard transcription of all the details |
---|
0:09:29 | with the speaker turn segmentation and the music segmentation two |
---|
0:09:33 | and a rich speech transcription says that includes all the disappearance |
---|
0:09:39 | and |
---|
0:09:40 | all the |
---|
0:09:42 | and all that the world like you're a |
---|
0:09:45 | i'm french you know some not station but a more like to alright so all |
---|
0:09:52 | i think so and all this kind of expression that might be useful to recognise |
---|
0:09:59 | the people |
---|
0:10:00 | and we name the older person that are speaking and that we may and all |
---|
0:10:06 | the |
---|
0:10:07 | the main the speed of be of here so that sure on the speech transcription |
---|
0:10:13 | are annotated to be from books that's is example here you an example and that's |
---|
0:10:20 | what i want the user name so it's at the beginning |
---|
0:10:25 | so that the main difference matrix we use is the estimated global bit-rate is found |
---|
0:10:32 | on the means and false excitation but we want to boundaries at the fact that |
---|
0:10:37 | the system i have found that the correct number of people who are present in |
---|
0:10:42 | the video that's why we include a confusion that's means that if you have to |
---|
0:10:48 | the number of people but you miss and you do an ml for the name |
---|
0:10:54 | of the people is a less it's |
---|
0:10:58 | that it's an important in less important error not to miss some have that's why |
---|
0:11:05 | we use this kind of and this metric for the main task and for the |
---|
0:11:10 | question who is speaking who is visible |
---|
0:11:13 | and what names are displayed |
---|
0:11:16 | and for what names are cited we use the slow to rate which is a |
---|
0:11:20 | comparison of the hypothesis and the reference interval for the name |
---|
0:11:27 | so for the driver and also the dry run corpus is very short |
---|
0:11:33 | corpus based the goal was to see what's given what we can do with this |
---|
0:11:39 | metric sense is kind of corpus and that it's clear that it's not enough for |
---|
0:11:43 | the system to develop something which is |
---|
0:11:46 | the performance but it's not the goal of the driver |
---|
0:11:50 | and what we saw here is that the |
---|
0:11:53 | the speech duration for a speaker is very short |
---|
0:11:58 | and the majority of the speaker speak less than a twenty seconds but it's the |
---|
0:12:04 | assignments because it's show and it said that if you can see of the show |
---|
0:12:09 | and the you have that you have or you have people who speak not that |
---|
0:12:16 | one more time |
---|
0:12:17 | two hundred and sixty second so it's the diversity of the corpus and for them |
---|
0:12:24 | the key for the people distribution according to the number of key frames |
---|
0:12:29 | they appear you have the same thing some of them have your is not so |
---|
0:12:34 | much and it is that if you can see but usually when someone appears not |
---|
0:12:40 | captures a lot G speaker lots and so you combining and visioning the information you |
---|
0:12:45 | might find who is speaking and who is present in the video |
---|
0:12:50 | and so if you and i |
---|
0:12:53 | the moments where the speed of the things display or the faces visible all the |
---|
0:13:00 | speaker is speaking in all the corpus you can see that for eight percent |
---|
0:13:05 | the P the person is speaking appears and his name i is displayed on the |
---|
0:13:11 | videos at the same time |
---|
0:13:13 | and |
---|
0:13:15 | yeah but for example you have |
---|
0:13:18 | a set seventy |
---|
0:13:21 | percent of the people who just to name displayed on the screen and so for |
---|
0:13:26 | the main task for example you don't have to say that the these people there's |
---|
0:13:31 | people are present in the video is because they are not speaking or they are |
---|
0:13:36 | not |
---|
0:13:37 | V C and Z is distribution |
---|
0:13:40 | is very different according to the kind of shows for example for different story |
---|
0:13:49 | you have |
---|
0:13:51 | a more |
---|
0:13:52 | as long as thirty two persons of the few that the people want the name |
---|
0:13:57 | that are not useful to find the people and for L C P for that's |
---|
0:14:01 | the contrary you washers that if you find the name of a person that something's |
---|
0:14:07 | that's |
---|
0:14:08 | this person is present in the video so the participants have to analyze the little |
---|
0:14:16 | this kind of things to |
---|
0:14:17 | to have it might be a full to have this kind of information |
---|
0:14:22 | so the |
---|
0:14:24 | here you have to the annotation and the clues you can use to do that |
---|
0:14:30 | to answer the question |
---|
0:14:34 | you know i |
---|
0:14:38 | there is there are there are more that |
---|
0:14:41 | a two hundred and sixty seven people |
---|
0:14:45 | there's people in the datasets |
---|
0:14:47 | the one hundred seventy one people for the test set |
---|
0:14:52 | and as you can see |
---|
0:14:55 | there are some and then use guys that's means that's for the annotators a then |
---|
0:15:01 | why not able to know who is that where it just we got in the |
---|
0:15:08 | video that's just watching the video so |
---|
0:15:12 | that's why i say the autonomous and the system have to find that there is |
---|
0:15:16 | someone but they have not |
---|
0:15:18 | maybe |
---|
0:15:21 | for the fast results it's clear that it's a driving test again so the results |
---|
0:15:27 | on that's so good |
---|
0:15:29 | what we want to compare is the |
---|
0:15:32 | here you go the system of things |
---|
0:15:37 | for the main task |
---|
0:15:38 | and comparing to the task we speaking and who is visible and as you can |
---|
0:15:44 | see |
---|
0:15:45 | they have a better results to say who is speaking example to say who is |
---|
0:15:51 | visible on the videos and the for the main task the main problem is to |
---|
0:15:57 | say who he is visible so on |
---|
0:16:01 | please speaking |
---|
0:16:04 | for speaking |
---|
0:16:07 | in particular we analyzed the results for the and comparing the |
---|
0:16:12 | the results for the supervised mixture model condition and the supervised on the model condition |
---|
0:16:19 | and as you can see there is not different most significant difference in the results |
---|
0:16:24 | between the two conditions that's means that the system then |
---|
0:16:28 | the information that come from |
---|
0:16:31 | the also for the C was not used by the system to improve their |
---|
0:16:36 | then |
---|
0:16:38 | so |
---|
0:16:39 | and the on the side you know the |
---|
0:16:43 | the results by shows |
---|
0:16:47 | so the center of the circle we present the mean for the mean performance |
---|
0:16:52 | and |
---|
0:16:53 | the writers represents the standard deviation of the bit of the reference |
---|
0:16:57 | and as you can see the we got according to the show the systems are |
---|
0:17:03 | more provides and another so if we compare them |
---|
0:17:08 | the yet also it's the results are very precise assessment that's this |
---|
0:17:14 | this show is correctly a tree is a process but yeah regarding the green the |
---|
0:17:22 | dark green maybe even if there is a lower the evaluation of the performance is |
---|
0:17:29 | more important so that's might be some things that's the system have to improve to |
---|
0:17:38 | regarding who is visible |
---|
0:17:40 | in the videos |
---|
0:17:42 | doing the same kind of analysis you can see is that there is a significant |
---|
0:17:48 | difference between the supervised multimodal condition and the supervised model condition so here the speech |
---|
0:17:56 | and i is useful and the systems have used this complementarity information and here again |
---|
0:18:05 | you have |
---|
0:18:06 | the representation of this is the results according to the show and here again you |
---|
0:18:13 | have difference performance and evaluation of the performance of the show |
---|
0:18:17 | is important |
---|
0:18:19 | for who is sort |
---|
0:18:22 | and we focus here for the results on the kind of mistakes the and have |
---|
0:18:27 | rows S car done by the system and again as you can say that can |
---|
0:18:31 | see that iteration is the more important |
---|
0:18:34 | here are for all the systems will participate |
---|
0:18:37 | and the |
---|
0:18:39 | results might be that they have and as out |
---|
0:18:43 | the system to |
---|
0:18:44 | to detect the then sent it is named is has to be improved because they |
---|
0:18:51 | say don't the same is a lots of names |
---|
0:18:55 | for what i'm are displayed the performance again can be improved and we focus on |
---|
0:19:00 | the austere and text segmentation results |
---|
0:19:04 | and the results on a set of models is a lot so but not so |
---|
0:19:09 | that again they can extract some information from and the segmentation is quite good |
---|
0:19:16 | so it's again the problem of extracting the name from the text |
---|
0:19:21 | that is the marginal program for |
---|
0:19:25 | so in conclusion a dollar question and the goals is to find people in multiple |
---|
0:19:33 | in the condition in french language the main question he who is present in the |
---|
0:19:38 | video but you have a subtask and |
---|
0:19:41 | seven questions that was that can be helpful to risk terms of the domain task |
---|
0:19:47 | and this challenge now is open to anyone which is to participate so |
---|
0:19:54 | yeah you can go |
---|
0:19:56 | and for the dry run it is clear that sufficient information can improve and the |
---|
0:20:01 | device in we also an important variability of the performance according to the shoes |
---|
0:20:07 | for the perspective |
---|
0:20:09 | for the matrix a |
---|
0:20:11 | we want to include the ensemble and to take account the person in the videos |
---|
0:20:16 | because for some application in particular for clustering of videos it's a less authority it |
---|
0:20:24 | so that the importance of the person depend of his role in the video so |
---|
0:20:30 | it's a an important to work and |
---|
0:20:33 | we want to weights |
---|
0:20:36 | the importance of the people according to the way the available modality that's someone if |
---|
0:20:43 | you lose on okay speaking and |
---|
0:20:46 | is visible is you will have a man make more errors than if it's just |
---|
0:20:52 | speaking or just visible on the screen |
---|
0:20:55 | for that we want to |
---|
0:20:58 | to improve the characterisation of the difference between scenario |
---|
0:21:02 | use the due to linger more speech analysis |
---|
0:21:07 | a more |
---|
0:21:09 | and this is a different size for the videos |
---|
0:21:12 | and dropped or more that's what the same speaker |
---|
0:21:16 | is in different shows like it's not exactly the same thing to speak in departments |
---|
0:21:22 | and to be in debate with and also people so it's this kind of social |
---|
0:21:27 | shows and the time we hope for |
---|
0:21:31 | so thank you and if you have a question |
---|
0:21:35 | i |
---|
0:21:59 | all of them all the description that we have i have done here it's for |
---|
0:22:04 | all the |
---|
0:22:05 | all the dots are because after all this is that will be on the on |
---|
0:22:12 | the learning and the training part for the official compare comparing so it so that's |
---|
0:22:18 | why we have to do that |
---|
0:22:20 | the analysis on all the data and that's where table it's a |
---|
0:22:27 | it's a choice because we don't have but that's a now we don't have announced |
---|
0:22:32 | does not shoot to speech it's in doing so |
---|
0:22:36 | morning |
---|
0:22:37 | since the yes |
---|
0:22:48 | yeah |
---|
0:23:04 | yeah the continuous but there was presented to the system |
---|
0:23:07 | as they have they can use all the videos |
---|
0:23:10 | that's for the annotation for the evaluation if the evaluation is based on a key |
---|
0:23:16 | frames |
---|
0:23:17 | it's more the evaluations and the for the for the participants must be a all |
---|
0:23:22 | the videos |
---|
0:23:23 | and it's just because we can say it's very expensive to do this kind of |
---|
0:23:30 | annotation so that's why we dress for the evaluation |
---|
0:23:34 | and that's why we indicates the beginning and the end of the operation of the |
---|
0:23:39 | people so as to have also for the systems they have something |
---|
0:23:45 | generalization risk is that it's not exactly diarisation for the videos |
---|
0:23:48 | but it's always the problem |
---|
0:23:51 | expensive part of doing this kind of |
---|
0:23:55 | in addition |
---|
0:24:02 | yeah |
---|
0:24:03 | for the speech for speech for the question who is speaking is they have to |
---|
0:24:07 | to answer for all the video but for the visible part of that they have |
---|
0:24:13 | to |
---|
0:24:13 | to focus on |
---|
0:24:16 | on the key frames |
---|
0:24:19 | but it is clear that at the beginning they don't know where are the key |
---|
0:24:23 | frames it's that just |
---|
0:24:25 | they don't just so when it's the test and |
---|
0:24:28 | where wednesday |
---|
0:24:30 | where are the people in the video |
---|
0:24:43 | oh |
---|
0:24:51 | oh |
---|
0:24:56 | for the it's for the transcript that they have to transcribe the all the videos |
---|
0:25:03 | the system at the at the beginning thing just have access to the |
---|
0:25:09 | the as |
---|
0:25:11 | it's a how to use their own system to transcribe the videos |
---|
0:25:15 | that's the for the set task we bss for example |
---|
0:25:22 | a use that you have to some reassurance after this for the main task they |
---|
0:25:27 | just have to use it was that the beginning and |
---|
0:25:30 | so used a technologies they want |
---|
0:25:32 | so that summer yeah |
---|
0:25:35 | a used to transcribe it does this one |
---|
0:25:37 | the up on transcription |
---|
0:25:39 | and also prefer just doing yeah generalization and have a |
---|
0:25:46 | they don't for a lot on does unsupervised condition and so he was a lot |
---|
0:25:51 | of face models or |
---|
0:25:56 | the voice |
---|
0:26:08 | i |
---|
0:26:14 | no i think of the name of the detailed the shows so for example a |
---|
0:26:19 | single a lot of the present data because for the information show is always the |
---|
0:26:25 | same presentation right now |
---|
0:26:27 | in all that they all signed the or the shows the old interest shows but |
---|
0:26:32 | they don't know a always the |
---|
0:26:36 | in fact that people for example so yeah |
---|
0:26:40 | yeah |
---|
0:26:43 | oh |
---|