0:00:21 | i'm waiting for the for screen |
---|
0:00:38 | yeah |
---|
0:00:41 | hi i'm just the more from the university of five |
---|
0:00:44 | and |
---|
0:00:45 | a will talk to about a a pretty in is to D we did uh and speaker addition of that |
---|
0:00:50 | original use would be video five |
---|
0:00:54 | i will start with an introduction then i would describe the at a speaker diarization system |
---|
0:01:00 | uh describe that that base we we then use for this two T |
---|
0:01:04 | show use some results and to uh |
---|
0:01:07 | conclusion |
---|
0:01:08 | some plastic |
---|
0:01:10 | as a you know not |
---|
0:01:12 | but speaker there is a and is the process to find in audio stream who spoke when with no priori |
---|
0:01:17 | information on that |
---|
0:01:19 | identity of the speakers of the number |
---|
0:01:22 | and it's important to note that is |
---|
0:01:24 | that the speaker diarization process |
---|
0:01:27 | in the speaker they're efficient process we don't do speaker identification |
---|
0:01:32 | as you are so now |
---|
0:01:34 | a uh a a is to approach is |
---|
0:01:37 | for speaker diarization systems |
---|
0:01:39 | but the map and top-down |
---|
0:01:41 | uh the down approach C |
---|
0:01:43 | a a is used but a system such as the yes stem and the bottom up that's approach is used |
---|
0:01:49 | by a system such as the you system |
---|
0:01:52 | so uh uh in the but in the top-down down system we start with no speakers and we had them |
---|
0:01:57 | one by one and and to the top with their and is reached |
---|
0:02:01 | and in the bottom-up approach we start with a lot of speakers and we um |
---|
0:02:06 | and we met them and to the top two and three |
---|
0:02:11 | the main idea of this to D was |
---|
0:02:13 | a test |
---|
0:02:14 | uh uh the of speaker diarization system and and its behavior on different uh on the new |
---|
0:02:21 | and a new content |
---|
0:02:22 | in in the new context |
---|
0:02:24 | which is the web video |
---|
0:02:26 | this system has been test on uh from that uh |
---|
0:02:29 | but cast |
---|
0:02:30 | was that that a it in the french evaluation campaign |
---|
0:02:33 | instead step |
---|
0:02:34 | and then meeting that the at the |
---|
0:02:37 | in the um |
---|
0:02:38 | and nist evaluation complain R |
---|
0:02:46 | uh the |
---|
0:02:48 | yeah this is the the decision that description of our system |
---|
0:02:53 | there are three minutes steps |
---|
0:02:55 | in the um in how a process with that with the speech nonspeech segmentation or so called |
---|
0:03:00 | speech activity detection |
---|
0:03:03 | then we have a segmentation step |
---|
0:03:05 | and there is segments should |
---|
0:03:07 | every every segment |
---|
0:03:08 | the re-segmentation step which aim to refine |
---|
0:03:12 | the um the results we have produced |
---|
0:03:15 | so in the uh speech sounds speed the detection we initialize an hmm from the given gmms |
---|
0:03:22 | we apply a viterbi decoding and we are our or segment that five |
---|
0:03:26 | then uh this files are the base for the next step would you the segmentation step |
---|
0:03:31 | in the segmentation step we initialize |
---|
0:03:33 | and any hmm with one speaker |
---|
0:03:35 | which will be the default speaker |
---|
0:03:38 | we try to add a speaker we'll and it's not that |
---|
0:03:41 | and the mean are in the do uh of training and decoding |
---|
0:03:46 | uh we check if we can add a a new speaker if |
---|
0:03:50 | we can |
---|
0:03:50 | we have a our segment it thought |
---|
0:03:52 | and if we can add the speaker we |
---|
0:03:55 | we go |
---|
0:03:56 | at the beginning of the in |
---|
0:04:00 | then a finally a there is some most stations that we in a uh we initialize |
---|
0:04:05 | a we generate an hmm |
---|
0:04:06 | from the previews |
---|
0:04:07 | segment that file |
---|
0:04:09 | and so in the loop |
---|
0:04:11 | oh viterbi decoding and but that adaptation |
---|
0:04:14 | and we have a our final segment |
---|
0:04:16 | i |
---|
0:04:19 | uh |
---|
0:04:20 | as i said in the introduction them in idea of these two D was to test how a system on |
---|
0:04:24 | in and you context which is the way we do fight is |
---|
0:04:28 | the content of the web video five is and control we've do you don't video such as a movie trailers |
---|
0:04:34 | all broadcast use |
---|
0:04:36 | and will these tools for example a uh you can have a a video recording in studio or with a |
---|
0:04:42 | cell phone |
---|
0:04:43 | we decided to |
---|
0:04:45 | a to be the database |
---|
0:04:47 | in in as a a which is a D that |
---|
0:04:49 | two seven categories |
---|
0:04:51 | described just after or with mean a |
---|
0:04:54 | so a D as well |
---|
0:04:56 | contains a small than eight hundred videos in seven categories |
---|
0:04:59 | document are every movie trailer cartoon commercial a news |
---|
0:05:03 | well and using you |
---|
0:05:05 | and this two D we left |
---|
0:05:07 | uh |
---|
0:05:08 | a two categories |
---|
0:05:09 | spot because we don't have |
---|
0:05:11 | the the video stream |
---|
0:05:13 | and using video because it the it's a very difficult and there a very particular that i |
---|
0:05:20 | we manually annotated |
---|
0:05:22 | a a part of this corpus |
---|
0:05:24 | we ended it the audio the audio cup |
---|
0:05:27 | the audio file |
---|
0:05:29 | of uh a one hundred |
---|
0:05:31 | the twenty nine video file |
---|
0:05:33 | oh |
---|
0:05:34 | a it's which present around then how as and the hard |
---|
0:05:38 | these numbers are about the and that it but |
---|
0:05:42 | oh the corpus |
---|
0:05:44 | but two main thing that we can see it that we can deduce from this that but is that we |
---|
0:05:49 | um |
---|
0:05:50 | we have the category which would be the best the news at the but some of the the that bill |
---|
0:05:56 | and the one which should be the worst |
---|
0:05:58 | a movie trailer |
---|
0:05:59 | and D is category should be the best and the worst |
---|
0:06:02 | because the um the length of the speaker turns |
---|
0:06:06 | for the news is very high and for the movie trailer is very low |
---|
0:06:10 | this is |
---|
0:06:11 | information is very information you "'cause" |
---|
0:06:13 | be important because if you remember what i said just before |
---|
0:06:17 | we will on them with that and if we don't have in of that that were on how one with |
---|
0:06:21 | that |
---|
0:06:21 | we shouldn't have a |
---|
0:06:23 | a a good reason |
---|
0:06:28 | so the results |
---|
0:06:30 | then uh them set |
---|
0:06:33 | in the |
---|
0:06:34 | for these two D we compare the the system to the you and but the map system the room but |
---|
0:06:40 | the maps |
---|
0:06:41 | a "'em" were works |
---|
0:06:42 | uh |
---|
0:06:44 | a like how our system |
---|
0:06:47 | a with the C uh speech speech segmentation the the segmentation |
---|
0:06:52 | and then uh segmentation based on the bic criterion and the or a segmentation |
---|
0:06:59 | we test |
---|
0:07:00 | this system on a on the |
---|
0:07:03 | different that that's set |
---|
0:07:04 | the at C O nine |
---|
0:07:06 | uh |
---|
0:07:07 | that that that's it's from the nist |
---|
0:07:09 | evaluation can |
---|
0:07:11 | it's meeting that a |
---|
0:07:13 | and |
---|
0:07:14 | from on uh |
---|
0:07:15 | as step two thousand eight |
---|
0:07:17 | and that uh from the french evaluation can a stuff to it's broadcast news that that |
---|
0:07:22 | and a a on our uh and at at the soup that |
---|
0:07:26 | of it years are are with manual and automatic speech and |
---|
0:07:29 | speech segmentation |
---|
0:07:31 | we we see after why you would be |
---|
0:07:35 | so this is how a pretty preliminary results |
---|
0:07:37 | the first |
---|
0:07:39 | a thing that we can out lines |
---|
0:07:40 | if |
---|
0:07:41 | E is that uh we have |
---|
0:07:43 | quite good results |
---|
0:07:45 | i if you remember what show you said just before |
---|
0:07:48 | but uh we are not so far from the state of the art |
---|
0:07:51 | a result |
---|
0:07:54 | uh the second thing is that uh |
---|
0:07:57 | we know that the in system i'll perform hours |
---|
0:08:00 | is |
---|
0:08:02 | and you can see that on a step two thousand eight |
---|
0:08:05 | uh they do to two times better than us |
---|
0:08:10 | and how our system |
---|
0:08:11 | but |
---|
0:08:12 | oh on the uh in on the years are are |
---|
0:08:16 | got to |
---|
0:08:17 | uh this |
---|
0:08:18 | um |
---|
0:08:21 | this |
---|
0:08:21 | the a are remark can be applied because it they are not two times better |
---|
0:08:27 | then how our system |
---|
0:08:32 | uh |
---|
0:08:33 | then you can see that |
---|
0:08:35 | the um the hard part of the um |
---|
0:08:39 | of the there is an error rate |
---|
0:08:41 | he's you to speech nonspeech segmentation error |
---|
0:08:44 | so we try to move there Z to measure the influence of the segmentation the first |
---|
0:08:50 | speech speech nonspeech |
---|
0:08:52 | detection step |
---|
0:08:53 | this is the reason why we applied our system |
---|
0:08:56 | well system on the automatic speech and speech segmentation |
---|
0:09:00 | and manual segmentation |
---|
0:09:03 | so that results uh there is nearly no or |
---|
0:09:06 | for the |
---|
0:09:07 | with the with the perfect |
---|
0:09:09 | um |
---|
0:09:11 | with the perfect speech |
---|
0:09:12 | speech nonspeech segmentation |
---|
0:09:16 | are so try to move there are to measure the influence of this system |
---|
0:09:20 | and the that that's well |
---|
0:09:22 | yeah as expected you can see that's the best category is the news category |
---|
0:09:28 | and they're worst category for how a system is |
---|
0:09:31 | the movie trailer category as |
---|
0:09:33 | expect |
---|
0:09:37 | uh |
---|
0:09:38 | you can see that um that you that you insist them i'll the phones i well system in nearly all |
---|
0:09:45 | the categories |
---|
0:09:46 | but the range of the um |
---|
0:09:49 | oh the scroll on a are quite close |
---|
0:09:52 | uh for example phone use the minimum an error rate is around zero percent for each system |
---|
0:09:58 | and the maximum there is an error rate for cartoon new there on the |
---|
0:10:02 | seventy two per |
---|
0:10:04 | for most |
---|
0:10:08 | but i think that we can uh did use from this stuff but that |
---|
0:10:12 | we |
---|
0:10:14 | this |
---|
0:10:14 | it's also something that's we knew |
---|
0:10:16 | that's that that system phone found the more speaker band how a system |
---|
0:10:22 | but you can see a |
---|
0:10:24 | uh uh when you look at the scroll that's |
---|
0:10:26 | the um |
---|
0:10:28 | the speaker phone by the U system |
---|
0:10:31 | i not small right reliable than how |
---|
0:10:34 | speaker phone even if |
---|
0:10:35 | the number of speaker from |
---|
0:10:37 | i of them |
---|
0:10:43 | um um |
---|
0:10:44 | in conclusion this to the outlines the difficulties and coded by both system |
---|
0:10:50 | but by both that system |
---|
0:10:53 | and uh and that there was a new was what done |
---|
0:10:56 | it also lines |
---|
0:10:58 | that's it's a very difficult database |
---|
0:11:00 | with a lot of but between categories are high interactivity if you're a but the number and the duration of |
---|
0:11:07 | of for a speaker turn of the speaker turns |
---|
0:11:10 | and there is a lot of a one i these |
---|
0:11:12 | should explain what we have but results |
---|
0:11:15 | and the |
---|
0:11:18 | our our big T |
---|
0:11:19 | is |
---|
0:11:20 | a uh first to data only with their go is where we are the best |
---|
0:11:25 | and uh in the second time |
---|
0:11:28 | the main um |
---|
0:11:29 | a research i sis will be |
---|
0:11:31 | to use height of that formation from the video stream to have the decision |
---|
0:11:36 | on the on the speaker |
---|
0:11:38 | thank you for attention |
---|
0:11:40 | and if you have been |
---|
0:11:41 | in |
---|
0:11:42 | i |
---|
0:11:48 | we |
---|
0:11:50 | i |
---|
0:11:52 | oh |
---|
0:11:53 | hmmm |
---|
0:11:57 | so two questions on the first and uh |
---|
0:12:00 | did you score overlapped speech |
---|
0:12:02 | no |
---|
0:12:02 | no because how were system can on the phone now on uh overlaps |
---|
0:12:06 | each okay and like that |
---|
0:12:08 | she the notion and data sets marked manually and |
---|
0:12:12 | number of speakers an average speaker turn |
---|
0:12:14 | you know the distribution L in any another important factor in the diarization is they even if i |
---|
0:12:19 | five speakers if it's dominated by two |
---|
0:12:23 | and you can actually do |
---|
0:12:24 | right if |
---|
0:12:25 | speakers stick at ninety percent of think |
---|
0:12:27 | talk i that we had an action on the different categories of how might of been distributed |
---|
0:12:31 | we don't really measure the |
---|
0:12:33 | but uh |
---|
0:12:36 | i'm call there a partition is quite a key but and for all the speakers |
---|
0:12:40 | is |
---|
0:12:41 | a a for some categories |
---|
0:12:44 | but is no no um |
---|
0:12:48 | the mean on speaker |
---|
0:12:50 | yeah it's |
---|
0:12:51 | uh |
---|
0:12:52 | i know it depends on the categories |
---|
0:12:54 | like a news and document are is there is the mean and speakers |
---|
0:12:58 | but but for movie trailers got to an and from a shot in that the same |
---|
0:13:06 | i |
---|
0:13:07 | i |
---|
0:13:09 | oh |
---|
0:13:10 | do do anything special with music because i can image and there is a a lot of music a for |
---|
0:13:14 | example in a movie trailers |
---|
0:13:16 | or it can be like only music or music in the background |
---|
0:13:20 | yeah we don't use music uh information |
---|
0:13:23 | for now |
---|
0:13:25 | might be uh |
---|
0:13:26 | something interesting to do |
---|
0:13:28 | that's uh |
---|
0:13:29 | i and just to do where your question |
---|
0:13:31 | a a a a a we don't the the the music information |
---|
0:13:34 | with the music first |
---|
0:13:35 | mission fun |
---|
0:13:36 | which means that you don't you do not score |
---|
0:13:39 | i Q you are is are the parts are the music |
---|
0:13:43 | it depends on |
---|
0:13:44 | how it's we by the by the speech nonspeech uh step if the music is recognized |
---|
0:13:51 | as a speech um |
---|
0:13:54 | as the non-speech level |
---|
0:13:55 | it one be scroll but if it |
---|
0:13:57 | uh marked as speech uh |
---|
0:14:00 | level level it would be score |
---|
0:14:05 | i |
---|
0:14:08 | i |
---|
0:14:11 | but |
---|
0:14:17 | but |
---|
0:14:18 | oh |
---|
0:14:20 | but |
---|
0:14:22 | yeah |
---|
0:14:23 | i |
---|
0:14:25 | i |
---|
0:14:28 | i |
---|
0:14:30 | uh it's |
---|
0:14:32 | here again depends on the categories |
---|
0:14:34 | movie trailers cartoons |
---|
0:14:36 | a a very noisy |
---|
0:14:39 | that's uh a |
---|
0:14:40 | mm use |
---|
0:14:41 | quite X |
---|
0:14:46 | i |
---|
0:14:48 | i |
---|
0:14:55 | um |
---|