0:00:15 | and well what can i actually identify speakers i and then we also wanted to |
---|
0:00:20 | try if it's possible to actually fuse the results |
---|
0:00:22 | where is more |
---|
0:00:24 | sort of traditional atoms an and systems so basically there was some things |
---|
0:00:32 | done already |
---|
0:00:34 | and this is basically some the closest works we could find the time of writing |
---|
0:00:38 | but you know do so you sort of the archive publishing method than and stuff |
---|
0:00:42 | like that they could be out of date so basically especially the first one of |
---|
0:00:46 | occurs because they actually uses the in spectrograms as well |
---|
0:00:49 | but |
---|
0:00:51 | what the use it for is to identify disguised voices so for example when you |
---|
0:00:55 | have voice actors and then like the simpsons or something one actor can play like |
---|
0:01:03 | several characters so they want to sort of identify the actual act as a not |
---|
0:01:09 | the characters that they play |
---|
0:01:11 | but they didn't do when in so the fusion or exploration and in basically used |
---|
0:01:16 | sort of out of ops network |
---|
0:01:18 | and also there's this is quite a lot of |
---|
0:01:20 | now one to conclusions a non sound |
---|
0:01:25 | so basically what we want to see is a citizens of the overview so the |
---|
0:01:32 | lower part is basically surreal standards |
---|
0:01:37 | approach where you have the mfccs or other features extracted any sort of the i-vectors |
---|
0:01:42 | or the u |
---|
0:01:44 | whatever and then |
---|
0:01:46 | you usually in this identification what we wanted to do is basically extract spectrograms but |
---|
0:01:52 | it through the network and then sort of get the identity of the other hand |
---|
0:01:57 | i will explain later wine |
---|
0:01:59 | there are several identities |
---|
0:02:03 | of the c and then so basically what we wanted to test |
---|
0:02:07 | a little conversion network and then |
---|
0:02:10 | t v is |
---|
0:02:12 | and you're |
---|
0:02:14 | actually quite dataset was |
---|
0:02:16 | quite surprising that you |
---|
0:02:20 | so we actually chose this |
---|
0:02:22 | system so |
---|
0:02:24 | the fusion |
---|
0:02:26 | so |
---|
0:02:27 | this is not expect sorta need to go into detail |
---|
0:02:33 | convolution work inspired by a lot of networks that are currently used for image recognition |
---|
0:02:39 | this particular one |
---|
0:02:41 | so basically what we did we tried an existing model and then we sort of |
---|
0:02:45 | started downsizing it because it didn't like to change the resultant cued up learning |
---|
0:02:51 | and we came up with this |
---|
0:02:53 | and it's actually a very robust as you have five convolutional layer but the main |
---|
0:03:00 | an overkill |
---|
0:03:02 | especially of the images and we begin that is a monochromatic so we don't like |
---|
0:03:07 | you three chance that the very beginning |
---|
0:03:11 | some be system basically trying to one twelve efforts but it's actually |
---|
0:03:17 | and we use rather than dropout rate of the nonlinear function |
---|
0:03:23 | and |
---|
0:03:25 | and dropout at zero point fine |
---|
0:03:27 | and this is up to each we conducted where we did no random propping the |
---|
0:03:32 | rotations this was due to the so the spectrograms basically have a pretty big overlap |
---|
0:03:38 | anyway so cropping than actually do |
---|
0:03:40 | the detection have much use and we don't want to the rotations because hopefully this |
---|
0:03:44 | may be something and the time domain may be interesting |
---|
0:03:49 | and we use |
---|
0:03:50 | well average point max pooling but this is just based on experimental |
---|
0:03:55 | exactly so basically because we wanted to combat |
---|
0:03:59 | t v s and t o v as and stuff like that |
---|
0:04:03 | we you want to have |
---|
0:04:05 | the same sort of output so basically what we got from the signal is |
---|
0:04:11 | the somebody news |
---|
0:04:13 | so the speech segments |
---|
0:04:14 | and because the spectrograms have a |
---|
0:04:18 | then it shows a fixed size we have to sort of divine to speech segments |
---|
0:04:22 | into separate spectrograms and then do an average |
---|
0:04:27 | and the output to get an equivalent for forward to us for example |
---|
0:04:32 | so for |
---|
0:04:35 | you many the end you get your the eggs we |
---|
0:04:37 | so to use the following setting like more teachers and paper one sort of going |
---|
0:04:42 | dependent this now but we tested a settings in this |
---|
0:04:45 | i think you the best also i'm not |
---|
0:04:48 | the segmentation problem for |
---|
0:04:51 | getting the speech segments is based and bic criterion |
---|
0:04:55 | i victim suspect hundred and stuff like that |
---|
0:05:00 | so for the fusion we chose t v s because it had the best results |
---|
0:05:05 | and then |
---|
0:05:06 | we explore three |
---|
0:05:09 | different approaches the late fusion so basically just to the scrolls |
---|
0:05:13 | from the t v s and from bayesian and then |
---|
0:05:16 | basically |
---|
0:05:18 | fuse them |
---|
0:05:18 | and then we so from our experiments that |
---|
0:05:22 | actually the c n and works was four |
---|
0:05:27 | longer segments |
---|
0:05:29 | speech |
---|
0:05:30 | which was quite surprising but then so we basically wanted so the weight down it's |
---|
0:05:36 | this value depending on the duration |
---|
0:05:38 | so the and the duration baseline instance for the duration the track |
---|
0:05:44 | and then we wanted to see if an early fusion |
---|
0:05:48 | so basically take the our work all the last hidden sin level we do with |
---|
0:05:52 | pca to have |
---|
0:05:54 | the same dimensionality as an i-vector and then we just concatenate them and trainings be |
---|
0:05:58 | a |
---|
0:06:01 | so that they said that we used in the repair this is a french language |
---|
0:06:06 | corpus this is and radios |
---|
0:06:10 | and |
---|
0:06:11 | that seven types of videos including news debates |
---|
0:06:16 | sort of interviews celebrity gossip stuff like that so and because of this it's pretty |
---|
0:06:22 | noisy because you i don't very often you have like background music you have different |
---|
0:06:27 | voices overlapping you have streets noises a et cetera |
---|
0:06:33 | and |
---|
0:06:34 | very unbalanced as well because |
---|
0:06:36 | you sometimes have very i don't know politicians who i don't present fronts that sort |
---|
0:06:42 | of is that almost constantly a or binders throughout the more and then you have |
---|
0:06:49 | sort of this long tail of speakers so basically in the whole training set that |
---|
0:06:53 | eight hundred three months speakers but that says sets |
---|
0:06:57 | contains only one hundred thirteen and this is likely on be one hundred thirteen is |
---|
0:07:01 | actually overlap |
---|
0:07:04 | with the speakers with and train set |
---|
0:07:06 | and while the strange about speech or frames and six for the test |
---|
0:07:15 | this is just a show sort of like the imbalance in the distribution this is |
---|
0:07:18 | a logarithmic scale |
---|
0:07:21 | and then this |
---|
0:07:23 | so on the x-axis you have all those one hundred thirteen speakers |
---|
0:07:27 | and then on the while you have the duration but speaker so basically and that |
---|
0:07:32 | sort by the duration and the train set so basically what you've got is that |
---|
0:07:39 | it's not very an imbalanced us you know some people speaking forty minutes and then |
---|
0:07:43 | someone who excuse that for just a few seconds |
---|
0:07:46 | and then it's |
---|
0:07:48 | as we can see that spike at the very rights |
---|
0:07:51 | this shows that there's actually |
---|
0:07:53 | someone who |
---|
0:07:54 | is almost nonexistent train set but then he's very present in the test data |
---|
0:08:00 | so |
---|
0:08:01 | pretty difficult also another feature of this data that |
---|
0:08:06 | almost |
---|
0:08:07 | a quota speech segments are shorter than two seconds |
---|
0:08:11 | and seven c |
---|
0:08:13 | percent shorter than that |
---|
0:08:15 | a which makes it quite difficult so basically we used mfccs features |
---|
0:08:22 | and this is sort of problem no |
---|
0:08:26 | nineteen dimensions so |
---|
0:08:28 | so basically all the details and the paper but |
---|
0:08:31 | we end up with than fifty nine dimensional vector |
---|
0:08:34 | up to some |
---|
0:08:36 | feature warping |
---|
0:08:38 | so for the spectrograms you have an example of it on your |
---|
0:08:45 | it's |
---|
0:08:45 | the two hundred |
---|
0:08:47 | forty miliseconds in duration |
---|
0:08:50 | there's a big overlap between neighboring spectrogram |
---|
0:08:54 | well at the two hundred milisecond systems on overall |
---|
0:08:59 | it's percent |
---|
0:09:02 | and basically |
---|
0:09:04 | so this is that we use so are |
---|
0:09:09 | audio segments were a value of refinement seconds |
---|
0:09:12 | and then we form for the look for a window and twenty miliseconds we use |
---|
0:09:17 | the |
---|
0:09:18 | i mean windowing |
---|
0:09:21 | log-spectra optical |
---|
0:09:23 | amplitude values extraction and then we basically got an individual matrix which ones of a |
---|
0:09:28 | forty eight times woman twenty one pixel |
---|
0:09:34 | so basically here the results so far table we see the results of the on |
---|
0:09:39 | for each individual systems and basically |
---|
0:09:42 | this in and |
---|
0:09:43 | doesn't work very well which isn't |
---|
0:09:45 | that's surprising considering |
---|
0:09:47 | the way the dataset structured but |
---|
0:09:50 | pretty surprising is that the of the a |
---|
0:09:52 | is also not very good an actually gmm ubm |
---|
0:09:57 | right okay so basically to the best system is the c v s one |
---|
0:10:02 | and that we have used for fusion afterwards so basically |
---|
0:10:07 | we want to see |
---|
0:10:10 | so in the lower table you have three more detailed results including the accuracy or |
---|
0:10:17 | the tracks to have less than two seconds |
---|
0:10:21 | and |
---|
0:10:22 | actually the best approach that we have is the just the simple length and so |
---|
0:10:28 | basically take the predictions from c n |
---|
0:10:30 | and seriously sort of normalise them and |
---|
0:10:34 | our remote |
---|
0:10:35 | and the biggest most of the form is actually is also given that for the |
---|
0:10:40 | trusts okay for the facts that a lower than two seconds so basically for forty |
---|
0:10:49 | forty one almost and forty nine for t v s and fourteen and respectively and |
---|
0:10:53 | then goes up to fifty eight |
---|
0:10:56 | so it's a phone |
---|
0:10:58 | which is quite of course |
---|
0:11:02 | and then the yellow re fusion actually model but well actually decreased results |
---|
0:11:08 | but for like duration nights |
---|
0:11:12 | it's pretty |
---|
0:11:13 | similar so basically |
---|
0:11:15 | even though the c n and didn't |
---|
0:11:18 | outperform |
---|
0:11:19 | it |
---|
0:11:20 | seems to provide different things in spectrograms and |
---|
0:11:23 | by fusion consort exploited and sort of go |
---|
0:11:26 | beyond what was |
---|
0:11:29 | but say possible so is also |
---|
0:11:34 | so it's of the lower plots |
---|
0:11:36 | as you |
---|
0:11:39 | we have so the red one is the nn |
---|
0:11:43 | performance across |
---|
0:11:44 | different duration files |
---|
0:11:48 | on a logarithmic scale |
---|
0:11:49 | so you can see that |
---|
0:11:52 | the between c and then and |
---|
0:11:55 | i-vectors as of this yes |
---|
0:11:57 | it's a low increases as a sort of a long |
---|
0:12:01 | with the duration and the biggest is actually helpful for very short tracks and then |
---|
0:12:08 | doesn't affect the performance and the latest |
---|
0:12:14 | so that's basically it we wanted so |
---|
0:12:19 | see how it works and we conclude that the s t and c n and |
---|
0:12:24 | t v s main improve over the baseline systems |
---|
0:12:29 | a more data that may be requires |
---|
0:12:32 | or more what quality data especially for this unit india data actually work better and |
---|
0:12:37 | four perspectives |
---|
0:12:40 | so basically we chose this corpus |
---|
0:12:42 | because it also contains |
---|
0:12:44 | texas and stuff like that is we explored wanted to have like a system that |
---|
0:12:48 | takes both the spectrogram the face and say |
---|
0:12:52 | so the a be a like a speaking persons |
---|
0:12:55 | rather than just concentrate on speaker identification by standard edition and we want to have |
---|
0:13:00 | it all compact and then like one trainable system |
---|
0:13:05 | and |
---|
0:13:05 | an additional source of |
---|
0:13:08 | inside make the to force a difference in an architecture so basically if you |
---|
0:13:15 | have just for example horizontal or vertical focuses rather than squares that we use now |
---|
0:13:22 | you can sort of force it to look |
---|
0:13:25 | more than in sort of the time domain frequency domain |
---|
0:13:30 | to sort of look at the |
---|
0:13:34 | at some buttons that |
---|
0:13:36 | and so that's a thank you |
---|
0:13:43 | i performance |
---|
0:13:45 | so we have plenty of time for some more buttons |
---|
0:13:51 | okay |
---|
0:13:56 | yes |
---|
0:14:15 | any kind of segmentation per segmentation or you assume that there is |
---|
0:14:21 | you know the segmentation so these age segmentation is basically an automatic speech segmentation done |
---|
0:14:28 | by bic criterion so it is a pretty all technique and then we just basically |
---|
0:14:33 | the segments as they are |
---|
0:14:35 | and a pretty noisy sometimes analysis that it is very hot sometimes to distinguish |
---|
0:14:41 | or to filter out like music and voice and stuff like that and then sometimes |
---|
0:14:44 | because like something's that basically have strike selecting two speakers |
---|
0:14:49 | as well which you know |
---|
0:14:52 | we could probably |
---|
0:14:53 | benefit from using a more sophisticated way to generate the |
---|
0:15:10 | okay maybe also one is not experiments on this a the features are complementary to |
---|
0:15:16 | the baseline so did you have an attempt to have as well as in the |
---|
0:15:20 | upper layers learned by the c n like another |
---|
0:15:24 | can you can kinda the telephone or something up for a meaning in terms of |
---|
0:15:28 | the old averaged it is some basic you could be a actually that was to |
---|
0:15:33 | see the saliency maps |
---|
0:15:36 | so basically this is a and once again you can actually see |
---|
0:15:42 | the was of particular layers c n and look at what it looks task |
---|
0:15:48 | it to make a decision so basically what i guess pretty interesting most of the |
---|
0:15:54 | teachers that were horizontal |
---|
0:15:56 | and announcer in frequency domain so that's one way so that's my final we want |
---|
0:16:05 | to see what happens if you like force the not just the vertical |
---|
0:16:10 | and see what happens that |
---|
0:16:23 | segmentation error |
---|
0:16:27 | the simulation the red and no sorry |
---|
0:16:32 | the measurement question was how |
---|
0:16:34 | what five |
---|
0:16:36 | of your total data is the segmentation that it |
---|
0:16:40 | okay i don't have number wouldn't sorry |
---|
0:16:46 | but |
---|
0:16:47 | could be in the fact that should be |
---|
0:17:09 | doesn't come out and of the last question with twenty five persons |
---|
0:17:13 | of the segment with the duration less than two seconds |
---|
0:17:17 | going but we are |
---|
0:17:19 | but using |
---|
0:17:20 | almost you know to compute a segmentation score we have this |
---|
0:17:25 | what of open five seconds along the boundaries of each segment it means that new |
---|
0:17:30 | case for twenty five percent of the data |
---|
0:17:33 | fifty posants of the speech is not used to compute e |
---|
0:17:38 | segmentation school so we have to change from it we want to go |
---|
0:17:42 | if the segmentation you were a house and but on speaker identification |
---|
0:17:47 | okay |
---|
0:17:48 | thank you |
---|
0:17:56 | time problem one final question |
---|
0:18:01 | okay thinker everyone a separate so unless the spectrogram |
---|