0:00:14 | hi everyone this is needed region problematical stopped today i'm going to present a list |
---|
0:00:20 | of clustering for speaker diarization the course of this outcome i like a ramp |
---|
0:00:27 | in the beginning reasons i mean to give a brief introduction to the past can |
---|
0:00:30 | you diarization use the no i think or from results |
---|
0:00:35 | as we all know that are initially is wow the task is equal recognition terrible |
---|
0:00:40 | together with identification and verification |
---|
0:00:43 | at the bottom of this feature it shows the scenario of speaker diarization tools because |
---|
0:00:49 | i'm talking with each other based on the recording the case of speaker diarization used |
---|
0:00:55 | to |
---|
0:00:56 | is i when each speaker is speaking |
---|
0:00:59 | technically no diarization can be decomposed into two steps segmentation and clustering |
---|
0:01:07 | you this now i will go through the most commonly used framework is speaker diarization |
---|
0:01:12 | that he's the optimal if you have a typical cluster we use h table shows |
---|
0:01:18 | in the nineteen one which composition we bust two cameras that imitation only |
---|
0:01:24 | no i always true method of the intention the next nist documentation and the segmentation |
---|
0:01:30 | based on speaker change point detection |
---|
0:01:34 | already that it in the speech segments it |
---|
0:01:38 | a stairwell good the speech segments from the same speaker to the same cluster |
---|
0:01:43 | in s a with respect to whether the number of clusters useful human or not |
---|
0:01:50 | we have important operations |
---|
0:01:52 | when the number of speakers is given to be a |
---|
0:01:56 | no clustering always |
---|
0:01:58 | stops when the without the number of clusters ranges and |
---|
0:02:03 | then each of the and clusters will be used a representation of a speaker in |
---|
0:02:07 | the conversation |
---|
0:02:10 | when the number of speakers is nothing the we will both the threshold to those |
---|
0:02:15 | because indirectly with does it go similarity of the merging clusters you know we you |
---|
0:02:20 | know here that when you know t then i feel and stick to |
---|
0:02:24 | when the |
---|
0:02:26 | speakers in the idea of them o g p c speaker thing one thing to |
---|
0:02:30 | reach is the threshold |
---|
0:02:32 | after we will stop |
---|
0:02:35 | yes |
---|
0:02:36 | no result in the number of clusters where is the estimated number of clusters and |
---|
0:02:41 | hence |
---|
0:02:42 | and each of the casters will be used to represent a specific speaker in the |
---|
0:02:47 | composition |
---|
0:02:49 | after e |
---|
0:02:51 | baby with applications there is always used |
---|
0:02:54 | imagine be re-segmentation we first race present each speaker with a gmm |
---|
0:03:00 | after that we're well beyond and h a gmm based on the gmms by adding |
---|
0:03:05 | transitional probability |
---|
0:03:07 | but only we will lie speech frames to the speaker gmms by viterbi decoding |
---|
0:03:18 | although age they has been widely used |
---|
0:03:21 | and the performance of each has been acknowledged |
---|
0:03:24 | no asked us some shortcomings units in our work way |
---|
0:03:28 | cope with the well as in |
---|
0:03:30 | now he's the clusters and probably the orange speakers they can watch |
---|
0:03:35 | in this nice |
---|
0:03:36 | when k is the diarization and tools costco example |
---|
0:03:40 | speaker in rule and speaker would be red |
---|
0:03:43 | during clustering we will have a pastor or speaker eight understatement consisting of each problem |
---|
0:03:49 | both speakers a and b |
---|
0:03:52 | but unknown speaker not only and is because similarity of the on and the statement |
---|
0:03:56 | of mixed each |
---|
0:03:58 | they didn't manage to a custom speaker i |
---|
0:04:02 | another scenario |
---|
0:04:03 | those documents from speaker he may also be multitudes because they actually the second picture |
---|
0:04:09 | in both cases the cost of speaker it will be biased to be could be |
---|
0:04:15 | with a clustering going on the speech |
---|
0:04:18 | the speech of speakers |
---|
0:04:20 | a and b may now present already |
---|
0:04:22 | that means those |
---|
0:04:24 | i mean |
---|
0:04:25 | no doubt addition they lost |
---|
0:04:28 | future studies of the original those that can be into the statistically |
---|
0:04:35 | in the in the that is composed of sailors from a only |
---|
0:04:41 | with the battery go the system is composed of these statements from |
---|
0:04:47 | see for me getting worse it to a |
---|
0:04:51 | the clusters if a is composed of speech signals from both the a and b |
---|
0:04:57 | all strategies with problem is to start early either |
---|
0:05:01 | go to be able to determine rose because they get in most states in this |
---|
0:05:06 | way we have to us to clean the you really use the way |
---|
0:05:13 | okay the clusters the issues that is like to be known as what is that |
---|
0:05:18 | it should be large enough to provide us it organisations people a i that it |
---|
0:05:25 | should be clean i allowing for |
---|
0:05:29 | involved in this one c d can be as we have |
---|
0:05:32 | so the action a |
---|
0:05:36 | will be a tradeoff between the two vectors |
---|
0:05:38 | we propose a list of clustering by thinking strict threshold without age they the ideally |
---|
0:05:44 | that will be a change the and get more faster than time t is the |
---|
0:05:49 | number of speakers |
---|
0:05:51 | is the only stuff clustering the clustering was a clustering |
---|
0:05:57 | checked thresholds the resulting clusters where k is large and then the anticipated number of |
---|
0:06:03 | speakers and |
---|
0:06:05 | in any way to is given all that we have different implementations |
---|
0:06:11 | when the number of speakers is nothing but we will first estimating it to be |
---|
0:06:17 | and had |
---|
0:06:19 | then based on a given or estimated number of speakers and only had we want |
---|
0:06:25 | to the class to selection to select a model and how clusters problems ending clusters |
---|
0:06:30 | each of the selected clusters where represents a specific speaker in the speech conversation |
---|
0:06:36 | in the battles that we will apply viterbi re-segmentation to align the frames of the |
---|
0:06:42 | whole conversation to the selected clusters |
---|
0:06:47 | in this now and the following |
---|
0:06:49 | we will describe how the number of speakers is no work |
---|
0:06:54 | was gone it will work should not because similarity score magics s |
---|
0:06:58 | each element s is thus because in our goal but no we'll let you clusters |
---|
0:07:04 | example s j k is a speaker similarity score into the g s and have |
---|
0:07:10 | after |
---|
0:07:11 | finally as well be i was initially magics of five i |
---|
0:07:18 | in the score matching s we will do and ninety conversation on it and stored |
---|
0:07:24 | in a manual in using you of the role you one to u k |
---|
0:07:29 | after that we want him choose the union ratio between the existing and can values |
---|
0:07:33 | after that k |
---|
0:07:35 | finally the lamb of speakers and had will be estimated at the point with a |
---|
0:07:40 | maximum again that night |
---|
0:07:45 | with a given all the estimated number of speakers in this nine and the following |
---|
0:07:50 | we will show how do not have to selection works in but we with the |
---|
0:07:55 | latter selecting is this and after of probability clusters of i wonder what i |
---|
0:08:00 | no we were achieved this to find out all of the company combination and after |
---|
0:08:05 | in these is to be the index set i one to |
---|
0:08:10 | after that we work on how the stuff or matching for each combination by extracting |
---|
0:08:15 | the corresponding rows and columns from s |
---|
0:08:18 | well score magics it would be of the imaging |
---|
0:08:22 | now takes a factor and i |
---|
0:08:27 | in the scores that matches this way was then do the eigenvalue decomposition and each |
---|
0:08:32 | of the in and found that the eigenvalues to be in one three |
---|
0:08:37 | but only the in this combination of the maximum and you man summation well be |
---|
0:08:42 | used in this is |
---|
0:08:43 | definitely pastors |
---|
0:08:47 | so that follows a description of the algorithm next we were able to the experiments |
---|
0:08:53 | all experiments was having a i had use the money is being the data set |
---|
0:08:59 | consisting of two cents is a dimension that and the as the of |
---|
0:09:04 | you made mistakes |
---|
0:09:05 | the duration of conversation various problems three hundred two hundred seconds |
---|
0:09:11 | the number of speakers conversation from one to nine |
---|
0:09:15 | in our evaluation when used are now role in addition error rates and eer as |
---|
0:09:20 | actually |
---|
0:09:22 | what use the pen the ground truth segmentation |
---|
0:09:25 | as a temporal segmentation |
---|
0:09:28 | be to you has to be noted that if in the reference euclidean speaker b |
---|
0:09:34 | hyper |
---|
0:09:35 | overlaps |
---|
0:09:36 | no overlap segments will be used as individual segments |
---|
0:09:43 | in our experiments we have to model as opposed by being a bottleneck feature extractor |
---|
0:09:49 | with a given model no is an expensive extractor with the rest of the model |
---|
0:09:54 | for most of the models |
---|
0:09:56 | we used at an additional advantage as input feature and of course of the and |
---|
0:10:02 | one change how about static y and into |
---|
0:10:05 | in the model the acoustic input layer of the year is the carriage real compatibility |
---|
0:10:12 | with ease contextual between you really both that and the right size |
---|
0:10:17 | you has i hate enables the was well hidden layers well that one thousand and |
---|
0:10:24 | it will give for the dimension of the not hidden layer wise lda and he's |
---|
0:10:28 | being a output was used |
---|
0:10:31 | it can only be sure |
---|
0:10:33 | in our is known model there were nine convolutional class |
---|
0:10:38 | only that we had no we'll collection they are five thousand and to you or |
---|
0:10:42 | than the ones we may go after that to well collection labels were used up |
---|
0:10:48 | to this green a the are one of the may five one hundred and twenty |
---|
0:10:53 | eight |
---|
0:10:53 | well use i x |
---|
0:10:55 | in both models a five of the classification a while the number of training because |
---|
0:11:01 | at least eleven thousand three hundred and if we |
---|
0:11:08 | we use the conventional a st as the baseline based on a involves the conventional |
---|
0:11:14 | clustering and the or is not mastery when use the egg expect when combined with |
---|
0:11:20 | cosine distance as the speaker modeling and is because similarity on a on then |
---|
0:11:26 | in the another speaker information and after selection in our restart clustering framework when use |
---|
0:11:34 | the bic score unspeakable individual |
---|
0:11:39 | in the re-segmentation phase |
---|
0:11:41 | way used a speaker pair of each point we duration |
---|
0:11:49 | well the name |
---|
0:11:49 | when having a experiments in the scenario where the number because once again but |
---|
0:11:55 | this table shows the performance comparison between the provisional edge v and the proposed only |
---|
0:12:00 | a star clustering and development and evaluation sets respectively |
---|
0:12:06 | from a comparison we have seen that the list of clustering |
---|
0:12:12 | can provide better performance than the conventional h |
---|
0:12:17 | to understand the reason for the computer there already |
---|
0:12:21 | we have a purity after the whole clustering process of the two systems that control |
---|
0:12:28 | case is given by |
---|
0:12:30 | in the evaluation |
---|
0:12:33 | to be the same page speech that's was required |
---|
0:12:37 | to be in those in speaker at the reference from the comparison to |
---|
0:12:42 | we have seen that the superiority of a restart clustering i know how |
---|
0:12:47 | high-level speaker correctly |
---|
0:12:50 | that it can provide a better initialization with imitation based |
---|
0:12:57 | then we continued our experiments in this scenario we also number of speakers was not |
---|
0:13:02 | a but |
---|
0:13:03 | this table shows the performance comparison between the conditional basically and of the proposed or |
---|
0:13:09 | is not clustering |
---|
0:13:10 | development and the evaluation sets respectively |
---|
0:13:14 | problem comparison because in that the or is not clustering can achieve better performance than |
---|
0:13:19 | age they |
---|
0:13:22 | or address l when used the |
---|
0:13:25 | a report the results reported by different schemes |
---|
0:13:32 | to have a family known database of various clustering with a number of speakers right |
---|
0:13:38 | now again by the way how does advantage of speech in the development set was |
---|
0:13:43 | estimated numbers of because what's more than or equal to the ground truth actually you |
---|
0:13:49 | this paper |
---|
0:13:51 | no means that shows that not only start clustering estimate columbus because more accurately well |
---|
0:13:57 | as the number of estimated on us because it was not ground truth |
---|
0:14:03 | combined with those because right here as you know strangely enough three people this can |
---|
0:14:09 | help us to understand |
---|
0:14:11 | the database of the audience to a jury |
---|
0:14:18 | asked but experiments you know only those a threshold in both systems |
---|
0:14:23 | right |
---|
0:14:24 | results actually you don't actually got we evaluate the threshold zero point one to the |
---|
0:14:30 | row of table one the paper we have seen that the or is not clustering |
---|
0:14:34 | provided that statistically problem is not age they |
---|
0:14:39 | well |
---|
0:14:40 | no only a star clustering bad rich mess that interesting that means that the audience |
---|
0:14:45 | to clustering is less than thirty two just a threshold |
---|
0:14:48 | and more robust pitch there |
---|
0:14:53 | finally we will come to a convolution |
---|
0:14:55 | in this paper we propose an only stuff that you to h stays speaker diarization |
---|
0:15:00 | consisting of two steps |
---|
0:15:03 | second the number of initial clusters natural and he's anything man phenomenon that's because then |
---|
0:15:09 | we combine no extraneous |
---|
0:15:11 | after into the have a few number of speakers |
---|
0:15:14 | the database of the proposed method was just a better from two aspects |
---|
0:15:19 | back home as well had than h they based speaker diarization past well as the |
---|
0:15:25 | number of speakers last not even all that |
---|
0:15:28 | the second one is the a propose the similarity in magic space estimate of the |
---|
0:15:34 | number of speakers and the resultant of speaker and a half of context of threshold |
---|
0:15:38 | setting process relatively simple and robust |
---|
0:15:44 | that's all of my a hessian thank you |
---|