0:00:13 | i |
---|
0:00:14 | since for what should be too |
---|
0:00:16 | and twenty one do clustering university |
---|
0:00:20 | and here i have a brief introduction to a paper |
---|
0:00:24 | the heart we still hot experimental results and discussions from decay you a novelty |
---|
0:00:33 | in this paper we present the summit each system for the second that a speech |
---|
0:00:38 | diarization challenge |
---|
0:00:40 | diarization system includes multiple modules |
---|
0:00:43 | nobody voice activity detection speaker in many extraction similarities miss truman clustering with the confusion |
---|
0:00:51 | overlap detection |
---|
0:00:54 | for each model to explore different technologies to enhance the performance |
---|
0:00:59 | a final submission even close to mismatch system based vad that the there is no |
---|
0:01:06 | based speaker a value |
---|
0:01:08 | that estimate base the similarity scoring and |
---|
0:01:11 | spectral per state |
---|
0:01:13 | three diarisation use also applied in the re-segmentation stage |
---|
0:01:18 | and overlap detection also brings time improvement |
---|
0:01:23 | our proposed system achieves a key point at forty what check one and twenty seven |
---|
0:01:29 | point ninety percent in eer for check two |
---|
0:01:34 | post a systems have reduced the f d r's right twenty seven point five percent |
---|
0:01:39 | is that you one point seven percent relative a cascade of use of s times |
---|
0:01:45 | we believe that diarization task is the over each utterance |
---|
0:01:51 | may analysis |
---|
0:01:53 | we carry a mentality analysis on a development set to show how hot the competition |
---|
0:02:00 | is |
---|
0:02:01 | several in that occurs and order |
---|
0:02:04 | their religion of the lda was |
---|
0:02:07 | the number of speakers |
---|
0:02:09 | speech percent each and the overlap ever |
---|
0:02:12 | overlap ever determines the medium and diarization error rate a system is able to a |
---|
0:02:18 | chip we sell handling overlaps speech |
---|
0:02:22 | it is defined as follows |
---|
0:02:27 | i |
---|
0:02:28 | the speech regions of speaker i |
---|
0:02:32 | in summary the competition is how many because first the audio site shown for about |
---|
0:02:39 | divers set of challenging documents |
---|
0:02:42 | second the number of speakers varies you know very large range |
---|
0:02:47 | so |
---|
0:02:48 | hi overlap error costs for the eer |
---|
0:02:53 | well it process i employed in our experiments for training |
---|
0:02:58 | note that one looks like to combine short utterances received all speakers |
---|
0:03:04 | suitable for speaker in nineteen change |
---|
0:03:07 | most people speak audio's are drawn from the database is a median and tri-phone domains |
---|
0:03:14 | the making data consist of |
---|
0:03:16 | icsi i s l nist s and one baseline to |
---|
0:03:22 | the telephone data services |
---|
0:03:24 | no monolingual problems that |
---|
0:03:27 | including arabic |
---|
0:03:29 | english |
---|
0:03:30 | drama |
---|
0:03:31 | japanese |
---|
0:03:32 | men therein and spanish |
---|
0:03:35 | that used for changing voice activity detection |
---|
0:03:39 | similarity miss truman an overlap detection |
---|
0:03:43 | musician and a raw score or |
---|
0:03:46 | i employed for the computation |
---|
0:03:50 | voice activity detection |
---|
0:03:53 | right i p c p of initial best time for channel two |
---|
0:03:57 | let us to estimate into frames with twenty milliseconds |
---|
0:04:01 | duration |
---|
0:04:03 | for each input for n |
---|
0:04:05 | a pc generous the and the recall what |
---|
0:04:08 | and optional setting right a p c is the way steve martin |
---|
0:04:12 | there is a list of ways here |
---|
0:04:15 | well |
---|
0:04:16 | three is the most where is you about field all non speech |
---|
0:04:21 | we also propose a em based approach for the vad task |
---|
0:04:26 | then usual network as shown in figure two consist of their rest and model you |
---|
0:04:32 | multiple bidirectional estimate there's and in you know there's |
---|
0:04:37 | our motivation is stay the rest and what you |
---|
0:04:40 | generous representative feature mapping is for speech and non-speech |
---|
0:04:45 | and then the right the original svms control sequential information |
---|
0:04:50 | the input is that a long sequence of frame as features |
---|
0:04:55 | each a france inter sequence a hack and feed into the rest that |
---|
0:04:59 | generating multiple channel |
---|
0:05:03 | features magazines |
---|
0:05:05 | we of times ago for every holy |
---|
0:05:07 | on each channel and courtesy dimensional vector |
---|
0:05:11 | next a bidirectional estimate there's to catch for the for one and that was sequence |
---|
0:05:16 | information |
---|
0:05:18 | finally |
---|
0:05:19 | allpass from the prior that rationales |
---|
0:05:21 | task to that being a layers |
---|
0:05:23 | and that p with the sigmoid function |
---|
0:05:26 | and generous the speech posteriors |
---|
0:05:31 | all converges activity detection |
---|
0:05:33 | real-time a sliding mean of one point five seconds lands and zero point five as |
---|
0:05:38 | the five |
---|
0:05:39 | second shift was based speech into short segments |
---|
0:05:44 | the speaker embedded ice check a to find the sediments |
---|
0:05:48 | here we consider three models |
---|
0:05:50 | i-vector extractor and the rest i-vector |
---|
0:05:55 | for the i-vector extractor with follows that the how to design a t v one |
---|
0:06:00 | where is that in colour t and height of also audio's for system changing |
---|
0:06:06 | for this paper we also follows that the heart was on an ap we will |
---|
0:06:11 | call us that to change the model |
---|
0:06:14 | s for the rest i-vector |
---|
0:06:16 | it consists of three main components |
---|
0:06:20 | a restaurant or in a two-dimensional staticity pooling their |
---|
0:06:25 | and a feed-forward network |
---|
0:06:28 | not fit the one that well in close to is that the in your there's |
---|
0:06:32 | the search of l o zero point five between |
---|
0:06:36 | given a sequence of input features |
---|
0:06:39 | to rest and brian first covers them into multiple channel feature dimensions |
---|
0:06:45 | is that the static sporting their calculators the mean extend the time variation studies for |
---|
0:06:52 | each channel |
---|
0:06:53 | generating the utterance level representation of |
---|
0:06:56 | to see that addition |
---|
0:06:58 | last the feed-forward network transforms the utterance level feature representation to speaker posteriors |
---|
0:07:07 | the embedding the imaging is one hundred and twenty k |
---|
0:07:11 | chinese there is also folks that respect alimentation |
---|
0:07:15 | and detail parameters can be view in table three |
---|
0:07:20 | speaker in getting sequence x one x to x and |
---|
0:07:25 | we compute similarity score as i g between any interest because embedding as i x |
---|
0:07:31 | j |
---|
0:07:33 | and push on the similarity matrix research and times i |
---|
0:07:38 | the first was that for the similarity measure is p lda |
---|
0:07:43 | you can be expressed as follows |
---|
0:07:47 | that's their assumes that the embedding i and j are from the different speakers |
---|
0:07:53 | well it's one assumes that is there are from the same speaker |
---|
0:07:58 | the lda model is channel we suppose that and written by the two development set |
---|
0:08:05 | we must not is there |
---|
0:08:07 | the key lda and those speaker embedded these you know paralysed and had a man |
---|
0:08:13 | reach you can always the sequential information |
---|
0:08:16 | therefore we propose to analyse them basis point model to capture the forward and backward |
---|
0:08:22 | messages |
---|
0:08:24 | in comparison with p lda |
---|
0:08:27 | scores articulated between vector and sequence rather than vector that could |
---|
0:08:33 | give a speaker embodies x one x two accent |
---|
0:08:37 | recently that i in recreate could be compared with the whole sequence |
---|
0:08:43 | do you feed this sequence into a list and their work |
---|
0:08:47 | and generous course of the input can kinda vectors |
---|
0:08:51 | a strong be actually equation seven |
---|
0:08:55 | the first you know what kind of course |
---|
0:08:58 | includes two i original estimate errors and to lean you know there is |
---|
0:09:03 | the output layer is one dimensional connected with this economy function |
---|
0:09:10 | in the clustering stage |
---|
0:09:12 | two was that a part |
---|
0:09:15 | the first was that is agglomerative hierarchical clustering |
---|
0:09:20 | which are from as the random mutually between precise |
---|
0:09:24 | segments i'm initialized as individual clusters |
---|
0:09:29 | and each time to prove starts with the highest score are merged and chosen humans |
---|
0:09:35 | raised were is mediate |
---|
0:09:37 | and are not always a spectral clustering |
---|
0:09:41 | is and where our best score some you know it's a |
---|
0:09:44 | given the similarities matrix s |
---|
0:09:47 | you can see that as i j s away of a g between no i |
---|
0:09:52 | and not okay you know and directly where |
---|
0:09:56 | by removing weak edges with small weights |
---|
0:09:59 | spectral clustering device the original graph into multiple somewhere off |
---|
0:10:05 | which star graph is a holster |
---|
0:10:09 | of course there |
---|
0:10:11 | there we segmentations that is that high to aligned a close friend rides |
---|
0:10:17 | g m and we segmentation next see that constructing thus because the cp gmms |
---|
0:10:23 | for each speaker according to clustering results |
---|
0:10:27 | then for each frame in the audio |
---|
0:10:30 | we assign it to gmm is the highest the posteriors |
---|
0:10:35 | the process interest and to convert |
---|
0:10:39 | and are not always |
---|
0:10:41 | we start with station |
---|
0:10:43 | construct a gmm a gmm model |
---|
0:10:45 | with engine voice priors |
---|
0:10:49 | impulses that imitation side or speaker-specific gmms share the same component weights and covariance men |
---|
0:10:57 | she's |
---|
0:10:58 | besides |
---|
0:10:59 | the mean vectors are projected from total variability subspace |
---|
0:11:06 | with some progress |
---|
0:11:08 | v diarization kingsbury's segmentation performance |
---|
0:11:13 | the small we consider is overlap detection |
---|
0:11:17 | the model structure data and ten in combination is a all the same as those |
---|
0:11:23 | in rest of the same voice activity detection system |
---|
0:11:27 | that we change the labels for speech nonspeech two overlap no overlap |
---|
0:11:34 | for testing cases |
---|
0:11:36 | but has segment is referred as overlapped speech |
---|
0:11:40 | we used ten is boundary i twenty frames and ten or speakers of hearing is |
---|
0:11:46 | the extended segment as the labels of the original segment |
---|
0:11:53 | experimental results |
---|
0:11:55 | whatever directly you and voice activity detection performance |
---|
0:11:59 | maybe parallel independent evaluation on a pc our best system based vad |
---|
0:12:07 | the metric used and whereas you're right |
---|
0:12:09 | and results are shown in table four |
---|
0:12:12 | basically we start model adaptation |
---|
0:12:16 | are processed model used just slightly better than the official baseline |
---|
0:12:20 | however if you finding the model to handle development set |
---|
0:12:26 | accuracy ready to be increased to ninety one point four percent on the eval set |
---|
0:12:32 | you can sort of course there are chanting that is drawn from meeting and telephone |
---|
0:12:37 | domain |
---|
0:12:39 | well as the database probably eleven domain |
---|
0:12:43 | domain mismatch this to work performance |
---|
0:12:46 | well model adaptation rinsed income improvement |
---|
0:12:52 | in table five |
---|
0:12:54 | we compare different combinations of the speaker binding |
---|
0:12:58 | similarity scoring and resume is that into one |
---|
0:13:03 | it is all that the that the address mapping to |
---|
0:13:07 | performs i-vector extractor or combination |
---|
0:13:11 | is that is |
---|
0:13:13 | so i and o a system based scoring well by spectral clustering have used |
---|
0:13:19 | better there in comparison to is you know the edges e |
---|
0:13:24 | best single system is systems six |
---|
0:13:27 | which she is that the eer of twenty point eight seven percent |
---|
0:13:32 | where we fuse based on tool for densities are reaching their score metrics |
---|
0:13:38 | the eer for the reduces to and you one to four percent |
---|
0:13:46 | with the condition is carried out on a best single system and the fusion system |
---|
0:13:52 | results are shown in figure six |
---|
0:13:55 | in our expectation |
---|
0:13:57 | the vad algorithms should outperform the gmm is the |
---|
0:14:02 | and re-segmentation models used |
---|
0:14:04 | should bring similar improvement for both systems |
---|
0:14:08 | "'cause" the price |
---|
0:14:10 | for the fusion system die residual predictions after resegmentation does not become more data right |
---|
0:14:18 | so mostly be improvement this can be systems six with bp diarization |
---|
0:14:25 | we do seems that the eer by one point six five percent absolutely |
---|
0:14:32 | the last few in our diarization system is overlap detection |
---|
0:14:37 | since the overlap everybody's as i s time for instance is present on the development |
---|
0:14:42 | set |
---|
0:14:44 | is it is not go for asked was seems that there is around ten percent |
---|
0:14:48 | of the sometime error the eval set |
---|
0:14:52 | experiments are carried out on systems use with three d diarization |
---|
0:14:57 | results are shown in table seven |
---|
0:15:00 | all have to the time of the last speech only slightly improves the past i |
---|
0:15:04 | zero point is the c eight percent on channel one and zero point six nine |
---|
0:15:10 | percent on check two |
---|
0:15:12 | it is the very challenging because we for less than |
---|
0:15:17 | ten percent of the overlapped speech |
---|
0:15:22 | last to understand how our system performs it is recipe goldmine |
---|
0:15:28 | we go the eers of the development set on system six |
---|
0:15:32 | a tall man |
---|
0:15:34 | results are shown in figure three |
---|
0:15:38 | system performs rolls on this policy is |
---|
0:15:42 | rest or |
---|
0:15:43 | we have video media been chosen |
---|
0:15:46 | c of each are discussed in manhattan and that's due to high overlap errors |
---|
0:15:53 | the child domain |
---|
0:15:55 | this is by no overlap error rates to hide eer of |
---|
0:16:00 | so these data points that you eight percent |
---|
0:16:03 | it is probably because the audio are drawn from seeking colours |
---|
0:16:08 | we have shown to a six to at most old |
---|
0:16:13 | this is a mismatch |
---|
0:16:14 | comparison of speakers in our training database |
---|
0:16:18 | as a result |
---|
0:16:21 | six times the outperforms probably in this for changing documents |
---|
0:16:28 | things you've we'll watch |
---|