0:00:12 | hello everyone |
---|
0:00:14 | since for what shall use video |
---|
0:00:16 | i'm stringent england you please dont university |
---|
0:00:19 | eight year i we have a brief introduction to you about all paper |
---|
0:00:24 | optimal mapping lost |
---|
0:00:26 | are passed other works well and when speaker variation |
---|
0:00:33 | recently i neutral new word based approaches have become more and more popular problem or |
---|
0:00:40 | the use of speaker diarization |
---|
0:00:42 | such as voice activity detection |
---|
0:00:45 | speaker you by extraction clustering |
---|
0:00:49 | however |
---|
0:00:50 | and two in speaker diarization shin still remains a challenging |
---|
0:00:55 | partly you that difficult lost this i was the speaker label accurately portray |
---|
0:01:02 | no permutation invariant train was on n p r t s can be a possible |
---|
0:01:08 | solution |
---|
0:01:09 | and has been apply to as the emd network |
---|
0:01:14 | but it's kind of time complexity increases vector s and number of speakers increases |
---|
0:01:23 | in this paper |
---|
0:01:24 | we investigate improve on the calculus and for the proposed a novel optimal mapping lost |
---|
0:01:32 | which directly compute was that best matches between the allpass because sequence and that once |
---|
0:01:39 | was because sequence so hungarian weights |
---|
0:01:43 | all proposed lots systems finally we use the cost to polynomial time |
---|
0:01:50 | meanwhile keeping the same performance as deputy dots |
---|
0:01:57 | so what is speaker accurately |
---|
0:02:01 | he than an audio with ground truth labels |
---|
0:02:04 | a e p c zero eight p c are the speakers |
---|
0:02:10 | we nationally you all the speakers in two interpreters |
---|
0:02:13 | and good the encoding labels |
---|
0:02:16 | one two to three |
---|
0:02:19 | for outputs like one two to three and two one three |
---|
0:02:25 | plus should be correctly from the view of speaker diarization |
---|
0:02:29 | are |
---|
0:02:30 | traditional losses functions like binary cross entropy loss as the and the first a wrist |
---|
0:02:37 | or lost |
---|
0:02:38 | well as the second output with high loss |
---|
0:02:42 | this obviously does not meet all expectations |
---|
0:02:47 | the reason behind this that speaker diarization only focuses on the relative difference of the |
---|
0:02:54 | speaker identity |
---|
0:02:59 | let's have a real as the lightest obtained you and a speaker diarization system |
---|
0:03:05 | and see how yourself the rubber |
---|
0:03:08 | as the un t post encoder of each transform as the model |
---|
0:03:14 | is the partition decoding removed |
---|
0:03:17 | to be specific |
---|
0:03:19 | even for an recipe includes that little male filter bank features |
---|
0:03:24 | it directly generous speaker posterior assumes that model |
---|
0:03:30 | in addition |
---|
0:03:31 | the plp loss is used to call with the speaker labels maybe you'll be t |
---|
0:03:36 | program |
---|
0:03:38 | figure two shows an overview of as the emd for that was speaker cases |
---|
0:03:44 | given a sequence of features |
---|
0:03:47 | x one x two s t |
---|
0:03:50 | sat anti enclose of ground truth labels as follows |
---|
0:03:57 | t is the duration and is the number of speakers in the audio |
---|
0:04:03 | speaker label y t indicates a joint activity of the n is expressed as a |
---|
0:04:09 | too small |
---|
0:04:11 | it should be not to use a fifty seven boundary vector but not one whole |
---|
0:04:16 | vector |
---|
0:04:17 | for non speech regions |
---|
0:04:19 | windy sphere resellers |
---|
0:04:23 | the speeches from my water is employed as a network |
---|
0:04:29 | input i handled as follows |
---|
0:04:32 | first |
---|
0:04:33 | for a rest features are transformed by the new layer |
---|
0:04:37 | and then built into multiple stacked transform at encoder layers |
---|
0:04:44 | who is then passed through the second being a layer and the signal money function |
---|
0:04:49 | generation the at the end speaker it's posteriors as it is moment |
---|
0:04:56 | we believe that you have family there was transformer |
---|
0:04:59 | so let's give the introduction |
---|
0:05:02 | for more details |
---|
0:05:04 | please read the paper |
---|
0:05:08 | the system all was is i and the ground truth labels y column arise |
---|
0:05:15 | because the following formants |
---|
0:05:18 | where they had an and y had and has the same shape of t |
---|
0:05:24 | which in you has posterior as and labels of speaker and overtime respectively |
---|
0:05:31 | speaker label ambiguity programs you it is i |
---|
0:05:35 | no matter how you show why by column |
---|
0:05:39 | you can be still effective label is creation |
---|
0:05:42 | to cope with that program |
---|
0:05:45 | sequences can see those or problem rice permutations of y and computers the final cross |
---|
0:05:52 | entropy loss between slight each kind of population |
---|
0:05:58 | anyone a two in this is shown in texas of one two and |
---|
0:06:04 | and then to minimal loss is to return for and that well i for publication |
---|
0:06:11 | in brief |
---|
0:06:12 | the pf in loss function can be written as follows |
---|
0:06:17 | hum and an almost all possible permutations of one to an |
---|
0:06:23 | and the psd loss compaq time complexity is |
---|
0:06:27 | all of t times eight times and in fact a real |
---|
0:06:34 | the time cost of he lost become expensive as the number of speakers increases |
---|
0:06:41 | to deal with that we use |
---|
0:06:43 | become of this the first improvement first p actually lost |
---|
0:06:48 | we don't and computation is in the process of psd computation |
---|
0:06:53 | two point |
---|
0:06:55 | it was for what the equation as follows |
---|
0:07:00 | since we call and can was ranges from one to and |
---|
0:07:06 | only and square pairs they actually has the computation of pc loss function |
---|
0:07:13 | however |
---|
0:07:14 | is the process |
---|
0:07:15 | the function is score for and times and factory of times |
---|
0:07:21 | okay each pair is creepy to be computed |
---|
0:07:25 | our proposed idea is simple |
---|
0:07:28 | we first comp u |
---|
0:07:30 | we see that was used of or and we have pairs and stores them in |
---|
0:07:35 | the last matrix |
---|
0:07:37 | then you in the permutation process |
---|
0:07:40 | given and had so you have and way ahead and pair |
---|
0:07:44 | we just index and returns a corresponding amount |
---|
0:07:49 | the details as shown in algorithm one |
---|
0:07:53 | in the time complexity is |
---|
0:07:56 | of t times and square cost all of and times and vectorial |
---|
0:08:04 | so you the construction of the last matrix l |
---|
0:08:08 | we have choose this guy improvement on the calculus |
---|
0:08:12 | however |
---|
0:08:13 | the computational she's still increases you know that are real time |
---|
0:08:17 | when n is large |
---|
0:08:19 | to deal with a proper |
---|
0:08:21 | we must remove the permutation process |
---|
0:08:25 | because the relationship between they had an and y have the and is described by |
---|
0:08:30 | pca loss |
---|
0:08:32 | and each is they have an must be designed one and only one optimal why |
---|
0:08:38 | head and is a final |
---|
0:08:40 | as shown in figure four |
---|
0:08:44 | this tactical hoc assignment problem |
---|
0:08:47 | is that the proposed that the optimal moment shingles |
---|
0:08:51 | which employs the hunger reason everybody to find the best matching between the head and |
---|
0:08:58 | whitehead |
---|
0:09:01 | hungarian and was then it starts and every sentence as the final important right polynomial |
---|
0:09:08 | time |
---|
0:09:09 | vol case |
---|
0:09:11 | key element i j is kind the ester cost of assigned ground-truth speaker change |
---|
0:09:19 | was because i |
---|
0:09:22 | totally is to find the optimal designing indexes |
---|
0:09:25 | so that you overall cost is slow at least |
---|
0:09:31 | the hungarian with that can be described as follows |
---|
0:09:36 | is time complexity |
---|
0:09:39 | of and you |
---|
0:09:41 | and our optimal mapping loss are shown in double isn't cool |
---|
0:09:47 | in total |
---|
0:09:49 | the complexity is o |
---|
0:09:52 | key times |
---|
0:09:53 | and where us all of |
---|
0:09:57 | and q |
---|
0:10:01 | experimental results |
---|
0:10:04 | in the first experiment |
---|
0:10:06 | we generate plunking all close |
---|
0:10:08 | ranging from zero to one and binary ground truth labels are random programs |
---|
0:10:15 | is the things |
---|
0:10:16 | the times t time |
---|
0:10:19 | the batch size i is the tool wondering taking i and t six s five |
---|
0:10:25 | hundred |
---|
0:10:26 | the number of speakers and range is done to ten |
---|
0:10:31 | for each and |
---|
0:10:32 | we keep the process of data generation and lost computation |
---|
0:10:37 | using different roles functions all one hundred times |
---|
0:10:42 | the average time closed are reported in table two |
---|
0:10:46 | is periods are carried out on both you and g p triphones |
---|
0:10:52 | for you case |
---|
0:10:54 | we also for all the speaker |
---|
0:10:57 | we see that open mode mapping was is the hardest |
---|
0:11:00 | when and is larger than for in the time course are relatively stable |
---|
0:11:06 | in contrast |
---|
0:11:08 | the rest to functions a tool so you was |
---|
0:11:12 | when the number of speakers reaches time |
---|
0:11:18 | in addition we also repeated as the un the disappearance use different loss functions |
---|
0:11:25 | and see where is our proposed function is compatible in network training |
---|
0:11:31 | results are shown in table three |
---|
0:11:33 | as expected |
---|
0:11:35 | the model channel with different loss functions results in the same relation error rate |
---|
0:11:42 | therefore |
---|
0:11:43 | all loss functions i effective |
---|
0:11:48 | there is no |
---|
0:11:49 | since you |
---|