0:00:13 | a segmentation |
---|
0:00:16 | so a a good are known and thank you for coming |
---|
0:00:20 | yeah and present now where we have do not the university of that was i needs so on |
---|
0:00:25 | this is a variability ability compensation and for the segment eight for speaker segmentation and to speak that phone conversation |
---|
0:00:32 | and we are also present in our |
---|
0:00:34 | a a technique to a it several hypotheses hypothesis and |
---|
0:00:39 | a for a given recording and to select the that best hypothesis |
---|
0:00:42 | there a segmentation the |
---|
0:00:44 | and so uh |
---|
0:00:46 | these work is focused on a the segmentation of to speaker conversation so it's |
---|
0:00:52 | is a speaker diarization problem |
---|
0:00:55 | a a a a a a so we we i'm at the answering the question was but one |
---|
0:00:59 | but |
---|
0:01:00 | and |
---|
0:01:01 | um and it's an is your task seemed since we and number of speaker is known |
---|
0:01:07 | and and limited to two |
---|
0:01:09 | so in this case now win the boundaries |
---|
0:01:11 | the speaker of that is of the that decision problem so it |
---|
0:01:14 | we can |
---|
0:01:15 | can um |
---|
0:01:18 | we could it as a a a a segmentation problem |
---|
0:01:22 | so i only there's mean a that in the field of the speaker verification and this has motivated and you |
---|
0:01:28 | approach just for this some addition of to speaker conversation |
---|
0:01:31 | yeah many base some factor analysis use eigenvoice voices |
---|
0:01:35 | and in this approach is uh the um |
---|
0:01:37 | the |
---|
0:01:38 | the speaker I D model a a bit a gmm supervector |
---|
0:01:42 | and that can be a present but this small set |
---|
0:01:44 | oh or a a a a has a small they mention vector that we use a record the speaker factors |
---|
0:01:50 | was i mention is much lower that they do you not a gmm supervector more |
---|
0:01:55 | so a a a the main idea is that |
---|
0:01:58 | a a a using a a a a a compact speaker a presentation we can estimate the a the parameters |
---|
0:02:03 | of these presentation |
---|
0:02:04 | what is more segments |
---|
0:02:06 | and that's what we do for |
---|
0:02:08 | for |
---|
0:02:09 | speaker segmentation |
---|
0:02:11 | so we eh |
---|
0:02:12 | a start i stream of speaker factors over the input signal |
---|
0:02:17 | so we are over or a one second window and |
---|
0:02:20 | frame by frame we start uh |
---|
0:02:23 | uh a set of |
---|
0:02:24 | a a a a a sequence of speaker factor of but that's |
---|
0:02:26 | and then we cluster these the speaker factors and the two clusters |
---|
0:02:30 | using pca A plus k-means clustering |
---|
0:02:33 | a once we have the two class we find a single got a single full covariance |
---|
0:02:38 | a cost four four |
---|
0:02:40 | it's just speaker or and we wanna be that we segmentation obtain a first |
---|
0:02:43 | a a some addition up with and then we would find this some addition using now |
---|
0:02:47 | some uh for segmentation step |
---|
0:02:50 | um |
---|
0:02:52 | using them to mfccs features and the mm a speaker models |
---|
0:02:57 | and |
---|
0:02:58 | so the |
---|
0:02:59 | the but the main contribution of this work is T the |
---|
0:03:04 | the the in general |
---|
0:03:05 | uh the of for in on whatever the T |
---|
0:03:08 | so first |
---|
0:03:09 | that |
---|
0:03:09 | i about that that's of but a do we can find a the set so |
---|
0:03:14 | and and only when a if we have a similar or similar recordings containing in different speakers |
---|
0:03:21 | we analyse the |
---|
0:03:22 | by a T percent on these recordings |
---|
0:03:25 | a a we can fight that there's the by ability this actually this by really the and it's mainly you |
---|
0:03:31 | to |
---|
0:03:32 | as because percentage recording so you use you referred to as well with yes and the speaker what every |
---|
0:03:37 | but |
---|
0:03:38 | we analyse a a set of a |
---|
0:03:40 | the accordance don't to the same speaker |
---|
0:03:43 | a so we can see that there's also whatever lead them on this record |
---|
0:03:47 | and use usually do you to a |
---|
0:03:49 | two aspects like the channel or |
---|
0:03:52 | the most chance of the speaker |
---|
0:03:54 | and this |
---|
0:03:55 | this but every is usually um |
---|
0:03:57 | no known as |
---|
0:03:59 | and system whatever |
---|
0:04:01 | but in in addition if we if we analyse |
---|
0:04:04 | a a recording contained a single speaker and with that many in smaller |
---|
0:04:09 | so i say |
---|
0:04:10 | and that we |
---|
0:04:11 | and a light but in really the this just slices |
---|
0:04:14 | a a we have see that there is also whatever it with you now recording |
---|
0:04:18 | and |
---|
0:04:19 | this but i bit is usually do to a phonetic balance or |
---|
0:04:22 | or they |
---|
0:04:23 | they won't young on the one channel of the of the |
---|
0:04:26 | of the recording |
---|
0:04:28 | and we will refer to as by big guess |
---|
0:04:30 | and that's system whatever you |
---|
0:04:33 | no approach for speaker segmentation we are only a |
---|
0:04:37 | but we we are only modeling in speaker whatever every |
---|
0:04:41 | so the question is a a are you the types of whatever the inter or interest |
---|
0:04:45 | my ability a fact some of the in performance |
---|
0:04:49 | well |
---|
0:04:50 | and |
---|
0:04:52 | so we just one do we need to compensate for inter system whatever they don't this is somewhat of a |
---|
0:04:57 | you |
---|
0:04:57 | we note that in the system whatever decomposition is but important for speaker recognition |
---|
0:05:02 | uh yeah but |
---|
0:05:03 | well i was able to say that it's not |
---|
0:05:05 | i so be important for speaker some shown but the |
---|
0:05:09 | is presentation yes so that |
---|
0:05:11 | channel factors it you helps us well |
---|
0:05:13 | so i i i have some preliminary experiments that P |
---|
0:05:17 | then |
---|
0:05:18 | the help but |
---|
0:05:19 | may keep my |
---|
0:05:20 | but we believe that it's it should and had so much because you not that is decision that's you don't |
---|
0:05:25 | see the same a speaker over different sessions |
---|
0:05:27 | so it's i was the same as the the speakers are |
---|
0:05:30 | a single session |
---|
0:05:31 | you don't have a higher information of the speakers |
---|
0:05:34 | actually we we believe that it does is a whatever ready to make up the body the speakers and diarization |
---|
0:05:38 | task |
---|
0:05:39 | because the channel is information that can |
---|
0:05:41 | how do you to separate the speak |
---|
0:05:44 | and what about the some whatever it |
---|
0:05:46 | so what what usually in the feel of speaker recognition a a state-of-the-art system doesn't take only a a a |
---|
0:05:53 | a that that's system and don't take a |
---|
0:05:55 | and take into account intersession variability |
---|
0:05:57 | seems yeah um they used a whole conversation to to be a model |
---|
0:06:02 | uh but |
---|
0:06:03 | we we think it's |
---|
0:06:04 | but important for speaker some of this and their efficient because um many |
---|
0:06:09 | and the state of the systems are based on |
---|
0:06:11 | the clustering of various mark be or segment |
---|
0:06:14 | and |
---|
0:06:15 | we can compensate do but i believe them and be segments for a given a speaker |
---|
0:06:19 | okay |
---|
0:06:19 | clustering process should be seen |
---|
0:06:23 | so that's |
---|
0:06:24 | but we try to do so you been a |
---|
0:06:27 | a each dataset contains several speakers sensor out according to per speaker |
---|
0:06:31 | we kind of thing a team of speaker factors for from each recording |
---|
0:06:35 | and then |
---|
0:06:36 | a a we can see that every such and as a different class |
---|
0:06:40 | and we |
---|
0:06:41 | more more to the speaker and that's system some whatever the guess between class body on seems we believe that |
---|
0:06:45 | more in a speaker and that's as them whatever the have |
---|
0:06:49 | to two separate speakers |
---|
0:06:50 | and in a recording |
---|
0:06:52 | and we model the session and whatever you S within class but yeah |
---|
0:06:56 | so we we this framework in |
---|
0:06:59 | it's C C to to apply a one known techniques has |
---|
0:07:02 | linear discriminant analysis of much might you can class but as to minimize we class variance |
---|
0:07:07 | are also within class covariance normalization |
---|
0:07:09 | a a to normalize the variance |
---|
0:07:12 | of |
---|
0:07:12 | for every class |
---|
0:07:13 | so it's |
---|
0:07:14 | the identity map fix |
---|
0:07:16 | so this to thing this have been successfully applied the uh |
---|
0:07:20 | for intersession compensation and in a speaker recognition |
---|
0:07:24 | they |
---|
0:07:24 | the in i of sister |
---|
0:07:27 | so to evaluate this this of these two approaches we use a a a a a a nice are we |
---|
0:07:32 | weight |
---|
0:07:32 | a a a some channel condition containing more than two thousand five minute telephone conversations |
---|
0:07:38 | i and that the speech nonspeech a what's are given |
---|
0:07:41 | and we miss the performance and that's of the speaker segmentation or or or speaker or |
---|
0:07:46 | a a a a a part of the that is as an hour rate so a as we |
---|
0:07:49 | we have some parts that the speech nonspeech segmentation and "'em" we don't take into account overlap speech |
---|
0:07:54 | the that is as an hour rate is the same as T segmentation or or or speaker are |
---|
0:07:59 | and uh a C us use what we you we assume a don't twenty five second people score |
---|
0:08:06 | and here we have the the results |
---|
0:08:08 | for the system using in a small ubm |
---|
0:08:10 | a a two hundred fifty six gaussians |
---|
0:08:12 | a prior |
---|
0:08:13 | and mfccs features |
---|
0:08:16 | and in this case we we don't we don't use that the segmentation of steps |
---|
0:08:20 | a a a a we can see that our get in two percent some or or |
---|
0:08:25 | um using intersession session variability compensation and W C C and we |
---|
0:08:28 | twenty a speaker factors |
---|
0:08:30 | a a the sum an or is that used to a two point five |
---|
0:08:34 | you also can see that another other another baseline |
---|
0:08:36 | a a with fifty speaker factors that it |
---|
0:08:39 | a slightly better |
---|
0:08:40 | know |
---|
0:08:40 | not much but slightly better |
---|
0:08:42 | and to to try a L D A for them dimensional direction |
---|
0:08:47 | and we can see that the L use had been |
---|
0:08:49 | a a a a but i an i and W C C N any |
---|
0:08:52 | is better |
---|
0:08:53 | the obtained a two percent a a of segmentation or |
---|
0:08:57 | and even the combination of what you are is is not better than |
---|
0:09:01 | now |
---|
0:09:02 | a and directly W C N |
---|
0:09:05 | but uh we try with this |
---|
0:09:07 | systems uh a after the resegmentation and it was surprising pricing that |
---|
0:09:11 | there was some of the step |
---|
0:09:14 | um |
---|
0:09:15 | make more or less equal that was also |
---|
0:09:17 | using twenty you're fifty speaker factors so |
---|
0:09:20 | the inter just but every decomposition with W Z and just your working |
---|
0:09:24 | and |
---|
0:09:24 | giving the a an improvement |
---|
0:09:27 | but it seems that it's a useful to improve the number of speaker factors so |
---|
0:09:32 | we were a little disappointed with |
---|
0:09:33 | this |
---|
0:09:34 | because with all it to suit help |
---|
0:09:36 | and we that i know where use but it meant to that not in the paper |
---|
0:09:41 | representing here |
---|
0:09:43 | a without a target ubm M um more features |
---|
0:09:46 | so in this case increase in the number of is because fact of his help |
---|
0:09:49 | so we can see that in this case or baseline E we the speaker factors |
---|
0:09:54 | thanks so one point eight |
---|
0:09:55 | segmentation or which is a a lower what before he was |
---|
0:09:58 | two point one |
---|
0:10:01 | and now that we use channel again but use the your work to one point four |
---|
0:10:05 | in in in this case a a in this case we we also increase the number of speaker factors and |
---|
0:10:11 | to test the L A |
---|
0:10:12 | and we see that the the eighties |
---|
0:10:14 | yeah is had been you've and |
---|
0:10:16 | a a more than before |
---|
0:10:18 | a |
---|
0:10:19 | and and also that |
---|
0:10:20 | our but configuration now is used to combine L D A plus W C C N |
---|
0:10:25 | so it seems that uh it's is it's to question that a |
---|
0:10:29 | the base and we'd have the speaker factors he's not better than the base them with fifty speaker factors but |
---|
0:10:34 | that there and the egg |
---|
0:10:35 | a we can take advantage of how more speaker factors |
---|
0:10:40 | a our best result is |
---|
0:10:41 | one point three some our |
---|
0:10:45 | so on and on the other hand |
---|
0:10:47 | you we propose you know so i think need to to you know it several segmentation a |
---|
0:10:52 | hypothesis |
---|
0:10:54 | and to select the best one a |
---|
0:10:56 | base of based on a set of from mister |
---|
0:10:59 | a so what we do is we it adaptively pretty it become the composition to to have this |
---|
0:11:06 | a a a a a a did we obtain four levels of splitting in as we can see the figure |
---|
0:11:10 | a a a a a and we segment every um |
---|
0:11:12 | every a slice with a propose a a system three |
---|
0:11:16 | then for every level we set at the best the slides this |
---|
0:11:19 | and we combine them to be able to speaker model |
---|
0:11:23 | and then we this to speaker models we to segment |
---|
0:11:26 | the whole recording |
---|
0:11:27 | using a |
---|
0:11:28 | i there with some segmentation and mfccs features |
---|
0:11:31 | you and i speak a speaker model |
---|
0:11:34 | a to select the what to select the best segment that slices is and the best level um on this |
---|
0:11:39 | four |
---|
0:11:40 | we use a a a a a a complete as missiles and also my you're voting stuff |
---|
0:11:46 | sorry components most of that were using this work were where a bias use information criterion |
---|
0:11:51 | a a a a a a we using mfccs speech their sign you a speaker models to compute a |
---|
0:11:56 | big |
---|
0:11:57 | and and the K yeah these things these dancing the speaker factors space |
---|
0:12:02 | so we were using gaussian and |
---|
0:12:04 | a a speaker models and a that space and computing the K U |
---|
0:12:07 | this stuff between what more |
---|
0:12:09 | a and |
---|
0:12:10 | to fuse both compute as were are using the a |
---|
0:12:13 | a a quite toolkit |
---|
0:12:15 | a well no for speaker verification |
---|
0:12:18 | and uh the in the weights of the |
---|
0:12:21 | a diffuse fusion weights were optimized to separate do for those |
---|
0:12:26 | a a a a a of time less that one percent someone channel |
---|
0:12:30 | okay |
---|
0:12:31 | so |
---|
0:12:31 | a a kid we have the results |
---|
0:12:33 | for these uh i but these is you know channel selection |
---|
0:12:36 | a strategy |
---|
0:12:37 | we can see when that when we we are not using seen a |
---|
0:12:41 | inter session variability compensation |
---|
0:12:43 | a a a a a a what this solution is improving the results which just but our baseline which choose |
---|
0:12:48 | two point one |
---|
0:12:49 | and we're getting one point nine with to our started you |
---|
0:12:53 | and if |
---|
0:12:54 | we have a an idea |
---|
0:12:56 | a a coffee that's much some we could select the best |
---|
0:13:00 | a a level at every time |
---|
0:13:02 | we could go that to one point one segmentation or |
---|
0:13:05 | but |
---|
0:13:05 | the of competence was of our remote idea |
---|
0:13:07 | at the mall |
---|
0:13:08 | a |
---|
0:13:09 | and using system but every to compare that you would we then get |
---|
0:13:12 | in a significant improvement improvement was |
---|
0:13:15 | was not the statistic that |
---|
0:13:17 | that's it's of this signal is significant |
---|
0:13:19 | a is so we try to my are what in a that the any help |
---|
0:13:22 | but that we we wanted to to make it what we |
---|
0:13:26 | it's some set of complete myself to |
---|
0:13:28 | because the |
---|
0:13:30 | the possibilities of for are complete and mysteries |
---|
0:13:33 | of computers was of a high |
---|
0:13:35 | and |
---|
0:13:36 | simple uh stuff that you to fuse |
---|
0:13:38 | a a segmentation hypothesis |
---|
0:13:40 | so |
---|
0:13:42 | yeah |
---|
0:13:43 | we were not really happy with this |
---|
0:13:45 | was also with try again again with a lot ubm more features |
---|
0:13:49 | and |
---|
0:13:50 | and our best racial for intersession variability compensation |
---|
0:13:54 | and and also a new set of complete missiles |
---|
0:13:57 | but this |
---|
0:13:57 | this is not in the paper a new results |
---|
0:13:59 | but |
---|
0:14:00 | oh show in two |
---|
0:14:02 | a a a a a and we we could but use this segmentation addition or or from one point three |
---|
0:14:06 | to one point two |
---|
0:14:07 | and one what some additional or one point zero |
---|
0:14:10 | and if we put select |
---|
0:14:11 | i always the best level that we could read used to get two point seven |
---|
0:14:15 | the channel or |
---|
0:14:16 | which is |
---|
0:14:18 | but good use so compared to the |
---|
0:14:20 | to |
---|
0:14:21 | based |
---|
0:14:21 | one |
---|
0:14:22 | well one point |
---|
0:14:25 | so |
---|
0:14:26 | ask completion sort of this work we we have presented to to make those for it a somewhat every to |
---|
0:14:31 | compensation |
---|
0:14:32 | we have some that they have for speaker segmentation |
---|
0:14:36 | a a a a a a change that W C N of things better performance than that done of the |
---|
0:14:40 | eight |
---|
0:14:40 | and it's |
---|
0:14:41 | some somehow similar to of the a plus but C C N |
---|
0:14:45 | that |
---|
0:14:45 | but similar to the combination of both |
---|
0:14:48 | um |
---|
0:14:49 | in the number of a speaker factors increase greece of the computational cost |
---|
0:14:53 | a it seems that W C N it's |
---|
0:14:56 | but there's that the word for |
---|
0:14:57 | a |
---|
0:14:59 | should should word for low computational cost applications |
---|
0:15:02 | but the of course a for our best computer and computational cost is not a problem all was computation is |
---|
0:15:08 | using |
---|
0:15:08 | a high number of speaker factors and all the egg it uses here |
---|
0:15:12 | and |
---|
0:15:14 | we we we have a summary of the results yes we might they our but so that this is one |
---|
0:15:19 | point three |
---|
0:15:20 | i |
---|
0:15:20 | we the a system where a the from one point nine to one point three |
---|
0:15:25 | and |
---|
0:15:26 | and also |
---|
0:15:27 | a a a note that |
---|
0:15:28 | probably that but used in is had been a lot because |
---|
0:15:31 | a a a a a a a close or or or because of for but that in the study so |
---|
0:15:36 | or in is that that use you seen pca plus k-means means |
---|
0:15:39 | as initialization |
---|
0:15:41 | so not i seen the the |
---|
0:15:44 | he within class covariance for a be a speaker is probably had been the K means that assumes that they |
---|
0:15:48 | all the |
---|
0:15:49 | and class is have the same class so |
---|
0:15:52 | at yeah why i are not quite as |
---|
0:15:53 | a think i |
---|
0:15:54 | so |
---|
0:15:55 | so it probably because of the |
---|
0:15:57 | a |
---|
0:15:58 | a W C i have some |
---|
0:16:01 | and also we have present a hypothesis generation and selection technique |
---|
0:16:05 | which C can prove to this like the results |
---|
0:16:08 | and for our best configuration we can use D some addition are from one point three to one point to |
---|
0:16:13 | with a large you |
---|
0:16:15 | um |
---|
0:16:16 | think that's all |
---|
0:16:17 | thank you match |
---|
0:16:24 | you you we use time for questions |
---|
0:16:27 | we |
---|
0:16:38 | yeah |
---|
0:16:40 | yeah |
---|
0:16:43 | yeah |
---|
0:16:43 | i just to one they mention |
---|
0:16:46 | i didn't mention it because it's |
---|
0:16:47 | it's in another |
---|
0:16:49 | but but so this is much more to produce on |
---|
0:16:52 | on really ability but |
---|
0:16:54 | yeah P C A |
---|
0:16:56 | okay we keep just one mention |
---|
0:16:59 | a a a to initialize a a and then became means we use all that mentions but we need to |
---|
0:17:03 | like this |
---|
0:17:04 | the a means of a k-means means we |
---|
0:17:07 | pca C a of show |
---|
0:17:17 | no |
---|
0:17:18 | no |
---|
0:17:22 | i |
---|
0:17:29 | yeah |
---|
0:17:38 | hmmm |
---|
0:17:40 | yeah |
---|
0:17:46 | well |
---|
0:17:48 | yeah yeah |
---|
0:17:49 | sure |
---|
0:17:50 | yeah but uh i i mean |
---|
0:17:52 | now experiments i i am i keeping one i'm and C N N |
---|
0:17:55 | uh |
---|
0:17:56 | maybe |
---|
0:17:57 | you know |
---|
0:17:58 | is not the best you can do but |
---|
0:18:00 | to one they mention it's not |
---|
0:18:01 | but it was so in is is usually the first dimension |
---|
0:18:05 | of |
---|
0:18:06 | of the pca put these is the best want to to it this because but it still i we are |
---|
0:18:10 | getting a about eighteen percent that is is an error right |
---|
0:18:13 | just using one dimension |
---|
0:18:14 | so we are not sure that i |
---|
0:18:17 | the best presentation |
---|
0:18:23 | yeah |
---|
0:18:30 | hmmm |
---|
0:18:31 | so you we we were try to |
---|
0:18:33 | just plug yeah C A output |
---|
0:18:36 | be more they mentioned for you and all all the images to the came |
---|
0:18:41 | the are questions |
---|
0:18:47 | and let's thing the speed than then one the speech was reduced |
---|