0:00:15 | but not everybody today angle and at present a more work that freeze and a |
---|
0:00:23 | addressee the problem of our online speaker diarization in contrast the two other works that |
---|
0:00:27 | the majority of work in that is a shown that the was a it is |
---|
0:00:31 | mainly offline |
---|
0:00:32 | and the in a semi-supervised scenario |
---|
0:00:36 | a i would first provide a brief introduction and the motivation a i would band |
---|
0:00:41 | and describe the system implementation |
---|
0:00:44 | i will the provide this mess |
---|
0:00:46 | the some experimental results and the a simple solution |
---|
0:00:51 | so |
---|
0:00:52 | a i guess most of you are familiar with the problem of speaker diarization basically |
---|
0:00:57 | given an audio stream a i want to determine who spoke when |
---|
0:01:01 | and i want to a determined not to my segmentation word this segment boundaries are |
---|
0:01:07 | present the speaker there are changes and the in optimize speaker sequence what i want |
---|
0:01:11 | to assign you a the segments a to a specific speaker |
---|
0:01:16 | an so |
---|
0:01:19 | most of the state-of-the-art detection system at the vault around the belmont off offline diarization |
---|
0:01:25 | system however with the diffusion of a smart objects are intimately fittings and mar force |
---|
0:01:31 | in the your recent the are online diarization has attracted the |
---|
0:01:36 | and increasing is the right interest |
---|
0:01:40 | so in the literature only a few if you allow online there is some system |
---|
0:01:45 | that i've been presented mainly focusing on the plenary speeches and the broadcast news when |
---|
0:01:50 | the speaker turns article |
---|
0:01:53 | and the our previous work right whether the problem of unsupervised online diarization |
---|
0:01:59 | for of meeting data |
---|
0:02:00 | a with a single piece the microphone |
---|
0:02:03 | so as the condition |
---|
0:02:05 | so unfortunately the system be the b-format and |
---|
0:02:10 | although the results were aligned with the previous work at their system thing to perform |
---|
0:02:14 | a well to some or practical like applications |
---|
0:02:18 | in there is a short online diarization |
---|
0:02:20 | so |
---|
0:02:22 | we a |
---|
0:02:23 | we think basically in online diarization get the |
---|
0:02:26 | we have to deal with a the problem of speaker modeling not in addition |
---|
0:02:30 | and the after period of not feature |
---|
0:02:33 | the we assume that we encounter speech we want to be able to initialize speaker |
---|
0:02:38 | model and the properties which kind of analysis window with the which amount of |
---|
0:02:42 | speech |
---|
0:02:43 | we take to a |
---|
0:02:46 | initialize the speaker model |
---|
0:02:48 | so i can choose a the what amount of as a speech time to the |
---|
0:02:53 | and decrease the latency of the system |
---|
0:02:55 | however everybody probably knows that the |
---|
0:02:59 | there are interface much it is much higher because the speaker model are not well |
---|
0:03:03 | initialized with little data |
---|
0:03:05 | otherwise i can take a and low and longer windows a longer amount of us |
---|
0:03:11 | a speech |
---|
0:03:12 | a yes |
---|
0:03:14 | whatever in the keys a the in speaker variation i |
---|
0:03:18 | can fix the problem of |
---|
0:03:20 | using too long speech in which are there are multiple speaker might be multiple speakers |
---|
0:03:25 | as |
---|
0:03:26 | because number speaker might and increase with the |
---|
0:03:30 | a longer windows |
---|
0:03:31 | so a |
---|
0:03:34 | and way to improve the line that additional used to allow that it is allowed |
---|
0:03:38 | a speaker models to decision |
---|
0:03:40 | we but as some labeled training he some initial labeled training data |
---|
0:03:44 | and the so our contribution is will present work going that the t seems supervised |
---|
0:03:49 | online diarization system |
---|
0:03:51 | and the on the work this kind of what was already presented by then your |
---|
0:03:56 | moral |
---|
0:03:56 | in august done for but in the context of the offline diarization |
---|
0:04:02 | so the problem is the and the problem that we try to address is what |
---|
0:04:06 | kind of the of c data is required to reach a similar performance the two |
---|
0:04:11 | and of lined original system |
---|
0:04:13 | a okay i will continue we've the spinning the and they all came up with |
---|
0:04:18 | addition that these that we used to update the models in our system so on |
---|
0:04:25 | the we supposed to have a sequence of these speech segment |
---|
0:04:29 | from a particular speaker s |
---|
0:04:30 | and the each segment is by parameterized by set of acoustic features |
---|
0:04:36 | you |
---|
0:04:37 | we maximum number of features and i and i was able to have initial the |
---|
0:04:41 | gmm ubm model |
---|
0:04:42 | we've gig of some components |
---|
0:04:44 | so we found that basically in most of that is also used in that we |
---|
0:04:49 | found the literature |
---|
0:04:51 | it'd the authors would be used to a initialize speaker model by map adapting the |
---|
0:04:56 | lander ubm model and that you with the for speech segment obtaining a the than |
---|
0:05:02 | the first the for speaker model |
---|
0:05:05 | and then using the net |
---|
0:05:07 | speaker two and two right we have that the previous model s one in this |
---|
0:05:11 | case |
---|
0:05:12 | putting that and model or a new model as to |
---|
0:05:15 | however we found out that the by using these that the final model doesn't is |
---|
0:05:22 | not the same model s for using all the segments |
---|
0:05:25 | two but they rely on the ubm model in ones |
---|
0:05:27 | is so although it is a modest contribution we found out that the value |
---|
0:05:31 | by least a calculating the sufficient statistic using the a available speech segment we against |
---|
0:05:39 | the land the ubm model each time and by accumulating the sample statistics |
---|
0:05:43 | the final model is more consistent a more similar to a the more that it |
---|
0:05:48 | and that is the cosine with the offline that a muppet map adapted model |
---|
0:05:54 | so you know from a politician basically this fisher statistics for a gig of the |
---|
0:05:59 | component are doing a with for the main question okay i of the first the |
---|
0:06:04 | older quite zero for a zero for and |
---|
0:06:08 | sufficient statistic the first order statistics and the second order statistics |
---|
0:06:12 | and the basically a |
---|
0:06:16 | it's quite represent the likelihood of the each feature contain in the segments and i |
---|
0:06:21 | guess the around the ubm model and i use all the segments available to market |
---|
0:06:26 | that the lander ubm model |
---|
0:06:28 | so |
---|
0:06:30 | to obtain that they knew they are they to estimate i basically just the is |
---|
0:06:35 | a tradeoff between the a and some ratio between assumption statistics and the original the |
---|
0:06:42 | original |
---|
0:06:44 | parameters and that these racial is |
---|
0:06:48 | this ratio is like |
---|
0:06:51 | is the depends on their relevance factor that tells me |
---|
0:06:55 | a in how much importance quickly to initial parameters rather |
---|
0:06:59 | rather than the final day |
---|
0:07:03 | i what's important to |
---|
0:07:05 | much closer but the i want to be |
---|
0:07:07 | in estimating the new parameters and also we have become of it is an additional |
---|
0:07:11 | parameter so that the weights are estimated weights sum to one |
---|
0:07:15 | and so it's once motivation as we said use the first segments to updated around |
---|
0:07:21 | the ubm model |
---|
0:07:22 | i've the in the first speaker model s one |
---|
0:07:24 | and then i recalculate there is sufficient statistics i guess to the in a speaker |
---|
0:07:29 | model s one |
---|
0:07:30 | to train and you model has two |
---|
0:07:32 | and the more in general so i train the s one speaker model by map |
---|
0:07:38 | adapting the ubm and model with the first segment and then a given once a |
---|
0:07:42 | of the speaker segment you plus one |
---|
0:07:44 | a the sufficient statistics |
---|
0:07:48 | are given a calculated guess the previous model s i |
---|
0:07:51 | so i have all the sufficient statistics calculating just the model |
---|
0:07:56 | and the incremental map adaptation extent we want to calculate the with the for statement |
---|
0:08:03 | the sufficient statistics a guest aligning the model to obtain the for small the embedding |
---|
0:08:08 | a with the second segment we again a couple the sufficient statistics of against the |
---|
0:08:14 | long to a ubm model and we have q-mode this one in this way the |
---|
0:08:18 | three d of that the model is the same s offline map adaptation |
---|
0:08:24 | and will so what was it |
---|
0:08:28 | a we train the need just because model by you with the first segment |
---|
0:08:32 | and then |
---|
0:08:33 | a given i do not seem an a plus one is sufficient statistics a quality |
---|
0:08:38 | by accumulating this |
---|
0:08:42 | a with the previous one by using the features in the |
---|
0:08:47 | it in the last segment |
---|
0:08:49 | and calculating sufficient statistic guest the lower than and are committing to the thing to |
---|
0:08:54 | statistics |
---|
0:08:55 | so as a set for the model cost contribution but we found it green good |
---|
0:09:00 | improvements in the final the original results as we see after |
---|
0:09:06 | okay the system implementation |
---|
0:09:09 | a so a now two |
---|
0:09:12 | and the we have a supervised the supervised and face and unsupervised in the supervised |
---|
0:09:19 | case we the is like around able to not be allowed the |
---|
0:09:24 | we have and amount of the and the spk segments per speaker and with that |
---|
0:09:30 | those ones we after feature extraction we initialize the models |
---|
0:09:34 | in the all the people speak and the meeting for example |
---|
0:09:38 | and the in the line face instead we have a supervised and the lights and |
---|
0:09:44 | the we classify each signal and the speech segment a factor of dividing the to |
---|
0:09:51 | segment of the maximum duration ts that are present our latency |
---|
0:09:55 | and the these basically |
---|
0:09:58 | these the distance are classified a guest the speaker model available |
---|
0:10:04 | and the a i determine which speaker models is the and |
---|
0:10:10 | is the most likely and the i'd label that segment with the according to the |
---|
0:10:16 | speaker models that maximize the likelihood and i update the model by my incremental or |
---|
0:10:22 | sequential map adaptation we use we show both results |
---|
0:10:26 | and the in their life data i bring we need a sufficient statistics that with |
---|
0:10:31 | the will be that are used to update to the speaker models |
---|
0:10:36 | so all in the line processing i assign each segment i one of them speaker |
---|
0:10:41 | models |
---|
0:10:42 | according to must be more likely criteria |
---|
0:10:44 | and the model that maximize the likelihood of for the feature contain the segment is |
---|
0:10:50 | set to be this use the is used log of the segment |
---|
0:10:55 | and the that speaker model e the adapted by either sequential or increment all the |
---|
0:11:03 | of the station |
---|
0:11:05 | and this is the implementation of the system |
---|
0:11:08 | the to use |
---|
0:11:09 | so i will not present experimental setup and the experimental results |
---|
0:11:13 | so we used for different datasets |
---|
0:11:16 | a compact from the n is the rich transcription activator |
---|
0:11:21 | and the to the first that set is used to train the ubm and the |
---|
0:11:25 | is just a few and is a set of sixty meeting shows the from the |
---|
0:11:29 | nist audio for evaluation |
---|
0:11:31 | the we have the development dataset |
---|
0:11:33 | is the set of fifteen meeting shows from there are two five and that you've |
---|
0:11:37 | six evaluation is used to g |
---|
0:11:39 | to develop the system |
---|
0:11:41 | and the evaluation set that we used to ask people that used to evaluate this |
---|
0:11:45 | system is the set of at meeting shows from the active seven and the set |
---|
0:11:50 | of seventeen shows from the t o nine evaluation |
---|
0:11:52 | and the we show the results independently for these two dataset to perform better comparison |
---|
0:11:58 | we previous work |
---|
0:12:00 | the experimental setup we use a nineteen the mel frequency cepstral coefficient i made by |
---|
0:12:06 | energy we've a twenty millisecond windows and the ten millisecond chips |
---|
0:12:11 | a i mean is shown me ten milisecond the overlap |
---|
0:12:14 | a ubm is trained on that the ubm with a |
---|
0:12:17 | ten iteration em iteration and sixty four gaussian components |
---|
0:12:21 | and the analysis window that you that the segment duration that |
---|
0:12:28 | that correspond to the lattice of the system is the we not i for the |
---|
0:12:33 | phone lattices from zero point twenty five second zero point five seconds on t four |
---|
0:12:36 | seconds |
---|
0:12:37 | and the amount of training data that used to initialize the model initially in the |
---|
0:12:42 | back from one thirty nine seconds |
---|
0:12:45 | and the which both short of the results for sequential an increment all the map |
---|
0:12:49 | adaptation |
---|
0:12:50 | and the relevance factor for the map adaptation least and the |
---|
0:12:55 | the a |
---|
0:12:56 | okay the overlapped speech is a move according to the transcription |
---|
0:12:59 | well what |
---|
0:13:00 | but problems with the descent |
---|
0:13:03 | we need and the offline baseline system is the idea eric on top down duration |
---|
0:13:08 | system classification the to use as a baseline to a as a reference for |
---|
0:13:13 | the results |
---|
0:13:16 | so in this in these in the first in the rest of the left |
---|
0:13:21 | you are presented the results using a sequential model adaptation approach we can see that |
---|
0:13:25 | the by allowing and amount of training data in a of the people putting data |
---|
0:13:29 | can also labeled training data to initialize the model |
---|
0:13:33 | and we managed to perform better of an offline diarization system |
---|
0:13:37 | and the results right are is that we increment the muppet that they show that |
---|
0:13:42 | the |
---|
0:13:46 | allows for better or a profiles of the q and the because the model we |
---|
0:13:51 | accumulate the statistics |
---|
0:13:52 | and the we can see that the |
---|
0:13:56 | we can reach an offline but there is some performance we've only five second possible |
---|
0:14:01 | actual y n |
---|
0:14:02 | incremental works better with three seconds |
---|
0:14:05 | and the by allowing more training data |
---|
0:14:08 | a the we the readers to that is not a rate of ten percent |
---|
0:14:13 | and |
---|
0:14:15 | the c and the okay the state for the lower latency this is just a |
---|
0:14:20 | it is system |
---|
0:14:22 | does not perform well because see that the licensees the really global too low and |
---|
0:14:27 | the also |
---|
0:14:29 | the |
---|
0:14:32 | so for ten in this table we present the results for different amount of and |
---|
0:14:38 | bring the cost the |
---|
0:14:40 | a training data |
---|
0:14:42 | three second by second and seven seconds |
---|
0:14:45 | and the for the different not set and all these results correspond reluctance use of |
---|
0:14:49 | three second |
---|
0:14:50 | so we can see that in all cases incremental by map adaptation works better when |
---|
0:14:55 | the question map adaptation so but being the statistics of we are but would it's |
---|
0:14:59 | better results |
---|
0:15:01 | and the finally |
---|
0:15:03 | these dropped represents a represents the amount of training data as a function of the |
---|
0:15:09 | latency and yes |
---|
0:15:12 | when that |
---|
0:15:14 | when we would be to the offline diarization performance all |
---|
0:15:18 | or points corresponding to d r of seventeen percent and still have in here that |
---|
0:15:22 | the incremental map adaptation works better when is when sum up to the station |
---|
0:15:29 | the in |
---|
0:15:31 | for future work the in |
---|
0:15:32 | and |
---|
0:15:34 | the goal is to probably a and reduce both lattices and both the amount of |
---|
0:15:39 | training data to reach better performance |
---|
0:15:42 | so |
---|
0:15:43 | to include |
---|
0:15:44 | so we propose a semi-supervised aligned there is some system |
---|
0:15:47 | and we show that for the in the case of they have to your seven |
---|
0:15:51 | dataset the system gonna outperform an offline diarization system we've already three seconds of |
---|
0:15:56 | a speaker c data and with a latency of three second |
---|
0:16:00 | well using an document the map adaptation approach |
---|
0:16:03 | and the a by line will be the legacy all retirement see the that we |
---|
0:16:08 | have lower eer and percent |
---|
0:16:10 | and the also if we tested this inconvenience of like of |
---|
0:16:16 | initialize the speaker model some labeled training data we can |
---|
0:16:19 | a |
---|
0:16:21 | open there should two |
---|
0:16:23 | development of supervisors image supervise the speaker discriminative features transformation |
---|
0:16:29 | a both to reduce that fancy |
---|
0:16:31 | and that the amount of data |
---|
0:16:42 | thank you have any so i'll in here we have time for a few questions |
---|
0:16:54 | thank you for talk |
---|
0:16:57 | for so long to this then do you |
---|
0:17:00 | to know how many speakers or the conversation runs |
---|
0:17:04 | yes is usually we mean that we allow for these |
---|
0:17:08 | and knowing in advance the number of speakers to initialize the better the models |
---|
0:17:12 | and was we five we are searching other ways to introduce new speakers that are |
---|
0:17:17 | unknown in the |
---|
0:17:18 | in this was in the beginning |
---|
0:17:20 | you divided the data between the speakers |
---|
0:17:24 | so we assume that all the speakers of speaks from the beginning |
---|
0:17:28 | yes |
---|
0:17:30 | present himself or something like this |
---|
0:17:33 | yes exactly we are it means that us not which everybody also examined |
---|
0:17:38 | you're wrong system |
---|
0:17:42 | does not assume that the number of speakers is not the ones so it's more |
---|
0:17:48 | a totally correlated to compare isn't version |
---|
0:17:52 | this mean a when we do the these experiments i agree with the comment when |
---|
0:17:56 | you really these experiments we had only that all signed addition system any was difficult |
---|
0:17:59 | to |
---|
0:18:00 | a initialized speaker models |
---|
0:18:01 | so since we had already that baseline we decide okay to stick that one as |
---|
0:18:05 | a reference and the g |
---|
0:18:07 | we compare with the and the last questions practical who see in the that online |
---|
0:18:13 | for the first the segmentation for the filtering |
---|
0:18:17 | okay also other application that the these the work can and can be useful for |
---|
0:18:22 | example can be |
---|
0:18:24 | and |
---|
0:18:25 | when you interact weavers mat corpus |
---|
0:18:27 | all six you can provide the initial these |
---|
0:18:31 | some data for the people that usually utilized was is mapped object so you can |
---|
0:18:36 | provide some initial data that was our disk is not gonna use for this east |
---|
0:18:41 | thank you remote |
---|
0:18:45 | any other questions |
---|
0:18:52 | so i have a question so and |
---|
0:18:57 | i from your presentation i understand that you are adapting all the parameter phase |
---|
0:19:03 | okay systems for score area where is okay by dataset covariance tennis a means so |
---|
0:19:09 | have you tried |
---|
0:19:11 | two |
---|
0:19:12 | check if extractor that fewer parameters and you get some |
---|
0:19:17 | yes it right what we only the mean that the as we increment the model |
---|
0:19:21 | that these women so you |
---|
0:19:22 | bring "'em" we must have that we stick with a map adaptation |
---|
0:19:25 | it's a to get worse results in front of data on the parameters |
---|
0:19:29 | while the and saw in the case identity of the mean |
---|
0:19:33 | because as the use of all these few data will data model is like |
---|
0:19:39 | use only few data but incremental map adaptation is we are bringing we've also the |
---|
0:19:43 | statistics those statistics are also useful to an update a variance and the mean as |
---|
0:19:49 | well so an incremental map adaptation case we you |
---|
0:19:53 | id updating with the parameters a broad better results than updating models |
---|
0:19:59 | and for example |
---|
0:20:01 | do you think a comparison or |
---|
0:20:04 | you think it will make sense to compare |
---|
0:20:06 | in terms of number of parameters maybe increasing number of gaussians in the unit or |
---|
0:20:11 | so i don't if you and sounds and |
---|
0:20:14 | okay so maybe releasing for about or |
---|
0:20:18 | for better a competition on reducing the number of gaussian mean |
---|
0:20:21 | no i was i was thinking owning increasing then if you as you go for |
---|
0:20:26 | example i mean you could okay |
---|
0:20:28 | even |
---|
0:20:30 | that might be postport model becomes a better liable to increase the number of components |
---|
0:20:36 | to maybe double the money we don't right in this case because we sixty four |
---|
0:20:39 | different components |
---|
0:20:40 | because it was language pairs was that we did |
---|
0:20:43 | so but that might be a good |
---|
0:20:50 | there's still time for questions |
---|
0:20:59 | questions |
---|
0:21:06 | a global there is the explanation but the white your system is it does on |
---|
0:21:11 | the offline but system |
---|
0:21:14 | sorry your system is it does on the of the nicest in |
---|
0:21:19 | and we would you know like |
---|
0:21:22 | okay inane that case so we try to put in previous work we tried we |
---|
0:21:27 | totally unsupervised system and they which within a number of speakers and that the performance |
---|
0:21:32 | where much more some end of line there is some system |
---|
0:21:35 | and the as i said before the comparison was faring the case because the |
---|
0:21:40 | in that line that is some system all c than the number of speakers but |
---|
0:21:43 | in this case you allow the you allow to not the number of speakers in |
---|
0:21:47 | a bass to get better performance and to stop of practical application so but knowing |
---|
0:21:51 | and the number of speakers already you add a lot of that you two |
---|
0:21:55 | to the problem so you is already |
---|
0:21:58 | and |
---|
0:21:59 | is already some information |
---|
0:22:02 | that adds to be the offline diarization system |
---|
0:22:08 | understanding that you can decreasing but on my system can do you line |
---|
0:22:13 | so what the difference at the end of line |
---|
0:22:16 | i mean you can imitate of online |
---|
0:22:19 | by using offline system |
---|
0:22:22 | okay but the flesh system basically you need all the audio from the beginning so |
---|
0:22:28 | that you do not have to use all the |
---|
0:22:31 | was audio |
---|
0:22:32 | so |
---|
0:22:34 | also the this system that we use of flying system was really computationally every so |
---|
0:22:40 | it to the use is a lot of and segmentation and clustering e like |
---|
0:22:48 | iterations |
---|
0:22:49 | so is my it may lie on the from the point of view of latency |
---|
0:22:52 | is much worse than the online diarization system |
---|
0:23:04 | you know questions |
---|
0:23:07 | so |
---|
0:23:08 | let's take the speaker again |
---|
0:23:11 | i |
---|