0:00:15but not everybody today angle and at present a more work that freeze and a
0:00:23addressee the problem of our online speaker diarization in contrast the two other works that
0:00:27the majority of work in that is a shown that the was a it is
0:00:31mainly offline
0:00:32and the in a semi-supervised scenario
0:00:36a i would first provide a brief introduction and the motivation a i would band
0:00:41and describe the system implementation
0:00:44i will the provide this mess
0:00:46the some experimental results and the a simple solution
0:00:51so
0:00:52a i guess most of you are familiar with the problem of speaker diarization basically
0:00:57given an audio stream a i want to determine who spoke when
0:01:01and i want to a determined not to my segmentation word this segment boundaries are
0:01:07present the speaker there are changes and the in optimize speaker sequence what i want
0:01:11to assign you a the segments a to a specific speaker
0:01:16an so
0:01:19most of the state-of-the-art detection system at the vault around the belmont off offline diarization
0:01:25system however with the diffusion of a smart objects are intimately fittings and mar force
0:01:31in the your recent the are online diarization has attracted the
0:01:36and increasing is the right interest
0:01:40so in the literature only a few if you allow online there is some system
0:01:45that i've been presented mainly focusing on the plenary speeches and the broadcast news when
0:01:50the speaker turns article
0:01:53and the our previous work right whether the problem of unsupervised online diarization
0:01:59for of meeting data
0:02:00a with a single piece the microphone
0:02:03so as the condition
0:02:05so unfortunately the system be the b-format and
0:02:10although the results were aligned with the previous work at their system thing to perform
0:02:14a well to some or practical like applications
0:02:18in there is a short online diarization
0:02:20so
0:02:22we a
0:02:23we think basically in online diarization get the
0:02:26we have to deal with a the problem of speaker modeling not in addition
0:02:30and the after period of not feature
0:02:33the we assume that we encounter speech we want to be able to initialize speaker
0:02:38model and the properties which kind of analysis window with the which amount of
0:02:42speech
0:02:43we take to a
0:02:46initialize the speaker model
0:02:48so i can choose a the what amount of as a speech time to the
0:02:53and decrease the latency of the system
0:02:55however everybody probably knows that the
0:02:59there are interface much it is much higher because the speaker model are not well
0:03:03initialized with little data
0:03:05otherwise i can take a and low and longer windows a longer amount of us
0:03:11a speech
0:03:12a yes
0:03:14whatever in the keys a the in speaker variation i
0:03:18can fix the problem of
0:03:20using too long speech in which are there are multiple speaker might be multiple speakers
0:03:25as
0:03:26because number speaker might and increase with the
0:03:30a longer windows
0:03:31so a
0:03:34and way to improve the line that additional used to allow that it is allowed
0:03:38a speaker models to decision
0:03:40we but as some labeled training he some initial labeled training data
0:03:44and the so our contribution is will present work going that the t seems supervised
0:03:49online diarization system
0:03:51and the on the work this kind of what was already presented by then your
0:03:56moral
0:03:56in august done for but in the context of the offline diarization
0:04:02so the problem is the and the problem that we try to address is what
0:04:06kind of the of c data is required to reach a similar performance the two
0:04:11and of lined original system
0:04:13a okay i will continue we've the spinning the and they all came up with
0:04:18addition that these that we used to update the models in our system so on
0:04:25the we supposed to have a sequence of these speech segment
0:04:29from a particular speaker s
0:04:30and the each segment is by parameterized by set of acoustic features
0:04:36you
0:04:37we maximum number of features and i and i was able to have initial the
0:04:41gmm ubm model
0:04:42we've gig of some components
0:04:44so we found that basically in most of that is also used in that we
0:04:49found the literature
0:04:51it'd the authors would be used to a initialize speaker model by map adapting the
0:04:56lander ubm model and that you with the for speech segment obtaining a the than
0:05:02the first the for speaker model
0:05:05and then using the net
0:05:07speaker two and two right we have that the previous model s one in this
0:05:11case
0:05:12putting that and model or a new model as to
0:05:15however we found out that the by using these that the final model doesn't is
0:05:22not the same model s for using all the segments
0:05:25two but they rely on the ubm model in ones
0:05:27is so although it is a modest contribution we found out that the value
0:05:31by least a calculating the sufficient statistic using the a available speech segment we against
0:05:39the land the ubm model each time and by accumulating the sample statistics
0:05:43the final model is more consistent a more similar to a the more that it
0:05:48and that is the cosine with the offline that a muppet map adapted model
0:05:54so you know from a politician basically this fisher statistics for a gig of the
0:05:59component are doing a with for the main question okay i of the first the
0:06:04older quite zero for a zero for and
0:06:08sufficient statistic the first order statistics and the second order statistics
0:06:12and the basically a
0:06:16it's quite represent the likelihood of the each feature contain in the segments and i
0:06:21guess the around the ubm model and i use all the segments available to market
0:06:26that the lander ubm model
0:06:28so
0:06:30to obtain that they knew they are they to estimate i basically just the is
0:06:35a tradeoff between the a and some ratio between assumption statistics and the original the
0:06:42original
0:06:44parameters and that these racial is
0:06:48this ratio is like
0:06:51is the depends on their relevance factor that tells me
0:06:55a in how much importance quickly to initial parameters rather
0:06:59rather than the final day
0:07:03i what's important to
0:07:05much closer but the i want to be
0:07:07in estimating the new parameters and also we have become of it is an additional
0:07:11parameter so that the weights are estimated weights sum to one
0:07:15and so it's once motivation as we said use the first segments to updated around
0:07:21the ubm model
0:07:22i've the in the first speaker model s one
0:07:24and then i recalculate there is sufficient statistics i guess to the in a speaker
0:07:29model s one
0:07:30to train and you model has two
0:07:32and the more in general so i train the s one speaker model by map
0:07:38adapting the ubm and model with the first segment and then a given once a
0:07:42of the speaker segment you plus one
0:07:44a the sufficient statistics
0:07:48are given a calculated guess the previous model s i
0:07:51so i have all the sufficient statistics calculating just the model
0:07:56and the incremental map adaptation extent we want to calculate the with the for statement
0:08:03the sufficient statistics a guest aligning the model to obtain the for small the embedding
0:08:08a with the second segment we again a couple the sufficient statistics of against the
0:08:14long to a ubm model and we have q-mode this one in this way the
0:08:18three d of that the model is the same s offline map adaptation
0:08:24and will so what was it
0:08:28a we train the need just because model by you with the first segment
0:08:32and then
0:08:33a given i do not seem an a plus one is sufficient statistics a quality
0:08:38by accumulating this
0:08:42a with the previous one by using the features in the
0:08:47it in the last segment
0:08:49and calculating sufficient statistic guest the lower than and are committing to the thing to
0:08:54statistics
0:08:55so as a set for the model cost contribution but we found it green good
0:09:00improvements in the final the original results as we see after
0:09:06okay the system implementation
0:09:09a so a now two
0:09:12and the we have a supervised the supervised and face and unsupervised in the supervised
0:09:19case we the is like around able to not be allowed the
0:09:24we have and amount of the and the spk segments per speaker and with that
0:09:30those ones we after feature extraction we initialize the models
0:09:34in the all the people speak and the meeting for example
0:09:38and the in the line face instead we have a supervised and the lights and
0:09:44the we classify each signal and the speech segment a factor of dividing the to
0:09:51segment of the maximum duration ts that are present our latency
0:09:55and the these basically
0:09:58these the distance are classified a guest the speaker model available
0:10:04and the a i determine which speaker models is the and
0:10:10is the most likely and the i'd label that segment with the according to the
0:10:16speaker models that maximize the likelihood and i update the model by my incremental or
0:10:22sequential map adaptation we use we show both results
0:10:26and the in their life data i bring we need a sufficient statistics that with
0:10:31the will be that are used to update to the speaker models
0:10:36so all in the line processing i assign each segment i one of them speaker
0:10:41models
0:10:42according to must be more likely criteria
0:10:44and the model that maximize the likelihood of for the feature contain the segment is
0:10:50set to be this use the is used log of the segment
0:10:55and the that speaker model e the adapted by either sequential or increment all the
0:11:03of the station
0:11:05and this is the implementation of the system
0:11:08the to use
0:11:09so i will not present experimental setup and the experimental results
0:11:13so we used for different datasets
0:11:16a compact from the n is the rich transcription activator
0:11:21and the to the first that set is used to train the ubm and the
0:11:25is just a few and is a set of sixty meeting shows the from the
0:11:29nist audio for evaluation
0:11:31the we have the development dataset
0:11:33is the set of fifteen meeting shows from there are two five and that you've
0:11:37six evaluation is used to g
0:11:39to develop the system
0:11:41and the evaluation set that we used to ask people that used to evaluate this
0:11:45system is the set of at meeting shows from the active seven and the set
0:11:50of seventeen shows from the t o nine evaluation
0:11:52and the we show the results independently for these two dataset to perform better comparison
0:11:58we previous work
0:12:00the experimental setup we use a nineteen the mel frequency cepstral coefficient i made by
0:12:06energy we've a twenty millisecond windows and the ten millisecond chips
0:12:11a i mean is shown me ten milisecond the overlap
0:12:14a ubm is trained on that the ubm with a
0:12:17ten iteration em iteration and sixty four gaussian components
0:12:21and the analysis window that you that the segment duration that
0:12:28that correspond to the lattice of the system is the we not i for the
0:12:33phone lattices from zero point twenty five second zero point five seconds on t four
0:12:36seconds
0:12:37and the amount of training data that used to initialize the model initially in the
0:12:42back from one thirty nine seconds
0:12:45and the which both short of the results for sequential an increment all the map
0:12:49adaptation
0:12:50and the relevance factor for the map adaptation least and the
0:12:55the a
0:12:56okay the overlapped speech is a move according to the transcription
0:12:59well what
0:13:00but problems with the descent
0:13:03we need and the offline baseline system is the idea eric on top down duration
0:13:08system classification the to use as a baseline to a as a reference for
0:13:13the results
0:13:16so in this in these in the first in the rest of the left
0:13:21you are presented the results using a sequential model adaptation approach we can see that
0:13:25the by allowing and amount of training data in a of the people putting data
0:13:29can also labeled training data to initialize the model
0:13:33and we managed to perform better of an offline diarization system
0:13:37and the results right are is that we increment the muppet that they show that
0:13:42the
0:13:46allows for better or a profiles of the q and the because the model we
0:13:51accumulate the statistics
0:13:52and the we can see that the
0:13:56we can reach an offline but there is some performance we've only five second possible
0:14:01actual y n
0:14:02incremental works better with three seconds
0:14:05and the by allowing more training data
0:14:08a the we the readers to that is not a rate of ten percent
0:14:13and
0:14:15the c and the okay the state for the lower latency this is just a
0:14:20it is system
0:14:22does not perform well because see that the licensees the really global too low and
0:14:27the also
0:14:29the
0:14:32so for ten in this table we present the results for different amount of and
0:14:38bring the cost the
0:14:40a training data
0:14:42three second by second and seven seconds
0:14:45and the for the different not set and all these results correspond reluctance use of
0:14:49three second
0:14:50so we can see that in all cases incremental by map adaptation works better when
0:14:55the question map adaptation so but being the statistics of we are but would it's
0:14:59better results
0:15:01and the finally
0:15:03these dropped represents a represents the amount of training data as a function of the
0:15:09latency and yes
0:15:12when that
0:15:14when we would be to the offline diarization performance all
0:15:18or points corresponding to d r of seventeen percent and still have in here that
0:15:22the incremental map adaptation works better when is when sum up to the station
0:15:29the in
0:15:31for future work the in
0:15:32and
0:15:34the goal is to probably a and reduce both lattices and both the amount of
0:15:39training data to reach better performance
0:15:42so
0:15:43to include
0:15:44so we propose a semi-supervised aligned there is some system
0:15:47and we show that for the in the case of they have to your seven
0:15:51dataset the system gonna outperform an offline diarization system we've already three seconds of
0:15:56a speaker c data and with a latency of three second
0:16:00well using an document the map adaptation approach
0:16:03and the a by line will be the legacy all retirement see the that we
0:16:08have lower eer and percent
0:16:10and the also if we tested this inconvenience of like of
0:16:16initialize the speaker model some labeled training data we can
0:16:19a
0:16:21open there should two
0:16:23development of supervisors image supervise the speaker discriminative features transformation
0:16:29a both to reduce that fancy
0:16:31and that the amount of data
0:16:42thank you have any so i'll in here we have time for a few questions
0:16:54thank you for talk
0:16:57for so long to this then do you
0:17:00to know how many speakers or the conversation runs
0:17:04yes is usually we mean that we allow for these
0:17:08and knowing in advance the number of speakers to initialize the better the models
0:17:12and was we five we are searching other ways to introduce new speakers that are
0:17:17unknown in the
0:17:18in this was in the beginning
0:17:20you divided the data between the speakers
0:17:24so we assume that all the speakers of speaks from the beginning
0:17:28yes
0:17:30present himself or something like this
0:17:33yes exactly we are it means that us not which everybody also examined
0:17:38you're wrong system
0:17:42does not assume that the number of speakers is not the ones so it's more
0:17:48a totally correlated to compare isn't version
0:17:52this mean a when we do the these experiments i agree with the comment when
0:17:56you really these experiments we had only that all signed addition system any was difficult
0:17:59to
0:18:00a initialized speaker models
0:18:01so since we had already that baseline we decide okay to stick that one as
0:18:05a reference and the g
0:18:07we compare with the and the last questions practical who see in the that online
0:18:13for the first the segmentation for the filtering
0:18:17okay also other application that the these the work can and can be useful for
0:18:22example can be
0:18:24and
0:18:25when you interact weavers mat corpus
0:18:27all six you can provide the initial these
0:18:31some data for the people that usually utilized was is mapped object so you can
0:18:36provide some initial data that was our disk is not gonna use for this east
0:18:41thank you remote
0:18:45any other questions
0:18:52so i have a question so and
0:18:57i from your presentation i understand that you are adapting all the parameter phase
0:19:03okay systems for score area where is okay by dataset covariance tennis a means so
0:19:09have you tried
0:19:11two
0:19:12check if extractor that fewer parameters and you get some
0:19:17yes it right what we only the mean that the as we increment the model
0:19:21that these women so you
0:19:22bring "'em" we must have that we stick with a map adaptation
0:19:25it's a to get worse results in front of data on the parameters
0:19:29while the and saw in the case identity of the mean
0:19:33because as the use of all these few data will data model is like
0:19:39use only few data but incremental map adaptation is we are bringing we've also the
0:19:43statistics those statistics are also useful to an update a variance and the mean as
0:19:49well so an incremental map adaptation case we you
0:19:53id updating with the parameters a broad better results than updating models
0:19:59and for example
0:20:01do you think a comparison or
0:20:04you think it will make sense to compare
0:20:06in terms of number of parameters maybe increasing number of gaussians in the unit or
0:20:11so i don't if you and sounds and
0:20:14okay so maybe releasing for about or
0:20:18for better a competition on reducing the number of gaussian mean
0:20:21no i was i was thinking owning increasing then if you as you go for
0:20:26example i mean you could okay
0:20:28even
0:20:30that might be postport model becomes a better liable to increase the number of components
0:20:36to maybe double the money we don't right in this case because we sixty four
0:20:39different components
0:20:40because it was language pairs was that we did
0:20:43so but that might be a good
0:20:50there's still time for questions
0:20:59questions
0:21:06a global there is the explanation but the white your system is it does on
0:21:11the offline but system
0:21:14sorry your system is it does on the of the nicest in
0:21:19and we would you know like
0:21:22okay inane that case so we try to put in previous work we tried we
0:21:27totally unsupervised system and they which within a number of speakers and that the performance
0:21:32where much more some end of line there is some system
0:21:35and the as i said before the comparison was faring the case because the
0:21:40in that line that is some system all c than the number of speakers but
0:21:43in this case you allow the you allow to not the number of speakers in
0:21:47a bass to get better performance and to stop of practical application so but knowing
0:21:51and the number of speakers already you add a lot of that you two
0:21:55to the problem so you is already
0:21:58and
0:21:59is already some information
0:22:02that adds to be the offline diarization system
0:22:08understanding that you can decreasing but on my system can do you line
0:22:13so what the difference at the end of line
0:22:16i mean you can imitate of online
0:22:19by using offline system
0:22:22okay but the flesh system basically you need all the audio from the beginning so
0:22:28that you do not have to use all the
0:22:31was audio
0:22:32so
0:22:34also the this system that we use of flying system was really computationally every so
0:22:40it to the use is a lot of and segmentation and clustering e like
0:22:48iterations
0:22:49so is my it may lie on the from the point of view of latency
0:22:52is much worse than the online diarization system
0:23:04you know questions
0:23:07so
0:23:08let's take the speaker again
0:23:11i