0:00:06well
0:00:07after a great discussion uh about uh the
0:00:10last
0:00:11so take i will i will continue with another topic of related to speaker diarisation
0:00:16uh
0:00:17my name is bob automatic and
0:00:19uh i was working uh previous semester or uh
0:00:23as an erasmus student in that
0:00:26uh you at the university of i mean you wanna
0:00:29at all about
0:00:30about the last
0:00:31in formatting that venue
0:00:32uh were my supervisors where
0:00:35coding the video and there is not true
0:00:38uh
0:00:39it was about uh preliminary study
0:00:42oh factor analysis based approach is applied to the speaker diarization task
0:00:48of meetings
0:00:50well
0:00:51what it would be about
0:00:53uh i will briefly describe the speaker diarisation
0:00:58also factor analysis
0:01:00i will tell you something about the objectives of this study uh some experiments
0:01:05and the
0:01:06perspective
0:01:10uh shortly about diarisation i suppose uh almost all of you know what speaker diarization means
0:01:20what
0:01:20is its purpose
0:01:22uh speaker diarization tries to find the answer a question
0:01:27who spoke one
0:01:29uh we don't have
0:01:31uh any a priori knowledge
0:01:32about speakers they and number
0:01:35and their identity
0:01:38uh as you can see here is a small
0:01:41small
0:01:42you have uh
0:01:44if uh
0:01:45and how
0:01:46would of uh such a such a system
0:01:48uh where we can see the
0:01:50speech segments are labelled by the by the speakers
0:01:55uh the diarisation system you uh tries to find the same segments of
0:01:59goers
0:02:00and label them
0:02:01uh for for my experiments i used uh diarisation system uh developed in
0:02:08in the yeah
0:02:09uh the the system uh participate it uh in a nice the rich transcription
0:02:15uh combines since two thousand three
0:02:18uh the system uses topdown strategy
0:02:21uh what is the top down strategy i will
0:02:23i will uh
0:02:24sounds
0:02:25now
0:02:26uh the top down strategy consists of uh
0:02:29four main steps
0:02:31the first
0:02:32the uh is in uh speech activity detection
0:02:36uh
0:02:37where to retrain the gmm models
0:02:41uh
0:02:42are are are used uh
0:02:44as a as a models of speech and nonspeech
0:02:47uh
0:02:49then uh it's
0:02:50used uh viterbi decoding and the map adaptation
0:02:54another step is uh segmentation
0:02:56uh where is
0:02:57use the evaluative
0:03:00uh hidden markov model
0:03:02uh
0:03:03also viterbi the counting the coding and uh
0:03:07uh the third and for the fourth
0:03:09steps
0:03:10are almost the same uh it's for segmentation about
0:03:13using different
0:03:15parameterisation
0:03:19uh factor and all is is uh
0:03:22is
0:03:22is so well known in in fields like uh speaker verification
0:03:27language identification uh and video gender classification
0:03:32uh
0:03:34and
0:03:35the the uh
0:03:38the big difference uh
0:03:41you can say it's
0:03:42uh
0:03:43but then that legally
0:03:44describe
0:03:45uh in these two equations
0:03:47where the the first decorations
0:03:50is standard gmm ubm modelling
0:03:52and
0:03:53the second equation
0:03:55uh
0:03:57contains
0:03:58uh
0:03:59um
0:04:00contains you we
0:04:02which uh
0:04:04so modelling the session variability
0:04:10so what about
0:04:11trying factor analysis uh
0:04:13the link uh uh the
0:04:15the single audio files
0:04:17uh
0:04:19uh we have situation for example
0:04:22speaker is
0:04:23peaky and
0:04:25environment of the recording is changing like
0:04:28the speaker is going
0:04:30and around the microphone and the distance
0:04:33speaker and uh
0:04:35and the microphone is changing
0:04:37uh the the factor analysis can be held
0:04:39helpful in this case
0:04:41um
0:04:44uh we we tried to uh to
0:04:47two approaches in this work and
0:04:50the first is uh by localising subspace you containing the entire segment viability
0:04:56and the second uh is
0:04:59uh in a localising the interspeaker variability
0:05:05about the experimental protocol the details uh are the following as a development set i used twenty three audio files
0:05:13from the nist uh rich transcriptions
0:05:16since two thousand four
0:05:18two two thousand six
0:05:20uh
0:05:22it took place in seven different meeting rooms and
0:05:25uh from
0:05:26some statistical data
0:05:28uh the recordings
0:05:29uh have from ten to eighteen minutes
0:05:32containing from four to nine participants
0:05:36and
0:05:36as evaluation set i use the
0:05:40seven audio files from nist uh from the previous year
0:05:45they have from seventeen to twenty seven minutes
0:05:48and from four to seven speakers
0:05:52uh
0:05:54the multiple distant microphones were used here and as a performance uh
0:05:59measure
0:06:00uh i used uh diarisation error rate
0:06:05the factor analysis model link was applied
0:06:08only
0:06:09in the third step of the speaker diarisation system
0:06:14now the first approach
0:06:16the modelling go
0:06:18interspeaker variability
0:06:23uh the U matrix uh here
0:06:27in
0:06:27in this equation
0:06:29is common to all speakers
0:06:31and the assumptions are uh
0:06:34main relevant speaker information located in the low
0:06:37dimension subspace and the rest
0:06:40uh
0:06:41all the speaker information in the full space
0:06:45and the results are on the next
0:06:47page
0:06:48uh there is uh
0:06:51nothing interesting
0:06:52except
0:06:53one think
0:06:54it's the difference
0:06:56between
0:06:57these two columns
0:06:59uh
0:07:00what does it mean and the first column
0:07:03uh contains the baseline diarization error rate
0:07:07of
0:07:07this file
0:07:08without application of factor analysis
0:07:11uh the next
0:07:12column contains uh
0:07:14results
0:07:15after
0:07:16application uh factor analysis for segmentation
0:07:20containing
0:07:21the U V
0:07:23and the last without
0:07:24you think
0:07:26and the difference is
0:07:28big
0:07:28uh in average about ten percent
0:07:30what does it mean it means that the U I
0:07:34can
0:07:35contains some information
0:07:37useful
0:07:38four
0:07:39what they're doing
0:07:40speaker
0:07:41uh
0:07:42in this case uh the only only thing
0:07:45uh which is important
0:07:46all the all the results
0:07:48are uh
0:07:49in average whereas
0:07:52the second approach is uh in the in in their segment of our identity
0:07:58um
0:08:00it's almost the same except uh
0:08:04the the base
0:08:04think that the right but the is
0:08:07uh
0:08:08modelling
0:08:08inter segment
0:08:10so the results uh are
0:08:13this page
0:08:17yeah the baseline
0:08:19diarisation error rate
0:08:22there is uh
0:08:24after
0:08:24application of factor analysis
0:08:28with ordering with you you
0:08:30and here without
0:08:31you you
0:08:34uh what is
0:08:35what is uh interesting here
0:08:38only the fact that uh
0:08:41so speaker information uh
0:08:44present
0:08:45is present in the inter segment component but
0:08:47not significant
0:08:50uh i tried another experiment
0:08:53and it was based uh on filtering
0:08:57um uh of a speech segment
0:09:00in
0:09:00mm kay
0:09:01development set
0:09:03in the first column you can uh see
0:09:06there are
0:09:07results of system uh
0:09:09which uses
0:09:10you metrics
0:09:12uh estimated on all speech segments of from the the from the development set
0:09:18in the next next column you can see uh
0:09:20results
0:09:21system
0:09:22using uh
0:09:24you matrix estimated on uh segments
0:09:27longer or equal to
0:09:29one second
0:09:30and so on
0:09:32so seconds five second consequence
0:09:33uh the most uh interesting i think uh
0:09:36uh
0:09:37this
0:09:38this in this paper is
0:09:40is the uh
0:09:41the big difference
0:09:43in these values
0:09:44uh for this file
0:09:46uh
0:09:49it's uh
0:09:49the original
0:09:51uh diarization error rate
0:09:53for this file was about twenty percent
0:09:57after application uh
0:09:58this modelling and this filtration of
0:10:01uh segments shorter than one second
0:10:03we improve the segmentation
0:10:05uh about fifteen point five
0:10:08point five
0:10:09person
0:10:10uh
0:10:13well
0:10:13it's interesting
0:10:15and uh
0:10:16we move
0:10:17this segmentation
0:10:20uh so much
0:10:21we
0:10:22we got from
0:10:23twenty percent error rate to five percent error rate
0:10:26uh
0:10:27what about next
0:10:28uh our segmentation step using ca
0:10:32norm uh standard or a segmentation step
0:10:34uh they but this is is that uh we can again
0:10:38and other improvements
0:10:40uh with viterbi and map adaptation
0:10:43and
0:10:44we can see here that
0:10:46is it but this is calm
0:10:47it's confirmed because from
0:10:49uh from the well change
0:10:51the segmentation
0:10:53we improve it so but
0:10:55by another one point four percent
0:10:58but this is uh
0:11:00this is important uh
0:11:02and significant only for
0:11:04for this file
0:11:06uh
0:11:07where the segmentation
0:11:08changed a lot
0:11:14oh
0:11:15in general
0:11:16the it's not significant
0:11:19these changes
0:11:21uh
0:11:22and the signal segmentation uh
0:11:26was uh just
0:11:27about classical viterbi and
0:11:29map adaptation
0:11:36i would like to summarise
0:11:37this work
0:11:39uh i just it's a two strategies
0:11:43the
0:11:44interspeaker variability modelling and inter segment
0:11:48but i but at the moment modelling
0:11:50and
0:11:50uh
0:11:51only the second
0:11:53has uh and improvements
0:11:56uh of of the segmentation
0:11:58but
0:11:59very
0:12:00or
0:12:02uh it can be useful
0:12:05to to feel
0:12:06filters some
0:12:08some short
0:12:09uh
0:12:10speech segment
0:12:11in the
0:12:12in the heart of estimation you moderate
0:12:15and it's
0:12:17also useful as you so uh another
0:12:20presegmentation step
0:12:27next work uh can be done with uh
0:12:30more training data
0:12:32uh and
0:12:35uh
0:12:36the large number of speakers when dealing with the
0:12:39interspeaker variability
0:12:41uh
0:12:43regarding the inter segment viability
0:12:46uh it can be interesting to to
0:12:49ben dealing with the multiple distant microphones
0:12:53uh and uh
0:12:55also another
0:12:57test
0:12:57can be done uh
0:12:59one uh
0:13:01when the application factor analysis based uh speaker modelling in the first step
0:13:06of the
0:13:07the speaker diarization system
0:13:14well thank you very much for attention
0:13:16and
0:13:16if you have any questions
0:13:26question
0:13:32only reported an improvement when actually you selected only the
0:13:36speech segments longer than one second
0:13:39right
0:13:40it means that actually in your segmentation of most of most lots of research
0:13:44and this is your variable files
0:13:46so good that was how we were i was configure are there any
0:13:50it limits for the minimum duration of a segment
0:13:53uh sorry i cannot tell uh and i think about the vad because i just the work
0:13:58uh with the diarization system as it was
0:14:01uh maybe uh korean if uh not serious
0:14:22uh but uh maybe uh i i didn't understand well uh this uh this uh filtration is made on the
0:14:28development
0:14:29so
0:14:37uh_huh
0:14:51yeah in fact that the united
0:14:53yeah train on the
0:14:55and development it so we have to wait for instance the development set
0:14:59so we can choose
0:15:00and the length of the segment
0:15:02and you try to train
0:15:06yeah but the united estimation yeah
0:15:11yeah
0:15:11oh i have a question
0:15:14so
0:15:14i see that is it to speaker variability in this segment
0:15:18ability
0:15:19and uh
0:15:21do you
0:15:22so
0:15:23i guess each segment their ability uh reflects the changes
0:15:27speaker
0:15:28is it useful information for
0:15:30or
0:15:31detecting the speaker
0:15:32change
0:15:35so
0:15:36and we expect
0:15:38a two
0:15:39speaker
0:15:39i think they should okay
0:15:41can you do some information but
0:15:43we should keep
0:15:44okay segment and applications compensated and
0:15:48nation
0:15:49well you can line
0:15:50why not
0:15:51in uh in the estimation of you metrics
0:15:54the uh the vocal development set
0:15:56uh we had the reference
0:15:58and uh you matrix was estimated um
0:16:02in this case
0:16:03uh
0:16:06for for each speaker
0:16:09uh
0:16:09between uh the segments
0:16:11of
0:16:12of one speaker
0:16:14so it was it was not uh
0:16:16in there a segment of arrival they
0:16:19in the way of
0:16:20for uh
0:16:21intel
0:16:21all segments right but the only
0:16:24uh it was in their segment the viability of
0:16:26of a certain speaker
0:16:31all speakers soprano testing
0:16:36and then you do the presegmentation
0:16:38using a generative model
0:16:41you can see you mentioned B B segmentation
0:16:44i always
0:16:45process so you have one
0:16:47one night lately
0:16:50and
0:16:52how many rounds
0:16:53right
0:16:55uh how many how many or segmentation
0:16:58uh
0:16:59uh well
0:17:00there is normally there is uh
0:17:02one one uh
0:17:03segmentation and then uh take place and the story segmentation
0:17:07this case it was a resegmentation uses uh factor analysis
0:17:12wondering
0:17:13uh
0:17:16and uh there is segmentation uh uh was it the right thing until
0:17:20uh the number of
0:17:22five
0:17:22changes of
0:17:23in the in the segmentation
0:17:26uh was uh
0:17:28less than a certain
0:17:30well you
0:17:33one one
0:17:38one per segmentation process
0:17:40with many iterations
0:17:41right
0:17:42which
0:17:46oh
0:17:47slide
0:17:49a class
0:17:52uh
0:17:54i don't know which light you mean
0:17:56in this uh
0:17:57there are parts of the uh
0:18:01right
0:18:01segmentation
0:18:02yes
0:18:03uh yeah
0:18:04this is the original baseline system
0:18:07and there are two resegmentation uh steps and uh the factor analysis
0:18:12took place after this
0:18:14presegmentation step
0:18:16as the last
0:18:17part of the of the diarization system
0:18:20okay
0:18:24you can can anything
0:18:25in fact that the number education is not speak
0:18:28it depends that understands it changes
0:18:31and giving them a sense
0:18:32so when we an estimated ten
0:18:35no more changes
0:18:37it went a segmentation that a given state we stop
0:18:42thank you
0:18:44no
0:18:46i actually
0:18:50yes
0:18:56uh but you tested so
0:18:58you you only scored the sections of the meetings that did not have overlapping speakers correct
0:19:05uh
0:19:06we just it only the the evaluation set uh from the nist
0:19:10so i but there were different ways to score that there was a parameter which determines how much overlapping speech
0:19:16was included
0:19:18uh
0:19:19and and your your uh error rate
0:19:22are quite low so i assume you
0:19:24but did not score the overlap
0:19:26speakers
0:19:27but that's just an assumption i want to
0:19:29from you
0:19:32well there are rights uh
0:19:36maybe maybe you don't know because you just drama
0:19:38for example the yeah right
0:19:40here are
0:19:41are the global arrays that although the total
0:19:44all rights including curve force are um with
0:19:46speech and the speaker
0:19:48you could change
0:19:50okay i i don't know uh
0:19:52if i
0:19:53and just
0:19:54oh okay
0:19:55and then about this one meeting
0:19:57where you
0:19:58had a significant improvement
0:20:00um
0:20:02i i
0:20:03i remember that on one of the nist meetings
0:20:06there was a
0:20:07much larger number of speakers
0:20:09then
0:20:10and the other meetings
0:20:12and i wonder if that was the one meeting where you saw again
0:20:16um so there were many more
0:20:19speaker changes because the number of speakers were actually
0:20:22that's like double the other meetings
0:20:25uh so i wondered if you had actually looked at some statistics of your meetings
0:20:29to see uh if
0:20:31there are some variable like the number
0:20:33speakers that
0:20:34uh could predict when you're method works
0:20:37uh well and when that might make a difference
0:20:40oh well uh i i don't have
0:20:42anyhow
0:20:44oh
0:20:52no information
0:20:55and we we did not
0:20:57it
0:20:57and
0:20:58and is about to the
0:21:00this was it
0:21:01we
0:21:02we know that
0:21:03you say that again
0:21:05yeah sometimes
0:21:07and is not
0:21:08necessary you to to the fact and he's in good
0:21:11if we change
0:21:13and finally implies sense
0:21:14and
0:21:16insinuation of that and uh
0:21:18and this is and we know that we can
0:21:20and this improvement
0:21:22an infected E es work
0:21:24the good ones too
0:21:26and
0:21:27exp tool
0:21:28and don't we all
0:21:29applying thank john and easy C speaker deviation
0:21:33on meetings we had these aladdin
0:21:35speakers most because then
0:21:36implementation
0:21:38and a different connotation
0:21:41it is
0:21:45and that that's the overlap
0:21:47and we didn't
0:21:49scroll and we thought about that
0:21:51and
0:21:52because we we
0:21:54do something
0:21:55oh
0:21:56and to delete
0:21:57overlap
0:21:58and the ones to law school