0:00:06so
0:00:07hi everybody and
0:00:08but the high
0:00:10i i'm presenting the world
0:00:11on the use of jesse as yen for speaker diarization and tracking
0:00:16that was brought on by my colleague yeah exactly
0:00:19great income and i come to the conference
0:00:21income collaboration with not too hard
0:00:24i can see generous
0:00:25before mac please to uh
0:00:28we live in
0:00:28happen
0:00:31so after the presentation of the of the task and the with division i will describe the two tasks
0:00:36uh
0:00:38that are explored
0:00:39i quickly speaker diarization and speaker tracking
0:00:42along with this
0:00:42stance
0:00:43uh
0:00:44uh use than the result obtained before compression and plastic
0:00:50first about the the two tasks that we consider a acoustic speaker diarisation is a
0:00:55who who spoke when' task
0:00:57we got married segmentation and clustering that was really and right
0:01:00the previews
0:01:01um
0:01:02uh
0:01:03oh
0:01:04um
0:01:05we consider it as a
0:01:06processing for the uh
0:01:09automatic speech recognition in for underage
0:01:12transcription
0:01:13uh in this situation we have new approach
0:01:15priori information on speaker phone
0:01:17and speaker's voice
0:01:19and we can see there you only acoustic driven approaches
0:01:22because al also approach
0:01:24linguistic use of the of the transcription
0:01:26but we are also interested in just to get writing that
0:01:30uh
0:01:31well we want to detect regional for spoken documents that
0:01:34a detailed by a given speaker
0:01:36and this situation we have a list of the speaker to the right
0:01:40and we have uh
0:01:41we are provided with training that that for
0:01:44for this because
0:01:45we consider the speaker tracking task as a combination of both acoustic speaker diarization
0:01:51press
0:01:51the speaker verification module in our configuration
0:01:56oh i'm and motivation in this work was to uh
0:02:00including the our system the
0:02:02as yeah techniques that
0:02:04you know all that they are
0:02:06become
0:02:07very uh successfully in
0:02:09speaker recognition
0:02:11uh
0:02:12we started with the
0:02:13G S B
0:02:15questions about vectors
0:02:16stan
0:02:17that was
0:02:17easy to develop a framework
0:02:19and they also for the
0:02:21uh
0:02:22well the features that can be used
0:02:24uh in uh
0:02:25uh
0:02:26as the end system for speaker recognition that well
0:02:29mllr cmllr lattice mllr that
0:02:32uh we also uh
0:02:34uh
0:02:35efficient to combine with the the
0:02:37do not you as it is
0:02:39or more
0:02:40i want also to say one about the context of the we're programs that
0:02:44uh
0:02:45what do you rate
0:02:45our work for improving
0:02:48uh
0:02:48so it's a a friend found that the
0:02:51research
0:02:52uh
0:02:52and uh
0:02:53uh
0:02:54innovation program
0:02:56uh
0:02:56that
0:02:57aims to improve automatic would you be sure uh document structuring and indexing
0:03:01and
0:03:02for this work we wanted to work specifically on the speaker diarization
0:03:06and tracking for
0:03:07exactly what 'cause that yeah
0:03:09and that's why we are uh still
0:03:12on the us team of
0:03:14uh
0:03:14offline
0:03:15um
0:03:16diarisation
0:03:17uh because we we are working on a carry on broadcast that that that are recorded and that
0:03:23uh patch on the on the web or on the radio on T V
0:03:26so one
0:03:27and we will
0:03:29still easy integration of
0:03:30been based
0:03:31fig
0:03:34we worked on the
0:03:35that out of the
0:03:36french ester evaluation
0:03:39uh
0:03:40i hope that the that that will
0:03:42so softly available to the
0:03:44were community
0:03:45remedied as being distributed to the participant to this evaluation in two thousand eight
0:03:50uh yeah uh
0:03:51one hundred
0:03:52being a target speakers
0:03:54uh for which uh
0:03:56we have
0:03:57about
0:03:57one and the right
0:03:58i well as a
0:03:59training that that
0:04:02consist of
0:04:03french speaking or radio shows
0:04:05from
0:04:06uh
0:04:08different sources
0:04:09french tools but also uh
0:04:11uh
0:04:12i consoles
0:04:14uh
0:04:15we have
0:04:17for the impostor that that we've to the
0:04:19it's yeah one which is uh
0:04:21uh
0:04:22is that when evaluation uh that that
0:04:24about
0:04:24four hundred
0:04:25impostors
0:04:27uh
0:04:28as the two development data consisting twenty radio show for that a lot
0:04:32six hours and the evaluation
0:04:34uh a row roughly the same amount
0:04:37twenty six radio shows
0:04:38uh false
0:04:39seven hours
0:04:41uh i also provide uh
0:04:43uh
0:04:44and the value of the
0:04:46uh
0:04:46so if you use some statistics
0:04:48on the number of speakers as speaking anything
0:04:51the segment length
0:04:52the
0:04:53uh S yeah
0:04:54uh
0:04:54to development and evaluation uh that that's that
0:04:58uh
0:04:58the development we we have between nine and twenty five
0:05:02speaker for the mean of C
0:05:04being
0:05:04and
0:05:05on the
0:05:06evaluation
0:05:07uh
0:05:08roughly lies
0:05:10uh with the
0:05:10given uh
0:05:13speaker there so we
0:05:15in a right
0:05:16the speaking length
0:05:17also vary a lot
0:05:19with a mean of uh
0:05:21sixty five seconds
0:05:23ranging between alpha signal and then
0:05:25more than ten minutes
0:05:27and on the evaluation the
0:05:29it's a bit
0:05:29i yeah with
0:05:30and the right of
0:05:31it is signals
0:05:32but we can see the standard deviation is very very high so it
0:05:36just
0:05:36just to have a rough
0:05:38uh i
0:05:39and segments
0:05:40oh also
0:05:41yeah
0:05:42in average
0:05:43six
0:05:43yeah
0:05:44seventeen seconds and
0:05:46the deadline for you know the about it
0:05:48we also uh rummaging from a fraction of signal to
0:05:52uh so norman
0:05:56i will not describe the uh acoustic speaker diarization system
0:06:00uh which is basically this is ten that uh high guy
0:06:03uh recap recap just
0:06:05the previous stork
0:06:06that was developed by uh
0:06:08C than men yeah uh
0:06:10changed to uh myself and only woman
0:06:13for them is
0:06:14two thousand for evaluation
0:06:16uh
0:06:17basically uh well
0:06:20so we just that
0:06:21uh
0:06:22die revisions
0:06:23then
0:06:24so initial a segmentation is using a front end with
0:06:28standard mfcc feature found in uh is a system
0:06:31uh
0:06:32the speech activity detection
0:06:35relies on the viterbi decoding with
0:06:37that's ever uh gmms of
0:06:39speech music and noise
0:06:43on the speech segment
0:06:44there is
0:06:45um
0:06:46uh
0:06:48the segmentation to ins more uh segments using
0:06:53to select
0:06:53two i just some sliding windows of output signal
0:06:57and using a local gardens major to segment
0:07:01i does that uh
0:07:03gmm are trained on the signal
0:07:06yeah the lower segmentation
0:07:08uh of this uh of this data
0:07:11this is the initial segmentation
0:07:13we have the first step of
0:07:15uh i
0:07:15i dramatic clustering using
0:07:18using the
0:07:19classical bic italian
0:07:21uh
0:07:22and
0:07:22oh using full covariance matrix on the single version
0:07:27uh something only
0:07:29uh specific
0:07:30thing is on the
0:07:31penalty which is the local
0:07:33big in I T
0:07:34uh taking into account only the number
0:07:37um it out of the
0:07:39two clusters that are um
0:07:41but
0:07:41and not of the
0:07:42all that
0:07:43and not
0:07:44or of the future
0:07:48and we uh
0:07:50with
0:07:51put the output of the biclustering into a second step
0:07:54using
0:07:56uh
0:07:56speaker I D um
0:07:58mode that is and uh clustering
0:08:00using
0:08:02a slightly different
0:08:05features
0:08:06using
0:08:07uh feature warping
0:08:09and not adapting the ubm
0:08:12the clustering relies
0:08:14the force log likelihood ratio between the two clusters
0:08:21so what what we did the
0:08:22uh was
0:08:23but there are simple
0:08:24simple stuff was looking at the G as the as yet
0:08:28then
0:08:28and
0:08:29integrate it into the system
0:08:31place of the uh last
0:08:33S I Ds clustering stage
0:08:36uh so i think i
0:08:39whiskey
0:08:39right up
0:08:40first of all that it's
0:08:42rose asked on that
0:08:43stuff
0:08:44in
0:08:45the G S V U
0:08:48consist of the means of the
0:08:50uh
0:08:50so predictor of the adapted gmm
0:08:57it
0:08:58exactly as that
0:08:59uh combining diarisation system can improve on the
0:09:04individual assistance
0:09:06there are several ways of doing this combination
0:09:08uh
0:09:09i think one system into the ozone
0:09:11the kind of thing that we already do in our
0:09:14stan
0:09:14we can also mounts different systems or
0:09:17do a
0:09:17cluster voting technique
0:09:19we did a version that's
0:09:21score label
0:09:23which means that
0:09:24during the
0:09:25clustering
0:09:27process
0:09:29we compute an average score between the G and then
0:09:32as the end
0:09:33and the
0:09:33G and then
0:09:34geodesy and then gmmubm schools
0:09:38with
0:09:38uh
0:09:39the weight
0:09:39oh that optimise
0:09:40on the that
0:09:42development
0:09:45the performance measure is the diarization error rate
0:09:49it was already described in uh
0:09:51preview stored so i want
0:09:53get
0:09:53too much into
0:09:54then
0:09:55uh again
0:09:56uh just
0:09:57to say that we also
0:09:59put some
0:10:00i could use with
0:10:01she with your coverage
0:10:02which are the ratio of the minutes reference speaker about within this cluster
0:10:07and
0:10:08combats ripples of possible right
0:10:10which can provide a
0:10:11better insight into the
0:10:13the speaker or
0:10:15and we use the
0:10:16the nist to for scoring following the
0:10:19step two evaluation plan
0:10:23yeah
0:10:24yeah he's a figure of the diarisation error right
0:10:28four
0:10:28uh
0:10:29i was
0:10:30ten
0:10:30uh
0:10:31to the to the left
0:10:32we have the performance of the gmm
0:10:34then
0:10:35to the right of the the G S E S P N
0:10:37then
0:10:38and
0:10:39uh
0:10:40on the uh
0:10:41X axis
0:10:42the different combination weight
0:10:45we have been
0:10:46right
0:10:47in the
0:10:48in green
0:10:49the green curve is for the evaluation set and the right go
0:10:53for the development set
0:10:55which is that the
0:10:56gmm is yeah
0:10:58yeah forms
0:10:59better than the
0:11:01that that the gmmubm so we have a bit of that
0:11:03the that yeah this is then
0:11:05but combination uh
0:11:07yeah
0:11:07is very uh
0:11:09successfully here
0:11:11uh
0:11:12more in detail
0:11:13what what we get is
0:11:14a ten percent
0:11:16relative improvements
0:11:17from a given that one to ten not one
0:11:20from the
0:11:21best performing
0:11:22then to the
0:11:23the G and then
0:11:25press
0:11:25at the end
0:11:26then
0:11:27uh on the development set
0:11:29and
0:11:29on the evaluation we are also saying
0:11:31right
0:11:32going down from uh nine that
0:11:34six two
0:11:35it but
0:11:36three
0:11:42was this was for the acoustic speaker diarization system
0:11:45no some words about the the speaker tracking
0:11:49as i said we uh you just stand as a combination of
0:11:53uh of the speaker acoustic speaker diarization system to the to the left
0:11:57with a speaker at our educations
0:11:59stan
0:12:00and we have
0:12:02three possible ways of doing this
0:12:04combination
0:12:05uh we can do this you can verification
0:12:08on the initial segments
0:12:10of the system
0:12:11or on the cluster
0:12:13output by the beach
0:12:15all by the S I D
0:12:16uh clustering step
0:12:19each case
0:12:20the segments are then
0:12:21uh compared to the
0:12:22speaker models and
0:12:23level according
0:12:30if you well
0:12:31a weekly on the on the C stands for the tracking system
0:12:35we use
0:12:35gmmubm and she is this the end system
0:12:38uh is that a
0:12:40uh
0:12:42that have the same uh
0:12:44properties as as assistant that we
0:12:46that we have already presented folder
0:12:49i musician uh a
0:12:51uh for the verification we choose the target model
0:12:54with the highest likelihood
0:12:56ratio
0:12:58i with the verification phase
0:13:00and
0:13:00the G S via the end is also uh
0:13:03is also
0:13:04following the the same
0:13:08the same uh
0:13:10architecture
0:13:12uh
0:13:13with the constraint that we scrolls input
0:13:16posters and target
0:13:17because
0:13:18uh using the agenda
0:13:20and channel matching the the current condition
0:13:25and we also perform
0:13:26uh waited at the right of but
0:13:28across all the
0:13:29so level
0:13:29uh system fusion
0:13:33the the performance measures for the
0:13:34tracking task where as
0:13:36finding the exact way evaluation campaign
0:13:39uh
0:13:39recall and precision
0:13:41and
0:13:42an issue of combining the
0:13:44but recall and precision
0:13:46um
0:13:49but in a time waiting
0:13:51um i manner
0:13:52and also the speaker weighted action that was the proposed
0:13:56doing the the F and you're that speaker
0:14:03we have something on the on the debt curls that that was
0:14:06simulated by uh by using a
0:14:09uh
0:14:10short segments of the evaluation data
0:14:13of all the different possible uh
0:14:15so
0:14:16then
0:14:17on the evaluation on that then and on the
0:14:20yeah
0:14:21so you're on the evaluation data
0:14:25but suppose that you and then you yeah and the G S V as yeah
0:14:28then
0:14:29we shall with the red the green and the blue gel
0:14:33a different version of the gmmubm since then
0:14:36uh with the
0:14:37verification applied
0:14:39i sat at the output of the segment
0:14:42initial segmentation in blue
0:14:44as the output of the
0:14:45D
0:14:45stayed in a red
0:14:46of the output of the excited state in green
0:14:50it appears that there are not that much if you're ounces and we add
0:14:53a slightly better
0:14:55a four months
0:14:56by using the output of the final stage
0:14:58which is a exciting stage
0:15:00and the G is yes yes
0:15:02then
0:15:03uh
0:15:03yeah you don't shown on the on this
0:15:05uh i'll put on the S I D the clustering step
0:15:08and uh
0:15:10yeah
0:15:10is
0:15:10you shown to perform much better than the
0:15:13gmmubm
0:15:17yeah some some figures i i will consult
0:15:19uh provides a recall precision if you're right
0:15:22uh
0:15:23and
0:15:24if your average by by speaker
0:15:26i i will mainly call uh focus on the uh
0:15:30S
0:15:31uh problem of the of the result
0:15:34that that of him on the
0:15:36uh on the dev and on T of or on the about it
0:15:39well what we
0:15:41what we
0:15:42so on the on the go
0:15:44on the
0:15:45on the development set
0:15:46we
0:15:47uh
0:15:48observe that the
0:15:49S I D clustering step
0:15:51provide the
0:15:53uh
0:15:54better performance that that was a condition
0:15:56and at the
0:15:57compare able uh performance
0:15:59that of the G S the end
0:16:01then
0:16:02and the combination uh is uh
0:16:04improving upon but it's then
0:16:07on the user on on on the evaluation dataset
0:16:10uh the G S V as yeah and you
0:16:13performing much better than the G bit gmmubm
0:16:16then
0:16:17and this case
0:16:19the combination
0:16:20a slightly outperform the G S T I E N
0:16:23and is
0:16:24better than the gmm ubm
0:16:26then
0:16:29well
0:16:29uh
0:16:30that was uh
0:16:32i would say a a simple
0:16:33experimental framework
0:16:35uh that
0:16:37was
0:16:38done that to do the integration of the D N A is yeah
0:16:41then into uh
0:16:42speaker diarization and tracking
0:16:44then
0:16:45so just yeah yeah as
0:16:46a private school
0:16:48performance to the
0:16:49existing standard gmm ubm
0:16:52uh that we that we had
0:16:54and the
0:16:55the score level fusion was uh
0:16:58what's that
0:16:59factory
0:17:00uh there are uh some caveats
0:17:02yeah
0:17:03for example in the post all set
0:17:05which is not very balanced
0:17:06according to the
0:17:07gender and the channel
0:17:09that
0:17:10there are some very small set for example
0:17:13honestly made on our bound that that we have a very few posters for the
0:17:17uh for the experiments
0:17:19and of course
0:17:20we want to go
0:17:22browser for
0:17:23well the svm features
0:17:24like an L L F
0:17:25cmllr lattice mllr
0:17:27and also the very interesting direction that were presented in the
0:17:31previous bill
0:17:34thank you for that
0:17:44cool
0:17:47yeah and that is that
0:17:49you using delta double delta features
0:17:52yeah
0:17:53uh and
0:17:54some other posters
0:17:56found that the amending the deltas
0:17:58yeah
0:18:00it is
0:18:01you could addendum limiting the deltas
0:18:03so
0:18:03the
0:18:04did you
0:18:05you try limiting the deltas
0:18:07yeah
0:18:08for example in the
0:18:09in the first
0:18:10stayed on the beach
0:18:12segmentation
0:18:13uh
0:18:15on the initial segmentation we use the delta delta
0:18:18on the first
0:18:19uh that uh
0:18:20on the big stage we use of
0:18:22comments metrics using
0:18:23nothing at all or only the
0:18:25a static features
0:18:28and on the second stage we use
0:18:30uh only the delta
0:18:32not the delta delta
0:18:34um
0:18:35the the russian one of the rationale is
0:18:38trying to have different
0:18:39uh
0:18:40feature representation
0:18:41to combine different
0:18:43uh aspects
0:18:44a different flavour
0:18:45i i'm not sure that it is optimal this way
0:18:47because we wouldn't test or configuration
0:18:49it was one way of doing that
0:18:51but i i i agree that that that that that that
0:18:55it's not
0:18:56clearly convincing that they bring uh always something uh in the district
0:19:11i
0:19:11to to to what people say that we observe that when the data very clean
0:19:16yeah
0:19:17we make some recruiting
0:19:19acoustic room
0:19:20so the didn't uh didn't they give some game but i mean
0:19:24it's
0:19:24data from nineteen
0:19:26precision you will never see any given to the different user
0:19:34thank you
0:19:51you don't
0:19:51slide fourteen years
0:19:53forty
0:19:53slide forty
0:19:55okay yeah
0:19:58yes you can explain why you could input result and
0:20:01the evaluation did there was a difference between a database
0:20:06the databases are
0:20:07where is that well not recorded at the same day
0:20:10and
0:20:11they have a slightly different balance between the sources
0:20:14of data
0:20:15there are some some
0:20:16uh
0:20:17so that coming from french
0:20:19i'll send their
0:20:20some of the uh from uh
0:20:22i pupils and that's good not
0:20:24was uh from a high view mall
0:20:25was up so they are
0:20:27and the balancing is different between the that when the evil
0:20:30slightly different and i think it's
0:20:33it's
0:20:33uh and it's
0:20:34blind some
0:20:36some
0:20:36reside
0:20:37and also the fact that uh
0:20:39well
0:20:41i think that for the acquisition system even the weight is uh
0:20:46the givens images
0:20:47even twenty uh
0:20:49well
0:20:50six a well it's not that much
0:20:54and then
0:20:55when you do speaker tracking
0:20:57you have like sometimes
0:21:00speaker
0:21:00we speak a lot
0:21:01so his model is what is fine
0:21:03sometimes speaker lies only few
0:21:05time to
0:21:05however on the speaker
0:21:07so how to fix it yourself in this case
0:21:11um
0:21:17on the speaker tracking it's a verification
0:21:20uh
0:21:22compared to
0:21:24two of racial
0:21:25so there is only a
0:21:27the normalisation by by the length but the the the free should there is no
0:21:31normalisation uh according to the length of the of the data
0:21:34i i agree that it is some something that needs to be uh
0:21:38addressed
0:21:39yeah
0:21:51and this
0:21:51i think yeah not