0:00:13a segmentation
0:00:16so a a good are known and thank you for coming
0:00:20yeah and present now where we have do not the university of that was i needs so on
0:00:25this is a variability ability compensation and for the segment eight for speaker segmentation and to speak that phone conversation
0:00:32and we are also present in our
0:00:34a a technique to a it several hypotheses hypothesis and
0:00:39a for a given recording and to select the that best hypothesis
0:00:42there a segmentation the
0:00:44and so uh
0:00:46these work is focused on a the segmentation of to speaker conversation so it's
0:00:52is a speaker diarization problem
0:00:55a a a a a a so we we i'm at the answering the question was but one
0:00:59but
0:01:00and
0:01:01um and it's an is your task seemed since we and number of speaker is known
0:01:07and and limited to two
0:01:09so in this case now win the boundaries
0:01:11the speaker of that is of the that decision problem so it
0:01:14we can
0:01:15can um
0:01:18we could it as a a a a segmentation problem
0:01:22so i only there's mean a that in the field of the speaker verification and this has motivated and you
0:01:28approach just for this some addition of to speaker conversation
0:01:31yeah many base some factor analysis use eigenvoice voices
0:01:35and in this approach is uh the um
0:01:37the
0:01:38the speaker I D model a a bit a gmm supervector
0:01:42and that can be a present but this small set
0:01:44oh or a a a a has a small they mention vector that we use a record the speaker factors
0:01:50was i mention is much lower that they do you not a gmm supervector more
0:01:55so a a a the main idea is that
0:01:58a a a using a a a a a compact speaker a presentation we can estimate the a the parameters
0:02:03of these presentation
0:02:04what is more segments
0:02:06and that's what we do for
0:02:08for
0:02:09speaker segmentation
0:02:11so we eh
0:02:12a start i stream of speaker factors over the input signal
0:02:17so we are over or a one second window and
0:02:20frame by frame we start uh
0:02:23uh a set of
0:02:24a a a a a sequence of speaker factor of but that's
0:02:26and then we cluster these the speaker factors and the two clusters
0:02:30using pca A plus k-means clustering
0:02:33a once we have the two class we find a single got a single full covariance
0:02:38a cost four four
0:02:40it's just speaker or and we wanna be that we segmentation obtain a first
0:02:43a a some addition up with and then we would find this some addition using now
0:02:47some uh for segmentation step
0:02:50um
0:02:52using them to mfccs features and the mm a speaker models
0:02:57and
0:02:58so the
0:02:59the but the main contribution of this work is T the
0:03:04the the in general
0:03:05uh the of for in on whatever the T
0:03:08so first
0:03:09that
0:03:09i about that that's of but a do we can find a the set so
0:03:14and and only when a if we have a similar or similar recordings containing in different speakers
0:03:21we analyse the
0:03:22by a T percent on these recordings
0:03:25a a we can fight that there's the by ability this actually this by really the and it's mainly you
0:03:31to
0:03:32as because percentage recording so you use you referred to as well with yes and the speaker what every
0:03:37but
0:03:38we analyse a a set of a
0:03:40the accordance don't to the same speaker
0:03:43a so we can see that there's also whatever lead them on this record
0:03:47and use usually do you to a
0:03:49two aspects like the channel or
0:03:52the most chance of the speaker
0:03:54and this
0:03:55this but every is usually um
0:03:57no known as
0:03:59and system whatever
0:04:01but in in addition if we if we analyse
0:04:04a a recording contained a single speaker and with that many in smaller
0:04:09so i say
0:04:10and that we
0:04:11and a light but in really the this just slices
0:04:14a a we have see that there is also whatever it with you now recording
0:04:18and
0:04:19this but i bit is usually do to a phonetic balance or
0:04:22or they
0:04:23they won't young on the one channel of the of the
0:04:26of the recording
0:04:28and we will refer to as by big guess
0:04:30and that's system whatever you
0:04:33no approach for speaker segmentation we are only a
0:04:37but we we are only modeling in speaker whatever every
0:04:41so the question is a a are you the types of whatever the inter or interest
0:04:45my ability a fact some of the in performance
0:04:49well
0:04:50and
0:04:52so we just one do we need to compensate for inter system whatever they don't this is somewhat of a
0:04:57you
0:04:57we note that in the system whatever decomposition is but important for speaker recognition
0:05:02uh yeah but
0:05:03well i was able to say that it's not
0:05:05i so be important for speaker some shown but the
0:05:09is presentation yes so that
0:05:11channel factors it you helps us well
0:05:13so i i i have some preliminary experiments that P
0:05:17then
0:05:18the help but
0:05:19may keep my
0:05:20but we believe that it's it should and had so much because you not that is decision that's you don't
0:05:25see the same a speaker over different sessions
0:05:27so it's i was the same as the the speakers are
0:05:30a single session
0:05:31you don't have a higher information of the speakers
0:05:34actually we we believe that it does is a whatever ready to make up the body the speakers and diarization
0:05:38task
0:05:39because the channel is information that can
0:05:41how do you to separate the speak
0:05:44and what about the some whatever it
0:05:46so what what usually in the feel of speaker recognition a a state-of-the-art system doesn't take only a a a
0:05:53a that that's system and don't take a
0:05:55and take into account intersession variability
0:05:57seems yeah um they used a whole conversation to to be a model
0:06:02uh but
0:06:03we we think it's
0:06:04but important for speaker some of this and their efficient because um many
0:06:09and the state of the systems are based on
0:06:11the clustering of various mark be or segment
0:06:14and
0:06:15we can compensate do but i believe them and be segments for a given a speaker
0:06:19okay
0:06:19clustering process should be seen
0:06:23so that's
0:06:24but we try to do so you been a
0:06:27a each dataset contains several speakers sensor out according to per speaker
0:06:31we kind of thing a team of speaker factors for from each recording
0:06:35and then
0:06:36a a we can see that every such and as a different class
0:06:40and we
0:06:41more more to the speaker and that's system some whatever the guess between class body on seems we believe that
0:06:45more in a speaker and that's as them whatever the have
0:06:49to two separate speakers
0:06:50and in a recording
0:06:52and we model the session and whatever you S within class but yeah
0:06:56so we we this framework in
0:06:59it's C C to to apply a one known techniques has
0:07:02linear discriminant analysis of much might you can class but as to minimize we class variance
0:07:07are also within class covariance normalization
0:07:09a a to normalize the variance
0:07:12of
0:07:12for every class
0:07:13so it's
0:07:14the identity map fix
0:07:16so this to thing this have been successfully applied the uh
0:07:20for intersession compensation and in a speaker recognition
0:07:24they
0:07:24the in i of sister
0:07:27so to evaluate this this of these two approaches we use a a a a a a nice are we
0:07:32weight
0:07:32a a a some channel condition containing more than two thousand five minute telephone conversations
0:07:38i and that the speech nonspeech a what's are given
0:07:41and we miss the performance and that's of the speaker segmentation or or or speaker or
0:07:46a a a a a part of the that is as an hour rate so a as we
0:07:49we have some parts that the speech nonspeech segmentation and "'em" we don't take into account overlap speech
0:07:54the that is as an hour rate is the same as T segmentation or or or speaker are
0:07:59and uh a C us use what we you we assume a don't twenty five second people score
0:08:06and here we have the the results
0:08:08for the system using in a small ubm
0:08:10a a two hundred fifty six gaussians
0:08:12a prior
0:08:13and mfccs features
0:08:16and in this case we we don't we don't use that the segmentation of steps
0:08:20a a a a we can see that our get in two percent some or or
0:08:25um using intersession session variability compensation and W C C and we
0:08:28twenty a speaker factors
0:08:30a a the sum an or is that used to a two point five
0:08:34you also can see that another other another baseline
0:08:36a a with fifty speaker factors that it
0:08:39a slightly better
0:08:40know
0:08:40not much but slightly better
0:08:42and to to try a L D A for them dimensional direction
0:08:47and we can see that the L use had been
0:08:49a a a a but i an i and W C C N any
0:08:52is better
0:08:53the obtained a two percent a a of segmentation or
0:08:57and even the combination of what you are is is not better than
0:09:01now
0:09:02a and directly W C N
0:09:05but uh we try with this
0:09:07systems uh a after the resegmentation and it was surprising pricing that
0:09:11there was some of the step
0:09:14um
0:09:15make more or less equal that was also
0:09:17using twenty you're fifty speaker factors so
0:09:20the inter just but every decomposition with W Z and just your working
0:09:24and
0:09:24giving the a an improvement
0:09:27but it seems that it's a useful to improve the number of speaker factors so
0:09:32we were a little disappointed with
0:09:33this
0:09:34because with all it to suit help
0:09:36and we that i know where use but it meant to that not in the paper
0:09:41representing here
0:09:43a without a target ubm M um more features
0:09:46so in this case increase in the number of is because fact of his help
0:09:49so we can see that in this case or baseline E we the speaker factors
0:09:54thanks so one point eight
0:09:55segmentation or which is a a lower what before he was
0:09:58two point one
0:10:01and now that we use channel again but use the your work to one point four
0:10:05in in in this case a a in this case we we also increase the number of speaker factors and
0:10:11to test the L A
0:10:12and we see that the the eighties
0:10:14yeah is had been you've and
0:10:16a a more than before
0:10:18a
0:10:19and and also that
0:10:20our but configuration now is used to combine L D A plus W C C N
0:10:25so it seems that uh it's is it's to question that a
0:10:29the base and we'd have the speaker factors he's not better than the base them with fifty speaker factors but
0:10:34that there and the egg
0:10:35a we can take advantage of how more speaker factors
0:10:40a our best result is
0:10:41one point three some our
0:10:45so on and on the other hand
0:10:47you we propose you know so i think need to to you know it several segmentation a
0:10:52hypothesis
0:10:54and to select the best one a
0:10:56base of based on a set of from mister
0:10:59a so what we do is we it adaptively pretty it become the composition to to have this
0:11:06a a a a a a did we obtain four levels of splitting in as we can see the figure
0:11:10a a a a a and we segment every um
0:11:12every a slice with a propose a a system three
0:11:16then for every level we set at the best the slides this
0:11:19and we combine them to be able to speaker model
0:11:23and then we this to speaker models we to segment
0:11:26the whole recording
0:11:27using a
0:11:28i there with some segmentation and mfccs features
0:11:31you and i speak a speaker model
0:11:34a to select the what to select the best segment that slices is and the best level um on this
0:11:39four
0:11:40we use a a a a a a complete as missiles and also my you're voting stuff
0:11:46sorry components most of that were using this work were where a bias use information criterion
0:11:51a a a a a a we using mfccs speech their sign you a speaker models to compute a
0:11:56big
0:11:57and and the K yeah these things these dancing the speaker factors space
0:12:02so we were using gaussian and
0:12:04a a speaker models and a that space and computing the K U
0:12:07this stuff between what more
0:12:09a and
0:12:10to fuse both compute as were are using the a
0:12:13a a quite toolkit
0:12:15a well no for speaker verification
0:12:18and uh the in the weights of the
0:12:21a diffuse fusion weights were optimized to separate do for those
0:12:26a a a a a of time less that one percent someone channel
0:12:30okay
0:12:31so
0:12:31a a kid we have the results
0:12:33for these uh i but these is you know channel selection
0:12:36a strategy
0:12:37we can see when that when we we are not using seen a
0:12:41inter session variability compensation
0:12:43a a a a a a what this solution is improving the results which just but our baseline which choose
0:12:48two point one
0:12:49and we're getting one point nine with to our started you
0:12:53and if
0:12:54we have a an idea
0:12:56a a coffee that's much some we could select the best
0:13:00a a level at every time
0:13:02we could go that to one point one segmentation or
0:13:05but
0:13:05the of competence was of our remote idea
0:13:07at the mall
0:13:08a
0:13:09and using system but every to compare that you would we then get
0:13:12in a significant improvement improvement was
0:13:15was not the statistic that
0:13:17that's it's of this signal is significant
0:13:19a is so we try to my are what in a that the any help
0:13:22but that we we wanted to to make it what we
0:13:26it's some set of complete myself to
0:13:28because the
0:13:30the possibilities of for are complete and mysteries
0:13:33of computers was of a high
0:13:35and
0:13:36simple uh stuff that you to fuse
0:13:38a a segmentation hypothesis
0:13:40so
0:13:42yeah
0:13:43we were not really happy with this
0:13:45was also with try again again with a lot ubm more features
0:13:49and
0:13:50and our best racial for intersession variability compensation
0:13:54and and also a new set of complete missiles
0:13:57but this
0:13:57this is not in the paper a new results
0:13:59but
0:14:00oh show in two
0:14:02a a a a a and we we could but use this segmentation addition or or from one point three
0:14:06to one point two
0:14:07and one what some additional or one point zero
0:14:10and if we put select
0:14:11i always the best level that we could read used to get two point seven
0:14:15the channel or
0:14:16which is
0:14:18but good use so compared to the
0:14:20to
0:14:21based
0:14:21one
0:14:22well one point
0:14:25so
0:14:26ask completion sort of this work we we have presented to to make those for it a somewhat every to
0:14:31compensation
0:14:32we have some that they have for speaker segmentation
0:14:36a a a a a a change that W C N of things better performance than that done of the
0:14:40eight
0:14:40and it's
0:14:41some somehow similar to of the a plus but C C N
0:14:45that
0:14:45but similar to the combination of both
0:14:48um
0:14:49in the number of a speaker factors increase greece of the computational cost
0:14:53a it seems that W C N it's
0:14:56but there's that the word for
0:14:57a
0:14:59should should word for low computational cost applications
0:15:02but the of course a for our best computer and computational cost is not a problem all was computation is
0:15:08using
0:15:08a high number of speaker factors and all the egg it uses here
0:15:12and
0:15:14we we we have a summary of the results yes we might they our but so that this is one
0:15:19point three
0:15:20i
0:15:20we the a system where a the from one point nine to one point three
0:15:25and
0:15:26and also
0:15:27a a a note that
0:15:28probably that but used in is had been a lot because
0:15:31a a a a a a a close or or or because of for but that in the study so
0:15:36or in is that that use you seen pca plus k-means means
0:15:39as initialization
0:15:41so not i seen the the
0:15:44he within class covariance for a be a speaker is probably had been the K means that assumes that they
0:15:48all the
0:15:49and class is have the same class so
0:15:52at yeah why i are not quite as
0:15:53a think i
0:15:54so
0:15:55so it probably because of the
0:15:57a
0:15:58a W C i have some
0:16:01and also we have present a hypothesis generation and selection technique
0:16:05which C can prove to this like the results
0:16:08and for our best configuration we can use D some addition are from one point three to one point to
0:16:13with a large you
0:16:15um
0:16:16think that's all
0:16:17thank you match
0:16:24you you we use time for questions
0:16:27we
0:16:38yeah
0:16:40yeah
0:16:43yeah
0:16:43i just to one they mention
0:16:46i didn't mention it because it's
0:16:47it's in another
0:16:49but but so this is much more to produce on
0:16:52on really ability but
0:16:54yeah P C A
0:16:56okay we keep just one mention
0:16:59a a a to initialize a a and then became means we use all that mentions but we need to
0:17:03like this
0:17:04the a means of a k-means means we
0:17:07pca C a of show
0:17:17no
0:17:18no
0:17:22i
0:17:29yeah
0:17:38hmmm
0:17:40yeah
0:17:46well
0:17:48yeah yeah
0:17:49sure
0:17:50yeah but uh i i mean
0:17:52now experiments i i am i keeping one i'm and C N N
0:17:55uh
0:17:56maybe
0:17:57you know
0:17:58is not the best you can do but
0:18:00to one they mention it's not
0:18:01but it was so in is is usually the first dimension
0:18:05of
0:18:06of the pca put these is the best want to to it this because but it still i we are
0:18:10getting a about eighteen percent that is is an error right
0:18:13just using one dimension
0:18:14so we are not sure that i
0:18:17the best presentation
0:18:23yeah
0:18:30hmmm
0:18:31so you we we were try to
0:18:33just plug yeah C A output
0:18:36be more they mentioned for you and all all the images to the came
0:18:41the are questions
0:18:47and let's thing the speed than then one the speech was reduced