and well what can i actually identify speakers i and then we also wanted to
try if it's possible to actually fuse the results
where is more
sort of traditional atoms an and systems so basically there was some things
done already
and this is basically some the closest works we could find the time of writing
but you know do so you sort of the archive publishing method than and stuff
like that they could be out of date so basically especially the first one of
occurs because they actually uses the in spectrograms as well
but
what the use it for is to identify disguised voices so for example when you
have voice actors and then like the simpsons or something one actor can play like
several characters so they want to sort of identify the actual act as a not
the characters that they play
but they didn't do when in so the fusion or exploration and in basically used
sort of out of ops network
and also there's this is quite a lot of
now one to conclusions a non sound
so basically what we want to see is a citizens of the overview so the
lower part is basically surreal standards
approach where you have the mfccs or other features extracted any sort of the i-vectors
or the u
whatever and then
you usually in this identification what we wanted to do is basically extract spectrograms but
it through the network and then sort of get the identity of the other hand
i will explain later wine
there are several identities
of the c and then so basically what we wanted to test
a little conversion network and then
t v is
and you're
actually quite dataset was
quite surprising that you
so we actually chose this
system so
the fusion
so
this is not expect sorta need to go into detail
convolution work inspired by a lot of networks that are currently used for image recognition
this particular one
so basically what we did we tried an existing model and then we sort of
started downsizing it because it didn't like to change the resultant cued up learning
and we came up with this
and it's actually a very robust as you have five convolutional layer but the main
an overkill
especially of the images and we begin that is a monochromatic so we don't like
you three chance that the very beginning
some be system basically trying to one twelve efforts but it's actually
and we use rather than dropout rate of the nonlinear function
and
and dropout at zero point fine
and this is up to each we conducted where we did no random propping the
rotations this was due to the so the spectrograms basically have a pretty big overlap
anyway so cropping than actually do
the detection have much use and we don't want to the rotations because hopefully this
may be something and the time domain may be interesting
and we use
well average point max pooling but this is just based on experimental
exactly so basically because we wanted to combat
t v s and t o v as and stuff like that
we you want to have
the same sort of output so basically what we got from the signal is
the somebody news
so the speech segments
and because the spectrograms have a
then it shows a fixed size we have to sort of divine to speech segments
into separate spectrograms and then do an average
and the output to get an equivalent for forward to us for example
so for
you many the end you get your the eggs we
so to use the following setting like more teachers and paper one sort of going
dependent this now but we tested a settings in this
i think you the best also i'm not
the segmentation problem for
getting the speech segments is based and bic criterion
i victim suspect hundred and stuff like that
so for the fusion we chose t v s because it had the best results
and then
we explore three
different approaches the late fusion so basically just to the scrolls
from the t v s and from bayesian and then
basically
fuse them
and then we so from our experiments that
actually the c n and works was four
longer segments
speech
which was quite surprising but then so we basically wanted so the weight down it's
this value depending on the duration
so the and the duration baseline instance for the duration the track
and then we wanted to see if an early fusion
so basically take the our work all the last hidden sin level we do with
pca to have
the same dimensionality as an i-vector and then we just concatenate them and trainings be
a
so that they said that we used in the repair this is a french language
corpus this is and radios
and
that seven types of videos including news debates
sort of interviews celebrity gossip stuff like that so and because of this it's pretty
noisy because you i don't very often you have like background music you have different
voices overlapping you have streets noises a et cetera
and
very unbalanced as well because
you sometimes have very i don't know politicians who i don't present fronts that sort
of is that almost constantly a or binders throughout the more and then you have
sort of this long tail of speakers so basically in the whole training set that
eight hundred three months speakers but that says sets
contains only one hundred thirteen and this is likely on be one hundred thirteen is
actually overlap
with the speakers with and train set
and while the strange about speech or frames and six for the test
this is just a show sort of like the imbalance in the distribution this is
a logarithmic scale
and then this
so on the x-axis you have all those one hundred thirteen speakers
and then on the while you have the duration but speaker so basically and that
sort by the duration and the train set so basically what you've got is that
it's not very an imbalanced us you know some people speaking forty minutes and then
someone who excuse that for just a few seconds
and then it's
as we can see that spike at the very rights
this shows that there's actually
someone who
is almost nonexistent train set but then he's very present in the test data
so
pretty difficult also another feature of this data that
almost
a quota speech segments are shorter than two seconds
and seven c
percent shorter than that
a which makes it quite difficult so basically we used mfccs features
and this is sort of problem no
nineteen dimensions so
so basically all the details and the paper but
we end up with than fifty nine dimensional vector
up to some
feature warping
so for the spectrograms you have an example of it on your
it's
the two hundred
forty miliseconds in duration
there's a big overlap between neighboring spectrogram
well at the two hundred milisecond systems on overall
it's percent
and basically
so this is that we use so are
audio segments were a value of refinement seconds
and then we form for the look for a window and twenty miliseconds we use
the
i mean windowing
log-spectra optical
amplitude values extraction and then we basically got an individual matrix which ones of a
forty eight times woman twenty one pixel
so basically here the results so far table we see the results of the on
for each individual systems and basically
this in and
doesn't work very well which isn't
that's surprising considering
the way the dataset structured but
pretty surprising is that the of the a
is also not very good an actually gmm ubm
right okay so basically to the best system is the c v s one
and that we have used for fusion afterwards so basically
we want to see
so in the lower table you have three more detailed results including the accuracy or
the tracks to have less than two seconds
and
actually the best approach that we have is the just the simple length and so
basically take the predictions from c n
and seriously sort of normalise them and
our remote
and the biggest most of the form is actually is also given that for the
trusts okay for the facts that a lower than two seconds so basically for forty
forty one almost and forty nine for t v s and fourteen and respectively and
then goes up to fifty eight
so it's a phone
which is quite of course
and then the yellow re fusion actually model but well actually decreased results
but for like duration nights
it's pretty
similar so basically
even though the c n and didn't
outperform
it
seems to provide different things in spectrograms and
by fusion consort exploited and sort of go
beyond what was
but say possible so is also
so it's of the lower plots
as you
we have so the red one is the nn
performance across
different duration files
on a logarithmic scale
so you can see that
the between c and then and
i-vectors as of this yes
it's a low increases as a sort of a long
with the duration and the biggest is actually helpful for very short tracks and then
doesn't affect the performance and the latest
so that's basically it we wanted so
see how it works and we conclude that the s t and c n and
t v s main improve over the baseline systems
a more data that may be requires
or more what quality data especially for this unit india data actually work better and
four perspectives
so basically we chose this corpus
because it also contains
texas and stuff like that is we explored wanted to have like a system that
takes both the spectrogram the face and say
so the a be a like a speaking persons
rather than just concentrate on speaker identification by standard edition and we want to have
it all compact and then like one trainable system
and
an additional source of
inside make the to force a difference in an architecture so basically if you
have just for example horizontal or vertical focuses rather than squares that we use now
you can sort of force it to look
more than in sort of the time domain frequency domain
to sort of look at the
at some buttons that
and so that's a thank you
i performance
so we have plenty of time for some more buttons
okay
yes
any kind of segmentation per segmentation or you assume that there is
you know the segmentation so these age segmentation is basically an automatic speech segmentation done
by bic criterion so it is a pretty all technique and then we just basically
the segments as they are
and a pretty noisy sometimes analysis that it is very hot sometimes to distinguish
or to filter out like music and voice and stuff like that and then sometimes
because like something's that basically have strike selecting two speakers
as well which you know
we could probably
benefit from using a more sophisticated way to generate the
okay maybe also one is not experiments on this a the features are complementary to
the baseline so did you have an attempt to have as well as in the
upper layers learned by the c n like another
can you can kinda the telephone or something up for a meaning in terms of
the old averaged it is some basic you could be a actually that was to
see the saliency maps
so basically this is a and once again you can actually see
the was of particular layers c n and look at what it looks task
it to make a decision so basically what i guess pretty interesting most of the
teachers that were horizontal
and announcer in frequency domain so that's one way so that's my final we want
to see what happens if you like force the not just the vertical
and see what happens that
segmentation error
the simulation the red and no sorry
the measurement question was how
what five
of your total data is the segmentation that it
okay i don't have number wouldn't sorry
but
could be in the fact that should be
doesn't come out and of the last question with twenty five persons
of the segment with the duration less than two seconds
going but we are
but using
almost you know to compute a segmentation score we have this
what of open five seconds along the boundaries of each segment it means that new
case for twenty five percent of the data
fifty posants of the speech is not used to compute e
segmentation school so we have to change from it we want to go
if the segmentation you were a house and but on speaker identification
okay
thank you
time problem one final question
okay thinker everyone a separate so unless the spectrogram