oh
yeah
C nine i can't wait to present
oh
the first is
instead
and the for this call
i was
you
all right
application speaker
maybe may not be as well this work is a well known
you may not know whether you want to pay attention to this thing so let
me just summarise it
oh gee by
proof speaker
yeah
one of the interested in that have to pay attention
okay so let me
in this
you're
just
duration
lee
speech processing
yeah
five channel
really
some knowledge about this
discounts characteristics present in the acoustic signal could be helpful for improving performance of such
recognition system in our case you interested in speaker recognition system
and the svm already before that using such information speech
we also very informational side information in the fusion calibration and how well
and
most of the recent approaches would always
the L one recent words like to put people do for nist evaluations that usually
below
oh independent detectors for various
i
information various detectors various
estimators like estimators of a signal to noise ratio reverberation a detectors of language and
so on
and what we propose in this work is to detect a
this acoustic condition based on direct everybody else i-vectors nowadays be yeah time to show
that you can just the i-vectors
detect all kinds of other different
acoustic condition information or nuisance condition or call it is like from the i-vector and
then use this information in quite simple way
calibration
fusion speaker
this
just some
maybe the
most important previous work on making use of
i is the condition detection that you have
in the past you may still remember feature mapping where features are compensated based on
detecting acoustic condition or
more specifically channel the of the signal and then
this work in that for the channels that they just based on
it's channel specific gaussian mixture models but then we thought that the
don't need to detect a channel condition or acoustic conditions anymore because you got this
wonderful is the joint factor analysis and i-vectors and we thought that you don't have
to really explicitly detect the condition that the channel compensation scheme will account for this
variability what is intersession variability directly but again we saw that using some side information
in a calibration or fusion was actually helping even so that you're compensating using
space methods
oh
again the side information that we have been using the side already mentioned recently where
extracted by language identification systems also signal-to-noise the racial estimators and we collect all kinds
of information about the signal like that it will try to make use of it
in to improve speaker identification system and there are the request in several different ways
of using such information so they probably because this thing was
i just a bill
evolution condition specific fusion or calibration so
okay you were trained different calibration for a specific condition like specific duration or
or english only trials
spoken by different languages
or more than anything it was something that possible cm actually started with that and
i think the nickel remark one it is focal toolkit was by linear score can
be defined by linear score side information combination where a that was bilinear form the
new interaction between the scores
and the side information itself
all speech so i mean
mention when using
where
collecting
information about try itself rather than
the in digital recording integral
i
segmental speech so that was
this side information will be is this child actually
could be different recordings in the trial come from different languages or other people's of
these short duration of all these long duration something that we have tried
in two thousand and politicians was finally to get side information from individual segments and
combine these side information from individual segments in certain way
to improve again calibration fusion this is something that people i'm using but we would
be a four in the morning this
in this talk
and let me maybe
splendid more closely
oh
so what is our approach them to do
this acoustic conditions
as i said we are going to use
i-vectors that's as an input to classifier to detect predefined set of various audio characteristics
in our is that when you just simple linear gaussian classifier
which is a similar thing that people you're using for i-vector based language identification
and
the way we are going to represent the audio characteristics of the signal would be
just vector of posterior probabilities of these individual
and
oh yeah like to show that we can use this vector of all ones are
all these this us and side information for
a fusion and calibration of speaker a system and get quite substantial
where
improvement in performance
and in this work we actually using exactly the same i-vectors for both
characterization of the audio segment and speaker recognition and
a justification for this thing is
reasoning that we have is that
or maybe nuisance characteristics including an i-vector itself affect speaker id performance so if we
can the take those characteristics from the i-vector itself those should be those most important
for
for
improving speaker recognition performance for
compensating for the effects are still there in i-vectors
or
oh before i get into more details and on what exactly do than actually the
results
oh let me introduce the a development and relations that used in this work we
should give you some idea what kind of partly this and kind of condition you
actually
and
so the data but we have used this prism evaluation set which is something that
you have presented
i during the last nist
workshop
and
decided that you have
collected and pretty well collected by the database was something that we therefore
best project and we see this comparison of the design for actress
many dataset so would be data that yeah are comes from feature sre you evolution
take a speech for you would say so these are basically that everybody uses for
for training system for nist evolution but we try to bill evaluation set that accounts
for different kinds of our abilities
so that was a huge investment in normalizing all the old in a method a
information of all the files and we try to include as many trials and read
as many trials as we could
we try to create trials for a specific types of variability so we do lots
of evaluation conditions for the specific types of variability
like different styles in general for to the different vocal for language portability usually the
conditions that the oldest the specific type of a really db we didn't try to
really makes different types
where
the different
and so on
right now results and see what is the different types of variability courses
what degradation
i
it's and i never explicitly always tried to break more trials compared to what has
been defined for installations
oh we also tried to introduce new types of variability in this they are specifically
was
noise variability reverberation so we artificially added actually non sound and reverberation the a i
don't
a few more this on the next slide
they should be also duration for condition
for each mixing
at this prison set consists of two parts that elements to be better at the
moment we have a one thousand speakers around thirty K thirty thousand audio files
for more than seven
seventy million five so that was so this is actually relations that
and then you have a tree
sixty thousand speakers
comes from
a hundred thousand
sessions
oh
deciding this task just to give you some idea wasn't really just as easy as
taking saying of the we used features switchboard and sre four for training the rest
for four testing either really and attention to reduce to the data from different
different
sets to use the models for training and testing so for example to get some
language portability we have used for evolution data from five
the same are trained for different
yeah
i
oh yeah
for all i
we try to use them for
eventually for testing but at the same time we wanted to cover some of the
channels that are some of the microphones and i two thousand they don't they also
training also so
they be related to
pay attention
splitting the database is very like this
last number of trials
i
you see
straight patients
i don't i don't i just try to quickly summarise the bigger for designing the
noisy and reverberant data what exactly do what's really can be some design so that
we define the way how to for all that the noise and what
we try to use open source tools and principles the other noises that added to
the data and county somerset is that this other people are interested in adding new
a noisy is it should be straightforward to just for the rest is that you
have designed a new types of noises new also reverberations
so i mean
just the in the blue box it pretty much summarizes the additive noise but also
use this file for
for
just adding the be
noise to the data as a specific S
the signal to noise ratio and you have used different noise is the kind of
course there are only is not use different kind of noises for
for training data for enrollment trials for enrollment segments for test segments of try to
make sure that you never train or test that not even the same noise are
not even noise taken from the same
final or not even that exactly the same time
so if i say that was a cocktail party
i noise to be noise from restaurant noise from our for so different kind of
pop noises and make sure that really makes this
very similar
training
still
and you have added the noise to the data at different snrs specifically
twenty fifty eight
the noise was actually added to clean data for these data are wrong
thousand and four
my
the data should be
right before
similarly defined is a reverberant subset of the data a which again use this
we're to which is open source
for simulating a rectangle or impulse responses from a rectangular room set and then added
reverberation at different
reverberation times to date and again pay attention to at the same time reverberation to
training and test data
destroying
i
inputs
okay so we get the time how to the sounds like not be
and you more details on be characterization six
so they system itself
is based as i said i-vector the i-vector
is pretty much the standard i-vector extractor that everybody using nowadays a ubm based on
gaussian train and that's
my
you mean and variance
a extract actually si substrate exactly the same as for speaker identification string only one
speech frames are trained on
silence
but it's quite possible that for detecting
different
it is
you just features like may be applied
normalization
so
so
i
or the U six hundred dimensional i-vectors are extracted from the standard variability space that
is assumed to
are expected to contain information about speaker
acoustic conditions in this case we didn't do any of the eight
oh
for the speaker information so use the i-vector
channel and the length normalization for this
channel
position
so as a classifier is to use linear gaussian classifier trained on these i-vectors about
oh
what is trained for classify a
conditions is that i'm going to show on the next slide and a final diarization
characterization present this paper
posterior probabilities be specified classes so in fact they
taking this vector is
simple as a nonlinear function forty five and
this is just a simple as i mean
affine transformation i-vector for mass function
take this
posture
this summarizes the whole system works
and as well as you can see that is
i-vector
such as
is that
i
as you can see
the same training data variance
post-training
train a ubm training the subspace matrix and also
option
i
i
so
okay
based on this actually
our system for
so we try to distinguish between
three dollars
a microphone they are
a noisy and this case where
those kind of noisy data that you are actually noise added to the clean originally
clean microphone the a and B distinguish three different conditions which is noise a db
fifteen db and twenty db snr
and the conditions for are currently covered we define the condition according to reverberation time
three five
zero point
you can see how much data used for training data
and
hence
i
right and soon as you can see there is always the same number of training
and test files for
noise
because those are actually
the same file
and noise in different level noise
the way to those classes because we just the vector posture only use of those
classes defined
assume that the classes are usually X
that's
which is exactly this
green and with or elevation data because this is exactly how our evaluation set was
designed
we
never have reverberation and noise in the same recording
but of course this is unrealistic
in relatively you can
reverberation of the army
background i
oh
still be viewed that using this paper would be useful for such conditions because this
all the vectors of posteriors can account for
or something just conditions in the data also
one
this is
i animation that they
sure
where if you have reported that comes from my
comes from
all then we do this estimation you probably get all through that somehow reflects the
that i
there is
stands for my
yeah probably
my
what how much
yes
yeah
but of course we can go for more principled way we can even a little
independent classifiers for these independent types of articulators of that
classifiers also speech and noise which kind of reverberation level of reverberation but it still
a microphone was it would be trained data which contains a mix of
such
such conditions
so this
oh
table summarizes
what performance be obtained in terms of that i think these conditions
so they the table shows of the two classes and the detected classes
i
and i know that i'm supposed to be pressing enter
space
so they
the if you had a perfect classification we should see numbers hundred and diagonal and
zero
as well as the justice confusion matrix and normalized in such a way that you
i
persons
i can see that this didn't really have a what
what is
what you were pleased with was that we could actually see that at least
recognition microphone and telephone data is almost
right so that almost here for microphone we get some confusion
here we might
and
twenty db
noisy data and as i told you actually great it is a noisy i think
these microphone data and adding noise to
exactly this i twenty db snr and if you listen to those clear that some
of the data actually contained some voice so it's quite natural that some of the
states from the clean microphone
becomes
twenty db which is not
kind of like
in a year
condition
oh also if you look at the a different noise levels we see quite reasonable
performance of base
a reasonably large numbers in that all again like nicely twenty db recognise that there
is some confusion
but this is again something that would be expected specially the S and now ratios
which are close to each other should be actually
use
and
a what one thing that actually seen was that the most of the confusion comes
with some type of noise they don't really affect the i-vector much i think this
type of noise resulted in almost
exactly the same i-vector and something that was also naturalness you get
where you don't do very well i'll be
these conditions where maybe try to detect the reverberation time
and we see that the those they thought of those detections are actually comes all
over the place you really confused for
for noisy at a party
the main reason that we believe that is the thing is happening is that redefining
conditions reverberation time is not actually a good thing to do because the reverberation be
if you played reverberations then you could actually hear that one
one
one type of reverberation at one reverberation time was much models are just a perceptually
to another reporting which was completely different reverberation times of the reverberation time is probably
the right consonant as we apply the C using these data for improving speech recognition
actually improves the speaker recognition performance we actually looks like that the classification itself does
a good job in terms of
classifying things in putting things into the right classes
which allows us to degrade
so finally how to use this information about acoustic condition for the calibration for improving
the speaker recognition system
cells of you see this approach that the that no an echo actually proposed for
when we do our the A B C system for nist sre two thousand and
you also and i believe this is the thing that is implemented in
in both source to that
three available
so the idea is to be just one calibration if we review this and tara
people obtain standard linear combination where you know a some bias and some multiply the
experiments with
switching from be touches
oh nickel
okay
and that's of the device is the
wavelet multiply the scores
but you can see that we actually in some bias term which is just your
combination
between the vector of posteriors from one and the second
segment that are space in the
trial and then based some matrix so this linear phone that is
vectors in there
from
is
i
just bias and this is that the final score this is to go next
for
also
so we just mentioned before
for
okay
the same
vectors
you're just list conditions
running times and let me just
say briefly that we are presenting results on one list of conditions that a subset
of all conditions summarising and these are these
we know that no problem of that a lot of entry
microphone just my own different vocal for different languages in the recordings of different noises
in recording room reverberation and the system that we use for speaker id
used exactly this
i
as i say
once invited me
right normalization lda are used as a
train the presence of and
this slide just two
results and you can see that you are actually nice
on so these are
once in principle
the dcf and eer
fine
relations
can see so maybe less
we just are relevant for
because
yeah
not
from
from this condition which are actually the condition of the conversation but recorded over a
microphone somewhere our prediction actually does very job
oh surprisingly get some improvements in
also used it all comes
from the single condition that you have a probably again can do some two one
detecting voice is that you have a does a good job on detecting on the
noise condition
from different
noise levels
i
thus reasonable job on room reverberation actually in
right
so only conditions but they are quite
anyway for speaker identification proposed we don't get any problem at all the that comes
from just one is to be don't have conditioned and then tell us to improve
the thing we do not get improvements for language and a common words that
again we didn't really have condition
thus
so
and the next slide actually just showing the same thing that we also still pretty
much the same gains you can be fused with
system cepstral prosody just
so this is just to say
suspecting summarizes
and
the conclusions are summarized in practice
i
well
oh
so well what is that it doesn't classify people are
issues that you have in training and test
actually is if i say i reverberation is rubber reverberation time domain is different for
reverberation but they come from not only are artistry of the actual five
and he defined the reverberation time just cost is
what we have seen is that if you listen to the recordings that in test
fine two recordings that sound similar or perceptually but they come from different classes we
just the way defined the cost is probably wasn't where i think
direction part is how the possibility fine it's probably not correct that would be more
natural clustering a more natural clustering that would account for the type of reverberation i
mean you and regression that you if you flat there is nothing for about that
kind of you late reverberation order reverberation that spread over all the time
which will affect the speech and
what else
would be in our case may be considered to be come from the same part
so then you probably can be some classes which about related speaker recognition performance and
it helps at the end in the in the
and the speaker recognition performance even for that
the classification that would but the classification is not because we define
and
i
i