however
my name is a weird
this is trained in the signals the standard a traditional a accuracy los angeles
to the be presenting our work
try to an umbilical analysis of information coder
in this and then the neural speaker representations
and here the people that have
well average of it for this work
so first
i'll introduce what i referred to as speaker meetings in the rest of the talk
speaker limiting the lower dimensions these two presentations
that or discriminative of speaker identity
these other applications
such as
in voice biometrics but the task is to verify wasn't sounded different speech
the house at application can speaker adapted a set of models
they can also be used in speaker diarization
with the task is to domain
who spoke when in multiparty conversations
this can be of particular use in meeting an x and many other applications
good speaker ramblings should satisfy two properties
first there should be discriminative of speaker factors
second is that addition be invariant to other factors
so what are the fact of information that could be encoded speaker embedding
for ease of analysis be broadly categorized them as follows
so as to the speaker factors these are related to the speaker's identity but example
that gender age et cetera
content factors a these are quite during speech production by the speaker
for example
emotional state output a in the speech signal
sentiment whether it is a positive landed one year
the language being spoken
and most importantly the lexicon containing the signal
and
that is the channel factors these factors that quite given signal captured of the microphone
we could be the room acoustics
the microphone on a linear is applied on acoustic noise
and also artificial and also the artifacts related to the competition
on signal vector
as i mentioned previously good speaker the minister supposed to be invariant nuisance factors
these other factors that in that in order to the speaker's identity
such emergencies useful for robust speaker recognition
in the presence of a bad on acoustic noise
they're also useful for detecting a speaker's identity
irrespective of the emotional state of the speaker
and
also independent of all speakers is
this is particularly useful
in text-independent speaker verification applications
so with those that don't have the motivation the goal of our work is to
four
first
is to quantify the amount of misinformation in speaker meetings
second is to investigate
what extent
unsupervised learning and hence
to remove the misinformation
most existing digits
only performed analysis based on one or two datasets
and
compared to analysis is lacking
also most of this work do not consider the dependence
but in the individual variables in the dataset
for example
note addressed dataset a lexical content and the speaker identity sad and angry
but some sentences that spoken only vectors speakers
therefore
it should be possible to predict the speakers based on lexical content on
being can to mitigate these limitations our previous work
by making the following contributions
firstly we use multiple datasets to comprehensively and lies information and are denoted speaker different
additions
secondly we analyze the
effect of disentangling speaker factors from uses factors on then down information
briefly detail what they mean made disentanglement
in the
orders of the talk
we define a disentanglement broadly as the task of separating out information streams from advancing
signal
is a coke example
the input speech signal from belief you good
who is happy that just bought a civilised like super
contain such information related to various factors
it contains information about because identity including have with him gender and age
the information put into the good emotional state is also encoder
more importantly
the language identity and the lexical content i don't same but in the signal
the goal of additional embedding extractor
is to separate all these information streams
and in the context of speaker the meetings i which is supposed to capture speaker
and get additional information
all other factors such as an emotional state and the lexical content
i considered nuisance factors
it is these factors which we propose to remove from the speaker meanings
to make the more robust
no and explain the methodology behind it is and then a speaker domain extraction
this is a model b is
as input of we can use any speech representation sort of that's either spectrogram
only one speaker meeting from pre-denned model statistics vectors
and
using than suppose disentanglement adapted from
method that as previously proposed in the computer vision domain
we try to separate out
these speaker later information
from the loses information
please note that this method with previously proposed in our earlier work
and you can find more details
in that paper
however for completeness that explained in that he rested
i don't think that comprises two models the main model
which are shown in the clean
blocks here
and
the address it and models shown in the blue
then put it is first processed in court of misfits fit into two
and weighting function in is trash shown in the figure
the embedding hits them
is starting to the predictive
which predictions speaker labels like that
the embedding has two is concatenated with the noisy version of h one
which is denoted by hits and prime here
it's and frame is obtained by thing it's one
to drop what martin
two randomly remove certain elements of h from
and has two along with the noisy
hatch on which is session pine
i concatenated
and fed into a decoder
which tries really consider that the origin input x
the motivation behind using the top or
is to make sure that
hatch one
is an and eleven source of information for the reconstruction task
and training in this and make sure that
the information required for reconstruction is not storage and
and only
the information required for
speaker and weightings are stored
here
in addition
we also used to disentangle models we just one and low
these models are jointly trained
to perform poorly in predicting hits on from is to
and has to from its own
the goal of these models is to ensure that
and so the nist two are not very to a feature that
doesn't make sure that did not contain similar information
this way
we can team for this and then there's other conditions
and the questions that we used a present one fish one here the main model
produces two losses a one is a standard cross entropy loss from the predicate
which pretty speakers
and the second is the means greater reconstruction us from the decoder
and the adversarial
a model is a use means could've lost
the overall loss function is shown here
we try to minimize the loss with respect to the main models
when advert of by maximizing the twisted in knots
this training process further apart from previous work as i mentioned before
basically use this technique
on it
because the digit recognition task
on successful training
them but enhancement is expected to capture speaker discriminative information
and them in his to is expected to captain useless information
notice that we are not used any labels of that uses factors such as a
nice tight channel conditions
extractor
for training the models we use the standard box in the training corpus now which
consists of
in the way we use of interviews with celebrities
the additive noise and reverberation which is standard practice in a day in examining
this results in two point four million utterances from i don't seven thousand two hundred
speakers
as mentioned before we can you either you spectrograms atoms is and what
well it also is decoder meetings from kate and models which we do in
this work
so we use i x it is extracted from a publicly available played in models
as input
exactly that's most of you already know are speaker demanding a hint on the automatically
rubber and related work
that is trained to classify speakers
from a large dataset artificial augmented with noise and reverberation
and this model has shown to provide state-of-the-art performance and multiple tasks
not require speaker discriminant discriminately
we use multiple datasets i not evaluations as mentioned here
and by evaluating some factors for example
i emotion on my calculator that
we could also
too low the
issue of dataset bias
creating in the model
and following others in the looks the make the assumption that
better classification performance
all of the speaker remaining for the factors
in light
there is more information present in the embedding with respect to that factors
and as a baseline views expected that speaker eminence since our model a data accepted
as input
we can consider a speaker ramblings as a refinement of detectors
but speaker different information today and uses factorisation will
the also reduce the dimension of expected by using pca
or to match the
the and meetings in vermont models
so us of the results
and the first set of results shows the accuracy of predicting speaker factors using x
vectors
shown in blue
and using alignment actually hindered
and in this case high it is better
the first two of graphs here so speaker classification accuracy and the other two sure
gender prediction accuracy
so we find that in general both expect is an atom bearings
but from creativity in just thank speakers and genders
and we see a slight degradation a using another
however the differences that women
one other observation is that
in i'm okay final performance of
both axes and i model
we conjecture that this the eight it could be due to a speaker overlap
and also this dataset is not what ideally suited for speaker
recognition task since
the purpose of this dataset was emotion recognition
no the more enticing results
of a show the results of predicting the and in factors using x s and
are speaker dominance and in this case since is then used actors you know it
is british
we find that in
on
the cases are model it is the model is information
in particular
emotion and lexical information added used to a greater extent
here the lexical accuracy
is accuracy of predicting the sentence
spoken given speaker the meeting of that sentence
and apart from the election emotional lexical content we also see a detection
no information but into sentiment
we just was used to motion
and also language
in this side of a report the results of predicting the channel factors using x
vectors
and a speaker dominance
okay in this case a low respective
in particular of we focus on three factors
the room microphone distance are the microphone location
and then i start
we find that in predicting the location of the microphone use
and the type of agonise present
except is have a much higher accuracy than a to predict
this means that being able to successfully reduced and what of this isn't information from
extractors
however we notice that
in panic and the room
in this the recording with me
because so present to see that what extent this and i gnostic animating that very
effective
this needs further investigation
we show the results of like evaluation
then evaluated models for speaker verification task
and our competitors
the detection update of "'cause" actual
where the false positive rate and the according to be exact scale only lately
right and they because model you get compared to the articles
and the "'cause"
that it was at the origin
you don't better models
the black dotted lines a show the except the model
and all the other
lines do not are modeled they then without
lda based dimensionality reduction
be found statistically significant differences only in the graphs
based numbers dimension
well most notably in challenging scenarios
babble in television lies in the background
all models perform better than extractors
also in the are distant microphone condition i've models perform significantly better than extractors
we also found that at the model and do that is trained with a metadata
what was slightly better compared to the model in one
that is staying with not additional conditions
this actually confronted expected be
so finally like to quickly present a discussion based on experiments which hopefully will be
useful pointers for future research
in this domain
first we find that speaker the meetings to captain right of information what into a
nuisance factors
and this can sometimes be detrimental to robustness
and we also found that just introducing
bottleneck on the dimension of the speaker automatic by using pca
doesn't seem all this information
this points of the need for explicitly more the fusion starters
and using the
on suppose that wasn't invariance technique which is the
taking that using a model
we can it is then uses information
from the speaker meetings
and the because advantages that unlabeled of nuisance factors are not required for this matter
we also found that and the voice disentanglement retains gender information
this actually such as that speaker gender
as captured when you know conditions
is a crucial part of identity
this is quite intuitive from human perception point hasn't
essentially what the shows is that mute of conditions and sounds and
for though human perception
finally a disincentive speaker representations shall
a better verification performance the presence of ability of tiny conditions
but it only babble in television i features consider
very challenging for this test
going forward we would like to explore methods to further improve the sentiment
and
so far we have not as a mention of all of not used in uses
labeled so
we would like to see if
if we use this it's a with variable data available
danny
achieve better disentanglement
so that brings me to the
invested in different conditions of those of the differences
finally i would like to acknowledge the support for us to for this work
and
that's it utterance that's what is into my presentation
please feel free to us and many men with any questions or stations you might
have
thank you