hello everyone mind themselves silly
amount i sent his ml only database i
and presenting a paper perform
on the t v o where include the representation for utterance level speaker and language
he or she
is the joint work was truly an original capturing
so that's log normal division first
will be coming d
the transformer based contextual representation like for g p d have shown agree six days
in downstream
a natural language understanding tasks
so similarity speech sinking off
okay zoom in addition
in the k how downstream speech tasks
i'd is the way
where imitation the
you know p
and that's in there
downstreaming speech task
like speaker condition or language recognition has very limited training data
but there is just a lot of all rush
speech and sre corpora or you are unlabeled speech corpora
so doing acoustic segmentation k u t rising those large corpus to how task training
speech task
so the most important question here is what information do we need
full is a task to the speech task
the first thing we have signal is definitely she
in fact
from that information for speaker recognition manager and she has already been scored
for very long time
so past works have used for a value their master of using have racist the
and as the frame level feature extractor for speaker and channel and language recognition
so it is often down with bottleneck features
well generally intermediate frame wise feature from a t and train a large speech corpora
however
since speaker recognition can require higher level "'cause" information like speaker traits
and an s relevant for a star
it is then which may be insufficient
for speaker recognition in some cases
so overcome this well scope your master
is to do multi task and to is a system with speaker recognition there
so there is a new trained
and you still supervised acoustic representation
able to include is start by pre-training on large amounts of monolingual speech
so those still relies models can capture some global structure cause a global acoustic structure
and can help is are
and also potentially string speech test
so some examples
so this models having wall way still suppress object the
and either contrast e
or the recursive more
s reconstruction
in fact some of those or have already shows on a stick space
in by being applying selves aggressive because you're representation in speaker recognition
so we propose perform
which include bows phonetic information
and
so supervised acoustic segmentation
which i just talking previous sides
so we assign overview of our model so is this is the include
and we do feature extraction
and we mask
span all four frames
then we have this mask frame sequences into transformer encoder
to get performance issue
so then we do multi task
and is performed with letters
so on the left side this is asr task so is to use it also
here
and on right side this is a self supervised
a consumer station test
so we use a lot also here to reconstruct mask frames to orange all frames
so for training criteria
and reconstruction task
we just use l one loss
to reconstruct mass frames
two or and you know frames
so it's basically same as denoising auto-encoder
a specifically
why we data easily mask a standoff ten frames and a five pairs random five
percent of the positions
and the replacing with zero vectors
so in this way we mask a fifteen percent of tokens
i which is similar to prayer
pre-training schema
and four sre task it just use standard c d c
training criteria
and then we combine posts are also lost a good get together
so
one die here is the hyper parameter and here is the signals that's
so we notice is that we use greatly to include really risque rescaled reconstruction lost
to be proportionally o is a city z lost
so after finish a pre-training there were more though
we can be fixed
and then we used to extract a features from the data
so as shown in here
we use perform more than two ultra a
the performance issue which is a true one here
and then we passes through block sorry we passed is performing transition into
bastien tests model
so for tests
this model
we just use x vector class of attention putting as architecture
and for those things you and it has speaker recognition we use a center
they lost
so
for a first time for closed set language recognition
we use the softmax layer there are very sensitive intention task has fixed language can
words
for speaker recognition we used here d eight and compare pairs of speaker invariance
which way extra here
okay so next we don't go about experience it up
so we use of all the dimension mean normalized and a mfcc as a writing
puts
are performed parameters burden schedule in the training details
i system was per base model
which is a tough still flotation a years
us to one sixty eight item dimension
and the general adaptation tasks
we are variable as a speech utterance and allows them into a backchannel at
and spread over multiple gpu
our or model is over first recent batches
to a maximum learning rate of one zero point zero one
averaging the model for us thirty bucks
fourteen data
for her from pre-training
we trained to perform although i'm two different dataset
the first one is a fisher english which is then point eight star hers
it's a different conversation data set
and that it did not wishing and i don't perform model a
t v
which is m one s sixty k
hers
which is the english ten talks
for speaker recognition
we use a fisher perform model for features speaker recognition task
and are used at an perform model for work so that a
speaker recognition tasks
so to be no this is that
even though it at an animal a rolls broadcast speech
but they don't have any
data overlap
in them
so that can be cast in there are
all of told me downstream task
forty them perform model
for language recognition
we use close in two thousand one means
lr and evaluation
so here is the results for a range of recognition experiments
so as you can see here
we have a huge improvements using perform
have here is in this day as input
we actually the state-of-the-art and three seconds and ten seconds condition
no we are therefore
preaching system s thirty seconds
but we estimate the past are all into an investor
so that speaker and us
initial experiments
so on the vowels
dataset
but first show that
using perform much better they use mfccs input
and in the fisher of you know feature
speaker recognition
case
performed you includes over ship feature the multi tasking approach where is a phone adding
extra thirty were jointly reading is star in speaker condition which is this like here
and the last scale all don't mean well the last speaker recognition tasks
in
perform gives around like eighty percent relative reduction in equal error rate
compare with the model training directly why stacy
our model also includes are
recent work
uses the imprint training set
we are multitasking and of research training which is this time
we did some operation study
the first one is to
we are trying to in their interpret the
last scale
the longer which is the last cable beating reconstruction in c vc lost
so as strong this table
we interpolate eating bound i zero which is it easy only model and alarmed i
x one which is the reconstruction model
it is you recognition in speaker recognition performance is you command was slightly degree
one performance on training to reconstruct
for language recognition if unless it is the only model
these are best
i mean reconstruction resulting degradation
posted really
as the degree the quality of phonetic information encoded
for speaker recognition a model do you the bass
when some i think it's a toss is introduced
in line with previous work on the really result phonetic one mission to is not
a speaker and session
as expected
using their problems it is the only model actively degree speaker recognition performance
in prayer and critics
we might just an is this clusters and take along the i for example and
because
zero point you
so we and outrageous study to incorporate
information so the in two different perform may years
so was basically what we these the wintry known model to use a global softmax
for myself are weights
two pool representation over the years
so in this way we know
one
what you they wish they years the model is focused on
so in this work we can see that
gaining this face but
then recognition user imitation virtually from later they years
this is consistent with the language recognition primary using phonetic information
in contrast
speaker recognition use more immediately years
this is just a house data some focus the end of from that information being
the average
you know
is the next last intuition their language or vanishing use higher-level features
for example phonetics a sequence information
well speaker recognition use pram a lower level features
quality slide speech
and the vocal range
class
some possible phonetic
performance
so in some
we introduce versatile
that's definitely tend to phonetically or where
acoustic instrumentation
and use perform we small task specific model we can improve performance of multiple speech
task
namely language and speaker recognition
way to other state-of-the-art
a variant of six point one six
and they are i you know i'm yours demonstrate second
posted language recognition tasks
and
eighteen percent relative reduction in speaker decoder is i want to dataset
future work including scrolling additional gain from i four using personal
and i scoring more advanced still system right so consider imitation methods
thank you guys think you've always demand transition