hello everyone mind themselves silly

amount i sent his ml only database i

and presenting a paper perform

on the t v o where include the representation for utterance level speaker and language

he or she

is the joint work was truly an original capturing

so that's log normal division first

will be coming d

the transformer based contextual representation like for g p d have shown agree six days

in downstream

a natural language understanding tasks

so similarity speech sinking off

okay zoom in addition

in the k how downstream speech tasks

i'd is the way

where imitation the

you know p

and that's in there

downstreaming speech task

like speaker condition or language recognition has very limited training data

but there is just a lot of all rush

speech and sre corpora or you are unlabeled speech corpora

so doing acoustic segmentation k u t rising those large corpus to how task training

speech task

so the most important question here is what information do we need

full is a task to the speech task

the first thing we have signal is definitely she

in fact

from that information for speaker recognition manager and she has already been scored

for very long time

so past works have used for a value their master of using have racist the

and as the frame level feature extractor for speaker and channel and language recognition

so it is often down with bottleneck features

well generally intermediate frame wise feature from a t and train a large speech corpora

however

since speaker recognition can require higher level "'cause" information like speaker traits

and an s relevant for a star

it is then which may be insufficient

for speaker recognition in some cases

so overcome this well scope your master

is to do multi task and to is a system with speaker recognition there

so there is a new trained

and you still supervised acoustic representation

able to include is start by pre-training on large amounts of monolingual speech

so those still relies models can capture some global structure cause a global acoustic structure

and can help is are

and also potentially string speech test

so some examples

so this models having wall way still suppress object the

and either contrast e

or the recursive more

s reconstruction

in fact some of those or have already shows on a stick space

in by being applying selves aggressive because you're representation in speaker recognition

so we propose perform

which include bows phonetic information

and

so supervised acoustic segmentation

which i just talking previous sides

so we assign overview of our model so is this is the include

and we do feature extraction

and we mask

span all four frames

then we have this mask frame sequences into transformer encoder

to get performance issue

so then we do multi task

and is performed with letters

so on the left side this is asr task so is to use it also

here

and on right side this is a self supervised

a consumer station test

so we use a lot also here to reconstruct mask frames to orange all frames

so for training criteria

and reconstruction task

we just use l one loss

to reconstruct mass frames

two or and you know frames

so it's basically same as denoising auto-encoder

a specifically

why we data easily mask a standoff ten frames and a five pairs random five

percent of the positions

and the replacing with zero vectors

so in this way we mask a fifteen percent of tokens

i which is similar to prayer

pre-training schema

and four sre task it just use standard c d c

training criteria

and then we combine posts are also lost a good get together

so

one die here is the hyper parameter and here is the signals that's

so we notice is that we use greatly to include really risque rescaled reconstruction lost

to be proportionally o is a city z lost

so after finish a pre-training there were more though

we can be fixed

and then we used to extract a features from the data

so as shown in here

we use perform more than two ultra a

the performance issue which is a true one here

and then we passes through block sorry we passed is performing transition into

bastien tests model

so for tests

this model

we just use x vector class of attention putting as architecture

and for those things you and it has speaker recognition we use a center

they lost

so

for a first time for closed set language recognition

we use the softmax layer there are very sensitive intention task has fixed language can

words

for speaker recognition we used here d eight and compare pairs of speaker invariance

which way extra here

okay so next we don't go about experience it up

so we use of all the dimension mean normalized and a mfcc as a writing

puts

are performed parameters burden schedule in the training details

i system was per base model

which is a tough still flotation a years

us to one sixty eight item dimension

and the general adaptation tasks

we are variable as a speech utterance and allows them into a backchannel at

and spread over multiple gpu

our or model is over first recent batches

to a maximum learning rate of one zero point zero one

averaging the model for us thirty bucks

fourteen data

for her from pre-training

we trained to perform although i'm two different dataset

the first one is a fisher english which is then point eight star hers

it's a different conversation data set

and that it did not wishing and i don't perform model a

t v

which is m one s sixty k

hers

which is the english ten talks

for speaker recognition

we use a fisher perform model for features speaker recognition task

and are used at an perform model for work so that a

speaker recognition tasks

so to be no this is that

even though it at an animal a rolls broadcast speech

but they don't have any

data overlap

in them

so that can be cast in there are

all of told me downstream task

forty them perform model

for language recognition

we use close in two thousand one means

lr and evaluation

so here is the results for a range of recognition experiments

so as you can see here

we have a huge improvements using perform

have here is in this day as input

we actually the state-of-the-art and three seconds and ten seconds condition

no we are therefore

preaching system s thirty seconds

but we estimate the past are all into an investor

so that speaker and us

initial experiments

so on the vowels

dataset

but first show that

using perform much better they use mfccs input

and in the fisher of you know feature

speaker recognition

case

performed you includes over ship feature the multi tasking approach where is a phone adding

extra thirty were jointly reading is star in speaker condition which is this like here

and the last scale all don't mean well the last speaker recognition tasks

in

perform gives around like eighty percent relative reduction in equal error rate

compare with the model training directly why stacy

our model also includes are

recent work

uses the imprint training set

we are multitasking and of research training which is this time

we did some operation study

the first one is to

we are trying to in their interpret the

last scale

the longer which is the last cable beating reconstruction in c vc lost

so as strong this table

we interpolate eating bound i zero which is it easy only model and alarmed i

x one which is the reconstruction model

it is you recognition in speaker recognition performance is you command was slightly degree

one performance on training to reconstruct

for language recognition if unless it is the only model

these are best

i mean reconstruction resulting degradation

posted really

as the degree the quality of phonetic information encoded

for speaker recognition a model do you the bass

when some i think it's a toss is introduced

in line with previous work on the really result phonetic one mission to is not

a speaker and session

as expected

using their problems it is the only model actively degree speaker recognition performance

in prayer and critics

we might just an is this clusters and take along the i for example and

because

zero point you

so we and outrageous study to incorporate

information so the in two different perform may years

so was basically what we these the wintry known model to use a global softmax

for myself are weights

two pool representation over the years

so in this way we know

one

what you they wish they years the model is focused on

so in this work we can see that

gaining this face but

then recognition user imitation virtually from later they years

this is consistent with the language recognition primary using phonetic information

in contrast

speaker recognition use more immediately years

this is just a house data some focus the end of from that information being

the average

you know

is the next last intuition their language or vanishing use higher-level features

for example phonetics a sequence information

well speaker recognition use pram a lower level features

quality slide speech

and the vocal range

class

some possible phonetic

performance

so in some

we introduce versatile

that's definitely tend to phonetically or where

acoustic instrumentation

and use perform we small task specific model we can improve performance of multiple speech

task

namely language and speaker recognition

way to other state-of-the-art

a variant of six point one six

and they are i you know i'm yours demonstrate second

posted language recognition tasks

and

eighteen percent relative reduction in speaker decoder is i want to dataset

future work including scrolling additional gain from i four using personal

and i scoring more advanced still system right so consider imitation methods

thank you guys think you've always demand transition