0:00:16hello everyone mind themselves silly
0:00:19amount i sent his ml only database i
0:00:23and presenting a paper perform
0:00:26on the t v o where include the representation for utterance level speaker and language
0:00:31he or she
0:00:33is the joint work was truly an original capturing
0:00:39so that's log normal division first
0:00:44will be coming d
0:00:46the transformer based contextual representation like for g p d have shown agree six days
0:00:53in downstream
0:00:55a natural language understanding tasks
0:00:58so similarity speech sinking off
0:01:01okay zoom in addition
0:01:03in the k how downstream speech tasks
0:01:05i'd is the way
0:01:07where imitation the
0:01:09you know p
0:01:12and that's in there
0:01:14downstreaming speech task
0:01:16like speaker condition or language recognition has very limited training data
0:01:23but there is just a lot of all rush
0:01:26speech and sre corpora or you are unlabeled speech corpora
0:01:31so doing acoustic segmentation k u t rising those large corpus to how task training
0:01:38speech task
0:01:41so the most important question here is what information do we need
0:01:47full is a task to the speech task
0:01:50the first thing we have signal is definitely she
0:01:54in fact
0:01:55from that information for speaker recognition manager and she has already been scored
0:02:01for very long time
0:02:02so past works have used for a value their master of using have racist the
0:02:08and as the frame level feature extractor for speaker and channel and language recognition
0:02:15so it is often down with bottleneck features
0:02:19well generally intermediate frame wise feature from a t and train a large speech corpora
0:02:26however
0:02:27since speaker recognition can require higher level "'cause" information like speaker traits
0:02:33and an s relevant for a star
0:02:36it is then which may be insufficient
0:02:39for speaker recognition in some cases
0:02:43so overcome this well scope your master
0:02:47is to do multi task and to is a system with speaker recognition there
0:02:56so there is a new trained
0:02:59and you still supervised acoustic representation
0:03:05able to include is start by pre-training on large amounts of monolingual speech
0:03:11so those still relies models can capture some global structure cause a global acoustic structure
0:03:19and can help is are
0:03:21and also potentially string speech test
0:03:25so some examples
0:03:27so this models having wall way still suppress object the
0:03:31and either contrast e
0:03:33or the recursive more
0:03:36s reconstruction
0:03:38in fact some of those or have already shows on a stick space
0:03:44in by being applying selves aggressive because you're representation in speaker recognition
0:03:52so we propose perform
0:03:54which include bows phonetic information
0:03:57and
0:03:58so supervised acoustic segmentation
0:04:02which i just talking previous sides
0:04:05so we assign overview of our model so is this is the include
0:04:11and we do feature extraction
0:04:15and we mask
0:04:16span all four frames
0:04:18then we have this mask frame sequences into transformer encoder
0:04:23to get performance issue
0:04:27so then we do multi task
0:04:29and is performed with letters
0:04:31so on the left side this is asr task so is to use it also
0:04:35here
0:04:37and on right side this is a self supervised
0:04:41a consumer station test
0:04:43so we use a lot also here to reconstruct mask frames to orange all frames
0:04:51so for training criteria
0:04:54and reconstruction task
0:04:57we just use l one loss
0:04:59to reconstruct mass frames
0:05:01two or and you know frames
0:05:03so it's basically same as denoising auto-encoder
0:05:08a specifically
0:05:10why we data easily mask a standoff ten frames and a five pairs random five
0:05:16percent of the positions
0:05:18and the replacing with zero vectors
0:05:22so in this way we mask a fifteen percent of tokens
0:05:27i which is similar to prayer
0:05:29pre-training schema
0:05:32and four sre task it just use standard c d c
0:05:36training criteria
0:05:38and then we combine posts are also lost a good get together
0:05:44so
0:05:44one die here is the hyper parameter and here is the signals that's
0:05:50so we notice is that we use greatly to include really risque rescaled reconstruction lost
0:05:58to be proportionally o is a city z lost
0:06:04so after finish a pre-training there were more though
0:06:09we can be fixed
0:06:11and then we used to extract a features from the data
0:06:16so as shown in here
0:06:18we use perform more than two ultra a
0:06:22the performance issue which is a true one here
0:06:25and then we passes through block sorry we passed is performing transition into
0:06:33bastien tests model
0:06:35so for tests
0:06:37this model
0:06:39we just use x vector class of attention putting as architecture
0:06:44and for those things you and it has speaker recognition we use a center
0:06:50they lost
0:06:53so
0:06:54for a first time for closed set language recognition
0:06:58we use the softmax layer there are very sensitive intention task has fixed language can
0:07:04words
0:07:06for speaker recognition we used here d eight and compare pairs of speaker invariance
0:07:11which way extra here
0:07:15okay so next we don't go about experience it up
0:07:19so we use of all the dimension mean normalized and a mfcc as a writing
0:07:25puts
0:07:26are performed parameters burden schedule in the training details
0:07:30i system was per base model
0:07:34which is a tough still flotation a years
0:07:37us to one sixty eight item dimension
0:07:40and the general adaptation tasks
0:07:43we are variable as a speech utterance and allows them into a backchannel at
0:07:49and spread over multiple gpu
0:07:52our or model is over first recent batches
0:07:55to a maximum learning rate of one zero point zero one
0:07:59averaging the model for us thirty bucks
0:08:02fourteen data
0:08:06for her from pre-training
0:08:07we trained to perform although i'm two different dataset
0:08:11the first one is a fisher english which is then point eight star hers
0:08:18it's a different conversation data set
0:08:21and that it did not wishing and i don't perform model a
0:08:26t v
0:08:27which is m one s sixty k
0:08:30hers
0:08:31which is the english ten talks
0:08:34for speaker recognition
0:08:36we use a fisher perform model for features speaker recognition task
0:08:41and are used at an perform model for work so that a
0:08:45speaker recognition tasks
0:08:47so to be no this is that
0:08:49even though it at an animal a rolls broadcast speech
0:08:54but they don't have any
0:08:56data overlap
0:08:58in them
0:09:00so that can be cast in there are
0:09:02all of told me downstream task
0:09:06forty them perform model
0:09:09for language recognition
0:09:10we use close in two thousand one means
0:09:13lr and evaluation
0:09:19so here is the results for a range of recognition experiments
0:09:24so as you can see here
0:09:27we have a huge improvements using perform
0:09:29have here is in this day as input
0:09:34we actually the state-of-the-art and three seconds and ten seconds condition
0:09:39no we are therefore
0:09:41preaching system s thirty seconds
0:09:43but we estimate the past are all into an investor
0:09:49so that speaker and us
0:09:51initial experiments
0:09:54so on the vowels
0:09:59dataset
0:09:59but first show that
0:10:01using perform much better they use mfccs input
0:10:06and in the fisher of you know feature
0:10:09speaker recognition
0:10:11case
0:10:13performed you includes over ship feature the multi tasking approach where is a phone adding
0:10:18extra thirty were jointly reading is star in speaker condition which is this like here
0:10:26and the last scale all don't mean well the last speaker recognition tasks
0:10:33in
0:10:34perform gives around like eighty percent relative reduction in equal error rate
0:10:41compare with the model training directly why stacy
0:10:45our model also includes are
0:10:48recent work
0:10:50uses the imprint training set
0:10:51we are multitasking and of research training which is this time
0:11:00we did some operation study
0:11:03the first one is to
0:11:06we are trying to in their interpret the
0:11:09last scale
0:11:10the longer which is the last cable beating reconstruction in c vc lost
0:11:16so as strong this table
0:11:19we interpolate eating bound i zero which is it easy only model and alarmed i
0:11:24x one which is the reconstruction model
0:11:28it is you recognition in speaker recognition performance is you command was slightly degree
0:11:33one performance on training to reconstruct
0:11:37for language recognition if unless it is the only model
0:11:41these are best
0:11:42i mean reconstruction resulting degradation
0:11:45posted really
0:11:46as the degree the quality of phonetic information encoded
0:11:51for speaker recognition a model do you the bass
0:11:54when some i think it's a toss is introduced
0:11:57in line with previous work on the really result phonetic one mission to is not
0:12:02a speaker and session
0:12:04as expected
0:12:06using their problems it is the only model actively degree speaker recognition performance
0:12:12in prayer and critics
0:12:15we might just an is this clusters and take along the i for example and
0:12:21because
0:12:23zero point you
0:12:26so we and outrageous study to incorporate
0:12:32information so the in two different perform may years
0:12:38so was basically what we these the wintry known model to use a global softmax
0:12:43for myself are weights
0:12:45two pool representation over the years
0:12:49so in this way we know
0:12:52one
0:12:54what you they wish they years the model is focused on
0:12:58so in this work we can see that
0:13:01gaining this face but
0:13:03then recognition user imitation virtually from later they years
0:13:08this is consistent with the language recognition primary using phonetic information
0:13:13in contrast
0:13:14speaker recognition use more immediately years
0:13:18this is just a house data some focus the end of from that information being
0:13:23the average
0:13:25you know
0:13:26is the next last intuition their language or vanishing use higher-level features
0:13:32for example phonetics a sequence information
0:13:35well speaker recognition use pram a lower level features
0:13:40quality slide speech
0:13:41and the vocal range
0:13:43class
0:13:43some possible phonetic
0:13:46performance
0:13:49so in some
0:13:50we introduce versatile
0:13:52that's definitely tend to phonetically or where
0:13:54acoustic instrumentation
0:13:57and use perform we small task specific model we can improve performance of multiple speech
0:14:04task
0:14:05namely language and speaker recognition
0:14:08way to other state-of-the-art
0:14:11a variant of six point one six
0:14:14and they are i you know i'm yours demonstrate second
0:14:17posted language recognition tasks
0:14:20and
0:14:21eighteen percent relative reduction in speaker decoder is i want to dataset
0:14:27future work including scrolling additional gain from i four using personal
0:14:32and i scoring more advanced still system right so consider imitation methods
0:14:41thank you guys think you've always demand transition