0:00:16 | hello everyone mind themselves silly |
---|
0:00:19 | amount i sent his ml only database i |
---|
0:00:23 | and presenting a paper perform |
---|
0:00:26 | on the t v o where include the representation for utterance level speaker and language |
---|
0:00:31 | he or she |
---|
0:00:33 | is the joint work was truly an original capturing |
---|
0:00:39 | so that's log normal division first |
---|
0:00:44 | will be coming d |
---|
0:00:46 | the transformer based contextual representation like for g p d have shown agree six days |
---|
0:00:53 | in downstream |
---|
0:00:55 | a natural language understanding tasks |
---|
0:00:58 | so similarity speech sinking off |
---|
0:01:01 | okay zoom in addition |
---|
0:01:03 | in the k how downstream speech tasks |
---|
0:01:05 | i'd is the way |
---|
0:01:07 | where imitation the |
---|
0:01:09 | you know p |
---|
0:01:12 | and that's in there |
---|
0:01:14 | downstreaming speech task |
---|
0:01:16 | like speaker condition or language recognition has very limited training data |
---|
0:01:23 | but there is just a lot of all rush |
---|
0:01:26 | speech and sre corpora or you are unlabeled speech corpora |
---|
0:01:31 | so doing acoustic segmentation k u t rising those large corpus to how task training |
---|
0:01:38 | speech task |
---|
0:01:41 | so the most important question here is what information do we need |
---|
0:01:47 | full is a task to the speech task |
---|
0:01:50 | the first thing we have signal is definitely she |
---|
0:01:54 | in fact |
---|
0:01:55 | from that information for speaker recognition manager and she has already been scored |
---|
0:02:01 | for very long time |
---|
0:02:02 | so past works have used for a value their master of using have racist the |
---|
0:02:08 | and as the frame level feature extractor for speaker and channel and language recognition |
---|
0:02:15 | so it is often down with bottleneck features |
---|
0:02:19 | well generally intermediate frame wise feature from a t and train a large speech corpora |
---|
0:02:26 | however |
---|
0:02:27 | since speaker recognition can require higher level "'cause" information like speaker traits |
---|
0:02:33 | and an s relevant for a star |
---|
0:02:36 | it is then which may be insufficient |
---|
0:02:39 | for speaker recognition in some cases |
---|
0:02:43 | so overcome this well scope your master |
---|
0:02:47 | is to do multi task and to is a system with speaker recognition there |
---|
0:02:56 | so there is a new trained |
---|
0:02:59 | and you still supervised acoustic representation |
---|
0:03:05 | able to include is start by pre-training on large amounts of monolingual speech |
---|
0:03:11 | so those still relies models can capture some global structure cause a global acoustic structure |
---|
0:03:19 | and can help is are |
---|
0:03:21 | and also potentially string speech test |
---|
0:03:25 | so some examples |
---|
0:03:27 | so this models having wall way still suppress object the |
---|
0:03:31 | and either contrast e |
---|
0:03:33 | or the recursive more |
---|
0:03:36 | s reconstruction |
---|
0:03:38 | in fact some of those or have already shows on a stick space |
---|
0:03:44 | in by being applying selves aggressive because you're representation in speaker recognition |
---|
0:03:52 | so we propose perform |
---|
0:03:54 | which include bows phonetic information |
---|
0:03:57 | and |
---|
0:03:58 | so supervised acoustic segmentation |
---|
0:04:02 | which i just talking previous sides |
---|
0:04:05 | so we assign overview of our model so is this is the include |
---|
0:04:11 | and we do feature extraction |
---|
0:04:15 | and we mask |
---|
0:04:16 | span all four frames |
---|
0:04:18 | then we have this mask frame sequences into transformer encoder |
---|
0:04:23 | to get performance issue |
---|
0:04:27 | so then we do multi task |
---|
0:04:29 | and is performed with letters |
---|
0:04:31 | so on the left side this is asr task so is to use it also |
---|
0:04:35 | here |
---|
0:04:37 | and on right side this is a self supervised |
---|
0:04:41 | a consumer station test |
---|
0:04:43 | so we use a lot also here to reconstruct mask frames to orange all frames |
---|
0:04:51 | so for training criteria |
---|
0:04:54 | and reconstruction task |
---|
0:04:57 | we just use l one loss |
---|
0:04:59 | to reconstruct mass frames |
---|
0:05:01 | two or and you know frames |
---|
0:05:03 | so it's basically same as denoising auto-encoder |
---|
0:05:08 | a specifically |
---|
0:05:10 | why we data easily mask a standoff ten frames and a five pairs random five |
---|
0:05:16 | percent of the positions |
---|
0:05:18 | and the replacing with zero vectors |
---|
0:05:22 | so in this way we mask a fifteen percent of tokens |
---|
0:05:27 | i which is similar to prayer |
---|
0:05:29 | pre-training schema |
---|
0:05:32 | and four sre task it just use standard c d c |
---|
0:05:36 | training criteria |
---|
0:05:38 | and then we combine posts are also lost a good get together |
---|
0:05:44 | so |
---|
0:05:44 | one die here is the hyper parameter and here is the signals that's |
---|
0:05:50 | so we notice is that we use greatly to include really risque rescaled reconstruction lost |
---|
0:05:58 | to be proportionally o is a city z lost |
---|
0:06:04 | so after finish a pre-training there were more though |
---|
0:06:09 | we can be fixed |
---|
0:06:11 | and then we used to extract a features from the data |
---|
0:06:16 | so as shown in here |
---|
0:06:18 | we use perform more than two ultra a |
---|
0:06:22 | the performance issue which is a true one here |
---|
0:06:25 | and then we passes through block sorry we passed is performing transition into |
---|
0:06:33 | bastien tests model |
---|
0:06:35 | so for tests |
---|
0:06:37 | this model |
---|
0:06:39 | we just use x vector class of attention putting as architecture |
---|
0:06:44 | and for those things you and it has speaker recognition we use a center |
---|
0:06:50 | they lost |
---|
0:06:53 | so |
---|
0:06:54 | for a first time for closed set language recognition |
---|
0:06:58 | we use the softmax layer there are very sensitive intention task has fixed language can |
---|
0:07:04 | words |
---|
0:07:06 | for speaker recognition we used here d eight and compare pairs of speaker invariance |
---|
0:07:11 | which way extra here |
---|
0:07:15 | okay so next we don't go about experience it up |
---|
0:07:19 | so we use of all the dimension mean normalized and a mfcc as a writing |
---|
0:07:25 | puts |
---|
0:07:26 | are performed parameters burden schedule in the training details |
---|
0:07:30 | i system was per base model |
---|
0:07:34 | which is a tough still flotation a years |
---|
0:07:37 | us to one sixty eight item dimension |
---|
0:07:40 | and the general adaptation tasks |
---|
0:07:43 | we are variable as a speech utterance and allows them into a backchannel at |
---|
0:07:49 | and spread over multiple gpu |
---|
0:07:52 | our or model is over first recent batches |
---|
0:07:55 | to a maximum learning rate of one zero point zero one |
---|
0:07:59 | averaging the model for us thirty bucks |
---|
0:08:02 | fourteen data |
---|
0:08:06 | for her from pre-training |
---|
0:08:07 | we trained to perform although i'm two different dataset |
---|
0:08:11 | the first one is a fisher english which is then point eight star hers |
---|
0:08:18 | it's a different conversation data set |
---|
0:08:21 | and that it did not wishing and i don't perform model a |
---|
0:08:26 | t v |
---|
0:08:27 | which is m one s sixty k |
---|
0:08:30 | hers |
---|
0:08:31 | which is the english ten talks |
---|
0:08:34 | for speaker recognition |
---|
0:08:36 | we use a fisher perform model for features speaker recognition task |
---|
0:08:41 | and are used at an perform model for work so that a |
---|
0:08:45 | speaker recognition tasks |
---|
0:08:47 | so to be no this is that |
---|
0:08:49 | even though it at an animal a rolls broadcast speech |
---|
0:08:54 | but they don't have any |
---|
0:08:56 | data overlap |
---|
0:08:58 | in them |
---|
0:09:00 | so that can be cast in there are |
---|
0:09:02 | all of told me downstream task |
---|
0:09:06 | forty them perform model |
---|
0:09:09 | for language recognition |
---|
0:09:10 | we use close in two thousand one means |
---|
0:09:13 | lr and evaluation |
---|
0:09:19 | so here is the results for a range of recognition experiments |
---|
0:09:24 | so as you can see here |
---|
0:09:27 | we have a huge improvements using perform |
---|
0:09:29 | have here is in this day as input |
---|
0:09:34 | we actually the state-of-the-art and three seconds and ten seconds condition |
---|
0:09:39 | no we are therefore |
---|
0:09:41 | preaching system s thirty seconds |
---|
0:09:43 | but we estimate the past are all into an investor |
---|
0:09:49 | so that speaker and us |
---|
0:09:51 | initial experiments |
---|
0:09:54 | so on the vowels |
---|
0:09:59 | dataset |
---|
0:09:59 | but first show that |
---|
0:10:01 | using perform much better they use mfccs input |
---|
0:10:06 | and in the fisher of you know feature |
---|
0:10:09 | speaker recognition |
---|
0:10:11 | case |
---|
0:10:13 | performed you includes over ship feature the multi tasking approach where is a phone adding |
---|
0:10:18 | extra thirty were jointly reading is star in speaker condition which is this like here |
---|
0:10:26 | and the last scale all don't mean well the last speaker recognition tasks |
---|
0:10:33 | in |
---|
0:10:34 | perform gives around like eighty percent relative reduction in equal error rate |
---|
0:10:41 | compare with the model training directly why stacy |
---|
0:10:45 | our model also includes are |
---|
0:10:48 | recent work |
---|
0:10:50 | uses the imprint training set |
---|
0:10:51 | we are multitasking and of research training which is this time |
---|
0:11:00 | we did some operation study |
---|
0:11:03 | the first one is to |
---|
0:11:06 | we are trying to in their interpret the |
---|
0:11:09 | last scale |
---|
0:11:10 | the longer which is the last cable beating reconstruction in c vc lost |
---|
0:11:16 | so as strong this table |
---|
0:11:19 | we interpolate eating bound i zero which is it easy only model and alarmed i |
---|
0:11:24 | x one which is the reconstruction model |
---|
0:11:28 | it is you recognition in speaker recognition performance is you command was slightly degree |
---|
0:11:33 | one performance on training to reconstruct |
---|
0:11:37 | for language recognition if unless it is the only model |
---|
0:11:41 | these are best |
---|
0:11:42 | i mean reconstruction resulting degradation |
---|
0:11:45 | posted really |
---|
0:11:46 | as the degree the quality of phonetic information encoded |
---|
0:11:51 | for speaker recognition a model do you the bass |
---|
0:11:54 | when some i think it's a toss is introduced |
---|
0:11:57 | in line with previous work on the really result phonetic one mission to is not |
---|
0:12:02 | a speaker and session |
---|
0:12:04 | as expected |
---|
0:12:06 | using their problems it is the only model actively degree speaker recognition performance |
---|
0:12:12 | in prayer and critics |
---|
0:12:15 | we might just an is this clusters and take along the i for example and |
---|
0:12:21 | because |
---|
0:12:23 | zero point you |
---|
0:12:26 | so we and outrageous study to incorporate |
---|
0:12:32 | information so the in two different perform may years |
---|
0:12:38 | so was basically what we these the wintry known model to use a global softmax |
---|
0:12:43 | for myself are weights |
---|
0:12:45 | two pool representation over the years |
---|
0:12:49 | so in this way we know |
---|
0:12:52 | one |
---|
0:12:54 | what you they wish they years the model is focused on |
---|
0:12:58 | so in this work we can see that |
---|
0:13:01 | gaining this face but |
---|
0:13:03 | then recognition user imitation virtually from later they years |
---|
0:13:08 | this is consistent with the language recognition primary using phonetic information |
---|
0:13:13 | in contrast |
---|
0:13:14 | speaker recognition use more immediately years |
---|
0:13:18 | this is just a house data some focus the end of from that information being |
---|
0:13:23 | the average |
---|
0:13:25 | you know |
---|
0:13:26 | is the next last intuition their language or vanishing use higher-level features |
---|
0:13:32 | for example phonetics a sequence information |
---|
0:13:35 | well speaker recognition use pram a lower level features |
---|
0:13:40 | quality slide speech |
---|
0:13:41 | and the vocal range |
---|
0:13:43 | class |
---|
0:13:43 | some possible phonetic |
---|
0:13:46 | performance |
---|
0:13:49 | so in some |
---|
0:13:50 | we introduce versatile |
---|
0:13:52 | that's definitely tend to phonetically or where |
---|
0:13:54 | acoustic instrumentation |
---|
0:13:57 | and use perform we small task specific model we can improve performance of multiple speech |
---|
0:14:04 | task |
---|
0:14:05 | namely language and speaker recognition |
---|
0:14:08 | way to other state-of-the-art |
---|
0:14:11 | a variant of six point one six |
---|
0:14:14 | and they are i you know i'm yours demonstrate second |
---|
0:14:17 | posted language recognition tasks |
---|
0:14:20 | and |
---|
0:14:21 | eighteen percent relative reduction in speaker decoder is i want to dataset |
---|
0:14:27 | future work including scrolling additional gain from i four using personal |
---|
0:14:32 | and i scoring more advanced still system right so consider imitation methods |
---|
0:14:41 | thank you guys think you've always demand transition |
---|