0:00:01 | two |
---|
0:00:02 | yes |
---|
0:00:04 | so interviews the allow testing well |
---|
0:00:11 | thank you on the first i want to thank you again for of anything this |
---|
0:00:16 | water in it is very |
---|
0:00:20 | and if you could the a moment and if you could your but you did |
---|
0:00:25 | very well and i'm sure we will we will take advantage of you organisation an |
---|
0:00:29 | optional |
---|
0:00:31 | secondly equal i'm really be too |
---|
0:00:34 | have you location to introduce you musical of any will be whole first the speaker |
---|
0:00:41 | i will be sure because i'm quite sure but quite all of japanese |
---|
0:00:46 | no you would really so i will not even introduction they know you even if |
---|
0:00:53 | you are still you with a true |
---|
0:00:56 | as |
---|
0:00:58 | you go according to me at least we will see you was a really a |
---|
0:01:02 | few buttons |
---|
0:01:03 | no |
---|
0:01:04 | so we go you us your master in the second and hearing that went to |
---|
0:01:10 | university the frozen even |
---|
0:01:13 | so about twenty ten years ago |
---|
0:01:16 | and the you went to be a is to use a student rental |
---|
0:01:24 | with the wheel invisible trying to one be from the sunni a remote case the |
---|
0:01:30 | you then you is the on droning for distance speech in the two thousand seventeen |
---|
0:01:39 | and then known the meta |
---|
0:01:42 | maybe not the useful to introduce them you know that a |
---|
0:01:46 | i'm also true but they will all of us so awful you know very well |
---|
0:01:53 | and you start work has a also is a sure working closely with a threshold |
---|
0:01:59 | venue |
---|
0:02:01 | you work on several topics may be wrong than representation only for speech button not |
---|
0:02:07 | only about |
---|
0:02:08 | and he really you also one of the code from the of the speech way |
---|
0:02:14 | initiative for building you |
---|
0:02:17 | two k open okay for a speech and speaker recognition it was a singing about |
---|
0:02:23 | so |
---|
0:02:24 | even us to |
---|
0:02:27 | we use it to form you use you already have a long list of speakers |
---|
0:02:32 | in the topics and i know you we need to as a very nice a |
---|
0:02:36 | or a now and i don't even for you but before two |
---|
0:02:44 | do they do you say it will be wall of introduction if you want before |
---|
0:02:48 | a good movie do i will explain how the decision we walk |
---|
0:02:54 | we will close to a pre-recorded view bone by nicole |
---|
0:03:00 | during easy do you have you wanted to some question maybe case in intensive box |
---|
0:03:07 | please or a few |
---|
0:03:10 | think about question and i haven't integration is possible now see we give you what |
---|
0:03:16 | you do need good complete variances |
---|
0:03:19 | and then we will have a fifteen minutes |
---|
0:03:23 | live |
---|
0:03:23 | question and answers with music or doing this decision is fifteen years |
---|
0:03:29 | you could use both the question and answer box |
---|
0:03:33 | well |
---|
0:03:34 | be a raise your hand the so we raise your hand with the you know |
---|
0:03:38 | that to a i-th question in i |
---|
0:03:42 | during position |
---|
0:03:44 | so we could be want to say some well handled before two good we do |
---|
0:03:47 | just i think you're much for the introduction hello |
---|
0:03:52 | i hope the d v d w within the video will be fine now but |
---|
0:03:56 | in the worst case you probably you guys have to increase a little bit it |
---|
0:04:00 | but |
---|
0:04:01 | let's see how it goes |
---|
0:04:04 | it can be cool i think we give a really do know |
---|
0:04:49 | sorry we have a simple was shown to an small technical problem good we don't |
---|
0:04:53 | have you do you the |
---|
0:04:57 | before it was working so it's better to does which the previews |
---|
0:05:03 | present |
---|
0:05:03 | annotation we're |
---|
0:05:05 | can't hear nothing alright |
---|
0:05:09 | yes a |
---|
0:05:10 | can and have a little stuff |
---|
0:05:17 | okay training |
---|
0:05:41 | hi everyone i mean permanently |
---|
0:05:43 | and a very high |
---|
0:05:45 | to give it is here today |
---|
0:05:47 | had obviously |
---|
0:05:49 | so let me for the whole thing rather |
---|
0:05:53 | for i by can be used for them |
---|
0:05:57 | with the |
---|
0:05:57 | the speech commute |
---|
0:06:01 | entitled make you know used to words unsupervised training |
---|
0:06:04 | all speech work station |
---|
0:06:07 | well so supervised learning is a key a lot of what are the my shooter |
---|
0:06:12 | feel |
---|
0:06:13 | and of course is getting ready |
---|
0:06:16 | within the speech community well |
---|
0:06:18 | so today i will like to share the experience |
---|
0:06:24 | the time again after working poor |
---|
0:06:26 | i two or three years |
---|
0:06:28 | on this topic |
---|
0:06:32 | okay but if or diving into cell supervised learning that me room are some of |
---|
0:06:39 | the limitations |
---|
0:06:41 | of supervised there |
---|
0:06:43 | which is the dominant paradigm stays there |
---|
0:06:48 | well you can see deep-learning |
---|
0:06:51 | as a way to lure hierarchical representation is where we start from the low concepts |
---|
0:06:58 | we combine them |
---|
0:06:59 | we create |
---|
0:07:00 | high-level also console |
---|
0:07:03 | so the learning |
---|
0:07:05 | is a very general is the case |
---|
0:07:08 | is implemented through a deep neural networks |
---|
0:07:13 | that are often |
---|
0:07:14 | trained |
---|
0:07:15 | in a supervised way |
---|
0:07:17 | using a large and rotated corpora |
---|
0:07:22 | you can do this is that only and approach |
---|
0:07:26 | alright integrate |
---|
0:07:27 | success |
---|
0:07:28 | are you learning many practical application |
---|
0:07:33 | is clear today |
---|
0:07:34 | and is paradigm |
---|
0:07:36 | has some limitations |
---|
0:07:39 | what are |
---|
0:07:40 | this issues |
---|
0:07:42 | for example |
---|
0:07:44 | we |
---|
0:07:45 | indeed the data and not |
---|
0:07:48 | general data |
---|
0:07:49 | but and updated data and crosses they cannot the issue the expense the time-consuming however |
---|
0:07:56 | wires numerals normal |
---|
0:08:01 | rubber supervised learning is data and |
---|
0:08:04 | also computationally demanding |
---|
0:08:07 | one |
---|
0:08:09 | of course to these days to reach state-of-the-art performance |
---|
0:08:14 | machine learning |
---|
0:08:15 | we need a lot of data |
---|
0:08:17 | and a lot of data requires a lot of computations |
---|
0:08:21 | deleting the fact the access but |
---|
0:08:24 | a |
---|
0:08:25 | supervised learning |
---|
0:08:27 | a technology to have brute |
---|
0:08:32 | brute |
---|
0:08:32 | setup |
---|
0:08:33 | users |
---|
0:08:36 | moreover |
---|
0:08:38 | if we |
---|
0:08:39 | training a system now |
---|
0:08:40 | supervised way the representations that the latter |
---|
0:08:44 | my by the hours |
---|
0:08:46 | to worse a specific application |
---|
0:08:49 | for instance if we train a system for speaker identification |
---|
0:08:54 | the representation that's been there are would that not or |
---|
0:08:58 | speech recognition |
---|
0:09:00 | so we might want to real or some kind of general representation that annoying |
---|
0:09:07 | transfer learning |
---|
0:09:08 | much easier and better |
---|
0:09:11 | density |
---|
0:09:14 | the third imitation is actually more exploration |
---|
0:09:17 | and is that where rain |
---|
0:09:20 | does not use |
---|
0:09:21 | only supervised learning |
---|
0:09:24 | critical mine different all |
---|
0:09:27 | i'm |
---|
0:09:28 | pretty sure |
---|
0:09:29 | that |
---|
0:09:30 | combined |
---|
0:09:32 | different the remote data that is cool but she |
---|
0:09:35 | to reach higher levels |
---|
0:09:37 | or artificial intelligence |
---|
0:09:39 | we can combine |
---|
0:09:41 | supervised learning |
---|
0:09:42 | we and |
---|
0:09:45 | contrastive learning |
---|
0:09:46 | weighted imitation learn a |
---|
0:09:49 | well we'd reinforcement learning and of course |
---|
0:09:52 | with some supervised learning |
---|
0:09:55 | so what is sell supervised there |
---|
0:09:59 | so supervised learning is a type of an unsupervised learning |
---|
0:10:04 | where we have a supervision |
---|
0:10:07 | but the supervision |
---|
0:10:08 | is extracted |
---|
0:10:09 | from the city no it's channel |
---|
0:10:13 | in cell supervised learning we'd ask |
---|
0:10:16 | don't have |
---|
0:10:16 | you models that have to create labels we don't have you months |
---|
0:10:21 | but the labels |
---|
0:10:23 | i retreated basically |
---|
0:10:25 | for free |
---|
0:10:25 | we can create |
---|
0:10:27 | columns of them without s |
---|
0:10:31 | normally in some supervised learning |
---|
0:10:34 | we applied some kind of |
---|
0:10:36 | known transformations to the input signal |
---|
0:10:39 | and use the resulting outcomes |
---|
0:10:42 | as a label as targets |
---|
0:10:46 | well let me clarified his with some example derived from the computer vision community which |
---|
0:10:52 | was the first one |
---|
0:10:54 | teaching better |
---|
0:10:56 | this |
---|
0:10:56 | approach |
---|
0:10:58 | in this |
---|
0:10:59 | comparison of community actually they |
---|
0:11:03 | the not is quite early i this earlier than the other that by solving some |
---|
0:11:09 | kind of symbols task we were able |
---|
0:11:11 | to train a neural network that there are some kind of needful |
---|
0:11:14 | representation |
---|
0:11:17 | for instance you can ask your neural network was also kind of relative positioning task |
---|
0:11:22 | where you have small edges of an image |
---|
0:11:25 | and you have to decide their relative position |
---|
0:11:27 | between them |
---|
0:11:29 | you can ask your neural network |
---|
0:11:31 | but the right colour |
---|
0:11:32 | set an image |
---|
0:11:33 | or to find the correct |
---|
0:11:35 | rotation and of any age |
---|
0:11:39 | goal of this task are relatively |
---|
0:11:41 | easy but each we design a system your vector learners used in table show this |
---|
0:11:48 | task |
---|
0:11:49 | we inherently over a wider system to have some kind of semantic knowledge of the |
---|
0:11:55 | words or at least semantic knowledge on the image |
---|
0:11:58 | that can be really very have their |
---|
0:12:04 | representation hopefully high level |
---|
0:12:06 | robust representations |
---|
0:12:10 | and yes |
---|
0:12:11 | subsets unsupervised learning is extremely |
---|
0:12:15 | interesting is gaining a lot of randy |
---|
0:12:19 | let me show that animals |
---|
0:12:21 | give low rank k by |
---|
0:12:22 | the kernel |
---|
0:12:24 | showing saying that you know if only the cage |
---|
0:12:29 | no supervised learning the su or look at a reformer learning is the charger indicate |
---|
0:12:34 | that an unsupervised |
---|
0:12:36 | or supervised learning is the basic indicate you sell |
---|
0:12:40 | and meaning that |
---|
0:12:42 | we believe this modality is |
---|
0:12:44 | definitely |
---|
0:12:47 | ingredient |
---|
0:12:48 | a two |
---|
0:12:49 | to develop intelligent systems |
---|
0:12:54 | okay but what about the old you an speech field |
---|
0:12:59 | as i mentioned before |
---|
0:13:02 | there is a crucial we number of research more stuff cools in the direction |
---|
0:13:07 | also supervise there really you know speech |
---|
0:13:12 | and we have seen as many of them even |
---|
0:13:15 | at the interspeech |
---|
0:13:17 | but here let me just highlight here of |
---|
0:13:22 | and my opinion the first work that firstly shows the appendices also supervised learning you |
---|
0:13:29 | know you speech |
---|
0:13:30 | is the contrastive predictive coding was by are among the nor backing |
---|
0:13:35 | two thousand eight key |
---|
0:13:38 | this work is mostly about |
---|
0:13:40 | predicting |
---|
0:13:41 | the future |
---|
0:13:42 | given the past |
---|
0:13:45 | more recently we have seen |
---|
0:13:47 | another |
---|
0:13:48 | very good where by facebook with what we should back to zero where d with |
---|
0:13:54 | we were able to show impressive results with that our approach |
---|
0:13:57 | which implies some kind of masking technique sooner number couple |
---|
0:14:02 | which ones dish |
---|
0:14:05 | i also contributed |
---|
0:14:07 | does feel with the problem of analysis which encode it which as we will see |
---|
0:14:12 | later i which we explore |
---|
0:14:15 | multi doubts selsa provides there |
---|
0:14:18 | however |
---|
0:14:20 | cell supervised learning all speech |
---|
0:14:22 | is it really challenge |
---|
0:14:26 | why |
---|
0:14:27 | first of all because speech is characterised by high dimensional that |
---|
0:14:32 | we have typically a long sequences |
---|
0:14:35 | of samples that can be well variable length |
---|
0:14:40 | the last |
---|
0:14:41 | but not laced |
---|
0:14:43 | speech in her you know the and tails |
---|
0:14:47 | complex hierarchical structure that might be very difficult to further |
---|
0:14:53 | without being guided |
---|
0:14:55 | by a strong |
---|
0:14:57 | supervision |
---|
0:14:58 | speech in fact |
---|
0:15:00 | as characterised by samples we can combine |
---|
0:15:03 | there were sampled that the |
---|
0:15:05 | aims |
---|
0:15:07 | i from twenty and you can create two levels of all syllables okay worse and |
---|
0:15:11 | finally |
---|
0:15:13 | we have than me |
---|
0:15:14 | all descendants |
---|
0:15:16 | and inferring |
---|
0:15:17 | all these kind of structure |
---|
0:15:20 | might be |
---|
0:15:21 | extremely difficult |
---|
0:15:25 | on my side i started i've been some supervised learning when i started my all |
---|
0:15:31 | stock |
---|
0:15:31 | i mean the almost |
---|
0:15:33 | three years ago |
---|
0:15:35 | and time |
---|
0:15:37 | people it means that we're doing research ourselves supervised learning |
---|
0:15:42 | a approaches based on what information |
---|
0:15:46 | and i got so excited that |
---|
0:15:48 | i decided to study some supervised learning |
---|
0:15:51 | approaches with motion information |
---|
0:15:53 | for learning |
---|
0:15:55 | speech representations |
---|
0:15:56 | and that led to the development |
---|
0:15:58 | all the technical |
---|
0:15:59 | a lot coming from max that i will and described in the next my |
---|
0:16:05 | after that we for extended |
---|
0:16:08 | this techniques using a multi task supervised learning approach |
---|
0:16:13 | and that led to the double meant |
---|
0:16:15 | all the problem of the gnostic speech encoder plays |
---|
0:16:18 | the presented |
---|
0:16:19 | and interspeech two thousand nineteen |
---|
0:16:22 | and also we extended |
---|
0:16:24 | days with another technique |
---|
0:16:26 | if you can improve system called base plus |
---|
0:16:30 | and we recently presented this work |
---|
0:16:32 | at i |
---|
0:16:37 | okay let's start from motion information based approach |
---|
0:16:42 | what is more information |
---|
0:16:44 | the motion information is defined as the key and they are virgins |
---|
0:16:48 | between |
---|
0:16:49 | the joint distributions of two random variables |
---|
0:16:53 | and their product or marginal |
---|
0:16:58 | why |
---|
0:16:59 | this is important |
---|
0:17:01 | because we move information we can capture complex problem being of relationships |
---|
0:17:06 | between |
---|
0:17:07 | random part of |
---|
0:17:10 | eve the |
---|
0:17:12 | two random variables are independent univoxel formation zero |
---|
0:17:17 | while you do with some kind of dependency between is why doubles the are then |
---|
0:17:22 | mutual information is greater you |
---|
0:17:26 | this is very attractive |
---|
0:17:29 | the issues that much information that's difficult to compute high dimensional space |
---|
0:17:36 | and is limited |
---|
0:17:38 | a lot |
---|
0:17:40 | it's optical but |
---|
0:17:41 | in |
---|
0:17:42 | for a decal mush entirely sure |
---|
0:17:47 | however one recent were coal mine actual information |
---|
0:17:52 | you're estimator |
---|
0:17:54 | phone that it is possible |
---|
0:17:56 | one maximizing minimizing motivation |
---|
0:17:59 | within a framework that closely resembles |
---|
0:18:03 | data counts |
---|
0:18:05 | how does where |
---|
0:18:07 | i think mention and we can sample somehow |
---|
0:18:11 | some samples from the joint distribution |
---|
0:18:13 | recorded |
---|
0:18:14 | positive samples |
---|
0:18:16 | we will explain later |
---|
0:18:18 | how we can do that graph |
---|
0:18:20 | it's also assume we can i |
---|
0:18:22 | sample |
---|
0:18:24 | some kind of examples from the marginal distributions and we call |
---|
0:18:28 | there's negative samples |
---|
0:18:32 | then we can see that |
---|
0:18:33 | this positive and negative samples |
---|
0:18:36 | with the special neural net where was cost function |
---|
0:18:40 | is it don't are far down |
---|
0:18:42 | bound works mesh |
---|
0:18:44 | the don't screw are no information that has low where |
---|
0:18:49 | and if we train |
---|
0:18:50 | this is a letter to maximize |
---|
0:18:54 | this them about |
---|
0:18:55 | we finally converge to also mesh |
---|
0:19:01 | and inspired by this approach i started |
---|
0:19:04 | thinking about |
---|
0:19:06 | motion information based approaches specific only |
---|
0:19:09 | for speech |
---|
0:19:12 | i danced idea and then you do cool a little informatics that works |
---|
0:19:18 | in this way |
---|
0:19:20 | so |
---|
0:19:22 | for example we employ s seven they strategy |
---|
0:19:25 | that will |
---|
0:19:26 | several positive and negative |
---|
0:19:28 | this way |
---|
0:19:29 | sure the whole |
---|
0:19:30 | that choosing a random shyer |
---|
0:19:33 | from i runs and scolded |
---|
0:19:35 | so you one |
---|
0:19:37 | then |
---|
0:19:37 | which is another out of the channel from the same sentence |
---|
0:19:41 | and we call it |
---|
0:19:42 | two |
---|
0:19:45 | and finally which is another random from another sentence |
---|
0:19:49 | that's your front |
---|
0:19:53 | we this |
---|
0:19:54 | samples with his chance we can |
---|
0:19:57 | please some kind of interesting things |
---|
0:20:00 | for instance we can process |
---|
0:20:03 | c one c two i was your problem with and recorder |
---|
0:20:08 | which provide |
---|
0:20:09 | hopefully higher level information |
---|
0:20:14 | then |
---|
0:20:15 | we can go free positive and negative so all we |
---|
0:20:19 | if we |
---|
0:20:21 | concatenate |
---|
0:20:22 | z one and two we create |
---|
0:20:25 | samples from the joint distribution |
---|
0:20:28 | positive system |
---|
0:20:30 | which is a positive sense or because we expect some kind of relation between |
---|
0:20:36 | this random variables because extract |
---|
0:20:39 | from say |
---|
0:20:40 | a signal |
---|
0:20:43 | then we can also can also create |
---|
0:20:44 | and negative samples michael t z one and that run |
---|
0:20:49 | in this can be seen |
---|
0:20:51 | and a sample from the chronicle marginal distribution |
---|
0:20:56 | after that |
---|
0:20:57 | we employ and discriminator which is |
---|
0:21:01 | with posting |
---|
0:21:02 | or negative samples |
---|
0:21:04 | and it is screaming the |
---|
0:21:06 | should figure out |
---|
0:21:07 | basically |
---|
0:21:08 | if |
---|
0:21:09 | you need to get positive or negative examples for this case |
---|
0:21:14 | if the representations |
---|
0:21:15 | kind of from seven |
---|
0:21:17 | or from you |
---|
0:21:22 | in this system that discriminate rollers is |
---|
0:21:25 | set |
---|
0:21:26 | to maximize the mutual information |
---|
0:21:30 | moreover the encoder and a discrete mister |
---|
0:21:33 | are jointly trained from scratch |
---|
0:21:37 | and this |
---|
0:21:38 | results in |
---|
0:21:39 | compared to |
---|
0:21:41 | game nodding an adversarial game like can |
---|
0:21:44 | this case |
---|
0:21:46 | the encoder and its creator should cooperate to learn |
---|
0:21:52 | i hu and hopefully high level |
---|
0:21:55 | representation |
---|
0:21:57 | a good question here okay |
---|
0:22:00 | and but one two will are you play is k |
---|
0:22:05 | with this came we basically learn speaker identities of our wheeler speaker endings |
---|
0:22:15 | why |
---|
0:22:16 | because this approach is based on randomly |
---|
0:22:18 | sam thing |
---|
0:22:19 | within the same set |
---|
0:22:21 | and if we randomly sample within the same sentence |
---|
0:22:25 | and reliable started or that the system can disentangle are the variable factor is |
---|
0:22:32 | definitely the speaker identity |
---|
0:22:34 | rubber in |
---|
0:22:36 | we assume that we have i dataset and just large enough without |
---|
0:22:40 | large variability a speaker and if we randomly sample two sentences |
---|
0:22:45 | the probability of by me |
---|
0:22:46 | the same speaker is very low |
---|
0:22:49 | so overall |
---|
0:22:50 | this can be c |
---|
0:22:51 | as a system for learning |
---|
0:22:54 | speaker of endings without |
---|
0:22:57 | provided to the system the police |
---|
0:22:59 | this is label |
---|
0:23:02 | on the speaker identity |
---|
0:23:06 | the encoder is fat by their roles speech samples directly |
---|
0:23:12 | in the first layer of a contact the architecture we just use see that makes |
---|
0:23:17 | learning problem to roll samples much easier |
---|
0:23:20 | in fact instead of using the standard convolutional filters we use a band pass parameterize |
---|
0:23:26 | filters that only learns d |
---|
0:23:29 | because of this is distilled |
---|
0:23:32 | this makes |
---|
0:23:35 | learning from the rose i'm all easier |
---|
0:23:38 | and not only used on the supervised learning but we also only useful in this |
---|
0:23:42 | also provides context and |
---|
0:23:44 | i will encourage you to read a reference paper |
---|
0:23:47 | if you would like to hear more about |
---|
0:23:51 | sing |
---|
0:23:53 | what are the strength and issues a lot come from |
---|
0:23:58 | once trained is that |
---|
0:23:59 | we are able |
---|
0:24:00 | when they let me from us were able to learn |
---|
0:24:03 | high quality |
---|
0:24:05 | speaker representation which are competitive |
---|
0:24:07 | with the ones |
---|
0:24:09 | learning standard supervised we |
---|
0:24:12 | or rubber |
---|
0:24:14 | luckily formats is very simple and also computationally efficient |
---|
0:24:19 | because we only use the local information thanks to that we can provide a lot |
---|
0:24:23 | the computations |
---|
0:24:26 | the mediation with that |
---|
0:24:27 | is that the representations are very task specific |
---|
0:24:32 | as we have seen before with lee we can |
---|
0:24:36 | there |
---|
0:24:37 | speaker baddies |
---|
0:24:39 | but what about the other for and |
---|
0:24:42 | informations that's a banded in speech signal mike phonemes |
---|
0:24:46 | and motions |
---|
0:24:47 | and many are things |
---|
0:24:51 | so when it's this results i ask myself |
---|
0:24:55 | i w really sure that a single task as in our |
---|
0:25:00 | actually most of the forest the trying to used cell supervised learning by solving single |
---|
0:25:05 | task |
---|
0:25:07 | but |
---|
0:25:08 | my experience suggests that one single task was not is not know because |
---|
0:25:13 | we |
---|
0:25:14 | with a single task we always only count sure |
---|
0:25:18 | little information |
---|
0:25:20 | on the signal that we might want |
---|
0:25:25 | well based on this observation we decided star and you project called problem i know |
---|
0:25:32 | stick speech coder where we wanted to learn |
---|
0:25:37 | more general representation might join the demixing multiple |
---|
0:25:43 | cell supervised task |
---|
0:25:46 | in pays we have an ensemble on your macros that mass operate together |
---|
0:25:52 | to discover good speech representations |
---|
0:25:58 | so what is the intuition behind that |
---|
0:26:01 | if we joint this'll moldable unsupervised task |
---|
0:26:05 | we can expect that each task ratings different you |
---|
0:26:11 | under speech |
---|
0:26:13 | and you |
---|
0:26:13 | put together |
---|
0:26:15 | different views on the same signal |
---|
0:26:17 | we might have higher chances |
---|
0:26:20 | two |
---|
0:26:21 | have a more general incomplete |
---|
0:26:23 | description |
---|
0:26:24 | on the signal so |
---|
0:26:28 | moreover |
---|
0:26:30 | and consensus across all these uses needed |
---|
0:26:33 | and using pose some kind of |
---|
0:26:35 | soft constraint in the representation |
---|
0:26:38 | it may seem we can improve |
---|
0:26:41 | its robustness |
---|
0:26:44 | so with this approach we were actually able |
---|
0:26:47 | to learn |
---|
0:26:48 | general robust |
---|
0:26:50 | and transferable features |
---|
0:26:52 | thanks to |
---|
0:26:53 | a joint is holding multiple task |
---|
0:26:56 | and let me explain next slide more details on how |
---|
0:27:01 | a system works |
---|
0:27:05 | a is based on an encoder |
---|
0:27:08 | the transforms more samples higher level representation |
---|
0:27:14 | you colour is based on signal formal by seven locks |
---|
0:27:19 | and the also earlier |
---|
0:27:22 | he writing we start from the raw set will be |
---|
0:27:26 | one starts from the lowest possible speech representation |
---|
0:27:32 | after the encoder we have a bunch all workers where each worker saul's different sensible |
---|
0:27:39 | mice task |
---|
0:27:41 | one thing to remark is that the worker |
---|
0:27:44 | workers are very small |
---|
0:27:46 | one |
---|
0:27:47 | because you've if the workers are very simple a small you're not sure |
---|
0:27:51 | we forced encoder to provide |
---|
0:27:54 | and much more robust and what is higher now |
---|
0:27:58 | representation |
---|
0:28:01 | there are actually two types of work we |
---|
0:28:03 | started |
---|
0:28:04 | regression workers that solves |
---|
0:28:08 | error regression task and the binary |
---|
0:28:12 | strolls |
---|
0:28:12 | binary classification task |
---|
0:28:14 | you binary workers are similar to that one |
---|
0:28:17 | other than the one that we have some for an hour |
---|
0:28:21 | more show you from which |
---|
0:28:23 | as for the regression task |
---|
0:28:26 | we have some workers that is t v some kind of normal speech representation |
---|
0:28:33 | for instance we have one worker estimating waveform back |
---|
0:28:37 | you know encoder fashion |
---|
0:28:40 | we estimateable always spectrum |
---|
0:28:42 | we estimate that about |
---|
0:28:43 | frequency cepstral coefficients embassy they also have positive features such as |
---|
0:28:49 | bottom-up probability zero crossing rate and i don't |
---|
0:28:54 | so why we do something like that |
---|
0:28:56 | because we use the way being jack quarters some kind of |
---|
0:29:01 | prior knowledge that can be very helpful |
---|
0:29:04 | in |
---|
0:29:05 | so supervised learning |
---|
0:29:07 | in particular in the speech community we are well aware that there are some |
---|
0:29:12 | features that are we are very helpful |
---|
0:29:15 | like mfcc |
---|
0:29:16 | cross at least |
---|
0:29:17 | why not |
---|
0:29:19 | try to take advantage of that |
---|
0:29:21 | i y |
---|
0:29:22 | we are not trying to jack |
---|
0:29:24 | this information inside a wire |
---|
0:29:26 | neural network |
---|
0:29:29 | you parallel to the regressors we also have |
---|
0:29:32 | binary classification task |
---|
0:29:35 | binary classification task working with similar to what we have described for with more to |
---|
0:29:41 | the formation approaches |
---|
0:29:43 | basically we sample tree |
---|
0:29:45 | speech and x |
---|
0:29:47 | are core of the negatives according to some kind of predefined extra you |
---|
0:29:53 | we don't process all the stress |
---|
0:29:56 | weighted the our case encoder |
---|
0:29:59 | and then we should and scream inter |
---|
0:30:01 | which is trained on binary percent we should figure out any |
---|
0:30:05 | we have a positive or negative |
---|
0:30:08 | so very similar to |
---|
0:30:10 | the only approach we describe four |
---|
0:30:14 | only difference |
---|
0:30:15 | is the article or something strategy |
---|
0:30:18 | because we didn't different some to strategy we can't |
---|
0:30:21 | hi my |
---|
0:30:22 | different features |
---|
0:30:24 | one simple strategy that we don't |
---|
0:30:27 | is the one proposed in mock of the infomax that has we have seen for |
---|
0:30:31 | is able to lure |
---|
0:30:33 | speaker and wendy's and general speaker identity |
---|
0:30:38 | together with that we have an under similar strategy called good level the marks |
---|
0:30:43 | here we do we play basically the same game but we use |
---|
0:30:48 | larger chunks |
---|
0:30:49 | and with larger channels |
---|
0:30:51 | we hope white while i |
---|
0:30:54 | kind of |
---|
0:30:54 | complementary information which hopefully is more |
---|
0:30:57 | global them |
---|
0:31:01 | well finally we propose another interesting task or sequence pretty code |
---|
0:31:07 | would this task be hopefully are able to capture some kind of |
---|
0:31:12 | information on the order |
---|
0:31:14 | all |
---|
0:31:15 | the sequence |
---|
0:31:16 | it works in this way we choose a random channel from |
---|
0:31:20 | and a random sentence |
---|
0:31:22 | cultures and core change |
---|
0:31:24 | which is another random show on the future |
---|
0:31:27 | of the same set those and is also one |
---|
0:31:31 | and then we choose another random chat on that |
---|
0:31:34 | passed on the same |
---|
0:31:37 | so if we |
---|
0:31:39 | palais de ziggy |
---|
0:31:41 | we are |
---|
0:31:42 | hopefully able to capture a little bit better how |
---|
0:31:46 | the sequence can involve and ask country some kind of longer context information we were |
---|
0:31:53 | able to capture with previous task |
---|
0:31:56 | this sequence political endings similar |
---|
0:31:59 | two contrastive predictive coding proposed by are one or |
---|
0:32:03 | the main difference is that no work is |
---|
0:32:07 | the negative samples actually all the samples are derived from the same sentence not for |
---|
0:32:12 | other ones because |
---|
0:32:14 | in this case you will like to only focus on how |
---|
0:32:17 | this you possible we don't want to capture |
---|
0:32:20 | another kind of pixel information such as speaker that we will capture |
---|
0:32:25 | with other tasks |
---|
0:32:30 | okay but how can we use |
---|
0:32:33 | mays |
---|
0:32:34 | inside s speech cross i |
---|
0:32:39 | well |
---|
0:32:39 | step one is unsupervised training so we can take the architecture |
---|
0:32:44 | that we have |
---|
0:32:45 | and i four |
---|
0:32:47 | and training particular we can jointly train you quarter and workers using standard issue |
---|
0:32:57 | a by optimising a loss which is computed as the average |
---|
0:33:03 | each worker cost |
---|
0:33:05 | in of are you experiment with it |
---|
0:33:07 | we tried different |
---|
0:33:09 | alternatives |
---|
0:33:10 | but we found that |
---|
0:33:11 | average e |
---|
0:33:12 | the courses |
---|
0:33:14 | the best approach we very fine |
---|
0:33:18 | once we have train |
---|
0:33:19 | i where a architecture we now use |
---|
0:33:22 | i didn't label |
---|
0:33:23 | we can go to step two which is supervised by joining |
---|
0:33:28 | this case |
---|
0:33:29 | we get to create a all the workers and |
---|
0:33:31 | like our colour into |
---|
0:33:34 | a supervised classifier which is trained with little |
---|
0:33:37 | i'm now a supervised eight |
---|
0:33:41 | actually here and there are a couple of also the data is not number one |
---|
0:33:47 | is to use |
---|
0:33:48 | is it as a standard |
---|
0:33:50 | feature called or this case |
---|
0:33:53 | freeze |
---|
0:33:53 | pays yuri this supervised fine phase |
---|
0:33:57 | another approach |
---|
0:33:59 | just a pre-training priest with this unsupervised |
---|
0:34:02 | parameters |
---|
0:34:03 | and fine curate |
---|
0:34:05 | you re |
---|
0:34:06 | the |
---|
0:34:08 | supervised find you phase so this several approaches the one usually hears |
---|
0:34:14 | the best for four |
---|
0:34:17 | it is very important |
---|
0:34:19 | true mar |
---|
0:34:20 | that is |
---|
0:34:21 | step number one this unsupervised three |
---|
0:34:24 | can |
---|
0:34:25 | should be done only once |
---|
0:34:27 | in fact we have seen |
---|
0:34:29 | there is a dish variance phase |
---|
0:34:32 | are generally now that can use for large are righty |
---|
0:34:37 | all speech tasks like |
---|
0:34:39 | speech recognition speaker recognition speaker speech enhancement |
---|
0:34:43 | and min six |
---|
0:34:45 | and you even don't wanna |
---|
0:34:47 | three by yourself |
---|
0:34:49 | that's a supervised extractor you can use |
---|
0:34:52 | and three |
---|
0:34:54 | parameters that share |
---|
0:34:55 | but the i were proposed |
---|
0:35:00 | well this is not all about he's |
---|
0:35:04 | in fact |
---|
0:35:05 | in created by the good results achieved with the original version |
---|
0:35:10 | we decided |
---|
0:35:11 | two |
---|
0:35:13 | spend some time to founder |
---|
0:35:15 | we revise the architecture and improving |
---|
0:35:18 | and we don't use opportunity of the judges are two dollars a night t |
---|
0:35:23 | organized by the johns hopkins university to set up t |
---|
0:35:27 | working on improving |
---|
0:35:29 | pace |
---|
0:35:30 | and as a result we came up with a you architecture called |
---|
0:35:34 | pays last where we introduced |
---|
0:35:37 | different types all improvements |
---|
0:35:41 | first of all week apple |
---|
0:35:43 | a peas with on-the-fly data ish |
---|
0:35:47 | here we use speech what an initial techniques like anti noise reverberation |
---|
0:35:53 | but we also out |
---|
0:35:55 | some kind of run zeros in the time waveform and also we filter the dixie |
---|
0:36:00 | data in the signal of with some kind of random band must and boston's order |
---|
0:36:05 | to use |
---|
0:36:07 | zeros |
---|
0:36:07 | in the frequency domain |
---|
0:36:10 | so what is that are not be very important because |
---|
0:36:13 | i gives us to the system so i kind of robustness is a noise and |
---|
0:36:19 | reverberation another environment artifacts |
---|
0:36:23 | a nice things that |
---|
0:36:24 | since everything is on the fly |
---|
0:36:26 | every time we contaminated descendants for distortion |
---|
0:36:32 | and also |
---|
0:36:33 | the workers are based on the clean |
---|
0:36:37 | alone labels extracted from the clean version signal so we |
---|
0:36:42 | implicitly ask |
---|
0:36:43 | this way |
---|
0:36:44 | our system to |
---|
0:36:46 | perform some kind of |
---|
0:36:47 | i dunno ways |
---|
0:36:50 | and then we also robust colour |
---|
0:36:53 | we still have seen no always on the years but that we have also i |
---|
0:36:58 | recurrent neural network that is |
---|
0:37:01 | and efficient way to introduce some kind of we can see that sure |
---|
0:37:05 | and we also |
---|
0:37:08 | some ski connection that have a rowdy and back to punish |
---|
0:37:14 | then we have improve a lot other workers |
---|
0:37:17 | so we have not so that |
---|
0:37:19 | the more workers |
---|
0:37:22 | the better it is |
---|
0:37:25 | and yes |
---|
0:37:26 | we definitely have a introduced |
---|
0:37:30 | a lot of workers the injured that estimates for instance you type of features on |
---|
0:37:34 | different |
---|
0:37:36 | context lines et cetera overall |
---|
0:37:39 | we can improve a lot the performance |
---|
0:37:42 | all the system will different speech tasks |
---|
0:37:46 | what do we learn phase |
---|
0:37:48 | we show some kind of it isn't applauded |
---|
0:37:51 | assuming that's |
---|
0:37:53 | here |
---|
0:37:55 | we show that bayes variable are pretty well speaker identity is and you can |
---|
0:38:00 | clearly recognise |
---|
0:38:01 | that the |
---|
0:38:03 | there are pretty defining cluster |
---|
0:38:06 | a four |
---|
0:38:08 | the speakers |
---|
0:38:11 | here is that we show some carol |
---|
0:38:14 | i'll |
---|
0:38:17 | deceived lots |
---|
0:38:18 | for phonemes |
---|
0:38:19 | and you can see here |
---|
0:38:21 | everything's lossless well the final but |
---|
0:38:24 | you have some phonemes |
---|
0:38:26 | like it is |
---|
0:38:27 | sure |
---|
0:38:27 | right |
---|
0:38:29 | but you can also detect some kind of phonemes which are |
---|
0:38:32 | a pretty clusters of meaning that |
---|
0:38:34 | we are actually learning |
---|
0:38:36 | some kind of twenty |
---|
0:38:38 | representation |
---|
0:38:39 | even |
---|
0:38:40 | without |
---|
0:38:41 | and he |
---|
0:38:41 | so when you label |
---|
0:38:45 | okay we try these plots are different |
---|
0:38:49 | speech tasks and you can refer to the paper to see all the results |
---|
0:38:55 | but she really we just discussed some all the numbers that we have chi |
---|
0:38:59 | on a noisy asr tasks highlight |
---|
0:39:03 | i think a little bit then robustness |
---|
0:39:06 | on the proposed approach |
---|
0:39:10 | furthermore let me say that we have three |
---|
0:39:12 | a wire |
---|
0:39:13 | ace on every speech |
---|
0:39:16 | without using the labels and |
---|
0:39:18 | very interesting |
---|
0:39:20 | we have noticed that we don't need |
---|
0:39:22 | a not a lot of data to train a base we just need |
---|
0:39:26 | one hundred fifty a wire one hundred that was really the speech |
---|
0:39:29 | and these are enough to |
---|
0:39:31 | i generated numbers sdc staples |
---|
0:39:36 | this is quite interesting because |
---|
0:39:38 | i usually standard sort of about approaches rely on a lot a lot of data |
---|
0:39:43 | in our case with thing that |
---|
0:39:45 | somehow we are more that efficient because we employ a lot a lot of workers |
---|
0:39:51 | trying to extract a lot of information |
---|
0:39:54 | are on our speech signals |
---|
0:39:58 | on the left you can see the results when we treat only here you right |
---|
0:40:03 | is a challenging task characterised by speech recorded in a domestic requirement |
---|
0:40:09 | and corrupted by noise ratio |
---|
0:40:12 | you can see here |
---|
0:40:14 | that pays a single outperform |
---|
0:40:18 | traditional features and also combinations a traditional speech features |
---|
0:40:23 | on the right you can see the results of time five |
---|
0:40:27 | jerry time |
---|
0:40:28 | probably is the most challenging |
---|
0:40:31 | task average |
---|
0:40:32 | and where design speech is discover or as white noise you're a sure |
---|
0:40:37 | a lot a lot of these two buses such as overlap speech |
---|
0:40:41 | and that even guess |
---|
0:40:42 | a pretty challenging scenario we are able |
---|
0:40:45 | to the slightly outperform |
---|
0:40:48 | the standard and based on their |
---|
0:40:51 | i features |
---|
0:40:53 | all their current database |
---|
0:40:58 | actually do representations of other with them |
---|
0:41:02 | a is |
---|
0:41:03 | are quite a general or boston transferable |
---|
0:41:06 | and we have successfully applied |
---|
0:41:09 | them to different tasks |
---|
0:41:10 | why don't we have seen speech recognition but you can use it |
---|
0:41:14 | for speaker recognition |
---|
0:41:16 | for speech announcement |
---|
0:41:17 | was learning and motion recognition and i and also aware of some works right to |
---|
0:41:23 | use |
---|
0:41:24 | p is for transfer learning across languages train one that based on and trivias on |
---|
0:41:32 | english and you task and another language and seems to |
---|
0:41:36 | sure some kind of surprising robustness here |
---|
0:41:39 | transformation |
---|
0:41:42 | you can find the code in the tree model |
---|
0:41:45 | on guitar when i encourage you to |
---|
0:41:47 | well here and play would pace as well |
---|
0:41:53 | but let me conclude this park with some sides also supervised learning and their role |
---|
0:41:59 | that it can lady |
---|
0:42:01 | in the future |
---|
0:42:04 | has a mentioned in the first part of the presentation i think they're the g |
---|
0:42:10 | be of intelligent machines is the combination of different note that this |
---|
0:42:15 | we can combine a supervised learning |
---|
0:42:17 | with unsupervised imitation the room for smaller in contrast one has all |
---|
0:42:25 | so i think there is a huge based here for which tweezers direction where we |
---|
0:42:30 | basically |
---|
0:42:33 | combine |
---|
0:42:34 | in a simple and again the way |
---|
0:42:36 | difference |
---|
0:42:38 | elderly time that |
---|
0:42:40 | one of them |
---|
0:42:43 | could be and |
---|
0:42:45 | so supervised learning but not only |
---|
0:42:49 | this is |
---|
0:42:49 | very important in days because |
---|
0:42:53 | stand our supervised learning as i don't know approach but we are start something see |
---|
0:42:58 | some kind of limitation in this limitation mouldy even including your |
---|
0:43:03 | in the next |
---|
0:43:04 | years so supervised learning is too much as a demanding too much or addition to |
---|
0:43:09 | learning |
---|
0:43:09 | and we've been going the direction |
---|
0:43:11 | only few it was a few companies the war will be able |
---|
0:43:15 | to train state-of-the-art just |
---|
0:43:18 | and i think different starting different learning with what is conditioned |
---|
0:43:23 | an especially selsa for about thirty because i we has we have seen |
---|
0:43:29 | in his presentation |
---|
0:43:30 | so supervised learning can |
---|
0:43:32 | an extremely useful the transfer learning area |
---|
0:43:36 | so we sell supervised learning we have channels cooler a representation which is |
---|
0:43:42 | generally now |
---|
0:43:44 | it can use |
---|
0:43:46 | for several down by class task |
---|
0:43:50 | and this is |
---|
0:43:52 | a really big advantage |
---|
0:43:55 | in terms of computational complexity scores |
---|
0:43:59 | so i think |
---|
0:44:01 | the future paradigm |
---|
0:44:03 | will be a final enough will be similar to the first a popular approach of |
---|
0:44:10 | learning where we where he where |
---|
0:44:12 | able to initialize current |
---|
0:44:15 | neural network |
---|
0:44:16 | using |
---|
0:44:18 | unsupervised learning approaches also provides a legal approach |
---|
0:44:22 | and then we can find you know that we need also |
---|
0:44:25 | i think is |
---|
0:44:26 | could be |
---|
0:44:29 | pretty much |
---|
0:44:31 | i feature primetime needed for speech where |
---|
0:44:33 | bayesian transfer to remove lady |
---|
0:44:36 | always measure |
---|
0:44:38 | role in the pipeline |
---|
0:44:40 | and yes |
---|
0:44:42 | that some similar to what we have seen the last the differences that |
---|
0:44:46 | and you at first system we were using for a supervisor some supervised learning where |
---|
0:44:52 | based on restrictive about of washing |
---|
0:44:54 | right now is the as we are using |
---|
0:44:56 | much more sophisticated techniques |
---|
0:44:58 | but the idea is the same manner |
---|
0:45:01 | could be |
---|
0:45:03 | quickly and the measurable in speech processing and more in general |
---|
0:45:07 | in that the machine learning in the near future |
---|
0:45:11 | if you're interested in to the stopping again you would like to read |
---|
0:45:15 | a more also supervised |
---|
0:45:17 | learning you know you speech you can take a look |
---|
0:45:19 | into the and i c m l workshop |
---|
0:45:22 | also supervised learning you know the speech that you have |
---|
0:45:26 | recently |
---|
0:45:27 | organized |
---|
0:45:28 | and you can going to the website c or the presentation and read all the |
---|
0:45:34 | which i think is |
---|
0:45:35 | kind of interesting initiative |
---|
0:45:37 | and that we also highlight |
---|
0:45:39 | they will be |
---|
0:45:41 | seen their initiative |
---|
0:45:42 | it is your i knew it is so i will equation also |
---|
0:45:48 | you also to participate |
---|
0:45:49 | to use that |
---|
0:45:53 | alright since i have a few more minutes |
---|
0:45:56 | i'm very happy to of the u |
---|
0:45:59 | on another very exciting projects and leading these days which is called |
---|
0:46:04 | speech frame |
---|
0:46:06 | speech frame will be an open-source all than one two |
---|
0:46:10 | entirely down well i |
---|
0:46:12 | no one goal |
---|
0:46:13 | be a little in that can significant speed-up |
---|
0:46:16 | research and double of all speech and audio processing techniques |
---|
0:46:22 | so we are building |
---|
0:46:23 | toolkit which will be efficient flexible |
---|
0:46:27 | moreover and very important we'd i hu |
---|
0:46:34 | the main difference with the other existing toolkit that speech rate is specifically designed with |
---|
0:46:41 | addressed |
---|
0:46:42 | multiple speech task |
---|
0:46:44 | i don't see time |
---|
0:46:46 | recent speech brain muscle or speech michelle channels operations recognition and most recognition multi microphone |
---|
0:46:55 | signal processing speaker diarization |
---|
0:46:58 | and many other things |
---|
0:47:00 | so |
---|
0:47:00 | typically all this task share the underlying technology which is unclear me |
---|
0:47:08 | and the room there is |
---|
0:47:10 | the reason why we have we need different repository or |
---|
0:47:16 | different kind of speech applications is so what we want |
---|
0:47:20 | is like our brain |
---|
0:47:22 | we have a single that is able |
---|
0:47:24 | to process several speech applications and the c time |
---|
0:47:32 | main issue with the other tokens |
---|
0:47:34 | this most of them is that the |
---|
0:47:37 | i really for a single task |
---|
0:47:40 | for instance you can use county for each and you know speech recognition and i |
---|
0:47:44 | don't know colour the is |
---|
0:47:47 | is |
---|
0:47:49 | we the idea creating can show that can be extremely is that still on |
---|
0:47:54 | meeting speech recognition |
---|
0:47:56 | standard v is yes |
---|
0:47:59 | very good or |
---|
0:48:00 | speaker recognition |
---|
0:48:02 | i think |
---|
0:48:04 | it is fess explicitly them will |
---|
0:48:07 | what |
---|
0:48:08 | different task is still not exist |
---|
0:48:12 | and people when they how to implement complex pipeline involving |
---|
0:48:18 | different technologies lie like speech enhancement last |
---|
0:48:22 | speech recognition |
---|
0:48:23 | or |
---|
0:48:24 | speech recognition speaker recognition |
---|
0:48:26 | they are like because the captain john |
---|
0:48:32 | and of course jumping from one looking to is very demanding here t can be |
---|
0:48:38 | there are different programming languages will different constant errors are we there's cetera |
---|
0:48:44 | and the |
---|
0:48:45 | one other issues that |
---|
0:48:48 | if we have different look at very how to combine a system together and uniformly |
---|
0:48:54 | in a single system just fully range just |
---|
0:48:57 | a very important use this we declare |
---|
0:49:00 | so we actually working on that and we are trying to lower best rate |
---|
0:49:07 | to do not always will allow users to |
---|
0:49:12 | actually a couple the next |
---|
0:49:14 | a speech point one |
---|
0:49:16 | in an easy way |
---|
0:49:19 | what a time line actually we have work a lot of these you're on that |
---|
0:49:23 | we haven't email |
---|
0:49:25 | a lot of people working on that a lot of interest |
---|
0:49:29 | and we are very close to a first really is that |
---|
0:49:33 | will happen we estimate within a couple amount so i as strongly encouraged you to |
---|
0:49:40 | stay tuned and then |
---|
0:49:43 | and that try |
---|
0:49:45 | speech brain |
---|
0:49:46 | i in the future and q how's your feedback |
---|
0:49:51 | speaker in |
---|
0:49:52 | as quickly the project is how would be as well people |
---|
0:49:57 | we have lower while the |
---|
0:50:00 | twenty delaware as last having solar raiders you have all sources sounds will all be |
---|
0:50:08 | ones and so the project is getting bigger and we go to have also the |
---|
0:50:13 | product |
---|
0:50:15 | all the speech community |
---|
0:50:18 | technical the store |
---|
0:50:21 | saying it be right to my |
---|
0:50:23 | collaborator |
---|
0:50:24 | the guys year are being |
---|
0:50:28 | this ain't is that working on there |
---|
0:50:32 | all these are the other works lots of the what's happening |
---|
0:50:36 | and here you can see |
---|
0:50:39 | the key that is currently working on the speech rate and that recyclable them because |
---|
0:50:45 | i think together we are working very well and |
---|
0:50:52 | well we soon you'll see and the result of our house work |
---|
0:50:57 | thank you very much |
---|
0:50:58 | for everything |
---|
0:51:00 | and i'm very happy now to reply to your |
---|
0:51:15 | many thanks musical than i wasn't nation |
---|
0:51:20 | i already have a |
---|
0:51:21 | a set of questions for you |
---|
0:51:25 | so as to what is wrong using both ukrainian but at so complex the first |
---|
0:51:31 | patient was from nicole rubber |
---|
0:51:38 | and the only the we i have to you england |
---|
0:51:43 | it a weight on a holiday is less computationally demanding men so that is known |
---|
0:51:50 | in |
---|
0:51:52 | actually is nothing but the best and then i'm |
---|
0:51:55 | i think and i can take this opportunity to clarify little bit matter this the |
---|
0:52:01 | things there are a couple of things to consider |
---|
0:52:05 | for the whole with bayes |
---|
0:52:08 | we're trying to learn not and task specific representation but in general representation |
---|
0:52:15 | a at this means that you can train you are i'll supervise a network just |
---|
0:52:21 | once right and then you can use just a little amount of supervised data to |
---|
0:52:27 | train the system |
---|
0:52:29 | so and is naturally it's to the computational advantages because you have to train |
---|
0:52:36 | the big thing on the one |
---|
0:52:38 | and a menu don't things |
---|
0:52:42 | when you have |
---|
0:52:44 | some |
---|
0:52:45 | things which are and we have to the standard supervised learning and usually |
---|
0:52:51 | if you have a good representation a supervised learning part is gonna be are much |
---|
0:52:57 | easier |
---|
0:52:59 | and the other i think good think about pay is a |
---|
0:53:03 | that they didn't remark too much in the presentation but this is better to remark |
---|
0:53:07 | here a little beads |
---|
0:53:10 | is that the basis pretty there's a sufficient right |
---|
0:53:13 | we found very good results even just using something like fifty hours of speech so |
---|
0:53:19 | very little compared to |
---|
0:53:21 | what we see these days |
---|
0:53:24 | even on cell supervised learning where people are using tie was on and thousand how |
---|
0:53:27 | real speech |
---|
0:53:29 | and we are data efficient because mm with the multiple workers |
---|
0:53:35 | somehow we try to extract as much as all the possible information from phone signal |
---|
0:53:41 | we are trying to do our best to be also that efficient extract everything we |
---|
0:53:47 | can from the signal |
---|
0:53:50 | so the right shoe things here |
---|
0:53:53 | the day |
---|
0:53:55 | and the fact that we are learning a general representation right so when we you |
---|
0:53:59 | can train only one time phase and use it for multiple task and then also |
---|
0:54:04 | be that late fusion part that to allow you to |
---|
0:54:08 | learn reasonable representation |
---|
0:54:10 | even or it then |
---|
0:54:12 | and a relatively small amount of unlabeled data |
---|
0:54:17 | an eco are you are you k do you have other |
---|
0:54:22 | comments on the part |
---|
0:54:33 | so |
---|
0:54:34 | okay |
---|
0:54:36 | five is very bad because you really question is on the sides of anyway and |
---|
0:54:41 | try my best |
---|
0:54:43 | i haven't quite a you have a question from don't combo well as you could |
---|
0:54:48 | become a common and remote with this also is supervised learning and this ideal conditions |
---|
0:54:56 | actually mm and we increased a lot the robustness of bayes the when we revise |
---|
0:55:03 | it with bayes plus |
---|
0:55:06 | and as i mentioned before in based blast we combine basically sell supervised learning with |
---|
0:55:16 | on-the-fly data limitation |
---|
0:55:19 | well that's domain it means that every time we have and you sent those we |
---|
0:55:23 | contaminated with a different sequence of noise and on different reverberation such that the system |
---|
0:55:30 | every time and looks also different |
---|
0:55:33 | sentence |
---|
0:55:34 | a lda different at least contamination and in the output |
---|
0:55:39 | our workers are i'm not extracting their labels from the noisy signal but from the |
---|
0:55:45 | original clean one |
---|
0:55:47 | so somehow i wire system is a forest |
---|
0:55:52 | two |
---|
0:55:53 | they noise the features |
---|
0:55:55 | and d is that it's to the robustness we have seen before we actually tried |
---|
0:56:01 | it they're challenging task like your our time i data and it was so realistically |
---|
0:56:10 | rate at these increase robustness to where standard approaches |
---|
0:56:18 | good thank you same really sure |
---|
0:56:22 | okay |
---|
0:56:24 | you ask some questions |
---|
0:56:26 | bayes rule has also a question about the competition between the walkers in days |
---|
0:56:34 | and i don't we should be visible but he or within them or when |
---|
0:56:40 | leam engine could consider some segments |
---|
0:56:43 | one in the same interest |
---|
0:56:45 | one has a positive example and you're as a negative |
---|
0:56:50 | some people |
---|
0:56:51 | and ask you what to expect been able to learn in this case |
---|
0:56:57 | actually the set of workers that we tried is not random right we took the |
---|
0:57:03 | opportunity of the day salt for instance to do a lot a lot of experiments |
---|
0:57:08 | we and we just come out with a set of word the subset of worker |
---|
0:57:14 | the subset of ideas |
---|
0:57:16 | that actually works for us |
---|
0:57:19 | so actually i one of our concern was okay how is possible to put together |
---|
0:57:27 | a regression task which are bayes on the square error for instance is lost with |
---|
0:57:33 | binary task which are based on other kind of lost like better because entropy |
---|
0:57:38 | how we can how we can learn things together and we told that there was |
---|
0:57:45 | a big issue but we realise that actually is not just doing an experiment doing |
---|
0:57:50 | some kind of operation of the workers so we not does that if we put |
---|
0:57:54 | together more workers the batter units |
---|
0:57:58 | and the same atoms for a leam and jean |
---|
0:58:02 | which are a different actually because a lean is based on small amount is small |
---|
0:58:09 | chunks of speech |
---|
0:58:11 | and we that the will there are not and meeting not carry information |
---|
0:58:16 | while the with them james in the same game but played with the larger ta |
---|
0:58:21 | and larger chunks of one seconds one second house |
---|
0:58:26 | and we that tubular hopefully |
---|
0:58:29 | higher level representations so we found that the |
---|
0:58:35 | they did chew and the same time are at any we have full even though |
---|
0:58:39 | at the are clearly correlated subsets right |
---|
0:58:46 | and cuban equation is coming from one channel |
---|
0:58:50 | and the nist you is you have to the right including to provide us to |
---|
0:58:56 | pay |
---|
0:58:58 | and she's really thinking about the five but they is not explicitly thing within speaker |
---|
0:59:04 | variability |
---|
0:59:06 | so none of the task is forcing and then he's from different from those from |
---|
0:59:10 | within speaker could be seen you know we shall work |
---|
0:59:13 | use it for you and known problem in adding some supervised five little ones you're |
---|
0:59:19 | where you have always easy |
---|
0:59:24 | well first of all on including supervised task totally makes sense honestly one can play |
---|
0:59:32 | with the and |
---|
0:59:34 | same a supervised of course seems cell supervise in this case a things and i |
---|
0:59:39 | e n is present all people already d the i saw some recent papers that |
---|
0:59:45 | actually work trying to do that the |
---|
0:59:48 | in this paper for base we prefer to stay |
---|
0:59:52 | on the selsa broom buys side only to make sure us to do actually check |
---|
0:59:58 | what are the output read is something it's a pure |
---|
1:00:02 | so supervised learning approach |
---|
1:00:06 | so as for bayes for speaker recognition and then within speaker maybe this yes is |
---|
1:00:14 | not that specifically designed for that |
---|
1:00:18 | so is not them is not the optimal but we anyway learn some kind of |
---|
1:00:25 | a or speaker identity |
---|
1:00:29 | actually |
---|
1:00:34 | we didn't there's too much about we can we are confident that we can learn |
---|
1:00:39 | can be quite competitive with them with data with standard system actually maybe we have |
---|
1:00:45 | to devise a little bitty architecture for that you're speaker recognition applications because these days |
---|
1:00:52 | also here so |
---|
1:00:54 | numbers to which are impressive in terms of equal error rate for box so that |
---|
1:00:58 | but |
---|
1:00:59 | the same idea i mean could be could be i think it's extended and we |
---|
1:01:05 | designed to specifically lower better speaker imaginings actually was in our main target was |
---|
1:01:15 | was more general so we wanted to |
---|
1:01:18 | to learn a pretty general representation and see if this is somehow works |
---|
1:01:25 | reasonably well for multiple stars |
---|
1:01:29 | thank you very is a nicely with the next question from o coming from the |
---|
1:01:34 | we're not |
---|
1:01:36 | which tries to use |
---|
1:01:38 | if you common than the five about the things that you system is no need |
---|
1:01:43 | to give or speaker restitution and ten information you can really |
---|
1:01:50 | as you are using examples positive examples coming from within a this a single you |
---|
1:01:56 | five |
---|
1:01:58 | actually what we do is to do this on the slide at the moment a |
---|
1:02:05 | sure right |
---|
1:02:06 | so if we have sentence one |
---|
1:02:10 | one time sentence one is |
---|
1:02:13 | contaminated with some kind of channel so i kind of reverberation affect the next time |
---|
1:02:18 | is contaminated with another one so maybe with this approach we try to limit a |
---|
1:02:25 | little bit the these affective but |
---|
1:02:29 | there might be there might be this issue read through |
---|
1:02:34 | do you mean to you thing but that the motivation you use it would take |
---|
1:02:39 | decision problem of internal run by itself |
---|
1:02:43 | so of maybe not tickling the full problem but at least |
---|
1:02:50 | minimizing right or |
---|
1:02:51 | reducing its right |
---|
1:02:54 | i think about the and the other hand we don't that many out there does |
---|
1:02:57 | it will feel will like to stay in the |
---|
1:03:00 | so supervised domain right so we don't and speaker labels so we cannot say okay |
---|
1:03:05 | let's jump to another's signal from the same speaker because that case |
---|
1:03:10 | we have |
---|
1:03:11 | we use the |
---|
1:03:13 | the labels so |
---|
1:03:14 | the best we can do is to contaminate the sentence two |
---|
1:03:17 | i mean |
---|
1:03:18 | change a little bit some other database the reverberation noise effect and |
---|
1:03:22 | hope to have |
---|
1:03:24 | to learn more this p can left the channel |
---|
1:03:29 | fine i we moved to a question from and you can turn |
---|
1:03:34 | hasn't that i model can use form from two perspectives and dealing extraction and more |
---|
1:03:41 | than ornament pre-training |
---|
1:03:44 | both for this we don't should be effective |
---|
1:03:47 | well but which one may be built for speaker verification so |
---|
1:03:53 | language and then |
---|
1:03:55 | we take a look again |
---|
1:03:57 | i |
---|
1:04:04 | okay |
---|
1:04:06 | i think a please could be used both you're right i can be used for |
---|
1:04:13 | feature extraction or embedded guess extraction all for and basically pre-training |
---|
1:04:21 | my experience is that |
---|
1:04:25 | these works very well in a pre-training scenario so it is designed basically to have |
---|
1:04:30 | the |
---|
1:04:33 | to train printing your network with their nest so nist also provides way and then |
---|
1:04:38 | find your eight with the small supervised data |
---|
1:04:44 | this is the |
---|
1:04:46 | basically the mean the main application we have in mind for a for pays but |
---|
1:04:51 | we also tried it as a standard feature extractor |
---|
1:04:55 | where embedding a structure |
---|
1:04:57 | not for speaker recognition but for a speech recognition |
---|
1:05:02 | and it works quite well so if you freeze the encoder right and you plan |
---|
1:05:06 | just the features that you have there you can and supervisor coded what's well but |
---|
1:05:11 | it works better if you jointly finetune the encoder and the classifier during that a |
---|
1:05:18 | supervised phase |
---|
1:05:21 | thank you and we'll come back to the grid also no with a question about |
---|
1:05:25 | the |
---|
1:05:27 | temporal |
---|
1:05:29 | sequence walker also can you would avoid more on the minimum detection walker but focused |
---|
1:05:36 | on the right sequence |
---|
1:05:38 | this is for that the |
---|
1:05:40 | maybe some cases the segment from the few to and but would reasonably contain the |
---|
1:05:47 | thing then |
---|
1:05:49 | so |
---|
1:05:51 | with some problem with this walkers you know some comments |
---|
1:05:55 | definitely that's that have very nice question actually mm could easily the soup as worker |
---|
1:06:02 | is the one that has that's important thing the performance |
---|
1:06:07 | so as i mention with the a lot of model glacial we try to figure |
---|
1:06:12 | out the effect of each and which task and is what was working well improve |
---|
1:06:19 | but less than other work at where more important like the rest of the regressors |
---|
1:06:24 | and the m and g |
---|
1:06:27 | and mm |
---|
1:06:29 | actually this is an important risk when you what when you build a view sample |
---|
1:06:33 | from the past is simple from the future you have to make sure you just |
---|
1:06:38 | you are not same thing with being the receptive field of your convolutional neural networks |
---|
1:06:43 | otherwise the task becomes |
---|
1:06:45 | too easy |
---|
1:06:47 | so what we have done is to make sure that the next the future sample |
---|
1:06:52 | is not too close rights from the people from the and core one and not |
---|
1:06:58 | too far because if it is to close the risk is to learn nothing basically |
---|
1:07:03 | if it is to fire |
---|
1:07:06 | the risk is that there isn't anything in anymore and reasonable correlation between that you |
---|
1:07:12 | so or it's not easy to design the this task |
---|
1:07:17 | and them |
---|
1:07:17 | we did the |
---|
1:07:18 | you didn't as weights are we |
---|
1:07:20 | we were able to sample the past in the feature representation within some reasonable range |
---|
1:07:27 | it could be interesting to write i believe that you hide traces |
---|
1:07:33 | but are we move to another question from i in one but still were asked |
---|
1:07:40 | to you being to write the all the same it does |
---|
1:07:43 | known from will lead to four speakers for extracting speaker-specific information |
---|
1:07:50 | well and in this paper we new the bayes paper actually is not only about |
---|
1:07:57 | the nist speaker recognition so the filters that will learn are actually are not that |
---|
1:08:04 | far away from stand out method |
---|
1:08:07 | mel filters where we basically try to locate |
---|
1:08:10 | more precisely more filters in the lower part of the spectrum and less filters in |
---|
1:08:16 | the higher part of the spectrum |
---|
1:08:18 | we de lima |
---|
1:08:20 | local informants the technique that was designed basically that work only for speaker recognition the |
---|
1:08:27 | filters we still are there are some harass right where more filters in there are |
---|
1:08:34 | as more common for the speech and the formants |
---|
1:08:37 | so similar to what we have seen in |
---|
1:08:41 | in |
---|
1:08:42 | using sync net |
---|
1:08:44 | with the supervised a approach |
---|
1:08:48 | but with bayes we are not the we're not a look at more |
---|
1:08:52 | more filters in the speech region we are more or less the same as the |
---|
1:08:57 | standard not filter scale back |
---|
1:09:04 | i we have |
---|
1:09:05 | we are also conclusion i don't have more open question i just have one or |
---|
1:09:11 | be possible |
---|
1:09:12 | a i would like to see you |
---|
1:09:15 | to the explaining more well i use a about unsupervised training |
---|
1:09:21 | used and the it composes of as the training |
---|
1:09:25 | it's of my feeling what an issue as |
---|
1:09:29 | b s |
---|
1:09:30 | a more easy to find if you have some |
---|
1:09:34 | during a supervised training because you're some information on the data meta information of video |
---|
1:09:41 | and we each with the unsupervised training seems to me that |
---|
1:09:46 | you have less information but you have no reason to have a list yes in |
---|
1:09:51 | the |
---|
1:09:52 | they |
---|
1:09:53 | the figure sure |
---|
1:09:55 | okay and |
---|
1:09:58 | the reason is that the |
---|
1:10:00 | if you train your representation with the supervised data your presentation could be biased to |
---|
1:10:07 | the task right specifically for instance if it's frame |
---|
1:10:11 | a aspic a representation with speaker recognition right your presentation is not could for speech |
---|
1:10:19 | recognition and it is does a bias on speaker recognition around it |
---|
1:10:24 | with a supervised learning at least the in the way we are trying to do |
---|
1:10:28 | it with a multitask et cetera this list a risk is reduced because you have |
---|
1:10:32 | the same representation that is good for both speech recognition |
---|
1:10:36 | and speech recognition and the speaker recognition |
---|
1:10:41 | communist and that i |
---|
1:10:43 | really the want to thank you again and we are we will be the over |
---|
1:10:50 | the official but before to close the position i will be the microphone to get |
---|
1:10:57 | a the only those two |
---|
1:11:00 | wants to you to also i think you actually |
---|
1:11:06 | thank you are service right |
---|
1:11:09 | yes i stepped off state of a very wide of the top integerization and then |
---|
1:11:15 | s l obtained in this session |
---|
1:11:17 | so as dataset now do you think that something to us to decisions |
---|
1:11:23 | but |
---|
1:11:26 | system |
---|
1:11:28 | one of the stuff can show this not |
---|
1:11:45 | and the second |
---|
1:11:51 | yes |
---|
1:11:52 | if your best guess okay just to heal i you the that token decision that |
---|
1:12:03 | the but there is something that sequences changes that's thanks for inviting me that was |
---|
1:12:09 | really great thank you |
---|
1:12:11 | okay that's a tennis together again |
---|
1:12:13 | and test on a distance for a |
---|
1:12:19 | a sentence |
---|
1:12:21 | and you lucille tomorrow a same time this time i in a and ten |
---|
1:12:31 | definitely |
---|
1:12:33 | so as you can just |
---|
1:12:36 | of that by time |
---|