0:00:01 | i and you for doing this you know i'm second exactly i don't here is |
---|
0:00:05 | that answers and propose a novel in figure of science and technology and research find |
---|
0:00:10 | ways that leaking just by |
---|
0:00:13 | we introduce some recently some works the lower developing your own seems to incorporate the |
---|
0:00:18 | that lisa |
---|
0:00:19 | speaker and listened well speaking |
---|
0:00:21 | so rather than a tutorial on capitol details we share our experiences indefinitely since between |
---|
0:00:27 | the printer |
---|
0:00:28 | one problem we face and one solution that we take |
---|
0:00:32 | is enjoying work we understand processing of exercise does not hold on the high applied |
---|
0:00:37 | to harness if indeed and submachine comma |
---|
0:00:41 | and then problem topics to discuss i won't be able to cover everything details within |
---|
0:00:45 | an hour of tutorial |
---|
0:00:47 | so if you have some bar that you want more power domain imposed questioning decoding |
---|
0:00:53 | in section |
---|
0:00:57 | okay like to discuss what does this interpolator his |
---|
0:01:02 | this is an example a final meeting people speaking different languages though there are two |
---|
0:01:08 | main speaker and so interpreted as |
---|
0:01:11 | it is done for one person speaking is five that once the japanese |
---|
0:01:16 | mandarin speakers that is pretty intervals individual will translate probably induced feature by means |
---|
0:01:22 | so the aim is to construct on the since it is interpreted it can be |
---|
0:01:26 | proficient interpreter |
---|
0:01:30 | one language translation is one of the knowledge of the and to mimic human integrated |
---|
0:01:35 | income after from one language to another |
---|
0:01:38 | this technology properties of the automatic speech recognition and upon this leads into the backs |
---|
0:01:44 | in the source language |
---|
0:01:46 | on the same translation the top on the text image source languages the corresponding text |
---|
0:01:50 | target languages |
---|
0:01:53 | and thanks to speech synthesis or pdas the generic space where problem based on the |
---|
0:01:58 | x in the target language |
---|
0:02:00 | however kind of finding a spoken with this is an extremely complex tasks |
---|
0:02:05 | with the one this process is translation performance is the five have been the probability |
---|
0:02:11 | like professional interpreter |
---|
0:02:13 | so i don't the ls from sleep you the component and use |
---|
0:02:17 | then we will discuss what we sell out is the worst different beam since the |
---|
0:02:22 | greater |
---|
0:02:23 | you know many people was on the baseline with analogies with five days are used |
---|
0:02:29 | for |
---|
0:02:32 | the development of automatic speech recognition has an apple as in the process and just |
---|
0:02:37 | once the basic human speech |
---|
0:02:39 | i don't walls for is a third approach is happy line are not with people |
---|
0:02:43 | and technologies and then can be seen with dynamic time warping |
---|
0:02:47 | and moved to the different approach with statistical modelling offers you don't model one male |
---|
0:02:52 | portion which more or hmm gmm |
---|
0:02:55 | this figure shows the generic structure of the original gmm furthermore |
---|
0:03:00 | in commenting on what's of three main components impulse one is that was the one |
---|
0:03:04 | with the theme in the acoustic probably don't have been genetically by itself bored or |
---|
0:03:08 | phoneme based hmm |
---|
0:03:11 | the second was his pronunciation lexicon with whatever the pronunciation of all with a sequence |
---|
0:03:16 | of phones |
---|
0:03:17 | and that one is like was more deal with a steaming the prior probability of |
---|
0:03:21 | sequence of four |
---|
0:03:24 | finally speech recognition decodings that we find the base station once all words according to |
---|
0:03:29 | the speaker conversation a lexicon and language models |
---|
0:03:37 | and the resurgence of dignity in the nist has also some place in the course |
---|
0:03:42 | of really the on utilizing unlabeled transform it is are |
---|
0:03:46 | maybe is that the performance e is a process things that's is very close to |
---|
0:03:50 | the baseline formant for remote |
---|
0:03:53 | for example a hybrid hmm and then estimates the posterior probabilities idea and |
---|
0:03:59 | then there is also c d c or connectionist temporal classification |
---|
0:04:05 | and also seconds to say what's more they with listen and stuff parental |
---|
0:04:11 | the important part of thought differently there is simply by many complicated hence and in |
---|
0:04:17 | more detail |
---|
0:04:18 | and then why the weighted measure performance is to exactly |
---|
0:04:26 | in the last twenty k it has been significant improvement in the has a performance |
---|
0:04:31 | this can be thing and of jingle bayes theorem decrease in word error rate |
---|
0:04:36 | in nineteen ninety three to work and or you can see what they that was |
---|
0:04:39 | close to hundred percent |
---|
0:04:41 | suddenly i am in microsoft this work have shown that speech recognition one is a |
---|
0:04:47 | more you word error rate and professional transcription which was at all five point five |
---|
0:04:51 | percent more generally |
---|
0:04:55 | in a similar detection technology one has gradually sifted |
---|
0:05:01 | no foundation for a system using we're all working and an analysis in this is |
---|
0:05:06 | meant for |
---|
0:05:07 | no weight for musical computational problems and no more flexible for using the thought that |
---|
0:05:12 | income or the hidden semi-markov models gmm or svm and gmm |
---|
0:05:19 | and something instead of the performance starts have been successfully constructed based on your programmable |
---|
0:05:27 | so this time sequence of second from all |
---|
0:05:30 | can i think that comes solely on import over complete d s |
---|
0:05:34 | and provisional with men transform |
---|
0:05:39 | the performance on t v is has also been improved be problem for is to |
---|
0:05:43 | human likewise |
---|
0:05:44 | in latest examples |
---|
0:05:47 | most of restricted or and russian trick or treaters instruments |
---|
0:05:56 | it's okay here is a little voice |
---|
0:05:59 | no that isn't videos good has become more human-like as the following speech samples |
---|
0:06:05 | it also listed on short notice two children and at a very close |
---|
0:06:13 | in addition the google i would be viewed as we do not at all somebody |
---|
0:06:18 | close to human speech |
---|
0:06:20 | this system used a combination of cognitive gives a changing is in the discrete is |
---|
0:06:26 | and using the hold on with meant to control intonation depending on the second |
---|
0:06:32 | in a system become his or in trying to scale to have found an appointment |
---|
0:06:36 | or whole utterance to estimate the separation |
---|
0:06:39 | i will resemble the second one |
---|
0:06:48 | menu |
---|
0:06:48 | i |
---|
0:06:54 | for every |
---|
0:06:56 | where i |
---|
0:06:59 | well p o |
---|
0:07:07 | so it's of great here we cannot be president anymore with one if you want |
---|
0:07:11 | in which one with murphy you get similar papers on their website |
---|
0:07:17 | okay so we have seen that both examine the fa improve the quality close to |
---|
0:07:22 | your buddy |
---|
0:07:24 | okay we solve all problems |
---|
0:07:26 | that is that the balloon model was used to train a |
---|
0:07:30 | and how we can utilize the time maybe a system |
---|
0:07:36 | is it is an example as before we just by their own meeting between two |
---|
0:07:40 | people speaking different languages |
---|
0:07:42 | that's a more powerful is an iterative process of interpretation |
---|
0:07:47 | where one of the speech in one languages |
---|
0:07:50 | they will use a and we don't we can be enough to sentencing was translated |
---|
0:07:55 | and speak to the other speaker in another language |
---|
0:07:58 | this means that the estimation process before is simply the end of the sentence |
---|
0:08:04 | and i don't paper has the ability to do some other in this process we |
---|
0:08:07 | just listening don't leading and speak |
---|
0:08:11 | so the channel started from unseen speakers integrator are not uncommon by a size and |
---|
0:08:16 | the idea is to construct a machine that have the ability to listen one is |
---|
0:08:20 | speaking as well as their ability to perform recognition and synthesis fits into by |
---|
0:08:29 | that was discussed ill posed problem one with the following unless you can listen while |
---|
0:08:34 | speaking |
---|
0:08:37 | and is an incentive produce the basic mechanics so in this congregation which is called |
---|
0:08:43 | the speech at |
---|
0:08:45 | and mechanism have a slight disturbance mesa therefore for the speaker's mind to the listeners |
---|
0:08:50 | my |
---|
0:08:52 | it consists of speech production we can see something with the speakers is worth and |
---|
0:08:57 | you know based soundwave |
---|
0:08:59 | and a week the speech waveform to me |
---|
0:09:05 | the data as this perception process happen in the listeners identity system can perceive what |
---|
0:09:11 | was said |
---|
0:09:14 | can close to speech in exactly so has a critical out to be but we |
---|
0:09:19 | can assume for speaker small to the u |
---|
0:09:24 | that would be able to communicate few hundred to how to use and state |
---|
0:09:30 | you know how to talk like one company but their articulation and listening to solve |
---|
0:09:35 | problems |
---|
0:09:36 | so again here we leave behind with acquisition possible speech and speaker system hasn't could |
---|
0:09:42 | be allowed to keep k |
---|
0:09:48 | so the children who was there really of them have difficulty to produce three states |
---|
0:09:54 | even adults can be count after becoming proficient with the languages nonetheless for speech articulation |
---|
0:10:00 | be applied as a result of the that all the three |
---|
0:10:06 | human brain that is inside a model integration in speech processing |
---|
0:10:11 | so the auditory system is critically important steps and all speakers and the model system |
---|
0:10:17 | is critically both in the progress and all states |
---|
0:10:21 | so that was done by three or four on the next we have seen it |
---|
0:10:25 | sounds like model response talking face okay feeding both doing the busted perception mostly is |
---|
0:10:31 | and to encode for all samples articulation |
---|
0:10:36 | so this means that the process false this perception and production is not unique ability |
---|
0:10:45 | on the other hand competitors also are able to learn how to use and how |
---|
0:10:51 | to speak |
---|
0:10:52 | as we know by whales mixture and within the how to use the so given |
---|
0:10:59 | the state is people point out that |
---|
0:11:02 | and also by werewolves text as seen this is if you know how does b |
---|
0:11:09 | but computers cannot hear their own voice |
---|
0:11:12 | and then release lsp a separately and then |
---|
0:11:17 | and therefore requires a little more robust and next thing in sort of points way |
---|
0:11:22 | more attention |
---|
0:11:26 | but the question is can we do a long lasting that can listen while speaking |
---|
0:11:31 | now discuss how to develop a sequence in speech and framework |
---|
0:11:38 | the proposed approach to the logo slow speech in one based on clean |
---|
0:11:42 | this is particularly model it indicates human speech perception and production here |
---|
0:11:49 | it is to have a system that not only can be silence p but also |
---|
0:11:52 | is somewhere speaking |
---|
0:11:56 | this is a standard is not in tts framework in which the training to independently |
---|
0:12:03 | as we mentioned before by where is the most you can learn how to lisa |
---|
0:12:07 | and by where yes in the second the how to speak |
---|
0:12:12 | now here estimates in speech and free will |
---|
0:12:15 | so we can look connection from asr to tts and vts database or |
---|
0:12:21 | this means yes really see the asr output and may so really see if the |
---|
0:12:26 | output probability s |
---|
0:12:28 | or in other words a sunken this that the what did you say |
---|
0:12:32 | the key idea here is to train both baseline vts models jointly |
---|
0:12:38 | in the training they the frame level training with a label in a globally in |
---|
0:12:43 | some reasonable baseline mean |
---|
0:12:45 | where a sliding scale encourages fighter using unlabeled data engineering useful for you rate |
---|
0:12:51 | including bounced against imposing we use a starting today is modeled independently using standard way |
---|
0:13:01 | in more detail |
---|
0:13:04 | if the original speech |
---|
0:13:06 | one white |
---|
0:13:07 | is the original text |
---|
0:13:09 | huh is the predicted streams |
---|
0:13:12 | well why |
---|
0:13:13 | if the predictive text |
---|
0:13:15 | so the a set of is that are from x the whitehead |
---|
0:13:19 | you know using all sequences are more militant on what is the text |
---|
0:13:23 | and the there is a into a problem to white to x had so hugh |
---|
0:13:28 | we also used to you know second one model control for text to squeeze |
---|
0:13:35 | i don't case it is interesting speech and the only difference cases where the but |
---|
0:13:39 | also in text they are available |
---|
0:13:42 | so given a pair of skis and that's the a one model can be trained |
---|
0:13:46 | independently in support by selecting |
---|
0:13:49 | is this can be done by minimizing the loss between the predicted that the sequence |
---|
0:13:54 | and the ground to sequence |
---|
0:13:56 | so for its this is by minimizing the loss between y and y head |
---|
0:14:01 | and for the tts this is done by minimizing the x and x and |
---|
0:14:09 | is there is when one is fifty five central |
---|
0:14:12 | for this thing will be for unsupervised domain |
---|
0:14:16 | given on this feature x |
---|
0:14:19 | it is pretty the most possible transcription y huh |
---|
0:14:23 | and based on why cooking is trying to reconstruct this feature |
---|
0:14:28 | and let the loss yes between the original space feature x |
---|
0:14:33 | and predicted streams extract |
---|
0:14:36 | so therefore or this is possible to improve tediously speech-only by it is portable phase |
---|
0:14:41 | are |
---|
0:14:44 | now see that all cases where only text data is available |
---|
0:14:49 | so given only text features y |
---|
0:14:52 | it is generally it's this feature x have |
---|
0:14:55 | and based on x head as a try to record a text regions y |
---|
0:15:00 | and we got but also it is out of between original text feature y and |
---|
0:15:04 | of but that's what y k |
---|
0:15:07 | so here it is possible to improve a start with excellently by support of tts |
---|
0:15:14 | so the overall any object is to minimize baseline and it is lost doing was |
---|
0:15:19 | proposed by let me really write something well and possible by let me when only |
---|
0:15:23 | unclear probable |
---|
0:15:26 | so the basic idea is to able to train domain that's without or getting the |
---|
0:15:30 | old one |
---|
0:15:31 | if we set off i'll be greatly zero one this means that we can use |
---|
0:15:35 | a portion of the loss and the canadian provided by the training set |
---|
0:15:39 | but if we set off i was zero this means that we completely learning curve |
---|
0:15:44 | with only speeds or latex |
---|
0:15:48 | but this is the overall structure of a star we use the segments of segments |
---|
0:15:52 | from which is similar to a lot so that misspell proposed by janet all to |
---|
0:15:57 | draw some fifty |
---|
0:15:59 | so it has in order according to an attention model |
---|
0:16:03 | they one is x which is this is features |
---|
0:16:08 | and a good is why which is the act sequence |
---|
0:16:12 | and i is the importance they were it's t is the decoder state and the |
---|
0:16:17 | attention models produce context information |
---|
0:16:21 | at time do you which is a line between the encoder and decoder hidden state |
---|
0:16:27 | and the loss function is colour we be between the white and a pretty good |
---|
0:16:32 | why where c here is the number of output classes |
---|
0:16:37 | similar to a silence detector this is that can subsequently be as baseline control for |
---|
0:16:42 | it is also classes in court are they going to an ancient old one |
---|
0:16:47 | x r is a linear spectrogram feature what i x and stands to talk about |
---|
0:16:52 | features and wise that x include |
---|
0:16:55 | and is the oldest and hence the decoded as they can attention borders context information |
---|
0:17:01 | based on a hundred and recorded hidden state |
---|
0:17:05 | note that kind of lost the first one is database your space to function lost |
---|
0:17:10 | and the second one is the and a sentence detection with cross entropy |
---|
0:17:16 | okay so let's discuss some experiments we since the chain |
---|
0:17:21 | in this week's features we use a d-dimensional mel spectrogram |
---|
0:17:25 | one thousand four dimensional in the spectrogram |
---|
0:17:29 | yes it second thoughts between problem by using a finland can pretty the phase and |
---|
0:17:35 | false stft |
---|
0:17:37 | and for the next we used in the six off of it six complexion more |
---|
0:17:41 | actually special cost |
---|
0:17:44 | by our proposed mental we experiment on a corpus with a single speaker |
---|
0:17:50 | because i'm is only most agencies by the economy are trained on a single speaker |
---|
0:17:54 | dataset |
---|
0:17:56 | we less another mostly single speaker data to just mean and use talked also about |
---|
0:18:02 | what training of all on a more than one and four hundred one for this |
---|
0:18:07 | or |
---|
0:18:09 | most similar several situation |
---|
0:18:12 | for this when for training data with parameters with the that's the or probable |
---|
0:18:18 | and the second one is when one small portion of it have to wear speakers |
---|
0:18:22 | and text data and non-target cannot utilize text or space data on |
---|
0:18:28 | and the last one is when wanting to devise bands present text data |
---|
0:18:35 | then we showed that is okay so you know use the error rate for a |
---|
0:18:39 | while waiting a somewhat in |
---|
0:18:42 | this is the result when the system was trained we will training data |
---|
0:18:46 | we can see are three point one percent |
---|
0:18:51 | no one and channel has a transcript and things the l and the remaining one |
---|
0:18:56 | we have speech or basically only with an example this year become so you one |
---|
0:19:01 | point seven percent which is quite high |
---|
0:19:04 | and but it is only one speaking we assume states mechanism |
---|
0:19:09 | it and you yes can be designed their using a very low and generally useful |
---|
0:19:14 | feedback |
---|
0:19:15 | results show that we improve its performance from something one point seven percent and twelve |
---|
0:19:21 | point three percent |
---|
0:19:23 | we separately and twenty five percent is only and twenty four percent and only is |
---|
0:19:30 | okay three point five percent eer |
---|
0:19:33 | it was very close to the system used for a hundred percent they're data |
---|
0:19:42 | nineteen and actual pdf they sell |
---|
0:19:45 | what it is experiment we report the emergency we pretty well male and a lot |
---|
0:19:51 | welcome to the ground to the ground truth |
---|
0:19:54 | results show that a starving tedious model have been trained with small gradient ascent |
---|
0:20:00 | this is are there |
---|
0:20:02 | using a really engine used would be great |
---|
0:20:06 | the all along with formal training be has no square zero point six |
---|
0:20:12 | and with only ten percent brand it on the last become one point in |
---|
0:20:17 | zero five |
---|
0:20:19 | then by listening was speaking one we explore only with phase we also included it |
---|
0:20:24 | is performance |
---|
0:20:29 | the summary inspired by the human speech in we propose a gaussian speech and often |
---|
0:20:34 | able to most into conformity somewhere speaking and achieve something strong bias let me |
---|
0:20:40 | and mechanism a novel and found it is still it is it harder when there |
---|
0:20:44 | is if i'm really to analyze identity is too important leasing pair and optimized on |
---|
0:20:49 | whatever we could is construction cost |
---|
0:20:52 | however the one sentence the sort of the system was able to handle unseen speaker |
---|
0:20:59 | this is done also only we mix the voice a given speaker via the speaker |
---|
0:21:04 | identity by one hundred and twenty |
---|
0:21:07 | furthermore has are also on the other two speaker specific central speaker because to is |
---|
0:21:13 | unable to produce a more or is this problem in speaker |
---|
0:21:18 | so there are we tried to improvement and then a lot of this incredible seems |
---|
0:21:23 | to change |
---|
0:21:26 | so the eighties to handle voice characteristics are unknown speaker |
---|
0:21:31 | you know in the area of speaker recognition system into the speech in little |
---|
0:21:35 | and i spent a globally deal yes to have the same speaker using one speaker |
---|
0:21:40 | adaptation |
---|
0:21:42 | after the couple with a star and we develop a speech in primal better and |
---|
0:21:47 | what do you a speech problem unknown speaker |
---|
0:21:52 | this morning mankind is somewhere only it is available |
---|
0:21:56 | and it really most possible package and y |
---|
0:22:00 | we can recognition profile speaker everybody's they're finally based on y and z two d |
---|
0:22:06 | s try to construct an x have |
---|
0:22:09 | indeed yes loss is calculated between original speech feature x and x huh |
---|
0:22:17 | one the other hand went on the actually is available |
---|
0:22:20 | you samples speaker factors the |
---|
0:22:24 | yes each and every speech feature x have based on that x y and speaker |
---|
0:22:28 | close to |
---|
0:22:30 | then given x head and is a try to recall a white |
---|
0:22:34 | it's a loss is calculated between original x y and of predicting one |
---|
0:22:42 | is a consequence is a is the same as in basic the since speech at |
---|
0:22:46 | when the second to stick when it is no additional input on the speaker factor |
---|
0:22:52 | so now they're kind of loss function |
---|
0:22:55 | one is this recall such a loss we can see |
---|
0:22:58 | the second one is and of sentence production rules with cross entropy |
---|
0:23:03 | and you one is the percent one loss which is the cosine distance between the |
---|
0:23:08 | original and unique speaker and basically |
---|
0:23:14 | we're on our experiment on the task is multi-speaker data which is the worst original |
---|
0:23:18 | dataset |
---|
0:23:19 | we normalize the speech a necessary supplies not mean that for will be there is |
---|
0:23:24 | i before any assigned to hunter its parent where training set |
---|
0:23:29 | and i four records is around seven thousand across of all sixteen hours of speech |
---|
0:23:35 | spoken by native speakers |
---|
0:23:38 | well as on the consists of all something else and it is about sixty six |
---|
0:23:43 | our spoken by two hundred speakers |
---|
0:23:47 | so if you know we use a month or they have ninety three |
---|
0:23:52 | and then f are likely to for dataset |
---|
0:23:57 | sure that is already some |
---|
0:24:00 | we first train baseline model by using examples for is i t for state only |
---|
0:24:06 | and we choose seventeen point seventy five percent eer |
---|
0:24:10 | in the second rule we train a little we clearly tell the full was originally |
---|
0:24:15 | aside two hundred eighty four data and its you seven point what we can see |
---|
0:24:19 | are |
---|
0:24:20 | it is our global performance |
---|
0:24:23 | and in the last four we trained on one deal with some reasonable price than |
---|
0:24:28 | using as i before s period and as i two hundred and we're at all |
---|
0:24:34 | for comparison |
---|
0:24:35 | we are more training with something simplifies the mean |
---|
0:24:38 | we get a label propagation mental |
---|
0:24:42 | where we train the original one really there's this text is i before |
---|
0:24:47 | we realised are pretty good initial one we don't really using data |
---|
0:24:51 | so for text only right side two hundred we're stationed ideas to generate the corresponding |
---|
0:24:57 | speakers |
---|
0:24:58 | and for speech-only is idle under we'll stage channel is not to do that the |
---|
0:25:03 | corresponding thanks |
---|
0:25:05 | often there and we train a more the other with a generic full training set |
---|
0:25:09 | our results shown by using label propagation is absolutely use the cr fourteen point fifty |
---|
0:25:16 | a gaussian |
---|
0:25:17 | nevertheless |
---|
0:25:18 | speech and model could achieve a significant improvement |
---|
0:25:22 | and it's we nine point eighty six percent c are |
---|
0:25:26 | which close the door one result |
---|
0:25:31 | similarly answer yes loss could also be viewed as the training to him as in |
---|
0:25:35 | switching |
---|
0:25:37 | no you want to show some speech samples |
---|
0:25:42 | the first one is the baseline model where we train only with something percent greater |
---|
0:25:47 | detail |
---|
0:25:48 | this is a rate of travel and actually provide a solution |
---|
0:25:53 | then might be lies inside hundred percent are very well with speech change |
---|
0:25:57 | this is because the server the problem actually provide a solution |
---|
0:26:03 | and the one model trained with appropriate training set |
---|
0:26:07 | a basis to read the travel |
---|
0:26:09 | they actually provide a solution |
---|
0:26:12 | no one will but it also the speaker d as they still |
---|
0:26:17 | is this the baseline |
---|
0:26:19 | the bases aren't the problem they actually provide a solution |
---|
0:26:22 | and this is basically changed |
---|
0:26:24 | the basis of the problems i actually provide a solution |
---|
0:26:28 | and is just a horrible model |
---|
0:26:31 | the places aren't the problem actually provide a solution |
---|
0:26:35 | and can see that in one with a very |
---|
0:26:38 | it is gonna be improved significantly |
---|
0:26:45 | so that a summary we proved most of speech and still had voice correctly speech |
---|
0:26:49 | from unknown speaker |
---|
0:26:52 | in which the s can generate speech with similar voice kind of these big on |
---|
0:26:56 | we one shows speaker example |
---|
0:26:58 | and it's not also okay you need are from the combination between the accent and |
---|
0:27:03 | an arbitrary voice characteristics |
---|
0:27:06 | however there is another limitation in the current frame |
---|
0:27:11 | if we only have actually been performed or prompted us to a soft and all |
---|
0:27:15 | and only is with the big but lost |
---|
0:27:19 | one the other hand if we only have this data we perform based on the |
---|
0:27:23 | ideas and only duty as it came across |
---|
0:27:27 | this is because the publication error for the reconstruction lost to a star is challenging |
---|
0:27:33 | note that the output of base obviously don't |
---|
0:27:38 | the house always so improbable |
---|
0:27:43 | we will discuss our solution to handle back propagation good basically no |
---|
0:27:48 | the figure shows the speech or with speaker anybody mortars |
---|
0:27:53 | in the original from all the roles of ideas couldn't we probably conformable why because |
---|
0:27:59 | of this is in |
---|
0:28:01 | no postal address the problem is nist meeting gradient of whatever why we try to |
---|
0:28:07 | estimate the |
---|
0:28:10 | to understand why the gradient of this operation is not be by considered as a |
---|
0:28:15 | function |
---|
0:28:17 | is it can see almost everywhere a small changes in that you would result doesn't |
---|
0:28:21 | and employed in the all pole |
---|
0:28:24 | and so on the lda zero |
---|
0:28:27 | for change very good is not zero the gradient is in pretty |
---|
0:28:31 | and so it is not used for formant recognition |
---|
0:28:35 | i don't want lead to good on this problem will be to use a continuous |
---|
0:28:39 | approximation dataset fashion |
---|
0:28:42 | but they fail to produce discrete all |
---|
0:28:45 | so the solution was trained on the two i see our papers which is used |
---|
0:28:49 | almost all x distribution |
---|
0:28:52 | it requires a simple mental for all sample from a technical institution we class probabilities |
---|
0:29:00 | let me talk in more detail |
---|
0:29:03 | the main problem we use this notion that the calibration is not giving stable |
---|
0:29:09 | so that all this issue is to use some loss function as an approximation to |
---|
0:29:15 | one loss function |
---|
0:29:17 | and three or four efficient way of sampling from the can take the oldest equation |
---|
0:29:22 | by saving a random variable g to do all the probabilities |
---|
0:29:27 | and then parameter that controls how closely they used and was approximately screen one vector |
---|
0:29:34 | instead i equal to zero the softmax computation smoothly to mister are lost and a |
---|
0:29:39 | simple spectral amplitude of one |
---|
0:29:42 | on the other hand if the t is equal good vad sample pack become you |
---|
0:29:48 | pull |
---|
0:29:49 | the loss is currently in the good over here we place degradation prediction of different |
---|
0:29:56 | sample y over b y with id two d |
---|
0:30:02 | in this experiment we use multi-speaker or thing to allow the ascent okay |
---|
0:30:08 | there is something a bit |
---|
0:30:10 | with the use of a convolutional next we all been eleven percent relative improvement compared |
---|
0:30:15 | to our previous frame |
---|
0:30:21 | "'cause" to somebody we aim was in speech and mechanism by allowing back propagation through |
---|
0:30:27 | the screen or because they soliciting later |
---|
0:30:30 | in the future it is necessary to the probability effectiveness of the protocol for is |
---|
0:30:35 | lined with |
---|
0:30:38 | no an additional mechanism when we extend of speech into the model chain |
---|
0:30:47 | we know the in human communication the most common we for human of the comic |
---|
0:30:51 | at high speeds |
---|
0:30:53 | but alignment system cannot know what is completely without a connection to the whole by |
---|
0:30:58 | us to section |
---|
0:31:02 | so human communication is actually multi sensory and in boston communication channels |
---|
0:31:08 | not only of three but also fits or channels |
---|
0:31:12 | humans perceive this multiple source of information together to build the general concept |
---|
0:31:20 | basically the idea of incorporating visual information for speech processing is not new |
---|
0:31:25 | we know there is i'm the of is always are |
---|
0:31:29 | but more samples are usually done by simply concatenating the formation |
---|
0:31:34 | individually was information can |
---|
0:31:37 | and this mental usually require all information from different model it is something altogether |
---|
0:31:44 | but on the other hand but i'll be is often available |
---|
0:31:50 | we have run the weakness in speech instead of the last three from the mean |
---|
0:31:54 | of five parallel space and text data |
---|
0:31:58 | it probably wasn't ability to include a something tedious performance in semi-supervised learning |
---|
0:32:03 | by following examined it is that it is are given only the accent only speech |
---|
0:32:08 | data |
---|
0:32:09 | unfortunately although it removes the requirement of five being for normal apparently a in the |
---|
0:32:15 | only required to have a lot size of vehicle |
---|
0:32:19 | so you study is limited only with speech and text for modeling p |
---|
0:32:24 | so a single before then the fact no obligation is actually working modeled in for |
---|
0:32:30 | no on the other day system but also results in something |
---|
0:32:36 | you know proposed multimodal seems jane to meet all around human communication and non-weighted result |
---|
0:32:42 | modeling |
---|
0:32:44 | specifically we design a gender detector the clinic speech recognition or a cyber space in |
---|
0:32:50 | this is or pdas immense cepstral mean or i c emitted by our or image |
---|
0:32:56 | generation i g |
---|
0:32:58 | it can be trained using these four points variation by assisting each other given incomplete |
---|
0:33:02 | data and averaging postal data preparation within the chain |
---|
0:33:08 | so there is a question now case can we still improve asr even austin's or |
---|
0:33:13 | text data available |
---|
0:33:18 | similar to this change them for this case is where we have well i don't |
---|
0:33:22 | know ready to all speech image and then x |
---|
0:33:25 | we can separately any start e d s i si and alright g using supervised |
---|
0:33:31 | learning |
---|
0:33:35 | next one is that simplifies the mean |
---|
0:33:37 | in my office emails and text data |
---|
0:33:42 | the left side is when the input is emails all speech-only data |
---|
0:33:47 | i is are and i see we generally x this on speed or image in |
---|
0:33:51 | one |
---|
0:33:53 | and this is an immense will be reconstruction and allows can be used to back |
---|
0:33:57 | propagated t v is alright g |
---|
0:34:02 | come on the right side is when include is text only d o |
---|
0:34:06 | one and they say that it is and i'm to regenerate sneezing units respectively |
---|
0:34:12 | and sre i seem to the costs on the text |
---|
0:34:16 | i this way a star and i c can be propagated to its construction cost |
---|
0:34:22 | and improve the performance |
---|
0:34:27 | is the case where only a single node leds available |
---|
0:34:32 | for example |
---|
0:34:33 | when we speak only d o |
---|
0:34:35 | transcribed by asr |
---|
0:34:37 | and if the hybrid bases are then used to generate an email by nist |
---|
0:34:42 | and we constantly emails we i see we get another x hypothesis |
---|
0:34:47 | we did not the laws and improve the unseen more detail |
---|
0:34:51 | on the other hand when we happen humans only they are i will generate tax |
---|
0:34:58 | transcription then the caption and synthesise industries by is more detail |
---|
0:35:03 | and then the synthesized is are then transcribe what is more detail |
---|
0:35:08 | we then also there is a description against the intermediate extraction |
---|
0:35:15 | so this is our main interest to see if the image on the data can |
---|
0:35:19 | help to improve day so |
---|
0:35:26 | so we can also create another automatically with a single multimodal chain |
---|
0:35:31 | with we call m c two |
---|
0:35:33 | because maybe if we want to investigate the possibility of applying the chain mechanism in |
---|
0:35:39 | a simple remote is of would be more tomorrow |
---|
0:35:45 | in what you mean and with a process all together with what they are available |
---|
0:35:51 | we immensely to detect small you know |
---|
0:35:54 | so what is in ways or speech-only |
---|
0:35:58 | in speech to text will obviously a emails into x |
---|
0:36:03 | yes and i d we consider this present in |
---|
0:36:07 | and the reconstruction was can be used to improve the ideas or i g |
---|
0:36:12 | and when the in one is that only we can cover the cost function was |
---|
0:36:17 | for text hypothesis is of immense this the best one two |
---|
0:36:20 | my back propagated is lost image speech-to-text model can be too |
---|
0:36:29 | but i that aims at and bt is similar with the one we can the |
---|
0:36:33 | since pitching |
---|
0:36:35 | now discuss that's detector of image captioning and image generation |
---|
0:36:41 | what i see we use an efficient human captioning one real problems show it and |
---|
0:36:45 | then they'll |
---|
0:36:48 | and for i t views attention can we just someone this image generation using open |
---|
0:36:53 | city are lost |
---|
0:36:57 | well in to the extremes the same source multimodal whatever we to in order |
---|
0:37:03 | so this in order of seem you know we that is not encoder and the |
---|
0:37:08 | ica overdone |
---|
0:37:10 | in the morning on mine the output layer probability for a starting i see in |
---|
0:37:15 | order to introduce some information sharing |
---|
0:37:17 | in a single information |
---|
0:37:20 | well in it was bizarre available really going to use only the corresponding within a |
---|
0:37:24 | year |
---|
0:37:28 | one experiment we will probably eight k dataset |
---|
0:37:32 | eight emails we're in private connection |
---|
0:37:36 | as for all you use the corresponding sixty five hours not grasp is multi-speaker data |
---|
0:37:41 | it one by how one and lost |
---|
0:37:44 | we simulate recognition we're all pretty bad that's not the axes |
---|
0:37:49 | our okay used to see how robust mental performed in a single one really data |
---|
0:37:54 | c |
---|
0:37:56 | so we make the operation you mean subset has different modality |
---|
0:38:01 | yes portion has pairs mistakes and units |
---|
0:38:06 | this kind of one also has all model e d but it is a pair |
---|
0:38:12 | and the one point one on have speech or spontaneity |
---|
0:38:20 | is the result of our experiment |
---|
0:38:23 | well the and m c one added it will be small sample currently our baseline |
---|
0:38:28 | monolingual seventy six point seventy five c are |
---|
0:38:32 | and with the speech a using a dataset |
---|
0:38:35 | this eer was reduced to fifteen point |
---|
0:38:38 | and boston |
---|
0:38:41 | then by using speech on the humans on t is the are still be with |
---|
0:38:46 | the reduced to well what is the most expensive |
---|
0:38:49 | so it is only really a sample be proved even with all speed and takes |
---|
0:38:54 | the now |
---|
0:38:55 | we can also see improvement on the other more for example the idea model could |
---|
0:39:00 | be improved given one of these bits data |
---|
0:39:03 | a similar and its e |
---|
0:39:05 | also happened for the m c to a single chain |
---|
0:39:09 | it is an assistant also successfully we use the z are probably twenty six point |
---|
0:39:14 | sixty seven |
---|
0:39:15 | two point seventy two percent |
---|
0:39:23 | some very nice speech in a lot to train in some be supported by slamming |
---|
0:39:27 | without a real data |
---|
0:39:29 | you know we calculate a switching mechanism in the whole be more touching by jointly |
---|
0:39:34 | training the ica i'd in more detail in a collection |
---|
0:39:38 | and the resulting feel that the loss to still improve asr in all the image |
---|
0:39:43 | data is available |
---|
0:39:47 | okay now all challenge in what's happening unless in space integrator |
---|
0:39:52 | we discuss therefore we must seen that can listen was speaking |
---|
0:39:57 | now discuss the second channels |
---|
0:40:02 | so if you know roles to problem to is how to different agreement on a |
---|
0:40:06 | side ideas for real-time seamlessly incorporated |
---|
0:40:12 | we have the same beep or the justice translation table one sets of a sound |
---|
0:40:17 | and d and d is |
---|
0:40:19 | in this manner the process of the oscillation is that the sentence by sentence |
---|
0:40:24 | so forth nice the whole space a difference in the source line with |
---|
0:40:29 | then we currently in this work into the other languages |
---|
0:40:33 | and finally synthesized oppose and in the target language industries |
---|
0:40:40 | used to you meetings the literature to hang on the complete sentence can be long |
---|
0:40:44 | and complicated p |
---|
0:40:46 | so most integrator past me maybe coefficient of predatory that's a the incoming speech stream |
---|
0:40:52 | from the source language to the target languages in real time |
---|
0:40:56 | for the process so we can come like this |
---|
0:41:09 | so one point channels for sleeping is the development of incremental asr |
---|
0:41:13 | and you know we discuss our solution in developing neural increment a star |
---|
0:41:21 | this can lead to its cost something incremental a size that the one will need |
---|
0:41:26 | to decide the incremental step and the transcription |
---|
0:41:29 | the aligned with the conference on speech segment |
---|
0:41:36 | it's we know the engine make i mean something sequences like when it is not |
---|
0:41:39 | use of globalisation probability quite the computation of a weighted sum initiation of things i |
---|
0:41:46 | in plastic we have generally biting or if they |
---|
0:41:49 | this means the system can only generate x output of these can be entirely input |
---|
0:41:55 | sequence |
---|
0:41:56 | consequently utilizing you in situation that require immediate recognition is typical |
---|
0:42:05 | for storing limit the sri asr have been proposed |
---|
0:42:09 | one approach is to use a lot of attention |
---|
0:42:12 | and that and one proposed by wine so |
---|
0:42:15 | a boy a unique directional antenna with a c t c acoustic model |
---|
0:42:22 | gently and also what was not only people using broad classes the remote |
---|
0:42:28 | that incrementally recognize the input speech waveform |
---|
0:42:34 | however most existing euro isr models utilize from frameworks and learning algorithms |
---|
0:42:40 | and i'm not completely against you know it is are |
---|
0:42:44 | here our solution is to employ the original i-th detector |
---|
0:42:49 | attention with a star which of the sequence |
---|
0:42:53 | and then we perform i mentioned a where well as a star is to determine |
---|
0:42:57 | what and highest are yes the students more two |
---|
0:43:01 | so the isr can be makes the statistics alignment problems based on asr |
---|
0:43:08 | this is the overall speaker |
---|
0:43:10 | one is the teacher model which is the non-incremental asr |
---|
0:43:15 | while the right one is the students more detail which is the incremental asr |
---|
0:43:21 | and this one is the ancient past four |
---|
0:43:24 | from the teacher more detail |
---|
0:43:26 | to the students model |
---|
0:43:28 | in recognitions their eyes are exactly the same man and four is that i a |
---|
0:43:34 | summary for the tension alignment wrong non-incrementally sound model |
---|
0:43:39 | in the last of aligned to the people |
---|
0:43:42 | to go in and we'll the local and all blocks even more |
---|
0:43:50 | do not show that performance was it on a dataset |
---|
0:43:54 | this is the performance of stand out a start by x is the publication |
---|
0:43:59 | and this is our standard case are |
---|
0:44:02 | what these results on the result of our claim entirely summer |
---|
0:44:08 | as you can see that is something field results as well you use the delay |
---|
0:44:12 | while maintaining comparable performance was final is are |
---|
0:44:15 | there we d n |
---|
0:44:23 | summary we give a note that in some other and or you know i-th detector |
---|
0:44:27 | of neural a star |
---|
0:44:30 | you know we performed for me |
---|
0:44:32 | where with three standard a star recipe to model and i is are assisted in |
---|
0:44:37 | more detail |
---|
0:44:38 | experimental result you feel the results to really the only |
---|
0:44:42 | and still active comparable performance this time not a start date with an idea |
---|
0:44:50 | now discuss how to develop purely limited all previous |
---|
0:44:57 | so similar to i guess problem |
---|
0:44:59 | the channels in fact an incremental db is that the model to produce based upon |
---|
0:45:05 | receiving a call target samples from the and system |
---|
0:45:12 | as the two handles shortly is then sent bands |
---|
0:45:16 | we find out dataset by randomly split into four sequence |
---|
0:45:21 | in short pause and at the beginning and symbols to input your text |
---|
0:45:28 | here we use different people to differences the human location within the full set comes |
---|
0:45:34 | so s is the standard speaking and one of "'em" |
---|
0:45:38 | in the middle center speaking and |
---|
0:45:45 | recall that we still based on style no |
---|
0:45:47 | the whole problem and doing this |
---|
0:45:50 | you know we made from the training in a sentence by sentence of first one |
---|
0:45:54 | frustration without much modification to the original one |
---|
0:46:03 | this experiment we used japanese single speaker dataset or to use a data set that |
---|
0:46:09 | includes about seven thousand and or an hour or the u |
---|
0:46:14 | spoken by using only the female speaker |
---|
0:46:17 | the input text consists of forty five one single and then the accent types |
---|
0:46:27 | a big show the naturalness of the raw score and it's in the size you |
---|
0:46:31 | in japanese new ideas |
---|
0:46:35 | we do not use where for that so there has been widely okay between generated |
---|
0:46:39 | speech and not all speakers |
---|
0:46:41 | nevertheless here we can see that the size of fa in most quality due to |
---|
0:46:46 | different unique things |
---|
0:46:48 | results in it might is one separates or what were |
---|
0:46:53 | it's almost two point all the more score |
---|
0:46:57 | and the synthesized speech quality improve for one you in two connecting to or three |
---|
0:47:02 | units |
---|
0:47:04 | so that plays an example |
---|
0:47:07 | the first one is where and the increment also is everyone x and place |
---|
0:47:12 | you do not match you know that the model that |
---|
0:47:16 | and this is for to a same faces |
---|
0:47:19 | you don't have to do not another important |
---|
0:47:23 | and this is for three i think phrases |
---|
0:47:25 | you do not much time of day the model |
---|
0:47:29 | is for the whole sentence you do not to do not imagine |
---|
0:47:34 | and this is for sentence you do not too much time of day you know |
---|
0:47:38 | that |
---|
0:47:39 | this is the clusters any not putting them in and |
---|
0:47:45 | so the results i just use japanese you know i yes when command files in |
---|
0:47:50 | the size you need a between the us in parentheses |
---|
0:47:53 | two whole sentence you |
---|
0:47:59 | but to somebody we are therefore not rely is based on segments are sent one |
---|
0:48:03 | variable |
---|
0:48:04 | next element is something feel that linguistically general one phrase is critical and the next |
---|
0:48:10 | linguistic features are nice e |
---|
0:48:12 | and a minimum increment a single site was between the real thing phrases and house |
---|
0:48:17 | and vicinity |
---|
0:48:21 | now we discuss how we combine all samples and then for the incremental sre and |
---|
0:48:26 | d is the one of the neighboring real-time the since the interpreter |
---|
0:48:32 | we have reported in the since speech in a son is we're going to be |
---|
0:48:37 | to the connection and the number of gaussians the listened one speaking |
---|
0:48:43 | there are two process is not lead us |
---|
0:48:47 | and from p d is to use so |
---|
0:48:50 | what worked out in the last sentence level |
---|
0:48:53 | because of that it requires a long really well especially when encountering input sequence |
---|
0:49:01 | in contrast |
---|
0:49:02 | humans can be so the one they speak in real time |
---|
0:49:06 | and if there is that the only the hearing |
---|
0:49:08 | they won't be able to contain speech |
---|
0:49:11 | so this means that this is you for the to perform over time |
---|
0:49:16 | a feedback mechanism |
---|
0:49:22 | here we propose to also incremental dustin change in which we contact i assign and |
---|
0:49:27 | i'm yes we saw be back to |
---|
0:49:31 | so the aim is to use a group delay and improve my i d as |
---|
0:49:36 | listening quality by in terms for each other within a short sequence |
---|
0:49:44 | the loading mechanism incrementality in staging is similar to the one in basic missing speech |
---|
0:49:49 | jane |
---|
0:49:50 | a different smell |
---|
0:49:52 | we used as shown to be made between the components |
---|
0:49:56 | feedback loop can also be into two processes |
---|
0:50:00 | i have a new ideas and provide us to a is are |
---|
0:50:05 | in i so that i guess process |
---|
0:50:08 | well it's incremental step |
---|
0:50:10 | in minnesota tech mean |
---|
0:50:13 | i generally corresponding ensure that |
---|
0:50:17 | and i the ethical practise on speech utterance based on the ice a text out |
---|
0:50:23 | and the law here is currently by comparing the original space and i guess be |
---|
0:50:29 | due to the i d s |
---|
0:50:32 | we is process and increase the end of the speech |
---|
0:50:43 | this process is from i yes to outliers are |
---|
0:50:49 | so similar recipe for example we have of context here in the front |
---|
0:50:54 | we begin by taking the point of the accent right and i'm guessing the size |
---|
0:50:59 | of sauce is based on this day |
---|
0:51:05 | i is a pretty the source based on section |
---|
0:51:08 | and the loss here is calculated for i saw by comparing the i is a |
---|
0:51:12 | text output and that one too complex |
---|
0:51:16 | we repeat the same person to tell us exactly |
---|
0:51:26 | again in this experiment we investigated the performance using will lead to validate countries |
---|
0:51:32 | is the result of standard asr and b d is |
---|
0:51:36 | well these are the case of ice and i d is |
---|
0:51:41 | we also from the horizontal axis |
---|
0:51:44 | where is a very small where n is a rhino r and d s r |
---|
0:51:48 | i d is working independently |
---|
0:51:52 | and this all the result window with three using speech imploring and false preaching |
---|
0:51:59 | well i is a we can only z are given that was input and the |
---|
0:52:03 | big space from i d is |
---|
0:52:06 | similarly for i e d s we can we lost given that for text actually |
---|
0:52:11 | by the eyes are |
---|
0:52:14 | this was done to investigate if you pay scorsese because the quality feedback will affect |
---|
0:52:20 | the fusion performance |
---|
0:52:23 | as you can see we decrease eer baseline really good for seventy percent |
---|
0:52:30 | fourteen percent with the seventies despised chain |
---|
0:52:34 | and i'm both and we simplify speech in |
---|
0:52:38 | the improvement also used for recognition results in the input |
---|
0:52:42 | so he really use the are from point before twelve percent |
---|
0:52:48 | so yes i is performance also you grew where it was trained using incremental speech |
---|
0:52:54 | e |
---|
0:52:56 | so that one there is something if we don't the ones for more coldly is |
---|
0:53:00 | the delay and including point four miles of segmental system |
---|
0:53:08 | okay so now we reduce the overall solution and feature selection |
---|
0:53:15 | so here we have demonstrated we can also be seen speech in the table two |
---|
0:53:21 | is a speaker identities and was speaking |
---|
0:53:24 | in commonly we mostly utility to achieve some is priceless mean |
---|
0:53:29 | on the other hand we have also i have some ideas |
---|
0:53:34 | and then we combine ideas are and i p is really incrementality in speech and |
---|
0:53:39 | variable |
---|
0:53:42 | in the future we will therefore audio time we hold since in the way that |
---|
0:53:46 | lisa translates the and use that was speaking and template |
---|
0:53:53 | so this is the situation that we use in this so i |
---|
0:53:57 | these are our publication data could be still do |
---|
0:54:01 | yes including nebula since the same framework |
---|
0:54:04 | well in one since the chain |
---|
0:54:06 | multimodal the since machine leaming balinese r and d is an incremental the scene screeching |
---|
0:54:16 | but this is the and all that the eer in some way ration it let |
---|
0:54:21 | me know by imposing question in the korean in section and you |
---|