0:00:01i and you for doing this you know i'm second exactly i don't here is
0:00:05that answers and propose a novel in figure of science and technology and research find
0:00:10ways that leaking just by
0:00:13we introduce some recently some works the lower developing your own seems to incorporate the
0:00:18that lisa
0:00:19speaker and listened well speaking
0:00:21so rather than a tutorial on capitol details we share our experiences indefinitely since between
0:00:27the printer
0:00:28one problem we face and one solution that we take
0:00:32is enjoying work we understand processing of exercise does not hold on the high applied
0:00:37to harness if indeed and submachine comma
0:00:41and then problem topics to discuss i won't be able to cover everything details within
0:00:45an hour of tutorial
0:00:47so if you have some bar that you want more power domain imposed questioning decoding
0:00:53in section
0:00:57okay like to discuss what does this interpolator his
0:01:02this is an example a final meeting people speaking different languages though there are two
0:01:08main speaker and so interpreted as
0:01:11it is done for one person speaking is five that once the japanese
0:01:16mandarin speakers that is pretty intervals individual will translate probably induced feature by means
0:01:22so the aim is to construct on the since it is interpreted it can be
0:01:26proficient interpreter
0:01:30one language translation is one of the knowledge of the and to mimic human integrated
0:01:35income after from one language to another
0:01:38this technology properties of the automatic speech recognition and upon this leads into the backs
0:01:44in the source language
0:01:46on the same translation the top on the text image source languages the corresponding text
0:01:50target languages
0:01:53and thanks to speech synthesis or pdas the generic space where problem based on the
0:01:58x in the target language
0:02:00however kind of finding a spoken with this is an extremely complex tasks
0:02:05with the one this process is translation performance is the five have been the probability
0:02:11like professional interpreter
0:02:13so i don't the ls from sleep you the component and use
0:02:17then we will discuss what we sell out is the worst different beam since the
0:02:22greater
0:02:23you know many people was on the baseline with analogies with five days are used
0:02:29for
0:02:32the development of automatic speech recognition has an apple as in the process and just
0:02:37once the basic human speech
0:02:39i don't walls for is a third approach is happy line are not with people
0:02:43and technologies and then can be seen with dynamic time warping
0:02:47and moved to the different approach with statistical modelling offers you don't model one male
0:02:52portion which more or hmm gmm
0:02:55this figure shows the generic structure of the original gmm furthermore
0:03:00in commenting on what's of three main components impulse one is that was the one
0:03:04with the theme in the acoustic probably don't have been genetically by itself bored or
0:03:08phoneme based hmm
0:03:11the second was his pronunciation lexicon with whatever the pronunciation of all with a sequence
0:03:16of phones
0:03:17and that one is like was more deal with a steaming the prior probability of
0:03:21sequence of four
0:03:24finally speech recognition decodings that we find the base station once all words according to
0:03:29the speaker conversation a lexicon and language models
0:03:37and the resurgence of dignity in the nist has also some place in the course
0:03:42of really the on utilizing unlabeled transform it is are
0:03:46maybe is that the performance e is a process things that's is very close to
0:03:50the baseline formant for remote
0:03:53for example a hybrid hmm and then estimates the posterior probabilities idea and
0:03:59then there is also c d c or connectionist temporal classification
0:04:05and also seconds to say what's more they with listen and stuff parental
0:04:11the important part of thought differently there is simply by many complicated hence and in
0:04:17more detail
0:04:18and then why the weighted measure performance is to exactly
0:04:26in the last twenty k it has been significant improvement in the has a performance
0:04:31this can be thing and of jingle bayes theorem decrease in word error rate
0:04:36in nineteen ninety three to work and or you can see what they that was
0:04:39close to hundred percent
0:04:41suddenly i am in microsoft this work have shown that speech recognition one is a
0:04:47more you word error rate and professional transcription which was at all five point five
0:04:51percent more generally
0:04:55in a similar detection technology one has gradually sifted
0:05:01no foundation for a system using we're all working and an analysis in this is
0:05:06meant for
0:05:07no weight for musical computational problems and no more flexible for using the thought that
0:05:12income or the hidden semi-markov models gmm or svm and gmm
0:05:19and something instead of the performance starts have been successfully constructed based on your programmable
0:05:27so this time sequence of second from all
0:05:30can i think that comes solely on import over complete d s
0:05:34and provisional with men transform
0:05:39the performance on t v is has also been improved be problem for is to
0:05:43human likewise
0:05:44in latest examples
0:05:47most of restricted or and russian trick or treaters instruments
0:05:56it's okay here is a little voice
0:05:59no that isn't videos good has become more human-like as the following speech samples
0:06:05it also listed on short notice two children and at a very close
0:06:13in addition the google i would be viewed as we do not at all somebody
0:06:18close to human speech
0:06:20this system used a combination of cognitive gives a changing is in the discrete is
0:06:26and using the hold on with meant to control intonation depending on the second
0:06:32in a system become his or in trying to scale to have found an appointment
0:06:36or whole utterance to estimate the separation
0:06:39i will resemble the second one
0:06:48menu
0:06:48i
0:06:54for every
0:06:56where i
0:06:59well p o
0:07:07so it's of great here we cannot be president anymore with one if you want
0:07:11in which one with murphy you get similar papers on their website
0:07:17okay so we have seen that both examine the fa improve the quality close to
0:07:22your buddy
0:07:24okay we solve all problems
0:07:26that is that the balloon model was used to train a
0:07:30and how we can utilize the time maybe a system
0:07:36is it is an example as before we just by their own meeting between two
0:07:40people speaking different languages
0:07:42that's a more powerful is an iterative process of interpretation
0:07:47where one of the speech in one languages
0:07:50they will use a and we don't we can be enough to sentencing was translated
0:07:55and speak to the other speaker in another language
0:07:58this means that the estimation process before is simply the end of the sentence
0:08:04and i don't paper has the ability to do some other in this process we
0:08:07just listening don't leading and speak
0:08:11so the channel started from unseen speakers integrator are not uncommon by a size and
0:08:16the idea is to construct a machine that have the ability to listen one is
0:08:20speaking as well as their ability to perform recognition and synthesis fits into by
0:08:29that was discussed ill posed problem one with the following unless you can listen while
0:08:34speaking
0:08:37and is an incentive produce the basic mechanics so in this congregation which is called
0:08:43the speech at
0:08:45and mechanism have a slight disturbance mesa therefore for the speaker's mind to the listeners
0:08:50my
0:08:52it consists of speech production we can see something with the speakers is worth and
0:08:57you know based soundwave
0:08:59and a week the speech waveform to me
0:09:05the data as this perception process happen in the listeners identity system can perceive what
0:09:11was said
0:09:14can close to speech in exactly so has a critical out to be but we
0:09:19can assume for speaker small to the u
0:09:24that would be able to communicate few hundred to how to use and state
0:09:30you know how to talk like one company but their articulation and listening to solve
0:09:35problems
0:09:36so again here we leave behind with acquisition possible speech and speaker system hasn't could
0:09:42be allowed to keep k
0:09:48so the children who was there really of them have difficulty to produce three states
0:09:54even adults can be count after becoming proficient with the languages nonetheless for speech articulation
0:10:00be applied as a result of the that all the three
0:10:06human brain that is inside a model integration in speech processing
0:10:11so the auditory system is critically important steps and all speakers and the model system
0:10:17is critically both in the progress and all states
0:10:21so that was done by three or four on the next we have seen it
0:10:25sounds like model response talking face okay feeding both doing the busted perception mostly is
0:10:31and to encode for all samples articulation
0:10:36so this means that the process false this perception and production is not unique ability
0:10:45on the other hand competitors also are able to learn how to use and how
0:10:51to speak
0:10:52as we know by whales mixture and within the how to use the so given
0:10:59the state is people point out that
0:11:02and also by werewolves text as seen this is if you know how does b
0:11:09but computers cannot hear their own voice
0:11:12and then release lsp a separately and then
0:11:17and therefore requires a little more robust and next thing in sort of points way
0:11:22more attention
0:11:26but the question is can we do a long lasting that can listen while speaking
0:11:31now discuss how to develop a sequence in speech and framework
0:11:38the proposed approach to the logo slow speech in one based on clean
0:11:42this is particularly model it indicates human speech perception and production here
0:11:49it is to have a system that not only can be silence p but also
0:11:52is somewhere speaking
0:11:56this is a standard is not in tts framework in which the training to independently
0:12:03as we mentioned before by where is the most you can learn how to lisa
0:12:07and by where yes in the second the how to speak
0:12:12now here estimates in speech and free will
0:12:15so we can look connection from asr to tts and vts database or
0:12:21this means yes really see the asr output and may so really see if the
0:12:26output probability s
0:12:28or in other words a sunken this that the what did you say
0:12:32the key idea here is to train both baseline vts models jointly
0:12:38in the training they the frame level training with a label in a globally in
0:12:43some reasonable baseline mean
0:12:45where a sliding scale encourages fighter using unlabeled data engineering useful for you rate
0:12:51including bounced against imposing we use a starting today is modeled independently using standard way
0:13:01in more detail
0:13:04if the original speech
0:13:06one white
0:13:07is the original text
0:13:09huh is the predicted streams
0:13:12well why
0:13:13if the predictive text
0:13:15so the a set of is that are from x the whitehead
0:13:19you know using all sequences are more militant on what is the text
0:13:23and the there is a into a problem to white to x had so hugh
0:13:28we also used to you know second one model control for text to squeeze
0:13:35i don't case it is interesting speech and the only difference cases where the but
0:13:39also in text they are available
0:13:42so given a pair of skis and that's the a one model can be trained
0:13:46independently in support by selecting
0:13:49is this can be done by minimizing the loss between the predicted that the sequence
0:13:54and the ground to sequence
0:13:56so for its this is by minimizing the loss between y and y head
0:14:01and for the tts this is done by minimizing the x and x and
0:14:09is there is when one is fifty five central
0:14:12for this thing will be for unsupervised domain
0:14:16given on this feature x
0:14:19it is pretty the most possible transcription y huh
0:14:23and based on why cooking is trying to reconstruct this feature
0:14:28and let the loss yes between the original space feature x
0:14:33and predicted streams extract
0:14:36so therefore or this is possible to improve tediously speech-only by it is portable phase
0:14:41are
0:14:44now see that all cases where only text data is available
0:14:49so given only text features y
0:14:52it is generally it's this feature x have
0:14:55and based on x head as a try to record a text regions y
0:15:00and we got but also it is out of between original text feature y and
0:15:04of but that's what y k
0:15:07so here it is possible to improve a start with excellently by support of tts
0:15:14so the overall any object is to minimize baseline and it is lost doing was
0:15:19proposed by let me really write something well and possible by let me when only
0:15:23unclear probable
0:15:26so the basic idea is to able to train domain that's without or getting the
0:15:30old one
0:15:31if we set off i'll be greatly zero one this means that we can use
0:15:35a portion of the loss and the canadian provided by the training set
0:15:39but if we set off i was zero this means that we completely learning curve
0:15:44with only speeds or latex
0:15:48but this is the overall structure of a star we use the segments of segments
0:15:52from which is similar to a lot so that misspell proposed by janet all to
0:15:57draw some fifty
0:15:59so it has in order according to an attention model
0:16:03they one is x which is this is features
0:16:08and a good is why which is the act sequence
0:16:12and i is the importance they were it's t is the decoder state and the
0:16:17attention models produce context information
0:16:21at time do you which is a line between the encoder and decoder hidden state
0:16:27and the loss function is colour we be between the white and a pretty good
0:16:32why where c here is the number of output classes
0:16:37similar to a silence detector this is that can subsequently be as baseline control for
0:16:42it is also classes in court are they going to an ancient old one
0:16:47x r is a linear spectrogram feature what i x and stands to talk about
0:16:52features and wise that x include
0:16:55and is the oldest and hence the decoded as they can attention borders context information
0:17:01based on a hundred and recorded hidden state
0:17:05note that kind of lost the first one is database your space to function lost
0:17:10and the second one is the and a sentence detection with cross entropy
0:17:16okay so let's discuss some experiments we since the chain
0:17:21in this week's features we use a d-dimensional mel spectrogram
0:17:25one thousand four dimensional in the spectrogram
0:17:29yes it second thoughts between problem by using a finland can pretty the phase and
0:17:35false stft
0:17:37and for the next we used in the six off of it six complexion more
0:17:41actually special cost
0:17:44by our proposed mental we experiment on a corpus with a single speaker
0:17:50because i'm is only most agencies by the economy are trained on a single speaker
0:17:54dataset
0:17:56we less another mostly single speaker data to just mean and use talked also about
0:18:02what training of all on a more than one and four hundred one for this
0:18:07or
0:18:09most similar several situation
0:18:12for this when for training data with parameters with the that's the or probable
0:18:18and the second one is when one small portion of it have to wear speakers
0:18:22and text data and non-target cannot utilize text or space data on
0:18:28and the last one is when wanting to devise bands present text data
0:18:35then we showed that is okay so you know use the error rate for a
0:18:39while waiting a somewhat in
0:18:42this is the result when the system was trained we will training data
0:18:46we can see are three point one percent
0:18:51no one and channel has a transcript and things the l and the remaining one
0:18:56we have speech or basically only with an example this year become so you one
0:19:01point seven percent which is quite high
0:19:04and but it is only one speaking we assume states mechanism
0:19:09it and you yes can be designed their using a very low and generally useful
0:19:14feedback
0:19:15results show that we improve its performance from something one point seven percent and twelve
0:19:21point three percent
0:19:23we separately and twenty five percent is only and twenty four percent and only is
0:19:30okay three point five percent eer
0:19:33it was very close to the system used for a hundred percent they're data
0:19:42nineteen and actual pdf they sell
0:19:45what it is experiment we report the emergency we pretty well male and a lot
0:19:51welcome to the ground to the ground truth
0:19:54results show that a starving tedious model have been trained with small gradient ascent
0:20:00this is are there
0:20:02using a really engine used would be great
0:20:06the all along with formal training be has no square zero point six
0:20:12and with only ten percent brand it on the last become one point in
0:20:17zero five
0:20:19then by listening was speaking one we explore only with phase we also included it
0:20:24is performance
0:20:29the summary inspired by the human speech in we propose a gaussian speech and often
0:20:34able to most into conformity somewhere speaking and achieve something strong bias let me
0:20:40and mechanism a novel and found it is still it is it harder when there
0:20:44is if i'm really to analyze identity is too important leasing pair and optimized on
0:20:49whatever we could is construction cost
0:20:52however the one sentence the sort of the system was able to handle unseen speaker
0:20:59this is done also only we mix the voice a given speaker via the speaker
0:21:04identity by one hundred and twenty
0:21:07furthermore has are also on the other two speaker specific central speaker because to is
0:21:13unable to produce a more or is this problem in speaker
0:21:18so there are we tried to improvement and then a lot of this incredible seems
0:21:23to change
0:21:26so the eighties to handle voice characteristics are unknown speaker
0:21:31you know in the area of speaker recognition system into the speech in little
0:21:35and i spent a globally deal yes to have the same speaker using one speaker
0:21:40adaptation
0:21:42after the couple with a star and we develop a speech in primal better and
0:21:47what do you a speech problem unknown speaker
0:21:52this morning mankind is somewhere only it is available
0:21:56and it really most possible package and y
0:22:00we can recognition profile speaker everybody's they're finally based on y and z two d
0:22:06s try to construct an x have
0:22:09indeed yes loss is calculated between original speech feature x and x huh
0:22:17one the other hand went on the actually is available
0:22:20you samples speaker factors the
0:22:24yes each and every speech feature x have based on that x y and speaker
0:22:28close to
0:22:30then given x head and is a try to recall a white
0:22:34it's a loss is calculated between original x y and of predicting one
0:22:42is a consequence is a is the same as in basic the since speech at
0:22:46when the second to stick when it is no additional input on the speaker factor
0:22:52so now they're kind of loss function
0:22:55one is this recall such a loss we can see
0:22:58the second one is and of sentence production rules with cross entropy
0:23:03and you one is the percent one loss which is the cosine distance between the
0:23:08original and unique speaker and basically
0:23:14we're on our experiment on the task is multi-speaker data which is the worst original
0:23:18dataset
0:23:19we normalize the speech a necessary supplies not mean that for will be there is
0:23:24i before any assigned to hunter its parent where training set
0:23:29and i four records is around seven thousand across of all sixteen hours of speech
0:23:35spoken by native speakers
0:23:38well as on the consists of all something else and it is about sixty six
0:23:43our spoken by two hundred speakers
0:23:47so if you know we use a month or they have ninety three
0:23:52and then f are likely to for dataset
0:23:57sure that is already some
0:24:00we first train baseline model by using examples for is i t for state only
0:24:06and we choose seventeen point seventy five percent eer
0:24:10in the second rule we train a little we clearly tell the full was originally
0:24:15aside two hundred eighty four data and its you seven point what we can see
0:24:19are
0:24:20it is our global performance
0:24:23and in the last four we trained on one deal with some reasonable price than
0:24:28using as i before s period and as i two hundred and we're at all
0:24:34for comparison
0:24:35we are more training with something simplifies the mean
0:24:38we get a label propagation mental
0:24:42where we train the original one really there's this text is i before
0:24:47we realised are pretty good initial one we don't really using data
0:24:51so for text only right side two hundred we're stationed ideas to generate the corresponding
0:24:57speakers
0:24:58and for speech-only is idle under we'll stage channel is not to do that the
0:25:03corresponding thanks
0:25:05often there and we train a more the other with a generic full training set
0:25:09our results shown by using label propagation is absolutely use the cr fourteen point fifty
0:25:16a gaussian
0:25:17nevertheless
0:25:18speech and model could achieve a significant improvement
0:25:22and it's we nine point eighty six percent c are
0:25:26which close the door one result
0:25:31similarly answer yes loss could also be viewed as the training to him as in
0:25:35switching
0:25:37no you want to show some speech samples
0:25:42the first one is the baseline model where we train only with something percent greater
0:25:47detail
0:25:48this is a rate of travel and actually provide a solution
0:25:53then might be lies inside hundred percent are very well with speech change
0:25:57this is because the server the problem actually provide a solution
0:26:03and the one model trained with appropriate training set
0:26:07a basis to read the travel
0:26:09they actually provide a solution
0:26:12no one will but it also the speaker d as they still
0:26:17is this the baseline
0:26:19the bases aren't the problem they actually provide a solution
0:26:22and this is basically changed
0:26:24the basis of the problems i actually provide a solution
0:26:28and is just a horrible model
0:26:31the places aren't the problem actually provide a solution
0:26:35and can see that in one with a very
0:26:38it is gonna be improved significantly
0:26:45so that a summary we proved most of speech and still had voice correctly speech
0:26:49from unknown speaker
0:26:52in which the s can generate speech with similar voice kind of these big on
0:26:56we one shows speaker example
0:26:58and it's not also okay you need are from the combination between the accent and
0:27:03an arbitrary voice characteristics
0:27:06however there is another limitation in the current frame
0:27:11if we only have actually been performed or prompted us to a soft and all
0:27:15and only is with the big but lost
0:27:19one the other hand if we only have this data we perform based on the
0:27:23ideas and only duty as it came across
0:27:27this is because the publication error for the reconstruction lost to a star is challenging
0:27:33note that the output of base obviously don't
0:27:38the house always so improbable
0:27:43we will discuss our solution to handle back propagation good basically no
0:27:48the figure shows the speech or with speaker anybody mortars
0:27:53in the original from all the roles of ideas couldn't we probably conformable why because
0:27:59of this is in
0:28:01no postal address the problem is nist meeting gradient of whatever why we try to
0:28:07estimate the
0:28:10to understand why the gradient of this operation is not be by considered as a
0:28:15function
0:28:17is it can see almost everywhere a small changes in that you would result doesn't
0:28:21and employed in the all pole
0:28:24and so on the lda zero
0:28:27for change very good is not zero the gradient is in pretty
0:28:31and so it is not used for formant recognition
0:28:35i don't want lead to good on this problem will be to use a continuous
0:28:39approximation dataset fashion
0:28:42but they fail to produce discrete all
0:28:45so the solution was trained on the two i see our papers which is used
0:28:49almost all x distribution
0:28:52it requires a simple mental for all sample from a technical institution we class probabilities
0:29:00let me talk in more detail
0:29:03the main problem we use this notion that the calibration is not giving stable
0:29:09so that all this issue is to use some loss function as an approximation to
0:29:15one loss function
0:29:17and three or four efficient way of sampling from the can take the oldest equation
0:29:22by saving a random variable g to do all the probabilities
0:29:27and then parameter that controls how closely they used and was approximately screen one vector
0:29:34instead i equal to zero the softmax computation smoothly to mister are lost and a
0:29:39simple spectral amplitude of one
0:29:42on the other hand if the t is equal good vad sample pack become you
0:29:48pull
0:29:49the loss is currently in the good over here we place degradation prediction of different
0:29:56sample y over b y with id two d
0:30:02in this experiment we use multi-speaker or thing to allow the ascent okay
0:30:08there is something a bit
0:30:10with the use of a convolutional next we all been eleven percent relative improvement compared
0:30:15to our previous frame
0:30:21"'cause" to somebody we aim was in speech and mechanism by allowing back propagation through
0:30:27the screen or because they soliciting later
0:30:30in the future it is necessary to the probability effectiveness of the protocol for is
0:30:35lined with
0:30:38no an additional mechanism when we extend of speech into the model chain
0:30:47we know the in human communication the most common we for human of the comic
0:30:51at high speeds
0:30:53but alignment system cannot know what is completely without a connection to the whole by
0:30:58us to section
0:31:02so human communication is actually multi sensory and in boston communication channels
0:31:08not only of three but also fits or channels
0:31:12humans perceive this multiple source of information together to build the general concept
0:31:20basically the idea of incorporating visual information for speech processing is not new
0:31:25we know there is i'm the of is always are
0:31:29but more samples are usually done by simply concatenating the formation
0:31:34individually was information can
0:31:37and this mental usually require all information from different model it is something altogether
0:31:44but on the other hand but i'll be is often available
0:31:50we have run the weakness in speech instead of the last three from the mean
0:31:54of five parallel space and text data
0:31:58it probably wasn't ability to include a something tedious performance in semi-supervised learning
0:32:03by following examined it is that it is are given only the accent only speech
0:32:08data
0:32:09unfortunately although it removes the requirement of five being for normal apparently a in the
0:32:15only required to have a lot size of vehicle
0:32:19so you study is limited only with speech and text for modeling p
0:32:24so a single before then the fact no obligation is actually working modeled in for
0:32:30no on the other day system but also results in something
0:32:36you know proposed multimodal seems jane to meet all around human communication and non-weighted result
0:32:42modeling
0:32:44specifically we design a gender detector the clinic speech recognition or a cyber space in
0:32:50this is or pdas immense cepstral mean or i c emitted by our or image
0:32:56generation i g
0:32:58it can be trained using these four points variation by assisting each other given incomplete
0:33:02data and averaging postal data preparation within the chain
0:33:08so there is a question now case can we still improve asr even austin's or
0:33:13text data available
0:33:18similar to this change them for this case is where we have well i don't
0:33:22know ready to all speech image and then x
0:33:25we can separately any start e d s i si and alright g using supervised
0:33:31learning
0:33:35next one is that simplifies the mean
0:33:37in my office emails and text data
0:33:42the left side is when the input is emails all speech-only data
0:33:47i is are and i see we generally x this on speed or image in
0:33:51one
0:33:53and this is an immense will be reconstruction and allows can be used to back
0:33:57propagated t v is alright g
0:34:02come on the right side is when include is text only d o
0:34:06one and they say that it is and i'm to regenerate sneezing units respectively
0:34:12and sre i seem to the costs on the text
0:34:16i this way a star and i c can be propagated to its construction cost
0:34:22and improve the performance
0:34:27is the case where only a single node leds available
0:34:32for example
0:34:33when we speak only d o
0:34:35transcribed by asr
0:34:37and if the hybrid bases are then used to generate an email by nist
0:34:42and we constantly emails we i see we get another x hypothesis
0:34:47we did not the laws and improve the unseen more detail
0:34:51on the other hand when we happen humans only they are i will generate tax
0:34:58transcription then the caption and synthesise industries by is more detail
0:35:03and then the synthesized is are then transcribe what is more detail
0:35:08we then also there is a description against the intermediate extraction
0:35:15so this is our main interest to see if the image on the data can
0:35:19help to improve day so
0:35:26so we can also create another automatically with a single multimodal chain
0:35:31with we call m c two
0:35:33because maybe if we want to investigate the possibility of applying the chain mechanism in
0:35:39a simple remote is of would be more tomorrow
0:35:45in what you mean and with a process all together with what they are available
0:35:51we immensely to detect small you know
0:35:54so what is in ways or speech-only
0:35:58in speech to text will obviously a emails into x
0:36:03yes and i d we consider this present in
0:36:07and the reconstruction was can be used to improve the ideas or i g
0:36:12and when the in one is that only we can cover the cost function was
0:36:17for text hypothesis is of immense this the best one two
0:36:20my back propagated is lost image speech-to-text model can be too
0:36:29but i that aims at and bt is similar with the one we can the
0:36:33since pitching
0:36:35now discuss that's detector of image captioning and image generation
0:36:41what i see we use an efficient human captioning one real problems show it and
0:36:45then they'll
0:36:48and for i t views attention can we just someone this image generation using open
0:36:53city are lost
0:36:57well in to the extremes the same source multimodal whatever we to in order
0:37:03so this in order of seem you know we that is not encoder and the
0:37:08ica overdone
0:37:10in the morning on mine the output layer probability for a starting i see in
0:37:15order to introduce some information sharing
0:37:17in a single information
0:37:20well in it was bizarre available really going to use only the corresponding within a
0:37:24year
0:37:28one experiment we will probably eight k dataset
0:37:32eight emails we're in private connection
0:37:36as for all you use the corresponding sixty five hours not grasp is multi-speaker data
0:37:41it one by how one and lost
0:37:44we simulate recognition we're all pretty bad that's not the axes
0:37:49our okay used to see how robust mental performed in a single one really data
0:37:54c
0:37:56so we make the operation you mean subset has different modality
0:38:01yes portion has pairs mistakes and units
0:38:06this kind of one also has all model e d but it is a pair
0:38:12and the one point one on have speech or spontaneity
0:38:20is the result of our experiment
0:38:23well the and m c one added it will be small sample currently our baseline
0:38:28monolingual seventy six point seventy five c are
0:38:32and with the speech a using a dataset
0:38:35this eer was reduced to fifteen point
0:38:38and boston
0:38:41then by using speech on the humans on t is the are still be with
0:38:46the reduced to well what is the most expensive
0:38:49so it is only really a sample be proved even with all speed and takes
0:38:54the now
0:38:55we can also see improvement on the other more for example the idea model could
0:39:00be improved given one of these bits data
0:39:03a similar and its e
0:39:05also happened for the m c to a single chain
0:39:09it is an assistant also successfully we use the z are probably twenty six point
0:39:14sixty seven
0:39:15two point seventy two percent
0:39:23some very nice speech in a lot to train in some be supported by slamming
0:39:27without a real data
0:39:29you know we calculate a switching mechanism in the whole be more touching by jointly
0:39:34training the ica i'd in more detail in a collection
0:39:38and the resulting feel that the loss to still improve asr in all the image
0:39:43data is available
0:39:47okay now all challenge in what's happening unless in space integrator
0:39:52we discuss therefore we must seen that can listen was speaking
0:39:57now discuss the second channels
0:40:02so if you know roles to problem to is how to different agreement on a
0:40:06side ideas for real-time seamlessly incorporated
0:40:12we have the same beep or the justice translation table one sets of a sound
0:40:17and d and d is
0:40:19in this manner the process of the oscillation is that the sentence by sentence
0:40:24so forth nice the whole space a difference in the source line with
0:40:29then we currently in this work into the other languages
0:40:33and finally synthesized oppose and in the target language industries
0:40:40used to you meetings the literature to hang on the complete sentence can be long
0:40:44and complicated p
0:40:46so most integrator past me maybe coefficient of predatory that's a the incoming speech stream
0:40:52from the source language to the target languages in real time
0:40:56for the process so we can come like this
0:41:09so one point channels for sleeping is the development of incremental asr
0:41:13and you know we discuss our solution in developing neural increment a star
0:41:21this can lead to its cost something incremental a size that the one will need
0:41:26to decide the incremental step and the transcription
0:41:29the aligned with the conference on speech segment
0:41:36it's we know the engine make i mean something sequences like when it is not
0:41:39use of globalisation probability quite the computation of a weighted sum initiation of things i
0:41:46in plastic we have generally biting or if they
0:41:49this means the system can only generate x output of these can be entirely input
0:41:55sequence
0:41:56consequently utilizing you in situation that require immediate recognition is typical
0:42:05for storing limit the sri asr have been proposed
0:42:09one approach is to use a lot of attention
0:42:12and that and one proposed by wine so
0:42:15a boy a unique directional antenna with a c t c acoustic model
0:42:22gently and also what was not only people using broad classes the remote
0:42:28that incrementally recognize the input speech waveform
0:42:34however most existing euro isr models utilize from frameworks and learning algorithms
0:42:40and i'm not completely against you know it is are
0:42:44here our solution is to employ the original i-th detector
0:42:49attention with a star which of the sequence
0:42:53and then we perform i mentioned a where well as a star is to determine
0:42:57what and highest are yes the students more two
0:43:01so the isr can be makes the statistics alignment problems based on asr
0:43:08this is the overall speaker
0:43:10one is the teacher model which is the non-incremental asr
0:43:15while the right one is the students more detail which is the incremental asr
0:43:21and this one is the ancient past four
0:43:24from the teacher more detail
0:43:26to the students model
0:43:28in recognitions their eyes are exactly the same man and four is that i a
0:43:34summary for the tension alignment wrong non-incrementally sound model
0:43:39in the last of aligned to the people
0:43:42to go in and we'll the local and all blocks even more
0:43:50do not show that performance was it on a dataset
0:43:54this is the performance of stand out a start by x is the publication
0:43:59and this is our standard case are
0:44:02what these results on the result of our claim entirely summer
0:44:08as you can see that is something field results as well you use the delay
0:44:12while maintaining comparable performance was final is are
0:44:15there we d n
0:44:23summary we give a note that in some other and or you know i-th detector
0:44:27of neural a star
0:44:30you know we performed for me
0:44:32where with three standard a star recipe to model and i is are assisted in
0:44:37more detail
0:44:38experimental result you feel the results to really the only
0:44:42and still active comparable performance this time not a start date with an idea
0:44:50now discuss how to develop purely limited all previous
0:44:57so similar to i guess problem
0:44:59the channels in fact an incremental db is that the model to produce based upon
0:45:05receiving a call target samples from the and system
0:45:12as the two handles shortly is then sent bands
0:45:16we find out dataset by randomly split into four sequence
0:45:21in short pause and at the beginning and symbols to input your text
0:45:28here we use different people to differences the human location within the full set comes
0:45:34so s is the standard speaking and one of "'em"
0:45:38in the middle center speaking and
0:45:45recall that we still based on style no
0:45:47the whole problem and doing this
0:45:50you know we made from the training in a sentence by sentence of first one
0:45:54frustration without much modification to the original one
0:46:03this experiment we used japanese single speaker dataset or to use a data set that
0:46:09includes about seven thousand and or an hour or the u
0:46:14spoken by using only the female speaker
0:46:17the input text consists of forty five one single and then the accent types
0:46:27a big show the naturalness of the raw score and it's in the size you
0:46:31in japanese new ideas
0:46:35we do not use where for that so there has been widely okay between generated
0:46:39speech and not all speakers
0:46:41nevertheless here we can see that the size of fa in most quality due to
0:46:46different unique things
0:46:48results in it might is one separates or what were
0:46:53it's almost two point all the more score
0:46:57and the synthesized speech quality improve for one you in two connecting to or three
0:47:02units
0:47:04so that plays an example
0:47:07the first one is where and the increment also is everyone x and place
0:47:12you do not match you know that the model that
0:47:16and this is for to a same faces
0:47:19you don't have to do not another important
0:47:23and this is for three i think phrases
0:47:25you do not much time of day the model
0:47:29is for the whole sentence you do not to do not imagine
0:47:34and this is for sentence you do not too much time of day you know
0:47:38that
0:47:39this is the clusters any not putting them in and
0:47:45so the results i just use japanese you know i yes when command files in
0:47:50the size you need a between the us in parentheses
0:47:53two whole sentence you
0:47:59but to somebody we are therefore not rely is based on segments are sent one
0:48:03variable
0:48:04next element is something feel that linguistically general one phrase is critical and the next
0:48:10linguistic features are nice e
0:48:12and a minimum increment a single site was between the real thing phrases and house
0:48:17and vicinity
0:48:21now we discuss how we combine all samples and then for the incremental sre and
0:48:26d is the one of the neighboring real-time the since the interpreter
0:48:32we have reported in the since speech in a son is we're going to be
0:48:37to the connection and the number of gaussians the listened one speaking
0:48:43there are two process is not lead us
0:48:47and from p d is to use so
0:48:50what worked out in the last sentence level
0:48:53because of that it requires a long really well especially when encountering input sequence
0:49:01in contrast
0:49:02humans can be so the one they speak in real time
0:49:06and if there is that the only the hearing
0:49:08they won't be able to contain speech
0:49:11so this means that this is you for the to perform over time
0:49:16a feedback mechanism
0:49:22here we propose to also incremental dustin change in which we contact i assign and
0:49:27i'm yes we saw be back to
0:49:31so the aim is to use a group delay and improve my i d as
0:49:36listening quality by in terms for each other within a short sequence
0:49:44the loading mechanism incrementality in staging is similar to the one in basic missing speech
0:49:49jane
0:49:50a different smell
0:49:52we used as shown to be made between the components
0:49:56feedback loop can also be into two processes
0:50:00i have a new ideas and provide us to a is are
0:50:05in i so that i guess process
0:50:08well it's incremental step
0:50:10in minnesota tech mean
0:50:13i generally corresponding ensure that
0:50:17and i the ethical practise on speech utterance based on the ice a text out
0:50:23and the law here is currently by comparing the original space and i guess be
0:50:29due to the i d s
0:50:32we is process and increase the end of the speech
0:50:43this process is from i yes to outliers are
0:50:49so similar recipe for example we have of context here in the front
0:50:54we begin by taking the point of the accent right and i'm guessing the size
0:50:59of sauce is based on this day
0:51:05i is a pretty the source based on section
0:51:08and the loss here is calculated for i saw by comparing the i is a
0:51:12text output and that one too complex
0:51:16we repeat the same person to tell us exactly
0:51:26again in this experiment we investigated the performance using will lead to validate countries
0:51:32is the result of standard asr and b d is
0:51:36well these are the case of ice and i d is
0:51:41we also from the horizontal axis
0:51:44where is a very small where n is a rhino r and d s r
0:51:48i d is working independently
0:51:52and this all the result window with three using speech imploring and false preaching
0:51:59well i is a we can only z are given that was input and the
0:52:03big space from i d is
0:52:06similarly for i e d s we can we lost given that for text actually
0:52:11by the eyes are
0:52:14this was done to investigate if you pay scorsese because the quality feedback will affect
0:52:20the fusion performance
0:52:23as you can see we decrease eer baseline really good for seventy percent
0:52:30fourteen percent with the seventies despised chain
0:52:34and i'm both and we simplify speech in
0:52:38the improvement also used for recognition results in the input
0:52:42so he really use the are from point before twelve percent
0:52:48so yes i is performance also you grew where it was trained using incremental speech
0:52:54e
0:52:56so that one there is something if we don't the ones for more coldly is
0:53:00the delay and including point four miles of segmental system
0:53:08okay so now we reduce the overall solution and feature selection
0:53:15so here we have demonstrated we can also be seen speech in the table two
0:53:21is a speaker identities and was speaking
0:53:24in commonly we mostly utility to achieve some is priceless mean
0:53:29on the other hand we have also i have some ideas
0:53:34and then we combine ideas are and i p is really incrementality in speech and
0:53:39variable
0:53:42in the future we will therefore audio time we hold since in the way that
0:53:46lisa translates the and use that was speaking and template
0:53:53so this is the situation that we use in this so i
0:53:57these are our publication data could be still do
0:54:01yes including nebula since the same framework
0:54:04well in one since the chain
0:54:06multimodal the since machine leaming balinese r and d is an incremental the scene screeching
0:54:16but this is the and all that the eer in some way ration it let
0:54:21me know by imposing question in the korean in section and you