Speech Transcript - Towards Developing Neural Machine Speech Interpreter that Listens, Speaks, and Listens while Speaking

0:00:01	i and you for doing this you know i'm second exactly i don't here is
0:00:05	that answers and propose a novel in figure of science and technology and research find
0:00:10	ways that leaking just by
0:00:13	we introduce some recently some works the lower developing your own seems to incorporate the
0:00:18	that lisa
0:00:19	speaker and listened well speaking
0:00:21	so rather than a tutorial on capitol details we share our experiences indefinitely since between
0:00:27	the printer
0:00:28	one problem we face and one solution that we take
0:00:32	is enjoying work we understand processing of exercise does not hold on the high applied
0:00:37	to harness if indeed and submachine comma
0:00:41	and then problem topics to discuss i won't be able to cover everything details within
0:00:45	an hour of tutorial
0:00:47	so if you have some bar that you want more power domain imposed questioning decoding
0:00:53	in section
0:00:57	okay like to discuss what does this interpolator his
0:01:02	this is an example a final meeting people speaking different languages though there are two
0:01:08	main speaker and so interpreted as
0:01:11	it is done for one person speaking is five that once the japanese
0:01:16	mandarin speakers that is pretty intervals individual will translate probably induced feature by means
0:01:22	so the aim is to construct on the since it is interpreted it can be
0:01:26	proficient interpreter
0:01:30	one language translation is one of the knowledge of the and to mimic human integrated
0:01:35	income after from one language to another
0:01:38	this technology properties of the automatic speech recognition and upon this leads into the backs
0:01:44	in the source language
0:01:46	on the same translation the top on the text image source languages the corresponding text
0:01:50	target languages
0:01:53	and thanks to speech synthesis or pdas the generic space where problem based on the
0:01:58	x in the target language
0:02:00	however kind of finding a spoken with this is an extremely complex tasks
0:02:05	with the one this process is translation performance is the five have been the probability
0:02:11	like professional interpreter
0:02:13	so i don't the ls from sleep you the component and use
0:02:17	then we will discuss what we sell out is the worst different beam since the
0:02:22	greater
0:02:23	you know many people was on the baseline with analogies with five days are used
0:02:29	for
0:02:32	the development of automatic speech recognition has an apple as in the process and just
0:02:37	once the basic human speech
0:02:39	i don't walls for is a third approach is happy line are not with people
0:02:43	and technologies and then can be seen with dynamic time warping
0:02:47	and moved to the different approach with statistical modelling offers you don't model one male
0:02:52	portion which more or hmm gmm
0:02:55	this figure shows the generic structure of the original gmm furthermore
0:03:00	in commenting on what's of three main components impulse one is that was the one
0:03:04	with the theme in the acoustic probably don't have been genetically by itself bored or
0:03:08	phoneme based hmm
0:03:11	the second was his pronunciation lexicon with whatever the pronunciation of all with a sequence
0:03:16	of phones
0:03:17	and that one is like was more deal with a steaming the prior probability of
0:03:21	sequence of four
0:03:24	finally speech recognition decodings that we find the base station once all words according to
0:03:29	the speaker conversation a lexicon and language models
0:03:37	and the resurgence of dignity in the nist has also some place in the course
0:03:42	of really the on utilizing unlabeled transform it is are
0:03:46	maybe is that the performance e is a process things that's is very close to
0:03:50	the baseline formant for remote
0:03:53	for example a hybrid hmm and then estimates the posterior probabilities idea and
0:03:59	then there is also c d c or connectionist temporal classification
0:04:05	and also seconds to say what's more they with listen and stuff parental
0:04:11	the important part of thought differently there is simply by many complicated hence and in
0:04:17	more detail
0:04:18	and then why the weighted measure performance is to exactly
0:04:26	in the last twenty k it has been significant improvement in the has a performance
0:04:31	this can be thing and of jingle bayes theorem decrease in word error rate
0:04:36	in nineteen ninety three to work and or you can see what they that was
0:04:39	close to hundred percent
0:04:41	suddenly i am in microsoft this work have shown that speech recognition one is a
0:04:47	more you word error rate and professional transcription which was at all five point five
0:04:51	percent more generally
0:04:55	in a similar detection technology one has gradually sifted
0:05:01	no foundation for a system using we're all working and an analysis in this is
0:05:06	meant for
0:05:07	no weight for musical computational problems and no more flexible for using the thought that
0:05:12	income or the hidden semi-markov models gmm or svm and gmm
0:05:19	and something instead of the performance starts have been successfully constructed based on your programmable
0:05:27	so this time sequence of second from all
0:05:30	can i think that comes solely on import over complete d s
0:05:34	and provisional with men transform
0:05:39	the performance on t v is has also been improved be problem for is to
0:05:43	human likewise
0:05:44	in latest examples
0:05:47	most of restricted or and russian trick or treaters instruments
0:05:56	it's okay here is a little voice
0:05:59	no that isn't videos good has become more human-like as the following speech samples
0:06:05	it also listed on short notice two children and at a very close
0:06:13	in addition the google i would be viewed as we do not at all somebody
0:06:18	close to human speech
0:06:20	this system used a combination of cognitive gives a changing is in the discrete is
0:06:26	and using the hold on with meant to control intonation depending on the second
0:06:32	in a system become his or in trying to scale to have found an appointment
0:06:36	or whole utterance to estimate the separation
0:06:39	i will resemble the second one
0:06:48	menu
0:06:48	i
0:06:54	for every
0:06:56	where i
0:06:59	well p o
0:07:07	so it's of great here we cannot be president anymore with one if you want
0:07:11	in which one with murphy you get similar papers on their website
0:07:17	okay so we have seen that both examine the fa improve the quality close to
0:07:22	your buddy
0:07:24	okay we solve all problems
0:07:26	that is that the balloon model was used to train a
0:07:30	and how we can utilize the time maybe a system
0:07:36	is it is an example as before we just by their own meeting between two
0:07:40	people speaking different languages
0:07:42	that's a more powerful is an iterative process of interpretation
0:07:47	where one of the speech in one languages
0:07:50	they will use a and we don't we can be enough to sentencing was translated
0:07:55	and speak to the other speaker in another language
0:07:58	this means that the estimation process before is simply the end of the sentence
0:08:04	and i don't paper has the ability to do some other in this process we
0:08:07	just listening don't leading and speak
0:08:11	so the channel started from unseen speakers integrator are not uncommon by a size and
0:08:16	the idea is to construct a machine that have the ability to listen one is
0:08:20	speaking as well as their ability to perform recognition and synthesis fits into by
0:08:29	that was discussed ill posed problem one with the following unless you can listen while
0:08:34	speaking
0:08:37	and is an incentive produce the basic mechanics so in this congregation which is called
0:08:43	the speech at
0:08:45	and mechanism have a slight disturbance mesa therefore for the speaker's mind to the listeners
0:08:50	my
0:08:52	it consists of speech production we can see something with the speakers is worth and
0:08:57	you know based soundwave
0:08:59	and a week the speech waveform to me
0:09:05	the data as this perception process happen in the listeners identity system can perceive what
0:09:11	was said
0:09:14	can close to speech in exactly so has a critical out to be but we
0:09:19	can assume for speaker small to the u
0:09:24	that would be able to communicate few hundred to how to use and state
0:09:30	you know how to talk like one company but their articulation and listening to solve
0:09:35	problems
0:09:36	so again here we leave behind with acquisition possible speech and speaker system hasn't could
0:09:42	be allowed to keep k
0:09:48	so the children who was there really of them have difficulty to produce three states
0:09:54	even adults can be count after becoming proficient with the languages nonetheless for speech articulation
0:10:00	be applied as a result of the that all the three
0:10:06	human brain that is inside a model integration in speech processing
0:10:11	so the auditory system is critically important steps and all speakers and the model system
0:10:17	is critically both in the progress and all states
0:10:21	so that was done by three or four on the next we have seen it
0:10:25	sounds like model response talking face okay feeding both doing the busted perception mostly is
0:10:31	and to encode for all samples articulation
0:10:36	so this means that the process false this perception and production is not unique ability
0:10:45	on the other hand competitors also are able to learn how to use and how
0:10:51	to speak
0:10:52	as we know by whales mixture and within the how to use the so given
0:10:59	the state is people point out that
0:11:02	and also by werewolves text as seen this is if you know how does b
0:11:09	but computers cannot hear their own voice
0:11:12	and then release lsp a separately and then
0:11:17	and therefore requires a little more robust and next thing in sort of points way
0:11:22	more attention
0:11:26	but the question is can we do a long lasting that can listen while speaking
0:11:31	now discuss how to develop a sequence in speech and framework
0:11:38	the proposed approach to the logo slow speech in one based on clean
0:11:42	this is particularly model it indicates human speech perception and production here
0:11:49	it is to have a system that not only can be silence p but also
0:11:52	is somewhere speaking
0:11:56	this is a standard is not in tts framework in which the training to independently
0:12:03	as we mentioned before by where is the most you can learn how to lisa
0:12:07	and by where yes in the second the how to speak
0:12:12	now here estimates in speech and free will
0:12:15	so we can look connection from asr to tts and vts database or
0:12:21	this means yes really see the asr output and may so really see if the
0:12:26	output probability s
0:12:28	or in other words a sunken this that the what did you say
0:12:32	the key idea here is to train both baseline vts models jointly
0:12:38	in the training they the frame level training with a label in a globally in
0:12:43	some reasonable baseline mean
0:12:45	where a sliding scale encourages fighter using unlabeled data engineering useful for you rate
0:12:51	including bounced against imposing we use a starting today is modeled independently using standard way
0:13:01	in more detail
0:13:04	if the original speech
0:13:06	one white
0:13:07	is the original text
0:13:09	huh is the predicted streams
0:13:12	well why
0:13:13	if the predictive text
0:13:15	so the a set of is that are from x the whitehead
0:13:19	you know using all sequences are more militant on what is the text
0:13:23	and the there is a into a problem to white to x had so hugh
0:13:28	we also used to you know second one model control for text to squeeze
0:13:35	i don't case it is interesting speech and the only difference cases where the but
0:13:39	also in text they are available
0:13:42	so given a pair of skis and that's the a one model can be trained
0:13:46	independently in support by selecting
0:13:49	is this can be done by minimizing the loss between the predicted that the sequence
0:13:54	and the ground to sequence
0:13:56	so for its this is by minimizing the loss between y and y head
0:14:01	and for the tts this is done by minimizing the x and x and
0:14:09	is there is when one is fifty five central
0:14:12	for this thing will be for unsupervised domain
0:14:16	given on this feature x
0:14:19	it is pretty the most possible transcription y huh
0:14:23	and based on why cooking is trying to reconstruct this feature
0:14:28	and let the loss yes between the original space feature x
0:14:33	and predicted streams extract
0:14:36	so therefore or this is possible to improve tediously speech-only by it is portable phase
0:14:41	are
0:14:44	now see that all cases where only text data is available
0:14:49	so given only text features y
0:14:52	it is generally it's this feature x have
0:14:55	and based on x head as a try to record a text regions y
0:15:00	and we got but also it is out of between original text feature y and
0:15:04	of but that's what y k
0:15:07	so here it is possible to improve a start with excellently by support of tts
0:15:14	so the overall any object is to minimize baseline and it is lost doing was
0:15:19	proposed by let me really write something well and possible by let me when only
0:15:23	unclear probable
0:15:26	so the basic idea is to able to train domain that's without or getting the
0:15:30	old one
0:15:31	if we set off i'll be greatly zero one this means that we can use
0:15:35	a portion of the loss and the canadian provided by the training set
0:15:39	but if we set off i was zero this means that we completely learning curve
0:15:44	with only speeds or latex
0:15:48	but this is the overall structure of a star we use the segments of segments
0:15:52	from which is similar to a lot so that misspell proposed by janet all to
0:15:57	draw some fifty
0:15:59	so it has in order according to an attention model
0:16:03	they one is x which is this is features
0:16:08	and a good is why which is the act sequence
0:16:12	and i is the importance they were it's t is the decoder state and the
0:16:17	attention models produce context information
0:16:21	at time do you which is a line between the encoder and decoder hidden state
0:16:27	and the loss function is colour we be between the white and a pretty good
0:16:32	why where c here is the number of output classes
0:16:37	similar to a silence detector this is that can subsequently be as baseline control for
0:16:42	it is also classes in court are they going to an ancient old one
0:16:47	x r is a linear spectrogram feature what i x and stands to talk about
0:16:52	features and wise that x include
0:16:55	and is the oldest and hence the decoded as they can attention borders context information
0:17:01	based on a hundred and recorded hidden state
0:17:05	note that kind of lost the first one is database your space to function lost
0:17:10	and the second one is the and a sentence detection with cross entropy
0:17:16	okay so let's discuss some experiments we since the chain
0:17:21	in this week's features we use a d-dimensional mel spectrogram
0:17:25	one thousand four dimensional in the spectrogram
0:17:29	yes it second thoughts between problem by using a finland can pretty the phase and
0:17:35	false stft
0:17:37	and for the next we used in the six off of it six complexion more
0:17:41	actually special cost
0:17:44	by our proposed mental we experiment on a corpus with a single speaker
0:17:50	because i'm is only most agencies by the economy are trained on a single speaker
0:17:54	dataset
0:17:56	we less another mostly single speaker data to just mean and use talked also about
0:18:02	what training of all on a more than one and four hundred one for this
0:18:07	or
0:18:09	most similar several situation
0:18:12	for this when for training data with parameters with the that's the or probable
0:18:18	and the second one is when one small portion of it have to wear speakers
0:18:22	and text data and non-target cannot utilize text or space data on
0:18:28	and the last one is when wanting to devise bands present text data
0:18:35	then we showed that is okay so you know use the error rate for a
0:18:39	while waiting a somewhat in
0:18:42	this is the result when the system was trained we will training data
0:18:46	we can see are three point one percent
0:18:51	no one and channel has a transcript and things the l and the remaining one
0:18:56	we have speech or basically only with an example this year become so you one
0:19:01	point seven percent which is quite high
0:19:04	and but it is only one speaking we assume states mechanism
0:19:09	it and you yes can be designed their using a very low and generally useful
0:19:14	feedback
0:19:15	results show that we improve its performance from something one point seven percent and twelve
0:19:21	point three percent
0:19:23	we separately and twenty five percent is only and twenty four percent and only is
0:19:30	okay three point five percent eer
0:19:33	it was very close to the system used for a hundred percent they're data
0:19:42	nineteen and actual pdf they sell
0:19:45	what it is experiment we report the emergency we pretty well male and a lot
0:19:51	welcome to the ground to the ground truth
0:19:54	results show that a starving tedious model have been trained with small gradient ascent
0:20:00	this is are there
0:20:02	using a really engine used would be great
0:20:06	the all along with formal training be has no square zero point six
0:20:12	and with only ten percent brand it on the last become one point in
0:20:17	zero five
0:20:19	then by listening was speaking one we explore only with phase we also included it
0:20:24	is performance
0:20:29	the summary inspired by the human speech in we propose a gaussian speech and often
0:20:34	able to most into conformity somewhere speaking and achieve something strong bias let me
0:20:40	and mechanism a novel and found it is still it is it harder when there
0:20:44	is if i'm really to analyze identity is too important leasing pair and optimized on
0:20:49	whatever we could is construction cost
0:20:52	however the one sentence the sort of the system was able to handle unseen speaker
0:20:59	this is done also only we mix the voice a given speaker via the speaker
0:21:04	identity by one hundred and twenty
0:21:07	furthermore has are also on the other two speaker specific central speaker because to is
0:21:13	unable to produce a more or is this problem in speaker
0:21:18	so there are we tried to improvement and then a lot of this incredible seems
0:21:23	to change
0:21:26	so the eighties to handle voice characteristics are unknown speaker
0:21:31	you know in the area of speaker recognition system into the speech in little
0:21:35	and i spent a globally deal yes to have the same speaker using one speaker
0:21:40	adaptation
0:21:42	after the couple with a star and we develop a speech in primal better and
0:21:47	what do you a speech problem unknown speaker
0:21:52	this morning mankind is somewhere only it is available
0:21:56	and it really most possible package and y
0:22:00	we can recognition profile speaker everybody's they're finally based on y and z two d
0:22:06	s try to construct an x have
0:22:09	indeed yes loss is calculated between original speech feature x and x huh
0:22:17	one the other hand went on the actually is available
0:22:20	you samples speaker factors the
0:22:24	yes each and every speech feature x have based on that x y and speaker
0:22:28	close to
0:22:30	then given x head and is a try to recall a white
0:22:34	it's a loss is calculated between original x y and of predicting one
0:22:42	is a consequence is a is the same as in basic the since speech at
0:22:46	when the second to stick when it is no additional input on the speaker factor
0:22:52	so now they're kind of loss function
0:22:55	one is this recall such a loss we can see
0:22:58	the second one is and of sentence production rules with cross entropy
0:23:03	and you one is the percent one loss which is the cosine distance between the
0:23:08	original and unique speaker and basically
0:23:14	we're on our experiment on the task is multi-speaker data which is the worst original
0:23:18	dataset
0:23:19	we normalize the speech a necessary supplies not mean that for will be there is
0:23:24	i before any assigned to hunter its parent where training set
0:23:29	and i four records is around seven thousand across of all sixteen hours of speech
0:23:35	spoken by native speakers
0:23:38	well as on the consists of all something else and it is about sixty six
0:23:43	our spoken by two hundred speakers
0:23:47	so if you know we use a month or they have ninety three
0:23:52	and then f are likely to for dataset
0:23:57	sure that is already some
0:24:00	we first train baseline model by using examples for is i t for state only
0:24:06	and we choose seventeen point seventy five percent eer
0:24:10	in the second rule we train a little we clearly tell the full was originally
0:24:15	aside two hundred eighty four data and its you seven point what we can see
0:24:19	are
0:24:20	it is our global performance
0:24:23	and in the last four we trained on one deal with some reasonable price than
0:24:28	using as i before s period and as i two hundred and we're at all
0:24:34	for comparison
0:24:35	we are more training with something simplifies the mean
0:24:38	we get a label propagation mental
0:24:42	where we train the original one really there's this text is i before
0:24:47	we realised are pretty good initial one we don't really using data
0:24:51	so for text only right side two hundred we're stationed ideas to generate the corresponding
0:24:57	speakers
0:24:58	and for speech-only is idle under we'll stage channel is not to do that the
0:25:03	corresponding thanks
0:25:05	often there and we train a more the other with a generic full training set
0:25:09	our results shown by using label propagation is absolutely use the cr fourteen point fifty
0:25:16	a gaussian
0:25:17	nevertheless
0:25:18	speech and model could achieve a significant improvement
0:25:22	and it's we nine point eighty six percent c are
0:25:26	which close the door one result
0:25:31	similarly answer yes loss could also be viewed as the training to him as in
0:25:35	switching
0:25:37	no you want to show some speech samples
0:25:42	the first one is the baseline model where we train only with something percent greater
0:25:47	detail
0:25:48	this is a rate of travel and actually provide a solution
0:25:53	then might be lies inside hundred percent are very well with speech change
0:25:57	this is because the server the problem actually provide a solution
0:26:03	and the one model trained with appropriate training set
0:26:07	a basis to read the travel
0:26:09	they actually provide a solution
0:26:12	no one will but it also the speaker d as they still
0:26:17	is this the baseline
0:26:19	the bases aren't the problem they actually provide a solution
0:26:22	and this is basically changed
0:26:24	the basis of the problems i actually provide a solution
0:26:28	and is just a horrible model
0:26:31	the places aren't the problem actually provide a solution
0:26:35	and can see that in one with a very
0:26:38	it is gonna be improved significantly
0:26:45	so that a summary we proved most of speech and still had voice correctly speech
0:26:49	from unknown speaker
0:26:52	in which the s can generate speech with similar voice kind of these big on
0:26:56	we one shows speaker example
0:26:58	and it's not also okay you need are from the combination between the accent and
0:27:03	an arbitrary voice characteristics
0:27:06	however there is another limitation in the current frame
0:27:11	if we only have actually been performed or prompted us to a soft and all
0:27:15	and only is with the big but lost
0:27:19	one the other hand if we only have this data we perform based on the
0:27:23	ideas and only duty as it came across
0:27:27	this is because the publication error for the reconstruction lost to a star is challenging
0:27:33	note that the output of base obviously don't
0:27:38	the house always so improbable
0:27:43	we will discuss our solution to handle back propagation good basically no
0:27:48	the figure shows the speech or with speaker anybody mortars
0:27:53	in the original from all the roles of ideas couldn't we probably conformable why because
0:27:59	of this is in
0:28:01	no postal address the problem is nist meeting gradient of whatever why we try to
0:28:07	estimate the
0:28:10	to understand why the gradient of this operation is not be by considered as a
0:28:15	function
0:28:17	is it can see almost everywhere a small changes in that you would result doesn't
0:28:21	and employed in the all pole
0:28:24	and so on the lda zero
0:28:27	for change very good is not zero the gradient is in pretty
0:28:31	and so it is not used for formant recognition
0:28:35	i don't want lead to good on this problem will be to use a continuous
0:28:39	approximation dataset fashion
0:28:42	but they fail to produce discrete all
0:28:45	so the solution was trained on the two i see our papers which is used
0:28:49	almost all x distribution
0:28:52	it requires a simple mental for all sample from a technical institution we class probabilities
0:29:00	let me talk in more detail
0:29:03	the main problem we use this notion that the calibration is not giving stable
0:29:09	so that all this issue is to use some loss function as an approximation to
0:29:15	one loss function
0:29:17	and three or four efficient way of sampling from the can take the oldest equation
0:29:22	by saving a random variable g to do all the probabilities
0:29:27	and then parameter that controls how closely they used and was approximately screen one vector
0:29:34	instead i equal to zero the softmax computation smoothly to mister are lost and a
0:29:39	simple spectral amplitude of one
0:29:42	on the other hand if the t is equal good vad sample pack become you
0:29:48	pull
0:29:49	the loss is currently in the good over here we place degradation prediction of different
0:29:56	sample y over b y with id two d
0:30:02	in this experiment we use multi-speaker or thing to allow the ascent okay
0:30:08	there is something a bit
0:30:10	with the use of a convolutional next we all been eleven percent relative improvement compared
0:30:15	to our previous frame
0:30:21	"'cause" to somebody we aim was in speech and mechanism by allowing back propagation through
0:30:27	the screen or because they soliciting later
0:30:30	in the future it is necessary to the probability effectiveness of the protocol for is
0:30:35	lined with
0:30:38	no an additional mechanism when we extend of speech into the model chain
0:30:47	we know the in human communication the most common we for human of the comic
0:30:51	at high speeds
0:30:53	but alignment system cannot know what is completely without a connection to the whole by
0:30:58	us to section
0:31:02	so human communication is actually multi sensory and in boston communication channels
0:31:08	not only of three but also fits or channels
0:31:12	humans perceive this multiple source of information together to build the general concept
0:31:20	basically the idea of incorporating visual information for speech processing is not new
0:31:25	we know there is i'm the of is always are
0:31:29	but more samples are usually done by simply concatenating the formation
0:31:34	individually was information can
0:31:37	and this mental usually require all information from different model it is something altogether
0:31:44	but on the other hand but i'll be is often available
0:31:50	we have run the weakness in speech instead of the last three from the mean
0:31:54	of five parallel space and text data
0:31:58	it probably wasn't ability to include a something tedious performance in semi-supervised learning
0:32:03	by following examined it is that it is are given only the accent only speech
0:32:08	data
0:32:09	unfortunately although it removes the requirement of five being for normal apparently a in the
0:32:15	only required to have a lot size of vehicle
0:32:19	so you study is limited only with speech and text for modeling p
0:32:24	so a single before then the fact no obligation is actually working modeled in for
0:32:30	no on the other day system but also results in something
0:32:36	you know proposed multimodal seems jane to meet all around human communication and non-weighted result
0:32:42	modeling
0:32:44	specifically we design a gender detector the clinic speech recognition or a cyber space in
0:32:50	this is or pdas immense cepstral mean or i c emitted by our or image
0:32:56	generation i g
0:32:58	it can be trained using these four points variation by assisting each other given incomplete
0:33:02	data and averaging postal data preparation within the chain
0:33:08	so there is a question now case can we still improve asr even austin's or
0:33:13	text data available
0:33:18	similar to this change them for this case is where we have well i don't
0:33:22	know ready to all speech image and then x
0:33:25	we can separately any start e d s i si and alright g using supervised
0:33:31	learning
0:33:35	next one is that simplifies the mean
0:33:37	in my office emails and text data
0:33:42	the left side is when the input is emails all speech-only data
0:33:47	i is are and i see we generally x this on speed or image in
0:33:51	one
0:33:53	and this is an immense will be reconstruction and allows can be used to back
0:33:57	propagated t v is alright g
0:34:02	come on the right side is when include is text only d o
0:34:06	one and they say that it is and i'm to regenerate sneezing units respectively
0:34:12	and sre i seem to the costs on the text
0:34:16	i this way a star and i c can be propagated to its construction cost
0:34:22	and improve the performance
0:34:27	is the case where only a single node leds available
0:34:32	for example
0:34:33	when we speak only d o
0:34:35	transcribed by asr
0:34:37	and if the hybrid bases are then used to generate an email by nist
0:34:42	and we constantly emails we i see we get another x hypothesis
0:34:47	we did not the laws and improve the unseen more detail
0:34:51	on the other hand when we happen humans only they are i will generate tax
0:34:58	transcription then the caption and synthesise industries by is more detail
0:35:03	and then the synthesized is are then transcribe what is more detail
0:35:08	we then also there is a description against the intermediate extraction
0:35:15	so this is our main interest to see if the image on the data can
0:35:19	help to improve day so
0:35:26	so we can also create another automatically with a single multimodal chain
0:35:31	with we call m c two
0:35:33	because maybe if we want to investigate the possibility of applying the chain mechanism in
0:35:39	a simple remote is of would be more tomorrow
0:35:45	in what you mean and with a process all together with what they are available
0:35:51	we immensely to detect small you know
0:35:54	so what is in ways or speech-only
0:35:58	in speech to text will obviously a emails into x
0:36:03	yes and i d we consider this present in
0:36:07	and the reconstruction was can be used to improve the ideas or i g
0:36:12	and when the in one is that only we can cover the cost function was
0:36:17	for text hypothesis is of immense this the best one two
0:36:20	my back propagated is lost image speech-to-text model can be too
0:36:29	but i that aims at and bt is similar with the one we can the
0:36:33	since pitching
0:36:35	now discuss that's detector of image captioning and image generation
0:36:41	what i see we use an efficient human captioning one real problems show it and
0:36:45	then they'll
0:36:48	and for i t views attention can we just someone this image generation using open
0:36:53	city are lost
0:36:57	well in to the extremes the same source multimodal whatever we to in order
0:37:03	so this in order of seem you know we that is not encoder and the
0:37:08	ica overdone
0:37:10	in the morning on mine the output layer probability for a starting i see in
0:37:15	order to introduce some information sharing
0:37:17	in a single information
0:37:20	well in it was bizarre available really going to use only the corresponding within a
0:37:24	year
0:37:28	one experiment we will probably eight k dataset
0:37:32	eight emails we're in private connection
0:37:36	as for all you use the corresponding sixty five hours not grasp is multi-speaker data
0:37:41	it one by how one and lost
0:37:44	we simulate recognition we're all pretty bad that's not the axes
0:37:49	our okay used to see how robust mental performed in a single one really data
0:37:54	c
0:37:56	so we make the operation you mean subset has different modality
0:38:01	yes portion has pairs mistakes and units
0:38:06	this kind of one also has all model e d but it is a pair
0:38:12	and the one point one on have speech or spontaneity
0:38:20	is the result of our experiment
0:38:23	well the and m c one added it will be small sample currently our baseline
0:38:28	monolingual seventy six point seventy five c are
0:38:32	and with the speech a using a dataset
0:38:35	this eer was reduced to fifteen point
0:38:38	and boston
0:38:41	then by using speech on the humans on t is the are still be with
0:38:46	the reduced to well what is the most expensive
0:38:49	so it is only really a sample be proved even with all speed and takes
0:38:54	the now
0:38:55	we can also see improvement on the other more for example the idea model could
0:39:00	be improved given one of these bits data
0:39:03	a similar and its e
0:39:05	also happened for the m c to a single chain
0:39:09	it is an assistant also successfully we use the z are probably twenty six point
0:39:14	sixty seven
0:39:15	two point seventy two percent
0:39:23	some very nice speech in a lot to train in some be supported by slamming
0:39:27	without a real data
0:39:29	you know we calculate a switching mechanism in the whole be more touching by jointly
0:39:34	training the ica i'd in more detail in a collection
0:39:38	and the resulting feel that the loss to still improve asr in all the image
0:39:43	data is available
0:39:47	okay now all challenge in what's happening unless in space integrator
0:39:52	we discuss therefore we must seen that can listen was speaking
0:39:57	now discuss the second channels
0:40:02	so if you know roles to problem to is how to different agreement on a
0:40:06	side ideas for real-time seamlessly incorporated
0:40:12	we have the same beep or the justice translation table one sets of a sound
0:40:17	and d and d is
0:40:19	in this manner the process of the oscillation is that the sentence by sentence
0:40:24	so forth nice the whole space a difference in the source line with
0:40:29	then we currently in this work into the other languages
0:40:33	and finally synthesized oppose and in the target language industries
0:40:40	used to you meetings the literature to hang on the complete sentence can be long
0:40:44	and complicated p
0:40:46	so most integrator past me maybe coefficient of predatory that's a the incoming speech stream
0:40:52	from the source language to the target languages in real time
0:40:56	for the process so we can come like this
0:41:09	so one point channels for sleeping is the development of incremental asr
0:41:13	and you know we discuss our solution in developing neural increment a star
0:41:21	this can lead to its cost something incremental a size that the one will need
0:41:26	to decide the incremental step and the transcription
0:41:29	the aligned with the conference on speech segment
0:41:36	it's we know the engine make i mean something sequences like when it is not
0:41:39	use of globalisation probability quite the computation of a weighted sum initiation of things i
0:41:46	in plastic we have generally biting or if they
0:41:49	this means the system can only generate x output of these can be entirely input
0:41:55	sequence
0:41:56	consequently utilizing you in situation that require immediate recognition is typical
0:42:05	for storing limit the sri asr have been proposed
0:42:09	one approach is to use a lot of attention
0:42:12	and that and one proposed by wine so
0:42:15	a boy a unique directional antenna with a c t c acoustic model
0:42:22	gently and also what was not only people using broad classes the remote
0:42:28	that incrementally recognize the input speech waveform
0:42:34	however most existing euro isr models utilize from frameworks and learning algorithms
0:42:40	and i'm not completely against you know it is are
0:42:44	here our solution is to employ the original i-th detector
0:42:49	attention with a star which of the sequence
0:42:53	and then we perform i mentioned a where well as a star is to determine
0:42:57	what and highest are yes the students more two
0:43:01	so the isr can be makes the statistics alignment problems based on asr
0:43:08	this is the overall speaker
0:43:10	one is the teacher model which is the non-incremental asr
0:43:15	while the right one is the students more detail which is the incremental asr
0:43:21	and this one is the ancient past four
0:43:24	from the teacher more detail
0:43:26	to the students model
0:43:28	in recognitions their eyes are exactly the same man and four is that i a
0:43:34	summary for the tension alignment wrong non-incrementally sound model
0:43:39	in the last of aligned to the people
0:43:42	to go in and we'll the local and all blocks even more
0:43:50	do not show that performance was it on a dataset
0:43:54	this is the performance of stand out a start by x is the publication
0:43:59	and this is our standard case are
0:44:02	what these results on the result of our claim entirely summer
0:44:08	as you can see that is something field results as well you use the delay
0:44:12	while maintaining comparable performance was final is are
0:44:15	there we d n
0:44:23	summary we give a note that in some other and or you know i-th detector
0:44:27	of neural a star
0:44:30	you know we performed for me
0:44:32	where with three standard a star recipe to model and i is are assisted in
0:44:37	more detail
0:44:38	experimental result you feel the results to really the only
0:44:42	and still active comparable performance this time not a start date with an idea
0:44:50	now discuss how to develop purely limited all previous
0:44:57	so similar to i guess problem
0:44:59	the channels in fact an incremental db is that the model to produce based upon
0:45:05	receiving a call target samples from the and system
0:45:12	as the two handles shortly is then sent bands
0:45:16	we find out dataset by randomly split into four sequence
0:45:21	in short pause and at the beginning and symbols to input your text
0:45:28	here we use different people to differences the human location within the full set comes
0:45:34	so s is the standard speaking and one of "'em"
0:45:38	in the middle center speaking and
0:45:45	recall that we still based on style no
0:45:47	the whole problem and doing this
0:45:50	you know we made from the training in a sentence by sentence of first one
0:45:54	frustration without much modification to the original one
0:46:03	this experiment we used japanese single speaker dataset or to use a data set that
0:46:09	includes about seven thousand and or an hour or the u
0:46:14	spoken by using only the female speaker
0:46:17	the input text consists of forty five one single and then the accent types
0:46:27	a big show the naturalness of the raw score and it's in the size you
0:46:31	in japanese new ideas
0:46:35	we do not use where for that so there has been widely okay between generated
0:46:39	speech and not all speakers
0:46:41	nevertheless here we can see that the size of fa in most quality due to
0:46:46	different unique things
0:46:48	results in it might is one separates or what were
0:46:53	it's almost two point all the more score
0:46:57	and the synthesized speech quality improve for one you in two connecting to or three
0:47:02	units
0:47:04	so that plays an example
0:47:07	the first one is where and the increment also is everyone x and place
0:47:12	you do not match you know that the model that
0:47:16	and this is for to a same faces
0:47:19	you don't have to do not another important
0:47:23	and this is for three i think phrases
0:47:25	you do not much time of day the model
0:47:29	is for the whole sentence you do not to do not imagine
0:47:34	and this is for sentence you do not too much time of day you know
0:47:38	that
0:47:39	this is the clusters any not putting them in and
0:47:45	so the results i just use japanese you know i yes when command files in
0:47:50	the size you need a between the us in parentheses
0:47:53	two whole sentence you
0:47:59	but to somebody we are therefore not rely is based on segments are sent one
0:48:03	variable
0:48:04	next element is something feel that linguistically general one phrase is critical and the next
0:48:10	linguistic features are nice e
0:48:12	and a minimum increment a single site was between the real thing phrases and house
0:48:17	and vicinity
0:48:21	now we discuss how we combine all samples and then for the incremental sre and
0:48:26	d is the one of the neighboring real-time the since the interpreter
0:48:32	we have reported in the since speech in a son is we're going to be
0:48:37	to the connection and the number of gaussians the listened one speaking
0:48:43	there are two process is not lead us
0:48:47	and from p d is to use so
0:48:50	what worked out in the last sentence level
0:48:53	because of that it requires a long really well especially when encountering input sequence
0:49:01	in contrast
0:49:02	humans can be so the one they speak in real time
0:49:06	and if there is that the only the hearing
0:49:08	they won't be able to contain speech
0:49:11	so this means that this is you for the to perform over time
0:49:16	a feedback mechanism
0:49:22	here we propose to also incremental dustin change in which we contact i assign and
0:49:27	i'm yes we saw be back to
0:49:31	so the aim is to use a group delay and improve my i d as
0:49:36	listening quality by in terms for each other within a short sequence
0:49:44	the loading mechanism incrementality in staging is similar to the one in basic missing speech
0:49:49	jane
0:49:50	a different smell
0:49:52	we used as shown to be made between the components
0:49:56	feedback loop can also be into two processes
0:50:00	i have a new ideas and provide us to a is are
0:50:05	in i so that i guess process
0:50:08	well it's incremental step
0:50:10	in minnesota tech mean
0:50:13	i generally corresponding ensure that
0:50:17	and i the ethical practise on speech utterance based on the ice a text out
0:50:23	and the law here is currently by comparing the original space and i guess be
0:50:29	due to the i d s
0:50:32	we is process and increase the end of the speech
0:50:43	this process is from i yes to outliers are
0:50:49	so similar recipe for example we have of context here in the front
0:50:54	we begin by taking the point of the accent right and i'm guessing the size
0:50:59	of sauce is based on this day
0:51:05	i is a pretty the source based on section
0:51:08	and the loss here is calculated for i saw by comparing the i is a
0:51:12	text output and that one too complex
0:51:16	we repeat the same person to tell us exactly
0:51:26	again in this experiment we investigated the performance using will lead to validate countries
0:51:32	is the result of standard asr and b d is
0:51:36	well these are the case of ice and i d is
0:51:41	we also from the horizontal axis
0:51:44	where is a very small where n is a rhino r and d s r
0:51:48	i d is working independently
0:51:52	and this all the result window with three using speech imploring and false preaching
0:51:59	well i is a we can only z are given that was input and the
0:52:03	big space from i d is
0:52:06	similarly for i e d s we can we lost given that for text actually
0:52:11	by the eyes are
0:52:14	this was done to investigate if you pay scorsese because the quality feedback will affect
0:52:20	the fusion performance
0:52:23	as you can see we decrease eer baseline really good for seventy percent
0:52:30	fourteen percent with the seventies despised chain
0:52:34	and i'm both and we simplify speech in
0:52:38	the improvement also used for recognition results in the input
0:52:42	so he really use the are from point before twelve percent
0:52:48	so yes i is performance also you grew where it was trained using incremental speech
0:52:54	e
0:52:56	so that one there is something if we don't the ones for more coldly is
0:53:00	the delay and including point four miles of segmental system
0:53:08	okay so now we reduce the overall solution and feature selection
0:53:15	so here we have demonstrated we can also be seen speech in the table two
0:53:21	is a speaker identities and was speaking
0:53:24	in commonly we mostly utility to achieve some is priceless mean
0:53:29	on the other hand we have also i have some ideas
0:53:34	and then we combine ideas are and i p is really incrementality in speech and
0:53:39	variable
0:53:42	in the future we will therefore audio time we hold since in the way that
0:53:46	lisa translates the and use that was speaking and template
0:53:53	so this is the situation that we use in this so i
0:53:57	these are our publication data could be still do
0:54:01	yes including nebula since the same framework
0:54:04	well in one since the chain
0:54:06	multimodal the since machine leaming balinese r and d is an incremental the scene screeching
0:54:16	but this is the and all that the eer in some way ration it let
0:54:21	me know by imposing question in the korean in section and you

Towards Developing Neural Machine Speech Interpreter that Listens, Speaks, and Listens while Speaking

Tutorials

Dr Sakriani Sakti, NAIST, Japan