Přepis řeči - Trends in Speech and Spoken Language Processing

0:00:13	bush water on
0:00:15	don't ask me to carry in one
0:00:19	but okay no shot and so uh but and for uh
0:00:24	three sessions
0:00:26	um
0:00:27	for
0:00:28	after lunchtime
0:00:30	uh i think we're
0:00:32	as
0:00:32	because shall i we to and of about five to six o'clock
0:00:36	and family
0:00:37	and took pictures of uh of the poster sessions
0:00:40	and the sessions that had
0:00:43	i two people at the end
0:00:45	i and of speech and language processing so people in speech and language or or or or or or is
0:00:50	dedicated to kind of stay to the yeah so
0:00:52	i can much for a for coming to it
0:00:55	so uh uh i just go
0:00:56	inter
0:00:57	so uh i was to close have to the
0:00:59	first just a couple
0:01:01	about that the suspension line which technical can
0:01:04	most of you have a saying that one can had a like a icassp i my above ten
0:01:09	to a can use
0:01:11	for special which to
0:01:12	i the low just to fifty three members the
0:01:15	uh a a for the notion of a should
0:01:18	i because we have a a large number of people per it's made at a cost so
0:01:23	still
0:01:24	uh sum of about some papers
0:01:27	a a a a a a a "'cause" spanish trapped on the last
0:01:31	a a couple of hours passed because
0:01:33	we have a separate a constant of filter
0:01:36	a that focus on which much processing and and spend increasing a rather significant way
0:01:42	a
0:01:43	uh
0:01:43	the paper try since you can can
0:01:45	and
0:01:46	initial
0:01:47	we have we about a the of the papers that that's except at i cast
0:01:52	and suspension which processing field and so
0:01:54	uh and discussed like to cover
0:01:58	uh i seven hundred submissions and a three and four are some papers
0:02:03	i in for it's when five we could just one thing
0:02:06	a session of those and that were to thirty minutes by itself
0:02:11	um
0:02:12	but we just them not to do that
0:02:14	okay
0:02:14	uh that's true
0:02:16	a uh i was spent
0:02:19	a under a giant
0:02:22	a a but uh
0:02:24	a a a a from a a a a has a big impact of interest
0:02:28	was not able to attend a of the input here so
0:02:32	a lot of uh chin folks kinda again
0:02:35	um we encourage questions is or we're going to have to kind of try to cover to to uh
0:02:40	uh uh i have tried to make sure that are we get to have a microphone well so
0:02:45	uh i think
0:02:46	german one going for
0:02:48	okay
0:02:50	um um be very quickly so if anything
0:02:54	i white he's
0:02:55	um if anything i set wrong or a miss anything point to where
0:02:58	and i you need three
0:03:00	to Q is set
0:03:01	just
0:03:01	i'm i
0:03:03	why is that
0:03:04	a because there is a were three hundred paper a C to really hard
0:03:07	uh to goes through a them them a summarise be case
0:03:10	section if die on it up now of a see that yeah
0:03:13	um B
0:03:16	um this is
0:03:17	you thank you people like couple that that talk to try to get a job of U
0:03:21	um so a is three hundred twenty five paper is uh it is true of an them are on speech
0:03:26	and it seventy five non on the language processing
0:03:29	oh according to how the conference
0:03:31	i sign of me to different section is the be case is arguable second papers
0:03:34	can be a both
0:03:36	um
0:03:37	so they they are to part out wall um i what cover an now one handed out well not
0:03:42	uh the um
0:03:43	uh they or departing in language processing and uh and speech processing by not T T S and a a
0:03:48	and this speech
0:03:49	and that will a cover to speak i D uh including a speech
0:03:54	a speaker verification and recognition and test speaker tiring is
0:03:58	diarisation
0:03:59	and and and up yeah what touched they speech nice thing has to
0:04:03	and the little B
0:04:05	and so this is a language modeling
0:04:07	um on their right top of the the to show shows they number of papers on the big field F
0:04:11	i of the language processing
0:04:13	so uh a couple of things worth mentioning here is that a for X Y the at the the like
0:04:18	the model um a a a model M based exponential model um the class based in then your a network
0:04:23	model a language model non spam model it a dynamic language model adaptation
0:04:28	um discriminate models
0:04:31	and uh i think there many others just a couple
0:04:33	and i as i C O goes through a paper is so what's common in the
0:04:37	um the um
0:04:39	uh the a cup of the a sum up paper on computing
0:04:42	um uh optimize asian time to do you know how to
0:04:45	train a language model a large scale data uh so it distributed uh ch training a fast
0:04:51	you are not recommended model training training how to manage in long span
0:04:54	and there is a common um comic data set people work
0:04:58	and uh spoken document a processing here the task it try to do to documents some summarization classification a speaker
0:05:05	role identification um give a kind approach is a typical of machine learning
0:05:09	um at the A C rap and now the motion any classification and reasons
0:05:13	and the uh and translation and the semantic a classification or set
0:05:17	um two sessions that route paper is on this topic
0:05:20	including a um the standing you i search though um different paper sick probably use different times to by or
0:05:27	of folks out how to use a lot the how to uh um and ten to carries for search
0:05:32	and then and that there is a um
0:05:34	well i'm paper use in T B N of a i think it you see lots of paper some T
0:05:38	B the used to be a language
0:05:40	um uh and is standing this the for car car routing
0:05:43	um a speech translation uh cab are saying how you can and tied that
0:05:48	a speech recognition and south
0:05:50	um and translation together whether a yes are a word accuracy it's not probably is not good metric of four
0:05:56	for uh um for speech translation
0:05:59	um the bilingual audio uh subtitle extraction there's
0:06:02	many others i think a
0:06:04	i probably didn't not list L
0:06:07	um
0:06:08	uh power linguistic in an a linguist give features
0:06:11	uh did this are very interesting and be case
0:06:14	you could think case of what to you can you know for a from speech language the motion detection
0:06:18	um recognising and lexical but yes and now
0:06:21	um the cognitive load a correct classification a
0:06:25	um trying to
0:06:26	um um i to guess
0:06:28	when there is a one to talk to compute a to compute trying to guess you know how much you
0:06:31	are thinking
0:06:32	um
0:06:34	the perceptual difference so four innings this if for language learning be or speech at things that says
0:06:39	um you generating a a traffic trucks pressure
0:06:41	it's are on this stick to topic
0:06:43	um some of them operating new
0:06:46	um spoken term and recognition
0:06:48	it's trying to um now um you know given the huge uh
0:06:52	each file was so audio file was or of video trying to be trying a list of
0:06:55	of spoken utterances giving a voice query boast term which you you just speak
0:07:01	um the approach is are of dynamic whopping sub word recognition rate to one graph based approaches
0:07:07	um there's comedy a set out from this
0:07:12	and the dialogue um
0:07:13	there are um this is a i we don't have
0:07:16	for this calm face and there's only five people
0:07:19	well i to but i mean to dialogue um by a uh uh
0:07:22	you know a at the train is before you know if you look a back
0:07:25	a couple years ago probably there not many approaches are the disk approach
0:07:28	now is most papers to focus on this disc of there two
0:07:32	oh fights
0:07:33	wise you you you track a distribution of all the all the possible state
0:07:37	the second part is the being from an any you put to again it become palm to P i think
0:07:42	a um uh there are several papers on the conference on problem
0:07:46	um this so
0:07:47	there's a events this is so specifically so we're not go through those
0:07:51	um language identification um
0:07:54	yeah six a a a a a a paper is an in and one session
0:07:58	they skate trying use phonetic a prosodic feature is the combination of them
0:08:02	and and did to do that to to identified the language if you look and approach um how to do
0:08:07	at it's a a you know um use a classification
0:08:10	uh i i can i seek up a paper sound logistic regression in and grams
0:08:14	there's a set to or was the same data
0:08:17	or this uh
0:08:19	a trying you guess the language what use thing
0:08:21	um a lexicon modeling is trying to use the much line tight to automatically generate the pronunciation from the from
0:08:28	the given word
0:08:29	um that you to be there is a couple question lining uh and here have
0:08:33	uh approach is introduced in that
0:08:35	um are to lingo on a multichannel processing
0:08:38	it task you is that you know you have a mixed them and to each input how do you
0:08:42	uh to the asr um to uh index and search and um what mac thing you can set up
0:08:48	the approach the very died verse by a so i did not list them on here
0:08:51	um
0:08:52	speech analysis this a few i really um don't know much so i uh
0:08:55	um
0:08:57	just trying to cover what topics that were kind motion detection
0:09:00	um
0:09:01	i you know on sing this level you kind to
0:09:03	um
0:09:05	it
0:09:05	sec come up if the change you can see today the relationship between motion in F zero range
0:09:10	um
0:09:10	you motion you know including detect the anger and so and so far
0:09:14	and duration modeling for for log block of account for zero
0:09:18	um um yes are peachy frequency estimation so on
0:09:22	um
0:09:23	the approach is i i i sees you know couple of things probably um not new buys the comic class
0:09:28	papers
0:09:29	um singularity generic the i have a the phase locked loops
0:09:33	and there is coming a set on that
0:09:35	um
0:09:36	as i said this is good i don't know much so if for the second chair is our or would
0:09:40	know this a better place a calming know what's
0:09:42	what's what's missing here that i didn't cover
0:09:45	uh speech enhancement
0:09:46	um
0:09:48	a a time know the task it trying to just
0:09:50	as separate a speech of versus not speech and noise
0:09:53	um
0:09:55	a there is
0:09:56	okay
0:09:57	um you can be the slides but uh there's is i i think i can i at
0:10:00	compared to produce the conference is is a be more apt it's a music noise
0:10:05	um
0:10:06	there as many approaches here um
0:10:09	you know
0:10:09	somehow of mark and tuning to in uh
0:10:12	they well no poke just like when you filtering in who the calm through train C C
0:10:17	um have a long list T here i thing i don't have time goes through an all
0:10:22	and uh
0:10:23	you can vacation
0:10:24	um we have over all there's a forty eight paper on this topic in clean
0:10:29	speaker diarization and um but i think that we more than previous to conferences probably one the reason
0:10:36	um i don't know what is the relevant to the nist to be recognition evaluation
0:10:40	um
0:10:42	a crop but he's through the
0:10:44	the paper is there is a
0:10:46	a couple of things just
0:10:47	just highlights i think E
0:10:49	very very is hard to summarise
0:10:51	um i back to space and uh a probabilistic lda
0:10:54	and uh the evaluation papers the from nist R
0:10:57	are the are used to use in in a fusion that if you would us several uh speaker recognition system
0:11:04	to you fuse the results
0:11:05	um
0:11:08	okay second one here speaker that issuing is first if who
0:11:11	spoke when you in audio stream i'm meeting
0:11:14	um
0:11:15	yeah
0:11:16	a just a which is a second and this into
0:11:17	uh top down bottom class three
0:11:20	um
0:11:20	how to uh exposed features be close to give features there's is by new keys approach
0:11:25	um there uh information bottleneck the based approach the couple course quite
0:11:29	i knew hence
0:11:30	is is me on this field
0:11:32	um you bass so S I can in this uh
0:11:35	um
0:11:36	there's lots of people is here i search will miss something
0:11:40	um the so i put i mean you several kind can't three the first one uh processing
0:11:44	and signal up i think set compressed the sensing
0:11:46	i you can use a compressed sing on on the other parts of a by S are two
0:11:50	um now net to magic a factorization
0:11:53	um how to use of a to transform in that the spectrum
0:11:57	and then there other approaches i have give a long list to here
0:12:00	and
0:12:01	a feature is so how to and you know lots of features six we shouldn't based ten antenna
0:12:06	um
0:12:07	there's say a you a of the cup of papers how to use T N to two genders the tandem
0:12:11	features
0:12:12	um
0:12:13	logistically smart mapping
0:12:15	and noise it's feature normalization
0:12:17	uh there is a a and different model
0:12:19	um
0:12:20	um i is
0:12:22	quite diapers
0:12:23	a collection
0:12:24	so i don't think of will we can
0:12:26	um
0:12:27	maybe after this of can put a slice somewhere where thing wants to take a look
0:12:31	um
0:12:32	given at a wall and you know worked
0:12:41	a i'll try to cover
0:12:42	can everyone hear me
0:12:44	so i like to cover all of the uh papers that but generally included in large vocabulary speech recognition and
0:12:51	acoustic modeling and adaptation technique
0:12:53	um can any that were a lot of a you can see
0:12:56	the asr lie
0:12:58	and so we try to split it in a manner that matched well with the sessions so let's for start
0:13:03	with adaptation
0:13:04	the problem here is basically to say how well can you adapt your existing models
0:13:09	do is a specific speaker or environment
0:13:12	and the most recent trend we been seeing is how can you and force sparsity or structure on the transforms
0:13:18	we line
0:13:19	and how can you do a better optimization
0:13:21	now in general the ideas that have been floating it on in this field include discriminative transforms
0:13:27	how can to find something that will learned rapidly or rapidly adapt to minimal amounts of data
0:13:33	and so and now you see these things are adapted to more real well tasks such as a waste H
0:13:40	and you starting to see as some impact from one of these techniques and this new of data
0:13:45	and we did see some for is now on a rapid adaptation for uh like is said to a what
0:13:50	test
0:13:51	and how you can include um
0:13:53	convex optimization methods in situations where your objective function is not convex anymore
0:13:58	i if you want to read more about uh these bit where as i've listed that element section here
0:14:06	that was not a
0:14:06	good idea job
0:14:18	we we have small problem
0:14:20	and so i i think that do modeling now yeah are modeling was split the as many many sessions
0:14:26	uh basically but all talking about statistical modeling of speech signals yeah
0:14:30	uh the more recent trends have been along the line of how can i use machine learning technique
0:14:36	in large vocabulary speech recognition we all know they were on certain class of problems like envision and handwriting recognition
0:14:43	uh uh which are
0:14:45	really difficult but but like i have small do it is sets so we're not looking to see how we
0:14:49	can apply an and these techniques to speech problems
0:14:53	and a lot what that comes the task of speeding up these learning algorithms to deal with large quantities of
0:14:58	data
0:14:58	and that guy here we saw more applications to real well uh tasks
0:15:02	and including die play evaluation
0:15:05	yeah i this like yeah your some of the i D as we saw most if you are familiar with
0:15:09	these things
0:15:10	um
0:15:12	i i'll some of the key components yeah why at a a a we saw some papers and capturing long
0:15:16	H on that
0:15:18	uh critically more use of this do you has to like clean either an hmm framework or in other forms
0:15:23	of coding
0:15:24	uh a how can you use the psd there to type from class classifiers intelligently maybe we've using them in
0:15:30	deep belief nets or maybe they are using them directly the hmm framework
0:15:34	a a can you intelligently like acoustic units
0:15:37	whether that it's for english or any other form of language
0:15:41	and do you'll have enough data and now to pick these acoustic units which didn't white the before
0:15:45	um also we have seen some papers that use language id accent and dialect identification in incorporating them to improve
0:15:53	speech recognition accuracy
0:15:54	so you a bunch of a as people are working on
0:15:57	um we want to see in some recent interesting wake on uh last functions and busting methodologies that improve the
0:16:04	quality of the classifiers of the learners an acoustic model
0:16:08	uh this this particular yeah that that the meant in the the section title modeling for a a
0:16:22	uh moving on a at but to why sessions which covered acoustic modeling these line it's topics and statistical that
0:16:28	that
0:16:28	and these do fall under the category of general asr type problem
0:16:32	um
0:16:33	that was some more i yes yeah which include complex models
0:16:37	which include long spend board language modeling an acoustic modeling technique
0:16:41	uh we see some applications of C i i have to be so multiple stream the nation's
0:16:46	um a i thought is an using this D D as as some sort of any then to that there
0:16:50	and thinking out how to model these posterior
0:16:53	a a a a few a where is that where a uh uh derivation from the johns hopkins workshop which
0:16:58	is that every summer focused john
0:17:00	how you can use some of these posteriors in some sort of a segment of framework
0:17:05	a a more recently if you see what the training is
0:17:08	uh we see a lot of
0:17:10	now and and sparse representations example are based methods
0:17:14	how you can capture higher-order statistics using deep belief networks
0:17:18	um you have a point process models are can you to spectro-temporal patterns
0:17:23	uh i so we are saying a wide range of novelty here in this field
0:17:31	uh a continuing on and modeling which is also included in discriminative techniques for asr
0:17:37	a of the is you was mostly on how can i use just limit of training for both acoustic model
0:17:42	as well as for adaptation
0:17:44	uh i we saw some papers on training full covariance models
0:17:47	uh we also saw a if you break it down into specific to saw some feature selection voices
0:17:53	better are like it is the in your model parameters that was interesting
0:17:56	and people also to present a different kinds of training criteria do you use an objective function that models see
0:18:02	what at a rate or do you use an objective mark function that model something else related uh
0:18:07	to to the likelihood or the ad or in some computer in some other fashion
0:18:16	um the last session that a cover on asr was uh a tight to large vocabulary speech recognition
0:18:22	uh the focus it was mostly and bowling large systems uh large systems for the galley value evaluation in different
0:18:28	languages
0:18:29	and that are also if you like six systems that were built on real world tasks
0:18:33	and
0:18:34	so of the key idea here are how can you exploit large quantities of unlabeled data and to the class
0:18:40	of unsupervised training
0:18:41	a a do you use better methods for lattice based training
0:18:45	uh we also saw that's a is the best farming techniques and algorithms for building acoustic and language models
0:18:51	and typically we and then like in tasks like mandarin then a big which were part of the gale evaluation
0:18:57	oh also system combination strategies played an important role
0:19:01	uh we also some that that it's to do unit selection
0:19:04	particularly in language just like a man and polish we sell some methods to improve the quality of transcripts when
0:19:10	you're don't have a uh manually transcribed data
0:19:13	how you can improve the performance of your acoustic models of the training by
0:19:18	getting better transfer
0:19:20	uh that was a like a on in of decoding schemes to better optimize memory consumption and to make things
0:19:27	go faster
0:19:28	and B so a large presence of deep belief networks all over the place
0:19:38	which still anyway somewhat
0:19:39	and we saw lots of papers on acoustic modeling out so
0:19:43	um
0:19:44	this you can break it down into some a couple of that as
0:19:47	one which includes or tended to features for hmms in addition to traditional mfccs and plps
0:19:53	and the other in the modeling paradigms itself we saw a lot of but are starting from from recognition to
0:19:59	lvcsr
0:20:00	a a a a a few things to point out we saw energy based feature
0:20:05	a lot of articulatory trajectories a hot can you do it uh include nonstationary features term and page for set
0:20:11	languages
0:20:13	uh we saw some efficient parameter estimation that captures phonetic variability
0:20:18	i am not capturing everything and every session but these are sort of uh to get you what motivated to
0:20:23	look at gender trends and ideas and bring in
0:20:26	ideas from other feels that but perhaps help acoustic modeling better
0:20:30	uh we did see a lot of like linear models for covariance model
0:20:34	and particularly this time we set some work on a uh or of a lap speech detection
0:20:39	and non-audible audible but detection which is useful in uh
0:20:43	situations adjust just monitoring in the public domain
0:20:46	uh the set relevant sessions are acoustic modeling one and two
0:20:54	um
0:20:55	that first session some speech synthesis
0:20:57	so this is just a very brief summary and speech synthesis
0:21:00	uh uh we sell a focus on well that two categories and synthesis hmm based in concatenative uh unit selection
0:21:07	based tts
0:21:08	a a bunch of the like on hmm based synthesis focused mainly on the underlying parameterization majorization and do construction
0:21:15	and that included a work on X duration modeling
0:21:19	how you can incorporate this technology and embedded system
0:21:22	a a so impact of machine translation i meaning the number of errors the translation system makes and the fluency
0:21:29	of the output the impact that has on speech synthesis
0:21:33	um that tying like that of parameter estimation for hmms this is was also there
0:21:38	uh i think that that of section we saw work on a prosody prediction how you can do better prosody
0:21:44	prediction how you can do better uh annotation of pitch axe
0:21:49	uh uh we also saw a new constraints being introduced used for unit selection in concatenative tts systems
0:21:56	and the but all the relevant sessions are listed yeah there but also a few posters on
0:22:01	in the machine dining section a speech and audio applications that cover synthesis
0:22:05	so that's basically a broad overview i have for asr and since
0:22:14	i know or
0:22:15	no no seen over three hundred papers and thirty minutes
0:22:18	a note
0:22:19	maybe feel like a fire hose just at you
0:22:22	um
0:22:23	i about to see we could try to generate a few questions uh from folks
0:22:27	um i will so that uh we use speech language to C
0:22:31	uh we do put a newsletter news letter of any of you of of the local up through the paper
0:22:35	a speech or you
0:22:36	in this i care as we do reach a goal of we
0:22:40	uh you know what dress for um
0:22:42	uh all all of papers and speech and language area our group
0:22:46	uh and and back to or
0:22:48	uh a news letter or you mail lists so
0:22:51	you good or a regular copy your of
0:22:54	the newsletter for mark to C
0:22:56	and we will include uh
0:22:57	uh links to kind of down be slides if you like to get a copy of the
0:23:02	right
0:23:03	so can i you for the any questions here
0:23:07	the river and may have to make a on its work
0:23:17	and a can have the speakers
0:23:21	no
0:23:21	to to to get a
0:23:22	i don't get all
0:23:24	of of a three or four years ago
0:23:27	speech technical to can be cut of organised itself so
0:23:31	a a more
0:23:32	text for you know spoken language so a is spoken language not your text
0:23:37	alright spoken like
0:23:39	or text
0:23:40	processing a to spoken language processing from
0:23:43	to try to sort of try those papers from your is you but they were generally going to
0:23:49	also circle
0:23:51	oh it's what's what's a room um
0:23:54	uh
0:23:55	solution is to actually things like a spring
0:23:59	a are a put up your part of like are there more
0:24:04	are more
0:24:05	so if we more
0:24:06	a set of the paper is that what is going to
0:24:09	is your car
0:24:10	he's coming here
0:24:11	so i think we have a for about a hundred and term papers and spoken language should be push you
0:24:17	or it's been sitting in the last two or three use roughly new were for about eighty two
0:24:21	a a a little over a hundred and weeks a roughly you know average word forty
0:24:25	six forty two to you percent of the people
0:24:29	um close some of the work or uh but is presented in spoken language start also go to use em
0:24:34	so i
0:24:35	but it brings in more folks
0:24:37	for um
0:24:37	from that
0:24:38	community so to speak
0:24:40	um
0:24:40	i just several also so that uh
0:24:43	and the speech technical committee meeting we had on wednesday
0:24:47	um um
0:24:48	uh uh be up to short frames from the trains actions the number of paper submitted a are in spoken
0:24:53	language was increase significantly uh spent a huge increase of the number of submissions
0:24:59	and uh uh page count is actually local realms of some of you have
0:25:03	a people sitting in volume ninety nine you'll know what i mean
0:25:07	that's kind of a or or or going uh volume and are we can kind to do more
0:25:11	a you to a kind of a or to the people but the us work there was a lot of
0:25:15	people kind of coming in
0:25:17	oh i a series
0:25:21	of request
0:25:27	no one see me was on in bands of sorry
0:25:30	question
0:25:32	uh so on from
0:25:34	we use
0:25:36	row
0:25:37	i
0:25:38	for
0:25:41	sure
0:25:42	no
0:25:44	i
0:25:46	i you
0:25:48	so are was so i could so
0:25:51	or do some channels as i'm not sure if you want to use one
0:25:53	so
0:25:54	for a was from the speech so it uh
0:25:57	i think when people are looking at a uh there's a lot more work now you know real data
0:26:03	uh and so what can you broadcast news uh working a real was to go search
0:26:08	in you to write and videos an audio bits sparse of on the web
0:26:12	um there's a lot more a you know play between music and speech
0:26:16	um
0:26:17	there was we used one people looking at speaker I D language i D
0:26:21	uh in multiple languages
0:26:23	uh
0:26:24	with people singing
0:26:25	uh and and so for a but we can use the past
0:26:28	although occurrence but we're also seen uh
0:26:31	yeah the morphing or transformation ear
0:26:33	um
0:26:34	i so not in this conference spoken in a pretty loose are cars or
0:26:38	someone one to work a music video or a pop artist
0:26:41	uh a in english and more of to than to spanish
0:26:44	um
0:26:45	and it would be sound of flawless and the of grammar
0:26:48	really
0:26:49	where is power being in english
0:26:51	uh and to know spanish but you couldn't sell
0:26:54	was really good
0:26:55	so i i think you saying a lot of movement now
0:26:58	a some of the tools of there exist for speech recognition
0:27:01	a a speaker or do you'd are resolution and so forth
0:27:04	trying to draw some challenges and music because a lot of a more realistic beater
0:27:08	uh a the folks for getting access to
0:27:11	uh have music in
0:27:12	has become a big bit channel
0:27:15	uh actually pitch tracking of this the speech analysis side
0:27:18	pitch tracking where there's music is a real tough thing to work out
0:27:22	and uh are some we'll folks that are or have been working about or
0:27:27	and a quick comments i think a a do see a lot of people as time to move music um
0:27:32	i have an face
0:27:35	a a couple use but one of the big challenges was the computing speaker problem
0:27:40	you have one person walking on another person
0:27:43	no words
0:27:44	music working on someone else and being or would try to suppress that's to try to the recognition for
0:27:50	a
0:27:52	question number
0:27:55	i just got glasses i think that's for
0:27:58	a a a we are just uh reading some common to the previous
0:28:02	commons all this that as the single processing as speech and also
0:28:06	recognition of of this is is is
0:28:08	i to C
0:28:09	personally actually a a of view myself a as a a how core signal processor
0:28:14	so i E C is processing
0:28:16	so the regardless whether is a speech or
0:28:19	so is is
0:28:20	actually a the models of research asia and i do have a lot quite of few about colleagues such as
0:28:25	a professor
0:28:26	so that almost
0:28:27	oh from a uh a talk university
0:28:29	is
0:28:30	well as like a working on in both to maze
0:28:34	and do we treated music either the instrumental music well
0:28:38	well can
0:28:40	yeah yeah as
0:28:41	a all or or or on a interest or applications mary rules
0:28:46	uh the
0:28:47	oh we do all
0:28:48	speech this this is
0:28:50	we use a T T as
0:28:52	the knowledge
0:28:54	particularly these there is a and then the has for whether or
0:28:57	and so not only just bridging the gap between the traditional
0:29:01	uh you know
0:29:03	concatenation or unit selection based since this is
0:29:06	but right now the hmm a since this is
0:29:09	a a a or or some people close the hybrid synthesis is but actually my opinion is really just T
0:29:16	but the the whole a statistical and sample based uh the rent or E
0:29:21	a as the
0:29:23	as a
0:29:24	holistic approach to the holes is this is or render E
0:29:28	processes
0:29:29	so T
0:29:30	uh
0:29:31	we you is uh a T T
0:29:33	to do single is
0:29:35	but only the knowledge and the we just try to say
0:29:38	you you the out of E
0:29:40	given the
0:29:41	re quote speech material to real
0:29:44	can we sing as cell
0:29:45	yes the uh we had done that and that only you has a noisy in
0:29:49	i saw quite a few people are quite a few research researchers
0:29:52	uh
0:29:53	really working but did shall in that direction
0:29:56	and but
0:29:56	polyphonic sink the pitch tracking
0:29:59	which is a a really a
0:30:02	i i i been but rather where be ever so i'll
0:30:05	but the is definitely posing a a a a a kick take advantage lunch
0:30:10	and interest to
0:30:11	that the
0:30:13	signal the processor
0:30:14	speech researcher or or music researchers
0:30:17	and the analysis
0:30:20	analysis is that of the recognition uh uh again um probably just as a matter
0:30:25	come motion say
0:30:27	that is that
0:30:28	used to be
0:30:29	just say
0:30:30	recognition commission
0:30:32	the next the or to understanding
0:30:35	and the the car we just of a speech synthesis
0:30:39	uh
0:30:39	session and maybe the advertisement houseman
0:30:42	so that they
0:30:44	but but uh in as there a speech synthesisers or speech this is a researcher or because
0:30:50	the whole understand understanding
0:30:52	to close the speech chain
0:30:54	we do need a a good small out
0:30:57	express a
0:30:58	speech this is is too
0:31:00	so uh
0:31:01	to put a quick summary i personal out to C
0:31:05	there's is the boundary between the nose
0:31:08	and speech
0:31:09	and the of the common to as statistical modeling
0:31:13	the
0:31:14	the sample both
0:31:16	uh
0:31:17	uh a really
0:31:18	uh
0:31:20	every using a it's a really is really just merging was each other in kind of the seamless model
0:31:27	thanks for
0:31:28	but the uh when you think about better recognition
0:31:31	maybe a couple years ago a general perception is that since you can buy commercial be used
0:31:35	speech recognition products in the field
0:31:38	that some people perceive that it's solved
0:31:40	uh but in fact were there's a huge challenges i think
0:31:44	you will do to
0:31:46	or four
0:31:47	much more realistic data in the field make or recognition much more challenging to do
0:31:52	and frank's comments on the synthesis part as is
0:31:55	right on target when you look could be general usual population
0:31:58	and the use of dialogue system
0:32:00	uh studies have shown that
0:32:02	uh the perception of how group the dialogue system use
0:32:06	a is to a large extent related to the quality of the synthesized voice that you're interacting were
0:32:11	uh
0:32:12	hidden behind errors where are the recognition or rate
0:32:15	uh is used to hard kind of recover from a lot of ground already recently
0:32:20	uh or or uh looking at a rubber processing approach
0:32:24	or the questions or comments
0:32:28	we have gone or thirty minutes per
0:32:30	should we have a we're three had papers here so we should have some more course
0:32:35	so to to make a put us or for the use or you are workshop
0:32:40	um
0:32:41	it's it's motion some this are we were to the you training and maybe not as a a a sunny
0:32:46	it's it's a sony and or
0:32:48	yeah
0:32:49	uh to the weather is great there and or
0:32:52	a great opportunity to come uh
0:32:54	follow up on some of the topics that you you've see that risk
0:32:58	and everyone gets a little
0:33:00	a a sort of of flowers that we could
0:33:03	oh the comments or questions
0:33:07	well with you for a a block to make the last pitch of view are for interest in being involved
0:33:12	in the speech language stuck to can please
0:33:14	a a contract one of pretty members there's or we're fifty members
0:33:18	a a if you do the web for a you news later
0:33:21	um blue fine there's a number of of topics that are are to if you were advertising for jobs or
0:33:26	uh trying to to record folks so there's an online
0:33:30	uh jobs posting your we're
0:33:33	i i don't know lesson
0:33:34	well S
0:33:34	may later i put a little chunk can be on
0:33:37	uh what represents a a grand challenge of a speech and language field and i think
0:33:42	uh there's been a lot of talk curve
0:33:44	in terms of energy and health care
0:33:46	uh i was grand challenges
0:33:48	speech and language arose
0:33:50	uh the one of the most
0:33:52	input mass perks when you work could society and interacting with folks
0:33:56	a speech-to-speech translation some of the big advancements in this area
0:34:00	uh will allow people to communicate more efficiently and reduce barriers between people so
0:34:05	speech of mine which very important should represent a one of a grand challenges well
0:34:11	if there are no more comments are will close the session and think are we
0:34:16	uh

Trends in Speech and Spoken Language Processing

Expert Sessions

Přednášející: John H. Hansen, Junlan Feng, Bhuvana Ramabhadran, Autoři: John H. Hansen, Junlan Feng, Bhuvana Ramabhadran