0:00:14my name is that if they really keep that and
0:00:18i will talk about the work we did at you to do you to assist
0:00:22telecommunication engagement which are each recognition
0:00:26in not realistic active learning space
0:00:32this study explores chat conversational engagement through automatic speech recognition in not realistic classroom setting
0:00:41the ability to assist children's conversational interaction is critical for typically developing and at risk
0:00:49children for example language the late
0:00:52but it literature is child is identified but it is support can be provided to
0:00:57reduce
0:00:57this also impact of the speech disorder
0:01:01while research has considered it's have speech recognition the past
0:01:06most of the status focused on the six to eighteen
0:01:11age group
0:01:12only if you still this have explored the
0:01:15for the schoolers speech recognition
0:01:18they use words phrases and conversational speech but is based on
0:01:23structured human computer interaction scenario
0:01:28in our study this is an hour or is based on not realistic conversational interaction
0:01:33between child adult and child and then the cure spaces
0:01:39we children and adults a while but attached
0:01:43wiener orders
0:01:45we investigated child speech recognition we are h we rides from two one five years
0:01:53in this study we explore the instrumentation techniques for
0:01:57shall not realistic speech recognition the documentation has
0:02:02sure one
0:02:03to improve the performance of
0:02:05i don't speech recognition systems convert but it has not been extensively studied for channel
0:02:11speech
0:02:13finally we investigate work on trains to assess channel speech development or typically developing as
0:02:21well as those children better might be at risk
0:02:25the board hans estimated based on
0:02:27big what is the score
0:02:29our speech recognition system
0:02:32work on estimation provides
0:02:34inside in the assessment of foul language engagement and their next pieces and i identify
0:02:42with child might need one teacher attention
0:02:51all experiments reported in this study uses american english channel sponginess conversations
0:02:57captured in a high well it to children learning centre in the united states
0:03:03data was collected from that are three children or h two point five years
0:03:09and four
0:03:10problem for adults teens first
0:03:12three females and one me
0:03:15based on actual diagnosis eight of the children are at risk for example speech or
0:03:21language delayed
0:03:22the speech data was gathered and three inclusive early childhood classrooms
0:03:29during naturally through morning and afternoon activities
0:03:33children we have told what it typical morning
0:03:37activities and routines
0:03:40the data was get the rate we're in being the recording unions which i will
0:03:45i read compact or more orders
0:03:48that always minimal so are weirdness for the speaker alarm was captured an entry is
0:03:54the conversation
0:03:56the child training corpus consists of about fifteen hours of manually transcribed ordeal via transcripts
0:04:04have one hundred twenty thousand cans
0:04:07while the data consists of twenty three hours of men to transcribe order with three
0:04:12hundred thousand boards and the transcripts
0:04:15in addition and out of domain conversation i like web text score was also used
0:04:21consisting of two point six million word tokens
0:04:25all results are reported from three hours test
0:04:29that all
0:04:31channels p for development one five our dataset was used
0:04:39baseline recognition system acoustic models are tied state like to ride three state hmm
0:04:47gaussian mixture observation densities
0:04:50also tried phone based models are what position dependent the with the models are changed
0:04:56on it and nine dimensional and is the c
0:05:01the features are nine frames slide and projected to forty dimensions using l the and
0:05:07mllt
0:05:09next speaker adaptive training is performed using mllr
0:05:14the three gram language model is will using manual transcriptions from the top score corpus
0:05:21the lexicon this problem
0:05:25which consist of the most one hundred fifty four five in all those
0:05:32fifteen hours of transcribed conversation
0:05:35speech is used
0:05:42it deep neural network system is trained as the main questions t likelihoods
0:05:47alignments are produced by saginaw mushroom and then
0:05:51in the experiments with original chart training data set we used nn topology to hear
0:05:58then there's two thousand forty eight neurons apparently air and the output layer is based
0:06:03on so
0:06:06secret discriminative training is applied with a simple our objective
0:06:10the den and you is the same features as our said gmm hmms these that
0:06:17you just spliced using a context of nine frames
0:06:21followed by the ml t
0:06:24at a model are
0:06:30the constraint given for speech recognition task that wanted to text and if available transcribed
0:06:37or directly down for sponginess child speech is limited
0:06:42alternate the dorm in addition remotes
0:06:45a another as more
0:06:47for language and acoustic model and smell
0:06:54to improve the language model three alternate data limitation techniques i investigate adding at all
0:07:01data that data and producing additional text our analysis
0:07:07the language model is estimated using supplement teletext resources and interpolated with their original baseline
0:07:14language model
0:07:16although the data
0:07:17the use of one three hours of manually annotated of the transcription with three hundred
0:07:23thousand order can
0:07:25is investigated for didn't augmentation
0:07:29all conversation i like data was reported in child gets and
0:07:34read data conversational like but text data with one six mainly on
0:07:39more is explored who the language model
0:07:45or another mistake generation
0:07:47text generated using r and fifty million words
0:07:51there are and has two hidden layers and five hundred twelve minutes barely
0:07:56they are and then finds long context or regret that it is
0:08:01we use white meaningful sentences and maintain the same vocabulary
0:08:11to as the improvement derived from the use of sublime until t text resources
0:08:18contrastive experiments are performed with alternate language model
0:08:22jenna has some and see that no acoustic models are based on un-transcribed audio
0:08:28from table it is observed that work but makes it is improved using albeit all
0:08:33maybe something
0:08:34the word error rate improvement is that you only with the language model in cooperating
0:08:40on the whole training transcripts
0:08:42but text and aaron and generate texts
0:08:45resulting in text with
0:08:47it to three million four counts
0:08:54in this case the pervert that like that is reduced by eleven point
0:08:59we had a tiny gains of zero point zero nine absolute river
0:09:03over the baseline
0:09:14was that the woman decent that that's the at three alternate approaches
0:09:21the temple perturbation and though the does that use
0:09:25we investigate the impact of different are variation coefficients an alternate number of corpus of
0:09:32the original child data set of fifteen hours
0:09:37you perturbation and when it's both pitch and tempo variations and the speech signal
0:09:43speed modification is achieved by resampling the signal
0:09:47we explore main dish some of the training dataset by changing this period of the
0:09:52orders that no result thinking for versions of the original trial
0:09:58training data but speaker factors of zero one eight zero one nine one and one
0:10:04point two
0:10:05that what worked obese
0:10:08the term for all the signal is modified while the pitch and spectral envelope of
0:10:13the signal is not changed
0:10:15the training dataset was and lights by creating for additional corpus of third the no
0:10:20child training data point modified downpour factors to zero point eight there are one nine
0:10:27one and one point two
0:10:31the lda to use it
0:10:33we draw on top of the whole training dataset
0:10:36the although the thus it is
0:10:39comprised of twenty three hours of transcribed audio we almost in tears in
0:10:45all data was recorded in a childcare centre
0:10:54acoustic model from these results are provided in the table
0:10:59in the experiments we use the language model we're channel training transcriptions and inability to
0:11:05adult about an unknown and generated text
0:11:09table show that for general instruments seen them their highs were improvement is obtained by
0:11:14incorporating two corpus of chart transcribe order good spewed factors
0:11:20all zero one nine and one for one
0:11:27what in this case forty five hours of training data is used but where improvement
0:11:33of zero one that one absolute compared to original child training for those that
0:11:40the performance of the national muncie stems is also some are and in this table
0:11:47the top line indicates
0:11:52that with their original children transcribed audio set improvement or five one forty two percent
0:11:58absolute obtained or but then an hmm training of a gmm will have some time
0:12:05comparing the nn performance with different acoustic models that it can be observed that an
0:12:11absolute wer reduction of two point forty eight is achieved using forty five hours dataset
0:12:18which incorporates
0:12:21the their troop audio signals zero one i and one factors
0:12:31finally we investigate one hundred fifty eight hours dataset that additional includes transcribed or adult
0:12:39data in this case the highest improvement of it one zero three achieved over the
0:12:45baseline
0:12:58the environment of the trial at a classroom settings it is important for child speech
0:13:04learning there is a need to identify its children are truthful or language engagement and
0:13:11these children should receive more teachers for during the learning activities
0:13:16we assess children speech development using work on trains for each child
0:13:22work on are estimated based on support this is all our best then
0:13:27speech recognition system
0:13:29work on site submitted
0:13:31to not be completely accurate
0:13:34how a where they are consistent and here for still able to establish bits child
0:13:41children how little conversational interaction
0:13:47the for comments estimated based on they report this is our best speech recognition
0:13:53comparing the work on some references but
0:13:57con in what is it can be seen that even if there are speech recognition
0:14:02system error it is still possible to establish was to have little conversational interaction and
0:14:10are at risk
0:14:12and it is child for a second and third
0:14:17this used number or the cell for and sell five
0:14:21based or on
0:14:23work on synthetic i borders is
0:14:26you two challenges in this not realistic child the child and adult solving space the
0:14:33pork on are not completely accurate about what they are consistent and we're still able
0:14:39to establish which to learn how low conversational interaction
0:14:44these children should get more t just for an
0:14:48so and for academic learning during active it is in the daycare centre
0:14:56there is there
0:14:57as investigated the benefits or applying the data augmentation techniques
0:15:03for each child is from two point five years
0:15:08in assisting style not realistic and agent through the speech recognition
0:15:14we explored several data augmentation techniques to advance language and acoustic models and showed which
0:15:20provided gains in
0:15:21speech recognition performance
0:15:24we also explored assessment or child language development the artwork on trains
0:15:30there is still so that people or performing speech recognition system can contribute to extract
0:15:37a conversation engagement assessment
0:15:41alternate text documentation a rule which
0:15:44we investigated to increase the limited amount of original transcriber conversation sides each using
0:15:52the that the
0:15:54but data
0:15:55and text generated by our analysis
0:15:59interpolating based text collectively leads to a perplexity improvement over the four
0:16:06but there is
0:16:07very little guy of their over there or original baseline
0:16:13next acoustic ornamentation techniques for channel speech we explore based on
0:16:18speed perturbation sample perturbation and adult data
0:16:23the experiments we explore performed with training data brand on fifteen
0:16:29to one hundred fifty eight hours
0:16:32what you and tempo perturbation we have shown to improve word error rate would spew
0:16:37calibration factor of zero point nine one to be the most beneficial
0:16:43their greatest more error a reduction of it one zero three absolute was achieved over
0:16:49the baseline after incorporating all mended order would it is that the improvement language model
0:16:55and using the nn system
0:16:58conversational interaction you work on was explored process children speech and easement
0:17:05this is how to establish a relative or an ordering
0:17:09so that conversational interaction
0:17:12and here for work but then suddenly
0:17:15or by a separation vary between the trees and tippett the kernel developing children but
0:17:22the
0:17:22so it's
0:17:23so out of the active learning space