0:00:14 | my name is that if they really keep that and |
---|
0:00:18 | i will talk about the work we did at you to do you to assist |
---|
0:00:22 | telecommunication engagement which are each recognition |
---|
0:00:26 | in not realistic active learning space |
---|
0:00:32 | this study explores chat conversational engagement through automatic speech recognition in not realistic classroom setting |
---|
0:00:41 | the ability to assist children's conversational interaction is critical for typically developing and at risk |
---|
0:00:49 | children for example language the late |
---|
0:00:52 | but it literature is child is identified but it is support can be provided to |
---|
0:00:57 | reduce |
---|
0:00:57 | this also impact of the speech disorder |
---|
0:01:01 | while research has considered it's have speech recognition the past |
---|
0:01:06 | most of the status focused on the six to eighteen |
---|
0:01:11 | age group |
---|
0:01:12 | only if you still this have explored the |
---|
0:01:15 | for the schoolers speech recognition |
---|
0:01:18 | they use words phrases and conversational speech but is based on |
---|
0:01:23 | structured human computer interaction scenario |
---|
0:01:28 | in our study this is an hour or is based on not realistic conversational interaction |
---|
0:01:33 | between child adult and child and then the cure spaces |
---|
0:01:39 | we children and adults a while but attached |
---|
0:01:43 | wiener orders |
---|
0:01:45 | we investigated child speech recognition we are h we rides from two one five years |
---|
0:01:53 | in this study we explore the instrumentation techniques for |
---|
0:01:57 | shall not realistic speech recognition the documentation has |
---|
0:02:02 | sure one |
---|
0:02:03 | to improve the performance of |
---|
0:02:05 | i don't speech recognition systems convert but it has not been extensively studied for channel |
---|
0:02:11 | speech |
---|
0:02:13 | finally we investigate work on trains to assess channel speech development or typically developing as |
---|
0:02:21 | well as those children better might be at risk |
---|
0:02:25 | the board hans estimated based on |
---|
0:02:27 | big what is the score |
---|
0:02:29 | our speech recognition system |
---|
0:02:32 | work on estimation provides |
---|
0:02:34 | inside in the assessment of foul language engagement and their next pieces and i identify |
---|
0:02:42 | with child might need one teacher attention |
---|
0:02:51 | all experiments reported in this study uses american english channel sponginess conversations |
---|
0:02:57 | captured in a high well it to children learning centre in the united states |
---|
0:03:03 | data was collected from that are three children or h two point five years |
---|
0:03:09 | and four |
---|
0:03:10 | problem for adults teens first |
---|
0:03:12 | three females and one me |
---|
0:03:15 | based on actual diagnosis eight of the children are at risk for example speech or |
---|
0:03:21 | language delayed |
---|
0:03:22 | the speech data was gathered and three inclusive early childhood classrooms |
---|
0:03:29 | during naturally through morning and afternoon activities |
---|
0:03:33 | children we have told what it typical morning |
---|
0:03:37 | activities and routines |
---|
0:03:40 | the data was get the rate we're in being the recording unions which i will |
---|
0:03:45 | i read compact or more orders |
---|
0:03:48 | that always minimal so are weirdness for the speaker alarm was captured an entry is |
---|
0:03:54 | the conversation |
---|
0:03:56 | the child training corpus consists of about fifteen hours of manually transcribed ordeal via transcripts |
---|
0:04:04 | have one hundred twenty thousand cans |
---|
0:04:07 | while the data consists of twenty three hours of men to transcribe order with three |
---|
0:04:12 | hundred thousand boards and the transcripts |
---|
0:04:15 | in addition and out of domain conversation i like web text score was also used |
---|
0:04:21 | consisting of two point six million word tokens |
---|
0:04:25 | all results are reported from three hours test |
---|
0:04:29 | that all |
---|
0:04:31 | channels p for development one five our dataset was used |
---|
0:04:39 | baseline recognition system acoustic models are tied state like to ride three state hmm |
---|
0:04:47 | gaussian mixture observation densities |
---|
0:04:50 | also tried phone based models are what position dependent the with the models are changed |
---|
0:04:56 | on it and nine dimensional and is the c |
---|
0:05:01 | the features are nine frames slide and projected to forty dimensions using l the and |
---|
0:05:07 | mllt |
---|
0:05:09 | next speaker adaptive training is performed using mllr |
---|
0:05:14 | the three gram language model is will using manual transcriptions from the top score corpus |
---|
0:05:21 | the lexicon this problem |
---|
0:05:25 | which consist of the most one hundred fifty four five in all those |
---|
0:05:32 | fifteen hours of transcribed conversation |
---|
0:05:35 | speech is used |
---|
0:05:42 | it deep neural network system is trained as the main questions t likelihoods |
---|
0:05:47 | alignments are produced by saginaw mushroom and then |
---|
0:05:51 | in the experiments with original chart training data set we used nn topology to hear |
---|
0:05:58 | then there's two thousand forty eight neurons apparently air and the output layer is based |
---|
0:06:03 | on so |
---|
0:06:06 | secret discriminative training is applied with a simple our objective |
---|
0:06:10 | the den and you is the same features as our said gmm hmms these that |
---|
0:06:17 | you just spliced using a context of nine frames |
---|
0:06:21 | followed by the ml t |
---|
0:06:24 | at a model are |
---|
0:06:30 | the constraint given for speech recognition task that wanted to text and if available transcribed |
---|
0:06:37 | or directly down for sponginess child speech is limited |
---|
0:06:42 | alternate the dorm in addition remotes |
---|
0:06:45 | a another as more |
---|
0:06:47 | for language and acoustic model and smell |
---|
0:06:54 | to improve the language model three alternate data limitation techniques i investigate adding at all |
---|
0:07:01 | data that data and producing additional text our analysis |
---|
0:07:07 | the language model is estimated using supplement teletext resources and interpolated with their original baseline |
---|
0:07:14 | language model |
---|
0:07:16 | although the data |
---|
0:07:17 | the use of one three hours of manually annotated of the transcription with three hundred |
---|
0:07:23 | thousand order can |
---|
0:07:25 | is investigated for didn't augmentation |
---|
0:07:29 | all conversation i like data was reported in child gets and |
---|
0:07:34 | read data conversational like but text data with one six mainly on |
---|
0:07:39 | more is explored who the language model |
---|
0:07:45 | or another mistake generation |
---|
0:07:47 | text generated using r and fifty million words |
---|
0:07:51 | there are and has two hidden layers and five hundred twelve minutes barely |
---|
0:07:56 | they are and then finds long context or regret that it is |
---|
0:08:01 | we use white meaningful sentences and maintain the same vocabulary |
---|
0:08:11 | to as the improvement derived from the use of sublime until t text resources |
---|
0:08:18 | contrastive experiments are performed with alternate language model |
---|
0:08:22 | jenna has some and see that no acoustic models are based on un-transcribed audio |
---|
0:08:28 | from table it is observed that work but makes it is improved using albeit all |
---|
0:08:33 | maybe something |
---|
0:08:34 | the word error rate improvement is that you only with the language model in cooperating |
---|
0:08:40 | on the whole training transcripts |
---|
0:08:42 | but text and aaron and generate texts |
---|
0:08:45 | resulting in text with |
---|
0:08:47 | it to three million four counts |
---|
0:08:54 | in this case the pervert that like that is reduced by eleven point |
---|
0:08:59 | we had a tiny gains of zero point zero nine absolute river |
---|
0:09:03 | over the baseline |
---|
0:09:14 | was that the woman decent that that's the at three alternate approaches |
---|
0:09:21 | the temple perturbation and though the does that use |
---|
0:09:25 | we investigate the impact of different are variation coefficients an alternate number of corpus of |
---|
0:09:32 | the original child data set of fifteen hours |
---|
0:09:37 | you perturbation and when it's both pitch and tempo variations and the speech signal |
---|
0:09:43 | speed modification is achieved by resampling the signal |
---|
0:09:47 | we explore main dish some of the training dataset by changing this period of the |
---|
0:09:52 | orders that no result thinking for versions of the original trial |
---|
0:09:58 | training data but speaker factors of zero one eight zero one nine one and one |
---|
0:10:04 | point two |
---|
0:10:05 | that what worked obese |
---|
0:10:08 | the term for all the signal is modified while the pitch and spectral envelope of |
---|
0:10:13 | the signal is not changed |
---|
0:10:15 | the training dataset was and lights by creating for additional corpus of third the no |
---|
0:10:20 | child training data point modified downpour factors to zero point eight there are one nine |
---|
0:10:27 | one and one point two |
---|
0:10:31 | the lda to use it |
---|
0:10:33 | we draw on top of the whole training dataset |
---|
0:10:36 | the although the thus it is |
---|
0:10:39 | comprised of twenty three hours of transcribed audio we almost in tears in |
---|
0:10:45 | all data was recorded in a childcare centre |
---|
0:10:54 | acoustic model from these results are provided in the table |
---|
0:10:59 | in the experiments we use the language model we're channel training transcriptions and inability to |
---|
0:11:05 | adult about an unknown and generated text |
---|
0:11:09 | table show that for general instruments seen them their highs were improvement is obtained by |
---|
0:11:14 | incorporating two corpus of chart transcribe order good spewed factors |
---|
0:11:20 | all zero one nine and one for one |
---|
0:11:27 | what in this case forty five hours of training data is used but where improvement |
---|
0:11:33 | of zero one that one absolute compared to original child training for those that |
---|
0:11:40 | the performance of the national muncie stems is also some are and in this table |
---|
0:11:47 | the top line indicates |
---|
0:11:52 | that with their original children transcribed audio set improvement or five one forty two percent |
---|
0:11:58 | absolute obtained or but then an hmm training of a gmm will have some time |
---|
0:12:05 | comparing the nn performance with different acoustic models that it can be observed that an |
---|
0:12:11 | absolute wer reduction of two point forty eight is achieved using forty five hours dataset |
---|
0:12:18 | which incorporates |
---|
0:12:21 | the their troop audio signals zero one i and one factors |
---|
0:12:31 | finally we investigate one hundred fifty eight hours dataset that additional includes transcribed or adult |
---|
0:12:39 | data in this case the highest improvement of it one zero three achieved over the |
---|
0:12:45 | baseline |
---|
0:12:58 | the environment of the trial at a classroom settings it is important for child speech |
---|
0:13:04 | learning there is a need to identify its children are truthful or language engagement and |
---|
0:13:11 | these children should receive more teachers for during the learning activities |
---|
0:13:16 | we assess children speech development using work on trains for each child |
---|
0:13:22 | work on are estimated based on support this is all our best then |
---|
0:13:27 | speech recognition system |
---|
0:13:29 | work on site submitted |
---|
0:13:31 | to not be completely accurate |
---|
0:13:34 | how a where they are consistent and here for still able to establish bits child |
---|
0:13:41 | children how little conversational interaction |
---|
0:13:47 | the for comments estimated based on they report this is our best speech recognition |
---|
0:13:53 | comparing the work on some references but |
---|
0:13:57 | con in what is it can be seen that even if there are speech recognition |
---|
0:14:02 | system error it is still possible to establish was to have little conversational interaction and |
---|
0:14:10 | are at risk |
---|
0:14:12 | and it is child for a second and third |
---|
0:14:17 | this used number or the cell for and sell five |
---|
0:14:21 | based or on |
---|
0:14:23 | work on synthetic i borders is |
---|
0:14:26 | you two challenges in this not realistic child the child and adult solving space the |
---|
0:14:33 | pork on are not completely accurate about what they are consistent and we're still able |
---|
0:14:39 | to establish which to learn how low conversational interaction |
---|
0:14:44 | these children should get more t just for an |
---|
0:14:48 | so and for academic learning during active it is in the daycare centre |
---|
0:14:56 | there is there |
---|
0:14:57 | as investigated the benefits or applying the data augmentation techniques |
---|
0:15:03 | for each child is from two point five years |
---|
0:15:08 | in assisting style not realistic and agent through the speech recognition |
---|
0:15:14 | we explored several data augmentation techniques to advance language and acoustic models and showed which |
---|
0:15:20 | provided gains in |
---|
0:15:21 | speech recognition performance |
---|
0:15:24 | we also explored assessment or child language development the artwork on trains |
---|
0:15:30 | there is still so that people or performing speech recognition system can contribute to extract |
---|
0:15:37 | a conversation engagement assessment |
---|
0:15:41 | alternate text documentation a rule which |
---|
0:15:44 | we investigated to increase the limited amount of original transcriber conversation sides each using |
---|
0:15:52 | the that the |
---|
0:15:54 | but data |
---|
0:15:55 | and text generated by our analysis |
---|
0:15:59 | interpolating based text collectively leads to a perplexity improvement over the four |
---|
0:16:06 | but there is |
---|
0:16:07 | very little guy of their over there or original baseline |
---|
0:16:13 | next acoustic ornamentation techniques for channel speech we explore based on |
---|
0:16:18 | speed perturbation sample perturbation and adult data |
---|
0:16:23 | the experiments we explore performed with training data brand on fifteen |
---|
0:16:29 | to one hundred fifty eight hours |
---|
0:16:32 | what you and tempo perturbation we have shown to improve word error rate would spew |
---|
0:16:37 | calibration factor of zero point nine one to be the most beneficial |
---|
0:16:43 | their greatest more error a reduction of it one zero three absolute was achieved over |
---|
0:16:49 | the baseline after incorporating all mended order would it is that the improvement language model |
---|
0:16:55 | and using the nn system |
---|
0:16:58 | conversational interaction you work on was explored process children speech and easement |
---|
0:17:05 | this is how to establish a relative or an ordering |
---|
0:17:09 | so that conversational interaction |
---|
0:17:12 | and here for work but then suddenly |
---|
0:17:15 | or by a separation vary between the trees and tippett the kernel developing children but |
---|
0:17:22 | the |
---|
0:17:22 | so it's |
---|
0:17:23 | so out of the active learning space |
---|