my name is that if they really keep that and
i will talk about the work we did at you to do you to assist
telecommunication engagement which are each recognition
in not realistic active learning space
this study explores chat conversational engagement through automatic speech recognition in not realistic classroom setting
the ability to assist children's conversational interaction is critical for typically developing and at risk
children for example language the late
but it literature is child is identified but it is support can be provided to
reduce
this also impact of the speech disorder
while research has considered it's have speech recognition the past
most of the status focused on the six to eighteen
age group
only if you still this have explored the
for the schoolers speech recognition
they use words phrases and conversational speech but is based on
structured human computer interaction scenario
in our study this is an hour or is based on not realistic conversational interaction
between child adult and child and then the cure spaces
we children and adults a while but attached
wiener orders
we investigated child speech recognition we are h we rides from two one five years
in this study we explore the instrumentation techniques for
shall not realistic speech recognition the documentation has
sure one
to improve the performance of
i don't speech recognition systems convert but it has not been extensively studied for channel
speech
finally we investigate work on trains to assess channel speech development or typically developing as
well as those children better might be at risk
the board hans estimated based on
big what is the score
our speech recognition system
work on estimation provides
inside in the assessment of foul language engagement and their next pieces and i identify
with child might need one teacher attention
all experiments reported in this study uses american english channel sponginess conversations
captured in a high well it to children learning centre in the united states
data was collected from that are three children or h two point five years
and four
problem for adults teens first
three females and one me
based on actual diagnosis eight of the children are at risk for example speech or
language delayed
the speech data was gathered and three inclusive early childhood classrooms
during naturally through morning and afternoon activities
children we have told what it typical morning
activities and routines
the data was get the rate we're in being the recording unions which i will
i read compact or more orders
that always minimal so are weirdness for the speaker alarm was captured an entry is
the conversation
the child training corpus consists of about fifteen hours of manually transcribed ordeal via transcripts
have one hundred twenty thousand cans
while the data consists of twenty three hours of men to transcribe order with three
hundred thousand boards and the transcripts
in addition and out of domain conversation i like web text score was also used
consisting of two point six million word tokens
all results are reported from three hours test
that all
channels p for development one five our dataset was used
baseline recognition system acoustic models are tied state like to ride three state hmm
gaussian mixture observation densities
also tried phone based models are what position dependent the with the models are changed
on it and nine dimensional and is the c
the features are nine frames slide and projected to forty dimensions using l the and
mllt
next speaker adaptive training is performed using mllr
the three gram language model is will using manual transcriptions from the top score corpus
the lexicon this problem
which consist of the most one hundred fifty four five in all those
fifteen hours of transcribed conversation
speech is used
it deep neural network system is trained as the main questions t likelihoods
alignments are produced by saginaw mushroom and then
in the experiments with original chart training data set we used nn topology to hear
then there's two thousand forty eight neurons apparently air and the output layer is based
on so
secret discriminative training is applied with a simple our objective
the den and you is the same features as our said gmm hmms these that
you just spliced using a context of nine frames
followed by the ml t
at a model are
the constraint given for speech recognition task that wanted to text and if available transcribed
or directly down for sponginess child speech is limited
alternate the dorm in addition remotes
a another as more
for language and acoustic model and smell
to improve the language model three alternate data limitation techniques i investigate adding at all
data that data and producing additional text our analysis
the language model is estimated using supplement teletext resources and interpolated with their original baseline
language model
although the data
the use of one three hours of manually annotated of the transcription with three hundred
thousand order can
is investigated for didn't augmentation
all conversation i like data was reported in child gets and
read data conversational like but text data with one six mainly on
more is explored who the language model
or another mistake generation
text generated using r and fifty million words
there are and has two hidden layers and five hundred twelve minutes barely
they are and then finds long context or regret that it is
we use white meaningful sentences and maintain the same vocabulary
to as the improvement derived from the use of sublime until t text resources
contrastive experiments are performed with alternate language model
jenna has some and see that no acoustic models are based on un-transcribed audio
from table it is observed that work but makes it is improved using albeit all
maybe something
the word error rate improvement is that you only with the language model in cooperating
on the whole training transcripts
but text and aaron and generate texts
resulting in text with
it to three million four counts
in this case the pervert that like that is reduced by eleven point
we had a tiny gains of zero point zero nine absolute river
over the baseline
was that the woman decent that that's the at three alternate approaches
the temple perturbation and though the does that use
we investigate the impact of different are variation coefficients an alternate number of corpus of
the original child data set of fifteen hours
you perturbation and when it's both pitch and tempo variations and the speech signal
speed modification is achieved by resampling the signal
we explore main dish some of the training dataset by changing this period of the
orders that no result thinking for versions of the original trial
training data but speaker factors of zero one eight zero one nine one and one
point two
that what worked obese
the term for all the signal is modified while the pitch and spectral envelope of
the signal is not changed
the training dataset was and lights by creating for additional corpus of third the no
child training data point modified downpour factors to zero point eight there are one nine
one and one point two
the lda to use it
we draw on top of the whole training dataset
the although the thus it is
comprised of twenty three hours of transcribed audio we almost in tears in
all data was recorded in a childcare centre
acoustic model from these results are provided in the table
in the experiments we use the language model we're channel training transcriptions and inability to
adult about an unknown and generated text
table show that for general instruments seen them their highs were improvement is obtained by
incorporating two corpus of chart transcribe order good spewed factors
all zero one nine and one for one
what in this case forty five hours of training data is used but where improvement
of zero one that one absolute compared to original child training for those that
the performance of the national muncie stems is also some are and in this table
the top line indicates
that with their original children transcribed audio set improvement or five one forty two percent
absolute obtained or but then an hmm training of a gmm will have some time
comparing the nn performance with different acoustic models that it can be observed that an
absolute wer reduction of two point forty eight is achieved using forty five hours dataset
which incorporates
the their troop audio signals zero one i and one factors
finally we investigate one hundred fifty eight hours dataset that additional includes transcribed or adult
data in this case the highest improvement of it one zero three achieved over the
baseline
the environment of the trial at a classroom settings it is important for child speech
learning there is a need to identify its children are truthful or language engagement and
these children should receive more teachers for during the learning activities
we assess children speech development using work on trains for each child
work on are estimated based on support this is all our best then
speech recognition system
work on site submitted
to not be completely accurate
how a where they are consistent and here for still able to establish bits child
children how little conversational interaction
the for comments estimated based on they report this is our best speech recognition
comparing the work on some references but
con in what is it can be seen that even if there are speech recognition
system error it is still possible to establish was to have little conversational interaction and
are at risk
and it is child for a second and third
this used number or the cell for and sell five
based or on
work on synthetic i borders is
you two challenges in this not realistic child the child and adult solving space the
pork on are not completely accurate about what they are consistent and we're still able
to establish which to learn how low conversational interaction
these children should get more t just for an
so and for academic learning during active it is in the daycare centre
there is there
as investigated the benefits or applying the data augmentation techniques
for each child is from two point five years
in assisting style not realistic and agent through the speech recognition
we explored several data augmentation techniques to advance language and acoustic models and showed which
provided gains in
speech recognition performance
we also explored assessment or child language development the artwork on trains
there is still so that people or performing speech recognition system can contribute to extract
a conversation engagement assessment
alternate text documentation a rule which
we investigated to increase the limited amount of original transcriber conversation sides each using
the that the
but data
and text generated by our analysis
interpolating based text collectively leads to a perplexity improvement over the four
but there is
very little guy of their over there or original baseline
next acoustic ornamentation techniques for channel speech we explore based on
speed perturbation sample perturbation and adult data
the experiments we explore performed with training data brand on fifteen
to one hundred fifty eight hours
what you and tempo perturbation we have shown to improve word error rate would spew
calibration factor of zero point nine one to be the most beneficial
their greatest more error a reduction of it one zero three absolute was achieved over
the baseline after incorporating all mended order would it is that the improvement language model
and using the nn system
conversational interaction you work on was explored process children speech and easement
this is how to establish a relative or an ordering
so that conversational interaction
and here for work but then suddenly
or by a separation vary between the trees and tippett the kernel developing children but
the
so it's
so out of the active learning space