alright welcome to the second session on acoustics we well
follow this immediately with the sponsors session and then the
back with dinner or per speaker
is all like a came out
thank you
okay it's not all okay
okay sorry
hello vehicle it's a welcome to my talk my name is a ticket out
and that you might be
is not better or
sound check
okay that's good
things
how well come welcome to my talk so
today i'd like to present decided that's i conducted together with my colleagues
in was eager to lexical profound problem thinker the store like to thank them or
without them it would be impossible to conduct this research on this you attention
and so the use your problem as you probably can guess so this topic is
related is with the big problem introduced by then both those
at the beginning of our conference today
so it's also about stated
interaction and multi party interaction
so
a the title is cross corpus that accommodation for acoustic addressee detection
first of all i'd like to
clarify what was use action actually is
so it's a common trend that modern spoken dialogue systems i getting
more adaptive and human like
not you know the two
interact with multiple users under realistic conditions in the real physical world
and's
sorry
so
it may happen that's
not a single user of interest the system but a group of users and this
is exactly the place where the suit action
where
this young but the rises it appears in conversations between
technical system and the group of users
and it's
we're gonna call this kind of
interactions as human machine
conversations and here we have
realistic example from our data
so
the as the s
so base in such a mixed kind of instructions as this is supposed
to distinguish between human and compute a direct utterances
that means solving a binary
classification problem in order to maintain a efficient conversations in a realistic manner
it's important that
human direct utterances so the system is not supposed to give a direct answer to
human direct utterances
because otherwise it would so interrupt a dialogue flow between to human participants
well
a similar problem arises in can with in conversations between several adults and a child
and similarly to
function of this you'd actually caller's problem as i don't channel to be sued action
and here we have again
a realistic example how
not to educate your children but smart phones
yes and again in this case the is this is supposed to distinguish between adult
and child directed utterances produced by adults
and this also means
binary classification problem
and it's functionality may be useful for a system before mean
children developments mandatory in
mainly the let's assume that the list distinguishable are children and a directed acoustic patterns
the bigger progress so that shouldn't make in maintaining social interactions and
in particular in maintaining
spoken conversations
so
now
let's find out if
these two rejection problems have anything in common
first of all we need to answer the question how we address other people in
real life
the simplest way to do this is just
by name so or what we will okay cable or okay alex a or
i like this
then
we can do the same think implicitly by using for example das
i'm looking at him talking to you
then some contextual markers like a specific topics or
specialist a convenience
and
the
the last utterance if is to
modified acoustic speaking style and our prosody
and the present study is focused
exactly on the
last way
on the on the on the letter way of
addressing
subjects in our conversation
so the
the idea behind acoustic addressee detection is that people tend to change the remainder of
speech depending on whom the talking to
for example we may face some special to see such as hard of hearing people
actually people
children or spoken dialogue systems
that's in our opinion might have some communication difficulties
and talk into such it receives we intentionally
we intentionally modify all in the moment of a speech make you need a more
technical loud and generate the more understandable a since we do not
pc then as adequate conversational agents
and then main assumption that we make here is that's human the reckon speech
is supposed to be
similar to adult directed speech
well
and is
in the same way you much indirect speech is for so must be quite similar
to child directed speech
in our experiments we use
relatively simple and yet efficient approach data augmentation called makes a mix up encourages a
model to behave mean eerie into that space between seen data points and i it
already has quite many applications in
isr in
image recognition and
many other
popular fields
basically makes it generates a typical examples
as thing and combinations
of to random a real feature and label vectors take into the coefficients number
and it's this number is a real number randomly generated from a but it stiff
from but from a beta distribution
a specified as follows by the only parameter alpha so technically life i thought lays
within the interval from zero to infinity
but according to our experiments
so i four values higher than one
leads already two
and defeating
and it's in our opinion the most reasonable inter well to ri
this parameter is from zero to one
so
that's question is how many examples to generate and here
that's imagine that we just merge the
c
different datasets without applying any bit argumentation just put them together
so we generate one batch
from each dataset
and it means that we they can increase the initial model training data in the
target corpus in c times
but if you something sleep line except
so we generate
along this
but this seebosh's we generate a also
"'kay"
examples key
i'd
"'kay" artificial examples of from each real example
increasing the amount of training data in a
see you multiply a k plus one times
and it's important to note that if it but at the visual examples are generated
or
but relies on the fly without any significant delays in the training process so we
just
do it on the go
well you can see
the models that we used to
two
it uses all the views to solve our problem
and the they are arranged according to their complexity a little from
left to right
well the first model is a simple
we are as we am
using the compare functionals as the input so this is a pretty popular feature set
in the area for motion recognition was introduced at the interspeech to solve and thirteen
i guess
yes so these features are extracted from the whole utterance
next we apply
the l d model
that includes a recurrent neural network with long short-term memory
and so
repeat a bit of these which were also used to compute the
the compare function also for the for the first model
and in contrast to
the functionals the l d's have
a time continuous nature
so it's time continuous signal
and in the last more lost all model is and consistently for mean raw signal
processing so
it receives just the
raw audio utterance that buses statistical of convolutional input then there's and suffer the same
convolutional component the lunchroom with looks for
we launch with the memory
that was introduced the within the previous model
yes and to be
it should be as the as the reference point for the convolutional component be of
taking
the five-layer sounded like addiction slightly modified it for needs mainly be reused
it's dimensionality
so by reducing the number or of for use in each layer according to the
amount of data that we have at our disposal and we also reduced the kernel
sizes in this paper according to the dimensionality of the signal that we have
well
here you can see the data that we have at our disposal we
we have two datasets for modeling
emotional issue detection namely smart video corpus that's contains interactions between the user to consider
it and the mobile is this
and by the way this is the only corpus that's
that was
models like
played by wizard-of-oz setting
the next
corpus
is was this was this is a conversation corpus that contains
similarly to this we see that contains
interaction between the user a confederate and then almost an alex acero dot is data
is real
without any was of the for stimulation
and
the third corpus is home bank that's includes conversations between a and adults another adult
and the child
we tried to repeat use the same as pleadings into training development and test sets
that's
the introduced in the
i regional studies published but also the corpora
and they turned out to be approximately the same well in the proposal so
train development and test has a purple the proportion of four five by one by
four
first we conduct some preliminary analysis with a linear model the font model we perform
feature selection by means of recursively recursive feature elimination
we just the exclude a small portion of all
compare features with the lowest svm weights
and that we measure the performance
all the
you reduced feature set in terms of unweighted average recall
and if it just let us consider the is considered to be optimal
e for them
them dimensionality-reduction leads to a significant information loss as
and it's here in this in this figure we see that's the
the optimal feature sets a
right significantly
and it's also very interesting that's the size of the optimal feature set on this
p c is much greater than then the other two so it may be explained
by them
a wizard-of-oz model in probably
some of the participants
did it's really believe that they were interacting with the real technical system
and the this issue resulted in
mm slightly a acoustic the basic buttons
well another
sequence of experiments at we conduct is a is inverse local and look experiments the
local means leave one corpus out a everyone knows what it means and inverse local
am is just that we retrain a our model on one corpus and test on
each of the other corpora separately
so and in this figure there is a pretty clear relation between b a c
and
as we see
so or it's pretty natural that's
these corpora
perceived as similar by our system because
the domains pretty close and the they both your utterance in german
in contrast to home bank that was uttered english and as we can see from
this figure
so our
linear model
fails to find any direct relation between
this corpus and the other two
but let's take a look at the
at the at the next year
and here we notice a very interesting trend that's
even bill
hum bank
significantly differs from data to from data two corpora i think the linear model trained
on
on every on sorry one and u two corpora
a reforms on each of them equally well is if it's not range
on each of the corpus separately and tested on them separately
so it means that's
the data sets that we have a non coded
at least not contradictory
so well let's take a look at all experiments but
the l d model and various can on various contexts lands a prime example
and here
in each of the three cases
red green and blue we see that the
dashed line is located about the
the solid one
mean and that's a mix up results in this additional performance improvement no really
when the ready
when already applied to the same corpus
and
it's also interesting to note that
so the context and for two seconds
turns out to be optimal for each of the for each of the corpus given
a given that they have
very different utterance then distributions
so two seconds is sufficient to predict accuracies using acoustic commonality
well
unfortunately makes up gives no performance improvement to the end-to-end model or probably we just
don't have enough data to provide
so we really produce the same experiments with
local and inverse local on be neural network based models
and so the
they both show the same trends the
that's
s b c n b a c seem quite similar to them
and actually the end-to-end model managed to capture
this similarity even better compared to the l d one
but there is an issue with model with multi with multitask learning
particularly
the issue is that
our neural network
regardless of which one us start with reading to
so the sig to the easiest task
with the highest commission features and labels and he they can see that the model
trained on any two dataset
starts
like
so the model
completely ignores the home bank
even though it was trained on this corpus
and it also star discriminating
i guess with you dataset colour vegetation changes if we started by me so
so all over the corpora
and the model actually starts receiving
both corpora really efficient
efficiently
as if you go
trains a on each of the corpus separately and tested on each of the corpus
separately
again we really but we conduct
this index but we conducted a similar experiment it just merging all three
datasets with and without makes up
using all three models
and so here we can see that makes up a low rises both settle these
l d and models and also prevents overfitting
the specific corpus mainly dstc with the highest correlation with the features and labels as
i is the set so these this task for our system
but unfortunately makes up doesn't provide an improvement for the funk model
what
actually goal
this model
doesn't suffer from overfitting the specific task and
doesn't need to be regularized
you do it's very simple structure
did it is very simple architecture
well the last the last the series of experiments
is experiments with i some of the features
the idea behind them is that so
system directed utterances tandem age
the isr
acoustic and language models much better compared to
human addressed utterances
and it's
this definitely works in the human machine setting
but
it seems to be
not working
in the i don't channels i think and we just analyse the
the data itself so
deep inside and the noted that
sometimes addressing children
no
sanderson children so people don't even use words instead they just use some separate intonations
or sounds or so without any words and
this causes real problems to our asr meaning that's so
the are the
the asr confidence will be equal over both of the target process
so
this is the reason why it performs so where
at this humbling problem
so here we come to the conclusions and we can conclude that makes up improves
classification performance for models then this
predefined features and also
this is less like
and also enables multitask learning abilities
for both and joint models and models that it was conducted feature sets
just like and speech fragments
allows us to
capture
accuracies but the
sufficient quality
and actually the same conclusion was drawn by the group of
matters of researchers regarding english language
yes and
as a told
a couple beers before i saw confidence is not representative for a c d low
it still useful for each met and three so you all experiments we also a
bit a couple of baseline so we introduce the first official baseline for be a
sissy corpus and the ability to the on back into and baseline
for future directions i woods propose extending our experiments applying mix up to two dimensional
spectrograms and two features extracted with their without the convolutional component
thank you
we have time for some questions
hi a credit when you in c
yes i
i was wondering why it shows you a tree i don't child interaction between a
human machine interaction is there any literature likely to this decision or was it just
sort of this additional you know it was a but our assumption without any background
i mean it was like an interesting
assumption in interesting something to do not to prove it of the proved run
yes and so
conceptually
it should be like this that's not so sometimes we receive a system as an
infant or person have been lack of communication all scales
of and's that's what we take in as the basic assumption for
forums actually simulate conceptually there's do not sitting
conceptually distinct okay this is on one so i put into our experiments a single
i think
yes that's actually they are probably overlap but only partially
what's couldn't our experiments a single system is capable or float in both
that simultaneously
i perform far worse on the adult channel corpus
yes but because the baseline performance is far worse
i mean the highest baseline on one h b is like
it is zero point sixty four
all zero point six to six or something this
okay
so it just the matter of the data quality
high and just from a reporter numerous the interesting talk i was wondering
maybe i missed something did you see any language features it so no do you
not all can speculate so it is gonna be an impact on the performance of
what it means same as which we just i mean like a separate words or
for instance if i'm talking to a channel i might address to change in a
different way to address signals
okay well it's a difficult question human that's i told that sometimes talking to the
channel we don't use real words
this is the problem for language modeling right i mean i was my hypothesis is
that you would simplify the language to use if you're addressing a child their compared
when you address and yes we do we do
my speculation on this would be yes
we can so we can we can try to leverage in both textual and acoustical
modalities
to solve the same problem yes okay next
for one more
that is common
i just so have you checked
how well you do with respect to the results of the competence
so the same data set was used a similar data set was used as part
of the interspeech compared challenge anything the guy obviously don't like i think it was
seventy point something
so this curious but the look at the majority baseline so i you predicting the
majority class because essentially binary class prediction you do we
and so one thing that you model is just only
how to predict the majority class
i mean i use a
no
i use unweighted average recall and if it if it would predict just
just a majority class a so and so it means that actually the model we
just
a role
all the examples to the ones you melissa
it means that you're performance metric would be
like
not about than zero point a zero point five
because it's like it's like a global metric
sure but for instance even so if you look at the
the baseline for the speech and that's about seventy point something
so you so i we see you mean the baseline for combine corpus
of using the end-to-end or
similarly no i actually the end-to-end baseline was the word baseline
so and sixty four so
i remember the
the article
release the rights before the interest right before the submission for the challenge and the
result there's of the baseline for the intent model was like
is zero point fifty nine also
at rate and the end-to-end if you if you mean this and
if we talk about the entire multi model
like thing so the baseline was like
zero point seven also but they use the much a great the feature sets for
this and several models like a collective of models
include in michael for your words and two and so ill these and all that
stuff
okay let's thank our speaker again