martian will be presenting the next talk
is this on
so how do i how do i do that
i need some help here think or maybe
oops i'm sorry stop by computer
they are i'm and the that the presentation is on this computer but i can't
find the
how there is no point there now right or
right
is the other
well i can start while this is happening i can start by saying but
the work that i'm gonna be present thing was really cornell of cost gives work
and he very
generously
invited us to collaborate don't
here to collaborate arithmetic invite comedy as and me to
to collaborate with him on this
and then it turned out that he
cannot make it today
which means that you are
which means that you are stuck with me here i will try not to make
too much of a mess up stocks
so the question that we're that we're
but we are talking here is a very old question
in speech science in is the question of
whether page or to what extent pitch plays a role in management of speaker change
and this was generated if you to the bay to generate a huge so steady
stream with papers
and but if you look across those papers you can
so to extract some broad
then but broad consensus that's
first of all
pitch does play some or all and then secondly that there is this binary opposition
between flat pitch
signalling or being links to turnholding and any kind of pitch movement dynamic pitch being
linked to turn-yielding
and that's it trained that's the whole the story
except of course it is not because there are still
and number of questions that you might want to ask about the contribution of pitch
to turn taking
such as well
doesn't matter whether you're looking at spontaneous or task oriented material does it matter whether
you're
speakers can see each other with the you know each other
what is the actual contribution of
off
pitch over
lexical or are syntactic cues and finally
i mean i'm a i'm a linguist by training or politician and so we know
that different languages use pitch linguistically to different extents and where the question is with
this is also reflected in
how the user pitch
for pragmatic purposes such as
that just turn taking
and then there's a whole
other another list of questions is how do you
how do you transform how do you are present your pitch in your
model right so how do you do some kind of perceptual stylisation based on
perceptual threshold
do you do kind of some sort of curve fitting
polynomial functional data analysis what have you
to use log scale
the do you do transform at a semi tones how far back to the you
look for those cues right now we're looking at ten miliseconds hundred or one second
or ten second right
these are all interesting an important question but is very difficult to
to answer them in an a systematic way because any two studies you point two
well vary across so many dimensions that it's very difficult to
to estimate a sort of quantify the contribution of each of any of these factors
to actual contribution of pitch to turn taking
so what we've trying to do here's propose a way of
evaluating the role of pitch
in turn taking and that's a method which has three important we think a i
properties the first one it's
scalable trying to its applicable to material when the size
it
is not
reliant a large the are many miller reliant on manual annotation
and it is
it gives you a sort of quantative
index of
contribution of pitch or any other feature as a matter of fact because in the
long term i mean this model this method can be applied to any potential turn
taking can to a few candidate
so
the way we chose to showcase this and also to evaluate this method was to
ask three questions which
well we thought we but there were interesting for us and we hope
or interesting to some of you and this is the first question is whether pitch
but there is any benefit in having pitch information to prediction of
of speech activity and dialogue
the second one is if it does
make a difference how best to represent
your pitch information and the third one is how far back to you have to
look for the for these cues
so these are the question that will be asking and will be trying to answer
them
using switchboard
which we divided into these three speaker disjoint sets right there's no speaker
in more than one of those and instead of running our own voice activity detection
we just use the forced alignments of the of the manual transcriptions that come up
with switchboard
and the whole i mean what you have what we did ben and this is
the idea that lies
at the heart of this of this of this method and i'm sure you've seen
this
before and it's this idea of contractual pornography
which is a sort of
discrete eyes are quantized speech silence annotation right so you have basically a
a frame of predefined duration here we used hundred milliseconds and for each of those
frames and for each of the speakers you indicate whether someone was speaking
or was silent during in that interval and so here we have a person
speaker a speaking for
four hundred milliseconds and there's a hundred miliseconds of overlap
the speaker b
takes four
frame it for frames of speech and there is a hundred milliseconds of
of silence and then speaker a contain
and what you can do then it once you have this sort of representation that
of course you can
do this very simple very simply you can very simply predict speech activity and that
you just take
speech this one speakers history we call this speaker target speaker
you take this person's
is to speak speech activity history
you can potentially if you're interested in that it can take this the other
persons the speech activity history
and then what you do is
you would trying to predict
where the target speaker is gonna be silent or is going to be speaking in
the next hundred milliseconds
and this kind of model can serve as a very neat baseline onto which you
can then keep adding
other features in our case pitch
and what you can do though is then you can compare this speech activity based
only model so baseline and the composite speech activity and
in our case pitch model
any kind of course also
compare the different types of pitch parameterization with one another
of course the only thing that you have to do before you do this kind
of
exercise
is you somehow have to take the continuously varying
pitch values and you somehow have to cast them into this chromagram
a matrix like representation and what we did here was of the simplest possible solution
we just calculate they
for each hundred millisecond frame we calculate be the average
pitch in that interval or missing or we just leave it is the missing value
if there was no voicing in that interval
right
and then we run those prediction experiments using quite simple feed forward networks with the
single hidden layer and for all the experiments that are talking about here
we had a two units in that hidden layer
other some more in the paper which i will not be talking about here
and you will note that this is a non recurrent network in there is a
reason for this right because since we are actually interested in the in the
in the length of the of the usable
pitch history we actually want to have axes we want to have control over how
much
history that the network has
access to
and before we go on the difference is were compared
using cross entropy
expressed in those bits per hundred miliseconds frame there'll be a lot of comparisons here
so there'll be lots of pictures
there's even more in the paper i've sort of to the liberty of picking out
the more boring ones which i think is good as long as you don't tell
cornell so if you know them don't tell
so
the two questions
the first two questions where a
first of all
well is there any benefit in knowing
in having access to pitch history well doing is a speech activity prediction
and the second one
is
how to
what's the optimal representation of pitch values for in such a system
and
so what we do here
it's we start with the speech activity only baseline or in so will be seeing
this kind of picture a lot
so what we have here have here is the training set dev set and test
set here we have the cross entropy rates for all those systems and what we
have here
on the x-axis is the conditioning context right so this is a system which is
trained on one hundred millisecond frame of a speech activity history and this is a
system trained on
one second of speech
activity history and you can see that the big
we cross
all those three sets the cross entropy is drop as you would expect right
so there is an improvement in prediction
and
and what we will be doing
from now on
it's will be taking this
guy so will be taking the system which is trained on
ten
on one second of speech activity history of both speakers
and will be adding
more and more all
of pitch history right so it's always
ten frames of speech activity history propose speakers and then pitch
all one
and what we did first we just added absolute pitch a linear
scale in hz
and surprisingly base even this simple pitch representation helps quite a bit trying to i
mean you can see that even having one frame with pitch
history is already better than
then this baseline here
and
but then it sort of improve c and further and it starts to settle around
three hundred milliseconds
so that the that that's good news rank it seems to suggest that the pitch
information is somehow relevant for speech active prediction
but i mean
clearly adding apps use representing pitch in absolute terms this is a kind of a
laughable id alright that we this completely
speaker dependent
so what you wanna do is you want to
do it well speaker-independent somehow so you want to
the speaker normalization and what we did hear your we do this again the simplest
thing
so we just that score the
the pitch values
and surprisingly this did not really might make much of a different side so that's
that's surprising
you would expect some improvement but of course
if you think about it actually
this introduces more confusion because ones that scoring does of course it brings the mean
to zero
and the voiceless frames
are also represented as zeros in the model
so then these models are just
confusing those two
those two phenomena
this can be
quite easily
improved
by just adding another feature vector this to be
a feature vector which is just a binary feature
both voicing feature right so it's one when there's voicing and zero when it's not
and this allows us to
this allows the model to disambiguate zeros which are due to being close to speakers
mean from zeros which are due to voice lessons
and when you do this that you actually get a quite is quite a substantial
drop in cross entropy rates right switch
the just the bases a
as a good representation and this drop was actually greater
then if you add voicing on top of absolute pitch exact again it's not something
i'm showing here but it is in the in the paper
and then of course
you can go on and say well we know that speech is really
it perceived on semi timescale runs on log scale so does actually matter if we
convert
are how the hz to semi turn before is that scoring and it actually does
a little bit trying to there is that there is a slight improvement would generalizes
to the
the test set
and of course and the last up with data was asking
so all along with only been using pitch history of the target speaker but you
can also ask well that's not doesn't help to know the pitch history of the
interlocutor
and again there is a there is a
slides
but consistent improvement if you if you use both speakers history right
so this is our solution arg answer to question number one and two
or preliminary answer anyway
and then we have question number three which is how far back do you have
to walk and for this we have this sort of diagram
the so the topline is as before so this is the speech activity only
model
except previously be ended here on this blue dots and here we
extended
for another ten frames so this model is trained on
two seconds of speech activity his trade we can say see that is sort of
continues dropping but a little bit less
bless abruptly this curve here is exactly the curve that we had before so trained
on
pitch plus
one second of speech activity history and this one is
more and more of speech history
plus
two seconds of speech act i pitch history plus
two seconds of speech activities training
and this is quite interesting actually hand and a little bit puzzling in that
these curves
i mean whiskers are quite similar i mean they all still
start settling around four hundred
milliseconds
but this one is just is just a shifted down to know what this means
is basically that
the same amount of
pitch history is more helpful
if you have more speech activity history that just kind of interesting have some ideas
about we don't let me weekly we don't know why that is
one possibility that could be something to do with the sort of backchannel nonbackchannel thing
and that
a pitch act as out of a whatever
four hundred of those four hundred milliseconds of
off
pitch cues
might be only useful when the when the person has been talking for a
for sufficiently long
right so as i said there's more in the paper but this is all i
wanted to show you for here
but then what have we learned the three questions are back first what was well
the speaker does have does that speech help
and a prediction of
a speech activity
in dialogue the answer is yes
what is the optimal representation well from what we've seen it seems to be
the binary voicing combination of binary voicing for this disambiguation of voice listeners
and
is that score normalization normalized pitch on an intel on the same assembly don't scale
and how far back should one log well it seems that four hundred of context
is
sufficient
but we have also seen that in terms of the absolute reduction and cross entropy
then into a that the best performing pitch
and representation
retreated resulted in a reduction in reduction which is corresponds to roughly seventy five percent
of the reduction
in the speech activity only model when you go from one frame
to ten frames right so it's quite a
quite substantial in the in that
and the most arms
we have also seen that
but that
i mean four hundred millisecond seems to be enough
which is not much if you
think
about this study that cornell did with less tried work in two thousand twelve and
they found that if you do
speech activity history only you can go
back as much as eight
seconds and you still
keep
i improving
but on the other hand if you think about the sort of prosodic domain with
the window which within which any kind of
pitch
q
could be embedded then something on the order of the magnitude of the foot of
the method of a prosodic foot so something like
four hundred milisecond
long
makes
perfect sense to me
and
we have a coke or we one thing we did was of course cheat a
little bit in that
when we did those that scoring of the pitch
we used speakers
means and standard deviation that we assume that they are known a prior alright and
this of course is not the case if you work to run this analysis of
real time
a scenario
and these would then have to be estimated incrementally
and i want to finish here
and go back to the to the rationale of doing all this
analysis and all this sort of playing around with this and this was really to
to come up with a better way
of doing
automated analysis of large speech material and then especially
to be able to
to bootstrap to produce results
across
across different corpora and make them so of comp arable so one thing you could
do with this for instance is
we run this in switchboard what you can do is take the same thing and
run it on callhome for instance which is also dyadic
which is also
phone
and but people know each other there right
and then which you can and what you can then do is sort of you
can compare those things
and you can see to what extent familiarity between speakers for instance plays a role
a in how pitch is employed for
turn management
and of course in this is kind of what goblet here's and me excited about
this
is that
there there's nothing but limits
these things to pitch trying to can do we intend there's nothing stop the printing
you from doing intensity and the kind of voice quality features so or a bottom-up
multimodal features so this
this really opens the way in a sense for doing a lot
of interesting things and of course in the long term whatever you find out
could potentially be also used in some sort of mixed initiative dialogue system but this
really is something that but that you know about than i don't so i will
i will stop here thank you
can we have plenty of time for questions
i have a hidden slide with corn else phone numbers like i
so perhaps aim is this but so how you handling cases where you're not able
to fine depicts the pitch isn't the thing because you have voiceless that any particular
thing i mean i are originally it's its left to assess the missing value
but then of course of all the because of all the shenanigans that happened inside
i understand they just
they just the transformed into zeros right so that's why then there is this confusion
between
voiceless nist and the
after that scoring of the and the mean pitch
their questions
thanks for in there is to so i'm as i was wondering i
absolute
is
a little bit is very different from a male voice you
voice is on female voices
so
i'm wondering if you you're more than a non tools
i mean voice and a female voices define three
i mean
well maybe but i mean how would that's information b
useful the prediction of speaker of
so the speaker of speaking in the next hundred milisecond
also
but you results is very surprising that absolutely yes is right i think so too
i think so too
because i mean you don't assume that
those speaking and hundred and sixty five hz
signals
but you wanna
all the time right i agree that it is it is it is it is
surprising
but of course i mean
i if you compare those
what was
right so if you compare the absolute pitch and is that the that speaker normalized
speech there is a lot
clearly that the that the absolute pitch missus so there is a lot to improve
on that there must be some information that of that is still
how do you mean of there is some kind of the model man it sort
of inside the network there was some kind of
clustering that it sort of had a one classes of classifier for men and one
for women sort of
yes actually i think you just he don't my question i'm wondering here how much
is the modeling doing like you're proposing a certain representation you with binarize pretty so
but obviously the model is probably also doing something on top of that and so
i i'm not sure if we did you guys have looked into
can you disentangle really understand because if someone takes a different approach that c where
construct features that are temporally nature you know like looking at slopes and all the
stuff like a much as the model accounting for i'm not sure it's hard i
guess what to say i cannot answer this but it's i mean of course you
don't know what the model is actually doing yes absolutely
absolute but i mean
that the things the than thing is that this is what i think this is
one way of sort of
approaching this problem well
producing results which are sort of comparable across studies yes but its absence
you mentioned at the beginning that
the pitch might flat and before turn taking so we for unity norm
and sees you don't user recurrent model did you also consider doing and
taking the tent are of the absolute pitch not only the absolute values no we
didn't but isn't the something that the network of potentially kind of figure out and
so does the question i mean i think so
the question whether you i don't think you've done but are you planning to take
this out of the corpus and see whether the kinds of differentiation the your models
finding might be used
productively to change the behavior of the other speaker like if you alter the captured
or vice fits well right of people we generate absolutely out that could be done
and the other question
i was wondering what using would need to change its it was a multi-speaker situations
and not just to at three four
possibly i mean then this is a this is something that we have discussed a
lot i mean the problem with
doing this then
is that
we had a paper it into speech and two thousand seventeen where we did this
kind of modeling for
for the for respiratory data and turn taking
and the problem is then we had three speakers and then you can absolutely do
it's
but then you have to do all this kind of so we then you would
have another row here right
what you then have to do
is that you have to sort of he
for you have to keep sort of shifting those speakers because you don't want your
model
two
to rely on the final but speaker b was on the row
two and speaker you see was all row three right so then with three speaker
it with three speakers is still doable once you go into really multiparty things then
there's just this explodes
so then you would have to do it
the somehow differently and perhaps only use the only take into account the speaker so
we're speaking wouldn't then it's the last i don't five minute or five minutes or
something and then sort of two
to an incremental also dynamically
produce those
subsets of speakers that you that you predict for
anymore questions
just wondering whether you've looked into the granularity here so you picking hundred milliseconds of
you look at all the time windows
i mean well we had we didn't but this is a clear of bayes i
think is a clique you
problem right that that's
but that could but for the
somehow should be addressed a absolutely but i mean that the them at the method
itself right i mean you like is agnostic of this sort of the
is like
whatever your
your pitch extraction is
the then i mean we will produce different
pitch tracks but also whatever your voice activity detection run like these were also produces
a but this sort of a pretty in some sense of the preprocessing
but still i think
absolutely
absolutely
alright let's thank our speaker again