i most of it i wonder whether nine came from and so and so it's
a mean i started discussing troll arrangements and then it became clear
so a question as myself when only invited me to
comes this meeting was
what can i say that's possibly it interests to people interested speaker
identification language identification this kind of topics
you would find
but you know i i-vectors princess in this talks provides an i-vector is described the
process of virus transmission on an apple device
but this was
what is talk might have you are looking at some of topics
and there are a number of points of connection that i noted down so i
think
i'm gonna be talking about the way the speakers changes result context
and how we develop algorithms
the can modify speech become or intelligible but of course this is a sum relationship
with spoofing for instance so the elements like to be talking about could potentially be
used to disguise summaries identity
also the effect of noise on speaking style is obviously very relevance people interested speaker
identification
there's a talk this morning about diarization an overlapping speech i'm gonna be giving you
some data drawn overlapping speech in really at this conditions when they're all that's well
that's all "'cause" presents
and
i don't believe also that's durational variations a problem
and we can get leaves me behavioral information on those are there is some points
of contact with what i wanna say
i and the kind of work that's
people are doing this field
but we keep things simple this what i'm gonna talk about
i'm gonna be talking about replacing the easy approach to intelligibility which is increasing the
volume
with this hypothetical but very potentially very valuable
device for increasing intelligibility
so i'm gonna start by talking about why we should robust device speech output it's
always interesting problem what kind of applications does it have
get a few general observations from what talk as
two in adverse conditions
and then it will be
research interest to talk into spectral and temporal domains sony little tidbits about behavioural observations
again what focusing on some the algorithms that's
people in the listing talk a project of developed so the last couple of years
and
culminating with a single american challenge
which is a global evaluation of speech modification techniques which took place last year
that's interspeech
and if this time also if you words about what might be modifications due to
speech whether they actually make it anymore it's intrinsically intelligible or whether they just to
overcome the problems of noise
so why relative i speech at all
well i'm sure whether where that speech output is extremely common but natural recorded and
synthetic
if you think about your engine is to
to this at this place
you would've presumably gone through various transport interchanges and proposes
aeroplanes themselves
lots of didn't can't environments reverberation now recorders et cetera
hearing all sorts of really almost unintelligible messages
coming out of the every like speaker there are millions of these things in continuous
operation
and it's sure given an interesting problem to attempt to make the messages as intelligible
as possible
same goes for mobile devices or say in car navigation systems where noise is just
simply a fact that of life for the context which they used
no score in speech technology that optically speech synthetic speech is really what
you know realistically the messages sent sergeant's the environments
regardless of context we calls were summed else's talking let's say in a in a
voice striven the gps type of system
or regardless whether they noise present
that is a few examples just real quick
for example just recorded
with a simple handle device
of one this case recording speech in noisy environments
and because as a home on this i'm gonna plug it in
just for the duration of this
note you convolve that video with this a delivery system as well you could you
point in theory much which at all
but believe it's not intelligible star with
he is not like this one so that was recorded speech this is like speech
but this is like accented speech we know the accent from foreign accent can lead
to can be equivalent to say five db at noise in some cases
i
euro language x i don't peers expert sidewords i one result was
and this is another one my favourite examples "'cause" this is this is really user
interface
already in interface design problem for the people design this is the train
needed to edinburgh
i
of the noise saying the trains about depart collided with the announcements so to a
simple fixes for those cases
in particular
anyway what is worth doing this
well i think is always with bearing in mind that for a natural sentences we
have lots of different data for
basic show the same point
that's
for every t but you can
gain
it effective terms are save it more about that later on
it's worth between five and eight of say five subsets buying on speech material for
this is for sentences these are pretty close to normal
normal speech
and so
every t v gain is worth is with having essentially
at some images perhaps a little less now
is the every db
attenuated potentially saves lives and might sound like a
a bit of bold statements but so that we qualified
this is ripple in the will self-organization which just covered
environmental noise in europe
and by environment i environmental noise and excluding workplace noise so this is not people
working in factories with you know
pistons and how misses ongoing all the time no this is due with
just the noise pollution there exists in environments be looking at or railway station you
got announcements all day long and so you live near to an apple also on
the nice to talk about
of at aeroplanes use of a stress related diseases call you possibilities in particular
now these don't necessary leads to for target so
the qualification that was this is
the more methodologies measuring how few likely is lost
so if you saw for instance a severe tenets s is results of environmental noise
that might
produce a coefficient of point one for instance which means that for every ten years
you effect you lost one you of healthy life that's a very large figure
i that we can do to attenuate environmental noise has to be beneficial
to just to contrast what i'm talking about with existing other areas within a field
the difference between speech proposed vacation let's say that speech enhancement
it with dealing with speech
where the intended message of the signal itself is now so it's in a sense
a simpler problem we're not we have not the problem with taking a noisy speech
signal and phone recognizers or enhance of the broadcast for instance
sometimes called near end speech enhancement
and it's not like an additive noise suppression we we're not attending to control the
noise so it's of this sort of talk just start talking about here will be
in situations where is not really practical to control the noise because you got say
a public address system and hundreds of people a listing to its control be wearing
headphones or whatever
well we can't do then what we left with within this speech within the system
its ability to modify the speech itself so get fairly to if you constraints these
are just practical constraints
on the short so we probably the interest in changing the distribution of energy and
time
duration even
i'll show you small with the do that later on
and
but overall in the long term we need to we don't want to extend the
duration
and fulfil the behind an announcement france
and we don't want
able do not want to increase in intensity of the signal so
normally what we're doing this work at this is gonna be the case throughout all
of what i talk about pretty much alike talk about
there's a constant input output energy constraint just four months
you just a few examples what we can achieve and you know a compact is
saying how these it done
a little later so if you couldn't see of the original example you won't
these was my mouse on
one
okay so what you listen to is some speech and noise
trust me because you may not be able to his speech
he just about tell the speaker in a
well as you can modify that speech
without changing the overall rms energy the noise is constance so the nist nor is
the same in these two examples modified do something like this
which if you listen that in an experimental setup i had transference you could be
guaranteed to get probably seventy percent words clearly the sentence was that's right
this is the eight we possibly soup has been certainness itself
so you know semantically unpredictable
so that's the kind of by weight motivation
now what's a little bit about what's so opposed to
and this is been other a longstanding research area
i see that at least that's goes but along bought
a hundred years ago
but a lot of work conferences clean speech so if i if you if you
give some did instructions
to speak clearly
then they will and
you're we even need to give you instructions in this situation nine
tempting to speak more clearly than i was over lunch for instance
and speech changes in the situations in ways which
it's possible to characterize possibly copy
and this documents anything about
mapping clean speech properties on so you
a on to natural speech or maybe function instead
on the adverse conditions
speech also changes as a function of the interlocutor
which changes if when you took children infant-directed speech for instance
for the directed speech is also called
the nonnatives and also for pets
and also computers
so it is also computed as we as we all know talk cost
there's was involved in that have available in speech recognition speech changes
i mean if focusing settlement postconditions
so say this
works be good on
on several about speech but
a long time
johnstone also work in this area
also
no where
should i should say we're interested in one but speech
not necessarily because we expect speakers
to be an of arms or a device to be an environment where or listeners
be to be in environments
i don't levels which use normative induced one but speech which usually quite high
but simply because lombard speech is more intelligible we want to know why
first of all
sciences and then we want to in part knowledge into an algorithm so you at
least produced it intelligible benefits of one but speech or to go beyond
actually some results like that show that we
kind indeed go beyond
now i guess medical of the fertile about speech but
if you if you haven't this is what it sounds like
in the next okay that's little speech very well i
this is the same talk the same sentence with in this case i think ninety
six db s p l noise on have
you don't hear coming from the signal goes
and some of the properties along but speech are fairly evidence i think here so
you see the duration training so this is the normal speech this is the lombard
speech
so you can see the that it's quite normal for the duration to be
slightly long this is exactly nonlinear
the nonlinear stretching here voiced elements tend to be extended well voices that elements
also if you look here at the f zero you the harmonics only possible lombard
speech
after zeros typically higher two
there's all the characteristics which are not visible in this particular the loss but you'll
see in the second
which might be important
now this the real reason we're just a lot about speech and there's lots of
data
like this is just some we had
from a similar studies were sweltering is here
and this is showing the
percentage increase in intelligibility of a baseline
an a normal baseline these are just for difference
well but conditions
it you can see we can get some pretty
serious intelligibility improvement
this of the old speech is then represented to listeners in the same amount of
noise
and improvements of up to twenty five
decibels
sorry twenty five a
percent
question is why
while about speech more intelligible
and the seem to be
a number of possibilities possibly acting in conjunction
so one option you hear on what some panel are
or three spectrograms cochlea gram the sometimes goals as a logarithmic frequency scale
nearly
in speech which is not
in int used in noise this is normal speech
and these it just
different degrees of law about speech signal see the duration differences again there
whatever
trash on this site the
are
the regions of the speech if you to then mixed each of these incessant amount
of noise
which means it's speech actually come through the not x
these things the chuckling glimpses here
this it's a model be defining a bit more carefully like draw
what you see is these glimpses and then in the normal speech
there are few glimpses of the normal speech the raw
in the
what about speech in particular the high frequency regions
and so
one of the other key properties of lombard speech is the spectral tilt
is changed exactly reduced so that this is low frequency with you looking at this
is high frequency lombard speech is more like that
which means essentially spongy more energy in the made
a bit a high frequencies
need
in my terms means made along the cochlea which means provided khz upwards we seek
more energy
so this potentially spectral cues
this also potentially sample cues
simply a slowing down of speech right if you like let's say it's not
it's non linear
expansion maybe that's beneficial that's a contentious issue which all address
and maybe the raw acoustic phonetic changes to
maybe
when list as a present with high level of noise they attempt to hit
of all target store
and expanded files space as is the case of its is all the phones modified
speech just clean speech
no but the question is one but speech intrinsically more intelligible be addressed in the
two
note stop the listing talk project we all got together
kinda optimistically start the project and had a bit of a brainstorming session
just a just a to lists of the things we might do to speech
to make it more intelligible make it more robust
or c first
i'm supposed to increase intensity which ruled out
and hence
again some of those were aware of lombard
speech this point
was he changing spectral tilt is a possibility
the for the thing i just mention the speech phonetic changes
spun the vol space
latency voice see what is going this model space on the slide
and so we continued
think about this maybe now ring the formant bandwidths
put more energy wasting energy on
useless parameters like you know all lattice and the peak
the of the in the in the spectrum what in general reality energy sparsifying energy
is another generalisation that's
some of these are mentioning because you gonna see some examples
of these
as you compression has been over a long time to work in the audio
broadcasting
and also works here and then there are fewer the higher level things try to
match the interloctor intensity or to contrast it
to maybe help is probably overlaps which was talked about this morning
and so on
okay training of zero
we thought that more
table with some other things in searching for these more practical vowels and consonants simply
fine syntax blah
and further maybe producing speech with had which has a low cognitive load on the
list a
as you see there's an awful lot of things that can be looked at c
and there's in all of these of been woodside so it's a grey area of
people interested in
start to look at this
what i try to do that was to group that into a bit more sensible
structure
by looking at the goal of
a speech modification so all possible goals could be at modifying speech
all of it is context dependent but if we just gonna focus on speech and
noise
one of the clear goals is to reduce and energetic masking
as it's called
no i if you know why the difference
the difference between in jersey masking informational masking
energetic masking describes the process which essentially
look so what happens when a mask and the target let's a speech interact level
of the or two or three periphery
something information is lost
due to compression in the auditory system
but then masking
can come back again later if the some information getting through from another chocolates a
there to talk was talking at once you need to messages or unity of fragments
of two messages
and it if they speak as a very similar they have the same gender
then it can be very confusing to work out what which bits belong to each
talk not an example of informational mask
so what things we can just produce a just masking
by doing things like sparsification of spectrum
training spectral tilt to reduce informational masking we might do something if we've got control
over
the entire message generation process we might be something like change the gender of the
talker
okay so with about
well not necessarily tts but we have voice conversion systems do this
and we cannot visual cues as a nice with reducing the effect of an interfering
talkers
and then we can do all the things this comes from i longstanding interest in
order to scene analysis
but it taking the problem and investing we can try to prevent grouping we can
send a message into an environment whether all the sources
but do things check to prevent a grouping with them
is that see something with an idea which a promote interest all in system an
awful lot of work in scene analysis
about the c plays in a court set
i believe and
you assigned that's a given instruments when they common
have a sort of its finding can use timing differences
at the onset of them could be interesting keep them separate
that's an example of what i'm sorry about that using a scene analysis
to prevent
the message clashing with the background
nothing we can do to reduce cognitive load of the message by using possibly
simple syntax
decreasing speech right or we can equip the speech with
more redundancy differences by it's a high-level repeating the message boards
so the lots of things that you might
figure out what
do not want to do is to sort of move in the direction of some
the experimental experiments we been doing of last be as
and this is a kind of typical approach to be take
we could use what can be we describe the syntactic one but speech in one
form or another
i mean is we can take normal speech display this again
in reading text okay
and take the global but sentence three i o e
and say well how much of the intelligibility advantage of that one buttons comes from
site time timing differences so we can time aligned
the two sentences and then princess asked a question
i
only the f zero shift in that
but
remove the spectral tilt that sounds like this next for me not like targets along
about three okay so the residual you know the difference that between the two
the things like spectral tilt and sort of an experimental point you we can then
identify
the contribution of us to such as f zero spectral tilt duration
to the intelligibility advantage
so now want to look at some spectral domain and starts off with
one of the l experiments we did looking at
exactly those parameters spectral tilt and fundamental frequency
because lombard speech in clean speech and all the phones of speech
do you modify have zero you might be led to believe the f zero
is important to my let's think that actually important change
but it turns out that it hasn't
so what you looking out here is the
increase
intelligibility or a baseline by manipulating
f zero to bring it in line with lombard speech
and none of these changes significance these three different pause just represent different
lowball conditions
on the other hand to be checked the spectral tilt
this just a constant changes not time dependent
skantze changed spectral tilt
we get about two thirds the benefit coming through this is not real about speech
up here
so the still a got but a lot is you to spectral tilt
turns out this could be predicted very well by just considering energetic masking eclipsing model
so it spectral tilt is putting some the speech out with the most
of course and there have been approach is which just simply do that the speech
we get some benefits modifying speech based on
just changing spectral tilt globally
but we can say that look quite a bit more generally and ask the question
if all you're allowed to do is to cut with a stationary spectral weighting
so essentially designing send a simple filter
to apply to speech that was the best you can do
in the spectral domain weeks
this the general approach
offline
this can still be this can be must get dependent so it's context dependent it
is masked is
we can come up with a different special weighting
and we do that offline
and then online its nest every that's recognise what kind of background we have and
then apply the weighting necessarily
necessary for that particular
a type mask
what we didn't hear what we realised of the art in this project
with the really important role the object intelligibility matches have
in this whole process simply because
we want to use them as part of the closed loop design process optimisation process
we can't bring back panel of listeners every
ten milliseconds
to try to answer the question how intelligibility this you modification that our algorithm just
come up with
"'cause" we still test the n
at the design phase
so is critically important of a good intelligibility predictor
so the first intelligibility project we use is the
the glimpse proportion measure
and that just described what the says very simple thing
so we take
see what representations separated or two representations of the speech of the noise of these
just
just imagine some kinda cochlea gram representation
gammatone filter-bank we take the envelope you've is willing to it
the hilbert envelope
downsample
essentially that's it
and we have the question on how often is the speech above the noise plus
some threshold which relies of a real but we need
and just measure the number of points well that's the case
by simple very rapidly computed
intelligibility model
if we do that
we come up with these kinds of weighting so again
it depends what kind of optimisation but you want to use this is just ignore
them
these very high dimensional or two spectra
say sixty dimensional
also
one thing here if you read these icons this is speech or noise competing talk
a cs
speech modulated noise white noise
and circle different mask is we also got given snrs or even ten five zero
minus five minus ten
and this also some interesting things going on here these the optimal spectral weighting that
come up with
you don't listen to think before using much lower dimensionality representation so differences octave weightings
so you got you know six to eight octave bands weightings or even third octave
weightings maybe twenty third octave bands here we got much higher dimensional representation
so we can someone that one expected to was at least somewhat unexpected result the
or something is the as the snr decreases we sing
that this optimal weighting is getting more extreme
more binary
we caught sparse twisty "'cause" what is essentially getting is
is shifting the energy
instead because you regions
to it to limit boosting gets and then attenuation the neighboring regions
this is only value on expected
the question is what the what was this all amount to foolishness
i display your an example of
of what these things sound like to this is just the on model modified speech
a large size in stockings is how to sell
from the whole corpus
this is the modified
i stockings as a cluster
one of the modified
and he pretty you know their course equally intelligible i hope in quiet but in
noise and
you know the sentence but it's i think it should be reason we evident that
the modified speech is more intelligible that and so as part of the three can
challenge we
and to this particular the algorithm and got improvements of up to say fifteen percent
percentage points
in this is just two different conditions and given snrs but
roughly that's amount
it is more useful to think of these in terms of db improvements
and so we its use this
so idea of a equivalent intensity increase the idea is if you modify speech
how much would you need to the one modified speech of by
since the how much you gonna to increase the snr
to get the same level of performance
and this can be computed using a
by computing psychometrics functions for each of the mask as you need to use and
using the mapping from the or modified speech the modified speech
i don't you what is inside with
sensed tells as is i if you look at the subjective by
these fill lines here
the we getting about two db some improvement using that stuff expect a weighting
which is kinda useful to db is maybe seventy four somewhere between ten and fifteen
percentage points also
now something else in this figure shows
these the white bars here
all the protections on the bases are of a
object intelligible to model that was used directly design the weighting the first place
and icsi the predictions are not really that good
i mean you
you could look at this kind i can say well there are quite collects but
there are not really very good at all
this is quite a big basically in these cases here
of course
the
in a in one sense doesn't matter because we're still getting improvement fearlessness
what the other hand get a better in its object intelligibility model
than against abortion for instance then we might expect bigger gains
so the kindness idea one other things
the most units and times been focusing on
a loss
is improving intelligibility models of a modified incident synthetic speech
so what you seen here these it you might recognise some of these
abbreviations here this is the speech intelligibility index extended speech utility bills you index
this is one controls the data was lab
et cetera but these are quite recent intelligibility metrics
seven over that
and these a five claim space matrix that's
saigon signed is developed
to try to improve matters
it's a difference is that the one that we're using you just save these past
stuff expect a weightings is this one just performed that well actually it's all
but most the metrics there really perform so well
the modified speech
normally that we four-formant
so the correlation with the model correlations of at least point nine
for natural
speech their form of synthetic speech writer
so one actually now is what happens if we
do the same
static
that's spectral
wait estimation
one of these is going to use this high energy could portion
metric instead
this is just really a series adaptations to the normal course a portion
well you
what we doing over here this is the normal was proportion
we do in here is i'd adding on something which represents the hearing s
let level
sometimes we present in speech is such a low snr that some of this some
of the speech itself within the mixture when it's presented to listen to say it's
some people db or whatever
is actually below the threshold of hearing
and this or talking effect on the intelligibility prediction so that skated for over here
you've also got
a sort of ways i logarithmic compression
to a
deal with the fact that
glints is very redundant so you probably only thirty percent of the spectro-temporal plane glints
to get
to see in performance
that's handled that
and this is a durational modification factor
which attempts to cater for the fact that
rapid speech is less intelligible so this a few changes in that i'm not really
gonna go too much into them here
but just a trace of the buttons that
come out of this often optimisation
process
and so what we seeing is actually quite similar buttons to the preceding model draw
some differences is a six different noise types
low-pass high
low-pass this is high plus noise white noise
and again a modulated
could be talking noise and speech noise but we essentially seem pretty much abuse of
the high frequencies
we find here well we change corpus here a little bit
it became more convenient for me working in
in spain to have spanish listeners rolled to
of the
we don't my ex english collings scottish whatever to run some experiments with this
so this is with a short but which is a spanish version of the harvard
sentences
and what you saying here all gains in percentage points
these not relative gains
is the percentage points gains of up to fifty five percentage points from static special
weighting
in the best cases in some really cases down twenty thirty
doesn't look at all the white in white noise which we put down to a
continue problems of the origin
further problems of the objective intelligibility metric
but nevertheless we can see that for a very simple approach which could be implemented
it
simple linear filter
we can get
some pretty be gains
in noise using these approaches
and actual and questions we want to that is
to what extent do we need to make mask in a basket dependent weightings
because if you look at the mask is we here because the weightings rather of
a here
for the different mask is we stent a system similar passage
we tend to see
a preference for getting the energy opens the
i frequencies
with maybe a sort of
tendency to preserve some very low frequency information which might be related to encoding voicing
for instance
so we tried out a number of
static spectral weighting in a master independent sense this the simplest one
which used essentially transmit
transfer
reallocate lots of energy from the low frequencies below khz
so the edges above
with no attempt to produce a clatter profile
that's all here
and then we the these but testing out the idea of sparse boosting just boosting
if you channels
sparse twisting with a some low-frequency information transfer sounds
once the
and just of our sense that run them
selection information
in i and i frequencies
and it turns out
slightly to a surprise
that the master independent weighting which of these black policy
that in a real conditions of all this condition
does as well as the mask a deep and weighting
which the white boss previous
this copy from the previous
couple slides back
all of the other weighting stick white so well although they in general produced improvements
so what this is saying really is that
for what a wide variety of common noises
say babble noise in particular which is basically car noise in transport interchanges the same
be speech noises
we can we can get pretty significant improvements from a simple approach of spectral weighting
as lots multi set about spectral
types of things lots more to be don't but
i want to get a kind of a
a better look at
all the various examine to move on as look at temporal modifications
the testing to look at
these this question of duration or speech rate changes
you might think that by slowing speech data and the way the lombard speech
does at least for certain segments
is don't for the
for reasons
because the in selected to all the speakers try to make things easier for the
interlocutor
so what we looked at was
whether or not the slower speech rate along but speech actually helps at all
we see here is the method also we use a this is plain speech this
is lombard speech
then we just simply time-aligned nonlinearly the low about speech with the plane speech
and once you've got the time alignment you can then do things like transplanting spectral
information
from saying the lombard speech into the line speech
in the timeline sense
well the answers the question
and whether or not duration helps
is no
and this is not the only study this found this
no i don't linear stretching or nonlinear
as in this case non linear time alignment benefits this is because these of these
two point c is these benefits
overall modified speech we see this is not a lot lombard speech
easy to spectral modifications local modifications meaning
spectral transplantation having don the nonlinear time warping
nothing helps
except the spectral changes
these decreases the not significance but then clearly not in the right direction
but i
but in a i a little bit later on three result which seems to seems
to country
to contradict this
so one wants a for example the next
five or ten minutes
is a slightly
richer interpretation of durational changes
and this is
what happens to speech
when you're talking the presence of a temporally modulated mask
so i just think about that you know anytime you go into a cafe or
something
you dealing with this a modulated background
is there anything that we a speakers to in a module at background to make
life easier to listen
these situations belly been studied
and yet has the potential to
we think we thought
and continues think
to show
so more complex behaviour on the part of speakers to helplessness
so it is the current task
that we used is this you talking task as a visual area between these two
talkers be visual barrier here
they're wearing headphones the listing to modulate you mask as of different types
varying gain
that density so there's some
opportunities let's say for the talkers to maybe get into the gaps
here's a bit of a link with the overlapping
speech material this morning
and they have different to docking proposals
so they need to communicate is one of the string to get task so this
is a
an example of all that sounds like
i mean see you can what you listening
see it you can
imagine
the mask are being present you the must present in this example
you hear the mask about the must was present for the for these talk as
the you can okay gonna one
and it in the middle right hand box
the middle row there has to be three in five
no colour role
i mean the timing wasn't quite natural i think you need here is not really
every now briefly what conversation
this the third party and that a lot that parties
is it is a modulated mask in this case
no is less interesting things ago on an overlap as i'm sure
i don't need to tell you
and
but these not
you know this a little bit of in the meetings
style overlap
because it obsoletes not the competing talking the background will see some examples of that
in a moment
why white simply wanna focus on is the overlap
it's simply the degree of overlap
with the with the mask
do the talkers treat the mask a like an interlocutor
but there is a tend to avoid overlap
or not
what we found is that to some extent yes it's is showing the reduction in-overlap
these just the for different masters or the dense and sparse mask so in the
case where there's
more potential for reducing a lot voice pops easy if it's order to do so
that we do see reduction overlap
well however they to the itchy this by increasing speech right so they speak a
more
only when there's no overlap
when the weather's up
when this background speech and that's what's response of the increase in
the decrease in you know a lot this is normalize of course
by
a speech activity
so what are see what a speaker during
well
strike work out what speakers of doing when noise is present or indeed when noise
isn't present
religious technique which is
we develop a signal for system identification
cool reverse correlation
as there's been used for instance try to identify
also nonlinear systems although it really strictly speaking only applies in you know like when
the linear system comes with doing with in thai speech
perception process and then also the speech production process in response to
the speech relating to so we got to it to highly nonlinear systems in so
it shouldn't really work
but that less what we do is
we look at all events of a particular type
in the corpus lsa all vacations when the person you talking to
stop speaking offsets
and we say what was going on
what was going on annual speech in the in to watch the speech
i point
yes we just and code all those you mustn't like spikes
and then we take a window
look at speech activity an average over all of those exemplars
not gives is what we call this event related activity which what you seeing here's
this the window pasta minus one second
with a simple case is first so is no noise presence here this is just
looking at the
activity response to an interlocutor
so this is just simply saying we take all the points which and
a need to look at a stops talking what do you do it
well not surprisingly
well but more likely to be start and talk this what's been taking is really
about
and we see the reverse pattern
on the other side
but interesting questions what happens when the mask or we take the mask rebounds so
what happens when the masking goes off what're you doing is a talk
well
not very much but then
afterwards shortly afterwards
you increase your
likelihood of speaking
and like
likewise in the case response qbc bit more clearly if we
just look at the difference between the onset and offset abouts the symmetric for all
intents and purposes
and so we see this what we call it contrast cuts
this is really just shown in that was having an interlocutor case
see very nice cup
a quite a wide range in the mask in case well because it's become guess
whether must be bands gonna take place there's really no difference here that is right
after the milliseconds after the massacre
as i that come on come off
we see a change in the speaker activity what is the showing is that's talk
as are sensitive to the mask is
and do respond in some way
well the last seven possible strategies the soak it might be using
and it sends out some non-targets ability tele but simply to say that
it isn't a case
mainly that when a mask it comes on
talkers tend to stop the doing this is this stop
strategy here
it's more case that
they tend not to start
when a mask result in the two things if you think about it might with
the same when you averaged across is why we need to distinguish between the two
so we see lots evidence for talk strategy based on the masking goes off
you more likely to start talking that makes sense
and if the mask comes on you less likely to start talking a little bit
of evidence
the to mask because you to stop talking
but not it's quite weak evidence
now how does this work in a low for a more natural situation where this
that's the all the conversations presence in the background rather than this
slightly audiovisual
background model at background noise
so these were some expense we carried out broken english and in
in spanish and
the basic scenario is that we have a pair talk is here having a conversations
they come into the first five minutes
and then the joint for the next ten minutes by not the parents or "'cause"
and then the symmetry purposes the first belly so we got by tippett era where
we got to past conversations
second group is not allowed to talk the first group vice versa
and so we really interested in a very natural situation see how one conversation affects
not the conversation i just play using this example
and i'm you'll be helped a little bit by the transcription right hand side i'll
try to follow it
i for lid
or cosine i got my legs
this is the natural overlap situation if you want to the percentage of appears not
twenty five set
across the entire corpus is more like eighty percent
twenty percent within buttons within press
so i'm willing to the couple of things here
one the things that top is due
in
i and or not the situation like that's
extracted reduce
the amount of natural overlap that they lie within the with that conversational partner
in the figure was mention this morning brought about twenty five percent we find the
same
think so the thing here when there's no
background presents
and the older dusty would
in we got this is a natural state of the two person dialogue roughly that
twenty five percent or materials are a lot
singe for the background in the
you see that's reduced that's one big change another change we see is when we
remove the visual modalities you might and they're just in that picture they were these
the lies is the one the conditions
and that also "'cause" i in a bit of egyptian overlap
someone response
but
the interesting question is
to what extent all listed it may not make and situation aware of what's going
on the background
and adapting accordingly
so these are the either likes activity plots of the four we saw before
is this is with no background presents so we see this turn taking behaviour
and this is where there's a visual information
a lot i would i the
so we can see the interlocutors lips
of the interesting case of these cases where the noise is present
and so this is the
showing the
activity response to the noise
a low is much weaker pattern
we still see the same
sensitivity the noise in these highly dense that situation
so the foreground since they can summarise all this
is affected by background
a background conversations
was always quality with
speech technology well
out of this grew an algorithm called g c v time
and
which was also submitted to the oregon challenge
and the idea here the approach here
is to
so the general dynamic time warping based approach
where we take a speech signal every here is the mass get we say
if we are allowed on a frame-by-frame basis to modify the speech signal
to achieve some objective
whatever that is
then we could do so by
finding its you quote defining the
the least cost path through some
some two massive distance with the least you methodists in this cost matrix
we ended with modified speech
we temporal changes
so the important question now is
what we put in as the cost function
we tried various things one of them is
based on a just masking clean thing again that's the weather g comes it in
the g c we time
and the other components is to a cochlea scaled entropy which is a measure of
information content in speech so that
to but in the in simple terms what we try to do is find the
path
which maximizes the number of glimpses of speech you're gonna gas by shifting speech away
from e pox where the mask or is intense
and the same time is sensitive
two speech information contents
for least speech information content is defined by cochlea scaled entropy
and it turns out that this
is the prettiest successful strategies with that for decibels
of improvement
in the reckon challenge
now
is it
the way that can change a set so what it has the allowed a small
what we location
for various reasons we were interested in promoting some temporal structure so we light a
little bit elongation
half a second i the side sense
and of course not surprisingly most the time that he you the re timing out
than exploits that fact
no strategy speech or shifts bits of speech around
into those got into the and also exploiting the silence
so what the be time because it simply the elongation well our previous results
would suggest that
elongation doesn't help right this is i began the section
you location doesn't help
but strangely
we found in the case of the modulated mastering competing speech in this case
we found that would simply you located did help
not as much as we timing
about what about a half the effect could be due to pure elongation
so
but selected
speech shaped noise in this case
we find elongation doesn't help which is
consistent with the accent you picture so what's really going on here
well the reason that people don't find improvements with a durational based approaches distracting is
"'cause" most of the work has been done looking at this stage we mask is
and interest issue mask you simply you log eight
you we not in just using any new information
"'cause" the master itself a stationary with you gotta modulated mask
you stretch they say of all out
you know if for it was massed some parts fragments of it you know if
needed for identification are gonna skate masking
and that's what we think is responsible here
the other important thing here on this
the came out of this is the tree timing itself appears to be intrinsically harmful
so what something which is strangely something which is really beneficial for one mask
we get is a big these of the games
exactly harmful for the first stage we mask so we're
distorting the acoustic phonetic integrity of the speech
but nevertheless it is still the same with time speech is still got the same
distortions in
but in the case of liturgy mask
is highly intelligible
in the target some of the circle more was about that
it was
well we know what's it is more likely to
picture of where we all
with that speech modifications what can we achieve
so the racks to a couple of forty can challenge is what we do internally
within the listing talk projects and then one that's a the clearly evaluate your unless
just interspeech
and
and the goal was to
people providing within
well modified speech
had access to mask is a given snrs
and simply returns
modified speech to us we then evaluated with a very large number of listeners
and these are some of the entries
so
playing speech
a large size in stockings is how to sell
natural about speech
a large size and stockings is a to sell
some on modified tts
a large size in stockings is how to sell
this is from g which is a lot
long but property is applied to tts so long bob
a tts adapted to lombard speech
a large size in stopping this five to sell well as the synthetic voice
trying to compete with noise as well
i know able to techniques
i'll play this one because this was the winning entry
i in stockings is to sell
on this website you'll find that you also organs most more examples
well these are the with the results of the internal challenge of the systems
any s a s t r c which came from
nist ugandans lap
that's university of crete
was the winning entry
producing gains of about thirty six thirty seven percentage points in this condition
what the what does that i'm not to in db terms well it amounts about
five db
no with stones the way you know useful gains i think for speech modification approaches
well you can also see here i think is interesting is that long about speech
natural about speech in this condition
just this case here actually produced a gain of about one
db so we're getting super along but performance
i was of some of these modification elements
he was of the ones of the
you know based
some extent on lombard speech
and tts is a long way behind it is a modified but by applying for
instance long but like properties to tts systems
we can improve things by over two db
a slightly larger challenge the oregon johns last year
is opposing results in a sliding way as i'm just take it with this
so we're looking at here is the equipment intensity change and db
in the face of a
stationary mask the speech shaped noise and this in the keeping competing talk a mask
all the green points correspond to natural speech and the baseline as well the lines
intersect about their
and the tts entries of a low baseline
and they're in blue and see them over here
this is not the in a in a fairly low noise condition if we were
gonna
a high noise condition
a better idea of all these things are really capable all
then we again we sing gains of about five db
in stage we noise and also
the t c v time
getting close not fully be also in
in fluctuating noise
what i really want to point out a listening is probably to me was most
interesting outcome of this evaluation
is the fact that
somebody's tts systems adapted to
toot based on some intelligibility criterion are actually doing really well
combat that makes and that baseline over here and we're getting a couple of the
tts systems here which a player examples of
in a second actually more intelligible than
natural speech
in noise size of a fairly interesting achievement
these k from two different labs
one is everybody brother
garcia
and teeny
and you timit she
the group and also from daniel error at the
the
well you assume with the basque country
well the reason a difference group
i'd nothing to do with this
okay so this is the this is an example that what leasing sound like and
just a tts systems
and you pretty evident this that the synthetic speech is much more intelligible in those
cases
just the final thing to say about american john something we did recently
and a natural thing to do of course is to take spectral changes temporal changes
and see whether they complement each other
and the show in ensure the answer is yes so this is a modified speech
this is defective just applying temporal changes
with the g c v timeouts
this is just the effect of the
assess the l c in this case spectral shaping and dynamic range compression algorithm
and you put the two things together
and you get something which isn't quite additive but it certainly a call me and
me a complementary that for two percentage points
we have i nine to ten decibels impact
so just in the last if i one a couple but it's
i want to just pretend this question is modified
speech intrinsically more intelligible or is it just
hunting the mask is that are essentially work
nice little the critical to answer this question simply because
what we measuring intelligibility we normally measure intelligibility using noise
because the "'cause" otherwise performs is see like but you gotta system which modify speech
to be more intelligible the noise
and of course it's gonna be more intelligently noise so you don't measuring intrinsic intelligibility
you measuring the ability to have come and just mask normally
that's to use native listeners
use non-native listeners then intelligibility in
why it is usually some way below
ceiling performance the natives
this what we did
we plan on that you listen is long about speech
we found was
forget about most of this
this is the key results here is that one but speech is actually less intelligible
them playing speech in quiet
the same speech which is more intelligible the noise is less intelligible in quite
for non-native listeners and with
lombard speech was making improvements somehow to acoustic-phonetic clarity should we say
just a generalized lot of possible changes
then you might expect
to see benefits but we don't
skip over that
and something we don't recently also it's as the same question with non-native listeners
for s t l c
which is you know say that where the entry in the hoary can challenge
and again we see some results in quiet but non-native listeners you see that will
be low ceiling you know modified speech
exactly make things worse
so just to conclude that
what i try to show how is that
by taking some inspiration not so you really "'cause" inspiration but i see sometimes going
beyond what this is
what's also capable of doing
we are able to motivate
some algorithms which can them but
speech which is merely unintelligible interspeech which is almost entirely intelligible
there's a been a
some develop a subjective intelligibility models to make this possible
and i see the this is this a definitely scope a much more working that's
a rather better intelligibility models we can produce
the bigger gains we expect you'll to produce
and i should say that this work is more or less immediately applicable to all
forms of speech outputs
including domestic audio coming from non speech technology devices no radios t vs et cetera
some stuff i didn't read so too much about reduce work with dyslexic it's basically
with the retraining
to show that they benefit from
from a sample modifications to
one thing we do need to look at and i looked on the last couple
of slight is this loss of intrinsic intelligibility
i think this is an opportunity we've got an algorithm here which does well in
noise in quite exactly homes things what about what if we can
what if the two things are not you know
if we can sample the two things together if we cannot do makefile space changes
in the same time is due within just masking and we can see summary we
became
okay i carry much
thank you mouth in this very good interesting talk
you have any comments on the use of the is are
i mean use of this work for asr to improve speech recognition
e
is interesting question
well you thinking maybe we can train talk as two
in tract
more heavily with our asr devices is not gonna happen is it
i think
yes of course in
i is what i would you lanes and in the listing talk a project was
to get as far as looking at dialogue systems
so well as the i saw is
a key component and
to look at ways of improving the
interaction by essentially making this in that the output part of it
much more context aware
and of course
in this sense
if you could make the instruction smoother
this might also mean allowing overlap as an actual muffler compensation and i guess the
input side might also
and of being smooth that's
we didn't end up doing and whatnot so
some results in is are show that it's a pretty variable to adapt the environment
rather than trying to make as p
well those obstruction and speech and
but they demonstrate that data to adapt to and
well i mean other application of a nine dimensional the way i think about is
no i think sometimes we'll green river might background in competition or scene analysis we
often
set of solve this problem of
taking to independence
sources and trying to separate maybe
and acknowledgement the fact that the two are not independent alpha
except in speech separation competitions
you know we're always aware of what's going the background we since we modify a
speech that really a that'll to be factored into these albums told making simple actually
for the elements
thank you martin that was very interesting i'm wondering
if i i've probably got about twenty questions but if i just you know down
to me here we good
in your work where the any constraints regarding quality your naturalness of the enhancement good
question the a salami policy can answer that
okay so
again one of our original goals thinking that we would just knock off the intelligibility
stuff and first year or something
was to look at speech quality and we did a little the to work looking
at the objective much as speech quality so basque this stuff
and hindi
what we've
so the modification i didn't talk about put a produce a highly distorting so i
remember one modification that we produced
we were essentially taking the you know it can general approach of
suppose we equalise the snr in every time frame
i process a little bit like the also perhaps but you know
more extreme or we utilize the sound each by a special equalise the snr in
each time-frequency fix
you can imagine the effect
you know of doing that is highly distorting
and sometimes is highly beneficial also but sometimes very hopeful it's kind of very binary
type of thing so the mean we did this so much as and some of
the other part this i think getting rid
it's work on speech quality two
but it's
so
in this sense
we're looking for correlations between
for the just been to a license some results non-native listeners
we looked at their responses to as a function of speech quality differences where we
might expect
you know into they pass their intelligibility parts of its
pretty much identical to native listeners and respect to we've examined
even though the match the quite different that distortion you might expect that the rich
native l one knowledge would somehow enable list the sample
that these distortions
where easily but that hasn't in the case of
we don't have pockets working at every to reboot consideration considerations sightedness a is related
to this is we're using
these constants rms energy constraint "'cause" we should be really look at loudness
which more difficult optimize
rely as you gotta agree allowed a small the first and so
then my second one would be you discussed effective the listening ear been matched native
or not it is worse but and you mentioned working with english and spanish but
i'm wondering
have you studied a variety of source languages in found any of them more amenable
to this process maybe we should switch to a different line
thank you
that's also interesting
i have not done any work we in
we can speech output and that but item we had a
the project a few years go looking at eight european languages from the point of
view of noise resistance and the clearly difference is that
a lot of it is got to do without engines resistance wenches masking
i just never taken into account we often you know in multi language studies just
normalize use it quite yes and although we should be doing that actually
because the term baseline which might seem to be able to tolerate maybe up to
forty be more
noise and then especially designed in that respect
i'm adding
has you know you are in the speaker recognition community i'm quite sure you where
expecting my question
you have any edge you're of the
but i should effect of from bar number effect so
allpass
kind of things you are
you just present that the
one you meant abilities in speaker recognition
that i sparse well unless johnston some work in this which i guess probably have
i'm not sure that the extent okay so i mean obviously feel of like to
speaking on the stress and then very high very high noise conditions
if you want in speaker identification based on
this your question right
my question is also need to be data forensic problem you thing but
if someone's recorded in presence of noise so
using omar
number of votes
of best model information would be the same band when we record used us and
you know quite are the ones variance question eyes a very
interesting project using similar techniques not but
we at the basque country web looking up at that you did you stupid looking
at which essentially trying to map from normals along but speech if you if you
know that somebody's talking as a degree of noise you could attempt to transform
the lombard speech and to normal speech don't you wanna come at some possible
but predictably so it some cases
precisely i think you're always really need to
be careful experimentally to use the latter case because no
can be no mean even been told you communicating makes a huge the
i think
i want to on to the example of the two couples one speaking condition other
spanish
i guess the one couple do not understand the are the car
we didn't that one of the situations i had actually we just we had same
experiments done with for english or force punished i the is this was the question
what would happen if there will be two couples speaking the same language but on
different topics
and the disturbance also not only noise but able to an understanding that your i
think it and on the statistics we'd like to a couple look if years ago
we discovered this effect of along with the
the informational masking calls by sharing the same language isn't interfering conversation
and it is the case that
we
if it is principle the common experience for many people
optically in a bilingual trilingual country is just one
where
if somebody talking in a language you well aware of even if it's not your
native language
just a much bigger interfering of factors for start of that's one the things that's
definitely happens that's word about between one four db depending on different language pairs that
the that looks at
was that the party question
okay
so it's all a matter informational masking and again that's another big area that we've
it is often the perceptual point of view but not from the apple point if
you see how to deal with
do you without
thank you for the talk
so regarding what also zero actually said
we i think press i personally extract some somewhat speech enhancement and some colleagues tool
for like three s
and even attractive in the training data if you do speech enhancement
and it test data our systems seems to
bad compared if we give him all the noise
cell
do you have the opposite
okay so even so you think that every speech enhancement this case doesn't it doesn't
work has no explanation for that
but it seems that if we do if we tried to remove the noise
the systems get better
and this is a general findings very surprising finding it in a way and that
speech enhancement does not this which is robust application in a weighted linear speech enhancement
the very few speech enhancement techniques work for intelligibility purposes
only one so no
okay
and that works
that doesn't work
the call this terrible it's a related to your question so the mean this space
is the not kind of even linearly related these things
that's it is very just in case the you don't region
i mean that's not really tools is the dynamic range compression type of
extreme dynamic range compression
so this question is based on the example that you short about announcement in the
training
so is that any v of increasing intelligibility of things like the name of the
train station
for people who you know not the native
speakers
so i mean there's a couple things you can do that one a low level
the one i level thing and results in between
we haven't done if we haven't done it is but others are so one thing
of course is you can transfer all your excess energy to those important items which
the low level thing that we don't
at the high level you can attempt to modify feuding with synthetic speech
you can attempt to produce high pass hyper speech can you which is
has been very successful in doing this
a lot rhythmically fully automatically producing a
speech which is more likely to meet its target so with the next point involves
place so this is really gonna help a lot in those cases
and then there's sort of more preside things like simply you know repetition on all
the simply simplification since arkansas when it comes to sort of proper names like that
sure there are some very specific things i think you can do to solve we
need them that way
i think
this is like introducing redundant since i think
this leads to be don't