i most of it i wonder whether nine came from and so and so it's

a mean i started discussing troll arrangements and then it became clear

so a question as myself when only invited me to

comes this meeting was

what can i say that's possibly it interests to people interested speaker

identification language identification this kind of topics

you would find

but you know i i-vectors princess in this talks provides an i-vector is described the

process of virus transmission on an apple device

but this was

what is talk might have you are looking at some of topics

and there are a number of points of connection that i noted down so i

think

i'm gonna be talking about the way the speakers changes result context

and how we develop algorithms

the can modify speech become or intelligible but of course this is a sum relationship

with spoofing for instance so the elements like to be talking about could potentially be

used to disguise summaries identity

also the effect of noise on speaking style is obviously very relevance people interested speaker

identification

there's a talk this morning about diarization an overlapping speech i'm gonna be giving you

some data drawn overlapping speech in really at this conditions when they're all that's well

that's all "'cause" presents

and

i don't believe also that's durational variations a problem

and we can get leaves me behavioral information on those are there is some points

of contact with what i wanna say

i and the kind of work that's

people are doing this field

but we keep things simple this what i'm gonna talk about

i'm gonna be talking about replacing the easy approach to intelligibility which is increasing the

volume

with this hypothetical but very potentially very valuable

device for increasing intelligibility

so i'm gonna start by talking about why we should robust device speech output it's

always interesting problem what kind of applications does it have

get a few general observations from what talk as

two in adverse conditions

and then it will be

research interest to talk into spectral and temporal domains sony little tidbits about behavioural observations

again what focusing on some the algorithms that's

people in the listing talk a project of developed so the last couple of years

and

culminating with a single american challenge

which is a global evaluation of speech modification techniques which took place last year

that's interspeech

and if this time also if you words about what might be modifications due to

speech whether they actually make it anymore it's intrinsically intelligible or whether they just to

overcome the problems of noise

so why relative i speech at all

well i'm sure whether where that speech output is extremely common but natural recorded and

synthetic

if you think about your engine is to

to this at this place

you would've presumably gone through various transport interchanges and proposes

aeroplanes themselves

lots of didn't can't environments reverberation now recorders et cetera

hearing all sorts of really almost unintelligible messages

coming out of the every like speaker there are millions of these things in continuous

operation

and it's sure given an interesting problem to attempt to make the messages as intelligible

as possible

same goes for mobile devices or say in car navigation systems where noise is just

simply a fact that of life for the context which they used

no score in speech technology that optically speech synthetic speech is really what

you know realistically the messages sent sergeant's the environments

regardless of context we calls were summed else's talking let's say in a in a

voice striven the gps type of system

or regardless whether they noise present

that is a few examples just real quick

for example just recorded

with a simple handle device

of one this case recording speech in noisy environments

and because as a home on this i'm gonna plug it in

just for the duration of this

note you convolve that video with this a delivery system as well you could you

point in theory much which at all

but believe it's not intelligible star with

he is not like this one so that was recorded speech this is like speech

but this is like accented speech we know the accent from foreign accent can lead

to can be equivalent to say five db at noise in some cases

i

euro language x i don't peers expert sidewords i one result was

and this is another one my favourite examples "'cause" this is this is really user

interface

already in interface design problem for the people design this is the train

needed to edinburgh

i

of the noise saying the trains about depart collided with the announcements so to a

simple fixes for those cases

in particular

anyway what is worth doing this

well i think is always with bearing in mind that for a natural sentences we

have lots of different data for

basic show the same point

that's

for every t but you can

gain

it effective terms are save it more about that later on

it's worth between five and eight of say five subsets buying on speech material for

this is for sentences these are pretty close to normal

normal speech

and so

every t v gain is worth is with having essentially

at some images perhaps a little less now

is the every db

attenuated potentially saves lives and might sound like a

a bit of bold statements but so that we qualified

this is ripple in the will self-organization which just covered

environmental noise in europe

and by environment i environmental noise and excluding workplace noise so this is not people

working in factories with you know

pistons and how misses ongoing all the time no this is due with

just the noise pollution there exists in environments be looking at or railway station you

got announcements all day long and so you live near to an apple also on

the nice to talk about

of at aeroplanes use of a stress related diseases call you possibilities in particular

now these don't necessary leads to for target so

the qualification that was this is

the more methodologies measuring how few likely is lost

so if you saw for instance a severe tenets s is results of environmental noise

that might

produce a coefficient of point one for instance which means that for every ten years

you effect you lost one you of healthy life that's a very large figure

i that we can do to attenuate environmental noise has to be beneficial

to just to contrast what i'm talking about with existing other areas within a field

the difference between speech proposed vacation let's say that speech enhancement

it with dealing with speech

where the intended message of the signal itself is now so it's in a sense

a simpler problem we're not we have not the problem with taking a noisy speech

signal and phone recognizers or enhance of the broadcast for instance

sometimes called near end speech enhancement

and it's not like an additive noise suppression we we're not attending to control the

noise so it's of this sort of talk just start talking about here will be

in situations where is not really practical to control the noise because you got say

a public address system and hundreds of people a listing to its control be wearing

headphones or whatever

well we can't do then what we left with within this speech within the system

its ability to modify the speech itself so get fairly to if you constraints these

are just practical constraints

on the short so we probably the interest in changing the distribution of energy and

time

duration even

i'll show you small with the do that later on

and

but overall in the long term we need to we don't want to extend the

duration

and fulfil the behind an announcement france

and we don't want

able do not want to increase in intensity of the signal so

normally what we're doing this work at this is gonna be the case throughout all

of what i talk about pretty much alike talk about

there's a constant input output energy constraint just four months

you just a few examples what we can achieve and you know a compact is

saying how these it done

a little later so if you couldn't see of the original example you won't

these was my mouse on

one

okay so what you listen to is some speech and noise

trust me because you may not be able to his speech

he just about tell the speaker in a

well as you can modify that speech

without changing the overall rms energy the noise is constance so the nist nor is

the same in these two examples modified do something like this

which if you listen that in an experimental setup i had transference you could be

guaranteed to get probably seventy percent words clearly the sentence was that's right

this is the eight we possibly soup has been certainness itself

so you know semantically unpredictable

so that's the kind of by weight motivation

now what's a little bit about what's so opposed to

and this is been other a longstanding research area

i see that at least that's goes but along bought

a hundred years ago

but a lot of work conferences clean speech so if i if you if you

give some did instructions

to speak clearly

then they will and

you're we even need to give you instructions in this situation nine

tempting to speak more clearly than i was over lunch for instance

and speech changes in the situations in ways which

it's possible to characterize possibly copy

and this documents anything about

mapping clean speech properties on so you

a on to natural speech or maybe function instead

on the adverse conditions

speech also changes as a function of the interlocutor

which changes if when you took children infant-directed speech for instance

for the directed speech is also called

the nonnatives and also for pets

and also computers

so it is also computed as we as we all know talk cost

there's was involved in that have available in speech recognition speech changes

i mean if focusing settlement postconditions

so say this

works be good on

on several about speech but

a long time

johnstone also work in this area

also

no where

should i should say we're interested in one but speech

not necessarily because we expect speakers

to be an of arms or a device to be an environment where or listeners

be to be in environments

i don't levels which use normative induced one but speech which usually quite high

but simply because lombard speech is more intelligible we want to know why

first of all

sciences and then we want to in part knowledge into an algorithm so you at

least produced it intelligible benefits of one but speech or to go beyond

actually some results like that show that we

kind indeed go beyond

now i guess medical of the fertile about speech but

if you if you haven't this is what it sounds like

in the next okay that's little speech very well i

this is the same talk the same sentence with in this case i think ninety

six db s p l noise on have

you don't hear coming from the signal goes

and some of the properties along but speech are fairly evidence i think here so

you see the duration training so this is the normal speech this is the lombard

speech

so you can see the that it's quite normal for the duration to be

slightly long this is exactly nonlinear

the nonlinear stretching here voiced elements tend to be extended well voices that elements

also if you look here at the f zero you the harmonics only possible lombard

speech

after zeros typically higher two

there's all the characteristics which are not visible in this particular the loss but you'll

see in the second

which might be important

now this the real reason we're just a lot about speech and there's lots of

data

like this is just some we had

from a similar studies were sweltering is here

and this is showing the

percentage increase in intelligibility of a baseline

an a normal baseline these are just for difference

well but conditions

it you can see we can get some pretty

serious intelligibility improvement

this of the old speech is then represented to listeners in the same amount of

noise

and improvements of up to twenty five

decibels

sorry twenty five a

percent

question is why

while about speech more intelligible

and the seem to be

a number of possibilities possibly acting in conjunction

so one option you hear on what some panel are

or three spectrograms cochlea gram the sometimes goals as a logarithmic frequency scale

nearly

in speech which is not

in int used in noise this is normal speech

and these it just

different degrees of law about speech signal see the duration differences again there

whatever

trash on this site the

are

the regions of the speech if you to then mixed each of these incessant amount

of noise

which means it's speech actually come through the not x

these things the chuckling glimpses here

this it's a model be defining a bit more carefully like draw

what you see is these glimpses and then in the normal speech

there are few glimpses of the normal speech the raw

in the

what about speech in particular the high frequency regions

and so

one of the other key properties of lombard speech is the spectral tilt

is changed exactly reduced so that this is low frequency with you looking at this

is high frequency lombard speech is more like that

which means essentially spongy more energy in the made

a bit a high frequencies

need

in my terms means made along the cochlea which means provided khz upwards we seek

more energy

so this potentially spectral cues

this also potentially sample cues

simply a slowing down of speech right if you like let's say it's not

it's non linear

expansion maybe that's beneficial that's a contentious issue which all address

and maybe the raw acoustic phonetic changes to

maybe

when list as a present with high level of noise they attempt to hit

of all target store

and expanded files space as is the case of its is all the phones modified

speech just clean speech

no but the question is one but speech intrinsically more intelligible be addressed in the

two

note stop the listing talk project we all got together

kinda optimistically start the project and had a bit of a brainstorming session

just a just a to lists of the things we might do to speech

to make it more intelligible make it more robust

or c first

i'm supposed to increase intensity which ruled out

and hence

again some of those were aware of lombard

speech this point

was he changing spectral tilt is a possibility

the for the thing i just mention the speech phonetic changes

spun the vol space

latency voice see what is going this model space on the slide

and so we continued

think about this maybe now ring the formant bandwidths

put more energy wasting energy on

useless parameters like you know all lattice and the peak

the of the in the in the spectrum what in general reality energy sparsifying energy

is another generalisation that's

some of these are mentioning because you gonna see some examples

of these

as you compression has been over a long time to work in the audio

broadcasting

and also works here and then there are fewer the higher level things try to

match the interloctor intensity or to contrast it

to maybe help is probably overlaps which was talked about this morning

and so on

okay training of zero

we thought that more

table with some other things in searching for these more practical vowels and consonants simply

fine syntax blah

and further maybe producing speech with had which has a low cognitive load on the

list a

as you see there's an awful lot of things that can be looked at c

and there's in all of these of been woodside so it's a grey area of

people interested in

start to look at this

what i try to do that was to group that into a bit more sensible

structure

by looking at the goal of

a speech modification so all possible goals could be at modifying speech

all of it is context dependent but if we just gonna focus on speech and

noise

one of the clear goals is to reduce and energetic masking

as it's called

no i if you know why the difference

the difference between in jersey masking informational masking

energetic masking describes the process which essentially

look so what happens when a mask and the target let's a speech interact level

of the or two or three periphery

something information is lost

due to compression in the auditory system

but then masking

can come back again later if the some information getting through from another chocolates a

there to talk was talking at once you need to messages or unity of fragments

of two messages

and it if they speak as a very similar they have the same gender

then it can be very confusing to work out what which bits belong to each

talk not an example of informational mask

so what things we can just produce a just masking

by doing things like sparsification of spectrum

training spectral tilt to reduce informational masking we might do something if we've got control

over

the entire message generation process we might be something like change the gender of the

talker

okay so with about

well not necessarily tts but we have voice conversion systems do this

and we cannot visual cues as a nice with reducing the effect of an interfering

talkers

and then we can do all the things this comes from i longstanding interest in

order to scene analysis

but it taking the problem and investing we can try to prevent grouping we can

send a message into an environment whether all the sources

but do things check to prevent a grouping with them

is that see something with an idea which a promote interest all in system an

awful lot of work in scene analysis

about the c plays in a court set

i believe and

you assigned that's a given instruments when they common

have a sort of its finding can use timing differences

at the onset of them could be interesting keep them separate

that's an example of what i'm sorry about that using a scene analysis

to prevent

the message clashing with the background

nothing we can do to reduce cognitive load of the message by using possibly

simple syntax

decreasing speech right or we can equip the speech with

more redundancy differences by it's a high-level repeating the message boards

so the lots of things that you might

figure out what

do not want to do is to sort of move in the direction of some

the experimental experiments we been doing of last be as

and this is a kind of typical approach to be take

we could use what can be we describe the syntactic one but speech in one

form or another

i mean is we can take normal speech display this again

in reading text okay

and take the global but sentence three i o e

and say well how much of the intelligibility advantage of that one buttons comes from

site time timing differences so we can time aligned

the two sentences and then princess asked a question

i

only the f zero shift in that

but

remove the spectral tilt that sounds like this next for me not like targets along

about three okay so the residual you know the difference that between the two

the things like spectral tilt and sort of an experimental point you we can then

identify

the contribution of us to such as f zero spectral tilt duration

to the intelligibility advantage

so now want to look at some spectral domain and starts off with

one of the l experiments we did looking at

exactly those parameters spectral tilt and fundamental frequency

because lombard speech in clean speech and all the phones of speech

do you modify have zero you might be led to believe the f zero

is important to my let's think that actually important change

but it turns out that it hasn't

so what you looking out here is the

increase

intelligibility or a baseline by manipulating

f zero to bring it in line with lombard speech

and none of these changes significance these three different pause just represent different

lowball conditions

on the other hand to be checked the spectral tilt

this just a constant changes not time dependent

skantze changed spectral tilt

we get about two thirds the benefit coming through this is not real about speech

up here

so the still a got but a lot is you to spectral tilt

turns out this could be predicted very well by just considering energetic masking eclipsing model

so it spectral tilt is putting some the speech out with the most

of course and there have been approach is which just simply do that the speech

we get some benefits modifying speech based on

just changing spectral tilt globally

but we can say that look quite a bit more generally and ask the question

if all you're allowed to do is to cut with a stationary spectral weighting

so essentially designing send a simple filter

to apply to speech that was the best you can do

in the spectral domain weeks

this the general approach

offline

this can still be this can be must get dependent so it's context dependent it

is masked is

we can come up with a different special weighting

and we do that offline

and then online its nest every that's recognise what kind of background we have and

then apply the weighting necessarily

necessary for that particular

a type mask

what we didn't hear what we realised of the art in this project

with the really important role the object intelligibility matches have

in this whole process simply because

we want to use them as part of the closed loop design process optimisation process

we can't bring back panel of listeners every

ten milliseconds

to try to answer the question how intelligibility this you modification that our algorithm just

come up with

"'cause" we still test the n

at the design phase

so is critically important of a good intelligibility predictor

so the first intelligibility project we use is the

the glimpse proportion measure

and that just described what the says very simple thing

so we take

see what representations separated or two representations of the speech of the noise of these

just

just imagine some kinda cochlea gram representation

gammatone filter-bank we take the envelope you've is willing to it

the hilbert envelope

downsample

essentially that's it

and we have the question on how often is the speech above the noise plus

some threshold which relies of a real but we need

and just measure the number of points well that's the case

by simple very rapidly computed

intelligibility model

if we do that

we come up with these kinds of weighting so again

it depends what kind of optimisation but you want to use this is just ignore

them

these very high dimensional or two spectra

say sixty dimensional

also

one thing here if you read these icons this is speech or noise competing talk

a cs

speech modulated noise white noise

and circle different mask is we also got given snrs or even ten five zero

minus five minus ten

and this also some interesting things going on here these the optimal spectral weighting that

come up with

you don't listen to think before using much lower dimensionality representation so differences octave weightings

so you got you know six to eight octave bands weightings or even third octave

weightings maybe twenty third octave bands here we got much higher dimensional representation

so we can someone that one expected to was at least somewhat unexpected result the

or something is the as the snr decreases we sing

that this optimal weighting is getting more extreme

more binary

we caught sparse twisty "'cause" what is essentially getting is

is shifting the energy

instead because you regions

to it to limit boosting gets and then attenuation the neighboring regions

this is only value on expected

the question is what the what was this all amount to foolishness

i display your an example of

of what these things sound like to this is just the on model modified speech

a large size in stockings is how to sell

from the whole corpus

this is the modified

i stockings as a cluster

one of the modified

and he pretty you know their course equally intelligible i hope in quiet but in

noise and

you know the sentence but it's i think it should be reason we evident that

the modified speech is more intelligible that and so as part of the three can

challenge we

and to this particular the algorithm and got improvements of up to say fifteen percent

percentage points

in this is just two different conditions and given snrs but

roughly that's amount

it is more useful to think of these in terms of db improvements

and so we its use this

so idea of a equivalent intensity increase the idea is if you modify speech

how much would you need to the one modified speech of by

since the how much you gonna to increase the snr

to get the same level of performance

and this can be computed using a

by computing psychometrics functions for each of the mask as you need to use and

using the mapping from the or modified speech the modified speech

i don't you what is inside with

sensed tells as is i if you look at the subjective by

these fill lines here

the we getting about two db some improvement using that stuff expect a weighting

which is kinda useful to db is maybe seventy four somewhere between ten and fifteen

percentage points also

now something else in this figure shows

these the white bars here

all the protections on the bases are of a

object intelligible to model that was used directly design the weighting the first place

and icsi the predictions are not really that good

i mean you

you could look at this kind i can say well there are quite collects but

there are not really very good at all

this is quite a big basically in these cases here

of course

the

in a in one sense doesn't matter because we're still getting improvement fearlessness

what the other hand get a better in its object intelligibility model

than against abortion for instance then we might expect bigger gains

so the kindness idea one other things

the most units and times been focusing on

a loss

is improving intelligibility models of a modified incident synthetic speech

so what you seen here these it you might recognise some of these

abbreviations here this is the speech intelligibility index extended speech utility bills you index

this is one controls the data was lab

et cetera but these are quite recent intelligibility metrics

seven over that

and these a five claim space matrix that's

saigon signed is developed

to try to improve matters

it's a difference is that the one that we're using you just save these past

stuff expect a weightings is this one just performed that well actually it's all

but most the metrics there really perform so well

the modified speech

normally that we four-formant

so the correlation with the model correlations of at least point nine

for natural

speech their form of synthetic speech writer

so one actually now is what happens if we

do the same

static

that's spectral

wait estimation

one of these is going to use this high energy could portion

metric instead

this is just really a series adaptations to the normal course a portion

well you

what we doing over here this is the normal was proportion

we do in here is i'd adding on something which represents the hearing s

let level

sometimes we present in speech is such a low snr that some of this some

of the speech itself within the mixture when it's presented to listen to say it's

some people db or whatever

is actually below the threshold of hearing

and this or talking effect on the intelligibility prediction so that skated for over here

you've also got

a sort of ways i logarithmic compression

to a

deal with the fact that

glints is very redundant so you probably only thirty percent of the spectro-temporal plane glints

to get

to see in performance

that's handled that

and this is a durational modification factor

which attempts to cater for the fact that

rapid speech is less intelligible so this a few changes in that i'm not really

gonna go too much into them here

but just a trace of the buttons that

come out of this often optimisation

process

and so what we seeing is actually quite similar buttons to the preceding model draw

some differences is a six different noise types

low-pass high

low-pass this is high plus noise white noise

and again a modulated

could be talking noise and speech noise but we essentially seem pretty much abuse of

the high frequencies

we find here well we change corpus here a little bit

it became more convenient for me working in

in spain to have spanish listeners rolled to

of the

we don't my ex english collings scottish whatever to run some experiments with this

so this is with a short but which is a spanish version of the harvard

sentences

and what you saying here all gains in percentage points

these not relative gains

is the percentage points gains of up to fifty five percentage points from static special

weighting

in the best cases in some really cases down twenty thirty

doesn't look at all the white in white noise which we put down to a

continue problems of the origin

further problems of the objective intelligibility metric

but nevertheless we can see that for a very simple approach which could be implemented

it

simple linear filter

we can get

some pretty be gains

in noise using these approaches

and actual and questions we want to that is

to what extent do we need to make mask in a basket dependent weightings

because if you look at the mask is we here because the weightings rather of

a here

for the different mask is we stent a system similar passage

we tend to see

a preference for getting the energy opens the

i frequencies

with maybe a sort of

tendency to preserve some very low frequency information which might be related to encoding voicing

for instance

so we tried out a number of

static spectral weighting in a master independent sense this the simplest one

which used essentially transmit

transfer

reallocate lots of energy from the low frequencies below khz

so the edges above

with no attempt to produce a clatter profile

that's all here

and then we the these but testing out the idea of sparse boosting just boosting

if you channels

sparse twisting with a some low-frequency information transfer sounds

once the

and just of our sense that run them

selection information

in i and i frequencies

and it turns out

slightly to a surprise

that the master independent weighting which of these black policy

that in a real conditions of all this condition

does as well as the mask a deep and weighting

which the white boss previous

this copy from the previous

couple slides back

all of the other weighting stick white so well although they in general produced improvements

so what this is saying really is that

for what a wide variety of common noises

say babble noise in particular which is basically car noise in transport interchanges the same

be speech noises

we can we can get pretty significant improvements from a simple approach of spectral weighting

as lots multi set about spectral

types of things lots more to be don't but

i want to get a kind of a

a better look at

all the various examine to move on as look at temporal modifications

the testing to look at

these this question of duration or speech rate changes

you might think that by slowing speech data and the way the lombard speech

does at least for certain segments

is don't for the

for reasons

because the in selected to all the speakers try to make things easier for the

interlocutor

so what we looked at was

whether or not the slower speech rate along but speech actually helps at all

we see here is the method also we use a this is plain speech this

is lombard speech

then we just simply time-aligned nonlinearly the low about speech with the plane speech

and once you've got the time alignment you can then do things like transplanting spectral

information

from saying the lombard speech into the line speech

in the timeline sense

well the answers the question

and whether or not duration helps

is no

and this is not the only study this found this

no i don't linear stretching or nonlinear

as in this case non linear time alignment benefits this is because these of these

two point c is these benefits

overall modified speech we see this is not a lot lombard speech

easy to spectral modifications local modifications meaning

spectral transplantation having don the nonlinear time warping

nothing helps

except the spectral changes

these decreases the not significance but then clearly not in the right direction

but i

but in a i a little bit later on three result which seems to seems

to country

to contradict this

so one wants a for example the next

five or ten minutes

is a slightly

richer interpretation of durational changes

and this is

what happens to speech

when you're talking the presence of a temporally modulated mask

so i just think about that you know anytime you go into a cafe or

something

you dealing with this a modulated background

is there anything that we a speakers to in a module at background to make

life easier to listen

these situations belly been studied

and yet has the potential to

we think we thought

and continues think

to show

so more complex behaviour on the part of speakers to helplessness

so it is the current task

that we used is this you talking task as a visual area between these two

talkers be visual barrier here

they're wearing headphones the listing to modulate you mask as of different types

varying gain

that density so there's some

opportunities let's say for the talkers to maybe get into the gaps

here's a bit of a link with the overlapping

speech material this morning

and they have different to docking proposals

so they need to communicate is one of the string to get task so this

is a

an example of all that sounds like

i mean see you can what you listening

see it you can

imagine

the mask are being present you the must present in this example

you hear the mask about the must was present for the for these talk as

the you can okay gonna one

and it in the middle right hand box

the middle row there has to be three in five

no colour role

i mean the timing wasn't quite natural i think you need here is not really

every now briefly what conversation

this the third party and that a lot that parties

is it is a modulated mask in this case

no is less interesting things ago on an overlap as i'm sure

i don't need to tell you

and

but these not

you know this a little bit of in the meetings

style overlap

because it obsoletes not the competing talking the background will see some examples of that

in a moment

why white simply wanna focus on is the overlap

it's simply the degree of overlap

with the with the mask

do the talkers treat the mask a like an interlocutor

but there is a tend to avoid overlap

or not

what we found is that to some extent yes it's is showing the reduction in-overlap

these just the for different masters or the dense and sparse mask so in the

case where there's

more potential for reducing a lot voice pops easy if it's order to do so

that we do see reduction overlap

well however they to the itchy this by increasing speech right so they speak a

more

only when there's no overlap

when the weather's up

when this background speech and that's what's response of the increase in

the decrease in you know a lot this is normalize of course

by

a speech activity

so what are see what a speaker during

well

strike work out what speakers of doing when noise is present or indeed when noise

isn't present

religious technique which is

we develop a signal for system identification

cool reverse correlation

as there's been used for instance try to identify

also nonlinear systems although it really strictly speaking only applies in you know like when

the linear system comes with doing with in thai speech

perception process and then also the speech production process in response to

the speech relating to so we got to it to highly nonlinear systems in so

it shouldn't really work

but that less what we do is

we look at all events of a particular type

in the corpus lsa all vacations when the person you talking to

stop speaking offsets

and we say what was going on

what was going on annual speech in the in to watch the speech

i point

yes we just and code all those you mustn't like spikes

and then we take a window

look at speech activity an average over all of those exemplars

not gives is what we call this event related activity which what you seeing here's

this the window pasta minus one second

with a simple case is first so is no noise presence here this is just

looking at the

activity response to an interlocutor

so this is just simply saying we take all the points which and

a need to look at a stops talking what do you do it

well not surprisingly

well but more likely to be start and talk this what's been taking is really

about

and we see the reverse pattern

on the other side

but interesting questions what happens when the mask or we take the mask rebounds so

what happens when the masking goes off what're you doing is a talk

well

not very much but then

afterwards shortly afterwards

you increase your

likelihood of speaking

and like

likewise in the case response qbc bit more clearly if we

just look at the difference between the onset and offset abouts the symmetric for all

intents and purposes

and so we see this what we call it contrast cuts

this is really just shown in that was having an interlocutor case

see very nice cup

a quite a wide range in the mask in case well because it's become guess

whether must be bands gonna take place there's really no difference here that is right

after the milliseconds after the massacre

as i that come on come off

we see a change in the speaker activity what is the showing is that's talk

as are sensitive to the mask is

and do respond in some way

well the last seven possible strategies the soak it might be using

and it sends out some non-targets ability tele but simply to say that

it isn't a case

mainly that when a mask it comes on

talkers tend to stop the doing this is this stop

strategy here

it's more case that

they tend not to start

when a mask result in the two things if you think about it might with

the same when you averaged across is why we need to distinguish between the two

so we see lots evidence for talk strategy based on the masking goes off

you more likely to start talking that makes sense

and if the mask comes on you less likely to start talking a little bit

of evidence

the to mask because you to stop talking

but not it's quite weak evidence

now how does this work in a low for a more natural situation where this

that's the all the conversations presence in the background rather than this

slightly audiovisual

background model at background noise

so these were some expense we carried out broken english and in

in spanish and

the basic scenario is that we have a pair talk is here having a conversations

they come into the first five minutes

and then the joint for the next ten minutes by not the parents or "'cause"

and then the symmetry purposes the first belly so we got by tippett era where

we got to past conversations

second group is not allowed to talk the first group vice versa

and so we really interested in a very natural situation see how one conversation affects

not the conversation i just play using this example

and i'm you'll be helped a little bit by the transcription right hand side i'll

try to follow it

i for lid

or cosine i got my legs

this is the natural overlap situation if you want to the percentage of appears not

twenty five set

across the entire corpus is more like eighty percent

twenty percent within buttons within press

so i'm willing to the couple of things here

one the things that top is due

in

i and or not the situation like that's

extracted reduce

the amount of natural overlap that they lie within the with that conversational partner

in the figure was mention this morning brought about twenty five percent we find the

same

think so the thing here when there's no

background presents

and the older dusty would

in we got this is a natural state of the two person dialogue roughly that

twenty five percent or materials are a lot

singe for the background in the

you see that's reduced that's one big change another change we see is when we

remove the visual modalities you might and they're just in that picture they were these

the lies is the one the conditions

and that also "'cause" i in a bit of egyptian overlap

someone response

but

the interesting question is

to what extent all listed it may not make and situation aware of what's going

on the background

and adapting accordingly

so these are the either likes activity plots of the four we saw before

is this is with no background presents so we see this turn taking behaviour

and this is where there's a visual information

a lot i would i the

so we can see the interlocutors lips

of the interesting case of these cases where the noise is present

and so this is the

showing the

activity response to the noise

a low is much weaker pattern

we still see the same

sensitivity the noise in these highly dense that situation

so the foreground since they can summarise all this

is affected by background

a background conversations

was always quality with

speech technology well

out of this grew an algorithm called g c v time

and

which was also submitted to the oregon challenge

and the idea here the approach here

is to

so the general dynamic time warping based approach

where we take a speech signal every here is the mass get we say

if we are allowed on a frame-by-frame basis to modify the speech signal

to achieve some objective

whatever that is

then we could do so by

finding its you quote defining the

the least cost path through some

some two massive distance with the least you methodists in this cost matrix

we ended with modified speech

we temporal changes

so the important question now is

what we put in as the cost function

we tried various things one of them is

based on a just masking clean thing again that's the weather g comes it in

the g c we time

and the other components is to a cochlea scaled entropy which is a measure of

information content in speech so that

to but in the in simple terms what we try to do is find the

path

which maximizes the number of glimpses of speech you're gonna gas by shifting speech away

from e pox where the mask or is intense

and the same time is sensitive

two speech information contents

for least speech information content is defined by cochlea scaled entropy

and it turns out that this

is the prettiest successful strategies with that for decibels

of improvement

in the reckon challenge

now

is it

the way that can change a set so what it has the allowed a small

what we location

for various reasons we were interested in promoting some temporal structure so we light a

little bit elongation

half a second i the side sense

and of course not surprisingly most the time that he you the re timing out

than exploits that fact

no strategy speech or shifts bits of speech around

into those got into the and also exploiting the silence

so what the be time because it simply the elongation well our previous results

would suggest that

elongation doesn't help right this is i began the section

you location doesn't help

but strangely

we found in the case of the modulated mastering competing speech in this case

we found that would simply you located did help

not as much as we timing

about what about a half the effect could be due to pure elongation

so

but selected

speech shaped noise in this case

we find elongation doesn't help which is

consistent with the accent you picture so what's really going on here

well the reason that people don't find improvements with a durational based approaches distracting is

"'cause" most of the work has been done looking at this stage we mask is

and interest issue mask you simply you log eight

you we not in just using any new information

"'cause" the master itself a stationary with you gotta modulated mask

you stretch they say of all out

you know if for it was massed some parts fragments of it you know if

needed for identification are gonna skate masking

and that's what we think is responsible here

the other important thing here on this

the came out of this is the tree timing itself appears to be intrinsically harmful

so what something which is strangely something which is really beneficial for one mask

we get is a big these of the games

exactly harmful for the first stage we mask so we're

distorting the acoustic phonetic integrity of the speech

but nevertheless it is still the same with time speech is still got the same

distortions in

but in the case of liturgy mask

is highly intelligible

in the target some of the circle more was about that

it was

well we know what's it is more likely to

picture of where we all

with that speech modifications what can we achieve

so the racks to a couple of forty can challenge is what we do internally

within the listing talk projects and then one that's a the clearly evaluate your unless

just interspeech

and

and the goal was to

people providing within

well modified speech

had access to mask is a given snrs

and simply returns

modified speech to us we then evaluated with a very large number of listeners

and these are some of the entries

so

playing speech

a large size in stockings is how to sell

natural about speech

a large size and stockings is a to sell

some on modified tts

a large size in stockings is how to sell

this is from g which is a lot

long but property is applied to tts so long bob

a tts adapted to lombard speech

a large size in stopping this five to sell well as the synthetic voice

trying to compete with noise as well

i know able to techniques

i'll play this one because this was the winning entry

i in stockings is to sell

on this website you'll find that you also organs most more examples

well these are the with the results of the internal challenge of the systems

any s a s t r c which came from

nist ugandans lap

that's university of crete

was the winning entry

producing gains of about thirty six thirty seven percentage points in this condition

what the what does that i'm not to in db terms well it amounts about

five db

no with stones the way you know useful gains i think for speech modification approaches

well you can also see here i think is interesting is that long about speech

natural about speech in this condition

just this case here actually produced a gain of about one

db so we're getting super along but performance

i was of some of these modification elements

he was of the ones of the

you know based

some extent on lombard speech

and tts is a long way behind it is a modified but by applying for

instance long but like properties to tts systems

we can improve things by over two db

a slightly larger challenge the oregon johns last year

is opposing results in a sliding way as i'm just take it with this

so we're looking at here is the equipment intensity change and db

in the face of a

stationary mask the speech shaped noise and this in the keeping competing talk a mask

all the green points correspond to natural speech and the baseline as well the lines

intersect about their

and the tts entries of a low baseline

and they're in blue and see them over here

this is not the in a in a fairly low noise condition if we were

gonna

a high noise condition

a better idea of all these things are really capable all

then we again we sing gains of about five db

in stage we noise and also

the t c v time

getting close not fully be also in

in fluctuating noise

what i really want to point out a listening is probably to me was most

interesting outcome of this evaluation

is the fact that

somebody's tts systems adapted to

toot based on some intelligibility criterion are actually doing really well

combat that makes and that baseline over here and we're getting a couple of the

tts systems here which a player examples of

in a second actually more intelligible than

natural speech

in noise size of a fairly interesting achievement

these k from two different labs

one is everybody brother

garcia

and teeny

and you timit she

the group and also from daniel error at the

the

well you assume with the basque country

well the reason a difference group

i'd nothing to do with this

okay so this is the this is an example that what leasing sound like and

just a tts systems

and you pretty evident this that the synthetic speech is much more intelligible in those

cases

just the final thing to say about american john something we did recently

and a natural thing to do of course is to take spectral changes temporal changes

and see whether they complement each other

and the show in ensure the answer is yes so this is a modified speech

this is defective just applying temporal changes

with the g c v timeouts

this is just the effect of the

assess the l c in this case spectral shaping and dynamic range compression algorithm

and you put the two things together

and you get something which isn't quite additive but it certainly a call me and

me a complementary that for two percentage points

we have i nine to ten decibels impact

so just in the last if i one a couple but it's

i want to just pretend this question is modified

speech intrinsically more intelligible or is it just

hunting the mask is that are essentially work

nice little the critical to answer this question simply because

what we measuring intelligibility we normally measure intelligibility using noise

because the "'cause" otherwise performs is see like but you gotta system which modify speech

to be more intelligible the noise

and of course it's gonna be more intelligently noise so you don't measuring intrinsic intelligibility

you measuring the ability to have come and just mask normally

that's to use native listeners

use non-native listeners then intelligibility in

why it is usually some way below

ceiling performance the natives

this what we did

we plan on that you listen is long about speech

we found was

forget about most of this

this is the key results here is that one but speech is actually less intelligible

them playing speech in quiet

the same speech which is more intelligible the noise is less intelligible in quite

for non-native listeners and with

lombard speech was making improvements somehow to acoustic-phonetic clarity should we say

just a generalized lot of possible changes

then you might expect

to see benefits but we don't

skip over that

and something we don't recently also it's as the same question with non-native listeners

for s t l c

which is you know say that where the entry in the hoary can challenge

and again we see some results in quiet but non-native listeners you see that will

be low ceiling you know modified speech

exactly make things worse

so just to conclude that

what i try to show how is that

by taking some inspiration not so you really "'cause" inspiration but i see sometimes going

beyond what this is

what's also capable of doing

we are able to motivate

some algorithms which can them but

speech which is merely unintelligible interspeech which is almost entirely intelligible

there's a been a

some develop a subjective intelligibility models to make this possible

and i see the this is this a definitely scope a much more working that's

a rather better intelligibility models we can produce

the bigger gains we expect you'll to produce

and i should say that this work is more or less immediately applicable to all

forms of speech outputs

including domestic audio coming from non speech technology devices no radios t vs et cetera

some stuff i didn't read so too much about reduce work with dyslexic it's basically

with the retraining

to show that they benefit from

from a sample modifications to

one thing we do need to look at and i looked on the last couple

of slight is this loss of intrinsic intelligibility

i think this is an opportunity we've got an algorithm here which does well in

noise in quite exactly homes things what about what if we can

what if the two things are not you know

if we can sample the two things together if we cannot do makefile space changes

in the same time is due within just masking and we can see summary we

became

okay i carry much

thank you mouth in this very good interesting talk

you have any comments on the use of the is are

i mean use of this work for asr to improve speech recognition

e

is interesting question

well you thinking maybe we can train talk as two

in tract

more heavily with our asr devices is not gonna happen is it

i think

yes of course in

i is what i would you lanes and in the listing talk a project was

to get as far as looking at dialogue systems

so well as the i saw is

a key component and

to look at ways of improving the

interaction by essentially making this in that the output part of it

much more context aware

and of course

in this sense

if you could make the instruction smoother

this might also mean allowing overlap as an actual muffler compensation and i guess the

input side might also

and of being smooth that's

we didn't end up doing and whatnot so

some results in is are show that it's a pretty variable to adapt the environment

rather than trying to make as p

well those obstruction and speech and

but they demonstrate that data to adapt to and

well i mean other application of a nine dimensional the way i think about is

no i think sometimes we'll green river might background in competition or scene analysis we

often

set of solve this problem of

taking to independence

sources and trying to separate maybe

and acknowledgement the fact that the two are not independent alpha

except in speech separation competitions

you know we're always aware of what's going the background we since we modify a

speech that really a that'll to be factored into these albums told making simple actually

for the elements

thank you martin that was very interesting i'm wondering

if i i've probably got about twenty questions but if i just you know down

to me here we good

in your work where the any constraints regarding quality your naturalness of the enhancement good

question the a salami policy can answer that

okay so

again one of our original goals thinking that we would just knock off the intelligibility

stuff and first year or something

was to look at speech quality and we did a little the to work looking

at the objective much as speech quality so basque this stuff

and hindi

what we've

so the modification i didn't talk about put a produce a highly distorting so i

remember one modification that we produced

we were essentially taking the you know it can general approach of

suppose we equalise the snr in every time frame

i process a little bit like the also perhaps but you know

more extreme or we utilize the sound each by a special equalise the snr in

each time-frequency fix

you can imagine the effect

you know of doing that is highly distorting

and sometimes is highly beneficial also but sometimes very hopeful it's kind of very binary

type of thing so the mean we did this so much as and some of

the other part this i think getting rid

it's work on speech quality two

but it's

so

in this sense

we're looking for correlations between

for the just been to a license some results non-native listeners

we looked at their responses to as a function of speech quality differences where we

might expect

you know into they pass their intelligibility parts of its

pretty much identical to native listeners and respect to we've examined

even though the match the quite different that distortion you might expect that the rich

native l one knowledge would somehow enable list the sample

that these distortions

where easily but that hasn't in the case of

we don't have pockets working at every to reboot consideration considerations sightedness a is related

to this is we're using

these constants rms energy constraint "'cause" we should be really look at loudness

which more difficult optimize

rely as you gotta agree allowed a small the first and so

then my second one would be you discussed effective the listening ear been matched native

or not it is worse but and you mentioned working with english and spanish but

i'm wondering

have you studied a variety of source languages in found any of them more amenable

to this process maybe we should switch to a different line

thank you

that's also interesting

i have not done any work we in

we can speech output and that but item we had a

the project a few years go looking at eight european languages from the point of

view of noise resistance and the clearly difference is that

a lot of it is got to do without engines resistance wenches masking

i just never taken into account we often you know in multi language studies just

normalize use it quite yes and although we should be doing that actually

because the term baseline which might seem to be able to tolerate maybe up to

forty be more

noise and then especially designed in that respect

i'm adding

has you know you are in the speaker recognition community i'm quite sure you where

expecting my question

you have any edge you're of the

but i should effect of from bar number effect so

allpass

kind of things you are

you just present that the

one you meant abilities in speaker recognition

that i sparse well unless johnston some work in this which i guess probably have

i'm not sure that the extent okay so i mean obviously feel of like to

speaking on the stress and then very high very high noise conditions

if you want in speaker identification based on

this your question right

my question is also need to be data forensic problem you thing but

if someone's recorded in presence of noise so

using omar

number of votes

of best model information would be the same band when we record used us and

you know quite are the ones variance question eyes a very

interesting project using similar techniques not but

we at the basque country web looking up at that you did you stupid looking

at which essentially trying to map from normals along but speech if you if you

know that somebody's talking as a degree of noise you could attempt to transform

the lombard speech and to normal speech don't you wanna come at some possible

but predictably so it some cases

precisely i think you're always really need to

be careful experimentally to use the latter case because no

can be no mean even been told you communicating makes a huge the

i think

i want to on to the example of the two couples one speaking condition other

spanish

i guess the one couple do not understand the are the car

we didn't that one of the situations i had actually we just we had same

experiments done with for english or force punished i the is this was the question

what would happen if there will be two couples speaking the same language but on

different topics

and the disturbance also not only noise but able to an understanding that your i

think it and on the statistics we'd like to a couple look if years ago

we discovered this effect of along with the

the informational masking calls by sharing the same language isn't interfering conversation

and it is the case that

we

if it is principle the common experience for many people

optically in a bilingual trilingual country is just one

where

if somebody talking in a language you well aware of even if it's not your

native language

just a much bigger interfering of factors for start of that's one the things that's

definitely happens that's word about between one four db depending on different language pairs that

the that looks at

was that the party question

okay

so it's all a matter informational masking and again that's another big area that we've

it is often the perceptual point of view but not from the apple point if

you see how to deal with

do you without

thank you for the talk

so regarding what also zero actually said

we i think press i personally extract some somewhat speech enhancement and some colleagues tool

for like three s

and even attractive in the training data if you do speech enhancement

and it test data our systems seems to

bad compared if we give him all the noise

cell

do you have the opposite

okay so even so you think that every speech enhancement this case doesn't it doesn't

work has no explanation for that

but it seems that if we do if we tried to remove the noise

the systems get better

and this is a general findings very surprising finding it in a way and that

speech enhancement does not this which is robust application in a weighted linear speech enhancement

the very few speech enhancement techniques work for intelligibility purposes

only one so no

okay

and that works

that doesn't work

the call this terrible it's a related to your question so the mean this space

is the not kind of even linearly related these things

that's it is very just in case the you don't region

i mean that's not really tools is the dynamic range compression type of

extreme dynamic range compression

so this question is based on the example that you short about announcement in the

training

so is that any v of increasing intelligibility of things like the name of the

train station

for people who you know not the native

speakers

so i mean there's a couple things you can do that one a low level

the one i level thing and results in between

we haven't done if we haven't done it is but others are so one thing

of course is you can transfer all your excess energy to those important items which

the low level thing that we don't

at the high level you can attempt to modify feuding with synthetic speech

you can attempt to produce high pass hyper speech can you which is

has been very successful in doing this

a lot rhythmically fully automatically producing a

speech which is more likely to meet its target so with the next point involves

place so this is really gonna help a lot in those cases

and then there's sort of more preside things like simply you know repetition on all

the simply simplification since arkansas when it comes to sort of proper names like that

sure there are some very specific things i think you can do to solve we

need them that way

i think

this is like introducing redundant since i think

this leads to be don't