0:00:14i most of it i wonder whether nine came from and so and so it's
0:00:17a mean i started discussing troll arrangements and then it became clear
0:00:23so a question as myself when only invited me to
0:00:28comes this meeting was
0:00:30what can i say that's possibly it interests to people interested speaker
0:00:35identification language identification this kind of topics
0:00:38you would find
0:00:39but you know i i-vectors princess in this talks provides an i-vector is described the
0:00:44process of virus transmission on an apple device
0:00:49but this was
0:00:51what is talk might have you are looking at some of topics
0:00:54and there are a number of points of connection that i noted down so i
0:00:57think
0:00:58i'm gonna be talking about the way the speakers changes result context
0:01:02and how we develop algorithms
0:01:05the can modify speech become or intelligible but of course this is a sum relationship
0:01:08with spoofing for instance so the elements like to be talking about could potentially be
0:01:13used to disguise summaries identity
0:01:17also the effect of noise on speaking style is obviously very relevance people interested speaker
0:01:21identification
0:01:22there's a talk this morning about diarization an overlapping speech i'm gonna be giving you
0:01:26some data drawn overlapping speech in really at this conditions when they're all that's well
0:01:31that's all "'cause" presents
0:01:32and
0:01:33i don't believe also that's durational variations a problem
0:01:37and we can get leaves me behavioral information on those are there is some points
0:01:41of contact with what i wanna say
0:01:43i and the kind of work that's
0:01:45people are doing this field
0:01:47but we keep things simple this what i'm gonna talk about
0:01:50i'm gonna be talking about replacing the easy approach to intelligibility which is increasing the
0:01:54volume
0:01:55with this hypothetical but very potentially very valuable
0:02:00device for increasing intelligibility
0:02:04so i'm gonna start by talking about why we should robust device speech output it's
0:02:08always interesting problem what kind of applications does it have
0:02:13get a few general observations from what talk as
0:02:15two in adverse conditions
0:02:17and then it will be
0:02:18research interest to talk into spectral and temporal domains sony little tidbits about behavioural observations
0:02:24again what focusing on some the algorithms that's
0:02:28people in the listing talk a project of developed so the last couple of years
0:02:32and
0:02:33culminating with a single american challenge
0:02:36which is a global evaluation of speech modification techniques which took place last year
0:02:40that's interspeech
0:02:42and if this time also if you words about what might be modifications due to
0:02:47speech whether they actually make it anymore it's intrinsically intelligible or whether they just to
0:02:52overcome the problems of noise
0:02:54so why relative i speech at all
0:02:57well i'm sure whether where that speech output is extremely common but natural recorded and
0:03:04synthetic
0:03:05if you think about your engine is to
0:03:08to this at this place
0:03:11you would've presumably gone through various transport interchanges and proposes
0:03:17aeroplanes themselves
0:03:19lots of didn't can't environments reverberation now recorders et cetera
0:03:23hearing all sorts of really almost unintelligible messages
0:03:27coming out of the every like speaker there are millions of these things in continuous
0:03:31operation
0:03:32and it's sure given an interesting problem to attempt to make the messages as intelligible
0:03:36as possible
0:03:38same goes for mobile devices or say in car navigation systems where noise is just
0:03:43simply a fact that of life for the context which they used
0:03:48no score in speech technology that optically speech synthetic speech is really what
0:03:53you know realistically the messages sent sergeant's the environments
0:03:57regardless of context we calls were summed else's talking let's say in a in a
0:04:01voice striven the gps type of system
0:04:04or regardless whether they noise present
0:04:08that is a few examples just real quick
0:04:10for example just recorded
0:04:12with a simple handle device
0:04:14of one this case recording speech in noisy environments
0:04:17and because as a home on this i'm gonna plug it in
0:04:20just for the duration of this
0:04:32note you convolve that video with this a delivery system as well you could you
0:04:36point in theory much which at all
0:04:38but believe it's not intelligible star with
0:04:41he is not like this one so that was recorded speech this is like speech
0:04:44but this is like accented speech we know the accent from foreign accent can lead
0:04:49to can be equivalent to say five db at noise in some cases
0:04:52i
0:04:58euro language x i don't peers expert sidewords i one result was
0:05:02and this is another one my favourite examples "'cause" this is this is really user
0:05:05interface
0:05:07already in interface design problem for the people design this is the train
0:05:10needed to edinburgh
0:05:13i
0:05:17of the noise saying the trains about depart collided with the announcements so to a
0:05:21simple fixes for those cases
0:05:24in particular
0:05:26anyway what is worth doing this
0:05:28well i think is always with bearing in mind that for a natural sentences we
0:05:32have lots of different data for
0:05:33basic show the same point
0:05:36that's
0:05:36for every t but you can
0:05:39gain
0:05:40it effective terms are save it more about that later on
0:05:42it's worth between five and eight of say five subsets buying on speech material for
0:05:47this is for sentences these are pretty close to normal
0:05:50normal speech
0:05:51and so
0:05:52every t v gain is worth is with having essentially
0:05:58at some images perhaps a little less now
0:06:00is the every db
0:06:01attenuated potentially saves lives and might sound like a
0:06:05a bit of bold statements but so that we qualified
0:06:08this is ripple in the will self-organization which just covered
0:06:12environmental noise in europe
0:06:15and by environment i environmental noise and excluding workplace noise so this is not people
0:06:20working in factories with you know
0:06:23pistons and how misses ongoing all the time no this is due with
0:06:28just the noise pollution there exists in environments be looking at or railway station you
0:06:32got announcements all day long and so you live near to an apple also on
0:06:36the nice to talk about
0:06:37of at aeroplanes use of a stress related diseases call you possibilities in particular
0:06:43now these don't necessary leads to for target so
0:06:45the qualification that was this is
0:06:48the more methodologies measuring how few likely is lost
0:06:51so if you saw for instance a severe tenets s is results of environmental noise
0:06:54that might
0:06:55produce a coefficient of point one for instance which means that for every ten years
0:07:00you effect you lost one you of healthy life that's a very large figure
0:07:03i that we can do to attenuate environmental noise has to be beneficial
0:07:11to just to contrast what i'm talking about with existing other areas within a field
0:07:17the difference between speech proposed vacation let's say that speech enhancement
0:07:21it with dealing with speech
0:07:24where the intended message of the signal itself is now so it's in a sense
0:07:27a simpler problem we're not we have not the problem with taking a noisy speech
0:07:31signal and phone recognizers or enhance of the broadcast for instance
0:07:37sometimes called near end speech enhancement
0:07:41and it's not like an additive noise suppression we we're not attending to control the
0:07:44noise so it's of this sort of talk just start talking about here will be
0:07:49in situations where is not really practical to control the noise because you got say
0:07:52a public address system and hundreds of people a listing to its control be wearing
0:07:56headphones or whatever
0:07:59well we can't do then what we left with within this speech within the system
0:08:02its ability to modify the speech itself so get fairly to if you constraints these
0:08:07are just practical constraints
0:08:09on the short so we probably the interest in changing the distribution of energy and
0:08:12time
0:08:14duration even
0:08:15i'll show you small with the do that later on
0:08:18and
0:08:19but overall in the long term we need to we don't want to extend the
0:08:22duration
0:08:23and fulfil the behind an announcement france
0:08:27and we don't want
0:08:29able do not want to increase in intensity of the signal so
0:08:33normally what we're doing this work at this is gonna be the case throughout all
0:08:35of what i talk about pretty much alike talk about
0:08:38there's a constant input output energy constraint just four months
0:08:44you just a few examples what we can achieve and you know a compact is
0:08:47saying how these it done
0:08:48a little later so if you couldn't see of the original example you won't
0:08:53these was my mouse on
0:08:58one
0:08:59okay so what you listen to is some speech and noise
0:09:03trust me because you may not be able to his speech
0:09:09he just about tell the speaker in a
0:09:12well as you can modify that speech
0:09:15without changing the overall rms energy the noise is constance so the nist nor is
0:09:20the same in these two examples modified do something like this
0:09:28which if you listen that in an experimental setup i had transference you could be
0:09:32guaranteed to get probably seventy percent words clearly the sentence was that's right
0:09:37this is the eight we possibly soup has been certainness itself
0:09:41so you know semantically unpredictable
0:09:44so that's the kind of by weight motivation
0:09:47now what's a little bit about what's so opposed to
0:09:51and this is been other a longstanding research area
0:09:55i see that at least that's goes but along bought
0:09:58a hundred years ago
0:09:59but a lot of work conferences clean speech so if i if you if you
0:10:02give some did instructions
0:10:04to speak clearly
0:10:05then they will and
0:10:07you're we even need to give you instructions in this situation nine
0:10:11tempting to speak more clearly than i was over lunch for instance
0:10:15and speech changes in the situations in ways which
0:10:20it's possible to characterize possibly copy
0:10:23and this documents anything about
0:10:25mapping clean speech properties on so you
0:10:28a on to natural speech or maybe function instead
0:10:31on the adverse conditions
0:10:33speech also changes as a function of the interlocutor
0:10:36which changes if when you took children infant-directed speech for instance
0:10:41for the directed speech is also called
0:10:43the nonnatives and also for pets
0:10:45and also computers
0:10:47so it is also computed as we as we all know talk cost
0:10:50there's was involved in that have available in speech recognition speech changes
0:10:55i mean if focusing settlement postconditions
0:10:58so say this
0:10:59works be good on
0:11:00on several about speech but
0:11:02a long time
0:11:04johnstone also work in this area
0:11:06also
0:11:07no where
0:11:08should i should say we're interested in one but speech
0:11:12not necessarily because we expect speakers
0:11:16to be an of arms or a device to be an environment where or listeners
0:11:19be to be in environments
0:11:21i don't levels which use normative induced one but speech which usually quite high
0:11:25but simply because lombard speech is more intelligible we want to know why
0:11:28first of all
0:11:30sciences and then we want to in part knowledge into an algorithm so you at
0:11:36least produced it intelligible benefits of one but speech or to go beyond
0:11:40actually some results like that show that we
0:11:42kind indeed go beyond
0:11:44now i guess medical of the fertile about speech but
0:11:47if you if you haven't this is what it sounds like
0:11:50in the next okay that's little speech very well i
0:11:55this is the same talk the same sentence with in this case i think ninety
0:11:59six db s p l noise on have
0:12:03you don't hear coming from the signal goes
0:12:06and some of the properties along but speech are fairly evidence i think here so
0:12:10you see the duration training so this is the normal speech this is the lombard
0:12:12speech
0:12:14so you can see the that it's quite normal for the duration to be
0:12:17slightly long this is exactly nonlinear
0:12:20the nonlinear stretching here voiced elements tend to be extended well voices that elements
0:12:26also if you look here at the f zero you the harmonics only possible lombard
0:12:31speech
0:12:31after zeros typically higher two
0:12:35there's all the characteristics which are not visible in this particular the loss but you'll
0:12:39see in the second
0:12:40which might be important
0:12:43now this the real reason we're just a lot about speech and there's lots of
0:12:46data
0:12:47like this is just some we had
0:12:49from a similar studies were sweltering is here
0:12:52and this is showing the
0:12:54percentage increase in intelligibility of a baseline
0:12:58an a normal baseline these are just for difference
0:13:00well but conditions
0:13:01it you can see we can get some pretty
0:13:04serious intelligibility improvement
0:13:06this of the old speech is then represented to listeners in the same amount of
0:13:10noise
0:13:11and improvements of up to twenty five
0:13:13decibels
0:13:14sorry twenty five a
0:13:15percent
0:13:17question is why
0:13:19while about speech more intelligible
0:13:21and the seem to be
0:13:23a number of possibilities possibly acting in conjunction
0:13:28so one option you hear on what some panel are
0:13:31or three spectrograms cochlea gram the sometimes goals as a logarithmic frequency scale
0:13:36nearly
0:13:38in speech which is not
0:13:40in int used in noise this is normal speech
0:13:42and these it just
0:13:43different degrees of law about speech signal see the duration differences again there
0:13:48whatever
0:13:49trash on this site the
0:13:51are
0:13:52the regions of the speech if you to then mixed each of these incessant amount
0:13:57of noise
0:13:58which means it's speech actually come through the not x
0:14:01these things the chuckling glimpses here
0:14:03this it's a model be defining a bit more carefully like draw
0:14:07what you see is these glimpses and then in the normal speech
0:14:12there are few glimpses of the normal speech the raw
0:14:15in the
0:14:16what about speech in particular the high frequency regions
0:14:19and so
0:14:20one of the other key properties of lombard speech is the spectral tilt
0:14:23is changed exactly reduced so that this is low frequency with you looking at this
0:14:28is high frequency lombard speech is more like that
0:14:32which means essentially spongy more energy in the made
0:14:35a bit a high frequencies
0:14:37need
0:14:38in my terms means made along the cochlea which means provided khz upwards we seek
0:14:43more energy
0:14:45so this potentially spectral cues
0:14:48this also potentially sample cues
0:14:50simply a slowing down of speech right if you like let's say it's not
0:14:55it's non linear
0:14:56expansion maybe that's beneficial that's a contentious issue which all address
0:15:01and maybe the raw acoustic phonetic changes to
0:15:03maybe
0:15:05when list as a present with high level of noise they attempt to hit
0:15:09of all target store
0:15:11and expanded files space as is the case of its is all the phones modified
0:15:16speech just clean speech
0:15:18no but the question is one but speech intrinsically more intelligible be addressed in the
0:15:23two
0:15:25note stop the listing talk project we all got together
0:15:28kinda optimistically start the project and had a bit of a brainstorming session
0:15:32just a just a to lists of the things we might do to speech
0:15:36to make it more intelligible make it more robust
0:15:40or c first
0:15:41i'm supposed to increase intensity which ruled out
0:15:44and hence
0:15:45again some of those were aware of lombard
0:15:48speech this point
0:15:49was he changing spectral tilt is a possibility
0:15:52the for the thing i just mention the speech phonetic changes
0:15:56spun the vol space
0:15:57latency voice see what is going this model space on the slide
0:16:01and so we continued
0:16:02think about this maybe now ring the formant bandwidths
0:16:05put more energy wasting energy on
0:16:08useless parameters like you know all lattice and the peak
0:16:12the of the in the in the spectrum what in general reality energy sparsifying energy
0:16:16is another generalisation that's
0:16:18some of these are mentioning because you gonna see some examples
0:16:21of these
0:16:22as you compression has been over a long time to work in the audio
0:16:26broadcasting
0:16:27and also works here and then there are fewer the higher level things try to
0:16:30match the interloctor intensity or to contrast it
0:16:33to maybe help is probably overlaps which was talked about this morning
0:16:37and so on
0:16:38okay training of zero
0:16:40we thought that more
0:16:42table with some other things in searching for these more practical vowels and consonants simply
0:16:47fine syntax blah
0:16:49and further maybe producing speech with had which has a low cognitive load on the
0:16:55list a
0:16:56as you see there's an awful lot of things that can be looked at c
0:16:59and there's in all of these of been woodside so it's a grey area of
0:17:01people interested in
0:17:03start to look at this
0:17:04what i try to do that was to group that into a bit more sensible
0:17:09structure
0:17:10by looking at the goal of
0:17:12a speech modification so all possible goals could be at modifying speech
0:17:16all of it is context dependent but if we just gonna focus on speech and
0:17:19noise
0:17:21one of the clear goals is to reduce and energetic masking
0:17:24as it's called
0:17:26no i if you know why the difference
0:17:29the difference between in jersey masking informational masking
0:17:32energetic masking describes the process which essentially
0:17:37look so what happens when a mask and the target let's a speech interact level
0:17:42of the or two or three periphery
0:17:44something information is lost
0:17:46due to compression in the auditory system
0:17:50but then masking
0:17:51can come back again later if the some information getting through from another chocolates a
0:17:55there to talk was talking at once you need to messages or unity of fragments
0:17:59of two messages
0:18:01and it if they speak as a very similar they have the same gender
0:18:04then it can be very confusing to work out what which bits belong to each
0:18:08talk not an example of informational mask
0:18:11so what things we can just produce a just masking
0:18:14by doing things like sparsification of spectrum
0:18:17training spectral tilt to reduce informational masking we might do something if we've got control
0:18:22over
0:18:22the entire message generation process we might be something like change the gender of the
0:18:26talker
0:18:28okay so with about
0:18:29well not necessarily tts but we have voice conversion systems do this
0:18:34and we cannot visual cues as a nice with reducing the effect of an interfering
0:18:38talkers
0:18:39and then we can do all the things this comes from i longstanding interest in
0:18:43order to scene analysis
0:18:45but it taking the problem and investing we can try to prevent grouping we can
0:18:49send a message into an environment whether all the sources
0:18:52but do things check to prevent a grouping with them
0:18:55is that see something with an idea which a promote interest all in system an
0:18:59awful lot of work in scene analysis
0:19:01about the c plays in a court set
0:19:05i believe and
0:19:07you assigned that's a given instruments when they common
0:19:11have a sort of its finding can use timing differences
0:19:14at the onset of them could be interesting keep them separate
0:19:17that's an example of what i'm sorry about that using a scene analysis
0:19:20to prevent
0:19:21the message clashing with the background
0:19:24nothing we can do to reduce cognitive load of the message by using possibly
0:19:28simple syntax
0:19:29decreasing speech right or we can equip the speech with
0:19:34more redundancy differences by it's a high-level repeating the message boards
0:19:40so the lots of things that you might
0:19:42figure out what
0:19:44do not want to do is to sort of move in the direction of some
0:19:47the experimental experiments we been doing of last be as
0:19:51and this is a kind of typical approach to be take
0:19:54we could use what can be we describe the syntactic one but speech in one
0:19:58form or another
0:19:59i mean is we can take normal speech display this again
0:20:03in reading text okay
0:20:05and take the global but sentence three i o e
0:20:10and say well how much of the intelligibility advantage of that one buttons comes from
0:20:14site time timing differences so we can time aligned
0:20:18the two sentences and then princess asked a question
0:20:23i
0:20:23only the f zero shift in that
0:20:26but
0:20:27remove the spectral tilt that sounds like this next for me not like targets along
0:20:32about three okay so the residual you know the difference that between the two
0:20:37the things like spectral tilt and sort of an experimental point you we can then
0:20:41identify
0:20:42the contribution of us to such as f zero spectral tilt duration
0:20:47to the intelligibility advantage
0:20:51so now want to look at some spectral domain and starts off with
0:20:56one of the l experiments we did looking at
0:20:59exactly those parameters spectral tilt and fundamental frequency
0:21:04because lombard speech in clean speech and all the phones of speech
0:21:07do you modify have zero you might be led to believe the f zero
0:21:11is important to my let's think that actually important change
0:21:14but it turns out that it hasn't
0:21:15so what you looking out here is the
0:21:18increase
0:21:19intelligibility or a baseline by manipulating
0:21:22f zero to bring it in line with lombard speech
0:21:25and none of these changes significance these three different pause just represent different
0:21:29lowball conditions
0:21:31on the other hand to be checked the spectral tilt
0:21:33this just a constant changes not time dependent
0:21:36skantze changed spectral tilt
0:21:38we get about two thirds the benefit coming through this is not real about speech
0:21:41up here
0:21:42so the still a got but a lot is you to spectral tilt
0:21:45turns out this could be predicted very well by just considering energetic masking eclipsing model
0:21:51so it spectral tilt is putting some the speech out with the most
0:21:57of course and there have been approach is which just simply do that the speech
0:22:01we get some benefits modifying speech based on
0:22:04just changing spectral tilt globally
0:22:06but we can say that look quite a bit more generally and ask the question
0:22:10if all you're allowed to do is to cut with a stationary spectral weighting
0:22:15so essentially designing send a simple filter
0:22:18to apply to speech that was the best you can do
0:22:22in the spectral domain weeks
0:22:23this the general approach
0:22:25offline
0:22:27this can still be this can be must get dependent so it's context dependent it
0:22:31is masked is
0:22:32we can come up with a different special weighting
0:22:37and we do that offline
0:22:38and then online its nest every that's recognise what kind of background we have and
0:22:42then apply the weighting necessarily
0:22:44necessary for that particular
0:22:46a type mask
0:22:48what we didn't hear what we realised of the art in this project
0:22:51with the really important role the object intelligibility matches have
0:22:55in this whole process simply because
0:22:57we want to use them as part of the closed loop design process optimisation process
0:23:02we can't bring back panel of listeners every
0:23:05ten milliseconds
0:23:07to try to answer the question how intelligibility this you modification that our algorithm just
0:23:11come up with
0:23:12"'cause" we still test the n
0:23:14at the design phase
0:23:15so is critically important of a good intelligibility predictor
0:23:20so the first intelligibility project we use is the
0:23:23the glimpse proportion measure
0:23:26and that just described what the says very simple thing
0:23:29so we take
0:23:31see what representations separated or two representations of the speech of the noise of these
0:23:35just
0:23:36just imagine some kinda cochlea gram representation
0:23:40gammatone filter-bank we take the envelope you've is willing to it
0:23:44the hilbert envelope
0:23:45downsample
0:23:47essentially that's it
0:23:48and we have the question on how often is the speech above the noise plus
0:23:53some threshold which relies of a real but we need
0:23:56and just measure the number of points well that's the case
0:23:59by simple very rapidly computed
0:24:02intelligibility model
0:24:04if we do that
0:24:06we come up with these kinds of weighting so again
0:24:09it depends what kind of optimisation but you want to use this is just ignore
0:24:12them
0:24:13these very high dimensional or two spectra
0:24:16say sixty dimensional
0:24:18also
0:24:19one thing here if you read these icons this is speech or noise competing talk
0:24:25a cs
0:24:27speech modulated noise white noise
0:24:29and circle different mask is we also got given snrs or even ten five zero
0:24:34minus five minus ten
0:24:36and this also some interesting things going on here these the optimal spectral weighting that
0:24:40come up with
0:24:42you don't listen to think before using much lower dimensionality representation so differences octave weightings
0:24:48so you got you know six to eight octave bands weightings or even third octave
0:24:51weightings maybe twenty third octave bands here we got much higher dimensional representation
0:24:56so we can someone that one expected to was at least somewhat unexpected result the
0:24:59or something is the as the snr decreases we sing
0:25:03that this optimal weighting is getting more extreme
0:25:06more binary
0:25:08we caught sparse twisty "'cause" what is essentially getting is
0:25:11is shifting the energy
0:25:14instead because you regions
0:25:16to it to limit boosting gets and then attenuation the neighboring regions
0:25:21this is only value on expected
0:25:24the question is what the what was this all amount to foolishness
0:25:29i display your an example of
0:25:31of what these things sound like to this is just the on model modified speech
0:25:35a large size in stockings is how to sell
0:25:38from the whole corpus
0:25:39this is the modified
0:25:41i stockings as a cluster
0:25:44one of the modified
0:25:46and he pretty you know their course equally intelligible i hope in quiet but in
0:25:51noise and
0:26:00you know the sentence but it's i think it should be reason we evident that
0:26:04the modified speech is more intelligible that and so as part of the three can
0:26:08challenge we
0:26:09and to this particular the algorithm and got improvements of up to say fifteen percent
0:26:16percentage points
0:26:17in this is just two different conditions and given snrs but
0:26:20roughly that's amount
0:26:23it is more useful to think of these in terms of db improvements
0:26:27and so we its use this
0:26:29so idea of a equivalent intensity increase the idea is if you modify speech
0:26:36how much would you need to the one modified speech of by
0:26:41since the how much you gonna to increase the snr
0:26:44to get the same level of performance
0:26:47and this can be computed using a
0:26:50by computing psychometrics functions for each of the mask as you need to use and
0:26:53using the mapping from the or modified speech the modified speech
0:26:58i don't you what is inside with
0:27:00sensed tells as is i if you look at the subjective by
0:27:05these fill lines here
0:27:07the we getting about two db some improvement using that stuff expect a weighting
0:27:11which is kinda useful to db is maybe seventy four somewhere between ten and fifteen
0:27:16percentage points also
0:27:19now something else in this figure shows
0:27:21these the white bars here
0:27:23all the protections on the bases are of a
0:27:26object intelligible to model that was used directly design the weighting the first place
0:27:30and icsi the predictions are not really that good
0:27:34i mean you
0:27:35you could look at this kind i can say well there are quite collects but
0:27:37there are not really very good at all
0:27:40this is quite a big basically in these cases here
0:27:43of course
0:27:45the
0:27:46in a in one sense doesn't matter because we're still getting improvement fearlessness
0:27:51what the other hand get a better in its object intelligibility model
0:27:54than against abortion for instance then we might expect bigger gains
0:27:59so the kindness idea one other things
0:28:02the most units and times been focusing on
0:28:05a loss
0:28:05is improving intelligibility models of a modified incident synthetic speech
0:28:11so what you seen here these it you might recognise some of these
0:28:15abbreviations here this is the speech intelligibility index extended speech utility bills you index
0:28:20this is one controls the data was lab
0:28:22et cetera but these are quite recent intelligibility metrics
0:28:25seven over that
0:28:27and these a five claim space matrix that's
0:28:31saigon signed is developed
0:28:32to try to improve matters
0:28:34it's a difference is that the one that we're using you just save these past
0:28:36stuff expect a weightings is this one just performed that well actually it's all
0:28:40but most the metrics there really perform so well
0:28:43the modified speech
0:28:45normally that we four-formant
0:28:46so the correlation with the model correlations of at least point nine
0:28:50for natural
0:28:51speech their form of synthetic speech writer
0:28:55so one actually now is what happens if we
0:28:57do the same
0:28:59static
0:29:00that's spectral
0:29:01wait estimation
0:29:03one of these is going to use this high energy could portion
0:29:07metric instead
0:29:09this is just really a series adaptations to the normal course a portion
0:29:13well you
0:29:14what we doing over here this is the normal was proportion
0:29:17we do in here is i'd adding on something which represents the hearing s
0:29:22let level
0:29:23sometimes we present in speech is such a low snr that some of this some
0:29:28of the speech itself within the mixture when it's presented to listen to say it's
0:29:32some people db or whatever
0:29:34is actually below the threshold of hearing
0:29:36and this or talking effect on the intelligibility prediction so that skated for over here
0:29:42you've also got
0:29:43a sort of ways i logarithmic compression
0:29:46to a
0:29:47deal with the fact that
0:29:49glints is very redundant so you probably only thirty percent of the spectro-temporal plane glints
0:29:53to get
0:29:53to see in performance
0:29:56that's handled that
0:29:57and this is a durational modification factor
0:30:00which attempts to cater for the fact that
0:30:04rapid speech is less intelligible so this a few changes in that i'm not really
0:30:08gonna go too much into them here
0:30:11but just a trace of the buttons that
0:30:12come out of this often optimisation
0:30:15process
0:30:16and so what we seeing is actually quite similar buttons to the preceding model draw
0:30:21some differences is a six different noise types
0:30:23low-pass high
0:30:24low-pass this is high plus noise white noise
0:30:27and again a modulated
0:30:28could be talking noise and speech noise but we essentially seem pretty much abuse of
0:30:33the high frequencies
0:30:38we find here well we change corpus here a little bit
0:30:40it became more convenient for me working in
0:30:44in spain to have spanish listeners rolled to
0:30:46of the
0:30:47we don't my ex english collings scottish whatever to run some experiments with this
0:30:54so this is with a short but which is a spanish version of the harvard
0:30:57sentences
0:30:59and what you saying here all gains in percentage points
0:31:04these not relative gains
0:31:05is the percentage points gains of up to fifty five percentage points from static special
0:31:10weighting
0:31:12in the best cases in some really cases down twenty thirty
0:31:16doesn't look at all the white in white noise which we put down to a
0:31:20continue problems of the origin
0:31:23further problems of the objective intelligibility metric
0:31:26but nevertheless we can see that for a very simple approach which could be implemented
0:31:29it
0:31:29simple linear filter
0:31:31we can get
0:31:32some pretty be gains
0:31:34in noise using these approaches
0:31:38and actual and questions we want to that is
0:31:41to what extent do we need to make mask in a basket dependent weightings
0:31:45because if you look at the mask is we here because the weightings rather of
0:31:49a here
0:31:50for the different mask is we stent a system similar passage
0:31:53we tend to see
0:31:54a preference for getting the energy opens the
0:31:57i frequencies
0:31:58with maybe a sort of
0:32:00tendency to preserve some very low frequency information which might be related to encoding voicing
0:32:05for instance
0:32:08so we tried out a number of
0:32:09static spectral weighting in a master independent sense this the simplest one
0:32:14which used essentially transmit
0:32:16transfer
0:32:17reallocate lots of energy from the low frequencies below khz
0:32:21so the edges above
0:32:23with no attempt to produce a clatter profile
0:32:27that's all here
0:32:28and then we the these but testing out the idea of sparse boosting just boosting
0:32:32if you channels
0:32:33sparse twisting with a some low-frequency information transfer sounds
0:32:36once the
0:32:37and just of our sense that run them
0:32:39selection information
0:32:41in i and i frequencies
0:32:44and it turns out
0:32:45slightly to a surprise
0:32:47that the master independent weighting which of these black policy
0:32:52that in a real conditions of all this condition
0:32:55does as well as the mask a deep and weighting
0:32:58which the white boss previous
0:33:00this copy from the previous
0:33:01couple slides back
0:33:03all of the other weighting stick white so well although they in general produced improvements
0:33:08so what this is saying really is that
0:33:11for what a wide variety of common noises
0:33:14say babble noise in particular which is basically car noise in transport interchanges the same
0:33:18be speech noises
0:33:20we can we can get pretty significant improvements from a simple approach of spectral weighting
0:33:26as lots multi set about spectral
0:33:28types of things lots more to be don't but
0:33:31i want to get a kind of a
0:33:33a better look at
0:33:34all the various examine to move on as look at temporal modifications
0:33:39the testing to look at
0:33:41these this question of duration or speech rate changes
0:33:45you might think that by slowing speech data and the way the lombard speech
0:33:49does at least for certain segments
0:33:51is don't for the
0:33:53for reasons
0:33:54because the in selected to all the speakers try to make things easier for the
0:33:57interlocutor
0:34:00so what we looked at was
0:34:02whether or not the slower speech rate along but speech actually helps at all
0:34:05we see here is the method also we use a this is plain speech this
0:34:08is lombard speech
0:34:10then we just simply time-aligned nonlinearly the low about speech with the plane speech
0:34:15and once you've got the time alignment you can then do things like transplanting spectral
0:34:19information
0:34:20from saying the lombard speech into the line speech
0:34:23in the timeline sense
0:34:25well the answers the question
0:34:27and whether or not duration helps
0:34:30is no
0:34:31and this is not the only study this found this
0:34:34no i don't linear stretching or nonlinear
0:34:37as in this case non linear time alignment benefits this is because these of these
0:34:41two point c is these benefits
0:34:43overall modified speech we see this is not a lot lombard speech
0:34:47easy to spectral modifications local modifications meaning
0:34:51spectral transplantation having don the nonlinear time warping
0:34:55nothing helps
0:34:56except the spectral changes
0:34:57these decreases the not significance but then clearly not in the right direction
0:35:03but i
0:35:03but in a i a little bit later on three result which seems to seems
0:35:06to country
0:35:07to contradict this
0:35:10so one wants a for example the next
0:35:12five or ten minutes
0:35:13is a slightly
0:35:15richer interpretation of durational changes
0:35:18and this is
0:35:19what happens to speech
0:35:21when you're talking the presence of a temporally modulated mask
0:35:25so i just think about that you know anytime you go into a cafe or
0:35:29something
0:35:31you dealing with this a modulated background
0:35:35is there anything that we a speakers to in a module at background to make
0:35:39life easier to listen
0:35:41these situations belly been studied
0:35:43and yet has the potential to
0:35:45we think we thought
0:35:47and continues think
0:35:49to show
0:35:50so more complex behaviour on the part of speakers to helplessness
0:35:56so it is the current task
0:35:58that we used is this you talking task as a visual area between these two
0:36:02talkers be visual barrier here
0:36:05they're wearing headphones the listing to modulate you mask as of different types
0:36:09varying gain
0:36:10that density so there's some
0:36:12opportunities let's say for the talkers to maybe get into the gaps
0:36:16here's a bit of a link with the overlapping
0:36:18speech material this morning
0:36:20and they have different to docking proposals
0:36:22so they need to communicate is one of the string to get task so this
0:36:26is a
0:36:26an example of all that sounds like
0:36:29i mean see you can what you listening
0:36:32see it you can
0:36:33imagine
0:36:34the mask are being present you the must present in this example
0:36:38you hear the mask about the must was present for the for these talk as
0:36:43the you can okay gonna one
0:36:46and it in the middle right hand box
0:36:49the middle row there has to be three in five
0:36:52no colour role
0:36:54i mean the timing wasn't quite natural i think you need here is not really
0:36:57every now briefly what conversation
0:36:59this the third party and that a lot that parties
0:37:02is it is a modulated mask in this case
0:37:05no is less interesting things ago on an overlap as i'm sure
0:37:08i don't need to tell you
0:37:10and
0:37:11but these not
0:37:12you know this a little bit of in the meetings
0:37:15style overlap
0:37:16because it obsoletes not the competing talking the background will see some examples of that
0:37:20in a moment
0:37:21why white simply wanna focus on is the overlap
0:37:23it's simply the degree of overlap
0:37:25with the with the mask
0:37:28do the talkers treat the mask a like an interlocutor
0:37:32but there is a tend to avoid overlap
0:37:34or not
0:37:35what we found is that to some extent yes it's is showing the reduction in-overlap
0:37:41these just the for different masters or the dense and sparse mask so in the
0:37:45case where there's
0:37:46more potential for reducing a lot voice pops easy if it's order to do so
0:37:49that we do see reduction overlap
0:37:52well however they to the itchy this by increasing speech right so they speak a
0:37:56more
0:37:58only when there's no overlap
0:37:59when the weather's up
0:38:01when this background speech and that's what's response of the increase in
0:38:04the decrease in you know a lot this is normalize of course
0:38:07by
0:38:08a speech activity
0:38:11so what are see what a speaker during
0:38:13well
0:38:14strike work out what speakers of doing when noise is present or indeed when noise
0:38:18isn't present
0:38:20religious technique which is
0:38:22we develop a signal for system identification
0:38:25cool reverse correlation
0:38:27as there's been used for instance try to identify
0:38:31also nonlinear systems although it really strictly speaking only applies in you know like when
0:38:35the linear system comes with doing with in thai speech
0:38:39perception process and then also the speech production process in response to
0:38:43the speech relating to so we got to it to highly nonlinear systems in so
0:38:47it shouldn't really work
0:38:48but that less what we do is
0:38:50we look at all events of a particular type
0:38:53in the corpus lsa all vacations when the person you talking to
0:38:57stop speaking offsets
0:38:59and we say what was going on
0:39:01what was going on annual speech in the in to watch the speech
0:39:04i point
0:39:06yes we just and code all those you mustn't like spikes
0:39:09and then we take a window
0:39:11look at speech activity an average over all of those exemplars
0:39:15not gives is what we call this event related activity which what you seeing here's
0:39:19this the window pasta minus one second
0:39:21with a simple case is first so is no noise presence here this is just
0:39:24looking at the
0:39:26activity response to an interlocutor
0:39:28so this is just simply saying we take all the points which and
0:39:31a need to look at a stops talking what do you do it
0:39:34well not surprisingly
0:39:35well but more likely to be start and talk this what's been taking is really
0:39:39about
0:39:40and we see the reverse pattern
0:39:42on the other side
0:39:43but interesting questions what happens when the mask or we take the mask rebounds so
0:39:47what happens when the masking goes off what're you doing is a talk
0:39:51well
0:39:52not very much but then
0:39:54afterwards shortly afterwards
0:39:56you increase your
0:39:57likelihood of speaking
0:40:00and like
0:40:00likewise in the case response qbc bit more clearly if we
0:40:04just look at the difference between the onset and offset abouts the symmetric for all
0:40:09intents and purposes
0:40:10and so we see this what we call it contrast cuts
0:40:13this is really just shown in that was having an interlocutor case
0:40:16see very nice cup
0:40:19a quite a wide range in the mask in case well because it's become guess
0:40:23whether must be bands gonna take place there's really no difference here that is right
0:40:27after the milliseconds after the massacre
0:40:30as i that come on come off
0:40:31we see a change in the speaker activity what is the showing is that's talk
0:40:35as are sensitive to the mask is
0:40:36and do respond in some way
0:40:40well the last seven possible strategies the soak it might be using
0:40:43and it sends out some non-targets ability tele but simply to say that
0:40:47it isn't a case
0:40:49mainly that when a mask it comes on
0:40:52talkers tend to stop the doing this is this stop
0:40:55strategy here
0:40:57it's more case that
0:40:59they tend not to start
0:41:01when a mask result in the two things if you think about it might with
0:41:03the same when you averaged across is why we need to distinguish between the two
0:41:07so we see lots evidence for talk strategy based on the masking goes off
0:41:11you more likely to start talking that makes sense
0:41:13and if the mask comes on you less likely to start talking a little bit
0:41:18of evidence
0:41:18the to mask because you to stop talking
0:41:21but not it's quite weak evidence
0:41:25now how does this work in a low for a more natural situation where this
0:41:30that's the all the conversations presence in the background rather than this
0:41:33slightly audiovisual
0:41:35background model at background noise
0:41:38so these were some expense we carried out broken english and in
0:41:41in spanish and
0:41:43the basic scenario is that we have a pair talk is here having a conversations
0:41:47they come into the first five minutes
0:41:48and then the joint for the next ten minutes by not the parents or "'cause"
0:41:52and then the symmetry purposes the first belly so we got by tippett era where
0:41:56we got to past conversations
0:41:58second group is not allowed to talk the first group vice versa
0:42:01and so we really interested in a very natural situation see how one conversation affects
0:42:06not the conversation i just play using this example
0:42:09and i'm you'll be helped a little bit by the transcription right hand side i'll
0:42:13try to follow it
0:42:14i for lid
0:42:26or cosine i got my legs
0:42:34this is the natural overlap situation if you want to the percentage of appears not
0:42:37twenty five set
0:42:39across the entire corpus is more like eighty percent
0:42:42twenty percent within buttons within press
0:42:45so i'm willing to the couple of things here
0:42:48one the things that top is due
0:42:50in
0:42:51i and or not the situation like that's
0:42:54extracted reduce
0:42:56the amount of natural overlap that they lie within the with that conversational partner
0:43:01in the figure was mention this morning brought about twenty five percent we find the
0:43:03same
0:43:04think so the thing here when there's no
0:43:06background presents
0:43:07and the older dusty would
0:43:08in we got this is a natural state of the two person dialogue roughly that
0:43:12twenty five percent or materials are a lot
0:43:15singe for the background in the
0:43:16you see that's reduced that's one big change another change we see is when we
0:43:21remove the visual modalities you might and they're just in that picture they were these
0:43:24the lies is the one the conditions
0:43:28and that also "'cause" i in a bit of egyptian overlap
0:43:30someone response
0:43:32but
0:43:33the interesting question is
0:43:34to what extent all listed it may not make and situation aware of what's going
0:43:38on the background
0:43:39and adapting accordingly
0:43:41so these are the either likes activity plots of the four we saw before
0:43:45is this is with no background presents so we see this turn taking behaviour
0:43:49and this is where there's a visual information
0:43:53a lot i would i the
0:43:54so we can see the interlocutors lips
0:43:57of the interesting case of these cases where the noise is present
0:44:00and so this is the
0:44:01showing the
0:44:03activity response to the noise
0:44:05a low is much weaker pattern
0:44:06we still see the same
0:44:08sensitivity the noise in these highly dense that situation
0:44:12so the foreground since they can summarise all this
0:44:15is affected by background
0:44:17a background conversations
0:44:20was always quality with
0:44:22speech technology well
0:44:24out of this grew an algorithm called g c v time
0:44:27and
0:44:28which was also submitted to the oregon challenge
0:44:31and the idea here the approach here
0:44:35is to
0:44:36so the general dynamic time warping based approach
0:44:38where we take a speech signal every here is the mass get we say
0:44:43if we are allowed on a frame-by-frame basis to modify the speech signal
0:44:47to achieve some objective
0:44:49whatever that is
0:44:51then we could do so by
0:44:53finding its you quote defining the
0:44:56the least cost path through some
0:44:58some two massive distance with the least you methodists in this cost matrix
0:45:03we ended with modified speech
0:45:05we temporal changes
0:45:07so the important question now is
0:45:09what we put in as the cost function
0:45:11we tried various things one of them is
0:45:13based on a just masking clean thing again that's the weather g comes it in
0:45:17the g c we time
0:45:18and the other components is to a cochlea scaled entropy which is a measure of
0:45:23information content in speech so that
0:45:25to but in the in simple terms what we try to do is find the
0:45:28path
0:45:29which maximizes the number of glimpses of speech you're gonna gas by shifting speech away
0:45:33from e pox where the mask or is intense
0:45:37and the same time is sensitive
0:45:39two speech information contents
0:45:41for least speech information content is defined by cochlea scaled entropy
0:45:46and it turns out that this
0:45:47is the prettiest successful strategies with that for decibels
0:45:52of improvement
0:45:53in the reckon challenge
0:45:59now
0:46:01is it
0:46:03the way that can change a set so what it has the allowed a small
0:46:06what we location
0:46:07for various reasons we were interested in promoting some temporal structure so we light a
0:46:11little bit elongation
0:46:13half a second i the side sense
0:46:15and of course not surprisingly most the time that he you the re timing out
0:46:21than exploits that fact
0:46:22no strategy speech or shifts bits of speech around
0:46:26into those got into the and also exploiting the silence
0:46:29so what the be time because it simply the elongation well our previous results
0:46:33would suggest that
0:46:34elongation doesn't help right this is i began the section
0:46:38you location doesn't help
0:46:39but strangely
0:46:40we found in the case of the modulated mastering competing speech in this case
0:46:44we found that would simply you located did help
0:46:49not as much as we timing
0:46:50about what about a half the effect could be due to pure elongation
0:46:54so
0:46:55but selected
0:46:57speech shaped noise in this case
0:46:59we find elongation doesn't help which is
0:47:02consistent with the accent you picture so what's really going on here
0:47:06well the reason that people don't find improvements with a durational based approaches distracting is
0:47:12"'cause" most of the work has been done looking at this stage we mask is
0:47:16and interest issue mask you simply you log eight
0:47:19you we not in just using any new information
0:47:21"'cause" the master itself a stationary with you gotta modulated mask
0:47:25you stretch they say of all out
0:47:27you know if for it was massed some parts fragments of it you know if
0:47:31needed for identification are gonna skate masking
0:47:34and that's what we think is responsible here
0:47:36the other important thing here on this
0:47:38the came out of this is the tree timing itself appears to be intrinsically harmful
0:47:43so what something which is strangely something which is really beneficial for one mask
0:47:48we get is a big these of the games
0:47:50exactly harmful for the first stage we mask so we're
0:47:55distorting the acoustic phonetic integrity of the speech
0:47:59but nevertheless it is still the same with time speech is still got the same
0:48:04distortions in
0:48:05but in the case of liturgy mask
0:48:07is highly intelligible
0:48:10in the target some of the circle more was about that
0:48:12it was
0:48:14well we know what's it is more likely to
0:48:17picture of where we all
0:48:19with that speech modifications what can we achieve
0:48:22so the racks to a couple of forty can challenge is what we do internally
0:48:25within the listing talk projects and then one that's a the clearly evaluate your unless
0:48:29just interspeech
0:48:31and
0:48:32and the goal was to
0:48:34people providing within
0:48:36well modified speech
0:48:38had access to mask is a given snrs
0:48:41and simply returns
0:48:43modified speech to us we then evaluated with a very large number of listeners
0:48:47and these are some of the entries
0:48:49so
0:48:51playing speech
0:48:52a large size in stockings is how to sell
0:48:55natural about speech
0:48:57a large size and stockings is a to sell
0:49:01some on modified tts
0:49:03a large size in stockings is how to sell
0:49:06this is from g which is a lot
0:49:09long but property is applied to tts so long bob
0:49:12a tts adapted to lombard speech
0:49:15a large size in stopping this five to sell well as the synthetic voice
0:49:20trying to compete with noise as well
0:49:22i know able to techniques
0:49:23i'll play this one because this was the winning entry
0:49:25i in stockings is to sell
0:49:28on this website you'll find that you also organs most more examples
0:49:34well these are the with the results of the internal challenge of the systems
0:49:38any s a s t r c which came from
0:49:43nist ugandans lap
0:49:45that's university of crete
0:49:48was the winning entry
0:49:50producing gains of about thirty six thirty seven percentage points in this condition
0:49:56what the what does that i'm not to in db terms well it amounts about
0:49:59five db
0:50:00no with stones the way you know useful gains i think for speech modification approaches
0:50:05well you can also see here i think is interesting is that long about speech
0:50:09natural about speech in this condition
0:50:11just this case here actually produced a gain of about one
0:50:15db so we're getting super along but performance
0:50:18i was of some of these modification elements
0:50:20he was of the ones of the
0:50:21you know based
0:50:22some extent on lombard speech
0:50:24and tts is a long way behind it is a modified but by applying for
0:50:28instance long but like properties to tts systems
0:50:31we can improve things by over two db
0:50:37a slightly larger challenge the oregon johns last year
0:50:41is opposing results in a sliding way as i'm just take it with this
0:50:45so we're looking at here is the equipment intensity change and db
0:50:49in the face of a
0:50:50stationary mask the speech shaped noise and this in the keeping competing talk a mask
0:50:54all the green points correspond to natural speech and the baseline as well the lines
0:50:58intersect about their
0:50:59and the tts entries of a low baseline
0:51:03and they're in blue and see them over here
0:51:06this is not the in a in a fairly low noise condition if we were
0:51:08gonna
0:51:09a high noise condition
0:51:11a better idea of all these things are really capable all
0:51:13then we again we sing gains of about five db
0:51:17in stage we noise and also
0:51:19the t c v time
0:51:20getting close not fully be also in
0:51:22in fluctuating noise
0:51:24what i really want to point out a listening is probably to me was most
0:51:27interesting outcome of this evaluation
0:51:29is the fact that
0:51:30somebody's tts systems adapted to
0:51:33toot based on some intelligibility criterion are actually doing really well
0:51:38combat that makes and that baseline over here and we're getting a couple of the
0:51:41tts systems here which a player examples of
0:51:43in a second actually more intelligible than
0:51:47natural speech
0:51:48in noise size of a fairly interesting achievement
0:51:51these k from two different labs
0:51:53one is everybody brother
0:51:55garcia
0:51:56and teeny
0:51:57and you timit she
0:51:59the group and also from daniel error at the
0:52:02the
0:52:03well you assume with the basque country
0:52:05well the reason a difference group
0:52:06i'd nothing to do with this
0:52:08okay so this is the this is an example that what leasing sound like and
0:52:12just a tts systems
0:52:26and you pretty evident this that the synthetic speech is much more intelligible in those
0:52:30cases
0:52:32just the final thing to say about american john something we did recently
0:52:35and a natural thing to do of course is to take spectral changes temporal changes
0:52:39and see whether they complement each other
0:52:42and the show in ensure the answer is yes so this is a modified speech
0:52:46this is defective just applying temporal changes
0:52:48with the g c v timeouts
0:52:50this is just the effect of the
0:52:52assess the l c in this case spectral shaping and dynamic range compression algorithm
0:52:55and you put the two things together
0:52:57and you get something which isn't quite additive but it certainly a call me and
0:53:01me a complementary that for two percentage points
0:53:03we have i nine to ten decibels impact
0:53:07so just in the last if i one a couple but it's
0:53:11i want to just pretend this question is modified
0:53:14speech intrinsically more intelligible or is it just
0:53:18hunting the mask is that are essentially work
0:53:22nice little the critical to answer this question simply because
0:53:25what we measuring intelligibility we normally measure intelligibility using noise
0:53:29because the "'cause" otherwise performs is see like but you gotta system which modify speech
0:53:34to be more intelligible the noise
0:53:36and of course it's gonna be more intelligently noise so you don't measuring intrinsic intelligibility
0:53:40you measuring the ability to have come and just mask normally
0:53:44that's to use native listeners
0:53:46use non-native listeners then intelligibility in
0:53:49why it is usually some way below
0:53:52ceiling performance the natives
0:53:54this what we did
0:53:55we plan on that you listen is long about speech
0:53:58we found was
0:53:59forget about most of this
0:54:01this is the key results here is that one but speech is actually less intelligible
0:54:05them playing speech in quiet
0:54:07the same speech which is more intelligible the noise is less intelligible in quite
0:54:11for non-native listeners and with
0:54:13lombard speech was making improvements somehow to acoustic-phonetic clarity should we say
0:54:17just a generalized lot of possible changes
0:54:19then you might expect
0:54:21to see benefits but we don't
0:54:24skip over that
0:54:25and something we don't recently also it's as the same question with non-native listeners
0:54:29for s t l c
0:54:32which is you know say that where the entry in the hoary can challenge
0:54:35and again we see some results in quiet but non-native listeners you see that will
0:54:38be low ceiling you know modified speech
0:54:41exactly make things worse
0:54:45so just to conclude that
0:54:47what i try to show how is that
0:54:49by taking some inspiration not so you really "'cause" inspiration but i see sometimes going
0:54:54beyond what this is
0:54:55what's also capable of doing
0:54:57we are able to motivate
0:54:59some algorithms which can them but
0:55:01speech which is merely unintelligible interspeech which is almost entirely intelligible
0:55:06there's a been a
0:55:08some develop a subjective intelligibility models to make this possible
0:55:11and i see the this is this a definitely scope a much more working that's
0:55:15a rather better intelligibility models we can produce
0:55:18the bigger gains we expect you'll to produce
0:55:21and i should say that this work is more or less immediately applicable to all
0:55:24forms of speech outputs
0:55:25including domestic audio coming from non speech technology devices no radios t vs et cetera
0:55:32some stuff i didn't read so too much about reduce work with dyslexic it's basically
0:55:36with the retraining
0:55:38to show that they benefit from
0:55:41from a sample modifications to
0:55:44one thing we do need to look at and i looked on the last couple
0:55:46of slight is this loss of intrinsic intelligibility
0:55:50i think this is an opportunity we've got an algorithm here which does well in
0:55:53noise in quite exactly homes things what about what if we can
0:55:58what if the two things are not you know
0:56:01if we can sample the two things together if we cannot do makefile space changes
0:56:05in the same time is due within just masking and we can see summary we
0:56:08became
0:56:10okay i carry much
0:56:26thank you mouth in this very good interesting talk
0:56:30you have any comments on the use of the is are
0:56:33i mean use of this work for asr to improve speech recognition
0:56:40e
0:56:42is interesting question
0:56:45well you thinking maybe we can train talk as two
0:56:48in tract
0:56:50more heavily with our asr devices is not gonna happen is it
0:56:55i think
0:56:55yes of course in
0:56:57i is what i would you lanes and in the listing talk a project was
0:57:01to get as far as looking at dialogue systems
0:57:03so well as the i saw is
0:57:05a key component and
0:57:08to look at ways of improving the
0:57:12interaction by essentially making this in that the output part of it
0:57:16much more context aware
0:57:18and of course
0:57:19in this sense
0:57:20if you could make the instruction smoother
0:57:23this might also mean allowing overlap as an actual muffler compensation and i guess the
0:57:27input side might also
0:57:29and of being smooth that's
0:57:31we didn't end up doing and whatnot so
0:57:33some results in is are show that it's a pretty variable to adapt the environment
0:57:39rather than trying to make as p
0:57:42well those obstruction and speech and
0:57:46but they demonstrate that data to adapt to and
0:57:51well i mean other application of a nine dimensional the way i think about is
0:57:56no i think sometimes we'll green river might background in competition or scene analysis we
0:58:00often
0:58:01set of solve this problem of
0:58:03taking to independence
0:58:06sources and trying to separate maybe
0:58:08and acknowledgement the fact that the two are not independent alpha
0:58:11except in speech separation competitions
0:58:13you know we're always aware of what's going the background we since we modify a
0:58:17speech that really a that'll to be factored into these albums told making simple actually
0:58:21for the elements
0:58:31thank you martin that was very interesting i'm wondering
0:58:35if i i've probably got about twenty questions but if i just you know down
0:58:39to me here we good
0:58:43in your work where the any constraints regarding quality your naturalness of the enhancement good
0:58:50question the a salami policy can answer that
0:58:54okay so
0:58:55again one of our original goals thinking that we would just knock off the intelligibility
0:58:59stuff and first year or something
0:59:01was to look at speech quality and we did a little the to work looking
0:59:04at the objective much as speech quality so basque this stuff
0:59:08and hindi
0:59:10what we've
0:59:11so the modification i didn't talk about put a produce a highly distorting so i
0:59:16remember one modification that we produced
0:59:18we were essentially taking the you know it can general approach of
0:59:22suppose we equalise the snr in every time frame
0:59:26i process a little bit like the also perhaps but you know
0:59:30more extreme or we utilize the sound each by a special equalise the snr in
0:59:34each time-frequency fix
0:59:36you can imagine the effect
0:59:38you know of doing that is highly distorting
0:59:40and sometimes is highly beneficial also but sometimes very hopeful it's kind of very binary
0:59:45type of thing so the mean we did this so much as and some of
0:59:48the other part this i think getting rid
0:59:50it's work on speech quality two
0:59:53but it's
0:59:54so
0:59:57in this sense
0:59:58we're looking for correlations between
1:00:00for the just been to a license some results non-native listeners
1:00:04we looked at their responses to as a function of speech quality differences where we
1:00:08might expect
1:00:09you know into they pass their intelligibility parts of its
1:00:12pretty much identical to native listeners and respect to we've examined
1:00:18even though the match the quite different that distortion you might expect that the rich
1:00:23native l one knowledge would somehow enable list the sample
1:00:27that these distortions
1:00:29where easily but that hasn't in the case of
1:00:32we don't have pockets working at every to reboot consideration considerations sightedness a is related
1:00:37to this is we're using
1:00:39these constants rms energy constraint "'cause" we should be really look at loudness
1:00:45which more difficult optimize
1:00:46rely as you gotta agree allowed a small the first and so
1:00:50then my second one would be you discussed effective the listening ear been matched native
1:00:59or not it is worse but and you mentioned working with english and spanish but
1:01:05i'm wondering
1:01:06have you studied a variety of source languages in found any of them more amenable
1:01:12to this process maybe we should switch to a different line
1:01:17thank you
1:01:18that's also interesting
1:01:20i have not done any work we in
1:01:22we can speech output and that but item we had a
1:01:25the project a few years go looking at eight european languages from the point of
1:01:28view of noise resistance and the clearly difference is that
1:01:32a lot of it is got to do without engines resistance wenches masking
1:01:35i just never taken into account we often you know in multi language studies just
1:01:39normalize use it quite yes and although we should be doing that actually
1:01:43because the term baseline which might seem to be able to tolerate maybe up to
1:01:46forty be more
1:01:48noise and then especially designed in that respect
1:01:57i'm adding
1:01:59has you know you are in the speaker recognition community i'm quite sure you where
1:02:04expecting my question
1:02:06you have any edge you're of the
1:02:08but i should effect of from bar number effect so
1:02:12allpass
1:02:13kind of things you are
1:02:15you just present that the
1:02:17one you meant abilities in speaker recognition
1:02:24that i sparse well unless johnston some work in this which i guess probably have
1:02:29i'm not sure that the extent okay so i mean obviously feel of like to
1:02:32speaking on the stress and then very high very high noise conditions
1:02:36if you want in speaker identification based on
1:02:39this your question right
1:02:41my question is also need to be data forensic problem you thing but
1:02:46if someone's recorded in presence of noise so
1:02:50using omar
1:02:52number of votes
1:02:54of best model information would be the same band when we record used us and
1:02:59you know quite are the ones variance question eyes a very
1:03:03interesting project using similar techniques not but
1:03:06we at the basque country web looking up at that you did you stupid looking
1:03:09at which essentially trying to map from normals along but speech if you if you
1:03:13know that somebody's talking as a degree of noise you could attempt to transform
1:03:18the lombard speech and to normal speech don't you wanna come at some possible
1:03:39but predictably so it some cases
1:04:20precisely i think you're always really need to
1:04:22be careful experimentally to use the latter case because no
1:04:26can be no mean even been told you communicating makes a huge the
1:04:33i think
1:04:34i want to on to the example of the two couples one speaking condition other
1:04:40spanish
1:04:41i guess the one couple do not understand the are the car
1:04:47we didn't that one of the situations i had actually we just we had same
1:04:51experiments done with for english or force punished i the is this was the question
1:04:55what would happen if there will be two couples speaking the same language but on
1:05:01different topics
1:05:02and the disturbance also not only noise but able to an understanding that your i
1:05:09think it and on the statistics we'd like to a couple look if years ago
1:05:13we discovered this effect of along with the
1:05:17the informational masking calls by sharing the same language isn't interfering conversation
1:05:22and it is the case that
1:05:25we
1:05:25if it is principle the common experience for many people
1:05:29optically in a bilingual trilingual country is just one
1:05:33where
1:05:35if somebody talking in a language you well aware of even if it's not your
1:05:37native language
1:05:39just a much bigger interfering of factors for start of that's one the things that's
1:05:42definitely happens that's word about between one four db depending on different language pairs that
1:05:47the that looks at
1:05:52was that the party question
1:05:54okay
1:05:56so it's all a matter informational masking and again that's another big area that we've
1:06:00it is often the perceptual point of view but not from the apple point if
1:06:03you see how to deal with
1:06:05do you without
1:06:19thank you for the talk
1:06:20so regarding what also zero actually said
1:06:25we i think press i personally extract some somewhat speech enhancement and some colleagues tool
1:06:30for like three s
1:06:32and even attractive in the training data if you do speech enhancement
1:06:36and it test data our systems seems to
1:06:40bad compared if we give him all the noise
1:06:43cell
1:06:44do you have the opposite
1:07:10okay so even so you think that every speech enhancement this case doesn't it doesn't
1:07:15work has no explanation for that
1:07:27but it seems that if we do if we tried to remove the noise
1:07:30the systems get better
1:07:37and this is a general findings very surprising finding it in a way and that
1:07:40speech enhancement does not this which is robust application in a weighted linear speech enhancement
1:07:46the very few speech enhancement techniques work for intelligibility purposes
1:07:49only one so no
1:07:52okay
1:08:12and that works
1:08:13that doesn't work
1:08:15the call this terrible it's a related to your question so the mean this space
1:08:20is the not kind of even linearly related these things
1:08:24that's it is very just in case the you don't region
1:08:28i mean that's not really tools is the dynamic range compression type of
1:08:32extreme dynamic range compression
1:08:39so this question is based on the example that you short about announcement in the
1:08:45training
1:08:46so is that any v of increasing intelligibility of things like the name of the
1:08:53train station
1:08:55for people who you know not the native
1:08:59speakers
1:09:01so i mean there's a couple things you can do that one a low level
1:09:04the one i level thing and results in between
1:09:07we haven't done if we haven't done it is but others are so one thing
1:09:11of course is you can transfer all your excess energy to those important items which
1:09:16the low level thing that we don't
1:09:19at the high level you can attempt to modify feuding with synthetic speech
1:09:24you can attempt to produce high pass hyper speech can you which is
1:09:29has been very successful in doing this
1:09:31a lot rhythmically fully automatically producing a
1:09:36speech which is more likely to meet its target so with the next point involves
1:09:39place so this is really gonna help a lot in those cases
1:09:43and then there's sort of more preside things like simply you know repetition on all
1:09:47the simply simplification since arkansas when it comes to sort of proper names like that
1:09:52sure there are some very specific things i think you can do to solve we
1:09:56need them that way
1:09:57i think
1:10:14this is like introducing redundant since i think
1:10:19this leads to be don't