0:00:14 | i most of it i wonder whether nine came from and so and so it's |
---|
0:00:17 | a mean i started discussing troll arrangements and then it became clear |
---|
0:00:23 | so a question as myself when only invited me to |
---|
0:00:28 | comes this meeting was |
---|
0:00:30 | what can i say that's possibly it interests to people interested speaker |
---|
0:00:35 | identification language identification this kind of topics |
---|
0:00:38 | you would find |
---|
0:00:39 | but you know i i-vectors princess in this talks provides an i-vector is described the |
---|
0:00:44 | process of virus transmission on an apple device |
---|
0:00:49 | but this was |
---|
0:00:51 | what is talk might have you are looking at some of topics |
---|
0:00:54 | and there are a number of points of connection that i noted down so i |
---|
0:00:57 | think |
---|
0:00:58 | i'm gonna be talking about the way the speakers changes result context |
---|
0:01:02 | and how we develop algorithms |
---|
0:01:05 | the can modify speech become or intelligible but of course this is a sum relationship |
---|
0:01:08 | with spoofing for instance so the elements like to be talking about could potentially be |
---|
0:01:13 | used to disguise summaries identity |
---|
0:01:17 | also the effect of noise on speaking style is obviously very relevance people interested speaker |
---|
0:01:21 | identification |
---|
0:01:22 | there's a talk this morning about diarization an overlapping speech i'm gonna be giving you |
---|
0:01:26 | some data drawn overlapping speech in really at this conditions when they're all that's well |
---|
0:01:31 | that's all "'cause" presents |
---|
0:01:32 | and |
---|
0:01:33 | i don't believe also that's durational variations a problem |
---|
0:01:37 | and we can get leaves me behavioral information on those are there is some points |
---|
0:01:41 | of contact with what i wanna say |
---|
0:01:43 | i and the kind of work that's |
---|
0:01:45 | people are doing this field |
---|
0:01:47 | but we keep things simple this what i'm gonna talk about |
---|
0:01:50 | i'm gonna be talking about replacing the easy approach to intelligibility which is increasing the |
---|
0:01:54 | volume |
---|
0:01:55 | with this hypothetical but very potentially very valuable |
---|
0:02:00 | device for increasing intelligibility |
---|
0:02:04 | so i'm gonna start by talking about why we should robust device speech output it's |
---|
0:02:08 | always interesting problem what kind of applications does it have |
---|
0:02:13 | get a few general observations from what talk as |
---|
0:02:15 | two in adverse conditions |
---|
0:02:17 | and then it will be |
---|
0:02:18 | research interest to talk into spectral and temporal domains sony little tidbits about behavioural observations |
---|
0:02:24 | again what focusing on some the algorithms that's |
---|
0:02:28 | people in the listing talk a project of developed so the last couple of years |
---|
0:02:32 | and |
---|
0:02:33 | culminating with a single american challenge |
---|
0:02:36 | which is a global evaluation of speech modification techniques which took place last year |
---|
0:02:40 | that's interspeech |
---|
0:02:42 | and if this time also if you words about what might be modifications due to |
---|
0:02:47 | speech whether they actually make it anymore it's intrinsically intelligible or whether they just to |
---|
0:02:52 | overcome the problems of noise |
---|
0:02:54 | so why relative i speech at all |
---|
0:02:57 | well i'm sure whether where that speech output is extremely common but natural recorded and |
---|
0:03:04 | synthetic |
---|
0:03:05 | if you think about your engine is to |
---|
0:03:08 | to this at this place |
---|
0:03:11 | you would've presumably gone through various transport interchanges and proposes |
---|
0:03:17 | aeroplanes themselves |
---|
0:03:19 | lots of didn't can't environments reverberation now recorders et cetera |
---|
0:03:23 | hearing all sorts of really almost unintelligible messages |
---|
0:03:27 | coming out of the every like speaker there are millions of these things in continuous |
---|
0:03:31 | operation |
---|
0:03:32 | and it's sure given an interesting problem to attempt to make the messages as intelligible |
---|
0:03:36 | as possible |
---|
0:03:38 | same goes for mobile devices or say in car navigation systems where noise is just |
---|
0:03:43 | simply a fact that of life for the context which they used |
---|
0:03:48 | no score in speech technology that optically speech synthetic speech is really what |
---|
0:03:53 | you know realistically the messages sent sergeant's the environments |
---|
0:03:57 | regardless of context we calls were summed else's talking let's say in a in a |
---|
0:04:01 | voice striven the gps type of system |
---|
0:04:04 | or regardless whether they noise present |
---|
0:04:08 | that is a few examples just real quick |
---|
0:04:10 | for example just recorded |
---|
0:04:12 | with a simple handle device |
---|
0:04:14 | of one this case recording speech in noisy environments |
---|
0:04:17 | and because as a home on this i'm gonna plug it in |
---|
0:04:20 | just for the duration of this |
---|
0:04:32 | note you convolve that video with this a delivery system as well you could you |
---|
0:04:36 | point in theory much which at all |
---|
0:04:38 | but believe it's not intelligible star with |
---|
0:04:41 | he is not like this one so that was recorded speech this is like speech |
---|
0:04:44 | but this is like accented speech we know the accent from foreign accent can lead |
---|
0:04:49 | to can be equivalent to say five db at noise in some cases |
---|
0:04:52 | i |
---|
0:04:58 | euro language x i don't peers expert sidewords i one result was |
---|
0:05:02 | and this is another one my favourite examples "'cause" this is this is really user |
---|
0:05:05 | interface |
---|
0:05:07 | already in interface design problem for the people design this is the train |
---|
0:05:10 | needed to edinburgh |
---|
0:05:13 | i |
---|
0:05:17 | of the noise saying the trains about depart collided with the announcements so to a |
---|
0:05:21 | simple fixes for those cases |
---|
0:05:24 | in particular |
---|
0:05:26 | anyway what is worth doing this |
---|
0:05:28 | well i think is always with bearing in mind that for a natural sentences we |
---|
0:05:32 | have lots of different data for |
---|
0:05:33 | basic show the same point |
---|
0:05:36 | that's |
---|
0:05:36 | for every t but you can |
---|
0:05:39 | gain |
---|
0:05:40 | it effective terms are save it more about that later on |
---|
0:05:42 | it's worth between five and eight of say five subsets buying on speech material for |
---|
0:05:47 | this is for sentences these are pretty close to normal |
---|
0:05:50 | normal speech |
---|
0:05:51 | and so |
---|
0:05:52 | every t v gain is worth is with having essentially |
---|
0:05:58 | at some images perhaps a little less now |
---|
0:06:00 | is the every db |
---|
0:06:01 | attenuated potentially saves lives and might sound like a |
---|
0:06:05 | a bit of bold statements but so that we qualified |
---|
0:06:08 | this is ripple in the will self-organization which just covered |
---|
0:06:12 | environmental noise in europe |
---|
0:06:15 | and by environment i environmental noise and excluding workplace noise so this is not people |
---|
0:06:20 | working in factories with you know |
---|
0:06:23 | pistons and how misses ongoing all the time no this is due with |
---|
0:06:28 | just the noise pollution there exists in environments be looking at or railway station you |
---|
0:06:32 | got announcements all day long and so you live near to an apple also on |
---|
0:06:36 | the nice to talk about |
---|
0:06:37 | of at aeroplanes use of a stress related diseases call you possibilities in particular |
---|
0:06:43 | now these don't necessary leads to for target so |
---|
0:06:45 | the qualification that was this is |
---|
0:06:48 | the more methodologies measuring how few likely is lost |
---|
0:06:51 | so if you saw for instance a severe tenets s is results of environmental noise |
---|
0:06:54 | that might |
---|
0:06:55 | produce a coefficient of point one for instance which means that for every ten years |
---|
0:07:00 | you effect you lost one you of healthy life that's a very large figure |
---|
0:07:03 | i that we can do to attenuate environmental noise has to be beneficial |
---|
0:07:11 | to just to contrast what i'm talking about with existing other areas within a field |
---|
0:07:17 | the difference between speech proposed vacation let's say that speech enhancement |
---|
0:07:21 | it with dealing with speech |
---|
0:07:24 | where the intended message of the signal itself is now so it's in a sense |
---|
0:07:27 | a simpler problem we're not we have not the problem with taking a noisy speech |
---|
0:07:31 | signal and phone recognizers or enhance of the broadcast for instance |
---|
0:07:37 | sometimes called near end speech enhancement |
---|
0:07:41 | and it's not like an additive noise suppression we we're not attending to control the |
---|
0:07:44 | noise so it's of this sort of talk just start talking about here will be |
---|
0:07:49 | in situations where is not really practical to control the noise because you got say |
---|
0:07:52 | a public address system and hundreds of people a listing to its control be wearing |
---|
0:07:56 | headphones or whatever |
---|
0:07:59 | well we can't do then what we left with within this speech within the system |
---|
0:08:02 | its ability to modify the speech itself so get fairly to if you constraints these |
---|
0:08:07 | are just practical constraints |
---|
0:08:09 | on the short so we probably the interest in changing the distribution of energy and |
---|
0:08:12 | time |
---|
0:08:14 | duration even |
---|
0:08:15 | i'll show you small with the do that later on |
---|
0:08:18 | and |
---|
0:08:19 | but overall in the long term we need to we don't want to extend the |
---|
0:08:22 | duration |
---|
0:08:23 | and fulfil the behind an announcement france |
---|
0:08:27 | and we don't want |
---|
0:08:29 | able do not want to increase in intensity of the signal so |
---|
0:08:33 | normally what we're doing this work at this is gonna be the case throughout all |
---|
0:08:35 | of what i talk about pretty much alike talk about |
---|
0:08:38 | there's a constant input output energy constraint just four months |
---|
0:08:44 | you just a few examples what we can achieve and you know a compact is |
---|
0:08:47 | saying how these it done |
---|
0:08:48 | a little later so if you couldn't see of the original example you won't |
---|
0:08:53 | these was my mouse on |
---|
0:08:58 | one |
---|
0:08:59 | okay so what you listen to is some speech and noise |
---|
0:09:03 | trust me because you may not be able to his speech |
---|
0:09:09 | he just about tell the speaker in a |
---|
0:09:12 | well as you can modify that speech |
---|
0:09:15 | without changing the overall rms energy the noise is constance so the nist nor is |
---|
0:09:20 | the same in these two examples modified do something like this |
---|
0:09:28 | which if you listen that in an experimental setup i had transference you could be |
---|
0:09:32 | guaranteed to get probably seventy percent words clearly the sentence was that's right |
---|
0:09:37 | this is the eight we possibly soup has been certainness itself |
---|
0:09:41 | so you know semantically unpredictable |
---|
0:09:44 | so that's the kind of by weight motivation |
---|
0:09:47 | now what's a little bit about what's so opposed to |
---|
0:09:51 | and this is been other a longstanding research area |
---|
0:09:55 | i see that at least that's goes but along bought |
---|
0:09:58 | a hundred years ago |
---|
0:09:59 | but a lot of work conferences clean speech so if i if you if you |
---|
0:10:02 | give some did instructions |
---|
0:10:04 | to speak clearly |
---|
0:10:05 | then they will and |
---|
0:10:07 | you're we even need to give you instructions in this situation nine |
---|
0:10:11 | tempting to speak more clearly than i was over lunch for instance |
---|
0:10:15 | and speech changes in the situations in ways which |
---|
0:10:20 | it's possible to characterize possibly copy |
---|
0:10:23 | and this documents anything about |
---|
0:10:25 | mapping clean speech properties on so you |
---|
0:10:28 | a on to natural speech or maybe function instead |
---|
0:10:31 | on the adverse conditions |
---|
0:10:33 | speech also changes as a function of the interlocutor |
---|
0:10:36 | which changes if when you took children infant-directed speech for instance |
---|
0:10:41 | for the directed speech is also called |
---|
0:10:43 | the nonnatives and also for pets |
---|
0:10:45 | and also computers |
---|
0:10:47 | so it is also computed as we as we all know talk cost |
---|
0:10:50 | there's was involved in that have available in speech recognition speech changes |
---|
0:10:55 | i mean if focusing settlement postconditions |
---|
0:10:58 | so say this |
---|
0:10:59 | works be good on |
---|
0:11:00 | on several about speech but |
---|
0:11:02 | a long time |
---|
0:11:04 | johnstone also work in this area |
---|
0:11:06 | also |
---|
0:11:07 | no where |
---|
0:11:08 | should i should say we're interested in one but speech |
---|
0:11:12 | not necessarily because we expect speakers |
---|
0:11:16 | to be an of arms or a device to be an environment where or listeners |
---|
0:11:19 | be to be in environments |
---|
0:11:21 | i don't levels which use normative induced one but speech which usually quite high |
---|
0:11:25 | but simply because lombard speech is more intelligible we want to know why |
---|
0:11:28 | first of all |
---|
0:11:30 | sciences and then we want to in part knowledge into an algorithm so you at |
---|
0:11:36 | least produced it intelligible benefits of one but speech or to go beyond |
---|
0:11:40 | actually some results like that show that we |
---|
0:11:42 | kind indeed go beyond |
---|
0:11:44 | now i guess medical of the fertile about speech but |
---|
0:11:47 | if you if you haven't this is what it sounds like |
---|
0:11:50 | in the next okay that's little speech very well i |
---|
0:11:55 | this is the same talk the same sentence with in this case i think ninety |
---|
0:11:59 | six db s p l noise on have |
---|
0:12:03 | you don't hear coming from the signal goes |
---|
0:12:06 | and some of the properties along but speech are fairly evidence i think here so |
---|
0:12:10 | you see the duration training so this is the normal speech this is the lombard |
---|
0:12:12 | speech |
---|
0:12:14 | so you can see the that it's quite normal for the duration to be |
---|
0:12:17 | slightly long this is exactly nonlinear |
---|
0:12:20 | the nonlinear stretching here voiced elements tend to be extended well voices that elements |
---|
0:12:26 | also if you look here at the f zero you the harmonics only possible lombard |
---|
0:12:31 | speech |
---|
0:12:31 | after zeros typically higher two |
---|
0:12:35 | there's all the characteristics which are not visible in this particular the loss but you'll |
---|
0:12:39 | see in the second |
---|
0:12:40 | which might be important |
---|
0:12:43 | now this the real reason we're just a lot about speech and there's lots of |
---|
0:12:46 | data |
---|
0:12:47 | like this is just some we had |
---|
0:12:49 | from a similar studies were sweltering is here |
---|
0:12:52 | and this is showing the |
---|
0:12:54 | percentage increase in intelligibility of a baseline |
---|
0:12:58 | an a normal baseline these are just for difference |
---|
0:13:00 | well but conditions |
---|
0:13:01 | it you can see we can get some pretty |
---|
0:13:04 | serious intelligibility improvement |
---|
0:13:06 | this of the old speech is then represented to listeners in the same amount of |
---|
0:13:10 | noise |
---|
0:13:11 | and improvements of up to twenty five |
---|
0:13:13 | decibels |
---|
0:13:14 | sorry twenty five a |
---|
0:13:15 | percent |
---|
0:13:17 | question is why |
---|
0:13:19 | while about speech more intelligible |
---|
0:13:21 | and the seem to be |
---|
0:13:23 | a number of possibilities possibly acting in conjunction |
---|
0:13:28 | so one option you hear on what some panel are |
---|
0:13:31 | or three spectrograms cochlea gram the sometimes goals as a logarithmic frequency scale |
---|
0:13:36 | nearly |
---|
0:13:38 | in speech which is not |
---|
0:13:40 | in int used in noise this is normal speech |
---|
0:13:42 | and these it just |
---|
0:13:43 | different degrees of law about speech signal see the duration differences again there |
---|
0:13:48 | whatever |
---|
0:13:49 | trash on this site the |
---|
0:13:51 | are |
---|
0:13:52 | the regions of the speech if you to then mixed each of these incessant amount |
---|
0:13:57 | of noise |
---|
0:13:58 | which means it's speech actually come through the not x |
---|
0:14:01 | these things the chuckling glimpses here |
---|
0:14:03 | this it's a model be defining a bit more carefully like draw |
---|
0:14:07 | what you see is these glimpses and then in the normal speech |
---|
0:14:12 | there are few glimpses of the normal speech the raw |
---|
0:14:15 | in the |
---|
0:14:16 | what about speech in particular the high frequency regions |
---|
0:14:19 | and so |
---|
0:14:20 | one of the other key properties of lombard speech is the spectral tilt |
---|
0:14:23 | is changed exactly reduced so that this is low frequency with you looking at this |
---|
0:14:28 | is high frequency lombard speech is more like that |
---|
0:14:32 | which means essentially spongy more energy in the made |
---|
0:14:35 | a bit a high frequencies |
---|
0:14:37 | need |
---|
0:14:38 | in my terms means made along the cochlea which means provided khz upwards we seek |
---|
0:14:43 | more energy |
---|
0:14:45 | so this potentially spectral cues |
---|
0:14:48 | this also potentially sample cues |
---|
0:14:50 | simply a slowing down of speech right if you like let's say it's not |
---|
0:14:55 | it's non linear |
---|
0:14:56 | expansion maybe that's beneficial that's a contentious issue which all address |
---|
0:15:01 | and maybe the raw acoustic phonetic changes to |
---|
0:15:03 | maybe |
---|
0:15:05 | when list as a present with high level of noise they attempt to hit |
---|
0:15:09 | of all target store |
---|
0:15:11 | and expanded files space as is the case of its is all the phones modified |
---|
0:15:16 | speech just clean speech |
---|
0:15:18 | no but the question is one but speech intrinsically more intelligible be addressed in the |
---|
0:15:23 | two |
---|
0:15:25 | note stop the listing talk project we all got together |
---|
0:15:28 | kinda optimistically start the project and had a bit of a brainstorming session |
---|
0:15:32 | just a just a to lists of the things we might do to speech |
---|
0:15:36 | to make it more intelligible make it more robust |
---|
0:15:40 | or c first |
---|
0:15:41 | i'm supposed to increase intensity which ruled out |
---|
0:15:44 | and hence |
---|
0:15:45 | again some of those were aware of lombard |
---|
0:15:48 | speech this point |
---|
0:15:49 | was he changing spectral tilt is a possibility |
---|
0:15:52 | the for the thing i just mention the speech phonetic changes |
---|
0:15:56 | spun the vol space |
---|
0:15:57 | latency voice see what is going this model space on the slide |
---|
0:16:01 | and so we continued |
---|
0:16:02 | think about this maybe now ring the formant bandwidths |
---|
0:16:05 | put more energy wasting energy on |
---|
0:16:08 | useless parameters like you know all lattice and the peak |
---|
0:16:12 | the of the in the in the spectrum what in general reality energy sparsifying energy |
---|
0:16:16 | is another generalisation that's |
---|
0:16:18 | some of these are mentioning because you gonna see some examples |
---|
0:16:21 | of these |
---|
0:16:22 | as you compression has been over a long time to work in the audio |
---|
0:16:26 | broadcasting |
---|
0:16:27 | and also works here and then there are fewer the higher level things try to |
---|
0:16:30 | match the interloctor intensity or to contrast it |
---|
0:16:33 | to maybe help is probably overlaps which was talked about this morning |
---|
0:16:37 | and so on |
---|
0:16:38 | okay training of zero |
---|
0:16:40 | we thought that more |
---|
0:16:42 | table with some other things in searching for these more practical vowels and consonants simply |
---|
0:16:47 | fine syntax blah |
---|
0:16:49 | and further maybe producing speech with had which has a low cognitive load on the |
---|
0:16:55 | list a |
---|
0:16:56 | as you see there's an awful lot of things that can be looked at c |
---|
0:16:59 | and there's in all of these of been woodside so it's a grey area of |
---|
0:17:01 | people interested in |
---|
0:17:03 | start to look at this |
---|
0:17:04 | what i try to do that was to group that into a bit more sensible |
---|
0:17:09 | structure |
---|
0:17:10 | by looking at the goal of |
---|
0:17:12 | a speech modification so all possible goals could be at modifying speech |
---|
0:17:16 | all of it is context dependent but if we just gonna focus on speech and |
---|
0:17:19 | noise |
---|
0:17:21 | one of the clear goals is to reduce and energetic masking |
---|
0:17:24 | as it's called |
---|
0:17:26 | no i if you know why the difference |
---|
0:17:29 | the difference between in jersey masking informational masking |
---|
0:17:32 | energetic masking describes the process which essentially |
---|
0:17:37 | look so what happens when a mask and the target let's a speech interact level |
---|
0:17:42 | of the or two or three periphery |
---|
0:17:44 | something information is lost |
---|
0:17:46 | due to compression in the auditory system |
---|
0:17:50 | but then masking |
---|
0:17:51 | can come back again later if the some information getting through from another chocolates a |
---|
0:17:55 | there to talk was talking at once you need to messages or unity of fragments |
---|
0:17:59 | of two messages |
---|
0:18:01 | and it if they speak as a very similar they have the same gender |
---|
0:18:04 | then it can be very confusing to work out what which bits belong to each |
---|
0:18:08 | talk not an example of informational mask |
---|
0:18:11 | so what things we can just produce a just masking |
---|
0:18:14 | by doing things like sparsification of spectrum |
---|
0:18:17 | training spectral tilt to reduce informational masking we might do something if we've got control |
---|
0:18:22 | over |
---|
0:18:22 | the entire message generation process we might be something like change the gender of the |
---|
0:18:26 | talker |
---|
0:18:28 | okay so with about |
---|
0:18:29 | well not necessarily tts but we have voice conversion systems do this |
---|
0:18:34 | and we cannot visual cues as a nice with reducing the effect of an interfering |
---|
0:18:38 | talkers |
---|
0:18:39 | and then we can do all the things this comes from i longstanding interest in |
---|
0:18:43 | order to scene analysis |
---|
0:18:45 | but it taking the problem and investing we can try to prevent grouping we can |
---|
0:18:49 | send a message into an environment whether all the sources |
---|
0:18:52 | but do things check to prevent a grouping with them |
---|
0:18:55 | is that see something with an idea which a promote interest all in system an |
---|
0:18:59 | awful lot of work in scene analysis |
---|
0:19:01 | about the c plays in a court set |
---|
0:19:05 | i believe and |
---|
0:19:07 | you assigned that's a given instruments when they common |
---|
0:19:11 | have a sort of its finding can use timing differences |
---|
0:19:14 | at the onset of them could be interesting keep them separate |
---|
0:19:17 | that's an example of what i'm sorry about that using a scene analysis |
---|
0:19:20 | to prevent |
---|
0:19:21 | the message clashing with the background |
---|
0:19:24 | nothing we can do to reduce cognitive load of the message by using possibly |
---|
0:19:28 | simple syntax |
---|
0:19:29 | decreasing speech right or we can equip the speech with |
---|
0:19:34 | more redundancy differences by it's a high-level repeating the message boards |
---|
0:19:40 | so the lots of things that you might |
---|
0:19:42 | figure out what |
---|
0:19:44 | do not want to do is to sort of move in the direction of some |
---|
0:19:47 | the experimental experiments we been doing of last be as |
---|
0:19:51 | and this is a kind of typical approach to be take |
---|
0:19:54 | we could use what can be we describe the syntactic one but speech in one |
---|
0:19:58 | form or another |
---|
0:19:59 | i mean is we can take normal speech display this again |
---|
0:20:03 | in reading text okay |
---|
0:20:05 | and take the global but sentence three i o e |
---|
0:20:10 | and say well how much of the intelligibility advantage of that one buttons comes from |
---|
0:20:14 | site time timing differences so we can time aligned |
---|
0:20:18 | the two sentences and then princess asked a question |
---|
0:20:23 | i |
---|
0:20:23 | only the f zero shift in that |
---|
0:20:26 | but |
---|
0:20:27 | remove the spectral tilt that sounds like this next for me not like targets along |
---|
0:20:32 | about three okay so the residual you know the difference that between the two |
---|
0:20:37 | the things like spectral tilt and sort of an experimental point you we can then |
---|
0:20:41 | identify |
---|
0:20:42 | the contribution of us to such as f zero spectral tilt duration |
---|
0:20:47 | to the intelligibility advantage |
---|
0:20:51 | so now want to look at some spectral domain and starts off with |
---|
0:20:56 | one of the l experiments we did looking at |
---|
0:20:59 | exactly those parameters spectral tilt and fundamental frequency |
---|
0:21:04 | because lombard speech in clean speech and all the phones of speech |
---|
0:21:07 | do you modify have zero you might be led to believe the f zero |
---|
0:21:11 | is important to my let's think that actually important change |
---|
0:21:14 | but it turns out that it hasn't |
---|
0:21:15 | so what you looking out here is the |
---|
0:21:18 | increase |
---|
0:21:19 | intelligibility or a baseline by manipulating |
---|
0:21:22 | f zero to bring it in line with lombard speech |
---|
0:21:25 | and none of these changes significance these three different pause just represent different |
---|
0:21:29 | lowball conditions |
---|
0:21:31 | on the other hand to be checked the spectral tilt |
---|
0:21:33 | this just a constant changes not time dependent |
---|
0:21:36 | skantze changed spectral tilt |
---|
0:21:38 | we get about two thirds the benefit coming through this is not real about speech |
---|
0:21:41 | up here |
---|
0:21:42 | so the still a got but a lot is you to spectral tilt |
---|
0:21:45 | turns out this could be predicted very well by just considering energetic masking eclipsing model |
---|
0:21:51 | so it spectral tilt is putting some the speech out with the most |
---|
0:21:57 | of course and there have been approach is which just simply do that the speech |
---|
0:22:01 | we get some benefits modifying speech based on |
---|
0:22:04 | just changing spectral tilt globally |
---|
0:22:06 | but we can say that look quite a bit more generally and ask the question |
---|
0:22:10 | if all you're allowed to do is to cut with a stationary spectral weighting |
---|
0:22:15 | so essentially designing send a simple filter |
---|
0:22:18 | to apply to speech that was the best you can do |
---|
0:22:22 | in the spectral domain weeks |
---|
0:22:23 | this the general approach |
---|
0:22:25 | offline |
---|
0:22:27 | this can still be this can be must get dependent so it's context dependent it |
---|
0:22:31 | is masked is |
---|
0:22:32 | we can come up with a different special weighting |
---|
0:22:37 | and we do that offline |
---|
0:22:38 | and then online its nest every that's recognise what kind of background we have and |
---|
0:22:42 | then apply the weighting necessarily |
---|
0:22:44 | necessary for that particular |
---|
0:22:46 | a type mask |
---|
0:22:48 | what we didn't hear what we realised of the art in this project |
---|
0:22:51 | with the really important role the object intelligibility matches have |
---|
0:22:55 | in this whole process simply because |
---|
0:22:57 | we want to use them as part of the closed loop design process optimisation process |
---|
0:23:02 | we can't bring back panel of listeners every |
---|
0:23:05 | ten milliseconds |
---|
0:23:07 | to try to answer the question how intelligibility this you modification that our algorithm just |
---|
0:23:11 | come up with |
---|
0:23:12 | "'cause" we still test the n |
---|
0:23:14 | at the design phase |
---|
0:23:15 | so is critically important of a good intelligibility predictor |
---|
0:23:20 | so the first intelligibility project we use is the |
---|
0:23:23 | the glimpse proportion measure |
---|
0:23:26 | and that just described what the says very simple thing |
---|
0:23:29 | so we take |
---|
0:23:31 | see what representations separated or two representations of the speech of the noise of these |
---|
0:23:35 | just |
---|
0:23:36 | just imagine some kinda cochlea gram representation |
---|
0:23:40 | gammatone filter-bank we take the envelope you've is willing to it |
---|
0:23:44 | the hilbert envelope |
---|
0:23:45 | downsample |
---|
0:23:47 | essentially that's it |
---|
0:23:48 | and we have the question on how often is the speech above the noise plus |
---|
0:23:53 | some threshold which relies of a real but we need |
---|
0:23:56 | and just measure the number of points well that's the case |
---|
0:23:59 | by simple very rapidly computed |
---|
0:24:02 | intelligibility model |
---|
0:24:04 | if we do that |
---|
0:24:06 | we come up with these kinds of weighting so again |
---|
0:24:09 | it depends what kind of optimisation but you want to use this is just ignore |
---|
0:24:12 | them |
---|
0:24:13 | these very high dimensional or two spectra |
---|
0:24:16 | say sixty dimensional |
---|
0:24:18 | also |
---|
0:24:19 | one thing here if you read these icons this is speech or noise competing talk |
---|
0:24:25 | a cs |
---|
0:24:27 | speech modulated noise white noise |
---|
0:24:29 | and circle different mask is we also got given snrs or even ten five zero |
---|
0:24:34 | minus five minus ten |
---|
0:24:36 | and this also some interesting things going on here these the optimal spectral weighting that |
---|
0:24:40 | come up with |
---|
0:24:42 | you don't listen to think before using much lower dimensionality representation so differences octave weightings |
---|
0:24:48 | so you got you know six to eight octave bands weightings or even third octave |
---|
0:24:51 | weightings maybe twenty third octave bands here we got much higher dimensional representation |
---|
0:24:56 | so we can someone that one expected to was at least somewhat unexpected result the |
---|
0:24:59 | or something is the as the snr decreases we sing |
---|
0:25:03 | that this optimal weighting is getting more extreme |
---|
0:25:06 | more binary |
---|
0:25:08 | we caught sparse twisty "'cause" what is essentially getting is |
---|
0:25:11 | is shifting the energy |
---|
0:25:14 | instead because you regions |
---|
0:25:16 | to it to limit boosting gets and then attenuation the neighboring regions |
---|
0:25:21 | this is only value on expected |
---|
0:25:24 | the question is what the what was this all amount to foolishness |
---|
0:25:29 | i display your an example of |
---|
0:25:31 | of what these things sound like to this is just the on model modified speech |
---|
0:25:35 | a large size in stockings is how to sell |
---|
0:25:38 | from the whole corpus |
---|
0:25:39 | this is the modified |
---|
0:25:41 | i stockings as a cluster |
---|
0:25:44 | one of the modified |
---|
0:25:46 | and he pretty you know their course equally intelligible i hope in quiet but in |
---|
0:25:51 | noise and |
---|
0:26:00 | you know the sentence but it's i think it should be reason we evident that |
---|
0:26:04 | the modified speech is more intelligible that and so as part of the three can |
---|
0:26:08 | challenge we |
---|
0:26:09 | and to this particular the algorithm and got improvements of up to say fifteen percent |
---|
0:26:16 | percentage points |
---|
0:26:17 | in this is just two different conditions and given snrs but |
---|
0:26:20 | roughly that's amount |
---|
0:26:23 | it is more useful to think of these in terms of db improvements |
---|
0:26:27 | and so we its use this |
---|
0:26:29 | so idea of a equivalent intensity increase the idea is if you modify speech |
---|
0:26:36 | how much would you need to the one modified speech of by |
---|
0:26:41 | since the how much you gonna to increase the snr |
---|
0:26:44 | to get the same level of performance |
---|
0:26:47 | and this can be computed using a |
---|
0:26:50 | by computing psychometrics functions for each of the mask as you need to use and |
---|
0:26:53 | using the mapping from the or modified speech the modified speech |
---|
0:26:58 | i don't you what is inside with |
---|
0:27:00 | sensed tells as is i if you look at the subjective by |
---|
0:27:05 | these fill lines here |
---|
0:27:07 | the we getting about two db some improvement using that stuff expect a weighting |
---|
0:27:11 | which is kinda useful to db is maybe seventy four somewhere between ten and fifteen |
---|
0:27:16 | percentage points also |
---|
0:27:19 | now something else in this figure shows |
---|
0:27:21 | these the white bars here |
---|
0:27:23 | all the protections on the bases are of a |
---|
0:27:26 | object intelligible to model that was used directly design the weighting the first place |
---|
0:27:30 | and icsi the predictions are not really that good |
---|
0:27:34 | i mean you |
---|
0:27:35 | you could look at this kind i can say well there are quite collects but |
---|
0:27:37 | there are not really very good at all |
---|
0:27:40 | this is quite a big basically in these cases here |
---|
0:27:43 | of course |
---|
0:27:45 | the |
---|
0:27:46 | in a in one sense doesn't matter because we're still getting improvement fearlessness |
---|
0:27:51 | what the other hand get a better in its object intelligibility model |
---|
0:27:54 | than against abortion for instance then we might expect bigger gains |
---|
0:27:59 | so the kindness idea one other things |
---|
0:28:02 | the most units and times been focusing on |
---|
0:28:05 | a loss |
---|
0:28:05 | is improving intelligibility models of a modified incident synthetic speech |
---|
0:28:11 | so what you seen here these it you might recognise some of these |
---|
0:28:15 | abbreviations here this is the speech intelligibility index extended speech utility bills you index |
---|
0:28:20 | this is one controls the data was lab |
---|
0:28:22 | et cetera but these are quite recent intelligibility metrics |
---|
0:28:25 | seven over that |
---|
0:28:27 | and these a five claim space matrix that's |
---|
0:28:31 | saigon signed is developed |
---|
0:28:32 | to try to improve matters |
---|
0:28:34 | it's a difference is that the one that we're using you just save these past |
---|
0:28:36 | stuff expect a weightings is this one just performed that well actually it's all |
---|
0:28:40 | but most the metrics there really perform so well |
---|
0:28:43 | the modified speech |
---|
0:28:45 | normally that we four-formant |
---|
0:28:46 | so the correlation with the model correlations of at least point nine |
---|
0:28:50 | for natural |
---|
0:28:51 | speech their form of synthetic speech writer |
---|
0:28:55 | so one actually now is what happens if we |
---|
0:28:57 | do the same |
---|
0:28:59 | static |
---|
0:29:00 | that's spectral |
---|
0:29:01 | wait estimation |
---|
0:29:03 | one of these is going to use this high energy could portion |
---|
0:29:07 | metric instead |
---|
0:29:09 | this is just really a series adaptations to the normal course a portion |
---|
0:29:13 | well you |
---|
0:29:14 | what we doing over here this is the normal was proportion |
---|
0:29:17 | we do in here is i'd adding on something which represents the hearing s |
---|
0:29:22 | let level |
---|
0:29:23 | sometimes we present in speech is such a low snr that some of this some |
---|
0:29:28 | of the speech itself within the mixture when it's presented to listen to say it's |
---|
0:29:32 | some people db or whatever |
---|
0:29:34 | is actually below the threshold of hearing |
---|
0:29:36 | and this or talking effect on the intelligibility prediction so that skated for over here |
---|
0:29:42 | you've also got |
---|
0:29:43 | a sort of ways i logarithmic compression |
---|
0:29:46 | to a |
---|
0:29:47 | deal with the fact that |
---|
0:29:49 | glints is very redundant so you probably only thirty percent of the spectro-temporal plane glints |
---|
0:29:53 | to get |
---|
0:29:53 | to see in performance |
---|
0:29:56 | that's handled that |
---|
0:29:57 | and this is a durational modification factor |
---|
0:30:00 | which attempts to cater for the fact that |
---|
0:30:04 | rapid speech is less intelligible so this a few changes in that i'm not really |
---|
0:30:08 | gonna go too much into them here |
---|
0:30:11 | but just a trace of the buttons that |
---|
0:30:12 | come out of this often optimisation |
---|
0:30:15 | process |
---|
0:30:16 | and so what we seeing is actually quite similar buttons to the preceding model draw |
---|
0:30:21 | some differences is a six different noise types |
---|
0:30:23 | low-pass high |
---|
0:30:24 | low-pass this is high plus noise white noise |
---|
0:30:27 | and again a modulated |
---|
0:30:28 | could be talking noise and speech noise but we essentially seem pretty much abuse of |
---|
0:30:33 | the high frequencies |
---|
0:30:38 | we find here well we change corpus here a little bit |
---|
0:30:40 | it became more convenient for me working in |
---|
0:30:44 | in spain to have spanish listeners rolled to |
---|
0:30:46 | of the |
---|
0:30:47 | we don't my ex english collings scottish whatever to run some experiments with this |
---|
0:30:54 | so this is with a short but which is a spanish version of the harvard |
---|
0:30:57 | sentences |
---|
0:30:59 | and what you saying here all gains in percentage points |
---|
0:31:04 | these not relative gains |
---|
0:31:05 | is the percentage points gains of up to fifty five percentage points from static special |
---|
0:31:10 | weighting |
---|
0:31:12 | in the best cases in some really cases down twenty thirty |
---|
0:31:16 | doesn't look at all the white in white noise which we put down to a |
---|
0:31:20 | continue problems of the origin |
---|
0:31:23 | further problems of the objective intelligibility metric |
---|
0:31:26 | but nevertheless we can see that for a very simple approach which could be implemented |
---|
0:31:29 | it |
---|
0:31:29 | simple linear filter |
---|
0:31:31 | we can get |
---|
0:31:32 | some pretty be gains |
---|
0:31:34 | in noise using these approaches |
---|
0:31:38 | and actual and questions we want to that is |
---|
0:31:41 | to what extent do we need to make mask in a basket dependent weightings |
---|
0:31:45 | because if you look at the mask is we here because the weightings rather of |
---|
0:31:49 | a here |
---|
0:31:50 | for the different mask is we stent a system similar passage |
---|
0:31:53 | we tend to see |
---|
0:31:54 | a preference for getting the energy opens the |
---|
0:31:57 | i frequencies |
---|
0:31:58 | with maybe a sort of |
---|
0:32:00 | tendency to preserve some very low frequency information which might be related to encoding voicing |
---|
0:32:05 | for instance |
---|
0:32:08 | so we tried out a number of |
---|
0:32:09 | static spectral weighting in a master independent sense this the simplest one |
---|
0:32:14 | which used essentially transmit |
---|
0:32:16 | transfer |
---|
0:32:17 | reallocate lots of energy from the low frequencies below khz |
---|
0:32:21 | so the edges above |
---|
0:32:23 | with no attempt to produce a clatter profile |
---|
0:32:27 | that's all here |
---|
0:32:28 | and then we the these but testing out the idea of sparse boosting just boosting |
---|
0:32:32 | if you channels |
---|
0:32:33 | sparse twisting with a some low-frequency information transfer sounds |
---|
0:32:36 | once the |
---|
0:32:37 | and just of our sense that run them |
---|
0:32:39 | selection information |
---|
0:32:41 | in i and i frequencies |
---|
0:32:44 | and it turns out |
---|
0:32:45 | slightly to a surprise |
---|
0:32:47 | that the master independent weighting which of these black policy |
---|
0:32:52 | that in a real conditions of all this condition |
---|
0:32:55 | does as well as the mask a deep and weighting |
---|
0:32:58 | which the white boss previous |
---|
0:33:00 | this copy from the previous |
---|
0:33:01 | couple slides back |
---|
0:33:03 | all of the other weighting stick white so well although they in general produced improvements |
---|
0:33:08 | so what this is saying really is that |
---|
0:33:11 | for what a wide variety of common noises |
---|
0:33:14 | say babble noise in particular which is basically car noise in transport interchanges the same |
---|
0:33:18 | be speech noises |
---|
0:33:20 | we can we can get pretty significant improvements from a simple approach of spectral weighting |
---|
0:33:26 | as lots multi set about spectral |
---|
0:33:28 | types of things lots more to be don't but |
---|
0:33:31 | i want to get a kind of a |
---|
0:33:33 | a better look at |
---|
0:33:34 | all the various examine to move on as look at temporal modifications |
---|
0:33:39 | the testing to look at |
---|
0:33:41 | these this question of duration or speech rate changes |
---|
0:33:45 | you might think that by slowing speech data and the way the lombard speech |
---|
0:33:49 | does at least for certain segments |
---|
0:33:51 | is don't for the |
---|
0:33:53 | for reasons |
---|
0:33:54 | because the in selected to all the speakers try to make things easier for the |
---|
0:33:57 | interlocutor |
---|
0:34:00 | so what we looked at was |
---|
0:34:02 | whether or not the slower speech rate along but speech actually helps at all |
---|
0:34:05 | we see here is the method also we use a this is plain speech this |
---|
0:34:08 | is lombard speech |
---|
0:34:10 | then we just simply time-aligned nonlinearly the low about speech with the plane speech |
---|
0:34:15 | and once you've got the time alignment you can then do things like transplanting spectral |
---|
0:34:19 | information |
---|
0:34:20 | from saying the lombard speech into the line speech |
---|
0:34:23 | in the timeline sense |
---|
0:34:25 | well the answers the question |
---|
0:34:27 | and whether or not duration helps |
---|
0:34:30 | is no |
---|
0:34:31 | and this is not the only study this found this |
---|
0:34:34 | no i don't linear stretching or nonlinear |
---|
0:34:37 | as in this case non linear time alignment benefits this is because these of these |
---|
0:34:41 | two point c is these benefits |
---|
0:34:43 | overall modified speech we see this is not a lot lombard speech |
---|
0:34:47 | easy to spectral modifications local modifications meaning |
---|
0:34:51 | spectral transplantation having don the nonlinear time warping |
---|
0:34:55 | nothing helps |
---|
0:34:56 | except the spectral changes |
---|
0:34:57 | these decreases the not significance but then clearly not in the right direction |
---|
0:35:03 | but i |
---|
0:35:03 | but in a i a little bit later on three result which seems to seems |
---|
0:35:06 | to country |
---|
0:35:07 | to contradict this |
---|
0:35:10 | so one wants a for example the next |
---|
0:35:12 | five or ten minutes |
---|
0:35:13 | is a slightly |
---|
0:35:15 | richer interpretation of durational changes |
---|
0:35:18 | and this is |
---|
0:35:19 | what happens to speech |
---|
0:35:21 | when you're talking the presence of a temporally modulated mask |
---|
0:35:25 | so i just think about that you know anytime you go into a cafe or |
---|
0:35:29 | something |
---|
0:35:31 | you dealing with this a modulated background |
---|
0:35:35 | is there anything that we a speakers to in a module at background to make |
---|
0:35:39 | life easier to listen |
---|
0:35:41 | these situations belly been studied |
---|
0:35:43 | and yet has the potential to |
---|
0:35:45 | we think we thought |
---|
0:35:47 | and continues think |
---|
0:35:49 | to show |
---|
0:35:50 | so more complex behaviour on the part of speakers to helplessness |
---|
0:35:56 | so it is the current task |
---|
0:35:58 | that we used is this you talking task as a visual area between these two |
---|
0:36:02 | talkers be visual barrier here |
---|
0:36:05 | they're wearing headphones the listing to modulate you mask as of different types |
---|
0:36:09 | varying gain |
---|
0:36:10 | that density so there's some |
---|
0:36:12 | opportunities let's say for the talkers to maybe get into the gaps |
---|
0:36:16 | here's a bit of a link with the overlapping |
---|
0:36:18 | speech material this morning |
---|
0:36:20 | and they have different to docking proposals |
---|
0:36:22 | so they need to communicate is one of the string to get task so this |
---|
0:36:26 | is a |
---|
0:36:26 | an example of all that sounds like |
---|
0:36:29 | i mean see you can what you listening |
---|
0:36:32 | see it you can |
---|
0:36:33 | imagine |
---|
0:36:34 | the mask are being present you the must present in this example |
---|
0:36:38 | you hear the mask about the must was present for the for these talk as |
---|
0:36:43 | the you can okay gonna one |
---|
0:36:46 | and it in the middle right hand box |
---|
0:36:49 | the middle row there has to be three in five |
---|
0:36:52 | no colour role |
---|
0:36:54 | i mean the timing wasn't quite natural i think you need here is not really |
---|
0:36:57 | every now briefly what conversation |
---|
0:36:59 | this the third party and that a lot that parties |
---|
0:37:02 | is it is a modulated mask in this case |
---|
0:37:05 | no is less interesting things ago on an overlap as i'm sure |
---|
0:37:08 | i don't need to tell you |
---|
0:37:10 | and |
---|
0:37:11 | but these not |
---|
0:37:12 | you know this a little bit of in the meetings |
---|
0:37:15 | style overlap |
---|
0:37:16 | because it obsoletes not the competing talking the background will see some examples of that |
---|
0:37:20 | in a moment |
---|
0:37:21 | why white simply wanna focus on is the overlap |
---|
0:37:23 | it's simply the degree of overlap |
---|
0:37:25 | with the with the mask |
---|
0:37:28 | do the talkers treat the mask a like an interlocutor |
---|
0:37:32 | but there is a tend to avoid overlap |
---|
0:37:34 | or not |
---|
0:37:35 | what we found is that to some extent yes it's is showing the reduction in-overlap |
---|
0:37:41 | these just the for different masters or the dense and sparse mask so in the |
---|
0:37:45 | case where there's |
---|
0:37:46 | more potential for reducing a lot voice pops easy if it's order to do so |
---|
0:37:49 | that we do see reduction overlap |
---|
0:37:52 | well however they to the itchy this by increasing speech right so they speak a |
---|
0:37:56 | more |
---|
0:37:58 | only when there's no overlap |
---|
0:37:59 | when the weather's up |
---|
0:38:01 | when this background speech and that's what's response of the increase in |
---|
0:38:04 | the decrease in you know a lot this is normalize of course |
---|
0:38:07 | by |
---|
0:38:08 | a speech activity |
---|
0:38:11 | so what are see what a speaker during |
---|
0:38:13 | well |
---|
0:38:14 | strike work out what speakers of doing when noise is present or indeed when noise |
---|
0:38:18 | isn't present |
---|
0:38:20 | religious technique which is |
---|
0:38:22 | we develop a signal for system identification |
---|
0:38:25 | cool reverse correlation |
---|
0:38:27 | as there's been used for instance try to identify |
---|
0:38:31 | also nonlinear systems although it really strictly speaking only applies in you know like when |
---|
0:38:35 | the linear system comes with doing with in thai speech |
---|
0:38:39 | perception process and then also the speech production process in response to |
---|
0:38:43 | the speech relating to so we got to it to highly nonlinear systems in so |
---|
0:38:47 | it shouldn't really work |
---|
0:38:48 | but that less what we do is |
---|
0:38:50 | we look at all events of a particular type |
---|
0:38:53 | in the corpus lsa all vacations when the person you talking to |
---|
0:38:57 | stop speaking offsets |
---|
0:38:59 | and we say what was going on |
---|
0:39:01 | what was going on annual speech in the in to watch the speech |
---|
0:39:04 | i point |
---|
0:39:06 | yes we just and code all those you mustn't like spikes |
---|
0:39:09 | and then we take a window |
---|
0:39:11 | look at speech activity an average over all of those exemplars |
---|
0:39:15 | not gives is what we call this event related activity which what you seeing here's |
---|
0:39:19 | this the window pasta minus one second |
---|
0:39:21 | with a simple case is first so is no noise presence here this is just |
---|
0:39:24 | looking at the |
---|
0:39:26 | activity response to an interlocutor |
---|
0:39:28 | so this is just simply saying we take all the points which and |
---|
0:39:31 | a need to look at a stops talking what do you do it |
---|
0:39:34 | well not surprisingly |
---|
0:39:35 | well but more likely to be start and talk this what's been taking is really |
---|
0:39:39 | about |
---|
0:39:40 | and we see the reverse pattern |
---|
0:39:42 | on the other side |
---|
0:39:43 | but interesting questions what happens when the mask or we take the mask rebounds so |
---|
0:39:47 | what happens when the masking goes off what're you doing is a talk |
---|
0:39:51 | well |
---|
0:39:52 | not very much but then |
---|
0:39:54 | afterwards shortly afterwards |
---|
0:39:56 | you increase your |
---|
0:39:57 | likelihood of speaking |
---|
0:40:00 | and like |
---|
0:40:00 | likewise in the case response qbc bit more clearly if we |
---|
0:40:04 | just look at the difference between the onset and offset abouts the symmetric for all |
---|
0:40:09 | intents and purposes |
---|
0:40:10 | and so we see this what we call it contrast cuts |
---|
0:40:13 | this is really just shown in that was having an interlocutor case |
---|
0:40:16 | see very nice cup |
---|
0:40:19 | a quite a wide range in the mask in case well because it's become guess |
---|
0:40:23 | whether must be bands gonna take place there's really no difference here that is right |
---|
0:40:27 | after the milliseconds after the massacre |
---|
0:40:30 | as i that come on come off |
---|
0:40:31 | we see a change in the speaker activity what is the showing is that's talk |
---|
0:40:35 | as are sensitive to the mask is |
---|
0:40:36 | and do respond in some way |
---|
0:40:40 | well the last seven possible strategies the soak it might be using |
---|
0:40:43 | and it sends out some non-targets ability tele but simply to say that |
---|
0:40:47 | it isn't a case |
---|
0:40:49 | mainly that when a mask it comes on |
---|
0:40:52 | talkers tend to stop the doing this is this stop |
---|
0:40:55 | strategy here |
---|
0:40:57 | it's more case that |
---|
0:40:59 | they tend not to start |
---|
0:41:01 | when a mask result in the two things if you think about it might with |
---|
0:41:03 | the same when you averaged across is why we need to distinguish between the two |
---|
0:41:07 | so we see lots evidence for talk strategy based on the masking goes off |
---|
0:41:11 | you more likely to start talking that makes sense |
---|
0:41:13 | and if the mask comes on you less likely to start talking a little bit |
---|
0:41:18 | of evidence |
---|
0:41:18 | the to mask because you to stop talking |
---|
0:41:21 | but not it's quite weak evidence |
---|
0:41:25 | now how does this work in a low for a more natural situation where this |
---|
0:41:30 | that's the all the conversations presence in the background rather than this |
---|
0:41:33 | slightly audiovisual |
---|
0:41:35 | background model at background noise |
---|
0:41:38 | so these were some expense we carried out broken english and in |
---|
0:41:41 | in spanish and |
---|
0:41:43 | the basic scenario is that we have a pair talk is here having a conversations |
---|
0:41:47 | they come into the first five minutes |
---|
0:41:48 | and then the joint for the next ten minutes by not the parents or "'cause" |
---|
0:41:52 | and then the symmetry purposes the first belly so we got by tippett era where |
---|
0:41:56 | we got to past conversations |
---|
0:41:58 | second group is not allowed to talk the first group vice versa |
---|
0:42:01 | and so we really interested in a very natural situation see how one conversation affects |
---|
0:42:06 | not the conversation i just play using this example |
---|
0:42:09 | and i'm you'll be helped a little bit by the transcription right hand side i'll |
---|
0:42:13 | try to follow it |
---|
0:42:14 | i for lid |
---|
0:42:26 | or cosine i got my legs |
---|
0:42:34 | this is the natural overlap situation if you want to the percentage of appears not |
---|
0:42:37 | twenty five set |
---|
0:42:39 | across the entire corpus is more like eighty percent |
---|
0:42:42 | twenty percent within buttons within press |
---|
0:42:45 | so i'm willing to the couple of things here |
---|
0:42:48 | one the things that top is due |
---|
0:42:50 | in |
---|
0:42:51 | i and or not the situation like that's |
---|
0:42:54 | extracted reduce |
---|
0:42:56 | the amount of natural overlap that they lie within the with that conversational partner |
---|
0:43:01 | in the figure was mention this morning brought about twenty five percent we find the |
---|
0:43:03 | same |
---|
0:43:04 | think so the thing here when there's no |
---|
0:43:06 | background presents |
---|
0:43:07 | and the older dusty would |
---|
0:43:08 | in we got this is a natural state of the two person dialogue roughly that |
---|
0:43:12 | twenty five percent or materials are a lot |
---|
0:43:15 | singe for the background in the |
---|
0:43:16 | you see that's reduced that's one big change another change we see is when we |
---|
0:43:21 | remove the visual modalities you might and they're just in that picture they were these |
---|
0:43:24 | the lies is the one the conditions |
---|
0:43:28 | and that also "'cause" i in a bit of egyptian overlap |
---|
0:43:30 | someone response |
---|
0:43:32 | but |
---|
0:43:33 | the interesting question is |
---|
0:43:34 | to what extent all listed it may not make and situation aware of what's going |
---|
0:43:38 | on the background |
---|
0:43:39 | and adapting accordingly |
---|
0:43:41 | so these are the either likes activity plots of the four we saw before |
---|
0:43:45 | is this is with no background presents so we see this turn taking behaviour |
---|
0:43:49 | and this is where there's a visual information |
---|
0:43:53 | a lot i would i the |
---|
0:43:54 | so we can see the interlocutors lips |
---|
0:43:57 | of the interesting case of these cases where the noise is present |
---|
0:44:00 | and so this is the |
---|
0:44:01 | showing the |
---|
0:44:03 | activity response to the noise |
---|
0:44:05 | a low is much weaker pattern |
---|
0:44:06 | we still see the same |
---|
0:44:08 | sensitivity the noise in these highly dense that situation |
---|
0:44:12 | so the foreground since they can summarise all this |
---|
0:44:15 | is affected by background |
---|
0:44:17 | a background conversations |
---|
0:44:20 | was always quality with |
---|
0:44:22 | speech technology well |
---|
0:44:24 | out of this grew an algorithm called g c v time |
---|
0:44:27 | and |
---|
0:44:28 | which was also submitted to the oregon challenge |
---|
0:44:31 | and the idea here the approach here |
---|
0:44:35 | is to |
---|
0:44:36 | so the general dynamic time warping based approach |
---|
0:44:38 | where we take a speech signal every here is the mass get we say |
---|
0:44:43 | if we are allowed on a frame-by-frame basis to modify the speech signal |
---|
0:44:47 | to achieve some objective |
---|
0:44:49 | whatever that is |
---|
0:44:51 | then we could do so by |
---|
0:44:53 | finding its you quote defining the |
---|
0:44:56 | the least cost path through some |
---|
0:44:58 | some two massive distance with the least you methodists in this cost matrix |
---|
0:45:03 | we ended with modified speech |
---|
0:45:05 | we temporal changes |
---|
0:45:07 | so the important question now is |
---|
0:45:09 | what we put in as the cost function |
---|
0:45:11 | we tried various things one of them is |
---|
0:45:13 | based on a just masking clean thing again that's the weather g comes it in |
---|
0:45:17 | the g c we time |
---|
0:45:18 | and the other components is to a cochlea scaled entropy which is a measure of |
---|
0:45:23 | information content in speech so that |
---|
0:45:25 | to but in the in simple terms what we try to do is find the |
---|
0:45:28 | path |
---|
0:45:29 | which maximizes the number of glimpses of speech you're gonna gas by shifting speech away |
---|
0:45:33 | from e pox where the mask or is intense |
---|
0:45:37 | and the same time is sensitive |
---|
0:45:39 | two speech information contents |
---|
0:45:41 | for least speech information content is defined by cochlea scaled entropy |
---|
0:45:46 | and it turns out that this |
---|
0:45:47 | is the prettiest successful strategies with that for decibels |
---|
0:45:52 | of improvement |
---|
0:45:53 | in the reckon challenge |
---|
0:45:59 | now |
---|
0:46:01 | is it |
---|
0:46:03 | the way that can change a set so what it has the allowed a small |
---|
0:46:06 | what we location |
---|
0:46:07 | for various reasons we were interested in promoting some temporal structure so we light a |
---|
0:46:11 | little bit elongation |
---|
0:46:13 | half a second i the side sense |
---|
0:46:15 | and of course not surprisingly most the time that he you the re timing out |
---|
0:46:21 | than exploits that fact |
---|
0:46:22 | no strategy speech or shifts bits of speech around |
---|
0:46:26 | into those got into the and also exploiting the silence |
---|
0:46:29 | so what the be time because it simply the elongation well our previous results |
---|
0:46:33 | would suggest that |
---|
0:46:34 | elongation doesn't help right this is i began the section |
---|
0:46:38 | you location doesn't help |
---|
0:46:39 | but strangely |
---|
0:46:40 | we found in the case of the modulated mastering competing speech in this case |
---|
0:46:44 | we found that would simply you located did help |
---|
0:46:49 | not as much as we timing |
---|
0:46:50 | about what about a half the effect could be due to pure elongation |
---|
0:46:54 | so |
---|
0:46:55 | but selected |
---|
0:46:57 | speech shaped noise in this case |
---|
0:46:59 | we find elongation doesn't help which is |
---|
0:47:02 | consistent with the accent you picture so what's really going on here |
---|
0:47:06 | well the reason that people don't find improvements with a durational based approaches distracting is |
---|
0:47:12 | "'cause" most of the work has been done looking at this stage we mask is |
---|
0:47:16 | and interest issue mask you simply you log eight |
---|
0:47:19 | you we not in just using any new information |
---|
0:47:21 | "'cause" the master itself a stationary with you gotta modulated mask |
---|
0:47:25 | you stretch they say of all out |
---|
0:47:27 | you know if for it was massed some parts fragments of it you know if |
---|
0:47:31 | needed for identification are gonna skate masking |
---|
0:47:34 | and that's what we think is responsible here |
---|
0:47:36 | the other important thing here on this |
---|
0:47:38 | the came out of this is the tree timing itself appears to be intrinsically harmful |
---|
0:47:43 | so what something which is strangely something which is really beneficial for one mask |
---|
0:47:48 | we get is a big these of the games |
---|
0:47:50 | exactly harmful for the first stage we mask so we're |
---|
0:47:55 | distorting the acoustic phonetic integrity of the speech |
---|
0:47:59 | but nevertheless it is still the same with time speech is still got the same |
---|
0:48:04 | distortions in |
---|
0:48:05 | but in the case of liturgy mask |
---|
0:48:07 | is highly intelligible |
---|
0:48:10 | in the target some of the circle more was about that |
---|
0:48:12 | it was |
---|
0:48:14 | well we know what's it is more likely to |
---|
0:48:17 | picture of where we all |
---|
0:48:19 | with that speech modifications what can we achieve |
---|
0:48:22 | so the racks to a couple of forty can challenge is what we do internally |
---|
0:48:25 | within the listing talk projects and then one that's a the clearly evaluate your unless |
---|
0:48:29 | just interspeech |
---|
0:48:31 | and |
---|
0:48:32 | and the goal was to |
---|
0:48:34 | people providing within |
---|
0:48:36 | well modified speech |
---|
0:48:38 | had access to mask is a given snrs |
---|
0:48:41 | and simply returns |
---|
0:48:43 | modified speech to us we then evaluated with a very large number of listeners |
---|
0:48:47 | and these are some of the entries |
---|
0:48:49 | so |
---|
0:48:51 | playing speech |
---|
0:48:52 | a large size in stockings is how to sell |
---|
0:48:55 | natural about speech |
---|
0:48:57 | a large size and stockings is a to sell |
---|
0:49:01 | some on modified tts |
---|
0:49:03 | a large size in stockings is how to sell |
---|
0:49:06 | this is from g which is a lot |
---|
0:49:09 | long but property is applied to tts so long bob |
---|
0:49:12 | a tts adapted to lombard speech |
---|
0:49:15 | a large size in stopping this five to sell well as the synthetic voice |
---|
0:49:20 | trying to compete with noise as well |
---|
0:49:22 | i know able to techniques |
---|
0:49:23 | i'll play this one because this was the winning entry |
---|
0:49:25 | i in stockings is to sell |
---|
0:49:28 | on this website you'll find that you also organs most more examples |
---|
0:49:34 | well these are the with the results of the internal challenge of the systems |
---|
0:49:38 | any s a s t r c which came from |
---|
0:49:43 | nist ugandans lap |
---|
0:49:45 | that's university of crete |
---|
0:49:48 | was the winning entry |
---|
0:49:50 | producing gains of about thirty six thirty seven percentage points in this condition |
---|
0:49:56 | what the what does that i'm not to in db terms well it amounts about |
---|
0:49:59 | five db |
---|
0:50:00 | no with stones the way you know useful gains i think for speech modification approaches |
---|
0:50:05 | well you can also see here i think is interesting is that long about speech |
---|
0:50:09 | natural about speech in this condition |
---|
0:50:11 | just this case here actually produced a gain of about one |
---|
0:50:15 | db so we're getting super along but performance |
---|
0:50:18 | i was of some of these modification elements |
---|
0:50:20 | he was of the ones of the |
---|
0:50:21 | you know based |
---|
0:50:22 | some extent on lombard speech |
---|
0:50:24 | and tts is a long way behind it is a modified but by applying for |
---|
0:50:28 | instance long but like properties to tts systems |
---|
0:50:31 | we can improve things by over two db |
---|
0:50:37 | a slightly larger challenge the oregon johns last year |
---|
0:50:41 | is opposing results in a sliding way as i'm just take it with this |
---|
0:50:45 | so we're looking at here is the equipment intensity change and db |
---|
0:50:49 | in the face of a |
---|
0:50:50 | stationary mask the speech shaped noise and this in the keeping competing talk a mask |
---|
0:50:54 | all the green points correspond to natural speech and the baseline as well the lines |
---|
0:50:58 | intersect about their |
---|
0:50:59 | and the tts entries of a low baseline |
---|
0:51:03 | and they're in blue and see them over here |
---|
0:51:06 | this is not the in a in a fairly low noise condition if we were |
---|
0:51:08 | gonna |
---|
0:51:09 | a high noise condition |
---|
0:51:11 | a better idea of all these things are really capable all |
---|
0:51:13 | then we again we sing gains of about five db |
---|
0:51:17 | in stage we noise and also |
---|
0:51:19 | the t c v time |
---|
0:51:20 | getting close not fully be also in |
---|
0:51:22 | in fluctuating noise |
---|
0:51:24 | what i really want to point out a listening is probably to me was most |
---|
0:51:27 | interesting outcome of this evaluation |
---|
0:51:29 | is the fact that |
---|
0:51:30 | somebody's tts systems adapted to |
---|
0:51:33 | toot based on some intelligibility criterion are actually doing really well |
---|
0:51:38 | combat that makes and that baseline over here and we're getting a couple of the |
---|
0:51:41 | tts systems here which a player examples of |
---|
0:51:43 | in a second actually more intelligible than |
---|
0:51:47 | natural speech |
---|
0:51:48 | in noise size of a fairly interesting achievement |
---|
0:51:51 | these k from two different labs |
---|
0:51:53 | one is everybody brother |
---|
0:51:55 | garcia |
---|
0:51:56 | and teeny |
---|
0:51:57 | and you timit she |
---|
0:51:59 | the group and also from daniel error at the |
---|
0:52:02 | the |
---|
0:52:03 | well you assume with the basque country |
---|
0:52:05 | well the reason a difference group |
---|
0:52:06 | i'd nothing to do with this |
---|
0:52:08 | okay so this is the this is an example that what leasing sound like and |
---|
0:52:12 | just a tts systems |
---|
0:52:26 | and you pretty evident this that the synthetic speech is much more intelligible in those |
---|
0:52:30 | cases |
---|
0:52:32 | just the final thing to say about american john something we did recently |
---|
0:52:35 | and a natural thing to do of course is to take spectral changes temporal changes |
---|
0:52:39 | and see whether they complement each other |
---|
0:52:42 | and the show in ensure the answer is yes so this is a modified speech |
---|
0:52:46 | this is defective just applying temporal changes |
---|
0:52:48 | with the g c v timeouts |
---|
0:52:50 | this is just the effect of the |
---|
0:52:52 | assess the l c in this case spectral shaping and dynamic range compression algorithm |
---|
0:52:55 | and you put the two things together |
---|
0:52:57 | and you get something which isn't quite additive but it certainly a call me and |
---|
0:53:01 | me a complementary that for two percentage points |
---|
0:53:03 | we have i nine to ten decibels impact |
---|
0:53:07 | so just in the last if i one a couple but it's |
---|
0:53:11 | i want to just pretend this question is modified |
---|
0:53:14 | speech intrinsically more intelligible or is it just |
---|
0:53:18 | hunting the mask is that are essentially work |
---|
0:53:22 | nice little the critical to answer this question simply because |
---|
0:53:25 | what we measuring intelligibility we normally measure intelligibility using noise |
---|
0:53:29 | because the "'cause" otherwise performs is see like but you gotta system which modify speech |
---|
0:53:34 | to be more intelligible the noise |
---|
0:53:36 | and of course it's gonna be more intelligently noise so you don't measuring intrinsic intelligibility |
---|
0:53:40 | you measuring the ability to have come and just mask normally |
---|
0:53:44 | that's to use native listeners |
---|
0:53:46 | use non-native listeners then intelligibility in |
---|
0:53:49 | why it is usually some way below |
---|
0:53:52 | ceiling performance the natives |
---|
0:53:54 | this what we did |
---|
0:53:55 | we plan on that you listen is long about speech |
---|
0:53:58 | we found was |
---|
0:53:59 | forget about most of this |
---|
0:54:01 | this is the key results here is that one but speech is actually less intelligible |
---|
0:54:05 | them playing speech in quiet |
---|
0:54:07 | the same speech which is more intelligible the noise is less intelligible in quite |
---|
0:54:11 | for non-native listeners and with |
---|
0:54:13 | lombard speech was making improvements somehow to acoustic-phonetic clarity should we say |
---|
0:54:17 | just a generalized lot of possible changes |
---|
0:54:19 | then you might expect |
---|
0:54:21 | to see benefits but we don't |
---|
0:54:24 | skip over that |
---|
0:54:25 | and something we don't recently also it's as the same question with non-native listeners |
---|
0:54:29 | for s t l c |
---|
0:54:32 | which is you know say that where the entry in the hoary can challenge |
---|
0:54:35 | and again we see some results in quiet but non-native listeners you see that will |
---|
0:54:38 | be low ceiling you know modified speech |
---|
0:54:41 | exactly make things worse |
---|
0:54:45 | so just to conclude that |
---|
0:54:47 | what i try to show how is that |
---|
0:54:49 | by taking some inspiration not so you really "'cause" inspiration but i see sometimes going |
---|
0:54:54 | beyond what this is |
---|
0:54:55 | what's also capable of doing |
---|
0:54:57 | we are able to motivate |
---|
0:54:59 | some algorithms which can them but |
---|
0:55:01 | speech which is merely unintelligible interspeech which is almost entirely intelligible |
---|
0:55:06 | there's a been a |
---|
0:55:08 | some develop a subjective intelligibility models to make this possible |
---|
0:55:11 | and i see the this is this a definitely scope a much more working that's |
---|
0:55:15 | a rather better intelligibility models we can produce |
---|
0:55:18 | the bigger gains we expect you'll to produce |
---|
0:55:21 | and i should say that this work is more or less immediately applicable to all |
---|
0:55:24 | forms of speech outputs |
---|
0:55:25 | including domestic audio coming from non speech technology devices no radios t vs et cetera |
---|
0:55:32 | some stuff i didn't read so too much about reduce work with dyslexic it's basically |
---|
0:55:36 | with the retraining |
---|
0:55:38 | to show that they benefit from |
---|
0:55:41 | from a sample modifications to |
---|
0:55:44 | one thing we do need to look at and i looked on the last couple |
---|
0:55:46 | of slight is this loss of intrinsic intelligibility |
---|
0:55:50 | i think this is an opportunity we've got an algorithm here which does well in |
---|
0:55:53 | noise in quite exactly homes things what about what if we can |
---|
0:55:58 | what if the two things are not you know |
---|
0:56:01 | if we can sample the two things together if we cannot do makefile space changes |
---|
0:56:05 | in the same time is due within just masking and we can see summary we |
---|
0:56:08 | became |
---|
0:56:10 | okay i carry much |
---|
0:56:26 | thank you mouth in this very good interesting talk |
---|
0:56:30 | you have any comments on the use of the is are |
---|
0:56:33 | i mean use of this work for asr to improve speech recognition |
---|
0:56:40 | e |
---|
0:56:42 | is interesting question |
---|
0:56:45 | well you thinking maybe we can train talk as two |
---|
0:56:48 | in tract |
---|
0:56:50 | more heavily with our asr devices is not gonna happen is it |
---|
0:56:55 | i think |
---|
0:56:55 | yes of course in |
---|
0:56:57 | i is what i would you lanes and in the listing talk a project was |
---|
0:57:01 | to get as far as looking at dialogue systems |
---|
0:57:03 | so well as the i saw is |
---|
0:57:05 | a key component and |
---|
0:57:08 | to look at ways of improving the |
---|
0:57:12 | interaction by essentially making this in that the output part of it |
---|
0:57:16 | much more context aware |
---|
0:57:18 | and of course |
---|
0:57:19 | in this sense |
---|
0:57:20 | if you could make the instruction smoother |
---|
0:57:23 | this might also mean allowing overlap as an actual muffler compensation and i guess the |
---|
0:57:27 | input side might also |
---|
0:57:29 | and of being smooth that's |
---|
0:57:31 | we didn't end up doing and whatnot so |
---|
0:57:33 | some results in is are show that it's a pretty variable to adapt the environment |
---|
0:57:39 | rather than trying to make as p |
---|
0:57:42 | well those obstruction and speech and |
---|
0:57:46 | but they demonstrate that data to adapt to and |
---|
0:57:51 | well i mean other application of a nine dimensional the way i think about is |
---|
0:57:56 | no i think sometimes we'll green river might background in competition or scene analysis we |
---|
0:58:00 | often |
---|
0:58:01 | set of solve this problem of |
---|
0:58:03 | taking to independence |
---|
0:58:06 | sources and trying to separate maybe |
---|
0:58:08 | and acknowledgement the fact that the two are not independent alpha |
---|
0:58:11 | except in speech separation competitions |
---|
0:58:13 | you know we're always aware of what's going the background we since we modify a |
---|
0:58:17 | speech that really a that'll to be factored into these albums told making simple actually |
---|
0:58:21 | for the elements |
---|
0:58:31 | thank you martin that was very interesting i'm wondering |
---|
0:58:35 | if i i've probably got about twenty questions but if i just you know down |
---|
0:58:39 | to me here we good |
---|
0:58:43 | in your work where the any constraints regarding quality your naturalness of the enhancement good |
---|
0:58:50 | question the a salami policy can answer that |
---|
0:58:54 | okay so |
---|
0:58:55 | again one of our original goals thinking that we would just knock off the intelligibility |
---|
0:58:59 | stuff and first year or something |
---|
0:59:01 | was to look at speech quality and we did a little the to work looking |
---|
0:59:04 | at the objective much as speech quality so basque this stuff |
---|
0:59:08 | and hindi |
---|
0:59:10 | what we've |
---|
0:59:11 | so the modification i didn't talk about put a produce a highly distorting so i |
---|
0:59:16 | remember one modification that we produced |
---|
0:59:18 | we were essentially taking the you know it can general approach of |
---|
0:59:22 | suppose we equalise the snr in every time frame |
---|
0:59:26 | i process a little bit like the also perhaps but you know |
---|
0:59:30 | more extreme or we utilize the sound each by a special equalise the snr in |
---|
0:59:34 | each time-frequency fix |
---|
0:59:36 | you can imagine the effect |
---|
0:59:38 | you know of doing that is highly distorting |
---|
0:59:40 | and sometimes is highly beneficial also but sometimes very hopeful it's kind of very binary |
---|
0:59:45 | type of thing so the mean we did this so much as and some of |
---|
0:59:48 | the other part this i think getting rid |
---|
0:59:50 | it's work on speech quality two |
---|
0:59:53 | but it's |
---|
0:59:54 | so |
---|
0:59:57 | in this sense |
---|
0:59:58 | we're looking for correlations between |
---|
1:00:00 | for the just been to a license some results non-native listeners |
---|
1:00:04 | we looked at their responses to as a function of speech quality differences where we |
---|
1:00:08 | might expect |
---|
1:00:09 | you know into they pass their intelligibility parts of its |
---|
1:00:12 | pretty much identical to native listeners and respect to we've examined |
---|
1:00:18 | even though the match the quite different that distortion you might expect that the rich |
---|
1:00:23 | native l one knowledge would somehow enable list the sample |
---|
1:00:27 | that these distortions |
---|
1:00:29 | where easily but that hasn't in the case of |
---|
1:00:32 | we don't have pockets working at every to reboot consideration considerations sightedness a is related |
---|
1:00:37 | to this is we're using |
---|
1:00:39 | these constants rms energy constraint "'cause" we should be really look at loudness |
---|
1:00:45 | which more difficult optimize |
---|
1:00:46 | rely as you gotta agree allowed a small the first and so |
---|
1:00:50 | then my second one would be you discussed effective the listening ear been matched native |
---|
1:00:59 | or not it is worse but and you mentioned working with english and spanish but |
---|
1:01:05 | i'm wondering |
---|
1:01:06 | have you studied a variety of source languages in found any of them more amenable |
---|
1:01:12 | to this process maybe we should switch to a different line |
---|
1:01:17 | thank you |
---|
1:01:18 | that's also interesting |
---|
1:01:20 | i have not done any work we in |
---|
1:01:22 | we can speech output and that but item we had a |
---|
1:01:25 | the project a few years go looking at eight european languages from the point of |
---|
1:01:28 | view of noise resistance and the clearly difference is that |
---|
1:01:32 | a lot of it is got to do without engines resistance wenches masking |
---|
1:01:35 | i just never taken into account we often you know in multi language studies just |
---|
1:01:39 | normalize use it quite yes and although we should be doing that actually |
---|
1:01:43 | because the term baseline which might seem to be able to tolerate maybe up to |
---|
1:01:46 | forty be more |
---|
1:01:48 | noise and then especially designed in that respect |
---|
1:01:57 | i'm adding |
---|
1:01:59 | has you know you are in the speaker recognition community i'm quite sure you where |
---|
1:02:04 | expecting my question |
---|
1:02:06 | you have any edge you're of the |
---|
1:02:08 | but i should effect of from bar number effect so |
---|
1:02:12 | allpass |
---|
1:02:13 | kind of things you are |
---|
1:02:15 | you just present that the |
---|
1:02:17 | one you meant abilities in speaker recognition |
---|
1:02:24 | that i sparse well unless johnston some work in this which i guess probably have |
---|
1:02:29 | i'm not sure that the extent okay so i mean obviously feel of like to |
---|
1:02:32 | speaking on the stress and then very high very high noise conditions |
---|
1:02:36 | if you want in speaker identification based on |
---|
1:02:39 | this your question right |
---|
1:02:41 | my question is also need to be data forensic problem you thing but |
---|
1:02:46 | if someone's recorded in presence of noise so |
---|
1:02:50 | using omar |
---|
1:02:52 | number of votes |
---|
1:02:54 | of best model information would be the same band when we record used us and |
---|
1:02:59 | you know quite are the ones variance question eyes a very |
---|
1:03:03 | interesting project using similar techniques not but |
---|
1:03:06 | we at the basque country web looking up at that you did you stupid looking |
---|
1:03:09 | at which essentially trying to map from normals along but speech if you if you |
---|
1:03:13 | know that somebody's talking as a degree of noise you could attempt to transform |
---|
1:03:18 | the lombard speech and to normal speech don't you wanna come at some possible |
---|
1:03:39 | but predictably so it some cases |
---|
1:04:20 | precisely i think you're always really need to |
---|
1:04:22 | be careful experimentally to use the latter case because no |
---|
1:04:26 | can be no mean even been told you communicating makes a huge the |
---|
1:04:33 | i think |
---|
1:04:34 | i want to on to the example of the two couples one speaking condition other |
---|
1:04:40 | spanish |
---|
1:04:41 | i guess the one couple do not understand the are the car |
---|
1:04:47 | we didn't that one of the situations i had actually we just we had same |
---|
1:04:51 | experiments done with for english or force punished i the is this was the question |
---|
1:04:55 | what would happen if there will be two couples speaking the same language but on |
---|
1:05:01 | different topics |
---|
1:05:02 | and the disturbance also not only noise but able to an understanding that your i |
---|
1:05:09 | think it and on the statistics we'd like to a couple look if years ago |
---|
1:05:13 | we discovered this effect of along with the |
---|
1:05:17 | the informational masking calls by sharing the same language isn't interfering conversation |
---|
1:05:22 | and it is the case that |
---|
1:05:25 | we |
---|
1:05:25 | if it is principle the common experience for many people |
---|
1:05:29 | optically in a bilingual trilingual country is just one |
---|
1:05:33 | where |
---|
1:05:35 | if somebody talking in a language you well aware of even if it's not your |
---|
1:05:37 | native language |
---|
1:05:39 | just a much bigger interfering of factors for start of that's one the things that's |
---|
1:05:42 | definitely happens that's word about between one four db depending on different language pairs that |
---|
1:05:47 | the that looks at |
---|
1:05:52 | was that the party question |
---|
1:05:54 | okay |
---|
1:05:56 | so it's all a matter informational masking and again that's another big area that we've |
---|
1:06:00 | it is often the perceptual point of view but not from the apple point if |
---|
1:06:03 | you see how to deal with |
---|
1:06:05 | do you without |
---|
1:06:19 | thank you for the talk |
---|
1:06:20 | so regarding what also zero actually said |
---|
1:06:25 | we i think press i personally extract some somewhat speech enhancement and some colleagues tool |
---|
1:06:30 | for like three s |
---|
1:06:32 | and even attractive in the training data if you do speech enhancement |
---|
1:06:36 | and it test data our systems seems to |
---|
1:06:40 | bad compared if we give him all the noise |
---|
1:06:43 | cell |
---|
1:06:44 | do you have the opposite |
---|
1:07:10 | okay so even so you think that every speech enhancement this case doesn't it doesn't |
---|
1:07:15 | work has no explanation for that |
---|
1:07:27 | but it seems that if we do if we tried to remove the noise |
---|
1:07:30 | the systems get better |
---|
1:07:37 | and this is a general findings very surprising finding it in a way and that |
---|
1:07:40 | speech enhancement does not this which is robust application in a weighted linear speech enhancement |
---|
1:07:46 | the very few speech enhancement techniques work for intelligibility purposes |
---|
1:07:49 | only one so no |
---|
1:07:52 | okay |
---|
1:08:12 | and that works |
---|
1:08:13 | that doesn't work |
---|
1:08:15 | the call this terrible it's a related to your question so the mean this space |
---|
1:08:20 | is the not kind of even linearly related these things |
---|
1:08:24 | that's it is very just in case the you don't region |
---|
1:08:28 | i mean that's not really tools is the dynamic range compression type of |
---|
1:08:32 | extreme dynamic range compression |
---|
1:08:39 | so this question is based on the example that you short about announcement in the |
---|
1:08:45 | training |
---|
1:08:46 | so is that any v of increasing intelligibility of things like the name of the |
---|
1:08:53 | train station |
---|
1:08:55 | for people who you know not the native |
---|
1:08:59 | speakers |
---|
1:09:01 | so i mean there's a couple things you can do that one a low level |
---|
1:09:04 | the one i level thing and results in between |
---|
1:09:07 | we haven't done if we haven't done it is but others are so one thing |
---|
1:09:11 | of course is you can transfer all your excess energy to those important items which |
---|
1:09:16 | the low level thing that we don't |
---|
1:09:19 | at the high level you can attempt to modify feuding with synthetic speech |
---|
1:09:24 | you can attempt to produce high pass hyper speech can you which is |
---|
1:09:29 | has been very successful in doing this |
---|
1:09:31 | a lot rhythmically fully automatically producing a |
---|
1:09:36 | speech which is more likely to meet its target so with the next point involves |
---|
1:09:39 | place so this is really gonna help a lot in those cases |
---|
1:09:43 | and then there's sort of more preside things like simply you know repetition on all |
---|
1:09:47 | the simply simplification since arkansas when it comes to sort of proper names like that |
---|
1:09:52 | sure there are some very specific things i think you can do to solve we |
---|
1:09:56 | need them that way |
---|
1:09:57 | i think |
---|
1:10:14 | this is like introducing redundant since i think |
---|
1:10:19 | this leads to be don't |
---|