0:00:15 | so i agree much for all the introduction average that like to say i |
---|
0:00:20 | i was work was a focus primarily on my two students chang change she and |
---|
0:00:26 | gang worked on this for |
---|
0:00:29 | a part of the lre efforts that we've been looking at |
---|
0:00:33 | and they were supposed to both here unfortunately are processed for getting the visa to |
---|
0:00:38 | finland i was a little bit more elaborate from the state since i wasn't able |
---|
0:00:41 | to get an here but this represents their work so credit noted that he was |
---|
0:00:46 | gonna |
---|
0:00:47 | kinda pass the baton over to me say something about |
---|
0:00:50 | i don't know how the |
---|
0:00:51 | on the highway or something like this i was afraid that i was gonna get |
---|
0:00:54 | a into a bad spot here so i'll start of the talk right thinking the |
---|
0:00:59 | organisers for last night |
---|
0:01:01 | i pulled a bunch of pictures i see |
---|
0:01:04 | we have one long wheels that sitting out here kind of waving to everyone here |
---|
0:01:07 | and tell me sits here is a |
---|
0:01:11 | kind of got energy and even though these cities named after joe |
---|
0:01:16 | expected generated go diving into the lake i and cannibal around a pretty kind of |
---|
0:01:21 | took the gentle approaches of siding in all those daughter to cannibal and systems |
---|
0:01:27 | right so |
---|
0:01:29 | now that we've adjusted for the event pair for the morning i guess so the |
---|
0:01:34 | outline for that are |
---|
0:01:35 | at first will talk about robust language recognition some ideas that we're looking at in |
---|
0:01:40 | this area and the focus of this talk will be a little more i'm feature |
---|
0:01:44 | characterisation we have a number different features there were exploring |
---|
0:01:48 | and from that we'll talk about some proposed a fusion system that we're looking at |
---|
0:01:54 | then are evaluations are evaluations are on two different corpora the darpa rats are corpus |
---|
0:02:00 | i which is a very noisy corpus and the nist lre which we just heard |
---|
0:02:05 | about this is from the o nine test set that we are working with |
---|
0:02:09 | and some performance analysis and conclusions so to begin with that the focus we want |
---|
0:02:15 | that is one or things that when you look at language id if you could |
---|
0:02:20 | simply say well the purpose used to distinguish one language from a set of languages |
---|
0:02:25 | or multiple languages |
---|
0:02:26 | but the type of task that you're looking at might be different depending on the |
---|
0:02:30 | different context |
---|
0:02:31 | you regret kinda node |
---|
0:02:33 | that in the nist lre there are a number different scenarios you're looking for example |
---|
0:02:37 | i that are doing hindi or let's say russian or ukrainian |
---|
0:02:42 | these are languages that are close to each other and well they are unique separate |
---|
0:02:47 | languages i there are maybe a little bit different and dialects of a particular language |
---|
0:02:51 | the other hand you could have very distinct languages the really far apart somehow the |
---|
0:02:56 | classifiers and features that you might use |
---|
0:02:59 | for languages that are spaced really far apart me not necessarily be the best scenarios |
---|
0:03:04 | you're looking at closely spaced languages |
---|
0:03:06 | or dialects of the same language no the challenge i think that there is becoming |
---|
0:03:11 | more and more a roland and the language shy space |
---|
0:03:15 | is not just |
---|
0:03:16 | the space between the languages but the space between the different characteristics that you might |
---|
0:03:20 | see in the audio streams if you're gonna be using |
---|
0:03:24 | it's much more likely that you use found data how to help build acoustic models |
---|
0:03:29 | and particularly for the out of set languages in you would freak instead languages |
---|
0:03:33 | not knowing the context in which the audio is captured of for those dataset languages |
---|
0:03:38 | introduces a lot of a challenges |
---|
0:03:41 | we had a paper in be interspeech two years back that was entitled i dialect |
---|
0:03:49 | id is the secret in the silence and this was by no means an indication |
---|
0:03:54 | of ldc it's torque efforts to collect a wide variety of language data both for |
---|
0:03:59 | dialect the language id |
---|
0:04:01 | we had done some studies on an arabic corpus which is a five corpus a |
---|
0:04:06 | set for i arabic and compare that against the for our corporate available from ldc |
---|
0:04:12 | and found that |
---|
0:04:14 | in fact that if you throw away all the speech from the five were corpora |
---|
0:04:19 | from the ldc set for arabic you actually did better for language id or dialect |
---|
0:04:25 | id by just focusing on the silent sections and so what that actually tells us |
---|
0:04:30 | is that |
---|
0:04:31 | if you're not sure about how to data is being collected you probably doing a |
---|
0:04:35 | channel handset or microphone id and not necessarily doing dialect id so the work we're |
---|
0:04:41 | looking at here's actually to see if we can improve some performance and robustness side |
---|
0:04:46 | note that |
---|
0:04:48 | in some previous work we've done a lot of efforts and i b m s |
---|
0:04:52 | r i b n |
---|
0:04:54 | of late teens working on the darpa rats language id task which is very noisy |
---|
0:05:02 | more recently our work is focused a little bit more in looking at improving open |
---|
0:05:06 | set out of set language rejection |
---|
0:05:09 | and their primarily because we were interested in seeing how we can come up with |
---|
0:05:13 | more efficient ways to develop background models |
---|
0:05:16 | for when we don't have all of the rejection language information we're trying to change |
---|
0:05:22 | i in this study we're gonna for crystal the moron on alternate features as well |
---|
0:05:25 | as various backend classifiers and fusion |
---|
0:05:30 | so three different sets of features are being considered here the classical features that typically |
---|
0:05:35 | you might expect to see in a typical speech application these are for different sets |
---|
0:05:41 | of features we have here i innovative features are the power normalized cepstral coefficients b |
---|
0:05:46 | and c from cmu group i and |
---|
0:05:49 | perceptual minimum variance distortionless for someone's p mvdr these are set of features that we |
---|
0:05:54 | had |
---|
0:05:56 | maybe at ten years back |
---|
0:05:58 | i one of the interspeech meetings |
---|
0:06:00 | that we use for speech recognition and then a number of extension of features and |
---|
0:06:05 | we refer to these primarily because there's additional processing that might be associated with these |
---|
0:06:09 | as opposed to simply just extracting based feature set so |
---|
0:06:13 | these include |
---|
0:06:15 | various versions of mfcc features depending on with a window |
---|
0:06:19 | our a cell lfs season rasta-plp type features |
---|
0:06:23 | these are kind of the three classes of features that we've been working at |
---|
0:06:28 | in order to kinda give your flow diagram of how the of the data is |
---|
0:06:31 | being extracted it kind of see alright the paper we kind of summarise all the |
---|
0:06:35 | different aspects here but |
---|
0:06:37 | these are the various sets of features that are coming out of our system |
---|
0:06:41 | and the next part there will look at how we actually extract these so in |
---|
0:06:45 | the front end for processing we have speech activity detector uses a common set setup |
---|
0:06:49 | we develop for the rats program |
---|
0:06:52 | a standard that shifted delta cepstra features |
---|
0:06:56 | with a seven one three seven i configuration for this |
---|
0:07:00 | you a ubm in a state-of-the-art i-vector based system that uses for dimensions and we |
---|
0:07:05 | use an lda based up again for dimensionality reduction on the back end processing we |
---|
0:07:11 | do duration length normalisation and we have two different setups wanna gaussian |
---|
0:07:17 | gender gaussian backend |
---|
0:07:19 | also gaussian eyes that cosine distance scoring strategy for the two different classifiers |
---|
0:07:24 | so the system flow diagram for this other words like this |
---|
0:07:28 | we have our input audio data here the two audio a datasets that you see |
---|
0:07:32 | here a basic represent raw data for the ubm type construction as well as for |
---|
0:07:37 | the total variability matrix it's needed for the i-vector setup |
---|
0:07:41 | and these two datasets are actually the same is what we use an hour training |
---|
0:07:45 | set gaussian a gender back end it is on the side here and then the |
---|
0:07:50 | cosine distance scoring setup is here |
---|
0:07:52 | and then we do score fusion |
---|
0:07:54 | score processing first and then fuse the setups |
---|
0:07:58 | so for system fusion we have our setup looks like this we can do feature |
---|
0:08:04 | concatenation then that's one of approaches we look at your just counting feature set up |
---|
0:08:08 | i would back and fusion we use for call in kind of fuse the backend |
---|
0:08:14 | a system so we see here to them up in the final a decision surface |
---|
0:08:17 | or decision a |
---|
0:08:20 | for the evaluation corpora i know that we had to different corpora that we're working |
---|
0:08:24 | with of the nist lre weak classifiers is a large scale setup twenty three different |
---|
0:08:29 | languages where only using these for the in set there and of the duration mismatch |
---|
0:08:35 | we looked at the three sets that you would typically see |
---|
0:08:39 | for the darpa program as i know some of you may not of be familiar |
---|
0:08:44 | with the darpa setup but the it's five languages that are rendered darpa language id |
---|
0:08:50 | task or arabic farsi urdu props to an already |
---|
0:08:54 | and they're ten out of seven languages that are included in there is extremely noisy |
---|
0:08:59 | a play just an audio clip here so you get some sense of how about |
---|
0:09:02 | the data is |
---|
0:09:19 | i see clearly see that that's not your typical telephone call but you might be |
---|
0:09:24 | picking up |
---|
0:09:25 | and so in that context the language id task is quite challenging so what are |
---|
0:09:32 | the things we wanted to kind of see in our setup here for a lisa |
---|
0:09:36 | darpa rats corpus where y to understand |
---|
0:09:39 | if the channels were somehow dependent on each other if everything was kinda uniform there's |
---|
0:09:44 | some variability across the channels so we consider seven of the channels channel d was |
---|
0:09:50 | there is a channels in the system we set out channel id here because seven |
---|
0:09:54 | channels here |
---|
0:09:55 | or it is we look at the six are a language classes that of the |
---|
0:09:58 | five |
---|
0:09:59 | correlation in seven languages and then there's the ten out of set languages that are |
---|
0:10:03 | set up here we scored |
---|
0:10:06 | to a seven or eight errors only seven sorted files here crosses forty one classes |
---|
0:10:11 | and the ideas that you kind of look at the channel confusion set up here |
---|
0:10:16 | if there is no |
---|
0:10:18 | you know dependency here we kind of expect there to be kind of clear diagonal |
---|
0:10:23 | lines here the factory c d's and it i aspects here tell us to their |
---|
0:10:27 | clearly some channel dependencies in here so where is telling us is that there's a |
---|
0:10:31 | lot of transmission channel factors |
---|
0:10:33 | they're kinda influencing |
---|
0:10:35 | all the data and what we would expect the classifier setup so that was reason |
---|
0:10:41 | i pointed to this previous study we good looking at the airbag test to ensure |
---|
0:10:46 | we could try to do some type of normalisation and channel characteristics |
---|
0:10:50 | so in looking at the two corpora we did kind of our evaluation here four |
---|
0:10:56 | no the various feature set so |
---|
0:10:59 | this has the other rats the results here and the lre on nine results here |
---|
0:11:04 | the three different a broad classes of features the classical features innovative features an extension |
---|
0:11:09 | of features are here and we list rich |
---|
0:11:15 | performance here for four |
---|
0:11:17 | for each of the different feature sets |
---|
0:11:19 | and you can see with the gaussian eyes than the cosine distance scoring individual scores |
---|
0:11:24 | here you look at the back end fusion strategy we get their performance improvement here |
---|
0:11:29 | and we can see obviously that i confusion ends up helping and all these conditions |
---|
0:11:33 | there's a very striking in terms of the performance on the clean datasets are a |
---|
0:11:39 | little bit better than the performance on the noisy sets |
---|
0:11:43 | see from the rats that next we wanted to kind of look at rank ordering |
---|
0:11:47 | which features |
---|
0:11:50 | might i actually show better improvement so here we just plot |
---|
0:11:55 | the two classifiers and be a the backend fusion setup so this just gives your |
---|
0:12:00 | relative comparison across other rats and the lre a nine dataset and basically by confusion |
---|
0:12:06 | here benefits various feature concatenation strategy set and almost all combinations |
---|
0:12:13 | we get thirty three percent relative improvement on performance for lid on the rats data |
---|
0:12:18 | and of thirty four percent relative improvement on the lre set |
---|
0:12:23 | so next we wanted to look at a little bit more i'm kind of |
---|
0:12:30 | test duration aspects here so baseline system shows how test duration performance varies depending on |
---|
0:12:38 | the on that the test sets here for the lre data |
---|
0:12:42 | and you can kinda see as the test duration increase is obviously we get better |
---|
0:12:45 | performance if you look at the hybrid fusion are also has a nice improvement here |
---|
0:12:51 | we see that the relative improvement is quite substantial a hybrid fusion obviously does improve |
---|
0:12:57 | lid performance |
---|
0:12:58 | and the roles improvement is actually much stronger the longer duration set is but you |
---|
0:13:03 | can see that we're almost kind of cutting the error rates here and half which |
---|
0:13:07 | is or forty percent leased |
---|
0:13:09 | which is quite nice in terms of the shorter three second duration sets |
---|
0:13:15 | finally thing we want to look for is looking at the various features we want |
---|
0:13:19 | to ask coupled basic questions in terms of how each of these features might you |
---|
0:13:23 | contributing to improve system performance |
---|
0:13:25 | so i one question might be how do we calibrate the contribution of each feature |
---|
0:13:30 | in the fusion set |
---|
0:13:31 | and use that contribution similar to the different tasks |
---|
0:13:35 | for rats for a for the lre so the ideas that if you look at |
---|
0:13:39 | the rank ordering hearing clean data versus the noisy data do you actually get a |
---|
0:13:43 | different set of features that might be better for that particular task |
---|
0:13:47 | so we use this that relative significance factor here where we use the a leave-one-out |
---|
0:13:53 | that system ranking |
---|
0:13:55 | for each particular feature and we normalize that by the individual systems ratings for that |
---|
0:14:02 | particular feature so that allows us to look at the relative rank for the particular |
---|
0:14:07 | features and this kind assures now the |
---|
0:14:11 | the rank-order setups for the different features for rats and for lre |
---|
0:14:16 | and what we see here is that if you're looking closely you see that sets |
---|
0:14:19 | pasta l p i guess my students got hundred at ross l p l the |
---|
0:14:24 | it's of released on the rats in lre the rasta plp feature actually |
---|
0:14:28 | i gave us that the strongest contribution for improved terribly performance |
---|
0:14:33 | and you can see various other features here rank a lower |
---|
0:14:38 | what's interesting to note is that if you look at the relative significance factor here |
---|
0:14:42 | for the clean data are rasta plp actually a far surpasses all the other features |
---|
0:14:49 | i in the clean task that relative impact actually reduce is quite significantly it still |
---|
0:14:55 | rank first |
---|
0:14:56 | but the impact of that single feature when the data becomes x extremely noisy is |
---|
0:15:01 | a whole lot less |
---|
0:15:02 | well that's telling us this that in noisy tasks you actually need to leverage performance |
---|
0:15:07 | across multiple features |
---|
0:15:09 | in order to hope to get a similar levels of performance and the lid task |
---|
0:15:13 | and noisy conditions |
---|
0:15:15 | so i in conclusion here |
---|
0:15:18 | probably if using various types of acoustic features and i can't classifiers we can contribute |
---|
0:15:23 | to a stronger ali performance |
---|
0:15:27 | in various are corpora |
---|
0:15:29 | the latest propose gaussian a cosine distance scoring back end were shown to outperform the |
---|
0:15:34 | generative gaussian backend |
---|
0:15:36 | for the darpa rats scenario we saw that we had of thirty eight percent improvement |
---|
0:15:42 | i'm |
---|
0:15:43 | for that particular task and for nist lre we had some additional experiments are in |
---|
0:15:48 | the paper that show that forty six percent relative improvement |
---|
0:15:52 | and for the right order features the rasta-plp feature turned out to be the most |
---|
0:15:57 | significant feature set |
---|
0:15:59 | for the two corpora that we are considered but we found that you need to |
---|
0:16:03 | fuse multiple features and particularly for the noisy conditions in order to hope to get |
---|
0:16:08 | a similar levels of performance gain |
---|
0:16:13 | a star |
---|
0:16:21 | any questions |
---|
0:16:24 | which |
---|
0:16:27 | next on and i just don wallace logic presented right to left and spot and |
---|
0:16:32 | lre results |
---|
0:16:35 | what's |
---|
0:16:38 | given rats so noise |
---|
0:16:42 | so there's always a challenge in explaining why something works like so i would say |
---|
0:16:48 | kind of looking at yellowy data |
---|
0:16:52 | i think you have different sets of levels of noise on the rats i think |
---|
0:16:59 | for us the rejection but you see for the rats data you got the ten |
---|
0:17:04 | out of set languages those in some sense might be a little bit easier we |
---|
0:17:09 | have done a test return yellow re sets and what we did as we generated |
---|
0:17:13 | a five in set task that was used as close as possible to the five |
---|
0:17:18 | means that we start from rats |
---|
0:17:20 | we show the performance there was actually of a remote fairly different then we were |
---|
0:17:24 | sitting on the lre on nine set |
---|
0:17:28 | i wish i can give yield more insight as to why performance was colours but |
---|
0:17:33 | how to say that using more features actually helps |
---|
0:17:38 | expected say |
---|
0:17:44 | did you look at the end scene channel in the rats so's i understood the |
---|
0:17:48 | rats you trained on data in set through all the channels your testing the or |
---|
0:17:53 | did you did the pull one out and unseen there's see i think the unseen |
---|
0:17:57 | which the one images recently released |
---|
0:18:00 | well you could do not held out just help wanted uni that's we need to |
---|
0:18:04 | be gentle that we have all that actually with the channel but we did doing |
---|
0:18:08 | to help |
---|
0:18:09 | we do we have done tests on late in that context but not against all |
---|
0:18:14 | these features |
---|
0:18:16 | i can say you similar we did a fair amount of testing actually when we're |
---|
0:18:20 | looking at ms and may g c features and a couple of other frontend enhancement |
---|
0:18:25 | straight techniques in the last icassp for the lid task and their we did we |
---|
0:18:31 | did hold one of the channels that just to see if we could do an |
---|
0:18:33 | unseen channel that might help |
---|
0:18:39 | so that you looked at a year shifted delta |
---|
0:18:42 | cepstrum like but you can use for plp something that you have more long-term information |
---|
0:18:46 | so you should be is used actual to shifted delta cepstra area system plp so |
---|
0:18:52 | we use shifted delta cepstra set up on all that that's on for seven one |
---|
0:18:56 | three seven is the configuration |
---|
0:19:11 | the just a excellent talk solely on the at all the well the try to |
---|
0:19:17 | the on the study on |
---|
0:19:19 | is set up recognising the channel america the language rather than the channels so comment |
---|
0:19:26 | on how you know in what cell findings from so which is features this simple |
---|
0:19:32 | effective enough |
---|
0:19:34 | also |
---|
0:19:35 | i can answer that question and that real but let me naked one common so |
---|
0:19:39 | when joe is giving us talk a keynote talk industry one comments i guess the |
---|
0:19:44 | to get a chance to make all sit now |
---|
0:19:46 | when you're doing language id or speaker id for that matter particular language id "'cause" |
---|
0:19:51 | you're much more likely to use kind of found data for this and you may |
---|
0:19:55 | not know the channel conditions are one of the tasks that actually a really good |
---|
0:19:59 | thing to do and it may not be something you wanna report but it's something |
---|
0:20:02 | that i think everyone should do |
---|
0:20:04 | typically when you're looking at lid you would run a speech activity detector so you're |
---|
0:20:09 | gonna have kind of your silence or low energy and noise and you're speech what |
---|
0:20:14 | was really a good task is to run your language id task on the speech |
---|
0:20:18 | and then run it on the silence okay all the data that you pulled out |
---|
0:20:23 | if you run it on the silence and you find it you getting basically chance |
---|
0:20:26 | across all your setups then you kind of note that the channels are not really |
---|
0:20:31 | dependent on each other |
---|
0:20:33 | what if you're getting really good performance |
---|
0:20:36 | actually better performance than if you're using the speech and you can have no we |
---|
0:20:40 | get your classifier is not really targeting the speech is actually targeting the channel characteristics |
---|
0:20:47 | and that's what we found we tried actually i in a previous paper a number |
---|
0:20:51 | of ways to kind of just |
---|
0:20:53 | long term channel normalisation techniques something like this we were able to get the long |
---|
0:20:58 | term channel exactly the same for those different corpora |
---|
0:21:01 | during that we still could not get a the performance i'm a silence to draw |
---|
0:21:05 | up to chance |
---|
0:21:06 | a personal and no for i think it from looking at nist |
---|
0:21:10 | i really would like to see a performance benchmark especially for lid not necessarily for |
---|
0:21:15 | this it's i but if you look for lid if you could come up with |
---|
0:21:18 | the performance benchmark that looked at your performance for all the speech and kind of |
---|
0:21:23 | balance that against performance against the silence |
---|
0:21:26 | because the ideas that you get a great performance here and you're getting just a |
---|
0:21:30 | little improvement here than your gain is you that's all you're really leveraging actually looking |
---|
0:21:35 | at the speech but the performance is really big thing that actually and out with |
---|
0:21:40 | make up and affected your cheating |
---|
0:21:42 | so that the kind of says more about the speaker |
---|