0:00:22 | a speaker |
---|
0:00:23 | from you procedure used in yeah |
---|
0:00:26 | for me |
---|
0:00:32 | yeah |
---|
0:00:33 | so the way from the university this in finland |
---|
0:00:36 | and this work is a collaboration with the same city from brown university |
---|
0:00:41 | and you see that and you compress and then from nokia research center |
---|
0:00:45 | so |
---|
0:00:46 | i'm not sure whether i'm presenting this work out the correct conference because |
---|
0:00:50 | to last the speaker and language variability something that we wouldn't |
---|
0:00:54 | like what we are interested in the |
---|
0:00:57 | no sense that stuff that we use |
---|
0:01:01 | the speaker recognition |
---|
0:01:03 | so |
---|
0:01:04 | and we are interested basically to infer the mobile user's context |
---|
0:01:09 | based on audio signals so by context in this work we mean in the into |
---|
0:01:14 | the physical location |
---|
0:01:16 | and user's activity |
---|
0:01:18 | or the particular environment |
---|
0:01:21 | so it's gonna if we have such information but we can use this for many |
---|
0:01:26 | processes social network purposes or i don't think acoustic models or not |
---|
0:01:33 | are you know |
---|
0:01:35 | improving |
---|
0:01:36 | so it's |
---|
0:01:37 | services |
---|
0:01:39 | so in this study i'm going to focus only on the environment recognition based on |
---|
0:01:42 | acoustic cues |
---|
0:01:44 | so it is likely that we consider a nine different a context which the which |
---|
0:01:49 | we can see in everyday life including colour and more than formant and so there |
---|
0:01:53 | and also we have a option for out-of-set case like we don't fragments for any |
---|
0:01:59 | of the |
---|
0:02:00 | context |
---|
0:02:03 | so we all that is what are smart phones the reference whatever out of all |
---|
0:02:07 | sensors like gps X |
---|
0:02:09 | parameters line sensors and so on |
---|
0:02:11 | and in some of the early studies this is present a accelerometer at the |
---|
0:02:18 | is there in this instance of have been used for context identification |
---|
0:02:21 | but we consider the use the audio signal |
---|
0:02:25 | so the obvious reason for that one is that we don't have to have the |
---|
0:02:29 | latest for the X you can have anymore but one to recognise the context |
---|
0:02:33 | and the other one is that we are not depend on a network infrastructure if |
---|
0:02:36 | you if we compare first |
---|
0:02:38 | thus the gps |
---|
0:02:40 | or wifi signals |
---|
0:02:42 | and |
---|
0:02:43 | actually in some cases for instance let's setup we study different we are like a |
---|
0:02:47 | normal car bus |
---|
0:02:49 | so based on the this the we apples actually present |
---|
0:02:54 | well it would be quite difficult to tell whether it's a normal car bus but |
---|
0:02:58 | if we had a diffuse we could do mix of the discrimination |
---|
0:03:02 | and in fact there is some recent evidence that this audio but use can be |
---|
0:03:06 | more helpful in |
---|
0:03:08 | some cases |
---|
0:03:10 | when we are trying to recognize the user context |
---|
0:03:15 | so that a couple of examples from the data looks like it was a different |
---|
0:03:20 | one is probably familiar to all of us this office environment |
---|
0:03:24 | and it is that the big three second segments so |
---|
0:03:26 | they are fairly so |
---|
0:03:33 | the car environment |
---|
0:03:41 | i |
---|
0:03:44 | and the right |
---|
0:03:46 | we see in room three and shot samples |
---|
0:03:49 | and here is that examples of how we can have a |
---|
0:03:52 | state quite different acoustics depending on the which user |
---|
0:03:56 | or we what type of device has been used to collect data so |
---|
0:04:02 | but this is so this is a representative of the intra-class variability what we're |
---|
0:04:07 | facing this problem |
---|
0:04:14 | i |
---|
0:04:19 | then there is this for example |
---|
0:04:25 | another example from the same user |
---|
0:04:31 | and this funny really sounds that you can hear is because the user is probably |
---|
0:04:34 | the |
---|
0:04:35 | and phone in his pocket and this close miked the standard |
---|
0:04:39 | marshall |
---|
0:04:43 | i |
---|
0:04:49 | i |
---|
0:04:50 | so we get an idea of what this problem is about |
---|
0:04:54 | so it is what we consider this as a supervised that's sparrows identification task where |
---|
0:04:59 | we train context model for all of our ten classes |
---|
0:05:03 | and okay that's probably not the most correct way of doing so what we also |
---|
0:05:06 | trained explicit model for the out-of-set class is rather than trying to treat in the |
---|
0:05:12 | classifier |
---|
0:05:13 | so quickly about the feature extraction so we typical mfcc front-end what we see the |
---|
0:05:20 | speaker recognition thirty miliseconds frames and so |
---|
0:05:23 | and a bit sixteen K audio |
---|
0:05:25 | oh |
---|
0:05:26 | the two differences from the speaker and language recognition is that we have a we |
---|
0:05:30 | don't include any feature normalizations here because we believe that the central bias and the |
---|
0:05:36 | such a minute devices contain also information useful information for the context |
---|
0:05:41 | and also this frame rate is a much reduced so we escaping reality thing this |
---|
0:05:46 | actually non overlapping frames |
---|
0:05:48 | what we have here because there is a |
---|
0:05:50 | requirement so the real time requirements here |
---|
0:05:54 | so let's look at the classifier backend which is the focus of this work |
---|
0:05:59 | so i try to summarise in this a slight the process that i could find |
---|
0:06:03 | in the literature and to make this writing this is available some analogies how this |
---|
0:06:08 | might be related to speaker the speaker and language recognition |
---|
0:06:11 | so quite a number of others have used as the very simple distance based classification |
---|
0:06:15 | k-nearest network vq |
---|
0:06:17 | those used gaussian mixture models or support vector machines but actually dsp as well i |
---|
0:06:22 | have been studied in this field i usually not using it you know supervector kernels |
---|
0:06:26 | whatsoever actually trying to training of the individual mfcc frames |
---|
0:06:31 | and then of course there is hmms to try to model the temporal trajectories of |
---|
0:06:36 | the acoustic contexts |
---|
0:06:38 | and also a couple of all source |
---|
0:06:41 | in a minute you can't detection so basically you have a |
---|
0:06:45 | discrete set of event detectors that's less lost a laughing and cheering and then you |
---|
0:06:49 | construct histogram of these outputs |
---|
0:06:52 | similar to the high level speaker recognition |
---|
0:06:55 | so |
---|
0:06:57 | as we know this improvement is that a user in competition and we don't need |
---|
0:07:01 | rick one so many development datasets and so on but they are limited in the |
---|
0:07:06 | sense that we don't have a |
---|
0:07:08 | any frame dependence modeling |
---|
0:07:10 | and the that the more complicated models |
---|
0:07:13 | model temporal aspects of that involve more involving of that's |
---|
0:07:18 | the application scenario that we consider a is this the recognition from very short utterances |
---|
0:07:24 | okay and then |
---|
0:07:26 | be a remote retaining the contestants and the more users as low as possible |
---|
0:07:31 | and the other factors that we don't really have an access to similar datasets what |
---|
0:07:36 | we have in the nist about a sense |
---|
0:07:39 | at least not in a little |
---|
0:07:41 | a good level datasets for this purpose |
---|
0:07:44 | so then |
---|
0:07:45 | for this reason we focus on this |
---|
0:07:47 | relatively simple stuff here |
---|
0:07:50 | so i'll contribution here is basically that with some other to see how the familiar |
---|
0:07:55 | tools like using the speaker and language recognition but for this task |
---|
0:08:00 | and the other thing is that in the previous studies |
---|
0:08:04 | usually the data has been collected in a using same microphone in a fixed locations |
---|
0:08:08 | necessity |
---|
0:08:10 | okay but in this study we have a large collection of a test samples collected |
---|
0:08:14 | by different mobile users |
---|
0:08:16 | and but the |
---|
0:08:18 | and a couple of different mobile phones as well so there is a lot of |
---|
0:08:21 | variability with respect to the device and the user collected the data |
---|
0:08:26 | and also actually to see the |
---|
0:08:28 | and of other classifiers |
---|
0:08:30 | okay we see here a couple of familiar and unfamiliar abbreviations i'm not going to |
---|
0:08:35 | explain the classifiers in this study because that's in this |
---|
0:08:38 | familiar to the audience |
---|
0:08:40 | so |
---|
0:08:42 | basically six different methods with the distance so you can be to a gaussian mixtures |
---|
0:08:47 | trained with the maximum likelihood training |
---|
0:08:50 | and also discriminative training utilizing the but tools |
---|
0:08:54 | and then also to supervectors systems using gmm supervectors and the generalized linear risk sequence |
---|
0:09:01 | and there are some of the control parameters that we considered |
---|
0:09:04 | so process for the two simplest classifiers that we have a |
---|
0:09:08 | because knn would require that we start the whole training that this is not feasible |
---|
0:09:12 | so we use the vq code-books to approximate the training set |
---|
0:09:17 | okay |
---|
0:09:19 | so |
---|
0:09:20 | that he is an overview of the data so i'm not going to a good |
---|
0:09:23 | indicators of the numbers but |
---|
0:09:26 | but the actors and the last row and the last column of this table which |
---|
0:09:30 | so the number of samples by the different classes |
---|
0:09:32 | and users so |
---|
0:09:34 | we can see that there is a |
---|
0:09:36 | massive imbalance in the data which actually causes some problems for the classifier construction |
---|
0:09:42 | and some of the user didn't collect any data for certain classes and some of |
---|
0:09:46 | them have been more active collecting the data |
---|
0:09:50 | and the most popular class seems to be the office |
---|
0:09:52 | so many people have collected the data director of results and don't feel too much |
---|
0:09:57 | enthusiastic to do this at every time |
---|
0:10:01 | and regarding the users |
---|
0:10:05 | most of the samples come from the city of tampering but there is one user |
---|
0:10:10 | number |
---|
0:10:11 | when we have actually some data samples from bottom row and a majority from closing |
---|
0:10:16 | so two different it is |
---|
0:10:17 | and then here is that can see the different phone models that the |
---|
0:10:21 | included to the comparisons |
---|
0:10:23 | and when we had a very to a classifiers we consider only one user at |
---|
0:10:27 | cross validation which means that when we have that english that the user number one |
---|
0:10:31 | we have training of the classifiers using the remaining five speakers a five user |
---|
0:10:35 | and then product this |
---|
0:10:37 | over all the users |
---|
0:10:40 | and also we refer to report the average class specific accuracy rather than the form |
---|
0:10:46 | identification accuracy because |
---|
0:10:48 | that would be very much biased towards the of explicit we want to see on |
---|
0:10:52 | average per class how we are doing |
---|
0:10:56 | so here are the |
---|
0:10:58 | the results for the two simplest classifier so in the x-axis we can see the |
---|
0:11:02 | codebook size that we used for the vq and K |
---|
0:11:04 | then we see the identification rate here |
---|
0:11:07 | so as you can see the best that we can achieve here is around forty |
---|
0:11:11 | percent but for the knn |
---|
0:11:13 | and perhaps surprisingly we get the best result when we have just using single nearest |
---|
0:11:17 | neighbor |
---|
0:11:18 | i have some possible that one |
---|
0:11:21 | really suspect listen and then also they're not so surprisingly we find that is |
---|
0:11:26 | can be you |
---|
0:11:27 | vq scoring outperforms the best K configuration |
---|
0:11:32 | and generally when we use the more than two five six speaker |
---|
0:11:36 | results |
---|
0:11:38 | well here are the results for the simple stuff but for the gaussian mixture models |
---|
0:11:42 | of frame that's gaussian mixture |
---|
0:11:45 | and |
---|
0:11:46 | yeah |
---|
0:11:47 | well how the accuracy in general et cetera that when we use more than five |
---|
0:11:51 | hundred twelve gaussians |
---|
0:11:53 | and we in this numbers we don't see the maximum benefit from the discriminative training |
---|
0:11:58 | and but later so a couple of more details about the phone |
---|
0:12:04 | and that is a gmm-svm system was the most confusing for us actually because we |
---|
0:12:10 | couldn't find any of these typical trance that when we increase the number of gaussian |
---|
0:12:14 | so that the relevance factor we is difficult this find any |
---|
0:12:17 | meaningful buttons here so we actually tried two different svm optimizer some take this couple |
---|
0:12:22 | of time so if it is a that is correct but the still confusing so |
---|
0:12:27 | could be |
---|
0:12:28 | could be one reason that we have a dealing with the spatial data segments rather |
---|
0:12:33 | than to five nist we typically speaker recognition |
---|
0:12:37 | a |
---|
0:12:37 | when we use the universal background model training here we didn't pay attention the most |
---|
0:12:42 | the data balancing so |
---|
0:12:44 | we suspect that the power ubm experiments doing obvious detection so |
---|
0:12:49 | we the reason why we do the balancing is that |
---|
0:12:52 | then it would mean that we |
---|
0:12:54 | we need to plan the number of that are the most lot smaller such as |
---|
0:12:58 | you can see there is the smallest number or less than three thousand |
---|
0:13:03 | so we didn't want to |
---|
0:13:04 | reduce data too much |
---|
0:13:08 | i think would be also some cases where with the svm that we what one |
---|
0:13:13 | why was talking a couple of the signal that maybe we should someone try to |
---|
0:13:17 | balance the number of training examples also has been so this could be the reasons |
---|
0:13:21 | why we |
---|
0:13:22 | this |
---|
0:13:24 | behaviour |
---|
0:13:26 | for the cheerleaders a classifier and so in your results for three different mobile |
---|
0:13:31 | that's a monomial expansion orders one two and three |
---|
0:13:35 | so and here you can see the number of the elements that to this problem |
---|
0:13:39 | in supervector |
---|
0:13:40 | so it seems that we get the best accuracy with a compromise that the second |
---|
0:13:44 | order polynomial expansion |
---|
0:13:46 | around thirty five percent correct rate and the |
---|
0:13:50 | so it is |
---|
0:13:52 | case number one just correspondent we address the mfcc vectors and train svms december is |
---|
0:13:57 | that you see that this has been kind of a used actually many of the |
---|
0:14:01 | previous studies in the context classification |
---|
0:14:03 | so we had a better than that one |
---|
0:14:06 | right |
---|
0:14:08 | so here's an overall comparison of the classifiers |
---|
0:14:11 | so if you look at the results |
---|
0:14:13 | i mean we where we set the parameters to the values |
---|
0:14:18 | so there is not much difference if we consider the for us improvement this year |
---|
0:14:22 | and for the svms as we already know this the result |
---|
0:14:26 | but kind of what that goes with this print based methods |
---|
0:14:32 | so here's some more detail if we look at the results but the class |
---|
0:14:37 | so the left you can see the particular name of the plus and number of |
---|
0:14:41 | a test samples |
---|
0:14:42 | for the class so |
---|
0:14:44 | obviously this office environment seems to be the easiest to recognize |
---|
0:14:49 | mostly most likely because here we have the largest training set assignments and |
---|
0:14:53 | and also it's a it appears that this |
---|
0:14:57 | it's very much the same office facilities because of the of the users here a |
---|
0:15:01 | employs of nokia research center at that there might be even this some of the |
---|
0:15:07 | same meetings where the same class at |
---|
0:15:09 | where attending to |
---|
0:15:12 | so it's |
---|
0:15:13 | there is some bias here |
---|
0:15:16 | and also surprisingly the other class the out-of-set class yes very good accuracy because we |
---|
0:15:21 | it's not gonna model the acoustics of everything else except this one so |
---|
0:15:26 | i would say this much more difficult than trying to train the ubm and speaker |
---|
0:15:30 | recognition because we cannot pretty |
---|
0:15:33 | what are the all the possible audio use |
---|
0:15:35 | we can see |
---|
0:15:38 | and |
---|
0:15:39 | these gmm-svm is a curious so it's almost like dirac delta function centered on the |
---|
0:15:44 | obvious |
---|
0:15:44 | i think so as you can see we get about one hundred percent |
---|
0:15:48 | recognition of the office environment but then |
---|
0:15:51 | actually for some classes all the test samples is classified so there is something |
---|
0:15:57 | something funny with this one |
---|
0:15:59 | and you think about the |
---|
0:16:02 | a gmm the gmm systems |
---|
0:16:04 | so we see that the discriminative training helps in about half of the cases |
---|
0:16:08 | but the again it's the speculation but what might be |
---|
0:16:11 | because of the |
---|
0:16:13 | certain classes here |
---|
0:16:17 | and |
---|
0:16:18 | we are also interested look on whether there is any user effect visible because we |
---|
0:16:22 | can think that the different users have different preferences then maybe they're going to different |
---|
0:16:27 | restaurants using different histories and so on |
---|
0:16:30 | so we look at this one |
---|
0:16:36 | so most of the users are quite a similar so for they want to fifty |
---|
0:16:40 | percent across this except for this one user where we have this data samples also |
---|
0:16:46 | from tampa so remember that this well trained using leave-one-user-out |
---|
0:16:51 | so it means that these |
---|
0:16:52 | the city helsinki has not been basically see here so it seems that there is |
---|
0:16:56 | a |
---|
0:16:57 | or at disorders that there is a bias due to the city |
---|
0:17:04 | and also be made an attempt to |
---|
0:17:07 | look on what's the effect of the certain mobile device but it's a little bit |
---|
0:17:11 | problematic analysis because too many things are changing at the same time |
---|
0:17:15 | so |
---|
0:17:17 | so these two cases |
---|
0:17:19 | are the ones with of the most training data |
---|
0:17:21 | so we all the results of this six to one oh no together data but |
---|
0:17:27 | test data we don't see the highest accuracy on this have all speakers there |
---|
0:17:32 | so |
---|
0:17:36 | so that doesn't |
---|
0:17:39 | that's the expectation |
---|
0:17:42 | and then this to a basically it was that we have not seen at all |
---|
0:17:46 | so this |
---|
0:17:48 | this is the only user was data from these devices but we have been doing |
---|
0:17:52 | what okay for instance for the vq are obtained for that minimizes them again not |
---|
0:17:56 | so well |
---|
0:17:57 | so this is a nice a little bit limited because |
---|
0:18:01 | probably that it should be a button a good what it to conclude anything from |
---|
0:18:05 | this |
---|
0:18:06 | this analysis |
---|
0:18:07 | and ideally should have a problem recordings please see that you want effect |
---|
0:18:12 | so from this to a lot not as it seems that the new city or |
---|
0:18:16 | that kind of a higher level factors will have a |
---|
0:18:19 | stronger impact on the you once |
---|
0:18:24 | so |
---|
0:18:26 | let me conclude i think that this task is very interesting but also appears to |
---|
0:18:30 | be quite surprising |
---|
0:18:32 | so in this building so that we got the first the highest a cluster and |
---|
0:18:36 | if you get identification rate was around forty four percent |
---|
0:18:40 | a after all the parameter two things |
---|
0:18:42 | and |
---|
0:18:43 | none of the six classifiers really was supported it's other at least from the for |
---|
0:18:48 | simple classifiers |
---|
0:18:50 | and the |
---|
0:18:52 | the mmi training help in that in some for some of the classes but not |
---|
0:18:55 | systematically for all of them so i think it's worth looking further |
---|
0:18:59 | yeah |
---|
0:19:01 | this set of data you didn't have on the average that out of the right |
---|
0:19:06 | and then we couldn't is really good |
---|
0:19:08 | two good results using the sequence kernel svms but perhaps |
---|
0:19:12 | they could be some different way of constructing the sequence kernel not using the gmm |
---|
0:19:17 | supervectors |
---|
0:19:18 | so i think |
---|
0:19:20 | that's what i wanted to say thank you |
---|
0:19:33 | oh |
---|
0:19:35 | oh |
---|
0:19:36 | oh |
---|
0:19:45 | so you can be |
---|
0:19:50 | i |
---|
0:19:53 | i |
---|
0:19:57 | okay after the redirect that question the right actually was constructed to do gmm systems |
---|
0:20:03 | i |
---|
0:20:04 | right |
---|
0:20:05 | i |
---|
0:20:07 | i |
---|
0:20:12 | i |
---|
0:20:14 | i |
---|
0:20:16 | i |
---|
0:20:20 | yeah but i think it's very much to do with this |
---|
0:20:31 | yeah |
---|
0:20:35 | yeah we used to but to do it and i believe this is the is |
---|
0:20:39 | the case that we did |
---|
0:20:51 | okay i'm an exponent |
---|
0:20:53 | i |
---|
0:20:55 | yeah |
---|
0:20:56 | i |
---|
0:20:57 | yeah |
---|
0:20:59 | i |
---|
0:21:00 | i |
---|
0:21:01 | i |
---|
0:21:04 | yeah |
---|
0:21:07 | i |
---|
0:21:08 | well |
---|
0:21:12 | right |
---|
0:21:13 | yeah |
---|
0:21:17 | i |
---|
0:21:19 | oh |
---|
0:21:22 | i |
---|
0:21:26 | i |
---|
0:21:28 | i |
---|
0:21:39 | i |
---|
0:21:47 | no that that's a separate topic so it is data is collected in a mobile |
---|
0:21:51 | phones but we did all the simulations and a pc |
---|
0:21:53 | so they have been a couple of studies on the power consumption also i mean |
---|
0:21:56 | it depends on the optimum already cause an impostor if they prefer it is |
---|
0:22:01 | simple frame-level svms because |
---|
0:22:04 | yeah |
---|
0:22:08 | yeah right exactly present if we need to continuously monitor the yeah |
---|
0:22:13 | and i don't have a clue what |
---|
0:22:21 | i |
---|