0:00:22a speaker
0:00:23from you procedure used in yeah
0:00:26for me
0:00:32yeah
0:00:33so the way from the university this in finland
0:00:36and this work is a collaboration with the same city from brown university
0:00:41and you see that and you compress and then from nokia research center
0:00:45so
0:00:46i'm not sure whether i'm presenting this work out the correct conference because
0:00:50to last the speaker and language variability something that we wouldn't
0:00:54like what we are interested in the
0:00:57no sense that stuff that we use
0:01:01the speaker recognition
0:01:03so
0:01:04and we are interested basically to infer the mobile user's context
0:01:09based on audio signals so by context in this work we mean in the into
0:01:14the physical location
0:01:16and user's activity
0:01:18or the particular environment
0:01:21so it's gonna if we have such information but we can use this for many
0:01:26processes social network purposes or i don't think acoustic models or not
0:01:33are you know
0:01:35improving
0:01:36so it's
0:01:37services
0:01:39so in this study i'm going to focus only on the environment recognition based on
0:01:42acoustic cues
0:01:44so it is likely that we consider a nine different a context which the which
0:01:49we can see in everyday life including colour and more than formant and so there
0:01:53and also we have a option for out-of-set case like we don't fragments for any
0:01:59of the
0:02:00context
0:02:03so we all that is what are smart phones the reference whatever out of all
0:02:07sensors like gps X
0:02:09parameters line sensors and so on
0:02:11and in some of the early studies this is present a accelerometer at the
0:02:18is there in this instance of have been used for context identification
0:02:21but we consider the use the audio signal
0:02:25so the obvious reason for that one is that we don't have to have the
0:02:29latest for the X you can have anymore but one to recognise the context
0:02:33and the other one is that we are not depend on a network infrastructure if
0:02:36you if we compare first
0:02:38thus the gps
0:02:40or wifi signals
0:02:42and
0:02:43actually in some cases for instance let's setup we study different we are like a
0:02:47normal car bus
0:02:49so based on the this the we apples actually present
0:02:54well it would be quite difficult to tell whether it's a normal car bus but
0:02:58if we had a diffuse we could do mix of the discrimination
0:03:02and in fact there is some recent evidence that this audio but use can be
0:03:06more helpful in
0:03:08some cases
0:03:10when we are trying to recognize the user context
0:03:15so that a couple of examples from the data looks like it was a different
0:03:20one is probably familiar to all of us this office environment
0:03:24and it is that the big three second segments so
0:03:26they are fairly so
0:03:33the car environment
0:03:41i
0:03:44and the right
0:03:46we see in room three and shot samples
0:03:49and here is that examples of how we can have a
0:03:52state quite different acoustics depending on the which user
0:03:56or we what type of device has been used to collect data so
0:04:02but this is so this is a representative of the intra-class variability what we're
0:04:07facing this problem
0:04:14i
0:04:19then there is this for example
0:04:25another example from the same user
0:04:31and this funny really sounds that you can hear is because the user is probably
0:04:34the
0:04:35and phone in his pocket and this close miked the standard
0:04:39marshall
0:04:43i
0:04:49i
0:04:50so we get an idea of what this problem is about
0:04:54so it is what we consider this as a supervised that's sparrows identification task where
0:04:59we train context model for all of our ten classes
0:05:03and okay that's probably not the most correct way of doing so what we also
0:05:06trained explicit model for the out-of-set class is rather than trying to treat in the
0:05:12classifier
0:05:13so quickly about the feature extraction so we typical mfcc front-end what we see the
0:05:20speaker recognition thirty miliseconds frames and so
0:05:23and a bit sixteen K audio
0:05:25oh
0:05:26the two differences from the speaker and language recognition is that we have a we
0:05:30don't include any feature normalizations here because we believe that the central bias and the
0:05:36such a minute devices contain also information useful information for the context
0:05:41and also this frame rate is a much reduced so we escaping reality thing this
0:05:46actually non overlapping frames
0:05:48what we have here because there is a
0:05:50requirement so the real time requirements here
0:05:54so let's look at the classifier backend which is the focus of this work
0:05:59so i try to summarise in this a slight the process that i could find
0:06:03in the literature and to make this writing this is available some analogies how this
0:06:08might be related to speaker the speaker and language recognition
0:06:11so quite a number of others have used as the very simple distance based classification
0:06:15k-nearest network vq
0:06:17those used gaussian mixture models or support vector machines but actually dsp as well i
0:06:22have been studied in this field i usually not using it you know supervector kernels
0:06:26whatsoever actually trying to training of the individual mfcc frames
0:06:31and then of course there is hmms to try to model the temporal trajectories of
0:06:36the acoustic contexts
0:06:38and also a couple of all source
0:06:41in a minute you can't detection so basically you have a
0:06:45discrete set of event detectors that's less lost a laughing and cheering and then you
0:06:49construct histogram of these outputs
0:06:52similar to the high level speaker recognition
0:06:55so
0:06:57as we know this improvement is that a user in competition and we don't need
0:07:01rick one so many development datasets and so on but they are limited in the
0:07:06sense that we don't have a
0:07:08any frame dependence modeling
0:07:10and the that the more complicated models
0:07:13model temporal aspects of that involve more involving of that's
0:07:18the application scenario that we consider a is this the recognition from very short utterances
0:07:24okay and then
0:07:26be a remote retaining the contestants and the more users as low as possible
0:07:31and the other factors that we don't really have an access to similar datasets what
0:07:36we have in the nist about a sense
0:07:39at least not in a little
0:07:41a good level datasets for this purpose
0:07:44so then
0:07:45for this reason we focus on this
0:07:47relatively simple stuff here
0:07:50so i'll contribution here is basically that with some other to see how the familiar
0:07:55tools like using the speaker and language recognition but for this task
0:08:00and the other thing is that in the previous studies
0:08:04usually the data has been collected in a using same microphone in a fixed locations
0:08:08necessity
0:08:10okay but in this study we have a large collection of a test samples collected
0:08:14by different mobile users
0:08:16and but the
0:08:18and a couple of different mobile phones as well so there is a lot of
0:08:21variability with respect to the device and the user collected the data
0:08:26and also actually to see the
0:08:28and of other classifiers
0:08:30okay we see here a couple of familiar and unfamiliar abbreviations i'm not going to
0:08:35explain the classifiers in this study because that's in this
0:08:38familiar to the audience
0:08:40so
0:08:42basically six different methods with the distance so you can be to a gaussian mixtures
0:08:47trained with the maximum likelihood training
0:08:50and also discriminative training utilizing the but tools
0:08:54and then also to supervectors systems using gmm supervectors and the generalized linear risk sequence
0:09:01and there are some of the control parameters that we considered
0:09:04so process for the two simplest classifiers that we have a
0:09:08because knn would require that we start the whole training that this is not feasible
0:09:12so we use the vq code-books to approximate the training set
0:09:17okay
0:09:19so
0:09:20that he is an overview of the data so i'm not going to a good
0:09:23indicators of the numbers but
0:09:26but the actors and the last row and the last column of this table which
0:09:30so the number of samples by the different classes
0:09:32and users so
0:09:34we can see that there is a
0:09:36massive imbalance in the data which actually causes some problems for the classifier construction
0:09:42and some of the user didn't collect any data for certain classes and some of
0:09:46them have been more active collecting the data
0:09:50and the most popular class seems to be the office
0:09:52so many people have collected the data director of results and don't feel too much
0:09:57enthusiastic to do this at every time
0:10:01and regarding the users
0:10:05most of the samples come from the city of tampering but there is one user
0:10:10number
0:10:11when we have actually some data samples from bottom row and a majority from closing
0:10:16so two different it is
0:10:17and then here is that can see the different phone models that the
0:10:21included to the comparisons
0:10:23and when we had a very to a classifiers we consider only one user at
0:10:27cross validation which means that when we have that english that the user number one
0:10:31we have training of the classifiers using the remaining five speakers a five user
0:10:35and then product this
0:10:37over all the users
0:10:40and also we refer to report the average class specific accuracy rather than the form
0:10:46identification accuracy because
0:10:48that would be very much biased towards the of explicit we want to see on
0:10:52average per class how we are doing
0:10:56so here are the
0:10:58the results for the two simplest classifier so in the x-axis we can see the
0:11:02codebook size that we used for the vq and K
0:11:04then we see the identification rate here
0:11:07so as you can see the best that we can achieve here is around forty
0:11:11percent but for the knn
0:11:13and perhaps surprisingly we get the best result when we have just using single nearest
0:11:17neighbor
0:11:18i have some possible that one
0:11:21really suspect listen and then also they're not so surprisingly we find that is
0:11:26can be you
0:11:27vq scoring outperforms the best K configuration
0:11:32and generally when we use the more than two five six speaker
0:11:36results
0:11:38well here are the results for the simple stuff but for the gaussian mixture models
0:11:42of frame that's gaussian mixture
0:11:45and
0:11:46yeah
0:11:47well how the accuracy in general et cetera that when we use more than five
0:11:51hundred twelve gaussians
0:11:53and we in this numbers we don't see the maximum benefit from the discriminative training
0:11:58and but later so a couple of more details about the phone
0:12:04and that is a gmm-svm system was the most confusing for us actually because we
0:12:10couldn't find any of these typical trance that when we increase the number of gaussian
0:12:14so that the relevance factor we is difficult this find any
0:12:17meaningful buttons here so we actually tried two different svm optimizer some take this couple
0:12:22of time so if it is a that is correct but the still confusing so
0:12:27could be
0:12:28could be one reason that we have a dealing with the spatial data segments rather
0:12:33than to five nist we typically speaker recognition
0:12:37a
0:12:37when we use the universal background model training here we didn't pay attention the most
0:12:42the data balancing so
0:12:44we suspect that the power ubm experiments doing obvious detection so
0:12:49we the reason why we do the balancing is that
0:12:52then it would mean that we
0:12:54we need to plan the number of that are the most lot smaller such as
0:12:58you can see there is the smallest number or less than three thousand
0:13:03so we didn't want to
0:13:04reduce data too much
0:13:08i think would be also some cases where with the svm that we what one
0:13:13why was talking a couple of the signal that maybe we should someone try to
0:13:17balance the number of training examples also has been so this could be the reasons
0:13:21why we
0:13:22this
0:13:24behaviour
0:13:26for the cheerleaders a classifier and so in your results for three different mobile
0:13:31that's a monomial expansion orders one two and three
0:13:35so and here you can see the number of the elements that to this problem
0:13:39in supervector
0:13:40so it seems that we get the best accuracy with a compromise that the second
0:13:44order polynomial expansion
0:13:46around thirty five percent correct rate and the
0:13:50so it is
0:13:52case number one just correspondent we address the mfcc vectors and train svms december is
0:13:57that you see that this has been kind of a used actually many of the
0:14:01previous studies in the context classification
0:14:03so we had a better than that one
0:14:06right
0:14:08so here's an overall comparison of the classifiers
0:14:11so if you look at the results
0:14:13i mean we where we set the parameters to the values
0:14:18so there is not much difference if we consider the for us improvement this year
0:14:22and for the svms as we already know this the result
0:14:26but kind of what that goes with this print based methods
0:14:32so here's some more detail if we look at the results but the class
0:14:37so the left you can see the particular name of the plus and number of
0:14:41a test samples
0:14:42for the class so
0:14:44obviously this office environment seems to be the easiest to recognize
0:14:49mostly most likely because here we have the largest training set assignments and
0:14:53and also it's a it appears that this
0:14:57it's very much the same office facilities because of the of the users here a
0:15:01employs of nokia research center at that there might be even this some of the
0:15:07same meetings where the same class at
0:15:09where attending to
0:15:12so it's
0:15:13there is some bias here
0:15:16and also surprisingly the other class the out-of-set class yes very good accuracy because we
0:15:21it's not gonna model the acoustics of everything else except this one so
0:15:26i would say this much more difficult than trying to train the ubm and speaker
0:15:30recognition because we cannot pretty
0:15:33what are the all the possible audio use
0:15:35we can see
0:15:38and
0:15:39these gmm-svm is a curious so it's almost like dirac delta function centered on the
0:15:44obvious
0:15:44i think so as you can see we get about one hundred percent
0:15:48recognition of the office environment but then
0:15:51actually for some classes all the test samples is classified so there is something
0:15:57something funny with this one
0:15:59and you think about the
0:16:02a gmm the gmm systems
0:16:04so we see that the discriminative training helps in about half of the cases
0:16:08but the again it's the speculation but what might be
0:16:11because of the
0:16:13certain classes here
0:16:17and
0:16:18we are also interested look on whether there is any user effect visible because we
0:16:22can think that the different users have different preferences then maybe they're going to different
0:16:27restaurants using different histories and so on
0:16:30so we look at this one
0:16:36so most of the users are quite a similar so for they want to fifty
0:16:40percent across this except for this one user where we have this data samples also
0:16:46from tampa so remember that this well trained using leave-one-user-out
0:16:51so it means that these
0:16:52the city helsinki has not been basically see here so it seems that there is
0:16:56a
0:16:57or at disorders that there is a bias due to the city
0:17:04and also be made an attempt to
0:17:07look on what's the effect of the certain mobile device but it's a little bit
0:17:11problematic analysis because too many things are changing at the same time
0:17:15so
0:17:17so these two cases
0:17:19are the ones with of the most training data
0:17:21so we all the results of this six to one oh no together data but
0:17:27test data we don't see the highest accuracy on this have all speakers there
0:17:32so
0:17:36so that doesn't
0:17:39that's the expectation
0:17:42and then this to a basically it was that we have not seen at all
0:17:46so this
0:17:48this is the only user was data from these devices but we have been doing
0:17:52what okay for instance for the vq are obtained for that minimizes them again not
0:17:56so well
0:17:57so this is a nice a little bit limited because
0:18:01probably that it should be a button a good what it to conclude anything from
0:18:05this
0:18:06this analysis
0:18:07and ideally should have a problem recordings please see that you want effect
0:18:12so from this to a lot not as it seems that the new city or
0:18:16that kind of a higher level factors will have a
0:18:19stronger impact on the you once
0:18:24so
0:18:26let me conclude i think that this task is very interesting but also appears to
0:18:30be quite surprising
0:18:32so in this building so that we got the first the highest a cluster and
0:18:36if you get identification rate was around forty four percent
0:18:40a after all the parameter two things
0:18:42and
0:18:43none of the six classifiers really was supported it's other at least from the for
0:18:48simple classifiers
0:18:50and the
0:18:52the mmi training help in that in some for some of the classes but not
0:18:55systematically for all of them so i think it's worth looking further
0:18:59yeah
0:19:01this set of data you didn't have on the average that out of the right
0:19:06and then we couldn't is really good
0:19:08two good results using the sequence kernel svms but perhaps
0:19:12they could be some different way of constructing the sequence kernel not using the gmm
0:19:17supervectors
0:19:18so i think
0:19:20that's what i wanted to say thank you
0:19:33oh
0:19:35oh
0:19:36oh
0:19:45so you can be
0:19:50i
0:19:53i
0:19:57okay after the redirect that question the right actually was constructed to do gmm systems
0:20:03i
0:20:04right
0:20:05i
0:20:07i
0:20:12i
0:20:14i
0:20:16i
0:20:20yeah but i think it's very much to do with this
0:20:31yeah
0:20:35yeah we used to but to do it and i believe this is the is
0:20:39the case that we did
0:20:51okay i'm an exponent
0:20:53i
0:20:55yeah
0:20:56i
0:20:57yeah
0:20:59i
0:21:00i
0:21:01i
0:21:04yeah
0:21:07i
0:21:08well
0:21:12right
0:21:13yeah
0:21:17i
0:21:19oh
0:21:22i
0:21:26i
0:21:28i
0:21:39i
0:21:47no that that's a separate topic so it is data is collected in a mobile
0:21:51phones but we did all the simulations and a pc
0:21:53so they have been a couple of studies on the power consumption also i mean
0:21:56it depends on the optimum already cause an impostor if they prefer it is
0:22:01simple frame-level svms because
0:22:04yeah
0:22:08yeah right exactly present if we need to continuously monitor the yeah
0:22:13and i don't have a clue what
0:22:21i