a speaker
from you procedure used in yeah
for me
yeah
so the way from the university this in finland
and this work is a collaboration with the same city from brown university
and you see that and you compress and then from nokia research center
so
i'm not sure whether i'm presenting this work out the correct conference because
to last the speaker and language variability something that we wouldn't
like what we are interested in the
no sense that stuff that we use
the speaker recognition
so
and we are interested basically to infer the mobile user's context
based on audio signals so by context in this work we mean in the into
the physical location
and user's activity
or the particular environment
so it's gonna if we have such information but we can use this for many
processes social network purposes or i don't think acoustic models or not
are you know
improving
so it's
services
so in this study i'm going to focus only on the environment recognition based on
acoustic cues
so it is likely that we consider a nine different a context which the which
we can see in everyday life including colour and more than formant and so there
and also we have a option for out-of-set case like we don't fragments for any
of the
context
so we all that is what are smart phones the reference whatever out of all
sensors like gps X
parameters line sensors and so on
and in some of the early studies this is present a accelerometer at the
is there in this instance of have been used for context identification
but we consider the use the audio signal
so the obvious reason for that one is that we don't have to have the
latest for the X you can have anymore but one to recognise the context
and the other one is that we are not depend on a network infrastructure if
you if we compare first
thus the gps
or wifi signals
and
actually in some cases for instance let's setup we study different we are like a
normal car bus
so based on the this the we apples actually present
well it would be quite difficult to tell whether it's a normal car bus but
if we had a diffuse we could do mix of the discrimination
and in fact there is some recent evidence that this audio but use can be
more helpful in
some cases
when we are trying to recognize the user context
so that a couple of examples from the data looks like it was a different
one is probably familiar to all of us this office environment
and it is that the big three second segments so
they are fairly so
the car environment
i
and the right
we see in room three and shot samples
and here is that examples of how we can have a
state quite different acoustics depending on the which user
or we what type of device has been used to collect data so
but this is so this is a representative of the intra-class variability what we're
facing this problem
i
then there is this for example
another example from the same user
and this funny really sounds that you can hear is because the user is probably
the
and phone in his pocket and this close miked the standard
marshall
i
i
so we get an idea of what this problem is about
so it is what we consider this as a supervised that's sparrows identification task where
we train context model for all of our ten classes
and okay that's probably not the most correct way of doing so what we also
trained explicit model for the out-of-set class is rather than trying to treat in the
classifier
so quickly about the feature extraction so we typical mfcc front-end what we see the
speaker recognition thirty miliseconds frames and so
and a bit sixteen K audio
oh
the two differences from the speaker and language recognition is that we have a we
don't include any feature normalizations here because we believe that the central bias and the
such a minute devices contain also information useful information for the context
and also this frame rate is a much reduced so we escaping reality thing this
actually non overlapping frames
what we have here because there is a
requirement so the real time requirements here
so let's look at the classifier backend which is the focus of this work
so i try to summarise in this a slight the process that i could find
in the literature and to make this writing this is available some analogies how this
might be related to speaker the speaker and language recognition
so quite a number of others have used as the very simple distance based classification
k-nearest network vq
those used gaussian mixture models or support vector machines but actually dsp as well i
have been studied in this field i usually not using it you know supervector kernels
whatsoever actually trying to training of the individual mfcc frames
and then of course there is hmms to try to model the temporal trajectories of
the acoustic contexts
and also a couple of all source
in a minute you can't detection so basically you have a
discrete set of event detectors that's less lost a laughing and cheering and then you
construct histogram of these outputs
similar to the high level speaker recognition
so
as we know this improvement is that a user in competition and we don't need
rick one so many development datasets and so on but they are limited in the
sense that we don't have a
any frame dependence modeling
and the that the more complicated models
model temporal aspects of that involve more involving of that's
the application scenario that we consider a is this the recognition from very short utterances
okay and then
be a remote retaining the contestants and the more users as low as possible
and the other factors that we don't really have an access to similar datasets what
we have in the nist about a sense
at least not in a little
a good level datasets for this purpose
so then
for this reason we focus on this
relatively simple stuff here
so i'll contribution here is basically that with some other to see how the familiar
tools like using the speaker and language recognition but for this task
and the other thing is that in the previous studies
usually the data has been collected in a using same microphone in a fixed locations
necessity
okay but in this study we have a large collection of a test samples collected
by different mobile users
and but the
and a couple of different mobile phones as well so there is a lot of
variability with respect to the device and the user collected the data
and also actually to see the
and of other classifiers
okay we see here a couple of familiar and unfamiliar abbreviations i'm not going to
explain the classifiers in this study because that's in this
familiar to the audience
so
basically six different methods with the distance so you can be to a gaussian mixtures
trained with the maximum likelihood training
and also discriminative training utilizing the but tools
and then also to supervectors systems using gmm supervectors and the generalized linear risk sequence
and there are some of the control parameters that we considered
so process for the two simplest classifiers that we have a
because knn would require that we start the whole training that this is not feasible
so we use the vq code-books to approximate the training set
okay
so
that he is an overview of the data so i'm not going to a good
indicators of the numbers but
but the actors and the last row and the last column of this table which
so the number of samples by the different classes
and users so
we can see that there is a
massive imbalance in the data which actually causes some problems for the classifier construction
and some of the user didn't collect any data for certain classes and some of
them have been more active collecting the data
and the most popular class seems to be the office
so many people have collected the data director of results and don't feel too much
enthusiastic to do this at every time
and regarding the users
most of the samples come from the city of tampering but there is one user
number
when we have actually some data samples from bottom row and a majority from closing
so two different it is
and then here is that can see the different phone models that the
included to the comparisons
and when we had a very to a classifiers we consider only one user at
cross validation which means that when we have that english that the user number one
we have training of the classifiers using the remaining five speakers a five user
and then product this
over all the users
and also we refer to report the average class specific accuracy rather than the form
identification accuracy because
that would be very much biased towards the of explicit we want to see on
average per class how we are doing
so here are the
the results for the two simplest classifier so in the x-axis we can see the
codebook size that we used for the vq and K
then we see the identification rate here
so as you can see the best that we can achieve here is around forty
percent but for the knn
and perhaps surprisingly we get the best result when we have just using single nearest
neighbor
i have some possible that one
really suspect listen and then also they're not so surprisingly we find that is
can be you
vq scoring outperforms the best K configuration
and generally when we use the more than two five six speaker
results
well here are the results for the simple stuff but for the gaussian mixture models
of frame that's gaussian mixture
and
yeah
well how the accuracy in general et cetera that when we use more than five
hundred twelve gaussians
and we in this numbers we don't see the maximum benefit from the discriminative training
and but later so a couple of more details about the phone
and that is a gmm-svm system was the most confusing for us actually because we
couldn't find any of these typical trance that when we increase the number of gaussian
so that the relevance factor we is difficult this find any
meaningful buttons here so we actually tried two different svm optimizer some take this couple
of time so if it is a that is correct but the still confusing so
could be
could be one reason that we have a dealing with the spatial data segments rather
than to five nist we typically speaker recognition
a
when we use the universal background model training here we didn't pay attention the most
the data balancing so
we suspect that the power ubm experiments doing obvious detection so
we the reason why we do the balancing is that
then it would mean that we
we need to plan the number of that are the most lot smaller such as
you can see there is the smallest number or less than three thousand
so we didn't want to
reduce data too much
i think would be also some cases where with the svm that we what one
why was talking a couple of the signal that maybe we should someone try to
balance the number of training examples also has been so this could be the reasons
why we
this
behaviour
for the cheerleaders a classifier and so in your results for three different mobile
that's a monomial expansion orders one two and three
so and here you can see the number of the elements that to this problem
in supervector
so it seems that we get the best accuracy with a compromise that the second
order polynomial expansion
around thirty five percent correct rate and the
so it is
case number one just correspondent we address the mfcc vectors and train svms december is
that you see that this has been kind of a used actually many of the
previous studies in the context classification
so we had a better than that one
right
so here's an overall comparison of the classifiers
so if you look at the results
i mean we where we set the parameters to the values
so there is not much difference if we consider the for us improvement this year
and for the svms as we already know this the result
but kind of what that goes with this print based methods
so here's some more detail if we look at the results but the class
so the left you can see the particular name of the plus and number of
a test samples
for the class so
obviously this office environment seems to be the easiest to recognize
mostly most likely because here we have the largest training set assignments and
and also it's a it appears that this
it's very much the same office facilities because of the of the users here a
employs of nokia research center at that there might be even this some of the
same meetings where the same class at
where attending to
so it's
there is some bias here
and also surprisingly the other class the out-of-set class yes very good accuracy because we
it's not gonna model the acoustics of everything else except this one so
i would say this much more difficult than trying to train the ubm and speaker
recognition because we cannot pretty
what are the all the possible audio use
we can see
and
these gmm-svm is a curious so it's almost like dirac delta function centered on the
obvious
i think so as you can see we get about one hundred percent
recognition of the office environment but then
actually for some classes all the test samples is classified so there is something
something funny with this one
and you think about the
a gmm the gmm systems
so we see that the discriminative training helps in about half of the cases
but the again it's the speculation but what might be
because of the
certain classes here
and
we are also interested look on whether there is any user effect visible because we
can think that the different users have different preferences then maybe they're going to different
restaurants using different histories and so on
so we look at this one
so most of the users are quite a similar so for they want to fifty
percent across this except for this one user where we have this data samples also
from tampa so remember that this well trained using leave-one-user-out
so it means that these
the city helsinki has not been basically see here so it seems that there is
a
or at disorders that there is a bias due to the city
and also be made an attempt to
look on what's the effect of the certain mobile device but it's a little bit
problematic analysis because too many things are changing at the same time
so
so these two cases
are the ones with of the most training data
so we all the results of this six to one oh no together data but
test data we don't see the highest accuracy on this have all speakers there
so
so that doesn't
that's the expectation
and then this to a basically it was that we have not seen at all
so this
this is the only user was data from these devices but we have been doing
what okay for instance for the vq are obtained for that minimizes them again not
so well
so this is a nice a little bit limited because
probably that it should be a button a good what it to conclude anything from
this
this analysis
and ideally should have a problem recordings please see that you want effect
so from this to a lot not as it seems that the new city or
that kind of a higher level factors will have a
stronger impact on the you once
so
let me conclude i think that this task is very interesting but also appears to
be quite surprising
so in this building so that we got the first the highest a cluster and
if you get identification rate was around forty four percent
a after all the parameter two things
and
none of the six classifiers really was supported it's other at least from the for
simple classifiers
and the
the mmi training help in that in some for some of the classes but not
systematically for all of them so i think it's worth looking further
yeah
this set of data you didn't have on the average that out of the right
and then we couldn't is really good
two good results using the sequence kernel svms but perhaps
they could be some different way of constructing the sequence kernel not using the gmm
supervectors
so i think
that's what i wanted to say thank you
oh
oh
oh
so you can be
i
i
okay after the redirect that question the right actually was constructed to do gmm systems
i
right
i
i
i
i
i
yeah but i think it's very much to do with this
yeah
yeah we used to but to do it and i believe this is the is
the case that we did
okay i'm an exponent
i
yeah
i
yeah
i
i
i
yeah
i
well
right
yeah
i
oh
i
i
i
i
no that that's a separate topic so it is data is collected in a mobile
phones but we did all the simulations and a pc
so they have been a couple of studies on the power consumption also i mean
it depends on the optimum already cause an impostor if they prefer it is
simple frame-level svms because
yeah
yeah right exactly present if we need to continuously monitor the yeah
and i don't have a clue what
i