0:00:16why don't my presentation
0:00:19and representing a or b i don and i this is often get energy profiles
0:00:26also for speech detection
0:00:28and a distinctly score for my all those
0:00:37i'm not only
0:00:38i can
0:00:39well and myself
0:00:41two additional okay
0:00:45and a little or no one by matrix and the importance that issue
0:00:51use of interest
0:00:52and the key subject matter of this paper is development of cartilage or to alleviate
0:00:58these phenomena
0:00:59this can be that energy based features which is really for with different
0:01:04a speaker and session position an attribute assisi
0:01:08and we present experimental results on strong standard
0:01:12but he wouldn't it as a namely is used for fifteen
0:01:18was a big hassle
0:01:20and in some ways
0:01:23so introduction a speech signal okay
0:01:26well with his that was of information and then applications well
0:01:30but is unable
0:01:31the linguistic information and banning was too much
0:01:36the weighting function can be useful for speech recognition
0:01:40language recognition but linguistic information can be useful for speaker recognition in all simulations this
0:01:45estimation
0:01:50there really is the us officials what into
0:01:54automatic speaker verification which is the art speaker detection
0:01:59so they are is there are in this paper we focus on speaker verification and
0:02:03are also
0:02:05so the decision in is equal be
0:02:08enrolment in like background noise
0:02:11and a noise channel mismatch
0:02:13but it was because there is the data lines
0:02:17the different be speaker in particular speaker vanity
0:02:21because it should be noted that you when you three eighty speakers voice basically we
0:02:26will speaker verification or speaker recognition
0:02:29however
0:02:31and dimension invariant also be clean speech communication channel
0:02:35it is a speaker's voice well which is a
0:02:39in the challenge
0:02:40because of a really
0:02:42in the speaker
0:02:43if you have
0:02:44with the whole middle residuals like
0:02:47inter session variability
0:02:49speaking style and pronunciation duration
0:02:53features physical conditions gender
0:02:56no
0:02:57exactly
0:02:58and their ability or a business activities in speaker recognition we call spoofing and things
0:03:05with the means to create a lot of one of them
0:03:08will be able to venice system effectively
0:03:11and everything system
0:03:13and the second system is to develop an icon colleges to be able to
0:03:17and even something like that moment system
0:03:22the and we'll have at all
0:03:25something using model will be able to test the robustness of a system where is
0:03:29an additional smoothing is important
0:03:31could be able to design unfriendly a good you know all speaker system
0:03:39well i secure by mixture
0:03:43but the final types of the next well while the remaining speaker verification system
0:03:50one is based on
0:03:52we didn't or acoustics that his impersonation on the initial estimate astronomy
0:03:58no recognition results acoustics
0:04:02there is identical twins
0:04:03and they're double the same animals do technologies such as speech synthesis and wise convolution
0:04:11note or
0:04:12news from charities based on this s and was in the
0:04:19and found where there
0:04:21there are in a very is and just activities as it would be in an
0:04:25analysis of speech
0:04:26that is
0:04:28which is
0:04:29very difficult to a very easy to model a doesn't wire technical knowledge
0:04:35most loving where plate
0:04:37in this story being the
0:04:39on the order of speech or y
0:04:42and so on but it is very difficult because the speech is from b
0:04:46within the only
0:04:48speech
0:04:49well
0:04:50there are in various in order for slightly a science and such things
0:04:56one
0:04:57a possibly case
0:04:59the only four so it will be introduced on a relational some special sessions or
0:05:04the sensitivities on
0:05:07by image analysis of a false with a single remote control which will be next
0:05:13however the major for star only
0:05:17you know internationally will challenge timebin organized in please okay that's score a seasonal quickly
0:05:25challenge to was
0:05:28and the database was really was previously alright generalized for a
0:05:39different in a little the content analytical
0:05:43organise for speech synthesis
0:05:44and what is a difficult question
0:05:47well as those of you know there those on b
0:05:51changeable elements which
0:05:55exactly
0:05:56last year lamenting janice was no
0:06:02and i just the listeners and
0:06:05and there was based on was used converting a play detection then the real speech
0:06:10detection
0:06:11and also
0:06:12we well configuration of the physical or a system
0:06:16and in a similar systems and so on addition
0:06:20this is useful right
0:06:25only the really comparison of the
0:06:28i really wasn't the risk analyses also been or someone looks for
0:06:32various kinds of been applied in also mention also be a d o meetings and
0:06:38with
0:06:38so since the i know the
0:06:42the statistically meaningful car was for things and is a the
0:06:46impersonation is not available
0:06:48and therefore the risk is unknown in a problem at least
0:06:53but as in industry in
0:06:58lattice based on an additional reason is
0:07:02and you know no is that is a very high
0:07:05well models when we don't
0:07:09otherwise that i this data in industry so on t
0:07:13the latter detection training without a nice
0:07:18it is what we mistral content based recording what and then us smartphone and in
0:07:24control of the conditions
0:07:26so the available gender errors could be you know
0:07:30right behind mobile where the risk is very high that's for sure
0:07:35an individual nineteen combine the enforce it is unity okay
0:07:39in the different my from some additional which
0:07:45so
0:07:46forcible the problem here we can consider a l stand alone today the
0:07:51which can be considered as a
0:07:53you in without in berkeley systems contain natural speech
0:07:57and there are four for speech detection
0:08:01and the something that's things you go to be here
0:08:05can be done i know that the we can be gone it was given that
0:08:09we're getting microphone point on it is really and transmission channel
0:08:16a sufficient wind
0:08:18i'm really and in the literature so it is also based on just one there
0:08:25so in this thing but we consider three times a small signal x based on
0:08:30this reason why someone
0:08:32is visible unit database
0:08:34and the latter at s
0:08:39so we finished isn't are from a speech
0:08:43and convolutional useful analysis system designer and initial speech synthesis systems sort of speech a
0:08:50natural language text
0:08:51in was very little mostly extending steganalysis
0:08:55then we need to find healing was and possibly a speech in addition we accent
0:08:58speech
0:08:59and this is at application that actually
0:09:02how we can use different that is system to
0:09:06communication
0:09:09one thousand
0:09:10like it is linguistically conversational wise samples for speaker who is one there is something
0:09:15maybe once or speaker so this is in the intervals between the most
0:09:23and basically kind of or
0:09:26by considering the impostor basically and speaker
0:09:29and that actually that would be that you from this is the same why second
0:09:33one is
0:09:34later in this context or
0:09:37and things with something s is used
0:09:42and so is a really useful
0:09:44that's for eliciting model as convolutional on the actual speech we the
0:09:49in both as follows actually plus the acoustic my
0:09:53and i and i so i'm the impostors realistic will be on relational or i
0:09:59mean the convolutional was response of the microphone
0:10:02you recording idealise speaker
0:10:06in the multimedia speaker and acoustic
0:10:09so here the problem is to be able to understand the if we wanted to
0:10:13the acoustics
0:10:14which means you can detect some of the characteristics of equality
0:10:19was legally or maybe condition
0:10:21because you noting you wanna do speech coding speech
0:10:25the parameters and
0:10:27to build you
0:10:29really independent acoustic something both channels according to my acoustic and one
0:10:35we will understand whether the speech community and a genuine are indeed
0:10:40and does not
0:10:42so in this paper the one to exploit be okay spatial the initial so anything
0:10:50that in your you get as you it is really the in energy off basically
0:10:59i think is similar well energy or something that so for example in the traditional
0:11:04signal also literature we learned online in that you where
0:11:08however in the actual speech production the and belongs not sufficient which the energy requires
0:11:14a statistical because
0:11:16in that using whatever statistically hundred dollars acoustic signal
0:11:20is there are more or less then
0:11:22then as it were removed and doors a sticks in the context of acoustic signal
0:11:28x
0:11:29in the physical environment like simple and emotion and along a low dimensional the probability
0:11:34like single emotion
0:11:36automatically i dunno systems to describe why
0:11:40and efficient which i was solution it was gonna silently and the ministry that's implement
0:11:46motionless agenda
0:11:47estimation of cornish energy bussgang energy which each year
0:11:52two
0:11:53but and frequency which is there a five cent signal energy is not only functional
0:12:01roles signal that all the time and frequency and not
0:12:06which is completely ignoring the actual and long
0:12:10well at a in the energy so much
0:12:13and of course a bic is nothing that you colour synergy
0:12:18because in a sense that easy and square
0:12:22so but c the speed of light in vacuum
0:12:25and even a smaller cost
0:12:27given that and then they rely on which
0:12:31so
0:12:33the binary that it may not here
0:12:35is that the energy is not only depend on only one
0:12:42and that is ignored in the conditions these approaches
0:12:46so what we do we consider distributional is the channel and by considering these are
0:12:50just a
0:12:52and the ones a single emotion we consider speech portion of it should be speech
0:12:58recognition the
0:13:03so the solution is this is in business in the final say
0:13:06which is presented in this really so well before they're a little briefly mention that
0:13:13these features are just an initial estimate a model
0:13:16the thing to their is you know which include pitch an electrician sufficient condition
0:13:22metaphysics fusion and features
0:13:24i'm late
0:13:26these features based on systems and what i
0:13:28and that the energy based features that capture if a parent and the resulting in
0:13:35and the features more variability
0:13:38didn't seem as mentioned in section which was then you know yes
0:13:43you know constant
0:13:46so be in that an active speech production facilities where s is i was really
0:13:51you know in the union model questioned by
0:13:55no not limited of each motion is more speech
0:14:00and this is a little investigation which means as shown that it will lie
0:14:06this is a mission the line and didn't you
0:14:09well with a mean basically a
0:14:11the revolution unit that is one that
0:14:15and in one year
0:14:16and you show that different phase maybe do this
0:14:21and
0:14:23is the house italy having to improvements in the only one sound an instructional
0:14:32so this is basically a sound signal just and yet and you also mean and
0:14:37you see that was and with the total number between one managed to make a
0:14:42basically
0:14:44something clustering based on and minus twenty which is coming up
0:14:47and false was less value omega
0:14:50we begin basically
0:14:52so i'm innocent civilians form it is clear luminosities nothing but a square and minus
0:14:57itself in boston based on this is nothing but this is gonna ministry rather than
0:15:02the previously well as you get a functional these we will call is based
0:15:07given that is good anything the
0:15:10is generally the this profile which is given by exploiting minus one
0:15:15plus one into it and minus one
0:15:17in order to the difference between the simple elements one
0:15:22okay so
0:15:24in this there will be used to you is really a limitation for
0:15:29these things are a lot and silence a minute do you we use the energy
0:15:35because
0:15:36there is a ducking under in using their so that and in his and you
0:15:40can imagine also not using
0:15:43this is a signal so you will be superior i'm writing
0:15:47you know
0:15:50so well
0:15:52no she can see that are you really
0:15:57the view point explicitly of speech
0:16:01this is that the of speech
0:16:02and this is the view point
0:16:05so as to what these are the a new values ten split speech
0:16:10as we can see that the audio file is maximum you
0:16:14indicating that
0:16:15and it is a high snr and using both linear prediction inverse like iteration well
0:16:22very high energy and but being able to the energy so use high energy as
0:16:28well
0:16:29secondly we use
0:16:31a lot in this way speech just convolutional
0:16:35you was responsible a automatic systems only do not be an impulse response of there
0:16:41are some interest
0:16:42kind of all places for a moment for isn't it was also sponsors
0:16:47will be you know in both senses so therefore
0:16:51i system is already a this represent one
0:16:55and the display the impulse response of one
0:16:58therefore if we consider with the v by giving a basically
0:17:04so for you in this explicitly mechanism where only here
0:17:08then you will remember only the data streams are not institutions these fluctuations are basically
0:17:14all
0:17:15most so lately
0:17:17we wanna speech or something and so otherwise huge estimation of impulses also otherwise
0:17:23we really
0:17:24we in both systems
0:17:26so fourteen recorded in boston signals are relocation only there is an excellent in the
0:17:32u i
0:17:33and blue or within the one on the gas
0:17:37this is for models a star
0:17:39the impulse response is considered that and also for all right and so do you
0:17:45whatever sometimes was the gate functions
0:17:48however control the explanation involved only more using function you can see that all this
0:17:54stuff to that real
0:17:56and negating their
0:17:58these speech distortion cannot consider only because of being
0:18:03so
0:18:04v high on what we
0:18:07actually in this work we
0:18:10consider this observation on the ball in the meeting to do not constitute this innovation
0:18:17and costly for our miss and false illusion meetings
0:18:23we consider this is added to this for signal which is easy
0:18:27the final season is in the next we think that and we are going back
0:18:32to that used and the
0:18:33of speech for example
0:18:36the these and the yearly was that i here
0:18:40other than in an actual speech corresponding to you guys also very
0:18:45and that's
0:18:46but it wasn't there is needed is one as constant consistently so the overall while
0:18:51also give you get a constant high on the energy
0:18:56and here llr fluctuations in the next speech and in the next speech for the
0:19:00one extracted via
0:19:02the model such
0:19:03no fluctuations almost or the rest of the homes of this work
0:19:08socialisation investigation well sure well why that
0:19:13basically
0:19:15this
0:19:16models will be also a for
0:19:20a small degradation and we also this one on the spectral features
0:19:25we also this one that we that addition a
0:19:28we found that in an actual speech
0:19:30comparison of lr
0:19:32the initial with features new was really matter
0:19:35got additional getting the
0:19:38but haven't in capturing the performance chosen separately distribution in an action was itching speech
0:19:44we also have the same thing basically on the bus one sixteen or database for
0:19:48the natural
0:19:50and e
0:19:50but it is in each condition is therefore anyone important easily speech just from the
0:19:55missing the native speech
0:19:57we also that
0:19:59the buttons on go
0:20:01this one you again and you which is a screen
0:20:05and decision that was basically features which is a on
0:20:09yes there's recognition and signal
0:20:13we passing through the band filters
0:20:15and are
0:20:17on the this thing and mel filterbank
0:20:19we use an explicit about
0:20:21and then
0:20:21this filters out a little sub band signal that again as you face
0:20:26and then this can now investigate nearly one of those that's why we model
0:20:30then we move a mean and averaging all those in that you and the non
0:20:34dct comparing the
0:20:36energy research which uses the assumption contribution in the time
0:20:42we standard it is generally is this problem can database and because of the database
0:20:47and we use this is that it is feature dimension low dimensional feature does not
0:20:52model so that it does not want using gmm and the frequencies
0:20:58i think is used to use lost in the mfccs and then linear sequences elements
0:21:03yes mel frequency
0:21:05and we employ union
0:21:07so the nine dimensional feature vector is more or less one twenty one twenty four
0:21:13of this increases the finances is thirty nine and it is commonly used in addition
0:21:19to capture
0:21:20and six
0:21:23so far results online in the master
0:21:26we also there the results for the proposed features as i
0:21:30and it really of their combat mfcc and you design a reversed is easily
0:21:36well
0:21:37little better results than people just
0:21:40but when we use this the results where he can be used
0:21:48and the six
0:21:50these are the leading goals
0:21:52we can see that the results for the on the development and the results for
0:21:55the future statistically significantly better than
0:21:58mfcc
0:22:00four
0:22:01a development set on it is useful not continuous
0:22:06and then we also statistically
0:22:08results is only mentioned it does it is s one s ten
0:22:12and asked an is used in this is that is the highest
0:22:16so it is very important role nor their
0:22:19the equal error at least a relatively low
0:22:23for the and was features are and this is because as you compare
0:22:27the nazis is you an existing which was well when on the basis for almost
0:22:32a phone a just a
0:22:34in this work
0:22:36well contingency
0:22:39however one hundred and it would be here was you just you very large
0:22:44well you think whether you listen for testing
0:22:46and it is expected because
0:22:49s ten is
0:22:51based on tts four wheel which is the little based on a decision based on
0:22:55that each
0:22:56an active speech i is organized as we chose unity right
0:23:00well in the model suggesting there are basically
0:23:03the one thousand and then what is and using
0:23:06created in the gmm based system
0:23:08and the standard english more
0:23:11you know
0:23:13and then use the best performance on the
0:23:15on the features it only has a very also
0:23:19i was features on better than the existing to generate mfcc and sixty
0:23:26what is it again mapping
0:23:27and a few though features in windows uses the
0:23:30and a serious a or b and disability related in these score distributions o d
0:23:37development and the
0:23:39you versions therefore the
0:23:40mfcc
0:23:41c is easy and easy gives you know it is the
0:23:45system and english versions of one hundred and four not
0:23:49be these features
0:23:52and we also found this on the initial results on the eval set
0:23:56we also that the proposed features to perform better than
0:23:59miss consistent use based on spectral energy features and it is a little on the
0:24:04mean an additional classifier
0:24:05now and stands for the r m is just a
0:24:09however unknown the that actually you know
0:24:12the last one do what s
0:24:15using that was just a matter
0:24:18was recently small for the baseline systems and ms
0:24:22indicating that the emotions are data needed
0:24:28i think
0:24:28this is the lead to go sure would be just as soon as it will
0:24:32it is also their the on the development set the most features to a that
0:24:38is shown as a and b
0:24:40depending on the line
0:24:41no not tonight
0:24:43performs significantly better and then speakers in a nursing
0:24:46and the fusion is a lot on a similar to the but with features and
0:24:51there is this is just
0:24:53well i
0:24:55similar results are only versions are there
0:24:58on the that was just system fourteen better than this and
0:25:03and
0:25:05feasible is not from phone
0:25:07really doesn't bother to at most
0:25:10combat or indian
0:25:13finally or something but
0:25:15in this thing but only exploited bouncing you try to form that was just an
0:25:19addiction
0:25:20he of is known about features are evaluated on the standard as well a system
0:25:25been viewed as well as she was and only better than existing we just
0:25:29this is i don't do not will for testing there is units isn't this just
0:25:33and just which is based on okay show that is just understands exploits
0:25:38it doesn't deal almost a single
0:25:41and s is the problem is a really going to do so especially in
0:25:45in addition to nist
0:25:49the senator differences and only time
0:25:51as a result be a sixty nine under
0:25:54well i wouldn't look at compression and distance well
0:25:58in speaker recognition for what in the score should the
0:26:01it just one or more or fess
0:26:05the organisers or
0:26:06on is to go with marshall argument each a and of course urination nine just
0:26:12basically
0:26:13is it is possible to not contain challenge
0:26:15i don't think industry
0:26:17i mean and five shows