0:00:16 | why don't my presentation |
---|
0:00:19 | and representing a or b i don and i this is often get energy profiles |
---|
0:00:26 | also for speech detection |
---|
0:00:28 | and a distinctly score for my all those |
---|
0:00:37 | i'm not only |
---|
0:00:38 | i can |
---|
0:00:39 | well and myself |
---|
0:00:41 | two additional okay |
---|
0:00:45 | and a little or no one by matrix and the importance that issue |
---|
0:00:51 | use of interest |
---|
0:00:52 | and the key subject matter of this paper is development of cartilage or to alleviate |
---|
0:00:58 | these phenomena |
---|
0:00:59 | this can be that energy based features which is really for with different |
---|
0:01:04 | a speaker and session position an attribute assisi |
---|
0:01:08 | and we present experimental results on strong standard |
---|
0:01:12 | but he wouldn't it as a namely is used for fifteen |
---|
0:01:18 | was a big hassle |
---|
0:01:20 | and in some ways |
---|
0:01:23 | so introduction a speech signal okay |
---|
0:01:26 | well with his that was of information and then applications well |
---|
0:01:30 | but is unable |
---|
0:01:31 | the linguistic information and banning was too much |
---|
0:01:36 | the weighting function can be useful for speech recognition |
---|
0:01:40 | language recognition but linguistic information can be useful for speaker recognition in all simulations this |
---|
0:01:45 | estimation |
---|
0:01:50 | there really is the us officials what into |
---|
0:01:54 | automatic speaker verification which is the art speaker detection |
---|
0:01:59 | so they are is there are in this paper we focus on speaker verification and |
---|
0:02:03 | are also |
---|
0:02:05 | so the decision in is equal be |
---|
0:02:08 | enrolment in like background noise |
---|
0:02:11 | and a noise channel mismatch |
---|
0:02:13 | but it was because there is the data lines |
---|
0:02:17 | the different be speaker in particular speaker vanity |
---|
0:02:21 | because it should be noted that you when you three eighty speakers voice basically we |
---|
0:02:26 | will speaker verification or speaker recognition |
---|
0:02:29 | however |
---|
0:02:31 | and dimension invariant also be clean speech communication channel |
---|
0:02:35 | it is a speaker's voice well which is a |
---|
0:02:39 | in the challenge |
---|
0:02:40 | because of a really |
---|
0:02:42 | in the speaker |
---|
0:02:43 | if you have |
---|
0:02:44 | with the whole middle residuals like |
---|
0:02:47 | inter session variability |
---|
0:02:49 | speaking style and pronunciation duration |
---|
0:02:53 | features physical conditions gender |
---|
0:02:56 | no |
---|
0:02:57 | exactly |
---|
0:02:58 | and their ability or a business activities in speaker recognition we call spoofing and things |
---|
0:03:05 | with the means to create a lot of one of them |
---|
0:03:08 | will be able to venice system effectively |
---|
0:03:11 | and everything system |
---|
0:03:13 | and the second system is to develop an icon colleges to be able to |
---|
0:03:17 | and even something like that moment system |
---|
0:03:22 | the and we'll have at all |
---|
0:03:25 | something using model will be able to test the robustness of a system where is |
---|
0:03:29 | an additional smoothing is important |
---|
0:03:31 | could be able to design unfriendly a good you know all speaker system |
---|
0:03:39 | well i secure by mixture |
---|
0:03:43 | but the final types of the next well while the remaining speaker verification system |
---|
0:03:50 | one is based on |
---|
0:03:52 | we didn't or acoustics that his impersonation on the initial estimate astronomy |
---|
0:03:58 | no recognition results acoustics |
---|
0:04:02 | there is identical twins |
---|
0:04:03 | and they're double the same animals do technologies such as speech synthesis and wise convolution |
---|
0:04:11 | note or |
---|
0:04:12 | news from charities based on this s and was in the |
---|
0:04:19 | and found where there |
---|
0:04:21 | there are in a very is and just activities as it would be in an |
---|
0:04:25 | analysis of speech |
---|
0:04:26 | that is |
---|
0:04:28 | which is |
---|
0:04:29 | very difficult to a very easy to model a doesn't wire technical knowledge |
---|
0:04:35 | most loving where plate |
---|
0:04:37 | in this story being the |
---|
0:04:39 | on the order of speech or y |
---|
0:04:42 | and so on but it is very difficult because the speech is from b |
---|
0:04:46 | within the only |
---|
0:04:48 | speech |
---|
0:04:49 | well |
---|
0:04:50 | there are in various in order for slightly a science and such things |
---|
0:04:56 | one |
---|
0:04:57 | a possibly case |
---|
0:04:59 | the only four so it will be introduced on a relational some special sessions or |
---|
0:05:04 | the sensitivities on |
---|
0:05:07 | by image analysis of a false with a single remote control which will be next |
---|
0:05:13 | however the major for star only |
---|
0:05:17 | you know internationally will challenge timebin organized in please okay that's score a seasonal quickly |
---|
0:05:25 | challenge to was |
---|
0:05:28 | and the database was really was previously alright generalized for a |
---|
0:05:39 | different in a little the content analytical |
---|
0:05:43 | organise for speech synthesis |
---|
0:05:44 | and what is a difficult question |
---|
0:05:47 | well as those of you know there those on b |
---|
0:05:51 | changeable elements which |
---|
0:05:55 | exactly |
---|
0:05:56 | last year lamenting janice was no |
---|
0:06:02 | and i just the listeners and |
---|
0:06:05 | and there was based on was used converting a play detection then the real speech |
---|
0:06:10 | detection |
---|
0:06:11 | and also |
---|
0:06:12 | we well configuration of the physical or a system |
---|
0:06:16 | and in a similar systems and so on addition |
---|
0:06:20 | this is useful right |
---|
0:06:25 | only the really comparison of the |
---|
0:06:28 | i really wasn't the risk analyses also been or someone looks for |
---|
0:06:32 | various kinds of been applied in also mention also be a d o meetings and |
---|
0:06:38 | with |
---|
0:06:38 | so since the i know the |
---|
0:06:42 | the statistically meaningful car was for things and is a the |
---|
0:06:46 | impersonation is not available |
---|
0:06:48 | and therefore the risk is unknown in a problem at least |
---|
0:06:53 | but as in industry in |
---|
0:06:58 | lattice based on an additional reason is |
---|
0:07:02 | and you know no is that is a very high |
---|
0:07:05 | well models when we don't |
---|
0:07:09 | otherwise that i this data in industry so on t |
---|
0:07:13 | the latter detection training without a nice |
---|
0:07:18 | it is what we mistral content based recording what and then us smartphone and in |
---|
0:07:24 | control of the conditions |
---|
0:07:26 | so the available gender errors could be you know |
---|
0:07:30 | right behind mobile where the risk is very high that's for sure |
---|
0:07:35 | an individual nineteen combine the enforce it is unity okay |
---|
0:07:39 | in the different my from some additional which |
---|
0:07:45 | so |
---|
0:07:46 | forcible the problem here we can consider a l stand alone today the |
---|
0:07:51 | which can be considered as a |
---|
0:07:53 | you in without in berkeley systems contain natural speech |
---|
0:07:57 | and there are four for speech detection |
---|
0:08:01 | and the something that's things you go to be here |
---|
0:08:05 | can be done i know that the we can be gone it was given that |
---|
0:08:09 | we're getting microphone point on it is really and transmission channel |
---|
0:08:16 | a sufficient wind |
---|
0:08:18 | i'm really and in the literature so it is also based on just one there |
---|
0:08:25 | so in this thing but we consider three times a small signal x based on |
---|
0:08:30 | this reason why someone |
---|
0:08:32 | is visible unit database |
---|
0:08:34 | and the latter at s |
---|
0:08:39 | so we finished isn't are from a speech |
---|
0:08:43 | and convolutional useful analysis system designer and initial speech synthesis systems sort of speech a |
---|
0:08:50 | natural language text |
---|
0:08:51 | in was very little mostly extending steganalysis |
---|
0:08:55 | then we need to find healing was and possibly a speech in addition we accent |
---|
0:08:58 | speech |
---|
0:08:59 | and this is at application that actually |
---|
0:09:02 | how we can use different that is system to |
---|
0:09:06 | communication |
---|
0:09:09 | one thousand |
---|
0:09:10 | like it is linguistically conversational wise samples for speaker who is one there is something |
---|
0:09:15 | maybe once or speaker so this is in the intervals between the most |
---|
0:09:23 | and basically kind of or |
---|
0:09:26 | by considering the impostor basically and speaker |
---|
0:09:29 | and that actually that would be that you from this is the same why second |
---|
0:09:33 | one is |
---|
0:09:34 | later in this context or |
---|
0:09:37 | and things with something s is used |
---|
0:09:42 | and so is a really useful |
---|
0:09:44 | that's for eliciting model as convolutional on the actual speech we the |
---|
0:09:49 | in both as follows actually plus the acoustic my |
---|
0:09:53 | and i and i so i'm the impostors realistic will be on relational or i |
---|
0:09:59 | mean the convolutional was response of the microphone |
---|
0:10:02 | you recording idealise speaker |
---|
0:10:06 | in the multimedia speaker and acoustic |
---|
0:10:09 | so here the problem is to be able to understand the if we wanted to |
---|
0:10:13 | the acoustics |
---|
0:10:14 | which means you can detect some of the characteristics of equality |
---|
0:10:19 | was legally or maybe condition |
---|
0:10:21 | because you noting you wanna do speech coding speech |
---|
0:10:25 | the parameters and |
---|
0:10:27 | to build you |
---|
0:10:29 | really independent acoustic something both channels according to my acoustic and one |
---|
0:10:35 | we will understand whether the speech community and a genuine are indeed |
---|
0:10:40 | and does not |
---|
0:10:42 | so in this paper the one to exploit be okay spatial the initial so anything |
---|
0:10:50 | that in your you get as you it is really the in energy off basically |
---|
0:10:59 | i think is similar well energy or something that so for example in the traditional |
---|
0:11:04 | signal also literature we learned online in that you where |
---|
0:11:08 | however in the actual speech production the and belongs not sufficient which the energy requires |
---|
0:11:14 | a statistical because |
---|
0:11:16 | in that using whatever statistically hundred dollars acoustic signal |
---|
0:11:20 | is there are more or less then |
---|
0:11:22 | then as it were removed and doors a sticks in the context of acoustic signal |
---|
0:11:28 | x |
---|
0:11:29 | in the physical environment like simple and emotion and along a low dimensional the probability |
---|
0:11:34 | like single emotion |
---|
0:11:36 | automatically i dunno systems to describe why |
---|
0:11:40 | and efficient which i was solution it was gonna silently and the ministry that's implement |
---|
0:11:46 | motionless agenda |
---|
0:11:47 | estimation of cornish energy bussgang energy which each year |
---|
0:11:52 | two |
---|
0:11:53 | but and frequency which is there a five cent signal energy is not only functional |
---|
0:12:01 | roles signal that all the time and frequency and not |
---|
0:12:06 | which is completely ignoring the actual and long |
---|
0:12:10 | well at a in the energy so much |
---|
0:12:13 | and of course a bic is nothing that you colour synergy |
---|
0:12:18 | because in a sense that easy and square |
---|
0:12:22 | so but c the speed of light in vacuum |
---|
0:12:25 | and even a smaller cost |
---|
0:12:27 | given that and then they rely on which |
---|
0:12:31 | so |
---|
0:12:33 | the binary that it may not here |
---|
0:12:35 | is that the energy is not only depend on only one |
---|
0:12:42 | and that is ignored in the conditions these approaches |
---|
0:12:46 | so what we do we consider distributional is the channel and by considering these are |
---|
0:12:50 | just a |
---|
0:12:52 | and the ones a single emotion we consider speech portion of it should be speech |
---|
0:12:58 | recognition the |
---|
0:13:03 | so the solution is this is in business in the final say |
---|
0:13:06 | which is presented in this really so well before they're a little briefly mention that |
---|
0:13:13 | these features are just an initial estimate a model |
---|
0:13:16 | the thing to their is you know which include pitch an electrician sufficient condition |
---|
0:13:22 | metaphysics fusion and features |
---|
0:13:24 | i'm late |
---|
0:13:26 | these features based on systems and what i |
---|
0:13:28 | and that the energy based features that capture if a parent and the resulting in |
---|
0:13:35 | and the features more variability |
---|
0:13:38 | didn't seem as mentioned in section which was then you know yes |
---|
0:13:43 | you know constant |
---|
0:13:46 | so be in that an active speech production facilities where s is i was really |
---|
0:13:51 | you know in the union model questioned by |
---|
0:13:55 | no not limited of each motion is more speech |
---|
0:14:00 | and this is a little investigation which means as shown that it will lie |
---|
0:14:06 | this is a mission the line and didn't you |
---|
0:14:09 | well with a mean basically a |
---|
0:14:11 | the revolution unit that is one that |
---|
0:14:15 | and in one year |
---|
0:14:16 | and you show that different phase maybe do this |
---|
0:14:21 | and |
---|
0:14:23 | is the house italy having to improvements in the only one sound an instructional |
---|
0:14:32 | so this is basically a sound signal just and yet and you also mean and |
---|
0:14:37 | you see that was and with the total number between one managed to make a |
---|
0:14:42 | basically |
---|
0:14:44 | something clustering based on and minus twenty which is coming up |
---|
0:14:47 | and false was less value omega |
---|
0:14:50 | we begin basically |
---|
0:14:52 | so i'm innocent civilians form it is clear luminosities nothing but a square and minus |
---|
0:14:57 | itself in boston based on this is nothing but this is gonna ministry rather than |
---|
0:15:02 | the previously well as you get a functional these we will call is based |
---|
0:15:07 | given that is good anything the |
---|
0:15:10 | is generally the this profile which is given by exploiting minus one |
---|
0:15:15 | plus one into it and minus one |
---|
0:15:17 | in order to the difference between the simple elements one |
---|
0:15:22 | okay so |
---|
0:15:24 | in this there will be used to you is really a limitation for |
---|
0:15:29 | these things are a lot and silence a minute do you we use the energy |
---|
0:15:35 | because |
---|
0:15:36 | there is a ducking under in using their so that and in his and you |
---|
0:15:40 | can imagine also not using |
---|
0:15:43 | this is a signal so you will be superior i'm writing |
---|
0:15:47 | you know |
---|
0:15:50 | so well |
---|
0:15:52 | no she can see that are you really |
---|
0:15:57 | the view point explicitly of speech |
---|
0:16:01 | this is that the of speech |
---|
0:16:02 | and this is the view point |
---|
0:16:05 | so as to what these are the a new values ten split speech |
---|
0:16:10 | as we can see that the audio file is maximum you |
---|
0:16:14 | indicating that |
---|
0:16:15 | and it is a high snr and using both linear prediction inverse like iteration well |
---|
0:16:22 | very high energy and but being able to the energy so use high energy as |
---|
0:16:28 | well |
---|
0:16:29 | secondly we use |
---|
0:16:31 | a lot in this way speech just convolutional |
---|
0:16:35 | you was responsible a automatic systems only do not be an impulse response of there |
---|
0:16:41 | are some interest |
---|
0:16:42 | kind of all places for a moment for isn't it was also sponsors |
---|
0:16:47 | will be you know in both senses so therefore |
---|
0:16:51 | i system is already a this represent one |
---|
0:16:55 | and the display the impulse response of one |
---|
0:16:58 | therefore if we consider with the v by giving a basically |
---|
0:17:04 | so for you in this explicitly mechanism where only here |
---|
0:17:08 | then you will remember only the data streams are not institutions these fluctuations are basically |
---|
0:17:14 | all |
---|
0:17:15 | most so lately |
---|
0:17:17 | we wanna speech or something and so otherwise huge estimation of impulses also otherwise |
---|
0:17:23 | we really |
---|
0:17:24 | we in both systems |
---|
0:17:26 | so fourteen recorded in boston signals are relocation only there is an excellent in the |
---|
0:17:32 | u i |
---|
0:17:33 | and blue or within the one on the gas |
---|
0:17:37 | this is for models a star |
---|
0:17:39 | the impulse response is considered that and also for all right and so do you |
---|
0:17:45 | whatever sometimes was the gate functions |
---|
0:17:48 | however control the explanation involved only more using function you can see that all this |
---|
0:17:54 | stuff to that real |
---|
0:17:56 | and negating their |
---|
0:17:58 | these speech distortion cannot consider only because of being |
---|
0:18:03 | so |
---|
0:18:04 | v high on what we |
---|
0:18:07 | actually in this work we |
---|
0:18:10 | consider this observation on the ball in the meeting to do not constitute this innovation |
---|
0:18:17 | and costly for our miss and false illusion meetings |
---|
0:18:23 | we consider this is added to this for signal which is easy |
---|
0:18:27 | the final season is in the next we think that and we are going back |
---|
0:18:32 | to that used and the |
---|
0:18:33 | of speech for example |
---|
0:18:36 | the these and the yearly was that i here |
---|
0:18:40 | other than in an actual speech corresponding to you guys also very |
---|
0:18:45 | and that's |
---|
0:18:46 | but it wasn't there is needed is one as constant consistently so the overall while |
---|
0:18:51 | also give you get a constant high on the energy |
---|
0:18:56 | and here llr fluctuations in the next speech and in the next speech for the |
---|
0:19:00 | one extracted via |
---|
0:19:02 | the model such |
---|
0:19:03 | no fluctuations almost or the rest of the homes of this work |
---|
0:19:08 | socialisation investigation well sure well why that |
---|
0:19:13 | basically |
---|
0:19:15 | this |
---|
0:19:16 | models will be also a for |
---|
0:19:20 | a small degradation and we also this one on the spectral features |
---|
0:19:25 | we also this one that we that addition a |
---|
0:19:28 | we found that in an actual speech |
---|
0:19:30 | comparison of lr |
---|
0:19:32 | the initial with features new was really matter |
---|
0:19:35 | got additional getting the |
---|
0:19:38 | but haven't in capturing the performance chosen separately distribution in an action was itching speech |
---|
0:19:44 | we also have the same thing basically on the bus one sixteen or database for |
---|
0:19:48 | the natural |
---|
0:19:50 | and e |
---|
0:19:50 | but it is in each condition is therefore anyone important easily speech just from the |
---|
0:19:55 | missing the native speech |
---|
0:19:57 | we also that |
---|
0:19:59 | the buttons on go |
---|
0:20:01 | this one you again and you which is a screen |
---|
0:20:05 | and decision that was basically features which is a on |
---|
0:20:09 | yes there's recognition and signal |
---|
0:20:13 | we passing through the band filters |
---|
0:20:15 | and are |
---|
0:20:17 | on the this thing and mel filterbank |
---|
0:20:19 | we use an explicit about |
---|
0:20:21 | and then |
---|
0:20:21 | this filters out a little sub band signal that again as you face |
---|
0:20:26 | and then this can now investigate nearly one of those that's why we model |
---|
0:20:30 | then we move a mean and averaging all those in that you and the non |
---|
0:20:34 | dct comparing the |
---|
0:20:36 | energy research which uses the assumption contribution in the time |
---|
0:20:42 | we standard it is generally is this problem can database and because of the database |
---|
0:20:47 | and we use this is that it is feature dimension low dimensional feature does not |
---|
0:20:52 | model so that it does not want using gmm and the frequencies |
---|
0:20:58 | i think is used to use lost in the mfccs and then linear sequences elements |
---|
0:21:03 | yes mel frequency |
---|
0:21:05 | and we employ union |
---|
0:21:07 | so the nine dimensional feature vector is more or less one twenty one twenty four |
---|
0:21:13 | of this increases the finances is thirty nine and it is commonly used in addition |
---|
0:21:19 | to capture |
---|
0:21:20 | and six |
---|
0:21:23 | so far results online in the master |
---|
0:21:26 | we also there the results for the proposed features as i |
---|
0:21:30 | and it really of their combat mfcc and you design a reversed is easily |
---|
0:21:36 | well |
---|
0:21:37 | little better results than people just |
---|
0:21:40 | but when we use this the results where he can be used |
---|
0:21:48 | and the six |
---|
0:21:50 | these are the leading goals |
---|
0:21:52 | we can see that the results for the on the development and the results for |
---|
0:21:55 | the future statistically significantly better than |
---|
0:21:58 | mfcc |
---|
0:22:00 | four |
---|
0:22:01 | a development set on it is useful not continuous |
---|
0:22:06 | and then we also statistically |
---|
0:22:08 | results is only mentioned it does it is s one s ten |
---|
0:22:12 | and asked an is used in this is that is the highest |
---|
0:22:16 | so it is very important role nor their |
---|
0:22:19 | the equal error at least a relatively low |
---|
0:22:23 | for the and was features are and this is because as you compare |
---|
0:22:27 | the nazis is you an existing which was well when on the basis for almost |
---|
0:22:32 | a phone a just a |
---|
0:22:34 | in this work |
---|
0:22:36 | well contingency |
---|
0:22:39 | however one hundred and it would be here was you just you very large |
---|
0:22:44 | well you think whether you listen for testing |
---|
0:22:46 | and it is expected because |
---|
0:22:49 | s ten is |
---|
0:22:51 | based on tts four wheel which is the little based on a decision based on |
---|
0:22:55 | that each |
---|
0:22:56 | an active speech i is organized as we chose unity right |
---|
0:23:00 | well in the model suggesting there are basically |
---|
0:23:03 | the one thousand and then what is and using |
---|
0:23:06 | created in the gmm based system |
---|
0:23:08 | and the standard english more |
---|
0:23:11 | you know |
---|
0:23:13 | and then use the best performance on the |
---|
0:23:15 | on the features it only has a very also |
---|
0:23:19 | i was features on better than the existing to generate mfcc and sixty |
---|
0:23:26 | what is it again mapping |
---|
0:23:27 | and a few though features in windows uses the |
---|
0:23:30 | and a serious a or b and disability related in these score distributions o d |
---|
0:23:37 | development and the |
---|
0:23:39 | you versions therefore the |
---|
0:23:40 | mfcc |
---|
0:23:41 | c is easy and easy gives you know it is the |
---|
0:23:45 | system and english versions of one hundred and four not |
---|
0:23:49 | be these features |
---|
0:23:52 | and we also found this on the initial results on the eval set |
---|
0:23:56 | we also that the proposed features to perform better than |
---|
0:23:59 | miss consistent use based on spectral energy features and it is a little on the |
---|
0:24:04 | mean an additional classifier |
---|
0:24:05 | now and stands for the r m is just a |
---|
0:24:09 | however unknown the that actually you know |
---|
0:24:12 | the last one do what s |
---|
0:24:15 | using that was just a matter |
---|
0:24:18 | was recently small for the baseline systems and ms |
---|
0:24:22 | indicating that the emotions are data needed |
---|
0:24:28 | i think |
---|
0:24:28 | this is the lead to go sure would be just as soon as it will |
---|
0:24:32 | it is also their the on the development set the most features to a that |
---|
0:24:38 | is shown as a and b |
---|
0:24:40 | depending on the line |
---|
0:24:41 | no not tonight |
---|
0:24:43 | performs significantly better and then speakers in a nursing |
---|
0:24:46 | and the fusion is a lot on a similar to the but with features and |
---|
0:24:51 | there is this is just |
---|
0:24:53 | well i |
---|
0:24:55 | similar results are only versions are there |
---|
0:24:58 | on the that was just system fourteen better than this and |
---|
0:25:03 | and |
---|
0:25:05 | feasible is not from phone |
---|
0:25:07 | really doesn't bother to at most |
---|
0:25:10 | combat or indian |
---|
0:25:13 | finally or something but |
---|
0:25:15 | in this thing but only exploited bouncing you try to form that was just an |
---|
0:25:19 | addiction |
---|
0:25:20 | he of is known about features are evaluated on the standard as well a system |
---|
0:25:25 | been viewed as well as she was and only better than existing we just |
---|
0:25:29 | this is i don't do not will for testing there is units isn't this just |
---|
0:25:33 | and just which is based on okay show that is just understands exploits |
---|
0:25:38 | it doesn't deal almost a single |
---|
0:25:41 | and s is the problem is a really going to do so especially in |
---|
0:25:45 | in addition to nist |
---|
0:25:49 | the senator differences and only time |
---|
0:25:51 | as a result be a sixty nine under |
---|
0:25:54 | well i wouldn't look at compression and distance well |
---|
0:25:58 | in speaker recognition for what in the score should the |
---|
0:26:01 | it just one or more or fess |
---|
0:26:05 | the organisers or |
---|
0:26:06 | on is to go with marshall argument each a and of course urination nine just |
---|
0:26:12 | basically |
---|
0:26:13 | is it is possible to not contain challenge |
---|
0:26:15 | i don't think industry |
---|
0:26:17 | i mean and five shows |
---|