0:00:16 | i |
---|
0:00:17 | well |
---|
0:00:19 | i |
---|
0:00:23 | i |
---|
0:00:24 | i |
---|
0:00:25 | i |
---|
0:00:26 | oh |
---|
0:00:29 | i |
---|
0:00:37 | two |
---|
0:00:40 | roughly |
---|
0:00:42 | since |
---|
0:00:42 | and as a student of which risky in computer science and college |
---|
0:00:49 | i'm glad shows you |
---|
0:00:51 | the study of the effects of it just a nation using i-vectors interesting dependent speaker |
---|
0:00:57 | location |
---|
0:01:00 | best i would use the main challenge in speaker verification |
---|
0:01:04 | and then i will |
---|
0:01:06 | the |
---|
0:01:07 | is actually about the problem of research |
---|
0:01:10 | and their proposal |
---|
0:01:12 | have |
---|
0:01:13 | then i would use the i-vector framework for discrimination model |
---|
0:01:19 | including the from what the intersection of all the signal speech |
---|
0:01:25 | and then i would just use the |
---|
0:01:28 | the elements |
---|
0:01:29 | in those |
---|
0:01:30 | excuse in the daytime |
---|
0:01:32 | description of speaker verification systems |
---|
0:01:35 | and the experiment results |
---|
0:01:38 | i don't of the solutions |
---|
0:01:43 | and backchannels in speaker verification comes from two it first one is |
---|
0:01:48 | extrinsic the right but |
---|
0:01:51 | and the other one is interesting there are P G |
---|
0:01:55 | the best alignment the associated with that is |
---|
0:01:58 | come outside of the speakers such as mismatched channels |
---|
0:02:02 | or environmental noise |
---|
0:02:05 | the intrinsic variability is associated with that is that |
---|
0:02:10 | from the speakers |
---|
0:02:12 | such here is speaking style |
---|
0:02:14 | emotion |
---|
0:02:15 | speech one and state helps |
---|
0:02:19 | and it can there are a lot of research |
---|
0:02:22 | focus on the extrinsic drive each |
---|
0:02:25 | but an example of research about |
---|
0:02:29 | in this to the right which has been proposed so |
---|
0:02:32 | in this paper we focus on the intrinsic the remote but |
---|
0:02:37 | the one stack is fess of we use the right but |
---|
0:02:41 | in speaker verification |
---|
0:02:46 | the problem with focus on |
---|
0:02:48 | on the performance of speaker verification |
---|
0:02:51 | i'm at best yeah that the right into the remote speech |
---|
0:02:55 | so there are two questions |
---|
0:02:58 | best one is |
---|
0:02:59 | how the speaker verification system before |
---|
0:03:01 | where enrollment and testing on the and mismatched conditions between just arrived at |
---|
0:03:07 | and the second parties |
---|
0:03:09 | how the colleges focus on model that was over at each |
---|
0:03:13 | okay in addressing the effects of interesting eighteen speaker verification |
---|
0:03:19 | so wait one |
---|
0:03:21 | yeah |
---|
0:03:23 | would be the proposal more than the signal right but with i-vector framework and want |
---|
0:03:28 | to say that that's |
---|
0:03:33 | and |
---|
0:03:34 | first we have to define the variation forms |
---|
0:03:38 | because interested over but comes form |
---|
0:03:41 | all the data associated with the speakers |
---|
0:03:44 | but they are still practise |
---|
0:03:46 | so waste best |
---|
0:03:48 | define the base form that is neutral spontaneous |
---|
0:03:53 | speech at normal rate and a four inch at least |
---|
0:03:58 | for many cases |
---|
0:03:59 | basic well |
---|
0:04:00 | weight |
---|
0:04:02 | you |
---|
0:04:02 | either that you know |
---|
0:04:04 | variation forms |
---|
0:04:06 | from six aspects including speech rate |
---|
0:04:09 | with the state S |
---|
0:04:11 | speaking by |
---|
0:04:12 | emotional state speaking style and the speaking language |
---|
0:04:16 | for example in the speaking rate |
---|
0:04:19 | we have |
---|
0:04:19 | fast speech or slow speech |
---|
0:04:22 | you think you basic skaters |
---|
0:04:25 | oh well |
---|
0:04:27 | clean i zero |
---|
0:04:29 | for example the model of speech means |
---|
0:04:32 | the speakers have a candy is a mouse |
---|
0:04:36 | talk |
---|
0:04:37 | in that way |
---|
0:04:39 | the recognizer with |
---|
0:04:41 | the other night they are to use the speech data |
---|
0:04:45 | has a cat qualities noise |
---|
0:04:49 | and |
---|
0:04:50 | the speaker why don't including not so hot and whisper |
---|
0:04:55 | in the emotional state but have happy |
---|
0:04:58 | emotion and their own body motion |
---|
0:05:01 | and the |
---|
0:05:03 | the speaking style |
---|
0:05:04 | a reading style |
---|
0:05:07 | yeah |
---|
0:05:07 | about the speaker which we have most chinese language recognition |
---|
0:05:13 | so for me six aspects |
---|
0:05:15 | we have to have |
---|
0:05:18 | variation forms and the way |
---|
0:05:21 | recording for the data i |
---|
0:05:23 | for experience |
---|
0:05:27 | then |
---|
0:05:29 | are we just use the i-vector framework point is the variation more |
---|
0:05:33 | and is the i-vector modeling has been successful in the application |
---|
0:05:38 | for the channel compensation |
---|
0:05:42 | the i-vector framework is composed to pass festivities |
---|
0:05:46 | we can project the supervector |
---|
0:05:50 | and |
---|
0:05:51 | into the i-vector the total she so the total variability space |
---|
0:05:57 | he sees the low dimensional space |
---|
0:06:01 | the second part is that i that was |
---|
0:06:03 | okay |
---|
0:06:04 | we can use the cosine similarity score |
---|
0:06:08 | to actually use the |
---|
0:06:10 | similarity between a test |
---|
0:06:13 | utterance and yeah |
---|
0:06:15 | training |
---|
0:06:18 | please |
---|
0:06:19 | how baltimore in nineteen score i'd be please |
---|
0:06:22 | i-vector framework |
---|
0:06:24 | because |
---|
0:06:25 | before they also partly |
---|
0:06:27 | studies |
---|
0:06:28 | about the i-vector format for modeling the |
---|
0:06:32 | can compose compensation |
---|
0:06:34 | channel |
---|
0:06:35 | so |
---|
0:06:37 | we want to see if it is derived for the |
---|
0:06:40 | what we are interested about ivy |
---|
0:06:46 | seconds how to label the effects of images |
---|
0:06:49 | ratios we use a set of technologies |
---|
0:06:53 | which is used to have to be the best soulful |
---|
0:06:57 | channels |
---|
0:06:59 | there were having we use to lda and this is a P |
---|
0:07:04 | the idea behind the lda is |
---|
0:07:07 | minimizes the within speaker variety by maximizing the between speaker for speech |
---|
0:07:14 | we have |
---|
0:07:15 | define the compression and the |
---|
0:07:17 | the lda projection matrix are obtained |
---|
0:07:20 | by |
---|
0:07:23 | is composed also |
---|
0:07:25 | it can batters |
---|
0:07:26 | which is how to decrease the eigenvalue of the equation |
---|
0:07:31 | and |
---|
0:07:33 | within class |
---|
0:07:34 | very well as normalization |
---|
0:07:37 | do not it |
---|
0:07:38 | the lowest weight the idea is that |
---|
0:07:41 | you |
---|
0:07:42 | exact the direction of high inter-speaker each |
---|
0:07:47 | which she's |
---|
0:07:49 | though the partition |
---|
0:07:51 | the taxes in projection matters is obtained by |
---|
0:07:56 | could cut computation so equation with |
---|
0:08:00 | chomsky |
---|
0:08:01 | people |
---|
0:08:02 | the composition |
---|
0:08:04 | i G E is |
---|
0:08:06 | partition magic's |
---|
0:08:08 | and the buttons as we use process |
---|
0:08:11 | partition methods |
---|
0:08:13 | that using it was that was since direction |
---|
0:08:18 | so |
---|
0:08:19 | G |
---|
0:08:20 | partition magic's |
---|
0:08:22 | and they use |
---|
0:08:24 | you don't ten |
---|
0:08:26 | finally compose the eigen vectors of the within class |
---|
0:08:30 | covariance normalization |
---|
0:08:32 | metrics |
---|
0:08:36 | so i would use the experience about how to use that perform well |
---|
0:08:42 | in the interesting |
---|
0:08:45 | relation box |
---|
0:08:47 | one best or we use the line junction tree which involves we have recording |
---|
0:08:52 | yeah we went into |
---|
0:08:53 | so i don't she for the tree and the test |
---|
0:08:57 | then we'll description about all |
---|
0:09:00 | so the speaker recognition system you |
---|
0:09:04 | yeah |
---|
0:09:04 | which use the gmm-ubm baseline system |
---|
0:09:08 | it's just as it's the speaker recognition system and then ways you would |
---|
0:09:14 | so we'll |
---|
0:09:15 | i've based speaker recognition system with different |
---|
0:09:19 | interested over verification |
---|
0:09:21 | instance |
---|
0:09:22 | i'm thinking of a large and then we use the expression |
---|
0:09:26 | results |
---|
0:09:30 | the ranges over the variation corpus that we use |
---|
0:09:34 | these counts for |
---|
0:09:36 | one hundred |
---|
0:09:37 | we must use events |
---|
0:09:39 | which she has |
---|
0:09:42 | they to try to solve the speech chinese |
---|
0:09:45 | yeah it used for eighteen years ago to tell you i guess |
---|
0:09:50 | yeah |
---|
0:09:51 | two how variation forms just a |
---|
0:09:54 | still people |
---|
0:09:56 | yeah |
---|
0:09:58 | each student speaks for stream units |
---|
0:10:02 | for each variation form |
---|
0:10:04 | so that the |
---|
0:10:07 | then each day what is it about two ten |
---|
0:10:10 | parts |
---|
0:10:11 | so each part not for |
---|
0:10:14 | eighteen seconds that is used for training and testing |
---|
0:10:18 | and some of that |
---|
0:10:20 | okay resolution is a parts |
---|
0:10:23 | and these or model soundtrack |
---|
0:10:27 | we use the data machines in the intrinsic variation corpus |
---|
0:10:32 | the function |
---|
0:10:33 | have been for a specific you present to apply use for training would be a |
---|
0:10:38 | we just thirty speakers |
---|
0:10:41 | and fifty male physician variables that to which uses gender dependent and gender independent ubm |
---|
0:10:48 | the last for eighteen hours and the current the trial |
---|
0:10:52 | orientation forms |
---|
0:10:54 | then we use thirty speaks |
---|
0:10:57 | around six P |
---|
0:10:58 | data to train |
---|
0:11:00 | the total reliability space which is a much extreme |
---|
0:11:05 | also it is not for eighteen hours |
---|
0:11:08 | and of course we have to |
---|
0:11:11 | we use straight |
---|
0:11:12 | different |
---|
0:11:15 | interesting the composition a large |
---|
0:11:18 | lda up to their energy so have to train the projection last week's forty |
---|
0:11:23 | and you |
---|
0:11:24 | and speakers |
---|
0:11:26 | which asking for time outs |
---|
0:11:29 | for training partition a six |
---|
0:11:32 | asked we used one speakers which included in two thousand four hundred utterance |
---|
0:11:38 | for the task |
---|
0:11:40 | and |
---|
0:11:41 | all tell variation forms |
---|
0:11:46 | and that way you five |
---|
0:11:48 | speaker recognition systems |
---|
0:11:50 | we use the gmm-ubm speaker |
---|
0:11:53 | six |
---|
0:11:54 | speaker recognition system as a baseline system |
---|
0:11:56 | which is |
---|
0:11:58 | the gmm-ubm is composed of |
---|
0:12:01 | several |
---|
0:12:02 | also |
---|
0:12:03 | mixture |
---|
0:12:05 | the feature volumes days thirteen on original mfcc and ubm is composed so if you |
---|
0:12:11 | want to five hundred child gaussian mixtures |
---|
0:12:15 | and that is a speaker verification system is |
---|
0:12:20 | use the lp in terms of them but also with a combination of whatever you |
---|
0:12:26 | know |
---|
0:12:27 | and the i-vector dimension of that are these two hundred |
---|
0:12:35 | this table |
---|
0:12:36 | oh |
---|
0:12:38 | you incorporate for you for each enrollment condition when testing utterance |
---|
0:12:44 | so the total variation forms |
---|
0:12:46 | and |
---|
0:12:47 | for us to use the speech recognition |
---|
0:12:50 | you to include we choose the spontaneous speech is that this case |
---|
0:12:55 | then we have |
---|
0:12:56 | a six aspects including speech studies that you know one speaking rate emotional state physical |
---|
0:13:04 | state |
---|
0:13:05 | speech and language |
---|
0:13:07 | there are |
---|
0:13:08 | well calibration forms |
---|
0:13:11 | and for each variation forms |
---|
0:13:13 | way |
---|
0:13:15 | we use them |
---|
0:13:16 | for the enrollment condition |
---|
0:13:19 | and trust |
---|
0:13:21 | this year we said well with water variation forms and we can see that E |
---|
0:13:27 | yeah i based system |
---|
0:13:29 | perform much better than the gmm ubm |
---|
0:13:33 | baseline system |
---|
0:13:34 | the best results obtained of the egg |
---|
0:13:38 | which is a combination |
---|
0:13:40 | of lda and wccn |
---|
0:13:44 | and also we have |
---|
0:13:48 | see |
---|
0:13:50 | in what a different variation forms |
---|
0:13:52 | we found if you used to whisper |
---|
0:13:55 | as you won't match |
---|
0:13:57 | then |
---|
0:13:59 | the eer is |
---|
0:14:01 | or not |
---|
0:14:03 | so that perform a whole |
---|
0:14:10 | that way |
---|
0:14:11 | calculated avoid for speaker repeated and |
---|
0:14:15 | yeah |
---|
0:14:16 | iteration calls |
---|
0:14:17 | and from this table we can see that |
---|
0:14:21 | i-vector system i-vector be used in a speaker tracking system is better |
---|
0:14:26 | then the gmm ubm |
---|
0:14:28 | speaker locations |
---|
0:14:29 | in reducing the variation corpus |
---|
0:14:32 | and |
---|
0:14:34 | the best results you obtained in the i-vector |
---|
0:14:38 | based |
---|
0:14:39 | speaker consistent with the relation okay |
---|
0:14:43 | yeah |
---|
0:14:43 | and |
---|
0:14:44 | we lately |
---|
0:14:47 | section six |
---|
0:14:49 | as an |
---|
0:14:52 | icsi's a det curve or a speaker system |
---|
0:14:57 | i S gmm ubm based on this |
---|
0:15:01 | pitch and the |
---|
0:15:04 | so that these two |
---|
0:15:07 | see in system with that would be a and wccn |
---|
0:15:11 | we can see |
---|
0:15:12 | there are three the improvements for the performance |
---|
0:15:19 | this to this paper shows |
---|
0:15:22 | the camera the reason between gmm-ubm system and i-vector system |
---|
0:15:28 | you |
---|
0:15:29 | matched and mismatched conditions |
---|
0:15:31 | so faster we can see the first two comes is used matched conditions |
---|
0:15:38 | the last two is for mismatched conditions |
---|
0:15:42 | and they use |
---|
0:15:43 | we can we computed for each |
---|
0:15:47 | variation forms |
---|
0:15:48 | and we can see for each variation forms |
---|
0:15:52 | mismatched |
---|
0:15:54 | in this to matched conditions |
---|
0:15:56 | the huge the yard is much bigger |
---|
0:15:59 | there is a match the conditions |
---|
0:16:02 | and the second we can be always |
---|
0:16:04 | can you know the gmm-ubm system and the i-vector is the system |
---|
0:16:10 | and we can see |
---|
0:16:11 | for example for spontaneous |
---|
0:16:14 | margin for |
---|
0:16:16 | the one the ones for the gmm-ubm the yellow ones for the i-vector systems and |
---|
0:16:26 | the |
---|
0:16:27 | there are the |
---|
0:16:30 | when the whole whisper |
---|
0:16:32 | version of all the i-vectors this system is that |
---|
0:16:36 | have a |
---|
0:16:38 | significant |
---|
0:16:40 | we actually |
---|
0:16:42 | oh |
---|
0:16:44 | and |
---|
0:16:45 | this table shows for each testing condition when spontaneous |
---|
0:16:51 | utterance find for enrollment |
---|
0:16:54 | when the |
---|
0:16:56 | cost |
---|
0:16:57 | the most are you know the whole way we speak |
---|
0:17:01 | we spontaneously so it can see when testing with each iteration vol |
---|
0:17:08 | "'cause" |
---|
0:17:09 | turn moment for me is that spontaneous so if you castaways it also the spontaneous |
---|
0:17:14 | for the yeah using it should be a small and the best results we obtained |
---|
0:17:20 | with obvious isn't it |
---|
0:17:22 | and |
---|
0:17:24 | also in the past few enrolment we use it |
---|
0:17:29 | spontaneous bombard castaways the whisper |
---|
0:17:33 | duration and they were found that |
---|
0:17:35 | the |
---|
0:17:36 | yeah is it might speaker |
---|
0:17:39 | and the whole performance |
---|
0:17:42 | shot duration |
---|
0:17:44 | this |
---|
0:17:45 | speaker say that |
---|
0:17:48 | so since the whisper variation used to |
---|
0:17:53 | but different from the heart a very simple |
---|
0:17:55 | so we do so we |
---|
0:17:58 | presented is table which shows if you |
---|
0:18:01 | norman we see whisper utterances |
---|
0:18:03 | what about the eer for a for each testing condition |
---|
0:18:08 | and we can see that |
---|
0:18:11 | the results using |
---|
0:18:13 | using become much worse |
---|
0:18:15 | for example for the gmm-ubm system is wrong |
---|
0:18:20 | what you |
---|
0:18:21 | percent |
---|
0:18:22 | then the best results are obtained in the matched recognition which she's |
---|
0:18:28 | seventeen percent |
---|
0:18:30 | yeah |
---|
0:18:31 | also |
---|
0:18:32 | for the whole picture we can see that |
---|
0:18:34 | the i-th basis people in system |
---|
0:18:37 | is still |
---|
0:18:38 | perform well |
---|
0:18:39 | the problem better than the gmm-ubm system |
---|
0:18:45 | the combination of lda and which is an |
---|
0:18:48 | we also performed best |
---|
0:18:55 | so we have well occlusions that's |
---|
0:18:58 | mismatch using you just a confederation course channel variation in speaker recognition performance |
---|
0:19:04 | and the second these the i-vector framework one but then gmm ubm you modelling agency |
---|
0:19:10 | the variations |
---|
0:19:12 | and especially with a combination of four |
---|
0:19:16 | lda an adapted and the best they can get the best results |
---|
0:19:20 | this that the whisper utterances that much different form of the variation forms |
---|
0:19:26 | is that brings the matched condition of speaker recognition performance |
---|
0:19:31 | so of to work will in the model domain there will try much more useful |
---|
0:19:38 | just iteration compensation |
---|
0:19:40 | and also in the visual domain will |
---|
0:19:45 | will propose some in |
---|
0:19:46 | i don't mess between four |
---|
0:19:50 | for example we do you |
---|
0:19:52 | the |
---|
0:19:53 | whisper where the whisper variation in the best results |
---|
0:19:59 | maybe best if |
---|
0:20:02 | after the vad |
---|
0:20:04 | the list the |
---|
0:20:06 | the |
---|
0:20:07 | whisper low quality is much shorter the model |
---|
0:20:11 | rep a speech |
---|
0:20:13 | the second these whisper the speech she is |
---|
0:20:16 | different |
---|
0:20:18 | for is much different from other speech sound which involves so we can do some |
---|
0:20:24 | work in the feature domain |
---|
0:20:26 | to include just the performance of the speaker but he system |
---|
0:20:31 | that's all thank you |
---|
0:20:35 | i |
---|
0:20:50 | yes |
---|
0:20:51 | we will record the this database in the fatter profitable and they all use the |
---|
0:20:58 | one |
---|
0:20:58 | they all students and the i and why they |
---|
0:21:04 | some |
---|
0:21:05 | what in a paper |
---|
0:21:08 | tell |
---|
0:21:09 | which you |
---|
0:21:11 | they have to act the emotion |
---|
0:21:14 | yeah i something |
---|
0:21:15 | target |
---|
0:21:16 | how to act |
---|
0:21:18 | some |
---|
0:21:39 | i |
---|
0:21:41 | i |
---|
0:21:44 | yes for example |
---|
0:21:48 | i |
---|
0:21:49 | i |
---|
0:21:55 | yes for example if you if you speech parameter we may be |
---|
0:21:59 | we have to you can alter so those listed you model and motion stays at |
---|
0:22:07 | so when we are part of the database we try to just a change you |
---|
0:22:12 | one mation also some |
---|
0:22:15 | some of deformities relation |
---|
0:22:17 | so we just to try to |
---|
0:22:20 | asked to separate are the eyes signals on |
---|
0:22:24 | elation |
---|
0:22:41 | assume we have |
---|
0:22:42 | investigation |
---|
0:22:44 | in future work |
---|
0:22:45 | some of it |
---|
0:22:47 | thank you |
---|