0:00:16i
0:00:17well
0:00:19i
0:00:23i
0:00:24i
0:00:25i
0:00:26oh
0:00:29i
0:00:37two
0:00:40roughly
0:00:42since
0:00:42and as a student of which risky in computer science and college
0:00:49i'm glad shows you
0:00:51the study of the effects of it just a nation using i-vectors interesting dependent speaker
0:00:57location
0:01:00best i would use the main challenge in speaker verification
0:01:04and then i will
0:01:06the
0:01:07is actually about the problem of research
0:01:10and their proposal
0:01:12have
0:01:13then i would use the i-vector framework for discrimination model
0:01:19including the from what the intersection of all the signal speech
0:01:25and then i would just use the
0:01:28the elements
0:01:29in those
0:01:30excuse in the daytime
0:01:32description of speaker verification systems
0:01:35and the experiment results
0:01:38i don't of the solutions
0:01:43and backchannels in speaker verification comes from two it first one is
0:01:48extrinsic the right but
0:01:51and the other one is interesting there are P G
0:01:55the best alignment the associated with that is
0:01:58come outside of the speakers such as mismatched channels
0:02:02or environmental noise
0:02:05the intrinsic variability is associated with that is that
0:02:10from the speakers
0:02:12such here is speaking style
0:02:14emotion
0:02:15speech one and state helps
0:02:19and it can there are a lot of research
0:02:22focus on the extrinsic drive each
0:02:25but an example of research about
0:02:29in this to the right which has been proposed so
0:02:32in this paper we focus on the intrinsic the remote but
0:02:37the one stack is fess of we use the right but
0:02:41in speaker verification
0:02:46the problem with focus on
0:02:48on the performance of speaker verification
0:02:51i'm at best yeah that the right into the remote speech
0:02:55so there are two questions
0:02:58best one is
0:02:59how the speaker verification system before
0:03:01where enrollment and testing on the and mismatched conditions between just arrived at
0:03:07and the second parties
0:03:09how the colleges focus on model that was over at each
0:03:13okay in addressing the effects of interesting eighteen speaker verification
0:03:19so wait one
0:03:21yeah
0:03:23would be the proposal more than the signal right but with i-vector framework and want
0:03:28to say that that's
0:03:33and
0:03:34first we have to define the variation forms
0:03:38because interested over but comes form
0:03:41all the data associated with the speakers
0:03:44but they are still practise
0:03:46so waste best
0:03:48define the base form that is neutral spontaneous
0:03:53speech at normal rate and a four inch at least
0:03:58for many cases
0:03:59basic well
0:04:00weight
0:04:02you
0:04:02either that you know
0:04:04variation forms
0:04:06from six aspects including speech rate
0:04:09with the state S
0:04:11speaking by
0:04:12emotional state speaking style and the speaking language
0:04:16for example in the speaking rate
0:04:19we have
0:04:19fast speech or slow speech
0:04:22you think you basic skaters
0:04:25oh well
0:04:27clean i zero
0:04:29for example the model of speech means
0:04:32the speakers have a candy is a mouse
0:04:36talk
0:04:37in that way
0:04:39the recognizer with
0:04:41the other night they are to use the speech data
0:04:45has a cat qualities noise
0:04:49and
0:04:50the speaker why don't including not so hot and whisper
0:04:55in the emotional state but have happy
0:04:58emotion and their own body motion
0:05:01and the
0:05:03the speaking style
0:05:04a reading style
0:05:07yeah
0:05:07about the speaker which we have most chinese language recognition
0:05:13so for me six aspects
0:05:15we have to have
0:05:18variation forms and the way
0:05:21recording for the data i
0:05:23for experience
0:05:27then
0:05:29are we just use the i-vector framework point is the variation more
0:05:33and is the i-vector modeling has been successful in the application
0:05:38for the channel compensation
0:05:42the i-vector framework is composed to pass festivities
0:05:46we can project the supervector
0:05:50and
0:05:51into the i-vector the total she so the total variability space
0:05:57he sees the low dimensional space
0:06:01the second part is that i that was
0:06:03okay
0:06:04we can use the cosine similarity score
0:06:08to actually use the
0:06:10similarity between a test
0:06:13utterance and yeah
0:06:15training
0:06:18please
0:06:19how baltimore in nineteen score i'd be please
0:06:22i-vector framework
0:06:24because
0:06:25before they also partly
0:06:27studies
0:06:28about the i-vector format for modeling the
0:06:32can compose compensation
0:06:34channel
0:06:35so
0:06:37we want to see if it is derived for the
0:06:40what we are interested about ivy
0:06:46seconds how to label the effects of images
0:06:49ratios we use a set of technologies
0:06:53which is used to have to be the best soulful
0:06:57channels
0:06:59there were having we use to lda and this is a P
0:07:04the idea behind the lda is
0:07:07minimizes the within speaker variety by maximizing the between speaker for speech
0:07:14we have
0:07:15define the compression and the
0:07:17the lda projection matrix are obtained
0:07:20by
0:07:23is composed also
0:07:25it can batters
0:07:26which is how to decrease the eigenvalue of the equation
0:07:31and
0:07:33within class
0:07:34very well as normalization
0:07:37do not it
0:07:38the lowest weight the idea is that
0:07:41you
0:07:42exact the direction of high inter-speaker each
0:07:47which she's
0:07:49though the partition
0:07:51the taxes in projection matters is obtained by
0:07:56could cut computation so equation with
0:08:00chomsky
0:08:01people
0:08:02the composition
0:08:04i G E is
0:08:06partition magic's
0:08:08and the buttons as we use process
0:08:11partition methods
0:08:13that using it was that was since direction
0:08:18so
0:08:19G
0:08:20partition magic's
0:08:22and they use
0:08:24you don't ten
0:08:26finally compose the eigen vectors of the within class
0:08:30covariance normalization
0:08:32metrics
0:08:36so i would use the experience about how to use that perform well
0:08:42in the interesting
0:08:45relation box
0:08:47one best or we use the line junction tree which involves we have recording
0:08:52yeah we went into
0:08:53so i don't she for the tree and the test
0:08:57then we'll description about all
0:09:00so the speaker recognition system you
0:09:04yeah
0:09:04which use the gmm-ubm baseline system
0:09:08it's just as it's the speaker recognition system and then ways you would
0:09:14so we'll
0:09:15i've based speaker recognition system with different
0:09:19interested over verification
0:09:21instance
0:09:22i'm thinking of a large and then we use the expression
0:09:26results
0:09:30the ranges over the variation corpus that we use
0:09:34these counts for
0:09:36one hundred
0:09:37we must use events
0:09:39which she has
0:09:42they to try to solve the speech chinese
0:09:45yeah it used for eighteen years ago to tell you i guess
0:09:50yeah
0:09:51two how variation forms just a
0:09:54still people
0:09:56yeah
0:09:58each student speaks for stream units
0:10:02for each variation form
0:10:04so that the
0:10:07then each day what is it about two ten
0:10:10parts
0:10:11so each part not for
0:10:14eighteen seconds that is used for training and testing
0:10:18and some of that
0:10:20okay resolution is a parts
0:10:23and these or model soundtrack
0:10:27we use the data machines in the intrinsic variation corpus
0:10:32the function
0:10:33have been for a specific you present to apply use for training would be a
0:10:38we just thirty speakers
0:10:41and fifty male physician variables that to which uses gender dependent and gender independent ubm
0:10:48the last for eighteen hours and the current the trial
0:10:52orientation forms
0:10:54then we use thirty speaks
0:10:57around six P
0:10:58data to train
0:11:00the total reliability space which is a much extreme
0:11:05also it is not for eighteen hours
0:11:08and of course we have to
0:11:11we use straight
0:11:12different
0:11:15interesting the composition a large
0:11:18lda up to their energy so have to train the projection last week's forty
0:11:23and you
0:11:24and speakers
0:11:26which asking for time outs
0:11:29for training partition a six
0:11:32asked we used one speakers which included in two thousand four hundred utterance
0:11:38for the task
0:11:40and
0:11:41all tell variation forms
0:11:46and that way you five
0:11:48speaker recognition systems
0:11:50we use the gmm-ubm speaker
0:11:53six
0:11:54speaker recognition system as a baseline system
0:11:56which is
0:11:58the gmm-ubm is composed of
0:12:01several
0:12:02also
0:12:03mixture
0:12:05the feature volumes days thirteen on original mfcc and ubm is composed so if you
0:12:11want to five hundred child gaussian mixtures
0:12:15and that is a speaker verification system is
0:12:20use the lp in terms of them but also with a combination of whatever you
0:12:26know
0:12:27and the i-vector dimension of that are these two hundred
0:12:35this table
0:12:36oh
0:12:38you incorporate for you for each enrollment condition when testing utterance
0:12:44so the total variation forms
0:12:46and
0:12:47for us to use the speech recognition
0:12:50you to include we choose the spontaneous speech is that this case
0:12:55then we have
0:12:56a six aspects including speech studies that you know one speaking rate emotional state physical
0:13:04state
0:13:05speech and language
0:13:07there are
0:13:08well calibration forms
0:13:11and for each variation forms
0:13:13way
0:13:15we use them
0:13:16for the enrollment condition
0:13:19and trust
0:13:21this year we said well with water variation forms and we can see that E
0:13:27yeah i based system
0:13:29perform much better than the gmm ubm
0:13:33baseline system
0:13:34the best results obtained of the egg
0:13:38which is a combination
0:13:40of lda and wccn
0:13:44and also we have
0:13:48see
0:13:50in what a different variation forms
0:13:52we found if you used to whisper
0:13:55as you won't match
0:13:57then
0:13:59the eer is
0:14:01or not
0:14:03so that perform a whole
0:14:10that way
0:14:11calculated avoid for speaker repeated and
0:14:15yeah
0:14:16iteration calls
0:14:17and from this table we can see that
0:14:21i-vector system i-vector be used in a speaker tracking system is better
0:14:26then the gmm ubm
0:14:28speaker locations
0:14:29in reducing the variation corpus
0:14:32and
0:14:34the best results you obtained in the i-vector
0:14:38based
0:14:39speaker consistent with the relation okay
0:14:43yeah
0:14:43and
0:14:44we lately
0:14:47section six
0:14:49as an
0:14:52icsi's a det curve or a speaker system
0:14:57i S gmm ubm based on this
0:15:01pitch and the
0:15:04so that these two
0:15:07see in system with that would be a and wccn
0:15:11we can see
0:15:12there are three the improvements for the performance
0:15:19this to this paper shows
0:15:22the camera the reason between gmm-ubm system and i-vector system
0:15:28you
0:15:29matched and mismatched conditions
0:15:31so faster we can see the first two comes is used matched conditions
0:15:38the last two is for mismatched conditions
0:15:42and they use
0:15:43we can we computed for each
0:15:47variation forms
0:15:48and we can see for each variation forms
0:15:52mismatched
0:15:54in this to matched conditions
0:15:56the huge the yard is much bigger
0:15:59there is a match the conditions
0:16:02and the second we can be always
0:16:04can you know the gmm-ubm system and the i-vector is the system
0:16:10and we can see
0:16:11for example for spontaneous
0:16:14margin for
0:16:16the one the ones for the gmm-ubm the yellow ones for the i-vector systems and
0:16:26the
0:16:27there are the
0:16:30when the whole whisper
0:16:32version of all the i-vectors this system is that
0:16:36have a
0:16:38significant
0:16:40we actually
0:16:42oh
0:16:44and
0:16:45this table shows for each testing condition when spontaneous
0:16:51utterance find for enrollment
0:16:54when the
0:16:56cost
0:16:57the most are you know the whole way we speak
0:17:01we spontaneously so it can see when testing with each iteration vol
0:17:08"'cause"
0:17:09turn moment for me is that spontaneous so if you castaways it also the spontaneous
0:17:14for the yeah using it should be a small and the best results we obtained
0:17:20with obvious isn't it
0:17:22and
0:17:24also in the past few enrolment we use it
0:17:29spontaneous bombard castaways the whisper
0:17:33duration and they were found that
0:17:35the
0:17:36yeah is it might speaker
0:17:39and the whole performance
0:17:42shot duration
0:17:44this
0:17:45speaker say that
0:17:48so since the whisper variation used to
0:17:53but different from the heart a very simple
0:17:55so we do so we
0:17:58presented is table which shows if you
0:18:01norman we see whisper utterances
0:18:03what about the eer for a for each testing condition
0:18:08and we can see that
0:18:11the results using
0:18:13using become much worse
0:18:15for example for the gmm-ubm system is wrong
0:18:20what you
0:18:21percent
0:18:22then the best results are obtained in the matched recognition which she's
0:18:28seventeen percent
0:18:30yeah
0:18:31also
0:18:32for the whole picture we can see that
0:18:34the i-th basis people in system
0:18:37is still
0:18:38perform well
0:18:39the problem better than the gmm-ubm system
0:18:45the combination of lda and which is an
0:18:48we also performed best
0:18:55so we have well occlusions that's
0:18:58mismatch using you just a confederation course channel variation in speaker recognition performance
0:19:04and the second these the i-vector framework one but then gmm ubm you modelling agency
0:19:10the variations
0:19:12and especially with a combination of four
0:19:16lda an adapted and the best they can get the best results
0:19:20this that the whisper utterances that much different form of the variation forms
0:19:26is that brings the matched condition of speaker recognition performance
0:19:31so of to work will in the model domain there will try much more useful
0:19:38just iteration compensation
0:19:40and also in the visual domain will
0:19:45will propose some in
0:19:46i don't mess between four
0:19:50for example we do you
0:19:52the
0:19:53whisper where the whisper variation in the best results
0:19:59maybe best if
0:20:02after the vad
0:20:04the list the
0:20:06the
0:20:07whisper low quality is much shorter the model
0:20:11rep a speech
0:20:13the second these whisper the speech she is
0:20:16different
0:20:18for is much different from other speech sound which involves so we can do some
0:20:24work in the feature domain
0:20:26to include just the performance of the speaker but he system
0:20:31that's all thank you
0:20:35i
0:20:50yes
0:20:51we will record the this database in the fatter profitable and they all use the
0:20:58one
0:20:58they all students and the i and why they
0:21:04some
0:21:05what in a paper
0:21:08tell
0:21:09which you
0:21:11they have to act the emotion
0:21:14yeah i something
0:21:15target
0:21:16how to act
0:21:18some
0:21:39i
0:21:41i
0:21:44yes for example
0:21:48i
0:21:49i
0:21:55yes for example if you if you speech parameter we may be
0:21:59we have to you can alter so those listed you model and motion stays at
0:22:07so when we are part of the database we try to just a change you
0:22:12one mation also some
0:22:15some of deformities relation
0:22:17so we just to try to
0:22:20asked to separate are the eyes signals on
0:22:24elation
0:22:41assume we have
0:22:42investigation
0:22:44in future work
0:22:45some of it
0:22:47thank you