0:00:13actually and um i have
0:00:15here to prevent our core also paper on behalf of the
0:00:18first of all sir from university of science and technology of china
0:00:23actually this work started will when the first author
0:00:27visited dice
0:00:28as an research in
0:00:30we
0:00:30implemented implement it's the harmonic plus noise model basically in the very beginning we we one the that to be
0:00:37used for speech or analysis is is
0:00:39and um
0:00:40especially for speech synthesis
0:00:42i
0:00:43a a she went back to school
0:00:45uh we are
0:00:47i use that
0:00:48harmonic plus noise model
0:00:50uh to to to implement a new feature for but applied this time for for speaker but verification
0:00:58and um
0:00:59we got on a price and they
0:01:02a a a a a promising
0:01:04speaker verification do a result
0:01:06using this that new set of features
0:01:08so
0:01:09this is basically
0:01:10uh i
0:01:11but this stuff of that the host story of this work
0:01:14um
0:01:15that a days out i i i will first to introduce our
0:01:19but vision
0:01:20and um
0:01:21the so called S S E a feature
0:01:24which stands for spectral
0:01:25subband and sure
0:01:27feature
0:01:28yeah i i wear up briefly introduce the harmonic class noise analysis of speech
0:01:33and um
0:01:34the you a how we and how we calculate
0:01:37these spectral so subband and the to your feature and finally how we model the S S yeah are feature
0:01:44and finally i where introduce are evaluation results and down
0:01:48conclusions
0:01:50and that is
0:01:51that is um probably uh we we have known problem that
0:01:55for today speaker i identification man the verification tasks
0:02:00usually we steal
0:02:01use
0:02:02the features part from automatic speech recognition
0:02:06um the problem is
0:02:07is
0:02:09those features are actually supposed
0:02:11to be able to normalize
0:02:13the speaker information
0:02:15but
0:02:15so
0:02:17we we want the motivation is quite street for word we want to
0:02:21find some
0:02:22new features
0:02:23that is
0:02:24a to re current
0:02:26mfcc features
0:02:28uh to uh uh features like
0:02:29like mfcc
0:02:31so
0:02:31it can
0:02:32uh carry
0:02:33the speaker characteristics and then there for
0:02:37uh two
0:02:38a a to be able to improve the speaker verification performance
0:02:42so
0:02:44this is actually a the motivation of a P work uh are of this work
0:02:49um
0:02:51that for a for there are several steps
0:02:53a a a a a to extract
0:02:55our proposed as
0:02:56S S yeah features
0:02:58at the first step is we um
0:03:00apply
0:03:01the harmonic plus
0:03:03noise and then it's analysis
0:03:04i of speech
0:03:05and um
0:03:06then
0:03:07calculate
0:03:08a subband and shows i we uh i uh we introduced a details later
0:03:13that's actually uh
0:03:14in each subband
0:03:16you you need to calculate
0:03:18the
0:03:18and edge of the harmonic part
0:03:20rows is the energy of the noise part
0:03:23and then
0:03:25that's a new feature and you plug into the current
0:03:28speaker verification system which is
0:03:30actually uh
0:03:31conventional gmm-ubm system
0:03:33and um you you use that as a uh
0:03:37as a
0:03:39as M a company uh to read fit a feature
0:03:42to
0:03:43mfcc feature
0:03:46so
0:03:47i i i will uh briefly introduce T
0:03:50the harmonic
0:03:51plus
0:03:51noise
0:03:52i i a speech and then it's is here
0:03:54this this work was pope
0:03:56proposed to by a professor
0:03:58start new yeah know
0:04:00uh
0:04:01you you can you can you can find the reference people are people here
0:04:04and um basically for for this for that
0:04:07for the each
0:04:09uh input at
0:04:10you we first the do
0:04:12uh F zero extraction of pitch extraction to get
0:04:15the uh uh uh uh a
0:04:17at F zero estimation
0:04:19and and uh then and and of course we you you get the you ways always label me
0:04:24we we we discard
0:04:26uh as those are waste of frames and a only in uh use those was of for frames for for
0:04:32further analysis
0:04:34and um
0:04:35a to this we do pitch synchronise
0:04:38uh synchronous
0:04:39window any um um on on the input utterance
0:04:42so you get
0:04:43several frame
0:04:44a to represent
0:04:45the intel uh uh uh
0:04:47in have syntax
0:04:48and um
0:04:49for each given frame
0:04:52uh our short suspects
0:04:53a speech segment
0:04:55we do
0:04:55a a you man and and if H M and stands for a harmonic plus noise model me
0:05:00model don't
0:05:02and um the basic idea of a you and and nine C is you
0:05:06um to decompose
0:05:07the input speech signal into the harmonic part which use
0:05:11a purely
0:05:12attic
0:05:14class
0:05:14the that noise pop
0:05:16we can use several mastered
0:05:18to to represent the noise part and the in this work we use
0:05:22uh uh we use the residual basically the input signal my the harmonic part
0:05:28uh
0:05:28as with noise
0:05:30yeah are some basic uh uh setups
0:05:32a a up the hmm and nine is
0:05:35um
0:05:36the speech signal as
0:05:38uh i i i as you can see
0:05:39and um we use to pitch period
0:05:43a hamming window for each track twos to uh to get the uh
0:05:48to to to basically chop
0:05:50the input
0:05:51include a a speech
0:05:52and um he is a not and that and that was another important thing we need to define that is
0:05:58a a for for for for each you man
0:06:00which H can an an it's is that is in max
0:06:03max small was the frequency
0:06:05uh
0:06:06uh we we fix
0:06:07that frequency to six er
0:06:10and um the as a as a as i mentioned before
0:06:13the noise noise part
0:06:15a a a a a is defined as a research researchers it signal
0:06:21and then yes uh example
0:06:24all all this the the same role you're
0:06:27uh i
0:06:28a a pronounced up by different two different speakers
0:06:31uh
0:06:32the the group
0:06:33the red curve is the uh uh uh
0:06:36harmonic part harmonic spectrum
0:06:38uh
0:06:39i of a particular input frame
0:06:41and it and the and the green power growing curve is
0:06:44the noise part of that's the spectral for
0:06:46spectrogram up the not a noise part
0:06:49and um
0:06:50for this
0:06:52frequency subband as you can see
0:06:54a a for this speaker
0:06:56the it's a and the tree show of the harmonic part and the noise part is almost a uh
0:07:02almost most like a bit uh at the wine
0:07:05basically it means
0:07:07uh the energy of the harmonic money part
0:07:09a similar
0:07:10is similar to the to the energy of the noise part
0:07:13a for this speaker
0:07:14you can see more energy is
0:07:16this are designed to the harmonic part about the and
0:07:20the noise part
0:07:21so
0:07:22we we hope these characteristics or or or read uh
0:07:26a this kind of K
0:07:27a a re and
0:07:29differentiate
0:07:30a a a a a a different speakers
0:07:33so
0:07:35we when we observe this
0:07:37the a
0:07:38the sixty two
0:07:39problems you and you need to you right find the first is
0:07:43a band
0:07:44a of each sub-band
0:07:46a in this case we we define depend where is that it as the average of to
0:07:51mean them uh possible F zero
0:07:53and the maximal possible F zero four
0:07:56for a a for a given speaker
0:07:58actually i this this
0:08:00uh this number
0:08:01those two numbers up a gender dependent
0:08:04we defined
0:08:05a a a a i a i of
0:08:07values use for female speakers and this
0:08:10another set of barry was for for male speakers
0:08:13and done and the problem is is centre
0:08:16center frequency
0:08:17of of each subband
0:08:19um
0:08:19actually at this is quite straightforward uh four or H M and analysis
0:08:24we use
0:08:25three
0:08:26yeah and grew or multi part might might pose are of F zero
0:08:30and um so that we can define
0:08:33T
0:08:34i subbands together uh in total to cover
0:08:38they whole frequency range
0:08:40and um
0:08:42after this
0:08:45oh
0:08:46which start and and and in frequency
0:08:49we can calculate
0:08:50the subband energy for the H T
0:08:54mainly at the harmonic part
0:08:56and
0:08:57at the end sub energy for for the noise part
0:09:00and done
0:09:01you can calculate then calculate
0:09:03the energy ratio between the two and come vote
0:09:07the value
0:09:08into into T B
0:09:09so
0:09:11so
0:09:12after this
0:09:13for each frame you you get a dimensional feature vector
0:09:17and um
0:09:18and uh the this is
0:09:19a gender dependent
0:09:21so
0:09:22in in in in our experiments
0:09:24uh
0:09:25for female speakers there are a sort is three
0:09:28uh
0:09:29i mentioned dimension
0:09:31uh uh uh we have a
0:09:33is three dimensional feature for female speakers
0:09:36and uh
0:09:37forty five
0:09:38dimensional
0:09:39feature for male speakers because
0:09:41of male speakers
0:09:43usually have a
0:09:44lower or uh i F zero
0:09:46so
0:09:48so
0:09:49after the
0:09:50the the feature has been a are calculated
0:09:53uh we need to you
0:09:55not of that
0:09:56uh
0:09:56so we the first thing we we want to check is whether
0:10:00the distribution of the S S yeah features is in so that we can use
0:10:05uh jim and
0:10:06to to model that
0:10:08so
0:10:08actually we we we caff the we
0:10:10we we plot those
0:10:12a a one to see whether we can we can model that and they so like
0:10:16using using
0:10:18come distribution to model that feature
0:10:20a a uh is quite reasonable and they it looks like a option
0:10:25so
0:10:27we use the are come at covers note gmm-ubm ubm system
0:10:31to do speaker verification
0:10:34we use
0:10:35uh
0:10:36conventional mfcc feature
0:10:38as a baseline
0:10:39and um
0:10:40implement the S S yeah a feature based
0:10:43system
0:10:44and um
0:10:46we use um mentoring data name the
0:10:49uh it's
0:10:50six is this is um come the used a a a a database
0:10:54uh uh mentoring and them
0:10:56i
0:10:57it is widely used a in china i i i two
0:11:00for some for speech recognition and that speech and then it's is even for speaker a speaker related task
0:11:07and the we measure the
0:11:09the eer
0:11:10uh to to say
0:11:12to
0:11:13um
0:11:14as a pro for a a as a as a of performance match
0:11:17and uh
0:11:18no score number the that
0:11:19normalization was used
0:11:21uh this is say that these some statistics
0:11:24have C up to a training and test and couples
0:11:26are we have
0:11:27oh a hundred and D five
0:11:29speakers altogether together
0:11:31and um
0:11:32we use
0:11:33and seconds training and in seconds
0:11:36a test
0:11:37i
0:11:38oh for speaker verification task
0:11:40and then
0:11:41so it this is a reading style us
0:11:43and um
0:11:45so
0:11:47those sickly we have
0:11:48for for the for the two speakers we have
0:11:51uh a a a a seventy
0:11:52some T six male speakers
0:11:54last
0:11:55a you four female speakers
0:11:57and uh we have
0:11:58a to to there we have um
0:12:00by this number of of
0:12:03a unique testing sentences
0:12:05and we
0:12:06we we
0:12:06we we we are range them to get
0:12:09this number of
0:12:11uh
0:12:11uh
0:12:12i six
0:12:13six thousand male trials
0:12:15process
0:12:16seven sound and female trials
0:12:19okay
0:12:19that's is see that the result
0:12:22uh
0:12:23we can first uh see the uh
0:12:26and F C C baseline the ye E for the for the M and mfcc baseline
0:12:30and um
0:12:32i i as as as you can observe of and as user
0:12:35uh of the female speakers
0:12:38i have to be to be more a little bit
0:12:41more difficult to handle
0:12:42and um
0:12:43by using as S yeah are features are alarm
0:12:47uh is that the performance is actually worse
0:12:50then the mfcc features
0:12:52that
0:12:53you can get those numbers
0:12:55at
0:12:56if you if we combine those two systems together
0:12:59we we get an
0:13:01the uh we get a a a a a a reasonable input bit uh performance improvement
0:13:05especially
0:13:06for the for the female speaker female speakers
0:13:10so if we can combine those two system together actually
0:13:14uh that
0:13:14the female up
0:13:16the performance for the for
0:13:17for the a female speakers here is actually
0:13:21becomes
0:13:21factor
0:13:22then the then the or speak
0:13:24uh so
0:13:26this is a a
0:13:27quite
0:13:28interesting and surprisingly good
0:13:31uh performance improves
0:13:35so
0:13:36to conclude
0:13:37this is
0:13:38this this paper actually
0:13:40a it is quite straightforward we we we
0:13:43proposed are you new
0:13:45feature named as as yeah
0:13:47for speaker verification
0:13:49it can uh
0:13:51characterise
0:13:52three interaction between vocal tract movements and
0:13:55close to L for
0:13:56and uh
0:13:57seems like it it is
0:13:59quite quite the to
0:14:01capture the speaker
0:14:02characteristics
0:14:04and um
0:14:05this feature is
0:14:07complementary to mfcc
0:14:09and um
0:14:10in if you read you you you read it there uh in reducing yeah a along with the mfcc baseline
0:14:16system
0:14:17and um
0:14:18of the future work we want to
0:14:20a
0:14:22to see
0:14:22to to do more experiment to see whether it performs well
0:14:26for example
0:14:27in noisy environment
0:14:29and um and that and after post processing
0:14:32techniques
0:14:33okay thank you very much
0:14:40i you you through the question the
0:14:42yeah
0:14:43yeah
0:14:44hopefully i can i i because uh i i'm not quite for a mill it ways the uh speaker verification
0:14:49task
0:14:50this what was was basically
0:14:52at time when uh when she went back to school
0:14:55uh we
0:14:56uh the the the the the
0:14:59the come part of of this work it that the the intent actually implements the G and three is it's
0:15:05when going as you with it it is
0:15:07so hopefully i can i can i can
0:15:09i can answer your question more focus on the it it it H and part
0:15:14uh
0:15:15and hopefully my a the can states not so the details uh a
0:15:19so the uh your is you feature is the
0:15:22but is only calculates the racial of the harmonic and the noise parts
0:15:25yeah so the weight does noise come from
0:15:28ah
0:15:29the the noise is is that so car noise it it different from the uh from the additive noise all
0:15:34the noise
0:15:35no the environment so issue in in speech recognition
0:15:38that's is is is this is different
0:15:40a basically
0:15:42uh
0:15:43the second use
0:15:44this speaker
0:15:51that
0:15:52that
0:15:53for the H M and analysis part
0:15:56uh for each even an input speech frame
0:15:59you will decompose
0:16:01the
0:16:03the inputs
0:16:03speech signal into two different parts
0:16:06the first a part is called how money part which is purely
0:16:10a pure rhetoric
0:16:11and and and the remaining saying
0:16:13the residual you can't define that as a as it noise
0:16:16so this noise is different front fronts the noise in
0:16:20in in in speech for example speech recognition
0:16:23oh so uh it's like a in your system you are using clean signals the recorded in the parts of
0:16:28is yes the so you can uh like extract the uh
0:16:33and not the the you the you as you call the not that's are defined everything that is not a
0:16:38period it okay so that so as noise i is if there was like a a same noise in the
0:16:44uh in in the region signal is
0:16:46with will this noise
0:16:48noise part be robust to the it is to have a that and that is a problem we want we
0:16:52want to see
0:16:53that a a uh in this uh if if the input signal is noisy
0:16:58that there are several things that could be effect it's by by the uh uh a that it i mean
0:17:03the additive noise
0:17:04for example what the
0:17:06it it the the it and analysis still rely pretty much on the
0:17:11i'm the actor rates
0:17:12estimate of the F zero
0:17:14yeah
0:17:15but if you have very strong noise
0:17:17this part could be affected
0:17:19and then
0:17:20uh
0:17:22"'cause" see that the the a still bit it depending on the on hype
0:17:26a pure noise
0:17:27it could affect you how money estimation and it it could also a fact
0:17:32the noise estimation
0:17:34yeah
0:17:34but that that the is actually based thing
0:17:36the that the the the first
0:17:38a or yeah why wound it to to investigate
0:17:42as a as a future work
0:17:43okay in also i'd like to ask a what is the mess used to fuse the mfcc C and S
0:17:48the cr systems
0:17:50that's
0:17:50uh the this is
0:17:51score fusion because
0:17:53a a you know and F C C for the mfcc system you use all the all this input frames
0:17:59i
0:17:59for the edge H an and H and and
0:18:03a the S S yeah system
0:18:04um was the frames are discarded
0:18:07so
0:18:08we are
0:18:09the so so the the
0:18:11the you you uh the
0:18:13you you but this call a call uh to get um frame average
0:18:18like a good reaches call and then combine them to get in in you and the uh is is is
0:18:23that shouldn't coefficients of the
0:18:26so the individual features
0:18:28the uh
0:18:30i
0:18:30i can not a i don't have a a answer to this question maybe we we need to check with
0:18:36the we we we is yeah why i'm that
0:18:38okay yeah you think i i don't know whether there is a uh any weight or if you if it
0:18:42is
0:18:43a a critical to can those weight
0:18:45yeah
0:18:46okay
0:18:47oh of of the question
0:18:50you do anything with the residual for
0:18:53uh you use that all somehow as a feature
0:18:56now we we don't use that we only calculate the in to racial look between the how money part was
0:19:02is C
0:19:02the research
0:19:03yeah
0:19:06can also a question um is
0:19:09then
0:19:10addition shouldn't of these features to do mfcc improves the the results means that so to these two feature sets
0:19:17of features are uncorrelated
0:19:19so
0:19:19did
0:19:20to much will be pretty did the judgement to
0:19:24to what it's and it or it's just it works better room
0:19:27so the a subset of the speakers that you to is
0:19:31a a possible slogan
0:19:34may be and i'm not so sure about this part and done
0:19:38and the we we did we didn't try in for example you can
0:19:43if
0:19:44the the first thing is we we have discarded the on voiced of frames
0:19:48so
0:19:48so basic you you cannot not
0:19:51a a a a two i E like pca to to to come work for example combine mfcc yeah a
0:19:57long we see as as yeah feature and then map
0:20:00that's same to a two
0:20:01and the uh read used in the that animation two
0:20:04to play the news system would we we haven't tried that because
0:20:08um
0:20:10because you have difficulties to hand with the with the on ways the a frames
0:20:14i in in in
0:20:16conventional no out
0:20:18uh how money plus noise and non is is
0:20:21have a for always of frames you do not have the estimate of of of the harmonic part
0:20:27so
0:20:28basically city we cannot calculate
0:20:30the subband and three joe for the always stuff
0:20:33second
0:20:36but
0:20:38have
0:20:40a
0:20:42i
0:20:44i
0:20:45oh
0:20:47ah
0:20:47sh
0:20:50a
0:20:51yeah
0:20:55i
0:20:55i
0:20:57but
0:20:58i
0:21:00yeah
0:21:01oh
0:21:03i
0:21:04i
0:21:09i
0:21:09a
0:21:10i
0:21:11i
0:21:14oh
0:21:16i
0:21:17i
0:21:18i
0:21:20i
0:21:21i
0:21:21yeah
0:21:25yeah
0:21:26a
0:21:28a
0:21:29a
0:21:32i
0:21:34a
0:21:37yeah
0:21:38yeah
0:21:39i
0:21:40oh
0:21:42i
0:21:43a
0:21:44i
0:21:45oh
0:21:46a
0:21:48i
0:21:53i
0:21:53oh
0:21:55i
0:21:59i
0:22:00i
0:22:03a
0:22:04i
0:22:09i
0:22:11yeah
0:22:12yeah
0:22:13i
0:22:16i
0:22:17a
0:22:22okay
0:22:23thanks so let's things the speech were again and
0:22:26i
0:22:28yeah