0:00:13 | welcome for the presentation for the paper element of a slow environment invariant speaker recognition |
---|
0:00:20 | it's just and by nations on channel checksum hall and some given |
---|
0:00:26 | the goal of this work the speaker recognition |
---|
0:00:29 | speaker recognition is identifying of our system |
---|
0:00:31 | from characteristics of voices |
---|
0:00:33 | the effect of a stopping who is speaking |
---|
0:00:36 | it can be kind of rising to closer problem |
---|
0:00:39 | well as a problem |
---|
0:00:41 | or closest setting all testing identities are created by knight in shining there were can |
---|
0:00:46 | be addressed as a classification problem |
---|
0:00:49 | we nonrunning where for this problem this speaker identification |
---|
0:00:55 | on the other hand well as the setting dusting identities are not seen during training |
---|
0:01:00 | which is close to practise |
---|
0:01:03 | we also call this problem a speaker verification |
---|
0:01:07 | in speaker verification extract the speaker representation of two speech signals income are then whether |
---|
0:01:14 | two speech |
---|
0:01:14 | or from same person or not |
---|
0:01:18 | like many other research areas progress in speaker verification has been facilitated by the availability |
---|
0:01:24 | of our skill dataset call also |
---|
0:01:28 | barcelona the data set consisting of short clips of human speech |
---|
0:01:32 | extracted from interviews videos |
---|
0:01:34 | the speakers and a wide range of their |
---|
0:01:37 | in a series accents |
---|
0:01:40 | questions ranges |
---|
0:01:42 | videos included in the dataset are shot in a large number of challenging visual |
---|
0:01:46 | and auditory environments |
---|
0:01:49 | this and has been widely used for training and testing speaker recognition models |
---|
0:01:55 | well of one contains over one hundred thousand utterances where one thousand two hundred fifteen |
---|
0:02:01 | one celebrities |
---|
0:02:02 | well boston to contains over one million utterances |
---|
0:02:06 | well |
---|
0:02:06 | moreover six thousand celebrities |
---|
0:02:08 | extracted from videos uploaded you two |
---|
0:02:11 | most of the previous research only focus on boosting the performance of speaker verification based |
---|
0:02:17 | on given test data set |
---|
0:02:20 | there has been a number of more insertion or more hardcore detectors one loss functions |
---|
0:02:25 | suitable were asked |
---|
0:02:27 | but he scores do not consider what information is learned |
---|
0:02:31 | by the models |
---|
0:02:32 | whether it is the useful information or undesirable biases are present in the dataset |
---|
0:02:39 | in speaker recognition the challenge comes down to ability to separate the voice characteristics and |
---|
0:02:45 | the environments and which the forces voices recorded |
---|
0:02:49 | in real-world scenarios |
---|
0:02:51 | a person usually in rows |
---|
0:02:53 | as well her voice and the same environment |
---|
0:02:56 | therefore |
---|
0:02:57 | the same person speaks in different environmental voice creature can be bar on the embedding |
---|
0:03:03 | space |
---|
0:03:04 | if the voice characteristics and environment information is and angle |
---|
0:03:08 | and then |
---|
0:03:11 | the wall so that there is a means of recordings from vipers |
---|
0:03:14 | but finite environments for each speaker |
---|
0:03:17 | making it possible with the model over bits of the environment as well as the |
---|
0:03:22 | voice characteristics |
---|
0:03:24 | therefore |
---|
0:03:25 | we do not know whether a network |
---|
0:03:27 | can impact of the other voice characteristics were environmental |
---|
0:03:31 | or |
---|
0:03:32 | session biased as well |
---|
0:03:34 | you know work to prevent this |
---|
0:03:36 | we must look beyond classification accuracy as the only learning objective |
---|
0:03:43 | so |
---|
0:03:44 | the objective of this work is |
---|
0:03:46 | how to learn speaker discriminative and environment in various speaker verification network |
---|
0:03:52 | in this work |
---|
0:03:53 | we introduce an environment a diverse real training framework in which the network and we |
---|
0:04:00 | learn a speaker discriminative and environment invariant in that things but not close to the |
---|
0:04:05 | mean shift strange shape |
---|
0:04:08 | chip this by using bo |
---|
0:04:09 | previous and usability a limitation but also that it is the |
---|
0:04:14 | we show that our environment results training wells network to generalize better |
---|
0:04:19 | to unseen conditions |
---|
0:04:22 | okay motivation of our training one |
---|
0:04:25 | is that |
---|
0:04:26 | a model should not possible to discriminate between two coats of this thing speaker from |
---|
0:04:31 | the same video were to those of the same speaker and different scales |
---|
0:04:38 | now let's talk about our training framework |
---|
0:04:41 | this is the overview of the training phase |
---|
0:04:44 | we will explain in detail in the later slides |
---|
0:04:47 | first we'll talk about the fast formation |
---|
0:04:50 | each meeting bass consist of two three two second audio segments brown and different speakers |
---|
0:04:57 | to the tree audio segments from each speaker |
---|
0:05:00 | or the same video |
---|
0:05:01 | and the other is for the value |
---|
0:05:04 | the two sediments cannot be either from different parts of the same audio clip or |
---|
0:05:09 | another clear from the same q two |
---|
0:05:12 | but billy reference can be performed a while at |
---|
0:05:15 | and the box select data set |
---|
0:05:17 | as shown in the left |
---|
0:05:19 | here you know assumption is that |
---|
0:05:22 | the error clips from the same video with |
---|
0:05:24 | in requiring is still in wireless whereas |
---|
0:05:28 | the clips from different videos would have more different channel characteristics |
---|
0:05:34 | now |
---|
0:05:35 | let's talk about the environment space |
---|
0:05:38 | but in wyman the toolkit strange graded whether or not in the audio segments come |
---|
0:05:43 | from the same environment |
---|
0:05:44 | or same video |
---|
0:05:46 | but it should be lost |
---|
0:05:48 | the anchor nested weapon the same video as the anchor well |
---|
0:05:52 | one mean also there |
---|
0:05:54 | and the anchor in a segment promote different video one and negative |
---|
0:05:59 | the gradient is back okay to only to the environment that are |
---|
0:06:03 | so the scene a feature extractor is not optimized during the space |
---|
0:06:08 | in the speaker space the c n and feature extractor and the speaker recognition network |
---|
0:06:13 | were trained simultaneously using this gender cross entropy loss |
---|
0:06:18 | in addition the computational asking in like this and that's what's ability to discriminate between |
---|
0:06:23 | the clearest averaging from the same environment |
---|
0:06:26 | and those from different environments |
---|
0:06:29 | this is done by minimising the kl divergence between the softmax |
---|
0:06:33 | but the chirplet distances |
---|
0:06:35 | and the unit one distribution |
---|
0:06:38 | the environment where channel information can be seen as undesirable sources of variations |
---|
0:06:44 | it should be absent from and i new speaker embedding |
---|
0:06:48 | the extent to which the conclusion a lot of computers |
---|
0:06:51 | the overall loss function |
---|
0:06:53 | this control binary variable out but |
---|
0:06:56 | by zero |
---|
0:06:57 | this and normal shape parameter of the speaker verification |
---|
0:07:01 | as outweigh increases |
---|
0:07:03 | the extent of the confusion lost |
---|
0:07:06 | increases |
---|
0:07:07 | increases to |
---|
0:07:09 | now we're going to explain our experiments |
---|
0:07:13 | our experiments are performance two different architectures |
---|
0:07:16 | which is em or t and then resident work |
---|
0:07:19 | although the original the g and there's not a state-of-the-art network is no for high |
---|
0:07:25 | efficiency and good classification performance |
---|
0:07:29 | which is em already as a furthermore quantification of the network |
---|
0:07:33 | that a forty dimensional mel filter of ssm closeness to the whole spectrogram |
---|
0:07:38 | technically reducing the number of computations |
---|
0:07:41 | then resonant thirty four is the same as the original rest now but there are |
---|
0:07:45 | layers |
---|
0:07:46 | except with only one or |
---|
0:07:48 | but the channels in each residual well |
---|
0:07:50 | in order to reduce computational cost |
---|
0:07:54 | gratification we used two types and well average rating |
---|
0:07:58 | and so that all |
---|
0:08:00 | well i have vegetabley simply takes the mean of the feature along |
---|
0:08:04 | the time domain |
---|
0:08:06 | so what bowling is introduced to any addition to the frames |
---|
0:08:10 | that are more informative |
---|
0:08:11 | or utterance level speaker recognition |
---|
0:08:15 | and the speaker and environment that were |
---|
0:08:19 | the speaker network single fully connected layer is used |
---|
0:08:23 | and really environment that order to fully committed layer with the real the activation is |
---|
0:08:27 | using this train |
---|
0:08:29 | we train our models and so when on the boston one dataset |
---|
0:08:34 | well i and its application may or may training on the overlapping order the development |
---|
0:08:39 | set well identification and verification so that a models change what identification |
---|
0:08:44 | can be used to really verification |
---|
0:08:48 | this makes me cry vacation a one thousand two hundred eleven way classification task |
---|
0:08:54 | and test set consisted unseen utterance as but scenes speakers during training |
---|
0:09:00 | we're verification |
---|
0:09:02 | all speech segments from the one thousand two hundred eleven development set speakers |
---|
0:09:07 | are used for chain |
---|
0:09:08 | and that shane model is then evaluated |
---|
0:09:11 | on the already unseen test speakers |
---|
0:09:15 | during training |
---|
0:09:16 | we use a baseline to second temporal segment |
---|
0:09:20 | extractor randomly from each utterance |
---|
0:09:23 | spectrograms are extracted from that but the hamming window |
---|
0:09:26 | of with |
---|
0:09:27 | twenty five milisecond this that |
---|
0:09:30 | and milisecond |
---|
0:09:31 | rasta |
---|
0:09:33 | but to hunker a seven dimensional aspect frames are used as the input to the |
---|
0:09:37 | network |
---|
0:09:39 | proposes you have only |
---|
0:09:40 | where dimensional mel-cepstral counts |
---|
0:09:42 | or uses the but |
---|
0:09:45 | mean and variance normalization is for one every frequency bin of the spectrogram |
---|
0:09:50 | have built so again i utterance level |
---|
0:09:53 | no |
---|
0:09:54 | no voice activity detection where they are one station is used and the string |
---|
0:10:02 | and then to ask in training or a classification task |
---|
0:10:06 | but the verification task requires a measure of similarity |
---|
0:10:10 | in our work but by a layer and the classification the trick is replaced with |
---|
0:10:14 | one |
---|
0:10:15 | i low dimensional by abundant well and this layer as we trained with contrast in |
---|
0:10:21 | boston high negative mine |
---|
0:10:24 | the c and feature extractor is not function |
---|
0:10:26 | but the contract loss |
---|
0:10:30 | and the to restrain using stochastic gradient descent |
---|
0:10:33 | the initial learning rate of |
---|
0:10:35 | zero one zero one increasing by a factor of zero point nine by |
---|
0:10:40 | every at all |
---|
0:10:41 | the training is a actual hundred and false |
---|
0:10:44 | whenever the validation a right not improvement and of course |
---|
0:10:48 | the chambers |
---|
0:10:51 | replay experiment measures the performance on the same also the test set used in the |
---|
0:10:56 | speaker verification task |
---|
0:10:58 | but they rely on robust speaker and re recorded using a jobless be |
---|
0:11:03 | well i hundred and ten microphone |
---|
0:11:05 | this results in a significant change and channel characteristics |
---|
0:11:09 | and the duration of sound quality |
---|
0:11:12 | a model so identical to those used in previous experiments |
---|
0:11:16 | and not by engine on their place segments |
---|
0:11:21 | this table of course |
---|
0:11:22 | results for multiple models used for evaluation |
---|
0:11:26 | across both speaker identification and verification task |
---|
0:11:30 | a models trained with the proposed a diverse to strategy |
---|
0:11:33 | well a place greater than zero |
---|
0:11:36 | consistently outperform those trained without |
---|
0:11:39 | when r y equals to zero |
---|
0:11:42 | reply equal error rate is the result of replay experiment as we mentioned in the |
---|
0:11:47 | previous slide |
---|
0:11:48 | the improvement in performance as a play increases |
---|
0:11:51 | is more pronounced |
---|
0:11:53 | in this study |
---|
0:11:54 | which suggests that |
---|
0:11:55 | the model training with the proposed addresses training generalize much better to unseen environments or |
---|
0:12:01 | channel |
---|
0:12:03 | one |
---|
0:12:05 | experiments compare by not universal training |
---|
0:12:08 | else truman environment information that embedding |
---|
0:12:12 | the test lists for evaluating environment recognition system |
---|
0:12:16 | nine thousand four hundred and eighty six same-speaker pairs |
---|
0:12:20 | a which come from the same video and the other how well the opinions |
---|
0:12:26 | the lower equal error rate in the case that the network is better at reading |
---|
0:12:31 | whether or not |
---|
0:12:32 | error of audio segments company the same value |
---|
0:12:35 | results demonstrate and environmental recognition performance |
---|
0:12:39 | decreases with uniquely somehow well it shows that unwanted environment information |
---|
0:12:45 | as in the in the speaker and leading to an extent |
---|
0:12:49 | so summarize our work |
---|
0:12:51 | pairs |
---|
0:12:52 | we propose an environment and process training network to their speaker discriminative environment invariant that |
---|
0:12:58 | so |
---|
0:13:00 | secondly |
---|
0:13:01 | for most of |
---|
0:13:02 | our proposed method icsi's database lies in value on the basilar one dataset |
---|
0:13:09 | we also problem network to vary by the environment information as to remove them the |
---|
0:13:14 | embedding |
---|
0:13:16 | and q |
---|