0:00:06 | okay i'm going to talk about the what we did together with uh when you only be suggesting this one |
---|
0:00:12 | uh |
---|
0:00:13 | in the scope of the |
---|
0:00:14 | last |
---|
0:00:15 | and a simulation |
---|
0:00:17 | per speaker |
---|
0:00:18 | or it was the motivation beginning |
---|
0:00:21 | this is |
---|
0:00:21 | uh before the line of my presentation |
---|
0:00:24 | i would try to motivate and interviews the problem we went to phase and we went to sell |
---|
0:00:29 | and |
---|
0:00:30 | since my words but it wasn't related to connectionist speech recognition i |
---|
0:00:33 | i will |
---|
0:00:34 | uh have a look to the very basic cell so |
---|
0:00:37 | and then they will |
---|
0:00:39 | interviews the |
---|
0:00:40 | uh the |
---|
0:00:41 | the novel features obtained using this work but they call |
---|
0:00:44 | the information that would features |
---|
0:00:46 | for speaker recognition |
---|
0:00:47 | and then we go uh we we good |
---|
0:00:49 | we will go to the the experiments and |
---|
0:00:51 | so conclusions and some |
---|
0:00:53 | feature to work or work ideas |
---|
0:00:56 | so |
---|
0:00:56 | uh the main motivation was |
---|
0:00:58 | uh we wanted to participate in this |
---|
0:01:00 | and me stipulation |
---|
0:01:02 | we saw that the best systems were using |
---|
0:01:04 | um |
---|
0:01:05 | an arm is |
---|
0:01:06 | different amount of |
---|
0:01:07 | systems |
---|
0:01:08 | combining them |
---|
0:01:09 | and what |
---|
0:01:11 | i want i'm not going to mention them but eh |
---|
0:01:14 | you know there are many men is possible subsystems |
---|
0:01:16 | and |
---|
0:01:18 | i'm on all of them i was |
---|
0:01:19 | particularly attractive uh directed by |
---|
0:01:22 | but what are usually called like level features that it's in close relation with the prior session |
---|
0:01:27 | and basically this |
---|
0:01:28 | this system say to use |
---|
0:01:29 | speaker adaptation transforms employed in asr systems for speaker detection |
---|
0:01:34 | features |
---|
0:01:35 | and our proposed as alternatives to four times a capsule |
---|
0:01:39 | features that are the most commonly used one |
---|
0:01:42 | and you know we have the the work of a and veracity |
---|
0:01:44 | in fact uh the working press it into the it's very closely we |
---|
0:01:47 | but it was never samples that were in that |
---|
0:01:49 | in another |
---|
0:01:51 | in another |
---|
0:01:52 | with some difference of course |
---|
0:01:53 | the same |
---|
0:01:54 | and basically |
---|
0:01:55 | uh uh |
---|
0:01:56 | what is done in this work |
---|
0:01:58 | uh is |
---|
0:01:59 | places to use |
---|
0:02:00 | uh weights they re |
---|
0:02:01 | from mllr transforms |
---|
0:02:03 | to produce a grimace in the back doors |
---|
0:02:05 | uh |
---|
0:02:06 | concatenate them and use |
---|
0:02:07 | this is four coefficients to model |
---|
0:02:09 | speaker |
---|
0:02:10 | uh support vector machines |
---|
0:02:12 | so |
---|
0:02:13 | what is the problem |
---|
0:02:15 | we propose that at least one of them |
---|
0:02:17 | i don't like |
---|
0:02:18 | we'd have been always working |
---|
0:02:19 | uh we i read a a a and then a tremendous sense of this is that on the neural networks |
---|
0:02:24 | remarkable systems |
---|
0:02:26 | and |
---|
0:02:28 | and |
---|
0:02:29 | i wish to show later |
---|
0:02:31 | some characteristics but uh the |
---|
0:02:33 | the main problem nor the motivation for this work is that |
---|
0:02:35 | we can no |
---|
0:02:36 | we cannot use |
---|
0:02:38 | a typical adaptation methods like |
---|
0:02:40 | mllr that are usually used in |
---|
0:02:43 | and gaussian approaches |
---|
0:02:45 | so |
---|
0:02:45 | what i try |
---|
0:02:46 | to doing this work at the very beginning |
---|
0:02:49 | it began with |
---|
0:02:50 | to see if i can do something similar to the motor transformation for |
---|
0:02:53 | i've read |
---|
0:02:54 | uh |
---|
0:02:55 | um |
---|
0:02:56 | systems and if we can use it |
---|
0:02:57 | uh to obtain the speaker |
---|
0:02:59 | information into china |
---|
0:03:01 | a speaker discussion system |
---|
0:03:03 | and it with the farthest 'em |
---|
0:03:04 | some |
---|
0:03:05 | baseline systems in that in that |
---|
0:03:07 | it's very with us tonight |
---|
0:03:08 | uh telephone bill for condition |
---|
0:03:11 | so |
---|
0:03:12 | the minister |
---|
0:03:13 | with some of the irony is a basis for |
---|
0:03:16 | uh |
---|
0:03:18 | uh |
---|
0:03:18 | you know for the one |
---|
0:03:21 | someone do the probably don't are not very related |
---|
0:03:23 | basically we have been working on this |
---|
0:03:25 | uh for some applications |
---|
0:03:27 | mainly for business than prescription |
---|
0:03:29 | uh but also for telephony get a telephone applications |
---|
0:03:32 | and for some other languages but |
---|
0:03:34 | our main focus so quickly or two is |
---|
0:03:36 | um |
---|
0:03:38 | and usually is considered |
---|
0:03:39 | that was the way he works is that will replace the gaussian |
---|
0:03:42 | for a neural network and all cases are and will be emulated perceptron |
---|
0:03:47 | and we use this uh |
---|
0:03:48 | the probability estimations |
---|
0:03:50 | as the dubs a a as the |
---|
0:03:53 | i'll probably this |
---|
0:03:54 | or as |
---|
0:03:55 | as the break this uh |
---|
0:03:57 | the pasta probably this L the |
---|
0:03:59 | of the single state hmm |
---|
0:04:01 | and usually |
---|
0:04:03 | uh we have very uh relatively few outputs |
---|
0:04:06 | like |
---|
0:04:07 | just |
---|
0:04:07 | uh phonemes or some other so for that |
---|
0:04:09 | uh |
---|
0:04:10 | units but not not a more |
---|
0:04:12 | the main characteristics is that |
---|
0:04:13 | or |
---|
0:04:14 | is that they are usually considered but up to classify of the the neural networks |
---|
0:04:18 | the up with these two marks of other streams |
---|
0:04:20 | and |
---|
0:04:21 | they're pretty good for a blind test |
---|
0:04:23 | as we will just leave you |
---|
0:04:25 | and the only time we have |
---|
0:04:27 | uh some problems with context modelling |
---|
0:04:29 | uh and also |
---|
0:04:31 | uh with annotation it's no so or at least they are not so |
---|
0:04:34 | so what estimation methods like in gaussian systems |
---|
0:04:37 | so this is this is |
---|
0:04:38 | uh uh diagram block of |
---|
0:04:40 | or row 'cause my suspicion system for make an english |
---|
0:04:43 | you can see |
---|
0:04:44 | probably |
---|
0:04:46 | so that this |
---|
0:04:47 | here |
---|
0:04:47 | and use this |
---|
0:04:49 | okay |
---|
0:04:50 | okay |
---|
0:04:50 | we can you can you can see a similar streams with different features you pay a fee B B features |
---|
0:04:56 | uh |
---|
0:04:57 | you be with breast the features on |
---|
0:04:59 | this one so modulation spectrum for each of features |
---|
0:05:01 | it's one of them |
---|
0:05:02 | for me it's a different |
---|
0:05:03 | we could later perceptron |
---|
0:05:05 | well trained with with |
---|
0:05:06 | with they uh with like a |
---|
0:05:09 | with transcriptions |
---|
0:05:10 | and everything |
---|
0:05:11 | and we marked |
---|
0:05:12 | uh the similar stint in uh with a simple um |
---|
0:05:15 | problem or |
---|
0:05:16 | rule |
---|
0:05:18 | this probably does this posterior so the use by uh because or |
---|
0:05:21 | to have with a language model the lexical oh no |
---|
0:05:24 | and |
---|
0:05:24 | uh some definitions of the hmm |
---|
0:05:27 | that is the relation between the perot would probably be and |
---|
0:05:30 | and wait |
---|
0:05:31 | phoneme for instance |
---|
0:05:32 | uh represent and that |
---|
0:05:33 | minimal relation also |
---|
0:05:35 | to provide the most likely word |
---|
0:05:37 | or send |
---|
0:05:38 | uh some |
---|
0:05:40 | characteristics of the system |
---|
0:05:41 | in dummies |
---|
0:05:42 | a four |
---|
0:05:44 | and made in nineteen seventy bottle said |
---|
0:05:46 | we have been a one to call in less one group |
---|
0:05:48 | one real time |
---|
0:05:50 | is that a seventy percent |
---|
0:05:52 | one six were right |
---|
0:05:54 | will this will use this in the form of the phonemes some others of phonetic units |
---|
0:05:58 | and we train with a four |
---|
0:06:00 | although the forty hours |
---|
0:06:01 | i would have a program language model |
---|
0:06:03 | that is the interpolation of the transcripts and the |
---|
0:06:06 | and |
---|
0:06:07 | written that uh from newspapers |
---|
0:06:09 | and that a relatively small colours |
---|
0:06:12 | just a four thousand |
---|
0:06:14 | so uh in that let's just say i'm just saying |
---|
0:06:17 | um sorry data |
---|
0:06:19 | for evaluation and uh |
---|
0:06:22 | i needed to |
---|
0:06:23 | the trained on you |
---|
0:06:24 | and you a speech recogniser |
---|
0:06:25 | and it's i i read about a very very very weak |
---|
0:06:27 | system |
---|
0:06:28 | because it is i have access to C T S |
---|
0:06:30 | uh it's speech |
---|
0:06:31 | and basically way what is was to train new remote unit and networks with |
---|
0:06:35 | we don't simply data with the simple 'cause news data |
---|
0:06:38 | and there is some other differences to the system i use |
---|
0:06:40 | in this work is that they |
---|
0:06:42 | i have another another issue mostly with and see a bunch of from today |
---|
0:06:45 | different than features |
---|
0:06:46 | i don't use a monophone M unit |
---|
0:06:49 | and |
---|
0:06:49 | and i did some very informal evaluations yes |
---|
0:06:52 | see for myself how it was working then |
---|
0:06:54 | and in in telephone data conversational telephone is it |
---|
0:06:58 | and i have |
---|
0:06:58 | uh very everywhere |
---|
0:07:00 | i |
---|
0:07:01 | where relate |
---|
0:07:02 | uh but anyway |
---|
0:07:04 | this |
---|
0:07:04 | recognise is used for |
---|
0:07:05 | uh two purposes for support for someone |
---|
0:07:07 | as in a uh generate a phonetic alignment with the descriptions of the provided by nist |
---|
0:07:12 | and |
---|
0:07:13 | and also for for training the |
---|
0:07:16 | the speaker adaptation |
---|
0:07:17 | the summation |
---|
0:07:20 | so |
---|
0:07:20 | uh how how can we |
---|
0:07:22 | uh the other |
---|
0:07:23 | i will be needed and then works |
---|
0:07:25 | to a speaker information or whatever else |
---|
0:07:28 | uh |
---|
0:07:28 | there are several approaches but i |
---|
0:07:30 | some basic |
---|
0:07:31 | the two of them |
---|
0:07:32 | the first one would be uh starting from i speak in that and then i'm open an mlp network |
---|
0:07:38 | uh |
---|
0:07:38 | we can do uh the rival uh of what propagation algorithm |
---|
0:07:42 | and |
---|
0:07:43 | mm |
---|
0:07:44 | what |
---|
0:07:44 | we started with a network of anyone train it |
---|
0:07:47 | instead of random |
---|
0:07:48 | wait |
---|
0:07:49 | and with about voice and i went and we had that the weights and |
---|
0:07:52 | and that so |
---|
0:07:53 | uh the other |
---|
0:07:54 | think we can do it's probably |
---|
0:07:55 | just a they'd some of the weights forces that |
---|
0:07:58 | the ones that we go from the the last hidden layer to that |
---|
0:08:01 | to the output layer |
---|
0:08:03 | the price of something |
---|
0:08:04 | more interesting to do |
---|
0:08:06 | it's too |
---|
0:08:06 | to modify the the structure of the detector |
---|
0:08:09 | of the mlp network |
---|
0:08:11 | and tried |
---|
0:08:12 | not too |
---|
0:08:13 | to modify the speaker independent component |
---|
0:08:15 | and that's what we can do for instance well we can get |
---|
0:08:18 | there's more that for most of the phonetic level that would be |
---|
0:08:20 | to uh uh |
---|
0:08:22 | some kind of transformation at the output of the problems and try to |
---|
0:08:25 | but that to the speaker characteristics and on the other hand you can |
---|
0:08:28 | try to the same |
---|
0:08:29 | at the acoustic level |
---|
0:08:31 | try to add that the |
---|
0:08:32 | the features the input features to the characteristics of the |
---|
0:08:35 | from that could from the speaker dependent of the characteristics of the speaker independent |
---|
0:08:39 | uh system so |
---|
0:08:41 | this last solution |
---|
0:08:43 | i i did some just for a desire to verify that this could work on it works |
---|
0:08:47 | and and i i from that what that that was the best one |
---|
0:08:51 | for yourself application so |
---|
0:08:53 | hey |
---|
0:08:54 | decided to try |
---|
0:08:56 | also forced to get |
---|
0:08:58 | and here we have a um |
---|
0:09:00 | a typical mlp neville with just one could allow yeah |
---|
0:09:03 | is it impolite or the feel i don't open later |
---|
0:09:06 | and |
---|
0:09:07 | how can we train dissertation |
---|
0:09:09 | uh features or this other additions |
---|
0:09:11 | uh |
---|
0:09:12 | lattices |
---|
0:09:13 | basically we incorporate a newly nine |
---|
0:09:16 | lighter than the beginning |
---|
0:09:18 | and we apply |
---|
0:09:19 | uh there but replication algorithm |
---|
0:09:21 | as usual i mean |
---|
0:09:22 | we have data would labels we it make the forward propagation compute the output of the network |
---|
0:09:27 | there are |
---|
0:09:28 | we do the |
---|
0:09:29 | okay |
---|
0:09:30 | the quality cover we do that |
---|
0:09:32 | about the position of the yeah well |
---|
0:09:33 | and then |
---|
0:09:35 | when it comes to the the weight |
---|
0:09:36 | i'm sorry |
---|
0:09:37 | well |
---|
0:09:38 | opening |
---|
0:09:39 | no but |
---|
0:09:40 | okay when it comes to |
---|
0:09:41 | to the bit the weight |
---|
0:09:42 | we just the the the date of the of the linear would never and we can |
---|
0:09:46 | we keep |
---|
0:09:47 | froze and the the the speaker independent component |
---|
0:09:50 | so |
---|
0:09:51 | let me |
---|
0:09:53 | okay |
---|
0:09:54 | about this |
---|
0:09:55 | the formation of a normalisation uh |
---|
0:09:57 | well i seaside it's intended them up in common the switched over the representation that consists |
---|
0:10:02 | the mlp uh performance |
---|
0:10:04 | and it can be considered a kind of sorry |
---|
0:10:06 | right hand |
---|
0:10:06 | as for the normalisation |
---|
0:10:08 | but with some |
---|
0:10:09 | a special characteristics because |
---|
0:10:11 | we are not imposing any any |
---|
0:10:13 | a restriction in addition to the base station process i mean |
---|
0:10:16 | we don't have a |
---|
0:10:17 | a target speaker that we try to normalise |
---|
0:10:19 | that that uh |
---|
0:10:20 | the data |
---|
0:10:21 | and |
---|
0:10:22 | and according to previous works |
---|
0:10:24 | it seems that it's also |
---|
0:10:25 | i don't stick to depend on i mean |
---|
0:10:26 | if we train the transformation network |
---|
0:10:28 | with |
---|
0:10:29 | i |
---|
0:10:30 | a speaker independent network behind and it changed is a speaker independent network |
---|
0:10:34 | that instead of having one hidden linux that's too |
---|
0:10:36 | it doesn't works anymore |
---|
0:10:38 | so |
---|
0:10:38 | uh |
---|
0:10:39 | it has some kind of |
---|
0:10:40 | the pendant |
---|
0:10:41 | of the detector |
---|
0:10:42 | uh |
---|
0:10:43 | well we have trained |
---|
0:10:44 | it's withstand the the marketing from that to say |
---|
0:10:46 | um with a diagonal buttons vintage metrics |
---|
0:10:49 | and |
---|
0:10:50 | when we use implemented of the same speaker |
---|
0:10:53 | uh what we |
---|
0:10:54 | falcon beginning is that uh |
---|
0:10:57 | it could |
---|
0:10:57 | hopefully if we send the differences we continue a speaker and |
---|
0:11:00 | and so model and that |
---|
0:11:02 | well i thought that would be useful for speaker identification |
---|
0:11:06 | so |
---|
0:11:06 | there so i stuck exactly the features |
---|
0:11:09 | i'd in the phonetic alignment with a nice |
---|
0:11:12 | the stations |
---|
0:11:13 | and |
---|
0:11:14 | train a speaker additions estimation for every segment |
---|
0:11:17 | and it's um |
---|
0:11:19 | a special things that they do is to remove |
---|
0:11:21 | long |
---|
0:11:22 | segments of silence to to avoid background and channel effect |
---|
0:11:25 | in the resulting features |
---|
0:11:27 | and i then just thinking of 'cause what edition that that this |
---|
0:11:30 | that is usually don't in the market |
---|
0:11:31 | in mlp training |
---|
0:11:33 | i just |
---|
0:11:34 | a place um |
---|
0:11:35 | fix the number five books and |
---|
0:11:36 | already said that this was that |
---|
0:11:38 | base and already sticks that they |
---|
0:11:39 | from the |
---|
0:11:40 | what what |
---|
0:11:41 | uh i don't that you think is that instead of |
---|
0:11:44 | training |
---|
0:11:45 | a full matrix |
---|
0:11:47 | uh |
---|
0:11:47 | and fully mean |
---|
0:11:49 | the input usually a fireman mlp it's composed of |
---|
0:11:51 | by the frame |
---|
0:11:52 | but the current frame and and it's context |
---|
0:11:55 | uh |
---|
0:11:56 | and if |
---|
0:11:57 | this for the square matters would be |
---|
0:11:59 | and |
---|
0:11:59 | and feed the number of features and |
---|
0:12:02 | the shape of the context |
---|
0:12:04 | it said that the reason that i |
---|
0:12:05 | i |
---|
0:12:06 | train or |
---|
0:12:08 | right |
---|
0:12:08 | tie the network |
---|
0:12:09 | uh for each frame independently on its position the context |
---|
0:12:13 | so are reduce the size of the of the transformation this chi |
---|
0:12:17 | so networks also um attic |
---|
0:12:19 | and what is our intent to come between them okay |
---|
0:12:22 | and |
---|
0:12:23 | and in addition to that that the the the |
---|
0:12:25 | that the source and the word feature vector |
---|
0:12:28 | uh you also a stack the feature in the meeting |
---|
0:12:31 | the feature mean and variance |
---|
0:12:32 | because it is it is it |
---|
0:12:33 | uh it is very usual to |
---|
0:12:35 | to do mean um but it's not my decision to the |
---|
0:12:37 | to the input of the mlp |
---|
0:12:39 | okay |
---|
0:12:40 | and i do this for |
---|
0:12:41 | for the difference thing to have |
---|
0:12:43 | the plp that could be with that's the modulation spectrum and at sea |
---|
0:12:49 | and for modelling i use support vector machines |
---|
0:12:52 | i i think that the speaker my uh |
---|
0:12:54 | feature vector and uh and i said above are impostor |
---|
0:12:57 | said |
---|
0:12:58 | used as negative examples |
---|
0:13:00 | i use the lips of them |
---|
0:13:02 | with linear kernel and ideas uh i mean that's almost stationary |
---|
0:13:05 | oh the input in the front seat one |
---|
0:13:09 | so |
---|
0:13:10 | let's go to the sperry meant |
---|
0:13:11 | um |
---|
0:13:13 | it's it's a i use the estimated as an extra to show three only the telltale condition |
---|
0:13:19 | uh |
---|
0:13:19 | i used to come to stiff |
---|
0:13:21 | systems |
---|
0:13:21 | to verify the usefulness or not |
---|
0:13:24 | oh this approach |
---|
0:13:25 | uh |
---|
0:13:26 | uh quite simple gmm ubm |
---|
0:13:29 | uh based on |
---|
0:13:32 | based on the features |
---|
0:13:33 | i i remove |
---|
0:13:35 | nonspeech frames or look at |
---|
0:13:37 | no one or two frames based on |
---|
0:13:38 | i well trained as business be it |
---|
0:13:40 | and uh |
---|
0:13:42 | i mean why becomes an alignment of the log energy |
---|
0:13:45 | uh i did that so i mean embodiments a shot and well |
---|
0:13:49 | typical things in |
---|
0:13:50 | ubm |
---|
0:13:52 | this is the set of the to use from previous |
---|
0:13:54 | a summary of relations |
---|
0:13:56 | i also play uh the normal score |
---|
0:13:58 | lemma session |
---|
0:14:00 | it |
---|
0:14:00 | and in addition to that uh can persist it compresses |
---|
0:14:04 | system |
---|
0:14:05 | i L C is uh |
---|
0:14:07 | a supervector are |
---|
0:14:08 | system that the quality of the S B svm |
---|
0:14:11 | and for the uh |
---|
0:14:13 | for the negative said it's i i i did read that the |
---|
0:14:15 | the supervectors from from this speaker models and |
---|
0:14:18 | i'm for the |
---|
0:14:20 | for the battery use data from the previous |
---|
0:14:22 | sorry |
---|
0:14:23 | S R I evaluations |
---|
0:14:25 | and i didn't apply score normalisation because they didn't |
---|
0:14:27 | uh see |
---|
0:14:28 | much improvement probably |
---|
0:14:30 | fig so there's some kind of problem might |
---|
0:14:32 | in my configuration i'm a conclusion |
---|
0:14:36 | uh i did calibration function and uh gender dependent is in the the toolkit |
---|
0:14:41 | by an equal |
---|
0:14:42 | to gain and it in two steps these |
---|
0:14:45 | this has gotten |
---|
0:14:47 | for every single system |
---|
0:14:49 | and later on i did |
---|
0:14:50 | other linear logistic regression |
---|
0:14:52 | and in case of |
---|
0:14:53 | uh yeah |
---|
0:14:54 | doing function of more than one system it's not it at this is that |
---|
0:14:57 | okay |
---|
0:14:58 | uh and i i did pay for every focus validation |
---|
0:15:02 | in the same evaluation set |
---|
0:15:04 | so |
---|
0:15:04 | what |
---|
0:15:08 | i i didn't think carole double colouration because uh |
---|
0:15:11 | it's what the recognition some set for calibration forty one right |
---|
0:15:14 | so |
---|
0:15:15 | and here we have already some results |
---|
0:15:18 | you can see that that works |
---|
0:15:19 | in blue |
---|
0:15:21 | over the course of the individual |
---|
0:15:23 | transformation network C stands |
---|
0:15:25 | based on different features plp but uh |
---|
0:15:27 | well listen a spectrogram and nancy |
---|
0:15:29 | yeah you have |
---|
0:15:30 | the mean detection cost function um |
---|
0:15:32 | point |
---|
0:15:34 | supplied by |
---|
0:15:35 | i i i i for me to say that it's the cost |
---|
0:15:37 | i use the cost of the sre propose an eight |
---|
0:15:39 | not the new all the |
---|
0:15:41 | two thousand nine |
---|
0:15:42 | yeah |
---|
0:15:43 | so |
---|
0:15:43 | and this is the the war right |
---|
0:15:46 | well the the the first thing that that that they want to make about this is that that |
---|
0:15:50 | well is not but it would but it worked |
---|
0:15:52 | and anyway i wasn't sure when a list of the to this |
---|
0:15:54 | and |
---|
0:15:56 | and with this but the individual systems |
---|
0:15:58 | uh we can see probably but the performance of that C |
---|
0:16:01 | the features but uh i don't have a big explanation probably |
---|
0:16:04 | because the feature |
---|
0:16:05 | sizes |
---|
0:16:06 | is bigger but the that i'm not sure or what simply because then that what is but |
---|
0:16:10 | it's over the classifier |
---|
0:16:12 | uh then i did |
---|
0:16:13 | some to other experiments that |
---|
0:16:15 | what's first try to fuse with audiologist information that the four individual systems |
---|
0:16:19 | or even better |
---|
0:16:21 | to try uh to concatenate that there |
---|
0:16:23 | the four |
---|
0:16:25 | david well features |
---|
0:16:27 | and to uh to train a single |
---|
0:16:29 | ordered in a single |
---|
0:16:30 | transformation of the feature vector |
---|
0:16:33 | what uh and we can see a nice improvement |
---|
0:16:35 | using the complete |
---|
0:16:36 | just wasn't at work |
---|
0:16:37 | feature vector |
---|
0:16:39 | um |
---|
0:16:43 | move to the next one |
---|
0:16:45 | this is that the that or |
---|
0:16:47 | comparing the different |
---|
0:16:48 | bayesian systems |
---|
0:16:50 | together with the new proposed in from the pacific on |
---|
0:16:52 | T N svm |
---|
0:16:54 | uh |
---|
0:16:56 | we can see with respect to the gmmubm |
---|
0:16:59 | um |
---|
0:17:01 | about their |
---|
0:17:03 | it performs better that close to the operation point |
---|
0:17:05 | but it seems that it goes it was |
---|
0:17:08 | words |
---|
0:17:09 | or a or a plus list items |
---|
0:17:11 | as long as we go closer to the whatever |
---|
0:17:13 | point |
---|
0:17:14 | and with us to the supervector uh |
---|
0:17:17 | we have a slightly worse performance in close to that |
---|
0:17:20 | the person point and and it works |
---|
0:17:23 | right |
---|
0:17:23 | words |
---|
0:17:24 | in the in the other |
---|
0:17:26 | in the other |
---|
0:17:26 | one to the the car |
---|
0:17:28 | thirteen point |
---|
0:17:29 | and we yeah |
---|
0:17:30 | what do think it's important from these results |
---|
0:17:32 | is that |
---|
0:17:33 | i can achieve more or less similar system |
---|
0:17:36 | some of the baseline systems by comparing to |
---|
0:17:38 | in some cases |
---|
0:17:39 | a bit worse in some cases a bit better but |
---|
0:17:42 | not politically different |
---|
0:17:45 | so |
---|
0:17:46 | the the the the the the final corpus of what's |
---|
0:17:48 | in fact |
---|
0:17:49 | trying to use it for for improving the the baseline systems |
---|
0:17:52 | and this is that the the the results show that the combination |
---|
0:17:56 | and you can see several |
---|
0:17:58 | different combinations |
---|
0:17:59 | these are the two baselines |
---|
0:18:01 | this is the minimum cost |
---|
0:18:03 | obtain deeper |
---|
0:18:04 | right |
---|
0:18:04 | and we can see that when we |
---|
0:18:06 | yeah |
---|
0:18:07 | we incorporated this formation of what features system |
---|
0:18:10 | we have |
---|
0:18:11 | some improvement |
---|
0:18:12 | probably |
---|
0:18:13 | uh |
---|
0:18:14 | it's um |
---|
0:18:19 | that that all the combinations here also |
---|
0:18:22 | so |
---|
0:18:24 | and i'm sure |
---|
0:18:25 | okay with that |
---|
0:18:28 | i mean |
---|
0:18:29 | yeah |
---|
0:18:30 | the conclusions |
---|
0:18:31 | uh |
---|
0:18:32 | what they combine |
---|
0:18:33 | in this work or what they want to do is to show |
---|
0:18:36 | that features that it |
---|
0:18:37 | from N in a a and then it to my meditation techniques |
---|
0:18:41 | can be used for speaker identification |
---|
0:18:43 | in a very similar way to |
---|
0:18:45 | how similar are |
---|
0:18:46 | is used for lotion systems |
---|
0:18:48 | uh i have used uh uh annotation technique |
---|
0:18:51 | technical information network |
---|
0:18:53 | and |
---|
0:18:56 | okay back to base |
---|
0:18:57 | on the recognition of this everlasting transforms |
---|
0:19:00 | and |
---|
0:19:01 | and the mean and variance of the input feature statistics uh should do to perform |
---|
0:19:06 | but well |
---|
0:19:07 | and with respect to the baseline |
---|
0:19:09 | we could see a relatively good performance |
---|
0:19:11 | so cases it in some operation points of |
---|
0:19:14 | all the the the |
---|
0:19:15 | cover it with |
---|
0:19:16 | it was more |
---|
0:19:16 | it was bad another |
---|
0:19:18 | it was worse but more or less |
---|
0:19:20 | uh |
---|
0:19:21 | similar performances |
---|
0:19:22 | and |
---|
0:19:23 | uh we could build five verify that it provides some |
---|
0:19:26 | complementary speaker |
---|
0:19:28 | choose for for for channel that we can have |
---|
0:19:30 | uh or baseline systems |
---|
0:19:33 | uh with respect to |
---|
0:19:34 | to carlisle and future work |
---|
0:19:36 | that we are going in a a |
---|
0:19:37 | or listen uh |
---|
0:19:39 | with these features |
---|
0:19:40 | um |
---|
0:19:41 | we need |
---|
0:19:42 | to assess a better than classified other we case our system because |
---|
0:19:46 | i would have very |
---|
0:19:47 | by a very low they were provided fact |
---|
0:19:50 | and |
---|
0:19:50 | well for discussion and i imagination also for |
---|
0:19:53 | for the station itself because uh |
---|
0:19:56 | probably with a better |
---|
0:19:57 | a speech recognition system would |
---|
0:19:59 | uh we'll have more meaningful features |
---|
0:20:02 | uh |
---|
0:20:03 | we we did almost all the tuning |
---|
0:20:06 | and another one to two characteristics uh |
---|
0:20:09 | base and all these things but that would probably |
---|
0:20:10 | should do something |
---|
0:20:12 | uh |
---|
0:20:13 | more and undertones and which is the relation between the the architecture of the speaker independent and never on the |
---|
0:20:18 | resulting features |
---|
0:20:20 | uh |
---|
0:20:20 | or even to mystical there |
---|
0:20:22 | adaptation method |
---|
0:20:23 | i do not try this |
---|
0:20:24 | adaptation of the output of the problem is |
---|
0:20:26 | um |
---|
0:20:28 | we have also i would have some of these things some us to meet at the |
---|
0:20:32 | it is a bit |
---|
0:20:33 | we can say |
---|
0:20:33 | and |
---|
0:20:35 | also uh apply in but everything compensation like now |
---|
0:20:39 | and so the nothing can really work in other things |
---|
0:20:41 | with interest in it |
---|
0:20:42 | and into the something similar to |
---|
0:20:44 | what is only in in people um for language identification letter |
---|
0:20:47 | that is |
---|
0:20:48 | use in |
---|
0:20:49 | several mlp |
---|
0:20:50 | uh networks from different languages and |
---|
0:20:53 | the reading this transformation networks for every of these languages without |
---|
0:20:56 | phonetic alignment and and then |
---|
0:20:58 | to get in a in a single feature vector and |
---|
0:21:01 | and this way of making the the the the approach |
---|
0:21:04 | uh not needed for the asr descriptions and finally |
---|
0:21:07 | making it also language independent |
---|
0:21:09 | and |
---|
0:21:11 | that's all |
---|
0:21:13 | okay |
---|
0:21:24 | okay |
---|
0:21:24 | questions |
---|
0:21:30 | actually |
---|
0:21:30 | chris slot |
---|
0:21:32 | uh |
---|
0:21:33 | no |
---|
0:21:34 | yeah i think it and number |
---|
0:21:36 | like i don't know |
---|
0:21:37 | number |
---|
0:21:38 | oh |
---|
0:21:42 | oh |
---|
0:21:47 | that one |
---|
0:21:48 | one |
---|
0:21:50 | okay some |
---|
0:21:51 | um |
---|
0:21:53 | lost |
---|
0:21:54 | um |
---|
0:21:55 | right |
---|
0:21:56 | a lot |
---|
0:21:57 | yeah |
---|
0:21:58 | um systems |
---|
0:21:59 | yeah |
---|
0:22:00 | yeah |
---|
0:22:01 | you know |
---|
0:22:02 | normalisation |
---|
0:22:03 | five |
---|
0:22:03 | so |
---|
0:22:04 | oh |
---|
0:22:05 | uh |
---|
0:22:07 | oh |
---|
0:22:09 | sure |
---|
0:22:11 | right |
---|
0:22:11 | um |
---|
0:22:12 | right |
---|
0:22:12 | normalisation |
---|
0:22:13 | no um |
---|
0:22:14 | i just the randomisation of the input of the svm |
---|
0:22:18 | modelling |
---|
0:22:19 | uh i didn't do |
---|
0:22:20 | a modelling uh also when i was doing testing but i i didn't do |
---|
0:22:24 | any other normalisation to the |
---|
0:22:26 | feature vectors |
---|
0:22:27 | in the rents either one |
---|
0:22:28 | i think it's conditional and |
---|
0:22:30 | in this |
---|
0:22:31 | uh support vector machine approaches |
---|
0:22:33 | just one |
---|
0:22:34 | some |
---|
0:22:34 | features |
---|
0:22:35 | just |
---|
0:22:36 | no |
---|
0:22:37 | no |
---|
0:22:37 | not not |
---|
0:22:38 | uh |
---|
0:22:40 | it's true |
---|
0:22:41 | yeah |
---|
0:22:42 | but |
---|
0:22:42 | well i and number should but it was between the the the svm |
---|
0:22:45 | we |
---|
0:22:46 | go with this i mean it will select |
---|
0:22:48 | this |
---|
0:22:49 | features that are more important i i mean |
---|
0:22:51 | i i didn't read in a different way that |
---|
0:22:53 | if you just coming from plp or |
---|
0:22:55 | i i just let the this ubm to learn |
---|
0:22:58 | what he thought it was better |
---|
0:23:00 | didn't do anything |
---|
0:23:02 | in this way |
---|
0:23:08 | uh |
---|
0:23:10 | this morning to mobilise the ubm it can be |
---|
0:23:15 | speaker not to to |
---|
0:23:18 | i'll close system model |
---|
0:23:21 | and if you use one of my |
---|
0:23:24 | oh |
---|
0:23:24 | oh yeah |
---|
0:23:26 | and to train neural network much more data than not |
---|
0:23:32 | one more thing |
---|
0:23:34 | and uh so why not too many |
---|
0:23:37 | rich but from the old people from one |
---|
0:23:41 | uh i i think it differs when you will |
---|
0:23:43 | should be because it and get it very well |
---|
0:23:45 | you're talking about |
---|
0:23:45 | way way idea and still with a random initialisation of the mlp network |
---|
0:23:50 | for training |
---|
0:23:52 | oh |
---|
0:23:53 | that was a layer of the moment |
---|
0:23:56 | the soft mask |
---|
0:23:57 | yeah |
---|
0:23:58 | yeah |
---|
0:23:59 | so why the force |
---|
0:24:00 | the one that works well mark |
---|
0:24:02 | just |
---|
0:24:03 | right |
---|
0:24:04 | you need only so much |
---|
0:24:06 | you can |
---|
0:24:08 | that |
---|
0:24:11 | uh |
---|
0:24:11 | the D C |
---|
0:24:15 | uh |
---|
0:24:17 | yeah |
---|
0:24:18 | i have |
---|
0:24:18 | a soft but |
---|
0:24:19 | max output here |
---|
0:24:21 | yeah |
---|
0:24:21 | and i don't have any other any other softmax output |
---|
0:24:24 | anyway this |
---|
0:24:25 | the linear input network and |
---|
0:24:27 | and i'm not i'm not doing any kind of non nonlinear in section at this point |
---|
0:24:32 | it's a |
---|
0:24:34 | am i think so |
---|
0:24:35 | but |
---|
0:24:35 | the |
---|
0:24:37 | no there's no |
---|
0:24:38 | nonlinearly stationary i dunno if an answering to you |
---|
0:24:43 | i |
---|
0:24:43 | oh |
---|
0:24:47 | no they didn't have or is it just it's a it's |
---|
0:24:50 | uh |
---|
0:24:51 | uh speech features |
---|
0:24:52 | yeah p2p or |
---|
0:24:54 | okay or |
---|
0:24:55 | and |
---|
0:24:56 | sorry and it's |
---|
0:24:57 | uh the |
---|
0:24:58 | the current frame and its context |
---|
0:25:00 | not only uh use anything context of |
---|
0:25:02 | there are two |
---|
0:25:04 | but uh |
---|
0:25:05 | but it's it's it's it's it's feature |
---|
0:25:11 | yeah |
---|
0:25:12 | so would you like to slide forty |
---|
0:25:19 | table |
---|
0:25:20 | um |
---|
0:25:21 | uh |
---|
0:25:23 | as a baseline |
---|
0:25:24 | but just like how much is it |
---|
0:25:26 | uh |
---|
0:25:27 | how many |
---|
0:25:27 | map estimation |
---|
0:25:29 | ah i did five and probably the support of okay relation |
---|
0:25:33 | um but i i think it was |
---|
0:25:35 | right |
---|
0:25:36 | this |
---|
0:25:36 | yeah |
---|
0:25:37 | yes i did five map iterations |
---|
0:25:39 | yes so i improprieties them but |
---|
0:25:40 | uh you you did |
---|
0:25:42 | five map map iterations before |
---|
0:25:44 | sitting |
---|
0:25:45 | do your is for him |
---|
0:25:47 | yeah |
---|
0:25:48 | so |
---|
0:25:48 | uh |
---|
0:25:49 | we found |
---|
0:25:50 | that |
---|
0:25:51 | yeah |
---|
0:25:52 | one |
---|
0:25:53 | yeah i i |
---|
0:25:54 | if only they |
---|
0:25:55 | right i i |
---|
0:25:57 | it's a |
---|
0:25:58 | to control |
---|
0:25:59 | yeah uh |
---|
0:25:59 | but but but we verified |
---|
0:26:01 | well |
---|
0:26:02 | i'm not completion but |
---|
0:26:03 | uh in that basic gmm ubm |
---|
0:26:05 | with five we got better even if we we go farther away |
---|
0:26:09 | we got |
---|
0:26:10 | uh |
---|
0:26:11 | a slight improvement |
---|
0:26:12 | but uh we and verified it when we moved to the supervector we do |
---|
0:26:16 | the so uh |
---|
0:26:17 | and and i realised i could do you want to david that it |
---|
0:26:20 | this was not a good idea |
---|
0:26:22 | probably |
---|
0:26:22 | well that's |
---|
0:26:23 | probably um |
---|
0:26:25 | with the but the configuration i would have |
---|
0:26:27 | uh |
---|
0:26:28 | okay |
---|
0:26:28 | i'm sure |
---|
0:26:29 | but the performance and as a purveyor in the supervector |
---|
0:26:32 | system |
---|
0:26:33 | sure i i realise that |
---|
0:26:35 | fig |
---|
0:26:51 | oh |
---|
0:26:51 | on the loss |
---|
0:26:53 | right |
---|
0:26:56 | X |
---|
0:26:57 | yeah |
---|
0:26:57 | okay |
---|
0:26:58 | oh |
---|
0:26:59 | sure |
---|
0:26:59 | uh_huh |
---|
0:27:00 | you |
---|
0:27:01 | yeah |
---|
0:27:01 | so |
---|
0:27:02 | how much |
---|
0:27:04 | oh |
---|
0:27:05 | hmmm |
---|
0:27:08 | oh |
---|
0:27:08 | and |
---|
0:27:09 | well |
---|
0:27:09 | it improves |
---|
0:27:10 | right |
---|
0:27:11 | the |
---|
0:27:12 | yeah |
---|
0:27:12 | no no i i like the way it was a |
---|
0:27:14 | uh too much but probably there is |
---|
0:27:16 | oh |
---|
0:27:16 | fig configuration problems because they see that people |
---|
0:27:19 | i get |
---|
0:27:20 | very nice improvement with |
---|
0:27:21 | no |
---|
0:27:22 | i don't know if it's because they'll tell only |
---|
0:27:24 | uh that the prove it |
---|
0:27:26 | yeah they get the improvement it's not |
---|
0:27:28 | so let's say i tried with |
---|
0:27:30 | uh |
---|
0:27:30 | different dimensionalities |
---|
0:27:32 | and the |
---|
0:27:34 | it improves |
---|
0:27:35 | yeah |
---|
0:27:35 | but the but it was not moving from |
---|
0:27:37 | i don't know how much i had here |
---|
0:27:39 | it wasn't moving from |
---|
0:27:42 | the six point fifty nine to three |
---|
0:27:45 | it was less |
---|
0:27:46 | and that |
---|
0:27:47 | um |
---|
0:27:48 | it is part of one |
---|
0:27:49 | um |
---|
0:27:50 | i just one |
---|
0:27:52 | which |
---|
0:27:53 | uh sport |
---|
0:27:54 | not |
---|
0:27:55 | uh |
---|
0:27:56 | um |
---|
0:27:57 | one |
---|
0:27:57 | you |
---|
0:27:59 | oh |
---|
0:27:59 | oh |
---|
0:28:00 | yeah |
---|
0:28:01 | yeah |
---|
0:28:02 | hmmm |
---|
0:28:03 | we live |
---|
0:28:04 | just |
---|
0:28:04 | straight on it |
---|
0:28:05 | no |
---|
0:28:06 | what |
---|
0:28:07 | she |
---|
0:28:08 | um |
---|
0:28:09 | no |
---|
0:28:10 | yeah |
---|
0:28:11 | one |
---|
0:28:12 | um |
---|
0:28:14 | oh |
---|
0:28:14 | oh |
---|
0:28:15 | oops |
---|
0:28:15 | yeah |
---|
0:28:17 | right |
---|
0:28:18 | or a |
---|
0:28:19 | hmmm |
---|
0:28:19 | oh |
---|
0:28:20 | sure |
---|
0:28:21 | also |
---|
0:28:22 | school |
---|
0:28:23 | hmmm |
---|
0:28:25 | oh |
---|
0:28:25 | i i i if it is a did something not right would be because i didn't various incidents |
---|
0:28:30 | okay fine |
---|
0:28:31 | and |
---|
0:28:31 | um i'm not sure |
---|
0:28:33 | i'm not sure i'm just currently live |
---|
0:28:35 | uh svm because probably in |
---|
0:28:37 | i think it was using that probably estimation it beeps and it's not a good idea |
---|
0:28:42 | but then using |
---|
0:28:42 | in both |
---|
0:28:43 | and both systems based on svm i mean |
---|
0:28:46 | and using also in a my proposal so |
---|
0:28:48 | i think i i can improve in that way the noise |
---|
0:28:51 | more was what you were meant in |
---|
0:28:52 | because and and are doing the |
---|
0:28:54 | and and and are doing the the this kind of problem with the background using that |
---|
0:28:58 | the to the the as the sum |
---|
0:29:00 | as being pretty and i think that |
---|
0:29:02 | the prediction the problem |
---|
0:29:03 | prediction is not that would the score for the |
---|
0:29:05 | for the speaker identification |
---|
0:29:07 | thing |
---|
0:29:08 | but |
---|
0:29:09 | well |
---|
0:29:11 | okay |
---|