0:00:13 | thank you very much for video presentation |
---|
0:00:16 | mandarin min come from you don't time |
---|
0:00:20 | today i can actually for competition expectation for shown to the spoken language identification |
---|
0:00:31 | i want to keep this presentation of the follows |
---|
0:00:35 | clustering and we introduce the short utterance language identification tasks |
---|
0:00:42 | the thing i shall use a neural network based on writing techniques |
---|
0:00:47 | extractor |
---|
0:00:47 | and they show how that vector use them for lid task |
---|
0:00:53 | after that the feature compensation learning will be introduced |
---|
0:00:58 | then |
---|
0:00:58 | i'm sure you |
---|
0:01:00 | our experiments are sent out |
---|
0:01:03 | one really |
---|
0:01:04 | and you summer and the conclusions |
---|
0:01:10 | okay language identification techniques and topical use of a pre-processing stage a lot you lingo |
---|
0:01:18 | did recognition and translation system |
---|
0:01:22 | for real time speech processing system |
---|
0:01:26 | incorporating performance of shock filters are task |
---|
0:01:30 | are important |
---|
0:01:31 | because it can |
---|
0:01:32 | zero to reduce the real-time factor and the |
---|
0:01:36 | it is also or system |
---|
0:01:39 | well of the |
---|
0:01:40 | state of the how |
---|
0:01:41 | to |
---|
0:01:43 | right the master is the i-vector based and that's it |
---|
0:01:46 | alright to this semester very effective a relative number of devices |
---|
0:01:52 | recently |
---|
0:01:53 | most of the researcher neural network based approaches |
---|
0:01:58 | because so the idea is the classification task |
---|
0:02:02 | therefore they neural network model can be directly used for classification |
---|
0:02:10 | the entanglements sure that the performance |
---|
0:02:13 | a shot boundaries right you task |
---|
0:02:18 | experiments a high initial for speaker verification task |
---|
0:02:23 | and the recent study it was also successfully used to derive the task |
---|
0:02:28 | in this work |
---|
0:02:29 | we focus on the big vector based |
---|
0:02:32 | nested |
---|
0:02:36 | the expenditure the neural network based they write presentation data |
---|
0:02:41 | note that using that are applied to men cost |
---|
0:02:45 | the speaker recognition even today actually on the language identification |
---|
0:02:51 | the network for extracting extractor |
---|
0:02:55 | consists of three month euros |
---|
0:02:59 | reliable feature extractor |
---|
0:03:02 | statistics hogan |
---|
0:03:05 | and the boundaries |
---|
0:03:08 | variable representation years |
---|
0:03:11 | a very well feature extractor model |
---|
0:03:15 | outputs frame level |
---|
0:03:17 | the utterance |
---|
0:03:18 | we impose over a sequence of acoustic features |
---|
0:03:24 | well this year s |
---|
0:03:26 | time delay neural network |
---|
0:03:29 | well convolutional neural network or used |
---|
0:03:34 | then |
---|
0:03:35 | a good coding here |
---|
0:03:39 | canberra the frame level quality |
---|
0:03:42 | further frame level features into a fixed to dimensional vector by using the mean and |
---|
0:03:50 | they're |
---|
0:03:50 | standard the condition |
---|
0:03:53 | finally |
---|
0:03:55 | for connected actually didn't is used to process all utterance level representations |
---|
0:04:03 | and a final thoughts the next earlier you used it is all those response to |
---|
0:04:09 | use you have |
---|
0:04:11 | and the map i |
---|
0:04:16 | and like to thank next are mostly used for speaker verification task |
---|
0:04:21 | using the verification task |
---|
0:04:23 | the extractor the doctors |
---|
0:04:27 | frontends |
---|
0:04:28 | that is the used to extract results of contracting agent |
---|
0:04:33 | you back and |
---|
0:04:34 | some of them and here or cosine similarity can be used up all common case |
---|
0:04:40 | for the lid task |
---|
0:04:41 | the front end up backends approach can also be used |
---|
0:04:46 | compared to be that jointly row just thinking regression become more widely used directly |
---|
0:04:52 | classification task |
---|
0:04:54 | well clusters and |
---|
0:04:56 | a reading tasks |
---|
0:04:57 | we can also directly use the network outputs for classification |
---|
0:05:05 | this work |
---|
0:05:06 | make a shot authors lid task |
---|
0:05:10 | not only |
---|
0:05:11 | but the testing utterance become shorter |
---|
0:05:14 | so performance also decreases |
---|
0:05:18 | no degradation is mainly because |
---|
0:05:21 | and i can think up to ten calls applies a large variation |
---|
0:05:26 | of the shuttle to resist |
---|
0:05:29 | to reduce |
---|
0:05:30 | the variation or short utterances |
---|
0:05:34 | normalization method using and |
---|
0:05:36 | corresponding no other varieties |
---|
0:05:39 | warranty investigated for i-vectors |
---|
0:05:42 | and neural network based |
---|
0:05:46 | it is the number that we can also apply stimuli the i-vector extractor |
---|
0:05:55 | therefore |
---|
0:05:56 | we inputting we think that |
---|
0:05:58 | similar idea |
---|
0:06:00 | two |
---|
0:06:01 | improves accuracy performance by using vector network |
---|
0:06:07 | the chair |
---|
0:06:08 | compensation |
---|
0:06:11 | well down by reducing the actually then |
---|
0:06:14 | representation pleading a and the short duration |
---|
0:06:19 | inputs |
---|
0:06:21 | there |
---|
0:06:22 | the s |
---|
0:06:24 | is that representation overshot of the variance |
---|
0:06:27 | and there is a representation of the corresponding rhino buttons is |
---|
0:06:35 | the i-vector space |
---|
0:06:38 | this education |
---|
0:06:39 | can be rewriting "'cause" this one |
---|
0:06:44 | well for training |
---|
0:06:46 | drastically |
---|
0:06:47 | which the vector is the network by using an l |
---|
0:06:53 | duration encodes |
---|
0:06:55 | then the shot input space to model the trend maybe a function |
---|
0:07:02 | considering that difference between them out and the shot utterance |
---|
0:07:08 | the shot boundaries |
---|
0:07:10 | consis a very limited information |
---|
0:07:12 | therefore to improve the performance a short utterance |
---|
0:07:17 | both i and i were extracted and information local phonetic information an important issue |
---|
0:07:25 | we suppose that |
---|
0:07:26 | the variance |
---|
0:07:28 | components the vector kind of that language and describe the information related to local phonetic |
---|
0:07:36 | information |
---|
0:07:37 | based on this consideration |
---|
0:07:40 | but we propose to normalize only seventeen |
---|
0:07:44 | component it's vector |
---|
0:07:47 | it is |
---|
0:07:49 | the representation overlap utterance |
---|
0:07:52 | well |
---|
0:07:54 | you mean |
---|
0:07:55 | so rare in |
---|
0:07:57 | components |
---|
0:07:58 | to you the |
---|
0:07:59 | frame level phonetic information |
---|
0:08:02 | well alright discriminative features for language identification |
---|
0:08:09 | the cost of the proposed a method is the only this time |
---|
0:08:14 | for the representation of the utterance |
---|
0:08:18 | could be obtained by neural network we assume that all those |
---|
0:08:23 | so the intended to pass the last |
---|
0:08:28 | in that program them that's a wine |
---|
0:08:32 | we use and spectral and the |
---|
0:08:35 | to supply |
---|
0:08:37 | representation |
---|
0:08:39 | and the in proposed a mess of the two we use the rest match |
---|
0:08:45 | a global calibration pony |
---|
0:08:48 | to obtain a representation |
---|
0:08:52 | we evaluate you the proposed method that means that language recognition evaluation |
---|
0:08:59 | two thousand and seventy set |
---|
0:09:03 | it's a training data used |
---|
0:09:06 | clover in this ad |
---|
0:09:07 | and i dunno three five development data |
---|
0:09:12 | for a rainy the to seven |
---|
0:09:16 | and the |
---|
0:09:17 | the telephone data so that i that line |
---|
0:09:22 | for the test set it to be used as a close the standard nice to |
---|
0:09:28 | those |
---|
0:09:31 | the except that has recently that in section that the study is that okay and |
---|
0:09:36 | the |
---|
0:09:37 | this ad |
---|
0:09:38 | we also program the |
---|
0:09:40 | a wine one point five and to use against this sense |
---|
0:09:47 | one of a trust |
---|
0:09:49 | we used to sixty dimensional all they're pretty bad major |
---|
0:09:55 | and then you covariance and that the existing as the average of was used for |
---|
0:10:00 | evaluation metric |
---|
0:10:03 | for this analysis is you can kind of the rest nets system and that it's |
---|
0:10:08 | vector systems |
---|
0:10:10 | the rest analysis to us |
---|
0:10:13 | so the holy rollers that's |
---|
0:10:16 | network |
---|
0:10:17 | they are probably |
---|
0:10:20 | and that while for the connectivity |
---|
0:10:22 | the a lot of nist or both |
---|
0:10:27 | well the i-vectors is to the thing last night to |
---|
0:10:31 | we use the reliable feature extractor |
---|
0:10:37 | well the training examples |
---|
0:10:41 | some examples of our group had between five to ten seconds and the shot utterance |
---|
0:10:49 | but it is going back to two seconds |
---|
0:10:53 | in this case we show the results of the baseline and systems |
---|
0:10:58 | come variation |
---|
0:10:59 | we also realistic this results with popular by |
---|
0:11:04 | other is utterance |
---|
0:11:07 | was anybody can |
---|
0:11:09 | it's a extractor system are more in fact you on long code utterances |
---|
0:11:17 | and whatnot shop utterances the rest and |
---|
0:11:23 | this is done in the better performance |
---|
0:11:27 | and because of the duration mismatch the model trained with a lot of them is |
---|
0:11:32 | samples |
---|
0:11:34 | we form the where on the basis of the data but i'm not problem that |
---|
0:11:38 | there shall i |
---|
0:11:42 | the integration of the team here that without the feature compensation method |
---|
0:11:48 | in this table |
---|
0:11:49 | the baseline is the olympics vector network trained with the shops examples |
---|
0:11:56 | the results of mean error rate is the |
---|
0:12:00 | composition learning |
---|
0:12:02 | and the two proposed them |
---|
0:12:07 | mess to whether he's this table |
---|
0:12:10 | for you the variation |
---|
0:12:13 | we give a speaker to compare baseline |
---|
0:12:16 | mean and variance this okay |
---|
0:12:19 | and the proposed a method |
---|
0:12:22 | problem of the results |
---|
0:12:24 | we can say |
---|
0:12:26 | the channel compensation |
---|
0:12:28 | by using those |
---|
0:12:30 | mean and variance |
---|
0:12:31 | only could improve the performance |
---|
0:12:35 | well not all utterances |
---|
0:12:38 | yielding very |
---|
0:12:40 | according to the best results |
---|
0:12:44 | i four show the other varieties |
---|
0:12:49 | compensation by using |
---|
0:12:51 | me only |
---|
0:12:55 | this significantly improve the performance |
---|
0:13:00 | well concluded |
---|
0:13:02 | in this work |
---|
0:13:03 | we investigate an improvement of the neural network based the impending techniques |
---|
0:13:10 | vector for shot about the rest lid task |
---|
0:13:13 | we compare database that the channel compensation by comparing in various and the need i |
---|
0:13:20 | think this the last |
---|
0:13:22 | the proposed to me is the channel compensation only |
---|
0:13:26 | it is expected to capture high-level or |
---|
0:13:30 | construct a language information |
---|
0:13:32 | right our meeting |
---|
0:13:34 | variance components three because it is for that reason for software that it's |
---|
0:13:40 | the results show that the proposed method the mock in fact the shock filters right |
---|
0:13:47 | you task |
---|
0:13:51 | that's what your attention |
---|