0:00:13thank you very much for video presentation
0:00:16mandarin min come from you don't time
0:00:20today i can actually for competition expectation for shown to the spoken language identification
0:00:31i want to keep this presentation of the follows
0:00:35clustering and we introduce the short utterance language identification tasks
0:00:42the thing i shall use a neural network based on writing techniques
0:00:47extractor
0:00:47and they show how that vector use them for lid task
0:00:53after that the feature compensation learning will be introduced
0:00:58then
0:00:58i'm sure you
0:01:00our experiments are sent out
0:01:03one really
0:01:04and you summer and the conclusions
0:01:10okay language identification techniques and topical use of a pre-processing stage a lot you lingo
0:01:18did recognition and translation system
0:01:22for real time speech processing system
0:01:26incorporating performance of shock filters are task
0:01:30are important
0:01:31because it can
0:01:32zero to reduce the real-time factor and the
0:01:36it is also or system
0:01:39well of the
0:01:40state of the how
0:01:41to
0:01:43right the master is the i-vector based and that's it
0:01:46alright to this semester very effective a relative number of devices
0:01:52recently
0:01:53most of the researcher neural network based approaches
0:01:58because so the idea is the classification task
0:02:02therefore they neural network model can be directly used for classification
0:02:10the entanglements sure that the performance
0:02:13a shot boundaries right you task
0:02:18experiments a high initial for speaker verification task
0:02:23and the recent study it was also successfully used to derive the task
0:02:28in this work
0:02:29we focus on the big vector based
0:02:32nested
0:02:36the expenditure the neural network based they write presentation data
0:02:41note that using that are applied to men cost
0:02:45the speaker recognition even today actually on the language identification
0:02:51the network for extracting extractor
0:02:55consists of three month euros
0:02:59reliable feature extractor
0:03:02statistics hogan
0:03:05and the boundaries
0:03:08variable representation years
0:03:11a very well feature extractor model
0:03:15outputs frame level
0:03:17the utterance
0:03:18we impose over a sequence of acoustic features
0:03:24well this year s
0:03:26time delay neural network
0:03:29well convolutional neural network or used
0:03:34then
0:03:35a good coding here
0:03:39canberra the frame level quality
0:03:42further frame level features into a fixed to dimensional vector by using the mean and
0:03:50they're
0:03:50standard the condition
0:03:53finally
0:03:55for connected actually didn't is used to process all utterance level representations
0:04:03and a final thoughts the next earlier you used it is all those response to
0:04:09use you have
0:04:11and the map i
0:04:16and like to thank next are mostly used for speaker verification task
0:04:21using the verification task
0:04:23the extractor the doctors
0:04:27frontends
0:04:28that is the used to extract results of contracting agent
0:04:33you back and
0:04:34some of them and here or cosine similarity can be used up all common case
0:04:40for the lid task
0:04:41the front end up backends approach can also be used
0:04:46compared to be that jointly row just thinking regression become more widely used directly
0:04:52classification task
0:04:54well clusters and
0:04:56a reading tasks
0:04:57we can also directly use the network outputs for classification
0:05:05this work
0:05:06make a shot authors lid task
0:05:10not only
0:05:11but the testing utterance become shorter
0:05:14so performance also decreases
0:05:18no degradation is mainly because
0:05:21and i can think up to ten calls applies a large variation
0:05:26of the shuttle to resist
0:05:29to reduce
0:05:30the variation or short utterances
0:05:34normalization method using and
0:05:36corresponding no other varieties
0:05:39warranty investigated for i-vectors
0:05:42and neural network based
0:05:46it is the number that we can also apply stimuli the i-vector extractor
0:05:55therefore
0:05:56we inputting we think that
0:05:58similar idea
0:06:00two
0:06:01improves accuracy performance by using vector network
0:06:07the chair
0:06:08compensation
0:06:11well down by reducing the actually then
0:06:14representation pleading a and the short duration
0:06:19inputs
0:06:21there
0:06:22the s
0:06:24is that representation overshot of the variance
0:06:27and there is a representation of the corresponding rhino buttons is
0:06:35the i-vector space
0:06:38this education
0:06:39can be rewriting "'cause" this one
0:06:44well for training
0:06:46drastically
0:06:47which the vector is the network by using an l
0:06:53duration encodes
0:06:55then the shot input space to model the trend maybe a function
0:07:02considering that difference between them out and the shot utterance
0:07:08the shot boundaries
0:07:10consis a very limited information
0:07:12therefore to improve the performance a short utterance
0:07:17both i and i were extracted and information local phonetic information an important issue
0:07:25we suppose that
0:07:26the variance
0:07:28components the vector kind of that language and describe the information related to local phonetic
0:07:36information
0:07:37based on this consideration
0:07:40but we propose to normalize only seventeen
0:07:44component it's vector
0:07:47it is
0:07:49the representation overlap utterance
0:07:52well
0:07:54you mean
0:07:55so rare in
0:07:57components
0:07:58to you the
0:07:59frame level phonetic information
0:08:02well alright discriminative features for language identification
0:08:09the cost of the proposed a method is the only this time
0:08:14for the representation of the utterance
0:08:18could be obtained by neural network we assume that all those
0:08:23so the intended to pass the last
0:08:28in that program them that's a wine
0:08:32we use and spectral and the
0:08:35to supply
0:08:37representation
0:08:39and the in proposed a mess of the two we use the rest match
0:08:45a global calibration pony
0:08:48to obtain a representation
0:08:52we evaluate you the proposed method that means that language recognition evaluation
0:08:59two thousand and seventy set
0:09:03it's a training data used
0:09:06clover in this ad
0:09:07and i dunno three five development data
0:09:12for a rainy the to seven
0:09:16and the
0:09:17the telephone data so that i that line
0:09:22for the test set it to be used as a close the standard nice to
0:09:28those
0:09:31the except that has recently that in section that the study is that okay and
0:09:36the
0:09:37this ad
0:09:38we also program the
0:09:40a wine one point five and to use against this sense
0:09:47one of a trust
0:09:49we used to sixty dimensional all they're pretty bad major
0:09:55and then you covariance and that the existing as the average of was used for
0:10:00evaluation metric
0:10:03for this analysis is you can kind of the rest nets system and that it's
0:10:08vector systems
0:10:10the rest analysis to us
0:10:13so the holy rollers that's
0:10:16network
0:10:17they are probably
0:10:20and that while for the connectivity
0:10:22the a lot of nist or both
0:10:27well the i-vectors is to the thing last night to
0:10:31we use the reliable feature extractor
0:10:37well the training examples
0:10:41some examples of our group had between five to ten seconds and the shot utterance
0:10:49but it is going back to two seconds
0:10:53in this case we show the results of the baseline and systems
0:10:58come variation
0:10:59we also realistic this results with popular by
0:11:04other is utterance
0:11:07was anybody can
0:11:09it's a extractor system are more in fact you on long code utterances
0:11:17and whatnot shop utterances the rest and
0:11:23this is done in the better performance
0:11:27and because of the duration mismatch the model trained with a lot of them is
0:11:32samples
0:11:34we form the where on the basis of the data but i'm not problem that
0:11:38there shall i
0:11:42the integration of the team here that without the feature compensation method
0:11:48in this table
0:11:49the baseline is the olympics vector network trained with the shops examples
0:11:56the results of mean error rate is the
0:12:00composition learning
0:12:02and the two proposed them
0:12:07mess to whether he's this table
0:12:10for you the variation
0:12:13we give a speaker to compare baseline
0:12:16mean and variance this okay
0:12:19and the proposed a method
0:12:22problem of the results
0:12:24we can say
0:12:26the channel compensation
0:12:28by using those
0:12:30mean and variance
0:12:31only could improve the performance
0:12:35well not all utterances
0:12:38yielding very
0:12:40according to the best results
0:12:44i four show the other varieties
0:12:49compensation by using
0:12:51me only
0:12:55this significantly improve the performance
0:13:00well concluded
0:13:02in this work
0:13:03we investigate an improvement of the neural network based the impending techniques
0:13:10vector for shot about the rest lid task
0:13:13we compare database that the channel compensation by comparing in various and the need i
0:13:20think this the last
0:13:22the proposed to me is the channel compensation only
0:13:26it is expected to capture high-level or
0:13:30construct a language information
0:13:32right our meeting
0:13:34variance components three because it is for that reason for software that it's
0:13:40the results show that the proposed method the mock in fact the shock filters right
0:13:47you task
0:13:51that's what your attention