Speech Transcript - The Sheffield language recognition system in NIST LRE 2015

0:00:15	so first thank you very much for the odyssey conference for giving us the chance
0:00:20	to present our language recognition system my name's raymond we've from the university of sheffield
0:00:26	and the chinese university of hong kong
0:00:30	so well i was a these have a language recognition system is pretty a
0:00:35	fundamental and signed it
0:00:37	so the of motivation of the paper and at all today will be basically to
0:00:43	go through the keypoints maybe set of the core system and so this is some
0:00:47	more suggestions and well as be calibration
0:00:53	a bit of the background language recognition is about recognising language from a speech segment
0:00:57	so we go through the classical map that all language recognition we can see researches
0:01:03	using acoustic of phonotactic features working on that
0:01:07	and then there are shifted delta cepstral features which takes a longer temporal spend of
0:01:11	the signal which helps you language recognition recently i-vectors all at the end and for
0:01:18	the combination of all of methods proved to be useful in anguish recognitions
0:01:23	of all us we submitted three systems of the combination of three system in a
0:01:30	nice language recognition last year
0:01:32	the first one is a standard i-vector system and we have phonotactic system and the
0:01:36	third one is a frame basis but the nn system after the evaluation we
0:01:41	got a little bit enhancement combining the button and i-vector we'll go through its good
0:01:45	of that like to
0:01:47	so this is just a recap briefly on be a training data and also the
0:01:51	target languages we have the switchboard data used telephone speech training data also some
0:01:56	all multilingual
0:01:59	lre training data from past evaluations
0:02:03	training set of assets
0:02:04	so there are twenty languages in language recognition and therefore into six language clusters and
0:02:09	the task of language recognition is to identify languages within the clusters of language we
0:02:15	shot closely related
0:02:18	on the training data of language recognition comes as a role set of files in
0:02:23	about seven hundred eight hundred hours or to start with the training we run some
0:02:28	voice activity detection and to train our voice activity detector we use the competition that
0:02:34	are from speech all by
0:02:37	training our switchboard model of from tokenizer
0:02:40	run out
0:02:41	a forced alignment into them and then we just treats the silence label as non
0:02:46	speech and the nonsilence labeled as speech we also take some of the posterity train
0:02:52	data from voice of america broadcasts speech to train the of voice activity detector using
0:02:57	that channel
0:02:58	for this data we just take the role of speech nonspeech label
0:03:02	on the amount of the voiced and unvoiced speech in different corpus from the table
0:03:11	we train a to lay at the end and for a vad so this
0:03:15	takes a stand at the end and with
0:03:18	which with all train three dimensional filter bank features of features by saying of fifteen
0:03:23	frames laughs and the right
0:03:25	the outputs of the end and is
0:03:28	two neurons the end and which is voice and the voice put zero probability
0:03:33	we have sequence lyman using a tuesday hmm and forcing a minimum duration of twenty
0:03:38	frames of voiced and unvoiced on top of that we have a heuristic to bridge
0:03:43	the non a non speech gap which are shorter than two seconds
0:03:47	for the results
0:03:49	on the switchboard test data we have a miss and false alarm rate all around
0:03:53	to present
0:03:55	but for the be all the o a data the broke out of the broadcast
0:04:00	they to the error rates much higher so we did an oral inspections that they
0:04:05	and
0:04:07	we believe it's down to the inaccuracy of the reference data so will a first
0:04:10	system and to continue trying out language recognition system
0:04:14	on
0:04:15	we establish
0:04:17	define a training set in the cost of the system development so these are the
0:04:23	two "'cause" that's we use v one and p three
0:04:26	the v one data is already version of the training data we use it directly
0:04:30	text e of vad results
0:04:33	and then extracts
0:04:34	the whole segment whose duration lies between twenty and forty five seconds and then we
0:04:39	train that specifically for thirty minute all sort thirty second condition so in the developments
0:04:46	we
0:04:47	from the very beginning divided test and training in three second ten seconds thirty second
0:04:53	duration we're not sure whether this is correct or not very that
0:04:57	four
0:04:58	v three data are then we
0:05:00	actually run different tokenizer all over again on the whole training set of the data
0:05:05	and with that we will be one segmentation just that then we have a shorter
0:05:09	segments for offshore shorter segments for decoding at the speed up the decoding process in
0:05:14	the first round
0:05:15	then we run re-segmentation with differences i don't stressful
0:05:19	and we derive a three
0:05:21	training set of normal evaluation of thirty seconds ten seconds and three seconds
0:05:26	so these are not this thing gives that with a little bit of overlapping
0:05:30	what data partitions of for each of the set then we have
0:05:33	at present of the data for training time percent for development and that we're going
0:05:37	to report the internal pass result in the early bits of the experiment for ten
0:05:41	percent inter class
0:05:45	so this is a system diagram for our or language recognition system on the laughs
0:05:50	you can see the i-vector system and there is a phonotactic system the phonotactic system
0:05:55	generate bottleneck features to fit into
0:05:57	the nn system which is the frame based language recognition system
0:06:03	the i-vector system is i we follow standard county recipe of media and normalization for
0:06:11	the features shifted delta cepstrum mean normalization and also frame based vad to start with
0:06:17	we trained a two thousand forty eight combine ubm and so the variability matrix to
0:06:22	extract six hundred dimension i-vector we tried to language classifiers with all support vector machine
0:06:28	and logistic regression and then to focus of the study here is to see to
0:06:33	compare the use of
0:06:35	different datasets in the training of ubm and also to the for a bit matrix
0:06:39	also language classifier and also the comparison of global and cluster dependence classifiers
0:06:47	but i think global classifies i mean classify which all
0:06:51	classifies all the trendy languages and one go
0:06:54	so we have four configurations here is that with so form condition a to condition
0:06:59	be we increase the amount of data for ubm and total variability matrix training
0:07:03	from be to see
0:07:05	we replace the svm with logistic regression classifier and from c t we further increase
0:07:11	the amount of training data for logistic regression classifier
0:07:15	and the past year on the right shows the
0:07:19	minimum average minimums the average score for different all configurations of set up on the
0:07:24	i-vector system and the result is reported on the internal tests v one data
0:07:28	which has
0:07:29	thirty second duration
0:07:32	on for when we go to a where we look at the to read past
0:07:36	here in the middle then we can see
0:07:39	the comparison between using fewer amount of training data for the ubm and more amounts
0:07:46	then it gives some improvement there
0:07:48	and we also see some a difference
0:07:52	a by having a global classifier and within class the classifiers we did not manage
0:07:57	to try or the combination is listed here just because of the time constraint
0:08:02	but for this set of experiments on
0:08:05	what we conclude is that we tend to use
0:08:09	the full set of role training data and segment that for the training of ubm
0:08:14	and sort of error rate matrix and also within class the classifiers outperform the global
0:08:19	classifiers
0:08:20	and then when our training progresses then we moved to the v three data
0:08:26	we have similar conclusions as i just mentioned and then we tried
0:08:31	to use different amount of the training data forty logistic regression classifier as shown as
0:08:36	the three web boss here
0:08:39	basically the left bar here are used as few amount of training data only one
0:08:43	hundred hours
0:08:44	and we use three hundred hours of data
0:08:48	for the d one we use the roll set of data which a comprises about
0:08:52	eight hundred hours so here that showrooms
0:08:55	a trade-off between using more data and also whether the data are well structure of
0:09:00	our segment it or not and then we ended up with using three hundred hours
0:09:04	of segmented data training the a logistic regression classifier
0:09:09	for the two red bars on the far left and right it is about the
0:09:15	use of svm or use of the all
0:09:19	logistic regression in the language recognition again that shows the
0:09:25	improvement
0:09:26	for using logistic regression classifier
0:09:31	then that comes to our second system lid phonotactic language recognition system
0:09:37	there are two components in the phonotactic system first a phone tokenizer and the second
0:09:43	the language classifier the from tokenizer is based on the standard county setup we have
0:09:49	lda c m and how speaker adaptation
0:09:52	then it is that the n m with six layer and each layer contains around
0:09:57	two thousand euros
0:09:59	we used i don't bigram language model with a very low grammar scale factor of
0:10:04	zero point five we tried to have a high a scale factor of two and
0:10:07	it
0:10:08	that's and
0:10:09	gives better results in our internal test sets
0:10:12	optionally we try to run even sequence training on the training switchboard data but bear
0:10:19	in mind design english training data so we're not sure that
0:10:22	of discriminative training will give over trying new networks to the results
0:10:28	for the language classifier design svm classifiers
0:10:31	which runs are trained on the tf-idf statistics of the phone n-gram which tried from
0:10:38	bigram l from trigram the reason we back-off to bigram is that we of trained
0:10:43	on the form
0:10:45	position dependent form and we ended up with
0:10:48	roughly five million dimension of the trigram statistics we
0:10:51	where e that maybe sparsity issues
0:10:55	so this is the performance on the internal test sets
0:10:59	with the different setup
0:11:00	as we which the trigram outdated gives better performance in terms of the low means
0:11:06	the average score of this is valid for the thirty seconds later but you messy
0:11:10	in a while that may break very comes to very short duration segments
0:11:17	the purple bass a the results with the discriminatively trained the nn from tokenisers again
0:11:23	than that shown that be of the over trained the nn here are and it
0:11:27	gives higher word error rate
0:11:29	sorry a higher that means the average score i mean
0:11:36	the third system is the frame based the nn system for language recognition
0:11:42	we talk a sixty four dimensional bottleneck features from the switchboard tokenisers
0:11:47	and there are features slicing with the for frames one the left and for frames
0:11:51	on the right
0:11:52	the d n and is a four layer the nn with seven hundred neurons
0:11:58	we have a problem normalizations which
0:12:02	we multiplied it has probability with the inverse of the language prior and the decision
0:12:07	of language recognition system can buy every change the frame based language recognition posterior probability
0:12:17	so this is
0:12:18	hey summary of the frame based language recognition system on different handsets
0:12:26	then to trance we observed against very obviously when the situation is shorter than d
0:12:33	c average score is higher and then the second is generally the
0:12:38	the be the error he is higher than the phonotactic system and i-vector system but
0:12:45	it becomes more robust when it comes to a very short duration
0:12:49	situation
0:12:51	so after the evaluation we have an enhanced system which recall that a button that
0:12:55	i-vector system and is also a basic system
0:12:59	we talked the
0:13:00	bottleneck features from the switchboard and we place the mfcc in i-vector system with the
0:13:06	bottleneck features and build another system for language recognition
0:13:11	a bit of the details
0:13:13	we take the sixty four dimension bottleneck features
0:13:16	there are no vtln and no normalization or shifted delta cepstrum but they are frame
0:13:22	based vad here
0:13:25	so this is a side by side comparison between the i-vector system and the bottleneck
0:13:30	system where the mfcc features can replaced by the bottleneck
0:13:33	we can see roughly of relative improvement from fifteen to twenty five percent by replacing
0:13:40	the bottleneck features
0:13:45	for system calibration and fusion we train target language dependent gaussian back and
0:13:53	and the gaussian
0:13:54	has for age of sixteen components and then these are trained on da training data
0:13:59	of thirty seconds data
0:14:02	then of course system fusion we run logistic regression
0:14:06	that comprises the log-likelihood ratio conversion and that the system combination
0:14:12	the reflection
0:14:15	so we apply that separately on the three system the i-vector system the nn system
0:14:20	and phonotactic system we found that
0:14:22	the
0:14:24	gaussian back and you know why work for the i-vector so we do not use
0:14:27	that in the
0:14:30	final evaluation
0:14:31	and then for the in an informal to technique gives a
0:14:34	significant improvement
0:14:38	and this is the fusion result in our internal it has set so
0:14:42	for thirty second data
0:14:44	i-vector system gave so
0:14:46	the battery so i'm on the three
0:14:49	submissions system
0:14:50	and
0:14:51	it can the n and informal to take they have roughly the same performance
0:14:56	system fusion give some performance improvement actually a noticeable performance in the internal test that's
0:15:03	we have
0:15:04	and the bottleneck system did not give better results but and where we incorporate the
0:15:08	for system than there are the best results we have
0:15:12	when it comes down to three seconds a as i've set the phonotactic system
0:15:20	behaves much worse here
0:15:22	so that maybe because of this pastiche was on t particular setup of our own
0:15:27	current statistics
0:15:28	and
0:15:30	when
0:15:31	we compare the i-vector system and the bottleneck system then we see significant improvement for
0:15:36	the off button x system and the further improvement the impression
0:15:41	then here we show on the results of d formal evaluation
0:15:46	datasets
0:15:48	i-vector system
0:15:50	phonotactic system the nn system performs well
0:15:54	roughly as expected
0:15:55	and then bottleneck system again has
0:15:59	more than ten percent relative improvement on top of the i-vector system
0:16:03	and this system version
0:16:05	gives marginal improvement
0:16:07	on top of the best system here
0:16:10	then finally i'm going to a shown to be about a pair-wise system contribution
0:16:16	to see keyword you've contribution to t component system in our language recognition systems
0:16:22	so now you see clusters of boss hears for each clusters on the very laughed
0:16:27	pass we have a single system
0:16:29	and then what these single system for example this is about an i-vector system
0:16:34	we make a fusion with this system with one of the system and then the
0:16:39	older is that we take the worst system to fuse with
0:16:43	and then we take the second whereas and so on
0:16:46	so the interesting thing here he is that gender at apart from fusion with that
0:16:51	the nn system which is the worst system
0:16:53	fusion pairwise fusion you know case works
0:16:58	maybe you can argue we may be in a different all operating region that the
0:17:02	error region and that
0:17:04	maybe why we cease to work
0:17:07	and then another interesting thing is the of
0:17:11	performance of fusion system basically is in proportion to the performance of the single system
0:17:16	which means that when the fusion of about the system then we get a better
0:17:19	results here
0:17:22	so as a summary we introduce the three language direction recognition component systems submitted to
0:17:28	the or at least two thousand fifty and the description to segmentation data selection plan
0:17:35	and classifier training we have and then harassment button i i-vector system
0:17:40	and is demonstrate performance improvement for the future work we want to were a bit
0:17:46	on the data selection and augmentation as a team thus
0:17:50	and also we are interested in the multilingual new network of the adaptation of that
0:17:55	maybe some unsupervised training on that as well and to improve the bottleneck features also
0:18:01	some variability compensation to deal with the huge try no and development dataset mismatch in
0:18:07	the evaluation dataset
0:18:08	and a suggestions or maybe collaborations all welcome a thank very much more attention
0:18:20	here type questions
0:18:34	thanks for the when you're talking about the language clusters
0:18:41	the clusters the according to some linguists yes
0:18:48	for our small experiment
0:18:52	the linguistic clusters and the based on the
0:18:57	a to a lot of the same with silver last of the data
0:19:03	but they which is
0:19:06	and use these clusters that are on features
0:19:10	we gain that it would be the computer the
0:19:13	when compared to the results of plus so that are made by linguists
0:19:19	tried plus the language for trial
0:19:24	yes i think that's a scientific question an interesting question we follow language classes basically
0:19:30	by a narrow definition all exciting following what the nice a language recognition evaluations told
0:19:36	us to and you're absolutely right up there are some cases when the training where
0:19:41	you
0:19:43	just become a distinction between
0:19:47	even dialects or other unwanted of factors which does not directly related to language classes
0:19:54	at all so yes definitely this is something we want to look at them particularly
0:19:57	for some dialects were interested in for example chinese data are interested in it everyone
0:20:02	to do more
0:20:06	and the questions
0:20:11	i one quick question so in an eer at most teams we did a scroll
0:20:17	most works typically would sixty percent for train maybe going to seventy percent used a
0:20:22	little bit more you want to eighty percent so my question is once you did
0:20:28	your development when you actually submitted
0:20:31	the final results did you do of for retrained with all the data or did
0:20:35	you just stick with the original eighty percent range system that you
0:20:38	we trained with the original system with eighty percent which we now doubts whether this
0:20:43	should be the case and then we also have almost have a little bit by
0:20:49	even if in the very early stage we
0:20:51	divided data into three second ten second and three seconds or and that again
0:20:56	of reduce the amount of training size and that's that we should note decision we
0:21:01	tried to use h present and seventy we
0:21:05	one more suggestions on
0:21:07	how the data i think with of a bit on the all data segmentation and
0:21:13	selection time
0:21:16	here any other questions
0:21:20	and b let's think speaker again

The Sheffield language recognition system in NIST LRE 2015

Speaker & Language Recognition Systems

Raymond W. M. Ng, Mauro Nicolao, Oscar Saz, Madina Hasan, Bhusan Chettri, Mortaza Doulaty, Tan Lee, Thomas Hain