Speech Transcript - BAT System Description for NIST LRE 2015

0:00:14	you know all that from you the
0:00:18	i will be presenting what we did for lre fifteen and probably
0:00:23	great part of you have already seen most of this presentation
0:00:27	at the workshop
0:00:29	we have changed you things correctly some errors
0:00:32	and i will give you the presentation again
0:00:36	so
0:00:38	well lets them here it was as john already said it was a collaboration between
0:00:43	per no i need your and only technically three you know
0:00:46	i included the almost the full list of people who participate it is a in
0:00:51	our team that was a lot of concentrated fun during the autumn and we really
0:00:58	enjoyed that
0:01:00	so
0:01:02	let's go straight to the system what they to be used to be we decided
0:01:07	to participate in both nist conditions the fixed data condition and open data condition
0:01:12	and the fixed data condition we joint some affords with mit and the they provided
0:01:20	some definitions of the
0:01:21	of the development set and the shortcuts so we split all of the data we
0:01:26	had available for
0:01:27	training and they have we kept sixty percent for training and forty percent for that
0:01:33	and we also generate the some short cuts out of the long segments that are
0:01:38	uniformly distributed from three to thirty seconds because that was that's what we apply are
0:01:44	expecting then devil data according to evolution one
0:01:47	for the open training data condition a
0:01:51	we try to harvest all of the data from a harddrive that we could find
0:01:56	we also asked our friends
0:01:58	from here from bilbao to provide some other databases and also nudging from mit so
0:02:04	these databases that you might not using your systems regular eer colour guthrie that is
0:02:11	we took european spanish and british english
0:02:14	and from al jazeera free speech corpus we took some arabic dialects otherwise it was
0:02:22	just all the data that be harvested for nist lre o nine from the radios
0:02:28	from the voice of america and so on just to let you know we didn't
0:02:32	use any bible four
0:02:34	for the classifier training we just use the bible data to train some
0:02:41	bottleneck feature extractors able to speak about it later
0:02:47	bottleneck features that's really is a core far system so it's
0:02:52	i think that most of you are already familiar but this architecture we train a
0:02:56	neural network do classify phoneme states it's just some better specially did is architecture because
0:03:04	it is stacked bottleneck so
0:03:06	the structure is here on the picture
0:03:08	the stacked mean that
0:03:10	we first train the classical network to classify the phonemes days then be coded at
0:03:15	the bottleneck
0:03:16	and then steak these bottlenecks in time and train again
0:03:20	so that we train another stage and we take the bottlenecks
0:03:24	from the second stage from the second network so that's why the stacked bottlenecks
0:03:30	the effect is that
0:03:31	in the end they see longer context and
0:03:35	from our experience other they work pretty well but if you do
0:03:39	some tuning you can you can
0:03:42	you can just use the first bottlenecks it's enough especially for speaker id i say
0:03:49	so for the fixed training condition apparently we had to use switchboard and the network
0:03:54	was approximately seven thousand triphone states at all
0:03:58	and the we were trying some new technique a with the automatic acoustic unit discovery
0:04:06	and we train the bottleneck on these and for that we used lre fifteen data
0:04:11	for the open training
0:04:13	condition b
0:04:15	we use the bible data and later in the most of all we've train another
0:04:19	network that has seventeen languages of the bible and it is indeed the one that
0:04:26	that it would like to use if you can use
0:04:30	all kind of data
0:04:33	so general system or would be you as i already said the basis of our
0:04:38	system other bottlenecks either based on switchboard or labeled data and then some reference we
0:04:45	had the mfcc shifted delta cepstral system we had be llr system we also tried
0:04:52	some
0:04:54	some politics systems and model the
0:04:56	expect the n-gram counts with the multinomial subspace model and techniques like that where around
0:05:03	fewer spectra they didn't make it a diffusion
0:05:07	and are favourite classifier is just a simple wiener gaussian classifier
0:05:12	and if you can along with it's good to include the i-vector uncertainty in the
0:05:18	computation of scores that helps quite a bit with the calibration and also
0:05:24	provide you slides
0:05:27	performance boost
0:05:30	and
0:05:31	we had them new fink
0:05:33	a sequence summarizing neural network
0:05:36	i will speak about just now
0:05:40	just later because it was a little bit of a disaster labels e
0:05:45	the fusion
0:05:47	fusion was a little bit different we tried to reflect the nist criteria because we've
0:05:52	are to the c average was computed over the clusters and then averaged so
0:05:59	so we are reflected ease and the otherwise
0:06:04	we had one way then
0:06:06	per system and one buys per language
0:06:08	and the cluster prior and that be assigned the cluster specific priors for the data
0:06:14	for each cluster and all of the or other data
0:06:17	other set whose where had the prior set to zero and v be trained over
0:06:22	all clusters in the end so that
0:06:26	i think that it improve the results on the nist metric what substantially
0:06:33	and also we gave nist a system that was
0:06:36	a classical multiclass system that they could they could do some between cluster results on
0:06:41	this is because if we gave them just the one that b calibrated or fused
0:06:47	this way
0:06:48	they would be out of like with doing anything with that because of course
0:06:53	the asked for
0:06:54	a log likelihood ratios not the log likelihoods i hope that the next time they
0:06:58	will they will rectify this
0:07:02	this all what we had in the end in our submissions
0:07:07	most of the systems are stacked bottlenecks to see in the and mean the cluster
0:07:11	dependent system i will speak about it just two slides later
0:07:15	and then there was this a sequence summarizing network
0:07:19	and as you can see
0:07:21	it is the clear that were system it would never make it to the to
0:07:26	the diffusion but at the nist workshop five as present think is this as a
0:07:29	system that could almost perfectly classify but that's data it's not the case there was
0:07:34	a bunch of course
0:07:37	some level data in the training data
0:07:40	so now it's the worst system
0:07:43	so anyway we were so scared added what worked so well on our test data
0:07:47	that we didn't included in the primary system anyway so that the red arrow shows
0:07:52	what we had as a primary system a narrative and the
0:07:56	the alternate system would be with the
0:07:59	sequence summarizing that were included the what i report here is the c word star
0:08:04	means that the calibration was performed on the dev set
0:08:09	i don't i don't show already the c average for the dev so because during
0:08:13	that develop we were doing check and i think
0:08:16	which is
0:08:18	not here in this lies anymore
0:08:21	and so these are the results on that that's that it's
0:08:25	it's pretty good let's skip to the
0:08:28	results on the of also
0:08:31	there is nothing much to say just that the we sing quite some a calibration
0:08:36	loss on the of all data
0:08:39	and the
0:08:42	which was not the case on our test data especially on the on the fixed
0:08:46	set because it proved to be
0:08:49	quite easier said than the one i design for the open data condition
0:08:56	so
0:08:57	so that's it that's our that's of this are fixed that's our system for the
0:09:01	fixed training condition
0:09:03	so now let's talk about those specialities we had there the one with a cluster
0:09:08	dependent i-vector system
0:09:10	the cluster dependent means that we train
0:09:12	per cluster we train the ubm separate cluster and then the i-vector and the rest
0:09:18	of the system is trained on the whole data
0:09:23	they provide
0:09:25	you can see there's a six independent systems which provide the scores and then we
0:09:29	fuse them here with the
0:09:32	with a simple average due to provide some robustness be we calibrate them later anyway
0:09:38	so based this proved to be quite effective during the development with you just need
0:09:45	to take care about the amount of the daytime in the in the cluster so
0:09:50	the results line coming here indicate that there is no need you know data and
0:09:55	if you use of diagonal ubm you have a
0:09:58	you have a better result in the end which i believe this cost by not
0:10:03	enough data per cluster to fit all of all of the parameters of the full
0:10:06	covariance ubm
0:10:10	and the sequence summarizing neural network which doesn't work
0:10:14	it's
0:10:14	is i don't know if you have ever use it for language id it's basically
0:10:20	you take a sequence and short utterance
0:10:23	and
0:10:24	and passing through the network summarise it at this there is a summarisation a layer
0:10:29	inside
0:10:31	when you many of initial the frames then you then you provoke a the rest
0:10:33	till the end where you have to
0:10:36	probabilities of the classes and you do it all over again over all the data
0:10:41	and
0:10:43	and the that's it
0:10:45	the
0:10:47	and then to just that you can use the sequence summarizing layer
0:10:50	as some sort of feature extractor and model it is and later it differently
0:10:56	and apparently works a little bit better than then just using the network to do
0:11:01	the final classification
0:11:03	we had some partial results with the sequence summarizing that for the at when we
0:11:09	tried it on lre o nine but here the task is so much tougher
0:11:14	and
0:11:15	the system was a complete disaster
0:11:18	open training data condition
0:11:21	it's a almost the same scenario just we had a little bit more variability in
0:11:26	features here specifically i would like to point out the multilingual features multilingual bottleneck features
0:11:33	that is the ml seven insist in
0:11:37	and
0:11:39	you can see that if you include this whole machinery and all of the data
0:11:43	and the nice a look like that can really cluster the space
0:11:47	of the languages you get the cleared the best system that you can get
0:11:54	and it also is the case on the ml data
0:11:59	here i can even show you that what is the difference when you use the
0:12:04	use the covariance in the in the gaussian linear classifier to obtain the scores
0:12:11	it's the last line versus the second line of the table there is not so
0:12:16	much gain on the on the dev data because they're already
0:12:22	goals are to whatever we are training on but there is a nice gain
0:12:28	my skin on the on the of all data
0:12:34	if we if we submitted just the single system that would be probably the best
0:12:39	but of course
0:12:41	we haven't seen the
0:12:43	seen the results on the dev all data before submitting and
0:12:46	and tried try the whole fusion which is
0:12:50	slightly worse than the single best system
0:12:57	some analysis with the training data
0:13:00	we
0:13:01	we had a little a time constraints and we thought that
0:13:06	from our experience
0:13:07	it's experience it's always good do
0:13:11	necessary to retrain the final classifier i mean when you have the i-vectors to retrain
0:13:16	the logistic regression or regions of classifier to get your classes posteriors
0:13:22	but it unfortunately was not this case or for the album data condition we decided
0:13:27	okay we have this ubm i-vector extractor let's just use deals and retrain a retrain
0:13:34	the system we will use for our submission of the open data condition
0:13:39	as
0:13:41	and we didn't train the new ubm and i-vector extractor of course we did it
0:13:45	after
0:13:46	and you can see that
0:13:48	the column just below the submission is the one that we would get if we
0:13:54	to the time and retrained both ubm i-vector and the classifier on top of our
0:14:02	dataset
0:14:05	so we hurt ourselves quite a bit here as well
0:14:12	so features
0:14:14	as i already said the bottleneck features are the best ones that we were able
0:14:19	to
0:14:20	to train
0:14:22	if you compare it with the mfcc and shifted okay switch shifted delta cepstra there
0:14:29	is a there is a huge get and i think that
0:14:33	the bottleneck system should be the basis of
0:14:37	any serious
0:14:38	language id system nowadays
0:14:42	the bottlenecks out of the network it was trained on the automatically derived units
0:14:47	it didn't perform very well but of course
0:14:51	that was a very new thing and we didn't want to only
0:14:56	run the bottlenecks and
0:14:59	be done with the evaluation so we tried it you can see that still it's
0:15:04	really depends if you can if you can derive some
0:15:08	some meaningful units and
0:15:09	and more specifically if
0:15:11	if the ml data would match your that they do very are trained it because
0:15:16	then the units what
0:15:18	would correspond and probably the book like would be better
0:15:22	it so far doesn't work that well
0:15:29	with french cluster yesterday i so many people present the results here already been of
0:15:34	the french cluster they but inspired with great in the nist workshop where he it
0:15:40	excluded them from the results i think that we should not do that i spoke
0:15:46	the ldc
0:15:47	at the data are completely okay people can recognise a there is just the problem
0:15:51	with the channel as if they gave us
0:15:53	one channel in training and another one in the test they basically swap it
0:15:59	and because this is a cluster of just two languages we all build a very
0:16:02	nice channel detector
0:16:04	so
0:16:06	that is something we should deal with and not to exclude the french class are
0:16:09	from the evaluation
0:16:12	just please fix it
0:16:14	well we will try but we haven't time to really do that so all of
0:16:18	the results i will show in q of course include the french cluster
0:16:22	and
0:16:23	there
0:16:25	they're pretty good if you if you take the a multilingual bottleneck features but we
0:16:29	have to be careful even you when you're doing analysis of with the french cluster
0:16:36	the croat from the french is actually from bubble so if you happen to have
0:16:39	some bubble data bic or for about it rather not use it or use it
0:16:43	carefully
0:16:45	or you might be surprised how useful the problem
0:16:47	well it didn't solve it it'll
0:16:52	so
0:16:53	we of course try the bunch of the classifiers on top of the i-vectors and
0:16:58	i can say that
0:17:00	it's all about the same
0:17:04	and the classifier of choices the simplest one just the gaussian in our classifier that
0:17:10	you can build
0:17:11	right away out of i-vectors
0:17:13	an eagle was experimenting with some different language dependent i-vectors when you extract the i-vectors
0:17:20	with the language priors involved it was
0:17:24	it was performing nicely but
0:17:28	but the
0:17:30	not really beating the
0:17:32	the simple across a linear classifier we try it
0:17:37	fully bayesian classifier we tried a neural network and the logistic regression you can see
0:17:42	that all the columns here are pretty much the same
0:17:48	and
0:17:49	we still have a few minutes so i can again briefly us to do something
0:17:53	all this automatically derived you needs it's a it's a variational bayes method a we
0:17:58	train a duration a process mixture of hmms and b we try to fit the
0:18:04	open phoneme blue the on the data to estimate the estimate the
0:18:09	units
0:18:11	and then be used this to somehow transcribed data
0:18:16	and use these once this
0:18:19	as the source for a training the training the neural network which would include the
0:18:24	bottleneck and then
0:18:25	then we would have some
0:18:27	unsupervised bottleneck
0:18:31	well maybe there is there is a
0:18:34	still somehow four days and i hope that people edge h work should bill
0:18:37	we'll move this thing forward and we will see the goal think is that
0:18:43	we were able to surpass the mfcc baseline on the dev set with this system
0:18:50	that is i think that's already impressive
0:18:55	so the conclusions
0:18:59	again
0:19:00	use the bottleneck system in your lid system the gaussian linear classifier is enough
0:19:07	it if you can do you just include the uncertainty in the score computation
0:19:13	and we tried a bunch of the phonotactic systems and they perform
0:19:20	okay but they didn't make it to the fusion
0:19:24	and
0:19:26	i would say that it's always good to have some exercise with the data engineering
0:19:31	and try to see the
0:19:33	see the data that we have and try to collect something and
0:19:36	where with the data not only with the systems
0:19:40	we tried a bunch of other things like the denoising the reverberation we didn't see
0:19:45	any gains on the dev set then there is very slight gains on the evaluation
0:19:49	set
0:19:52	for the phonotactic systems we very using the switchboard to train it
0:19:56	and
0:19:57	we try to frame of the nn which
0:20:00	which was pretty bad
0:20:02	so that's all ready thank you
0:20:11	okay time for some questions
0:20:20	so my question is more related with the stacked bottleneck that you were recently there
0:20:25	you mentioned that it's good for language at night you didn't get so many good
0:20:31	which holds for speaker at
0:20:32	well we get the good results for speaker id just that we get as good
0:20:38	results with the bottlenecks that would not be the stack so you can train the
0:20:44	first network
0:20:45	only and take the classical what lex you don't need to do this exercise which
0:20:50	thinking the bottlenecks and training another network
0:20:53	well but they perform well for speaker s one is not what the right
0:20:59	i once i wouldn't think i wouldn't say that it's worth it
0:21:03	but maybe bill using the sorry sixteen just don't you don't use it as an
0:21:08	excuse
0:21:11	and the other question it's a
0:21:13	although i guess that using these are stacked bottleneck features on later six ubms for
0:21:21	language cluster you're solution was like in terms of time like are we can't
0:21:26	well that is indeed a
0:21:29	oracle system
0:21:30	from the point of the design but it worked slightly better
0:21:37	i wouldn't be in favour of a building such a system for five percent relative
0:21:42	gain over ten percent relative in but it simulation no
0:21:46	the numbers matter the usability is
0:21:49	the second thing
0:21:54	recursions
0:22:06	thank you the for the presentation i'm sorry because my question is also related to
0:22:10	the stacked bottleneck i was wondering if you have made in the analysis on the
0:22:15	alignment provided by both
0:22:17	the first bottlenecks and the stacked one to see if there is really an evolution
0:22:21	in the process all
0:22:23	alignment
0:22:25	you mean you mean what you mean the performance of the system or some
0:22:29	no i'm talking read about the lid alignment on your ubm to see how they
0:22:33	are about the distribution of the features evolves
0:22:37	i don't think we made my this comparison
0:22:40	sorry
0:22:46	our can ask questions are also messiah problem accurate context you're looking at plusminus time
0:22:53	found that did you
0:22:55	we don't something you kind of exporter you can't that fixed to the set of
0:22:59	course this is the ideal number explored
0:23:03	a bunch of numbers if you're having just the first network i think that you
0:23:07	can play more with a context
0:23:12	you should aim for something like three hundred millisecond of the context we if you're
0:23:16	using the stacked bottleneck the context is more because used a
0:23:21	several bottlenecks and
0:23:22	use that in the second stage so
0:23:24	that's why they will something plusminus then
0:23:27	i was thinking for maybe more sensitive
0:23:29	with the background noise "'cause" you do in your other systems you said you did
0:23:33	some denoising theirselves wondering what's more sensitive to noise the bottleneck is pretty good in
0:23:38	dealing with the noise actually i had a paper interspeech when we trained the denoising
0:23:44	all tangled or
0:23:45	and it works pretty well on the mfccs
0:23:48	then be used they'll that denoised spectral to generate the bottlenecks
0:23:53	and the
0:23:54	and well basically repeat all the experiments with the bottlenecks and the gains are much
0:23:59	more much smaller
0:24:04	discussion
0:24:12	so that this is more of a
0:24:14	a comment on the french cluster you're speaking about and i agree you know it
0:24:19	showed up is problematic that you said ignoring it is not the answer to it
0:24:24	i would point out that we do a contradiction going on in the sense you
0:24:29	about you label that a single the channel thing right
0:24:33	but we know from lre nineteen other ones we done
0:24:37	narrowband over brought up or broadcast and haven't seen this massive the ship four
0:24:43	so we have that the contradiction in the past use this successfully with telephony speech
0:24:48	pulling it from broadcast and so forth there is an interesting point here which the
0:24:54	it again ldc went out that did say that it's not it was not in
0:24:58	this labelling was errors in there but
0:25:01	this chance that the formality of the language changes based on whether you're broadcast you
0:25:06	might be at a higher you know high versus low whereas telephony so there's i
0:25:11	just bring doesn't bring these are in general because policing talks coming on the display
0:25:15	be one thing that may be something about the actual
0:25:18	dialect show that happens based on how to produce not so much of the channel
0:25:23	we don't know yet
0:25:25	i agree
0:25:27	okay lets them for speaker again

BAT System Description for NIST LRE 2015

Speaker & Language Recognition Systems

Oldrich Plchot, Pavel Matejka, Ondrej Glembek, Radek Fer, Ondrej Novotny, Jan Pesan, Lukas Burget, Niko Brummer, Sandro Cumani