0:00:15 | i'm gonna be representing us to use a university of science and technology of china |
---|
0:00:19 | the national engineering level of speech line and language |
---|
0:00:24 | information processing of the |
---|
0:00:26 | this is a paper by a margin is my student master student and some other |
---|
0:00:32 | collaborators we asked him to |
---|
0:00:34 | build his own c n which he did and then be austin to join using |
---|
0:00:38 | it for something which he did |
---|
0:00:40 | so what i'm gonna do is present |
---|
0:00:42 | what came out when he tried that |
---|
0:00:46 | we got before stages introduction really have this works for language and at the structure |
---|
0:00:51 | in selling |
---|
0:00:52 | is proposed method some experiments analysis and then |
---|
0:00:56 | maybe and with but of sort on some future work |
---|
0:01:00 | well the first thing to ask is what is language identification and |
---|
0:01:06 | it just the task of taking a piece of speech and extracting language identification information |
---|
0:01:12 | from that comes at different levels as we know |
---|
0:01:14 | and we can say that that's all acoustic information or phonetic information list right hand |
---|
0:01:21 | to disassociate that from the characteristics of a speech speaker as will say |
---|
0:01:26 | and a little while |
---|
0:01:27 | and was finding the tendency to do |
---|
0:01:30 | speaker recognition |
---|
0:01:32 | state-of-the-art as well probably |
---|
0:01:36 | maybe this will change shortly i don't know but state-of-the-art is really gmm i-vectors |
---|
0:01:42 | and we say in great gains but everybody's you know trying to find what's next |
---|
0:01:48 | deep-learning in particular allows us to i'll take some of the advantages of supervised at |
---|
0:01:55 | training be able to extract discriminative it |
---|
0:01:58 | discriminative information out of the |
---|
0:02:01 | the data is that we have especially when we have a small amounts of training |
---|
0:02:04 | data we can use transfer learning methods to |
---|
0:02:08 | to train something which may well be discriminative |
---|
0:02:11 | on a weighted task of inferring it it's language id |
---|
0:02:16 | some of these are we saying |
---|
0:02:18 | recently they take bottleneck a network based i-vector representation of a |
---|
0:02:24 | v and song collaborator |
---|
0:02:26 | this was |
---|
0:02:27 | i think it was last year in interspeech there's also a poster yesterday which you |
---|
0:02:31 | missed paper should be in the savings we say dc non based a neural network |
---|
0:02:37 | approaches here |
---|
0:02:40 | doing great things that's transactions on ice lp this |
---|
0:02:44 | then there's some approaches which are and to it and methods and we can look |
---|
0:02:50 | at some of the |
---|
0:02:53 | the we the |
---|
0:02:55 | state-of-the-art as flown through that |
---|
0:02:58 | deep neural networks |
---|
0:03:00 | here |
---|
0:03:01 | and that was i guess |
---|
0:03:05 | long short-term memory |
---|
0:03:07 | i'm n n's here |
---|
0:03:11 | also in to speech |
---|
0:03:13 | so this is really a extracting at a frame level |
---|
0:03:17 | and gathering sufficient statistics over an utterance in order so |
---|
0:03:22 | pulled out |
---|
0:03:22 | language specific |
---|
0:03:25 | identifiers they re entrant approach |
---|
0:03:28 | using convolutional no young neural network to it so the text it's short |
---|
0:03:34 | utterances and it's using the power of a c n |
---|
0:03:37 | to put out |
---|
0:03:38 | they the information from these short utterances |
---|
0:03:41 | and seems to get over some of the problems in terms |
---|
0:03:45 | of utterance length |
---|
0:03:47 | we have a different method |
---|
0:03:48 | we also think that doing st say so mfccs with a large context maybe |
---|
0:03:56 | introducing |
---|
0:03:57 | too much information that a c n all day n and then this to remove |
---|
0:04:00 | so what we have today was |
---|
0:04:03 | a use some of our train a precious training data to remove information that probably |
---|
0:04:08 | shouldn't have been included in the first place if we had a magic wand |
---|
0:04:11 | in terms of input features are |
---|
0:04:15 | so what we're doing is a |
---|
0:04:16 | is slightly different |
---|
0:04:20 | we think convolution young neural network and |
---|
0:04:24 | when using the c n to extract frame level information per se what we actually |
---|
0:04:31 | doing |
---|
0:04:32 | in this very |
---|
0:04:34 | wide long |
---|
0:04:36 | and to and type system is starting off with plp input features |
---|
0:04:43 | and we're doing a the bottleneck |
---|
0:04:46 | the nn just data to take bottleneck |
---|
0:04:48 | network |
---|
0:04:49 | taking the bottleneck features here |
---|
0:04:52 | adding a what could be quite a lot of context to the bottleneck features and |
---|
0:04:57 | then fading that adjusts ann |
---|
0:05:00 | and here so three layers |
---|
0:05:03 | i finally a fully connected output and we're getting is a language label |
---|
0:05:07 | directly at the output from this |
---|
0:05:09 | so you can see why this is sort of attractive in terms of a system |
---|
0:05:14 | level implementation but to me it's |
---|
0:05:17 | kind of counterintuitive |
---|
0:05:19 | because we tend to use c n n's |
---|
0:05:21 | to extract |
---|
0:05:22 | front and information the mean in the related tasks that we've been trying |
---|
0:05:27 | they tend to one for a while for that |
---|
0:05:30 | i mean we did try things like stacks of mfccs as an input features to |
---|
0:05:37 | a c n directly and it doesn't seem to somebody else can do about of |
---|
0:05:41 | nasa doesn't seem to want that well |
---|
0:05:43 | so what we did was we have of the nn |
---|
0:05:46 | followed by a c n and the see how that works |
---|
0:05:50 | and limits |
---|
0:05:52 | sums up what is that transform acoustic features to a compact representation |
---|
0:05:56 | we did not frame-by-frame and a context of multiple frames for the bottleneck features with |
---|
0:06:02 | context into the c n |
---|
0:06:03 | and we come out with something which should be discriminatively in terms of language |
---|
0:06:10 | okay so this is what we call the lid features |
---|
0:06:15 | i mean we think that the general acoustic features that the import they like i |
---|
0:06:19 | said they do contain too much information |
---|
0:06:21 | so we trying to reduce the amount of |
---|
0:06:24 | information about an on the |
---|
0:06:26 | on the train system |
---|
0:06:28 | follows |
---|
0:06:31 | i'm not given the limited amount of training data we don't really wanna voice that |
---|
0:06:36 | we know that we can have a deep neural network which is trained on sentence |
---|
0:06:43 | and that will be a phonetic information |
---|
0:06:47 | the beginning of it is acoustic information |
---|
0:06:49 | somewhere in the middle of that network is a transformation |
---|
0:06:53 | effectively from the phonetic to the from the acoustic to a phonetic we take the |
---|
0:06:57 | well something features which |
---|
0:06:59 | we how far a compact |
---|
0:07:03 | representation of the requirement information |
---|
0:07:06 | not sure that's true because there's plenty of approach is that take |
---|
0:07:10 | information from |
---|
0:07:12 | both the center and the end of the day n and |
---|
0:07:15 | seem to work well especially with fusion |
---|
0:07:18 | anyway |
---|
0:07:19 | what we're doing it would just |
---|
0:07:20 | kind of different is when using a spatial pyramid polling |
---|
0:07:24 | the output of the c n and |
---|
0:07:29 | we want this allows us that there was it allows us to take the front |
---|
0:07:33 | end information and to span utterance level with |
---|
0:07:38 | which |
---|
0:07:38 | provides us with a |
---|
0:07:41 | utterance length invariant |
---|
0:07:44 | fixed dimension vector this point |
---|
0:07:50 | so i just a deal with arbitrary input so we just we take the method |
---|
0:07:53 | spatial polling is from the paper by climbing huh |
---|
0:07:56 | that's e c v computer vision two thousand and fourteen and it's designed to solve |
---|
0:08:01 | the problem of making the feature dimension invariant to the input size missus a problem |
---|
0:08:07 | we face often and is a problem certain |
---|
0:08:10 | and areas of image processing also face |
---|
0:08:14 | i think was happen is we've got i |
---|
0:08:16 | so i kind of feedback where the speech technology goes into the image processing and |
---|
0:08:20 | the |
---|
0:08:20 | comes back to the speech failed and then i cycles around |
---|
0:08:24 | so this is really inspired by a bag of words approach |
---|
0:08:26 | and it comes through |
---|
0:08:31 | into the special permit problem which uses a power of two |
---|
0:08:34 | stack of max hold features |
---|
0:08:37 | okay so it changes resolution of the power to |
---|
0:08:41 | and we can control quite finally how many |
---|
0:08:44 | features |
---|
0:08:45 | we |
---|
0:08:45 | one to the output of that |
---|
0:08:47 | so attractive in that work well like it |
---|
0:08:50 | the information on that is actually in the paper |
---|
0:08:54 | so how do we do that had we put all the stuff together |
---|
0:08:57 | well the shown in the diagram on the right here what we're doing is with |
---|
0:09:01 | taking a |
---|
0:09:02 | six layer the nn which is trained with large scale |
---|
0:09:07 | switchboard |
---|
0:09:09 | information |
---|
0:09:10 | and with taking the half of the network up to the bottleneck layer and fading |
---|
0:09:14 | that into system that now was trained using language id using lid training data |
---|
0:09:21 | and |
---|
0:09:23 | now if we propose that if we take that information and we feed directly into |
---|
0:09:27 | a c n |
---|
0:09:28 | given the training data that we would using this well it will not converge for |
---|
0:09:32 | anything sensible |
---|
0:09:33 | if at all |
---|
0:09:35 | it just doesn't work so what we have that there was they have to build |
---|
0:09:37 | a network |
---|
0:09:38 | like a c n layer by layer |
---|
0:09:41 | so that the nn is already trying that's fixed that's great |
---|
0:09:43 | and then you start to build that the c n by having first convolutional and |
---|
0:09:47 | then the second and then the third each one takes a special permit polling and |
---|
0:09:52 | the fully connected layer at the output |
---|
0:09:55 | to give us the direct language labels |
---|
0:09:59 | and excel works right i'm we can see that late only look at the results |
---|
0:10:03 | layer by layer |
---|
0:10:05 | s two |
---|
0:10:05 | how of the |
---|
0:10:08 | how the accuracy improves with |
---|
0:10:10 | the number of layers and with the size of the labels |
---|
0:10:14 | it's quite interesting to say that |
---|
0:10:17 | the nn pretty standard it's |
---|
0:10:20 | forty eight features fifteen plp use delta and delta-delta |
---|
0:10:24 | sorry pitch |
---|
0:10:25 | and with a context size |
---|
0:10:28 | twenty one frames |
---|
0:10:30 | one of two four |
---|
0:10:32 | one or two four fifty one or two for one or two four and three |
---|
0:10:35 | zero to zeros senones at the output |
---|
0:10:38 | and we look at the structure of the |
---|
0:10:40 | c n of a little while |
---|
0:10:42 | it is worth mentioning at this point because it's a problem |
---|
0:10:46 | but we create sorry separate networks for the task of thirty second ten seconds and |
---|
0:10:52 | three seconds data |
---|
0:10:55 | i mean we would like to combine these with trying to money |
---|
0:10:58 | but this separately trained |
---|
0:10:59 | no maximum |
---|
0:11:02 | a baseline is button like gmm i-vector and bottleneck the nn i-vector with lda doubly |
---|
0:11:09 | c n |
---|
0:11:10 | pretty much as we published previously |
---|
0:11:14 | so we look at how this works |
---|
0:11:16 | just try to visualise some of these layers |
---|
0:11:19 | what we have here was we got the |
---|
0:11:23 | post |
---|
0:11:24 | pooling three |
---|
0:11:26 | fully connected layer |
---|
0:11:28 | information |
---|
0:11:31 | note this diagram comes from the paper what we've done is be taken the |
---|
0:11:37 | these |
---|
0:11:38 | the test it |
---|
0:11:41 | over some utterances and we've compared for different languages just visually |
---|
0:11:47 | so what we don't is just thirty five randomly selected features from that stack |
---|
0:11:53 | plotted here for two languages |
---|
0:11:57 | because right |
---|
0:11:58 | on the left this dowry |
---|
0:12:00 | on the right it's farsi |
---|
0:12:02 | which i'm told are very similar languages |
---|
0:12:06 | the top and the bottom at different |
---|
0:12:09 | segments |
---|
0:12:10 | from utterances |
---|
0:12:12 | so what we're looking on the left is intra language difference what we're looking at |
---|
0:12:16 | the right just in between left and my is interlanguage difference so top and one |
---|
0:12:21 | was intra |
---|
0:12:22 | left and my is inter |
---|
0:12:23 | so we should say that there is a large variability between languages small variability within |
---|
0:12:29 | languages that's what we get |
---|
0:12:31 | it gives us |
---|
0:12:32 | visual evidence |
---|
0:12:33 | to think that |
---|
0:12:35 | these statistics might well be discriminative for languages |
---|
0:12:40 | just leaving along a bit further |
---|
0:12:43 | down here what we getting here was a frame level information |
---|
0:12:48 | and we like to call this lid senones maybe this is not best terminology |
---|
0:12:54 | but |
---|
0:12:55 | just two |
---|
0:12:56 | to explain have a how we get to that sort of a conclusion |
---|
0:13:01 | if we look at this information i e bay saying and a right so i |
---|
0:13:05 | and be noticed the scales on some of the one |
---|
0:13:11 | one five a low lid senones coming out of the of the system out for |
---|
0:13:18 | frame level with context for |
---|
0:13:22 | speech |
---|
0:13:25 | another piece of speech there |
---|
0:13:27 | a transition region between |
---|
0:13:29 | two parts of speech here |
---|
0:13:32 | and non-speech region just here |
---|
0:13:35 | so what we tend to say when we visualise this is we see a different |
---|
0:13:40 | lid senones activating and a activating |
---|
0:13:44 | as we go through an utterance or go between utterances |
---|
0:13:48 | and we believe that this language discrimination information in the |
---|
0:13:54 | if you look at the scale |
---|
0:13:56 | the y-axis scale of that use |
---|
0:13:58 | we can see that when there is a non speech regions around here we get |
---|
0:14:01 | all sorts of things activating but the level |
---|
0:14:05 | the amplitude of activation is quite low |
---|
0:14:08 | you can it gives |
---|
0:14:09 | evidence to the fact that rippling you have something which is a language specific at |
---|
0:14:12 | least |
---|
0:14:18 | so we also there's something called a |
---|
0:14:22 | hybrid sampled evaluation so we spent thirty seconds ten seconds a three seconds in to |
---|
0:14:26 | separate networks |
---|
0:14:28 | we train them independently and we do well we don't do quite the same degree |
---|
0:14:32 | of augmentation as a hundred but we do try to men by cutting the thirty |
---|
0:14:37 | second speech into ten seconds |
---|
0:14:38 | and three seconds regions |
---|
0:14:40 | so we're doing is where |
---|
0:14:41 | we're trying to make up to the fact that the three second information is woefully |
---|
0:14:45 | inadequate in terms of statistics probably |
---|
0:14:48 | i having a lot more effect |
---|
0:14:50 | a mostly have that works |
---|
0:14:52 | in terms of the but |
---|
0:14:54 | performance of each |
---|
0:14:58 | unfortunately we only have data here from |
---|
0:15:01 | yes to allow you zero nine and for that we only have six |
---|
0:15:05 | most confusable languages |
---|
0:15:07 | it's a subset is much quicker subset so do analysis on into one experiments on |
---|
0:15:13 | and if you look at papers over the last few years |
---|
0:15:16 | we tend to publish with these six languages fast |
---|
0:15:20 | and then extend later |
---|
0:15:22 | seems worthwhile |
---|
0:15:24 | it's about hundred fifty i was of |
---|
0:15:26 | training data voice of america and radio broadcast cts and telephone speech |
---|
0:15:31 | and we split up into the three different |
---|
0:15:35 | level or looking at two baseline systems and our proposed network |
---|
0:15:39 | normal |
---|
0:15:40 | the fusion on that later |
---|
0:15:42 | everybody wants to do fusion |
---|
0:15:44 | the end |
---|
0:15:47 | so let's look at three the way that this |
---|
0:15:52 | this structure can be adapted because the so many different parameters that we could change |
---|
0:15:57 | in here |
---|
0:15:58 | the first one you wanted it was look at the |
---|
0:16:00 | the size of the context |
---|
0:16:02 | at the output of the |
---|
0:16:04 | the nn layers |
---|
0:16:06 | and with changing and |
---|
0:16:08 | if you can make it out just here |
---|
0:16:12 | lower case n |
---|
0:16:13 | so what we're doing is where |
---|
0:16:15 | keeping the same |
---|
0:16:16 | bottleneck |
---|
0:16:17 | network |
---|
0:16:18 | but we're starting a more of them |
---|
0:16:22 | and we can see from the results for thirty seconds ten seconds and three seconds |
---|
0:16:25 | in eer |
---|
0:16:27 | the bigger the context in general the better the results |
---|
0:16:31 | no bear in mind that we only have some context the input here |
---|
0:16:37 | right that's also got context twenty one frames to be precise so we adding more |
---|
0:16:43 | context at this and we're saying benefit |
---|
0:16:48 | and it turns out that for the ten seconds and three seconds tasks context of |
---|
0:16:52 | twenty one |
---|
0:16:53 | just here |
---|
0:16:54 | tends to what better |
---|
0:16:55 | for the |
---|
0:16:56 | thirty second task and even longer context much better probably because the data was longer |
---|
0:17:02 | i think that the |
---|
0:17:04 | the problem is the three seconds and ten seconds data tends to saturate i mean |
---|
0:17:09 | we just cannot physically get enough information had about data |
---|
0:17:12 | no matter how much context size |
---|
0:17:14 | we introduce |
---|
0:17:17 | and moving on a little bit further |
---|
0:17:19 | we can also experiment with |
---|
0:17:21 | how |
---|
0:17:22 | t and how wide the c n is |
---|
0:17:26 | and we do that down here with |
---|
0:17:31 | basically three different experiments one of which is the lid net with |
---|
0:17:36 | i a one zero two four |
---|
0:17:40 | such that convolution input layer |
---|
0:17:44 | single-layer |
---|
0:17:45 | then fading into the special permit polling and the fully connected system |
---|
0:17:50 | we trained the system up we get about nine |
---|
0:17:54 | nine percent to sixteen percent |
---|
0:17:57 | performance on the three different scales if we add another layer so we have a |
---|
0:18:01 | two class and then we all that down by reasonable amount for the three seconds |
---|
0:18:06 | not quite so much of the thirty seconds |
---|
0:18:09 | and we're looking at one two eight to five six or five one two |
---|
0:18:14 | size on the secondly |
---|
0:18:17 | in the c n |
---|
0:18:18 | third layer |
---|
0:18:21 | we check out sixty phone one two eight and we can say that basically with |
---|
0:18:24 | increasing complexity the results tend to improve lesson for the thirty seconds more for the |
---|
0:18:29 | others |
---|
0:18:31 | i but temple evaluation what we actually doing here is way using the |
---|
0:18:35 | the thumb thirty second network |
---|
0:18:38 | to evaluate thirty second data |
---|
0:18:40 | the ten second network to evaluate thirty second a ten second data |
---|
0:18:43 | and the three second network to evaluate everything |
---|
0:18:46 | and the performance |
---|
0:18:49 | unsurprisingly of the three second network is better for the three second data you can |
---|
0:18:54 | only use that ten seconds better for the ten second data but the thirty second |
---|
0:18:59 | network a thirty second data is |
---|
0:19:03 | however |
---|
0:19:04 | it's better using the ten second one for thirty second data so this means that |
---|
0:19:08 | perhaps this these networks themselves are hoping at different scales of information so we fuse |
---|
0:19:13 | them together to get the results of the bottom |
---|
0:19:16 | and we have a slight improvement there |
---|
0:19:20 | but you won't notice that we can only improve on the baseline system for the |
---|
0:19:25 | thirty second result |
---|
0:19:29 | one more thing before we conclude the i-vector system uses a button first order statistics |
---|
0:19:36 | but this effectively only uses a with order statistics |
---|
0:19:39 | so |
---|
0:19:41 | pretty much are a few what would be looking at hand we can incorporate more |
---|
0:19:45 | statistics |
---|
0:19:47 | whether we can build a comprehensive know what that uses |
---|
0:19:50 | all scales and handles all scales simultaneously so that's it that's a weird and wonderful |
---|
0:19:56 | day n c n hybrid thank you |
---|
0:20:06 | we have time for questions |
---|
0:20:13 | so |
---|
0:20:15 | thanks very much that was very interesting and as far as i could see a |
---|
0:20:20 | score of the network so |
---|
0:20:24 | as far as understood you |
---|
0:20:28 | did some incremental training and so once you that once you trying to part of |
---|
0:20:34 | the network and then you extend the network the first the parameters of first part |
---|
0:20:38 | they stay fixed you don't step them |
---|
0:20:41 | you have this a fixed so that we fixed that enemy build on it and |
---|
0:20:44 | it |
---|
0:20:46 | again is what you do when you ask us to try different things and i |
---|
0:20:50 | probably wouldn't have done this myself but it tends to one |
---|
0:20:53 | quite well |
---|
0:21:00 | the most |
---|
0:21:01 | there's a fixed you mean you network trained the mortgage you just change |
---|
0:21:07 | most layer |
---|
0:21:08 | and they to retrain the whole system no we don't we train our system we |
---|
0:21:12 | focus that the to the backend open and we just trained on us layer |
---|
0:21:17 | q |
---|
0:21:21 | i think we have another question for a we've got lots of time so you |
---|
0:21:25 | spoke a lot about the information flow through the neural network so if you of |
---|
0:21:32 | a read some of geoff hinton "'s" stuff on neural networks the you will you |
---|
0:21:39 | will tell you again and again that |
---|
0:21:41 | that there is more information |
---|
0:21:44 | in our case in the speech and the labels so |
---|
0:21:48 | use advocating for the use of generative models rather than discriminative ones as well as |
---|
0:21:54 | i can see you horses ple discriminative so i just like to hear any point |
---|
0:22:00 | you have on that matter |
---|
0:22:02 | so actually is interesting that you bring that up because i was looking at some |
---|
0:22:05 | of the comments in tumours making recently and |
---|
0:22:10 | i he was talking about using |
---|
0:22:12 | he was talking about the benefits of having a two-stage process where we have one |
---|
0:22:18 | front end which is very good at picking out |
---|
0:22:20 | how the most useful data from a large-scale dataset and then a backend which is |
---|
0:22:26 | very good it using the and that these two tasks are complementary is seldom we |
---|
0:22:32 | can use one system that excels in both tasks he believes that both women can |
---|
0:22:36 | be trained and we seem to do not but we've done it |
---|
0:22:40 | okay the opposite way around to the way i would have imagined |
---|
0:22:48 | okay thank you very much speech |
---|