0:00:26 | a a a i i i |
---|
0:00:27 | no |
---|
0:00:30 | i |
---|
0:00:31 | a |
---|
0:00:32 | i you know what |
---|
0:00:33 | a i today |
---|
0:00:35 | and people |
---|
0:00:36 | and some people would be you know a to up to have an another's with |
---|
0:00:40 | they on if an sup of for the rest of the live |
---|
0:00:43 | i thinking that those prophecies is of been just slightly misinterpreted |
---|
0:00:47 | and the event that they were referring to is this |
---|
0:00:49 | a wonderful speech to okay |
---|
0:00:51 | i |
---|
0:00:53 | a that have in of actually that |
---|
0:00:55 | you know it almost buys review thing it |
---|
0:00:57 | so uh |
---|
0:00:59 | i do i think that have anything and their estimates the significance of this that |
---|
0:01:04 | a that's okay |
---|
0:01:05 | okay |
---|
0:01:06 | so first they just about the name |
---|
0:01:08 | it's a it's some kind of coffee reference hence the little coffee being with uh |
---|
0:01:13 | but had so |
---|
0:01:15 | a |
---|
0:01:16 | but |
---|
0:01:16 | is just |
---|
0:01:17 | whatever name we thought to |
---|
0:01:20 | so uh |
---|
0:01:21 | the structure of this uh this whole presentation is fess i'm gonna talk |
---|
0:01:25 | for about |
---|
0:01:26 | fifteen or twenty minute |
---|
0:01:28 | just giving you know of you kind of from all sides of this tool K |
---|
0:01:32 | and then we're gonna a people to escape in case they don't want to know more details than the have |
---|
0:01:36 | a short break |
---|
0:01:37 | and then |
---|
0:01:39 | i not and uh on drug going to talk about a uh |
---|
0:01:43 | some more called local stuff like |
---|
0:01:45 | and i was gonna talk about some of the acoustic modeling code |
---|
0:01:48 | and we'll talk about the uh matrix like |
---|
0:01:51 | which just kind of independent useful |
---|
0:01:53 | uh |
---|
0:01:54 | speech |
---|
0:01:55 | and then after that |
---|
0:01:56 | uh |
---|
0:01:57 | i'm gonna go through some example scripts that we have been try to get people |
---|
0:02:01 | more of a you know |
---|
0:02:02 | give people a sense of of how to use that |
---|
0:02:06 | now |
---|
0:02:07 | or the next slide |
---|
0:02:09 | so |
---|
0:02:10 | some important aspect of the project is it the |
---|
0:02:13 | it's license under a you V two point uh which is the |
---|
0:02:17 | style a license that basically allows it to do anything you want with it |
---|
0:02:21 | there is only a uh |
---|
0:02:23 | an acknowledgement a |
---|
0:02:25 | close which as you have to acknowledge that |
---|
0:02:27 | the code came from that but that |
---|
0:02:29 | that's of that's |
---|
0:02:31 | it's it's one of the most open up the standard lies |
---|
0:02:35 | uh |
---|
0:02:36 | the project of currently hosted on source forge which is the |
---|
0:02:39 | standard place for these kinds of open source project |
---|
0:02:43 | uh |
---|
0:02:44 | we we it |
---|
0:02:45 | some talk it's a very closely associated with a particular institution |
---|
0:02:49 | our attention is for it to be more of a kind of |
---|
0:02:52 | thing that lives |
---|
0:02:54 | and the clouds out or and source for |
---|
0:02:56 | i i shouldn't have use that will that that's to |
---|
0:02:58 | that's just gratuitous that |
---|
0:03:00 | but it yeah there it's very for it not to just be him a |
---|
0:03:04 | the pet project of some particular little group but uh |
---|
0:03:07 | that's to represent |
---|
0:03:08 | the best of what's out there and and and we will can be participants as long as you can contribute |
---|
0:03:12 | code under |
---|
0:03:13 | this slice sense than that's great |
---|
0:03:16 | uh |
---|
0:03:17 | it's basically a C plus plus to at |
---|
0:03:19 | the code compiles it a native windows and |
---|
0:03:22 | and the common units but fun like can we're not claiming that a compile once on or you know |
---|
0:03:28 | other we're problem but but it compiled from on the normal one |
---|
0:03:32 | a |
---|
0:03:34 | you have some documentation not as much as takes T K |
---|
0:03:37 | and and and we have example script |
---|
0:03:40 | these example scripts and not uh |
---|
0:03:42 | there just for results also as one and and uh |
---|
0:03:45 | wall street journal |
---|
0:03:46 | but we're gonna have more to |
---|
0:03:48 | they |
---|
0:03:50 | they basically run from ldc that's |
---|
0:03:52 | so once you have the this you can kind of point them to the disk |
---|
0:03:55 | and just |
---|
0:03:56 | get an idea of how it work |
---|
0:04:00 | so |
---|
0:04:04 | oh no i now i realise that we didn't look a large enough row |
---|
0:04:08 | i think i think we just have a tie this thing to uh aggressively |
---|
0:04:12 | if these were not guy |
---|
0:04:14 | uh |
---|
0:04:15 | yeah |
---|
0:04:16 | so |
---|
0:04:20 | okay somehow out i gonna go through the kind of a think that support this is just the current features |
---|
0:04:24 | obviously |
---|
0:04:25 | or tending to a lot more |
---|
0:04:27 | so you can build a standard context-dependent uh |
---|
0:04:30 | lvcsr system |
---|
0:04:32 | you know with tree clustering |
---|
0:04:34 | in that it's been written in such a way that it supports arbitrary context size is so you can go |
---|
0:04:39 | to |
---|
0:04:39 | quint phone oh what's have and it will uh |
---|
0:04:42 | a work |
---|
0:04:43 | without without pain |
---|
0:04:45 | but the the training coding about fst based on a |
---|
0:04:49 | our code compiled against openfst |
---|
0:04:52 | for those of you who don't know up fst is |
---|
0:04:55 | it's kind of like the eighteen T tells set it's open source |
---|
0:04:58 | it's uh |
---|
0:04:59 | a project uh |
---|
0:05:00 | like google and some other |
---|
0:05:04 | um |
---|
0:05:06 | we can only only have max and like the had training |
---|
0:05:09 | we haven't yet done lattice generation but at time |
---|
0:05:12 | timeline line for adding discriminative training and lattice generation |
---|
0:05:15 | and |
---|
0:05:16 | this summer slash |
---|
0:05:18 | like |
---|
0:05:20 | uh |
---|
0:05:21 | we we we support all kinds of linear and affine transforms you can imagine |
---|
0:05:25 | i don't not all of these |
---|
0:05:27 | necessarily involve uh |
---|
0:05:29 | you know that tree version |
---|
0:05:30 | what where you have a |
---|
0:05:32 | multiple regression plot |
---|
0:05:34 | that's just because we |
---|
0:05:36 | are trying to avoid very complicated frameworks that would make that so difficult to use |
---|
0:05:41 | so a lot of these just support point a single transform |
---|
0:05:45 | we |
---|
0:05:45 | all of these things also have examples scrip |
---|
0:05:48 | so it's not just something that's in the code that |
---|
0:05:51 | that we know work |
---|
0:05:51 | something that you can also |
---|
0:05:53 | get to |
---|
0:05:55 | so |
---|
0:05:57 | and trying to have a i did want to just is other tool kits as a little disclaimer here |
---|
0:06:01 | that |
---|
0:06:02 | we're not claiming that all of tool kids don't have any of these advantages to |
---|
0:06:07 | but uh |
---|
0:06:08 | waiting for clean coal code and modular design |
---|
0:06:12 | uh |
---|
0:06:13 | and and by module we we probably need something a little bit stronger than you would normally uh |
---|
0:06:19 | normally imagine it's it's written in such a way that |
---|
0:06:22 | it's not only easy to combine the various things that are in the |
---|
0:06:25 | but it's easy to uh |
---|
0:06:27 | kind of extend arbitrarily |
---|
0:06:29 | and and we have avoid the kind of code where |
---|
0:06:32 | when you add something |
---|
0:06:34 | a bunch of other bits of code have to know about what you added then you have to modify all |
---|
0:06:38 | kinds of |
---|
0:06:39 | you know |
---|
0:06:40 | all kinds of other |
---|
0:06:42 | and |
---|
0:06:44 | the part is a big uh |
---|
0:06:46 | advantage i know but not a lot of uh |
---|
0:06:50 | to gets such a completely free lies |
---|
0:06:53 | and that that we don't really anticipate this being used for commercial purposes |
---|
0:06:57 | uh |
---|
0:06:58 | our understanding is that |
---|
0:06:59 | a lot of research group |
---|
0:07:02 | as a matter of principle they they won't |
---|
0:07:04 | you stuff that has no commercially license because this say |
---|
0:07:07 | is this research can the commercial by the |
---|
0:07:10 | and now |
---|
0:07:11 | or of the license will |
---|
0:07:14 | uh |
---|
0:07:15 | have example scripts which were which were uh |
---|
0:07:19 | standing documentation |
---|
0:07:21 | and |
---|
0:07:22 | that this whole community building think that the people involved in cal is currently uh |
---|
0:07:27 | it |
---|
0:07:28 | it's a group of people mostly vol |
---|
0:07:30 | who are to the previous to you works so |
---|
0:07:33 | myself are are not a bunch of guys from but |
---|
0:07:36 | and and if you others |
---|
0:07:38 | case |
---|
0:07:39 | uh |
---|
0:07:40 | but we open to new participant |
---|
0:07:43 | and uh |
---|
0:07:45 | well what what what we're hoping for mainly is not just people who come to be a line not to |
---|
0:07:49 | of code but |
---|
0:07:50 | the people who really want to understand the whole thing |
---|
0:07:53 | you can contribute a significant amount |
---|
0:07:56 | um |
---|
0:07:58 | the |
---|
0:07:59 | it's okay is especially good for stuff that involves a lot of linear algebra |
---|
0:08:03 | it has a |
---|
0:08:04 | very good matrix like be the andreas going to talk about |
---|
0:08:07 | so if you want to do stuff that involves a lot of a matrix and vector |
---|
0:08:12 | H |
---|
0:08:13 | are |
---|
0:08:14 | do |
---|
0:08:15 | also uh |
---|
0:08:16 | of course we we compile pile against the openfst library so |
---|
0:08:20 | you can do have T stuff you know at the code |
---|
0:08:23 | uh |
---|
0:08:24 | its built in |
---|
0:08:26 | a scalable way |
---|
0:08:27 | well |
---|
0:08:30 | it doesn't explicitly interact with any power level is a parallel by |
---|
0:08:34 | it doesn't |
---|
0:08:36 | it doesn't interact with them at weird do use or or um |
---|
0:08:39 | um P i i think |
---|
0:08:40 | "'cause" we felt that that that would just lock it into particular kinds of system |
---|
0:08:44 | so uh |
---|
0:08:45 | but |
---|
0:08:46 | all the a |
---|
0:08:47 | it's been in in such a way that uh it should still work efficiently when everything is very large scale |
---|
0:08:53 | you have a lot of day |
---|
0:08:55 | our our intention is to it and all of the state-of-the-art methods |
---|
0:09:00 | for lvcsr things like |
---|
0:09:01 | discriminative training |
---|
0:09:03 | a standard |
---|
0:09:04 | all of the standard adaptation |
---|
0:09:07 | uh |
---|
0:09:09 | but uh i think i say |
---|
0:09:11 | on the next slide |
---|
0:09:12 | uh |
---|
0:09:13 | something that we not kinda doing in the in the immediate future |
---|
0:09:17 | it's things like online decoding which |
---|
0:09:19 | what i mean by that is uh |
---|
0:09:21 | where the data is coming in say from a microphone or telephone |
---|
0:09:25 | and it's some kind of interactive application |
---|
0:09:27 | because you could use it to do that and building a decoder isn't that hard in this framework |
---|
0:09:32 | but uh |
---|
0:09:34 | i basic target audience is uh |
---|
0:09:37 | speech recognition researchers who want to work |
---|
0:09:40 | on the speech rec oh |
---|
0:09:42 | other than |
---|
0:09:43 | rather than those who uh |
---|
0:09:47 | oh have a mock i was learning what everyone was looking at a multiscale to enter the room um and |
---|
0:09:52 | disrupted that's all right if very present |
---|
0:09:56 | i |
---|
0:09:56 | oh i |
---|
0:09:59 | i |
---|
0:10:00 | i |
---|
0:10:00 | okay |
---|
0:10:02 | i |
---|
0:10:04 | so i |
---|
0:10:06 | we set some people lately have uh |
---|
0:10:10 | this become popular recently take do a kind of life unwrapper for C plus plus code |
---|
0:10:15 | the idea being that you can uh |
---|
0:10:17 | more easily write your script |
---|
0:10:19 | however we we've avoided that approach because |
---|
0:10:22 | probably because it's a hassle to do the the wrapping |
---|
0:10:25 | and nobody ever understands house we were |
---|
0:10:28 | probably because uh |
---|
0:10:30 | it just forces people to learn a new language and |
---|
0:10:32 | probably those who just want the colours think that everyone knows by |
---|
0:10:36 | that |
---|
0:10:37 | uh |
---|
0:10:39 | so |
---|
0:10:40 | we support the kind of |
---|
0:10:42 | flexibility and configurable ability of that in different ways |
---|
0:10:46 | but partly uh |
---|
0:10:48 | i think it'll become clear later so perhaps will |
---|
0:10:53 | will will will leave to lake to those ask |
---|
0:10:56 | so we don't have back would training their in there are no immediate plans to do it |
---|
0:11:01 | and i some people i think some people like for back for kind of religious reason |
---|
0:11:05 | but uh |
---|
0:11:06 | i don't believe any was demonstrated the viterbi be is worse |
---|
0:11:10 | and it just so and we need to use with a be |
---|
0:11:12 | for uh |
---|
0:11:14 | we because you can write the alignments to this compact lee |
---|
0:11:17 | and then |
---|
0:11:18 | on |
---|
0:11:23 | really |
---|
0:11:24 | okay |
---|
0:11:27 | really interesting |
---|
0:11:29 | but i i i even even not let this as |
---|
0:11:32 | like |
---|
0:11:32 | just a single hypothesis |
---|
0:11:34 | makes it if |
---|
0:11:36 | okay |
---|
0:11:37 | so we'll have to think about that i mean it's not like it's really hard to do |
---|
0:11:40 | but it just wasn't something that we had planned |
---|
0:11:42 | uh |
---|
0:11:45 | one |
---|
0:11:47 | uh_huh |
---|
0:11:49 | oh okay |
---|
0:11:50 | well it's at the state level |
---|
0:11:53 | but we it's not really this the |
---|
0:11:55 | i stay i mean pdf |
---|
0:11:56 | index but |
---|
0:11:57 | that you little bit more precise not because uh |
---|
0:12:00 | you just right out the state sequence it's fine for model training but then |
---|
0:12:04 | if you wanna work work the phone sequence the penny how tree work |
---|
0:12:08 | it might not be implied by the state sequence of then we have these identifiers the also contain the phone |
---|
0:12:13 | and the transition |
---|
0:12:15 | oh it's and it's a it's an integer a list of it just but those |
---|
0:12:19 | in integers |
---|
0:12:21 | are not quite the states there |
---|
0:12:22 | something that can be mapped to the state also to the phone |
---|
0:12:28 | uh |
---|
0:12:30 | so i'm just gonna describe a how this |
---|
0:12:33 | came to be we had this work in two thousand nine |
---|
0:12:36 | a a lot of uh focus was on |
---|
0:12:38 | as G M N |
---|
0:12:41 | um |
---|
0:12:42 | we that the supper we we were using that some guys some brno a university of technology |
---|
0:12:47 | including a on draw look at another |
---|
0:12:49 | uh |
---|
0:12:50 | they built this |
---|
0:12:51 | uh infrastructure for uh |
---|
0:12:54 | for training as gmms that was it was written in C plus plus but it rely don't he's T K |
---|
0:12:58 | system |
---|
0:12:59 | and i also built a a and F E F S T based code |
---|
0:13:02 | so that we could be code our own C plus plus code with access to the matrix like |
---|
0:13:07 | um |
---|
0:13:08 | so we kind of calling that crow took D |
---|
0:13:12 | and and we wanted to release that |
---|
0:13:14 | recipe |
---|
0:13:15 | you know in is some kind of open source way but we realise that |
---|
0:13:19 | the rest P was just too hard to encapsulate because the had he's T K had our stuff |
---|
0:13:24 | as a lot of script |
---|
0:13:26 | so we we wanted to create something that |
---|
0:13:29 | good support this stuff and was easy to encapsulate so we we an entirely new uh |
---|
0:13:35 | uh |
---|
0:13:37 | the next summer we were entirely new toolkit that is |
---|
0:13:41 | you know that we that |
---|
0:13:43 | we wanted everything to be clean and unified |
---|
0:13:45 | and to have a nice use shiny C plus plus |
---|
0:13:48 | speech rec my |
---|
0:13:51 | i think that's the uh |
---|
0:13:53 | i think that's this a |
---|
0:13:55 | slides a last somewhere |
---|
0:13:57 | are two thousand ten we had another workshop and or no |
---|
0:14:00 | where we uh |
---|
0:14:02 | that a lot of coding |
---|
0:14:03 | and and the vision at that time which and i realise is very unrealistic |
---|
0:14:07 | was that we |
---|
0:14:08 | we have a complete working system with example script |
---|
0:14:12 | you know the end of the sum |
---|
0:14:14 | but that that kind of didn't really materialise a had a lot of pieces |
---|
0:14:18 | but we didn't really have a complete working system so |
---|
0:14:22 | after uh |
---|
0:14:24 | i kind of obligated to |
---|
0:14:26 | you know |
---|
0:14:27 | and is the system and and and we had a help from others thing especially on that |
---|
0:14:31 | and and doing a lot of coding after that |
---|
0:14:34 | so |
---|
0:14:37 | uh |
---|
0:14:40 | when we go to the next slide |
---|
0:14:42 | it's a it's only been officially really something like last week |
---|
0:14:46 | that's when we actually uh got all the legal approvals and |
---|
0:14:49 | put up on source forge |
---|
0:14:51 | this is just a list of the people i don't think i'm gonna go through all the names |
---|
0:14:55 | this the list of all the people who are rich then uh code specifically for D |
---|
0:14:59 | that's of the list the people who done various other things or it's so help the in various ways |
---|
0:15:04 | and |
---|
0:15:05 | uh |
---|
0:15:07 | i would describe exactly have for each one but i'm kind of scared i've left someone of one of these |
---|
0:15:12 | lists |
---|
0:15:13 | and |
---|
0:15:14 | i i i i |
---|
0:15:15 | i i just let you read it |
---|
0:15:18 | um |
---|
0:15:20 | a lot of these people are |
---|
0:15:21 | have some connection to bird or you invested to of technology |
---|
0:15:25 | oh people but the in uh or |
---|
0:15:28 | like that |
---|
0:15:30 | so that this is a |
---|
0:15:32 | this is is a rather messy diagram |
---|
0:15:34 | i i just wanted |
---|
0:15:36 | i want to give you some idea of what the dependency structure of kaldi was but i decided to put |
---|
0:15:40 | side information and to here so |
---|
0:15:42 | the area of these uh |
---|
0:15:45 | of these rectangles is roughly proportional to how many lines of code |
---|
0:15:49 | there are |
---|
0:15:50 | so |
---|
0:15:51 | the these think the thing that we can pile again |
---|
0:15:54 | so open a fist is the C plus plus library |
---|
0:15:57 | uh |
---|
0:15:59 | at let's C left that refers to the math libraries that we can pile again |
---|
0:16:03 | uh |
---|
0:16:06 | and |
---|
0:16:06 | and the rough dependency structures thing on top of things that and on them but |
---|
0:16:10 | is very approximate |
---|
0:16:11 | so |
---|
0:16:13 | for instance he's various |
---|
0:16:14 | fst the algorithms that we've extended of an fst with |
---|
0:16:18 | uh |
---|
0:16:19 | stuff relating to tree clustering for decision tree |
---|
0:16:23 | uh |
---|
0:16:24 | that for leading to hmm topology |
---|
0:16:27 | decoder decoders |
---|
0:16:29 | language modeling thing this is a small box because really all it does is uh |
---|
0:16:34 | compile a marketing to enough |
---|
0:16:37 | i two that |
---|
0:16:38 | uh |
---|
0:16:41 | you tell this at this is mostly i O stuff as various frameworks for io |
---|
0:16:45 | that |
---|
0:16:46 | will be explained later run kind of after a break |
---|
0:16:49 | so we can allow people |
---|
0:16:50 | skate |
---|
0:16:51 | this is the matrix like we so this |
---|
0:16:54 | a lot of this is just wrappers for stuff that's here |
---|
0:16:57 | but if any i don't know if any of you are familiar with |
---|
0:16:59 | with the steal a pack and blast and those things |
---|
0:17:02 | but their C libraries that |
---|
0:17:04 | for C plus plus program a slightly painful to work with "'cause" they have all of these arguments like the |
---|
0:17:09 | rose the columns |
---|
0:17:10 | tried |
---|
0:17:11 | and the thing you wanna do is this very long line of code |
---|
0:17:14 | and uh |
---|
0:17:16 | so there's no notion of like a matrix as an object |
---|
0:17:18 | so this kind of ad that abstraction and it is it is significantly easier to use |
---|
0:17:23 | then of this make the |
---|
0:17:26 | uh |
---|
0:17:27 | this is feed sure |
---|
0:17:29 | preprocessing and you know |
---|
0:17:30 | going from a web file to mfcc that's that's fair |
---|
0:17:34 | uh gaussian mixture models a diagonal and full |
---|
0:17:39 | subspace gaussian mixture models this is |
---|
0:17:41 | the reason might talk |
---|
0:17:42 | the |
---|
0:17:44 | linear transforms |
---|
0:17:45 | things like fmllr M L R S T C |
---|
0:17:49 | hlda |
---|
0:17:50 | things of that nature |
---|
0:17:52 | vtln is in here to kind of the |
---|
0:17:54 | linear form of vtln |
---|
0:17:57 | uh |
---|
0:17:57 | all of these things that he had these are kind of you know directories that contain |
---|
0:18:01 | command line programs that tells you a bit about the structure of the toolkit which is that we have |
---|
0:18:06 | which really more than a hundred command line program |
---|
0:18:09 | and each one does a fairly specific thing |
---|
0:18:12 | wanted to avoid this phenomenon where you have a program that kind of allegedly does one thing |
---|
0:18:17 | that really is controlled by button really an option |
---|
0:18:20 | and has rather complicated behavior depending which upset you give it |
---|
0:18:24 | so this is part of the mechanism that we use to ensure |
---|
0:18:27 | uh |
---|
0:18:28 | the everything's configurable an easy to understand |
---|
0:18:31 | is no python layer but that's a lot of uh |
---|
0:18:34 | programs as simple function |
---|
0:18:37 | and on top of this |
---|
0:18:38 | is the |
---|
0:18:39 | shell scripts |
---|
0:18:40 | so to do a not actual system building a recipe |
---|
0:18:45 | what are example scripts currently only do is it's the bash script |
---|
0:18:48 | and that you know has a bunch of variables and bash to keep track of iteration and stuff |
---|
0:18:53 | and it and it runs the job |
---|
0:18:55 | but invoking |
---|
0:18:56 | from the command line |
---|
0:18:57 | because the different ways you could do this if you if you love perl up a python or whatever you |
---|
0:19:01 | as to i |
---|
0:19:02 | but that's how a script |
---|
0:19:04 | and and something that |
---|
0:19:06 | i really haven't included on this diagram but it's kind of parts of the |
---|
0:19:10 | dependency structures this some |
---|
0:19:12 | tools that we rely on so |
---|
0:19:15 | uh |
---|
0:19:17 | for language modeling |
---|
0:19:18 | i D thought we use i R T L them just because of license issues but probably you on use |
---|
0:19:23 | that's i lm if you |
---|
0:19:25 | wanna do a lot of a language modeling stuff |
---|
0:19:27 | uh things like as P H two pi |
---|
0:19:29 | to and |
---|
0:19:30 | to uh |
---|
0:19:31 | in separate data from the L |
---|
0:19:34 | and so on so that the you |
---|
0:19:35 | we actually we actually have a |
---|
0:19:37 | and of can |
---|
0:19:38 | and installation script that will automatically obtain those things are so the scripts can run |
---|
0:19:43 | without you having to manually install stuff |
---|
0:19:46 | and your sis |
---|
0:19:48 | so i'm just gonna |
---|
0:19:49 | briefly summarise the matrix like tree under will be talking more about it later but the plan was |
---|
0:19:55 | to allow people to escape after this initial segment |
---|
0:19:58 | case the not that the boat to that they one here about this stuff |
---|
0:20:01 | but uh |
---|
0:20:03 | as i said it's a C plus plus rap for a blast and seal at pat |
---|
0:20:07 | and we've |
---|
0:20:07 | well why should say on really has gone to a lot of trouble to ensure that it can compile |
---|
0:20:12 | and the various |
---|
0:20:13 | different configurations the what |
---|
0:20:16 | libraries you have your system |
---|
0:20:18 | so it can either the work from blast plus C lap pack |
---|
0:20:21 | or from a less or using |
---|
0:20:23 | entails M K L |
---|
0:20:25 | the reason is that on some systems you might have one but not the other |
---|
0:20:29 | i i less is an implementation of blast that's the |
---|
0:20:32 | kind of optimized to the specific a hardware |
---|
0:20:35 | automatically |
---|
0:20:37 | is is generally a more |
---|
0:20:39 | so |
---|
0:20:40 | the code that we've rat |
---|
0:20:41 | includes |
---|
0:20:42 | generic matrices like square matrices |
---|
0:20:46 | also packed symmetric matrices where where you uh |
---|
0:20:50 | have a symmetric matrix the only store the lower triangle |
---|
0:20:53 | and it's like this this this |
---|
0:20:56 | order |
---|
0:20:57 | and uh pack triangular matrix |
---|
0:20:59 | there are other formats that last and C web back supports but these are the ones that we for what |
---|
0:21:04 | most |
---|
0:21:04 | applicable to |
---|
0:21:06 | speech processing like we don't are a lot of sparse make sure |
---|
0:21:10 | and traditional |
---|
0:21:13 | so |
---|
0:21:14 | this uh and i like we also includes things like S P D an F S C |
---|
0:21:18 | i fifty isn't supply any of those libraries but we uh |
---|
0:21:22 | we we uh got permission from rick come out of our microsoft |
---|
0:21:25 | to uh |
---|
0:21:26 | use this code |
---|
0:21:27 | so he has a good "'em" |
---|
0:21:30 | um |
---|
0:21:32 | something about the matrix like the even if you don't buy into the whole to kit |
---|
0:21:36 | if you need a C plus plus matrix library it's probably a |
---|
0:21:40 | is probably quite good in fact it's surprising that there it doesn't seem to be a lot out there |
---|
0:21:45 | that fills this nice just there's blues |
---|
0:21:48 | but that it's a rather weird library and i i don't think a lot of people like |
---|
0:21:55 | um |
---|
0:21:57 | okay if you what the about open F is key |
---|
0:22:00 | so i i seem and one he knows what what fsts are |
---|
0:22:03 | it in T had this command line tool kit |
---|
0:22:06 | but i don't believe they ever released |
---|
0:22:07 | source |
---|
0:22:08 | so one some of those guys went to google they decided to have one that was uh |
---|
0:22:12 | for open source and it's a patch lies |
---|
0:22:15 | that's why we as part there is reason we made out the a you license |
---|
0:22:18 | because we figured that |
---|
0:22:20 | to to use up pin fst there's no real point in having a |
---|
0:22:23 | different license "'cause" it just gives the law my head |
---|
0:22:26 | so we went for the same one |
---|
0:22:28 | ah |
---|
0:22:30 | so yeah |
---|
0:22:31 | we can pile against its some that for is the decoder |
---|
0:22:35 | it doesn't use like a special decoding graph format |
---|
0:22:38 | use is the uh same memory structures the openfst |
---|
0:22:43 | and the by the way open F to C has a lot of templates and stuff so that |
---|
0:22:47 | is not just one fst for and there's a lot of them |
---|
0:22:49 | so if you want to do you could uh |
---|
0:22:52 | kind of template your decoder run some fancy format that would be let's a compact or dynamically expanded or some |
---|
0:22:58 | like |
---|
0:22:59 | we're not gonna go into that in detail today |
---|
0:23:02 | so we actually implemented various extensions to openfst |
---|
0:23:07 | some of the recipes the perhaps not totally in the spirit of openfst because |
---|
0:23:12 | those guys have a particular recipe that they do |
---|
0:23:15 | and i was is just a little bit different for |
---|
0:23:20 | later on i can explain why |
---|
0:23:21 | i feel that there are good reasons for uh i don't know if those guys would agree with |
---|
0:23:26 | uh |
---|
0:23:29 | so |
---|
0:23:31 | if you with the by about io |
---|
0:23:33 | it's of the controversial decision among the group to U C plus plus three |
---|
0:23:38 | in the end we decided to do it probably because openfst also does it |
---|
0:23:42 | uh |
---|
0:23:43 | something you know a lot of people prefer sea base i L |
---|
0:23:46 | but but but we do this |
---|
0:23:48 | uh |
---|
0:23:48 | we support binary in text mode formats a little bit like htk so that each |
---|
0:23:53 | object in the toolkit |
---|
0:23:55 | as a function that will |
---|
0:23:57 | right and it takes a little argument binary tech |
---|
0:24:00 | so it it'll just |
---|
0:24:01 | put its output it's data out of the stream in binary or text mode |
---|
0:24:05 | any in each object also has the read function that does the same thing |
---|
0:24:08 | so |
---|
0:24:09 | ah |
---|
0:24:11 | it's of the standard thing in many talk at the used and final made in various ways |
---|
0:24:15 | like this can mean the standard input standard output |
---|
0:24:18 | it is just a command |
---|
0:24:20 | and this is what how it knows that it's |
---|
0:24:22 | can |
---|
0:24:23 | uh |
---|
0:24:24 | this is the |
---|
0:24:25 | and off that into a found meaning it will |
---|
0:24:28 | it will open the file fc to that position |
---|
0:24:31 | it's is uh useful for reasons that will be described later |
---|
0:24:36 | uh |
---|
0:24:39 | so this this archive format is it |
---|
0:24:41 | quite fundamental part of the way uh |
---|
0:24:44 | kaldi work |
---|
0:24:45 | and i think |
---|
0:24:46 | i've just cry i'm gonna describe this more later in a another talk with the basic concept is |
---|
0:24:51 | you have a collection of objects let's imagine that they're matrix |
---|
0:24:55 | and there you are there are indexed by a string |
---|
0:24:58 | where the string might be let's say an utterance id |
---|
0:25:01 | so you want to have some way to |
---|
0:25:04 | to access this collection of uh |
---|
0:25:06 | strings and matrices |
---|
0:25:09 | and you might there might be a couple of different ways you could do that you might wanna go sequentially |
---|
0:25:12 | through the |
---|
0:25:13 | as an accumulation of some |
---|
0:25:15 | we might want to do random access |
---|
0:25:17 | so there's a whole framework for doing this |
---|
0:25:20 | uh |
---|
0:25:21 | basically the reason is so that your |
---|
0:25:23 | the most of the calico doesn't have to worry about |
---|
0:25:27 | things like opening files and ever conditions and |
---|
0:25:30 | you know that doesn't have to be a lot of logic about that in the command line programs because it's |
---|
0:25:34 | all handled by some |
---|
0:25:36 | generic framework |
---|
0:25:37 | but apart from this we tried to avoid |
---|
0:25:39 | generic framework |
---|
0:25:42 | ah |
---|
0:25:44 | the tree building clustering code |
---|
0:25:46 | we it's based on |
---|
0:25:47 | very generic |
---|
0:25:49 | clustering the can something like |
---|
0:25:51 | i guess hard to model whatever they call it |
---|
0:25:53 | uh |
---|
0:25:54 | so it doesn't that that that internal code doesn't assume a lot about what your trees |
---|
0:25:59 | it is suitable build decision trees in different ways including |
---|
0:26:02 | like sharing the true |
---|
0:26:04 | and asking questions about the central central phone |
---|
0:26:07 | it's like that |
---|
0:26:08 | um |
---|
0:26:10 | it's very scalable to white context for example quint phone |
---|
0:26:13 | i know a lot of the |
---|
0:26:16 | it it's hard to write code that was scaled to queen phone because if you have to enumerate all of |
---|
0:26:20 | the context |
---|
0:26:22 | that's kind of it's hard hard to go to |
---|
0:26:24 | a but uh |
---|
0:26:25 | we basically avoid ever enumerating those con |
---|
0:26:29 | uh as an example of a |
---|
0:26:30 | how we make use of this general C |
---|
0:26:33 | and the wall street journal recipe we uh |
---|
0:26:35 | we increase the phone sets of the in the were asking about the phone position and the stress |
---|
0:26:41 | i |
---|
0:26:41 | a "'cause" the know he's to K supports this "'cause" i thing you had a |
---|
0:26:44 | have a paper marked with |
---|
0:26:45 | he about doing that |
---|
0:26:47 | so uh |
---|
0:26:48 | but but uh if the phones that much larger than that probably |
---|
0:26:52 | an approach based on enumeration of context would start |
---|
0:26:55 | i |
---|
0:26:57 | you don't think so no i mean like it was a thousand thousand keep this day |
---|
0:27:03 | right |
---|
0:27:04 | okay well i |
---|
0:27:06 | okay |
---|
0:27:09 | uh |
---|
0:27:10 | okay hmm and transition modeling co |
---|
0:27:13 | so |
---|
0:27:14 | we've |
---|
0:27:15 | we try to have an approach where |
---|
0:27:17 | a piece of code only needs to know |
---|
0:27:20 | the minima needs to know |
---|
0:27:21 | so so the hey gmm and transition modeling code doesn't really have any notion of a pdf it's purely |
---|
0:27:27 | it purely does what it needs to do |
---|
0:27:30 | and the rest to separate |
---|
0:27:32 | so |
---|
0:27:32 | this is probably pretty standard approach you you develop a uh |
---|
0:27:36 | you specify prototype to paul |
---|
0:27:38 | it's apology for each phone is that how many states what the transitions are |
---|
0:27:43 | uh |
---|
0:27:44 | and we make the transitions the |
---|
0:27:46 | separate depending on the uh |
---|
0:27:49 | depending on the pdf |
---|
0:27:50 | so so that if the pdfs into states are different than the transitions out of those |
---|
0:27:54 | states are separately estimated |
---|
0:27:56 | is this is just the most |
---|
0:27:58 | specifically that you can estimate the transitions without having your |
---|
0:28:02 | decoding graph blowup |
---|
0:28:04 | it's not believing clear that this matters but |
---|
0:28:07 | uh |
---|
0:28:08 | we just felt that it was that we should do the best we could on |
---|
0:28:12 | uh |
---|
0:28:13 | they're mechanisms would sending these youth hmms into fsts because |
---|
0:28:17 | all of the training decoding is fst basically kind of have to have an fst representation of these |
---|
0:28:24 | uh |
---|
0:28:25 | it's is something that we touched on a a |
---|
0:28:28 | and i are F S T so what you would normally imagine is that the F it has input symbols |
---|
0:28:32 | that are the |
---|
0:28:33 | the pdf so some symbol the represents the P D and the output symbols of the word |
---|
0:28:39 | but the problem with that is let's suppose you uh |
---|
0:28:43 | you want to find out what the phone sequence |
---|
0:28:45 | it's all well well and good if your |
---|
0:28:48 | if if your phone had separate tree |
---|
0:28:51 | so that so that it was could for each state which phone it belong |
---|
0:28:55 | but but what if you had a larger phone set and you wanted to have a shared tree room |
---|
0:28:59 | and that wasn't you know one to one mappings |
---|
0:29:01 | oh there was in the mapping you need so |
---|
0:29:04 | so we have a input labels on the fsts the encoded bit more information |
---|
0:29:08 | uh |
---|
0:29:10 | and this is also useful in training the transitions because |
---|
0:29:12 | sometimes just the pdf labels wouldn't you of you quite enough information |
---|
0:29:17 | the train the transition |
---|
0:29:20 | uh |
---|
0:29:21 | there's a couple of different ways to create decoding graphs |
---|
0:29:24 | for |
---|
0:29:24 | for uh training purposes you have to create a lot of these things at the same time |
---|
0:29:29 | and combining the fst algorithms using script |
---|
0:29:32 | would be quite inefficient because you have the overhead of process creation |
---|
0:29:37 | so |
---|
0:29:37 | we uh |
---|
0:29:39 | we call the openfst algorithms of the C plus plus level combine them together |
---|
0:29:43 | so that uh |
---|
0:29:45 | you can create your decoding graphs for |
---|
0:29:47 | training |
---|
0:29:49 | uh |
---|
0:29:50 | and and we typically put them in one of these archive |
---|
0:29:54 | like basically a big file concatenated together with little keys in it |
---|
0:29:58 | on disk |
---|
0:29:59 | so that you don't have the I O of |
---|
0:30:01 | accessing hundreds of little file |
---|
0:30:03 | training use of the viterbi path |
---|
0:30:05 | these graphs |
---|
0:30:07 | uh |
---|
0:30:08 | for test time |
---|
0:30:09 | we we we didn't we didn't use this approach of C plus plus because it there's just no point |
---|
0:30:14 | we uh |
---|
0:30:16 | it's basically scripts and i'm gonna goes wannabe scripts later for those words |
---|
0:30:21 | um |
---|
0:30:23 | that's the least scripts that create the decoding graph recalls some openfst tools but some of our own |
---|
0:30:28 | and that relates partly to a difference in recipes |
---|
0:30:31 | but uh |
---|
0:30:32 | i'll talk more about later |
---|
0:30:34 | after great |
---|
0:30:36 | so |
---|
0:30:37 | and i was gonna talk later about some of the acoustic modeling co |
---|
0:30:41 | i'm just gonna give a brief summary |
---|
0:30:43 | are gmm code is |
---|
0:30:45 | it's very simple it's not part of some big framework |
---|
0:30:47 | it kind of but like an |
---|
0:30:49 | and object that has you know the means the variances |
---|
0:30:52 | it can evaluate like it's for you give it the feature |
---|
0:30:55 | but it doesn't |
---|
0:30:56 | and her from some |
---|
0:30:58 | generic acoustic model class and it doesn't at ten |
---|
0:31:01 | that's a kind of know about things like linear a it just sits there |
---|
0:31:04 | and and things like we a transform |
---|
0:31:07 | they kind of have to access the model and do what they want |
---|
0:31:10 | the the reason for that is that if |
---|
0:31:12 | the gmm knows too much |
---|
0:31:14 | them whatever you do that's fancy |
---|
0:31:16 | you have to then change the gmm code |
---|
0:31:19 | and it just |
---|
0:31:21 | it's is not my situation |
---|
0:31:24 | so uh |
---|
0:31:26 | yeah we have a separate class for gmm stats accumulation |
---|
0:31:29 | and doing that they |
---|
0:31:31 | so |
---|
0:31:32 | for for a collection of gmms like an gmm gmm system |
---|
0:31:36 | we have a class that pretty much behave similar to a vector a G M at |
---|
0:31:41 | so it's |
---|
0:31:42 | it's a fairly simple thing |
---|
0:31:43 | there's no notion of name of a state that is just an integer |
---|
0:31:47 | and then really we've avoided having |
---|
0:31:50 | like names and names for things in the co |
---|
0:31:52 | exit |
---|
0:31:53 | jurors |
---|
0:31:54 | uh_huh |
---|
0:31:57 | oh this this low case vector just refer to the S T L vector |
---|
0:32:01 | but there is an upper case vector to that |
---|
0:32:03 | but does something in a matrix like |
---|
0:32:06 | i |
---|
0:32:08 | well the code is never been case in that as the code we |
---|
0:32:12 | i i even on windows |
---|
0:32:14 | uh |
---|
0:32:15 | i |
---|
0:32:19 | yeah |
---|
0:32:20 | okay |
---|
0:32:21 | we've got quite a lot of linear transform coder |
---|
0:32:24 | uh |
---|
0:32:26 | lda hate lda |
---|
0:32:28 | again and fitting on the fence with regard the naming of this technique |
---|
0:32:32 | i don't wanna and anyway |
---|
0:32:33 | i |
---|
0:32:35 | uh |
---|
0:32:36 | another these multi name okay |
---|
0:32:38 | uh olympia version of each other i mean we tried regular vtln is |
---|
0:32:42 | yeah everyone knows that it's kind of tricky to get it to work |
---|
0:32:45 | it was that you'll anyone that worked better in the N |
---|
0:32:47 | uh it is something new that |
---|
0:32:50 | it's a kind of a replacement for vtln that what's a little bit better |
---|
0:32:53 | i gonna |
---|
0:32:54 | explain what it is uh at a later date |
---|
0:32:57 | mllr |
---|
0:32:58 | uh |
---|
0:32:59 | a lot of this |
---|
0:32:59 | so |
---|
0:33:02 | one this transform the global the with the way we handle them as well |
---|
0:33:06 | it just becomes part of the feature space |
---|
0:33:09 | so it's just |
---|
0:33:09 | start of the matrix on disk and this |
---|
0:33:12 | use a lot of plight so the way it actually works is that this matrix |
---|
0:33:15 | is multiplied by the feature as part of a high |
---|
0:33:18 | my seem like you're right obviously there is silly way to do it from a computational point of view but |
---|
0:33:23 | it just makes the scripts really convenient |
---|
0:33:25 | to uh |
---|
0:33:26 | do |
---|
0:33:27 | uh |
---|
0:33:29 | yeah so when i say they're applied in a unified way what what i mean is that the co the |
---|
0:33:33 | estimates any of these transforms |
---|
0:33:35 | there really outputs just to make trick |
---|
0:33:37 | so uh |
---|
0:33:38 | you know there's no like |
---|
0:33:40 | and some a lot transform J |
---|
0:33:43 | that's just |
---|
0:33:44 | well okay yeah there is so for the uh regression tree one |
---|
0:33:47 | i |
---|
0:33:48 | but for but for the global one it's just it's just a matrix |
---|
0:33:52 | i |
---|
0:33:52 | i mean that's with the point of contention among as that to whether to do it this way |
---|
0:33:56 | but uh |
---|
0:33:57 | some of a style that it was important to keep the simple case is simple |
---|
0:34:01 | and to it to avoid having a |
---|
0:34:03 | a framework |
---|
0:34:04 | for the cases one was an S |
---|
0:34:07 | uh |
---|
0:34:10 | okay decoders |
---|
0:34:11 | well of the decoders that we currently have use |
---|
0:34:14 | fully expanded F S is one i mean when i say for the expanded i mean is down to that |
---|
0:34:18 | H M and state level with |
---|
0:34:20 | so loops represented as uh |
---|
0:34:23 | actual you know if sdr |
---|
0:34:26 | i know there's a lot of way to do this and initially |
---|
0:34:28 | one of the thoughts we had |
---|
0:34:29 | would be that |
---|
0:34:31 | you know we wouldn't have the self loop so we might even have |
---|
0:34:34 | representations of the states the and then it was just so much simpler to do it this way |
---|
0:34:38 | this is what we have now |
---|
0:34:41 | we have three decoders but by decoder we mean they uh |
---|
0:34:44 | C plus plus code that does decoding |
---|
0:34:47 | it's not necessarily the same thing as a command line decoding |
---|
0:34:50 | we have three decoders on the spectrum simple too fast |
---|
0:34:53 | and the reason for this is that |
---|
0:34:54 | once you have a complicated fast decoder is almost impossible to the to debug |
---|
0:34:59 | so if something goes wrong you can always just one the simple one |
---|
0:35:02 | you know and you can find out if it's a decoder issue |
---|
0:35:06 | uh |
---|
0:35:07 | decoded |
---|
0:35:08 | we wanted to make it so the decoder doesn't as you too much about what you're model model selection |
---|
0:35:13 | so it again decoder has no idea of gmm hmms it doesn't even know about features |
---|
0:35:18 | all that |
---|
0:35:20 | all the decoder knows about is give me the likelihood or |
---|
0:35:24 | score level |
---|
0:35:25 | for this |
---|
0:35:26 | uh frame index |
---|
0:35:28 | and this pdf in that |
---|
0:35:30 | so it so interface that the decoder seizes is almost like a matrix |
---|
0:35:35 | the matrix of uh |
---|
0:35:37 | of floats but i'm is is not represented that way because you want to |
---|
0:35:41 | you know you want to have it on them on |
---|
0:35:43 | i |
---|
0:35:45 | so yeah this is the decodable interface an interface that the |
---|
0:35:49 | it's a very simple interface that says give me the likelihood for this you know time in this frame and |
---|
0:35:54 | like |
---|
0:35:54 | how many time frames are the |
---|
0:35:57 | and how many pdf index is that that's almost all the interfaces |
---|
0:36:01 | but this this is the interface at the decoder requires so the idea was to implement you know |
---|
0:36:06 | L fantastic a model |
---|
0:36:08 | and you |
---|
0:36:09 | uh |
---|
0:36:10 | i |
---|
0:36:11 | in in a very matter what interface of that model is |
---|
0:36:13 | you create a small object that satisfies the decodable interface |
---|
0:36:17 | and knows how to get the likelihoods from your and L fantastical model |
---|
0:36:21 | and then you uh |
---|
0:36:23 | you instantiate the decoder with that are you give that |
---|
0:36:27 | so uh |
---|
0:36:31 | the gmm wrapping okay |
---|
0:36:34 | yeah so i come online decoding programs a very simple we don't have like multipath or anything |
---|
0:36:39 | we don't have uh |
---|
0:36:42 | we don't we don't know than to support multiple types of model |
---|
0:36:46 | an example decoding program is |
---|
0:36:48 | decode with the G M and |
---|
0:36:51 | but no |
---|
0:36:52 | with number multiple class adaptation |
---|
0:36:54 | yeah so does the simple thing |
---|
0:36:55 | and then if you want to support let's a multi-class |
---|
0:36:58 | mllr fmllr |
---|
0:37:00 | we uh have a separate come online prague |
---|
0:37:03 | yeah the idea is that |
---|
0:37:04 | there might be people coming into the project might want to be able to understand that come online program |
---|
0:37:09 | and we don't one that once a make the barrier to entry too high |
---|
0:37:12 | we got the |
---|
0:37:13 | support the overhead of having to maintain two parallel decoders |
---|
0:37:17 | keep it relatively simple to understand any given one |
---|
0:37:24 | uh |
---|
0:37:24 | we support the standard types of features |
---|
0:37:27 | mfcc and plp features are quite similar to |
---|
0:37:30 | K one |
---|
0:37:31 | we've |
---|
0:37:32 | we put in a reasonable range of configure ability but |
---|
0:37:35 | i mean being realistic with respect to how much people are really working on this stuff i mean i think |
---|
0:37:40 | most people are doing research on this would probably be coming out with their own features |
---|
0:37:44 | so we don't support every possible |
---|
0:37:46 | combination of it |
---|
0:37:47 | for every possible change |
---|
0:37:49 | we only we we dwell format because there i reasoning is |
---|
0:37:53 | your you can always it's find the external program to convert it and |
---|
0:37:57 | do it as part of a high |
---|
0:38:02 | sorry |
---|
0:38:07 | well we cannot htk and i won't from uh we don't there's no more that we support |
---|
0:38:13 | uh |
---|
0:38:15 | yeah |
---|
0:38:15 | i mean |
---|
0:38:16 | i i basic concept to have people use the system is |
---|
0:38:19 | as a complete system |
---|
0:38:21 | because once you start supporting model you know in a conversion just get work |
---|
0:38:26 | but yeah that's the he's tk K features as a as a special case |
---|
0:38:30 | uh |
---|
0:38:31 | we typically will right features another large objects to a single very large file of relates to this archive format |
---|
0:38:37 | so the form of the file as a key space then your object |
---|
0:38:41 | and another key a space that object |
---|
0:38:43 | and uh |
---|
0:38:45 | we have efficient mechanisms to read such files |
---|
0:38:48 | the the the two normal cases are firstly sequential access |
---|
0:38:51 | we want it's rate over the things an archive |
---|
0:38:54 | exactly random access and the the different ways to do that one is |
---|
0:38:58 | you can write a separate file that has little |
---|
0:39:00 | point doesn't of the file |
---|
0:39:02 | another is that |
---|
0:39:03 | you can kind of simulate random access even though you're really going sequentially |
---|
0:39:07 | if you know that the keys are sorted |
---|
0:39:10 | uh and another way is if the file isn't isn't that big |
---|
0:39:13 | you can do random access by just having the code go through the whole file |
---|
0:39:18 | stalled objects and memory |
---|
0:39:19 | that's not just scalable but |
---|
0:39:21 | for for a lot of uh |
---|
0:39:23 | types of all kinds it really doesn't matter |
---|
0:39:27 | oh yeah so the feature |
---|
0:39:29 | feature level processing like adding deltas that from a lot |
---|
0:39:32 | typically each one of those the separate program so you have like a sequence of programs and apply |
---|
0:39:37 | and again that's a bit inefficient but |
---|
0:39:39 | it's not like it's really consuming more than ten percent of your C P U so |
---|
0:39:43 | you just don't care that much this has been written with |
---|
0:39:46 | ease of use in my |
---|
0:39:49 | uh |
---|
0:39:50 | like i said there's a lot of command line tools this is an example of uh |
---|
0:39:54 | a command line and this backslashes of this |
---|
0:39:57 | the cell |
---|
0:39:58 | so uh |
---|
0:40:01 | this this is one of the many programs |
---|
0:40:03 | the plp would be a separate command line |
---|
0:40:06 | this is just you know |
---|
0:40:07 | an option |
---|
0:40:08 | either the two command line arguments in this uh |
---|
0:40:11 | i gonna be explaining later on or about what these mean with this |
---|
0:40:14 | directed to write these things to it |
---|
0:40:16 | and archive on the |
---|
0:40:18 | a key object key object |
---|
0:40:21 | and then |
---|
0:40:22 | i don't know this is the input |
---|
0:40:23 | we have to read it |
---|
0:40:25 | and then this is telling it to write an archive and also |
---|
0:40:28 | and i C P file that |
---|
0:40:30 | kind of has little pointers into the okay |
---|
0:40:32 | so that you can efficiently access the features by random access |
---|
0:40:36 | um um |
---|
0:40:39 | so |
---|
0:40:39 | so yeah another example of another feature of this is that as only one option here we we we have |
---|
0:40:44 | no more than a few options on any given come on |
---|
0:40:47 | i mean it's a local program i support |
---|
0:40:49 | less the channel |
---|
0:40:51 | it's not it's not a very can different to at is more driven by how you combine these grow |
---|
0:40:57 | a |
---|
0:40:58 | oh you something else about this whole archive a uh formalism is that |
---|
0:41:02 | this C plus plus level code in the individual come line tools |
---|
0:41:06 | we doesn't have have to worry too much about high uh |
---|
0:41:10 | you can just treat |
---|
0:41:11 | the uh |
---|
0:41:13 | when to get something like this there's |
---|
0:41:15 | there's very short uh |
---|
0:41:17 | statements in the C plus plus that will it's a rate over a |
---|
0:41:20 | stuff |
---|
0:41:20 | so it doesn't have the |
---|
0:41:22 | think too much about the error conditions |
---|
0:41:26 | but yep |
---|
0:41:32 | fst festive generation |
---|
0:41:35 | okay that as another part of the talk later on |
---|
0:41:44 | well |
---|
0:41:45 | for training |
---|
0:41:47 | there's there's a command line program that will |
---|
0:41:49 | kind of do the fst generation for you and generate lots of the left S to use one for each |
---|
0:41:53 | file |
---|
0:41:54 | yeah so for testing |
---|
0:41:56 | it's it's a script the calls a fist openfst programs an our versions of openfst for |
---|
0:42:03 | so |
---|
0:42:05 | i'm gonna go through that script later one and another part |
---|
0:42:07 | top |
---|
0:42:09 | a a are you this decide this is not obvious you know a lot stand the script |
---|
0:42:13 | but this is just to get people some idea |
---|
0:42:16 | oh of uh |
---|
0:42:17 | of how we do do training |
---|
0:42:19 | so you know this is the bashed script it's doing a loop over the iterations |
---|
0:42:24 | uh and this one is estimating ml mllt up |
---|
0:42:29 | i suppose this script review the bias and sorry man i |
---|
0:42:33 | but as that we are is the colour i've yet |
---|
0:42:35 | so a |
---|
0:42:38 | so if it's that one of iterations that we do a lot C |
---|
0:42:42 | then uh |
---|
0:42:44 | so we have on disk |
---|
0:42:45 | some uh alignment this is like steak level alignment |
---|
0:42:49 | it's in a mark at i've |
---|
0:42:51 | from my that i mentioned |
---|
0:42:52 | so this converts them to posteriors |
---|
0:42:54 | just an average of trivial way by thing that each |
---|
0:42:57 | each one has a posterior of one |
---|
0:43:00 | this takes the this |
---|
0:43:01 | this gives a zero weight to the file and |
---|
0:43:03 | that's would be a |
---|
0:43:04 | this would be a variable and by |
---|
0:43:07 | uh |
---|
0:43:08 | yeah so this takes away the uh |
---|
0:43:10 | you the silence is there a posterior |
---|
0:43:12 | and this is an accumulation program |
---|
0:43:14 | that uh |
---|
0:43:16 | this would be the model that's the thought fit of the features as the abashed variable that would be |
---|
0:43:21 | that elsewhere where |
---|
0:43:23 | uh this |
---|
0:43:24 | a a hmmm |
---|
0:43:25 | i think that's refers to the standard input |
---|
0:43:28 | me that's reading an our cat from the standard input and that |
---|
0:43:30 | output by the |
---|
0:43:32 | you this |
---|
0:43:32 | mean that's writing an archive to standard it out but |
---|
0:43:35 | so |
---|
0:43:35 | yeah yeah output of these programs is passed by up pi |
---|
0:43:41 | uh |
---|
0:43:42 | all all of the error and logging out but goes to the standard error uh |
---|
0:43:45 | because we've kind of used with that it out but for this type stuff |
---|
0:43:49 | so |
---|
0:43:50 | so we just directing the logging up |
---|
0:43:52 | there |
---|
0:43:53 | so then this is a separate program that does the mllt the estimation |
---|
0:43:58 | it takes in uh |
---|
0:43:59 | let me see |
---|
0:44:01 | uh it's it's it's computing some kind of make |
---|
0:44:04 | and then uh |
---|
0:44:06 | because then am a lot T yeah |
---|
0:44:08 | what i i have to you can the transform |
---|
0:44:10 | you have to change the means of your model so |
---|
0:44:13 | we have a separate we like to get everything separate |
---|
0:44:16 | so you know transforming the me the separate operations so we have a separate program for that |
---|
0:44:21 | and then |
---|
0:44:22 | we have to compose the L B M T transform with the previous one |
---|
0:44:26 | so this is another will program that does that |
---|
0:44:29 | so this with was setting another bash variable able to make |
---|
0:44:32 | the ah features correspond now to the |
---|
0:44:35 | new ml L you a melody features |
---|
0:44:38 | so |
---|
0:44:40 | so as you can see that this is the very and bash |
---|
0:44:43 | and it's |
---|
0:44:43 | this would be passed as a command line arguments to one of the program |
---|
0:44:47 | and it's a command involving a pie that actually vol |
---|
0:44:51 | calling to separate cal be uh |
---|
0:44:54 | program |
---|
0:44:55 | each for their own argument |
---|
0:44:57 | so obvious you can guess from the names of those programs what they're doing |
---|
0:45:01 | and then of "'cause" uh it seems to have features sub |
---|
0:45:04 | oh yeah i think we were estimating the ml T on a subset of features |
---|
0:45:08 | so this is like the same as this but it's the |
---|
0:45:10 | it's using less |
---|
0:45:12 | the data |
---|
0:45:15 | so i think i |
---|
0:45:17 | i spoke about these issues but for |
---|
0:45:21 | oh yeah so uh |
---|
0:45:24 | we had example scripts results management and was to general and these run from the ldc |
---|
0:45:29 | distributed this |
---|
0:45:31 | uh now we found in the literature just some uh |
---|
0:45:35 | some some uh baseline |
---|
0:45:37 | these numbers are numbers are just the basic context system |
---|
0:45:42 | with i think uh mean normalization |
---|
0:45:44 | we have of course more advanced things but |
---|
0:45:47 | those you know because it had to find in the literature the same thing |
---|
0:45:50 | we just giving you the unadapted adapted |
---|
0:45:53 | so it's a |
---|
0:45:54 | slightly better than this number will can right someone a two thousand |
---|
0:45:58 | and that the hates you K paper from ninety four |
---|
0:46:01 | a has a funny but a number for this was the gender dependent system |
---|
0:46:05 | so uh |
---|
0:46:05 | so i think basically would doing the same as |
---|
0:46:08 | you expect given the same out |
---|
0:46:11 | i mean |
---|
0:46:12 | uh |
---|
0:46:13 | i was hoping you know the set of this help project that the results would be but uh |
---|
0:46:17 | for issues relating to the tree in can phone and stuff but |
---|
0:46:20 | you know that in we give a senate |
---|
0:46:22 | so |
---|
0:46:22 | it it's working there's no major but |
---|
0:46:26 | uh did of the |
---|
0:46:28 | okay next slide |
---|
0:46:32 | uh just the not on speed and coding is used |
---|
0:46:35 | use a bigram numbers and the "'cause" the baseline we'll bigram numbers |
---|
0:46:38 | we can't yeah yeah code with the full |
---|
0:46:41 | with the full uh trigram language model that |
---|
0:46:43 | distributed with the wall street journal corpus |
---|
0:46:46 | because the fsts uh |
---|
0:46:48 | they get to large |
---|
0:46:50 | we have a "'cause" to with pruned track |
---|
0:46:52 | but that's why we're coding the bigram numbers |
---|
0:46:54 | so |
---|
0:46:56 | hopefully by the sum we gonna |
---|
0:46:58 | as the couple of things we can do that we both working on one is to have a decoder that |
---|
0:47:01 | does some kind of on the fly |
---|
0:47:03 | pensions so that we can uh |
---|
0:47:05 | the code directly with that |
---|
0:47:07 | and the other to have a just generation so we can we score |
---|
0:47:11 | the decoding speed is for these was to just don't numbers is about twice as fast as real |
---|
0:47:16 | and a "'cause" that's on a good machine |
---|
0:47:18 | so i mean this is june so that you don't get more than zero point one degradation from |
---|
0:47:23 | versus a white B |
---|
0:47:26 | a the wall street journal script takes a few hours on a single machine using |
---|
0:47:30 | we problem lies on to three C be used |
---|
0:47:32 | this is just an example script we didn't want to include things like you serve in the example script |
---|
0:47:37 | because then it wouldn't run on uh everyone's machine |
---|
0:47:40 | the was they would be fast if you were doing a parallel |
---|
0:47:44 | yeah |
---|
0:47:47 | uh_huh |
---|
0:47:54 | if it in member it well as well |
---|
0:47:56 | well |
---|
0:47:58 | but ten gig |
---|
0:47:59 | i i mean |
---|
0:48:00 | i S you know everyone knows that F is T compilation tend to up a bit |
---|
0:48:04 | it's not like |
---|
0:48:06 | if you have the size of the model you can just about compiler |
---|
0:48:12 | i i don't recall that it's a trigram one for most journal |
---|
0:48:15 | i i |
---|
0:48:17 | and then we go how many was but i think |
---|
0:48:19 | i don't think that the our stuff is any worse than you know normal if T |
---|
0:48:23 | that ups that fully expand of thing |
---|
0:48:27 | oh yeah okay results management |
---|
0:48:29 | this is a |
---|
0:48:31 | use she came results |
---|
0:48:33 | or take an uh |
---|
0:48:35 | from uh |
---|
0:48:36 | this is i think this is basically the hey K are each K the be but he's real us to |
---|
0:48:40 | take it from a paper of mine like in ninety nine or something |
---|
0:48:43 | "'cause" |
---|
0:48:44 | i just couldn't find in the read me file from are C K on all of the test |
---|
0:48:49 | and the average as you can see the average is the same |
---|
0:48:52 | so |
---|
0:48:53 | with the same algorithms are getting the same result as H |
---|
0:48:55 | okay |
---|
0:48:56 | uh |
---|
0:48:59 | yeah and it and the decoding we on the setup is about zero point one times real |
---|
0:49:09 | yeah |
---|
0:49:09 | yeah |
---|
0:49:16 | yeah the test set are quite |
---|
0:49:20 | oh yeah |
---|
0:49:21 | is a very small test that a handful of words that are of |
---|
0:49:25 | uh this is this page is mainly |
---|
0:49:27 | just to give you some idea of the kinds of things that are in our example scripts we have a |
---|
0:49:31 | bunch of |
---|
0:49:32 | different configuration of this of the standard configuration |
---|
0:49:35 | well this is the standard configuration because this is what within the htk baseline line |
---|
0:49:40 | uh |
---|
0:49:41 | um |
---|
0:49:42 | adding M L T doesn't we seem the hell |
---|
0:49:45 | sorry adding is T |
---|
0:49:46 | see |
---|
0:49:47 | as they we the hell |
---|
0:49:48 | i |
---|
0:49:51 | a a a a a it's well i think nine frames plus lda that you makes it worse but then |
---|
0:49:56 | when you do |
---|
0:49:57 | uh F T C on top of that |
---|
0:50:00 | if that you better than here and so that this was the kind of this was that I B M |
---|
0:50:04 | recipe P |
---|
0:50:05 | so |
---|
0:50:07 | sorry this with I B M is to be so i i guess that must been some interaction between these |
---|
0:50:11 | two parts of the recipe |
---|
0:50:13 | that somehow made it work |
---|
0:50:14 | i i don't know if it's a generalized to other trade other test set |
---|
0:50:17 | we gonna find out |
---|
0:50:19 | uh that's placed nine frames plus hlda |
---|
0:50:22 | triple deltas plus hlda |
---|
0:50:24 | triple deltas plus |
---|
0:50:26 | lda D A plus a lot C |
---|
0:50:28 | this this |
---|
0:50:29 | quite good |
---|
0:50:30 | uh sgmm cyst these are all and adaptive |
---|
0:50:33 | have a separate slide for uh adapted exp |
---|
0:50:37 | um |
---|
0:50:39 | if is doing it |
---|
0:50:41 | and that's |
---|
0:50:41 | it's stated otherwise that oh yeah okay so this is but utterance adaptation this is per speaker |
---|
0:50:47 | so |
---|
0:50:48 | this was four point five my and before uh |
---|
0:50:51 | adaptation so it really doesn't help if you do it but i'd sir rights and that's because this too many |
---|
0:50:55 | parameters |
---|
0:50:57 | in to model a |
---|
0:50:58 | this is doing the same thing per speaker gets a lot but uh |
---|
0:51:01 | its exponential transform is |
---|
0:51:03 | again i'm not gonna describe what it is is something vtln one |
---|
0:51:07 | uh and it gets quite a bit but uh |
---|
0:51:09 | this is a this vtln and of the kind of many a version of vtln i believe |
---|
0:51:14 | it is that thing to improve quite a lot |
---|
0:51:16 | and of got improvement is more pronounced on the per utterance level because |
---|
0:51:20 | uh |
---|
0:51:21 | in know it it's just like a constrained form of a from a loss of the only point is |
---|
0:51:26 | to do it |
---|
0:51:27 | to do when you have less they |
---|
0:51:29 | uh |
---|
0:51:30 | splice nine frames for cell day sex to transform |
---|
0:51:34 | a from well thing i from a lot |
---|
0:51:36 | we only did some of these put speaker because it wouldn't help of the |
---|
0:51:39 | uh |
---|
0:51:41 | as you can see that the well of different combinations this is as gmm including the |
---|
0:51:45 | speaker offsets sets the and thumbs if you member |
---|
0:51:48 | and it does help so |
---|
0:51:50 | so uh i think rick was saying that that is wasn't working for him but it seems to be working |
---|
0:51:54 | for us |
---|
0:51:55 | three point one five goes to uh |
---|
0:51:59 | where is it to point six eight |
---|
0:52:01 | i i must of uh forgot to fill this line and |
---|
0:52:04 | it's is that's gmm plus a from a la |
---|
0:52:07 | but no speaker vectors |
---|
0:52:09 | yeah |
---|
0:52:10 | a per speaker |
---|
0:52:12 | yeah |
---|
0:52:13 | i think i have these numbers but i think i must not put in i think a best number was |
---|
0:52:17 | like to point for |
---|
0:52:20 | point three |
---|
0:52:21 | uh |
---|
0:52:23 | so general plug for cal |
---|
0:52:26 | uh |
---|
0:52:27 | i believe it easy to use i mean i have the scripts didn't scale you guys up as if you |
---|
0:52:31 | traction is that once you understand them |
---|
0:52:34 | everything becomes quite simple |
---|
0:52:36 | but |
---|
0:52:37 | it kind of does that you that the sound has speech works like if you some under who does |
---|
0:52:42 | is randomly |
---|
0:52:43 | moving the script you know changing configurations of |
---|
0:52:45 | the |
---|
0:52:46 | you're not uh |
---|
0:52:47 | it's not gonna work |
---|
0:52:48 | it it doesn't like |
---|
0:52:50 | it doesn't or to magically know that the features you have a not combat your model |
---|
0:52:56 | so so you can have to know what you doing from a speech science point of view |
---|
0:53:00 | but |
---|
0:53:01 | it's quite uh |
---|
0:53:02 | it's easy to use that the C plus plus |
---|
0:53:04 | flash to |
---|
0:53:06 | software engineer |
---|
0:53:08 | uh |
---|
0:53:08 | it's easy to extend and modify |
---|
0:53:10 | you can reduce should be go changes are give them back to |
---|
0:53:13 | the cal group |
---|
0:53:15 | uh |
---|
0:53:15 | we open to including other people's |
---|
0:53:17 | stuff |
---|
0:53:18 | so that may give you most citation |
---|
0:53:21 | so this |
---|
0:53:21 | is i really |
---|
0:53:23 | the and the this first part so |
---|
0:53:26 | you can get up and have a drink and after a few minutes |
---|
0:53:29 | well |
---|
0:53:32 | yeah has documentation cal D duck source forge dot net |
---|
0:53:36 | uh uh okay if is not as good as H K and probably being realistic will never be |
---|
0:53:41 | what will do |
---|
0:53:42 | is will |
---|
0:53:43 | of able lies the F but the he's to K has use and point people to the he's K documentation |
---|
0:53:48 | so then about eight and say that have he had then |
---|
0:53:51 | yeah i know me |
---|
0:53:52 | i |
---|
0:53:54 | i |
---|
0:53:55 | i |
---|
0:53:55 | i use C |
---|
0:53:58 | see i |
---|
0:53:59 | i |
---|
0:54:02 | but okay |
---|
0:54:03 | we can have a shot rate you we can have a drink |
---|
0:54:06 | and just a pair you're not in uh |
---|
0:54:09 | that committed to it |
---|
0:54:10 | and then |
---|
0:54:12 | uh uh we've have a gonna talk up to what |
---|
0:54:14 | the fact |
---|