0:00:13 | um my name is jeff very |
---|
0:00:14 | uh and i'm in a talk about using ultrasound to visualise that on during speech |
---|
0:00:20 | uh |
---|
0:00:21 | so first of all |
---|
0:00:22 | we might wanna ask |
---|
0:00:24 | why would we want to use ultrasound to look at |
---|
0:00:27 | the tongue anyway uh |
---|
0:00:30 | so the the main so that one of the main reasons is because there's |
---|
0:00:33 | there is evidence to suggest that the |
---|
0:00:36 | or a lot more |
---|
0:00:38 | then just audio processing during |
---|
0:00:40 | speech uh perception |
---|
0:00:42 | so |
---|
0:00:43 | for example there is uh |
---|
0:00:44 | the well-known make can fact so |
---|
0:00:47 | uh this shows of there's integration between the visual signal and audio signal at least uh during human speech perception |
---|
0:00:59 | okay so |
---|
0:01:00 | i i here the short uh a video |
---|
0:01:03 | them demonstrating them would the meg or fact so if you if you look at the lips |
---|
0:01:08 | uh during the video and you listen and |
---|
0:01:10 | then you you you are perceived different uh syllables |
---|
0:01:14 | so let's let's |
---|
0:01:15 | was know that |
---|
0:01:19 | ah |
---|
0:01:20 | ah |
---|
0:01:22 | ah |
---|
0:01:24 | ah |
---|
0:01:26 | okay so you should uh if you were watching the video you should of notice you should of heard different |
---|
0:01:31 | sounds |
---|
0:01:32 | right |
---|
0:01:33 | but now if you if you close your eyes or look away from the video of you don't look at |
---|
0:01:37 | the lips |
---|
0:01:38 | all replay the video and you'll see that there actually the same sound in uh acoustically |
---|
0:01:46 | a |
---|
0:01:48 | ah |
---|
0:01:49 | ah |
---|
0:01:51 | a |
---|
0:01:53 | okay so |
---|
0:01:55 | uh if you if you're not looking at the lips then you notice that the sound is always the same |
---|
0:02:00 | that's always bar |
---|
0:02:01 | however if you look at the lips |
---|
0:02:02 | and here the sound then changes okay |
---|
0:02:05 | so uh this is |
---|
0:02:07 | a really strong affect and it doesn't matter which language you speak |
---|
0:02:10 | uh it's still you get this affect least most people do so |
---|
0:02:15 | oh this is this really suggests that you know are brain makes use of lots of different |
---|
0:02:20 | uh |
---|
0:02:20 | information |
---|
0:02:22 | so uh |
---|
0:02:23 | there's a recent study uh from |
---|
0:02:26 | eleven italy |
---|
0:02:27 | showing that uh |
---|
0:02:29 | there's also use of the motor system |
---|
0:02:31 | specifically |
---|
0:02:32 | the part of the brain that control the tongue and the lips during speech uh perception |
---|
0:02:38 | uh but this the there their fact is only observed what during noisy speech so what that means is that |
---|
0:02:44 | uh during a noisy |
---|
0:02:45 | situation |
---|
0:02:46 | if you have trouble hearing the |
---|
0:02:49 | the person you're listening to |
---|
0:02:51 | that uh you you may use the you're motor cortex may become involved with helping you to parse the speech |
---|
0:02:57 | that you're listening to |
---|
0:02:59 | um so |
---|
0:03:00 | this has lots of |
---|
0:03:02 | sort of a |
---|
0:03:02 | interesting questions to pursue and |
---|
0:03:06 | we wanna be able to use ultrasound |
---|
0:03:08 | to uh investigate |
---|
0:03:10 | what what the tongue is doing |
---|
0:03:12 | during speech |
---|
0:03:16 | so |
---|
0:03:17 | it a typical ultrasound image of the time uh i at added the profile the face |
---|
0:03:22 | to sort of give you landmarks so what we're looking at here so it's a midsagittal view |
---|
0:03:27 | and uh the tongue tip |
---|
0:03:28 | is always uh |
---|
0:03:30 | to the right of the image |
---|
0:03:32 | and uh |
---|
0:03:33 | some you can get as far |
---|
0:03:35 | back depending on the subject and and the probe you're using |
---|
0:03:40 | you can usually get most of the tongue body uh sometimes at tongue tip is not |
---|
0:03:45 | visible but you |
---|
0:03:46 | can definitely see that on body so this uh |
---|
0:03:49 | bright and represents the surface of the tongue |
---|
0:03:53 | so here's a typical uh |
---|
0:03:55 | segment of speech |
---|
0:03:57 | and what it looks like |
---|
0:03:58 | in ultrasound |
---|
0:04:01 | but |
---|
0:04:03 | i |
---|
0:04:05 | okay i'll point that again |
---|
0:04:07 | i |
---|
0:04:08 | okay so you can see that tongue moving around |
---|
0:04:11 | uh to make the different sounds |
---|
0:04:17 | uh so applications for this um |
---|
0:04:21 | or recently ban uh a new |
---|
0:04:23 | no ultrasound machine been released uh |
---|
0:04:26 | that's hand so this is a |
---|
0:04:28 | fully |
---|
0:04:29 | operational ultrasound machine |
---|
0:04:31 | and so there's possibilities in the future of having |
---|
0:04:35 | uh portable ultrasound machines integrated with other sensors |
---|
0:04:39 | uh to do speech recognition |
---|
0:04:41 | oh there's also a large interest in silent speech interface which is being able to |
---|
0:04:47 | uh measure some sort of uh articulatory motion |
---|
0:04:51 | without vocalisation and being able to resynthesize speech from that |
---|
0:04:55 | so this could be useful |
---|
0:04:56 | yeah environment where |
---|
0:04:58 | uh it's either |
---|
0:04:59 | there's too much noise |
---|
0:05:01 | or if you're an environment where you have to |
---|
0:05:03 | be silent right by still need to communicate |
---|
0:05:07 | uh a and their so the possibility of adding a model of time motion |
---|
0:05:12 | to uh uh uh |
---|
0:05:15 | to a speech recognition system |
---|
0:05:16 | so |
---|
0:05:17 | uh we can use these images to |
---|
0:05:20 | construct such a model |
---|
0:05:23 | okay so in this work |
---|
0:05:26 | uh |
---|
0:05:26 | we're just addressing these questions so first of all we wanna we wanted to classify tongue shapes |
---|
0:05:33 | into a you know phonemes and also |
---|
0:05:37 | uh use this uh |
---|
0:05:38 | to try to segment this the ultrasound video into phoneme second |
---|
0:05:44 | uh so the tool we chose to do this is |
---|
0:05:47 | a variation of a deep belief network that we're calling the translational deep belief network |
---|
0:05:52 | so |
---|
0:05:54 | uh a deep belief network is uh |
---|
0:05:56 | it's composed of us of stacked restricted boltzmann machines so |
---|
0:06:01 | restricted both so machines or probabilistic generative models |
---|
0:06:04 | so |
---|
0:06:06 | uh and they uh of so machine consists of a single |
---|
0:06:10 | uh visible layer and a single hidden layer |
---|
0:06:13 | and uh so it's a it's a yeah neural network |
---|
0:06:16 | and |
---|
0:06:17 | basically in a deep belief network |
---|
0:06:20 | uh the uh the hidden layer of the of the first |
---|
0:06:23 | restricted both and machine becomes the visible layer in the second one |
---|
0:06:27 | and you just stack them up |
---|
0:06:28 | uh and uh |
---|
0:06:31 | so |
---|
0:06:31 | these these uh |
---|
0:06:33 | deep belief networks are typically trained |
---|
0:06:36 | in two stages so |
---|
0:06:38 | the first thing happens is |
---|
0:06:40 | uh i |
---|
0:06:41 | you uh do an unsupervised |
---|
0:06:43 | pre-training training stage |
---|
0:06:45 | uh that and it sort of uh what happens is the network |
---|
0:06:49 | the uh D correlates |
---|
0:06:51 | the data so it comes up it learns a a representation on its own |
---|
0:06:55 | without any uh |
---|
0:06:57 | or of a human labeled and knowledge and then |
---|
0:07:00 | after that pre-training is been done a second uh a discriminative stage |
---|
0:07:05 | uh a sort of the |
---|
0:07:07 | optimises as the network to |
---|
0:07:09 | output |
---|
0:07:10 | a label so the the human |
---|
0:07:12 | label is not actually added |
---|
0:07:14 | to the uh |
---|
0:07:16 | to the training and tell the second stage so |
---|
0:07:19 | with the translational |
---|
0:07:21 | deep belief network |
---|
0:07:22 | um |
---|
0:07:24 | we wanted to add and the label during the pre-training as well |
---|
0:07:29 | so uh this is |
---|
0:07:31 | what what uh |
---|
0:07:33 | base the basic idea well what we're |
---|
0:07:35 | proposing |
---|
0:07:37 | so we have a a a basic uh a regular deep net |
---|
0:07:41 | but we we train it with all the |
---|
0:07:44 | the sensory inputs |
---|
0:07:46 | and the labels concatenated to form the visible layer |
---|
0:07:50 | that we we train a a a uh regular deep belief network |
---|
0:07:54 | and uh |
---|
0:07:55 | then we copy the weights of the up the |
---|
0:07:58 | upper hidden layers over to a new network |
---|
0:08:01 | and we retrain |
---|
0:08:03 | this bottom layer |
---|
0:08:04 | uh to except |
---|
0:08:06 | only the the you sensor inputs |
---|
0:08:08 | without a label |
---|
0:08:10 | so then then uh finally we do uh back propagation to at the labels on |
---|
0:08:16 | uh |
---|
0:08:16 | at the at the last stage |
---|
0:08:19 | so |
---|
0:08:19 | what i more about that |
---|
0:08:23 | um so |
---|
0:08:24 | the first thing you wanna do is is at the |
---|
0:08:27 | you you want to add the uh human the labeled data |
---|
0:08:31 | into the into the pre-training stage when |
---|
0:08:34 | and that helps the um |
---|
0:08:37 | the network to |
---|
0:08:39 | build features that contain information |
---|
0:08:42 | about the labels and the and the sensor input |
---|
0:08:46 | so the first thing to do is train a deep belief network |
---|
0:08:50 | on the both of the labels and the |
---|
0:08:52 | sensors |
---|
0:08:54 | and then we do we replace the |
---|
0:08:57 | second there the weep we place the bottom layer of the network with the translational |
---|
0:09:02 | uh to uh to except the images only |
---|
0:09:05 | and then we do discriminative backprop uh two |
---|
0:09:09 | uh fine tune the network |
---|
0:09:11 | and then that allows us to extract a label from an from an unlabeled in image |
---|
0:09:17 | so again too |
---|
0:09:19 | sort of explain this we're just copying the weights from the original network over |
---|
0:09:23 | yeah and uh |
---|
0:09:24 | then substituting this bottom one with the with the new network that we train |
---|
0:09:29 | so |
---|
0:09:30 | uh to train this |
---|
0:09:32 | oh we're using |
---|
0:09:34 | uh a slight variation on the contrastive divergence rule |
---|
0:09:38 | so |
---|
0:09:40 | uh this is the equation for that and what it says is |
---|
0:09:44 | you sample um |
---|
0:09:45 | you sample that the with the |
---|
0:09:47 | input |
---|
0:09:48 | and the hidden the the |
---|
0:09:50 | representation in the hidden layer |
---|
0:09:52 | you sample from both of those to get reconstruction so this is where the |
---|
0:09:56 | the fact that it's a generative a a generative model comes in play so you |
---|
0:10:01 | yeah uh use the |
---|
0:10:03 | hidden layer to re generate the visible output |
---|
0:10:05 | and then use the the the re generated |
---|
0:10:07 | a a visible layer to reconstruct |
---|
0:10:10 | another hidden layer and then you minimize the difference between those |
---|
0:10:15 | uh so here it is it's grammatically |
---|
0:10:17 | um |
---|
0:10:19 | what we're doing is uh |
---|
0:10:21 | we're sampling from this slayer to get are hidden units and then ins |
---|
0:10:24 | normally you it just sample this to reconstruct this slayer again |
---|
0:10:28 | but the the the key difference here |
---|
0:10:31 | is where can wear sampling |
---|
0:10:33 | to a new uh a new |
---|
0:10:35 | visible layer |
---|
0:10:36 | they contains only the sensor input |
---|
0:10:39 | and then sampling from that |
---|
0:10:41 | to re can to get reconstructed it units |
---|
0:10:44 | and then that allows us to apply the contrastive divergence rule |
---|
0:10:51 | okay so |
---|
0:10:53 | for our experiment |
---|
0:10:55 | uh |
---|
0:10:56 | we use just five uh phoneme categories so we had a |
---|
0:11:00 | a database of |
---|
0:11:03 | uh |
---|
0:11:03 | one thousand eight or ninety three images the represented prototypical shapes |
---|
0:11:09 | uh for those categories so |
---|
0:11:11 | uh as we went through by hand and and hand labeled |
---|
0:11:14 | uh where we thought the prototypical shape |
---|
0:11:17 | for the uh P T K R in L in we also chose um |
---|
0:11:22 | images is represented a non category or a garbage category |
---|
0:11:26 | and those were images that |
---|
0:11:28 | where at least uh five frames in the video away from a peak image |
---|
0:11:33 | so then uh |
---|
0:11:35 | to feed this to the network we uh use this sort of labeling scheme |
---|
0:11:39 | where uh we have a one versus all already uh |
---|
0:11:42 | representation re and we repeat that |
---|
0:11:45 | a bunch of time so that |
---|
0:11:47 | uh there's |
---|
0:11:48 | similar number of |
---|
0:11:50 | input uh input |
---|
0:11:52 | uh nodes on the visible layer |
---|
0:11:54 | for images and for labels that way |
---|
0:11:58 | a network doesn't just minimise the error for the label |
---|
0:12:01 | it actually |
---|
0:12:03 | has to take that into account so |
---|
0:12:05 | we have that this so if we had a category two or a T the label would look like this |
---|
0:12:10 | a B |
---|
0:12:11 | you know the second one would be uh a one and then we would repeat this string |
---|
0:12:15 | a much at times |
---|
0:12:18 | so uh then |
---|
0:12:20 | we of the images and scaled down |
---|
0:12:22 | the relevant part |
---|
0:12:24 | to do a smaller number of pixels and did five fold |
---|
0:12:28 | cross |
---|
0:12:30 | so we had a dramatic differences in accuracy uh |
---|
0:12:34 | from a regular deep belief network and are uh translational |
---|
0:12:38 | deep belief network so |
---|
0:12:41 | as you can see um i have average the |
---|
0:12:44 | standard deep belief network got about |
---|
0:12:46 | forty two percent |
---|
0:12:48 | and uh when we add the label information during pre-training we get a lot a lot higher in the eighties |
---|
0:12:56 | um so we compare this to some other method is um |
---|
0:13:02 | there's been work uh on |
---|
0:13:04 | using on constructing what are what they're calling i can on so there that's similar to the eigen faces of |
---|
0:13:10 | turk and pentland were |
---|
0:13:12 | uh and the is a just pca analysis so |
---|
0:13:15 | you uh but all the images |
---|
0:13:17 | a you have a a a set of images and you find the principal components so that of images |
---|
0:13:22 | and then you can represent |
---|
0:13:24 | all your images in in terms of |
---|
0:13:27 | the uh |
---|
0:13:28 | the co coefficient of |
---|
0:13:30 | the first and |
---|
0:13:32 | uh |
---|
0:13:33 | upon |
---|
0:13:34 | so |
---|
0:13:34 | we use that to do dimensionality reduction in that up with a um hundred |
---|
0:13:40 | represent each image with a hundred coefficients |
---|
0:13:43 | and then use that to train uh a support vector machine and that got |
---|
0:13:47 | i got fifty three percent |
---|
0:13:49 | accuracy |
---|
0:13:51 | uh we also used |
---|
0:13:53 | um |
---|
0:13:54 | try trying to |
---|
0:13:56 | just uh use um |
---|
0:13:58 | segment out part of the image so by |
---|
0:14:00 | just tracing |
---|
0:14:02 | the tongue surface so |
---|
0:14:04 | we have uh these ultrasound images and instead of using the whole image we just one use |
---|
0:14:09 | what the human things as the relevant part so that's just |
---|
0:14:13 | tracing the |
---|
0:14:14 | tongue surface and um |
---|
0:14:17 | in previous work we showed how you can use |
---|
0:14:20 | this uh |
---|
0:14:22 | deep belief network to also extract |
---|
0:14:25 | uh these traces automatically |
---|
0:14:27 | so |
---|
0:14:28 | and all talk about that |
---|
0:14:31 | but uh |
---|
0:14:31 | so we we use these features instead just sample these curves have fifty points |
---|
0:14:36 | and use that to train an a uh a support vector machine and got seventy percent accuracy |
---|
0:14:43 | so uh that nice thing about the |
---|
0:14:46 | translational deep belief nets |
---|
0:14:49 | is they can also be used to do |
---|
0:14:51 | this uh |
---|
0:14:53 | sort of automatics image image segmentation or uh |
---|
0:14:57 | so the way that works is |
---|
0:15:00 | using again the generative properties of the of the model to |
---|
0:15:04 | uh construct an auto-encoder so |
---|
0:15:07 | what that means is |
---|
0:15:09 | uh we first train a network up to |
---|
0:15:12 | uh just like before |
---|
0:15:14 | but then we can use this |
---|
0:15:16 | this is a top hidden layer |
---|
0:15:19 | to reconstruct |
---|
0:15:20 | what it's on the input so this is a um an audible coder |
---|
0:15:24 | in the sense that |
---|
0:15:26 | what you put in you can reconstruct on the output |
---|
0:15:29 | so |
---|
0:15:30 | that means one we give it an image and a label |
---|
0:15:33 | that we can uh reconstruct that image and that label |
---|
0:15:37 | so uh we is this property |
---|
0:15:40 | uh to train an auto-encoder |
---|
0:15:42 | like this |
---|
0:15:43 | and then again we |
---|
0:15:45 | we create a new network |
---|
0:15:47 | by copying |
---|
0:15:48 | all this |
---|
0:15:49 | peace over to the new network |
---|
0:15:51 | and then |
---|
0:15:52 | using this uh |
---|
0:15:54 | the T R B M |
---|
0:15:55 | two |
---|
0:15:56 | retrain a bottom layer a bottom layer of the network to accept only the image so |
---|
0:16:01 | this allows us to put in an image and get a label out |
---|
0:16:05 | uh in this case the label is that on trace so |
---|
0:16:08 | but again this is |
---|
0:16:10 | or input looks like is a read this image |
---|
0:16:13 | and put it into the network in it |
---|
0:16:14 | reduces the uh at trace |
---|
0:16:17 | okay |
---|
0:16:19 | so um |
---|
0:16:22 | the other thing we looked at was uh segmenting the speech um |
---|
0:16:26 | sequence so |
---|
0:16:27 | in this case we use just a regular sequence which included um |
---|
0:16:31 | images that were not part of the original training set |
---|
0:16:35 | so we only trained on just prototypical shapes and the sequence |
---|
0:16:39 | change takes uh contains lots of transitional stage as well |
---|
0:16:44 | so |
---|
0:16:45 | uh when we did that um the classifier or uh that the the deep belief network was |
---|
0:16:50 | it it's uh gives us a representation or gives us an activation |
---|
0:16:55 | they can handle multiple categories so |
---|
0:16:58 | oh that looks like this one uh we are we actually get |
---|
0:17:02 | uh for the sequence we sort of see the dynamics of the tongue motion and we can use this to |
---|
0:17:06 | segment |
---|
0:17:07 | the speech stream so for example |
---|
0:17:10 | well on this frame we have activation for a the P |
---|
0:17:13 | shape and the T shape |
---|
0:17:15 | at the same time so it's sort |
---|
0:17:17 | transitioning from those shape |
---|
0:17:20 | and so putting it all together we have |
---|
0:17:23 | the original ultrasound um |
---|
0:17:25 | the automatic |
---|
0:17:26 | automatically-extracted extracted contour or and then here will be the label |
---|
0:17:30 | that it shows |
---|
0:17:32 | and uh |
---|
0:17:35 | it looks like that |
---|
0:17:36 | okay |
---|
0:17:38 | so uh that's all i have |
---|
0:17:48 | thank you very much |
---|
0:17:49 | so question |
---|
0:17:53 | come |
---|
0:17:56 | no |
---|
0:17:57 | and |
---|
0:17:58 | it |
---|
0:18:03 | yeah |
---|
0:18:05 | yeah you you training set from a uh uh one person or |
---|
0:18:09 | of person |
---|
0:18:10 | and |
---|
0:18:10 | you you get and you don |
---|
0:18:12 | in yeah you and but there is the |
---|
0:18:14 | probably a and you john the tree of uh |
---|
0:18:17 | as will that's right there's large differences between people and actually uh |
---|
0:18:22 | for the tone |
---|
0:18:23 | surface extraction we use we trained with nine speakers |
---|
0:18:27 | and uh it was able to generalise to |
---|
0:18:31 | uh a new speakers as long as the |
---|
0:18:34 | their time shape was |
---|
0:18:35 | um sort of |
---|
0:18:37 | a i around about the same size as another |
---|
0:18:40 | as one of those nine subjects was trained |
---|
0:18:42 | uh i've for the classification we trained on |
---|
0:18:45 | to uh two speakers |
---|
0:18:47 | and it it did well for for those people so |
---|
0:18:50 | uh we still |
---|
0:18:51 | that's still an ongoing uh |
---|
0:18:53 | area where we need to |
---|
0:18:55 | see how well a general is to other speakers but you're right |
---|
0:18:58 | about that |
---|
0:18:59 | thank |
---|
0:19:00 | and the fact that many you turns |
---|
0:19:04 | he's may yeah to the shape of the are also to and when of speaking |
---|
0:19:10 | a fixed |
---|
0:19:10 | uh i think it it's mostly to do with the shape of the time and especially during data collection |
---|
0:19:16 | uh the depth of the scanned and things like that make a large different |
---|
0:19:20 | uh_huh so maybe |
---|
0:19:22 | be norm announced |
---|
0:19:23 | yeah that's and that's yeah for |
---|
0:19:25 | okay |
---|
0:19:27 | as the question |
---|
0:19:28 | note it's move to that makes |
---|
0:19:31 | don't |
---|