0:00:14 | thank you so as yeah and is right mentioned my name stuff |
---|
0:00:17 | and then we repeat this unit |
---|
0:00:19 | yeah |
---|
0:00:20 | uh a some that of all what do these is an of vocal out |
---|
0:00:24 | and is a joint work |
---|
0:00:27 | a a you know that of menu like an even against low |
---|
0:00:29 | and uh are group at in P and which also includes george the from the feeling and my upon |
---|
0:00:35 | yeah okay |
---|
0:00:36 | so what i of vocal outbursts |
---|
0:00:38 | and availability of non-linguistic vocalisations |
---|
0:00:41 | which basically |
---|
0:00:43 | what's a main both facial expressions as well |
---|
0:00:45 | a examples include a or include the laughter which is probably the most common |
---|
0:00:50 | a golfing briefing |
---|
0:00:52 | uh but are also many different types of vocalisations |
---|
0:00:55 | it's uh |
---|
0:00:56 | we we may not to real i |
---|
0:00:58 | but they this vocalisations play an important role in front dense conversations for example |
---|
0:01:03 | uh it has been shown that laughter punch eight speech |
---|
0:01:06 | and uh whether this mean |
---|
0:01:08 | it's that we tend to laugh at places where punctuation would be placed |
---|
0:01:12 | and a another example is uh um |
---|
0:01:15 | when in a conversation a two participants laughs and we can lee |
---|
0:01:19 | mean it is very likely that this indicates if this is that a a the end of the topic they |
---|
0:01:23 | discuss and it's very likely that in you topic we start |
---|
0:01:27 | and a part from the after which is probably the most vary widely studied |
---|
0:01:31 | and vocalisation |
---|
0:01:33 | most of the other vocalisation vocalisations |
---|
0:01:35 | and i used as a feedback but a mechanism during direction |
---|
0:01:39 | and as i said before it's |
---|
0:01:41 | we are very common although we don't realise it in uh |
---|
0:01:45 | like a a real conversations |
---|
0:01:48 | and a if you in several works laughter uh recognition and classification from a the only the also if you |
---|
0:01:54 | were to not do visual classification of a laughter |
---|
0:01:57 | but uh a works on recognizing work very discriminating between different vocalisations are limited compared to laughter |
---|
0:02:05 | and one of the main reasons is uh the lack of data |
---|
0:02:08 | it's uh so what was all goal you in this work |
---|
0:02:11 | and i would like to discriminate between different vocalisations |
---|
0:02:15 | scenes fitted a a a a like to of a data set that contains such vocalisations |
---|
0:02:19 | but uh used not only the features |
---|
0:02:22 | but also visual features |
---|
0:02:24 | and the idea here is that |
---|
0:02:25 | since uh you you most of the time they if a facial expression involved in the production of localization |
---|
0:02:32 | a a |
---|
0:02:33 | no would be this uh information can be cats very visual features |
---|
0:02:36 | uh uh so it can improve the performance |
---|
0:02:39 | when it is added to be audio information |
---|
0:02:42 | uh okay so |
---|
0:02:44 | i do that this we used was the visual in the score from a U M |
---|
0:02:48 | which basically it's uh a contains twenty one subject the |
---|
0:02:52 | and yeah yeah it's a no thirty thirty thousand nine hundred and one turns |
---|
0:02:57 | and basically to dyadic interaction scenario so basically as a present or and the subject we've |
---|
0:03:03 | interact |
---|
0:03:04 | and uh |
---|
0:03:05 | during the interaction there are several uh vocalisations |
---|
0:03:10 | these uh uh we use basically |
---|
0:03:11 | and the partitioning we used |
---|
0:03:13 | actually |
---|
0:03:14 | she's been used a this they this it you'll also be used in the the speech probably sticks allan's |
---|
0:03:19 | and we use the same partitioning |
---|
0:03:21 | for training development |
---|
0:03:23 | and testing |
---|
0:03:24 | although |
---|
0:03:25 | unfortunately this slide the development a column is missing |
---|
0:03:29 | and to that our for all a non-linguistic vocalisations |
---|
0:03:33 | two are for uh that to like yes |
---|
0:03:36 | uh |
---|
0:03:36 | B think can send his attention and laughter |
---|
0:03:39 | and they are also these also another class we could use another class got but which contains other noise since |
---|
0:03:44 | speech |
---|
0:03:46 | and uh for experiments |
---|
0:03:48 | uh i'm going to show you we have excluded the breath class because |
---|
0:03:53 | and most of the time this i would they use not all |
---|
0:03:56 | and use a is no um |
---|
0:03:58 | it in please new this data set in a most of the time there is reason the facial expression of |
---|
0:04:02 | vol |
---|
0:04:03 | uh okay |
---|
0:04:05 | so |
---|
0:04:06 | seems that what |
---|
0:04:08 | so just so you a few examples |
---|
0:04:11 | uh |
---|
0:04:17 | okay so this is an example of a laughter |
---|
0:04:19 | a database |
---|
0:04:21 | i |
---|
0:04:23 | so which can you see is also |
---|
0:04:27 | although there is a common uh a point a directed the face of the subject |
---|
0:04:32 | you see they still uh significant did movement |
---|
0:04:35 | it's uh |
---|
0:04:37 | which is quite common in action well as uh for example which you bill a database that |
---|
0:04:42 | we that subjects |
---|
0:04:44 | uh what to the fight to find a video clip |
---|
0:04:46 | and we record the reaction and thing this case because just a static mean they just what's uh something it |
---|
0:04:52 | don't there is a a this did movements of a small where in this case and also |
---|
0:04:56 | in or or a a a real case cases |
---|
0:04:58 | they should movement is always |
---|
0:04:59 | uh uh there are and |
---|
0:05:01 | uh okay so let and so an example of a uh i think the nixon's um |
---|
0:05:07 | a i |
---|
0:05:08 | so basically it's pretty so but i a |
---|
0:05:12 | and uh and example of cons and |
---|
0:05:15 | this from |
---|
0:05:16 | oh |
---|
0:05:18 | huh |
---|
0:05:20 | huh |
---|
0:05:21 | so basically in this and uh that was an of express and just some |
---|
0:05:25 | the movement |
---|
0:05:27 | and T |
---|
0:05:28 | so |
---|
0:05:29 | but presentation |
---|
0:05:40 | okay so we use this P it |
---|
0:05:43 | and vocalisations |
---|
0:05:45 | no look classification of the uh of of of a a in the five classes the for uh |
---|
0:05:52 | yeah the the three actually |
---|
0:05:53 | for the for the for the four plus is three they a |
---|
0:05:56 | vocalisations and got but i |
---|
0:05:58 | and |
---|
0:06:00 | so we extract it |
---|
0:06:01 | okay this is just a a number you what will explain it so which of these and slide |
---|
0:06:05 | which selected uh and visual features which where |
---|
0:06:07 | and the visual features what up sampled to much the frame rate of the audio the features and then were |
---|
0:06:12 | concatenated for feature-level fusion |
---|
0:06:15 | a close in the classification was performed |
---|
0:06:17 | two different approaches one was a is M |
---|
0:06:20 | and the other and was there long sort memory |
---|
0:06:23 | uh a can you a network |
---|
0:06:25 | it's uh okay |
---|
0:06:27 | so the frame rate for visual features is twenty five frames per second which is |
---|
0:06:31 | a a common frame rate |
---|
0:06:33 | and to there are D we just to types of features |
---|
0:06:35 | shape features which are based on the point distribution model |
---|
0:06:38 | and appearance features which are are based on pca and grad and orientations |
---|
0:06:43 | uh |
---|
0:06:45 | so uh a yeah so in the beginning with track twenty point to the phase |
---|
0:06:49 | it's uh a these are the four points on the out of one point for seen four points and meet |
---|
0:06:54 | i two for is eyebrow |
---|
0:06:55 | and you can see this is an example of or |
---|
0:06:58 | of subject laughing |
---|
0:07:00 | with the tracking point this face |
---|
0:07:02 | and uh okay as you may see tracking |
---|
0:07:06 | i |
---|
0:07:07 | and uh and that happens i mean a fourteen to be in the perfect tracking |
---|
0:07:11 | and |
---|
0:07:13 | so we initialize the points and then the tracker |
---|
0:07:15 | uh like this twenty points |
---|
0:07:17 | and um |
---|
0:07:19 | now the main problem we yeah |
---|
0:07:21 | and that should be is present for both shape and appearance features |
---|
0:07:24 | is that uh we want to decouple couple |
---|
0:07:27 | a a a a set pose from uh a facial expressions |
---|
0:07:30 | and now how we do this |
---|
0:07:33 | is uh |
---|
0:07:34 | basically we use |
---|
0:07:36 | a distribution model |
---|
0:07:37 | it's point |
---|
0:07:38 | and just to court needs X and Y |
---|
0:07:41 | so i won't cut innate all the coordinates went up with a forty dimensional vector for each frame |
---|
0:07:46 | if we concatenate now of these for uh vectors or from of frames into a matrix |
---|
0:07:51 | if we have K frames will end up with a K by forty meeting |
---|
0:07:54 | and then we apply pca on this matrix |
---|
0:07:58 | okay is well known that the great is variance of the data lies in the fact me small components |
---|
0:08:02 | and the i'm knees |
---|
0:08:03 | that seems that are a a significant a head movements most of the variance will be captured and from a |
---|
0:08:09 | uh in the first principal components well as facial expressions was account for smaller variance |
---|
0:08:15 | and will be encoded in a lower component |
---|
0:08:19 | no no components |
---|
0:08:20 | and uh |
---|
0:08:22 | so in this case |
---|
0:08:24 | and we found that a first first piece for components correspond head movements |
---|
0:08:28 | and the yeah remaining from five ten facial expressions |
---|
0:08:31 | but was to this is but a uh most of the it's uh |
---|
0:08:35 | depends on the data sets you know a other do the sets with the a even higher |
---|
0:08:39 | and with |
---|
0:08:40 | a stronger should movement |
---|
0:08:42 | and |
---|
0:08:44 | we consider that you of in the first five or from six |
---|
0:08:47 | uh corresponding hidden would |
---|
0:08:48 | so basically of that the features are very simple it just the project see of the core image of the |
---|
0:08:53 | thirty decoding mates |
---|
0:08:54 | to this principal components of course to facial expressions |
---|
0:08:58 | and |
---|
0:08:59 | i can we gonna example |
---|
0:09:11 | okay so basically |
---|
0:09:17 | what so on |
---|
0:09:18 | it's this is not from maybe is database but just to get an idea of how these uh a principal |
---|
0:09:23 | works |
---|
0:09:23 | it's uh on the top left |
---|
0:09:25 | yeah is see the videos three |
---|
0:09:27 | or the top right you see the actually tracked points on the bottom left |
---|
0:09:32 | you see the reconstruction but on the principal components that correspond to should movements |
---|
0:09:36 | and the bottom uh light |
---|
0:09:38 | you should a reconstruction of corresponds |
---|
0:09:40 | oh to the press components |
---|
0:09:42 | uh with a to expressions |
---|
0:09:45 | and you can see that it's C |
---|
0:09:47 | tensor head |
---|
0:09:49 | top uh the bottom right remains always front i'll |
---|
0:09:52 | and four expressions well as the bottom left |
---|
0:09:55 | for was that the should poles |
---|
0:10:02 | it's uh |
---|
0:10:03 | because it's very simple but just |
---|
0:10:05 | yeah |
---|
0:10:06 | and uh |
---|
0:10:14 | okay |
---|
0:10:14 | a a okay uh are we show the simple also for appearance features we want to move |
---|
0:10:19 | and should pose |
---|
0:10:21 | and in in this case it's harder |
---|
0:10:23 | and yeah so what to a we need |
---|
0:10:25 | just the common approach in computer vision |
---|
0:10:27 | we use a reference frame which is uh |
---|
0:10:30 | can they you tell expression of the subject |
---|
0:10:32 | and also each in front of you |
---|
0:10:34 | and we compute the fine transformation between it's frame and the reference frame and went with affine transformation we mean |
---|
0:10:39 | basically |
---|
0:10:40 | and we scale rotate |
---|
0:10:43 | and translate |
---|
0:10:44 | and the face |
---|
0:10:45 | so but it comes to front of uh house |
---|
0:10:48 | and you can see a very simple example |
---|
0:10:50 | the bottom |
---|
0:10:51 | you see |
---|
0:10:53 | or the left |
---|
0:10:54 | uh a |
---|
0:10:55 | the shared is |
---|
0:10:57 | it's a bit uh rotated |
---|
0:10:59 | and uh after applying |
---|
0:11:01 | scaling translation rotation |
---|
0:11:04 | uh uh use of the face becomes fonda |
---|
0:11:07 | uh |
---|
0:11:08 | and then we crop |
---|
0:11:10 | and yeah yeah area on the face |
---|
0:11:12 | it's uh um |
---|
0:11:14 | and then we apply pca to you much good D and orientations |
---|
0:11:18 | it's a okay i'll gonna i will not going into details and vision it goes uh |
---|
0:11:23 | can find more information this paper |
---|
0:11:25 | and but the main idea is that |
---|
0:11:28 | it's quite common to apply pca their action and a pixel intensities are with this approach as you can sing |
---|
0:11:33 | discuss of this paper it you some advantages for example it's more robust to illumination |
---|
0:11:37 | and so that's why would side to use this one |
---|
0:11:40 | uh now |
---|
0:11:42 | we got audio features is were computed |
---|
0:11:44 | with the open smile which is uh |
---|
0:11:47 | to could provide |
---|
0:11:49 | you M |
---|
0:11:50 | and you |
---|
0:11:50 | the a frame rate are is one how find frame for second |
---|
0:11:54 | and that's why we need to up sample the visual features which are extract twenty five frames per second |
---|
0:11:59 | it's a we use some standard or where what do features like a |
---|
0:12:03 | plp coefficients the first five good could be since insanity loudness |
---|
0:12:07 | a a fundamental frequency and probability of voicing of T |
---|
0:12:11 | with the uh the first and second order delta coefficients |
---|
0:12:14 | is a pretty standard features |
---|
0:12:16 | and um |
---|
0:12:18 | oh for classification and the first approach was to use long short-term memory recurrent neural networks |
---|
0:12:24 | which the risky |
---|
0:12:26 | mean felix describe so |
---|
0:12:28 | just can to give a slide it's uh |
---|
0:12:32 | and this for dynamically a a classification of forced uh for starting |
---|
0:12:36 | a the main problem is that uh it's not a at a different line |
---|
0:12:41 | S so order to extract features which do not depend on the length of the utterance |
---|
0:12:46 | and was simply to extract some with statistics of over the entire utterance of these low level features |
---|
0:12:53 | so just for example the mean |
---|
0:12:55 | or of the feature over the entire utterance or the maximum value of the rains |
---|
0:13:00 | yeah will convert |
---|
0:13:01 | and not ounce so that has is represented by you |
---|
0:13:04 | if feature vector or of fixed size |
---|
0:13:07 | and uh event classification is uh are performed for the entire utterance using support vector machines |
---|
0:13:13 | and uh a you can see just and i but you hear of via the same features the appearance features |
---|
0:13:18 | plp P energy of zero loudness and probability of voicing |
---|
0:13:22 | one case |
---|
0:13:23 | yeah in the study case |
---|
0:13:25 | we compute the statistics over the entire utterance |
---|
0:13:28 | we fill them to an svm and we get to label for a sequence |
---|
0:13:32 | and |
---|
0:13:34 | where in the second case when we use the L S T Ms uh and team networks |
---|
0:13:38 | we in simply |
---|
0:13:41 | yeah give the low level features were no need to compute function functionals |
---|
0:13:45 | to a |
---|
0:13:47 | a list of their L S T works |
---|
0:13:49 | which provide a label for it's |
---|
0:13:50 | frame |
---|
0:13:51 | and then we can simply take the majority and to label the sequence according to |
---|
0:13:55 | so now is result |
---|
0:13:58 | a as you can see |
---|
0:13:59 | for a um |
---|
0:14:01 | for that weighted average |
---|
0:14:03 | is B Ms |
---|
0:14:04 | provide |
---|
0:14:05 | uh but but performance |
---|
0:14:07 | where as for on a weighted average L S T M |
---|
0:14:09 | a lead to better performance |
---|
0:14:11 | so these skin means |
---|
0:14:13 | but is B M's are good at uh |
---|
0:14:15 | scream mean there that's in classifying a the largest uh |
---|
0:14:20 | class which in this case is station |
---|
0:14:22 | contains more than a thousand examples |
---|
0:14:25 | a a and they are not so good uh it's recognising the other classes |
---|
0:14:30 | where whereas the less the M's |
---|
0:14:31 | and i was the recognising okay a would do that they better data but recognising all classes |
---|
0:14:36 | so you see that's twenty Q |
---|
0:14:38 | uh usually much higher and waited of but it's a values |
---|
0:14:42 | something also which is also interesting |
---|
0:14:44 | is that |
---|
0:14:45 | to compare the performance of for the oh |
---|
0:14:47 | and with the audio visual approach |
---|
0:14:49 | the close of for do for example here |
---|
0:14:51 | you see that it's sixty four point six percent |
---|
0:14:54 | now when we add appearance basically lead goes down |
---|
0:14:57 | this may sound a bit uh surprising |
---|
0:15:00 | and because especially for visual speech recognition peons just consider the state-of-the-art |
---|
0:15:05 | but there are two reasons first of all we use information from the entire face |
---|
0:15:09 | and so basically these a lot of down information which can made |
---|
0:15:13 | we possible to get the performance |
---|
0:15:15 | and and a second reason is a scenes |
---|
0:15:18 | uh is this sort you before these significant head movement |
---|
0:15:21 | a although we do this registration step |
---|
0:15:23 | to convert all expressions to frontal pose |
---|
0:15:27 | and |
---|
0:15:28 | still this is but not perfect and especially when there are out of plane rotations which means that uh subject |
---|
0:15:34 | is not looking at the common a but is looking somewhere else |
---|
0:15:37 | then we this approach is impossible |
---|
0:15:39 | uh to reconstruct the front of a a you you |
---|
0:15:42 | and the |
---|
0:15:44 | and it is known but the appearance features are are much more sensitive |
---|
0:15:48 | to a stationary or stop than shape features |
---|
0:15:50 | it's uh |
---|
0:15:53 | so this could be it's a reasonable explanation of the but performance when adding the P where when we had |
---|
0:15:59 | a shape information |
---|
0:16:00 | we should that is uh a significant gain from sixty four point six to seven two percent |
---|
0:16:05 | and uh |
---|
0:16:07 | now if we look at the co fusion radix to see |
---|
0:16:10 | which class |
---|
0:16:12 | uh |
---|
0:16:13 | the result per class |
---|
0:16:15 | and was you that the okay of this is that little on the left |
---|
0:16:18 | is the result when using all the information only |
---|
0:16:21 | and them but i is when using audio plus shape in this is for the L S the M |
---|
0:16:26 | and networks |
---|
0:16:28 | so we're a |
---|
0:16:29 | we we see that for can and and laughter there is uh |
---|
0:16:33 | significant improvement from seven the phone |
---|
0:16:36 | from forty seven to sixty six and from sixty three two seventy nine |
---|
0:16:40 | where as for his it a of the performance |
---|
0:16:42 | uh ghost down so basically |
---|
0:16:45 | uh know |
---|
0:16:46 | when we that these a extra visual information |
---|
0:16:49 | so this |
---|
0:16:50 | somebody so |
---|
0:16:51 | and we so that it is shape features improve press performance for consent and laughter |
---|
0:16:56 | where appearance features |
---|
0:16:58 | uh |
---|
0:16:59 | you not seem to do so i mean |
---|
0:17:01 | it's uh uh only a which just on seem to show have on in the case of support but the |
---|
0:17:06 | must scenes |
---|
0:17:07 | and still okay this is negligible the improve right |
---|
0:17:10 | improvement from fifty nine point and fifty nine point four |
---|
0:17:13 | well as when we combine all the features together then there is |
---|
0:17:16 | is more improvement |
---|
0:17:17 | so this is going case of the peons features |
---|
0:17:19 | a cattle |
---|
0:17:21 | comparing a now is uh L S T M networks with svms |
---|
0:17:25 | so that a a a list the as basically a a a a to do a better job of recognising |
---|
0:17:30 | and the egg uh |
---|
0:17:32 | different vocalisations and where B M's |
---|
0:17:35 | mostly recognise |
---|
0:17:36 | the class with a |
---|
0:17:38 | the largest class was his station |
---|
0:17:40 | and uh and of for future work okay |
---|
0:17:43 | it's uh is felix said |
---|
0:17:46 | we not experiments we have used presegmenting sequence which means we know the start the end to extract the sequence |
---|
0:17:52 | and we do classification |
---|
0:17:54 | no much harder the problem is to do sporting of these non-linguistic vocalisations which means give a continuous stream |
---|
0:17:59 | and we don't know this the beginning and the end actually that's out uh goal |
---|
0:18:03 | which uh especially when using uh when i in visual information |
---|
0:18:07 | which this |
---|
0:18:09 | could be it challenging task because uh that are case is that the face may not be visible so in |
---|
0:18:14 | this case is to like that we have to ten not for example |
---|
0:18:17 | uh the visual system |
---|
0:18:19 | and i fink this is you to soak in a look |
---|
0:18:22 | these are web sites um |
---|
0:18:24 | or to P of and you best of unique |
---|
0:18:27 | so thank you very much |
---|
0:18:29 | thank you much |
---|
0:18:32 | to left |
---|
0:18:33 | oh i'm for a couple of questions |
---|
0:18:38 | and some |
---|
0:18:39 | i |
---|
0:18:45 | i |
---|
0:18:46 | a paper |
---|
0:18:48 | looks |
---|
0:18:50 | as far as i can so but you can correct me |
---|
0:18:52 | the illumination was uh a pretty much okay a so i'm just wondering |
---|
0:18:57 | uh |
---|
0:18:58 | when you you go over to a more realistic |
---|
0:19:01 | recordings illumination that makes like station |
---|
0:19:04 | we that terry rate |
---|
0:19:06 | what to expect that uh |
---|
0:19:08 | you get the same |
---|
0:19:09 | amount of improvement for consent and |
---|
0:19:13 | laughter or or not or |
---|
0:19:16 | have you done any |
---|
0:19:17 | and |
---|
0:19:18 | well in this case |
---|
0:19:20 | uh okay use appearance features a definitely influenced by illumination and they are sensitive to illumination now shape features |
---|
0:19:27 | and the question is if this is a difference in illumination uh can uh fact that the that i can |
---|
0:19:33 | even it works |
---|
0:19:34 | fine fine a but even like a ins you to illumination |
---|
0:19:38 | then a okay it's uh um |
---|
0:19:41 | can can use even shape is gonna provide an useful information because the points we be top along the |
---|
0:19:47 | but uh no this is a um basically this is an open problem because not been solved and |
---|
0:19:53 | yeah i this as you know computer vision still be don't |
---|
0:19:56 | audio processing |
---|
0:19:57 | and these are problems that uh and nobody knows the answer |
---|
0:20:01 | and that's why in most applications use single bit of easy and |
---|
0:20:04 | basically |
---|
0:20:06 | for example not a visual speech recognition |
---|
0:20:09 | uh uh it's subject looking directly the common to and it's always a frontal A are of the face |
---|
0:20:15 | and |
---|
0:20:16 | quite recently that their here some approach is a trying to apply this method to more realistic scenarios |
---|
0:20:23 | but to apply it in uh a case is like you said you real environment |
---|
0:20:28 | um at least i don't know when approach that true |
---|
0:20:31 | would do work well at the moment |
---|
0:20:36 | of a question |
---|
0:20:48 | a some uh basically yeah all the features where uh it what up some of both for a a for |
---|
0:20:52 | both cases |
---|
0:20:53 | an initial to the same frame rate |
---|
0:20:55 | although for is sit it |
---|
0:20:57 | mean it may not be necessary since we starts these function yeah actually i'm talking about instance up something |
---|
0:21:04 | if the that's like you know |
---|
0:21:05 | okay only feature a |
---|
0:21:08 | okay |
---|
0:21:12 | of a question |
---|
0:21:16 | and |
---|
0:21:19 | you you you shall couple look the most what what i one there |
---|
0:21:22 | uh are there any that as the audience's a actually this class definition |
---|
0:21:28 | for so you close to the V usual |
---|
0:21:31 | features |
---|
0:21:31 | for example for a station to |
---|
0:21:34 | i i mean i i cool more and why uh facial expression |
---|
0:21:39 | each |
---|
0:21:41 | cried |
---|
0:21:42 | station |
---|
0:21:43 | and also a |
---|
0:21:44 | be able to actually it in training |
---|
0:21:48 | it |
---|
0:21:50 | and |
---|
0:21:52 | a |
---|
0:21:52 | so that is by so i mean uh uh okay just for you want example a but if you look |
---|
0:21:58 | at all the a examples you see that there is also sometimes is also a a big difference and sometimes |
---|
0:22:04 | and uh if you want look |
---|
0:22:06 | and the video without leasing to audio |
---|
0:22:08 | it is very likely that even |
---|
0:22:10 | humans |
---|
0:22:11 | and a would be confused between different vocalisation but and she's station comes N |
---|
0:22:17 | and |
---|
0:22:18 | so yeah there is body ends and uh uh in particular for laughter that |
---|
0:22:23 | uh i from there three hundred examples of laughter |
---|
0:22:26 | and there |
---|
0:22:27 | okay it's a the variance is high |
---|
0:22:29 | it's uh and now was a gotten turning a test set |
---|
0:22:34 | i think this is the official partitioning and or maybe be you and can same more |
---|
0:22:37 | what was the criteria for deciding training and testing |
---|
0:22:41 | it's uh |
---|
0:22:42 | uh |
---|
0:22:43 | what actually as just of that done for the silence |
---|
0:22:45 | and that was done to be very transfer and |
---|
0:22:48 | similar as for the a i corpus done by a speaker right |
---|
0:22:53 | this so yeah but is clear there is by and uh is it before sometimes even |
---|
0:22:57 | if you turn of the audio |
---|
0:22:59 | you cannot discriminate between those two |
---|
0:23:01 | i think |
---|
0:23:09 | a a is between different |
---|
0:23:18 | that that my main |
---|
0:23:20 | uh |
---|
0:23:21 | i i think actually and is also a another issue |
---|
0:23:27 | so |
---|
0:23:28 | what do you wear a so what's question but covariance would between a |
---|
0:23:35 | i to print that this question as if there was cool covariance between for example a |
---|
0:23:41 | uh |
---|
0:23:42 | that |
---|
0:23:43 | to explain for example that you didn't get much improvement for stations |
---|
0:23:48 | "'cause" the covariance as a high between the |
---|
0:23:50 | combination features so that that |
---|
0:23:52 | a a a a uh |
---|
0:23:56 | it's uh |
---|
0:23:58 | yeah it could be a all i means it's uh um |
---|
0:24:02 | some expressions are similar but yeah from the different a |
---|
0:24:06 | in different classes |
---|
0:24:07 | and if friend |
---|
0:24:08 | it's a |
---|
0:24:13 | yeah the i i'm not sure i mean it's uh a could because it's are spontaneous expressions and uh they |
---|
0:24:19 | also different i mean |
---|
0:24:20 | if you look at all of them you will not find too that are exactly the same |
---|
0:24:24 | it's uh and |
---|
0:24:28 | okay thank you |
---|
0:24:31 | oh the question |
---|
0:24:33 | okay so thank you again that |
---|