0:00:13 | and not known um |
---|
0:00:14 | oops |
---|
0:00:16 | uh i apologise for uh missing my slot and i guess of moving a little bit fast here than i |
---|
0:00:20 | expected |
---|
0:00:21 | my name is david very data |
---|
0:00:23 | uh i i'm from H P labs |
---|
0:00:25 | and today and presenting a of the work of my colleagues fun time |
---|
0:00:30 | i don't try to an chris will |
---|
0:00:32 | um so town had to cancel his trip at the last moment because of some us |
---|
0:00:37 | uh these uh |
---|
0:00:38 | situation |
---|
0:00:39 | so um i'm |
---|
0:00:41 | good to do my best |
---|
0:00:42 | to |
---|
0:00:43 | but present a his by uh |
---|
0:00:45 | um the what is called a event classification |
---|
0:00:48 | photo collection |
---|
0:00:49 | and uh the idea is |
---|
0:00:52 | uh |
---|
0:00:53 | to be able to uh use a collection of of photograph from a a a single device |
---|
0:00:57 | over a short period of time to be one hour two hours |
---|
0:01:01 | and um |
---|
0:01:02 | and to be able to classify the events |
---|
0:01:06 | that uh that photo collection is uh is represented |
---|
0:01:10 | so uh examples of events um uh you christmas scenes probably scenes |
---|
0:01:16 | oh and fine state |
---|
0:01:17 | i'll those ports and and and things of this nature |
---|
0:01:21 | so it turns out that this is a um |
---|
0:01:25 | apparently quite a challenging problem i think the main the main difficulties are |
---|
0:01:29 | that the um |
---|
0:01:30 | a photos are essentially and |
---|
0:01:31 | stream |
---|
0:01:32 | um they can just be |
---|
0:01:34 | any collection of photos taken |
---|
0:01:36 | any time |
---|
0:01:37 | um |
---|
0:01:38 | and uh that the reason so you know this doing this is |
---|
0:01:42 | uh to make organization and management of personal photos |
---|
0:01:45 | uh that's what follows |
---|
0:01:47 | um |
---|
0:01:48 | much easier to do |
---|
0:01:50 | it tell the stories of people's lives |
---|
0:01:52 | but the reason why each piece interested in this is because |
---|
0:01:55 | um if if we can um |
---|
0:01:58 | automatically categorise of a classifier a collection of photos to fit a certain theme |
---|
0:02:03 | then uh of the companies able to |
---|
0:02:06 | um |
---|
0:02:07 | suggest products that the consumer can buy |
---|
0:02:10 | um like for of books and things like that um that maps that the so |
---|
0:02:15 | each piece interest |
---|
0:02:17 | so this is a system overview of uh how |
---|
0:02:21 | how the though |
---|
0:02:22 | the uh work of my colleagues um what |
---|
0:02:25 | uh |
---|
0:02:26 | a a collection of photos is given |
---|
0:02:28 | and the first thing that we do is we consider each uh pressing that they do |
---|
0:02:32 | is a take a single photo at a time |
---|
0:02:35 | um and they extract some data |
---|
0:02:38 | a a um and then they run to classifier |
---|
0:02:41 | to obtain a prediction |
---|
0:02:42 | um all of what the uh a category it is a of that for it is |
---|
0:02:47 | um |
---|
0:02:48 | based on the metadata |
---|
0:02:50 | like |
---|
0:02:51 | sorry think this as |
---|
0:02:52 | i didn't realise |
---|
0:02:53 | this is running on or two |
---|
0:02:55 | uh |
---|
0:02:55 | or or that |
---|
0:02:56 | keep |
---|
0:02:57 | the pixel forward |
---|
0:02:58 | um |
---|
0:02:59 | they also take a single photo obtain a uh a a of visual feature is to grab |
---|
0:03:03 | uh a a one a classifier and like wise and before |
---|
0:03:07 | uh obtain predictions soft predictions of which uh category |
---|
0:03:11 | um |
---|
0:03:12 | that photo belongs to |
---|
0:03:15 | then all of that |
---|
0:03:16 | metadata predictions and visual predictions |
---|
0:03:18 | from all the images are combined together |
---|
0:03:21 | information fusion step |
---|
0:03:23 | and uh a the event is then classified to to one of the the category |
---|
0:03:29 | so in this talk i'll start by describing a uh what the metadata data is and the uh the each |
---|
0:03:34 | of the classifier |
---|
0:03:36 | uh i also talk |
---|
0:03:37 | about the visual feature uh |
---|
0:03:39 | a process and and the class five but that |
---|
0:03:42 | um finally um the information fusion step |
---|
0:03:45 | and uh a result |
---|
0:03:49 | so the metadata um that is um |
---|
0:03:52 | used here in these experiments |
---|
0:03:54 | um is actually a of for different things |
---|
0:03:57 | i time stamps |
---|
0:03:59 | each of the uh for close |
---|
0:04:01 | uh |
---|
0:04:01 | and indication of whether the flash was on or off when the photo was taken |
---|
0:04:06 | uh exposure time |
---|
0:04:07 | and a focal a focal and |
---|
0:04:09 | so you can see that time stamps uh be a lot of information about |
---|
0:04:13 | about a certain um |
---|
0:04:16 | so and um |
---|
0:04:17 | and that for example if you look here |
---|
0:04:19 | um these uh all the photos it's a histogram of all the photos label this christmas |
---|
0:04:24 | um |
---|
0:04:25 | in the in the training set |
---|
0:04:27 | and um |
---|
0:04:28 | zero corresponds to december twenty states and and negative numbers uh |
---|
0:04:33 | the |
---|
0:04:34 | yeah and a negative numbers for uh to some twenty fit for the numbers of and sim if it so |
---|
0:04:39 | you can see |
---|
0:04:40 | that is a is a very lot spike on december twenty fit but there's also a large number of christmas |
---|
0:04:45 | the photos that were taken in the month of december |
---|
0:04:49 | there are there are you know a small number of photos taken at all the times for christmas themed that |
---|
0:04:54 | perhaps this is due to bad data |
---|
0:04:56 | the fact that the |
---|
0:04:57 | uh the photographer did not a set a time correctly on the car |
---|
0:05:01 | a probably explains that |
---|
0:05:03 | flash are are off reveals a lot of information about whether the uh photo was taken indoors or outdoors and |
---|
0:05:09 | you can see most |
---|
0:05:10 | christmas photos were taken with flash on |
---|
0:05:12 | exposed time like why tells you about the uh the light conditions when the photos was taken |
---|
0:05:17 | vocal length tells you about out um the relative distance |
---|
0:05:21 | a seen from the from the camp |
---|
0:05:24 | so a the metadata is extracted |
---|
0:05:27 | um from a single image |
---|
0:05:29 | and um |
---|
0:05:31 | and then the class |
---|
0:05:32 | a classifier was is built offline |
---|
0:05:34 | uh based on um the random farce technique |
---|
0:05:38 | and certainly not |
---|
0:05:39 | but on this so um |
---|
0:05:41 | i don't i don't have a lot to say about it but i understand |
---|
0:05:44 | that um |
---|
0:05:45 | the the random forest that fire off is very good |
---|
0:05:48 | formants |
---|
0:05:49 | uh i very low computational cost and that was why this choice was made |
---|
0:05:55 | um the visual features are |
---|
0:05:57 | uh extracted in a different way they are um |
---|
0:06:00 | a use the bag of features |
---|
0:06:02 | approach |
---|
0:06:03 | and so uh what |
---|
0:06:04 | uh what's done by my colleagues here |
---|
0:06:06 | is they it take the original image |
---|
0:06:08 | and the um |
---|
0:06:10 | filter down sampling at uh down to what of size |
---|
0:06:13 | and then it filtered down sampled that to to um |
---|
0:06:15 | one sixteen so |
---|
0:06:18 | um that the lime to just program for this |
---|
0:06:20 | X D in tiles this images program to four tile so each time |
---|
0:06:23 | there twenty one time L C each white tile has the same size |
---|
0:06:27 | now |
---|
0:06:28 | from each tile |
---|
0:06:29 | they uh a sample what all they take a grid of points |
---|
0:06:33 | and at each grid location |
---|
0:06:35 | uh they obtain a a um |
---|
0:06:39 | a a each feature vector |
---|
0:06:41 | uh |
---|
0:06:42 | and that looks something like this a hundred and twenty eight dimensional |
---|
0:06:45 | a feature vector |
---|
0:06:47 | each uh feature vector |
---|
0:06:49 | is then quantized |
---|
0:06:51 | quantized to one of two hundred words uh this dictionary also |
---|
0:06:56 | trained uh offline |
---|
0:06:59 | and so uh i if you take all the feature vectors |
---|
0:07:02 | from a certain uh tile |
---|
0:07:04 | i you can then obtain a frequency back to |
---|
0:07:07 | all uh two hundred elements because there are there two hundred on |
---|
0:07:11 | code |
---|
0:07:13 | um so if a total uh a there are twenty one tiles |
---|
0:07:17 | and two hundred elements in each feature vector |
---|
0:07:20 | so all those feature vectors actually concatenated together |
---|
0:07:23 | to obtain a um |
---|
0:07:25 | uh a a vector of four thousand two hundred um |
---|
0:07:28 | element |
---|
0:07:30 | and then that |
---|
0:07:31 | a feature vector is passed through a a a support vector machine |
---|
0:07:34 | that um |
---|
0:07:35 | produces a |
---|
0:07:37 | prediction |
---|
0:07:38 | a the uh |
---|
0:07:40 | the event category |
---|
0:07:42 | based on the visual features |
---|
0:07:43 | for single image |
---|
0:07:46 | okay |
---|
0:07:47 | so |
---|
0:07:48 | what are described so far are um using the metadata data to provide predictions of a single image and then |
---|
0:07:53 | using visual features to provide predictions of a single image |
---|
0:07:56 | uh but it turns out that um this mine |
---|
0:07:59 | for single image this might not be a very good |
---|
0:08:02 | so to take a look at this image |
---|
0:08:04 | um |
---|
0:08:06 | it's not exactly clear what |
---|
0:08:07 | what event |
---|
0:08:09 | image represents |
---|
0:08:10 | but if you look at all the surrounding images |
---|
0:08:12 | from that collection |
---|
0:08:14 | you see that pretty clearly this is the a channels but state park |
---|
0:08:18 | and so that this is the main contribution of the car work |
---|
0:08:21 | is to um |
---|
0:08:24 | sort of leverage the fact that we have a collection of images and an |
---|
0:08:27 | see how much |
---|
0:08:28 | um do what we can do with a oh |
---|
0:08:30 | what my colleagues can could do with that |
---|
0:08:36 | um |
---|
0:08:36 | okay |
---|
0:08:37 | so that that brings us to the information fusion step and and what they've done is actually a fairly simple |
---|
0:08:43 | so suppose that um |
---|
0:08:45 | we have a uh it's collection of images i one through i and |
---|
0:08:50 | and then a have from the previous steps that i saw uh already told you about |
---|
0:08:54 | uh we've already obtained |
---|
0:08:56 | uh |
---|
0:08:56 | probability back it is um based on so this indicates visual features |
---|
0:09:01 | is the probability that a uh image i |
---|
0:09:04 | um |
---|
0:09:05 | is |
---|
0:09:06 | classified as a belong to defend G |
---|
0:09:10 | like is the probability that image is classified as belonging to a event J |
---|
0:09:14 | uh based on the metadata features |
---|
0:09:16 | so we obtain um these vectors of of of probabilities |
---|
0:09:21 | uh but we also have to note that uh a different types of features um of the different amount of |
---|
0:09:26 | conference the different events |
---|
0:09:27 | we've already see that uh that |
---|
0:09:30 | a time |
---|
0:09:30 | metadata feature |
---|
0:09:32 | it's fairly useful in predicting a christmas |
---|
0:09:35 | but is not |
---|
0:09:36 | very useful in um telling you about but it |
---|
0:09:39 | because but it by a uh |
---|
0:09:41 | family evenly distributed throughout the county year |
---|
0:09:45 | so these um these weight scenes uh need to be obtained of those also obtained offline like training |
---|
0:09:52 | and then um |
---|
0:09:53 | these um probabilities are combined |
---|
0:09:56 | to uh a single conference number for a collection of photos high |
---|
0:10:01 | um |
---|
0:10:03 | um um |
---|
0:10:05 | that the confidence measure it is a conference that that |
---|
0:10:08 | that the collection i |
---|
0:10:09 | uh i classifies as to eventually J |
---|
0:10:12 | so what's done is a is a linear combination of the probability up |
---|
0:10:17 | for the single images |
---|
0:10:18 | um weighted by the weights and also weighted by this off and one minus for which treats off |
---|
0:10:24 | um |
---|
0:10:25 | the um |
---|
0:10:27 | the metadata data |
---|
0:10:28 | and the and the visual feature a uh classification |
---|
0:10:32 | is method it is not available as is the case for i think approximately twenty five percent |
---|
0:10:37 | of all the elements in the in the test set |
---|
0:10:41 | um you can only use the visual date |
---|
0:10:46 | so i think we can now start to discuss the experimental results |
---|
0:10:50 | uh |
---|
0:10:51 | i think a hundred thousand photos were obtained from online um online folders of of of uh |
---|
0:10:58 | photos |
---|
0:10:59 | um and these were manually labeled into eight event types |
---|
0:11:03 | so uh those a |
---|
0:11:04 | christmas how we valentine's day for them to lie |
---|
0:11:08 | um i'll post board |
---|
0:11:10 | one |
---|
0:11:11 | uh that's D C speech scenes and none of the above |
---|
0:11:16 | and none of the above |
---|
0:11:17 | no the above turns out to be a a a um |
---|
0:11:20 | very difficult category to deal with |
---|
0:11:24 | so a out of the heart thousand photos eight thousand were used for training |
---|
0:11:28 | and um a hundred fifty two collections |
---|
0:11:31 | uh were used for testing each collection could um contain |
---|
0:11:36 | anyway between five and one hundred |
---|
0:11:38 | images |
---|
0:11:41 | so here i show the um confusion matrix |
---|
0:11:45 | uh |
---|
0:11:46 | result |
---|
0:11:47 | that um my colleagues |
---|
0:11:48 | obtained for the signal photo classification |
---|
0:11:52 | based on meta data on top and uh a visual features |
---|
0:11:55 | the bottom |
---|
0:11:56 | using metadata actually turns out that you can do very well for um |
---|
0:12:00 | signal images on christmas how we valentine's time stay in fourth of july |
---|
0:12:05 | uh because these have |
---|
0:12:07 | uh dates |
---|
0:12:08 | associated with those of and |
---|
0:12:10 | um |
---|
0:12:11 | for you know the other ones |
---|
0:12:12 | sporting scenes but days speech um the metadata is is not as you |
---|
0:12:18 | a visual classifier are um since up to be |
---|
0:12:21 | pretty good as well |
---|
0:12:22 | but it it uh is |
---|
0:12:24 | very very good at how all sports events and uh and beach event |
---|
0:12:28 | um and that's |
---|
0:12:30 | probably because those have a fairly consistent |
---|
0:12:33 | um |
---|
0:12:34 | visual signature |
---|
0:12:35 | a visual composition |
---|
0:12:37 | especially |
---|
0:12:40 | so the next uh page of results |
---|
0:12:42 | shows shows um |
---|
0:12:43 | the classification results for the whole collection |
---|
0:12:46 | after um |
---|
0:12:47 | they done |
---|
0:12:48 | um |
---|
0:12:49 | the information used |
---|
0:12:51 | i'm see that the results and i'm not too bad you getting |
---|
0:12:54 | i guess between seventy and ninety percent accuracy |
---|
0:12:57 | so for these |
---|
0:12:59 | these seven categories |
---|
0:13:00 | um the none of the above |
---|
0:13:02 | is |
---|
0:13:03 | uh the false less well so so what's happening is that |
---|
0:13:06 | and which have nothing to do with any of these that and are actually getting mapped to to one of |
---|
0:13:11 | seven seven |
---|
0:13:11 | for |
---|
0:13:12 | three |
---|
0:13:15 | don't know |
---|
0:13:17 | um |
---|
0:13:18 | so well |
---|
0:13:20 | well to conclude um what of been presenting today is work my colleagues on uh |
---|
0:13:25 | a of uh |
---|
0:13:27 | collections of photos |
---|
0:13:28 | to um |
---|
0:13:29 | uh a number of uh of categories |
---|
0:13:34 | um |
---|
0:13:36 | what uh what finally um colleagues are interested in in doing it is um |
---|
0:13:41 | instead of just |
---|
0:13:42 | subtracting features from individual photos and then fusing them later |
---|
0:13:46 | is to um what what the intent to do is |
---|
0:13:49 | directly extract |
---|
0:13:50 | uh features from the collection |
---|
0:13:52 | so |
---|
0:13:53 | this my |
---|
0:13:54 | a quite invention |
---|
0:13:55 | of of new types |
---|
0:13:57 | feature |
---|
0:13:58 | um um also like to explore but fusion of different classifiers what the using right now is that |
---|
0:14:04 | a yeah but you waiting |
---|
0:14:06 | um and uh potentially i think the think um |
---|
0:14:10 | that |
---|
0:14:11 | some nonlinear |
---|
0:14:12 | um |
---|
0:14:13 | approach might be that uh maybe |
---|
0:14:15 | considering um different metadata data features |
---|
0:14:18 | groups so different visual features |
---|
0:14:20 | hmmm |
---|
0:14:21 | um |
---|
0:14:22 | my |
---|
0:14:23 | provide a a better |
---|
0:14:25 | a fusion than what they using right now |
---|
0:14:27 | and that also like to um |
---|
0:14:29 | grow with the number of categories |
---|
0:14:31 | i |
---|
0:14:33 | to a larger number |
---|
0:14:36 | so um that wraps a my talk a i'll do my best to to answer your question |
---|
0:14:42 | um |
---|
0:14:43 | but uh if there are any the direct talented |
---|
0:14:46 | you |
---|
0:14:47 | i can only forty |
---|
0:14:49 | to |
---|
0:14:50 | michael colleague |
---|
0:14:52 | i |
---|
0:14:56 | and |
---|
0:14:57 | the but there any equation questions |
---|
0:15:04 | can |
---|
0:15:05 | i you we can |
---|
0:15:08 | and this concludes that |
---|
0:15:10 | particular |
---|
0:15:11 | is so should thank you very much for to |
---|