Speech Transcript - GPU Optimization of Convolution for Large 3-D Real Images

0:00:16	okay a
0:00:19	so good morning everyone it's my pleasure to present here of work my name is
0:00:24	public address i'm from centre for by medical image analysis in uh at some uh
0:00:29	select university banal and in my talk i would like to present the work of
0:00:34	the in my colleagues at that with when on about same chi who is from
0:00:39	uh this faculty
0:00:42	uh in my talk uh or the outline of my talk would be as follows
0:00:49	so first uh i will shortly uh recall uh convolution and how to compute its
0:00:57	and then i will talk about motivation of a little more which is use of
0:01:01	convolution in image processing and uh acceleration of convolution on a graphic cards
0:01:09	uh then i will present our algorithm uh i will show the results on that
0:01:16	uh finally i will conclude my talk so uh
0:01:22	first uh what is convolution i'm sure most of you are uh familiar with a
0:01:29	convolution or at least heard about it but uh just to recall uh some basics
0:01:37	uh it's uh and essential operation in the image or signal processing and uh
0:01:45	uh to show an example on a uh
0:01:50	here on the two images in fact convolution takes two images as an input
0:01:57	and uh it has one images and i'm
0:02:01	and uh one of the input images is usually called a convolution kernel so what
0:02:07	actually happens is that uh
0:02:11	in at every pixel of and output image the value is computed as the sum
0:02:17	of all well point wise multiplication or for the convolution kernel and an appropriate window
0:02:25	the input image
0:02:28	um there are several approaches how to compute the convolution uh one of them is
0:02:35	the so-called naive approach from the definition which is uh the slowest one then there
0:02:41	are some uh some faster solutions like separable convolution or recursive filters uh which are
0:02:50	however a suitable only for some specific uh kind of convolution kernels uh for a
0:02:58	large general convolution kernel it turns out that the best solution a so called fast
0:03:04	convolution which is uh
0:03:08	uh actually convolution computed in the frequency domain according to so-called convolution theorem that means
0:03:16	that we compute fourier transforms of input image and the convolution kernel we compute point
0:03:23	wise multiplication and then we compute uh the inverse fourier transform of the results
0:03:31	uh then the question is a white to accelerate convolution on gpu uh the answer
0:03:39	is simple uh nowadays gpu implementations uh achieve up to approximately ten times speedup over
0:03:48	uh optimized cpu solutions
0:03:51	and the uh if you look for uh resources you can find some of convolution
0:03:59	implementations in uh and we D a uh documentations for example you can find night
0:04:06	and separable convolution income to its decay you can also find uh
0:04:13	implementation of the fast fourier transform in those the cufft library which can help you
0:04:20	to implement the fast convolution
0:04:24	uh no motivation for our work
0:04:27	is uh
0:04:29	uh following uh and our laboratory we develop uh a simulator of optical microscopy images
0:04:39	uh you can try it uh by herself a at this web page
0:04:44	cheers
0:04:46	and's uh
0:04:48	and this simulation itself uh is divided into three phases in and the first step
0:04:56	and artificial image is created in the second step the image is blurred according to
0:05:04	a configuration of for real optical microscope and in this final face uh and noise
0:05:12	is added and the reason why we use this simulator is that uh we can
0:05:18	we can generate as much images as uh as many images as we want and
0:05:24	uh with these images of we know the ground truth we know the number of
0:05:30	objects and therefore there's so that we can uh test the image processing algorithms on
0:05:37	these images and we can evaluate how they perform
0:05:42	uh the essential part is this second phase where the convolution of an input image
0:05:49	and so-called point spread function is computed and uh at this phase the convolution is
0:05:56	computed over a large images to for your uh information uh the input images can
0:06:05	have uh thousand four thousand four hundred pixels and the convolution kernel can have hundred
0:06:12	four hundred four hundred pixels so these images are really big
0:06:17	uh
0:06:19	if uh if one tries to implement this convolution on gpu uh
0:06:27	it turns out that the gpu implementation has significantly lower computation time than cpu implementations
0:06:37	but there are two drawbacks first one is that the uh the gpu implementation is
0:06:45	uh basically for complex data uh it's not optimized for uh images which are uh
0:06:54	which have real values
0:06:57	so if we compare the solution with the optimized cpu solution it's not the speed
0:07:03	up is not that big the other drawback is that uh if the input image
0:07:09	is too large then we don't have you know gpu memory to compute the convolution
0:07:16	so the goals are to pressed to decompose the problem of convolution and the second
0:07:23	one is to optimize it for real data
0:07:27	uh as for the decomposition of problem there are several approaches but uh with respect
0:07:34	to number of operations and number of data transfers between cpu and gpu it turns
0:07:40	out that uh the best solution is so called decimation in frequency algorithm which is
0:07:45	uh well known from the fast fourier transform uh algorithm
0:07:52	so we use these decimation in frequency algorithm and the idea looks basically as follows
0:08:00	uh on cpu we actually compute the first step of the fast fourier transform according
0:08:09	to uh these equations
0:08:11	and the this allows us to split the input data into and parts in this
0:08:18	case it's two parts and we can compute the fast fourier transform and subsequently the
0:08:25	convolution uh in two parts separately
0:08:32	as for the optimisation for a real data the algorithm changes slightly so we add
0:08:40	two steps into these uh these the work flow uh we convert the data from
0:08:48	real too complex and in the final phase we somehow we combine the results and
0:08:55	uh these first steps the transformation of real data into complex data is actually a
0:09:02	trivial because uh what we do is actually that we take
0:09:08	the old pixels as the real part of a complex number and we take even
0:09:15	pixels as a complex part of uh
0:09:20	as imaginary part of or for complex uh image and what we get is a
0:09:27	complex image of house is
0:09:30	without doing any computations or data transfers we just recast
0:09:36	the a error rate and uh the computer memory
0:09:42	so
0:09:43	so this is done in a uh instantly the final phase uh take some time
0:09:51	but it's not uh it's not that's computationally expensive
0:09:57	um
0:10:00	and to summarize the whole are written uh
0:10:05	we get the input signal and input kernel we decompose both of them and recast
0:10:13	them to complex signals this is done on cpu then we compute all fourier transforms
0:10:21	point wise multiplication so and inverse fourier transforms on gpu this is a computationally expensive
0:10:28	part so we try to keep the most computations on gpu
0:10:33	and finally each uh we transferred data back and compose them together and recombine then
0:10:42	so that uh
0:10:44	we get uh the result
0:10:49	uh now i will speak about some experiments so this is the configuration we used
0:10:57	and uh we used to dataset
0:11:01	uh actually the first dataset has uh in the first dataset image have images have
0:11:09	well uh the dimensions that are powers of two which is the optimal for which
0:11:16	is optimal for fast fourier transform so these images are expected to have uh
0:11:24	good uh computation times i in the second dataset the images are slightly larger and
0:11:32	that means that to the computation times will be uh will be uh longer
0:11:40	and as you can see on this graph uh on X axis we have image
0:11:48	size on y-axis we have performance which is computed as a ratio between the output
0:11:55	image size and the computation time so the higher is better
0:12:00	uh
0:12:03	the
0:12:05	uh the gpu optimized for real data performs best
0:12:12	and the
0:12:16	uh
0:12:20	even gpu of without the optimisation for a real data performs better than the best
0:12:28	cpu solution but we optimize section for the real data we get like a two
0:12:34	times more and more as you know
0:12:37	uh this graph is for the first dataset
0:12:43	these graphics for second dataset so clearly the performance goes down when uh when the
0:12:50	image don't have dimensions of a specific size but uh for any data sets it
0:12:57	turns out that the gpu uh solution is uh much uh much of faster than
0:13:03	the cpu one and the speed up to achieve this approximately uh about uh five
0:13:10	two to ten depending on uh the input image
0:13:18	and uh we also come uh perform test
0:13:24	with uh a fixed input image and we uh computed the performance with different values
0:13:32	of the parameter P parameter means that uh
0:13:38	B parameter is actually the number of four of sub images the data are split
0:13:45	into so one means that no decomposition is actually made
0:13:51	two means that data are decomposed into two parts and so we can clearly see
0:13:57	that the decomposition uh imposes some uh some uh
0:14:03	performance decrease
0:14:06	but then with increasing be parameter the performance doesn't decrease that much and that it
0:14:14	stays uh
0:14:17	it's always higher significantly higher than the performance of cpu uh implementation as for the
0:14:26	gpu memory that is required it uh of course it uh decreases with the be
0:14:36	equal to two or be equal to four it's the that you can memory requirements
0:14:42	are the same because of the uh properties of the recombination phase
0:14:51	but you can see that uh
0:14:54	uh
0:14:55	the memory requirements can decrease
0:15:00	uh two
0:15:04	can i can decrease to any number you desire according uh to the graphic arts
0:15:11	you have
0:15:13	as for the numerical precision it turns out that uh
0:15:19	uh that uh the error of the output image is under the level of well
0:15:29	all for one that means that the
0:15:35	the maximum difference between output image computed on gpu and output image computed on cpu
0:15:44	in long double precision
0:15:46	is less uh use less than one and that means that uh the images are
0:15:53	uh essentially the same but uh
0:15:57	the images that we use hats uh and uh actually effective uh a bit that's
0:16:06	all about twelve bits which is uh which is uh come on it that's for
0:16:15	a optical microscopy images so if one have
0:16:20	you for wanna use this images of higher bit that let's at sixteen or thirty
0:16:28	two then uh this precision
0:16:32	uh change D not enough so um
0:16:39	so in this case one should use not single precision for computation but uh let's
0:16:45	see double precision uh fortunately modern graphic cards allow to use double precision
0:16:54	uh we also tried to implement um more multi-gpu solution
0:17:01	and uh we made some optimisation with synchronisation so that uh
0:17:09	the overall time of the multi-gpu uh multi-gpu solution is as best as possible however
0:17:18	it turns out that the with too graphic cards one does not definitely achieve two
0:17:24	times speed up uh in fact for some images two gpus perform more or less
0:17:31	the same this one gpu and that's because of the overhead of data transfers
0:17:38	for some images one can achieve uh maybe one point five speed up
0:17:46	so it can be interesting for some uh some kind of input images
0:17:53	uh to complete my talk uh
0:17:57	our algorithm for uh computation of uh on convolution on gpu uh has actually known
0:18:05	timit an only needs of data sides the only name it is actually the cpu
0:18:10	memory but there are some approaches to uh decompose the data even if you don't
0:18:18	have enough to cpu memory so for extremely large uh input images uh
0:18:26	you can uh you can combine maybe our approach with some other approaches to perform
0:18:33	the computation
0:18:36	oh
0:18:37	the algorithm is optimized for real data uh which are uh oh
0:18:43	most common in practical applications it has quite stable performance that means that uh with
0:18:51	uh increasing the number of sub images the data are split into the performance doesn't
0:18:57	decrease significantly uh there are non rigid and data transfers so our computations
0:19:04	and the multi-gpu solution is also possible so if you have many graphic cards you
0:19:11	can split the data into uh more parts and send the data into uh individual
0:19:18	graphic cards let then compute parallel and the disadvantages is that as i told you
0:19:26	the multi gpu solution is not that efficient
0:19:30	and uh the numerical precision may not be sufficient for some applications so uh what
0:19:36	should consider uh like uh using uh double precision for the computation
0:19:45	so uh that's all to conclude my talk thank you very much for listening and
0:19:50	i'm looking forward to have questions
0:20:15	uh_huh
0:20:23	uh_huh yes thank you that's and that are definitely interesting questions uh
0:20:30	so uh i didn't i didn't i didn't mention it but it wasn't slide so
0:20:36	the cpu was kinda like a four course
0:20:41	intel core I seven and the gpu was uh you course you D X of
0:20:47	four hundred and eighty
0:20:50	and the cpu implementation was based on the fftw uh a library so probably
0:21:00	uh it is considered the fastest
0:21:04	fft library uh so we use the
0:21:11	as fast as cpu implementation is possible
0:21:29	uh yes uh we actually of for the sake of the simplicity we
0:21:38	uh
0:21:40	yeah
0:21:42	uh oh
0:21:44	uh i will put in different we uh in a fast convolution the current knows
0:21:51	both the data both the data and can also have to be uh that it
0:21:56	in to the same size so in fact it doesn't matter if you have uh
0:22:01	current no one hundred or one hundred per one hundred or two hundred or two
0:22:05	hundred or one hundred because
0:22:09	at the end it's uh it is but it's to the size
0:22:13	oh of well
0:22:18	uh of size of input image last the size of kernel so it is like
0:22:24	a one thousand or one some part two hundred for example and uh for the
0:22:29	sake of simplicity during our experiments we use uh we use the
0:22:35	uh the input image of the same size as the uh convolution kernel but uh
0:22:42	it really doesn't matter
0:22:45	how large the convolution car noise
0:22:59	uh_huh
0:23:16	oh because uh for uh in america precision test we wanted to compare the single
0:23:24	precision on gpu with uh the ground truth
0:23:29	because we wanted to make sure that the results are correct so that for this
0:23:35	test we use the along double precision on cpu however for performance test we use
0:23:42	the single precision both on cpu and gpu
0:23:49	i'm sorry it's the wasn't that clear during the talk
0:24:00	i
0:24:02	it's
0:24:15	uh block convolution you mean you mean uh splitting it into small blocks and you
0:24:22	uh performing poorly yeah or what overlap that yeah that's uh
0:24:29	i that's the this approach it's timing
0:24:35	and uh actually unlikely to and of its what i swore i was working on
0:24:42	this approach and its uh but the disadvantage of this approach is that
0:24:49	the number of operation
0:24:52	uh in means of a asymptotic complexity is uh higher
0:24:57	and so is the number of data transfers because of the overlaps
0:25:04	and uh it turns out that the with the gpu implementation the number of data
0:25:09	transfers is essential
0:25:11	so we uh we wanted to keep the number of data transfers as low as
0:25:16	possible and the
0:25:18	uh for this reason we use the decimation in frequency instead of instead of overlap
0:25:26	set safe overlap-add method because we got significantly lower number of uh data transfers
0:25:44	uh_huh
0:25:47	yeah D oh yeah i'm sorry i didn't mention it D means number of parts
0:25:52	the data are decomposed into
0:25:55	so in fact with the overlap save an overlap add with increasing number of some
0:26:02	parts you get uh you get higher number of data transfers
0:26:17	okay

GPU Optimization of Convolution for Large 3-D Real Images

DSP and Hardware

Pavel Karas, David Svoboda, Pavel Zemcik