okay a
so good morning everyone it's my pleasure to present here of work my name is
public address i'm from centre for by medical image analysis in uh at some uh
select university banal and in my talk i would like to present the work of
the in my colleagues at that with when on about same chi who is from
uh this faculty
uh in my talk uh or the outline of my talk would be as follows
so first uh i will shortly uh recall uh convolution and how to compute its
and then i will talk about motivation of a little more which is use of
convolution in image processing and uh acceleration of convolution on a graphic cards
uh then i will present our algorithm uh i will show the results on that
uh finally i will conclude my talk so uh
first uh what is convolution i'm sure most of you are uh familiar with a
convolution or at least heard about it but uh just to recall uh some basics
uh it's uh and essential operation in the image or signal processing and uh
uh to show an example on a uh
here on the two images in fact convolution takes two images as an input
and uh it has one images and i'm
and uh one of the input images is usually called a convolution kernel so what
actually happens is that uh
in at every pixel of and output image the value is computed as the sum
of all well point wise multiplication or for the convolution kernel and an appropriate window
the input image
um there are several approaches how to compute the convolution uh one of them is
the so-called naive approach from the definition which is uh the slowest one then there
are some uh some faster solutions like separable convolution or recursive filters uh which are
however a suitable only for some specific uh kind of convolution kernels uh for a
large general convolution kernel it turns out that the best solution a so called fast
convolution which is uh
uh actually convolution computed in the frequency domain according to so-called convolution theorem that means
that we compute fourier transforms of input image and the convolution kernel we compute point
wise multiplication and then we compute uh the inverse fourier transform of the results
uh then the question is a white to accelerate convolution on gpu uh the answer
is simple uh nowadays gpu implementations uh achieve up to approximately ten times speedup over
uh optimized cpu solutions
and the uh if you look for uh resources you can find some of convolution
implementations in uh and we D a uh documentations for example you can find night
and separable convolution income to its decay you can also find uh
implementation of the fast fourier transform in those the cufft library which can help you
to implement the fast convolution
uh no motivation for our work
is uh
uh following uh and our laboratory we develop uh a simulator of optical microscopy images
uh you can try it uh by herself a at this web page
cheers
and's uh
and this simulation itself uh is divided into three phases in and the first step
and artificial image is created in the second step the image is blurred according to
a configuration of for real optical microscope and in this final face uh and noise
is added and the reason why we use this simulator is that uh we can
we can generate as much images as uh as many images as we want and
uh with these images of we know the ground truth we know the number of
objects and therefore there's so that we can uh test the image processing algorithms on
these images and we can evaluate how they perform
uh the essential part is this second phase where the convolution of an input image
and so-called point spread function is computed and uh at this phase the convolution is
computed over a large images to for your uh information uh the input images can
have uh thousand four thousand four hundred pixels and the convolution kernel can have hundred
four hundred four hundred pixels so these images are really big
uh
if uh if one tries to implement this convolution on gpu uh
it turns out that the gpu implementation has significantly lower computation time than cpu implementations
but there are two drawbacks first one is that the uh the gpu implementation is
uh basically for complex data uh it's not optimized for uh images which are uh
which have real values
so if we compare the solution with the optimized cpu solution it's not the speed
up is not that big the other drawback is that uh if the input image
is too large then we don't have you know gpu memory to compute the convolution
so the goals are to pressed to decompose the problem of convolution and the second
one is to optimize it for real data
uh as for the decomposition of problem there are several approaches but uh with respect
to number of operations and number of data transfers between cpu and gpu it turns
out that uh the best solution is so called decimation in frequency algorithm which is
uh well known from the fast fourier transform uh algorithm
so we use these decimation in frequency algorithm and the idea looks basically as follows
uh on cpu we actually compute the first step of the fast fourier transform according
to uh these equations
and the this allows us to split the input data into and parts in this
case it's two parts and we can compute the fast fourier transform and subsequently the
convolution uh in two parts separately
as for the optimisation for a real data the algorithm changes slightly so we add
two steps into these uh these the work flow uh we convert the data from
real too complex and in the final phase we somehow we combine the results and
uh these first steps the transformation of real data into complex data is actually a
trivial because uh what we do is actually that we take
the old pixels as the real part of a complex number and we take even
pixels as a complex part of uh
as imaginary part of or for complex uh image and what we get is a
complex image of house is
without doing any computations or data transfers we just recast
the a error rate and uh the computer memory
so
so this is done in a uh instantly the final phase uh take some time
but it's not uh it's not that's computationally expensive
um
and to summarize the whole are written uh
we get the input signal and input kernel we decompose both of them and recast
them to complex signals this is done on cpu then we compute all fourier transforms
point wise multiplication so and inverse fourier transforms on gpu this is a computationally expensive
part so we try to keep the most computations on gpu
and finally each uh we transferred data back and compose them together and recombine then
so that uh
we get uh the result
uh now i will speak about some experiments so this is the configuration we used
and uh we used to dataset
uh actually the first dataset has uh in the first dataset image have images have
well uh the dimensions that are powers of two which is the optimal for which
is optimal for fast fourier transform so these images are expected to have uh
good uh computation times i in the second dataset the images are slightly larger and
that means that to the computation times will be uh will be uh longer
and as you can see on this graph uh on X axis we have image
size on y-axis we have performance which is computed as a ratio between the output
image size and the computation time so the higher is better
uh
the
uh the gpu optimized for real data performs best
and the
uh
even gpu of without the optimisation for a real data performs better than the best
cpu solution but we optimize section for the real data we get like a two
times more and more as you know
uh this graph is for the first dataset
these graphics for second dataset so clearly the performance goes down when uh when the
image don't have dimensions of a specific size but uh for any data sets it
turns out that the gpu uh solution is uh much uh much of faster than
the cpu one and the speed up to achieve this approximately uh about uh five
two to ten depending on uh the input image
and uh we also come uh perform test
with uh a fixed input image and we uh computed the performance with different values
of the parameter P parameter means that uh
B parameter is actually the number of four of sub images the data are split
into so one means that no decomposition is actually made
two means that data are decomposed into two parts and so we can clearly see
that the decomposition uh imposes some uh some uh
performance decrease
but then with increasing be parameter the performance doesn't decrease that much and that it
stays uh
it's always higher significantly higher than the performance of cpu uh implementation as for the
gpu memory that is required it uh of course it uh decreases with the be
equal to two or be equal to four it's the that you can memory requirements
are the same because of the uh properties of the recombination phase
but you can see that uh
uh
the memory requirements can decrease
uh two
can i can decrease to any number you desire according uh to the graphic arts
you have
as for the numerical precision it turns out that uh
uh that uh the error of the output image is under the level of well
all for one that means that the
the maximum difference between output image computed on gpu and output image computed on cpu
in long double precision
is less uh use less than one and that means that uh the images are
uh essentially the same but uh
the images that we use hats uh and uh actually effective uh a bit that's
all about twelve bits which is uh which is uh come on it that's for
a optical microscopy images so if one have
you for wanna use this images of higher bit that let's at sixteen or thirty
two then uh this precision
uh change D not enough so um
so in this case one should use not single precision for computation but uh let's
see double precision uh fortunately modern graphic cards allow to use double precision
uh we also tried to implement um more multi-gpu solution
and uh we made some optimisation with synchronisation so that uh
the overall time of the multi-gpu uh multi-gpu solution is as best as possible however
it turns out that the with too graphic cards one does not definitely achieve two
times speed up uh in fact for some images two gpus perform more or less
the same this one gpu and that's because of the overhead of data transfers
for some images one can achieve uh maybe one point five speed up
so it can be interesting for some uh some kind of input images
uh to complete my talk uh
our algorithm for uh computation of uh on convolution on gpu uh has actually known
timit an only needs of data sides the only name it is actually the cpu
memory but there are some approaches to uh decompose the data even if you don't
have enough to cpu memory so for extremely large uh input images uh
you can uh you can combine maybe our approach with some other approaches to perform
the computation
oh
the algorithm is optimized for real data uh which are uh oh
most common in practical applications it has quite stable performance that means that uh with
uh increasing the number of sub images the data are split into the performance doesn't
decrease significantly uh there are non rigid and data transfers so our computations
and the multi-gpu solution is also possible so if you have many graphic cards you
can split the data into uh more parts and send the data into uh individual
graphic cards let then compute parallel and the disadvantages is that as i told you
the multi gpu solution is not that efficient
and uh the numerical precision may not be sufficient for some applications so uh what
should consider uh like uh using uh double precision for the computation
so uh that's all to conclude my talk thank you very much for listening and
i'm looking forward to have questions
uh_huh
uh_huh yes thank you that's and that are definitely interesting questions uh
so uh i didn't i didn't i didn't mention it but it wasn't slide so
the cpu was kinda like a four course
intel core I seven and the gpu was uh you course you D X of
four hundred and eighty
and the cpu implementation was based on the fftw uh a library so probably
uh it is considered the fastest
fft library uh so we use the
as fast as cpu implementation is possible
uh yes uh we actually of for the sake of the simplicity we
uh
yeah
uh oh
uh i will put in different we uh in a fast convolution the current knows
both the data both the data and can also have to be uh that it
in to the same size so in fact it doesn't matter if you have uh
current no one hundred or one hundred per one hundred or two hundred or two
hundred or one hundred because
at the end it's uh it is but it's to the size
oh of well
uh of size of input image last the size of kernel so it is like
a one thousand or one some part two hundred for example and uh for the
sake of simplicity during our experiments we use uh we use the
uh the input image of the same size as the uh convolution kernel but uh
it really doesn't matter
how large the convolution car noise
uh_huh
oh because uh for uh in america precision test we wanted to compare the single
precision on gpu with uh the ground truth
because we wanted to make sure that the results are correct so that for this
test we use the along double precision on cpu however for performance test we use
the single precision both on cpu and gpu
i'm sorry it's the wasn't that clear during the talk
i
it's
uh block convolution you mean you mean uh splitting it into small blocks and you
uh performing poorly yeah or what overlap that yeah that's uh
i that's the this approach it's timing
and uh actually unlikely to and of its what i swore i was working on
this approach and its uh but the disadvantage of this approach is that
the number of operation
uh in means of a asymptotic complexity is uh higher
and so is the number of data transfers because of the overlaps
and uh it turns out that the with the gpu implementation the number of data
transfers is essential
so we uh we wanted to keep the number of data transfers as low as
possible and the
uh for this reason we use the decimation in frequency instead of instead of overlap
set safe overlap-add method because we got significantly lower number of uh data transfers
uh_huh
yeah D oh yeah i'm sorry i didn't mention it D means number of parts
the data are decomposed into
so in fact with the overlap save an overlap add with increasing number of some
parts you get uh you get higher number of data transfers
okay