-
Notifications
You must be signed in to change notification settings - Fork 276
[core] Test GPGPU (OpenCL/CUDA) backend with AMGCL solver #3831
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Regarding increased memory consumption during compilation: I think we could keep just the declarations of |
|
dc926f4 seems to work. @RiccardoRossi, can you check if this reduces compilation memory requirements? I suspect this may also reduce the overall compile time of KratosCore, since all amgcl headers are now moved to a separate compilation unit. |
2fd4aab to
7519c45
Compare
|
7519c45 allows to choose between OpenCL and CUDA backends with a cmake option. |
|
When compiling with cotire and the following flags: -DUSE_COTIRE=ON the following warning are issued (however the code compiles and runs) |
|
correction: when running the compilation with the flags the following error is issued |
|
I don't think the error is related to the changes here? |
|
Re cotire: sakra/cotire#135 |
|
yes it looks like the error with cotire is the one you indicate... however i did not really understand how the patch should be applied |
|
mmm also this one is relevant although it still does not give a solution (i think you are doing what they suggest) |
|
I did not do anything to solve the cotire issue; I think this should be solved on the cotire side first. |
c0d09cf to
2632dfa
Compare
|
Rebased onto the current master and squashed the commits. |
2632dfa to
8324752
Compare
|
I have been trying stuff with cotire (including trying out the newest version). Unfortunately cotire is needed since we use it in our CI. I verified and the latest master it does compile with cotire without any problem, so the problem comes from the local modifications of the CMakeLists.txt I am 95% sure that the problem comes from or can't we use "add_define" instead and define it globally to all the KratosCore? |
|
Modifying global state is considered extremely bad practice in cmake, but since this is how all of Kratos is currently built anyway, I think we can do that. |
|
@RiccardoRossi , is c702ec1 what you had in mind? Does that work? |
|
25048ef applies cotire patch from sakra/cotire#155. With this, I am able to compile with both |
cc1ccc1 to
25048ef
Compare
Enable vexcl backend for amgcl solver, which makes it possible to use GPGPU (either CUDA or OpenCL) in order to accelerate solution. * CMake option AMGCL_GPGPU (default: off) controls whether to compile GPGPU support. * CMake option AMGCL_GPGPU_BACKEND (default: OpenCL) selects vexcl backend (OpenCL/CUDA) * New setting in linear solver parameters: `use_gpgpu` enables GPGPU at runtime. * Environment variable OCL_DEVICE may be used to select a particular compute device.
25048ef to
a034674
Compare
|
I've rebased the PR onto the current master. With this, I am able to compile Kratos core with |
962ef32 to
a034674
Compare
Since we instantiating the templates explicitly, nothing stops us to convert the templates to plain functions.
bcbe915 to
22f0b38
Compare
|
approving...in case appveyor builds... :D |
|
It does! |
|
this looks cool @ddemidov
|
You can choose as GPGPU backend either OpenCL: cmake -DAMGCL_GPGPU=ON -DAMGCL_GPGPU_BACKEND=OpenCL ...or CUDA: cmake -DAMGCL_GPGPU=ON -DAMGCL_GPGPU_BACKEND=CUDA ...For OpenCL to work you need libOpenCL, OpenCL ICD, and OpenCL headers. The library and ICD usually come with the graphic drivers, and OpenCL headers may be installed as OpenCL SDK or opencl-headers package. For CUDA you need to install Nvidia CUDA Toolkit.
The solver uses vex::Filter::Env device filter which means you can control which device to use with an environment variable $ clinfo | grep 'Device Name'
Device Name Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz
Device Name Tesla K40c
Device Name GeForce GT 610 |
|
I think @RiccardoRossi did some experiments regarding performance of GPGPU solver in Kratos. |
|
yes i did: i have a GTX 1070, it is about 3-3.5 times faster everything included than my i7-8700 (once again, kudos @ddemidov ) One cool thing is that by using OpenCL instead of CUDA you can also use AMD gpus if you have one. |
|
BTW, we need to add the usage comments to the wiki |
|
ping it would be nice if you could make a small entry in the Wiki |
|
Hi, this is very good news and I am trying it out currently. Here are my initial experiences. Mind you, I am not an expert on compiling and how to exploit the advantage of GPU-usage, but would try to support this endeavor by at least having a go at it. As @ddemidov mentions, the changes regarding setting up and using with Kratos are the following: in the configure files (or the OpenCL version) -DAMGCL_GPGPU=ON
-DAMGCL_GPGPU_BACKEND=CUDAto be able to use when AMGCL solvers are selected with the specific flag in the project parameters "use_gpgpu" : trueIn case the Wiki part will be written/updated, one should not forget the changes needed for boost, as it needs to be bootstrapped and installed ./bootstrap.sh
./b2 installas well as properly exported. I needed the following settings: export CPP_INCLUDE_PATH=~/<path_to_folder>/boost_1_69_0/include:$CPP_INCLUDE_PATH
export LD_LIBRARY_PATH=~/<path_to_folder>/boost_1_69_0/lib:$LD_LIBRARY_PATHAs far as my recent experience goes, that is the easier part. I tried to set it up on 2 desktops today:
It it also not straightforward to know when every component version is correct and installed properly (at least not for me). After the following checks it seems to be ready to use: ~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Nov__3_21:07:56_CDT_2017
Cuda compilation tools, release 9.1, V9.1.85~$ nvidia-smi
Thu Aug 8 17:25:58 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.26 Driver Version: 430.26 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro P400 Off | 00000000:65:00.0 On | N/A |
| 34% 43C P8 N/A / N/A | 548MiB / 1997MiB | 0% Default |
+-------------------------------+----------------------+----------------------+where the driver NVIDIA number and CUDA versions are displayed and are correct. It took some iterations to remove incorrect versions and set up. Compiling Kratos with that one running configuration (hardware+software) works with both the flags CUDA and OpenCL. I did some test runs with the 2D and 3D cylinder case from Kratos Fluid using monolithic VMS. No extensive performance testing, as I am not aware how this could be done objectively on a local machine. Maybe some hints? @ddemidov I tried it out with a rather primitive way using Interestingly, the 3D case (with 140k nodes and 820k elements) went through with Can it happen that the same hardware manages with one type of compilation to handle a case but not with another? @ddemidov It seems that on my hardware and compiled with the Does one still need to take care of the environment variable Specs for NVIDIA Quadro P400: Running on Fujitsu desktop with: |
Right. although I prefer to use system version of boost libs (installed with something like
I have the same experience. Nvidia in the last years makes it hard for owners of 'old' hardware: they drop driver support for what they consider 'old' very quickly, and the latest cuda toolkit versions require the latest driver versions. It is much easier to use their OpenCL though: it does not seem to have this problem. So I would consider using
I think the most objective way is just to measure the wallclock time needed to complete the whole run. Not sure how to do this exactly in Kratos though. Also, if you are using OpenCL backend, make sure you are not using your CPU instead of GPU (or CPU together with GPU). You can use environment variables to control the compute device choice: See the complete list here: https://vexcl.readthedocs.io/en/latest/initialize.html#common-filters.
Not sure what happens here. VexCL uses sparse matrices from (closed-source) CUSPARSE library with the CUDA backend, so it is possible the matrices require more memory either during construction or permanently.
This does not look like a very fast GPU (looking at memory bandwidth). @RiccardoRossi ran his tests on GTX 1070 which has the bandwidth of 256 GB/s. The performance of the solver should be roughly proportional to the available memory bandwidth, so your GPU should be about 8x slower than Riccardo's. |
|
@ddemidov thanks for the hints! I will give it a try with the environment variables. With respect to memory requirements and comsumption, I should understand that GPU computing tries to push the whole computation to the GPU, right? Or am I still misunderstanding it. @adityaghantasala @AndreasWinterstein it would be worth trying and testing on our more capable desktop machine. @RiccardoRossi any hints and experiences with respect to problem type and size to run? Have you had similar issues, like having not enough memory? |
|
in my experiencr it was approximately 3 times faster with
gpu than with cpu.
i have 8GB video ram, so i did not reach that limit neother with ooencl nor
with cuda, but admittedly tbe case only had approx 1m elements
…On Thu, Aug 8, 2019, 7:16 PM mpentek ***@***.***> wrote:
@ddemidov <https://github.com/ddemidov> thanks for the hints! I will give
it a try with the environment variables.
With respect to memory requirements and comsumption, I should understand
that GPU computing tries to push the whole computation to the GPU, right?
Or am I still misunderstanding it.
@adityaghantasala <https://github.com/adityaghantasala>
@AndreasWinterstein <https://github.com/AndreasWinterstein> it would be
worth trying and testing on our more capable desktop machine.
@RiccardoRossi <https://github.com/RiccardoRossi> any hints and
experiences with respect to problem type and size to run? Have you had
similar issues, like having not enough memory?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3831?email_source=notifications&email_token=AB5PWEKPUM5SZCEN2ABJMBDQDRIHJA5CNFSM4GQEBQ7KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD34JZEY#issuecomment-519609491>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AB5PWENQVK22WAU57K5MMCDQDRIHJANCNFSM4GQEBQ7A>
.
|
AMGCL constructs the AMG hierarchy (the set of coarser and coarser system matrices together with inter-level transfer operators) on the CPU, and then transfers the complete hierarchy to the GPU. The whole computation is then done on the GPU. |
|
@ddemidov i think you should tell that you did implement the whole
compitation on gpu, but that after testing the cpu was faster for the
"preparation phase" even including tjentransfer time
…On Thu, Aug 8, 2019, 7:59 PM Denis Demidov ***@***.***> wrote:
With respect to memory requirements and comsumption, I should understand
that GPU computing tries to push the whole computation to the GPU, right?
Or am I still misunderstanding it.
AMGCL constructs the AMG hierarchy (the set of coarser and coarser system
matrices together with inter-level transfer operators) on the CPU, and then
transfers the complete hierarchy to the GPU. The whole computation is then
done on the GPU.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3831?email_source=notifications&email_token=AB5PWEPE42FDGHKCJN3KFWLQDRNJBA5CNFSM4GQEBQ7KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD34NS2A#issuecomment-519625064>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AB5PWEJXFHWY6WYLQXT53WTQDRNJBANCNFSM4GQEBQ7A>
.
|
|
I don't have an option to do the setup on the GPU, but we did compare performance with the cusplibrary, which does the complete setup GPU-side. It appeared our approach was faster (and more memory-efficient). See the benchmarks here: https://amgcl.readthedocs.io/en/latest/benchmarks.html#d-poisson-problem. |
Enable vexcl backend for amgcl solver, which makes it possible to use
GPGPU (either CUDA or OpenCL) in order to accelerate solution.
AMGCL_GPGPU(default:OFF) controls whether to compileGPGPU support.
AMGCL_GPGPU_BACKEND(default:OpenCL) selects vexclbackend (OpenCL/CUDA)
use_gpgpu(default:false) enables GPGPU atruntime.
OCL_DEVICEmay be used to select a particularcompute device.
It is enough to clone and configure VexCL anywhere for cmake to pick it up.VexCL sources are in
external_libraries.