CUDA and remote GPU

CUDA is all good, as long as you have an Nvidia video card on hand. But what to do when there is no Nvidia video card on your favorite laptop? Or do you need to develop in a virtual machine?


I will try to consider in this article such a solution as the rCUDA (Remote CUDA) framework, which will help when Nvidia has a video card, but is not installed on the machine that is supposed to run CUDA applications. For those who are interested, welcome under cat.


Tldr

rCUDA (Remote CUDA) - a framework that implements the CUDA API that allows you to use a remote video card. Located in a workable beta version, available only under Linux. The main goal of rCUDA is full compatibility with the CUDA API, you do not need to modify your code in any way, it is enough to set special environment variables.


What is rCUDA


rCUDA (Remote CUDA) is a framework implementing CUDA API that allows you to use a video card located on a remote machine for CUDA calculations without making any changes to your code. Developed at the Polytechnic University of Valencia ( rcuda-team ).


Restrictions


Currently, only GNU / Linux systems are supported, but the developers promise to support Windows in the future. The current version of rCUDA, 18.03beta, is compatible with CUDA 5-8, that is, CUDA 9 is not supported. The developers declared full compatibility with CUDA API, with the exception of graphics.


Possible usage scenarios


  1. Running CUDA applications in a virtual machine when forwarding a video card is inconvenient or impossible, for example, when the video card is occupied by a host, or when there are more than one virtual machine.
  2. Laptop without a discrete graphics card.
  3. The desire to use multiple video cards (clustering). Theoretically, you can use all video cards available in a team, including in conjunction.

Brief instruction


Test configuration


Testing was conducted on the following configuration:


Server:
Ubuntu 16.04, GeForce GTX 660


Customer:
Virtual machine with Ubuntu 16.04 on a laptop without a discrete video card.


Getting rCUDA


The most difficult stage. Unfortunately, at the moment the only way to get your own copy of this framework is to fill in the corresponding request form on the official website. However, the developers promise to respond within 1-2 days. In my case, I was sent a distribution on the same day.


CUDA installation


First you need to install CUDA Toolkit on the server and the client (even if the client does not have an nvidia video card). To do this, you can download it from the official site or use the repository. The main thing is to use a version not higher than 8. In this example, the installer .run from the official site is used .


chmod +x cuda_8.0.61_375.26_linux.run ./cuda_8.0.61_375.26_linux.run 

Important! On the client, you should refuse to install the nvidia driver. By default, the CUDA Toolkit will be available at / usr / local / cuda /. Install CUDA Samples, they will be needed.


Install rCUDA


Let's unpack the archive from developers in our home directory on the server and on the client.


 tar -xvf rCUDA*.tgz -C ~/ mv ~/rCUDA* ~/rCUDA 

You need to do these actions on the server and on the client.


Running the rCUDA daemon on the server


 export PATH=$PATH/usr/local/cuda/bin export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64:/home/<XXX>/rCUDA/lib/cudnn cd ~/rCUDA/bin ./rCUDAd 

Replace <XXX> with your username. Use ./rCUDAd -iv if you want to see detailed output.


Client setup


Let's open a terminal on the client, in which we will later run CUDA code. On the client side, we need to "replace" the standard CUDA libraries with the rCUDA libraries, for which we add the appropriate paths to the LD_LIBRARY_PATH environment variable. We also need to specify the number of servers and their addresses (in my example it will be one).


 export PATH=$PATH/usr/local/cuda/bin export LD_LIBRARY_PATH=/home/<XXX>/rCUDA/lib/:$LD_LIBRARY_PATH export RCUDA_DEVICE_COUNT=1 #    (),     export RCUDA_DEVICE_0=<IP  >:0 #     

Build and Run


Let's try to build and run a few examples.


Example 1


Let's start with a simple, with deviceQuery, an example that simply displays the parameters of a CUDA compatible device, that is, in our case, the remote GTX660.


 cd <YYY>/NVIDIA_CUDA-8.0_Samples/1_Utilities/deviceQuery make EXTRA_NVCCFLAGS=--cudart=shared 

Important! Without EXTRA_NVCCFLAGS = - cudart = shared miracle will not work
Replace <YYY> with the path you specified for CUDA Samples when installing CUDA.


Run the compiled example:


 ./deviceQuery 

If you did everything correctly, the result will be something like this:


Result
 ./deviceQuery Starting... CUDA Device Query (Runtime API) version (CUDART static linking) Detected 1 CUDA Capable device(s) Device 0: "GeForce GTX 660" CUDA Driver Version / Runtime Version 9.0 / 8.0 CUDA Capability Major/Minor version number: 3.0 Total amount of global memory: 1994 MBytes (2090991616 bytes) ( 5) Multiprocessors, (192) CUDA Cores/MP: 960 CUDA Cores GPU Max Clock rate: 1072 MHz (1.07 GHz) Memory Clock rate: 3004 Mhz Memory Bus Width: 192-bit L2 Cache Size: 393216 bytes Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096) Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 1 copy engine(s) Run time limit on kernels: Yes Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = GeForce GTX 660 Result = PASS 

Most importantly, we should see:


Device0 = GeForce GTX 660
Result = PASS

Fine! We managed to build and run a CUDA application on a machine without a discrete video card, using the video card installed on the remote server.


Important! If the output of the application starts with lines like:


 mlock error: Cannot allocate memory rCUDA warning: 1007.461 mlock error: Cannot allocate memory 

it means you need to add the following lines on the server and on the client to the file "/etc/security/limits.conf":


 * hard memlock unlimited * soft memlock unlimited 

Thus, you allow all users (*) unlimited (unlimited) memory blocking (memlock). It would be even better to replace * with the desired user, and instead of unlimited, select less fat rights.


Example 2


Now let's try something more interesting. Test the implementation of the scalar product of vectors using shared memory and synchronization ("CUDA technology in examples" Sanders J. Kendrot E. 5.3.1).


In this example, we will calculate the scalar product of two vectors with a dimension of 33 * 1024, comparing the answer with the result obtained on the CPU.


dotProd.cu
 #include <stdio.h> #define imin(a,b) (a<b?a:b) const int N = 33 * 1024; const int threadsPerBlock = 256; const int blocksPerGrid = imin(32, (N+threadsPerBlock-1) / threadsPerBlock); __global__ void dot(float* a, float* b, float* c) { __shared__ float cache[threadsPerBlock]; int tid = threadIdx.x + blockIdx.x * blockDim.x; int cacheIndex = threadIdx.x; float temp = 0; while (tid < N){ temp += a[tid] * b[tid]; tid += blockDim.x * gridDim.x; } // set the cache values cache[cacheIndex] = temp; // synchronize threads in this block __syncthreads(); // for reductions, threadsPerBlock must be a power of 2 // because of the following code int i = blockDim.x/2; while (i != 0){ if (cacheIndex < i) cache[cacheIndex] += cache[cacheIndex + i]; __syncthreads(); i /= 2; } if (cacheIndex == 0) c[blockIdx.x] = cache[0]; } int main (void) { float *a, *b, c, *partial_c; float *dev_a, *dev_b, *dev_partial_c; // allocate memory on the cpu side a = (float*)malloc(N*sizeof(float)); b = (float*)malloc(N*sizeof(float)); partial_c = (float*)malloc(blocksPerGrid*sizeof(float)); // allocate the memory on the gpu cudaMalloc((void**)&dev_a, N*sizeof(float)); cudaMalloc((void**)&dev_b, N*sizeof(float)); cudaMalloc((void**)&dev_partial_c, blocksPerGrid*sizeof(float)); // fill in the host memory with data for(int i=0; i<N; i++) { a[i] = i; b[i] = i*2; } // copy the arrays 'a' and 'b' to the gpu cudaMemcpy(dev_a, a, N*sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy(dev_b, b, N*sizeof(float), cudaMemcpyHostToDevice); dot<<<blocksPerGrid, threadsPerBlock>>>(dev_a, dev_b, dev_partial_c); // copy the array 'c' back from the gpu to the cpu cudaMemcpy(partial_c,dev_partial_c, blocksPerGrid*sizeof(float), cudaMemcpyDeviceToHost); // finish up on the cpu side c = 0; for(int i=0; i<blocksPerGrid; i++) { c += partial_c[i]; } #define sum_squares(x) (x*(x+1)*(2*x+1)/6) printf("GPU - %.6g \nCPU - %.6g\n", c, 2*sum_squares((float)(N-1))); // free memory on the gpu side cudaFree(dev_a); cudaFree(dev_b); cudaFree(dev_partial_c); // free memory on the cpu side free(a); free(b); free(partial_c); } 

Build and Run:


 /usr/local/cuda/bin/nvcc --cudart=shared dotProd.cu -o dotProd ./dotProd 

This result tells us that everything is fine with us:


GPU - 2.57236e + 13
CPU - 2.57236e + 13

Example 3


We will launch another standard test CUDA- matrixMulCUBLAS (matrix multiplication).


 cd < YYY>/NVIDIA_CUDA-8.0_Samples/0_Simple/matrixMulCUBLAS make EXTRA_NVCCFLAGS=--cudart=shared ./matrixMulCUBLAS 

Result

[Matrix Multiply CUBLAS] - Starting ...
GPU Device 0: "GeForce GTX 660" with compute capability 3.0


MatrixA (640,480), MatrixB (480,320), MatrixC (640,320)
Computing result using CUBLAS ... done.
Performance = 436.24 GFlop / s, Time = 0.451 msec, Size = 196608000 Ops
Computing result using host CPU ... done.
Comparing CUBLAS Matrix Multiply with CPU results: PASS


NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.


Interesting to us:


Performance = 436.24 GFlop / s,
Comparing CUBLAS Matrix Multiply with CPU results: PASS

Security


I did not find any mention of any authentication method in the rCUDA documentation. I think at the moment the simplest thing you can do is open access to the desired port (8308) only from a specific address.


With iptables, it will look like this:


 iptables -A INPUT -m state --state NEW -p tcp -s < > --dport 8308 -j ACCEPT 

Otherwise, leave the issue of security beyond the scope of this post.


Sources and references

[1] http://www.rcuda.net/pub/rCUDA_guide.pdf
[2] http://www.rcuda.net/pub/rCUDA_QSG.pdf
[3] C. Reaño, F. Silla, G. Shainer and S. Schultz, “Local and Remote GPUs Perform InfiniBand 100G” in the International Middleware Conference, Vancouver, BC, Canada, December 2015.
[4] C. Reaño and F. Silla, “A Performance Comparison of CUDA Remote GPU Virtualization Frameworks”, in the International Conference on Cluster Computing, Chicago, IL, USA, September 2015.

Source: https://habr.com/ru/post/416127/


All Articles