{"id":470,"date":"2014-04-30T13:25:18","date_gmt":"2014-04-30T12:25:18","guid":{"rendered":"http:\/\/numbercrunch.de\/blog\/?p=470"},"modified":"2023-01-18T22:45:24","modified_gmt":"2023-01-18T21:45:24","slug":"cuda-aware-mpi","status":"publish","type":"post","link":"https:\/\/www.numbercrunch.de\/blog\/2014\/04\/cuda-aware-mpi\/","title":{"rendered":"CUDA-aware MPI"},"content":{"rendered":"<p style=\"text-align: justify;\">CUDA and MPI provide two different APIs for parallel programming that target very different parallel architectures. While CUDA allows to utilize parallel graphics hardware for general purpose computing, MPI is usually employed to write parallel programs that run on large SMP systems or on cluster computers. In order to improve a cluster&#8217;s overall computational capabilities it is not unusual to equip the nodes of a cluster with graphics cards. This, however, adds also another level of parallelism that a programmer must cope with. Likely one will combine MPI with CUDA when writing programs for such a system. Combining both kinds of parallel programming techniques becomes much easier if the MPI implementation is CUDA-aware. CUDA-aware means that one can send data from device memory of one graphics card directly to the device memory of another card without an intermediate copy to host memory. This magic becomes possible thanks to \u00abunified virtual address space\u00bb that puts all CUDA execution, on CPU and on the GPU, into a single address space. (Unified virtual address space requires CUDA 4.0 or later and a GPU with compute capability 2.0 or higher.) CUDA-aware MPI eases multi-GPU programming a lot and improves performance. Among other implementations, <a href=\"http:\/\/www.open-mpi.org\/faq\/?category=building#build-cuda\">Open MPI<\/a> is CUDA-aware.<\/p>\n<p style=\"text-align: justify;\">The following program below sends a message from one GPU to another via MPI. The program may be compiled with<\/p>\n<pre style=\"text-align: justify;\">mpic++ -o ping_pong ping_pong.cc -lcuda -lcudart<\/pre>\n<p style=\"text-align: justify;\">It should be run on a node with two or more GPUs and all of them set to <a href=\"http:\/\/docs.nvidia.com\/cuda\/cuda-c-programming-guide\/index.html#axzz30Mn2EMkg\">compute mode<\/a> \u00abexclusive process\u00bb. There is an obstacle when starting the program via <code>mpirun<\/code>. <code>mpirun<\/code> may not find the library <code>libcuda.so<\/code> or try to load the wrong library (e.g. a 32-bit library on a 64-bit system). In this case set the environment variable <code>LD_LIBRARY_PATH<\/code> to an appropriate value.<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"cpp\">#include &lt;cstdlib&gt;\n#include &lt;cstring&gt;\n#include &lt;iostream&gt;\n#include \"mpi.h\"\n#include &lt;cuda.h&gt;\n#include &lt;cuda_runtime.h&gt;\n\nint main(int argc, char *argv[]) {\n  MPI_Init(&amp;argc, &amp;argv);\n  int myrank, nprocs;\n  MPI_Comm_size(MPI_COMM_WORLD, &amp;nprocs);\n  MPI_Comm_rank(MPI_COMM_WORLD, &amp;myrank);\n  const int str_max_len=32;\n  char str[str_max_len];\n  char *str_d;\n  cudaMalloc(&amp;str_d, str_max_len);\n  if (nprocs&gt;=2 and myrank==0) {\n    std::strncpy(str, \"Hello world!\", str_max_len);\n    cudaMemcpy(str_d, str, str_max_len, cudaMemcpyHostToDevice);\n    MPI_Send(str_d, str_max_len, MPI_CHAR, 1, 0, MPI_COMM_WORLD);\n  }\n  if (nprocs&gt;=2 and myrank==1) {\n    MPI_Status stat;\n    MPI_Recv(str_d, str_max_len, MPI_CHAR, 0, 0, MPI_COMM_WORLD, &amp;stat);\n    cudaMemcpy(str, str_d, str_max_len, cudaMemcpyDeviceToHost);\n    std::cout &lt;&lt; \"got \\\"\" &lt;&lt; str &lt;&lt; \"\\\"\" &lt;&lt; std::endl;\n  }\n  cudaFree(str_d);\n  MPI_Finalize();\n  return EXIT_SUCCESS;\n}\n<\/pre>\n<p>The message throughput may be measured by a simple ping-pong test.<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"cpp\">\/\/ ping_pong.cc\n\/\/\n\/\/ determine bandwidth as a function of packet size\n\n#include &lt;cstdlib&gt;\n#include &lt;iostream&gt;\n#include &lt;fstream&gt;\n#include &lt;string&gt;\n#include \"mpi.h\"\n#include &lt;cuda.h&gt;\n#include &lt;cuda_runtime.h&gt;\n\nconst int max_packet_size=0x1000000;\u00a0 \/\/ maximal message size\nconst int count=250;\u00a0 \/\/ number of messages\nchar *buff, *buff_2;\u00a0 \/\/ buffers\n\ntypedef enum { host, device, copy } benchmark_type;\n\nint main(int argc, char *argv[]) {\n\n\u00a0 if (argc!=2){\n\u00a0\u00a0\u00a0 std::cerr &lt;&lt; \"usage: \" &lt;&lt; argv[0] &lt;&lt; \" [--host|--device|--copy]\\n\";\n\u00a0\u00a0\u00a0 std::exit(EXIT_FAILURE);\n\u00a0 }\n\u00a0 \n\u00a0 std::string arg_str(argv[1]);\n\u00a0 if (arg_str!=\"--host\" and \n\u00a0\u00a0\u00a0\u00a0\u00a0 arg_str!=\"--device\" and\n\u00a0\u00a0\u00a0\u00a0\u00a0 arg_str!=\"--copy\") {\n\u00a0\u00a0\u00a0 std::cerr &lt;&lt; \"usage: \" &lt;&lt; argv[0] &lt;&lt; \" [--host|--device|--copy]\\n\";\n\u00a0\u00a0\u00a0 std::exit(EXIT_FAILURE);\n\u00a0 }\n\u00a0 \n\u00a0 benchmark_type benchmark;\n\u00a0 if (arg_str==\"--host\")\n\u00a0\u00a0\u00a0 benchmark=host;\n\u00a0 else if (arg_str==\"--device\")\n\u00a0\u00a0\u00a0 benchmark=device;\n\u00a0 else if (arg_str==\"--copy\")\n\u00a0\u00a0\u00a0 benchmark=copy;\n\u00a0 else { \/\/ shold never be reached\n\u00a0\u00a0\u00a0 std::cerr &lt;&lt; \"usage: \" &lt;&lt; argv[0] &lt;&lt; \" [--host|--device|--copy]\\n\";\n\u00a0\u00a0\u00a0 std::exit(EXIT_FAILURE);\n\u00a0 }\n\n\u00a0 MPI::Init();\u00a0 \/\/ initialize MPI\n\n\u00a0 int rank=MPI::COMM_WORLD.Get_rank();\u00a0 \/\/ get my rank\n\u00a0 int size=MPI::COMM_WORLD.Get_size();\u00a0 \/\/ get number of processes\n\n\u00a0 std::string fname;\n\u00a0 if (benchmark==host) {\n\u00a0\u00a0\u00a0 buff=new char[max_packet_size];\u00a0 \/\/ allocate host memory\n\u00a0\u00a0\u00a0 fname=\"ping_pong_host.dat\";\n\u00a0 } else if (benchmark==device) {\n\u00a0\u00a0\u00a0 cudaMalloc(&amp;buff, max_packet_size);\u00a0 \/\/ allocate GPU memory\n\u00a0\u00a0\u00a0 fname=\"ping_pong_device.dat\";\n\u00a0 } else {\n\u00a0\u00a0\u00a0 buff=new char[max_packet_size];\u00a0 \/\/ allocate host memory\n\u00a0\u00a0\u00a0 cudaMalloc(&amp;buff_2, max_packet_size);\u00a0 \/\/ allocate GPU memory\n\u00a0\u00a0\u00a0 fname=\"ping_pong_copy.dat\";\n\u00a0 }\n\u00a0 if (size==2) {\u00a0 \/\/ need exactly two processes \u00a0\n\u00a0\u00a0\u00a0 int device[2];\n\u00a0\u00a0\u00a0 if (benchmark!=host) {\n\u00a0\u00a0\u00a0\u00a0\u00a0 cudaGetDevice(device);\n\u00a0\u00a0\u00a0\u00a0\u00a0 if (rank==0) \n\u00a0\u00a0 \u00a0MPI::COMM_WORLD.Recv(&amp;device[1], 1, MPI::INT, 1, 0);\n\u00a0\u00a0\u00a0\u00a0\u00a0 else \n\u00a0\u00a0 \u00a0MPI::COMM_WORLD.Send(&amp;device[0], 1, MPI::INT, 0, 0);\n\u00a0\u00a0\u00a0 }\n\u00a0\u00a0\u00a0 std::ofstream out; \u00a0\n\u00a0\u00a0\u00a0 if (rank==0) {\u00a0 \/\/ open output file\n\u00a0\u00a0\u00a0\u00a0\u00a0 out.open(fname.c_str());\n\u00a0\u00a0\u00a0\u00a0\u00a0 if (!out)\n\u00a0\u00a0 \u00a0MPI::COMM_WORLD.Abort(EXIT_FAILURE);\n\u00a0\u00a0\u00a0\u00a0\u00a0 out &lt;&lt; \"# bandwidth as a function of packet size\\n\";\n\u00a0\u00a0\u00a0\u00a0\u00a0 if (benchmark!=host) \n\u00a0\u00a0 \u00a0out &lt;&lt; \"# process 0 using GPU \" &lt;&lt; device[0] &lt;&lt; \"\\n\"\n\u00a0\u00a0 \u00a0\u00a0\u00a0\u00a0 &lt;&lt; \"# process 1 using GPU \" &lt;&lt; device[1] &lt;&lt; \"\\n\";\n\u00a0\u00a0\u00a0\u00a0\u00a0 out &lt;&lt; \"# clock resolution \" &lt;&lt; MPI::Wtick() &lt;&lt; \"sec.\\n\"\n\u00a0\u00a0 \u00a0\u00a0 &lt;&lt; \"# packet size\\tmean time\\tmaximal time\\n\";\n\u00a0\u00a0\u00a0 }\n\u00a0\u00a0 \u00a0\n\u00a0\u00a0\u00a0 cudaEvent_t start, stop;\n\u00a0\u00a0\u00a0 if (benchmark==copy) {\n\u00a0\u00a0\u00a0\u00a0\u00a0 cudaEventCreate(&amp;start);\n\u00a0\u00a0\u00a0\u00a0\u00a0 cudaEventCreate(&amp;stop);\n\u00a0\u00a0\u00a0 }\n\n\u00a0\u00a0\u00a0 \/\/ try messages of different sizes\n\u00a0\u00a0\u00a0 int packet_size=1;\n\u00a0\u00a0\u00a0 while (packet_size&lt;=max_packet_size) {\n\u00a0\u00a0\u00a0\u00a0\u00a0 double t_av=0.0;\n\u00a0\u00a0\u00a0\u00a0\u00a0 double t_max=0.0;\n\u00a0\u00a0\u00a0\u00a0\u00a0 \/\/ average over several messages\n\u00a0\u00a0\u00a0\u00a0\u00a0 for (int i=0; i&lt;count; ++i) {\n\u00a0\u00a0 \u00a0MPI::COMM_WORLD.Barrier();\u00a0\u00a0 \/\/ synchronize processes\n\u00a0\u00a0 \u00a0if (rank==0) {\n\u00a0\u00a0 \u00a0\u00a0 double t=MPI::Wtime();\u00a0\u00a0 \/\/ start time\n\u00a0\u00a0 \u00a0\u00a0 if (benchmark==copy) {\n\u00a0\u00a0 \u00a0\u00a0\u00a0\u00a0 cudaEventRecord(start, 0);\n\u00a0\u00a0 \u00a0\u00a0\u00a0\u00a0 cudaMemcpy(buff, buff_2, packet_size, cudaMemcpyDeviceToHost);\n\u00a0\u00a0 \u00a0\u00a0\u00a0\u00a0 cudaEventRecord(stop, 0);\n\u00a0\u00a0 \u00a0\u00a0\u00a0\u00a0 cudaEventSynchronize(stop);\n\u00a0\u00a0 \u00a0\u00a0 }\n\u00a0\u00a0 \u00a0\u00a0 MPI::COMM_WORLD.Send(buff, packet_size, MPI::CHAR, 1, 0);\n\u00a0\u00a0 \u00a0\u00a0 if (benchmark==copy) {\n\u00a0\u00a0 \u00a0\u00a0\u00a0\u00a0 cudaEventRecord(start, 0);\n\u00a0\u00a0 \u00a0\u00a0\u00a0\u00a0 cudaMemcpy(buff_2, buff, packet_size, cudaMemcpyHostToDevice);\n\u00a0\u00a0 \u00a0\u00a0\u00a0\u00a0 cudaEventRecord(stop, 0);\n\u00a0\u00a0 \u00a0\u00a0\u00a0\u00a0 cudaEventSynchronize(stop);\n\u00a0\u00a0 \u00a0\u00a0\u00a0\u00a0 cudaEventRecord(start, 0);\n\u00a0\u00a0 \u00a0\u00a0\u00a0\u00a0 cudaMemcpy(buff, buff_2, packet_size, cudaMemcpyDeviceToHost);\n\u00a0\u00a0 \u00a0\u00a0\u00a0\u00a0 cudaEventRecord(stop, 0);\n\u00a0\u00a0 \u00a0\u00a0\u00a0\u00a0 cudaEventSynchronize(stop);\n\u00a0\u00a0 \u00a0\u00a0 }\n\u00a0\u00a0 \u00a0\u00a0 MPI::COMM_WORLD.Recv(buff, packet_size, MPI::CHAR, 1, 0);\n\u00a0\u00a0 \u00a0\u00a0 if (benchmark==copy) {\n\u00a0\u00a0 \u00a0\u00a0\u00a0\u00a0 cudaEventRecord(start, 0);\n\u00a0\u00a0 \u00a0\u00a0\u00a0\u00a0 cudaMemcpy(buff_2, buff, packet_size, cudaMemcpyHostToDevice);\n\u00a0\u00a0 \u00a0\u00a0\u00a0\u00a0 cudaEventRecord(stop, 0);\n\u00a0\u00a0 \u00a0\u00a0\u00a0\u00a0 cudaEventSynchronize(stop);\n\u00a0\u00a0 \u00a0\u00a0 }\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 t=(MPI::Wtime()-t)\/2.0;\u00a0 \/\/ time difference\n\u00a0\u00a0 \u00a0\u00a0 t_av+=t;\n\u00a0\u00a0 \u00a0\u00a0 if (t&gt;t_max)\n\u00a0\u00a0 \u00a0\u00a0\u00a0\u00a0 t_max=t;\n\u00a0\u00a0 \u00a0} else {\n\u00a0\u00a0 \u00a0\u00a0 if (benchmark==copy) {\n\u00a0\u00a0 \u00a0\u00a0\u00a0\u00a0 cudaEventRecord(start, 0);\n\u00a0\u00a0 \u00a0\u00a0\u00a0\u00a0 cudaMemcpy(buff, buff_2, packet_size, cudaMemcpyDeviceToHost);\n\u00a0\u00a0 \u00a0\u00a0\u00a0\u00a0 cudaEventRecord(stop, 0);\n\u00a0\u00a0 \u00a0\u00a0\u00a0\u00a0 cudaEventSynchronize(stop);\n\u00a0\u00a0 \u00a0\u00a0 }\n\u00a0\u00a0 \u00a0\u00a0 MPI::COMM_WORLD.Recv(buff, packet_size, MPI::CHAR, 0, 0);\n\u00a0\u00a0 \u00a0\u00a0 if (benchmark==copy) {\n\u00a0\u00a0 \u00a0\u00a0\u00a0\u00a0 cudaEventRecord(start, 0);\n\u00a0\u00a0 \u00a0\u00a0\u00a0\u00a0 cudaMemcpy(buff_2, buff, packet_size, cudaMemcpyHostToDevice);\n\u00a0\u00a0 \u00a0\u00a0\u00a0\u00a0 cudaEventRecord(stop, 0);\n\u00a0\u00a0 \u00a0\u00a0\u00a0\u00a0 cudaEventSynchronize(stop);\n\u00a0\u00a0 \u00a0\u00a0\u00a0\u00a0 cudaEventRecord(start, 0);\n\u00a0\u00a0 \u00a0\u00a0\u00a0\u00a0 cudaMemcpy(buff, buff_2, packet_size, cudaMemcpyDeviceToHost);\n\u00a0\u00a0 \u00a0\u00a0\u00a0\u00a0 cudaEventRecord(stop, 0);\n\u00a0\u00a0 \u00a0\u00a0\u00a0\u00a0 cudaEventSynchronize(stop);\n\u00a0\u00a0 \u00a0\u00a0 }\n\u00a0\u00a0 \u00a0\u00a0 MPI::COMM_WORLD.Send(buff, packet_size, MPI::CHAR, 0, 0);\n\u00a0\u00a0 \u00a0\u00a0 if (benchmark==copy) {\n\u00a0\u00a0 \u00a0\u00a0\u00a0\u00a0 cudaEventRecord(start, 0);\n\u00a0\u00a0 \u00a0\u00a0\u00a0\u00a0 cudaMemcpy(buff_2, buff, packet_size, cudaMemcpyHostToDevice);\n\u00a0\u00a0 \u00a0\u00a0\u00a0\u00a0 cudaEventRecord(stop, 0);\n\u00a0\u00a0 \u00a0\u00a0\u00a0\u00a0 cudaEventSynchronize(stop);\n\u00a0\u00a0 \u00a0\u00a0 }\n\u00a0\u00a0 \u00a0}\n\u00a0\u00a0\u00a0\u00a0\u00a0 }\n\u00a0\u00a0\u00a0\u00a0\u00a0 if (rank==0) {\u00a0 \/\/ print results to file\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 t_av\/=count;\n\u00a0\u00a0 \u00a0out &lt;&lt; packet_size &lt;&lt; \"\\t\\t\" &lt;&lt; t_av &lt;&lt; \"\\t\" &lt;&lt; t_max &lt;&lt; \"\\n\";\n\u00a0\u00a0\u00a0\u00a0\u00a0 }\n\u00a0\u00a0\u00a0\u00a0\u00a0 packet_size*=2;\u00a0 \/\/ double packet size\n\u00a0\u00a0\u00a0 }\n\n\u00a0\u00a0\u00a0 if (rank==0)\u00a0 \/\/ close file\n\u00a0\u00a0\u00a0\u00a0\u00a0 out.close();\n\u00a0 }\n\u00a0 \n\u00a0 MPI::Finalize();\u00a0 \/\/ finish MPI\n\u00a0 \n\u00a0 return EXIT_SUCCESS;\n}\n<\/pre>\n<figure id=\"attachment_489\" aria-describedby=\"caption-attachment-489\" style=\"width: 480px\" class=\"wp-caption aligncenter\"><a href=\"http:\/\/numbercrunch.de\/blog\/wp-content\/uploads\/2014\/04\/ping_pong.png\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-489\" src=\"\/\/numbercrunch.de\/blog\/wp-content\/uploads\/2014\/04\/ping_pong.png\" alt=\"Half round trip time  as a function of the message size. CUDA-aware MPI reduces round trip times by eliminating a temporary copy to host memory.  For small messages communication via host shared memory is faster than inter-GPU communication. Test was done on a system with two Tesla M2090 cards.\" width=\"480\" height=\"362\" srcset=\"https:\/\/www.numbercrunch.de\/blog\/wp-content\/uploads\/2014\/04\/ping_pong.png 480w, https:\/\/www.numbercrunch.de\/blog\/wp-content\/uploads\/2014\/04\/ping_pong-300x226.png 300w\" sizes=\"(max-width: 480px) 100vw, 480px\" \/><\/a><figcaption id=\"caption-attachment-489\" class=\"wp-caption-text\">Half round trip time as a function of the message size. CUDA-aware MPI reduces round trip times by eliminating a temporary copy to host memory. For small messages communication via host shared memory is faster than inter-GPU communication. Test was done on a system with two Tesla M2090 cards.<\/figcaption><\/figure>\n","protected":false},"excerpt":{"rendered":"<p>CUDA and MPI provide two different APIs for parallel programming that target very different parallel architectures. While CUDA allows to utilize parallel graphics hardware for general purpose computing, MPI is usually employed to write parallel programs that run on large SMP systems or on cluster computers. In order to improve a cluster&#8217;s overall computational capabilities&hellip; <a href=\"https:\/\/www.numbercrunch.de\/blog\/2014\/04\/cuda-aware-mpi\/\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">CUDA-aware MPI<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[13,9,8,7],"tags":[],"class_list":["post-470","post","type-post","status-publish","format-standard","hentry","category-gpu-computing","category-mpi","category-parallel-computing","category-performance"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.6 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>CUDA-aware MPI - Number Crunch<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.numbercrunch.de\/blog\/2014\/04\/cuda-aware-mpi\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"CUDA-aware MPI - Number Crunch\" \/>\n<meta property=\"og:description\" content=\"CUDA and MPI provide two different APIs for parallel programming that target very different parallel architectures. While CUDA allows to utilize parallel graphics hardware for general purpose computing, MPI is usually employed to write parallel programs that run on large SMP systems or on cluster computers. In order to improve a cluster&#8217;s overall computational capabilities&hellip; Continue reading CUDA-aware MPI\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.numbercrunch.de\/blog\/2014\/04\/cuda-aware-mpi\/\" \/>\n<meta property=\"og:site_name\" content=\"Number Crunch\" \/>\n<meta property=\"article:published_time\" content=\"2014-04-30T12:25:18+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2023-01-18T21:45:24+00:00\" \/>\n<meta name=\"author\" content=\"Heiko Bauke\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Heiko Bauke\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"5 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/www.numbercrunch.de\/blog\/2014\/04\/cuda-aware-mpi\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/www.numbercrunch.de\/blog\/2014\/04\/cuda-aware-mpi\/\"},\"author\":{\"name\":\"Heiko Bauke\",\"@id\":\"https:\/\/www.numbercrunch.de\/blog\/#\/schema\/person\/e73eab65b1721dd0c3d408edb887e413\"},\"headline\":\"CUDA-aware MPI\",\"datePublished\":\"2014-04-30T12:25:18+00:00\",\"dateModified\":\"2023-01-18T21:45:24+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/www.numbercrunch.de\/blog\/2014\/04\/cuda-aware-mpi\/\"},\"wordCount\":354,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/www.numbercrunch.de\/blog\/#\/schema\/person\/e73eab65b1721dd0c3d408edb887e413\"},\"articleSection\":[\"GPU computing\",\"MPI\",\"parallel computing\",\"Performance\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/www.numbercrunch.de\/blog\/2014\/04\/cuda-aware-mpi\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/www.numbercrunch.de\/blog\/2014\/04\/cuda-aware-mpi\/\",\"url\":\"https:\/\/www.numbercrunch.de\/blog\/2014\/04\/cuda-aware-mpi\/\",\"name\":\"CUDA-aware MPI - Number Crunch\",\"isPartOf\":{\"@id\":\"https:\/\/www.numbercrunch.de\/blog\/#website\"},\"datePublished\":\"2014-04-30T12:25:18+00:00\",\"dateModified\":\"2023-01-18T21:45:24+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/www.numbercrunch.de\/blog\/2014\/04\/cuda-aware-mpi\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/www.numbercrunch.de\/blog\/2014\/04\/cuda-aware-mpi\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/www.numbercrunch.de\/blog\/2014\/04\/cuda-aware-mpi\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/www.numbercrunch.de\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"CUDA-aware MPI\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/www.numbercrunch.de\/blog\/#website\",\"url\":\"https:\/\/www.numbercrunch.de\/blog\/\",\"name\":\"Number Crunch\",\"description\":\"A computational science blog.\",\"publisher\":{\"@id\":\"https:\/\/www.numbercrunch.de\/blog\/#\/schema\/person\/e73eab65b1721dd0c3d408edb887e413\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/www.numbercrunch.de\/blog\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":[\"Person\",\"Organization\"],\"@id\":\"https:\/\/www.numbercrunch.de\/blog\/#\/schema\/person\/e73eab65b1721dd0c3d408edb887e413\",\"name\":\"Heiko Bauke\",\"logo\":{\"@id\":\"https:\/\/www.numbercrunch.de\/blog\/#\/schema\/person\/image\/\"}}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"CUDA-aware MPI - Number Crunch","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.numbercrunch.de\/blog\/2014\/04\/cuda-aware-mpi\/","og_locale":"en_US","og_type":"article","og_title":"CUDA-aware MPI - Number Crunch","og_description":"CUDA and MPI provide two different APIs for parallel programming that target very different parallel architectures. While CUDA allows to utilize parallel graphics hardware for general purpose computing, MPI is usually employed to write parallel programs that run on large SMP systems or on cluster computers. In order to improve a cluster&#8217;s overall computational capabilities&hellip; Continue reading CUDA-aware MPI","og_url":"https:\/\/www.numbercrunch.de\/blog\/2014\/04\/cuda-aware-mpi\/","og_site_name":"Number Crunch","article_published_time":"2014-04-30T12:25:18+00:00","article_modified_time":"2023-01-18T21:45:24+00:00","author":"Heiko Bauke","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Heiko Bauke","Est. reading time":"5 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.numbercrunch.de\/blog\/2014\/04\/cuda-aware-mpi\/#article","isPartOf":{"@id":"https:\/\/www.numbercrunch.de\/blog\/2014\/04\/cuda-aware-mpi\/"},"author":{"name":"Heiko Bauke","@id":"https:\/\/www.numbercrunch.de\/blog\/#\/schema\/person\/e73eab65b1721dd0c3d408edb887e413"},"headline":"CUDA-aware MPI","datePublished":"2014-04-30T12:25:18+00:00","dateModified":"2023-01-18T21:45:24+00:00","mainEntityOfPage":{"@id":"https:\/\/www.numbercrunch.de\/blog\/2014\/04\/cuda-aware-mpi\/"},"wordCount":354,"commentCount":0,"publisher":{"@id":"https:\/\/www.numbercrunch.de\/blog\/#\/schema\/person\/e73eab65b1721dd0c3d408edb887e413"},"articleSection":["GPU computing","MPI","parallel computing","Performance"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.numbercrunch.de\/blog\/2014\/04\/cuda-aware-mpi\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.numbercrunch.de\/blog\/2014\/04\/cuda-aware-mpi\/","url":"https:\/\/www.numbercrunch.de\/blog\/2014\/04\/cuda-aware-mpi\/","name":"CUDA-aware MPI - Number Crunch","isPartOf":{"@id":"https:\/\/www.numbercrunch.de\/blog\/#website"},"datePublished":"2014-04-30T12:25:18+00:00","dateModified":"2023-01-18T21:45:24+00:00","breadcrumb":{"@id":"https:\/\/www.numbercrunch.de\/blog\/2014\/04\/cuda-aware-mpi\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.numbercrunch.de\/blog\/2014\/04\/cuda-aware-mpi\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/www.numbercrunch.de\/blog\/2014\/04\/cuda-aware-mpi\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.numbercrunch.de\/blog\/"},{"@type":"ListItem","position":2,"name":"CUDA-aware MPI"}]},{"@type":"WebSite","@id":"https:\/\/www.numbercrunch.de\/blog\/#website","url":"https:\/\/www.numbercrunch.de\/blog\/","name":"Number Crunch","description":"A computational science blog.","publisher":{"@id":"https:\/\/www.numbercrunch.de\/blog\/#\/schema\/person\/e73eab65b1721dd0c3d408edb887e413"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.numbercrunch.de\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":["Person","Organization"],"@id":"https:\/\/www.numbercrunch.de\/blog\/#\/schema\/person\/e73eab65b1721dd0c3d408edb887e413","name":"Heiko Bauke","logo":{"@id":"https:\/\/www.numbercrunch.de\/blog\/#\/schema\/person\/image\/"}}]}},"_links":{"self":[{"href":"https:\/\/www.numbercrunch.de\/blog\/wp-json\/wp\/v2\/posts\/470"}],"collection":[{"href":"https:\/\/www.numbercrunch.de\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.numbercrunch.de\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.numbercrunch.de\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.numbercrunch.de\/blog\/wp-json\/wp\/v2\/comments?post=470"}],"version-history":[{"count":18,"href":"https:\/\/www.numbercrunch.de\/blog\/wp-json\/wp\/v2\/posts\/470\/revisions"}],"predecessor-version":[{"id":984,"href":"https:\/\/www.numbercrunch.de\/blog\/wp-json\/wp\/v2\/posts\/470\/revisions\/984"}],"wp:attachment":[{"href":"https:\/\/www.numbercrunch.de\/blog\/wp-json\/wp\/v2\/media?parent=470"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.numbercrunch.de\/blog\/wp-json\/wp\/v2\/categories?post=470"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.numbercrunch.de\/blog\/wp-json\/wp\/v2\/tags?post=470"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}