{"id":3313,"date":"2014-06-19T12:05:55","date_gmt":"2014-06-19T19:05:55","guid":{"rendered":"https:\/\/developer.nvidia.com\/blog\/parallelforall\/?p=3313"},"modified":"2022-08-21T16:37:06","modified_gmt":"2022-08-21T23:37:06","slug":"cuda-pro-tip-profiling-mpi-applications","status":"publish","type":"post","link":"https:\/\/developer.nvidia.com\/blog\/cuda-pro-tip-profiling-mpi-applications\/","title":{"rendered":"CUDA Pro Tip: Profiling MPI Applications"},"content":{"rendered":"<p>When I profile MPI+CUDA applications, sometimes performance issues only occur for certain MPI ranks. To fix these, it&#8217;s necessary to identify the MPI rank where the performance issue occurs. Before CUDA 6.5 it was hard to do this because the CUDA profiler only shows the PID of the processes and leaves the developer to figure out the mapping from PIDs to MPI ranks. Although the mapping can be done manually, for example for OpenMPI via the command-line option <code>--display-map<\/code>, it&#8217;s tedious and error prone. A solution which solves this for the command-line output of <code>nvprof<\/code> is described here <a href=\"http:\/\/www.parallel-computing.pro\/index.php\/9-cuda\/5-sorting-cuda-profiler-output-of-the-mpi-cuda-program\">http:\/\/www.parallel-computing.pro\/index.php\/9-cuda\/5-sorting-cuda-profiler-output-of-the-mpi-cuda-program<\/a> . In this post I will describe how the new output file naming of nvprof to be introduced with CUDA 6.5 can be used to conveniently analyze the performance of a MPI+CUDA application with <code>nvprof<\/code> and the NVIDIA Visual Profiler (<code>nvvp<\/code>).<\/p>\n<h2 id=\"profiling_mpi_applications_with_nvprof_and_nvvp\" >Profiling MPI applications with nvprof and nvvp<a href=\"#profiling_mpi_applications_with_nvprof_and_nvvp\" class=\"heading-anchor-link\"><i class=\"fas fa-link\"><\/i><\/a><\/h2>\n<h3 id=\"collecting_data_with_nvprof\" >Collecting data with nvprof<a href=\"#collecting_data_with_nvprof\" class=\"heading-anchor-link\"><i class=\"fas fa-link\"><\/i><\/a><\/h3>\n<p><code>nvprof<\/code> supports dumping the profile to a file which can be later imported into <code>nvvp<\/code>. To generate a profile for a MPI+CUDA application I simply start <code>nvprof<\/code> with the MPI launcher and up to CUDA 6 I used the string &#8220;<code>%p<\/code>&#8221; in the output file name. <code>nvprof<\/code> automatically replaces that string with the PID and generates a separate file for each MPI rank. With CUDA 6.5, the string &#8220;<code>%q{ENV}<\/code>&#8221; can be used to name the output file of <code>nvprof<\/code>. This allows us to include the MPI rank in the output file name by utilizing environment variables automatically set by the MPI launcher (<code>mpirun<\/code> or <code>mpiexec<\/code>). E.g. for OpenMPI <code>OMPI_COMM_WORLD_RANK<\/code> is set to the MPI rank for each launched process.<\/p>\n<pre class=\"prettyprint\">$ mpirun -np 2 nvprof -o simpleMPI.%q{OMPI_COMM_WORLD_RANK}.nvprof .&#47;simpleMPI\nRunning on 2 nodes\n==18811== NVPROF is profiling process 18811, command: .&#47;simpleMPI\n==18813== NVPROF is profiling process 18813, command: .&#47;simpleMPI\nAverage of square roots is: 0.667279\nPASSED\n==18813== Generated result file: simpleMPI.1.nvprof\n==18811== Generated result file: simpleMPI.0.nvprof<\/pre>\n<p><!--more--><\/p>\n<h3 id=\"analyzing_profiles_with_nvvp\" >Analyzing profiles with nvvp<a href=\"#analyzing_profiles_with_nvvp\" class=\"heading-anchor-link\"><i class=\"fas fa-link\"><\/i><\/a><\/h3>\n<p>The output files produced by <code>nvprof<\/code> can be either read by <code>nvprof<\/code> to analyze the profile one rank at a time (using\u00a0<code>--import-profile<\/code>) or imported into <code>nvvp<\/code>. Since CUDA 6 it&#8217;s possible to import multiple files into the same time-line as described <a title=\"here\" href=\"http:\/\/docs.nvidia.com\/cuda\/profiler-users-guide\/index.html#import-multi-nvprof-session\">here<\/a>. This significantly improves the usability of <code>nvvp<\/code> for MPI applications.<\/p>\n<figure id=\"attachment_3315\" aria-describedby=\"caption-attachment-3315\" style=\"width: 600px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/developer.nvidia.com\/blog\/parallelforall\/wp-content\/uploads\/2014\/06\/NVVP_wo_resource_naming.png\"><img loading=\"lazy\" decoding=\"async\" class=\"size-large wp-image-3315 \" src=\"\/blog\/wp-content\/uploads\/2014\/06\/NVVP_wo_resource_naming-624x350.png\" alt=\"NVVP time line showing the GPU activity of two MPI processes.\" width=\"600\" height=\"336\" srcset=\"https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2014\/06\/NVVP_wo_resource_naming-624x350.png 624w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2014\/06\/NVVP_wo_resource_naming-300x168.png 300w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2014\/06\/NVVP_wo_resource_naming-500x280.png 500w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2014\/06\/NVVP_wo_resource_naming-160x90.png 160w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2014\/06\/NVVP_wo_resource_naming-1024x575.png 1024w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2014\/06\/NVVP_wo_resource_naming.png 1331w\" sizes=\"auto, (max-width: 600px) 100vw, 600px\" \/><\/a><figcaption id=\"caption-attachment-3315\" class=\"wp-caption-text\">NVVP time line showing the GPU activity of two MPI processes.<\/figcaption><\/figure>\n<h2 id=\"enhancing_profiles_with_nvtx\" >Enhancing profiles with NVTX<a href=\"#enhancing_profiles_with_nvtx\" class=\"heading-anchor-link\"><i class=\"fas fa-link\"><\/i><\/a><\/h2>\n<p>The analysis process can be further improved by using <a title=\"NVTX\" href=\"https:\/\/developer.nvidia.com\/blog\/parallelforall\/cuda-pro-tip-generate-custom-application-profile-timelines-nvtx\/\">NVTX<\/a> and naming the CPU threads and CUDA devices according to the MPI rank associated to them.\u00a0With CUDA 7.5 you can name threads just as you name\u00a0output files with the command line options <code>--context-name<\/code> and <code>--process-name<\/code>, by passing a string like <code>\u201cMPI Rank %q{OMPI_COMM_WORLD_RANK}\u201d<\/code> as a parameter. Before CUDA 7.5 you can achieve the same result by using NVTX explicitly from your application:<\/p>\n<pre class=\"prettyprint\">char name&#091;256&#093;;\nsprintf( name, &#34;MPI Rank %d&#34;, rank );\n\nnvtxNameOsThread(pthread_self(), name);\nnvtxNameCudaDeviceA(rank, name);<\/pre>\n<figure id=\"attachment_3316\" aria-describedby=\"caption-attachment-3316\" style=\"width: 600px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/developer.nvidia.com\/blog\/parallelforall\/wp-content\/uploads\/2014\/06\/NVVP_with_resource_naming.png\"><img loading=\"lazy\" decoding=\"async\" class=\"size-large wp-image-3316 \" src=\"\/blog\/wp-content\/uploads\/2014\/06\/NVVP_with_resource_naming-624x350.png\" alt=\"NVVP time line with named OS thread and CUDA device showing the GPU activity of two MPI processes.\" width=\"600\" height=\"336\" srcset=\"https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2014\/06\/NVVP_with_resource_naming-624x350.png 624w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2014\/06\/NVVP_with_resource_naming-300x168.png 300w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2014\/06\/NVVP_with_resource_naming-500x280.png 500w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2014\/06\/NVVP_with_resource_naming-160x90.png 160w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2014\/06\/NVVP_with_resource_naming-1024x575.png 1024w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2014\/06\/NVVP_with_resource_naming.png 1331w\" sizes=\"auto, (max-width: 600px) 100vw, 600px\" \/><\/a><figcaption id=\"caption-attachment-3316\" class=\"wp-caption-text\">NVVP time line with named OS thread and CUDA device showing the GPU activity of two MPI processes.<\/figcaption><\/figure>\n<p>Instead of naming the CUDA devices it&#8217;s also possible to name the GPU context:<\/p>\n<pre class=\"prettyprint\">char name&#091;256&#093;;\nsprintf( name, &#34;MPI Rank %d&#34;, rank );\nnvtxNameOsThread(pthread_self(), name);\n\nCUcontext ctx;\ncuCtxGetCurrent( &amp;ctx );\nnvtxNameCuContextA( ctx, name );<\/pre>\n<figure id=\"attachment_3317\" aria-describedby=\"caption-attachment-3317\" style=\"width: 600px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/developer.nvidia.com\/blog\/parallelforall\/wp-content\/uploads\/2014\/06\/NVVP_with_resource_naming_ctx.png\"><img loading=\"lazy\" decoding=\"async\" class=\"size-large wp-image-3317 \" src=\"\/blog\/wp-content\/uploads\/2014\/06\/NVVP_with_resource_naming_ctx-624x350.png\" alt=\"NVVP time line with named OS thread and CUDA context showing the GPU activity of two MPI processes.\" width=\"600\" height=\"336\" srcset=\"https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2014\/06\/NVVP_with_resource_naming_ctx-624x350.png 624w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2014\/06\/NVVP_with_resource_naming_ctx-300x168.png 300w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2014\/06\/NVVP_with_resource_naming_ctx-500x280.png 500w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2014\/06\/NVVP_with_resource_naming_ctx-160x90.png 160w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2014\/06\/NVVP_with_resource_naming_ctx-1024x575.png 1024w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2014\/06\/NVVP_with_resource_naming_ctx.png 1331w\" sizes=\"auto, (max-width: 600px) 100vw, 600px\" \/><\/a><figcaption id=\"caption-attachment-3317\" class=\"wp-caption-text\">NVVP time line with named OS thread and CUDA context showing the GPU activity of two MPI processes.<\/figcaption><\/figure>\n<p>To guarantee that <code>cuCtxGetCurrent<\/code> picks the right context, it&#8217;s required that a CUDA Runtime call is made between the calls to <code>cudaSetDevice<\/code> and <code>cuCtxGetCurrent<\/code>.<\/p>\n<h2 id=\"other_tools\" >Other Tools<a href=\"#other_tools\" class=\"heading-anchor-link\"><i class=\"fas fa-link\"><\/i><\/a><\/h2>\n<p>To collect application traces and analyze the performance of MPI applications, well established and much more sophisticated tools like <a title=\"Score-P\" href=\"http:\/\/www.vi-hps.org\/projects\/score-p\/\">Score-P<\/a>, <a title=\"Vampir\" href=\"http:\/\/www.vampir.eu\/\">Vampir<\/a> or <a title=\"TAU\" href=\"http:\/\/www.cs.uoregon.edu\/research\/tau\/home.php\">TAU<\/a> exists. These tools use our profiling interface CUPTI to assess MPI+CUDA applications and also offer advanced support to detect MPI and CPU related performance issues.<\/p>\n<h2 id=\"conclusion\" >Conclusion<a href=\"#conclusion\" class=\"heading-anchor-link\"><i class=\"fas fa-link\"><\/i><\/a><\/h2>\n<p>Following the above approach many performance issues of MPI+CUDA applications can be identified with NVIDIA tools and NVTX can be used to improve working with these profiles. Besides the NVTX resource naming, everything described here works equally well with MPI+OpenACC applications.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>When I profile MPI+CUDA applications, sometimes performance issues only occur for certain MPI ranks. To fix these, it&#8217;s necessary to identify the MPI rank where the performance issue occurs. Before CUDA 6.5 it was hard to do this because the CUDA profiler only shows the PID of the processes and leaves the developer to figure &hellip; <a href=\"https:\/\/developer.nvidia.com\/blog\/cuda-pro-tip-profiling-mpi-applications\/\">Continued<\/a><\/p>\n","protected":false},"author":245,"featured_media":9152,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"publish_to_discourse":"","publish_post_category":"318","wpdc_auto_publish_overridden":"","wpdc_topic_tags":"","wpdc_pin_topic":"","wpdc_pin_until":"","discourse_post_id":"602000","discourse_permalink":"https:\/\/forums.developer.nvidia.com\/t\/cuda-pro-tip-profiling-mpi-applications\/148444","wpdc_publishing_response":"success","wpdc_publishing_error":"","nv_subtitle":"","ai_post_summary":"","footnotes":"","_links_to":"","_links_to_target":""},"categories":[503],"tags":[21,48,52,449,59],"coauthors":[139],"class_list":["post-3313","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-simulation-modeling-design","tag-cuda","tag-mpi","tag-openacc","tag-pro-tip","tag-profiling"],"acf":{"post_industry":[],"post_products":["CUDA","OpenACC"],"post_learning_levels":[],"post_content_types":[],"post_collections":[]},"jetpack_featured_media_url":"https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2017\/02\/GPU-Pro-Tip-e1753800348480.png","primary_category":{"category":"Simulation \/ Modeling \/ Design","link":"https:\/\/developer.nvidia.com\/blog\/category\/simulation-modeling-design\/","id":503,"data_source":""},"nv_translations":[],"jetpack_shortlink":"https:\/\/wp.me\/pcCQAL-Rr","jetpack_likes_enabled":true,"jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/posts\/3313","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/users\/245"}],"replies":[{"embeddable":true,"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/comments?post=3313"}],"version-history":[{"count":1,"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/posts\/3313\/revisions"}],"predecessor-version":[{"id":9492,"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/posts\/3313\/revisions\/9492"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/media\/9152"}],"wp:attachment":[{"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/media?parent=3313"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/categories?post=3313"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/tags?post=3313"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/coauthors?post=3313"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}