{"id":250,"date":"2011-02-23T23:59:23","date_gmt":"2011-02-23T22:59:23","guid":{"rendered":"http:\/\/numbercrunch.de\/blog\/?p=250"},"modified":"2015-08-27T15:58:06","modified_gmt":"2015-08-27T14:58:06","slug":"gpu-fft-performance","status":"publish","type":"post","link":"https:\/\/www.numbercrunch.de\/blog\/2011\/02\/gpu-fft-performance\/","title":{"rendered":"GPU FFT performance"},"content":{"rendered":"<p style=\"text-align: justify;\">In a recent paper (see <a href=\"http:\/\/arxiv.org\/abs\/1012.3911\">arXiv:1012.3911<\/a>)  we showed how to solve\u00a0 the time-dependent\u00a0 <a href=\"http:\/\/en.wikipedia.org\/wiki\/Schr%C3%B6dinger_equation\">Schr\u00f6dinger equation<\/a> and  the time-dependent <a href=\"http:\/\/en.wikipedia.org\/wiki\/Dirac_equation\">Dirac equation<\/a> by a Fourier split operator method on  GPU hardware. For the Fourier split operator method one has to compute a fast Fourier transform (FFT) in each time step and the FFT dominates the overall computing time. For this reason I evaluated various kinds of GPU hardware for calculating FFTs.<\/p>\n<p style=\"text-align: justify;\">I started with a Nvidia GeForce GTX480 (<a href=\"http:\/\/en.wikipedia.org\/wiki\/GeForce_400_Series\">Fermi<\/a> architecture), which is consumer graphics card, and got quite satisfactory performance gains compared to traditional CPU implementations of FFTs, see below and <a href=\"http:\/\/arxiv.org\/abs\/1012.3911\">arXiv:1012.3911<\/a> for details. Consumer graphics cards as the GTX480, however, have a reduced double precision performance. <a href=\"http:\/\/www.nvidia.com\/object\/why-choose-tesla.html\">Tesla compute modules<\/a> have a double precision peak performance that is about four times higher than the peak performance of consumer graphics cards based on the Fermi architecture. I wondered, do we actually see this four-fold double precision performance in FFT performance?<\/p>\n<p style=\"text-align: justify;\">Because I do not have Tesla compute module I ran an few benchmarks in the <a href=\"http:\/\/aws.amazon.com\/\">Amazon cloud<\/a> on a <a href=\"http:\/\/aws.amazon.com\/ec2\/instance-types\/\">Cluster GPU instance<\/a>. Each Cluster GPU instance is equipped with two Nvidia Tesla M2050 GPUs. Setting up a node in the cloud was easy but FFT performance was quite disappointing. Especially for small problems the Tesla M2050 performed quite poorly as compared the GeForce GTX480, see below. I am not able to pin down the reason for the observed performance degeneration definitely. However, I speculate that <a href=\"virtualization\">virtualization<\/a> takes its toll here.<\/p>\n<p style=\"text-align: justify;\">Finally, I could also run my benchmark without a virtualization layer on a Tesla S2050 system (that contains four Tesla M2050 GPUs). Many thanks to <a href=\"http:\/\/www.megware.com\/\">Megaware<\/a> for providing computing time.<cite><strong> <\/strong><\/cite> Performance measurements on this system support my conjecture that the Amazon Cluster GPU instance was slowed down by the cloud&#8217;s virtualization layer. If my benchmark runs on \u00bbbare metal\u00ab then Tesla M2050 GPUs give approximately the same FFT performance as GeForce GTX480 GPUs do. However, from the fact that a Tesla M2050 GPU has an about four times higher double precision peak performance than a GeForce GTX one might expect\u00a0 even better FFT performance for a Tesla M2050 GPU. That we do not see this is an indication that the FFT is inherently bounded by the memory bandwidth. In fact, to come close to the GeForce GTX480 we have to turn off error correction (ECC) for the Tesla M2050 GPU. (The GeForce GTX480 memory has no error correction.) Even without error correction the Tesla M2050 GPU is slightly slower than a GeForce GTX480 because it has a slightly lower clock rate.<\/p>\n<p style=\"text-align: justify;\">\n<figure id=\"attachment_248\" aria-describedby=\"caption-attachment-248\" style=\"width: 576px\" class=\"wp-caption aligncenter\"><a href=\"http:\/\/numbercrunch.de\/blog\/wp-content\/uploads\/2011\/02\/bench_fft_compare_1d.png\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-248\" title=\"bench_fft_compare_1d\" src=\"\/\/numbercrunch.de\/blog\/wp-content\/uploads\/2011\/02\/bench_fft_compare_1d.png\" alt=\"\" width=\"576\" height=\"432\" srcset=\"https:\/\/www.numbercrunch.de\/blog\/wp-content\/uploads\/2011\/02\/bench_fft_compare_1d.png 576w, https:\/\/www.numbercrunch.de\/blog\/wp-content\/uploads\/2011\/02\/bench_fft_compare_1d-300x225.png 300w\" sizes=\"(max-width: 576px) 100vw, 576px\" \/><\/a><figcaption id=\"caption-attachment-248\" class=\"wp-caption-text\">Double precision FFT performance (computing time in seconds) for one-dimensional grids of size N on different kinds of Fermi GPUs. Computing time does not include data transfer between host and device memory. Some benchmarks ran on multi-GPU systems but ony a single GPU was utilized to perform FFTs by the CuFFT 3.2 library.<\/figcaption><\/figure>\n<p style=\"text-align: justify;\">\n<figure id=\"attachment_249\" aria-describedby=\"caption-attachment-249\" style=\"width: 576px\" class=\"wp-caption aligncenter\"><a href=\"http:\/\/numbercrunch.de\/blog\/wp-content\/uploads\/2011\/02\/bench_fft_compare_2d.png\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-249\" title=\"bench_fft_compare_2d\" src=\"\/\/numbercrunch.de\/blog\/wp-content\/uploads\/2011\/02\/bench_fft_compare_2d.png\" alt=\"\" width=\"576\" height=\"432\" srcset=\"https:\/\/www.numbercrunch.de\/blog\/wp-content\/uploads\/2011\/02\/bench_fft_compare_2d.png 576w, https:\/\/www.numbercrunch.de\/blog\/wp-content\/uploads\/2011\/02\/bench_fft_compare_2d-300x225.png 300w\" sizes=\"(max-width: 576px) 100vw, 576px\" \/><\/a><figcaption id=\"caption-attachment-249\" class=\"wp-caption-text\">Double precision FFT performance (computing time in seconds) for two-dimensional grids of size N\u00d7N on different kinds of Fermi GPUs. Computing time does not include data transfer between host and device memory. Some benchmarks ran on multi-GPU systems but ony a single GPU was utilized to perform FFTs by the CuFFT 3.2 library.<\/figcaption><\/figure>\n","protected":false},"excerpt":{"rendered":"<p>In a recent paper (see arXiv:1012.3911) we showed how to solve\u00a0 the time-dependent\u00a0 Schr\u00f6dinger equation and the time-dependent Dirac equation by a Fourier split operator method on GPU hardware. For the Fourier split operator method one has to compute a fast Fourier transform (FFT) in each time step and the FFT dominates the overall computing&hellip; <a href=\"https:\/\/www.numbercrunch.de\/blog\/2011\/02\/gpu-fft-performance\/\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">GPU FFT performance<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[13,7],"tags":[],"class_list":["post-250","post","type-post","status-publish","format-standard","hentry","category-gpu-computing","category-performance"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.6 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>GPU FFT performance - Number Crunch<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.numbercrunch.de\/blog\/2011\/02\/gpu-fft-performance\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"GPU FFT performance - Number Crunch\" \/>\n<meta property=\"og:description\" content=\"In a recent paper (see arXiv:1012.3911) we showed how to solve\u00a0 the time-dependent\u00a0 Schr\u00f6dinger equation and the time-dependent Dirac equation by a Fourier split operator method on GPU hardware. For the Fourier split operator method one has to compute a fast Fourier transform (FFT) in each time step and the FFT dominates the overall computing&hellip; Continue reading GPU FFT performance\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.numbercrunch.de\/blog\/2011\/02\/gpu-fft-performance\/\" \/>\n<meta property=\"og:site_name\" content=\"Number Crunch\" \/>\n<meta property=\"article:published_time\" content=\"2011-02-23T22:59:23+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2015-08-27T14:58:06+00:00\" \/>\n<meta property=\"og:image\" content=\"\/\/numbercrunch.de\/blog\/wp-content\/uploads\/2011\/02\/bench_fft_compare_1d.png\" \/>\n<meta name=\"author\" content=\"Heiko Bauke\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Heiko Bauke\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"3 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/www.numbercrunch.de\/blog\/2011\/02\/gpu-fft-performance\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/www.numbercrunch.de\/blog\/2011\/02\/gpu-fft-performance\/\"},\"author\":{\"name\":\"Heiko Bauke\",\"@id\":\"https:\/\/www.numbercrunch.de\/blog\/#\/schema\/person\/e73eab65b1721dd0c3d408edb887e413\"},\"headline\":\"GPU FFT performance\",\"datePublished\":\"2011-02-23T22:59:23+00:00\",\"dateModified\":\"2015-08-27T14:58:06+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/www.numbercrunch.de\/blog\/2011\/02\/gpu-fft-performance\/\"},\"wordCount\":554,\"commentCount\":1,\"publisher\":{\"@id\":\"https:\/\/www.numbercrunch.de\/blog\/#\/schema\/person\/e73eab65b1721dd0c3d408edb887e413\"},\"articleSection\":[\"GPU computing\",\"Performance\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/www.numbercrunch.de\/blog\/2011\/02\/gpu-fft-performance\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/www.numbercrunch.de\/blog\/2011\/02\/gpu-fft-performance\/\",\"url\":\"https:\/\/www.numbercrunch.de\/blog\/2011\/02\/gpu-fft-performance\/\",\"name\":\"GPU FFT performance - Number Crunch\",\"isPartOf\":{\"@id\":\"https:\/\/www.numbercrunch.de\/blog\/#website\"},\"datePublished\":\"2011-02-23T22:59:23+00:00\",\"dateModified\":\"2015-08-27T14:58:06+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/www.numbercrunch.de\/blog\/2011\/02\/gpu-fft-performance\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/www.numbercrunch.de\/blog\/2011\/02\/gpu-fft-performance\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/www.numbercrunch.de\/blog\/2011\/02\/gpu-fft-performance\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/www.numbercrunch.de\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"GPU FFT performance\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/www.numbercrunch.de\/blog\/#website\",\"url\":\"https:\/\/www.numbercrunch.de\/blog\/\",\"name\":\"Number Crunch\",\"description\":\"A computational science blog.\",\"publisher\":{\"@id\":\"https:\/\/www.numbercrunch.de\/blog\/#\/schema\/person\/e73eab65b1721dd0c3d408edb887e413\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/www.numbercrunch.de\/blog\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":[\"Person\",\"Organization\"],\"@id\":\"https:\/\/www.numbercrunch.de\/blog\/#\/schema\/person\/e73eab65b1721dd0c3d408edb887e413\",\"name\":\"Heiko Bauke\",\"logo\":{\"@id\":\"https:\/\/www.numbercrunch.de\/blog\/#\/schema\/person\/image\/\"}}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"GPU FFT performance - Number Crunch","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.numbercrunch.de\/blog\/2011\/02\/gpu-fft-performance\/","og_locale":"en_US","og_type":"article","og_title":"GPU FFT performance - Number Crunch","og_description":"In a recent paper (see arXiv:1012.3911) we showed how to solve\u00a0 the time-dependent\u00a0 Schr\u00f6dinger equation and the time-dependent Dirac equation by a Fourier split operator method on GPU hardware. For the Fourier split operator method one has to compute a fast Fourier transform (FFT) in each time step and the FFT dominates the overall computing&hellip; Continue reading GPU FFT performance","og_url":"https:\/\/www.numbercrunch.de\/blog\/2011\/02\/gpu-fft-performance\/","og_site_name":"Number Crunch","article_published_time":"2011-02-23T22:59:23+00:00","article_modified_time":"2015-08-27T14:58:06+00:00","og_image":[{"url":"\/\/numbercrunch.de\/blog\/wp-content\/uploads\/2011\/02\/bench_fft_compare_1d.png","type":"","width":"","height":""}],"author":"Heiko Bauke","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Heiko Bauke","Est. reading time":"3 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.numbercrunch.de\/blog\/2011\/02\/gpu-fft-performance\/#article","isPartOf":{"@id":"https:\/\/www.numbercrunch.de\/blog\/2011\/02\/gpu-fft-performance\/"},"author":{"name":"Heiko Bauke","@id":"https:\/\/www.numbercrunch.de\/blog\/#\/schema\/person\/e73eab65b1721dd0c3d408edb887e413"},"headline":"GPU FFT performance","datePublished":"2011-02-23T22:59:23+00:00","dateModified":"2015-08-27T14:58:06+00:00","mainEntityOfPage":{"@id":"https:\/\/www.numbercrunch.de\/blog\/2011\/02\/gpu-fft-performance\/"},"wordCount":554,"commentCount":1,"publisher":{"@id":"https:\/\/www.numbercrunch.de\/blog\/#\/schema\/person\/e73eab65b1721dd0c3d408edb887e413"},"articleSection":["GPU computing","Performance"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.numbercrunch.de\/blog\/2011\/02\/gpu-fft-performance\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.numbercrunch.de\/blog\/2011\/02\/gpu-fft-performance\/","url":"https:\/\/www.numbercrunch.de\/blog\/2011\/02\/gpu-fft-performance\/","name":"GPU FFT performance - Number Crunch","isPartOf":{"@id":"https:\/\/www.numbercrunch.de\/blog\/#website"},"datePublished":"2011-02-23T22:59:23+00:00","dateModified":"2015-08-27T14:58:06+00:00","breadcrumb":{"@id":"https:\/\/www.numbercrunch.de\/blog\/2011\/02\/gpu-fft-performance\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.numbercrunch.de\/blog\/2011\/02\/gpu-fft-performance\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/www.numbercrunch.de\/blog\/2011\/02\/gpu-fft-performance\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.numbercrunch.de\/blog\/"},{"@type":"ListItem","position":2,"name":"GPU FFT performance"}]},{"@type":"WebSite","@id":"https:\/\/www.numbercrunch.de\/blog\/#website","url":"https:\/\/www.numbercrunch.de\/blog\/","name":"Number Crunch","description":"A computational science blog.","publisher":{"@id":"https:\/\/www.numbercrunch.de\/blog\/#\/schema\/person\/e73eab65b1721dd0c3d408edb887e413"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.numbercrunch.de\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":["Person","Organization"],"@id":"https:\/\/www.numbercrunch.de\/blog\/#\/schema\/person\/e73eab65b1721dd0c3d408edb887e413","name":"Heiko Bauke","logo":{"@id":"https:\/\/www.numbercrunch.de\/blog\/#\/schema\/person\/image\/"}}]}},"_links":{"self":[{"href":"https:\/\/www.numbercrunch.de\/blog\/wp-json\/wp\/v2\/posts\/250"}],"collection":[{"href":"https:\/\/www.numbercrunch.de\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.numbercrunch.de\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.numbercrunch.de\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.numbercrunch.de\/blog\/wp-json\/wp\/v2\/comments?post=250"}],"version-history":[{"count":23,"href":"https:\/\/www.numbercrunch.de\/blog\/wp-json\/wp\/v2\/posts\/250\/revisions"}],"predecessor-version":[{"id":273,"href":"https:\/\/www.numbercrunch.de\/blog\/wp-json\/wp\/v2\/posts\/250\/revisions\/273"}],"wp:attachment":[{"href":"https:\/\/www.numbercrunch.de\/blog\/wp-json\/wp\/v2\/media?parent=250"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.numbercrunch.de\/blog\/wp-json\/wp\/v2\/categories?post=250"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.numbercrunch.de\/blog\/wp-json\/wp\/v2\/tags?post=250"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}