1 option

CUDA Fortran for Scientists and Engineers : Best Practices for Efficient CUDA Fortran Programming / Gregory Ruetsch and Massimiliano Fatica.

O'Reilly Online Learning: Academic/Public Library Edition Available online

Format:: Book
Author/Creator:: Ruetsch, Gregory, author.; Fatica, Massimiliano, author.
Language:: English
Subjects (All):: FORTRAN (Computer program language).
Physical Description:: 1 online resource (438 pages)
Edition:: Second edition.
Place of Publication:: Cambridge, MA : Morgan Kaufmann, [2024]
Summary:: CUDA Fortran for Scientists and Engineers: Best Practices for Efficient CUDA Fortran Programming shows how high-performance application developers can leverage the power of GPUs using Fortran, the familiar language of scientific computing and supercomputer performance benchmarking. The authors presume no prior parallel computing experience, and cover the basics along with best practices for efficient GPU computing using CUDA Fortran. In order to add CUDA Fortran to existing Fortran codes, they explain how to understand the target GPU architecture, identify computationally-intensive parts of the code, and modify the code to manage the data and parallelism and optimize performance – all in Fortran, without having to rewrite in another language. Each concept is illustrated with actual examples so you can immediately evaluate the performance of your code in comparison. This second edition provides much needed updates on how to efficiently program GPUs in CUDA Fortran. It can be used either as a tutorial on GPU programming in CUDA Fortran as well as a reference text.
Contents:: Front Cover; CUDA Fortran for Scientists and Engineers; Copyright; Contents; Preface to the Second Edition; Preface to the First Edition; Acknowledgments; 1 CUDA Fortran programming; 1 Introduction; 1.1 A brief history of GPU computing; 1.2 Parallel computation; 1.3 Basic concepts; 1.3.1 A first CUDA Fortran program; 1.3.1.1 CUDA Fortran compilation; 1.3.2 Extending to larger arrays; 1.3.3 Multidimensional arrays; 1.3.4 Interfaces for device code; 1.3.5 Managed data; 1.3.6 Kernel loop directives and CUF kernels; 1.4 Determining CUDA hardware features and limits; 1.4.1 Choosing a device to run on; 1.4.2 Floating point precision; 1.4.2.1 Accommodating variable precision; 1.5 Error handling; 1.6 Compiling CUDA Fortran code; 1.7 CUDA Driver, Toolkit, and compatibility; 2 Correctness, accuracy, and debugging; 2.1 Assessing correctness of results; 2.1.1 Non-associativity of floating point arithmetic; 2.1.2 Fused-multiply add; 2.1.3 Flags affecting floating-point accuracy; 2.2 Debugging; 2.2.1 Printing from device code; 2.2.2 Debugging with cuda-gdb; 2.2.2.1 System requirements; 2.2.2.2 Compilation; 2.2.2.3 Setting breakpoints; 2.2.2.4 Focus - software and hardware coordinates; 2.2.2.5 CUDA activity status; 2.2.2.6 Single-stepping in device code; 2.2.2.7 Examining program state; 2.2.3 compute-sanitizer; 3 Performance measurement and metrics; 3.1 Measuring execution time; 3.1.1 Host-device synchronization and CPU timers; 3.1.2 Timing via CUDA events; 3.1.3 Nsight Systems command-line interface nsys; 3.1.3.1 Nsight Systems graphical user interface nsys-ui; 3.1.4 Customizing profiling with nvtx; 3.1.4.1 Basic NVTX tooling interfaces; 3.1.4.2 Advanced NVTX tooling interfaces; 3.1.4.3 Automated NVTX instrumentation.; 3.2 Instruction, bandwidth, and latency bound kernels; 3.3 Memory bandwidth; 3.3.1 Theoretical peak bandwidth; 3.3.2 Effective bandwidth; 3.3.3 Actual data throughput vs. effective bandwidth; 4 Synchronization; 4.1 Synchronization of kernel execution and data transfers; 4.1.1 Pageable versus pinned host memory; 4.1.2 Streams; 4.1.2.1 Creating streams; 4.1.3 Asynchronous transfers via cudaMemcpyAsync(); 4.1.4 Synchronization barriers; 4.1.4.1 cudaDeviceSynchronize(); 4.1.4.2 cudaStreamSynchronize(); 4.1.4.3 cudaEventSynchronize(); 4.1.4.4 Querying streams and events; 4.1.5 Advanced stream topics; 4.1.5.1 The default stream; 4.1.5.2 Non-blocking streams; 4.1.5.3 Stream priorities; 4.2 Synchronization of kernel threads on the device; 4.2.1 Shared memory; 4.2.2 Synchronizing threads within a block; 4.2.3 Warps, warp synchronization, and warp-level primitives; 4.2.3.1 SHFL functions; 4.2.4 Atomics; 4.2.5 Memory fences; 4.2.6 Cooperative groups; 4.2.6.1 Grid synchronization; 4.2.6.2 Thread block clusters; 5 Optimization; 5.1 Transfers between host and device; 5.1.1 Pinned memory; 5.2 Device memory; 5.2.1 ECC (Error Correcting Code); 5.2.2 Global memory; 5.2.2.1 Declaring global array arguments in kernels; 5.2.2.2 Coalesced global memory access; 5.2.3 Local memory; 5.2.4 Constant memory; 5.2.5 L1 and L2 caches; 5.2.6 Shared memory; 5.2.6.1 Configuring shared memory; 5.2.6.2 Global memory coalescing through shared memory; 5.2.6.3 Shared memory bank conflicts; 5.2.7 Registers; 5.2.7.1 Exchanging register data between threads in a warp; 5.3 Execution configuration; 5.3.1 Thread-level parallelism; 5.3.2 Instruction-level parallelism; 5.3.2.1 Asynchronous data transfers between global and shared memory; 5.3.2.2 Instruction-level parallelism in CUF kernels.; 5.4 Instruction optimization; 5.4.1 Device intrinsics; 5.4.1.1 Directed rounding; 5.4.1.2 C intrinsics; 5.4.1.3 Fast math intrinsics; 5.4.1.4 Compiler options; 5.4.2 Divergent warps; 6 Porting tips and techniques; 6.1 CUF kernels; 6.2 Conditional inclusion of code; 6.3 Renaming variables; 6.3.1 Renaming via use statements; 6.3.2 Renaming via the associate construct; 6.4 Minimizing memory footprint for work arrays; 6.5 Array compaction; 7 Interfacing with CUDA C code and CUDA libraries; 7.1 Calling user-written CUDA C code; 7.1.1 The ignore"80"137tkr directive; 7.2 cuBLAS; 7.2.1 Legacy cuBLAS API; 7.2.2 New cuBLAS API; 7.2.3 Batched cuBLAS routines; 7.2.4 GEMM with tensor cores; 7.3 cuSPARSE; 7.4 cuSOLVER; 7.5 cuTENSOR; 7.5.1 Low-level cuTENSOR interfaces; 7.6 Thrust; 8 Multi-GPU programming; 8.1 CUDA multi-GPU features; 8.1.1 Peer-to-peer communication; 8.1.1.1 Requirements for peer-to-peer communication; 8.1.2 Peer-to-peer direct transfers; 8.1.3 Peer-to-peer transpose; 8.2 Multi-GPU programming with MPI; 8.2.1 Assigning devices to MPI ranks; 8.2.2 MPI transpose; 8.2.3 GPU-aware MPI transpose; 2 Case studies; 9 Monte Carlo method; 9.1 CURAND; 9.2 Computing π with CUF kernels; 9.2.1 IEEE-754 precision; 9.3 Computing π with reduction kernels; 9.3.1 Reductions with SHFL instructions; 9.3.2 Reductions with atomic locks; 9.3.3 Reductions using the grid"80"137group cooperative group; 9.4 Accuracy of summation; 9.5 Option pricing; 10 Finite difference method; 10.1 Nine-point 1D finite difference stencil; 10.1.1 Data reuse and shared memory; 10.1.2 The x-derivative kernel; 10.1.2.1 Performance of the x-derivative kernel; 10.1.3 Derivatives in y and z; 10.1.4 Nonuniform grids; 10.2 2D Laplace equation.; 11 Applications of the fast Fourier transform; 11.1 CUFFT; 11.2 Spectral derivatives; 11.3 Convolution; 11.4 Poisson solver; 11.4.1 Vortex dynamics; 12 Ray tracing; 12.1 Generating an image file; PPM file format; Implementation; 12.2 Vectors in CUDA Fortran; Implementation of the RGB module; 12.3 Rays, a simple camera, and background; First ray-tracing code; 12.4 Adding a sphere; Theory; 12.5 Surface normals and multiple objects; Surface normals; Multiple objects; 12.6 Antialiasing; CUDA implementation; 12.7 Material types; 12.7.1 Diffuse materials; Random number generation; Modifications to the color() routine; 12.7.2 Metal; Material derived types; Initialization of spheres and material types; The scatter() function and Cray pointers; 12.7.3 Dielectrics; 12.8 Positionable camera; 12.9 Defocus blur; 12.10 Where next?; 12.11 Triangles; 12.12 Lights; 12.13 Textures; 3 Appendices; A System and environment management; A.1 Environment variables; A.1.1 General; A.1.2 Just-in-time compilation; A.2 nvidia-smi - System Management Interface; A.2.1 Enabling and disabling ECC; A.2.2 Compute mode; A.2.3 Persistence mode; A.2.4 Topology; References; Index; Back Cover.
Notes:: Description based on publisher supplied metadata and other sources.; Description based on print version record.; Includes bibliographical references and index.
ISBN:: 0-443-21976-1
OCLC:: 1446222244

The Penn Libraries is committed to describing library materials using current, accurate, and responsible language. If you discover outdated or inaccurate language, please fill out this feedback form to report it and suggest alternative language.

1 option

CUDA Fortran for Scientists and Engineers : Best Practices for Efficient CUDA Fortran Programming / Gregory Ruetsch and Massimiliano Fatica.

My Account

Guides