1 option
Parallel and High Performance Computing.
- Format:
- Book
- Author/Creator:
- Robey, Robert.
- Language:
- English
- Subjects (All):
- Parallel programming (Computer science).
- Electronic data processing.
- Big data.
- C# (Computer program language).
- Genre:
- Instructional and educational works.
- Physical Description:
- 1 online resource (604 pages)
- Place of Publication:
- New York : Manning Publications Co. LLC, 2021.
- Summary:
- Parallel and High Performance Computing offers techniques guaranteed to boost your code's effectiveness. Summary Complex calculations, like training deep learning models or running large-scale simulations, can take an extremely long time. Efficient parallel programming can save hours-or even days-of computing time. Parallel and High Performance Computing shows you how to deliver faster run-times, greater scalability, and increased energy efficiency to your programs by mastering parallel techniques for multicore processor and GPU hardware. About the technology Write fast, powerful, energy efficient programs that scale to tackle huge volumes of data. Using parallel programming, your code spreads data processing tasks across multiple CPUs for radically better performance. With a little help, you can create software that maximizes both speed and efficiency. About the book Parallel and High Performance Computing offers techniques guaranteed to boost your code's effectiveness. You'll learn to evaluate hardware architectures and work with industry standard tools such as OpenMP and MPI. You'll master the data structures and algorithms best suited for high performance computing and learn techniques that save energy on handheld devices. You'll even run a massive tsunami simulation across a bank of GPUs. What's inside Planning a new parallel project Understanding differences in CPU and GPU architecture Addressing underperforming kernels and loops Managing applications with batch scheduling About the reader For experienced programmers proficient with a high-performance computing language like C, C++, or Fortran. About the author Robert Robey works at Los Alamos National Laboratory and has been active in the field of parallel computing for over 30 years. Yuliana Zamora is currently a PhD student and Siebel Scholar at the University of Chicago, and has lectured on programming modern hardware at numerous national conferences. Table of Contents PART 1 INTRODUCTION TO PARALLEL COMPUTING 1 Why parallel computing? 2 Planning for parallelization 3 Performance limits and profiling 4 Data design and performance models 5 Parallel algorithms and patterns PART 2 CPU: THE PARALLEL WORKHORSE 6 Vectorization: FLOPs for free 7 OpenMP that performs 8 MPI: The parallel backbone PART 3 GPUS: BUILT TO ACCELERATE 9 GPU architectures and concepts 10 GPU programming model 11 Directive-based GPU programming 12 GPU languages: Getting down to basics 13 GPU profiling and tools PART 4 HIGH PERFORMANCE COMPUTING ECOSYSTEMS 14 Affinity: Truce with the kernel 15 Batch schedulers: Bringing order to chaos 16 File operations for a parallel world 17 Tools and resources for better code
- Contents:
- Intro
- Parallel and High Performance Computing
- Copyright
- Dedication
- contents
- front matter
- foreword
- Yulie Zamora, University of Chicago, Illinois
- How we came to write this book
- acknowledgments
- about this book
- Who should read this book
- Part 1 Introduction to parallel computing
- 1 Why parallel computing?
- 1.1 Why should you learn about parallel computing?
- 1.1.1 What are the potential benefits of parallel computing?
- 1.1.2 Parallel computing cautions
- 1.2 The fundamental laws of parallel computing
- 1.2.1 The limit to parallel computing: Amdahl's Law
- 1.2.2 Breaking through the parallel limit: Gustafson-Barsis's Law
- 1.3 How does parallel computing work?
- 1.3.1 Walking through a sample application
- 1.3.2 A hardware model for today's heterogeneous parallel systems
- 1.3.3 The application/software model for today's heterogeneous parallel systems
- 1.4 Categorizing parallel approaches
- 1.5 Parallel strategies
- 1.6 Parallel speedup versus comparative speedups: Two different measures
- 1.7 What will you learn in this book?
- 1.7.1 Additional reading
- 1.7.2 Exercises
- Summary
- 2 Planning for parallelization
- 2.1 Approaching a new project: The preparation
- 2.1.1 Version control: Creating a safety vault for your parallel code
- 2.1.2 Test suites: The first step to creating a robust, reliable application
- 2.1.3 Finding and fixing memory issues
- 2.1.4 Improving code portability
- 2.2 Profiling: Probing the gap between system capabilities and application performance
- 2.3 Planning: A foundation for success
- 2.3.1 Exploring with benchmarks and mini-apps
- 2.3.2 Design of the core data structures and code modularity
- 2.3.3 Algorithms: Redesign for parallel
- 2.4 Implementation: Where it all happens
- 2.5 Commit: Wrapping it up with quality
- 2.6 Further explorations.
- 2.6.1 Additional reading
- 2.6.2 Exercises
- 3 Performance limits and profiling
- 3.1 Know your application's potential performance limits
- 3.2 Determine your hardware capabilities: Benchmarking
- 3.2.1 Tools for gathering system characteristics
- 3.2.2 Calculating theoretical maximum flops
- 3.2.3 The memory hierarchy and theoretical memory bandwidth
- 3.2.4 Empirical measurement of bandwidth and flops
- 3.2.5 Calculating the machine balance between flops and bandwidth
- 3.3 Characterizing your application: Profiling
- 3.3.1 Profiling tools
- 3.3.2 Empirical measurement of processor clock frequency and energy consumption
- 3.3.3 Tracking memory during run time
- 3.4 Further explorations
- 3.4.1 Additional reading
- 3.4.2 Exercises
- 4 Data design and performance models
- 4.1 Performance data structures: Data-oriented design
- 4.1.1 Multidimensional arrays
- 4.1.2 Array of Structures (AoS) versus Structures of Arrays (SoA)
- 4.1.3 Array of Structures of Arrays (AoSoA)
- 4.2 Three Cs of cache misses: Compulsory, capacity, conflict
- 4.3 Simple performance models: A case study
- 4.3.1 Full matrix data representations
- 4.3.2 Compressed sparse storage representations
- 4.4 Advanced performance models
- 4.5 Network messages
- 4.6 Further explorations
- 4.6.1 Additional reading
- 4.6.2 Exercises
- 5 Parallel algorithms and patterns
- 5.1 Algorithm analysis for parallel computing applications
- 5.2 Performance models versus algorithmic complexity
- 5.3 Parallel algorithms: What are they?
- 5.4 What is a hash function?
- 5.5 Spatial hashing: A highly-parallel algorithm
- 5.5.1 Using perfect hashing for spatial mesh operations
- 5.5.2 Using compact hashing for spatial mesh operations
- 5.6 Prefix sum (scan) pattern and its importance in parallel computing.
- 5.6.1 Step-efficient parallel scan operation
- 5.6.2 Work-efficient parallel scan operation
- 5.6.3 Parallel scan operations for large arrays
- 5.7 Parallel global sum: Addressing the problem of associativity
- 5.8 Future of parallel algorithm research
- 5.9 Further explorations
- 5.9.1 Additional reading
- 5.9.2 Exercises
- Part 2 CPU: The parallel workhorse
- 6 Vectorization: FLOPs for free
- 6.1 Vectorization and single instruction, multiple data (SIMD) overview
- 6.2 Hardware trends for vectorization
- 6.3 Vectorization methods
- 6.3.1 Optimized libraries provide performance for little effort
- 6.3.2 Auto-vectorization: The easy way to vectorization speedup (most of the time1)
- 6.3.3 Teaching the compiler through hints: Pragmas and directives
- 6.3.4 Crappy loops, we got them: Use vector intrinsics
- 6.3.5 Not for the faint of heart: Using assembler code for vectorization
- 6.4 Programming style for better vectorization
- 6.5 Compiler flags relevant for vectorization for various compilers
- 6.6 OpenMP SIMD directives for better portability
- 6.7 Further explorations
- 6.7.1 Additional reading
- 6.7.2 Exercises
- 7 OpenMP that performs
- 7.1 OpenMP introduction
- 7.1.1 OpenMP concepts
- 7.1.2 A simple OpenMP program
- 7.2 Typical OpenMP use cases: Loop-level, high-level, and MPI plus OpenMP
- 7.2.1 Loop-level OpenMP for quick parallelization
- 7.2.2 High-level OpenMP for better parallel performance
- 7.2.3 MPI plus OpenMP for extreme scalability
- 7.3 Examples of standard loop-level OpenMP
- 7.3.1 Loop level OpenMP: Vector addition example
- 7.3.2 Stream triad example
- 7.3.3 Loop level OpenMP: Stencil example
- 7.3.4 Performance of loop-level examples
- 7.3.5 Reduction example of a global sum using OpenMP threading
- 7.3.6 Potential loop-level OpenMP issues.
- 7.4 Variable scope importance for correctness in OpenMP
- 7.5 Function-level OpenMP: Making a whole function thread parallel
- 7.6 Improving parallel scalability with high-level OpenMP
- 7.6.1 How to implement high-level OpenMP
- 7.6.2 Example of implementing high-level OpenMP
- 7.7 Hybrid threading and vectorization with OpenMP
- 7.8 Advanced examples using OpenMP
- 7.8.1 Stencil example with a separate pass for the x and y directions
- 7.8.2 Kahan summation implementation with OpenMP threading
- 7.8.3 Threaded implementation of the prefix scan algorithm
- 7.9 Threading tools essential for robust implementations
- 7.9.1 Using Allinea/ARM MAP to get a quick high-level profile of your application
- 7.9.2 Finding your thread race conditions with Intel® Inspector
- 7.10 Example of a task-based support algorithm
- 7.11 Further explorations
- 7.11.1 Additional reading
- 7.11.2 Exercises
- 8 MPI: The parallel backbone
- 8.1 The basics for an MPI program
- 8.1.1 Basic MPI function calls for every MPI program
- 8.1.2 Compiler wrappers for simpler MPI programs
- 8.1.3 Using parallel startup commands
- 8.1.4 Minimum working example of an MPI program
- 8.2 The send and receive commands for process-to-process communication
- 8.3 Collective communication: A powerful component of MPI
- 8.3.1 Using a barrier to synchronize timers
- 8.3.2 Using the broadcast to handle small file input
- 8.3.3 Using a reduction to get a single value from across all processes
- 8.3.4 Using gather to put order in debug printouts
- 8.3.5 Using scatter and gather to send data out to processes for work
- 8.4 Data parallel examples
- 8.4.1 Stream triad to measure bandwidth on the node
- 8.4.2 Ghost cell exchanges in a two-dimensional (2D) mesh
- 8.4.3 Ghost cell exchanges in a three-dimensional (3D) stencil calculation.
- 8.5 Advanced MPI functionality to simplify code and enable optimizations
- 8.5.1 Using custom MPI data types for performance and code simplification
- 8.5.2 Cartesian topology support in MPI
- 8.5.3 Performance tests of ghost cell exchange variants
- 8.6 Hybrid MPI plus OpenMP for extreme scalability
- 8.6.1 The benefits of hybrid MPI plus OpenMP
- 8.6.2 MPI plus OpenMP example
- 8.7 Further explorations
- 8.7.1 Additional reading
- 8.7.2 Exercises
- Part 3 GPUs: Built to accelerate
- 9 GPU architectures and concepts
- 9.1 The CPU-GPU system as an accelerated computational platform
- 9.1.1 Integrated GPUs: An underused option on commodity-based systems
- 9.1.2 Dedicated GPUs: The workhorse option
- 9.2 The GPU and the thread engine
- 9.2.1 The compute unit is the streaming multiprocessor (or subslice)
- 9.2.2 Processing elements are the individual processors
- 9.2.3 Multiple data operations by each processing element
- 9.2.4 Calculating the peak theoretical flops for some leading GPUs
- 9.3 Characteristics of GPU memory spaces
- 9.3.1 Calculating theoretical peak memory bandwidth
- 9.3.2 Measuring the GPU stream benchmark
- 9.3.3 Roofline performance model for GPUs
- 9.3.4 Using the mixbench performance tool to choose the best GPU for a workload
- 9.4 The PCI bus: CPU to GPU data transfer overhead
- 9.4.1 Theoretical bandwidth of the PCI bus
- 9.4.2 A benchmark application for PCI bandwidth
- 9.5 Multi-GPU platforms and MPI
- 9.5.1 Optimizing the data movement between GPUs across the network
- 9.5.2 A higher performance alternative to the PCI bus
- 9.6 Potential benefits of GPU-accelerated platforms
- 9.6.1 Reducing time-to-solution
- 9.6.2 Reducing energy use with GPUs
- 9.6.3 Reduction in cloud computing costs with GPUs
- 9.7 When to use GPUs
- 9.8 Further explorations
- 9.8.1 Additional reading.
- 9.8.2 Exercises.
- Notes:
- Description based on publisher supplied metadata and other sources.
- ISBN:
- 9781638350385
- 1638350388
- OCLC:
- 1262371463
The Penn Libraries is committed to describing library materials using current, accurate, and responsible language. If you discover outdated or inaccurate language, please fill out this feedback form to report it and suggest alternative language.