1 option

Multicore and gpu programming : an integrated approach / Gerassimos Barlas.

O'Reilly Online Learning: Academic/Public Library Edition Available online

Format:: Book
Author/Creator:: Barlas, Gerassimos, author.
Language:: English
Subjects (All):: Multiprocessors.
Physical Description:: 1 online resource (1 volume) : illustrations
Edition:: First edition.
Other Title:: Multicore and Graphics Processing Unit programming
Place of Publication:: Amsterdam : Morgan Kaufmann, [2015]
Language Note:: English
System Details:: text file
Summary:: Multicore and GPU Programming offers broad coverage of the key parallel computing skillsets: multicore CPU programming and manycore "massively parallel" computing. Using threads, OpenMP, MPI, and CUDA, it teaches the design and development of software capable of taking advantage of today’s computing platforms incorporating CPU and GPU hardware and explains how to transition from sequential programming to a parallel computing paradigm. Presenting material refined over more than a decade of teaching parallel computing, author Gerassimos Barlas minimizes the challenge with multiple examples, extensive case studies, and full source code. Using this book, you can develop programs that run over distributed memory machines using MPI, create multi-threaded applications with either libraries or directives, write optimized applications that balance the workload between available computing resources, and profile and debug programs targeting multicore machines. Comprehensive coverage of all major multicore programming tools, including threads, OpenMP, MPI, and CUDA Demonstrates parallel programming design patterns and examples of how different tools and paradigms can be integrated for superior performance Particular focus on the emerging area of divisible load theory and its impact on load balancing and distributed systems Download source code, examples, and instructor support materials on the book's companion website
Contents:: Front Cover; Multicore and GPU Programming: An Integrated Approach; Copyright; Dedication; Contents; List of Tables; Preface; What Is in This Book; Using This Book as a Textbook; Software and Hardware Requirements; Sample Code; Chapter 1: Introduction; 1.1 The era of multicore machines; 1.2 A taxonomy of parallel machines; 1.3 A glimpse of contemporary computing machines; 1.3.1 The cell BE processor; 1.3.2 Nvidia's Kepler; 1.3.3 AMD's APUs; 1.3.4 Multicore to many-core: tilera's TILE-Gx8072 and intel's xeon phi; 1.4 Performance metrics; 1.5 Predicting and measuring parallel program performance; 1.5.1 Amdahl's law; 1.5.2 Gustafson-barsis's rebuttal; Exercises; Chapter 2: Multicore and parallel program design; 2.1 Introduction; 2.2 The PCAM methodology; 2.3 Decomposition patterns; 2.3.1 Task parallelism; 2.3.2 Divide-and-conquer decomposition; 2.3.3 Geometric decomposition; 2.3.4 Recursive data decomposition; 2.3.5 Pipeline decomposition; 2.3.6 Event-based coordination decomposition; 2.4 Program structure patterns; 2.4.1 Single-program, multiple-data; 2.4.2 Multiple-program, multiple-data; 2.4.3 Master-worker; 2.4.4 Map-reduce; 2.4.5 Fork/join; 2.4.6 Loop parallelism; 2.5 Matching decomposition patterns with program structure patterns; Chapter 3: Shared-memory programming: threads; 3.1 Introduction; 3.2 Threads; 3.2.1 What is a thread?; 3.2.2 What are threads good for?; 3.2.3 Thread creation and initialization; 3.2.3.1 Implicit thread creation; 3.2.4 Sharing data between threads; 3.3 Design concerns; 3.4 Semaphores; 3.5 Applying semaphores in classical problems; 3.5.1 Producers-consumers; 3.5.2 Dealing with termination; 3.5.2.1 Termination using a shared data item; 3.5.2.2 Termination using messages.; 3.5.3 The barbershop problem: introducing fairness; 3.5.4 Readers-writers; 3.5.4.1 A solution favoring the readers; 3.5.4.2 Giving priority to the writers; 3.5.4.3 A fair solution; 3.6 Monitors; 3.6.1 Design approach 1: critical section inside the monitor; 3.6.2 Design approach 2: monitor controls entry to critical section; 3.7 Applying monitors in classical problems; 3.7.1 Producers-consumers revisited; 3.7.1.1 Producers-consumers: buffer manipulation within the monitor; 3.7.1.2 Producers-consumers: buffer insertion/extraction exterior to the monitor; 3.7.2 Readers-writers; 3.7.2.1 A solution favoring the readers; 3.7.2.2 Giving priority to the writers; 3.7.2.3 A fair solution; 3.8 Dynamic vs. static thread management; 3.8.1 Qt's thread pool; 3.8.2 Creating and managing a pool of threads; 3.9 Debugging multithreaded applications; 3.10 Higher-level constructs: multithreaded programming without threads; 3.10.1 Concurrent map; 3.10.2 Map-reduce; 3.10.3 Concurrent filter; 3.10.4 Filter-reduce; 3.10.5 A case study: multithreaded sorting; 3.10.6 A case study: multithreaded image matching; Chapter 4: Shared-memory programming: OpenMP; 4.1 Introduction; 4.2 Your First OpenMP Program; 4.3 Variable Scope; 4.3.1 OpenMP Integration V.0: Manual Partitioning; 4.3.2 OpenMP Integration V.1: Manual Partitioning Without a Race Condition; 4.3.3 OpenMP Integration V.2: Implicit Partitioning with Locking; 4.3.4 OpenMP Integration V.3: Implicit Partitioning with Reduction; 4.3.5 Final Words on Variable Scope; 4.4 Loop-Level Parallelism; 4.4.1 Data Dependencies; 4.4.1.1 Flow Dependencies; 4.4.1.2 Antidependencies; 4.4.1.3 Output Dependencies; 4.4.2 Nested Loops; 4.4.3 Scheduling; 4.5 Task Parallelism; 4.5.1 The sections Directive; 4.5.1.1 Producers-Consumers in OpenMP.; 4.5.2 The task Directive; 4.6 Synchronization Constructs; 4.7 Correctness and Optimization Issues; 4.7.1 Thread Safety; 4.7.2 False Sharing; 4.8 A Case Study: Sorting in OpenMP; 4.8.1 Bottom-Up Mergesort in OpenMP; 4.8.2 Top-Down Mergesort in OpenMP; 4.8.3 Performance Comparison; Chapter 5: Distributed memory programming; 5.1 Communicating Processes; 5.2 MPI; 5.3 Core concepts; 5.4 Your first MPI program; 5.5 Program architecture; 5.5.1 SPMD; 5.5.2 MPMD; 5.6 Point-to-Point communication; 5.7 Alternative Point-to-Point communication modes; 5.7.1 Buffered Communications; 5.8 Non blocking communications; 5.9 Point-to-Point Communications: Summary; 5.10 Error reporting and handling; 5.11 Collective communications; 5.11.1 Scattering; 5.11.2 Gathering; 5.11.3 Reduction; 5.11.4 All-to-All Gathering; 5.11.5 All-to-All Scattering; 5.11.6 All-to-All Reduction; 5.11.7 Global Synchronization; 5.12 Communicating objects; 5.12.1 Derived Datatypes; 5.12.2 Packing/Unpacking; 5.13 Node management: communicators and groups; 5.13.1 Creating Groups; 5.13.2 Creating Intra-Communicators; 5.14 One-sided communications; 5.14.1 RMA Communication Functions; 5.14.2 RMA Synchronization Functions; 5.15 I/O considerations; 5.16 Combining MPI processes with threads; 5.17 Timing and Performance Measurements; 5.18 Debugging and profiling MPI programs; 5.19 The Boost.MPI library; 5.19.1 Blocking and non blocking Communications; 5.19.2 Data Serialization; 5.19.3 Collective Operations; 5.20 A case study: diffusion-limited aggregation; 5.21 A case study: brute-force encryption cracking; 5.21.1 Version #1 : "plain-vanilla'' MPI; 5.21.2 Version #2 : combining MPI and OpenMP; 5.22 A Case Study: MPI Implementation of the Master-Worker Pattern.; 5.22.1 A Simple Master-Worker Setup; 5.22.2 A Multithreaded Master-Worker Setup; Chapter 6: GPU programming; 6.1 GPU Programming; 6.2 CUDA's programming model: threads, blocks, and grids; 6.3 CUDA's execution model: streaming multiprocessors and warps; 6.4 CUDA compilation process; 6.5 Putting together a CUDA project; 6.6 Memory hierarchy; 6.6.1 Local Memory/Registers; 6.6.2 Shared Memory; 6.6.3 Constant Memory; 6.6.4 Texture and Surface Memory; 6.7 Optimization techniques; 6.7.1 Block and Grid Design; 6.7.2 Kernel Structure; 6.7.3 Shared Memory Access; 6.7.4 Global Memory Access; 6.7.5 Page-Locked and Zero-Copy Memory; 6.7.6 Unified Memory; 6.7.7 Asynchronous Execution and Streams; 6.7.7.1 Stream Synchronization: Events and Callbacks; 6.8 Dynamic parallelism; 6.9 Debugging CUDA programs; 6.10 Profiling CUDA programs; 6.11 CUDA and MPI; 6.12 Case studies; 6.12.1 Fractal Set Calculation; 6.12.1.1 Version #1: One thread per pixel; 6.12.1.2 Version #2: Pinned host and pitched device memory; 6.12.1.3 Version #3: Multiple pixels per thread; 6.12.1.4 Evaluation; 6.12.2 Block Cipher Encryption; 6.12.2.1 Version #1: The case of a standalone GPU machine; 6.12.2.2 Version #2: Overlapping GPU communication and computation; 6.12.2.3 Version #3: Using a cluster of GPU machines; 6.12.2.4 Evaluation; Chapter 7: The Thrust template library; 7.1 Introduction; 7.2 First steps in Thrust; 7.3 Working with Thrust datatypes; 7.4 Thrust algorithms; 7.4.1 Transformations; 7.4.2 Sorting and searching; 7.4.3 Reductions; 7.4.4 Scans/prefix sums; 7.4.5 Data management and manipulation; 7.5 Fancy iterators; 7.6 Switching device back ends; 7.7 Case studies; 7.7.1 Monte carlo integration; 7.7.2 DNA Sequence alignment; Exercises.; Chapter 8: Load balancing; 8.1 Introduction; 8.2 Dynamic load balancing: the Linda legacy; 8.3 Static Load Balancing: The Divisible LoadTheory Approach; 8.3.1 Modeling Costs; 8.3.2 Communication Configuration; 8.3.3 Analysis; 8.3.3.1 N-Port, Block-Type, Single-Installment Solution; 8.3.3.2 One-Port, Block-Type, Single-Installment Solution; 8.3.4 Summary - Short Literature Review; 8.4 DLTlib: A library for partitioning workloads; 8.5 Case studies; 8.5.1 Hybrid Computation of a Mandelbrot Set "Movie'':A Case Study in Dynamic Load Balancing; 8.5.2 Distributed Block Cipher Encryption: A Case Study in Static Load Balancing; Appendix A: Compiling Qt programs; A.1 Using an IDE; A.2 The qmake Utility; Appendix B: Running MPI programs; B.1 Preparatory Steps; B.2 Computing Nodes discovery for MPI Program Deployment; B.2.1 Host Discovery with the nmap Utility; B.2.2 Automatic Generation of a Hostfile; Appendix C: Time measurement; C.1 Introduction; C.2 POSIX High-Resolution Timing; C.3 Timing in Qt; C.4 Timing in OpenMP; C.5 Timing in MPI; C.6 Timing in CUDA; Appendix D: Boost.MPI; D.1 Mapping from MPI C to Boost.MPI; Appendix E: Setting up CUDA; E.1 Installation; E.2 Issues with GCC; E.3 Running CUDA without an Nvidia GPU; E.4 Running CUDA on Optimus-Equipped Laptops; E.5 Combining CUDA with Third-Party Libraries; Appendix F: DLTlib; F.1 DLTlib Functions; F.1.1 Class Network: Generic Methods; F.1.2 Class Network: Query Processing; F.1.3 Class Network: Image Processing; F.1.4 Class Network: Image Registration; F.2 DLTlib Files; Glossary; Bibliography; Index.
Notes:: Bibliographic Level Mode of Issuance: Monograph; Includes bibliographical references and index.; Description based on print version record.
ISBN:: 9780124171374; 0124171370
OCLC:: 903404072

The Penn Libraries is committed to describing library materials using current, accurate, and responsible language. If you discover outdated or inaccurate language, please fill out this feedback form to report it and suggest alternative language.

1 option

Multicore and gpu programming : an integrated approach / Gerassimos Barlas.

My Account

Guides