My Account Log in

1 option

Multicore and gpu programming : an integrated approach / Gerassimos Barlas.

O'Reilly Online Learning: Academic/Public Library Edition Available online

View online
Format:
Book
Author/Creator:
Barlas, Gerassimos, author.
Language:
English
Subjects (All):
Multiprocessors.
Physical Description:
1 online resource (1 volume) : illustrations
Edition:
First edition.
Other Title:
Multicore and Graphics Processing Unit programming
Place of Publication:
Amsterdam : Morgan Kaufmann, [2015]
Language Note:
English
System Details:
text file
Summary:
Multicore and GPU Programming offers broad coverage of the key parallel computing skillsets: multicore CPU programming and manycore "massively parallel" computing. Using threads, OpenMP, MPI, and CUDA, it teaches the design and development of software capable of taking advantage of today’s computing platforms incorporating CPU and GPU hardware and explains how to transition from sequential programming to a parallel computing paradigm. Presenting material refined over more than a decade of teaching parallel computing, author Gerassimos Barlas minimizes the challenge with multiple examples, extensive case studies, and full source code. Using this book, you can develop programs that run over distributed memory machines using MPI, create multi-threaded applications with either libraries or directives, write optimized applications that balance the workload between available computing resources, and profile and debug programs targeting multicore machines. Comprehensive coverage of all major multicore programming tools, including threads, OpenMP, MPI, and CUDA Demonstrates parallel programming design patterns and examples of how different tools and paradigms can be integrated for superior performance Particular focus on the emerging area of divisible load theory and its impact on load balancing and distributed systems Download source code, examples, and instructor support materials on the book's companion website
Contents:
Front Cover
Multicore and GPU Programming: An Integrated Approach
Copyright
Dedication
Contents
List of Tables
Preface
What Is in This Book
Using This Book as a Textbook
Software and Hardware Requirements
Sample Code
Chapter 1: Introduction
1.1 The era of multicore machines
1.2 A taxonomy of parallel machines
1.3 A glimpse of contemporary computing machines
1.3.1 The cell BE processor
1.3.2 Nvidia's Kepler
1.3.3 AMD's APUs
1.3.4 Multicore to many-core: tilera's TILE-Gx8072 and intel's xeon phi
1.4 Performance metrics
1.5 Predicting and measuring parallel program performance
1.5.1 Amdahl's law
1.5.2 Gustafson-barsis's rebuttal
Exercises
Chapter 2: Multicore and parallel program design
2.1 Introduction
2.2 The PCAM methodology
2.3 Decomposition patterns
2.3.1 Task parallelism
2.3.2 Divide-and-conquer decomposition
2.3.3 Geometric decomposition
2.3.4 Recursive data decomposition
2.3.5 Pipeline decomposition
2.3.6 Event-based coordination decomposition
2.4 Program structure patterns
2.4.1 Single-program, multiple-data
2.4.2 Multiple-program, multiple-data
2.4.3 Master-worker
2.4.4 Map-reduce
2.4.5 Fork/join
2.4.6 Loop parallelism
2.5 Matching decomposition patterns with program structure patterns
Chapter 3: Shared-memory programming: threads
3.1 Introduction
3.2 Threads
3.2.1 What is a thread?
3.2.2 What are threads good for?
3.2.3 Thread creation and initialization
3.2.3.1 Implicit thread creation
3.2.4 Sharing data between threads
3.3 Design concerns
3.4 Semaphores
3.5 Applying semaphores in classical problems
3.5.1 Producers-consumers
3.5.2 Dealing with termination
3.5.2.1 Termination using a shared data item
3.5.2.2 Termination using messages.
3.5.3 The barbershop problem: introducing fairness
3.5.4 Readers-writers
3.5.4.1 A solution favoring the readers
3.5.4.2 Giving priority to the writers
3.5.4.3 A fair solution
3.6 Monitors
3.6.1 Design approach 1: critical section inside the monitor
3.6.2 Design approach 2: monitor controls entry to critical section
3.7 Applying monitors in classical problems
3.7.1 Producers-consumers revisited
3.7.1.1 Producers-consumers: buffer manipulation within the monitor
3.7.1.2 Producers-consumers: buffer insertion/extraction exterior to the monitor
3.7.2 Readers-writers
3.7.2.1 A solution favoring the readers
3.7.2.2 Giving priority to the writers
3.7.2.3 A fair solution
3.8 Dynamic vs. static thread management
3.8.1 Qt's thread pool
3.8.2 Creating and managing a pool of threads
3.9 Debugging multithreaded applications
3.10 Higher-level constructs: multithreaded programming without threads
3.10.1 Concurrent map
3.10.2 Map-reduce
3.10.3 Concurrent filter
3.10.4 Filter-reduce
3.10.5 A case study: multithreaded sorting
3.10.6 A case study: multithreaded image matching
Chapter 4: Shared-memory programming: OpenMP
4.1 Introduction
4.2 Your First OpenMP Program
4.3 Variable Scope
4.3.1 OpenMP Integration V.0: Manual Partitioning
4.3.2 OpenMP Integration V.1: Manual Partitioning Without a Race Condition
4.3.3 OpenMP Integration V.2: Implicit Partitioning with Locking
4.3.4 OpenMP Integration V.3: Implicit Partitioning with Reduction
4.3.5 Final Words on Variable Scope
4.4 Loop-Level Parallelism
4.4.1 Data Dependencies
4.4.1.1 Flow Dependencies
4.4.1.2 Antidependencies
4.4.1.3 Output Dependencies
4.4.2 Nested Loops
4.4.3 Scheduling
4.5 Task Parallelism
4.5.1 The sections Directive
4.5.1.1 Producers-Consumers in OpenMP.
4.5.2 The task Directive
4.6 Synchronization Constructs
4.7 Correctness and Optimization Issues
4.7.1 Thread Safety
4.7.2 False Sharing
4.8 A Case Study: Sorting in OpenMP
4.8.1 Bottom-Up Mergesort in OpenMP
4.8.2 Top-Down Mergesort in OpenMP
4.8.3 Performance Comparison
Chapter 5: Distributed memory programming
5.1 Communicating Processes
5.2 MPI
5.3 Core concepts
5.4 Your first MPI program
5.5 Program architecture
5.5.1 SPMD
5.5.2 MPMD
5.6 Point-to-Point communication
5.7 Alternative Point-to-Point communication modes
5.7.1 Buffered Communications
5.8 Non blocking communications
5.9 Point-to-Point Communications: Summary
5.10 Error reporting and handling
5.11 Collective communications
5.11.1 Scattering
5.11.2 Gathering
5.11.3 Reduction
5.11.4 All-to-All Gathering
5.11.5 All-to-All Scattering
5.11.6 All-to-All Reduction
5.11.7 Global Synchronization
5.12 Communicating objects
5.12.1 Derived Datatypes
5.12.2 Packing/Unpacking
5.13 Node management: communicators and groups
5.13.1 Creating Groups
5.13.2 Creating Intra-Communicators
5.14 One-sided communications
5.14.1 RMA Communication Functions
5.14.2 RMA Synchronization Functions
5.15 I/O considerations
5.16 Combining MPI processes with threads
5.17 Timing and Performance Measurements
5.18 Debugging and profiling MPI programs
5.19 The Boost.MPI library
5.19.1 Blocking and non blocking Communications
5.19.2 Data Serialization
5.19.3 Collective Operations
5.20 A case study: diffusion-limited aggregation
5.21 A case study: brute-force encryption cracking
5.21.1 Version #1 : "plain-vanilla'' MPI
5.21.2 Version #2 : combining MPI and OpenMP
5.22 A Case Study: MPI Implementation of the Master-Worker Pattern.
5.22.1 A Simple Master-Worker Setup
5.22.2 A Multithreaded Master-Worker Setup
Chapter 6: GPU programming
6.1 GPU Programming
6.2 CUDA's programming model: threads, blocks, and grids
6.3 CUDA's execution model: streaming multiprocessors and warps
6.4 CUDA compilation process
6.5 Putting together a CUDA project
6.6 Memory hierarchy
6.6.1 Local Memory/Registers
6.6.2 Shared Memory
6.6.3 Constant Memory
6.6.4 Texture and Surface Memory
6.7 Optimization techniques
6.7.1 Block and Grid Design
6.7.2 Kernel Structure
6.7.3 Shared Memory Access
6.7.4 Global Memory Access
6.7.5 Page-Locked and Zero-Copy Memory
6.7.6 Unified Memory
6.7.7 Asynchronous Execution and Streams
6.7.7.1 Stream Synchronization: Events and Callbacks
6.8 Dynamic parallelism
6.9 Debugging CUDA programs
6.10 Profiling CUDA programs
6.11 CUDA and MPI
6.12 Case studies
6.12.1 Fractal Set Calculation
6.12.1.1 Version #1: One thread per pixel
6.12.1.2 Version #2: Pinned host and pitched device memory
6.12.1.3 Version #3: Multiple pixels per thread
6.12.1.4 Evaluation
6.12.2 Block Cipher Encryption
6.12.2.1 Version #1: The case of a standalone GPU machine
6.12.2.2 Version #2: Overlapping GPU communication and computation
6.12.2.3 Version #3: Using a cluster of GPU machines
6.12.2.4 Evaluation
Chapter 7: The Thrust template library
7.1 Introduction
7.2 First steps in Thrust
7.3 Working with Thrust datatypes
7.4 Thrust algorithms
7.4.1 Transformations
7.4.2 Sorting and searching
7.4.3 Reductions
7.4.4 Scans/prefix sums
7.4.5 Data management and manipulation
7.5 Fancy iterators
7.6 Switching device back ends
7.7 Case studies
7.7.1 Monte carlo integration
7.7.2 DNA Sequence alignment
Exercises.
Chapter 8: Load balancing
8.1 Introduction
8.2 Dynamic load balancing: the Linda legacy
8.3 Static Load Balancing: The Divisible LoadTheory Approach
8.3.1 Modeling Costs
8.3.2 Communication Configuration
8.3.3 Analysis
8.3.3.1 N-Port, Block-Type, Single-Installment Solution
8.3.3.2 One-Port, Block-Type, Single-Installment Solution
8.3.4 Summary - Short Literature Review
8.4 DLTlib: A library for partitioning workloads
8.5 Case studies
8.5.1 Hybrid Computation of a Mandelbrot Set "Movie'':A Case Study in Dynamic Load Balancing
8.5.2 Distributed Block Cipher Encryption: A Case Study in Static Load Balancing
Appendix A: Compiling Qt programs
A.1 Using an IDE
A.2 The qmake Utility
Appendix B: Running MPI programs
B.1 Preparatory Steps
B.2 Computing Nodes discovery for MPI Program Deployment
B.2.1 Host Discovery with the nmap Utility
B.2.2 Automatic Generation of a Hostfile
Appendix C: Time measurement
C.1 Introduction
C.2 POSIX High-Resolution Timing
C.3 Timing in Qt
C.4 Timing in OpenMP
C.5 Timing in MPI
C.6 Timing in CUDA
Appendix D: Boost.MPI
D.1 Mapping from MPI C to Boost.MPI
Appendix E: Setting up CUDA
E.1 Installation
E.2 Issues with GCC
E.3 Running CUDA without an Nvidia GPU
E.4 Running CUDA on Optimus-Equipped Laptops
E.5 Combining CUDA with Third-Party Libraries
Appendix F: DLTlib
F.1 DLTlib Functions
F.1.1 Class Network: Generic Methods
F.1.2 Class Network: Query Processing
F.1.3 Class Network: Image Processing
F.1.4 Class Network: Image Registration
F.2 DLTlib Files
Glossary
Bibliography
Index.
Notes:
Bibliographic Level Mode of Issuance: Monograph
Includes bibliographical references and index.
Description based on print version record.
ISBN:
9780124171374
0124171370
OCLC:
903404072

The Penn Libraries is committed to describing library materials using current, accurate, and responsible language. If you discover outdated or inaccurate language, please fill out this feedback form to report it and suggest alternative language.

My Account

Shelf Request an item Bookmarks Fines and fees Settings

Guides

Using the Library Catalog Using Articles+ Library Account