My Account Log in

1 option

Learn OpenAI Whisper : Transform Your Understanding of GenAI Through Robust and Accurate Speech Processing Solutions / Josué R. Batista and Christopher Papile.

O'Reilly Online Learning: Academic/Public Library Edition Available online

View online
Format:
Book
Author/Creator:
Batista, Josué R., author.
Papile, Christopher, author.
Language:
English
Subjects (All):
Artificial intelligence.
Automatic speech recognition.
Natural language processing (Computer science).
Physical Description:
1 online resource (372 pages)
Edition:
First edition.
Place of Publication:
Birmingham, England : Packt Publishing, [2024]
Summary:
As the field of generative AI evolves, so does the demand for intelligent systems that can understand human speech. Navigating the complexities of automatic speech recognition (ASR) technology is a significant challenge for many professionals. This book offers a comprehensive solution that guides you through OpenAI's advanced ASR system. You’ll begin your journey with Whisper's foundational concepts, gradually progressing to its sophisticated functionalities. Next, you’ll explore the transformer model, understand its multilingual capabilities, and grasp training techniques using weak supervision. The book helps you customize Whisper for different contexts and optimize its performance for specific needs. You’ll also focus on the vast potential of Whisper in real-world scenarios, including its transcription services, voice-based search, and the ability to enhance customer engagement. Advanced chapters delve into voice synthesis and diarization while addressing ethical considerations. By the end of this book, you'll have an understanding of ASR technology and have the skills to implement Whisper. Moreover, Python coding examples will equip you to apply ASR technologies in your projects as well as prepare you to tackle challenges and seize opportunities in the rapidly evolving world of voice recognition and processing.
Contents:
Cover
Title Page
Copyright and Credits
Foreword
Contributors
Table of Contents
Preface
Part 1: Introducing OpenAI's Whisper
Chapter 1: Unveiling Whisper - Introducing OpenAI's Whisper
Technical requirements
Deconstructing OpenAI's Whisper
The marvel of human vocalization - Understanding voice and speech
Understanding the intricacies of speech recognition
OpenAI's Whisper - A technological parallel
The evolution of speech recognition and the emergence of OpenAI's Whisper
Exploring key features and capabilities of Whisper
Speech-to-text conversion
Translation capabilities
Support for diverse file formats
Ease of use
Multilingual capabilities
Large input handling
Prompts for specialized vocabularies
Integration with GPT models
Fine-tunability
Voice synthesis
Speech diarization
Setting up Whisper
Using Whisper via Hugging Face's web interface
Using Whisper via Google Colaboratory
Expanding on the basic usage of Whisper
Summary
Chapter 2: Understanding the Core Mechanisms of Whisper
Delving deeper into ASR systems
Definition and purpose of ASR systems
ASR in the real world
Brief history and evolution of ASR technology
The early days - Pattern recognition approaches
Statistical approaches emerge - Hidden Markov models and n-gram models
The deep learning breakthrough
Ongoing innovations
Exploring the Whisper ASR system
Understanding the trade-offs - End-to-end versus hybrid models
Combining connectionist temporal classification and transformer models in Whisper
The role of linguistic knowledge in Whisper
Understanding Whisper's components and functions
Audio input and preprocessing
Acoustic modeling
Language modeling
Decoding
Postprocessing.
Applying best practices for performance optimization
Understanding compute requirements
Optimizing the deployment targets
Managing data flows
Monitoring metrics and optimization
Part 2: Underlying Architecture
Chapter 3: Diving into the Whisper Architecture
Understanding the transformer model in Whisper
Introducing the transformer model
Examining the role of the transformer model in Whisper
Deciphering the encoder-decoder mechanics
Exploring the multitasking and multilingual capabilities of Whisper
Assessing Whisper's ability to handle multiple tasks
Exploring Whisper's multilingual capabilities deeper
Appreciating the importance of multitasking and multilingual capabilities in ASR systems
Training Whisper with weak supervision on large-scale data
Introducing weak supervision
Understanding the role of weak supervision in training Whisper
Recognizing the benefits of using large-scale data for training
Gaining insights into data, annotation, and model training
Understanding the importance of data selection and annotation
Learning how data is utilized in training Whisper
Exploring the process of model training in Whisper
Integrating Whisper with other OpenAI technologies
Understanding the synergies between AI models
Learning how integration augments Whisper's capabilities
Examining examples of applications that benefit from integration with Whisper
Chapter 4: Fine-Tuning Whisper for Domain and Language Specificity
Introducing the fine-tuning process for Whisper
Leveraging the Whisper checkpoints
Milestone 1 - Preparing the environment and data for fine-tuning
Leveraging GPU acceleration
Installing the appropriate Python libraries.
Milestone 2 - Incorporating the Common Voice 11 dataset
Expanding language coverage
Improving translation capabilities
Milestone 3 - Setting up Whisper pipeline components
Loading WhisperTokenizer
Milestone 4 - Transforming raw speech data into Mel spectrogram features
Combining to create a WhisperProcessor class
Milestone 5 - Defining training parameters and hardware configurations
Setting up the data collator
Milestone 6 - Establishing standardized test sets and metrics for performance benchmarking
Loading a pre-trained model checkpoint
Defining training arguments
Milestone 7 - Executing the training loops
Milestone 8 - Evaluating performance across datasets
Mitigating demographic biases
Optimizing for content domains
Managing user expectations
Milestone 9 - Building applications that demonstrate customized speech recognition
Part 3: Real-world Applications and Use Cases
Chapter 5: Applying Whisper in Various Contexts
Exploring transcription services
Understanding the role of Whisper in transcription services
Setting up Whisper for transcription tasks
Transcribing audio files with Whisper efficiently
Integrating Whisper into voice assistants and chatbots
Recognizing the potential of Whisper in voice assistants and chatbots
Integrating Whisper into chatbot architectures
Quantizing Whisper for chatbot efficiency and user experience
Enhancing accessibility features with Whisper
Identifying the need for Whisper in accessibility tools
Building an interactive image-to-text application with Whisper
Chapter 6: Expanding Applications with Whisper
Transcribing with precision
Leveraging Whisper for multilingual transcription
Indexing content for enhanced discoverability.
Leveraging FeedParser and Whisper to create searchable text
Enhancing interactions and learning with Whisper
Challenges of implementing real-time ASR using Whisper
Implementing Whisper in customer service
Advancing language learning with Whisper
Optimizing the environment to deploy ASR solutions built using Whisper
Introducing OpenVINO
Applying OpenVINO Model Optimizer to Whisper
Generating video subtitles using Whisper and OpenVINO
Chapter 7: Exploring Advanced Voice Capabilities
Leveraging the power of quantization
Quantizing Whisper with CTranslate2 and running inference with Faster-Whisper
Quantizing Distil-Whisper with OpenVINO
Facing the challenges and opportunities of real-time speech recognition
Building a real-time ASR demo with Hugging Face Whisper
Chapter 8: Diarizing Speech with WhisperX and NVIDIA's NeMo
Augmenting Whisper with speaker diarization
Understanding the limitations and constraints of diarization
Bringing transformers into speech diarization
Introducing NVIDIA's NeMo framework
Integrating Whisper and NeMo
An introduction to speaker embeddings
Differentiating NVIDIA's NeMo capabilities
Performing hands-on speech diarization
Setting up the environment
Streamlining the diarization workflow with helper functions
Separating music from speech using Demucs
Transcribing audio using WhisperX
Aligning the transcription with the original audio using Wav2Vec2
Using NeMo's MSDD model for speaker diarization
Mapping speakers to sentences according to timestamps
Enhancing speaker attribution with punctuation-based realignment
Finalizing the diarization process
Chapter 9: Harnessing Whisper for Personalized Voice Synthesis
Technical requirements.
Understanding text-to-speech in voice synthesis
Introducing TorToiSe-TTS-Fast
Using Audacity for audio processing
Running the notebook with TorToiSe-TTS-Fast
PVS step 1 - Converting audio files into LJSpeech format
PVS step 2 - Fine-tuning a PVS model with the DLAS toolkit
PVS step 3 - Synthesizing speech using a fine-tuned PVS model
Chapter 10: Shaping the Future with Whisper
Anticipating future trends, features, and enhancements
Improving accuracy and robustness
Expanding language support in OpenAI Whisper
Achieving better punctuation, formatting, and speaker diarization in OpenAI Whisper
Accelerating performance and enabling real-time capabilities in OpenAI Whisper
Enhancing Whisper's integration with other AI systems
Considering ethical implications
Ensuring fairness and mitigating bias in ASR
Protecting privacy and data
Establishing guidelines and safeguards for responsible use
Preparing for the evolving ASR and voice technologies landscape
Embracing emerging architectures and training techniques
Preparing for multimodal interfaces and textless NLP
Index
Other Books You May Enjoy.
Notes:
Description based upon print version of record.
Gaining insights into data, annotation, and model training
Description based on publisher supplied metadata and other sources.
Description based on print version record.
ISBN:
9781835087497
1835087493
OCLC:
1435751064

The Penn Libraries is committed to describing library materials using current, accurate, and responsible language. If you discover outdated or inaccurate language, please fill out this feedback form to report it and suggest alternative language.

My Account

Shelf Request an item Bookmarks Fines and fees Settings

Guides

Using the Library Catalog Using Articles+ Library Account