1 option
Handbook of neural networks for speech processing / Shigeru Katagiri, editor.
LIBRA TK7895.S65 H36 2000
Available from offsite location
- Format:
- Book
- Series:
- Artech House signal processing library
- Language:
- English
- Subjects (All):
- Automatic speech recognition--Handbooks, manuals, etc.
- Automatic speech recognition.
- Neural networks (Computer science)--Handbooks, manuals, etc.
- Neural networks (Computer science).
- Speech processing systems--Handbooks, manuals, etc.
- Speech processing systems.
- Genre:
- Handbooks and manuals.
- Physical Description:
- xxiii, 522 pages : illustrations ; 24 cm.
- Other Title:
- Neural networks for speech processing
- Place of Publication:
- Boston : Artech House, [2000]
- Summary:
- Here are the comprehensive details on cutting edge technologies employing neural networks for speech recognition and speech processing in modern communications. Going far beyond the simple speech recognition technologies on the market today, this new book, written by and for speech and signal processing engineers in industry, R&D, and academia, takes you to the forefront of the hottest emergent neural net-based speech processing techniques.
- Contents:
- Part I Fundamentals
- 1.1 Speech Processing 3
- 1.2 Neural Networks 6
- 1.2.1 Fundamentals 6
- 1.2.2 Taxonomy of Neural Networks 8
- 1.2.2.2 Structure 9
- 1.2.2.3 Measurement 11
- 1.2.2.4 Objective Function 13
- 1.2.2.5 Optimization 14
- 1.3 Neural Networks for Speech Processing 15
- 1.4 Handbook Overview 15
- 1.4.1 Part I: Fundamentals 15
- 1.4.2 Part II: Current Issues in Speech Recognition 16
- 1.4.3 Part III: Current Issues in Speech Signal Processing 17
- 2 The Speech Signal and Its Production Model 19
- 2.2 Information Conveyed by Speech 21
- 2.2.1 Linguistic Information 21
- 2.2.1.1 Segmental Features 21
- 2.2.1.2 Suprasegmental Features 22
- 2.2.2 Paralinguistic Information 23
- 2.2.3 Nonlinguistic Information 27
- 2.2.3.1 Idiosyncratic Factors 27
- 2.2.3.2 Emotional Factors 29
- 2.2.4 Hierarchical Speech Production Processes 30
- 2.3 Physical and Physiological Processes in Speech Production 32
- 2.3.1 Respiration System 32
- 2.3.1.1 Normal Breathing Without Speech Production 32
- 2.3.1.2 Expiration in Speech Production 34
- 2.3.2 Phonatory System 34
- 2.3.2.1 Framework of the Larynx 34
- 2.3.2.2 Abduction versus Adduction 34
- 2.3.2.3 F0 Control During Speech 37
- 2.3.3 Articulatory System 39
- 2.3.3.1 Morphology of Articulators 39
- 2.3.3.2 Vowel Production 40
- 2.3.3.3 Consonant Production 41
- 2.4 Models and Theories of Speech Production 49
- 2.4.1 Laryngeal System 49
- 2.4.1.1 Vocal Fold Vibration 49
- 2.4.1.2 F0 Control in Running Speech Production 50
- 2.4.1.3 Vertical Movements of the Larynx 51
- 2.4.2 Dynamic Characteristic of Articulators 52
- 2.4.2.1 Models of Individual Articulators 52
- 2.4.2.2 Articulatory Models of Speech Production 54
- 3 Speech Recognition 63
- 3.2.1 Hearing and Machine Recognition 64
- 3.2.2 Recognition-Oriented Speech Feature Representation 66
- 3.2.2.1 Sound Spectrogram: Time-Frequency-Energy Representation 66
- 3.2.2.2 Acoustic Feature Vector 69
- 3.2.2.3 Static Versus Dynamic Nature 72
- 3.2.3 Variety of Recognition Tasks 73
- 3.2.4 Recognition Mechanism 73
- 3.2.4.1 Example Task Setting 73
- 3.2.4.2 Distance-Based Recognition 76
- 3.2.4.3 Distance Computation Based on Dynamic Time Warping 77
- 3.2.4.4 Remarks 80
- 3.3 Bayes Decision Theory 80
- 3.3.2 Maximum Likelihood Estimation Approach 83
- 3.3.3 Bayesian Approach 85
- 3.3.4 Discriminant Function Approach 86
- 3.3.4.1 Example Task and Decision Rule 86
- 3.3.4.2 Loss 88
- 3.3.4.3 Optimization 88
- 3.3.4.4 Design of Linear Discriminant Function Classifier 90
- 3.3.4.5 Remarks 91
- 3.4 Acoustic Feature Extraction 91
- 3.4.1 Filter-Bank 91
- 3.4.1.1 Artificial Cochlea Filter 93
- 3.4.1.2 Fourier-Transform-Based Filter 94
- 3.4.2 Autoregressive Modeling 94
- 3.4.3 Cepstrum Modeling 98
- 3.4.4 Dynamic Feature Modeling 101
- 3.5 Probabilistic Acoustic Modeling Based on Hidden Markov Model 103
- 3.5.1 Principles of Hidden Markov Model 103
- 3.5.2 Selection of Output Probability Function 106
- 3.5.2.1 Discrete Model 106
- 3.5.2.2 Continuous Model 106
- 3.5.3 MLE-Based Design Method 109
- 3.5.3.1 Forward-Backward Method 109
- 3.5.4 Trellis Algorithm and Viterbi Algorithm 111
- 3.5.5 Discriminative Design Methods 112
- 3.6 Language Modeling 112
- 3.6.1 Role of Language Modeling 112
- 3.6.2 N-Gram Language Modeling 114
- 3.7.2 Selection of Model Units 115
- 3.7.3 Open-Vocabulary Recognition 116
- 3.7.4 Bibliographical Remarks 117
- 4 Speech Coding 121
- 4.2 Attributes of Speech Coders 122
- 4.3 Basic Principles of Speech Coders 123
- 4.4 Quantization 126
- 4.4.1 Scalar Quantization 126
- 4.4.2 Vector Quantization 126
- 4.5 Linear Prediction 128
- 4.5.1 Linear Prediction Principles 128
- 4.5.2 Speech Coding Based on Linear Prediction 129
- 4.5.3 The Analysis-by-Synthesis Principle 131
- 4.5.4 Perceptual Filtering 135
- 4.5.5 Quantization of the Linear Prediction Coefficients 136
- 4.6 Sinusoidal Coding 141
- 4.7 Waveform Interpolation Methods 142
- 4.8 Subband Coding 143
- 4.9 Variable-Rate Coding 144
- 4.9.1 Basics 144
- 4.9.2 Phonetic Segmentation 145
- 4.9.3 Variable Rate Coders for ATM Networks 145
- 4.9.4 Voice over IP 146
- 4.10 Wideband Coders 146
- 4.11 Measuring Speech Coder Performance 147
- 4.12 Speech Coding over Noisy Channels 150
- 4.13 Speech Coding Standards 151
- Part II Current Issues in Speech Recognition
- 5 Discriminative Prototype-Based Methods for Speech Recognition 159
- 5.2 Bayes Decision Theory 161
- 5.2.1 The Bayes Decision Rule 161
- 5.2.2 Discriminant Functions 161
- 5.2.3 Discriminant Functions for Prototype-Based Methods 163
- 5.3 Example-Based Methods 165
- 5.3.1 Density Estimation 166
- 5.3.2 Estimation of Posterior Probabilities 167
- 5.3.3 The k-Nearest-Neighbor Method 168
- 5.3.3.1 The Nearest-Neighbor Rule 168
- 5.3.3.2 Error Bounds for k-Nearest-Neighbor Classification 168
- 5.3.4 Parzen Windows 168
- 5.3.5 Advantages and Limitations of Example-Based Methods 171
- 5.3.6 Smoothing 172
- 5.3.7 Applications to Speech Recognition 173
- 5.4 Prototype-Based Methods for Speech Recognition 173
- 5.5 Prototype-Based Classifier Design Using Minimum Classification Error 175
- 5.5.1 Definition of Discriminant Function 175
- 5.5.2 Definition of Misclassification Measure 176
- 5.5.3 Definition of Local Loss Function 176
- 5.5.4 Overall Loss Function and Optimization 177
- 5.5.5 Modified Newton's Method: The Quickprop Algorithm 178
- 5.5.6 Relation of MCE Loss to the Bayes Error 181
- 5.5.7 Choice of Smoothing Parameters for MCE-Based Optimization 182
- 5.6 Learning Vector Quantization 182
- 5.6.1 Shift-Tolerant LVQ for Speech Recognition 185
- 5.6.1.1 HMM Interpretation of STLVQ 187
- 5.6.1.2 Limitations and Strengths of STLVQ Architecture and Training 187
- 5.6.2 Expanding the Scope of LVQ for Speech Recognition: Incorporation into Hidden Markov Modeling 188
- 5.6.2.1 LVQ-HMM 189
- 5.6.2.2 HMM-LVQ 190
- 5.6.3 Minimum Classification Error Interpretation of LVQ 192
- 5.6.4 Smoothness of MCE Loss 193
- 5.6.5 LVQ Summary 195
- 5.7 Prototype-Based Methods Using Dynamic Programming 195
- 5.7.1 MCE-Trained Prototypes for DTW-Based Speech Recognition 196
- 5.7.1.1 Practical Implementation of MCE/GPD 198
- 5.7.1.2 MCE-DTW Results 199
- 5.7.2 Prototype-Based Minimum Error Classifier 199
- 5.7.2.1 PBMEC State Distance and Discriminant Function 200
- 5.7.2.2 MCE/GPD in the Context of Speech Recognition Using Phoneme Models 202
- 5.7.2.3 PBMEC Results 202
- 5.7.3 Summary of Prototype-Based Methods Using DP 203
- 5.8 Hidden Markov Model Design Based on MCE 204
- 5.8.1 HMM State Likelihood and Discriminant Function 206
- 5.8.2 MCE Misclassification Measure and Loss 207
- 5.8.3 Calculation of MCE Gradient for HMMs 207
- 5.8.3.1 Derivative of Loss with Respect to Misclassification Measure 207
- 5.8.3.2 Derivative of Misclassification Measure with Respect to Discriminant Functions 208
- 5.8.3.3 Derivative of Discriminant Function with Respect to Observation Probability Density Function 208
- 5.8.3.4 Derivative of Observation Probability with Respect to Mixing Weights 209
- 5.8.3.5 Derivative of Observation Probability with Respect to Mean Vectors 209
- 5.8.3.6 Derivative of Observation Probability with Respect to Covariances 210
- 5.8.3.7 Application of the Chain Rule 210
- 5.8.4 MCE-HMM Results 211
- 6 Recurrent Neural Networks for Speech Recognition 217
- 6.1.1 Background and Motivation 217
- 6.2 Speech Recognition Theory 220
- 6.3 Basics of Neural Networks 223
- 6.3.1 Parameter Estimation by Maximum Likelihood 224
- 6.3.2 Problem Classification 225
- 6.3.2.1 Regression 225
- 6.3.2.2 Classification 226
- 6.3.3 Neural Network Training 226
- 6.3.3.1 Gradient Descent Training 227
- 6.3.3.2 RPROP Training 228
- 6.3.3.3 ARPROP Training 229
- 6.3.4 Neural Network Architectures 230
- 6.3.4.1 Multilayer Perceptrons 232
- 6.3.4.2 Time-Delay Neural Networks 232
- 6.4 Recurrent Neural Networks 232
- 6.4.1 Unidirectional Recurrent Neural Network 233
- 6.4.1.1 RNN Architecture 233
- 6.4.1.2 RNN Training 235
- 6.4.2 Bidirectional Recurrent Neural Network 235
- 6.4.2.1 BRNN Architecture 235
- 6.4.2.2 BRNN Training 236
- 6.5 Modeling Phonetic Context 237
- 6.6 System Training and Usage 239
- 6.6.1 Training 239
- 6.6.2 Usage 240
- 6.7.1.1 Training Criterion 240
- 6.7.1.2 Discriminative Training 241
- 6.7.1.3 Distribution of Model Complexity 241
- 7 Time-Delay Neural Networks and NN/HMM Hybrids: A Family of Connectionist Continuous-Speech Recognition Systems 245
- 7.2 MS-TDNNs and NN/HMM Hybrid Approaches 246
- 7.2.1 The Time-Delay Neural Network (TDNN) 247
- 7.2.2 Multistate TDNN 249
- 7.2.3 MS-TDNN Variants 249
- 7.2.4 Hybrid NN/HMM Variants 249
- 7.3 Alphabet Recognition with the MS-TDNN 251
- 7.3.1 Training Procedures 251
- 7.3.2 Duration Modeling 254
- 7.3.3 Experiments 255
- 7.3.3.1 Speaker-Dependent Data 255
- 7.3.3.2 Speaker-Independent Data 255
- 7.3.3.3 Telephone Data 256
- 7.3.4 Searching in Large Name Lists 257
- 7.4 Multimodal Input: Lipreading 260
- 7.4.1 Motivation 260
- 7.4.2 The Recognizer 260
- 7.4.3 Results 264
- 7.5 Modular Neural Networks 265
- 7.5.1 Architecture 266
- 7.5.2 Application to NN/HMM Models 267
- 7.5.3 Experiments with a Hybrid HME/HMM System 268
- 7.6 Context Modeling 269
- 7.6.1 Clustering Context Classes 269
- 7.6.2 Factoring Context-Dependent Posteriors 270
- 7.6.3 Hierarchies of Neural Networks 272
- 7.6.3.1 Manually Structured Hierarchies 273
- 7.6.3.2 Clustering Hierarchies of Neural Networks 274
- 7.6.4 Experiments and Results 275
- 8 Probability-Oriented Neural Networks and Hybrid Connectionist/Stochastic Networks 281
- 8.2 Fundamentals of Probability-Oriented Neural Networks 282
- 8.2.1 The Bayes Decision Framework 282
- 8.2.2 Training Procedures 283
- 8.2.3 Types of PONNs 284
- 8.2.3.1 Radial Basis Function Networks 284
- 8.2.3.2 Probabilistic Neural Networks 286
- 8.3 Learning Methods for PNNs 289
- 8.3.1 Position of the Problem 289
- 8.3.2 MLE and EM Algorithms for PNNs 290
- 8.3.3 MMIE for PNNs 290
- 8.4 Applications to Automatic Speech Recognition 291
- 8.4.1 Speech Recognition 291
- 8.4.2 Speaker Recognition 291
- 8.5 Hybrid Connectionist/Stochastic Models 293
- 8.5.1 Position of the Problem 293
- 8.5.2 Proposed Solutions 293
- 8.5.2.1 ANNs as Front-Ends for HMMs 294
- 8.5.2.2 ANNs as Postprocessors of HMMs 296
- 8.5.2.3 Unified Models 297
- 9 Minimum Classification Error Networks 307
- 9.1.1 Speech Pattern Recognition Using Modular Systems 307
- 9.1.2 Classifier Design 309
- 9.1.3 What Is an Artificial Neural Network? 310
- 9.1.4 Minimum Recognition Error Network 311
- 9.2 Discriminative Pattern Classification 312
- 9.2.1 Bayes Decision Theory 312
- 9.2.2 Minimum Error Rate Classification 314
- 9.2.3 Discriminative Training 314
- 9.3 Generalized Probabilistic Descent Method 317
- 9.3.2 Formalization Fundamentals 319
- 9.3.2.1 Distance Classifier for Classifying Dynamic Patterns: Preparation 319
- 9.3.2.2 Emulation of Decision Process 321
- 9.3.2.3 Selection of Loss Functions 324
- 9.3.2.4 Design Optimality in Practical Situations 326
- 9.3.3 GPD-Based Classifier Design 327
- 9.3.3.1 E-Set Task 327
- 9.3.3.2 P-Set Task 328
- 9.4 Derivatives of GPD 328
- 9.4.2 Segmental GPD for Continuous Speech Recognition 329
- 9.4.3 Minimum Error Training for Open-Vocabulary Speech Recognition 331
- 9.4.3.1 Open-Vocabulary Speech Recognition 331
- 9.4.3.2 Minimum Spotting Error Learning 332
- 9.4.3.3 Discriminative Utterance Verification 334
- 9.4.4 Discriminative Feature Extraction 336
- 9.4.4.1 Fundamentals 336
- 9.4.4.2 An Example Implementation for Cepstrum-Based Speech Recognition 338
- 9.4.4.3 Discriminative Metric Design 339
- 9.4.4.4 Minimum Error Learning Subspace Method 339
- 9.4.5 Speaker Recognition Using GPD 340
- Appendix 1 Probabilistic Descent Theorem for Probability-Based Discriminant Functions 347
- Appendix 2 Relationships Between MCE/GPD and Others 349
- Part III Current Issues in Speech Signal Processing
- 10 Networks for Speaker Recognition 357
- 10.2 Speaker Recognition Overview 359
- 10.3 Discriminative Information 363
- 10.3.1 Supervised Training 363
- 10.3.2 Cohort Normalization 365
- 10.4 Speaker Recognition Networks 366
- 10.4.1 Multilayer Perceptron 366
- 10.4.2 Radial Basis Functions 369
- 10.4.3 Time-Delay Neural Networks 370
- 10.4.4 Recurrent Neural Networks 371
- 10.4.5 Learning Vector Quantization 372
- 10.4.6 Decision Trees 372
- 10.4.7 Neural Tree Network 376
- 10.4.8 Performance Summary 377
- 10.5 Model Combination 377
- 10.5.1 Model Combination Approaches 381
- 10.5.1.1 Linear Opinion Pool 381
- 10.5.1.2 Log Opinion Pool 381
- 10.5.1.3 Voting Methods 381
- 10.5.2 Error Correlation Analysis 382
- 10.5.3 Two-Model Combination 383
- 10.5.4 Three-Model Combination 384
- 11 Neural Networks for Voice Conversion 393
- 11.1 Introduction: Speech and Speaker Characteristics 393
- 11.2 Studies in Voice Conversion 401
- 11.3 Neural Networks for Transformation of Vocal Tract Shapes 405
- 11.3.1 Linear Approximation of Formant Transformation 406
- 11.3.2 Neural Network Models 410
- 11.4 Implementation of Voice Conversion 420
- 11.4.1 Voice Transformation System 420
- 11.4.2 Normalization of Intonational Features 424
- 11.4.3 Evaluation of Voice Transformation 426
- 12 Neural Networks for Speech Coding 433
- 12.2 Source Coding and Neural Networks 433
- 12.2.1 Source Coding 433
- 12.2.2 Neural Networks 435
- 12.2.3 Source Coding with Neural Networks 436
- 12.2.3.1 Vector Quantization with Kohonen Self-Organizing Feature Maps 437
- 12.2.3.2 Multilayer Neural Network as Front-End of a Coder 437
- 12.2.3.3 Codebook-Excited Neural Networks 437
- 12.3 Quantization Performance of Neural Networks 438
- 12.3.1 Kohonen Self-Organizing Feature Maps 438
- 12.3.1.1 Architecture and Training Process 441
- 12.3.1.2 Conditional Histogram Neural Network FSVQ 443
- 12.3.1.3 Nearest-Neighbor Neural Network FSVQ 444
- 12.3.1.4 Simulations 445
- 12.3.2 Coders with Neural Network Front-Ends 447
- 12.3.3 Codebook-Excited Neural Networks 451
- 12.4 Speech Coding with Neural Networks 453
- 12.4.1 Coding Speech Spectrum with Neural Networks 453
- 12.4.2 Nonlinear Prediction Speech Coding 454
- 12.4.2.1 A Neural Model of Nonlinear Prediction 455
- 12.4.2.2 Nonlinear Predictive Vector Quantization 456
- 12.4.2.3 Nonlinear Predictive Quantization Performance 458
- 12.4.3 Code-Excited Nonlinear Predictive Speech Coding 461
- 12.4.3.1 Nonlinear Predictive Filter Tolerance for an Excitation Disturbance 461
- 12.4.3.2 Gain-Adaptive Nonlinear Predictive Coding 463
- 12.4.3.3 Coding Performance 464
- 13 Networks for Speech Enhancement 471
- 13.1.2 Model Structure 472
- 13.2 Neural Time-Domain Filtering Methods 474
- 13.2.1 Direct Time-Domain Mapping 474
- 13.2.2 Extended Kalman Filtering with Predictive Models 478
- 13.3 Neural Transform-Domain Methods 481
- 13.3.1 Spectral Subtraction 482
- 13.3.2 Neural Transform-Domain Mappings 484
- 13.4 State-Dependent Model Switching Methods 488
- 13.4.1 Classification Switched Models 488
- 13.4.2 Hybrid HMM and EKF 489
- 13.5 Online Iterative Methods 490
- 13.5.1 Online Predictive Enhancement 491
- 13.5.2 Maximum-Likelihood Estimation and Dual Kalman Filtering 492
- 13.5.3 Noise-Regularized Adaptive Filtering 496.
- Notes:
- Includes bibliographical references and index.
- Local Notes:
- Acquired for the Penn Libraries with assistance from the Anne and Joseph Trachtman Memorial Book Fund.
- ISBN:
- 0890069549
- OCLC:
- 44446874
The Penn Libraries is committed to describing library materials using current, accurate, and responsible language. If you discover outdated or inaccurate language, please fill out this feedback form to report it and suggest alternative language.