My Account Log in

1 option

LARGE LANGUAGE MODEL-BASED SOLUTIONS : how to deliver value with cost-effective generative AI applications / Shreyas Subramanian.

O'Reilly Online Learning: Academic/Public Library Edition Available online

View online
Format:
Book
Author/Creator:
Subramanian, Shreyas, author.
Series:
Tech Today Series
Language:
English
Subjects (All):
Natural language generation (Computer science).
Artificial intelligence--Computer programs.
Artificial intelligence.
Physical Description:
1 online resource
Edition:
1st ed.
Place of Publication:
Newark : John Wiley & Sons, Incorporated, 2024.
Summary:
Learn to build cost-effective apps using Large Language Models In Large Language Model-Based Solutions: How to Deliver Value with Cost-Effective Generative AI Applications, Principal Data Scientist at Amazon Web Services, Shreyas Subramanian, delivers a practical guide for developers and data scientists who wish to build and deploy cost-effective large language model (LLM)-based solutions. In the book, you'll find coverage of a wide range of key topics, including how to select a model, pre- and post-processing of data, prompt engineering, and instruction fine tuning. The author sheds light on techniques for optimizing inference, like model quantization and pruning, as well as different and affordable architectures for typical generative AI (GenAI) applications, including search systems, agent assists, and autonomous agents. You'll also find: Effective strategies to address the challenge of the high computational cost associated with LLMs Assistance with the complexities of building and deploying affordable generative AI apps, including tuning and inference techniques Selection criteria for choosing a model, with particular consideration given to compact, nimble, and domain-specific models Perfect for developers and data scientists interested in deploying foundational models, or business leaders planning to scale out their use of GenAI, Large Language Model-Based Solutions will also benefit project leaders and managers, technical support staff, and administrators with an interest or stake in the subject.
Contents:
Cover
Contents At A Glance
Title Page
Copyright Page
Dedication Page
About the Author
About the Technical Editor
Contents
Introduction
GenAI Applications and Large Language Models
Importance of Cost Optimization
Challenges and Opportunities
Micro Case Studies
OpenAI: Leading the Way
Hugging Face: Open-Source Community Building
Bloomberg GPT: LLMs in Large Commercial Institutions
Who Is This Book For?
Summary
Chapter 1 Introduction
Overview of GenAI Applications and Large Language Models
The Rise of Large Language Models
Neural Networks, Transformers, and Beyond
GenAI vs. LLMs: What's the Difference?
The Three-Layer GenAI Application Stack
The Infrastructure Layer
The Model Layer
The Application Layer
Paths to Productionizing GenAI Applications
Sample LLM-Powered Chat Application
The Importance of Cost Optimization
Cost Assessment of the Model Inference Component
Cost Assessment of the Vector Database Component
Benchmarking Setup and Results
Other Factors to Consider
Cost Assessment of the Large Language Model Component
Chapter 2 Tuning Techniques for Cost Optimization
Fine-Tuning and Customizability
Basic Scaling Laws You Should Know
Parameter-Efficient Fine-Tuning Methods
Adapters Under the Hood
Prompt Tuning
Prefix Tuning
P-tuning
IA3
Low-Rank Adaptation
Cost and Performance Implications of PEFT Methods
Chapter 3 Inference Techniques for Cost Optimization
Introduction to Inference Techniques
Prompt Engineering
Impact of Prompt Engineering on Cost
Estimating Costs for Other Models
Clear and Direct Prompts
Adding Qualifying Words for Brief Responses
Breaking Down the Request
Example of Using Claude for PII Removal
Conclusion
Providing Context.
Examples of Providing Context
RAG and Long Context Models
Recent Work Comparing RAG with Long Content Models
Context and Model Limitations
Indicating a Desired Format
Example of Formatted Extraction with Claude
Trade-Off Between Verbosity and Clarity
Caching with Vector Stores
What Is a Vector Store?
How to Implement Caching Using Vector Stores
Chains for Long Documents
What Is Chaining?
Implementing Chains
Example Use Case
Common Components
Tools That Implement Chains
Comparing Results
Summarization
Summarization in the Context of Cost and Performance
Efficiency in Data Processing
Cost-Effective Storage
Enhanced Downstream Applications
Improved Cache Utilization
Summarization as a Preprocessing Step
Enhanced User Experience
Batch Prompting for Efficient Inference
Batch Inference
Experimental Results
Using the accelerate Library
Using the DeepSpeed Library
Batch Prompting
Example of Using Batch Prompting
Model Optimization Methods
Quantization
Code Example
Recent Advancements: GPTQ
Recap of PEFT Methods
Cost and Performance Implications
References
Chapter 4 Model Selection and Alternatives
Introduction to Model Selection
Motivating Example: The Tale of Two Models
The Role of Compact and Nimble Models
Examples of Successful Smaller Models
Quantization for Powerful but Smaller Models
Text Generation with Mistral 7B
Zephyr 7B and Aligned Smaller Models
CogVLM for Language-Vision Multimodality
Prometheus for Fine-Grained Text Evaluation
Orca 2 and Teaching Smaller Models to Reason
Breaking Traditional Scaling Laws with Gemini and Phi
Phi 1, 1.5, and 2 B Models
Gemini Models.
Domain-Specific Models
Step 1 - Training Your Own Tokenizer
Step 2 - Training Your Own Domain-Specific Model
More References for Fine-Tuning
Evaluating Domain-Specific Models vs. Generic Models
The Power of Prompting with General-Purpose Models
Chapter 5 Infrastructure and Deployment Tuning Strategies
Introduction to Tuning Strategies
Hardware Utilization and Batch Tuning
Memory Occupancy
Strategies to Fit Larger Models in Memory
KV Caching
PagedAttention
How Does PagedAttention Work?
Comparisons, Limitations, and Cost Considerations
AlphaServe
How Does AlphaServe Work?
Impact of Batching
Cost and Performance Considerations
S3: Scheduling Sequences with Speculation
How Does S3 Work?
Performance and Cost
Streaming LLMs with Attention Sinks
Fixed to Sliding Window Attention
Extending the Context Length
Working with Infinite Length Context
How Does StreamingLLM Work?
Performance and Results
Cost Considerations
Batch Size Tuning
Frameworks for Deployment Configuration Testing
Cloud-NativeInference Frameworks
Deep Dive into Serving Stack Choices
Batching Options
Options in DJL Serving
High-Level Guidance for Selecting Serving Parameters
Automatically Finding Good Inference Configurations
Creating a Generic Template
Defining a HPO Space
Searching the Space for Optimal Configurations
Results of Inference HPO
Inference Acceleration Tools
TensorRT and GPU Acceleration Tools
CPU Acceleration Tools
Monitoring and Observability
LLMOps and Monitoring
Why Is Monitoring Important for LLMs?
Monitoring and Updating Guardrails
Index
EULA.
Notes:
OCLC-licensed vendor bibliographic record.
Description based on publisher supplied metadata and other sources.
ISBN:
9781394240739
1394240732
9781394240746
1394240740
OCLC:
1428895126

The Penn Libraries is committed to describing library materials using current, accurate, and responsible language. If you discover outdated or inaccurate language, please fill out this feedback form to report it and suggest alternative language.

Find

Home Release notes

My Account

Shelf Request an item Bookmarks Fines and fees Settings

Guides

Using the Find catalog Using Articles+ Using your account