1 option
LARGE LANGUAGE MODEL-BASED SOLUTIONS : how to deliver value with cost-effective generative AI applications / Shreyas Subramanian.
- Format:
- Book
- Author/Creator:
- Subramanian, Shreyas, author.
- Series:
- Tech Today Series
- Language:
- English
- Subjects (All):
- Natural language generation (Computer science).
- Artificial intelligence--Computer programs.
- Artificial intelligence.
- Physical Description:
- 1 online resource
- Edition:
- 1st ed.
- Place of Publication:
- Newark : John Wiley & Sons, Incorporated, 2024.
- Summary:
- Learn to build cost-effective apps using Large Language Models In Large Language Model-Based Solutions: How to Deliver Value with Cost-Effective Generative AI Applications, Principal Data Scientist at Amazon Web Services, Shreyas Subramanian, delivers a practical guide for developers and data scientists who wish to build and deploy cost-effective large language model (LLM)-based solutions. In the book, you'll find coverage of a wide range of key topics, including how to select a model, pre- and post-processing of data, prompt engineering, and instruction fine tuning. The author sheds light on techniques for optimizing inference, like model quantization and pruning, as well as different and affordable architectures for typical generative AI (GenAI) applications, including search systems, agent assists, and autonomous agents. You'll also find: Effective strategies to address the challenge of the high computational cost associated with LLMs Assistance with the complexities of building and deploying affordable generative AI apps, including tuning and inference techniques Selection criteria for choosing a model, with particular consideration given to compact, nimble, and domain-specific models Perfect for developers and data scientists interested in deploying foundational models, or business leaders planning to scale out their use of GenAI, Large Language Model-Based Solutions will also benefit project leaders and managers, technical support staff, and administrators with an interest or stake in the subject.
- Contents:
- Cover
- Contents At A Glance
- Title Page
- Copyright Page
- Dedication Page
- About the Author
- About the Technical Editor
- Contents
- Introduction
- GenAI Applications and Large Language Models
- Importance of Cost Optimization
- Challenges and Opportunities
- Micro Case Studies
- OpenAI: Leading the Way
- Hugging Face: Open-Source Community Building
- Bloomberg GPT: LLMs in Large Commercial Institutions
- Who Is This Book For?
- Summary
- Chapter 1 Introduction
- Overview of GenAI Applications and Large Language Models
- The Rise of Large Language Models
- Neural Networks, Transformers, and Beyond
- GenAI vs. LLMs: What's the Difference?
- The Three-Layer GenAI Application Stack
- The Infrastructure Layer
- The Model Layer
- The Application Layer
- Paths to Productionizing GenAI Applications
- Sample LLM-Powered Chat Application
- The Importance of Cost Optimization
- Cost Assessment of the Model Inference Component
- Cost Assessment of the Vector Database Component
- Benchmarking Setup and Results
- Other Factors to Consider
- Cost Assessment of the Large Language Model Component
- Chapter 2 Tuning Techniques for Cost Optimization
- Fine-Tuning and Customizability
- Basic Scaling Laws You Should Know
- Parameter-Efficient Fine-Tuning Methods
- Adapters Under the Hood
- Prompt Tuning
- Prefix Tuning
- P-tuning
- IA3
- Low-Rank Adaptation
- Cost and Performance Implications of PEFT Methods
- Chapter 3 Inference Techniques for Cost Optimization
- Introduction to Inference Techniques
- Prompt Engineering
- Impact of Prompt Engineering on Cost
- Estimating Costs for Other Models
- Clear and Direct Prompts
- Adding Qualifying Words for Brief Responses
- Breaking Down the Request
- Example of Using Claude for PII Removal
- Conclusion
- Providing Context.
- Examples of Providing Context
- RAG and Long Context Models
- Recent Work Comparing RAG with Long Content Models
- Context and Model Limitations
- Indicating a Desired Format
- Example of Formatted Extraction with Claude
- Trade-Off Between Verbosity and Clarity
- Caching with Vector Stores
- What Is a Vector Store?
- How to Implement Caching Using Vector Stores
- Chains for Long Documents
- What Is Chaining?
- Implementing Chains
- Example Use Case
- Common Components
- Tools That Implement Chains
- Comparing Results
- Summarization
- Summarization in the Context of Cost and Performance
- Efficiency in Data Processing
- Cost-Effective Storage
- Enhanced Downstream Applications
- Improved Cache Utilization
- Summarization as a Preprocessing Step
- Enhanced User Experience
- Batch Prompting for Efficient Inference
- Batch Inference
- Experimental Results
- Using the accelerate Library
- Using the DeepSpeed Library
- Batch Prompting
- Example of Using Batch Prompting
- Model Optimization Methods
- Quantization
- Code Example
- Recent Advancements: GPTQ
- Recap of PEFT Methods
- Cost and Performance Implications
- References
- Chapter 4 Model Selection and Alternatives
- Introduction to Model Selection
- Motivating Example: The Tale of Two Models
- The Role of Compact and Nimble Models
- Examples of Successful Smaller Models
- Quantization for Powerful but Smaller Models
- Text Generation with Mistral 7B
- Zephyr 7B and Aligned Smaller Models
- CogVLM for Language-Vision Multimodality
- Prometheus for Fine-Grained Text Evaluation
- Orca 2 and Teaching Smaller Models to Reason
- Breaking Traditional Scaling Laws with Gemini and Phi
- Phi 1, 1.5, and 2 B Models
- Gemini Models.
- Domain-Specific Models
- Step 1 - Training Your Own Tokenizer
- Step 2 - Training Your Own Domain-Specific Model
- More References for Fine-Tuning
- Evaluating Domain-Specific Models vs. Generic Models
- The Power of Prompting with General-Purpose Models
- Chapter 5 Infrastructure and Deployment Tuning Strategies
- Introduction to Tuning Strategies
- Hardware Utilization and Batch Tuning
- Memory Occupancy
- Strategies to Fit Larger Models in Memory
- KV Caching
- PagedAttention
- How Does PagedAttention Work?
- Comparisons, Limitations, and Cost Considerations
- AlphaServe
- How Does AlphaServe Work?
- Impact of Batching
- Cost and Performance Considerations
- S3: Scheduling Sequences with Speculation
- How Does S3 Work?
- Performance and Cost
- Streaming LLMs with Attention Sinks
- Fixed to Sliding Window Attention
- Extending the Context Length
- Working with Infinite Length Context
- How Does StreamingLLM Work?
- Performance and Results
- Cost Considerations
- Batch Size Tuning
- Frameworks for Deployment Configuration Testing
- Cloud-NativeInference Frameworks
- Deep Dive into Serving Stack Choices
- Batching Options
- Options in DJL Serving
- High-Level Guidance for Selecting Serving Parameters
- Automatically Finding Good Inference Configurations
- Creating a Generic Template
- Defining a HPO Space
- Searching the Space for Optimal Configurations
- Results of Inference HPO
- Inference Acceleration Tools
- TensorRT and GPU Acceleration Tools
- CPU Acceleration Tools
- Monitoring and Observability
- LLMOps and Monitoring
- Why Is Monitoring Important for LLMs?
- Monitoring and Updating Guardrails
- Index
- EULA.
- Notes:
- OCLC-licensed vendor bibliographic record.
- Description based on publisher supplied metadata and other sources.
- ISBN:
- 9781394240739
- 1394240732
- 9781394240746
- 1394240740
- OCLC:
- 1428895126
The Penn Libraries is committed to describing library materials using current, accurate, and responsible language. If you discover outdated or inaccurate language, please fill out this feedback form to report it and suggest alternative language.