4 options
Pentaho data integration beginner's guide / Maria Carina Roldan.
EBSCOhost Academic eBook Collection (North America) Available online
EBSCOhost Academic eBook Collection (North America)O'Reilly Online Learning: Academic/Public Library Edition Available online
O'Reilly Online Learning: Academic/Public Library Edition- Format:
- Book
- Author/Creator:
- Roldan, Maria C.
- Language:
- English
- Subjects (All):
- Data integration (Computer science).
- Database management--Computer programs.
- Database management.
- Physical Description:
- 1 online resource (502 pages) : illustrations
- Edition:
- Second edition.
- Place of Publication:
- Birmingham : Packt Publishing, 2013.
- Language Note:
- English
- System Details:
- text file
- Summary:
- Extract, Transform, and Load (ETL) is the essence of data integration and this book shows you how to achieve it quickly and efficiently using Pentaho Data. A hands-on guide that you’ll find an indispensable time-saver. Manipulate your data by exploring, transforming, validating, and integrating it Learn to migrate data between applications Explore several features of Pentaho Data Integration 5.0 Connect to any database engine, explore the databases, and perform all kind of operations on databases In Detail Capturing, manipulating, cleansing, transferring, and loading data effectively are the prime requirements in every IT organization. Achieving these tasks require people devoted to developing extensive software programs, or investing in ETL or data integration tools that can simplify this work. Pentaho Data Integration is a full-featured open source ETL solution that allows you to meet these requirements. Pentaho Data Integration has an intuitive, graphical, drag-and-drop design environment and its ETL capabilities are powerful. However, getting started with Pentaho Data Integration can be difficult or confusing. "Pentaho Data Integration Beginner's Guide, Second Edition" provides the guidance needed to overcome that difficulty, covering all the possible key features of Pentaho Data Integration. "Pentaho Data Integration Beginner's Guide, Second Edition" starts with the installation of Pentaho Data Integration software and then moves on to cover all the key Pentaho Data Integration concepts. Each chapter introduces new features, allowing you to gradually get involved with the tool. First, you will learn to do all kinds of data manipulation and work with plain files. Then, the book gives you a primer on databases and teaches you how to work with databases inside Pentaho Data Integration. Moreover, you will be introduced to data warehouse concepts and you will learn how to load data in a data warehouse. After that, you will learn to implement simple and complex processes. Finally, you will have the opportunity of applying and reinforcing all the learned concepts through the implementation of a simple datamart. With "Pentaho Data Integration Beginner's Guide, Second Edition", you will learn everything you need to know in order to meet your data manipulation requirements.
- Contents:
- Cover
- Copyright
- Credits
- About the Author
- About the Reviewers
- www.PacktPub.com
- Table of Contents
- Preface
- Chapter 1: Getting Started with Pentaho Data Integration
- Pentaho Data Integration and Pentaho BI Suite
- Exploring the Pentaho Demo
- Pentaho Data Integration
- Using PDI in real world scenarios
- Loading data warehouses or datamarts
- Integrating data
- Data cleansing
- Migrating information
- Exporting data
- Integrating PDI along with other Pentaho tools
- Installing PDI
- Time for action - installing PDI
- Launching the PDI graphical designer - Spoon
- Time for action - starting and customizing Spoon
- Spoon
- Setting preferences in the Options window
- Storing transformations and jobs in a repository
- Creating your first transformation
- Time for action - creating a hello world transformation
- Directing Kettle engine with transformations
- Exploring the Spoon interface
- Designing a transformation
- Running and previewing the transformation
- Installing MySQL
- Time for action - installing MySQL on Windows
- Time for action - installing MySQL on Ubuntu
- Summary
- Chapter 2: Getting Started with Transformations
- Designing and previewing transformations
- Time for action - creating a simple transformation and getting familiar with the design process
- Getting familiar with editing features
- Using the mouse-over assistance toolbar
- Working with grids
- Understanding the Kettle rowset
- Looking at the results in the Execution Results pane
- The Logging tab
- The Step Metrics tab
- Running transformations in an interactive fashion
- Time for action - generating a range of dates and inspecting the data as it is being created
- Adding or modifying fields by using different PDI steps
- The Select values step
- Getting fields
- Date fields
- Handling errors.
- Time for action - avoiding errors while converting the estimated time from string to integer
- The error handling functionality
- Time for action - configuring the error handling to see the description of the errors
- Personalizing the error handling
- Chapter 3: Manipulating Real-world Data
- Reading data from files
- Time for action - reading results of football matches from files
- Input files
- Input steps
- Reading several files at once
- Time for action - reading all your files at a time using a single text file input step
- Time for action - reading all your files at a time using a single text file input step and regular expressions
- Regular expressions
- Troubleshooting reading files
- Sending data to files
- Time for action - sending the results of matches to a plain file
- Output files
- Output steps
- Getting system information
- Time for action - reading and writing matches files with flexibility
- The Get System Info step
- Running transformations from a terminal window
- Time for action - running the matches transformation from a terminal window
- XML files
- Time for action - getting data from an XML file with information about countries
- What is XML?
- PDI transformation files
- Getting data from XML files
- XPath
- Configuring the Get data from XML step
- Kettle variables
- How and when you can use variables
- Chapter 4: Filtering, Searching, and Performing Other Useful Operations with Data
- Sorting data
- Time for action - sorting information about matches with the Sort rows step
- Calculations on groups of rows
- Time for action - calculating football match statistics by grouping data
- Group by Step
- Numeric fields
- Filtering
- Time for action - counting frequent words by filtering
- Time for action - refining the counting task by filtering even more.
- Filtering rows using the Filter rows step
- Looking up data
- Time for action - finding out which language people speak
- The Stream lookup step
- Data cleaning
- Time for action - fixing words before counting them
- Cleansing data with PDI
- Chapter 5: Controlling the Flow of Data
- Splitting streams
- Time for action - Browsing new features of PDI by copying a dataset
- Copying rows
- Distributing rows
- Time for action - Assigning tasks by distributing
- Splitting the stream based on conditions
- Time for action - Assigning tasks by filtering priorities with the Filter rows step
- PDI steps for splitting the stream based on conditions
- Time for action - Assigning tasks by filtering priorities with the Switch/Case step
- Merging streams
- Time for action - Gathering progress and merging it all together
- PDI options for merging streams
- Time for action - Giving priority to Bouchard by using the Append Stream
- Treating invalid data by splitting and merging streams
- Time for action - Treating errors in the estimated time to avoid discarding rows
- Treating rows with invalid data
- Chapter 6: Transforming Your Data by Coding
- Doing simple tasks with the JavaScript step
- Time for action - counting frequent words by coding in JavaScript
- Using the JavaScript language in PDI
- Inserting JavaScript code using the Modified JavaScript Value Step
- Adding fields
- Modifying fields
- Using transformation predefined constants
- Testing the script using the Test script button
- Reading and parsing unstructured files with JavaScript
- Time for action - changing a list of house descriptions with JavaScript
- Looping over the dataset rows
- Doing simple tasks with the Java Class step
- Time for action - counting frequent words by coding in Java
- Using the Java language in PDI.
- Inserting Java code using the User Defined Java Class step
- Sending rows to the next step
- Data types equivalence
- Testing the Java Class using the Test class button
- Transforming the dataset with Java
- Time for action - splitting the field to rows using Java
- Avoiding coding by using purpose built steps
- Chapter 7: Transforming the Rowset
- Converting rows to columns
- Time for action - enhancing the films file by converting rows to columns
- Converting row data to column data by using the Row Denormaliser step
- Aggregating data with a Row Denormaliser step
- Time for action - aggregating football matches data with the Row Denormaliser step
- Using Row Denormaliser for aggregating data
- Normalizing data
- Time for action - enhancing the matches file by normalizing the dataset
- Modifying the dataset with a Row Normaliser step
- Summarizing the PDI steps that operate on sets of rows
- Generating a custom time dimension dataset by using Kettle variables
- Time for action - creating the time dimension dataset
- Getting variables
- Time for action - parameterizing the start and end date of the time dimension dataset
- Using the Get Variables step
- Chapter 8: Working with Databases
- Introducing the Steel Wheels sample database
- Connecting to the Steel Wheels database
- Time for action - creating a connection to the Steel Wheels database
- Connecting with Relational Database Management Systems
- Exploring the Steel Wheels database
- Time for action - exploring the sample database
- A brief word about SQL
- Exploring any configured database with the database explorer
- Querying a database
- Time for action - getting data about shipped orders
- Getting data from the database with the Table input step.
- Using the SELECT statement for generating a new dataset
- Making flexible queries using parameters
- Time for action - getting orders in a range of dates using parameters
- Adding parameters to your queries
- Making flexible queries by using Kettle variables
- Time for action - getting orders in a range of dates by using Kettle variables
- Using Kettle variables in your queries
- Sending data to a database
- Time for action - loading a table with a list of manufacturers
- Inserting new data into a database table with the Table output step
- Inserting or updating data by using other PDI steps
- Time for action - inserting new products or updating existent ones
- Time for action - testing the update of existent products
- Inserting or updating with the Insert/Update Step
- Eliminating data from a database
- Time for action - deleting data about discontinued items
- Deleting records of a database table with the Delete step
- Chapter 9: Performing Advanced Operations with Databases
- Preparing the environment
- Time for action - populating the Jigsaw database
- Exploring the Jigsaw database model
- Looking up data in a database
- Doing simple lookups
- Time for action - using a Database lookup step to create a list of products to buy
- Looking up values in a database with the Database lookup step
- Performing complex lookups
- Time for action - using a Database join step to create a list of suggested products to buy
- Joining data from the database to the stream data by using a Database join step
- Introducing dimensional modeling
- Loading dimensions with data
- Time for action - loading a region dimension with a Combination lookup/update step
- Time for action - testing the transformation that loads the region dimension
- Describing data with dimensions.
- Loading Type I SCD with a Combination lookup/update step.
- Notes:
- Includes index.
- Description based on online resource; title from PDF title page (ebrary, viewed November 23, 2013).
- ISBN:
- 9781782165057
- 1782165053
- OCLC:
- 862050189
The Penn Libraries is committed to describing library materials using current, accurate, and responsible language. If you discover outdated or inaccurate language, please fill out this feedback form to report it and suggest alternative language.