My Account Log in

3 options

Python web scraping : fetching data from the web / Katharine Jarmul, Richard Lawson.

EBSCOhost Academic eBook Collection (North America) Available online

View online

Ebook Central College Complete Available online

View online

O'Reilly Online Learning: Academic/Public Library Edition Available online

View online
Format:
Book
Author/Creator:
Jarmul, Katharine, author.
Lawson, Richard, author.
Language:
English
Subjects (All):
Python (Computer program language).
Computer programming.
Physical Description:
1 online resource (215 pages) : illustrations (some color)
Edition:
Second edition.
Place of Publication:
Birmingham : Packt, 2017.
System Details:
text file
Biography/History:
Jarmul Katharine: Katharine Jarmul is a data scientist and Pythonista based in Berlin, Germany. She runs a data science consulting company, Kjamistan, that provides services such as data extraction, acquisition, and modelling for small and large companies. She has been writing Python since 2008 and scraping the web with Python since 2010, and has worked at both small and large start-ups who use web scraping for data analysis and machine learning. When she's not scraping the web, you can follow her thoughts and activities via Twitter (@kjam)
Summary:
Successfully scrape data from any website with the power of Python 3.x About This Book A hands-on guide to web scraping using Python with solutions to real-world problems Create a number of different web scrapers in Python to extract information This book includes practical examples on using the popular and well-maintained libraries in Python for your web scraping needs Who This Book Is For This book is aimed at developers who want to use web scraping for legitimate purposes. Prior programming experience with Python would be useful but not essential. Anyone with general knowledge of programming languages should be able to pick up the book and understand the principals involved. What You Will Learn Extract data from web pages with simple Python programming Build a concurrent crawler to process web pages in parallel Follow links to crawl a website Extract features from the HTML Cache downloaded HTML for reuse Compare concurrent models to determine the fastest crawler Find out how to parse JavaScript-dependent websites Interact with forms and sessions In Detail The Internet contains the most useful set of data ever assembled, most of which is publicly accessible for free. However, this data is not easily usable. It is embedded within the structure and style of websites and needs to be carefully extracted. Web scraping is becoming increasingly useful as a means to gather and make sense of the wealth of information available online. This book is the ultimate guide to using the latest features of Python 3.x to scrape data from websites. In the early chapters, you'll see how to extract data from static web pages. You'll learn to use caching with databases and files to save time and manage the load on servers. After covering the basics, you'll get hands-on practice building a more sophisticated crawler using browsers, crawlers, and concurrent scrapers. You'll determine when and how to scrape data from a JavaScript-dependent website using PyQt and Selenium. You'll get a better understanding of how to submit forms on complex websites protected by CAPTCHA. You'll find out how to automate these actions with Python packages such as mechanize. You'll also learn how to create class-based scrapers with Scrapy libraries and implement your learning on real websites. By the end of the book, you will have explored testing websites with scrapers, remote scraping, best practices, working with images, and many other relevant topics. Style and approach This hands-on ...
Contents:
Cover
Credits
Copyright
About the Authors
About the Reviewers
www.PacktPub.com
Customer Feedback
Table of Contents
Preface
Chapter 1: Introduction to Web Scraping
When is web scraping useful?
Is web scraping legal?
Python 3
Background research
Checking robots.txt
Examining the Sitemap
Estimating the size of a website
Identifying the technology used by a website
Finding the owner of a website
Crawling your first website
Scraping versus crawling
Downloading a web page
Retrying downloads
Setting a user agent
Sitemap crawler
ID iteration crawler
Link crawlers
Advanced features
Parsing robots.txt
Supporting proxies
Throttling downloads
Avoiding spider traps
Final version
Using the requests library
Summary
Chapter 2: Scraping the Data
Analyzing a web page
Three approaches to scrape a web page
Regular expressions
Beautiful Soup
Lxml
CSS selectors and your Browser Console
XPath Selectors
LXML and Family Trees
Comparing performance
Scraping results
Overview of Scraping
Adding a scrape callback to the link crawler
Chapter 3: Caching Downloads
When to use caching?
Adding cache support to the link crawler
Disk Cache
Implementing DiskCache
Testing the cache
Saving disk space
Expiring stale data
Drawbacks of DiskCache
Key-value storage cache
What is key-value storage?
Installing Redis
Overview of Redis
Redis cache implementation
Compression
Exploring requests-cache
Chapter 4: Concurrent Downloading
One million web pages
Parsing the Alexa list
Sequential crawler
Threaded crawler
How threads and processes work
Implementing a multithreaded crawler
Multiprocessing crawler
Performance.
[Python multiprocessing and the GIL]
Python multiprocessing and the GIL
Chapter 5: Dynamic Content
An example dynamic web page
Reverse engineering a dynamic web page
Edge cases
Rendering a dynamic web page
PyQt or PySide
Debugging with Qt
Executing JavaScript
Website interaction with WebKit
Waiting for results
The Render class
Selenium
Selenium and Headless Browsers
Chapter 6: Interacting with Forms
The Login form
Loading cookies from the web browser
Extending the login script to update content
Automating forms with Selenium
Chapter 7: Solving CAPTCHA
Registering an account
Loading the CAPTCHA image
Optical character recognition
Further improvements
Solving complex CAPTCHAs
Using a CAPTCHA solving service
Getting started with 9kw
The 9kw CAPTCHA API
Reporting errors
Integrating with registration
CAPTCHAs and machine learning
Chapter 8: Scrapy
Installing Scrapy
Starting a project
Defining a model
Creating a spider
Tuning settings
Testing the spider
Different Spider Types
Scraping with the shell command
Checking results
Interrupting and resuming a crawl
Scrapy Performance Tuning
Visual scraping with Portia
Installation
Annotation
Running the Spider
Automated scraping with Scrapely
Chapter 9: Putting It All Together
Google search engine
Facebook
The website
Facebook API
Gap
BMW
Index.
Notes:
Includes index.
Description based on online resource; title from PDF title page (ebrary, viewed June 15, 2017).
OCLC:
990784837

The Penn Libraries is committed to describing library materials using current, accurate, and responsible language. If you discover outdated or inaccurate language, please fill out this feedback form to report it and suggest alternative language.

Find

Home Release notes

My Account

Shelf Request an item Bookmarks Fines and fees Settings

Guides

Using the Find catalog Using Articles+ Using your account