Scraping Multiple Pages in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1....
Transcript of Scraping Multiple Pages in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1....
![Page 1: Scraping Multiple Pages in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm](https://reader034.fdocuments.us/reader034/viewer/2022051919/600b245e3f41d377bc2038ea/html5/thumbnails/1.jpg)
Scraping Multiple Pages in PythonWORKSHOP 3 | CREATOR: CHARLOTTE LLOYD
![Page 2: Scraping Multiple Pages in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm](https://reader034.fdocuments.us/reader034/viewer/2022051919/600b245e3f41d377bc2038ea/html5/thumbnails/2.jpg)
Outline
I. RecapII. Workshop ExampleIII. Verify DataIV. Celebration, Back-slapping
![Page 3: Scraping Multiple Pages in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm](https://reader034.fdocuments.us/reader034/viewer/2022051919/600b245e3f41d377bc2038ea/html5/thumbnails/3.jpg)
RecapPART I
![Page 4: Scraping Multiple Pages in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm](https://reader034.fdocuments.us/reader034/viewer/2022051919/600b245e3f41d377bc2038ea/html5/thumbnails/4.jpg)
Three Major Ways to Use Python
1. Command Line
2. “IDE”
3. Notebook
![Page 5: Scraping Multiple Pages in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm](https://reader034.fdocuments.us/reader034/viewer/2022051919/600b245e3f41d377bc2038ea/html5/thumbnails/5.jpg)
Scraping Process // Battle Plan
u 1. Surveillanceu Evaluate the page, learn the terrain.
u 2. Plan of Attack
u Brainstorm ways to approach the enemy.
u 3. Write codeu Be willing to change your strategy if you encounter obstacles or see another
“weakness” to exploit.
u 4. Emerge bloodied, yet victorious.
u Verify the data before all that syntax evaporates from your short term memory.
![Page 6: Scraping Multiple Pages in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm](https://reader034.fdocuments.us/reader034/viewer/2022051919/600b245e3f41d377bc2038ea/html5/thumbnails/6.jpg)
Workshop ExamplePART IV
![Page 7: Scraping Multiple Pages in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm](https://reader034.fdocuments.us/reader034/viewer/2022051919/600b245e3f41d377bc2038ea/html5/thumbnails/7.jpg)
GOAL
u http://www.bfi.org.uk/films-tv-people/sightandsoundpoll2012/voters
u Scrape all information about all voters
u Scrape “film details” (except ”featuring”) for all films chosen by voters in their “top ten"
u Save data as 2 different csv files
![Page 8: Scraping Multiple Pages in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm](https://reader034.fdocuments.us/reader034/viewer/2022051919/600b245e3f41d377bc2038ea/html5/thumbnails/8.jpg)
1. Surveillance
u Voter: http://www.bfi.org.uk/films-tv-people/sightandsoundpoll2012/voter/94
u special case: http://www.bfi.org.uk/films-tv-people/sightandsoundpoll2012/voter/6
u Film: http://www.bfi.org.uk/films-tv-people/4ce2b6a7a801b
u special case: http://www.bfi.org.uk/films-tv-people/4ce2b8bb6b693
u special case: http://www.bfi.org.uk/films-tv-people/4ce2b7d2993a2
![Page 9: Scraping Multiple Pages in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm](https://reader034.fdocuments.us/reader034/viewer/2022051919/600b245e3f41d377bc2038ea/html5/thumbnails/9.jpg)
2. Plan of Attack: Voters
u What is our strategy to get the judge URLs? u exploit the “class=sas-poll” feature to scrape URLs from each of 25 tables
u What is our strategy to get the data for each judge?u scrape the name, type, info and country from the main page
u scrape the 10 films and comment from the judge’s individual page
u How can we handle the special cases? u manually create filmIDs for films without webpages
![Page 10: Scraping Multiple Pages in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm](https://reader034.fdocuments.us/reader034/viewer/2022051919/600b245e3f41d377bc2038ea/html5/thumbnails/10.jpg)
2. Plan of Attack: Films
u What is our strategy for getting the film URLs? u save them to a list while we’re scraping the judges
u What is our strategy to get the data for each film? Why do we have to incorporate the special cases directly into the strategy?
u we need to separately search for cells containing the director, country, year, genre, type, and category info
u the number of cells in the table varies, so we have to know what they are based on their content and not their position
![Page 11: Scraping Multiple Pages in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm](https://reader034.fdocuments.us/reader034/viewer/2022051919/600b245e3f41d377bc2038ea/html5/thumbnails/11.jpg)
3. Let’s look at the code together
u available at: https://github.com/charlloyd/film-gaze
u First let’s run it in Spyder.
u Then let’s download the jupyter notebook.
![Page 12: Scraping Multiple Pages in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm](https://reader034.fdocuments.us/reader034/viewer/2022051919/600b245e3f41d377bc2038ea/html5/thumbnails/12.jpg)