Web Scraping for Code-ophobes
-
Upload
annie-cushing -
Category
Technology
-
view
14.501 -
download
1
description
Transcript of Web Scraping for Code-ophobes
![Page 1: Web Scraping for Code-ophobes](https://reader033.fdocuments.us/reader033/viewer/2022051015/5552c23db4c90581158b486e/html5/thumbnails/1.jpg)
Web Scraping
@AnnieCushing
For Code-ophobes
![Page 2: Web Scraping for Code-ophobes](https://reader033.fdocuments.us/reader033/viewer/2022051015/5552c23db4c90581158b486e/html5/thumbnails/2.jpg)
What I’m not
@AnnieCushing
![Page 3: Web Scraping for Code-ophobes](https://reader033.fdocuments.us/reader033/viewer/2022051015/5552c23db4c90581158b486e/html5/thumbnails/3.jpg)
What I am
![Page 4: Web Scraping for Code-ophobes](https://reader033.fdocuments.us/reader033/viewer/2022051015/5552c23db4c90581158b486e/html5/thumbnails/4.jpg)
THE WIND BENEATH MY WEB-SCRAPING WINGS
@djchrisle
@ethanlyon
@AnnieCushing
![Page 5: Web Scraping for Code-ophobes](https://reader033.fdocuments.us/reader033/viewer/2022051015/5552c23db4c90581158b486e/html5/thumbnails/5.jpg)
3 WAYS TO SCRAPE IN GOOGLE DOCS
• ImportFeed• ImportHTML• ImportXML
@AnnieCushing
![Page 6: Web Scraping for Code-ophobes](https://reader033.fdocuments.us/reader033/viewer/2022051015/5552c23db4c90581158b486e/html5/thumbnails/6.jpg)
=ImportFeed
![Page 7: Web Scraping for Code-ophobes](https://reader033.fdocuments.us/reader033/viewer/2022051015/5552c23db4c90581158b486e/html5/thumbnails/7.jpg)
ImportFeed
=ImportFeed(URL, query, headers, numItems)
http://bit.ly/importfeed@AnnieCushing
=ImportFeed("http://feeds.searchengineland.com/searchengineland")
OR
=ImportFeed(C4) My preference
![Page 8: Web Scraping for Code-ophobes](https://reader033.fdocuments.us/reader033/viewer/2022051015/5552c23db4c90581158b486e/html5/thumbnails/8.jpg)
@AnnieCushing
![Page 9: Web Scraping for Code-ophobes](https://reader033.fdocuments.us/reader033/viewer/2022051015/5552c23db4c90581158b486e/html5/thumbnails/9.jpg)
@AnnieCushing
http://slidesha.re/stalker-wil
STALKING FOR LINKS
BY @WILREYNOLDS
![Page 10: Web Scraping for Code-ophobes](https://reader033.fdocuments.us/reader033/viewer/2022051015/5552c23db4c90581158b486e/html5/thumbnails/10.jpg)
=ImportHTML
![Page 11: Web Scraping for Code-ophobes](https://reader033.fdocuments.us/reader033/viewer/2022051015/5552c23db4c90581158b486e/html5/thumbnails/11.jpg)
ImportHTML
• Table• List
TWO OPTIONS
@AnnieCushing
![Page 12: Web Scraping for Code-ophobes](https://reader033.fdocuments.us/reader033/viewer/2022051015/5552c23db4c90581158b486e/html5/thumbnails/12.jpg)
=ImportHtml(URL, query, index)
URL: “www.domain.com/whatever” OR cell reference query: “table” or “list” OR cell referenceindex: If multiple lists or tables, which one (3 = 3rd table)
@AnnieCushing
![Page 13: Web Scraping for Code-ophobes](https://reader033.fdocuments.us/reader033/viewer/2022051015/5552c23db4c90581158b486e/html5/thumbnails/13.jpg)
Table Example of ImportHTML
@AnnieCushing
![Page 14: Web Scraping for Code-ophobes](https://reader033.fdocuments.us/reader033/viewer/2022051015/5552c23db4c90581158b486e/html5/thumbnails/14.jpg)
List Example of ImportHTML
@AnnieCushing
![Page 15: Web Scraping for Code-ophobes](https://reader033.fdocuments.us/reader033/viewer/2022051015/5552c23db4c90581158b486e/html5/thumbnails/15.jpg)
=ImportXML
![Page 16: Web Scraping for Code-ophobes](https://reader033.fdocuments.us/reader033/viewer/2022051015/5552c23db4c90581158b486e/html5/thumbnails/16.jpg)
ImportXML
http://bit.ly/xpath-tutorial
=ImportXML(URL, query)
@AnnieCushing
![Page 17: Web Scraping for Code-ophobes](https://reader033.fdocuments.us/reader033/viewer/2022051015/5552c23db4c90581158b486e/html5/thumbnails/17.jpg)
Simple Explanation of XPath
XPath uses path expressions to select nodes or node-sets in an XML document.
@AnnieCushing
![Page 18: Web Scraping for Code-ophobes](https://reader033.fdocuments.us/reader033/viewer/2022051015/5552c23db4c90581158b486e/html5/thumbnails/18.jpg)
@AnnieCushing
![Page 19: Web Scraping for Code-ophobes](https://reader033.fdocuments.us/reader033/viewer/2022051015/5552c23db4c90581158b486e/html5/thumbnails/19.jpg)
7 Types of Nodes
@AnnieCushing
![Page 20: Web Scraping for Code-ophobes](https://reader033.fdocuments.us/reader033/viewer/2022051015/5552c23db4c90581158b486e/html5/thumbnails/20.jpg)
Simple Explanation of XPath
<div><p><blockquote><price><ul>
ELEMENTS
@AnnieCushing
![Page 21: Web Scraping for Code-ophobes](https://reader033.fdocuments.us/reader033/viewer/2022051015/5552c23db4c90581158b486e/html5/thumbnails/21.jpg)
• As you drill down, you separate nodes with /
• Ex: /html/div/ul/li/a
PARENT-CHILD NODES
@AnnieCushing
![Page 22: Web Scraping for Code-ophobes](https://reader033.fdocuments.us/reader033/viewer/2022051015/5552c23db4c90581158b486e/html5/thumbnails/22.jpg)
classidsize
Look for the = sign
ATTRIBUTES
@AnnieCushing
![Page 23: Web Scraping for Code-ophobes](https://reader033.fdocuments.us/reader033/viewer/2022051015/5552c23db4c90581158b486e/html5/thumbnails/23.jpg)
Simple Explanation of XPath
/: Starts at the root//: Starts wherever @: Selects attributes []: Answers the question “Which one?”[*]: All
KEY CHARACTERS
@AnnieCushing
![Page 24: Web Scraping for Code-ophobes](https://reader033.fdocuments.us/reader033/viewer/2022051015/5552c23db4c90581158b486e/html5/thumbnails/24.jpg)
Let’s Start Simple
@AnnieCushing
![Page 25: Web Scraping for Code-ophobes](https://reader033.fdocuments.us/reader033/viewer/2022051015/5552c23db4c90581158b486e/html5/thumbnails/25.jpg)
Magic!
@AnnieCushing
![Page 26: Web Scraping for Code-ophobes](https://reader033.fdocuments.us/reader033/viewer/2022051015/5552c23db4c90581158b486e/html5/thumbnails/26.jpg)
Grab the URLs
@AnnieCushing
![Page 27: Web Scraping for Code-ophobes](https://reader033.fdocuments.us/reader033/viewer/2022051015/5552c23db4c90581158b486e/html5/thumbnails/27.jpg)
Because it’s an @tribute!
![Page 28: Web Scraping for Code-ophobes](https://reader033.fdocuments.us/reader033/viewer/2022051015/5552c23db4c90581158b486e/html5/thumbnails/28.jpg)
Let’s dial it up
@AnnieCushing
http://bit.ly/distilled-xml
![Page 29: Web Scraping for Code-ophobes](https://reader033.fdocuments.us/reader033/viewer/2022051015/5552c23db4c90581158b486e/html5/thumbnails/29.jpg)
@AnnieCushing
![Page 30: Web Scraping for Code-ophobes](https://reader033.fdocuments.us/reader033/viewer/2022051015/5552c23db4c90581158b486e/html5/thumbnails/30.jpg)
@AnnieCushing
What if your child nodes look like this?
![Page 31: Web Scraping for Code-ophobes](https://reader033.fdocuments.us/reader033/viewer/2022051015/5552c23db4c90581158b486e/html5/thumbnails/31.jpg)
Let’s dial it up
@AnnieCushing
![Page 32: Web Scraping for Code-ophobes](https://reader033.fdocuments.us/reader033/viewer/2022051015/5552c23db4c90581158b486e/html5/thumbnails/32.jpg)
Could do it this way
@AnnieCushing
![Page 33: Web Scraping for Code-ophobes](https://reader033.fdocuments.us/reader033/viewer/2022051015/5552c23db4c90581158b486e/html5/thumbnails/33.jpg)
At your own risk
@AnnieCushing
![Page 34: Web Scraping for Code-ophobes](https://reader033.fdocuments.us/reader033/viewer/2022051015/5552c23db4c90581158b486e/html5/thumbnails/34.jpg)
Better plan
@AnnieCushing
![Page 35: Web Scraping for Code-ophobes](https://reader033.fdocuments.us/reader033/viewer/2022051015/5552c23db4c90581158b486e/html5/thumbnails/35.jpg)
The world according to Annie
// = blah blah yada yada
@AnnieCushing
![Page 36: Web Scraping for Code-ophobes](https://reader033.fdocuments.us/reader033/viewer/2022051015/5552c23db4c90581158b486e/html5/thumbnails/36.jpg)
Can even be in the middle of the XPath
//div[@class=‘main’]//blockquote[2]
@AnnieCushing
![Page 37: Web Scraping for Code-ophobes](https://reader033.fdocuments.us/reader033/viewer/2022051015/5552c23db4c90581158b486e/html5/thumbnails/37.jpg)
Other ways to tell “which one” in XPath
STARTS-WITH
@AnnieCushing
![Page 38: Web Scraping for Code-ophobes](https://reader033.fdocuments.us/reader033/viewer/2022051015/5552c23db4c90581158b486e/html5/thumbnails/38.jpg)
Other ways to tell “which one” in XPath
@AnnieCushing
CONTAINS
![Page 39: Web Scraping for Code-ophobes](https://reader033.fdocuments.us/reader033/viewer/2022051015/5552c23db4c90581158b486e/html5/thumbnails/39.jpg)
Other ways to tell “which one” in XPath
@AnnieCushing
![Page 40: Web Scraping for Code-ophobes](https://reader033.fdocuments.us/reader033/viewer/2022051015/5552c23db4c90581158b486e/html5/thumbnails/40.jpg)
Other ways to tell “which one” in XPath
INDEX VALUE
@AnnieCushing
![Page 41: Web Scraping for Code-ophobes](https://reader033.fdocuments.us/reader033/viewer/2022051015/5552c23db4c90581158b486e/html5/thumbnails/41.jpg)
Other ways to tell “which one” in XPath
LAST()
@AnnieCushing
![Page 42: Web Scraping for Code-ophobes](https://reader033.fdocuments.us/reader033/viewer/2022051015/5552c23db4c90581158b486e/html5/thumbnails/42.jpg)
Become a scraping FOOL
@NicoMiceli
@AnnieCushing
• Pull queries from Topsy• Pull product feeds• Pull specific elements from a sitemap• Scrape Twitter followers• Pull GA metrics• Scrape HTML tables (e.g., list of countries from Wikipedia)• Scrape lists (e.g., scraped lists of consumer review sites to create
a custom search engine, top sports blogs, etc.)• Scrape rankings• Scrape GA codes / Adsense IDs / IPs / IP Country Codes• Find de-indexed sites• Scrape directories• Scrape Yahoo / Google for relevant pages from directory listings• Scraping title / h1 / meta descriptions• Scrape page URLs to find if someone is linking to you• Scrape Google to find snippets of text on a list of domains (for link
networks)• Scrape Quora
![Page 43: Web Scraping for Code-ophobes](https://reader033.fdocuments.us/reader033/viewer/2022051015/5552c23db4c90581158b486e/html5/thumbnails/43.jpg)
43
SEE IMPORT FUNCTIONS IN THEIR NATURAL HABITAT!http://bit.ly/annies-gdoc@AnnieCushin
g
![Page 44: Web Scraping for Code-ophobes](https://reader033.fdocuments.us/reader033/viewer/2022051015/5552c23db4c90581158b486e/html5/thumbnails/44.jpg)
AWWW YEAHHH!
![Page 45: Web Scraping for Code-ophobes](https://reader033.fdocuments.us/reader033/viewer/2022051015/5552c23db4c90581158b486e/html5/thumbnails/45.jpg)
TO PLAY …
1. Log in2. File > Make a copy…3. Poke around and test
@AnnieCushing
![Page 46: Web Scraping for Code-ophobes](https://reader033.fdocuments.us/reader033/viewer/2022051015/5552c23db4c90581158b486e/html5/thumbnails/46.jpg)
RESOURCES
XPath Tutorial: http://bit.ly/xpath-tutorial Annie’s Gdoc: http://bit.ly/annies-gdocDistilled Guide: http://bit.ly/distilled-guideSEER Cookbook: http://bit.ly/seer-cookbook
@AnnieCushing