Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
-
Upload
markkerzner -
Category
Technology
-
view
609 -
download
0
description
Transcript of Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
![Page 1: Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)](https://reader033.fdocuments.us/reader033/viewer/2022061114/54620eeab4af9f531c8b45ca/html5/thumbnails/1.jpg)
Houston Hadoop Meetup2/12/14
Nutch + Hadoop with Selenium and Burp
By Mark Kerzner, Elephant Scale
![Page 2: Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)](https://reader033.fdocuments.us/reader033/viewer/2022061114/54620eeab4af9f531c8b45ca/html5/thumbnails/2.jpg)
Nutch story
• Created by Doug Cutting to crawl the web
• Not scalable
• Enter HDFS
• Nutch on HDFS
• Nutch on Hadoop
• Nutch 1.x, Nutch 2.x
![Page 3: Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)](https://reader033.fdocuments.us/reader033/viewer/2022061114/54620eeab4af9f531c8b45ca/html5/thumbnails/3.jpg)
Nutch 1.x
• Local or HDFS
• Command-line
• Crawl-db
![Page 4: Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)](https://reader033.fdocuments.us/reader033/viewer/2022061114/54620eeab4af9f531c8b45ca/html5/thumbnails/4.jpg)
Configuring Nutch • Edit the file conf/regex-urlfilter.txt and replace
# accept anything else
+.
• Use a regular expression matching the domain you wish to
crawl.
• For example, to crawl only nutch.apache.org domain
+^http://([a-z0-9]*\.)*nutch.apache.org/
![Page 5: Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)](https://reader033.fdocuments.us/reader033/viewer/2022061114/54620eeab4af9f531c8b45ca/html5/thumbnails/5.jpg)
Nutch architecture
![Page 6: Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)](https://reader033.fdocuments.us/reader033/viewer/2022061114/54620eeab4af9f531c8b45ca/html5/thumbnails/6.jpg)
Solr integration
![Page 7: Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)](https://reader033.fdocuments.us/reader033/viewer/2022061114/54620eeab4af9f531c8b45ca/html5/thumbnails/7.jpg)
Solr Application (FreeEed, demo)
![Page 8: Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)](https://reader033.fdocuments.us/reader033/viewer/2022061114/54620eeab4af9f531c8b45ca/html5/thumbnails/8.jpg)
Scaling Nutch
• HDFS – scaling storage
• MapReduce – scale crawling
• Gora – scale back end
![Page 9: Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)](https://reader033.fdocuments.us/reader033/viewer/2022061114/54620eeab4af9f531c8b45ca/html5/thumbnails/9.jpg)
Gora
• Data Persistence : Persisting objects to Column stores
such as HBase, Cassandra, Hypertable, Voldermort,
Redis, etc; SQL databases, such as MySQL, HSQLDB, flat
files in local file system of Hadoop HDFS
• Data Access : Java-friendly API for accessing the data
regardless of its location
• Indexing : Solr
• Analysis Apache Pig, Apache Hive and Cascading
• MapReduce support
![Page 10: Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)](https://reader033.fdocuments.us/reader033/viewer/2022061114/54620eeab4af9f531c8b45ca/html5/thumbnails/10.jpg)
Passwords? – Oops!
1. Burp + HttpClient
2. Selenium + Java
![Page 11: Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)](https://reader033.fdocuments.us/reader033/viewer/2022061114/54620eeab4af9f531c8b45ca/html5/thumbnails/11.jpg)
Burp (with demo)
![Page 12: Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)](https://reader033.fdocuments.us/reader033/viewer/2022061114/54620eeab4af9f531c8b45ca/html5/thumbnails/12.jpg)
HttpClientCloseableHttpClient httpclient = HttpClients.createDefault();
try {
HttpPost httpPost = new HttpPost(getUrl());
// put in all custom headers
Map<String, String> headers = getHeaders();
for (Map.Entry<String, String> header : headers.entrySet()) {
httpPost.addHeader(header.getKey(), header.getValue());
}
HttpEntity entity = new ByteArrayEntity(getPostBody().getBytes("UTF-8"));
httpPost.setEntity(entity);
response = httpclient.execute(httpPost);
![Page 13: Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)](https://reader033.fdocuments.us/reader033/viewer/2022061114/54620eeab4af9f531c8b45ca/html5/thumbnails/13.jpg)
Browser interaction? – Oops!
Selenium
Selenium + Java
![Page 14: Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)](https://reader033.fdocuments.us/reader033/viewer/2022061114/54620eeab4af9f531c8b45ca/html5/thumbnails/14.jpg)
Selenium (with demo) WebDriver driver = new FirefoxDriver();
// Go to the login page
driver.get("https://mysite.com");
// put in the username
WebElement query = driver.findElement(By.name("username-element"));
query.sendKeys("your-user-name");
// put in the password
query = driver.findElement(By.name("password-element"));
query.sendKeys("real-password");
((JavascriptExecutor) driver).executeScript("javascript:whatever-login-
script();");