Wrappers Kapowtech RoboSuite 6.0 Team číslo 10 – Vampires, stretnutie číslo 2.

22
Wrappers Kapowtech RoboSuite 6.0 Team číslo 10 – Vampires , stretnutie číslo

Transcript of Wrappers Kapowtech RoboSuite 6.0 Team číslo 10 – Vampires, stretnutie číslo 2.

Page 1: Wrappers Kapowtech RoboSuite 6.0 Team číslo 10 – Vampires, stretnutie číslo 2.

Wrappers

Kapowtech RoboSuite 6.0

Team číslo 10 – Vampires , stretnutie číslo 2

Page 2: Wrappers Kapowtech RoboSuite 6.0 Team číslo 10 – Vampires, stretnutie číslo 2.

Documents

• Lixto WhitePaper

• Wrapper Development Tools

• Piggy Bank

• WebVCR

• Kapow RoboSuite Documentation

Page 3: Wrappers Kapowtech RoboSuite 6.0 Team číslo 10 – Vampires, stretnutie číslo 2.

Why wrappers?

- HTML is used to display data

- the data is stored inside your HTML

- WEB is designed for human consumption, even if it was derived from well-defined database

- wrapper – robot browsing web and extraction of data

Page 4: Wrappers Kapowtech RoboSuite 6.0 Team číslo 10 – Vampires, stretnutie číslo 2.

Applications

• online price comparisons• automatic stock market surveillance• personalized online news• flight tickets• job search• competitors advantage• research of a new technology • …….

Page 5: Wrappers Kapowtech RoboSuite 6.0 Team číslo 10 – Vampires, stretnutie číslo 2.

Lixto WhitePaper

• presented by Duri

• table on the next slide - Comparison of wrappers, programming

languages and by hand conversion

- Criteria's like learning time, expressive power, user friendliness,…

Page 6: Wrappers Kapowtech RoboSuite 6.0 Team číslo 10 – Vampires, stretnutie číslo 2.

Comparison of wrappers, programming languages and by hand conversion

Page 7: Wrappers Kapowtech RoboSuite 6.0 Team číslo 10 – Vampires, stretnutie číslo 2.

Wrapper Development Tools

• 3 main functions:- ability of downloading HTML pages

from website- search for, recognize and extract data- save extracted data in a suitable

formats, such a XML, XLS, Databases for further importing to the other applications

Page 8: Wrappers Kapowtech RoboSuite 6.0 Team číslo 10 – Vampires, stretnutie číslo 2.

Wrapper Development Tools

• Non commercial tools:- most of them developed at universities- output data: mainly text and XML- most of them offer API- most of them is implemented in Java and is

OpenSource- Most of them offer Web Crawling- some of them offer GUI- just few offer Editor – regular expressions,

ontologies

Page 9: Wrappers Kapowtech RoboSuite 6.0 Team číslo 10 – Vampires, stretnutie číslo 2.

Wrapper Development Tools

• Commercial tools:- most of them developed in commercial companies- output data: mainly XML, tables and text- most of them offer database connectivity- most of them offer Web Crawling- most of them offer API- all of them offer GUI- most of them offer Editor – regular expressions,

Perl, VBScript,…

Page 10: Wrappers Kapowtech RoboSuite 6.0 Team číslo 10 – Vampires, stretnutie číslo 2.

Piggy Bank

• extension for Firefox Web browser• turns it into a Semantic Web browser• let users:- combine information from several web sites and

browse them all together- save information you have found on the Web- tag each item you save- share saved information- browse and search through an existing web site

Page 11: Wrappers Kapowtech RoboSuite 6.0 Team číslo 10 – Vampires, stretnutie číslo 2.

Piggy Bank – Applications

• Meeting with friends and you want to locate restaurant with Chinese cuisine, which is close to your favorite coffee shop with wireless network

• You are moving to the new place and you are looking for apartment close to school, subway station, away crime hotspots, nearby hospital,…

Page 12: Wrappers Kapowtech RoboSuite 6.0 Team číslo 10 – Vampires, stretnutie číslo 2.

Piggy Bank – How it works

• semantic web

• RDF model

• XML information

• screen scraper

Page 13: Wrappers Kapowtech RoboSuite 6.0 Team číslo 10 – Vampires, stretnutie číslo 2.

Piggy Bank Example

Page 14: Wrappers Kapowtech RoboSuite 6.0 Team číslo 10 – Vampires, stretnutie číslo 2.

Piggy Bank - Architecture

• consists of 3 primarily parts:- chrome additions to browser, including

menu commands, toolbars, etc- Black-end Java code that manages

collected information in databases and serves it up through an HTTP interface

- XPCOM components written in JavaScript that bridge the chrome part and the Java part.

Page 15: Wrappers Kapowtech RoboSuite 6.0 Team číslo 10 – Vampires, stretnutie číslo 2.

Piggy Bank - Technologies

• Firefox, as the application platform• XUL, as the extension’s user interface language• HTML, as the client side user interface language• Javascript, as the client side and extension’s scripting

language• Java, as the server side core programming language• Batik, for encoding PNG files• Informa, for parsing RSS feeds• Jetty, as the embedded web server• JTidy and JDom, for applying XSLT on HTML• Log4j, as the logging framework• Lucene, as the text indexer• Sesame, as the RDF access and storage API• Velocity, as the templating engine for generating HTML

Page 16: Wrappers Kapowtech RoboSuite 6.0 Team číslo 10 – Vampires, stretnutie číslo 2.

WebVCR

• smart bookmarks

– shortcuts to Web content that require multiple steps to be retrieved

- hard-to-reach Web content

• VCR style – record, replay, eventually browse steps users actions

• no programming required from user, just usual browsing

Page 17: Wrappers Kapowtech RoboSuite 6.0 Team číslo 10 – Vampires, stretnutie číslo 2.

WebVCR - application

• navigation travelocity.com:- Juliana plans to attend the WWW9 conference

and she is looking for flights from Newark to Amsterdam, that leave from Newark May 14th and return from Amsterdam on May 20th. She must take the following steps:

- go to http://www.travelocity.com - choose the Find/Book a Flight option - login - specify details of itinerary- produced address:

http://dps1.travelocity.com:80/airgchoice.ctl?SEQ=94312

Page 18: Wrappers Kapowtech RoboSuite 6.0 Team číslo 10 – Vampires, stretnutie číslo 2.

WebVCR – 3 main steps

• Notification – tracking users actions- browser modification to provide notifications for each action

performed - using of a proxy to rewrite each page and replaces all hrefs

with calls to a well-known script which provide the notification

- using of a proxy to monitor all HTTP commands sent to/from the browser

- attaching JavaScript event handlers to all active objects in the page

• Recording - Storing user's browsing information

• Playback: Replaying users' actions  

Page 19: Wrappers Kapowtech RoboSuite 6.0 Team číslo 10 – Vampires, stretnutie číslo 2.

WebVCR – how to cope with changes

• changes do not pose a problem to a user browsing the Web since the user can easily determine which link he wants to follow, but they do present a challenge to a system that performs automatic navigation - Attempt to locate a link in the last retrieved page

corresponding to DOM location stored in current smart bookmark step. If the link exists, the target of the link matches the bookmark, and either the URL or text of the retrieved link match the step, then use that link.

- Otherwise, if there is a unique link in the page whose target, URL, and text match those of the stored link, use that link

- Otherwise, if there is a unique link in the page whose target and URL match those of the stored link, use that link

- Otherwise, if there is a unique link in the page whose target and text match those of the stored link, use that link.

Page 20: Wrappers Kapowtech RoboSuite 6.0 Team číslo 10 – Vampires, stretnutie číslo 2.

WebVCR – how to cope with changes

• Otherwise, if the link corresponds to a CGI bin script (e.g., contains ``?'' in it), then find all links that match the stored URL up to the first occurrence of a ``?'' and store them in set of candidate links, which we denote L.

• Eliminate any elements of L whose parameter names do not match the stored version. For instance, if the stored URL is http://xyz.com/script?x=10&y=12 then http://xyz.com/script?x=20&y=32 matches, but http://xyz.com/script?x=10&z=12 does not, since it has a parameter named z that does not appear in the stored version.

• For each parameter in the stored version whose value matches the corresponding parameter value in at least one element of L, eliminate all elements of L with a non-matching value for the same parameter.

Page 21: Wrappers Kapowtech RoboSuite 6.0 Team číslo 10 – Vampires, stretnutie číslo 2.

WebVCR – how to cope with changes

• If L is a singleton set, use that element. • Otherwise, the playback can either be aborted,

or the link present at the recorded DOM location can be used to try and proceed through the playback (our implementation uses the latter). However, the playback might fail later in the sequence, or the sequence might traverse pages different from what the user had recorded.  

Page 22: Wrappers Kapowtech RoboSuite 6.0 Team číslo 10 – Vampires, stretnutie číslo 2.

WebVCR – problems

• HTTP authentication - some user actions cannot be recorded in the client, it is not possible to detect when HTTP authentication takes place, and since the values entered by the user are not available through the DOM API

• State information – cookies, login and password just first time, after that go straight through cookies

• Signed applets

• Automatic refresh – they assume that auto refresh takes place

• Microsoft IE limitations