The Bixo Web Mining Toolkit

Bixo - Web Mining Toolkit 23 Sep 2009

1

Web Mining Toolkit

Ken Krugler

TransPac Software, Inc.

My background - did a startup called Krugle from 2005 - 2008

Used Nutch to do a vertical crawl of the web, looking for technical software

pages.

Mined pages for references to open source projects.

Used experience to create Bixo, an open source web mining toolkit

Built on top of Hadoop, Cascading, Tika.


2

Web Mining 101

Extracting & Processing Web Data

More Than Just Search

Business intelligence, competitive intelligence,

events, people, companies, popularity, pricing,

social graphs, Twitter feeds, Facebook friends,

support forums, shopping carts…

Quick intro to web mining, so we’re on the same page

Most people think about the big search companies when they think about web

mining.

Search is clearly the biggest web mining category, and generates the most

revenue.

But other types of web mining have value that is high and growing.

This is what Bixo focuses on.


3

4 Steps in Mining

Collect - fetch content from web

Parse - extract data from formats

Analyze - tokenize, rate, classify, cluster

Produce - an index, a report

Search

Note - does not include serving up the search results

Why do I bring this up? To help clarify why web mining is not the same as

vertical search (next slide)


4

Vertical Search

Vertical crawl to get specific content

Common use case for Nutch, Heritrix

But web mining often has different outcome

And specialized processing of data

Most people think of vertical search when they think of specialized web

mining.

Lots of people have been doing this, using OSS like Nutch & Heritrix.

End result is typically a Lucene index, plus the content, inverted links, etc.

Typical web mining is not the same as vertical search.

Often uses a white list, versus crawling to discover links.

More specialized processing of the data.

And these differences help answer the question of (next slide)…


5

Why Bixo?

Response to needs of commercial projects

– Plug into Cascading-based workflow

– Low IT time/skill requirements

– Run well in AWS EC2 environment

– Flexible I/O support for AWS - S3, HBase

– Toolkit for building custom solutions

• Fetch white list (parse/index, data mine)

• Scrape white list (social popularity)

Does the world really need yet another web crawler?

No, but it does need a web mining toolkit

Two companies agreed to sponsor work on Bixo as an open source project.

On the point of running well in an EC2 environment…

Even though there are many web mining tasks that can be handled on a single

computer,

You very quickly run into issues of scale if you can’t handle upwards of

100M+ pages.


6

Bixo Overview

MIT license open source project

In use by three companies

“Pipe” model for building workflows

Runs on top of Hadoop/Cascading

Full disclosure - Bixo makes heavy use of Cascading, which is under GPL.

So if you want to sell a product based on Bixo, you need to talk to Chris

Wensel.

The pipe model comes from our use of Cascading to define the workflows.


7

What is Cascading

API for Hadoop data processing workflows

Operations on tuples with named fields

Workflows created from pipes

Reduces painful low-level MR details

Key for complex/reliable workflows

I know Chris Wensel has previously talked about Cascading here, but just to

make sure we’re all on the same page…

“tuple” is like a row in a database. Named fields with values.

Example of tuple - result of fetching a page, has URL, time of fetch, content,

headers, response rate, etc.

Because you can build workflows out of a mix of pre-defined & custom pipes,

it’s a real toolkit.

Chris explains it as MR is assembly, and Cascading is C. Sometimes it feels

more like C++ :)

Key aspect of reliable workflows is Cascading’s ability to check your

workflow (the DAG it builds)

Finds cases where fields aren’t available for operations.

Solves a key problem we ran into when customizing Nutch at Krugle


8

Architecture

This architecture looks nice and squeaky clean - and in general it is.

One issue is with the fetch phase of bixo not fitting well into the MR model.

External resource constraints mean you can’t treat it like a regular job.

So lots of threads in a special reduce phase, with corresponding issues

-Stack size

-Error handling


9

HUGMEE

Hadoop

Users who

Generate the

Most

Effective

Emails

Let’s use a real example now of using Bixo to do web mining.

Imagine that the Apache Foundation decided to honor people who make

significant contributions to the Hadoop community.

In a typical company, determining the winner would depend on political

maneuvering, bribes,and sucking up.

But the Apache Foundation could decides to go for a quantitative approach for

the HUGMEE award.


10

Helpful Hadoopers

Use mailing list archives for data (collect)

Parse mbox files and emails (parse)

Score based on key phrases (analyze)

End result is score/name pair (produce)

How do you figure out the most helpful Hadoopers?

As we discussed previously, it’s a classic web mining problem

Luckily the Hadoop mailing lists are all nicely archived as monthly mbox files.

How do we score based on key phrases (next slide)?


11

Scoring Algorithm

Very sophisticated point system

“thanks” == 5

“owe you a beer” == 50

“worship the ground you walk on” == 100


12

High Level Steps

Collect emails

– Fetch mod_mbox generated page

– Parse it to extract links to mbox files

– Fetch mbox files

– Split into separate emails

Parse emails

– Extract key headers (messageId, email, etc)

– Parse body to identify quoted text

Parsing the mod_mbox page is simple with Tika’s HtmlParser

Cheated a bit when parsing emails - some users like Owen have many aliases

So hand-generated alias resolution table.


13

High Level Steps

Analyze emails

– Find key phrases in replies (ignore signoff)

– Score emails by phrases

– Group & sum by message ID

– Group & sum by email address

Produce ranked list

– Toss email addresses with no love

– Sort by summed score

Need to ignore “thanks” in “thanks in advance for doing my job for me”

signoff.

Generate two tuples for each email:

-one with messageId/name/address

-One with reply-to messageId/score

Group/sum aspect is classic reduce operation.


14

Workflow

I think this slide is pretty self-explanatory - two Bixo fetch cycles, 6 custom

Cascading operations, 6 MR jobs.

OK, actually not so clear, but…

Key point is that only purple is stuff that I had to actually create

Some lines are purple as well, since that workflow (DAG) is also something I

defined - see next page.

But only two custom operations actually needed - parsing mbox_page and

calculating score

Running took about 30 minutes - mostly politely waiting until it was Ok to

politely do another fetch.

Downloaded 150MB of mbox files

409 unique email addresses with at least one positive reply.


15

Building the Flow

Most of the code needed to create the workflow for this data mining app.

Lots of oatmeal code - which is good. Don’t want to be writing tricky code

here.

Could optimize, but that would be a mistake…most web mining is

programmer-constrained.

So just use more servers in EC2 - cheaper & faster.


16

mod_mbox Page

Example of the top-level pages that were fetched in first phase.

Then needed to be parsed to extract links to mbox files.


17

Custom Operation

Example of one of two custom operation

Parsing mod_mbox page

Uses Tika to extract Ids

Emits tuple with URL for each mbox ID


18

Validate

Curve looks right - exponential decay.

409 unique email addresses that got some love from somebody.


19

This Hug’s for Ted!

And the winner is…Ted Dunning

I know - I should have colored the elephant yellow.


20

Produce

A list of the usual suspects

Coincidentally, Ted helped me derive the scoring algorithm I used…hmm.


21

Use Bixo to…

Find +/- product comments on forums

Compare web site quality

Track social network popularity

Derive optimized SEO terms

Scape and analyze pricing data

Previous example could be easily changed to “find opinion makers on forums”

Many other use cases

All involve web mining workflow - fetch, parse, analyze, produce


22

Summary

Bixo is a web mining toolkit

Built on Hadoop, Cascading, Tika

Young project but used commercially

Future - Mahout, monitoring, HBase, URL

DB, cleanup, bug fixes, rinse, repeat

Lots to be done, of course, but moving fast


23

Resources

Web: http://bixo.101tec.com

List: http://tech.groups.yahoo.com/group/bixo-dev/

Source: http://github.com/emi/bixo/tree

Bugs: http://oss.101tec.com/jira/browse/bixo

URLs to find out more about the Bixo project.

Stefan Groschupf from 101tec helped with initial Bixo coding.

His company provides infrastructure for project, thus 101tec.com in URLs

above


24

Any Questions?

The Bixo Web Mining Toolkit

Technology

Transcript of The Bixo Web Mining Toolkit