If They Can’t Find It, They Can’t Buy It – Lift Conversion With Better Search
Pay for Placement Search. Copyright GoTo.com, 2/19/2001, 2 Agenda l Search Engines Where did they...
-
Upload
augustine-harrison -
Category
Documents
-
view
215 -
download
3
Transcript of Pay for Placement Search. Copyright GoTo.com, 2/19/2001, 2 Agenda l Search Engines Where did they...
Pay for Placement Search
Copyright GoTo.com, 2/19/2001, 2
Agenda
Search Engines Where did they come from? How do they work? Who’s the biggest? Why GoTo is the coolest.
What type of stuff do you need to support the web’s 2nd* largest search engine? Architecture, infrastructure, nuts and bolts Performance Operations
What kind of people (and how many) do you need to do this kind of business?
Where is the Internet going? What's going to happen to search engines?*Don’t quote me
Copyright GoTo.com, 2/19/2001, 3
Ancient History
The Pre-cursors Archie (1990) – ftp based file indexing and retrieval Gopher (1992) – document network (non-ftp)
The early ‘bots (1992-1993) WWW Wanderer (wandex) –servers, then URLs Aliweb – index web like Archie w/site index retrieval
Then came the spiders (1993+) WWW Worm Excite (Architext), 2/93 from Stanford
Copyright GoTo.com, 2/19/2001, 4
All Done? Wrong!
Problems with Spiders:
Get lots of data, but no intelligence to map pages to concept space
Problem still exist today (spamming)
The Solution? Searchable Directories. Human crafted hierarchies.
Tradewave Galaxy (1/94) Yahoo! (4/94), Filo and Yang of Stanford
Copyright GoTo.com, 2/19/2001, 5
I Give Up – Let’s Search Everyone!
Here Come the Metasearchers!
MetaCrawler, go2net, dogpile (1995) Momma Search.com (CNet)
Spray out searches to several engines – combine the results
Copyright GoTo.com, 2/19/2001, 6
The Universe Divides (kinda)
The Crawler-based Search Engines Lycos (7/94) – the wolf
spider Infoseek (4/94) Altavista (12/95) Inktomi (Slurp) – HotBot
(5/96) – the plains Indians spider myth
Google, Northern Lights, Excite, FAST, direct hit, and more…
The Directory/Editorial based Search Engines
Yahoo! (4/94) LookSmart (5/95) Snap.com ODP (NewHoo) -- dmoz
(1/98) Ask Jeeves (4/97) GoTo (6/98)
Copyright GoTo.com, 2/19/2001, 7
How Crawlers Work (or don’t)
Start with list of URLs (submitted, generated from somewhere)
For each Site Get the base page ‘Catalog’ the page based on crawler-specific implementation Follow links on page and recurse
Some Details META tags
<META NAME=“ROBOTS” CONTENT=“ALL | NONE | NOINDEX | NOFOLLOW”>
Robots.txt# /robots.txt file for http://goto.com/ # disallow all robots from crawling GoTo User-agent: * Disallow: /
Copyright GoTo.com, 2/19/2001, 8
Some Search Engine Examples
Inktomi Infrastructure only – you pay for the search results Used to power Yahoo! (now Google), HotBot, many
others Now typically a fall-though placement (bidded or
other paid inclusion first, then Inktomi results Google
Sergey and Larry Power Yahoo!, virgin.net, some others Searching for a revenue model
Copyright GoTo.com, 2/19/2001, 9
Inktomi ‘Slurp’ Crawler
Slurp Characteristics• Starts with active submitted URLs• Hierarchy of Importance
– Page Title– Description meta– Keyword meta– Text in document (not in images )
• No frames• Looks for spoofing tricks (drop page)
4 week full cycle (constant incremental)• Many different indices created (or various customers),
different depths, etc.
Copyright GoTo.com, 2/19/2001, 10
Some Cataloging Approaches (cont.)
Google Backrub/Googlebot crawler PageRank™
• Page A, Pages linking to A T1..Tn, Links on A C(A)• PR(A) = (1-d) + d(PR(T1)/C(T1)+…+PR(Tn)/C(Tn))• ~probability distribution that random surfer hits a page based on links
Cache the documents (no kidding) All kinds of tweaks to the PageRank, including:
• Domain tweaks (.org, .gov, .edu)• Serious bias against large pages• Bias against dynamic pages (.asp, .jhtml, .jsp)
Check out http://www.searchengineworld.com/google Original design at
http://www7.scu.edu.au/programme/fullpapers/1921/com1921.htm
Copyright GoTo.com, 2/19/2001, 11
Who’s the ‘biggest’ Search Engine
What is ‘big’ Number of documents indexed (SearchEngineWatch, 11/8/200)
KEY: GG=Google, FAST=FAST, WT=WebTop.com, INK=Inktomi, AV=AltaVista,NL=Northern Light, EX=Excite, Go=Go (Infoseek).
Copyright GoTo.com, 2/19/2001, 12
Who’s the ‘biggest’ Search Engine
What is ‘big’
Searches/Day – Total Web 500mm/day (ptr estimate)• Yahoo! – 100mm• Alta Vista – 50mm (International too)• Google – 50mm• Inktomi – 40mm• Everyone else – 10mm or fewer
Where’s GoTo? Hint
Copyright GoTo.com, 2/19/2001, 13
Let’s Talk About GoTo
Basic Business Model – Middlemen for Textual Advertisements (Search Results) Advertisers provide us Search Listings (Title, URL,
Description, bid) for a search term We charge advertisers for user clicks on Search Listings We serve search listings to our own site (www.goto.com -
5%), and other partners sites (affiliates like Alta Vista, AOL, Netscpae, Cnet, etc. etc. – 95%)
Since we make money when people search (and click), we pay for sites to include our listings
Live auction for search results
Copyright GoTo.com, 2/19/2001, 14
The Scale of Operations
Search Volume – 70mm+/day, capacity for 210mm/day
300mm impressions/day
10mm clicks/day – Med/Large Phone company
6mm+ search listings
40,000+ advertisers
Wow
Copyright GoTo.com, 2/19/2001, 15
Systems Strategic Bombing View
Search ServingSystems
AdvertiserManagement
Systems
Event Tracking,Fraud Detection,Data Reporting
Searchesto
www.goto.com andaffiliatepartners
Advertiser Self-Management on the Web
(DTC)
Customer Service(Silknet)
Editorial Processing
Account Monitoring
Search Listings
Searches, Clicks, etc.
Event Repository & DataMarts
Click-Through Protection
Oracle Financials
Copyright GoTo.com, 2/19/2001, 16
It Can’t be that Simple, Right?
Right!
lb-cms.back
: GoTo::cms-app
: GoTo::cms
Pasadena::desktopNT
: CRM::MSIE
: EPS::GUI : EPS::EPS Jr.
Sunnyvale:: spica
: Stats::https
: Stats::Dy namo
Data
ALL eServ ice instancestalk to both databases
Sunnyvale:: haedi
: AM::AMConf ig
: AM::AMCTP1.0
: AM::MultiSiteClickListener
: AM::AMScheduler
Data
Data
Sunnyvale:: betelgeuze
: DTC::https{user = goto,port = 443}
: DTC::Dy namo{baseport = 3000,
user = dtc}
: OLS::Dy namo{user = signup,
baseport = 2100}
: OLS::Loadmanager{baseport = 2120,
user = signup}
Sunnyvale:: baten
Sunnyvale:: ServerIron
: GoTo::secure.goto.com{port = 443}
: GoTo::www.goto.com{port = 443}
Sunnyvale:: sargas
: DTC::loadmanager{baseport = 3020,
user = dtc}
: DTC::https{user = goto,port = 443}
: DTC::Dy namo{baseport = 3000,
user = dtc}
: OLS::Dy namo{user = signup,
baseport = 2100}
Pasadena::xchg3
: GoTo::MSExchange
Pasadena::alrisha
: CRM::KanaDB
Pasadena::saba
: CRM::KanaApp
Pasadena::masu
: CRM::KanaWeb
Sunnyvale:: kajiki
: CRM::http
Sunnyvale:: hamachi
: CRM::http
: CRM::ASP Files
: GoTo::jndi.cms-ejb
Sunnyvale:: alula
: AM::EJB
: CRM::EJB
: AM::MailNotif icationAgent
: EPS::ImportSLRAgent
: EPS::CompleteSLRAgent
: DTC::EJB
: EPS::EJB
: OLS::EJB
Sunnyvale:: nusakan
: AM::EJB
: CRM::EJB
: DTC::EJB
: EPS::EJB
: OLS::EJB
: EPS::ImportSLRAgent
: EPS::CompleteSLRAgent
Sunnyvale:: akagai
: CRM::Silk eServ ices
: CRM::smtp serv ice
Sunnyvale:: aji
: CRM::Silk eServ ices
: CRM::smtp serv ice
Sunnyvale:: anago
: CRM::Silk eServ ices
: CRM::smtp serv ice
: CRM::MailAttachDB
CRM::CSR
CRM::Admin
ALL EJBs are accessedv ia the loadbalanced name
All EJB's accessthe Database
DTC/OLS Dy namo'stalk to EJB serv ices
Sunnyvale:: lesath
Sunnyvale:: atlas
: Stats::liv e_STAT
Sunnyvale:: lca
: Stats::CTP Array
Sunnyvale:: kaus
: Stats::liv e_TMRT
AM::Cy bersource
DTC::Adv ertiser
EPS::Editor
OLS::Prospectiv e Client
Sunnyvale:: tabit
: EPS::liv e_EPS
All instances of EPSEJBs or Agents talk
to the databases
Pasadena::saturn
: OF::liv e_OFIN
eServ ice instances talkdirectly to OF database
AM periodically updatesf inance and gets balance
updates f rom OF
VPN
VPN
VPN
VPN
VPN
: Stats::STST
Stats pushes CTP2.0data to AM table in liv e_CRM
VPN
Reston:: zaurac
: AM::MultiSiteClickListenerLWES
net
net
CRM uses Statsf or RunRate data
DTC uses Stats f orReports/prediction
Sunnyvale:: bellatrix
: OLS::https{port = 443,user = goto}
LWES
: Stats::BusinessObjects
Sunnyvale:: galt
: AM::DB
: HWES::DB
: CRM::DB
: AM::TableSnapshot
HWES
Sunnyvale:: tyl
: CRM::GlobalDB
DNS (RoundRobin)
Sunnyvale:: alkes
: EPS::liv e_SRDB
live_SILK
live_CRM
Copyright GoTo.com, 2/19/2001, 17
It Can’t be that Simple, Right?
GoTo’s systems seem deceptively simple.
GoTo’s pay-for-performance search product seems simple to execute – advertisers provide the content in the form of search listings, the content is ordered by bid price, and advertisers are charged for resulting clicks.
The complexity of these systems is based on the scale of the problem (number of advertisers, search listings, searches per day, etc.), In addition to some non-apparent complications (e.g. fraud detection).
Copyright GoTo.com, 2/19/2001, 18
Architecture Features
High Availability -- Noah’s Ark Approach – no single point of failure Load balancers State migration
Scalability:no architectural changes to scale serving capacity.
Extensibility:can add search features incrementally.
Distributed content:multiple sites currently serving all partners.
Copyright GoTo.com, 2/19/2001, 19
Advertiser Management
Search ServingSystems
AdvertiserManagement
Systems
Event Tracking,Fraud Detection,Data Reporting
Searchesto
www.goto.com andaffiliatepartners
Advertiser Self-Management on the Web
(DTC)
Customer Service(Silknet)
Editorial Processing
Account Monitoring
Search Listings
Searches, Clicks, etc.
Event Repository & DataMarts
Click-Through Protection
Oracle Financials
Copyright GoTo.com, 2/19/2001, 20
Advertiser Tools
DirecTraffic Center®
Functions – manage account balance, report on activity, real-time bid charges, add/modify/delete search listings
ATG/Dynamo (jhtml)/Java, EJB search Listing services (BEA/Weblogic), custom cache reporting scheme based on Oracle 8i
Copyright GoTo.com, 2/19/2001, 21
Advertiser Management Systems
Copyright GoTo.com, 2/19/2001, 22
Account Monitoring
The real ‘special sauce’ Listens to real-time clicks and monitors
account activity to process notifications, automated changes, status changes
Manages credit limits, monthly advertiser budgets, activation and de-activation of accounts, and over 300 different business rules around accounts
EJB – Weblogic
Copyright GoTo.com, 2/19/2001, 23
Editorial Processing
We are a publishing business
100 editors Workflow fo 50,000-100,000 work orders a
month Review all listings (with some help) EJB/Desktop App (Swing)
Copyright GoTo.com, 2/19/2001, 24
Fraud Detection and Reporting
Search ServingSystems
AdvertiserManagement
Systems
Event Tracking,Fraud Detection,Data Reporting
Searchesto
www.goto.com andaffiliatepartners
Advertiser Self-Management on the Web
(DTC)
Customer Service(Silknet)
Editorial Processing
Account Monitoring
Search Listings
Searches, Clicks, etc.
Event Repository & DataMarts
Click-Through Protection
Oracle Financials
Copyright GoTo.com, 2/19/2001, 25
Event Processing – What Are Events?
LWES – Light Weight Event Systems UDP-multicast based events thrown by front
end systems Events include
• Searches• Clicks (redirects)• Navigation
Events are Key/Value pairs ‘Caught by separate Journaling Systems
Copyright GoTo.com, 2/19/2001, 26
What do we do with these events?
Result Clicks (I.e. we charge advertiser) goto fraud detection
• patent pending system that monitors our web site behavior to detect potentially fraudulent activity. The systems analyze millions of transactions daily for suspicious behavior, whether malicious or benign, and perform sophisticated rule-based and statistically-derived event filtering.
• GoTo’s Fraud Squad of 8 developers and analysts constantly monitor and improve the fraud detection techniques and tools, and manage the issue treatment and resolution processes.
Copyright GoTo.com, 2/19/2001, 27
More About Fraud
Fraud Detection -- Attacks and Filters Attacks
• Inadvertent• Crawling spiders run amok• Advertisers testing their own listings• Malicious• Stockholder -- the revenue goosers• Advertiser Vs. Advertisers• Bored Crackers
Filters• Deterministic - rules based filters covering user sessions, IP addresses and search terms.
The deterministic filters catch all the blatant abuses (repetitive clicking, repetitive searching, “speed” clicking).
• Probabilistic -- behavior pattern based, these filters discard anomalous click groupings. The probabilistic filters are very good at catching subtle abuses of advertiser resources: traversal of consecutive paid listings, randomized but obviously scripted clicking, expensive clicking.
• Both deterministic and probabilistic filters are routinely updated to reflect changes in site usage patterns.
Copyright GoTo.com, 2/19/2001, 28
How do you do this in near-real-time?
Data Pipeline The ‘backbone’ of fraud detection A flexible array (~30) of commodity machines that perform simple
aggregations and other arithmetic calculations in a networked and coordinated way
A control and processing language used to describe the required calculations, and processed by the data pipeline machines.
Click Scoring Assignment of a click score for click events that classifies them
into various ‘buckets’ of validity. Formulas that define the ‘buckets’ based on historical patterns of
behavior of the site, and analysis of previous fraudulent attempts.
Copyright GoTo.com, 2/19/2001, 29
Search Serving Systems
Search ServingSystems
AdvertiserManagement
Systems
Event Tracking,Fraud Detection,Data Reporting
Searchesto
www.goto.com andaffiliatepartners
Advertiser Self-Management on the Web
(DTC)
Customer Service(Silknet)
Editorial Processing
Account Monitoring
Search Listings
Searches, Clicks, etc.
Event Repository & DataMarts
Click-Through Protection
Oracle Financials
Copyright GoTo.com, 2/19/2001, 30
Search Serving Systems
HardwarePlatfoms
Technology/ProductUtilized
DataData
Content Load Balancing
Application/Web Servers
n
JDBC Connection PoolingCustom-Developed Load Balancing
Oracle 8iQuest SharePlen for Oracle
Load Balancing
Foundry ServerIronFoundry BigIron/FastIron
Internet
Sun 420RSolaris 2.6
Sun E4500 (Database)Sun 420R (Event Journalers)
Event Journalers
Fraud Detection
nRedHat Linux 6.2VaLinuxCustom-Developed Fraud Detection
DataWarehouse
Oracle 8iInformatica
Business Objects
2x Sun E4500 (Database)2x Sun E4500 (Informatica ETL)2x Sun E450 (Data Marts/Business Objects
10.5 TB SAN (StorageTek/MTI)
Multiple Sites
Common to AllServing Sites
Backoffice Sites Only
Apachemod_perl
GoTo Cache ServerOracle OCI Drivers
n
Copyright GoTo.com, 2/19/2001, 31
The Nitty-Gritty
Search Serving Platforms: 100+ Sun e420R, 450mhz (4),
4GB ATG/Dynamo/Java, and
Apache/mod_perl Gigabit site backbone InterNAP Multiple (3) co-location
facilities Search serving feeds include
HTML and XML all through HTTP (1.0 or 1.1)
Global Load Balancing (Arrowpoint)
Distributed content caching (Akamai)
Backend Platforms: Data repository (16TB) for
search and click events – several (4) e4500 Sun/Oracle 8i machines connected to a MTI SAN
Fraud Detection through an array (3) or Intel/Linux machines, utilizing custom detection systems.
CRM via Silknet (NT/2000) N-tier application backbone via
EJB (Weblogic) servers – application integration all through XML
Complete DR site for fast recovery
Copyright GoTo.com, 2/19/2001, 32
Facilities
6 Facilities: Search Serving Sites
• Global Center – Sunnyvale CA
• Cable & Wireless – Reston VA
• ESAT – Dublin, Ireland Offices
• Pasadena• San Mateo• Raleigh-Durham• London
Development & Test Site• Qwest CyberCenter –
Burbank CA Backend Processing Site (New)
• Las Vegas, Nevada
Copyright GoTo.com, 2/19/2001, 33
Search Serving Performance
Copyright GoTo.com, 2/19/2001, 34
Network Operations Center
Copyright GoTo.com, 2/19/2001, 35
Network Operations Center
Copyright GoTo.com, 2/19/2001, 36
GoTo Technology Organization
Three Major Technology Groups (groupings):
Development Groups (4) Technical Operations Architecture and Planning
About 115 people.
Number/Email to Remember:
Me – 626-685-5743, [email protected]
Copyright GoTo.com, 2/19/2001, 37
The perils of an open office plan
Copyright GoTo.com, 2/19/2001, 38
The future…
Stickiness models are dead
The vultures are circling…
The end for ‘search engines’
Everyone needs a revenue model Search Portal ? Pay for placement the norm
Copyright GoTo.com, 2/19/2001, 39
References
Web Sites about Search Engines
www.searchenginewatch.com www.searchengineworld.com
Services
www.wordtracker.com
Articles