Ernestina Menasalvas [email protected] Facultad de Informatica Univesidad Politecnica de Madrid...

34
Ernestina Menasalvas emenasalvas@fi.upm.es Facultad de Informatica Univesidad Politecnica de Madrid May 2004

Transcript of Ernestina Menasalvas [email protected] Facultad de Informatica Univesidad Politecnica de Madrid...

Page 1: Ernestina Menasalvas emenasalvas@fi.upm.es Facultad de Informatica Univesidad Politecnica de Madrid May 2004.

Ernestina Menasalvas [email protected]

Facultad de InformaticaUnivesidad Politecnica de Madrid

May 2004

Page 2: Ernestina Menasalvas emenasalvas@fi.upm.es Facultad de Informatica Univesidad Politecnica de Madrid May 2004.

Introduction and motivation• Internet as a communication channel.• Technology needed to develop new services, security, infraestructure,

analysis• Web Mining to analyze the patterns so the services reply to user needs

• Most of the webmining projects that have been developed, have note taken into account the context in which they have been developed:

– Competitive society – Success criteria dependes both:

• User satisfaction• Sponsors benefit increase

• The gap between tecnology depelopment in the web and the business factors is increasing and genetares as a side effect a separation on what tecnologist develop and what the companies need.

• Knowing that the problem exists is just the begining… • Technological projects have to be integrated in the global strategy of

the company

Page 3: Ernestina Menasalvas emenasalvas@fi.upm.es Facultad de Informatica Univesidad Politecnica de Madrid May 2004.

The problem• Innovative ideas in e-commerce are vaguely defined so they

loose focus and precision• New technologies are being applied consuming resources but

without appropriate finantial or economic benefits• Growth of the web activity, participation in every daily activity

(commercial, educational news, ..) is not being replied by an accordindly number of servicies

• Services are being considered insuficient.

• Thus, site sponsors have to improve offered services to satisfy the increasing growth in demand.

• On the other hand, the growth in offers will bring a growth in demand what will make that the consumer will ask for a better service offer.

• Web Mining projects have to be planned as one more project in the global strategy of the company

Page 4: Ernestina Menasalvas emenasalvas@fi.upm.es Facultad de Informatica Univesidad Politecnica de Madrid May 2004.

Web Site personalization Optimization and personalization of user web experience is crucial for

attracting and retaining electronic, web-based commerce customers. Try to maintain the one-to-one relationship Identifying future behaviour is crucial for the site to act proactively. Information about user experience is captured in clickstream logs:

pages viewed, timing, and sequence. Solutions given:

– Clustering of users– Cluster of pages– Most visited path– Recommender systems– …

• The question:– How to deploy?– How has the method been evaluated?– How does it helps to the company– How does it evolves in time?

Page 5: Ernestina Menasalvas emenasalvas@fi.upm.es Facultad de Informatica Univesidad Politecnica de Madrid May 2004.

Web Mining project evaluation• Criteria being used to evaluate the success of a site takes not external

(commercial) aspects into account. • Site aspects such as: increasing volume of selling, fraud decrease,

customer retention, competitivie prizes are not explicitiy tackled • Success in web sites is a measure related to eficiency and quality:

– Efficiency: number of pages being accessed along one session, lenght of the session and actions developed

– Quality: respose time of the site to the user requests, pages accesibility, visitors per page …

• Company success is evaluated in terms of:– Incomes, Outcomes, Expenses– ROI, Market presence

• Differences between criteria used to evaluate the success of any project in the entreprise compared to those in the case of a web project are in the root of the problem of webmining not complete success

• Site sponsors do no evaluate commercial and finantial aspects and are only based on vague commertial notions

• The success in terms of use, structure and content has to be linked to company business goals achievement

Page 6: Ernestina Menasalvas emenasalvas@fi.upm.es Facultad de Informatica Univesidad Politecnica de Madrid May 2004.

Web Mining project management

• An enterprise is a system design to fulfil certain goals by means of the integration of different resources.

• Subsistems are at the same time interrelated and inter independent

• When the company uses the Web as a channel, all the services, infraestructure, …, has to be seen as one of the subsystems.

• Success of solution in the web subsystem has to be related to the behaviour of the rest of the subsistems

• Web Mining projects are concerned with the Web subsystem• So web mining project is not only an IT problem• Apply a project management methodology to control the

process: A project manager is needed-> different role from the data miner

• Identify Data Mining problems. • For each of them apply CRISP-DM

Page 7: Ernestina Menasalvas emenasalvas@fi.upm.es Facultad de Informatica Univesidad Politecnica de Madrid May 2004.

Web Mining Project management (cont)

• To properly deal with a data mining project we need explicit information of the company:

– Structure of the company (departments, sections, channels, …)– Goals of the company and success criteria (both at the higher level and at the

department level) • Company environment, identify:

– Resources, constraints, and any factor that can determine the goal analysis and the development of a web project

– Web Project goals and their relationship with the goals of the company• To evaluate if the web mining project results contribute to the company goals

fulfilment:– The web site is not usually the end but the means.– It is of the channels that the company uses to achieve goals.– So in order to establish a site as a sucessful site, then it is a must the activities being

developed through the site to generate value for the company • Traditional approaches only analyze the site from the user perspective, but the

actions of the users have to generate value for the company

• It is a CRM project

• Web Project plan generation

Page 8: Ernestina Menasalvas emenasalvas@fi.upm.es Facultad de Informatica Univesidad Politecnica de Madrid May 2004.

CRM project – the three legs

ERP/ERM

Order Manag.

Supply ChainMgmt.

Order Prom.

LegacySystems

SalesAutomation

ServiceAutomation

MarketingAutomation

FieldService

Mobile SalesVertical Apps.

Category Mgmt.

MarketingAutomation

Campaign Mgmt.

CustomerActivity

Customers Products

DataWarehouse

Voice(IVR, ACD)

Conferencing

WebConferencing

E-mail

ResponseManagement

FaxLetter

DirectInteraction

Operational CRM Analytical CRM

Collaborative CRM

Bac

kO

ffic

eF

ront

Off

ice

Mob

ileO

ffic

eC

usto

mer

Inte

ract

ion

Clo

sed-

Loop

Pro

cess

ing

(EA

I Too

lkits

, Em

bedd

ed/M

obile

Age

nts

Page 9: Ernestina Menasalvas emenasalvas@fi.upm.es Facultad de Informatica Univesidad Politecnica de Madrid May 2004.

Data MiningIncreasing potentialto supportbusiness decisions

Relationship with End User

Business Analyst

DataAnalyst

DBA

MakingDecisions

Data Presentation

Visualization Techniques

Data MiningInformation Discovery

Data Exploration

OLAP, MDA

Statistical Analysis, Querying and Reporting

Data Warehouses / Data Marts

Data SourcesPaper, Files, Information Providers, Database Systems, OLTP

Page 10: Ernestina Menasalvas emenasalvas@fi.upm.es Facultad de Informatica Univesidad Politecnica de Madrid May 2004.

Fact Gap

“Fact Gap”: difference between the available information and the ability to take decisions based on these information. (Gartner Group)

Page 11: Ernestina Menasalvas emenasalvas@fi.upm.es Facultad de Informatica Univesidad Politecnica de Madrid May 2004.

Data Mining gives the intelligence

• Data bases gives the data.• But intelligence is needed to explore the data

to find the patterns, rules and ideas to explain what is going on and to predict what will go on

• Techniques and tools are needed to add this intelligence to data in order to extract the maximum benefit from data.

• But tools alone (nowadays) do not put the intelligence, this has to be provided by EXPERTS and translated into the data for better understanding

Page 12: Ernestina Menasalvas emenasalvas@fi.upm.es Facultad de Informatica Univesidad Politecnica de Madrid May 2004.

Data warehouse and data bases are the support

Page 13: Ernestina Menasalvas emenasalvas@fi.upm.es Facultad de Informatica Univesidad Politecnica de Madrid May 2004.

Data Mining Standard process model : Crisp-DM

Problem Understanding

Data Understanding

Data Preparation

Modeling

Evaluation

Deployment

Page 14: Ernestina Menasalvas emenasalvas@fi.upm.es Facultad de Informatica Univesidad Politecnica de Madrid May 2004.

Building the bridge

• In order to provide users with the most appropriate solution, data to be analyzed have to be enriched with business information

• Business problems have to be translated to data mining problems

• Results have to be understable not only by data mining experts but also by end users

• Underlying the data mining solution semantics has to be settled

Page 15: Ernestina Menasalvas emenasalvas@fi.upm.es Facultad de Informatica Univesidad Politecnica de Madrid May 2004.

Deeper analisis of Personalization

• What is personalization?• Observe user-web page interactions to identify patterns that:

indicate high-level user activity, anticipate future use activity, Make it possible to proactively act

• What is going to be personalized?– The site: this means pages according to the users behaviour or

pattern • Why the personalization is needed?

– To improve the site performance– The web is just another channel – Site performance has to do with improving the goals of the

company• Who is the user?

– Navigator– Customer

Page 16: Ernestina Menasalvas emenasalvas@fi.upm.es Facultad de Informatica Univesidad Politecnica de Madrid May 2004.

Web Data to be analyzed

• In any web mining problem we have data related to:– Pages– Navigators and navigation– Customers and their transactions

• Web Logs is just the begining• Not only the data has to be taken into account but all

the circumstances under which the data were collected:

• Environment– General– Organization-related– Customer-related

Page 17: Ernestina Menasalvas emenasalvas@fi.upm.es Facultad de Informatica Univesidad Politecnica de Madrid May 2004.

Enviroment

• Affects both direct and indirectly to the way activites occur. Between the factors to take into account:– Legal conditions– Technological conditions– Demography– Ecological conditions (weather, transports,

communications)– Cultural and social conditions– Geographical situation

• Take into account the location of the site, of the navigator, …

Page 18: Ernestina Menasalvas emenasalvas@fi.upm.es Facultad de Informatica Univesidad Politecnica de Madrid May 2004.

Information to be added• Departments:

– The same concept can have different meaning depending on the department – Product for marketing is not the same than for production

• Products, services:– Data per se of the object: size, color, …– Data relevant for the company: margin of benefits, top ten, …– How it is presented in the web

• People consumers in general:– Static data: gender, demographic information (varies over the time but in a particular

moment it is static)– Roles:…– Behavior with the company being analyzed: number and kind of transaction he/she

performs– Behavioural data related to the environment (economy, legal constraints, climate,…)

• Navigators:– Web Log: Location (IP), time, browser,…– Behaviour : comparative with the “normal” if any to discover : mood, different location,

…• Dates

– Itself has no meaning– Legal and fiscal periods, holidays, weekend, – Opening, closure, ….

Page 19: Ernestina Menasalvas emenasalvas@fi.upm.es Facultad de Informatica Univesidad Politecnica de Madrid May 2004.

Data enrichment • There is no method, no model to follow. It is more an art• Only with experience • Projects for the same domain share the enrichment:

– A model could be established– Evaluate if data are appropriate to mine– Evaluate kind of patterns that can be obtained– Evaluate if a certain pattern cannot be obtained

• Metadata is needed about the data– Meaning for the business of each value, attribute, page, action, …

• Metadata for each attribute, has to include semantics:– Meaning: group according to it: demographical, behavioural, enviromental,

social, cultural– Business value – Cirmcunstances– Constraints– Relationship with other concepts

• Ontology of concepts ??? • Integrate metadata so the mining activity deals with them.

Page 20: Ernestina Menasalvas emenasalvas@fi.upm.es Facultad de Informatica Univesidad Politecnica de Madrid May 2004.

Data Modelling and deployment

• Once enriched data, patterns extracted can be interpreted according to:– User profiles– Session value (according to certain goals)– Period of the day

• Solution has to be deployed and integrated in the site structure.

• Patterns evolve in time as new data are coming

• Models have to be refined• Establish the basis for the model to be refined

without performance decrease

Page 21: Ernestina Menasalvas emenasalvas@fi.upm.es Facultad de Informatica Univesidad Politecnica de Madrid May 2004.

Web Mining infraestructure

DECISION LAYER

User HTTP Client

OriginalWEBSITE

UserAgent

InterfaceAgent

UserModel

SEMANTIC LAYERCRM SERVICES PROVIDER LAYER

PlanningAgent

USERS

PlanningAgentPlanning

Agent VWi

OperationalPLANS

ActionPlan

HTTP Request

HTTP Response HTTP Response

WebLogs

ModelsServices Information AgentsAgents

Page 22: Ernestina Menasalvas emenasalvas@fi.upm.es Facultad de Informatica Univesidad Politecnica de Madrid May 2004.

Case-study: act according to the value of the current session

Patterns to help: Predict user behavior based on current behavior, not identity. Abstract user behavior with varying degrees of granularity =>

subsessions. Estimate the value of the session to accordidly act

Subsessions capture/approximate user state information.

Key concept: frequent behavior paths. Markov model to predict next set of pages and

behaviour Webhouse to store information about users Modify APACHE: pop ups and precaching

Page 23: Ernestina Menasalvas emenasalvas@fi.upm.es Facultad de Informatica Univesidad Politecnica de Madrid May 2004.

Case-study

Page 24: Ernestina Menasalvas emenasalvas@fi.upm.es Facultad de Informatica Univesidad Politecnica de Madrid May 2004.

1. Find behavior rulesPartial tree:

Define break points as decision points in the path. Use them to create rules.

Knowing PIND

allows us to

predict a set of pages to

follow....

PIND

PDEP

Break point

PDEP

Break point

Page 25: Ernestina Menasalvas emenasalvas@fi.upm.es Facultad de Informatica Univesidad Politecnica de Madrid May 2004.

Behaviour rules– Página principal, Tablón Exámenes– Página principal, Tablón Prácticas, Material apoyo Práctica 1– Página principal, Tablón Prácticas, Material apoyo Práctica 2

5

2

Páginaprincipal

PrácticasMaterial deapoyo Práctica 2

4

Tablón

Página deDecisión

Material deapoyo Práctica 13

Exámenes-3

PáginaObjetivo

...

Page 26: Ernestina Menasalvas emenasalvas@fi.upm.es Facultad de Informatica Univesidad Politecnica de Madrid May 2004.

2. Find Subsessions Sessions may be described

in terms of subsessions. E.g., browse catalog,

browse shipping information, browse privacy notices, perform purchase.

Subsessions may be defined in a number of ways, according to the desired semantics. E.g., use breakpoints.

PDEP

PIND

PDEP

Page 27: Ernestina Menasalvas emenasalvas@fi.upm.es Facultad de Informatica Univesidad Politecnica de Madrid May 2004.

Click-path Subsession FigureReal-time user web page access path, with identified frequent paths

Web page access path expressed as a sequence of subsessions

Page 28: Ernestina Menasalvas emenasalvas@fi.upm.es Facultad de Informatica Univesidad Politecnica de Madrid May 2004.

3. Markov models to predict behavior and paths

session1

session2

session3

session4

session5

session6

. .

.

Behavior X Behavior Y

BK N BK M BK P

Dep2

Dep1

Dep3

Page 29: Ernestina Menasalvas emenasalvas@fi.upm.es Facultad de Informatica Univesidad Politecnica de Madrid May 2004.

4. Per user analysis: average time spent in page

0

10

20

30

40

50

60

1 2 3 4 5 6 7 8 9 1011121314151617181920212223

Time(secs)

URLs

Page 30: Ernestina Menasalvas emenasalvas@fi.upm.es Facultad de Informatica Univesidad Politecnica de Madrid May 2004.

5. Online Value evolution

Value

Traversed number of links

-5

0

5

10

15

20

25

30

35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

sesión 1 sesión 2 sesión 3

Page 31: Ernestina Menasalvas emenasalvas@fi.upm.es Facultad de Informatica Univesidad Politecnica de Madrid May 2004.
Page 32: Ernestina Menasalvas emenasalvas@fi.upm.es Facultad de Informatica Univesidad Politecnica de Madrid May 2004.

Benefits of the algorithm

• Makes it possible to know at any point if the ongoing navigation would be beneficial for the site, so that the site can be dynamically adjusted accordingly.

• Quantify the value of a user session while he or she is navigating

• Makes relationship user - site closer to real life relationships

• The algorithm integrates the site/department goals:– Sends pop ups to students according to the exercises they

have already done– Professors can establish preferences and the rules are

changed accordingly– …

Page 33: Ernestina Menasalvas emenasalvas@fi.upm.es Facultad de Informatica Univesidad Politecnica de Madrid May 2004.

Conclusion

• Without a proper project management:– Difficult to obtain significant patterns– Difficult interpretation of the resutls– The potential of the process is minimized

• Site goals have to be integrated• Algorithms alone are of not use: The best

algorithm not always means the best result• The patterns have to be deployed in a proper

architecture

Page 34: Ernestina Menasalvas emenasalvas@fi.upm.es Facultad de Informatica Univesidad Politecnica de Madrid May 2004.

THANKS!

QUESTIONS???