Data Warehouse & Mining Notes

88
~ Arvind Pandi Dorai Lecturer, Computer Dept KJSIEIT

description

Data Warehouse & Mining Notes for MU Engg students

Transcript of Data Warehouse & Mining Notes

  • ~ Arvind Pandi Dorai

    Lecturer, Computer Dept

    KJSIEIT

  • Chapter 1

    Introduction

    NEED OF DATA WAREHOUSE In 1960s, computer systems used to maintain business

    data.

    As enterprises grew larger, hundreds of computer applications needed to support business processes.

    In 1990s as businesses grew more complex, corporations spread globally & competition became complex, businesses executives became desperate for information to stay competitive & improve bottom line.

    Companies need information to formulate the business strategies, establish goals, set objectives & monitor results

  • Data Warehouse

    Definition: Data warehouse is a relational DB that

    maintains huge volumes of historical data, so as to

    support strategic analysis & decision making.

    To take a strategic decision, we need strong

    analysis & for strong analysis we need historical

    data. Since ERP does not support historical data,

    DW came into picture.

  • Data Warehouse Features

    Subject oriented - Subject specific data marts.

    Integrated - Data integrated into single uniform format.

    Time Variant - DW maintains data over a wide range of time.

    Non volatile - Data is never deleted, Rarely updated.

  • Data Warehouse Objects

    Dimension Tables:

    Fact Tables:

    Dimension Table Key

    Wide

    Textual Attributes

    Denormalised

    Drill-down & Roll-up

    Multiple Hierarchies

    Foreign key

    Deep

    Numeric facts

    Transaction level data

    Aggregate data

  • Star Schema

    A large and central fact table and one table for

    each dimension.

    Every fact points to one tuple in each of the

    dimensions and has additional attributes.

    Does not capture hierarchies directly.

    De-normalized system.

    Easy to understand, easy to define hierarchies,

    reduces no. of joins.

  • Star Schema layout

  • Star Schema Example

  • SnowFlake Schema

    Variant of star schema model.

    A single & large and central fact table and one or

    more tables for each dimension.

    Dimension tables are normalized i.e. split

    dimension table data into additional tables.

    Process of making a snowflake schema is called

    snowflaking.

    Drawbacks: Time consuming joins, report

    generation slow.

  • Snowflake Schema Layout

  • Fact Constellation

    Multiple fact tables share dimension tables.

    This schema is viewed as collection of stars hence

    called galaxy schema or fact constellation or

    family of stars.

    Sophisticated application requires such schema.

  • Fact Constellation

    Store Key

    Product Key

    Period Key

    Units

    Price

    Store Dimension

    Product Dimension

    Sales

    Fact Table

    Store Key

    Store Name

    City

    State

    Region

    Product Key

    Product Desc

    Shipper Key

    Store Key

    Product Key

    Period Key

    Units

    Price

    Shipping

    Fact Table

  • Chapter 2

    Metadata

    Meta Data: Data about data

    Types of Metadata:

    Operational Metadata

    Extraction &Transformation Metadata

    End-User Metadata

  • Information Package

    IP gives special significance to dimension hierarchy in

    the business dimension & the key facts in the fact table.

  • Chapter 3

    DW Architecture

  • DW Architecture Data Acquisition

    Data Extraction

    Data Transformation

    Data Staging

    Data Storage

    Data Loading

    Data Aggregation

    Information Delivery

    Report

    OLAP

    Data Mining

  • Data Acquisition

    Data Extraction:

    Immediate Data Extraction

    Deferred Data Extraction

    Data Transformation:

    Splitting up of cells

    Merging up of cells

    Decoding of fields

    De-duplication

    Date-Time format conversion

    Computed or derived fields

    Data Staging

  • Data Storage

    Data Loading:

    Initial Loading

    Incremental Loading

    Data Aggregation:

    Based on fact tables

    Based on aggregate tables

  • Information Delivery

    Reports Aggregate data

    OLAP Multidimensional Analysis

    Data Mining Extracting knowledge from database

  • Chapter 4

    Principles of Dimensional Modeling

    Dimensional Modeling:

    Logical Design technique to structure{arrange} the

    business dimensions & the fact tables.

    DM is a technique to prepare a star schema.

    Provides best data access.

    Fact table interacts with each & every business

    dimension.

    Drill-down & Roll-up.

  • Fully Additive Facts: When the values of an attribute are summed up by

    simple addition to provide some meaningful data, it is

    known as fully additive facts.

    Semi Additive Facts: When the values of an attribute are summed up, but it

    does not provide meaningful data, but when some

    mathematical operations are performed on it to provide

    meaningful data, it is known as fully additive facts.

    Factless Fact table: A fact table in which numeric facts are absent.

  • Chapter 5

    Information Access & Delivery

    OLAP is a technique that allows user to view aggregate data across measurements along with a

    set of related dimension.

    OLAP supports multidimensional analysis because

    data is stored in multidimensional array.

  • OLAP Operations

    Slice: Filtering the OLAP cube, view 1 attribute.

    Dice: Viewing two attributes.

    Drill-down: Detailing or expanding an attribute

    values.

    Roll-up: Aggregating or compressing an attribute

    values.

    Rotate: Rotating the cube to view different

    dimensions.

  • OLAP Operations

    Slice and Dice

    Time

    Product Product= iPod

    Time

  • OLAP Operations

    Drill Down

    Time

    Product

    Category e.g Music Player

    Sub Category e.g MP3

    Product e.g iPod

  • OLAP Operations

    Roll Up

    Time

    Product

    Category e.g Music Player

    Sub Category e.g MP3

    Product e.g iPod

  • OLAP Operations

    Pivot

    Time

    Product

    Region

    Product

  • OLAP Server

    An OLAP Server is a high capacity, multi-user data

    manipulation engine specifically designed to support

    and operate on multi-dimensional data structure.

    OLAP server available are

    MOLAP server

    ROLAP server

    HOLAP server

  • Chapter 6 Implementation & Maintenance

    IMPLEMENTATION: Monitoring: Sending data from sources

    Integrating: Loading, cleansing, ...

    Processing: Query processing, indexing, ...

    Managing: Metadata, Design, ...

  • Maintainence

    Maintenance is an issue for materialized

    views

    Recomputation

    Incremental updating

  • View and Materialized Views

    View

    Derived relation defined in terms of base

    (stored) relations.

    Materialized views

    A view can be materialized by storing the tuples

    of the view in the database.

    Index structures can be built on the materialized

    view.

  • Overview

    Extracting knowledge

    Perform analysis

    Use DM Algorithms

  • Knowledge Discovery in Database

  • Steps In KDD Process

    Data Cleaning

    Data Integration

    Data Selection

    Data Transformation

    Data mining

    Pattern Evaluation

    Knowledge Presentation

  • Architecture of DM

  • DM Algorithms

    Association: Relationship between item sets.

    Used in Market basket analysis.

    Eg: Apriori & FP Tree

    Classification: Classify each item to predefined groups.

    Eg: Nave Bayesian & ID3

    Clustering: Each item divided into dynamically generated

    groups.

    Eg: K-means & K-mediods

  • Example: Market Basket Data

    Items frequently purchased together:

    Computer Printer

    Uses:

    Placement

    Advertising

    Sales

    Coupons

    Objective: increase sales and reduce costs

    Called Market Basket Analysis, Shopping Cart Analysis

  • Transaction Data: Supermarket Data

    Market basket transactions:

    t1: {bread, cheese, milk}

    t2: {apple, jam, salt, ice-cream}

    tn: {biscuit, jam, milk}

    Concepts: An item: an item/article in a basket

    I: the set of all items sold in the store

    A Transaction: items purchased in a basket; it may have TID (transaction ID)

    A Transactional dataset: A set of transactions

  • Association Rule Definitions

    Association Rule (AR): implication X Y where

    X,Y I and X Y = ;

    Support of AR (s) X Y: Percentage of

    transactions that contain X Y

    Confidence of AR (a) X Y: Ratio of number of

    transactions that contain X Y to the number

    that contain X

  • Association Rule Problem

    Given a set of items I={I1,I2,,Im} and a database of transactions D={t1,t2, , tn} where ti={Ii1,Ii2, , Iik} and Iij I, the Association Rule Problem is to identify all

    association rules X Y with a minimum support and

    confidence.

    Link Analysis

  • Association Rule Mining Task

    Given a set of transactions T, the goal of association rule

    mining is to find all rules having

    support minsup threshold

    confidence minconf threshold

    Brute-force approach:

    List all possible association rules

    Compute the support and confidence for each rule

    Prune rules that fail the minsup and minconf thresholds

  • Example

    Transaction data

    Assume:

    minsup = 30%

    minconf = 80%

    An example frequent itemset:

    {Cocoa, Clothes, Milk} [sup = 3/7]

    Association rules from the itemset:

    Clothes Milk, Cocoa [sup = 3/7, conf = 3/3]

    Clothes, Cocoa Milk, [sup = 3/7, conf = 3/3]

    t1: Butter, Cocoa, Milk

    t2: Butter, Cheese

    t3: Cheese, Boots

    t4: Butter, Cocoa, Cheese

    t5: Butter, Cocoa, Clothes, Cheese, Milk

    t6: Cocoa, Clothes, Milk

    t7: Cocoa, Milk, Clothes

  • Mining Association Rules

    Two-step approach:

    1. Frequent Itemset Generation

    Generate all itemsets whose support minsup

    2. Rule Generation

    Generate high confidence rules from each frequent

    itemset, where each rule is a binary partitioning of a

    frequent itemset

    Frequent itemset generation is still computationally

    expensive

  • Step:1 Generate Candidate & Frequent

    Item Sets

    Let k=1 Generate frequent itemsets of length 1

    Repeat until no new frequent itemsets are identified

    Generate length (k+1) candidate itemsets from length k frequent itemsets

    Prune candidate itemsets containing subsets of length k that are infrequent

    Count the support of each candidate by scanning the DB

    Eliminate candidates that are infrequent, leaving only those that are frequent

  • Apriori Algorithm Example

  • Step 2: Generating Rules From Frequent

    Itemsets

    Frequent itemsets association rules One more step is needed to generate association rules For each frequent itemset X, For each proper nonempty subset A of X,

    Let B = X - A A B is an association rule if

    Confidence(A B) minconf, support(A B) = support(AB) = support(X) confidence(A B) = support(A B) / support(A)

  • Generating Rules: An example

    Suppose {2,3,4} is frequent, with sup=50%

    Proper nonempty subsets: {2,3}, {2,4}, {3,4}, {2}, {3}, {4},

    with sup=50%, 50%, 75%, 75%, 75%, 75% respectively

    These generate these association rules:

    2,3 4, confidence=100%

    2,4 3, confidence=100%

    3,4 2, confidence=67%

    2 3,4, confidence=67%

    3 2,4, confidence=67%

    4 2,3, confidence=67%

    All rules have support = 50%

  • Rule Generation

    Given a frequent itemset L, find all non-empty subsets f

    L such that f L f satisfies the minimum confidence requirement

    If {A,B,C,D} is a frequent itemset, candidate rules:

    ABC D, ABD C, ACD B, BCD A,

    A BCD, B ACD, C ABD, D ABC

    AB CD, AC BD, AD BC, BC AD,

    BD AC, CD AB,

    If |L| = k, then there are 2k 2 candidate association rules (ignoring L and L)

  • Generating Rules

    To recap, in order to obtain A B, we need to have support(A B) and support(A)

    All the required information for confidence computation has already been recorded in itemset generation. No need to see the data T any more.

    This step is not as time-consuming as frequent itemsets generation.

  • Rule Generation

    How to efficiently generate rules from frequent itemsets?

    In general, confidence does not have an anti-monotone property

    c(ABC D) can be larger or smaller than

    c(AB D)

    But confidence of rules generated from the same itemset has an anti-monotone property

    e.g., L = {A,B,C,D}: c(ABC D) c(AB CD) c(A BCD)

  • Apriori Advantages/Disadvantages

    Advantages:

    Uses large itemset property.

    Easily parallelized

    Easy to implement.

    Disadvantages:

    Assumes transaction database is memory resident.

    Requires up to m database scans.

  • Mining Frequent Patterns

    Without Candidate Generation

    Compress a large database into a compact, Frequent-

    Pattern tree (FP-tree) structure

    highly condensed, but complete for frequent pattern

    mining

    avoid costly database scans

    Develop an efficient, FP-tree-based frequent pattern

    mining method

    A divide-and-conquer methodology: decompose mining

    tasks into smaller ones

    Avoid candidate generation: sub-database test only!

  • Construct FP-tree From A Transaction DB

    {}

    f:4 c:1

    b:1

    p:1

    b:1 c:3

    a:3

    b:1 m:2

    p:2 m:1

    Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3

    min_support = 0.5 TID Items bought (L-order) freq items 100 {f, a, c, d, g, i, m, p} {f, c, a, m, p} 200 {a, b, c, f, l, m, o} {f, c, a, b, m} 300 {b, f, h, j, o} {f, b} 400 {b, c, k, s, p} {c, b, p} 500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}

    Steps:

    1. Scan DB once, find

    frequent 1-itemset

    (single item pattern)

    2. Order frequent

    items in frequency

    descending order

    3. Scan DB again,

    construct FP-tree

  • Benefits of the FP-tree Structure

    Completeness:

    never breaks a long pattern of any transaction

    preserves complete information for frequent pattern

    mining

    Compactness

    reduce irrelevant informationinfrequent items are gone

    frequency descending ordering: more frequent items are

    more likely to be shared

    never be larger than the original database (if not count

    node-links and counts)

  • Mining Frequent Patterns Using FP-tree

    General idea (divide-and-conquer)

    Recursively grow frequent pattern path using the FP-

    tree

    Method

    For each item, construct its conditional pattern-base,

    and then its conditional FP-tree

    Repeat the process on each newly created conditional

    FP-tree

    Until the resulting FP-tree is empty, or it contains only

    one path (single path will generate all the combinations

    of its sub-paths, each of which is a frequent pattern)

  • Major Steps to Mine FP-tree

    1) Construct conditional pattern base for each

    node in the FP-tree

    2) Construct conditional FP-tree from each

    conditional pattern-base

    3) Recursively mine conditional FP-trees and

    grow frequent patterns obtained so far

    If the conditional FP-tree contains a single path,

    simply enumerate all the patterns

  • Step 1: FP-tree to Conditional Pattern Base

    Starting at the frequent header table in the FP-tree Traverse the FP-tree by following the link of each

    frequent item Accumulate all of transformed prefix paths of that item to

    form a conditional pattern base Conditional pattern bases

    item cond. pattern base

    c f:3

    a fc:3

    b fca:1, f:1, c:1

    m fca:2, fcab:1

    p fcam:2, cb:1

    {}

    f:4 c:1

    b:1

    p:1

    b:1 c:3

    a:3

    b:1 m:2

    p:2 m:1

    Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3

  • Step 2: Construct Conditional FP-tree

    For each pattern-base Accumulate the count for each item in the base

    Construct the FP-tree for the frequent items of the pattern base

    m-conditional

    pattern base:

    fca:2, fcab:1

    {}

    f:3

    c:3

    a:3

    m-conditional FP-

    tree

    All frequent patterns concerning m

    m,

    fm, cm, am,

    fcm, fam, cam,

    fcam

    {}

    f:4 c:1

    b:1

    p:1

    b:1 c:3

    a:3

    b:1 m:2

    p:2 m:1

    Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3

  • Mining Frequent Patterns by Creating

    Conditional Pattern-Bases

    Empty Empty f

    {(f:3)}|c {(f:3)} c

    {(f:3, c:3)}|a {(fc:3)} a

    Empty {(fca:1), (f:1), (c:1)} b

    {(f:3, c:3, a:3)}|m {(fca:2), (fcab:1)} m

    {(c:3)}|p {(fcam:2), (cb:1)} p

    Conditional FP-tree Conditional pattern-base Item

  • Step 3: Recursively mine the conditional

    FP-tree

    {}

    f:3

    c:3

    a:3 m-conditional FP-tree

    Cond. pattern base of am: (fc:3)

    {}

    f:3

    c:3

    am-conditional FP-tree

    Cond. pattern base of cm: (f:3) {}

    f:3

    cm-conditional FP-tree

    Cond. pattern base of cam: (f:3) {}

    f:3

    cam-conditional FP-tree

  • Single FP-tree Path Generation

    Suppose an FP-tree T has a single path P

    The complete set of frequent pattern of T can be generated

    by enumeration of all the combinations of the sub-paths of P

    {}

    f:3

    c:3

    a:3

    m-conditional FP-tree

    All frequent patterns concerning m

    m,

    fm, cm, am,

    fcm, fam, cam,

    fcam

  • Classification

    Given old data about customers and payments, predict

    new applicants loan eligibility.

    Age Salary

    Profession Location Customer

    type

    Previous customers Classifier Decision tree

    Salary > 5 K

    Prof. = Exec

    New applicants data

    good/

    bad

  • Overview of Naive Bayes The goal of Naive Bayes is to work out whether a new

    example is in a class given that it has a certain combination of attribute values. We work out the likelihood of the example being in each class given the evidence (its attribute values), and take the highest likelihood as the classification.

    Bayes Rule: E- Event has occurred

    P[H] is called the prior probability (of the hypothesis). P[H|E] is called the posterior probability (of the hypothesis given the evidence)

    ][

    ][].|[]|[

    EP

    HPHEPEHP

  • ID3 (Decision Tree Algorithm)

    ID3 was the first proper decision tree algorithm to use this

    mechanism:

    Building a decision tree with ID3 algorithm

    1. Select the attribute with the most gain

    2. Create the subsets for each value of the attribute

    3. For each subset

    1. if not all the elements of the subset belongs to same

    class repeat the steps 1-3 for the subset

  • ID3 (Decision Tree Algorithm) Function DecisionTtreeLearner(Examples, Target_Class, Attributes)

    create a Root node for the tree if all Examples are positive, return the single-node tree Root, with label = Yes if all Examples are negative, return the single-node tree Root, with label = No if Attributes list is empty,

    return the single-node tree Root, with label = most common value of Target_Class in Examples

    else A = the attribute from Attributes with the highest information gain with respect to Examples

    Make A the decision attribute for Root for each possible value v of A:

    add a new tree branch below Root, corresponding to the test A = v let Examples_v be the subset of Examples that have value v for attribute A if Examples_v is empty then

    add a leaf node below this new branch with label = most common value of Target_Class in Examples

    else add the subtree DTL(Examples_v, Target_Class, Attributes - { A })

    end if end return Root

  • Decision Trees (Summary)

    Advantages of ID3

    automatically creates knowledge from data

    can discover new knowledge (watch out for counter-intuitive rules)

    avoids knowledge acquisition bottleneck

    identifies most discriminating attribute first

    trees can be converted to rules

    Disadvantages of ID3

    several identical examples have same effect as a single

    example

    trees can become large and difficult to understand

    cannot deal with contradictory examples

    examines attributes individually: does not consider

    effects of inter-attribute relationships

  • CLUSTERING

    Cluster: a collection of data objects

    Similar to one another within the same cluster

    Dissimilar to the objects in other clusters

    Cluster analysis

    Grouping a set of data objects into clusters

    Clustering is unsupervised classification: no predefined classes

    Typical applications

    As a stand-alone tool to get insight into data distribution

    As a preprocessing step for other algorithms

  • Partitional Clustering

    Nonhierarchical

    Creates clusters in one step as opposed to several

    steps.

    Since only one set of clusters is output, the user

    normally has to input the desired number of

    clusters, k.

    Usually deals with static sets.

  • K-Means

    Initial set of clusters randomly chosen.

    Iteratively, items are moved among sets of clusters

    until the desired set is reached.

    High degree of similarity among elements in a cluster is obtained.

    Given a cluster Ki={ti1,ti2,,tim}, the cluster mean is

    mi = (1/m)(ti1 + + tim)

  • K-Means Example

    Given: {2,4,10,12,3,20,30,11,25}, k=2

    Randomly assign means: m1=3,m2=4

    K1={2,3}, K2={4,10,12,20,30,11,25}, m1=2.5,m2=16

    K1={2,3,4},K2={10,12,20,30,11,25}, m1=3,m2=18

    K1={2,3,4,10},K2={12,20,30,11,25}, m1=4.75,m2=19.6

    K1={2,3,4,10,11,12},K2={20,30,25}, m1=7,m2=25

    Stop as the clusters with these means are the same.

  • Hierarchical Clustering

    Clusters are created in levels actually creating sets of clusters at each level.

    Agglomerative: Initially each item in its own cluster

    Iteratively clusters are merged together

    Bottom Up

    Divisive: Initially all items in one cluster

    Large clusters are successively divided

    Top Down

  • Hierarchical Clustering

    Use distance matrix as clustering criteria. This method

    does not require the number of clusters k as an input,

    but needs a termination condition

    Step 0 Step 1 Step 2 Step 3 Step 4

    b

    d

    c

    e

    a a b

    d e

    c d e

    a b c d e

    Step 4 Step 3 Step 2 Step 1 Step 0

    agglomerative

    (AGNES)

    divisive

    (DIANA)

  • The K-Medoids Clustering Method

    Find representative objects, called medoids, in clusters

    PAM (Partitioning Around Medoids,)

    starts from an initial set of medoids and iteratively

    replaces one of the medoids by one of the non-medoids if

    it improves the total distance of the resulting clustering

    Handles outliers well.

    Ordering of input does not impact results.

    Does not scale well.

    Each cluster represented by one item, called the medoid.

    Initial set of k medoids randomly chosen.

    PAM works effectively for small data sets, but does not scale

    well for large data sets

  • PAM (Partitioning Around Medoids)

    PAM - Use real object to represent the cluster

    Select k representative objects arbitrarily

    For each pair of non-selected object h and selected

    object i, calculate the total swapping cost TCih

    For each pair of i and h,

    If TCih < 0, i is replaced by h

    Then assign each non-selected object to the most

    similar representative object

    repeat steps 2-3 until there is no change

  • PAM

  • Web Mining

    Web Mining

    Web Content

    Mining Web Structure

    Mining Web Usage

    Mining

    Identify information

    within given web

    pages

    Distinguish personal

    home pages from

    other web pages

    Understand access

    patterns and the trends

    to improve structure

    Uses interconnections

    between web pages to

    give weight to the

    Pages

    Defines Data structures

    of the links

  • Crawlers

    Robot (spider) traverses the hypertext structure in the Web.

    Collect information from visited pages

    Used to construct indexes for search engines

    Traditional Crawler visits entire Web and replaces index

    Periodic Crawler visits portions of the Web and updates subset of index

    Incremental Crawler selectively searches the Web and incrementally modifies index

    Focused Crawler visits pages related to a particular subject

  • Web Usage Mining

    Performs mining on Web Usage data or Web Logs

    A web log is a listing of page reference data also

    called as a click steam

    Can be seen from either server perspective better web site design

    Or client perspective prefetching of web pages etc.

  • Web Usage Mining Applications

    Personalization

    Improve structure of a sites Web pages

    Aid in caching and prediction of future page references

    Improve design of individual pages

    Improve effectiveness of e-commerce (sales and

    advertising)

  • Web Usage Mining Activities

    Preprocessing Web log Cleanse

    Remove extraneous information

    Sessionize

    Session: Sequence of pages referenced by one user at a sitting.

    Pattern Discovery Count patterns that occur in sessions

    Pattern is sequence of pages references in session.

    Similar to association rules

    Transaction: session

    Itemset: pattern (or subset)

    Order is important

    Pattern Analysis

  • Web Structure Mining

    Mine structure (links, graph) of the Web

    Techniques

    PageRank

    CLEVER

    Create a model of the Web organization.

    May be combined with content mining to more

    effectively retrieve important pages.

  • Web as a Graph

    Web pages as nodes of a graph.

    Links as directed edges.

    www.uta.edu

    my page

    www.uta.edu

    www.google.com

    www.google.com

    my page

    www.uta.edu

    www.google.com

  • Link Structure of the Web

    Forward links (out-edges).

    Backward links (in-edges).

    Approximation of importance/quality: a page may

    be of high quality if it is referred to by many other

    pages, and by pages of high quality.

  • PageRank

    Used by Google

    Prioritize pages returned from search by looking at Web structure.

    Importance of page is calculated based on number of pages which point to it Backlinks.

    Weighting is used to provide more importance to backlinks coming form important pages.

  • HITS Algorithm

    Used to generate good quality authoritative pages

    and hub pages

    Authoritative Page: A page pointed by many

    other pages.

    Hub Page: A page which points to an authoritative

    page.

  • HITS Algorithm

    Step 1: Generate Root set

    Step 2: Generate Base set

    Step 3: Build Graph

    Step 4: Retain external links & eliminate internal links

    Step 5: Calculate Authoritative & Hub score

    Step 6: Generate result