Retail Big Data Analytics Solution Blueprint

13
Introduction The volume, variety, and velocity of data 1 being produced in all areas of the retail industry is growing exponentially, creating both challenges and opportunities for those diligently analyzing this data to gain a competitive advantage. Although retailers have been using data analytics to generate business intelligence for years, the extreme composition of today’s data necessitates new approaches and tools. This is because the retail industry has entered the big data era, having access to more information that can be used to create amazing shopping experiences and forge tighter connections between customers, brands, and retailers. A trail of data follows products as they are manufactured, shipped, stocked, advertised, purchased, consumed, and talked about by consumers – all of which can help forward-thinking retailers increase sales and operations performance. This requires an end-to-end retail analytics solution capable of analyzing large datasets populated by retail systems and sensors, enterprise resource planning (ERP), inventory control, social media, and other sources. How does one start a big data project? In an attempt to demystify retail data analytics, this paper chronicles a real-world implementation that is producing tangible benefits, such as allowing retailers to: Increase sales per visit with a deeper understanding of customers’ purchase patterns. Learn about new sales opportunities by identifying unexpected trends from social media. Improve inventory management with greater visibility into the product pipeline. A set of simple analytics experiments was performed to create capabilities and a framework for conducting large-scale, distributed data analytics. These experiments facilitated an understanding of the edge-to-cloud business analytics value proposition and at the same time, provided insight into the technical architecture and integration needed for implementation. Getting Started with Big Data Analytics in Retail Learn how Intel and Living Naturally* used big data to help a health store increase sales and reduce inventory carrying costs. SOLUTION BLUEPRINT Big Data Analytics in Retail Data

description

xzc

Transcript of Retail Big Data Analytics Solution Blueprint

  • Introduction

    The volume, variety, and velocity of data1 being produced in all areas of the retail industry is growing exponentially, creating both challenges and opportunities for those diligently analyzing this data to gain a competitive advantage. Although retailers have been using data analytics to generate business intelligence for years, the extreme composition of todays data necessitates new approaches and tools. This is because the retail industry has entered the big data era, having access to more information that can be used to create amazing shopping experiences and forge tighter connections between customers, brands, and retailers.

    A trail of data follows products as they are manufactured, shipped, stocked, advertised, purchased, consumed, and talked about by consumers all of which can help forward-thinking retailers increase sales and operations performance. This requires an end-to-end retail analytics solution capable of analyzing large datasets populated by retail systems and sensors, enterprise resource planning (ERP), inventory control, social media, and other sources.

    How does one start a big data project? In an attempt to demystify retail data analytics, this paper chronicles a real-world implementation that is producing tangible benefits, such as allowing retailers to:

    Increase sales per visit with a deeper understanding of customers purchase patterns.

    Learn about new sales opportunities by identifying unexpected trends from social media.

    Improve inventory management with greater visibility into the product pipeline.

    A set of simple analytics experiments was performed to create capabilities and a framework for conducting large-scale, distributed data analytics. These experiments facilitated an understanding of the edge-to-cloud business analytics value proposition and at the same time, provided insight into the technical architecture and integration needed for implementation.

    Getting Started with Big Data Analytics in RetailLearn how Intel and Living Naturally* used big data to help a health store increase sales and reduce inventory carrying costs.

    SOLUTION BLUEPRINTBig Data Analytics in Retail Data

  • Table of ContentsIntroduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1Big Data Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2 Unstructured Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2 Competitive Advantage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2 Big Data Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2

    Real-World Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2 Product Pipeline Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3 Social Media Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4 Market Basket Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4

    Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4

    Identify the Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4 Identify All Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5 Conduct Cause and Effect Analysis . . . . . . . . . . . . . . . . . . . . . .5 Determine Metrics That Need Improvement . . . . . . . . . . . . . .6 Verify the Solution Is Workable . . . . . . . . . . . . . . . . . . . . . . . . . .6 Consider Business Process Re-engineering . . . . . . . . . . . . . . .6 Calculate an ROI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6

    Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6Algorithm Definition Process . . . . . . . . . . . . . . . . . . . . . . . . . . . .6

    Social Media Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8

    Big Data Analytics Solution Implementation . . . . . . . . . . . .8 Capabilities and Software Components . . . . . . . . . . . . . . . . . .9

    Social Media Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9 Social Media Analysis Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    Product Pipeline Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11Market Basket Analysis Implementation . . . . . . . . . . . . . . 12 Market Basket Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Basket Analysis Using Transposition and Encoding . . . . . 13

    Physical Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14Accelerate Big Data Implementations . . . . . . . . . . . . . . . . . 14

    Competitive AdvantageSo what if retailers just ignore big data; after all, is it worth all the effort? It turns out that it is. According to McKinsey* Global Institute, big data has the potential to increase net retailer margins by 60 percent.3 Likewise, companies in the top third of their industry in the use of data-driven-decision making were, on average, five percent more productive and six percent more profitable than their competitors, wrote Andrew McAfee and Erik Brynjolfsson in a Harvard Business Review article.4

    In order to generate the insights needed to reap substantial business benefits, new innovative approaches and technologies are required. This is because big data in retail is like a mountain, and retailers must uncover those tiny, but game-changing golden nuggets of insights and knowledge that can be used to create a competitive advantage.

    Big Data TechnologiesIn order for retailers to realize the full potential of big data, they must find a new approach for handling large amounts of data. Traditional tools and infrastructure are struggling to keep up with larger and more varied data sets coming in at high velocity. New technologies are emerging to make big data analytics scalable and cost-effective, such as a distributed grid of computing resources. Processing is pushed out to the nodes where the data resides, which is in contrast to long-established approaches that retrieve data for processing from a central point.

    Hadoop* is a popular, open-source software framework that enables distributed processing of large data sets on clusters of computers. The framework easily scales on servers based on Intel Xeon processors, as demonstrated by Cloudera*, a provider of Apache* Hadoop-based software, support, services, and training. Although Hadoop has captured a lot of attention, other tools and technologies are also available for working on different types of big data analytics problems.

    Real-World Use CasesIntel teamed up with Living Naturally*, a trusted name in retail technology, in order to demonstrate big data use cases for 3,000 natural food stores in the United States. Living Naturally develops and markets a suite of online and mobile programs designed to enhance the productivity and marketing capabilities of retailers and suppliers. Since 1999, Living Naturally has worked with thousands of retail customers and 20,000 major industry brands.

    In a period of three months, a team of four Intel analysts worked with Living Naturally on a retail analytics project spanning four key phases: problem definition, data preparation, algorithm definition, and big data solution implementation. During the project, the following use cases were investigated in detail:

    Big Data BasicsData has been getting bigger for a while now. The volume of data generated or processed in 2014 alone is expected to exceed six zettabytes;2 that is 1,200 times more than all the data ever generated prior to 2003.

    Unstructured DataOne of the reasons data is getting bigger is it is continuously being generated from more sources and more devices. Making matters more difficult, much of the data is unstructured, coming from videos, photos, comments on social media forums, reviews on web sites, and so on. As a result, this data is often made up of volumes of text, dates, numbers, and facts that are typically free form by nature and cannot be stored in structured, predefined tables.

    Certain data sources are arriving so fast, there may not be enough time to store them before applying analytics to them. And that is why conventional data analytics and tools alone do not enable retail IT to store, manage, process, and analyze all the data they may need to utilize.

    2

  • Figure 1. User Interface for Retail Buyers

    Product Pipeline Tracking: When inventory levels are out of sync with demand, make recommendations to retail buyers to remedy the situation and maximize return for the store.

    Market Basket Analysis: When an item goes on sale, let retailers know about adjacent products that benefit from a sales increase as well.

    Social Media Analysis: Prior to products going viral on social media, suggest retail buyers increase order size to be more responsive to shifting consumer demand and avoid out-of-stocks.

    A key objective of the team was to ensure the big data solution complemented how people do their jobs. This meant the data had to be presented to retail buyers in an actionable and easy-to-use manner. This is best illustrated by the solutions intuitive user interface, which is shown in Figure 1 and described in the following sections:

    Mangos Market

    Sarasota, FL 99352

    ORDER# 1294853

    Associated Buyers

    Houston, TX 77002

    55 30 2025 18 120A CASHEWS XLG

    WHL

    120

    38 units

    30

    120 38

    30

    0000000003827

    NUTRIENT DEPOT

    !211814

    SALES

    ADD

    Recommendation

    Papayas are commonly purchased with Cashews

    Product Pipeline TrackingFigure 2 defines the fields of the retail buyers user interface (UI) used to order products. In this case, the product is cashews, and the upper left corner shows the orders (30, 55, 30, 25) the retail buyer made over the last four weeks. Next to the graph is a shopping cart, indicating the buyer plans to order 20 more. Moving right along the UI, the next shipment will contain a partial order of 18 more products arriving from Houston, Texas, as seen in the lower left corner. Currently, there are 120 units in inventory.

    The retail buyer has a good track record ordering this product, getting an A grade and a thumbs up from the tool. The tool makes a suggestion to increase the reorder from 18 currently in the shopping cart to 38 units. The buyer can override the suggestion by moving the smiley face left or right to decrease or increase the amount of the next order.

    Next shipment Current inventory

    Recommended reorder

    quantity

    Order quantity

    selector

    Mangos Market

    Sarasota, FL 99352

    ORDER# 1294853

    Associated Buyers

    Houston, TX 77002

    55 30 2025 18 120A CASHEWS XLG

    WHL

    120

    38 units

    30

    120 38

    ProductPrevious orders

    Next

    shipment

    Current

    inventory

    Ordering

    track record

    Order amount in

    shopping cart

    Inventory

    status

    Figure 2. Product Ordering

    3

  • Social Media AnalysisIn the previous example, why is the tool suggesting a greater order size when 120 units are already in inventory, and the prior orders have kept pace of demand and then some? Because the health benefits of cashews were discussed in a recent social media post, which can be read by clicking the Twitter bird in the UI shot in Figure 3. The tool estimated the demand increase created by this tweet and other related social media postings (also reflected in the retailers own increased sales numbers), and recommended a suitable reorder quantity for cashews.

    Market Basket AnalysisThe tool also lets the retail buyer know about associated products for cashews, recommending the buyer increase orders of papayas, which customers are more likely to purchase along with cashews due to another social media-driven microtrend (Figure 4). In other words, if cashew sales are about to spike due to social media postings, papaya sales are likely to increase as well.

    Significant

    Twitter* chatter

    30

    0000000003827

    NUTRIENT DEPOT

    !211814

    ORDERS

    INVENTORY

    SHRINK

    CHANGE

    SALES

    May 10: Dr. Smith

    promoted the

    health benefits

    from cashews...

    Click for

    tweet details

    When the team developed the retailer buyer UI, these featured use cases were pursued separately. But over time, the team saw the use cases were interconnected and built on each other, resulting in a solution that was greater than the sum of its parts. For example, providing retail buyers a single view for social media postings and associated products (i.e., market basket analysis) enables them to take advantage of multiple profit-maximizing opportunities at the same time.

    The rest of this paper describes a process for carrying out a big data project based on the teams learnings.

    Problem DefinitionBefore starting a big data project, it is important to have clarity of purpose, which can be accomplished with the following steps:

    Identify the ProblemRetailers need to have a clear idea of the problems they want to solve and understand the value of solving them. For instance, a clothing store may find that four of ten customers looking for blue jeans cannot find their size on the shelf, which results in lost sales. Using big data, the retailers goal could be to reduce the frequency of out-of-stocks to less than one in ten customers, thus increasing the sales potential for jeans by over 50 percent.

    30

    0000000003827

    NUTRIENT DEPOT

    !211814

    SALES

    Click to see

    associated

    products

    ADD

    Recommendation

    Papayas are commonly purchased with Cashews

    Figure 4. Associated Products

    4

    Figure 3. Social Media Posting

  • Big data can be used to solve a lot of problems, such as reduce spoilage, increase margin, increase transaction size or sales volume, improve new product launch success, or increase customer dwell time in stores.

    Working with a retail solution provider to 3000 natural food stores, the Intel team identified an out-of-stocks problem associated with products going viral thanks to social media and Internet postings of a celebrity or expert endorsing them. For example, a popular medical doctor with a health-oriented television show recommended raspberry ketone pills as a weight reducer to his very large following on Twitter*. This caused a run on the product at health stores and ultimately led to empty shelves, which took a long time to restock since there was little inventory in the supply chain.

    Identify All Data sourcesRetailers collect a variety of data from point-of-sale (POS) systems, including the typical sales by product and sales by customer via loyalty cards. However, there could be less obvious sources of useful data that retailers can find by walking the store, a process intended to provide a better understanding of a products end-to-end life cycle. This exercise allows retailers to think about problems with a fresh set of eyes, avoid thinking of their data in silod terms, and consider new opportunities that present themselves when big data is used to find correlations between existing and new data sources, such as:

    Video: Surveillance cameras and anonymous video analytics on digital signs

    Social Media: Twitter, Facebook*, blogs, and product reviews

    Supply Chain: Orders, shipments, invoices, inventory, sales receipts, and payments

    Advertising: Coupons in flyers and newspapers, and advertisements on TV and in-store signs

    Environment: Weather, community events, seasons, and holidays

    Product Movement: RFID tags and GPS

    The Intel team focused on the data sources and business solutions listed in Table 1. Using social media chatter, the team sent useful product information and desired promotions to customers based on their interests. In addition, social media and supply chain data were analyzed together to minimize out-of-stocks and overstocks, which will be described in more detail in a following section.

    Conduct Cause and Effect AnalysisAfter listing all the possible data sources, the next step is to explore the data through queries that help determine which combinations of data could be used to solve specific retail problems. For the Intel team, this meant using social media data to inform health stores up to three weeks before a product is likely to go viral, providing enough time for retail buyers to increase their orders. The team queried social media feeds and compared them to historic store inventory levels and found it was possible to give retail buyers an early warning of potentially rising product demand. In this case, it was important to find a reliable leading indicator that could be used to predict how much product should be ordered and when.

    5

    Data Type Social Media Supply Chain

    Data Sources Twitter* Facebook*

    Orders Invoices Shipments Receipts Sales Transactions

    Business Solutions Implement 1:1 digital marketing Send personalized promotions Influencecustomersentiment

    Increase new product sales Maximizeprofitbyunderstandingretailbuyerbehavior Make product recommendations Measure marketing campaign effectiveness

    Table 1. Data Sources and Business Solutions

  • Determine Metrics That Need ImprovementAfter clearly defining a problem, it is critical to determine metrics that can accurately measure the effectiveness of the future big data solution. A recommendation is to make sure the metrics can be translated into a monetary value (e.g., income from increased sales, savings from reduced spoilage) and incorporated into a return on investment (ROI) calculation.

    The Intel team focused on metrics related to reducing shrinkage and spoilage, and increasing profitability by suggesting which products a store should pick for promotion.

    Verify the Solution Is WorkableA workable big data solution hinges on presenting the findings in a clear and acceptable way to employees. As described previously, the Intel team made sure the solution provided retail buyers with timely product ordering recommendations displayed with an intuitive UI running on devices used to enter orders.

    Consider Business Process Re-engineeringBy definition, the outcomes from a big data project can change the way people do their jobs. Will the data simplify or complicate things for employees? Will employees need to be trained on a new device? Will the POS system need to be modified to deliver a different data set? It is important to consider the costs and time of the business process re-engineering that may be needed to fully implement the big data solution.

    Calculate an ROITaking all the inputs previously discussed, retailers should calculate an ROI to ensure their big data project makes financial sense. At this point, the ROI is likely to have some assumptions and unknowns, but hopefully these will have only a second order effect on the financials. If the return is deemed too low, it may make sense to repeat the prior steps and look for other problems and opportunities to pursue. Moving forward, the following steps will add clarity to what can be accomplished and raise the confidence level of the ROI.

    Data Preparation Retail data used for big data analysis typically comes from multiple sources and different operational systems, and may have inconsistencies. Particularly for the Intel team, transaction data was supplied by several systems, which truncated UPC codes or stored them differently due to special characters in the data field. More often than not, one of the first steps is to cleanse, enhance, and normalize the data.

    The Intel team performed a number of data clean tasks:

    Populate missing data elements

    Scrub data for discrepancies

    Make data categories uniform

    Reorganize data fields

    Normalize UPC data

    Algorithm Definition ProcessAfter identifying the retail data sources, a set of algorithms can be selected for data exploration. This is a trial-and-error process because it is very difficult to assess the effectiveness of an algorithm before applying it to a particular data set. Algorithms are like atomic pieces, and figuring out which ones are needed and work together is like assembling a puzzle.

    Table 2 lists commonly used algorithms and the types of business problems they help to solve.

    6

  • 7Analysis Technique

    Algorithms Business Problems

    Classification Decision trees, neural network, and Nave Bayesian Networks.

    A Bayesian network is a probabilistic graphical model that represents a set of variables and their conditional independencies.

    Churn Analysis: Learn which customers are most likely to switch to a competitor.

    Risk Management: Decide whether a loan be approved for a particular customer.

    Directed Advertising: Personalize advertisements (in-store, online) to better attract customer attention.

    Business Solutions Similar to classification, logistic and linear regression techniques look for patterns that determine a numerical value.

    Predict coupon redemption rate based on face value, distribution method, distribution volume, and season.

    Clustering/Segmentation

    These algorithms assign of a set of observations into subsets (clusters), which are similar according to pre-designated criteria.

    They allow for different assumptions on the data structure and evaluations based on internal compactness within a cluster and/or separation between different clusters.

    Maximize profit by creating a profile of perfect buyer behaviors.

    Customer Segmentation: Determine behavioral and descriptive profiles of customers.

    Association Rules Association rule learning is a method for discovering interesting relations between variables in large databases.

    The method helps identify items that frequently appear together and determine rules about the associations.

    Identify product adjacency for cross-selling purposes.

    Determine frequently shopped products.

    Perform market basket analysis.

    Forecasting/ Prediction

    This approach estimates the value of a variable of interest at some specified future date.

    It uses formal statistical methods, employing time series, cross-sectional or longitudinal data, and techniques that deal with seasonality, trending, and noisiness of the data.

    This approach estimates the value of a variable of interest at some specified future date.

    It uses formal statistical methods, employing time series, cross-sectional or longitudinal data, and techniques that deal with seasonality, trending, and noisiness of the data.

    Sequence Analysis This method finds patterns in a series of events.

    Both sequence and time-series data are similar; they contain adjacent observations that are order-dependent. The difference is that where a time series contains numerical data, a sequence series contains discrete states.

    Model customer purchases as a sequence, (e.g., a customer first buys a computer, and then buys speakers, and finally buys a webcam).

    Anomaly Detection These are distance based techniques: k-nearest neighbor, cluster analysis-based outlier detection, pointing at records that deviate from learned association rules.

    They detect patterns (i.e., anomalies) in a given data set that do not conform to an established normal behavior and can be used to validate data entry and identify outliers from expected norm.

    Detect credit card fraud and network intrusion.

    Conduct manufacturing error analysis.

    Computer Vision This type of algorithm translates images/high-dimensional data from the real world to provide numerical or symbolic information for decision-making.

    Use models are constructed with the aid of geometry, physics, statistics, and learning theory for vision perception.

    Automated image analysis is performed on video sequences to multiple camera views with many applications.

    Detect events (surveillance), model objects interaction, reconstruct scenes, recognize objects, etc.

    Table 2. Algorithm Examples

  • Select natural

    foods influencer

    Monitor social

    influencers for

    health food related

    endorsements &

    recommendations

    Monitor

    endorsements &

    recommendations

    for viral

    conditions

    Monitor Living

    Naturally*

    products to

    message

    Review sales prior,

    during, and after

    message becomes

    viral

    120

    100

    80

    60

    40

    20

    0

    Trend

    400

    300

    200

    100

    0

    -100

    -120

    Month

    1 2 3 4 5 6 7 8 9

    Sales,

    Trend

    Rate

    Trend Trend Rate Salesd(Trend)/d(Months)

    Social Media AnalysisThe goal of this data exploration was to determine if social media could be used to identify changes to customer demand for products. The initial premise was the social media analysis could identify emerging social topics and viral events, and correlate them with products and sales. This capability would enable retailers to detect micro trends and events early enough to make informed ordering, pricing, and product promotion decisions. The Intel team focused on known and emerging influencers in the social media world.

    Figure 5 shows the data exploration approach, which basically looked for correlations between social media events and changes in the sales of related products. The input data included Google* Trends, tweets from a popular doctor with a TV show, product attributes, and business-to-consumer (B2C) point-of-sale data.

    An example of the output is presented in Figure 6. The blue line shows the trend for social media activity related to raspberry ketone, a potential weight-reducing compound found in red raspberries. The trend line peaks twice (weeks two and four), which is followed by an increase in sales about a week later, as shown by the green line.

    Big Data Analytics Solution ImplementationThe Intel team implemented an analytics solution architecture based on a Cloudera Distribution of the Apache Hadoop platform, which could handle the large volume, velocity, and variety of data that has to be processed. Supply chain and sales data were analyzed along with social media data relevant to the natural foods domain; however, the framework was designed to scale for additional data like weather, sensor data from the Internet of Things (IoT), or other sources that are of value to retailers.

    Figure 7 shows a conceptual solution stack of the big data analytics solution, containing layers of features required to analyze large volumes of data and gain actionable insights from the chosen algorithms. The left side of the figure shows the data inputs, and the right side lists elements of the framework spanning the computing infrastructure through to the retailer buyer UI.

    Figure 5. Data Exploration Approach for Social Media Monitoring

    Figure 6. Social Media Impacts on the Sales of Raspberry Ketone

    Figure 7. Big Data Analytics Conceptual Solution Stack

    Twitter* Data

    Stream Filtered by

    Keywords and

    Usernames

    Supply Chain &

    Transactional Data

    Big Data Retail Analytics Framework

    Visualization

    User Interface and Visualization

    Big Data Distributed File System,

    Data Connectors, and Databases

    Storage

    Hardware

    Server, Storage, and Network

    Hardware

    Query, Machine Learning, and

    Statistical Algorithms

    Analytics

    Parallel Computing, Text Processing,

    Natural Language Processing,

    and Sentiment Analysis

    Compute

    8

  • Capabilities and Software ComponentsVarious solution components and associated technologies were used, which are highlighted in Figure 8. At a high level, Twitter data is streamed live using the fire hose API exposed by Twitter, and supply-chain data is normalized, processed, prepared, and stored in relational databases. These databases are loaded in Hadoop Distributed File System (HDFS) and HBase, which is a NoSQL database. The data is processed using MapReduce programs, machine learning algorithms, and statistical tools, and the output is recommendations for retail buyers as well as actionable interactive visualizations.

    Social Media AnalysisTo enable the social media analysis capabilities, a base framework was created to access and store live Twitter streams, depicted by the process flow in Figure 9. The known social media influencers were identified as previously discussed. Public chat trends were tracked by extracting content from tweets, messages were tokenized and tagged using natural language processing techniques, and sentiment scores calculated using open source libraries. The social media analysis was tied to product sales trends to make recommendations.

    Map Reduce

    Programs

    NLP Libraries

    Tagging &

    Tokenization

    Sentiment

    Scoring

    HDFS

    (Transaction & Raw

    Twitter Data)

    HBase

    (Processed Data)

    Sqoop

    (Data Integration)

    Hive

    PigMahout R-Hadoop

    Analytics

    Visualization

    in D3.js

    Store Manager

    User Interface

    4 Node Cluster. 2 x Intel Xeon Processor @ 2.7 GHz

    24 TB Disk Space, 126 GB Memory per Node,10 Gigabit EthernetHardware

    Storage

    Compute

    Analytics

    Visualization

    Big Data Retail Analytics Framework

    Built on Cloudera* Distribution of Apache* Hadoop* (CDH5)

    Sqoop

    Mongo

    Export

    Twitter Firehouse API

    Connector (Java

    Application)

    (Twitter live stream in

    JSON format filtered

    by keywords, user

    names)

    Loyalty ManagementSystem

    Store & RetailerDetails

    Product Catalog

    Campaigns &Promotions Data

    Inventory Data

    POS Transactions

    Web Transactions

    Retailer Buyer Orders

    Cleaned Up and

    Aggregated Data

    Living Naturally* Retail Data

    in SQL Server

    Mongo Database

    Staging Data Store

    Twitter* Data

    Twitter

    Figure 9. Social Media Analysis Process Flow

    Other

    Social

    Media

    Twitter*Facebook*

    User

    Keywords

    Number of

    Followers

    Filter by

    Track public chat trend and sentiment on

    the relevant topics in influencer tweets

    Connect with sales trend of the products

    Combine the data and make recommendations

    Influencer

    Public

    chat

    Connect

    with Sales

    Recommen

    -dations

    VisualizeInteractive visualization for analysis as well as

    integration with store manager UI

    Identify known and emerging influencersExtract

    Features

    Tokenize &

    Tag the Text

    Sentiment

    Score

    Topics

    9

    Figure 8. Solution Components

  • Product Pipeline TrackingA retailers profitability is closely tied to supply chain management, a critical function that helps ensure the right amount of inventory is on hand. Without proper product management, even small variations in customer demand can lead to major inventory imbalances as business partners in the supply chain try to make adjustments with imperfect information. Inventory can swing wildly, creating an unstable situation known as the bullwhip effect, which results in stock-outs or excess inventory.

    It is possible to avoid these inventory issues by providing effective communication and coordination of the supply chain after connecting the dots between orders, inventory, and sales. For example, retailers can improve information sharing by providing sales insight to the supply chain so everyones demand predictions are based on the same information, thereby reducing the bullwhip effect. Essentially, the supply chain should be viewed as a glass

    pipeline that provides stakeholders with information transparency for data types such as customer demand, available capacity, and inventory levels. Efficiency should increase with better information flow, assuming incentive structures and the necessary technologies are also in place.

    The Intel team developed a glass pipeline analytics capability by first evaluating the availability and integrity of product-oriented data structures shown in Figure 10. The data was bucketed into four categories: unavailable, limited and suspect, available but dirty, and mostly clean data. After merging the various product-oriented data structures, the team was able to generate useful reports such as the inventory-turns graph in Figure 11. This was the first step in creating product-based actionable insights to ensure the right inventory is in the right place at the right time. The glass pipeline also allows the retail buyer to select individual product volumes, pricing, and promotions while ensuring maximum store returns.

    Not data available

    Limited data / highly suspect

    Moderate / larger amount of data requiring scrubbing

    Very strong data requiring limited to no scrubbing

    InventoryInventoryInventoryInventoryInventoryInventory

    Raw

    Material

    Raw

    Material

    Raw

    Material

    PO & Forecast

    Ship and Inv.

    Retailer

    Ship and Inv.Price

    Customer

    Quantity

    POS & Web

    Customers

    Shelf

    Non-loyalty

    Transactions

    Demand

    Generation

    Price Drop

    MFG Coupon

    LG Promo

    Sh

    ip &

    In

    v.

    PO

    /Gro

    ss

    Disty

    Inventory

    Demand

    Signals

    En

    vir

    on

    me

    nt

    We

    b S

    ea

    rch

    Tw

    itte

    r*Manufacturer

    Ship and Inv.Ship and Inv.

    PO & Forecast

    PO/Gross

    Figure 10. Glass Pipeline Data Structures

    10

  • The details with respect to this specific analytics solution can be found in a companion solution blueprint. This solution blueprint provides a deep dive discussion on the business and technical aspects of social media analytics.

    Market Basket Analysis ImplementationSeveral techniques were applied in order to understand consumer buying patterns and determine product mixes that best suited consumers and maximized overall return. The basis of the analytics begins with association rule learning; and in particular, the machine learning technique was applied in on-line buying and extended to address the likelihood of shopping baskets of much larger size and diversity.

    Market Basket Analysis To determine product associations, the Intel team used association rules to mine data and discover relationships between products in large-scale transaction data recorded by point-of-sale systems. Brick and mortar retailers, especially grocery, have much more complicated associations between products than on-line retailers due to the greater average number of items in a consumers market basket. There are also other types of associations and relationships that may occur in the store (based on

    co-location, recipe ingredients, etc.). This called for a fresh approach to providing prioritized market basket association signals to the buyer.

    Market Basket Association Rules analysis employs a well-defined algorithmic method. The following provides a specific example of simple grocery association between milk, bread, and butter to explain how the method works. In this case, the following rule was tested to determine the likelihood that customers who buy either milk or bread will also buy butter:

    {milk, bread} => {butter} [support, confidence]

    Three measures were used to place constraints on the significance of rules:

    Support: The proportion of transactions in the data set which contain items: milk or bread.

    Confidence: The probability of finding butter under the condition that these transactions also contain milk or bread.

    Lift: A measure of the improvement in the occurrence of the butter given the presence of milk or bread.

    The conclusion is that the purchase of milk or bread is accompanied by the purchase of butter 25 percent of the time based on this data set.

    To test {milk, bread} => {butter}:

    ({milk, bread, butter}) = 1/5 = 0.2

    ({milk, bread} => {butter}) =

    Support({milk, bread, butter}) / Support({milk, bread})

    = 0.2 / 0.4 = 0.5

    ({milk, bread} => {butter}) =

    Support({milk, bread, butter}) /

    (Support({milk, bread}) * Support({butter}))

    = 0.2 / (0.4 * 0.4) = 1.25

    Support

    Confidence

    Lift

    Source: http://en.wikipedia.org/wiki/Association_rules

    Transaction ID Milk Bread Butter Beer

    1

    2

    3

    4

    5

    1

    0

    0

    1

    0

    1

    0

    0

    1

    1

    0

    1

    0

    1

    0

    0

    0

    1

    0

    0

    11

    Figure 12. Association Rules Learning Example

    Figure 11. Inventory Turns

    $120,000

    $100,000

    $80,000

    $60,000

    $40.000

    $20,000

    $0

    Monthly Dollars

    Spent/Sold

    Monthly

    Inventory Turns %

    60%

    50%

    40%

    30%

    20%

    10%

    0%

    1/30/11

    2/28/11

    3/31/11

    4/30/11

    5/31/11

    6/30/11

    7/31/11

    8/31/11

    9/30/11

    10/31/11

    11/30/11

    12/31/11

    1/31/12

    2/29/12

    3/31/12

    4/30/12

    5/31/12

    6/30/12

    7/31/12

    8/31/12

    Monthly Total $ Spent Monthly Total $ Sold Monthly Inventory Turn

  • With association rule techniques as the basis for creating the patterns of consumer buying and products with natural affinities, the solution took into account the diversity and size of shopping baskets and provided a multi-tiered analysis approach. The idea is to identify emergent basket rules and patterns from transaction datasets to detect product affinities. These product affinities were then combined with social media analysis, pricing strategy, promotions, and customer product recommendations. Figure 13 shows a high level process flow of the solution.

    Basket Analysis Using Transposition and EncodingFor every product set, the solution merged the sets for every possible pair of products and counted the results. The merge need not be restricted to a pair, so multiple products can be considered at a time. The result was then merged with other products to find basket rules with the cost of only counting the transactions. The solution calculated the number of times the products occurred together, divided by the total number of transactions. The solution processed the following steps:

    For each product, run through the rest of the list.

    Merge the transaction lists together using AND (only keep a transaction if in both lists).

    If the number of transactions in common is greater than the support value, put the new combination in the result at a level equal to the number of keys in the product list, and keep the list of transactions in common.

    When a level is finished, and there are results, repeat the process, but use the result from that level as the input.

    The details with respect to this specific analytics solution can be found in a companion solution blueprint, whereas this solution blueprint focuses on the business and technical aspects of social media analytics.

    Transaction ID Product Category

    100012345

    100012346

    100012347

    100012348

    100012349

    100012350

    - - - - - - -

    - - - - - - -

    Vegetables

    Canned Goods

    Vegetables

    Bread

    Chips & Pretzels

    Canned Goods

    - - - - - - -

    - - - - - - -

    Transaction Dataset

    Association

    Rules Library

    98% pf people who purchased items

    A and B also purchased item C

    Rule X Y

    Support =(X U Y).count

    n

    Confidence =(X U Y).count

    X.count

    Lift =Support

    Supp(X). Supp(Y)

    items support

    1 {DRY GROCERY\SHELF STABLE FOODS\CANNED GOODS,

    FRESH\PRODUCE\VEGETABLES} 0 .005387144

    2 {DESERTS,

    DRY GROCERY\SHELF STABLE FOODS\COOKIES CRACKERS AND CRISPBREADS} 0 .003611720

    3 {DRY GROCERY\SHELF STABLE FOODS\COOKIES CRACKERS AND CRISPBREADS,

    HERBS BULK} 0.007000243

    4 {DRY GROCERY\SHELF STABLE FOODS\CANNED GOODS,

    DRY GROCERY\SNACKS\CHIPS AND PRETZELS} 0 .003363160

    5 {DRY GROCERY\SHELF STABLE FOODS\SOUP AND BROTH,

    DRY GROCERY\SNACKS\CHIPS AND PRETZELS} 0.0024992898

    6 {DRY GROCERY\SNACKS\CHIPS AND PRETZELS,

    FRESH BAKERY\BREAD AND BAKED GOODS\BREAD} 0 .003525485

    7 {DRY GROCERY\SHELF STABLE MEAL SOLUTIONS\MEAT POULTRY SEAFOOD,

    FRESH BAKERY\BREAD AND BAKED GOODS\BREAD} 0.004159565

    8 {DRY GROCERY\SHELF STABLE MEAL SOLUTIONS\MEAT POULTRY SEAFOOD,

    DRY GROCERY\SNACKS\CHIPS AND PRETZELS} 0 .003276926

    Association

    Rules Mining

    Basket Rules

    FISH

    CHEESE

    MEAT

    VEGETABLES

    HERBS

    POULTRY

    SEAFOOD

    CRACKERS

    BREAD

    MILK

    Market Basket Analysis

    Figure 13. Market Basket Analysis High-Level Process Flow

    12

  • Physical InfrastructureThe physical implementation architecture of the analytics framework is shown in Figure 14, including the hardware configuration, operating systems, and important software components used in the solution.

    Accelerate Big Data ImplementationsBig data presents retailers with game-changing opportunities to solve age-old business problems as well as create a competitive advantage through unique insights. This paper provided an overview of an actual big data analytics implementation, thereby providing a primer for those just getting started in this field.

    Working to make implementing big data easier, Intel has a team of experts who create data analytics applications and analytics toolkits for retailers so they can focus on solving business problems instead of on system development. Intel has been in the big data business for over 30 years, managing state-of-the-art manufacturing sites and complex supply chain networks. Applying this knowledge to in-store analytics, Intel data analytics consulting services yield actionable data that allows retailers and brands to respond to customers desires in a more responsive and predictive manner.

    Figure 14. Physical Infrastructure

    Twitter* Data Streaming Server

    2 x Intel Xeon processor @ 2.7 GHz,

    24 TB disk space, and 126 GB Memory

    OS: Microsoft* Server 2008 R2

    Mongo* database

    Hadoop* Cluster

    2 x Intel Xeon processor @ 2.7 GHz, 24 TB disk space,

    and 126 GB Memory per node

    Cloudera* Distribution of Hadoop (CDH5)

    Hadoop Node

    Transaction Data - Staging Server

    2 x Intel Xeon processor @ 2.4 GHz,

    7 TB disk space and 24 GB Memory

    OS: Microsoft Server 2008 R2

    Web Server

    2 x Intel Xeon processor @ 2.4 GHz,

    2 TB disk space, and 24 GB Memory

    Hosting UI and Analytics Visualization

    10 Gigabit Ethernet

    Hadoop Node Hadoop Node Hadoop Node

    10 Gigabit Ethernet Switch

    For more information about Intel solutions for the retail industry, visit www.intel.com/retail.

    Visit http://www.cloudera.com for information on Cloudera Big Data solutions.

    13

    1 Gartner analyst Doug Laney introduced the 3Vs concept in a 2001 MetaGroup research publication, 3D data management: Controlling data volume, variety and velocity. http://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf.

    2 Ernst and Young, Ready for takeoff?, 2014, http://www.ey.com/Publication/vwLUAssets/EY_-_Ready_for_takeoff_executive_summary/%24FILE/EY-Ready-for-takeoff-Executive-summary.pdf.3 McKinsey & Company, Consumer Marketing Analytics Center, Creating Competitive Advantage from Big Data in Retail, June 2012. http://www.mckinsey.com/client_service/retail/expertise/~/media/

    mckinsey/dotcom/client_service/retail/articles/cmac_creating_competitive_advantage_from_big_data.4 Andrew McAfee and Erik Brynjolfsson, Big Data: The Management Revolution, Harvard Business Review, October 2012.

    Copyright 2014 Intel Corporation. All rights reserved. Intel, the Intel logo and Xeon are trademarks of Intel Corporation in the United States and/or other countries.

    *Other names and brands may be claimed as the property of others. Printed in USA 0614/MB/TM/PDF Please Recycle 330716-001US