AI WEB ENGINE PROJECT - Dell...AI WEB ENGINE PROJECT Phil Hummel Technical Marketing Consultant...

AI WEB ENGINE PROJECT

Phil HummelTechnical Marketing ConsultantMachine Learning and AIDell [email protected]

Gene ChesserDell EMC

Knowledge Sharing Article © 2018 Dell Inc. or its subsidiaries.

2018 Dell EMC Proven Professional Knowledge Sharing 2

Table of Contents

Introduction ........................................................................................................... 4

AI – An Engine of Innovation for Online Communities ........................................... 5

Identifying opportunities .......................................................................................................................... 5

Assembling a team .................................................................................................................................... 5

Using a formal process .............................................................................................................................. 6

The Machine Learning Technology Landscape ....................................................... 7

Defining ML, DL and AI .............................................................................................................................................. 7

Letting the data tell the story .................................................................................................................................... 8

Using ML and AI in Online Communities ................................................................ 8

Project vision ............................................................................................................................................................. 8

Related projects ...................................................................................................................................................... 10

Defining the project................................................................................................................................................. 11

Crawl ................................................................................................................................................................... 11

Walk .................................................................................................................................................................... 12

Run ...................................................................................................................................................................... 12

Sustaining Operations for AI ............................................................................................................................... 13

Conclusions .......................................................................................................... 13

References ........................................................................................................... 14

Appendix A: Architectural Design ........................................................................ 15

Goal: Establish Thought leadership in AI-ML-DL .................................................................................................... 15

Appendix B: Process Flow .................................................................................... 16

Appendix C: LDA Process Engine .......................................................................... 17

Appendix D: Input Source Engine ......................................................................... 18

Appendix E: Concept Outlines .............................................................................. 19

Input Sources ........................................................................................................................................................... 19

Classification Engine ................................................................................................................................................ 19

Raw Data Deep Content Search and tagging ........................................................................................................... 20

Analytics Association Clustering and Grouping Engine ........................................................................................... 20

DL Engine ................................................................................................................................................................. 20

Primary Inferencing Engine ..................................................................................................................................... 20

Maintenance Processes Engine ............................................................................................................................... 20


Table of Figures

Figure 1: Using a Formal Process ................................................................................................................................... 6

Figure 2: Architectural Design ..................................................................................................................................... 15

Figure 3: Process Flow ................................................................................................................................................. 16

Figure 4: LDA Process Engine ....................................................................................................................................... 17

Figure 5: Input Source Engine ...................................................................................................................................... 18

Disclaimer: The views, processes or methodologies published in this article are those of the authors.

They do not necessarily reflect Dell EMC’s views, processes or methodologies.


Introduction

One of the most significant outcomes from the development of the Internet has been the explosion of

easily available information that simultaneously enriches and overwhelms us. Another significant

development has been the creation of online communities that allow people from around the world to

connect and share information on any topic imaginable. And yet, despite the decades-long evolution of

online communities that so many people would argue have gotten richer and more engaging, it feels like

the information explosion is outpacing our individual and community efforts to filter and digest even a

fraction of most relevant content that we would like to review. There is still much room for

improvement in the way we consume and share content. Our opinion is that online communities

present a greater opportunity for innovation in combating information overload than do personal

productivity tools.

On a recent business web conference call, one overwhelmed participant typed into the chat window “I

just need the world to slow down for a year so I can catch up on my reading.” We suspect this person is

not alone. We also feel that way – at least once a week. So, what can we do about it? This paper

describes a proposal for enriching the content sharing experience of online communities using a

combination of machine learning (ML), deep learning (DL), artificial intelligence (AI) and plain old human

nature.

Information technology both creates the challenges we face with information overload and also can help

us get a handle on it. We have Google news and other “search services” that will automatically do

keyword searches and email us interesting URLs. That increases the size of our reading list but should

help organize the recommendations into topics. We also have community sites like Twitter, LinkedIn and

Facebook. We can easily connect with like-minded people but we can’t influence what they share. The

focus of our socially-generated streams varies widely and cannot match our current interests very well.

A third type of community is the “question and answer” sites such as Quora, Reddit and, for

technologist, Stack Overflow. These sites concentrate information but we still must start with an often-

imprecise keyword search. On the positive side, the results are typically short, information dense and

more action oriented. Finally, sites like Reddit and Stack Overflow introduce point systems to help users

find better answers to questions. Users gain reputation points on Stack Overflow and karma on Reddit

based on how other users rate the quality of their contributions. The theory is that answers or posts

from users with above average point values should be more useful based on the communities’

evaluation of past contributions.

In this paper we use this information combined with current trends in technology to propose a new

design for an online knowledge sharing community that uses many of the existing successful strategies

mixed with some new features. We describe the vision and propose how a prototype could be

developed. Our goal is to stimulate discussion and look for collaborators.


AI – An Engine of Innovation for Online Communities

Identifying opportunities

The goal of any project should be to create change in the world and that is especially true for AI projects.

For AI projects, that change may be as simple as improving the quality of some data that an organization

is collecting or may be as large as a new product introduction or a social media-powered movement.

The important point is that you should have a vision articulated in writing before you start. Find as many

people from diverse backgrounds as possible to review your plan and encourage feedback. It is far easier

to improve your approach to realize the vision early in the journey.

The project that we describe in this paper started with a vision, a rather large one. The project creator

developed the concept over a period of months during which he engaged with many people to present

and refine the concept. He also conducted extensive online research to determine how machine

learning (ML), deep learning (DL), and artificial intelligence (AI) were being applied to similar areas of

innovation. Those efforts have been fruitful once again reinforcing the value of disciplined research and

discussion prior to starting development of the project that will materialize the vision.

Assembling a team

Creating meaningful change in the world with AI involves a lot of difficult but rewarding work. It is rarely

achieved through the efforts of a single individual. Anyone that leverages open source tools for ML or DL

is already relying on the efforts of dozens or hundreds of developers and thousands to millions of other

user’s testing efforts to accomplish their goals. Once a person creating a vision understands the need

and advantages of collaboration the likelihood of success improves compared to the alternative of

having complete control by going solo. Understanding your current strengths in the context of

everything that is required for a successful AI project helps narrow your search for complementary

teammates.

The initial concept for this project was developed by someone with a long history in technology and

online community experience but relatively new to data science. The next person to join the team was

strong in data science with some programming skill and much less domain experience. Together, they

have been able to better describe both the vision and a plan for how to architect a proof-of-concept

than either would have accomplished alone. The next step will be to use those documents to attract

additional team members with other skill sets and eventually investment to turn the design into a

prototype. The primary goals are to learn, foster collaboration with people of diverse backgrounds and

to advance the state-of-the-art for online information sharing communities.

There has been much discussion in the AI, ML and DL industry about the definition of, and availability of

data scientists. At one extreme are the advocates of the importance of the “unicorn” data scientist that

is an expert in statistics, programming, data management, and the problem domain. A recent tongue-in-

check blog article recognized by KD Nuggets suggests that “all it will take to become a real data scientist

is five PhD's and 87 years of job experience.” On the other extreme are those advocating that there is no

shortage of data scientists since the current crop of data science tools are so powerful that “citizen data

scientists” are all that most projects need. Our experience and understanding of the current tools and

breadth of skills required to successfully complete a data science project lead us to conclude that the

most viable strategy to staffing is somewhere between the extremes.

https://brohrer.github.io/imposter_syndrome.html


There are many successful data scientists that are expert in a few problem domains which includes in-

depth knowledge of important data and data science research and best practices in those areas.

Therefore, we suggest a strategy for new talent development that mimics this historical observation.

Invest in training that will develop subject matter-specific data scientists. For example, define a role for

an image recognition expert in quality control or security surveillance and then train one or more people

to fill that specific job description. Another role may be created for a natural language/speech

recognition expert in the field of customer service. Roles that have a 6-12 month expected learning

curve are a good test of the potential return on investment from talent development.

Using a formal process

Building business processes and value based on intelligence derived from data can be both rewarding

and risky. There are many published case studies that ended well as are those that ended poorly. Just as

the value, quality, and reliability of developing software assets has been improved through the

application of dedicated management techniques, data science investments that use a formal process

have been shown to be more successful. There are many well-regarded frameworks that can be used for

data science and analytics work including this model proposed by a Dell EMC 2013 article.

Figure 1: Using a Formal Process

Organizations should review what options are available and pick a framework with good adoption and

try to stay with it for the first few cycles from discovery to production operations. The main advantage

of using a formal process is to build in regular evaluation by a team that shares responsibility. Each

checkpoint that requires team approval before moving to the next stage facilitates an opportunity to

document the status, risks and goals for both the current and next phase(s).

https://infocus.emc.com/david_dietrich/the-genesis-of-emcs-data-analytics-lifecycle/


The Machine Learning Technology Landscape

Defining ML, DL and AI

There are many overlapping and conflicting definitions of machine learning, deep learning and artificial

intelligence everywhere we look, especially on the internet. We felt it was critical to define these terms

for the context of this paper in as clear a manner as possible and then use them consistently throughout

this discussion.

Our definition of AI is software with imbedded intelligence features that users perceive as smart or

adaptive. The challenge with such a broad definition is that the expectations of users continue to evolve.

When users first encountered recommendation engines on ecommerce and entertainment web sites

they were considered state of the art smart applications. Today, simple recommendations like “users

who bought this product also bought these items” are considered routine. There are other examples of

recommendation engines that are combined with chatbots that take advantage of the context of the

conversation while making suggestions. In our opinion, the most likely sources of intelligence used in AI

software include people/experts for expert systems, and ML and DL for data driven intelligence and

reinforcement learning for adaptive games and software. Examples of software applications that have

the potential for being considered AI, highlighting the subjective nature of our designation, include:

Spam filters/content filters

Recommendation/personalization engines

Chatbots

Image/speech recognition

Autonomous driving vehicles

Games that learn during play

For data-driven intelligence used in AI we need to be able to discover associations between two or more

variables. With only a single variable we are limited to descriptive statistics like the average, maximum,

minimum, etc. For AI we need to be able to estimate the most likely value of one variable (the outcome)

based on the values of one or more features. We want to be able to estimate the air temperature based

on the hour of day or we want to estimate someone’s weight given their height, age and gender. We

refer to these relationships between outcomes and features as data models.

The simplest type of data model we can specify is using the assumption of that of a linear (straight line

or plane) relationship between the outcome and the features. For instance, we could assume that the

relationship between the fuel consumption of a vehicle (miles per gallon) and the weight of a vehicle

(pounds) is linear. We can estimate the parameters of the linear model (train the model) for such data

using many types of ML and DL. Statistical theory gives us “tests” that we can compute to determine if

the linear assumption was valid or “good enough”. If the tests determine that the assumption of

linearity was not valid, there are many more sophisticated machine learning models that can be used to

work with more complex relationships. The work of understanding the types of relationships that exist

between variables and how they can be represented (modeled) consumes a significant amount of time

for many data scientists working with machine learning techniques.

We therefore use the term machine learning to refer to the application of methods and algorithms that

use statistical theory to support assumptions regarding the statistical properties of the data and


relationships between the variables. This definition is often referred to as “classical” machine learning

by some authors but we will just use machine learning for the remainder of this paper.

When the relationships between the variables in a data set are very complex it is difficult to find an

acceptable machine learning technique where all or most of standard statistical assumptions for that

modeling technique are satisfied by the data. If we use a model with assumptions that are not valid, the

results will not be reliable.

Even though machine learning algorithms have grown in capability and complexity over the years the

relationships between the outcome and the data features for problems like image classification and

speech recognition have proven to be to too complex for these methods to be acceptably accurate.

Deep learning techniques have emerged that can efficiently model very large data sets with very

complex relationships between the variables. Deep learning techniques allow the relationships between

variables to be discovered during training. Deep learning techniques use a robust optimization approach

that allows even very complex relationships in the data to be discovered without having to make any

assumptions based on statistical properties of the data. The tradeoff is that the computational workload

is greatly increased over most machine learning approaches and they won’t produce reasonable

modeling results without a relatively large amount of data.

When a data set is small and/or the form of the relationships between the variables can be specified,

machine learning models are usually the most efficient and cost effective approach to modeling the

relationships between variables. For certain types of problems including image and voice analytics as

well as massive amounts of traditional text and numbers, deep learning could be the only option.

Letting the data tell the story

Spending too much up front time attempting to choose “the best” modeling framework from the vast

array of ML and DL options is a common pitfall for organizations new to data modeling. Data analytics

frameworks, like the one pictured above, are clear about the importance and the resource intensity of

the phases between discovery, data preparation and model evaluation. A good starting strategy is “the

simplest model” that gets the job done is preferred. The type of data and type of result that you are

hoping to achieve can also be used to inform modeling strategy. Asking what has been tried in the past

and what are data scientists having the most success with currently will help limit the number of dead-

end efforts. The data science community is amazingly open to sharing experience and answering

questions. It is unlikely that anyone will tell you exactly how to solve a problem, but asking a well-

formed specific question that shows that you expended significant effort researching prior to posting

will often result in a better outcome than trying to reinvent every wheel in the data science toolbox.

Using ML and AI in Online Communities

Project vision

The best way to conceptualize the vision is by comparison to an amalgam of existing online experiences.

Imagine a site where:

1. members share content from the web, i.e. LinkedIn and

2. are awarded community value points by other members based on the quality of their

submissions and reviews, and

3. new curated content streams from an intelligent web crawler, i.e. Google News, and


4. all data and activity on the site fuels a personalization engine, i.e. Spotify.

Did we mention that the vision was massive? The rest of the paper will attempt to expand on this vision

and describe how we would go about building a prototype.

Our goal is to develop a community-focused website that creates an immersive experience by simulating

a virtual reality maze of information and reference material that allows the user to navigate to relevant

sources of content quickly and easily. Community members can get tailored recommendations by

identifying their area(s) of research and/or interest as part of their personal profile and then personalize

the scope for an individual as the AI engine learns more and more about the individual’s preferences

and continues to update their profile in real time.

It is important that the site design:

provides an aesthetically pleasing and user-friendly experience

is appropriately segmented and relevant in unique and dynamic ways to a highly-technical

audience

The site will include all relevant information pertaining to products and solutions value and positioning

but the scope of discoverable content is not limited to any one company.

The AI for the site will be powered by a personalization engine. The engine will provide

recommendations from thousands of dynamically-scanned URLs that contain web articles and

information that that have been analyzed using ML and DL techniques to match the content to user

interests and queries. Another goal of the site design is that the personalization engine improves by

learning as the users spend more time using the site and rating content by gathering additional user and

content data dynamically and via user input that feeds back to the ML/DL models.

In this concept, there is also a gamification aspect that serves not only as a mechanism for unique,

specialized navigation, but supports concise data gathering and processing. Gamification of the site also

includes features to support point accumulation from other members of the community coupled with

one or more leader boards for those that like a bit of competition and recognition. Participation in this

aspect of the site can be optional. The proposed concept example below represents one possible

scenario for implementation:

A community member enters the site and is placed in the lobby of a virtual building with multiple doors

in view:

One door leads the Research Department

A second door is labeled Solutions Expo

Another door opens into the Products and Resources Market

The fourth door is to the Communities Conference portal.

A more detailed description of the Research Department will help explain some of the features that will

only be possible through the application of ML and DL. The Research Department would be largely a

repository of content-aware search engine prioritized documents. As a user selects categories to search,

the complexity of the underlying engine is hidden behind an intelligent chat (chatbot) interaction or

other intuitive interfaces to narrow the selection. The available information store is made up of the

results of crawling site maps that contain thousands of links to articles and information related to any


number of technical disciplines such as AI, ML, DL and other sociological input sources like Google

Analytics and historical cookie data collected based on what the user chooses to research. Ideally, the AI

engine would be constantly analyzing a list of URLs supplied by members of the community. The AI

engine would then create detailed sitemaps and determine the scanned resources are quality AI/ML/DL

content sources to include in the information store. The AI engine would also rate the content and offer

it to appropriate users based on their interest profiles as potential high-value data related to their

search.

As content is discovered and recommended to members, they would be offered the option to read the

item or add it to their personal “document cart”. The cart should function like familiar features found on

software and driver support sites or ecommerce sites like Amazon where you can get previews and drill

down to the details and specifications. Members that are just browsing the site but have not added

content to their cart can also generate data useful to the AI engine using cookies to track what content

they looked at after proper notification regarding the use of cookie tracking. Tracking data to enhance

the users profile can also be used to offer users an email summary with either links to the recently

visited articles or a zip file of the content for offline viewing. There would also be the ability for the user

to add their own articles or blog posts with a set of fields for data describing author details,

organization, etc., to be filled in, that would create the metadata for keyword searches.

The AI-Web Engine could also email users when new articles come in that match their profile as high

value information where they could read or download the new content. The Gamification concepts

mentioned earlier can award the users points and status for researching and contributing articles or

blogs where there would be rewards as members of a top ten group or a set of masters in specific focus

areas like Oil and Gas or Self Driving Cars. As more users interact with the AI-Web Engine the experience

will improve so the community will take pride in developing a smarter more fun experience where they

get what they are interested in quickly and easily.

One key to success for this effort is to build on the current academic and open source mindset where

the site becomes a clearing house for members to give and receive as part of the AI community. There

would be sub-sections for specialty groups like ML vs. DL as well as links to the Communities

Conferences area of the site mentioned above that would be focused on industries that could include

finance, medical research or any others in response to members proposing them and participating.

Related projects

During the development of this project vision we have used Internet research to discover other projects

that are related to this area of research and development. We have focused primarily on ML and DL for

text analysis. The areas of personalization and gamification still need a thorough review of current best

practices and trends.

A form of text analysis described as unsupervised machine learning for topic modeling is a compelling

option for the new search engine since it has the attractive feature that it does not require a large

collection of labeled data as input. Our theory is that we can “jump start” the process of organizing

discovered content via an unsupervised approach and improve on the accuracy of that tagging through

member reviews and refinement of the topic assignments in a second stage most likely using DL.

Starting with deep learning for all text analysis has the potential advantage of combining feature

extraction and topic modeling in a single framework. A downside of all deep learning approaches


including text analysis is that the models and sensitivity to inputs are frequently difficult to understand

and therefore explain. The perception of deep learning as “black box” intelligence has been a significant

barrier to acceptance in many situations where business owners want to understand how software with

AI capabilities works.

Both machine learning and deep learning for text analysis require pre-processing of the raw text prior to

doing the actual modeling. This step is often referred to as feature extraction. Our research shows that a

popular starting point for topic modeling as well many other text analysis techniques is to construct a

“bag of words” representation of the document. The bag-of-words model is mainly used to generate

features used in machine learning and deep learning models. Two of the most common features

extracted from the bag of words are the term frequencies (TF) (number of times a term appears in the

text) and inverse document frequencies (IDF). Used together, the TF-IDF form a statistic that represents

how important a word is to a document in a collection or corpus,

Here is list of some interesting articles we have reviewed that provide additional background on the

topic of machine learning for text analysis.

Deep Learning Bag-of-Words Model for Predicting Movie Review Sentiment: this article provides a

detailed discussion of the techniques for constructing a bag of words from a text document with Python

code samples.

Natural Language Processing in a Kaggle Competition for Movie Reviews: this post will help you

understand some basic NLP (Natural Language Processing) techniques, along with some tips on using

scikit-learn to make your classification models.

Improving Uber Customer Care with NLP & Machine Learning: Discusses building an NLP model that

analyzes text using topic modeling. Even though it does not take into account word ordering, it has been

proven very powerful for tasks such as information retrieval and document classification.

Defining the project

Given the ambitious goals of the project and the limited resources, we have adopted a crawl, walk, run

strategy for defining and building a prototype. In the crawl phase we have examined web presentation

concepts and engaged a set of WordPress Developers that we have worked with in the past to address

the user experience and investigate how to create an interactive API that can collect user information in

real time and feed it to the AI-Web Engine as well as receive Inferencing recommendations from the AI-

Web Engine to maintain the real-time experience for the user. We have also created the process flow

charts defining the separate execution steps required to collect, classify, collate, and develop metadata

to be associated with the actual articles and feed this information into the ML-DL application processes.

These Design concepts and process flow concepts will certainly change as we move into the walk and

run modes where we develop the individual working models in preparation for the fully working proof

of concept project. The Concept charts and outlined descriptions are in the appendix for a more

detailed description.

Crawl

This is literally the crawl phase where we must put limits on the sources of data we are going to explore

and begin to extract text and metadata from URLs. We do not want to nor can replicate the scope of

https://machinelearningmastery.com/deep-learning-bag-of-words-model-sentiment-analysis/

https://opendatascience.com/blog/natural-language-processing-in-a-kaggle-competition-for-movie-reviews/?imm_mid=0f9a96&cmp=em-data-na-na-newsltr_ai_20171211

https://eng.uber.com/cota/


Google or Bing. For the initial prototype, we plan to limit the starting URLs for the crawling engine to a

handful of know data science content rich sites like Data Science Central, KDNuggets, etc. We can also

limit the scope of exploration by limiting the depth of the crawl and not following links to other

domains. We have also recently explored the idea of doing string matching on the text of URLs against a

target word list that would include keywords such as machine learning, ML, AI etc. If we applied this

logic to the list of URLs in the Related Projects section we would have skipped the Uber Engineering

article based on the limited information in the URL. It is obvious that there are still many design

decisions that need to be made based on the information we gather during prototype development and

testing. The purpose of this initial limited crawl is to get enough relevant content to test and refine the

topic modeling engine.

Our current proposal for unsupervised topic modeling is to use Latent Dirichlet Allocation (LDA), a

generative statistical model for assigning topics to documents. We have also been exploring the use of

an autoencoder neural network for our unsupervised topic modeling learning algorithm. The advantage

of LDA is that is widely available from a variety of frameworks including version 1.3.0 and higher of

Apache Spark, open source R with the topicmodels and lda packages and NumPy for Python. LDA is also

easy to implement and interpret. To run LDA we will first need to create a bag of words data structure

and compute the TFIDF statistics all of which are widely documented and implemented in packages for

both R and Python.

Other reasons to start the prototype with LDA are that that when documents cover only a small set of

topics and that topics use only a small set of words frequently, LDA typically results in a better

disambiguation of words and a more precise assignment of documents to topics compared to other

models. Nonparametric extensions of LDA include the nested Chinese restaurant process which allows

topics to be arranged in a hierarchy whose structure is learnt from data.

Walk

The second phase of the project begins when we have a reasonable number of classified documents.

We are targeting between 2,500 and 5,000 covering 5-10 topics. We will then need a web application to

begin registering members of the community so they can start interacting with the content. The goal in

this phase is to collect user requirements for future development and begin to build additional features

that can be used in the recommendation engine. For the time being, we can only recommend articles

based on the topic(s) assigned in the Crawl phase.

We will also begin development of the community point system in this phase. We need a relatively

simple to understand but meaningful set of rules. In our experience, the rule for both Stack Overflow

points and Reddit karma are too complex as a starting point. The Everyone Social platform for social

selling and employee advocacy uses a very simple and easy to understand point system based on

“engagements”. We expect that our initial design will take something from each of several platforms for

an initial design and let member feedback inform improvements.

Run

This stage of the project is where we tackle the most complex technology challenge, the personalization

engine. To test alternative implementation approaches we will need a lot of data that we don’t have

today. After 3-6 months of operating the community in the Walk stage we will begin to have some

member profiles, engagement activity, member rankings and comments and other data crucial to


building the personalization engine. Our current thinking is that we will be able to use a relatively small

set of actual data to generate a larger data set through sampling and simulation. We can use the

relationships in the actual data and add a degree of randomness during the simulation to provide a large

enough volume of data to test alternative ML and DL algorithms. The advantage of putting

development effort into a data simulation tool is that it allows us to precisely control the form and

complexity of the data relationships to better evaluate which models discover what we already know

about the data.

Sustaining Operations for AI

We have presented a vision and plan for conducting research and development into the use of AI for

improving the ability of online communities to collectively review and share content. We also want to

address some of the issues related to the successful long-term viability of an AI application.

The amount of intelligence/information that is encapsulated in a ML/DL model at the time of estimation

or training is determined by the input data and the specification of the model. If the set of significant

features and the relationships between the features and outcome are stable over time, then the model

will never need to be updated. As you can imagine this is never the case with interesting problems.

In practice there are many factors that create the need for ongoing development and testing. First the

list of known and important features will most likely change over time. New sources of data are

discovered or developed. Some initial features will lose influence over time. Secondly, new models and

application techniques are constantly evolving and the use of ensemble approaches involving multiple

interrelated models are an endless opportunity for research. Also, any AI system that involves analysis

of human attitudes and behaviors will be impacted by changing social trends, news, behavior and needs.

Complex AI systems are notoriously difficult to monitor for change and oftentimes the first indication

that a system may need to be reevaluated is when the prediction accuracy drops noticeably. A more

serious and difficult situation to asses is when a model suddenly goes from acceptable accuracy to

unacceptable. This frequently indicates there has been a structural change in the data relationships.

When the project has passed the proof of concept phase and is in the final implementation stage, a

Maintenance Process Engine will be designed to run additional analytics processes so the AI-Web engine

can adapt to new concepts not available at the time of this project definition. The new concepts to

consider are largely expected to come from the actual user community interacting with the AI-Web

engine. If we do a good job designing the frameworks and create a rewarding and fun experience the

project should live on for many years and can easily be adapted to other research Subjects in addition to

AI, ML, And DL.

Conclusions

Most people agree that the amount of new information created every day, even for a small set of topics,

is overwhelming our ability to feel acceptable informed. This problem impacts both our professional

and personal live. The development of new personal productivity tools such as tablets and smart

phones coupled with intelligent applications can simultaneously ameliorate and exacerbate our ability to

feel caught up on even the most recent events and developments. The internet era has also fueled the

development of online communities for knowledge sharing and collaboration that also both increase the


competition for attention but also can help us find and digest some types of information more

efficiently.

In this paper we have begun to describe a vision and plan for improving the capabilities of intelligent

software in the context of online communities to help deal with the information overload that we all

experience. The vision is ambitious but achievable. The research and development plan is evolving and

we recognize that there is much yet to be done. The goal of presenting the ideas and concepts at this

stage is to foster comment and debate. We are committed to continuing the work on the research and

development effort and any feedback we can solicit will surely help to solidify and improve the results.

Our potential to use technology together with our basic human need to share and be recognized for our

efforts is great. Internet technology breaks down the barriers created by distance, time and even

language. We should take advantage of those benefits in an effort to improve our ability to learn and

share from the valuable but overwhelming amount of content that this same technology is enabling.

Online communities powered by AI have the potential to improve our control of information and help us

learn and share together in a way that will save each of us precious hours each week and improve our

feelings of being well informed and connected.

References

Dell EMC 2013 article: The Genesis of EMC’s Data Analytics Lifecycle, By, David

Dietrich o https://infocus.emc.com/david_dietrich/the-genesis-of-emcs-data-analytics-lifecycle/

Deep Learning Bag-of-Words Model for Predicting Movie Review Sentiment: How to Develop

a Deep Learning Bag-of-Words Model for Predicting Movie Review Sentiment,

Jason Brownlee

o https://machinelearningmastery.com/deep-learning-bag-of-words-model-sentiment-

analysis/

Natural Language Processing in a Kaggle Competition for Movie Reviews: By, Jesse Steinweg-Woods

o https://opendatascience.com/blog/natural-language-processing-in-a-kaggle-

competition-for-movie-reviews/?imm_mid=0f9a96&cmp=em-data-na-na-

newsltr_ai_20171211

Improving Uber Customer Care with NLP & Machine Learning: COTA: Improving Uber

Customer Care with NLP & Machine Learning, By: Huaixiu Zheng, Yi-Chia Wang, &

Piero Molino o https://eng.uber.com/cota/

o




https://machinelearningmastery.com/author/jasonb/










Appendix: ML-DL AI Engine for Research and Recommendations

Appendix A: Architectural Design

Figure 2: Architectural Design

Site Map DataPublished Industry

ML-DL Existing DataTraining Data

Education

Event DataFocus Area

Sessions

Daily PublicationsBlogs

AnalystsIndustry

Web Interface to UsersWordPress Framework

Resource and Research Portal

Public and Private areas (customer councils, focus

groups, analysts)

E S T

61 70 1 2 3 4 5

141 158 9 10 11 12 13

User Login and Profile Management

Cookie TrackingQuestionaires

Web ACCESS Internal use

Initially Public-Later

Linux OS GPU & FPGAFrameworkAlgorithms

Processes - Jobs

Data Gathering Web

Crawler Input

Sources

Search/Index Metadata Deep

Content

Classification Inclusion Exclusion Old New

AssociationLinkages

Cluster Info

InferencePresentation

Layer

InferenceRecommender

Google Analytics Data

Wikipedia Data

Goal: Establish Thought leadership in AI-ML-DL

The goal is to create a site that not only provides an aesthetically pleasing and user-friendly experience,

but that is appropriately segmented and relevant in unique, dynamic ways to a highly-technical

audience. A site that exemplifies and reinforces the solutions value and positioning throughout – for

instance, by leveraging and promoting a ML-DL search engine where the data in specific sections is

thousands of dynamically-updated URLs that contain articles and information that are perfect subject

data for content addressable search algorithms.

AI WEB ENGINE PROJECT


Appendix B: Process Flow

Figure 3: Process Flow

Site Map Data

Industry Published

Existing Data

Training Data

Education


Sessions

PublicationsBlogs

Industry

Clients VM - 1-8Web Interface to UsersWordPress Framework



groups, analysts)

E S T

61 70 1 2 3 4 5

141 158 9 10 11 12 13

User Login and Profile Management

Cookie TrackingQuestionaires

WordPress VM - 9Web ACCESS Internal – Initially Public-Later

VM -10Linux

Input SourcesWeb Crawlers


VM – 31 Inference

Presentation Layer VM - 30

InferenceRecommender

Google Analytics

Data

Vmware ESXi 6.5

GPU & FPGANetwork

Data Pools

VM - 11Linux

ML - LDAData Label Algorythms

VM - 12Linux

Search/Index Metadata

Deep Content

VM - 13LinuxML

Association LinkageCluster Data groups

VM - 20DEEP LEARNING

LinuxIntel BigDL

Analytics Engine

User Experience – VR or Game

Library or Lab

(Linux)Hadoop & Spark-Hortonworks-ClouderaSpark - In-Memory Table representationMLLIB – API to R/W from SparkLDA – Model Type from MLLIB

Topic Modeling based on groups of words - Creates Metadata

Accept Metadata- Deep content search

Analyze data to prepare for deep learning training

Continue to process

changed data for more AI Inference

User Input – Cookie InfoFrom VM-9

Web

BigDL develop Inferencing

functionalityPACKAGE TRAIN

MODELS

Neural Network Models to

continuously learn and make better

recommendations

Wikipedia

OutputTo Inference

Recommender

Web


Appendix C: LDA Process Engine

Figure 4: LDA Process Engine

Site Map Data

Industry Published

Existing Data

Training Data

Education


Sessions

PublicationsBlogs

Industry

E S T

61 70 1 2 3 4 5

141 158 9 10 11 12 13

VM - 10Linux



VM - 11LinuxML

Data Label Algorythms

Move data to Analytics Database Hadoop & Spark-(Hortonworks-Cloudera) Spark - In-Memory Table representation

Import Site Map Data (Format for Master Database)

Topic Modeling based on groups of words - Creates Metadata

Wikipedia

MLLIB – API to R/W from SparkConnect LDA to MLLIB

LDA – Model Type from MLLIBLDA to Classify and Label Create Rich Metadata

Site Map Data + LDA Metadata

LabelingOutput

1St LDA

2nd LDA

Classify Data

Exclude – Prune Data Web CrawlTargets

Feed to Input

Sources VM

VM -10Linux


1. Download Raw content from Web Crawl Targets 2. Export to HTML 3. Return to Database

Expand Database with Metadata

and Raw Content

VM -12Linux

Search & Deep Content Index

3RD LDA

Forward to Search VM

LDA VM – 11 Process Flow- Classification and Labeling- Add Metadata to Master Database

Label Raw Data

Start with Existing Data

Develop input engine later

Create VMUbuntu-SRVNVIDIA Grid

CUDA Toolkit

Phil looking for VM with

Hadoop & Spark

(Hortonworks or Cloudera)

Install Framework and

Application

Develop Algorithms


Appendix D: Input Source Engine

Figure 5: Input Source Engine

Manual Input Site Map Data

Industry Published

Existing Data

Training Data

Education


Sessions

PublicationsBlogs

Industry

Clients VM - 1-8Web Interface to UsersWordPress Framework



groups, analysts)

E S T

61 70 1 2 3 4 5

141 158 9 10 11 12 13

WordPress VM - 9Web ACCESS Internal – Initially Public-Later

VM -10Linux


Vmware ESXi 6.5

GPU & FPGANetwork

Data Pools

Continue to process

changed data for more AI Inference

Wikipedia

Collect Data Manual Input Automated Input

Download Raw content

Create Automated process similar to Visio Site Map creation (Metadata Stage)

UpdateSite Map

Data& Change

Log

Send to LDA VM-11 for processing to Label & exclude URLs – Receive list to download Raw content – Import Raw Data – Export without HTML taging

Export Data Remove HTML Format Data

Update Master Database

Add Raw Data to Site Map Data

Send to LDA VM – 11 for labeling and scoring deep content data

Develop Automated

Update Process

Repeat for each Data Type


Appendix E: Concept Outlines

Input Sources

1. Data Sets

a. Site Map Data

b. Event Data – Focus areas sessions

c. Training Data – Education Material

d. Published Existing trchomolgy data

e. Industry Publications

f. Blogs

g. Wikipedia

2. Input Processes

a. Web Crawlers

b. ML-DL output sources

c. Google Analytics

d. Other Analytics

Classification Engine

1. ML Algorithms

a. MLLIB

b. LDA – Label process of Site Data URLs

2. Classification categories

a. Inclusion

b. Exclusion

c. Known Data

d. New Data

3. Platforms

a. Hadoop

b. Spark

c. Horizon works

d. Cloudera


4. Results analysis and output to Web crawler for raw content

Raw Data Deep Content Search and tagging

1. Accept input from LDA to download full content of high priority labeled data

2. Deep Content Search and labeling (i.e Autonomy or Elastic Search)

3. Prepare advanced metadata tagging for each article of data to be used

4. Prepare data for output t analytics association engine

Analytics Association Clustering and Grouping Engine

1. Prepare data with value weighting to compare to user interest categories

2. Establish update process to update available labels to math to user interests

3. Prepare data for DL input

DL Engine

1. Accept Input from multiple sources

a. Analytics Association Data

b. User Real time cookie information

c. Google Analytics

d. Change Data from multiple users

2. Determine training model and continuous update processes

3. Prepare Inferencing output data

4. Flag new data that can be offered to users based on their profile both real time and after user

logs off

5. Establish maintenance flags triggering when data suggests model should be updated

Primary Inferencing Engine

1. Receive real time input form DL engine

2. Determine if data is for immediate user feedback or future communication

3. Maintain flag logging and threshold management to kick off other processes (i.e. Maintenance)

4. Feed Analytics and reporting engine

5. Feed Inferencing Presentation Layer

Maintenance Processes Engine

1. Reporting


2. Sustaining processes

a. Model Update Flagging

b. Structural Changes need analysis

c. Anomaly flagging

3. Clean up process management

4. Process maintenance and documentation

5. Totally Analytics Lifecycle Management

a. Future Maintenance concepts list

1. Define models 1. Known Data 2. Changed Data 3. Inferred Data 4. Sustaining

1. How do you determine the model needs changing or review 2. What are triggers in data and results that would require

updating models 1. EX> New technologies not available at the beginning of

project 2. Data Analytics of actual project

1. Goggle Analytics – Trunk into another data set for DL engine to make better decisions

2. Internal analytics as part of the project b. Detecting structural changes

1. Define structural – Things that can change 1. What we expect 2. What we think may happen

2. Model needs to be reviewed on the data we have 1. Consider new sources of data 2. Consider new technology tools 3. Adapt structural changes to new concepts – Completely new sources of

features 4. Constant quest for relevant data – Good opportunity for an Academia

Community of thinkers 1. Build a Blog – Instant access for users to feedback tools 2. Incorporate feedback in all parts of the web experience` 3. Chat bots for interactive communications during user web

experience


Dell EMC believes the information in this publication is accurate as of its publication date. The

information is subject to change without notice.

THE INFORMATION IN THIS PUBLICATION IS PROVIDED “AS IS.” DELL EMC MAKES NO

RESPRESENTATIONS OR WARRANTIES OF ANY KIND WITH RESPECT TO THE INFORMATION IN THIS

PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED WARRANTIES OF MERCHANTABILITY OR

FITNESS FOR A PARTICULAR PURPOSE.

Use, copying and distribution of any Dell EMC software described in this publication requires an

applicable software license.

Dell, EMC and other trademarks are trademarks of Dell Inc. or its subsidiaries.

AI WEB ENGINE PROJECT - Dell...AI WEB ENGINE PROJECT Phil Hummel Technical Marketing Consultant...

Documents

Transcript of AI WEB ENGINE PROJECT - Dell...AI WEB ENGINE PROJECT Phil Hummel Technical Marketing Consultant...