AI WEB ENGINE PROJECT - Dell...AI WEB ENGINE PROJECT Phil Hummel Technical Marketing Consultant...
Embed Size (px)
Transcript of AI WEB ENGINE PROJECT - Dell...AI WEB ENGINE PROJECT Phil Hummel Technical Marketing Consultant...

AI WEB ENGINE PROJECT
Phil HummelTechnical Marketing ConsultantMachine Learning and AIDell [email protected]
Gene ChesserDell EMC
Knowledge Sharing Article © 2018 Dell Inc. or its subsidiaries.

2018 Dell EMC Proven Professional Knowledge Sharing 2
Table of Contents
Introduction ........................................................................................................... 4
AI – An Engine of Innovation for Online Communities ........................................... 5
Identifying opportunities .......................................................................................................................... 5
Assembling a team .................................................................................................................................... 5
Using a formal process .............................................................................................................................. 6
The Machine Learning Technology Landscape ....................................................... 7
Defining ML, DL and AI .............................................................................................................................................. 7
Letting the data tell the story .................................................................................................................................... 8
Using ML and AI in Online Communities ................................................................ 8
Project vision ............................................................................................................................................................. 8
Related projects ...................................................................................................................................................... 10
Defining the project................................................................................................................................................. 11
Crawl ................................................................................................................................................................... 11
Walk .................................................................................................................................................................... 12
Run ...................................................................................................................................................................... 12
Sustaining Operations for AI ............................................................................................................................... 13
Conclusions .......................................................................................................... 13
References ........................................................................................................... 14
Appendix A: Architectural Design ........................................................................ 15
Goal: Establish Thought leadership in AI-ML-DL .................................................................................................... 15
Appendix B: Process Flow .................................................................................... 16
Appendix C: LDA Process Engine .......................................................................... 17
Appendix D: Input Source Engine ......................................................................... 18
Appendix E: Concept Outlines .............................................................................. 19
Input Sources ........................................................................................................................................................... 19
Classification Engine ................................................................................................................................................ 19
Raw Data Deep Content Search and tagging ........................................................................................................... 20
Analytics Association Clustering and Grouping Engine ........................................................................................... 20
DL Engine ................................................................................................................................................................. 20
Primary Inferencing Engine ..................................................................................................................................... 20
Maintenance Processes Engine ............................................................................................................................... 20

2018 Dell EMC Proven Professional Knowledge Sharing 3
Table of Figures
Figure 1: Using a Formal Process ................................................................................................................................... 6
Figure 2: Architectural Design ..................................................................................................................................... 15
Figure 3: Process Flow ................................................................................................................................................. 16
Figure 4: LDA Process Engine ....................................................................................................................................... 17
Figure 5: Input Source Engine ...................................................................................................................................... 18
Disclaimer: The views, processes or methodologies published in this article are those of the authors.
They do not necessarily reflect Dell EMC’s views, processes or methodologies.

2018 Dell EMC Proven Professional Knowledge Sharing 4
Introduction
One of the most significant outcomes from the development of the Internet has been the explosion of
easily available information that simultaneously enriches and overwhelms us. Another significant
development has been the creation of online communities that allow people from around the world to
connect and share information on any topic imaginable. And yet, despite the decades-long evolution of
online communities that so many people would argue have gotten richer and more engaging, it feels like
the information explosion is outpacing our individual and community efforts to filter and digest even a
fraction of most relevant content that we would like to review. There is still much room for
improvement in the way we consume and share content. Our opinion is that online communities
present a greater opportunity for innovation in combating information overload than do personal
productivity tools.
On a recent business web conference call, one overwhelmed participant typed into the chat window “I
just need the world to slow down for a year so I can catch up on my reading.” We suspect this person is
not alone. We also feel that way – at least once a week. So, what can we do about it? This paper
describes a proposal for enriching the content sharing experience of online communities using a
combination of machine learning (ML), deep learning (DL), artificial intelligence (AI) and plain old human
nature.
Information technology both creates the challenges we face with information overload and also can help
us get a handle on it. We have Google news and other “search services” that will automatically do
keyword searches and email us interesting URLs. That increases the size of our reading list but should
help organize the recommendations into topics. We also have community sites like Twitter, LinkedIn and
Facebook. We can easily connect with like-minded people but we can’t influence what they share. The
focus of our socially-generated streams varies widely and cannot match our current interests very well.
A third type of community is the “question and answer” sites such as Quora, Reddit and, for
technologist, Stack Overflow. These sites concentrate information but we still must start with an often-
imprecise keyword search. On the positive side, the results are typically short, information dense and
more action oriented. Finally, sites like Reddit and Stack Overflow introduce point systems to help users
find better answers to questions. Users gain reputation points on Stack Overflow and karma on Reddit
based on how other users rate the quality of their contributions. The theory is that answers or posts
from users with above average point values should be more useful based on the communities’
evaluation of past contributions.
In this paper we use this information combined with current trends in technology to propose a new
design for an online knowledge sharing community that uses many of the existing successful strategies
mixed with some new features. We describe the vision and propose how a prototype could be
developed. Our goal is to stimulate discussion and look for collaborators.

2018 Dell EMC Proven Professional Knowledge Sharing 5
AI – An Engine of Innovation for Online Communities
Identifying opportunities
The goal of any project should be to create change in the world and that is especially true for AI projects.
For AI projects, that change may be as simple as improving the quality of some data that an organization
is collecting or may be as large as a new product introduction or a social media-powered movement.
The important point is that you should have a vision articulated in writing before you start. Find as many
people from diverse backgrounds as possible to review your plan and encourage feedback. It is far easier
to improve your approach to realize the vision early in the journey.
The project that we describe in this paper started with a vision, a rather large one. The project creator
developed the concept over a period of months during which he engaged with many people to present
and refine the concept. He also conducted extensive online research to determine how machine
learning (ML), deep learning (DL), and artificial intelligence (AI) were being applied to similar areas of
innovation. Those efforts have been fruitful once again reinforcing the value of disciplined research and
discussion prior to starting development of the project that will materialize the vision.
Assembling a team
Creating meaningful change in the world with AI involves a lot of difficult but rewarding work. It is rarely
achieved through the efforts of a single individual. Anyone that leverages open source tools for ML or DL
is already relying on the efforts of dozens or hundreds of developers and thousands to millions of other
user’s testing efforts to accomplish their goals. Once a person creating a vision understands the need
and advantages of collaboration the likelihood of success improves compared to the alternative of
having complete control by going solo. Understanding your current strengths in the context of
everything that is required for a successful AI project helps narrow your search for complementary
teammates.
The initial concept for this project was developed by someone with a long history in technology and
online community experience but relatively new to data science. The next person to join the team was
strong in data science with some programming skill and much less domain experience. Together, they
have been able to better describe both the vision and a plan for how to architect a proof-of-concept
than either would have accomplished alone. The next step will be to use those documents to attract
additional team members with other skill sets and eventually investment to turn the design into a
prototype. The primary goals are to learn, foster collaboration with people of diverse backgrounds and
to advance the state-of-the-art for online information sharing communities.
There has been much discussion in the AI, ML and DL industry about the definition of, and availability of
data scientists. At one extreme are the advocates of the importance of the “unicorn” data scientist that
is an expert in statistics, programming, data management, and the problem domain. A recent tongue-in-
check blog article recognized by KD Nuggets suggests that “all it will take to become a real data scientist
is five PhD's and 87 years of job experience.” On the other extreme are those advocating that there is no
shortage of data scientists since the current crop of data science tools are so powerful that “citizen data
scientists” are all that most projects need. Our experience and understanding of the current tools and
breadth of skills required to successfully complete a data science project lead us to conclude that the
most viable strategy to staffing is somewhere between the extremes.

2018 Dell EMC Proven Professional Knowledge Sharing 6
There are many successful data scientists that are expert in a few problem domains which includes in-
depth knowledge of important data and data science research and best practices in those areas.
Therefore, we suggest a strategy for new talent development that mimics this historical observation.
Invest in training that will develop subject matter-specific data scientists. For example, define a role for
an image recognition expert in quality control or security surveillance and then train one or more people
to fill that specific job description. Another role may be created for a natural language/speech
recognition expert in the field of customer service. Roles that have a 6-12 month expected learning
curve are a good test of the potential return on investment from talent development.
Using a formal process
Building business processes and value based on intelligence derived from data can be both rewarding
and risky. There are many published case studies that ended well as are those that ended poorly. Just as
the value, quality, and reliability of developing software assets has been improved through the
application of dedicated management techniques, data science investments that use a formal process
have been shown to be more successful. There are many well-regarded frameworks that can be used for
data science and analytics work including this model proposed by a Dell EMC 2013 article.
Figure 1: Using a Formal Process
Organizations should review what options are available and pick a framework with good adoption and
try to stay with it for the first few cycles from discovery to production operations. The main advantage
of using a formal process is to build in regular evaluation by a team that shares responsibility. Each
checkpoint that requires team approval before moving to the next stage facilitates an opportunity to
document the status, risks and goals for both the current and next phase(s).

2018 Dell EMC Proven Professional Knowledge Sharing 7
The Machine Learning Technology Landscape
Defining ML, DL and AI
There are many overlapping and conflicting definitions of machine learning, deep learning and artificial
intelligence everywhere we look, especially on the internet. We felt it was critical to define these terms
for the context of this paper in as clear a manner as possible and then use them consistently throughout
this discussion.
Our definition of AI is software with imbedded intelligence features that users perceive as smart or
adaptive. The challenge with such a broad definition is that the expectations of users continue to evolve.
When users first encountered recommendation engines on ecommerce and entertainment web sites
they were considered state of the art smart applications. Today, simple recommendations like “users
who bought this product also bought these items” are considered routine. There are other examples of
recommendation engines that are combined with chatbots that take advantage of the context of the
conversation while making suggestions. In our opinion, the most likely sources of intelligence used in AI
software include people/experts for expert systems, and ML and DL for data driven intelligence and
reinforcement learning for adaptive games and software. Examples of software applications that have
the potential for being considered AI, highlighting the subjective nature of our designation, include:
Spam filters/content filters
Recommendation/personalization engines
Chatbots
Image/speech recognition
Autonomous driving vehicles
Games that learn during play
For data-driven intelligence used in AI we need to be able to discover associations between two or more
variables. With only a single variable we are limited to descriptive statistics like the average, maximum,
minimum, etc. For AI we need to be able to estimate the most likely value of one variable (the outcome)
based on the values of one or more features. We want to be able to estimate the air temperature based
on the hour of day or we want to estimate someone’s weight given their height, age and gender. We
refer to these relationships between outcomes and features as data models.
The simplest type of data model we can specify is using the assumption of that of a linear (straight line
or plane) relationship between the outcome and the features. For instance, we could assume that the
relationship between the fuel consumption of a vehicle (miles per gallon) and the weight of a vehicle
(pounds) is linear. We can estimate the parameters of the linear model (train the model) for such data
using many types of ML and DL. Statistical theory gives us “tests” that we can compute to determine if
the linear assumption was valid or “good enough”. If the tests determine that the assumption of
linearity was not valid, there are many more sophisticated machine learning models that can be used to
work with more complex relationships. The work of understanding the types of relationships that exist
between variables and how they can be represented (modeled) consumes a significant amount of time
for many data scientists working with machine learning techniques.
We therefore use the term machine learning to refer to the application of methods and algorithms that
use statistical theory to support assumptions regarding the statistical properties of the data and

2018 Dell EMC Proven Professional Knowledge Sharing 8
relationships between the variables. This definition is often referred to as “classical” machine learning
by some authors but we will just use machine learning for the remainder of this paper.
When the relationships between the variables in a data set are very complex it is difficult to find an
acceptable machine learning technique where all or most of standard statistical assumptions for that
modeling technique are satisfied by the data. If we use a model with assumptions that are not valid, the
results will not be reliable.
Even though machine learning algorithms have grown in capability and complexity over the years the
relationships between the outcome and the data features for problems like image classification and
speech recognition have proven to be to too complex for these methods to be acceptably accurate.
Deep learning techniques have emerged that can efficiently model very large data sets with very
complex relationships between the variables. Deep learning techniques allow the relationships between
variables to be discovered during training. Deep learning techniques use a robust optimization approach
that allows even very complex relationships in the data to be discovered without having to make any
assumptions based on statistical properties of the data. The tradeoff is that the computational workload
is greatly increased over most machine learning approaches and they won’t produce reasonable
modeling results without a relatively large amount of data.
When a data set is small and/or the form of the relationships between the variables can be specified,
machine learning models are usually the most efficient and cost effective approach to modeling the
relationships between variables. For certain types of problems including image and voice analytics as
well as massive amounts of traditional text and numbers, deep learning could be the only option.
Letting the data tell the story
Spending too much up front time attempting to choose “the best” modeling framework from the vast
array of ML and DL options is a common pitfall for organizations new to data modeling. Data analytics
frameworks, like the one pictured above, are clear about the importance and the resource intensity of
the phases between discovery, data preparation and model evaluation. A good starting strategy is “the
simplest model” that gets the job done is preferred. The type of data and type of result that you are
hoping to achieve can also be used to inform modeling strategy. Asking what has been tried in the past
and what are data scientists having the most success with currently will help limit the number of dead-
end efforts. The data science community is amazingly open to sharing experience and answering
questions. It is unlikely that anyone will tell you exactly how to solve a problem, but asking a well-
formed specific question that shows that you expended significant effort researching prior to posting
will often result in a better outcome than trying to reinvent every wheel in the data science toolbox.
Using ML and AI in Online Communities
Project vision
The best way to conceptualize the vision is by comparison to an amalgam of existing online experiences.
Imagine a site where:
1. members share content from the web, i.e. LinkedIn and
2. are awarded community value points by other members based on the quality of their
submissions and reviews, and
3. new curated content streams from an intelligent web crawler, i.e. Google News, and

2018 Dell EMC Proven Professional Knowledge Sharing 9
4. all data and activity on the site fuels a personalization engine, i.e. Spotify.
Did we mention that the vision was massive? The rest of the paper will attempt to expand on this vision
and describe how we would go about building a prototype.
Our goal is to develop a community-focused website that creates an immersive experience by simulating
a virtual reality maze of information and reference material that allows the user to navigate to relevant
sources of content quickly and easily. Community members can get tailored recommendations by
identifying their area(s) of research and/or interest as part of their personal profile and then personalize
the scope for an individual as the AI engine learns more and more about the individual’s preferences
and continues to update their profile in real time.
It is important that the site design:
provides an aesthetically pleasing and user-friendly experience
is appropriately segmented and relevant in unique and dynamic ways to a highly-technical
audience
The site will include all relevant information pertaining to products and solutions value and positioning
but the scope of discoverable content is not limited to any one company.
The AI for the site will be powered by a personalization engine. The engine will provide
recommendations from thousands of dynamically-scanned URLs that contain web articles and
information that that have been analyzed using ML and DL techniques to match the content to user
interests and queries. Another goal of the site design is that the personalization engine improves by
learning as the users spend more time using the site and rating content by gathering additional user and
content data dynamically and via user input that feeds back to the ML/DL models.
In this concept, there is also a gamification aspect that serves not only as a mechanism for unique,
specialized navigation, but supports concise data gathering and processing. Gamification of the site also
includes features to support point accumulation from other members of the community coupled with
one or more leader boards for those that like a bit of competition and recognition. Participation in this
aspect of the site can be optional. The proposed concept example below represents one possible
scenario for implementation:
A community member enters the site and is placed in the lobby of a virtual building with multiple doors
in view:
One door leads the Research Department
A second door is labeled Solutions Expo
Another door opens into the Products and Resources Market
The fourth door is to the Communities Conference portal.
A more detailed description of the Research Department will help explain some of the features that will
only be possible through the application of ML and DL. The Research Department would be largely a
repository of content-aware search engine prioritized documents. As a user selects categories to search,
the complexity of the underlying engine is hidden behind an intelligent chat (chatbot) interaction or
other intuitive interfaces to narrow the selection. The available information store is made up of the
results of crawling site maps that contain thousands of links to articles and information related to any

2018 Dell EMC Proven Professional Knowledge Sharing 10
number of technical disciplines such as AI, ML, DL and other sociological input sources like Google
Analytics and historical cookie data collected based on what the user chooses to research. Ideally, the AI
engine would be constantly analyzing a list of URLs supplied by members of the community. The AI
engine would then create detailed sitemaps and determine the scanned resources are quality AI/ML/DL
content sources to include in the information store. The AI engine would also rate the content and offer
it to appropriate users based on their interest profiles as potential high-value data related to their
search.
As content is discovered and recommended to members, they would be offered the option to read the
item or add it to their personal “document cart”. The cart should function like familiar features found on
software and driver support sites or ecommerce sites like Amazon where you can get previews and drill
down to the details and specifications. Members that are just browsing the site but have not added
content to their cart can also generate data useful to the AI engine using cookies to track what content
they looked at after proper notification regarding the use of cookie tracking. Tracking data to enhance
the users profile can also be used to offer users an email summary with either links to the recently
visited articles or a zip file of the content for offline viewing. There would also be the ability for the user
to add their own articles or blog posts with a set of fields for data describing author details,
organization, etc., to be filled in, that would create the metadata for keyword searches.
The AI-Web Engine could also email users when new articles come in that match their profile as high
value information where they could read or download the new content. The Gamification concepts
mentioned earlier can award the users points and status for researching and contributing articles or
blogs where there would be rewards as members of a top ten group or a set of masters in specific focus
areas like Oil and Gas or Self Driving Cars. As more users interact with the AI-Web Engine the experience
will improve so the community will take pride in developing a smarter more fun experience where they
get what they are interested in quickly and easily.
One key to success for this effort is to build on the current academic and open source mindset where
the site becomes a clearing house for members to give and receive as part of the AI community. There
would be sub-sections for specialty groups like ML vs. DL as well as links to the Communities
Conferences area of the site mentioned above that would be focused on industries that could include
finance, medical research or any others in response to members proposing them and participating.
Related projects
During the development of this project vision we have used Internet research to discover other projects
that are related to this area of research and development. We have focused primarily on ML and DL for
text analysis. The areas of personalization and gamification still need a thorough review of current best
practices and trends.
A form of text analysis described as unsupervised machine learning for topic modeling is a compelling
option for the new search engine since it has the attractive feature that it does not require a large
collection of labeled data as input. Our theory is that we can “jump start” the process of organizing
discovered content via an unsupervised approach and improve on the accuracy of that tagging through
member reviews and refinement of the topic assignments in a second stage most likely using DL.
Starting with deep learning for all text analysis has the potential advantage of combining feature
extraction and topic modeling in a single framework. A downside of all deep learning approaches

2018 Dell EMC Proven Professional Knowledge Sharing 11
including text analysis is that the models and sensitivity to inputs are frequently difficult to understand
and therefore explain. The perception of deep learning as “black box” intelligence has been a significant
barrier to acceptance in many situations where business owners want to understand how software with
AI capabilities works.
Both machine learning and deep learning for text analysis require pre-processing of the raw text prior to
doing the actual modeling. This step is often referred to as feature extraction. Our research shows that a
popular starting point for topic modeling as well many other text analysis techniques is to construct a
“bag of words” representation of the document. The bag-of-words model is mainly used to generate
features used in machine learning and deep learning models. Two of the most common features
extracted from the bag of words are the term frequencies (TF) (number of times a term appears in the
text) and inverse document frequencies (IDF). Used together, the TF-IDF form a statistic that represents
how important a word is to a document in a collection or corpus,
Here is list of some interesting articles we have reviewed that provide additional background on the
topic of machine learning for text analysis.
Deep Learning Bag-of-Words Model for Predicting Movie Review Sentiment: this article provides a
detailed discussion of the techniques for constructing a bag of words from a text document with Python
code samples.
Natural Language Processing in a Kaggle Competition for Movie Reviews: this post will help you
understand some basic NLP (Natural Language Processing) techniques, along with some tips on using
scikit-learn to make your classification models.
Improving Uber Customer Care with NLP & Machine Learning: Discusses building an NLP model that
analyzes text using topic modeling. Even though it does not take into account word ordering, it has been
proven very powerful for tasks such as information retrieval and document classification.
Defining the project
Given the ambitious goals of the project and the limited resources, we have adopted a crawl, walk, run
strategy for defining and building a prototype. In the crawl phase we have examined web presentation
concepts and engaged a set of WordPress Developers that we have worked with in the past to address
the user experience and investigate how to create an interactive API that can collect user information in
real time and feed it to the AI-Web Engine as well as receive Inferencing recommendations from the AI-
Web Engine to maintain the real-time experience for the user. We have also created the process flow
charts defining the separate execution steps required to collect, classify, collate, and develop metadata
to be associated with the actual articles and feed this information into the ML-DL application processes.
These Design concepts and process flow concepts will certainly change as we move into the walk and
run modes where we develop the individual working models in preparation for the fully working proof
of concept project. The Concept charts and outlined descriptions are in the appendix for a more
detailed description.
Crawl
This is literally the crawl phase where we must put limits on the sources of data we are going to explore
and begin to extract text and metadata from URLs. We do not want to nor can replicate the scope of

2018 Dell EMC Proven Professional Knowledge Sharing 12
Google or Bing. For the initial prototype, we plan to limit the starting URLs for the crawling engine to a
handful of know data science content rich sites like Data Science Central, KDNuggets, etc. We can also
limit the scope of exploration by limiting the depth of the crawl and not following links to other
domains. We have also recently explored the idea of doing string matching on the text of URLs against a
target word list that would include keywords such as machine learning, ML, AI etc. If we applied this
logic to the list of URLs in the Related Projects section we would have skipped the Uber Engineering
article based on the limited information in the URL. It is obvious that there are still many design
decisions that need to be made based on the information we gather during prototype development and
testing. The purpose of this initial limited crawl is to get enough relevant content to test and refine the
topic modeling engine.
Our current proposal for unsupervised topic modeling is to use Latent Dirichlet Allocation (LDA), a
generative statistical model for assigning topics to documents. We have also been exploring the use of
an autoencoder neural network for our unsupervised topic modeling learning algorithm. The advantage
of LDA is that is widely available from a variety of frameworks including version 1.3.0 and higher of
Apache Spark, open source R with the topicmodels and lda packages and NumPy for Python. LDA is also
easy to implement and interpret. To run LDA we will first need to create a bag of words data structure
and compute the TFIDF statistics all of which are widely documented and implemented in packages for
both R and Python.
Other reasons to start the prototype with LDA are that that when documents cover only a small set of
topics and that topics use only a small set of words frequently, LDA typically results in a better
disambiguation of words and a more precise assignment of documents to topics compared to other
models. Nonparametric extensions of LDA include the nested Chinese restaurant process which allows
topics to be arranged in a hierarchy whose structure is learnt from data.
Walk
The second phase of the project begins when we have a reasonable number of classified documents.
We are targeting between 2,500 and 5,000 covering 5-10 topics. We will then need a web application to
begin registering members of the community so they can start interacting with the content. The goal in
this phase is to collect user requirements for future development and begin to build additional features
that can be used in the recommendation engine. For the time being, we can only recommend articles
based on the topic(s) assigned in the Crawl phase.
We will also begin development of the community point system in this phase. We need a relatively
simple to understand but meaningful set of rules. In our experience, the rule for both Stack Overflow
points and Reddit karma are too complex as a starting point. The Everyone Social platform for social
selling and employee advocacy uses a very simple and easy to understand point system based on
“engagements”. We expect that our initial design will take something from each of several platforms for
an initial design and let member feedback inform improvements.
Run
This stage of the project is where we tackle the most complex technology challenge, the personalization
engine. To test alternative implementation approaches we will need a lot of data that we don’t have
today. After 3-6 months of operating the community in the Walk stage we will begin to have some
member profiles, engagement activity, member rankings and comments and other data crucial to

2018 Dell EMC Proven Professional Knowledge Sharing 13
building the personalization engine. Our current thinking is that we will be able to use a relatively small
set of actual data to generate a larger data set through sampling and simulation. We can use the
relationships in the actual data and add a degree of randomness during the simulation to provide a large
enough volume of data to test alternative ML and DL algorithms. The advantage of putting
development effort into a data simulation tool is that it allows us to precisely control the form and
complexity of the data relationships to better evaluate which models discover what we already know
about the data.
Sustaining Operations for AI
We have presented a vision and plan for conducting research and development into the use of AI for
improving the ability of online communities to collectively review and share content. We also want to
address some of the issues related to the successful long-term viability of an AI application.
The amount of intelligence/information that is encapsulated in a ML/DL model at the time of estimation
or training is determined by the input data and the specification of the model. If the set of significant
features and the relationships between the features and outcome are stable over time, then the model
will never need to be updated. As you can imagine this is never the case with interesting problems.
In practice there are many factors that create the need for ongoing development and testing. First the
list of known and important features will most likely change over time. New sources of data are
discovered or developed. Some initial features will lose influence over time. Secondly, new models and
application techniques are constantly evolving and the use of ensemble approaches involving multiple
interrelated models are an endless opportunity for research. Also, any AI system that involves analysis
of human attitudes and behaviors will be impacted by changing social trends, news, behavior and needs.
Complex AI systems are notoriously difficult to monitor for change and oftentimes the first indication
that a system may need to be reevaluated is when the prediction accuracy drops noticeably. A more
serious and difficult situation to asses is when a model suddenly goes from acceptable accuracy to
unacceptable. This frequently indicates there has been a structural change in the data relationships.
When the project has passed the proof of concept phase and is in the final implementation stage, a
Maintenance Process Engine will be designed to run additional analytics processes so the AI-Web engine
can adapt to new concepts not available at the time of this project definition. The new concepts to
consider are largely expected to come from the actual user community interacting with the AI-Web
engine. If we do a good job designing the frameworks and create a rewarding and fun experience the
project should live on for many years and can easily be adapted to other research Subjects in addition to
AI, ML, And DL.
Conclusions
Most people agree that the amount of new information created every day, even for a small set of topics,
is overwhelming our ability to feel acceptable informed. This problem impacts both our professional
and personal live. The development of new personal productivity tools such as tablets and smart
phones coupled with intelligent applications can simultaneously ameliorate and exacerbate our ability to
feel caught up on even the most recent events and developments. The internet era has also fueled the
development of online communities for knowledge sharing and collaboration that also both increase the

2018 Dell EMC Proven Professional Knowledge Sharing 14
competition for attention but also can help us find and digest some types of information more
efficiently.
In this paper we have begun to describe a vision and plan for improving the capabilities of intelligent
software in the context of online communities to help deal with the information overload that we all
experience. The vision is ambitious but achievable. The research and development plan is evolving and
we recognize that there is much yet to be done. The goal of presenting the ideas and concepts at this
stage is to foster comment and debate. We are committed to continuing the work on the research and
development effort and any feedback we can solicit will surely help to solidify and improve the results.
Our potential to use technology together with our basic human need to share and be recognized for our
efforts is great. Internet technology breaks down the barriers created by distance, time and even
language. We should take advantage of those benefits in an effort to improve our ability to learn and
share from the valuable but overwhelming amount of content that this same technology is enabling.
Online communities powered by AI have the potential to improve our control of information and help us
learn and share together in a way that will save each of us precious hours each week and improve our
feelings of being well informed and connected.
References
Dell EMC 2013 article: The Genesis of EMC’s Data Analytics Lifecycle, By, David
Dietrich o https://infocus.emc.com/david_dietrich/the-genesis-of-emcs-data-analytics-lifecycle/
Deep Learning Bag-of-Words Model for Predicting Movie Review Sentiment: How to Develop
a Deep Learning Bag-of-Words Model for Predicting Movie Review Sentiment,
Jason Brownlee
o https://machinelearningmastery.com/deep-learning-bag-of-words-model-sentiment-
analysis/
Natural Language Processing in a Kaggle Competition for Movie Reviews: By, Jesse Steinweg-Woods
o https://opendatascience.com/blog/natural-language-processing-in-a-kaggle-
competition-for-movie-reviews/?imm_mid=0f9a96&cmp=em-data-na-na-
newsltr_ai_20171211
Improving Uber Customer Care with NLP & Machine Learning: COTA: Improving Uber
Customer Care with NLP & Machine Learning, By: Huaixiu Zheng, Yi-Chia Wang, &
Piero Molino o https://eng.uber.com/cota/
o

2018 Dell EMC Proven Professional Knowledge Sharing 15
Appendix: ML-DL AI Engine for Research and Recommendations
Appendix A: Architectural Design
Figure 2: Architectural Design
Site Map DataPublished Industry
ML-DL Existing DataTraining Data
Education
Event DataFocus Area
Sessions
Daily PublicationsBlogs
AnalystsIndustry
Web Interface to UsersWordPress Framework
Resource and Research Portal
Public and Private areas (customer councils, focus
groups, analysts)
E S T
61 70 1 2 3 4 5
141 158 9 10 11 12 13
User Login and Profile Management
Cookie TrackingQuestionaires
Web ACCESS Internal use
Initially Public-Later
Linux OS GPU & FPGAFrameworkAlgorithms
Processes - Jobs
Data Gathering Web
Crawler Input
Sources
Search/Index Metadata Deep
Content
Classification Inclusion Exclusion Old New
AssociationLinkages
Cluster Info
InferencePresentation
Layer
InferenceRecommender
Google Analytics Data
Wikipedia Data
Goal: Establish Thought leadership in AI-ML-DL
The goal is to create a site that not only provides an aesthetically pleasing and user-friendly experience,
but that is appropriately segmented and relevant in unique, dynamic ways to a highly-technical
audience. A site that exemplifies and reinforces the solutions value and positioning throughout – for
instance, by leveraging and promoting a ML-DL search engine where the data in specific sections is
thousands of dynamically-updated URLs that contain articles and information that are perfect subject
data for content addressable search algorithms.
AI WEB ENGINE PROJECT

2018 Dell EMC Proven Professional Knowledge Sharing 16
Appendix B: Process Flow
Figure 3: Process Flow
Site Map Data
Industry Published
Existing Data
Training Data
Education
Event DataFocus Area
Sessions
PublicationsBlogs
Industry
Clients VM - 1-8Web Interface to UsersWordPress Framework
Resource and Research Portal
Public and Private areas (customer councils, focus
groups, analysts)
E S T
61 70 1 2 3 4 5
141 158 9 10 11 12 13
User Login and Profile Management
Cookie TrackingQuestionaires
WordPress VM - 9Web ACCESS Internal – Initially Public-Later
VM -10Linux
Input SourcesWeb Crawlers
Classification Inclusion Exclusion Old New
VM – 31 Inference
Presentation Layer VM - 30
InferenceRecommender
Google Analytics
Data
Vmware ESXi 6.5
GPU & FPGANetwork
Data Pools
VM - 11Linux
ML - LDAData Label Algorythms
VM - 12Linux
Search/Index Metadata
Deep Content
VM - 13LinuxML
Association LinkageCluster Data groups
VM - 20DEEP LEARNING
LinuxIntel BigDL
Analytics Engine
User Experience – VR or Game
Library or Lab
(Linux)Hadoop & Spark-Hortonworks-ClouderaSpark - In-Memory Table representationMLLIB – API to R/W from SparkLDA – Model Type from MLLIB
Topic Modeling based on groups of words - Creates Metadata
Accept Metadata- Deep content search
Analyze data to prepare for deep learning training
Continue to process
changed data for more AI Inference
User Input – Cookie InfoFrom VM-9
Web
BigDL develop Inferencing
functionalityPACKAGE TRAIN
MODELS
Neural Network Models to
continuously learn and make better
recommendations
Wikipedia
OutputTo Inference
Recommender
Web

2018 Dell EMC Proven Professional Knowledge Sharing 17
Appendix C: LDA Process Engine
Figure 4: LDA Process Engine
Site Map Data
Industry Published
Existing Data
Training Data
Education
Event DataFocus Area
Sessions
PublicationsBlogs
Industry
E S T
61 70 1 2 3 4 5
141 158 9 10 11 12 13
VM - 10Linux
Input SourcesWeb Crawlers
Classification Inclusion Exclusion Old New
VM - 11LinuxML
Data Label Algorythms
Move data to Analytics Database Hadoop & Spark-(Hortonworks-Cloudera) Spark - In-Memory Table representation
Import Site Map Data (Format for Master Database)
Topic Modeling based on groups of words - Creates Metadata
Wikipedia
MLLIB – API to R/W from SparkConnect LDA to MLLIB
LDA – Model Type from MLLIBLDA to Classify and Label Create Rich Metadata
Site Map Data + LDA Metadata
LabelingOutput
1St LDA
2nd LDA
Classify Data
Exclude – Prune Data Web CrawlTargets
Feed to Input
Sources VM
VM -10Linux
Input SourcesWeb Crawlers
1. Download Raw content from Web Crawl Targets 2. Export to HTML 3. Return to Database
Expand Database with Metadata
and Raw Content
VM -12Linux
Search & Deep Content Index
3RD LDA
Forward to Search VM
LDA VM – 11 Process Flow- Classification and Labeling- Add Metadata to Master Database
Label Raw Data
Start with Existing Data
Develop input engine later
Create VMUbuntu-SRVNVIDIA Grid
CUDA Toolkit
Phil looking for VM with
Hadoop & Spark
(Hortonworks or Cloudera)
Install Framework and
Application
Develop Algorithms

2018 Dell EMC Proven Professional Knowledge Sharing 18
Appendix D: Input Source Engine
Figure 5: Input Source Engine
Manual Input Site Map Data
Industry Published
Existing Data
Training Data
Education
Event DataFocus Area
Sessions
PublicationsBlogs
Industry
Clients VM - 1-8Web Interface to UsersWordPress Framework
Resource and Research Portal
Public and Private areas (customer councils, focus
groups, analysts)
E S T
61 70 1 2 3 4 5
141 158 9 10 11 12 13
WordPress VM - 9Web ACCESS Internal – Initially Public-Later
VM -10Linux
Input SourcesWeb Crawlers
Vmware ESXi 6.5
GPU & FPGANetwork
Data Pools
Continue to process
changed data for more AI Inference
Wikipedia
Collect Data Manual Input Automated Input
Download Raw content
Create Automated process similar to Visio Site Map creation (Metadata Stage)
UpdateSite Map
Data& Change
Log
Send to LDA VM-11 for processing to Label & exclude URLs – Receive list to download Raw content – Import Raw Data – Export without HTML taging
Export Data Remove HTML Format Data
Update Master Database
Add Raw Data to Site Map Data
Send to LDA VM – 11 for labeling and scoring deep content data
Develop Automated
Update Process
Repeat for each Data Type

2018 Dell EMC Proven Professional Knowledge Sharing 19
Appendix E: Concept Outlines
Input Sources
1. Data Sets
a. Site Map Data
b. Event Data – Focus areas sessions
c. Training Data – Education Material
d. Published Existing trchomolgy data
e. Industry Publications
f. Blogs
g. Wikipedia
2. Input Processes
a. Web Crawlers
b. ML-DL output sources
c. Google Analytics
d. Other Analytics
Classification Engine
1. ML Algorithms
a. MLLIB
b. LDA – Label process of Site Data URLs
2. Classification categories
a. Inclusion
b. Exclusion
c. Known Data
d. New Data
3. Platforms
a. Hadoop
b. Spark
c. Horizon works
d. Cloudera

2018 Dell EMC Proven Professional Knowledge Sharing 20
4. Results analysis and output to Web crawler for raw content
Raw Data Deep Content Search and tagging
1. Accept input from LDA to download full content of high priority labeled data
2. Deep Content Search and labeling (i.e Autonomy or Elastic Search)
3. Prepare advanced metadata tagging for each article of data to be used
4. Prepare data for output t analytics association engine
Analytics Association Clustering and Grouping Engine
1. Prepare data with value weighting to compare to user interest categories
2. Establish update process to update available labels to math to user interests
3. Prepare data for DL input
DL Engine
1. Accept Input from multiple sources
a. Analytics Association Data
b. User Real time cookie information
c. Google Analytics
d. Change Data from multiple users
2. Determine training model and continuous update processes
3. Prepare Inferencing output data
4. Flag new data that can be offered to users based on their profile both real time and after user
logs off
5. Establish maintenance flags triggering when data suggests model should be updated
Primary Inferencing Engine
1. Receive real time input form DL engine
2. Determine if data is for immediate user feedback or future communication
3. Maintain flag logging and threshold management to kick off other processes (i.e. Maintenance)
4. Feed Analytics and reporting engine
5. Feed Inferencing Presentation Layer
Maintenance Processes Engine
1. Reporting

2018 Dell EMC Proven Professional Knowledge Sharing 21
2. Sustaining processes
a. Model Update Flagging
b. Structural Changes need analysis
c. Anomaly flagging
3. Clean up process management
4. Process maintenance and documentation
5. Totally Analytics Lifecycle Management
a. Future Maintenance concepts list
1. Define models 1. Known Data 2. Changed Data 3. Inferred Data 4. Sustaining
1. How do you determine the model needs changing or review 2. What are triggers in data and results that would require
updating models 1. EX> New technologies not available at the beginning of
project 2. Data Analytics of actual project
1. Goggle Analytics – Trunk into another data set for DL engine to make better decisions
2. Internal analytics as part of the project b. Detecting structural changes
1. Define structural – Things that can change 1. What we expect 2. What we think may happen
2. Model needs to be reviewed on the data we have 1. Consider new sources of data 2. Consider new technology tools 3. Adapt structural changes to new concepts – Completely new sources of
features 4. Constant quest for relevant data – Good opportunity for an Academia
Community of thinkers 1. Build a Blog – Instant access for users to feedback tools 2. Incorporate feedback in all parts of the web experience` 3. Chat bots for interactive communications during user web
experience

2018 Dell EMC Proven Professional Knowledge Sharing 22
Dell EMC believes the information in this publication is accurate as of its publication date. The
information is subject to change without notice.
THE INFORMATION IN THIS PUBLICATION IS PROVIDED “AS IS.” DELL EMC MAKES NO
RESPRESENTATIONS OR WARRANTIES OF ANY KIND WITH RESPECT TO THE INFORMATION IN THIS
PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED WARRANTIES OF MERCHANTABILITY OR
FITNESS FOR A PARTICULAR PURPOSE.
Use, copying and distribution of any Dell EMC software described in this publication requires an
applicable software license.
Dell, EMC and other trademarks are trademarks of Dell Inc. or its subsidiaries.