SDS PODCAST EPISODE 175 WITH GREGORY PIATETSKY- SHAPIRO
Transcript of SDS PODCAST EPISODE 175 WITH GREGORY PIATETSKY- SHAPIRO
Show Notes: http://www.superdatascience.com/175 1
SDS PODCAST
EPISODE 175
WITH
GREGORY
PIATETSKY-
SHAPIRO
Show Notes: http://www.superdatascience.com/175 2
Kirill Eremenko: This is episode number 175 with President and Editor
at KDNuggets, Gregory Piatetsky-Shapiro.
Welcome to the Super Data Science Podcast. My name
is Kirill Eremenko, data science coach and lifestyle
entrepreneur. Each week, we bring you inspiring
people and ideas to help you build your successful
career in data science. Thanks for being here today,
and now let's make the complex simple.
Welcome back to the Super Data Science Podcast,
ladies and gentlemen. Today, I've got a very exciting
guest for you on the show, the legendary Gregory
Piatetsky-Shapiro, who is the founder of KDNuggets, is
joining us. I actually met Gregory quite a while ago. It
was over a year ago in May 2017 at the ODSC
Conference where we chatted and I invited him to the
podcast, but it took this long for us to organize
everything, and now he's finally come on the show. If
you don't know who Gregory is, then this will just put
things into perspective for you. KDNuggets is one of
the most popular data science resources out there.
Write accurate news on data science, they provide
their own articles, they conduct polls on data science,
and many, many more exciting things in the space of
data science. They've been around since 1997. Here's
another perspective for you, Gregory has 256,000
followers on LinkedIn, so that should just tell you of
what kind of an influencer in the space of data science
Gregory is, and how much he's actually contributed to
the community, how many things he's given back to
Show Notes: http://www.superdatascience.com/175 3
the space. Today, we are with welcoming him on the
show.
In today's podcast, what will we be talking about?
Today, we're going to cover off quite a few topics. Of
course, we'll go through the foundations of KDNuggets.
A very exciting, very interesting story of how it all
started, where Gregory began his journey into the
space and what KDNuggets has grown into, but also
we will cover off some of the more recent advances that
have been happening in the space of data science that
KDNuggets has been highlighting or has been
participating in.
For instance, we'll talk about the whole concept of data
science being the sexiest professional in 21st century
and what has it turned into now, and what role is
machine learning playing in there? We'll also talk
about what the new GDPR regulations in Europe mean
for data scientists. The Global Data Protection
Regulation, it came into play in Europe earlier this
year. We'll also talk about GDPR, the new European
Data Protection Regulation which came into play
earlier this year. It's one of the first changes in
decades in the European Data Protection Regulations.
We'll talk about the concept of citizen and data
scientist. We'll talk about reinforcement learning, and
quite a lot of other very exciting things as you can
imagine Gregory has seized all these new updates in
the space of data science on a daily basis. He is the
editor for KDNuggets, so all these articles that you're
seeing on KDNuggets actually go through him, and
Show Notes: http://www.superdatascience.com/175 4
today he's sharing his best and most exciting insights
with us.
All in all, a very exciting episode full of most recent
technology core advancements and interesting stories
on how this all came to be. Can't wait for you to check
it out, so let's dive straight into it, and without further
ado, I bring to you, Gregory Piatetsky-Shapiro, founder
and editor at KDNuggets.
Welcome, ladies and gentlemen to the Super Data
Science Podcast. Today, I've got a very exciting guest,
Gregory Piatetsky-Shapiro on the phone. Gregory,
welcome to the show. How are you today?
Gregory P. S.: Thank you, Kirill. I'm excited to be here. It's a pleasure
to be on your podcast.
Kirill Eremenko: It's so wonderful to have you. We met in May, 2,000,
what was it? 17? No, I think May 2,000, yeah, 17. Last
year in May, and it's been over a year, and I've been
wanting to get you on the show for a year now, and
finally we're here. This is super, super, exciting.
Gregory, where are you located right now?
Gregory P. S.: I am in Boston, Massachusetts.
Kirill Eremenko: Is that-?
Gregory P. S.: Actually, I'm-
Kirill Eremenko: Yep.
Gregory P. S.: ... working at home, so we have beautiful sunny
weather, and all of my cats, I think, are outside. As a
data scientist in the daytime, I do have the cats, but
Show Notes: http://www.superdatascience.com/175 5
hopefully they don't interfere in the middle of this
conversation.
Kirill Eremenko: Fantastic. Yeah, I was just about to ask that. That is
your home base, Boston. Is that correct?
Gregory P. S.: Yes.
Kirill Eremenko: Wonderful. It's so great to hear that you've got sunny
weather in Boston today. Last time I was there, it was
in May last year, it was surprisingly chilly. Yeah, so it's
good to hear that the weather is nice today. All right,
so let's dive straight into the podcast. Gregory, you are
the Founder and Director or President and Editor of
KDNuggets, a very popular data science media outlet
and news aggregator and a platform that shares
research about data science. You've been running this
platform for 21 years now. Tell us a little bit about how
it all started. Where did this idea come from?
Gregory P. S.: Yes, thank you. Probably, I started when I was a kid, I
was very fascinated by science fiction, and I loved
stories about robots, especially, from Isaac Asimov and
other writers like Stanislaw Lem and [inaudible
00:06:39] that was known in the Western. I was
always curious about the idea of AI, and this probably
motivated me to learn computers when they first year
appeared. In my first year in college when computers
were still programmed with punch cards, I remember
spending several weeks of my free time in the summer,
writing a program to play battleships, which was still a
very advanced program for that period. And then I
used APL. That was a special language developed by
IBM. It's A Programming Language, and it had special
symbols for every different array operation.
Show Notes: http://www.superdatascience.com/175 6
You can think of it as like R but with Greek letters.
After spending several weeks programming it, I played
one game and I was very soundly defeated by my own
program. I think as a result, I become much more
interested in creating programs than playing them. I
did my undergraduate, I studied for undergraduate
degree in Mathematics, then I came to United States to
study computer science at NYU, and I got my PhD in
Applied Machine Learning to Databases. I think the
idea was a self-organizing database system that
automatically selects different indices and does
something intelligent.
Then I worked at GTE as a researcher. GTE was a
large telephone company in United States. Now, it is
part of Verizon, which is even a larger telecom
company. I remember around 1986 or so, I attended a
workshop, which was called Expert Database System.
That was a very interesting name, but the concept was
very fuzzy, and the workshop paper and talks were all
over the place. I thought we could focus on something
more clearly defined, analyzing databases and finding
interesting patterns. In one of our projects that we did
on applying some intelligent to figure out databases,
and I discovered that a particular query would run
10,000 faster if we knew that there was a particular
rule, that kind of functional constraint that always
existed. There were some over-supplication. Can you
find some useful rules in databases?
I was, at that time, young, energetic and naïve, and I
thought that I could organize a better workshop. At
that time, a popular term was data mining. It's
Show Notes: http://www.superdatascience.com/175 7
interesting to note just as an aside how the
terminology changes and reflects the time. It went
from data fishing and data dredging, which were bad
times, and data mining became second popular term.
Now, the popular term is data science or maybe until
last year. Now, it's machine learning and artificial
intelligence, but in any case, so I organized a
workshop. I thought data mining was not sexy enough,
so I came up with the name Knowledge Discovery in
Data or KDD. That was the first workshop back in
1981, which attracted, I think, about 70 people
including several leading researchers.
Kirill Eremenko: Wow.
Gregory P. S.: Then I organized a couple of more workshop, and later
in 1994, one of my best ideas was to stop doing it
myself and to recruit Usama Fayyad, who was then
just a fresh PhD from Ann Arbor. His advisor
Ramasamy Uthurusamy, was then a researcher
general modest, and they agreed to run '94 workshop.
Then, next year, that workshop went into a conference,
and later with the help of Won Kim, who is the chair of
KDD that SIGMOD. He was very experienced with ACM
that's a leading professional organization, Association
for Computing Machinery. We created a special
interest group, SIGKDD, that was running KDD
conference, and they're still running until today.
I think we've had about that 22 KDD conferences since
then. I'm very pleased to say that KDD remains the
leading research conference in the field based on
citations and other indices. Now, I can stand back
after many years of organizing [inaudible 00:11:50]
Show Notes: http://www.superdatascience.com/175 8
like a grandparent, enjoy the baby doing really well.
That was kind of one track of my activity.
How did I get to where I am? After third KDD
workshop, I decided to send a newsletter to people who
attended the workshop, and I called it the Knowledge
Discover Nuggets. The first issue, which is still online,
went to, I think, about 50 people, who attended that
workshop. Now, it's almost 25 years, actually 25 years
[inaudible 00:12:32] so KDNuggets has about 200,000
subscribers and followers that was emailed with the
Facebook, LinkedIn, and our website gets about
500,000 a month.
Kirill Eremenko: Wow, congratulations. That's huge.
Gregory P. S.: Big goals. Thank you. But we're focusing on analytics,
data science, and machine learning. If I try to talk to
my people you realize that as a data scientist at heart,
I just tried to select a few interesting things to write
about or select things on the web that we can publish.
I guess that was a second track in my career.
In parallel, when organizing conferences and
publishing newsletter was not a full-time activity, and
also in all the conference organizing that I've done was
always as a volunteer [inaudible 00:13:43] was very
received any payment for it, but probably was one of
the more rewarding things that I've done because I
enjoyed doing it with interesting people and helping to
put good things together. But another interesting thing
that I've been doing in terms of research and data
mining involved consulting and being enrolled in the
world of startups. In 1997, which was still very part of
the Dot-com Rush 00:14:23], I left the GTE research
Show Notes: http://www.superdatascience.com/175 9
lab and I joined the startup that was doing analytics
data mining consulting for financial industry, mainly
banks and insurance companies.
We worked with the largest names like Credit Suisse,
Chase Manhattan, Citibank. I was a chief scientist,
and managed a small team of perhaps about 10
people. Then around 2000, our smaller startup was
bought by a big startup. For a very short period of
time, the value of the big startup exceeded $1 billion.
Kirill Eremenko: Wow.
Gregory P. S.: It became the wanted unicorn, but before anyone,
including me, could do anything foolish with the stock
options, the larger startup's stock crashed almost all
the way down to zero. I left it 2001. I think maybe
couple of months before that stock went all the way to
zero. I was self-employed since about 2001, mainly
publishing KDNuggets and doing consulting and data
mining.
I think one interesting question for all the younger
people listening is synergy. In my case, I've done this
three parallel and mutually supporting activities as a
research and consult and data mining, and as a
founder and chair of KDD conferences, or Publishing
Editor of KDNuggets news and website. In each one of
those activities was in some way helping the other. I
know, Kirill, that you're also teaching courses and you
have a very nice book, Confident Data Skills.
Kirill Eremenko: Thank you, yes.
Gregory P. S.: And probably doing other things. I guess probably,
helpful suggestion for young people that try to do
Show Notes: http://www.superdatascience.com/175 10
interesting things is to think is there a synergy with
this activity with some other [inaudible 00:16:42] if
there is not, then maybe it's not the best thing to do.
The very synergy, it generally helps you to succeed.
Just to finish in this, in the last few years, I think,
maybe writing the big data and data science, which
KDNuggets became so popular that I stopped that I
stopped consulting them. Now, I only publish
KDNuggets, and we have another excellent full-time
idea [inaudible 00:17:17] based in Canada. We have
several interns based in London and other places.
KDNuggets is global in its reach.
Kirill Eremenko: Gotcha. Wow, that's such an interesting career, and I
love that you mentioned that wonderful takeaway for
your career [inaudible 00:17:41] about synergy. I can
totally agree with that that when you're working on A
and B, you should be aiming to make sure that A plus
B is more than just A plus B. It's A plus B plus an
extra value. So it's not one plus one equal two. If you
truly have a synergy in the things that you're working
on, one plus one equals three or four or five, because
they complement each other, and they help your
audience, and they help you propel your career
forward. That's a very interesting takeaway, and
definitely, I can agree that looking back unconsciously,
I've probably done that. I can see I've done that in my
own career, but that was always unconscious. That
was just like a gut feel, but if you think about it
consciously, I think you can make much faster
progress in the things that you're doing and how
you're going foreword.
Show Notes: http://www.superdatascience.com/175 11
Thank you, and it's really exciting to hear that
KDNuggets has got so many followers, 200,000
subscribers and 500,000 visitors per month. That is
truly astonishing numbers. You mentioned that you
select those blog posts. How many blog posts do you
publish on KDNuggets? How frequently do they come
out?
Gregory P. S.: Well, we publish every weekday, and we try to select
maybe two or three interesting blog posts a day. Now,
we get a lot of submissions. Occasionally, myself and
Matthew [May 00:19:14] we also write our own
editorial pieces, and if we see some interesting blog
posts around the web, then we'll also ask the authors
for reposting those as guest blogs on KDNuggets, but
there's so much stuff on the web that we try to select
only a small number, maybe two or three per day.
Kirill Eremenko: That's quite a lot as well. Already, that makes it 10 or
15 or more per week. How do you find the time to go
through all of them? You probably get a ton of
submissions sent to you. How many submissions do
you get, just out of curiosity?
Gregory P. S.: Well, it's hard to say, but I think we probably get
something like three to five submissions per day, not a
very large number because we have clear guidelines,
and we're also focus on more technical solutions. Our
audience is mainly data scientists, and machine
learning engineers, so we'll not publish something like
why your business should use data science. I assume
our readers already know, but we would publish
something that explains how to create a pipeline in
Python or some ideas how to use Python [inaudible
Show Notes: http://www.superdatascience.com/175 12
00:20:38] or maybe some interesting polls that I run
every month or so. There're some interesting
observations like our recent poll, most popular annual
poll on what is the software that you use?
I've been running this poll, actually, since 2001,
amazingly.
Kirill Eremenko: Wow.
Gregory P. S.: Yeah, this is the 19th such poll. Now, the latest poll is
out to show that there is kind of a clear ecosystem
emerging around Python, Spark, Anaconda and
TensorFlow. Now, it's becoming this integral part of
data science tool box. Python seems to have more
significant [inaudible 00:21:31] ahead of R. There're a
lot more tools that use Python than R. There are some
other interesting observations that your readers can
see on KDNuggets.
Kirill Eremenko: Wonderful. Is it just like on the main page of the blog
or is there a specific page for all these insights? 'Cause
I-
Gregory P. S.: Well, on the main menu, we have a section called top
stories, and if you scroll there, then you will find more
interesting things.
Kirill Eremenko: That's so cool.
Gregory P. S.: Yeah, being data scientist, we always analyze the
results, so we always like to see what's more popular,
publish separate posts with just the top stories.
Kirill Eremenko: Gotcha. Wow, this is really cool. I'm on the page right
now, and I highly recommend for people to check it
out. It's kdnuggets.com, and then you can, at the top,
Show Notes: http://www.superdatascience.com/175 13
find top stories and look through those. All right, well,
that's really interesting, very powerful insight.
Actually, before today's podcast, I was reading your
most recent blog about why data science is no longer
the sexiest profession of the 21st century, even though
it's still satisfactory, there's a new profession that is
the sexiest. Do you mind sharing a little bit on that
with us?
Gregory P. S.: Sure. Recently done a poll of our readers, and I think
we asked them basically, "What's your title and how
satisfied are you?" [inaudible 00:23:05] very satisfied,
which we converted to +2, to very unsatisfied, which
we converted to -2, and surprisingly, the profession
with the highest job satisfied was machine learning
engineer, which, well, and as a researcher I have to
say that the average satisfaction was like 0.7, and the
standard deviation was around 1.0, so it's not like all
the machine learning engineers were highly satisfied.
There was still a lot of unsatisfied ones, but on
average, I think there was a significant difference
between the job satisfaction for this profession,
machine learning engineer, and the second and third
place, which were researcher and data scientist.
Data scientist is still the most common job title. I see
that on the web and [inaudible 00:24:10] and on job
[inaudible 00:24:12] to get on KDNuggets, but kind of
there is more coming, more requests, more demand,
coming for people with machine learning engineer
skill. I guess a difference I would describe as machine
learning engineer is building machine learning
systems, probably they now use deep learning, and
Show Notes: http://www.superdatascience.com/175 14
data scientist perhaps do more work on analyzing and
then trying to understand what is happening with
companies, not necessarily building production
systems.
Kirill Eremenko: Gotcha. Very interesting. That's a little hint, I guess, to
our listeners. If you're looking for the new data
scientists of the job that's coming to take on the data
scientists, it might be machine learning engineer. Very
interesting. Thank you for that. All right, so I wanted
to ask you a couple of questions. You've obviously had
a very diverse and interesting, like a career filled with
lots of different roles and different engagements, and
different things that you've worked on, that you've
done, I just wanted to find out some of the highlights.
What is a recent win that you share with us?
Something that you've had [inaudible 00:25:43]?
Gregory P. S.: I will mention maybe a couple of interesting things,
maybe they're not as recent but it's still very
instructed. I think one of the most interesting project
that I worked on when I was still at GTE Laboratory
was called Key Findings Reporter, for which we're
called KEFIR. It was a system for analysis and
summarization of key changes in large databases, and
we applied it to healthcare data. Healthcare in United
States is a scandal and also very, very expensive. I
think we spend here twice as much as other
industrialized countries. We got data with no better
results, and trying to understand where all that money
goes is an essential part of the equation.
Our system automatically analyzed changes in
[inaudible 00:26:47] variables and it selected the
Show Notes: http://www.superdatascience.com/175 15
important ones, and it was combined with the small
and for a system to add recommendations at what to
do about the changes. Like for example, if you have
particular type of medical problem, then the expert
system will recommend how to solve it. It presented
visualization and it looked at changes in trends. One
good way to identify what changes are more important
is always look at changes. For example, if you just look
at the associations, you can find a huge number of
significant associations in data. How do you fill the
important ones. You'll look at ones that change over
time. What is true this period and was not true in the
previous period.
It was all combined in one very nice system, and it was
applied to all our GTE healthcare data and it identified
some significant potential savings. We did win Highest
Technical award from GT. Unfortunately, I guess I
would still regard it as a failure because the system
was not deployed.
Kirill Eremenko: Why is that?
Gregory P. S.: Probably that's connected to another question we
discussed. What's the most thing to do? I think the
most difficult challenge in my work as data scientist
was getting the results deployed because that requires
change in organizational culture and support from the
top. In case of if the system was acting technically but
there was no place in the organization. It was not clear
who would us it, how it would affect them, the work of
people who were analyzing healthcare data. That is
probably the fate of many data science projects. You
can easily build the great prototype but unless there is
Show Notes: http://www.superdatascience.com/175 16
a clear way to deployment and support from the
organization, it is still a failure.
Kirill Eremenko: I see.
Gregory P. S.: That's, I guess, another interesting story I can say, I
worked in many different projects. Probably the ones I
enjoyed the most was working on bioinformatics data.
I had one project where we worked with a mass
spectrometry data trying to develop early indicators of
Alzheimer. The problem with analyzing biological data
is you have a huge number of variables. You could
have 20,000 different compounds, but you don't have
a large number of patients. Typically you could get
meeting several 100 patients. Imagine you have 100
trackers and poly trackers, you have applied 2,000
variables, it creates very significant problems in
determining what's significant and what is just
random noise. In that particular case, we did discover
very strong biomarkers, but they were 100% accurate.
There was, I think, quite dozen of them.
One of them actually had biological significance
because it was like vitamin C, so our initial results
suggested that people who had more vitamin C were
likely to get Alzheimer. Even though my intuition is
that the scientists told me, "Beware of perfect results."
This was [inaudible 00:30:46] it was 100% correct, so
it doesn't matter how you put in the data, if it's 100%
correct, it will still be 100% correct. Myself and my
friends, we all started to drink more orange juice and
vitamin C, but were still skeptical about the results.
The only way to test them was to get another
population. We did that and we found that probably
Show Notes: http://www.superdatascience.com/175 17
the original data was contaminated in some form. I
guess don't trust the results if they're too good. That
could be a useful lesson.
Probably, the most success that I had in my career of
data mining [inaudible 00:31:41] was when we had to
help organizations make some strategic decisions. We
would examine whether they should use this
particular strategy or that one. Some of those work
was deployed but as they consulted, they cannot tell
you unfortunately the details but I know that there
were kind of pay-off of, I think, seven digits based on
our results, but those results were easy to deploy
because it was like do this decision A or decision B to
get the required change in the entire organization
structure.
Kirill Eremenko: Gotcha. Thank you very much. That's interesting. We
just talked about the wins and the challenges, and I
appreciate you sharing your experience. It's sometimes
difficult to share experiences, especially if it's a project
like the one you're working on for the Key Findings
Reporter, where you're working on it for a long time
and you're really proud of the results but it's not
deployed, but it is a great example for our listeners,
especially for those starting out of some challenges
that they might come across. In this case the takeaway
is that even if your project is great and you see that it's
got a lot of value, the situation might occur in such a
way that it might not be deployed in the end, and that
shouldn't ... Of course, you should do as much as you
can in order to avoid the situation, but if it does
Show Notes: http://www.superdatascience.com/175 18
happen, then don't let it bring you down. It sometimes
happen and even to the best people in the industry.
Also, the other example is also great where the results
are too good. Even in data science, sometimes
intuition plays an important role. Like you said, when
the results for that vitamin C example were too good,
your intuition was saying that don't trust the results
[inaudible 00:33:50] I think it's also a good thing to
look out for if your results are too good to be true, then
find another place to check them, verify them, and
make sure that the test or the example is repeatable.
All right, so we talked about something that's the wins
and we talked about. How about what is your one
most favorite thing about being a data scientist?
What's the one most favorite thing that's kept you
going through this career for more than 20 years?
Gregory P. S.: Well, I really enjoy the process of exploratory data
analysis and visualization. Analyzing the data running,
the data algorithms, what does the data review? It's
like discovery of new and unknown realms. I think
curiosity is an essential trait for a good data scientist.
Along with discovering something, now I try to see
what's the best way to visualize and present it.
Especially, for example, if I'm looking at data for
recently the Nuggets posts, there're many ways to
organize it and thinking of what is a good story that
the data sells and what is a good image that is worth a
1,000 word in a story. Generally, I think probably the
most useful thing to read, and I think when I read a
study somewhere that confirmed it is the captions on
images.
Show Notes: http://www.superdatascience.com/175 19
If a picture is worth 1,000 words, then a good caption
on that image may be worth 10,000 words. Think of
how to present the data, present the story and
visualize it and describe the image that you just
presented.
Kirill Eremenko: Thank you. It's definitely one of my favorite parts as
well of data science. Well, Gregory, I know that you will
need to go very soon, so I wanna really jump to the
part where I'm very curious, as you said, an important
part of being a data scientist is curiosity. I'm very
curious to get your answer to the following question.
It's a philosophical question, one I ask very often in
the podcast almost every time. I always get different
answers. Different people have different perspectives.
The reason by I'm so curious to get your perspective
on this is because of the amount of experience you
have in the field, your worldview and how it's
developed overtime. On top of that, you just interact
with so many people, over hundreds of thousands of
followers, you influence them, you reply to their
comments on KDNuggets, you get these emails, you
have aggregated so much, such a wealth of
information in the space.
Here it goes. From all this experience, from everything
you've seen in the field of data science, where do think
the field of data science and analytics is going, and
what should our listeners prepared for to be ready for
the future that's coming?
Gregory P. S.: Thank you, Kirill. I think that's a great question. I
guess as data scientists, we should always try to
predict the future, and as data scientists with a lot of
Show Notes: http://www.superdatascience.com/175 20
experience, I can say that we're not very good at
predicting human trends, but I'll try nevertheless.
Kirill Eremenko: All right.
Gregory P. S.: What I see now is data science is becoming part of a
larger machine learning and AI field, which is really
progressing very fast. Capabilities especially in deep
learning are growing at amazing rate, like every day we
see some really amazing stuff, like this recent Google
Duplex Demonstration, where they had completely
human quality calls with unsuspecting humans, but I
think AI hype is growing even faster than AI
capabilities, so beware of the hype, I guess that could
be one warning.
Second recent important events is the GDPR, this is a
Europe general data protection directive that took
effect May, 25th this year. It seems those companies
even outside of Europe to revise their privacy policy to
conform to GDPR. The good part for consumers, it
offers more protections. It gives consumers some right
about the data that they use to receiving their data,
and it potentially makes like more complicated for
companies because GDPR also gives consumers rights
for something like explanation, and exactly what it
means is, I think, still under debate, I think interested
listeners can read my blog called "Does GDPR Make
Machine Learning Illegal?" which looks into that. I
think the answer is no, it doesn't make machine
learning illegal but the right for explanation may make
machine learning more difficult, and exactly how it will
play out will, I think, be determined by words, I think.
The first was used against Google and Facebook, were
Show Notes: http://www.superdatascience.com/175 21
filed in the first couple of hours after GDPR published
that, so we'll see.
Another interesting trend that I'm watching is what's
being called citizen data scientist. I think this term
was introduced by Gartner a couple of years ago, and
the idea was to also become so good that any citizen
can use the them and do data science. I have been
very skeptical of citizen data scientist. I think do you
want a citizen dentist to work on your teeth or a
citizen pilot to fly your airplane, probably not. I think
data science can either be fully automated, and this
was a direction taken by companies like DataRobot,
H2O and others that offer kind of full automated
solutions, or you can have physicians that require
training and expertise in data science and kind of
having people with no training who use tools that are
semi-automated. I think it's very dangerous because
you can easily make blunt conclusions just think of
my example with vitamin C and Alzheimer, which
citizens data scientists will say that was correct results
but would lack training and intuition to warn where
they're going into a wrong direction.
Now, I think there's a golden age for data science.
There're amazing tools that allow one person to do
what hundreds of people could not do 10 years ago,
but data science as most data-driven activities with
some relatively clear rules and goals is also becoming
automated. We had a poll recently on KDNuggets that
asked readers when data science will be automated,
and the median answer was 2025. For our data
science listeners enjoy this great period but beware of
Show Notes: http://www.superdatascience.com/175 22
coming automation. In terms of the future trends, of
course, [WebID 00:42:30] has heard many times about
deep learning. Another important technology that I
think now is coming into forefront is reinforcement
learning, and especially Deep reinforcement learning.
Data science involves really from data that has already
been recorded, kind of learning from the past, whereas
reinforcement learning is applied to agents that are
active in their work, data experiments, and can learn
from their experiments. This was the keep summaries
and successes like AlphaGo that defeated the world
champion in Go by essentially learning this by playing
with itself. If I can make one more interesting
observation about the future, so this AlphaGo was
developed initially from learning with human masters
or experts in Go, and later, people at DeepMind
developed a more general version which they called
Alpha Zero, Zero to indicate that it started with zero
human knowledge, essentially just with itself using
reinforcement learning and deep learning. It achieved
in about four hours, the super human level in chess.
That was very disappointing for me as a former chess
player.
It took it, I think, three days to achieve that
superhuman level in Go. This Alpha Zero version
played strongest chess player in the world. It's no
longer human. I think the strongest chess player in
the world is now a computer. I think they've had a
program called Stockfish which was programmed old
fashioned style with [inaudible 00:44:38] human
opponents and [inaudible 00:44:40] millions of
Show Notes: http://www.superdatascience.com/175 23
physicians [inaudible 00:44:43] and when Alpha Zero
played the Stockfish, it defeated it something like 10 to
0.
I looked at some of the games, and it made completely
inhuman moves. I don't know Go but I do know chess,
so I could appreciate how amazing those moves were.
Humans would make and we'd call them, amazing
examples of human intuition and creativity, but I
think somebody describe it, "It would be like aliens
landed on earth and they learned to play chess." I
guess they're kind of looking forward, this give us a
sneak preview into artificial general intelligence. I've
got no idea when next it will be achieved but people
who will interact with it will probably be hard pressed
to understand why does it do what it does. That is
experience of chess masters looking at how the
superhuman Alpha Zero works.
It has a completely different intuition, and people who
understand Go report similar things, that it plays
completely different way that humans have never even
thought about, not always, and then they're still moves
that humans can understand, but occasionally does it
completely superhuman move. That's kind of for a
preview of small window into artificial general
intelligence.
Kirill Eremenko: Well, fantastic. Thank you very much. I noticed that
you have a blog post about this as well, which is very
exciting, so if there're any chess players listening or
even if you're just interested in artificial intelligence,
Gregory has got a blog post about data science in 30
minutes, artificial general intelligence and answers to
Show Notes: http://www.superdatascience.com/175 24
your questions, so you can read more about this, and
I'm definitely curious about this. I've gonna jump onto
this and check it out, 'cause I'm also a chess player
myself. It's a very good lens to put it in. I've heard
about the developments of Google DeepMind in the
game of Go and AlphaGo Zero, and how it was able to
win with a huge advantage.
In the same way, I don't play Go. I'm not a Go player
so it's quite hard to relate, but with this chess
situation, I definitely would like to know a bit more
about that inhuman move with the knight and things
like that. I'll have a look at that. Thank you so much
for sharing. Yeah, it's definitely an interesting area,
and of course, I'd like to also, just recap on the things
that you mentioned about the trends. I knew this was
a good question. Gregory, you were a great person to
answer the question, and you did give us so many tips,
so ladies and gentlemen, listen to this podcast, here
are some takeaways from Gregory's answer to our
question, what to prepare for the future. AI capabilities
are growing, and machine learning as well, but beware
of the AI hype.
GDPR, so look at that, the European Data Protection
Direction which into action May 25th this year. Does it
make machine learning illegal or not? There's a blog
post on KDNuggets about that as well. Seasoned data
scientists, that's a concept that was introduced by
Gartner, but is it really a good thing or is it actually
something that sounds good but it actually might
cause more problems if people don't really know what
they're doing? How is that related to automation of
Show Notes: http://www.superdatascience.com/175 25
data science, things that companies like DataRobot
and H2O are looking into.
Then the fourth thing was data science and
automation. You're moving on from that. You had a
poll that asked your readers and the median answer
was 2025. That's when data science will be fully
automated, so something to look into as well, and keep
following the trends on KDNuggets to see how that
changes, and if it does, and finally a new addition into
this whole mix of AI, deep learning, machine learning
and data science is reinforcement learning. It's picking
up more and more these days, so another important
technology to look out for in the future.
Gregory, all I can say is a huge thank you. I know
we've gone a bit over the available time you had.
Before you go, could you please let our listeners know
how they can contact you, find you, follow you, get in
touch, or just learn all these amazing things that
you're sharing with the world?
Gregory P. S.: Well, thank you Kirill. Well, our listeners can find
website KDNuggets. They can contact me by email,
editor1, the "editor" followed by digit 1,
[email protected], or tweet to @kdnuggets, and
they can also like our Facebook, KDNuggets, or join
our KDNuggets LinkedIn group. Welcome reader's
comments, submissions or blogs. We always look for
good technical submissions. As I mentioned we
publish two, three blogs per day, although currently I
have to say we already scheduled all the blogs until
July 2nd, but good blogs will certainly get published.
Kirill Eremenko: Gotcha.
Show Notes: http://www.superdatascience.com/175 26
Gregory P. S.: Kirill, thank you very much. I enjoyed the discussion,
and hope to see you again at another conference
somewhere.
Kirill Eremenko: Thank you. Thank you very much, Gregory. Very lovely
having you on the show, and I do also hope we'll catch
up soon.
There you have it. That was Gregory Piatetsky-
Shapiro, and all of his amazing and exciting and
insightful stories from the years of experience in data
science and all the people he's interacted with, all of
the articles and news that he's aggregated through
KDNuggets and all of the amazing events that he's
been through.
I'll be interested to find out what your favorite part of
today's podcast was. For me personally, it was the
example that Gregory gave about KEFIR, that situation
where a technically excellent system was developed
but it wasn't used because it didn't have a place in the
organization. A very telling example and something
that can happen to anybody, it can happen on any
project, so it's always important to understand, I
guess, what you're working towards and learning from
experience such as this one. When they're not even
your own, you can still learn from it and understand
that situations like that can happen, and how you can
try to avoid them in your own career. Of course,
among other things there was a lot of very valuable
insights that Gregory shared with us.
On that note, we're gonna wrap up. I highly encourage
you to check out KDNuggets and follow that website,
and follow the news that they're sharing. Get onto
Show Notes: http://www.superdatascience.com/175 27
their email list, so you get all the updates, all the very
important and most recent updates of data science. Of
course, follow Gregory himself, connect with him, if
you're not following him already on LinkedIn, I'm sure
he'll be happy to get in touch and stay in touch. Of
course, you can find all of the short notes for today's
episode at www.superdatascience.com/175. We'll also
include a ton of links that we mentioned on the show
so head on over to superdatascience.com/175, and
check the module out, look up those articles, look at
those polls, and see where the world of data science is
going. I can't wait to seeing you back here next time.
Until then, happy analyzing.