Networks in AI - Washington University in St. Louism.neumann/fl2019/cse416/... · 2019-11-26 ·...

50
Networks in AI Rajagopal Venkat Ph.D. Candidate, Computer Science Advisor : Dr. Yevgeniy Vorobeychik [email protected]

Transcript of Networks in AI - Washington University in St. Louism.neumann/fl2019/cse416/... · 2019-11-26 ·...

Page 1: Networks in AI - Washington University in St. Louism.neumann/fl2019/cse416/... · 2019-11-26 · Netflix Recommender Challenge vNetflix released a massive anonymized dataset of user

Networks in AIRajagopal Venkat

Ph.D. Candidate, Computer ScienceAdvisor : Dr. Yevgeniy Vorobeychik

[email protected]

Page 2: Networks in AI - Washington University in St. Louism.neumann/fl2019/cse416/... · 2019-11-26 · Netflix Recommender Challenge vNetflix released a massive anonymized dataset of user

Challenges in AI

v Knowledge Representation

v Machine Learning / Data Mining

v Planning / Reward-based Learning / Multiagent Systems

v Natural Language Processing

v Social and Commonsense Reasoning

v Adversarial Learning / Security

Why would analysis of networks be a good tool to

apply to these domains?

Page 3: Networks in AI - Washington University in St. Louism.neumann/fl2019/cse416/... · 2019-11-26 · Netflix Recommender Challenge vNetflix released a massive anonymized dataset of user

Why Networks?

v Powerful representation – captures relations between entities.

v Potential algorithmic benefits to exploiting structure.

v Highly interpretable!

Page 4: Networks in AI - Washington University in St. Louism.neumann/fl2019/cse416/... · 2019-11-26 · Netflix Recommender Challenge vNetflix released a massive anonymized dataset of user

In Today’s Lecture…

v Knowledge Representation

v Machine Learning / Data Mining

v Planning / Reward-based Learning / Multiagent Systems

v Natural Language Processing

v Social and Commonsense Reasoning

v Adversarial Learning / Security

Page 5: Networks in AI - Washington University in St. Louism.neumann/fl2019/cse416/... · 2019-11-26 · Netflix Recommender Challenge vNetflix released a massive anonymized dataset of user

Natural Language Processing

Page 6: Networks in AI - Washington University in St. Louism.neumann/fl2019/cse416/... · 2019-11-26 · Netflix Recommender Challenge vNetflix released a massive anonymized dataset of user

a. Applications to NLP

1. Topic Modeling (for short texts)

• Given a set of documents, identify a topic each document pertains to.• A topic is characterized by a set of words.

Venkatesaramani, Rajagopal, et al. "A Semantic Cover Approach for Topic Modeling." Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM 2019). 2019.

Page 7: Networks in AI - Washington University in St. Louism.neumann/fl2019/cse416/... · 2019-11-26 · Netflix Recommender Challenge vNetflix released a massive anonymized dataset of user

An Example

Page 8: Networks in AI - Washington University in St. Louism.neumann/fl2019/cse416/... · 2019-11-26 · Netflix Recommender Challenge vNetflix released a massive anonymized dataset of user

Topic 1 : [Russia, Clinton, campaign]

Topic 2 : [Page, Mueller, Witch, Hunt]

An Example

Page 9: Networks in AI - Washington University in St. Louism.neumann/fl2019/cse416/... · 2019-11-26 · Netflix Recommender Challenge vNetflix released a massive anonymized dataset of user

Semantic Cover(Keyword Extraction)

Topics

v Cluster similar documents together.

v For each cluster, extract a small set of representative keywords.

Overview

Page 10: Networks in AI - Washington University in St. Louism.neumann/fl2019/cse416/... · 2019-11-26 · Netflix Recommender Challenge vNetflix released a massive anonymized dataset of user

How can we cluster documents?Hint: This is 416A!

Page 11: Networks in AI - Washington University in St. Louism.neumann/fl2019/cse416/... · 2019-11-26 · Netflix Recommender Challenge vNetflix released a massive anonymized dataset of user

Graph of Documents

v Construct a network, where nodes are documents, and weighted edges represent document similarity.

v Tf-Idf – A notion of similarity between documents that captures word-overlap.

v Run Spectral Clustering over this graph!

Page 12: Networks in AI - Washington University in St. Louism.neumann/fl2019/cse416/... · 2019-11-26 · Netflix Recommender Challenge vNetflix released a massive anonymized dataset of user

Finding Topic Words

v For each cluster, we want to extract a set of words that best summarizes the documents in that cluster.

v We need a notion of similarity between words and documents.

Any ideas?

Page 13: Networks in AI - Washington University in St. Louism.neumann/fl2019/cse416/... · 2019-11-26 · Netflix Recommender Challenge vNetflix released a massive anonymized dataset of user

Word Vector Embedding

v Word2Vec – place words that occur in similar contexts closer to each other.

v Uses a deep neural network to learn representation.

Page 14: Networks in AI - Washington University in St. Louism.neumann/fl2019/cse416/... · 2019-11-26 · Netflix Recommender Challenge vNetflix released a massive anonymized dataset of user

Word Vector Embedding

v Represent documents as weighted centroids of words.

v Sample Documents:

1. My cat ruined my pizza.2. I have a dog and a cat.3. Well, Trump doesn’t like cats.4. There’s a new book on Trump.

Page 15: Networks in AI - Washington University in St. Louism.neumann/fl2019/cse416/... · 2019-11-26 · Netflix Recommender Challenge vNetflix released a massive anonymized dataset of user

Semantic Cover

v A word covers a document, if it is one of the knearest word vectors to the document.

Page 16: Networks in AI - Washington University in St. Louism.neumann/fl2019/cse416/... · 2019-11-26 · Netflix Recommender Challenge vNetflix released a massive anonymized dataset of user

Semantic Cover

v Find a small set of words that cover all documents in each cluster.

v Finding the minimum such cover is 𝒩𝒫-hard.

How could we (approximately) solve this with networks?

Page 17: Networks in AI - Washington University in St. Louism.neumann/fl2019/cse416/... · 2019-11-26 · Netflix Recommender Challenge vNetflix released a massive anonymized dataset of user

Greedy Minimum CoverWords Documents

Topic Set :

{ }

w1

w2

w3

w4

w5

w6

d2

d3

d4

d5

d1

Page 18: Networks in AI - Washington University in St. Louism.neumann/fl2019/cse416/... · 2019-11-26 · Netflix Recommender Challenge vNetflix released a massive anonymized dataset of user

Greedy Minimum CoverWords Documents

Topic Set :

{w2, }

w1

w2

w3

w4

w5

w6

d2

d3

d4

d5

d1

Page 19: Networks in AI - Washington University in St. Louism.neumann/fl2019/cse416/... · 2019-11-26 · Netflix Recommender Challenge vNetflix released a massive anonymized dataset of user

Greedy Minimum CoverWords Documents

Topic Set :

{w2,w6}

w1

w2

w3

w4

w5

w6

d2

d3

d4

d5

d1

Page 20: Networks in AI - Washington University in St. Louism.neumann/fl2019/cse416/... · 2019-11-26 · Netflix Recommender Challenge vNetflix released a massive anonymized dataset of user

Greedy Minimum CoverWords Documents

Topic Set :

{w2,w6}

w1

w2

w3

w4

w5

w6

d2

d3

d4

d5

d1

Page 21: Networks in AI - Washington University in St. Louism.neumann/fl2019/cse416/... · 2019-11-26 · Netflix Recommender Challenge vNetflix released a massive anonymized dataset of user

Semantic Cover

v Each cluster represents a topic.

v Each topic is characterized by a small set of words.

Putting it all Together

Page 22: Networks in AI - Washington University in St. Louism.neumann/fl2019/cse416/... · 2019-11-26 · Netflix Recommender Challenge vNetflix released a massive anonymized dataset of user

Disneyland measles outbreak linked to low vaccination rates

More measles cases tied to Disneyland Illinois day care

Amid US measles outbreak few rules on teacher vaccinations

US measles count rises to 121; most linked to Disneyland

Measles cases turn attention to bounty of childhood vaccines

FDA commissioner says measles outbreak alarming

Semantic Cover In Action

Page 23: Networks in AI - Washington University in St. Louism.neumann/fl2019/cse416/... · 2019-11-26 · Netflix Recommender Challenge vNetflix released a massive anonymized dataset of user

Applications to NLP

2. WordNet

• Lexical graph database of words in the English language.

• “Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept.”

Car Automobile

Chair

Armchair

Shut Close

Seat

Observe

See

NotePerceive

JotPrinceton University "About WordNet." WordNet. Princeton University. 2010.

Page 24: Networks in AI - Washington University in St. Louism.neumann/fl2019/cse416/... · 2019-11-26 · Netflix Recommender Challenge vNetflix released a massive anonymized dataset of user

Applications to NLP

2. WordNet

• “Synsets are interlinked by means of conceptual-semantic and lexical relations.”

• “The majority of the WordNet’s relations connect words from the same part of speech (POS). Thus, WordNet really consists of four sub-nets.”

Princeton University "About WordNet." WordNet. Princeton University. 2010.

Chair

Armchair

Seat

Furniture

Bed

Cot

Page 25: Networks in AI - Washington University in St. Louism.neumann/fl2019/cse416/... · 2019-11-26 · Netflix Recommender Challenge vNetflix released a massive anonymized dataset of user

Can you think of some uses for WordNet?

Page 26: Networks in AI - Washington University in St. Louism.neumann/fl2019/cse416/... · 2019-11-26 · Netflix Recommender Challenge vNetflix released a massive anonymized dataset of user

Using WordNet

v Morphological links can be used for stemming/lemmatization.

• Children – Child• Ate – Eat

v Graph distances between words in WordNet are often used as a semantic similarity measure!

Pedersen, Ted, Siddharth Patwardhan, and Jason Michelizzi. "WordNet:: Similarity: measuring the relatedness of concepts."Demonstration papers at HLT-NAACL 2004. Association for Computational Linguistics, 2004.

Page 27: Networks in AI - Washington University in St. Louism.neumann/fl2019/cse416/... · 2019-11-26 · Netflix Recommender Challenge vNetflix released a massive anonymized dataset of user

Common Sense Reasoning

Page 28: Networks in AI - Washington University in St. Louism.neumann/fl2019/cse416/... · 2019-11-26 · Netflix Recommender Challenge vNetflix released a massive anonymized dataset of user

b. Commonsense Reasoning

v Modeling human ability to reason about everyday situations.

v Reasoning about properties – physical, purpose, etc. – of people and objects.

v Reasoning about intentions and outcomes.

v Commonsense Knowledge – information an intelligent agent is assumed to know to be able to reason about the above.

Page 29: Networks in AI - Washington University in St. Louism.neumann/fl2019/cse416/... · 2019-11-26 · Netflix Recommender Challenge vNetflix released a massive anonymized dataset of user

Introducing

Speer, Robert, Joshua Chin, and Catherine Havasi. "Conceptnet 5.5: An open multilingual graph of general knowledge."Thirty-First AAAI Conference on Artificial Intelligence. 2017.

v Developed at MIT Media Labs (crowdsourced).

v Based on Open Mind Common Sense database.

v Based on multiple sources: Wiktionary, Open Multilingual WordNet.

v 28 million relations, all binary properties.

v Weighted, directed.

Page 30: Networks in AI - Washington University in St. Louism.neumann/fl2019/cse416/... · 2019-11-26 · Netflix Recommender Challenge vNetflix released a massive anonymized dataset of user

Playing 20-Questions Using ConceptNetWork in collaboration with Joel Michelson, Vanderbilt University

v 2-player interactive game.

v Player A thinks of something, Player B tries to guess using up to 20 yes-or-no questions.

v Players typically think of everyday objects, famous people, etc.

Page 31: Networks in AI - Washington University in St. Louism.neumann/fl2019/cse416/... · 2019-11-26 · Netflix Recommender Challenge vNetflix released a massive anonymized dataset of user

Why do we care?

v Good example of everyday intelligent activity.

v Requires efficient storage and search of common-sense or encyclopedic knowledge.

v Lends itself well to semantic networks, acts as an evaluation of search space optimization.

v Questions should be formed on-the-fly.

Page 32: Networks in AI - Washington University in St. Louism.neumann/fl2019/cse416/... · 2019-11-26 · Netflix Recommender Challenge vNetflix released a massive anonymized dataset of user

20q.net

v A set number of pre-defined questions.

v Agreeability metric.

v Bi-modal neural network

• Rank objects.• Find questions.

Page 33: Networks in AI - Washington University in St. Louism.neumann/fl2019/cse416/... · 2019-11-26 · Netflix Recommender Challenge vNetflix released a massive anonymized dataset of user

Multi-Layered Graphs1

6 53

42

v The graph, G is represented as a hash-table, where keys are nodes, and corresponding values are Neighborhoods.

v Each neighborhood is a list of lists, where each sub-list

represents a layer in the graph.

1 : [ ( ), ( 3 ), ( ) ]2 : [ ( ), ( ), ( 3 ) ]3 : [ (6), (1, 4), (2, 5) ]4 : [ ( ), ( 3 ), ( ) ]5 : [ ( ), ( ), ( 3 ) ]6 : [ (3), ( ), ( ) ]

Page 34: Networks in AI - Washington University in St. Louism.neumann/fl2019/cse416/... · 2019-11-26 · Netflix Recommender Challenge vNetflix released a massive anonymized dataset of user

Generating Questions with ConceptNet

Given graph G with n relations, a list of positive answers 𝑃 = 𝑣&, 𝑒&) , … and list of negative answers 𝑁 = 𝑣&, 𝑒&) , … ,

𝑉- = 𝑣 ∈ 𝑉 𝐺 𝑣, 𝑢1 ∈ 𝐸1) 𝐺 , ∀ 𝑢1, 𝑒1) ∈ 𝑃} ; 𝑉6 = 𝑣 ∈ 𝑉 𝐺 𝑣, 𝑢1 ∉ 𝐸1) 𝐺 , ∀ 𝑢1, 𝑒1) ∈ 𝑁}

𝑉89:;);9<= = 𝑉- ∩ 𝑉6

Solve arg maxI∈J,)∈[&,:]

𝛿89:;);9<=) (𝑣)

where 𝛿89:;);9<=) (𝑣) is the number of edges of the ith relation type originating from a candidate vertex and terminating in v.

Page 35: Networks in AI - Washington University in St. Louism.neumann/fl2019/cse416/... · 2019-11-26 · Netflix Recommender Challenge vNetflix released a massive anonymized dataset of user

Generating Questions with ConceptNet

v At each ‘yes’ answer, save list L, of items directly connected by that relation.

v Continue candidate generation.

v If number of questions left (out of 20) <= 5, or if we are out of candidate questions, start random guessing from

last stored list.

v Instead of guessing with uniform probability, we want to bias guessing towards common objects.

Page 36: Networks in AI - Washington University in St. Louism.neumann/fl2019/cse416/... · 2019-11-26 · Netflix Recommender Challenge vNetflix released a massive anonymized dataset of user

Generating Questions with ConceptNet

v Intuition : As ConceptNet is partly crowdsourced, common entities are likely to have high degree.

v RelatedTo edges were entirely crowdsourced – good indicator.

v For each entry in last updated L, we find the number of RelatedTo edges it has.

v Calculate probability of guessing the ith item as:

𝑃 𝑢) =𝛿PQ 𝑢) + 𝛿PQ∗ (𝑢))

∑UV∈W(𝛿PQ 𝑢1 + 𝛿PQ∗ (𝑢1))

Page 37: Networks in AI - Washington University in St. Louism.neumann/fl2019/cse416/... · 2019-11-26 · Netflix Recommender Challenge vNetflix released a massive anonymized dataset of user

Pop Quiz!Which of the following is a person?

v Crab

v Hippo

v Brick

v Fish

v Tiger

v Computer

Page 38: Networks in AI - Washington University in St. Louism.neumann/fl2019/cse416/... · 2019-11-26 · Netflix Recommender Challenge vNetflix released a massive anonymized dataset of user

Pop Quiz!Which of the following is a person?

v Crab

v Hippo

v Brick

v Fish

v Tiger

v Computer

All of these!

Page 39: Networks in AI - Washington University in St. Louism.neumann/fl2019/cse416/... · 2019-11-26 · Netflix Recommender Challenge vNetflix released a massive anonymized dataset of user

Target Number of Questions/Result/ReasonWater 17

Brick Fail. Brick has low degree.

American White Oak 14, if answer no to ‘Is it an Angiosperm?’

Stephen Hawking 18

Capybara 20

Megalodon Fail. Too many fish to guess from.

Mosquito 14

Scissors Fail. Guesses more common objects.

Nashville Fail. Not a city in ConceptNet.

Mountain Lily 14

Performance

Page 40: Networks in AI - Washington University in St. Louism.neumann/fl2019/cse416/... · 2019-11-26 · Netflix Recommender Challenge vNetflix released a massive anonymized dataset of user

Is it still useful?

Page 41: Networks in AI - Washington University in St. Louism.neumann/fl2019/cse416/... · 2019-11-26 · Netflix Recommender Challenge vNetflix released a massive anonymized dataset of user

Is it still useful?

v This method is an excellent detector of incorrect/missing edges.

v Can be reformulated into a Game with a Purpose!

v Can clean up crowdsourced datasets… with crowdsourcing.

Page 42: Networks in AI - Washington University in St. Louism.neumann/fl2019/cse416/... · 2019-11-26 · Netflix Recommender Challenge vNetflix released a massive anonymized dataset of user

A Comedy of Errors

v Computers NotCapableOf believe_in_jesus

v Televangelists NotCapableOf live_like_jesus

v Shed IsA where_store_dead_bodies

v Squirrel IsA holding_nuts

v Horses IsA cool_and_can_jump_and_stuff

v Businessman IsA male_animal

Page 43: Networks in AI - Washington University in St. Louism.neumann/fl2019/cse416/... · 2019-11-26 · Netflix Recommender Challenge vNetflix released a massive anonymized dataset of user

Adversarial Learning and Privacy

Page 44: Networks in AI - Washington University in St. Louism.neumann/fl2019/cse416/... · 2019-11-26 · Netflix Recommender Challenge vNetflix released a massive anonymized dataset of user

De-Anonymization

v Several sources of anonymized data are created and published everyday.

v These datasets are susceptible to re-identification attacks, in combination with external sources of information.

v Huge privacy and security implications!

Page 45: Networks in AI - Washington University in St. Louism.neumann/fl2019/cse416/... · 2019-11-26 · Netflix Recommender Challenge vNetflix released a massive anonymized dataset of user

The AOL leak

v In August 2006, a researcher at AOL released search queries of ~650k users for academic use.

v Dataset was stripped of user information and replaced with numeric IDs.

Deng, Hongbo, Michael R. Lyu, and Irwin King. "A generalized co-hits algorithm and its application to bipartite graphs."Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2009.

Page 46: Networks in AI - Washington University in St. Louism.neumann/fl2019/cse416/... · 2019-11-26 · Netflix Recommender Challenge vNetflix released a massive anonymized dataset of user

The AOL leak

v Within weeks, user no. 4417749 was identified as Thelma Arnold from Georgia.

v Her queries, in conjunction with local phone directory records made it very easy to trace her data back to her.

Page 47: Networks in AI - Washington University in St. Louism.neumann/fl2019/cse416/... · 2019-11-26 · Netflix Recommender Challenge vNetflix released a massive anonymized dataset of user

Netflix Recommender Challenge

v Netflix released a massive anonymized dataset of user movie-ratings.

v Proposed a $1 million prize for a recommender system that could improve their current model the most (min 10%).

v Researchers showed that people could be identified.

Any ideas how you’d do this?

Page 48: Networks in AI - Washington University in St. Louism.neumann/fl2019/cse416/... · 2019-11-26 · Netflix Recommender Challenge vNetflix released a massive anonymized dataset of user

Genetic Privacy

v Anonymized genetic data (SNPs) are shared between medical centers.

v Companies like 23andMe store such genetic data for thousands of people.

v With some knowledge of phenotypes, it may be possible to reidentify people!

Page 49: Networks in AI - Washington University in St. Louism.neumann/fl2019/cse416/... · 2019-11-26 · Netflix Recommender Challenge vNetflix released a massive anonymized dataset of user

Bipartite Matching

w11

w13

v Each edge represents a probability that the two end-points correspond to the same individual.

v The aim is to find a subset of edges such that each vertex is adjacent to exactly one edge, such that it maximizes the sum of edge weights.

Hungarian Algorithm

Page 50: Networks in AI - Washington University in St. Louism.neumann/fl2019/cse416/... · 2019-11-26 · Netflix Recommender Challenge vNetflix released a massive anonymized dataset of user

In Summary,

v Networks are essential models in various subdomains in AI (and CS in general).

v There’s a lot more to do than social network analysis!

Questions?

[email protected]