The Unreasonable Effectiveness of Data

The Unreasonable Effectiveness of Data

Alon Halevy, Peter Norvig and Fernando PereiraGoogle

2011. 10. 24Eun-Sol Kim

• The miracle of the appropriateness of the language of mathematics for the formulation of the laws of physics is a wonderful gift which we neither understand nor deserve.

- Eugene Wigner, The Unreasonable Effectiveness of Mathematics in the Natural Sciences

• Essentially, all models are wrong but some are useful

- George Box

Two approaches to AI• GOFAI ( Good Old-Fashioned Artificial

Intelligence )– Based on Logic– Symbolic AI

• SML ( Statistical Machine Learning )– Based on empirical data ( sensor data or databases )– Inductive inference based on data, generalize data to

rules, predict on future data

• Scene completion using millions of photographs- Hays et al., CMU, SIGGRAPH 2007

The power of data

Learning from Text at Web Scale• Brown Corpus– 1 Million English

words– Complete sen-

tences, no spelling errors, no gram-matical errors

• Google a trillion-word corpus– 100 time larger

than Brown corpus– Frequency counts

for all sequences up to 5 words long.

Some lessons of web-scale learning

1. Use available large-scale data rather than annotated data

– We can find useful semantic relationships au-tomatically from the statistics of search queries and the corresponding results or from the ac-cumulated evidence of web-based text pat-terns without annotated data.

2. Memorization is a good policy

- Memorizing specific phrases is more effective than general patterns.

- Machine translation example : Large memo-rized phrase tables that give candidate map-pings between specific source- and target-lan-guage phrases.

- For many tasks, words and word combinations provide all the representational machinery we need to learn from text.

Conventional two approaches to NLP

• Deep approach– Hand-coded grammars and ontologies– Complex networks of relations

• Statistical approach– Learning n-gram statistics from large

corpora

New approaches to NLP• Combination of two conventional ap-

proaches– Statistical relational learning• Represent relations between objects with

rule ( first-order-logic)• Model built by statistical learning

Semantic interpretation• Semantic web– A convention for formal representation lan-

guages that lets software services interact with each other

• Semantic interpretation– Imprecise, ambiguous natural languages.– Embodied in human cognitive and cultural pro-

cesses whereby linguistic expression elicits ex-pected responses and expected changes in cog-nitive states

The challenges for achieving accurate semantic interpretation• Interpreting the content

– methods to infer relationships between column headers or mentions of entities in the world.

• Web-scale data might be an important part of the solution.– Hundreds of millions of independently created tables.– Tables represent structured data– With table, we can resolve semantic heterogeneity.

Choose a representation That can use unsupervised learning

On unlabeled dataWhich is so much more plentiful than

labeled data.

The Unreasonable Effectiveness of Data

Documents

Transcript of The Unreasonable Effectiveness of Data