Q & A Based Internet Information Acquisition

70
Q & A Based Internet Information Acquisition Xiaoyan Zhu Tsinghua University [email protected] 22/6/20 1

description

Q & A Based Internet Information Acquisition. Xiaoyan Zhu Tsinghua University [email protected]. Tsinghua University found in 1911. 13 Schools and 54 Departments. With 7,080 Faculties and 25,173 students. National Lab of Information Science and Technology. 1,000 people. - PowerPoint PPT Presentation

Transcript of Q & A Based Internet Information Acquisition

Page 1: Q & A Based  Internet Information Acquisition

 Q & A Based 

Internet Information Acquisition

Xiaoyan ZhuTsinghua University

[email protected]

23/4/21 1

Page 2: Q & A Based  Internet Information Acquisition

Tsinghua University found in 1911

23/4/21 2

Page 3: Q & A Based  Internet Information Acquisition

13 Schools and 54 Departments 

With 7,080 Faculties and 25,173 studentsWith 7,080 Faculties and 25,173 students

23/4/21 3

Page 4: Q & A Based  Internet Information Acquisition

National Lab of Information Science and Technology

1,000 people

23/4/21 4

Page 5: Q & A Based  Internet Information Acquisition

Department of CS 

23/4/21 5

Faculty member: 120Students Undergraduate 661 Master 545 PhD 298

Page 6: Q & A Based  Internet Information Acquisition

Joint Research CenterTsinghua – Waterloo Research Center for Internet Information Acquisition

• Title of the project– Break barrier to access internet

• Mission– Enable the internet information to be widely and conveniently accessed by the 

disadvantaged: people who do not read English, people who do not have the internet, and visually impaired people. 

• Goal – To minimize language barriers and allow over 19.2 billion English internet pages accessible 

to 1.5 billion Chinese people.– To enable 580 million Chinese cell phone users access the internet, even if they do not have 

internet connection.– To enable 16 million visually impaired people in China, and many more in Canada and the 

world, access the internet conveniently.

• Supported by International Research Chair Initiative project       from International Development Research Center, Canada, (1 million Canadian 

dollar) 

23/4/21 6

Page 7: Q & A Based  Internet Information Acquisition

http://ciia.cs.tsinghua.edu.cn/Project_WebSite/index.jsp

23/4/21 7

Page 8: Q & A Based  Internet Information Acquisition

TALK OUTLINE

• How we get information from internet• Research topics

– Information similarity measure– information extraction– Information summarization

• Future work

23/4/21 8

Page 9: Q & A Based  Internet Information Acquisition

From http://en.disitu.com/2008/03/10/how-many-website-in-the-world/

There are more than 20 billion web pages in the world, and 

increasing in a speed of 200,000 a day.

23/4/21 9

Page 10: Q & A Based  Internet Information Acquisition

General search engines

23/4/21 10

Page 11: Q & A Based  Internet Information Acquisition

23/4/21 11

more than 20billion+5billion (70?)web pages indexed.

almost 4, 000, 000, 000user queries per day.

almost 1, 000, 000, 000of user-generated data(MB) every hour.

almost 17, 600, 000daily visitors.

Page 12: Q & A Based  Internet Information Acquisition

Vertical search engines• Travel Search• Shopping (or Product) Search• Employment Search• …… 

23/4/21 12

Page 13: Q & A Based  Internet Information Acquisition

23/4/21 13

15, 000, 000registered users.

1, 000, 000

daily visitors

20, 000, 000services provided every year.

Page 14: Q & A Based  Internet Information Acquisition

Community Q&A Systems

23/4/21 14

Page 15: Q & A Based  Internet Information Acquisition

23/4/21 15

Page 16: Q & A Based  Internet Information Acquisition

Q&A SystemsInput:  natural  language questionsReturn: sophisticated and tidy answers!

23/4/21 16

Page 17: Q & A Based  Internet Information Acquisition

QUANTA− Q&A based information acquisition 

Q&A      Knowledge     Power

RequirementRequirement KeywordsKeywords URLURL Web pagesWeb pages InformationInformation

Refine keywordsRefine keywords

RequirementRequirement KeywordsKeywords URLURL Web pagesWeb pages InformationInformation

Natural language queryNatural language query QUANTA

23/4/21 17

To make information acquisition more intelligent and convenient and effectiveTo minimize language barriers and allow over 19.2 billion English internet pages accessible to 1.5 billion Chinese people. To enable 580 million Chinese cell phone users access the internet, even if they do not have internet connection.

Page 18: Q & A Based  Internet Information Acquisition

THU QUANTA

23/4/21 18

http://166.111.138.87/quanta/index.jsp

Page 19: Q & A Based  Internet Information Acquisition

QUANTA’s result

23/4/21 19

Page 20: Q & A Based  Internet Information Acquisition

•  – Key word search and more powerful for the shorter query.

•  – Question answering and more useful for detail question in 

a specific domain.

•  – Question answering and  more popular for various 

complex questions.

•  – Complement to web keyword search.– Enhance existing cQA and search services.– Leverage existing knowledge in the question and answer 

forms and their authors.23/4/21 20

General engines

Vertical engines

Community Q&A

Q&A systems

Get answer automatically from internet at anytime, anywhere for everyone!

Summary

Page 21: Q & A Based  Internet Information Acquisition

Problems • General search engines, Key word search

– powerful for short query but long query and question• Loss of  information  Return irrelevant results

• Vertical search engines, question answering– Powerful for specified questions  Domain limited 

• cQA system, question answering– The best answer is really best?– Too many redundant answers for the same question.– The answer is not in real time for a new question.

23/4/21 21

Page 22: Q & A Based  Internet Information Acquisition

Problems (cont.)

23/4/21 22

Complexity of the question limited QUANTA

Database based Wolfram alpha

Knowledge based

Powerset

Q&A Systems

Page 23: Q & A Based  Internet Information Acquisition

TALK OUTLINE

• How we can get information from internet• Research topics

– Information similarity measure– information extraction– Information summarization

• Future work

23/4/21 23

Page 24: Q & A Based  Internet Information Acquisition

Research topics ongoing

• Information Similarity • Information Extraction• Sentiment analysis, opinion mining• Information summarization• Question analysis and classification• Candidate generation, ranking, 

evaluation• Recommendation for related 

content 

Theo

retic

al S

tudy

Application

23/4/21 24

Page 25: Q & A Based  Internet Information Acquisition

Our work• Information theory based information similarity measure

– Conditional information distance• Distance between named entities under the specific environment

– Min distance measure• Distance between two objects for partial matching problem

– Information Distance among Many Objects• Comprehensive and typical information selection, e.g. review mining; multi-

document summarization

• Information extraction– Relationship extraction– Redundant removing

• Summarization– Based on Information Distance– Based on Graph Centrality

23/4/21 25

Page 26: Q & A Based  Internet Information Acquisition

TALK OUTLINE

• How we can get information from internet• Research topics

– Information similarity measure– information extraction– Information summarization

• Future work

23/4/21 26

Page 27: Q & A Based  Internet Information Acquisition

Information Distance

Kolmogorov Complexity

Dmax(x,y)

Dmax(x,y|c)dmax(x,y|c)dmax(x,y)

Dmin(x,y)dmin(x,y)

23/4/21

Information Distance

Dmax(x1,x2,…)

27

Page 28: Q & A Based  Internet Information Acquisition

Information Distance

• Original theories:– Information distance: Dmax(x, y)

– Normalized information distance: dmax(x, y) 

• Proposed by our group:– Conditional Information Distance(CID/NCID): 

Dmax(x, y|c), dmax(x, y|c)

– min Distance:  Dmin(x, y), dmin(x,y)

– Information Distance among Many Objects (IDMO): Dmax(x1, …,xn) 

2823/4/21

Page 29: Q & A Based  Internet Information Acquisition

Motivation of NCIDNormalized Conditional Information Distance 

• Information Distance (Bennett C H, G´acs P, Li M, Vitanyi P, and Zurek W. , 1998):

This is normalized ID, named NID, where, K(x) is the Kolmogorov complexity of x, while K(x|y) is that condition to y.

NID is an absolute, universal, and application independent distance measure between any two sequences, which was applied in evolution tree creation, language classification, music classification, plagiarism detection, data mining for images and time sequences such as heart rhythm data etc..

• However, while people try to use it with Google, it becomes difficult and meaningless sometimes. For example, “fan” “CPU” and “star”? The NIDs of them are 0.60 and 0.58, respectively which are almost same and mean nothing.

29

)}(),(max{

)}|(),|(max{),(

yKxK

xyKyxKyxd

23/4/21

Page 30: Q & A Based  Internet Information Acquisition

Experiment results of NCIDx y condition NID

fan CPU - 0.6076

fan star - 0.5832

fan CPU temperature 0.3527

fan star temperature 0.6916

fan CPU Hollywood 0.8258

fan star Hollywood 0.6598

xCondition

ISANo Condition

fish salmon salmon

fish carp tail

fish shark shark

fish whale whale

fish cuttlefish carp

fish fin cuttlefish

fish tail fin

Regular expression

23/4/21 30

Page 31: Q & A Based  Internet Information Acquisition

Approximation of NCID

where, f(x) is the number of elements in which x occurs, and f(x,c) is number of elements in which x and c both occur; f(y,c) and f(x,y,c) are similarly defined.

From definition, 0 ≤ f(x,y,c) ≤ f(x,c) / f(y,c) ≤ f(c),

when f(x,y,c) = 0,

if f(x,c) * f(y,c) > 0, then dc(x,y) = ∞ (infinite);

otherwise, dc(x; y) is undefined.

)},(log),,(min{log)(log

)},(log),,(max{log),,(log),(),(

cyfcxfcf

cyfcxfcyxfyxNCIDyxd cc

3123/4/21

Page 32: Q & A Based  Internet Information Acquisition

Motivation of min Distance

Partial matching problem– Example: What city is Lake Washington by? (Seattle, Bellevue,

Kirkland)

• Seattle, correct and most popular answer, has much more private information comparing with the other answers, in contrast with the information shared with “Lake Washington” .

– Problem of max distance• The shared information between question and the answer is “diluted”

by the private information of the answer, which makes the system selected an unpopular candidate, Bellevue which has “dense” shared information.

• Can we remove the irrelevant information in a coherent theory and give the most popular city Seattle a chance?3223/4/21

Page 33: Q & A Based  Internet Information Acquisition

min Distance measure• Define Dmin and dmin

• Characters of min Distance• Universal• No partial matching problem• Nonnegative and symmetrical• It does not satisfy triangle inequality

Publications: ACM KDD-07, Bioinformatics, JCST

)}(),(min{

)}|(),|(min{),()},|(),|(min{),( minmin yKxK

xyKyxKyxdxyKyxKyxD

)()},(),,(min{

)},(),,(max{),,()|,(min cKcyKcxK

cyKcxKcyxKcyxd

3323/4/21

Page 34: Q & A Based  Internet Information Acquisition

d(human,horse) > d(human,Centaur)+d(Centaur,horse)

minmin Distance’s problem Distance’s problem

3423/4/21

An information distance must reflect what “we think ” to be similar. And what “we think” to be similar apparently does not really satisfy triangle inequality. min Distance is reasonable and successful in the applications.

Page 35: Q & A Based  Internet Information Acquisition

Motivation of IDMO

3504/21/23

• In many data mining applications, we are more interested in mining information from many, not just two, information carrying entities, for example:– What is the public opinion on the United States presidential

election, from the blogs?

– What do the customers say about a product, from the reviews?

– Which article, among many, covers the news most comprehensively? Or typical in one particular news item?

23/4/21 35

Page 36: Q & A Based  Internet Information Acquisition

IDMO measure• Define Dmax(x1, …,xn) 

3604/21/23

Dmax(x1,...,xn) =Em(x1,...,xn) = min { |p| : U(xi,p,j) = xj, for all i,j}

1 2 1 maxmin ( ... | , ) ( ,..., | ) min ( , | )n i m n i ki i

k i

K x x x x c E x x c D x x c

• Conditional Dmax(x1, …,xn) 

Dmax(x1,...,xn|c) =Em(x1,...,xn|c) = min { |p| : U(xi,p,j|c) = xj, for all i,j}

• Most representative object

– Left-hand side: the most “comprehensive” object that contains the most information about all of the others

– Right-hand side: the most “typical” object which is similar to all of the others

23/4/21 36

Page 37: Q & A Based  Internet Information Acquisition

Document Summarization by IDMO

Multi-document Summarization

– To generate the most “typical” summary according to the right-hand side of Em

– Ranked top 1 under the measurement of overall responsiveness both on TAC 2008(33 research groups, 58 submissions) and 2009 (26 research groups, 52 submissions)

3704/21/2323/4/21 37

Publications: ICDM-09

Page 38: Q & A Based  Internet Information Acquisition

Review Mining by IDMO

Comprehensive and typical review selection

– To select the most “comprehensive” and the most “typical” reviews

– We have studied the relationship between a review’s sentiment rating and its textual content and developed a rating estimation system

– The system based on our theory is very helpful for customers

3804/21/23

Publications: ACM CIKM-08, WI-09

23/4/21 38

Page 39: Q & A Based  Internet Information Acquisition

TALK OUTLINE

• How we can get information from internet• Research topics

– Information similarity measure– Information extraction– Information summarization

• Future work

23/4/21 39

Page 40: Q & A Based  Internet Information Acquisition

Information Extraction

• Relationship extraction–Supervised learning–Unsupervised learning

23/4/21 40

Page 41: Q & A Based  Internet Information Acquisition

Interaction extraction

• Statistical algorithms– low precision, high recall

• Pattern matching algorithms– Manual pattern generation

• High precision, low recall

• Bad at generalization

– Automatic pattern generation• Good balance between precision and recall

• Good at generalization

23/4/21 41

Page 42: Q & A Based  Internet Information Acquisition

Pattern generation and optimization

• Pattern generation: extract patterns automatically.– dynamic alignment/dynamic programming algorithm

• Pattern optimization: reduce and merge patterns to increase generalization power, and hence the recall and precision rates.– Supervised Machine Learning algorithm

• approach based on MDL principle

• data labeling is required

– Semisupervised Machine Learning algorithm• approach with a ranking function, and a heuristic evaluation algorithm

• Relatively, little data labeling is required

23/4/21 42

Page 43: Q & A Based  Internet Information Acquisition

Pattern Set

• Best pattern set should satisfy: – least amount of error in output and least amount of

redundancy in pattern set

– maximum number of sentences matched by at least one pattern.

• The problem is how to take the balance. That is a trade-off between the complexity of the model (pattern set) and the fitness of the model to the data (shown by the performance of the system).

23/4/21 43

Page 44: Q & A Based  Internet Information Acquisition

Pattern optimization (MDL based)

• What is Minimum Description Length principle– Proposed by Rissanen, in 1978, as a tool to solve the tradeoff

problem between generalization and accuracy.

– MDL principle can be applied without the analytical form of the risk function.

– The MDL principle states that given some data D, the best model (or theory) MMDL in the set M of all models consistent with the data is the one that minimizes the sum of the length in bits of description of the model, and the length in bits of the description of the data with the aid of the model.

where l(M) and l(D/M) denote, respectively, the description length of the model M and that of data D using model M.

)|()(minarg MDlMlMM

MDL

23/4/21 44

Page 45: Q & A Based  Internet Information Acquisition

Implement of MDL principle

• The MDL principle can be viewed from the point of view of Kolmogorov complexity (Li and Vitanyi, 1997):

where K() is Kolmogorov complexity which is also not computable. The MDL principle looks for an optimal balance between the regularities (in the model) and the randomness remaining in the data, that is, a trade-off between the complexity of the model and the fitness of the model to the data.

• The problem is how to make the principle computable.

K(D|M)K(M)MM

MDL minarg

  ,

23/4/21 45

Page 46: Q & A Based  Internet Information Acquisition

MDL based algorithm

• Examples of pattern (a tag sequence):– {PTN VBZ IN PTN: *; binds, associates; to, with; *}– {PTN VBZ PTN: *; binds, associates, activate; *}Pattern set P, P={p1, p2, …, pn}, and pattern pi, pi=mi

1mi2…mi

Ji

• Finally, the optimal pattern P* is obtained as follows:

where, γ is a constant, and c(mij) is the number of words involved in jth

components of pattern pi. I* is expected interaction set while I is extracted by pattern set P. So d(I, I*) is the number of difference between I and I*.

d(I , I*)K(P)PP

2logminarg*  

n

iii IIIId

1

*),(*),(

n

i j

jimPK

1

||)(

othersmc

PTNmm

ji

jij

i),(/

,1||

),( *

ii II

*

*

,0

,1

ii

ii

II

II

=

23/4/21 46

Page 47: Q & A Based  Internet Information Acquisition

Experiment results (1)

700

800

900

1000

1100

1200

1 22 43 64 85 106 127 148 169 190Number of Pattern Del eted

MDL

0. 2

0. 3

0. 4

0. 5

0. 6

0. 7

0. 8

0. 9

1 22 43 64 85 106 127 148 169 190Number of Pattern Del eted

Preci si onRecal l

• X-coordinate is the number of deleted pattern;

• Y-coordinates are the value of MDL, precision and recall of the system, respectively.

23/4/21 47

Page 48: Q & A Based  Internet Information Acquisition

Experiment results (2)

Algorithm Pattern Number

Training Set

Precision Recall F-Score

Original 192 64.6% 57.2% 60.68%

ERM134 82.7% 55.7% 66.57%

MDL 14 80.7% 61.1% 69.55%

Algorithm Pattern NumberTest Set

Precision Recall F-Score

Original 192 63.5% 57.3% 60.24%

ERM 134 77.9% 53.3% 63.29%

MDL 14 79.8% 59.5% 68.17%

Publications: Bioinformatics 2004,  200523/4/21 48

Page 49: Q & A Based  Internet Information Acquisition

Pattern optimization (semi-supervised)

• Why is semisupervised algorithm proposed– Many kinds of relationship between protein, gene,

and disease.

– Data annotation is too expensive.

• Key point is ranking function and evaluation algorithm

23/4/21 49

Page 50: Q & A Based  Internet Information Acquisition

Semi-supervised Algorithms

• Novel Ranking function

– Where p. positive indicates the number of correct instances matched by the pattern p and p. negative denotes the number of false instances. The parameter β is a threshold that controls p/n. If , HD is an increasing function of (p+n), which means if several patterns have the same p/n that exceeds , a pattern with larger (p+n) has a higher rank and is more possible to be saved. If , the first term is negative, which means that a pattern with larger (p+n) will have a lower rank. Thus different ranking strategies are used when different p/n are met.

• Heuristic Evaluation Algorithm (HEA)– To reduce redundancy among patterns with an optimization function

2

. 0.5( ) ( log )*ln( . . 1)

. 0.5

p positiveHD p p positive p negative

p negative

2/ np2

2/ np

23/4/21 50

Page 51: Q & A Based  Internet Information Acquisition

Experiment results

F1 score

0.24

0.26

0.28

0.3

0.32

0.34

0.36

0.38

0.4

0.42

0.44

0.46

0.48

2480

2280

2080

1880

1680

1480

1280

1080

880

680

480

280 80

Number of Patterns

Ripper

Riloff

HD

HEA

Method Patterns

Precision

Recall F1

score

Impr. of F1

Baseline (raw)

2480

19.0% 46.5%

27.0% –

Ripper 1626

41.0% 40.8%

40.9% +49.3%

Riloff 88

40.8% 44.5% 42.6%

+55.5%

HD 92

52.5% 38.8% 44.5%

+62.4%

HEA 7243.5% 45.9% 45.5%

+66.1%

Published on Bioinformatics,

Medical Informatics,

and APBC’07 

23/4/21 51

Page 52: Q & A Based  Internet Information Acquisition

TALK OUTLINE

• How we can get information from internet• Research topics

– Information similarity measure– information extraction– Information summarization

• Future work

23/4/21 52

Page 53: Q & A Based  Internet Information Acquisition

Background

• Too many information– Concise & coherent summary

• Generative summarization– Language generation (re-phrase meanings)– Very difficult to express semantics

• Extractive summarization– Extract key sentences

23/4/21 53

Page 54: Q & A Based  Internet Information Acquisition

Previous Studies

• Statistical approaches (Nomoto,2001; …)• Linguistic techniques (Nakao,2000; …)• Graph-based methods

– LexRank (Erkan&Radev, 2004;)– TextRank (Mihalcea&Tarau, 2004;)– Query specific document summarizer 

(Varadarajan&Hristidis, 2006)– Many more …

23/4/21 54

Page 55: Q & A Based  Internet Information Acquisition

System I: Information Distance Based

• Problem Reformulation    Given cluster A with m documents  A1,A2,…,Am, 

the update sum task for cluster B={B1,B2,…,Bn} should:

Min{Dmax(S, B1B2…Bn| A1A2…Am)}, |S|≤Θ

          where, S={s1,s2,…,sk} is a set in which each si is 

a sentence selected for the summary 23/4/21 55

Page 56: Q & A Based  Internet Information Acquisition

Problem Reformulation

•  K(AB)=K(A∪B)   K(A|B)=K(A\B)

• Dmax(S, B1B2…Bn| A1A2…Am)

=K((B1B2…Bn\A1A2…Am)\S1S2…Sk)

• Min{Dmax(S, B1B2…Bn| A1A2…Am)}

=Max{K(S1S2…Sk)}

23/4/21 56

Page 57: Q & A Based  Internet Information Acquisition

Approximation

• How to compute K(S1S2…Sk)?• Assumption: each important term carries one 

unit of information, then                  K(S)=|S|  (the cardinality)

S={t1,t2,t3,…,tn}  tk: important terms• Important terms

– Non stop-words– Named entities (person, org., loc., date, …)– With high document frequency

23/4/21 57

Page 58: Q & A Based  Internet Information Acquisition

Approximation (cont.)

• Select one representative sentence s for each document D by:        argmins{Dmax(s,T), s∈D}

T: the union of topic title and narrativeRemove redundant representative sentences 

—8 continuous common words—60% common words

23/4/21 58

Page 59: Q & A Based  Internet Information Acquisition

Generate Summary

• With the representative sentences, select a subset that makes max{K(…)}– Compute all combinations of sentences within the 

length limit;

23/4/21 59

Publication: IEEE ICDM-09

Page 60: Q & A Based  Internet Information Acquisition

Results

Evaluation Method Best Our Result RankAverage Modified Score 0.336 0.309 5/58Macroaverage Modified Score with 3 models

0.331 0.304 5/58

Average Linguistic Quality 3.333 2.958 3/58Average Overall Responsiveness 2.667 2.667 1/58

23/4/21 60

Cluster Traditional UpdateEvaluation Method Results Rank Results RankAVG Modified Score 0.311 9 0.296 4/52MacroAVG Modified 0.316 9 0.292 4/52AVG Linguistic Qualilty 5.682 3 5.886 1/52AVG Overall Resp. 4.955 2 5.023 1/52

TAC 2008, Update Summarization

TAC 2009, Summarization

Page 61: Q & A Based  Internet Information Acquisition

System II: graph model based• It may be a new solution for text presentation 

– Bag-of-Words

• Iterating on the graph can propagate very distant dependence 

• Key points: define nodes\edges\computationA

B

C

D

E

A E

23/4/21 61

Page 62: Q & A Based  Internet Information Acquisition

Graph-based Update Summarization

① Select most salient terms② Build the term-sentence matrix W③ Use the LSI sentence-sentence similarity matrix SIM④ Construct a graph based on SIM⑤ Compute the graph centrality (power iteration 

algorithm)⑥ Select the top 15 sentences with high centrality⑦ Compute all combinations with the length limit⑧ Score a summary as a whole, and keep the best⑨ Re-order the sentences within the summary

23/4/21 62

Page 63: Q & A Based  Internet Information Acquisition

Graph centrality

• Centrality measure: which is the most important node in a graph?– Degree centrality– Eigenvector centrality

• Suppose the centrality of sentence s is Cs, then

– Important connections make the node itself more important

s r s rr D r s

C Sim C

23/4/21 63

Page 64: Q & A Based  Internet Information Acquisition

Tailored to Update Summarization

• Problem: given cluster A, summarize cluster B

• Sentence in cluster B should be penalized 

• Matrix form:

AA AB

BA BB

SIM SIMSIM

SIM SIM

B A

s r s r r s rr D r s r D r s

C Sim C Sim C

' AA AB

BA BB

SIM SIMSIM

SIM SIM

23/4/21 64

Page 65: Q & A Based  Internet Information Acquisition

How to score term, summary?

• Score a term– The position of a word (headline, first sentence)– With manual tuning parameters: 

• Score(w)=tf(w)0.4*Fd(df(w))

• Score a summary– The term frequency of each word– The centrality of each sentence

0 15( )

( ) 1 0 45 2Score w

Power wmaxscore

( )( ) ( )Power ww

w S

SumScore S count

23/4/21 65

Page 66: Q & A Based  Internet Information Acquisition

Update Summarization Results (TAC 2008)

Evaluation Method Best Result Our Result Rank

Average Modified Score

0.336 0.304 7 /58

Macroaverage Modified Score with 3 models

0.331 0.299 7 /58

Average Linguistic Quality

3.333 3.073 2 /58

Average Overall Responsiveness

2.667 2.667 1 /58

23/4/21 66

Page 67: Q & A Based  Internet Information Acquisition

TALK OUTLINE

• How we can get information from internet• Research topics

– Information similarity measure– information extraction– Information summarization

• Future work

23/4/21 67

Page 68: Q & A Based  Internet Information Acquisition

Challenges • If there is an universal semantic information similarity 

measure for any types of information.• How to organize and represent the knowledge in order 

to reuse it.• How to combine rules and statistical methods to get 

sophisticated and tidy answers with comprehensive information.

• How to integrate all kinds of information resources to achieve a domain independent QA system.  

• ……

– Goal: provide a scalable question and answering service

23/4/21 68

Page 69: Q & A Based  Internet Information Acquisition

Acknowledgement 

• Prof. Ming Li, Dr. Minlie Huang, Dr. Yu Hao• Xian Zhang, Chong Long, Feng Jin, Zhicheng 

Zheng, Fan Bu• Jianshu Sun, Yang Tang• Shouyuan Chen, Yuanming Yu 

23/4/21 69

Page 70: Q & A Based  Internet Information Acquisition

Q&A

Thanks !

23/4/21 70