NLP and Big Data Shanxi HPC Research Center Xiaoge LI [email protected] WBDB2013, Xi’an, China.
Transcript of NLP and Big Data Shanxi HPC Research Center Xiaoge LI [email protected] WBDB2013, Xi’an, China.
![Page 2: NLP and Big Data Shanxi HPC Research Center Xiaoge LI lixg@xupt.edu.cn WBDB2013, Xi’an, China.](https://reader035.fdocuments.us/reader035/viewer/2022062519/5697bfec1a28abf838cb85ee/html5/thumbnails/2.jpg)
Introduction
Internet is a big knowledge base unstructured
NLP & IE“understand” human language
Unstructured data Structure data
![Page 3: NLP and Big Data Shanxi HPC Research Center Xiaoge LI lixg@xupt.edu.cn WBDB2013, Xi’an, China.](https://reader035.fdocuments.us/reader035/viewer/2022062519/5697bfec1a28abf838cb85ee/html5/thumbnails/3.jpg)
Problems Human language changed
Let Google it !Net language ( LOL , 给力 ) compounds words (JFK airport)
Domain knowledgeDomain specific training sets
Chinese tokenization 小菊 / nr / 的 /u/ 生活 / vn / 很 /d/ 给 /v 力 / vg 小菊 / nr / 的 /u/ 生活 / vn / 很 /d/ 给力 /a
![Page 4: NLP and Big Data Shanxi HPC Research Center Xiaoge LI lixg@xupt.edu.cn WBDB2013, Xi’an, China.](https://reader035.fdocuments.us/reader035/viewer/2022062519/5697bfec1a28abf838cb85ee/html5/thumbnails/4.jpg)
NLP need big data
Unsupervised (weekly supervised)learningknowledge acquisitionRelationship New wordsNE gazette
![Page 5: NLP and Big Data Shanxi HPC Research Center Xiaoge LI lixg@xupt.edu.cn WBDB2013, Xi’an, China.](https://reader035.fdocuments.us/reader035/viewer/2022062519/5697bfec1a28abf838cb85ee/html5/thumbnails/5.jpg)
System Architecture
Linux Cluster
HDFS
Knowledge
acquisition
NLP & IE Map
Reduce HBase
Entity graph
information
fusion
![Page 6: NLP and Big Data Shanxi HPC Research Center Xiaoge LI lixg@xupt.edu.cn WBDB2013, Xi’an, China.](https://reader035.fdocuments.us/reader035/viewer/2022062519/5697bfec1a28abf838cb85ee/html5/thumbnails/6.jpg)
knowledge acquisition
Large scale Corpus from Web Weekly supervised learning Bootstrapping technique Map reduce , Hbase Location NE and new word P = 87.28%, 72.1%
![Page 7: NLP and Big Data Shanxi HPC Research Center Xiaoge LI lixg@xupt.edu.cn WBDB2013, Xi’an, China.](https://reader035.fdocuments.us/reader035/viewer/2022062519/5697bfec1a28abf838cb85ee/html5/thumbnails/7.jpg)
Chinese NLP & IE engine
Pipeline FST & statistic mixture modelInput : plain textOut : structured XMLMap reduce Speed: 500KB/s in 10 nodes
![Page 8: NLP and Big Data Shanxi HPC Research Center Xiaoge LI lixg@xupt.edu.cn WBDB2013, Xi’an, China.](https://reader035.fdocuments.us/reader035/viewer/2022062519/5697bfec1a28abf838cb85ee/html5/thumbnails/8.jpg)
Information objectInformation Object
Name Entity
Person
Organization
Location
Product
Time
事件
Pre-defined Event
General Event
Profile and Event
![Page 9: NLP and Big Data Shanxi HPC Research Center Xiaoge LI lixg@xupt.edu.cn WBDB2013, Xi’an, China.](https://reader035.fdocuments.us/reader035/viewer/2022062519/5697bfec1a28abf838cb85ee/html5/thumbnails/9.jpg)
Example Profile
In Concept-Based Profile, its attributes are filled by its participant profiles.
![Page 10: NLP and Big Data Shanxi HPC Research Center Xiaoge LI lixg@xupt.edu.cn WBDB2013, Xi’an, China.](https://reader035.fdocuments.us/reader035/viewer/2022062519/5697bfec1a28abf838cb85ee/html5/thumbnails/10.jpg)
Information Network
NLP
• Tokenization
• POS• Sallow
parsing• Deep
parsing
IE
• NE tag• CE
linkage• NE
Profile • Profile
Merge
![Page 11: NLP and Big Data Shanxi HPC Research Center Xiaoge LI lixg@xupt.edu.cn WBDB2013, Xi’an, China.](https://reader035.fdocuments.us/reader035/viewer/2022062519/5697bfec1a28abf838cb85ee/html5/thumbnails/11.jpg)
Cross Document Information fusion
Hierarchical Clustering Map Reduce Hbase Half Million Profiles Computing complexity P=94.65% R=88.24% F= 91.33%
![Page 12: NLP and Big Data Shanxi HPC Research Center Xiaoge LI lixg@xupt.edu.cn WBDB2013, Xi’an, China.](https://reader035.fdocuments.us/reader035/viewer/2022062519/5697bfec1a28abf838cb85ee/html5/thumbnails/12.jpg)
Information Graph multi-dimension
Orange: locationGray: organizationBlue: Person
Source:2012 People’s dailyQuery :China Agricultural University
Expand 1 level
![Page 13: NLP and Big Data Shanxi HPC Research Center Xiaoge LI lixg@xupt.edu.cn WBDB2013, Xi’an, China.](https://reader035.fdocuments.us/reader035/viewer/2022062519/5697bfec1a28abf838cb85ee/html5/thumbnails/13.jpg)
Organization-Organization Network
Query: China Agricultural University filter: Organization
![Page 14: NLP and Big Data Shanxi HPC Research Center Xiaoge LI lixg@xupt.edu.cn WBDB2013, Xi’an, China.](https://reader035.fdocuments.us/reader035/viewer/2022062519/5697bfec1a28abf838cb85ee/html5/thumbnails/14.jpg)
Location-Personal Network
Query : 青岛港, filter : Location
![Page 15: NLP and Big Data Shanxi HPC Research Center Xiaoge LI lixg@xupt.edu.cn WBDB2013, Xi’an, China.](https://reader035.fdocuments.us/reader035/viewer/2022062519/5697bfec1a28abf838cb85ee/html5/thumbnails/15.jpg)
Person-location Network
Query: 金日成
![Page 16: NLP and Big Data Shanxi HPC Research Center Xiaoge LI lixg@xupt.edu.cn WBDB2013, Xi’an, China.](https://reader035.fdocuments.us/reader035/viewer/2022062519/5697bfec1a28abf838cb85ee/html5/thumbnails/16.jpg)
Future Work
Query LanguageGraph Mining Enhance NLP Enginevisualization
![Page 17: NLP and Big Data Shanxi HPC Research Center Xiaoge LI lixg@xupt.edu.cn WBDB2013, Xi’an, China.](https://reader035.fdocuments.us/reader035/viewer/2022062519/5697bfec1a28abf838cb85ee/html5/thumbnails/17.jpg)
Questions?
Thank you