Virginia Tech, Blacksburg, 24061 December 10th, …...Virginia Tech, Blacksburg, 24061...

CS6604 Course ProjectFall 2019

Automatic Classification of Arabic ETDs

Eman Abdelrahman and Fatimah AlotaibiSupervised by: Dr. Edward Fox

December 10th, 2019Virginia Tech, Blacksburg, 24061

Acknowledgements:

● We would like to deeply thank Dr. Fox for his continuous support. ● Also, our colleague Palakh Jude for the guidelines and assistance she provided us. ● We would also like to thank our colleague Bill Ingram for adding us to his ARC

allocation. ● Special thanks to Saudi Digital Libraries for giving an account for Fatimah Alotaibi

which made this project possible. ● Thanks to Institute of Museum and Library Services IMLS LG-37-19-0078-19.

Outline:

● Motivation.● NLP in Arabic language.● Related work.● Dataset.● Preprocessing.● Experiment and results.● Insights and future work.

Motivation● ETDs are becoming the new genre.

● They need classification for better browsing and accessibility.

● Increasing number of universities are requesting their graduate students to deposit an Arabic translated version of their ETD or at least for the title and abstract.

● No prior machine learning research has been done on Arabic ETDs due to:○ Data availability.○ Complexity of Arabic Language.

NLP in Arabic Language

According to “Introduction to Arabic Natural Language Processing” book, Nizar Y. Habash, Morgan & Claypool Publishers, 2010:

● Vast majority of Arabic words are morphologically complex.

● Arabic is high inflectional and derivational language.

● Arabic language has rich and complex grammatical structures.

Significant challenges to many Natural Language Processing (NLP) applications.

Related WorkClassification models performance comparison:

Related Work Cont.

Building new system, comparison with other existing systems

Related Work Cont.Classification with no preprocessing

Dataset:

Dataset:● United Arab Emirates University “Scholarworks @ UAEU”.

Dataset:● United Arab Emirates University “Scholarworks @ UAEU”.● Challenge:

Dataset● Saudi Digital Library

○ AskZad Library

Dataset:● Saudi Digital Library

○ AskZad Library

● Challenge:

Dataset:

● Saudi Digital Libraries ○ AskZad Library

○ 12 categories■ Total 518 documents■ 124,320 words

Categories:● Mapping to

ProQuest categorization system

Preprocessing:1. Stopwords removal

a. NLTK

2. Lemmatizationa. By Farasa API

Lemmatization works better than stemming for the data mining and information retrieval, especially in Arabic as it is highly inflectional language.

Experiments and Preliminary Results● Multiclass classification performed poorly:

○ Average Accuracy ~ 24%

● Binary classification performed better:○ Average Accuracy ~ 68% per Category

Experiments and Preliminary Results (Contd.):

Classifier Accuracy

SVM 0.237

Decision Trees 0.244

Random Forest 0.252

Ensemble Classifier 0.259

● Multi-class Classification:

Experiments and Preliminary Results (Contd.):● Binary Classification:

○ Random Forest

Insights and Future work● Investigate why there exists a big difference between accuracies for different

categories in the Binary Classification.

categories in the Binary Classification.● Investigate the low performance of the Multi-class Classification:

○ Parameters tuning

○ Parameters tuning● Increase the size of the corpus:

○ Sketch Engine

○ Parameters tuning● Increase the size of the corpus .

○ Sketch Engine● Run each classifier against both Arabic and English abstracts separately. ● Use word embeddings.

Questions

Virginia Tech, Blacksburg, 24061 December 10th, …...Virginia Tech, Blacksburg, 24061...

Documents

Transcript of Virginia Tech, Blacksburg, 24061 December 10th, …...Virginia Tech, Blacksburg, 24061...

fox@vt.edu http:// fox.cs.vt.edu Dept. of Computer Science, Virginia Tech Blacksburg, VA 24061 USA

Edward A. Fox Virginia Tech, Blacksburg, VA 24061 USA fox@vt fox.cs.vt/talks

&~~}0tfei Dula~, CPS~, CUP~...llJVirginiaTech I Procurement Department (MC 0333) North End Center, Suite 2100, Virginia Tech 300 Turner Street NW Blacksburg, Virginia 24061 540/231-6221

Paleontology Topic Trends - Virginia Tech€¦ · Paleontology Topic Trends Final Report CS 4624: Multimedia, Hypertext, and Information Access Virginia Tech Blacksburg, VA 24061

Clustering and Topic Analysis - Virginia Tech€¦ · Clustering and Topic Analysis CS 5604Final Presentation December 12, 2017 Virginia Tech, Blacksburg VA 24061 Global Event and

The Indium-loaded Liquid Scintillator (InLS) Zheng Chang*, Christian Grieb and Raju S. Raghavan Dept, of Physics, Virginia Tech, Blacksburg, VA 24061 Richard.

Hardware Implementations of NIST Lightweight Cryptographic ... · Behnaz Rezvani and William Diehl Virginia Tech, Blacksburg, VA 24061, USA email: {behnaz, wdiehl}@vt.edu Abstract.

VIRGINIA WATER RESOURCES RESEARCH CENTER · ii Blacksburg, Virginia 24061 P3-1997 The proceedings of the 1997 Karst-Water Environment Symposium and Workshop is a publication of the

Computer Science - Virginia Tech · 2020-01-24 · Computer Science 225 Stanger Street / 114 McBryde Hall, Virginia Tech, Blacksburg, VA 24061 540-231-6931 From the Department Head

DevelopmentofanAcceleratedTestMethodologytothe ... · 1Department of Statistics, Virginia Tech, Blacksburg, VA 24061 2Department of Statistics, Iowa State University, Ames, IA 50011

Geological Society of America Bulletin€¦ · 3Department of Geosciences, 4044 Derring Hall, Virginia Tech, Blacksburg, Virginia 24061, USA Published online March 29, 2010; doi:10.1130/B30067.1

Blacksburg, Virginia 24061 phone: 540.231.6240 | fax: 540 ... · Student Engagement and Campus Life ... Blacksburg, Virginia 24061 phone: 540.231.6240 | fax: ... Duck Pond wedding

Extrapolatedimplicit-explicitRunge-Kutta methods · Extrapolatedimplicit-explicitRunge-Kutta methods A.Cardone, ... Blacksburg, Virginia 24061, e-mail: zhang@vt.edu. 1. and stiﬀ

Inverse Problems in Structural Mechanics · Inverse Problems in Structural Mechanics Jing Li Virginia Polytechnic Institute and State University Blacksburg, VA 24061-0203 (ABSTRACT)

Dr. David M. Kohl Professor, Agricultural and Applied Economics Virginia Tech Blacksburg, VA 24061 (540) 231-7727 (Jill Albert) (540) 961-2094 (Alicia.

ACC Tournament Time - hokiesports.com · Virginia Tech/ACC Tournament Virginia Tech SID Ofﬁce Room 460 Jamerson Athletic Center Blacksburg, VA 24061 (540) 231-6726 The Virginia

J. Unnam*, C. R. Houska* and S. V. N. Naidu** · 2013-08-31 · virginia polytechnic institute and state university college of engineering blacksburg, virginia 24061. environmental

Virginia Tech Letterhead · Virginia Polytechnic Institute and State University IT for Administrative Services (0291) • 280 Sterrett Drive • Blacksburg, VA 24061 • (540) 231-3400

Mmax and the Maximum Catalog Magnitude Martin Chapman Department of Geosciences Virginia Tech Blacksburg, VA 24061 mcc@vt.edu Mmax Workshop Golden, Colorado.

Airbnb Scraping - Virginia Tech...2020/05/04 · Airbnb Scraping CS 4624 Multimedia, Hypertext, and Information Access Final Report Virginia Tech, Blacksburg VA 24061 13 May 2020

J. Unnam, C. R. Houska and S. V. N. Naidu** · 2013-08-31 · virginia polytechnic institute and state university college of engineering blacksburg, virginia 24061. environmental