Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations...

Special Topics in Computer ScienceSpecial Topics in Computer Science

The Art of Information RetrievalThe Art of Information Retrieval

Chapter 7: Text Operations Chapter 7: Text Operations

Alexander Gelbukh

www.Gelbukh.com

Previous chapter: ConclusionsPrevious chapter: Conclusions

Modeling of text helps predict behavior of systemso Zipf law, Heaps’ law

Describing formally the structure of documents allows to treat a part of their meaning automatically, e.g., search

Languages to describe document syntaxo SGML, too expensive

o HTML, too simple

o XML, good combination

Text operationsText operations

Linguistic operations Document clustering Compression Encription (not discussed here)

Linguistic operationsLinguistic operations

Purpose: Convert words to “meanings” Synonyms or related words

o Different words, same meaning. Morphology

o Foot / feet, woman / female

Homonymso Same words, different meanings. Word senses

o River bank / financial bank

Stopwordso Word, no meaning. Functional words

For good or for bad?For good or for bad?

More exact matchingo Less noise, better recall

Unexpected behavioro Difficult for users to graspo Harms if introduces errors

More expensiveo Adds a whole new technologyo Maintenance; language dependentso Slows down

Good if done well, harmful if done badly

Document preprocessingDocument preprocessing

Lexical analysis (punctuation, case)o Simple but must be careful

Stopwords. Reduces index size and pocessing time Stemming: connected, connection, connections, ...

o Multiword expressions: hot dog, B-52

o Here, all the power of linguistic analysis can be used

Selection of index termso Often nouns; noun groups: computer science

Construction of thesauruso synonymy: network of related concepts (words or phrases)

StemmingStemming

Methodso Linguistic analysis: complex, expensive maintenance

o Table lookup: simple, but needs data

o Statistical (Avetisyan): no data, but imprecise

o Suffix removal

Suffix removalo Porter algorithm. Martin Porter. Ready code on his website

o Substitution rules: sses s, s o stresses stress.

Better stemmingBetter stemming

The whole problematics of computational linguistics POS disambiguation

o well adverb or noun? Oil well.

o Statistical methods. Brill tagger

o Syntactic analysis. Syntactic disambiguation

Word sense disambiguatiuono bank1 and bank2 should be different stems

o Statistical methods

o Dictionary-based methods. Lesk algorithm

o Semantic analysis

ThesaurusThesaurus

Terms (controlled vocabulary) and relationships Terms

o used for indexingo represent a concept. One word or a phrase. Usually nounso sense. Definition or notes to distinguish senses: key (door).

Relationshipso Paradigmatic:

Synonymy, hierarchical (is-a, part), non-hierarchical

o Syntagmatic: collocations, co-occurrences WordNet. EuroWordNet

o synsets

Use of thesurusUse of thesurus

To help the user to formulate the queryo Navigation in the hierarchy of words

o Yahoo!

For the program, to collate related termso woman female

o fuzzy comparison: woman 0.8 * female. Path length

Yahoo! vs. thesaurusYahoo! vs. thesaurus

The book says Yahoo! is based on a thesaurus.

I disagree Tesaurus: words of language organized in hierarchy Document hierarchy: documents attached to hierarchy This is word sense disambiguation I claim that Yahoo! is based on (manual) WSD Also uses thesaurus for navigation

Document clusteringDocument clustering

Operation on the whole collection Global vs. local Global: whole collection

o At compile time, one-time operation

Localo Cluster the results of a specific query

o At runtime, with each query

Is more a query transformation operationo Already discussed in Chapter 5

CompressionCompression

Gain: storage, transmission, search Lost: time on compressing/decompressing

In IR: need for random access. o Blocks do not work

Also: pattern matching on compressed text

Compression methodsCompression methods

Statistical Huffman: fixed size per symbol.

o More frequent symbols shorter

o Allows starting decompression from any symbol

Arithmetic: dynamic codingo Need to decompress from the beginning

o Not for IR

Dictionary Pointers to previous occurrences. Lampel-Ziv

o Again not for IR

Compression ratioCompression ratio

Size compressed / size decompressed

Huffman, units = words: up to 2 bits per charo Close to the limit = entropy. Only for large texts!

o Other methods: similar ratio, but no random access

Shannon: optimal length for symbol with probability p is - log2 p

Entropy: Limit of compressiono Average length with optimal coding

o Property of model

ModelingModeling

Find probability for the next symbol Adaptive, static, semi-static

o Adaptive: good compression, but need to start frombeginning

o Static (for language): poor compression, random access

o Semi-static (for specific text; two-pass): both OK

Word-based vs. character-basedo Word-based: better compression and search

Huffman codingHuffman coding

Each symbol is encoded, sequentially More frequent symbols have shorter codes No code is a prefix of another one

How to buildthe tree: book

Byte codesare better

Allow forsequentialsearch

Dictionary-based methodsDictionary-based methods

Static (simple, poor compression), dynamic, semi-static. Lempel-Ziv: references to previous occurrence

o Adaptive

Disadvantages for IRo Need to decode from the very beginning

o New statistical methods perform better

Comparison of methodsComparison of methods

Compression of inverted filesCompression of inverted files

Inverted file: words + lists of docs where they occur Lists of docs are ordered. Can be compressed Seen as lists of gaps.

o Short gaps occur more frequently

o Statistical compression

Our work: order the docs for better compressiono We code runs of docs

o Minimize the number of runs

o Distance: # of different words

o TSP.

Research topicsResearch topics

All computational linguisticso Improved POS tagging

o Improved WSD

Uses of thesauruso for user navigation

o for collating similar terms

Better compression methodso Searchable compression

o Random access

ConclusionsConclusions

Text transformation: meaning instead of stringso Lexical analysis

o Stopwords

o Stemming POS, WSD, syntax, semantics Ontologies to collate similar stems

Text compressiono Searchable

o Random access

o Word-based statistical methods (Huffman)

Index compression

Thank you!Till compensation

lecture

Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations...

Documents

Transcript of Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations...

Media Retrieval Information Retrieval Image Retrieval Video Retrieval Audio Retrieval Information Retrieval Image Retrieval Video Retrieval Audio Retrieval.

web.icmc.usp.br · Computational Linguistics Models, Resources, Applications Igor Bolshakov Alexander Gelbukh IENCIA DE LA COMPUTACIÓN

Do-s and Don’t-s in your PhD - Gelbukh · Do-s and Don’t-s in your PhD Alexander Gelbukh Images downloaded from Internet belong to their respective copyright holders. Agenda •Paper

Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.

Special Topics in Computer Science The Art of Information Retrieval Chapter 10: User Interfaces and Visualization Alexander Gelbukh .

Extracción abierta de información a ... - Alexander Gelbukh · análisis de extracciones a partir de dos conjuntos de textos en español: FactSpaCIC, un conjunto de oraciones gramaticalmente

Probabilistic Information Retrieval Part I: Survey Alexander Dekhtyar department of Computer Science University of Maryland.

Scholarships at CIC-IPN, Mexico - Alexander Gelbukh• MSc: Engg; PhD: MSc • Averagescores • PhD: letterfrom advisor + 3 lettersfrom PhDs • Documents with apostille – Birth

Special Topics in Computer Science Advanced Topics in Information Retrieval Chapter 2: Modeling Alexander Gelbukh .

Scholarships at CIC-IPN, Mexico - Alexander Gelbukh · 2018. 9. 23. · 26/04/2016 6. Scholarship •MSc: ~ US$ 600, PhD: ~ US$ 800 per month ... –CICLing conference, 8th best in

Identification of Composite Named Entities in a Spanish Textual Database Sofía N. Galicia-Haro Facultad de Ciencias - UNAM Alexander F. Gelbukh and Igor.

The Europeana Sounds Music Information Retrieval Pilotschindler/pubs/EUROMED2016.pdf · 2016-06-29 · The Europeana Sounds Music Information Retrieval Pilot Alexander Schindler 1,

arXiv:1908.03628v2 [cs.LG] 27 Aug 2019 · 2019-08-29 · Yash Mehta Navonil Majumder Alexander Gelbukh Erik Cambria Received: 2018-12-03 / Accepted: XXXX-XX-XX Abstract Recently,

Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh .

International Journal of Computational Linguistics and ... · Computational Linguistics and Applications Vol. 5 No. 1 Jan-Jun 2014 CONTENTS Editorial 5–8 ALEXANDER GELBUKH SEMANTICS

1 Numerical Geometry of Non-Rigid Shapes Feature-based methods & shape retrieval problems Feature-based methods and shape retrieval problems © Alexander.

Special Topics in Computer Science The Art of Information Retrieval Chapter 2: Modeling Alexander Gelbukh .

Alexander Gelbukh Gelbukh

TESIS - Alexander Gelbukh Majumder - MSc.pdfIn this thesis, we speciﬁcally deal with the sentiment analysis of videos, where in-formation is available in three modalities: audio,

EmoSenticSpace: A Novel Framework for Affective … A...EmoSenticSpace: A Novel Framework for Affective Common -Sense Reasoning Soujanya Poria 1, Alexander Gelbukh 2, Erik Cambria