Text mining - mycourses.aalto.fi · •Text mining combines aspects from many traditions (data...

Text miningCourse: Methods of Data Mining

Helena Ahonen-Myka, 6. 10. 2020

Outline

1. Introduction

2. Text as data

3. Tasks and methods

4. Text mining process in practice

1. Introduction

• What is text mining?

• Hard to formulate a definition for text mining• Text mining combines aspects from many traditions (data mining, machine

learning, natural language processing, information retrieval, corpuslinguistics,…)

• Text mining is used as a nice label for many commercial products whichprocess text data

• Three (overlapping) goals:• Overview of content• Applications• Discovery of new knowledge

Goal: overview of content

• A lot of text that is relevant in some situation may be available, butwe don’t have time to read it all

• Electronic health records contain a lot of free-text informationcollected about the visits of patients, medicines, allergies (in manyplaces, over the years)

• When a patient comes to visit a doctor, the doctor should be able to retrieve all the relevant information (but not something else)• Problem-oriented view

• Clustering by symptoms, body parts etc (which have to be recognized)

• Visualization e.g. on a timeline

Goal: Applications

• Many applications have something with language and text content to do• Search engines, question-answering systems, chatbots, machine translation,

speech recognition, document classification, and sentiment analysis

• Traditional fields: natural language processing, information retrieval,…

• These applications can be seen as text mining applications, but they also need representations of text• These representations can be provided by other text mining methods

• Or the applications need other applications• Gathering parallel texts for machine translation (e.g. the same text in English and French)

Goal: discovery of new knowledge

• Text has many layers -> focus can be on language or content

• Extract entities in text and discover connections between them

• Find frequent patterns

• On the language level:• Collocations

• Grammatical structures -> building of parsers

• Author or genre analysis• "Which words did Shakespeare use? How this set differs from the words used by J.K.

Rowling in her Harry Potter books?"

Goal: Discovery of new knowledge, contentlevel• Assume a factory that runs 24/7

• After each shift, the engineers write down a short note about theproblems occurred and what they did to solve the problems

• The next shift can see what happened and act accordingly (primaryuse of data)

• These diary notes have been collected for years, and they could beused to (secondary use of data)• predict when a part will break

• learn what the experienced engineers do when this problem happens

• optimize steps in the process

Some aspects/dimensions of text data

• Applications (or text mining problems) are also characterized by several dimensions, which then influence the methods needed

• A set of documents vs. continuous stream of text

• Short text fragments (e.g. tweets), vs long documents (e.g. books)

• Legacy data, produced by non-existing organizations and software

• New data: may contain many words (e.g. names, terms) never seen before

• Others’ data vs own data (with some background information)

Other dimensions

• language of text• English: a lot of data for training models, a lot of linguistic and other tools,

many other resources (dictionaries, terminologies, pre-trained models), simple morphology (not much inflection)

• Most of the other languages have much less tools and resources, whereas thelinguistic features can be more challenging (rich morphology, wordboundaries)

• Endangered languages, dead languages

• Multilingual documents/applications• More complex, but languages can also help each other (knowledge transfer)

Other dimensions

• Text as major focus vs text as small part

• Text as small part• E.g. text field in database

• Important to handle efficiently

• For instance, contact information for our business customers• Information collected by many people over several years

• Many duplicates: various ways to write the names of companies, addressesand contact persons

Other dimensions

• Bad vs clean data…• ”bad” data often found in old systems, on the web, in user-generated content

• Text created for primary use may not be suitable for secondary use of data

• Formal vs. informal language• Formal: books, news articles

• No errors, words found in dictionaries: many linguistic tools available

• Informal: social media postings, text messages• ”errors”, new creations: statistical methods needed

• Structured vs unstructured data• “normal” text vs XML or HTML documents

2. Text as data: words and documents

• Usually we see text as a sequence of words

• But we seldom process text data as such

• There are many ways to express any idea -> a lot of variety in text

• To make text easier to process, we usually include pre-processingsteps that• reduce variation

• add external knowledge

• Most of the choices of pre-processing have something to do withfrequencies of words, or more precisely their frequency distribution

Text as data: words

• If the words of any document (or collection) are ranked by theirfrequency• A small set of words occur very frequently

• Frequencies of other words decrease fast

• There is a long tail of infrequent words• Rare words are not rare…

• Usually the middle-frequency words are considered the mostinformative• Some of these words may describe a domain, some are more specific

• Depends on the task, if this distinction is relevant

Pre-processing, words to terms

• In general, we can try to find more informative terms, by modifying words to move them to the middle-frequency range:• Combining very frequent words with other words (‘can_be_found_in’)

• Collapsing different forms, e.g. inflected forms to one base form or stem

• Replacing low-frequency words with a more general synonym or a more generic term, maybe from a controlled vocabulary (e.g. WordNet)

• Replacing low-frequency words with generic tags (“this is a noun”, “this is a person name”)

• Converting all words to lower-case

• Masking numbers (“2020” -> “dddd”)

Pre-processing, numbers

• Consider the numeric expressions in the following sentence from the MedLine Corpus: • The corresponding free cortisol fractions in these sera were 4.53 +/- 0.15%

and 8.16 +/- 0.23%, respectively.

• Should we say that the numeric expression 4.53 +/- 0.15% is three words?

• Or should we say that it's a single compound word?

• Or should we say that it is actually nine words, since it's read "four point five three, plus or minus zero point fifteen percent"?

• Or should we say that it's not a "real" word at all, since it wouldn't appear in any dictionary? Should we mask it? How?

Pre-processing: lower levels (string/bytes)

• Words are not the only possibility to represent text

• Particularly with informal language, we may need to process text, at least first, as strings of characters

• There are efficient search methods for searching patterns/substrings in text

• Character encodings are unfortunately still often a problem

• Byte-Pair Encoding, WordPiece, SentencePiece• Motivated by the intuition that rare and unknown words can often be

decomposed into multiple subwords, BPE finds the best word segmentation by iteratively and greedily merging frequent pairs of characters.

Feature extraction: from words to features

• Representation of text has to be suitable for the task

• Defining the computational problem and representation of data are essential

• There are often many ways to do this, and the most straightforward way is not always the most efficient

• It may be possible to find a short representation instead of including all words

• Some application-specific pre-processing could be applied to words

Document and word representations

• Most methods have• Vocabulary (or lexicon) V

• Representation of a document using members of V

• In many methods, vocabulary is fixed• Unknown (out of vocabulary) words need to be taken into account in methods

• Document representation: a vector with |V| dimensions

• Representation of kth word of V: a vector of |V| dimensions, with just one dimension with value (one-hot encoding)• Value: present/not present, tf*idf (term frequency * inversed document frequency)

• Document = bag of words


• Document as a bag of words• Order of words in the document is not significant

• Sparse vector: vocabulary is usually large, but in one document just a few words

• Sparse vector can be implemented with a dense vector

• If order of words is important, n-grams can be used instead of words• Bigrams, trigrams

• Combinations possible

• Relevant n-grams can also be found in pre-processing• Only these added


• Word embeddings• Bag of words does not take into account similarity of words (‘car’, ‘auto’)

• Word2vec: shallow neural learning model that represents words by co-occurring words; different meanings of a word are not taken into account• “I ate an apple.”, “I have an Apple phone.” -> same “meaning” for ‘apple’

• Other models take also meanings into account

• Fewer dimensions, density increased


• Word embeddings are trained first, and they are then usedparticularly in deep learning methods

• Newest deep learning methods/models pre-train models that includemodels for words -> separate training for finding wordrepresentations is not needed anymore

3. Tasks and methods

• Natural language processing kind of (learning) tasks• Classification

• Span prediction

• Generative tasks

• Finding frequent patterns in text

Classification

• Task of choosing the correct class label for a given input

• The set of labels is defined in advance

• Examples:• document classification ("Is this story about sports or economics?)

• named entity recognition ("Is this string of words a name of a person? Or place, or organization, or date?")

• topic detection and tracking ("Is this news story about something new?", "Is this story about some old event, and if so, which?").

Classification, learning a model from labeledexamples

Span prediction

• Span-based question-answering is a task where you have two texts, one called the context, and another called the question. The goal is to extract the answer of the question in the context, if it exists.• context: “Today is going to be a rainy day and tomorrow it will be snowy.”

• question: “What is the weather like today?”

• extracted answer: “rainy”

• Summaries• Given a text, select the substrings of it that summarize the content

Generative tasks

• Machine translation

• Abstractive summarization• Summary is freely generated text that presents the gist of a text

• Content generation• Generation of news, fiction

• GPT-3: language model that produces human-like text, given some text to start with• The Guardian Weekly ordered an ”opinion from a robot”:

https://www.theguardian.com/commentisfree/2020/sep/08/robot-wrote-this-article-gpt-3

https://www.theguardian.com/commentisfree/2020/sep/08/robot-wrote-this-article-gpt-3

Text-to-label vs text-to-text

• Google’s Text-To-Text Transfer Transformer (T5) framework: • Reframing of all NLP tasks into a unified text-to-text-format where the input

and output are always text string

• In contrast to other models that can only output either a class or a span of theinput

• Pre-training with a huge data set causes the model to develop general-purpose abilities and knowledge

• After that, the model can be fine-tuned on smaller labeled datasets, often resulting in (far) better performance than training on the labeled data alone

T5: one model, many tasks

Text-to-text transfer models

• Pre-training has to be self-supervised (text has to be enough to train)

• Possible models:• Mask language model, predicting missing words

• Original: ”Thank you for inviting me to your party last week.”

• Input for training: ”Thank you <X> me to your party <Y> week.”

• Target: ”<X> for inviting <Y> last <Z>

• Next sentence prediction• Sample sentence pairs (A, B) so that 50% of time, B follows A; 50% of time B does not

follow A

• Target: yes or no

Finding frequent phrases in text

• Problem: Find the maximal length word sequences that are frequent(e.g. occur more than 3 times) in sentences of a collection. There canbe gaps between the words in the maximal sequence.

• Given a set of sentences:• Thank you for inviting me to your party last week.

• Thank you for sending me your nice feedback last week.

• Thank you for calling me.

• Given a frequency threshold 2, • Thank you for me your last week is a maximal frequent sentence

• Thank you for me is not maximal, but may be useful

Finding frequent phrases in text

• The maximal sequences can be seen as patterns

• Further processing may find common features in the words in thegaps:• Thank you for <V_ing> me <V_ing>: any verb ending with –ing

• Possible application for language learning• The most frequent structures and phrases can be found in textbooks and

dictionaries

• But language has many lexical ”rules” that are not based on grammar• Which preposition is used with this verb and noun?

• Frequent sequences could be used to generate gap filling tasks automatically

Finding frequent phrases in text, method

• ”apriori-inspired” method can be used• Find first frequent words, then generate candidate pairs using these words

• Check the frequency of the candidates, and select the frequent ones

• Generate k-grams from frequent k-1 grams, check the frequency, etc.

• But: pure bottom-up approach would generate too many candidates• Sequences (not just sets), sequences can be long, gaps allowed

• Solution: candidates are also expanded directly as long as there are frequentk-grams that fit into an existing frequent sequence• Plus some pruning etc.

Finding frequent phrases, lessons learned

• Not just preprocessing + processing, but several rounds• Each round doing something simple, and producing a better starting point for the

next round (hopefully less computing)

• Frequency distribution of text influence steps• A lot of pairs to process (but pairs are short…)• Frequency of sequences drops fast, as the sequences get longer• -> different methods possible

• Various frequency levels can be processed separately• For some round: remove stopwords and mask low-frequency word (<RARE>), find

frequent patterns• Find separately sequences of stopwords, and merge with the middle-frequency

pattern• Check, if in some contexts some <RARE> words are frequent (local patterns)

Finding frequent phrases, lessons learned

• Even if processing ordered sequences, the sequences can first beprocessed as sets of words• Frequency(set of words in s) > frequency(sequence s)

• Set is faster to process; if it not frequent, the sequence is not frequent

• Would be interesting to have a more dynamic process• For each word, more features (specific -> generic)

• Process would generalize word representation if needed• Frequent words as such

• Rare words dynamically replaced by more generic features

• Many possibilities to explore…

4. Text mining process in practice

• If you have a new project with customer data and some task, how to start?• Often you might have ”potentially interesting” data only, no idea what could

be done with it

• First, get familiar to the data• List most frequent words and their distribution in the collection

• List n-grams, collocations,…

• Use some simple methods to find baselines• Clustering, classification

• Methods and tools in libraries (nltk.org, pytorch, trax)

Text mining process in practice

• Experiment with preprocessing• Cleaning of data: normalization, removal of duplicates and extra information,

possibly correcting errors• Often preprocessing takes a lot of (your) time, and is frustrating…

• Define learning or discovery tasks

• Design representation for words and documents• Starting with: What is word? What is document in our context?

• Select/develop methods; use existing efficient implementations ifpossible

• Evaluate, visualize results, iterate (do not give up!)

5. Summary

• Discussion on text mining applications and characteristics

• Text as data: word and document representations

• Tasks and methods

• Text mining process in practice

Text mining - mycourses.aalto.fi · •Text mining combines aspects from many traditions (data...

Documents

Transcript of Text mining - mycourses.aalto.fi · •Text mining combines aspects from many traditions (data...