Text Reprocessing
A taxonomy of text preprocessing tasks
Text Normalization
-
Tokenizing (segmenting) words
-
Normalizing word formats
-
Segmenting sentences
Tokenization : Task of segmenting running text into words
Type VS Token
-
Word types : different words
-
Word tokens : multiple occurrences of words in a text
Simple Tokenization in UNIX
-
STEP 1. tokenizing
-
STEP 2. Sorting
Punctuation Issues
-
Word-internal punctuation : Ph.D., 555,500.50
-
Clitic contractions : What’re, I’m
-
Multi-token words : New York, Rock ‘n’ roll
Language Issues
-
French : L ? L’ ? Le ?
-
German : Lebensversicherungsgesellschaftsangestellter
-
Chinese and Japanese : フォーチュン500社は情報不足のため時間あた$500K(約6,000万円)
Normalization
-
Word normalization : the task of putting words/tokens in a standard format, choosing a single normal form for words with multiple forms
-
Information Retrieval : indexed text & query terms must have same form
-
We most commonly implicitly define equivalence classes of terms
Lemmatization
-
Lemma: a set of lexical forms having the same stem, major part of speech, and rough word sense
-
Lemmatization: have to find correct dictionary headword form
Morphological parsing
-
The most sophisticated methods for lemmatization
Stemming
-
Crude chopping of affixes
-
A simpler version of lemmatization
Porter’s algorithm
-
Commonest algorithm for stemming English
-
Example : Rewrite rules, Rules sensitive to the measure of words
Sentence Segmentation
-
Sentence boundary detection, sentence splitting
-
A crucial first step in text processing
Decision Tree Version
POS Tagging
- The process of assigning a part-of-speech or lexical class marker to each word in a collection.
Probabilistic Model
-
Hat ^ means “our estimate of the best one”
-
Argmaxx f(x) means “the x such that f(x) is maximized”
-
Intuition of Bayesian classification:
-
(2-gram) Hidden Markov Model
Quizlet
https://quizlet.com/_89qocm?x=1qqt&i=184b21
'Studies & Courses > NLP & Text Mining' 카테고리의 다른 글
[Text Mining] Text Classification (0) | 2020.05.24 |
---|---|
[Text Mining] Introduction to Text Mining (0) | 2020.03.30 |
댓글