본문 바로가기
Studies & Courses/NLP & Text Mining

[Text Mining] Text Reprocessing

by Air’s Big Data 2020. 4. 2.

 

 

 

Text Reprocessing

 

A taxonomy of text preprocessing tasks

 

 

 

 

 

Text Normalization

  • Tokenizing (segmenting) words

  • Normalizing word formats

  • Segmenting sentences

 

Tokenization : Task of segmenting running text into words

 

 

Type VS Token

  • Word types : different words

  • Word tokens : multiple occurrences of words in a text

 

Simple Tokenization in UNIX

  • STEP 1. tokenizing

  • STEP 2. Sorting

 

Punctuation Issues

  • Word-internal punctuation : Ph.D., 555,500.50

  • Clitic contractions : What’re, I’m

  • Multi-token words : New York, Rock ‘n’ roll

Language Issues

  • French : L ? L’ ? Le ?

  • German : Lebensversicherungsgesellschaftsangestellter

  • Chinese and Japanese : フォーチュン500社は情報不足のため時間あた$500K(6,000万円)

 

Normalization

  • Word normalization : the task of putting words/tokens in a standard format, choosing a single normal form for words with multiple forms

  • Information Retrieval : indexed text & query terms must have same form

  • We most commonly implicitly define equivalence classes of terms

 

Lemmatization

  • Lemma: a set of lexical forms having the same stem, major part of speech, and rough word sense

  • Lemmatization: have to find correct dictionary headword form

Morphological parsing

  • The most sophisticated methods for lemmatization

 

Stemming

  • Crude chopping of affixes

  • A simpler version of lemmatization

 

Porter’s algorithm 

 

Sentence Segmentation

  • Sentence boundary detection, sentence splitting

  • A crucial first step in text processing

Decision Tree Version

 

 

 

 

POS Tagging

  • The process of assigning a part-of-speech or lexical class marker to each word in a collection.

 

Probabilistic Model

 

 

  • Hat ^ means “our estimate of the best one”

  • Argmaxx f(x) means “the x such that f(x) is maximized”

  • Intuition of Bayesian classification:

  • (2-gram) Hidden Markov Model

 

 

 

 

Quizlet

https://quizlet.com/_89qocm?x=1qqt&i=184b21

 

'Studies & Courses > NLP & Text Mining' 카테고리의 다른 글

[Text Mining] Text Classification  (0) 2020.05.24
[Text Mining] Introduction to Text Mining  (0) 2020.03.30

댓글