Data Science Arts: Text preprocessing for Text Mining in R (stemming)

While analyzing text, we need to preprocess it. Text data contains white spaces, punctuations, stop words etc. These characters do not convey much information and are hard to process. For example, English stop words like "an", “the”, “is”, "are" etc. do not tell you much information about the sentiment of the text, entities mentioned in the text, or relationships between those entities. Depending upon the task at hand, we deal with such characters differently. This will help isolate text mining in R on important words.

Convert the text to lower case, so that words like “wrong” and “Wrong” are considered the same word for analysis
Remove numbers
Remove English stopwords e.g “are”, “is”, “of”, etc
Remove punctuation e.g “,”, “?”, etc
Eliminate extra white spaces
Stemming text

Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form. E.g changing “laptop”, “laptops”, “laptop’s”, “laptops’” to “laptop”. This can also help with different verb tenses with the same semantic meaning such as see, saw, and seen.

One very useful library to perform the above steps and text mining in R is the “tm” package. The main structure for managing documents in tm is called a Corpus, which represents a collection of text documents.

Cleaning text in R

# Transform and clean the text

library("tm")

docs <- Corpus(VectorSource(textdata))

Using the TM library to process text
# Convert the text to lower case

docs <- tm_map(docs, content_transformer(tolower))

# Remove numbers

docs <- tm_map(docs, removeNumbers)

# Remove english common stopwords

docs <- tm_map(docs, removeWords, stopwords("english"))

# Remove punctuations

docs <- tm_map(docs, removePunctuation)

# Eliminate extra white spaces

docs <- tm_map(docs, stripWhitespace)

To stem text, we will need another library also, known as SnowballC which will shared in my another blog.

Data Science Arts

Text preprocessing for Text Mining in R (stemming)

No comments:

Post a Comment

7 Stages of Machine Learning - Framework Introduction