Text preprocessing for Text Mining in R (stemming)


While analyzing text, we need to preprocess it. Text data contains white spaces, punctuations, stop words etc. These characters do not convey much information and are hard to process. For example, English stop words like "an", “the”, “is”, "are" etc. do not tell you much information about the sentiment of the text, entities mentioned in the text, or relationships between those entities. Depending upon the task at hand, we deal with such characters differently. This will help isolate text mining in R on important words. 

  • Convert the text to lower case, so that words like “wrong” and “Wrong” are considered the same word for analysis
  • Remove numbers
  • Remove English stopwords e.g “are”, “is”, “of”, etc
  • Remove punctuation e.g “,”, “?”, etc
  • Eliminate extra white spaces
  • Stemming text 
Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form. E.g changing “laptop”, “laptops”, “laptop’s”, “laptops’” to “laptop”. This can also help with different verb tenses with the same semantic meaning such as see, saw, and seen. 
One very useful library to perform the above steps and text mining in R is the “tm” package. The main structure for managing documents in tm is called a Corpus, which represents a collection of text documents.

Cleaning text in R

# Transform and clean the text
library("tm")
docs <- Corpus(VectorSource(textdata))


Using the TM library to process text










# Convert the text to lower case
docs <- tm_map(docs, content_transformer(tolower))
# Remove numbers
docs <- tm_map(docs, removeNumbers)
# Remove english common stopwords
docs <- tm_map(docs, removeWords, stopwords("english"))
# Remove punctuations
docs <- tm_map(docs, removePunctuation)
# Eliminate extra white spaces
docs <- tm_map(docs, stripWhitespace)
To stem text, we will need another library also, known as SnowballC which will shared in my another blog.

No comments:

Post a Comment

7 Stages of Machine Learning - Framework Introduction

Framework Introduction 7 Stages Introduction Stage 1: Problem Definition Stage 2: Data Collection Stage 3: Data Preparation Stage 4: Data Vi...