What is stemming?

Stemming in natural language processing (NLP) involves reducing words to their root form or stem, which may not always be a valid word. For example, "arguing" and "argued" may sometimes be stemmed to "argu," which isn't a correct word. This is achieved primarily by removing suffixes. Removing suffixes groups different forms of the same word together and helps computers process them more efficiently. This process enhances the accuracy and efficiency of various NLP tasks by reducing the number of unique words to consider.

Key takeaways

  • What it is: Stemming is a fast, rule-based process in NLP for cutting words down to their root form (for example, "running" becomes "run")
  • Purpose: It reduces word variations to improve the efficiency of search engines and text analysis models
  • Key consideration: Stemming is faster but less accurate than lemmatization, as its output may not be a real word (for example, "arguing" becomes "argu")
  • Common algorithms: The most well-known types are the Porter, Snowball, and Lancaster stemmers

What is the purpose of stemming?

The main purpose of stemming is to reduce the variations of a word that a machine has to process. By reducing words to their base form, machines can treat different forms of the same word as a single entity. For example, "running," "runs," and "runner" would all be reduced to the stem "run." This simplification can help improve the accuracy and efficiency of various NLP tasks.

Some key purposes of stemming include:

  • Information retrieval: Stemming enables search engines to retrieve relevant documents even if the search query uses different forms of the words present in the documents
  • Text mining: Stemming helps identify patterns and trends in large text datasets by grouping together different forms of the same word
  • Machine translation: Stemming can potentially improve the accuracy of machine translation by reducing the number of words that need to be translated

How does stemming work in NLP?

Stemming algorithms use a set of rules to identify and remove suffixes from words. These rules are often based on linguistic patterns or statistical analysis of large collections of text. The algorithms generally work in a series of steps, each removing a specific type of suffix. For example, a simple stemming rule might be to remove the suffix "-ing" from words ending in "-ing." The process is usually fast and computationally inexpensive, making it suitable for processing large amounts of text data.

Stemming and conflation

One important concept related to stemming is conflation, which involves treating different words or phrases as semantic matches because they refer to the same central idea. For example, 'decided' and 'decidable' might not be synonyms but could be treated as similar in certain contexts, such as when analyzing topics related to decision-making processes. Stemming can be seen as a type of conflation that focuses on reducing inflectional variations of words.   

Stemming also plays an important role in term conflation, which is a more general process of reducing lexical variations in text. Term conflation aims to reduce different forms of words (like stemming and lemmatization), as well as variations in meaning, grammar, or spelling. By reducing these differences, stemming can make text analysis and searching for information more effective.

Types of stemming algorithms

The foundation for stemming algorithms was laid in 1968 by Julie Beth Lovins, who developed the first published stemmer. Since then several different stemming algorithms have been created, each with its own strengths and weaknesses:

Porter stemmer

The Porter stemmer is one of the oldest and most widely used stemming algorithms, developed by Martin Porter in 1980. It uses a series of rules to remove suffixes from English words. It's known for its simplicity and speed but can sometimes over-reduce words, leading to inaccuracies, and may not perform well for languages other than English. For example, a Porter stemmer might reduce "university," "universal," and "universities" all to the same stem: "univers". This clearly demonstrates the aggressive nature and potential meaning-loss of the algorithm.

Snowball stemmer

The Snowball stemmer was developed as an improvement on the Porter stemmer. It supports multiple languages (not just English) and is generally considered more accurate. However, it's not always guaranteed to avoid over-stemming. It's a more sophisticated algorithm, allowing it to capture more linguistic nuances and produce more semantically meaningful stems, and offer a better balance between accuracy and speed. This can be helpful in applications where preserving the context and meaning of words is essential, such as information retrieval and machine translation.

Lancaster stemmer

The Lancaster stemmer is another popular algorithm known for its more aggressive reduction of words. While this can lead to faster processing, it can often result in more stemming errors compared to the Porter or Snowball stemmers. The increased speed, while helpful in certain situations like processing large volumes of text where time is of the essence, might not outweigh the potential loss of accuracy in many applications.

Stemming and lemmatization

While stemming and lemmatization are two methods used to reduce words to their basic form, they're not the same. Lemmatization is a more advanced version of stemming that takes into account the word's context and grammar. It uses a dictionary and morphological analysis to figure out the word's dictionary form, also known as its lemma. Lemmatization typically produces a valid word (the lemma), unlike stemming, which may not. While lemmatization is generally more accurate than stemming, it can be computationally more expensive as it takes more time and effort to do.

Feature

Stemming

Lemmatization

Complexity

Lower

Higher

Accuracy

Lower

Higher

Speed

Faster

Slower

Output

May not be a valid word

Always a valid word

Feature

Stemming

Lemmatization

Complexity

Lower

Higher

Accuracy

Lower

Higher

Speed

Faster

Slower

Output

May not be a valid word

Always a valid word

Applications of stemming

Stemming can be used in a variety of NLP tasks:

Information retrieval

Information retrieval systems, such as search engines, desktop search tools, retrieval augmented generation (RAG), and document management systems, can greatly benefit from stemming. By applying stemming to search terms and the documents being searched, these systems can more effectively match queries with relevant content, even when the wording is not identical.

Text classification

Stemming can help improve the accuracy of text classification algorithms by reducing the number of features or attributes of the text data and increasing the likelihood that related words are grouped together. This makes it easier for the algorithm to identify patterns and classify texts accurately.

Text summarization

Text summarization can leverage stemming to help identify the most important words and reduce redundancy. By grouping related words together, stemming helps create more concise and informative summaries.

Sentiment analysis

Stemming can help figure out if a text is positive, negative, or neutral by shortening words to their main form. For example, "happy," "happily," and "happiness" all become "happy." This can make it easier to see the overall positive feeling and avoids confusion from different word forms. However, sometimes stemming can cause mistakes if it removes important information or shortens words incorrectly. Still, it generally makes sentiment analysis better and faster by focusing on the core meaning of words, not their grammar.

Benefits of stemming

Using stemming can provide several potential advantages:

Improved model performance

Stemming can help boost the performance of your NLP models by reducing the number of unique words. This may lead to faster training times and improved prediction accuracy. By grouping related words, stemming strengthens the signal for pattern identification in the text. As a result, you may see more robust and accurate models, especially for tasks like text classification and sentiment analysis. For example, in Vertex AI, using stemming as a preprocessing step can improve the accuracy of your sentiment analysis models by reducing the impact of minor word variations.

Reduced dimensionality

Reducing the data dimensionality by decreasing the count of unique words processed can be directly achieved through stemming. This can help significantly minimize the resources required for tasks such as creating term-frequency matrices or building a vocabulary index. The lessened dimensionality can also translate to faster processing speeds and lower memory consumption.

Improved search recall

In information retrieval systems, stemming can significantly improve recall. Someone searching for "hiking poles," for instance, might also find documents containing "hikes," "hiker," or "hiked." Stemming bridges the gap between different forms of the same word, ensuring relevant documents aren't missed due to minor variations in wording. This enhanced recall can be crucial for ensuring comprehensive search results, although it might come at the cost of more irrelevant results.

Enhanced clustering and topic modeling

Document clustering and topic modeling can be improved through stemming. By reducing words to their root forms, stemming helps group documents based on their underlying semantic meaning rather than superficial variations in word forms. This can result in more coherent and meaningful clusters or topics.

Simplified text preprocessing

Stemming can greatly simplify the overall text preprocessing pipeline. It reduces the number of unique terms that need to be considered in subsequent steps like stop word removal, feature extraction (TF-IDF, word embeddings), and data normalization. A cleaner, more concise data representation is often easier to manage and analyze, helping save development time and resources.

Reduced data sparsity and overfitting

In machine learning models that deal with text data, stemming can help reduce data sparsity by grouping together different forms of the same word. This may prevent overfitting, where the model memorizes specific word forms instead of learning generalizable patterns.

Limitations of stemming

Despite its benefits, stemming also has some possible limitations:

  • Over-stemming: This occurs when a stemming algorithm removes too much of a word, resulting in a stem that is not a valid word or that has a different meaning than the original word
  • Under-stemming: This can happen when a stemming algorithm fails to remove enough of a word, resulting in different forms of the same word being treated as different words
  • Loss of information: Stemming can sometimes result in a loss of information, as the suffixes that are removed may contain important grammatical or semantic information
  • Contextual errors: Stemming algorithms typically operate without considering the context of the word, which can lead to errors in cases where the same word has different meanings depending on the context

Take the next step

Start building on Google Cloud with $300 in free credits and 20+ always free products.

Google Cloud