What Is Lemmatization?
Lemmatization is a text normalization technique in natural language processing. Lemmatization uses vocabulary and morphological analysis to remove affixes of words. For example, “building has floors” reduces to “build have floor” upon lemmatization.
Lemmatization is often used for:
- Information retrieval for expanding search criteria
- Reducing dimensionality of problems in text classification, sentiment analysis, or topic modeling
Lemmatization vs. Stemming
A related approach to lemmatization, stemming, is based on simple heuristic rules. It often results in roots or word parts that are not actual words, whereas lemmatization always returns valid dictionary words.
Examples of lemmatization and stemming are shown below.
In MATLAB®, lemmatization can be done using “normalizeWords” function with the style option of ‘lemma’. To learn more about using lemmatization and building predictive models with text data with MATLAB, see Text Analytics Toolbox™.
Examples and How To
See also: natural language processing, sentiment analysis, word2vec, stemming, n-gram, text mining with MATLAB, data science, deep learning, Deep Learning Toolbox™, Statistics and Machine Learning Toolbox™