Why in topic modeling by LDA, stop words still exist within the generated topics, although I removed it by the stop words removal function ?
1 view (last 30 days)
Show older comments
Hello and good day to you..
I am doing topic modling by Latent Dirichlet Allocation (LDA), and this require preprocessing (cleaning) the data before. Thus, I did preprocessing steps in order as follows:
1- Tokenize the text using tokenizedDocument.
2- addPartOfSpeechDetails
3- Lemmatize the words using normalizeWords.
4- Erase punctuation using erasePunctuation.
5- Remove a list of stop words (such as "and", "of", and "the") using removeStopWords.
6- Remove words with 2 or fewer characters using removeShortWords.
7- Remove words with 15 or more characters using removeLongWords.
However, when topics generated by the LDA model, whereby a topic in LDA means (a collection of propably related words), there is a topic contain stop words although it were removed from the data by the step number 5. thus it must not be exist in the data to be modeld by the LDA. why these stop words still there and showed as one of resulted topics, althgouh these words do not even exist in the Vocabulary of the model ?
Please HELP !
0 Comments
Answers (0)
See Also
Categories
Find more on Text Data Preparation in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!