Vector estimate of importance of each word in a text
1 view (last 30 days)
Show older comments
What is a simple, widely used, practical way to estimate the 'importance' of each word in book length texts? Need to produce a vector of the estimated importance for each occurence of each word in the text. The output should be a vector with the same length as the sequential number of words in the text (including each occurrence separately) that provides an estimate of word importance for each word in the text.
for example, would like to be able to do something like this:
Imp = wordImportance(longString);
where Imp will be a two-column table of all the words in the string that will be the same number of rows as the sequential number of words in longString (including each occurrence separately, and including stopwords etc) and their importance in context within the text, likely based on ngram.
I have access to the text analytics toolbox. I don't have sufficient knowledge of how to use it.
There appears to be an entire sub-field of text analysis devoted to this problem. I am aware that there are many approaches and that selecting one depends on the details of what you are trying to achieve. That said, I need a basic importance estimate for each word to get started.
The examples for using rakeKeywords and textrankKeywords seem relevant, but don't ultimately produce a vector output of the estimated importance of each word in a document.
Thank you for your help
5 Comments
Rik
on 22 Aug 2022
That's a fairly concise yardstick; 'can I omit all mentions of Mathworks products along with any posted code without materially changing the meaning'. I will be 'borrowing' that one.
On topic: I personally interpreted your question more along the lines of 'What importance measures are there in Text Analytics Toolbox?'. In such cases I click around on the 'related functions' section and on the categories in the sidebar. That might lead you to pages like the one Steven linked you.
The documentation is very good compared to all other systems I've worked with. Use it to your advantage.
Christopher Creutzig
on 1 Sep 2022
Could you specify which “well-used metrics” you would like to use?
I am not aware of any widely applicable metric that takes a single document, without some background training corpus from the field in question to have a baseline against which to recognize relevant changes.
Depending on your concrete problem at hand, any of the following may or may not be useful:
- Co-occurrence analysis, finding highly connected words. Possibly after removing stopwords.
- Named Entity recognition.
- Splitting the text into chapters, paragraphs, or even sentences, and then treating those as a corpus and looking at tfidf scores, topic mixtures (cf. fitlda), keyword extraction, extractSummary, and similar operations
Answers (1)
Constantino Carlos Reyes-Aldasoro
on 2 Sep 2022
Why don't you start with a simpler problem, say that you want to count the occurrences of certain words. That would be a problem that is easy to validate manually and that will help you get started. Then you can increase the complexity step by step towards importance, say importance is similarity to a certain word, or belonging to a certain group (colours, fruits, etc.) so if your importance is being member of the group you give a value.
0 Comments
See Also
Categories
Find more on Text Data Preparation in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!