Vector estimate of importance of each word in a text

Question

CdC on 22 Aug 2022

0
Link

Direct link to this question

https://se.mathworks.com/matlabcentral/answers/1783325-vector-estimate-of-importance-of-each-word-in-a-text

Answered: Constantino Carlos Reyes-Aldasoro on 2 Sep 2022

What is a simple, widely used, practical way to estimate the 'importance' of each word in book length texts? Need to produce a vector of the estimated importance for each occurence of each word in the text. The output should be a vector with the same length as the sequential number of words in the text (including each occurrence separately) that provides an estimate of word importance for each word in the text.

for example, would like to be able to do something like this:

Imp = wordImportance(longString);

where Imp will be a two-column table of all the words in the string that will be the same number of rows as the sequential number of words in longString (including each occurrence separately, and including stopwords etc) and their importance in context within the text, likely based on ngram.

I have access to the text analytics toolbox. I don't have sufficient knowledge of how to use it.

There appears to be an entire sub-field of text analysis devoted to this problem. I am aware that there are many approaches and that selecting one depends on the details of what you are trying to achieve. That said, I need a basic importance estimate for each word to get started.

The examples for using rakeKeywords and textrankKeywords seem relevant, but don't ultimately produce a vector output of the estimated importance of each word in a document.

Thank you for your help

5 Comments
Show 3 older commentsHide 3 older comments

CdC on 22 Aug 2022

Edited: CdC on 22 Aug 2022

John, that is a singularly unhelpful response. This is absolutely a question about Matlab, and about the text analytics toolbox specifically. Yes, I'm aware that there are many approaches to chosing importance. That said, there are also a number of tools that Matlab provides to do exactly this, and some suggestions from someone with experience using those tools would be a good starting point for anyone getting going with the toolbox.

As it stands, having spent quite a while with the toolbox, if one isn't already familiar with its nomenclature, it's pretty hard to get started on solving a basic practical problem, like this one.

The importance of a word isn't just 'highly subjective', it's specifiable using a number of well-used metrics. That's a good part of what the text analytics toolbox is all about. Someone familiar with the toolbox can likely describe a few examples of which metrics are widely used for what kinds of tasks, and how to use them.

Your comment strikes me as similar to telling someone who wants to use the DSP toolbox but isn't yet an expert in digital filtering that they shouldn't post questions about how to do bandpass filtering until they can fully specify their filter. Needless to say, if they could do that - or if I already knew the answer to my question - it wouldn't be a question. While in both cases from the perspective of an expert the question may not be sufficiently clear, from the perspective of a beginner it'd be pretty durn helpful to suggest some starting points, eg commonly used approaches.

Thank you.

Steven Lord on 22 Aug 2022

John, that is a singularly unhelpful response. This is absolutely a question about Matlab, and about the text analytics toolbox specifically.

I'm afraid I have to agree with John here. If you had a question "I want to implement importance measure X in MATLAB, how do I do step Y?", that would be related to MATLAB. If you had a question "What importance measures are there in Text Analytics Toolbox?" that too would be related to MathWorks products.

But reading through what you wrote, my interpretation (and I suspect John's as well) was that essentially the question you asked was "What types of importance measures are there?" and that is a precursor question to the ones above and one that is independent of MATLAB.

Note that the only place in your question where you used the name of a MathWorks product was when you told us that you have Text Analytics Toolbox, and that information could theoretically have been omitted without changing your question at all. That's an indication to me that you're asking a general text processing question, not a question about how to do text processing with MathWorks products.

That being said, if you're interested in text importance in Text Analytics Toolbox, consider looking at some of the functions in the "Word and N-Gram Counting" category on this documentation page. But note that sometimes an important "word" in a sentence isn't a word at all: the classical example is "Let's eat, Grandma!" versus "Let's eat Grandma!" Those have very different meanings.

Rik on 22 Aug 2022

That's a fairly concise yardstick; 'can I omit all mentions of Mathworks products along with any posted code without materially changing the meaning'. I will be 'borrowing' that one.

On topic: I personally interpreted your question more along the lines of 'What importance measures are there in Text Analytics Toolbox?'. In such cases I click around on the 'related functions' section and on the categories in the sidebar. That might lead you to pages like the one Steven linked you.

The documentation is very good compared to all other systems I've worked with. Use it to your advantage.

Christopher Creutzig on 1 Sep 2022

Could you specify which “well-used metrics” you would like to use?

I am not aware of any widely applicable metric that takes a single document, without some background training corpus from the field in question to have a baseline against which to recognize relevant changes.

Depending on your concrete problem at hand, any of the following may or may not be useful:

Co-occurrence analysis, finding highly connected words. Possibly after removing stopwords.
Named Entity recognition.
Splitting the text into chapters, paragraphs, or even sentences, and then treating those as a corpus and looking at tfidf scores, topic mixtures (cf. fitlda), keyword extraction, extractSummary, and similar operations

Sign in to comment.

Sign in to answer this question.

Answer 1

Constantino Carlos Reyes-Aldasoro on 2 Sep 2022

0
Link

Direct link to this answer

https://se.mathworks.com/matlabcentral/answers/1783325-vector-estimate-of-importance-of-each-word-in-a-text#answer_1040705

Why don't you start with a simpler problem, say that you want to count the occurrences of certain words. That would be a problem that is easy to validate manually and that will help you get started. Then you can increase the complexity step by step towards importance, say importance is similarity to a certain word, or belonging to a certain group (colours, fruits, etc.) so if your importance is being member of the group you give a value.

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Vector estimate of importance of each word in a text

5 Comments
Show 3 older commentsHide 3 older comments

Answers (1)

0 Comments
Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

Vector estimate of importance of each word in a text

5 Comments Show 3 older commentsHide 3 older comments

Answers (1)

0 Comments Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

5 Comments
Show 3 older commentsHide 3 older comments

0 Comments
Show -2 older commentsHide -2 older comments