For each document - whether it be a page from an Issue Crawler network or text submitted by the user - the Issue Discovery Tool does the following:
- Make a phrase list of noun phrases and Capitalized Sequences (Resulting in a list of Proper Nouns, Acronyms, ...)
- Add to the phrase list a list of significant words or phrases extracted from a larger source set of content by using the Yahoo Term Extraction Web Service
- Output is adjusted as follows:
- Lowercase all phrases in the list (for easy comparison)
- Remove phrases that have a length less than 3
- Weight each phrase found in the previous steps as follows: Count the number of times the phrase appears in the document. If the phrase comes from Yahoo add 1 to the previous count (This favors Yahoo's presumed robustness). If the phrase does not come from Yahoo but if there are multiple terms in the phrase, add 2 to the previous count. (This assumes preference for multiple terms to single terms, if they did not come from Yahoo).
- Remove phrases that are on the stop word list.
- Remove phrases that are also part of a longer phrase in the list.
- Sum the weight of all phrases obtained from all documents into one large list.
- Rank the list.
The Issue Discovery tool is not designed to 'give proper weight' to items. It is more a heuristic, a data exploration tool rather than an empirical tool.