python - Paring an index down to "interesting" words for future search terms

Question

I have a list of about 18,000 unique words scraped from a database of government transcripts that I would like to make searchable in a web app. The catch: This web app must be client-side. (AJAX is permissible.)

All the original transcripts are in neat text files on my server, so the index file of words will list which files contain each word and how many times, like so:

ADMINSTRATION   {"16": 4, "11": 5, "29": 4, "14": 2}
ADMIRAL {"34": 12, "12": 2, "15": 9, "16": 71, "17": 104, "18": 37, "19": 23}
AMBASSADOR  {"2": 15, "3": 10, "5": 37, "8": 5, "41": 10, "10": 2, "16": 6, "17": 6, "50": 4, "20": 5, "22": 17, "40": 10, "25": 14}

I have this reduced to a trie-structure in its final form to save space and speed up retrieval, but even so, 18K words is about 5MB of data with the locations, even with stop words removed. But no one is reasonably going to search for out-of-context adjectives and subordinating conjunctions.

I realize this is something of a language question as much as a coding question, but I'm wondering if there is a common solution in NLP for reducing a text to words that are meaningful out of context.

I tried running each word through the Python NLTK POS tagger, but there's a high error rate when the words stand by themselves, as one would expect.

score 0 · Accepted Answer

NLP は私の分野ですが、残念ながらそれを確実に行う方法は 1 つしかありません。最初にトランスクリプトの各文に POS タグを付けてから、(word,pos-tag) タプルの統計を抽出します。したがって、形容詞としての「returned」などのインスタンスと、この単語が動詞として使用される場合を区別することができます。最後に、何を保持し、何を破棄するかを決定します (たとえば、名詞と動詞のみを保持し、それ以外はすべて破棄します)。

python - Paring an index down to "interesting" words for future search terms

2 に答える 2

Related

Reference