c# - What is the best way to select a text portion to cut based on keywords?

Question

When you search something in Stackoverflow it cuts the portion of the question description that best matches your criteria and after that it marks the criteria words.

I wonder the best way to do this manually in C#, meaning without the help of a full-text search engine.

The main problem is how to select the best text portion in a fast way?

What I did so far is:

I obtain the space indexes of the text. This allows me to know where the words begin so that I can start my substring tests from them.

From each of the space indexes, I get 300 characters ahead and test how many occurrences of the keywords I find.

I assume that the 300 characters long portion that has the most occurrences is the best so I cut it from the original text.

Is this a good approach? Is there a faster way? Is counting the number of occurrences the best way to find the most relevant portion?

score 1 · Accepted Answer

このアプローチを使用すると、多くの場合、一致の最初または最後に近いキーワードとの最適な一致を見つけることができます。つまり、それらのキーワードのコンテキストはあまりありません。マッチの開始点と終了点付近のキーワードの両側に n 個の単語がなければならないという条件を追加します。

スペースの代わりに句読点や接続語など、より便利な場所で一致を解除することを検討できます。

また、キーワードの頻度を調べるだけでなく、キーワードにさまざまな重みを付けるために、用語の頻度 (ドキュメントの頻度の逆数)を調べることもできます。

c# - What is the best way to select a text portion to cut based on keywords?

1 に答える 1

Related

Reference