java - Parsing a lot of text based on a constant set of search terms

Question

I have a set of search terms like [+dog -"jack russels" +"fox terrier"], [+cat +persian -tabby]. These could be quite long with maybe 30 sub-terms making up each term.

I now have some online news articles extracts such as ["My fox terrier is the cutest dog in the world..."] and ["Has anyone seen my lost persian cat? He went missing ..."]. They're not too long, perhaps 500 characters at most each.

In traditional search engines one expects a huge amount of articles that are pre-processed into indexes, allowing for speed-ups when searching given 'search terms', using set theory/boolean logic to reduce articles to only ones that match the phrases. In this situation, however, the order of my search terms is ~10^5, and I'd like to be able to process a single article at a time, to see ALL the sets of search terms that article would be matched with (i.e. all the + terms are in the text and none of the - terms).

I have a possible solution using two maps (one for the positive sub-phrases, one for the negative sub-phrases), but I don't think it'll be very efficient.

First prize would be a library that solves this problem, second prize is a push in the right direction towards solving this.

Kind regards,

score 1 · Accepted Answer

一致にはすべての正のサブ用語が必要であると仮定します。

検索用語のすべてのサブ用語をハッシュテーブルに入れます。サブ用語がキーで、値は完全な検索用語データ構造へのポインタです (一意の ID とブール値へのサブ用語のマップを含む必要があります)。

さらに、ニュース項目を処理するときは、用語 ID でインデックス付けされた「候補」マップを作成します。各候補構造には、用語定義へのポインタ、表示されたサブ用語を含むセット、および「拒否」フラグがあります。

ニュース記事の言葉を繰り返します。

ヒットごとに、候補エントリを検索します。そこにない場合は、空のものを作成して追加します。

候補拒否フラグが設定されている場合は、完了です。

それ以外の場合は、用語データ構造からサブ用語を検索します。負の場合は、拒否フラグを設定します。正の場合、サブ用語を表示されたサブ用語のセットに追加します。

最後に、候補を反復処理します。拒否されず、見られたセットのサイズがその用語の正のサブ用語の数に等しいすべての候補がヒットです。

実装: https://docs.google.com/document/d/1boieLJboLTy7X2NH1Grybik4ERTpDtFVggjZeEDQH74/edit

実行時間は O(n * m) です。ここで、n は記事内の単語数、m は同じサブ用語を共有する用語の最大数です (比較的小さいと予想されます)。

score 0 · Accepted Answer

まず、ドキュメントのサフィックスツリーを作成すると、一度構築する必要があるため、検索がはるかに高速になると思いますが、クエリの長さだけ何度でも使用できます。

次に、すべての検索用語 (+ と - の両方) を繰り返して、答えが「はい」(ドキュメントがクエリに一致する) かどうかを確認する必要があります。ただし、「いいえ」の答えは必要ありません。答えが「いいえ」の場合、ドキュメントに対して検索用語を照合する順序が非常に重要になります。つまり、ある注文は別の注文よりも早く「いいえ」になる場合があります。ここでの質問は、「高速 NO を取得する最適な順序は?」です。アプリケーションによって異なりますが、「赤い大きな猫」などの複数単語の用語は、「猫」などの短い用語に比べてドキュメント内で繰り返されることが少なく、その逆も同様です。したがって、最初に +"Loo ooo ooo ooo ooo ong" と -"short" 用語を使用します。

java - Parsing a lot of text based on a constant set of search terms

2 に答える 2

Related

Reference