dtsearch - dtSearch で、フレーズごとに 1 つのヒットを強調表示するのではなく、フレーズごとに 1 つのヒットを強調表示するようにする

Question

dtSearch を使用して、ドキュメント内のテキスト検索の一致を強調表示しています。これを行うためのコードは、いくつかの詳細とクリーンアップを除いて、おおよそ次の行に沿っています。

SearchJob sj = new SearchJob();
sj.Request = "\"audit trail\""; // the user query
sj.FoldersToSearch.Add(path_to_src_document);
sj.Execute();
FileConverter fileConverter = new FileConverter();
fileConverter.SetInputItem(sj.Results, 0);
fileConvert.BeforeHit = "<a name=\"HH_%%ThisHit%%\"/><b>";
fileConverter.AfterHit = "</b>";
fileConverter.Execute();
string myHighlightedDoc = fileConverter.OutputString;

dtSearch に次のような引用句クエリを指定すると、

"監査証跡"

次に、dtSearch は次のようにヒットの強調表示を行います。

<a name="HH_0"/>監査 <a name="HH_1"/>トレイルは、<a name="HH_2"/ >監査 <a name="HH_last"/>トレイルについて！

フレーズの各単語が個別に強調表示されていることに注意してください。代わりに、次のようにフレーズをユニット全体として強調表示したいと思います。

<a name="HH_0"/>監査証跡は、<a name="HH_last"/>監査証跡を持つと楽しいものです!

これにより、A) ハイライトの見栄えが良くなり、B) ユーザーがヒットからヒットへと移動するのに役立つ JavaScript の動作が改善され、C) 合計ヒット数のより正確なカウントが得られます。

このように dtSearch でフレーズを強調表示する良い方法はありますか?

score 2 · Accepted Answer

注: ここのテキストとコードは、もう少し作業が必要になると思います。人々が回答やコードの修正を手伝いたい場合、これはおそらくコミュニティ wiki になる可能性があります。

これについて dtSearch に問い合わせました (2010 年 4 月 26 日)。彼らの反応は 2 部構成でした。

まず、たとえばフラグを変更するだけでは、目的の強調表示の動作を取得することはできません。

第 2 に、フレーズ一致が全体として扱われる低レベルのヒット情報を取得することができます。特に、SearchJob で dtsSearchWantHitsByWord フラグと dtsSearchWantHitsArray フラグの両方を設定した場合、検索結果には、クエリ内の各単語またはフレーズが一致する場所の単語オフセットで注釈が付けられます。たとえば、入力ドキュメントが

監査証跡は、監査証跡を持つのが楽しいものです。

そしてあなたのクエリは

"監査証跡"

次に (.NET API で)、sj.Results.CurrentItem.HitsByWord[0] には次のような文字列が含まれます。

監査証跡 (2 11 )

これは、文書の 2 番目の単語と 11 番目の単語から始まる「監査証跡」という語句が見つかったことを示しています。

この情報を使用してできることの 1 つは、dtSearch のハイライトのどれが重要でないかを示す「スキップリスト」を作成することです (つまり、どれが単語または句の開始ではなく、句の継続であるか)。たとえば、スキップリストが [4, 7, 9] の場合、4 番目、7 番目、9 番目のヒットは重要ではなく、他のヒットは正当であったことを意味する可能性があります。この種の「スキップリスト」は、少なくとも 2 つの方法で使用できます。

ヒットからヒットへとナビゲートするコードを変更して、skipList.contains(i) の場合にヒット番号 i をスキップするようにすることができます。
要件によっては、dtSearch FileConverter によって生成された HTML を書き換えることもできます。私の場合、dtSearch でヒットに <name="HH_1"/>hitword のような注釈を付け、A タグを使用します (そして、それらが順番に番号付けされているという事実 - HH_1、HH_2、HH_3 など) をヒットナビゲーションの基礎として使用します。それで、私が試したいくつかの成功は、HTML を歩き回り、HH_i の i がスキップリストに含まれているすべての A タグを取り除くことです。ヒットナビゲーションコードによっては、たとえば HH_1 と HH_3 の間にギャップがないように、おそらく A タグの番号を付け直す必要があります。

これらの「スキップリスト」が実際に役立つと仮定すると、どのように生成しますか? さて、主に機能するコードは次のとおりです。

using System;
using System.Collections.Generic;
using System.IO;
using System.Text;
using System.Text.RegularExpressions;
using NUnit.Framework;

public class DtSearchUtil
{
    /// <summary>
    /// Makes a "skip list" for the dtSearch result document with the specified
    /// WordArray data. The skip list indicates which hits in the dtSearch markup
    /// should be skipped during hit navigation. The reason to skip some hits
    /// is to allow navigation to be phrase aware, rather than forcing the user
    /// to visit each word in the phrase as if it were an independent hit.
    /// The skip list consists of 1-indexed hit offsets. 2, for example, would
    /// mean that the second hit should be skipped during hit navigation.
    /// </summary>
    /// <param name="dtsHitsByWordArray">dtSearch HitsByWord data. You'll get this from SearchResultItem.HitsByWord
    /// if you did your search with the dtsSearchWantHitsByWord and dtsSearchWantHitsArray
    /// SearchFlags.</param>
    /// <param name="userHitCount">How many total hits there are, if phrases are counted
    /// as one hit each.</param>
    /// <returns></returns>
    public static List<int> MakeHitSkipList(string[] dtsHitsByWordArray, out int userHitCount)
    {
        List<int> skipList = new List<int>();
        userHitCount = 0;

        int curHitNum = 0; // like the dtSearch doc-level highlights, this counts hits word-by-word, rather than phrase by phrase
        List<PhraseRecord> hitRecords = new List<PhraseRecord>();
        foreach (string dtsHitsByWordString in dtsHitsByWordArray)
        {
            hitRecords.Add(PhraseRecord.ParseHitsByWordString(dtsHitsByWordString));
        }
        int prevEndOffset = -1;

        while (true)
        {
            int nextOffset = int.MaxValue;
            foreach (PhraseRecord rec in hitRecords)
            {
                if (rec.CurOffset >= rec.OffsetList.Count)
                    continue;

                nextOffset = Math.Min(nextOffset, rec.OffsetList[rec.CurOffset]);
            }
            if (nextOffset == int.MaxValue)
                break;

            userHitCount++;

            PhraseRecord longestMatch = null;
            for (int i = 0; i < hitRecords.Count; i++)
            {
                PhraseRecord rec = hitRecords[i];
                if (rec.CurOffset >= rec.OffsetList.Count)
                    continue;
                if (nextOffset == rec.OffsetList[rec.CurOffset])
                {
                    if (longestMatch == null ||
                        longestMatch.LengthInWords < rec.LengthInWords)
                    {
                        longestMatch = rec;
                    }
                }
            }

            // skip subsequent words in the phrase
            for (int i = 1; i < longestMatch.LengthInWords; i++)
            {
                skipList.Add(curHitNum + i);
            }

            prevEndOffset = longestMatch.OffsetList[longestMatch.CurOffset] +
                (longestMatch.LengthInWords - 1);

            longestMatch.CurOffset++;

            curHitNum += longestMatch.LengthInWords;

            // skip over any unneeded, overlapping matches (i.e. at the same offset)
            for (int i = 0; i < hitRecords.Count; i++)
            {
                while (hitRecords[i].CurOffset < hitRecords[i].OffsetList.Count &&
                    hitRecords[i].OffsetList[hitRecords[i].CurOffset] <= prevEndOffset)
                {
                    hitRecords[i].CurOffset++;
                }
            }
        }

        return skipList;
    }

    // Parsed form of the phrase-aware hit offset stuff that dtSearch can give you 
    private class PhraseRecord
    {
        public string PhraseText;

        /// <summary>
        /// Offsets into the source text at which this phrase matches. For example,
        /// offset 300 would mean that one of the places the phrase matches is
        /// starting at the 300th word in the document. (Words are counted according
        /// to dtSearch's internal word breaking algorithm.)
        /// See also:
        /// http://support.dtsearch.com/webhelp/dtSearchNetApi2/frames.html?frmname=topic&frmfile=dtSearch__Engine__SearchFlags.html
        /// </summary>
        public List<int> OffsetList;

        // BUG: We calculate this with a whitespace tokenizer. This will probably
        // cause bad results in some places. (Better to figure out how to count
        // the way dtSearch would.)
        public int LengthInWords
        {
            get
            {
                return Regex.Matches(PhraseText, @"[^\s]+").Count;
            }
        }

        public int CurOffset = 0;

        public static PhraseRecord ParseHitsByWordString(string dtsHitsByWordString)
        {
            Match m = Regex.Match(dtsHitsByWordString, @"^([^,]*),\s*\d*\s*\(([^)]*)\).*");
            if (!m.Success)
                throw new ArgumentException("Bad dtsHitsByWordString. Did you forget to set dtsHitsByWordString in dtSearch?");

            string phraseText = m.Groups[1].Value;
            string parenStuff = m.Groups[2].Value;

            PhraseRecord hitRecord = new PhraseRecord();
            hitRecord.PhraseText = phraseText;
            hitRecord.OffsetList = GetMatchOffsetsFromParenGroupString(parenStuff);
            return hitRecord;
        }

        static List<int> GetMatchOffsetsFromParenGroupString(string parenGroupString)
        {
            List<int> res = new List<int>();
            MatchCollection matchCollection = Regex.Matches(parenGroupString, @"\d+");
            foreach (Match match in matchCollection)
            {
                string digitString = match.Groups[0].Value;
                res.Add(int.Parse(digitString));
            }
            return res;
        }
    }
}


[TestFixture]
public class DtSearchUtilTests
{
    [Test]
    public void TestMultiPhrasesWithoutFieldName()
    {
        string[] foo = { @"apple pie, 7 (482 499 552 578 589 683 706 );",
            @"bana*, 4 (490 505 689 713 )"
            };

        // expected dtSearch hit order:
        // 0: apple@482
        // 1: pie@483 [should skip]
        // 2: banana-something@490
        // 3: apple@499
        // 4: pie@500 [should skip]
        // 5: banana-something@505
        // 6: apple@552
        // 7: pie@553 [should skip]
        // 8: apple@578
        // 9: pie@579 [should skip]
        // 10: apple@589
        // 11: pie@590 [should skip]
        // 12: apple@683
        // 13: pie@684 [skip]
        // 14: banana-something@689
        // 15: apple@706
        // 16: pie@707 [skip]
        // 17: banana-something@713

        int userHitCount;
        List<int> skipList = DtSearchUtil.MakeHitSkipList(foo, out userHitCount);

        Assert.AreEqual(11, userHitCount);

        Assert.AreEqual(1, skipList[0]);
        Assert.AreEqual(4, skipList[1]);
        Assert.AreEqual(7, skipList[2]);
        Assert.AreEqual(9, skipList[3]);
        Assert.AreEqual(11, skipList[4]);
        Assert.AreEqual(13, skipList[5]);
        Assert.AreEqual(16, skipList[6]);
        Assert.AreEqual(7, skipList.Count);
    }

    [Test]
    public void TestPhraseOveralap1()
    {
        string[] foo = { @"apple pie, 7 (482 499 552 );",
            @"apple, 4 (482 490 499 552)"
            };

        // expected dtSearch hit order:
        // 0: apple@482
        // 1: pie@483 [should skip]
        // 2: apple@490
        // 3: apple@499
        // 4: pie@500 [should skip]
        // 5: apple@552
        // 6: pie@553 [should skip]

        int userHitCount;
        List<int> skipList = DtSearchUtil.MakeHitSkipList(foo, out userHitCount);

        Assert.AreEqual(4, userHitCount);

        Assert.AreEqual(1, skipList[0]);
        Assert.AreEqual(4, skipList[1]);
        Assert.AreEqual(6, skipList[2]);
        Assert.AreEqual(3, skipList.Count);
    }

    [Test]
    public void TestPhraseOveralap2()
    {
        string[] foo = { @"apple pie, 7 (482 499 552 );",
@"pie, 4 (483 490 500 553)"
    };

        // expected dtSearch hit order:
        // 0: apple@482
        // 1: pie@483 [should skip]
        // 2: pie@490
        // 3: apple@499
        // 4: pie@500 [should skip]
        // 5: apple@552
        // 6: pie@553 [should skip]

        int userHitCount;
        List<int> skipList = DtSearchUtil.MakeHitSkipList(foo, out userHitCount);

        Assert.AreEqual(4, userHitCount);

        Assert.AreEqual(1, skipList[0]);
        Assert.AreEqual(4, skipList[1]);
        Assert.AreEqual(6, skipList[2]);
        Assert.AreEqual(3, skipList.Count);
    }

    // TODO: test "apple pie" and "apple", plus "apple pie" and "pie"

    // "subject" should not freak it out
    [Test]
    public void TestSinglePhraseWithFieldName()
    {
        string[] foo = { @"apple pie, 7 (482 499 552 578 589 683 706 ), subject" };

        int userHitCount;
        List<int> skipList = DtSearchUtil.MakeHitSkipList(foo, out userHitCount);

        Assert.AreEqual(7, userHitCount);

        Assert.AreEqual(7, skipList.Count);
        Assert.AreEqual(1, skipList[0]);
        Assert.AreEqual(3, skipList[1]);
        Assert.AreEqual(5, skipList[2]);
        Assert.AreEqual(7, skipList[3]);
        Assert.AreEqual(9, skipList[4]);
        Assert.AreEqual(11, skipList[5]);
        Assert.AreEqual(13, skipList[6]);
    }
}

dtsearch - dtSearch で、フレーズごとに 1 つのヒットを強調表示するのではなく、フレーズごとに 1 つのヒットを強調表示するようにする

1 に答える 1

Related

Reference