matlab - アルファベット順に並べられたリスト内の文字列マッチング (MATLAB の方法)

Question

アルファベット順に並べられた文字列の大きなセル配列 (〜 495,000) があり、多くの重複があります (アルファベット順であるため、互いに隣り合っています)。

与えられたルックアップ文字列について、渡したものと一致するリスト内のすべての文字列を見つける必要があります。

私はこれstrcmp(lookUpString,list)を行うために使用してきましたが、これは非常に遅いです.リスト内の各値を調べて比較していると思います.アルファベット順にソートされていることを認識していないからです.

while ループを記述してリストを反復処理し、必要な文字列のブロックが見つかるまで各文字列を比較するstrcmp(そして停止する) ことができますが、これを行う "matlab" 方法があるかどうか疑問に思っていました (つまり、論理を実行します)。ソートされた配列での比較演算)。

ご協力いただきありがとうございます！

score 4 · Accepted Answer

更新：以前の「方法3」に満足できなかったので、パフォーマンスを向上させるために少し再調整しました。今では、ナイーブのほぼ10倍の速度で実行されますstrcmp。

strcmp私のマシンで勝ちます（Linux Mint 12では2011b）。特に、よりもはるかにうまく機能しますismember。ただし、手動で事前に並べ替えを行うと、速度が少し向上します。次の速度テストを検討してください。

NumIter = 100;
N = 495000;
K = N / 20;
List = cell(N, 1);
for i = 1:20
    List(i*K - K + 1:i*K) = cellstr(char(i+96));
end

StrToFind = cell(NumIter, 1);
for j = 1:NumIter
    StrToFind{j} = char(round(rand * 20) + 96);
end

%# METHOD 1 (ismember)
tic
for j = 1:NumIter
    Index1 = ismember(List, StrToFind{j});
    Soln1 = List(Index1);
end
toc

%#METHOD 2 (strcmp)
tic
for j = 1:NumIter
    Index2 = strcmp(StrToFind{j}, List);
    Soln2 = List(Index2);
end
toc

%#METHOD 3 (strmp WITH MANUAL PRE-SORTING)    
tic
for j = 1:NumIter
    CurStrToFind = StrToFind{j};
    K = 100;
    I1 = zeros(K, 2); I1(1, :) = ones(1, 2);
    I2 = zeros(K, 2); I2(end, 1) = 1; I2(end, 2) = N;
    KDiv = floor(N/K);
    for k = 2:K-1
        CurSearchNum = k * KDiv;
        CurListItem = List{CurSearchNum};
        if CurListItem < CurStrToFind; I1(k, 1) = 1; end;
        if CurListItem > CurStrToFind; I2(k, 1) = 1; end;
        I1(k, 2) = CurSearchNum; I2(k, 2) = CurSearchNum;
    end
    a = find(I1(:, 1), 1, 'last');
    b = find(I2(:, 1), 1, 'first');
    ShortList = List(I1(a, 2):I2(b, 2));
    Index3 = strcmp(CurStrToFind, ShortList);
    Soln3 = ShortList(Index3);
end
toc

出力は次のとおりです。

Elapsed time is 6.411537 seconds.
Elapsed time is 1.396239 seconds.
Elapsed time is 0.150143 seconds.

score 1 · Accepted Answer

1

ismemberはあなたの友達です。線形検索の代わりに、二分検索を行います。

于 2012-10-03T00:32:03.780 に答える

score 0 · Accepted Answer

二分探索を試してみてください。

ほぼ 13(!) 倍高速です。

Elapsed time is 7.828260 seconds.    % ismember
Elapsed time is 0.775260 seconds.    % strcmp
Elapsed time is 0.113533 seconds.    % strmp WITH MANUAL PRE-SORTING
Elapsed time is 0.008243 seconds.    % binsearch

これは私が使用しているビン検索コードです:

function ind = binSearch(key, cellstr)
    % BINSEARCH  that find index i such that cellstr(i)<= key <= cellstr(i+1)
    %
    % * Synopsis: ind = binSearch(key, cellstr)
    % * Input   : key = what to search for
    %           : cellstr = sorted cell-array of string (others might work, check strlexcmp())
    % * Output  : ind = index in x cellstr such that cellstr(i)<= key <= cellstr(i+1)
    % * Depends : strlexcmp() from Peter John Acklam’s string-utilities, 
    %             at: http://home.online.no/~pjacklam/matlab/software/util/strutil/
    %
    % Transcoded from a Java version at: http://googleresearch.blogspot.it/2006/06/extra-extra-read-all-about-it-nearly.html
    % ankostis, Aug 2013

    low = 1;
    high = numel(cellstr);

    while (low <= high)
        ind = fix((low + high) / 2);
        val = cellstr{ind};

        d = strlexcmp(val, key);
        if (d < 0)
            low = ind + 1;
        elseif (d > 0)
            high = ind - 1;
        else
            return;     %% Key found.
        end
    end
    ind = -(low);       %% Not found!
end

strlexcmp()Peter John Acklam の文字列ユーティリティ ( http://home.online.no/~pjacklam/matlab/software/util/strutil/ ) から入手できます。

最後に、これは私が使用したテストスクリプトです。

%#METHOD 4 (WITH BIN-SEARCH)    
tic
for j = 1:NumIter
    Index1 = binsearch(StrToFind{j}, List);
    Soln4 = List(Index1);
end

長い文字列では結果が異なる場合があることに注意してください。

matlab - アルファベット順に並べられたリスト内の文字列マッチング (MATLAB の方法)

3 に答える 3

Related

Reference