c++ - 有限状態マシンの実行に時間がかかるのはなぜですか？

Question

次の形式の関数呼び出しを抽出することになっているステートマシンで作業しています

/* I am a comment */
//I am a comment
pref("this.is.a.string.which\"can have QUOTES\"", 123456);

抽出されたデータはpref("this.is.a.string.which\"can have QUOTES\"", 123456); ファイルからのものです。現在、41kbのファイルを処理するには、このプロセスに1分半近くかかります。この有限状態マシンについて、私がここで真剣に誤解していることはありますか？

#include <boost/algorithm/string.hpp>
std::vector<std::string> Foo()
{
    std::string fileData;
    //Fill filedata with the contents of a file
    std::vector<std::string> results;
    std::string::iterator begin = fileData.begin();
    std::string::iterator end = fileData.end();
    std::string::iterator stateZeroFoundLocation = fileData.begin();
    std::size_t state = 0;
    for(; begin < end; begin++)
    {
        switch (state)
        {
        case 0:
            if (boost::starts_with(boost::make_iterator_range(begin, end), "pref(")) {
                stateZeroFoundLocation = begin;
                begin += 4;
                state = 2;
            } else if (*begin == '/')
                state = 1;
            break;
        case 1:
            state = 0;
            switch (*begin)
            {
            case '*':
                begin = boost::find_first(boost::make_iterator_range(begin, end), "*/").end();
                break;
            case '/':
                begin = std::find(begin, end, L'\n');
            }
            break;
        case 2:
            if (*begin == '"')
                state = 3;
            break;
        case 3:
            switch(*begin)
            {
            case '\\':
                state = 4;
                break;
            case '"':
                state = 5;
            }
            break;
        case 4:
            state = 3;
            break;
        case 5:
            if (*begin == ',')
                state = 6;
            break;
        case 6:
            if (*begin != ' ')
                state = 7;
            break;
        case 7:
            switch(*begin)
            {
            case '"':
                state = 8;
                break;
            default:
                state = 10;
                break;
            }
            break;
        case 8:
            switch(*begin)
            {
            case '\\':
                state = 9;
                break;
            case '"':
                state = 10;
            }
            break;
        case 9:
            state = 8;
            break;
        case 10:
            if (*begin == ')')
                state = 11;
            break;
        case 11:
            if (*begin == ';')
                state = 12;
            break;
        case 12:
            state = 0;
            results.push_back(std::string(stateZeroFoundLocation, begin));
        };
    }
    return results;
}

ビリー3

編集：まあ、これは私が今まで見た中で最も奇妙なものの1つです。プロジェクトを再構築したところ、かなり正常に実行されています。奇数。

score 3 · Accepted Answer

41 kbのファイルのほとんどがコメントまたは設定でない限り、ほとんどの時間は状態0で費やされます。また、状態0の各文字について、少なくとも2回の関数呼び出しを行います。

if (boost::starts_with(boost::make_iterator_range(begin, end), "pref(")) {

現在の文字が「p」であるかどうかを事前にテストすることで、これを高速化できます。

if (*begin == 'p' && boost::starts_with(boost::make_iterator_range(begin, end), "pref(")) {

文字が「p」でない場合は、関数呼び出しを行う必要はありません。特に、おそらく時間が費やされている場所であるイテレータを作成しないでください。

score 1 · Accepted Answer

これが問題の一部であるかどうかはわかりませんが、ケース0でタイプミスがあり、「perf」のつづりが「pref」と間違っています。

score 1 · Accepted Answer

見ただけではわかりにくいですが、検索アルゴリズムがやっていると思います。なぜFSM内を検索しているのですか？定義上、一度に1文字ずつ与えることになっています。さらに状態を追加します。また、結果をベクトルではなくリストにしてみてください。多くのコピーが進行中

vector<string>

しかし、ほとんどの場合：プロファイルを作成してください。

score 0 · Accepted Answer

解析を行っている場合は、パーサーライブラリを使用しないでください。

私は通常、Boost.Spirit.Qiを念頭に置いています。

文法を EBNF のような表現で表現すると、保守が確実に容易になります。
これはヘッダーのみのライブラリであるため、バイナリ全体を混在させても問題はありません。

最小限のアプローチは評価できますが、残念ながら、有限状態マシンを独自にコーディングするというあなたの考えは賢明ではありません。これはおもちゃの例ではうまく機能しますが、要件が増えると巨大なswitchものになり、何が起こっているのかを理解するのがますます複雑になります。

そして、進化しないことを知っているなんて言わないでください: 私は神託を信じていません ;)

score 0 · Accepted Answer

有限状態マシンは実行可能なソリューションですが、テキスト処理の場合は、高度に最適化された有限状態マシンジェネレーターを使用するのが最善です。この場合、正規表現です。これは Perl の正規表現です。

# first clean the comments
$source =~ s|//.*$||;      # replace "// till end of line" with nothing
$source =~ s|/\*.*?\*/||s; # replace "/* any text until */" with nothing
                           # depending on your data, you may need a few other
                           # rules here to avoid blanking data, you could replace
                           # the comments with a unique identifier, and then
                           # expand any identifiers that the regex below returns

# then find your data
while ($source =~ /perf\(\s*"(.+?)",\s*(\d+)\s*\);/g) { 
   # matches your function signature and moves along source
   # do something with the captured groups, in this case $1 and $2
}

ほとんどの正規表現ライブラリは Perl と互換性があるため、構文を翻訳するのは難しくありません。検索がより複雑になる場合は、パーサーが適しています。

c++ - 有限状態マシンの実行に時間がかかるのはなぜですか？

5 に答える 5

Related

Reference