c++ - c++11の正規表現はPythonよりも遅い

Question

こんにちは私は、正規表現を使用して分割文字列分割を行う次のコードがなぜあるのかを理解したいと思います

#include<regex>
#include<vector>
#include<string>

std::vector<std::string> split(const std::string &s){
    static const std::regex rsplit(" +");
    auto rit = std::sregex_token_iterator(s.begin(), s.end(), rsplit, -1);
    auto rend = std::sregex_token_iterator();
    auto res = std::vector<std::string>(rit, rend);
    return res;
}

int main(){
    for(auto i=0; i< 10000; ++i)
       split("a b c", " ");
    return 0;
}

次のPythonコードよりも遅い

import re
for i in range(10000):
    re.split(' +', 'a b c')

これが

> python test.py  0.05s user 0.01s system 94% cpu 0.070 total
./test  0.26s user 0.00s system 99% cpu 0.296 total

osxでclang++を使用しています。

-O3でコンパイルすると、0.09s user 0.00s system 99% cpu 0.109 total

score 112 · Accepted Answer

知らせ

この回答も参照してください: https://stackoverflow.com/a/21708215は、ここの下部にあるEDIT 2のベースでした。

より良いタイミング測定を得るために、ループを 1000000 に増やしました。

これは私のPythonのタイミングです:

real    0m2.038s
user    0m2.009s
sys     0m0.024s

これはあなたのコードに相当するものですが、少しきれいです:

#include <regex>
#include <vector>
#include <string>

std::vector<std::string> split(const std::string &s, const std::regex &r)
{
    return {
        std::sregex_token_iterator(s.begin(), s.end(), r, -1),
        std::sregex_token_iterator()
    };
}

int main()
{
    const std::regex r(" +");
    for(auto i=0; i < 1000000; ++i)
       split("a b c", r);
    return 0;
}

タイミング：

real    0m5.786s
user    0m5.779s
sys     0m0.005s

これは、ベクターオブジェクトと文字列オブジェクトの構築/割り当てを回避するための最適化です。

#include <regex>
#include <vector>
#include <string>

void split(const std::string &s, const std::regex &r, std::vector<std::string> &v)
{
    auto rit = std::sregex_token_iterator(s.begin(), s.end(), r, -1);
    auto rend = std::sregex_token_iterator();
    v.clear();
    while(rit != rend)
    {
        v.push_back(*rit);
        ++rit;
    }
}

int main()
{
    const std::regex r(" +");
    std::vector<std::string> v;
    for(auto i=0; i < 1000000; ++i)
       split("a b c", r, v);
    return 0;
}

タイミング：

real    0m3.034s
user    0m3.029s
sys     0m0.004s

これにより、ほぼ 100% のパフォーマンスが向上します。

ベクトルはループの前に作成され、最初の繰り返しでメモリを増やすことができます。その後、によるメモリの割り当て解除はありません。ベクトルはメモリを維持し、文字列をその場でclear()構築します。

別のパフォーマンスの向上は、構築/破棄をstd::string完全に回避することであり、したがって、そのオブジェクトの割り当て/割り当て解除を回避します。

これは、この方向の暫定的なものです。

#include <regex>
#include <vector>
#include <string>

void split(const char *s, const std::regex &r, std::vector<std::string> &v)
{
    auto rit = std::cregex_token_iterator(s, s + std::strlen(s), r, -1);
    auto rend = std::cregex_token_iterator();
    v.clear();
    while(rit != rend)
    {
        v.push_back(*rit);
        ++rit;
    }
}

タイミング：

real    0m2.509s
user    0m2.503s
sys     0m0.004s

最終的な改善は、各 char ポインターが元のc 文字列自体の内部の部分文字列を指すように、 std::vectorofを返すことです。問題は、それらのそれぞれが null で終了しないため、それを行うことができないことです (これについては、後のサンプルで C++1y の使用法を参照してください)。const char *s string_ref

この最後の改善は、次の方法でも実現できます。

#include <regex>
#include <vector>
#include <string>

void split(const std::string &s, const std::regex &r, std::vector<std::string> &v)
{
    auto rit = std::cregex_token_iterator(s.data(), s.data() + s.length(), r, -1);
    auto rend = std::cregex_token_iterator();
    v.clear();
    while(rit != rend)
    {
        v.push_back(*rit);
        ++rit;
    }
}

int main()
{
    const std::regex r(" +");
    std::vector<std::string> v;
    for(auto i=0; i < 1000000; ++i)
       split("a b c", r, v); // the constant string("a b c") should be optimized
                             // by the compiler. I got the same performance as
                             // if it was an object outside the loop
    return 0;
}

-O3 を指定して (trunk から) clang 3.3 でサンプルをビルドしました。他の正規表現ライブラリの方がパフォーマンスが向上する可能性がありますが、いずれにせよ、割り当て/割り当て解除はパフォーマンスヒットになることがよくあります。

Boost.Regex

これは、 C 文字列引数のサンプルのboost::regexタイミングです。

real    0m1.284s
user    0m1.278s
sys     0m0.005s

このサンプルの同じコードboost::regexとstd::regexインターフェイスは同じで、名前空間とインクルードを変更するだけで済みます。

時間が経つにつれて改善されることを願っています。C++ stdlib 正規表現の実装はまだ初期段階にあります。

編集

完成させるために、私はこれを試しました（上記の「究極の改善」の提案）が、同等のstd::vector<std::string> &vバージョンのパフォーマンスをまったく改善しませんでした：

#include <regex>
#include <vector>
#include <string>

template<typename Iterator> class intrusive_substring
{
private:
    Iterator begin_, end_;

public:
    intrusive_substring(Iterator begin, Iterator end) : begin_(begin), end_(end) {}

    Iterator begin() {return begin_;}
    Iterator end() {return end_;}
};

using intrusive_char_substring = intrusive_substring<const char *>;

void split(const std::string &s, const std::regex &r, std::vector<intrusive_char_substring> &v)
{
    auto rit = std::cregex_token_iterator(s.data(), s.data() + s.length(), r, -1);
    auto rend = std::cregex_token_iterator();
    v.clear(); // This can potentially be optimized away by the compiler because
               // the intrusive_char_substring destructor does nothing, so
               // resetting the internal size is the only thing to be done.
               // Formerly allocated memory is maintained.
    while(rit != rend)
    {
        v.emplace_back(rit->first, rit->second);
        ++rit;
    }
}

int main()
{
    const std::regex r(" +");
    std::vector<intrusive_char_substring> v;
    for(auto i=0; i < 1000000; ++i)
       split("a b c", r, v);

    return 0;
}

これは、array_ref および string_ref の提案と関係があります。これを使用したサンプルコードは次のとおりです。

#include <regex>
#include <vector>
#include <string>
#include <string_ref>

void split(const std::string &s, const std::regex &r, std::vector<std::string_ref> &v)
{
    auto rit = std::cregex_token_iterator(s.data(), s.data() + s.length(), r, -1);
    auto rend = std::cregex_token_iterator();
    v.clear();
    while(rit != rend)
    {
        v.emplace_back(rit->first, rit->length());
        ++rit;
    }
}

int main()
{
    const std::regex r(" +");
    std::vector<std::string_ref> v;
    for(auto i=0; i < 1000000; ++i)
       split("a b c", r, v);

    return 0;
}

また、 with vector returnの場合、コピーではstring_refなくvector を返す方が安くなります。stringsplit

編集2

この新しいソリューションは、リターンによって出力を取得できます。https://github.com/mclow/string_viewにあるMarshall Clow のstring_view(string_ref名前が変更された) libc++ 実装を使用しました。

#include <string>
#include <string_view>
#include <boost/regex.hpp>
#include <boost/range/iterator_range.hpp>
#include <boost/iterator/transform_iterator.hpp>

using namespace std;
using namespace std::experimental;
using namespace boost;

string_view stringfier(const cregex_token_iterator::value_type &match) {
    return {match.first, static_cast<size_t>(match.length())};
}

using string_view_iterator =
    transform_iterator<decltype(&stringfier), cregex_token_iterator>;

iterator_range<string_view_iterator> split(string_view s, const regex &r) {
    return {
        string_view_iterator(
            cregex_token_iterator(s.begin(), s.end(), r, -1),
            stringfier
        ),
        string_view_iterator()
    };
}

int main() {
    const regex r(" +");
    for (size_t i = 0; i < 1000000; ++i) {
        split("a b c", r);
    }
}

タイミング：

real    0m0.385s
user    0m0.385s
sys     0m0.000s

これが以前の結果と比較してどれほど速いかに注目してください。もちろん、ループ内で a を埋めるわけではありません(また、おそらく事前に何かを照合することもvectorありません) が、とにかく範囲を取得します。forvector

元の(またはnull で終了する string ) の createに及ぶため、これは非常に軽量になり、不要な文字列割り当てが生成されることはありませんiterator_range。string_viewstring

この実装を使用して比較するだけですsplitが、実際には次のように入力vectorできます。

int main() {
    const regex r(" +");
    vector<string_view> v;
    v.reserve(10);
    for (size_t i = 0; i < 1000000; ++i) {
        copy(split("a b c", r), back_inserter(v));
        v.clear();
    }
}

これはブーストレンジコピーアルゴリズムを使用して、各反復でベクトルを埋めます。タイミングは次のとおりです。

real    0m1.002s
user    0m0.997s
sys     0m0.004s

string_viewご覧のとおり、最適化された出力パラメーターバージョンと比較して大きな違いはありません。

このように機能する a の提案があるstd::splitことにも注意してください。

score 5 · Accepted Answer

最適化では、一般に、次の 2 つのことを回避する必要があります。

不要なもののために CPU サイクルを消費する
何かが起こるのをぼんやりと待っている (メモリの読み取り、ディスクの読み取り、ネットワークの読み取りなど)

すべてをメモリにキャッシュするよりも何かを計算する方が高速になる場合があるため、この2つは正反対になる可能性があります...したがって、バランスのゲームです。

コードを分析しましょう。

std::vector<std::string> split(const std::string &s){
    static const std::regex rsplit(" +"); // only computed once

    // search for first occurrence of rsplit
    auto rit = std::sregex_token_iterator(s.begin(), s.end(), rsplit, -1);

    auto rend = std::sregex_token_iterator();

    // simultaneously:
    // - parses "s" from the second to the past the last occurrence
    // - allocates one `std::string` for each match... at least! (there may be a copy)
    // - allocates space in the `std::vector`, possibly multiple times
    auto res = std::vector<std::string>(rit, rend);

    return res;
}

もっとうまくやれるか？メモリの割り当てと割り当て解除を維持する代わりに、既存のストレージを再利用できれば、大幅な改善が見られるはずです [1]。

// Overwrites 'result' with the matches, returns the number of matches
// (note: 'result' is never shrunk, but may be grown as necessary)
size_t split(std::string const& s, std::vector<std::string>& result){
    static const std::regex rsplit(" +"); // only computed once

    auto rit = std::cregex_token_iterator(s.begin(), s.end(), rsplit, -1);
    auto rend = std::cregex_token_iterator();

    size_t pos = 0;

    // As long as possible, reuse the existing strings (in place)
    for (size_t max = result.size();
         rit != rend && pos != max;
         ++rit, ++pos)
    {
        result[pos].assign(rit->first, rit->second);
    }

    // When more matches than existing strings, extend capacity
    for (; rit != rend; ++rit, ++pos) {
        result.emplace_back(rit->first, rit->second);
    }

    return pos;
} // split

サブマッチの数が反復間で一定である、実行するテストでは、このバージョンが打ち負かされる可能性はほとんどありません。最初の実行時にのみメモリを割り当て (rsplitとの両方result)、その後は既存のメモリを再利用し続けます。

[1]: 免責事項、私はこのコードが正しいことを証明しただけで、テストはしていません (Donald Knuth が言うように)。

score 3 · Accepted Answer

このバージョンはどうですか？正規表現ではありませんが、分割をかなり速く解決します...

#include <vector>
#include <string>
#include <algorithm>

size_t split2(const std::string& s, std::vector<std::string>& result)
{
    size_t count = 0;
    result.clear();
    std::string::const_iterator p1 = s.cbegin();
    std::string::const_iterator p2 = p1;
    bool run = true;
    do
    {
        p2 = std::find(p1, s.cend(), ' ');
        result.push_back(std::string(p1, p2));
        ++count;

        if (p2 != s.cend())
        {
            p1 = std::find_if(p2, s.cend(), [](char c) -> bool
            {
                return c != ' ';
            });
        }
        else run = false;
    } while (run);
    return count;
}

int main()
{
    std::vector<std::string> v;
    std::string s = "a b c";
    for (auto i = 0; i < 100000; ++i)
        split2(s, v); 
    return 0;
}

$ time splittest.exe

実 0m0.132s ユーザー 0m0.000s システム 0m0.109s

score 0 · Accepted Answer

C++11 の正規表現は、perl よりも、おそらく python よりもはるかに遅いと言えます。

パフォーマンスを適切に測定するには、自明ではない式を使用してテストを実行するか、正規表現以外のすべてを測定することをお勧めします。

たとえば、C++11 と perl を比較するには

C++11 テストコード

  #include <iostream>
  #include <string>
  #include <regex>
  #include <chrono>

  int main ()
  {
     int veces = 10000;
     int count = 0;
     std::regex expres ("([^-]*)-([^-]*)-(\\d\\d\\d:\\d\\d)---(.*)");

     std::string text ("some-random-text-453:10--- etc etc blah");
     std::smatch macha;

     auto start = std::chrono::system_clock::now();

     for (int ii = 0;  ii < veces; ii ++)
        count += std::regex_search (text, macha, expres);

     auto milli = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now() - start).count();

     std::cout << count << "/" << veces << " matches " << milli << " ms --> " << (milli / (float) veces) << " ms per regex_search" << std::endl;
     return 0;
  }

私のコンピューターでgcc 4.9.3でコンパイルすると、出力が得られます

 10000/10000 matches 1704 ms --> 0.1704 ms per regex_search

perl テストコード

  use Time::HiRes qw/ time sleep /;

  my $veces = 1000000;
  my $conta = 0;
  my $phrase = "some-random-text-453:10--- etc etc blah";

  my $start = time;

  for (my $ii = 0; $ii < $veces; $ii++)
  {   
     if ($phrase =~ m/([^-]*)-([^-]*)-(\d\d\d:\d\d)---(.*)/)
     {
        $conta = $conta + 1;
     }
  }
  my $elapsed = (time - $start) * 1000.0;
  print $conta . "/" . $veces . " matches " . $elapsed . " ms --> " . ($elapsed / $veces) . " ms per regex\n";

コンピューターで perl v5.8.8 を再び使用する

  1000000/1000000 matches 2929.55303192139 ms --> 0.00292955303192139 ms per regex

したがって、このテストでは、C++11 / perl の比率は

   0.1704 / 0.002929 = 58.17 times slower than perl

実際のシナリオでは、比率が約 100 倍から 200 倍遅くなります。したがって、たとえば、100 万行の大きなファイルを解析するには、perl では約 1 秒かかりますが、正規表現を使用する C++11 プログラムではさらに数分 (!) かかる場合があります。

c++ - c++11の正規表現はPythonよりも遅い

4 に答える 4

知らせ

Boost.Regex

編集

編集2

Related

Reference