c++ - いくつかの入力ファイルに基づいて「ランダムな」テキスト生成で次の文字を予測する

Question

マルコフモデルに基づいてランダムなテキストを生成するプログラムを書いています。単語の間に多くのスペースがあるファイルで、最初のシードがスペースであるように見えるという問題が発生しています。問題は、次のすべての文字もスペースとして表示されるため、nextChosenCharは常にスペースであるため、生成されるランダムなテキストは単なる空白のドキュメントであるということです。

誰かがこの問題の解決策を提案できますか？

以下のコードの後半に見られるように、私は解決策を考え出そうとしましたが、役に立ちませんでした。

char ChooseNextChar(string seed, int order, string fileName){
    Map<string, Vector<char> > nextCharMap;
    ifstream inputStream;
    inputStream.open(fileName.c_str());
    int offset = 0;
    Vector<char> charsFollingSeedVector;
    inputStream.clear();
    char* buffer = new char [order + 1];
    char charFollowingSeed;
    static int consecutiveSpaces = 0;
    while (!inputStream.eof()) {    
        inputStream.seekg(offset);
        inputStream.read(buffer, order + 1);
        string key(buffer, order);
        if (equalsIgnoreCase(key, seed)) {
            //only insert key if not present otherwise overwriting old info 
            if (!nextCharMap.containsKey(seed)) {
                nextCharMap.put(seed, charsFollingSeedVector);
            }
            //read the char directly following seed
            charFollowingSeed = buffer[order];
            nextCharMap[seed].push_back(charFollowingSeed);
        }
        offset++;
    }
    //case where no chars following seed
    if (nextCharMap[seed].isEmpty()) {
        return EOF;
    }
    //determine which is the most frequent following char
    char nextChosenChar = MostFequentCharInVector(seed, nextCharMap);

    //TRYING TO FIX PROBLEM OF ONLY OUTPUTTING SPACES**********
     if (nextChosenChar == ' ') {
        consecutiveSpaces++;
        if (consecutiveSpaces >= 1) {
            nextChosenChar = nextCharMap[seed].get(randomInteger(0, nextCharMap[seed].size()-1));
            consecutiveSpaces = 0;
        }
    }
    return nextChosenChar;
}

score 2 · Accepted Answer

文字ベースのモデルが本当に必要な場合は、出力として非常に自然に見えるテキストを取得することはできませんが、それは間違いなく可能であり、そのモデルは基本的に空白文字のシーケンスも処理できます。それらをテキストの自然な部分と見なす場合は、入力からそれらを削除する必要はありません。

重要なのは、マルコフモデルが、特定の段階で最も確率が高い1つの文字の予測に常にフォールバックするとは限らないということです。代わりに、可能な文字の確率分布全体を調べて、ランダムに1つを選択する必要があります。

ここで、ランダムとは、プログラマーによって事前に決定されていない文字を選択することを意味します。それでも、ランダム分布は一様分布ではありません。つまり、すべての文字が同じように発生するわけではありません。考えられるさまざまなキャラクターの相対的な確率を考慮に入れる必要があります。これを行う1つの方法は、文字の累積確率分布を生成することです。たとえば、確率が

p('a') == 0.2
p('b') == 0.4
p('c') == 0.4

私たちはそれらを次のように表します

p('a') == 0.2
p('b') == p('a') + 0.4 == 0.6
p('c') == p('a') + p('b') == 1.0

次に、乱数を生成するために、最初に0から1の間に均一に分布した乱数Nを生成し、次に累積確率がN以上である最初の文字を選択します。

以下のサンプルコードでこれを実装しました。このtrain()手順では、トレーニング入力のすべての文字について、次の文字の累積確率分布が生成されます。'predict（）'プロシージャは、これを適用してランダムなテキストを生成します。

完全な実装の場合、これにはまだ欠けています。

最初の文字の確率分布の表現。'main（）'関数でわかるように、私の出力は常に't'で始まります。
出力文字列の長さ、または最後の文字の表現。'main（）'は、常に長さ100の文字列を生成するだけです。

コードは、Linux上のGCC 4.7.0（C ++ 11オプション）でテストされました。以下の出力例。

#include <iostream>
#include <string>
#include <vector>
#include <utility>
#include <map>
#include <numeric>
#include <algorithm>
#include <random>

template <typename Char>
class Markov
{
public:
  /* Data type used to count the frequencies (integer!) of
     characters. */
  typedef std::map<Char,unsigned>            CharDistributionMap;

  /* Data type used to represent a cumulative probability (float!)
     distribution. */
  typedef std::vector<std::pair<Char,float>> CharDistribution;

  /* Data type used to represent the Markov model. Each character is
     mapped to a probality distribution of the characters that follow
     it. */
  typedef std::map<Char,CharDistribution>    MarkovModel;


  /* The model. */
  MarkovModel  _model;

  /* Training procedure. */
  template <typename Iterator>
  void train(Iterator from, Iterator to)
  {
    _model = {};
    if (from == to)
      return;

    std::map<Char,CharDistributionMap> proto_model {};

    /* Count frequencies. */
    Char current = *from;
    while (true) {
      ++from;
      if (from == to)
        break;
      Char next = *from;
      proto_model[current][next] += 1;
      current = next;
    }

    /* Transform into probability distribution. */
    for (const auto &entry : proto_model) {
      const Char current              = entry.first;
      const CharDistributionMap &freq = entry.second;

      /* Calculate total frequency of current character. */
      unsigned total =
         std::accumulate(std::begin(freq),std::end(freq),0,
           [](unsigned res,const std::pair<Char,unsigned> &p){
                   return res += p.second;
               });

      /* Determine the probability distribution of characters that
         follow the current character. This is calculated as a cumulative
         probability. */
      CharDistribution dist {};
      float probability { 0.0 };
      std::for_each(std::begin(freq),std::end(freq),
             [total,&probability,&dist](const std::pair<Char,unsigned> &p){
                   // using '+=' to get cumulative probability:
                   probability += static_cast<float>(p.second) / total; 
                   dist.push_back(std::make_pair(p.first,probability));
             });

      /* Add probability distribution for current character to the model. */
      _model[current] = dist;
    }
  }


  /* Predict the next character, assuming that training has been
     performed. */
  template <typename RandomNumberGenerator>
  Char predict(RandomNumberGenerator &gen, const Char current)
  {
    static std::uniform_real_distribution<float> generator_dist { 0, 1 };

    /* Assume that the current character is known to the model. Otherwise,
       an std::out_of_range exception will be thrown. */
    const CharDistribution &dist { _model.at(current) };

    /* Generate random number between 0 and 1. */
    float random { generator_dist(gen) };

    /* Identify the character that has the greatest cumulative probabilty
       smaller than the random number generated. */
    auto res =
         std::lower_bound(std::begin(dist),std::end(dist),
                          std::make_pair(Char(),random),
             [](const std::pair<Char,float> &p1, const std::pair<Char,float> &p2) {
                    return (p1.second < p2.second);
             });
    if (res == std::end(dist))
      throw "Empty probability distribution. This should not happen.";
    return res->first;
  }

};

int main()
{
  /* Initialize random-number generator. */
  std::random_device rd;
  std::mt19937 gen(rd());


  std::string input { "this   is    some   input text   with   many spaces." };

  if (input.empty())
    return 1;

  /* We append the first character to the end, to ensure that even the
     last character of the text gets a non-empty probability
     distribution. A more proper way of dealing with character that
     have empty distributions would be _smoothing_. */
  input += input[0];

  Markov<char> markov {};
  markov.train(std::begin(input),std::end(input));

  /* We set the initial character. In a real stochastic model, there
     would have to be a separate probality distribution for initial
     character and we would choose the initial character randomly,
     too. */
  char current_char { 't' };

  for (unsigned i = 0 ; i < 100 ; ++i) {
    std::cout << current_char;
    current_char = markov.predict(gen,current_char);
  }
  std::cout << current_char << std::endl;
}

このプログラムによって生成された出力例：

t  mext s.t th   winy  iny  somaces      sputhis inpacexthispace te  iny            me   mext mexthis

tes    is  manputhis.th is  wis.th with it    is  is.t  s   t   winy    it mext    is        ispany

this  maces      somany  t    s        it this  winy sputhisomacext manput    somanputes  macexte iso

t   wispanpaces maces  tesomacexte s  s  mes.th     isput t wit   t   somanputes   s  withit  sput ma

ご覧のとおり、スペース文字の分布は、当然のことながら、入力テキストに見られる分布に従います。

score 0 · Accepted Answer

1つの解決策は、ファイルから文字を1つずつストリーミングして、読み取りループが次のようになるようにすることです。

char buffer[order];
inputStream.get(buffer,order);

char next_char;
while ( inputStream.get(next_char) )
{
   string key(buffer, order);
   if (equalsIgnoreCase(key, seed)) {
   // only insert key if not present otherwise overwriting old info 
   if (!nextCharMap.containsKey(seed)) {
      nextCharMap[seed] = Vector(charFollowingSeed);
   }
   else
   {
     nextCharMap[seed].push_back(charFollowingSeed);
   }
   // Update the buffer.
   for(unsigned int i=1; i<order; ++i) buffer[i-1]=buffer[i];
   buffer[order-1]=next_char;
}

次に、次のように余分なスペースを破棄できます。

....
while ( inputStream.get(next_char) )
{
   //Remove multiple spaces from input.
   if( next_char==' ' and buffer[order-1]==' ') continue

   string key(buffer, order);
   ....

c++ - いくつかの入力ファイルに基づいて「ランダムな」テキスト生成で次の文字を予測する

2 に答える 2

Related

Reference