java - 文字列の単語数と文字数を取得する

Question

String を受け入れ、その中の単語 (1 つ以上の空白で区切られている) または文字 (非空白文字) の数を教えてくれる2 つの「モード」(および) で動作する超効率的なメソッドを作成しようとしています。WORDCHARACTER

public int getCount(String toExamine, boolean wordMode) {
    int count = 0;

    if(wordMode) {
        // Return the number of words.
    }
    else {
        // Return the number of characers.
    }

    return count;
}

を使用してモードバージョンを達成できることを知っています:WORDStringTokenizer

StringTokenizer tokenizer = new StringTokenizer(" ");

CHARACTERしかし、モードに何を使用するか（空白以外の文字の数）についてはまったくわかりません。次のような粗雑なものを使用できると確信しています：

for(int i = 0; i < toExamine.length; i++)
    if(Character.isSpace(toExamine.charAt(i)))
        count++;

しかし、それは一種の醜いものであり、これを行う最も効率的な方法ではない可能性があります (StringTokenizer作品についても同じです)。ここで正規表現を使用できますか、それとも他のJava文字列/文字の狂気を使用して、必要なものを非常に効率的な方法で取得できますか? 私はここで数千万の文字列に取り組んでいます。前もって感謝します。

score 0 · Accepted Answer

char配列に変換し、forループで反復します

int charCount =0;
for(int i=0; i<sentence.length(); i++) {
    if(!Character.isWhitespace(sentence.charAt(i))) {
        charCount++;
    }
}

他の方法では、すべての空白を置き換え、長さを以下のコードで数えます

int charcount = 0;
String newSentence =sentence.replaceAll("\\s+", "");
charcount = newSentence.length();

score 0 · Accepted Answer

これはforループよりも高速ではありませんが、正規表現を使用する必要がある場合は、次のように試すことができます。

int noSpaces=toExamine.split("\\s+").length-1;

pf文字の数は次のようになります。

int noChar=toExamine.length-noSpaces;

score 0 · Accepted Answer

以下のテストプログラムは、次の結果を生成します。プログラムはそのような結果を 5 セット出力しますが、ここでは 1 つだけ示します。の行//は私の注釈であり、プログラムの出力ではありません。

// スペースに対する非スペースの割合は約 0.857 です
// 生成される文字列全体の長さは 1 075 662 です
0.857 1075662
// Name_of_method (結果): 15_Runs_In_Microseconds | Average_In_Microseconds
countWords_1 (131489): 20465 20240 21045 20193 20000 19972 20551 39489 19859 19971 19889 19877 20049 19900 19949 | 21429
countWords_2 (131489): 255500 258723 254543 255956 253606 263549 254096 254402 254191 254296 253752 261501 260788 261574 254178 | 256710
countWords_3 (131489): 26225 25022 24830 24829 24545 24819 25459 24625 25628 24700 24936 24794 24794 24849 25026 | 25005
countWords_4 (131489): 24537 24169 25283 24862 23863 23902 24068 23906 51472 23731 23889 23844 23832 24275 23896 | 25968
countWords_5 (131489): 81087 112095 80008 81290 81472 80581 80717 80460 79870 80557 80694 80923 145686 80564 80849 | 87123
countWords_6 (131489): 114391 114146 111946 111873 112331 167207 134117 118217 112843 112804 113533 111834 112830 112392 118181 | 118576
countChars_1 (922546): 150507 109102 150453 111352 149753 108099 153842 109034 150817 117258 149219 108194 152839 110340 149524 | 132022
countChars_2 (922546): 28779 29473 52499 27182 26519 27743 26717 27161 26451 27060 26307 27309 26350 62824 33134 | 31700
countChars_3 (922546): 25408 25127 24980 24832 24624 24671 24848 24712 24634 24622 24607 24613 24661 24765 24883 | 24799
countChars_4 (922546): 81489 82246 80906 80718 80803 81147 81113 81798 81030 81024 108508 80768 80780 80671 80753 | 82916
countChars_5 (922546): 26086 25546 24846 43734 25016 25083 24894 25530 25031 25041 25114 24935 25358 24895 43498 | 27640
countChars_6 (922546): 102559 102257 101381 101589 103432 101739 102794 129472 101305 101834 103124 101486 101254 102874 101481 | 103905

countWords_2およびcountWords_6は、正規表現およびを使用したトリックを含むワンライナーメソッドでありreplaceAll、他のメソッドと比較して非常に低速です。countWords_5プリコンパイルされたを使用Patternしてマッチングを行い、を使用したワンライナーよりも高速ですが、replaceAll他のものと比較するとまだ遅いです。

countWords_3とcountWords_4は単純なループですが、若干の違いがあります。タイミングは決定的な違いを示していません。(タイミングが大きいか小さいかの一貫性を探します。タイミングの差は少なくとも約 5 ミリ秒である必要があります)。

countWords_1StringTokenizer Unicode 文字を含まないデフォルトの区切り文字を使用します。したがって、セマンティックが完全に異なるため、ここでは適切な比較ができません。

単語数 (空白以外の文字のシーケンスとして定義) をカウントする場合、単純なループは、私が考えることができる正規表現方法よりも高速です。

countChars_1とcountChars_6は、正規表現とreplaceAll. countChars_4繰り返しになりますが、コンパイル済みのを使用するよりも低速ですPattern。繰り返しになりますが、すべての正規表現ソリューションは単純なループよりも遅くなります。

countChars_2、countChars_3およびcountChars_5単純なループのいくつかのバリエーションです。私が観察した多くの実行countChars_3との違いはあまり一貫しておらず、したがって決定的なものではありません。ただし、通常は少し遅くなります。おそらく、関数によって返されたメモリに新しいメモリを割り当てる必要があるためです。countChars_5 countChars2char[]toCharArray

ここにある方法が最速であることを保証するものではありませんが、単純なループが正規表現ソリューションとどのように比較されるかについてのアイデアを示しています。

このテストプログラムを実行して、自分で判断できます。あなたが自由にできるように、私はテストを書いています：

生成されたテスト文字列の長さとスペース文字の出現頻度を変更します。

現在、テスト文字列の長さは 700 000 から 1 300 000 文字の間でランダムであり、非スペース文字とスペース文字の比率は 4:1 から 9:1 の間で変化します (一般的なテキストについて推測します)。を 0 に設定するFLUCTUATIONと、長さまたは比率が固定されます。エッジケースをテストする場合に非常に便利です。
テスト文字列の生成方法を置き換えます (ランダムに生成された文字列ではなく実際のデータ)。

現在、ASCII 文字のサブセットが使用されています。約 64 個の非スペース文字。スペース、改行、タブ、およびキャリッジリターンが空白文字として使用されます。Unicode の空白文字がありますが、現在のテストには含まれていません。
@Test注釈でマークされた、テストする新しいメソッドを追加します。

import java.util.regex.Pattern;
import java.util.regex.Matcher;

import java.util.Arrays;
import java.util.ArrayList;
import java.util.Random;
import java.util.StringTokenizer;

import java.lang.reflect.Method;

import java.lang.annotation.Retention;
import java.lang.annotation.RetentionPolicy;
import java.lang.annotation.Target;
import java.lang.annotation.ElementType;

class TestStringProcessing_15028652 {

  @Retention(RetentionPolicy.RUNTIME)
  @Target(ElementType.METHOD)
  private @interface Test {};

  // From 0.80 - 0.90 (4:1 to 9:1 non-space:space characters ratio)
  private static final double NON_SPACE_RATIO = 0.85;
  private static final double NON_SPACE_RATIO_FLUCTUATION = 0.05;

  // With the way the test is written, it is not going to work well with small input (1000 is NOT enough)
  // Currently set to 700 000 - 1 300 000 characters
  private static final int NUM_CHARS = 1000000;
  private static final int NUM_CHARS_FLUCTUATION = 300000;

  // Some whitespace characters
  private static final char WHITESPACES[] = {' ', '\t', '\r', '\n'};

  // Number of times to run all methods
  private static final int NUM_OUTER = 5;
  // Number of times to run each method
  private static final int NUM_REPEAT = 15;

  static {
    for (int i = 0; i < WHITESPACES.length; i++) {
      assert(Character.isWhitespace(WHITESPACES[i]));
    }
  }

  private static Random random = new Random();

  private static String generateInput() {

    double nonSpaceRatio = NON_SPACE_RATIO + random.nextDouble() * 2 * NON_SPACE_RATIO_FLUCTUATION - NON_SPACE_RATIO_FLUCTUATION;
    int numChars = NUM_CHARS + random.nextInt(2 * NUM_CHARS_FLUCTUATION) - NUM_CHARS_FLUCTUATION;

    System.out.printf("%.3f %d\n", nonSpaceRatio, numChars);

    StringBuffer output = new StringBuffer();

    for (int i = 0; i < numChars; i++) {
      if (random.nextDouble() < nonSpaceRatio) {
        output.append((char) (random.nextInt(64) + '0'));
      } else {
        output.append(WHITESPACES[random.nextInt(WHITESPACES.length)]);
      }
    }

    return output.toString();
  }

  private static ArrayList<Method> getTestMethods() {
    Class<?> klass = null;
    try {
      klass = Class.forName(Thread.currentThread().getStackTrace()[1].getClassName());
    } catch (Exception e) {
      e.printStackTrace();
      System.err.println("Something really bad happened. Bailling out...");
      System.exit(1);
    }
    Method[] methods = klass.getMethods();
    // System.out.println(klass);
    // System.out.println(Arrays.toString(methods));

    ArrayList<Method> testMethods = new ArrayList<Method>();

    for (Method method: methods) {
        if (method.isAnnotationPresent(Test.class)) {
          testMethods.add(method);
        }
    }

    return testMethods;
  }


  public static void runTestReflection() {
    ArrayList<Method> methods = getTestMethods();

    for (int t = 0; t < NUM_OUTER; t++) {
      String input = generateInput();

      for (Method method: methods) {

        try {
          System.out.print(method.getName() + " (" + method.invoke(null, input) + "): ");
        } catch (Exception e) {
          e.printStackTrace();
        }

        long sum = 0;
        for (int i = 0; i < NUM_REPEAT; i++) {
          long start, end;
          Object result;

          try {
            start = System.nanoTime();
            result = method.invoke(null, input);
            end = System.nanoTime();

            System.out.print((end - start) / 1000 + " ");
            sum += (end - start) / 1000;
          } catch (Exception e) {
            e.printStackTrace();
          }
        }

        System.out.println("| " + sum / NUM_REPEAT);
      }

      System.out.println();
    }
  }

  public static void main(String args[]) {
    runTestReflection();
  }

  @Test
  public static int countWords_1(String input) {
    // WARNING: This is NOT the same as isWhitespace, since isWhitespace
    // also consider Unicode characters.
    return new StringTokenizer(input).countTokens();
  }

  @Test
  public static int countWords_2(String input) {
    return input.replaceAll("\\S+", "$0 ").length() - input.length();
  }

  @Test
  public static int countWords_3(String input) {
    int count = 0;
    boolean in = false;

    for (int i = 0; i < input.length(); i++) {
      if (!Character.isWhitespace(input.charAt(i))) {
        if (!in) {
          in = true;
          count++;
        }
      } else {
        in = false;
      }
    }

    return count;
  }

  @Test
  public static int countWords_4(String input) {
    int count = 0;

    for (int i = 0; i < input.length(); i++) {
      if (!Character.isWhitespace(input.charAt(i))) {
        do {
          i++;
        } while (i < input.length() && !Character.isWhitespace(input.charAt(i)));
        count++;
      }
    }

    return count;
  }

  @Test
  public static int countWords_5(String input) {
    int count = 0;
    Matcher m = p.matcher(input);

    while (m.find()) {
      count++;
    }

    return count;
  }

  @Test
  public static int countWords_6(String input) {
    return input.replaceAll("\\s*+\\S++\\s*+", " ").length();
  }

  @Test
  public static int countChars_1(String input) {
    return input.replaceAll("\\s+", "").length();
  }

  @Test
  public static int countChars_2(String input) {
    int count = 0;
    for (char c: input.toCharArray()) {
      if (!Character.isWhitespace(c)) {
        count++;
      }
    }

    return count;
  }

  @Test
  public static int countChars_3(String input) {
    int count = 0;
    for (int i = 0; i < input.length(); i++) {
      if (!Character.isWhitespace(input.charAt(i))) {
        count++;
      }
    }

    return count;
  }

  private static Pattern p = Pattern.compile("\\S+");

  @Test
  public static int countChars_4(String input) {
    Matcher m = p.matcher(input);
    int count = 0;

    while (m.find()) {
      count += m.end() - m.start();
    }

    return count;
  }

  @Test
  public static int countChars_5(String input) {
    int count = input.length();
    for (int i = 0; i < input.length(); i++) {
      if (Character.isWhitespace(input.charAt(i))) {
        count--;
      }
    }

    return count;
  }

  @Test
  public static int countChars_6(String input) {
    return input.length() - input.replaceAll("\\S+", "").length();
  }
}

java - 文字列の単語数と文字数を取得する

3 に答える 3

Related

Reference