java - 複数文字の区切り文字を持つStringTokenizerと同等

Question

文字列をトークンに分割しようとしています。

トークンの区切り文字は単一の文字ではなく、一部の区切り文字は他の区切り文字に含まれています（例、＆および&&）。また、区切り文字をトークンとして返す必要があります。
StringTokenizerは、複数文字の区切り文字を処理できません。String.splitで可能だと思いますが、私のニーズに合った魔法の正規表現を推測することはできません。

何か案が？

例：

Token delimiters: "&", "&&", "=", "=>", " "  
String to tokenize: a & b&&c=>d  
Expected result: an string array containing "a", " ", "&", " ", "b", "&&", "c", "=>", "d"

---編集---
皆さんの助けに感謝し、Dasblinkenlightは私に解決策を与えてくれます。これが私が彼の助けを借りて書いた「すぐに使える」コードです。

private static String[] wonderfulTokenizer(String string, String[] delimiters) {
  // First, create a regular expression that matches the union of the delimiters
  // Be aware that, in case of delimiters containing others (example && and &),
  // the longer may be before the shorter (&& should be before &) or the regexpr
  // parser will recognize && as two &.
  Arrays.sort(delimiters, new Comparator<String>() {
    @Override
    public int compare(String o1, String o2) {
      return -o1.compareTo(o2);
     }
  });
  // Build a string that will contain the regular expression
  StringBuilder regexpr = new StringBuilder();
  regexpr.append('(');
  for (String delim : delimiters) { // For each delimiter
    if (regexpr.length() != 1) regexpr.append('|'); // Add union separator if needed
    for (int i = 0; i < delim.length(); i++) {
      // Add an escape character if the character is a regexp reserved char
      regexpr.append('\\');
      regexpr.append(delim.charAt(i));
    }
  }
  regexpr.append(')'); // Close the union
  Pattern p = Pattern.compile(regexpr.toString());

  // Now, search for the tokens
  List<String> res = new ArrayList<String>();
  Matcher m = p.matcher(string);
  int pos = 0;
  while (m.find()) { // While there's a delimiter in the string
    if (pos != m.start()) {
      // If there's something between the current and the previous delimiter
      // Add it to the tokens list
      res.add(string.substring(pos, m.start()));
    }
    res.add(m.group()); // add the delimiter
    pos = m.end(); // Remember end of delimiter
  }
  if (pos != string.length()) {
    // If it remains some characters in the string after last delimiter
    // Add this to the token list
    res.add(string.substring(pos));
  }
  // Return the result
  return res.toArray(new String[res.size()]);
}

パターンを一度だけ作成することでトークン化する文字列が多い場合は、最適化できます。

score 4 · Accepted Answer

と単純なループを使用してPattern、探している結果を得ることができます。

List<String> res = new ArrayList<String>();
Pattern p = Pattern.compile("([&]{1,2}|=>?| +)");
String s = "s=a&=>b";
Matcher m = p.matcher(s);
int pos = 0;
while (m.find()) {
    if (pos != m.start()) {
        res.add(s.substring(pos, m.start()));
    }
    res.add(m.group());
    pos = m.end();
}
if (pos != s.length()) {
    res.add(s.substring(pos));
}
for (String t : res) {
    System.out.println("'"+t+"'");
}

これにより、以下の結果が生成されます。

's'
'='
'a'
'&'
'=>'
'b'

score 2 · Accepted Answer

区切り記号を削除したため、分割はそれを行いません。おそらく、自分で文字列をトークン化する (つまり、for ループ) か、 http: //www.antlr.org/ のようなフレームワークを使用する必要があります。

score 1 · Accepted Answer

これを試して：

String test = "a & b&&c=>d=A";
String regEx = "(&[&]?|=[>]?)";

String[] res = test.split(regEx);
for(String s : res){
    System.out.println("Token: "+s);
}

最後に「=A」を追加して、それも解析されることを示しました。

別の回答で述べたように、区切り文字を結果に保持するという非定型の動作が必要な場合は、おそらく自分でパーサーを作成する必要があります....しかし、その場合、「区切り文字」が何であるかを本当に考える必要がありますあなたのコード。

java - 複数文字の区切り文字を持つStringTokenizerと同等

3 に答える 3

Related

Reference