c# - C＃の異なる形式の文に一致する正規表現

Question

ファイルのフォーマット

POS ID         PosScore NegScore    SynsetTerms                          Gloss
a   00001740    0.125   0           able#1"                              able to swim"; "she was able to program her computer";
a   00002098    0       0.75        unable#1                            "unable to get to town without a car"; 
a   00002312    0       0           dorsal#2 abaxial#1                  "the abaxial surface of a leaf is the underside or side facing away from the stem"
a   00002843    0       0           basiscopic#1                         facing or on the side toward the base
a   00002956    0       0.23        abducting#1 abducent#1               especially of muscles; drawing away from the midline of the body or from an adjacent part
a   00003131    0       0           adductive#1 adducting#1 adducent#1   especially of muscles;

このファイルで、(ID、PosScore、NegScore、および SynsetTerms)フィールドを抽出します。( ID,PosScore,NegScore)フィールドデータの抽出は簡単で、これらのフィールドのデータには次のコードを使用します。

Regex expression = new Regex(@"(\t(\d+)|(\w+)\t)");

var results = expression.Matches(input);
foreach (Match match in results)
{

    Console.WriteLine(match);
}
Console.ReadLine();

正しい結果が得られますが、ファイル化されたSynsetTermsは問題を引き起こします。行によっては 2 つ以上の単語が含まれているため、単語を整理して PosScore と NegScore を取得する方法です。

たとえば、5 行目には 2 つの単語がabducting#1ありますabducent#1が、どちらも同じスコアです。

Word とそのスコアを取得する行の正規表現は次のようになります。

  Word                PosScore          NegScore 
  abducting#1         0                 0.23
  abducent#1          0                 0.23

score 5 · Accepted Answer

非正規表現の文字列分割バージョンの方が簡単な場合があります。

var data =
   lines.Split(new[] {Environment.NewLine}, StringSplitOptions.RemoveEmptyEntries)
        .Skip(1)
        .Select(line => line.Split('\t'))
        .SelectMany(parts => parts[4].Split().Select(word => new
            {
                ID = parts[1],
                Word = word,
                PosScore = decimal.Parse(parts[2]),
                NegScore = decimal.Parse(parts[3])
            }));

score 1 · Accepted Answer

この正規表現を使用できます

^(?<pos>\w+)\s+(?<id>\d+)\s+(?<pscore>\d+(?:\.\d+)?)\s+(?<nscore>\d+(?:\.\d+)?)\s+(?<terms>(?:.*?#[^\s]*)+)\s+(?<gloss>.*)$

このようなリストを作成できます

var lst=Regex.Matches(input,regex)
             .Cast<Match>()
             .Select(x=>
             new 
             {
                 pos=x.Groups["pos"].Value,
                 terms=Regex.Split(x.Groups["terms"].Value,@"\s+"),
                 gloss=x.Groups["gloss"].Value
             }
        );

そして今、あなたはそれを繰り返すことができます

foreach(var temp in lst)
{
    temp.pos;
    //you can now iterate over terms
    foreach(var t in temp.terms)
    {
    }
}

c# - C＃の異なる形式の文に一致する正規表現

2 に答える 2

Related

Reference