parsing - D でファイルを解析する

Question

私は D が初めてで、次の形式の生物学的ファイルを解析したいと考えています。

>name1
acgcgcagagatatagctagatcg
aagctctgctcgcgct
>name2
acgggggcttgctagctcgatagatcga
agctctctttctccttcttcttctagagaga
>name2
gag ggagag

「ヘッダー」name1、name2、name3 を対応する「シーケンス」データ、..acgcg... などでキャプチャできるようにします。

今、私はこれを持っていますが、行ごとに繰り返すだけです。

import std.stdio;
import std.stream;
import std.regex;


int main(string[] args){
  auto filename = args[1];
  auto entry_name = regex(r"^>(.*)"); //captures header only
  auto fasta_regex = regex(r"(\>.+\n)([^\>]+\n)"); //captures header and correponding sequence

  try {
    Stream file = new BufferedFile(filename);
    foreach(ulong n, char[] line; file) {
      auto name_capture = match(line,entry_name);
      writeln(name_capture.captures[1]);
    }

    file.close();
  }
  catch (FileException xy){
    writefln("Error reading the file: ");
  }

  catch (Exception xx){
    writefln("Exception occured: " ~ xx.toString());
  }
  return 0;
}

各項目がファイル内のエントリに対応する連想配列を作成できるように、ヘッダーとシーケンスデータを抽出する良い方法を知りたいです。

[name1:acgcgcagagatatagctagatcgaagctctgctcgcgct,name2:acgggggcttgctagctcgatagatcgaagctctctttctccttcttcttctagagaga,.....]

score 8 · Accepted Answer

ヘッダーは独自の行にありますよね？それをチェックして、アペンダーを使用して値を割り当ててみませんか

auto current = std.array.appender!(char[]);
string name;
foreach(ulong n, char[] line; file) {
      auto entry = match(line,entry_name);
      if(entry){//we are in a header line

          if(name){//write what was caught 
              map[name]=current.data.dup;//dup because .current.data is reused
          }
          name = entry.hit.idup;
          current.clear();
      }else{
          current.put(line);
      }
}
map[name]=current.data.dup;//remember last capture

map は、値を保存する場所です (astring[string]が行います)。

score 4 · Accepted Answer

これが正規表現のない私の解決策です（このような単純な入力には正規表現が必要だとは思いません）：

import std.stdio;
import std.stream;

int main(string[] args) {
  int ret = 0;
  string fileName = args[1];
  string header;
  char[] sequence;
  string[string] content;
  try {  
    auto file = new BufferedFile(fileName);
    foreach(ulong lineNumber, char[] line; file) {
      if (line[0] == '>') {       
        if (header.length > 0) {
          content[header] = sequence.idup;
          sequence.length = 0;
        } // if
        // we have a new header, and new sequence will start after it
        header = line[1..$].idup;
        content[header] = "";
      } else {
          sequence ~= line;
      } // else
    } // foreach
    content[header] = sequence.idup;
    file.close();
  }
  catch (OpenException oe){
    writefln("Error opening file: " ~ oe.toString());
  }
  catch (Exception e){
    writefln("Exception: " ~ e.toString());
  }
  writeln(content);
  return ret;
} // main() function

/+ -------------------------- BEGIN OUTPUT ------------------------------- +
["name3":"gag ggagag", "name1":"acgcgcagagatatagctagatcgaagctctgctcgcgct", "name2":"acgggggcttgctagctcgatagatcgaagctctctttctccttcttcttctagagaga"]
 + -------------------------- END OUTPUT --------------------------------- +/

parsing - D でファイルを解析する

2 に答える 2

Related

Reference