java - Java 文字列検索中の突然のスローダウンと java.lang.OutOfMemoryError

Question

私は、ほとんどが機能する RNA 配列のパターン発見のためのプログラムを書いています。シーケンス内の「パターン」を見つけるために、いくつかの可能なパターンを生成し、それらのすべてのシーケンスの入力ファイルをスキャンしています (アルゴリズムには他にもありますが、これは壊れているビットです)。生成される可能性のあるパターンは、ユーザーが指定した長さです。

これは、最大 8 文字までのすべてのシーケンス長でうまく機能します。次に 9 で、プログラムが長時間実行され、java.lang.OutOfMemoryError が発生します。いくつかのデバッグの後、弱点はパターン生成方法であることがわかりました。

/* Get elementary pattern (ep) substrings, to later combine into full patterns */
public static void init_ep_subs(int length) {

ep_subs = new ArrayList<Substring>(); // clear static ep_subs data field

/* ep subs are of the form C1...C2...C3 where C1, C2, C3 are characters in the
   alphabet and the whole length of the string is equal to the input parameter
   'length'. The number of dots varies for different lengths.
The middle character C2 can occur instead of any dot, or not at all.*/

for (int i = 1; i < length-1; i++) { // for each potential position of C2

    // for each alphabet character to be C1
    for (int first = 0; first < alphabet.length; first++) { 

    // for each alphabet character to be C3
    for (int last = 0; last < alphabet.length; last++) {

        // make blank pattern, i.e. no C2
        Substring s_blank = new Substring(-1, alphabet[first],
                          '0', alphabet[last]);

        // get its frequency in the input string
        s_blank.occurrences = search_sequences(s_blank.toString());

        // if blank ep is found frequently enough in the input string, store it
        if (s_blank.frequency()>=nP) ep_subs.add(s_blank);

        // when C2 is present, for each character it could be
        for (int mid = 0; mid < alphabet.length; mid++) {

        // make pattern C1,C2,C3
        Substring s = new Substring(i, alphabet[first],
                        alphabet[mid],
                        alphabet[last]);

        // search input string for pattern s
        s.occurrences = search_sequences(s.toString());

        // if s is frequent enough, store it
        if (s.frequency()>=nP) ep_subs.add(s);
        }
    }
    }
}
}

何が起こるか: search_sequences への呼び出しの時間を測定すると、呼び出しはそれぞれ約 40 ～ 100 ミリ秒で開始され、最初のパターンでそのように実行されます。その後、数百回のパターン ('C.....GC' 前後) の後、これらの呼び出しに突然約 10 倍の 1000 ～ 2000 ミリ秒の時間がかかり始めます。その後、時間は着実に増加し、約 12000ms ('C......TA') で次のエラーが発生します。

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOfRange(Arrays.java:3209)
    at java.lang.String.<init>(String.java:215)
    at java.nio.HeapCharBuffer.toString(HeapCharBuffer.java:542)
    at java.nio.CharBuffer.toString(CharBuffer.java:1157)
    at java.util.regex.Matcher.toMatchResult(Matcher.java:232)
    at java.util.Scanner.match(Scanner.java:1270)
    at java.util.Scanner.hasNextLine(Scanner.java:1478)
    at PatternFinder4.search_sequences(PatternFinder4.java:217)
    at PatternFinder4.init_ep_subs(PatternFinder4.java:256)
    at PatternFinder4.main(PatternFinder4.java:62)

これは search_sequences メソッドです:

/* Searches the input string 'sequences' for occurrences of the parameter string 'sub' */
public static ArrayList<int[]> search_sequences(String sub) {

/* arraylist returned holding int arrays with coordinates of the places where 'sub'
 was found, i.e. {l,i} l = lines number, i = index within line */
ArrayList<int[]> occurrences = new ArrayList<int[]>();
s = new Scanner(sequences);
int line_index = 0;

String line = "";
while (s.hasNextLine()) {
    line = s.nextLine();
    pattern = Pattern.compile(sub);
    matcher = pattern.matcher(line);
    pattern = null; // all the =nulls were intended to help memory management, had no effect

    int index = 0;

    // for each occurrence of 'sub' in the line being scanned
    while (matcher.find(index)) {
    int start = matcher.start(); // get the index of the next occurrence
    int[] occurrence = {line_index, start}; // make up the coordinate array
    occurrences.add(occurrence); // store that occurrence
    index = start+1; // start looking from after the last occurence found
    }
    matcher=null;
    line=null;
    line_index++;

}
s=null;

return occurrences;
}

速度の異なる 2 台のコンピューターでこのプログラムを試してみましたが、実際の時間は search_sequence の完了時間はより高速なコンピューターでは小さくなりますが、相対的な時間は同じです。ほぼ同じ反復回数で、search_sequence の完了に 10 倍の時間がかかり始めます。

BufferedReader などのさまざまな入力ストリームのメモリ効率と速度についてグーグルで調べてみましたが、一般的なコンセンサスは、それらがすべて Scanner とほぼ同等であるということです。このバグが何であるか、または自分でそれを理解する方法について何かアドバイスはありますか?

誰かがコードをもっと見たい場合は、尋ねてください。

編集：

1 - 入力ファイル 'sequences' は、約 200 文字のさまざまな長さの 1000 のタンパク質配列 (それぞれが 1 行にある) です。また、このプログラムは/最大 9 の長さのパターンまで/動作する必要があることにも言及する必要があります。

2 - 上記のコードで使用されている Substring クラスのメソッドは次のとおりです。

static class Substring {
int residue; // position of the middle character C2
char front, mid, end; // alphabet characters for C1, C2 and C3
ArrayList<int[]> occurrences; // list of positions the substring occurs in 'sequences'
String string; // string representation of the substring

public Substring(int inresidue, char infront, char inmid, char inend) {
    occurrences = new ArrayList<int[]>();
    residue = inresidue;
    front = infront;
    mid = inmid;
    end = inend;
    setString(); // makes the string representation using characters and their positions
}

/* gets the frequency of the substring given the places it occurs in 'sequences'. 
   This only counts the substring /once per line ist occurs in/. */
public int frequency() {
    return PatternFinder.frequency(occurrences);
}

public String toString() {
    return string;
}

/* makes the string representation using the substring's characters and their positions */
private void setString() {
    if (residue>-1) {
    String left_mid = "";
    for (int j = 0; j < residue-1; j++) left_mid += ".";
    String right_mid = "";
    for (int j = residue+1; j < length-1; j++) right_mid += ".";
    string = front + left_mid + mid + right_mid + end;
    } else {
    String mid = "";
    for (int i = 0; i < length-2; i++) mid += ".";
    string = front + mid + end;
    }
}
 }

... および PatternFinder.frequency メソッド (Substring.frequency() で呼び出される) :

public static int frequency(ArrayList<int[]> occurrences) {
    HashSet<String> lines_present = new HashSet<String>();
    for (int[] occurrence : occurrences) {
        lines_present.add(new String(occurrence[0]+""));
    }
    return lines_present.size();
    }

score 0 · Accepted Answer

明らかなメモリリークを見つけることはできませんが、プログラムには多くの非効率性があります。以下にいくつかの推奨事項を示します。

コードを適切にインデントします。あなたにとっても他の人にとっても、それを読むのがずっと簡単になります。現在の形式では、非常に読みにくいです。
メンバー変数を参照している場合は、その前にを付けますthis.。そうしないと、コードスニペットの読者は何を参照しているのかわからなくなります。
どうしても必要な場合を除き、静的メンバーとメソッドは避けてください。それらを参照するときClassname.membernameは、同じ理由でフォームを使用してください。
のコードはfrequency()just とどう違うのreturn occurrences.size()ですか?
ではsearch_sequences()、正規表現文字列subは定数です。一度だけコンパイルする必要がありますが、すべての行で再コンパイルしています。
入力文字列 ( sequences) を 1 行に分割し、配列またはに格納しますArrayList。内部で再分割しないでくださいsearch_sequences()。分割されたコレクションを渡します。

修正すべき点はおそらく他にもありますが、これが飛び出すリストです。

これらすべてを修正しても問題が解決しない場合は、プロファイラーを使用して何が起こっているのかを調べる必要があります。

score 0 · Accepted Answer

アルファベットとは？どのような正規表現を与えていますか? 保存しているオカレンスの数を確認しましたか? 指数関数的な数の検索を行っているため、出現を単に保存するだけでメモリ不足になる可能性があります。

あなたのアルゴリズムには、指数関数的なリソース使用量が隠されているようです。何をしようとしているのかを再考する必要があります。

また、ローカル変数を null に設定しても役に立ちません。JVM は既にデータフローと活性分析を行っているからです。

編集:これは、短い正規表現でも実行に指数関数的な時間がかかることを説明するページです。

java - Java 文字列検索中の突然のスローダウンと java.lang.OutOfMemoryError

2 に答える 2

Related

Reference