java - 大きなテキストの最長共通部分文字列

Question

私は、最長の共通部分文字列を見つけるためのコードを書くように求める学校の課題を持っています。私はそれを実行しましたが、それほど大きくないテキストでのみ機能し、Moby Dick と War And Peace の共通部分文字列を見つけるよう求められています。私が間違っていることを正しい方向に向けていただければ幸いです。コンパイラは、SuffixArray を作成するために MyString クラスの部分文字列メソッドを呼び出したときに、エラーが MyString クラスの部分文字列メソッドにあると不平を言っています。

package datastructuresone;

import java.io.File;
import java.io.FileNotFoundException;
import java.util.Arrays;
import java.util.Scanner;


 class SuffixArray
 {

private final MyString[] suffixes;
private final int N;

public SuffixArray(String s)
{
    N = s.length();
    MyString snew = new MyString(s);
    suffixes = new MyString[N];
    for (int i = 0; i < N; i++)
    {
        suffixes[i] = snew.substring(i);
    }
    Arrays.sort(suffixes);
}

public int length()
{
    return N;
}

public int index(int i)
{
    return N - suffixes[i].length();
}

public MyString select(int i)
{
    return suffixes[i];
}

// length of longest common prefix of s and t
private static int lcp(MyString s, MyString t)
{
    int N = Math.min(s.length(), t.length());
    for (int i = 0; i < N; i++)
    {
        if (s.charAt(i) != t.charAt(i))
        {
            return i;
        }
    }
    return N;
}

// longest common prefix of suffixes(i) and suffixes(i-1)
public int lcp(int i)
{
    return lcp(suffixes[i], suffixes[i - 1]);
}

// longest common prefix of suffixes(i) and suffixes(j)
public int lcp(int i, int j)
{
    return lcp(suffixes[i], suffixes[j]);

}
}

public class DataStructuresOne
{

public static void main(String[] args) throws FileNotFoundException
{
    Scanner in1 = new Scanner(new File("./build/classes/WarAndPeace.txt"));
    Scanner in2 = new Scanner(new File("./build/classes/MobyDick.txt"));

    StringBuilder sb = new StringBuilder();
    StringBuilder sb1 = new StringBuilder();

    while (in1.hasNextLine())
    {
        sb.append(in1.nextLine());

    }
    while (in2.hasNextLine())
    {
        sb1.append(in2.nextLine());
    }

    String text1 = sb.toString().replaceAll("\\s+", " ");
    String text2 = sb1.toString().replaceAll("\\s+", " ");

    int N1 = text1.length();
    int N2 = text2.length();

    SuffixArray sa = new SuffixArray(text1 + "#" + text2);
    int N = sa.length();

    String substring = "";
    for (int i = 1; i < N; i++)
    {

        // adjacent suffixes both from second text string
        if (sa.select(i).length() <= N2 && sa.select(i - 1).length() <= N2)
        {
            continue;
        }

        // adjacent suffixes both from first text string
        if (sa.select(i).length() > N2 + 1 && sa.select(i - 1).length() > N2 + 1)
        {
            continue;
        }

        // check if adjacent suffixes longer common substring
        int length = sa.lcp(i);
        if (length > substring.length())
        {
            substring = sa.select(i).toString().substring(0, length);
            System.out.println(substring + " ");
        }
    }

    System.out.println("The length of the substring " + substring.length() + "length on    first N " + N1 + " length of Second N " + N2
            + "The length of the array sa: " + N);
   System.out.println("'" + substring + "'");

   final class MyString implements Comparable<MyString>
{

public MyString(String str)
{
    offset = 0;
    len = str.length();
    arr = str.toCharArray();
}

public int length()
{
    return len;
}

public char charAt(int idx)
{
    return arr[ idx + offset];
}

public int compareTo(MyString other)
{
    int myEnd = offset + len;
    int yourEnd = other.offset + other.len;
    int i = offset, j = other.offset;

    for (; i < myEnd && j < yourEnd; i++, j++)
    {
        if (arr[ i] != arr[ j])
        {
            return arr[ i] - arr[ j];
        }
    }

    // reached end.  Who got there first?
    if (i == myEnd && j == yourEnd)
    {
        return 0;   // identical strings
    }
    if (i == myEnd)
    {
        return -1;
    } else
    {
        return +1;
    }
}


public MyString substring(int beginIndex, int endIndex)
{
    return new MyString(arr, beginIndex + offset, endIndex - beginIndex);
}

public MyString substring(int beginIndex)
{
    return substring(beginIndex, offset + len);
}

public boolean equals(Object other)
{
    return (other instanceof MyString) && compareTo((MyString) other) == 0;
}

public String toString()
{
    return new String(arr, offset, len);
}

private MyString(char[] a, int of, int ln)
{
    arr = a;
    offset = of;
    len = ln;
}
private char[] arr;
private int offset;
private int len;
}

score 3 · Accepted Answer

ここ：

for (int i = 0; i < N; i++)
{
    suffixes[i] = snew.substring(i);
}

長い文字列全体だけでなく、文字列全体 - 1 文字、文字列全体 - 2 文字などを保存しようとしています。これらはすべて別々に保存されます。

文字列が 10 文字のみの場合、10 個の異なる文字列に合計 55 文字相当の文字を格納することになります。

1000 文字で、合計 500500 文字を格納しています。

より一般的には、長さ * (長さ + 1)/2 文字を処理する必要があります。

余談ですが、『戦争と平和』の登場人物数はわかりませんが、ページ数は約 1250 で、典型的な単語数/ページの見積もりは 250 で、平均単語の長さは約 5 文字です。

(1250 * 250 * 5)*(1250 * 250 * 5 + 1)/2 = 1.2207039 * 10^12 文字。

メモリ内の char のサイズは 2 バイトなので、サイズは約 2.22 TB になります (小説のテキストだけの場合は 1.49 MB です)。

score 1 · Accepted Answer

コードの最初の数行で、両方のテキストの少なくとも 3 つのコピーを数えます。ここにいくつかのアイデアがあります

巨大な文字列になった後ではなく、各行を読み取るときにスペースを変換します。行頭と行末のスペースも忘れずに。
String の代わりに StringBuilder をベースとして使用して MyString クラスを構築します。可能であれば、ネイティブメソッドを使用して StringBuilder 内のすべての検索を行います。
必要以上に文字列を抽出しないでください。

-Xmx Java ランタイムオプションを調べて、ヒープ領域をデフォルトより大きく設定してください。覚えてないのでググってください。-Xmx=1024M には最後に M が必要であることに注意してください。(ファイルサイズを見て、2 冊の本のサイズを確認してください。)

score 0 · Accepted Answer

MyString を作成するときarr = str.toCharArray();は、文字列の文字データの新しいコピーを作成するを呼び出します。しかし、Java では、文字列は不変です。では、データのコピーではなく、文字列への参照を格納してみませんか?

一度にすべての接尾辞を作成しますが、一度に参照できるのは 1 つ (つまり 2 つ) だけです。現在関心のあるサフィックスのみを参照するようにソリューションを再コーディングし、それらが必要な場合にのみそれらを構築する (その後それらへの参照を失う) 場合、それらは Java によってガベージコレクションされる可能性があります。これにより、メモリが不足する可能性が低くなります。2 つの文字列を格納するメモリオーバーヘッドと、数十万の文字列を格納するメモリオーバーヘッドを比較してください :)

score 0 · Accepted Answer

このプログラムは Scala で書きました。多分あなたはそれをJavaに翻訳することができます。

class MyString private (private val string: String, startIndex: Int, endIndex: Int) extends Comparable[MyString] {
  def this(string: String) = this(string, 0, string.length)
  def length() = endIndex-startIndex
  def charAt(i: Int) = {
    if(i >= length) throw new IndexOutOfBoundsException
    string.charAt(startIndex + i)
  }
  def substring(start: Int, end: Int): MyString = {
    if(start < 0 || end > length || end < start) throw new IndexOutOfBoundsException
    new MyString(string, startIndex + start, startIndex + end)
  }
  def substring(start: Int): MyString = substring(start, length)
  def longestCommonSubstring(other: MyString): MyString = {
    var index = 0
    val len = math.min(length, other.length)
    while(index < len && charAt(index) == other.charAt(index)) index += 1
    substring(0, index)
  }
  def compareTo(other: MyString): Int = {
    val len = math.min(length, other.length)
    for(i <- 0 until len) {
      if(charAt(i) > other.charAt(i)) return 1
      if(charAt(i) < other.charAt(i)) return -1
    }
    length-other.length
  }
  def >(other: MyString) = compareTo(other) > 0
  def <(other: MyString) = compareTo(other) < 0
  override def equals(other: Any) = other.isInstanceOf[MyString] && compareTo(other.asInstanceOf[MyString]) == 0
  override def toString() = "\"" + string.substring(startIndex, endIndex) + "\""
}

def readFile(name: String) = new MyString(io.Source.fromFile(name).getLines.mkString(" ").replaceAll("\\s+", " "))
def makeList(str: MyString) = (0 until str.length).map(i => str.substring(i)).toIndexedSeq

val string1 = readFile("WarAndPeace.txt")
val string2 = readFile("MobyDick.txt")

val (list1, list2) = (makeList(string1).sorted, makeList(string2).sorted)

var longestMatch = new MyString("")
var (index1, index2) = (0,0)
while(index1 < list1.size && index2 < list2.size) {
  val lcs = list1(index1).longestCommonSubstring(list2(index2)) 
  if(lcs.length > longestMatch.length) longestMatch = lcs
  if(list1(index1) < list2(index2)) index1 += 1
  else index2 += 1
}

println(longestMatch)

java - 大きなテキストの最長共通部分文字列

4 に答える 4

Related

Reference