java - 2つの配列で同一のバイト[]オブジェクトを同時に見つける方法は?

Question

ハッシュに衝突攻撃を実装しようとしています (コース「暗号化」にアクセスしています)。したがって、ハッシュ (= byte-sequences byte[]) の 2 つの配列があり、両方の配列に存在するハッシュを見つけたいと考えています。いくつかの調査と多くの検討の後、シングルコアマシンでの最善の解決策はHashSet(最初の配列のすべての要素を追加しcontains、2 番目の配列の要素が既に存在するかどうかを確認する) ことであると確信しています。

ただし、8 コアと 12 GB RAM を搭載したマシンにアクセスできるため、同時実行ソリューションを実装したいと考えています。私が考えることができる最善の解決策は、Collections.newSetFromMap(new ConcurrentHashMap<A,B>()). このデータ構造を使用すると、最初の配列のすべての要素を並列に追加でき、すべての要素が追加された後contains、同一のハッシュを同時にチェックできます。

だから私の質問は、この正確な問題のために設計されたアルゴリズムを知っていますか? そうでない場合、そのような ConcurrentHashSet を問題や効果的なランタイムの複雑さに関して使用した経験はありますか? または、私を助けることができる別の事前構築されたデータ構造をお勧めできますか?

PS: 誰かが詳細に興味を持っている場合: Skandiumを使用してプログラムを並列化する予定です。

score 5 · Accepted Answer

のどの形式を使用しても、完全に時間の無駄になると思いますHashMap。さまざまなデータのマルチバイトハッシュを計算していると思いますが、これらは既にhashes であり、これ以上ハッシュを実行する必要はありません。

あなたはそれを述べていませんが、あなたのハッシュはbyteシーケンスであると推測しています。明らかに、これらを保管するには、トライまたはドーグのいずれかが理想的です。

したがって、 a を実装し、trie/dawgそれを使用してすべてのハッシュを最初の配列に格納することをお勧めします。次に、すべての計算能力を並行して使用して、 this の 2 番目の配列の各要素を検索できますtrie。ロックは必要ありません。

追加した

Dawgこれが私が一緒にノックした簡単な実装です。うまくいくようです。

public class Dawg {
  // All my children.
  Dawg[] children = new Dawg[256];
  // Am I a leaf.
  boolean isLeaf = false;

  // Add a new word.
  public void add ( byte[] word ) {
    // Finds its location, growing as necessary.
    Dawg loc = find ( word, 0, true );
    loc.isLeaf = true;
  }

  // String form.
  public void add ( String word ) {
    add(word.getBytes());
  }

  // Returns true if word is in the dawg.
  public boolean contains ( byte [] word ) {
    // Finds its location, no growing allowed.
    Dawg d = find ( word, 0, false );
    return d != null && d.isLeaf; 
  }

  // String form.
  public boolean contains ( String word ) {
    return contains(word.getBytes());
  }

  // Find the Dawg - growing the tree as necessary if requested.
  private Dawg find ( byte [] word, int i, boolean grow ) {
    Dawg child = children[word[i]];
    if ( child == null ) {
      // Not present!
      if ( grow ) {
        // Grow the tree.
        child = new Dawg();
        children[word[i]] = child;
      }
    }
    // Found it?
    if ( child != null ) {
      // More to find?
      if ( i < word.length - 1 ) {
        child = child.find(word, i+1, grow);
      }
    }
    return child;
  }

  public static void main ( String[] args ) {
    Dawg d = new Dawg();
    d.add("H");
    d.add("Hello");
    d.add("World");
    d.add("Hell");
    System.out.println("Hello is "+(d.contains("Hello")?"in":"out"));
    System.out.println("World is "+(d.contains("World")?"in":"out"));
    System.out.println("Hell is "+(d.contains("Hell")?"in":"out"));
    System.out.println("Hal is "+(d.contains("Hal")?"in":"out"));
    System.out.println("Hel is "+(d.contains("Hel")?"in":"out"));
    System.out.println("H is "+(d.contains("H")?"in":"out"));
  }
}

追加した

これは、同時ロックフリーバージョンの良いスタートになる可能性があります。これらのことはテストが難しいことで有名なので、これが機能することを保証することはできませんが、私の考えでは、確実に機能するはずです.

import java.util.concurrent.atomic.AtomicReferenceArray;


public class LFDawg {
  // All my children.
  AtomicReferenceArray<LFDawg> children = new AtomicReferenceArray<LFDawg> ( 256 );
  // Am I a leaf.
  boolean isLeaf = false;

  // Add a new word.
  public void add ( byte[] word ) {
    // Finds its location, growing as necessary.
    LFDawg loc = find( word, 0, true );
    loc.isLeaf = true;
  }

  // String form.
  public void add ( String word ) {
    add( word.getBytes() );
  }

  // Returns true if word is in the dawg.
  public boolean contains ( byte[] word ) {
    // Finds its location, no growing allowed.
    LFDawg d = find( word, 0, false );
    return d != null && d.isLeaf;
  }

  // String form.
  public boolean contains ( String word ) {
    return contains( word.getBytes() );
  }

  // Find the Dawg - growing the tree as necessary if requested.
  private LFDawg find ( byte[] word, int i, boolean grow ) {
    LFDawg child = children.get( word[i] );
    if ( child == null ) {
      // Not present!
      if ( grow ) {
        // Grow the tree.
        child = new LFDawg();
        if ( !children.compareAndSet( word[i], null, child ) ) {
          // Someone else got there before me. Get the one they set.
          child = children.get( word[i] );
        }
      }
    }
    // Found it?
    if ( child != null ) {
      // More to find?
      if ( i < word.length - 1 ) {
        child = child.find( word, i + 1, grow );
      }
    }
    return child;
  }

  public static void main ( String[] args ) {
    LFDawg d = new LFDawg();
    d.add( "H" );
    d.add( "Hello" );
    d.add( "World" );
    d.add( "Hell" );
    System.out.println( "Hello is " + ( d.contains( "Hello" ) ? "in" : "out" ) );
    System.out.println( "World is " + ( d.contains( "World" ) ? "in" : "out" ) );
    System.out.println( "Hell is " + ( d.contains( "Hell" ) ? "in" : "out" ) );
    System.out.println( "Hal is " + ( d.contains( "Hal" ) ? "in" : "out" ) );
    System.out.println( "Hel is " + ( d.contains( "Hel" ) ? "in" : "out" ) );
    System.out.println( "H is " + ( d.contains( "H" ) ? "in" : "out" ) );
  }
}

score 0 · Accepted Answer

より単純なアプローチは、最初の配列を N 個の等しい (またはほぼ等しい) 部分に分割することです (8 コアの場合、n=8 が妥当と思われます)。次に、2番目の配列のハッシュがN個の小さなサブファースト配列に存在するかどうかを調べることにより、「通常の」方法でプログラムを解決します。これは並行して行うことができます。

そうは言っても、try/dawgs について聞いたことがなく、主な議論が魅力的で有益であることがわかりました。（私は主に言葉ではなく数字を扱います）

これは、元のファイルを実際に分割して並列処理できるように、byte[] ハッシュが有限で短い長さであることを前提としています。そうですか？

編集が追加されました

このアイデアの例については、 Wen-Mei W. Hwu が編集したGPU Graphics Gemsの第 11 章、Ligowski、Rudnicki、Liu、および Schmidt による記事を参照してください。彼らは、巨大な単一データベースを多くの小さな部分に分割することで、大規模なタンパク質配列データベース検索を並列化し、各サブピースに対して通常のアルゴリズムを実行します。私はこの引用が好きです. 「説明されているアルゴリズムは恥ずかしいほど並列です」. 彼らの場合、彼らは CUDA を使用し、多くのメモリ最適化を行う必要がありましたが、原則はマルチコアマシンにも当てはまるはずです。

半疑似コードが続きます。受信する byte[] ハッシュに Lists を使用します。問題ないことを願っています

独自の1芯方式

originalProcess(List<byte[]> list1, List<byte[]> list2) {
   HashSet<byte[]> bigHugeHashOfList1 = new HashSet<byte[]>();
   bigHugeHashOfList1.addAll(list1);
   for (byte[] hash : list2)
      if (bigHugeHashOfList1.contains(hash)
         // do something
}

新しい方法。まったく同じプロセス方法を使用します（後で）。ここには DAWGS や TRIES はありません...

preprocess(List<byte[]> list1, List<byte[]> list2) {
   List<byte[]>[] splitLists = new ArrayList<byte[]>[8];
   for (int i=0; i<8; i++)
      splitLists[i] = new ArrayList<byte[]>();
   for (byte[] hash : list1) {
      int idx = hash[0]&7; // I'm taking the 3 low order bits, YMMV
      splitLists[idx].add(hash);
      // a minor speedup would be to create the HashSet here instead of in originalProcess()
   }

   // now, using your favorite parallel/concurrency technique,
   // do the equivalent of
   for (int i=0; i<8; i++)
      originalProcess(splitLists[i], list2);
}

java - 2つの配列で同一のバイト[]オブジェクトを同時に見つける方法は?

2 に答える 2

Related

Reference