java - メモリ効率の高い複数値マップ

Question

こんにちは、次の問題があります。文字列と対応する整数値のリストをにMultiValueMap<String, Integer> 格納しています。約 13 000 000 百万の文字列を格納しています。1 つの文字列には最大 500 以上の値を含めることができます。単一の値ごとに、マップ上でランダムにアクセスできます。したがって、最悪のケースは 13 000 000* 500 プットコールです。これで、マップの速度は向上しましたが、メモリオーバーヘッドがかなり高くなります。AMultiValueMap<String, Integer>は a 以外の何物でもありませんHashMap/TreeMap<String, <ArrayList<Integer>>。HashMap と TreeMap の両方に、かなりのメモリオーバーヘッドがあります。完了したらマップを変更するつもりはありませんが、プログラムでのランダムアクセスのために、マップを高速かつできるだけ小さくする必要があります。(私はそれをディスクに保存し、起動時にロードしています。シリアル化されたマップファイルは約 600 MB を占めますが、メモリでは約 3 GB ですか?)

最もメモリ効率の良い方法は、文字列をソートされた文字列配列に格納し、値に対応する 2 次元の int 配列を持つことです。したがって、アクセスは文字列配列のバイナリ検索であり、対応する値を取得します。

現在、そこに到達する方法は 3 つあります。

作成フェーズでは、並べ替えられた MultivalueMap (TreeMap) を使用してすべてを格納します。すべての値の取得が完了したらmap.keyset().toArray(new String[0]);、[2 次元の int 配列を作成し、multivaluemap からすべての値を取得する] を呼び出して文字列配列を取得します。長所: 実装が簡単で、作成中も高速です。短所: マップから配列へのコピー中に、さらに多くのメモリを消費します。
最初から配列またはおそらく ArrayList を使用し、そこにすべてを格納します Pro: 最小のメモリオーバーヘッド。短所：新しいキーを追加するたびに配列をソート/コピーする必要があるため、これは非常に遅くなります。また、対応するint配列を同じ順序に保つために、独自の（おそらくさらに遅い）ソートを実装する必要があります。弦。実装が難しい
配列と MultivalueMap をバッファーとして使用します。プログラムが作成フェーズの 10% または 20% を終了したら、配列に値を追加してそれらを整理し、新しいマップを開始します。長所: おそらくまだ十分に高速で、メモリ効率も十分です。短所：実装が難しい。

これらの解決策のどれも、私にとって本当に正しいとは思えません。この問題に対する他の解決策、おそらくメモリ効率の良い (MultiValue)Map 実装を知っていますか?

データベースを使用できることはわかっているので、わざわざ回答として投稿しないでください。データベースを使用せずにこれを行う方法を知りたいです。

score 5 · Accepted Answer

Guava の Multimap に切り替えた場合 (アプリケーションでそれが可能かどうかはわかりません)、Trove を使用して取得できる可能性があります。

ListMultimap<String, Integer> multimap = Multimaps.newListMultimap(
  new HashMap<String, Collection<Integer>>(),
  new Supplier<List<Integer>>() {
    public List<Integer> get() {
      return new TIntListDecorator();
    }
  });

これにより、 aListMultimapを使用して配列に基づく値HashMapにマップされるa が作成されます。これはメモリ効率が高いはずですが、ボクシングのためにわずかな速度のペナルティが発生します。についても同様のことができるかもしれませんが、それがどのライブラリからのものかはわかりません。Listint[]MultiValueMap

score 2 · Accepted Answer

キー文字列、特に一般的なルートにパターンがある場合、aaTrieは大幅に少ないデータを保存する効果的な方法である可能性があります。

動作するTrieMapのコードは次のとおりです。

注意： sEntrySetを反復処理するために使用することに関する通常のアドバイスは、sには適用されMapませんTrie。それらは非常に非効率的であるTrieため、可能な限り要求することは避けてください。

/**
 * Implementation of a Trie structure.
 * 
 * A Trie is a compact form of tree that takes advantage of common prefixes
 * to the keys. 
 * 
 * A normal HashSet will take the key and compute a hash from it, this hash will
 * be used to locate the value through various methods but usually some kind
 * of bucket system is used. The memory footprint resulting becomes something
 * like O(n).
 * 
 * A Trie structure essentuially combines all common prefixes into a single key.
 * For example, holding the strings A, AB, ABC and ABCD will only take enough 
 * space to record the presence of ABCD. The presence of the others will be 
 * recorded as flags within the record of ABCD structure at zero cost.
 * 
 * This structure is useful for holding similar strings such as product IDs or
 * credit card numbers.
 *  
 */
public class TrieMap<V> extends AbstractMap<String, V> implements Map<String, V> {


  /**
   * Map each character to a sub-trie.
   * 
   * Could replace this with a 256 entry array of Tries but this will handle
   * multibyte character sets and I can discard empty maps. 
   * 
   * Maintained at null until needed (for better memory footprint).
   * 
   */
  protected Map<Character, TrieMap<V>> children = null;

  /**
   * Here we store the map contents.
   */
  protected V leaf = null;

  /**
   * Set the leaf value to a new setting and return the old one.
   * 
   * @param newValue
   * @return old value of leaf.
   */
  protected V setLeaf(V newValue) {
    V old = leaf;
    leaf = newValue;
    return old;
  }

  /**
   * I've always wanted to name a method something like this.
   */
  protected void makeChildren () {
    if ( children == null ) {
      // Use a TreeMap to ensure sorted iteration.
      children = new TreeMap<Character, TrieMap<V>>();
    }
  }

  /**
   * Finds the TrieMap that "should" contain the key.
   * 
   * @param key 
   * 
   * The key to find.
   * 
   * @param grow 
   * 
   * Set to true to grow the Trie to fit the key.
   * 
   * @return 
   * 
   * The sub Trie that "should" contain the key or null if key was not found and
   * grow was false.
   */
  protected TrieMap<V> find(String key, boolean grow) {
    if (key.length() == 0) {
      // Found it!
      return this;
    } else {
      // Not at end of string.
      if (grow) {
        // Grow the tree.
        makeChildren();
      }
      if (children != null) {
        // Ask the kids.
        char ch = key.charAt(0);
        TrieMap<V> child = children.get(ch);
        if (child == null && grow) {
          // Make the child.
          child = new TrieMap<V>();
          // Store the child.
          children.put(ch, child);
        }
        if (child != null) {
          // Find it in the child.
          return child.find(tail(key), grow);
        }
      }
    }
    return null;

  }

  /**
   * Remove the head (first character) from the string.
   * 
   * @param s
   * 
   * The string.
   * 
   * @return 
   * 
   * The same string without the first (head) character.
   * 
   */
  // Suppress warnings over taking a subsequence
  private String tail(String s) {
    return s.substring(1, s.length());
  }

  /**
   * 
   * Add a new value to the map.
   * 
   * Time footprint = O(s.length).
   * 
   * @param s
   * 
   * The key defining the place to add.
   * 
   * @param value
   * 
   * The value to add there.
   * 
   * @return
   * 
   * The value that was there, or null if it wasn't.
   * 
   */
  @Override
  public V put(String key, V value) {
    V old = null;

    // If empty string.
    if (key.length() == 0) {
      old = setLeaf(value);
    } else {
      // Find it.
      old = find(key, true).put("", value);
    }

    return old;
  }

  /**
   * Gets the value at the specified key position.
   * 
   * @param o
   * 
   * The key to the location.
   * 
   * @return 
   * 
   * The value at that location, or null if there is no value at that location.
   */
  @Override
  public V get(Object o) {
    V got = null;
    if (o != null) {
      String key = (String) o;
      TrieMap<V> it = find(key, false);
      if (it != null) {
        got = it.leaf;
      }
    } else {
      throw new NullPointerException("Nulls not allowed.");
    }
    return got;
  }

  /**
   * Remove the value at the specified location.
   * 
   * @param o
   * 
   * The key to the location.
   * 
   * @return 
   * 
   * The value that was removed, or null if there was no value at that location.
   */
  @Override
  public V remove(Object o) {
    V old = null;
    if (o != null) {
      String key = (String) o;
      if (key.length() == 0) {
        // Its me!
        old = leaf;
        leaf = null;
      } else {
        TrieMap<V> it = find(key, false);
        if (it != null) {
          old = it.remove("");
        }
      }
    } else {
      throw new NullPointerException("Nulls not allowed.");
    }
    return old;
  }

  /**
   * Count the number of values in the structure.
   * 
   * @return 
   * 
   * The number of values in the structure.
   */
  @Override
  public int size() {
    // If I am a leaf then size increases by 1.
    int size = leaf != null ? 1 : 0;
    if (children != null) {
      // Add sizes of all my children.
      for (Character c : children.keySet()) {
        size += children.get(c).size();
      }
    }
    return size;
  }

  /**
   * Is the tree empty?
   * 
   * @return 
   * 
   * true if the tree is empty.
   * false if there is still at least one value in the tree.
   */
  @Override
  public boolean isEmpty() {
    // I am empty if I am not a leaf and I have no children 
    // (slightly quicker than the AbstaractCollection implementation).
    return leaf == null && (children == null || children.isEmpty());
  }

  /**
   * Returns all keys as a Set.
   * 
   * @return 
   * 
   * A HashSet of all keys.
   * 
   * Note: Although it returns Set<S> it is actually a Set<String> that has been
   * home-grown because the original keys are not stored in the structure 
   * anywhere.
   */
  @Override
  public Set<String> keySet() {
    // Roll them a temporary list and give them a Set from it.
    return new HashSet<String>(keyList());
  }

  /**
   * List all my keys.
   * 
   * @return 
   * 
   * An ArrayList of all keys in the tree.
   * 
   * Note: Although it returns List<S> it is actually a List<String> that has been
   * home-grown because the original keys are not stored in the structure 
   * anywhere.
   * 
   */
  protected List<String> keyList() {
    List<String> contents = new ArrayList<String>();

    if (leaf != null) {
      // If I am a leaf, a null string is in the set.
      contents.add((String) "");
    }

    // Add all sub-tries.
    if (children != null) {
      for (Character c : children.keySet()) {
        TrieMap<V> child = children.get(c);
        List<String> childContents = child.keyList();
        for (String subString : childContents) {
          // All possible substrings can be prepended with this character.
          contents.add((String) (c + subString.toString()));
        }
      }
    }

    return contents;
  }

  /**
   * Does the map contain the specified key.
   * 
   * @param key
   * 
   * The key to look for.
   * 
   * @return 
   * 
   * true if the key is in the Map.
   * false if not.
   */
  public boolean containsKey(String key) {
    TrieMap<V> it = find(key, false);
    if (it != null) {
      return it.leaf != null;
    }
    return false;
  }

  /**
   * Represent me as a list.
   * 
   * @return 
   * 
   * A String representation of the tree.
   */
  @Override
  public String toString() {
    List<String> list = keyList();
    //Collections.sort((List<String>)list);
    StringBuilder sb = new StringBuilder();
    Separator comma = new Separator(",");
    sb.append("{");
    for (String s : list) {
      sb.append(comma.sep()).append(s).append("=").append(get(s));
    }
    sb.append("}");
    return sb.toString();
  }

  /**
   * Clear down completely.
   */
  @Override
  public void clear() {
    children = null;
    leaf = null;
  }

  /**
   * Return a list of key/value pairs.
   * 
   * @return 
   * 
   * The entry set.
   */
  public Set<Map.Entry<String, V>> entrySet() {
    Set<Map.Entry<String, V>> entries = new HashSet<Map.Entry<String, V>>();
    List<String> keys = keyList();
    for (String key : keys) {
      entries.add(new Entry<String,V>(key, get(key)));
    }
    return entries;
  }

  /**
   * An entry.
   * 
   * @param <S>
   * 
   * The type of the key.
   * 
   * @param <V> 
   * 
   * The type of the value.
   */
  private static class Entry<S, V> implements Map.Entry<S, V> {

    protected S key;
    protected V value;

    public Entry(S key, V value) {
      this.key = key;
      this.value = value;
    }

    public S getKey() {
      return key;
    }

    public V getValue() {
      return value;
    }

    public V setValue(V newValue) {
      V oldValue = value;
      value = newValue;
      return oldValue;
    }

    @Override
    public boolean equals(Object o) {
      if (!(o instanceof TrieMap.Entry)) {
        return false;
      }
      Entry e = (Entry) o;
      return (key == null ? e.getKey() == null : key.equals(e.getKey()))
              && (value == null ? e.getValue() == null : value.equals(e.getValue()));
    }

    @Override
    public int hashCode() {
      int keyHash = (key == null ? 0 : key.hashCode());
      int valueHash = (value == null ? 0 : value.hashCode());
      return keyHash ^ valueHash;
    }

    @Override
    public String toString() {
      return key + "=" + value;
    }
  }
}

score 2 · Accepted Answer

まず、整数が使用するメモリについて考えます。範囲は約0〜4000000になるとおっしゃいました。16777216 個の異なる値を表すには、24 ビットで十分です。それが許容できる場合は、整数ごとに 3 バイトのバイト配列を使用して、25% 節約できます。次のような配列にインデックスを付ける必要があります。

int getPackedInt(byte[] array, int index) {
  int i = index*3;
  return ((array[i] & 0xFF)<<16) + ((array[i+1] & 0xFF) <<8) + (array[i+2] & 0xFF);
}
int storePackedInt(byte[] array, int index, int value) {
  assert value >= 0 && value <= 0xFFFFFF;
  int i = index*3;
  array[i]   = (byte)((value>>16) & 0xFF);
  array[i+1] = (byte)((value>>8)  & 0xFF);
  array[i+2] = (byte)(value & 0xFF);
}

整数の分布について何か言えますか? それらの多くが 16 ビットに収まる場合は、数値ごとに可変バイト数のエンコーディングを使用できます (UTF-8 が文字を表すために行うようなものです)。

次に、文字列のメモリを節約できるかどうかを検討します。ストリングスの特徴は？彼らは通常どのくらいの期間ですか？多くの文字列がプレフィックスを共有しますか? アプリケーションの特性に合わせて調整された圧縮スキームは、多くのスペースを節約できます (falsarella が指摘したように)。または、多くの文字列がプレフィックスを共有する場合は、それらをある種の検索トライに格納する方が効率的です。(このアプリケーションに適した「パトリシア」と呼ばれるタイプのトライがあります。) おまけとして、トライで文字列を検索する方が、ハッシュマップを検索するよりも高速であることに注意してください (ただし、確認するにはベンチマークを実行する必要があります)。それがあなたのアプリケーションに当てはまる場合）。

文字列はすべて ASCII になりますか? その場合、Javacharは 16 ビットであるため、文字列に使用されるメモリの 50% が無駄になります。この場合も、バイト配列の使用を検討できます。

格納された文字列を反復処理するのではなく、文字列を検索するだけでよい場合は、型にはまらない方法を検討することもできます。つまり、文字列をハッシュし、ハッシュのみを保持します。異なる文字列が同じ値にハッシュされる可能性があるため、格納されていない文字列が検索によって「見つかる」可能性があります。しかし、ハッシュ値 (および優れたハッシュ関数) に十分なビット数を使用すると、その可能性を非常に小さくすることができるため、推定された宇宙の寿命の中でほぼ確実に発生することはありません。

最後に、文字列と整数を保持する構造体自体のメモリがあります。私はすでにトライを使用することを提案しましたが、そうしないことにした場合、並列配列よりも少ないメモリを使用するものは何もありません.整数の配列。二分検索を実行して String 配列へのインデックスを見つけた後、同じインデックスを使用して整数配列の配列にアクセスできます。

構造を構築しているときに、検索トライが適切な選択であると判断した場合は、それを直接使用します。それ以外の場合は、2 つのパスを実行できます。1 つは文字列のセットを構築し (次にそれらを配列に入れて並べ替える)、2 番目のパスは整数の配列を追加します。

score 2 · Accepted Answer

圧縮された文字列を使用して、メモリ使用量を大幅に削減できます。

さらに、他のより抜本的な解決策があります (再実装が必要になります)。

NoSQL データベースに関するメモリディスクベースのリストの実装または提案。

score 2 · Accepted Answer

マップに格納する Integer 値によっては、プリミティブな int 値よりもはるかに多くの RAM を占有する個別の Integer インスタンスが原因で、大量のヒープメモリオーバーヘッドが発生する場合があります。

Integer インスタンスの配列に基づくのではなく、基本的にint array に基づく List を実装する( ColtやJava のプリミティブコレクションなど)に浮かぶ多くの実装のいずれかを使用することMapを検討してください。StringIntArrayList

java - メモリ効率の高い複数値マップ

5 に答える 5

Related

Reference