python - Python でエントロピーを計算する最速の方法

Question

私のプロジェクトでは、0-1 ベクトルのエントロピーを何度も計算する必要があります。これが私のコードです：

def entropy(labels):
    """ Computes entropy of 0-1 vector. """
    n_labels = len(labels)

    if n_labels <= 1:
        return 0

    counts = np.bincount(labels)
    probs = counts[np.nonzero(counts)] / n_labels
    n_classes = len(probs)

    if n_classes <= 1:
        return 0
    return - np.sum(probs * np.log(probs)) / np.log(n_classes)

もっと速い方法はありますか？

score 39 · Accepted Answer

およびとしてデータを使用するpd.Seriesとscipy.stats、特定の量のエントロピーを計算するのは非常に簡単です。

import pandas as pd
import scipy.stats

def ent(data):
    """Calculates entropy of the passed `pd.Series`
    """
    p_data = data.value_counts()           # counts occurrence of each value
    entropy = scipy.stats.entropy(p_data)  # get entropy from counts
    return entropy

注:scipy.statsは提供されたデータを正規化するため、これを明示的に行う必要はありません。つまり、カウントの配列を渡すと正常に機能します。

score 18 · Accepted Answer

numpyにも依存しない答え：

import math
from collections import Counter

def eta(data, unit='natural'):
    base = {
        'shannon' : 2.,
        'natural' : math.exp(1),
        'hartley' : 10.
    }

    if len(data) <= 1:
        return 0

    counts = Counter()

    for d in data:
        counts[d] += 1

    ent = 0

    probs = [float(c) / len(data) for c in counts.values()]
    for p in probs:
        if p > 0.:
            ent -= p * math.log(p, base[unit])

    return ent

これは、スローできるすべてのデータ型を受け入れます。

>>> eta(['mary', 'had', 'a', 'little', 'lamb'])
1.6094379124341005

>>> eta([c for c in "mary had a little lamb"])
2.311097886212714

@Jarad が提供する回答では、タイミングも提案されています。そのために：

repeat_number = 1000000
e = timeit.repeat(
    stmt='''eta(labels)''', 
    setup='''labels=[1,3,5,2,3,5,3,2,1,3,4,5];from __main__ import eta''', 
    repeat=3, 
    number=repeat_number)

Timeit の結果: (これは、最適な numpy アプローチよりも ~4 倍速いと思います)

print('Method: {}, Avg.: {:.6f}'.format("eta", np.array(e).mean()))

Method: eta, Avg.: 10.461799

score 12 · Accepted Answer

unutbu からの提案に従って、純粋な python 実装を作成します。

def entropy2(labels):
 """ Computes entropy of label distribution. """
    n_labels = len(labels)

    if n_labels <= 1:
        return 0

    counts = np.bincount(labels)
    probs = counts / n_labels
    n_classes = np.count_nonzero(probs)

    if n_classes <= 1:
        return 0

    ent = 0.

    # Compute standard entropy.
    for i in probs:
        ent -= i * log(i, base=n_classes)

    return ent

私が見逃していた点は、labels は大きな配列ですが、probs は 3 つまたは 4 つの要素の長さであるということでした。純粋な python を使用すると、アプリケーションが 2 倍高速になりました。

score 9 · Accepted Answer

エントロピーの私のお気に入りの関数は次のとおりです。

def entropy(labels):
    prob_dict = {x:labels.count(x)/len(labels) for x in labels}
    probs = np.array(list(prob_dict.values()))

    return - probs.dot(np.log2(probs))

dict -> values -> list -> np.array 変換を回避するためのより良い方法をまだ探しています。見つけたらまたコメントします。

score 2 · Accepted Answer

これは、これまでに見つけた最速の Python 実装です。

import numpy as np

def entropy(labels):
    ps = np.bincount(labels) / len(labels)
    return -np.sum([p * np.log2(p) for p in ps if p > 0])

score 1 · Accepted Answer

BiEntropy はエントロピーを計算する最速の方法ではありませんが、厳密であり、明確に定義された方法で Shannon Entropy に基づいています。画像関連のアプリケーションを含むさまざまな分野でテストされています。Github の Python で実装されています。

score 0 · Accepted Answer

上記の答えは良いですが、異なる軸に沿って動作できるバージョンが必要な場合は、ここに実用的な実装があります.

def entropy(A, axis=None):
    """Computes the Shannon entropy of the elements of A. Assumes A is 
    an array-like of nonnegative ints whose max value is approximately 
    the number of unique values present.

    >>> a = [0, 1]
    >>> entropy(a)
    1.0
    >>> A = np.c_[a, a]
    >>> entropy(A)
    1.0
    >>> A                   # doctest: +NORMALIZE_WHITESPACE
    array([[0, 0], [1, 1]])
    >>> entropy(A, axis=0)  # doctest: +NORMALIZE_WHITESPACE
    array([ 1., 1.])
    >>> entropy(A, axis=1)  # doctest: +NORMALIZE_WHITESPACE
    array([[ 0.], [ 0.]])
    >>> entropy([0, 0, 0])
    0.0
    >>> entropy([])
    0.0
    >>> entropy([5])
    0.0
    """
    if A is None or len(A) < 2:
        return 0.

    A = np.asarray(A)

    if axis is None:
        A = A.flatten()
        counts = np.bincount(A) # needs small, non-negative ints
        counts = counts[counts > 0]
        if len(counts) == 1:
            return 0. # avoid returning -0.0 to prevent weird doctests
        probs = counts / float(A.size)
        return -np.sum(probs * np.log2(probs))
    elif axis == 0:
        entropies = map(lambda col: entropy(col), A.T)
        return np.array(entropies)
    elif axis == 1:
        entropies = map(lambda row: entropy(row), A)
        return np.array(entropies).reshape((-1, 1))
    else:
        raise ValueError("unsupported axis: {}".format(axis))

python - Python でエントロピーを計算する最速の方法

14 に答える 14

Related

Reference