python - N チャンク以下で最小限の長さでファイルを分割する

Question

ファイルがあり、それを最大 N 個の小さなファイルに分割する必要があり、最小のチャンクには少なくとも X バイトが必要であり、すべてのファイルが (ほぼ) 同じサイズである必要があるとします。

たとえば、N=4 で X=3 の文字列 'abcdefghij' を使用すると、['abcd', 'efg', 'hij'] が返されます。

3 chunks < 4 chunks
4 chars > 3 chars

x分割関数を作成しましたが、余分な文字列が 1 つ作成されることがあるため、そこで計算する代わりに値を渡す必要があります。

def split(string, n):
    x = len(string)//n
    return [string[i:i+x] for i in range(0, len(string), x)]

本当の問題は、最小バイト数でファイルをカットするためのチャンク数を計算する方法です。

def calculate(length, max_n, min_x):
    n, x = ...
    return n, x

この種のアクションを実行するための単純な既知のアルゴリズムはありますか?

実際には、チャンクの数を最大化したいので、ファイルが 1 バイト異なる必要はありません。

score 0 · Accepted Answer

「最小バイト数でファイルを切り取る」とはどういう意味ですか? 問題を完全に説明していないか、独自の解決策がありません。

あなたのソリューションが示すように、それは分割の問題です:Lが全長の場合、任意nののチャンクに分割できます。残り (必然的にn未満) は、他の文字よりも 1 文字多いチャンクの数を示します。たとえば、あなたの例では、3 つのチャンクのうちの 1 つが長くなっています。ただし、分割(剰余 3) して 7 つのチャンクを取得し、そのうちの 3 つが長くなります (長さは 1 ではなく 2)。または、あなたが書いているように、「チャンクの数を最大化する」ことに本当に傾倒している場合は、長さ1の10個のチャンクだけです。 n < L10 % 3 = 110 % 7

より一般的には、指定した任意の長さについてm、を選択するN = L // mと、チャンクの長さがmとm+1(または余りがないm場合L // mはだけ) になります。私が言ったように、それは単なる分割の問題です。

score 0 · Accepted Answer

シンプルか既知かはわかりませんが、これでうまくいくようです。セット内の以前の文字列に割り当てられた余分な文字を含む N 個の文字列を返します。

import itertools as it
s = 'abcdefhijklm'
def hunks(s, n):
    size, extra = divmod(len(s), n)
    i = 0
    extras = it.chain(it.repeat(1, extra), it.repeat(0))
    while i < len(s):
        e = next(extras)
        yield s[i:i + size + e]
        i += size + e
list(hunks(s, 4))

score 0 · Accepted Answer

def calculate(L, N, X):
    n = min(L//X, N)
    return n, L//n

編集：

def spread(seq, N=None, X=1):
    """Yield successive subsequences of seq having at least X elements.

    If N is specified, the number of subsequences yielded will not exceed N.

    The first L % X subsequences yielded (where L = len(seq)) will be longer
    by 1 than the remaining ones.

    >>> list(spread('abcdefghij', 4, 3))
    ['abcd', 'efg', 'hij']
    >>> list(spread('abcdefghijklmnopqrstuvwxyz', 4, 7))
    ['abcdefghi', 'jklmnopqr', 'stuvwxyz']

    seq    any object supporting len(...) and slice-indexing
    N      a positive integer (default: L)
    X      a positive integer not greater than L (default: 1)
    """

    # All error-checking code omitted

    L = len(seq)       # length of seq
    assert 0 < X <= L

    if N is None: N = L
    assert 0 < N

    # A total of n subsequences will be yielded, the first r of which will 
    # have length x + 1, and the remaining ones will have length x.

    # if we insist on using calculate()...
    # n, x = calculate(L, N, X)
    # r = L % n

    # ...but this entails separate computations of L//n and L%n; may as well
    # do both with a single divmod(L, n)
    n = min(L//X, N)
    x, r = divmod(L, n)

    start = 0
    stride = x + 1    # stride will revert to x when i == r
    for i in range(n):
        if i == r: stride = x
        finish = start + stride
        yield seq[start:finish]
        start = finish
    assert start == L

python - N チャンク以下で最小限の長さでファイルを分割する

3 に答える 3

Related

Reference