python - Python の述語で iterable をグループ化する

Question

私はこのようなファイルを解析しています:

- ヘッダ - 
データ1
データ2
- ヘッダ - 
データ3
データ4
データ5
- ヘッダ - 
- ヘッダ - 
...

そして、私はこのようなグループが欲しい:

[ [header, data1, data2], [header, data3, data4, data5], [header], [header], ... ]

だから私は次のようにそれらを繰り返すことができます:

for grp in group(open('file.txt'), lambda line: 'header' in line):
    for item in grp:
        process(item)

そして、グループの検出ロジックをグループの処理ロジックから分離します。

しかし、グループは任意に大きくなる可能性があり、それらを保存したくないため、イテラブルのイテラブルが必要です。つまり、述語で示されるように、「センチネル」または「ヘッダー」アイテムに遭遇するたびに、イテラブルをサブグループに分割したいと考えています。これは一般的なタスクのようですが、効率的な Pythonic 実装が見つかりません。

リストへの追加の実装は次のとおりです。

def group(iterable, isstart=lambda x: x):
    """Group `iterable` into groups starting with items where `isstart(item)` is true.

    Start items are included in the group.  The first group may or may not have a 
    start item.  An empty `iterable` results in an empty result (zero groups)."""
    items = []
    for item in iterable:
        if isstart(item) and items:
            yield iter(items)
            items = []
        items.append(item)
    if items:
        yield iter(items)

itertools素敵なバージョンが必要な気がしますが、私にはわかりません. 「明らかな」(?!)groupbyソリューションは機能しないようです。隣接するヘッダーが存在する可能性があり、それらは別々のグループに入る必要があるためです。私が思いつくことができる最善の方法groupbyは、カウンターを保持するキー関数を (ab) 使用することです。

def igroup(iterable, isstart=lambda x: x):
    def keyfunc(item):
        if isstart(item):
            keyfunc.groupnum += 1       # Python 2's closures leave something to be desired
        return keyfunc.groupnum
    keyfunc.groupnum = 0
    return (group for _, group in itertools.groupby(iterable, keyfunc))

しかし、私は Python の方がうまくやれると思います -- そして悲しいことに、これはダムリストのバージョンよりもさらに遅いです:

# ipython
%time deque(group(xrange(10 ** 7), ラムダ x: x % 1000 == 0), maxlen=0)
CPU 時間: ユーザー 4.20 秒、sys: 0.03 秒、合計: 4.23 秒

%time deque(igroup(xrange(10 ** 7), ラムダ x: x % 1000 == 0), maxlen=0)
CPU 時間: ユーザー 5.45 秒、sys: 0.01 秒、合計: 5.46 秒

簡単にするために、単体テストコードを次に示します。

class Test(unittest.TestCase):
    def test_group(self):
        MAXINT, MAXLEN, NUMTRIALS = 100, 100000, 21
        isstart = lambda x: x == 0
        self.assertEqual(next(igroup([], isstart), None), None)
        self.assertEqual([list(grp) for grp in igroup([0] * 3, isstart)], [[0]] * 3)
        self.assertEqual([list(grp) for grp in igroup([1] * 3, isstart)], [[1] * 3])
        self.assertEqual(len(list(igroup([0,1,2] * 3, isstart))), 3)        # Catch hangs when groups are not consumed
        for _ in xrange(NUMTRIALS):
            expected, items = itertools.tee(itertools.starmap(random.randint, itertools.repeat((0, MAXINT), random.randint(0, MAXLEN))))
            for grpnum, grp in enumerate(igroup(items, isstart)):
                start = next(grp)
                self.assertTrue(isstart(start) or grpnum == 0)
                self.assertEqual(start, next(expected))
                for item in grp:
                    self.assertFalse(isstart(item))
                    self.assertEqual(item, next(expected))

では、Python で述語によってイテラブルをエレガントかつ効率的にサブグループ化するにはどうすればよいでしょうか?

score 5 · Accepted Answer

Pythonでエレガントかつ効率的に述語によってイテラブルをサブグループ化するにはどうすればよいですか?

これは、質問のものと非常によく似た、簡潔でメモリ効率の高い実装です。

from itertools import groupby, imap
from operator import itemgetter

def igroup(iterable, isstart):
    def key(item, count=[False]):
        if isstart(item):
           count[0] = not count[0] # start new group
        return count[0]
    return imap(itemgetter(1), groupby(iterable, key))

無限グループをサポートします。

teeベースのソリューションはわずかに高速ですが、現在のグループのメモリを消費します (list質問のベースのソリューションと同様):

from itertools import islice, tee

def group(iterable, isstart):
    it, it2 = tee(iterable)
    count = 0
    for item in it:
        if isstart(item) and count:
            gr = islice(it2, count)
            yield gr
            for _ in gr:  # skip to the next group
                pass
            count = 0
        count += 1
    if count:
       gr = islice(it2, count)
       yield gr
       for _ in gr:  # skip to the next group
           pass

groupby-ソリューションは、純粋な Python で実装できます。

def igroup_inline_key(iterable, isstart):
    it = iter(iterable)

    def grouper():
        """Yield items from a single group."""
        while not p[START]:
            yield p[VALUE]  # each group has at least one element (a header)
            p[VALUE] = next(it)
            p[START] = isstart(p[VALUE])

    p = [None]*2 # workaround the absence of `nonlocal` keyword in Python 2.x
    START, VALUE = 0, 1
    p[VALUE] = next(it)
    while True:
        p[START] = False # to distinguish EOF and a start of new group
        yield grouper()
        while not p[START]: # skip to the next group
            p[VALUE] = next(it)
            p[START] = isstart(p[VALUE])

コードの繰り返しを避けるために、while Trueループは次のように記述できます。

while True:
    p[START] = False  # to distinguish EOF and a start of new group
    g = grouper()
    yield g
    if not p[START]:  # skip to the next group
        for _ in g:
            pass
        if not p[START]:  # EOF
            break

ただし、前のバリアントはより明示的で読みやすいかもしれません。

groupby純粋な Python での一般的なメモリ効率の高いソリューションは、ベースのソリューションよりも大幅に高速になることはないと思います。

process(item)に比べて高速igroup()で、ヘッダーが文字列内で効率的に見つかる場合 (固定静的ヘッダーなど) 、ファイルを大きなチャンクで読み取り、ヘッダー値で分割することでパフォーマンスを向上させることができます。タスクをIOバウンドにする必要があります。

score 4 · Accepted Answer

私はあなたのコードをすべて読んだわけではありませんが、これが役立つかもしれないと思います：

from itertools import izip, tee, chain


def pairwise(iterable):
    a, b = tee(iterable)
    return izip(a, chain(b, [next(b, None)]))


def group(iterable, isstart):

    pairs = pairwise(iterable)

    def extract(current, lookahead, pairs=pairs, isstart=isstart):
        yield current
        if isstart(lookahead):
            return
        for current, lookahead in pairs:
            yield current
            if isstart(lookahead):
                return

    for start, lookahead in pairs:
        gen = extract(start, lookahead)
        yield gen
        for _ in gen:
            pass


for gen in group(xrange(4, 16), lambda x: x % 5 == 0):
    print '------------------'
    for n in gen:
        print n

print [list(g) for g in group([], lambda x: x % 5 == 0)]

結果：

$ python gen.py
------------------
4
------------------
5
6
7
8
9
------------------
10
11
12
13
14
------------------
15
[]

編集：

そして、上記と同様の別の解決策がありますが、pairwise()代わりにとセンチネルはありません。どちらが速いかわかりません：

def group(iterable, isstart):

    sentinel = object()

    def interleave(iterable=iterable, isstart=isstart, sentinel=sentinel):
        for item in iterable:
            if isstart(item):
                yield sentinel
            yield item

    items = interleave()

    def extract(item, items=items, isstart=isstart, sentinel=sentinel):
        if item is not sentinel:
            yield item
        for item in items:
            if item is sentinel:
                return
            yield item

    for lookahead in items:
        gen = extract(lookahead)
        yield gen
        for _ in gen:
            pass

JFSebastians のスキップされたサブグループジェネレーターの枯渇のアイデアのおかげで、両方ともテストケースに合格しました。

score 2 · Accepted Answer

重要なことは、サブジェネレーターを生成するジェネレーターを作成する必要があるということです。私のソリューションは、概念的には@pillmuncherによるものと似ていますが、itertools機械を使用して補助ジェネレーターを作成する必要がないため、より自己完結型です。欠点は、ややエレガントでない一時リストを使用する必要があることです。Python 3では、これはおそらく。を使用するとよりうまく実行できますnonlocal。

def grouper(iterable, isstart):
    it = iter(iterable)
    last = [next(it)]
    def subgroup():
        while True:
            toYield = last[0]
            try:
                last.append(next(it))
            except StopIteration, e:
                last.pop(0)
                yield toYield
                raise StopIteration
            else:
                yield toYield
                last.pop(0)
            if isstart(last[0]):
                raise StopIteration
    while True:
        sg = subgroup()
        yield sg
        if len(last) == 2:
            # subgenerator was aborted before completion, let's finish it
            for a in sg:
                pass
        if last:
            # sub-generator left next element waiting, next sub-generator will yield it
            pass
        else:
            # sub-generator left "last" empty because source iterable was exhausted
            raise StopIteration

>>> for g in grouper([0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0], lambda x: x==0):
...     print "Group",
...     for i in g:
...         print i,
...     print
Group 0 1 1
Group 0 1
Group 0 1 1 1 1
Group 0

これがパフォーマンスの面でどのようなものかはわかりません。それは面白いことだったので、やっただけです。

編集：私はあなたの元の2つと私のユニットテストを実行しました。私のはあなたより少し速いようですが、igroupそれでもリストベースのバージョンより遅いです。ここで速度とメモリの間でトレードオフを行う必要があるのは当然のようです。グループがそれほど大きくならないことがわかっている場合は、速度を上げるためにリストベースのバージョンを使用してください。グループが巨大になる可能性がある場合は、ジェネレータベースのバージョンを使用してメモリ使用量を抑えてください。

編集：上記の編集されたバージョンは、別の方法で破損を処理します。サブジェネレーターから抜け出し、外部ジェネレーターを再開すると、中止されたグループの残りの部分がスキップされ、次のグループから開始されます。

>>> for g in grouper([0, 1, 2, 88, 3, 0, 1, 88, 2, 3, 4, 0, 1, 2, 3, 88, 4], lambda x: x==0):
...     print "Group",
...     for i in g:
...         print i,
...         if i==88:
...             break
...     print
Group 0 1 2 88
Group 0 1 88
Group 0 1 2 3 88

score 0 · Accepted Answer

そこで、からサブグループのペアをつなぎ合わせようとする別のバージョンを次に示しgroupbyますchain。与えられたパフォーマンステストでは著しく高速ですが、小さなグループが多数ある場合 (たとえば ) ははるかに遅くなりますisstart = lambda x: x % 2 == 0。繰り返されるヘッダーをごまかし、バッファリングします (これは、read-all-but-last イテレータトリックで回避できます)。エレガンスの面でも一歩後退しているので、私はまだオリジナルの方が好きだと思います.

def group2(iterable, isstart=lambda x: x):
    groups = itertools.groupby(iterable, isstart)
    start, group = next(groups)
    if not start:                   # Deal with initial non-start group
        yield group
        _, group = next(groups)
    groups = (grp for _, grp in groups)
    while True:                     # group will always be start item(s) now      
        group = list(group)         
        for item in group[0:-1]:    # Back-to-back start items... and hope this doesn't get very big.  :)
            yield iter([item])      
        yield itertools.chain([group[-1]], next(groups, []))       # Start item plus subsequent non-start items
        group = next(groups)

%time deque(group2(xrange(10 ** 7), ラムダ x: x % 1000 == 0), maxlen=0)
CPU 時間: ユーザー 3.13 秒、sys: 0.00 秒、合計: 3.13 秒

python - Python の述語で iterable をグループ化する

4 に答える 4

Related

Reference