決議一個迭代而不列出每個塊-有解無憂

假設我想實作 Python 可迭代的拆分，而不列出每個塊，類似于itertools.groupby，其塊是惰性的。但我想在比密鑰相等更復雜的條件下進行。所以更像一個決議器。

例如，假設我想在整數迭代中使用奇數作為分隔符。喜歡more_itertools.split_at(lambda x: x % 2 == 1, xs)。（但more_itertools.split_at列出了每個塊。）

在決議器組合器語言中，這可能被稱為sepBy1(odd, many(even)). 在 Haskell 中有解決這類問題的Parsec,pipes-parse和庫。pipes-group例如，撰寫from Pipes.Group的 -likeitertools.groupby版本就足夠且有趣了（參見此處）。groupsBy'

可能會有一些巧妙的柔術itertools.groupby，可能會應用itertools.pairwise, 然后itertools.groupby，然后再回到單個元素。

我想我可以自己把它寫成一個生成器，但是itertools.groupby用 Python（如下）撰寫已經相當復雜了。也不容易概括。

似乎應該有一些更普遍的東西，比如為任何型別的流撰寫決議器和組合器的相對輕松的方式。

# From https://docs.python.org/3/library/itertools.html#itertools.groupby
# groupby() is roughly equivalent to:
class groupby:
    # [k for k, g in groupby('AAAABBBCCDAABBB')] --> A B C D A B
    # [list(g) for k, g in groupby('AAAABBBCCD')] --> AAAA BBB CC D
    def __init__(self, iterable, key=None):
        if key is None:
            key = lambda x: x
        self.keyfunc = key
        self.it = iter(iterable)
        self.tgtkey = self.currkey = self.currvalue = object()
    def __iter__(self):
        return self
    def __next__(self):
        self.id = object()
        while self.currkey == self.tgtkey:
            self.currvalue = next(self.it)    # Exit on StopIteration
            self.currkey = self.keyfunc(self.currvalue)
        self.tgtkey = self.currkey
        return (self.currkey, self._grouper(self.tgtkey, self.id))
    def _grouper(self, tgtkey, id):
        while self.id is id and self.currkey == tgtkey:
            yield self.currvalue
            try:
                self.currvalue = next(self.it)
            except StopIteration:
                return
            self.currkey = self.keyfunc(self.currvalue)

uj5u.com熱心網友回復：

這里有幾個簡單的迭代器拆分器，我是在無聊的時候寫的。我不認為它們特別深刻，但也許它們會以某種方式提供幫助。

我沒有花很多時間思考有用的界面、優化或實作多個互動子功能。如果需要，可以添加所有這些東西。

這些基本上都是模仿的itertools.groupby，它的界面可能會被認為有點奇怪。這是 Python 真的不是函式式編程語言的結果。Python 的生成器（和其他實作迭代器協議的物件）是有狀態的，沒有保存和恢復生成器狀態的工具。因此，這些函式確實回傳一個迭代器，該迭代器連續生成迭代器，這些迭代器從原始迭代器產生值。但是回傳的迭代器共享底層的迭代器，也就是傳遞給原始呼叫的迭代器，這意味著當你推進外層迭代器時，當前內層迭代器中任何未使用的值都會被丟棄，恕不另行通知。

有一些（相當昂貴的）方法可以避免丟棄這些值，但由于最明顯的方法——listifying——從一開始就被排除在外，因此groupby盡管準確記錄行為很尷尬，但我還是選擇了界面。可以包裝內部迭代器itertools.tee以使原始迭代器獨立，但其代價類似于（或可能略高于）listifying。它仍然要求在啟動下一個子迭代器之前完全生成每個子迭代器，但它不需要在開始使用值之前完全生成子迭代器。

為簡單起見（根據我的說法:-)），我將這些函式實作為生成器而不是物件，就像itertoolsand一樣more_itertools。外部生成器產生每個連續的子迭代器，然后在產生下一個子迭代器之前收集并丟棄其中的任何剩余值[注 1]。我想大多數時候子迭代器會在外部回圈嘗試重繪它之前完全耗盡，所以額外的呼叫會有點浪費，但它比你參考的代碼更簡單itertools.groupby。

仍然有必要從子迭代器回傳原始迭代器已用盡的事實，因為這不是您可以詢問迭代器的事情。我使用nonlocal宣告在外部和內部生成器之間共享狀態。在某些方面，維護物件中的狀態itertools.groupby可能更靈活，甚至可能被認為更 Pythonic，但nonlocal對我有用。

我實作了more_itertools.split_at（沒有maxsplits和keep_separator選項）并且我認為相當于Pipes.Groups.groupBy'，重命名為split_between表示如果它們滿足某些條件，它會在兩個連續元素之間拆分。

請注意，split_between始終在通過運行第一個子迭代器請求它之前從提供的迭代器中強制第一個值。其余的值是惰性生成的。我嘗試了幾種方法來推遲第一個物件，但最后我還是采用了這個設計，因為它更簡單。結果是split_at，它不做初始力，總是回傳至少一個子迭代器，即使提供的引數是空的，而split_between沒有。對于一些真正的問題，我必須同時嘗試這兩種方法才能決定我更喜歡哪個界面；如果您有偏好，請務必表達出來（但不保證更改）。

from collections import deque

def split_at(iterable, pred=lambda x:x is None):
    '''Produces an iterator which returns successive sub-iterations of 
       `iterable`, delimited by values for which `pred` returns
       truthiness. The default predicate returns True only for the
       value None.

       The sub-iterations share the underlying iterable, so they are not 
       independent of each other. Advancing the outer iterator will discard
       the rest of the current sub-iteration.

       The delimiting values are discarded.
    '''

    done = False
    iterable = iter(iterable)

    def subiter():
        nonlocal done
        for value in iterable:
            if pred(value): return
            yield value
        done = True

    while not done:
        yield (g := subiter())
        deque(g, maxlen=0)

def split_between(iterable, pred=lambda before,after:before   1 != after):
    '''Produces an iterator which returns successive sub-iterations of 
       `iterable`, delimited at points where calling `pred` on two
       consecutive values produces truthiness. The default predicate
       returns True when the two values are not consecutive, making it
       possible to split a sequence of integers into contiguous ranges.

       The sub-iterations share the underlying iterable, so they are not 
       independent of each other. Advancing the outer iterator will discard
       the rest of the current sub-iteration.
    '''
    iterable = iter(iterable)

    try:
        before = next(iterable)
    except StopIteration:
        return

    done = False

    def subiter():
        nonlocal done, before
        for after in iterable:
            yield before
            prev, before = before, after
            if pred(prev, before):
                return

        yield before
        done = True

    while not done:
        yield (g := subiter())
        deque(g, maxlen=0)

筆記

collections.deque(g, maxlen=0)我相信，這是目前丟棄迭代器剩余值的最有效方法，盡管它看起來有點神秘。歸功于more_itertools為我指出了該解決方案，以及用于計算生成器產生的物件數量的相關運算式：
```
cache[0][0] if (cache := deque(enumerate(it, 1), maxlen=1)) else 0
```
盡管我無意more_itertools為上述怪物負責。（他們用if宣告來做，而不是海象。）

轉載請註明出處，本文鏈接：https://www.uj5u.com/houduan/473577.html

標籤：Python 解析迭代工具哈斯克尔管道更多迭代工具

上一篇：如何將htmldiv文本決議為json

下一篇：兩個輸出檔案共享相同的路徑但具有不同的內容