我正在嘗試洗掉重復或相似的行,但我想只保留最后一個匹配項,所有重復或相似的行都應該被選中。
這是我要清理的文本(忽略行號僅顯示我所指的行):
l1:In 1881, Paul Gauguin joked about how to extract Cézanne's mysterious methods, instructing Camille Pissarro to "ply him with one of those mysterious homeopathic drugs and come straight to Paris
l2:
l3:In 1881, Paul Gauguin joked about how to extract Cézanne's mysterious methods, instructing Camille Pissarro to "ply him with one of those mysterious homeopathic drugs and come straight to Paris to share the information". The painter and critic Maurice Denis shared a sense of bewilderment about Cézanne's revoluti
l4:
l5:Cézanne's mysterious methods, instructing Camille Pissarro to "ply him with one of those mysterious homeopathic drugs and come straight to Paris to share the information".
l6:
l7:In 1881, Paul Gauguin joked about how to extract Cézanne's mysterious methods, instructing Camille Pissarro to "ply him with one of those mysterious homeopathic drugs and come straight to Paris to share the information".
l8:
l9:He overturned centuries of theories about how the eye works by depicting a world constantly in motion, affected by the passing of time and infused with the artist's own memories and emotions.
l10:
l11:In 1881, Paul Gauguin joked about how to extract Cézanne's mysterious methods, instructing Camille Pissarro to "ply him with one of those mysterious homeopathic drugs and come straight to Paris to share the information".
在此示例中,我只希望在第 11 行中未選擇最后一個巧合,最后一行帶有此文本
In 1881, Paul Gauguin joked about how to extract Cézanne's mysterious methods, instructing Camille Pissarro to "ply him with one of those mysterious homeopathic drugs and come straight to Paris to share the information".
第 1、3、5、7 行有一些相似的文本或相同的文本,應該與正則運算式匹配或被選中,該行上的文本可以是新行之前的任何文本,并且應該在檔案中檢測更多此類示例。
我正在使用這個正則運算式,但根本不作業,只選擇 l1 和 l7 但也應該選擇 l3 和 l5 這是示例https://regex101.com/r/gd0Z3V/1:
(?sm)(^[^\r\n]*)[\r\n](?=.*^\1)
uj5u.com熱心網友回復:
這里的主要問題是正則運算式不理解人類邏輯。正則運算式中不存在“看起來一樣”。因此,第一個要求是將人類邏輯轉換為正則運算式邏輯。
我們可以通過指定我們希望有多少個字符完全相同以將其視為匹配來做到這一點。
這里我選擇100 characters
。(您當然可以更改它,但它適用于您的示例文本)。
現在我們可以構建一個匹配整行的正則運算式,如果該行中的 100 個字符在文本的下方重復:
/^.*(.{100}).*$(?=[\s\S] \1)/gm
說明:
^.*
- 從行首零個或多個字符開始匹配
(.{100})
- 創建組 1,匹配100 characters
.*$
- 匹配該行的其余部分
(?=[\s\S] \1)
-look ahead
對于一個或多個ANY
字符(包括換行符),后面是第 1 組中匹配的文本。
結果是整行匹配,如果進一步向下重復 100 個字符。
我在這里為您創建了一個測驗用例:JSRegExpBuilder(它使用 javascript,但應該適用于大多數風格)。
轉載請註明出處,本文鏈接:https://www.uj5u.com/net/517293.html
標籤:正则表达式正则表达式组