當兩個單詞相距N個單詞時Python查找-有解無憂

我有一個名稱字典（其中鍵是名稱，值是與該名稱關聯的別名串列）和文本正文。我想查找并計算一對名稱在文本正文中相隔 N 個單詞的次數。

例如：

found_names = {
    'jack' : ['ripper', 'drought'],
    'jim' : ['carrey', 'gaffigan'],
    'chris' : ['hemsworth', 'evans', 'pratt', 'brown'],
    'james' : ['bond']
}

文字 = 'Hello Jack, my name is Jim. My friend Evans has seen you with Bond. Evans says they call you ripper'

N = 6

我們會得到某種結果，如：

[
    ('Jack', 'Jim', 1),  # [jack] my name is [jim] - 4 words after Jack
    ('Jim', 'Chris', 1), # [jim]. my friend [evans] - 3 words after Jim, evans refers to an alias of Chris hence Chris being in the result
    ('Chris', 'James', 2), # [evans] has seen you with [bond]. -- first occurence, [bond]. [evans] -- second occurence
    ('Chris', 'Jack', 1), # [evans] says they call you [ripper] - 5 words apart and ripper is an alias for Jack
]

請注意，在此期間無需兩次參考相同的名稱，因此結果沒有 (Chris, Chris)。此外，兩個名稱的出現最多只能相隔 6 個單詞，而不是正好相隔 6 個單詞。

有沒有辦法有效地做到這一點？我正在考慮在文本正文中的名稱串列中找到每個名稱的位置，并將其存盤為字典，其中鍵是名稱，值是與名稱在正文中的位置相對應的索引串列文本。

我不知道在那之后該怎么辦，任何人都可以幫忙。我到目前為止的代碼：

N = 16
t = text.split(' ')
pos_dct = {} # key is going to be the name, value is going to. be the list of positions
name_lst = [[k]   v for k,v in found_names.items()]
for i,w in enumerate(t):
    if w in name_lst:
        if w in pos_dct:
            pos_dct[w].append(i)
        else:
            pos_dct[w] = [i]

uj5u.com熱心網友回復：

這里的主要思想是 (1) 在中的查找陣列中包含名稱本身found_names，然后 (2) 將您的輸入字串轉換為索引字典；如果該特定索引中的單詞是名稱（或別名），則每個索引將只附加一個名稱。

在此之后，（3）對于每個索引，我們將檢查是否在范圍內找到任何大于當前索引的索引（由給出N）；如果是這樣，我們將增加對當前名稱，其他名稱的計數器。

# 0. Variables setup
from collections import defaultdict
import string

N = 6
s = "Hello Jack, my name is Jim. My friend Evans has seen you with Bond. Evans says they call you ripper"
found_names = {
    'jack' : ['ripper', 'drought'],
    'jim' : ['carrey', 'gaffigan'],
    'chris' : ['hemsworth', 'evans', 'pratt', 'brown'],
    'james' : ['bond']
}

# 1. Includes the name itself in the lookup array
found_names_full = found_names.copy()
for k,a in found_names_full.items():
    a.append(k.lower())
found_names_full  # {'jack': ['ripper', 'drought', 'jack'], 'jim': ['carrey', 'gaffigan', 'jim'], 'chris': ['hemsworth', 'evans', 'pratt', 'brown', 'chris'], 'james': ['bond', 'james']}

# 2. Check, word by word, if it's included in the lookup array.
# If so, store for the current index the name (key)
s2 = dict()
for i, word in enumerate(s.lower().split()):
    word = word.strip(string.punctuation)
    for k,a in found_names_full.items():
        if word not in a:
            continue
        s2[i] = k
        
s2  # {1: 'jack', 5: 'jim', 8: 'chris', 13: 'james', 14: 'chris', 19: 'jack'}

# 3. Get the list of indices of matched words
s3 = defaultdict(int) 
s2k = list(s2.keys())

for i,k in enumerate(s2k):
    # 3.1 Given an index, get the sublist of all indices greater than 
    # current index
    if i < len(s2k) - 1:
        k2l = s2k[i 1:]
    else:
        k2l = []

    # 3.2 For each index greater than current index, check if
    # is found in range
    for k2 in k2l:
        if k2-k < N:
            # 3.3 Get names found in current position (k)
            # and index greater than current but in range N (k2)
            n1 = s2[k]
            n2 = s2[k2]
            # 3.4 Get a sorted key 
            key = tuple(sorted([n1,n2]))
            # 3.5 And add 1 to counter
            s3[key]  = 1

s3  # defaultdict(<class 'int'>, {('jack', 'jim'): 1, ('chris', 'jim'): 1, ('chris', 'james'): 2, ('chris', 'jack'): 1})

如果您希望輸出類似于您的輸出，那么您需要：

s4 = list([(*k,v) for (k,v) in s3.items()])
s4  # [('jack', 'jim', 1), ('chris', 'jim', 1), ('chris', 'james', 2), ('chris', 'jack', 1)]

uj5u.com熱心網友回復：

我會試試這個。如果字典鍵是小寫字母，那么我會將lower()函式添加到word.strip(string.punctuation)asword.strip(string.punctuation).lower()

import string


text = 'Hello Jack, my name is Jim. My friend Evans has seen you with Bond. Evans says they call you ripper'

found_names = {
    'Jack' : ['ripper', 'drought'],
    'Jim' : ['carrey', 'gaffigan'],
    'Chris' : ['hemsworth', 'evans', 'pratt', 'brown'],
    'James' : ['bond']
}


def names():
    x_position = 0
    for x in found_names.keys():
        x_position  = 1
        y_position = 0
        for y in found_names.keys():
            y_position  = 1
            if y_position > x_position:
                yield (x, y)

def pair_count_list():
    for names_pair in names():
        counted = sum(    
            1
            for word in text.split()
            if word.strip(string.punctuation) in names_pair
        )
        yield (names_pair   (counted,))


values = [x for x in pair_count_list()]
print(values)

輸出：

[
    ('Jack', 'Jim', 2)
    , ('Jack', 'Chris', 1)
    , ('Jack', 'James', 1)
    , ('Jim', 'Chris', 1)
    , ('Jim', 'James', 1)
    , ('Chris', 'James', 0)
]

uj5u.com熱心網友回復：

我最近在 Python 的自然語言處理課程中遇到了類似的問題，所以我想我會分享上述問題的替代解決方案。

庫中的ngrams模塊nltk提供了一種優雅簡潔的方式來解決這個問題。'ngrams' 是文本中長度為 n 的所有單詞序列。因此，我們只是遍歷 ngram，根據“找到的名稱”的名稱/字典檢查每個單詞的第一個和最后一個單詞。

from itertools import combinations
from nltk import ngrams, word_tokenize
from pprint import PrettyPrinter
pp = PrettyPrinter()

found_names = {
    'jack' : ['ripper', 'drought'],
    'jim' : ['carrey', 'gaffigan'],
    'chris' : ['hemsworth', 'evans', 'pratt', 'brown'],
    'james' : ['bond']
}

# Get pairs of different names without repetitions
name_pairs = list(combinations(sorted(found_names.keys()), 2))

text = 'Hello Jack, my name is Jim. My friend Evans has seen you with Bond. Evans says they call you ripper'
# Get list of words in sentence, lowercased and ignoring punctuation
words = [word.lower() for word in word_tokenize(text) if word.isalpha()]

N = 6
# ngrams are sequences of words of length n from the text
n_grams = list(ngrams(words, N))

result = []

for name1, name2 in name_pairs:
    count = 0
    for n_gram in n_grams:
        first_word, last_word = (n_gram[0], n_gram[-1])
        # Check both orders
        if first_word == name1 or first_word in found_names[name1]:
            if last_word == name2 or last_word in found_names[name2]:
                count  = 1
        elif last_word == name1 or last_word in found_names[name1]:
            if first_word == name2 or first_word in found_names[name2]:
                count  = 1
    result.append((name1, name2, count))

pp.pprint(result)

輸出：

[('chris', 'jack', 1),
('chris', 'james', 1),
('chris', 'jim', 0),
('jack', 'james', 0),
('jack', 'jim', 0),
('james', 'jim', 0)]

轉載請註明出處，本文鏈接：https://www.uj5u.com/qita/402272.html

標籤：Python 数组蟒蛇-3.x 细绳字典

上一篇：檢查存盤大值的映射中是否存在鍵

下一篇：如何將重復的列名python資料幀轉換為json