我有一個名稱字典(其中鍵是名稱,值是與該名稱關聯的別名串列)和文本正文。我想查找并計算一對名稱在文本正文中相隔 N 個單詞的次數。
例如 :
found_names = {
'jack' : ['ripper', 'drought'],
'jim' : ['carrey', 'gaffigan'],
'chris' : ['hemsworth', 'evans', 'pratt', 'brown'],
'james' : ['bond']
}
文字 = 'Hello Jack, my name is Jim. My friend Evans has seen you with Bond. Evans says they call you ripper'
N = 6
我們會得到某種結果,如:
[
('Jack', 'Jim', 1), # [jack] my name is [jim] - 4 words after Jack
('Jim', 'Chris', 1), # [jim]. my friend [evans] - 3 words after Jim, evans refers to an alias of Chris hence Chris being in the result
('Chris', 'James', 2), # [evans] has seen you with [bond]. -- first occurence, [bond]. [evans] -- second occurence
('Chris', 'Jack', 1), # [evans] says they call you [ripper] - 5 words apart and ripper is an alias for Jack
]
請注意,在此期間無需兩次參考相同的名稱,因此結果沒有 (Chris, Chris)。此外,兩個名稱的出現最多只能相隔 6 個單詞,而不是正好相隔 6 個單詞。
有沒有辦法有效地做到這一點?我正在考慮在文本正文中的名稱串列中找到每個名稱的位置,并將其存盤為字典,其中鍵是名稱,值是與名稱在正文中的位置相對應的索引串列文本。
我不知道在那之后該怎么辦,任何人都可以幫忙。我到目前為止的代碼:
N = 16
t = text.split(' ')
pos_dct = {} # key is going to be the name, value is going to. be the list of positions
name_lst = [[k] v for k,v in found_names.items()]
for i,w in enumerate(t):
if w in name_lst:
if w in pos_dct:
pos_dct[w].append(i)
else:
pos_dct[w] = [i]
uj5u.com熱心網友回復:
這里的主要思想是 (1) 在 中的查找陣列中包含名稱本身found_names
,然后 (2) 將您的輸入字串轉換為索引字典;如果該特定索引中的單詞是名稱(或別名),則每個索引將只附加一個名稱。
在此之后,(3)對于每個索引,我們將檢查是否在范圍內找到任何大于當前索引的索引(由 給出N
);如果是這樣,我們將增加對當前名稱,其他名稱的計數器。
# 0. Variables setup
from collections import defaultdict
import string
N = 6
s = "Hello Jack, my name is Jim. My friend Evans has seen you with Bond. Evans says they call you ripper"
found_names = {
'jack' : ['ripper', 'drought'],
'jim' : ['carrey', 'gaffigan'],
'chris' : ['hemsworth', 'evans', 'pratt', 'brown'],
'james' : ['bond']
}
# 1. Includes the name itself in the lookup array
found_names_full = found_names.copy()
for k,a in found_names_full.items():
a.append(k.lower())
found_names_full # {'jack': ['ripper', 'drought', 'jack'], 'jim': ['carrey', 'gaffigan', 'jim'], 'chris': ['hemsworth', 'evans', 'pratt', 'brown', 'chris'], 'james': ['bond', 'james']}
# 2. Check, word by word, if it's included in the lookup array.
# If so, store for the current index the name (key)
s2 = dict()
for i, word in enumerate(s.lower().split()):
word = word.strip(string.punctuation)
for k,a in found_names_full.items():
if word not in a:
continue
s2[i] = k
s2 # {1: 'jack', 5: 'jim', 8: 'chris', 13: 'james', 14: 'chris', 19: 'jack'}
# 3. Get the list of indices of matched words
s3 = defaultdict(int)
s2k = list(s2.keys())
for i,k in enumerate(s2k):
# 3.1 Given an index, get the sublist of all indices greater than
# current index
if i < len(s2k) - 1:
k2l = s2k[i 1:]
else:
k2l = []
# 3.2 For each index greater than current index, check if
# is found in range
for k2 in k2l:
if k2-k < N:
# 3.3 Get names found in current position (k)
# and index greater than current but in range N (k2)
n1 = s2[k]
n2 = s2[k2]
# 3.4 Get a sorted key
key = tuple(sorted([n1,n2]))
# 3.5 And add 1 to counter
s3[key] = 1
s3 # defaultdict(<class 'int'>, {('jack', 'jim'): 1, ('chris', 'jim'): 1, ('chris', 'james'): 2, ('chris', 'jack'): 1})
如果您希望輸出類似于您的輸出,那么您需要:
s4 = list([(*k,v) for (k,v) in s3.items()])
s4 # [('jack', 'jim', 1), ('chris', 'jim', 1), ('chris', 'james', 2), ('chris', 'jack', 1)]
uj5u.com熱心網友回復:
我會試試這個。如果字典鍵是小寫字母,那么我會將lower()
函式添加到word.strip(string.punctuation)
asword.strip(string.punctuation).lower()
import string
text = 'Hello Jack, my name is Jim. My friend Evans has seen you with Bond. Evans says they call you ripper'
found_names = {
'Jack' : ['ripper', 'drought'],
'Jim' : ['carrey', 'gaffigan'],
'Chris' : ['hemsworth', 'evans', 'pratt', 'brown'],
'James' : ['bond']
}
def names():
x_position = 0
for x in found_names.keys():
x_position = 1
y_position = 0
for y in found_names.keys():
y_position = 1
if y_position > x_position:
yield (x, y)
def pair_count_list():
for names_pair in names():
counted = sum(
1
for word in text.split()
if word.strip(string.punctuation) in names_pair
)
yield (names_pair (counted,))
values = [x for x in pair_count_list()]
print(values)
輸出:
[
('Jack', 'Jim', 2)
, ('Jack', 'Chris', 1)
, ('Jack', 'James', 1)
, ('Jim', 'Chris', 1)
, ('Jim', 'James', 1)
, ('Chris', 'James', 0)
]
uj5u.com熱心網友回復:
我最近在 Python 的自然語言處理課程中遇到了類似的問題,所以我想我會分享上述問題的替代解決方案。
庫中的ngrams
模塊nltk
提供了一種優雅簡潔的方式來解決這個問題。'ngrams' 是文本中長度為 n 的所有單詞序列。因此,我們只是遍歷 ngram,根據“找到的名稱”的名稱/字典檢查每個單詞的第一個和最后一個單詞。
from itertools import combinations
from nltk import ngrams, word_tokenize
from pprint import PrettyPrinter
pp = PrettyPrinter()
found_names = {
'jack' : ['ripper', 'drought'],
'jim' : ['carrey', 'gaffigan'],
'chris' : ['hemsworth', 'evans', 'pratt', 'brown'],
'james' : ['bond']
}
# Get pairs of different names without repetitions
name_pairs = list(combinations(sorted(found_names.keys()), 2))
text = 'Hello Jack, my name is Jim. My friend Evans has seen you with Bond. Evans says they call you ripper'
# Get list of words in sentence, lowercased and ignoring punctuation
words = [word.lower() for word in word_tokenize(text) if word.isalpha()]
N = 6
# ngrams are sequences of words of length n from the text
n_grams = list(ngrams(words, N))
result = []
for name1, name2 in name_pairs:
count = 0
for n_gram in n_grams:
first_word, last_word = (n_gram[0], n_gram[-1])
# Check both orders
if first_word == name1 or first_word in found_names[name1]:
if last_word == name2 or last_word in found_names[name2]:
count = 1
elif last_word == name1 or last_word in found_names[name1]:
if first_word == name2 or first_word in found_names[name2]:
count = 1
result.append((name1, name2, count))
pp.pprint(result)
輸出:
[('chris', 'jack', 1),
('chris', 'james', 1),
('chris', 'jim', 0),
('jack', 'james', 0),
('jack', 'jim', 0),
('james', 'jim', 0)]
轉載請註明出處,本文鏈接:https://www.uj5u.com/qita/402272.html
上一篇:檢查存盤大值的映射中是否存在鍵