代碼倉庫：第一次個人專案-論文查重系統

代碼倉庫：第一次個人專案-論文查重系統
PSP表格
計算模塊介面的設計與實作程序
性能分析
單元測驗
- 代碼覆寫率
- 單元測驗

PSP表格

PSP2.1	Personal Software Process Stages	預估耗時（分鐘）	實際耗時（分鐘）
Planning	計劃	120	100
· Estimate	· 估計這個任務需要多少時間	480	600
Development	開發	180	240
· Analysis	· 需求分析 (包括學習新技術)	120	240
· Design Spec	· 生成設計檔案	60	50
· Design Review	· 設計復審	30	20
· Coding Standard	· 代碼規范 (為目前的開發制定合適的規范)	20	10
· Design	· 具體設計	90	120
· Coding	· 具體編碼	40	40
· Code Review	· 代碼復審	30	40
· Test	· 測驗（自我測驗，修改代碼，提交修改）	120	210
Reporting	報告	30	50
· Test Repor	· 測驗報告	20	15
· Size Measurement	· 計算作業量	20	10
· Postmortem & Process Improvement Plan	· 事后總結, 并提出程序改進計劃	30	35
	· 合計	480	600

計算模塊介面的設計與實作程序

本文使用SimHash和海明距離計算文章重復率，原理參考自：https://blog.csdn.net/wxgxgp/article/details/104106867
在分詞與權重計算部分參考自jieba庫的Github：https://github.com/fxsjy/jieba

用jieba庫分詞，并手動洗掉停用詞

stopWords = [' ', '!', ',', '.', '?', '！', '？', '，', '，', '\n', '\t', '\b', '"', '“', '”', '：', '《', '》', '<', '>']
splitWords = jieba.lcut(source)
splitWords = del_stopWords(splitWords, stopWords)
#洗掉停用詞
def del_stopWords(split_sentence, stopWords):
    i = 0
    while (len(split_sentence) != 0):
        
        if (i >= len(split_sentence)):
            break
        
        #匹配停用詞
        tmp = False
        for n in stopWords:
            if (split_sentence[i] == n):
                tmp = True
                break       

        if (tmp):
            split_sentence.pop(i)
            continue

        i += 1

    return split_sentence

用哈希函式轉換分詞為64位01字串，并用jieba庫的TF-IDF 演算法對關鍵詞進行抽取并計算權重（詞頻）

# 哈希函式，輸入單個分詞
def string_hash(source):
        if source == "":
            return 0
        else:
            x = ord(source[0]) << 7
            m = 1000003
            mask = 2**128 - 1
            for c in source:
                x = ((x*m)^ord(c)) & mask
            x ^= len(source)
            if x == -1:
                x = -2
            x = bin(x).replace('0b', '').zfill(64)[-64:]
            
            return str(x)

#計算權重并加權，輸入分詞串列，可以去重并重新排序
def count_weight(split_sentence, listSize):
    keyWords = jieba.analyse.extract_tags("|".join(split_sentence), topK=listSize, withWeight=True)    
    list_weightPluse = list()
    a = list(map(int, string_hash(keyWords[0][0])))

    for index in range(len(keyWords)):
        tmp = list(map(int, string_hash(keyWords[index][0])))        
        tmp = np.subtract(np.multiply(tmp, 2), 1)   #把0 1轉換為-1 1             
        list_weightPluse.append(np.multiply(tmp, keyWords[index][1]))
    
    return list_weightPluse

合并和降維哈希陣列

#合并加權后的哈希值，輸入加權哈希值串列
def mergeHash(list_weightPluse):
    mergeHash_list = [0] * 64
    for index in range(len(list_weightPluse)):
        mergeHash_list = np.add(mergeHash_list, list_weightPluse[index])

    return mergeHash_list

#降維，大于0的輸出1，小于等于0的輸出0
def reduction(mergeHash_list):
    reduction_list = list()
    for index in range(len(mergeHash_list)):
        if (mergeHash_list[index] > 0):
            reduction_list.append(1)
        else:
            reduction_list.append(0)
    
    return reduction_list

用按位異或的方式計算海明距離，并計算相似度

#海明距離
def getDistance(list_1, list_2):
    distance = 0

    for index in range(len(list_1)):
        if (list_1[index] ^ list_2[index] == 1):
            distance += 1

    return distance
	
#將漢明距離轉換為相似度
similarity = round((64 - hanmingDistance) / 64 * 100, 2)

檔案的讀取和結果檔案的輸出

address_orig = sys.argv[1]
address_copy = sys.argv[2]
address_out = sys.argv[3]
#讀入并計算源檔案的simhash
file_1 = open(address_orig, encoding= 'UTF-8')
s1 = file_1.read()

list_1 = simhash(s1, stopWords)

#print(list_1)

file_1.close()

#將相似度寫入新建檔案
file_3 = open(address_out, 'w', encoding= 'UTF-8')

file_3.write(str(similarity))

file_3.close()

嘗試輸出結果

性能分析

單元測驗

代碼覆寫率

單元測驗

因為分詞的準確性直接關系到最后輸出結果過的準確性，所以這次我主要測驗分詞功能是否完善

可見，當完整的詞語被空格或者其它符號打斷時，jieba庫似乎不能很好的檢測出來，如果是一些比較重要的詞出現增刪改的情況，很有可能會影響相似度的計算，
如果考慮在分詞前將停用詞去除，如果處理的文章里有英文，則會導致單詞粘連的情況而無法正確識別英文單詞，
也許有更好的方法解決這個問題，比如區分各種情況，再根據不同情況用不同的方法去除干擾詞，

轉載請註明出處，本文鏈接：https://www.uj5u.com/houduan/546795.html

標籤：Python

上一篇：Biopython 安裝

下一篇：Behave 安裝

軟體工程第一次個人作業

代碼倉庫：第一次個人專案-論文查重系統

PSP表格

計算模塊介面的設計與實作程序

性能分析

單元測驗

代碼覆寫率

單元測驗