使用 SQL Server,我應該如何在大表中的所有行中執行模糊排序搜索,以便與一列上的長短語相似?
換句話說,如果我的資料如下所示:
ID | 資料 |
---|---|
1 | 敏捷的棕色狐貍跳過了懶狗 |
2 | 敏捷的棕色貓跳過懶惰的青蛙 |
3 | 懶惰的快速棕色青蛙跳過貓 |
4 | lorem ipsum dolor 坐 amet |
我搜索“快速棕色的牛跳過一只懶狗”,我想要的結果大致類似于:
ID | 分數 |
---|---|
1 | 95 |
2 | 80 |
3 | 40 |
4 | 0 |
實際資料會有更多的行和更長的短語。
顯然我不想要一個精確的字串匹配,所以使用LIKE
orCONTAINS
顯然是行不通的。
單詞順序很重要,因此單獨搜索每個單詞也不起作用。
全文索引和類似聲音的索引似乎只對子字串相似性有用,所以我還沒有看到將其應用于短語相似性的方法。例如,您如何以一種對缺少或添加單詞的相似短語給出不錯分數的方式查詢這個?
我已經使用編輯距離(Lavenshtein、Jaro-Winkler 等)進行了測驗,但是對于一大組長字串來說它太慢了。一個查詢需要幾分鐘。聽起來它應該只用于較小的資料,所以我認為這里需要一種不同的方法。
我已經看到提到 TFIDF 和余弦相似性,但我不確定這是否適合在這里使用,或者它如何在 SQL Server 上實作。
此外,由于我們在 Linux 上使用 SQL Server,因此 CLR 支持受到限制。只要不需要不安全或外部權限,它似乎是允許的。
uj5u.com熱心網友回復:
使用模糊匹配邏輯快速找到最佳匹配字串的相對快速的方法可以基于對字串中匹配的 3-gram 進行計數。
它可以利用預建的 sql 函式和索引表來加快搜索速度。特別是它不必檢查從搜索字串到資料集中每個字串的距離。
首先,為方便起見,創建一個將字串分解為 3 個字母標記的表函式。
drop function dbo.get_triplets;
go
CREATE FUNCTION dbo.get_triplets
(
@data varchar(1000)
)
RETURNS TABLE AS RETURN
(
WITH Nums AS
(
SELECT n = ROW_NUMBER() OVER (ORDER BY [object_id]) FROM sys.all_objects
)
select triplet,count(*) c, len(@data)-2 triplet_count
from (
select SUBSTRING(@data,n,3) triplet
from (select top (len(@data)-2) n from nums) n
) triplets
group by triplet
)
GO
創建字串資料集
drop table if exists #data;
select * into #data
from (
values
(1, 'the quick brown fox jumps over the lazy dog'),
(2, 'the quick brown cat jumps over the lazy frog'),
(3, 'the lazy quick brown frog jumps over the cat'),
(4, 'lorem ipsum dolor sit amet')
) a(id,data);
創建 3 個字母標記的索引表
drop table if exists #triplets;
select id,triplet,c,triplet_count data_triplet_count
into #triplets
from #data d
cross apply dbo.get_triplets(d.data);
CREATE unique CLUSTERED INDEX IX_triplet_index ON #triplets(triplet,id);
然后我希望使用類似于的查詢對給定字串的匹配進行有效的模糊搜索
declare @string_to_search varchar(1000) = 'the quick brown ox jumps over a lazy dog';
select matched.*,d.data,
cast(
cast(matched_triplets as float)
/
cast(case when data_triplet_count>string_triplet_count
then data_triplet_count
else string_triplet_count
end as float)
as decimal(4,3)) score
from (
select id,sum(case when a.c<b.c then a.c else b.c end) matched_triplets,
max(a.data_triplet_count) data_triplet_count,
max(b.triplet_count) string_triplet_count
from #triplets a
join dbo.get_triplets(@string_to_search) b
on a.triplet = b.triplet
group by id
) matched
join #data d
on d.id = matched.id;
uj5u.com熱心網友回復:
使用我寫的 FUNCTION F_INFERENCE_BASIQUE...
https://sqlpro.developpez.com/cours/sql/comparaisons-motifs/#LVI
演示:
CREATE TABLE T_TEST (id int, val VARCHAR(256));
INSERT INTO T_TEST VALUES
(1, 'the quick brown fox jumps over the lazy dog'),
(2, 'the quick brown cat jumps over the lazy frog'),
(3, 'the lazy quick brown frog jumps over the cat'),
(4, 'lorem ipsum dolor sit amet');
SELECT id, 100.0 * dbo.F_INFERENCE_BASIQUE(val, 'the quick brown fox jumps over the lazy dog')
/ LEN('the quick brown fox jumps over the lazy dog') AS PERCENT_MATCH
FROM T_TEST
id PERCENT_MATCH
----------- ---------------------------------------
1 100.000000000000
2 46.511627906976
3 81.395348837209
4 6.976744186046
您可以根據自己的方便調整代碼,例如以消除兩種方式與僅一種方式進行比較...在這種情況下,匹配指數為:
id PERCENT_MATCH
----------- ---------------------------------------
1 100.000000000000
2 46.511627906976
3 25.581395348837
4 4.651162790697
幾乎接近您的需求!
uj5u.com熱心網友回復:
執行模糊字串比較的方法和演算法有很多。
在我的系統中,我使用了一個計算Jaro-Winkler 距離的 CLR 函式。當用戶嘗試創建新公司時,我會使用它。在創建新條目之前,我計算了新公司名稱與資料庫中所有現有公司之間的 Jaro-Winkler 距離,以查看它是否已經存在并允許一些錯誤輸入和稍微不同的拼寫。
我向用戶展示了一些頂級匹配項,希望他們不會創建重復項。
這就是它適用于您的示例的方式:
DECLARE @T TABLE (id int, val VARCHAR(256));
INSERT INTO @T VALUES
(1, 'the quick brown fox jumps over the lazy dog'),
(2, 'the quick brown cat jumps over the lazy frog'),
(3, 'the lazy quick brown frog jumps over the cat'),
(4, 'lorem ipsum dolor sit amet'),
(12, 'the quick brown ox jumps over the lazy dog'),
(13, 'the quick brown fox jumps over a lazy dog'),
(14, 'the quick brown ox jumps over a lazy dog')
;
SELECT
T.*
,dbo.StringSimilarityJaroWinkler(T.val,
'the quick brown ox jumps over a lazy dog') AS dist
FROM @T AS T
ORDER BY dist desc
;
---- ---------------------------------------------- -------------------
| id | val | dist |
---- ---------------------------------------------- -------------------
| 14 | the quick brown ox jumps over a lazy dog | 1 |
| 13 | the quick brown fox jumps over a lazy dog | 0.995121951219512 |
| 12 | the quick brown ox jumps over the lazy dog | 0.975586080586081 |
| 1 | the quick brown fox jumps over the lazy dog | 0.971267143709004 |
| 2 | the quick brown cat jumps over the lazy frog | 0.931560196560197 |
| 3 | the lazy quick brown frog jumps over the cat | 0.836212121212121 |
| 4 | lorem ipsum dolor sit amet | 0.584472934472934 |
---- ---------------------------------------------- -------------------
這是 CLR 函式的 C# 代碼:
using System;
using System.Data;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
public partial class UserDefinedFunctions
{
/*
The Winkler modification will not be applied unless the percent match
was at or above the WeightThreshold percent without the modification.
Winkler's paper used a default value of 0.7
*/
private static readonly double m_dWeightThreshold = 0.7;
/*
Size of the prefix to be concidered by the Winkler modification.
Winkler's paper used a default value of 4
*/
private static readonly int m_iNumChars = 4;
[Microsoft.SqlServer.Server.SqlFunction(DataAccess = DataAccessKind.None, SystemDataAccess = SystemDataAccessKind.None, IsDeterministic = true, IsPrecise = true)]
public static SqlDouble StringSimilarityJaroWinkler(SqlString string1, SqlString string2)
{
if (string1.IsNull || string2.IsNull)
{
return 0.0;
}
return GetStringSimilarityJaroWinkler(string1.Value, string2.Value);
}
private static double GetStringSimilarityJaroWinkler(string string1, string string2)
{
int iLen1 = string1.Length;
int iLen2 = string2.Length;
if (iLen1 == 0)
{
return iLen2 == 0 ? 1.0 : 0.0;
}
int iSearchRange = Math.Max(0, Math.Max(iLen1, iLen2) / 2 - 1);
bool[] Matched1 = new bool[iLen1];
for (int i = 0; i < Matched1.Length; i)
{
Matched1[i] = false;
}
bool[] Matched2 = new bool[iLen2];
for (int i = 0; i < Matched2.Length; i)
{
Matched2[i] = false;
}
int iNumCommon = 0;
for (int i = 0; i < iLen1; i)
{
int iStart = Math.Max(0, i - iSearchRange);
int iEnd = Math.Min(i iSearchRange 1, iLen2);
for (int j = iStart; j < iEnd; j)
{
if (Matched2[j]) continue;
if (string1[i] != string2[j]) continue;
Matched1[i] = true;
Matched2[j] = true;
iNumCommon;
break;
}
}
if (iNumCommon == 0) return 0.0;
int iNumHalfTransposed = 0;
int k = 0;
for (int i = 0; i < iLen1; i)
{
if (!Matched1[i]) continue;
while (!Matched2[k])
{
k;
}
if (string1[i] != string2[k])
{
iNumHalfTransposed;
}
k;
// even though length of Matched1 and Matched2 can be different,
// number of elements with true flag is the same in both arrays
// so, k will never go outside the array boundary
}
int iNumTransposed = iNumHalfTransposed / 2;
double dWeight =
(
(double)iNumCommon / (double)iLen1
(double)iNumCommon / (double)iLen2
(double)(iNumCommon - iNumTransposed) / (double)iNumCommon
) / 3.0;
if (dWeight > m_dWeightThreshold)
{
int iComparisonLength = Math.Min(m_iNumChars, Math.Min(iLen1, iLen2));
int iCommonChars = 0;
while (iCommonChars < iComparisonLength && string1[iCommonChars] == string2[iCommonChars])
{
iCommonChars;
}
dWeight = dWeight 0.1 * iCommonChars * (1.0 - dWeight);
}
return dWeight;
}
};
轉載請註明出處,本文鏈接:https://www.uj5u.com/yidong/514590.html
上一篇:AzureSQL資料庫sys.resource_stats::LAST_VALUE(storage_in_megabytes)