我正在嘗試撰寫一個在 C# 中拆分句子的正則運算式。
我的正則運算式作業不正常,它很好地拆分了它們,但字串的最后一個字符總是被洗掉。有小費嗎?
例如,如果我想將文本拆分為句子:
Lorem ipsum dolor sit amet。Nam autem doloribus ut perspiciatis omnis est ratione quidem!
我的正則運算式將它們分成:
Lorem ipsum dolor sat ame
Nam autem doloribus ut perspiciatis omnis est ratione quide
它應該是:
Lorem ipsum dolor sit amet
Nam autem doloribus ut perspiciatis omnis est ratione quidem
示例代碼
我的正則運算式是字串變數:模式
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
using System.Threading.Tasks;
using static System.Net.Mime.MediaTypeNames;
namespace L4_17
{
internal class Program
{
static void Main(string[] args)
{
const string firstBookData = "first.txt";
string firstFileData = File.ReadAllText(firstBookData);
string pattern = "[^\\.\\!\\?] *[\\.\\!\\?]";
List<string> allSentencesInFirstDataFile = Regex.Split(firstFileData, pattern).ToList();
foreach(string sentence in allSentencesInFirstDataFile)
{
Console.WriteLine(sentence);
}
}
}
}
uj5u.com熱心網友回復:
我建議使用不同的模式:
[.!?] \s*(?=\p{Lu}|$)
解釋:
[.!?] - at least one symbol of ., !, ? (let's support ??, ..., ?! etc.)
\s* - zero or more white spaces
(?=\p{Lu}|$) - either end of the string or Capital letter of the next sentence
代碼:
var text = "Lorem ipsum dolor sit amet. Nam etc. autem??? Doloribus ut perspiciatis?! Omnis est ratione quidem!";
var lines = Regex.Split(text, @"[.!?] \s*(?=\p{Lu}|$)");
Console.WriteLine(string.Join(Environment.NewLine, lines));
輸出:
Lorem ipsum dolor sit amet
Nam etc. autem # <- note etc. is not the end of the sentence
Doloribus ut perspiciatis
Omnis est ratione quidem
轉載請註明出處,本文鏈接:https://www.uj5u.com/qianduan/531037.html
標籤:C#正则表达式
上一篇:正則運算式搜索文本然后提取行