使用 python-ucto
Ucto 是一種基於規則的多語言標記器。它也進行句子邊界檢測。雖然它是用 C++編寫的,但是有一個 Python 繫結 python-ucto 來與它進行互動。
import ucto
#Set a file to use as tokeniser rules, this one is for English, other languages are available too:
settingsfile = "/usr/local/etc/ucto/tokconfig-en"
#Initialise the tokeniser, options are passed as keyword arguments, defaults:
# lowercase=False,uppercase=False,sentenceperlineinput=False,
# sentenceperlineoutput=False,
# sentencedetection=True, paragraphdetection=True, quotedetection=False,
# debug=False
tokenizer = ucto.Tokenizer(settingsfile)
tokenizer.process("This is a sentence. This is another sentence. More sentences are better!")
for sentence in tokenizer.sentences():
print(sentence)