关于python：如何防止在NLTK中拆分特定的单词或短语和数字？

How to prevent splitting specific words or phrases and numbers in NLTK?

当我对拆分特定单词、日期和数字的文本进行标记时，我遇到了文本匹配问题。在 NLTK 中对单词进行标记时，如何防止诸如”在我的家人中跑步”、”30 分钟步行”或”每天 4 次”之类的短语分裂？

它们不应导致：

1	[ ‘runs’ , ‘in’ , ‘my’ , ‘family’ , ‘4x’ , ‘a’ , ‘day’ ]

例如：

Yes 20-30 minutes a day on my bike, it works great!!

给予：

1	[ ‘yes’ , ’20-30′ , ‘minutes’ , ‘a’ , ‘day’ , ‘on’ , ‘my’ , ‘bike’ , ‘,’ , ‘it’ , ‘works’ , ‘great’ ]

我希望将”20-30 分钟”视为一个词。我怎样才能得到这种行为>？

相关讨论

糊涂的第一个问题！我认为用标点符号和语法稍微清理一下这个问题是值得的，因为我认为这不是一项简单的任务。我提供了一个解决方案，但我担心它的计算成本可能非常高。让其他一些用户对此有很大帮助。
好问题！还有一些用 nltk 编写的函数，它的工作方式与 spacy 语言学步骤/正则表达式模式方法略有不同。

您可以使用 MWETokenizer :

1
2
3
4
5

from nltk import word_tokenize
from nltk. tokenize import MWETokenizer

tokenizer = MWETokenizer ( [ ( ’20’ , ‘-‘ , ’30’ , ‘minutes’ , ‘a’ , ‘day’ ) ] )
tokenizer. tokenize (word_tokenize ( ‘Yes 20-30 minutes a day on my bike, it works great!!’ ) )

[输出]:

1	[ ‘Yes’ , ’20-30_minutes_a_day’ , ‘on’ , ‘my’ , ‘bike’ , ‘,’ , ‘it’ , ‘works’ , ‘great’ , ‘!’ , ‘!’ ]

一个更原则的方法，因为你不知道`word_tokenize 将如何拆分你想要保留的单词：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

from nltk import word_tokenize
from nltk. tokenize import MWETokenizer

def multiword_tokenize (text , mwe ):
# Initialize the MWETokenizer
protected_tuples = [word_tokenize (word ) for word in mwe ]
protected_tuples_underscore = [ ‘_’. join (word ) for word in protected_tuples ]
tokenizer = MWETokenizer (protected_tuples )
# Tokenize the text.
tokenized_text = tokenizer. tokenize (word_tokenize (text ) )
# Replace the underscored protected words with the original MWE
for i , token in enumerate (tokenized_text ):
if token in protected_tuples_underscore:
tokenized_text [i ] = mwe [protected_tuples_underscore. index ( token ) ]
return tokenized_text

mwe = [ ’20-30 minutes a day’ , ‘!!’ ]
print (multiword_tokenize ( ‘Yes 20-30 minutes a day on my bike, it works great!!’ , mwe ) )

[输出]:

1	[ ‘Yes’ , ’20-30 minutes a day’ , ‘on’ , ‘my’ , ‘bike’ , ‘,’ , ‘it’ , ‘works’ , ‘great’ , ‘!!’ ]

据我所知，您将很难在标记化的同时保留各种长度的 n-gram，但您可以找到这些 n-gram，如下所示。然后，您可以将语料库中的项目替换为 n-gram，并使用一些连接字符(如破折号)。

这是一个示例解决方案，但可能有很多方法可以实现。重要说明：我提供了一种查找文本中常见 ngram 的方法(您可能需要超过 1 个，因此我在其中放置了一个变量，以便您可以决定要收集多少个 ngram。您可能需要不同的数字对于每种类型，但我现在只给出了 1 个变量。)这可能会错过你认为重要的 ngram。为此，您可以将要查找的内容添加到 user_grams 。这些将被添加到搜索中。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

import nltk

#an example corpus
corpus = ”’A big tantrum runs in my family 4x a day, every week.
A big tantrum is lame. A big tantrum causes strife. It runs in my family
because of our complicated history. Every week is a lot though. Every week
I dread the tantrum. Every week…Here is another ngram I like a lot”’. lower ( )

#tokenize the corpus
corpus_tokens = nltk. word_tokenize (corpus )

#create ngrams from n=2 to 5
bigrams = list (nltk. ngrams (corpus_tokens , 2 ) )
trigrams = list (nltk. ngrams (corpus_tokens , 3 ) )
fourgrams = list (nltk. ngrams (corpus_tokens , 4 ) )
fivegrams = list (nltk. ngrams (corpus_tokens , 5 ) )

此部分查找常见的 ngram，最多为 5 个。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

#if you change this to zero you will only get the user chosen ngrams
n_most_common = 1 #how many of the most common n-grams do you want.

fdist_bigrams = nltk. FreqDist (bigrams ). most_common (n_most_common ) #n most common bigrams
fdist_trigrams = nltk. FreqDist (trigrams ). most_common (n_most_common ) #n most common trigrams
fdist_fourgrams = nltk. FreqDist (fourgrams ). most_common (n_most_common ) #n most common four grams
fdist_fivegrams = nltk. FreqDist (fivegrams ). most_common (n_most_common ) #n most common five grams

#concat the ngrams together
fdist_bigrams = [x [ 0 ] [ 0 ]+ ‘ ‘+x [ 0 ] [ 1 ] for x in fdist_bigrams ]
fdist_trigrams = [x [ 0 ] [ 0 ]+ ‘ ‘+x [ 0 ] [ 1 ]+ ‘ ‘+x [ 0 ] [ 2 ] for x in fdist_trigrams ]
fdist_fourgrams = [x [ 0 ] [ 0 ]+ ‘ ‘+x [ 0 ] [ 1 ]+ ‘ ‘+x [ 0 ] [ 2 ]+ ‘ ‘+x [ 0 ] [ 3 ] for x in fdist_fourgrams ]
fdist_fivegrams = [x [ 0 ] [ 0 ]+ ‘ ‘+x [ 0 ] [ 1 ]+ ‘ ‘+x [ 0 ] [ 2 ]+ ‘ ‘+x [ 0 ] [ 3 ]+ ‘ ‘+x [ 0 ] [ 4 ] for x in fdist_fivegrams ]

#next 4 lines create a single list with important ngrams
n_grams =fdist_bigrams
n_grams. extend (fdist_trigrams )
n_grams. extend (fdist_fourgrams )
n_grams. extend (fdist_fivegrams )

此部分允许您将自己的 ngram 添加到列表中

1
2
3
4
5
6

#Another option here would be to make your own list of the ones you want
#in this example I add some user ngrams to the ones found above
user_grams = [ ‘ngram1 I like’ , ‘ngram 2’ , ‘another ngram I like a lot’ ]
user_grams = [x. lower ( ) for x in user_grams ]

n_grams. extend (user_grams )

最后一部分执行处理，以便您可以再次标记化并将 ngram 作为标记。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

#initialize the corpus that will have combined ngrams
corpus_ngrams =corpus

#here we go through the ngrams we found and replace them in the corpus with
#version connected with dashes. That way we can find them when we tokenize.
for gram in n_grams:
gram_r =gram. replace ( ‘ ‘ , ‘-‘ )
corpus_ngrams =corpus_ngrams. replace (gram , gram. replace ( ‘ ‘ , ‘-‘ ) )

#retokenize the new corpus so we can find the ngrams
corpus_ngrams_tokens = nltk. word_tokenize (corpus_ngrams )

print (corpus_ngrams_tokens )

Out: [ ‘a-big-tantrum’ , ‘runs-in-my-family’ , ‘4x’ , ‘a’ , ‘day’ , ‘,’ , ‘every-week’ , ‘.’ , ‘a-big-tantrum’ , ‘is’ , ‘lame’ , ‘.’ , ‘a-big-tantrum’ , ’causes’ , ‘strife’ , ‘.’ , ‘it’ , ‘runs-in-my-family’ , ‘because’ , ‘of’ , ‘our’ , ‘complicated’ , ‘history’ , ‘.’ , ‘every-week’ , ‘is’ , ‘a’ , ‘lot’ , ‘though’ , ‘.’ , ‘every-week’ , ‘i’ , ‘dread’ , ‘the’ , ‘tantrum’ , ‘.’ , ‘every-week’ , ‘…’ ]

我认为这实际上是一个非常好的问题。

关于python：如何防止在NLTK中拆分特定的单词或短语和数字？

How to prevent splitting specific words or phrases and numbers in NLTK?

评论(0)

提示：请文明发言取消回复

近期文章

近期评论

关于python：如何防止在NLTK中拆分特定的单词或短语和数字？

How to prevent splitting specific words or phrases and numbers in NLTK?

评论(0)

提示：请文明发言 取消回复

相关文章

Python 如何安装Selenium(推荐)

Python合并多个PDF文件的完整指南与实践

Linux安装Pytorch1.8GPU(CUDA11.1)的实现

python自动计算图像数据集的RGB均值

近期文章

近期评论

提示：请文明发言取消回复