How to prevent splitting specific words or phrases and numbers in NLTK?

当我对拆分特定单词、日期和数字的文本进行标记时,我遇到了文本匹配问题。在 NLTK 中对单词进行标记时,如何防止诸如”在我的家人中跑步”、”30 分钟步行”或”每天 4 次”之类的短语分裂?

它们不应导致:

1 [ ‘runs’ , ‘in’ , ‘my’ , ‘family’ , ‘4x’ , ‘a’ , ‘day’ ]

例如:

Yes 20-30 minutes a day on my bike, it works great!!

给予:

1 [ ‘yes’ , ’20-30′ , ‘minutes’ , ‘a’ , ‘day’ , ‘on’ , ‘my’ , ‘bike’ , ‘,’ , ‘it’ , ‘works’ , ‘great’ ]

我希望将”20-30 分钟”视为一个词。我怎样才能得到这种行为>?



相关讨论

  • 糊涂的第一个问题!我认为用标点符号和语法稍微清理一下这个问题是值得的,因为我认为这不是一项简单的任务。我提供了一个解决方案,但我担心它的计算成本可能非常高。让其他一些用户对此有很大帮助。
  • 好问题!还有一些用 nltk 编写的函数,它的工作方式与 spacy 语言学步骤/正则表达式模式方法略有不同。


您可以使用 MWETokenizer :

1
2
3
4
5
from nltk import word_tokenize
from nltk. tokenize import MWETokenizer

tokenizer = MWETokenizer ( [ ( ’20’ , ‘-‘ , ’30’ , ‘minutes’ , ‘a’ , ‘day’ ) ] )
tokenizer. tokenize (word_tokenize ( ‘Yes 20-30 minutes a day on my bike, it works great!!’ ) )

[输出]:

1 [ ‘Yes’ , ’20-30_minutes_a_day’ , ‘on’ , ‘my’ , ‘bike’ , ‘,’ , ‘it’ , ‘works’ , ‘great’ , ‘!’ , ‘!’ ]

一个更原则的方法,因为你不知道`word_tokenize 将如何拆分你想要保留的单词:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
from nltk import word_tokenize
from nltk. tokenize import MWETokenizer

def multiword_tokenize (text , mwe ):
    # Initialize the MWETokenizer
    protected_tuples = [word_tokenize (word ) for word in mwe ]
    protected_tuples_underscore = [ ‘_’. join (word ) for word in protected_tuples ]
    tokenizer = MWETokenizer (protected_tuples )
    # Tokenize the text.
    tokenized_text = tokenizer. tokenize (word_tokenize (text ) )
    # Replace the underscored protected words with the original MWE
    for i , token in enumerate (tokenized_text ):
        if token in protected_tuples_underscore:
            tokenized_text [i ] = mwe [protected_tuples_underscore. index ( token ) ]
    return tokenized_text

mwe = [ ’20-30 minutes a day’ , ‘!!’ ]
print (multiword_tokenize ( ‘Yes 20-30 minutes a day on my bike, it works great!!’ , mwe ) )

[输出]:

1 [ ‘Yes’ , ’20-30 minutes a day’ , ‘on’ , ‘my’ , ‘bike’ , ‘,’ , ‘it’ , ‘works’ , ‘great’ , ‘!!’ ]

据我所知,您将很难在标记化的同时保留各种长度的 n-gram,但您可以找到这些 n-gram,如下所示。然后,您可以将语料库中的项目替换为 n-gram,并使用一些连接字符(如破折号)。

这是一个示例解决方案,但可能有很多方法可以实现。重要说明:我提供了一种查找文本中常见 ngram 的方法(您可能需要超过 1 个,因此我在其中放置了一个变量,以便您可以决定要收集多少个 ngram。您可能需要不同的数字对于每种类型,但我现在只给出了 1 个变量。)这可能会错过你认为重要的 ngram。为此,您可以将要查找的内容添加到 user_grams 。这些将被添加到搜索中。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import nltk

#an example corpus
corpus = ”’A big tantrum runs in my family 4x a day, every week.
A big tantrum is lame. A big tantrum causes strife. It runs in my family
because of our complicated history. Every week is a lot though. Every week
I dread the tantrum. Every week…Here is another ngram I like a lot”’
. lower ( )

#tokenize the corpus
corpus_tokens = nltk. word_tokenize (corpus )

#create ngrams from n=2 to 5
bigrams = list (nltk. ngrams (corpus_tokens , 2 ) )
trigrams = list (nltk. ngrams (corpus_tokens , 3 ) )
fourgrams = list (nltk. ngrams (corpus_tokens , 4 ) )
fivegrams = list (nltk. ngrams (corpus_tokens , 5 ) )

此部分查找常见的 ngram,最多为 5 个。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
#if you change this to zero you will only get the user chosen ngrams
n_most_common = 1 #how many of the most common n-grams do you want.

fdist_bigrams = nltk. FreqDist (bigrams ). most_common (n_most_common ) #n most common bigrams
fdist_trigrams = nltk. FreqDist (trigrams ). most_common (n_most_common ) #n most common trigrams
fdist_fourgrams = nltk. FreqDist (fourgrams ). most_common (n_most_common ) #n most common four grams
fdist_fivegrams = nltk. FreqDist (fivegrams ). most_common (n_most_common ) #n most common five grams

#concat the ngrams together
fdist_bigrams = [x [ 0 ] [ 0 ]+ ‘ ‘+x [ 0 ] [ 1 ] for x in fdist_bigrams ]
fdist_trigrams = [x [ 0 ] [ 0 ]+ ‘ ‘+x [ 0 ] [ 1 ]+ ‘ ‘+x [ 0 ] [ 2 ] for x in fdist_trigrams ]
fdist_fourgrams = [x [ 0 ] [ 0 ]+ ‘ ‘+x [ 0 ] [ 1 ]+ ‘ ‘+x [ 0 ] [ 2 ]+ ‘ ‘+x [ 0 ] [ 3 ] for x in fdist_fourgrams ]
fdist_fivegrams = [x [ 0 ] [ 0 ]+ ‘ ‘+x [ 0 ] [ 1 ]+ ‘ ‘+x [ 0 ] [ 2 ]+ ‘ ‘+x [ 0 ] [ 3 ]+ ‘ ‘+x [ 0 ] [ 4 ]   for x in fdist_fivegrams ]

#next 4 lines create a single list with important ngrams
n_grams =fdist_bigrams
n_grams. extend (fdist_trigrams )
n_grams. extend (fdist_fourgrams )
n_grams. extend (fdist_fivegrams )

此部分允许您将自己的 ngram 添加到列表中

1
2
3
4
5
6
#Another option here would be to make your own list of the ones you want
#in this example I add some user ngrams to the ones found above
user_grams = [ ‘ngram1 I like’ , ‘ngram 2’ , ‘another ngram I like a lot’ ]
user_grams = [x. lower ( ) for x in user_grams ]    

n_grams. extend (user_grams )

最后一部分执行处理,以便您可以再次标记化并将 ngram 作为标记。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
#initialize the corpus that will have combined ngrams
corpus_ngrams =corpus

#here we go through the ngrams we found and replace them in the corpus with
#version connected with dashes. That way we can find them when we tokenize.
for gram in n_grams:
    gram_r =gram. replace ( ‘ ‘ , ‘-‘ )
    corpus_ngrams =corpus_ngrams. replace (gram , gram. replace ( ‘ ‘ , ‘-‘ ) )

#retokenize the new corpus so we can find the ngrams
corpus_ngrams_tokens = nltk. word_tokenize (corpus_ngrams )

print (corpus_ngrams_tokens )

Out: [ ‘a-big-tantrum’ , ‘runs-in-my-family’ , ‘4x’ , ‘a’ , ‘day’ , ‘,’ , ‘every-week’ , ‘.’ , ‘a-big-tantrum’ , ‘is’ , ‘lame’ , ‘.’ , ‘a-big-tantrum’ , ’causes’ , ‘strife’ , ‘.’ , ‘it’ , ‘runs-in-my-family’ , ‘because’ , ‘of’ , ‘our’ , ‘complicated’ , ‘history’ , ‘.’ , ‘every-week’ , ‘is’ , ‘a’ , ‘lot’ , ‘though’ , ‘.’ , ‘every-week’ , ‘i’ , ‘dread’ , ‘the’ , ‘tantrum’ , ‘.’ , ‘every-week’ , ‘…’ ]

我认为这实际上是一个非常好的问题。



相关讨论

  • thanx.如果我想匹配我在数据集中找到的 n-gram f,那么我应该制作自己的列表来匹配它,并且只保留列表中的 n-gram,但这会更耗时?
  • 我在代码中包含了该选项。如果您不想同时找到最常见的那些,只需将 n_most_common=1 更改为 n_most_common=0 。不过,我希望我的解决方案是独立的和可验证的。我会将其编辑为评论。然后,您可以将所需的 n-gram 添加到 user_gram 列表中。
  • 此外,似乎不可能有那么多 ngram 可以同时不常见和重要。换句话说,如果你要费心对这些 ngram 进行标记,那应该是因为它们对你很重要,但如果它们不经常出现,这本身就使它们不那么重要。你应该用开始的方法得到所有重要的常见的,只需要添加一些特定于你的学习的,但这只是我的预感。
  • 此外,如果这回答了您的问题,请考虑投票并检查它是否已回答。看看,如果有人回答我的问题怎么办?


声明:本站(华域联盟www.cnhackhy.com)所有文章,如无特殊说明或标注,均为本站原创发布。任何个人或组织,在未征得本站同意时,禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益,可联系我们进行处理。