小白学NLP：必备库 spaCy使用指南

Coggle数据科学 2022-03-28

1655

spaCy介绍

spaCy

https://spacy.io/

spaCy 是一个开源自然语言处理库，旨在通过最有效的通用算法实现来有效处理 NLP 任务。

对于许多 NLP 任务，spaCy只有一种实现方法，选择目前可用的最有效的算法。这意味着您通常无法选择其他算法。

NLTK

NLTK是一个非常流行的开源，最初于 2001 年发布，它比spaCy（2015 年发布）要老得多。它还提供了许多功能，但包括效率较低的实现。

Spacy 与 NLTK对比

对于许多常见的 NLP 任务，spaCy更快、更高效，但代价是用户无法选择算法实现。但是Spacy不包括为某些应用程序预先创建的模型，例如情绪分析。

spaCy基础

spaCy 是一个开源 Python 库，可以解析和“理解”大量文本。可提供适用于特定语言（英语、法语、德语等）的单独模型。

使用流程

使用 Spacy 有几个关键步骤：

加载语言库
构建管道对象
使用Tokens
Parts-of-Speech Tagging
Understanding Token Attributes

Tokenization

处理文本的第一步是将所有组成部分（单词和标点符号）拆分为“tokens”。这些标记在 Doc 对象内被注释以包含描述性信息。

spaCy Objects

Pipeline

当我们运行nlp时，我们的文本进入一个处理管道，该管道首先分解文本，然后执行一系列操作来标记、解析和描述数据。

我们可以检查一下当前存在于管道中的组件。在后面的部分中，我们将学习如何禁用组件并根据需要添加新组件。

Part-of-Speech Tagging (POS)

将文本拆分为标记后的下一步是分配词性。在上面的例子中，特斯拉被认为是一个专有名词。这里需要一些统计建模。例如，“the”后面的词通常是名词。

Dependencies

我们还查看了分配给每个标记的句法依赖关系。Tesla 被识别为 nsubj 或句子的名义主语。

Additional Token Attributes

我们将在接下来的讲座中再次看到这些。现在我们只想说明 spaCy 分配给Token的其他一些信息：

Tag	Description
`.text`	original word
`.lemma_`	base form of word
`.pos_`	part-of-speech tag
`.tag_`	detailed part-of-speech
`.is_alpha`	Is alpha character?
`.is_stop`	Is token in stop list?

Spans

有时大型Doc对象可能难以处理。span是Doc[start:stop] 形式的Doc 对象切片。

Sentences

Doc对象内的某些标记也可能收到“句子开头”标签。虽然这不会立即构建句子列表，但这些标签可以通过 Doc.sents 生成句子片段。

import spacy
import pandas as pd

# 1. Loading the language library
nlp = spacy.load('en_core_web_sm')

# 2. Building a Pipline Object
doc = nlp(u'''
Tesla will start selling cars in India next year, government says. 
Elon Mask (CEO of Tesla) is now the richest men in the world.
''')

# 3. Using Tokens
for token in doc:
    print(f"{token.text:{12}}{token.pos_:{12}}{token.dep_:{12}}{token.lemma_}")

# Named Entity
for entity in doc.ents:
    print(f"{entity.text:-<{20}}{entity.label_:-<{20}}{str(spacy.explain(entity.label_))}")

# Noun Chunks
for chunk in doc.noun_chunks:
    print(chunk.text)

# Built-in Visualizers
from spacy import displacy
displacy.render(doc, style='dep', jupyter=True, options={'distance':90})

# Visualizing the entity recongnizer
displacy.render(doc, style='ent', jupyter=True)

Stemming

通常在搜索某个关键字的文本时，如果搜索返回该词的变体，这会有所帮助。例如，搜索“boat”可能会返回“boats”和“boating”。在这里，'boat' 将是[boat,boater,boating,boats]
的词干。

import nltk
from nltk.stem.porter import PorterStemmer

words = ['run', 'runner', 'ran', 'runs', 'easily', 'fairly', 'fairness']
p_stemmer = PorterStemmer()

for word in words:
    print(f"{word} --------> {p_stemmer.stem(word)}")

from nltk.stem.snowball import SnowballStemmer

words = ['run', 'runner', 'ran', 'runs', 'easily', 'fairly', 'fairness']
s_stemmer = SnowballStemmer(language='english')

for word in words:
    print(f"{word} --------> {s_stemmer.stem(word)}")

文本相似度

Spacy 可以比较两个对象并预测相似度：Doc.similarity()
、Span.similarity()
和Token.similarity()
。他们获取另一个对象并返回相似度分数（0到1）。

import en_core_web_md
# Load a larger model with vectors
nlp = en_core_web_md.load()

# Compare two documents
doc_1 = nlp("I like fast food")
doc_2 = nlp("I like pizza")

print(doc_1.similarity(doc_2))
print(doc_2.similarity(doc_1))

# 竞赛交流群邀请函 #

△长按添加竞赛小助手

添加Coggle小助手微信（ID : coggle666）

每天Kaggle算法竞赛、干货资讯汇总

与 22000+来自竞赛爱好者一起交流~

文章转载自Coggle数据科学，如果涉嫌侵权，请发送邮件至：contact@modb.pro进行举报，并提供相关证据，一经查实，墨天轮将立刻删除相关内容。