spacymoji

spaCy pipeline component for adding emoji meta data to Doc, Token and Span objects.

Project description

spaCy v2.0 extension and pipeline component for adding emoji meta data to Doc objects. Detects emoji consisting of one or more unicode characters, and can optionally merge multi-char emoji (combined pictures, emoji with skin tone modifiers) into one token. Human-readable emoji descriptions are added as a custom attribute, and an optional lookup table can be provided for your own descriptions. The extension sets the custom Doc, Token and Span attributes ._.is_emoji, ._.emoji_desc, ._.has_emoji and ._.emoji. You can read more about custom pipeline components and extension attributes here.

Emoji are matched using spaCy’s PhraseMatcher, and looked up in the data table provided by the “emoji” package.

Disclaimer: This extension only works in spaCy v2.0 (currently in alpha) and is still experimental.

⏳ Installation

spacymoji requires spacy-nightly v2.0.0a17 or higher.

pip install spacymoji

☝️ Usage

Import the component and initialise it with the shared nlp object (i.e. an instance of Language), which is used to initialise the PhraseMatcher with the shared vocab, and create the match patterns. Then add the component anywhere in your pipeline.

import spacy
from spacymoji import Emoji

nlp = spacy.load('en')
emoji = Emoji(nlp)
nlp.add_pipe(emoji, first=True)

doc = nlp(u"This is a test 😻 👍🏿")
assert doc._.has_emoji == True
assert doc[2:5]._.has_emoji == True
assert doc[0]._.is_emoji == False
assert doc[4]._.is_emoji == True
assert doc[5]._.emoji_desc == u'thumbs up dark skin tone'
assert len(doc._.emoji) == 2
assert doc._.emoji[1] == (u'👍🏿', 5, u'thumbs up dark skin tone')

spacymoji only cares about the token text, so you can use it on a blank Language instance (it should work for all available languages!), or in a pipeline with a loaded model. If you’re loading a model and your pipeline includes a tagger, parser and entity recognizer, make sure to add the emoji component as first=True, so the spans are merged right after tokenization, and before the document is parsed. If your text contains a lot of emoji, this might even give you a nice boost in parser accuracy.

Available attributes

The extension sets attributes on the Doc, Span and Token. You can change the attribute names on initialisation of the extension. For more details on custom components and attributes, see the processing pipelines documentation.

Token._.is_emoji	bool	Whether the token is an emoji.
Token._.emoji_desc	unicode	A human-readable description of the emoji.
Doc._.has_emoji	bool	Whether the document contains emoji.
Doc._.emoji	list	(emoji, index, description) tuples of the document’s emoji.
Span._.has_emoji	bool	Whether the span contains emoji.
Span._.emoji	list	(emoji, index, description) tuples of the span’s emoji.

Settings

On initialisation of Emoji, you can define the following settings:

nlp	Language	The shared nlp object. Used to initialise the matcher with the shared Vocab, and create Doc match patterns.
attrs	tuple	Attributes to set on the ._ property. Defaults to ('has_emoji', 'is_emoji', 'emoji_desc', 'emoji').
pattern_id	unicode	ID of match pattern, defaults to 'EMOJI'. Can be changed to avoid ID conflicts.
merge_spans	bool	Merge spans containing multi-character emoji, defaults to True. Will only merge combined emoji resulting in one icon, not sequences.
lookup	dict	Optional lookup table that maps emoji unicode strings to custom descriptions, e.g. translations or other annotations.

emoji = Emoji(nlp, attrs=('has_e', 'is_e', 'e_desc', 'e'), lookup={u'👨‍🎤': u'David Bowie'})
nlp.add_pipe(emoji)
doc = nlp(u"We can be 👨‍🎤 heroes")
assert doc[3]._.is_e
assert doc[3]._.e_desc == u'David Bowie'

🛣 Roadmap

This extension is still experimental, but here are some features that might be cool to add in the future:

Add match patterns and attributes for emoji shortcodes, e.g. :+1:. The shortcodes could optionally be merged into one token, and receive a NORM attribute with the unicode emoji. The NORM is used as a feature for training, so :+1: and 👍 would automatically receive similar representations.
Add support for the Unicode Emoji Annotations project. The JavaScript package also comes with pre-compiled JSON data, including both standardised and community-contributed annotations in English and German.

Project details

Release history Release notifications | RSS feed

3.1.0

May 10, 2023

3.0.1

Apr 20, 2021

3.0.0

Apr 19, 2021

2.0.0

Apr 9, 2019

1.0.0

Dec 9, 2017

This version

0.0.1

Oct 12, 2017

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spacymoji-0.0.1.tar.gz (5.7 kB view details)

Uploaded Oct 12, 2017 Source

File details

Details for the file spacymoji-0.0.1.tar.gz.

File metadata

Download URL: spacymoji-0.0.1.tar.gz
Upload date: Oct 12, 2017
Size: 5.7 kB
Tags: Source
Uploaded using Trusted Publishing? No

File hashes

Hashes for spacymoji-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`11b706cf079141351169b057ed7f2409d9ff353e9780a9698b5dfc07cf3c5f46`
MD5	`9fa7809d2526299ada812c5913b010a3`
BLAKE2b-256	`1c18f289088d3e63b17f7a9ada2a55ee753952d6579d526e9f8d5bc27af1fd28`

See more details on using hashes here.

spacymoji 0.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta