Natural language Processing, its one of the fields which got hype with the advancements of Neural Nets and its applications. SpaCy is open source library which supports various NLP concepts like NER, POS-tagging, dependency parsing etc., with a CNN model. Lets save Neural Nets creation using PyTorch for next story. Here our focus is on NLP Concepts and how spacy helps to implement it.
Every NLP method requires data which is text format. This data is called corpus multiple datasets together called corpora. According to most of the languages we separate text as words split by white space or punctuation these separated units are called tokens.The process of splitting text into tokens is called tokenization.
Tokenization can become tough based on the language we are operating on.For languages which are agglutinative just splitting with white space and punctuation is not sufficient.
If we consider text from Tweets we need to preserve #tags and @ handles and symbols also as units.
In SpaCy tokenizing English language is simple. We call the load function of Spacy library with English as a parameter and store it in a variable. Then we pass our text as parameter to that variable which return tokens as output. Its confusing when you read it but look at the below example
If you get any error like cant load the en_core_web_sm, then run this command in your anaconda prompt inside your environment.
Trending Bot Articles:
python -m spacy download en
For twitter data we cannot apply the spacy trained model but instead we use TweetTokenizer from NLTK library. This will separate # tags and @ tags and symbols as a separate token.
Unigrams, Bigrams, Trigrams…. N-grams
N-grams are N consecutive tokens in the text. Unigram is just very word separated, Bi-grams is two consecutuive words. What is the use of N-gram?
N-gram are useful in deciding which words occur together so that we can chunk them into a single entity. Its also useful in next word prediction, prediction is done based on N-gram probability(probability of N tokens occurring together eg: San Francisco). Also helpful in spell correction. Some times we also do character N-grams (characters occurring together).
Lemmatization and stemming
lemmatization and stemming are same thing but are defined in code differently. The Point is is to extract the root word for the given words. Eg: fly is root word for flew, flying, flown etc., Stemming utilizes pre-defined rules to remove edges from a word to get the root word called stems. Spacy uses dictionary call WordNet to extract root words called lemmas. These processes are done keep the dimensionality of the vectors as low as possible. Two mostly used stemmers are Porter and Snowball stemmers.
Parts Of Speech tagging is a kind on classification task. We tag every word with its parts of speech. How is it useful? Used in creating parse trees and building NERs. To Extract meaning or build any relationship between words POS is an important step. its Simple in Spacy as its a pre-trained model
Chunking means reducing long sentences into small units called chunks. For any given sentence we can extract Noun-phrases or verb-phrases. Using Spacy it even simpler with method noun_chunks.
Named Entity Recognition
As name tells identifying named entities like person,place, organization, brands,monetary etc. NER is useful to extract key elements in unstructured data so we can sort them based on related Key units in the text.
These are few of the entities used in Spacy: PERSON, NORP (nationalities, religious and political groups), FAC (buildings, airports etc.), ORG (organizations), GPE (countries, cities etc.), LANGUAGE (named languages), DATE, MONEY.
These are few Traditional NLP concepts and Using Spacy to code them. Very Simple and Understandable.
Next Story will be on Perceptron in Pytorch along with loss functions and activation functions.
Don’t forget to give us your 👏 !