Natural Language Processing
Sources Overview
Sample Use Cases
- Information search
- Information retrieval
Basics
In NLP important concept is a vector. Characters can be represented as vectors, words, or documents can be represented as vectors.
For example, two documents can be represented as vectors based on the number of occurences of specific words.
Example
doc1 = "meeting ... management ... meeting ... management ... meeting" # (1)
doc1 += "... management ... meeting ... meeting"
vector = [0, 0] # (2)
for word in doc1.split(" "):
if word == "management":
vector[0] = vector[0] + 1 # (3)
if word == "meeting":
vector[1] = vector[1] + 1 # (4)
print(vector) # (5)
- represents a document based on keywords only
- initializes array vector
- count for "management" is incremented in cell 0
- count for "meeting" is incremented in cell 1.
- [3, 5]
Two vectors can be compared between each other by either length or by measuring angle between them.
Info
Cosine similarity estimates the similarity between two nonzero vectors in space (or two texts represented by such vectors) on the basis of the angle between these vectors — for example, the cosine of 0° equals 1, which denotes the maximum similarity, and the cosine of 180° equals –1, which is the lowest value. Unlike Euclidean distance, this measure is not affected by vector length.