πŸ‘‰ Text Processing With Embedding

1 Text Data

  • Vocabulary
  • Tokenizer

2 Vocabulary and Tokenization

text_data = """
We are thankful to be welcome on these lands in 
friendship.  The lands we are situated on are
covered by the Williams Treaties and are the
traditional territory of the Mississaugas, a
branch of the greater Anishinaabeg Nation,
including Algonquin, Ojibway, Odawa and
Pottawatomi. These lands remain home to many
Indigenous nations and peoples.

We acknowledge this land out of respect for the 
Indigenous nations who have cared for Turtle
Island, also called North America, from before the
arrival of settler peoples until this day. Most
importantly, we acknowledge that the history of
these lands has been tainted by poor treatment and
a lack of friendship with the First Nations who
call them home.

This history is something we are all affected by 
because we are all treaty people in Canada.  We
all have a shared history to reflect on, and each
of us is affected by this history in different
ways.  Our past defines our present, but if we
move forward as friends and allies, then it does
not have to define our future.
""".lower()
#
# Building a tokenizer
#
import torchtext.data
tokenizer = torchtext.data.get_tokenizer('basic_english')
tokens = tokenizer(text_data)
print(tokens[:7], "...", len(tokens), "tokens")
['we', 'are', 'thankful', 'to', 'be', 'welcome', 'on'] ... 193 tokens
#
# Building a vocabulary
#
from collections import defaultdict
counter = defaultdict(int)
for token in tokens:
    counter[token] += 1
value
key
we 8
are 6
thankful 1
to 4
be 1
... ...
it 1
does 1
not 1
define 1
future 1

106 rows Γ— 1 columns

A vocabulary manages the tokens.

import torchtext.vocab

vocab = torchtext.vocab.build_vocab_from_iterator(
    [tokens],
    min_freq=1,
    specials=['<unk>', '<s>'],
)
len(vocab)
108
WARNING

build_vocab_from_iterator expects an iterator over batches of tokens, not an iterator over tokens. So, each element in the iterator must be a list of tokens.

vocab['we']
5
vocab.lookup_indices(['we', 'are', 'thankful'])
[5, 8, 93]
vocab.lookup_tokens([0,1,2,3,4,5,6,7,8,9,])
['<unk>', '<s>', ',', 'the', '.', 'we', 'of', 'and', 'are', 'by']

3 Embedding Vectors

With respect to a fixed tokenizer and vocabulary, we can treat any piece of text data as a sequence of integers.

seq_int = vocab.lookup_indices(tokenizer(text_data))
seq_int[:10]
[5, 8, 93, 13, 38, 105, 19, 21, 11, 17]
vocab.lookup_tokens(seq_int[:10])
['we', 'are', 'thankful', 'to', 'be', 'welcome', 'on', 'these', 'lands', 'in']

3.1 Embedding vectors

  • Using neural networks such as RNN, we can only process vectors in high dimensional space.
  • Thus, we want to represent each token as a high dimensional vector by some lookup table:
token vector in \(\mathbb{R}^d\)
0 \(E[0]\)
1 \(E[1]\)
2 \(E[2]\)
3 \(E[3]\)
\(\vdots\) \(\vdots\)

The lookup table can be represented by a tensor: \(E\in\mathbb{R}^{V\times d}\) where \(V\) is the vocabulary size, and \(d\) the dimensionality of the vector representations.

4 Building a custom embedding layer

Let’s try to build a custom embedding layer as a torch module. It will map sequences of integer tokens to sequences of vectors.

import torch
import torch.nn as nn

class MyEmbedding(nn.Module):
    def __init__(self, vocab_size, dim_embed):
        super().__init__()
        self.E = nn.Parameter(
            torch.randn((vocab_size, dim_embed))
        )

We initialize the embedding vectors randomly. We will use the embedding layer as apart of a larger neural network such that the end-to-end learning will optimize the embedding vectors for the learning task.

@add_method(MyEmbedding)
def forward(self, token_sequence):
    vectors = [
        self.E[i] for i in token_sequence
    ]
    return torch.stack(vectors)

This model perform embedding of unbatched token sequences.

token_seq = seq_int[:10]
token_seq
[5, 8, 93, 13, 38, 105, 19, 21, 11, 17]
emb = MyEmbedding(len(vocab), 3)
emb(token_seq)
tensor([[ 0.3816,  0.5266, -1.9311],
        [-0.3788, -0.7445, -1.7510],
        [ 0.3957, -0.7010, -2.0687],
        [-0.5363, -0.6186,  0.2826],
        [ 1.5242,  0.1881,  2.3600],
        [-0.0276,  0.5454,  0.1715],
        [ 0.0680, -0.1322,  0.4900],
        [ 0.8703, -1.2796,  2.1559],
        [-0.2499,  0.8778,  0.0142],
        [-0.8260, -0.4679, -0.5666]], grad_fn=<StackBackward0>)

5 Introducing the built-in Torch Embedding layer

PyTorch provides a default implementation of embedding: torch.nn.Embedding(...) that performs embedding of batched input sequences.

embedding = nn.Embedding(
    num_embeddings=len(vocab),
    embedding_dim=3,
)
inputs = torch.tensor([
    seq_int[:10],
    seq_int[10:20],
    seq_int[20:30],
], dtype=torch.int64)
inputs
tensor([[  5,   8,  93,  13,  38, 105,  19,  21,  11,  17],
        [ 25,   4,   3,  11,   5,   8,  89,  19,   8,  48],
        [  9,   3, 106,  98,   7,   8,   3,  97,  92,   6]])
embedding(inputs)
tensor([[[-0.2248,  0.1991,  0.8424],
         [ 1.2828, -0.7545, -0.5431],
         [ 0.4633, -0.1176, -1.5061],
         [ 1.7210, -0.7146,  0.8112],
         [ 0.3721,  0.6878, -0.4865],
         [ 0.3733,  1.0964,  2.7223],
         [-0.8063,  0.8842, -0.2131],
         [ 0.4032,  0.2207,  1.2382],
         [ 0.3583,  0.6511,  0.8397],
         [ 1.0241, -0.3042,  0.5910]],

        [[ 0.9165,  0.8446,  1.1294],
         [-1.3930,  1.2435,  0.3012],
         [-1.8170,  1.0054,  0.0941],
         [ 0.3583,  0.6511,  0.8397],
         [-0.2248,  0.1991,  0.8424],
         [ 1.2828, -0.7545, -0.5431],
         [ 0.8073, -0.0172,  0.4395],
         [-0.8063,  0.8842, -0.2131],
         [ 1.2828, -0.7545, -0.5431],
         [ 0.1293, -0.7778,  1.2692]],

        [[-1.0630, -0.4735, -1.8777],
         [-1.8170,  1.0054,  0.0941],
         [-1.5959,  0.8005,  0.4276],
         [ 0.0459,  0.8838, -1.8087],
         [ 2.0139,  0.7576, -0.6153],
         [ 1.2828, -0.7545, -0.5431],
         [-1.8170,  1.0054,  0.0941],
         [ 0.0703,  0.6909,  0.6928],
         [-0.1059,  0.7258, -0.5090],
         [ 0.3869,  0.7944,  0.1100]]], grad_fn=<EmbeddingBackward0>)