= """
text_data We are thankful to be welcome on these lands in
friendship. The lands we are situated on are
covered by the Williams Treaties and are the
traditional territory of the Mississaugas, a
branch of the greater Anishinaabeg Nation,
including Algonquin, Ojibway, Odawa and
Pottawatomi. These lands remain home to many
Indigenous nations and peoples.
We acknowledge this land out of respect for the
Indigenous nations who have cared for Turtle
Island, also called North America, from before the
arrival of settler peoples until this day. Most
importantly, we acknowledge that the history of
these lands has been tainted by poor treatment and
a lack of friendship with the First Nations who
call them home.
This history is something we are all affected by
because we are all treaty people in Canada. We
all have a shared history to reflect on, and each
of us is affected by this history in different
ways. Our past defines our present, but if we
move forward as friends and allies, then it does
not have to define our future.
""".lower()
π Text Processing With Embedding
1 Text Data
- Vocabulary
- Tokenizer
2 Vocabulary and Tokenization
#
# Building a tokenizer
#
import torchtext.data
= torchtext.data.get_tokenizer('basic_english') tokenizer
= tokenizer(text_data)
tokens print(tokens[:7], "...", len(tokens), "tokens")
['we', 'are', 'thankful', 'to', 'be', 'welcome', 'on'] ... 193 tokens
#
# Building a vocabulary
#
from collections import defaultdict
= defaultdict(int)
counter for token in tokens:
+= 1 counter[token]
value | |
---|---|
key | |
we | 8 |
are | 6 |
thankful | 1 |
to | 4 |
be | 1 |
... | ... |
it | 1 |
does | 1 |
not | 1 |
define | 1 |
future | 1 |
106 rows Γ 1 columns
A vocabulary manages the tokens.
import torchtext.vocab
= torchtext.vocab.build_vocab_from_iterator(
vocab
[tokens],=1,
min_freq=['<unk>', '<s>'],
specials
)len(vocab)
108
build_vocab_from_iterator
expects an iterator over batches of tokens, not an iterator over tokens. So, each element in the iterator must be a list of tokens.
'we'] vocab[
5
'we', 'are', 'thankful']) vocab.lookup_indices([
[5, 8, 93]
0,1,2,3,4,5,6,7,8,9,]) vocab.lookup_tokens([
['<unk>', '<s>', ',', 'the', '.', 'we', 'of', 'and', 'are', 'by']
3 Embedding Vectors
With respect to a fixed tokenizer and vocabulary, we can treat any piece of text data as a sequence of integers.
= vocab.lookup_indices(tokenizer(text_data))
seq_int 10] seq_int[:
[5, 8, 93, 13, 38, 105, 19, 21, 11, 17]
10]) vocab.lookup_tokens(seq_int[:
['we', 'are', 'thankful', 'to', 'be', 'welcome', 'on', 'these', 'lands', 'in']
3.1 Embedding vectors
- Using neural networks such as RNN, we can only process vectors in high dimensional space.
- Thus, we want to represent each token as a high dimensional vector by some lookup table:
token | vector in \(\mathbb{R}^d\) |
---|---|
0 | \(E[0]\) |
1 | \(E[1]\) |
2 | \(E[2]\) |
3 | \(E[3]\) |
\(\vdots\) | \(\vdots\) |
The lookup table can be represented by a tensor: \(E\in\mathbb{R}^{V\times d}\) where \(V\) is the vocabulary size, and \(d\) the dimensionality of the vector representations.
4 Building a custom embedding layer
Letβs try to build a custom embedding layer as a torch module. It will map sequences of integer tokens to sequences of vectors.
import torch
import torch.nn as nn
class MyEmbedding(nn.Module):
def __init__(self, vocab_size, dim_embed):
super().__init__()
self.E = nn.Parameter(
torch.randn((vocab_size, dim_embed)) )
We initialize the embedding vectors randomly. We will use the embedding layer as apart of a larger neural network such that the end-to-end learning will optimize the embedding vectors for the learning task.
@add_method(MyEmbedding)
def forward(self, token_sequence):
= [
vectors self.E[i] for i in token_sequence
]return torch.stack(vectors)
This model perform embedding of unbatched token sequences.
= seq_int[:10]
token_seq token_seq
[5, 8, 93, 13, 38, 105, 19, 21, 11, 17]
= MyEmbedding(len(vocab), 3)
emb emb(token_seq)
tensor([[ 0.3816, 0.5266, -1.9311],
[-0.3788, -0.7445, -1.7510],
[ 0.3957, -0.7010, -2.0687],
[-0.5363, -0.6186, 0.2826],
[ 1.5242, 0.1881, 2.3600],
[-0.0276, 0.5454, 0.1715],
[ 0.0680, -0.1322, 0.4900],
[ 0.8703, -1.2796, 2.1559],
[-0.2499, 0.8778, 0.0142],
[-0.8260, -0.4679, -0.5666]], grad_fn=<StackBackward0>)
5 Introducing the built-in Torch Embedding layer
PyTorch provides a default implementation of embedding: torch.nn.Embedding(...)
that performs embedding of batched input sequences.
= nn.Embedding(
embedding =len(vocab),
num_embeddings=3,
embedding_dim )
= torch.tensor([
inputs 10],
seq_int[:10:20],
seq_int[20:30],
seq_int[=torch.int64)
], dtype inputs
tensor([[ 5, 8, 93, 13, 38, 105, 19, 21, 11, 17],
[ 25, 4, 3, 11, 5, 8, 89, 19, 8, 48],
[ 9, 3, 106, 98, 7, 8, 3, 97, 92, 6]])
embedding(inputs)
tensor([[[-0.2248, 0.1991, 0.8424],
[ 1.2828, -0.7545, -0.5431],
[ 0.4633, -0.1176, -1.5061],
[ 1.7210, -0.7146, 0.8112],
[ 0.3721, 0.6878, -0.4865],
[ 0.3733, 1.0964, 2.7223],
[-0.8063, 0.8842, -0.2131],
[ 0.4032, 0.2207, 1.2382],
[ 0.3583, 0.6511, 0.8397],
[ 1.0241, -0.3042, 0.5910]],
[[ 0.9165, 0.8446, 1.1294],
[-1.3930, 1.2435, 0.3012],
[-1.8170, 1.0054, 0.0941],
[ 0.3583, 0.6511, 0.8397],
[-0.2248, 0.1991, 0.8424],
[ 1.2828, -0.7545, -0.5431],
[ 0.8073, -0.0172, 0.4395],
[-0.8063, 0.8842, -0.2131],
[ 1.2828, -0.7545, -0.5431],
[ 0.1293, -0.7778, 1.2692]],
[[-1.0630, -0.4735, -1.8777],
[-1.8170, 1.0054, 0.0941],
[-1.5959, 0.8005, 0.4276],
[ 0.0459, 0.8838, -1.8087],
[ 2.0139, 0.7576, -0.6153],
[ 1.2828, -0.7545, -0.5431],
[-1.8170, 1.0054, 0.0941],
[ 0.0703, 0.6909, 0.6928],
[-0.1059, 0.7258, -0.5090],
[ 0.3869, 0.7944, 0.1100]]], grad_fn=<EmbeddingBackward0>)