Site icon Shahad's Blogs

Byte Pair Encoding (BPE) and Subword Tokenization

BPE

Byte pair encoding

In almost every application related to NLP, we use text as a part of the data. To the models, the input is generally a list of words or sentences like “We will live on Mars soon”. To a model, we feed the text as a sequence of tokens. The tokens can be characters, space-separated words, a group of words, or even a part of a word (subword). Over the time different tokenization approaches have been taken. In the early stages, one-hot-encoding like tokenization techniques were used. Later we tried to hold the contextual information during tokenization. Thus word2vec model became the approach of choice gradually.

In word2vec, semantically similar tokens have a closer position on the vector plane. Like the words [“old”, “older” and “oldest”] will have a closer position in the word2vec plane. But as discussed in this post, this representation does not tell anything about the relationship of [“tall”, “taller”, and “tallest”]. But if we use subtokens like ‘er’ or ‘est’, the model can learn this relationship of other words also.

As the name suggests, in subword tokenization we break a word into frequent subwords and tokenize those subwords. Like we may break the work “tallest” into “tall” and “est”. Now the question arises, how can we get these subwords i.e. how will we know, into which subwords we will break a word? Byte Pair Encoding or BPE comes into play here!

Byte Pair Encoding

BPE was first described as a data compression algorithm in 1994 by Philip Gage. It at first finds the most frequent byte pair and replace all occurrence of the pair by a byte that is not used in the data. Let’s consider a text as: aaabdaaabac. Byte pair aa is most frequent in the text. If we replace this pair with Z, i.e. Z = aa, the text becomes ZabdZabac. We keep track of this in a replacement table. So, in the table, we will put Z = aa. Now, ab is the most frequent pair. So we can replace it with Y = ab. Thus the text becomes ZYdZYac. Further ZY can be replaced by X, resulting in the text to be XdXac and the replacement table to be Z = aa, Y = ab, X = ZY.

In NLP. Byte Pair Encoding was first used in machine translation. In the paper titled – “Neural Machine Translation of Rare Words with Subword Units“, the authors used the subword tokenization technique by adapting the BPE algorithm. For this, instead of replacing the most common pair, they merged those and created a new subword token. Let’s dive a bit deeper into understanding and implementation for subword tokenization.

BPE for Subword Tokenization

For the implementation, I have followed this post‘s code, which is a modification of the code released by the authors of the previously mentioned paper. Well, let’s start. We start with a text corpus. It can be a text from a book (like this) or for now, a simple paragraph.

At first, we count the frequencies of each word in the corpus. For each word, we add a special token (like <\w>) indicating the end of a word. So, a word this becomes this<\w>. Now, initially, we split the word into characters. The end token is treated as a single character. So, for this, we will get [t, h, i, s, <\w>]. Finally, after counting all the words of the corpus, we will get a vocabulary of space-separated words with the end token and their frequencies. This may look like this: {'t h i s <\w>': 10, 'b o o k <\w>': 2, 't h o r <\w> \': 7, 's o o n <\w>': 3}. We can implement this step like the following code. Here I have assumed that we are creating the vocab from a text file.

Now we start the training. At each iteration, we will count the frequency of each consecutive token pair. The pairs can be (t, h), (i, s), (s, <\w>), (o, o), etc. If we look, the pair (t, h) is present in 't h i s <\w>' with count 10 and 't h o r <\w>' with count 7. So the frequency of (t, h) is 10 + 7 = 17. We can get the pairs with frequencies with the following lines of code.

After counting the frequency of all pairs, we chose the most frequent pair. Then we merge this pair of tokens into a single token. Like if the most frequent token pair is (t, h), after merging the tokens we will get a new token 'th' and will replace all 't h' with 'th'. With the following function we can do the merging:

We will continue this iteration a fixed number of times. For each iteration, we will find the most frequent token pair, merge those tokens and update the vocabulary. After all iterations, we will sort the tokens according to the lengths and create a ‘word to token’ map. We can do the training with the following function:

Why that ‘<\w>’?

You might have been thinking, why are we adding the <\w> or stop token. In subword tokenization, this stop token plays a significant role. This helps to understand if a subword is the end of a word or not. Without the stop token (<\w>) say there is a subword token 'sis'. Now, this can be in the word 'sis ter' or in the word 'the sis'. Both of these words have quite different meanings. But the token with <\w> ('sis<\w>') will indicate the model that this token is for the word 'the sis<\w>', not for 'sis<\w> ter'.

Tokenizing a word

While tokenizing a word, at first we would add a stop token at the end. So, if we want to tokenize the word 'python', we would convert it to 'python<\w>'. At first, we would check if it is a known word, i.e. present in the vocabulary. If so, we would return the corresponding tokens.

Now, if the vocabulary does not contain the word, we will start iterating all the words from the vocabulary and see if a word is a substring of the given word. We will start from the longest word in the vocabulary and for this, we sorted the tokens at the end of training. The tokenization can be done with the following lines of code.

Generally, tokenization in this approach is computationally costly. That’s why we created the ‘words to tokens’ map, which can reduce the computations for some iterations in the case of rare words.

You can find the full code with examples and sample data here. Don’t forget to leave a star! 😀

That’s all for this post on Byte Pair Encoding and subword tokenization. I am a learner who is learning new things and trying to share with others. Let me know your thoughts on this post. Get more machine learning-related posts here.

Exit mobile version