Byte-pair encoding

百度今年美债和美元的动态组合成为关键，美债收益率超预期上行、美元超预期强势的组合可能是国内利率最大的上行风险所在。

In computing, byte-pair encoding (BPE),^[1]^[2] or digram coding,^[3] is an algorithm, first described in 1994 by Philip Gage, for encoding strings of text into smaller strings by creating and using a translation table.^[4] A slightly modified version of the algorithm is used in large language model tokenizers.

The original version of the algorithm focused on compression. It replaces the highest-frequency pair of bytes with a new byte that was not contained in the initial dataset. A lookup table of the replacements is required to rebuild the initial dataset. The modified version builds "tokens" (units of recognition) that match varying amounts of source text, from single characters (including single digits or single punctuation marks) to whole words (even long compound words).^[5]^[6]^[7]

Original algorithm

The original BPE algorithm operates by iteratively replacing the most common contiguous sequences of characters in a target text with unused 'placeholder' bytes. The iteration ends when no sequences can be found, leaving the target text effectively compressed. Decompression can be performed by reversing this process, querying known placeholder terms against their corresponding denoted sequence, using a lookup table. In the original paper, this lookup table is encoded and stored alongside the compressed text.

Example

Suppose the data to be encoded is:^[8]

aaabdaaabac

The byte pair "aa" occurs most often, so it will be replaced by a byte that is not used in the data, such as "Z". Now there is the following data and replacement table:

ZabdZabac
Z=aa

Then the process is repeated with byte pair "ab", replacing it with "Y":

ZYdZYac
Y=ab
Z=aa

The only literal byte pair left occurs only once, and the encoding might stop here. Alternatively, the process could continue with recursive byte-pair encoding, replacing "ZY" with "X":

XdXac
X=ZY
Y=ab
Z=aa

This data cannot be compressed further by byte-pair encoding because there are no pairs of bytes that occur more than once.

To decompress the data, simply perform the replacements in the reverse order.

Modified algorithm

The original BPE algorithm is modified for use in language modeling, especially for large language models based on neural networks. Compared to the original BPE, the modified BPE does not aim to maximally compress text, but rather, to encode plaintext into "tokens", which are natural numbers.^[9] All the unique tokens found in a corpus are listed in a token vocabulary, the size of which, in the case of GPT-3.5 and GPT-4, is 100256.^[10]

The modified tokenization algorithm initially treats the set of unique characters as 1-character-long n-grams (the initial tokens). Then, successively, the most frequent pair of adjacent tokens is merged into a new, longer n-gram and all instances of the pair are replaced by this new token. This is repeated until a vocabulary of prescribed size is obtained. Note that new words can always be constructed from final vocabulary tokens and initial-set characters.^[11]

This modified BPE approach has been extended from spoken language to sign language in recent years.^[12]

Example

Suppose we are encoding the previous example of "aaabdaaabac", with a specified vocabulary size of 6, then it would first be encoded as "0, 0, 0, 1, 2, 0, 0, 0, 1, 0, 3" with a vocabulary of "a=0, b=1, d=2, c=3". Then it would proceed as before, and obtain "4, 5, 2, 4, 5, 0, 3" with a vocabulary of "a=0, b=1, d=2, c=3, aa=4, ab=5".

So far this is essentially the same as before. However, if we only had specified a vocabulary size of 5, then the process would stop at vocabulary "a=0, b=1, d=2, c=3, aa=4", so that the example would be encoded as "4, 0, 1, 2, 4, 0, 1, 0, 3". Conversely, if we had specified a vocabulary size of 8, then it would be encoded as "7, 6, 0, 3", with a vocabulary of "a=0, b=1, d=2, c=3, aa=4, ab=5, aaab=6, aaabd=7". This is not maximally efficient, but the modified BPE does not aim to maximally compress a dataset, but aim to encode it efficiently for language model training.^[13]

Byte-level BPE

In the above example, the output of the BPE is a vocabulary, which can be used to encode any text that is written with the letters "abcd". It will not be able to encode text containing other symbols, such as "no". Even giving each of the 26 letters an entry in the vocabulary, since there are many languages in the world using many different scripts, inevitably some symbols would be unencodable by such a vocabulary.

One solution is to replace any unencodable symbol with a special symbol named UNK ("unknown").

The byte-level BPE is another approach. It simply converts the text into UTF-8 first, and treat it as a stream of bytes. This guarantees that any text encoded in UTF-8 can be encoded by the BPE. This has been used in BERT-like models like RoBERTa, BART, and DeBERTa, and GPT-like models like GPT-2.^[14]^[15]^[16]

References

^ Gage, Philip (1994). "A New Algorithm for Data Compression". The C User Journal.
^ "A New Algorithm for Data Compression". Dr. Dobb's Journal. 1 February 1994. Retrieved 10 August 2020.
^ Witten, Ian H.; Moffat, Alistair; Bell, Timothy C. (1994). Managing Gigabytes. New York: Van Nostrand Reinhold. ISBN 978-0-442-01863-4.
^ "Byte Pair Encoding". Archived from the original on 2025-08-06.
^ Sennrich, Rico; Birch, Alexandra; Haddow, Barry (2025-08-06). "Neural Machine Translation of Rare Words with Subword Units". arXiv:1508.07909 [cs.CL].
^ Brown, Tom B.; Mann, Benjamin; Ryde r, Nick; Subbiah, Melanie; Kaplan, Jared; Dhariwal, Prafulla; Neelakantan, Arvind; Shyam, Pranav; Sastry, Girish; Askell, Amanda; Agarwal, Sandhini (2025-08-06). "Language Models are Few-Shot Learners". arXiv:2005.14165 [cs.CL].
^ "google/sentencepiece". Google. 2025-08-06. Retrieved 2025-08-06.
^ Campesato, Oswald (2025-08-06). Large Language Models for Developers: A Prompt-based Exploration of LLMs. Walter de Gruyter GmbH. ISBN 978-1-5015-2095-2.
^ Zhang, Xiang; Cao, Juntai; You, Chenyu (2024). "Counting Ability of Large Language Models and Impact of Tokenization". arXiv:2410.19730 [cs.CL].
^ Raschka, Sebastian (2025-08-06). "Implementing A Byte Pair Encoding (BPE) Tokenizer From Scratch". Sebastian Raschka, PhD. Retrieved 2025-08-06.
^ Paa?, Gerhard; Giesselbach, Sven (2022). "Pre-trained Language Models". Foundation Models for Natural Language Processing. Artificial Intelligence: Foundations, Theory, and Algorithms. pp. 19–78. doi:10.1007/978-3-031-23190-2_2. ISBN 9783031231902.
^ Taro Miyazaki, Sihan Tan, Tsubasa Uchida, Hiroyuki Kaneko (May 25, 2024). "Sign Language Translation with Gloss Pair Encoding" (PDF). Proceedings of the 11th Workshop on the Representation and Processing of Sign Languages.{{cite journal}}: CS1 maint: multiple names: authors list (link)
^ Pai, Suhas (2025-08-06). Designing Large Language Model Applications: A Holistic Approach to LLMs. O'Reilly Media. ISBN 978-1-0981-5046-4.
^ "Byte-Pair Encoding tokenization". Hugging Face NLP Course. Retrieved 2025-08-06.
^ Y?ld?r?m, Sava?; Chenaghlu, Meysam Asgari (2025-08-06). Mastering Transformers: Build state-of-the-art models from scratch with advanced natural language processing techniques. Packt Publishing Ltd. ISBN 978-1-80107-889-4.
^ Wang, Changhan; Cho, Kyunghyun (2025-08-06). "Neural Machine Translation with Byte-Level Subwords". Proceedings of the AAAI Conference on Artificial Intelligence. 34 (5): 9154–9160. arXiv:1909.03341. doi:10.1609/aaai.v34i05.6451. ISSN 2374-3468.

[CUsersJ_Gage_1994-1] Gage, Philip (1994). "A New Algorithm for Data Compression". The C User Journal.

[2] "A New Algorithm for Data Compression". Dr. Dobb's Journal. 1 February 1994. Retrieved 10 August 2020.

[3] Witten, Ian H.; Moffat, Alistair; Bell, Timothy C. (1994). Managing Gigabytes. New York: Van Nostrand Reinhold. ISBN 978-0-442-01863-4.

[4] "Byte Pair Encoding". Archived from the original on 2025-08-06.

[5] Sennrich, Rico; Birch, Alexandra; Haddow, Barry (2025-08-06). "Neural Machine Translation of Rare Words with Subword Units". arXiv:1508.07909 [cs.CL].

[6] Brown, Tom B.; Mann, Benjamin; Ryde r, Nick; Subbiah, Melanie; Kaplan, Jared; Dhariwal, Prafulla; Neelakantan, Arvind; Shyam, Pranav; Sastry, Girish; Askell, Amanda; Agarwal, Sandhini (2025-08-06). "Language Models are Few-Shot Learners". arXiv:2005.14165 [cs.CL].

[7] "google/sentencepiece". Google. 2025-08-06. Retrieved 2025-08-06.

[8] Campesato, Oswald (2025-08-06). Large Language Models for Developers: A Prompt-based Exploration of LLMs. Walter de Gruyter GmbH. ISBN 978-1-5015-2095-2.

[9] Zhang, Xiang; Cao, Juntai; You, Chenyu (2024). "Counting Ability of Large Language Models and Impact of Tokenization". arXiv:2410.19730 [cs.CL].

[10] Raschka, Sebastian (2025-08-06). "Implementing A Byte Pair Encoding (BPE) Tokenizer From Scratch". Sebastian Raschka, PhD. Retrieved 2025-08-06.

[2022Book_-11] Paa?, Gerhard; Giesselbach, Sven (2022). "Pre-trained Language Models". Foundation Models for Natural Language Processing. Artificial Intelligence: Foundations, Theory, and Algorithms. pp. 19–78. doi:10.1007/978-3-031-23190-2_2. ISBN 9783031231902.

[12] Taro Miyazaki, Sihan Tan, Tsubasa Uchida, Hiroyuki Kaneko (May 25, 2024). "Sign Language Translation with Gloss Pair Encoding" (PDF). Proceedings of the 11th Workshop on the Representation and Processing of Sign Languages.{{cite journal}}: CS1 maint: multiple names: authors list (link)

[13] Pai, Suhas (2025-08-06). Designing Large Language Model Applications: A Holistic Approach to LLMs. O'Reilly Media. ISBN 978-1-0981-5046-4.

[14] "Byte-Pair Encoding tokenization". Hugging Face NLP Course. Retrieved 2025-08-06.

[15] Y?ld?r?m, Sava?; Chenaghlu, Meysam Asgari (2025-08-06). Mastering Transformers: Build state-of-the-art models from scratch with advanced natural language processing techniques. Packt Publishing Ltd. ISBN 978-1-80107-889-4.

[16] Wang, Changhan; Cho, Kyunghyun (2025-08-06). "Neural Machine Translation with Byte-Level Subwords". Proceedings of the AAAI Conference on Artificial Intelligence. 34 (5): 9154–9160. arXiv:1909.03341. doi:10.1609/aaai.v34i05.6451. ISSN 2374-3468.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

上善若水是什么意思	小便发黄是什么症状	尿检是检查什么的	猫字五行属什么	血糖高可以吃什么主食
梦见割草是什么意思	正月十六是什么星座	婴儿第一次理发有什么讲究吗	老人头晕是什么原因引起的	转念是什么意思
雨像什么	吃什么药提高免疫力	处级上面是什么级别	钛色是什么颜色	什么是心脏病
魂牵梦绕是什么意思	em什么意思	什么叫转基因	头发不长是什么原因	你想成为什么样的人

八岁属什么生肖hcv8jop6ns0r.cn	肝结节是什么病严重吗hcv7jop9ns2r.cn	ibm是什么96micro.com	ab和o型血生的孩子是什么血型hcv8jop8ns1r.cn	互粉是什么意思hcv7jop7ns4r.cn
海绵体充血不足吃什么药hcv9jop3ns4r.cn	广西狗肉节是什么时候hcv9jop6ns0r.cn	hpv是什么病hcv7jop9ns1r.cn	骨感是什么意思hcv7jop9ns7r.cn	官杀混杂是什么意思hcv8jop8ns9r.cn
发难是什么意思hcv8jop4ns0r.cn	滑肠是什么意思hcv8jop0ns7r.cn	排斥是什么意思96micro.com	追什么zhongyiyatai.com	曲高和寡什么意思hcv8jop2ns5r.cn
突然晕倒是什么原因造成的hcv7jop6ns3r.cn	天蝎座和什么座最配对hcv8jop9ns7r.cn	头疼喝什么饮料hcv7jop6ns1r.cn	4个火读什么hcv9jop4ns0r.cn	塔罗牌是什么hcv9jop2ns0r.cn

中国公路学会养护与管理分会第七届学术年会即将召开

Original algorithm

Example

Modified algorithm

Example

Byte-level BPE

See also

References