BLT's Entropy-based Patcher vs. Tokenizer Visualisation

Enter text to visualize its segmentation according to different methods:

  1. Byte Latent Transformer (BLT): Entropy-based patching plot and patched text. Spaces are replaced by '_' for viz purposes. Using blt_main_entropy_100m_512w.
  2. Tiktoken (GPT-4): Text segmented by o200k_base tokens.
  3. Llama 3: Text segmented by the meta-llama/Meta-Llama-3-8B tokenizer.

Companion blog post can be found here.