infini-transformer-pytorch

Infini-Transformer in Pytorch

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language
- Python :: 3.8
Topic
- Scientific/Engineering :: Artificial Intelligence

Project description

Infini-Transformer - Pytorch

Implementation of Infini-Transformer in Pytorch. They use a linear attention scheme to compress past memories and demonstrate multiple SOTAs for long context benchmarks.

Although unlikely to beat Ring Attention, I think it is worth exploring, as the techniques are orthogonal.

Yannic Kilcher's explanation

Install

$ pip install infini-transformer-pytorch

Usage

import torch
from infini_transformer_pytorch import InfiniTransformer

transformer = InfiniTransformer(
    num_tokens = 256,
    dim = 512,
    depth = 8,
    dim_head = 128,  # high head dimension may be part of the reason they got good results (kv has high capacity)
    heads = 8,
    use_mem_delta_rule = True
)

x = torch.randint(0, 256, (1, 1024))

logits1, _, mem1 = transformer(x, return_new_memories = False)
logits2, _, mem2 = transformer(x, past_memories = mem1, return_new_memories = False)
logits3, _, mem3 = transformer(x, past_memories = mem2, return_new_memories = True)

Training a transformer with recurrence usually trips up a lot of researchers, so to make it easy, just wrap it with InfiniTransformerWrapper

import torch

from infini_transformer_pytorch import (
    InfiniTransformer,
    InfiniTransformerWrapper
)

# model and wrapper

model = InfiniTransformer(
    num_tokens = 256,
    dim = 512,
    depth = 8,
    dim_head = 128,
    heads = 8,
    use_mem_delta_rule = True
)

wrapper = InfiniTransformerWrapper(
    model,
    segment_length = 512,
    detach_mems_every_num_segments = 2 # greater than 1 so the network can learn how to 'write' to the fast weight memories
).cuda()

# mock input

seq = torch.randint(0, 256, (2, 10000)).cuda() # can be arbitrarily long sequence

# training

loss = wrapper(
    seq,
    backward = True # will automatically segment and accumulate gradients when it detaches the memories
)

# after much data...

# calculating eval loss

with torch.no_grad():
    wrapper.eval()
    eval_loss = wrapper(seq)

# generating is as easy as

output = wrapper.generate(seq_len = 8192, prompt = seq[:, :1])

output.shape # (2, 8192 - 1)

Testing

Train an autoregressive enwik8

$ python train.py

Todo

detach_mems_every_num_segments hyperparameter is too confusing, get rid of it
experiment with enhanced recurrence, perhaps with a linear projection (talking heads on kv or linear projection on k, v separately) before sending the memories to the layer before
working example with enwik8

Citations

@inproceedings{Munkhdalai2024LeaveNC,
    title   = {Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention},
    author  = {Tsendsuren Munkhdalai and Manaal Faruqui and Siddharth Gopal},
    year    = {2024},
    url     = {https://api.semanticscholar.org/CorpusID:269033427}
}

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language
- Python :: 3.8
Topic
- Scientific/Engineering :: Artificial Intelligence

Release history Release notifications | RSS feed

This version

0.1.5

May 9, 2024

0.1.1

May 9, 2024

0.1.0

May 9, 2024

0.0.25

May 8, 2024

0.0.24

May 8, 2024

0.0.23

May 8, 2024

0.0.22

May 8, 2024

0.0.21

May 8, 2024

0.0.20

May 8, 2024

0.0.18

May 8, 2024

0.0.17

May 8, 2024

0.0.16

May 8, 2024

0.0.15

May 8, 2024

0.0.14

May 8, 2024

0.0.12

May 8, 2024

0.0.11

May 8, 2024

0.0.10

May 5, 2024

0.0.9

May 2, 2024

0.0.8

May 2, 2024

0.0.7

May 2, 2024

0.0.6

May 2, 2024

0.0.5

May 2, 2024

0.0.4

May 1, 2024

0.0.3

May 1, 2024

0.0.2

May 1, 2024

0.0.1

May 1, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

infini_transformer_pytorch-0.1.5.tar.gz (36.7 MB view hashes)

Uploaded May 9, 2024 Source

Built Distribution

infini_transformer_pytorch-0.1.5-py3-none-any.whl (4.2 kB view hashes)

Uploaded May 9, 2024 Python 3

Hashes for infini_transformer_pytorch-0.1.5.tar.gz

Hashes for infini_transformer_pytorch-0.1.5.tar.gz
Algorithm	Hash digest
SHA256	`4a5ec056b7ef3bccc2b4c980a83fd6f6e676729edcf8fa15f2c85d71646c5b1e`
MD5	`2231b5b805c473413a3ce801534c4364`
BLAKE2b-256	`e8c564c30f11e4bccb17fa226863fffc340151f070eab4d5d331903e31080ef6`

Hashes for infini_transformer_pytorch-0.1.5-py3-none-any.whl

Hashes for infini_transformer_pytorch-0.1.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d6cf5f8e847e9ecceea5f8479dc7b8908cb329566c0455cfd4ed3eacaf8a9c9c`
MD5	`f7b6c3d2530e5019233ed7cce4348329`
BLAKE2b-256	`6e294db6684030bee5ee9083346520e7347300d052b5565aebff93788ca9e677`