Getting Started
Installation
pip install deltatensors
pip install torch safetensors # for loading from safetensors directories
Basic usage
Save a delta
import deltatensors as dt
dt.save_delta_from_paths(
"checkpoint.wdelta",
"qwen-wiki/", # fine-tuned model directory
"qwen-base/", # base model directory
strategy="int4",
outlier_fraction=0.01,
)
This streams tensor pairs from disk one at a time — peak RAM is O(1 tensor), not O(two full models). For models that fit comfortably in RAM, see in-memory usage.
Reconstruct
recon_sd = dt.load_delta_from_paths(
"checkpoint.wdelta",
"qwen-base/",
verify=True,
)
Returns a Dict[str, np.ndarray]. verify=True checks the base model's SHA-256 hash against the one stored in the .wdelta file — recommended, since applying a delta to the wrong base produces garbage silently.
Load into a HuggingFace model
load_delta_from_paths gives you a numpy state dict. To run inference you need to patch it into a model. The trick is to do it in-place so you don't hold a full second copy in RAM:
from transformers import AutoModelForCausalLM
from deltatensors.format import read_wdelta
from deltatensors.compress import decompress
import torch
model = AutoModelForCausalLM.from_pretrained("qwen-base/", dtype=torch.float32)
sd = model.state_dict()
with open("checkpoint.wdelta", "rb") as f:
_, _, compressed_tensors = read_wdelta(f)
for name, payload in compressed_tensors.items():
if name not in sd:
continue
delta = torch.from_numpy(decompress(payload))
sd[name].add_(delta.to(sd[name].dtype))
del delta
model.load_state_dict(sd, strict=False)
Peak RAM here is one loaded model + one delta tensor at a time.
Inspect without loading anything
info = dt.inspect("checkpoint.wdelta")
print(info)
# {
# 'path': 'checkpoint.wdelta',
# 'size_mb': 294.2,
# 'parent_hash': 'e1810a...',
# 'strategy': 'int4',
# 'n_tensors': 290,
# 'tensors': {
# 'model.embed_tokens.weight': {'shape': [151936, 896], 'dtype': 'float32'},
# ...
# }
# }
Useful for checking what base model a .wdelta was built against (parent_hash) before you bother loading anything.
Choosing a strategy
int4 is the default recommendation — it gave 0.58% perplexity difference at 3.2x compression on Qwen2.5-0.5B. Use sparse if you want to tune the quality/compression tradeoff manually via sparsity=. quantized is the most aggressive and will show more quality loss.
| Strategy | Use when |
|---|---|
int4 |
Best compression with near-lossless quality |
sparse |
Tunable tradeoff via sparsity=0.0 to 0.99 |
quantized |
Maximum compression, more quality loss |
In-memory usage (small models)
If your models fit in RAM you can skip the path-based API and pass state dicts directly:
finetuned_sd = {...} # Dict[str, np.ndarray] or Dict[str, torch.Tensor]
base_sd = {...}
dt.save_delta("checkpoint.wdelta", finetuned_sd, base_sd, strategy="int4")
recon_sd = dt.load_delta("checkpoint.wdelta", base_sd, verify=True)