Skip to main content

Simplest possible content-addressable file store for blobs.

Project description

Simplest Possible Content-Addressable Blob Store

GrugStore Logo

PyPI version

This is a simple content-addressable blob store. It stores blobs of data and associated metadata. The blobs are stored in a directory hierarchy based on the base58 encoding of their SHA-256 hash. Metadata is stored as siblings to the blob file.

Quick Start

from grugstore import GrugStore

# Create a GrugStore instance
gs = GrugStore('some-dir', hierarchy_depth=3)

# Store a blob
hash_str, file_path = gs.store(b'Hello, World!')

# Check if a blob exists
if gs.exists(hash_str):
    # Load the blob
    blob = gs.load_bytes(hash_str)

Core Methods

Store Metadata

# Set a README for the store
gs.set_readme("This store contains user avatars and profile images")

# Get the README content
readme_content = gs.get_readme()

Storing and Loading Data

# Store raw bytes - returns (hash_string, file_path)
hash_str, file_path = gs.store(b'Hello, World!')

# Stream from a file-like object (e.g., for large files)
with open('large_file.bin', 'rb') as f:
    hash_str = gs.stream(f)

# Load data back
data = gs.load_bytes(hash_str)

# Read data using context manager (for streaming large files)
with gs.read(hash_str) as f:
    content = f.read()  # or read in chunks

# Write data using context manager with automatic hashing
with gs.write() as (f, get_hash):
    f.write(b'Hello, World!')
    f.write(b' More data...')
# After the context exits, get the hash
hash_str = get_hash()

Working with Sibling Files

# Store metadata/sibling files
gs.store_sibling(hash_str, 'json', b'{"key": "value"}')
gs.store_sibling(hash_str, 'txt', b'Additional notes')

# Load sibling data
metadata = gs.load_sibling_bytes(hash_str, 'json')
notes = gs.load_sibling_bytes(hash_str, 'txt')

Checking Existence

# Check if main blob exists
if gs.exists(hash_str):
    print("Blob exists!")

# Check if sibling file exists
if gs.exists(hash_str, 'json'):
    metadata = gs.load_sibling_bytes(hash_str, 'json')

Path Operations

# Get path to a blob (without loading it)
blob_path = gs.path_to(hash_str)

# Get path to a sibling file
metadata_path = gs.path_to(hash_str, 'json')

Copying and Moving Files

# Copy an external file into the store
# Returns (hash_string, file_path) - original file remains unchanged
hash_str, store_path = gs.copy_file('/path/to/source/file.pdf')

# Move an external file into the store
# Returns (hash_string, file_path) - original file is deleted
hash_str, store_path = gs.move_file('/path/to/source/file.pdf')

# Both methods:
# - Calculate the file's SHA-256 hash efficiently
# - Create the appropriate directory structure
# - Handle duplicates (won't overwrite existing files)
# - Support both string and Path objects as input

Iteration and Validation

# Iterate over all blobs (excluding siblings)
for hash_str, file_path in gs.iter_files(no_sibling=True):
    print(f"Found blob: {hash_str}")

# Iterate with sibling information
for hash_str, file_path, sibling_extensions in gs.iter_files():
    print(f"Blob: {hash_str}")
    print(f"Siblings: {sibling_extensions}")  # e.g., {'json', 'txt'}

# Validate integrity of all blobs
for invalid_path in gs.validate_tree():
    print(f"Corrupted file: {invalid_path}")

# Auto-delete corrupted files
for invalid_path in gs.validate_tree(auto_delete=True):
    print(f"Deleted corrupted file: {invalid_path}")

# Auto-delete corrupted files and their siblings
for invalid_path in gs.validate_tree(auto_delete=True, delete_siblings=True):
    print(f"Deleted corrupted file: {invalid_path}")

Filtering and Copying

# Create a filtered copy of the store
def size_filter(hash_str, file_path):
    # Only copy files smaller than 1MB
    return file_path.stat().st_size < 1024 * 1024

# Create a new store with only small files
filtered_gs = gs.filtered_copy('filtered-dir', size_filter)

# The filtered store contains the same hierarchy depth and README
print(f"Hierarchy depth: {filtered_gs.hierarchy_depth}")
print(f"README: {filtered_gs.get_readme()}")

# Example: Copy only specific file types based on sibling extensions
def has_json_metadata(hash_str, file_path):
    # Check if this blob has a JSON sibling
    return gs.exists(hash_str, 'json')

json_only_gs = gs.filtered_copy('json-only-dir', has_json_metadata)

# Example: Copy files matching certain hash patterns
def hash_prefix_filter(hash_str, file_path):
    # Only copy files whose hash starts with 'Q'
    return hash_str.startswith('Q')

q_files_gs = gs.filtered_copy('q-files-dir', hash_prefix_filter)

String Representations

# Get a human-readable string representation
print(gs)  # Output: GrugStore(/path/to/store)

# Get a detailed representation (useful for debugging)
print(repr(gs))  # Output: GrugStore(base_dir=PosixPath('/path/to/store'), hierarchy_depth=3)

File Layout

GrugStore organizes files in a hierarchical directory structure based on the base58-encoded SHA-256 hash of the content. Here's an example of what a GrugStore directory looks like with hierarchy_depth=2:

some-dir/
├── _meta/
│   └── README          # Optional store-level documentation
├── _tmp/                  # Temporary directory for atomic file operations
├── 2/
│   └── X/
│       ├── 2XaBcD...xyz  # The actual blob file (no extension)
│       └── 2XaBcD...xyz.json  # Sibling metadata file
├── 5/
│   └── K/
│       ├── 5Kj9Yz...abc  # Another blob
│       ├── 5Kj9Yz...abc.json  # JSON sibling
│       └── 5Kj9Yz...abc.txt   # Text sibling
└── 8/
    └── R/
        └── 8Rm4Qp...def  # Blob without any sibling files

Directory Structure Details

  • Hash-based hierarchy: Files are organized using prefixes of their base58-encoded hash. With hierarchy_depth=2, the first character becomes the first directory level, the second character becomes the second level.
  • Blob files: The main content files have no extension and are named with their full hash.
  • Sibling files: Related metadata or additional content files share the same hash name but include an extension (e.g., .json, .txt).
  • _meta/ directory: Contains store-level metadata like README files.
  • _tmp/ directory: Used internally for atomic file operations. Files are first written here and then moved to their final location to ensure write atomicity and prevent partial file corruption.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

grugstore-0.1.3.tar.gz (475.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

grugstore-0.1.3-py3-none-any.whl (9.7 kB view details)

Uploaded Python 3

File details

Details for the file grugstore-0.1.3.tar.gz.

File metadata

  • Download URL: grugstore-0.1.3.tar.gz
  • Upload date:
  • Size: 475.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.19

File hashes

Hashes for grugstore-0.1.3.tar.gz
Algorithm Hash digest
SHA256 885c533662fbd73be6b710adb53f1a8dbaaede4682fc3fedc1dd85eba1599b84
MD5 c9fc8ba62d248973d04c66aaecd87f86
BLAKE2b-256 152b731dd6efa804b3204a45ea78bf82a2c81890ade0a247a6e5224818afc16a

See more details on using hashes here.

File details

Details for the file grugstore-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: grugstore-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 9.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.19

File hashes

Hashes for grugstore-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 1f292193404430ba7f6e64e5cd0a40021dc999fe9cb5b5b68e77d3343cf60c01
MD5 890e7d29fd8385620b81441f020790b4
BLAKE2b-256 c0889309891e1a9d5b35b4be940cca8f884986053932b892585f2302f87e50e8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page