Skip to main content

Classes for representing different file formats in Python classes for use in type hinting in data workflows

Project description

https://github.com/arcanaframework/fileformats/actions/workflows/ci-cd.yml/badge.svg https://codecov.io/gh/arcanaframework/fileformats/branch/main/graph/badge.svg?token=UIS0OGPST7 Supported Python versions Latest Version Documentation Status

Fileformats provides a library of file-format types implemented as Python classes. The file-format types are designed to be used in type validation during the construction of data workflows (e.g. Pydra, Fastr), and also provide some basic data handling methods (e.g. loading data to dictionaries) and conversions between some equivalent types When the “extended” install option is provided.

File-format types are typically identified by a combination of file extension and “magic numbers” where applicable, however, unlike many other file-type Python packages, FileFormats, supports multi-file data formats (“file sets”) often found in scientific workflows, e.g. with separate header/data files. FileFormats also provides a flexible framework to add custom identification routines for exotic file formats, e.g. formats that require inspection of headers to locate data files, directories containing certain file types, or to peek at metadata fields to define specific sub-types (e.g. functional MRI DICOM file set).

See the extension template for instructions on how to design FileFormats extensions modules to augment the standard file-types implemented in the main repository with custom domain/vendor-specific file-format types.

Notes on MIME-type coverage

Support for all non-vendor standard MIME types (i.e. ones not matching */vnd.* or */x-*) has been added to FileFormats by semi-automatically scraping the IANA MIME types website for file extensions and magic numbers. As such, many of the formats in the library have not been properly tested on real data and so should be treated with some caution. If you encounter any issues with an implemented file type, please raise an issue in the GitHub tracker.

Adding support for vendor formats will be relatively straightforward, it just requires someone to do the job of manually curating the scraped data (a days work or so). Please get in touch if you are interested in helping out with this.

Installation

FileFormats can be installed for Python >= 3.7 from PyPI with

$ python3 -m pip fileformats

Support for converter methods between a few select formats can be installed by passing the ‘extended’ install extra, e.g

$ python3 -m pip install fileformats[extended]

Examples

Using the WithMagicNumber mixin class, the Png format can be defined concisely as

from fileformats.generic import File
from fileformats.core.mixin import WithMagicNumber

class Png(WithMagicNumber, File):
    binary = True
    ext = ".png"
    iana_mime = "image/png"
    magic_number = b".PNG"

Files can then be checked to see whether they are of PNG format by

png = Png("/path/to/image/file.png")  # Checks the extension and magic number

which will raise a FormatMismatchError if initialisation or validation fails, or for a boolean method that checks the validation use matches

if Png.matches(a_path_to_a_file):
    ... handle case ...

Format Conversion

While not implemented in the main File-formats itself, file-formats provides hooks for other packages to implement extra behaviour such as format conversion. The fileformats-extras implements a number of converters between standard file-format types, e.g. archive types to/from generic file/directories, which if installed can be called using the convert() method.

from fileformats.application import Zip
from fileformats.generic import Directory

zip_file = Zip.convert(Directory("/path/to/a/directory"))
extracted = Directory.convert(zip_file)
copied = extracted.copy_to("/path/to/output")

The converters are implemented in the Pydra dataflow framework, and can be linked into wider Pydra workflows by creating a converter task

import pydra
from pydra.tasks.mypackage import MyTask
from fileformats.application import Json, Yaml

wf = pydra.Workflow(name="a_workflow", input_spec=["in_json"])
wf.add(
    Yaml.get_converter(Json, name="json2yaml", in_file=wf.lzin.in_json)
)
wf.add(
    MyTask(
        name="my_task",
        in_file=wf.json2yaml.lzout.out_file,
    )
)
...

Alternatively, the conversion can be executed outside of a Pydra workflow with

json_file = Json("/path/to/file.json")
yaml_file = Yaml.convert(json_file)

License

This work is licensed under a Creative Commons Attribution 4.0 International License

Creative Commons Attribution 4.0 International License

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fileformats-0.10.1.tar.gz (74.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fileformats-0.10.1-py3-none-any.whl (93.1 kB view details)

Uploaded Python 3

File details

Details for the file fileformats-0.10.1.tar.gz.

File metadata

  • Download URL: fileformats-0.10.1.tar.gz
  • Upload date:
  • Size: 74.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.6

File hashes

Hashes for fileformats-0.10.1.tar.gz
Algorithm Hash digest
SHA256 9e72ee7e1ae6769d4fe44c2e3648568534b67615897ecd1254db77ca6a8e1186
MD5 e1d64d5e8fb166a4ebc983f7d12e84a7
BLAKE2b-256 cd1240dd5a3e17a9d5099a3044c71a760b7081a906e584c94f9507b00acda9ee

See more details on using hashes here.

File details

Details for the file fileformats-0.10.1-py3-none-any.whl.

File metadata

  • Download URL: fileformats-0.10.1-py3-none-any.whl
  • Upload date:
  • Size: 93.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.6

File hashes

Hashes for fileformats-0.10.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b9334cb9b0f1374287840abb93334f5895ddb2e116d10dcb764dd050464e81eb
MD5 f9e27b5257525b66cc6c06791cce663a
BLAKE2b-256 11b0e991eb66a41bfc6157ba1db3ae4f132aef82e17d4aded466808c6d60c395

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page