Skip to main content

Simple tool to compare directory contents.

Project description

Version Build status Coverage License Documentation status

Directory Content Difference

This project provides simple tools to compare the content of a directory against a reference directory.

This is useful to check the results of a process that generates several files, like a luigi workflow for example.

Installation

This package should be installed using pip:

pip install dir-content-diff

Usage

The dir-content-diff package introduces a framework to compare two directories. A comparator is associated to each file extension and then each file in the reference directory is compared to the file with the same relative path in the compared directory. By default, a few comparators are provided for usual files but others can be associated to new file extensions or can even replace the default ones. The comparators should be able to report the differences between two files accurately, reporting which elements are different among the data. When an extension has no comparator associated, a default comparator is used which just compares the whole binary data of the files, so it is not able to report which values are different.

Compare two directories

Basic Directory Comparison

Let's compare two directories with the following structures:

└── reference_dir
    ├── sub_dir_1
    |   ├── sub_file_1.a
    |   └── sub_file_2.b
    └── file_1.c
└── compared_dir
    ├── sub_dir_1
    |   ├── sub_file_1.a
    |   └── sub_file_2.b
    |   └── sub_file_3.b
    └── file_1.c

The reference directory contains all the files that should be checked in the compared directory, which means that extraneous files in the compared directory are just ignored.

These two directories can be compared with the following code:

import dir_content_diff

dir_content_diff.compare_trees("reference_dir", "compared_dir")

[!WARNING] The order of the parameters is important: the first path is considered as the reference directory while the second one is the compared directory. Inverting the parameters may return a different result (in this example it would return that the file sub_file_3.b is missing).

If all the files are identical, this code will return an empty dictionary because no difference was detected. As mentioned previously, this is because dir-content-diff is only looking for files in the compared directory that are also present in the reference directory, so the file sub_file_3.b is just ignored in this case.

Using Custom Comparators

If reference_dir/file_1.c is the following JSON-like file:

{
    "a": 1,
    "b": [1, 2]
}

And compared_dir/file_1.c is the following JSON-like file:

{
    "a": 2,
    "b": [10, 2, 0]
}

The following code registers the JsonComparator for the file extension .c and compares the two directories:

import dir_content_diff

dir_content_diff.register_comparator(".c", dir_content_diff.JsonComparator())
dir_content_diff.compare_trees("reference_dir", "compared_dir")

The previous code will output the following dictionary:

{
    'file_1.c': (
        'The files \'reference_dir/file_1.c\' and \'compared_dir/file_1.c\' are different:\n'
        'Added the value(s) \'{"2": 0}\' in the \'[b]\' key.\n'
        'Changed the value of \'[a]\' from 1 to 2.\n'
        'Changed the value of \'[b][0]\' from 1 to 10.'
    )
}

Assertion-based Comparison

It is also possible to check whether the two directories are equal or not with the following code:

import dir_content_diff

dir_content_diff.register_comparator(".c", dir_content_diff.JsonComparator())
dir_content_diff.assert_equal_trees("reference_dir", "compared_dir")

Which will output the following AssertionError:

AssertionError: The files 'reference_dir/file_1.c' and 'compared_dir/file_1.c' are different:
Added the value(s) '{"2": 0}' in the '[b]' key.
Changed the value of '[a]' from 1 to 2.
Changed the value of '[b][0]' from 1 to 10.

Advanced Configuration Options

The comparators have parameters that can be passed either to be used for all files of a given extension or only for a specific file:

import dir_content_diff

# Get the default comparators
comparators = dir_content_diff.get_comparators()

# Replace the comparators for JSON files to perform the comparison with a given tolerance
comparators[".json"] = dir_content_diff.JsonComparator(default_diff_kwargs={"tolerance": 0.1})

# Use a specific tolerance for the file ``sub_dir_1/sub_file_1.a``
# In this case, the kwargs are used to compute the difference by default, except the following
# specific kwargs: ``return_raw_diffs``, ``load_kwargs``, ``format_data_kwargs``, ``filter_kwargs``,
# ``format_diff_kwargs``, ``sort_kwargs``, ``concat_kwargs`` and ``report_kwargs``.
specific_args = {"sub_dir_1/sub_file_1.a": {"tolerance": 0.5}}

dir_content_diff.assert_equal_trees(
    "reference_dir",
    "compared_dir",
    comparators=comparators,
    specific_args=specific_args,
)

Each comparator has different arguments that are detailed in the documentation.

File-specific Comparators

It's also possible to specify a arbitrary comparator for a specific file:

specific_args = {
    "sub_dir_1/sub_file_1.a": {
        "comparator": dir_content_diff.JsonComparator(),
        "tolerance": 0.5,
    }
}
Pattern-based Configuration

Another possibility is to use regular expressions to associate specific arguments to a set of files:

specific_args = {
    "all files with *.a of *.b extensions": {
        "patterns": [r".*\.[a,b]$"],
        "comparator": dir_content_diff.BaseComparator(),
    }
}
File Filtering

Last but not least, it's possible to filter files from the reference directory (for example because the reference directory contains temporary files that should not be compared). For example, the following code will ignore all files whose name does not start with file_ and does not ends with _tmp.yaml:

import dir_content_diff

dir_content_diff.compare_trees(
    "reference_dir",
    "compared_dir",
    include_patterns=[r"file_.*"],
    exclude_patterns=[r".*_tmp\.yaml"],
)

Parallel Execution

By default, dir-content-diff runs file comparisons sequentially. However, for improved performance when comparing large numbers of files, parallel execution is available using either thread-based or process-based concurrency.

Configuration Options

Parallel execution can be configured using the following parameters:

  • executor_type: Controls the type of parallel execution:

    • "sequential" (default): No parallel execution, files are compared one by one
    • "thread": Uses ThreadPoolExecutor (recommended for I/O-bound tasks)
    • "process": Uses ProcessPoolExecutor (recommended for CPU-intensive comparisons)
  • max_workers: Maximum number of worker threads/processes. If None (default), it defaults to min(32, (os.cpu_count() or 1) + 4).

Usage Examples

Enable thread-based parallel execution:

import dir_content_diff

dir_content_diff.compare_trees(
    "reference_dir",
    "compared_dir",
    executor_type="thread",
    max_workers=8
)

Enable process-based parallel execution with automatic worker count:

import dir_content_diff

dir_content_diff.compare_trees(
    "reference_dir",
    "compared_dir",
    executor_type="process"
)

Using a configuration object:

import dir_content_diff

config = dir_content_diff.ComparisonConfig(
    executor_type="thread",
    max_workers=4
)

dir_content_diff.compare_trees(
    "reference_dir",
    "compared_dir",
    config=config
)

Performance Considerations

  • Thread-based execution (executor_type="thread") is generally recommended for most use cases as file comparisons are typically I/O-bound operations
  • Process-based execution (executor_type="process") may be beneficial when using computationally intensive comparators or when dealing with very large files
  • Parallel execution is automatically disabled for single file comparisons and falls back to sequential execution when only one file needs to be compared
  • The optimal number of workers depends on your system's capabilities and the nature of your files; too many workers may actually decrease performance due to overhead

Export formatted data

Some comparators have to format the data before comparing them. For example, if one wants to compare data with file paths inside, it's likely that only a relative part of these paths are relevant, not the entire absolute paths. To do this, a specific comparator can be defined with a custom format_data() method which is automatically called after the data are loaded but before the data are compared. It is then possible to export the data just after they have been formatted for check purpose for example. To do this, the export_formatted_files argument of the dir_content_diff.compare_trees and dir_content_diff.assert_equal_trees functions can be set to True. Thus all the files processed by a comparator with a save() method will be exported to a new directory. This new directory is the same as the compared directory to which a suffix is added. By default, the suffix is _FORMATTED, but it can be overridden by passing a non-empty string to the export_formatted_files argument.

Pytest plugin

This package can be used as a pytest plugin. When pytest is run and dir-content-diff is installed, it is automatically detected and registered as a plugin. It is then possible to trigger the export of formatted data with the following pytest option: --dcd-export-formatted-data. It is also possible to define a custom suffix for the new directory with the following option: --dcd-export-suffix.

Funding & Acknowledgment

The development of this software was supported by funding to the Blue Brain Project, a research center of the École polytechnique fédérale de Lausanne (EPFL), from the Swiss government’s ETH Board of the Swiss Federal Institutes of Technology.

For license and authors, see LICENSE.txt and AUTHORS.md respectively.

Copyright © 2021-2023 Blue Brain Project/EPFL

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dir_content_diff-1.14.1.tar.gz (78.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dir_content_diff-1.14.1-py3-none-any.whl (39.1 kB view details)

Uploaded Python 3

File details

Details for the file dir_content_diff-1.14.1.tar.gz.

File metadata

  • Download URL: dir_content_diff-1.14.1.tar.gz
  • Upload date:
  • Size: 78.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dir_content_diff-1.14.1.tar.gz
Algorithm Hash digest
SHA256 7dcd713a613215fb55be2b6aa2b813032f13fc8d22ca3920682314c6e1e57926
MD5 e317a6535f213a09a4fc2413c39abc99
BLAKE2b-256 6e78db3c50a90737e8a9b50c5d27340fc19f0baddfe1c8dae03338e10c1b8fe6

See more details on using hashes here.

File details

Details for the file dir_content_diff-1.14.1-py3-none-any.whl.

File metadata

File hashes

Hashes for dir_content_diff-1.14.1-py3-none-any.whl
Algorithm Hash digest
SHA256 965481217e57bf4efd1d80b9c3002dc134fa527fb4f2dae6437c7e9f488a90b5
MD5 82af0156642bf2e6c8de3fe4d09053fc
BLAKE2b-256 12c8c817506ae21e4a47d90310b796f7a29d94aef110dd2fc37cb127f3deddec

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page