Skip to main content

HDX Data Freshness

Project description

HDX Data Freshness

Build Status Coverage Status

The implementation of HDX freshness in Python reads all the datasets from the Humanitarian Data Exchange website (using the HDX Python library) and then iterates through them one by one performing a sequence of steps.

  1. It gets the dataset’s update frequency if it has one. If that update frequency is Never, then the dataset is always fresh.

  2. If not, it checks if the dataset and resource metadata have changed - this qualifies as an update from a freshness perspective. It compares the difference between the current time and update time with the update frequency and sets a status: fresh, due, overdue or delinquent.

  3. If the dataset is not fresh based on metadata, then the urls of the resources are examined. If they are internal urls (data.humdata.org - the HDX filestore, manage.hdx.rwlabs.org - CPS) then there is no further checking that can be done because when the files pointed to by these urls update, the HDX metadata is updated.

  4. If they are urls with an adhoc update frequency (proxy.hxlstandard.org, ourairports.com), then freshness cannot be determined. Currently, there is no mechanism in HDX to specify adhoc update frequencies, but there is a proposal to add this to the update frequency options. At the moment, the freshness value for adhoc datasets is based on whatever has been set for update frequency, but these datasets can be easily identified and excluded from results if needed.

  5. If the url is externally hosted and not adhoc, then we can open an HTTP GET request to the file and check the header returned for the Last-Modified field. If that field exists, then we read the date and time from it and check if that is more recent than the dataset or resource metadata modification date. If it is, we recalculate freshness.

  6. If the resource is not fresh by this measure, then we download the file and calculate an MD5 hash for it. In our database, we store previous hash values, so we can check if the hash has changed since the last time we took the hash.

  7. There are some resources where the hash changes constantly because they connect to an api which generates a file on the fly. To identify these, we hash again and check if the hash changes in the few seconds since the previous hash calculation.

Since there can be temporary connection and download issues with urls, the code has multiple retry functionality with increasing delays. Also as there are many requests to be made, rather than perform them one by one, they are executed concurrently using the asynchronous functionality that has been added to the most recent versions of Python.

Usage

python run.py

Project details


Release history Release notifications | RSS feed

This version

1.4.3

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hdx-data-freshness-1.4.3.tar.gz (17.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hdx_data_freshness-1.4.3-py2.py3-none-any.whl (22.5 kB view details)

Uploaded Python 2Python 3

File details

Details for the file hdx-data-freshness-1.4.3.tar.gz.

File metadata

  • Download URL: hdx-data-freshness-1.4.3.tar.gz
  • Upload date:
  • Size: 17.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.14.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.40.0 CPython/3.6.8

File hashes

Hashes for hdx-data-freshness-1.4.3.tar.gz
Algorithm Hash digest
SHA256 994be5d256dca25d363a9bdf50a3953040dfdf8fb6daa2976476306a635b63f4
MD5 9f15f4263d6114d212a2596eb94875d4
BLAKE2b-256 f3b12ede762dc4cfaeaafaa7758eeed090baad193214e00cb593ddd214171b0e

See more details on using hashes here.

File details

Details for the file hdx_data_freshness-1.4.3-py2.py3-none-any.whl.

File metadata

  • Download URL: hdx_data_freshness-1.4.3-py2.py3-none-any.whl
  • Upload date:
  • Size: 22.5 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.14.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.40.0 CPython/3.6.8

File hashes

Hashes for hdx_data_freshness-1.4.3-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 8683d7dbd5e8f5ef24707c88483100f7c87453fdde06278d196d085a8e8c4bb4
MD5 f02c14082638529beaaeb7d035152524
BLAKE2b-256 f442aef1ac04216b9f60820354457c1b7b384fdfde07735c3e7466c1d32cf875

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page