Skip to main content

An anonymization tool for production databases

Project description

pynonymizer

pynonymizer is a universal tool for translating sensitive production database dumps into anonymized copies.

This can help you support GDPR/Data Protection in your organization without compromizing on quality testing data.

Why are anonymized databases important?

The primary source of information on how your database is used is in your production database. In most situations, the production dataset is usually significantly larger than any development copy, and would contain a wider range of data.

From time to time, it is prudent to run a new feature or stage a test against this dataset, rather than one that is artificially created by developers or by testing frameworks. Anonymized databases allow us to use the structures present in production, while stripping them of any personally identifiable data that would consitute a breach of privacy for end-users and subsequently a breach of GDPR.

With Anonymized databases, copies can be processed regularly, and distributed easily, leaving your developers and testers with a rich source of information on the volume and general makeup of the system in production. It can be used to run better staging environments, integration tests, and even simulate database migrations.

below is an excerpt from an anonymized database:

id salutation firstname surname email dob
1 Dr. Bernard Gough tnelson@powell.com 2000-07-03
2 Mr. Molly Bennett clarkeharriet@price-fry.com 2014-05-19
3 Mrs. Chelsea Reid adamsamber@clayton.com 1974-09-08
4 Dr. Grace Armstrong tracy36@wilson-matthews.com 1963-12-15
5 Dr. Stanley James christine15@stewart.net 1976-09-16
6 Dr. Mark Walsh dgardner@ward.biz 2004-08-28
7 Mrs. Josephine Chambers hperry@allen.com 1916-04-04
8 Dr. Stephen Thomas thompsonheather@smith-stevens.com 1995-04-17
9 Ms. Damian Thompson yjones@cox.biz 2016-10-02
10 Miss Geraldine Harris porteralice@francis-patel.com 1910-09-28
11 Ms. Gemma Jones mandylewis@patel-thomas.net 1990-06-03
12 Dr. Glenn Carr garnervalerie@farrell-parsons.biz 1998-04-19

How does it work?

pynonymizer replaces personally identifiable data in your database with realistic pseudorandom data, from the Faker library or from other functions. There are a wide variety of data types available which should suit the column in question, for example:

  • unique_email
  • company
  • file_path
  • [...]

For a full list of data generation strategies, see the docs on strategyfiles

Process outline

  1. Restore from dumpfile to temporary database.
  2. Anonymize temporary database with strategy.
  3. Dump resulting data to file.
  4. Drop temporary database.

Supported Databases

  • mysql
  • More coming soon!

Requirements

  • Python >= 3.6

mysql

  • mysql/mysqldump: You will need these utilities in your path.
  • An active database connection, (mysql >= 5.5) either local or remote (to restore, anonymize, and dump from)
  • A backup in Single-file mysqldump output (schema and data)

Getting Started

Usage

  1. Write a strategyfile for your database
  2. See below:
usage: pynonymizer [-h] [--input INPUT] [--strategy STRATEGYFILE]
                   [--output OUTPUT] [--db-type DB_TYPE] [--db-host DB_HOST]
                   [--db-name DB_NAME] [--db-user DB_USER]
                   [--db-password DB_PASSWORD] [--fake-locale FAKE_LOCALE]
                   [-v]

A tool for writing better anonymization strategies for your production
databases.

optional arguments:
  -h, --help            show this help message and exit
  --input INPUT, -i INPUT
                        The source dumpfile to read from. [file.sql,
                        file.sql.gz] [$PYNONYMIZER_INPUT]
  --strategy STRATEGYFILE, -s STRATEGYFILE
                        A strategyfile to use during anonymization.
                        [$PYNONYMIZER_STRATEGY]
  --output OUTPUT, -o OUTPUT
                        The destination to write the dumped output to.
                        [file.sql, file.sql.gz] [$PYNONYMIZER_OUTPUT]
  --db-type DB_TYPE, -t DB_TYPE
                        Type of database to interact with. More databases will
                        be supposed in future versions. default: mysql
                        [$PYNONYMIZER_DB_TYPE]
  --db-host DB_HOST, -d DB_HOST
                        Database hostname or IP address.
                        [$PYNONYMIZER_DB_HOST]
  --db-name DB_NAME, -n DB_NAME
                        Name of database to restore and anonymize in. If not
                        provided, a unique name will be generated from the
                        strategy name. This will be dropped at the end of the
                        run. [$PYNONYMIZER_DB_NAME]
  --db-user DB_USER, -u DB_USER
                        Database credentials: username. [$PYNONYMIZER_DB_USER]
  --db-password DB_PASSWORD, -p DB_PASSWORD
                        Database credentials: password. Recommended: use
                        environment variables to avoid exposing secrets in
                        production environments. [$PYNONYMIZER_DB_PASSWORD]
  --fake-locale FAKE_LOCALE, -l FAKE_LOCALE
                        Locale setting to initialize fake data generation.
                        Affects Names, addresses, formats, etc. [$FAKE_LOCALE]
  -v, --version         show program's version number and exit

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pynonymizer-1.2.0.tar.gz (14.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pynonymizer-1.2.0-py3-none-any.whl (19.3 kB view details)

Uploaded Python 3

File details

Details for the file pynonymizer-1.2.0.tar.gz.

File metadata

  • Download URL: pynonymizer-1.2.0.tar.gz
  • Upload date:
  • Size: 14.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/39.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.7.0

File hashes

Hashes for pynonymizer-1.2.0.tar.gz
Algorithm Hash digest
SHA256 77d94d81b8ba09679bb703dac45f9c378e2612812223c3797e8974e701e29dde
MD5 3dae12a49067e18d55ddd8ad122a00a6
BLAKE2b-256 1061bbfa00f38d8cf7cd8247a7b10dab22d1f6097838eee4af3fab322fef42d4

See more details on using hashes here.

File details

Details for the file pynonymizer-1.2.0-py3-none-any.whl.

File metadata

  • Download URL: pynonymizer-1.2.0-py3-none-any.whl
  • Upload date:
  • Size: 19.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/39.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.7.0

File hashes

Hashes for pynonymizer-1.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c45bf4b8b72877035bfc99a3a6f8e6b424c7f82e76076fa4f5e4d639d866bdf0
MD5 3cfe1004690864567f8a0230a137f2ff
BLAKE2b-256 2be75f6316638bab91e1bec3c44e27d42a291ec5741af543f693d8bfcfaf339d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page