An anonymization tool for production databases

These details have not been verified by PyPI

Project links

Homepage

Project description

pynonymizer

⚠ Mirror

I'm currently reviewing moving this from GitLab to Github. For now this is an official mirror, but both issue trackers/PRs will be monitored.

I'd appreciate any thoughts on this or help coverting my gitlab pipeline!

pynonymizer is a universal tool for translating sensitive production database dumps into anonymized copies.

This can help you support GDPR/Data Protection in your organization without compromizing on quality testing data.

Why are anonymized databases important?

The primary source of information on how your database is used is in your production database. In most situations, the production dataset is usually significantly larger than any development copy, and would contain a wider range of data.

From time to time, it is prudent to run a new feature or stage a test against this dataset, rather than one that is artificially created by developers or by testing frameworks. Anonymized databases allow us to use the structures present in production, while stripping them of any personally identifiable data that would consitute a breach of privacy for end-users and subsequently a breach of GDPR.

With Anonymized databases, copies can be processed regularly, and distributed easily, leaving your developers and testers with a rich source of information on the volume and general makeup of the system in production. It can be used to run better staging environments, integration tests, and even simulate database migrations.

below is an excerpt from an anonymized database:

id	salutation	firstname	surname	email	dob
1	Dr.	Bernard	Gough	tnelson@powell.com	2000-07-03
2	Mr.	Molly	Bennett	clarkeharriet@price-fry.com	2014-05-19
3	Mrs.	Chelsea	Reid	adamsamber@clayton.com	1974-09-08
4	Dr.	Grace	Armstrong	tracy36@wilson-matthews.com	1963-12-15
5	Dr.	Stanley	James	christine15@stewart.net	1976-09-16
6	Dr.	Mark	Walsh	dgardner@ward.biz	2004-08-28
7	Mrs.	Josephine	Chambers	hperry@allen.com	1916-04-04
8	Dr.	Stephen	Thomas	thompsonheather@smith-stevens.com	1995-04-17
9	Ms.	Damian	Thompson	yjones@cox.biz	2016-10-02
10	Miss	Geraldine	Harris	porteralice@francis-patel.com	1910-09-28
11	Ms.	Gemma	Jones	mandylewis@patel-thomas.net	1990-06-03
12	Dr.	Glenn	Carr	garnervalerie@farrell-parsons.biz	1998-04-19

How does it work?

pynonymizer replaces personally identifiable data in your database with realistic pseudorandom data, from the Faker library or from other functions. There are a wide variety of data types available which should suit the column in question, for example:

unique_email
company
file_path
[...]

For a full list of data generation strategies, see the docs on strategyfiles

Process outline

Restore from dumpfile to temporary database.
Anonymize temporary database with strategy.
Dump resulting data to file.
Drop temporary database.

If this workflow doesnt work for you, see process control to see if it can be adjusted to suit your needs.

Requirements

Python >= 3.6

mysql

mysql/mysqldump Must be in $PATH
backup file in plain .sql/sql.gz (schema and data)
Local or remote mysql >= 5.5

mssql

Requires extra dependencies: install package pynonymizer[mssql]
MSSQL >= 2008
Due to backup/restore limitations, you must be running pynonymizer on the same server as the database engine.
A backup in .bak format

postgres

psql/pg_dump Must be in $PATH
backup file in plain .sql/sql.gz (schema and data)
Local or remote postgres server

Getting Started

Usage

Write a strategyfile for your database
See below:

usage: pynonymizer [-h] [--input INPUT] [--strategy STRATEGYFILE]
                   [--output OUTPUT] [--db-type DB_TYPE] [--db-host DB_HOST]
                   [--db-name DB_NAME] [--db-user DB_USER]
                   [--db-password DB_PASSWORD] [--fake-locale FAKE_LOCALE]
                   [--start-at STEP] [--skip-steps STEP [STEP ...]]
                   [--stop-at STEP] [--seed-rows SEED_ROWS]
                   [--mssql-backup-compression] [--mysql-dump-opts MYSQL_DUMP_OPTS] [-v]

A tool for writing better anonymization strategies for your production
databases.

optional arguments:
  -h, --help            show this help message and exit
  --input INPUT, -i INPUT
                        The source dumpfile to read from. [$PYNONYMIZER_INPUT]
  --strategy STRATEGYFILE, -s STRATEGYFILE
                        A strategyfile to use during anonymization.
                        [$PYNONYMIZER_STRATEGY]
  --output OUTPUT, -o OUTPUT
                        The destination to write the dumped output to.
                        [$PYNONYMIZER_OUTPUT]
  --db-type DB_TYPE, -t DB_TYPE
                        Type of database to interact with. More databases will
                        be supported in future versions. default: mysql
                        [$PYNONYMIZER_DB_TYPE]
  --db-host DB_HOST, -d DB_HOST
                        Database hostname or IP address.
                        [$PYNONYMIZER_DB_HOST]
  --db-port DB_PORT, -P DB_PORT
                        Database port. Defaults to provider default.
                        [$PYNONYMIZER_DB_PORT]
  --db-name DB_NAME, -n DB_NAME
                        Name of database to restore and anonymize in. If not
                        provided, a unique name will be generated from the
                        strategy name. This will be dropped at the end of the
                        run. [$PYNONYMIZER_DB_NAME]
  --db-user DB_USER, -u DB_USER
                        Database credentials: username. [$PYNONYMIZER_DB_USER]
  --db-password DB_PASSWORD, -p DB_PASSWORD
                        Database credentials: password. Recommended: use
                        environment variables to avoid exposing secrets in
                        production environments. [$PYNONYMIZER_DB_PASSWORD]
  --fake-locale FAKE_LOCALE, -l FAKE_LOCALE
                        Locale setting to initialize fake data generation.
                        Affects Names, addresses, formats, etc.
                        [$PYNONYMIZER_FAKE_LOCALE]
  --start-at STEP       Choose a step to begin the process (inclusive).
                        [$PYNONYMIZER_START_AT]
  --skip-steps STEP [STEP ...]
                        Choose one or more steps to skip.
                        [$PYNONYMIZER_SKIP_STEPS]
  --stop-at STEP        Choose a step to stop at (inclusive).
                        [$PYNONYMIZER_STOP_AT]
  --seed-rows SEED_ROWS
                        Specify a number of rows to populate the fake data
                        table used during anonymization.
                        [$PYNONYMIZER_SEED_ROWS]
  --mssql-backup-compression
                        [MSSQL] Use compression when backing up the database.
                        [$PYNONYMIZER_MSSQL_BACKUP_COMPRESSION]
  --mysql-dump-opts MYSQL_DUMP_OPTS
                        [MYSQL] pass additional arguments to the mysqldump process (advanced use only!).
                        [$PYNONYMIZER_MYSQL_DUMP_OPTS]
  -v, --version         show program's version number and exit
  --verbose             Increases the verbosity of the logging feature, to
                        help when troubleshooting issues.
                        [$PYNONYMIZER_VERBOSE]
  --dry-run             Instruct pynonymizer to skip all process steps. Useful
                        for testing safely. [$PYNONYMIZER_DRY_RUN]

License

MIT

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

2.5.0

Dec 27, 2024

2.4.0

Jul 30, 2024

2.3.1

May 27, 2024

2.2.1

Apr 30, 2024

2.2.0

Apr 14, 2024

2.1.1

Apr 6, 2024

2.1.0

Apr 3, 2024

2.0.0

Mar 28, 2024

1.25.0

Mar 29, 2023

1.24.0

Sep 7, 2022

1.23.0

Aug 21, 2022

1.22.0

Feb 6, 2022

1.21.3

Nov 14, 2021

1.21.2

Sep 6, 2021

1.21.1

Jun 22, 2021

1.21.0

May 31, 2021

1.20.0

May 6, 2021

1.19.0

Apr 24, 2021

1.18.1

Apr 12, 2021

1.18.0

Apr 11, 2021

1.17.0

Mar 29, 2021

1.16.0

Mar 16, 2021

1.15.0

Jan 29, 2021

1.14.0

Dec 7, 2020

1.13.0

Oct 22, 2020

1.12.0

Sep 25, 2020

1.11.2

Sep 23, 2020

1.11.1

Aug 29, 2020

1.10.1

Jul 22, 2020

This version

1.10.0

Jul 22, 2020

1.9.0

Jun 25, 2020

1.8.0

Jan 17, 2020

1.7.0

Jan 10, 2020

1.6.2

Sep 17, 2019

1.6.1

Aug 2, 2019

1.6.0

Aug 1, 2019

1.5.0

Jul 13, 2019

1.4.1

Jun 29, 2019

1.4.0

Jun 23, 2019

1.3.0

Jun 17, 2019

1.2.0

Jun 14, 2019

1.1.2

Jun 8, 2019

1.0.0

Jun 4, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pynonymizer-1.10.0.tar.gz (42.2 kB view details)

Uploaded Jul 22, 2020 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pynonymizer-1.10.0-py3-none-any.whl (61.5 kB view details)

Uploaded Jul 22, 2020 Python 3

File details

Details for the file pynonymizer-1.10.0.tar.gz.

File metadata

Download URL: pynonymizer-1.10.0.tar.gz
Upload date: Jul 22, 2020
Size: 42.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.0 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.8.0

File hashes

Hashes for pynonymizer-1.10.0.tar.gz
Algorithm	Hash digest
SHA256	`87bd7f1d84594b994d7cce5661dfa1b636badd82f9d728e49a0ee2aaf7852b33`
MD5	`239ad3bcf44c879bde9472d8861530d3`
BLAKE2b-256	`60be241fb2fc56c7b2076e68da2b4f51e1681af0983498a5f850c6b1da2233f1`

See more details on using hashes here.

File details

Details for the file pynonymizer-1.10.0-py3-none-any.whl.

File metadata

Download URL: pynonymizer-1.10.0-py3-none-any.whl
Upload date: Jul 22, 2020
Size: 61.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.0 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.8.0

File hashes

Hashes for pynonymizer-1.10.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e6cfd55b2b2de5f2247557b2abdb56df4d391658bb249642852b8f74b1dc0eba`
MD5	`b3be038b5216ec4d04b80ab4ad221c57`
BLAKE2b-256	`4edc05be5afbe113708dfc81a8f2fad6b2f6af386a1c328c63f53232350f2bd6`

See more details on using hashes here.

pynonymizer 1.10.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

pynonymizer

⚠ Mirror

Why are anonymized databases important?

How does it work?

Process outline

Requirements

mysql

mssql

postgres

Getting Started

Usage

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes