A python package to manage metadata
Project description
mojap-metadata
This python package allows users to read and alter our metadata schemas (using the metadata module) as well as convert our metadata schemas to other schema definitions utilised by other tools (these are defined in the converters module and are defined as Converters).
Installation
Make sure you are using a new version of pip (>=20.0.0)
pip install git+https://github.com/moj-analytical-services/mojap-metadata
To install additional dependencies that will be used by the converters (e.g. etl-manager
and arrow
extras)
pip install 'mojap-metadata[etl-manager,arrow] @ git+https://github.com/moj-analytical-services/mojap-metadata'
Metadata
This module creates a class called Metadata
which allows you to interact with our agnostic metadata schemas. The Metadata
class deals with parsing, manipulating and validating metadata json schemas.
The Schema
Our metadata schemas are used to define a table. The idea of these schemas are to define the contexts of a table with generic metadata schemas. If you want to use this schema to interact with Oracle, PyArrow or AWS Glue for example, then you can create a Converter class to take the metadata and converter it to a schema that works with that tool (or vice versa).
When adding a parameter to the metadata config first thing is to look if it exists in json-schema. For example enum
, pattern
and type
are parameters in our column types but come from json schema naming definitions.
An example of a basic metadata schema:
{
"$schema" : "$schema": "https://moj-analytical-services.github.io/metadata_schema/mojap_metadata/v1.0.0.json",
"name": "employees",
"description": "table containing employee information",
"file_format": "parquet",
"columns": [
{
"name": "employee_id",
"type": "int64",
"type_desc": "integer",
"description": "an ID for each employee",
"minimum": 1000,
"maximum": 9999
},
{
"name": "employee_name",
"type": "string",
"type_string": "string",
"description": "name of the employee"
},
{
"name": "employee_dob",
"type": "date64",
"type_desc": "date",
"description": "date of birth for the employee in ISO format",
"pattern": "^\\d{4}-([0]\\d|1[0-2])-([0-2]\\d|3[01])$"
}
]
}
Schema Properties
-
name: String that can be whatever you want to name the table. Best to avoid spaces as most systems do not like that but it will let you do this.
-
file_format: String denoting the file format.
-
columns: List of objects where each object descibes a column in your table. Each column object must have at least a
name
and a (type
ortype_description
).- name: String denoting the name of the column.
- type: String specifing the type the data is in. We use data types from the Apache Arrow project. We use their type names as it seems to comprehensively cover most of the data types we deal with.
- type_category: These group different sets of
type
properties into a single superset. These are:integer
,float
,string
,timestamp
,bool
,list
,struct
. For example we classint8, int16, int32, int64, uint8, uint16, uint32, uint64
asinteger
. It allows users to give more generic types if your data is not coming from a system or output with strict types (i.e. data exported from Excel or an unknown origin). The Metadata class has default type values for each giventype_category
. See thedefault_type_category_lookup
attribute of theMetadata
class to see said defaults. This field is required iftype
is not set. - description: Description of the column.
- enum: List of what values that column can take. (Same as the standardised json schema keyword).
- pattern: Regex pattern that value has to to match (for string type_categories only). (Same as the standardised json schema keyword).
- minLength / maxLength: The minimum and maximum length of the string (for string type_categories only). (Same as the standardised json schema keyword).
- minimum / maximum: The minumum and maximum value a numerical type can take (for integer and float type_categories only).
-
partitions: List of what columns in your dataset are partitions.
Additional Schema Parameters
We allow users to add addition parameters to the table schema object or any of the columns in the schema. If there are specific parameters / tags you want to add to your schema it should still pass validation (as long as the additional parameters are not the same name of ones already used in the schema).
Usage
from mojap_metadata import Metadata
# Generate basic Metadata Table from dict
meta1 = Metadata(name="test", columns=[{"name": "c1", "type": "int64"}, {"name": "c2", "type": "string"}])
print(meta1.name) # test
print(meta1.columns[0]) # {"name": "c1", "type": "int64"}
print(meta1.description) # ""
# Generate meta from dict
d = {
"name": "test",
"columns": [
{"name": "c1", "type": "int64"},
{"name": "c2", "type": "string"}
]
}
meta2 = Metadata.from_dict(d)
# Read / write to json
meta3 = Metadata.read_json("path/to/metadata_schema.json")
meta3.name = "new_table"
meta3.to_json("path/to/new_metadata_schema.json")
Converters
Converters takes a Metadata object and generates something else from it (or can convert something to a Metadata object). Most of the time your converter will convert our schema into another systems schema.
Usage
For example the ArrowConverter
takes our schemas and converts them to a pyarrow schema:
from mojap_metadata import Metadata
from mojap_metadata.converters.arrow_converter import ArrowConverter
d = {
"name": "test",
"columns": [
{"name": "c1", "type": "int64"},
{"name": "c2", "type": "string"},
{"name": "c3", "type": "struct<k1: string, k2:list<int64>>"}
],
"file_format": "jsonl"
}
meta = Metadata.from_dict(d)
ac = ArrowConverter()
arrow_schema = ac.generate_from_meta(meta)
print(arrow_schema) # Could use this schema to read in data as arrow dataframe and cast it to the correct types
Another example is the GlueConverter
which takes our schemas and converts them to a dictionary that be passed to the glue_client to deploy a schema on AWS Glue.
import boto3
from mojap_metadata import Metadata
from mojap_metadata.converters.glue_converter import GlueConverter
d = {
"name": "test",
"columns": [
{"name": "c1", "type": "int64"},
{"name": "c2", "type": "string"},
{"name": "c3", "type": "struct<k1: string, k2:list<int64>>"}
],
"file_format": "jsonl"
}
meta = Metadata.from_dict(d)
gc = ArrowConverter()
boto_dict = gc.generate_from_meta(meta, )
boto_dict = gc.generate_from_meta(meta, database_name="test_db", table_location="s3://bucket/test_db/test/")
print(boto_dict)
glue_client = boto3.client("glue")
glue_client.create_table(**boto_dict) # Would deploy glue schema based on our metadata
All converter classes are sub classes of the mojap_metadata.converters.BaseConverter
. This BaseConverter
has no actual functionality but is a boilerplate class that ensures standardised attributes for all added Converters
these are:
-
generate_from_meta: (function) takes a Metadata object and returns whatever the converter is producing .
-
generate_to_meta: (function) takes Any object (normally another schema for another system or package) and returns our Metadata object. (i.e. the reverse of generate_from_meta).
-
options: (Data Class) that are the options for the converter. The base options have a
suppress_warnings
parameter but it doesn't mean call converters use this. To get a better understanding of setting options see theGlueConverter
class or thetests/test_glue_converter.py
to see how they are set.
## Further Usage
See the mojap-aws-tools repo which utilises the converters a lot in different tutorials.
Contributing and Design Considerations
Each new converter (if not expanding on existing converters) should be added as a new submodule within the parent converters
module. This is especially true if the new converter has additional package dependencies. By design the standard install of this package is fairly lightweight. However if you needed the ArrowConverter
you would need to install the additional package dependencies for the arrow converter:
pip install 'mojap-metadata[arrow] @ git+https://github.com/moj-analytical-services/mojap-metadata'
This means we can continuely add converters (as submodules) and add optional package dependencies (see pyproject.toml ) without making the default install any less lightweight. mojap_metadata
would only error if someone tries to import a converter subclass that with having the additional dependencies dependencies installed).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for mojap_metadata-1.1.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a6a96afa882402e4773330a4ecfe4ffc2a32877b7fcad6e617fef0dd0400d054 |
|
MD5 | 051eb08046670e159bee6e51403e3038 |
|
BLAKE2b-256 | 719f405f897bdae7026a14ddcca3be31251a4b45267aac5b91872f2c9e5aef93 |