a small set of graph functions to be used from pySpark on top of networkx and graphframes

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Downloads

splink_graph

splink_graph is a small graph utility library meant to be used in the Apache Spark environment, that works with graph data structures such as the ones created from the outputs of data linking processes (candicate pair results) of splink

The main aim of splink_graph is to offer a small set of functions that work on top of established graph packages like graphframes and networkx , that can help with the process of graph analysis of the output of probabilistic data linkage tools.

Calculations performed per cluster in a parallel manner thanks to the underlying help from pyArrow

How to Install :

For dependencies and other important info so you can run these functions without an issue please consult INSTALL.md on this repo

Functionality offered :

For a primer on the terminology used please look at TERMINOLOGY.md file in this repo

Cluster metrics

Cluster metrics usually have as an input a spark edgelist dataframe that also includes the component_id (cluster_id) where the edge is in. The output is a row of one or more metrics per cluster

Cluster metrics currently offered:

diameter (largest shortest distance in a cluster)
transitivity (or Global Clustering Coefficient in the related literature)
cluster triangle clustering coeff (or Local Clustering Coefficient in the related literature)
cluster square clustering coeff (useful for bipartite networks)
cluster node connectivity
cluster edge connectivity
cluster efficiency
cluster modularity
cluster avg edge betweenness
cluster weisfeiler lehman graphhash (in order to quickly test for graph isomorphisms)

Cluster metrics are really helpful at finding the needle (of for example clusters with possible linking errors) in the haystack (whole set of clusters after the data linking process)

Node metrics

Node metrics have as an input a spark edgelist dataframe that also includes the component_id (cluster_id) where the edge is in. The output is a row of one or more metrics per node

Node metrics curretnly offered:

Eigenvector Centrality
Harmonic centrality

Edge metrics

Edge metrics have as an input a spark edgelist dataframe that also includes the component_id (cluster_id) where the edge is in. The output is a row of one or more metrics per edge

Edge metrics curretnly offered:

Edge Betweeness
Bridge Edges

Contributing

Feel free to contribute by

Forking the repository to suggest a change, and/or
Starting an issue.

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.8.2

Feb 20, 2023

0.5.0

Mar 14, 2022

0.4.21

Oct 19, 2021

0.4.20

Sep 7, 2021

0.4.19

Aug 31, 2021

0.4.18

Aug 13, 2021

0.4.17

Aug 11, 2021

0.4.16

Aug 10, 2021

0.4.15

Aug 4, 2021

0.4.14

Aug 4, 2021

0.4.13

Aug 2, 2021

0.4.12

Aug 1, 2021

0.4.11

Jul 23, 2021

0.4.10

Jul 22, 2021

0.4.9

Jul 21, 2021

0.4.8

Jul 20, 2021

0.4.7

Jul 19, 2021

0.4.6

Jul 18, 2021

0.4.4

Jul 17, 2021

This version

0.4.3

Jul 16, 2021

0.4.2

Jul 14, 2021

0.4.1

Jul 13, 2021

0.4.0

Jul 12, 2021

0.3.19

Jul 8, 2021

0.3.17

Jul 7, 2021

0.3.16

Jun 15, 2021

0.3.15

Jun 15, 2021

0.3.14

Jun 15, 2021

0.3.13

Jun 14, 2021

0.3.12

Jun 8, 2021

0.3.11

May 26, 2021

0.3.10

May 24, 2021

0.3.9

May 24, 2021

0.3.8

May 17, 2021

0.3.7

May 13, 2021

0.3.6

May 13, 2021

0.3.5

May 13, 2021

0.3.4

May 13, 2021

0.3.3

May 13, 2021

0.3.2

May 13, 2021

0.3.1

May 11, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

splink_graph-0.4.3.tar.gz (9.2 kB view hashes)

Uploaded Jul 16, 2021 Source

Built Distribution

splink_graph-0.4.3-py3-none-any.whl (9.8 kB view hashes)

Uploaded Jul 16, 2021 Python 3

Hashes for splink_graph-0.4.3.tar.gz

Hashes for splink_graph-0.4.3.tar.gz
Algorithm	Hash digest
SHA256	`f49f3d777e904f0ffee47792d98952a4f0f06881a4f0255daa80aaec1eef60de`
MD5	`822d8fc5f36e4729a86be0f7c81fd0dd`
BLAKE2b-256	`9d37b14bb77496a4687478a2cbc16920027ea0cc95f759ea01c659731f7ba4ae`

Hashes for splink_graph-0.4.3-py3-none-any.whl

Hashes for splink_graph-0.4.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`31be79bcedc01ece6dcc244551f4a2adf1a37dd484cd11f3e04034d5971ccbc1`
MD5	`49bb69f0d1ba9351479d4696932a11c1`
BLAKE2b-256	`02e4b6776f897cfb9ef421ea0d2974b4e5ffcbe3968cb47abbae1d84d5a1e565`