a small set of graph functions to be used from pySpark on top of networkx and graphframes

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

splink_graph

splink_graph is a small graph utility library in the Apache Spark environment, that works with graph data structures based on the graphframe package, such as the ones created from the outputs of data linking processes (candicate pair results) of splink

The main aim of splink_graph is to offer a small set of functions that work on top of established graph packages like graphframes and networkx , that can help with the process of data linkage

Using Pandas UDFs in Python: prerequisites

This package uses Pandas UDFs for certain functionality.Pandas UDFs are built on top of Apache Arrow and bring the best of both worlds: the ability to define low-overhead, high-performance UDFs entirely in Python.

With Apache Arrow, it is possible to exchange data directly between JVM and Python driver/executors with near-zero (de)serialization cost. However there are some things to be aware of if you want to use these functions. Since Arrow 0.15.0, a change in the binary IPC format requires an environment variable to be compatible with previous versions of Arrow <= 0.14.1. This is only necessary to do for PySpark users with versions 2.3.x and 2.4.x that have manually upgraded PyArrow to 0.15.0. The following can be added to conf/spark-env.sh to use the legacy Arrow IPC format:

ARROW_PRE_0_15_IPC_FORMAT=1`

Another way is to put the following on spark .config

.config("spark.sql.execution.arrow.pyspark.enabled", "true")
.config("spark.executorEnv.ARROW_PRE_0_15_IPC_FORMAT", "1")

This will instruct PyArrow >= 0.15.0 to use the legacy IPC format with the older Arrow Java that is in Spark 2.3.x and 2.4.x. Not setting this environment variable will lead to a similar error as described in SPARK-29367 when running pandas_udfs or toPandas() with Arrow enabled.

So all in all : either PyArrow needs to be at most in version 0.14.1 or if that cannot happen the above settings need to be be active.

Terminology

Like any discipline, graphs come with their own set of nomenclature. The following descriptions are intentionally simplified—more mathematically rigorous definitions can be found in any graph theory textbook.

Graph

— A data structure G = (V, E) where V and E are a set of vertices/nodes and edges.

Vertex/Node

— Represents a single entity such as a person or an object,

Edge

— Represents a relationship between two vertices (e.g., are these two vertices friends on a social network?).

Directed Graph vs. Undirected Graph

— Denotes whether the relationship represented by edges is symmetric or not

Weighted vs Unweighted Graph

 — In weighted graphs edges have a weight that could represent cost of traversing or a similarity score or a distance score

 — In unweighted graphs edges have no weight and simply show connections . example: course prerequisites

Subgraph

— A set of vertices and edges that are a subset of the full graph's vertices and edges.

Degree

— A vertex/node measurement quantifying the number of connected edges

Connected Component

— A strongly connected subgraph, meaning that every vertex can reach the other vertices in the subgraph.

Shortest Path

— The lowest number of edges required to traverse between two specific vertices/nodes.

Contributing

Feel free to contribute by

Forking the repository to suggest a change, and/or
Starting an issue.

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.8.2

Feb 20, 2023

0.5.0

Mar 14, 2022

0.4.21

Oct 19, 2021

0.4.20

Sep 7, 2021

0.4.19

Aug 31, 2021

0.4.18

Aug 13, 2021

0.4.17

Aug 11, 2021

0.4.16

Aug 10, 2021

0.4.15

Aug 4, 2021

0.4.14

Aug 4, 2021

0.4.13

Aug 2, 2021

0.4.12

Aug 1, 2021

0.4.11

Jul 23, 2021

0.4.10

Jul 22, 2021

0.4.9

Jul 21, 2021

0.4.8

Jul 20, 2021

0.4.7

Jul 19, 2021

0.4.6

Jul 18, 2021

0.4.4

Jul 17, 2021

0.4.3

Jul 16, 2021

0.4.2

Jul 14, 2021

0.4.1

Jul 13, 2021

0.4.0

Jul 12, 2021

This version

0.3.19

Jul 8, 2021

0.3.17

Jul 7, 2021

0.3.16

Jun 15, 2021

0.3.15

Jun 15, 2021

0.3.14

Jun 15, 2021

0.3.13

Jun 14, 2021

0.3.12

Jun 8, 2021

0.3.11

May 26, 2021

0.3.10

May 24, 2021

0.3.9

May 24, 2021

0.3.8

May 17, 2021

0.3.7

May 13, 2021

0.3.6

May 13, 2021

0.3.5

May 13, 2021

0.3.4

May 13, 2021

0.3.3

May 13, 2021

0.3.2

May 13, 2021

0.3.1

May 11, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

splink_graph-0.3.19.tar.gz (10.1 kB view hashes)

Uploaded Jul 8, 2021 Source

Built Distribution

splink_graph-0.3.19-py3-none-any.whl (8.9 kB view hashes)

Uploaded Jul 8, 2021 Python 3

Hashes for splink_graph-0.3.19.tar.gz

Hashes for splink_graph-0.3.19.tar.gz
Algorithm	Hash digest
SHA256	`b62a1af6a042ab4c8334734644c83a92fee59cf2bd3a1e2805e02157d2cdac98`
MD5	`ed81fbe5d0d9b9b6742cbf5963bc14c0`
BLAKE2b-256	`9ee4dccb00942d2579f3693e20d20dec8b8859f516c63e261f718507900e91b4`

Hashes for splink_graph-0.3.19-py3-none-any.whl

Hashes for splink_graph-0.3.19-py3-none-any.whl
Algorithm	Hash digest
SHA256	`99eba7bdf6e9a21ba0bffb23acfd70b7218dfeaf142e78986e13d99cc191367f`
MD5	`9c18667fb75eefdcc6bd49b4bc2a7fea`
BLAKE2b-256	`7a6461dedcecf15335b5a6ebc82d1101ad323f4c2362b978887b670683982e4b`