This is a repository for investigating implementations of GNNs for unsupervised clustering.

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 4 - Beta
Intended Audience
- Information Technology
- Science/Research
Operating System
- POSIX :: Linux
Programming Language
- Python :: 3.9

Project description

UGLE: Unsupervised GNN Learning Environment

GitHub Repo stars GitHub Workflow Status

Introduction

ugle is a library build on pytorch to compare implementations of GNNs for unsupervised clustering. It consists of a wide range of methods, with the original implementations referenced in the source code. We provide an experiment abstraction to compare different models and visualisation tools to plot results. Any method can be trained individually via the main script using a specified config and the trainer objects. Optuna is used to optimize hyperparameters, however models can be trained by specifying parameters.

Installation

To use this repository, you need to install pytorch_geometric so adjust the variables for the appropriate TORCH version and CUDA option. Please refer to the PyTorch Geometric installation guide if the following doesn't work for you as pytorch-geometric can be tricky with an M* Series Mac or Windows. Please use a virtual environment manager to install and use a version of python>=3.9.12. If you want a different version of pytorch then raise a pull request that can pass the version number to setup.py.

CUDA=cpu | cu102 | cu113 | cu116
pip install --extra-index-url https://data.pyg.org/whl/torch-1.12.0+{CUDA}.html ugle

Quick Tour

Here is an example of how you can use the python APIs from this repo.

import ugle
import numpy as np
n_nodes = 1000
n_features = 200
n_clusters = 3

# demo to evaluate a in Memory dataset 
dataset = {'features': np.random.rand(n_nodes, n_features),
            'adjacency': np.random.rand(n_nodes, n_nodes),
            'label': np.random.randint(0, n_clusters+1, size=n_nodes)}

# load the dmon default hyperparameters and evaluate on in memory dataset
cfg = ugle.utils.load_model_config(override_model="dmon_default")
Trainer = ugle.trainer.ugleTrainer("dmon", cfg)
results = Trainer.eval(dataset)

# evalute dmon with hpo
Trainer = ugle.trainer.ugleTrainer("dmon")
Trainer.cfg.dataset = "cora"
Trainer.cfg.trainer.n_trials_hyperopt = 2 # this is how you change the config
Trainer.cfg.args.max_epoch = 250
results = Trainer.eval(dataset)

Changing the cfg property in the Trainer object will change the training and optimisation procedure as demonstrated above. Detailed below are all the options that can be changed.

ugle/configs/config.yaml

trainer:
  gpu: "0" # GPU index (-1 = CPU)
  show_config: False # prints the working config to cmd line
  
  results_path: './results/' # where to save results to
  data_path: './data/' # where to look for data
  models_path: './models/' # where to save models 

  # if load_exising_test = True then trainer will look in {load_hps_path}{cfg.dataset}_{cfg.model}.pkl"
  # for previous found hyperparameters
  load_existing_test: False
  load_hps_path: './found_hps/'
  
  save_hpo_study: False # whether to save the hpo investigation 
  save_model: False # whether to save the best model

  # hyperparameter options
  test_metrics: ['f1', 'nmi', 'modularity', 'conductance'] # metrics to evaluate test data 
  valid_metrics: ['f1', 'nmi', 'modularity', 'conductance'] # metrics used for hpo and model selection
  validate_every_nepochs: 5 # how many training epochs per validation 
  n_trials_hyperopt: 5 # how many hpo trials
  max_n_pruned: 20 # patience for repeated hpo trials

  training_to_testing_split: 0.2 # percentage of testing data compared to total of training+validation 
  train_to_valid_split: 0.2 # percentage of validation data as proportion of the whole dataset
  split_scheme: 'drop_edges' # one of drop_edges, split_edges, all_edges, no_edges (see ugle.datasets.split_adj() for more info)

  # logging options
  calc_time: True # whether to calculate the time taken for evaluation
  calc_memory: False # whether to calculate the memory used by evaluation 
  log_interval: 5 # how often to refresh the progress bar

args: 
  random_seed: 42 # the random seed to set
  max_epoch: 5000 # the number of epochs to train for 
  # this is also where the specific model args are loaded into

If you want to do the same on the command line then you can parse use some of these arguments when running python main.py.

--model=<NAME-OF-MODEL> 
--dataset=<NAME-OF-DATASET> 
--seed=<SEED-ID> 
--max_epoch=<MAX-EPOCHS-FOR-TRAINING> 
--gpu=<GPU-INDEX>
--load_existing_test # if you specify a model as "dmon_custom" and write some custom args in the appropriate file that you want to test on then use this argument to load these

Running Experiment on Multiple (seeds/datasets/algorithms)

If you want to run a study over multiple seeds/dataset/algorithms then use the model_evaluations.py script alongside the experiments config files. See the ugle/configs/testing/ directory for examples on different scripts. If you have the null option in both datasets and algorithms in your config file then you can run the config over a single dataset and algorithm combination by using the -da argument which puts this combination into the dataset_algo_combinations option. This can also be used to define a list of combinations to choose to evaluate.

python3 model_evaluations.py -ec=<PATH-TO-EXP-CONFIG> -da=<DATASET>_<MODEL>

Existing GNN Implementations

Existing Datasets

acm
amac
amap
bat
citeseer
cora
cocs
dblp
eat
uat
pubmed
texas
wisc
cornell
Physics
CS
Photo
Computers

how to add a new model <: MODEL-NAME>

To create a model with the name , you need to create minimum two files:

create a file for the model: ugle/models/<MODEL-NAME>.py
create a file to hold the hyperparameters: ugle/configs/models/<MODEL-NAME>/<MODEL-NAME>.yaml
optional* create file to hold default hyperparameters : ugle/configs/models/<MODEL-NAME>/<MODEL-NAME>_default.yaml
optional* create a new .yaml file to hold other variations: ugle/configs/models/<MODEL_NAME>/<MODEL_NAME>_myparameters.yaml
run an experiments or a single model run using the name reference as : "<MODEL_NAME>_myparameters"

Inside ugle/models/<MODEL-NAME>.py, define four hooks to process the whole model

from ugle.trainer import ugleTrainer # this is the base class for training any model in this framework

class <MODEL-NAME>_trainer(ugleTrainer):

    def preprocess_data(self, features, adjacency):
        # here we process the data into a required format 
        # some models using pytorch_geometric layers, some use custom layers 
        # this allows us to create new representation formats for features/adjacency matrix
        # additional data structures needed can be returned for access under the tuple: processed_data
        
        return (features, adj, ... )

    def training_preprocessing(self, args, processed_data):
        (features, adj, ...) = processed_data
        
        # here the model and optimisers are defined as follows
        
        self.model = ... Model()
        self.optimizers = [optimizer]
        
        return

    def training_epoch_iter(self, args, processed_data):
    
        # here is the training iterations hook, that defines the forward pass for each model 
        # the only requirement is that a loss is returned.
        # If the processed data changes and is needed for the next iteration, 
        # then, return the tuple you need in place of processed data
        # If this is not needed then None must be returned instead
    
        return loss, (None/updated_processed_data)

    def test(self, processed_data):
    
        # the testing loop
        # the definition of how predictions are returned by this model
        # 1d numpy array returned of predictions for each node
    
        return preds

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 4 - Beta
Intended Audience
- Information Technology
- Science/Research
Operating System
- POSIX :: Linux
Programming Language
- Python :: 3.9

Release history Release notifications | RSS feed

This version

0.6.0

Jan 25, 2024

0.5.0

Jan 24, 2024

0.4.0

Jan 24, 2024

0.3.0

Jan 24, 2024

0.2.0

Jan 24, 2024

0.1.0

Jan 24, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ugle-0.6.0.tar.gz (44.5 kB view hashes)

Uploaded Jan 25, 2024 Source

Hashes for ugle-0.6.0.tar.gz

Hashes for ugle-0.6.0.tar.gz
Algorithm	Hash digest
SHA256	`4634af40b2a7547a830ea61497c94fdf7026620bc2fb6df825df7c95ab8bf6f0`
MD5	`c11c904948953a05abd91dd48564d972`
BLAKE2b-256	`94d61e2da89f6083d18003b8c857fd2c8f500a2a5f990b0bb62cd6b70300eb56`