Skip to main content

A simple light-weight extendable command line tool for managing jobs on DIAG's SOL cluster.

Project description

Solitude

GitHub workflow status: master PyPI version Code style: black

A simple light-weight command line tool for managing jobs on the SOL cluster

Features

  • Querying status of a specified list of slurm jobs and presenting them in a nice list overview
  • Tools to manage the specified jobs (starting/stopping/extending)
  • Cross platform due to using ssh (paramiko) for querying and issuing commands
  • Extendable and customizable through pluggy plugins

Setup and configuration

  1. Install trough pip using: $ pip install solitude
  2. Configure the tool through: $ solitude config create and fill out the prompts.
  3. Previous step should have generated a configuration file at the proper location (installation directory or the user's home directory). It should contain a target cluster machine and the login credentials, which will be used to query and issue commands. It's contents and whereabouts can be queried using solitude config status and should contain something like:
{
    "defaults": {
        "user": "username",
        "workers": 8
    }, 
    "ssh":{
        "server" : "dlc-machine.umcn.nl",
        "username" : "user",
        "password" : "*******"
    },
    "plugins":[    
    ]
}

Now the tool is ready for usage. See below for examples...

Example usage

Create a file for your deep learning project with a list of jobs (here we call this commands.sol) using the following format:

# Test jobs 
# (commented lines and empty lines will be ignored)

./c-submit --require-mem=1g --require-cpus=1 --gpu-count=0 {user} test 1 hello-world
./c-submit --require-mem=1g --require-cpus=1 --gpu-count=0 {user} test 1 ubuntu /usr/bin/sleep 500
./c-submit --require-mem=1g --require-cpus=1 --gpu-count=0 {user} test 1 ubuntu /usr/bin/echo "CUDA_ERROR"
./c-submit --require-mem=1g --require-cpus=1 --gpu-count=0 {user} test 1 {sol-docker-repo}/sil/hello-world

This format supports the special {user} macro which will be substituted with the default user name. It also supports the {sol-docker-repo} macro which will be substituted with the correct and actual SOL docker repository url (i.e.: doduo1.umcn.nl), should this url ever change then using this macro ensures automatic portability.

After creating the commands file we can run a one-time setup command (supported since v0.3.8) to set the current active sol file to work with:

$ solitude config set cmdfile /path/to/commands.sol

Now all our subsequent solitude job commands will use the configured commands file.

We can list the commands:

$ solitude job list

Running specific jobs (e.g. 1-3 with high priority) can be achieved with:

$ solitude job run -i 1-3 --priority=high

For stopping and extending running jobs you can use solitude job stop and solitude job extend commands respectively.

To get the full active log file for job 1 you can run:

$ solitude job getlog -i1

To get only the last few lines from the log line for a running job and track it in real time run:

$ solitude job getlog -i1 -p

For more help on any command add the --help option, e.g.:

$ solitude job run --help
$ solitude job --help
$ solitude --help

Plugins

The supported commands can be tweaked and extended by writing custom pluggy plugins. This can change the way commands are being treated, which information is retrieved etc. The pluggy documentation has some excellent detailed documentation on how to create and package your own plugins: https://pluggy.readthedocs.io/en/latest/

Here is a brief extract on how to do this for solitude.

First make a separate project folder and create the following files:

solitude-exampleplugin/solitude_exampleplugin.py

import solitude
from typing import Dict, List


@solitude.hookimpl
def matches_command(cmd: str) -> bool:
    """Should this command be processed by this plugin?

    :param cmd: the command to test
    :return: True if command matches False otherwise
    """    
    return "custom_command" in cmd


@solitude.hookimpl
def get_command_hash(cmd: str) -> str:
    """Computes the hash for the command
    This is used to uniquely link job status to commands.
    So if the exact same command is found they both link to the same job.
    Therefore it is recommended to remove parts from cmd that do not change
    the final results for the job.
    If you are uncertain what to do just return `cmd` as hash

    :param cmd: the command to compute the hash for
    :return: the command hash
    """
    return cmd


@solitude.hookimpl
def retrieve_state(cmd: str) -> Dict:
    """Retrieve state for the job which can be set in a dictionary

    :param cmd: the command to test
    :return: a dictionary with the retrieved state (used in other calls)
    """
    return {}


@solitude.hookimpl
def is_command_job_done(cmd: str, state: Dict) -> bool:
    """Checks if the command has finished

    :param cmd: the command to test
    :param state: the retrieved state dictionary for this job
    :return: True if job is done False otherwise
    """
    return False


@solitude.hookimpl
def get_command_status_str(cmd: str, state: Dict) -> str:
    """Retrieve state for the job which can be set in a dictionary

    :param cmd: the command to test
    :param state: the retrieved state dictionary for this job
    :return: a string containing job information and progress status
    """
    return cmd


@solitude.hookimpl
def get_errors_from_log(log: str) -> List[str]:
    """Checks the log for errors

    :param log: the log string to parse
    :return: A list of error messages, empty list if no errors were found
    """
    errors = []
    return errors

solitude-exampleplugin/setup.py

from setuptools import setup

setup(
    name="solitude-exampleplugin",
    install_requires="solitude",
    entry_points={"solitude": ["exampleplugin = solitude_exampleplugin"]},
    py_modules=["solitude_exampleplugin"],
)

Now let's install the plugin and test it:

$ pip install --editable solitude-exampleplugin
$ solitude job list -f your_test_commands.sol 

Contributing

Fork the solitude repository

Setup your forked repository locally as an editable installation:

$ cd ~
$ git clone https://github.com/yourproject/solitude
$ pip install --editable solitude

Now you can work locally and create your own pull requests.

Maintainer

Sil van de Leemput

History

0.3.9 (2022-05-03)
  • (hot fix) Long usernames caused issues with some commands, but this is fixed #82
  • (code enhancement) Black version has been increased to 22.3.0 #80
  • (code enhancement) More enumerations and code cleanup #76
0.3.8 (2021-09-14)
  • (new feature) New CLI option to set current command file(s) config set cmdfile #75
  • (new feature) Added node override option to the job run command #73
0.3.7 (2021-08-09)
  • (bug fix) Job status query has been fixed to work properly now #71
0.3.6 (2021-06-24)
  • (new feature) Lines in sol files mapping to the same hash will now be grouped in a single command in the job list for clarity #67
  • (code enhancement) A context manager has been added for action time logging #68
  • (code enhancement) The assignment of default values has been made explicit in the Config class #69
0.3.5 (2021-06-22)
  • (optimization) Querying job status is much faster by using a single SSH query #63
  • (stability) Made a retry mechanism for SSH channel errors for improved stability #65
0.3.4 (2021-03-04)
  • (new feature) Added -poll option to job getlog for active polling of logs over ssh #61
0.3.3 (2021-02-18)
  • (bugfix / enhancement) Cleaned job list output with --input-dir and --output-dir, so it becomes more readable #59
  • (enhancement) Code cleanup of the Config class, added a DEFAULT_CONFIG module level variable #57
0.3.2 (2021-02-16)
  • (enhancement) Added job getlog action to get log associated with a slurmjob/command #55
  • (bugfix / enhancement) csubmit plugin now includes --input-dir and --output-dir to hash generation by default #51
  • (bugfix / enhancement) Fixed --duration option since it was broken #54
0.3.1 (2021-02-11)
  • (enhancement) Cleans job list output by removing log url column #52
  • (enhancement) Cleans redundant print statement which showed on job list output #53
0.3.0 (2021-02-11)
  • (HOTFIX) Since oni dashboard will no longer be supported, from this version onward all job information will be queried directly from the cluster nodes using the SSHClient. #46
0.2.1 (2021-01-16)
  • (enhancement) Now by default cache messages are now squelched but are enabled in verbose mode #44
  • (bugfix / enhancement) If job run returns unexpected output, a more meaningful error message is displayed #42
0.2.0 (2020-11-23)
  • NOTE: Upgrading from 0.1.10 and below will break (one time) the existing cached joblinks. These can be manually relinked to commands after upgrading using job link.
  • (enhancement) {sol-docker-repo} macro has been added to support better lookup in sol files #21
  • (enhancement) Default verbosity of output has been reduced for job list and can be re-enabled with -v #35
  • (enhancement) JSON Cache and Config files now have better readability #37
  • (enhancement) Added auto-cleanup of slurm jobs in Cache using an expiration date #39
  • (enhancement) Changed internal os.path references with pathlib.Path objects and added click.Path tests #31
  • (enhancement) JobPriority Enum has been added #23
0.1.10 (2020-10-20)
  • (enhancement) Basic github actions CI tests have been added #16
  • (enhancement) New style NamedTuples classes are now used for the Config class #15
0.1.9 (2020-10-14)
  • (enhancement) Added type hints and MyPy pre-commit hook #12
  • (enhancement) Added better defaults for config create #10
0.1.8 (2020-10-01)
  • (bugfix / enhancement) Fixed a bug with c-submit and interactive plugin pattern matching #7
0.1.7 (2020-09-24)
  • (enhancement) Added better logic and error messages for SSH connectivity issues #6
  • (new feature) Added SSH connectivity testing on config create and config test #6
  • (refactor) Centralized all ssh related commands in a SSHClient class #6
  • (enhancement) Writing to cache/config files has been made atomic to avoid potential data corruption #3
0.1.6 (2020-08-31)
  • (new feature) Added job link and unlink commands
  • (refactor) Extracted SlurmJob from CommandBase
  • Added high-level pytests for CLI calls
0.1.5 (2020-07-15)
  • Changed to using the *.sol extension for the command lists in the README
  • CLI fixed --ignore-errors typo on solitude job run
  • Added black code style and pre-commit hooks
0.1.4 (2020-07-10)
  • HOTFIX cache wasn't properly created from scratch if folder didn't exist
  • CLI Warnings have been added if run/extend/stop commands are issued without jobids
0.1.3 (2020-07-09)
  • Added support for default command files option
  • Renamed plugin interface get_command_hash
  • Added job group to CLI interface
  • Added support for defaults in config create
  • Improved docs and added history section

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

solitude-0.3.9.tar.gz (32.0 kB view hashes)

Uploaded Source

Built Distribution

solitude-0.3.9-py3-none-any.whl (32.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page