miniwdl AWS backend (Batch+EFS)

Project description

miniwdl AWS plugin

Extends miniwdl to run workflows on AWS Batch and EFS

This miniwdl plugin enables it to submit AWS Batch jobs to execute WDL tasks. It uses EFS for work-in-progress file I/O, with optional S3 rails for workflow-level I/O.

Use within Amazon SageMaker Studio

Start with the companion miniwdl-aws-studio recipe to install miniwdl for interactive use within Amazon SageMaker Studio, a web IDE with a terminal and filesystem browser. You can use the terminal to operate miniwdl run against AWS Batch, the filesystem browser to manage the inputs and outputs on EFS, and the Jupyter notebooks to further analyze the outputs.

That's the best way to try miniwdl-aws and get familiar with how it works with Batch and EFS. Read on for non-interactive deployment, which is a bit more complicated.

Unattended operations

For non-interactive use, a command-line wrapper miniwdl-aws-submit launches miniwdl in its own small Batch job to orchestrate the workflow. This workflow job then spawns task jobs as needed, without needing the submitting computer (e.g. your laptop) to remain connected for the duration. Separate Batch compute environments handle workflow & task jobs, using lightweight Fargate resources for workflow jobs. (See below for detailed infra specs.)

Submitting workflow jobs

First pip3 install miniwdl-aws locally to make the miniwdl-aws-submit program available. The following example launches a viral genome assembly that should run in 10-15 minutes:

miniwdl-aws-submit \
  https://github.com/broadinstitute/viral-pipelines/raw/v2.1.28.0/pipes/WDL/workflows/assemble_refbased.wdl \
  reads_unmapped_bams=https://github.com/broadinstitute/viral-pipelines/raw/v2.1.19.0/test/input/G5012.3.testreads.bam \
  reference_fasta=https://github.com/broadinstitute/viral-pipelines/raw/v2.1.19.0/test/input/ebov-makona.fasta \
  sample_name=G5012.3 \
  --workflow-queue miniwdl_workflow \
  --task-queue miniwdl_task \
  --fsap fsap-xxxx \
  --s3upload s3://MY_BUCKET/assemble_refbased_test \
  --follow

The command line resembles miniwdl run's with extra AWS-related arguments:

command-line argument	equivalent environment variable
`--workflow-queue`	`MINIWDL__AWS__WORKFLOW_QUEUE`	Batch job queue on which to schedule the workflow job
`--task-queue`	`MINIWDL__AWS__TASK_QUEUE`	Batch job queue on which to schedule task jobs
`--fsap`	`MINIWDL__AWS__FSAP`	EFS Access Point ID, which workflow and task jobs will mount at `/mnt/efs`
`--s3upload`		(optional) S3 URI prefix under which to upload the workflow products, including the log and output files

Adding --wait makes the tool await the workflow job's success or failure, reproducing miniwdl's exit code. --follow does the same and also live-streams the workflow log. Without --wait or --follow, the tool displays the workflow job UUID and exits immediately.

Arguments not consumed by miniwdl-aws-submit are passed through to miniwdl run inside the workflow job; as are environment variables whose names begin with MINIWDL__, allowing override of any miniwdl configuration option (disable wih --no-env). See miniwdl_aws.cfg for various options preconfigured in the workflow job container.

Run directories on EFS

Miniwdl runs the workflow in a directory beneath /mnt/efs/miniwdl_run (override with --dir). The outputs also remain cached there for potential reuse in future runs.

Given the EFS-centric I/O model, you'll need a way to manage the filesystem contents remotely. Deploy an instance or container mounting your EFS, to access via SSH or web app (e.g. JupyterHub, Cloud Commander, VS Code server).

You can also automate cleanup of EFS run directories by setting miniwdl-aws-submit --s3upload and:

--delete-after success to delete the run directory immediately after successful output upload
--delete-after failure to delete the directory after failure
--delete-after always to delete it in either case
(or set environment variable MINIWDL__AWS__DELETE_AFTER_S3_UPLOAD)

Deleting a run directory after success prevents the outputs from being reused in future runs. Deleting it after failures can make debugging more difficult (although logs are retained, see below).

Logs & troubleshooting

If the terminal log isn't available (through Studio or miniwdl_submit_awsbatch --follow) to trace a workflow failure, look for miniwdl's usual log files written in the run directory on EFS.

Each task job's log is also forwarded to CloudWatch Logs under the /aws/batch/job group and a log stream name reported in miniwdl's log. Using miniwdl_submit_awsbatch, the workflow job's log is also forwarded. CloudWatch Logs indexes the logs for structured search through the AWS Console & API.

Misconfigured infrastructure might prevent logs from being written to EFS or CloudWatch at all. In that case, use the AWS Batch console/API to find status messages for the workflow or task jobs.

Appendix: expected AWS infrastructure

Requirements:

VPC in desired region
EFS + Access Point providing uid=0 access
- set --fsap or MINIWDL__AWS__FSAP to access point ID (fsap-xxxx)
Task execution:
- Batch Compute Environment
  - spot instances (including AmazonEC2SpotFleetTaggingRole)
  - instance role
    - service-role/AmazonEC2ContainerServiceforEC2Role
    - AmazonElasticFileSystemClientReadWriteAccess
    - AmazonEC2ContainerRegistryReadOnly
    - s3:Get*, s3:List* on any desired S3 buckets
    - s3:Put* on S3 bucket(s) for --s3upload
- Batch Job Queue connected to compute environment
  - set --task-queue or MINIWDL__AWS__TASK_QUEUE to queue name
Unattended workflow execution with miniwdl_submit_awsbatch:
- Batch Compute Environment
  - Fargate resources
  - execution role
    - service-role/AmazonECSTaskExecutionRolePolicy
    - AWSBatchFullAccess
    - AmazonElasticFileSystemFullAccess
    - AmazonEC2ContainerRegistryPowerUser
    - s3:Get*, s3:List* on any desired S3 buckets
    - s3:Put* on S3 bucket(s) for --s3upload
- Batch Job Queue connected to compute environment
  - tag the queue with WorkflowEngineRoleArn set to the execution role's ARN
  - set --workflow-queue or MINIWDL__AWS__WORKFLOW_QUEUE to queue name

Recommendations:

Compute environments
- Modify spot instance launch template to relocate docker operations onto NVMe or an autoscaling volume instead of the root EBS volume
- Set task environment allocation strategy to SPOT_CAPACITY_OPTIMIZED and spot bid policies as needed
- Set task environment spot bidPercentage as needed
- Temporarily set task environment minvCpus to reduce scheduling delays during periods of interactive use
EFS
- Create in One Zone mode, and restrict compute environments to the same zone (trading off availability for cost)
- Create in Max I/O mode
- Configure EFS Access Point to use a nonzero user ID, with an owned filesystem root directory
- Temporarily provision throughput if starting an intensive workload without much data already stored
Use non-default VPC security group for EFS & compute environments
- EFS must be accessible to all containers through TCP port 2049

Project details

Release history Release notifications | RSS feed

0.12.2

Feb 5, 2024

0.12.1

Nov 6, 2023

0.12.0

Oct 18, 2023

0.11.0

Oct 17, 2023

0.10.0

May 7, 2023

0.9.1

Feb 8, 2023

0.9.0

Jan 28, 2023

0.8.0

Dec 18, 2022

0.7.0

Oct 16, 2022

0.6.0

Oct 3, 2022

0.5.0

Sep 16, 2022

0.4.1

Jul 31, 2022

0.4.0

Jun 28, 2022

0.3.1

Jun 1, 2022

0.3.0

May 9, 2022

0.2.0

May 1, 2022

0.1.11

Apr 5, 2022

0.1.10

Apr 5, 2022

0.1.6

Oct 24, 2021

0.1.5

Oct 20, 2021

0.1.3

Oct 18, 2021

0.1.2

Oct 16, 2021

0.1.1

Oct 13, 2021

This version

0.0.2

Aug 5, 2021

0.0.1

Jul 21, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

miniwdl-aws-0.0.2.tar.gz (25.9 kB view hashes)

Uploaded Aug 5, 2021 Source

Hashes for miniwdl-aws-0.0.2.tar.gz

Hashes for miniwdl-aws-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`66fdd7cc4c11e0f495f79cbb73b6dcba047146ad267283cad779088204d5ea1f`
MD5	`bc84ff9c4f645dda1e8a6b6cd2513f12`
BLAKE2b-256	`adbc720b955ddd109eded75ddaebf05fd6eb1633c1bf4b15c86f5fe2000c1084`