Skip to main content

Big Data Smart Socket client

Project description

The increasing size of datasets used in scientific computing has made it difficult or impossible for a researcher to store all their data at the compute site they are using to process it. This has necessitated that a data transfer step become a key consideration in experimental design. Accordingly, scientific data repositories such as NCBI have begun to offer services such as dedicated data transfer machines and advanced transfer clients. Despite this, many researchers continue familiar but suboptimal practices: using slow transfer clients like a web browser or scp, transferring data over wireless networks, etc.

BDSS aims to alleviate this problem by shifting the burden of learning about alternative file mirrors, transfer clients, tuning parameters, etc. from the end user researcher to a group of “data curators”. It consists of three parts:

Components

  • Metadata repository

  • Central database managed by data curators

  • Matches patterns of data file URLs and maps them to alternate sources

  • Includes information about the transfer tool to use to retrieve the data

  • BDSS transfer client

  • Consumes information from metadata repository

  • Invokes transfer tools

  • Reports analytics to metadata repository

  • Integration as a Galaxy data transfer tool

Get Started

Examples

All examples here require a metadata repository configured to support them. The default metadata repository at http://bdss.bioinfo.wsu.edu/ supports these examples and the necessary configuration is also listed here.

NCBI SRA archive

NCBI makes files available for transfer using Aspera Connect, a tool with “improved data transfer characteristics” vs FTP or HTTP. If ascp is installed on your machine, BDSS can handle building the appropriate command.

Without BDSS:

ascp -i $HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh -T anonftp@ftp.ncbi.nlm.nih.gov:/sra/sra-instant/reads/ByRun/sra/SRR/SRR039/SRR039885/SRR039885.sra ./

With BDSS:

bdss transfer -u 'ftp://ftp.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR039/SRR039885/SRR039885.sra'

Metadata repository configuration:

{
  "data_sources": [
    {
      "description": "",
      "label": "NCBI Sequence Read Archive with FTP",
      "test_files": [],
      "transfer_mechanism": {
        "options": {},
        "type": "curl"
      },
      "transforms": [
        {
          "for_destinations": [],
          "options": {
            "new_scheme": "aspera"
          },
          "target": "NCBI Sequence Read Archive with Aspera",
          "type": "change_scheme"
        }
      ],
      "url_matchers": [
        {
          "options": {
            "pattern": "^ftp://ftp\\.ncbi\\.nlm\\.nih\\.gov/sra"
          },
          "type": "regular_expression"
        }
      ]
    },
    {
      "description": "",
      "label": "NCBI Sequence Read Archive with Aspera",
      "test_files": [],
      "transfer_mechanism": {
        "options": {
          "disable_encryption": true,
          "username": "anonftp"
        },
        "type": "aspera"
      },
      "transforms": [],
      "url_matchers": [
        {
          "options": {
            "pattern": "^aspera://ftp\\.ncbi\\.nlm\\.nih\\.gov/sra"
          },
          "type": "regular_expression"
        }
      ]
    }
  ],
  "destinations": []
}

JGI Genome Portal

To download files from the JGI Genome Portal, you must first authenticate. BDSS can prompt for credentials and handle storing your session cookies.

Without BDSS:

curl 'https://signon.jgi.doe.gov/signon/create' --data-urlencode 'login=USER_NAME' --data-urlencode 'password=USER_PASSWORD' -c cookies > /dev/null
curl 'http://genome.jgi.doe.gov/ext-api/downloads/get-directory?organism=PhytozomeV10' -b cookies > get-directory

With BDSS:

bdss transfer -u 'http://genome.jgi.doe.gov/ext-api/downloads/get-directory?organism=PhytozomeV10'
JGI Genome Portal username?USER_NAME
JGI Genome Portal password?USER_PASSWORD

Metadata repository configuration:

{
  "data_sources": [
    {
      "description": "",
      "label": "JGI Genome Portal",
      "test_files": [],
      "transfer_mechanism": {
        "options": {
          "auth_url": "https://signon.jgi.doe.gov/signon/create",
          "password_field": "password",
          "password_prompt": "JGI Genome Portal password?",
          "username_field": "login",
          "username_prompt": "JGI Genome Portal username?"
        },
        "type": "session_authenticated_curl"
      },
      "transforms": [],
      "url_matchers": [
        {
          "options": {
            "pattern": "http:\\/\\/genome\\.jgi\\.doe\\.gov\\/ext-api"
          },
          "type": "regular_expression"
        }
      ]
    }
  ],
  "destinations": []
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bdss_client-1.0.1b1.tar.gz (20.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bdss_client-1.0.1b1-py3-none-any.whl (50.1 kB view details)

Uploaded Python 3

File details

Details for the file bdss_client-1.0.1b1.tar.gz.

File metadata

  • Download URL: bdss_client-1.0.1b1.tar.gz
  • Upload date:
  • Size: 20.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for bdss_client-1.0.1b1.tar.gz
Algorithm Hash digest
SHA256 7a3f40194a85d51fd2de2b5d22e814fc6857c589efc438af692f249eeb08d58a
MD5 94d15b2de8fcb785001c1b1cd9796ff4
BLAKE2b-256 d29f81389f9d96f819b7fc51250270217979f2b3cf139659d03ad416bc51e367

See more details on using hashes here.

File details

Details for the file bdss_client-1.0.1b1-py3-none-any.whl.

File metadata

File hashes

Hashes for bdss_client-1.0.1b1-py3-none-any.whl
Algorithm Hash digest
SHA256 6ade15e6ab89760702a8d86b8fbf18180e7aa7fe9b520c34d4201ed9e2f71bd6
MD5 84660a56806f235b03b62473fc2cd8c0
BLAKE2b-256 b28e8e06b855e61e54e17feaafaab960c4eecd48d2061da7f431870314e44384

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page