Proposal for a Unifying REST API for GMI-Compliant Repositories

The goal of the Global Microbial Identifier (GMI) initiative is to create a global platform for storing whole-genome sequencing data and epidemiological metadata, and to implement an analysis platform for detecting outbreaks and emerging pathogens.

A major component of such a global resource, as envisioned by the GMI, is a repository for storing next-generation sequencing data and epidemiological metadata. Several repositories already exist for storing sequencing data and metadata, unified by the International Nucleotide Sequence Database Collaboration (INSDC). Each participant in the INSDC implements their repository independently and mirrors other participating repositories on a regular basis. In this document, we propose a new, unifying REST API for GMI-Compliant Repositories, including, but not limited to, the INSDC repositories.

First, we provide a brief review of the existing INSDC repositories, and the available methods for accessing those repositories. Then we describe, at a high-level, a hierarchical data model that follows the existing INSDC data model. We then provide a basic description of REST and some motivation for the creation of a common REST API, followed by a detailed description of the metadata, links and media types associated with the proposed data model. Finally, we show some example usage scenarios of how an analytical tool developer might interact with the common REST API.

Please note that this document is a DRAFT. Comments, errata, discussion, and constructive criticism are welcome.

Review

The INSDC, consisting of the National Center for Biotechnology Information (NCBI), the European Bioinformatics Institute (EBI), and the DNA Data Bank of Japan (DDBJ), is the current de facto standard for storing sequence read data in a centralized, public repository.

The members of the INSDC exchange and synchronize data on a regular basis so that the data at each location (NCBI in North America, EBI in Europe, and DDBJ in Japan) has a complete mirror of the sequence data submitted to each respective read archive.

While each of these repositories share the same data, submitting and accessing data from each repository is inconsistent for both human and software clients. Furthermore, both the NCBI and EBI offer XML schemas for each type of resource in their system, but the schemas differ across institutions. Finally, the automated submission process for each site is also different; the NCBI requires submission of XML formats to CGI-based web services, where the EBI offers REST web-services for the submission of XML formats.

Unifying these APIs with a single, modern API will reduce the complexity of accessing these repositories on the part of developers.

REST API

Representational State Transfer (REST) is an architectural style formalized by Roy Fielding in his Ph.D. thesis “Architectural Styles and the Design of Network-based Software Architectures”. In short, REST is an architectural style that models resources and their relatedness, and promotes the use of the features exposed by a communication protocol (often HTTP) for creating, reading, and modifying those resources.

The REST architectural style has several core tenets:

  1. Identification of resources (a URI in HTTP),
  2. Manipulation of resources (methods like GET, POST, DELETE, etc. in HTTP),
  3. Self-describing messages (Content-Type headers in HTTP),
  4. Hypermedia driving application state (named links between resources).

By implementing a common REST API, we will improve the ease-of-use for developers working with GMI-compliant repositories (analytical tools accessing resources, sequencers and sequencing facilities creating new resources in the repository) because developers only need to write software to target a single, common API. Furthermore, accessing GMI-compliant repositories can be simplified for developers by providing per-language libraries on behalf of the GMI that interact with all GMI-compliant repositories.

In addition to improving developer performance, implementing a common REST API (using formats like JSON and XML) provides an easy-to-extend model. We propose that the GMI define a minimal set of properties for each of the resources exposed by the common REST API. Each individual GMI-compliant repository is then free to extend upon that model using their own custom JSON or XML media types, provided that the GMI-defined minimal metadata is included when a client requests a resource using the media types defined by this document.

Next, the documentation for the REST API can be confined to a single location. The NCBI BioProject interface currently has documentation spread across several different locations.

Finally, by proposing a unified REST API for the GMI project, all interested participants in the GMI project can guide the construction of the unified REST API.

REST API Description

### Technical Notes
* In metadata property descriptions, properties in italics are optional properties.
* The metadata and data model described in this document adhere strictly to the INSDC Sequence Read Archive data model (see: ftp://ftp.sra.ebi.ac.uk/meta/xsd/sra_1_5/). Some of the resources might better be described as composite resources (i.e., the resource consists of several sub-resources). The data model could be refactored to move these sub-resources into their own, uniquely addressable resource. Sub-resources that are candidates for extraction are in bold.
* Links are prefixed using http://www.g-m-i.org/ to guarantee uniqueness of the relation name; the prefix is free to change, provided that uniqueness can be guaranteed. For example, the prefix could be http://www.insdc.org/, http://www.ncbi.nlm.nih.gov, http://www.sra.ebi.ac.uk, etc., provided that everyone implementing a GMI-compliant repository agrees on a single namespace.

Authentication

For resources that require authentication, specifically for the creation of new resources, we propose using basic HTTP authentication. Please see RFC 2617 for a description of basic HTTP authentication. Storing user or client credentials is left to the implementor.

Going forward, GMI-compliant repositories may use OAuth2, OpenID connect, Persona, or other types of authentication/authorization schemes for accessing resources.

Proposed Media Types

Implementors must implement the minimal complement of media types defined in the sections below. The proposed media types include a version number so that the minimal metadata can be changed over time without breaking clients targeting specific media type versions.

We propose the use of two formats as a basis for the metadata media types exposed by GMI-compliant REST APIs:

  1. XML (see: http://www.w3.org/standards/xml/), and
  2. JSON (see: http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf)

We propose XML because a large volume of existing clients can already parse the XML documents provided by INSDC participants. We propose the addition of JSON as a convenience for software developers targeting GMI-compliant REST APIs.

Furthermore, we propose the adoption of the existing XML media types provided by the INSDC for the Sequence Read Archive (SRA) initiative (the schemas can be found on the EBI FTP Archive). The only modifications that we propose to these XML formats is the replacement of the custom LINK-type elements with standard xml link elements. Furthermore, we suggest that the identifiers specified in the schemas be replaced with fully-qualified URLs so that clients do not need to know how to construct the URLs themselves.

Data Model

Members of the INSDC implement a five-level, hierarchical data model (See: Kodama, Y. et al.):

  1. Study (BioProject at NCBI and DDBJ)
  2. Sample (BioSample at NCBI and DDBJ)
  3. Experiment
  4. Run
  5. Analysis
  6. Submission

We do not propose to replace this model; the existing data model is widely accepted by software developers and biologists alike, and is already implemented across the members of the INSDC. Instead, we propose to use the existing data model as the basis for a new REST API.

Resources

Below we describe a possible set of properties and metadata for the data model described above. The properties and metadata here are generally used for data submission and resource creation.

Study

From NCBI submission portal: A BioProject (study) is a collection of biological data related to a single initiative, originating from a single organization or from a consortium of coordinating organizations.

Metadata

The metadata for a study is already defined by the INSDC. Please see the XML schema for study hosted at the EBI. A summary of the metadata for a study is listed below:

Acceptable media types

Sample

From NCBI submission portal: A BioSample (sample) is a description of the biological source materials used in experimental assays.

Metadata

The metadata for a sample is already defined by the INSDC. Please see the XML schema for sample hosted at the EBI. A summary of the metadata for a sample is listed below:

Acceptable media types

Experiment

From NCBI SRA Handbook: An experiment is a consistent set of laboratory operations on input material with an expected result.

Metadata

The metadata for a experiment is already defined by the INSDC. Please see the XML schema for experiment hosted at the EBI. A summary of the metadata for an experiment is listed below:

Acceptable media types

Run

From NCBI SRA Handbook: Results are called runs. Runs comprise the data gathered for a sample or sample bundle and refer to a defining experiment.

Metadata

The metadata for a run is already defined by the INSDC. Please see the XML schema for run hosted at the EBI. A summary of the metadata for an experiment is listed below:

Links for run resources are only exposed in media types that support hypermedia (i.e., json or xml file formats).

Acceptable media types

Analysis

Metadata

The metadata for an analysis is already defined by the INSDC. Please see the XML schema for analysis hosted at the EBI. A summary of the metadata for an analysis is listed below:

Links for analysis resources are only exposed in media types that support hypermedia (i.e., non-sequence file formats).

Acceptable media types

File Upload Considerations

Many large-scale, widely available products that deal with large, binary files (notably YouTube, Google Drive, and Dropbox) provide file uploads over HTTP. Nevertheless, uploading large, binary data files over HTTP does have some realistic concerns:

  1. Transfer speed: is the performance of uploading over HTTP worse than FTP or some other file transfer protocol? NCBI uses Aspera connect as a file transfer protocol for uploading files. Given the limited bandwidth of some centers, the file transfer protocol may be of little concern in terms of file transfer performance. Furthermore, with a sufficiently fast connection to the API, a client may overload the web server by uploading large files.
  2. Reliability: what happens if the file transfer is interrupted during transmission? Is the client expected to re-submit a very large set of files to the server upon failure? Other file transfer protocols provide resilience over HTTP in terms of resuming file transmission after an interruption.
  3. Browser limitations: some popular web browsers have a limit of 2GB for file uploads (notably, Firefox and Internet Explorer). While the intended audience of this REST API explicitly does not include web browsers, this is a realistic consideration.
  4. Web server limitations: systems administrators may not feel comfortable allowing large files to be uploaded via the web server. Most web servers allow the administrator to specify the maximum file upload size, and the maximum file upload size may be much smaller than the type of files needed to be submitted to a sequence data repository.

Possible solutions to the problems exposed by uploading files over HTTP:

  1. Simply use an alternative protocol more suited to transferring large, binary files (like FTP, SFTP, SCP/SSH, BitTorrent, Aspera Connect, etc.) to upload the files, then include a link to the created resource in the metadata package to be submitted. One possible problem with using alternative protocols compared to using HTTP/HTTPS is that some networks prohibit or hinder the use of other types of protocols, notably BitTorrent, for transferring data. The ports used by HTTP and HTTPS are often left without such prohibition. Furthermore, the addition of another protocol over HTTP for the REST API increases the complexity of clients, as each client would need to understand how to work with all necessary protocols (possibly irrelevant; most programming languages have libraries available for natively communicating over a variety of protocols).
  2. Adopt a resumable HTTP upload protocol. Google and YouTube allow the transfer of large, binary video files over HTTP using a wide variety of network links. The YouTube Data API describes a protocol for uploading video files to YouTube over HTTP in a resumable fashion. Chunking the large file addresses the reliability and web server limitations. Web browsers cannot work with files at such a low level. Chunking the large file does not address the problem of transfer speed.

HTTP Verbs

Each resource type can be represented as a resource collection or as an individual resource. This section outlines the HTTP verbs that can be invoked on resource collections and individual resources.

Resource Collections

Clients can invoke the following HTTP verbs on a resource collection (studies, samples, experiments, etc.) and expect the corresponding response codes outlined below.

Individual resource

Individual resources are accessed via the Location header upon successful creation of a resource, by navigating to a specific resource via a parent collection, or by bookmarking the address of a resource.

Examples

Creating a Study

This example shows the HTTP conversation that would take place by a client intending to create a new study. Note: the links and information shown in this example do not refer to a real example; the links and names were chosen arbitrarily.

Initial request for creation of study:

POST /studies HTTP/1.1
Authorization: Basic QWxhZGRpbjpvcGVuIHNlc2FtZQ==
Content-Type: application/vnd.gmi.study-v1+json

{
    "external-database-identifiers": {
        "ncbi/genbank": {"href": "http://www.ncbi.nlm.nih.gov/bioproject/18649"},
        "ebi/ena": {"href": "https://www.ebi.ac.uk/ena/data/view/SRP029705"}
    },
    "description": {
        "title": "A very interesting study.",
        "internal-name": "Internal study name.",
        "abstract": "This is a very interesting study, completed by very interesting people.",
        "description": "A supremely interesting study, this is a very long-winded description of what this project is about.",
        "type": "Whole Genome Sequencing",
        "related-studies": [
            "http://www.ncbi.nlm.nih.gov/bioproject/18651",
            "http://www.ncbi.nlm.nih.gov/bioproject/58825"
        ]
    },
    "related-resources": {
        "http://www.google.com"
    },
    "additional-properties": {
        "not-used-internally": "Properties in this section are not parsed internally; submitters should use this as an opportunity to define metadata not officially required or specified by GMI-compliant repositories."
    }
    "submitter": [
        { "href": "http://repository.g-m-i.org/submitters/cdc" }
    ]
}

Response (study has been accepted for human review):

HTTP/1.1 202 Accepted
Location: http://repository.g-m-i.org/studies-to-review/123
Content-Type: application/json

{
    "response": "Your study has been accepted for review. Automated parsers have briefly verified the metadata supplied with your study and have found no errors, however human review is required. Please see http://repository.g-m-i.org/studies-to-review/123 to monitor the review progress of your study."
}

Client begins to monitor progress of review:

GET /studies-to-review/123 HTTP/1.1
Authorization: Basic QWxhZGRpbjpvcGVuIHNlc2FtZQ==
Accept: application/json

Repository responds with the status of the request for creation:

HTTP/1.1 200 OK
Content-Type: application/json
Last-Modified: Mon, 23 Dec 2013 19:43:31 GMT

{
    "status": {
        "message": "pending review"
    },
}

Client continues to monitor progress of review:

GET /studies-to-review/123 HTTP/1.1
Authorization: Basic QWxhZGRpbjpvcGVuIHNlc2FtZQ==
Accept: application/json
If-Modified-Since: Mon, 23 Dec 2013 19:43:31 GMT

Content is unchanged on server; server does not send back complete response:

HTTP/1.1 304 Not Modified
Date: Mon, 23 Dec 2013 20:43:31 GMT

Client continues to monitor progress of review:

GET /studies-to-review/123 HTTP/1.1
Authorization: Basic QWxhZGRpbjpvcGVuIHNlc2FtZQ==
Accept: application/json
If-Modified-Since: Mon, 23 Dec 2013 20:43:31 GMT

Server responds with completed review, and a hyperlink to the location of the study:

HTTP/1.1 200 OK
Content-Type: application/json
Last-Modified: Mon, 23 Dec 2013 21:43:31 GMT

{
   "status": { 
        "message": "review complete",
        "http://www.g-m-i.org/links/study": "http://repository.g-m-i.org/studies/123"
    }
}

Adding a Sample to a Study

In the last example we showed the general process required to add a study to a repository. In this example, we show how a client can add a new sample to an existing study.

Given the location of the study (http://repository.g-m-i.org/studies/123), the client will issue ask the server for more information about the resource:

GET /studies/123 HTTP/1.1
Accept: application/vnd.gmi.study-v1+json

The server will respond with the project metadata in the JSON format requested:

HTTP/1.1 200 OK
Content-Type: application/vnd.gmi.study-v1+json
Last-Modified: Mon, 23 Dec 2013 21:43:31 GMT

{
    "links": [
        { "rel": "self", "href": "http://repository.g-m-i.org/studies/123" },
        { "rel": "http://www.g-m-i.org/links/study/samples", "href": "http://repository.g-m-i.org/studies/123/samples" },
        { "rel": "http://www.g-m-i.org/links/submitter", "href": "http://repository.g-m-i.org/users/1" },
        { "rel": "http://www.g-m-i.org/links/study/related-studies", "href": "http://repository.g-m-i.org/studies/123/related-studies" },
        { "rel": "http://www.g-m-i.org/links/related-resources", "href": "http://repository.g-m-i.org/studies/123/related-resources" }
    ],
    // Other project metadata...
}

From the response, the client can find the link for the collection of samples associated with the study by looking for a link with a rel of http://www.g-m-i.org/links/study/samples. Once the client has found the link, it can either get the complete list of samples associated with the study by issuing a GET request for the URL in the href part of the link:

GET /studies/123/samples HTTP/1.1
Accept: application/json

The server will respond with the complete set of samples associated with the study:

HTTP/1.1 200 OK
Content-Type: application/json
Last-Modified: Mon, 23 Dec 2013 21:43:31 GMT

{
    [
        { 
            "sampleName": "sample",
            "links": [
                { "rel": "self", "href": "http://repository.g-m-i.org/studies/123/samples/1" },
            ],
            // Other sample metadata...
        }
    ]
}

The client can also add new samples to the study by issuing a POST request to the same href:

POST /studies/123/samples HTTP/1.1
Authorization: Basic QWxhZGRpbjpvcGVuIHNlc2FtZQ==
Content-Type: application/vnd.gmi.sample-v1+json

{
    "sampleName": "sample",
    // Other sample metadata...
}

The server might respond immediately indicating success:

HTTP/1.1 201 Created
Location: http://repository.g-m-i.org/studies/123/samples/456

Or, the server might respond immediately indicating failure, with an invalid property name, for example:

HTTP/1.1 400 Bad Request
Content-Type: application/json

{
    "message": "Invalid property fields named.",
    "invalidFields": [
        "sampleName"
    ]
}

References

  1. NCBI SRA Handbook
    1. Submission Quick Start Guide
  2. Shumway, M. et al. Archiving next generation sequencing data.
  3. Leinonen, R. et al. The Sequence Read Archive.
  4. Kodama, Y. et al. The sequence read archive: explosive growth of sequencing data.
  5. IETF RFC 6838
  6. IETF RFC 2167
  7. NCBI BioProject Core XML Schema
  8. IANA Link Relations
  9. IETF RFC 2616 - Section 10, Status Codes Definition

Additional Resources on REST APIs