Introduction

LAPIS (Lightweight API for Sequences) is an open web application programming interface (API) allowing easy querying of genomic sequencing data. Originally developed for SARS-CoV-2 and used by CoV-Spectrum, it is now also available for monkeypox. The API for monkeypox uses all monkeypox data on NCBI GenBank and from authors who shared them directly with us. Please note that we cannot provide sequences within LAPIS which are from databases not in the public domain (such as e.g. GISAID EpiPox) due to re-sharing restrictions. The provided data were pre-proceessed and aligned by the Nextstrain team. GenBank allows automatic pulling of data such that LAPIS offers always the latest data. The trees on https://nextstrain.org/monkeypox are based on data pulled through LAPIS. The core features are:

Filter sequences by metadata or mutations
Aggregate data by any metadata field you like
Get the full metadata
Get the sequences as FASTA (aligned or unaligned)
Responses can be formatted as JSON and as CSV

In the following, we demostrate the core features enabled by the API. On the left, we present the basic syntax of the API and on the right, we show how to use it for queries. In the section "Use Cases", we provide examples how to use the API to generate statistics, create plots, or download sequences for further analysis based on the publically available monkeypox sequencing data.

Overview

The API has five main endpoints related to samples. These endpoints provide different types of data:

/sample/aggregated - to get summary data aggregated across samples
/sample/details - to get per-sample metadata
/sample/nuc-mutations - to get the common nucleotide mutations (shared by at least 5% of the sequences)
/sample/fasta - to get original (unaligned) sequences
/sample/fasta-aligned - to get aligned sequences

The API returns a response (data) based on a query to one of the endpoints. You can view a response in your browser, or use the data programmatically. We'll provide some examples in R.

Query Format

Query example:

Get the total number of available sequences:
/sample/aggregated

To query an endpoint, use the web link with prefix https://mpox-lapis.gen-spectrum.org/v1 and the suffix for the relevant endpoint. In the examples, we only show the suffixes to keep things simple, but a click takes you to the full link in your browser.

Response Format

Response example:

{
  "info":{"apiVersion":1,"dataVersion":1653160874,"deprecationDate":null,"deprecationInfo":null,"acknowledgement":null},
  "errors":[],
  "data":[{"count":84}]
}

The responses can be formatted in JSON or CSV. The default is JSON. To get CSV responses, append the query parameter dataFormat=csv.

Responses returned in the JSON format have three top level attributes:

"info" - data about the API itself
"errors" - an array (hopefully empty!) of things that went wrong
"data" - the actual resposne data

Filters

Examples:

Get the number of all samples from Nigeria since 2000:
/sample/aggregated?country=Nigeria&dateFrom=2000-01-01

{
  "info":...,
  "errors":[],
  "data":[{"count":5}]
}

Get metadata of samples of the West Africa clade:
/sample/details?clade=WA

{
  "info": ...,
  "errors": [],
  "data": [
    {
      "date":"2017-11-30",
      "country":"Nigeria",
      "host":"human",
      "clade":"WA",
      "sraAccession":"MK783030",
      "strain":"3025",
      ...
    },
    ...
  ]
}

We can adapt the query to filter to only samples of interest. The syntax for adding filters is <attribute1>=<valueA>&<attribute2>=<valueB>.

All five sample endpoints can be filtered by the following attributes:

dateFrom (see section "Date handling")
dateTo
yearFrom
yearTo
yearMonthFrom
yearMonthTo
region
country
division
host
clade
nucMutations (see section "Filter Mutations")

The endpoints details, nuc-mutations, fasta, and fasta-aligned can additionally be filtered by these attributes:

sraAccession
strain

To determine which values are available for each attribute, see the example in section "Aggregation".

Filter Mutations

Get the total number of samples with the nucleotide mutations 913T and 5986T:
/sample/aggregated?nucMutations=913T,5986T

It is possible to filter for nucleotide bases/mutations. Multiple mutations can be provided by specifying a comma-separated list.

A nucleotide mutation has the format <position><base>. A "base" can be one of the four nucleotides A, T, C, and G. It can also be - for deletion and N for unknown.

The <base> can be omitted to filter for any mutation. You can write a . for the <base> to filter for sequences for which it is confirmed that no mutation occurred, i.e., has the same base as the reference genome at the specified position.

Aggregation

Examples:

Get the number of samples per country:
/sample/aggregated?fields=country

{
  "info": {"apiVersion":1,"deprecationDate":null,"deprecationInfo":null},
  "errors": [],
  "data": [
    {"country":"France","count":1},
    {"country":"Portugal","count":1}
    ...
  ]
}

Get the number of samples per host and country from the 2022:
/sample/aggregated?dateFrom=2022-01-01&fields=host,country

{
  "info": {"apiVersion":1,"deprecationDate":null,"deprecationInfo":null},
  "errors": [],
  "data": [
    {"host":"human","country":"USA","count":1},
    {"host":"human","country":"Portugal","count":1},
    ...
  ]
}

Above, we used the /sample/aggregated endpoint to get the total counts of sequences with or without filters. Using the query parameter fields, we can group the samples and get the counts per group. For example, we can use it to get the number of samples per country. We can also use it to list the available values for each attribute.

fields accepts a comma-separated list. The following values are available:

date (see section "Date handling")
year
month
region
country
division
host
clade

Date handling

The date field returns and the dateFrom and dateTo parameters expect a string formatted as YYYY-MM-DD (e.g., 2022-05-29). There are however samples for which we do not know the exact date but only a partial date: e.g., only the year or the year and the month. In those cases, the date is considered as unknown and will return a null. That means that the query dateFrom=2022-01-01 will not return samples for which we do not know the exact date but only that it is from May 2022.

To support partial dates, LAPIS additionally has the fields year and month. They are returned by the details endpoint and can be used as an aggregation field (e.g., fields=year,month is possible). Further, LAPIS offers yearFrom, yearTo, yearMonthFrom and yearMonthTo filters. yearMonth has to be formatted as YYYY-MM. For example, the queries yearFrom=2022 and yearMonthFrom=2022-05 will include all samples from May 2022.

Background

Why is the query dateFrom=2022-01-01 not returning samples from May 2022 that don't have an exact date? The reason is that the following (desirable) property would be violated:

For t0 < t1:

aggregated(dateFrom=t0) = aggregated(dateFrom=t0,dateTo=t1) + aggregated(dateFrom=t1+1) = sum(aggregated(dateFrom=t0,fields=date))

Use Cases

We demonstrate an example for this API in R.

Plot the global distribution of all sequences

library(jsonlite)
library(ggplot2)

# Query the API
response <- fromJSON("https://mpox-lapis.gen-spectrum.org/v1/sample/aggregated?fields=country")
data <- response$data

# Make a plot
ggplot(
  data,
  aes(x = "", y = count, fill = country)) + 
  geom_bar(width = 1, stat = "identity") + 
  coord_polar("y", start = 0) + 
  theme_minimal() + 
  theme(
    panel.grid=element_blank(),
    panel.border = element_blank(),
    axis.ticks = element_blank(),
    axis.title.x = element_blank(), 
    axis.title.y = element_blank(),
    axis.text.x = element_blank())

About

LAPIS is being developed and maintained in the Computational Evolution group of ETH Zürich in Switzerland bsse.ethz.ch/cevo (Chaoran Chen and Tanja Stadler). The monkeypox data is pre-processed and aligned by members of the NextStrain team. We acknowledge the teams around the world sharing data openly on genbank in real time during this outbreak. We express our sincere gratitude.