COVID-19 Open API

8 min readApr 2, 2020

The COVID-19 (COronaVIrus Disease 2019) is an infectious disease caused by severe acute respiratory syndrome coronavirus (SARS-CoV-2). This disease has been first identified in China (Wuhan, Hubei province) at the end of 2019 and declared as a pandemic on March 11th, 2020.

Many universities, public and private entities, and other organizations are providing data visual representations (like charts, tables, geographical heatmaps, etc.) and datasets about the trend this disease is having day by day.

I have to confess I am very curious and I like to get these representations of data on my own in order to experiment and see if “it works” or just simply to get a different — or custom — point of view that the existing data visual representations couldn’t have.

The starting point

Since almost the beginning of the COVID-19 outbreak, I’ve started looking around for datasets on Google and right away I’ve been in front of scattered data from WHO (World Health Organization), European CDC (Center for Disease Control), China CDC, United States CDC, Italy Ministry of Health and many others.

Collecting and properly merging all of these data is a real challenge for a number of reasons. For instance, dealing with different formats (raw data, tabular, PDF reports, etc.). Fortunately, the Johns Hopkins University CSSE (Center for Systems Science and Engineering) made an excellent job of executing this task for the entire world and they have been the first doing it.

The Johns Hopkins University CSSE has developed an interactive real-time map of the COVID-19 spread around the world and they have become “famous” for that. But guess what? There is exactly the data I was looking for behind this map and they are stored in a beautiful public GitHub repository. Bingo!

The problem

Figure 1: A snapshot of how the files look like in the repository

Cool, I have data and I can do whatever I want with them. I am a huge fan of R (for those who don’t know what R is: R is a programming language and environment for statistical computing and graphics) and I started coding like a fool to get, process, and produce an output from all of those CSV (comma-separated values) files.

Figure 2: Data process and output generation workflow outline

Looking at the figure above, I’ve immediately understood how time-consuming (and boring) the data preparation process can be, as well as how it can get you off track from the real purpose you have to, or you are supposed to, reach.

It’s a huge overhead before the main course and so I’ve thought: I want to make a report, or a dashboard, or a web page, or I am simply looking to make some data visual representation. I don’t care about building the entire pipeline to get my data nice and clean, I just want data ready to go!

The idea

An API interface. How beautiful would it be to have an API interface to get the data directly on the frontend (whatever frontend) we want to build? That would be awesome, but we need a real database to store data if we want to make data available via API, and we also need to switch from R to another language/framework to build the API interface.

Technology

For the database, I have thought to leverage CouchDB because it already has an embedded RESTful API interface to be used by our backend, it is a good JSON based NoSQL database, and fits the required dimensions and necessities of this project.

For the backend, I have selected my beloved Node.js in combination with a great and powerful framework which is Nest.js.

The system

As we have said at the beginning, the Johns Hopkins University CSSE is collecting data from the most important health organizations around the world and they are making these data available on GitHub.

Note: all the system I’ve made is completely open-source and available on my GitHub.

Things to know before you run the code

There are a bunch of environment variables you have to set:

DB_AUTH="Basic YOUR_KEY"
DB_HOSTNAME=Your CouchDB hostname
DB_PORT=Your CouchDB Port
DB_NAME=Your data database 
DB_NAME_CONFIG=Your configuration database
DAILY_REPORTS_PATH=The path of where the daily reports CSV are (now is csse_covid_19_data/csse_covid_19_daily_reports)

The data database will be the database containing the data and the configuration database will be the database containing the latest version and the latest update date-time of your data (and whatever else you want to add if you want). The “importer” will update automatically the “config” document inside the configuration database.

Data importer

The most important thing for our system to work properly is data and we have to put in place a process that gets data from GitHub, processes them, and uploads them into our database.

The process described above has been developed inside the folder “importer”. This is an npm project and you can execute this command to prepare data and upload them into a CouchDB database:

npm run leech

The data importer is driven by some rules — an array of rules — (importer/data-processor/models/rules/rules.ts) that are defining how the CSV files look like. The reason behind this small layer of abstraction is that the CSV file structure can change along time and, indeed, has already done so several times. These rules are creating a good configuration degree that will allow the data importer to run without any modification (unless the Johns Hopkins University CSSE decides to dramatically change everything).

Each rule has a validity date (in UNIX time) beyond which it will no longer be considered. The validity date refers to the CSV file name that contains the date of the day it has been produced.

The content of the rule consists of an object that is mapping and typing the CSV fields and treating some of the values (like string replacements). Here is how a rule looks:

{
  validUntil: 1584835199000,
  columns: {
    provinceState: { header: 'Province/State', type: TYPES.STRING },
    country: { header: 'Country/Region', type: TYPES.STRING, replace: [{ what: 'Mainland China', str: 'China'}] },
    lastUpdate: { header: 'Last Update', type: TYPES.DATETIME },
    confirmed: { header: 'Confirmed', type: TYPES.NUMBER },
    deaths: { header: 'Deaths', type: TYPES.NUMBER },
    recovered: { header: 'Recovered', type: TYPES.NUMBER },
    latitude: { header: 'Latitude', type: TYPES.GEO },
    longitude: { header: 'Longitude', type: TYPES.GEO },
  }
}

The properties contained in the “columns” object are defining how the object that will be stored in our data database will look like.

API interface

Now that we have data we can make them available. Inside the folder “api-be” there is the backend exposing the interface to find data.

As you might guess, this is another npm project and you can get it running with this command:

npm run start

Note: the official API reference is contained in the Swagger UI that this backend exposes by itself at the /api path.

Finding data is the API that whoever is building any sort of frontend will be using. Using it is very simple because it refers to the “_find” API available in CouchDB (link to the doc). The available properties in the request JSON are:

selector: any - JSON object describing criteria used to select documents (selector syntax) - Required.limit: number – Maximum number of results returned. Default is 50.skip: number – Skip the first ‘n’ results, where ‘n’ is the value specified.sort: any[] – JSON array following sort syntax.fields: string[] – JSON array specifying which fields of each object should be.bookmark: string – A string that enables you to specify which page of results you require. Every query returns an opaque string under the bookmark key that can then be passed back in a query to get the next page of results. If any part of the selector query changes between requests, the results are undefined.

Geocoder

If you take a look into the CSV files produced by the Johns Hopkins University CSSE you will notice that the geographical location is missing in some of the records. We can “fix” these records by updating them. In the folder “geocoder” there is another npm project that will look for records with missing geo-location into the databases in order to update them.

The geocoder is based on Nominatim APIs. Nominatim is a tool to search Open Street Map data by name and address and to generate synthetic addresses of Open Street Map points.

npm run geocoder

Full system view

Cool cool, now our system has all the components to work properly and looks like this:

The example

I have created a little web application containing a chart which is representing how things are progressing in Italy.

The application is located inside the folder “frontend” and it is an Angular application.

The gloves are off

I am glad if you got this far and it means a lot to me that you have given me this much time!

You can play with this system, give me any suggestions, and also contribute to this project. Fork it and make PRs :-). There is a roadmap I would like to hit and it is public on this project repository.

Backend: http://api.covid19-opendata.online:30000/
Frontend: http://www.covid19-opendata.online/Swagger UI: http://api.covid19-opendata.online:30000/api

You can also find a downloadable Postman collection in my repository.

Quick examples

Hello world:
curl --location --request GET 'http://api.covid19-opendata.online:30000/config'Config:
curl --location --request GET 'http://api.covid19-opendata.online:30000/config'Find:
curl --location --request POST 'http://api.covid19-opendata.online:30000/data/find' \
--header 'Content-Type: application/json' \
--data-raw '{
 "selector": {
  "country": "Italy"
 }
}'

Special thanks go to Mattia Tupone and Marco Bucci. They have taken care of all the Kubernetes infrastructure that is hosting this project. Thank you so much.

References

CSSEGISandData/COVID-19

This is the data repository for the 2019 Novel Coronavirus Visual Dashboard operated by the Johns Hopkins University…

github.com

Stay safe, we will be back again soon.

The coronavirus outbreak is continuously and rapidly evolving. For updates, check the World Health Organization as well as your local health department.