Features:

A comparison of four major COVID-19 data sources

How journalists can use data from Johns Hopkins University, COVID Tracking Project, USAFacts, and The New York Times


A variety of datasets can help journalists track the spread of COVID-19. This series of small multiples from Politico uses data from the COVID Tracking Project.

The COVID-19 pandemic is a data-driven story. Daily counts of cases and deaths help reporters let the public know where, when, and how the contagion is spreading. Over the past month, a variety of datasets have sprung up to help track the virus. But which should journalists rely on? What’s the difference between them? What are the advantages and disadvantages to each? This guide will walk you through four major COVID-19 data sources: Johns Hopkins University, COVID Tracking Project, USAFacts, and The New York Times.

Overview

All four sources provide data on confirmed cases and COVID-19 deaths as a daily time series. COVID Tracking Project provides a current snapshot as well. They differ in the geographic level of detail (international, state, or county) and types of metrics included beyond cases and deaths, such as recovered patients, testing outcomes, and hospitalizations.

Each source generally pulls data from the same government agencies but at different times, which can lead to variation in daily totals. These differences can be significant, with national case counts varying by the thousands on a given day. The New York Times and Johns Hopkins tend to report the highest number of cases because they update later than the others. COVID Tracking Project updates their data earlier, so they tend to report the fewest number of cases. USAFacts is somewhere between.

One issue to watch going forward is how probable coronavirus deaths are handled. A probable death is one that was clinically diagnosed as COVID-19 but did not have a positive test for the virus. For example, on April 14, the New York City Health Department added more than 3,700 presumed coronavirus deaths to the city’s total. (The total number of probable deaths in NYC has since increased to more than 5,000.) As of May 6, 2020, Johns Hopkins University, USAFActs, and The New York Times are including all probable cases and deaths. COVID Tracking Project is not yet including presumed deaths if they are reported separately. This could change any day. Keep in mind that each state has its own methodology for counting COVID-19 deaths, including how they handle probable cases and whether they report them separately from laboratory-confirmed cases. (NOTE: This paragraph was updated May 6, 2020, to reflect The New York Times’ inclusion of probable cases and deaths.)

How each source handles specific geographic areas can vary. For example, USAFacts reports New York City data broken down into the five boroughs that make up the city, while other sources report one number for the entire city. The cruise-ship cases are also treated differently, with some assigning cases to a specific county or state and others not attributing them to a particular area.

So which data source should you use for your story? Overall, these four sources show similar trends for the U.S. over time. But when broken down by state or county, individual data points can vary depending on the update schedule, geographic definitions, and other factors. While it can be frustrating when data sources differ, it’s a good thing to have multiple sources to compare against each other. Local reporters should spot-check each of these sources against their own local agency websites to help decide which is best for them.

This guide has details about each of these four COVID-19 data sources, along with major caveats and exceptions:

And offers guidance on easy ways to get the data and decide what to use:

Johns Hopkins University

a dashboard made with data from the Johns Hopkins University Center for Systems Science and Engineering

A dashboard made with data from the Johns Hopkins University Center for Systems Science and Engineering.

Get the data from Johns Hopkins University

What is this data and how is it gathered?

The Johns Hopkins University Center for Systems Science and Engineering COVID-19 Dataset was the first to track the coronavirus and the only one discussed here that has international data. It provides daily time-series reports for countries around the world and several cruise ships starting on January 22. The data goes down to the county level in the United States, the state level in Australia, and the province level in China and Canada.

Three types of files are provided: daily reports for all global locations, daily state reports for U.S. states, and time-series summaries at the global and U.S. county level. All files contain totals of confirmed cases, deaths, and recovered cases, as well as geographic information like latitude and longitude, FIPS codes, and ISO3 codes (standardized codes for countries). U.S. daily state reports also contain derived statistics including incident rate, mortality rate, testing rate, and hospitalization rate. Time-series summaries are divided out into separate files containing confirmed cases, deaths, and recovered cases. (There are no recovered cases for U.S. counties.)

The data are pulled from a variety of sources including the World Health Organization, the U.S. Centers for Disease Control and Prevention, the European Center for Disease Prevention and Control, the National Health Commission of the People’s Republic of China, local media reports, and local health departments.

When is this data updated?

Starting April 23, all files are updated once a day between 3:30 and 4:00 UTC (11:30 p.m. and 12:00 a.m. EDT).

Caveats

The JHU CCSE data has undergone a number of changes since its creation in January, which are important to keep in mind. In early March, reporting for U.S. locations changed from cities to counties, and in late March the time-series data were restructured. Additional changes to data are also noted in the data repository. Below are observations collected from journalists who have worked with the data.

  • As of April 14, probable cases and deaths in the U.S. are included.
  • Recovered cases are currently being reported for only a handful of countries, and are available for some U.S. states, not counties.
  • Each day, daily totals are appended to the time-series files as new columns, not rows.
  • Although the dataset is updated daily, the rate at which data is pulled from local health agencies varies.
  • Case-reporting methodology is not uniform across countries. For example, starting on February 13, both clinically and lab-confirmed cases were included in the count of confirmed cases for Hubei province.
  • Recovered totals for Canada and the U.S. are included as individual rows in the daily reports and not assigned to a state.

Geographic exceptions

  • All cases for the five boroughs of New York City are assigned to one area called New York City. The five counties that make up the city (New York, Kings, Queens, Bronx and Richmond counties) are still listed in the data but will show zero cases and deaths.
  • All COVID-19 cases in repatriated U.S. citizens from the Diamond Princess and Grand Princess cruise ships are reported separately and not assigned to a specific state or county.
  • Michigan Department of Corrections (MDOC) and the Michigan Federal Correctional Institution (FCI) are both assigned to the state but not a specific county.

Additional information

JHU also provides a more realtime interactive map and dashboard with global COVID-19 data.

COVID Tracking Project

a chart from the Wall Street Journal made with data from the COVID Tracking Project

The Wall Street Journal made this chart with data from the COVID Tracking Project.

Get the data from COVID Tracking Project

What is this data and how is it gathered?

The COVID Tracking Project reports COVID-19 data for U.S. states and territories, collected primarily from local government health agencies. It is the only source that currently provides details on test results, but does not go down to the county level. There are four CSV files released per day: current national data, current state data, national time-series data, and state time-series data. Each file contains a cumulative count of positive/negative/pending test results, deaths, recovered patients, hospitalizations, ICU patients, and ventilated patients.

When is this data updated?

Data for individual locations are updated at various times throughout the day as states release their latest information. The current data files are updated every five minutes with a timestamp showing when each state was last checked. The time-series data are updated once a day, usually by 5 p.m. EDT.

Caveats

  • Probable cases and deaths that are reported separately are not included. This could change.
  • Data reporting varies across states—some states don’t report total hospitalized patients, some states include presumptive positives in their positives count, others include test results from both public and commercial labs, some states include deaths in their positives, etc. To fully understand the limitations of a state’s data, make sure to review the latest description for each state.
  • Data quality varies across states. COVID Tracking Project awards each state a letter grade between A to D based on the state’s reliability and completeness of a state’s reporting system. This information is available in the current state data file.
  • There is often a lag between negative, pending, and total tests compared to positive tests.
  • Because data reporting of ICU admissions and ventilator use varies from state to state, the Project announced on April 6 that it would no longer be including those statistics in their national, aggregate-level datasets.
  • States are constantly updating how they report their data. In some of these cases, prior data was not updated.

Geographic exceptions

  • American Samoa, Guam, Northern Mariana Islands, and Virgin Islands have N/A data grades for a variety of reasons—for example, American Samoa does not have testing facilities.
  • Guam’s total reported cases include 11 people tested at the Naval base in San Diego.

Additional information

The Project recently launched the COVID Racial Data Tracker in order to make race and ethnicity data more accessible, and are interested in several subpopulations like nursing-home residents, health-care workers and information about tribes.

Some of the Project’s team leads created a Google Spreadsheet that pulls in data from the Project’s API, making it easy for any journalist to zero in on their state. In addition, the team has created a Google Document of suggested questions that local journalists can use as a starting point while reporting with their data.

USAFacts

a map and chart made with data from USAFacts

USAFacts displays maps and charts that use the data it collects.

Get the data from USAFacts

What is this data and how is it gathered?

USAFacts time-series data has U.S. confirmed cases and deaths down to the county level. Two CSV files are provided: One has confirmed cases and the other has deaths. Beginning with January 22, each file contains county FIPS code, county name, state abbreviation, state FIPS code, then one column for each date with the number of cases or deaths. All county-level data is either scraped or manually gathered by USAFacts from state or local public-health agencies.

When is this data updated?

The data files are updated once a day, usually around 12 a.m. PDT.

Caveats

  • Probable cases and deaths are included as of April 28. Probable deaths in NYC that can not be allocated to one of the five boroughs are listed under “New York City Unallocated/Probable.”
  • Confirmed cases are only counted if reported by a state or local government agency.
  • In some states, cases that cannot be allocated to a specific county will count as “unallocated” for that state. If an unallocated case is later assigned to a specific county or a case is moved to a different county, USAFacts says it will update prior days’ counts accordingly.
  • County names can be inconsistent between the case and death files, so when matching to other data sources, journalists should only use FIPS codes (numeric codes for geographic areas, including counties).

Geographic exceptions

  • New York City cases are allocated to the five boroughs (New York, Kings, Queens, Bronx and Richmond counties) when possible. Some unallocated NYC cases are assigned to “New York City Unallocated/Probable” until they can be assigned to one of the five counties.
  • The 21 cases from the Grand Princess cruise ship are allocated to California but not to a specific county.
  • All of Kansas City’s cases and deaths are assigned to Jackson County, Missouri, even though portions of the city extend into Cass, Clay, and Platte counties.

Additional information

USAFacts has been collecting additional metrics—recovered patients, hospitalizations, and cases based on race, gender, and age—and plans on publishing some of these soon.

The New York Times

a map made with data from The New York Times

The New York Times uses its dataset for features like this map.

Get the data from The New York Times

What is this data and how is it gathered?

This is time-series data for U.S. confirmed cases and deaths down to the county level. There are three CSV files released once a day: national-level data, state-level data, and county-level data. Each file contains date, county/state name, FIPS code, number of confirmed cases, and number of deaths going back to January 21. All data is gathered by New York Times reporters from state and local health departments or official government announcements.

When is this data updated?

The data is updated once a day in the morning, usually around 8 a.m. EDT, with data from the day before. Occasionally, they will do additional updates during the day to make revisions or corrections.

Caveats

  • Probable cases and deaths are included as of May 6, 2020.
  • A confirmed case is a positive coronavirus test as reported by a federal, state, territorial, or local government agency.
  • All cases and deaths are counted on the date they are first announced.
  • If government agencies revise earlier data, prior entries will be updated.
  • If a patient’s county is unknown or pending, it will be allocated to “Unknown” for that state. The case will later be reallocated to a specific county when it becomes known.

Geographic exceptions

  • All cases for the five boroughs of New York City (New York, Kings, Queens, Bronx and Richmond counties) are assigned to one area called New York City. There is a large jump in the number of deaths on April 6, when the data source for deaths switched from New York City to New York state. For all counties in New York state, starting on April 8, deaths are reported by place of death instead of the individual’s residence.
  • In Kansas City, Mo., four counties (Cass, Clay, Jackson and Platte) overlap the municipality. The cases and deaths reported for these four counties are for only the portions outside of Kansas City. Cases and deaths for Kansas City are reported on their own.
  • Counts for Alameda County in California include cases and deaths from Berkeley and the Grand Princess cruise ship.
  • Counts for Douglas County in Nebraska include cases brought to the state from the Diamond Princess cruise ship.
  • The dataset includes a full list of geographic exceptions.

Additional information

If more counties begin reporting additional metrics like recovered patients, hospitalizations, demographics, or sub-county geography, the New York Times says it will consider adding those data points.

Download the data in one place

Big Local News, a collaborative effort and data-sharing platform for journalists, has been downloading and archiving all four data sources since mid-March. Journalists can log in to the free platform to access all of the case tracking data discussed here as well as other COVID-related data on hospital beds, associated demographic data, vulnerable communities, nursing homes and more.

Big Local News also has published a map of the daily case counts and deaths that is embeddable at the state and local level, using the New York Times data.

Choosing the right dataset for your story

All four datasets report the number of cases and deaths, and some of them also include data with other useful features. Here’s a table that summarizes the main features present in each dataset (note that these features may vary between data sheets within the same dataset):

U.S. states U.S. counties Global cases Total tests ICU / Ventilator use Hospitalizations
Johns Hopkins University *
Covid Tracking Project
USAFacts
The New York Times

* John Hopkins University uses testing and hospitalization data from COVID Tracking Project

Find more step-by-step COVID-19 data story recipes like this one. If you have questions about a story you’re working on, our free peer data review program is here to help.

Programs like these are part of the OpenNews COVID-19 community care package. If you’re using this story recipe, please let us know — we’d love to promote your work! If you’ve got a story recipe idea, we’d love to hear about it. Drop us a line at source@opennews.org.

Credits

  • Irena Fischer-Hwang

    Irena Fischer-Hwang is a graduate student in the Master of Journalism program at Stanford University. Previously, she was at The Dallas Morning News. She completed her Ph.D. in electrical engineering at Stanford University, and obtained her B.S. and M.Eng. degrees in electrical engineering from the Massachusetts Institute of Technology in 2011 and 2012.

  • Justin Mayo

    Justin Mayo is a data journalist with Big Local News, a project of Stanford’s Journalism and Democracy Initiative. He spent 20 years as an investigative reporter with The Seattle Times, specializing in database reporting, and was part of two Pulitzer Prize-winning teams: the 2015 Breaking News Pulitzer Prize for coverage of a mudslide near Oso, Wash., that killed 43 people and the 2012 Pulitzer Prize for Investigative Reporting for exposing the state of Washington’s financially motivated practice of routinely prescribing a deadly pain drug for people in state-subsidized health care.

Recently

Current page