We recently sat down for a conversation with Dr. Lucy Li, a data scientist at the Chan Zuckerberg Biohub (CZB) to discuss her latest research to estimate unreported COVID-19 cases. The research is supported by the AWS Diagnostic Development Initiative , a global program to support organizations working to bring better, more accurate, diagnostics solutions to market faster.
A woman works remotely from her home office.
Tell us more about the mission of the Chan Zuckerberg Biohub and your role.
I'm a data scientist at the Chan Zuckerberg Biohub (CZB), and my background is in infectious disease epidemiology. CZB is a nonprofit research organization that aims to set the standard for collaborative science—where leaders in science and technology come together to drive discovery and support the bold vision to cure, prevent, and manage disease within the next century. Our goal is to understand the fundamental mechanisms underlying disease and to develop new technologies that will lead to actionable diagnostics and effective therapies. It's a regional research endeavor with international reach, where the Bay Area's leading institutions—University of California San Francisco (UCSF), Stanford, and Berkeley—joined forces with CZB to catalyze impact, benefitting people and partnerships around the world.
Can you tell us about your new COVID-19 research?
One things that makes COVID-19 challenging to track is that not all individuals who have it exhibit symptoms—I was very interested in estimating the true number of infections. The virus genome mutates at a fairly constant rate as it spreads across the population, even when it's spreading in asymptomatic individuals. That means that every time someone new is infected, the virus changes a little bit and that mutation happens at a fairly constant rate as it spreads. So even if we aren't able to test everyone in the population, as long as we know how quickly the virus mutates, we can infer the likely number of undetected transmission events between people who were tested. For this research, I created a mathematical model to estimate the number of undetected infections at 12 locations in Asia, Europe, and the U.S. over the course of the pandemic.
What were the findings?
I found that there was a very wide range of infections that were undetected across these locations. The rate of undetected infections was as high as over 90 percent in Shanghai. We also found that there was significant change over time in the probability of detecting a case. When the virus was first transmitted to these 12 locations, over 98 percent of infections were undetected during those initial couple of weeks, indicating that the epidemic was already taking off by the time that intense testing started happening.
What are the practical implications of the research—how can it help us now?
Knowing how many individuals have been infected has significant implications for understanding the scope of the pandemic. While the number of confirmed infections is very high, understanding the additional number of infections that have occurred on top of confirmed cases can help us understand how much of the population has already been affected by the virus. These numbers are also useful for evaluating the efficacy of public health surveillance systems.
To understand how well testing strategies are working, you can look at the change in the proportion of undetected infections over time. The more testing and contact tracing is done, the smaller the number of undetected infections, compared to those reported to the healthcare system. That information is also useful for designing efficacious public health responses and interventions because it highlights locations within your country or within your state that might require more testing assets.

What role do AWS cloud services play in helping your team advance its research?

Amazon Web Services (AWS) provided computational support through credits and also offered the expertise of the AWS Professional Services team, who helped scale up this analysis using Amazon Elastic Compute Cloud (Amazon EC2) and AWS Batch. These resources provided a framework that CZB can use to continue this work in the future for other data sets. Essentially, each analysis that we conduct takes a long time to do and is computing intensive. For each of the 12 data sets that I worked with, I had to test thousands of different parameter sets and use those parameters to simulate what the epidemic should look like using those parameters, while also comparing it to the data I had at hand. That process can take hours or sometimes days. With the support of the AWS Professional Services team, I was able to better parallelize a process so that I could conduct work in a reasonable time frame, and could report on the data in a matter of days, rather than months.
How are you using machine learning, specifically?
In order to infer the number of undetected infections, I used a mathematical model to describe how coronaviruses spread from one person to another. I trained the model on available data—the viral genomes from each of the 12 locations, in addition to the time series of confirmed cases in each of the locations. The output of that model was the total number of infections—both confirmed cases and undetected infections. This model also helped us understand some interesting epidemiological parameters, such as the reproductive number and the role 'super-spreading' plays in contributing to this pandemic.
CZB had a major infectious disease initiative long before COVID-19. What impact has COVID-19 had on the organization's work overall?
Most people working on infectious disease projects at CZB have turned their focus to coronavirus testing and research over the last couple of months. In addition, the Biohub has been partnering with UCSF and our sister organization, the Chan Zuckerberg Initiative, to carry out antibody testing and Polymerise Chain Reaction (PCR) tests—which are used to directly detect the presence of an antigen. Both the laboratory and computational methods that the Biohub has developed as a result of responding to this coronavirus outbreak will not only improve our understanding of COVID-19 in the short term, but they will also have utility in the infectious disease base more generally.
Do you have any plans to build on the research on this study?
I'm definitely interested in continuing this type of analysis for different states and counties in the U.S. and repeating that on a regular basis. Since I first started my analysis, there have been many more viral genomes deposited online. So I think the analysis that I would do this month would provide much more precise estimates of those infection numbers than the ones that I reported on in my recent paper. It's a growing effort at the Biohub to do more viral sequencing in California in the coming weeks and months. The end goal is to make these results available for local public health departments, so they have another metric to track the number of infections, even when population-wide testing is not available.
Anything else you’d like to highlight about your research?
One interesting result—we were able to quantify how much variation there is in transmissibility. You may have heard of the concept of a "reproductive number"—which describes how many additional infections each infected individual causes. But that number is just the average—it doesn't really give a full picture of how variable people's individual reproductive numbers are. But with this genomics-informed approach, I was able to quantify that variability. In the research, I estimated around 80 percent of the infections were caused by the top 30 percent most infectious people. That figure has been estimated for other infectious diseases before, and it's on par with something like pandemic influenza, but not as extreme as the 2003 SARS outbreak. In 2003, a lot of the SARS epidemics were caused by these extreme "super-spreading" events of a single individual causing hundreds of infections. In this coronavirus outbreak, there are still super-spreading events, but they don't seem to play as big of a role in driving this pandemic forward. Thus, while there are still individual super-spreaders, perhaps more important for this current coronavirus pandemic is the contribution of super-spreading events where large numbers of people congregate in close proximity.