Studying COVID-19 Spread Through Spatiotemporal Mapping and Phylogenetics – OSCAR Celebration of Student Scholarship and Impact

Author(s): Steven Tai

Mentor(s): Hamdi Kavak, Computational & Data Sciences Department

Abstract

There have been several studies about COVID-19 about the spread of the virus, alongside with the construction of its phylogenetic tree. Phylogenetic trees allow us to view the branching of strains from the root, where each strain comes from and their connection to each other. The relationship between phylogeny and disease spread is not explicit but by combining the two, there could be underlying information that could help researchers gain more insight. GISAID provides large datasets of various diseases, including sample data for COVID-19 cases each with varying location subsets and strain identifiers. They also provide a large phylogenetic tree with the same strain identifiers on their leaf edges. Sample, tree, coordinate, and other metadata can be combined through data-frame merging and mapped. The significance of this relationship can tell researchers not only where strains are but also from where they came.

Audio Transcript

Hello, Welcome to my presentation!

My name is Steven Tai, I am an undergraduate biology major with a concentration in bioinformatics. My research project for OSCAR is about studying the spread of covid-19 using spatiotemporal mapping and phylogenetics

My mentor is Dr. Hamdi Kavak, and he specializes in modeling and simulation, machine/statistical learning, and network science.

For this presentation, I will do a brief overview of the project and go over some key definitions

Next, I will discuss the methods and resources that I’ve used

Then, I’ll show results and conclusions

Finally, I’ll discuss any issues and possible solutions

Let’s start with an overview of the project

Researchers have been coming up with ways of understanding the spread of the disease ever since it start and many have used different methods to predict and track the spread.
As the virus mutates, a phylogenetic tree was constructed to represent it

For this research, it is possible to use spatio temporal data, and mapping with the relations from the phylogenetic tree to gain insight to not only where the strain is but also “from where”?

The data for this project comes from the GISAID Initiative. They are an organization dedicated towards providing fast and large datasets for all influenza and most recently, COVID-19.

They also provide an updated Phylogenetic Tree in Newick format.

Next, we’ll go over the methods

There were the four steps that were taken to preprocess the data and I’ll go over what those are.

Only samples that contain human strains were used.
Data was also filtered for complete dates, with year month and day
The phylogenetic tree was also downloaded from GISAID

Duplicate entries and samples not pertaining to human were filtered out again using a python script.

The Geopy package for python was used to find coordinates of the geolocations

A list of unique geo_location from our dataset was inputted into the script,
the output gives the geolocation names with matching coordinates.

The error list was manually corrected and fed back into the script, while unsalvageable names were discarded

The final corrected list was then saved into a dataframe

An inner join was then performed using the sample data and coordinate data, matching using the unique names of geo_locations.

This will return only matching records of geo locations from both tables, and exclude everything else

The data is then uploaded to a PostGreSql Database, hosted on a local machine, allowing for easy access to the dataset

The phylogenetic tree was parsed from newick format and converted into a dataframe to inner join with the dataset.

These are the number of samples at the start versus the end of cleaning and joining.
We started with 10,697,090 downloaded samples.
After preprocessing and joining, the number comes to 10,524,757 samples.

Finally, we can see the resulting mapping

This is a preview of the mapping of the sample data using a custom program.
The biggest issue is that the program can only take up to 10,000 samples due to memory limitations
Also, there needs to be more inferences of transmission using internal nodes

I’ll now discuss some issues and solutions

Internal nodes infer branching within the tree, but the location of the branching can never be known.

A way we can find the locations is by using ancestral character state reconstruction(ACSR) algorithm.

One such algorithm is ACCTRAN, which does,
a downward pass where it assigns character states from the tips to the root,
and an upward pass that goes from root to tip, reassessing character states.
By assigning the character states using the locations of each known tip, internal nodes can be annotated with the most likely location predicted by the algorithm

Here is a diagram of a proposed pipeline for finding internal node locations

It inputs a newick tree and a list of locations with its leaf edge

Using phangorn’s acctran function, it annotate the internal nodes with the most likely location based on character states

This approach isn’t without it’s issues
When running the full tree, the memory usage was too large.
A possible solution is to split the tree down into smaller ones and run them separately through ACCTRAN.

It was estimated that creating subtrees and running them through ACCTRAN takes about 60 hours total

Another concern is that breaking down the decreases the accuracy of the locations, since the tree could be cut off from other sister clades at different points, preventing comparison between edges. However, given the limited resources this may be a trade off we must accept

There are also other algorithms that could be tested, and could be visited in later research.

That concludes my presentation. I would like to give a big thank you to Dr. Hamdi Kavak for providing excellent mentorship throughout the program. Thank you so much for listening, I hope this presentation gives insight into what I’ve been working on and I hope to continue this research in the future.

Leave a Reply Cancel reply