Author(s): Steven Tai
Mentor(s): Hamdi Kavak, Computational & Data Sciences Department
AbstractMy name is Steven Tai, I am an undergraduate biology major with a concentration in bioinformatics. My research project for OSCAR is about studying the spread of covid-19 using spatiotemporal mapping and phylogenetics
My mentor is Dr. Hamdi Kavak, and he specializes in modeling and simulation, machine/statistical learning, and network science.
For this presentation, I will do a brief overview of the project and go over some key definitions
Next, I will discuss the methods and resources that I’ve used
Then, I’ll show results and conclusions
Finally, I’ll discuss any issues and possible solutions
Let’s start with an overview of the project
Researchers have been coming up with ways of understanding the spread of the disease ever since it start and many have used different methods to predict and track the spread.
As the virus mutates, a phylogenetic tree was constructed to represent it
For this research, it is possible to use spatio temporal data, and mapping with the relations from the phylogenetic tree to gain insight to not only where the strain is but also “from where”?
The data for this project comes from the GISAID Initiative. They are an organization dedicated towards providing fast and large datasets for all influenza and most recently, COVID-19.
They also provide an updated Phylogenetic Tree in Newick format.
Next, we’ll go over the methods
There were the four steps that were taken to preprocess the data and I’ll go over what those are.
Only samples that contain human strains were used.
Data was also filtered for complete dates, with year month and day
The phylogenetic tree was also downloaded from GISAID
Duplicate entries and samples not pertaining to human were filtered out again using a python script.
The Geopy package for python was used to find coordinates of the geolocations
A list of unique geo_location from our dataset was inputted into the script,
the output gives the geolocation names with matching coordinates.
The error list was manually corrected and fed back into the script, while unsalvageable names were discarded
The final corrected list was then saved into a dataframe
An inner join was then performed using the sample data and coordinate data, matching using the unique names of geo_locations.
This will return only matching records of geo locations from both tables, and exclude everything else
The data is then uploaded to a PostGreSql Database, hosted on a local machine, allowing for easy access to the dataset
The phylogenetic tree was parsed from newick format and converted into a dataframe to inner join with the dataset.
These are the number of samples at the start versus the end of cleaning and joining.
We started with 10,697,090 downloaded samples.
After preprocessing and joining, the number comes to 10,524,757 samples.
Finally, we can see the resulting mapping
This is a preview of the mapping of the sample data using a custom program.
The biggest issue is that the program can only take up to 10,000 samples due to memory limitations
Also, there needs to be more inferences of transmission using internal nodes
I’ll now discuss some issues and solutions
Internal nodes infer branching within the tree, but the location of the branching can never be known.
A way we can find the locations is by using ancestral character state reconstruction(ACSR) algorithm.
One such algorithm is ACCTRAN, which does,
a downward pass where it assigns character states from the tips to the root,
and an upward pass that goes from root to tip, reassessing character states.
By assigning the character states using the locations of each known tip, internal nodes can be annotated with the most likely location predicted by the algorithm
Here is a diagram of a proposed pipeline for finding internal node locations
It inputs a newick tree and a list of locations with its leaf edge
Using phangorn’s acctran function, it annotate the internal nodes with the most likely location based on character states
This approach isn’t without it’s issues
When running the full tree, the memory usage was too large.
A possible solution is to split the tree down into smaller ones and run them separately through ACCTRAN.
It was estimated that creating subtrees and running them through ACCTRAN takes about 60 hours total
Another concern is that breaking down the decreases the accuracy of the locations, since the tree could be cut off from other sister clades at different points, preventing comparison between edges. However, given the limited resources this may be a trade off we must accept
There are also other algorithms that could be tested, and could be visited in later research.
That concludes my presentation. I would like to give a big thank you to Dr. Hamdi Kavak for providing excellent mentorship throughout the program. Thank you so much for listening, I hope this presentation gives insight into what I’ve been working on and I hope to continue this research in the future.