The covid-19 virus is a type of coronavirus. It has been linked to severe acute respiratory syndrome (SARS). The covid-19 virus can be contracted through contact with saliva or mucous from an infected person. Symptoms include fever, cough, sore throat, headache, muscle aches, and fatigue. There are several problems related to the Covid-19 pandemic which can be solved using machine learning/data science techniques. In this blog post, we will look into some of these Covid-19 use cases which can be solved using machine learning classification and clustering techniques.
What are Covid-19 data sets publicly available?
One of the datasets available for studying Covid-19 is GISAID data (https://www.gisaid.org/) that represents million viral genomes (virome) sequences of COVID-19 or more precisely SARS-CoV-2. Genomic data has a high volume, as the SARS-CoV-2 virome has around 30000 nucleotide base-pairs, and there are more than 2.5 million such sequences available in GISAID alone. In March 2020, when COVID-19 was declared a pandemic by the world health organization (WHO), there were a few thousand sequences available. GISAID collects sequences from all over the world, they come from heterogeneous sources of sequencing technologies and centers, leading to multiple levels of veracity.
The genomic sequence of a virus encodes all of its functions such as virulence and transmissibility. It is variation in this genomic sequence itself which defines the different variants of SARS-CoV-2 such as Alpha, Delta, and Gamma.
The following is the list of Covid-19 datasets publicly available:
- Covid-19 datasets from data world
- Covid-19 datasets from European center for disease prevention and control
- Covid-19 datasets from John Hopkins
- Covid-19 datasets from the national institute of health
- Covid-19 datasets from Kaggle (Dataset 1, Dataset 2)
What are some machine learning use cases related to the Covid-19 pandemic?
The following represents few machine learning use cases which can help deal with Covid-19 pandemic:
- Identify emerging variants: A clustering model (such as K-means), when trained on the genomic sequence data on regular basis could identify a new and rapidly emerging variant in terms of a cluster that grows abnormally quickly, allowing scientists to focus on this cluster. This is how new variants related to alpha, delta, and gamma are found.
- Classify the spread of variants: A classification model (such as logistic regression, Naive Bayes, etc), on the other hand, could help with tracking the spread of known variants in new municipalities, regions, countries, and continents. For example, the USA had a wave of the Alpha variant from the UK in early 2021, and later, a wave of the Delta variant from India and/or via other intermediaries, such as the UK. It is very important to classify such patterns of spread as it can reveal information about the underlying transmission networks between different countries, or even parts of different countries. It can help overcome some of the different veracity in the data, such as the widely varying degree to which different countries are represented in terms of sequencing data, due to sampling bias.
- Identify emerging clusters of Covid-19 infections: Covid-19 has been linked to several outbreaks of severe acute respiratory syndrome (SARS). By analyzing the genomic sequence data, one could identify new clusters of Covid-19 infections in terms of geographical location or transmission network.
- Identify vaccine targets: Covid-19 can be treated with vaccines that are expensive and time-consuming to develop. Vaccinations would have to target critical Covid-19 proteins that cannot mutate without compromising the virus’ ability to survive in human hosts. In machine learning terms, Covid-19 protein targets can be thought of as Covid-19’s “features”. Machine learning models trained on the Covid-19 dataset could identify these Covid-19 features. This would provide an efficient way to prioritize which Covid-NQP sequences are most important for vaccine target identification and design by Covid-19 vaccine developers.
- Identify genomic sequences that are resistant to Covid-19 drugs: Covid-19 is treated with many different antiviral drugs. Sequences of Covid-19 viruses can be identified by machine learning models which have resistant features. This information can be used to select Covid-19 viruses for further laboratory studies and drug development, as opposed to using the Sars-Cov-2 virus sequences which do not have these resistance features.
When training classification/clustering models on genomic sequence data, feature selection, and feature extraction are key as the number of sequences is so huge in numbers. Supervised and unsupervised feature selection/extraction methods such as ridge regression, lasso regression, and principal component analysis (PCA) could prove to be very helpful resulting in improving the overall predictive performance of the models. However, given the data volume is quite large, one can also try kernel methods for identifying important features although it also has its downsides.
Covid-19 is a pandemic that has been identified by the WHO as of March 2020. Covid-19 can be difficult to identify because it doesn’t always fit into the traditional classification system for viruses, but machine learning and data science techniques like clustering and classification models are helping Covid-19 experts make sense of Covid-19 genomic sequence data. This post will be updated from time to time with more Covid-19 machine learning use cases. If you want to learn more about different techniques, please feel free to reach out.