Lung diseases, including chronic obstructive pulmonary disease (COPD), are a leading cause of death worldwide. Early detection and treatment are critical for improving patient outcomes, but diagnosing lung diseases can be challenging. Machine learning (ML) models are transforming the field of pulmonology by enabling faster and more accurate prediction of lung diseases including COPD. In this blog, we’ll discuss the challenges of detecting / predicting lung diseases using machine learning, the clinical dataset used in research, supervised learning method used for building machine learning models.
Challenges in Detecting Lung Diseases with Machine Learning
Detecting and predicting lung diseases using machine learning can be challenging due to a lack of labeled data. Training ML models based on supervised learning requires large datasets of labeled data samples, but creating labels for medical data, in general, can be time-consuming and expensive. In the case of lung diseases, creating labels requires medical experts to review and interpret clinical measurements, such as spirograms. Additionally, lung diseases like COPD are often undiagnosed, meaning many individuals with the disease will not be labeled as having it. These challenges make it difficult to create labeled datasets for training ML models.
Lung Prediction Dataset: Clinical Data, Spirogram
The following can be clinical datasets for training machine learning classification model for predicting lung diseases. We can extract different types of features from the following classes of clinical datasets. The UK Biobank is a large national effort that has created a publicly available dataset of petabytes of imaging, metabolic tests, and medical records spanning 500,000 individuals. The dataset provides researchers with a rich source of data to study the links between environment, genetics, and disease.
- Hospital inpatient record: This dataset includes information about patients who have been admitted to a hospital for treatment of a lung disease or related condition. The data in this dataset may include information such as the patient’s medical history, the type of lung disease they have, the treatments they received, and their outcomes.
- Primary care records: This dataset includes information about patients who have visited their primary care physician for a lung-related issue or for routine check-ups. The data in this dataset may include information such as the patient’s medical history, lung function test results, smoking status, and medication history.
- Self-report records: This dataset includes information provided by patients themselves about their health status and any symptoms they may be experiencing. The data in this dataset may include information such as the patient’s smoking history, exposure to environmental pollutants, and family history of lung disease.
- Spirogram: Spirogram is a clinical measurement used to assess lung function and diagnose lung diseases, including chronic obstructive pulmonary disease (COPD). A spirogram is a graphical representation of the volume of air exhaled over time, measured using a device called a spirometer. The image (courtesy Google AI blog) displays spirograms from lung function tests (to capture COPD status), including a forced expiratory volume-time spirogram on the left, a forced expiratory flow-time spirogram in the middle, and an interpolated forced expiratory flow-volume spirogram on the right. The spirogram profiles for individuals without COPD are different from those with the disease.
Machine Learning Classification Model for Lung Disease Prediction
Machine learning classification models can be used to accurately phenotype at scale for lung diseases, specifically COPD. Clinical dataset and spirogram data can be used to train the classification model. Here is the image (courtesy: Google AI blog) representing the architecture of using training machine learning model that outputs a risk or liability score related to whether a person is suffering from COPD. The similar architecture can be used for training different types of models for predicting different kinds of lungs diseases. Note that the image below represents supervised learning method for training models that requires samples to be associated with labels.
COPD is a lung disease characterized by airway inflammation and impeded airflow that can progressively reduce lung function. The current guidelines for determining COPD status from spirograms use only a few specific data points in the curve and apply fixed thresholds to those values. However, for training the ML models shown in above image, the entire rich data present in spirogram along with additional clinical dataset was used.
Google researchers trained the model (shown in above picture) for predicting COPD status by making used of a variety of widely available sources of medical record information to create labels for the model without medical expert review. These labels are less reliable and noisy due to gaps in the medical records and undiagnosed COPD cases. However, the models trained with this data showed high accuracy. The model predictions were treated as a quantitative COPD liability or risk score, which improved the ability to predict COPD outcomes. Classification models were trained to predict a variety of binary COPD outcomes (for example, an individual’s COPD status, whether they were hospitalized for COPD or died from it). For greater detail, read the Google AI blog – An ML based approach to better characterize lung diseases.
AI / Machine learning is transforming the field of pulmonology by enabling faster, more accurate diagnoses and treatments for lung diseases like COPD. Despite challenges related to lack of labeled datasets, researchers are finding innovative ways to train ML classification models based on supervised learning methods by using rich dataset found in spirograms and other clinical measurements. The clinical dataset used in research, including spirograms and various types of medical records, provides a rich source of data to study the links between environment, genetics, and disease. As healthcare becomes increasingly data-driven, machine learning models will become an essential tool for pulmonologists, researchers, and patients, improving patient outcomes and reducing healthcare costs.
- Random Forest vs AdaBoost: Difference, Python Example - December 8, 2023
- Decoding Bagging in Random Forest: Examples - December 8, 2023
- Feature Importance & Random Forest – Sklearn Python Example - December 8, 2023