Lung diseases, including chronic obstructive pulmonary disease (COPD), are a leading cause of death worldwide. Early detection and treatment are critical for improving patient outcomes, but diagnosing lung diseases can be challenging. Machine learning (ML) models are transforming the field of pulmonology by enabling faster and more accurate prediction of lung diseases including COPD. In this blog, we’ll discuss the challenges of detecting / predicting lung diseases using machine learning, the clinical dataset used in research, supervised learning method used for building machine learning models.
Detecting and predicting lung diseases using machine learning can be challenging due to a lack of labeled data. Training ML models based on supervised learning requires large datasets of labeled data samples, but creating labels for medical data, in general, can be time-consuming and expensive. In the case of lung diseases, creating labels requires medical experts to review and interpret clinical measurements, such as spirograms. Additionally, lung diseases like COPD are often undiagnosed, meaning many individuals with the disease will not be labeled as having it. These challenges make it difficult to create labeled datasets for training ML models.
The following can be clinical datasets for training machine learning classification model for predicting lung diseases. We can extract different types of features from the following classes of clinical datasets. The UK Biobank is a large national effort that has created a publicly available dataset of petabytes of imaging, metabolic tests, and medical records spanning 500,000 individuals. The dataset provides researchers with a rich source of data to study the links between environment, genetics, and disease.
Machine learning classification models can be used to accurately phenotype at scale for lung diseases, specifically COPD. Clinical dataset and spirogram data can be used to train the classification model. Here is the image (courtesy: Google AI blog) representing the architecture of using training machine learning model that outputs a risk or liability score related to whether a person is suffering from COPD. The similar architecture can be used for training different types of models for predicting different kinds of lungs diseases. Note that the image below represents supervised learning method for training models that requires samples to be associated with labels.
COPD is a lung disease characterized by airway inflammation and impeded airflow that can progressively reduce lung function. The current guidelines for determining COPD status from spirograms use only a few specific data points in the curve and apply fixed thresholds to those values. However, for training the ML models shown in above image, the entire rich data present in spirogram along with additional clinical dataset was used.
Google researchers trained the model (shown in above picture) for predicting COPD status by making used of a variety of widely available sources of medical record information to create labels for the model without medical expert review. These labels are less reliable and noisy due to gaps in the medical records and undiagnosed COPD cases. However, the models trained with this data showed high accuracy. The model predictions were treated as a quantitative COPD liability or risk score, which improved the ability to predict COPD outcomes. Classification models were trained to predict a variety of binary COPD outcomes (for example, an individual’s COPD status, whether they were hospitalized for COPD or died from it). For greater detail, read the Google AI blog – An ML based approach to better characterize lung diseases.
AI / Machine learning is transforming the field of pulmonology by enabling faster, more accurate diagnoses and treatments for lung diseases like COPD. Despite challenges related to lack of labeled datasets, researchers are finding innovative ways to train ML classification models based on supervised learning methods by using rich dataset found in spirograms and other clinical measurements. The clinical dataset used in research, including spirograms and various types of medical records, provides a rich source of data to study the links between environment, genetics, and disease. As healthcare becomes increasingly data-driven, machine learning models will become an essential tool for pulmonologists, researchers, and patients, improving patient outcomes and reducing healthcare costs.
Artificial Intelligence (AI) agents have started becoming an integral part of our lives. Imagine asking…
In the ever-evolving landscape of agentic AI workflows and applications, understanding and leveraging design patterns…
In this blog, I aim to provide a comprehensive list of valuable resources for learning…
Have you ever wondered how systems determine whether to grant or deny access, and how…
What revolutionary technologies and industries will define the future of business in 2025? As we…
For data scientists and machine learning researchers, 2024 has been a landmark year in AI…