Photo by Robina Weermeijer on Unsplash

Research with Mpirik — An Opportunity to Explore

Miguel Sotelo
6 min readSep 29, 2020

--

Most healthcare software companies aim to improve patient care, patient outcomes, hospital revenue, and physician efficiency. However, these companies have a tremendous opportunity to also contribute to the scientific community through big data and interdisciplinary teams.

Mpirik is a healthcare software company focused on cardiovascular disease. Of the many different categories of heart disease, one of Mpirik’s first targets is structural heart disease. Severe Aortic Stenosis is one particular structural heart disease that affects over 250,000 patients in the US every year. Extensive research has been done to understand the clinical and echocardiographic variables associated with Aortic Stenosis (AS) severity, and the likelihood of adverse clinical outcomes due to disease progression. Available treatments for this disease, such as transcather aortic valve replacement, have been shown to drastically improve patient outcomes and quality of life. However, it has also been widely reported that a main contributor to worsening symptoms and poor clinical outcomes is undertreatment (patients going untreated for aortic stenosis). This undertreatment is the central problem that Mpirik aims to solve.

The vision is to analyze healthcare data to help clinicians ensure every patient with a diagnostic study that shows signs of Severe AS are given the chance to be evaluated for treatment. Because structural heart diseases are progressive, knowing when to refer and when to treat is a critical decision for clinicians. Machine learning models are capable of analyzing large data sets to help provide predictions for clinicians to consider when making treatment decisions. Thus,

Data-driven, accurate and interpretable machine learning models that can consider the high volume and dynamically changing patient data to produce a probability of outcome or disease progression would help increase patient referrals, thus improving patient care and outcomes, hospital revenue, and physician efficiency.

In addition to machine learning models, the combination of big data and expert physicians presents an opportunity to conduct advanced research and expand our understanding of these diseases. Especially by applying these same methodologies to lesser-studied valvular heart diseases (VHD), such as Mitral and Tricuspid Regurgitation, (MR and TR). Mpirik sees this opportunity and has created a research division dedicated to helping physician partners conduct studies with their data.

The remainder of this post will introduce you to Mpirik’s research strategy, from data collection to preprocessing and model development, as well as productized models and peer-reviewed manuscripts.

Data Collection, Preprocessing, and Variable Selection

Mpirik’s Cardiac Intelligence™ software analyzes echocardiogram reports to identify patients with characteristics of cardiac disease. Diagnostic results from the hospital EMR are securely transferred in realtime to Mpirik, organized using Natural Language Processing (NLP), and analyzed by Mpirik’s algorithms for criteria of cardiovascular diseases. The primary focus is to help the cardiovascular service lines improve their cardiovascular disease detection, while also tracking patients through the progression of the disease to eliminate undertreatment.

In addition to echocardiographic variables, we typically collect additional data from clinicians such as reasons for hospitalizations, associated comorbidities, lab tests, and death records.

The hypothesis outlines the methodology, and a good methodology produces reliable results.

Typically, our research questions revolve around predicting clinical outcomes and disease progression in a diseased population using logistic regression and survival analysis. In hypotheses relating to survival and time-to-event analyses, it is also important to select an appropriate right-bound to the data collection as this will influence censoring.

After extracting the data, it is important to scrutinize it and clean it up. Continuous variables that contain more than 70% missingness are removed; however, we have recently explored methods of multi-imputation based on regression and clustering techniques to appropriately deal with missing data. Predictor (independent) and response (dependent) variables that are categoricals, such as presence of atrial fibrillation, or NYHA scale, are encoded as integers.

Raw data is preprocessed, ensuring that variables are represented as either continuous or factors, and variable selection reduces the high-dimensionality problem into a list of probable predictor variables based on data and hypothesis testing.

Variables may interact with each other due to collinearity and redundancy. Thus, we employ strict variable selection methods to reduce dimensionality. For instance, principal component analysis (PCA) combines two or more variables (i.e, aortic mean gradient and jet velocity) into one or more orthogonal variables that contain an independently significant amount of information.

Variable selection for the final multivariate model relies on hypothesis-based trial and error. After dimensionality reduction, we report descriptive statistics comparing the distributions of continuous and categorical variables between two populations. We compare the means of continuous variables using a Student’s t-test, and the prevalence of categoricals using a chi-square statistic or a two proportion z-test. Additionally, we perform univariate logistic regression to identify key predictors associated with the response variable. The list of variables showing significance is consulted with clinical experts to discuss concordance with our working hypothesis.

We are currently investigating the feasibility of other techniques, such as smoothly clipped absolute deviation (SCAD) which shrinks coefficient estimates that are small while keeping large, significant coefficients relatively untouched.

Multivariate Model Development and Iteration

A final multivariate model is selected from the list of significant univariate candidates. Although stepwise, automated techniques exists for trimming down the list of possible predictors, we prefer manually iterating a combination of variables based on clinical knowledge because unsupervised stepwise variable selection may result in narrowly estimated confidence intervals, artificially increasing the significance of predictor variables and our confidence in the model. Plus, supervised variable selection allows us to actually think about the hypothesis as we make these decisions.

Even with a carefully constructed hypothesis, selecting the appropriate model is not always straightforward.

For instance, a survival analysis could either employ a descriptive approach (Kaplan-Meier), a semi-parametric approach (cox proportional hazards model), or a parametric approach that forces us to assume (and test the validity of) a distribution, typically a Weibull distribution. A benefit of the latter is the ability to extrapolate, which is particularly important when we have narrow data collection time frames.

Developing a multivariate model for disease progression or clinical outcomes depends on the list of input univariate predictors, in- and out of sample model evaluation, and clinical feedback

After a model is developed, we can test its accuracy and precision by performing in- and out of sample evaluations. Receiver operating characteristics (ROC) curves are our preferred method due to the wide acceptance in the clinical field and intuitive interpretability of the area under the curve (AUC). This curve plots the sensitivity and specificity of a model based on different cut-off thresholds from the resulting probability of the model. Optimization of the ROC can yield a recommended cut-off threshold that provides utility in predicting an outcome when deploying the resulting model as a product.

Model Publication and Implementation

The outcome of our research is publication in a peer-reviewed journal, an abstract and poster presentation in a conference, or a productized model using Amazon Web Services and Sagemaker. We currently have a number of publications and abstracts under review, and we have deployed disease progression models.

Other projects in the works range from developing a model for AS patient stratification in the age of COVID, to determining clinical and echocardiographic predictors of disease progression and adverse clinical outcomes in moderate AS (Gada et al., 2020), MR and TR (in review), to recent projects focused on disparities in care.

The secret to our success is the involvement of clinical champions

Mpirik is differentiated by the ability to quickly extract and effectively analyze data for any given research question. However, the secret to success is the involvement of clinical champions that provide their clinical knowledge and expertise while reviewing our results during model development and deployment.

Through collaborative work between clinical experts, and companies with data science expertise to effectively analyze big data, improvements to patient health care and advancements to scientific knowledge can be attained.

To learn more visit Mpirik’s website or contact us at info@mpirik.com

References:

Gada, H., Vora, A., Ramlawi, B., O’Hair, D., Sotelo, M., Rogers, C., Wagner, L., Brigman, L., Kohli, N., Clinical and Echocardiographic Predictors of Aortic Stenosis Disease Progression and Clinical Outcomes in Patients with Moderate Aortic Stenosis [abstract]. In Proceedings of the 69th Annual Meeting of the American College of Cardiology; 2020 March 28–30; Chicago, IL

--

--