Healthcare Data Standards and Quality: Navigating Missingness and Bias | Blog entry by The CAIT Center

In the world of healthcare innovation, we talk a lot about the power of algorithms. However, an algorithm is only as good as the data used to train it. For the average clinician, the day-to-day reality of data is not a clean spreadsheet; it is a chaotic mix of missing entries, human error, and inherent biases. To move from being a passive user of technology to an expert evaluator, you have to understand the specific data traps that can compromise patient care.

The "Missingness" Problem: Why Silence is a Data Point

One of the most frequent issues we encounter with healthcare data is missing data. Not everyone inputs everything they are supposed to; clinicians are busy, they forget, or the information simply does not feel relevant at the moment. Different institutions also have different requirements for what must be reported, meaning that when you pull information from across the country, it rarely lines up perfectly.

In data science, the crux of the issue is "why" the data is missing. It is rarely random. For example, a clinician might not record a patient's heart rate if the patient is in cardiac arrest because they are busy performing CPR. If you use this data to train a predictive model, the AI will not see the actual reason the data is missing. This reduces the "statistical power" of the model and can lead to flawed insights.

To address this, we sometimes use imputation, where we fill in the gaps with the mean or median values of the rest of the population, or use models to predict what the missing value might have been. However, as a clinician, you must always ask: Does the fact that this data is missing mean something clinically relevant?

The Bias Trap: Apples, Oranges, and Survivorship

Just because we collect data does not mean it reflects objective reality. Data often carries the same biases found in the clinic:

Indication Bias: Comparing patients prescribed insulin to those who are not is like comparing "apples and oranges". People on insulin usually have more severe diabetes, meaning their medical history is inherently different from those who are not on the medication.
Measurement Bias: Sometimes the data is simply wrong because a machine is broken or a reading was performed poorly.
Recording Bias: Clinical literature shows that diagnoses can be influenced by a patient's age or race. While age is a factor in an Alzheimer’s diagnosis, human bias in how things are recorded can bleed into the AI.

A classic example of this is survivorship bias. During World War II, researchers examined returning planes with bullet holes in their wings and suggested adding armor there. But the real answer was to put armor where there were no bullet holes, because the planes shot in the engine never made it home to be counted. In healthcare, we must constantly ask who is missing from our data and why.

Standardizing the Language: SNOMED, LOINC, and FHIR

For AI to work across different hospital systems, we have to speak the same language. If one person types "diabetes" in lowercase and another in uppercase, a computer might see them as two different things. To solve this, we rely on healthcare data standards:

Terminology: Standards such as SNOMED and LOINC (for labs) ensure that a diagnosis or test means the same thing everywhere.
Medications: RxNorm provides a standard way to record prescriptions.
Data Exchange: HL7 and FHIR (Fast Healthcare Interoperability Resources) enable different EHR systems to "plug and play," allowing data to be shared without months of manual work.
Coding: ICD-10 and CPT codes provide a specific language for diagnoses and procedures, ensuring everyone is talking about the same clinical reality.

Transforming Numbers into Insights

Finally, there is the work of data transformation. Raw numbers are not always the most meaningful way to describe a patient. For example, a BMI is a "continuous" number, but it might be more useful for a model to see it as a "categorical" label like "Overweight".

Similarly, scaling data helps models perform better. Numbers like a troponin or a D-dimer can jump from very low to very high almost instantly. Is a D-dimer of 100,000 twice as "bad" as 50,000? Not necessarily. By scaling these numbers, we can help AI models identify clinically meaningful trends.

The Turning Point: Why Your Clinical Voice is Irreplaceable

There is a common fear that data science will eventually automate the "human" out of healthcare, but the reality is the exact opposite. Because healthcare data is so messy and biased, a clinician's "sanity check" is the only thing that keeps the system grounded.

An algorithm can find a pattern, but only a clinician can tell you if that pattern is a medical breakthrough or a statistical error. You are the one who understands why a heart rate was not recorded during a code. You know that two "similar" patient groups are actually vastly different. When you step into the world of AI, you are not leaving your clinical expertise behind; you are using it to ensure the technology stays focused on what is plausible and what is safe.

About the CAIT Center

The Collaborative AI Technology Center (CAIT Center) is a research-based partnership between the GW Biomedical Informatics Center and the University of Maryland Eastern Shore School of Pharmacy. We help healthcare professionals bridge the gap between clinical practice and data science.

Get Certified in AI Literacy

If you are ready to master these data concepts, our free certificate program is the perfect place to start.

Course: Demystifying AI for Health Professionals

Learn how to perform data quality checks and "sanity checks" on the tools you use in your daily workflow.

No Math Required: We focus on the clinical meaning of data, not the underlying calculus.
Credly Certification: Earn a digital badge to formally showcase your expertise in medical informatics and data quality.
Evidence-Informed Guidance: Learn from George Washington University faculty specializing in healthcare data research.

Show comments

Blog

The CAIT Center

Blog entry by The CAIT Center