Headline: Data Quality Audit – Driving Evidence-Based Healthcare Decisions.
Abstract:
I recently led a comprehensive data quality audit (DQA) of COVID-19 data in Zambia’s Copperbelt Province, aligning with the national vision for digital health transformation. By employing advanced statistical methods and a robust assessment framework, we identified key strengths and areas for improvement in data accuracy, completeness, and correctness. The insights gained will inform targeted interventions to enhance data quality and support evidence-based decision-making for improved healthcare outcomes. I am passionate about leveraging data-driven solutions to strengthen health information systems and contribute to a more resilient and responsive healthcare landscape. This was on the The Global Funded-C19RM Project, under ICAP at Columbia University in Zambia.

Conducting a DQA at Twapia Urban Clinic in Ndola, Copperbelt Province, Zambia. In the picture (L>R): Michael Muyambango, John Small, Beatrice Kangwa, Kakusa Kakusa and Paul Zimba.
Key Skills:
- Data Quality Audit (DQA)
- Health Information Systems (HIS)
- Health Management Information Systems (HMIS)
- Statistical Analysis (STATA, Python, R)
- Data Visualization (Power BI, STATA, Python, R)
- Project Management
- Capacity Building
- Evidence-based reporting
1. METHODOLOGY
i. Site Selection Criteria:
High volume sites/facilities were selected through the guidance of the Provincial Health Office (PHO). Key matrices included:
a) Population size (catchment area).
b) Immunization Target.
c) Record of undesired outcomes related to Covid-19 (Optional).
A total of 19 health facilities were selected and visited.
ii. Assessment Framework:
The Total Data Quality Management (TDQM) framework was adopted for this DQA with data sources identified and data themes mapped from the routine C19RM assessment tool, with a focus on the M&E and HMIS components. The framework incorporates both quantitative and qualitative assessment.
iii. Data Collection:
Data was collected from facility Covid-19 immunization registers, and from the National COVAX Tracker. In-depth one-on-one interviews were conducted with facility staff that work(ed) with Covid-19 data, including focal point persons, data clerks / health information officers, surveillance officers, and facility in-charges from periods; 1st January 2020 to 31st March, 2024.
iv. Analysis Techniques:
STATA 17 and Python were used as the main analysis software. R was also used for comparison and visualization purposes.
a) Data Cleaning and Summary Statistics:
– Descriptive Statistics were used to summarize findings.
– Frequency distributions were used to analyze and identify missing data, inconsistencies, and potential errors.
– Data profiling: We profiled the data to generate summary statistics for all variables in the dataset, providing a comprehensive overview of data quality.
– Consistency checks: Comparing data values within and across fields to identify inconsistencies and anomalies.
– Completeness checks: To identify missing values or incomplete records, we reviewed and assessed data sources for completeness.
– Comparative and Triangulation Analysis: was used to compare registers, forms (hard copy), and electronic data (soft copy) to identify discrepancies and inconsistencies.
b) Advanced Statistical Methods /Techniques:
– Hypothesis testing: were used to test assumptions about data, such as normality or independence.
– Correlation analysis: were performed to measure the relationship between variables to identify potential dependencies and inconsistencies.
– Outlier detection: were performed to identify data points that deviate significantly from the rest of the data.
c) Visualizations:
– Data visualization: were utilized to create visual representations of data to identify patterns, trends, and anomalies.
– Root cause analysis was used to investigate the underlying causes of data quality issues.
2. FINDINGS:
i. Distribution Analysis:
The histograms in figure 4 above illustrate the distribution of Accuracy, Completeness, and Correctness across the 19 facilities visited.
a) Accuracy: The distribution is skewed right, indicating that most facilities have high accuracy scores. There’s a cluster of facilities with perfect accuracy (1.0), and a few with lower scores, highlighting potential areas for improvement.
b) Completeness: The distribution is also skewed right, with a majority of facilities demonstrating high completeness. Several facilities have perfect completeness (1.0), while a few have lower scores, suggesting some degree of missing data.
c) Correctness: The distribution leans heavily towards the higher end, implying that most facilities have a high proportion of correct data among the collected data. There are a few facilities with lower correctness scores, indicating potential issues with data entry or validation.
The scatter plot visualizes the relationship between the number of ‘Missing’ values and ‘Accuracy’ for each facility.There seems to be a general trend of decreasing accuracy with an increasing number of missing values. This suggests that missing data might be associated with a higher likelihood of errors in the recorded data.
However, there are also some facilities with a considerable number of missing values but still maintaining high accuracy. This indicates that the impact of missing data on accuracy might vary depending on other factors, such as the nature of the missing data and the specific data collection processes.
3. INFERENCE:
To generate inferences about the entire population of approximately 400 health facilities in the Copperbelt Province, based on the sample size of 19 health facilities, confidence intervals for the mean Accuracy, Completeness, and Correctness were calculated at 95% confidence level (assumption) for these estimations.
3.1 INTERPRETATION:
Assuming the dataset is a representative sample of approximately 400 health facilities on the Copperbelt (Zambia Health Facilities Registry,2024), we can infer the following with 95% confidence:
– The average Accuracy of data across all 400 facilities lies between 80% and 93%.
– The average Completeness of data across all facilities is between 92% and 98%.
– The average Correctness of data across all facilities is between 86% and 95%.
These intervals provide a range within which we can reasonably expect the true population means for these metrics to fall. It suggests that the overall data quality in terms of accuracy, completeness, and correctness is likely to be quite high for the entire population of health facilities. However, it’s important to remember that there’s still a 5% chance that the true population means lie outside these intervals.
4. RECOMMENDATIONS FOR PROGRAM IMPROVEMENT:
- Training and Capacity Building: As a country, we should prioritize training programs for facility staff on the use of the COVID-19 Tracker and updated clinical guidelines. We must ensure that SI units at ART clinics have access to and are trained on using Covax tracker and eIDSR.
- Data Management and Documentation: Strengthen data management practices, particularly regarding bi-directional screening for TB/COVID. Ensure the availability and proper use of data capture tools and registers.
- Communication and Advocacy: Continue to promote and support facilities’ participation in providing information to the public about COVID-19.
- Monitoring and Evaluation: Maintain regular data review meetings and strengthen M&E capacity at facilities to track progress, identify challenges, and inform continuous improvement efforts.
- Collaboration and Support: Continue to foster collaboration with implementing partners to provide necessary support and resources to facilities.
- Continued Technical Support Services: Continue to provide targeted technical support services (TSS) to the facilities, districts, and the province as a whole.
By addressing these areas and leveraging the identified strengths, the program can enhance its overall performance, improve data quality, and ultimately contribute to more effective COVID-19 response and management efforts.