
A primary strength of the proposed CNICS data repository is that it represents diverse, multi-site data collected during routine clinical care, and as such, represents the experience of many HIV infected individuals in the US. Another strength of the CNICS collaboration is the novel and dynamic data collection instrument which allows “real time” clinical data to be included in the overall data repository directly from site EMR databases. Uploaded data are checked for adherence to the existing CNICS metadata and coding standards and Synchronous validation occurs prior to the loading of data into the repository and verifies that all data elements are reported using a valid format and value. The Biostatistics Core coordinates development of data definitions and analysis variables in order to assure that these data are uniformly defined and consistently collected across all sites. The core also collaborates with CNICS investigators to define analysis variables, e.g. level of adherence to treatment regimen, that can be used uniformly across analyses of the CNICS data repository and maintain consistency in published results.
Asynchronous validation involves applications used after data are loaded into the CNICS repository to monitor data quality centrally. For example, patient data are evaluated for potentially invalid date ranges; deceased patients should not have clinical events occurring after their date of death, start and stop dates for courses of therapy with antiretroviral medications and episodes of clinical conditions must produce positive durations. HIV-1 RNA results can be examined to evaluate potential missing data; patients should have HIV-1 RNA measurements done routinely and test results should be found within specific windows of time following the initiation of antiretroviral therapy. Beyond querying these data to identify anomalies, an additional strength of centralized monitoring of the CNICS cohort is the ability to compare data patterns across sites. Thus, comparing rates of events across sites allows CNICS to valid these rates and identify potential data problems at sites that vary greatly from overall patterns.
Automated data quality assurance does an exceptional job of quality assurance of “cross-sectional” data but often encounters difficulties in evaluating the quality of longitudinal data or in complex data which involves relationships between a number of sources at a single time point. Once these data are accepted into the data repository, the Biostatistics Core develops a set of SAS QA checks to detect subtle problems in complex data that cannot be detected with synchronous and asynchronous validation procedures.
Since data from any multi-site study, including randomized clinical trials, has the potential for bias and differential reporting across sites, identifying potential confounders due to site differences is of utmost importance. Standard reports will be developed to determine potential differential confounders by site such as the frequency of outcomes of interest, completeness of follow-up, differential treatment practices by site or region, and distribution of all known potential confounders such as race and gender. Based on potential bias and differential reporting identified through these reports, quality control of analysis results will include considerations of: handling of missing data, definition of the analysis cohort, cross checks between raw data summaries, and plausibility of output.