Getting started
A quick introduction to incorporating validated indicators, code lists and scripts for measuring intervenable and clinically relevant ACEs using EHRs.
We are passionate about using EHRs to advance healthcare, social justice, and policy for families affected by adversity. However, ACEs in EHRs are not easily accessible and can be challenging to implement without guidance.
This page provides a guide to get you started incorporating validated indicators to identify ACEs in EHRs.
1. Prerequisites & Governance
Everything the researcher needs before touching the data. To use the ACE indicators, you will need:
-
Data Access: Securing authorised EHR data. Obtain authorised and de-identified data of children and parents for research purposes. Most ACEs are captured in primary care. Several data sources in the UK provide linked primary care data of children and parents, including the Clinical Practice Research Datalink (CPRD), The Health Improvement Network (THIN), and QResearch. Access to CPRD requires protocol approval. Please visit the Health Data Research Innovation Gateway for more information.
OR: Access locally stored EHR data: Many NHS trusts provide streamlined processes to access local data sources like the Clinical Record Interactive Search (1, 2, 3). These often have a data structure consistent with national data sources like HES-APC. - Secure Data Environments (SDE): Ensure that your work is conducted within a secure data environment approved for working with EHRs to protect privacy and meet compliance standards.
- Data Linkage: Implement robust linkage of child and parent records. Providers like CPRD allow you to access established mother-child linkages. For other linkage methods, please see existing methodologies (1, 2, 3). For paternal linkage, see methods described here.
2. The ACE Framework
Understanding the methodology underpinning the variables.
Domains & Indicators
We developed two measures of ACEs for electronic health records (EHRs):
- Domains: Grouped, high-level indicators.
- Indicators: Specific grouped codes or measures.
The "Think-Family" Approach
Unless specified, indicators refer to information recorded in the child's, the mother's and the father's records. Properly studying ACEs requires a "think-family" approach. We recognise that the health and well-being of children are intricately tied to their parents' experiences and health outcomes.
Sample Size Guidelines
The ability to use specific indicators depends on your sample size. We recommend restricting the disaggregation of specific indicators to those present in 250 or more unique children. For many studies, it will be most appropriate to use the six broad ACE domains only.
3. Working with Code Lists
The translation layer between the framework and the raw data.
The ACE indicators are mapped onto codes stored in code lists. The current ACEs indicators include ICD-9/10 codes (hospital/death records), Read codes, SNOMED-CT codes, medcodes, prodcodes, gemscripts, and specific field codes for HES datasets.
Code List Dictionary
| Variable | Explanation |
|---|---|
| Code | Data ready code. Contains added prefixes to prevent de-duplication across different systems. |
| Code original | Original code exactly as entered into the respective system. |
| Indicator code | Name of the specific indicator the code maps to. |
| Severity | Severity classification of the code description (e.g. for alcohol misuse: `mild`, `moderate`, `severe`). |
| Scale | 1 = Code/severity are dependent on extra clinical information (helper functions required). |
| Reference | 1 = part of family violence reference standard, 2 = broader measure. |
| Individual | 1 = child only, 2 = mother only, 3 = mother or child, 4 = female children only. |
Prefix Conversion Table
A substantial proportion of codes share the same alphanumeric sequence (e.g., "11246" is a medication in prodcode, but a diagnosis in medcode). To prevent accidental deduplication, we added prefixes to the master lists. You will need to convert your raw data codes to match:
| Data source & Coding system | Prefix | Example |
|---|---|---|
| CPRD GOLD: Prodcode | d_ |
d_727 - Sertraline |
| ONS Mortality ≤2000 & HES-APC ≤1998 | e_ |
e_5713 - Alcoholic liver damage |
| HES-APC: OPCS-4 | p_ |
p_X66 - CBT Therapy |
| HES-A&E: Specific diagnosis system | aed_ |
aed_144 - Poisoning |
4. The Data Pipeline (Preparation)
Getting the messy raw data ready for analysis.
- Extraction & Restructuring: Extract your raw files via SQL. Restructure all files and variables into a consistent "long format". Drop irrelevant data fields to reduce memory weight.
- Cleaning & Standardization: Remove white space/punctuation from codes and standardize date formats.
- Code list data extraction of larger files: Implement robust strategies to filter and extract patient data against specific code lists without crashing your system's memory.
Example: Cleaning codes and standardizing dates (R)
This uses the modern stringr and lubridate packages. It cleanly converts your date string into a genuine Date object (which natively formats as YYYY-MM-DD) and cleans the codes in one step, without ruining the rest of your columns.
library(dplyr)
library(stringr)
library(lubridate)
aces_data <- aces_data %>%
mutate(
# Clean codes: removes both spaces and periods in a single pass
medcode = str_remove_all(medcode, "[\\s\\.]"),
# Standardize dates: smartly parses DD-MM-YYYY into a proper R Date object
date_var = dmy(date_var)
)
Data Extraction Strategies: In-Memory vs. Chunking
Depending on your server’s memory capacity and the size of your datasets, you will need to choose the right extraction strategy. Below are two approaches for filtering patient data against your specific code lists.
Option 1: In-Memory Filtering (For Standard Datasets)
When your server has enough RAM to hold the entire dataset at once (e.g., loading a 20GB file on a machine with 64GB+ of RAM), a standard in-memory approach using data.table, dplyr, and fastmatch is the fastest method. This example script loads the full file, filters it against your code list, and saves the output before clearing the memory for the next file.
# ==============================================================================
# OPTION 1: IN-MEMORY EXTRACTION - EXAMPLE CPRD GOLD
# (Note: Using a master file approach where all CRPD GOLD files are combined)
# Use ONLY if your system RAM comfortably exceeds your file sizes.
# ==============================================================================
library(data.table)
library(dplyr)
library(fastmatch)
# 1. PREPARE & CLEAN THE CODELIST ----------------------------------------------
cat("Loading and cleaning codelist...\n")
# Read codelist, filter specific systems, remove NA’s
codelist_file <- "INSERT PATH TO FILE"
codelist <- fread(codelist_file, colClasses = "character", header = TRUE) %>%
filter(system %in% c("read", "prodcode", "OXMIS")) %>%
filter(code != "")
# Clean code: strip spaces and punctuation
codelist$code <- gsub('\\s+', '', codelist$code)
codelist$code <- gsub('\\.', '', codelist$code)
# Keep only distinct codes to speed up filtering
codelist <- codelist %>% distinct(code, .keep_all = TRUE)
# Garbage collection to free memory before loading large data
gc()
# Define exactly which columns to extract (dropping unused saves RAM)
target_cols <- c("patid", "medcode", "eventdate", "enttype", "data1", "data2", "data3", "data4", "data5", "data6", "data7", "source")
# 2. PROCESS DATA (Example: Mother Data) ---------------------------------------
cat("Processing Mother Data...\n")
# Load the entire dataset into RAM
all_mum <- fread("cprd_all_mum_data.txt", colClasses = "character", header = TRUE, nThread = 8, select = target_cols)
# Filter the massive dataset against our cleaned codelist
mother_aces_data <- all_mum %>% filter(medcode %fin% codelist$code)
# Write filtered data to a new file
fwrite(mother_aces_data, "mother_aces_data_cprd_gold.txt", sep = "\t", nThread = 7)
# CRITICAL: Delete large tables from RAM and clear memory before moving on
rm(all_mum, mother_aces_data)
gc()
# -> Repeat Step 2 for subsequent files (e.g., Baby data, Dad data)
Option 2: Filtering Massive Files (Chunking Approach)
We recommend a high-performance chunking approach when extracting data from extremely large datasets (e.g., 100GB+) that cannot fit into RAM. This iterates through the massive file in fixed-size batches—filtering against your medical code list and appending the matches to a new file—ensuring you never crash your system's memory.
# ==============================================================================
# OPTIMIZED R SCRIPT: CODELIST EXTRACTION ONLY (CHUNKING)
# Strategy:
# 1. Read huge file in chunks.
# 2. Filter ONLY based on Medical codelist.
# 3. Append to result file.
# ==============================================================================
# 1. LIBRARIES -----------------------------------------------------------------
library(data.table)
library(bit64) # Essential for CPRD ID handling
# 2. SETUP PARAMETERS ----------------------------------------------------------
# INPUT: The massive 100GB+ file
input_large_file <- "INSERT LARGE FILE PATH HERE"
# OUTPUT: where the filtered data will be saved
output_final_file <- "INSERT OUTPUT FILE PATH HERE"
# FILTER: Codelist only
codelist_file <- "INSERT ACES CODELIST FILE PATH HERE"
# CONFIGURATION:
# Column name for codes in the LARGE data (usually medcodeid or prodcodeid)
target_code_col <- "medcodeid"
# Column name for codes in your CODELIST file
codelist_col_name <- "code"
# CHUNK SIZE: 5 million rows is a sweet spot for speed vs memory.
chunk_size <- 5000000
# 3. PREPARE FILTER (CODELIST ONLY) --------------------------------------------
cat("Preparing codelist lookup...\n")
codes_raw <- fread(codelist_file)
target_codes <- unique(codes_raw[[codelist_col_name]])
rm(codes_raw)
cat(paste0("Ready: Filtering for ", length(target_codes), " medical codes.\n"))
# 4. INITIALIZE LOOP -----------------------------------------------------------
# Read header row first to get column names
header_row <- names(fread(input_large_file, nrows = 0))
if(!(target_code_col %in% header_row)) stop(paste0("Column '", target_code_col, "' not found."))
current_skip <- 0
first_write <- TRUE
chunk_counter <- 1
found_rows_total <- 0
# 5. EXECUTION LOOP ------------------------------------------------------------
cat("Starting High-Performance Chunking...\n")
repeat {
# --- STEP A: READ CHUNK ---
if(chunk_counter %% 10 == 0) cat(paste0("...Processing Batch ", chunk_counter, "...\n"))
if(current_skip == 0) {
# First chunk: Read header normally
chunk_dt <- fread(input_large_file, nrows = chunk_size, header = TRUE, nThread = 8, colClasses = "character")
} else {
# Subsequent chunks: Skip rows, No header
chunk_dt <- fread(input_large_file, skip = current_skip, nrows = chunk_size, header = FALSE, nThread = 8, colClasses = "character")
if (nrow(chunk_dt) == 0) break
setnames(chunk_dt, header_row)
}
rows_read <- nrow(chunk_dt)
# --- STEP B: FILTERING ---
chunk_dt <- chunk_dt[get(target_code_col) %in% target_codes]
# --- STEP C: WRITE RESULTS ---
matches_n <- nrow(chunk_dt)
if (matches_n > 0) {
cat(paste0(" Batch ", chunk_counter, ": Found ", matches_n, " matches. Appending.\n"))
fwrite(chunk_dt,
file = output_final_file,
append = !first_write,
sep = "\t",
col.names = first_write,
nThread = 8)
first_write <- FALSE
found_rows_total <- found_rows_total + matches_n
}
# --- STEP D: CLEANUP & INCREMENT ---
rm(chunk_dt)
if(chunk_counter %% 5 == 0) gc()
if(rows_read < chunk_size) {
cat("End of file reached.\n")
break
}
if (chunk_counter == 1) {
current_skip <- rows_read + 1
} else {
current_skip <- current_skip + rows_read
}
chunk_counter <- chunk_counter + 1
}
5. Applying Algorithms
Refining the data to ensure accuracy.
Time Restrictions
The validated ACE indicators are time sensitive and generally apply anytime between 2 years before birth and 10 years after birth. However, most child maltreatment algorithms are strictly limited to 2 years before birth up to 5 years after birth. Ascertaining correct exposure times is essential.
Helper Functions & Rule-Based Criteria: Some codes require extra clinical information or co-occurrence rules. Use control flow logic to define these parameters.
Example: Depression requiring symptom/medication AND a diagnosis within 2 years
depression_cases <- ehr_data %>%
filter(
# Has symptoms or medications
any_of(depression_symptoms) | any_of(depression_medications),
# AND co-occurs with a previous code within 2 years (365*2 days)
any_of(previous_codes) & visit_date >= (Sys.Date() - 365 * 2)
)
6. Downloads & Resources
The actual files researchers need to execute the work.
ACEsinEHRs control documentation & release information
Implementation Note
The indicators use control flow methods to implement rule-based algorithms must be applied to specific indicators (mainly HRP-CM) to prevent misclassification including age-restrictions, exclusions of accidental injuries, genetic predispositions, traumatic birth injuries or maternal-child transmissions.
Master Code Lists
Right-click on link to save as a ".txt" file.