Getting started

This page provides a guide to get you started incorporating validated indicators to identify ACEs in EHRs.

1. Prerequisites & Governance

Everything the researcher needs before touching the data. To use the ACE indicators, you will need:

Data Access: Securing authorised EHR data. Obtain authorised and de-identified data of children and parents for research purposes. Most ACEs are captured in primary care. Several data sources in the UK provide linked primary care data of children and parents, including the Clinical Practice Research Datalink (CPRD), The Health Improvement Network (THIN), and QResearch. Access to CPRD requires protocol approval. Please visit the Health Data Research Innovation Gateway for more information.

OR: Access locally stored EHR data: Many NHS trusts provide streamlined processes to access local data sources like the Clinical Record Interactive Search (1, 2, 3). These often have a data structure consistent with national data sources like HES-APC.
Secure Data Environments (SDE): Ensure that your work is conducted within a secure data environment approved for working with EHRs to protect privacy and meet compliance standards.
Data Linkage: Implement robust linkage of child and parent records. Providers like CPRD allow you to access established mother-child linkages. For other linkage methods, please see existing methodologies (1, 2, 3). For paternal linkage, see methods described here.

2. The ACE Framework

Understanding the methodology underpinning the variables.

Domains & Indicators

We developed two measures of ACEs for electronic health records (EHRs):

Domains: Grouped, high-level indicators.
Indicators: Specific grouped codes or measures.

Indicator 2

Indicator 1

Most specific indicator

Code (alphanumerical code attached to event)

The "Think-Family" Approach

Unless specified, indicators refer to information recorded in the child's, the mother's and the father's records. Properly studying ACEs requires a "think-family" approach. We recognise that the health and well-being of children are intricately tied to their parents' experiences and health outcomes.

Sample Size Guidelines

The ability to use specific indicators depends on your sample size. We recommend restricting the disaggregation of specific indicators to those present in 250 or more unique children. For many studies, it will be most appropriate to use the six broad ACE domains only.

3. Working with Code Lists

The translation layer between the framework and the raw data.

The ACE indicators are mapped onto codes stored in code lists. The current ACEs indicators include ICD-9/10 codes (hospital/death records), Read codes, SNOMED-CT codes, medcodes, prodcodes, gemscripts, and specific field codes for HES datasets.

Code List Dictionary

Variable	Explanation
Code	Data ready code. Contains added prefixes to prevent de-duplication across different systems.
Code original	Original code exactly as entered into the respective system.
Indicator code	Name of the specific indicator the code maps to.
Severity	Severity classification of the code description (e.g. for alcohol misuse: `mild`, `moderate`, `severe`).
Scale	`1` = Code/severity are dependent on extra clinical information (helper functions required).
Reference	`1` = part of family violence reference standard, `2` = broader measure.
Individual	`1` = child only, `2` = mother only, `3` = mother or child, `4` = female children only.

Prefix Conversion Table

A substantial proportion of codes share the same alphanumeric sequence (e.g., "11246" is a medication in prodcode, but a diagnosis in medcode). To prevent accidental deduplication, we added prefixes to the master lists. You will need to convert your raw data codes to match:

Data source & Coding system	Prefix	Example
CPRD GOLD: Prodcode	`d_`	d_727 - Sertraline
ONS Mortality ≤2000 & HES-APC ≤1998	`e_`	e_5713 - Alcoholic liver damage
HES-APC: OPCS-4	`p_`	p_X66 - CBT Therapy
HES-A&E: Specific diagnosis system	`aed_`	aed_144 - Poisoning

4. The Data Pipeline (Preparation)

Getting the messy raw data ready for analysis.

Extraction & Restructuring: Extract your raw files via SQL. Restructure all files and variables into a consistent "long format". Drop irrelevant data fields to reduce memory weight.
Cleaning & Standardization: Remove white space/punctuation from codes and standardize date formats.
Code list data extraction of larger files: Implement robust strategies to filter and extract patient data against specific code lists without crashing your system's memory.

Example: Cleaning codes and standardizing dates (R)

This uses the modern stringr and lubridate packages. It cleanly converts your date string into a genuine Date object (which natively formats as YYYY-MM-DD) and cleans the codes in one step, without ruining the rest of your columns.

R Script

library(dplyr)
library(stringr)
library(lubridate)

aces_data <- aces_data %>%
  mutate(
    # Clean codes: removes both spaces and periods in a single pass
    medcode = str_remove_all(medcode, "[\\s\\.]"),
    
    # Standardize dates: smartly parses DD-MM-YYYY into a proper R Date object
    date_var = dmy(date_var) 
  )

Data Extraction Strategies: In-Memory vs. Chunking

Depending on your server’s memory capacity and the size of your datasets, you will need to choose the right extraction strategy. Below are two approaches for filtering patient data against your specific code lists.

Option 1: In-Memory Filtering (For Standard Datasets)

When your server has enough RAM to hold the entire dataset at once (e.g., loading a 20GB file on a machine with 64GB+ of RAM), a standard in-memory approach using data.table, dplyr, and fastmatch is the fastest method. This example script loads the full file, filters it against your code list, and saves the output before clearing the memory for the next file.

R Script

# ==============================================================================
# OPTION 1: IN-MEMORY EXTRACTION - EXAMPLE CPRD GOLD
# (Note: Using a master file approach where all CRPD GOLD files are combined)
# Use ONLY if your system RAM comfortably exceeds your file sizes.
# ==============================================================================
library(data.table)
library(dplyr)
library(fastmatch)

# 1. PREPARE & CLEAN THE CODELIST ----------------------------------------------
cat("Loading and cleaning codelist...\n")

# Read codelist, filter specific systems, remove NA’s
codelist_file <- "INSERT PATH TO FILE"

codelist <- fread(codelist_file, colClasses = "character", header = TRUE) %>% 
  filter(system %in% c("read", "prodcode", "OXMIS")) %>%
  filter(code != "") 

# Clean code: strip spaces and punctuation
codelist$code <- gsub('\\s+', '', codelist$code)
codelist$code <- gsub('\\.', '', codelist$code)

# Keep only distinct codes to speed up filtering
codelist <- codelist %>% distinct(code, .keep_all = TRUE)

# Garbage collection to free memory before loading large data
gc() 

# Define exactly which columns to extract (dropping unused saves RAM)
target_cols <- c("patid", "medcode", "eventdate", "enttype", "data1", "data2", "data3", "data4", "data5", "data6", "data7", "source")

# 2. PROCESS DATA (Example: Mother Data) ---------------------------------------
cat("Processing Mother Data...\n")

# Load the entire dataset into RAM
all_mum <- fread("cprd_all_mum_data.txt", colClasses = "character", header = TRUE, nThread = 8, select = target_cols)

# Filter the massive dataset against our cleaned codelist
mother_aces_data <- all_mum %>% filter(medcode %fin% codelist$code)

# Write filtered data to a new file
fwrite(mother_aces_data, "mother_aces_data_cprd_gold.txt", sep = "\t", nThread = 7)

# CRITICAL: Delete large tables from RAM and clear memory before moving on
rm(all_mum, mother_aces_data)
gc()

# -> Repeat Step 2 for subsequent files (e.g., Baby data, Dad data)

Option 2: Filtering Massive Files (Chunking Approach)

We recommend a high-performance chunking approach when extracting data from extremely large datasets (e.g., 100GB+) that cannot fit into RAM. This iterates through the massive file in fixed-size batches—filtering against your medical code list and appending the matches to a new file—ensuring you never crash your system's memory.

R Script

# ==============================================================================
# OPTIMIZED R SCRIPT: CODELIST EXTRACTION ONLY (CHUNKING)
# Strategy:
# 1. Read huge file in chunks.
# 2. Filter ONLY based on Medical codelist.
# 3. Append to result file.
# ==============================================================================

# 1. LIBRARIES -----------------------------------------------------------------
library(data.table)
library(bit64) # Essential for CPRD ID handling

# 2. SETUP PARAMETERS ----------------------------------------------------------

# INPUT: The massive 100GB+ file
input_large_file <- "INSERT LARGE FILE PATH HERE"

# OUTPUT: where the filtered data will be saved
output_final_file <- "INSERT OUTPUT FILE PATH HERE"

# FILTER: Codelist only
codelist_file <- "INSERT ACES CODELIST FILE PATH HERE"

# CONFIGURATION:
# Column name for codes in the LARGE data (usually medcodeid or prodcodeid)
target_code_col <- "medcodeid"
# Column name for codes in your CODELIST file
codelist_col_name <- "code"

# CHUNK SIZE: 5 million rows is a sweet spot for speed vs memory.
chunk_size <- 5000000

# 3. PREPARE FILTER (CODELIST ONLY) --------------------------------------------
cat("Preparing codelist lookup...\n")

codes_raw <- fread(codelist_file)
target_codes <- unique(codes_raw[[codelist_col_name]])
rm(codes_raw)

cat(paste0("Ready: Filtering for ", length(target_codes), " medical codes.\n"))

# 4. INITIALIZE LOOP -----------------------------------------------------------

# Read header row first to get column names
header_row <- names(fread(input_large_file, nrows = 0))

if(!(target_code_col %in% header_row)) stop(paste0("Column '", target_code_col, "' not found."))

current_skip <- 0
first_write <- TRUE
chunk_counter <- 1
found_rows_total <- 0

# 5. EXECUTION LOOP ------------------------------------------------------------
cat("Starting High-Performance Chunking...\n")

repeat {

  # --- STEP A: READ CHUNK ---
  if(chunk_counter %% 10 == 0) cat(paste0("...Processing Batch ", chunk_counter, "...\n"))

  if(current_skip == 0) {
    # First chunk: Read header normally
    chunk_dt <- fread(input_large_file, nrows = chunk_size, header = TRUE, nThread = 8, colClasses = "character")
  } else {
    # Subsequent chunks: Skip rows, No header
    chunk_dt <- fread(input_large_file, skip = current_skip, nrows = chunk_size, header = FALSE, nThread = 8, colClasses = "character")

    if (nrow(chunk_dt) == 0) break
    setnames(chunk_dt, header_row)
  }

  rows_read <- nrow(chunk_dt)

  # --- STEP B: FILTERING ---
  chunk_dt <- chunk_dt[get(target_code_col) %in% target_codes]

  # --- STEP C: WRITE RESULTS ---
  matches_n <- nrow(chunk_dt)

  if (matches_n > 0) {
    cat(paste0(" Batch ", chunk_counter, ": Found ", matches_n, " matches. Appending.\n"))

    fwrite(chunk_dt,
           file = output_final_file,
           append = !first_write,
           sep = "\t",
           col.names = first_write,
           nThread = 8)

    first_write <- FALSE
    found_rows_total <- found_rows_total + matches_n
  }

  # --- STEP D: CLEANUP & INCREMENT ---
  rm(chunk_dt)
  if(chunk_counter %% 5 == 0) gc()

  if(rows_read < chunk_size) {
    cat("End of file reached.\n")
    break
  }

  if (chunk_counter == 1) {
    current_skip <- rows_read + 1
  } else {
    current_skip <- current_skip + rows_read
  }

  chunk_counter <- chunk_counter + 1
}

5. Applying Algorithms

Refining the data to ensure accuracy.

Time Restrictions

The validated ACE indicators are time sensitive and generally apply anytime between 2 years before birth and 10 years after birth. However, most child maltreatment algorithms are strictly limited to 2 years before birth up to 5 years after birth. Ascertaining correct exposure times is essential.

Helper Functions & Rule-Based Criteria: Some codes require extra clinical information or co-occurrence rules. Use control flow logic to define these parameters.

Example: Depression requiring symptom/medication AND a diagnosis within 2 years

depression_cases <- ehr_data %>%
  filter(
    # Has symptoms or medications
    any_of(depression_symptoms) | any_of(depression_medications),
    
    # AND co-occurs with a previous code within 2 years (365*2 days)
    any_of(previous_codes) & visit_date >= (Sys.Date() - 365 * 2)
  )

6. Downloads & Resources

The actual files researchers need to execute the work.

ACEsinEHRs control documentation & release information

Implementation Note

The indicators use control flow methods to implement rule-based algorithms must be applied to specific indicators (mainly HRP-CM) to prevent misclassification including age-restrictions, exclusions of accidental injuries, genetic predispositions, traumatic birth injuries or maternal-child transmissions.

Master Code Lists

Right-click on link to save as a ".txt" file.

All ACEs Complete Bundle

Total included codes: 23,164

Primary Care GP Bundle (CPRD / AURUM)

medcodes, prodcodes, measures only (20,925 codes)

Hospital Records Bundle

ICD, OPCS and HES-APC specific fields (2,239 codes)

Domain-Specific Lists

Child maltreatment

3,852 codes

Intimate partner violence

3,281 codes

High-risk presentation of CM

1,774 codes

Adverse family environment

2,308 codes

Parental mental health

9,063 codes

Parental substance misuse

2,886 codes

Exclusion & Context Code Lists

Accidents

3,356 codes

Osteoporosis - Bone disease

406 codes

Birth injuries/complications

238 codes

Mother-to-child-transmissions

15 codes

Covariates (Non-ACEs)

Used to add context/info to risk prediction models (8,808 codes)

Getting started

1. Prerequisites & Governance

2. The ACE Framework

Domains & Indicators

The "Think-Family" Approach

Sample Size Guidelines

3. Working with Code Lists

Code List Dictionary

Prefix Conversion Table

4. The Data Pipeline (Preparation)

5. Applying Algorithms

Time Restrictions

6. Downloads & Resources

Implementation Note

Master Code Lists

All ACEs Complete Bundle

Primary Care GP Bundle (CPRD / AURUM)

Hospital Records Bundle

Domain-Specific Lists

Child maltreatment

Intimate partner violence

High-risk presentation of CM

Adverse family environment

Parental mental health

Parental substance misuse

Exclusion & Context Code Lists

Accidents

Osteoporosis - Bone disease

Birth injuries/complications

Mother-to-child-transmissions

Covariates (Non-ACEs)

Explore ACE Domains

Browse Code Lists