Creating an early identification tool for child mental health
Research type
Research Study
Full title
Pilot study to create an anonymous linked dataset for children's health & social care to enable foundation work towards the development of an early identification tool for children’s mental health problems
IRAS ID
277542
Contact name
Anna Moore
Contact email
Sponsor organisation
Cambridge & Peterborough NHS Foundation Trust and the University of Cambridge
Duration of Study in the UK
1 years, 11 months, 31 days
Research summary
We will link routinely collected, de-identified electronic record data(ER) relating to all individuals between 0 and 18y of age with an ER at any of Cambridgeshire & Peterborough NHS Foundation Trust(CPFT), Cambridgeshire & Peterborough Local Authority(CPLA), Cambridgeshire University Hospitals(CUH), or Cambridgeshire Community Services(CCS), between the years of 2012 and 2020.
We will use this dataset to: (1) test different linkage methods including probabilistic and deterministic approaches, establishing non-matching and linkage bias; (2) establish which statistical methods most accurately predict mental health problems in children and young people(CYP), including traditional and machine learning approaches; and (3) develop natural language processing(NLP) algorithms to generate novel structured variables from unstructured data, testing if these improve the accuracy of predictive models.
The automatic system will de-identify data at source using validated software (CRATE; Cardinal 2017, PubMed ID 28441940). Identifiable data (unique identifiers, postcodes, names and addresses) will be removed and replaced by person-specific non-identifiable pseudonyms. In this pilot, de-identification will be irreversible, creating a de-identified dataset with no facility for re-identification.
Linkage will take a two-phased approach: (1) linking NHS records together, using a deterministic approach based on the hashed NHS number as a unique identifier, and (2) linking this linked NHS dataset to CPLA records using a probabilistic approach based on hashed name, date of birth and postcode.
The de-identified linked dataset will be used to undertake proof-of-concept studies that will inform the building of a ‘live’ linked database for direct care purposes in the future, and train algorithms to identify mental health problems in CYP. We expect the predictive power of algorithms using only structured electronic health record data to be low. We will test if the use of novel NLP generated variables (such as presenting needs or particular symptoms) improves the models’ accuracy.
REC name
East Midlands - Leicester Central Research Ethics Committee
REC reference
20/EM/0299
Date of REC Opinion
10 Dec 2020
REC opinion
Favourable Opinion