Benchmarking domain-specific pretrained language models to identify the best model for methodological rigor in clinical studies

Zhou, Fangwen; Parrish, Rick; Afzal, Muhammad; Saha, Ashirbani; Haynes, R. Brian; Iorio, Alfonso; Lokker, Cynthia

Benchmarking domain-specific pretrained language models to identify the best model for methodological rigor in clinical studies

Zhou, Fangwen and Parrish, Rick and Afzal, Muhammad and Saha, Ashirbani and Haynes, R. Brian and Iorio, Alfonso and Lokker, Cynthia (2025) Benchmarking domain-specific pretrained language models to identify the best model for methodological rigor in clinical studies. Journal of Biomedical Informatics, 166. p. 104825. ISSN 1532-0464

[thumbnail of 1-s2.0-S1532046425000541-main.pdf]

Preview

Text
1-s2.0-S1532046425000541-main.pdf - Published Version
Available under License Creative Commons Attribution Non-commercial No Derivatives.
Download (2MB)

Official URL: https://www.sciencedirect.com/science/article/pii/...

Abstract

Objective
Encoder-only transformer-based language models have shown promise in automating critical appraisal of clinical literature. However, a comprehensive evaluation of the models for classifying the methodological rigor of randomized controlled trials is necessary to identify the more robust ones. This study benchmarks several state-of-the-art transformer-based language models using a diverse set of performance metrics.
Methods
Seven transformer-based language models were fine-tuned on the title and abstract of 42,575 articles from 2003 to 2023 in McMaster University’s Premium LiteratUre Service database under different configurations. The studies reported in the articles addressed questions related to treatment, prevention, or quality improvement for which randomized controlled trials are the gold standard with defined criteria for rigorous methods. Models were evaluated on the validation set using 12 schemes and metrics, including optimization for cross-entropy loss, Brier score, AUROC, average precision, sensitivity, specificity, and accuracy, among others. Threshold tuning was performed to optimize threshold-dependent metrics. Models that achieved the best performance in one or more schemes on the validation set were further tested in hold-out and external datasets.
Results
A total of 210 models were fine-tuned. Six models achieved top performance in one or more evaluation schemes. Three BioLinkBERT models outperformed others on 8 of the 12 schemes. BioBERT, BiomedBERT, and SciBERT were best on 1, 1 and 2 schemes, respectively. While model performance remained robust on the hold-out test set, it declined in external datasets. Class weight adjustments improved performance in most instances.
Conclusion
BioLinkBERT generally outperformed the other models. Using comprehensive evaluation metrics and threshold tuning optimizes model selection for real-world applications. Future work should assess generalizability to other datasets, explore alternate imbalance strategies, and examine training on full-text articles.

Item Type:	Article
Identification Number:	10.1016/j.jbi.2025.104825
Dates:	Date Event 15 April 2025 Published Online 3 April 2025 Accepted
Uncontrolled Keywords:	Deep learning, Encoder-only transformer, Natural language processing, Text classification, Evidence-based medicine, Critical appraisal
Subjects:	CAH11 - computing > CAH11-01 - computing > CAH11-01-01 - computer science
Divisions:	Architecture, Built Environment, Computing and Engineering > Computer Science
Depositing User:	Gemma Tonks
Date Deposited:	19 Aug 2025 14:12
Last Modified:	19 Aug 2025 14:12
URI:	https://www.open-access.bcu.ac.uk/id/eprint/16605