Fine-Tuning and Benchmarking Transformer Models for Multiclass Classification of Clinical Research Papers: Retrospective Modeling Study

Zhou, Fangwen; Lokker, Cynthia; Parrish, Rick; Haynes, R Brian; Iorio, Alfonso; Saha, Ashirbani; Afzal, Muhammad

Fine-Tuning and Benchmarking Transformer Models for Multiclass Classification of Clinical Research Papers: Retrospective Modeling Study

Zhou, Fangwen and Lokker, Cynthia and Parrish, Rick and Haynes, R Brian and Iorio, Alfonso and Saha, Ashirbani and Afzal, Muhammad (2026) Fine-Tuning and Benchmarking Transformer Models for Multiclass Classification of Clinical Research Papers: Retrospective Modeling Study. JMIR AI, 5. e77311. ISSN 2817-1705

Preview

Text
ai-2026-1-e77311.pdf - Published Version
Available under License Creative Commons Attribution.
Download (933kB)

Official URL: https://ai.jmir.org/2026/1/e77311

Abstract

Background

The exponential growth of digital information has led to an unprecedented expansion in the volume of unstructured text data. Efficient classification of these data is critical for timely evidence synthesis and informed decision-making in health care. Machine learning techniques have shown considerable promise for text classification tasks. However, multiclass classification of papers by study publication type has been largely overlooked compared to binary or multilabel classification. Addressing this gap could significantly enhance knowledge translation workflows and support systematic review processes.

Objective

This study aimed to fine-tune and evaluate domain-specific transformer-based language models on a gold-standard dataset for multiclass classification of clinical literature into mutually exclusive categories: original studies, reviews, evidence-based guidelines, and nonexperimental studies.

Methods

The titles and abstracts of McMaster’s Premium Literature Service (PLUS) dataset comprising 162,380 papers were used for fine-tuning seven domain-specific transformers. Clinical experts classified the papers into four mutually exclusive publication types. PLUS data were split in an 80:10:10 ratio into training, validation, and testing sets, with the Clinical Hedges dataset used for external validation. A grid search evaluated the impact of class weight (CW) adjustments, learning rate (LR), batch size (BS), warmup ratio, and weight decay (WD), totaling 1890 configurations. Models were assessed using 10 metrics, including the area under the receiver operating characteristic curve (AUROC), the F1-score (harmonic mean of precision and recall), and Matthew’s correlation coefficient (MCC). The performance of individual classes was assessed using a one-to-rest approach, and overall performance was assessed using the macro average. Optimal models identified from validation results were further tested on both PLUS and Clinical Hedges, with calibration assessed visually.

Results

Ten best-performing models achieved macro AUROC≥0.99, F1-score≥0.89, and MCC≥0.88 on the validation and testing sets. Performance declined on Clinical Hedges. Models were consistently better at classifying original studies and reviews. Biomedical Bidirectional Encoder Representations from Transformers (fine-tuned on biomedical text; BioBERT)–based models had superior calibration performance, especially for original studies and reviews. Optimal configurations for search included lower LRs (1 × 10–5 and 3 × 10–5), midrange BSs (32–128), and lower WD (0.005-0.010). CW adjustments improved recall but generally reduced performance on other metrics. Models generally struggled with accurately classifying nonexperimental and guideline studies, potentially due to class imbalance and content heterogeneity.

Conclusions

This study used a comprehensive hyperparameter search to highlight the effectiveness of fine-tuned transformer models, notably BioBERT variants, for multiclass clinical literature classification. Although class weighting generally decreased overall performance, addressing class imbalance through alternative methods, such as hierarchical classification or targeted resampling, warrants future exploration. Hyperparameter configurations were crucial for robust performance, aligning with the previous literature. These findings support future modeling research and practical deployment in human-in-the-loop systems to support knowledge synthesis and translation workflows with the findings from this work.

Item Type:	Article
Identification Number:	10.2196/77311
Dates:	Date Event 29 April 2026 Published Online 17 March 2026 Accepted
Uncontrolled Keywords:	classification, deep learning, information science, medical informatics, natural language processing
Subjects:	CAH11 - computing > CAH11-01 - computing > CAH11-01-01 - computer science
Divisions:	Architecture, Built Environment, Computing and Engineering > Computer Science
Depositing User:	Gemma Tonks
Date Deposited:	14 May 2026 12:32
Last Modified:	14 May 2026 12:32
URI:	https://www.open-access.bcu.ac.uk/id/eprint/17044