TechRxiv
paper.pdf (371.24 kB)

Classification of cancer pathology reports: a large-scale comparative study

Download (371.24 kB)
preprint
posted on 29.06.2020 by Stefano Martina, Leonardo Ventura, Paolo Frasconi
We report about the application of state-of-the-art deep learning techniques to the automatic and interpretable assignment of ICD-O3 topography and morphology codes to free-text cancer reports. We present results on a large dataset (more than 80 000 labeled and 1 500 000 unlabeled anonymized reports written in Italian and collected from hospitals in Tuscany over more than a decade) and with a large number of classes (134 morphological classes and 61 topographical classes) for which we obtained the approval from the institutional ethics committee (CEAV 14081 oss 27/11/2018). We compare alternative architectures in terms of prediction accuracy and interpretability and show that our best model achieves a multiclass accuracy of 90.3% on topography site assignment and 84.8% on morphology type assignment. We found that in this context hierarchical models are not better than flat models and that an element-wise maximum aggregator is slightly better than attentive models on site classification. Moreover, the maximum aggregator offers a way to interpret the classification process.

Funding

Italian Ministry of Education, University, and Research, Grant 2017TWNMH2.

History

Email Address of Submitting Author

stefano.martina@unifi.it

ORCID of Submitting Author

0000-0001-6024-1752

Submitting Author's Institution

University of Florence

Submitting Author's Country

Italy

Licence

Exports