paper.pdf (371.24 kB)

Classification of cancer pathology reports: a large-scale comparative study

Download (371.24 kB)
posted on 29.06.2020, 17:45 by Stefano Martina, Leonardo Ventura, Paolo Frasconi
We report about the application of state-of-the-art deep learning techniques to the automatic and interpretable assignment of ICD-O3 topography and morphology codes to free-text cancer reports. We present results on a large dataset (more than 80 000 labeled and 1 500 000 unlabeled anonymized reports written in Italian and collected from hospitals in Tuscany over more than a decade) and with a large number of classes (134 morphological classes and 61 topographical classes) for which we obtained the approval from the institutional ethics committee (CEAV 14081 oss 27/11/2018). We compare alternative architectures in terms of prediction accuracy and interpretability and show that our best model achieves a multiclass accuracy of 90.3% on topography site assignment and 84.8% on morphology type assignment. We found that in this context hierarchical models are not better than flat models and that an element-wise maximum aggregator is slightly better than attentive models on site classification. Moreover, the maximum aggregator offers a way to interpret the classification process.


Italian Ministry of Education, University, and Research, Grant 2017TWNMH2.


Email Address of Submitting Author

ORCID of Submitting Author


Submitting Author's Institution

University of Florence

Submitting Author's Country