Abstract
We report about the application of state-of-the-art deep learning
techniques to the automatic and interpretable assignment of ICD-O3
topography and morphology codes to free-text cancer reports. We present
results on a large dataset (more than 80 000 labeled and 1 500 000
unlabeled anonymized reports written in Italian and collected from
hospitals in Tuscany over more than a decade) and with a large number of
classes (134 morphological classes and 61 topographical classes) for
which we obtained the approval from the institutional ethics committee
(CEAV 14081 oss 27/11/2018). We compare alternative architectures in
terms of prediction accuracy and interpretability and show that our best
model achieves a multiclass accuracy of 90.3% on topography site
assignment and 84.8% on morphology type assignment. We found that in
this context hierarchical models are not better than flat models and
that an element-wise maximum aggregator is slightly better than
attentive models on site classification. Moreover, the maximum
aggregator offers a way to interpret the classification process.