Towards a Decomposable Metric for Explainable Evaluation of Text Generation from AMR

Abstract
Abstract (translated)
URL
PDF

Abstract

Systems that generate sentences from (abstract) meaning representations (AMRs) are typically evaluated using automatic surface matching metrics that compare the generated texts to the texts that were originally given to human annotators to construct AMR meaning representations. However, besides well-known issues from which such metrics suffer (Callison-Burch et al., 2006; Novikova et al., 2017), we show that an additional problem arises when applied for AMR-to-text evaluation because mapping from the more abstract domain of AMR to the more concrete domain of sentences allows for manifold sentence realizations. In this work we aim to alleviate these issues and propose $\mathcal{M}\mathcal{F}_\beta$, an automatic metric that builds on two pillars. The first pillar is the principle of meaning preservation $\mathcal{M}$: it measures to what extent the original AMR graph can be reconstructed from the generated sentence. We implement this principle by i) automatically constructing an AMR from the generated sentence using state-of-the-art AMR parsers and ii) apply fine-grained principled AMR metrics to measure the distance between the original and the reconstructed AMR. The second pillar builds on a principle of (grammatical) form $\mathcal{F}$, which measures the linguistic quality of the generated sentences, which we implement using SOTA language models. We show - theoretically and experimentally - that fulfillment of both principles offers several benefits for evaluation of AMR-to-text systems, including the explainability of scores.

Abstract (translated)

URL

https://arxiv.org/abs/2008.08896

PDF

https://arxiv.org/pdf/2008.08896.pdf