####
Assessing Reliability on Annotations (1):

Theoretical Considerations

*Jens Stegmann and Andy Lücking*
###### Abstract

This is the first part of a two-report mini-series focussing on issues in the evaluation
of annotations. In this theoretically-oriented report we lay out the relevant statistical
background for reliability studies, evaluate some pertaining approaches and also sketch
some arguments that may lend themselves to the development of an original statistic.
A description of the pro ject background, including the documentation of the annotation
scheme at stake and the empirical data collected, as well as results from the practical
application of the relevant statistics and the discussion of our respective results are
contained in the second, more empirically-oriented report [Lücking and Stegmann, 2005].
The following points are dealt with in detail here: we summarize and contribute to an
argument by Gwet [2001] which indicates that the popular pi and kappa statistics [Carletta,
1996] are generally not appropriate for assessing the degree of agreement between
raters on categorical type-ii data. We propose the use of AC1 [Gwet, 2001] instead, since
it has desirable mathematical properties that make it more appropriate for assessing the
results of expert raters in general. As far as type-i data are concerned, we make use
of conventional correlation statistics which, unlike their AC1 and kappa cousins, do not
deliver results that are adjusted with respect to agreements due to chance. Furthermore,
we discuss issues in the interpretation of the results of the different statistics. Finally,
we take up some loose ends from the previous chapters and sketch some advanced ideas
pertaining to inter-rater agreement statistics. Therein, some differences as well as common
ground concerning Gwet's perspective and our own stance will be highlighted. We
conclude with some preliminary suggestions regarding the development of the original
statistic omega that will be different in nature from those discussed before.

*(~489 k)*

Anke Weinberger, 2006-05-29,
2006-07-13