We present a methodology motivated by a controlled trial designed to validate SPOT GRADE, a novel surgical bleeding severity scale. Briefly, the study was designed to quantify inter‐ and intra‐surgeon agreement for characterizing the severity of surgical bleeds via a Kappa statistic. Multiple surgeons were presented with a randomized sequence of controlled bleeding videos and asked to apply the rating system to characterize each wound. Each video was shown multiple times to quantify intra‐surgeon reliability, creating clustered data. In addition, videos within the same category may have had different classification probabilities due to changes in blood flow rates and wound sizes. In this work, we propose a new variance estimator for the Kappa statistic, for use in clustered data as well as heterogeneity among items within the same classification category. We then apply this methodology to data from the SPOT GRADE trial.