I will present methodology motivated by a controlled trial designed to validate SPOT GRADE, a novel surgical bleeding severity scale (Spotnitz, et al., 2018). Briefly the study was designed to quantify inter- and intra- surgeon agreement for characterizing the severity of surgical bleeds via a Kappa Statistic. Multiple surgeons were presented with a randomized sequence of controlled bleeding videos and asked to apply the rating system to characterize each wound. Each video was shown multiple times to characterize intra-surgeon reliability, creating clustered data. In addition, videos within the same category may have had different classification probabilities due to changes in blood flow rates and wound sizes. In this work, we propose a new variance estimator for the Kappa statistic, via a bootstrap procedure, for use in clustered data as well as heterogeneity among items within the same classification category. We then apply this methodology to data from the SPOT GRADE trial. We will also investigate how we might have made the SPOT GRADE trial more efficient by running it within a sequential sampling framework.