Sounds like a very good plan. Is there intention to do this kind of annotation for only the raw scan, or for intermediate stages of dirt cleaning?
We’ll start with a lossy encode (e.g., max JPEG of all frames rendered at something like 2K) of the raw scan and see where things go from there. An intermediate instance would make sense, but we might not even get enough people involved in the first place, so we should feel it out first.