You didn't mention any fence.
Ah, yes, you probably need a fence too I suppose.
I cannot see how this can possibly be sound without a fence, unless all involved writes (including the data writes) are volatile.
I don't know what "including the data writes" means here. Volatile is about telling the compiler "you really do have to do that, just do it", and I can't imagine that the DMA unit would, say, copy 100 words but skip the 56th one because it somehow decides that it didn't need to do that one. In that sense, I suppose I would say that all modifications that the DMA unit makes should be considered to be volatile. If I follow your meaning.
Did you omit some details or do you think this should be sound without a fence?
Yes, sorry! The device I normally use that has a DMA unit is old enough (manufactured 2001) that there's no out-of-order execution, no memory cache, and no parallelism. Accordingly, it has no fencing. Sometimes I do end up forgetting the additional requirements that a newer device would have to follow.