I've been away from USB 3 for a couple of months and I've forgotten most of what I knew already.
What's the host in this system?
To achieve throughput like this, all layers in the system driver stack have to be tuned for performance. If there's any layer which isn't expecting to transfer at this rate, it may well not manage it. I mentioned that the system I worked on achevied 460MB/s, but that was with careful attention to detail getting all layers to higher performance. That was also using multiple streams, if you only have one stream things may be more difficult.
So what host are you using? How good is its driver? The XHCI driver has to set several different fields in the endpoint context to make large bursts happen. The device driver also has to request large enough transfers, and its probably helpful to make sure you have several outstanding at any time. You don't want to give any other part fo the system an excuse to not be fast.
Also if I'm remembering right (and I'm not sure), can't either end of the bus issue a NRDY? If that's true, which end of the bus are the NRDYs coming from.