Author Topic: usb 1.1 device runs faster when plugged into usb 2.0 hub than usb 1.1 port  (Read 13983 times)

dsmtoday

  • Member
  • ***
  • Posts: 7
I have a legacy usb 1.1 device which I am updating the software on.  My development machine is a Z68-based PC which no longer has UHCI (1.1) ports - it only has two EHCI (2.0) ports.  It puts an internal hub between the external connection and the EHCI port, so it effectively looks like my 1.1 device is plugged into a 2.0 hub when connected to this development machine.

Anyway, the software mods I did to the device required that I increase the bulk packet bandwidth of the device from 800kb/s to 920kb/s.  And this new rate has always worked just fine on my development machine.  But then I plugged the device into the computers we are currently shipping, which are pre-SandyBridge, and the devices internal queue started getting overrun, resulting in lost data.  This older computer has 6 UHCI (1.1) ports and 2 EHCI (2.0) ports.  Looking at device manager, the device gets connected directly to the UHCI port.

The device worked on these older machines with UHCI ports for several years when the bandwidth was only 800kb/s.

But, if I take an external USB 2.0 hub, plug it into this older machine, and plug our device into the hub, the device no longer gets queue overruns at the new 920kb/s rate!

For some bizarre reason, this USB 1.1 device can get more throughput when it is connected to a USB 2.0 hub than it can if it is directly connected to a UHCI 1.1 port!

If you have any ideas why this might be happening, please let me know.  Also looking for any workarounds/hacks that don't involve physical hardware.

* Win7 64bit
* device is using 3 bulk pipes to send 64byte packets of real-time data to the computer
* using WinUSB and overlapped I/O with 12*64 as buffer size, and using 100 of these buffers (increasing buffer size above 12*64 did not seem to improve anything, and sizes below 12*64 made things worse)
* when using a 2.0 hub, I could reduce buffer size to 4*64 with no loss
* no isochronous pipes
* no interrupt pipes
* no other USB traffic during bulk transfer
* device is a low-volume scientific instrument

Thanks for any help!
-todd-
« Last Edit: June 24, 2012, 08:12:53 pm by dsmtoday »

dsmtoday

  • Member
  • ***
  • Posts: 7
I used RAW_IO in WinUSB and that fixed the above problem.  Somehow, it bypasses the braindamage of UHCI ports.  All my buffers we already multiples of 64, so literally, all I had to do was set the pipe properties to RAW_IO and the UHCI port stopped dropping packets.

Barry Twycross

  • Frequent Contributor
  • ****
  • Posts: 263
I wouldn't be entirely surprised by a hub increasing the available bandwidth. With the right sort of hub (a Multi-TT one) each port gets its own USB 1.1 bus bandwidth to use. If you're connected to a UHCI controller its sharing bandwidth with the other port on the UHCI.

Now you do mention: "* no other USB traffic during bulk transfer", which would have been my next question, is something else on the UHCI soaking up bandwidth.

What I'd do is take a look at the bus traffic, using a bus analyser, and see if anything stands out as causing the problem.

dsmtoday

  • Member
  • ***
  • Posts: 7
I verified in Device Manager that no other device was connected to the same UHCI port my device was using.

I also used a CATC USB analyzer to capture bus traffic.  There was no other bus traffic other than my device's traffic.

I compared USB traffic captures between being hooked up to a 2.0 hub and being directly connected to a UHCI port.  What I found is just that the UHCI port is "lazy".  I mean, the time between packets was generally 10-20% higher than that of the 2.0 hub.  And often, the UHCI port would simply "take a break" between packets (and mind you, these packets are all 64-bytes long, no short packets in the bunch) and slack off for a while.  Why it was doing this, I have no idea.

As I mentioned before, turning on WinUSB RAW_IO solved this issue.  I tested for two hours at my application's maximum bandwidth (~920kByte/sec) with not a single packet dropped.

This whole thing is kinda counter-intuitive, that I can use a smaller buffer size and non-RAW_IO with a 2.0 port and get better performance to a 1.1 device than I can with larger buffers and RAW_IO with a 1.1 port.  But it all relies on what Microsoft is doing in the depths of their driver and the scheduling.  And it is obviously very different what they do when dealing with a 1.1 port than a 2.0 port.  That is the biggest problem when trying to squeeze performance out of bulk transfers - you are at the mercy of Microsoft's USB scheduler, which is not standardized nor published.

Guido Koerber

  • Frequent Contributor
  • ****
  • Posts: 72
That has not as much to do with the driver implementation as with the host controller design. If you pulg the USB 1.1 device into a root port that root port is disconnected from the 2.0 host controller and connected to a USB 1.1 host controller.

When you insert a USB 2.0 hub then the communication is done by the high speed host controller which does not really care for the actual speed of your device. The Transaction Translators in some USB 2.0 hubs are faster with turning the USB 1.1 transfers into USB 2.0 transfers than USB 1.1 hosts are in doing the low/full speed transfers, so you can see increased performance.

This is most significant for low speed control transfers. We have seen more than 5 control transfers per millisecond when talking to low speed devices, usually OHCI hosts max out at 3 transfers per millisecond and UHCI hosts at 3 milliseconds per transsfer.

Tsuneo

  • Frequent Contributor
  • ****
  • Posts: 145
In addition to Guido,

Quote
dsmtoday:
And often, the UHCI port would simply "take a break" between packets (and mind you, these packets are all 64-bytes long, no short packets in the bunch) and slack off for a while.

The "break" is caused by deferred completion interrupt on the host controller.
When PC device driver (like WinUSB) gets a large transfer request, it divides the large request into shorter chunks, such as 4K bytes (PCI page size), to assign "real" memory for DMA. And then, PC driver sends the chunk requests to a host controller (HC), one by one. When a chunk request completes on the HC, its completion interrupt is deferred until next SOF timing. Receiving this interrupt, PC driver sends next chunk request to the HC, and transactions re-starts on the bus.

UHCI works on 1ms frame, and EHCI (high-speed) does on 125us micro-frame. In this reason, you'll see longer "break" between chunk transactions on UHCI.

When RAW_IO policy is enabled on WinUSB at a bulk IN endpoint, WinUSB queues two or more chunks to HC at a time. HC still defers completion interrupt, but HC moves to next chunk seamlessly without intervention of WinUSB.

Tsuneo

dsmtoday

  • Member
  • ***
  • Posts: 7
Quote
When RAW_IO policy is enabled on WinUSB at a bulk IN endpoint, WinUSB queues two or more chunks to HC at a time. HC still defers completion interrupt, but HC moves to next chunk seamlessly without intervention of WinUSB.

Thanks for that explanation.  I had always thought that the main advantage of RAW_IO in WinUSB was bypassing checks and buffer copying, thus was more of a CPU speedup.  I didn't know another advantage was getting better turnaround between buffers on the host controller.

Thanks to all who replied on this thread.  I really appreciate all the information.

Barry Twycross

  • Frequent Contributor
  • ****
  • Posts: 263
I compared USB traffic captures between being hooked up to a 2.0 hub and being directly connected to a UHCI port.  What I found is just that the UHCI port is "lazy".  I mean, the time between packets was generally 10-20% higher than that of the 2.0 hub.  And often, the UHCI port would simply "take a break" between packets (and mind you, these packets are all 64-bytes long, no short packets in the bunch) and slack off for a while.  Why it was doing this, I have no idea.
How long of a break? (Compared to a regular inter packet gap.) Is it anywhere interesting? Tsuneo's idea would give you a break before a SOF (and it'd be at the end of a transfer). If its between every packet, that could mean that the UHCI has other endpoints to service, even if they're not actually transferring data. If its sporadic it could be the host is going into some doze mode and the UHCI has to wake it up again.

Barry Twycross

  • Frequent Contributor
  • ****
  • Posts: 263
Also if the break is always before a SOF ,your UHCI driver isn't doing bandwidth reclamation properly.

dsmtoday

  • Member
  • ***
  • Posts: 7
My device does have three endpoints, each generating about 305kBytes/s of data, according to the CATC timing calculator.  There's also never any NACKs occurring.

Looking at the CATC trace, it seems that RAW_IO has a pretty solid average of 53us from the beginning of one packet transaction to the beginning of the next packet transaction.  Not using RAW_IO, this drifts around a bit, wobbling from 56us to 63us, but averaging around 60us.  I think the difference between these two (one solid, one dithering) is described by Tsuneo's post about the HC having multiple buffers queued vs. 1 buffer queued and waiting for the PC to hand it the next buffer.

Aha!  Encouraging me to look at the SOF packets (I usually disable them) led me to find places in my capture where an entire frame is empty (by that, I mean back-to-back SOF without a packet in between).  I don't know how I missed this before just from the inter-packet idle timing numbers.  I should have seen it.  But that's the problem right there.

It gets about 165kBytes into the transfer and then starts going into this mode where it issues 3 frames of 16 64byte packets then an empty frame, then 3 full frames again, then an empty frame.  It does that for a while, then it returns to normal transfers for a while, then it gets back into the every 4th frame is empty mode for a while.  This is odd because my overlapped I/O buffer sizes at the time were 256bytes, which is 4x packet size, not three.  I tried increasing the buffer sizes to 8x and 20x and 100x, but the problem persisted (I don't have captures of these later sizes).  This same non-RAW_IO + 4x packet size plugged into a USB 2.0 hub worked with no drops.