I am using a SanMax InClose USB2Dock, PMD-96-USB2 , for backup and restore. I am seeing intermittent failures when writing data to the disk from rsync. From /var/log/messages on my test machine:
------------------------------------------------------------------------------ Nov 12 21:34:20 dhcp-192 smartd[3286]: Device: /dev/hde, SMART Usage Attribute: 194 Temperature_Celsius changed from 93 to 97 Nov 12 22:03:48 dhcp-192 kernel: hub.c: USB device not accepting new address (error=-71) Nov 12 22:03:53 dhcp-192 kernel: usb-storage: host_reset() requested but not implemented Nov 12 22:04:03 dhcp-192 kernel: scsi: device set offline - command error recover failed: host 0 channel 0 id 0 lun 0 Nov 12 22:04:03 dhcp-192 kernel: ev 08:0e, sector 1923432 Nov 12 22:04:03 dhcp-192 kernel: I/O error: dev 08:0e, sector 1923584 Nov 12 22:04:03 dhcp-192 kernel: I/O error: dev 08:0e, sector 1923904 ----------------------------------------------------------------------------- ... and pages more
After the failure, the drive appears empty. I can return the system to function with: (1) unplug USB cable, wait 5 seconds, replug USB cable (2) rmmod usb-storage, insmod usb-storage. These actions can be swapped in sequence, or interleaved; the drive needs reseting by the cable interruption, and usb-storage has some remnant state that needs to be cleared out.
Note that a second usb-storage drive is not affected by the failure of the first. It is also interesting to note that if drive 1 is plugged in first as /dev/sda, then drive 2 is plugged in second as /dev/sdb, then both drives are unplugged, then drive 2 plugged in first will stay /dev/sdb and drive 1 plugged in second will stay /dev/sda. This is the right behavior, and clever, but it shows how remnant state is cleared only by removing and re-installing the usb-storage module.
The test machine is running Fedora Core 1 (2.4.22-1.2088.nptl), with 1GB RAM, an ASUS P4T-E motherboard, a 1.9GHz P4 processor, and a Maxtor (re-branded Promise) ATA-133 controller with a 200GB boot drive on /dev/hde .
The USB cage test drive is connected though a Adaptec USB2Connect PCI card (AUA-4000B, assy 2012006-01) with the NEC D720101GJ chipset, using port J2. This is connected to the SanMax InClose PMD-96-USB2 Cage, which is based on a Cypress CY7C68013 (8051 internal) chip with a 24MHz clock and an external 8Kx8 NVRAM. This chip has many sets of built-in hardware control, and a large internal FIFO.
Note that the NVRAM has a programming jumper, and it appears that the unit is programmed over the USB2 bus, and has a lot of general purpose wires connected to the IDE cage. I imagine this could be turned into a nice general purpose USB2 prototyping system. But I digress...
I am still working on finding a way to cause repeatable failures. So far, this only fails restoring a 120GB drive with rsync, which is a complex process. It can fail anywhere from 1 minute to one hour later.
I have tried the following permutations:
A different system running RedHat 9 with kernel 2.4.20-19.9. Zomax USB card (same NEC chipset). Different motherboard, video, etc. Same bug.
3 different copies of the same PMD-96-USB cage. Same bug.
Different hard drives, and different USB2 cables. Same bug.
Different USB2 PCI card, with a VIA VT6202 chipset. Same bug.
Two other USB2 swap systems, a ViPower VP-1028LSF and an ADS USBX-804 external case. These other systems do not exhibit the problem. However, they are both 50% slower than the PMD-96B-USB. This makes me suspect that some buffer is getting overflowed in the Linux kernel.
I have been in contact with Sanmax, and got a nice reply from designer Jack Mich ( jacklife2001 atYahooDotCom ). He does not see the problem on his windoze system, after a 2 hour test running at 1GB/minute . He notes that the buffer burst rate of the Cypress chip is 96MB/sec, so the USB2 60MB/sec is not going to overflow the GPIF hardware FIFO controller on the Cypress chip.
I can write code and recompile kernels, and even connect a slow digital oscilloscope to the right wires. If there are any other experiments I should be running to trace down this bug, let me know.
last revision November 12, 2003