My son is a computer geek like me. He was a slow starter. He did not get into it until he was a teenager but he has an annoying habit of doing things before I do, reminding me all the time that I am slowing down…
He was the first in our family:
- to build a PC from OTS parts,
- to use a 3gHz CPU,
- to use more than four hard drives in a box,
- to use multi-core CPU, and
- to use dual socket motherboard.
Of course, I am proud of him but it still bugs the geek in me that I cannot stay ahead of him.
However, there is occasional joy in my family relations, like when, a few times a year, he writes, “Dad, how do you…?”. Hah! I am still useful!
Last week, after I returned from the annual teaching battle, my son said, “Dad, I am getting this weird message and Google does not help…”.
too many iterations (6) in nv_nic_irq
Lo, it was a problem using the forcedeth driver that I had encountered and overcome two years ago. I gave him hints that worked promptly to end an annoying instability in a server running GNU/Linux. Apparently, the information that I found to solve the problem was no longer on the web so we will document it here.
The problem is that forcedeth.c which is a reverse-engineered driver for an Nvidia NIC has a problem dismissing an interrupt, that is, clearing the interrupt flag so that the CPU promptly interrupts again when the driver returns control. To slow down this tight loop the authour put in a timeout which wreaks havoc in some production systems. There are parameters that can be configured to control the interrupt behaviour.
Here is the code.
* Known bugs:
* We suspect that on some hardware no TX done interrupts are generated.
* This means recovery from netif_stop_queue only happens if the hw timer
* interrupt fires (100 times/second, configurable with NVREG_POLL_DEFAULT)
* and the timer is active in the IRQMask, or if a rx packet arrives by chance.
* If your hardware reliably generates tx done interrupts, then you can remove
* DEV_NEED_TIMERIRQ from the driver_data flags.
* DEV_NEED_TIMERIRQ will not harm you on sane hardware, only generating a few
* superfluous timer interrupts from the nic.
* Maximum number of loops until we assume that a bit in the irq mask
* is stuck. Overridable with module param.
static int max_interrupt_work = 5;
* Optimization can be either throuput mode or cpu mode
* Throughput Mode: Every tx and rx packet will generate an interrupt.
* CPU Mode: Interrupts are controlled by a timer.
#define NV_OPTIMIZATION_MODE_THROUGHPUT 0
#define NV_OPTIMIZATION_MODE_CPU 1
static int optimization_mode = NV_OPTIMIZATION_MODE_THROUGHPUT;
* Poll interval for timer irq
* This interval determines how frequent an interrupt is generated.
* The is value is determined by [(time_in_micro_secs * 100) / (2^10)]
* Min = 0, and Max = 65535
static int poll_interval = -1;
So, the usual recommendation is to bump up the parameter, max_interrupt_work. I found that did not work well because these machines are so fast. What value would be large enough? Trial and error improved things but it was still flakey. This was on a machine with 8 processors. We could easily afford to poll the device to service the NIC…
options forcedeth optimization_mode=1 poll_interval=100
in /etc/modprobe.d/options followed by rmmod forcedeth; modprobe forcedeth and everything was sweetness and light. No more freezes or crashes serving files at gigabit/s speeds.
How cool is it that old Dad could steer a young man towards the answer to a geeky problem in seconds?