Hi,

I spent the last day mulling things over and doing research.  It seems to me 
that the patch as first posted is correct and solves the deadlock, except 
that some uses of __GFP_MEMALLOC in __dev_alloc_skb may escape into contexts 
where the reserve is not guaranteed to be reclaimed.  It may be that this 
does not actually happen, but there is enough different usage that I would 
rather err on the side of caution just now, and offer a variant called, e.g., 
dev_memalloc_skb so that drivers will explicitly have to choose to use it (or 
supply the flag to __dev_alloc_skb).  This is just a stack of inline 
functions so there should be no extra object code.  The dev_memalloc_skb 
variant can go away in time, but for now it does no harm.

A minor cleanup: somebody (Rik) complained about his bleeding eyes after 
reading my side-effecty alloc_skb expression, so that was rearranged in a way 
that should optimize to the same thing.

On the "first write the patch, then do the research" principle, the entire 
thread on this topic from the ksummit-2005-discuss mailing list is a good 
read:

http://thunker.thunk.org/pipermail/ksummit-2005-discuss/2005-March/thread.html#186

Matt, you are close to the truth here:

http://thunker.thunk.org/pipermail/ksummit-2005-discuss/2005-March/000242.html

but this part isn't right: "it's important to note here that "making progress" 
may require M acknowledgements to N packets representing a single IO. So we 
need separate send and acknowledge pools for each SO_MEMALLOC socket so that 
we don't find ourselves wedged with M-1 available mempool slots when we're 
waiting on ACKs".  This erroneously assumes that mempools throttle the block 
IO traffic.  In fact, the throttling _must_ take place higher up, in the 
block IO stack.  The block driver that submits the network IO must 
pre-account any per-request resources and block until sufficient resources 
become available.  So the accounting would include both space for transmit 
and acknowledge, and the network block IO protocol must be designed to obey 
that accounting.  (I will wave my hands at the question of how we arrange for 
low-level components to communicate their resource needs to high-level 
throttlers, just for now.)

Andrea, also getting close:

http://thunker.thunk.org/pipermail/ksummit-2005-discuss/2005-March/000200.html

But there is no need to be random.  Short of actually overlowing the input 
ring buffer, we can be precise about accepting all block IO packets and 
dropping non-blockio traffic as necessary.

Rik, not bad:

http://thunker.thunk.org/pipermail/ksummit-2005-discuss/2005-March/000218.html

particularly for deducing it from first principles without actually looking at 
the network code ;-)  It is even the same socket flag name as I settled on 
(SO_MEMALLOC).  But step 4 veers off course: out of order does not matter.  
And the conclusion that we can throttle here by dropping non-blockio packets 
is not right: the packets that we got from reserve still can live an 
arbitrary time in the protocol stack, so we could still exhaust the reserve 
and be back to the same bad old deadlock conditions.

Everybody noticed that dropping non-blockio packets is key, and everybody 
missed the fact that softnet introduces additional queues that need 
throttling (which can't be done sanely) or bypassing.  Almost everybody 
noticed that throttling in the block IO submission path is non-optional.

Everybody thought that mempool is the one true way of reserving memory.  I am 
not so sure, though I still intend to produce a mempool variant of the patch.  
One problem I see with mempool is that it not only reserves resources, but 
pins them.  If the kernel is full of mempools pinning memory pages here and 
there, physical defragmentation gets that much harder and the buddy tree will 
fragment that much sooner.  The __GPF_MEMALLOC interface does not have this 
problem because pages stay in the global pool.  So the jury is still out on 
which method is better.

Obviously, to do the job properly, __GPF_MEMALLOC would need a way of resizing 
the memalloc reserve as users are loaded and unloaded.  Such an interface can 
be very lightweight.  I will cook one up just to demonstrate this.

Now, the scheme in my patch does the job and I think it does it in a way that 
works for all drivers, even e1000 (by method 1. in the thread above).  But we 
could tighten this up a little by noticing that it doesn't actually matter 
which socket buffer we return to the pool as long as we are sure to return 
the same amount of memory as we withdrew.  Therefore, we could just account 
the number of pages alloc_skb withdraws, and the number that freeing a packet 
returns.  The e1000 driver would look at this number to determine whether to 
mark a packet as from_reserve or not.  That way, the e1000 driver could set 
things in motion to release reserve resources sooner, rather than waiting for 
certain specially flagged skbs to work their way around the rx-ring.  This 
also makes it easier for related systems (such as delivery hooks) to draw 
from a single pool with accurate accounting.

I will follow the current simple per-skb approach all the way to the end on 
the "perfect is the enemy of good enough" principle.  In the long run, page 
accounting is the way to go.

Next, on to actually trying this.

Regards,

Daniel