Hi, I spent the last day mulling things over and doing research. It seems to me that the patch as first posted is correct and solves the deadlock, except that some uses of __GFP_MEMALLOC in __dev_alloc_skb may escape into contexts where the reserve is not guaranteed to be reclaimed. It may be that this does not actually happen, but there is enough different usage that I would rather err on the side of caution just now, and offer a variant called, e.g., dev_memalloc_skb so that drivers will explicitly have to choose to use it (or supply the flag to __dev_alloc_skb). This is just a stack of inline functions so there should be no extra object code. The dev_memalloc_skb variant can go away in time, but for now it does no harm. A minor cleanup: somebody (Rik) complained about his bleeding eyes after reading my side-effecty alloc_skb expression, so that was rearranged in a way that should optimize to the same thing. On the "first write the patch, then do the research" principle, the entire thread on this topic from the ksummit-2005-discuss mailing list is a good read: http://thunker.thunk.org/pipermail/ksummit-2005-discuss/2005-March/thread.html#186 Matt, you are close to the truth here: http://thunker.thunk.org/pipermail/ksummit-2005-discuss/2005-March/000242.html but this part isn't right: "it's important to note here that "making progress" may require M acknowledgements to N packets representing a single IO. So we need separate send and acknowledge pools for each SO_MEMALLOC socket so that we don't find ourselves wedged with M-1 available mempool slots when we're waiting on ACKs". This erroneously assumes that mempools throttle the block IO traffic. In fact, the throttling _must_ take place higher up, in the block IO stack. The block driver that submits the network IO must pre-account any per-request resources and block until sufficient resources become available. So the accounting would include both space for transmit and acknowledge, and the network block IO protocol must be designed to obey that accounting. (I will wave my hands at the question of how we arrange for low-level components to communicate their resource needs to high-level throttlers, just for now.) Andrea, also getting close: http://thunker.thunk.org/pipermail/ksummit-2005-discuss/2005-March/000200.html But there is no need to be random. Short of actually overlowing the input ring buffer, we can be precise about accepting all block IO packets and dropping non-blockio traffic as necessary. Rik, not bad: http://thunker.thunk.org/pipermail/ksummit-2005-discuss/2005-March/000218.html particularly for deducing it from first principles without actually looking at the network code ;-) It is even the same socket flag name as I settled on (SO_MEMALLOC). But step 4 veers off course: out of order does not matter. And the conclusion that we can throttle here by dropping non-blockio packets is not right: the packets that we got from reserve still can live an arbitrary time in the protocol stack, so we could still exhaust the reserve and be back to the same bad old deadlock conditions. Everybody noticed that dropping non-blockio packets is key, and everybody missed the fact that softnet introduces additional queues that need throttling (which can't be done sanely) or bypassing. Almost everybody noticed that throttling in the block IO submission path is non-optional. Everybody thought that mempool is the one true way of reserving memory. I am not so sure, though I still intend to produce a mempool variant of the patch. One problem I see with mempool is that it not only reserves resources, but pins them. If the kernel is full of mempools pinning memory pages here and there, physical defragmentation gets that much harder and the buddy tree will fragment that much sooner. The __GPF_MEMALLOC interface does not have this problem because pages stay in the global pool. So the jury is still out on which method is better. Obviously, to do the job properly, __GPF_MEMALLOC would need a way of resizing the memalloc reserve as users are loaded and unloaded. Such an interface can be very lightweight. I will cook one up just to demonstrate this. Now, the scheme in my patch does the job and I think it does it in a way that works for all drivers, even e1000 (by method 1. in the thread above). But we could tighten this up a little by noticing that it doesn't actually matter which socket buffer we return to the pool as long as we are sure to return the same amount of memory as we withdrew. Therefore, we could just account the number of pages alloc_skb withdraws, and the number that freeing a packet returns. The e1000 driver would look at this number to determine whether to mark a packet as from_reserve or not. That way, the e1000 driver could set things in motion to release reserve resources sooner, rather than waiting for certain specially flagged skbs to work their way around the rx-ring. This also makes it easier for related systems (such as delivery hooks) to draw from a single pool with accurate accounting. I will follow the current simple per-skb approach all the way to the end on the "perfect is the enemy of good enough" principle. In the long run, page accounting is the way to go. Next, on to actually trying this. Regards, Daniel