Hi Christopher,
This eliminates the failure as expected (at the source).
I do not think such a solution is required, and it probably affect performance.
As Matthew said, slab objects should not be used in sk_buff fragments.
The source of these is my kernel TCP sockets, where kernel_sendpage() is used with slab payload.
I eliminated this, and the failure disappeared, even though with this kind of fine timing issues, no failure does not mean anything
Moreover, I tried triggering on slab in sk_buff fragments and nothing came up.
So far:
1) Use of slab payload in kernel_sendpage() is not polite, even though we do not BUG on this and documentation does not tell it was just wrong.
2) RX path cannot bring sk_buffs in slab: drivers use alloc_pagexxx or page_frag_alloc().
What I am still wondering about (and investigating), is how kernel_sendpage() with slab payload results in slab payload on another socket RX.
Do you see how page ref-counting can be broken with extra references taken on a slab page containing the fragments, and dropped when networking is done with them?
Thanks,
Anton