On 06/11/2024 14.25, Jesper Dangaard Brouer wrote: > > On 26/10/2024 09.33, Yunsheng Lin wrote: >> On 2024/10/25 22:07, Jesper Dangaard Brouer wrote: >> >> ... >> >>> >>>>> You and Jesper seems to be mentioning a possible fact that there might >>>>> be 'hundreds of gigs of memory' needed for inflight pages, it would >>>>> be nice >>>>> to provide more info or reasoning above why 'hundreds of gigs of >>>>> memory' is >>>>> needed here so that we don't do a over-designed thing to support >>>>> recording >>>>> unlimited in-flight pages if the driver unbound stalling turns out >>>>> impossible >>>>> and the inflight pages do need to be recorded. >>>> >>>> I don't have a concrete example of a use that will blow the limit you >>>> are setting (but maybe Jesper does), I am simply objecting to the >>>> arbitrary imposing of any limit at all. It smells a lot of "640k ought >>>> to be enough for anyone". >>>> >>> >>> As I wrote before. In *production* I'm seeing TCP memory reach 24 GiB >>> (on machines with 384GiB memory). I have attached a grafana screenshot >>> to prove what I'm saying. >>> >>> As my co-worker Mike Freemon, have explain to me (and more details in >>> blogposts[1]). It is no coincident that graph have a strange "sealing" >>> close to 24 GiB (on machines with 384GiB total memory).  This is because >>> TCP network stack goes into a memory "under pressure" state when 6.25% >>> of total memory is used by TCP-stack. (Detail: The system will stay in >>> that mode until allocated TCP memory falls below 4.68% of total memory). >>> >>>   [1] >>> https://blog.cloudflare.com/unbounded-memory-usage-by-tcp-for-receive-buffers-and-how-we-fixed-it/ >> >> Thanks for the info. > > Some more info from production servers. > > (I'm amazed what we can do with a simple bpftrace script, Cc Viktor) > > In below bpftrace script/oneliner I'm extracting the inflight count, for > all page_pool's in the system, and storing that in a histogram hash. > > sudo bpftrace -e ' >  rawtracepoint:page_pool_state_release { @cnt[probe]=count(); >   @cnt_total[probe]=count(); >   $pool=(struct page_pool*)arg0; >   $release_cnt=(uint32)arg2; >   $hold_cnt=$pool->pages_state_hold_cnt; >   $inflight_cnt=(int32)($hold_cnt - $release_cnt); >   @inflight=hist($inflight_cnt); >  } >  interval:s:1 {time("\n%H:%M:%S\n"); >   print(@cnt); clear(@cnt); >   print(@inflight); >   print(@cnt_total); >  }' > > The page_pool behavior depend on how NIC driver use it, so I've run this > on two prod servers with drivers bnxt and mlx5, on a 6.6.51 kernel. > > Driver: bnxt_en > - kernel 6.6.51 > > @cnt[rawtracepoint:page_pool_state_release]: 8447 > @inflight: > [0]             507 |                                        | > [1]             275 |                                        | > [2, 4)          261 |                                        | > [4, 8)          215 |                                        | > [8, 16)         259 |                                        | > [16, 32)        361 |                                        | > [32, 64)        933 |                                        | > [64, 128)      1966 |                                        | > [128, 256)   937052 |@@@@@@@@@                               | > [256, 512)  5178744 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > [512, 1K)     73908 |                                        | > [1K, 2K)    1220128 |@@@@@@@@@@@@                            | > [2K, 4K)    1532724 |@@@@@@@@@@@@@@@                         | > [4K, 8K)    1849062 |@@@@@@@@@@@@@@@@@@                      | > [8K, 16K)   1466424 |@@@@@@@@@@@@@@                          | > [16K, 32K)   858585 |@@@@@@@@                                | > [32K, 64K)   693893 |@@@@@@                                  | > [64K, 128K)  170625 |@                                       | > > Driver: mlx5_core >  - Kernel: 6.6.51 > > @cnt[rawtracepoint:page_pool_state_release]: 1975 > @inflight: > [128, 256)         28293 |@@@@                               | > [256, 512)        184312 |@@@@@@@@@@@@@@@@@@@@@@@@@@@        | > [512, 1K)              0 |                                   | > [1K, 2K)            4671 |                                   | > [2K, 4K)          342571 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > [4K, 8K)          180520 |@@@@@@@@@@@@@@@@@@@@@@@@@@@        | > [8K, 16K)          96483 |@@@@@@@@@@@@@@                     | > [16K, 32K)         25133 |@@@                                | > [32K, 64K)          8274 |@                                  | > > > The key thing to notice that we have up-to 128,000 pages in flight on > these random production servers. The NIC have 64 RX queue configured, > thus also 64 page_pool objects. > I realized that we primarily want to know the maximum in-flight pages. So, I modified the bpftrace oneliner to track the max for each page_pool in the system. sudo bpftrace -e ' rawtracepoint:page_pool_state_release { @cnt[probe]=count(); @cnt_total[probe]=count(); $pool=(struct page_pool*)arg0; $release_cnt=(uint32)arg2; $hold_cnt=$pool->pages_state_hold_cnt; $inflight_cnt=(int32)($hold_cnt - $release_cnt); $cur=@inflight_max[$pool]; if ($inflight_cnt > $cur) { @inflight_max[$pool]=$inflight_cnt;} } interval:s:1 {time("\n%H:%M:%S\n"); print(@cnt); clear(@cnt); print(@inflight_max); print(@cnt_total); }' I've attached the output from the script. For unknown reason this system had 199 page_pool objects. The 20 top users: $ cat out02.inflight-max | grep inflight_max | tail -n 20 @inflight_max[0xffff88829133d800]: 26473 @inflight_max[0xffff888293c3e000]: 27042 @inflight_max[0xffff888293c3b000]: 27709 @inflight_max[0xffff8881076f2800]: 29400 @inflight_max[0xffff88818386e000]: 29690 @inflight_max[0xffff8882190b1800]: 29813 @inflight_max[0xffff88819ee83800]: 30067 @inflight_max[0xffff8881076f4800]: 30086 @inflight_max[0xffff88818386b000]: 31116 @inflight_max[0xffff88816598f800]: 36970 @inflight_max[0xffff8882190b7800]: 37336 @inflight_max[0xffff888293c38800]: 39265 @inflight_max[0xffff888293c3c800]: 39632 @inflight_max[0xffff888293c3b800]: 43461 @inflight_max[0xffff888293c3f000]: 43787 @inflight_max[0xffff88816598f000]: 44557 @inflight_max[0xffff888132ce9000]: 45037 @inflight_max[0xffff888293c3f800]: 51843 @inflight_max[0xffff888183869800]: 62612 @inflight_max[0xffff888113d08000]: 73203 Adding all values together: grep inflight_max out02.inflight-max | awk 'BEGIN {tot=0} {tot+=$2; printf "total:" tot "\n"}' | tail -n 1 total:1707129 Worst case we need a data structure holding 1,707,129 pages. Fortunately, we don't need a single data structure as this will be split between 199 page_pool's. --Jesper