On 9/18/23 17:54, Jan Kara wrote: > On Mon 18-09-23 07:59:03, Yury Norov wrote: >> On Mon, Sep 18, 2023 at 02:46:02PM +0200, Mirsad Todorovac wrote: >>> -------------------------------------------------------- >>> lib/find_bit.c | 33 +++++++++++++++++---------------- >>> 1 file changed, 17 insertions(+), 16 deletions(-) >>> >>> diff --git a/lib/find_bit.c b/lib/find_bit.c >>> index 32f99e9a670e..56244e4f744e 100644 >>> --- a/lib/find_bit.c >>> +++ b/lib/find_bit.c >>> @@ -18,6 +18,7 @@ >>> #include >>> #include >>> #include >>> +#include >>> /* >>> * Common helper for find_bit() function family >>> @@ -98,7 +99,7 @@ out: \ >>> */ >>> unsigned long _find_first_bit(const unsigned long *addr, unsigned long size) >>> { >>> - return FIND_FIRST_BIT(addr[idx], /* nop */, size); >>> + return FIND_FIRST_BIT(READ_ONCE(addr[idx]), /* nop */, size); >>> } >>> EXPORT_SYMBOL(_find_first_bit); >>> #endif >> >> ... >> >> That doesn't look correct. READ_ONCE() implies that there's another >> thread modifying the bitmap concurrently. This is not the true for >> vast majority of bitmap API users, and I expect that forcing >> READ_ONCE() would affect performance for them. >> >> Bitmap functions, with a few rare exceptions like set_bit(), are not >> thread-safe and require users to perform locking/synchronization where >> needed. > > Well, for xarray the write side is synchronized with a spinlock but the read > side is not (only RCU protected). > >> If you really need READ_ONCE, I think it's better to implement a new >> flavor of the function(s) separately, like: >> find_first_bit_read_once() > > So yes, xarray really needs READ_ONCE(). And I don't think READ_ONCE() > imposes any real perfomance overhead in this particular case because for > any sane compiler the generated assembly with & without READ_ONCE() will be > exactly the same. For example I've checked disassembly of _find_next_bit() > using READ_ONCE(). The main loop is: > > 0xffffffff815a2b6d <+77>: inc %r8 > 0xffffffff815a2b70 <+80>: add $0x8,%rdx > 0xffffffff815a2b74 <+84>: mov %r8,%rcx > 0xffffffff815a2b77 <+87>: shl $0x6,%rcx > 0xffffffff815a2b7b <+91>: cmp %rcx,%rax > 0xffffffff815a2b7e <+94>: jbe 0xffffffff815a2b9b <_find_next_bit+123> > 0xffffffff815a2b80 <+96>: mov (%rdx),%rcx > 0xffffffff815a2b83 <+99>: test %rcx,%rcx > 0xffffffff815a2b86 <+102>: je 0xffffffff815a2b6d <_find_next_bit+77> > 0xffffffff815a2b88 <+104>: shl $0x6,%r8 > 0xffffffff815a2b8c <+108>: tzcnt %rcx,%rcx > > So you can see the value we work with is copied from the address (rdx) into > a register (rcx) and the test and __ffs() happens on a register value and > thus READ_ONCE() has no practical effect. It just prevents the compiler > from doing some stupid de-optimization. > > Honza If I may also add, centralised READ_ONCE() version had fixed a couple of hundred of the instances of KCSAN data-races in dmesg. _find_*_bit() functions and/or macros cause quite a number of KCSAN BUG warnings: 95 _find_first_and_bit (lib/find_bit.c:114 (discriminator 10)) 31 _find_first_zero_bit (lib/find_bit.c:125 (discriminator 10)) 173 _find_next_and_bit (lib/find_bit.c:171 (discriminator 2)) 655 _find_next_bit (lib/find_bit.c:133 (discriminator 2)) 5 _find_next_zero_bit Finding each one find_bit_*() function and replacing it with find_bit_*_read_once() could be time-consuming and challenging. However, I will do both versions so you could compare, if you'd like. Note, in the PoC version I have only implemented find_next_bit_read_once() ATM to see if this works. Regards, Mirsad