linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [linux-next:master] [hugetlb]  003af997c8: libhugetlbfs-test.32/64bit.counters_sh.fail
@ 2024-07-16 13:30 kernel test robot
  2024-07-16 15:44 ` Aristeu Rozanski
  0 siblings, 1 reply; 4+ messages in thread
From: kernel test robot @ 2024-07-16 13:30 UTC (permalink / raw)
  To: Aristeu Rozanski
  Cc: oe-lkp, lkp, Linux Memory Management List, Andrew Morton,
	Muchun Song, Aristeu Rozanski, David Hildenbrand, Vishal Moola,
	oliver.sang



Hello,

kernel test robot noticed "libhugetlbfs-test.32/64bit.counters_sh.fail" on:

commit: 003af997c8a945493859dd1a2d015cc9387ff27a ("hugetlb: force allocating surplus hugepages on mempolicy allowed nodes")
https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master

in testcase: libhugetlbfs-test
version: libhugetlbfs-test-x86_64-1bb2d54-1_20240614
with following parameters:

	pagesize: 2MB



compiler: gcc-13
test machine: 224 threads 2 sockets Intel(R) Xeon(R) Platinum 8480+ (Sapphire Rapids) with 256G memory

(please refer to attached dmesg/kmsg for entire log/backtrace)


the issues are random in our tests, but keep clean on parent:

64548bc5347bd051 003af997c8a945493859dd1a2d0
---------------- ---------------------------
       fail:runs  %reproduction    fail:runs
           |             |             |
           :20          55%          11:20    libhugetlbfs-test.32bit.counters_sh.fail
           :20          65%          13:20    libhugetlbfs-test.64bit.counters_sh.fail



If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202407162105.5eab5c8f-oliver.sang@intel.com



The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20240716/202407162105.5eab5c8f-oliver.sang@intel.com


counters.sh (2M: 32):	FAIL Line 349: Bad HugePages_Total: expected 1, actual 2
counters.sh (2M: 64):	FAIL Line 349: Bad HugePages_Total: expected 1, actual 2


-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [linux-next:master] [hugetlb]  003af997c8: libhugetlbfs-test.32/64bit.counters_sh.fail
  2024-07-16 13:30 [linux-next:master] [hugetlb] 003af997c8: libhugetlbfs-test.32/64bit.counters_sh.fail kernel test robot
@ 2024-07-16 15:44 ` Aristeu Rozanski
  2024-07-17 20:45   ` Aristeu Rozanski
  0 siblings, 1 reply; 4+ messages in thread
From: Aristeu Rozanski @ 2024-07-16 15:44 UTC (permalink / raw)
  To: kernel test robot
  Cc: Aristeu Rozanski, oe-lkp, lkp, Linux Memory Management List,
	Andrew Morton, Muchun Song, David Hildenbrand, Vishal Moola

Taking a look on this.

-- 
Aristeu



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [linux-next:master] [hugetlb]  003af997c8: libhugetlbfs-test.32/64bit.counters_sh.fail
  2024-07-16 15:44 ` Aristeu Rozanski
@ 2024-07-17 20:45   ` Aristeu Rozanski
  2024-07-24 19:48     ` Aristeu Rozanski
  0 siblings, 1 reply; 4+ messages in thread
From: Aristeu Rozanski @ 2024-07-17 20:45 UTC (permalink / raw)
  To: kernel test robot
  Cc: Aristeu Rozanski, oe-lkp, lkp, Linux Memory Management List,
	Andrew Morton, Muchun Song, David Hildenbrand, Vishal Moola

On Tue, Jul 16, 2024 at 11:44:27AM -0400, Aristeu Rozanski wrote:
> Taking a look on this.

So it looks like to be a timing issue. While spreading some code to figure out
exactly which exact sequence is causing the issue, it makes the error go away
in the 'counters' test. More specifically one of the sequences:

        /* touched, shared mmap */
	map(SL_TEST, 1, MAP_SHARED);
	touch(SL_TEST, 1, MAP_SHARED);
	unmap(SL_TEST, 1, MAP_SHARED);

fails because it's expecting:

	HugePages_{Total,Free} = 1
	HugePages_Surp = 0

but gets:

	HugePages_{Total,Free} = 2
	HugePages_Surp = 1

which seems caused by a surplus page taking too long to be freed, thus
timing making difference here.

I'm not sure as why it'd take longer with my patch applied but will keep
digging.

-- 
Aristeu



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [linux-next:master] [hugetlb]  003af997c8: libhugetlbfs-test.32/64bit.counters_sh.fail
  2024-07-17 20:45   ` Aristeu Rozanski
@ 2024-07-24 19:48     ` Aristeu Rozanski
  0 siblings, 0 replies; 4+ messages in thread
From: Aristeu Rozanski @ 2024-07-24 19:48 UTC (permalink / raw)
  To: kernel test robot
  Cc: Aristeu Rozanski, oe-lkp, lkp, Linux Memory Management List,
	Andrew Morton, Muchun Song, David Hildenbrand, Vishal Moola

On Wed, Jul 17, 2024 at 04:45:07PM -0400, Aristeu Rozanski wrote:
> On Tue, Jul 16, 2024 at 11:44:27AM -0400, Aristeu Rozanski wrote:
> > Taking a look on this.
> 
> So it looks like to be a timing issue. While spreading some code to figure out
> exactly which exact sequence is causing the issue, it makes the error go away
> in the 'counters' test. More specifically one of the sequences:
> 
>         /* touched, shared mmap */
> 	map(SL_TEST, 1, MAP_SHARED);
> 	touch(SL_TEST, 1, MAP_SHARED);
> 	unmap(SL_TEST, 1, MAP_SHARED);
> 
> fails because it's expecting:
> 
> 	HugePages_{Total,Free} = 1
> 	HugePages_Surp = 0
> 
> but gets:
> 
> 	HugePages_{Total,Free} = 2
> 	HugePages_Surp = 1
> 
> which seems caused by a surplus page taking too long to be freed, thus
> timing making difference here.
> 
> I'm not sure as why it'd take longer with my patch applied but will keep
> digging.

It really seems to be a matter of small timing difference. Even poking with
perf is enough to not be able to reproduce the problem anymore. Will get in
contact with the libhugetlbfs folks, might need to implement on the counters
test intelligence to when there're surplus pages around wait for a little bit
to give a chance for it to be freed. I believe we're good to go.

Comments?

-- 
Aristeu



^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-07-24 19:48 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-07-16 13:30 [linux-next:master] [hugetlb] 003af997c8: libhugetlbfs-test.32/64bit.counters_sh.fail kernel test robot
2024-07-16 15:44 ` Aristeu Rozanski
2024-07-17 20:45   ` Aristeu Rozanski
2024-07-24 19:48     ` Aristeu Rozanski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox