* [linux-next:master] [hugetlb] 003af997c8: libhugetlbfs-test.32/64bit.counters_sh.fail
@ 2024-07-16 13:30 kernel test robot
2024-07-16 15:44 ` Aristeu Rozanski
0 siblings, 1 reply; 4+ messages in thread
From: kernel test robot @ 2024-07-16 13:30 UTC (permalink / raw)
To: Aristeu Rozanski
Cc: oe-lkp, lkp, Linux Memory Management List, Andrew Morton,
Muchun Song, Aristeu Rozanski, David Hildenbrand, Vishal Moola,
oliver.sang
Hello,
kernel test robot noticed "libhugetlbfs-test.32/64bit.counters_sh.fail" on:
commit: 003af997c8a945493859dd1a2d015cc9387ff27a ("hugetlb: force allocating surplus hugepages on mempolicy allowed nodes")
https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
in testcase: libhugetlbfs-test
version: libhugetlbfs-test-x86_64-1bb2d54-1_20240614
with following parameters:
pagesize: 2MB
compiler: gcc-13
test machine: 224 threads 2 sockets Intel(R) Xeon(R) Platinum 8480+ (Sapphire Rapids) with 256G memory
(please refer to attached dmesg/kmsg for entire log/backtrace)
the issues are random in our tests, but keep clean on parent:
64548bc5347bd051 003af997c8a945493859dd1a2d0
---------------- ---------------------------
fail:runs %reproduction fail:runs
| | |
:20 55% 11:20 libhugetlbfs-test.32bit.counters_sh.fail
:20 65% 13:20 libhugetlbfs-test.64bit.counters_sh.fail
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202407162105.5eab5c8f-oliver.sang@intel.com
The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20240716/202407162105.5eab5c8f-oliver.sang@intel.com
counters.sh (2M: 32): FAIL Line 349: Bad HugePages_Total: expected 1, actual 2
counters.sh (2M: 64): FAIL Line 349: Bad HugePages_Total: expected 1, actual 2
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [linux-next:master] [hugetlb] 003af997c8: libhugetlbfs-test.32/64bit.counters_sh.fail
2024-07-16 13:30 [linux-next:master] [hugetlb] 003af997c8: libhugetlbfs-test.32/64bit.counters_sh.fail kernel test robot
@ 2024-07-16 15:44 ` Aristeu Rozanski
2024-07-17 20:45 ` Aristeu Rozanski
0 siblings, 1 reply; 4+ messages in thread
From: Aristeu Rozanski @ 2024-07-16 15:44 UTC (permalink / raw)
To: kernel test robot
Cc: Aristeu Rozanski, oe-lkp, lkp, Linux Memory Management List,
Andrew Morton, Muchun Song, David Hildenbrand, Vishal Moola
Taking a look on this.
--
Aristeu
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [linux-next:master] [hugetlb] 003af997c8: libhugetlbfs-test.32/64bit.counters_sh.fail
2024-07-16 15:44 ` Aristeu Rozanski
@ 2024-07-17 20:45 ` Aristeu Rozanski
2024-07-24 19:48 ` Aristeu Rozanski
0 siblings, 1 reply; 4+ messages in thread
From: Aristeu Rozanski @ 2024-07-17 20:45 UTC (permalink / raw)
To: kernel test robot
Cc: Aristeu Rozanski, oe-lkp, lkp, Linux Memory Management List,
Andrew Morton, Muchun Song, David Hildenbrand, Vishal Moola
On Tue, Jul 16, 2024 at 11:44:27AM -0400, Aristeu Rozanski wrote:
> Taking a look on this.
So it looks like to be a timing issue. While spreading some code to figure out
exactly which exact sequence is causing the issue, it makes the error go away
in the 'counters' test. More specifically one of the sequences:
/* touched, shared mmap */
map(SL_TEST, 1, MAP_SHARED);
touch(SL_TEST, 1, MAP_SHARED);
unmap(SL_TEST, 1, MAP_SHARED);
fails because it's expecting:
HugePages_{Total,Free} = 1
HugePages_Surp = 0
but gets:
HugePages_{Total,Free} = 2
HugePages_Surp = 1
which seems caused by a surplus page taking too long to be freed, thus
timing making difference here.
I'm not sure as why it'd take longer with my patch applied but will keep
digging.
--
Aristeu
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [linux-next:master] [hugetlb] 003af997c8: libhugetlbfs-test.32/64bit.counters_sh.fail
2024-07-17 20:45 ` Aristeu Rozanski
@ 2024-07-24 19:48 ` Aristeu Rozanski
0 siblings, 0 replies; 4+ messages in thread
From: Aristeu Rozanski @ 2024-07-24 19:48 UTC (permalink / raw)
To: kernel test robot
Cc: Aristeu Rozanski, oe-lkp, lkp, Linux Memory Management List,
Andrew Morton, Muchun Song, David Hildenbrand, Vishal Moola
On Wed, Jul 17, 2024 at 04:45:07PM -0400, Aristeu Rozanski wrote:
> On Tue, Jul 16, 2024 at 11:44:27AM -0400, Aristeu Rozanski wrote:
> > Taking a look on this.
>
> So it looks like to be a timing issue. While spreading some code to figure out
> exactly which exact sequence is causing the issue, it makes the error go away
> in the 'counters' test. More specifically one of the sequences:
>
> /* touched, shared mmap */
> map(SL_TEST, 1, MAP_SHARED);
> touch(SL_TEST, 1, MAP_SHARED);
> unmap(SL_TEST, 1, MAP_SHARED);
>
> fails because it's expecting:
>
> HugePages_{Total,Free} = 1
> HugePages_Surp = 0
>
> but gets:
>
> HugePages_{Total,Free} = 2
> HugePages_Surp = 1
>
> which seems caused by a surplus page taking too long to be freed, thus
> timing making difference here.
>
> I'm not sure as why it'd take longer with my patch applied but will keep
> digging.
It really seems to be a matter of small timing difference. Even poking with
perf is enough to not be able to reproduce the problem anymore. Will get in
contact with the libhugetlbfs folks, might need to implement on the counters
test intelligence to when there're surplus pages around wait for a little bit
to give a chance for it to be freed. I believe we're good to go.
Comments?
--
Aristeu
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2024-07-24 19:48 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-07-16 13:30 [linux-next:master] [hugetlb] 003af997c8: libhugetlbfs-test.32/64bit.counters_sh.fail kernel test robot
2024-07-16 15:44 ` Aristeu Rozanski
2024-07-17 20:45 ` Aristeu Rozanski
2024-07-24 19:48 ` Aristeu Rozanski
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox