From: David Hildenbrand <david@redhat.com>
To: Balbir Singh <balbirs@nvidia.com>,
Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Liam.Howlett@oracle.com, airlied@gmail.com,
akpm@linux-foundation.org, apopple@nvidia.com, baohua@kernel.org,
baolin.wang@linux.alibaba.com, byungchul@sk.com, dakr@kernel.org,
dev.jain@arm.com, dri-devel@lists.freedesktop.org,
francois.dugast@intel.com, gourry@gourry.net,
joshua.hahnjy@gmail.com, linux-kernel@vger.kernel.org,
linux-mm@kvack.org, lorenzo.stoakes@oracle.com, lyude@redhat.com,
matthew.brost@intel.com, mpenttil@redhat.com, npache@redhat.com,
osalvador@suse.de, rakie.kim@sk.com, rcampbell@nvidia.com,
ryan.roberts@arm.com, simona@ffwll.ch,
ying.huang@linux.alibaba.com, ziy@nvidia.com,
kvm@vger.kernel.org, linux-s390@vger.kernel.org,
linux-next@vger.kernel.org
Subject: Re: linux-next: KVM/s390x regression
Date: Sat, 18 Oct 2025 00:41:23 +0200 [thread overview]
Message-ID: <cb85aaa3-e456-4fd8-b323-46c75d453a02@redhat.com> (raw)
In-Reply-To: <3a2db8fc-d289-415b-ae67-5a35c9c32a76@redhat.com>
On 18.10.25 00:15, David Hildenbrand wrote:
> On 17.10.25 23:56, Balbir Singh wrote:
>> On 10/18/25 04:07, David Hildenbrand wrote:
>>> On 17.10.25 17:20, Christian Borntraeger wrote:
>>>>
>>>>
>>>> Am 17.10.25 um 17:07 schrieb David Hildenbrand:
>>>>> On 17.10.25 17:01, Christian Borntraeger wrote:
>>>>>> Am 17.10.25 um 16:54 schrieb David Hildenbrand:
>>>>>>> On 17.10.25 16:49, Christian Borntraeger wrote:
>>>>>>>> This patch triggers a regression for s390x kvm as qemu guests can no longer start
>>>>>>>>
>>>>>>>> error: kvm run failed Cannot allocate memory
>>>>>>>> PSW=mask 0000000180000000 addr 000000007fd00600
>>>>>>>> R00=0000000000000000 R01=0000000000000000 R02=0000000000000000 R03=0000000000000000
>>>>>>>> R04=0000000000000000 R05=0000000000000000 R06=0000000000000000 R07=0000000000000000
>>>>>>>> R08=0000000000000000 R09=0000000000000000 R10=0000000000000000 R11=0000000000000000
>>>>>>>> R12=0000000000000000 R13=0000000000000000 R14=0000000000000000 R15=0000000000000000
>>>>>>>> C00=00000000000000e0 C01=0000000000000000 C02=0000000000000000 C03=0000000000000000
>>>>>>>> C04=0000000000000000 C05=0000000000000000 C06=0000000000000000 C07=0000000000000000
>>>>>>>> C08=0000000000000000 C09=0000000000000000 C10=0000000000000000 C11=0000000000000000
>>>>>>>> C12=0000000000000000 C13=0000000000000000 C14=00000000c2000000 C15=0000000000000000
>>>>>>>>
>>>>>>>> KVM on s390x does not use THP so far, will investigate. Does anyone have a quick idea?
>>>>>>>
>>>>>>> Only when running KVM guests and apart from that everything else seems to be fine?
>>>>>>
>>>>>> We have other weirdness in linux-next but in different areas. Could that somehow be
>>>>>> related to use disabling THP for the kvm address space?
>>>>>
>>>>> Not sure ... it's a bit weird. I mean, when KVM disables THPs we essentially just remap everything to be mapped by PTEs. So there shouldn't be any PMDs in that whole process.
>>>>>
>>>>> Remapping a file THP (shmem) implies zapping the THP completely.
>>>>>
>>>>>
>>>>> I assume in your kernel config has CONFIG_ZONE_DEVICE and CONFIG_ARCH_ENABLE_THP_MIGRATION set, right?
>>>>
>>>> yes.
>>>>
>>>>>
>>>>> I'd rule out copy_huge_pmd(), zap_huge_pmd() a well.
>>>>>
>>>>>
>>>>> What happens if you revert the change in mm/pgtable-generic.c?
>>>>
>>>> That partial revert seems to fix the issue
>>>> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
>>>> index 0c847cdf4fd3..567e2d084071 100644
>>>> --- a/mm/pgtable-generic.c
>>>> +++ b/mm/pgtable-generic.c
>>>> @@ -290,7 +290,7 @@ pte_t *___pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
>>>> if (pmdvalp)
>>>> *pmdvalp = pmdval;
>>>> - if (unlikely(pmd_none(pmdval) || !pmd_present(pmdval)))
>>>> + if (unlikely(pmd_none(pmdval) || is_pmd_migration_entry(pmdval)))
>>>
>>> Okay, but that means that effectively we stumble over a PMD entry that is not a migration entry but still non-present.
>>>
>>> And I would expect that it's a page table, because otherwise the change
>>> wouldn't make a difference.
>>>
>>> And the weird thing is that this only triggers sometimes, because if
>>> it would always trigger nothing would ever work.
>>>
>>> Is there some weird scenario where s390x might set a left page table mapped in a PMD to non-present?
>>>
>>
>> Good point
>>
>>> Staring at the definition of pmd_present() on s390x it's really just
>>>
>>> return (pmd_val(pmd) & _SEGMENT_ENTRY_PRESENT) != 0;
>>>
>>>
>>> Maybe this is happening in the gmap code only and not actually in the core-mm code?
>>>
>>
>>
>> I am not an s390 expert, but just looking at the code
>>
>> So the check on s390 effectively
>>
>> segment_entry/present = false or segment_entry_empty/invalid = true
>
> pmd_present() == true iff _SEGMENT_ENTRY_PRESENT is set
>
> because
>
> return (pmd_val(pmd) & _SEGMENT_ENTRY_PRESENT) != 0;
>
> is the same as
>
> return pmd_val(pmd) & _SEGMENT_ENTRY_PRESENT;
>
> But that means we have something where _SEGMENT_ENTRY_PRESENT is not set.
>
> I suspect that can only be the gmap tables.
>
> Likely __gmap_link() does not set _SEGMENT_ENTRY_PRESENT, which is fine
> because it's a software managed bit for "ordinary" page tables, not gmap
> tables.
>
> Which raises the question why someone would wrongly use
> pte_offset_map()/__pte_offset_map() on the gmap tables.
>
> I cannot immediately spot any such usage in kvm/gmap code, though.
>
Ah, it's all that pte_alloc_map_lock() stuff in gmap.c.
Oh my.
So we're mapping a user PTE table that is linked into the gmap tables
through a PMD table that does not have the right sw bits set we would
expect in a user PMD table.
What's also scary is that pte_alloc_map_lock() would try to pte_alloc()
a user page table in the gmap, which sounds completely wrong?
Yeah, when walking the gmap and wanting to lock the linked user PTE
table, we should probably never use the pte_*map variants but obtain
the lock through pte_lockptr().
All magic we end up doing with RCU etc in __pte_offset_map_lock()
does not apply to the gmap PMD table.
--
Cheers
David / dhildenb
next prev parent reply other threads:[~2025-10-17 22:41 UTC|newest]
Thread overview: 75+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-01 6:56 [v7 00/16] mm: support device-private THP Balbir Singh
2025-10-01 6:56 ` [v7 01/16] mm/zone_device: support large zone device private folios Balbir Singh
2025-10-12 6:10 ` Lance Yang
2025-10-12 22:54 ` Balbir Singh
2025-10-01 6:56 ` [v7 02/16] mm/zone_device: Rename page_free callback to folio_free Balbir Singh
2025-10-01 6:56 ` [v7 03/16] mm/huge_memory: add device-private THP support to PMD operations Balbir Singh
2025-10-12 15:46 ` Lance Yang
2025-10-13 0:01 ` Balbir Singh
2025-10-13 1:48 ` Lance Yang
2025-10-17 14:49 ` linux-next: KVM/s390x regression (was: [v7 03/16] mm/huge_memory: add device-private THP support to PMD operations) Christian Borntraeger
2025-10-17 14:54 ` linux-next: KVM/s390x regression David Hildenbrand
2025-10-17 15:01 ` Christian Borntraeger
2025-10-17 15:07 ` David Hildenbrand
2025-10-17 15:20 ` Christian Borntraeger
2025-10-17 17:07 ` David Hildenbrand
2025-10-17 21:56 ` Balbir Singh
2025-10-17 22:15 ` David Hildenbrand
2025-10-17 22:41 ` David Hildenbrand [this message]
2025-10-20 7:01 ` Christian Borntraeger
2025-10-20 7:00 ` Christian Borntraeger
2025-10-20 8:41 ` David Hildenbrand
2025-10-20 9:04 ` Claudio Imbrenda
2025-10-27 16:47 ` Claudio Imbrenda
2025-10-27 16:59 ` David Hildenbrand
2025-10-27 17:06 ` Christian Borntraeger
2025-10-28 9:24 ` Balbir Singh
2025-10-28 13:01 ` [PATCH v1 0/1] KVM: s390: Fix missing present bit for gmap puds Claudio Imbrenda
2025-10-28 13:01 ` [PATCH v1 1/1] " Claudio Imbrenda
2025-10-28 21:23 ` Balbir Singh
2025-10-29 10:00 ` David Hildenbrand
2025-10-29 10:20 ` Claudio Imbrenda
2025-10-28 22:53 ` [PATCH v1 0/1] " Andrew Morton
2025-10-01 6:56 ` [v7 04/16] mm/rmap: extend rmap and migration support device-private entries Balbir Singh
2025-10-22 11:54 ` Lance Yang
2025-10-01 6:56 ` [v7 05/16] mm/huge_memory: implement device-private THP splitting Balbir Singh
2025-10-01 6:56 ` [v7 06/16] mm/migrate_device: handle partially mapped folios during collection Balbir Singh
2025-10-01 6:56 ` [v7 07/16] mm/migrate_device: implement THP migration of zone device pages Balbir Singh
2025-10-01 6:56 ` [v7 08/16] mm/memory/fault: add THP fault handling for zone device private pages Balbir Singh
2025-10-01 6:57 ` [v7 09/16] lib/test_hmm: add zone device private THP test infrastructure Balbir Singh
2025-10-01 6:57 ` [v7 10/16] mm/memremap: add driver callback support for folio splitting Balbir Singh
2025-10-01 6:57 ` [v7 11/16] mm/migrate_device: add THP splitting during migration Balbir Singh
2025-10-13 21:17 ` Zi Yan
2025-10-13 21:33 ` Balbir Singh
2025-10-13 21:55 ` Zi Yan
2025-10-13 22:50 ` Balbir Singh
2025-10-19 8:19 ` Wei Yang
2025-10-19 22:49 ` Balbir Singh
2025-10-19 22:59 ` Zi Yan
2025-10-21 21:34 ` Balbir Singh
2025-10-22 2:59 ` Zi Yan
2025-10-22 7:16 ` Balbir Singh
2025-10-22 15:26 ` Zi Yan
2025-10-28 9:32 ` Balbir Singh
2025-10-01 6:57 ` [v7 12/16] lib/test_hmm: add large page allocation failure testing Balbir Singh
2025-10-01 6:57 ` [v7 13/16] selftests/mm/hmm-tests: new tests for zone device THP migration Balbir Singh
2025-10-01 6:57 ` [v7 14/16] selftests/mm/hmm-tests: partial unmap, mremap and anon_write tests Balbir Singh
2025-10-01 6:57 ` [v7 15/16] selftests/mm/hmm-tests: new throughput tests including THP Balbir Singh
2025-10-01 6:57 ` [v7 16/16] gpu/drm/nouveau: enable THP support for GPU memory migration Balbir Singh
2025-10-09 3:17 ` [v7 00/16] mm: support device-private THP Andrew Morton
2025-10-09 3:26 ` Balbir Singh
2025-10-09 10:33 ` Matthew Brost
2025-10-13 22:51 ` Balbir Singh
2025-11-11 23:43 ` Andrew Morton
2025-11-11 23:52 ` Balbir Singh
2025-11-12 0:24 ` Andrew Morton
2025-11-12 0:36 ` Balbir Singh
2025-11-20 2:40 ` Matthew Brost
2025-11-20 2:50 ` Balbir Singh
2025-11-20 2:59 ` Balbir Singh
2025-11-20 3:15 ` Matthew Brost
2025-11-20 3:58 ` Balbir Singh
2025-11-20 5:46 ` Balbir Singh
2025-11-20 5:53 ` Matthew Brost
2025-11-20 6:03 ` Balbir Singh
2025-11-20 17:27 ` Matthew Brost
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=cb85aaa3-e456-4fd8-b323-46c75d453a02@redhat.com \
--to=david@redhat.com \
--cc=Liam.Howlett@oracle.com \
--cc=airlied@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=apopple@nvidia.com \
--cc=balbirs@nvidia.com \
--cc=baohua@kernel.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=borntraeger@linux.ibm.com \
--cc=byungchul@sk.com \
--cc=dakr@kernel.org \
--cc=dev.jain@arm.com \
--cc=dri-devel@lists.freedesktop.org \
--cc=francois.dugast@intel.com \
--cc=gourry@gourry.net \
--cc=joshua.hahnjy@gmail.com \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-next@vger.kernel.org \
--cc=linux-s390@vger.kernel.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=lyude@redhat.com \
--cc=matthew.brost@intel.com \
--cc=mpenttil@redhat.com \
--cc=npache@redhat.com \
--cc=osalvador@suse.de \
--cc=rakie.kim@sk.com \
--cc=rcampbell@nvidia.com \
--cc=ryan.roberts@arm.com \
--cc=simona@ffwll.ch \
--cc=ying.huang@linux.alibaba.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox