From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 63C8CC02192 for ; Mon, 3 Feb 2025 10:11:09 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E785F28000E; Mon, 3 Feb 2025 05:11:08 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id E2ACD280002; Mon, 3 Feb 2025 05:11:08 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CA1BF28000E; Mon, 3 Feb 2025 05:11:08 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id A6B0D280002 for ; Mon, 3 Feb 2025 05:11:08 -0500 (EST) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 219AF82689 for ; Mon, 3 Feb 2025 10:11:02 +0000 (UTC) X-FDA: 83078215164.12.97D1473 Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.223.131]) by imf23.hostedemail.com (Postfix) with ESMTP id DF7B3140007 for ; Mon, 3 Feb 2025 10:10:59 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=suse.de header.s=susede2_rsa header.b=qePRmhRE; dkim=pass header.d=suse.de header.s=susede2_ed25519 header.b="Zj/qLvCn"; dkim=pass header.d=suse.de header.s=susede2_rsa header.b=qePRmhRE; dkim=pass header.d=suse.de header.s=susede2_ed25519 header.b="Zj/qLvCn"; spf=pass (imf23.hostedemail.com: domain of osalvador@suse.de designates 195.135.223.131 as permitted sender) smtp.mailfrom=osalvador@suse.de; dmarc=pass (policy=none) header.from=suse.de ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1738577460; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=6SO/WpjP3WmmWJ+47g5WIU6i2UnKDrx/RlWIZXMOb6A=; b=sY0eNYfz1E6g+Ee2Q2gZUkbFZkJyFfkgRQoW2xxPYSAxgUyiIkiHtgchNRh3qaiEAcIS/B f36AFq4HiEf1w6yR9Xq5J2d09+B67pDyp9qrSiADK79ERENomyb9FJZIvY01Bmb2rGQMAo 7EXPExjelkhBHBToWOpaSyUWvT/tj6E= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1738577460; a=rsa-sha256; cv=none; b=3jbi5soSZtJp6hEZY48Lr8KmnN1Qf6ux0N1bhUAF2wCpO5Rkl/jWIGbo8XGCJBhKsxDIJh sQOReEshhuU64ondkz3WX0acmfnDHMORS6jBvajDTHlS0EDlX1fdc4E9RtXkLQWLgJlmwt LGAJiWZ2z3zdJnI//FEbMK5+4y7hnD4= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=suse.de header.s=susede2_rsa header.b=qePRmhRE; dkim=pass header.d=suse.de header.s=susede2_ed25519 header.b="Zj/qLvCn"; dkim=pass header.d=suse.de header.s=susede2_rsa header.b=qePRmhRE; dkim=pass header.d=suse.de header.s=susede2_ed25519 header.b="Zj/qLvCn"; spf=pass (imf23.hostedemail.com: domain of osalvador@suse.de designates 195.135.223.131 as permitted sender) smtp.mailfrom=osalvador@suse.de; dmarc=pass (policy=none) header.from=suse.de Received: from imap1.dmz-prg2.suse.org (unknown [10.150.64.97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id 588091F37C; Mon, 3 Feb 2025 10:10:58 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1738577458; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=6SO/WpjP3WmmWJ+47g5WIU6i2UnKDrx/RlWIZXMOb6A=; b=qePRmhREX1eTb+iZe6sT3EHnzKTeBB3Y3XCZWVubrWOPAkIkf+GeivHRMspK+X5RVQzaeL fZs+p+GDhCyJi5OiuEEjGIIuYocQ4SbyzRvWCrhIwRupfQwwtIfQuGbatQwAdNq83WXSNy 2s2K85U3OPJo/+AiuZj5YMFGOtARmg8= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1738577458; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=6SO/WpjP3WmmWJ+47g5WIU6i2UnKDrx/RlWIZXMOb6A=; b=Zj/qLvCn6+wVga9h/I8CPymqVpjjS4qdVHvfFhumdvo7rE9jaeUJV2Ln5F0irpXQFiyJNE l2kYf178mYStdlCA== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1738577458; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=6SO/WpjP3WmmWJ+47g5WIU6i2UnKDrx/RlWIZXMOb6A=; b=qePRmhREX1eTb+iZe6sT3EHnzKTeBB3Y3XCZWVubrWOPAkIkf+GeivHRMspK+X5RVQzaeL fZs+p+GDhCyJi5OiuEEjGIIuYocQ4SbyzRvWCrhIwRupfQwwtIfQuGbatQwAdNq83WXSNy 2s2K85U3OPJo/+AiuZj5YMFGOtARmg8= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1738577458; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=6SO/WpjP3WmmWJ+47g5WIU6i2UnKDrx/RlWIZXMOb6A=; b=Zj/qLvCn6+wVga9h/I8CPymqVpjjS4qdVHvfFhumdvo7rE9jaeUJV2Ln5F0irpXQFiyJNE l2kYf178mYStdlCA== Received: from imap1.dmz-prg2.suse.org (localhost [127.0.0.1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by imap1.dmz-prg2.suse.org (Postfix) with ESMTPS id F1BD613795; Mon, 3 Feb 2025 10:10:57 +0000 (UTC) Received: from dovecot-director2.suse.de ([2a07:de40:b281:106:10:150:64:167]) by imap1.dmz-prg2.suse.org with ESMTPSA id 02/6NzGWoGc3bgAAD6G6ig (envelope-from ); Mon, 03 Feb 2025 10:10:57 +0000 Date: Mon, 3 Feb 2025 11:10:56 +0100 From: Oscar Salvador To: David Hildenbrand Cc: lsf-pc@lists.linux-foundation.org, Peter Xu , Muchun Song , linux-mm@kvack.org Subject: Re: [LSF/MM/BPF TOPIC] HugeTLB generic pagewalk Message-ID: References: <4c50a439-e2b8-4f54-ba3d-366d0e2961b2@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4c50a439-e2b8-4f54-ba3d-366d0e2961b2@redhat.com> X-Stat-Signature: 7tamh7s9x3d5oa8j5qwxdtt573xq5y1a X-Rspam-User: X-Rspamd-Queue-Id: DF7B3140007 X-Rspamd-Server: rspam03 X-HE-Tag: 1738577459-64901 X-HE-Meta: U2FsdGVkX1/5uHTydiF84ttpCXhxdB5PdEgzepnY+IdCzgLI+1kiNEeme/HBjAJ2hrBFKtmpENA2QRMIhp1rHXjYBYBl5uhSdPKDa4SapTi8+M4HfggmNleaAt2YleYV1VunqrhakUE8ZXhnsk9PN7vou+xEriHT0p4i6CgHqJeR4NYoI1jswy6p6Mi9cGtddRY6mTxYctQmZblUqC8leJNrqt0XBCW8Hu42td0FmqC1KvKmU3SEI8m0MtMnPgC5IbNyawgN6LSEzHLMXgVm4Xw7zcIb/eSe61y5qBpbkMsR/9GGkax8xnRcgGVlM+XUXtJ2pBE8w9tX1PIHvFcCK0Jnx/+Gtf27UDyPWue3fxg71PNajzAVaA6ZT2cHvCjDbUl/0Ue0IH00phBhjpnYGyMS3DAMdTSb3WjH7qJ2pMo8FOvDHSFfDSkFPaFHVZoRLM7q/gbpiBXc1ZUiJo8VlX5rA0QjCQsfRShmPCwbIGp1beGVIZ/Ll4kFTMdp7FrOpOpRLkFFSihTPwge4vPbHS/qGiA7m+mgYbrB0TX5uG3COWCryfRSLs3joHWSiU9IcJjawziaglrev6JLfcrgPRI4dtSxGfMfJ37poyYkhf2RIfeugE1d2OdIScxNO5ovRRxiigTfbeDbERS4BHB+rrF8AyKri5H6j90czmNBEWcVKMt6H+x92208zgdaOFgiIqt/8tAFjpum4r43ig8ZVILYGcJTlYNQY3BNi/nATuT0CcYl/FV2MAIIz3f94Y1R/lrTHzIp7NczYKNi6NFi6zDOg11Afw5VpaV+VFeZWnaQ+ZcXHYkuL9FrbT2aetXZInyumuWAM46DDK64iaQhOuRrMBhBfTU+1bpURHiMU4Q66ZXhl59AKagDi07s7bl1h85R7oyjL55eSqn4mSYaTxNqRobnuepJrApLzSxIPNBVdG/QTErKpYpclpaHeViPtGpOU5aytV4rN+e16ud tD52FsG7 BnK2FPIUcf7BH14HgSLjRlOETV6sT8rWGp0WNCT788bO24XX4q7db69gvs3l/DDNOWVTZG7fIclYN4hHxZD6gC+A2agBNXxTOnSZOt2GCmsKFgcI61DFMuoBjIR0b8o9OxcVw/kEQRHaLFnP1JKjsed3VFrJ3pmKglRU6dvrrLd7Ss+QlOFrS58WCvyRzDPkoYI7l3+il9PCs8CKQv+dWFx1QjC0gePnoxV551BW/OQOEMPdEoY0nXdjqlAOKoGnT1GRSTR8G7u6EaFOfjDXwbxcBz+DqHVzd6MVK16Uq06g6bKdaY5Lh8aKjRvs5IOOp1cCS X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Jan 31, 2025 at 12:19:24AM +0100, David Hildenbrand wrote: > On 30.01.25 22:36, Oscar Salvador wrote: > Hi Oscar, Hi David > > > > HugeTLB has its own way of dealing with things. > > E.g: HugeTLB interprets everything as a pte: huge_pte_uffd_wp, huge_pte_clear_uffd_wp, > > huge_pte_dirty, huge_pte_modify, huge_pte_wrprotect etc. > > > > One of the challenges that this raises is that if we want pmd/pud walkers to > > be able to make sense of hugetlb stuff, we need to implement pud/pmd > > (maybe some pmd we already have because of THP) variants of those. > > that's the easy case I'm afraid. The real problem are cont-pte constructs (or worse) > abstracted by hugetlb to be a single unit ("hugetlb pte"). Yes, that is a PITA to be honest. E.g: Most of the code that lives in fs/proc/task_mmu.c works under that assumption. I was checking how we could make cont-{pmd,pte} work for hugetlb in that area (just because it happens to be the first are I am trying to convert). E.g: Let us have a look at smaps_pte_range/smaps_pte_entry. I came up with something like the following (completely untested, just a PoC) to make smaps_pte_range work for cont-ptes. We would also need to change all the references of PAGE_SIZE to the actual size of the folio (if there are). But looking closer, folio_pte_batch is only declared in mm/internal.h, so I am not sure I can make use of that outside mm/ realm? If not, how could we work that around? diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 193dd5f91fbf..21fb122ebcac 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -802,7 +802,7 @@ static void mss_hugetlb_update(struct mem_size_stats *mss, struct folio *folio, #endif static void smaps_pte_entry(pte_t *pte, unsigned long addr, - struct mm_walk *walk) + struct mm_walk *walk, int nr_batch_entries) { struct mem_size_stats *mss = walk->private; struct vm_area_struct *vma = walk->vma; @@ -813,8 +813,19 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr, if (pte_present(ptent)) { page = vm_normal_page(vma, addr, ptent); - young = pte_young(ptent); - dirty = pte_dirty(ptent); + if (nr_batch_entries) { + int nr; + nr = folio_pte_batch(folio_page(page, 0), addr, pte, + ptent, max_nr, 0, NULL, &young, + &dirty); + if (nr != nr_batch_entries) { + walk->action = ACTION_AGAIN; + return; + } + } else { + young = pte_young(ptent); + dirty = pte_dirty(ptent); + } present = true; } else if (is_swap_pte(ptent)) { swp_entry_t swpent = pte_to_swp_entry(ptent); @@ -935,7 +946,7 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, struct mm_walk *walk) { struct vm_area_struct *vma = walk->vma; - pte_t *pte; + pte_t *pte, ptent; spinlock_t *ptl; ptl = pmd_huge_lock(pmd, vma); @@ -950,8 +961,21 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, walk->action = ACTION_AGAIN; return 0; } + + ptent = ptep_get(pte); + if (pte_present(ptent)) { + struct folio *folio = vm_normal_folio(vma, addr, ptent); + if (folio_test_large(folio)) { + /* Let us use pte batching */ + smaps_pte_entry(pte, addr, walk, (end - addr) / PAGE_SIZE); + pte_unmap_unlock(pte - 1, ptl); + + return 0; + } + } + for (; addr != end; pte++, addr += PAGE_SIZE) - smaps_pte_entry(pte, addr, walk); + smaps_pte_entry(pte, addr, walk, 0); pte_unmap_unlock(pte - 1, ptl); out: cond_resched(); > For "ordinary" pages, the cont-pte bit (as on arm64) is nowadays transparently > managed: you can modify any PTE part of the cont-gang and it will just > work as expected, transparently. But that is because AFAIU, on arm64 it knows if it's dealing with a cont-pte and it queries the status of the whole "gang", right? I see that huge_ptep_get/set_huge_pte_at checks whether it's dealing with cont-ptes and modifies/queries all sub-ptes. How is that done for non-hugetlb pages? > Not so with hugetlb, where you have to modify (or even query) the whole thing. I see. > > For GUP it was easier, because it was able to grab all information it needed > from the sub-ptes fairly easily, and it doesn't modify any page tabls. > > > I ran into this problem with folio_walk, and had to document it rather nastily: > > * WARNING: Modifying page table entries in hugetlb VMAs requires a lot of care. > * For example, PMD page table sharing might require prior unsharing. Also, > * logical hugetlb entries might span multiple physical page table entries, > * which *must* be modified in a single operation (set_huge_pte_at(), > * huge_ptep_set_*, ...). Note that the page table entry stored in @fw might > * not correspond to the first physical entry of a logical hugetlb entry. > > I wanted to use it to rewrite the uprobe code to also handle hugetlb with > less special casing, but that work stalled so far. I think my next attempt would rule > out any non-pmd / non-pud hugetlb pages to make it somewhat simpler. > > It all gets weird with things like: > > commit 0549e76663730235a10395a7af7ad3d3ce6e2402 > Author: Christophe Leroy > Date: Tue Jul 2 15:51:25 2024 +0200 > > powerpc/8xx: rework support for 8M pages using contiguous PTE entries > In order to fit better with standard Linux page tables layout, add support > for 8M pages using contiguous PTE entries in a standard page table. Page > tables will then be populated with 1024 similar entries and two PMD > entries will point to that page table. > The PMD entries also get a flag to tell it is addressing an 8M page, this > is required for the HW tablewalk assistance. > > Where we are walking a PTE table, but actually there is another PTE table we > have to modify in the same go. > > > Very hard to make that non-hugetlb aware, as it's simply completely different compared > to ordinary page table walking/modifications today. > > Maybe there are ideas to tackle that, and I'd be very interested in them. Yes, that one is very nasty, and as I recall from other conversations with Christophe, we ran out of bits (powerpc-8xx) to flag it somehow that that entry is special, so not really sure either. Although it would be a pitty that that is the thing that ends up stalling this -- Oscar Salvador SUSE Labs