From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6356DC36001 for ; Sat, 22 Mar 2025 00:15:21 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 98267280002; Fri, 21 Mar 2025 20:15:19 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 93447280001; Fri, 21 Mar 2025 20:15:19 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7D3CD280002; Fri, 21 Mar 2025 20:15:19 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 5D396280001 for ; Fri, 21 Mar 2025 20:15:19 -0400 (EDT) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id EBA19140833 for ; Sat, 22 Mar 2025 00:15:19 +0000 (UTC) X-FDA: 83247267558.26.09C1940 Received: from mail-ed1-f52.google.com (mail-ed1-f52.google.com [209.85.208.52]) by imf26.hostedemail.com (Postfix) with ESMTP id E12F8140011 for ; Sat, 22 Mar 2025 00:15:17 +0000 (UTC) Authentication-Results: imf26.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=uYZYxy0f; spf=pass (imf26.hostedemail.com: domain of jannh@google.com designates 209.85.208.52 as permitted sender) smtp.mailfrom=jannh@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1742602518; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=NWPajkWJogqaXmp1Upnq/GPA8oCmJY5/NKgNJh+xUr4=; b=K5sd+iQujOhEiLjATi53c6MqIwuRHi7spXAfa0ki2WebR2g/Fq+IKNUx1399zwBbHBHDtc uGnHxZ0MLCxxShtSi5EzTpHcIErFQazYc+hnj/CHIXsUuActlgqFNrpVTQF8p2fklvgIOC BMLQFzPqS1sySdSTjk1/9v/eBxaQTmQ= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1742602518; a=rsa-sha256; cv=none; b=oOIVm7R6Ry0RW/M2rkx/Y03ZG6WIgwl8XIG83acggWUau/ODlc9QuMsLTgyfb47rnZfQFO I2r0gOUQTqRDZHyeYz3hyPHicDyquZRk/C2xpibYbwkyatBb7vPra3i+lIiQwakQCXy4Tj +GVp/8YB47o3U1PW+p1gXvxMlsbn5hM= ARC-Authentication-Results: i=1; imf26.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=uYZYxy0f; spf=pass (imf26.hostedemail.com: domain of jannh@google.com designates 209.85.208.52 as permitted sender) smtp.mailfrom=jannh@google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-ed1-f52.google.com with SMTP id 4fb4d7f45d1cf-5dbfc122b82so3885a12.0 for ; Fri, 21 Mar 2025 17:15:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1742602516; x=1743207316; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=NWPajkWJogqaXmp1Upnq/GPA8oCmJY5/NKgNJh+xUr4=; b=uYZYxy0f+vL1Bg5JnnQ0Nrtklz2NTRLtlhARhASWz0r36PTWvKw9bhCAXlmTwclgL3 LBLUmnlNFzQOv29NJwjh72ZE9D9nN0BtbeZ1j/w83l2k2rHyjFqBBTbKqJ3pByvQfrwi fXm3jzg9ERrt8DQTadtA/RtDSXzNv7E3GaeiXMXISbbznWTEOzhidyApiUZhJf/7uv+i zxuQpkYbeBKtXVtLSL5Gbfi2dMvcabsgLN/KwY3Ri996KY4NZ0JocrJP5Tcd5iDvhD7B M9CG3OEApORJLVCBcz0qhPQk9RyWpbpcHPNh6C0tRpOsWsAkNMXqlAuvItargmyrV99G G2Yg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1742602516; x=1743207316; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=NWPajkWJogqaXmp1Upnq/GPA8oCmJY5/NKgNJh+xUr4=; b=o5JddmaimVW4B/FlnkP/XnlTV1lCYie0C/mlw8HwdLXiRcDJMULQnhbqcovBbV2PrK nLDkYzvlSOiB5sHxeevY+MS3tdBtgBU+GjVFoDmblO4XFat11t8re0L+IVAFcg66DzRG azOk8BoG/7lqkTnOluhCuAtgKQd3VPtpPTnReEeEYLPKw05om5tJylriITELdPnclI+x S4Axn8UJimxj61Bnv1LK+H+GrR02w2ktvSlwRmTrOeoiMqC0/bzxqb8If5hqi8osOcXk 4IvIyDGBtFzbhn8a1Tb554rodhrMyJ1vySXcnI60aFtG/t/tpoK8cyQjkWxXAuBjR9Ow 8J0g== X-Forwarded-Encrypted: i=1; AJvYcCWbuRyq57/Ea7jirkSdkxi9mEG555Hw7eyRyfhBzpz63oL4b01nC6lgLf9y1hxq5PfdyDn1BkRWkQ==@kvack.org X-Gm-Message-State: AOJu0YyOy08k1veFlpptv9/XfknCWcLn2UVJ1cIVPWpgKwO7nw/eBW86 HpkPN7FgsebZ5jHql7mxQipp2PRH9w24M9CCpl0GXsXwv5GfBC+FMQYkdeknKJ3MKkA4RfWcnfI GfvrrykePy5NWLp160dAP0g0flthdDIS9ImeO X-Gm-Gg: ASbGncvkQf4An3aJGNkZip8cxgq2zROrdCYlJXMWb/tPKdMENkAabuUm1TDBOs0cIa3 r5NOqV87yXzNvb3Hwq6rSKPYsWiWOnIF/1vvX1J9i8V37aXjfmKYaSCo7SSJqI1vdxpueJ5vJ4d aH0tqjdUI+bHwT+G3HaEcFOxyFay+jk8uAOVxcL8zHrGF4wShwh5qn5Q== X-Google-Smtp-Source: AGHT+IGcz6wA92jIM6hDanJsFde5LK802tMMyxQp8td2vt7vZbhUFuPwX/oatoRkWVXGG/+m4MkOSX70/7J9MWyO1zw= X-Received: by 2002:aa7:d806:0:b0:5e4:afad:9a83 with SMTP id 4fb4d7f45d1cf-5ec1d9c7e3bmr44923a12.2.1742602515733; Fri, 21 Mar 2025 17:15:15 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Jann Horn Date: Sat, 22 Mar 2025 01:14:39 +0100 X-Gm-Features: AQ5f1JotPQaRDY7T-kyb-A1coQmRsTtQyKGAYHVNXCMhmbARXCP-4LzrfE3v-Jc Message-ID: Subject: Re: [RFC PATCH 1/7] mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON To: Lorenzo Stoakes Cc: Andrew Morton , Vlastimil Babka , "Liam R . Howlett" , Suren Baghdasaryan , Matthew Wilcox , David Hildenbrand , linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: tukhtw5kkpxt9631r1td6u8r9mpcb87j X-Rspam-User: X-Rspamd-Queue-Id: E12F8140011 X-Rspamd-Server: rspam08 X-HE-Tag: 1742602517-483547 X-HE-Meta: U2FsdGVkX18IejcFJOcF67vthR9DLPxHlJd11MNaYMm/vYNRaiD/7BdaYE6E8blNzdv5dAHtl3AzwSL+y16ECVsINbELOR79vGWUmxw9M08ZYQSw+5Ko+ix0/zT0+QvI3Vfce8yEpWXXSUzU4R1HqQkJGD5NC4bosmawrJb30jVjMAEOBcV10sm/BXraCNX5RQJGHQ/4730v1DpIi6eRCTTAed/4i26o2fWJZctigTvzPfBB+hSeCIIZYl95eBTPCUOGvo356kRlstnNFbTXqo/6l7EWpycWffFJaHYgMUJJM0LuHMBBsXorTcPFshT1vzsUuUiHWEKHn3kHomO4bw6hL+ojA8a3Juri3UIsyhJlQGeWXXcOvPh20m439bCyjDPzmhO5IInWRMbu1Q8AfP87upr2mXd+VGROE7tDTx0nhWOZVYVnxTr1qpurTz/uOmBhen9yxxz/TLvsVuOJOArEE0qB90d7p245UOYCCBovy6+lKt5UH74nObIZVMgI3g7wNF/6dVpGOs2rqIxpsTVQ8/6uo+4yKCe1uktt9zj1Hs+JzFZnGay5XsIN9aEFrpl7kKJ7zFZW3A3soLFkwhf1otwdAge+IRN2YHZn66kjW7nxTr2mqnnBuUAl6Ad/js3ima6MaROTQv79LOR5MMfUtmA05pE4u7xJ9SV3cITrfwLkw2avrEucPUkaoj1r8tZkB24+m2wtV2RvH2f8wuXSJ5K9BO5NeSUmWWMhOhuKw/BXAT6y/B+uI0qTnHXXF9UjAykCtIh4uAsRe9SzNww47DNK/H2Q7MbAf9ElN5cOx5sY3A0Hc+x1Z/ddDwcm9CUy1jRHvrYmFxUFMMPbrUz1237WTiB/HKKfXZTsUVxJYAtEA2cLGQilz4iFE1ARFHKUch44Xkgro7lcKNX6p2s0fmroMZaPqLHib9eo/ahHVvj8k3dci78uznj7y3mKno69YkD3OYZiR2PVfrn mpWCJQCH aALVoHvpCfNHCqoa2nAA1LnxVVvqSALS3U04hWJHlYVxIGrfCzRA/YHDxOrVLv+n6ekLKqwRxPawtJFFoMDKaPKKgkXm9NvRtyQj3dAVJfJSaDf6iOI+VpxhEBEofNQDWF9d5689eoRbVODcxf4WqHwfOUU1gxQE/+Migl2gq7I7/zB+cKsxd3XVK1bzPngFIst4HHzcZtIqdQPbe1F1jM16k7X/XzvZan9H93zz1KwiMaGuZjQG641ZQw5Cri+fpops5J0z1D0KScwA= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000085, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Mar 21, 2025 at 10:54=E2=80=AFPM Lorenzo Stoakes wrote: > diff --git a/mm/mremap.c b/mm/mremap.c > index 0865387531ed..bb67562a0114 100644 > --- a/mm/mremap.c > +++ b/mm/mremap.c [...] > +/* > + * If the folio mapped at the specified pte entry can have its index and= mapping > + * relocated, then do so. > + * > + * Returns the number of pages we have traversed, or 0 if the operation = failed. > + */ > +static unsigned long relocate_anon(struct pagetable_move_control *pmc, > + unsigned long old_addr, unsigned long new_addr, pte_t pte= , > + bool undo) > +{ > + struct page *page; > + struct folio *folio; > + struct vm_area_struct *old, *new; > + pgoff_t new_index; > + unsigned long ret =3D 1; > + > + old =3D pmc->old; > + new =3D pmc->new; > + > + /* Ensure we have truly got an anon folio. */ > + page =3D vm_normal_page(old, old_addr, pte); > + if (!page) > + return ret; > + folio =3D page_folio(page); > + folio_lock(folio); > + > + /* no-op. */ > + if (!folio_test_anon(folio) || folio_test_ksm(folio)) > + goto out; > + > + /* > + * This should not happen as we explicitly disallow this, but che= ck > + * anyway. > + */ > + if (folio_test_large(folio)) { > + ret =3D 0; > + goto out; > + } Do I understand correctly that you assume here that the page is exclusively mapped? Maybe we could at least WARN_ON(folio_mapcount(folio) !=3D 1) or something like that? (I was also wondering if the PageAnonExclusive bit is somehow relevant, but we should probably not look at or touch that here, unless we want to think about cases where we _used to_ have a child from which the page may have been GUP'd...) > + if (!undo) > + new_index =3D linear_page_index(new, new_addr); > + else > + new_index =3D linear_page_index(old, old_addr); > + > + /* > + * The PTL should keep us safe from unmapping, and the fact the f= olio is > + * a PTE keeps the folio referenced. > + * > + * The mmap/VMA locks should keep us safe from fork and other pro= cesses. > + * > + * The rmap locks should keep us safe from anything happening to = the > + * VMA/anon_vma. "The rmap locks"? I only see that we're holding the rmap lock on the new anon_vma; are we also holding a lock on the old anon_vma? > + * The folio lock should keep us safe from reclaim, migration, et= c. > + */ > + folio_move_anon_rmap(folio, undo ? old : new); > + WRITE_ONCE(folio->index, new_index); > + > +out: > + folio_unlock(folio); > + return ret; > +} [...] > +static bool relocate_anon_ptes(struct pagetable_move_control *pmc, > + unsigned long extent, pmd_t *pmd, bool undo) > +{ > + struct mm_struct *mm =3D current->mm; > + struct pte_state state =3D { > + .old_addr =3D pmc->old_addr, > + .new_addr =3D pmc->new_addr, > + .old_end =3D pmc->old_addr + extent, > + }; > + spinlock_t *ptl; > + pte_t *ptep_start; > + bool ret; > + unsigned long nr_pages; > + > + ptep_start =3D pte_offset_map_lock(mm, pmd, pmc->old_addr, &ptl); > + /* > + * We prevent faults with mmap write lock, hold the rmap lock and= should > + * not fail to obtain this lock. Just give up if we can't. > + */ > + if (!ptep_start) > + return false; > + > + state.ptep =3D ptep_start; > + > + for (; !pte_done(&state); pte_next(&state, nr_pages)) { > + pte_t pte =3D ptep_get(state.ptep); > + > + if (pte_none(pte) || !pte_present(pte)) { > + nr_pages =3D 1; > + continue; > + } > + > + nr_pages =3D relocate_anon(pmc, state.old_addr, state.new= _addr, pte, undo); > + if (!nr_pages) { > + ret =3D false; > + goto out; > + } > + } > + > + ret =3D true; > +out: > + pte_unmap_unlock(ptep_start, ptl); > + return ret; > +} Just to make sure I understand correctly: This function is changing the ->pgoff and ->mapping of pages while they are still mapped in the old VMA, right? And if that fails midway through for whatever reason, we go and change all the already-changed ->pgoff and ->mapping values back? [...] > @@ -1132,6 +1347,42 @@ static void unmap_source_vma(struct vma_remap_stru= ct *vrm) > } > } > > +/* > + * Should we attempt to relocate anonymous folios to the location that t= he VMA > + * is being moved to by updating index and mapping fields accordingly? > + */ > +static bool should_relocate_anon(struct vma_remap_struct *vrm, > + struct pagetable_move_control *pmc) > +{ > + struct vm_area_struct *old =3D vrm->vma; > + > + /* Currently we only do this if requested. */ > + if (!(vrm->flags & MREMAP_RELOCATE_ANON)) > + return false; > + > + /* We can't deal with special or hugetlb mappings. */ > + if (old->vm_flags & (VM_SPECIAL | VM_HUGETLB)) > + return false; > + > + /* We only support anonymous mappings. */ > + if (!vma_is_anonymous(old)) > + return false; > + > + /* If no folios are mapped, then no need to attempt this. */ > + if (!old->anon_vma) > + return false; > + > + /* > + * If the old VMA is a child (i.e. has been forked), then the ind= ex > + * references multiple VMAs, we have to bail. > + */ > + if (!list_is_singular(&old->anon_vma_chain)) > + return false; I think this is wrong: list_is_singular(&old->anon_vma_chain) tells you whether pages in the VMA might be shared due to this mm_struct being forked from a parent mm_struct; but it won't tell you whether pages in the VMA might be shared due to this mm_struct being the parent of another mm_struct (other way around). I guess checking old->anon_vma->root->num_children could maybe work... > + > + /* Otherwise, we're good to go! */ > + return true; > +} > + > /* > * Copy vrm->vma over to vrm->new_addr possibly adjusting size as part o= f the > * process. Additionally handle an error occurring on moving of page tab= les, > @@ -1151,9 +1402,11 @@ static int copy_vma_and_data(struct vma_remap_stru= ct *vrm, > struct vm_area_struct *new_vma; > int err =3D 0; > PAGETABLE_MOVE(pmc, NULL, NULL, vrm->addr, vrm->new_addr, vrm->ol= d_len); > + bool relocate_anon =3D should_relocate_anon(vrm, &pmc); > > +again: > new_vma =3D copy_vma(&vma, vrm->new_addr, vrm->new_len, new_pgoff= , > - &pmc.need_rmap_locks); > + &pmc.need_rmap_locks, &relocate_anon); > if (!new_vma) { > vrm_uncharge(vrm); > *new_vma_ptr =3D NULL; > @@ -1163,12 +1416,66 @@ static int copy_vma_and_data(struct vma_remap_str= uct *vrm, > pmc.old =3D vma; > pmc.new =3D new_vma; > > + if (relocate_anon) { > + /* > + * We have a new VMA to reassign folios to. We take a loc= k on > + * its anon_vma so reclaim doesn't fail to unmap mappings= . > + * > + * We have acquired a VMA write lock by now (in vma_link(= )), so > + * we do not have to worry about racing faults. > + */ > + anon_vma_lock_write(new_vma->anon_vma); > + pmc.relocate_locked =3D new_vma; > + > + if (!relocate_anon_folios(&pmc, /* undo=3D */false)) { > + unsigned long start =3D new_vma->vm_start; > + unsigned long size =3D new_vma->vm_end - start; > + > + /* Undo if fails. */ > + relocate_anon_folios(&pmc, /* undo=3D */true); This relocate_anon_folios() has to make sure to only operate on the subset of PTEs that we already edited successfully, right? > + vrm_stat_account(vrm, vrm->new_len); > + > + anon_vma_unlock_write(new_vma->anon_vma); > + pmc.relocate_locked =3D NULL; > + > + do_munmap(current->mm, start, size, NULL); > + relocate_anon =3D false; > + goto again; > + } > + } [...] > diff --git a/mm/vma.c b/mm/vma.c > index 5418eef3a852..09027448753f 100644 > --- a/mm/vma.c > +++ b/mm/vma.c [...] > @@ -1821,6 +1834,14 @@ struct vm_area_struct *copy_vma(struct vm_area_str= uct **vmap, > new_vma->vm_ops->open(new_vma); > if (vma_link(mm, new_vma)) > goto out_vma_link; > + /* > + * If we're attempting to relocate anonymous VMAs, we > + * don't want to reuse an anon_vma as set by > + * vm_area_dup(), or copy anon_vma_chain or anything > + * like this. > + */ > + if (*relocate_anon && __anon_vma_prepare(new_vma)) > + goto out_vma_link; Is this bailout really okay? We go to the same label as if vma_link() failed, but we're in a very different state: For example, the VMA has already been inserted into the VMA tree with vma_iter_store() as part of vma_link() (unless that changed in one of the prerequisite patches). > *need_rmap_locks =3D false; > } > return new_vma;