From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B9365C3DA41 for ; Wed, 10 Jul 2024 23:08:40 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D714A6B0088; Wed, 10 Jul 2024 19:08:39 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D21636B0089; Wed, 10 Jul 2024 19:08:39 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BCCB16B008A; Wed, 10 Jul 2024 19:08:39 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 9E3C76B0088 for ; Wed, 10 Jul 2024 19:08:39 -0400 (EDT) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 15796402EE for ; Wed, 10 Jul 2024 23:08:39 +0000 (UTC) X-FDA: 82325384358.12.C8DFD9E Received: from mail-qt1-f176.google.com (mail-qt1-f176.google.com [209.85.160.176]) by imf13.hostedemail.com (Postfix) with ESMTP id 407122000D for ; Wed, 10 Jul 2024 23:08:37 +0000 (UTC) Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=uPZfRARa; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf13.hostedemail.com: domain of yuzhao@google.com designates 209.85.160.176 as permitted sender) smtp.mailfrom=yuzhao@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1720652881; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=OoGz0HLLFa4qZQJW+qHYgc410UtzwamboTgUAOBLLm8=; b=XsyXsTNW+OBHhaUJkfUyGKzMVtfyj6CnGskmSkEbZcqLo7Z2eC2lJ11xV56P2Zl9MlFnfh dpKAgqWIS20JwsS94nb60WLCj3hs267FG6dG60vo2U2yk/lFkcmi2Godlbqu/MtlSiw2QR FdCsZkpX6ZuKOFtzf0bca9pnmpfa23E= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1720652881; a=rsa-sha256; cv=none; b=m/KcBXiPxgJUUuocLXiyKypPFwzXsblsBEi8u1lX09s3ciJNmOhvDZNC0S0xtLRda28Ma4 Zcz1nFu5uCho9PrvnurknK1JzEwkpo/k5Ch86rT1opGFU3rG7z/Bp0uz0MSieVew1bKnfm fY47Qelv0m7Q9woI4QJw3QQsSQaBt7w= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=uPZfRARa; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf13.hostedemail.com: domain of yuzhao@google.com designates 209.85.160.176 as permitted sender) smtp.mailfrom=yuzhao@google.com Received: by mail-qt1-f176.google.com with SMTP id d75a77b69052e-447f8aa87bfso165151cf.0 for ; Wed, 10 Jul 2024 16:08:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1720652916; x=1721257716; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=OoGz0HLLFa4qZQJW+qHYgc410UtzwamboTgUAOBLLm8=; b=uPZfRARas3YOYxyo3Bt+C9r3okgb4w9I8n0BnAu3e/AZS7QZ+py4i/8hv4cdStp8jA WD4cLFJUmL65upUs0t9uJKlLmKgKSE21VvK73BqKz6mtz44Zft/a9UTq4GEOMR1IyD8z 9dlph9IVxMKP1430ZQi3BdlqnIgCxJVIX+uI+z9y68fip0FIjgkZeosqDyblmn3ggTd7 6E0WHyOWeH4b0yDP3l+lG0kaFS0kQmxo2xe450rw5+G4KJaaGzfAGRxdpcmLTWdbLkQe JPlPWzTZjxrZm5ZKggzKVymJBNhsALrXOF48NKiPPsOqNsxlCHVM3xvXuX9egOjCz3tD NEJA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1720652916; x=1721257716; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=OoGz0HLLFa4qZQJW+qHYgc410UtzwamboTgUAOBLLm8=; b=VnWuwb07gg4+ndZAGFmEpXWVj9QKftluK5SZJOUa/xW6hL2OM2KnQopf3v2KXrgrd+ PiyUh+Cwgfvt1NkyFKzpNGlvmiBk+rjQZIiJMK3uoZOibPvwffjSOj7jJUBM0R0OgMaL ZbjptNUpWi3wiUDA6kVvOFdOrJm9t+gTK0trnJyvzPs2Ta4JUt1jlFc0sxfbgBKYd0Kz Lv6Ct4Ug1iQV4EtXHLALGbcjMV5ZHFcfYv0KjYoNxBMt1TYE1Q2xb2AWBff/DLs27GwG Kklpk7BVyp0/fUEX2i7R+VKfVSFYz3Nm9017h8cvh5PjlhZ+sapr+yDSEKWcKN6CWFiE cAsg== X-Forwarded-Encrypted: i=1; AJvYcCVy+1r+0B9VmMYxCWZRR+MBcjkjlxuNlZZOl8XgkmSmIaW3hDbZRnBMMFZrqQo3lEwYNWknddrThy9W+fhGry7o9KI= X-Gm-Message-State: AOJu0Yw+aPFFSrhfGiFlFkWOVlRpg85J6vy0wksx7Ad6YScfTra1M8m1 oz4CMHtvNw995ajJtti0UkKMXLe9prGVbltPu9bM4S8NMGrm9gyxqT0YVADApszcBoyFUcgCAdg 7IRtKnBElntdr0KvCRkW81okJUNIFE1qvY+7y X-Google-Smtp-Source: AGHT+IGhsWC2zhiEe5HyqIKFiKRUuXkTjH/M95dSoqq3vPDJECDG2/qkfwLCxcq4j0ry3fp+0ptH1cfEcUOLFV8wr8s= X-Received: by 2002:ac8:4509:0:b0:444:e366:3fda with SMTP id d75a77b69052e-44d35b2bd77mr591451cf.27.1720652916040; Wed, 10 Jul 2024 16:08:36 -0700 (PDT) MIME-Version: 1.0 References: <20240113094436.2506396-1-sunnanyong@huawei.com> In-Reply-To: From: Yu Zhao Date: Wed, 10 Jul 2024 17:07:57 -0600 Message-ID: Subject: Re: [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize To: Catalin Marinas Cc: Nanyong Sun , will@kernel.org, mike.kravetz@oracle.com, muchun.song@linux.dev, akpm@linux-foundation.org, anshuman.khandual@arm.com, willy@infradead.org, wangkefeng.wang@huawei.com, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 407122000D X-Stat-Signature: fgcghkfyia4mrni7hwceaju1r1m59twz X-Rspam-User: X-HE-Tag: 1720652917-510624 X-HE-Meta: U2FsdGVkX1/tGjzw+PG04gudgyKvp3EVwM0D+uQv1Vnl2VnfjSs8Xxumv4qK5PxRkVxMLckBgJpBpopbnihz9AgFFr2GAhuYlEYx2NWH/6yPXZHqnUXS6T/ZjOmVGp2NIGYQfMe6jJtQ7tmytiVkJnt8wS0f9DtU342Gd3TOpEo53iUWb/siKjunfQ0iyqMUyvLca8P5z0TM04RGWx3AYervFQZ3OHvEEb2SajeqeN+NsOYhYguJH7+bHnG8ZGwNHJEwSRsl1jArBjycnDnaMJi7lbptds4Dctx1XTijSEOxVVzVCmGYww2H6meFrJcdNXPyJFWsfEpsmoev2+Kd5WjaJv+IH+O9ZjVAQpKk8t5R5juXSpgYhmsnZoUk8qMH0MINpafceNY/yP1cQXs/zLTSHVDwj1lZ+N1vn/fddtdH3j1X94WoMaD4c/cqjyT6EoiHl5zxIIw6QF21ngBff6cDbIiZC7nkgq2UOuSfLoDCYC43Xjj4uHgoKOTOr+x6VQSSO+JBbS77pB9NcSQRuAQX1BoeJ+M36FG+4GV9kObHaGgoU4Cdx8ZnZMeW0kX204hSg17smSRt7pRPVyGw2JEzkzLOZMa/SqKmqHbFpgPLnX0Bwa5T0AduSBsUufANmbStVESWUAJNqExd2zgQY188B1ATwFthul2GbnovAos4RTkIDTrhR693SGaaf1JGrIhl0QNL+/95uEC97+7tQq48jBdoU+ImAe58ln/y4EJtAfbxciKr23UsPBYOMg/IUsaUUnXvyEg7w2m9H2OY3TneH/6LCKF5MOHtWHYbfEY5MqYeGn5QpyiFuusQHEyTIYyFu/JfvKLZ3Goycg9l6eKxkWIUeE7MkCsUztszlrn+u137oJ2S07nQe1xaqbcwaxS61fZFHecW4/VNXgIXw3Gq+E1V9SuH93aghiGKCByYjzJXv5sZb/cl71unafn30n021nqThouOWAdDGw+ JqBo96hZ v3DjW6KqfbdI5hEHUHKzr1uCWu56Fmp1OyScOtpzvVzLO4lY5f6rrlthTjWo2nfbEcWjY9rYM1/WyOASaLeG4rsT4yrx8JgGiDbrSEBUD881zXQx8vd2UJtmPwV2L5wV3oFY0Oy84XlrmyRvGzlUCZCgyMRIMlHNJ6dBHbflYxUA+T/mJALjPgasd9V2rwD7U0pQGxdht092Ddysb6le7SwXIqgcyF2CHFbH2YcBF4dcd4BQ= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Jul 10, 2024 at 4:29=E2=80=AFPM Catalin Marinas wrote: > > On Wed, Jul 10, 2024 at 11:12:01AM -0600, Yu Zhao wrote: > > On Wed, Jul 10, 2024 at 10:51=E2=80=AFAM Catalin Marinas > > wrote: > > > On Fri, Jul 05, 2024 at 11:41:34AM -0600, Yu Zhao wrote: > > > > On Fri, Jul 5, 2024 at 9:49=E2=80=AFAM Catalin Marinas wrote: > > > > > If I did the maths right, for a 2MB hugetlb page, we have about 8 > > > > > vmemmap pages (32K). Once we split a 2MB vmemap range, > > > > > > > > Correct. > > > > > > > > > whatever else > > > > > needs to be touched in this range won't require a stop_machine(). > > > > > > > > There might be some misunderstandings here. > > > > > > > > To do HVO: > > > > 1. we split a PMD into 512 PTEs; > > > > 2. for every 8 PTEs: > > > > 2a. we allocate an order-0 page for PTE #0; > > > > 2b. we remap PTE #0 *RW* to this page; > > > > 2c. we remap PTEs #1-7 *RO* to this page; > > > > 2d. we free the original order-3 page. > > > > > > Thanks. I now remember why we reverted such support in 060a2c92d1b6 > > > ("arm64: mm: hugetlb: Disable HUGETLB_PAGE_OPTIMIZE_VMEMMAP"). The ma= in > > > problem is that point 2c also changes the output address of the PTE > > > (and the content of the page slightly). The architecture requires a > > > break-before-make in such scenario, though it would have been nice if= it > > > was more specific on what could go wrong. > > > > > > We can do point 1 safely if we have FEAT_BBM level 2. For point 2, I > > > assume these 8 vmemmap pages may be accessed and that's why we can't = do > > > a break-before-make safely. > > > > Correct > > > > > I was wondering whether we could make the > > > PTEs RO first and then change the output address but we have another > > > rule that the content of the page should be the same. I don't think > > > entries 1-7 are identical to entry 0 (though we could ask the archite= cts > > > for clarification here). Also, can we guarantee that nothing writes t= o > > > entry 0 while we would do such remapping? > > > > Yes, it's already guaranteed. > > > > > We know entries 1-7 won't be > > > written as we mapped them as RO but entry 0 contains the head page. > > > Maybe it's ok to map it RO temporarily until the newly allocated huge= tlb > > > page is returned. > > > > We can do that. I don't understand how this could elide BBM. After the > > above, we would still need to do: > > 3. remap entry 0 from RO to RW, mapping the `struct page` page that > > will be shared with entry 1-7. > > 4. remap entry 1-7 from their respective `struct page` pages to that > > of entry 0, while they remain RO. > > The Arm ARM states that we need a BBM if we change the output address > and: the old or new mappings are RW *or* the content of the page > changes. Ignoring the latter (page content), we can turn the PTEs RO > first without changing the pfn followed by changing the pfn while they > are RO. Once that's done, we make entry 0 RW and, of course, with > additional TLBIs between all these steps. Aha! This is easy to do -- I just made the RO guaranteed, as I mentioned earlier. Just to make sure I fully understand the workflow: 1. Split a RW PMD into 512 RO PTEs, pointing to the same 2MB `struct page` = area. 2. TLBI once, after pmd_populate_kernel() 3. Remap PTE 1-7 to the 4KB `struct page` area of PTE 0, for every 8 PTEs, while they remain RO. 4. TLBI once, after set_pte_at() on PTE 1-7. 5. Change PTE 0 from RO to RW, pointing to the same 4KB `struct page` area. 6. TLBI once, after set_pte_at() on PTE 0. No BBM required, regardless of FEAT_BBM level 2. Is this correct? > Can we leave entry 0 RO? This > would save an additional TLBI. Unfortunately we can't. Otherwise we wouldn't be able to, e.g., grab a refcnt on any hugeTLB pages. > Now, I wonder if all this is worth it. What are the scenarios where the > 8 PTEs will be accessed? The vmemmap range corresponding to a 2MB > hugetlb page for example is pretty well defined - 8 x 4K pages, aligned. > > > > If we could get the above work, it would be a lot simpler than thinki= ng > > > of stop_machine() or other locks to wait for such remapping. > > > > Steps 3/4 would not require BBM somehow? > > If we ignore the 'content' requirement, I think we could skip the BBM > but we need to make sure we don't change the permission and pfn at the > same time. Gotcha. > > > > To do de-HVO: > > > > 1. for every 8 PTEs: > > > > 1a. we allocate 7 order-0 pages. > > > > 1b. we remap PTEs #1-7 *RW* to those pages, respectively. > > > > > > Similar problem in 1.b, changing the output address. Here we could fo= rce > > > the content to be the same > > > > I don't follow the "the content to be the same" part. After HVO, we hav= e: > > > > Entry 0 -> `struct page` page A, RW > > Entry 1 -> `struct page` page A, RO > > ... > > Entry 7 -> `struct page` page A, RO > > > > To de-HVO, we need to make them: > > > > Entry 0 -> `struct page` page A, RW > > Entry 1 -> `struct page` page B, RW > > ... > > Entry 7 -> `struct page` page H, RW > > > > I assume the same content means PTE_0 =3D=3D PTE_1/.../7? > > That's the content of the page at the corresponding pfn before and after > the pte change. I'm pretty sure the Arm ARM states this in case the > hardware starts a load (e.g. unaligned) from one page and completes it > from another, the software should not see any difference. But for the > fields we care about in struct page, I assume they'd be the same (or > that we just don't care about inconsistencies during this transient > period). Thanks for the explanation. I'll cook up something if my understanding above is correct.