From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 93EFAC4332F for ; Wed, 14 Dec 2022 00:05:15 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id EE8398E0003; Tue, 13 Dec 2022 19:05:14 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id E97E88E0002; Tue, 13 Dec 2022 19:05:14 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D5FB78E0003; Tue, 13 Dec 2022 19:05:14 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id C67878E0002 for ; Tue, 13 Dec 2022 19:05:14 -0500 (EST) Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 9936A160C31 for ; Wed, 14 Dec 2022 00:05:14 +0000 (UTC) X-FDA: 80238966948.25.107B1C1 Received: from mail-wm1-f48.google.com (mail-wm1-f48.google.com [209.85.128.48]) by imf29.hostedemail.com (Postfix) with ESMTP id E796312000D for ; Wed, 14 Dec 2022 00:05:12 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=Phu5PfAM; spf=pass (imf29.hostedemail.com: domain of jthoughton@google.com designates 209.85.128.48 as permitted sender) smtp.mailfrom=jthoughton@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1670976313; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=JL7cSmFEiYtsVCTU7fmtFqqTwzhbwsAl9I0/oTYsi9I=; b=FHjKyUOxE6U3f03c1lFhdvqRqSljw9ZNCFDUthZl8GFwakDnnmgCfHArTgywB9R4Yy/LP9 el0oC0ZLlENSxZ9/Y4FV7ZiOwPQNtBgBtELMi7x1hcYhTH6E4j5EQhNJDY6sXSjoVDh2Me JpulxVvr3maxGdnq5+m19rJRFVhVbZs= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=Phu5PfAM; spf=pass (imf29.hostedemail.com: domain of jthoughton@google.com designates 209.85.128.48 as permitted sender) smtp.mailfrom=jthoughton@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1670976313; a=rsa-sha256; cv=none; b=e+cgDvWzJOtkH6Kfwi31gygyI0qs2VuZQIIw4T12DFMf13Gwal/8E01nkW0/YtzNbUapbL wNdfPBRXlZ9ZG/Zgi9a7reeXPsWDTrJCz7MXLt3YJRgey+AQ8yhZNhDv6Xf/51+DZg81Y8 y1CTab2TaYH3kak8DbBRcp/Wj/6N004= Received: by mail-wm1-f48.google.com with SMTP id f13-20020a1cc90d000000b003d08c4cf679so9474074wmb.5 for ; Tue, 13 Dec 2022 16:05:12 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=JL7cSmFEiYtsVCTU7fmtFqqTwzhbwsAl9I0/oTYsi9I=; b=Phu5PfAMI2MfzaZp31GDkVJ7xaQtq9gOVS+Tv0zGH2u1UIYxJwt/sfBuojIDZIyx/E N+blm4DqXM5h3hnRLXSSHqLYlunADBpbIx1UZDdLALwdLNPynpmEFtjfXNFs1noQFo30 IPWwtZOVPW65GKav8kECjl43qTyCx7fZSMXLHu5Es0kbbzmctv8kgnIZD9zaQIhX5dUn /3bV1xuOa3cD7/n7XgrrwR9N5tlyBwta8yQCKQvT2Z3ZrbQZ8fxxSaSdAafwrTyWiT4v Xj5qZ8rhi+hnH9yK+BGxJ7QCV0YILYUT5LdgKqnVNkL+g39FlnnmvoryodEccyxyOEAs SwJw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=JL7cSmFEiYtsVCTU7fmtFqqTwzhbwsAl9I0/oTYsi9I=; b=X7aC2zSmc+fAQ2jwA3zxO7fYUEk73b34scUmaDhUvUq0QpRLJRTRI9Al5XZe1Sh/8j M5O4IkAyhqNYCStxHh6YwJIlBEbjLVv7tLh9rjw1oyatoQmsYi+adjQZE58joiI8XDPX QzDzGrbf5dHovmRO2bEhZhhtPqKIGn3l7XAeXCsMNOsjHozbm8K5rXqj2W+qoPbh3GyK kE+/gHH5E3Iqd7hTRIONZXCdzBoJiULHny4K7c0u6zlk7fHbb8uc16wmjS6cETymu1WY Xj6cH8T/wE4c8IPXyMr3MFXZXPwyjbf2WEtL09p5WJfhBHG5nT78Wd0Oi2ryB+iysdNN aF/A== X-Gm-Message-State: ANoB5plVaxBVQ+dDN0+P0K7+4bvIPW6kwM83qdGwRkilIwe8hgSC84Fw kFjiKhUGgqRELoyHOc+23VTYO6PeZuPuyRubURDNbw== X-Google-Smtp-Source: AA0mqf7XcNjJEgbYavgVfYk4ZW3z7Y+39XY4q5ceurlaJTAT0OB2S728p9AmtspqhSbS4SLPj9AtIK2/pnXycAJUbxk= X-Received: by 2002:a05:600c:2211:b0:3cf:6c05:b4d5 with SMTP id z17-20020a05600c221100b003cf6c05b4d5mr28989wml.120.1670976311306; Tue, 13 Dec 2022 16:05:11 -0800 (PST) MIME-Version: 1.0 References: <20221021163703.3218176-1-jthoughton@google.com> <20221021163703.3218176-12-jthoughton@google.com> In-Reply-To: From: James Houghton Date: Tue, 13 Dec 2022 19:04:58 -0500 Message-ID: Subject: Re: [RFC PATCH v2 11/47] hugetlb: add hugetlb_pmd_alloc and hugetlb_pte_alloc To: Mike Kravetz Cc: Muchun Song , Peter Xu , David Hildenbrand , David Rientjes , Axel Rasmussen , Mina Almasry , "Zach O'Keefe" , Manish Mishra , Naoya Horiguchi , "Dr . David Alan Gilbert" , "Matthew Wilcox (Oracle)" , Vlastimil Babka , Baolin Wang , Miaohe Lin , Yang Shi , Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: E796312000D X-Stat-Signature: 5oise5nix33yudpdxyof8f76fumayk3g X-Rspam-User: X-HE-Tag: 1670976312-562650 X-HE-Meta: U2FsdGVkX197uAcl0acWdqO6cQEmjSMTvnKnNMfP7ZUGoIIbyN0lmltSpzNi1BUcDZNYROEHu0xEi3VDLdk6BBtUyK9gtAM59515aihEuRjXii8SiI+KsENc2E8Vshz4Pyut4ITPpkQ0ESbMssk+z3wgRfDaYEbSsGzFtDCb8REKam/Pcewfimadq4WI6SA97hjc5eEb+0NsKSuBQnTgQskqM5NIPW2/tD4kxC7OJNGLIHppjzl6wifTJQ38tKVr/1bsWh7HhEK4PZ4GNLfoXnGB8jySAApl8Pmo9Rc+Mjupz3XJcKT/g/wlL6M6Gaok94Y3T+osvFON1fvn2oeDIb5A3YBA/m2jd7kpmPTGhcpZyn/vr/H9h94kq9Vwq+HRkm0ZJR7x77DYM1lAKAfYMJhOftm5TL8RWv/XITPc+bp65sPYoNC4pt+IKAQ96YgxnD1g2eBXzmWQRFpddNveMDaa8XsJTs2oWsgykOfg8AS7MrwRun1GOkT5kGrZcpkp6k1vYuv/tZVQ552LwmBfKirsm+4rZLVzsBaE7ZIxwA2z8QLHU7dydEVbFACuYezE5zLpAv4VGFLkOccEsaTVpqisw6fF85W8ftVXIOOLxTf/1dXdO9uzOvwJAnXx9LZnXzOZBPKzzVZxIi/Y63ckRXZHrBnal3+D21MpiSPOx8kT4yF9xrOaLV2yxEkE99L4WVjzhNTBrTMN+cu4u3kaW5YcaoDgi0wU+5RwGRFyipRCHbLYIQpVspIyOlBlohNv1GR2XftGDFjHTIa6yR7hGZ3/nj3OSn5asC/LA0FNnUmjS8yZXyi2N+8E6Xa5SKjUloAvusfkidAECNTuJRrB46oZ6hPqPapR6UMddeRtjzOTjoOszXPPl/WygnJM8a8dLuXmJwSIhog8h63B7aBS23teDfgsLsxU4R862bQJHu/wjbBzkRbV7MGv23/W96u3XQphVSb3Qmr3CVZ+r// C5FULxWk objkfJBuVRPWHtKjUbbpFieIXLSUmLHlq32XoEc3MW1r8yuU4hCrUyb/gNnMIUcTeNN9P X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Dec 13, 2022 at 3:18 PM James Houghton wrote: > > On Tue, Dec 13, 2022 at 2:32 PM Mike Kravetz wrote: > > > > On 10/21/22 16:36, James Houghton wrote: > > > These functions are used to allocate new PTEs below the hstate PTE. This > > > will be used by hugetlb_walk_step, which implements stepping forwards in > > > a HugeTLB high-granularity page table walk. > > > > > > The reasons that we don't use the standard pmd_alloc/pte_alloc* > > > functions are: > > > 1) This prevents us from accidentally overwriting swap entries or > > > attempting to use swap entries as present non-leaf PTEs (see > > > pmd_alloc(); we assume that !pte_none means pte_present and > > > non-leaf). > > > 2) Locking hugetlb PTEs can different than regular PTEs. (Although, as > > > implemented right now, locking is the same.) > > > 3) We can maintain compatibility with CONFIG_HIGHPTE. That is, HugeTLB > > > HGM won't use HIGHPTE, but the kernel can still be built with it, > > > and other mm code will use it. > > > > > > When GENERAL_HUGETLB supports P4D-based hugepages, we will need to > > > implement hugetlb_pud_alloc to implement hugetlb_walk_step. > > > > > > Signed-off-by: James Houghton > > > --- > > > include/linux/hugetlb.h | 5 +++ > > > mm/hugetlb.c | 94 +++++++++++++++++++++++++++++++++++++++++ > > > 2 files changed, 99 insertions(+) > > > > > > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h > > > index d30322108b34..003255b0e40f 100644 > > > --- a/include/linux/hugetlb.h > > > +++ b/include/linux/hugetlb.h > > > @@ -119,6 +119,11 @@ void hugetlb_pte_copy(struct hugetlb_pte *dest, const struct hugetlb_pte *src) > > > > > > bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte, pte_t pte); > > > > > > +pmd_t *hugetlb_pmd_alloc(struct mm_struct *mm, struct hugetlb_pte *hpte, > > > + unsigned long addr); > > > +pte_t *hugetlb_pte_alloc(struct mm_struct *mm, struct hugetlb_pte *hpte, > > > + unsigned long addr); > > > + > > > struct hugepage_subpool { > > > spinlock_t lock; > > > long count; > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > > index a0e46d35dabc..e3733388adee 100644 > > > --- a/mm/hugetlb.c > > > +++ b/mm/hugetlb.c > > > @@ -341,6 +341,100 @@ static bool has_same_uncharge_info(struct file_region *rg, > > > #endif > > > } > > > > > > +pmd_t *hugetlb_pmd_alloc(struct mm_struct *mm, struct hugetlb_pte *hpte, > > > + unsigned long addr) > > > > A little confused as there are no users yet ... Is hpte the 'hstate PTE' > > that we are trying to allocate ptes under? For example, in the case of > > a hugetlb_pmd_alloc caller hpte would be a PUD or CONT_PMD size pte? > > The hpte is the level above the level we're trying to allocate (not > necessarily the 'hstate PTE'). I'll make that clear in the comments > for both functions. > > So consider allocating 4K PTEs for a 1G HugeTLB page: > - With the hstate 'PTE' (PUD), we make a hugetlb_pte with that PUD > (let's call it 'hpte') > - We call hugetlb_pmd_alloc(hpte) which will leave 'hpte' the same, > but the pud_t that hpte->ptep points to is no longer a leaf. > - We call hugetlb_walk_step(hpte) to step down a level to get a PMD, > changing hpte. The hpte->ptep is now pointing to a blank pmd_t. > - We call hugetlb_pte_alloc(hpte) to allocate a bunch of PTEs and > populate the pmd_t. > - We call hugetlb_walk_step(hpte) to step down again. Erm actually this isn't entirely accurate. The general flow is about right, but hugetlb_pmd_alloc/hugetlb_pte_alloc are actually part of hugetlb_walk_step. (See hugetlb_hgm_walk for the ground truth :P) - James > > This is basically what hugetlb_hgm_walk does (in the next patch). We > only change 'hpte' when we do a step, and that is when we populate > 'shift'. The 'sz' parameter for hugetlb_walk_step is what > architectures can use to populate hpte->shift appropriately (ignored > for x86). > > For arm64, we can use 'sz' to populate hpte->shift with what the > caller wants when we are free to choose (like if all the PTEs are > none, we can do CONT_PTE_SHIFT). See [1]'s implementation of > hugetlb_walk_step for what I *think* is correct for arm64. > > [1] https://github.com/48ca/linux/commit/bf3b8742e95c58c2431c80c5bed5cb5cb95885af > > > > > > +{ > > > + spinlock_t *ptl = hugetlb_pte_lockptr(mm, hpte); > > > + pmd_t *new; > > > + pud_t *pudp; > > > + pud_t pud; > > > + > > > + if (hpte->level != HUGETLB_LEVEL_PUD) > > > + return ERR_PTR(-EINVAL); > > > > Ah yes, it is PUD level. However, I guess CONT_PMD would also be valid > > on arm64? > > The level is always PGD, P4D, PUD, PMD, or PTE. CONT_PTE is on > HUGETLB_LEVEL_PTE, CONT_PMD is on HUGETLB_LEVEL_PMD. > > These functions are supposed to be used for all architectures (in > their implementations of 'hugetlb_walk_step'; that's why they're not > static, actually. I'll make that clear in the commit description). > > > > > > + > > > + pudp = (pud_t *)hpte->ptep; > > > +retry: > > > + pud = *pudp; > > > > We might want to consider a READ_ONCE here. I am not an expert on such > > things, but recall a similar as pointed out in the now obsolete commit > > 27ceae9833843. > > Agreed. Will try to change all PTE reading to use READ_ONCE, though > they can be easy to miss... :( > > Thanks very much for the reviews so far, Mike! > > - James > > > > > -- > > Mike Kravetz > >