From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id CB159EB8FAD for ; Wed, 6 Sep 2023 09:11:50 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5B8B028000C; Wed, 6 Sep 2023 05:11:50 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 542BB8E0014; Wed, 6 Sep 2023 05:11:50 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 395EA28000C; Wed, 6 Sep 2023 05:11:50 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 23EE78E0014 for ; Wed, 6 Sep 2023 05:11:50 -0400 (EDT) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id E974F1CA97F for ; Wed, 6 Sep 2023 09:11:49 +0000 (UTC) X-FDA: 81205605138.14.C21FDC8 Received: from mail-wr1-f47.google.com (mail-wr1-f47.google.com [209.85.221.47]) by imf08.hostedemail.com (Postfix) with ESMTP id 73D29160006 for ; Wed, 6 Sep 2023 09:11:47 +0000 (UTC) Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=B5uQA50o; spf=pass (imf08.hostedemail.com: domain of songmuchun@bytedance.com designates 209.85.221.47 as permitted sender) smtp.mailfrom=songmuchun@bytedance.com; dmarc=pass (policy=quarantine) header.from=bytedance.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1693991508; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=6FRe7ZqsNTAoDqqmat7zXdkTAVIWG/sthvOCRdId5kc=; b=JCfs+wM+2swUe3dA75ehNTklhcIm2j14SF+RGzvU4Zp3FZEbXsCUjnypMGS0w0Gm0QY1oB VJByCnHf/Xbeli8iMFtJK0NGQ7z2bTbIFhd6IMgzc0rel/CxmQQjfEN804F20/pNi+5wY1 qIU9QSW6f9fUTufTjaHcE/e059RGpzU= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1693991508; a=rsa-sha256; cv=none; b=ReE8f2Lf9bbQTSEWLZ1jDA9I5kQT42d/q+QF4xEVNXboNAMPMNfAetZ0d8UXGvTotDYX59 FRV6K+n5Lpuh6kENyaolWTmDeL77SJ7RmTNMY/TX5JoGdTi81qWDrInFdZCiKDTyfBOh0y rS75iH/krpgkhSe3+wExdSww57cGSEs= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=B5uQA50o; spf=pass (imf08.hostedemail.com: domain of songmuchun@bytedance.com designates 209.85.221.47 as permitted sender) smtp.mailfrom=songmuchun@bytedance.com; dmarc=pass (policy=quarantine) header.from=bytedance.com Received: by mail-wr1-f47.google.com with SMTP id ffacd0b85a97d-31977ace1c8so3029304f8f.1 for ; Wed, 06 Sep 2023 02:11:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1693991506; x=1694596306; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=6FRe7ZqsNTAoDqqmat7zXdkTAVIWG/sthvOCRdId5kc=; b=B5uQA50oCG1TRHWmRGK3SPd+eCyL88nRa9ZQtPjD9i/agcMUxN8nCt64IO4SW9fFJF p/CCKNHMUcEyIkNwfeTvBYiytmn/CwnlDxcmzaDKQMHU0REEWwWl/CB56bFms0PN2aEU iDK8mE4S3TPf42GP2RjwJYKtbW8zLU5mMhPwbIYxK8ZvrscL80yKdU7pK4rWz1kr6DIN S52zioe2uCmAyeYCWe0EIRWLcRKBARK1V9WhBsrzvPHF3t29y5uf7k0vlb76foTclxJP WOgxN2v7Wjih3EwMcdBb9UtX4+kCkhhzcAwFc5PWrWF0Gjl1W6oBjwalOKsv3w79N8MK 5hCg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1693991506; x=1694596306; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=6FRe7ZqsNTAoDqqmat7zXdkTAVIWG/sthvOCRdId5kc=; b=Gah1FCgcLcusQMeNUbNHNt2v/drc7LfDvMw5IMJp+cUjaun8gKJRxjQnc2ybrsKJVy aE4x95kYGb/SzRpyFRthkxe28GmfxeUVlMbo4mQI2m7nMZiXMnzKkYrSXc86s+aureAu tLgk9DcQCBKAuW+vpM8ZfDU4fzvJIRKO/tilFXAF224DgYR6eLjK6LSTCTQ2NBN2o/Dx KoVhrbBuoHgrGi2EVAcMPo/pgt1M7Iyo+4++QTVC99PazRbUnE8ZbjmL36kUMMw/Nk1V zG64iwY2JGBA05OJcJSN7fDAyR3GaQIh9HqZ7EGmXKo9/I2fCDoWFzS1E8LE3KndDlAP A6wQ== X-Gm-Message-State: AOJu0Yxmpj3jbOql8lkvZwNL6F3rUPz67d3xe2FwGbdS1GW/AwjmEhW4 qP88uLXkpwUy/RPLYQp3wH7JL+C8B/x/nVjwQvC0kqMObhfrLmAC20cMreiL X-Google-Smtp-Source: AGHT+IGw6jebFyIO+mXrth+2h9iuIdIaKkoeQAHvE/0M3w2G1mf/prtarQ+LsjUSRPDXzlepZqil9zLFkY1m4KYS/9o= X-Received: by 2002:adf:f84d:0:b0:319:6ce2:e5a3 with SMTP id d13-20020adff84d000000b003196ce2e5a3mr1720147wrq.26.1693991505663; Wed, 06 Sep 2023 02:11:45 -0700 (PDT) MIME-Version: 1.0 References: <20230905214412.89152-1-mike.kravetz@oracle.com> <20230905214412.89152-10-mike.kravetz@oracle.com> <0b0609d8-bc87-0463-bafd-9613f0053039@linux.dev> In-Reply-To: <0b0609d8-bc87-0463-bafd-9613f0053039@linux.dev> From: Muchun Song Date: Wed, 6 Sep 2023 17:11:08 +0800 Message-ID: Subject: Re: [External] Re: [PATCH v2 09/11] hugetlb: batch PMD split for bulk vmemmap dedup To: Mike Kravetz , Joao Martins Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Oscar Salvador , David Hildenbrand , Miaohe Lin , David Rientjes , Anshuman Khandual , Naoya Horiguchi , Michal Hocko , Matthew Wilcox , Xiongchun Duan , Andrew Morton , muchun.song@linux.dev Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 73D29160006 X-Rspam-User: X-Stat-Signature: zkh8j7jeoud6zr7h6p3a8qq8qcbkid6d X-Rspamd-Server: rspam03 X-HE-Tag: 1693991507-70668 X-HE-Meta: U2FsdGVkX1/ry/ZIP042sjHKuOtObkVsIP+SGz0BIr7iqeFHx9k2B9kJQaK7+oWV2tOohLZNtSTfeucrKG4fgzbkdQwLjlSrMXymUNw8i6aR1F7aoEfE2mTAcC2S2IQB25SOzbxmTDpEsuPLYOoulkYfNNhHYU4V7naiCed/ZRXk3Pi8gfU0GAi2JeSVi5wjKj3Dl8k0TAPEhViv4MrTL0CJ8274SksDl/3DyQeRfe4oDKQ6tQtipP2DkNa1FBMFul5wUzUpk5//PxsW1OyPYL7as0BbtUYul2v73pA5nwo9zDHRQo7aj+jn+89JdWEhs+7GZe0i8TLx7+Gz0rl2sIm56anTmnLsKg4wsycAMdkJfSl3vPJKa/ok0jvxrX4OjEY5n7hwdhtTh0e70tRwd4f7vxRHICzFJoxUXrPSMQ2WRhglY+OlLKACdYJj5FukBSfx9SWkf3gkg7CWi/2k+jSNu0NPht/HYN+Z3hTdR+55P9ysdD0WMmA/6u8wlvo7GYKPTz/+t+Y08OTr2IPnYV/dAbt2BpoVAJKkNYC9UqRXAGbqYREbHP1CyV3PUUpAGfCTTXYKBdytSlH3l5tzXyZMr+yBWUb7drw6mawGg6v8V6Fi09NAAWME6La2FZ9zZ2ogEpvc2FDB/7iNQtIKud0BTnvesvH6nIHXbStKFjtwjROsn+3kV2hn7zKHHGeDjfXrOueJO9xF7ymrmms3oUAvqQJlH+toZBuzvlNsjYbnPCuTn9uQPYE2AnL6dL7arfJtQlqVVA8mTUlOQl0DHQUtlcd+6TfJi9GIfpVyl0SilZuejlnip04xKIyxOtczIKmAi8q40soRmI1lcXhTvovKhGkrTJth/EnJ6F2yI0YPhz1XL5cbk6XmlZlbIbJnfLF9+tyEdjfjwnelc8QARLPbW/wd5y3UTw2liF6ZlPiKs73qATX+aTVK2o+glNHek1ABwYCOmNFXAPLzmRZ TUfqZ/qz nBU5xbDh5sAqVfWaPfqa9ZDYFX0tqZ21g9SIYhU3sPxPbrNU2y3afAjoxjFj84Q2IsbDkkPQxSD8v145vOHiRjnQCEcjbo8FUUpp4eMbcH9FK1GR4kbvJPw+jEcARW1N2puzVS/JLBEsJBt0uYms306lETfSU3PTzeamHTRyZz137gZOWg4Dc/yhk5s4zbYlPi3vgkckwhz6CeGpK9noAE74Buzd2xAY9vFncl4A+QP3NQXDY2f2ujaWD755uszexo8SgO/k+8fJgqBg= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Sep 6, 2023 at 4:25=E2=80=AFPM Muchun Song = wrote: > > > > On 2023/9/6 05:44, Mike Kravetz wrote: > > From: Joao Martins > > > > In an effort to minimize amount of TLB flushes, batch all PMD splits > > belonging to a range of pages in order to perform only 1 (global) TLB > > flush. > > > > Rebased and updated by Mike Kravetz > > > > Signed-off-by: Joao Martins > > Signed-off-by: Mike Kravetz > > --- > > mm/hugetlb_vmemmap.c | 72 +++++++++++++++++++++++++++++++++++++++++--= - > > 1 file changed, 68 insertions(+), 4 deletions(-) > > > > diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c > > index a715712df831..d956551699bc 100644 > > --- a/mm/hugetlb_vmemmap.c > > +++ b/mm/hugetlb_vmemmap.c > > @@ -37,7 +37,7 @@ struct vmemmap_remap_walk { > > struct list_head *vmemmap_pages; > > }; > > > > -static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start) > > +static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start, boo= l flush) > > { > > pmd_t __pmd; > > int i; > > @@ -80,7 +80,8 @@ static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigne= d long start) > > /* Make pte visible before pmd. See comment in pmd_instal= l(). */ > > smp_wmb(); > > pmd_populate_kernel(&init_mm, pmd, pgtable); > > - flush_tlb_kernel_range(start, start + PMD_SIZE); > > + if (flush) > > + flush_tlb_kernel_range(start, start + PMD_SIZE); > > } else { > > pte_free_kernel(&init_mm, pgtable); > > } > > @@ -127,11 +128,20 @@ static int vmemmap_pmd_range(pud_t *pud, unsigned= long addr, > > do { > > int ret; > > > > - ret =3D split_vmemmap_huge_pmd(pmd, addr & PMD_MASK); > > + ret =3D split_vmemmap_huge_pmd(pmd, addr & PMD_MASK, > > + walk->remap_pte !=3D NULL); > > It is bettter to only make @walk->remap_pte indicate whether we should go > to the last page table level. I suggest reusing VMEMMAP_NO_TLB_FLUSH > to indicate whether we should flush the TLB at pmd level. It'll be more > clear. > > > if (ret) > > return ret; > > > > next =3D pmd_addr_end(addr, end); > > + > > + /* > > + * We are only splitting, not remapping the hugetlb vmemm= ap > > + * pages. > > + */ > > + if (!walk->remap_pte) > > + continue; > > + > > vmemmap_pte_range(pmd, addr, next, walk); > > } while (pmd++, addr =3D next, addr !=3D end); > > > > @@ -198,7 +208,8 @@ static int vmemmap_remap_range(unsigned long start,= unsigned long end, > > return ret; > > } while (pgd++, addr =3D next, addr !=3D end); > > > > - flush_tlb_kernel_range(start, end); > > + if (walk->remap_pte) > > + flush_tlb_kernel_range(start, end); > > > > return 0; > > } > > @@ -297,6 +308,35 @@ static void vmemmap_restore_pte(pte_t *pte, unsign= ed long addr, > > set_pte_at(&init_mm, addr, pte, mk_pte(page, pgprot)); > > } > > > > +/** > > + * vmemmap_remap_split - split the vmemmap virtual address range [@sta= rt, @end) > > + * backing PMDs of the directmap into PTEs > > + * @start: start address of the vmemmap virtual address range that= we want > > + * to remap. > > + * @end: end address of the vmemmap virtual address range that w= e want to > > + * remap. > > + * @reuse: reuse address. > > + * > > + * Return: %0 on success, negative error code otherwise. > > + */ > > +static int vmemmap_remap_split(unsigned long start, unsigned long end, > > + unsigned long reuse) > > +{ > > + int ret; > > + struct vmemmap_remap_walk walk =3D { > > + .remap_pte =3D NULL, > > + }; > > + > > + /* See the comment in the vmemmap_remap_free(). */ > > + BUG_ON(start - reuse !=3D PAGE_SIZE); > > + > > + mmap_read_lock(&init_mm); > > + ret =3D vmemmap_remap_range(reuse, end, &walk); > > + mmap_read_unlock(&init_mm); > > + > > + return ret; > > +} > > + > > /** > > * vmemmap_remap_free - remap the vmemmap virtual address range [@sta= rt, @end) > > * to the page which @reuse is mapped to, then free = vmemmap > > @@ -602,11 +642,35 @@ void hugetlb_vmemmap_optimize(const struct hstate= *h, struct page *head) > > free_vmemmap_page_list(&vmemmap_pages); > > } > > > > +static void hugetlb_vmemmap_split(const struct hstate *h, struct page = *head) > > +{ > > + unsigned long vmemmap_start =3D (unsigned long)head, vmemmap_end; > > + unsigned long vmemmap_reuse; > > + > > + if (!vmemmap_should_optimize(h, head)) > > + return; > > + > > + vmemmap_end =3D vmemmap_start + hugetlb_vmemmap_size(h); > > + vmemmap_reuse =3D vmemmap_start; > > + vmemmap_start +=3D HUGETLB_VMEMMAP_RESERVE_SIZE; > > + > > + /* > > + * Split PMDs on the vmemmap virtual address range [@vmemmap_star= t, > > + * @vmemmap_end] > > + */ > > + vmemmap_remap_split(vmemmap_start, vmemmap_end, vmemmap_reuse); > > +} > > + > > void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_he= ad *folio_list) > > { > > struct folio *folio; > > LIST_HEAD(vmemmap_pages); > > > > + list_for_each_entry(folio, folio_list, lru) > > + hugetlb_vmemmap_split(h, &folio->page); > > Maybe it is reasonable to add a return value to hugetlb_vmemmap_split() > to indicate whether it has done successfully, if it fails, it must be > OOM, in which case, there is no sense to continue to split the page table > and optimize the vmemmap pages subsequently, right? Sorry, it is reasonable to continue to optimize the vmemmap pages subsequently since it should succeed because those vmemmap pages have been split successfully previously. Seems we should continue to optimize vmemmap once hugetlb_vmemmap_split() fails, then we will have more memory to continue to split. But it will make hugetlb_vmemmap_optimize_folios() a little complex. I'd like to hear you guys' opinions here. Thanks. > > Thanks. > > > + > > + flush_tlb_all(); > > + > > list_for_each_entry(folio, folio_list, lru) > > __hugetlb_vmemmap_optimize(h, &folio->page, &vmemmap_page= s); > > >