From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=ew8G=EQ=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-3.8 required=3.0 tests=BAYES_00,DKIM_SIGNED,
	DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,
	SPF_PASS autolearn=no autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 10C25C5517A
	for <linux-mm@archiver.kernel.org>; Tue, 10 Nov 2020 06:41:35 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 76FE32065E
	for <linux-mm@archiver.kernel.org>; Tue, 10 Nov 2020 06:41:34 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=bytedance-com.20150623.gappssmtp.com header.i=@bytedance-com.20150623.gappssmtp.com header.b="MwT0BS1C"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 76FE32065E
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=bytedance.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id A54FF6B0036; Tue, 10 Nov 2020 01:41:33 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 9DE926B005C; Tue, 10 Nov 2020 01:41:33 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 856576B005D; Tue, 10 Nov 2020 01:41:33 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0116.hostedemail.com [216.40.44.116])
	by kanga.kvack.org (Postfix) with ESMTP id 55DAA6B0036
	for <linux-mm@kvack.org>; Tue, 10 Nov 2020 01:41:33 -0500 (EST)
Received: from smtpin05.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay05.hostedemail.com (Postfix) with ESMTP id E03B2181AEF1A
	for <linux-mm@kvack.org>; Tue, 10 Nov 2020 06:41:32 +0000 (UTC)
X-FDA: 77467562424.05.nail55_4b0643c272f3
Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251])
	by smtpin05.hostedemail.com (Postfix) with ESMTP id C904B1801E332
	for <linux-mm@kvack.org>; Tue, 10 Nov 2020 06:41:32 +0000 (UTC)
X-HE-Tag: nail55_4b0643c272f3
X-Filterd-Recvd-Size: 8398
Received: from mail-pf1-f194.google.com (mail-pf1-f194.google.com [209.85.210.194])
	by imf50.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Tue, 10 Nov 2020 06:41:32 +0000 (UTC)
Received: by mail-pf1-f194.google.com with SMTP id w14so7879299pfd.7
        for <linux-mm@kvack.org>; Mon, 09 Nov 2020 22:41:31 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=bytedance-com.20150623.gappssmtp.com; s=20150623;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc;
        bh=0Geg8ga216EfQWJODZoPB/tADPTu5Ty/TnxqtFN8Hj0=;
        b=MwT0BS1CEqMvujGC6+tdpVO9hRjWYfqyzXa6ua/F0hzskdgPD5LbnmQ9Det6MF1OP8
         3Ah419OYIy7M4n/+h7Kr9enfI3NdOx4+4yreJSxWloDJ1SlsKj3SBpDuoS2m8PFV/Ezy
         0I/HBFgQIJvBiWSU8Hq82ZEFd7Vs/EZi7HC9ryFvr+TNe1+Imc1AwS5cxrQvPZnzcAqT
         5YHYdHwbebCo3lSSK74CmEc32eY/zAsbFAtiHi0AIvmddVUExyGTykizT69I/d1MZHA7
         2lfk5lETJ2+N60GnQJNbk4IYmcJdMiIBh62dD12+GiBWrIiCDo4zO9WUxJxSgGuydFeg
         OG0A==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc;
        bh=0Geg8ga216EfQWJODZoPB/tADPTu5Ty/TnxqtFN8Hj0=;
        b=ovEuwVSnVqu65MPljIipZRgzvbbXrI9S75UEoyf6P12zj/bLGfsJFUHDrcBS576tTO
         2V3zNMLqBwwbSdKAesIFhaxh5HELA2GuJ7lvzlu0Sy/eAY8G1qHybTHVDESyreJk7rZ1
         e1zyRmsgvI7I4c8GznU1BOfwlw/oBj5VtYVzvPK5iyO9PqkrPvt50RyqsRV/t5q1wbks
         HY6d52jHh8qSInDGPqEy+i7yTaA7RBopZmjVspNv2ZkOMi0cuffRrQoylXblvhqRDrA0
         26kwLOPgHjcBiN70tJgFeLg7HtYxPwNwB8vO/glUtBCB26mgRUsW6KVSW49yyq9cFETI
         7obw==
X-Gm-Message-State: AOAM530Xz1i95axIS2zuT0LSjzCc6RUUnHRq2lpdyJwvNURuCty2CJ4x
	3CVAynsJytv6XqBvZjibiCc67nRIW00ijvYlqKgcsA==
X-Google-Smtp-Source: ABdhPJyKv0hduPw4n7bMCGrbf+Z7jY28oQybr8inCHTvwXZPj/t82bSlEumrVllRoDro/ckWIWNhgZ9mwbt/p6SnlS0=
X-Received: by 2002:aa7:8287:0:b029:142:2501:39ec with SMTP id
 s7-20020aa782870000b0290142250139ecmr16678994pfm.59.1604990490866; Mon, 09
 Nov 2020 22:41:30 -0800 (PST)
MIME-Version: 1.0
References: <20201108141113.65450-1-songmuchun@bytedance.com>
 <20201108141113.65450-10-songmuchun@bytedance.com> <20201109185138.GD17356@linux>
In-Reply-To: <20201109185138.GD17356@linux>
From: Muchun Song <songmuchun@bytedance.com>
Date: Tue, 10 Nov 2020 14:40:54 +0800
Message-ID: <CAMZfGtXpXoQ+zVi2Us__7ghSu_3U7+T3tx-EL+zfa=1Obn=55g@mail.gmail.com>
Subject: Re: [External] Re: [PATCH v3 09/21] mm/hugetlb: Free the vmemmap
 pages associated with each hugetlb page
To: Oscar Salvador <osalvador@suse.de>
Cc: Jonathan Corbet <corbet@lwn.net>, Mike Kravetz <mike.kravetz@oracle.com>, 
	Thomas Gleixner <tglx@linutronix.de>, mingo@redhat.com, bp@alien8.de, x86@kernel.org, 
	hpa@zytor.com, dave.hansen@linux.intel.com, luto@kernel.org, 
	Peter Zijlstra <peterz@infradead.org>, viro@zeniv.linux.org.uk, 
	Andrew Morton <akpm@linux-foundation.org>, paulmck@kernel.org, mchehab+huawei@kernel.org, 
	pawan.kumar.gupta@linux.intel.com, Randy Dunlap <rdunlap@infradead.org>, oneukum@suse.com, 
	anshuman.khandual@arm.com, jroedel@suse.de, 
	Mina Almasry <almasrymina@google.com>, David Rientjes <rientjes@google.com>, 
	Matthew Wilcox <willy@infradead.org>, Michal Hocko <mhocko@suse.com>, 
	Xiongchun duan <duanxiongchun@bytedance.com>, linux-doc@vger.kernel.org, 
	LKML <linux-kernel@vger.kernel.org>, 
	Linux Memory Management List <linux-mm@kvack.org>, linux-fsdevel <linux-fsdevel@vger.kernel.org>
Content-Type: text/plain; charset="UTF-8"
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Tue, Nov 10, 2020 at 2:51 AM Oscar Salvador <osalvador@suse.de> wrote:
>
> On Sun, Nov 08, 2020 at 10:11:01PM +0800, Muchun Song wrote:
> > +static inline int freed_vmemmap_hpage(struct page *page)
> > +{
> > +     return atomic_read(&page->_mapcount) + 1;
> > +}
> > +
> > +static inline int freed_vmemmap_hpage_inc(struct page *page)
> > +{
> > +     return atomic_inc_return_relaxed(&page->_mapcount) + 1;
> > +}
> > +
> > +static inline int freed_vmemmap_hpage_dec(struct page *page)
> > +{
> > +     return atomic_dec_return_relaxed(&page->_mapcount) + 1;
> > +}
>
> Are these relaxed any different that the normal ones on x86_64?
> I got confused following the macros.

A PTE table can contain 64 HugeTLB(2MB) page's struct page structures.
So I use the freed_vmemmap_hpage to indicate how many HugeTLB pages
that it's vmemmap pages are already freed to buddy.

Once vmemmap pages of a HugeTLB page are freed, we call the
freed_vmemmap_hpage_inc, when freeing a HugeTLB to the buddy,
we should call freed_vmemmap_hpage_dec.

If the freed_vmemmap_hpage hit zero when free HugeTLB, we try to merge
the PTE table to PMD(now only support gigantic pages). This can refer to

  [PATCH v3 19/21] mm/hugetlb: Merge pte to huge pmd only for gigantic

Thanks.

>
> > +static void __free_huge_page_pte_vmemmap(struct page *reuse, pte_t *ptep,
> > +                                      unsigned long start,
> > +                                      unsigned int nr_free,
> > +                                      struct list_head *free_pages)
> > +{
> > +     /* Make the tail pages are mapped read-only. */
> > +     pgprot_t pgprot = PAGE_KERNEL_RO;
> > +     pte_t entry = mk_pte(reuse, pgprot);
> > +     unsigned long addr;
> > +     unsigned long end = start + (nr_free << PAGE_SHIFT);
>
> See below.
>
> > +static void __free_huge_page_pmd_vmemmap(struct hstate *h, pmd_t *pmd,
> > +                                      unsigned long addr,
> > +                                      struct list_head *free_pages)
> > +{
> > +     unsigned long next;
> > +     unsigned long start = addr + RESERVE_VMEMMAP_NR * PAGE_SIZE;
> > +     unsigned long end = addr + vmemmap_pages_size_per_hpage(h);
> > +     struct page *reuse = NULL;
> > +
> > +     addr = start;
> > +     do {
> > +             unsigned int nr_pages;
> > +             pte_t *ptep;
> > +
> > +             ptep = pte_offset_kernel(pmd, addr);
> > +             if (!reuse)
> > +                     reuse = pte_page(ptep[-1]);
>
> Can we define a proper name for that instead of -1?
>
> e.g: TAIL_PAGE_REUSE or something like that.

OK, will do.

>
> > +
> > +             next = vmemmap_hpage_addr_end(addr, end);
> > +             nr_pages = (next - addr) >> PAGE_SHIFT;
> > +             __free_huge_page_pte_vmemmap(reuse, ptep, addr, nr_pages,
> > +                                          free_pages);
>
> Why not passing next instead of nr_pages? I think it makes more sense.
> As a bonus we can kill the variable.

Good catch. We can pass next instead of nr_pages. Thanks.


>
> > +static void split_vmemmap_huge_page(struct hstate *h, struct page *head,
> > +                                 pmd_t *pmd)
> > +{
> > +     pgtable_t pgtable;
> > +     unsigned long start = (unsigned long)head & VMEMMAP_HPAGE_MASK;
> > +     unsigned long addr = start;
> > +     unsigned int nr = pgtable_pages_to_prealloc_per_hpage(h);
> > +
> > +     while (nr-- && (pgtable = vmemmap_pgtable_withdraw(head))) {
>
> The same with previous patches, I would scrap "nr" and its use.
>
> > +             VM_BUG_ON(freed_vmemmap_hpage(pgtable));
>
> I guess here we want to check whether we already call free_huge_page_vmemmap
> on this range?
> For this to have happened, the locking should have failed, right?

Only the first HugeTLB page should split the PMD to PTE. The other 63
HugeTLB pages
do not need to split. Here I want to make sure we are the first.

>
> > +static void free_huge_page_vmemmap(struct hstate *h, struct page *head)
> > +{
> > +     pmd_t *pmd;
> > +     spinlock_t *ptl;
> > +     LIST_HEAD(free_pages);
> > +
> > +     if (!free_vmemmap_pages_per_hpage(h))
> > +             return;
> > +
> > +     pmd = vmemmap_to_pmd(head);
> > +     ptl = vmemmap_pmd_lock(pmd);
> > +     if (vmemmap_pmd_huge(pmd)) {
> > +             VM_BUG_ON(!pgtable_pages_to_prealloc_per_hpage(h));
>
> I think that checking for free_vmemmap_pages_per_hpage is enough.
> In the end, pgtable_pages_to_prealloc_per_hpage uses free_vmemmap_pages_per_hpage.

The free_vmemmap_pages_per_hpage is not enough. See the comments above.

Thanks.

>
>
> --
> Oscar Salvador
> SUSE L3


--
Yours,
Muchun