From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0CAA8C433F5 for ; Sun, 8 May 2022 09:19:11 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7B07A6B0072; Sun, 8 May 2022 05:19:11 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 7610D6B007B; Sun, 8 May 2022 05:19:11 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6272F6B007D; Sun, 8 May 2022 05:19:11 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 53A1F6B0072 for ; Sun, 8 May 2022 05:19:11 -0400 (EDT) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 21A2520BDC for ; Sun, 8 May 2022 09:19:11 +0000 (UTC) X-FDA: 79442026902.07.121C33E Received: from out30-45.freemail.mail.aliyun.com (out30-45.freemail.mail.aliyun.com [115.124.30.45]) by imf25.hostedemail.com (Postfix) with ESMTP id BFE11A0093 for ; Sun, 8 May 2022 09:18:53 +0000 (UTC) X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R171e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e01424;MF=baolin.wang@linux.alibaba.com;NM=1;PH=DS;RN=30;SR=0;TI=SMTPD_---0VCZtzMG_1652001541; Received: from 30.15.195.77(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0VCZtzMG_1652001541) by smtp.aliyun-inc.com(127.0.0.1); Sun, 08 May 2022 17:19:03 +0800 Message-ID: <1fad03a6-98cf-1b0e-e012-82dc6466c7d2@linux.alibaba.com> Date: Sun, 8 May 2022 17:19:43 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Thunderbird/91.8.1 Subject: Re: [PATCH 2/3] mm: rmap: Fix CONT-PTE/PMD size hugetlb issue when migration From: Baolin Wang To: Mike Kravetz , akpm@linux-foundation.org, catalin.marinas@arm.com, will@kernel.org Cc: tsbogend@alpha.franken.de, James.Bottomley@HansenPartnership.com, deller@gmx.de, mpe@ellerman.id.au, benh@kernel.crashing.org, paulus@samba.org, hca@linux.ibm.com, gor@linux.ibm.com, agordeev@linux.ibm.com, borntraeger@linux.ibm.com, svens@linux.ibm.com, ysato@users.sourceforge.jp, dalias@libc.org, davem@davemloft.net, arnd@arndb.de, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-ia64@vger.kernel.org, linux-mips@vger.kernel.org, linux-parisc@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, linux-s390@vger.kernel.org, linux-sh@vger.kernel.org, sparclinux@vger.kernel.org, linux-arch@vger.kernel.org, linux-mm@kvack.org References: <11b92502b3df0e0bba6a1dc71476d79cab6c79ba.1651216964.git.baolin.wang@linux.alibaba.com> <5cab0eca-9630-a7c6-4f5d-5cb45ff82c83@oracle.com> <21b11024-e893-8c11-9b98-ab1d13413b61@linux.alibaba.com> <85bd80b4-b4fd-0d3f-a2e5-149559f2f387@oracle.com> In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Stat-Signature: nwoeha938dcf8ph91idrmtgae6hy137e X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: BFE11A0093 Authentication-Results: imf25.hostedemail.com; dkim=none; spf=pass (imf25.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.45 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=alibaba.com X-Rspam-User: X-HE-Tag: 1652001533-691937 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 5/7/2022 10:33 AM, Baolin Wang wrote: > > > On 5/7/2022 1:56 AM, Mike Kravetz wrote: >> On 5/5/22 20:39, Baolin Wang wrote: >>> >>> On 5/6/2022 7:53 AM, Mike Kravetz wrote: >>>> On 4/29/22 01:14, Baolin Wang wrote: >>>>> On some architectures (like ARM64), it can support CONT-PTE/PMD size >>>>> hugetlb, which means it can support not only PMD/PUD size hugetlb: >>>>> 2M and 1G, but also CONT-PTE/PMD size: 64K and 32M if a 4K page >>>>> size specified. >>>> >>>>> diff --git a/mm/rmap.c b/mm/rmap.c >>>>> index 6fdd198..7cf2408 100644 >>>>> --- a/mm/rmap.c >>>>> +++ b/mm/rmap.c >>>>> @@ -1924,13 +1924,15 @@ static bool try_to_migrate_one(struct folio >>>>> *folio, struct vm_area_struct *vma, >>>>>                        break; >>>>>                    } >>>>>                } >>>>> + >>>>> +            /* Nuke the hugetlb page table entry */ >>>>> +            pteval = huge_ptep_clear_flush(vma, address, pvmw.pte); >>>>>            } else { >>>>>                flush_cache_page(vma, address, pte_pfn(*pvmw.pte)); >>>>> +            /* Nuke the page table entry. */ >>>>> +            pteval = ptep_clear_flush(vma, address, pvmw.pte); >>>>>            } >>>> >>>> On arm64 with CONT-PTE/PMD the returned pteval will have dirty or >>>> young set >>>> if ANY of the PTE/PMDs had dirty or young set. >>> >>> Right. >>> >>>> >>>>> -        /* Nuke the page table entry. */ >>>>> -        pteval = ptep_clear_flush(vma, address, pvmw.pte); >>>>> - >>>>>            /* Set the dirty flag on the folio now the pte is gone. */ >>>>>            if (pte_dirty(pteval)) >>>>>                folio_mark_dirty(folio); >>>>> @@ -2015,7 +2017,10 @@ static bool try_to_migrate_one(struct folio >>>>> *folio, struct vm_area_struct *vma, >>>>>                pte_t swp_pte; >>>>>                  if (arch_unmap_one(mm, vma, address, pteval) < 0) { >>>>> -                set_pte_at(mm, address, pvmw.pte, pteval); >>>>> +                if (folio_test_hugetlb(folio)) >>>>> +                    set_huge_pte_at(mm, address, pvmw.pte, pteval); >>>> >>>> And, we will use that pteval for ALL the PTE/PMDs here.  So, we >>>> would set >>>> the dirty or young bit in ALL PTE/PMDs. >>>> >>>> Could that cause any issues?  May be more of a question for the >>>> arm64 people. >>> >>> I don't think this will cause any issues. Since the hugetlb can not >>> be split, and we should not lose the the dirty or young state if any >>> subpages were set. Meanwhile we already did like this in hugetlb.c: >>> >>> pte = huge_ptep_get_and_clear(mm, address, ptep); >>> tlb_remove_huge_tlb_entry(h, tlb, ptep, address); >>> if (huge_pte_dirty(pte)) >>>      set_page_dirty(page); >>> >> >> Agree that it 'should not' cause issues.  It just seems inconsistent. >> This is not a problem specifically with your patch, just the handling of >> CONT-PTE/PMD entries. >> >> There does not appear to be an arm64 specific version of huge_ptep_get() >> that takes CONT-PTE/PMD into account.  So, huge_ptep_get() would only >> return the one specific value.  It would not take into account the dirty >> or young bits of CONT-PTE/PMDs like your new version of >> huge_ptep_get_and_clear.  Is that correct?  Or, am I missing something. > > Yes, you are right. > >> >> If I am correct, then code like the following may not work: >> >> static int gather_hugetlb_stats(pte_t *pte, unsigned long hmask, >>                  unsigned long addr, unsigned long end, struct mm_walk >> *walk) >> { >>          pte_t huge_pte = huge_ptep_get(pte); >>          struct numa_maps *md; >>          struct page *page; >> >>          if (!pte_present(huge_pte)) >>                  return 0; >> >>          page = pte_page(huge_pte); >> >>          md = walk->private; >>          gather_stats(page, md, pte_dirty(huge_pte), 1); >>          return 0; >> } > > Right, this is inconsistent with current huge_ptep_get() interface like > you said. So I think we can define an ARCH-specific huge_ptep_get() > interface for arm64, and some sample code like below. How do you think? After some investigation, I send out a RFC patch set[1] to address this issue. We can talk about this issue in that thread. Thanks. [1] https://lore.kernel.org/all/cover.1651998586.git.baolin.wang@linux.alibaba.com/