From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 37D24C19F2B
	for <linux-mm@archiver.kernel.org>; Thu,  4 Aug 2022 11:53:50 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 9E3E18E0002; Thu,  4 Aug 2022 07:53:49 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 993468E0001; Thu,  4 Aug 2022 07:53:49 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 85B668E0002; Thu,  4 Aug 2022 07:53:49 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14])
	by kanga.kvack.org (Postfix) with ESMTP id 701F38E0001
	for <linux-mm@kvack.org>; Thu,  4 Aug 2022 07:53:49 -0400 (EDT)
Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay06.hostedemail.com (Postfix) with ESMTP id 3FFB8AB317
	for <linux-mm@kvack.org>; Thu,  4 Aug 2022 11:53:49 +0000 (UTC)
X-FDA: 79761750978.01.C339036
Received: from mail-pj1-f52.google.com (mail-pj1-f52.google.com [209.85.216.52])
	by imf31.hostedemail.com (Postfix) with ESMTP id E47B020046
	for <linux-mm@kvack.org>; Thu,  4 Aug 2022 11:53:47 +0000 (UTC)
Received: by mail-pj1-f52.google.com with SMTP id d65-20020a17090a6f4700b001f303a97b14so5328918pjk.1
        for <linux-mm@kvack.org>; Thu, 04 Aug 2022 04:53:45 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=bytedance-com.20210112.gappssmtp.com; s=20210112;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:from:to:cc;
        bh=HN/HS9H73Bhl8GChxdsxnK6gnUo/0YImrd9IZ4H3iRM=;
        b=RhMEStWRywM8TVbITfylBVmb6rMQE/oGNxV/M9OpcRvfxvvBRuQejOtNByFg3LQPTB
         2tWMTuqEbMiBSbp4OLbfiQIleYKjZ6tt1StWFgH/2g72+qd0lpcTIMUqTCDqHmaKZ/Nb
         MQ2JWuircXkz12XRJxGkQ+x/6Pq6DoXfS1PMYIYZc7l3WLH6YeHEwbGhWW0JOKqYQfLg
         5L2+tdMUSGaRDplCcxnuPxV/kDq8gf/tHiaJTBtklERkVF8e+iUzIT6X9HSRMDKuBT5D
         wECavITUkjXykkBlivldIUv1HMe2sxRvLSDTmWXbS5Ll4jNjgo6WDhhj24bAhHCr9/Lg
         vPbQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:x-gm-message-state:from:to:cc;
        bh=HN/HS9H73Bhl8GChxdsxnK6gnUo/0YImrd9IZ4H3iRM=;
        b=lP7ekR/gZ7RSa34+OC4x83GvFCPZwy1AzM4QBfBkTdai3R37Iox5OLCLB7HlP7UEps
         OwoZpOyNNa1lAmq59nJ3HypD+iJe2ERyDhaDXeBxTp0aIRB4uScYPKSm3+OwHhH5+2T8
         cEhgMZesQFOWSMUBr8Evg+Q7yV4oBhSQWTcpFx0YAfM/0fOztXpTCZfeUd1uwPBtdDA+
         2mJYW5nZ21SwMsH7wxsDUmVynssQ020oYpAlUJpEcLnScQCZtTnO+QzU8ln87WrsBxmE
         jSZtby8srnZCjnxiL9ThlHN3/8+bFB58UHvQSYNbC0xr0xdQ8Go8AdsD19okfmi6sRoK
         nfCA==
X-Gm-Message-State: ACgBeo1PxndlRQ4tMcvUSZf0lMBCqHhRJpxJDTvfg8/kfta52y1y+4TK
	QzvcETPRNb/b2G4DIxFLeDWATw==
X-Google-Smtp-Source: AA6agR4VqLOwO6d0j3ROxA9t8xDhGAVtNhiFekl+aMF2cxd8Fj1YGZ66n0i5qCTgQbRy+hdBF+Hslg==
X-Received: by 2002:a17:902:e5c4:b0:16e:d968:6352 with SMTP id u4-20020a170902e5c400b0016ed9686352mr1547747plf.69.1659614024859;
        Thu, 04 Aug 2022 04:53:44 -0700 (PDT)
Received: from localhost ([139.177.225.249])
        by smtp.gmail.com with ESMTPSA id u13-20020a170902e80d00b0016cb873fe6fsm752234plg.183.2022.08.04.04.53.44
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Thu, 04 Aug 2022 04:53:44 -0700 (PDT)
Date: Thu, 4 Aug 2022 19:53:31 +0800
From: Muchun Song <songmuchun@bytedance.com>
To: Joao Martins <joao.m.martins@oracle.com>
Cc: linux-mm@kvack.org, Mike Kravetz <mike.kravetz@oracle.com>,
	Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [PATCH v1] mm/hugetlb_vmemmap: remap head page to newly
 allocated page
Message-ID: <YuuzOyvJnzbgUhzN@FVFYT0MHHV2J.usts.net>
References: <20220802180309.19340-1-joao.m.martins@oracle.com>
 <Yun1bJsnK/6MFc0b@FVFYT0MHHV2J>
 <0b085bb1-b5f7-dfc2-588a-880de0d77ea2@oracle.com>
 <YupRjWRiz4lPo+y7@FVFYT0MHHV2J>
 <d204a782-e6d4-6545-295d-24ab3e355713@oracle.com>
 <YutyfnF2D5kyOzeR@FVFYT0MHHV2J.usts.net>
 <1579c011-eb6f-09ad-bf9b-5b3df3e1b182@oracle.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <1579c011-eb6f-09ad-bf9b-5b3df3e1b182@oracle.com>
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1659614028; a=rsa-sha256;
	cv=none;
	b=kHSGLuh0K8UOSS8Hc9CjsBER17HjFQozY/pfrwUwp+QVByP7YA7HsGmu26R8MTu4OV2YXS
	J9C0+V3vquuAMOPjnDInGloP0PFkv7AhnqsMQL4L5QI+c+BYs+w1d2JZiJgyRvU/1HgVlx
	s/rTG6N2PpiqYOk5EWj34+P/CpxcW6o=
ARC-Authentication-Results: i=1;
	imf31.hostedemail.com;
	dkim=pass header.d=bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b=RhMEStWR;
	spf=pass (imf31.hostedemail.com: domain of songmuchun@bytedance.com designates 209.85.216.52 as permitted sender) smtp.mailfrom=songmuchun@bytedance.com;
	dmarc=pass (policy=none) header.from=bytedance.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1659614028;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=HN/HS9H73Bhl8GChxdsxnK6gnUo/0YImrd9IZ4H3iRM=;
	b=OSUX9bT5p2sAIWOUvTts0exELdhnP8BwIiOSQvddXaAFQC+69UaHDRPztJ6ucHk0l8hi2+
	j7D9usPc2GohpC5MwmZv+IHEjcZSz3kXzeuDPG6SRPqYi+bCHqEzDJMZAxNTZPGnAoXisL
	4GvLqrb1p6b4OlaiTppceEzrDFIkpyA=
X-Rspamd-Server: rspam04
X-Stat-Signature: rhqztbs34roazbbjjjqptwij6scd94cq
Authentication-Results: imf31.hostedemail.com;
	dkim=pass header.d=bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b=RhMEStWR;
	spf=pass (imf31.hostedemail.com: domain of songmuchun@bytedance.com designates 209.85.216.52 as permitted sender) smtp.mailfrom=songmuchun@bytedance.com;
	dmarc=pass (policy=none) header.from=bytedance.com
X-Rspamd-Queue-Id: E47B020046
X-Rspam-User: 
X-HE-Tag: 1659614027-400890
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Thu, Aug 04, 2022 at 10:23:48AM +0100, Joao Martins wrote:
> On 8/4/22 08:17, Muchun Song wrote:
> > On Wed, Aug 03, 2022 at 01:22:21PM +0100, Joao Martins wrote:
> >>
> >>
> >> On 8/3/22 11:44, Muchun Song wrote:
> >>> On Wed, Aug 03, 2022 at 10:52:13AM +0100, Joao Martins wrote:
> >>>> On 8/3/22 05:11, Muchun Song wrote:
> >>>>> On Tue, Aug 02, 2022 at 07:03:09PM +0100, Joao Martins wrote:
> >>>>>> Today with `hugetlb_free_vmemmap=on` the struct page memory that is
> >>>>>> freed back to page allocator is as following: for a 2M hugetlb page it
> >>>>>> will reuse the first 4K vmemmap page to remap the remaining 7 vmemmap
> >>>>>> pages, and for a 1G hugetlb it will remap the remaining 4095 vmemmap
> >>>>>> pages. Essentially, that means that it breaks the first 4K of a
> >>>>>> potentially contiguous chunk of memory of 32K (for 2M hugetlb pages) or
> >>>>>> 16M (for 1G hugetlb pages). For this reason the memory that it's free
> >>>>>> back to page allocator cannot be used for hugetlb to allocate huge pages
> >>>>>> of the same size, but rather only of a smaller huge page size:
> >>>>>>
> >>>>>
> >>>>> Hi Joao,
> >>>>>
> >>>>> Thanks for your work on this. I admit you are right. The current mechanism
> >>>>> prevented the freed vmemmap pages from being mergerd into a potential
> >>>>> contiguous page. Allocating a new head page is straightforward approach,
> >>>>> however, it is very dangerous at runtime after system booting up. Why
> >>>>> dangerous? Because you should first 1) copy the content from the head vmemmap
> >>>>> page to the targeted (newly allocated) page, and then 2) change the PTE
> >>>>> entry to the new page. However, the content (especially the refcount) of the
> >>>>> old head vmemmmap page could be changed elsewhere (e.g. other modules)
> >>>>> between the step 1) and 2). Eventually, the new allocated vmemmap page is
> >>>>> corrupted. Unluckily, we don't have an easy approach to prevent it.
> >>>>>
> >>>> OK, I see what I missed. You mean the refcount (or any other data) on the
> >>>> preceeding struct pages to this head struct page unrelated to the hugetlb
> >>>> page being remapped. Meaning when the first struct page in the old vmemmap
> >>>> page is *not* the head page we are trying to remap right?
> >>>>
> >>>> See further below in your patch but I wonder if we could actually check this
> >>>> from the hugetlb head va being aligned to PAGE_SIZE. Meaning that we would be checking
> >>>> that this head page is the first of struct page in the vmemmap page that
> >>>> we are still safe to remap? If so, the patch could be simpler more
> >>>> like mine, without the special freeing path you added below.
> >>>>
> >>>> If I'm right, see at the end.
> >>>
> >>> I am not sure we are on the same page (it seems that we are not after I saw your
> >>> below patch). 
> >>
> >> Even though I misunderstood you it might still look like a possible scenario.
> >>
> >>> So let me make it become more clarified.
> >>>
> >> Thanks
> >>
> >>> CPU0:                                            CPU1:
> >>>
> >>> vmemmap_remap_free(start, end, reuse)
> >>>   // alloc a new page used to be the head vmemmap page
> >>>   page = alloc_pages_node();
> >>>
> >>>   memcpy(page_address(page), reuse, PAGE_SIZE);
> >>>                                                  // Now the @reuse address is mapped to the original
> >>>                                                  // page frame. So the change will be reflected on the
> >>>                                                  // original page frame.
> >>>                                                  get_page(reuse);
> >>>   vmemmap_remap_pte();
> >>>    // remap to the above new allocated page 
> >>>    set_pte_at();
> >>>
> >>>   flush_tlb_kernel_range();
> >>
> >> note-to-self: totally missed to change the flush_tlb_kernel_range() to include the full range.
> >>
> > 
> > Right. I have noticed that as well.
> > 
> >>>                                                  // Now the @reuse address is mapped to the new allocated
> >>>                                                  // page frame. So the change will be reflected on the
> >>>                                                  // new page frame and it is corrupted.
> >>>                                                  put_page(reuse);
> >>>
> >>> So we should make 1) memcpy, 2) remap and 3) TLB flush atomic on CPU0, however
> >>> it is not easy.
> >>>
> >> OK, I understand what you mean now. However, I am trying to follow if this race is
> >> possible? Note that given your previous answer above, I am assuming in your race scenario
> >> that the vmemmap page only ever stores metadata (struct page) related to the hugetlb-page
> >> currently being remapped. If this assumption is wrong, then your race would be possible
> >> (but it wouldn't be from a get_page in the reuse_addr)
> >>
> >> So, how would we get into doing a get_page() on the head-page that we are remapping (and
> >> its put_page() for that matter) from somewhere else ... considering we are at
> >> prep_new_huge_page() when we call vmemmap_remap_free() and hence ... we already got it
> >> from page allocator ... but hugetlb hasn't yet finished initializing the head page
> >> metadata. Hence it isn't yet accounted for someone to grab either e.g. in
> >> demote/page-fault/migration/etc?
> >>
> > 
> > As I know, at least there are two places which could get the refcount.
> > 1) GUP and 2) Memoey failure.
> > 
> So I am aware the means to grab the page refcount. It's how we get into such situation
> that I wasn't sure how we get to in the first place:
> 
> > For 1), you can refer to the commit 7118fc2906e2925d7edb5ed9c8a57f2a5f23b849.
> 
> Good pointer. This one sheds some light too:
> 
> https://lore.kernel.org/linux-mm/CAG48ez23q0Jy9cuVnwAe7t_fdhMk2S7N5Hdi-GLcCeq5bsfLxw@mail.gmail.com/
> 
> I wonder we get into a situation of doing a GUP on a user VA referring a
> just-allocated *hugetlb* page from buddy (but not yet in hugetlb free lists) even if
> temporarily. Unless the page was still siting on a page tables that were about to tear
> down. Maybe it is specific to gigantic pages. Hmm
>

It is not specific to gigantic pages. I think your above link has a good
description to show how this could happen to a non-gigantic page.

Thanks.
 
> > For 2), I think you can refer to the function of __get_unpoison_page().
> > 
> Ah, yes. Good point.
> 
> > Both places could grab an extra refcount to the processing HugeTLB's struct page.
> > 
> Thanks for the pointers ;)
>