From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 37D24C19F2B for ; Thu, 4 Aug 2022 11:53:50 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9E3E18E0002; Thu, 4 Aug 2022 07:53:49 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 993468E0001; Thu, 4 Aug 2022 07:53:49 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 85B668E0002; Thu, 4 Aug 2022 07:53:49 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 701F38E0001 for ; Thu, 4 Aug 2022 07:53:49 -0400 (EDT) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 3FFB8AB317 for ; Thu, 4 Aug 2022 11:53:49 +0000 (UTC) X-FDA: 79761750978.01.C339036 Received: from mail-pj1-f52.google.com (mail-pj1-f52.google.com [209.85.216.52]) by imf31.hostedemail.com (Postfix) with ESMTP id E47B020046 for ; Thu, 4 Aug 2022 11:53:47 +0000 (UTC) Received: by mail-pj1-f52.google.com with SMTP id d65-20020a17090a6f4700b001f303a97b14so5328918pjk.1 for ; Thu, 04 Aug 2022 04:53:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc; bh=HN/HS9H73Bhl8GChxdsxnK6gnUo/0YImrd9IZ4H3iRM=; b=RhMEStWRywM8TVbITfylBVmb6rMQE/oGNxV/M9OpcRvfxvvBRuQejOtNByFg3LQPTB 2tWMTuqEbMiBSbp4OLbfiQIleYKjZ6tt1StWFgH/2g72+qd0lpcTIMUqTCDqHmaKZ/Nb MQ2JWuircXkz12XRJxGkQ+x/6Pq6DoXfS1PMYIYZc7l3WLH6YeHEwbGhWW0JOKqYQfLg 5L2+tdMUSGaRDplCcxnuPxV/kDq8gf/tHiaJTBtklERkVF8e+iUzIT6X9HSRMDKuBT5D wECavITUkjXykkBlivldIUv1HMe2sxRvLSDTmWXbS5Ll4jNjgo6WDhhj24bAhHCr9/Lg vPbQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc; bh=HN/HS9H73Bhl8GChxdsxnK6gnUo/0YImrd9IZ4H3iRM=; b=lP7ekR/gZ7RSa34+OC4x83GvFCPZwy1AzM4QBfBkTdai3R37Iox5OLCLB7HlP7UEps OwoZpOyNNa1lAmq59nJ3HypD+iJe2ERyDhaDXeBxTp0aIRB4uScYPKSm3+OwHhH5+2T8 cEhgMZesQFOWSMUBr8Evg+Q7yV4oBhSQWTcpFx0YAfM/0fOztXpTCZfeUd1uwPBtdDA+ 2mJYW5nZ21SwMsH7wxsDUmVynssQ020oYpAlUJpEcLnScQCZtTnO+QzU8ln87WrsBxmE jSZtby8srnZCjnxiL9ThlHN3/8+bFB58UHvQSYNbC0xr0xdQ8Go8AdsD19okfmi6sRoK nfCA== X-Gm-Message-State: ACgBeo1PxndlRQ4tMcvUSZf0lMBCqHhRJpxJDTvfg8/kfta52y1y+4TK QzvcETPRNb/b2G4DIxFLeDWATw== X-Google-Smtp-Source: AA6agR4VqLOwO6d0j3ROxA9t8xDhGAVtNhiFekl+aMF2cxd8Fj1YGZ66n0i5qCTgQbRy+hdBF+Hslg== X-Received: by 2002:a17:902:e5c4:b0:16e:d968:6352 with SMTP id u4-20020a170902e5c400b0016ed9686352mr1547747plf.69.1659614024859; Thu, 04 Aug 2022 04:53:44 -0700 (PDT) Received: from localhost ([139.177.225.249]) by smtp.gmail.com with ESMTPSA id u13-20020a170902e80d00b0016cb873fe6fsm752234plg.183.2022.08.04.04.53.44 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 04 Aug 2022 04:53:44 -0700 (PDT) Date: Thu, 4 Aug 2022 19:53:31 +0800 From: Muchun Song To: Joao Martins Cc: linux-mm@kvack.org, Mike Kravetz , Andrew Morton Subject: Re: [PATCH v1] mm/hugetlb_vmemmap: remap head page to newly allocated page Message-ID: References: <20220802180309.19340-1-joao.m.martins@oracle.com> <0b085bb1-b5f7-dfc2-588a-880de0d77ea2@oracle.com> <1579c011-eb6f-09ad-bf9b-5b3df3e1b182@oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1579c011-eb6f-09ad-bf9b-5b3df3e1b182@oracle.com> ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1659614028; a=rsa-sha256; cv=none; b=kHSGLuh0K8UOSS8Hc9CjsBER17HjFQozY/pfrwUwp+QVByP7YA7HsGmu26R8MTu4OV2YXS J9C0+V3vquuAMOPjnDInGloP0PFkv7AhnqsMQL4L5QI+c+BYs+w1d2JZiJgyRvU/1HgVlx s/rTG6N2PpiqYOk5EWj34+P/CpxcW6o= ARC-Authentication-Results: i=1; imf31.hostedemail.com; dkim=pass header.d=bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b=RhMEStWR; spf=pass (imf31.hostedemail.com: domain of songmuchun@bytedance.com designates 209.85.216.52 as permitted sender) smtp.mailfrom=songmuchun@bytedance.com; dmarc=pass (policy=none) header.from=bytedance.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1659614028; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=HN/HS9H73Bhl8GChxdsxnK6gnUo/0YImrd9IZ4H3iRM=; b=OSUX9bT5p2sAIWOUvTts0exELdhnP8BwIiOSQvddXaAFQC+69UaHDRPztJ6ucHk0l8hi2+ j7D9usPc2GohpC5MwmZv+IHEjcZSz3kXzeuDPG6SRPqYi+bCHqEzDJMZAxNTZPGnAoXisL 4GvLqrb1p6b4OlaiTppceEzrDFIkpyA= X-Rspamd-Server: rspam04 X-Stat-Signature: rhqztbs34roazbbjjjqptwij6scd94cq Authentication-Results: imf31.hostedemail.com; dkim=pass header.d=bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b=RhMEStWR; spf=pass (imf31.hostedemail.com: domain of songmuchun@bytedance.com designates 209.85.216.52 as permitted sender) smtp.mailfrom=songmuchun@bytedance.com; dmarc=pass (policy=none) header.from=bytedance.com X-Rspamd-Queue-Id: E47B020046 X-Rspam-User: X-HE-Tag: 1659614027-400890 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Aug 04, 2022 at 10:23:48AM +0100, Joao Martins wrote: > On 8/4/22 08:17, Muchun Song wrote: > > On Wed, Aug 03, 2022 at 01:22:21PM +0100, Joao Martins wrote: > >> > >> > >> On 8/3/22 11:44, Muchun Song wrote: > >>> On Wed, Aug 03, 2022 at 10:52:13AM +0100, Joao Martins wrote: > >>>> On 8/3/22 05:11, Muchun Song wrote: > >>>>> On Tue, Aug 02, 2022 at 07:03:09PM +0100, Joao Martins wrote: > >>>>>> Today with `hugetlb_free_vmemmap=on` the struct page memory that is > >>>>>> freed back to page allocator is as following: for a 2M hugetlb page it > >>>>>> will reuse the first 4K vmemmap page to remap the remaining 7 vmemmap > >>>>>> pages, and for a 1G hugetlb it will remap the remaining 4095 vmemmap > >>>>>> pages. Essentially, that means that it breaks the first 4K of a > >>>>>> potentially contiguous chunk of memory of 32K (for 2M hugetlb pages) or > >>>>>> 16M (for 1G hugetlb pages). For this reason the memory that it's free > >>>>>> back to page allocator cannot be used for hugetlb to allocate huge pages > >>>>>> of the same size, but rather only of a smaller huge page size: > >>>>>> > >>>>> > >>>>> Hi Joao, > >>>>> > >>>>> Thanks for your work on this. I admit you are right. The current mechanism > >>>>> prevented the freed vmemmap pages from being mergerd into a potential > >>>>> contiguous page. Allocating a new head page is straightforward approach, > >>>>> however, it is very dangerous at runtime after system booting up. Why > >>>>> dangerous? Because you should first 1) copy the content from the head vmemmap > >>>>> page to the targeted (newly allocated) page, and then 2) change the PTE > >>>>> entry to the new page. However, the content (especially the refcount) of the > >>>>> old head vmemmmap page could be changed elsewhere (e.g. other modules) > >>>>> between the step 1) and 2). Eventually, the new allocated vmemmap page is > >>>>> corrupted. Unluckily, we don't have an easy approach to prevent it. > >>>>> > >>>> OK, I see what I missed. You mean the refcount (or any other data) on the > >>>> preceeding struct pages to this head struct page unrelated to the hugetlb > >>>> page being remapped. Meaning when the first struct page in the old vmemmap > >>>> page is *not* the head page we are trying to remap right? > >>>> > >>>> See further below in your patch but I wonder if we could actually check this > >>>> from the hugetlb head va being aligned to PAGE_SIZE. Meaning that we would be checking > >>>> that this head page is the first of struct page in the vmemmap page that > >>>> we are still safe to remap? If so, the patch could be simpler more > >>>> like mine, without the special freeing path you added below. > >>>> > >>>> If I'm right, see at the end. > >>> > >>> I am not sure we are on the same page (it seems that we are not after I saw your > >>> below patch). > >> > >> Even though I misunderstood you it might still look like a possible scenario. > >> > >>> So let me make it become more clarified. > >>> > >> Thanks > >> > >>> CPU0: CPU1: > >>> > >>> vmemmap_remap_free(start, end, reuse) > >>> // alloc a new page used to be the head vmemmap page > >>> page = alloc_pages_node(); > >>> > >>> memcpy(page_address(page), reuse, PAGE_SIZE); > >>> // Now the @reuse address is mapped to the original > >>> // page frame. So the change will be reflected on the > >>> // original page frame. > >>> get_page(reuse); > >>> vmemmap_remap_pte(); > >>> // remap to the above new allocated page > >>> set_pte_at(); > >>> > >>> flush_tlb_kernel_range(); > >> > >> note-to-self: totally missed to change the flush_tlb_kernel_range() to include the full range. > >> > > > > Right. I have noticed that as well. > > > >>> // Now the @reuse address is mapped to the new allocated > >>> // page frame. So the change will be reflected on the > >>> // new page frame and it is corrupted. > >>> put_page(reuse); > >>> > >>> So we should make 1) memcpy, 2) remap and 3) TLB flush atomic on CPU0, however > >>> it is not easy. > >>> > >> OK, I understand what you mean now. However, I am trying to follow if this race is > >> possible? Note that given your previous answer above, I am assuming in your race scenario > >> that the vmemmap page only ever stores metadata (struct page) related to the hugetlb-page > >> currently being remapped. If this assumption is wrong, then your race would be possible > >> (but it wouldn't be from a get_page in the reuse_addr) > >> > >> So, how would we get into doing a get_page() on the head-page that we are remapping (and > >> its put_page() for that matter) from somewhere else ... considering we are at > >> prep_new_huge_page() when we call vmemmap_remap_free() and hence ... we already got it > >> from page allocator ... but hugetlb hasn't yet finished initializing the head page > >> metadata. Hence it isn't yet accounted for someone to grab either e.g. in > >> demote/page-fault/migration/etc? > >> > > > > As I know, at least there are two places which could get the refcount. > > 1) GUP and 2) Memoey failure. > > > So I am aware the means to grab the page refcount. It's how we get into such situation > that I wasn't sure how we get to in the first place: > > > For 1), you can refer to the commit 7118fc2906e2925d7edb5ed9c8a57f2a5f23b849. > > Good pointer. This one sheds some light too: > > https://lore.kernel.org/linux-mm/CAG48ez23q0Jy9cuVnwAe7t_fdhMk2S7N5Hdi-GLcCeq5bsfLxw@mail.gmail.com/ > > I wonder we get into a situation of doing a GUP on a user VA referring a > just-allocated *hugetlb* page from buddy (but not yet in hugetlb free lists) even if > temporarily. Unless the page was still siting on a page tables that were about to tear > down. Maybe it is specific to gigantic pages. Hmm > It is not specific to gigantic pages. I think your above link has a good description to show how this could happen to a non-gigantic page. Thanks. > > For 2), I think you can refer to the function of __get_unpoison_page(). > > > Ah, yes. Good point. > > > Both places could grab an extra refcount to the processing HugeTLB's struct page. > > > Thanks for the pointers ;) >