From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 97F0BC00528
	for <linux-mm@archiver.kernel.org>; Fri,  4 Aug 2023 21:35:58 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 082FF6B0071; Fri,  4 Aug 2023 17:35:58 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 032CF8D0001; Fri,  4 Aug 2023 17:35:57 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id E15DF6B0074; Fri,  4 Aug 2023 17:35:57 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10])
	by kanga.kvack.org (Postfix) with ESMTP id D30756B0071
	for <linux-mm@kvack.org>; Fri,  4 Aug 2023 17:35:57 -0400 (EDT)
Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay02.hostedemail.com (Postfix) with ESMTP id 95D9B120792
	for <linux-mm@kvack.org>; Fri,  4 Aug 2023 21:35:57 +0000 (UTC)
X-FDA: 81087729954.27.B157154
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	by imf22.hostedemail.com (Postfix) with ESMTP id 4DF8FC0005
	for <linux-mm@kvack.org>; Fri,  4 Aug 2023 21:35:55 +0000 (UTC)
Authentication-Results: imf22.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=af7XUGYX;
	dmarc=pass (policy=none) header.from=redhat.com;
	spf=pass (imf22.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1691184955;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=6xHzxUYjJIdDtURU1S3vX2xZDjMKXI19w61gLbe1Ayo=;
	b=PkZCYAg8zgGf9LI4rGC0xYG7ru9iNPv4HIxzhiUvx6oeB3udZALLLiuilL0AqNHaXNEAvf
	q9BU1LA7IPIIRHB5Pq6X4H7TW3D3oNysZ2t1kPV1CKw8cRsC9SGQeabfUqs47v3+qXnjnx
	SZseXQMcmoE6An+jVVLa+PJoy6zaJ4U=
ARC-Authentication-Results: i=1;
	imf22.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=af7XUGYX;
	dmarc=pass (policy=none) header.from=redhat.com;
	spf=pass (imf22.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1691184955; a=rsa-sha256;
	cv=none;
	b=qps9/WcwB7fxl5eBSuM3MVr6Fty3Kzi0cZz9LCAsWsC15Ra5qF3SZ6ob5sKSFkmEL6ugGF
	lO+x2qfNqO4QBU5qG3wGt9/1GV5zvxzn2dEvSuRgNsCWdgfxNB05uVnnR1nkG+7J7KjRX9
	M6zZ1kLIT72nFuFpWtjVX8SMabJg8hQ=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1691184954;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=6xHzxUYjJIdDtURU1S3vX2xZDjMKXI19w61gLbe1Ayo=;
	b=af7XUGYXWenSrA63T/3RAsXfFSlb0CCGFeirHby4SZqZGlSJH56yRy+DRZQEUhQkbvZF4b
	b9s86fOmpcC9NIj6wTXpKPQHyBUindFhbb5qzQ0QKn1I5p1jOhbEEJLTQSSg0mWFnDK40a
	6aFhWi7RPVNp2il1pZPj4JHbJ57w3C8=
Received: from mail-wm1-f69.google.com (mail-wm1-f69.google.com
 [209.85.128.69]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-138-z0OGlJhXMsu6mS5w92ktzA-1; Fri, 04 Aug 2023 17:35:53 -0400
X-MC-Unique: z0OGlJhXMsu6mS5w92ktzA-1
Received: by mail-wm1-f69.google.com with SMTP id 5b1f17b1804b1-3fe210c47acso14852875e9.3
        for <linux-mm@kvack.org>; Fri, 04 Aug 2023 14:35:53 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1691184652; x=1691789452;
        h=content-transfer-encoding:in-reply-to:organization:from:references
         :cc:to:content-language:subject:user-agent:mime-version:date
         :message-id:x-gm-message-state:from:to:cc:subject:date:message-id
         :reply-to;
        bh=6xHzxUYjJIdDtURU1S3vX2xZDjMKXI19w61gLbe1Ayo=;
        b=WS2biWNo85fyT1O9z8pNdS5WFxnrVowPu041p+s7dw3MQxC28g7DvUV0jZo6w0Hswq
         dGsUzYj9ixbua1UV2cy6t2EkM3lqmDTB2QeUmNANmiPdnpRlRh/bpI0JfwNCY5pDliFK
         oJ9TyVripcd13daXnXVALWKbvWBeS6ucnz2D91aSTEJC22cDAeMvbEnK780wXjk71CjB
         SB8J8jGAGoSEuFnl3yeyXQXg/+JkWVF4lCNCTfQiAErF4lyM3g0q+6vyFTXbXXVKeohm
         bafuUqo+fcBuYrSRFB2XqkkhJrRJS5A/EH0gn7HyfMyrlGKP9kH1sgThR/8RHknN0Zg7
         6QUg==
X-Gm-Message-State: AOJu0Yzthhcck489dk/K37d+9Be+kczUWcr2M9izG/nWbpYfwF7QESZo
	CLoA33Qs1EZ3ZtZA7mIlV77zgbviWCCgg2o684sxx0lhC1yL3M6uq6TgOKjhon85lcfSbPKvKj7
	8Jli8qPDIw5k=
X-Received: by 2002:a7b:ce94:0:b0:3f8:2777:15e with SMTP id q20-20020a7bce94000000b003f82777015emr2164958wmj.31.1691184652193;
        Fri, 04 Aug 2023 14:30:52 -0700 (PDT)
X-Google-Smtp-Source: AGHT+IHg76fpCPdISu/CKKhttE0AnUkU9rMMWiHgz3qJGhEnE09KvQCL77IQ1NcO4miQKPYK6bEQ/A==
X-Received: by 2002:a7b:ce94:0:b0:3f8:2777:15e with SMTP id q20-20020a7bce94000000b003f82777015emr2164947wmj.31.1691184651768;
        Fri, 04 Aug 2023 14:30:51 -0700 (PDT)
Received: from ?IPV6:2003:d8:2f2d:8e00:a20e:59bc:3c13:4806? (p200300d82f2d8e00a20e59bc3c134806.dip0.t-ipconnect.de. [2003:d8:2f2d:8e00:a20e:59bc:3c13:4806])
        by smtp.gmail.com with ESMTPSA id 25-20020a05600c021900b003fe4ca8decdsm1425819wmi.31.2023.08.04.14.30.50
        (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
        Fri, 04 Aug 2023 14:30:51 -0700 (PDT)
Message-ID: <0d502268-ebdc-8462-d88c-e6a41578d9ae@redhat.com>
Date: Fri, 4 Aug 2023 23:30:49 +0200
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
 Thunderbird/102.13.0
Subject: Re: [PATCH v4 2/5] mm: LARGE_ANON_FOLIO for improved performance
To: Yu Zhao <yuzhao@google.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>,
 Andrew Morton <akpm@linux-foundation.org>,
 Matthew Wilcox <willy@infradead.org>, Yin Fengwei <fengwei.yin@intel.com>,
 Catalin Marinas <catalin.marinas@arm.com>, Will Deacon <will@kernel.org>,
 Anshuman Khandual <anshuman.khandual@arm.com>, Yang Shi
 <shy828301@gmail.com>, "Huang, Ying" <ying.huang@intel.com>,
 Zi Yan <ziy@nvidia.com>, Luis Chamberlain <mcgrof@kernel.org>,
 Itaru Kitayama <itaru.kitayama@gmail.com>,
 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>, linux-mm@kvack.org,
 linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org
References: <20230726095146.2826796-1-ryan.roberts@arm.com>
 <20230726095146.2826796-3-ryan.roberts@arm.com>
 <c02a95e9-b728-ad64-6942-f23dbd66af0c@arm.com>
 <CAOUHufaHH3Ctu3JRHSbmebHJ7XPnBEWTQ4mwOo+MGXU9yKvwbA@mail.gmail.com>
 <5e595904-3dca-0e15-0769-7ed10975fd0d@arm.com>
 <b936041c-08a7-e844-19e7-eafc4ddf63b9@redhat.com>
 <CAOUHufafd4GNna2GKdSyQdW6CLVh0gxhNgeOc6t+ZOphwgw7tw@mail.gmail.com>
 <259ad8fc-c12b-69b9-ba16-adb9e3e6d672@redhat.com>
 <CAOUHufbbrDrSv2Ak0tyyaw7qrekkQ-p2vjCqWsXFG7b-+EP=5g@mail.gmail.com>
From: David Hildenbrand <david@redhat.com>
Organization: Red Hat
In-Reply-To: <CAOUHufbbrDrSv2Ak0tyyaw7qrekkQ-p2vjCqWsXFG7b-+EP=5g@mail.gmail.com>
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Language: en-US
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-Rspamd-Queue-Id: 4DF8FC0005
X-Rspam-User: 
X-Rspamd-Server: rspam04
X-Stat-Signature: qf9q15cxfkcrd8nsq19kj4yyh8skntf3
X-HE-Tag: 1691184955-601105
X-HE-Meta: U2FsdGVkX1+AMATXJJ5ZOzCXWZQ1m7sYMF2KNrOxjtNF5lU9bNtLg7t6Db1GrfQNqkti0fuT5jzx69O+RrTEofGcAcFzGfBUeh5NismmX7yIsI3VaZSZ//kMpS3/qll1hW6J2MOabXfstQn/x/5Bfo7dwjRDcAnFgPqgPmBYgWQqFHfsAEoEBBYw5Wye+W4fKULwFOT1rtxzqxQLZW2jgDBWQrYcreo+2KVGYv2F1BXnUmYHidRLlgR5MDDlYvaj6RLaLIR8jQuYFPI3U6RyS1qujGg4emLJLtN+lTn4rZKZN6kdQwwXb0sOh8XJVKpw4grwQLz899WPSuZ17VFQDgCzws8AmJB4DFN/F5yqgEQUU1sBxfyg3HAz/lJ3BnxIjV7V3I4BHBikZTM8/+WtQ7n74Hj8ab4HC/kIx0N0TRCfVB3BHKQLCA49Pt3BSNa2mPEJVhcXXNL1SsGzH2RgwEGH+5mNEUHark2Lxk+504/8jSruu1/T36Uhu9j3Zk3eCF2YJW77wKUjsKDMqTw36zdzGZIiuzxuk6FQD/IbrBVFX1YwSiSxGZzysLzQ9ZxNFZh+uomIbznHJ85PFgJb+2tsZed81hg1Z+Q0N0yqb/gaM1yu8zOjOCsWnB4c1oCI5en4/hChfCGWdXlJrc8CpSbG6wv20vqCNJhMTCEGAkhIkof2+BToVYOMGj5qgVhik+3kUkp/dTV7DXIrmgiL81hRoPDBxAxrOWAWkb7bdtdN0cbmp/KCsv3QTA6EmNaqj5OndSWh18R+x0Nuk31i1axxaOKa8FFDWaKdOOoTq6mVvYx503r9kDkQ/4MljotEz2gIPqv/W0tJJDt8BvKVn953vgR1xMn+ZEu+NdQpG39xeeuLv22aXFplGaDMRSerIN3JyrlT2KbBleXLXLbTHOgn0Yl6R/aEztZrW+KpHA331y3V/oRHorNhZdUne1Fvf6TUk/5uNYdBUbqeFqp
 oqFW1uSc
 rhvlvdoQw0v2SzwRAsJby2OFZaYgVsovWxnh1IOX4BnzFloSQlLjxNmf+b3GGx/gAA9UdTLb0OkOcVc3vf6L3yG9u+t5II6WFljnuy93TLDKS+5eYjkk2rhsISlzP8nUq19F5jkCLv0GztrNqf1nTKCK7lSotrMNhLazTb1aclIjaLwyqBxsDBj6fwd1bmX4Pd2dYHYDF/MFCdJTB2JAYhphgawS8eVWqQOUc/kB0gLl4jlmKC/23U3DhdQ+RGXj65p2zE0+VUqQE8DGemvV/Tptk9jdbaJEPJdjyIrVs4udlOEVioQGmLJEvWV0lpTRzDCkrAHxC1osvioSY14fC+80qiol0txC+0MRlV6k3HbP5Yvy60SrcrVNvy+xSF+TITeS/lv9fUOat0zZtopoLJIKO+sdlTVCV6XLQ9v9VSngNfept+4okCt+hcEedwDVwNXEgX+ZSizjAE//ikVT/4yzdHNbaqLTkdPb27P0O+uVNAuc2fztCwQz2lQ8gto2eYeRpD05K3ES+xQ2rzjSSwUVVzL8m6vRNp66syniY8na1kEuRI6ERBL014OZN4CtXSZh7uVuKWFZSpBk=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On 04.08.23 23:26, Yu Zhao wrote:
> On Fri, Aug 4, 2023 at 3:13 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 04.08.23 23:00, Yu Zhao wrote:
>>> On Fri, Aug 4, 2023 at 2:23 PM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>> On 04.08.23 10:27, Ryan Roberts wrote:
>>>>> On 04/08/2023 00:50, Yu Zhao wrote:
>>>>>> On Thu, Aug 3, 2023 at 6:43 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>
>>>>>>> + Kirill
>>>>>>>
>>>>>>> On 26/07/2023 10:51, Ryan Roberts wrote:
>>>>>>>> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be
>>>>>>>> allocated in large folios of a determined order. All pages of the large
>>>>>>>> folio are pte-mapped during the same page fault, significantly reducing
>>>>>>>> the number of page faults. The number of per-page operations (e.g. ref
>>>>>>>> counting, rmap management lru list management) are also significantly
>>>>>>>> reduced since those ops now become per-folio.
>>>>>>>>
>>>>>>>> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig,
>>>>>>>> which defaults to disabled for now; The long term aim is for this to
>>>>>>>> defaut to enabled, but there are some risks around internal
>>>>>>>> fragmentation that need to be better understood first.
>>>>>>>>
>>>>>>>> When enabled, the folio order is determined as such: For a vma, process
>>>>>>>> or system that has explicitly disabled THP, we continue to allocate
>>>>>>>> order-0. THP is most likely disabled to avoid any possible internal
>>>>>>>> fragmentation so we honour that request.
>>>>>>>>
>>>>>>>> Otherwise, the return value of arch_wants_pte_order() is used. For vmas
>>>>>>>> that have not explicitly opted-in to use transparent hugepages (e.g.
>>>>>>>> where thp=madvise and the vma does not have MADV_HUGEPAGE), then
>>>>>>>> arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever is
>>>>>>>> bigger). This allows for a performance boost without requiring any
>>>>>>>> explicit opt-in from the workload while limitting internal
>>>>>>>> fragmentation.
>>>>>>>>
>>>>>>>> If the preferred order can't be used (e.g. because the folio would
>>>>>>>> breach the bounds of the vma, or because ptes in the region are already
>>>>>>>> mapped) then we fall back to a suitable lower order; first
>>>>>>>> PAGE_ALLOC_COSTLY_ORDER, then order-0.
>>>>>>>>
>>>>>>>
>>>>>>> ...
>>>>>>>
>>>>>>>> +#define ANON_FOLIO_MAX_ORDER_UNHINTED \
>>>>>>>> +             (ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SHIFT)
>>>>>>>> +
>>>>>>>> +static int anon_folio_order(struct vm_area_struct *vma)
>>>>>>>> +{
>>>>>>>> +     int order;
>>>>>>>> +
>>>>>>>> +     /*
>>>>>>>> +      * If THP is explicitly disabled for either the vma, the process or the
>>>>>>>> +      * system, then this is very likely intended to limit internal
>>>>>>>> +      * fragmentation; in this case, don't attempt to allocate a large
>>>>>>>> +      * anonymous folio.
>>>>>>>> +      *
>>>>>>>> +      * Else, if the vma is eligible for thp, allocate a large folio of the
>>>>>>>> +      * size preferred by the arch. Or if the arch requested a very small
>>>>>>>> +      * size or didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER,
>>>>>>>> +      * which still meets the arch's requirements but means we still take
>>>>>>>> +      * advantage of SW optimizations (e.g. fewer page faults).
>>>>>>>> +      *
>>>>>>>> +      * Finally if thp is enabled but the vma isn't eligible, take the
>>>>>>>> +      * arch-preferred size and limit it to ANON_FOLIO_MAX_ORDER_UNHINTED.
>>>>>>>> +      * This ensures workloads that have not explicitly opted-in take benefit
>>>>>>>> +      * while capping the potential for internal fragmentation.
>>>>>>>> +      */
>>>>>>>> +
>>>>>>>> +     if ((vma->vm_flags & VM_NOHUGEPAGE) ||
>>>>>>>> +         test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags) ||
>>>>>>>> +         !hugepage_flags_enabled())
>>>>>>>> +             order = 0;
>>>>>>>> +     else {
>>>>>>>> +             order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
>>>>>>>> +
>>>>>>>> +             if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
>>>>>>>> +                     order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED);
>>>>>>>> +     }
>>>>>>>> +
>>>>>>>> +     return order;
>>>>>>>> +}
>>>>>>>
>>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>> I'm writing up the conclusions that we arrived at during discussion in the THP
>>>>>>> meeting yesterday, regarding linkage with exiting THP ABIs. It would be great if
>>>>>>> I can get explicit "agree" or disagree + rationale from at least David, Yu and
>>>>>>> Kirill.
>>>>>>>
>>>>>>> In summary; I think we are converging on the approach that is already coded, but
>>>>>>> I'd like confirmation.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> The THP situation today
>>>>>>> -----------------------
>>>>>>>
>>>>>>>     - At system level: THP can be set to "never", "madvise" or "always"
>>>>>>>     - At process level: THP can be "never" or "defer to system setting"
>>>>>>>     - At VMA level: no-hint, MADV_HUGEPAGE, MADV_NOHUGEPAGE
>>>>>>>
>>>>>>> That gives us this table to describe how a page fault is handled, according to
>>>>>>> process state (columns) and vma flags (rows):
>>>>>>>
>>>>>>>                    | never     | madvise   | always
>>>>>>> ----------------|-----------|-----------|-----------
>>>>>>> no hint         | S         | S         | THP>S
>>>>>>> MADV_HUGEPAGE   | S         | THP>S     | THP>S
>>>>>>> MADV_NOHUGEPAGE | S         | S         | S
>>>>>>>
>>>>>>> Legend:
>>>>>>> S       allocate single page (PTE-mapped)
>>>>>>> LAF     allocate lage anon folio (PTE-mapped)
>>>>>>> THP     allocate THP-sized folio (PMD-mapped)
>>>>>>>>          fallback (usually because vma size/alignment insufficient for folio)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Principles for Large Anon Folios (LAF)
>>>>>>> --------------------------------------
>>>>>>>
>>>>>>> David tells us there are use cases today (e.g. qemu live migration) which use
>>>>>>> MADV_NOHUGEPAGE to mean "don't fill any PTEs that are not explicitly faulted"
>>>>>>> and these use cases will break (i.e. functionally incorrect) if this request is
>>>>>>> not honoured.
>>>>>>
>>>>>> I don't remember David saying this. I think he was referring to UFFD,
>>>>>> not MADV_NOHUGEPAGE, when discussing what we need to absolutely
>>>>>> respect.
>>>>>
>>>>> My understanding was that MADV_NOHUGEPAGE was being applied to regions *before*
>>>>> UFFD was being registered, and the app relied on MADV_NOHUGEPAGE to not back any
>>>>> unfaulted pages. It's not completely clear to me how not honouring
>>>>> MADV_NOHUGEPAGE would break things though. David?
>>>>
>>>> Sorry, I'm still lagging behind on some threads.
>>>>
>>>> Imagine the following for VM postcopy live migration:
>>>>
>>>> (1) Set MADV_NOHUGEPAGE on guest memory and discard all memory (e.g.,
>>>>        MADV_DONTNEED), to start with a clean slate.
>>>> (2) Migrates some pages during precopy from the source and stores them
>>>>        into guest memory on the destination. Some of the memory locations
>>>>        will have pages populated.
>>>> (3) At some point, decide to enable postcopy: enable userfaultfd on
>>>>        guest memory.
>>>> (4) Discard *selected* pages again that have been dirtied in the
>>>>        meantime on the source. These are pages that have been migrated
>>>>        previously.
>>>> (5) Start running the VM on the destination.
>>>> (6) Anything that's not populated will trigger userfaultfd missing
>>>>        faults. Then, you can request them from the source and place them.
>>>>
>>>> Assume you would populate more than required during 2), you can end up
>>>> not getting userfaultfd faults during 4) and corrupt your guest state.
>>>> It works if during (2) you migrated all guest memory, or if during 4)
>>>> you zap everything that still needs migr
>>>
>>> I see what you mean now. Thanks.
>>>
>>> Yes, in this case we have to interpret MADV_NOHUGEPAGE as nothing >4KB.
>>
>> Note that it's still even unclear to me why we want to *not* call these
>> things THP. It would certainly make everything less confusing if we call
>> them THP, but with additional attributes.
>>
>> I think that is one of the first things we should figure out because it
>> also indirectly tells us what all these toggles mean and how/if we
>> should redefine them (and if they even apply).
>>
>> Currently THP == PMD size
>>
>> In 2016, Hugh already envisioned PUD/PGD THP (see 49920d28781d ("mm:
>> make transparent hugepage size public")) when he explicitly exposed
>> "hpage_pmd_size". Not "hpage_size".
>>
>> For hugetlb on arm64 we already support various sizes that are < PMD
>> size and *not* call them differently. It's a huge(tlb) page. Sometimes
>> we refer to them as cont-PTE hugetlb pages.
>>
>>
>> So, nowadays we do have "PMD-sized THP", someday we might have
>> "PUD-sized THP". Can't we come up with a name to describe sub-PMD THP?
>>
>> Is it really of value if we invent a new term for them? Yes, I was not
>> enjoying "Flexible THP".
>>
>>
>> Once we figured that out, we should figure out if MADV_HUGEPAGE meant
>> "only PMD-sized THP" or anything else?
>>
>> Also, we can then figure out if MADV_NOHUGEPAGE meant "only PMD-sized
>> THP" or anything else?
>>
>>
>> The simplest approach to me would be "they imply any THP, and once we
>> need more tunables we might add some", similar to what Kirill also raised.
>>
>>
>> Again, it's all unclear to me at this point and I'm happy to hear
>> opinions, because I really don't know.
> 
> I agree these points require more discussion. But I don't think we
> need to conclude them now, unless they cause correctness issues like
> ignoring MADV_NOHUGEPAGE would. My concern is that if we decide to go
> with "they imply any THP" and *expose this to userspace now*, we might
> regret later.

If we don't think they are THP, probably MADV_NOHUGEPAGE should not 
apply and we should be ready to find other ways to deal with the mess we 
eventually create. If we want to go down that path, sure.

If they are THP, to me there is not really a question if MADV_NOHUGEPAGE 
applies to them or not. Unless we want to build a confusing piece of 
software ;)

-- 
Cheers,

David / dhildenb