From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 52121EB64D7
	for <linux-mm@archiver.kernel.org>; Wed, 28 Jun 2023 11:06:48 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id DD0098D0003; Wed, 28 Jun 2023 07:06:47 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id D82218D0001; Wed, 28 Jun 2023 07:06:47 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id C48108D0003; Wed, 28 Jun 2023 07:06:47 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10])
	by kanga.kvack.org (Postfix) with ESMTP id B86AB8D0001
	for <linux-mm@kvack.org>; Wed, 28 Jun 2023 07:06:47 -0400 (EDT)
Received: from smtpin19.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay06.hostedemail.com (Postfix) with ESMTP id 72B15B0504
	for <linux-mm@kvack.org>; Wed, 28 Jun 2023 11:06:47 +0000 (UTC)
X-FDA: 80951878854.19.6E6E0C9
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
	by imf19.hostedemail.com (Postfix) with ESMTP id 6890A1A000B
	for <linux-mm@kvack.org>; Wed, 28 Jun 2023 11:06:45 +0000 (UTC)
Authentication-Results: imf19.hostedemail.com;
	dkim=none;
	spf=pass (imf19.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com;
	dmarc=pass (policy=none) header.from=arm.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1687950405;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=snNyCiPvK7dBA1vSaL3zUHFVj9zKEnnVVRxH/KkL504=;
	b=vNRyCMi+08TmJZOZsyMCTVWf0DIAKu6Cz7L4zz5LcoPkb663ioUb3m5Xq1eNiumM3vSxhJ
	3dR7AaKNC5w7jM2HOvhX3eJ8+X4ll7W1VlcmONjxfNtinrUbRzoV0wjxulnHN2EHhVBC5a
	K/82+cpkHtw5QJQt/U6rnIzce0Gy6Bk=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1687950405; a=rsa-sha256;
	cv=none;
	b=j6jWKsB+ZE3qUH8RGjxmHKlap+ocn9jBWgxzbgQ8Do470m4sQBwW/UY/3fVawmxFv8YUo7
	8Z6Gey6G22XoUuVKX+qOplefES8RRzUEUODKlmEIaYqNonbKqVr0VFKO0+/0QxPvgBXcZw
	jswfsTfUY3VUhZRRrkvM3SRs+7Zyeew=
ARC-Authentication-Results: i=1;
	imf19.hostedemail.com;
	dkim=none;
	spf=pass (imf19.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com;
	dmarc=pass (policy=none) header.from=arm.com
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id E52A7C14;
	Wed, 28 Jun 2023 04:07:27 -0700 (PDT)
Received: from [10.57.76.180] (unknown [10.57.76.180])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id BF06E3F663;
	Wed, 28 Jun 2023 04:06:40 -0700 (PDT)
Message-ID: <5ad4f4de-1751-0320-5b8e-52bd6bd23d95@arm.com>
Date: Wed, 28 Jun 2023 12:06:38 +0100
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0)
 Gecko/20100101 Thunderbird/102.12.0
Subject: Re: [PATCH v1 03/10] mm: Introduce try_vma_alloc_movable_folio()
To: Yin Fengwei <fengwei.yin@intel.com>, Yu Zhao <yuzhao@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
 "Matthew Wilcox (Oracle)" <willy@infradead.org>,
 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
 David Hildenbrand <david@redhat.com>,
 Catalin Marinas <catalin.marinas@arm.com>, Will Deacon <will@kernel.org>,
 Geert Uytterhoeven <geert@linux-m68k.org>,
 Christian Borntraeger <borntraeger@linux.ibm.com>,
 Sven Schnelle <svens@linux.ibm.com>, Thomas Gleixner <tglx@linutronix.de>,
 Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
 Dave Hansen <dave.hansen@linux.intel.com>, "H. Peter Anvin" <hpa@zytor.com>,
 linux-kernel@vger.kernel.org, linux-mm@kvack.org,
 linux-alpha@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
 linux-ia64@vger.kernel.org, linux-m68k@lists.linux-m68k.org,
 linux-s390@vger.kernel.org
References: <20230626171430.3167004-1-ryan.roberts@arm.com>
 <20230626171430.3167004-4-ryan.roberts@arm.com>
 <CAOUHufZKM+aS_hYQ5nDUHh74UQwWipJ27Na5Sw4n+RDqnwyWHA@mail.gmail.com>
 <CAOUHufZeFTjzO6nSFz7Y=5rBGPzY+_eeN3f8W+g0u6AqosdmuQ@mail.gmail.com>
 <ba282a84-1a0d-4ffd-0b22-ac9510a820ef@arm.com>
 <8ab18141-8091-6691-ddbd-cff834a8d4d0@intel.com>
From: Ryan Roberts <ryan.roberts@arm.com>
In-Reply-To: <8ab18141-8091-6691-ddbd-cff834a8d4d0@intel.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Rspamd-Queue-Id: 6890A1A000B
X-Rspam-User: 
X-Stat-Signature: b9p58jjga3xfsz7ufgjkrfh6e37hiepq
X-Rspamd-Server: rspam03
X-HE-Tag: 1687950405-594736
X-HE-Meta: U2FsdGVkX1/D4r31x9PLdOIal/Kn5v8YTrH69unFY9KmZUwtX4FO9tpa1XB+5kG5KwrZbe5+k4bS0Fo0qr2BC5+BQ/UVBnP3PdVmUS4YAK8AWjHJLzPecx7WhCUQ4dAiYRxVzcKN5ACOxGTd+tYEjgLdKk0/YroFaE9PPrctXjU842J6qgRC3ygTG2Z5YEJcWk3vskzgnkj+4N1O4jnPafS7PVBMF5X322QEgUWhTgTiG9k7sE+ev1M5wWyPTLc3E/OdAaJrVmwAeU6gqEZ8y2zdk2514u41TyPfh87E5naHbK/o656tBw5UuVq9xj0GUHiaUBEVZIrM5GOLa53zkD9eGBQaaYGWbvi2V1d+Le+Yat7NcT8TmGd0vHWn2Cp36CyXhaFQby6uAjjPFJwj0TlY2UWOzEwpphE48L0/3tAqwMvf7Y9mPhiTWseV8q0Ic2ZBOJ7mMvew6+uzmChe0RwK8QEKIaHfYx/1GiCBPqzeUEtV34KxDDIbveMzY1UcdnggzEgRZ8HmF3Zwq1Yuj8QNwHDp1EcG9fMKC5IsSL8n3joYj3RC/TARKJRRTOAOKQwly1dq9HPXVBD+J4VEfN8q04VBQWwzHBny9EPgk2WDYSA6jDW5GxfIkLwVJbMQvO3pyxNhPvituro7kGwkuyAo9J4SoWcUx/wTDqCbFBPbiq3jgosQU56mZ37jl4TeqLwDiYy8ls54r5MOo0OPD1YxE5qgA2S6XYGtvce7b4NsdR6/Bb0JfnwvVejbvP0fWN/GuAU/WlVCDKWFMEAZjOQijyybQaPPq66ZXxJieGHI9VF82PNMBVVtN4jVT26Ywbq5xc5YbRqfXdkkbMJ5Af9iNuBfsFZgVkLXWJUJHCBJ0+VczmalCEGVI0L5++o3Byz23+FOMiBz1PxohKx3kwXyBAQMWFousnmWIwS7K0iKIKTa5JqLbKaEsCDdjBNzKi0xATXbsXNWnGd3wK2
 H8Fs5k10
 FdP2CJmdkY43Ji8zoR98F6MMmPHNtrvo1buy/NBi2cE6gkt+dZecJQZdyS3hFZT9hYI1ZdxjbeF01gAsafiyIA1GylpEbq+WLeW48dmcFIbbleJNlu/dsWFzQNamwI362as4824jLq+r5U/IWuCHmp+n9QILapE0b5aJ3qeSARTBpKjDCjtnU2sXnLBowFvLPrX+ctCJrWi2eSmGIc4+Xs2B5s3DtyujVhLF8SotsUmPKCkk=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On 28/06/2023 03:32, Yin Fengwei wrote:
> 
> 
> On 6/27/23 15:56, Ryan Roberts wrote:
>> On 27/06/2023 06:29, Yu Zhao wrote:
>>> On Mon, Jun 26, 2023 at 8:34 PM Yu Zhao <yuzhao@google.com> wrote:
>>>>
>>>> On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>
>>>>> Opportunistically attempt to allocate high-order folios in highmem,
>>>>> optionally zeroed. Retry with lower orders all the way to order-0, until
>>>>> success. Although, of note, order-1 allocations are skipped since a
>>>>> large folio must be at least order-2 to work with the THP machinery. The
>>>>> user must check what they got with folio_order().
>>>>>
>>>>> This will be used to oportunistically allocate large folios for
>>>>> anonymous memory with a sensible fallback under memory pressure.
>>>>>
>>>>> For attempts to allocate non-0 orders, we set __GFP_NORETRY to prevent
>>>>> high latency due to reclaim, instead preferring to just try for a lower
>>>>> order. The same approach is used by the readahead code when allocating
>>>>> large folios.
>>>>>
>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>> ---
>>>>>  mm/memory.c | 33 +++++++++++++++++++++++++++++++++
>>>>>  1 file changed, 33 insertions(+)
>>>>>
>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>> index 367bbbb29d91..53896d46e686 100644
>>>>> --- a/mm/memory.c
>>>>> +++ b/mm/memory.c
>>>>> @@ -3001,6 +3001,39 @@ static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf)
>>>>>         return 0;
>>>>>  }
>>>>>
>>>>> +static inline struct folio *vma_alloc_movable_folio(struct vm_area_struct *vma,
>>>>> +                               unsigned long vaddr, int order, bool zeroed)
>>>>> +{
>>>>> +       gfp_t gfp = order > 0 ? __GFP_NORETRY | __GFP_NOWARN : 0;
>>>>> +
>>>>> +       if (zeroed)
>>>>> +               return vma_alloc_zeroed_movable_folio(vma, vaddr, gfp, order);
>>>>> +       else
>>>>> +               return vma_alloc_folio(GFP_HIGHUSER_MOVABLE | gfp, order, vma,
>>>>> +                                                               vaddr, false);
>>>>> +}
>>>>> +
>>>>> +/*
>>>>> + * Opportunistically attempt to allocate high-order folios, retrying with lower
>>>>> + * orders all the way to order-0, until success. order-1 allocations are skipped
>>>>> + * since a folio must be at least order-2 to work with the THP machinery. The
>>>>> + * user must check what they got with folio_order(). vaddr can be any virtual
>>>>> + * address that will be mapped by the allocated folio.
>>>>> + */
>>>>> +static struct folio *try_vma_alloc_movable_folio(struct vm_area_struct *vma,
>>>>> +                               unsigned long vaddr, int order, bool zeroed)
>>>>> +{
>>>>> +       struct folio *folio;
>>>>> +
>>>>> +       for (; order > 1; order--) {
>>>>> +               folio = vma_alloc_movable_folio(vma, vaddr, order, zeroed);
>>>>> +               if (folio)
>>>>> +                       return folio;
>>>>> +       }
>>>>> +
>>>>> +       return vma_alloc_movable_folio(vma, vaddr, 0, zeroed);
>>>>> +}
>>>>
>>>> I'd drop this patch. Instead, in do_anonymous_page():
>>>>
>>>>   if (IS_ENABLED(CONFIG_ARCH_WANTS_PTE_ORDER))
>>>>     folio = vma_alloc_zeroed_movable_folio(vma, addr,
>>>> CONFIG_ARCH_WANTS_PTE_ORDER))
>>>>
>>>>   if (!folio)
>>>>     folio = vma_alloc_zeroed_movable_folio(vma, addr, 0);
>>>
>>> I meant a runtime function arch_wants_pte_order() (Its default
>>> implementation would return 0.)
>>
>> There are a bunch of things which you are implying here which I'll try to make
>> explicit:
>>
>> I think you are implying that we shouldn't retry allocation with intermediate
>> orders; but only try the order requested by the arch (arch_wants_pte_order())
>> and 0. Correct? For arm64 at least, I would like the VMA's THP hint to be a
>> factor in determining the preferred order (see patches 8 and 9). So I would add
>> a vma parameter to arch_wants_pte_order() to allow for this.
>>
>> For the case where the THP hint is present, then the arch will request 2M (if
>> the page size is 16K or 64K). If that fails to allocate, there is still value in
>> allocating a 64K folio (which is order 2 in the 16K case). Without the retry
>> with intermediate orders logic, we would not get this.
>>
>> We can't just blindly allocate a folio of arch_wants_pte_order() size because it
>> might overlap with existing populated PTEs, or cross the bounds of the VMA (or a
>> number of other things - see calc_anon_folio_order_alloc() in patch 10). Are you
>> implying that if there is any kind of issue like this, then we should go
>> directly to order 0? I can kind of see the argument from a minimizing
>> fragmentation perspective, but for best possible performance I think we are
>> better off "packing the bin" with intermediate orders.
> 
> One drawback of the retry is that it could introduce large tail latency (by
> memory zeroing, memory reclaiming or existing populated PTEs). That may not
> be appreciated by some applications. Thanks.

Good point. based on all the discussion, I think the conclusion is:

 - ask the arch to for preferred folio order with runtime function
 - check the folio will fit (racy) - if does not fit fall back to order-0
 - allocate the folio
 - take the ptl
 - check the folio still fits (not racy) - if does not fit fall back to order-0

So in the worst case the latency will be allocating and zeroing a large folio,
then allocating and zeroing an order-0 folio. Which is obviously better than
iterating through every order from preferred to 0.

I'll work this flow into a v2.

> 
> 
> Regards
> Yin, Fengwei
> 
>>
>> You're also implying that a runtime arch_wants_pte_order() function is better
>> than the Kconfig stuff I did in patch 8. On reflection, I agree with you here. I
>> think you mentioned that AMD supports coalescing 8 pages on some CPUs - so you
>> would probably want runtime logic to determine if you are on an appropriate AMD
>> CPU as part of the decision in that function?
>>
>> The real reason for the existance of try_vma_alloc_movable_folio() is that I'm
>> reusing it on the other fault paths (which are no longer part of this series).
>> But I guess that's not a good reason to keep this until we get to those patches.