From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=HqIU=JL=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-10.1 required=3.0 tests=BAYES_00,DKIM_INVALID,
	DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,
	NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=ham
	autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id CA1F0C43460
	for <linux-mm@archiver.kernel.org>; Wed, 14 Apr 2021 12:26:28 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 41E2361107
	for <linux-mm@archiver.kernel.org>; Wed, 14 Apr 2021 12:26:28 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 41E2361107
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id A78858D0003; Wed, 14 Apr 2021 08:26:27 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id A280C8D0001; Wed, 14 Apr 2021 08:26:27 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 856768D0003; Wed, 14 Apr 2021 08:26:27 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0029.hostedemail.com [216.40.44.29])
	by kanga.kvack.org (Postfix) with ESMTP id 663018D0001
	for <linux-mm@kvack.org>; Wed, 14 Apr 2021 08:26:27 -0400 (EDT)
Received: from smtpin20.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay01.hostedemail.com (Postfix) with ESMTP id 23F01181F24B3
	for <linux-mm@kvack.org>; Wed, 14 Apr 2021 12:26:27 +0000 (UTC)
X-FDA: 78030895614.20.A12534C
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	by imf24.hostedemail.com (Postfix) with ESMTP id 8C549A000398
	for <linux-mm@kvack.org>; Wed, 14 Apr 2021 12:26:20 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1618403186;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=UjvQ4PU7h8Ys45H28LzMhxk+VXcHhNlEEIqxegZmOG0=;
	b=WRHcbk+eWvvFljfajpAplvDB6vmjbWrVcO0sDt5SWrkQuCifeI/XgKctM5AGS1psXhBweL
	nqU1NhYlQZ2VaVDcdhJVXcZx/Q0rkK7OEz8g+EmEzk7d8TAXJSKyOLm82XkWUdng552fJC
	P5Dt3n2GTfGfEUqtWWR4xI2EGe/hPkg=
Received: from mail-wm1-f71.google.com (mail-wm1-f71.google.com
 [209.85.128.71]) (Using TLS) by relay.mimecast.com with ESMTP id
 us-mta-344-2B0zwIV8OkeXEuptjkFFmQ-1; Wed, 14 Apr 2021 08:26:24 -0400
X-MC-Unique: 2B0zwIV8OkeXEuptjkFFmQ-1
Received: by mail-wm1-f71.google.com with SMTP id l6-20020a1c25060000b029010ee60ad0fcso2022512wml.9
        for <linux-mm@kvack.org>; Wed, 14 Apr 2021 05:26:24 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:to:cc:references:from:organization:subject
         :message-id:date:user-agent:mime-version:in-reply-to
         :content-language:content-transfer-encoding;
        bh=UjvQ4PU7h8Ys45H28LzMhxk+VXcHhNlEEIqxegZmOG0=;
        b=SamHnmOLRX4gFF8b0uclR6cO4v4x8eK3ui78vQXAB+F2gasFZpwAX2skYlNyVKsU6G
         Vgo+JdtMl8EZLrAjuvvMAd7idjjZ2IfkbLmZZ33jk+otKmKQumxhwpyURHmsSZb6UBKx
         w/wtJKbPPhZS0I+FxtDS2Nz4IGtvKoh/aX/wHWXi9X5sn9h8U2Q/KcfdMxfZP9sg5e1M
         4U3zrJoOsZ/XEeCDOWILvaBnz45jK5S19SEI4+2v7vDr7QRHTiUPF1ncQLHvST7Sxx38
         91d5Ry1UAZ0uWqpGKxFeCPYMIYyklJxv9zIDjYdLAWMK/5tM25HS83hSALIRXD5YkdFg
         4B8A==
X-Gm-Message-State: AOAM533/Z3QlD+tGEBZonI4O689xmyXAM0H6BXDKDSokS88DsDJHvOTe
	8A4wQJu/cCNsFyq1nL80JvgXFoILuU/XfwSOaNHQbo5Xyr1nVW8Xin3GvlDsv7RwWfPpFqhxIIE
	Vvl/9lp17E44=
X-Received: by 2002:a05:6000:118c:: with SMTP id g12mr21667878wrx.241.1618403183274;
        Wed, 14 Apr 2021 05:26:23 -0700 (PDT)
X-Google-Smtp-Source: ABdhPJykpy9sJEW82HUI1sEt030ZbzQcqkh5sA8XfLL7bnMgORWkn0Fgx/Msi8ndSe/1aeQGBcuFAA==
X-Received: by 2002:a05:6000:118c:: with SMTP id g12mr21667817wrx.241.1618403182646;
        Wed, 14 Apr 2021 05:26:22 -0700 (PDT)
Received: from [192.168.3.132] (p5b0c6470.dip0.t-ipconnect.de. [91.12.100.112])
        by smtp.gmail.com with ESMTPSA id r5sm5270923wmr.15.2021.04.14.05.26.22
        (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
        Wed, 14 Apr 2021 05:26:22 -0700 (PDT)
To: Oscar Salvador <osalvador@suse.de>,
 Andrew Morton <akpm@linux-foundation.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>, Vlastimil Babka <vbabka@suse.cz>,
 Michal Hocko <mhocko@kernel.org>, Muchun Song <songmuchun@bytedance.com>,
 linux-mm@kvack.org, linux-kernel@vger.kernel.org
References: <20210413104747.12177-1-osalvador@suse.de>
 <20210413104747.12177-6-osalvador@suse.de>
From: David Hildenbrand <david@redhat.com>
Organization: Red Hat
Subject: Re: [PATCH v7 5/7] mm: Make alloc_contig_range handle free hugetlb
 pages
Message-ID: <71cce295-91d3-21d8-5075-04a2e828d0f2@redhat.com>
Date: Wed, 14 Apr 2021 14:26:21 +0200
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
 Thunderbird/78.8.1
MIME-Version: 1.0
In-Reply-To: <20210413104747.12177-6-osalvador@suse.de>
Authentication-Results: relay.mimecast.com;
	auth=pass smtp.auth=CUSA124A263 smtp.mailfrom=david@redhat.com
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
X-Rspamd-Server: rspam03
X-Rspamd-Queue-Id: 8C549A000398
X-Stat-Signature: qyj6xw45d1m9fenx3ysjn9cd3qnfj3ue
Received-SPF: none (redhat.com>: No applicable sender policy available) receiver=imf24; identity=mailfrom; envelope-from="<david@redhat.com>"; helo=us-smtp-delivery-124.mimecast.com; client-ip=170.10.133.124
X-HE-DKIM-Result: pass/pass
X-HE-Tag: 1618403180-607410
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

> +static inline int isolate_or_dissolve_huge_page(struct page *page)
> +{
> +	return -ENOMEM;

Without CONFIG_HUGETLB_PAGE, there is no way someone could possible pass=20
in something valid. Although it doesn't matter too much, -EINVAL or=20
similar sounds a bit better.

> +}
> +
>   static inline struct page *alloc_huge_page(struct vm_area_struct *vma=
,
>   					   unsigned long addr,
>   					   int avoid_reserve)
> diff --git a/mm/compaction.c b/mm/compaction.c
> index eeba4668c22c..89426b6d1ea3 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -788,7 +788,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
>    * Isolate all pages that can be migrated from the range specified by
>    * [low_pfn, end_pfn). The range is expected to be within same pagebl=
ock.
>    * Returns errno, like -EAGAIN or -EINTR in case e.g signal pending o=
r congestion,
> - * or 0.
> + * -ENOMEM in case we could not allocate a page, or 0.
>    * cc->migrate_pfn will contain the next pfn to scan (which may be bo=
th less,
>    * equal to or more that end_pfn).
>    *
> @@ -809,6 +809,7 @@ isolate_migratepages_block(struct compact_control *=
cc, unsigned long low_pfn,
>   	bool skip_on_failure =3D false;
>   	unsigned long next_skip_pfn =3D 0;
>   	bool skip_updated =3D false;
> +	bool fatal_error =3D false;

Can't we use "ret =3D=3D -ENOMEM" instead of fatal_error?

>   	int ret =3D 0;
>  =20
>   	cc->migrate_pfn =3D low_pfn;
> @@ -907,6 +908,33 @@ isolate_migratepages_block(struct compact_control =
*cc, unsigned long low_pfn,
>   			valid_page =3D page;
>   		}
>  =20
> +		if (PageHuge(page) && cc->alloc_contig) {
> +			ret =3D isolate_or_dissolve_huge_page(page);
> +
> +			/*
> +			 * Fail isolation in case isolate_or_dissolve_huge_page
> +			 * reports an error. In case of -ENOMEM, abort right away.
> +			 */
> +			if (ret < 0) {
> +				/*
> +				 * Do not report -EBUSY down the chain.
> +				 */
> +				if (ret =3D=3D -ENOMEM)
> +					fatal_error =3D true;
> +				else
> +					ret =3D 0;
> +				low_pfn +=3D (1UL << compound_order(page)) - 1;
> +				goto isolate_fail;
> +			}
> +
> +			/*
> +			 * Ok, the hugepage was dissolved. Now these pages are
> +			 * Buddy and cannot be re-allocated because they are
> +			 * isolated. Fall-through as the check below handles
> +			 * Buddy pages.
> +			 */
> +		}
> +
>   		/*
>   		 * Skip if free. We read page order here without zone lock
>   		 * which is generally unsafe, but the race window is small and
> @@ -1066,7 +1094,7 @@ isolate_migratepages_block(struct compact_control=
 *cc, unsigned long low_pfn,
>   		put_page(page);
>  =20
>   isolate_fail:
> -		if (!skip_on_failure)
> +		if (!skip_on_failure && !fatal_error)
>   			continue;
>  =20
>   		/*
> @@ -1092,6 +1120,9 @@ isolate_migratepages_block(struct compact_control=
 *cc, unsigned long low_pfn,
>   			 */
>   			next_skip_pfn +=3D 1UL << cc->order;
>   		}
> +
> +		if (fatal_error)
> +			break;
>   	}
>  =20
>   	/*
> @@ -1145,7 +1176,7 @@ isolate_migratepages_block(struct compact_control=
 *cc, unsigned long low_pfn,
>    * @end_pfn:   The one-past-last PFN.
>    *
>    * Returns errno, like -EAGAIN or -EINTR in case e.g signal pending o=
r congestion,
> - * or 0.
> + * -ENOMEM in case we could not allocate a page, or 0.
>    */
>   int
>   isolate_migratepages_range(struct compact_control *cc, unsigned long =
start_pfn,
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 0607b2b71ac6..4a664d6e82c1 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2266,6 +2266,121 @@ static void restore_reserve_on_error(struct hst=
ate *h,
>   	}
>   }
>  =20
> +/*
> + * alloc_and_dissolve_huge_page - Allocate a new page and dissolve the=
 old one
> + * @h: struct hstate old page belongs to
> + * @old_page: Old page to dissolve
> + * Returns 0 on success, otherwise negated error.
> + */
> +
> +static int alloc_and_dissolve_huge_page(struct hstate *h, struct page =
*old_page)
> +{
> +	gfp_t gfp_mask =3D htlb_alloc_mask(h) | __GFP_THISNODE;
> +	int nid =3D page_to_nid(old_page);
> +	struct page *new_page;
> +	int ret =3D 0;
> +
> +	/*
> +	 * Before dissolving the page, we need to allocate a new one for the
> +	 * pool to remain stable. Using alloc_buddy_huge_page() allows us to
> +	 * not having to deal with prep_new_page() and avoids dealing of any
> +	 * counters. This simplifies and let us do the whole thing under the
> +	 * lock.
> +	 */
> +	new_page =3D alloc_buddy_huge_page(h, gfp_mask, nid, NULL, NULL);
> +	if (!new_page)
> +		return -ENOMEM;
> +
> +retry:
> +	spin_lock_irq(&hugetlb_lock);
> +	if (!PageHuge(old_page)) {
> +		/*
> +		 * Freed from under us. Drop new_page too.
> +		 */
> +		goto free_new;
> +	} else if (page_count(old_page)) {
> +		/*
> +		 * Someone has grabbed the page, fail for now.
> +		 */
> +		ret =3D -EBUSY;
> +		goto free_new;
> +	} else if (!HPageFreed(old_page)) {
> +		/*
> +		 * Page's refcount is 0 but it has not been enqueued in the
> +		 * freelist yet. Race window is small, so we can succeed here if
> +		 * we retry.
> +		 */
> +		spin_unlock_irq(&hugetlb_lock);
> +		cond_resched();
> +		goto retry;
> +	} else {
> +		/*
> +		 * Ok, old_page is still a genuine free hugepage. Remove it from
> +		 * the freelist and decrease the counters. These will be
> +		 * incremented again when calling __prep_account_new_huge_page()
> +		 * and enqueue_huge_page() for new_page. The counters will remain
> +		 * stable since this happens under the lock.
> +		 */
> +		remove_hugetlb_page(h, old_page, false);
> +
> +		/*
> +		 * Call __prep_new_huge_page() to construct the hugetlb page, and
> +		 * enqueue it then to place it in the freelists. After this,
> +		 * counters are back on track. Free hugepages have a refcount of 0,
> +		 * so we need to decrease new_page's count as well.
> +		 */
> +		__prep_new_huge_page(new_page);
> +		__prep_account_new_huge_page(h, nid);
> +		page_ref_dec(new_page);
> +		enqueue_huge_page(h, new_page);
> +
> +		/*
> +		 * Pages have been replaced, we can safely free the old one.
> +		 */
> +		spin_unlock_irq(&hugetlb_lock);
> +		update_and_free_page(h, old_page);
> +	}
> +
> +	return ret;
> +
> +free_new:
> +	spin_unlock_irq(&hugetlb_lock);
> +	__free_pages(new_page, huge_page_order(h));
> +
> +	return ret;
> +}
> +
> +int isolate_or_dissolve_huge_page(struct page *page)
> +{
> +	struct hstate *h;
> +	struct page *head;
> +
> +	/*
> +	 * The page might have been dissolved from under our feet, so make su=
re
> +	 * to carefully check the state under the lock.
> +	 * Return success when racing as if we dissolved the page ourselves.
> +	 */
> +	spin_lock_irq(&hugetlb_lock);
> +	if (PageHuge(page)) {
> +		head =3D compound_head(page);
> +		h =3D page_hstate(head);
> +	} else {
> +		spin_unlock(&hugetlb_lock);
> +		return 0;
> +	}
> +	spin_unlock_irq(&hugetlb_lock);
> +
> +	/*
> +	 * Fence off gigantic pages as there is a cyclic dependency between
> +	 * alloc_contig_range and them. Return -ENOME as this has the effect

s/-ENOME/-ENOMEM/

> +	 * of bailing out right away without further retrying.
> +	 */
> +	if (hstate_is_gigantic(h))
> +		return -ENOMEM;
> +
> +	return alloc_and_dissolve_huge_page(h, head);
> +}
> +
>   struct page *alloc_huge_page(struct vm_area_struct *vma,
>   				    unsigned long addr, int avoid_reserve)
>   {
>=20

Complicated stuff, but looks good to me.

--=20
Thanks,

David / dhildenb