From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-10.1 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id CA1F0C43460 for ; Wed, 14 Apr 2021 12:26:28 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 41E2361107 for ; Wed, 14 Apr 2021 12:26:28 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 41E2361107 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id A78858D0003; Wed, 14 Apr 2021 08:26:27 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A280C8D0001; Wed, 14 Apr 2021 08:26:27 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 856768D0003; Wed, 14 Apr 2021 08:26:27 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0029.hostedemail.com [216.40.44.29]) by kanga.kvack.org (Postfix) with ESMTP id 663018D0001 for ; Wed, 14 Apr 2021 08:26:27 -0400 (EDT) Received: from smtpin20.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 23F01181F24B3 for ; Wed, 14 Apr 2021 12:26:27 +0000 (UTC) X-FDA: 78030895614.20.A12534C Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf24.hostedemail.com (Postfix) with ESMTP id 8C549A000398 for ; Wed, 14 Apr 2021 12:26:20 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1618403186; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=UjvQ4PU7h8Ys45H28LzMhxk+VXcHhNlEEIqxegZmOG0=; b=WRHcbk+eWvvFljfajpAplvDB6vmjbWrVcO0sDt5SWrkQuCifeI/XgKctM5AGS1psXhBweL nqU1NhYlQZ2VaVDcdhJVXcZx/Q0rkK7OEz8g+EmEzk7d8TAXJSKyOLm82XkWUdng552fJC P5Dt3n2GTfGfEUqtWWR4xI2EGe/hPkg= Received: from mail-wm1-f71.google.com (mail-wm1-f71.google.com [209.85.128.71]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-344-2B0zwIV8OkeXEuptjkFFmQ-1; Wed, 14 Apr 2021 08:26:24 -0400 X-MC-Unique: 2B0zwIV8OkeXEuptjkFFmQ-1 Received: by mail-wm1-f71.google.com with SMTP id l6-20020a1c25060000b029010ee60ad0fcso2022512wml.9 for ; Wed, 14 Apr 2021 05:26:24 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:to:cc:references:from:organization:subject :message-id:date:user-agent:mime-version:in-reply-to :content-language:content-transfer-encoding; bh=UjvQ4PU7h8Ys45H28LzMhxk+VXcHhNlEEIqxegZmOG0=; b=SamHnmOLRX4gFF8b0uclR6cO4v4x8eK3ui78vQXAB+F2gasFZpwAX2skYlNyVKsU6G Vgo+JdtMl8EZLrAjuvvMAd7idjjZ2IfkbLmZZ33jk+otKmKQumxhwpyURHmsSZb6UBKx w/wtJKbPPhZS0I+FxtDS2Nz4IGtvKoh/aX/wHWXi9X5sn9h8U2Q/KcfdMxfZP9sg5e1M 4U3zrJoOsZ/XEeCDOWILvaBnz45jK5S19SEI4+2v7vDr7QRHTiUPF1ncQLHvST7Sxx38 91d5Ry1UAZ0uWqpGKxFeCPYMIYyklJxv9zIDjYdLAWMK/5tM25HS83hSALIRXD5YkdFg 4B8A== X-Gm-Message-State: AOAM533/Z3QlD+tGEBZonI4O689xmyXAM0H6BXDKDSokS88DsDJHvOTe 8A4wQJu/cCNsFyq1nL80JvgXFoILuU/XfwSOaNHQbo5Xyr1nVW8Xin3GvlDsv7RwWfPpFqhxIIE Vvl/9lp17E44= X-Received: by 2002:a05:6000:118c:: with SMTP id g12mr21667878wrx.241.1618403183274; Wed, 14 Apr 2021 05:26:23 -0700 (PDT) X-Google-Smtp-Source: ABdhPJykpy9sJEW82HUI1sEt030ZbzQcqkh5sA8XfLL7bnMgORWkn0Fgx/Msi8ndSe/1aeQGBcuFAA== X-Received: by 2002:a05:6000:118c:: with SMTP id g12mr21667817wrx.241.1618403182646; Wed, 14 Apr 2021 05:26:22 -0700 (PDT) Received: from [192.168.3.132] (p5b0c6470.dip0.t-ipconnect.de. [91.12.100.112]) by smtp.gmail.com with ESMTPSA id r5sm5270923wmr.15.2021.04.14.05.26.22 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 14 Apr 2021 05:26:22 -0700 (PDT) To: Oscar Salvador , Andrew Morton Cc: Mike Kravetz , Vlastimil Babka , Michal Hocko , Muchun Song , linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <20210413104747.12177-1-osalvador@suse.de> <20210413104747.12177-6-osalvador@suse.de> From: David Hildenbrand Organization: Red Hat Subject: Re: [PATCH v7 5/7] mm: Make alloc_contig_range handle free hugetlb pages Message-ID: <71cce295-91d3-21d8-5075-04a2e828d0f2@redhat.com> Date: Wed, 14 Apr 2021 14:26:21 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.8.1 MIME-Version: 1.0 In-Reply-To: <20210413104747.12177-6-osalvador@suse.de> Authentication-Results: relay.mimecast.com; auth=pass smtp.auth=CUSA124A263 smtp.mailfrom=david@redhat.com X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 8C549A000398 X-Stat-Signature: qyj6xw45d1m9fenx3ysjn9cd3qnfj3ue Received-SPF: none (redhat.com>: No applicable sender policy available) receiver=imf24; identity=mailfrom; envelope-from=""; helo=us-smtp-delivery-124.mimecast.com; client-ip=170.10.133.124 X-HE-DKIM-Result: pass/pass X-HE-Tag: 1618403180-607410 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: > +static inline int isolate_or_dissolve_huge_page(struct page *page) > +{ > + return -ENOMEM; Without CONFIG_HUGETLB_PAGE, there is no way someone could possible pass=20 in something valid. Although it doesn't matter too much, -EINVAL or=20 similar sounds a bit better. > +} > + > static inline struct page *alloc_huge_page(struct vm_area_struct *vma= , > unsigned long addr, > int avoid_reserve) > diff --git a/mm/compaction.c b/mm/compaction.c > index eeba4668c22c..89426b6d1ea3 100644 > --- a/mm/compaction.c > +++ b/mm/compaction.c > @@ -788,7 +788,7 @@ static bool too_many_isolated(pg_data_t *pgdat) > * Isolate all pages that can be migrated from the range specified by > * [low_pfn, end_pfn). The range is expected to be within same pagebl= ock. > * Returns errno, like -EAGAIN or -EINTR in case e.g signal pending o= r congestion, > - * or 0. > + * -ENOMEM in case we could not allocate a page, or 0. > * cc->migrate_pfn will contain the next pfn to scan (which may be bo= th less, > * equal to or more that end_pfn). > * > @@ -809,6 +809,7 @@ isolate_migratepages_block(struct compact_control *= cc, unsigned long low_pfn, > bool skip_on_failure =3D false; > unsigned long next_skip_pfn =3D 0; > bool skip_updated =3D false; > + bool fatal_error =3D false; Can't we use "ret =3D=3D -ENOMEM" instead of fatal_error? > int ret =3D 0; > =20 > cc->migrate_pfn =3D low_pfn; > @@ -907,6 +908,33 @@ isolate_migratepages_block(struct compact_control = *cc, unsigned long low_pfn, > valid_page =3D page; > } > =20 > + if (PageHuge(page) && cc->alloc_contig) { > + ret =3D isolate_or_dissolve_huge_page(page); > + > + /* > + * Fail isolation in case isolate_or_dissolve_huge_page > + * reports an error. In case of -ENOMEM, abort right away. > + */ > + if (ret < 0) { > + /* > + * Do not report -EBUSY down the chain. > + */ > + if (ret =3D=3D -ENOMEM) > + fatal_error =3D true; > + else > + ret =3D 0; > + low_pfn +=3D (1UL << compound_order(page)) - 1; > + goto isolate_fail; > + } > + > + /* > + * Ok, the hugepage was dissolved. Now these pages are > + * Buddy and cannot be re-allocated because they are > + * isolated. Fall-through as the check below handles > + * Buddy pages. > + */ > + } > + > /* > * Skip if free. We read page order here without zone lock > * which is generally unsafe, but the race window is small and > @@ -1066,7 +1094,7 @@ isolate_migratepages_block(struct compact_control= *cc, unsigned long low_pfn, > put_page(page); > =20 > isolate_fail: > - if (!skip_on_failure) > + if (!skip_on_failure && !fatal_error) > continue; > =20 > /* > @@ -1092,6 +1120,9 @@ isolate_migratepages_block(struct compact_control= *cc, unsigned long low_pfn, > */ > next_skip_pfn +=3D 1UL << cc->order; > } > + > + if (fatal_error) > + break; > } > =20 > /* > @@ -1145,7 +1176,7 @@ isolate_migratepages_block(struct compact_control= *cc, unsigned long low_pfn, > * @end_pfn: The one-past-last PFN. > * > * Returns errno, like -EAGAIN or -EINTR in case e.g signal pending o= r congestion, > - * or 0. > + * -ENOMEM in case we could not allocate a page, or 0. > */ > int > isolate_migratepages_range(struct compact_control *cc, unsigned long = start_pfn, > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 0607b2b71ac6..4a664d6e82c1 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -2266,6 +2266,121 @@ static void restore_reserve_on_error(struct hst= ate *h, > } > } > =20 > +/* > + * alloc_and_dissolve_huge_page - Allocate a new page and dissolve the= old one > + * @h: struct hstate old page belongs to > + * @old_page: Old page to dissolve > + * Returns 0 on success, otherwise negated error. > + */ > + > +static int alloc_and_dissolve_huge_page(struct hstate *h, struct page = *old_page) > +{ > + gfp_t gfp_mask =3D htlb_alloc_mask(h) | __GFP_THISNODE; > + int nid =3D page_to_nid(old_page); > + struct page *new_page; > + int ret =3D 0; > + > + /* > + * Before dissolving the page, we need to allocate a new one for the > + * pool to remain stable. Using alloc_buddy_huge_page() allows us to > + * not having to deal with prep_new_page() and avoids dealing of any > + * counters. This simplifies and let us do the whole thing under the > + * lock. > + */ > + new_page =3D alloc_buddy_huge_page(h, gfp_mask, nid, NULL, NULL); > + if (!new_page) > + return -ENOMEM; > + > +retry: > + spin_lock_irq(&hugetlb_lock); > + if (!PageHuge(old_page)) { > + /* > + * Freed from under us. Drop new_page too. > + */ > + goto free_new; > + } else if (page_count(old_page)) { > + /* > + * Someone has grabbed the page, fail for now. > + */ > + ret =3D -EBUSY; > + goto free_new; > + } else if (!HPageFreed(old_page)) { > + /* > + * Page's refcount is 0 but it has not been enqueued in the > + * freelist yet. Race window is small, so we can succeed here if > + * we retry. > + */ > + spin_unlock_irq(&hugetlb_lock); > + cond_resched(); > + goto retry; > + } else { > + /* > + * Ok, old_page is still a genuine free hugepage. Remove it from > + * the freelist and decrease the counters. These will be > + * incremented again when calling __prep_account_new_huge_page() > + * and enqueue_huge_page() for new_page. The counters will remain > + * stable since this happens under the lock. > + */ > + remove_hugetlb_page(h, old_page, false); > + > + /* > + * Call __prep_new_huge_page() to construct the hugetlb page, and > + * enqueue it then to place it in the freelists. After this, > + * counters are back on track. Free hugepages have a refcount of 0, > + * so we need to decrease new_page's count as well. > + */ > + __prep_new_huge_page(new_page); > + __prep_account_new_huge_page(h, nid); > + page_ref_dec(new_page); > + enqueue_huge_page(h, new_page); > + > + /* > + * Pages have been replaced, we can safely free the old one. > + */ > + spin_unlock_irq(&hugetlb_lock); > + update_and_free_page(h, old_page); > + } > + > + return ret; > + > +free_new: > + spin_unlock_irq(&hugetlb_lock); > + __free_pages(new_page, huge_page_order(h)); > + > + return ret; > +} > + > +int isolate_or_dissolve_huge_page(struct page *page) > +{ > + struct hstate *h; > + struct page *head; > + > + /* > + * The page might have been dissolved from under our feet, so make su= re > + * to carefully check the state under the lock. > + * Return success when racing as if we dissolved the page ourselves. > + */ > + spin_lock_irq(&hugetlb_lock); > + if (PageHuge(page)) { > + head =3D compound_head(page); > + h =3D page_hstate(head); > + } else { > + spin_unlock(&hugetlb_lock); > + return 0; > + } > + spin_unlock_irq(&hugetlb_lock); > + > + /* > + * Fence off gigantic pages as there is a cyclic dependency between > + * alloc_contig_range and them. Return -ENOME as this has the effect s/-ENOME/-ENOMEM/ > + * of bailing out right away without further retrying. > + */ > + if (hstate_is_gigantic(h)) > + return -ENOMEM; > + > + return alloc_and_dissolve_huge_page(h, head); > +} > + > struct page *alloc_huge_page(struct vm_area_struct *vma, > unsigned long addr, int avoid_reserve) > { >=20 Complicated stuff, but looks good to me. --=20 Thanks, David / dhildenb