From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.6 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH, MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0912BC5B578 for ; Wed, 3 Jul 2019 23:54:52 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 9B65121881 for ; Wed, 3 Jul 2019 23:54:51 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=oracle.com header.i=@oracle.com header.b="eWNkXNsh" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 9B65121881 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=oracle.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 2FF556B0006; Wed, 3 Jul 2019 19:54:51 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 2AE168E0003; Wed, 3 Jul 2019 19:54:51 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 176F78E0001; Wed, 3 Jul 2019 19:54:51 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from mail-pl1-f198.google.com (mail-pl1-f198.google.com [209.85.214.198]) by kanga.kvack.org (Postfix) with ESMTP id D3B466B0006 for ; Wed, 3 Jul 2019 19:54:50 -0400 (EDT) Received: by mail-pl1-f198.google.com with SMTP id i33so2196761pld.15 for ; Wed, 03 Jul 2019 16:54:50 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:dkim-signature:subject:to:cc:references:from :message-id:date:user-agent:mime-version:in-reply-to :content-language:content-transfer-encoding; bh=lRFxzELYfMVCyxwAkaClIeFK5uwSzrQldTtm4JqVY7g=; b=YAkTdCnrC9VZZjQFg1WFUYUmS4U3PPiBgOkLHMDZNzmLyhqTEMChKTZqebYbJAAvKy 4HWxTYky6u88GZysOaJm/iFiKzW6APhQo5Drq6JaoLCs1Z2Vur4iBRFZlOqZwZ8qF0W+ PuK4e05AK5q4dDYh8UyzAG8hoJVyJniLI/lQrJ4vuqHwAz2nZhIlbd3XGCavXB8Ha/gA zJ6G8zuswn1bcSPa6doYNF2+x3ZrGw+tbKd5VyQda5BFYZcdbeqQRTVRY7xuwL9JDojW Z8yiHYV04iZAO+LN7nE51QXLm1iHajixOPr8ygAbfpiddyfvpWtm+JS671RsLztzgP/l Arhw== X-Gm-Message-State: APjAAAVFLI9l9uV3lPi6aPY55qQWOPCGA1OCzY7WhNYzqEsXFP0pOFF0 ldgaO8xiBnJMcBNI0rNpu2ophYJtzSpDS+xeiVtEGfv2iUa+8f3ra5UYpSEYO6ISb6jXKUOAuLV haV8Gb1UGk7hJQpJlrhs4o3xBKm+8OHRjKJOMXPVO3LIZgIpZCuO5pGssaVhpZoPjIA== X-Received: by 2002:a63:c20e:: with SMTP id b14mr15555960pgd.96.1562198090187; Wed, 03 Jul 2019 16:54:50 -0700 (PDT) X-Google-Smtp-Source: APXvYqwu3ooM9io6jUqmAM5LbWqEH99WFpCX7XVBguewm894rK3iVa7ctLUqhiCIDMCY8BZderQ0 X-Received: by 2002:a63:c20e:: with SMTP id b14mr15555906pgd.96.1562198088996; Wed, 03 Jul 2019 16:54:48 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1562198088; cv=none; d=google.com; s=arc-20160816; b=ZQlhKvAN6CrCgWLNxnf0ntUuPw5ORLMmdWA04tqMkUbM89VIqnTJWl2089DvE3aTKt J2A0+CASqUlTa8ySSpdOLZfSUuvkBMHe8q/Lk0/sobt8cxsFJogiLf6xGyGAHL+5aa4H OnYvIpvdvVXaZO50ky3JmA9wbgq9iQOyiMcSPogDnlFYUTOD3exOORWuy4bggbbRzGGf fZUAuXlCHiALHO2pynfQ/xD097G2ikmVe3x25cGGCcipoJoOIVlxJ78PZ6mw6Iq/UpIp 6sH97Zw8YzAInr+MAozvtab5PZSz38WODNjHWrFJtFvI+FIOu4kGl63wKByxubMXh6hY 7u7g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:content-language:in-reply-to:mime-version :user-agent:date:message-id:from:references:cc:to:subject :dkim-signature; bh=lRFxzELYfMVCyxwAkaClIeFK5uwSzrQldTtm4JqVY7g=; b=s18mMhr0NNhsl4gnfg/Jx/WsBCxnJYQ7xFplbHwK7oW1m6C2BNa2t8X9EZcn7fjGzD 6Vn7M9EiV0BIXH+vqkaKjPxsKPndev5chBSh5wLgB2fBrqMbCGUkDU1tJrvHWIiv4+cU Aw4zd87+vKKaDM7IDorhRB5N5bjOxUUJR99wqCAzJfKV2RNeyWXiu1zCbV1cn4PEWOQJ qfSa1vjh1NF9BrDctwRS5YveHLVGosJUm9NAzejX25YL/kNWe08WlPzg5nLG8Bq8L1bC 81QXv0xSf4UPSyAOSJo/GVq+tErk4/bfU8+b99ot9JtYO12NVAvshxHoBopGm9wo20aj rr3g== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2018-07-02 header.b=eWNkXNsh; spf=pass (google.com: domain of mike.kravetz@oracle.com designates 156.151.31.86 as permitted sender) smtp.mailfrom=mike.kravetz@oracle.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: from userp2130.oracle.com (userp2130.oracle.com. [156.151.31.86]) by mx.google.com with ESMTPS id v189si3507708pgd.289.2019.07.03.16.54.48 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 03 Jul 2019 16:54:48 -0700 (PDT) Received-SPF: pass (google.com: domain of mike.kravetz@oracle.com designates 156.151.31.86 as permitted sender) client-ip=156.151.31.86; Authentication-Results: mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2018-07-02 header.b=eWNkXNsh; spf=pass (google.com: domain of mike.kravetz@oracle.com designates 156.151.31.86 as permitted sender) smtp.mailfrom=mike.kravetz@oracle.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: from pps.filterd (userp2130.oracle.com [127.0.0.1]) by userp2130.oracle.com (8.16.0.27/8.16.0.27) with SMTP id x63NsRwx168343; Wed, 3 Jul 2019 23:54:46 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=subject : to : cc : references : from : message-id : date : mime-version : in-reply-to : content-type : content-transfer-encoding; s=corp-2018-07-02; bh=lRFxzELYfMVCyxwAkaClIeFK5uwSzrQldTtm4JqVY7g=; b=eWNkXNshfk6swcQNcJuHS2WHidHmc5993aQlROjTbPVUzXt2+UIXobRAA4njbi1/DsoR HFdWgylr6tqR6z1SHFtyuf6Ht2nfVhdBeUWS99P722NnudoxGzXsB10okYW2V50Ivo8M lL6x3Eln9hPkXVUcILhQnFjImK1njhHuzdOQEieJvr4nQXAXeJtc+xFW89pE/ozErLtp B1MWJm0KszzrLsjwkwrtaqKjyE8ZakG0Eehde3foKrzt0Ng85dZKbaH8CHqbai1Pjwr/ F9Xby8kwLr6eEOUKYvQ4MMpgf4QcB+PVnjqXpoZCC9iYYnHGRpy7KuacZhXx0/iHhWvU aw== Received: from aserp3020.oracle.com (aserp3020.oracle.com [141.146.126.70]) by userp2130.oracle.com with ESMTP id 2te61ec089-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 03 Jul 2019 23:54:45 +0000 Received: from pps.filterd (aserp3020.oracle.com [127.0.0.1]) by aserp3020.oracle.com (8.16.0.27/8.16.0.27) with SMTP id x63NqQNX147728; Wed, 3 Jul 2019 23:54:45 GMT Received: from aserv0122.oracle.com (aserv0122.oracle.com [141.146.126.236]) by aserp3020.oracle.com with ESMTP id 2tebkv518u-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 03 Jul 2019 23:54:45 +0000 Received: from abhmp0008.oracle.com (abhmp0008.oracle.com [141.146.116.14]) by aserv0122.oracle.com (8.14.4/8.14.4) with ESMTP id x63NsbPA005307; Wed, 3 Jul 2019 23:54:37 GMT Received: from [192.168.1.222] (/71.63.128.209) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Wed, 03 Jul 2019 16:54:36 -0700 Subject: Re: [Question] Should direct reclaim time be bounded? To: Mel Gorman Cc: Mel Gorman , Vlastimil Babka , Michal Hocko , "linux-mm@kvack.org" , linux-kernel , Andrea Arcangeli , Johannes Weiner References: <20190423071953.GC25106@dhcp22.suse.cz> <04329fea-cd34-4107-d1d4-b2098ebab0ec@suse.cz> <20190701085920.GB2812@suse.de> <80036eed-993d-1d24-7ab6-e495f01b1caa@oracle.com> <20190703094325.GB2737@techsingularity.net> From: Mike Kravetz Message-ID: <571d5557-2153-59ea-334b-8636cc1a49c9@oracle.com> Date: Wed, 3 Jul 2019 16:54:35 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.7.0 MIME-Version: 1.0 In-Reply-To: <20190703094325.GB2737@techsingularity.net> Content-Type: text/plain; charset=iso-8859-15 Content-Language: en-US Content-Transfer-Encoding: 7bit X-Proofpoint-Virus-Version: vendor=nai engine=6000 definitions=9307 signatures=668688 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=2 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1810050000 definitions=main-1907030294 X-Proofpoint-Virus-Version: vendor=nai engine=6000 definitions=9307 signatures=668688 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 priorityscore=1501 malwarescore=0 suspectscore=2 phishscore=0 bulkscore=0 spamscore=0 clxscore=1011 lowpriorityscore=0 mlxscore=0 impostorscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1810050000 definitions=main-1907030294 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 7/3/19 2:43 AM, Mel Gorman wrote: > Indeed. I'm getting knocked offline shortly so I didn't give this the > time it deserves but it appears that part of this problem is > hugetlb-specific when one node is full and can enter into this continual > loop due to __GFP_RETRY_MAYFAIL requiring both nr_reclaimed and > nr_scanned to be zero. Yes, I am not aware of any other large order allocations consistently made with __GFP_RETRY_MAYFAIL. But, I did not look too closely. Michal believes that hugetlb pages allocations should use __GFP_RETRY_MAYFAIL. > Have you considered one of the following as an option? > > 1. Always use the on-stack nodes_allowed in __nr_hugepages_store_common > and copy nodes_states if necessary. Add a bool parameter to > alloc_pool_huge_page that is true when called from set_max_huge_pages. > If an allocation from alloc_fresh_huge_page, clear the failing node > from the mask so it's not retried, bail if the mask is empty. The > consequences are that round-robin allocation of huge pages will be > different if a node failed to allocate for transient reasons. That seems to be a more aggressive form of 3 below. > 2. Alter the condition in should_continue_reclaim for > __GFP_RETRY_MAYFAIL to consider if nr_scanned < SWAP_CLUSTER_MAX. > Either raise priority (will interfere with kswapd though) or > bail entirely. Consequences may be that other __GFP_RETRY_MAYFAIL > allocations do not want this behaviour. There are a lot of users. Due to high number of users, I am avoiding such a change. It would be hard to validate that such a change does not impact other users. > 3. Move where __GFP_RETRY_MAYFAIL is set in a gfp_mask in mm/hugetlb.c. > Strip the flag if an allocation fails on a node. Consequences are > that setting the required number of huge pages is more likely to > return without all the huge pages set. We are actually using a form of this in our distro kernel. It works quite well on the older (4.11 based) distro kernel. My plan was to push this upstream. However, when I tested this on recent upstream kernels, I encountered long stalls associated with the first __GFP_RETRY_MAYFAIL allocation failure. That is what prompted me to ask this queastion/start this thread. The distro kernel would see stalls taking tens of seconds, upstream would see stalls of several minutes. Much has changed since 4.11, so I was trying to figure out what might be causing this change in behavior. BTW, here is the patch I was testing. It actually has additional code to switch between __GFP_RETRY_MAYFAIL and __GFP_NORETRY and back to hopefully take into account transient conditions. >From 528c52397301f02acb614c610bd65f0f9a107481 Mon Sep 17 00:00:00 2001 From: Mike Kravetz Date: Wed, 3 Jul 2019 13:36:24 -0700 Subject: [PATCH] hugetlbfs: don't retry when pool page allocations start to fail When allocating hugetlbfs pool pages via /proc/sys/vm/nr_hugepages, the pages will be interleaved between all nodes of the system. If nodes are not equal, it is quite possible for one node to fill up before the others. When this happens, the code still attempts to allocate pages from the full node. This results in calls to direct reclaim and compaction which slow things down considerably. When allocating pool pages, note the state of the previous allocation for each node. If previous allocation failed, do not use the aggressive retry algorithm on successive attempts. The allocation will still succeed if there is memory available, but it will not try as hard to free up memory. Signed-off-by: Mike Kravetz --- mm/hugetlb.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 77 insertions(+), 10 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index ede7e7f5d1ab..f3c50344a9b4 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1405,12 +1405,27 @@ pgoff_t __basepage_index(struct page *page) } static struct page *alloc_buddy_huge_page(struct hstate *h, - gfp_t gfp_mask, int nid, nodemask_t *nmask) + gfp_t gfp_mask, int nid, nodemask_t *nmask, + nodemask_t *node_alloc_noretry) { int order = huge_page_order(h); struct page *page; + bool alloc_try_hard = true; - gfp_mask |= __GFP_COMP|__GFP_RETRY_MAYFAIL|__GFP_NOWARN; + /* + * By default we always try hard to allocate the page with + * __GFP_RETRY_MAYFAIL flag. However, if we are allocating pages in + * a loop (to adjust global huge page counts) and previous allocation + * failed, do not continue to try hard on the same node. Use the + * node_alloc_noretry bitmap to manage this state information. + */ + if (node_alloc_noretry && node_isset(nid, *node_alloc_noretry)) + alloc_try_hard = false; + gfp_mask |= __GFP_COMP|__GFP_NOWARN; + if (alloc_try_hard) + gfp_mask |= __GFP_RETRY_MAYFAIL; + else + gfp_mask |= __GFP_NORETRY; if (nid == NUMA_NO_NODE) nid = numa_mem_id(); page = __alloc_pages_nodemask(gfp_mask, order, nid, nmask); @@ -1419,6 +1434,22 @@ static struct page *alloc_buddy_huge_page(struct hstate *h, else __count_vm_event(HTLB_BUDDY_PGALLOC_FAIL); + /* + * If we did not specify __GFP_RETRY_MAYFAIL, but still got a page this + * indicates an overall state change. Clear bit so that we resume + * normal 'try hard' allocations. + */ + if (node_alloc_noretry && page && !alloc_try_hard) + node_clear(nid, *node_alloc_noretry); + + /* + * If we tried hard to get a page but failed, set bit so that + * subsequent attempts will not try as hard until there is an + * overall state change. + */ + if (node_alloc_noretry && !page && alloc_try_hard) + node_set(nid, *node_alloc_noretry); + return page; } @@ -1427,7 +1458,8 @@ static struct page *alloc_buddy_huge_page(struct hstate *h, * should use this function to get new hugetlb pages */ static struct page *alloc_fresh_huge_page(struct hstate *h, - gfp_t gfp_mask, int nid, nodemask_t *nmask) + gfp_t gfp_mask, int nid, nodemask_t *nmask, + nodemask_t *node_alloc_noretry) { struct page *page; @@ -1435,7 +1467,7 @@ static struct page *alloc_fresh_huge_page(struct hstate *h, page = alloc_gigantic_page(h, gfp_mask, nid, nmask); else page = alloc_buddy_huge_page(h, gfp_mask, - nid, nmask); + nid, nmask, node_alloc_noretry); if (!page) return NULL; @@ -1450,14 +1482,16 @@ static struct page *alloc_fresh_huge_page(struct hstate *h, * Allocates a fresh page to the hugetlb allocator pool in the node interleaved * manner. */ -static int alloc_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed) +static int alloc_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed, + nodemask_t *node_alloc_noretry) { struct page *page; int nr_nodes, node; gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE; for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) { - page = alloc_fresh_huge_page(h, gfp_mask, node, nodes_allowed); + page = alloc_fresh_huge_page(h, gfp_mask, node, nodes_allowed, + node_alloc_noretry); if (page) break; } @@ -1601,7 +1635,7 @@ static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask, goto out_unlock; spin_unlock(&hugetlb_lock); - page = alloc_fresh_huge_page(h, gfp_mask, nid, nmask); + page = alloc_fresh_huge_page(h, gfp_mask, nid, nmask, NULL); if (!page) return NULL; @@ -1637,7 +1671,7 @@ struct page *alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask, if (hstate_is_gigantic(h)) return NULL; - page = alloc_fresh_huge_page(h, gfp_mask, nid, nmask); + page = alloc_fresh_huge_page(h, gfp_mask, nid, nmask, NULL); if (!page) return NULL; @@ -2207,13 +2241,31 @@ static void __init gather_bootmem_prealloc(void) static void __init hugetlb_hstate_alloc_pages(struct hstate *h) { unsigned long i; + nodemask_t *node_alloc_noretry; + + if (!hstate_is_gigantic(h)) { + /* + * bit mask controlling how hard we retry per-node + * allocations. + */ + node_alloc_noretry = kmalloc(sizeof(*node_alloc_noretry), + GFP_KERNEL | __GFP_NORETRY); + } else { + /* allocations done at boot time */ + node_alloc_noretry = NULL; + } + + /* bit mask controlling how hard we retry per-node allocations */ + if (node_alloc_noretry) + nodes_clear(*node_alloc_noretry); for (i = 0; i < h->max_huge_pages; ++i) { if (hstate_is_gigantic(h)) { if (!alloc_bootmem_huge_page(h)) break; } else if (!alloc_pool_huge_page(h, - &node_states[N_MEMORY])) + &node_states[N_MEMORY], + node_alloc_noretry)) break; cond_resched(); } @@ -2225,6 +2277,9 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h) h->max_huge_pages, buf, i); h->max_huge_pages = i; } + + if (node_alloc_noretry) + kfree(node_alloc_noretry); } static void __init hugetlb_init_hstates(void) @@ -2323,6 +2378,12 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid, nodemask_t *nodes_allowed) { unsigned long min_count, ret; + NODEMASK_ALLOC(nodemask_t, node_alloc_noretry, + GFP_KERNEL | __GFP_NORETRY); + + /* bit mask controlling how hard we retry per-node allocations */ + if (node_alloc_noretry) + nodes_clear(*node_alloc_noretry); spin_lock(&hugetlb_lock); @@ -2356,6 +2417,8 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid, if (hstate_is_gigantic(h) && !IS_ENABLED(CONFIG_CONTIG_ALLOC)) { if (count > persistent_huge_pages(h)) { spin_unlock(&hugetlb_lock); + if (node_alloc_noretry) + NODEMASK_FREE(node_alloc_noretry); return -EINVAL; } /* Fall through to decrease pool */ @@ -2388,7 +2451,8 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid, /* yield cpu to avoid soft lockup */ cond_resched(); - ret = alloc_pool_huge_page(h, nodes_allowed); + ret = alloc_pool_huge_page(h, nodes_allowed, + node_alloc_noretry); spin_lock(&hugetlb_lock); if (!ret) goto out; @@ -2429,6 +2493,9 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid, h->max_huge_pages = persistent_huge_pages(h); spin_unlock(&hugetlb_lock); + if (node_alloc_noretry) + NODEMASK_FREE(node_alloc_noretry); + return 0; } -- 2.20.1