From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.3 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 968C3ECE58A for ; Tue, 1 Oct 2019 21:58:17 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 3591A21A4C for ; Tue, 1 Oct 2019 21:58:17 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 3591A21A4C Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=suse.cz Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 917B88E0005; Tue, 1 Oct 2019 17:58:16 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8C6EC8E0001; Tue, 1 Oct 2019 17:58:16 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 78F4F8E0005; Tue, 1 Oct 2019 17:58:16 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0253.hostedemail.com [216.40.44.253]) by kanga.kvack.org (Postfix) with ESMTP id 5685C8E0001 for ; Tue, 1 Oct 2019 17:58:16 -0400 (EDT) Received: from smtpin24.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with SMTP id E2DC7181AC9B4 for ; Tue, 1 Oct 2019 21:58:15 +0000 (UTC) X-FDA: 75996579750.24.toys16_dda095984c54 X-HE-Tag: toys16_dda095984c54 X-Filterd-Recvd-Size: 11887 Received: from mx1.suse.de (mx2.suse.de [195.135.220.15]) by imf24.hostedemail.com (Postfix) with ESMTP for ; Tue, 1 Oct 2019 21:58:15 +0000 (UTC) X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id BE58FAD3E; Tue, 1 Oct 2019 21:58:12 +0000 (UTC) Subject: Re: [patch for-5.3 0/4] revert immediate fallback to remote hugepages To: David Rientjes Cc: Michal Hocko , Linus Torvalds , Andrea Arcangeli , Andrew Morton , Mel Gorman , "Kirill A. Shutemov" , Linux Kernel Mailing List , Linux-MM References: <20190904205522.GA9871@redhat.com> <20190909193020.GD2063@dhcp22.suse.cz> <20190925070817.GH23050@dhcp22.suse.cz> <20190927074803.GB26848@dhcp22.suse.cz> <20190930112817.GC15942@dhcp22.suse.cz> <20191001054343.GA15624@dhcp22.suse.cz> From: Vlastimil Babka Autocrypt: addr=vbabka@suse.cz; prefer-encrypt=mutual; keydata= mQINBFZdmxYBEADsw/SiUSjB0dM+vSh95UkgcHjzEVBlby/Fg+g42O7LAEkCYXi/vvq31JTB KxRWDHX0R2tgpFDXHnzZcQywawu8eSq0LxzxFNYMvtB7sV1pxYwej2qx9B75qW2plBs+7+YB 87tMFA+u+L4Z5xAzIimfLD5EKC56kJ1CsXlM8S/LHcmdD9Ctkn3trYDNnat0eoAcfPIP2OZ+ 9oe9IF/R28zmh0ifLXyJQQz5ofdj4bPf8ecEW0rhcqHfTD8k4yK0xxt3xW+6Exqp9n9bydiy tcSAw/TahjW6yrA+6JhSBv1v2tIm+itQc073zjSX8OFL51qQVzRFr7H2UQG33lw2QrvHRXqD Ot7ViKam7v0Ho9wEWiQOOZlHItOOXFphWb2yq3nzrKe45oWoSgkxKb97MVsQ+q2SYjJRBBH4 8qKhphADYxkIP6yut/eaj9ImvRUZZRi0DTc8xfnvHGTjKbJzC2xpFcY0DQbZzuwsIZ8OPJCc LM4S7mT25NE5kUTG/TKQCk922vRdGVMoLA7dIQrgXnRXtyT61sg8PG4wcfOnuWf8577aXP1x 6mzw3/jh3F+oSBHb/GcLC7mvWreJifUL2gEdssGfXhGWBo6zLS3qhgtwjay0Jl+kza1lo+Cv BB2T79D4WGdDuVa4eOrQ02TxqGN7G0Biz5ZLRSFzQSQwLn8fbwARAQABtCBWbGFzdGltaWwg QmFia2EgPHZiYWJrYUBzdXNlLmN6PokCVAQTAQoAPgIbAwULCQgHAwUVCgkICwUWAgMBAAIe AQIXgBYhBKlA1DSZLC6OmRA9UCJPp+fMgqZkBQJcbbyGBQkH8VTqAAoJECJPp+fMgqZkpGoP /1jhVihakxw1d67kFhPgjWrbzaeAYOJu7Oi79D8BL8Vr5dmNPygbpGpJaCHACWp+10KXj9yz fWABs01KMHnZsAIUytVsQv35DMMDzgwVmnoEIRBhisMYOQlH2bBn/dqBjtnhs7zTL4xtqEcF 1hoUFEByMOey7gm79utTk09hQE/Zo2x0Ikk98sSIKBETDCl4mkRVRlxPFl4O/w8dSaE4eczH LrKezaFiZOv6S1MUKVKzHInonrCqCNbXAHIeZa3JcXCYj1wWAjOt9R3NqcWsBGjFbkgoKMGD usiGabetmQjXNlVzyOYdAdrbpVRNVnaL91sB2j8LRD74snKsV0Wzwt90YHxDQ5z3M75YoIdl byTKu3BUuqZxkQ/emEuxZ7aRJ1Zw7cKo/IVqjWaQ1SSBDbZ8FAUPpHJxLdGxPRN8Pfw8blKY 8mvLJKoF6i9T6+EmlyzxqzOFhcc4X5ig5uQoOjTIq6zhLO+nqVZvUDd2Kz9LMOCYb516cwS/ Enpi0TcZ5ZobtLqEaL4rupjcJG418HFQ1qxC95u5FfNki+YTmu6ZLXy+1/9BDsPuZBOKYpUm 3HWSnCS8J5Ny4SSwfYPH/JrtberWTcCP/8BHmoSpS/3oL3RxrZRRVnPHFzQC6L1oKvIuyXYF rkybPXYbmNHN+jTD3X8nRqo+4Qhmu6SHi3VquQENBFsZNQwBCACuowprHNSHhPBKxaBX7qOv KAGCmAVhK0eleElKy0sCkFghTenu1sA9AV4okL84qZ9gzaEoVkgbIbDgRbKY2MGvgKxXm+kY n8tmCejKoeyVcn9Xs0K5aUZiDz4Ll9VPTiXdf8YcjDgeP6/l4kHb4uSW4Aa9ds0xgt0gP1Xb AMwBlK19YvTDZV5u3YVoGkZhspfQqLLtBKSt3FuxTCU7hxCInQd3FHGJT/IIrvm07oDO2Y8J DXWHGJ9cK49bBGmK9B4ajsbe5GxtSKFccu8BciNluF+BqbrIiM0upJq5Xqj4y+Xjrpwqm4/M ScBsV0Po7qdeqv0pEFIXKj7IgO/d4W2bABEBAAGJA3IEGAEKACYWIQSpQNQ0mSwujpkQPVAi T6fnzIKmZAUCWxk1DAIbAgUJA8JnAAFACRAiT6fnzIKmZMB0IAQZAQoAHRYhBKZ2GgCcqNxn k0Sx9r6Fd25170XjBQJbGTUMAAoJEL6Fd25170XjDBUH/2jQ7a8g+FC2qBYxU/aCAVAVY0NE YuABL4LJ5+iWwmqUh0V9+lU88Cv4/G8fWwU+hBykSXhZXNQ5QJxyR7KWGy7LiPi7Cvovu+1c 9Z9HIDNd4u7bxGKMpn19U12ATUBHAlvphzluVvXsJ23ES/F1c59d7IrgOnxqIcXxr9dcaJ2K k9VP3TfrjP3g98OKtSsyH0xMu0MCeyewf1piXyukFRRMKIErfThhmNnLiDbaVy6biCLx408L Mo4cCvEvqGKgRwyckVyo3JuhqreFeIKBOE1iHvf3x4LU8cIHdjhDP9Wf6ws1XNqIvve7oV+w B56YWoalm1rq00yUbs2RoGcXmtX1JQ//aR/paSuLGLIb3ecPB88rvEXPsizrhYUzbe1TTkKc 4a4XwW4wdc6pRPVFMdd5idQOKdeBk7NdCZXNzoieFntyPpAq+DveK01xcBoXQ2UktIFIsXey uSNdLd5m5lf7/3f0BtaY//f9grm363NUb9KBsTSnv6Vx7Co0DWaxgC3MFSUhxzBzkJNty+2d 10jvtwOWzUN+74uXGRYSq5WefQWqqQNnx+IDb4h81NmpIY/X0PqZrapNockj3WHvpbeVFAJ0 9MRzYP3x8e5OuEuJfkNnAbwRGkDy98nXW6fKeemREjr8DWfXLKFWroJzkbAVmeIL0pjXATxr +tj5JC0uvMrrXefUhXTo0SNoTsuO/OsAKOcVsV/RHHTwCDR2e3W8mOlA3QbYXsscgjghbuLh J3oTRrOQa8tUXWqcd5A0+QPo5aaMHIK0UAthZsry5EmCY3BrbXUJlt+23E93hXQvfcsmfi0N rNh81eknLLWRYvMOsrbIqEHdZBT4FHHiGjnck6EYx/8F5BAZSodRVEAgXyC8IQJ+UVa02QM5 D2VL8zRXZ6+wARKjgSrW+duohn535rG/ypd0ctLoXS6dDrFokwTQ2xrJiLbHp9G+noNTHSan ExaRzyLbvmblh3AAznb68cWmM3WVkceWACUalsoTLKF1sGrrIBj5updkKkzbKOq5gcC5AQ0E Wxk1NQEIAJ9B+lKxYlnKL5IehF1XJfknqsjuiRzj5vnvVrtFcPlSFL12VVFVUC2tT0A1Iuo9 NAoZXEeuoPf1dLDyHErrWnDyn3SmDgb83eK5YS/K363RLEMOQKWcawPJGGVTIRZgUSgGusKL NuZqE5TCqQls0x/OPljufs4gk7E1GQEgE6M90Xbp0w/r0HB49BqjUzwByut7H2wAdiNAbJWZ F5GNUS2/2IbgOhOychHdqYpWTqyLgRpf+atqkmpIJwFRVhQUfwztuybgJLGJ6vmh/LyNMRr8 J++SqkpOFMwJA81kpjuGR7moSrUIGTbDGFfjxmskQV/W/c25Xc6KaCwXah3OJ40AEQEAAYkC PAQYAQoAJhYhBKlA1DSZLC6OmRA9UCJPp+fMgqZkBQJbGTU1AhsMBQkDwmcAAAoJECJPp+fM gqZkPN4P/Ra4NbETHRj5/fM1fjtngt4dKeX/6McUPDIRuc58B6FuCQxtk7sX3ELs+1+w3eSV rHI5cOFRSdgw/iKwwBix8D4Qq0cnympZ622KJL2wpTPRLlNaFLoe5PkoORAjVxLGplvQIlhg miljQ3R63ty3+MZfkSVsYITlVkYlHaSwP2t8g7yTVa+q8ZAx0NT9uGWc/1Sg8j/uoPGrctml hFNGBTYyPq6mGW9jqaQ8en3ZmmJyw3CHwxZ5FZQ5qc55xgshKiy8jEtxh+dgB9d8zE/S/UGI E99N/q+kEKSgSMQMJ/CYPHQJVTi4YHh1yq/qTkHRX+ortrF5VEeDJDv+SljNStIxUdroPD29 2ijoaMFTAU+uBtE14UP5F+LWdmRdEGS1Ah1NwooL27uAFllTDQxDhg/+LJ/TqB8ZuidOIy1B xVKRSg3I2m+DUTVqBy7Lixo73hnW69kSjtqCeamY/NSu6LNP+b0wAOKhwz9hBEwEHLp05+mj 5ZFJyfGsOiNUcMoO/17FO4EBxSDP3FDLllpuzlFD7SXkfJaMWYmXIlO0jLzdfwfcnDzBbPwO hBM8hvtsyq8lq8vJOxv6XD6xcTtj5Az8t2JjdUX6SF9hxJpwhBU0wrCoGDkWp4Bbv6jnF7zP Nzftr4l8RuJoywDIiJpdaNpSlXKpj/K6KrnyAI/joYc7 Message-ID: Date: Tue, 1 Oct 2019 23:54:14 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.1.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 10/1/19 10:31 PM, David Rientjes wrote: > On Tue, 1 Oct 2019, Vlastimil Babka wrote: >=20 >> diff --git a/mm/mempolicy.c b/mm/mempolicy.c >> index 4ae967bcf954..2c48146f3ee2 100644 >> --- a/mm/mempolicy.c >> +++ b/mm/mempolicy.c >> @@ -2129,18 +2129,20 @@ alloc_pages_vma(gfp_t gfp, int order, struct v= m_area_struct *vma, >> nmask =3D policy_nodemask(gfp, pol); >> if (!nmask || node_isset(hpage_node, *nmask)) { >> mpol_cond_put(pol); >> + /* >> + * First, try to allocate THP only on local node, but >> + * don't reclaim unnecessarily, just compact. >> + */ >> page =3D __alloc_pages_node(hpage_node, >> - gfp | __GFP_THISNODE, order); >> + gfp | __GFP_THISNODE | __GFP_NORETRY, order); >=20 > The page allocator has heuristics to determine when compaction should b= e=20 > retried, reclaim should be retried, and the allocation itself should re= try=20 > for high-order allocations (see PAGE_ALLOC_COSTLY_ORDER). Yes, and flags like __GFP_NORETRY and __GFP_RETRY_MAYFAIL affect these heuristics according to the caller's willingness to wait vs availability of some kind of fallback. =20 > PAGE_ALLOC_COSTLY_ORDER exists solely to avoid poor allocator behavior=20 > when reclaim itself is unlikely -- or disruptive enough -- in making th= at=20 > amount of contiguous memory available. The __GFP_NORETRY check for costly orders is one of the implementation details of that poor allocator behavior avoidance. > Rather than papering over the poor feedback loop between compaction and= =20 > reclaim that exists in the page allocator, it's probably better to impr= ove=20 > that and determine when an allocation should fail or it's worthwhile to= =20 > retry. That's a separate topic from NUMA locality of thp. I don't disagree. But we are doing that 'papering over' already and I sim= ply believe it can be done better, until we can drop it. Right now at least t= he hugetlb allocations are regressing, as Michal showed. > In other words, we should likely address how compaction and reclaim is=20 > done for all high order-allocations in the page allocator itself rather= =20 > than only here for hugepage allocations and relying on specific gfp bit= s=20 > to be set. Well, with your patches, decisions are made based on pageblock order (affecting also hugetlbfs) and a specific __GFP_IO bit. I don't see how using a __GFP_NORETRY and costly order is worse - both are concepts already used to guide reclaim/compaction decisions. And it won't affect hugetlbfs. Also with some THP defrag modes, __GFP_NORETRY is also already being used. > Ask: if the allocation here should not retry regardless of why=20 > compaction failed, why should any high-order allocation (user or kernel= )=20 > retry if compaction failed and at what order we should just fail? If that's refering to the quoted code above, the allocation here should n= ot retry because it would lead to reclaiming just the local node, effectivel= y a node reclaim mode, leading to the problems Andrea and Mel reported. That's because __GFP_THISNODE makes it rather specific, so we shouldn't conclude anything for the other kinds of allocations, as you're asking. > If=20 > hugetlb wants to stress this to the fullest extent possible, it already= =20 > appropriately uses __GFP_RETRY_MAYFAIL. Which doesn't work anymore right now, and should again after this patch. > The code here is saying we care more about NUMA locality than hugepages That's why there's __GFP_THISNODE, yes. =20 > simply because that's where the access latency is better and is specifi= c=20 > to hugepages; allocation behavior for high-order pages needs to live in= =20 > the page allocator. I'd rather add the __GFP_NORETRY to __GFP_THISNODE here to influence the allocation behavior, than reintroduce any __GFP_THISNODE checks in the page allocator. There are not that many __GFP_THISNODE users, but there are plenty of __GFP_NORETRY users, so it seems better to adjust behavior based on the latter flag. IIRC in the recent past you argued for putting back __GFP_NORETRY to all THP allocations including madvise, so why is th= ere a problem now? >> =20 >> /* >> - * If hugepage allocations are configured to always >> - * synchronous compact or the vma has been madvised >> - * to prefer hugepage backing, retry allowing remote >> - * memory as well. >> + * If that fails, allow both compaction and reclaim, >> + * but on all nodes. >> */ >> - if (!page && (gfp & __GFP_DIRECT_RECLAIM)) >> + if (!page) >> page =3D __alloc_pages_node(hpage_node, >> - gfp | __GFP_NORETRY, order); Also this __GFP_NORETRY was added by your patch, so I really don't see why do you think it's wrong in the first allocation attempt. >> + gfp, order); >> =20 >> goto out; >> } >=20 > We certainly don't want this for non-MADV_HUGEPAGE regions when thp=20 > enabled bit is not set to "always". We'd much rather fallback to local= =20 > native pages because of its improved access latency. This is allowing = all=20 > hugepage allocations to be remote even without MADV_HUGEPAGE which is n= ot=20 > even what Andrea needs: qemu uses MADV_HUGEPAGE. =20 OTOH it's better to use a remote THP than a remote base page, in case the local node is below watermark. But that would probably require a larger s= urgery of the allocator. But what you're saying is keep the gfp & __GFP_DIRECT_RECLAIM condition. I still suggest to drop the __GFP_NORETRY as that goes against MADV_HUGEP= AGE.