From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: ** X-Spam-Status: No, score=2.4 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,HTML_MESSAGE,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id B9E41C433DF for ; Sat, 13 Jun 2020 04:17:06 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 4BEA620739 for ; Sat, 13 Jun 2020 04:17:06 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="SORAo8Ln" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 4BEA620739 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id D33D48D00DC; Sat, 13 Jun 2020 00:17:05 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id CE3DD8D00A0; Sat, 13 Jun 2020 00:17:05 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BD3598D00DC; Sat, 13 Jun 2020 00:17:05 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0034.hostedemail.com [216.40.44.34]) by kanga.kvack.org (Postfix) with ESMTP id A187A8D00A0 for ; Sat, 13 Jun 2020 00:17:05 -0400 (EDT) Received: from smtpin26.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 61F50180AD817 for ; Sat, 13 Jun 2020 04:17:05 +0000 (UTC) X-FDA: 76922878410.26.teeth21_430158926de2 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin26.hostedemail.com (Postfix) with ESMTP id 37D401804B65C for ; Sat, 13 Jun 2020 04:17:05 +0000 (UTC) X-HE-Tag: teeth21_430158926de2 X-Filterd-Recvd-Size: 36354 Received: from mail-oi1-f196.google.com (mail-oi1-f196.google.com [209.85.167.196]) by imf07.hostedemail.com (Postfix) with ESMTP for ; Sat, 13 Jun 2020 04:17:04 +0000 (UTC) Received: by mail-oi1-f196.google.com with SMTP id t25so10690263oij.7 for ; Fri, 12 Jun 2020 21:17:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=VNY9jFPUd3ibpTjPfNZEPMUJlOBIbXNjQ7Z9KSNzfDA=; b=SORAo8LnkkYJIGRKGibozOirjezTazEoaFTpRe9OnffTuQWXrO0IQiHPshL8oFgfoy A+ygvKNMpcLwVTH0iOAOzpJx5bFB4ZTyOzHmv0V6hWPGQZ1C0Ma3baxEOAC0dPkDyY1i 5iViOwCeNTjp+oNSIDMI7fuLBxV3SY4d3Q+WR2mIMP29tGpXkPlVyU4ZfqOq4pMmGBsr 8cm75cYAjkYo8I/N1nen+VbfRPAN70wuKyrFspenBcKT1jzeaRY2Cw+Kg6R22Izf5Krj 8rsaUNynC+WWZRzZDrpkGwVHOF9IabapGTD5B0wO/mABgBF5hV/BeQpyJ0jfu2XicMcz Ec+w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=VNY9jFPUd3ibpTjPfNZEPMUJlOBIbXNjQ7Z9KSNzfDA=; b=krrqZ/NW/rEGnOLFXRF4JJ2JqPwpcAu0cbroYGYRtg+qjE9XPmN1cB+A47r9zDa4BY f+z+ml1QLBELR0Cs4rh7Vf9wJFtT4MJLQZ+DuSrV3Ksnu6GB5ga1SHwi7eBbL/GoW+cs taPbuJ/BFpBjjgTNMnj3k+wHdXrn3sofCJXpXsDfZCN00cCiWKyP5svSLROWC4KQCd93 Y8DGUdakXOMpEYobdCUJySISVyc5A8fqJOqt1dvl8G/uutTsWP2FWalJU7K3BOExeKQ/ WQgZqr2FLcTRgz+tA8Okr8jMqzJTeGXrFZaXZMriPWHAFsnNDt1jU4A8XiBITb66NlIX 26XQ== X-Gm-Message-State: AOAM533UMgmXN6CzjINzMVGbxBvGNzxzXOQlnv7+eqc0XreFSv5eN+gK IyBOLpBvYCY73FaeSurj9SEVqePJ6L0tib3g+mk= X-Google-Smtp-Source: ABdhPJyN2A3omjApJpE+bRKI8Ef7UBNaNkitOcG3Rpvox3jWBM7G7TnfPaPm69URdJwX8DWHwhnOCR+psPz5nC5Tkyk= X-Received: by 2002:aca:5310:: with SMTP id h16mr1513513oib.163.1592021823745; Fri, 12 Jun 2020 21:17:03 -0700 (PDT) MIME-Version: 1.0 References: <20200613025102.12880-1-jaewon31.kim@samsung.com> In-Reply-To: From: Jaewon Kim Date: Sat, 13 Jun 2020 13:16:50 +0900 Message-ID: Subject: Re: [PATCH v2] page_alloc: consider highatomic reserve in wmartermark fast To: Vlastimil Babka Cc: Jaewon Kim , mgorman@techsingularity.net, minchan@kernel.org, mgorman@suse.de, hannes@cmpxchg.org, Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org, ytk.lee@samsung.com, cmlaika.kim@samsung.com Content-Type: multipart/alternative; boundary="00000000000047723c05a7ef775a" X-Rspamd-Queue-Id: 37D401804B65C X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam05 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: --00000000000047723c05a7ef775a Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable 2020=EB=85=84 6=EC=9B=94 12=EC=9D=BC (=EA=B8=88) =EC=98=A4=ED=9B=84 11:34, = Vlastimil Babka =EB=8B=98=EC=9D=B4 =EC=9E=91=EC=84=B1: > > On 6/13/20 4:51 AM, Jaewon Kim wrote: > > zone_watermark_fast was introduced by commit 48ee5f3696f6 ("mm, > > page_alloc: shortcut watermark checks for order-0 pages"). The commit > > simply checks if free pages is bigger than watermark without additional > > calculation such like reducing watermark. > > > > It considered free cma pages but it did not consider highatomic > > reserved. This may incur exhaustion of free pages except high order > > atomic free pages. > > > > Assume that reserved_highatomic pageblock is bigger than watermark min, > > and there are only few free pages except high order atomic free. Becaus= e > > zone_watermark_fast passes the allocation without considering high orde= r > > atomic free, normal reclaimable allocation like GFP_HIGHUSER will > > consume all the free pages. Then finally order-0 atomic allocation may > > fail on allocation. > > I don't understand why order-0 atomic allocation will fail. Is it because of > watermark check, or finding no suitable pages? > - watermark check should be OK as atomic allocations can use reserves > - suitable pages should be OK, even if all free pages are in the highatomic > reserves, because rmqueue() contains: > > if (alloc_flags & ALLOC_HARDER) > page =3D __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC); > > So what am I missing? > Hello The order-0 atomic allocation can be failed because of depletion of suitable free page. Watermark check passes order-0 atomic allocation but it will be failed at finding a free page. The __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC) can be used only for highorder. > > This means watermark min is not protected against non-atomic allocation= . > > The order-0 atomic allocation with ALLOC_HARDER unwantedly can be > > failed. Additionally the __GFP_MEMALLOC allocation with > > ALLOC_NO_WATERMARKS also can be failed. > > > > To avoid the problem, zone_watermark_fast should consider highatomic > > reserve. If the actual size of high atomic free is counted accurately > > like cma free, we may use it. On this patch just use > > nr_reserved_highatomic. Additionally introduce > > __zone_watermark_unusable_free to factor out common parts between > > zone_watermark_fast and __zone_watermark_ok. > > > > This is trace log which shows GFP_HIGHUSER consumes free pages right > > before ALLOC_NO_WATERMARKS. > > > > <...>-22275 [006] .... 889.213383: mm_page_alloc: page=3D00000000d2be5665 pfn=3D970744 order=3D0 migratetype=3D0 nr_free=3D36= 50 gfp_flags=3DGFP_HIGHUSER|__GFP_ZERO > > <...>-22275 [006] .... 889.213385: mm_page_alloc: page=3D000000004b2335c2 pfn=3D970745 order=3D0 migratetype=3D0 nr_free=3D36= 50 gfp_flags=3DGFP_HIGHUSER|__GFP_ZERO > > <...>-22275 [006] .... 889.213387: mm_page_alloc: page=3D00000000017272e1 pfn=3D970278 order=3D0 migratetype=3D0 nr_free=3D36= 50 gfp_flags=3DGFP_HIGHUSER|__GFP_ZERO > > <...>-22275 [006] .... 889.213389: mm_page_alloc: page=3D00000000c4be79fb pfn=3D970279 order=3D0 migratetype=3D0 nr_free=3D36= 50 gfp_flags=3DGFP_HIGHUSER|__GFP_ZERO > > <...>-22275 [006] .... 889.213391: mm_page_alloc: page=3D00000000f8a51d4f pfn=3D970260 order=3D0 migratetype=3D0 nr_free=3D36= 50 gfp_flags=3DGFP_HIGHUSER|__GFP_ZERO > > <...>-22275 [006] .... 889.213393: mm_page_alloc: page=3D000000006ba8f5ac pfn=3D970261 order=3D0 migratetype=3D0 nr_free=3D36= 50 gfp_flags=3DGFP_HIGHUSER|__GFP_ZERO > > <...>-22275 [006] .... 889.213395: mm_page_alloc: page=3D00000000819f1cd3 pfn=3D970196 order=3D0 migratetype=3D0 nr_free=3D36= 50 gfp_flags=3DGFP_HIGHUSER|__GFP_ZERO > > <...>-22275 [006] .... 889.213396: mm_page_alloc: page=3D00000000f6b72a64 pfn=3D970197 order=3D0 migratetype=3D0 nr_free=3D36= 50 gfp_flags=3DGFP_HIGHUSER|__GFP_ZERO > > kswapd0-1207 [005] ...1 889.213398: mm_page_alloc: page=3D (null) pfn=3D0 order=3D0 migratetype=3D1 nr_free=3D3650 gfp_flags=3DGFP_NOWAIT|__GFP_HIGHMEM|__GFP_NOWARN|__GFP_MOVABLE > > > > This is an example of ALLOC_HARDER allocation failure. > > > > <4>[ 6207.637280] [3: Binder:9343_3:22875] Binder:9343_3: page allocation failure: order:0, mode:0x480020(GFP_ATOMIC), nodemask=3D(null) > > <4>[ 6207.637311] [3: Binder:9343_3:22875] Call trace: > > <4>[ 6207.637346] [3: Binder:9343_3:22875] [] dump_stack+0xb8/0xf0 > > <4>[ 6207.637356] [3: Binder:9343_3:22875] [] warn_alloc+0xd8/0x12c > > <4>[ 6207.637365] [3: Binder:9343_3:22875] [] __alloc_pages_nodemask+0x120c/0x1250 > > <4>[ 6207.637374] [3: Binder:9343_3:22875] [] new_slab+0x128/0x604 > > <4>[ 6207.637381] [3: Binder:9343_3:22875] [] ___slab_alloc+0x508/0x670 > > <4>[ 6207.637387] [3: Binder:9343_3:22875] [] __kmalloc+0x2f8/0x310 > > <4>[ 6207.637396] [3: Binder:9343_3:22875] [] context_struct_to_string+0x104/0x1cc > > <4>[ 6207.637404] [3: Binder:9343_3:22875] [] security_sid_to_context_core+0x74/0x144 > > <4>[ 6207.637412] [3: Binder:9343_3:22875] [] security_sid_to_context+0x10/0x18 > > <4>[ 6207.637421] [3: Binder:9343_3:22875] [] selinux_secid_to_secctx+0x20/0x28 > > <4>[ 6207.637430] [3: Binder:9343_3:22875] [] security_secid_to_secctx+0x3c/0x70 > > <4>[ 6207.637442] [3: Binder:9343_3:22875] [] binder_transaction+0xe68/0x454c > > <4>[ 6207.637569] [3: Binder:9343_3:22875] Mem-Info: > > <4>[ 6207.637595] [3: Binder:9343_3:22875] active_anon:102061 inactive_anon:81551 isolated_anon:0 > > <4>[ 6207.637595] [3: Binder:9343_3:22875] active_file:59102 inactive_file:68924 isolated_file:64 > > <4>[ 6207.637595] [3: Binder:9343_3:22875] unevictable:611 dirty:63 writeback:0 unstable:0 > > <4>[ 6207.637595] [3: Binder:9343_3:22875] slab_reclaimable:13324 slab_unreclaimable:44354 > > <4>[ 6207.637595] [3: Binder:9343_3:22875] mapped:83015 shmem:4858 pagetables:26316 bounce:0 > > <4>[ 6207.637595] [3: Binder:9343_3:22875] free:2727 free_pcp:1035 free_cma:178 > > <4>[ 6207.637616] [3: Binder:9343_3:22875] Node 0 active_anon:408244kB inactive_anon:326204kB active_file:236408kB inactive_file:275696kB unevictable:2444kB isolated(anon):0kB isolated(file):256kB mapped:332060kB dirty:252kB writeback:0kB shmem:19432kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no > > <4>[ 6207.637627] [3: Binder:9343_3:22875] Normal free:10908kB min:6192kB low:44388kB high:47060kB active_anon:409160kB inactive_anon:325924kB active_file:235820kB inactive_file:276628kB unevictable:2444kB writepending:252kB present:3076096kB managed:2673676kB mlocked:2444kB kernel_stack:62512kB pagetables:105264kB bounce:0kB free_pcp:4140kB local_pcp:40kB free_cma:712kB > > <4>[ 6207.637632] [3: Binder:9343_3:22875] lowmem_reserve[]: 0 0 > > <4>[ 6207.637637] [3: Binder:9343_3:22875] Normal: 505*4kB (H) 357*8kB (H) 201*16kB (H) 65*32kB (H) 1*64kB (H) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB =3D 10236kB > > OK this shows we are well above min watermark, and indeed only free pages are > highatomic. Why doesn't the rmqueue() part quoted above work as expected and > allow this allocation to use those highatomic blocks? > Because highatomic free is reserved free only for high atomic, actually ALLOC_HARDER. Order-0 atomic allocation cannot use the highatomic free. > > <4>[ 6207.637655] [3: Binder:9343_3:22875] 138826 total pagecache pages > > <4>[ 6207.637663] [3: Binder:9343_3:22875] 5460 pages in swap cache > > <4>[ 6207.637668] [3: Binder:9343_3:22875] Swap cache stats: add 8273090, delete 8267506, find 1004381/4060142 > > > > This is an example of ALLOC_NO_WATERMARKS allocation failure. > > > > <6>[ 156.701551] [4: kswapd0: 1209] kswapd0 cpuset=3D/ mems_allowed=3D0 > > <4>[ 156.701563] [4: kswapd0: 1209] CPU: 4 PID: 1209 Comm: kswapd0 Tainted: G W 4.14.113-18113966 #1 > > <4>[ 156.701572] [4: kswapd0: 1209] Call trace: > > <4>[ 156.701605] [4: kswapd0: 1209] [<0000000000000000>] dump_stack+0x68/0x90 > > <4>[ 156.701612] [4: kswapd0: 1209] [<0000000000000000>] warn_alloc+0x104/0x198 > > <4>[ 156.701617] [4: kswapd0: 1209] [<0000000000000000>] __alloc_pages_nodemask+0xdc0/0xdf0 > > <4>[ 156.701623] [4: kswapd0: 1209] [<0000000000000000>] zs_malloc+0x148/0x3d0 > > <4>[ 156.701630] [4: kswapd0: 1209] [<0000000000000000>] zram_bvec_rw+0x250/0x568 > > <4>[ 156.701634] [4: kswapd0: 1209] [<0000000000000000>] zram_rw_page+0x8c/0xe0 > > <4>[ 156.701640] [4: kswapd0: 1209] [<0000000000000000>] bdev_write_page+0x70/0xbc > > <4>[ 156.701645] [4: kswapd0: 1209] [<0000000000000000>] __swap_writepage+0x58/0x37c > > <4>[ 156.701649] [4: kswapd0: 1209] [<0000000000000000>] swap_writepage+0x40/0x4c > > <4>[ 156.701654] [4: kswapd0: 1209] [<0000000000000000>] shrink_page_list+0xc3c/0xf54 > > <4>[ 156.701659] [4: kswapd0: 1209] [<0000000000000000>] shrink_inactive_list+0x2b0/0x61c > > <4>[ 156.701664] [4: kswapd0: 1209] [<0000000000000000>] shrink_node_memcg+0x23c/0x618 > > <4>[ 156.701668] [4: kswapd0: 1209] [<0000000000000000>] shrink_node+0x1c8/0x304 > > <4>[ 156.701673] [4: kswapd0: 1209] [<0000000000000000>] kswapd+0x680/0x7c4 > > <4>[ 156.701679] [4: kswapd0: 1209] [<0000000000000000>] kthread+0x110/0x120 > > <4>[ 156.701684] [4: kswapd0: 1209] [<0000000000000000>] ret_from_fork+0x10/0x18 > > <4>[ 156.701689] [4: kswapd0: 1209] Mem-Info: > > <4>[ 156.701712] [4: kswapd0: 1209] active_anon:88690 inactive_anon:88630 isolated_anon:0 > > <4>[ 156.701712] [4: kswapd0: 1209] active_file:99173 inactive_file:169305 isolated_file:32 > > <4>[ 156.701712] [4: kswapd0: 1209] unevictable:48292 dirty:538 writeback:38 unstable:0 > > <4>[ 156.701712] [4: kswapd0: 1209] slab_reclaimable:15131 slab_unreclaimable:47762 > > <4>[ 156.701712] [4: kswapd0: 1209] mapped:274654 shmem:2824 pagetables:25088 bounce:0 > > <4>[ 156.701712] [4: kswapd0: 1209] free:2489 free_pcp:444 free_cma:3 > > <4>[ 156.701728] [4: kswapd0: 1209] Node 0 active_anon:354760kB inactive_anon:354520kB active_file:396692kB inactive_file:677220kB unevictable:193168kB isolated(anon):0kB isolated(file):128kB mapped:1098616kB dirty:2152kB writeback:152kB shmem:11296kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no > > <4>[ 156.701738] [4: kswapd0: 1209] Normal free:9956kB min:7428kB low:93440kB high:97032kB active_anon:355176kB inactive_anon:354580kB active_file:396196kB inactive_file:677284kB unevictable:193168kB writepending:2304kB present:4081664kB managed:3593324kB mlocked:193168kB kernel_stack:55008kB pagetables:100352kB bounce:0kB free_pcp:1776kB local_pcp:656kB free_cma:12kB > > <4>[ 156.701741] [4: kswapd0: 1209] lowmem_reserve[]: 0 0 > > <4>[ 156.701747] [4: kswapd0: 1209] Normal: 196*4kB (H) 141*8kB (H) 109*16kB (H) 63*32kB (H) 20*64kB (H) 8*128kB (H) 2*256kB (H) 1*512kB (H) 0*1024kB 0*2048kB 0*4096kB =3D 9000kB > > Same here, although here AFAICS ALLOC_NO_WATERMARKS doesn't imply ALLOC_HARDER, > so the rmqueue() check wouldn't kick in? That would be something to fix... but > doesn't explain the GFP_ATOMIC case above. I might miss some log above, AFAIK it was order-0 allocation. For an order-0 allocation with ALLOC_NO_WATERMARKS, rmqueue will search fre= e a page but it cannot find one because other direct reclaimable allocation already took the suitable free without direct reclaimatiion. > > ... > > > @@ -3598,9 +3604,12 @@ static inline bool zone_watermark_fast(struct zone *z, unsigned int order, > /* > * Fast check for order-0 only. If this fails then the reserves > * need to be calculated. There is a corner case where the check > * passes but only the high-order atomic reserve are free. If > > * the caller is !atomic then it'll uselessly search the free > > * list. That corner case is then slower but it is harmless. > > */ > > The comment stops being true after this patch? It also suggests that Mel > anticipated this corner case, but that it should only cause a false positive > zone_watermark_fast() and then rmqueue() fails for !ALLOC_HARDER as it cannot > use MIGRATE_HIGHATOMIC blocks. It expects atomic order-0 still works. So what's > going on? As Mel also agreed with me in v1 mail thread, this highatomic reserved should be considered even in watermark fast. The comment, I think, may need to be changed. Prior to this patch, non highatomic allocation may do useless search, but it also can take ALL non highatomic free. With this patch, non highatomic allocation will NOT do useless search. Rather, it may be required direct reclamation even when there are some non high atomic free. i.e) In following situation, watermark check fails (9MB - 8MB < 4MB) though there are enough free (9MB - 4MB > 4MB). If this is really matter, we need to count highatomic free accurately. min : 4MB, highatomic reserved : 8MB Total free : 9MB actual highatomic free : 4MB non highatomic free : 5MB > > > - if (!order && (free_pages - cma_pages) > > > - mark + z->lowmem_reserve[highest_zoneidx]= ) > > - return true; > > + if (!order) { > > + long fast_free =3D free_pages - unusable_free; > > + > > + if (fast_free > mark + z->lowmem_reserve[highest_zoneidx]= ) > > + return true; > > + } > > > > return __zone_watermark_ok(z, order, mark, highest_zoneidx, alloc_flags, > > free_pages); > > > --00000000000047723c05a7ef775a Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable


2020=EB=85=84 6=EC=9B=94 12=EC=9D=BC (=EA=B8=88) = =EC=98=A4=ED=9B=84 11:34, Vlastimil Babka <vbabka@suse.cz>=EB=8B=98=EC=9D=B4 =EC=9E=91=EC=84=B1:
>> On 6/13/20 4:51 AM, Jaewon Kim wrote:
> > zone_watermark_fa= st was introduced by commit 48ee5f3696f6 ("mm,
> > page_alloc= : shortcut watermark checks for order-0 pages"). The commit
> &g= t; simply checks if free pages is bigger than watermark without additional<= br>> > calculation such like reducing watermark.
> >
>= > It considered free cma pages but it did not consider highatomic
&g= t; > reserved. This may incur exhaustion of free pages except high order=
> > atomic free pages.
> >
> > Assume that rese= rved_highatomic pageblock is bigger than watermark min,
> > and th= ere are only few free pages except high order atomic free. Because
> = > zone_watermark_fast passes the allocation without considering high ord= er
> > atomic free, normal reclaimable allocation like GFP_HIGHUSE= R will
> > consume all the free pages. Then finally order-0 atomic= allocation may
> > fail on allocation.
>
> I don'= t understand why order-0 atomic allocation will fail. Is it because of
&= gt; watermark check, or finding no suitable pages?
> - watermark chec= k should be OK as atomic allocations can use reserves
> - suitable pa= ges should be OK, even if all free pages are in the highatomic
> rese= rves, because rmqueue() contains:
>
> if (alloc_flags & ALL= OC_HARDER)
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 page =3D __rmqueue_smallest(= zone, order, MIGRATE_HIGHATOMIC);
>
> So what am I missing?
= >
Hello
The order-0 atomic allocation can be failed becaus= e of=C2=A0depletion of suitable free page.
Watermark check passes= order-0 atomic allocation but it will be failed at finding a free page.
The=C2=A0 __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC) can be used
o= nly for highorder.

> > This means watermark min is not = protected against non-atomic allocation.
> > The order-0 atomic al= location with ALLOC_HARDER unwantedly can be
> > failed. Additiona= lly the __GFP_MEMALLOC allocation with
> > ALLOC_NO_WATERMARKS als= o can be failed.
> >
> > To avoid the problem, zone_water= mark_fast should consider highatomic
> > reserve. If the actual si= ze of high atomic free is counted accurately
> > like cma free, we= may use it. On this patch just use
> > nr_reserved_highatomic. Ad= ditionally introduce
> > __zone_watermark_unusable_free to factor = out common parts between
> > zone_watermark_fast and __zone_waterm= ark_ok.
> >
> > This is trace log which shows GFP_HIGHUSE= R consumes free pages right
> > before ALLOC_NO_WATERMARKS.
>= ; >
> > =C2=A0 <...>-22275 [006] .... =C2=A0 889.213383: = mm_page_alloc: page=3D00000000d2be5665 pfn=3D970744 order=3D0 migratetype= =3D0 nr_free=3D3650 gfp_flags=3DGFP_HIGHUSER|__GFP_ZERO
> > =C2=A0= <...>-22275 [006] .... =C2=A0 889.213385: mm_page_alloc: page=3D0000= 00004b2335c2 pfn=3D970745 order=3D0 migratetype=3D0 nr_free=3D3650 gfp_flag= s=3DGFP_HIGHUSER|__GFP_ZERO
> > =C2=A0 <...>-22275 [006] ...= . =C2=A0 889.213387: mm_page_alloc: page=3D00000000017272e1 pfn=3D970278 or= der=3D0 migratetype=3D0 nr_free=3D3650 gfp_flags=3DGFP_HIGHUSER|__GFP_ZERO<= br>> > =C2=A0 <...>-22275 [006] .... =C2=A0 889.213389: mm_page= _alloc: page=3D00000000c4be79fb pfn=3D970279 order=3D0 migratetype=3D0 nr_f= ree=3D3650 gfp_flags=3DGFP_HIGHUSER|__GFP_ZERO
> > =C2=A0 <...&= gt;-22275 [006] .... =C2=A0 889.213391: mm_page_alloc: page=3D00000000f8a51= d4f pfn=3D970260 order=3D0 migratetype=3D0 nr_free=3D3650 gfp_flags=3DGFP_H= IGHUSER|__GFP_ZERO
> > =C2=A0 <...>-22275 [006] .... =C2=A0 = 889.213393: mm_page_alloc: page=3D000000006ba8f5ac pfn=3D970261 order=3D0 m= igratetype=3D0 nr_free=3D3650 gfp_flags=3DGFP_HIGHUSER|__GFP_ZERO
> &= gt; =C2=A0 <...>-22275 [006] .... =C2=A0 889.213395: mm_page_alloc: p= age=3D00000000819f1cd3 pfn=3D970196 order=3D0 migratetype=3D0 nr_free=3D365= 0 gfp_flags=3DGFP_HIGHUSER|__GFP_ZERO
> > =C2=A0 <...>-22275= [006] .... =C2=A0 889.213396: mm_page_alloc: page=3D00000000f6b72a64 pfn= =3D970197 order=3D0 migratetype=3D0 nr_free=3D3650 gfp_flags=3DGFP_HIGHUSER= |__GFP_ZERO
> > kswapd0-1207 =C2=A0[005] ...1 =C2=A0 889.213398: m= m_page_alloc: page=3D (null) pfn=3D0 order=3D0 migratetype=3D1 nr_free=3D36= 50 gfp_flags=3DGFP_NOWAIT|__GFP_HIGHMEM|__GFP_NOWARN|__GFP_MOVABLE
> = >
> > This is an example of ALLOC_HARDER allocation failure.> >
> > <4>[ 6207.637280] =C2=A0[3: =C2=A0Binder:9343= _3:22875] Binder:9343_3: page allocation failure: order:0, mode:0x480020(GF= P_ATOMIC), nodemask=3D(null)
> > <4>[ 6207.637311] =C2=A0[3:= =C2=A0Binder:9343_3:22875] Call trace:
> > <4>[ 6207.637346= ] =C2=A0[3: =C2=A0Binder:9343_3:22875] [<ffffff8008f40f8c>] dump_stac= k+0xb8/0xf0
> > <4>[ 6207.637356] =C2=A0[3: =C2=A0Binder:934= 3_3:22875] [<ffffff8008223320>] warn_alloc+0xd8/0x12c
> > &l= t;4>[ 6207.637365] =C2=A0[3: =C2=A0Binder:9343_3:22875] [<ffffff80082= 245e4>] __alloc_pages_nodemask+0x120c/0x1250
> > <4>[ 620= 7.637374] =C2=A0[3: =C2=A0Binder:9343_3:22875] [<ffffff800827f6e8>] n= ew_slab+0x128/0x604
> > <4>[ 6207.637381] =C2=A0[3: =C2=A0Bi= nder:9343_3:22875] [<ffffff800827b0cc>] ___slab_alloc+0x508/0x670
= > > <4>[ 6207.637387] =C2=A0[3: =C2=A0Binder:9343_3:22875] [<= ;ffffff800827ba00>] __kmalloc+0x2f8/0x310
> > <4>[ 6207.6= 37396] =C2=A0[3: =C2=A0Binder:9343_3:22875] [<ffffff80084ac3e0>] cont= ext_struct_to_string+0x104/0x1cc
> > <4>[ 6207.637404] =C2= =A0[3: =C2=A0Binder:9343_3:22875] [<ffffff80084ad8fc>] security_sid_t= o_context_core+0x74/0x144
> > <4>[ 6207.637412] =C2=A0[3: = =C2=A0Binder:9343_3:22875] [<ffffff80084ad880>] security_sid_to_conte= xt+0x10/0x18
> > <4>[ 6207.637421] =C2=A0[3: =C2=A0Binder:93= 43_3:22875] [<ffffff800849bd80>] selinux_secid_to_secctx+0x20/0x28> > <4>[ 6207.637430] =C2=A0[3: =C2=A0Binder:9343_3:22875] [&l= t;ffffff800849109c>] security_secid_to_secctx+0x3c/0x70
> > <= ;4>[ 6207.637442] =C2=A0[3: =C2=A0Binder:9343_3:22875] [<ffffff8008bf= e118>] binder_transaction+0xe68/0x454c
> > <4>[ 6207.6375= 69] =C2=A0[3: =C2=A0Binder:9343_3:22875] Mem-Info:
> > <4>[ = 6207.637595] =C2=A0[3: =C2=A0Binder:9343_3:22875] active_anon:102061 inacti= ve_anon:81551 isolated_anon:0
> > <4>[ 6207.637595] =C2=A0[3= : =C2=A0Binder:9343_3:22875] =C2=A0active_file:59102 inactive_file:68924 is= olated_file:64
> > <4>[ 6207.637595] =C2=A0[3: =C2=A0Binder:= 9343_3:22875] =C2=A0unevictable:611 dirty:63 writeback:0 unstable:0
>= > <4>[ 6207.637595] =C2=A0[3: =C2=A0Binder:9343_3:22875] =C2=A0sl= ab_reclaimable:13324 slab_unreclaimable:44354
> > <4>[ 6207.= 637595] =C2=A0[3: =C2=A0Binder:9343_3:22875] =C2=A0mapped:83015 shmem:4858 = pagetables:26316 bounce:0
> > <4>[ 6207.637595] =C2=A0[3: = =C2=A0Binder:9343_3:22875] =C2=A0free:2727 free_pcp:1035 free_cma:178
&g= t; > <4>[ 6207.637616] =C2=A0[3: =C2=A0Binder:9343_3:22875] Node 0= active_anon:408244kB inactive_anon:326204kB active_file:236408kB inactive_= file:275696kB unevictable:2444kB isolated(anon):0kB isolated(file):256kB ma= pped:332060kB dirty:252kB writeback:0kB shmem:19432kB writeback_tmp:0kB uns= table:0kB all_unreclaimable? no
> > <4>[ 6207.637627] =C2=A0= [3: =C2=A0Binder:9343_3:22875] Normal free:10908kB min:6192kB low:44388kB h= igh:47060kB active_anon:409160kB inactive_anon:325924kB active_file:235820k= B inactive_file:276628kB unevictable:2444kB writepending:252kB present:3076= 096kB managed:2673676kB mlocked:2444kB kernel_stack:62512kB pagetables:1052= 64kB bounce:0kB free_pcp:4140kB local_pcp:40kB free_cma:712kB
> > = <4>[ 6207.637632] =C2=A0[3: =C2=A0Binder:9343_3:22875] lowmem_reserve= []: 0 0
> > <4>[ 6207.637637] =C2=A0[3: =C2=A0Binder:9343_3:= 22875] Normal: 505*4kB (H) 357*8kB (H) 201*16kB (H) 65*32kB (H) 1*64kB (H) = 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB =3D 10236kB
>
&= gt; OK this shows we are well above min watermark, and indeed only free pag= es are
> highatomic. Why doesn't the rmqueue() part quoted above = work as expected and
> allow this allocation to use those highatomic = blocks?
>
Because highatomic=C2=A0free is reserved free onl= y for high atomic, actually ALLOC_HARDER.
Order-0 atomic allocati= on cannot use the highatomic free.

> > <4= >[ 6207.637655] =C2=A0[3: =C2=A0Binder:9343_3:22875] 138826 total pageca= che pages
> > <4>[ 6207.637663] =C2=A0[3: =C2=A0Binder:9343_= 3:22875] 5460 pages in swap cache
> > <4>[ 6207.637668] =C2= =A0[3: =C2=A0Binder:9343_3:22875] Swap cache stats: add 8273090, delete 826= 7506, find 1004381/4060142
> >
> > This is an example of = ALLOC_NO_WATERMARKS allocation failure.
> >
> > <6>= [ =C2=A0156.701551] =C2=A0[4: =C2=A0 =C2=A0 =C2=A0 =C2=A0kswapd0: 1209] ksw= apd0 cpuset=3D/ mems_allowed=3D0
> > <4>[ =C2=A0156.701563] = =C2=A0[4: =C2=A0 =C2=A0 =C2=A0 =C2=A0kswapd0: 1209] CPU: 4 PID: 1209 Comm: = kswapd0 Tainted: G =C2=A0 =C2=A0 =C2=A0 =C2=A0W =C2=A0 =C2=A0 =C2=A0 4.14.1= 13-18113966 #1
> > <4>[ =C2=A0156.701572] =C2=A0[4: =C2=A0 = =C2=A0 =C2=A0 =C2=A0kswapd0: 1209] Call trace:
> > <4>[ =C2= =A0156.701605] =C2=A0[4: =C2=A0 =C2=A0 =C2=A0 =C2=A0kswapd0: 1209] [<000= 0000000000000>] dump_stack+0x68/0x90
> > <4>[ =C2=A0156.7= 01612] =C2=A0[4: =C2=A0 =C2=A0 =C2=A0 =C2=A0kswapd0: 1209] [<00000000000= 00000>] warn_alloc+0x104/0x198
> > <4>[ =C2=A0156.701617]= =C2=A0[4: =C2=A0 =C2=A0 =C2=A0 =C2=A0kswapd0: 1209] [<0000000000000000&= gt;] __alloc_pages_nodemask+0xdc0/0xdf0
> > <4>[ =C2=A0156.7= 01623] =C2=A0[4: =C2=A0 =C2=A0 =C2=A0 =C2=A0kswapd0: 1209] [<00000000000= 00000>] zs_malloc+0x148/0x3d0
> > <4>[ =C2=A0156.701630] = =C2=A0[4: =C2=A0 =C2=A0 =C2=A0 =C2=A0kswapd0: 1209] [<0000000000000000&g= t;] zram_bvec_rw+0x250/0x568
> > <4>[ =C2=A0156.701634] =C2= =A0[4: =C2=A0 =C2=A0 =C2=A0 =C2=A0kswapd0: 1209] [<0000000000000000>]= zram_rw_page+0x8c/0xe0
> > <4>[ =C2=A0156.701640] =C2=A0[4:= =C2=A0 =C2=A0 =C2=A0 =C2=A0kswapd0: 1209] [<0000000000000000>] bdev_= write_page+0x70/0xbc
> > <4>[ =C2=A0156.701645] =C2=A0[4: = =C2=A0 =C2=A0 =C2=A0 =C2=A0kswapd0: 1209] [<0000000000000000>] __swap= _writepage+0x58/0x37c
> > <4>[ =C2=A0156.701649] =C2=A0[4: = =C2=A0 =C2=A0 =C2=A0 =C2=A0kswapd0: 1209] [<0000000000000000>] swap_w= ritepage+0x40/0x4c
> > <4>[ =C2=A0156.701654] =C2=A0[4: =C2= =A0 =C2=A0 =C2=A0 =C2=A0kswapd0: 1209] [<0000000000000000>] shrink_pa= ge_list+0xc3c/0xf54
> > <4>[ =C2=A0156.701659] =C2=A0[4: =C2= =A0 =C2=A0 =C2=A0 =C2=A0kswapd0: 1209] [<0000000000000000>] shrink_in= active_list+0x2b0/0x61c
> > <4>[ =C2=A0156.701664] =C2=A0[4:= =C2=A0 =C2=A0 =C2=A0 =C2=A0kswapd0: 1209] [<0000000000000000>] shrin= k_node_memcg+0x23c/0x618
> > <4>[ =C2=A0156.701668] =C2=A0[4= : =C2=A0 =C2=A0 =C2=A0 =C2=A0kswapd0: 1209] [<0000000000000000>] shri= nk_node+0x1c8/0x304
> > <4>[ =C2=A0156.701673] =C2=A0[4: =C2= =A0 =C2=A0 =C2=A0 =C2=A0kswapd0: 1209] [<0000000000000000>] kswapd+0x= 680/0x7c4
> > <4>[ =C2=A0156.701679] =C2=A0[4: =C2=A0 =C2=A0= =C2=A0 =C2=A0kswapd0: 1209] [<0000000000000000>] kthread+0x110/0x120=
> > <4>[ =C2=A0156.701684] =C2=A0[4: =C2=A0 =C2=A0 =C2=A0 = =C2=A0kswapd0: 1209] [<0000000000000000>] ret_from_fork+0x10/0x18
= > > <4>[ =C2=A0156.701689] =C2=A0[4: =C2=A0 =C2=A0 =C2=A0 =C2= =A0kswapd0: 1209] Mem-Info:
> > <4>[ =C2=A0156.701712] =C2= =A0[4: =C2=A0 =C2=A0 =C2=A0 =C2=A0kswapd0: 1209] active_anon:88690 inactive= _anon:88630 isolated_anon:0
> > <4>[ =C2=A0156.701712] =C2= =A0[4: =C2=A0 =C2=A0 =C2=A0 =C2=A0kswapd0: 1209] =C2=A0active_file:99173 in= active_file:169305 isolated_file:32
> > <4>[ =C2=A0156.70171= 2] =C2=A0[4: =C2=A0 =C2=A0 =C2=A0 =C2=A0kswapd0: 1209] =C2=A0unevictable:48= 292 dirty:538 writeback:38 unstable:0
> > <4>[ =C2=A0156.701= 712] =C2=A0[4: =C2=A0 =C2=A0 =C2=A0 =C2=A0kswapd0: 1209] =C2=A0slab_reclaim= able:15131 slab_unreclaimable:47762
> > <4>[ =C2=A0156.70171= 2] =C2=A0[4: =C2=A0 =C2=A0 =C2=A0 =C2=A0kswapd0: 1209] =C2=A0mapped:274654 = shmem:2824 pagetables:25088 bounce:0
> > <4>[ =C2=A0156.7017= 12] =C2=A0[4: =C2=A0 =C2=A0 =C2=A0 =C2=A0kswapd0: 1209] =C2=A0free:2489 fre= e_pcp:444 free_cma:3
> > <4>[ =C2=A0156.701728] =C2=A0[4: = =C2=A0 =C2=A0 =C2=A0 =C2=A0kswapd0: 1209] Node 0 active_anon:354760kB inact= ive_anon:354520kB active_file:396692kB inactive_file:677220kB unevictable:1= 93168kB isolated(anon):0kB isolated(file):128kB mapped:1098616kB dirty:2152= kB writeback:152kB shmem:11296kB writeback_tmp:0kB unstable:0kB all_unrecla= imable? no
> > <4>[ =C2=A0156.701738] =C2=A0[4: =C2=A0 =C2= =A0 =C2=A0 =C2=A0kswapd0: 1209] Normal free:9956kB min:7428kB low:93440kB h= igh:97032kB active_anon:355176kB inactive_anon:354580kB active_file:396196k= B inactive_file:677284kB unevictable:193168kB writepending:2304kB present:4= 081664kB managed:3593324kB mlocked:193168kB kernel_stack:55008kB pagetables= :100352kB bounce:0kB free_pcp:1776kB local_pcp:656kB free_cma:12kB
> = > <4>[ =C2=A0156.701741] =C2=A0[4: =C2=A0 =C2=A0 =C2=A0 =C2=A0kswa= pd0: 1209] lowmem_reserve[]: 0 0
> > <4>[ =C2=A0156.701747] = =C2=A0[4: =C2=A0 =C2=A0 =C2=A0 =C2=A0kswapd0: 1209] Normal: 196*4kB (H) 141= *8kB (H) 109*16kB (H) 63*32kB (H) 20*64kB (H) 8*128kB (H) 2*256kB (H) 1*512= kB (H) 0*1024kB 0*2048kB 0*4096kB =3D 9000kB
>
> Same here, alt= hough here AFAICS ALLOC_NO_WATERMARKS doesn't imply ALLOC_HARDER,
&g= t; so the rmqueue() check wouldn't kick in? That would be something to = fix... but
> doesn't explain the GFP_ATOMIC case above.

I might miss some log above, AFAIK it was order-0 allocatio= n.
For an order-0 allocation with=C2=A0ALLOC_NO_WATERMARKS, rmque= ue will search free
a page but it cannot find one because other d= irect reclaimable allocation already took
the suitable free witho= ut direct reclaimatiion.
>
> ...
>
> > @@= -3598,9 +3604,12 @@ static inline bool zone_watermark_fast(struct zone *z,= unsigned int order,
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 /*
> =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0* Fast check for order-0 only. If this fails the= n the reserves
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* need to be calcu= lated. There is a corner case where the check
> =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0* passes but only the high-order atomic reserve are free. If> > =C2=A0 =C2=A0 =C2=A0 =C2=A0* the caller is !atomic then it'= ll uselessly search the free
> > =C2=A0 =C2=A0 =C2=A0 =C2=A0* list= . That corner case is then slower but it is harmless.
> > =C2=A0 = =C2=A0 =C2=A0 =C2=A0*/
>
> The comment stops being true after t= his patch? It also suggests that Mel
> anticipated this corner case, = but that it should only cause a false positive
> zone_watermark_fast(= ) and then rmqueue() fails for !ALLOC_HARDER as it cannot
> use MIGRA= TE_HIGHATOMIC blocks. It expects atomic order-0 still works. So what's<= br>> going on?

As Mel also agreed with me in v= 1 mail thread, this highatomic=C2=A0reserved should
be considered= even in watermark fast.

The comment, I think, may= need to be changed. Prior to this patch, non highatomic
allo= cation may do useless search, but it also can take ALL non highatomic=C2=A0= free.

With this patch, non highatomic=C2=A0allocat= ion will NOT do useless search. Rather,
it may be required direct= reclamation even when there are some non high atomic free.

<= /div>
i.e)
In following situation, watermark check fails (9MB= - 8MB < 4MB) though there are
enough free (9MB - 4MB > 4MB= ). If this is really matter, we need to count highatomic
free acc= urately.

min : 4MB,=C2=A0
highatomic res= erved : 8MB
Total free : 9MB
=C2=A0 actual highatom= ic free : 4MB
=C2=A0 non highatomic free : 5MB
>
= > > - =C2=A0 =C2=A0 if (!order && (free_pages - cma_pages) &g= t;
> > - =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 mark + z->lowmem_reserve[highe= st_zoneidx])
> > - =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 retur= n true;
> > + =C2=A0 =C2=A0 if (!order) {
> > + =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 long fast_free =3D free_pages - unusable= _free;
> > +
> > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 if (fast_free > mark + z->lowmem_reserve[highest_zoneidx])
>= ; > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 return true;
> > + =C2=A0 =C2=A0 }
> >
> > = =C2=A0 =C2=A0 =C2=A0 return __zone_watermark_ok(z, order, mark, highest_zon= eidx, alloc_flags,
> > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 free_pages);
> >
>
--00000000000047723c05a7ef775a--