From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.6 required=3.0 tests=FREEMAIL_FORGED_FROMDOMAIN, FREEMAIL_FROM,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, MENTIONS_GIT_HOSTING,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id C7978C43331 for ; Sat, 9 Nov 2019 12:19:30 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 58C5C21848 for ; Sat, 9 Nov 2019 12:19:30 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 58C5C21848 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=sina.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id BCEA56B0003; Sat, 9 Nov 2019 07:19:29 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id B5B716B0006; Sat, 9 Nov 2019 07:19:29 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9FA426B0007; Sat, 9 Nov 2019 07:19:29 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0130.hostedemail.com [216.40.44.130]) by kanga.kvack.org (Postfix) with ESMTP id 842256B0003 for ; Sat, 9 Nov 2019 07:19:29 -0500 (EST) Received: from smtpin27.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with SMTP id 2CCE5824999B for ; Sat, 9 Nov 2019 12:19:29 +0000 (UTC) X-FDA: 76136644458.27.beast85_8cc5c58bcfd00 X-HE-Tag: beast85_8cc5c58bcfd00 X-Filterd-Recvd-Size: 12268 Received: from r3-24.sinamail.sina.com.cn (r3-24.sinamail.sina.com.cn [202.108.3.24]) by imf50.hostedemail.com (Postfix) with SMTP for ; Sat, 9 Nov 2019 12:19:27 +0000 (UTC) Received: from unknown (HELO localhost.localdomain)([114.244.162.243]) by sina.com with ESMTP id 5DC6AEC900026C58; Sat, 9 Nov 2019 20:19:23 +0800 (CST) X-Sender: hdanton@sina.com X-Auth-ID: hdanton@sina.com X-SMAIL-MID: 19616854920928 From: Hillf Danton To: kernel test robot Cc: Hillf Danton , linux-mm , Andrew Morton , Chris Down , Tejun Heo , Roman Gushchin , Shakeel Butt , Minchan Kim , Mel Gorman , linux-kernel , kbuild test robot Subject: Re: [memcg] 1fc14cf673: invoked_oom-killer:gfp_mask=0x Date: Sat, 9 Nov 2019 20:19:11 +0800 Message-Id: <20191109121911.6492-1-hdanton@sina.com> In-Reply-To: <20191026110745.12956-1-hdanton@sina.com> References: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hey Rong On Thu, 7 Nov 2019 17:02:34 +0800 Rong Chen wrote: >=20 > FYI, we noticed the following commit (built with gcc-7): >=20 > commit: 1fc14cf67325190e0075cf3cd5511965499fffb4 ("[RFC v2] memcg: add = memcg lru for page reclaiming") > url: https://github.com/0day-ci/linux/commits/Hillf-Danton/memcg-add-me= mcg-lru-for-page-reclaiming/20191029-143906 >=20 >=20 > in testcase: vm-scalability > with following parameters: >=20 > runtime: 300s > test: lru-file-mmap-read > cpufreq_governor: performance > ucode: 0x500002b >=20 > test-description: The motivation behind this suite is to exercise funct= ions and regions of the mm/ of the Linux kernel which are of interest to = us. > test-url: https://git.kernel.org/cgit/linux/kernel/git/wfg/vm-scalabili= ty.git/ >=20 >=20 > on test machine: 192 threads Intel(R) Xeon(R) Platinum 9242 CPU @ 2.30G= Hz with 192G memory >=20 > caused below changes (please refer to attached dmesg/kmsg for entire lo= g/backtrace): >=20 >=20 > +--------------------------------------------------+------------+------= ------+ > | | 8005803a2c | 1fc14= cf673 | > +--------------------------------------------------+------------+------= ------+ > | boot_successes | 2 | 4 = | > | boot_failures | 11 | = | > | WARNING:at_fs/iomap/direct-io.c:#iomap_dio_actor | 10 | = | > | RIP:iomap_dio_actor | 10 | = | > | BUG:kernel_hang_in_boot_stage | 1 | = | > | last_state.OOM | 0 | 4 = | > +--------------------------------------------------+------------+------= ------+ >=20 >=20 > If you fix the issue, kindly add following tag > Reported-by: kernel test robot >=20 >=20 >=20 > user :notice: [ 51.667771] 2019-11-06 23:56:11 ./usemem --runtime 3= 00 -f /tmp/vm-scalability-tmp/vm-scalability/sparse-lru-file-mmap-read-71= --readonly 22906492245 >=20 > user :notice: [ 51.697549] 2019-11-06 23:56:11 ./usemem --runtime 3= 00 -f /tmp/vm-scalability-tmp/vm-scalability/sparse-lru-file-mmap-read-72= --readonly 22906492245 >=20 > kern :warn : [ 51.715513] usemem invoked oom-killer: gfp_mask=3D0x4= 00dc0(GFP_KERNEL_ACCOUNT|__GFP_ZERO), order=3D0, oom_score_adj=3D0 >=20 > user :notice: [ 51.724161] 2019-11-06 23:56:11 truncate /tmp/vm-sca= lability-tmp/vm-scalability/sparse-lru-file-mmap-read-73 -s 22906492245 >=20 > kern :warn : [ 51.727992] CPU: 11 PID: 3618 Comm: usemem Not tainte= d 5.4.0-rc5-00020-g1fc14cf673251 #2 > user :notice: [ 51.744101] 2019-11-06 23:56:11 ./usemem --runtime 3= 00 -f /tmp/vm-scalability-tmp/vm-scalability/sparse-lru-file-mmap-read-73= --readonly 22906492245 >=20 > kern :warn : [ 51.752655] Call Trace: > kern :warn : [ 51.752666] dump_stack+0x5c/0x7b > user :notice: [ 51.771480] 2019-11-06 23:56:11 truncate /tmp/vm-sca= lability-tmp/vm-scalability/sparse-lru-file-mmap-read-74 -s 22906492245 >=20 > kern :warn : [ 51.775027] dump_header+0x4a/0x220 > kern :warn : [ 51.775029] oom_kill_process+0xe9/0x130 > kern :warn : [ 51.775031] out_of_memory+0x105/0x510 > kern :warn : [ 51.775037] __alloc_pages_slowpath+0xa3f/0xdb0 > kern :warn : [ 51.775040] __alloc_pages_nodemask+0x2f0/0x340 > kern :warn : [ 51.775044] pte_alloc_one+0x13/0x40 > kern :warn : [ 51.775048] __handle_mm_fault+0xe9d/0xf70 > kern :warn : [ 51.775050] handle_mm_fault+0xdd/0x210 > kern :warn : [ 51.775054] __do_page_fault+0x2f1/0x520 > kern :warn : [ 51.775056] do_page_fault+0x30/0x120 > user :notice: [ 51.782517] 2019-11-06 23:56:11 truncate /tmp/vm-sca= lability-tmp/vm-scalability/sparse-lru-file-mmap-read-75 -s 22906492245 >=20 > kern :warn : [ 51.792048] page_fault+0x3e/0x50 > kern :warn : [ 51.792051] RIP: 0033:0x55c6ced07cfc > user :notice: [ 51.798308] 2019-11-06 23:56:11 ./usemem --runtime 3= 00 -f /tmp/vm-scalability-tmp/vm-scalability/sparse-lru-file-mmap-read-74= --readonly 22906492245 >=20 > kern :warn : [ 51.799413] Code: 00 00 e8 37 f6 ff ff 48 83 c4 08 c3= 48 8d 3d 74 23 00 00 e8 56 f6 ff ff bf 01 00 00 00 e8 bc f6 ff ff 85 d2 = 74 08 48 8d 04 f7 <48> 8b 00 c3 48 8d 04 f7 48 89 30 b8 00 00 00 00 c3 48= 89 f8 48 29 > kern :warn : [ 51.799415] RSP: 002b:00007ffe889ebfe8 EFLAGS: 000102= 02 > user :notice: [ 51.808045] 2019-11-06 23:56:11 ./usemem --runtime 3= 00 -f /tmp/vm-scalability-tmp/vm-scalability/sparse-lru-file-mmap-read-75= --readonly 22906492245 >=20 > kern :warn : [ 51.809437] RAX: 00007fd4c5400000 RBX: 00000000085cc6= 00 RCX: 0000000000000018 > kern :warn : [ 51.809438] RDX: 0000000000000001 RSI: 00000000085cc6= 00 RDI: 00007fd48259d000 > kern :warn : [ 51.809440] RBP: 00000000085cc600 R08: 000000005dc2ed= 1f R09: 00007ffe889ebfa0 > user :notice: [ 51.818030] 2019-11-06 23:56:11 truncate /tmp/vm-sca= lability-tmp/vm-scalability/sparse-lru-file-mmap-read-76 -s 22906492245 >=20 > kern :warn : [ 51.820780] R10: 00007ffe889ebfa0 R11: 00000000000002= 46 R12: 0000000042e63000 > kern :warn : [ 51.820781] R13: 00007fd48259d000 R14: 00007ffe889ec0= 8c R15: 0000000000000001 > kern :warn : [ 51.820813] Mem-Info: > user :notice: [ 51.829016] 2019-11-06 23:56:11 ./usemem --runtime 3= 00 -f /tmp/vm-scalability-tmp/vm-scalability/sparse-lru-file-mmap-read-76= --readonly 22906492245 >=20 > kern :warn : [ 51.830751] active_anon:68712 inactive_anon:29360 iso= lated_anon:0 > active_file:497 inactive_file:48481807 i= solated_file:32 > unevictable:259869 dirty:2 writeback:0 u= nstable:0 > slab_reclaimable:130937 slab_unreclaimab= le:70163 > mapped:48488420 shmem:30398 pagetables:9= 7884 bounce:0 > free:169055 free_pcp:20966 free_cma:0 > user :notice: [ 51.838463] 2019-11-06 23:56:11 truncate /tmp/vm-sca= lability-tmp/vm-scalability/sparse-lru-file-mmap-read-77 -s 22906492245 >=20 > kern :warn : [ 51.840634] Node 0 active_anon:109476kB inactive_anon= :1400kB active_file:76kB inactive_file:47988152kB unevictable:281836kB is= olated(anon):0kB isolated(file):0kB mapped:47993516kB dirty:4kB writeback= :0kB shmem:1512kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB write= back_tmp:0kB unstable:0kB all_unreclaimable? no >=20 >=20 > To reproduce: >=20 > git clone https://github.com/intel/lkp-tests.git > cd lkp-tests > bin/lkp install job.yaml # job file is attached in this email > bin/lkp run job.yaml >=20 ---8<--- Subject: [RFC v1] memcg: make memcg lru reclaim dirty pages From: Hillf Danton The memcg lru was added on the top of high work, with the target of bypassing the soft limit reclaim by hooking into kswapd's logic. Because the memcg high work is currently unable to reclaim dirty pages, memcg lru adds the risk of premature oom even in case of order-0 allocation, so being able to handle dirty pages is a must-have. To add that capability, memcg lru no longer goes the high work route but embeds in kswapd's logic for page reclaim, by providing reclaimer the victim memcg, and then kswapd will take care of the rest. The hook function mem_cgroup_reclaim_high() is split into two parts for better round robin with an eye on over-reclaim. Thanks to Rong Chen for testing. Changes since v0 - fix build error - split hook function into two parts Reported-by: kernel test robot Reported-by: kbuild test robot Signed-off-by: Hillf Danton --- --- b/include/linux/memcontrol.h +++ d/include/linux/memcontrol.h @@ -742,7 +742,8 @@ static inline void mod_lruvec_page_state local_irq_restore(flags); } =20 -void mem_cgroup_reclaim_high(void); +struct mem_cgroup *mem_cgroup_reclaim_high_begin(void); +void mem_cgroup_reclaim_high_end(struct mem_cgroup *memcg); =20 unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, gfp_t gfp_mask, @@ -1130,7 +1131,11 @@ static inline void __mod_lruvec_slab_sta __mod_node_page_state(page_pgdat(page), idx, val); } =20 -static inline void mem_cgroup_reclaim_high(void) +static inline struct mem_cgroup *mem_cgroup_reclaim_high_begin(void) +{ + return NULL; +} +static inline void mem_cgroup_reclaim_high_end(struct mem_cgroup *memcg) { } =20 --- b/mm/memcontrol.c +++ d/mm/memcontrol.c @@ -2362,12 +2362,34 @@ static struct mem_cgroup *memcg_pinch_lr return NULL; } =20 -void mem_cgroup_reclaim_high(void) +struct mem_cgroup *mem_cgroup_reclaim_high_begin(void) { - struct mem_cgroup *memcg =3D memcg_pinch_lru(); + struct mem_cgroup *memcg, *victim; =20 - if (memcg) - schedule_work(&memcg->high_work); + memcg =3D victim =3D memcg_pinch_lru(); + if (!memcg) + return NULL; + + while ((memcg =3D parent_mem_cgroup(memcg))) + if (page_counter_read(&memcg->memory) > memcg->high) { + memcg_memory_event(memcg, MEMCG_HIGH); + memcg_add_lru(memcg); + break; + } + + return victim; +} + +void mem_cgroup_reclaim_high_end(struct mem_cgroup *memcg) +{ + while (memcg) { + if (page_counter_read(&memcg->memory) > memcg->high) { + memcg_memory_event(memcg, MEMCG_HIGH); + memcg_add_lru(memcg); + return; + } + memcg =3D parent_mem_cgroup(memcg); + } } =20 static void reclaim_high(struct mem_cgroup *memcg, --- b/mm/vmscan.c +++ d/mm/vmscan.c @@ -2932,6 +2932,29 @@ static inline bool compaction_ready(stru return zone_watermark_ok_safe(zone, 0, watermark, sc->reclaim_idx); } =20 +#ifdef CONFIG_MEMCG +static void mem_cgroup_reclaim_high(struct pglist_data *pgdat, + struct scan_control *sc) +{ + struct mem_cgroup *memcg; + + memcg =3D mem_cgroup_reclaim_high_begin(); + if (memcg) { + unsigned long ntr =3D sc->nr_to_reclaim; + + sc->nr_to_reclaim =3D SWAP_CLUSTER_MAX; + shrink_node_memcg(pgdat, memcg, sc); + sc->nr_to_reclaim =3D ntr; + } + mem_cgroup_reclaim_high_end(memcg); +} +#else +static void mem_cgroup_reclaim_high(struct pglist_data *pgdat, + struct scan_control *sc) +{ +} +#endif + /* * This is the direct reclaim path, for page-allocating processes. We o= nly * try to reclaim pages from zones which will satisfy the caller's alloc= ation @@ -2996,8 +3019,8 @@ static void shrink_zones(struct zonelist if (zone->zone_pgdat =3D=3D last_pgdat) continue; =20 - mem_cgroup_reclaim_high(); - continue; + mem_cgroup_reclaim_high(zone->zone_pgdat, sc); + continue; =20 /* * This steals pages from memory cgroups over softlimit @@ -3693,7 +3716,7 @@ restart: if (sc.priority < DEF_PRIORITY - 2) sc.may_writepage =3D 1; =20 - mem_cgroup_reclaim_high(); + mem_cgroup_reclaim_high(pgdat, &sc); goto soft_limit_reclaim_end; =20 /* Call soft limit reclaim before calling shrink_node. */ --