From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.8 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A, SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 46BEFC4338F for ; Mon, 16 Aug 2021 19:04:55 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id BEEF860238 for ; Mon, 16 Aug 2021 19:04:54 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org BEEF860238 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 598456B006C; Mon, 16 Aug 2021 15:04:54 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 521316B0072; Mon, 16 Aug 2021 15:04:54 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3C1686B0073; Mon, 16 Aug 2021 15:04:54 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0008.hostedemail.com [216.40.44.8]) by kanga.kvack.org (Postfix) with ESMTP id 1B5F06B006C for ; Mon, 16 Aug 2021 15:04:54 -0400 (EDT) Received: from smtpin23.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id C297616AF2 for ; Mon, 16 Aug 2021 19:04:53 +0000 (UTC) X-FDA: 78481870866.23.E476001 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124]) by imf02.hostedemail.com (Postfix) with ESMTP id 6782F700A675 for ; Mon, 16 Aug 2021 19:04:53 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1629140692; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=WlPP/8g3mJNR7Cgp7ApvxlHRVciepS4HYikYLtpQg3I=; b=Uwz6NcOb6Ufgyu6EK1X8rF3liJgNeP+7UfqFwHOL3AZJucfssx/tBlQSNjQfO3P98E/Xp1 Qys0eDi5iFqQ8P9qmJeU4mQdPe4Qsq9XXqWQGZ+vxM5Le893nNPlFYS9LHODcLzYJeDUuF VHqWrtYM5o0oBKowOCjDpoaxl4DM5m0= Received: from mail-wm1-f69.google.com (mail-wm1-f69.google.com [209.85.128.69]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-213-SlopYoqpPu-Kf4hRQcCm-Q-1; Mon, 16 Aug 2021 15:04:51 -0400 X-MC-Unique: SlopYoqpPu-Kf4hRQcCm-Q-1 Received: by mail-wm1-f69.google.com with SMTP id v2-20020a7bcb420000b02902e6b108fcf1so64708wmj.8 for ; Mon, 16 Aug 2021 12:04:51 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:from:to:cc:references:organization :message-id:date:user-agent:mime-version:in-reply-to :content-language:content-transfer-encoding; bh=WlPP/8g3mJNR7Cgp7ApvxlHRVciepS4HYikYLtpQg3I=; b=pTs029bn6JIEBOnnhkgEv+/XWz7sagEDQY2vIBPHARACVyex24ebPzCF8xtKCgeOuX +2ljIC3Vdr4dRNrWpw58eFX8gkbfTphbB6e4GGpi6k+/xKIVZVPo0cQ8DEWM6JM4wJ2J nyakpIOSpPHWuxvgmgPFd/Ux0SHZA26T7cBAh5Abkf4BMGqkJHPQ6dKu6peJMiMi3Sid aweQNc48rws2Oz97DXI3ilclbZQixlSGNDDePUu0c4Y/nikKwejdSnM1VGvZauJoE0KE DG/m1Nu0R8Ac+HgmcHYmhjlhn+XtweqLecSHdElPeGvVSyVMk34E4a3TWRZ/kt9C4euJ aJDA== X-Gm-Message-State: AOAM531Dlqqdevvqd/pbd0xOl4DnVvK6thvMGbbqI4sVamuG6OYqMmpP jZNmRJZjRRX2Q+Toivwxzv+F6CdL3ZfF7xzRhSOXyG6p5SUCxTuaWUciWK128uTYAXNu34Rek1y WwLxv4eXTizQ= X-Received: by 2002:a05:6000:18b:: with SMTP id p11mr19838047wrx.366.1629140690028; Mon, 16 Aug 2021 12:04:50 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzegjAdiuTAvddYmUhX62To5fHoHldieTxpuXfj9sq27Zdj+mAht8SDbY0WfSH3KIrHn4gYdg== X-Received: by 2002:a05:6000:18b:: with SMTP id p11mr19838028wrx.366.1629140689839; Mon, 16 Aug 2021 12:04:49 -0700 (PDT) Received: from [192.168.3.132] (p5b0c67f1.dip0.t-ipconnect.de. [91.12.103.241]) by smtp.gmail.com with ESMTPSA id y11sm88016wru.0.2021.08.16.12.04.48 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 16 Aug 2021 12:04:49 -0700 (PDT) Subject: Re: [PATCH 1/2] mm: hwpoison: don't drop slab caches for offlining non-LRU page From: David Hildenbrand To: Yang Shi , naoya.horiguchi@nec.com, osalvador@suse.de, tdmackey@twitter.com, akpm@linux-foundation.org, corbet@lwn.net Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <20210816180909.3603-1-shy828301@gmail.com> <2ea04811-a9a3-0fe6-38aa-222e79ded09a@redhat.com> Organization: Red Hat Message-ID: Date: Mon, 16 Aug 2021 21:04:48 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.11.0 MIME-Version: 1.0 In-Reply-To: <2ea04811-a9a3-0fe6-38aa-222e79ded09a@redhat.com> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 6782F700A675 X-Stat-Signature: teu8g7jr6fzp65bwhb5ttm9n41kre4wd Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=Uwz6NcOb; dmarc=pass (policy=none) header.from=redhat.com; spf=none (imf02.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 216.205.24.124) smtp.mailfrom=david@redhat.com X-HE-Tag: 1629140693-13251 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 16.08.21 21:02, David Hildenbrand wrote: > On 16.08.21 20:09, Yang Shi wrote: >> In the current implementation of soft offline, if non-LRU page is met, >> all the slab caches will be dropped to free the page then offline. Bu= t >> if the page is not slab page all the effort is wasted in vain. Even >> though it is a slab page, it is not guaranteed the page could be freed >> at all. >> >> However the side effect and cost is quite high. It does not only drop >> the slab caches, but also may drop a significant amount of page caches >> which are associated with inode caches. It could make the most >> workingset gone in order to just offline a page. And the offline is n= ot >> guaranteed to succeed at all, actually I really doubt the success rate >> for real life workload. >> >> Furthermore the worse consequence is the system may be locked up and >> unusable since the page cache release may incur huge amount of works >> queued for memcg release. >> >> Actually we ran into such unpleasant case in our production environmen= t. >> Firstly, the workqueue of memory_failure_work_func is locked up as >> below: >> >> BUG: workqueue lockup - pool cpus=3D1 node=3D0 flags=3D0x0 nice=3D0 st= uck for 53s! >> Showing busy workqueues and worker pools: >> workqueue events: flags=3D0x0 >> =C2=A0 pwq 2: cpus=3D1 node=3D0 flags=3D0x0 nice=3D0 active=3D14/256= refcnt=3D15 >> =C2=A0 =C2=A0 in-flight: 409271:memory_failure_work_func >> =C2=A0 =C2=A0 pending: kfree_rcu_work, kfree_rcu_monitor, kfree_rcu_= work, rht_deferred_worker, rht_deferred_worker, rht_deferred_worker, rht_= deferred_worker, kfree_rcu_work, kfree_rcu_work, kfree_rcu_work, kfree_rc= u_work, drain_local_stock, kfree_rcu_work >> workqueue mm_percpu_wq: flags=3D0x8 >> =C2=A0 pwq 2: cpus=3D1 node=3D0 flags=3D0x0 nice=3D0 active=3D1/256 = refcnt=3D2 >> =C2=A0 =C2=A0 pending: vmstat_update >> workqueue cgroup_destroy: flags=3D0x0 >> pwq 2: cpus=3D1 node=3D0 flags=3D0x0 nice=3D0 active=3D1/1 refcnt=3D= 12072 >> pending: css_release_work_fn >> >> There were over 12K css_release_work_fn queued, and this caused a few >> lockups due to the contention of worker pool lock with IRQ disabled, f= or >> example: >> >> NMI watchdog: Watchdog detected hard LOCKUP on cpu 1 >> Modules linked in: amd64_edac_mod edac_mce_amd crct10dif_pclmul crc32_= pclmul ghash_clmulni_intel xt_DSCP iptable_mangle kvm_amd bpfilter vfat f= at acpi_ipmi i2c_piix4 usb_storage ipmi_si k10temp i2c_core ipmi_devintf = ipmi_msghandler acpi_cpufreq sch_fq_codel xfs libcrc32c crc32c_intel mlx5= _core mlxfw nvme xhci_pci ptp nvme_core pps_core xhci_hcd >> CPU: 1 PID: 205500 Comm: kworker/1:0 Tainted: G L 5.10.= 32-t1.el7.twitter.x86_64 #1 >> Hardware name: TYAN F5AMT /z /S8026GM2NRE-CGN, BIOS V8.030 03/3= 0/2021 >> Workqueue: events memory_failure_work_func >> RIP: 0010:queued_spin_lock_slowpath+0x41/0x1a0 >> Code: 41 f0 0f ba 2f 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 0= 9 d0 a9 00 01 ff ff 75 1b 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 <8b> 07 84 = c0 75 f8 b8 01 00 00 00 66 89 07 c3 f6 c4 01 75 04 c6 47 >> RSP: 0018:ffff9b2ac278f900 EFLAGS: 00000002 >> RAX: 0000000000480101 RBX: ffff8ce98ce71800 RCX: 0000000000000084 >> RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8ce98ce6a140 >> RBP: 00000000000284c8 R08: ffffd7248dcb6808 R09: 0000000000000000 >> R10: 0000000000000003 R11: ffff9b2ac278f9b0 R12: 0000000000000001 >> R13: ffff8cb44dab9c00 R14: ffffffffbd1ce6a0 R15: ffff8cacaa37f068 >> FS: 0000000000000000(0000) GS:ffff8ce98ce40000(0000) knlGS:0000000000= 000000 >> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> CR2: 00007fcf6e8cb000 CR3: 0000000a0c60a000 CR4: 0000000000350ee0 >> Call Trace: >> __queue_work+0xd6/0x3c0 >> queue_work_on+0x1c/0x30 >> uncharge_batch+0x10e/0x110 >> mem_cgroup_uncharge_list+0x6d/0x80 >> release_pages+0x37f/0x3f0 >> __pagevec_release+0x1c/0x50 >> __invalidate_mapping_pages+0x348/0x380 >> ? xfs_alloc_buftarg+0xa4/0x120 [xfs] >> inode_lru_isolate+0x10a/0x160 >> ? iput+0x1d0/0x1d0 >> __list_lru_walk_one+0x7b/0x170 >> ? iput+0x1d0/0x1d0 >> list_lru_walk_one+0x4a/0x60 >> prune_icache_sb+0x37/0x50 >> super_cache_scan+0x123/0x1a0 >> do_shrink_slab+0x10c/0x2c0 >> shrink_slab+0x1f1/0x290 >> drop_slab_node+0x4d/0x70 >> soft_offline_page+0x1ac/0x5b0 >> ? dev_mce_log+0xee/0x110 >> ? notifier_call_chain+0x39/0x90 >> memory_failure_work_func+0x6a/0x90 >> process_one_work+0x19e/0x340 >> ? process_one_work+0x340/0x340 >> worker_thread+0x30/0x360 >> ? process_one_work+0x340/0x340 >> kthread+0x116/0x130 >=20 > Just curious, who actually ends up calling soft_offline_page() ? I > cannot really make sense of this, looking at upstream Linux. >=20 > I can spot >=20 > a) drivers/base/memory.c: /sys/devices/system/memory/soft_offline_page > seems to be a testing interface >=20 > b) MADV_SOFT_OFFLINE seems to be a testing interface as well >=20 > c) arch/parisc/kernel/pdt.c doesn't apply to your case I guess? >=20 > I'm just wondering who ends up calling soft_offline_page() in a > production environment and via which call path. I'm most probably > missing something. >=20 ... and I missed memory_failure_work_func() with MF_SOFT_OFFLINE :) Ignore my question :) --=20 Thanks, David / dhildenb