From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id DD0D6C433FE for ; Sun, 6 Feb 2022 22:08:50 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4BF206B0072; Sun, 6 Feb 2022 17:08:50 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 4457A6B0073; Sun, 6 Feb 2022 17:08:50 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2E5D06B0074; Sun, 6 Feb 2022 17:08:50 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0181.hostedemail.com [216.40.44.181]) by kanga.kvack.org (Postfix) with ESMTP id 1C3796B0072 for ; Sun, 6 Feb 2022 17:08:50 -0500 (EST) Received: from smtpin19.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id C433290F5C for ; Sun, 6 Feb 2022 22:08:49 +0000 (UTC) X-FDA: 79113745578.19.2E8E6F2 Received: from mail-pl1-f173.google.com (mail-pl1-f173.google.com [209.85.214.173]) by imf21.hostedemail.com (Postfix) with ESMTP id 5BB681C000A for ; Sun, 6 Feb 2022 22:08:49 +0000 (UTC) Received: by mail-pl1-f173.google.com with SMTP id y17so9763326plg.7 for ; Sun, 06 Feb 2022 14:08:49 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:from:to:cc:subject:in-reply-to:message-id:references :mime-version; bh=aFsW4oJYjKeOMHpCwLOrqYmCPgdn3u/aN+st8V1n1k0=; b=Jo0cdlDv6uybOeKk7VmFcccPwrHxPCx063Hn0O/eWQL+Lm74kSuEzcyx6Hnn8K4jDj kkfoIXVF3bITg/OQmP/RJqDdkcaYo0HH2donmBy4aOSQ8uQONkF6b8AQNTUbfzyRBygl D6LEk9i56y1f4/hJC9zPSX8bsEjrLHhC9SLndqqAOs7xUM+fOGu/fTH2l12CFifXLKK7 mrAJdsfZfc2TsA2iBYQNAVO+kXpDg6Zk0NnnisG08fQ4pwgTLyZFuqjj+Be1UUd8EUTA MTG8R/tjaJkQN4yrXwoK1oNMwwsvj7Ye13TrSrHvJDgIBHtEP8zV+KC228bK/LqTLHLq btbQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id :references:mime-version; bh=aFsW4oJYjKeOMHpCwLOrqYmCPgdn3u/aN+st8V1n1k0=; b=C8ftjzlkQ4ekpL7OooHpdjWqwipD+GUI81dkUpz95xiPONZAV5Pu8ILiB17pckgIRc +/pFSggsHRsziMbjQM4/1rREgnWjnbdqi8+qF0pbWrQjyDLwN02bQ39nSGZX4IF1A5Da CoxnYqAnWyWWJ6NfIBUNdU30NyZBhBzckNuXZ2KrXw9yliQlkKX3siaB/Sx0Rxy1p09A RDgvlBd/YHlGTd1wwgMtbP8/v9FTQzD7W8Dlanc1PwJGiuq9nXo3n9HH5tJoCIBSfixR 5TiLq4BqEJ4djaJp+dRDqPN5QWiqtT6y0Gd+/0ot8BsUHPsUGeOb46DOOrh2Ruh/bWcm t60Q== X-Gm-Message-State: AOAM532125lm7e1GwvtVj9IK8/SeDRmLvI9GYv5DSv4CkJhBMiGGPUhn 5lSMwshqaQVuNyrrI/ikbVZkog== X-Google-Smtp-Source: ABdhPJyPezlTdmbZLh2UjuPlM5QcVj8Q3m/Z+1MUjZLSlyhnXNwWB6QESyP0vCHlqrDH8V1tgs35eA== X-Received: by 2002:a17:902:760e:: with SMTP id k14mr13433737pll.11.1644185327962; Sun, 06 Feb 2022 14:08:47 -0800 (PST) Received: from [2620:15c:29:204:dae1:9bee:7b85:4b01] ([2620:15c:29:204:dae1:9bee:7b85:4b01]) by smtp.gmail.com with ESMTPSA id j10sm9471267pfu.93.2022.02.06.14.08.47 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 06 Feb 2022 14:08:47 -0800 (PST) Date: Sun, 6 Feb 2022 14:08:47 -0800 (PST) From: David Rientjes To: Mel Gorman cc: Andrew Morton , Hugh Dickins , Michal Hocko , Vlastimil Babka , Rik van Riel , Linux-MM , LKML Subject: Re: [PATCH] mm: vmscan: remove deadlock due to throttling failing to make progress In-Reply-To: <20220203100326.GD3301@suse.de> Message-ID: References: <20220203100326.GD3301@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 5BB681C000A X-Stat-Signature: dcma1hnd5r33ntfj4ntmc3rw7pcj3fsk X-Rspam-User: nil Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=Jo0cdlDv; spf=pass (imf21.hostedemail.com: domain of rientjes@google.com designates 209.85.214.173 as permitted sender) smtp.mailfrom=rientjes@google.com; dmarc=pass (policy=reject) header.from=google.com X-HE-Tag: 1644185329-285741 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, 3 Feb 2022, Mel Gorman wrote: > A soft lockup bug in kcompactd was reported in a private bugzilla with > the following visible in dmesg; > > [15980.045209][ C33] watchdog: BUG: soft lockup - CPU#33 stuck for 26s! [kcompactd0:479] > [16008.044989][ C33] watchdog: BUG: soft lockup - CPU#33 stuck for 52s! [kcompactd0:479] > [16036.044768][ C33] watchdog: BUG: soft lockup - CPU#33 stuck for 78s! [kcompactd0:479] > [16064.044548][ C33] watchdog: BUG: soft lockup - CPU#33 stuck for 104s! [kcompactd0:479] > > The machine had 256G of RAM with no swap and an earlier failed allocation > indicated that node 0 where kcompactd was run was potentially > unreclaimable; > > Node 0 active_anon:29355112kB inactive_anon:2913528kB active_file:0kB > inactive_file:0kB unevictable:64kB isolated(anon):0kB isolated(file):0kB > mapped:8kB dirty:0kB writeback:0kB shmem:26780kB shmem_thp: > 0kB shmem_pmdmapped: 0kB anon_thp: 23480320kB writeback_tmp:0kB > kernel_stack:2272kB pagetables:24500kB all_unreclaimable? yes > > Vlastimil Babka investigated a crash dump and found that a task migrating pages > was trying to drain PCP lists; > > PID: 52922 TASK: ffff969f820e5000 CPU: 19 COMMAND: "kworker/u128:3" > #0 [ffffaf4e4f4c3848] __schedule at ffffffffb840116d > #1 [ffffaf4e4f4c3908] schedule at ffffffffb8401e81 > #2 [ffffaf4e4f4c3918] schedule_timeout at ffffffffb84066e8 > #3 [ffffaf4e4f4c3990] wait_for_completion at ffffffffb8403072 > #4 [ffffaf4e4f4c39d0] __flush_work at ffffffffb7ac3e4d > #5 [ffffaf4e4f4c3a48] __drain_all_pages at ffffffffb7cb707c > #6 [ffffaf4e4f4c3a80] __alloc_pages_slowpath.constprop.114 at ffffffffb7cbd9dd > #7 [ffffaf4e4f4c3b60] __alloc_pages at ffffffffb7cbe4f5 > #8 [ffffaf4e4f4c3bc0] alloc_migration_target at ffffffffb7cf329c > #9 [ffffaf4e4f4c3bf0] migrate_pages at ffffffffb7cf6d15 > 10 [ffffaf4e4f4c3cb0] migrate_to_node at ffffffffb7cdb5aa > 11 [ffffaf4e4f4c3da8] do_migrate_pages at ffffffffb7cdcf26 > 12 [ffffaf4e4f4c3e88] cpuset_migrate_mm_workfn at ffffffffb7b859d2 > 13 [ffffaf4e4f4c3e98] process_one_work at ffffffffb7ac45f3 > 14 [ffffaf4e4f4c3ed8] worker_thread at ffffffffb7ac47fd > 15 [ffffaf4e4f4c3f10] kthread at ffffffffb7acbdc6 > 16 [ffffaf4e4f4c3f50] ret_from_fork at ffffffffb7a047e2 > > The root of the problem is that kcompact0 is not rescheduling on a CPU > while a task that has isolated a large number of the pages from the > LRU is waiting on kcompact0 to reschedule so the pages can be released. > While shrink_inactive_list() only loops once around too_many_isolated, > reclaim can continue without rescheduling if sc->skipped_deactivate == > 1 which could happen if there was no file LRU and the inactive anon list > was not low. > > Debugged-by: Vlastimil Babka > Signed-off-by: Mel Gorman Acked-by: David Rientjes