From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 92CFEC433EF
	for <linux-mm@archiver.kernel.org>; Wed, 13 Apr 2022 16:24:33 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 295A26B0074; Wed, 13 Apr 2022 12:24:33 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 244B76B0075; Wed, 13 Apr 2022 12:24:33 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 0E5766B0078; Wed, 13 Apr 2022 12:24:33 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (relay.a.hostedemail.com [64.99.140.24])
	by kanga.kvack.org (Postfix) with ESMTP id EE8F16B0074
	for <linux-mm@kvack.org>; Wed, 13 Apr 2022 12:24:32 -0400 (EDT)
Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay01.hostedemail.com (Postfix) with ESMTP id B4AD861D0D
	for <linux-mm@kvack.org>; Wed, 13 Apr 2022 16:24:32 +0000 (UTC)
X-FDA: 79352378784.20.8A7102A
Received: from mail-qk1-f181.google.com (mail-qk1-f181.google.com [209.85.222.181])
	by imf05.hostedemail.com (Postfix) with ESMTP id 37A24100005
	for <linux-mm@kvack.org>; Wed, 13 Apr 2022 16:24:32 +0000 (UTC)
Received: by mail-qk1-f181.google.com with SMTP id bk12so1825115qkb.7
        for <linux-mm@kvack.org>; Wed, 13 Apr 2022 09:24:31 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=osandov-com.20210112.gappssmtp.com; s=20210112;
        h=date:from:to:cc:subject:message-id:references:mime-version
         :content-disposition:in-reply-to;
        bh=Ik4QNwOLFHaKlPiISuyzicYsvEBnnCkkaEc+JVZvWyM=;
        b=s4HYSlMXmkJeFKgw+oJloimqsXXkUiJvjHV3K6DSlK2T3LSlwqVptze1dWn0labkPg
         sRSi7LsISChY3yU7eBZg7Kd+JBgImDweVu/gLa2TFU+40wKgJvxTXPlLc5iQ55ITRw8C
         LhU2+6tMbTJNMFyv3iOfhybwYYUtIGYjEgIijepy8/kuf1cK0V3UONJniLGLXlYGWBW3
         unIHjdfUSIUuDS5I6XxOVWPJivp25utm5gNYBlUpaqaKWv7bbVTdn4rAf6nKgbfdQkC4
         +g4CICRFgqb5qeLzycgTAsMucPxUz+S1ExHW4/rPtuWp38VywlfLksdCpKu5MJf/d7Sm
         4KIw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:date:from:to:cc:subject:message-id:references
         :mime-version:content-disposition:in-reply-to;
        bh=Ik4QNwOLFHaKlPiISuyzicYsvEBnnCkkaEc+JVZvWyM=;
        b=lfuo6tYQXqza0I7Wz4vBSgUm8B/tE7a4G3Iq+nGpzidzAp00pYu98lrH/iVcepGoQr
         RvRM4rqW6j76MQf4LSp3KwWGGLOM8UdDUKEf0d0VAl/MnwBoLYFxWS+vfuvnA26VuvNr
         6jPH8E2/PokbJYKCT+q9hhpXkAAlUsnDfMVq9mjAFWBWZeaOpi0UPzvqto8Ve8S5tYwP
         ZGt8iemOka5nMo+S0O4HUddXVr9teUpJjrtjKqPI4AfhD4IgOkfSS5C/lIk47mMeheGh
         h5bWaODneJXoZe5rmQRjRSKiiB/2xXoCwH7RsZPccYtaWKjiTGtMaV0Uoiwds/t/JZD8
         IRSQ==
X-Gm-Message-State: AOAM533HT/gsbTG5W6t7IiykonsiaiUHH45/IpyhBAf2IlTZ0+x7gP6y
	JL3RXd2F0iZxHaKWpD3BBlcYEw==
X-Google-Smtp-Source: ABdhPJxBHQcU6CZQlD2iG/jrJA0m6hUJBs6lhQqqyd2YljtbNkRs0Mf82jvdFkJ3kmk74rxGPTLtwg==
X-Received: by 2002:a37:acd:0:b0:69c:fa1:39de with SMTP id 196-20020a370acd000000b0069c0fa139demr7501547qkk.383.1649867071295;
        Wed, 13 Apr 2022 09:24:31 -0700 (PDT)
Received: from relinquished.localdomain ([2620:10d:c091:480::1:5987])
        by smtp.gmail.com with ESMTPSA id p28-20020a05620a15fc00b0069c28de43casm4793910qkm.102.2022.04.13.09.24.29
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 13 Apr 2022 09:24:30 -0700 (PDT)
Date: Wed, 13 Apr 2022 09:24:28 -0700
From: Omar Sandoval <osandov@osandov.com>
To: Baoquan He <bhe@redhat.com>
Cc: Chris Down <chris@chrisdown.name>, linux-mm@kvack.org,
	kexec@lists.infradead.org,
	Andrew Morton <akpm@linux-foundation.org>,
	Uladzislau Rezki <urezki@gmail.com>, Christoph Hellwig <hch@lst.de>,
	x86@kernel.org, kernel-team@fb.com
Subject: Re: [PATCH v2] mm/vmalloc: fix spinning drain_vmap_work after
 reading from /proc/vmcore
Message-ID: <Ylb5PLNDFIif6vZ2@relinquished.localdomain>
References: <52f819991051f9b865e9ce25605509bfdbacadcd.1649277321.git.osandov@fb.com>
 <Yk72+9UFwsaFXoZe@chrisdown.name>
 <Yk+l1/xLzhB02aIU@MiWiFi-R3L-srv>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <Yk+l1/xLzhB02aIU@MiWiFi-R3L-srv>
X-Stat-Signature: uctszzo578iizhjgq3q6qfixqh6fc5fd
X-Rspamd-Server: rspam07
X-Rspamd-Queue-Id: 37A24100005
Authentication-Results: imf05.hostedemail.com;
	dkim=pass header.d=osandov-com.20210112.gappssmtp.com header.s=20210112 header.b=s4HYSlMX;
	dmarc=none;
	spf=none (imf05.hostedemail.com: domain of osandov@osandov.com has no SPF policy when checking 209.85.222.181) smtp.mailfrom=osandov@osandov.com
X-Rspam-User: 
X-HE-Tag: 1649867072-891944
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Fri, Apr 08, 2022 at 11:02:47AM +0800, Baoquan He wrote:
> On 04/07/22 at 03:36pm, Chris Down wrote:
> > Omar Sandoval writes:
> > > From: Omar Sandoval <osandov@fb.com>
> > > 
> > > Commit 3ee48b6af49c ("mm, x86: Saving vmcore with non-lazy freeing of
> > > vmas") introduced set_iounmap_nonlazy(), which sets vmap_lazy_nr to
> > > lazy_max_pages() + 1, ensuring that any future vunmaps() immediately
> > > purge the vmap areas instead of doing it lazily.
> > > 
> > > Commit 690467c81b1a ("mm/vmalloc: Move draining areas out of caller
> > > context") moved the purging from the vunmap() caller to a worker thread.
> > > Unfortunately, set_iounmap_nonlazy() can cause the worker thread to spin
> > > (possibly forever). For example, consider the following scenario:
> > > 
> > > 1. Thread reads from /proc/vmcore. This eventually calls
> > >   __copy_oldmem_page() -> set_iounmap_nonlazy(), which sets
> > >   vmap_lazy_nr to lazy_max_pages() + 1.
> > > 2. Then it calls free_vmap_area_noflush() (via iounmap()), which adds 2
> > >   pages (one page plus the guard page) to the purge list and
> > >   vmap_lazy_nr. vmap_lazy_nr is now lazy_max_pages() + 3, so the
> > >   drain_vmap_work is scheduled.
> > > 3. Thread returns from the kernel and is scheduled out.
> > > 4. Worker thread is scheduled in and calls drain_vmap_area_work(). It
> > >   frees the 2 pages on the purge list. vmap_lazy_nr is now
> > >   lazy_max_pages() + 1.
> > > 5. This is still over the threshold, so it tries to purge areas again,
> > >   but doesn't find anything.
> > > 6. Repeat 5.
> > > 
> > > If the system is running with only one CPU (which is typicial for kdump)
> > > and preemption is disabled, then this will never make forward progress:
> > > there aren't any more pages to purge, so it hangs. If there is more than
> > > one CPU or preemption is enabled, then the worker thread will spin
> > > forever in the background. (Note that if there were already pages to be
> > > purged at the time that set_iounmap_nonlazy() was called, this bug is
> > > avoided.)
> > > 
> > > This can be reproduced with anything that reads from /proc/vmcore
> > > multiple times. E.g., vmcore-dmesg /proc/vmcore.
> > > 
> > > It turns out that improvements to vmap() over the years have obsoleted
> > > the need for this "optimization". I benchmarked
> > > `dd if=/proc/vmcore of=/dev/null` with 4k and 1M read sizes on a system
> > > with a 32GB vmcore. The test was run on 5.17, 5.18-rc1 with a fix that
> > > avoided the hang, and 5.18-rc1 with set_iounmap_nonlazy() removed
> > > entirely:
> > > 
> > >  |5.17  |5.18+fix|5.18+removal
> > > 4k|40.86s|  40.09s|      26.73s
> > > 1M|24.47s|  23.98s|      21.84s
> > > 
> > > The removal was the fastest (by a wide margin with 4k reads). This patch
> > > removes set_iounmap_nonlazy().
> > > 
> > > Signed-off-by: Omar Sandoval <osandov@fb.com>
> > 
> > It probably doesn't matter, but maybe worth adding in a Fixes tag just to
> > make sure anyone getting this without context understands that 690467c81b1a
> > ("mm/vmalloc: Move draining areas out of caller context") shouldn't reach
> > further rcs without this. Unlikely that would happen anyway, though.
> > 
> > Nice use of a bug as an impetus to clean things up :-) Thanks!
> 
> Since redhat mail server has issue, the body content of patch is empty
> from my mail client. So reply here to add comment.
> 
> As replied in v1 to Omar, I think this is a great fix. That would be
> also great to state if this is a real issue which is breaking thing,
> then add 'Fixes' tag and Cc stable like "Cc: <stable@vger.kernel.org> # 5.17",
> or a fantastic improvement from code inspecting.
> 
> Concern this because in distros, e.g in our rhel8, we maintain old kernel
> and back port necessary patches into the kernel, those patches with
> 'Fixes' tag definitely are good candidate. This is important too to LTS
> kernel.
> 
> Thanks
> Baoquan

Hi, Baoquan,

Sorry I missed your replies. I'll answer your questions from your first
email.

> I am wondering if this is a real issue you met, or you just found it
> by code inspecting

I hit this issue with the test suite for drgn
(https://github.com/osandov/drgn). We run the test cases in a virtual
machine on various kernel versions
(https://github.com/osandov/drgn/tree/main/vmtest). Part of the test
suite crashes the kernel to run some tests against /proc/vmcore
(https://github.com/osandov/drgn/blob/13144eda119790cdbc11f360c15a04efdf81ae9a/setup.py#L213,
https://github.com/osandov/drgn/blob/main/vmtest/enter_kdump.py,
https://github.com/osandov/drgn/tree/main/tests/linux_kernel/vmcore).
When I tried v5.18-rc1 configured with !SMP and !PREEMPT, that part of
the test suite got stuck, which is how I found this issue.

> I am wondering how your vmcore dumping is handled. Asking this because
> we usually use makedumpfile utility

In production at Facebook, we don't run drgn directly against
/proc/vmcore. We use makedumpfile and inspect the captured file with
drgn once we reboot.

> While using makedumpfile, we use mmap which is 4M at one time by
> default, then process the content. So the copy_oldmem_page() may only
> be called during elfcorehdr and notes reading.

We also use vmcore-dmesg
(https://git.kernel.org/pub/scm/utils/kernel/kexec/kexec-tools.git/tree/vmcore-dmesg)
on /proc/vmcore before calling makedumpfile. From what I can tell, that
uses read()/pread()
(https://git.kernel.org/pub/scm/utils/kernel/kexec/kexec-tools.git/tree/util_lib/elf_info.c),
so it would also hit this issue.

I'll send a v3 adding Fixes: 690467c81b1a ("mm/vmalloc: Move draining
areas out of caller context"). I don't think a stable tag is necessary
since this was introduced in v5.18-rc1 and hasn't been backported as far
as I can tell.

Thanks,
Omar