From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 92CFEC433EF for ; Wed, 13 Apr 2022 16:24:33 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 295A26B0074; Wed, 13 Apr 2022 12:24:33 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 244B76B0075; Wed, 13 Apr 2022 12:24:33 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0E5766B0078; Wed, 13 Apr 2022 12:24:33 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (relay.a.hostedemail.com [64.99.140.24]) by kanga.kvack.org (Postfix) with ESMTP id EE8F16B0074 for ; Wed, 13 Apr 2022 12:24:32 -0400 (EDT) Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id B4AD861D0D for ; Wed, 13 Apr 2022 16:24:32 +0000 (UTC) X-FDA: 79352378784.20.8A7102A Received: from mail-qk1-f181.google.com (mail-qk1-f181.google.com [209.85.222.181]) by imf05.hostedemail.com (Postfix) with ESMTP id 37A24100005 for ; Wed, 13 Apr 2022 16:24:32 +0000 (UTC) Received: by mail-qk1-f181.google.com with SMTP id bk12so1825115qkb.7 for ; Wed, 13 Apr 2022 09:24:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=osandov-com.20210112.gappssmtp.com; s=20210112; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=Ik4QNwOLFHaKlPiISuyzicYsvEBnnCkkaEc+JVZvWyM=; b=s4HYSlMXmkJeFKgw+oJloimqsXXkUiJvjHV3K6DSlK2T3LSlwqVptze1dWn0labkPg sRSi7LsISChY3yU7eBZg7Kd+JBgImDweVu/gLa2TFU+40wKgJvxTXPlLc5iQ55ITRw8C LhU2+6tMbTJNMFyv3iOfhybwYYUtIGYjEgIijepy8/kuf1cK0V3UONJniLGLXlYGWBW3 unIHjdfUSIUuDS5I6XxOVWPJivp25utm5gNYBlUpaqaKWv7bbVTdn4rAf6nKgbfdQkC4 +g4CICRFgqb5qeLzycgTAsMucPxUz+S1ExHW4/rPtuWp38VywlfLksdCpKu5MJf/d7Sm 4KIw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=Ik4QNwOLFHaKlPiISuyzicYsvEBnnCkkaEc+JVZvWyM=; b=lfuo6tYQXqza0I7Wz4vBSgUm8B/tE7a4G3Iq+nGpzidzAp00pYu98lrH/iVcepGoQr RvRM4rqW6j76MQf4LSp3KwWGGLOM8UdDUKEf0d0VAl/MnwBoLYFxWS+vfuvnA26VuvNr 6jPH8E2/PokbJYKCT+q9hhpXkAAlUsnDfMVq9mjAFWBWZeaOpi0UPzvqto8Ve8S5tYwP ZGt8iemOka5nMo+S0O4HUddXVr9teUpJjrtjKqPI4AfhD4IgOkfSS5C/lIk47mMeheGh h5bWaODneJXoZe5rmQRjRSKiiB/2xXoCwH7RsZPccYtaWKjiTGtMaV0Uoiwds/t/JZD8 IRSQ== X-Gm-Message-State: AOAM533HT/gsbTG5W6t7IiykonsiaiUHH45/IpyhBAf2IlTZ0+x7gP6y JL3RXd2F0iZxHaKWpD3BBlcYEw== X-Google-Smtp-Source: ABdhPJxBHQcU6CZQlD2iG/jrJA0m6hUJBs6lhQqqyd2YljtbNkRs0Mf82jvdFkJ3kmk74rxGPTLtwg== X-Received: by 2002:a37:acd:0:b0:69c:fa1:39de with SMTP id 196-20020a370acd000000b0069c0fa139demr7501547qkk.383.1649867071295; Wed, 13 Apr 2022 09:24:31 -0700 (PDT) Received: from relinquished.localdomain ([2620:10d:c091:480::1:5987]) by smtp.gmail.com with ESMTPSA id p28-20020a05620a15fc00b0069c28de43casm4793910qkm.102.2022.04.13.09.24.29 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 13 Apr 2022 09:24:30 -0700 (PDT) Date: Wed, 13 Apr 2022 09:24:28 -0700 From: Omar Sandoval To: Baoquan He Cc: Chris Down , linux-mm@kvack.org, kexec@lists.infradead.org, Andrew Morton , Uladzislau Rezki , Christoph Hellwig , x86@kernel.org, kernel-team@fb.com Subject: Re: [PATCH v2] mm/vmalloc: fix spinning drain_vmap_work after reading from /proc/vmcore Message-ID: References: <52f819991051f9b865e9ce25605509bfdbacadcd.1649277321.git.osandov@fb.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Stat-Signature: uctszzo578iizhjgq3q6qfixqh6fc5fd X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 37A24100005 Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=osandov-com.20210112.gappssmtp.com header.s=20210112 header.b=s4HYSlMX; dmarc=none; spf=none (imf05.hostedemail.com: domain of osandov@osandov.com has no SPF policy when checking 209.85.222.181) smtp.mailfrom=osandov@osandov.com X-Rspam-User: X-HE-Tag: 1649867072-891944 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, Apr 08, 2022 at 11:02:47AM +0800, Baoquan He wrote: > On 04/07/22 at 03:36pm, Chris Down wrote: > > Omar Sandoval writes: > > > From: Omar Sandoval > > > > > > Commit 3ee48b6af49c ("mm, x86: Saving vmcore with non-lazy freeing of > > > vmas") introduced set_iounmap_nonlazy(), which sets vmap_lazy_nr to > > > lazy_max_pages() + 1, ensuring that any future vunmaps() immediately > > > purge the vmap areas instead of doing it lazily. > > > > > > Commit 690467c81b1a ("mm/vmalloc: Move draining areas out of caller > > > context") moved the purging from the vunmap() caller to a worker thread. > > > Unfortunately, set_iounmap_nonlazy() can cause the worker thread to spin > > > (possibly forever). For example, consider the following scenario: > > > > > > 1. Thread reads from /proc/vmcore. This eventually calls > > > __copy_oldmem_page() -> set_iounmap_nonlazy(), which sets > > > vmap_lazy_nr to lazy_max_pages() + 1. > > > 2. Then it calls free_vmap_area_noflush() (via iounmap()), which adds 2 > > > pages (one page plus the guard page) to the purge list and > > > vmap_lazy_nr. vmap_lazy_nr is now lazy_max_pages() + 3, so the > > > drain_vmap_work is scheduled. > > > 3. Thread returns from the kernel and is scheduled out. > > > 4. Worker thread is scheduled in and calls drain_vmap_area_work(). It > > > frees the 2 pages on the purge list. vmap_lazy_nr is now > > > lazy_max_pages() + 1. > > > 5. This is still over the threshold, so it tries to purge areas again, > > > but doesn't find anything. > > > 6. Repeat 5. > > > > > > If the system is running with only one CPU (which is typicial for kdump) > > > and preemption is disabled, then this will never make forward progress: > > > there aren't any more pages to purge, so it hangs. If there is more than > > > one CPU or preemption is enabled, then the worker thread will spin > > > forever in the background. (Note that if there were already pages to be > > > purged at the time that set_iounmap_nonlazy() was called, this bug is > > > avoided.) > > > > > > This can be reproduced with anything that reads from /proc/vmcore > > > multiple times. E.g., vmcore-dmesg /proc/vmcore. > > > > > > It turns out that improvements to vmap() over the years have obsoleted > > > the need for this "optimization". I benchmarked > > > `dd if=/proc/vmcore of=/dev/null` with 4k and 1M read sizes on a system > > > with a 32GB vmcore. The test was run on 5.17, 5.18-rc1 with a fix that > > > avoided the hang, and 5.18-rc1 with set_iounmap_nonlazy() removed > > > entirely: > > > > > > |5.17 |5.18+fix|5.18+removal > > > 4k|40.86s| 40.09s| 26.73s > > > 1M|24.47s| 23.98s| 21.84s > > > > > > The removal was the fastest (by a wide margin with 4k reads). This patch > > > removes set_iounmap_nonlazy(). > > > > > > Signed-off-by: Omar Sandoval > > > > It probably doesn't matter, but maybe worth adding in a Fixes tag just to > > make sure anyone getting this without context understands that 690467c81b1a > > ("mm/vmalloc: Move draining areas out of caller context") shouldn't reach > > further rcs without this. Unlikely that would happen anyway, though. > > > > Nice use of a bug as an impetus to clean things up :-) Thanks! > > Since redhat mail server has issue, the body content of patch is empty > from my mail client. So reply here to add comment. > > As replied in v1 to Omar, I think this is a great fix. That would be > also great to state if this is a real issue which is breaking thing, > then add 'Fixes' tag and Cc stable like "Cc: # 5.17", > or a fantastic improvement from code inspecting. > > Concern this because in distros, e.g in our rhel8, we maintain old kernel > and back port necessary patches into the kernel, those patches with > 'Fixes' tag definitely are good candidate. This is important too to LTS > kernel. > > Thanks > Baoquan Hi, Baoquan, Sorry I missed your replies. I'll answer your questions from your first email. > I am wondering if this is a real issue you met, or you just found it > by code inspecting I hit this issue with the test suite for drgn (https://github.com/osandov/drgn). We run the test cases in a virtual machine on various kernel versions (https://github.com/osandov/drgn/tree/main/vmtest). Part of the test suite crashes the kernel to run some tests against /proc/vmcore (https://github.com/osandov/drgn/blob/13144eda119790cdbc11f360c15a04efdf81ae9a/setup.py#L213, https://github.com/osandov/drgn/blob/main/vmtest/enter_kdump.py, https://github.com/osandov/drgn/tree/main/tests/linux_kernel/vmcore). When I tried v5.18-rc1 configured with !SMP and !PREEMPT, that part of the test suite got stuck, which is how I found this issue. > I am wondering how your vmcore dumping is handled. Asking this because > we usually use makedumpfile utility In production at Facebook, we don't run drgn directly against /proc/vmcore. We use makedumpfile and inspect the captured file with drgn once we reboot. > While using makedumpfile, we use mmap which is 4M at one time by > default, then process the content. So the copy_oldmem_page() may only > be called during elfcorehdr and notes reading. We also use vmcore-dmesg (https://git.kernel.org/pub/scm/utils/kernel/kexec/kexec-tools.git/tree/vmcore-dmesg) on /proc/vmcore before calling makedumpfile. From what I can tell, that uses read()/pread() (https://git.kernel.org/pub/scm/utils/kernel/kexec/kexec-tools.git/tree/util_lib/elf_info.c), so it would also hit this issue. I'll send a v3 adding Fixes: 690467c81b1a ("mm/vmalloc: Move draining areas out of caller context"). I don't think a stable tag is necessary since this was introduced in v5.18-rc1 and hasn't been backported as far as I can tell. Thanks, Omar