From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 898DBEE49A8 for ; Mon, 21 Aug 2023 03:58:26 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 13F7F900003; Sun, 20 Aug 2023 23:58:26 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 0EFB36B007D; Sun, 20 Aug 2023 23:58:26 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EF979900003; Sun, 20 Aug 2023 23:58:25 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id E0BCE6B007B for ; Sun, 20 Aug 2023 23:58:25 -0400 (EDT) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id B8B67B203B for ; Mon, 21 Aug 2023 03:58:25 +0000 (UTC) X-FDA: 81146754570.24.92D7269 Received: from casper.infradead.org (casper.infradead.org [90.155.50.34]) by imf06.hostedemail.com (Postfix) with ESMTP id AF38518001A for ; Mon, 21 Aug 2023 03:58:23 +0000 (UTC) Authentication-Results: imf06.hostedemail.com; dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=WMEplBHX; spf=none (imf06.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1692590304; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=CGxdXSBzJWwWURY8ZP4wMj116bZ6iqKmFjUPQ49ecdc=; b=BlK+zo2tdVxgdyVOXb6vEJ54bt2FJ1Ub+VWUD6cYSi0rNb85NYuvpFT4s3Z3Co0r6wN9d1 DCrLH2hdVsB+Z37LnJILYidXsTkU2ySsuQwyA3K3FReWfYZhDtoJTBTDC6uVABRH0iyWQs XzGJZOYUy8r2k8Jo4MZxjCO86uJidD8= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1692590304; a=rsa-sha256; cv=none; b=ONx6oobwEGTuNiBVCx1H62JKNzgh4aMNHCXjiDaHtE3leRUHux7UY4NdBUBKA/lQWP8XYH TCS9MNl7GBXLUnmUPR1BiM/cSyPfMrwnCcGVjbRwW95O9lh/dnP+jQN40crrKA3WWRVzwI boXqeeIQPfU2wqD9FQU8BC6IdDeBPEA= ARC-Authentication-Results: i=1; imf06.hostedemail.com; dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=WMEplBHX; spf=none (imf06.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org; dmarc=none DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=CGxdXSBzJWwWURY8ZP4wMj116bZ6iqKmFjUPQ49ecdc=; b=WMEplBHXTiLDJzjGH2iUVo8B68 0O85ITVzV8qv9LTrWCJsuU9eDYguuJ1kVNabXeAHn4rZibgcQWCnXaS9xeBJRAxdxadEz8bUFaBYT /xiJkoUWV9t2VaSEiVvGKzSczS1MRlYUPxJFPY1B2IHZVr69q4S3qKpmaurO/aYHE0j9GOJ2NCh0f efTk5KQYLLsfpalDH2BJZ3B6zoCGQwQ7E5Kmy/HZrEuIR8YTVJgtO+MxGo3ymm6oW4Wjefdm3dYEC SGyBGxYmvvc1WoZFyPYSn5CtflFaLXxmLnk2Dw+nz1t7cmd/JSYV4mCuUzuIL7Ei6j7Plel4OLhmr Tne/Ox/Q==; Received: from willy by casper.infradead.org with local (Exim 4.94.2 #2 (Red Hat Linux)) id 1qXw3W-007z4Y-Nm; Mon, 21 Aug 2023 03:58:18 +0000 Date: Mon, 21 Aug 2023 04:58:18 +0100 From: Matthew Wilcox To: Mateusz Guzik Cc: torvalds@linux-foundation.org, akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH] mm: remove unintentional voluntary preemption in get_mmap_lock_carefully Message-ID: References: <20230820104303.2083444-1-mjguzik@gmail.com> <20230821011303.hoeqjbmjaxajh255@f> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20230821011303.hoeqjbmjaxajh255@f> X-Stat-Signature: 1m5m5kbghbqgj1oi9zw733ihe9ar69qs X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: AF38518001A X-Rspam-User: X-HE-Tag: 1692590303-719319 X-HE-Meta: U2FsdGVkX1/hXkds4QhwLlDERpnj1jYZ9XlVLGDKteo8BuCUruMkLJPDd5Zi2Y+Dp3D7OWzxWl6AbVpLMojaVm8KLGqnpGQOq1iGalpL0HSpl9dv3t/TbHOG7i2GYocOUR9Tjr5GggFgnXoycgnxILdmna/7j2ClEobEY1dNF/HdUKUSJOF+ReneA/QV3ombFte0MptWDXm6grBhzXd36U4zmsfLmDtTLIIec9CFpB9El5uAsmaESMIsOnagsicRDXNaAImGOGDodgxAdELRhzAEn4XSHqc32ph6lJEukJIRMy6KTKCo2GuowbMmjeQw1dAxpx8Dw6E1eggt9Jyxr2+xL5Hitc/vkXjQvpp96A5Z61LTjvwKt7UBUHlPa18W6M1x7gUOHakUhwtNtHlHd5V28JszcH7DwwdGD9hbUplmIPuHTR9VFlinpqU3Td0ZcrQ2zhMtEIAAz/qPxb5Ryhtfsj2YHoSYk22QfmxImTRMvCBGfmO0igjiWZvLWsNIH+7BT9jzU7A0OlhJu5W9lIdQNphA/jR+Hinqdm2vj6Wb/OorgA5GaIiqhupSO+8296IPQ+XQhmpRWYfT/sKmm3Egv9E1Q3RlZv2oJ0l8Nn7HVbPEZIjfaKnS6PPXD4TSsULH33I3FJb3JWMbo4unCybLG5J6XbjANeEd7KBZGOlGC1XetUbFTSDraS5N8K/RLDDlzZ5ReCYYbP9f9GoQvW33FKQtjJJk9u/mE/1MIi0hPAfw2HIWIrEHMo7+koySusG7xO3uq1/UpUHkQ0rPikqcpiZsnGbO4spyv7Qf3qrXEkBNNiI+XhZNKxyEFv+lKYwwCtaDacxmh13yP77Phvsu7yXUOhaTcd+P0W446upgSbyiTvW/3JZehUvFf26U/Xxkkux+Suplj1aSEHxnJY9WECjuynXROrcs5EdzzUK3Q7pQL4B9/nG5dnYgaQ6aGYb783YGs8aEVdHn/vh 3bDBrThn 6Mqz9+m3V6yTBlO3NUnpY/dpfuUOZEyRGzQO7t1dxLi94YOR9FjTWX9ztcM3J9xc603AtVag9loSRSKSC0qHWUT7UmQ3z71XHVe7n2WSZTM6kkd9bH8fZVjaJmsMg95yoVtfLG5Hzztn3iaaMyB/0FzxpiMG/sv89duXbcvmALnDm9BFfAn+z56cQdQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, Aug 21, 2023 at 03:13:03AM +0200, Mateusz Guzik wrote: > On Sun, Aug 20, 2023 at 07:12:16PM +0100, Matthew Wilcox wrote: > > On Sun, Aug 20, 2023 at 12:43:03PM +0200, Mateusz Guzik wrote: > > > Found by checking off-CPU time during kernel build (like so: > > > "offcputime-bpfcc -Ku"), sample backtrace: > > > finish_task_switch.isra.0 > > > __schedule > > > __cond_resched > > > lock_mm_and_find_vma > > > do_user_addr_fault > > > exc_page_fault > > > asm_exc_page_fault > > > - sh (4502) > > > > Now I'm awake, this backtrace really surprises me. Do we not check > > need_resched on entry? It seems terribly unlikely that need_resched > > gets set between entry and getting to this point, so I guess we must > > not. > > > > I suggest the version of the patch which puts might_sleep() before the > > mmap_read_trylock() is the right one to apply. It's basically what > > we've done forever, except that now we'll be rescheduling without the > > mmap lock held, which just seems like an overall win. > > > > I can't sleep and your response made me curious, is that really safe > here? > > As I wrote in another email, the routine is concerned with a case of the > kernel faulting on something it should not have. For a case like that I > find rescheduling to another thread to be most concerning. Hmm, initially I didn't see it, but you're concerned with something like: foo->bar = NULL; spin_lock(&foo->lock); printk("%d\n", foo->bar->baz); And yeah, scheduling away in that case would be bad. > That said I think I found a winner -- add need_resched() prior to > trylock. > > This adds less work than you would have added with might_sleep (a func > call), still respects the preemption point, dodges exception table > checks in the common case and does not switch away if the there is > anything fishy going on. > > Or just do that might_sleep. The might_sleep() is clearly safe, but I thought of a different take on the problem you've found, which is that we used to check need_resched on _every_ page fault, because we used to take the mmap_lock on every page fault. Now we only check it on the minority of page faults which can't be handled under the VMA lock. But we can't just slam a might_resched() into the start of the fault handler, because of the problem you outlined above. So how about something like this: +++ b/arch/x86/mm/fault.c @@ -1365,6 +1365,7 @@ void do_user_addr_fault(struct pt_regs *regs, if (!(fault & (VM_FAULT_RETRY | VM_FAULT_COMPLETED))) vma_end_read(vma); + might_resched(); if (!(fault & VM_FAULT_RETRY)) { count_vm_vma_lock_event(VMA_LOCK_SUCCESS); goto done; We found a VMA, so we know it isn't a NULL pointer dereference. And we've released the VMA lock at this point, so we won't be blocking anything from making progress. I'm not thrilled about having to replicate this in each architecture, but I also don't love putting it in lock_vma_under_rcu() (since someone might call that who actually can't schedule -- it certainly wouldn't be obvious from the function name). Then we can leave the might_sleep() exactly where it is in get_mmap_lock_carefully(); it's really unlikely to trigger a reschedule.