From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id AAB96C6FD1C for ; Fri, 24 Mar 2023 03:30:35 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B9A636B0072; Thu, 23 Mar 2023 23:30:34 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B4AB66B0074; Thu, 23 Mar 2023 23:30:34 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A11E26B0075; Thu, 23 Mar 2023 23:30:34 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 8ED7A6B0072 for ; Thu, 23 Mar 2023 23:30:34 -0400 (EDT) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 4BEA51205B5 for ; Fri, 24 Mar 2023 03:30:34 +0000 (UTC) X-FDA: 80602364388.27.C277D9F Received: from casper.infradead.org (casper.infradead.org [90.155.50.34]) by imf05.hostedemail.com (Postfix) with ESMTP id E6F9E100014 for ; Fri, 24 Mar 2023 03:30:31 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=RvtSOW0r; dmarc=none; spf=none (imf05.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1679628632; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=5HhTrG9WcEQAcau8U0gWMW/XrW+irv7dgvDxj7LMXiI=; b=0wZkRsbvUAGIt44BXrS6TvgDl2G5OY0vOElX09roeW3VZFYdab2qIlND8e4HKCTbSON6Jw sbvb1KP6sYMOyc7uthu5C3oq1DE7Wrrx2k4hi0TGUbj93/fnE2pTbeF80MndEn3xeY40EY bTr7k+BsCZ/LBR7z2uSoXCfH/BCFlGo= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=RvtSOW0r; dmarc=none; spf=none (imf05.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1679628632; a=rsa-sha256; cv=none; b=Jo1VxSKR3FgHvuJzV1Cn+EWjujl9LdNjzCHtKp/3QKeSjnllflzw6WUnFcnvTmJUPtHhoM aYMaH10mvJIfZsK2/y4CMEyh9ca5R71xOiu8fr9bY6Od0mY88CDSe0gvGHB9s8PjA8Rp8c 6Ye26W9okU7HTMKsG7kwT4lqvt8OE5Y= DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Transfer-Encoding: Content-Type:MIME-Version:References:Message-ID:Subject:Cc:To:From:Date: Sender:Reply-To:Content-ID:Content-Description; bh=5HhTrG9WcEQAcau8U0gWMW/XrW+irv7dgvDxj7LMXiI=; b=RvtSOW0rmTphxl0QHdUuMRLfxA MVbWm4euMSexZgthyfOOjs5NnDNFFuwYQzrClXzhEIi1/CHbbOYlAgEp12LFvmVSeM6p+0sC45qv+ MmhtYIwDHpFai/J0QNRGn9Tx40uWmD0tO0oDq9ti6CiKalBL+XwSjvy3P9vRP1xOCiO6WIFOcqcHR UZKyS0m+etof/p/S7mwvpjtqLEjVhwFR8creVzx4T+qf7Lf4W2B6ibMgAoGPjfgiTgpelheJ6o502 hJf4/Dm3Yg1zZ/aeiov2u+2OS/O/pxMyQH6RIlcnQCrVISap107u7f+Pxp/8+CNJa6FeEawG1PluM 6ahb9Apw==; Received: from willy by casper.infradead.org with local (Exim 4.94.2 #2 (Red Hat Linux)) id 1pfY8A-004X9T-Tg; Fri, 24 Mar 2023 03:30:19 +0000 Date: Fri, 24 Mar 2023 03:30:18 +0000 From: Matthew Wilcox To: Hugh Dickins Cc: Song Liu , Song Liu , Jiri Olsa , David Stevens , linux-mm@kvack.org, Andrew Morton , Peter Xu , "Kirill A . Shutemov" , Yang Shi , David Hildenbrand , Jiaqi Yan , linux-kernel@vger.kernel.org Subject: Re: [PATCH v5 3/3] mm/khugepaged: maintain page cache uptodate flag Message-ID: References: <20230307052036.1520708-1-stevensd@google.com> <20230307052036.1520708-4-stevensd@google.com> <866d1a75-d462-563-dfd7-1aa2971a285b@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Rspam-User: X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: E6F9E100014 X-Stat-Signature: xzwp3akme5bzb56nzht3pnib9m3qus4y X-HE-Tag: 1679628631-787208 X-HE-Meta: U2FsdGVkX19AnvHfYOSsp+JIuwlQVpIKXxa9gBHFFEMGhfKejrkrzDhqMyplEbHkaYsgzcfW++La+f9SeIlCc9gsUpEPziGPvPrGHxhlZQ/Uj6q5vY13FN26fmpPlytkJfiS2pnavPiqmT6sdMJUcvGwVMkAnB7a9ZM7Gp8NMNTiiwTbvBte/mo7/BAECiGdbitEcIEFaNvUHNREEO6R3JQIZqKqoA6bTayboBhHtDadvrh7Pxq2JZlYePeLvO8owc2jl5ioGxJ+TFLwpCztgCOn4LtxVA5oLPl7e4gn7Y19qJS7rMG0TF9T5re55EO8goA2J0sthqoC+x4mhJyMlxZm67xb5lRvpq0dBhrUEIpwc/TCDWXegipc2K04xVUQThKfk+4IVD6E6Z5AGD6Ad1tspau28H9IlHURBXnhhHByAdDpT++oM5qo1JO/eX48Pp6sUwj/KNQ54l5U0isoazyGoSCk+uuDVaN6zjVmE8fYGmwjnx+peZW610JGRcRAFme9be5bQBwAF3T4Two4suK62gyMkQzwfnEvuNeAoflP/jiLA3HRa+nFC0ukevlxyE7R0Xd71hAyRc6YQ/qCcmJSTejlWezrxFsp0jPHDQ/vubAoUlpL7CXKUoQFoiQQE/KNIFHoOdWYJ4HD/ACq+Z6Gs5zbbO590Nq4Grb6RwyRzxcbzhU89Y02AVCSwr6RkiDmHg8JUX50xHDtQaVQ7yh4RC+urYUwk1nu3UyC5hWizwoOfOXXHE66nFknG4bhTKAkm+3Nr7sos4ipNkTIhpydo6lEreE1gPUgh/baXRGqNrnGtwJl8v6991CceUYx4OfDGpYKgKMjAWjUvr8q18R6+ctx6Y0y6jug4QIrNynLQmXSXPvgsML8MitgOIVw1+MjscaD5L0ps0LXHZPSrVTtNDIAP0dKVztPU7af8QzwNiqmYoaAnczhME1IOFAghST6ag/GWN5NjjHN0LM cfq/VqkG vZNSIKchrKmVvMP6IkHj6LeRPo73lXzeuABLP/4jSruECUFD27K+Le3nmksHDgZoIXEIZ0/L/Pd9IaxAmI432tb/3GAdaC9Btur/2+dj3SSMnWMWgnNPHEz460GPnOfDUyPZf1QdbxWKx6rJvUOdOH7fQZa8kgK7f35xg04i0xrLYDD3pj+ZMOV+3V2uic5vS8mR1H+iXcfRw7ZVxHRWkmh6Mcg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Mar 23, 2023 at 06:56:34PM -0700, Hugh Dickins wrote: > On Thu, 23 Mar 2023, Song Liu wrote: > > On Thu, Mar 23, 2023 at 2:56 PM Matthew Wilcox wrote: > > > > > > On Thu, Mar 23, 2023 at 12:07:46PM -0700, Hugh Dickins wrote: > > > > On an earlier audit, for different reasons, I did also run across > > > > lib/buildid.c build_id_parse() using find_get_page() without checking > > > > PageUptodate() - looks as if it might do the wrong thing if it races > > > > with khugepaged collapsing text to huge, and should probably have a > > > > similar fix. > > > > > > That shouldn't be using find_get_page(). It should probably use > > > read_cache_folio() which will actually read in the data if it's not > > > present in the page cache, and return an ERR_PTR if the data couldn't > > > be read. > > > > build_id_parse() can be called from NMI, so I don't think we can let > > read_cache_folio() read-in the data. > > Interesting. > > This being the same Layering_Violation_ID which is asking for a home in > everyone's struct file? (Okay, I'm being disagreeable, no need to answer!) Yes, that's the one. Part of the BPF "splatter stuff all across the kernel that we don't really understand" (see, I can be just as disagreeable!) > I think even the current find_get_page() is unsafe from NMI: imagine that > NMI interrupting a sequence (maybe THP collapse or splitting, maybe page > migration, maybe others) when the page refcount has been frozen to 0: > you'll just have to reboot the machine? Correct; if it's been frozen by this CPU, we'll spin forever. I think the other conditions where we retry are temporary and caused by something _another_ CPU is doing. For example, if _this_ CPU is in the middle of modifying the tree when an NMI happens, we won't hit the xas_retry() case. That can only happen if we've observed something another CPU did, and then a second write happened from that same other CPU. The easiest example of that would be that we hit this kind of race: CPU 0 CPU 1 load root of tree, points to node store entry in root of tree wmb(); store RETRY entry in node load entry from node, observe RETRY entry The retry will happen and we'll observe the new state of the tree with the entry we're looking for in the root. If CPU 1 takes an NMI and interrupts itself, it will always see a consistent tree. > I guess the RCU-safety of find_get_page() implies that its XArray basics > are safe in NMI; but you need a low-level variant (xas_find()?) which > does none of the "goto retry"s, and fails immediately if anything is > wrong - including !PageUptodate. The Uptodate flag check needs to be done by the caller; the find_get_page() family return !uptodate pages. But find_get_page() does not advertise itself as NMI-safe. And I think it's wrong to try to make it NMI-safe. Most of the kernel is not NMI-safe. I think it's incumbent on the BPF people to get the information they need ahead of taking the NMI. NMI handlers are not supposed to be doing a huge amount of work! I don't really understand why it needs to do work in NMI context; surely it can note the location of the fault and queue work to be done later (eg on irq-enable, task-switch or return-to-user)