From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-19.8 required=3.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED, USER_AGENT_SANE_1,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id D0E50C433E0 for ; Tue, 29 Dec 2020 04:35:24 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 4549920867 for ; Tue, 29 Dec 2020 04:35:24 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 4549920867 Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 5A8A18D0024; Mon, 28 Dec 2020 23:35:23 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 559E38D0018; Mon, 28 Dec 2020 23:35:23 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 448808D0024; Mon, 28 Dec 2020 23:35:23 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0136.hostedemail.com [216.40.44.136]) by kanga.kvack.org (Postfix) with ESMTP id 2B5EB8D0018 for ; Mon, 28 Dec 2020 23:35:23 -0500 (EST) Received: from smtpin10.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id E178F181AEF0B for ; Tue, 29 Dec 2020 04:35:22 +0000 (UTC) X-FDA: 77645055684.10.sugar71_230eff327499 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin10.hostedemail.com (Postfix) with ESMTP id C141E16A0D2 for ; Tue, 29 Dec 2020 04:35:22 +0000 (UTC) X-HE-Tag: sugar71_230eff327499 X-Filterd-Recvd-Size: 8486 Received: from mail-oi1-f177.google.com (mail-oi1-f177.google.com [209.85.167.177]) by imf23.hostedemail.com (Postfix) with ESMTP for ; Tue, 29 Dec 2020 04:35:22 +0000 (UTC) Received: by mail-oi1-f177.google.com with SMTP id w124so13584434oia.6 for ; Mon, 28 Dec 2020 20:35:22 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:from:to:cc:subject:in-reply-to:message-id:references :user-agent:mime-version; bh=hqQREaNq+NNAv75fe43F01kJKyRrA6N95gs0RKLBLQM=; b=HKp1B9SaJ4tN+WlMO/Qg9gpOOi6ZDpwz4xG4OzPcWYuJu1lWBDnwSx77dfIPumeGBt 6xCVrpoH6RqQL5DEDV5mduzxLMDfvMG0oGAd50jUFEvL3Z5Bsx/LwqU5WP9JG4ZvhopB uSfSicQt/SklehWvKfm08VlMl0WGiv9Mhpp/VszvsmacTClXzXnVUpM2whqiErC0hn/Z alqFshOjTj3PaOHMuDSLeO+aIexc3M/a99on4A9KqSkwRBn4od3ypNmOnHk0eX8n2TwZ YzCaL4nGke7Y0FkGjnG1psXFjOxzTmej5ESsnXpIQ4iu4vgYXK9ZwoujSntqcILYI7X9 bfuQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id :references:user-agent:mime-version; bh=hqQREaNq+NNAv75fe43F01kJKyRrA6N95gs0RKLBLQM=; b=cGOMSemfT4CqAOqA7JE7mvbPIPpp2CcyY7vZTrOIJi+wa0Bj6VO3kSRwFdw7Z58Ipb QtM+MenJAhWtXsuMG4mR9dDKpaODckkw6Yjxgrhcpi72364bbKBCJ8z7xHNRBmKphfNl jzURaBxYKfMXarZScFJovmyTK9gn1Tva2FYGfv73M1lWEwa052No/FL6Ss1zt0oR/J9t NODbBej1wn2I/WreKztJBDBxviISyV74fEruG9aiHyi1PlRFcGbItuf2baUjeCrrdbex L4VQuEP0CzEOXPd/KN69BR1CrhOdcLwi9/mrSc6d8aTu1o+9RN7hfdmK3zsqcC8Iqslk czWg== X-Gm-Message-State: AOAM530OTTdHNGhQTJJwnBbCMwt3LPdprJH4scQNYvIrYOUpO8rsZEmg e4SMjfNEtcDXh/Z6xp5kiLUOLQ== X-Google-Smtp-Source: ABdhPJxnuGvedWExGwmna/yXsnxS3rwe/jtCvuhc/XFABPncfq3GbCaJ7VJgaAJnEVbAa24+fa1bNg== X-Received: by 2002:a05:6808:3c9:: with SMTP id o9mr1301842oie.103.1609216521313; Mon, 28 Dec 2020 20:35:21 -0800 (PST) Received: from eggly.attlocal.net (172-10-233-147.lightspeed.sntcca.sbcglobal.net. [172.10.233.147]) by smtp.gmail.com with ESMTPSA id j2sm9666576otq.78.2020.12.28.20.35.19 (version=TLS1 cipher=ECDHE-ECDSA-AES128-SHA bits=128/128); Mon, 28 Dec 2020 20:35:20 -0800 (PST) Date: Mon, 28 Dec 2020 20:35:06 -0800 (PST) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: "Kirill A. Shutemov" cc: Linus Torvalds , Hugh Dickins , Matthew Wilcox , "Kirill A. Shutemov" , Will Deacon , Linux Kernel Mailing List , Linux-MM , Linux ARM , Catalin Marinas , Jan Kara , Minchan Kim , Andrew Morton , Vinayak Menon , Android Kernel Team Subject: Re: [PATCH 1/2] mm: Allow architectures to request 'old' entries when prefaulting In-Reply-To: <20201228221237.6nu75kgxq7ikxn2a@box> Message-ID: References: <20201226224016.dxjmordcfj75xgte@box> <20201227234853.5mjyxcybucts3kbq@box> <20201228125352.phnj2x2ci3kwfld5@box> <20201228220548.57hl32mmrvvefj6q@box> <20201228221237.6nu75kgxq7ikxn2a@box> User-Agent: Alpine 2.11 (LSU 23 2013-08-11) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Got it at last, sorry it's taken so long. On Tue, 29 Dec 2020, Kirill A. Shutemov wrote: > On Tue, Dec 29, 2020 at 01:05:48AM +0300, Kirill A. Shutemov wrote: > > On Mon, Dec 28, 2020 at 10:47:36AM -0800, Linus Torvalds wrote: > > > On Mon, Dec 28, 2020 at 4:53 AM Kirill A. Shutemov wrote: > > > > > > > > So far I only found one more pin leak and always-true check. I don't see > > > > how can it lead to crash or corruption. Keep looking. Those mods look good in themselves, but, as you expected, made no difference to the corruption I was seeing. > > > > > > Well, I noticed that the nommu.c version of filemap_map_pages() needs > > > fixing, but that's obviously not the case Hugh sees. > > > > > > No,m I think the problem is the > > > > > > pte_unmap_unlock(vmf->pte, vmf->ptl); > > > > > > at the end of filemap_map_pages(). > > > > > > Why? > > > > > > Because we've been updating vmf->pte as we go along: > > > > > > vmf->pte += xas.xa_index - last_pgoff; > > > > > > and I think that by the time we get to that "pte_unmap_unlock()", > > > vmf->pte potentially points to past the edge of the page directory. > > > > Well, if it's true we have bigger problem: we set up an pte entry without > > relevant PTL. > > > > But I *think* we should be fine here: do_fault_around() limits start_pgoff > > and end_pgoff to stay within the page table. Yes, Linus's patch had made no difference, the map_pages loop is safe in that respect. > > > > It made mw looking at the code around pte_unmap_unlock() and I think that > > the bug is that we have to reset vmf->address and NULLify vmf->pte once we > > are done with faultaround: > > > > diff --git a/mm/memory.c b/mm/memory.c > > Ugh.. Wrong place. Need to sleep. > > I'll look into your idea tomorrow. > > diff --git a/mm/filemap.c b/mm/filemap.c > index 87671284de62..e4daab80ed81 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -2987,6 +2987,8 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf, unsigned long address, > } while ((head = next_map_page(vmf, &xas, end_pgoff)) != NULL); > pte_unmap_unlock(vmf->pte, vmf->ptl); > rcu_read_unlock(); > + vmf->address = address; > + vmf->pte = NULL; > WRITE_ONCE(file->f_ra.mmap_miss, mmap_miss); > > return ret; > -- And that made no (noticeable) difference either. But at last I realized, it's absolutely on the right track, but missing the couple of early returns at the head of filemap_map_pages(): add --- a/mm/filemap.c +++ b/mm/filemap.c @@ -3025,14 +3025,12 @@ vm_fault_t filemap_map_pages(struct vm_f rcu_read_lock(); head = first_map_page(vmf, &xas, end_pgoff); - if (!head) { - rcu_read_unlock(); - return 0; - } + if (!head) + goto out; if (filemap_map_pmd(vmf, head)) { - rcu_read_unlock(); - return VM_FAULT_NOPAGE; + ret = VM_FAULT_NOPAGE; + goto out; } vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, @@ -3066,9 +3064,9 @@ unlock: put_page(head); } while ((head = next_map_page(vmf, &xas, end_pgoff)) != NULL); pte_unmap_unlock(vmf->pte, vmf->ptl); +out: rcu_read_unlock(); vmf->address = address; - vmf->pte = NULL; WRITE_ONCE(file->f_ra.mmap_miss, mmap_miss); return ret; -- and then the corruption is fixed. It seems miraculous that the machines even booted with that bad vmf->address going to __do_fault(): maybe that tells us what a good job map_pages does most of the time. You'll see I've tried removing the "vmf->pte = NULL;" there. I did criticize earlier that vmf->pte was being left set, but was either thinking back to some earlier era of mm/memory.c, or else confusing with vmf->prealloc_pte, which is NULLed when consumed: I could not find anywhere in mm/memory.c which now needs vmf->pte to be cleared, and I seem to run fine without it (even on i386 HIGHPTE). So, the mystery is solved; but I don't think any of these patches should be applied. Without thinking through Linus's suggestions re do_set_pte() in particular, I do think this map_pages interface is too ugly, and given us lots of trouble: please take your time to go over it all again, and come up with a cleaner patch. I've grown rather jaded, and questioning the value of the rework: I don't think I want to look at or test another for a week or so. Hugh