From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 57311C433EF for ; Fri, 20 May 2022 01:12:33 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B3FC96B0071; Thu, 19 May 2022 21:12:32 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id AEEDE6B0072; Thu, 19 May 2022 21:12:32 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 98F278D0001; Thu, 19 May 2022 21:12:32 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 85BF16B0071 for ; Thu, 19 May 2022 21:12:32 -0400 (EDT) Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 584243456B for ; Fri, 20 May 2022 01:12:32 +0000 (UTC) X-FDA: 79484346144.04.34F61AC Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf28.hostedemail.com (Postfix) with ESMTP id AAE4BC000D for ; Fri, 20 May 2022 01:12:03 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1653009150; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=++GprW3RlN7gtbFJa/PDCM8rglnrNJeYPZycediCNuU=; b=MzVR05xqLkxYcxhnrGl8HTOmCRh5gZ2H2v5w3JHGqYGC+OYNOE9pRJqn7C2F/2nv6VDDha 1OHeH8Ukir5u+Y/K3Y9YIRv6enLYAj75EpMr6+fIOIr342I0eOmM9uNXS6ilCdsPI0zp+e HYSBF6RZe8KgUJt91e5BnYQwy4YZBAc= Received: from mail-il1-f197.google.com (mail-il1-f197.google.com [209.85.166.197]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-237-GhECnJhqNc2oMaTWXMa8_Q-1; Thu, 19 May 2022 21:12:27 -0400 X-MC-Unique: GhECnJhqNc2oMaTWXMa8_Q-1 Received: by mail-il1-f197.google.com with SMTP id g8-20020a92cda8000000b002d15f63967eso2857298ild.21 for ; Thu, 19 May 2022 18:12:27 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=++GprW3RlN7gtbFJa/PDCM8rglnrNJeYPZycediCNuU=; b=nLA7RxErLLym/dOtB/0wQKBf2Oq3X8+WgAsKt3KK9Dm8M03vppDA9z0BIVZRAMpYnV wyriwOmkMaAk0Pd974wQJ8v8MUDyiPkfLsdXAR/dWj1/OR6op/Ni7hphedL4MBB42/tB hjMvjJe8Kxf2UpfrGXbHbS4MqE0UqmhIuBLSOGdR5G/JyzDI1MyfDAAPrBA+1hgT58KM aiOdJ+GCzKdIatzbU4lQZE1NZqd/0xHGpZa5zEc3OaNOgleIZ3NfnhxKdGFXvpXwmEJa jDO54flGQUHq9GqEpP0M4HeOONMQK7CzcPgaB7zkiDqEe1hMnCWz1Opxo+pob/yyWNyo uYwg== X-Gm-Message-State: AOAM531E0DC0qCItrs5eHHF3Lp4aOTXZqXMwFMfpbcRqJ4hoZnlAdOIn c64ZnLR+Ohkwuwgll5kJimPRWWdtYXBcXKYDbPlPfY58pbv7PV9N8BCuS8CXetvCRN3Zv/ebIg+ MiALCbWlQgk4= X-Received: by 2002:a5d:848a:0:b0:648:b2f4:d5cd with SMTP id t10-20020a5d848a000000b00648b2f4d5cdmr3791303iom.53.1653009146548; Thu, 19 May 2022 18:12:26 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxC4lo7E09632B6ulXwQSB542FQENW567De1nRJBZiyvGX2L9mxgd3n8E9/xMy7qCTWll63Qw== X-Received: by 2002:a5d:848a:0:b0:648:b2f4:d5cd with SMTP id t10-20020a5d848a000000b00648b2f4d5cdmr3791279iom.53.1653009146206; Thu, 19 May 2022 18:12:26 -0700 (PDT) Received: from xz-m1.local (cpec09435e3e0ee-cmc09435e3e0ec.cpe.net.cable.rogers.com. [99.241.198.116]) by smtp.gmail.com with ESMTPSA id d3-20020a92d783000000b002cf846fe476sm1635697iln.77.2022.05.19.18.12.22 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 19 May 2022 18:12:25 -0700 (PDT) Date: Thu, 19 May 2022 21:12:21 -0400 From: Peter Xu To: Zach O'Keefe Cc: Alex Shi , David Hildenbrand , David Rientjes , Matthew Wilcox , Michal Hocko , Pasha Tatashin , SeongJae Park , Song Liu , Vlastimil Babka , Yang Shi , Zi Yan , linux-mm@kvack.org, Andrea Arcangeli , Andrew Morton , Arnd Bergmann , Axel Rasmussen , Chris Kennelly , Chris Zankel , Helge Deller , Hugh Dickins , Ivan Kokshaysky , "James E.J. Bottomley" , Jens Axboe , "Kirill A. Shutemov" , Matt Turner , Max Filippov , Miaohe Lin , Minchan Kim , Patrick Xia , Pavel Begunkov , Thomas Bogendoerfer Subject: Re: [PATCH v5 01/13] mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds THP Message-ID: References: <20220504214437.2850685-1-zokeefe@google.com> <20220504214437.2850685-2-zokeefe@google.com> MIME-Version: 1.0 In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=MzVR05xq; spf=none (imf28.hostedemail.com: domain of peterx@redhat.com has no SPF policy when checking 170.10.129.124) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com X-Rspam-User: X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: AAE4BC000D X-Stat-Signature: tqbjad8sbginegh3qcygi85jjad7scun X-HE-Tag: 1653009123-145404 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, May 19, 2022 at 02:06:25PM -0700, Zach O'Keefe wrote: > Thanks again for the review, Peter. > > On Wed, May 18, 2022 at 11:41 AM Peter Xu wrote: > > > > On Wed, May 04, 2022 at 02:44:25PM -0700, Zach O'Keefe wrote: > > > +static int find_pmd_or_thp_or_none(struct mm_struct *mm, > > > + unsigned long address, > > > + pmd_t **pmd) > > > +{ > > > + pmd_t pmde; > > > + > > > + *pmd = mm_find_pmd_raw(mm, address); > > > + if (!*pmd) > > > + return SCAN_PMD_NULL; > > > + > > > + pmde = pmd_read_atomic(*pmd); > > > > It seems to be correct on using the atomic fetcher here. Though irrelevant > > to this patchset.. does it also mean that we miss that on mm_find_pmd()? I > > meant a separate fix like this one: > > > > ---8<--- > > diff --git a/mm/rmap.c b/mm/rmap.c > > index 69416072b1a6..61309718640f 100644 > > --- a/mm/rmap.c > > +++ b/mm/rmap.c > > @@ -785,7 +785,7 @@ pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address) > > * without holding anon_vma lock for write. So when looking for a > > * genuine pmde (in which to find pte), test present and !THP together. > > */ > > - pmde = *pmd; > > + pmde = pmd_read_atomic(pmd); > > barrier(); > > if (!pmd_present(pmde) || pmd_trans_huge(pmde)) > > pmd = NULL; > > ---8<--- > > > > As otherwise it seems it's also prone to PAE race conditions when reading > > pmd out, but I could be missing something. > > > > This is a good question. I took some time to look into this, but it's > very complicated and unfortunately I couldn't reach a conclusion in > the time I allotted to myself. My working (unverified) assumption is > that mm_find_pmd() is called in places where it doesn't care if the > pmd isn't read atomically. Frankly I am still not sure on that. Say the immediate check in mm_find_pmd() is pmd_trans_huge() afterwards, and AFAICT that is: static inline int pmd_trans_huge(pmd_t pmd) { return (pmd_val(pmd) & (_PAGE_PSE|_PAGE_DEVMAP)) == _PAGE_PSE; } Where we have: #define _PAGE_BIT_PSE 7 /* 4 MB (or 2MB) page */ #define _PAGE_BIT_DEVMAP _PAGE_BIT_SOFTW4 #define _PAGE_BIT_SOFTW4 58 /* available for programmer */ I think it means we're checking both the lower (bit 7) and upper (bit 58) of the 32bits values of the whole 64bit pmd on PAE, and I can't explain how that'll not require atomicity.. pmd_read_atomic() plays with a magic that we'll either: - Get atomic results for none or pgtable pmds (which is stable), or - Can get not-atomic result for thps (unstable) However here we're checking exactly for thps.. I think we don't need a stable PFN but we'll need stable _PAGE_PSE|_PAGE_DEVMAP bits. It should not be like that when pmd_read_atomic() was introduced because that is 2012 (in 2012, 26c191788f18129) and AFAIU DEVMAP came in 2016+. IOW I had a feeling that pmd_read_atomic() used to be bug-free but after devmap it's harder to be justified at least, it seems to me. So far I don't see a clean way to fix it but use atomic64_read() because pmd_read_atomic() isn't atomic for thp checkings using pmd_trans_huge() afaict, however the last piece of puzzle is atomic64_read() is failing somehow on Xen and that's where we come up with the pmd_read_atomic() trick to read lower then upper 32bits: commit e4eed03fd06578571c01d4f1478c874bb432c815 Author: Andrea Arcangeli Date: Wed Jun 20 12:52:57 2012 -0700 thp: avoid atomic64_read in pmd_read_atomic for 32bit PAE In the x86 32bit PAE CONFIG_TRANSPARENT_HUGEPAGE=y case while holding the mmap_sem for reading, cmpxchg8b cannot be used to read pmd contents under Xen. I believe Andrea has a solid reason to do so at that time, but I'm still not sure why Xen won't work with atomic64_read() since it's not mentioned in the commit.. Please go ahead with your patchset without being blocked by this, because AFAICT pmd_read_atomic() is already better than pmde=*pmdp anyway, and it seems a more general problem even if it existed or I could also have missed something. > If so, does that also mean MADV_COLLAPSE is > safe? I'm not sure. These i386 PAE + THP racing issues were most > recently discussed when considering if READ_ONCE() should be used > instead of pmd_read_atomic() [1]. > > [1] https://lore.kernel.org/linux-mm/594c1f0-d396-5346-1f36-606872cddb18@google.com/ > > > > + > > > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > > > + /* See comments in pmd_none_or_trans_huge_or_clear_bad() */ > > > + barrier(); > > > +#endif > > > + if (!pmd_present(pmde)) > > > + return SCAN_PMD_NULL; > > > + if (pmd_trans_huge(pmde)) > > > + return SCAN_PMD_MAPPED; > > > > Would it be safer to check pmd_bad()? I think not all mm pmd paths check > > that, frankly I don't really know what's the major cause of a bad pmd > > (either software bugs or corrupted mem), but just to check with you, > > because potentially a bad pmd can be read as SCAN_SUCCEED and go through. > > > > Likewise, I'm not sure what the cause of "bad pmds" is. > > Do you mean to check pmd_bad() instead of pmd_trans_huge()? I.e. b/c a > pmd-mapped thp counts as "bad" (at least on x86 since PSE set) or do > you mean to additionally check pmd_bad() after the pmd_trans_huge() > check? > > If it's the former, I'd say we can't claim !pmd_bad() == memory > already backed by thps / our job here is done. > > If it's the latter, I don't see it hurting much (but I can't argue > intelligently about why it's needed) and can include the check in v6. The latter. I don't think it's strongly necessary, but looks good to have if you also agree. Thanks, -- Peter Xu