From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 300A6C61DA4 for ; Tue, 14 Feb 2023 18:00:01 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5FB7A6B0071; Tue, 14 Feb 2023 13:00:00 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 584F16B0072; Tue, 14 Feb 2023 13:00:00 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3D7186B0075; Tue, 14 Feb 2023 13:00:00 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 29EF76B0071 for ; Tue, 14 Feb 2023 13:00:00 -0500 (EST) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 0093BAB575 for ; Tue, 14 Feb 2023 17:59:59 +0000 (UTC) X-FDA: 80466660918.13.325943D Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf15.hostedemail.com (Postfix) with ESMTP id 57DB7A0008 for ; Tue, 14 Feb 2023 17:59:57 +0000 (UTC) Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=d0sDkfWK; spf=pass (imf15.hostedemail.com: domain of david@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1676397597; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=c7+MNGw6vwnaHZ5KcVeVb+hrtvazbgYAm6V9+SE6quw=; b=UlE/PBwvMBwj0YByFpvaNO5R5v4v7OKQWoqpsuGwbLoYZTWktwC9JFQBDa/u5jr1FwzfZv 92YOrbYoBmAZGsFzAPeH8xltPWCRJ+hPIjSBMgY2os6gCL0hj1jwJMubqPNHULdx1oOUIo M+Ts8k5c6WagwTg2cXc9Kr1jDJ7G7uI= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=d0sDkfWK; spf=pass (imf15.hostedemail.com: domain of david@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1676397597; a=rsa-sha256; cv=none; b=Wg1v5uGZAA5XQXptEw8qYBLM1NiA0R0abix9lOuGBC+IH24MFlk6wcj1kt3NUyHCPS5G/D 7C5Tbg7MegHu35Qj3xXiJG3+1IhVde2AsNfACuzcc87t6fxs5rjpJ1iBEkMwCFgaLtj6Y3 vsYWULQe0FEI1RW6UzL3cwKoN74kzFM= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1676397596; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=c7+MNGw6vwnaHZ5KcVeVb+hrtvazbgYAm6V9+SE6quw=; b=d0sDkfWK6Yj3TmZw2dd+yvLNqFPq0apbO+v2wefdaBb+wx7eJUOupvoOPLHusPHcNcljLv LvWWb0ce3snr2EixTwcL5NOpG1CcEtkgD7/u6qRr6byELc2i+7h7vdzNeo2X7iQPodndt0 YqgFUBNNLIurOV1c09e8l0P22oKz2v0= Received: from mail-wm1-f69.google.com (mail-wm1-f69.google.com [209.85.128.69]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-297-w1Y9DJafOY67_GhoateEuw-1; Tue, 14 Feb 2023 12:59:55 -0500 X-MC-Unique: w1Y9DJafOY67_GhoateEuw-1 Received: by mail-wm1-f69.google.com with SMTP id p14-20020a05600c468e00b003e0107732f4so8127799wmo.1 for ; Tue, 14 Feb 2023 09:59:54 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:in-reply-to:organization:from:references :cc:to:content-language:subject:user-agent:mime-version:date :message-id:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=c7+MNGw6vwnaHZ5KcVeVb+hrtvazbgYAm6V9+SE6quw=; b=jTq+pyoaTsIXBujTBYEU7etjWWDAUR27LFsAqpGBOu1q48VRLTcLkGSAhU99dtTNix 6OB8xV+mg1ubGuSDbkLHijlZfT7Q5kY985F20b5+WtQGOSZjBkiyU1Z1pXVDFU55ci+i scEfuxDL0UaDvdtUR5M56K5A3uem94qoqcGrUIHiVPxuP3A1pYJoHxuheQUkaK4lP6jf H5Y5+AFqhySZmWqMjG3OTl4Qz8EtZgiF8Qnr5qQPPGx8rsVJJ196qLqa4qcoKOnQQfWD fc3HmuCkGeblLfSLsFI5gcumMgWCo5l4gyBrgzJdzWXYX7G9414XZrv0MeArjV4uBuTN 5B8A== X-Gm-Message-State: AO0yUKUYkoKomA8WohmM+j0WkkjTVczICw0CHLpc9lTAzq0mx1vXjFlK aSsczEDCwx1Nt7inOG4LMXVCve5MQl7OfHBGIXUUveAJR3gDd4zeuH3IbNpqCf1WrQZNsWcG45g YYw1Qan21lCA= X-Received: by 2002:a05:600c:818:b0:3dc:5390:6499 with SMTP id k24-20020a05600c081800b003dc53906499mr2889191wmp.1.1676397593944; Tue, 14 Feb 2023 09:59:53 -0800 (PST) X-Google-Smtp-Source: AK7set98816UCJl6OBxQQIva/LK43GcvL27hW/jwL7RgWSMiK6908hg+hJsKTx0b9fYemxxuQH+yLQ== X-Received: by 2002:a05:600c:818:b0:3dc:5390:6499 with SMTP id k24-20020a05600c081800b003dc53906499mr2889151wmp.1.1676397593698; Tue, 14 Feb 2023 09:59:53 -0800 (PST) Received: from ?IPV6:2003:cb:c709:1700:969:8e2b:e8bb:46be? (p200300cbc709170009698e2be8bb46be.dip0.t-ipconnect.de. [2003:cb:c709:1700:969:8e2b:e8bb:46be]) by smtp.gmail.com with ESMTPSA id b18-20020a05600c4e1200b003e00c453447sm21363801wmq.48.2023.02.14.09.59.51 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 14 Feb 2023 09:59:53 -0800 (PST) Message-ID: <28f1e75a-a1fc-a172-3628-83575e387f9a@redhat.com> Date: Tue, 14 Feb 2023 18:59:50 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.6.0 Subject: Re: [PATCH v4 00/14] Introduce Copy-On-Write to Page Table To: Chih-En Lin Cc: Pasha Tatashin , Andrew Morton , Qi Zheng , "Matthew Wilcox (Oracle)" , Christophe Leroy , John Hubbard , Nadav Amit , Barry Song , Steven Rostedt , Masami Hiramatsu , Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Yang Shi , Peter Xu , Vlastimil Babka , Zach O'Keefe , Yun Zhou , Hugh Dickins , Suren Baghdasaryan , Yu Zhao , Juergen Gross , Tong Tiangen , Liu Shixin , Anshuman Khandual , Li kunyu , Minchan Kim , Miaohe Lin , Gautam Menghani , Catalin Marinas , Mark Brown , Will Deacon , Vincenzo Frascino , Thomas Gleixner , "Eric W. Biederman" , Andy Lutomirski , Sebastian Andrzej Siewior , "Liam R. Howlett" , Fenghua Yu , Andrei Vagin , Barret Rhoden , Michal Hocko , "Jason A. Donenfeld" , Alexey Gladkov , linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Dinglan Peng , Pedro Fonseca , Jim Huang , Huichun Feng References: <20230207035139.272707-1-shiyn.lin@gmail.com> <62c44d12-933d-ee66-ef50-467cd8d30a58@redhat.com> From: David Hildenbrand Organization: Red Hat In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 57DB7A0008 X-Stat-Signature: nigr8ha7tixomcgysd9cbuoa914admcf X-Rspam-User: X-HE-Tag: 1676397597-283902 X-HE-Meta: U2FsdGVkX1/3bDaefgLffDkdgzAe2NtEeHPuXHycl7tf9Bo6J/zVvySgFz7LV10/GGaH0Rzb6Qnd2GWxtdgHvYUwFow/rdWqr6gV30JtfHiKJRtBKJQvv6BpHnbLYVhZKcFFkBFlvCgKDx0qCPBWPOiICsfqVuHusuu8EnLo9SUaLd5l3LCe0wJMzPGtB4Lhatyenji8Z+Qb+e3SEIWJMRp7ANVJX/EialoIsrcDflDhzqrT3xDNJRNOvD6hmAJRdv62i0cedq2m1Q5my1aTffKBklqW8X2BjIhTBAwRbMr94gWqJrOvh5vHxAvwEiXh6jmxV8Ggb4eFUm68MVCm9CAJAZL5JNoeoGuChOtCW/Y19YBsHDST29J+goZOfid5abtDBmj6Cef2w1YCqYlGVN/JowybC6Picf0qrLa7jhhXda+ijcnchxQolCd3yohPcRETxTHicjciATHo/dkHGK8E23n2X2poWFAqT1JBTks28vZN9PTH9ooyGFMuKtTiT4Tl7AK8px2h+iayupNwE/XJSso9aOadOE6OiYyQ0Q4om6FJGW7qqRQCRvXKDMdsuLa8XXCIqGidE9G76PffJDezpEtD8v6iJDsWxh0cKdGOc4FGm8QLPmartt1ZWp58DA0miDMO7Woz38YBq4bigcYbA3q0ufuk+/NCgIjavGmaq59p0Z4IZSFMVkO/UL1xNktXYCLQKb/OjlSuwaAd8nddlLwNpyOD0vq4dxsFIDfr9zkxtrm0Az8LjHp16SR0HxWlvYiSEJ2Frc2vKO2HSNrVNLlS3KhwW88JFEE//aWAS9/zvJHKGBpppA/DM1yUc4jFfy29qx5KTnkz5j007+OHiyzNLdf8FQe0lAgaT5d2tCYeJ0WwgGWXE4n0QN2RhD5HqCFaftxNHc2KVpRNfnEGBU3ig6Dvr/RAE7B8rLJvGigOqpGhr/Dof6OF3w1hRjm8X/IHZsurwtRjh6P YvUjuoOK mLg5sISFFEF3zjUuBg7EB3YHDnvzpgcGpXkCbYGcrKjcsUz0J+OFeUzr+UfQ9RWeZOR31PwuiC+2NdUTYvWFLW9/ycKkcK58gKTnkeQ/iiDT0yPuRPyZLQ9m9acpbO+GKx2gPAIaOciprONx4aBz8RfhiuuWpfU+P/X6D0UUC3KOWOQ+eGgqeXf/hUPe0u4v8VoW1Fv3R3bKukFKUpIQ9dmNXiwGjJI7JcoUiexF9+mRwSLXgN+61prPqrEDz8xwK1ES+XDNp2JfFzaZrm2kHOgHeqLa4xwy566fIJda5JfAw58z4PJ5HOunvgnW+WjNcIbl9vkBmwhCLDyaMrMtk9JhiaJSMG8yJV8vNRFhVjVhNUNxcyri9/NMFTh+0jL74mwPk57g87zvETlPf9gh8XFlaVBpxw4UTzrRhvhNDj9SV+MHpyRlGzULBOgbmlgzTZx6+Z5UppCHwZC6DFxre4fSr2Q== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 14.02.23 18:54, Chih-En Lin wrote: >>> >>>> (2) break_cow_pte() can fail, which means that we can fail some >>>> operations (possibly silently halfway through) now. For example, >>>> looking at your change_pte_range() change, I suspect it's wrong. >>> >>> Maybe I should add WARN_ON() and skip the failed COW PTE. >> >> One way or the other we'll have to handle it. WARN_ON() sounds wrong for >> handling OOM situations (e.g., if only that cgroup is OOM). > > Or we should do the same thing like you mentioned: > " > For example, __split_huge_pmd() is currently not able to report a > failure. I assume that we could sleep in there. And if we're not able to > allocate any memory in there (with sleeping), maybe the process should > be zapped either way by the OOM killer. > " > > But instead of zapping the process, we just skip the failed COW PTE. > I don't think the user will expect their process to be killed by > changing the protection. The process is consuming more memory than it is capable of consuming. The process most probably would have died earlier without the PTE optimization. But yeah, it all gets tricky ... > >>> >>>> (3) handle_cow_pte_fault() looks quite complicated and needs quite some >>>> double-checking: we temporarily clear the PMD, to reset it >>>> afterwards. I am not sure if that is correct. For example, what >>>> stops another page fault stumbling over that pmd_none() and >>>> allocating an empty page table? Maybe there are some locking details >>>> missing or they are very subtle such that we better document them. I >>>> recall that THP played quite some tricks to make such cases work ... >>> >>> I think that holding mmap_write_lock may be enough (I added >>> mmap_assert_write_locked() in the fault function btw). But, I might >>> be wrong. I will look at the THP stuff to see how they work. Thanks. >>> >> >> Ehm, but page faults don't hold the mmap lock writable? And so are other >> callers, like MADV_DONTNEED or MADV_FREE. >> >> handle_pte_fault()->handle_pte_fault()->mmap_assert_write_locked() should >> bail out. >> >> Either I am missing something or you didn't test with lockdep enabled :) > > You're right. I thought I enabled the lockdep. > And, why do I have the page fault will handle the mmap lock writable in my mind. > The page fault holds the mmap lock readable instead of writable. > ;-) > > I should check/test all the locks again. > Thanks. Note that we have other ways of traversing page tables, especially, using the rmap which does not hold the mmap lock. Not sure if there are similar issues when suddenly finding no page table where there logically should be one. Or when a page table gets replaced and modified, while rmap code still walks the shared copy. Hm. -- Thanks, David / dhildenb