From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 88CCEC43334 for ; Mon, 18 Jul 2022 12:59:53 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 147316B0072; Mon, 18 Jul 2022 08:59:53 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 0F6D48E0001; Mon, 18 Jul 2022 08:59:53 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id ED91B6B0074; Mon, 18 Jul 2022 08:59:52 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id DF3C36B0072 for ; Mon, 18 Jul 2022 08:59:52 -0400 (EDT) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 9E4F631C2E for ; Mon, 18 Jul 2022 12:59:52 +0000 (UTC) X-FDA: 79700227824.26.7687964 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf28.hostedemail.com (Postfix) with ESMTP id 267C8C008B for ; Mon, 18 Jul 2022 12:59:51 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1658149191; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=K2UbXX8GfQDSH1G+l3mYd8piXh+YvuH3CKSjlZ6grsU=; b=CXjjsbp00/hvTH8YRuSSPgjffU2Kv0i13Rt7oTPu4a1i/C8QyxS0JpGsPQw56+ocPJ6YAR 1p5rI7Zgj5MacnWGUrfaun7qC2+wbsR4kLT0E6Rbl3nvgLgEsQQsCJQ2ong75thjlR58uH vu9luExPGxLQNpknKdalIwauvqqJDDI= Received: from mail-wm1-f69.google.com (mail-wm1-f69.google.com [209.85.128.69]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-648-CQmJYfIxPEqMG0xhmC3C7g-1; Mon, 18 Jul 2022 08:59:50 -0400 X-MC-Unique: CQmJYfIxPEqMG0xhmC3C7g-1 Received: by mail-wm1-f69.google.com with SMTP id k27-20020a05600c1c9b00b003a2fee19a80so7662207wms.1 for ; Mon, 18 Jul 2022 05:59:50 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:message-id:date:mime-version:user-agent :content-language:to:cc:references:from:organization:subject :in-reply-to:content-transfer-encoding; bh=K2UbXX8GfQDSH1G+l3mYd8piXh+YvuH3CKSjlZ6grsU=; b=QUmBhB0yrga6q/2bWICtV8aqx6AKBISGXxh1Vw0YvQheJeFnDCOWdglhKleMcPeguU 6+5QkomOhHaW5ifrHJBf9QESEYxcIgEFiNEs2ug0Ctky+ytFIDFBxoJInGY4qZ8qbZBm 0FmHpzjM35p61wAc9feCn8e8IABTKOPU558ZdXn8IJLTFtUOhBO0JvqvgUCsmGbG9sFG sQUQwR2KZzmLmIEaiazf4xJSxAUgcW4h6O/KswluYIzz2s+woCWtVBJ4K9FsvypjTSvA sEPfszTYTQgEiYOPyouCVlGlVBCXixwpeYZinVAa6TdsxEcw+lAVyJJbZ8AU2sQWxj5o wz0Q== X-Gm-Message-State: AJIora8Il4a5rEWSZdOj5IFM/NiJpcaZj5dX/en1w+NqTuG6J51Bp3K0 iEifcXB+8xBEOtvy+WfyKCBbu3p7qU6Y2d40b5dE/fwkbkP7F2gMc5T9o0JsM3IutHe06txwMV4 0wRFcD4wTYm4= X-Received: by 2002:a05:6000:178c:b0:21d:b6d0:11a8 with SMTP id e12-20020a056000178c00b0021db6d011a8mr22145009wrg.547.1658149188887; Mon, 18 Jul 2022 05:59:48 -0700 (PDT) X-Google-Smtp-Source: AGRyM1tMP0ObufMho649BUsTLfQYnTv6deddwkT8fZ1K0Baw8tI9SuX75xAwbOo4GXdS0x9jSXR1Eg== X-Received: by 2002:a05:6000:178c:b0:21d:b6d0:11a8 with SMTP id e12-20020a056000178c00b0021db6d011a8mr22144978wrg.547.1658149188542; Mon, 18 Jul 2022 05:59:48 -0700 (PDT) Received: from ?IPV6:2003:cb:c705:7400:6b3a:a74a:bd53:a018? (p200300cbc70574006b3aa74abd53a018.dip0.t-ipconnect.de. [2003:cb:c705:7400:6b3a:a74a:bd53:a018]) by smtp.gmail.com with ESMTPSA id o3-20020adfeac3000000b0021d6ac977fasm10766939wrn.69.2022.07.18.05.59.47 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 18 Jul 2022 05:59:48 -0700 (PDT) Message-ID: <1d3cdad0-b3a7-ec25-1652-efa7c39d1705@redhat.com> Date: Mon, 18 Jul 2022 14:59:46 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.11.0 To: Khalid Aziz , Andrew Morton , Mike Kravetz Cc: willy@infradead.org, aneesh.kumar@linux.ibm.com, arnd@arndb.de, 21cnbao@gmail.com, corbet@lwn.net, dave.hansen@linux.intel.com, ebiederm@xmission.com, hagen@jauu.net, jack@suse.cz, keescook@chromium.org, kirill@shutemov.name, kucharsk@gmail.com, linkinjeon@kernel.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, longpeng2@huawei.com, luto@kernel.org, markhemm@googlemail.com, pcc@google.com, rppt@kernel.org, sieberf@amazon.com, sjpark@amazon.de, surenb@google.com, tst@schoebel-theuer.de, yzaikin@google.com References: <20220701212403.77ab8139b6e1aca87fae119e@linux-foundation.org> <0864a811-53c8-a87b-a32d-d6f4c7945caa@redhat.com> <357da99d-d096-a790-31d7-ee477e37c705@oracle.com> <397f3cb2-1351-afcf-cd87-e8f9fb482059@redhat.com> From: David Hildenbrand Organization: Red Hat Subject: Re: [PATCH v2 0/9] Add support for shared PTEs across processes In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit ARC-Authentication-Results: i=1; imf28.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=CXjjsbp0; dmarc=pass (policy=none) header.from=redhat.com; spf=none (imf28.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 170.10.133.124) smtp.mailfrom=david@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1658149192; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=K2UbXX8GfQDSH1G+l3mYd8piXh+YvuH3CKSjlZ6grsU=; b=hHdFFqwGAeHbP+IbzW7wQw2Th6wn/cSB917+NnFOHaIwJ7z/BWKRb602D/gcQoN8uN3xJU I6/1pEaXtbNDufBPvBjI1z1P7MM6+QuE/Fxqtdf42Wcy4CCcQDtJYzlQntNURRI/yFigrA uWhzI+xblqdY5r3c8U39TWKXnjoq8f0= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1658149192; a=rsa-sha256; cv=none; b=RnrpalNSMUOcIWJz/cyMRx3tY0nUqFf+xyazlcDy2s6XBf9BTn2S+dEDLR6gX1vs8lS5D4 bayzJ2Waag6/2fK9vkDbWEzD4dvqboLjeaRwRXaOMwMPFhHjGj7EgmmZWscQAIop7/DfVP YsbuO+PUZvNU9KzqJhETsAI+lLz+7lE= X-Rspamd-Queue-Id: 267C8C008B X-Rspamd-Server: rspam04 Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=CXjjsbp0; dmarc=pass (policy=none) header.from=redhat.com; spf=none (imf28.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 170.10.133.124) smtp.mailfrom=david@redhat.com X-Rspam-User: X-Stat-Signature: qr8jkwdqg9eu13tn66e91dcf7agw3w1q X-HE-Tag: 1658149191-46747 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: [sorry for not being as responsive as I usually am] >> >> They share a *mm* including a consistent virtual memory layout (VMA >> list). Page table sharing is just a side product of that. You could even >> call page tables just an implementation detail to produce that >> consistent virtual memory layout -- described for that MM via a >> different data structure. > > Yes, sharing an mm and vma chain does make it different from implementation point of view. > >> >>> A number of people have commented on potential usefulness of this concept >>> and implementation. >> >> ... and a lot of people raised concerns. Yes, page table sharing to >> reduce memory consumption/tlb misses/... is something reasonable to >> have. But that doesn't require mshare, as hugetlb has proven. >> >> The design might be useful for a handful of corner (!) cases, but as the >> cover letter only talks about memory consumption of page tables, I'll >> not care about those. Once these corner cases are explained and deemed >> important, we might want to think of possible alternatives to explore >> the solution space. > > Memory consumption by page tables is turning out to be significant issue. I mentioned one real-world example from a > customer where a 300GB SGA on a 512GB server resulted in OOM when 1500+ processes tried to map parts of the SGA into > their address space. Some customers are able to solve this issue by switching to hugetlbfs but that is not feasible for > every one. Yes. Another use case I am aware of are KVM-based virtual machines, when VM memory (shmem, file-backed) is not only mapped into the emulator process, but also into other processes used to carry out I/O (e.g., vhost-user). In that case, it's tempting to simply share the page tables between all processes for the shared mapping -- automatically, just like shmem/hugetlb already does. [...] >> >>> suggestion to extend hugetlb PMD sharing was discussed briefly. Conclusion from that discussion and earlier discussion >>> on mailing list was hugetlb PMD sharing is built with special case code in too many places in the kernel and it is >>> better to replace it with something more general purpose than build even more on it. Mike can correct me if I got that >>> wrong. >> >> Yes, I pushed for the removal of that yet-another-hugetlb-special-stuff, >> and asked the honest question if we can just remove it and replace it by >> something generic in the future. And as I learned, we most probably >> cannot rip that out without affecting existing user space. Even >> replacing it by mshare() would degrade existing user space. >> >> So the natural thing to reduce page table consumption (again, what this >> cover letter talks about) for user space (semi- ?)automatically for >> MAP_SHARED files is to factor out what hugetlb has, and teach generic MM >> code to cache and reuse page tables (PTE and PMD tables should be >> sufficient) where suitable. >> >> For reasonably aligned mappings and mapping sizes, it shouldn't be too >> hard (I know, locking ...), to cache and reuse page tables attached to >> files -- similar to what hugetlb does, just in a generic way. We might >> want a mechanism to enable/disable this for specific processes and/or >> VMAs, but these are minor details. >> >> And that could come for free for existing user space, because page >> tables, and how they are handled, would just be an implementation detail. >> >> >> I'd be really interested into what the major roadblocks/downsides >> file-based page table sharing has. Because I am not convinced that a >> mechanism like mshare() -- that has to be explicitly implemented+used by >> user space -- is required for that. >> > > I see two parts to what you are suggesting (please correct me if I get this wrong): > > 1. Implement a generic page table sharing mechanism > 2. Implement a way to use this mechanism from userspace Yes. Whereby 2) would usually just be some heuristic (e.g.,. file > X MiB -> start sharing), with an additional way to just disable it or just enable it. But yes, most of that stuff should just be automatic. > > For 1, your suggestion seems to be extract the page table sharing code from hugetlb and make it generic. My approach is > to create a special mm struct to host the shared page tables and create a minimal set of changes to simply get PTEs from > this special mm struct whenever a shared VMA is accessed. There may be value to extracting hugetlb page table sharing > code and recasting it into this framework of special mm struct. I will look some more into it. The basic idea would be that whenever a MAP_SHARED VMA has a reasonable size, is aligned in a suitable way (including MAP offset), and protection match, you can just share PTE tables and even PMD tables. As page tables of shared mappings usually don't really store per-process information (exceptions I am aware of are userfaultfd and softdirty tracking), we can simply share/unshare page tables of shared mappings fairly easily. Then, you'd have e.g., 2 sets of page tables cached by the fd that can be mapped into processes 1) PROT_READ|PROT_WRITE 2) PROT_READ On VMA protection changes, one would have to unshare (unmap the page table) and either map another shared one, or map a private one. I don't think there would be need to optimize e.g., for PROT_NONE, but of course, other combinations could make sense to cache. PROT_NONE and other corner cases (softdirty tracking) would simply not use shared page tables. Shared page tables would have to be refcounted and one could e.g., implement a shrinker that frees unused page tables in the fd cache when memory reclaim kicks in. With something like that in place, page table consumption could be reduced and vmscan/rmap walks could turn out more efficient. > > As for 2, is it fair to say you are not fond of explicit opt-in from userspace and would rather have the sharing be file > based like hugetlb? That is worth considering but is limiting page table sharing to just file objects reasonable? A goal > for mshare mechanism was to allow shared objects to be files, anonymous pages, RDMA buffers, whatever. Idea being if you > can map it, you can share it with shared page tables. Maybe that is too ambitious a goal and I am open to course correction. We can glue it to the file or anything else that's shared I think -- I don't think we particularly, as long as it's something shared between processes to be mapped. And to be quite honest, whenever I read about anonymous memory (i.e., MAP_PRIVATE) I hear my inner voice screaming: just use *shared* memory when you want to *share* memory between processes, and optimize that if anything is missing. Having that said, I understood from previous discussions that there is a use case of efficient read-only protection across many processes/VMAs. I was wondering if that could be handled on the fs-level (pte_mkwrite). I remember I raised the idea before: if one could have a userfaultfd-wp-style (overlay?) file (system?), user-space could protect/unprotect file pages via a different mechanism (ioctl) and get notified about write access via something similar to userfaultfd user-space handlers, not via signals. Instead of adjusting VMAs, once would only adjust file page mappings to map the relevant pages R/O when protecting -- if page tables are shared, that would be efficient. Now, that is is just a very vague brain dump to get it out of my (overloaded) system. What I think the overall message is: let's try not designing new features around page table sharing, let's use page table sharing as an rmap performance optimization and as a mechanism to reduce page table overhead. I hope what I said makes any sense, I might eb just wrong. -- Thanks, David / dhildenb