From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7E3B5C001DE for ; Mon, 31 Jul 2023 17:06:15 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 14CF028007F; Mon, 31 Jul 2023 13:06:15 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 0FD9728007A; Mon, 31 Jul 2023 13:06:15 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EE01028007F; Mon, 31 Jul 2023 13:06:14 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id DAE1228007A for ; Mon, 31 Jul 2023 13:06:14 -0400 (EDT) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id B18BA40645 for ; Mon, 31 Jul 2023 17:06:14 +0000 (UTC) X-FDA: 81072535068.23.9884D3F Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf30.hostedemail.com (Postfix) with ESMTP id 641F080020 for ; Mon, 31 Jul 2023 17:06:12 +0000 (UTC) Authentication-Results: imf30.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=P6jIG37j; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf30.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1690823172; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=o7Je1Tw1wJWhPdYNICmrH5qTgNEkZ25/eHgF0fwCa/s=; b=NFnEExT/VH1rz0LMOm+orOynWHh2Tn5T7SqlkCG+KYfkCURCgdG1HPTRvqfB1FPmwokESq qJekF9VJDszNGOHhX3UY6fo4CllYNkEgfbhSlRJmuGfXUEnzNRXIfuQEyq9iIVpkzQFxSC gNYF+3ADCyuRi1IKem1LEUEZ5tlNQbE= ARC-Authentication-Results: i=1; imf30.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=P6jIG37j; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf30.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1690823172; a=rsa-sha256; cv=none; b=z4AEqkvNo+ET0EjVAFjn+w2R8ZFuMRiPWD9YomSlyOvu0qKpzlXi7z1G1yyPDYV2lG6Oz3 VhC178fZ85ikPZU7m8bh1gDt1oz2wiIuUYmkP719Gop+UQF+NlJlnVB40ftCRMaCKSKzWB ZzAAhes2q+uwHVnLR4/+YYfOlNaO5z4= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1690823171; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=o7Je1Tw1wJWhPdYNICmrH5qTgNEkZ25/eHgF0fwCa/s=; b=P6jIG37jSqXleRXLmjRV7D0UyUfwiLRO9Fho+/5HwxSCozxI7w8BaKsyCn3JxP7GtLHsT1 pDT1zfON64STLRAYj21hk2so57y0/kcJGVPLomqr+pKBYAviEqvRJZafOIcYhHG9w9zhL0 N1jOsXSo3Uk7bcPANCY54dr1Oj5EMqk= Received: from mail-wm1-f72.google.com (mail-wm1-f72.google.com [209.85.128.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-652-RP-M-JB6ONOpTOuN_rfqew-1; Mon, 31 Jul 2023 13:06:09 -0400 X-MC-Unique: RP-M-JB6ONOpTOuN_rfqew-1 Received: by mail-wm1-f72.google.com with SMTP id 5b1f17b1804b1-3f5df65f9f4so24523145e9.2 for ; Mon, 31 Jul 2023 10:06:07 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1690823167; x=1691427967; h=content-transfer-encoding:in-reply-to:subject:organization:from :references:cc:to:content-language:user-agent:mime-version:date :message-id:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=o7Je1Tw1wJWhPdYNICmrH5qTgNEkZ25/eHgF0fwCa/s=; b=hCJwR21SCRx6tjYfNSe5E8o9kGkfg6uD7fIllaJTmKLKge0Pjoq9MK6eyRFMH7DXhj 11eO/DRgQST5KCogDWu4nOXB2/H4ShGMLKCI2qEQHHbA6+mLiWQpE8oJXKmaLWaCvHp1 w502ABt5nR22TeqEp2GLPrfND5KXFgoGea8Akw/gbm1Mc2PlDU8Ssua9OT/KGpZgZ71v H1GkHzl9/d7KrHj/l3obimY5lnCQPlpyob1NyV8NhkLK+b7LaDXAj8ZDnqnsbXk8D3bc wdRsgFgIKTyP0x+5+hSeWRa6mA4FIoBoPzUEVy0xguMHcGYsYda5yRQ2NHmcjs9X42tO U04Q== X-Gm-Message-State: ABy/qLZqJfXlFtKKxHQmF3A7jHa+3HUxfVpYNzW7KqhLxpYFETTfFjBs 5bQYV+MEZBT1zUs1vo8cTK6EVls6hnq/zhB3mcf6u4JvRqGxddLwGSnah/nJtc44XubLuU+vHrr xNHnLq98EMP4= X-Received: by 2002:a05:600c:260e:b0:3fe:1548:264f with SMTP id h14-20020a05600c260e00b003fe1548264fmr433541wma.22.1690823166979; Mon, 31 Jul 2023 10:06:06 -0700 (PDT) X-Google-Smtp-Source: APBJJlH2MTM3zZh6fdgvplx34o1m2GIuXvVYoFmP83+z8cxLJsJsTDtauV6QKq2/5/U7uWp78AAGYQ== X-Received: by 2002:a05:600c:260e:b0:3fe:1548:264f with SMTP id h14-20020a05600c260e00b003fe1548264fmr433518wma.22.1690823166457; Mon, 31 Jul 2023 10:06:06 -0700 (PDT) Received: from ?IPV6:2003:cb:c723:4c00:5c85:5575:c321:cea3? (p200300cbc7234c005c855575c321cea3.dip0.t-ipconnect.de. [2003:cb:c723:4c00:5c85:5575:c321:cea3]) by smtp.gmail.com with ESMTPSA id 3-20020a05600c22c300b003fe13c3ece7sm7762180wmg.10.2023.07.31.10.06.05 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 31 Jul 2023 10:06:06 -0700 (PDT) Message-ID: Date: Mon, 31 Jul 2023 19:06:05 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.13.0 To: Matthew Wilcox Cc: Rongwei Wang , linux-arch@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "xuyu@linux.alibaba.com" References: <74fe50d9-9be9-cc97-e550-3ca30aebfd13@linux.alibaba.com> <9faea1cf-d3da-47ff-eb41-adc5bd73e5ca@linux.alibaba.com> From: David Hildenbrand Organization: Red Hat Subject: Re: [PATCH RFC v2 0/4] Add support for sharing page tables across processes (Previously mshare) In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Rspam-User: X-Stat-Signature: c8cs3gjkdq6gbou57dzbms3aihzzxtuj X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 641F080020 X-HE-Tag: 1690823172-541784 X-HE-Meta: U2FsdGVkX1+EloJUXD3zVjV5LdEYEnBiD4RmsK8uM/EjXdEMBZLoPd9p45UpBBbvjj/4Jltjpt37S27z8nwOriJA+OFWEQYXl7BfJdt1XyPSpApKlx7cIE6vGko1rgOhQqw5unwRYYRSq/V1cJ1Ab4ruUXNK86twB6YGCN3TR0tY4/ymiF37YhMypXTfcvD1pwwSuxtr4aOHTa8owTFZJuf7D/5b2yabKmPQbGYXfqr0YydKfas0p/MnyNkU68vihEh+BFhWSeNJqKmDRtn867iqe7xmawoHmcI5436uTvvZthJArgTX+C/EU0s4dfwrhcPHYbN+lOoT+/JwjETcyt5bxr9dh5/2j08cy2UWHZykog7RI71KCcV4NZovqHltDQ4QJKO/vACIbn9gpWw2inVZluGZSJq1qFr48yQs1k5op+YMoSOUpWIIIZ7Z+1Jr5gM855m4mShWfvde3+X9ZU0aYeeV+zuGadRgeDLHFA1Re6d4ABoTuMLB+J5yPEQb9BP0AjbhnUqMHpF9zuMhy79YFxFD7AYc8NkIul5Qz3aIXzPv9DQ0gCtJ3o0Pu9NE94S7cynrSlCJrdnnQCZlxRTjMmQjo7JuP/vmXHEN+7S0ZFqq7E9I4W0GyGxT1gfmJP2tB2d+6iKI5TEhRAKHXff6uSFYwQISD+/BSjcqMq0WJNPl0B3m/eEXFTC/5mtCiGAo4JR6bApXzYej9ViCxbIN/I4YEjMwbGkHeWzAb9T41PbLzd4ydUV8TB3MFl145bhMyP+e2l07tUauLwlIqID5jXAZPX7uRhwPNErkMStcUOQQgTHwGQZ9828XGjdDXoUQXSHW4pq7GmCSyAIiF0e4m2K+lwwpFOqm1XmsOKomv6U8K1Lf40wlQu8qVn+D6MNdtoqF4AbfgP7P+WhCDqyiqZViQVzF0rTbfDB+RIqzdeHHHN2xx1qZFsSjxo6XLsKFtEnGTLCs6veoVoo RLxyJI8z n7hSznBpdMGuWEpIvz0QEvQNGsXqh8a7LhvglJReaZkSekEhwJvqnZA5nmWyMLynm1/4CItascBG9fNhNefVfSiC+iVkaJ6SobS4hOzcRzKMqXaHS+kOBiNn+Oj5cq3CkYtzQglPl5i3im5d8CcYkaZRtU5Dz3QxkdpVhMyd/OU+pxzKu3DkwDWI0BVOhNyDdaXXP9Nmx9IDCnD6/IALpC5WZa9VM33xlzr0T7YktpntGYgElxelACozmhqm0ANkTYqG13WAmiQE8VlrT16xdf6R8UiGJkRkp73HLNBPmXRygZ1/p/F0tbUXxvRPR0n8dcCW+MwGzJU0sUR9+S6Mfg0kpzj7uMpcPaeknY9D2BydF2PNUKhrqU3stTQ13/aBaCkTAQzTe7H4Y8/n16AXCCx0SSFEGOnPTVhxusZbt3wnJUqc= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 31.07.23 18:54, Matthew Wilcox wrote: > On Mon, Jul 31, 2023 at 06:48:47PM +0200, David Hildenbrand wrote: >> On 31.07.23 18:38, Matthew Wilcox wrote: >>> On Mon, Jul 31, 2023 at 06:30:22PM +0200, David Hildenbrand wrote: >>>> Assume we do do the page table sharing at mmap time, if the flags are right. >>>> Let's focus on the most common: >>>> >>>> mmap(memfd, PROT_READ | PROT_WRITE, MAP_SHARED) >>>> >>>> And doing the same in each and every process. >>> >>> That may be the most common in your usage, but for a database, you're >>> looking at two usage scenarios. Postgres calls mmap() on the database >>> file itself so that all processes share the kernel page cache. >>> Some Commercial Databases call mmap() on a hugetlbfs file so that all >>> processes share the same userspace buffer cache. Other Commecial >>> Databases call shmget() / shmat() with SHM_HUGETLB for the exact >>> same reason. >> >> I remember you said that postgres might be looking into using shmem as well, >> maybe I am wrong. > > No, I said that postgres was also interested in sharing page tables. > I don't think they have any use for shmem. > >> memfd/hugetlb/shmem could all be handled alike, just "arbitrary filesystems" >> would require more work. > > But arbitrary filesystems was one of the origin use cases; where the > database is stored on a persistent memory filesystem, and neither the > kernel nor userspace has a cache. The Postgres & Commercial Database > use-cases collapse into the same case, and we want to mmap the files > directly and share the page tables. Yes, and transparent page table sharing can be achieved otherwise. I guess what you imply is that they want to share page tables and have a single mprotect(PROT_READ) to modify the shared page tables. > >>> This is why I proposed mshare(). Anyone can use it for anything. >>> We have such a diverse set of users who want to do stuff with shared >>> page tables that we should not be tying it to memfd or any other >>> filesystem. Not to mention that it's more flexible; you can map >>> individual 4kB files into it and still get page table sharing. >> >> That's not what the current proposal does, or am I wrong? > > I think you're wrong, but I haven't had time to read the latest patches. > Maybe I misunderstood what the MAP_SHARED_PT actually does. " This patch series adds a new flag to mmap() call - MAP_SHARED_PT. This flag can be specified along with MAP_SHARED by a process to hint to kernel that it wishes to share page table entries for this file mapping mmap region with other processes. Any other process that mmaps the same file with MAP_SHARED_PT flag can then share the same page table entries. Besides specifying MAP_SHARED_PT flag, the processes must map the files at a PMD aligned address with a size that is a multiple of PMD size and at the same virtual addresses. This last requirement of same virtual addresses can possibly be relaxed if that is the consensus. " Reading this, I'm confused how 4k files would interact with the PMD size requirement. Probably I got it all wrong. >> Also, I'm curious, is that a real requirement in the database world? > > I don't know. It's definitely an advantage that falls out of the design > of mshare. Okay, just checking if there is an important use case I'm missing, I'm also not aware of any. Anyhow, I have other work to do. Happy to continue the discussion someone is actually working on this (again). -- Cheers, David / dhildenb