From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-0.9 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 369FEC35242 for ; Fri, 14 Feb 2020 21:32:18 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id E4124222C2 for ; Fri, 14 Feb 2020 21:32:17 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=oracle.com header.i=@oracle.com header.b="OkXHATbM" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org E4124222C2 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=oracle.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 996546B06A0; Fri, 14 Feb 2020 16:32:17 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 947746B06A1; Fri, 14 Feb 2020 16:32:17 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 85D946B06A2; Fri, 14 Feb 2020 16:32:17 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0226.hostedemail.com [216.40.44.226]) by kanga.kvack.org (Postfix) with ESMTP id 6F0A96B06A0 for ; Fri, 14 Feb 2020 16:32:17 -0500 (EST) Received: from smtpin27.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 0D35B180AD81D for ; Fri, 14 Feb 2020 21:32:17 +0000 (UTC) X-FDA: 76490031114.27.fight97_88491083a1d27 X-HE-Tag: fight97_88491083a1d27 X-Filterd-Recvd-Size: 5078 Received: from userp2120.oracle.com (userp2120.oracle.com [156.151.31.85]) by imf05.hostedemail.com (Postfix) with ESMTP for ; Fri, 14 Feb 2020 21:32:16 +0000 (UTC) Received: from pps.filterd (userp2120.oracle.com [127.0.0.1]) by userp2120.oracle.com (8.16.0.42/8.16.0.42) with SMTP id 01ELVTpU151668; Fri, 14 Feb 2020 21:32:15 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=from : subject : to : cc : message-id : date : mime-version : content-type : content-transfer-encoding; s=corp-2020-01-29; bh=703W3wMNUJXCKSV43LUK9e0p/vB377ZcZ4up3ybNy4A=; b=OkXHATbMHtotSsuf81RuoJUV+U9BQQbm4bdkFlp0P3bchejVMzZo4yavVtzcXG7+yjb4 xPRJRFy8QPzcOEIbOj9/wHxoc7BnbWNC6zWBdu05Q4q7sYhj5dK4B12xQks7Az5J9SER MiptSoDkC1bSTZQ6nE9uxXVYXZJV32+ew5vA9yHTqmLn638jPLY2mPzZ9NG0QLk1bjw4 SN+Wm/yLQdFjopuHlSm6MctKdhT0g/0T5cl3/z9xx92QufID1znIcatswUq53rrHrTKS J9+9ziatkXV8TlMp1KeyXvhB1kAGFnl5b3xHpK6yLB8I3/oY4/Pq7QnHAX8vcRjNsdq+ jg== Received: from aserp3030.oracle.com (aserp3030.oracle.com [141.146.126.71]) by userp2120.oracle.com with ESMTP id 2y2p3t3qs1-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 14 Feb 2020 21:32:15 +0000 Received: from pps.filterd (aserp3030.oracle.com [127.0.0.1]) by aserp3030.oracle.com (8.16.0.42/8.16.0.42) with SMTP id 01ELS4HM156400; Fri, 14 Feb 2020 21:32:14 GMT Received: from aserv0122.oracle.com (aserv0122.oracle.com [141.146.126.236]) by aserp3030.oracle.com with ESMTP id 2y4k82v10v-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 14 Feb 2020 21:32:14 +0000 Received: from abhmp0019.oracle.com (abhmp0019.oracle.com [141.146.116.25]) by aserv0122.oracle.com (8.14.4/8.14.4) with ESMTP id 01ELWDC0024481; Fri, 14 Feb 2020 21:32:14 GMT Received: from [10.175.161.87] (/10.175.161.87) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Fri, 14 Feb 2020 13:32:13 -0800 From: Joao Martins Subject: [LSF/MM TOPIC] Guest memory without struct page To: lsf-pc@lists.linux-foundation.org Cc: linux-mm@kvack.org Message-ID: <1be38ae3-d51e-2661-d0ab-6ad8baefe804@oracle.com> Date: Fri, 14 Feb 2020 21:32:11 +0000 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit X-Proofpoint-Virus-Version: vendor=nai engine=6000 definitions=9531 signatures=668685 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 mlxlogscore=955 adultscore=0 suspectscore=0 mlxscore=0 bulkscore=0 malwarescore=0 phishscore=0 spamscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2001150001 definitions=main-2002140156 X-Proofpoint-Virus-Version: vendor=nai engine=6000 definitions=9531 signatures=668685 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0 mlxscore=0 malwarescore=0 suspectscore=0 mlxlogscore=989 priorityscore=1501 clxscore=1011 impostorscore=0 lowpriorityscore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2001150001 definitions=main-2002140156 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: All system RAM is tracked by a metadata structure called 'struct page' which amounts to 64bytes and represents a certain page granualarity. On x86 (or systems which PAGE_SIZE is 4K) this data structure represents a total of 1.5% overhead of total capacity. For hypervisors -- specially those without vhost/PV-devices, and just VFs -- persistent/volatile memory is largely assigned to userspace without kernel taking part in any of it's I/O paths, except for VFIO. 1.5% may not seem like much, but it is still a total of 16G per Tb just for struct page, which is a lot considering the hypervisor won't need it and instead should be used to create more guests (=Happy Users). The RFC patches submitted here [0] approach this through device-dax given the interface it provides already for VMMs and also given that this is too a source of overhead for non-volatile memory assigned to guests. Essentially it extends device-dax to create a PFNMAP vma with special pages (while adding support for huge special pages). host memory would be limited through some form of mem=X, efi_fake_mem=Y@X:0x40000 or memmap=Y@X-1+0xefffffff i.e. dedicate Y amount for guests memory. Should vhost-{net,scsi,etc} be used, we copy from/to guest memory (which works today for vhost-net, and easily adjusted for vhost-scsi), or perhaps explore dynamically creating/freeing struct pages on GUP temporary pinning. This topic would be to brainstorm the idea/proposal and also discuss alternatives/pitfalls/limitations/other-usecases(*). Regards, Joao (*) To some extent there might be a similarity to '"Secret" memory userspace APIs' subitem of this previously submitted topic[1] given that the guest memory in the described topic isn't part of the direct map. [0] https://lore.kernel.org/linux-mm/20200110190313.17144-1-joao.m.martins@oracle.com/ [1] https://lore.kernel.org/linux-mm/20200206165900.GD17499@linux.ibm.com/