From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D83AEC61DA3 for ; Tue, 21 Feb 2023 15:05:43 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2418A6B0071; Tue, 21 Feb 2023 10:05:43 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 1F1436B0072; Tue, 21 Feb 2023 10:05:43 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0B9876B0078; Tue, 21 Feb 2023 10:05:43 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id EED5E6B0071 for ; Tue, 21 Feb 2023 10:05:42 -0500 (EST) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id BF531A042E for ; Tue, 21 Feb 2023 15:05:42 +0000 (UTC) X-FDA: 80491623324.11.1CE924E Received: from casper.infradead.org (casper.infradead.org [90.155.50.34]) by imf19.hostedemail.com (Postfix) with ESMTP id 1647E1A002F for ; Tue, 21 Feb 2023 15:05:37 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=TOMR7Iiq; spf=none (imf19.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1676991938; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=qQppKJ/BVwSyTCcO4dDuPc3vOxvWSigLbT9BaGXgYu8=; b=zPOtDPN/GWHhMyJflWGeVya23gjvZ46p0AyMe/2GTHalEhYIODqlWIR1LxF4rtnTeNmYN7 yk+fX+BX8D/Z7p4dHtAUSyVxoMtRL1iu/1KGdq6FhyFcn9JY/gsdb7BPT+Q5YC0P/MhWfH y/bDqtd+TQx6HETJY4WOyuG1ZfDpNL4= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=TOMR7Iiq; spf=none (imf19.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org; dmarc=none ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1676991938; a=rsa-sha256; cv=none; b=496MWx+poITHZYARyNaKHpwmNEBXW+AmBKLk7GCOlD2vGucpkIr3M0Xl+g6xCu6N1Kxo6J 2298WL20YShskKC1iGxB9zccAr5Ol2wVzTxm2QKbx2eXtNrUgm22oInMZ9gaJxpHM+yq5P s1uHfxH8Gi47tjbWPKHx3BnTJf2Byfo= DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=qQppKJ/BVwSyTCcO4dDuPc3vOxvWSigLbT9BaGXgYu8=; b=TOMR7IiqRXSciiTNhhEV/gxEam 3nJzTBLwwwKUk9l3a7OZh+emqDUk6rnNPgDan8eZkQK9LhlcankmVUWoxcl9Ou/9glheBTkBnCRlv bgiRl6EPgNRLMQcViaGKMa1CAhzhIwft+Iop4k2MJAAEu+8QfSYCg/SJQ2OuhE0bpAS+lZQ5U4gMH xp5eEBuwIfxgjF9Q6Wnz9fgikol4Fgzbx1k7D3cmv0gAaqlzR5+A1XvWBDjkn8E3oCVT8qo/0HMU9 sv3Wfc6ZlWGGzqjOtjgddDbr6JmxtrIIe2FE2NBx9iFMw8TTA+gjc2/vtCYqxYGzH4z2QEr1UxNFH MXibZ5bA==; Received: from willy by casper.infradead.org with local (Exim 4.94.2 #2 (Red Hat Linux)) id 1pUUD1-00ChBS-3N; Tue, 21 Feb 2023 15:05:35 +0000 Date: Tue, 21 Feb 2023 15:05:35 +0000 From: Matthew Wilcox To: Pasha Tatashin Cc: lsf-pc@lists.linux-foundation.org, linux-mm Subject: Re: [LSF/MM/BPF TOPIC] Single Owner Memory Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 1647E1A002F X-Rspam-User: X-Stat-Signature: coosbxe8cpbarn6apg3iu1y91hu3cdr3 X-HE-Tag: 1676991937-110065 X-HE-Meta: U2FsdGVkX19c3Woe8V4eIlXPASjZXToAB85fawQaUL/FIL9TXdClo533ztQJL7JOU9fbFxCVd9+rzGv9WV3vglH8VHpLn6qFjdC00wryZa+BDjqes3Ptcjol2h3tWae64FAYETjzEaGV6vQcdYVfg2Wc4T2VP9iyyy/cnBrMg35luxWY+3wTp3OGRddWWw4Ndmy5PhTAy/vnreM+h0y+B2+3pVhMm8Kw/ywqHl7EZtC7lMadog0R7AGrswQQ554hWCM2pODejj6T4EduKxZMAABwvNdWYfS/2iIvVxqPrMLf84S/rqLLd7pKx8BDuMOmFj0BZrAd4gI2/kStSTAq9iWYD7SGbY8Gb9wrLnPxD4fAsKgW6jJefg5f0EKUfDr69qOt3csHSYvJfvjP2ksW6206jWV5bKxqHJtAlMiCIEYuCYRRLO5yqrEVKpvt41SDl3QXVW5IWDPokiEmDoXaFNyTuyXdc46hIJM9uuabHzG08Gp4x/u7VI3iorGXImGDEMPJFY/iJnVV1GXuW387ehUlSed6mPbddPDnpuATkhKw5x284zF50Fe4wfF6kELF/Dg2dS+rykm3FHIfqXtdmoeEoYBrs/sG4E9tbO/q0CC+H1J5IkRORRJoPQIBcfXdPEYwlVLCfdYhTpe37USPNVYzW9XnySikU3WTneAVJohK5tt+U9/XEM/cXKKI0h12ajT7E7fMoSnjd78alpxWnPJvmgASYbkYBM5ZuHyvrYMhHWGYWYBIh5bzbU2d7Dta+soMi6KXeSUMfQlJGL7XquYuhfq5Wqn9Uhqj74aWdYnrJoBZ5l4V7TO/hF9YAHuNR/v8TptQgpv0C7J0+szg6NAZp/Bi4bE3ZaC7DJQVUf4+voDLVaGSQt/IC2eX5IYogqXjg2Hc7GPqGPMDnXNWwB/qvBq47smWDBxaerC6B4QuPRtEcO+0vjZD0ljaDsDVHgcjKO80OOkblIorUD3 3FL+S0a6 cKCOYnaSU9PHuS6FgzBglboWFWVl7FUys+Nvx1JpmlT+xZv8RnC7CG7OVIybd3pwZnAfXKNjjgd+XCYE/yvhY0G50pB3RFzmVA5B9XOh7rcALy61NQ94xh2hfEClvaYKCIhEZT+/en7CpBWwAp0zN+S4std41RbR4b966pwgKXdJ6TLIfnGydpnlU0xmH5nZRWkbywfmvRIz+QKmNMoct8BZlOH0GQcuYJrGn/GZKew3OsR2WsnlxNce3P6XGnVs2mkQE3WQZe6YEJehq30TbVvGzUg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Feb 21, 2023 at 09:37:17AM -0500, Pasha Tatashin wrote: > Hey Matthew, > > Thank you for looking into this. > > On Tue, Feb 21, 2023 at 8:46 AM Matthew Wilcox wrote: > > > > On Mon, Feb 20, 2023 at 02:10:24PM -0500, Pasha Tatashin wrote: > > > Within Google the vast majority of memory, over 90% has a single > > > owner. This is because most of the jobs are not multi-process but > > > instead multi-threaded. The examples of single owner memory > > > allocations are all tcmalloc()/malloc() allocations, and > > > mmap(MAP_ANONYMOUS | MAP_PRIVATE) allocations without forks. On the > > > other hand, the struct page metadata that is shared for all types of > > > memory takes 1.6% of the system memory. It would be reasonable to find > > > ways to optimize memory such that the common som case has a reduced > > > amount of metadata. > > > > > > This would be similar to HugeTLB and DAX that are treated as special > > > cases, and can release struct pages for the subpages back to the > > > system. > > > > DAX can't, unless something's changed recently. You're referring to > > CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP > > DAX has a similar optimization: > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v6.2&id=e3246d8f52173a798710314a42fea83223036fc8 Oh, devdax, not fsdax. > > > The proposal is to discuss a new som driver that would use HugeTLB as > > > a source of 2M chunks. When user creates a som memory, i.e.: > > > > > > mmap(MAP_ANONYMOUS | MAP_PRIVATE); > > > madvise(mem, length, MADV_DONTFORK); > > > > > > A vma from the som driver is used instead of regular anon vma. > > > > That's going to be "interesting". The VMA is already created with > > the call to mmap(), and madvise has not traditionally allowed drivers > > to replace a VMA. You might be better off creating a /dev/som and > > hacking the malloc libraries to pass an fd from that instead of passing > > MAP_ANONYMOUS. > > I do not plan to replace VMA after madvise(), I showed the syscall > sequence to show how Single Owner Memory can be enforced today. > However, in the future we either need to add another mmap() flag for > single owner memory if that is proved to be important or as you > suggested use ioctl() through /dev/som. Not ioctl(). Pass an fd from /dev/som to mmap and have the som driver set up the VMA. > > > The discussion should include the following topics: > > > - Interaction with folio and the proposed struct page {memdesc}. > > > - Handling for migrate_pages() and friends. > > > - Handling for FOLL_PIN and FOLL_LONGTERM. > > > - What type of madvise() properties the som memory should handle > > > > Obviously once we get to dynamically allocated memdescs, this whole > > thing goes away, so I'm not excited about making big changes to the > > kernel to support this. > > This is why the changes that I am thinking about are going to be > mostly localized in a separate driver and do not alter the core mm > much. However, even with memdesc, today the Single Owner Memory is not > singled out from the rest of memory types (shared, anon, named), so I > do not expect the memdescs can provide saving or optimizations for > this specific use case. With memdescs, let's suppose the malloc library asks for a 256kB allocation. You end up using 8 bytes per page for the memdesc pointer (512 bytes) plus around 96 bytes for the folio that's used by the anon memory (assuming appropriate hinting / heuristics that says "Hey, treat this as a single allocation"). So that's 608 bytes of overhead for a 256kB allocation, or 0.23% overhead. About half the overhead of 8kB per 2MB (plus whatever overhead the SOM driver has to track the 256kB of memory). If 256kB isn't the right size to be doing this kind of analysis on, we can rerun it on whatever size you want. I'm not really familiar with what userspace is doing these days. > > The savings you'll see are 6 pages (24kB) per 2MB allocated (1.2%). > > That's not nothing, but it's not huge either. > > This depends on the scale, in our fleet 1.2% savings are huge. Then 1.4% will be better, yes? ;-)