From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 266FBC2BD09 for ; Wed, 3 Jul 2024 22:59:24 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A9CD36B008A; Wed, 3 Jul 2024 18:59:23 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A4BE76B008C; Wed, 3 Jul 2024 18:59:23 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8ECBD6B0092; Wed, 3 Jul 2024 18:59:23 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 741F56B008A for ; Wed, 3 Jul 2024 18:59:23 -0400 (EDT) Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 244A1A0273 for ; Wed, 3 Jul 2024 22:59:23 +0000 (UTC) X-FDA: 82299959406.25.F524AC1 Received: from mail-yb1-f175.google.com (mail-yb1-f175.google.com [209.85.219.175]) by imf18.hostedemail.com (Postfix) with ESMTP id 5BA171C0017 for ; Wed, 3 Jul 2024 22:59:21 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=vGj7lTRA; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf18.hostedemail.com: domain of tjmercier@google.com designates 209.85.219.175 as permitted sender) smtp.mailfrom=tjmercier@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1720047534; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=O641zrdw12C+c0OOg6YL1Y9zyKGp2U7Zfh7m/aotDYo=; b=8KR3IXmJfDr7frL8cXT/i2u8D+GHfn/t7vlfZlcahnQA73Qlm16QXkimHm758R+h0ZOIUK wpPHslpxnEZ7tJ+b7zrmW5fambpZFb0Z8Gaof/ciJ+BIw4I09K+x/GUStAgSAO7hKtFUSD 7ZSxYiHGgEkV4mPGsBXNKGPx9k/3Plk= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1720047534; a=rsa-sha256; cv=none; b=Oi86fCdpWun8p0Yn+0hdc+VWALdh9xYnH0WBpRw+t4jFvFkywaasZ3tjTopuedfpQSC/59 MPCej0+xtCFaXt57sHuE/lMEmw/5xGIc2EMDrV9UN0Q+jAa+VTJ0Du5ZCfdwjQpRqOWWcp MRLxSWGSEzuaAtjOCbti61Q0AIC3dg4= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=vGj7lTRA; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf18.hostedemail.com: domain of tjmercier@google.com designates 209.85.219.175 as permitted sender) smtp.mailfrom=tjmercier@google.com Received: by mail-yb1-f175.google.com with SMTP id 3f1490d57ef6-e035f4e3473so26196276.3 for ; Wed, 03 Jul 2024 15:59:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1720047560; x=1720652360; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=O641zrdw12C+c0OOg6YL1Y9zyKGp2U7Zfh7m/aotDYo=; b=vGj7lTRAArjTZ6MEl1HVW+VgwJXJOz6VD7Wymvf+liKZgG2aLO1Nq7LVtT86IlbqlE r/CFtieHnIgE7tz62WnuM3zbRKYPF4X12yP8C81Twa+f5hAkolEV5vP+iaZmkc2u8l6h IBvZFhDCg5RRr4ELwFD8sXjLSCy548P+N9Yy4Ac/MowfnVTxXgT0qrOkmJuuavEYvoN+ wJJuKRAQ79JfUua9VNzjz63tPlYr/giPb3Qyz5jZl8jEz4pCqoDAyEkzXu8xw5RGauUU OTFsqMiwrOJp1zbCWFs9gTtX93RC9nDCcrxeYW7jffMZOwAldL8vUnhlph3GRvljNEcA a9JA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1720047560; x=1720652360; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=O641zrdw12C+c0OOg6YL1Y9zyKGp2U7Zfh7m/aotDYo=; b=P/7bR19lWg4MYfITpP2YTDyptr8XASdTSjswI7mZwDfqEjRsAKBScqp7KmV0DjFI/R UFEFQaeHbYkt6wNnNToI4R4SCMl1zEww2FpRxeA5y53M4wXeBwogTCwOJXEKoRIqZw33 FQzPvn5MFWesmNyI8eHEJw4PscIi+jyp8b7Y672D5UW4pstnzhDfIsMj65GEiSoBxfTi +MU90jix7HSIZS3MGkDtrYBQl3MBPNVpgp7PmHow58WyAQHKZe6hDaLZE+X3JbKZYAth IgCDaibhHPY4wkt5yOThbrEf9+XW0rjq6rOPB5+Dn6Z1IXnYBsz0GVd5m+koa0aU6B0N 23eg== X-Forwarded-Encrypted: i=1; AJvYcCVMaJsP5eF8glMc/FLD7JzRHClpVj2mnR2gli8h9fDcW8/mryo0TzYi69s/CqCPpQCBdq4gt7YAvKrZbltPmSHdndw= X-Gm-Message-State: AOJu0YwuBCjKvwryY8kJQOUOywZPsqP4qX8vY5ERrYIOCBNvUhAu9hVS IpNwl9DW4Gxr5mnN4/Cf7dab6jX9EBRWdVmM/otrWIwL7OGbXb2iZzX9AQA0MyCgEf0GFC+mf93 ysn4TQ7efb39jgZWLQ3JMSpcr7ni4IbRoKO6N X-Google-Smtp-Source: AGHT+IEHdx+HiYOgSblpEZh8SDfLHA9XpYlLidnzOOkWvb/luHuRDdm/V7HIvv/c8Pww7EOqqB70nluyXjebX6FRws4= X-Received: by 2002:a05:6902:1b8a:b0:dff:2f2c:d6ad with SMTP id 3f1490d57ef6-e036ec4ba6emr15942437276.51.1720047559760; Wed, 03 Jul 2024 15:59:19 -0700 (PDT) MIME-Version: 1.0 References: <20240702084423.1717904-1-link@vivo.com> <27a62e44-9d85-4ef2-b833-e977af039758@vivo.com> In-Reply-To: <27a62e44-9d85-4ef2-b833-e977af039758@vivo.com> From: "T.J. Mercier" Date: Wed, 3 Jul 2024 15:59:07 -0700 Message-ID: Subject: Re: [RFC PATCH 0/4] Introduce PMC(PER-MEMCG-CACHE) To: Huan Yang Cc: Roman Gushchin , Johannes Weiner , Michal Hocko , Shakeel Butt , Muchun Song , Andrew Morton , "Matthew Wilcox (Oracle)" , David Hildenbrand , Ryan Roberts , Chris Li , Dan Schatzberg , Kairui Song , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Christian Brauner , opensource.kernel@vivo.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 5BA171C0017 X-Stat-Signature: 4yfh5tno3qwwz6idmdj5r7a4e3grxtqg X-Rspam-User: X-HE-Tag: 1720047561-408141 X-HE-Meta: U2FsdGVkX1+5KntNxbwpraZkP6QDtIwewqZtu57006NtLtf/epIPPzCGUeaUODeCDNUsQny8xh70KKKELI0xuQQ4aLha5StSimcSI6N/2351wtQmj1FzfdTSRtCNtBREcso8B3XermIsuhfUqQWBEiSG0KDtqPZU00snNPLCe71fzfDZ4YCy5HOwUC4XwCJJK7a/S9kEIzHI6oa6g4DFY/moeILhU2qiXj0Cxk9vk3qY5qClr/Xrpqq5IBHyAmzeyN+tvkyWXHjNbx/d2pe8lEBzBvjp7RQD+QYam3gW5XmaU2pu/ftKiAHbOM2hqeCQEs4hqBEBgafflJsKRv7VZOPg8A4Bl1wrUJg9cUL93/EhFscIhAGRxTKFL2BjSppdkLJsu/VnKsACrkjacp69I/9V3UNpGQuNL8MtGxrX0aCHk2d0Ia2IAF6Gktcbrs1EewkUfJ3eBnjYYCqYjHZHbyTwn2kBb3K/KFf3gIFSQW1+0NJuc9EStoxK8z38tFBfWPI3v+Ku0nV0wciSlHUTRceJEifAIczofUopFzlCJSb5QrnE07n2SSFsGuo2msm+uqV3zgoPjw695YGo90zTYvCAyeIxsBkPll3CoGtSCfr7WS5juvsbjVDxkc37j0vzE6E6l4jW5cWZnJONDA8GQjSwP5QhjGl/VRQ+RpTpvjvRnr3thh+/6ceI0XMWM/aJDgUe8q6TF0kB0XF4WbAR4dYV7f82xyvaGSVMfkoqdfcB4c098U2sOHOKfV2IaPMcTbv2ezPi548IsKQWM/PsKEA5/pikuoio6lYxXVEUEuPxmEQLYtxufT0YbbyhXyNETx+kYsL4iQn7ciGTiZTWws0X1XOmbJ7Peb/V2Se2VhHpiYkZFyaqeMzcXBaW3q608XUJUv3wt2y2yai4gJM+j+FEoHjf4uYa8zujeKO3IenjkLFjdqiymVbkEA4U4iFXt/l9iU7kJLKFbvEuqG0 Q1mkrerp UG8I2YUn+ojiGsBVy+ts2dVtgJS9okxTKsa7Srpz8P1RM/mUlAeK1cBsCnG0dkL/VZh4fLqD010ySk081aDcsDXLdD8hy00H4oAW+9g1pIckyCzWfSio22cNTW2Pu5WfjQbgeYmaIDx47J1JAesotf7B7H8VA1jePHbcMsqD1GDWBWxt/lTm4M8H+aPHpjjmdlSTTBgOF7feGacB/jlisSNCKOyyHf7idJOLIv1FsC7psdabmiX+p28qw265iuGaMymqbNX3NkDktjWF+IuLEULKoBpRmfy8hpdNA9jNmJltS2l95SoVEQKNp4nNIAmLJufvohYUk+eTHr9aAkxT1fpMJcWVm7U24NLHQ6AIIY2hz7ho= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Jul 2, 2024 at 7:23=E2=80=AFPM Huan Yang wrote: > > > =E5=9C=A8 2024/7/3 3:27, Roman Gushchin =E5=86=99=E9=81=93: > > On Tue, Jul 02, 2024 at 04:44:03PM +0800, Huan Yang wrote: > >> This patchset like to talk abount a idea about PMC(PER-MEMCG-CACHE). > >> > >> Background > >> =3D=3D=3D > >> > >> Modern computer systems always have performance gaps between hardware, > >> such as the performance differences between CPU, memory, and disk. > >> Due to the principle of locality of reference in data access: > >> > >> Programs often access data that has been accessed before > >> Programs access the next set of data after accessing a particular d= ata > >> As a result: > >> 1. CPU cache is used to speed up the access of already accessed dat= a > >> in memory > >> 2. Disk prefetching techniques are used to prepare the next set of = data > >> to be accessed in advance (to avoid direct disk access) > >> The basic utilization of locality greatly enhances computer performanc= e. > >> > >> PMC (per-MEMCG-cache) is similar, utilizing a principle of locality to= enhance > >> program performance. > >> > >> In modern computers, especially in smartphones, services are provided = to > >> users on a per-application basis (such as Camera, Chat, etc.), > >> where an application is composed of multiple processes working togethe= r to > >> provide services. > >> > >> The basic unit for managing resources in a computer is the process, > >> which in turn uses threads to share memory and accomplish tasks. > >> Memory is shared among threads within a process. > >> > >> However, modern computers have the following issues, with a locality d= eficiency: > >> > >> 1. Different forms of memory exist and are not interconnected (anon= ymous > >> pages, file pages, special memory such as DMA-BUF, various memor= y alloc in > >> kernel mode, etc.) > >> 2. Memory isolation exists between processes, and apart from specif= ic > >> shared memory, they do not communicate with each other. > >> 3. During the transition of functionality within an application, a = process > >> usually releases memory, while another process requests memory, = and in > >> this process, memory has to be obtained from the lowest level th= rough > >> competition. > >> > >> For example abount camera application: > >> > >> Camera applications typically provide photo capture services as well a= s photo > >> preview services. > >> The photo capture process usually utilizes DMA-BUF to facilitate the s= haring > >> of image data between the CPU and DMA devices. > >> When it comes to image preview, multiple algorithm processes are typic= ally > >> involved in processing the image data, which may also involve heap mem= ory > >> and other resources. > >> > >> During the switch between photo capture and preview, the application t= ypically > >> needs to release DMA-BUF memory and then the algorithms need to alloca= te > >> heap memory. The flow of system memory during this process is managed = by > >> the PCP-BUDDY system. > >> > >> However, the PCP and BUDDY systems are shared, and subsequently reques= ted > >> memory may not be available due to previously allocated memory being u= sed > >> (such as for file reading), requiring a competitive (memory reclamatio= n) > >> process to obtain it. > >> > >> So, if it is possible to allow the released memory to be allocated wit= h > >> high priority within the application, then this can meet the locality > >> requirement, improve performance, and avoid unnecessary memory reclaim= . > >> > >> PMC solutions are similar to PCP, as they both establish cache pools a= ccording > >> to certain rules. > >> > >> Why base on MEMCG? > >> =3D=3D=3D > >> > >> The MEMCG container can allocate selected processes to a MEMCG based o= n certain > >> grouping strategies (typical examples include grouping by app or UID). > >> Processes within the same MEMCG can then be used for statistics, upper= limit > >> restrictions, and reclamation control. > >> > >> All processes within a MEMCG are considered as a single memory unit, > >> sharing memory among themselves. As a result, when one process release= s > >> memory, another process within the same group can obtain it with the > >> highest priority, fully utilizing the locality of memory allocation > >> characteristics within the MEMCG (such as APP grouping). > >> > >> In addition, MEMCG provides feature interfaces that can be dynamically= toggled > >> and are fully controllable by the policy.This provides greater flexibi= lity > >> and does not impact performance when not enabled (controlled through s= tatic key). > >> > >> > >> Abount PMC implement > >> =3D=3D=3D > >> Here, a cache switch is provided for each MEMCG(not on root). > >> When the user enables the cache, processes within the MEMCG will share= memory > >> through this cache. > >> > >> The cache pool is positioned before the PCP. All order0 page released = by > >> processes in MEMCG will be released to the cache pool first, and when = memory > >> is requested, it will also be prioritized to be obtained from the cach= e pool. > >> > >> `memory.cache` is the sole entry point for controlling PMC, here are s= ome > >> nested keys to control PMC: > >> 1. "enable=3D[y|n]" to enable or disable targeted MEMCG's cache > >> 2. "keys=3Dnid=3D%d,watermark=3D%u,reaper_time=3D%u,limit=3D%u" to = control already > >> enabled PMC's behavior. > >> a) `nid` to targeted a node to change it's key. or else all node. > >> b) The `watermark` is used to control cache behavior, caching onl= y when > >> zone free pages above the zone's high water mark + this waterm= ark is > >> exceeded during memory release. (unit byte, default 50MB, > >> min 10MB per-node-all-zone) > >> c) `reaper_time` to control reaper gap, if meet, reaper all cache= in this > >> MEMCG(unit us, default 5s, 0 is disable.) > >> d) `limit` is to limit the maximum memory used by the cache pool(= unit bytes, > >> default 100MB, max 500MB per-node-all-zone) > >> > >> Performance > >> =3D=3D=3D > >> PMC is based on MEMCG and requires performance measurement through the > >> sharing of complex workloads between application processes. > >> Therefore, at the moment, we unable to provide a better testing soluti= on > >> for this patchset. > >> > >> Here is the internal testing situation we provide, using the camera > >> application as an example. (1-NODE-1-ZONE-8GRAM) > >> > >> Test Case: Capture in rear portrait HDR mode > >> 1. Test mode: rear portrait HDR mode. This scene needs more than 800M = ram > >> which memory types including dmabuf(470M), PSS(150M) and APU(200M) > >> 2. Test steps: take a photo, then click thumbnail to view the full ima= ge > >> > >> The overall performance benefit from click shutter button to showing w= hole > >> image improves 500ms, and the total slowpath cost of all camera thread= s reduced > >> from 958ms to 495ms. > >> Especially for the shot2shot in this mode, the preview dealy of each f= rame have > >> a significant improve. > > Hello Huan, > > > > thank you for sharing your work. > thanks > > > > Some high-level thoughts: > > 1) Naming is hard, but it took me quite a while to realize that you're = talking > Haha, sorry for my pool english > > about free memory. Cache is obviously an overloaded term, but per-memcg= -cache > > can mean absolutely anything (pagecache? cpu cache? ...), so maybe it's= not > > Currently, my idea is that all memory released by processes under memcg > will go into the `cache`, > > and the original attributes will be ignored, and can be freely requested > by processes under memcg. > > (so, dma-buf\page cache\heap\driver, so on). Maybe named PMP more > friendly? :) > > > the best choice. > > 2) Overall an idea to have a per-memcg free memory pool makes sense to = me, > > especially if we talk 2MB or 1GB pages (or order > 0 in general). > I like it too :) > > 3) You absolutely have to integrate the reclaim mechanism with a generi= c > > memory reclaim mechanism, which is driven by the memory pressure. > Yes, I all think about it. > > 4) You claim a ~50% performance win in your workload, which is a lot. I= t's not > > clear to me where it's coming from. It's hard to believe the page alloc= ation/release > > paths are taking 50% of the cpu time. Please, clarify. > > Let me describe it more specifically. In our test scenario, we have 8GB > of RAM, and our camera application > > has a complex set of algorithms, with a peak memory requirement of up to > 3GB. > > Therefore, in a multi-application background scenario, starting the > camera and taking photos will create a > > very high memory pressure. In this scenario, any released memory will be > quickly used by other processes (such as file pages). > > So, during the process of switching from camera capture to preview, > DMA-BUF memory will be released, > > while the memory used for the preview algorithm will be simultaneously > requested. > > We need to take a lot of slow path routes to obtain enough memory for > the preview algorithm, and it seems that the > > just released DMA-BUF memory does not provide much help. > Hi Huan, I find this part surprising. Assuming the dmabuf memory doesn't first go into a page pool (used for some buffers, not all) and actually does get freed synchronously with fput, this would mean it gets sucked up by other supposedly background processes before it can be allocated by the preview process. I thought the preview process was the one most desperate for memory? You mention file pages, but where is this newly-freed memory actually going if not to the preview process? My initial reaction was the same as Roman's that the PMC should be hooked up to reclaim instead of depending on the reaper. But I think this might suggest that wouldn't work because the system is under such high memory pressure that it'd be likely reclaim would have emptied the PMCs before the preview process could use it. One more thing I find odd is that for this to work a significant portion of your dmabuf pages would have to be order 0, but we're talking about a ~500M buffer. Does whatever exports this buffer not try to use higher order pages like here? https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/dri= vers/dma-buf/heaps/system_heap.c?h=3Dv6.9#n54 Thanks! -T.J. > But using PMC (let's call it that for now), we are able to quickly meet > the memory needs of the subsequent preview process > > with the just released DMA-BUF memory, without having to go through the > slow path, resulting in a significant performance improvement. > > (of course, break migrate type may not good.) > > > > > There are a lot of other questions, and you highlighted some of them be= low > > (and these are indeed right questions to ask), but let's start with som= ething. > > > > Thanks > Thanks >