Re: [RFC PATCH 0/4] Introduce PMC(PER-MEMCG-CACHE)

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Roman Gushchin <roman.gushchin@linux.dev>
To: Huan Yang <link@vivo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>,
	Michal Hocko <mhocko@kernel.org>,
	Shakeel Butt <shakeel.butt@linux.dev>,
	Muchun Song <muchun.song@linux.dev>,
	Andrew Morton <akpm@linux-foundation.org>,
	"Matthew Wilcox (Oracle)" <willy@infradead.org>,
	David Hildenbrand <david@redhat.com>,
	Ryan Roberts <ryan.roberts@arm.com>, Chris Li <chrisl@kernel.org>,
	Dan Schatzberg <schatzberg.dan@gmail.com>,
	Kairui Song <kasong@tencent.com>,
	cgroups@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org,
	Christian Brauner <brauner@kernel.org>,
	opensource.kernel@vivo.com
Subject: Re: [RFC PATCH 0/4] Introduce PMC(PER-MEMCG-CACHE)
Date: Tue, 2 Jul 2024 19:27:54 +0000	[thread overview]
Message-ID: <ZoRUukQUNqGHn_x1@google.com> (raw)
In-Reply-To: <20240702084423.1717904-1-link@vivo.com>

On Tue, Jul 02, 2024 at 04:44:03PM +0800, Huan Yang wrote:
> This patchset like to talk abount a idea about PMC(PER-MEMCG-CACHE).
> 
> Background
> ===
> 
> Modern computer systems always have performance gaps between hardware,
> such as the performance differences between CPU, memory, and disk.
> Due to the principle of locality of reference in data access:
> 
>   Programs often access data that has been accessed before
>   Programs access the next set of data after accessing a particular data
> As a result:
>   1. CPU cache is used to speed up the access of already accessed data
>      in memory
>   2. Disk prefetching techniques are used to prepare the next set of data
>      to be accessed in advance (to avoid direct disk access)
> The basic utilization of locality greatly enhances computer performance.
> 
> PMC (per-MEMCG-cache) is similar, utilizing a principle of locality to enhance
> program performance.
> 
> In modern computers, especially in smartphones, services are provided to
> users on a per-application basis (such as Camera, Chat, etc.),
> where an application is composed of multiple processes working together to
> provide services.
> 
> The basic unit for managing resources in a computer is the process,
> which in turn uses threads to share memory and accomplish tasks.
> Memory is shared among threads within a process.
> 
> However, modern computers have the following issues, with a locality deficiency:
> 
>   1. Different forms of memory exist and are not interconnected (anonymous
>      pages, file pages, special memory such as DMA-BUF, various memory alloc in
>      kernel mode, etc.)
>   2. Memory isolation exists between processes, and apart from specific
>      shared memory, they do not communicate with each other.
>   3. During the transition of functionality within an application, a process
>      usually releases memory, while another process requests memory, and in
>      this process, memory has to be obtained from the lowest level through
>      competition.
> 
> For example abount camera application:
> 
> Camera applications typically provide photo capture services as well as photo
> preview services.
> The photo capture process usually utilizes DMA-BUF to facilitate the sharing
> of image data between the CPU and DMA devices.
> When it comes to image preview, multiple algorithm processes are typically
> involved in processing the image data, which may also involve heap memory
> and other resources.
> 
> During the switch between photo capture and preview, the application typically
> needs to release DMA-BUF memory and then the algorithms need to allocate
> heap memory. The flow of system memory during this process is managed by
> the PCP-BUDDY system.
> 
> However, the PCP and BUDDY systems are shared, and subsequently requested
> memory may not be available due to previously allocated memory being used
> (such as for file reading), requiring a competitive (memory reclamation)
> process to obtain it.
> 
> So, if it is possible to allow the released memory to be allocated with
> high priority within the application, then this can meet the locality
> requirement, improve performance, and avoid unnecessary memory reclaim.
> 
> PMC solutions are similar to PCP, as they both establish cache pools according
> to certain rules.
> 
> Why base on MEMCG?
> ===
> 
> The MEMCG container can allocate selected processes to a MEMCG based on certain
> grouping strategies (typical examples include grouping by app or UID).
> Processes within the same MEMCG can then be used for statistics, upper limit
> restrictions, and reclamation control.
> 
> All processes within a MEMCG are considered as a single memory unit,
> sharing memory among themselves. As a result, when one process releases
> memory, another process within the same group can obtain it with the
> highest priority, fully utilizing the locality of memory allocation
> characteristics within the MEMCG (such as APP grouping).
> 
> In addition, MEMCG provides feature interfaces that can be dynamically toggled
> and are fully controllable by the policy.This provides greater flexibility
> and does not impact performance when not enabled (controlled through static key).
> 
> 
> Abount PMC implement
> ===
> Here, a cache switch is provided for each MEMCG(not on root).
> When the user enables the cache, processes within the MEMCG will share memory
> through this cache.
> 
> The cache pool is positioned before the PCP. All order0 page released by
> processes in MEMCG will be released to the cache pool first, and when memory
> is requested, it will also be prioritized to be obtained from the cache pool.
> 
> `memory.cache` is the sole entry point for controlling PMC, here are some
> nested keys to control PMC:
>   1. "enable=[y|n]" to enable or disable targeted MEMCG's cache
>   2. "keys=nid=%d,watermark=%u,reaper_time=%u,limit=%u" to control already
>   enabled PMC's behavior.
>     a) `nid` to targeted a node to change it's key. or else all node.
>     b) The `watermark` is used to control cache behavior, caching only when
>        zone free pages above the zone's high water mark + this watermark is
>        exceeded during memory release. (unit byte, default 50MB,
>        min 10MB per-node-all-zone)
>     c) `reaper_time` to control reaper gap, if meet, reaper all cache in this
>         MEMCG(unit us, default 5s, 0 is disable.)
>     d) `limit` is to limit the maximum memory used by the cache pool(unit bytes,
>        default 100MB, max 500MB per-node-all-zone)
> 
> Performance
> ===
> PMC is based on MEMCG and requires performance measurement through the
> sharing of complex workloads between application processes.
> Therefore, at the moment, we unable to provide a better testing solution
> for this patchset.
> 
> Here is the internal testing situation we provide, using the camera
> application as an example. (1-NODE-1-ZONE-8GRAM)
> 
> Test Case: Capture in rear portrait HDR mode
> 1. Test mode: rear portrait HDR mode. This scene needs more than 800M ram
>    which memory types including dmabuf(470M), PSS(150M) and APU(200M)
> 2. Test steps: take a photo, then click thumbnail to view the full image
> 
> The overall performance benefit from click shutter button to showing whole
> image improves 500ms, and the total slowpath cost of all camera threads reduced
> from 958ms to 495ms. 
> Especially for the shot2shot in this mode, the preview dealy of each frame have
> a significant improve.

Hello Huan,

thank you for sharing your work.

Some high-level thoughts:
1) Naming is hard, but it took me quite a while to realize that you're talking
about free memory. Cache is obviously an overloaded term, but per-memcg-cache
can mean absolutely anything (pagecache? cpu cache? ...), so maybe it's not
the best choice.
2) Overall an idea to have a per-memcg free memory pool makes sense to me,
especially if we talk 2MB or 1GB pages (or order > 0 in general).
3) You absolutely have to integrate the reclaim mechanism with a generic
memory reclaim mechanism, which is driven by the memory pressure.
4) You claim a ~50% performance win in your workload, which is a lot. It's not
clear to me where it's coming from. It's hard to believe the page allocation/release
paths are taking 50% of the cpu time. Please, clarify.

There are a lot of other questions, and you highlighted some of them below
(and these are indeed right questions to ask), but let's start with something.

Thanks

next prev parent reply	other threads:[~2024-07-02 19:28 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-07-02  8:44 Huan Yang
2024-07-02  8:44 ` [RFC PATCH 1/4] mm: memcg: pmc framework Huan Yang
2024-07-02  8:44 ` [RFC PATCH 2/4] mm: memcg: pmc support change attribute Huan Yang
2024-07-02  8:44 ` [RFC PATCH 3/4] mm: memcg: pmc: support reaper Huan Yang
2024-07-02  8:44 ` [RFC PATCH 4/4] mm: memcg: pmc: support oom release Huan Yang
2024-07-02 19:27 ` Roman Gushchin [this message]
2024-07-03  2:23   ` [RFC PATCH 0/4] Introduce PMC(PER-MEMCG-CACHE) Huan Yang
2024-07-03 17:27     ` Shakeel Butt
2024-07-04  2:49       ` Huan Yang
2024-07-03 22:59     ` T.J. Mercier
2024-07-04  2:29       ` Huan Yang
2024-07-09  0:11         ` T.J. Mercier

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZoRUukQUNqGHn_x1@google.com \
    --to=roman.gushchin@linux.dev \
    --cc=akpm@linux-foundation.org \
    --cc=brauner@kernel.org \
    --cc=cgroups@vger.kernel.org \
    --cc=chrisl@kernel.org \
    --cc=david@redhat.com \
    --cc=hannes@cmpxchg.org \
    --cc=kasong@tencent.com \
    --cc=link@vivo.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=muchun.song@linux.dev \
    --cc=opensource.kernel@vivo.com \
    --cc=ryan.roberts@arm.com \
    --cc=schatzberg.dan@gmail.com \
    --cc=shakeel.butt@linux.dev \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox