linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Lance Yang <ioworker0@gmail.com>
To: "Zach O'Keefe" <zokeefe@google.com>
Cc: akpm@linux-foundation.org, Michal Hocko <mhocko@suse.com>,
	 Yang Shi <shy828301@gmail.com>,
	David Hildenbrand <david@redhat.com>,
	songmuchun@bytedance.com,  peterx@redhat.com,
	mknyszek@google.com, minchan@kernel.org,  linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, linux-api@vger.kernel.org
Subject: Re: [PATCH v2 1/1] mm/madvise: add MADV_F_COLLAPSE_LIGHT to process_madvise()
Date: Sat, 27 Jan 2024 16:03:27 +0800	[thread overview]
Message-ID: <CAK1f24kc9pnOktT9zG68FykViGNnL+B9h+jCkF28YSn0PM7N9A@mail.gmail.com> (raw)
In-Reply-To: <CAAa6QmRZP5pL_5O7BpfjQf5LZ_ADGqYF_xdAYEbKXkqMViAwLw@mail.gmail.com>

Hey Zach,

Thanks for taking time to look into this!

On Sat, Jan 27, 2024 at 7:47 AM Zach O'Keefe <zokeefe@google.com> wrote:
>
> > I’d like to add another real use case.
> >
> > In our company, we deploy applications using offline-online
> > hybrid deployment. This approach leverages the distinctive
> > resource utilization patterns of online services, utilizing idle
> > resources during various time periods by filling them with
> > offline jobs. This helps reduce the growing cost expenditures
> > for the enterprise.
> >
> > Whether for online services or offline jobs, their requirements
> > for THP can be roughly categorized into three types:
> >
> > * The first type aims to use huge pages as much as possible
> > and tolerates unpredictable stalls caused by direct reclaim
> > and/or compaction.
> > * The second type attempts to use huge pages but is relatively
> > latency-sensitive and cannot tolerate unpredictable stalls.
> > * The third type prefers not to use huge pages at all and is
> > extremely latency-sensitive.
> >
> > After careful consideration, we decided to prioritize the
> > requirements of the first type and modify the THP settings
> > as follows:
> >
> > echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
> > echo defer >/sys/kernel/mm/transparent_hugepage/defrag
> >
> > With the introduction of MADV_COLLAPSE into the kernel,
> > it is no longer dependent on any sysfs setting under
> > /sys/kernel/mm/transparent_hugepage. MADV_COLLAPSE
> > offers the potential for fine-grained synchronous control over
> > the huge page allocation mechanism, marking a significant
> > enhancement for THP.
> >
> > If the kernel supports a more relaxed (opportunistic)
> > MADV_COLLAPSE, we will modify the THP settings as follows:
> >
> > echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
> > echo madvise >/sys/kernel/mm/transparent_hugepage/defrag
>
> [corrected, via 2 previous mails, to: echo madvise
> >/sys/kernel/mm/transparent_hugepage/enabled
> echo defer+madvise >/sys/kernel/mm/transparent_hugepage/defrag]
>
>
> > Then, we will use process_madvise(MADV_COLLAPSE, xx_relaxed_flag)
> > to address the requirements of the second type.
> >
> > Why don't we favor madvise(MADV_COLLAPSE) for the first type
> > of requirements?
> > The main reason is that these requirements are typically for offline
> > jobs in the Hadoop ecosystem, such as MapReduce and Spark,
> > which run primarily on the JVM. [..]
>
> Hey Lance,
>
> Thanks for proving this context, it's very helpful.
>
> Though, couldn't you use enabled=always, defrag=defer+madvise, then
> just use prctl(PR_SET_THP_DISABLE) on type-3 workloads to get the
> behaviour you want? i.e.

prctl(PR_SET_THP_DISABLE) is a good choice that can fully meet
the needs of type-3 workloads.

I might prefer using enabled=madvise, as this would allow
applications to implement specific calls to madvise to request huge
pages selectively. If we set enabled=always, some applications
may not be optimized for or may not benefit from huge pages.
In such cases, using huge pages for all allocations could lead
to suboptimal performance.

>
> type 1: apply MADV_HUGEPAGE -> sync defrag to get THP
> type 2: don't apply MADV_HUGEPAGE -> use THP if available, kick
> kswapd+kcompactd otherwise

Sorry, I did not express myself clearly. The type 2 of requirements
should be:
type 2: apply MADV_HUGEPAGE with defrag=defer, or use a more
relaxed (opportunistic) MADV_COLLAPSE.

> type 3: use prctl(PR_SET_THP_DISABLE) (or MADV_NOHUGEPAGE) -> no THPs
>
> Or am I missing something? It sounds like a confounding issue is that
> these are external workloads, or you don't have ability to modify? But
> that would preclude MADV_COLLAPSE (unless you're using
> process_madvise()).

Sorry, my previous explanation has been unclear. What I meant is
that the requirements of type-1 workloads can be independent of
any sysfs setting and can be addressed using madvise(MADV_COLLAPSE).
In this scenario, why haven't I utilized it? The reason is that I
currently lack the capability to modify the JVM or PyTorch to
make them compatible with madvise(MADV_COLLAPSE).
Therefore, the needs of type-1 workloads still rely on sysfs settings.

>
> Appreciate the help understanding the use case. I'm not opposed to the
> idea in general, but IMO would be great to have a clear need for it

I appreciate your perspective!

Thanks again for your valuable insights and your suggestions!
Lance

> (and right now, we don't currently have alignment with the original
> motivating usecase (Go) in that regard w.r.t their plans).
>
> Thanks,
> Zach


      reply	other threads:[~2024-01-27  8:03 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-18 12:03 Lance Yang
2024-01-18 13:28 ` Michal Hocko
2024-01-18 13:40 ` Michal Hocko
2024-01-18 13:43   ` Michal Hocko
2024-01-18 14:58     ` Zach O'Keefe
2024-01-18 19:00       ` Yang Shi
2024-01-19  2:37         ` Lance Yang
2024-01-19  1:46       ` Lance Yang
2024-01-19  2:03   ` Lance Yang
2024-01-19 12:51     ` Michal Hocko
2024-01-19 14:08       ` Lance Yang
2024-01-20  2:09       ` Lance Yang
2024-01-22 13:50         ` Michal Hocko
2024-01-22 14:14           ` Lance Yang
2024-01-22 14:34             ` Lance Yang
2024-01-26 23:26               ` Zach O'Keefe
2024-01-27  8:06                 ` Lance Yang
2024-01-21  3:12 ` Lance Yang
2024-01-26  6:16   ` Lance Yang
2024-01-26 10:15     ` Lance Yang
2024-01-26 12:52       ` Lance Yang
2024-01-26 23:46         ` Zach O'Keefe
2024-01-27  8:03           ` Lance Yang [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAK1f24kc9pnOktT9zG68FykViGNnL+B9h+jCkF28YSn0PM7N9A@mail.gmail.com \
    --to=ioworker0@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@redhat.com \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=minchan@kernel.org \
    --cc=mknyszek@google.com \
    --cc=peterx@redhat.com \
    --cc=shy828301@gmail.com \
    --cc=songmuchun@bytedance.com \
    --cc=zokeefe@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox