From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C502AC47DDB for ; Sat, 27 Jan 2024 08:03:42 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E71FC6B0072; Sat, 27 Jan 2024 03:03:41 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id E222C6B0074; Sat, 27 Jan 2024 03:03:41 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CE9926B0078; Sat, 27 Jan 2024 03:03:41 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id C08D96B0072 for ; Sat, 27 Jan 2024 03:03:41 -0500 (EST) Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 10D091A0F02 for ; Sat, 27 Jan 2024 08:03:41 +0000 (UTC) X-FDA: 81724351842.06.DB71D14 Received: from mail-yb1-f179.google.com (mail-yb1-f179.google.com [209.85.219.179]) by imf07.hostedemail.com (Postfix) with ESMTP id 561D54001D for ; Sat, 27 Jan 2024 08:03:39 +0000 (UTC) Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=DXOsEU+C; spf=pass (imf07.hostedemail.com: domain of ioworker0@gmail.com designates 209.85.219.179 as permitted sender) smtp.mailfrom=ioworker0@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1706342619; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=oJOghBOhISnNjhnFNuK2EmnH9DA2LqZ2Je0BGaoXHRI=; b=VuAQd5mhmGzEdmfZe5ZN+pTEah+wBQz19pkU8R+/g4BYf3hhRuwD3wJR1dqNvNh4spTJrD TTSiFsY41gyGmGmeP9Dyd2J//d7Hv56ymw+iMTL/U01+QwGq2DhdrSMHoJl7GiuyvelUae ZFR4fMssr/YDNBns0RjfZpxDCheLW1w= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=DXOsEU+C; spf=pass (imf07.hostedemail.com: domain of ioworker0@gmail.com designates 209.85.219.179 as permitted sender) smtp.mailfrom=ioworker0@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1706342619; a=rsa-sha256; cv=none; b=CUsjcqXv04clkpnJgp2KNNyFQuNozEEdTG9VslsPzwoap/vP7Vd+4lIStOEoYo/0m66Rfc PrHsXJ02V4h/BeA1K4Z+MVpsnVDzkifQVVb/BdjCXtw7GjvPN+dR45chKo8O2XUgryR0G5 ETcAg3LVEgkFGFlbLAnwi0qjGED7d9g= Received: by mail-yb1-f179.google.com with SMTP id 3f1490d57ef6-dbed179f0faso1931160276.1 for ; Sat, 27 Jan 2024 00:03:39 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1706342618; x=1706947418; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=oJOghBOhISnNjhnFNuK2EmnH9DA2LqZ2Je0BGaoXHRI=; b=DXOsEU+CkKmsKh1X24suPjkzUskwKBTSw2dNtOM5TJ8nmZU965uL/2brN+b/tdHYxV 0vtkaPjkFuWuT/vNBmEYnL5hZMfX+LgsaFytmbI69v13Rr2ha60ZOX7akr8IJDIniKkQ oM/iZznpISuP2bJ9QdyZS/EjoWWFBtuFjuGfiU+bBxVxE5xbWNL/KT7+PMknOllD40W1 6SrYAPmmIb+2Osq/JNXiRQyCULQl9XGYkckYxkIfXPODkUrlEgGFHRoWi3+S4D2v2OQP St8pYRHPLfzWUlLAzaajyeBZ0fARlYHS1BywTWidrVqYkkQI9hHhm/tp0Egu+jaRcKQY BZIQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1706342618; x=1706947418; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=oJOghBOhISnNjhnFNuK2EmnH9DA2LqZ2Je0BGaoXHRI=; b=ASbzkG98MoFQo4ubO+2bzYlSZFQ+IBiJpIa/pcNvBdSunsWNqrRHh8VdgY9mMW+XXt 5RNI1bsVbLcUII79gJVDU0oazYuiKskTPLD8TzvyFFhmr4mx9izYsqYCejb1RL6YSxt/ 3Ix/aref2jWeNR58I8K2KR2yD2Iu7vkEtFkWglEKWoZf68Rb26MBb3VJLHxDjNGU6GeQ Fbnc3FUpTmxQ5T+ZOyvQ3VL080gsCYuhvliSPnJLKDZ0Nvc8KMPNBrQSM8bU6acXLX+a xkpSO2OAzFhgh/E8U9y2LPVcj4L6AiZM0zozRwBefGcnbRRBJE5fOsSlsf/hlbUf7dKb XZbA== X-Gm-Message-State: AOJu0YzA8tlump2/RoMt/Z1qZ5GHeEguma1uKRw7UykUPTttBR+u+D+7 sAboXhuoXaSA7/N2beuNzu0Mcdr8/DMGADfG+C4IO/FpJVl6M9lZSz3VtDtUXq4sx/nCYwGBZtk LEVYvQKlgybm4/SC3SwQsbWWeykk= X-Google-Smtp-Source: AGHT+IGv4FSdj+g79HmeUw1/bUvDs7qmzfXrL/koI/HWaEGHgTpaOzXSsdQn2O3NNGee5xiyVfM+p12nqawJhQz3MWo= X-Received: by 2002:a25:8886:0:b0:dc2:35c1:f280 with SMTP id d6-20020a258886000000b00dc235c1f280mr1637172ybl.43.1706342618198; Sat, 27 Jan 2024 00:03:38 -0800 (PST) MIME-Version: 1.0 References: <20240118120347.61817-1-ioworker0@gmail.com> In-Reply-To: From: Lance Yang Date: Sat, 27 Jan 2024 16:03:27 +0800 Message-ID: Subject: Re: [PATCH v2 1/1] mm/madvise: add MADV_F_COLLAPSE_LIGHT to process_madvise() To: "Zach O'Keefe" Cc: akpm@linux-foundation.org, Michal Hocko , Yang Shi , David Hildenbrand , songmuchun@bytedance.com, peterx@redhat.com, mknyszek@google.com, minchan@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-api@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 561D54001D X-Rspam-User: X-Stat-Signature: 56kjd84taazc8pfxh88hggk468x74z6y X-Rspamd-Server: rspam01 X-HE-Tag: 1706342619-457914 X-HE-Meta: U2FsdGVkX1/uukm62dSLjLyQrKy7nmXvYP6Mip4jRPVzUyTCXKtV73IRssJybDaVcKHSgITK6/gE0vaPkxJS3FPkJF1foWRu7XdhO3k6En+aD0USNaZ282ZA4O9YP1aOUaMba/6syDIcmV1TSoEGtVLJEfRb465B33V/Imhp2O0mkCVCPY1v2IpaKUZOU8cK4P2c/6puP/K2d5WGUf0Njt7mnpgXn9NsiRmZdnLb7+B1QvkrHMoKFXpOJPtsBb+r+KOvnV7SvNfYXEoXXhr+VQ9b7d2r+gKW3lrCNqGydnq3I8KGA5gxnJkaQ7KZbkP4BtqDMtalDQf8biBh8Ppok8/9PHI/UYhw25udeWWJnk8Kwy931CW3j0KhqvxTOCvXQ9ecUzIlBWJjG4VSRHHHxpDspKq/hY+fgL+td/1fjLpeMMxisVT5X89dgBg4DXrP+59YQFrJcC2pUrlvytJYiDJqT9i2ygzTDWyv6lLxQJ8NjX9Yjp0Gt0/TkADVtPpNbmWJB5+0kYdQD7qeWWpShSWf/LUOlNzwh+bOms2Qr/L82vXOTK+QWbrfJpD5RTGgRrx4y/Aw0tRqG3rjKHQhEl5sxZsKnvtpjDzn5G/Uuza3xRSPyCojTYUrcD6zqAUrPaOX+5KUKqvSYIMmsE76xT0dDpO9VjjjSXTWFtWybUdZMmrHuwFcQNUq1A5eicvHKI9kk19xeIfOLqi41yI/dBsZKSn10uxe5O2XFsjf3aqad7/pJDq4WskVdQNQvHn1P+rmUhlO0It8Nh0oAkhb11GklwpTdbxQzytyn3YkejqCltlT7PpVo4dIx62Hro8HgdU8FLJjihmldAXyG9+M2fNEyI9AIjHh9R/WGgPA2doYBNNgtblysQ09OrTumRkdryVY623CxoYxbeUMgaiwDa7daidjLHCTzdesmahJqHLtqlySJHoediBZn9I8YwdcMNrwmEkdspL92woXKyy wHrljLUW UNbu6KSXc0Q10PYSAKcDLtyQWsiRAzim1bOkoBoH/7GQlvwYdHxz9HspBonVktmRSOb2u1ATJmAIFd0/IM05dlTcgkDvZgc6r5WZh4gsS99fmiZpWTLkjVLazmtDp0dfm8S2Jyw227ZJWR6SRlvwhPRq3dXE0RMy2SICskxv5rOouHHUZbaZjfyuxxUNSCLc+ppom3W1csh+ksBu95HlXhwyTnn2wNLhF57J+g8blzkoMWPWYK1oyDd5AuuwKi3jWMajstftXYjcoZfdjsA0AgZUcoSWKoXV4JCZS//IeZXR6XXys3dTKshNwlNen/eoEQFBXujbc7wr3nAICt4Zds+8Uc/7vBqAafOOeTZvdY5xAvVxFbXVQIwSWkDQhNsS3ETDrSxTvZPrAsrjz80woHXckwn5YVCCwES9D9H8bbkCzsOk= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hey Zach, Thanks for taking time to look into this! On Sat, Jan 27, 2024 at 7:47=E2=80=AFAM Zach O'Keefe w= rote: > > > I=E2=80=99d like to add another real use case. > > > > In our company, we deploy applications using offline-online > > hybrid deployment. This approach leverages the distinctive > > resource utilization patterns of online services, utilizing idle > > resources during various time periods by filling them with > > offline jobs. This helps reduce the growing cost expenditures > > for the enterprise. > > > > Whether for online services or offline jobs, their requirements > > for THP can be roughly categorized into three types: > > > > * The first type aims to use huge pages as much as possible > > and tolerates unpredictable stalls caused by direct reclaim > > and/or compaction. > > * The second type attempts to use huge pages but is relatively > > latency-sensitive and cannot tolerate unpredictable stalls. > > * The third type prefers not to use huge pages at all and is > > extremely latency-sensitive. > > > > After careful consideration, we decided to prioritize the > > requirements of the first type and modify the THP settings > > as follows: > > > > echo madvise >/sys/kernel/mm/transparent_hugepage/enabled > > echo defer >/sys/kernel/mm/transparent_hugepage/defrag > > > > With the introduction of MADV_COLLAPSE into the kernel, > > it is no longer dependent on any sysfs setting under > > /sys/kernel/mm/transparent_hugepage. MADV_COLLAPSE > > offers the potential for fine-grained synchronous control over > > the huge page allocation mechanism, marking a significant > > enhancement for THP. > > > > If the kernel supports a more relaxed (opportunistic) > > MADV_COLLAPSE, we will modify the THP settings as follows: > > > > echo madvise >/sys/kernel/mm/transparent_hugepage/enabled > > echo madvise >/sys/kernel/mm/transparent_hugepage/defrag > > [corrected, via 2 previous mails, to: echo madvise > >/sys/kernel/mm/transparent_hugepage/enabled > echo defer+madvise >/sys/kernel/mm/transparent_hugepage/defrag] > > > > Then, we will use process_madvise(MADV_COLLAPSE, xx_relaxed_flag) > > to address the requirements of the second type. > > > > Why don't we favor madvise(MADV_COLLAPSE) for the first type > > of requirements? > > The main reason is that these requirements are typically for offline > > jobs in the Hadoop ecosystem, such as MapReduce and Spark, > > which run primarily on the JVM. [..] > > Hey Lance, > > Thanks for proving this context, it's very helpful. > > Though, couldn't you use enabled=3Dalways, defrag=3Ddefer+madvise, then > just use prctl(PR_SET_THP_DISABLE) on type-3 workloads to get the > behaviour you want? i.e. prctl(PR_SET_THP_DISABLE) is a good choice that can fully meet the needs of type-3 workloads. I might prefer using enabled=3Dmadvise, as this would allow applications to implement specific calls to madvise to request huge pages selectively. If we set enabled=3Dalways, some applications may not be optimized for or may not benefit from huge pages. In such cases, using huge pages for all allocations could lead to suboptimal performance. > > type 1: apply MADV_HUGEPAGE -> sync defrag to get THP > type 2: don't apply MADV_HUGEPAGE -> use THP if available, kick > kswapd+kcompactd otherwise Sorry, I did not express myself clearly. The type 2 of requirements should be: type 2: apply MADV_HUGEPAGE with defrag=3Ddefer, or use a more relaxed (opportunistic) MADV_COLLAPSE. > type 3: use prctl(PR_SET_THP_DISABLE) (or MADV_NOHUGEPAGE) -> no THPs > > Or am I missing something? It sounds like a confounding issue is that > these are external workloads, or you don't have ability to modify? But > that would preclude MADV_COLLAPSE (unless you're using > process_madvise()). Sorry, my previous explanation has been unclear. What I meant is that the requirements of type-1 workloads can be independent of any sysfs setting and can be addressed using madvise(MADV_COLLAPSE). In this scenario, why haven't I utilized it? The reason is that I currently lack the capability to modify the JVM or PyTorch to make them compatible with madvise(MADV_COLLAPSE). Therefore, the needs of type-1 workloads still rely on sysfs settings. > > Appreciate the help understanding the use case. I'm not opposed to the > idea in general, but IMO would be great to have a clear need for it I appreciate your perspective! Thanks again for your valuable insights and your suggestions! Lance > (and right now, we don't currently have alignment with the original > motivating usecase (Go) in that regard w.r.t their plans). > > Thanks, > Zach