From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0E422C54798 for ; Fri, 1 Mar 2024 03:12:03 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 785B36B00AA; Thu, 29 Feb 2024 22:12:03 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 736656B00AB; Thu, 29 Feb 2024 22:12:03 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5D6A26B00AC; Thu, 29 Feb 2024 22:12:03 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 485E46B00AA for ; Thu, 29 Feb 2024 22:12:03 -0500 (EST) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 1C6B2A14E6 for ; Fri, 1 Mar 2024 03:12:03 +0000 (UTC) X-FDA: 81846996126.07.1583440 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf07.hostedemail.com (Postfix) with ESMTP id 54F254000B for ; Fri, 1 Mar 2024 03:11:59 +0000 (UTC) Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=PiQWJZog; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf07.hostedemail.com: domain of peterx@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=peterx@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1709262721; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=1+yj0M444Rn29k7fXYVoB6dYYSSyUrpcQePDPWOlWhI=; b=JZbwNRh0sJbNlNzOHIEtzRoY0re04AY7XykHUBp7sc1epjcjxNiWp0IkQqKlCeRGsbY7iM YuNtJ3UWo+HFyT9Ax8GOoOI8O16Jccy8EJZOUxK9fvZ6vU/CSrhhDenkmKFDX0gx5fjIDz ps0cwiiYzIrcqKJpoyenIhySVnndGR0= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=PiQWJZog; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf07.hostedemail.com: domain of peterx@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=peterx@redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709262721; a=rsa-sha256; cv=none; b=mxXMrjWgYEWr1IVZfqthX6+GZInzLykFWEjb/FJERawlGe2kz70OJ8wxoxzdOjVkTrYxiA AQaqdl+iCALVx0OW00zuf0+HQTa/1kZiPjU1asFGU7RRowy1t/26co1VDJMGbB9kPZgUv8 KFStyqvC/hIRTT2nw/8GESpKOYgA7ug= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1709262718; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=1+yj0M444Rn29k7fXYVoB6dYYSSyUrpcQePDPWOlWhI=; b=PiQWJZogbK+nuV3V+bofJ671CPdUhVjpZDLX8NaaJqU/XOg9V16qET2WHxHiFf2UGbNT27 q6Io2PAsf9lezU/ahZVscKzwXHk2qeU96kzaPOj/PkeRGEd8yipabNZ90uL4K78pY149Is 9Uw65xb/UYGZvWXgIkKyVR4MXcfybJU= Received: from mail-pl1-f197.google.com (mail-pl1-f197.google.com [209.85.214.197]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-636-d0Wy7cQFOWG4It6J6J0RsA-1; Thu, 29 Feb 2024 22:11:57 -0500 X-MC-Unique: d0Wy7cQFOWG4It6J6J0RsA-1 Received: by mail-pl1-f197.google.com with SMTP id d9443c01a7336-1da0dd1dec4so5011615ad.1 for ; Thu, 29 Feb 2024 19:11:56 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1709262716; x=1709867516; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=1+yj0M444Rn29k7fXYVoB6dYYSSyUrpcQePDPWOlWhI=; b=bTdLHAu3jgyoSfbOdv5KSTo13f8EMUNRd7wViKNHtYusxCorkDiM+hM6dBnJPh9Nl3 RIixtNexbO0BNC9gicTs+fe9xu3jRpNqtnF47RwX6U29NU81e823y65tP/RFowzl9vNV ItXtN47Pxd/0TrwUNY4PDnpytbRoyYXKs8A/xN94vHzDFFmLtdsGCj8iv1LdmAq5WAjx 6rGhHKuYlU57BjSDQQi9lOTa/+LAktebY1Kw9X/CgKgL/lGE9wJXpv7y+hgRcs9oW6s8 ZpzZNzICYrI8rSLrkgFvBcT57zI9aWJEu6XMZHwZ/1BnJWBgsMyqlSf4PogOj6IWL8Bz NYnA== X-Forwarded-Encrypted: i=1; AJvYcCVFwhbvTXNVfN/Gr9aG3g6z7rqjsChRxQcM41TTgynv3eg39RlykrmhyvzZX1U4R2oO8adspKGhHCYaSSyc/gDg3Z0= X-Gm-Message-State: AOJu0Yy25vBdGamWJztGtiLBtYQWCm37E/4hAjusvqpRhQA3bGsyoq7l mr9zCQGJVo/bZfTHZjuljU3MTWWaOifxIQHoXldbbvUkFuGoloaRO+/NxszGoduV5grE7eu2gkT VVf6ehNLFtFOS/AFwQpNnct+7klboI1frS//mQNadWQuZ4SOs X-Received: by 2002:a17:902:ff02:b0:1db:4b1b:d726 with SMTP id f2-20020a170902ff0200b001db4b1bd726mr449219plj.1.1709262715838; Thu, 29 Feb 2024 19:11:55 -0800 (PST) X-Google-Smtp-Source: AGHT+IHeDBs4QdJ4s8SyESxn8sPjRexaCFPcFjuNdVEndeTmzmgwleXgasUQHfRQkIewwG1qL90iHA== X-Received: by 2002:a17:902:ff02:b0:1db:4b1b:d726 with SMTP id f2-20020a170902ff0200b001db4b1bd726mr449206plj.1.1709262715407; Thu, 29 Feb 2024 19:11:55 -0800 (PST) Received: from x1n ([43.228.180.230]) by smtp.gmail.com with ESMTPSA id d14-20020a170903230e00b001da15580ca8sm2247285plh.52.2024.02.29.19.11.53 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 29 Feb 2024 19:11:55 -0800 (PST) Date: Fri, 1 Mar 2024 11:11:48 +0800 From: Peter Xu To: James Houghton Cc: lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Muchun Song Subject: Re: [LSF/MM/BPF TOPIC] Hugetlb Unifications Message-ID: References: MIME-Version: 1.0 In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: 54F254000B X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: 1d3wdd9469m788z9i674aysnjw81kray X-HE-Tag: 1709262719-141342 X-HE-Meta: U2FsdGVkX19oN0DvMzmVzPcIVMVYHrYwCkGjt0ZKYUKntuBNSihgDTZOvDciPyJ55GILE/J3LofZ8twFDQoHSHQSGrRVsWPpRy55xNEckwVQrASebvr6SxpIWc0gRwbpqFecGZTojeClRLMnCB6uXJRM/ewtjjaWIaGqukqV3iyX3Fs9n4UNz7PLlzc4ZL8vPW+H2901rvrC7KCda4LLDis4TMYqGDLmfu3lE+jzZj9phlndCDZIKkgZGwh8Xat/tdYiRLqzsS1G35MbCSBVON50NRCcvlulZGabfovGzrxFQc/wWEWStbFkBw9gavR4DIig5c7UKAkhTOKJgskRsTRAUhMIaFKNtRIuVK9No2AVd0xxjs3XZXDbrjLlcBXKCSCvNVfK2coAA+4Svh7ihL+LpAOaFgqlWxbUunupekRpQeE8n04C+neM7h32BF6CqwAqVx6wT4b/SWJn5ElpJgQr3HUzwV/BpZIqxTo5zgABp56MYmGmA0tVRJ45row4dFIGpp5Dj7EkSIwNLYtaw36AYnMk8mbf55xZPm0JUagoxtyXwyfxVs8lXBhySHXnYHZVYI1RTnnEB4wvTsF56jguJJrILml94j/TSBDF35AdJ3vsqJuYIaaES+6rPgEkc5l9wT645K54WNRnoDXbRlFEPwv2cNaLBWcsZ5qL49uHbxm/R5Ivh7JECxYe5j9h6tE2kcoJsYWRz032N6q+7oLAB5gGiaJlssrX2I15hYbANso6BJq6JlACj1X9adNvQwJ1JuyLbqijSqFTXKtMtcawRdTwfs3tYuginG/0KBeqBBtCHBIcoms/g2hs8Vl1TlTem9l4auXO6LuEWk0qTU0nMZnnbpLYX3e3tB18LbJh12hFnrqIdbqq9cWze+i/cHn34U1a5KklnmI2qOKrEzmpflPT1+/UPRHw6kYFxPijSi0ZiRh5hMNDOEDdOtiCMiqMPiIBmLc0Mj23JxI o5DfM5Um fkEWbP6XqX5ULg6G5d2fgNHSTyId0ppgCtqZBp9XawzmeK+OWEibXDZSwe9Lyc9YLfafK4AUH7YJ8o5tqPQZMRLDM2B6Rto0Nuo6oWoIqYRC9NqEWGyhe+LtJ3nh7Szsp3NwK1XGDyUOPf0v8i7jBmYLk7fMor4p9bQaMiS1XhWhNbvHd8iHm1/r/w7pmHACYFqsFcxSNBtCk/W6mQ+4mBoHzWClBOym3MstR+p5P2+UhU/HWPydTsZQroOsgPlmpA1+a9w0/C7PD/wmU2THrCT8P9F5bvSqyec88YaAcV9rFUvB1uT8/umFLoKjhbEB8SPwsoU6lqevDCUEPqGIkXTxVHNX94Gqt0VHBmdjUGAXVBlWzaYnrUxoN/pmBYZEFc5iGb4FGaV3lbISLb86CI85GST02f2IzvLAIPKEJqdQAdp86ikid5OEx2zr0F0j14ERXjoAENIVKt0zWzOBGaz2hJFoQUd0vnCprg8LL+V8/uA1Um/s8O7DrPQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hey, James, On Thu, Feb 29, 2024 at 05:37:23PM -0800, James Houghton wrote: > On Thu, Feb 22, 2024 at 12:50 AM Peter Xu wrote: > > > > I want to propose a session to discuss how we should unify hugetlb into > > core mm. > > > > Due to legacy reasons, hugetlb has plenty of its own code paths that are > > plugged into core mm, causing itself even more special than shmem. While > > it is a pretty decent and useful file system, efficient on supporting large > > & statically allocated chunks of memory, it also added maintenance burden > > due to having its own specific code paths spread all over the place. > > Thank you for proposing this topic. HugeTLB is very useful (1G > mappings, guaranteed hugepages, saving struct page overhead, shared > page tables), but it is special in ways that make it a headache to > modify (and making it harder to work on other mm features). > > I haven't been able to spend much time with HugeTLB since the LSFMM > talk last year, so I'm not much of an expert anymore. But I'll give my > two cents anyway. > > > It went into a bit of a mess, and it is messed up enough to become a reason > > to not accept new major features like what used to be proposed last year to > > map hugetlb pages in smaller sizes [1]. > > > > We all seem to agree something needs to be done to hugetlb, but it seems > > still not as clear on what exactly, then people forgot about it and move > > on, until hit it again. The problem didn't yet go away itself even if > > nobody asks. > > > > Is it worthwhile to spend time do such work? Do we really need a fresh new > > hugetlb-v2 just to accept new features? What exactly need to be > > generalized for hugetlb? Is huge_pte_offset() the culprit, or what else? > > To what extent hugetlb is free to accept new features? > > I think the smaller unification that has been done so far is great > (thank you!!), but at some point additional unification will require a > pretty heavy lift. Trying to enumerate some possible challenges: > > What does HugeTLB do differently than main mm? > - Page table walking, huge_pte_offset/etc., of course. > - "huge_pte" as a concept (type-erased p?d_t), though it shares its > type with pte_t. > - Completely different page fault path (hugetlbfs doesn't implement > vm_ops->{huge_,}fault). > - mapcount > - Reservation/MAP_NORESERVE > - HWPoison handling > - Synchronization (hugetlb_fault_mutex_table, VMA lock for PMD sharing) > - more... > > What does HugeTLB do that main mm doesn't do? > - It keeps pools of hugepages that cannot be used for anything else. > - It has PMD sharing (which can hopefully be replaced with mshare()) > - It has HVO (which can hopefully be dropped in a memdesc world) > - more...? > > Page table sharing and HVO are both important, but they're not > fundamental to HugeTLB, so it's not impossible to make progress on > drastic cleanup without them. > > No matter what, we'll need to add (more) PUD support into the main mm, > so we could start with that, though it won't be easy. Then we would > need at least... > > (1) ...a filesystem that implements huge_fault for PUDs > > It's not inconceivable to add support for this in shmem (where 1G > pages are allocated -- perhaps ahead of time -- with CMA, maybe?). > This could be done in hugetlbfs, but then you'd have to make sure that > the huge_fault implementation stays compatible with everything else in > hugetlb/hugetlbfs, perhaps making incremental progress difficult. Or > you could create hugetlbfs-v2. I'm honestly not sure which of these is > the least difficult -- probably the shmem route? IMHO hugetlb fault path can be the last to tackle; there seem to have other lower hanging fruits that are good candidates for such unifications works. For example, what if we can reduce customized hugetlb paths from 20 -> 2, where the customized fault() will be 1 out of the 2? To further reduce that 2 paths we may need a new file system, but if it's good enough maybe we don't need v2, at least not for someone looking for a cleanup: that is more suitable who can properly define the new interface first, and it can be much more work than an unification effort, also orthogonal in some way. > > (2) ...a mapcount (+refcount) system that works for PUD mappings. > > This discussion has progressed a lot since I last thought about it; > I'll let the experts figure this one out[1]. I hope there will be an solid answer there. Otherwise IIRC the last plan was to use 1 mapcount for anything mapped underneath. I still think it's a good plan, which may not apply to mTHP but could be perfectly efficient & simple to hugetlb. The complexity lies in elsewhere other than the counting itself but I had a feeling it's still a workable solution. > > Anyway, I'm oversimplifying things, and it's been a while since I've > thought hard about this, so please take this all with a grain of salt. > The main motivating use-case for HGM (to allow for post-copy live > migration of HugeTLB-1G-backed VMs with userfaultfd) can be solved in > other ways[2]. Do you know how far David went in that direction? When there will be a prototype? Would it easily work with MISSING faults (not MINOR)? I will be more than happy to see whatever solution come up from kernel that will resolve that pain for VMs first. It's unfortunate KVM will has its own solution for hugetlb small mappings, but I also understand there's more than one demand to that besides hugetlb on 1G (even though I'm not 100% sure of that demand when I think it again today: is it a worry that the pgtable pages will take a lot of space when trapping minor-faults? I haven't yet got time to revisit David's proposal there in the past two months; nor do I think I fully digested the details back then). The answer to above could also help me to prioritize my work, e.g., hugetlb unification is probably something we should do regardless, at least for the sake of a healthy mm code base. I have plan to move HGM or whatever it will be called to upstream if necessary, but it can also depends on how fast the other project goes, as personally I don't yet worry on hugetlb hwpoison yet (at least QEMU's hwpoison handling is still pretty much broken.. which is pretty unfortunate), but maybe any serious cloud provide still should care. > > > The goal of such a session is trying to make it clearer on answering above > > questions. > > I hope we can land on a clear answer this year. :) Yes. :) Thanks for the write-up and summary. > > - James > > [1]: https://lore.kernel.org/linux-mm/049e4674-44b6-4675-b53b-62e11481a7ce@redhat.com/ > [2]: https://lore.kernel.org/kvm/CALzav=d23P5uE=oYqMpjFohvn0CASMJxXB_XEOEi-jtqWcFTDA@mail.gmail.com/ > -- Peter Xu