From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 90AA1C27C52 for ; Thu, 6 Jun 2024 21:33:43 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 27A986B00B5; Thu, 6 Jun 2024 17:33:43 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 22B096B00B6; Thu, 6 Jun 2024 17:33:43 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 119DD6B00B7; Thu, 6 Jun 2024 17:33:43 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id E749E6B00B5 for ; Thu, 6 Jun 2024 17:33:42 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 8B5E21417A8 for ; Thu, 6 Jun 2024 21:33:42 +0000 (UTC) X-FDA: 82201765884.21.FF46452 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf24.hostedemail.com (Postfix) with ESMTP id 7E23F180007 for ; Thu, 6 Jun 2024 21:33:40 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=CXc53HfO; spf=pass (imf24.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1717709620; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ys+PkqsjEOrynL/NlLytjQKVWf14N0MDIRQzuDRLhOM=; b=jmMUOALPBYD84iYQrC0gEWKM9wbPsQSX92a2LL2lKQcetCnA9tbWResSdysJAHIITXA516 R0FOjSSTA7TJ9pzZ01REi4a5JtzeN0wc4mqHwXCHbu5oQv8/Bzw0i6VwIsSZzRgh4S69sy Gg3VmQQ9Ot9Xas/SpZcbhV8SuxhF+YU= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1717709620; a=rsa-sha256; cv=none; b=8dZMu3vOXZSXXLRc82QGuB3iYSuF5ghr1aYPCpfSq8YzMRhOzThBC9Mi7aXBXwwFyJgk0X xTYM7q/akqn3xpHlAN6apc+24R01eFU+Cr7fxQxTWPnSUF3MGqDg7XhZYVreluuPgv3R0h 1dgHMbVVrGw2fKyml7sorekhwDbg8tI= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=CXc53HfO; spf=pass (imf24.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1717709619; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=ys+PkqsjEOrynL/NlLytjQKVWf14N0MDIRQzuDRLhOM=; b=CXc53HfOmPqwf7UiRttkRlm/ltO0sXz391oc9YZ43xCqNZSHRKcUs11oyuRqR+pr95k2tq V4h9loVC6/CJwk0I5JydNXrK7Ck2A9+fY3RrL2vOy+R2O2wnk+bHOY25lK7DomWvux1hoG JL7dYyRtgqczLyu4gx15qP33BvW6dCU= Received: from mail-qv1-f71.google.com (mail-qv1-f71.google.com [209.85.219.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-425-iReAOepPMTy_iCpNnHckfw-1; Thu, 06 Jun 2024 17:33:36 -0400 X-MC-Unique: iReAOepPMTy_iCpNnHckfw-1 Received: by mail-qv1-f71.google.com with SMTP id 6a1803df08f44-6ae4a58fa7cso3674876d6.2 for ; Thu, 06 Jun 2024 14:33:36 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1717709616; x=1718314416; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=ys+PkqsjEOrynL/NlLytjQKVWf14N0MDIRQzuDRLhOM=; b=s+PGLfWCMpvLyeh1jPb7AwLucq/egeoxeKuGXQJjD4C40DbN34ixm+/NgUPfrxy+sA nbPWamQDpORAlK2nJe2erSneZA6Ql3rQf5F6h/KTnkCyRhzC7gyA9492ltIx87pECHR4 wXovWhfCiBhvkjVqcl9wmf0dI3F7+04Jn4OoCuQpcgFFvEUp5GNtSqeZs2Q8dbAdUqPv /3jvcvsyY7+5vdIq7f6uGSzwp6RTIFLTX/Goi3X3gmHVhT2w1OnE1nge755aDDY9aZKN jtBm3GrXbM0icHC94TMzZ6qV4LiD0v3geXeguePbQZGqiDCPudHcsKAhM88Atl9DI5Pf jfuA== X-Forwarded-Encrypted: i=1; AJvYcCX01ywNFP+RIdwEXevlHe6S+/QWUA+5N7eWiwBvqyW41gL7/1SAcjKY9OFmkvRk+XzMr4S3MgRIoZEw33FxWR/YYKc= X-Gm-Message-State: AOJu0YxaaDc6arI3OyKD71+Jm2B/f89kXouuy4wmGN6FDYmUgLOrsbVf mqi0Zsdj4j/3bfYFI93twTQnHtpd6BZrAhLGf6y66yOWUzqCkUxncMvEIV01L+vIlVsXedHw7Le abAswD60FwXHldqrauCdDkGr5MH+YMdCeM8KWocn9Wy5sUaqD X-Received: by 2002:a05:6214:4114:b0:6ab:8c3b:9031 with SMTP id 6a1803df08f44-6b059bde1b4mr8309666d6.1.1717709615716; Thu, 06 Jun 2024 14:33:35 -0700 (PDT) X-Google-Smtp-Source: AGHT+IFz3dhiwaUQX4p6jwtshZImIZDt8OvaBauRPsWhYmIQ0OcWyTnvXyi0MYq7mvCmCdTaSAjQsw== X-Received: by 2002:a05:6214:4114:b0:6ab:8c3b:9031 with SMTP id 6a1803df08f44-6b059bde1b4mr8309236d6.1.1717709614786; Thu, 06 Jun 2024 14:33:34 -0700 (PDT) Received: from x1n (pool-99-254-121-117.cpe.net.cable.rogers.com. [99.254.121.117]) by smtp.gmail.com with ESMTPSA id 6a1803df08f44-6b04f74c6a8sm9693926d6.66.2024.06.06.14.33.33 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 06 Jun 2024 14:33:34 -0700 (PDT) Date: Thu, 6 Jun 2024 17:33:31 -0400 From: Peter Xu To: Matthew Wilcox Cc: James Houghton , Khalid Aziz , Vishal Moola , Jane Chu , Muchun Song , linux-mm@kvack.org, Michal Hocko , Yu Zhao Subject: Re: Unifying page table walkers Message-ID: References: MIME-Version: 1.0 In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 7E23F180007 X-Rspam-User: X-Stat-Signature: hht1ggdsakjawirk18oqupgeefyospjs X-HE-Tag: 1717709620-520153 X-HE-Meta: U2FsdGVkX18nalkve1aOrKm627r6I2sQK6H2RVNsr4vRBe9DbsymkhiliNkUBq3nznRvpfzRTrKrh/aW2b0iMFzooNkEQkJhnts4nghcHtN+paHUbNJqtJFUYgHHzUH6s1sB5fVoh01WeH5biMJ68X9pNaR5Qj4wW1bYZ5KOkOqGa5bcfhKcNtOFXYX/6SPJ6pzTk8+kUTJ+fQCxeUf2+9wTwhgpaJ7TXEQQDmxlwKpABpbUJhETvez0f8uulFzRXHu35zDoJDMoZjb7Qzfkhww5cAK5HXU4hU55+iqX8XBT6P0h+LhATB09WWyfUgKKuSRkwH02bFUVe/xl9s0hQl0z9xpIxXq5bFviq8BvQgEYU5xrFH+XAto11F4HjynPQjc8bKpGGJ3i5Ai1AddcnKH5nPVXS+VLsSfiFM3LjgTUHG+GoliMmMHFM+F4HJBlZY6ySzrC18B1ZxGHYzdF7ENcyjmo0hnYqaihi0YpbEbaAH18qJgJwlu0APF1ii/09EXD+/LaLsB+2TPtdh89hXOw+4Ybat3umgOovpKKgtqKHPt3HRy9c9Np6HV7E7YtM05KtJRZfB9HKRDABLZyJuz5yRIoHDR9Pb5Tc+k2a+W7ZcVcCCmtO+tX3/OJ7ghL2FLPFaRTSlRoY2ZfJJzZWZYhs8yPEn2YfkyqXeVgmY2r/gFrNoxOGizjEeH2wsnw/jNsaqAexi3wFO2qyi4h+Q9HluV/ogr4D8IuTUP3z/sI63iWZCB3Vnim2Yoyl08zVh9qjmwS0ywPu5YD/N33CXcTCcb2Z0GdUiuROTfQNUXoYw9f2cpk1rkNYjVBHed/fw8nsNqyPJi464e53eJqGC8MXwaAV4+74rZmEHngsLkZ5INlY1T6HMKhDTjKT7oLYUliVk2BfN+eyHNW+uUCDQCDkKeJucBv4B9nxUsAx9/hn0LHPkrC8b4SBWQNuCiruKS9y5kZNEAIQXX3fzo mysBrjnz 0aIoMg0EAeMWyUu22e2a/zyayY7jDNnaGNXO3yK22P+5DpavwEejD/ktEWNm8jc3Mv6PvvXMdalHpV2Z/sYD+CQwpYpW4aDGV58ZXOoUUo5TB7r4csrc5m82YbvX6c0yax2guftQmw+1xfjgNCc6ihHfW7JbNvfgmAvmrdOzz/afrQzKoUiHpa1beDID7mnypuFFZmkyM7CMKgRKpsX8GH3jzm/hLTpjYLhGYYQZ5UFqdN4MkwlsJYnB5Hdctx2jDdgMiDFarFE10P0qrnrz9Ms+Wy9CPEr5euHwkH+AmlCmbUwwTU+1JawKwSQ74prp3EbQQ321NPIA1x8LCtr1GXOoxMOnQboWc8pDM1M9NtX3b5fSgCZTCe0SDrGisJF9yt/sF3heAKVjq2mSIOKu1fis0ZSQXD7AiS+xQfUaxJAFvxJyLoAVnVr6R1HuF2hzgqSrx4k7Ql+/vhX7DXckRPO9nRNg4/pX12/ZNj0E6tpMxRHQ= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Jun 06, 2024 at 09:04:53PM +0100, Matthew Wilcox wrote: > On Thu, Jun 06, 2024 at 12:30:44PM -0700, James Houghton wrote: > > Today the VM_HUGETLB flag tells the fault handler to call into > > hugetlb_fault() (there are many other special cases, but this one is > > probably the most important). How should faults on VMAs without > > VM_HUGETLB that map HugeTLB folios be handled? If you handle faults > > with the main mm fault handler without getting rid of hugetlb_fault(), > > I think you're basically implementing a second, more tmpfs-like > > hugetlbfs... right? > > > > I don't really have anything against this approach, but I think the > > decision was to reduce the number of special cases as much as we can > > first before attempting to rewrite hugetlbfs. > > > > Or maybe I've got something wrong and what you're asking doesn't > > logically end up at a hugetlbfs v2. > > Right, so we ignore hugetlb_fault() and call into __handle_mm_fault(). > Once there, we'll do: > > vmf.pud = pud_alloc(mm, p4d, address); > if (pud_none(*vmf.pud) && > thp_vma_allowable_order(vma, vm_flags, > TVA_IN_PF | TVA_ENFORCE_SYSFS, PUD_ORDER)) { > ret = create_huge_pud(&vmf); > > which will call vma->vm_ops->huge_fault(vmf, PUD_ORDER); > > So all we need to do is implement huge_fault in hugetlb_vm_ops. I > don't think that's the same as creating a hugetlbfs2 because it's just > another entry point. You can mmap() the same file both ways and it's > all cache coherent. Matthew, could you elaborate more on how hugetlb_vm_ops.huge_fault() could start to inject hugetlb pages without a hugetlb VMA? I meant, at least currently what I read is this, where we define a hugetlb VMA always as: vm_flags_set(vma, VM_HUGETLB | VM_DONTEXPAND); vma->vm_ops = &hugetlb_vm_ops; So any vma that uses hugetlb_vm_ops will have VM_HUGETLB set for sure.. If you're talking about some other VMAs, it sounds to me like that this huge_fault() should belong to that new VMA's vm_ops? Then it sounds like a way some non-hugetlb VMA wants to reuse the hugetlb allocator/pool of the huge pages. I'm not sure I understand it right, though.. Regarding to the 4k mapping plan on hugetlb.. I talked to Michal, Yu and some other people when during lsfmm, and I think so far it seems to me the best way to go is to allow shmem provide 1G pages. Again, IMHO I'd be totally fine if we finish the cleanup but just add HGM on top of hugetlbv1, but it looks like I'm the only person who thinks like that.. If we can introduce 1G to shmem then as long as the 1G pages can be as stable as a hugetlbv1 1G page then it's good enough for a VM use case (which I care, and I believe also to most cloud providers who cares about postcopy) and then adding 4k mapping on top of that can avoid all the hugetlb concerns people have too (even though I think most of the logic that HGM wants will still be there). That also kind of matches with the TAO's design where we may have more chance having THPs allocated even dynamically on 1G, however the sake here for VM context is we'll want reliable 1G not split-able ones. We may want it split-able only on pgtable but not the folios. However I don't yet think any of them are solid ideas. It might be interesting to know how your thoughts correlate to this too, since I think you mentioned the 4k mapping somewhere. I'm also making bold to copy relevant people just in case it could be relevant discussion. [PS: will need to be off work tomorrow, so please expect a delay on follow up emails..] Thanks, -- Peter Xu