From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.2 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id BDE82C43331 for ; Thu, 5 Sep 2019 18:51:06 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 7213320825 for ; Thu, 5 Sep 2019 18:51:07 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 7213320825 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=codewreck.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id F3F9F6B0003; Thu, 5 Sep 2019 14:51:05 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id EEFC66B0005; Thu, 5 Sep 2019 14:51:05 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E05D86B0007; Thu, 5 Sep 2019 14:51:05 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0055.hostedemail.com [216.40.44.55]) by kanga.kvack.org (Postfix) with ESMTP id B999E6B0003 for ; Thu, 5 Sep 2019 14:51:05 -0400 (EDT) Received: from smtpin18.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with SMTP id 5D3E0180AD801 for ; Thu, 5 Sep 2019 18:51:05 +0000 (UTC) X-FDA: 75901759290.18.bikes35_43a4e9a24c821 X-HE-Tag: bikes35_43a4e9a24c821 X-Filterd-Recvd-Size: 4140 Received: from nautica.notk.org (nautica.notk.org [91.121.71.147]) by imf21.hostedemail.com (Postfix) with ESMTP for ; Thu, 5 Sep 2019 18:51:04 +0000 (UTC) Received: by nautica.notk.org (Postfix, from userid 1001) id 4A755C009; Thu, 5 Sep 2019 20:51:03 +0200 (CEST) Date: Thu, 5 Sep 2019 20:50:48 +0200 From: Dominique Martinet To: Matthew Wilcox Cc: linux-mm@kvack.org Subject: Re: How to use huge pages in drivers? Message-ID: <20190905185048.GA23588@nautica> References: <20190903182627.GA6079@nautica> <20190903184230.GJ29434@bombadil.infradead.org> <20190903212815.GA7518@nautica> <20190904170056.GA9825@nautica> <20190904175032.GL29434@bombadil.infradead.org> <20190905154400.GA30549@nautica> <20190905181555.GQ29434@bombadil.infradead.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20190905181555.GQ29434@bombadil.infradead.org> User-Agent: Mutt/1.5.21 (2010-09-15) X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Matthew Wilcox wrote on Thu, Sep 05, 2019: > On Thu, Sep 05, 2019 at 05:44:00PM +0200, Dominique Martinet wrote: > > Question though - is it ok to insert small pages if the huge_fault > > handler is called with PE_SIZE_PMD ? > > (I think the pte insertion will automatically create the pmd, but would > > be good to confirm) > > No, you need to return VM_FAULT_FALLBACK, at which point the generic code > will create a PMD for you and then call your ->fault handler which can > insert PTEs. Hmm, that's a shame actually. There is a rather costly round-trip between linux and mckernel to determine what page size is used for this virtual address on the remote side and to get the corresponding physical address, so basically when we get the fault we do know know if this will be a PMD or PTE. I'd rather avoid having to do one round-trip at the PMD stage, get told this is a PTE, temporarily give up and wait to be called again with PE_SIZE_PTE and do a second round-trip in this case. I didn't see anywhere in the vm_fault struct that I could piggy-back to remember something from the previous call, and I'm pretty sure it would be a bad idea to use the vma's vm_private_data here because there could be multiple faults in parallel on other threads. Looking at vmf_insert_pfn(), it will allocate a pmd because of insert_pfn's get_locked_pte, so it does end up working (I never return a page - we always return VM_FAULT_NOPAGE on success, so I do not see the harm in doing it early if we can) Following the code in __handle_vm_fault assuming the pmd fault would have returned fallback I do not see any harm here - the pmd actually already has been allocated here (at pmd level fault), it's just set to none. Not exactly pretty, though, and very definitely no guarantee it'll keep working... I'll stick a comment saying what we should do at least :P > It works the same way from PUDs to PMDs by the way, in case you ever > have a 1GB mapping ;-) Yes, already returning fallback in this case - but I'm just assuming that won't happen so no round-trip here :) > > Now that I've set it as dax I think it actually makes sense as in > > "there's memory here that points to something linux no longer manages > > directly, just let it be" and we might benefit from the other exceptions > > dax have, I'll need to look at what this implies in more details... > > I think that should be fine, but I don't really know RHEL 7.3 all that > well ;-) Good enough for me, tests will tell me what I broke :) > No problem ... these APIs are relatively new and not necessarily all > that intuitive. Looking at a recent vanilla linux on evening and rhel's kernel at work didn't help on my side (some fun differences like the VM_HUGE_FAULT flag in the vma, but now I understand it was added for abi compatibility it does make sense after I found about it - on an older module the function could just have been left uninitialized and thus non-null yet not valid) Definitely did help to point at huge_fault() again. Thanks, -- Dominique