Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Ryan Roberts <ryan.roberts@arm.com>
To: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will@kernel.org>, Ard Biesheuvel <ardb@kernel.org>,
	Marc Zyngier <maz@kernel.org>, James Morse <james.morse@arm.com>,
	Andrey Ryabinin <ryabinin.a.a@gmail.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Matthew Wilcox <willy@infradead.org>,
	David Hildenbrand <david@redhat.com>,
	Kefeng Wang <wangkefeng.wang@huawei.com>,
	John Hubbard <jhubbard@nvidia.com>, Zi Yan <ziy@nvidia.com>,
	Barry Song <21cnbao@gmail.com>,
	Alistair Popple <apopple@nvidia.com>,
	Yang Shi <shy828301@gmail.com>,
	Nicholas Piggin <npiggin@gmail.com>,
	Christophe Leroy <christophe.leroy@csgroup.eu>,
	"Aneesh Kumar K.V" <aneesh.kumar@kernel.org>,
	"Naveen N. Rao" <naveen.n.rao@linux.ibm.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	"H. Peter Anvin" <hpa@zytor.com>,
	linux-arm-kernel@lists.infradead.org, x86@kernel.org,
	linuxppc-dev@lists.ozlabs.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
Date: Tue, 13 Feb 2024 13:03:53 +0000	[thread overview]
Message-ID: <d2ca1de5-d833-4806-a5a2-75410e1f731a@arm.com> (raw)
In-Reply-To: <ZctaaeVFF8TpjA8Z@FVFF77S0Q05N.cambridge.arm.com>

On 13/02/2024 12:02, Mark Rutland wrote:
> On Mon, Feb 12, 2024 at 12:59:57PM +0000, Ryan Roberts wrote:
>> On 12/02/2024 12:00, Mark Rutland wrote:
>>> Hi Ryan,
> 
> [...]
> 
>>>> +static inline void set_pte(pte_t *ptep, pte_t pte)
>>>> +{
>>>> +	/*
>>>> +	 * We don't have the mm or vaddr so cannot unfold contig entries (since
>>>> +	 * it requires tlb maintenance). set_pte() is not used in core code, so
>>>> +	 * this should never even be called. Regardless do our best to service
>>>> +	 * any call and emit a warning if there is any attempt to set a pte on
>>>> +	 * top of an existing contig range.
>>>> +	 */
>>>> +	pte_t orig_pte = __ptep_get(ptep);
>>>> +
>>>> +	WARN_ON_ONCE(pte_valid_cont(orig_pte));
>>>> +	__set_pte(ptep, pte_mknoncont(pte));
>>>> +}
>>>> +
>>>> +#define set_ptes set_ptes
>>>> +static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
>>>> +				pte_t *ptep, pte_t pte, unsigned int nr)
>>>> +{
>>>> +	pte = pte_mknoncont(pte);
>>>
>>> Why do we have to clear the contiguous bit here? Is that for the same reason as
>>> set_pte(), or do we expect callers to legitimately call this with the
>>> contiguous bit set in 'pte'?
>>>
>>> I think you explained this to me in-person, and IIRC we don't expect callers to
>>> go set the bit themselves, but since it 'leaks' out to them via __ptep_get() we
>>> have to clear it here to defer the decision of whether to set/clear it when
>>> modifying entries. It would be nice if we could have a description of why/when
>>> we need to clear this, e.g. in the 'public API' comment block above.
>>
>> Yes, I think you've got it, but just to ram home the point: The PTE_CONT bit is
>> private to the architecture code and is never set directly by core code. If the
>> public API ever receives a pte that happens to have the PTE_CONT bit set, it
>> would be bad news if we then accidentally set that in the pgtable.
>>
>> Ideally, we would just uncondidtionally clear the bit before a getter returns
>> the pte (e.g. ptep_get(), ptep_get_lockless(), ptep_get_and_clear(), ...). That
>> way, the code code is guarranteed never to see a pte with the PTE_CONT bit set
>> and can therefore never accidentally pass such a pte into a setter function.
>> However, there is existing functionality that relies on being able to get a pte,
>> then pass it to pte_leaf_size(), and arch function that checks the PTE_CONT bit
>> to determine how big the leaf is. This is used in perf_get_pgtable_size().
>>
>> So to allow perf_get_pgtable_size() to continue to see the "real" page size, I
>> decided to allow PTE_CONT to leak through the getters and instead
>> unconditionally clear the bit when a pte is passed to any of the setters.
>>
>> I'll add a (slightly less verbose) comment as you suggest.
> 
> Great, thanks!
> 
> [...]
> 
>>>> +static inline bool mm_is_user(struct mm_struct *mm)
>>>> +{
>>>> +	/*
>>>> +	 * Don't attempt to apply the contig bit to kernel mappings, because
>>>> +	 * dynamically adding/removing the contig bit can cause page faults.
>>>> +	 * These racing faults are ok for user space, since they get serialized
>>>> +	 * on the PTL. But kernel mappings can't tolerate faults.
>>>> +	 */
>>>> +	return mm != &init_mm;
>>>> +}
>>>
>>> We also have the efi_mm as a non-user mm, though I don't think we manipulate
>>> that while it is live, and I'm not sure if that needs any special handling.
>>
>> Well we never need this function in the hot (order-0 folio) path, so I think I
>> could add a check for efi_mm here with performance implication. It's probably
>> safest to explicitly exclude it? What do you think?
> 
> That sounds ok to me.
> 
> Otherwise, if we (somehow) know that we avoid calling this at all with an EFI
> mm (e.g. because of the way we construct that), I'd be happy with a comment.

We crossed streams - as per my other email, I'm confident that this is safe so
will just add a comment.

> 
> Probably best to Cc Ard for whatever we do here.

Ard is already on CC.

> 
>>>> +static inline pte_t *contpte_align_down(pte_t *ptep)
>>>> +{
>>>> +	return (pte_t *)(ALIGN_DOWN((unsigned long)ptep >> 3, CONT_PTES) << 3);
>>>
>>> I think this can be:
>>>
>>> static inline pte_t *contpte_align_down(pte_t *ptep)
>>> {
>>> 	return PTR_ALIGN_DOWN(ptep, sizeof(*ptep) * CONT_PTES);
>>> }
>>
>> Yep - that's much less ugly - thanks!
>>
>>>
>>>> +
>>>> +static void contpte_convert(struct mm_struct *mm, unsigned long addr,
>>>> +			    pte_t *ptep, pte_t pte)
>>>> +{
>>>> +	struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
>>>> +	unsigned long start_addr;
>>>> +	pte_t *start_ptep;
>>>> +	int i;
>>>> +
>>>> +	start_ptep = ptep = contpte_align_down(ptep);
>>>> +	start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>>>> +	pte = pfn_pte(ALIGN_DOWN(pte_pfn(pte), CONT_PTES), pte_pgprot(pte));
>>>> +
>>>> +	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE) {
>>>> +		pte_t ptent = __ptep_get_and_clear(mm, addr, ptep);
>>>> +
>>>> +		if (pte_dirty(ptent))
>>>> +			pte = pte_mkdirty(pte);
>>>> +
>>>> +		if (pte_young(ptent))
>>>> +			pte = pte_mkyoung(pte);
>>>> +	}
>>>
>>> Not a big deal either way, but I wonder if it makes more sense to accumulate
>>> the 'ptent' dirty/young values, then modify 'pte' once, i.e.
>>>
>>> 	bool dirty = false, young = false;
>>>
>>> 	for (...) {
>>> 		pte_t ptent = __ptep_get_and_clear(mm, addr, ptep);
>>> 		dirty |= pte_dirty(ptent);
>>> 		young |= pte_young(ptent);
>>> 	}
>>>
>>> 	if (dirty)
>>> 		pte_mkdirty(pte);
>>> 	if (young)
>>> 		pte_mkyoung(pte);
>>>
>>> I suspect that might generate slightly better code, but I'm also happy with the
>>> current form if people thnk that's more legible (I have no strong feelings
>>> either way).
>>
>> I kept it this way, because its the same pattern used in arm64's hugetlbpage.c.
>> We also had the same comment against David's batching patches recently, and he
>> opted to stick with the former version:
>>
>> https://lore.kernel.org/linux-mm/d83309fa-4daa-430f-ae52-4e72162bca9a@redhat.com/
>>
>> So I'm inclined to leave it as is, since you're not insisting :)
> 
> That rationale is reasonable, and I'm fine with this as-is.
> 
> [...]
> 
>>>> +pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
>>>> +{
>>>> +	/*
>>>> +	 * Gather access/dirty bits, which may be populated in any of the ptes
>>>> +	 * of the contig range. We may not be holding the PTL, so any contiguous
>>>> +	 * range may be unfolded/modified/refolded under our feet. Therefore we
>>>> +	 * ensure we read a _consistent_ contpte range by checking that all ptes
>>>> +	 * in the range are valid and have CONT_PTE set, that all pfns are
>>>> +	 * contiguous and that all pgprots are the same (ignoring access/dirty).
>>>> +	 * If we find a pte that is not consistent, then we must be racing with
>>>> +	 * an update so start again. If the target pte does not have CONT_PTE
>>>> +	 * set then that is considered consistent on its own because it is not
>>>> +	 * part of a contpte range.
>>>> +	 */
>>>> +
>>>> +	pgprot_t orig_prot;
>>>> +	unsigned long pfn;
>>>> +	pte_t orig_pte;
>>>> +	pgprot_t prot;
>>>> +	pte_t *ptep;
>>>> +	pte_t pte;
>>>> +	int i;
>>>> +
>>>> +retry:
>>>> +	orig_pte = __ptep_get(orig_ptep);
>>>> +
>>>> +	if (!pte_valid_cont(orig_pte))
>>>> +		return orig_pte;
>>>> +
>>>> +	orig_prot = pte_pgprot(pte_mkold(pte_mkclean(orig_pte)));
>>>> +	ptep = contpte_align_down(orig_ptep);
>>>> +	pfn = pte_pfn(orig_pte) - (orig_ptep - ptep);
>>>> +
>>>> +	for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
>>>> +		pte = __ptep_get(ptep);
>>>> +		prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
>>>> +
>>>> +		if (!pte_valid_cont(pte) ||
>>>> +		   pte_pfn(pte) != pfn ||
>>>> +		   pgprot_val(prot) != pgprot_val(orig_prot))
>>>> +			goto retry;
>>>> +
>>>> +		if (pte_dirty(pte))
>>>> +			orig_pte = pte_mkdirty(orig_pte);
>>>> +
>>>> +		if (pte_young(pte))
>>>> +			orig_pte = pte_mkyoung(orig_pte);
>>>> +	}
>>>> +
>>>> +	return orig_pte;
>>>> +}
>>>> +EXPORT_SYMBOL(contpte_ptep_get_lockless);
>>>
>>> I'm struggling to convince myself that this is safe in general, as it really
>>> depends on how the caller will use this value. Which caller(s) actually care
>>> about the access/dirty bits, given those could change at any time anyway?
>>
>> I think your points below are valid, and agree we should try to make this work
>> without needing access/dirty if possible. But can you elaborate on why you don't
>> think it's safe?
> 
> Having mulled this over, I think it is safe as-is, and I was being overly
> cautious.
> 
> I had a general fear of potential problems stemming from the fact that (a) the
> accumulation of access/dirty bits isn't atomic and (b) the loop is unbounded.
> From looking at how this is used today, I think (a) is essentially the same as
> reading a stale non-contiguous entry, and I'm being overly cautious there. For
> (b), I think that's largely a performance concern and the would only retry
> indefinitely in the presence of mis-programmed entries or consistent racing
> with a writer under heavy contention.
> 
> I think it's still desirable to avoid the accumulation in most cases (to avoid
> redundant work and to minimize the potential for unbounded retries), but I'm
> happy with that being a follow-up improvement.

Great! I'll do the conversion to ptep_get_lockless_nosync() as a follow up series.

> 
>>> I took a quick scan, and AFAICT:
>>
>> Thanks for enumerating these; Saves me from having to refresh my memory :)
>>>
>>> * For perf_get_pgtable_size(), we only care about whether the entry is valid
>>>   and has the contig bit set. We could clean that up with a new interface, e.g.
>>>   something like a new ptep_get_size_lockless().
>>>
>>> * For gup_pte_range(), I'm not sure we actually need the access/dirty bits when
>>>   we look at the pte to start with, since we only care where we can logically
>>>   write to the page at that point.
>>>
>>>   I see that we later follow up with:
>>>
>>>     with pte_val(pte) != pte_val(ptep_get(ptep)))
>>>
>>>   ... is that why we need ptep_get_lockless() to accumulate the access/dirty
>>>   bits? So that shape of lockless-try...locked-compare sequence works?
>>>
>>> * For huge_pte_alloc(), arm64 doesn't select CONFIG_ARCH_WANT_GENERAL_HUGETLB,
>>>   so this doesn' seem to matter.
>>>
>>> * For __collapse_huge_page_swapin(), we only care if the pte is a swap pte,
>>>   which means the pte isn't valid, and we'll return the orig_pte as-is anyway.
>>>
>>> * For pte_range_none() the access/dirty bits don't matter.
>>>
>>> * For handle_pte_fault() I think we have the same shape of
>>>   lockless-try...locked-compare sequence as for gup_pte_range(), where we don't
>>>   care about the acess/dirty bits before we reach the locked compare step.
>>>
>>> * For ptdump_pte_entry() I think it's arguable that we should continue to
>>>   report the access/dirty bits separately for each PTE, as we have done until
>>>   now, to give an accurate representation of the contents of the translation
>>>   tables.
>>>
>>> * For swap_vma_readahead() and unuse_pte_range() we only care if the PTE is a
>>>   swap entry, the access/dirty bits don't matter.
>>>
>>> So AFAICT this only really matters for gup_pte_range() and handle_pte_fault(),
>>> and IIUC that's only so that the locklessly-loaded pte value can be compared
>>> with a subsequently locked-loaded entry (for which the access/dirty bits will
>>> be accumulated). Have I understood that correctly?
>>
>> Yes, I agree with what you are saying. My approach was to try to implement the
>> existing APIs accurately though, the argument being that it reduces the chances
>> of getting it wrong. But if you think the implementation is unsafe, then I guess
>> it blows that out of the water...
> 
> I think your approach makes sense, and as above I'm happy to defer the API
> changes/additions to avoid the accumulation of access/dirty bits.
> 
>>> If so, I wonder if we could instead do that comparison modulo the access/dirty
>>> bits, 
>>
>> I think that would work - but will need to think a bit more on it.
>>
>>> and leave ptep_get_lockless() only reading a single entry?
>>
>> I think we will need to do something a bit less fragile. ptep_get() does collect
>> the access/dirty bits so its confusing if ptep_get_lockless() doesn't IMHO. So
>> we will likely want to rename the function and make its documentation explicit
>> that it does not return those bits.
>>
>> ptep_get_lockless_noyoungdirty()? yuk... Any ideas?
>>
>> Of course if I could convince you the current implementation is safe, I might be
>> able to sidestep this optimization until a later date?
> 
> Yep. :)
> 
> Mark.

next prev parent reply	other threads:[~2024-02-13 13:04 UTC|newest]

Thread overview: 80+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-02-02  8:07 [PATCH v5 00/25] Transparent Contiguous PTEs for User Mappings Ryan Roberts
2024-02-02  8:07 ` [PATCH v5 01/25] mm: Clarify the spec for set_ptes() Ryan Roberts
2024-02-12 12:03   ` David Hildenbrand
2024-02-02  8:07 ` [PATCH v5 02/25] mm: thp: Batch-collapse PMD with set_ptes() Ryan Roberts
2024-02-02  8:07 ` [PATCH v5 03/25] mm: Make pte_next_pfn() a wrapper around pte_advance_pfn() Ryan Roberts
2024-02-12 12:14   ` David Hildenbrand
2024-02-12 14:10     ` Ryan Roberts
2024-02-12 14:29       ` David Hildenbrand
2024-02-12 21:34         ` Ryan Roberts
2024-02-13  9:54           ` David Hildenbrand
2024-02-02  8:07 ` [PATCH v5 04/25] arm/mm: Convert pte_next_pfn() to pte_advance_pfn() Ryan Roberts
2024-02-02  8:07 ` [PATCH v5 05/25] arm64/mm: " Ryan Roberts
2024-02-02  8:07 ` [PATCH v5 06/25] powerpc/mm: " Ryan Roberts
2024-02-02  8:07 ` [PATCH v5 07/25] x86/mm: " Ryan Roberts
2024-02-02  8:07 ` [PATCH v5 08/25] mm: Remove pte_next_pfn() and replace with pte_advance_pfn() Ryan Roberts
2024-02-02  8:07 ` [PATCH v5 09/25] arm64/mm: set_pte(): New layer to manage contig bit Ryan Roberts
2024-02-02  8:07 ` [PATCH v5 10/25] arm64/mm: set_ptes()/set_pte_at(): " Ryan Roberts
2024-02-02  8:07 ` [PATCH v5 11/25] arm64/mm: pte_clear(): " Ryan Roberts
2024-02-02  8:07 ` [PATCH v5 12/25] arm64/mm: ptep_get_and_clear(): " Ryan Roberts
2024-02-02  8:07 ` [PATCH v5 13/25] arm64/mm: ptep_test_and_clear_young(): " Ryan Roberts
2024-02-02  8:07 ` [PATCH v5 14/25] arm64/mm: ptep_clear_flush_young(): " Ryan Roberts
2024-02-02  8:07 ` [PATCH v5 15/25] arm64/mm: ptep_set_wrprotect(): " Ryan Roberts
2024-02-02  8:07 ` [PATCH v5 16/25] arm64/mm: ptep_set_access_flags(): " Ryan Roberts
2024-02-02  8:07 ` [PATCH v5 17/25] arm64/mm: ptep_get(): " Ryan Roberts
2024-02-02  8:07 ` [PATCH v5 18/25] arm64/mm: Split __flush_tlb_range() to elide trailing DSB Ryan Roberts
2024-02-12 12:44   ` David Hildenbrand
2024-02-12 13:05     ` Ryan Roberts
2024-02-12 13:15       ` David Hildenbrand
2024-02-12 13:27         ` Ryan Roberts
2024-02-02  8:07 ` [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings Ryan Roberts
2024-02-12 12:00   ` Mark Rutland
2024-02-12 12:59     ` Ryan Roberts
2024-02-12 13:54       ` David Hildenbrand
2024-02-12 14:45         ` Ryan Roberts
2024-02-12 15:26           ` David Hildenbrand
2024-02-12 15:34             ` Ryan Roberts
2024-02-12 16:24               ` David Hildenbrand
2024-02-13 15:29                 ` Ryan Roberts
2024-02-12 15:30       ` Ryan Roberts
2024-02-12 20:38         ` Ryan Roberts
2024-02-13 10:01           ` David Hildenbrand
2024-02-13 12:06           ` Ryan Roberts
2024-02-13 12:19             ` David Hildenbrand
2024-02-13 13:06               ` Ryan Roberts
2024-02-13 13:13                 ` David Hildenbrand
2024-02-13 13:20                   ` Ryan Roberts
2024-02-13 13:22                     ` David Hildenbrand
2024-02-13 13:24                       ` Ryan Roberts
2024-02-13 13:33                     ` Ard Biesheuvel
2024-02-13 13:45                       ` David Hildenbrand
2024-02-13 14:02                         ` Ryan Roberts
2024-02-13 14:05                           ` David Hildenbrand
2024-02-13 14:08                             ` Ard Biesheuvel
2024-02-13 14:21                               ` Ryan Roberts
2024-02-13 12:02       ` Mark Rutland
2024-02-13 13:03         ` Ryan Roberts [this message]
2024-02-02  8:07 ` [PATCH v5 20/25] arm64/mm: Implement new wrprotect_ptes() batch API Ryan Roberts
2024-02-13 16:31   ` Mark Rutland
2024-02-13 16:36     ` Ryan Roberts
2024-02-02  8:07 ` [PATCH v5 21/25] arm64/mm: Implement new [get_and_]clear_full_ptes() batch APIs Ryan Roberts
2024-02-13 16:43   ` Mark Rutland
2024-02-13 16:48     ` Ryan Roberts
2024-02-13 16:53       ` Mark Rutland
2024-02-02  8:07 ` [PATCH v5 22/25] mm: Add pte_batch_hint() to reduce scanning in folio_pte_batch() Ryan Roberts
2024-02-12 13:43   ` David Hildenbrand
2024-02-12 15:00     ` Ryan Roberts
2024-02-12 15:47     ` Ryan Roberts
2024-02-12 16:27       ` David Hildenbrand
2024-02-02  8:07 ` [PATCH v5 23/25] arm64/mm: Implement pte_batch_hint() Ryan Roberts
2024-02-12 13:46   ` David Hildenbrand
2024-02-13 16:54   ` Mark Rutland
2024-02-02  8:07 ` [PATCH v5 24/25] arm64/mm: __always_inline to improve fork() perf Ryan Roberts
2024-02-13 16:55   ` Mark Rutland
2024-02-02  8:07 ` [PATCH v5 25/25] arm64/mm: Automatically fold contpte mappings Ryan Roberts
2024-02-13 17:44   ` Mark Rutland
2024-02-13 18:05     ` Ryan Roberts
2024-02-08 17:34 ` [PATCH v5 00/25] Transparent Contiguous PTEs for User Mappings Mark Rutland
2024-02-09  8:54   ` Ryan Roberts
2024-02-09 22:16     ` David Hildenbrand
2024-02-09 23:52       ` Ryan Roberts

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d2ca1de5-d833-4806-a5a2-75410e1f731a@arm.com \
    --to=ryan.roberts@arm.com \
    --cc=21cnbao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=aneesh.kumar@kernel.org \
    --cc=apopple@nvidia.com \
    --cc=ardb@kernel.org \
    --cc=bp@alien8.de \
    --cc=catalin.marinas@arm.com \
    --cc=christophe.leroy@csgroup.eu \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@redhat.com \
    --cc=hpa@zytor.com \
    --cc=james.morse@arm.com \
    --cc=jhubbard@nvidia.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=mark.rutland@arm.com \
    --cc=maz@kernel.org \
    --cc=mingo@redhat.com \
    --cc=naveen.n.rao@linux.ibm.com \
    --cc=npiggin@gmail.com \
    --cc=ryabinin.a.a@gmail.com \
    --cc=shy828301@gmail.com \
    --cc=tglx@linutronix.de \
    --cc=wangkefeng.wang@huawei.com \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    --cc=x86@kernel.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox