[PATCH 0/5] ARM64: make barebox compatible with KVM

mail archive of the barebox mailing list
 help / color / mirror / Atom feed

* [PATCH 0/5] ARM64: make barebox compatible with KVM
@ 2024-10-09  6:05 Ahmad Fatoum
  2024-10-09  6:05 ` [PATCH 1/5] ARM64: io: implement I/O accessors in assembly Ahmad Fatoum
                   ` (5 more replies)
  0 siblings, 6 replies; 10+ messages in thread
From: Ahmad Fatoum @ 2024-10-09  6:05 UTC (permalink / raw)
  To: barebox; +Cc: ejo

It was noticed by a patch series[1] for OpenEmbedded-core that
neither barebox nor U-Boot reach their shell when tested on a native
AArch64 server with Qemu running with KVM enabled.

After investigation and help from the Qemu/KVM maintainers[2], it turns
out that software that wants to run under KVM needs to take care what
instructions it uses for MMIO accesses:

Accessing a MMIO region will trap and KVM will check the exception
syndrome register and extract the information it needs to emulate the
access.

However, not all instructions that access memory can be described by
this Instruction Specific Syndrome. Notably, the pre- and post-index
variants of ldr/str don't have valid syndromes on trap; this is because
the instruction not only access the memory specified by the register,
but also increments the address in the register and KVM would need to
decode these instructions to be able to emulate that, which it currently
does not.

Linux doesn't suffer from this issue, because it implements readl/writel
in assembly and makes sure they only use KVM-friendly instructions.

This series does the same for barebox and also fixes other crashes
and slow-downs related to I/O that were noticed along the way.

[1]: https://lore.kernel.org/all/b77f2c6737c330ef9ecce325d50a4aaa25b3e536.camel@linuxfoundation.org/
[2]: https://lore.kernel.org/all/89f184d6-5b61-4c77-9f3b-c0a8f6a75d60@pengutronix.de/

Ahmad Fatoum (5):
  ARM64: io: implement I/O accessors in assembly
  ARM64: board-dt-2nd: grow stack down from start of binary
  mtd: cfi-flash: use I/O accessors for reads/writes of MMIO regions
  ARM64: mmu: flush cacheable regions prior to remapping
  virtio: don't use DMA API unless required

 arch/arm/cpu/board-dt-2nd-aarch64.S |   2 +-
 arch/arm/cpu/mmu_64.c               | 105 ++++++++++++++++++++++++++--
 arch/arm/include/asm/io.h           |   5 +-
 arch/arm/include/asm/io64.h         |  99 ++++++++++++++++++++++++++
 drivers/mtd/nor/cfi_flash.c         |   2 +-
 drivers/mtd/nor/cfi_flash.h         |  15 +++-
 drivers/virtio/virtio_ring.c        |  85 +++++++++++++++++++---
 include/linux/virtio_ring.h         |   1 +
 8 files changed, 293 insertions(+), 21 deletions(-)
 create mode 100644 arch/arm/include/asm/io64.h

-- 
2.39.5

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/5] ARM64: io: implement I/O accessors in assembly
  2024-10-09  6:05 [PATCH 0/5] ARM64: make barebox compatible with KVM Ahmad Fatoum
@ 2024-10-09  6:05 ` Ahmad Fatoum
  2024-10-09  6:05 ` [PATCH 2/5] ARM64: board-dt-2nd: grow stack down from start of binary Ahmad Fatoum
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 10+ messages in thread
From: Ahmad Fatoum @ 2024-10-09  6:05 UTC (permalink / raw)
  To: barebox; +Cc: ejo, Peter Maydell, Ahmad Fatoum

We currently implement the I/O accessors in C with volatile accesses.
With I/O memory being mapped strongly ordered, we are thus ensured that
actual memory access size and order is as specified in the C code.

This was sufficient so far, but triggers problems when barebox is run
under KVM: KVM expects that MMIO regions are only accessed via
instructions that set the ISV bit in the ESR_EL1 register.

An example for an instruction that doesn't set the ISV bit is the
pre-index and post-index form of ldr/str, e.g.:

	ldr     x0, =0x9000fe0
	ldr     w1, [x0], #4

This will fail when virtualized with KVM, because KVM would have to
decode the instruction on trap to handle the post-increment of x0,
which is currently unimplemented.

Let's fix this by implementing readl/writel and friends in assembly
unconditionally like Linux does.

This introduces some slight overhead, because the post-index variants
reduce code size when reading multiple registers sequentially, but this
ought to be negligible.

Suggested-by: Peter Maydell <peter.maydell@linaro.org>
Link: https://lore.kernel.org/all/CAFEAcA_Yv2a=XCKw80y9iyBRoC27UL6Sfzgy4KwFDkC1gbzK7w@mail.gmail.com/
Signed-off-by: Ahmad Fatoum <a.fatoum@pengutronix.de>
---
 arch/arm/include/asm/io.h   |  5 +-
 arch/arm/include/asm/io64.h | 99 +++++++++++++++++++++++++++++++++++++
 2 files changed, 103 insertions(+), 1 deletion(-)
 create mode 100644 arch/arm/include/asm/io64.h

diff --git a/arch/arm/include/asm/io.h b/arch/arm/include/asm/io.h
index 9e9b13ad18c6..0b9bb9bfaf9a 100644
--- a/arch/arm/include/asm/io.h
+++ b/arch/arm/include/asm/io.h
@@ -5,7 +5,7 @@
 
 #include <linux/compiler.h>
 
-#ifndef CONFIG_CPU_64
+#ifdef CONFIG_ARM32
 /*
  * Generic IO read/write.  These perform native-endian accesses.  Note
  * that some architectures will want to re-define __raw_{read,write}w.
@@ -25,6 +25,9 @@ void __raw_readsl(const volatile void __iomem *addr, void *data, int longlen);
 #define writesb(p,d,l)		__raw_writesb(p,d,l)
 #define writesw(p,d,l)		__raw_writesw(p,d,l)
 #define writesl(p,d,l)		__raw_writesl(p,d,l)
+
+#elif defined(CONFIG_ARM64)
+#include <asm/io64.h>
 #endif
 
 #define	IO_SPACE_LIMIT	0
diff --git a/arch/arm/include/asm/io64.h b/arch/arm/include/asm/io64.h
new file mode 100644
index 000000000000..6e3cba97f7ca
--- /dev/null
+++ b/arch/arm/include/asm/io64.h
@@ -0,0 +1,99 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Based on arch/arm/include/asm/io.h
+ *
+ * Copyright (C) 1996-2000 Russell King
+ * Copyright (C) 2012 ARM Ltd.
+ */
+#ifndef __ASM_IO64_H
+#define __ASM_IO64_H
+
+#include <linux/types.h>
+#include <linux/compiler.h>
+#include <asm/byteorder.h>
+
+/*
+ * Generic IO read/write.  These perform native-endian accesses.
+ */
+static __always_inline void ___raw_writeb(u8 val, volatile void __iomem *addr)
+{
+	volatile u8 __iomem *ptr = addr;
+	asm volatile("strb %w0, %1" : : "rZ" (val), "Qo" (*ptr));
+}
+
+static __always_inline void ___raw_writew(u16 val, volatile void __iomem *addr)
+{
+	volatile u16 __iomem *ptr = addr;
+	asm volatile("strh %w0, %1" : : "rZ" (val), "Qo" (*ptr));
+}
+
+static __always_inline void ___raw_writel(u32 val, volatile void __iomem *addr)
+{
+	volatile u32 __iomem *ptr = addr;
+	asm volatile("str %w0, %1" : : "rZ" (val), "Qo" (*ptr));
+}
+
+static __always_inline void ___raw_writeq(u64 val, volatile void __iomem *addr)
+{
+	volatile u64 __iomem *ptr = addr;
+	asm volatile("str %x0, %1" : : "rZ" (val), "Qo" (*ptr));
+}
+
+static __always_inline u8 ___raw_readb(const volatile void __iomem *addr)
+{
+	u8 val;
+	asm volatile("ldrb %w0, [%1]" : "=r" (val) : "r" (addr));
+	return val;
+}
+
+static __always_inline u16 ___raw_readw(const volatile void __iomem *addr)
+{
+	u16 val;
+
+	asm volatile("ldrh %w0, [%1]" : "=r" (val) : "r" (addr));
+	return val;
+}
+
+static __always_inline u32 ___raw_readl(const volatile void __iomem *addr)
+{
+	u32 val;
+	asm volatile("ldr %w0, [%1]" : "=r" (val) : "r" (addr));
+	return val;
+}
+
+static __always_inline u64 ___raw_readq(const volatile void __iomem *addr)
+{
+	u64 val;
+	asm volatile("ldr %0, [%1]" : "=r" (val) : "r" (addr));
+	return val;
+}
+
+
+#ifdef __LINUX_IO_STRICT_PROTOTYPES__
+#define __IOMEM(a)	(a)
+#else
+#define __IOMEM(a)	((void __force __iomem *)(a))
+#endif
+
+#define __raw_writeb(v, a) ___raw_writeb(v, __IOMEM(a))
+#define __raw_writew(v, a) ___raw_writew(v, __IOMEM(a))
+#define __raw_writel(v, a) ___raw_writel(v, __IOMEM(a))
+#define __raw_writeq(v, a) ___raw_writeq(v, __IOMEM(a))
+
+#define __raw_readb(a) ___raw_readb(__IOMEM(a))
+#define __raw_readw(a) ___raw_readw(__IOMEM(a))
+#define __raw_readl(a) ___raw_readl(__IOMEM(a))
+#define __raw_readq(a) ___raw_readq(__IOMEM(a))
+
+/*
+ * io{read,write}{16,32,64}be() macros
+ */
+#define ioread16be(p)		({ __u16 __v = be16_to_cpu((__force __be16)__raw_readw(p)); __v; })
+#define ioread32be(p)		({ __u32 __v = be32_to_cpu((__force __be32)__raw_readl(p)); __v; })
+#define ioread64be(p)		({ __u64 __v = be64_to_cpu((__force __be64)__raw_readq(p)); __v; })
+
+#define iowrite16be(v,p)	({ __raw_writew((__force __u16)cpu_to_be16(v), p); })
+#define iowrite32be(v,p)	({ __raw_writel((__force __u32)cpu_to_be32(v), p); })
+#define iowrite64be(v,p)	({ __raw_writeq((__force __u64)cpu_to_be64(v), p); })
+
+#endif	/* __ASM_IO64_H */
-- 
2.39.5




^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 2/5] ARM64: board-dt-2nd: grow stack down from start of binary
  2024-10-09  6:05 [PATCH 0/5] ARM64: make barebox compatible with KVM Ahmad Fatoum
  2024-10-09  6:05 ` [PATCH 1/5] ARM64: io: implement I/O accessors in assembly Ahmad Fatoum
@ 2024-10-09  6:05 ` Ahmad Fatoum
  2024-10-09  6:05 ` [PATCH 3/5] mtd: cfi-flash: use I/O accessors for reads/writes of MMIO regions Ahmad Fatoum
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 10+ messages in thread
From: Ahmad Fatoum @ 2024-10-09  6:05 UTC (permalink / raw)
  To: barebox; +Cc: ejo, Ahmad Fatoum

When first added, the ARM64 generic DT image started with `adr x1, 0'.
With the addition of the optional EFI entry point, a nop with an
optional MZ magic needed to be added at the start of the binary and so
`adr x1, 0' moved further down.

This 0 however is interpreted relative to the program counter and thus
the stack was now setup not from the start of the image down, but from
the location at which the adr instruction is located.

This happens to be 0x48, which not only overwrites the header during
execution, but also is not aligned to 16 bytes.

This issue went unnoticed so far, because the stack is only used to
find out the available memory (either from FDT or EFI boot service)
after which the stack is set up at a properly aligned fixed location
for the remainder of barebox' execution.

Under KVM however, this quickly crashes on the first stack access:

        ldr     x0, =0x4007fff8
        mov     sp, x0
        stp     x0, x1, [sp] // <-- data abort

Thus fix the code to grow the stack down from the first address.

Fixes: 742e78976dd4 ("ARM64: add optional EFI stub")
Signed-off-by: Ahmad Fatoum <a.fatoum@pengutronix.de>
---
 arch/arm/cpu/board-dt-2nd-aarch64.S | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm/cpu/board-dt-2nd-aarch64.S b/arch/arm/cpu/board-dt-2nd-aarch64.S
index 030366c1cbf5..2ea13d20b450 100644
--- a/arch/arm/cpu/board-dt-2nd-aarch64.S
+++ b/arch/arm/cpu/board-dt-2nd-aarch64.S
@@ -22,7 +22,7 @@ ENTRY("start_dt_2nd")
 	.int   .Lpe_header_offset  /* reserved (PE-COFF offset) */
 	.asciz "barebox"	   /* unused for now */
 2:
-	adr x1, 0
+	adr x1, _text - .
 	mov sp, x1
 	/* Stack now grows into the 0x80000 image load offset specified
 	 * above. This is more than enough until FDT /memory is decoded.
-- 
2.39.5

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 3/5] mtd: cfi-flash: use I/O accessors for reads/writes of MMIO regions
  2024-10-09  6:05 [PATCH 0/5] ARM64: make barebox compatible with KVM Ahmad Fatoum
  2024-10-09  6:05 ` [PATCH 1/5] ARM64: io: implement I/O accessors in assembly Ahmad Fatoum
  2024-10-09  6:05 ` [PATCH 2/5] ARM64: board-dt-2nd: grow stack down from start of binary Ahmad Fatoum
@ 2024-10-09  6:05 ` Ahmad Fatoum
  2024-10-09  6:05 ` [PATCH 4/5] ARM64: mmu: flush cacheable regions prior to remapping Ahmad Fatoum
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 10+ messages in thread
From: Ahmad Fatoum @ 2024-10-09  6:05 UTC (permalink / raw)
  To: barebox; +Cc: ejo, Ahmad Fatoum

The memcpy() routine can use post-indexed memory access instructions
on ARM that would trigger data aborts under KVM.

To fix this, use memcpy_fromio/memcpy_toio or the (read|write)[bwlq]
family of functions, as these are safe to call on MMIO addresses.

Signed-off-by: Ahmad Fatoum <a.fatoum@pengutronix.de>
---
 drivers/mtd/nor/cfi_flash.c |  2 +-
 drivers/mtd/nor/cfi_flash.h | 15 ++++++++++++---
 2 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/drivers/mtd/nor/cfi_flash.c b/drivers/mtd/nor/cfi_flash.c
index 2cb3d5538fbd..ea2373a01827 100644
--- a/drivers/mtd/nor/cfi_flash.c
+++ b/drivers/mtd/nor/cfi_flash.c
@@ -892,7 +892,7 @@ static int cfi_mtd_read(struct mtd_info *mtd, loff_t from, size_t len,
 {
 	struct flash_info *info = container_of(mtd, struct flash_info, mtd);
 
-	memcpy(buf, info->base + from, len);
+	memcpy_fromio(buf, info->base + from, len);
 	*retlen = len;
 
 	return 0;
diff --git a/drivers/mtd/nor/cfi_flash.h b/drivers/mtd/nor/cfi_flash.h
index 5d3053f97156..a80a979f8c0f 100644
--- a/drivers/mtd/nor/cfi_flash.h
+++ b/drivers/mtd/nor/cfi_flash.h
@@ -260,7 +260,11 @@ static inline void flash_write32(u32 value, void *addr)
 
 static inline void flash_write64(u64 value, void *addr)
 {
-	memcpy((void *)addr, &value, 8);
+#if BITS_PER_LONG >= 64
+	__raw_writeq(value, addr);
+#else
+	memcpy_toio(addr, &value, 8);
+#endif
 }
 
 static inline u8 flash_read8(void *addr)
@@ -280,8 +284,13 @@ static inline u32 flash_read32(void *addr)
 
 static inline u64 flash_read64(void *addr)
 {
-	/* No architectures currently implement readq() */
-	return *(volatile u64 *)addr;
+	u64 value;
+#if BITS_PER_LONG >= 64
+	value = __raw_readq(addr);
+#else
+	memcpy_fromio(&value, addr, 8);
+#endif
+	return value;
 }
 
 /*
-- 
2.39.5




^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 4/5] ARM64: mmu: flush cacheable regions prior to remapping
  2024-10-09  6:05 [PATCH 0/5] ARM64: make barebox compatible with KVM Ahmad Fatoum
                   ` (2 preceding siblings ...)
  2024-10-09  6:05 ` [PATCH 3/5] mtd: cfi-flash: use I/O accessors for reads/writes of MMIO regions Ahmad Fatoum
@ 2024-10-09  6:05 ` Ahmad Fatoum
  2024-10-09  6:05 ` [PATCH 5/5] virtio: don't use DMA API unless required Ahmad Fatoum
  2024-10-15  6:54 ` [PATCH 0/5] ARM64: make barebox compatible with KVM Sascha Hauer
  5 siblings, 0 replies; 10+ messages in thread
From: Ahmad Fatoum @ 2024-10-09  6:05 UTC (permalink / raw)
  To: barebox; +Cc: ejo, Marc Zyngier, Ahmad Fatoum

We currently invalidate memory ranges after remapping, because they may
contain stale dirty cache lines that interfere with further usage, e.g.
when running a memory test on a a DRAM area that was again remapped
cached, stale cache lines may remain from a run where the region
was cached.

The invalidation was done unconditionally to all memory regions being
remapped, even if they weren't previously cacheable. The CPU is within
rights to upgrade the invalidate to a clean+invalidate and the CPU
doesn't know about the state of the caches at the time of the permissions
check.

This led to barebox when run under qemu -enable-kvm on ARM64 to trigger
data aborts when executing DC IVAC on the cfi-flash MMIO regions after
remapping it.

Fix this by replacing the unconditional invalidation after the remapping
with flushing only cacheable pages before remapping non-cacheable.

The way we implement it with the previously unused find_pte() function
is less optimal than could be, but optimizing it further (probably at the
cost of readability) is left as a future exercise.

Cc: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/all/8634l97cfs.wl-maz@kernel.org/
Signed-off-by: Ahmad Fatoum <a.fatoum@pengutronix.de>
---
 arch/arm/cpu/mmu_64.c | 105 +++++++++++++++++++++++++++++++++++++++---
 1 file changed, 98 insertions(+), 7 deletions(-)

diff --git a/arch/arm/cpu/mmu_64.c b/arch/arm/cpu/mmu_64.c
index 7854f71f4cb6..94a19b1aec3c 100644
--- a/arch/arm/cpu/mmu_64.c
+++ b/arch/arm/cpu/mmu_64.c
@@ -63,15 +63,13 @@ static uint64_t *alloc_pte(void)
 }
 #endif
 
-static __maybe_unused uint64_t *find_pte(uint64_t addr)
+static uint64_t *__find_pte(uint64_t *ttb, uint64_t addr, int *level)
 {
-	uint64_t *pte;
+	uint64_t *pte = ttb;
 	uint64_t block_shift;
 	uint64_t idx;
 	int i;
 
-	pte = get_ttb();
-
 	for (i = 0; i < 4; i++) {
 		block_shift = level2shift(i);
 		idx = (addr & level2mask(i)) >> block_shift;
@@ -83,9 +81,17 @@ static __maybe_unused uint64_t *find_pte(uint64_t addr)
 			pte = (uint64_t *)(*pte & XLAT_ADDR_MASK);
 	}
 
+	if (level)
+		*level = i;
 	return pte;
 }
 
+/* This is currently unused, but useful for debugging */
+static __maybe_unused uint64_t *find_pte(uint64_t addr)
+{
+	return __find_pte(get_ttb(), addr, NULL);
+}
+
 #define MAX_PTE_ENTRIES 512
 
 /* Splits a block PTE into table with subpages spanning the old block */
@@ -168,6 +174,91 @@ static void create_sections(uint64_t virt, uint64_t phys, uint64_t size,
 	tlb_invalidate();
 }
 
+static size_t granule_size(int level)
+{
+	switch (level) {
+	default:
+	case 0:
+		return L0_XLAT_SIZE;
+	case 1:
+		return L1_XLAT_SIZE;
+	case 2:
+		return L2_XLAT_SIZE;
+	case 3:
+		return L3_XLAT_SIZE;
+	}
+}
+
+static bool pte_is_cacheable(uint64_t pte)
+{
+	return (pte & PTE_ATTRINDX_MASK) == PTE_ATTRINDX(MT_NORMAL);
+}
+
+/**
+ * flush_cacheable_pages - Flush only the cacheable pages in a region
+ * @start: Starting virtual address of the range.
+ * @end:   Ending virtual address of the range.
+ *
+ * This function walks the page table and flushes the data caches for the
+ * specified range only if the memory is marked as normal cacheable in the
+ * page tables. If a non-cacheable or non-normal page is encountered,
+ * it's skipped.
+ */
+static void flush_cacheable_pages(void *start, size_t size)
+{
+	u64 flush_start = ~0ULL, flush_end = ~0ULL;
+	u64 region_start, region_end;
+	size_t block_size;
+	u64 *ttb;
+
+	region_start = PAGE_ALIGN_DOWN((ulong)start);
+	region_end = PAGE_ALIGN(region_start + size);
+
+	ttb = get_ttb();
+
+	/*
+	 * TODO: This loop could be made more optimal by inlining the page walk,
+	 * so we need not restart address translation from the top every time.
+	 *
+	 * The hope is that with the page tables being cached and the
+	 * windows being remapped being small, the overhead compared to
+	 * actually flushing the ranges isn't too significant.
+	 */
+	for (u64 addr = region_start; addr < region_end; addr += block_size) {
+		int level;
+		u64 *pte = __find_pte(ttb, addr, &level);
+
+		block_size = granule_size(level);
+		
+		if (!pte || !pte_is_cacheable(*pte))
+			continue;
+
+		if (flush_end == addr) {
+			/*
+			 * While it's safe to flush the whole block_size,
+			 * it's unnecessary time waste to go beyond region_end.
+			 */
+			flush_end = min(flush_end + block_size, region_end);
+			continue;
+		}
+
+		/*
+		 * We don't have a previous contiguous flush area to append to.
+		 * If we recorded any area before, let's flush it now
+		 */
+		if (flush_start != ~0ULL)
+			v8_flush_dcache_range(flush_start, flush_end);
+
+		/* and start the new contiguous flush area with this page */
+		flush_start = addr;
+		flush_end = min(flush_start + block_size, region_end);
+	}
+
+	/* The previous loop won't flush the last cached range, so do it here */
+	if (flush_start != ~0ULL)
+		v8_flush_dcache_range(flush_start, flush_end);
+}
+
 static unsigned long get_pte_attrs(unsigned flags)
 {
 	switch (flags) {
@@ -201,10 +292,10 @@ int arch_remap_range(void *virt_addr, phys_addr_t phys_addr, size_t size, unsign
 	if (attrs == ~0UL)
 		return -EINVAL;
 
-	create_sections((uint64_t)virt_addr, phys_addr, (uint64_t)size, attrs);
+	if (flags != MAP_CACHED)
+		flush_cacheable_pages(virt_addr, size);
 
-	if (flags == MAP_UNCACHED)
-		dma_inv_range(virt_addr, size);
+	create_sections((uint64_t)virt_addr, phys_addr, (uint64_t)size, attrs);
 
 	return 0;
 }
-- 
2.39.5




^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 5/5] virtio: don't use DMA API unless required
  2024-10-09  6:05 [PATCH 0/5] ARM64: make barebox compatible with KVM Ahmad Fatoum
                   ` (3 preceding siblings ...)
  2024-10-09  6:05 ` [PATCH 4/5] ARM64: mmu: flush cacheable regions prior to remapping Ahmad Fatoum
@ 2024-10-09  6:05 ` Ahmad Fatoum
  2024-10-14 12:48   ` Sascha Hauer
  2024-10-14 13:06   ` [PATCH] fixup! " Ahmad Fatoum
  2024-10-15  6:54 ` [PATCH 0/5] ARM64: make barebox compatible with KVM Sascha Hauer
  5 siblings, 2 replies; 10+ messages in thread
From: Ahmad Fatoum @ 2024-10-09  6:05 UTC (permalink / raw)
  To: barebox; +Cc: ejo, Ahmad Fatoum

We have no Virt I/O drivers that make use of the streaming DMA API, but
the Virt queues are currently always allocated using the coherent DMA
API.

The coherent DMA API (dma_alloc_coherent/dma_free_coherent) doesn't yet
take a device pointer in barebox, unlike Linux, and as such it
unconditionally allocates uncached memory.

When normally run under Qemu, this doesn't matter. But once we enable
KVM, using uncached memory for the Virtqueues has considerable
performance impact.

To avoid this, let's mimic what Linux does and just side step the DMA
API if the Virt I/O device tells us that this is ok.

Signed-off-by: Ahmad Fatoum <a.fatoum@pengutronix.de>
---
 drivers/virtio/virtio_ring.c | 85 ++++++++++++++++++++++++++++++++----
 include/linux/virtio_ring.h  |  1 +
 2 files changed, 78 insertions(+), 8 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 0efe1e002506..787b04a766e9 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -299,14 +299,81 @@ static struct virtqueue *__vring_new_virtqueue(unsigned int index,
 	return vq;
 }
 
-static void *vring_alloc_queue(size_t size, dma_addr_t *dma_handle)
+/*
+ * Modern virtio devices have feature bits to specify whether they need a
+ * quirk and bypass the IOMMU. If not there, just use the DMA API.
+ *
+ * If there, the interaction between virtio and DMA API is messy.
+ *
+ * On most systems with virtio, physical addresses match bus addresses,
+ * and it _shouldn't_ particularly matter whether we use the DMA API.
+ *
+ * However, barebox' dma_alloc_coherent doesn't yet take a device pointer
+ * as argument, so even for dma-coherent devices, the virtqueue is mapped
+ * uncached on ARM. This has considerable impact on the Virt I/O performance,
+ * so we really want to avoid using the DMA API if possible for the time being.
+ *
+ * On some systems, including Xen and any system with a physical device
+ * that speaks virtio behind a physical IOMMU, we must use the DMA API
+ * for virtio DMA to work at all.
+ *
+ * On other systems, including SPARC and PPC64, virtio-pci devices are
+ * enumerated as though they are behind an IOMMU, but the virtio host
+ * ignores the IOMMU, so we must either pretend that the IOMMU isn't
+ * there or somehow map everything as the identity.
+ *
+ * For the time being, we preserve historic behavior and bypass the DMA
+ * API.
+ *
+ * TODO: install a per-device DMA ops structure that does the right thing
+ * taking into account all the above quirks, and use the DMA API
+ * unconditionally on data path.
+ */
+
+static bool vring_use_dma_api(const struct virtio_device *vdev)
 {
-	return dma_alloc_coherent(size, dma_handle);
+	return !virtio_has_dma_quirk(vdev);
 }
 
-static void vring_free_queue(size_t size, void *queue, dma_addr_t dma_handle)
+static void *vring_alloc_queue(struct virtio_device *vdev,
+			       size_t size, dma_addr_t *dma_handle)
 {
-	dma_free_coherent(queue, dma_handle, size);
+	if (vring_use_dma_api(vdev)) {
+		return dma_alloc_coherent(size, dma_handle);
+	} else {
+		void *queue = memalign(PAGE_SIZE, PAGE_ALIGN(size));
+
+		if (queue) {
+			phys_addr_t phys_addr = virt_to_phys(queue);
+			*dma_handle = (dma_addr_t)phys_addr;
+
+			/*
+			 * Sanity check: make sure we dind't truncate
+			 * the address.  The only arches I can find that
+			 * have 64-bit phys_addr_t but 32-bit dma_addr_t
+			 * are certain non-highmem MIPS and x86
+			 * configurations, but these configurations
+			 * should never allocate physical pages above 32
+			 * bits, so this is fine.  Just in case, throw a
+			 * warning and abort if we end up with an
+			 * unrepresentable address.
+			 */
+			if (WARN_ON_ONCE(*dma_handle != phys_addr)) {
+				free(queue);
+				return NULL;
+			}
+		}
+		return queue;
+	}
+}
+
+static void vring_free_queue(struct virtio_device *vdev,
+			     size_t size, void *queue, dma_addr_t dma_handle)
+{
+	if (vring_use_dma_api(vdev))
+		dma_free_coherent(queue, dma_handle, size);
+	else
+		free(queue);
 }
 
 struct virtqueue *vring_create_virtqueue(unsigned int index, unsigned int num,
@@ -327,7 +394,7 @@ struct virtqueue *vring_create_virtqueue(unsigned int index, unsigned int num,
 
 	/* TODO: allocate each queue chunk individually */
 	for (; num && vring_size(num, vring_align) > PAGE_SIZE; num /= 2) {
-		queue = vring_alloc_queue(vring_size(num, vring_align), &dma_addr);
+		queue = vring_alloc_queue(vdev, vring_size(num, vring_align), &dma_addr);
 		if (queue)
 			break;
 	}
@@ -337,7 +404,7 @@ struct virtqueue *vring_create_virtqueue(unsigned int index, unsigned int num,
 
 	if (!queue) {
 		/* Try to get a single page. You are my only hope! */
-		queue = vring_alloc_queue(vring_size(num, vring_align), &dma_addr);
+		queue = vring_alloc_queue(vdev, vring_size(num, vring_align), &dma_addr);
 	}
 	if (!queue)
 		return NULL;
@@ -347,7 +414,7 @@ struct virtqueue *vring_create_virtqueue(unsigned int index, unsigned int num,
 
 	vq = __vring_new_virtqueue(index, vring, vdev);
 	if (!vq) {
-		vring_free_queue(queue_size_in_bytes, queue, dma_addr);
+		vring_free_queue(vdev, queue_size_in_bytes, queue, dma_addr);
 		return NULL;
 	}
 	vq_debug(vq, "created vring @ (virt=%p, phys=%pad) for vq with num %u\n",
@@ -355,13 +422,15 @@ struct virtqueue *vring_create_virtqueue(unsigned int index, unsigned int num,
 
 	vq->queue_dma_addr = dma_addr;
 	vq->queue_size_in_bytes = queue_size_in_bytes;
+	vq->use_dma_api = vring_use_dma_api(vdev);
 
 	return vq;
 }
 
 void vring_del_virtqueue(struct virtqueue *vq)
 {
-	vring_free_queue(vq->queue_size_in_bytes, vq->vring.desc, vq->queue_dma_addr);
+	vring_free_queue(vq->vdev, vq->queue_size_in_bytes,
+			 vq->vring.desc, vq->queue_dma_addr);
 	list_del(&vq->list);
 	free(vq);
 }
diff --git a/include/linux/virtio_ring.h b/include/linux/virtio_ring.h
index bdef47b0fa6c..7bf667611938 100644
--- a/include/linux/virtio_ring.h
+++ b/include/linux/virtio_ring.h
@@ -103,6 +103,7 @@ struct virtqueue {
 	unsigned int num_free;
 	struct vring vring;
 	bool event;
+	bool use_dma_api;
 	unsigned int free_head;
 	unsigned int num_added;
 	u16 last_used_idx;
-- 
2.39.5




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 5/5] virtio: don't use DMA API unless required
  2024-10-09  6:05 ` [PATCH 5/5] virtio: don't use DMA API unless required Ahmad Fatoum
@ 2024-10-14 12:48   ` Sascha Hauer
  2024-10-14 13:05     ` Ahmad Fatoum
  2024-10-14 13:06   ` [PATCH] fixup! " Ahmad Fatoum
  1 sibling, 1 reply; 10+ messages in thread
From: Sascha Hauer @ 2024-10-14 12:48 UTC (permalink / raw)
  To: Ahmad Fatoum; +Cc: barebox, ejo

On Wed, Oct 09, 2024 at 08:05:11AM +0200, Ahmad Fatoum wrote:
> We have no Virt I/O drivers that make use of the streaming DMA API, but
> the Virt queues are currently always allocated using the coherent DMA
> API.
> 
> The coherent DMA API (dma_alloc_coherent/dma_free_coherent) doesn't yet
> take a device pointer in barebox, unlike Linux, and as such it
> unconditionally allocates uncached memory.
> 
> When normally run under Qemu, this doesn't matter. But once we enable
> KVM, using uncached memory for the Virtqueues has considerable
> performance impact.
> 
> To avoid this, let's mimic what Linux does and just side step the DMA
> API if the Virt I/O device tells us that this is ok.
> 
> Signed-off-by: Ahmad Fatoum <a.fatoum@pengutronix.de>
> ---
>  drivers/virtio/virtio_ring.c | 85 ++++++++++++++++++++++++++++++++----
>  include/linux/virtio_ring.h  |  1 +
>  2 files changed, 78 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index 0efe1e002506..787b04a766e9 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -299,14 +299,81 @@ static struct virtqueue *__vring_new_virtqueue(unsigned int index,
>  	return vq;
>  }
>  
> -static void *vring_alloc_queue(size_t size, dma_addr_t *dma_handle)
> +/*
> + * Modern virtio devices have feature bits to specify whether they need a
> + * quirk and bypass the IOMMU. If not there, just use the DMA API.
> + *
> + * If there, the interaction between virtio and DMA API is messy.
> + *
> + * On most systems with virtio, physical addresses match bus addresses,
> + * and it _shouldn't_ particularly matter whether we use the DMA API.
> + *
> + * However, barebox' dma_alloc_coherent doesn't yet take a device pointer
> + * as argument, so even for dma-coherent devices, the virtqueue is mapped
> + * uncached on ARM. This has considerable impact on the Virt I/O performance,
> + * so we really want to avoid using the DMA API if possible for the time being.
> + *
> + * On some systems, including Xen and any system with a physical device
> + * that speaks virtio behind a physical IOMMU, we must use the DMA API
> + * for virtio DMA to work at all.
> + *
> + * On other systems, including SPARC and PPC64, virtio-pci devices are
> + * enumerated as though they are behind an IOMMU, but the virtio host
> + * ignores the IOMMU, so we must either pretend that the IOMMU isn't
> + * there or somehow map everything as the identity.
> + *
> + * For the time being, we preserve historic behavior and bypass the DMA
> + * API.
> + *
> + * TODO: install a per-device DMA ops structure that does the right thing
> + * taking into account all the above quirks, and use the DMA API
> + * unconditionally on data path.
> + */
> +
> +static bool vring_use_dma_api(const struct virtio_device *vdev)
>  {
> -	return dma_alloc_coherent(size, dma_handle);
> +	return !virtio_has_dma_quirk(vdev);
>  }
>  
> -static void vring_free_queue(size_t size, void *queue, dma_addr_t dma_handle)
> +static void *vring_alloc_queue(struct virtio_device *vdev,
> +			       size_t size, dma_addr_t *dma_handle)
>  {
> -	dma_free_coherent(queue, dma_handle, size);
> +	if (vring_use_dma_api(vdev)) {
> +		return dma_alloc_coherent(size, dma_handle);
> +	} else {
> +		void *queue = memalign(PAGE_SIZE, PAGE_ALIGN(size));
> +
> +		if (queue) {
> +			phys_addr_t phys_addr = virt_to_phys(queue);
> +			*dma_handle = (dma_addr_t)phys_addr;
> +
> +			/*
> +			 * Sanity check: make sure we dind't truncate
> +			 * the address.  The only arches I can find that
> +			 * have 64-bit phys_addr_t but 32-bit dma_addr_t
> +			 * are certain non-highmem MIPS and x86
> +			 * configurations, but these configurations
> +			 * should never allocate physical pages above 32
> +			 * bits, so this is fine.  Just in case, throw a
> +			 * warning and abort if we end up with an
> +			 * unrepresentable address.
> +			 */
> +			if (WARN_ON_ONCE(*dma_handle != phys_addr)) {
> +				free(queue);
> +				return NULL;
> +			}
> +		}
> +		return queue;
> +	}
> +}
> +
> +static void vring_free_queue(struct virtio_device *vdev,
> +			     size_t size, void *queue, dma_addr_t dma_handle)
> +{
> +	if (vring_use_dma_api(vdev))
> +		dma_free_coherent(queue, dma_handle, size);
> +	else
> +		free(queue);
>  }
>  
>  struct virtqueue *vring_create_virtqueue(unsigned int index, unsigned int num,
> @@ -327,7 +394,7 @@ struct virtqueue *vring_create_virtqueue(unsigned int index, unsigned int num,
>  
>  	/* TODO: allocate each queue chunk individually */
>  	for (; num && vring_size(num, vring_align) > PAGE_SIZE; num /= 2) {
> -		queue = vring_alloc_queue(vring_size(num, vring_align), &dma_addr);
> +		queue = vring_alloc_queue(vdev, vring_size(num, vring_align), &dma_addr);
>  		if (queue)
>  			break;
>  	}
> @@ -337,7 +404,7 @@ struct virtqueue *vring_create_virtqueue(unsigned int index, unsigned int num,
>  
>  	if (!queue) {
>  		/* Try to get a single page. You are my only hope! */
> -		queue = vring_alloc_queue(vring_size(num, vring_align), &dma_addr);
> +		queue = vring_alloc_queue(vdev, vring_size(num, vring_align), &dma_addr);
>  	}
>  	if (!queue)
>  		return NULL;
> @@ -347,7 +414,7 @@ struct virtqueue *vring_create_virtqueue(unsigned int index, unsigned int num,
>  
>  	vq = __vring_new_virtqueue(index, vring, vdev);
>  	if (!vq) {
> -		vring_free_queue(queue_size_in_bytes, queue, dma_addr);
> +		vring_free_queue(vdev, queue_size_in_bytes, queue, dma_addr);
>  		return NULL;
>  	}
>  	vq_debug(vq, "created vring @ (virt=%p, phys=%pad) for vq with num %u\n",
> @@ -355,13 +422,15 @@ struct virtqueue *vring_create_virtqueue(unsigned int index, unsigned int num,
>  
>  	vq->queue_dma_addr = dma_addr;
>  	vq->queue_size_in_bytes = queue_size_in_bytes;
> +	vq->use_dma_api = vring_use_dma_api(vdev);

What's vq->use_dma_api good for? It's unused.

Sascha

-- 
Pengutronix e.K.                           |                             |
Steuerwalder Str. 21                       | http://www.pengutronix.de/  |
31137 Hildesheim, Germany                  | Phone: +49-5121-206917-0    |
Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 5/5] virtio: don't use DMA API unless required
  2024-10-14 12:48   ` Sascha Hauer
@ 2024-10-14 13:05     ` Ahmad Fatoum
  0 siblings, 0 replies; 10+ messages in thread
From: Ahmad Fatoum @ 2024-10-14 13:05 UTC (permalink / raw)
  To: Sascha Hauer; +Cc: barebox, ejo

Hello Sascha,

On 14.10.24 14:48, Sascha Hauer wrote:
> On Wed, Oct 09, 2024 at 08:05:11AM +0200, Ahmad Fatoum wrote:
>> We have no Virt I/O drivers that make use of the streaming DMA API, but
>> the Virt queues are currently always allocated using the coherent DMA
>> API.
>>  	vq->queue_dma_addr = dma_addr;
>>  	vq->queue_size_in_bytes = queue_size_in_bytes;
>> +	vq->use_dma_api = vring_use_dma_api(vdev);
> 
> What's vq->use_dma_api good for? It's unused.

Linux uses it to decide e.g. inside vring_map_single, whether DMA API
should be used. I think we may need it in barebox in future too, but
it can always be added later. Please see the fixup I just sent out.

Sorry for the inconvenience,
Ahmad

> 
> Sascha
> 


-- 
Pengutronix e.K.                           |                             |
Steuerwalder Str. 21                       | http://www.pengutronix.de/  |
31137 Hildesheim, Germany                  | Phone: +49-5121-206917-0    |
Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH] fixup! virtio: don't use DMA API unless required
  2024-10-09  6:05 ` [PATCH 5/5] virtio: don't use DMA API unless required Ahmad Fatoum
  2024-10-14 12:48   ` Sascha Hauer
@ 2024-10-14 13:06   ` Ahmad Fatoum
  1 sibling, 0 replies; 10+ messages in thread
From: Ahmad Fatoum @ 2024-10-14 13:06 UTC (permalink / raw)
  To: barebox; +Cc: Ahmad Fatoum

Linux uses the member to decide e.g. inside vring_map_single, whether
DMA API should be used. I think we may need it in barebox in future too,
but it can always be added later.

Signed-off-by: Ahmad Fatoum <a.fatoum@pengutronix.de>
---
 drivers/virtio/virtio_ring.c | 1 -
 include/linux/virtio_ring.h  | 1 -
 2 files changed, 2 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 1e5431dd3ee4..b0b402834aca 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -422,7 +422,6 @@ struct virtqueue *vring_create_virtqueue(unsigned int index, unsigned int num,
 
 	vq->queue_dma_addr = dma_addr;
 	vq->queue_size_in_bytes = queue_size_in_bytes;
-	vq->use_dma_api = vring_use_dma_api(vdev);
 
 	return vq;
 }
diff --git a/include/linux/virtio_ring.h b/include/linux/virtio_ring.h
index 7bf667611938..bdef47b0fa6c 100644
--- a/include/linux/virtio_ring.h
+++ b/include/linux/virtio_ring.h
@@ -103,7 +103,6 @@ struct virtqueue {
 	unsigned int num_free;
 	struct vring vring;
 	bool event;
-	bool use_dma_api;
 	unsigned int free_head;
 	unsigned int num_added;
 	u16 last_used_idx;
-- 
2.39.5




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/5] ARM64: make barebox compatible with KVM
  2024-10-09  6:05 [PATCH 0/5] ARM64: make barebox compatible with KVM Ahmad Fatoum
                   ` (4 preceding siblings ...)
  2024-10-09  6:05 ` [PATCH 5/5] virtio: don't use DMA API unless required Ahmad Fatoum
@ 2024-10-15  6:54 ` Sascha Hauer
  5 siblings, 0 replies; 10+ messages in thread
From: Sascha Hauer @ 2024-10-15  6:54 UTC (permalink / raw)
  To: barebox, Ahmad Fatoum; +Cc: ejo


On Wed, 09 Oct 2024 08:05:06 +0200, Ahmad Fatoum wrote:
> It was noticed by a patch series[1] for OpenEmbedded-core that
> neither barebox nor U-Boot reach their shell when tested on a native
> AArch64 server with Qemu running with KVM enabled.
> 
> After investigation and help from the Qemu/KVM maintainers[2], it turns
> out that software that wants to run under KVM needs to take care what
> instructions it uses for MMIO accesses:
> 
> [...]

Applied, thanks!

[1/5] ARM64: io: implement I/O accessors in assembly
      https://git.pengutronix.de/cgit/barebox/commit/?id=ad6062f654a4 (link may not be stable)
[2/5] ARM64: board-dt-2nd: grow stack down from start of binary
      https://git.pengutronix.de/cgit/barebox/commit/?id=cf7081c0771f (link may not be stable)
[3/5] mtd: cfi-flash: use I/O accessors for reads/writes of MMIO regions
      https://git.pengutronix.de/cgit/barebox/commit/?id=133903c41840 (link may not be stable)
[4/5] ARM64: mmu: flush cacheable regions prior to remapping
      https://git.pengutronix.de/cgit/barebox/commit/?id=b8454cae3b1e (link may not be stable)
[5/5] virtio: don't use DMA API unless required
      https://git.pengutronix.de/cgit/barebox/commit/?id=3ebd05809a49 (link may not be stable)

Best regards,
-- 
Sascha Hauer <s.hauer@pengutronix.de>




^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2024-10-15  6:55 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-10-09  6:05 [PATCH 0/5] ARM64: make barebox compatible with KVM Ahmad Fatoum
2024-10-09  6:05 ` [PATCH 1/5] ARM64: io: implement I/O accessors in assembly Ahmad Fatoum
2024-10-09  6:05 ` [PATCH 2/5] ARM64: board-dt-2nd: grow stack down from start of binary Ahmad Fatoum
2024-10-09  6:05 ` [PATCH 3/5] mtd: cfi-flash: use I/O accessors for reads/writes of MMIO regions Ahmad Fatoum
2024-10-09  6:05 ` [PATCH 4/5] ARM64: mmu: flush cacheable regions prior to remapping Ahmad Fatoum
2024-10-09  6:05 ` [PATCH 5/5] virtio: don't use DMA API unless required Ahmad Fatoum
2024-10-14 12:48   ` Sascha Hauer
2024-10-14 13:05     ` Ahmad Fatoum
2024-10-14 13:06   ` [PATCH] fixup! " Ahmad Fatoum
2024-10-15  6:54 ` [PATCH 0/5] ARM64: make barebox compatible with KVM Sascha Hauer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox