2

I've got a Xilinx Zynq 7000-based board with a peripheral in the FPGA fabric that has DMA capability (on an AXI bus). We've developed a circuit and are running Linux on the ARM cores. We're having performance problems accessing a DMA buffer from user space after it's been filled by hardware.

Summary:

We have pre-reserved at boot time a section of DRAM for use as a large DMA buffer. We're apparently using the wrong APIs to map this buffer, because it appears to be uncached, and the access speed is terrible.

Using it even as a bounce-buffer is untenably slow due to horrible performance. IIUC, ARM caches are not DMA coherent, so I would really appreciate some insight on how to do the following:

  1. Map a region of DRAM into the kernel virtual address space but ensure that it is cacheable.
  2. Ensure that mapping it into userspace doesn't also have an undesirable effect, even if that requires we provide an mmap call by our own driver.
  3. Explicitly invalidate a region of physical memory from the cache hierarchy before doing a DMA, to ensure coherency.

More info:

I've been trying to research this thoroughly before asking. Unfortunately, this being an ARM SoC/FPGA, there's very little information available on this, so I have to ask the experts directly.

Since this is an SoC, a lot of stuff is hard-coded for u-boot. For instance, the kernel and a ramdisk are loaded to specific places in DRAM before handing control over to the kernel. We've taken advantage of this to reserve a 64MB section of DRAM for a DMA buffer (it does need to be that big, which is why we pre-reserve it). There isn't any worry about conflicting memory types or the kernel stomping on this memory, because the boot parameters tell the kernel what region of DRAM it has control over.

Initially, we tried to map this physical address range into kernel space using ioremap, but that appears to mark the region uncacheable, and the access speed is horrible, even if we try to use memcpy to make it a bounce buffer. We use /dev/mem to map this also into userspace, and I've timed memcpy as being around 70MB/sec.

Based on a fair amount of searching on this topic, it appears that although half the people out there want to use ioremap like this (which is probably where we got the idea from), ioremap is not supposed to be used for this purpose and that there are DMA-related APIs that should be used instead. Unfortunately, it appears that DMA buffer allocation is totally dynamic, and I haven't figured out how to tell it, "here's a physical address already allocated -- use that."

One document I looked at is this one, but it's way too x86 and PC-centric: https://www.kernel.org/doc/Documentation/DMA-API-HOWTO.txt

And this question also comes up at the top of my searches, but there's no real answer: get the physical address of a buffer under Linux

Looking at the standard calls, dma_set_mask_and_coherent and family won't take a pre-defined address and wants a device structure for PCI. I don't have such a structure, because this is an ARM SoC without PCI. I could manually populate such a structure, but that smells to me like abusing the API, not using it as intended.

BTW: This is a ring buffer, where we DMA data blocks into different offsets, but we align to cache line boundaries, so there is no risk of false sharing.

Thank you a million for any help you can provide!

UPDATE: It appears that there's no such thing as a cacheable DMA buffer on ARM if you do it the normal way. Maybe if I don't make the ioremap call, the region won't be marked as uncacheable, but then I have to figure out how to do cache management on ARM, which I can't figure out. One of the problems is that memcpy in userspace appears to really suck. Is there a memcpy implementation that's optimized for uncached memory I can use? Maybe I could write one. I have to figure out if this processor has Neon.

4

4 に答える 4

0

マッピングをキャッシュしたい場合は、mmap() を実装するドライバーが必要だと思います。

これには、portalmem と zynqportal の 2 つのデバイス ドライバーを使用します。Connectal プロジェクトでは、ユーザー空間ソフトウェアと FPGA ロジック間の接続を「ポータル」と呼んでいます。これらのドライバーには、Linux カーネル バージョン 3.8.x 以降で安定している dma-buf が必要です。

portalmemドライバーは、ioctl を提供して参照カウントされたメモリのチャンクを割り当て、そのメモリに関連付けられたファイル記述子を返します。このドライバーはdma-buf 共有を実装します。また、ユーザー空間アプリケーションがメモリにアクセスできるように、mmap() も実装します。

割り当て時に、アプリケーションはメモリのキャッシュされたマッピングまたはキャッシュされていないマッピングを選択できます。x86 では、マッピングは常にキャッシュされます。mmap() の実装は現在、portalmem ドライバーの 173 行目から始まります。マッピングがキャッシュされていない場合、pgprot_writecombine() を使用して vma->vm_page_prot を変更し、書き込みのバッファリングを有効にしますが、キャッシュを無効にします。

portalmem ドライバーは、データ キャッシュ ラインを無効化し、必要に応じて書き戻す ioctl も提供します。

portalmem ドライバーは FPGA を認識しません。そのために、変換テーブルを FPGA に転送するための ioctl を提供するzynqportalドライバーを使用して、FPGA で論理的に連続したアドレスを使用し、それらを実際の DMA アドレスに変換できるようにします。portalmem で使用される割り当てスキームは、コンパクトな変換テーブルを作成するように設計されています。

ユーザー ソフトウェアを変更せずに、PCI Express 接続 FPGA のpcieportalと同じ portalmem ドライバーを使用します。

于 2016-02-02T15:14:56.807 に答える