After a reboot following an upgrade, my TrueNAS system crashed with the following error:
|
1 2 3 4 5 6 |
scsi host0: Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 7.0 <Adaptec 3960D Ultra160 SCSI adapter> aic7899: Ultra160 Wide Channel A, SCSI id=7, 32/253 SCBs aic7xxx 0000:05:02.1: enabling device (0000 -> 0003) DMAR: DRHD: handling fault status reg 3 DMAR: [DMA read NO PASID] Request device [05:00.0] fault addr 0x12174000 [Fault reason 0x02] Present bit in context entry is clear |
This was followed by a large number of SCB errors in an endless loop.
The problem is fairly straightforward if you know where to look. First, let’s break down the error message:
– DMAR – DMA remapping
– DRHD – DMA Remapping Hardware Unit Definition [1]
Because of the DMAR reference, the error is related to Intel’s IOMMU technology. What does the IOMMU do? When a device accesses memory via DMA, the IOMMU creates a mapping table and keeps track of which device is allowed to access which memory region. Without an IOMMU, a faulty or compromised driver could access memory regions it shouldn’t, potentially causing a segmentation fault or worse.
The root cause in my case was a PCIe-to-PCI bridge. If you look closely, you can see that there is a bridge device between the actual device and the motherboard:
|
1 2 3 4 5 6 7 8 9 |
lspci -tv -[0000:00]-+-00.0 Intel Corporation Gemini Lake Host Bridge ... +-13.0-[01-05]----00.0-[02-05]--+-03.0-[03]--+-00.0 QLogic Corp. ISP2432-based 4Gb Fibre Channel to PCI Express HBA | | \-00.1 QLogic Corp. ISP2432-based 4Gb Fibre Channel to PCI Express HBA | \-07.0-[04-05]----00.0-[05]--+-02.0 Adaptec AHA-3960D / AIC-7899A U160/m | \-02.1 Adaptec AHA-3960D / AIC-7899A U160/m +-13.1-[06]-- ... |
So what happens?
A device attempts to access memory. However, because of the bridge, the request contains the bridge’s device ID instead of the actual device’s ID. You can see this in the original error message: the bridge ID is 05:00.0, while the actual PCI devices are 05:02.0 and 05:02.1.
How to solve this?
Disable the entire Intel IOMMU with a kernel parameter:
intel_iommu=off
How to do this permanently?
In truenas you can add a kernel parameter with the following command, so it won’t break the followup upgrades:
|
1 |
midclt call system.advanced.update '{"kernel_extra_options": "intel_iommu=off"}' |
Links:
[1]: https://www.kernel.org/doc/Documentation/Intel-IOMMU.txt