If you want something specific to Mathematica then I don't know, but in general:
Let $\sigma = \rho_{AD} = \textrm{Tr}_{BC}(\rho)$. $\rho$ is an operator on four subsystems, so it has four inputs and four outputs making it a rank 8 tensor. Let $i,i'$ be the indices corresponding to the input and output of $\rho$ on subsystem $A$, let $j,j'$ correspond to $B$, etc. The components of $\sigma$ are then $\sigma_{ii'll'} = \sum_{jk} \rho_{ii'jjkkll'}$.
If $\rho$ is sparse you only need to sum the entries which are nonzero. If $\rho$ is Hermitian then $\sigma$ will be Hermitian and it is enough to only compute the upper triangle (the lower triangle is then the conjugate of that). I don't believe there are any other optimizations available unless you know of further structure on $\rho$. The sum is trivially parallelizable - each combination $ii'll'$ is completely independent of the others.
This post has been migrated from (A51.SE)