Re: How does cache coherency work on such a system
All of the up to 8-socket (glueless) designs just use standard Intel UPI (was QPI) connections between the processors - same as in a 4-socket box, except you get extra NUMA latency domains as you don't have enough UPI links to direct connect all CPUs together.
For the boxes that scale over 8S (like the KunLun and HPE's Superdome Flex), typically one of the UPI links to each processor is connected to custom silicon (think FPGAs) that act as agents/proxies to filter the coherency protocols and in some cases cache data as well. This is why these systems can typically use Gold Intel CPUs as well as Platinum CPUs - the limit of 4 UPI device IDs per group of processors in Gold CPUs isn't a problem as they access others via the agent/proxy silicon.