Their "tail latencies less than 80μs at the 99th percentile" translates to ~40μs for one way, and honestly for customer VMs a lot of that happens in the customer kernel + virtualization layer. Third, Google publishes an SLA on round-trip network latency between customer VMs at. Additional latency from intermediate switches is measured in nanoseconds. You'll see better latencies between machines in the same rack than between two racks, but this is a matter of single microseconds rather than milliseconds. Second, assuming a reasonably competent design, adding more machines to a network doesn't significantly increase the latency of that network. You will never be able to replicate the performance of Google's network by buying some rackmounted servers from Dell and plumbing them together with Cisco switches. A datacenter at Google scale is architecturally similar to a supercomputer cluster running on InfiniBand. I used to work at Google in Tech Infra, so I'll offer an alternate perspective while trying not to spill secrets.įirst, Google has enough money that they can build their entire network out of custom hardware, custom firmware, and patch the kernel + userspace. The other replies are assuming networking in a big network is inherently slower than in a small network. Google Persistent Disk is a kind of NAS anyway, so that perfectly aligns with the paragraph above. Assuming each write requires updating at least one block - 4kb - the setup needs to sustain at least 185 MB/s, which is clearly beyond a single HDD. They get 4 billion messages per day, which is roughly 46k per second. In this specific case, it's pretty well played. That's a large critical pain in the butt, but this is necessary for industrial grade reliability, and also allows making your "slow drive" faster in the future. For proper mirroring, you need another array of HDDs, which will end up being a NAS with its own redundancy. So you mostly don't want to use a single HDD as your mirror. The whole setup slows down when the write queue is full, reading from the write-mostly device can get be extremely slow thanks to all the pending writes, and the slow drive will wear out quicker thanks to the sustained high load (though this one should not apply in this specific case). The slow drive does cause issues, for it is literally slow. TBH, in general, I don't think it's a good option for databases. This is an old dark magic that prevents (though not entirely) reading from a specific drive. I mean, that's the only method that I can think of here. The magic here is `-write-behind`/`-write-mostly` in `mdadm`. NVME RAID0 is too dangerous and RAID 1 is too expensive, but, by pouring few hundred more bucks, one can gain a marginal safety while enjoying the blazing fast RAID0. This reminds me of mirroring an SSD array to a HDD! I believe this is what some college kids get their hands on, since many motherboards come with 2 NVMEs. Many interesting presentations on the official nvme website. Many of these are able to provide significantly lower latencies than a millisecond.Īs a concrete example, AWS `io2-express` has latencies of ~0.25-0.5ms, though i'm not sure which technology it's using. But you can also have NVME-over-TCPIP, or NVME-over-fiber-channel. For example, local NVME SSDs will use PCIe transport. It is orthogonal to whether the disk is attached to your motherboard via pcie, or it lives somewhere else in the datacenter and uses a different transport layer. This is a _protocol_ that storage devices like SSDs can use to communicate with the external world. I feel there is some slight mis-information in this article that might confuse people and give them the wrong impression of 'Local SSDs' vs the network based ones and about the nvme protocol and made it seem like "local ssd = nvme".
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |