Glenn Sizemore recently joined the vSAN team coming from one of the big blue storage vendors. His first hand knowledge of traditional storage and now also of vSAN gives him an interesting perspective. Check out Glenn’s most recent post as he discusses vSAN from the eyes of a Storage Administrator.
Today I would like to discuss vSAN from a slightly different perspective, that of the dedicated storage administrator. In my experience working with storage admins, I sometimes will come across a group who in my opinion are overly dismissive of HCI solutions. The subtle subtext being that HCI solutions are not capable of handling “real work.” Having transitioned from a significant storage vendor to the Storage and Availability team here at VMware, I was pleasantly delighted to confirm that those preconceived notions are rooted in pride and tradition not technical distinction. I don’t begrudge anyone who may share this out of date perspective, but I would like to confront several of the talking points often used within the industry when discussing HCI solutions and more specifically address how several of these criticisms have been mistakenly levied against vSAN.
HCI is fine for dev/test data, but it’s not reliable enough for production data.
Nope, this one is just FUD. Aimed at making the consumer overly cautious because as we all know losing data in this business is a resume generating event. The facts are that vSAN had a different starting point when it comes to availability as it utilizes RAIN in place of the more traditional RAID. While not genuinely unique the shared-nothing architecture of vSAN does enable an enticing set of powerful capabilities. These capabilities are a direct result of the architectural choice that went into vSAN.
Fundamentally there are only two ways to make a piece of data highly available. Either provide multiple redundant paths to a shared media source or store multiple copies of the data. vSAN being a shared nothing architecture necessitates storing multiple copies of the data; however, it isn’t nearly as pedestrian as one may think. You see, vSAN is an object store, not a distributed file system. This means that availability and durability are implemented at a per-object (component) level and can be adjusted as the realities of a deployment change.
Speaking of durability; it’s worth calling out that vSAN implements all the bit-rot detection and self-healing sub-systems one would find in a dedicated storage OS. At the end of the day as long as the failures to tolerate (FTT) policy settings are set higher than the number of nodes lost the data is still available. The real maturation process that vSAN has undergone over the past three years and six releases has been a refinement in the underlying systems which oversee this process. This helps ensure administrators don’t put themselves in an exposed position unknowingly.
HCI is fine so long as you don’t need to scale.
This will sometimes be more directly stated as; vSAN is fine but doesn’t scale. This particular claim is a pet peeve of mine because it’s rarely an actual concern. When we are honest, the question we ask is not will it scale, but instead if it scales sufficiently to solve the problem at hand. To answer that question let’s take a brief look at the current scale points of a vSAN cluster. The vSAN 6.6 release currently supports up to 5 disk groups per host. Each disk group will contain a cache and up to 7 capacity drives. Since only the capacity drives contribute to usable capacity that gives us a maximum of 35 capacity drives per vSAN node. Finally, we can combine up to 64 hosts into a single cluster.
Assuming 1.9TB flash drives that would give us roughly ~4PB of raw capacity in my theoretical cluster. Usable capacity is dependent on the policies applied to the component objects being stored. Therefore, I’d like to set that aside for a moment and instead continue to explore the scalability of vSAN. The final point of which is that as of the date of this post, up to 2000 vSphere hosts can potentially be placed into a single management domain for a combined total of ~125PB of raw capacity. That is a LOT of flash storage, and to be honest, there are other restrictions which kick in and limit a deployment before the 125PB maximum in my example. If additional capacity were needed, we can easily use a larger drives. Suffice it to say I believe we can safely move past the capacity argument, while not limitless vSAN 6.6 is already able to support more hardware then all but a handful of deployments can afford/justify.
What about performance?
It depends and it’s not productive to try and address performance any further than that in this post as its just too involved. Again though, there are sufficient proof points that vSAN is more than capable when it comes to performance, and can exceed the actual required performance of all but the most demanding customer workloads.
Object count?!
Again, this is a simple misunderstanding. As of vSAN 6.0, every node added to a cluster can support 9000 component objects. A single component can be up to 255GB. Therefore, a full 64-node All-Flash vSAN cluster would support 576,000 component objects. Mind you this is merely the current upper limit. Assuming all components were consumed in support of VMDK objects, there is ~140PB of addressable space in a VSAN cluster to spread across 6000 VMs.
So to circle back around to the beginning of the conversation just how much “scale” does one need out of a single vSphere cluster? I wouldn’t be comfortable with having that many eggs in a single fault domain, regardless of the storage technology in use.
Space Efficiency then! HCI solutions can’t compete with dedicated storage operating systems when it comes to space efficiency technologies.
You may have noticed how we’ve moved past questions and into accusations. That’s because that’s how these conversations, unfortunately, tend to go. Never the less it’s worth addressing.
vSAN 6.2 introduced Erasure-Coding as a failure toleration mechanism. This was a game changer as it enabled vSAN customers to realize the media efficiency of a RAID deployment with the flexibility and composable nature of HCI. It is a very compelling meet in the middle approach which allows customers to optimize their capacity pool as they deem optimal. Customers who value raw performance can utilize traditional full copy based RAID-1 striping.
Heck, they can even sacrifice additional capacity to further strip a component across a RAID-0 under the RAID-1 granting additional queues and media to a given workload.
When capacity utilization is a priority, the customer can opt to implement a RAID-5/RAID-6 EC deployment allowing up to 50% raw media utilization with minimal capacity lost to parity. The ability to configure how the data will be protected on a per object basis while the standard for an object storage system is unheard of in storage arrays. Traditionally, storage arrays would build the parity mechanism into the media pool themselves and support multiple different availability targets would require separate media pools.
Oh yeah, and vSAN 6.2 also added support for Deduplication and Compression on a per-disk group level. After spending way too much time arguing over who’s compression and dedupe works the best… I would just like to summarize this by saying your results will vary but not that much from vendor to vendor. Not all workloads are compatible, which is why the erasure coding RAID implementations are far more significant in my opinion. However, if your workloads recieves a benefit from dedup and or compression, then you will see those benefits reflected in vSAN for a surprisingly minimal performance impact.
This post carried on a little longer then I had intended if you made it this far I would like to thank you for your time. If you disagree with any of the points made, please reach out and let me know. I’m open to changing my mind. I have come by my current opinion only after building and selling CI and HCI solutions for customers of all sizes for the past several years. While doing so, I watched vSAN from the outside as the engineering teams iterated through the problem space. Now on the inside with a full view of the sausage factory, I’ve concluded that while vSAN is a new way of solving the VM storage problem. There are some very compelling capabilities in addition to the undeniable advantages around administration ease of use. If you’re on the fence and unsure how vSAN satisfies a particular concern addressed by your current solution, please drop me a line. I’d love to continue this conversation.
After this post went live we invited Glenn on the Virtually Speaking Podcast to add some color to these points. Enjoy!