Listen to them in your car, watch them over lunch: these short interviews with virtualization management thought leaders, plus demonstrations by vExperts and Virt Gurus will certainly enrich your industry knowledge. Subscribe today and don't miss a thing!
Wednesday, May 16, 2012 | by Lauren Bonaca | (comments: 0)

VKernel’s Mattias Sundling discusses The Expert Conference event with MVPs Hans Vredevoort and Anil Desai. Topics include highlights of the technical sessions presented by Microsoft, Quest and industry experts as well as updates and highlights of Windows Server 2012 and Hyper-V3 advances.
Speakers
Mattias Sundling
Evangelist & vExpert at VKernel
Twitter: @msundling
Hans Vredevoort
Constant & MVP at Inovativ
@hvredevoort
Blog: http://www.hyper-v.nu/
Anil Desai
Consultant and MVP at Anil Desai, Inc
@anildesai
Blog: http://anildesai.net/
For more Hyper-V and Microsoft podcasts, white papers, blogs and more, check out VKernel’s Hyper-V Performance Resources page and for more information about The Experts Conference, go to http://www.theexpertsconference.com.
Scott Herold: Okay, I want to thank you guys for having me out here. This is my second Minneapolis VMUG, the first one was probably four or five years ago back in the days when I was actually a Vizioncore employee. So I have been with the Quest Software family of companies since 2006; since we only had vRanger and vCharter as our only applications out there.
So I am out here, I think VKernel actually paid for the presentation, so you see things in VKernel orange today. Normally marketing comes up with some form of description of what I’m going to talk about. The interesting thing about the marketing is they know exactly when they are lying to you or the sales rep doesn’t really know that he is lying to you. So, whatever they actually said, we are going to talk about is probably not the case. I like to do VMUGs, I like to have a little bit of fun, I’m not going to bore you with product demos, I’m not going to do click next to continue type of stuff here. So, I am going to do presentation that I started doing at the VMworld this last year talking about some of the challenges in IT, why this new dynamic data center really changes how and why we have to look at the complete infrastructure differently and then going to some more educational type detail around, capacity management, capacity planning in particular. Just give you guys some good information about why it’s important, what it actually entails and how to take advantage of it in your organization overall.
So with that one of the things that we see is, this is a kind of a typical organizational chart that we see in a lot of companies that we have talked to. It’s our marketing budget, you know because of the acquisition, you have to spend a little bit less money, so I had to create all the graphics myself. I am not an artist and it’s not my trade, so please excuse any amateur graphics here. What happens is everything is kind of initiated from that end-user. These are the people that are consuming your infrastructure, these are the VM consumers. These are the guys that when you provision a virtual machine they are going to say, it takes me one second longer to do what used to take you know half a second before. When these guys were unhappy, it causes problems across the organization, normally at different levels.
The first one is a kind of IT executive. If I’m in Amazon and Amazon.com goes down because of say some network misconfiguration that takes down the entire EC2 cloud. If I’m Jeff Bezos, instead of you know swimming in a big pile of money like Scrooge McDuck, I’m sitting there with a brown bag of wine, you know trying to drink away my pains. When that happens, there is that person with the net organization that you couldn’t pay me enough to be this individual that’s that application owner. This is the person that sits between IT and the consumer to be able to deliver services and share application up time, meet service levels all that fun stuff that normally within IT on the infrastructure layer, we don’t really care about this. This guy or girl, this individual is one of the most stressed people out there. When something goes wrong she is or he is responsible for getting that infrastructure problem solved, getting that application back on line and making the end-users happy, very highly stressful. Interfaces a lot with the IT manager, IT manager has a hell of a time trying to work with his team, try to get the storage and network guys to stop fighting with each other, trying to help that virtualization administrator understand, what’s happening in my environment. That’s one the challenges that virtualization, cloud computing, this whole dynamic data center brings is troubleshooting problems becomes significantly more complex.
You know that additional layer of complexity, shared resources, a lot of things can happen that make it, you know, what normally would take you know a couple of hours to troubleshoot can turn it into a week long process. And there are several reasons for that and that’s kind of as we progress through the infrastructure at different layers, how things have changed over the last few years. The biggest change is obviously the server relationship, before servers used to run on their own, didn’t really matter what they did. If they consumed too much resources or not enough resources they really only kind of impact themselves, could be doing next to nothing, practically sleeping, partying like a 13th century Scotsman, taking all the resources that it wanted to.
This is virtually done at VMworld so keeping with the theme from like Caesar’s Palace, originally was a Julius, Brutus yelled “Sic temper tyrannis” before Brutus killed Julius Caesar, also turns out it’s what John Wilkes Booth yelled before shooting Lincoln so, not soon for a Caesar joke, a little too soon for an Abraham Lincoln joke. But again what these servers are doing inside doesn’t really impact other things on the outside.
I kind of equate virtualization to that family road trip. First few hours of the trip everyone’s excited, we’re going to go see Disneyland, I get to go see Mickey Mouse. But two hours into that trip, the kids start to piss you off. “Stop touching me, I’m not crossing the line, I have to pee, are we there yet”, all these questions. My parents used to threaten me, saying I’m going to turn this car around, of course they never did. Unfortunately the hypervisors aren’t like that, if you start to misbehave, if you start to cross the line, if you’re doing too much, if you’re consuming too much, the hypervisor will actually pull back resources from you, it will turn that car around, and when it does that it causes problems across the broader infrastructure. And that’s something that’s difficult to troubleshoot, especially since every resource is shared, CPU, memory, disk and network resources.
So trying to understand when the hypervisor is trying to protect the other workloads, and if that’s actually impacting things that are more critical than others. Another aspect that’s changed recently is the storage infrastructure. I can do an unofficial poll of pretty much any room and say, what is it that causes 85% of your performance problems in a virtual environment. Everyone’s going to jump up and down and say, storage, storage, storage. I’ve been doing virtualization for a long time and remember those times, those instances when I was first doing it. This was early 2004-2005 timeframe when ESX152 with that wonderful brown movie interface was still around. Going to the storage admin and saying I need a 400 gigabyte data store so I can provision my own storage and throw a whole bunch of virtual machines up there. And they would look at you like you practically killed their mother. It was just unthinkable to ask for that much data. And that’s kind of become the standard. And one of the things that we see is, I can go talk to a virtualization admin, he’s going to say, man that storage infrastructure is breaking my virtualization. I can also turn the tables and go to the storage administrator, who’s had a storage infrastructure longer than virtualizations existed.
And he is going to say, virtualization broke my storage infrastructure. It used to be very simple. It was a one to one method. I used to be able to take, you know, one or multiple data pass present it to a server and I knew exactly what was running out. That’s a very direct point A to point B communication path. If something went wrong I knew how to troubleshoot it. Now when you combine server virtualization with storage virtualization you get that image on the right, you don’t know how things are getting from point A to point B, you don’t know how to track that through the infrastructure, you have to work with multiple teams. Often times you have to get our team involved because there are some form of IT-based storage in the mix. Communication pass, now you have 3 teams trying to work together. Most of which who spend 80% of the time trying to blame each other for something else or trying to fight with each other, so it becomes much more complicated and much more critical to understand the storage infrastructure. Again, because of that shared nature and because that’s changed so drastically from how it was even 5, 6, 7 years ago.
From a network perspective, networking is inherently [chatty] [06:59], you can see what happens anything on the physical wire, very easy to see and understand. Virtualization is kind of up the ante in the importance of networking, again storage communication being important, the ability to still see and understand from an end-to-end standpoint what’s happening as a part of that communication. What happens when that starts to break down is when you have instances on which something doesn’t touch the [physical] wire and there are plenty of instances in which virtual machines running on the same host, on the same virtual switch won’t ever actually touch physical wire; so how do you actually watch that traffic?
Now you need to look at different capabilities. VMWare recently with vSphere 5 officially supported NetFlow, so you can actually see the type of communication, the amount of communication and the starting and end points, so you can start to see what’s happening in that virtual layer that you didn’t have visibility into from a physical standpoint. So you need the right tools that can actually help you see both on the wire and off the wire to really be able to get in and understand what’s happening from an end, end perspective across the actual network environment.
But all this leads to is a change to how applications are ultimately built to run, in a you know large organization, you know 10, 11, 12 years ago. I used to work for an insurance company and the whole concept of rolling out an application updates to a group of users scared everybody in IT that model that every single application gets a client installed on a desktop that talks to some form of application tier, has some form of service tier, not only talks to a database. It’s an arcade model, its great at the time, there was all that there was, but what we were seeing is a demand for users to be able to get access to that data from anywhere from any device and the easiest way to do that is through web services, HTML5, Web 2.0 type technologies.
In that insurance example the old way of doing this fat client and every single workstation, massive upgrade cost, support cost to be able to do that. The slowness to actually be able to roll out these updates maybe once a year if you are lucky. With web-based applications you can make an update about fixing the server and everyone is updated at once. You don’t have to worry about legacy, versions of the software out there running around and allows you to very rapidly scale. So, we have some form of natural disaster, how does that insurance company scale up necessary resources to be able to get additional insurance agents to answering phone calls and during in claims into the system without having to re provision you know 200 desktops and start actually getting out to the users. And this is where the applications are ultimately moving, it’s again making it so that the consumers of that infrastructure had access to the resources at any time can rapidly scale and have the best services possible.
From the end user standpoint, again this is something that’s really starting to change, we saw all the capability that VMware is putting around, mobility of that user. Before end users all had a desktop or laptop computer. Of course, when you are running windows you get blue screens, occasionally computers will catch on fire, viruses all over the place because someone clicks on the e-mail that says naked picture of latest you know Justin Bieber type celebrity. And then that’s how they actually access our applications. If something happens to that hardware, the time that it takes to service that hardware becomes extremely complex. You can take them out of service for a couple of days trying to get things resolved.
Now with the mobility, with desktop virtualization with just how we deliver applications specifics has been doing this since time began, changing how you deliver those applications to the users so they have access to those. Even if they lose their hardware being able to have that capability to give them what they need, where they need it and from the device they need it from. So, great news case of this is you know just changing how IT can actually deliver service to the users. So Dropbox, nobody is supposed to use the Dropbox, everybody does. I did yell that all the time but I turn around and tell my IT department and you can give me a way to get access to all of my data and all my systems from any of my devices, I’m going to use Dropbox and there is nothing you can do about it. And that’s right now the case, it’s a problem that VMware is trying to solve with their project. We see some call based funders trying to help solve that problem as well, to make sure that people have their data but still make it secure for the corporate IT environment to get them more comfortable with the service that they simply can’t provide because of cost, because of complexity, because of security that their users are actually demanding and in many cases requiring.
So, being able to transition that into how users need access to this data, need access to the application and need all this information ultimately it is what’s driving a lot of this change that infrastructure introducing the world of cloud computing, but breaking havoc on internal IT. So, what happens is we get this. It’s everybody’s nature to try to blame somebody else. You hear VMware talking about you know what happens when I have an application problem. I go yell at the virtualization admin, I tell him that it’s his application that sucks not my infrastructure. That’s what we see at pretty much every organization, it’s human nature to try to blame somebody else so I don’t have to do the hard work or if I do have to do the hard work try to find an excuse not have to do it until next week.
And that’s what we get. End users have a problem. They are going to start interfacing with their line of business. Line of business has to try to disseminate. It is the virtual team or he is going to blame a storage guy. The storage is going to blame in that network, network is going to stay it’s a database problem. Database is going to try and say, the user is stupid he doesn’t know what he is doing, and this is what we see in IT almost every single day. And this is something that we have to find a way to get around. So we started to see, as long as, you know several years ago that kind of accepted that infrastructure team. A team of core individuals responsible for that infrastructure at all cost. A network guy, a storage guy, OS guy, hardware guy; get them on the same team, so they have the same set of goals. The segmentation, the civilization of IT where networking team reports to different director; and the storage guy is different than the hardware team. Communication flaws apart, and the ability to solve challenge in the organization is almost impossible, unless you get the right organization setup. Once you can work adequately with the team, you’re never going to be able to get to the point where you’re just not trying to blame somebody else; so you don’t have to do the work. So with that, that’s kind of the, the front portion of the presentation. Now it’s going to be the educational aspect of it.
Still try to keep a little bit of fun, but I want to talk more about, to see the importance of capacity management in your environmental overall. What is it, you know, why is it important and try to break it down to a fundamental level. To help you guys just get a better understanding of why you need to be concerned about capacity. Because of the shifts it’s a little bit more important than traditional server monitor. Yes I hit 75% utilization on one of my ESX servers. Great! That’s good in the modern PC world or the modern server world with virtualization. But what does that mean for, how much more I connect. When am I going to run out? These are the things that are shifting the concern from what you’ve had to look at in the past, to what you need to look forward going forward because of the demands of IT, the fact that you can click a couple of mouse buttons, and provision a new system to them. That’s what they are becoming to expect. If you have to say, give me six weeks to order more hardware, that defeats a major purpose of live virtualization is put into a lot of organizations in the first place. So with that just trying to understand fundamentally what is capacity management. And at the most basic level, capacity management is the number of objects that can be added to a container that ultimately consumes resources.
This just doesn’t apply to a computer model. It can apply to several different things. In a very specific example, the number of VMs I can put on a host before I run out of CPU memory or disk resources. From an exchanged environment, how many more users can I put on the mail store before I run of memory of disk. So again to apply this exact same concept, actually, you know fortunately for us the exact same calculation is to actually be able to determine these different types of scenarios. How many requests per second on a web server before I run out of resources and not sure if anybody is familiar with Ron Oglesby, the acting key man been up here, he networks for Unidesk, he was my partner in crime several years ago in consulting, I wrote a couple of books with him. Big guy, former Navy Seal, not the kind of guy you want to piss off. But I did have the displeasure of rebooting servers once. So the number of times you can reboot his servers, that test his patience before he wants to trouble you and actually that answer is one. So the capacity of that is relatively small and you can fill it up very quickly.
So, when looking at capacity management, what are the key questions that are being asked to you as administrators, what is that the IT management is really concerned about, what is that you need to be concerned about as you are looking at your environment? Again, it’s not, you know, how can I get 100 virtual machines onto a single server. Cramming everything into a box isn’t the right answer, it’s you know how much more can I do with my current capacity, how quickly am I filling it up, how much more can I do with that existing capacity, when am I going to run out, I guess that kind of answers the same questions there. What happens if I add or remove capacity, what happens if I change their growth patterns, what happens if I change how I use my resources, where is the best place to actually run my work flows. These are the things that you have think about every single day, every time somebody makes a request, do I have enough room where as do I have enough room and its only makes request and am I going to able to fulfill that within the reasonable timeframe.
In order to really start understanding capacity it all starts with performance, performance is measured in utilization. So, utilization is like point of snapshots of used versus capacity of my environment, normally distributed as a percentage value. The key question is that you really have to look at around utilization is how many total objects can my environment hold? That’s a very-very important metrics field to understand. This all makes a request for 100 virtual machines, can I actually fulfill that request without having to order more hardware, going through a longer process. How many more or how many objects am I currently holding in that environment, how much more of an object can I add and if I want to add X more objects, will they fit for very key questions in the environment that you know at a very simplistic level I can answer with just a very basic utilization image.
So, how many total objects can my environment hold in this case? Six slots. How many am I currently holding? Four. I can add two more and if I want to add three or four more I can’t do it, very-very simple, about as basic as you get from a capacity standpoint; but something is actually very overlooked by organizations than understanding what their systems are actually capable of. So, in reality from application standpoint within a virtual environment, you simply can’t look at capacity in a separate cell than performance. Again it’s not about cramming 100 virtual machines into an ESX host, it’s a horrible metric. I wish I would die in a fire because it’s just not right. The right question is how much more can I do in my environment without impacting the performance of everything else running it. So when you take a look at not just a physical capacity, but also the performance capacity, storage is a great example. I have 30 terabytes of storage attached to my ESX server but as soon as I start running 10, 15, 20 virtual machines, I may only able to fill up 20% of that before my disk latency hits, 50, 60, 70 milliseconds.
That means I’m wasting a lot of space that I can’t use because how much I can actually put on there, as it relates to how my communication flows. CPU ready, one of the most critical values has been for a long time, how much time is my CPU waiting to run because it’s waiting for other processes to finish. I could have plenty of availability, in terms of CPU I’m only using 70, you know 75%, I should be able to run the other processes without a problem. Well depends on what type of calculations are occurring inside of the OS, system mode internal processes will consume CPU differently than user mode processes. And if you start to see that you’re overusing internal processes you’re going to see your CPU ready start to spike faster than in other systems, and being able to understand when you hit 7, 8, 9% CPU ready, you are going to be waiting for CPU cycles to become available for some of your workloads; even though you may have some 30- 40% of your total CPU available. So understanding that correlation of what’s a physical capacity limitation and what’s a performance capacity limitation is vital to really understanding how much more you can do in the environment, and you have to have visibility and correlate performance with capacity to get that right level of understanding.
Once you have utilization lockdown it’s all about stacking utilization in different points in time to establish a trend. How has my environment been growing, how has it been changing over a period of time. Once you can formulate that pattern, you can start to actually answer that question, how has my utilization changed over time. Here trending is very-very rarely static, you always have some form of anomaly in the data. Quadroon processing will cause it to spike. If you’re a retailer, if you’re Best Buy or Target they’re big in this area, every October funnily enough my utilization starts to go through the roof and it dies at the end of December. I’m an online retailer of course it’s going to increase over the Christmas season. So being able to track these utilizations in these different time slices, understanding trends, not just out of micro level looking at, well, over the last 8 hours I’ve slowly been increasing, but looking at broader patterns to encompass regular processing, annual trends that you see in your environment. And the more data that you have, the more accurate you can actually build up those trend lines and understand what’s normal behavior for my environment.
Once you get into understanding what that trend is, you can start to do some forecasting. Forecasting, just like the weather, it’s always on a prediction, there is some margin of error, there always will be some margin of error when you’re try to predict what’s going to happen in the future. The questions that you, this portion of it really answers is, how many objects will I be consuming at a point in time in the future and when I fill up that capacity. In reality the further on you forecast, the less accurate that prediction is going to be. Doesn’t matter how you slice up that data, what type of equations you apply to it, the further out, the less accurate. And different algorithms modify the accuracy of those predictions. If you just do a basic linear trend you are going to miss curves, peaks valleys, different things in that data, even using any of the products in the market, whether it’s ours whether it’s VMwares. If you change from logarithmic to algorithmic to polynomial to linear, you can have something that says, I can have 5 more working machines or I can have 500 more work machines depending on how it’s looking at that data, doing nothing more than simply changing the statistical calculation that you are using to figure out what that forecasting line looks like.
And this very example what we have is a linear trend starting off two weeks ago, I’m running 2 virtual machines. I added one last week so I’m running 3. Today I added another one for a total of 4. Based on that I can say I am adding one virtual machine per week, next week we’re going to be 5. Two weeks after that would be 6. So I can add two more to my environment, in two weeks I’ll be 4 capacity, I better order more hardware now, so I can get enough time to get start point work 3 weeks out from now. I don’t have to tell my customer, “sorry I can’t add anything else until hardware comes.” Very-very simplistic example, but trying to show how the accuracy of this data can change, again based on the types of calculations that you are applying to it. Linear will give you different data than projections, then polynomials. Polynomials can be a little bit more accurate. This is just again very basic hand drawn images to show how just a slight curve in that trim line can change what that forecast actually ultimately shows.
From my standpoints the adjustment of resources is where thing gets really-really interesting. This is all the variables that you can think of that change how you are using your utilization. How much capacity you ultimately have, any number of things that change that equation of how many more objects can I add to a container that consumes resources. I change my container size, I have increased it, I have decreased it because I lost one of my ESX servers in my cluster. All these different types of things ultimately change what that impact is of those long term questions. If I add more capacity how many more objects can I add because of that; being able to tell my management, if we buy 2 more ESX servers for our cluster we will be able to run 45 more virtual machines. Very important information that can be relatively simply captured and gathered using a right set of tools.
So if we take a look at the adjustments can occur in multiple ways, either in increasing capacity or decreasing capacity. The increasing capacity could be I added more memory to a host, I was running low. Overtime my utilization is steadily increasingly, is steadily increasing. I order my memory today, I’m at 66% utilization based on my growth trend next week really pushing also on this, I pop in more memory also I dropped out from 83% utilization to 75% utilization because I can run more virtual machines. Same thing can be applied to adding additional host to a cluster, creating an additional cluster to accommodate more work loads. Provisioning additional ones using different storage capacity increased my overall throughput capacity availability from a storage standpoint. All these things that change, how I use each of these different resources individually. Just the same, if I’m looking at a decreasing capacity, what happens in my cluster? In a normal scenario I want to make sure that I plan for N plus 1 failover of any host in my cluster. Sometimes N plus 2, N plus 3, depending on how critical that information is.
Being able to understand and set my service level if I lose one server in my environment, I still don’t want to be more than 75% utilized, being able to understand what those calculations mean and how that ultimately applies to being able to determine if this goes down will I suffer performance impact on my existing workloads. And the top use case you’d be at 100% utilization, if you lost one of the host running 12 virtual machines, you are running 13 virtual machines and you lose the host you are going to run into a problem where you can’t fully accommodate the load that’s being demanded by your existing workloads. Being able to understand this information is vital to maintaining that level of performances beating the service levels for your user and not causing a problem that could have been avoided by just having a right level of planning and insight into when you are ultimately going to fill up your capacity.
You can also make you know adjustment and how you use your utilization. There are several ways that this can happen, you know increase in utilization or decrease in utilization. From an increased standpoint let say Quest Soft who goes and buys company like VKernel. Instead of provisioning you know 10 user mailboxes a week, now we are provisioning 15. If I go from an insurance company back when I was doing it Zurich Insurance bought, I was merging with Farmers Insurance owned by the same company in Switzerland. Well now all of a sudden we went from having to add 100 virtual machines a month to 200 virtual machines a month. And we had to scale that infrastructure cost, LA and Chicago did at centers adequately. A lot of planning had to go into that to make sure that we can continue the service both sides of that business.
One of the things that we ultimately want to push people towards is not continuing to just roll crap at virtualization and to the point that it breaks; but maintain an optimal environment. Optimize those resources, make sure that when that application developer comes to you and says I need 8 gigabytes of memory for my application and then you’re done laughing at him and you give him 2 gigabytes. Make sure that you have the right level of resource to say, why are 6 gigabytes of your memory just filled with 0s according to the hypervisor? I’m going to pull that memory away optimizing it so the hypervisor doesn’t have to schedule that resource utilization, he can give it to other virtual machines. Especially if it’s a Linux operating system which is just going to suck up all the gigabytes and then dish it out as it feels necessary. Any database application will suck up as much memory as possible and do some memory management. Things that really aren’t using any of that memory that can be more efficiently given out to other virtual machines without impacting how the scheduler has to manage those memory resources without you know over consuming and with the new pricing, making sure that you are allocating or granting the right amount of memory to those resources to get the most out of your environment without having to pay additional licensing fees on time.
One of the things that you have to be careful of is when looking at a change in utilization, is it a real change in utilization or is it some form of anomaly? Again if I make an acquisition and also I’m doubling how much I’m provisioning, as I look at the change in the trend, is this something, you know, with 4 data points; it’s just not enough data to be able to tell is this a permanent trend or is it temporary. Going from two weeks ago, adding one work load last week, adding two workloads this week. Does that mean that I am going to add two more again next week? Doesn’t mean I’m going to add three more next week, or was that just an anomaly that I am going to add one more next week. You need more data points. You need time for that data to normalize, be able to determine if it really was a change. If it was a change, making sure that it’s a static change and actually normalize that data. So when you do see some of these anomalies and changes, it does take time. We can’t just instantly turn around and say, “oh you doubled, how much you provision this week, so we are going to double it for the rest of time.” We have to see that information, put it into the statistical models and give you that right level of information to see if it truly is an overall change and how you utilize your servers.
So again just some of the key takeaways is capacity is not about how much you can cram into your host and clusters. That’s the absolute wrong way to look at it. What we’re seeing across in industry average right now is on a typical 2 CPU server, any amount of memory doesn’t seem to matter; yet about 10 virtual machines per ESX host. That’s an industry wide average that we’re seeing right now. Again, I have some people that will stand up and say, “I’m actually getting 40 without performance problems. I have some people that will stand up and say, I am getting four before I run out of capacity. It is all about how much more you can put in there before start to swap memory, before you start trying at high disk latency, before you start to hit ridiculous amounts of CPU percent ready time; understanding what’s that right value imbalance for you.
And one of the key things that’s kind of on the backend, that seems to drive a lot of this is understanding your virtual CPU to physical core ratio. So if I am running dual core systems versus six core systems; how does that impact how many virtual CPUs I can actually assign to that system? Once you get to a ratio higher than 1:1, you have to be sharing CPU resources at some point. Like you all get to that point beforehand because you know hypervisor doesn’t necessarily drop, one VCPU on a single core and keep a balance that way. It’s constantly shifting everything. Last time I checked I think it was a literally every 20 milliseconds it’s recalculating where the best place for individual workloads is? It’s doing a lot of work to do that, and once you start getting past a certain point, there is hypervisor overhead that is churning to try to find out, I’ve got a 2:1 ratio of how many virtual CPUs I have to assign to physical cores, and how do I keep up with that, while still maintaining the performance of the virtual machine. So a lot of intelligence, a lot of backend process is needed to go and see and understand this and this is all information that we can help provide insight to provide access to and help you understand what’s the right fit for your environment; for the types of workflows that you run. Because I’m sure I can go on to any company and everyone is doing something just a little bit differently and there is no one answer that really fits what everything is doing.
At the end of the day, capacity is more than just applying that simple mathematical calculation. You need that best practice, you need to understand how the hypervisor works. You need to understand you know what does it really mean if I have 14 different ESX servers done multiple pass, all accessing the same data store. What type of contention does that cause and by simply spreading that workload without changing any resources, how can I actually get better performance out of that environment, that stuff that obviously as a systems management manual were out there to help you saw. At the end of the day, you are always going to have an next bottle neck or you don’t want to sit there and focus and saying, oh no I need to go optimize memory, I need to optimized storage. It was not causing a performance problem or it’s not that risk you’re causing an immediate problem, there are many-many more things you guys probably need to worry about on a day to day basis. So don’t get hung up on the fact that you have the next bottle neck. There is always next bottleneck. You are always going to run out of some resource first and as understanding when you are going to run out of that resource and understanding your options of what you can do to prevent that or mitigate that, that are really the important questions.
So, with that that can concludes my quick short simple presentation. One of the things that we do is we actually wanted to put up the paper cards, if you do fill one of these cards out, I will take it back with me get it mailed off to Boston where they are going to do a drawing that’s posted on the website listed here. One winner from this event will be announced early next week and one out of every 10 downloads and activations that are from that link will actually win an Iomega 2 terabyte store center IX2 device. So, great deal. During the acquisition process, when I was actually doing a trial of the software, started to be one download, I actually I got notified that I was a winner. Of course I was polite and actually emailed them back and said hey I’m actually from crust that was my personal email address, so they pulled it away from me. But it does go to show that you can actually win this thing and it will, it’s a great little device, works very well and anybody who downloads has 10% chance of actually wining one best of the link below. So, with that open for any questions at this point. Yes?
Male Speaker: It has the ability to add multiple sockets and cores at the end level? Has that made the capacity [Inaudible] [0:32:40.6] more complex.
Scott Herold: From our standpoint… the question is does the multiple core, multiple socket capabilities of the most recent version of the hypervisor causes additional problems. The way that we really look at, it is at a more fundamental level looking at just the overall utilization and there are certain triggers that we can look for to determine when something is causing a problem. And those triggers existed before there were multiple core SMP virtual machines where it was just virtual SMP that’s been around for a long time, so from our standpoint it didn’t cause a major shift and how we look capacity, but in terms of some of those metrics like the virtual CPUs versus core ratio; it does modify that a little bit and how it will ultimately impact your environment in change, how you actually want to provision some of those resources, but overall I don’t think it changes the overall capacity to your system, additional enhancement to the hypervisor to the hardware. Everything about it has made it, so a lot of these things you still get increased performance, increased capacity even though we are doing a lot of things that are more intensive to the hypervisor itself. Yes.
Male Speaker: [Inaudible] [0:33:57.6]?
Scott Herold: Making it really difficult especially when there is difference service levels or different thresholds at which different types of storage normally run into issues. NFS storage has different settings around latency before it starts to cause a major problem versus fiber-based storage. Depends if I’m running fiber drives versus solid-state drives versus SAS drives.
Male Speaker: [Interviewer] [0:34:28.2]
Scott Herold: And yeah when you have to look at all of that, again all we can really is kind of look at those triggers, make the software customized well enough where we have some best practices that will do out of box by 90% of the time we’re wrong, but make it so you can actually say, you know, hey in my environment when this type of storage hits a certain latency that’s a problem for me.
Male Speaker: [inaudible] [0:34:52.6]
Scott Herold: You know, VDI is really unique. Yeah from a VDI standpoint I continued to be a fan of run your less images and local SSD takes another picture, keep your user data, you custom data, the data that changes keep that in your central storage of the structure. If in OS or if a server goes down you lose a couple of images that were on that, but you can quickly spins those up on another system. You can still manage that capacity accurately and if your user data, the data that’s actually critical can still be accessed externally, it doesn’t matter if they are booting up on one image on this server, one time and another image on a different server the next time. So risk of losing an OS image is mitigated if you properly plan for user data in that external storage subsystem. All kinds of different scenarios and situations and I’m sure if somebody from net operating sees in the crowd they’ll stand and firstly disagree with me that everything must be on centralized storage because it’s fast enough for everything.
But it’s all about mitigating risk and cost especially associated with VDI and of course storage is one of the largest cost to any type of VDI project that can really prevent you from even getting off ground with that type of solution. So one last thing I commence, it’s across the slides on the front, VKernel was recently acquired by Quest Software. One of the things that we are trying to do with this, what the acquisitions overall; is maintain the VKernel business model to actually being able to provide easy to use software, very consumable software, and focusing on the questions and the used cases that customer answer. But at the same time with the Quest Software Solutions that we’ve had; we have a highly scale block of texture, highly customizable, very powerful solution, so for us it’s a matter of combining the best of both of those solution, to create something that truly as unique in the market, something that still on the lower end allows us to compete with some of the smaller vendors like SolarWinds, Beam, still be able to rapidly deliver new solutions to market. Well at the same time still being able to scale up to largest organizations and the needs that they have without sacrificing one portion of that market over another because virtualization regardless of the fact that has been around for several years still really is in its infancy. We see a lot of larger organizations that are still 30-35-40% utilized.
We’re still hearing stories of small businesses that I have never heard of, server virtualization. So our ability to still continue to capture that market allows us to remain flexible, provide that easy you solution, but by combining it with the traditional Quest Software will still allows to scale and compete with the big 4 vendors at the same time as well that were used to around application data base performance management and moving up level in that same market. So that was the logical and the reasoning behind our acquisition and ultimately it makes us a very serious player across the entire market for server virtualization combined with our data protection capabilities from the historic vRanger product that we had, as well as some of our recent acquisitions there with BakBone Software creating a real data protection platform, at the same time creating a performance capacity and management platform for not just a server virtualization environment, but the infrastructure that virtualization relies upon with storage and network as well as the workloads that people use as they consume virtualization for their application tiers. So that’s the approach that were taking and so far it has been a great ride and this acquisition has been one of the best moves for us and continues to drive us in the market.
Wednesday, April 25, 2012 | by Lauren Bonaca | (comments: 0)
Windows Server 8 is now Windows Server 2012. With it, Hyper-V gets big new features such as:
• Increase in cluster size from 16 nodes to 63.
• New VHDX virtual disk format increases disk size to 16TB, over 2TB
• SANless live migration
• More extensible/manageable vSwitch
• In-box, native NIC teaming
But are these and other new features enough to shift VMware customers over from vSphere to Hyper-V? Get the facts from Greg Shields, vExpert, MVP, CTP, of Concentrated Technology in this installment of VK.TV with VKernel Product Marketing Manager Alex Rosemblat.
For more Hyper-V podcasts, white papers, blogs and more, check out our Hyper-V Performance Resource page.
Scott Herold: Okay, I want to thank you guys for having me out here. This is my second Minneapolis VMUG, the first one was probably four or five years ago back in the days when I was actually a Vizioncore employee. So I have been with the Quest Software family of companies since 2006; since we only had vRanger and vCharter as our only applications out there.
So I am out here, I think VKernel actually paid for the presentation, so you see things in VKernel orange today. Normally marketing comes up with some form of description of what I’m going to talk about. The interesting thing about the marketing is they know exactly when they are lying to you or the sales rep doesn’t really know that he is lying to you. So, whatever they actually said, we are going to talk about is probably not the case. I like to do VMUGs, I like to have a little bit of fun, I’m not going to bore you with product demos, I’m not going to do click next to continue type of stuff here. So, I am going to do presentation that I started doing at the VMworld this last year talking about some of the challenges in IT, why this new dynamic data center really changes how and why we have to look at the complete infrastructure differently and then going to some more educational type detail around, capacity management, capacity planning in particular. Just give you guys some good information about why it’s important, what it actually entails and how to take advantage of it in your organization overall.
So with that one of the things that we see is, this is a kind of a typical organizational chart that we see in a lot of companies that we have talked to. It’s our marketing budget, you know because of the acquisition, you have to spend a little bit less money, so I had to create all the graphics myself. I am not an artist and it’s not my trade, so please excuse any amateur graphics here. What happens is everything is kind of initiated from that end-user. These are the people that are consuming your infrastructure, these are the VM consumers. These are the guys that when you provision a virtual machine they are going to say, it takes me one second longer to do what used to take you know half a second before. When these guys were unhappy, it causes problems across the organization, normally at different levels.
The first one is a kind of IT executive. If I’m in Amazon and Amazon.com goes down because of say some network misconfiguration that takes down the entire EC2 cloud. If I’m Jeff Bezos, instead of you know swimming in a big pile of money like Scrooge McDuck, I’m sitting there with a brown bag of wine, you know trying to drink away my pains. When that happens, there is that person with the net organization that you couldn’t pay me enough to be this individual that’s that application owner. This is the person that sits between IT and the consumer to be able to deliver services and share application up time, meet service levels all that fun stuff that normally within IT on the infrastructure layer, we don’t really care about this. This guy or girl, this individual is one of the most stressed people out there. When something goes wrong she is or he is responsible for getting that infrastructure problem solved, getting that application back on line and making the end-users happy, very highly stressful. Interfaces a lot with the IT manager, IT manager has a hell of a time trying to work with his team, try to get the storage and network guys to stop fighting with each other, trying to help that virtualization administrator understand, what’s happening in my environment. That’s one the challenges that virtualization, cloud computing, this whole dynamic data center brings is troubleshooting problems becomes significantly more complex.
You know that additional layer of complexity, shared resources, a lot of things can happen that make it, you know, what normally would take you know a couple of hours to troubleshoot can turn it into a week long process. And there are several reasons for that and that’s kind of as we progress through the infrastructure at different layers, how things have changed over the last few years. The biggest change is obviously the server relationship, before servers used to run on their own, didn’t really matter what they did. If they consumed too much resources or not enough resources they really only kind of impact themselves, could be doing next to nothing, practically sleeping, partying like a 13th century Scotsman, taking all the resources that it wanted to.
This is virtually done at VMworld so keeping with the theme from like Caesar’s Palace, originally was a Julius, Brutus yelled “Sic temper tyrannis” before Brutus killed Julius Caesar, also turns out it’s what John Wilkes Booth yelled before shooting Lincoln so, not soon for a Caesar joke, a little too soon for an Abraham Lincoln joke. But again what these servers are doing inside doesn’t really impact other things on the outside.
I kind of equate virtualization to that family road trip. First few hours of the trip everyone’s excited, we’re going to go see Disneyland, I get to go see Mickey Mouse. But two hours into that trip, the kids start to piss you off. “Stop touching me, I’m not crossing the line, I have to pee, are we there yet”, all these questions. My parents used to threaten me, saying I’m going to turn this car around, of course they never did. Unfortunately the hypervisors aren’t like that, if you start to misbehave, if you start to cross the line, if you’re doing too much, if you’re consuming too much, the hypervisor will actually pull back resources from you, it will turn that car around, and when it does that it causes problems across the broader infrastructure. And that’s something that’s difficult to troubleshoot, especially since every resource is shared, CPU, memory, disk and network resources.
So trying to understand when the hypervisor is trying to protect the other workloads, and if that’s actually impacting things that are more critical than others. Another aspect that’s changed recently is the storage infrastructure. I can do an unofficial poll of pretty much any room and say, what is it that causes 85% of your performance problems in a virtual environment. Everyone’s going to jump up and down and say, storage, storage, storage. I’ve been doing virtualization for a long time and remember those times, those instances when I was first doing it. This was early 2004-2005 timeframe when ESX152 with that wonderful brown movie interface was still around. Going to the storage admin and saying I need a 400 gigabyte data store so I can provision my own storage and throw a whole bunch of virtual machines up there. And they would look at you like you practically killed their mother. It was just unthinkable to ask for that much data. And that’s kind of become the standard. And one of the things that we see is, I can go talk to a virtualization admin, he’s going to say, man that storage infrastructure is breaking my virtualization. I can also turn the tables and go to the storage administrator, who’s had a storage infrastructure longer than virtualizations existed.
And he is going to say, virtualization broke my storage infrastructure. It used to be very simple. It was a one to one method. I used to be able to take, you know, one or multiple data pass present it to a server and I knew exactly what was running out. That’s a very direct point A to point B communication path. If something went wrong I knew how to troubleshoot it. Now when you combine server virtualization with storage virtualization you get that image on the right, you don’t know how things are getting from point A to point B, you don’t know how to track that through the infrastructure, you have to work with multiple teams. Often times you have to get our team involved because there are some form of IT-based storage in the mix. Communication pass, now you have 3 teams trying to work together. Most of which who spend 80% of the time trying to blame each other for something else or trying to fight with each other, so it becomes much more complicated and much more critical to understand the storage infrastructure. Again, because of that shared nature and because that’s changed so drastically from how it was even 5, 6, 7 years ago.
From a network perspective, networking is inherently [chatty] [06:59], you can see what happens anything on the physical wire, very easy to see and understand. Virtualization is kind of up the ante in the importance of networking, again storage communication being important, the ability to still see and understand from an end-to-end standpoint what’s happening as a part of that communication. What happens when that starts to break down is when you have instances on which something doesn’t touch the [physical] wire and there are plenty of instances in which virtual machines running on the same host, on the same virtual switch won’t ever actually touch physical wire; so how do you actually watch that traffic?
Now you need to look at different capabilities. VMWare recently with vSphere 5 officially supported NetFlow, so you can actually see the type of communication, the amount of communication and the starting and end points, so you can start to see what’s happening in that virtual layer that you didn’t have visibility into from a physical standpoint. So you need the right tools that can actually help you see both on the wire and off the wire to really be able to get in and understand what’s happening from an end, end perspective across the actual network environment.
But all this leads to is a change to how applications are ultimately built to run, in a you know large organization, you know 10, 11, 12 years ago. I used to work for an insurance company and the whole concept of rolling out an application updates to a group of users scared everybody in IT that model that every single application gets a client installed on a desktop that talks to some form of application tier, has some form of service tier, not only talks to a database. It’s an arcade model, its great at the time, there was all that there was, but what we were seeing is a demand for users to be able to get access to that data from anywhere from any device and the easiest way to do that is through web services, HTML5, Web 2.0 type technologies.
In that insurance example the old way of doing this fat client and every single workstation, massive upgrade cost, support cost to be able to do that. The slowness to actually be able to roll out these updates maybe once a year if you are lucky. With web-based applications you can make an update about fixing the server and everyone is updated at once. You don’t have to worry about legacy, versions of the software out there running around and allows you to very rapidly scale. So, we have some form of natural disaster, how does that insurance company scale up necessary resources to be able to get additional insurance agents to answering phone calls and during in claims into the system without having to re provision you know 200 desktops and start actually getting out to the users. And this is where the applications are ultimately moving, it’s again making it so that the consumers of that infrastructure had access to the resources at any time can rapidly scale and have the best services possible.
From the end user standpoint, again this is something that’s really starting to change, we saw all the capability that VMware is putting around, mobility of that user. Before end users all had a desktop or laptop computer. Of course, when you are running windows you get blue screens, occasionally computers will catch on fire, viruses all over the place because someone clicks on the e-mail that says naked picture of latest you know Justin Bieber type celebrity. And then that’s how they actually access our applications. If something happens to that hardware, the time that it takes to service that hardware becomes extremely complex. You can take them out of service for a couple of days trying to get things resolved.
Now with the mobility, with desktop virtualization with just how we deliver applications specifics has been doing this since time began, changing how you deliver those applications to the users so they have access to those. Even if they lose their hardware being able to have that capability to give them what they need, where they need it and from the device they need it from. So, great news case of this is you know just changing how IT can actually deliver service to the users. So Dropbox, nobody is supposed to use the Dropbox, everybody does. I did yell that all the time but I turn around and tell my IT department and you can give me a way to get access to all of my data and all my systems from any of my devices, I’m going to use Dropbox and there is nothing you can do about it. And that’s right now the case, it’s a problem that VMware is trying to solve with their project. We see some call based funders trying to help solve that problem as well, to make sure that people have their data but still make it secure for the corporate IT environment to get them more comfortable with the service that they simply can’t provide because of cost, because of complexity, because of security that their users are actually demanding and in many cases requiring.
So, being able to transition that into how users need access to this data, need access to the application and need all this information ultimately it is what’s driving a lot of this change that infrastructure introducing the world of cloud computing, but breaking havoc on internal IT. So, what happens is we get this. It’s everybody’s nature to try to blame somebody else. You hear VMware talking about you know what happens when I have an application problem. I go yell at the virtualization admin, I tell him that it’s his application that sucks not my infrastructure. That’s what we see at pretty much every organization, it’s human nature to try to blame somebody else so I don’t have to do the hard work or if I do have to do the hard work try to find an excuse not have to do it until next week.
And that’s what we get. End users have a problem. They are going to start interfacing with their line of business. Line of business has to try to disseminate. It is the virtual team or he is going to blame a storage guy. The storage is going to blame in that network, network is going to stay it’s a database problem. Database is going to try and say, the user is stupid he doesn’t know what he is doing, and this is what we see in IT almost every single day. And this is something that we have to find a way to get around. So we started to see, as long as, you know several years ago that kind of accepted that infrastructure team. A team of core individuals responsible for that infrastructure at all cost. A network guy, a storage guy, OS guy, hardware guy; get them on the same team, so they have the same set of goals. The segmentation, the civilization of IT where networking team reports to different director; and the storage guy is different than the hardware team. Communication flaws apart, and the ability to solve challenge in the organization is almost impossible, unless you get the right organization setup. Once you can work adequately with the team, you’re never going to be able to get to the point where you’re just not trying to blame somebody else; so you don’t have to do the work. So with that, that’s kind of the, the front portion of the presentation. Now it’s going to be the educational aspect of it.
Still try to keep a little bit of fun, but I want to talk more about, to see the importance of capacity management in your environmental overall. What is it, you know, why is it important and try to break it down to a fundamental level. To help you guys just get a better understanding of why you need to be concerned about capacity. Because of the shifts it’s a little bit more important than traditional server monitor. Yes I hit 75% utilization on one of my ESX servers. Great! That’s good in the modern PC world or the modern server world with virtualization. But what does that mean for, how much more I connect. When am I going to run out? These are the things that are shifting the concern from what you’ve had to look at in the past, to what you need to look forward going forward because of the demands of IT, the fact that you can click a couple of mouse buttons, and provision a new system to them. That’s what they are becoming to expect. If you have to say, give me six weeks to order more hardware, that defeats a major purpose of live virtualization is put into a lot of organizations in the first place. So with that just trying to understand fundamentally what is capacity management. And at the most basic level, capacity management is the number of objects that can be added to a container that ultimately consumes resources.
This just doesn’t apply to a computer model. It can apply to several different things. In a very specific example, the number of VMs I can put on a host before I run out of CPU memory or disk resources. From an exchanged environment, how many more users can I put on the mail store before I run of memory of disk. So again to apply this exact same concept, actually, you know fortunately for us the exact same calculation is to actually be able to determine these different types of scenarios. How many requests per second on a web server before I run out of resources and not sure if anybody is familiar with Ron Oglesby, the acting key man been up here, he networks for Unidesk, he was my partner in crime several years ago in consulting, I wrote a couple of books with him. Big guy, former Navy Seal, not the kind of guy you want to piss off. But I did have the displeasure of rebooting servers once. So the number of times you can reboot his servers, that test his patience before he wants to trouble you and actually that answer is one. So the capacity of that is relatively small and you can fill it up very quickly.
So, when looking at capacity management, what are the key questions that are being asked to you as administrators, what is that the IT management is really concerned about, what is that you need to be concerned about as you are looking at your environment? Again, it’s not, you know, how can I get 100 virtual machines onto a single server. Cramming everything into a box isn’t the right answer, it’s you know how much more can I do with my current capacity, how quickly am I filling it up, how much more can I do with that existing capacity, when am I going to run out, I guess that kind of answers the same questions there. What happens if I add or remove capacity, what happens if I change their growth patterns, what happens if I change how I use my resources, where is the best place to actually run my work flows. These are the things that you have think about every single day, every time somebody makes a request, do I have enough room where as do I have enough room and its only makes request and am I going to able to fulfill that within the reasonable timeframe.
In order to really start understanding capacity it all starts with performance, performance is measured in utilization. So, utilization is like point of snapshots of used versus capacity of my environment, normally distributed as a percentage value. The key question is that you really have to look at around utilization is how many total objects can my environment hold? That’s a very-very important metrics field to understand. This all makes a request for 100 virtual machines, can I actually fulfill that request without having to order more hardware, going through a longer process. How many more or how many objects am I currently holding in that environment, how much more of an object can I add and if I want to add X more objects, will they fit for very key questions in the environment that you know at a very simplistic level I can answer with just a very basic utilization image.
So, how many total objects can my environment hold in this case? Six slots. How many am I currently holding? Four. I can add two more and if I want to add three or four more I can’t do it, very-very simple, about as basic as you get from a capacity standpoint; but something is actually very overlooked by organizations than understanding what their systems are actually capable of. So, in reality from application standpoint within a virtual environment, you simply can’t look at capacity in a separate cell than performance. Again it’s not about cramming 100 virtual machines into an ESX host, it’s a horrible metric. I wish I would die in a fire because it’s just not right. The right question is how much more can I do in my environment without impacting the performance of everything else running it. So when you take a look at not just a physical capacity, but also the performance capacity, storage is a great example. I have 30 terabytes of storage attached to my ESX server but as soon as I start running 10, 15, 20 virtual machines, I may only able to fill up 20% of that before my disk latency hits, 50, 60, 70 milliseconds.
That means I’m wasting a lot of space that I can’t use because how much I can actually put on there, as it relates to how my communication flows. CPU ready, one of the most critical values has been for a long time, how much time is my CPU waiting to run because it’s waiting for other processes to finish. I could have plenty of availability, in terms of CPU I’m only using 70, you know 75%, I should be able to run the other processes without a problem. Well depends on what type of calculations are occurring inside of the OS, system mode internal processes will consume CPU differently than user mode processes. And if you start to see that you’re overusing internal processes you’re going to see your CPU ready start to spike faster than in other systems, and being able to understand when you hit 7, 8, 9% CPU ready, you are going to be waiting for CPU cycles to become available for some of your workloads; even though you may have some 30- 40% of your total CPU available. So understanding that correlation of what’s a physical capacity limitation and what’s a performance capacity limitation is vital to really understanding how much more you can do in the environment, and you have to have visibility and correlate performance with capacity to get that right level of understanding.
Once you have utilization lockdown it’s all about stacking utilization in different points in time to establish a trend. How has my environment been growing, how has it been changing over a period of time. Once you can formulate that pattern, you can start to actually answer that question, how has my utilization changed over time. Here trending is very-very rarely static, you always have some form of anomaly in the data. Quadroon processing will cause it to spike. If you’re a retailer, if you’re Best Buy or Target they’re big in this area, every October funnily enough my utilization starts to go through the roof and it dies at the end of December. I’m an online retailer of course it’s going to increase over the Christmas season. So being able to track these utilizations in these different time slices, understanding trends, not just out of micro level looking at, well, over the last 8 hours I’ve slowly been increasing, but looking at broader patterns to encompass regular processing, annual trends that you see in your environment. And the more data that you have, the more accurate you can actually build up those trend lines and understand what’s normal behavior for my environment.
Once you get into understanding what that trend is, you can start to do some forecasting. Forecasting, just like the weather, it’s always on a prediction, there is some margin of error, there always will be some margin of error when you’re try to predict what’s going to happen in the future. The questions that you, this portion of it really answers is, how many objects will I be consuming at a point in time in the future and when I fill up that capacity. In reality the further on you forecast, the less accurate that prediction is going to be. Doesn’t matter how you slice up that data, what type of equations you apply to it, the further out, the less accurate. And different algorithms modify the accuracy of those predictions. If you just do a basic linear trend you are going to miss curves, peaks valleys, different things in that data, even using any of the products in the market, whether it’s ours whether it’s VMwares. If you change from logarithmic to algorithmic to polynomial to linear, you can have something that says, I can have 5 more working machines or I can have 500 more work machines depending on how it’s looking at that data, doing nothing more than simply changing the statistical calculation that you are using to figure out what that forecasting line looks like.
And this very example what we have is a linear trend starting off two weeks ago, I’m running 2 virtual machines. I added one last week so I’m running 3. Today I added another one for a total of 4. Based on that I can say I am adding one virtual machine per week, next week we’re going to be 5. Two weeks after that would be 6. So I can add two more to my environment, in two weeks I’ll be 4 capacity, I better order more hardware now, so I can get enough time to get start point work 3 weeks out from now. I don’t have to tell my customer, “sorry I can’t add anything else until hardware comes.” Very-very simplistic example, but trying to show how the accuracy of this data can change, again based on the types of calculations that you are applying to it. Linear will give you different data than projections, then polynomials. Polynomials can be a little bit more accurate. This is just again very basic hand drawn images to show how just a slight curve in that trim line can change what that forecast actually ultimately shows.
From my standpoints the adjustment of resources is where thing gets really-really interesting. This is all the variables that you can think of that change how you are using your utilization. How much capacity you ultimately have, any number of things that change that equation of how many more objects can I add to a container that consumes resources. I change my container size, I have increased it, I have decreased it because I lost one of my ESX servers in my cluster. All these different types of things ultimately change what that impact is of those long term questions. If I add more capacity how many more objects can I add because of that; being able to tell my management, if we buy 2 more ESX servers for our cluster we will be able to run 45 more virtual machines. Very important information that can be relatively simply captured and gathered using a right set of tools.
So if we take a look at the adjustments can occur in multiple ways, either in increasing capacity or decreasing capacity. The increasing capacity could be I added more memory to a host, I was running low. Overtime my utilization is steadily increasingly, is steadily increasing. I order my memory today, I’m at 66% utilization based on my growth trend next week really pushing also on this, I pop in more memory also I dropped out from 83% utilization to 75% utilization because I can run more virtual machines. Same thing can be applied to adding additional host to a cluster, creating an additional cluster to accommodate more work loads. Provisioning additional ones using different storage capacity increased my overall throughput capacity availability from a storage standpoint. All these things that change, how I use each of these different resources individually. Just the same, if I’m looking at a decreasing capacity, what happens in my cluster? In a normal scenario I want to make sure that I plan for N plus 1 failover of any host in my cluster. Sometimes N plus 2, N plus 3, depending on how critical that information is.
Being able to understand and set my service level if I lose one server in my environment, I still don’t want to be more than 75% utilized, being able to understand what those calculations mean and how that ultimately applies to being able to determine if this goes down will I suffer performance impact on my existing workloads. And the top use case you’d be at 100% utilization, if you lost one of the host running 12 virtual machines, you are running 13 virtual machines and you lose the host you are going to run into a problem where you can’t fully accommodate the load that’s being demanded by your existing workloads. Being able to understand this information is vital to maintaining that level of performances beating the service levels for your user and not causing a problem that could have been avoided by just having a right level of planning and insight into when you are ultimately going to fill up your capacity.
You can also make you know adjustment and how you use your utilization. There are several ways that this can happen, you know increase in utilization or decrease in utilization. From an increased standpoint let say Quest Soft who goes and buys company like VKernel. Instead of provisioning you know 10 user mailboxes a week, now we are provisioning 15. If I go from an insurance company back when I was doing it Zurich Insurance bought, I was merging with Farmers Insurance owned by the same company in Switzerland. Well now all of a sudden we went from having to add 100 virtual machines a month to 200 virtual machines a month. And we had to scale that infrastructure cost, LA and Chicago did at centers adequately. A lot of planning had to go into that to make sure that we can continue the service both sides of that business.
One of the things that we ultimately want to push people towards is not continuing to just roll crap at virtualization and to the point that it breaks; but maintain an optimal environment. Optimize those resources, make sure that when that application developer comes to you and says I need 8 gigabytes of memory for my application and then you’re done laughing at him and you give him 2 gigabytes. Make sure that you have the right level of resource to say, why are 6 gigabytes of your memory just filled with 0s according to the hypervisor? I’m going to pull that memory away optimizing it so the hypervisor doesn’t have to schedule that resource utilization, he can give it to other virtual machines. Especially if it’s a Linux operating system which is just going to suck up all the gigabytes and then dish it out as it feels necessary. Any database application will suck up as much memory as possible and do some memory management. Things that really aren’t using any of that memory that can be more efficiently given out to other virtual machines without impacting how the scheduler has to manage those memory resources without you know over consuming and with the new pricing, making sure that you are allocating or granting the right amount of memory to those resources to get the most out of your environment without having to pay additional licensing fees on time.
One of the things that you have to be careful of is when looking at a change in utilization, is it a real change in utilization or is it some form of anomaly? Again if I make an acquisition and also I’m doubling how much I’m provisioning, as I look at the change in the trend, is this something, you know, with 4 data points; it’s just not enough data to be able to tell is this a permanent trend or is it temporary. Going from two weeks ago, adding one work load last week, adding two workloads this week. Does that mean that I am going to add two more again next week? Doesn’t mean I’m going to add three more next week, or was that just an anomaly that I am going to add one more next week. You need more data points. You need time for that data to normalize, be able to determine if it really was a change. If it was a change, making sure that it’s a static change and actually normalize that data. So when you do see some of these anomalies and changes, it does take time. We can’t just instantly turn around and say, “oh you doubled, how much you provision this week, so we are going to double it for the rest of time.” We have to see that information, put it into the statistical models and give you that right level of information to see if it truly is an overall change and how you utilize your servers.
So again just some of the key takeaways is capacity is not about how much you can cram into your host and clusters. That’s the absolute wrong way to look at it. What we’re seeing across in industry average right now is on a typical 2 CPU server, any amount of memory doesn’t seem to matter; yet about 10 virtual machines per ESX host. That’s an industry wide average that we’re seeing right now. Again, I have some people that will stand up and say, “I’m actually getting 40 without performance problems. I have some people that will stand up and say, I am getting four before I run out of capacity. It is all about how much more you can put in there before start to swap memory, before you start trying at high disk latency, before you start to hit ridiculous amounts of CPU percent ready time; understanding what’s that right value imbalance for you.
And one of the key things that’s kind of on the backend, that seems to drive a lot of this is understanding your virtual CPU to physical core ratio. So if I am running dual core systems versus six core systems; how does that impact how many virtual CPUs I can actually assign to that system? Once you get to a ratio higher than 1:1, you have to be sharing CPU resources at some point. Like you all get to that point beforehand because you know hypervisor doesn’t necessarily drop, one VCPU on a single core and keep a balance that way. It’s constantly shifting everything. Last time I checked I think it was a literally every 20 milliseconds it’s recalculating where the best place for individual workloads is? It’s doing a lot of work to do that, and once you start getting past a certain point, there is hypervisor overhead that is churning to try to find out, I’ve got a 2:1 ratio of how many virtual CPUs I have to assign to physical cores, and how do I keep up with that, while still maintaining the performance of the virtual machine. So a lot of intelligence, a lot of backend process is needed to go and see and understand this and this is all information that we can help provide insight to provide access to and help you understand what’s the right fit for your environment; for the types of workflows that you run. Because I’m sure I can go on to any company and everyone is doing something just a little bit differently and there is no one answer that really fits what everything is doing.
At the end of the day, capacity is more than just applying that simple mathematical calculation. You need that best practice, you need to understand how the hypervisor works. You need to understand you know what does it really mean if I have 14 different ESX servers done multiple pass, all accessing the same data store. What type of contention does that cause and by simply spreading that workload without changing any resources, how can I actually get better performance out of that environment, that stuff that obviously as a systems management manual were out there to help you saw. At the end of the day, you are always going to have an next bottle neck or you don’t want to sit there and focus and saying, oh no I need to go optimize memory, I need to optimized storage. It was not causing a performance problem or it’s not that risk you’re causing an immediate problem, there are many-many more things you guys probably need to worry about on a day to day basis. So don’t get hung up on the fact that you have the next bottle neck. There is always next bottleneck. You are always going to run out of some resource first and as understanding when you are going to run out of that resource and understanding your options of what you can do to prevent that or mitigate that, that are really the important questions.
So, with that that can concludes my quick short simple presentation. One of the things that we do is we actually wanted to put up the paper cards, if you do fill one of these cards out, I will take it back with me get it mailed off to Boston where they are going to do a drawing that’s posted on the website listed here. One winner from this event will be announced early next week and one out of every 10 downloads and activations that are from that link will actually win an Iomega 2 terabyte store center IX2 device. So, great deal. During the acquisition process, when I was actually doing a trial of the software, started to be one download, I actually I got notified that I was a winner. Of course I was polite and actually emailed them back and said hey I’m actually from crust that was my personal email address, so they pulled it away from me. But it does go to show that you can actually win this thing and it will, it’s a great little device, works very well and anybody who downloads has 10% chance of actually wining one best of the link below. So, with that open for any questions at this point. Yes?
Male Speaker: It has the ability to add multiple sockets and cores at the end level? Has that made the capacity [Inaudible] [0:32:40.6] more complex.
Scott Herold: From our standpoint… the question is does the multiple core, multiple socket capabilities of the most recent version of the hypervisor causes additional problems. The way that we really look at, it is at a more fundamental level looking at just the overall utilization and there are certain triggers that we can look for to determine when something is causing a problem. And those triggers existed before there were multiple core SMP virtual machines where it was just virtual SMP that’s been around for a long time, so from our standpoint it didn’t cause a major shift and how we look capacity, but in terms of some of those metrics like the virtual CPUs versus core ratio; it does modify that a little bit and how it will ultimately impact your environment in change, how you actually want to provision some of those resources, but overall I don’t think it changes the overall capacity to your system, additional enhancement to the hypervisor to the hardware. Everything about it has made it, so a lot of these things you still get increased performance, increased capacity even though we are doing a lot of things that are more intensive to the hypervisor itself. Yes.
Male Speaker: [Inaudible] [0:33:57.6]?
Scott Herold: Making it really difficult especially when there is difference service levels or different thresholds at which different types of storage normally run into issues. NFS storage has different settings around latency before it starts to cause a major problem versus fiber-based storage. Depends if I’m running fiber drives versus solid-state drives versus SAS drives.
Male Speaker: [Interviewer] [0:34:28.2]
Scott Herold: And yeah when you have to look at all of that, again all we can really is kind of look at those triggers, make the software customized well enough where we have some best practices that will do out of box by 90% of the time we’re wrong, but make it so you can actually say, you know, hey in my environment when this type of storage hits a certain latency that’s a problem for me.
Male Speaker: [inaudible] [0:34:52.6]
Scott Herold: You know, VDI is really unique. Yeah from a VDI standpoint I continued to be a fan of run your less images and local SSD takes another picture, keep your user data, you custom data, the data that changes keep that in your central storage of the structure. If in OS or if a server goes down you lose a couple of images that were on that, but you can quickly spins those up on another system. You can still manage that capacity accurately and if your user data, the data that’s actually critical can still be accessed externally, it doesn’t matter if they are booting up on one image on this server, one time and another image on a different server the next time. So risk of losing an OS image is mitigated if you properly plan for user data in that external storage subsystem. All kinds of different scenarios and situations and I’m sure if somebody from net operating sees in the crowd they’ll stand and firstly disagree with me that everything must be on centralized storage because it’s fast enough for everything.
But it’s all about mitigating risk and cost especially associated with VDI and of course storage is one of the largest cost to any type of VDI project that can really prevent you from even getting off ground with that type of solution. So one last thing I commence, it’s across the slides on the front, VKernel was recently acquired by Quest Software. One of the things that we are trying to do with this, what the acquisitions overall; is maintain the VKernel business model to actually being able to provide easy to use software, very consumable software, and focusing on the questions and the used cases that customer answer. But at the same time with the Quest Software Solutions that we’ve had; we have a highly scale block of texture, highly customizable, very powerful solution, so for us it’s a matter of combining the best of both of those solution, to create something that truly as unique in the market, something that still on the lower end allows us to compete with some of the smaller vendors like SolarWinds, Beam, still be able to rapidly deliver new solutions to market. Well at the same time still being able to scale up to largest organizations and the needs that they have without sacrificing one portion of that market over another because virtualization regardless of the fact that has been around for several years still really is in its infancy. We see a lot of larger organizations that are still 30-35-40% utilized.
We’re still hearing stories of small businesses that I have never heard of, server virtualization. So our ability to still continue to capture that market allows us to remain flexible, provide that easy you solution, but by combining it with the traditional Quest Software will still allows to scale and compete with the big 4 vendors at the same time as well that were used to around application data base performance management and moving up level in that same market. So that was the logical and the reasoning behind our acquisition and ultimately it makes us a very serious player across the entire market for server virtualization combined with our data protection capabilities from the historic vRanger product that we had, as well as some of our recent acquisitions there with BakBone Software creating a real data protection platform, at the same time creating a performance capacity and management platform for not just a server virtualization environment, but the infrastructure that virtualization relies upon with storage and network as well as the workloads that people use as they consume virtualization for their application tiers. So that’s the approach that were taking and so far it has been a great ride and this acquisition has been one of the best moves for us and continues to drive us in the market.
Wednesday, April 18, 2012 | by Lauren Bonaca | (comments: 0)

In this podcast Quest Software Systems Consultant, Makis Koiliakoudis, speaks with VKernel Senior Systems Engineer, Jonathan Klick, to demystify CPU Ready (%RDY) as a performance metric. Specifically, Makis and Jonathan provide answers to the following important questions:
For more vCPU and CPU podcasts, blogs, white papers and more, visit the VKernel vCPU and CPU Management Resource page.
Scott Herold: Okay, I want to thank you guys for having me out here. This is my second Minneapolis VMUG, the first one was probably four or five years ago back in the days when I was actually a Vizioncore employee. So I have been with the Quest Software family of companies since 2006; since we only had vRanger and vCharter as our only applications out there.
So I am out here, I think VKernel actually paid for the presentation, so you see things in VKernel orange today. Normally marketing comes up with some form of description of what I’m going to talk about. The interesting thing about the marketing is they know exactly when they are lying to you or the sales rep doesn’t really know that he is lying to you. So, whatever they actually said, we are going to talk about is probably not the case. I like to do VMUGs, I like to have a little bit of fun, I’m not going to bore you with product demos, I’m not going to do click next to continue type of stuff here. So, I am going to do presentation that I started doing at the VMworld this last year talking about some of the challenges in IT, why this new dynamic data center really changes how and why we have to look at the complete infrastructure differently and then going to some more educational type detail around, capacity management, capacity planning in particular. Just give you guys some good information about why it’s important, what it actually entails and how to take advantage of it in your organization overall.
So with that one of the things that we see is, this is a kind of a typical organizational chart that we see in a lot of companies that we have talked to. It’s our marketing budget, you know because of the acquisition, you have to spend a little bit less money, so I had to create all the graphics myself. I am not an artist and it’s not my trade, so please excuse any amateur graphics here. What happens is everything is kind of initiated from that end-user. These are the people that are consuming your infrastructure, these are the VM consumers. These are the guys that when you provision a virtual machine they are going to say, it takes me one second longer to do what used to take you know half a second before. When these guys were unhappy, it causes problems across the organization, normally at different levels.
The first one is a kind of IT executive. If I’m in Amazon and Amazon.com goes down because of say some network misconfiguration that takes down the entire EC2 cloud. If I’m Jeff Bezos, instead of you know swimming in a big pile of money like Scrooge McDuck, I’m sitting there with a brown bag of wine, you know trying to drink away my pains. When that happens, there is that person with the net organization that you couldn’t pay me enough to be this individual that’s that application owner. This is the person that sits between IT and the consumer to be able to deliver services and share application up time, meet service levels all that fun stuff that normally within IT on the infrastructure layer, we don’t really care about this. This guy or girl, this individual is one of the most stressed people out there. When something goes wrong she is or he is responsible for getting that infrastructure problem solved, getting that application back on line and making the end-users happy, very highly stressful. Interfaces a lot with the IT manager, IT manager has a hell of a time trying to work with his team, try to get the storage and network guys to stop fighting with each other, trying to help that virtualization administrator understand, what’s happening in my environment. That’s one the challenges that virtualization, cloud computing, this whole dynamic data center brings is troubleshooting problems becomes significantly more complex.
You know that additional layer of complexity, shared resources, a lot of things can happen that make it, you know, what normally would take you know a couple of hours to troubleshoot can turn it into a week long process. And there are several reasons for that and that’s kind of as we progress through the infrastructure at different layers, how things have changed over the last few years. The biggest change is obviously the server relationship, before servers used to run on their own, didn’t really matter what they did. If they consumed too much resources or not enough resources they really only kind of impact themselves, could be doing next to nothing, practically sleeping, partying like a 13th century Scotsman, taking all the resources that it wanted to.
This is virtually done at VMworld so keeping with the theme from like Caesar’s Palace, originally was a Julius, Brutus yelled “Sic temper tyrannis” before Brutus killed Julius Caesar, also turns out it’s what John Wilkes Booth yelled before shooting Lincoln so, not soon for a Caesar joke, a little too soon for an Abraham Lincoln joke. But again what these servers are doing inside doesn’t really impact other things on the outside.
I kind of equate virtualization to that family road trip. First few hours of the trip everyone’s excited, we’re going to go see Disneyland, I get to go see Mickey Mouse. But two hours into that trip, the kids start to piss you off. “Stop touching me, I’m not crossing the line, I have to pee, are we there yet”, all these questions. My parents used to threaten me, saying I’m going to turn this car around, of course they never did. Unfortunately the hypervisors aren’t like that, if you start to misbehave, if you start to cross the line, if you’re doing too much, if you’re consuming too much, the hypervisor will actually pull back resources from you, it will turn that car around, and when it does that it causes problems across the broader infrastructure. And that’s something that’s difficult to troubleshoot, especially since every resource is shared, CPU, memory, disk and network resources.
So trying to understand when the hypervisor is trying to protect the other workloads, and if that’s actually impacting things that are more critical than others. Another aspect that’s changed recently is the storage infrastructure. I can do an unofficial poll of pretty much any room and say, what is it that causes 85% of your performance problems in a virtual environment. Everyone’s going to jump up and down and say, storage, storage, storage. I’ve been doing virtualization for a long time and remember those times, those instances when I was first doing it. This was early 2004-2005 timeframe when ESX152 with that wonderful brown movie interface was still around. Going to the storage admin and saying I need a 400 gigabyte data store so I can provision my own storage and throw a whole bunch of virtual machines up there. And they would look at you like you practically killed their mother. It was just unthinkable to ask for that much data. And that’s kind of become the standard. And one of the things that we see is, I can go talk to a virtualization admin, he’s going to say, man that storage infrastructure is breaking my virtualization. I can also turn the tables and go to the storage administrator, who’s had a storage infrastructure longer than virtualizations existed.
And he is going to say, virtualization broke my storage infrastructure. It used to be very simple. It was a one to one method. I used to be able to take, you know, one or multiple data pass present it to a server and I knew exactly what was running out. That’s a very direct point A to point B communication path. If something went wrong I knew how to troubleshoot it. Now when you combine server virtualization with storage virtualization you get that image on the right, you don’t know how things are getting from point A to point B, you don’t know how to track that through the infrastructure, you have to work with multiple teams. Often times you have to get our team involved because there are some form of IT-based storage in the mix. Communication pass, now you have 3 teams trying to work together. Most of which who spend 80% of the time trying to blame each other for something else or trying to fight with each other, so it becomes much more complicated and much more critical to understand the storage infrastructure. Again, because of that shared nature and because that’s changed so drastically from how it was even 5, 6, 7 years ago.
From a network perspective, networking is inherently [chatty] [06:59], you can see what happens anything on the physical wire, very easy to see and understand. Virtualization is kind of up the ante in the importance of networking, again storage communication being important, the ability to still see and understand from an end-to-end standpoint what’s happening as a part of that communication. What happens when that starts to break down is when you have instances on which something doesn’t touch the [physical] wire and there are plenty of instances in which virtual machines running on the same host, on the same virtual switch won’t ever actually touch physical wire; so how do you actually watch that traffic?
Now you need to look at different capabilities. VMWare recently with vSphere 5 officially supported NetFlow, so you can actually see the type of communication, the amount of communication and the starting and end points, so you can start to see what’s happening in that virtual layer that you didn’t have visibility into from a physical standpoint. So you need the right tools that can actually help you see both on the wire and off the wire to really be able to get in and understand what’s happening from an end, end perspective across the actual network environment.
But all this leads to is a change to how applications are ultimately built to run, in a you know large organization, you know 10, 11, 12 years ago. I used to work for an insurance company and the whole concept of rolling out an application updates to a group of users scared everybody in IT that model that every single application gets a client installed on a desktop that talks to some form of application tier, has some form of service tier, not only talks to a database. It’s an arcade model, its great at the time, there was all that there was, but what we were seeing is a demand for users to be able to get access to that data from anywhere from any device and the easiest way to do that is through web services, HTML5, Web 2.0 type technologies.
In that insurance example the old way of doing this fat client and every single workstation, massive upgrade cost, support cost to be able to do that. The slowness to actually be able to roll out these updates maybe once a year if you are lucky. With web-based applications you can make an update about fixing the server and everyone is updated at once. You don’t have to worry about legacy, versions of the software out there running around and allows you to very rapidly scale. So, we have some form of natural disaster, how does that insurance company scale up necessary resources to be able to get additional insurance agents to answering phone calls and during in claims into the system without having to re provision you know 200 desktops and start actually getting out to the users. And this is where the applications are ultimately moving, it’s again making it so that the consumers of that infrastructure had access to the resources at any time can rapidly scale and have the best services possible.
From the end user standpoint, again this is something that’s really starting to change, we saw all the capability that VMware is putting around, mobility of that user. Before end users all had a desktop or laptop computer. Of course, when you are running windows you get blue screens, occasionally computers will catch on fire, viruses all over the place because someone clicks on the e-mail that says naked picture of latest you know Justin Bieber type celebrity. And then that’s how they actually access our applications. If something happens to that hardware, the time that it takes to service that hardware becomes extremely complex. You can take them out of service for a couple of days trying to get things resolved.
Now with the mobility, with desktop virtualization with just how we deliver applications specifics has been doing this since time began, changing how you deliver those applications to the users so they have access to those. Even if they lose their hardware being able to have that capability to give them what they need, where they need it and from the device they need it from. So, great news case of this is you know just changing how IT can actually deliver service to the users. So Dropbox, nobody is supposed to use the Dropbox, everybody does. I did yell that all the time but I turn around and tell my IT department and you can give me a way to get access to all of my data and all my systems from any of my devices, I’m going to use Dropbox and there is nothing you can do about it. And that’s right now the case, it’s a problem that VMware is trying to solve with their project. We see some call based funders trying to help solve that problem as well, to make sure that people have their data but still make it secure for the corporate IT environment to get them more comfortable with the service that they simply can’t provide because of cost, because of complexity, because of security that their users are actually demanding and in many cases requiring.
So, being able to transition that into how users need access to this data, need access to the application and need all this information ultimately it is what’s driving a lot of this change that infrastructure introducing the world of cloud computing, but breaking havoc on internal IT. So, what happens is we get this. It’s everybody’s nature to try to blame somebody else. You hear VMware talking about you know what happens when I have an application problem. I go yell at the virtualization admin, I tell him that it’s his application that sucks not my infrastructure. That’s what we see at pretty much every organization, it’s human nature to try to blame somebody else so I don’t have to do the hard work or if I do have to do the hard work try to find an excuse not have to do it until next week.
And that’s what we get. End users have a problem. They are going to start interfacing with their line of business. Line of business has to try to disseminate. It is the virtual team or he is going to blame a storage guy. The storage is going to blame in that network, network is going to stay it’s a database problem. Database is going to try and say, the user is stupid he doesn’t know what he is doing, and this is what we see in IT almost every single day. And this is something that we have to find a way to get around. So we started to see, as long as, you know several years ago that kind of accepted that infrastructure team. A team of core individuals responsible for that infrastructure at all cost. A network guy, a storage guy, OS guy, hardware guy; get them on the same team, so they have the same set of goals. The segmentation, the civilization of IT where networking team reports to different director; and the storage guy is different than the hardware team. Communication flaws apart, and the ability to solve challenge in the organization is almost impossible, unless you get the right organization setup. Once you can work adequately with the team, you’re never going to be able to get to the point where you’re just not trying to blame somebody else; so you don’t have to do the work. So with that, that’s kind of the, the front portion of the presentation. Now it’s going to be the educational aspect of it.
Still try to keep a little bit of fun, but I want to talk more about, to see the importance of capacity management in your environmental overall. What is it, you know, why is it important and try to break it down to a fundamental level. To help you guys just get a better understanding of why you need to be concerned about capacity. Because of the shifts it’s a little bit more important than traditional server monitor. Yes I hit 75% utilization on one of my ESX servers. Great! That’s good in the modern PC world or the modern server world with virtualization. But what does that mean for, how much more I connect. When am I going to run out? These are the things that are shifting the concern from what you’ve had to look at in the past, to what you need to look forward going forward because of the demands of IT, the fact that you can click a couple of mouse buttons, and provision a new system to them. That’s what they are becoming to expect. If you have to say, give me six weeks to order more hardware, that defeats a major purpose of live virtualization is put into a lot of organizations in the first place. So with that just trying to understand fundamentally what is capacity management. And at the most basic level, capacity management is the number of objects that can be added to a container that ultimately consumes resources.
This just doesn’t apply to a computer model. It can apply to several different things. In a very specific example, the number of VMs I can put on a host before I run out of CPU memory or disk resources. From an exchanged environment, how many more users can I put on the mail store before I run of memory of disk. So again to apply this exact same concept, actually, you know fortunately for us the exact same calculation is to actually be able to determine these different types of scenarios. How many requests per second on a web server before I run out of resources and not sure if anybody is familiar with Ron Oglesby, the acting key man been up here, he networks for Unidesk, he was my partner in crime several years ago in consulting, I wrote a couple of books with him. Big guy, former Navy Seal, not the kind of guy you want to piss off. But I did have the displeasure of rebooting servers once. So the number of times you can reboot his servers, that test his patience before he wants to trouble you and actually that answer is one. So the capacity of that is relatively small and you can fill it up very quickly.
So, when looking at capacity management, what are the key questions that are being asked to you as administrators, what is that the IT management is really concerned about, what is that you need to be concerned about as you are looking at your environment? Again, it’s not, you know, how can I get 100 virtual machines onto a single server. Cramming everything into a box isn’t the right answer, it’s you know how much more can I do with my current capacity, how quickly am I filling it up, how much more can I do with that existing capacity, when am I going to run out, I guess that kind of answers the same questions there. What happens if I add or remove capacity, what happens if I change their growth patterns, what happens if I change how I use my resources, where is the best place to actually run my work flows. These are the things that you have think about every single day, every time somebody makes a request, do I have enough room where as do I have enough room and its only makes request and am I going to able to fulfill that within the reasonable timeframe.
In order to really start understanding capacity it all starts with performance, performance is measured in utilization. So, utilization is like point of snapshots of used versus capacity of my environment, normally distributed as a percentage value. The key question is that you really have to look at around utilization is how many total objects can my environment hold? That’s a very-very important metrics field to understand. This all makes a request for 100 virtual machines, can I actually fulfill that request without having to order more hardware, going through a longer process. How many more or how many objects am I currently holding in that environment, how much more of an object can I add and if I want to add X more objects, will they fit for very key questions in the environment that you know at a very simplistic level I can answer with just a very basic utilization image.
So, how many total objects can my environment hold in this case? Six slots. How many am I currently holding? Four. I can add two more and if I want to add three or four more I can’t do it, very-very simple, about as basic as you get from a capacity standpoint; but something is actually very overlooked by organizations than understanding what their systems are actually capable of. So, in reality from application standpoint within a virtual environment, you simply can’t look at capacity in a separate cell than performance. Again it’s not about cramming 100 virtual machines into an ESX host, it’s a horrible metric. I wish I would die in a fire because it’s just not right. The right question is how much more can I do in my environment without impacting the performance of everything else running it. So when you take a look at not just a physical capacity, but also the performance capacity, storage is a great example. I have 30 terabytes of storage attached to my ESX server but as soon as I start running 10, 15, 20 virtual machines, I may only able to fill up 20% of that before my disk latency hits, 50, 60, 70 milliseconds.
That means I’m wasting a lot of space that I can’t use because how much I can actually put on there, as it relates to how my communication flows. CPU ready, one of the most critical values has been for a long time, how much time is my CPU waiting to run because it’s waiting for other processes to finish. I could have plenty of availability, in terms of CPU I’m only using 70, you know 75%, I should be able to run the other processes without a problem. Well depends on what type of calculations are occurring inside of the OS, system mode internal processes will consume CPU differently than user mode processes. And if you start to see that you’re overusing internal processes you’re going to see your CPU ready start to spike faster than in other systems, and being able to understand when you hit 7, 8, 9% CPU ready, you are going to be waiting for CPU cycles to become available for some of your workloads; even though you may have some 30- 40% of your total CPU available. So understanding that correlation of what’s a physical capacity limitation and what’s a performance capacity limitation is vital to really understanding how much more you can do in the environment, and you have to have visibility and correlate performance with capacity to get that right level of understanding.
Once you have utilization lockdown it’s all about stacking utilization in different points in time to establish a trend. How has my environment been growing, how has it been changing over a period of time. Once you can formulate that pattern, you can start to actually answer that question, how has my utilization changed over time. Here trending is very-very rarely static, you always have some form of anomaly in the data. Quadroon processing will cause it to spike. If you’re a retailer, if you’re Best Buy or Target they’re big in this area, every October funnily enough my utilization starts to go through the roof and it dies at the end of December. I’m an online retailer of course it’s going to increase over the Christmas season. So being able to track these utilizations in these different time slices, understanding trends, not just out of micro level looking at, well, over the last 8 hours I’ve slowly been increasing, but looking at broader patterns to encompass regular processing, annual trends that you see in your environment. And the more data that you have, the more accurate you can actually build up those trend lines and understand what’s normal behavior for my environment.
Once you get into understanding what that trend is, you can start to do some forecasting. Forecasting, just like the weather, it’s always on a prediction, there is some margin of error, there always will be some margin of error when you’re try to predict what’s going to happen in the future. The questions that you, this portion of it really answers is, how many objects will I be consuming at a point in time in the future and when I fill up that capacity. In reality the further on you forecast, the less accurate that prediction is going to be. Doesn’t matter how you slice up that data, what type of equations you apply to it, the further out, the less accurate. And different algorithms modify the accuracy of those predictions. If you just do a basic linear trend you are going to miss curves, peaks valleys, different things in that data, even using any of the products in the market, whether it’s ours whether it’s VMwares. If you change from logarithmic to algorithmic to polynomial to linear, you can have something that says, I can have 5 more working machines or I can have 500 more work machines depending on how it’s looking at that data, doing nothing more than simply changing the statistical calculation that you are using to figure out what that forecasting line looks like.
And this very example what we have is a linear trend starting off two weeks ago, I’m running 2 virtual machines. I added one last week so I’m running 3. Today I added another one for a total of 4. Based on that I can say I am adding one virtual machine per week, next week we’re going to be 5. Two weeks after that would be 6. So I can add two more to my environment, in two weeks I’ll be 4 capacity, I better order more hardware now, so I can get enough time to get start point work 3 weeks out from now. I don’t have to tell my customer, “sorry I can’t add anything else until hardware comes.” Very-very simplistic example, but trying to show how the accuracy of this data can change, again based on the types of calculations that you are applying to it. Linear will give you different data than projections, then polynomials. Polynomials can be a little bit more accurate. This is just again very basic hand drawn images to show how just a slight curve in that trim line can change what that forecast actually ultimately shows.
From my standpoints the adjustment of resources is where thing gets really-really interesting. This is all the variables that you can think of that change how you are using your utilization. How much capacity you ultimately have, any number of things that change that equation of how many more objects can I add to a container that consumes resources. I change my container size, I have increased it, I have decreased it because I lost one of my ESX servers in my cluster. All these different types of things ultimately change what that impact is of those long term questions. If I add more capacity how many more objects can I add because of that; being able to tell my management, if we buy 2 more ESX servers for our cluster we will be able to run 45 more virtual machines. Very important information that can be relatively simply captured and gathered using a right set of tools.
So if we take a look at the adjustments can occur in multiple ways, either in increasing capacity or decreasing capacity. The increasing capacity could be I added more memory to a host, I was running low. Overtime my utilization is steadily increasingly, is steadily increasing. I order my memory today, I’m at 66% utilization based on my growth trend next week really pushing also on this, I pop in more memory also I dropped out from 83% utilization to 75% utilization because I can run more virtual machines. Same thing can be applied to adding additional host to a cluster, creating an additional cluster to accommodate more work loads. Provisioning additional ones using different storage capacity increased my overall throughput capacity availability from a storage standpoint. All these things that change, how I use each of these different resources individually. Just the same, if I’m looking at a decreasing capacity, what happens in my cluster? In a normal scenario I want to make sure that I plan for N plus 1 failover of any host in my cluster. Sometimes N plus 2, N plus 3, depending on how critical that information is.
Being able to understand and set my service level if I lose one server in my environment, I still don’t want to be more than 75% utilized, being able to understand what those calculations mean and how that ultimately applies to being able to determine if this goes down will I suffer performance impact on my existing workloads. And the top use case you’d be at 100% utilization, if you lost one of the host running 12 virtual machines, you are running 13 virtual machines and you lose the host you are going to run into a problem where you can’t fully accommodate the load that’s being demanded by your existing workloads. Being able to understand this information is vital to maintaining that level of performances beating the service levels for your user and not causing a problem that could have been avoided by just having a right level of planning and insight into when you are ultimately going to fill up your capacity.
You can also make you know adjustment and how you use your utilization. There are several ways that this can happen, you know increase in utilization or decrease in utilization. From an increased standpoint let say Quest Soft who goes and buys company like VKernel. Instead of provisioning you know 10 user mailboxes a week, now we are provisioning 15. If I go from an insurance company back when I was doing it Zurich Insurance bought, I was merging with Farmers Insurance owned by the same company in Switzerland. Well now all of a sudden we went from having to add 100 virtual machines a month to 200 virtual machines a month. And we had to scale that infrastructure cost, LA and Chicago did at centers adequately. A lot of planning had to go into that to make sure that we can continue the service both sides of that business.
One of the things that we ultimately want to push people towards is not continuing to just roll crap at virtualization and to the point that it breaks; but maintain an optimal environment. Optimize those resources, make sure that when that application developer comes to you and says I need 8 gigabytes of memory for my application and then you’re done laughing at him and you give him 2 gigabytes. Make sure that you have the right level of resource to say, why are 6 gigabytes of your memory just filled with 0s according to the hypervisor? I’m going to pull that memory away optimizing it so the hypervisor doesn’t have to schedule that resource utilization, he can give it to other virtual machines. Especially if it’s a Linux operating system which is just going to suck up all the gigabytes and then dish it out as it feels necessary. Any database application will suck up as much memory as possible and do some memory management. Things that really aren’t using any of that memory that can be more efficiently given out to other virtual machines without impacting how the scheduler has to manage those memory resources without you know over consuming and with the new pricing, making sure that you are allocating or granting the right amount of memory to those resources to get the most out of your environment without having to pay additional licensing fees on time.
One of the things that you have to be careful of is when looking at a change in utilization, is it a real change in utilization or is it some form of anomaly? Again if I make an acquisition and also I’m doubling how much I’m provisioning, as I look at the change in the trend, is this something, you know, with 4 data points; it’s just not enough data to be able to tell is this a permanent trend or is it temporary. Going from two weeks ago, adding one work load last week, adding two workloads this week. Does that mean that I am going to add two more again next week? Doesn’t mean I’m going to add three more next week, or was that just an anomaly that I am going to add one more next week. You need more data points. You need time for that data to normalize, be able to determine if it really was a change. If it was a change, making sure that it’s a static change and actually normalize that data. So when you do see some of these anomalies and changes, it does take time. We can’t just instantly turn around and say, “oh you doubled, how much you provision this week, so we are going to double it for the rest of time.” We have to see that information, put it into the statistical models and give you that right level of information to see if it truly is an overall change and how you utilize your servers.
So again just some of the key takeaways is capacity is not about how much you can cram into your host and clusters. That’s the absolute wrong way to look at it. What we’re seeing across in industry average right now is on a typical 2 CPU server, any amount of memory doesn’t seem to matter; yet about 10 virtual machines per ESX host. That’s an industry wide average that we’re seeing right now. Again, I have some people that will stand up and say, “I’m actually getting 40 without performance problems. I have some people that will stand up and say, I am getting four before I run out of capacity. It is all about how much more you can put in there before start to swap memory, before you start trying at high disk latency, before you start to hit ridiculous amounts of CPU percent ready time; understanding what’s that right value imbalance for you.
And one of the key things that’s kind of on the backend, that seems to drive a lot of this is understanding your virtual CPU to physical core ratio. So if I am running dual core systems versus six core systems; how does that impact how many virtual CPUs I can actually assign to that system? Once you get to a ratio higher than 1:1, you have to be sharing CPU resources at some point. Like you all get to that point beforehand because you know hypervisor doesn’t necessarily drop, one VCPU on a single core and keep a balance that way. It’s constantly shifting everything. Last time I checked I think it was a literally every 20 milliseconds it’s recalculating where the best place for individual workloads is? It’s doing a lot of work to do that, and once you start getting past a certain point, there is hypervisor overhead that is churning to try to find out, I’ve got a 2:1 ratio of how many virtual CPUs I have to assign to physical cores, and how do I keep up with that, while still maintaining the performance of the virtual machine. So a lot of intelligence, a lot of backend process is needed to go and see and understand this and this is all information that we can help provide insight to provide access to and help you understand what’s the right fit for your environment; for the types of workflows that you run. Because I’m sure I can go on to any company and everyone is doing something just a little bit differently and there is no one answer that really fits what everything is doing.
At the end of the day, capacity is more than just applying that simple mathematical calculation. You need that best practice, you need to understand how the hypervisor works. You need to understand you know what does it really mean if I have 14 different ESX servers done multiple pass, all accessing the same data store. What type of contention does that cause and by simply spreading that workload without changing any resources, how can I actually get better performance out of that environment, that stuff that obviously as a systems management manual were out there to help you saw. At the end of the day, you are always going to have an next bottle neck or you don’t want to sit there and focus and saying, oh no I need to go optimize memory, I need to optimized storage. It was not causing a performance problem or it’s not that risk you’re causing an immediate problem, there are many-many more things you guys probably need to worry about on a day to day basis. So don’t get hung up on the fact that you have the next bottle neck. There is always next bottleneck. You are always going to run out of some resource first and as understanding when you are going to run out of that resource and understanding your options of what you can do to prevent that or mitigate that, that are really the important questions.
So, with that that can concludes my quick short simple presentation. One of the things that we do is we actually wanted to put up the paper cards, if you do fill one of these cards out, I will take it back with me get it mailed off to Boston where they are going to do a drawing that’s posted on the website listed here. One winner from this event will be announced early next week and one out of every 10 downloads and activations that are from that link will actually win an Iomega 2 terabyte store center IX2 device. So, great deal. During the acquisition process, when I was actually doing a trial of the software, started to be one download, I actually I got notified that I was a winner. Of course I was polite and actually emailed them back and said hey I’m actually from crust that was my personal email address, so they pulled it away from me. But it does go to show that you can actually win this thing and it will, it’s a great little device, works very well and anybody who downloads has 10% chance of actually wining one best of the link below. So, with that open for any questions at this point. Yes?
Male Speaker: It has the ability to add multiple sockets and cores at the end level? Has that made the capacity [Inaudible] [0:32:40.6] more complex.
Scott Herold: From our standpoint… the question is does the multiple core, multiple socket capabilities of the most recent version of the hypervisor causes additional problems. The way that we really look at, it is at a more fundamental level looking at just the overall utilization and there are certain triggers that we can look for to determine when something is causing a problem. And those triggers existed before there were multiple core SMP virtual machines where it was just virtual SMP that’s been around for a long time, so from our standpoint it didn’t cause a major shift and how we look capacity, but in terms of some of those metrics like the virtual CPUs versus core ratio; it does modify that a little bit and how it will ultimately impact your environment in change, how you actually want to provision some of those resources, but overall I don’t think it changes the overall capacity to your system, additional enhancement to the hypervisor to the hardware. Everything about it has made it, so a lot of these things you still get increased performance, increased capacity even though we are doing a lot of things that are more intensive to the hypervisor itself. Yes.
Male Speaker: [Inaudible] [0:33:57.6]?
Scott Herold: Making it really difficult especially when there is difference service levels or different thresholds at which different types of storage normally run into issues. NFS storage has different settings around latency before it starts to cause a major problem versus fiber-based storage. Depends if I’m running fiber drives versus solid-state drives versus SAS drives.
Male Speaker: [Interviewer] [0:34:28.2]
Scott Herold: And yeah when you have to look at all of that, again all we can really is kind of look at those triggers, make the software customized well enough where we have some best practices that will do out of box by 90% of the time we’re wrong, but make it so you can actually say, you know, hey in my environment when this type of storage hits a certain latency that’s a problem for me.
Male Speaker: [inaudible] [0:34:52.6]
Scott Herold: You know, VDI is really unique. Yeah from a VDI standpoint I continued to be a fan of run your less images and local SSD takes another picture, keep your user data, you custom data, the data that changes keep that in your central storage of the structure. If in OS or if a server goes down you lose a couple of images that were on that, but you can quickly spins those up on another system. You can still manage that capacity accurately and if your user data, the data that’s actually critical can still be accessed externally, it doesn’t matter if they are booting up on one image on this server, one time and another image on a different server the next time. So risk of losing an OS image is mitigated if you properly plan for user data in that external storage subsystem. All kinds of different scenarios and situations and I’m sure if somebody from net operating sees in the crowd they’ll stand and firstly disagree with me that everything must be on centralized storage because it’s fast enough for everything.
But it’s all about mitigating risk and cost especially associated with VDI and of course storage is one of the largest cost to any type of VDI project that can really prevent you from even getting off ground with that type of solution. So one last thing I commence, it’s across the slides on the front, VKernel was recently acquired by Quest Software. One of the things that we are trying to do with this, what the acquisitions overall; is maintain the VKernel business model to actually being able to provide easy to use software, very consumable software, and focusing on the questions and the used cases that customer answer. But at the same time with the Quest Software Solutions that we’ve had; we have a highly scale block of texture, highly customizable, very powerful solution, so for us it’s a matter of combining the best of both of those solution, to create something that truly as unique in the market, something that still on the lower end allows us to compete with some of the smaller vendors like SolarWinds, Beam, still be able to rapidly deliver new solutions to market. Well at the same time still being able to scale up to largest organizations and the needs that they have without sacrificing one portion of that market over another because virtualization regardless of the fact that has been around for several years still really is in its infancy. We see a lot of larger organizations that are still 30-35-40% utilized.
We’re still hearing stories of small businesses that I have never heard of, server virtualization. So our ability to still continue to capture that market allows us to remain flexible, provide that easy you solution, but by combining it with the traditional Quest Software will still allows to scale and compete with the big 4 vendors at the same time as well that were used to around application data base performance management and moving up level in that same market. So that was the logical and the reasoning behind our acquisition and ultimately it makes us a very serious player across the entire market for server virtualization combined with our data protection capabilities from the historic vRanger product that we had, as well as some of our recent acquisitions there with BakBone Software creating a real data protection platform, at the same time creating a performance capacity and management platform for not just a server virtualization environment, but the infrastructure that virtualization relies upon with storage and network as well as the workloads that people use as they consume virtualization for their application tiers. So that’s the approach that were taking and so far it has been a great ride and this acquisition has been one of the best moves for us and continues to drive us in the market.

