Friday, December 12, 2014

ESXi - hp-ams memory leak (Fix critical AMS issue)

A couple of moths or so, we had a strange/critical problem with our hosts(ESXi 5.1) in one our vCenter(v5.1).

All hosts(around 20) start with a strange behaviour. VMs were not able to start/stop or vMotion.
On the hosts we cannot start or stop any service.
When we try to go the ESXi console/shell we get ´cant´t fork´.

Some of the symptoms that can happen:

SYMPTOM:ESXi host hang during the Virtual Machine Vmkotion. Vmware termed this as HP-AMS 9.6 Memory leak issue.
SYMPTOM:Virtual Machine keeps working fine.
SYMPTOM:ESXi host is hung does not take any command either thru "putty" or "iLO-IRC".
SYMPTOM:Unable to fetch VMsuppport logs, as server is hung.
SYMPTOM:Cannot perform a vMotion to and from an ESXi host.
SYMPTOM:Cannot enable services from or to the ESXi host.
SYMPTOM:When attempting to enable services or vMotion the ESXi host fails.
SYMPTOM:When logging in to the ESXi shell, message seen: can't fork       

SYMPTOM:When pressing Alt+F12 at the DCUI, error seen:
WARNING: Heap: 2677: Heap globalCartel-1 already at its maximum size. Cannot expand

After many troubleshooting, and also some help form VMware support(this happen before this was a known and reported issue), we found out that the problem was in the hp-ams(HP Agentless Management Service).

CAUSE:HP-AMS service creates lots of zombie processes in the backgroud, which takes up all the memory and make the host hang(host services or VMs actions).

This is a know issue that can be found in the hp-ams versions 500.9.6.0-12.434156 and 550.9.6.0-12.1198610.
HP states that the problem can be found in ESXi 5.0, 5.1 and also 5.5, from all hp-ams v9.6.x or some 10.0.1.x.

So the option is to update/upgrade the hp-ams to latest hp-ams version(10.0.1-2.x)

First we need to download the proper versions for our HP Server and also for our ESXi version.

Both you can check in HP support site:

Lattest versions.

For ESXi 5.0/5.1 AMS Offline Bundle: Here
Full Bundle: Here
 
For ESXi 5.5 AMS Offline Bundle: Here
Full Bundle: Here

Note: I recommend that you apply the full bundle so that you can update all your HP vibs

More VMware details in VMware related KB article

After download the proper HP offline Bundle and you have copy to your host, you need to remove the old one and install the new one.

Here is how to:

On your ESXi console run these commands

##check hp-ams version
esxcli software vib list | grep "hp-am*"
##stop hp-ams service
/etc/init.d/hp-ams.sh stop

##Remove old hp-ams
esxcli software vib remove -n hp-ams

Note: Even is not mandatory, I always reboot the host before install the new one.

##Install the new hp-ams
example for ESXi 5.0/5.1 full bundle: esxcli software vib install -d /fullpath/hp-ams-esxi5.0-bundle-10.0.1-2.zip -f

If is just the ams file, you can run the same command just chaning the file line
*fullpath is the vmfs storage and folder that you copy you file in ESXi host.

After the installing you need to reboot you host.

This should fix your ams issues.

Hope this can help.

Monday, June 16, 2014

ESXi 5.x HP NIC's not found after firmware update.



When applying the HP Smart Update Manager (HP SUM) or the HP Service Pack for ProLiant on your HP servers your Network cards may stop working. In ESXi NIC's will not be found after firmware(in 5.x).

Problem will occur in the models:

    HP NC373T PCIe Multifunction Gig Server Adapter
    HP NC373F PCIe Multifunction Gig Server Adapter
    HP NC373i Multifunction Gigabit Server Adapter
    HP NC374m PCIe Multifunction Adapter
    HP NC373m Multifunction Gigabit Server Adapter
    HP NC324i PCIe Dual Port Gigabit Server Adapter
    HP NC326i PCIe Dual Port Gigabit Server Adapter
    HP NC326m PCI Express Dual Port Gigabit Server Adapter
    HP NC325m PCIe Quad Port Gigabit Server Adapter
    HP NC320i PCIe Gigabit Server Adapter
    HP NC320m PCI Express Gigabit Server Adapter
    HP NC382i DP Multifunction Gigabit Server Adapter
    HP NC382T PCIe DP Multifunction Gigabit Server Adapter
    HP NC382m DP 1GbE Multifunction BL-c Adapter
    HP NC105i PCIe Gigabit Server Adapter

I have seen these problem occur in 3 ways.
  1.     Applying HP Service Pack for ProLiant (HP SPP) Version 2014.02.0
  2.     Applying directly the firmware update CP021404(particularly in DL360-G5)
  3.     Applying HP Service Pack for ProLiant (HP SPP) Version 2014.02.0 (B)
1 - Applying HP Service Pack for ProLiant (HP SPP) Version 2014.02.0

This is HP Advisory

           DESCRIPTION
"On certain HP ProLiant servers, certain HP Broadcom-Based Network adapters listed in the Scope may become non-functional when they are updated with the Comprehensive Configuration Management (CCM) firmware Version 7.8.21 using firmware smart component
If the server has already stopped responding or boots to a black screen, power-cycle the server to recover and then use the smart components CP021847.scexe and CP021848.scexe Version 2.11.20 to remove the CCM from the affected adapters as described below. If this condition continues to occur, contact HP. Please reference Document ID c04258318 when speaking with HP Support"

This issue can be fixed using the CP CP021847.scexe and CP021848.scexe in the ESXi.
Download the CP, upload to your ESXi and then run it(sh ./CPxxxxx).

But I notice that sometimes this cannot fix the problem.  If this not fix the issue with your Network cards, go to step 2 for the solution.

2 - Applying directly the firmware update CP021404(particularly in DL360-G5)

We can apply the Interface Cards firmware update directly in the  ESXi console running "sh ./CP021404.exe"
Download the CP, upload to your ESXi and then run it(sh ./CPxxxxx).

Even HP stated this can only happen when applying the SPP, I have notice this issue also when applying trough the console the previous firmware update(CP021404) in DL360-G5 servers.
The NICs boot code, PXE/UEFI, IPMI, UMP, CLP, iSCSI, NCSI, FCoE and CCM code are corrupted and/or the MAC address for the network adapter have change(in a dual card both have the same MAC Address).  To correct this is not so easy.

Before the solution I will like to thank all guys in the HP Forum that provided information and some of the steps to fix this issue.

To fix the issue we need to fix the NIC/LOM and prepare the Non-volatile Memory for New NIC/LOM.

1. Download all the tools that we need to fix the NIC/LOM

- download FreeDOS
- download XDIAG.exe
- download bc08c740.bin
- read all informations in SETUP.TXT

2. Prepare the FreeDOS.iso
- After downloading open the ISO  with a tool like UltraISO
- Add the XDIAG.exe and the bc08c740.bin to the ISO using UltraISO
- Save the ISO with a new name
- Burn it or mount it with iLO

3. Boot from CD 
4. Run xdiag in engineering mode, type: xdiag -b06eng
5. On command prompt type: device 1
6. On command prompt: nvm fill 0 0x600 0
7. On command prompt: nvm upgrade -bc "/bc08c740.bin" (use the full path to the bin file with "")
8. On command prompt: nvm cfg and type default, then type 16=10 wich sets the BAR size to 32 for this NIC (see the guide).
9. Save
10. Then lets change to the second network port(device), on command prompt type: device 2 and repeat steps 6-8, run the command 1=00:00:18:xx:xx:xx (change the last digit for different MAC Address on device 2.)
11. Save
12. Exit
13. After do a full POWER CYCLE (unplug) of the server
      Note:You can boot again FreeDos and run xdiag in eng mode again and confirm that the MAC and RAM size are still changed.

14. DONE

Double check the NIC's MAC Address(you can do this in ESXi with the vSphere Client) if the values are set to the MAC Address you add in the step 10.

And that's it, this should fix your HP network cards. If you need more details and help, please go to the HP Forum(added above) and read all posts, or drop a message here.

3 - But also this can happen in the new SPP update: Advisory

Any ProLiant server using HP Smart Update Manager from the HP Service Pack for ProLiant (SPP) Version 2014.02.0 (B) with any of the following adapters:
HP NC371i Multifunction Gigabit Server Adapter
HP NC373i Multifunction Gigabit Server Adapter
HP NC324i PCIe Dual Port Gigabit Server Adapter
HP NC326i PCIe Dual Port Gigabit Server Adapter
  DESCRIPTION
"When an attempt is made to rewrite the Universal Management Port (UMP) firmware on certain HP Broadcom adapters using HP Smart Update Manager (HP SUM) from the HP Service Pack for ProLiant (SPP) Version 2014.02.0 (B), the message, "Update returned an error" may be displayed:
This occurs because the flash installer is not setting the force update flag when attempting to rewrite the UMP firmware even when the "action" tag in the discovery XML is set to "rewrite." As a result, when a rewrite is attempted using Broadcom's flash tools (without the "force" flag), it returns a non-zero return code which is treated as an error by the installer and will eventually result in the "Update returned an error" message in HP SUM."

In this case we disable the update in the SPP, and run CP023219 directly in the console.
Download the CP, upload to your ESXi and then run it(sh ./CPxxxxx).

Note: Any of the above solutions/information that I describe here are provided without any warranty. This is just some solutions/steps that I have tried and fixed my issues.. Particularly for my DL360-G5 Servers.
If you not comfortable to use any, please contact HP to give you support.

Friday, May 30, 2014

VMware ESXi requires No-Execute Memory Protection enabled

Today building my new Virtual Home LAB I had an issue with my new HP DL360-G5.
Note: VMware doesn't support ESXi 5.5 running on DL360-G5(in this case Intel Xeon 5150). But it runs smoothly without any issues. Even is not VMware supported in the HCL.

When I was installing ESXi 5.5 (HP-5.73.21 build) on this server I received a purple screen with this warning:


VMware ESXi 5.5 (VMKernel Release Build 1623387)

The system has found a problem on your machine and cannot continue.

VMware ESXi requires the Execute Disable/No Execute CPU feature to be enabled

This means that you server have the No-Execute Memory Protection option disabled.

You can check here VMware KB regarding this issue.

Is very easy to fix it.
Restart you server, enter BIOS and change the option No-Execute Memory Protection to enable.


Then you can re-run ESXi install again.

Tuesday, May 6, 2014

vCenter 5.5 SSO one-way Trusts between Domains/Forests Bug

There is a bug in vCenter 5.5 with AD vs SSO that we found out and that is an hassle to big environments with several domains and have only one-way trust.

I will try to use simple examples so that you can understand more real environments.

Example:
You have a global domain xpto.com and several subdomains(let say in different continents and also country subdomains), emea.xpto.com, epac.xpto.com, etc. There is only trust(one-way) across the most of the multiple domains and forests. In this case was a one-way trust from our internal domain(country.xpto.com) to the global domain(xpto.com).

All your users are from global domain. Also permissions to the the vCenter you have Groups from your internal subdomain(country.xpto.com) and add users from global domain(xpto.com) and maybe from other global domains emea.xpto.com, epac.xpto.com.

AD configurations for the vCenter permissions.

AD Group vCenter Admin(admins from you internal domain, but also from the global domain)
AD Group Sales Rep(users from internal, but also from emea.xpto.com, epac.xpto.com).

Those groups have rights to vShere Client, but also vSphere Web Client.

Here is the problem, using Groups from local domain and add global users(or other one-way trust subdomain).

Users from other others domains inside Groups from the internal domain will not be able to connect to vSphere Client(no permissions), will connect to vSphere Web Client, but will not see any vCenter.

Solution/Workaround?? Just use users directly(from any domain) and then they can login and have the proper permissions.
If you add the users directly to the vCenter(Clusters, Pools, Folders, etc.) users can login.

In our case was an big, big problem, we have hundred of users that login to the vCenters from different projects and different parts of the world, and we need to add those, one by one in Clusters, Pools, Folders etc.

This is not a proper way to manage permissions with Groups/Users. But was the only way, or rollback to 5.0.

After we contact VMware support, they recognize the bug(after lot of tests, emails and remote sessions), and promise that the bug will be fixed in the future(maybe vCenter 5.5 update 2).

Check VMware KB regarding AD trusts http://kb.vmware.com/kb/2064250 and check VMware notes: VMware is aware of both of these limitation with vSphere 5.5 and is working towards resolving them.

Tuesday, February 18, 2014

ESXi 5.0 HP Smart Array P420i Controller issue (no local Storage and no Array Controller)

Today we had an issue with an ESXi 5.0 Host. Is a HP DL360p Gen8, so we needed to reinstall the ESXi 5.0.

These ESXI host have 4TB of local Storage, so we need to reinstall the ESXi without destroy the Local VMFS Storage.

Downloaded the latest ESXi 5.0.0.update03-1311175.x86_64 and reinstall the ESXi in the Flash Card(we have our ESXi's installed in internal 4Gb Flash Cards).

When reinstalling the ESXi recognize a ESXi 5.0 installation and we choose to install(not upgrade), then after the install we had a surprise, no Local Storage on ESXi.
Troubleshooting the problem, we notice that HP Smart Array P420i Controller hpsa driver was from original ESXi 5.0 ISO and doesn't work, so we need to use the HP bundle driver/VIB.

We have some options here:

First option: Use a The HP customized VMware ESXi that will install all HP drivers(VIP) and bundles. In the ISO is already included the hpsa that fix the issue(you should get this from your vendor, or HP account support).

Second option: Install a offline bundle from HP. This offline bundle will update, install and remove old VIBs/drives.

How to install HP offline bundle.

First you need to download the latest from HP Software Delivery Repository: http://downloads.linux.hp.com/SDR/index.html

For ESXi 5.0
For ESXi 5.5

In our case was the hp-esxi5.0uX-bundle-1.5-39.zip

Before any changes or install, we need to put the host in Maintenance Mode.

After download the VIB we need to copy to the ESXi Datastore, again we will use WinSCP for this. You can copy directly to /var/log/vmware(if doesn't exist, create).
Or copy to any folder and then copy to the /var/log/vmware

cp offline-bundle.zip /var/log/vmware

After we copy the file we can install the VIB.

Run this command to install drivers using the offline bundle (this requires an absolute path):

For example(in our case):

esxcli software vib install –d /var/log/vmware/hp-esxi5.0uX-bundle-1.5-39.zip

After is finish, you should see a list of the VIBs that were installed and removed(also the ones that were skipped)

Next reboot the ESXi host.

After the server is rebooted you can now list the VIB that are installed.

Run the command:

esxcli software vib list | grep -i (in this case HP)

You should get something like this:

char-hpcru 5.0.3.09-1OEM.500.0.0.434156 Hewlett-Packard PartnerSupported 2013-04-16
char-hpilo 500.9.0.0.9-1OEM.500.0.0.434156 Hewlett-Packard PartnerSupported 2013-04-15
hp-ams 500.9.3.5-02.434156 Hewlett-Packard PartnerSupported 2013-04-26
hp-build 5.20.43-434156 Hewlett-Packard PartnerSupported 2013-04-15
hp-smx-provider 500.03.02.10.4-434156 Hewlett-Packard VMwareAccepted 2013-04-26
hpacucli 9.40-12.0 Hewlett-Packard PartnerSupported 2013-04-16
hpbootcfg 01-01.02 Hewlett-Packard PartnerSupported 2013-04-15
hponcfg 04-00.10 Hewlett-Packard PartnerSupported 2013-04-15
scsi-hpsa 5.0.0-28OEM.500.0.0.472560 Hewlett-Packard VMwareCertified 2013-04-15
scsi-hpvsa 5.0.0-22OEM.500.0.0.406165 Hewlett-Packard PartnerSupported 2013-04-15
vmware-esx-hp_vaaip_p2000 2.1.0-2 Hewlett-Packard VMwareAccepted 2013-04-15
ata-pata-hpt3x2n 0.3.4-3vmw.500.1.11.623860 VMware VMwareCertified 2013-04-15
hpnmi 2.0.11-434156 hp PartnerSupported 2013-04-15

Note: In bold is the hpsa driver that we need to change to fix the HP Smart Array P420i Controller issue.

Connect to your vCenter(or host) with vSphere and check you Datastores on the host and/or Hardware Status(only with vCenter). You should see in Storage sensor your disks and also HP Smart Array P420i Controller in the bottom.

Remove the Maintenance Mode.

Third option: Install only the hspa driver(was what we initially decided to test if it fix the issue):

Same procedure as the Second Option, but in this case is only the hpsa VIB that we will install.

You can download from HERE

After you downloaded just follow the same tasks in the above option.

After you install you should see a list of the one(s) that were removed, and the new one(s) that was installed.

Note: To install/remove VIBs you can also check http://kb.vmware.com/kb/2005205

Wednesday, February 12, 2014

Troubleshooting ESXi 5.5 snapshots issues

Today we had some issues with some VMs regarding snapshots and some VM operations.

We had upgraded some several ESXi 5.0 hosts to a ESXi 5.5 and put them in a new vCenter 5.5b. Consequently the VMs HW were upgraded to v10.

I don't know if this snapshots issues was regarding these migrations, but we have seen them after this migrations. In around 200 VMs(50% of them have snapshots) we have find 10/15 with this problems.

1st problem:

When they try to revert to any snapshots we get "The operation is not supported on the object".
This one was easy to fixed, just remove the VM from the inventory and add again to vCenter. That fix the problem.

We may use the vim-cmd vmsvc/reload vmid option here(second option on this article).

2th problem:

When they try to use any of the Snapshots options('Take Snapshot', 'Snapshot Manager' and 'Revert to current Snapshot') Virtual machine operations were grayed out. There is virtual machine tasks running in the background(maybe any previous task hanging).

So first we need to check which tasks are running on the VM.

Just connect to your host(ssh) and list all VMs and their vmid.

Run: vim-cmd vmsvc/getallvms

You will get all VMs and vmid associated. After you get your VM vmid run:

vim-cmd vmsvc/get.tasklist vmid

The output is similar to:

(ManagedObjectReference) [
'vim.Task:haTask-112-vim.VirtualMachine.createSnapshot-3887']


If there is no task running on that virtual machine, you see output:

(ManagedObjectReference) []

But even you don't see any tasks running, it has a process running(or hanging). So we should reload the vmx file(that will create a new Inventory ID (Vmid) and this should be enough to fix this issue.

Run:

vim-cmd vmsvc/reload vmid

Now the Virtual Machine should be ok and you should have all options available again.

More information here about this commands:

http://kb.vmware.com/kb/1013003
http://kb.vmware.com/kb/2048748

UPDATE: Tested the vim-cmd vmsvc/reload on the first issue, and it works.