How Veeam Backup on vVOL can kill your PROD Env (SQL) on HPE Alletra

When a backup takes your SQL servers offline: an HPE Alletra vVol folder-space incident

An incident walkthrough, the root cause, how to diagnose it, and what I'd do differently — for storage and VMware engineers.


Summary

Two production SQL Server VMs sharing one vVol folder on an HPE Alletra 6000 kept going offline "randomly." The trigger turned out to be the weekly Veeam backup: the VMware snapshot delta, written into the same array folder as the live volumes, pushed the folder past its space limit. NimbleOS then set every volume in the folder to non-writable to protect data — resulting in an instant SQL outage on both servers at once. The real long-term constraint wasn't the folder limit at all; it was the pool free space sitting behind it. Below is the whole journey: the symptom, the dead ends, the confirmation, and the monitoring I built so it never surprises us again.


1. The war story

The symptom was maddeningly vague: every so often — not every week, not predictably — both AUSSDC1PRDSQL20 and AUSSDC1PRDSQL21 would lose write access at the same moment. SQL would fall over. The fix each time was a call to HPE Support, who would "unlock" the storage, and we'd be back. No one could say why.

A few things made it hard to chase:

  • It wasn't every backup. Plenty of weekly runs completed fine, so "the backup did it" felt too simple.
  • The environment is segmented. The Veeam server, the management box that can reach the array, and the array's own management interface all live in different network zones. No single host could see everything, which made evidence-gathering awkward.
  • By the time anyone looked, it was healthy again. The array showed all volumes online, plenty of datastore free space in vCenter. Nothing obviously wrong.

That last point is the trap. The failure is transient by nature — it spikes during the backup and recovers minutes after the snapshot is removed. If you only look between backup windows, you see a clean system and conclude it was a fluke. It is not a fluke.


2. The technical root cause

What actually happens

These two SQL VMs are vVol-backed, and on the Alletra they live inside a single folder called SDC2VVOL. A folder on NimbleOS/Alletra has its own space limit — and that limit is independent of, and can be lower than, the datastore capacity vCenter reports.

The weekly Veeam job uses standard VMware-snapshot backup (CBT-based; not storage-snapshot integration, not VSS in this job's config). While that snapshot is open:

  1. SQL keeps writing heavily to its databases.
  2. Every overwritten block has to be preserved, so the snapshot delta grows — and that delta is written into the same SDC2VVOL folder.
  3. Folder usage = base volume data + snapshot delta. During a long backup over a busy SQL box, the delta is large.
  4. If folder usage crosses the folder's space limit, NimbleOS sets the folder's volumes non-writable.

Because all the VMDKs for both SQL servers live in that one folder, they all go read-only at the same instant. That is the simultaneous outage.

When the backup finishes and the snapshot is consolidated and removed, folder usage drops straight back to baseline. Plot folder usage over time and you get a clean sawtooth: flat baseline, sharp spike during backup, fall back after. The outage happens on the weeks where the spike clips 100%.


It tells the whole story in one picture:

  • Blue sawtooth — each spike is one weekly backup inflating folder usage, then recovering to the ~36 TiB baseline.
  • Orange dashed line (~50 TiB) — the old folder limit. You can see the spikes creeping up against it.
  • Red OUTAGE point — the bad week where the spike crossed the old limit and the array set the volumes non-writable.
  • Vertical divider — where the limit was raised to 65 TiB.
  • Green band — the headroom that the raise bought; same-size spikes now clear the ceiling comfortably.

It deliberately leaves the pool constraint out — this chart is just the folder-level sawtooth, which is the most intuitive entry point for the story. If you want, I can make a second simple diagram showing the layering (pool 123 TiB → SDC2VVOL folder 36/65 TiB → the other ~50 volumes), since that's the part your L3 found confusing and it pairs well with this one. Or I can embed this image reference into the blog markdown so it's ready to publish as a set.

The detail that surprised me: there's no grace

The folder had its overdraft limit set to 0%. The overdraft is a NimbleOS feature that lets a folder burst past its space limit by a configurable percentage (0–200%) before volumes go non-writable. At 0%, there is no buffer — the instant usage hits the limit, volumes lock. That explains why the outage was always abrupt rather than gradual.


The mistake I made in my own analysis

My first hypothesis was a per-volume snapshot quota — I assumed the array was locking a single volume when its snapshot reserve filled. That was wrong, and the evidence corrected me: every volume's Reserve % was 0, and the array's own alert log named the real mechanism in plain text:




Folder-level space exhaustion, not snapshot quota. Worth stating plainly because it's an easy wrong turn: on these arrays, folders have space limits, and that's a different object from per-volume snapshot reserves.

The bigger trap: the limit you raise may be one the pool can't honour

Here's the part that matters most for anyone copying this fix. The immediate remedy is obvious — raise the folder space limit to buy headroom. We did (≈40 TiB → ≈50 TiB → 65 TiB over time). But raising a folder limit only helps if the physical pool behind it can actually supply that space.

In our case, after raising the folder limit to 65 TiB:



The folder is allowed to grow to 65 TiB, but the pool only has ~40 TiB free — and that pool also feeds ~50 other volumes. The folder can never physically reach its own limit; the pool would run dry first. So the protective ceiling silently moved from "folder limit" to "pool free space." And a pool-full event is worse than the original problem: it takes every volume on the array offline, not just the two SQL servers in one folder.

That's the headline lesson: on a thin-provisioned, over-committed pool, the folder limit is a soft ceiling and pool free space is the hard one. Monitor the hard one.


3. How to diagnose it (read-only)

Everything here is non-destructive — --list / --info CLI verbs and GET-only REST. Nothing changes array state. A few practical notes from doing this in a locked-down environment:

Array side (NimbleOS / Alletra CLI over SSH)



Gotchas I hit:
  • array --info needs an array name on NimbleOS 6.1 — use array --list if you just want the inventory.
  • SSH single-command mode doesn't play nicely with the array's keyboard-interactive auth and restricted shell. In PowerShell, Posh-SSH (Invoke-SSHCommand) handles it cleanly where native ssh.exe "<cmd>" does not.
  • The REST API is on port 5392, which is a separate firewall rule from the 443 web UI. The web UI being reachable tells you nothing about whether 5392 is open from your host. If 5392 is blocked, fall back to SSH — folder --info over SSH gives the same numbers the REST CSV would.
  • Field labels matter for parsing. Don't loose-regex used / limitfolder --info has both "Volume mapped usage" and "Total volume and snapshot usage," and "Overdraft Limit (%)" is a percentage (value 0), not a size. Match the exact labels or you'll divide by zero.


vCenter side (PowerCLI, read-only)

Get-Snapshot -VM AUSSDC1PRDSQL20, AUSSDC1PRDSQL21   # any snapshot during normal ops is suspicious

Get-VIEvent -Entity $vm | Where-Object FullFormattedMessage -match 'snapshot|consolidat|datastore|space'

Get-Datastore -Name *SDC2VVOL*                        # capacity vs free


Veeam side (PowerShell)

Two things to check, both read-only:

  • Job config — is it using VMware snapshots vs storage-snapshot integration vs VSS? (UseChangeTracking, VSS flags.) This tells you what is creating the delta.
  • Session timing — start/end of each weekly run. Line these timestamps up against the array's folder-usage spike and the offline alert. When the Veeam runtime overlaps the array spike overlaps the offline alert, that's your proof.

Note: Veeam B&R v13's PowerShell module requires PowerShell 7+ — it will not load under Windows PowerShell 5.1. PowerCLI and Posh-SSH are fine on 5.1; only the Veeam module forces pwsh.

The correlation that confirms it

Don't rely on any single source. The confirmation is three timelines lining up:



4. The fix and the monitoring

Immediate

  • Raise the folder space limit to restore backup headroom (done). This is the right first lever and matches what HPE does when they "unlock" you.
  • Enable a small overdraft (we set 10%) so the folder-limit failure mode becomes a brief burst window instead of an instant lock. Be honest about its value, though: once the folder limit exceeds pool free space, the overdraft sits at a ceiling the pool can't reach, so it mostly won't engage. It's a cheap belt, not the fix.

The actual safety net: monitor pool free space

I wrote a read-only PowerShell monitor (SSH + Posh-SSH, since REST/5392 is blocked from our host) that runs on a schedule and tracks both dimensions:

  • Folder usage vs limit (warn 85% / crit 92%).
  • Pool free space (warn < 25 TiB / crit < 12 TiB) — the binding constraint.
  • Pool used % to match the GUI header exactly, so the script and the console tell the identical story.
  • Appends a CSV history row each run for trending, and emails on threshold breach.

A typical line:


Run it every few minutes around the backup window and you'll watch the sawtooth live — which is exactly the early warning the old setup never had.

The real long-term fix

For multi-TB SQL, stop putting giant VMDK snapshot deltas in the folder at all. Move SQL data protection to native SQL backups (full + transaction log) via Veeam's application-aware processing. That removes the snapshot-delta-in-the-folder problem at its source. Bigger change, correct destination.


5. Lessons learned / what I'd do differently

  1. Believe the alert log before your own theory. The array told us "folder overdraft limit" in plain English from day one. I spent time on a snapshot-quota hypothesis the evidence didn't support. Read the array's own words first.
  2. For transient capacity issues, capture during the event. A snapshot of a healthy system between backups proves nothing. The data that mattered was the alert-log history and a run timed to overlap the backup.
  3. Distinguish soft limits from hard limits. Folder space limit = soft, configurable, easy to raise. Pool free space = hard, physical. Raising a soft limit above the hard one just moves the failure mode to a worse place. Always check the pool behind the folder.
  4. A folder is a blast radius. Co-locating both SQL servers in one vVol folder meant one space event took both down together. Worth asking whether critical peers should share a folder at all.
  5. Over-provisioning is fine until it isn't. Thin provisioning and strong data reduction (our pool runs ~5x reduction) make over-committed limits tolerable — but reduction ratios aren't guaranteed as data grows. Over-commit with a monitor, not on faith.
  6. Mind the platform footguns. Veeam v13 needs PowerShell 7. The REST API port is separate from the web UI port. Array SSH wants keyboard-interactive handling. array --info needs an argument. None are hard once known; all cost time when not.
  7. Monitor the number that can actually hurt you. We now alert on pool free space, not just the folder percentage. The thing that takes the whole array down is the thing on the dashboard.
Environment: HPE Alletra 6000 (NimbleOS 6.1), vVol over Fibre Channel, vCenter 8, Veeam B&R v13.

You can use these scripts to gather logs so that you can correlate outage with logs or pro-active monitoring by analysing the stats.  

https://github.com/cloudmigrator/Veeam

Popular posts from this blog

On-Prem Storage To Azure Storage - Part-2

On-Prem Storage To Azure Storage - Part-1

Secure Boot vCenter Deployment