The Incident

What Happened

In December 2025, one of our production FTP servers went down hard. The server — running Windows Server 2016 — was responsible for ingesting high volumes of files around the clock. At some point during a peak processing window, all write operations on the D: drive failed simultaneously.

The symptom was deceptively simple: the operating system reported 0 bytes available on the D: drive. Applications failed, file transfers aborted, and alerts fired across the board.

The first instinct of anyone on the team was obvious: the disk is full. Except it wasn't.

Outage was caused by something most infrastructure engineers have never encountered — and that is exactly what makes it dangerous. NTFS metadata exhaustion is invisible to standard monitoring, looks exactly like a disk space problem, and cannot be solved by adding disk space.

The Investigation

When we RDP'd into the server and opened Windows Explorer, the D: drive showed over 500GB of free space on a 3TB volume. Task Manager and diskpart agreed. The disk was not full by any conventional measure.

We started working down the checklist:

• Disk free space — fine (500GB+ available)

• File system errors — none reported by chkdsk surface scan

• Volume Shadow Copy Service — not consuming excessive space

• Disk health — SMART status clean

• Permissions — service accounts had full control

Nothing obvious. I started digging into the NTFS filesystem metadata layer using fsutil — a command-line tool that exposes internals that Windows Explorer and standard monitoring tools never surface.

This is where we found the real problem. File count was around 70 million files in single FTP folder.

Prequel to Root Cause

Root Cause: MFT and USN Journal Exhaustion

NTFS maintains a data structure called the USN Journal — the Update Sequence Number journal. It is a change log that records every file and directory operation on a volume: creates, deletes, renames, attribute changes, and writes. The journal exists primarily to support backup software, file replication, antivirus engines, and search indexing — all of which use it to efficiently detect what changed since they last ran.

On Windows Server 2016, the USN journal has a critical characteristic: it is static. The journal is allocated a fixed maximum size at volume format time. When the allocated journal space fills up, NTFS does not gracefully expand it. Instead — and this is the critical detail — NTFS begins failing write operations on the volume entirely.

Critical Finding The server was not out of disk space. The NTFS USN journal had consumed all of its pre-allocated metadata space. NTFS responded by locking writes at the filesystem level, even though hundreds of gigabytes of raw storage remained unused.

Running fsutil against the affected drive revealed this immediately:

fsutil usn queryjournal D:

Usn Journal ID : 0x01d9a3f5b2c40000

First Usn : 0x0000000000000000

Next Usn : 0x0000001000000000 <- allocated space exhausted

Lowest Valid Usn : 0x0000000000000000

Max Size : 0x0000001000000000 <- 4096 MB maximum

Allocation Delta : 0x0000000040000000 <- 1024 MB delta

The Next USN value had reached the Max Size ceiling. The journal was full. NTFS had nowhere to write the change records for new operations, so it refused to allow new writes at all.

A second server running the same workload was found in the same state within the same week.

I will try to explain it as simply as possible using - (Bank Account Analogy)

Think of the file system like a bank system:

MFT (Master File Table) → The main account database
(stores every file and folder record)
USN Journal → The audit log
(tracks every change for backup, antivirus, auditing, replication, etc.)
NTFS Transaction Log ($LogFile) → The transaction ledger
(ensures data consistency)

Every file operation (create, delete, rename, modify) is treated as a financial transaction:

You cannot update an account unless both the audit log (USN) and transaction ledger ($LogFile) can also be written.

What Went Wrong

Over time, the system accumulated tens of millions of small files, which caused:

The MFT to grow extremely large (~23.4 GB)
Heavy fragmentation of filesystem metadata
The USN audit log reached its size and growth limits (by default, it’s set to 512Mb)

When Windows attempted normal operations (such as deleting/writing files), NTFS tried to:

Update the account database (MFT)
Write an audit log entry (USN Journal)
Record the transaction (NTFS log)

At this point:

The audit log (USN journal) could no longer expand due to metadata fragmentation and the growth boundary (512MB limit)

Just like a bank:

If the audit log is full, all transactions are blocked, even if money is still available.

So, Windows blocked all file changes to protect filesystem consistency.

This caused:

Disk reported “0 bytes free”
File deletion failures
Folder errors such as “directory not empty”
Volume appeared locked

Even though nearly 900 GB of physical disk space was still available.

Root Cause (Technical Summary)

The failure was caused by:

Extremely high file counts (~24 million)
Very large MFT size (~23.4 GB)
USN Journal size too small for enterprise-scale workloads
NTFS metadata fragmentation is preventing journal expansion

This triggered NTFS metadata exhaustion, a known behavior in large-scale file systems.

This was an architectural scaling limit being reached.

Permanent Fix Implemented (tested and confirmed write-lock removed)

Built and resized the NTFS change journal:

Increased USN Journal size from 512 MB → 4 GB
Increased allocation growth from 128 MB → 1 GB

This:

Eliminates fragmentation risk
Allows safe metadata expansion
Prevents recurrence of this failure mode

Additionally:

MFT growth reservation is configured optimally

This now makes sense. After expanding the USN Journal, I was able to unlock the D: drive, which immediately allowed the MFT to expand and removed the write lock.

Why Windows 2016 Specifically?

This is not a bug that affects all versions of Windows equally. The USN journal behaviour changed significantly between OS versions:

OS Version	USN Journal Type	Behaviour at Max Size
Windows Server 2016	Static (fixed allocation)	Fails ALL write operations when full
Windows Server 2019	Semi-static	Improved but still requires monitoring
Windows Server 2022	Dynamic (auto-expanding)	Self-managing, much lower risk profile

Our affected servers were both running Windows Server 2016. The high volume of file operations — thousands of files per hour being created, processed, and deleted — generated an enormous number of USN journal entries. The journal filled up faster than anyone anticipated.

The Fix

The immediate fix was straightforward once the cause was identified. We resized and recreated the USN journal using fsutil with a larger maximum size and a larger allocation delta:

# Delete the existing journal

fsutil usn deletejournal /D D:

# Recreate with 8GB max size and 512MB allocation delta

fsutil usn createjournal m=8589934592 a=536870912 D:

Write operations resumed immediately after the journal was recreated. No data loss occurred — the journal exhaustion only blocked writes; it did not corrupt existing data.

The same fix was applied to the second affected server. Both returned to normal operation within minutes.

Key Lesson The fix took minutes. Finding the cause took hours. The real work was building something to ensure it never happens silently again.

Understanding the Problem Deeply

What is the MFT and Why Does It Matter?

Beyond the USN journal, there is a second NTFS metadata structure worth understanding: the Master File Table, or MFT. The MFT is a database that contains one record for every file and directory on the volume. Each file — regardless of its size — has an MFT entry. A volume containing ten million small files will have a very large MFT even if those files are tiny.

The MFT grows as files are created. In environments with high file turnover — where files are constantly created and deleted — the MFT can grow substantially over time because NTFS does not compact or shrink the MFT when files are deleted. Space freed by deleted file records may be reused, but the overall MFT size rarely decreases.

A very large MFT is not immediately dangerous, but it is an early indicator of a high-churn environment that is also at elevated risk from USN journal exhaustion. We therefore monitor both metrics together.

Why Standard Monitoring Missed It

Our existing monitoring infrastructure — like most enterprise environments — was tracking the right things for the wrong abstraction layer. We had alerts on:

• Disk free space (GB and percentage)

• Volume read/write errors

• CPU and memory utilisation

• Service availability and response times

None of these metrics reach into the NTFS metadata layer. Disk free space is a measure of raw storage — it knows nothing about journal allocation. A server with 500GB free and an exhausted USN journal looks perfectly healthy to every standard monitoring tool. This is a genuinely invisible failure mode.

The fsutil command is the only built-in Windows tool that surfaces USN journal allocation details. It is not integrated into Performance Monitor, Windows Admin Center, Event Viewer alerts, or most third-party monitoring platforms.

The Risk Profile Going Forward

After the incident, we audited the rest of our Windows server fleet. We had 14 servers running various workloads, exposing 36 NTFS volumes in total. Several were on Windows Server 2016 with high file operation counts. The audit revealed:

• 2 servers with USN journal configurations identical to the failed server (4096MB max, 1024MB delta on Win2016)

• 3 servers showing measurable MFT growth trends over the previous weeks

• 1 server with an MFT already at 23GB — the original incident server — reflecting years of accumulated file churn

We needed a systematic approach. Manual audits are a point-in-time snapshot. What we needed was continuous monitoring with trend analysis.

Building the Monitoring Solution

Architecture Overview

I built the monitoring solution in two layers:

1. Data Collection — A PowerShell script that runs daily via the existing Veeam backup infrastructure, connects to all Windows VMs via VMware PowerCLI, collects NTFS metadata using fsutil, and saves daily CSV snapshots.

2. Analysis and Alerting — A second PowerShell script that reads all historical CSVs, computes trends, assesses risk, generates a self-contained HTML dashboard, and emails it to the team.

Data Collection: The Scanner Script

The scanner runs from a management server that already has VMware PowerCLI installed. It connects to vCenter, enumerates all running Windows VMs, and uses Invoke-VMScript to run fsutil commands remotely inside each guest — no agent installation, no firewall changes, no new dependencies.

For each NTFS volume on each server, it captures:

• MFT Valid Data Length — the current size of the Master File Table in bytes

• USN Journal Maximum Size — the ceiling beyond which writes will fail

• USN Allocation Delta — the chunk size used when allocating journal space

• USN Allocated Size — how much journal space is currently in use

• Dirty bit state — whether the volume has unflushed changes

The data is saved to a daily CSV file in a History folder, creating a time series that accumulates day by day:

D:\Veeam\Scan-FileSystem\Logs\History\20260210.csv

D:\Veeam\Scan-FileSystem\Logs\History\20260211.csv

D:\Veeam\Scan-FileSystem\Logs\History\...

It also saves FSHealth-Summary.csv (today's state) and FSHealth-Predictive.csv (risk predictions from comparing yesterday to today).

Risk Detection: What the Script Looks For

The predictive engine compares today's readings against yesterday's and flags conditions that indicate approaching failure. The key detection rules are:

HIGH RISK (Immediate Action) Windows Server 2016 + USN Max >= 4096MB. This is the exact configuration that caused the December 2025 outage. Any server matching this profile gets a HIGH flag regardless of other metrics.

MEDIUM RISK (Monitor Closely) MFT growing consistently on Windows Server 2016 (>20MB over tracking period), OR any server showing >5MB MFT growth regardless of OS version.

The script is deliberately conservative. A server that is fine today but is trending toward the danger zone gets flagged early, giving the team time to investigate before a crisis.

What 19 Days of Data Revealed

After collecting 19 days of continuous monitoring data across all 14 servers and 36 drives, the picture became clear:

Risk	Server / Drive	OS	MFT Size	Finding
HIGH	AUSSDC1FTP001 D:	WS2016	23,928 MB	Exact Dec 2025 incident profile. USN Max = 4096MB, Delta = 1024MB. MFT at 23GB from years of churn.
HIGH	AUSSDC1STGFTP20 D:	WS2016	4,650 MB	Same USN profile. USN resize detected Feb 24 — post-incident remediation applied but still at risk.
MED	AUSSDC1MFT001 E:	WS2016	4,634 MB	MFT growing +21.75MB over 19 days (~1.8MB/day). Consistent upward trend.
MED	AUSMFTCOLD01 E:	WS2019	590 MB	Fastest MFT growth in fleet: +24.75MB in 19 days (~8.25MB/day). Investigate workload.
MED	AUSMFTCOLD01 C:	WS2019	559 MB	Step-change growth detected on Feb 24 — correlates with USN resize event.
MED	AUSSDC1FTP001 C:	WS2016	759 MB	Slow but steady MFT growth +8.5MB on the original incident server's system drive.
OK	All others (30)	Various	—	Stable. Win2022 servers with dynamic USN journal show no risk indicators.

The AI-Powered Dashboard

Why Add AI to Infrastructure Monitoring?

The PowerShell script is good at detecting known patterns — we told it exactly what to look for. But infrastructure risk is rarely that clean. Static threshold rules have well-known limitations:

• A threshold that is appropriate for one server may be meaningless for another with a different workload profile

• A server growing 10MB/day may be completely normal if it has always grown at that rate, or alarming if it suddenly accelerated from 1MB/day

• Correlating multiple signals simultaneously (MFT size + USN configuration + OS version + growth rate + workload type) is difficult to express as simple if/then rules

• Raw fsutil output contains additional fields — repair stream size, $LogFile allocation, VSS overhead — that the script does not yet parse but which carry diagnostic value

An AI model, given the full 19-day history and context about the December 2025 incident, can reason across all of these dimensions simultaneously. It can identify patterns that deviate from each server's own baseline, distinguish genuine risk escalation from normal variation, and explain its reasoning in plain English that management can act on.

More practically: our management team and some client stakeholders are not infrastructure engineers. They need the system to tell them in clear language what is wrong, how serious it is, and what to do about it — not hand them a CSV and expect them to interpret it.

Dashboard Architecture

The dashboard is generated daily by a PowerShell script (Generate-FSHealthReport.ps1) that runs automatically after the scan. It reads all accumulated History CSVs, computes risk across all server-drive pairs, and outputs a single self-contained HTML file with all data embedded.

The file requires no web server, no database, no dependencies beyond a browser. It can be opened directly from a network share, emailed as an attachment, or served from any file server. Everything — the data, the charts, the AI integration — is in one file.

Key Features

• Fleet overview bar showing HIGH / MEDIUM / OK / Total counts at a glance

• Filterable server list sorted by risk level, with inline MFT size and OS indicators

• 19-day MFT trend chart that updates instantly when a different server is selected

• USN parameter cards flagging servers that match the December 2025 incident profile

• Copy for Copilot button that generates a complete, server-specific AI analysis prompt

• Direct API mode for organisations with Anthropic API access

The Copy for Copilot Feature

This is the feature designed specifically for our environment, where Microsoft Copilot is available to staff through standard Microsoft 365 licensing but direct API access is limited.

When a user selects any server-drive pair and clicks Copy Prompt, the dashboard generates a complete analysis prompt containing:

3. The full incident background — what happened in December 2025, how the USN journal works, what the failure mode looks like

4. The selected server's current state — MFT size, growth total, average daily rate, USN max, USN delta, OS version, risk flags

5. The complete 19-day historical record for that specific drive — every data point

6. Fleet context — which other servers are HIGH or MEDIUM risk, total drives monitored

7. Structured analysis requests — what the AI should assess, compare, and recommend

The user copies that prompt, opens Microsoft Copilot in their browser, pastes it, and receives a detailed technical analysis with prioritised remediation steps — without ever needing API credentials or a developer to help them.

Critically, the prompt is generated fresh for whichever server is currently selected. Every server in the fleet gets its own contextual analysis. It is not a generic template — it contains that specific server's actual data.

What AI Analysis Adds Over Static Scripts

We asked Co-Pilot to analyse the two HIGH-risk servers using the 19-day data. The observations that went beyond what the PowerShell thresholds could detect:

• AUSSDC1FTP001 D: has an MFT of 23GB suggesting tens of millions of files have passed through this server over its lifetime. Even if the USN journal is resized, the MFT size indicates this server is operating at a fundamentally different scale than the others and may need architectural review, not just a parameter change.

• AUSMFTCOLD01 E: is growing at 8.25MB/day average but the rate is not constant — it accelerated in the second week of the monitoring period. A static threshold would flag it the same way regardless of acceleration. The AI identified the rate change as a more urgent signal than the absolute value.

• The USN resize events detected on Feb 24 (AUSSDC1STGFTP20 and AUSMFTCOLD01) both coincide — suggesting a remediation run was performed that day, but the script treating them as independent events had no way to correlate them. The AI noted this as a pattern suggesting a deliberate change rather than an anomaly.

Email Integration

The report generator also sends a daily HTML email to the infrastructure team and relevant stakeholders. The email contains:

• A summary bar with HIGH / MEDIUM / OK / Total drive counts

• A full table of every HIGH risk drive with server name, drive letter, OS, MFT size, growth, and the specific risk flag

• A full table of every MEDIUM risk drive with the same fields

• An all-clear message when no risks are detected

• The full interactive HTML report as an attachment

The subject line adapts to urgency — ACTION REQUIRED when HIGH risks are present, MONITOR when only MEDIUM risks exist, and All Systems OK when everything is clean. This means recipients can triage from their inbox without opening anything.

SMTP configuration is inherited directly from the existing scan script, so there is no new configuration to maintain. The same relay server, same from address, same recipient list.

Remediation Playbook

For HIGH Risk Servers (Win2016 + USN Max >= 4GB)

These servers match the exact profile that caused the December 2025 outage. Treat them as urgent regardless of whether they are currently experiencing issues. The journal filling up is a when, not an if, on high-churn servers.

Step 1: Assess current journal usage

fsutil usn queryjournal D:

Compare Next Usn against Max Size. If Next Usn is within 20% of Max Size, treat as critical.

Step 2: Resize the journal

# Delete the existing journal (brief impact on VSS/backup tools)

fsutil usn deletejournal /D D:

# Recreate with larger values

# m = max size in bytes (8GB = 8589934592)

# a = allocation delta in bytes (512MB = 536870912)

fsutil usn createjournal m=8589934592 a=536870912 D:

Step 3: Verify

fsutil usn queryjournal D:

Confirm Max Size now reflects the new value. Monitor Next Usn over the following week to ensure the growth rate is sustainable within the new ceiling.

Planning Note Deleting the USN journal is a brief operation but causes backup software, antivirus, and search indexing to lose their change-tracking checkpoint. They will typically perform a full rescan on next run. Coordinate with backup teams before executing on production servers.

For MEDIUM Risk Servers (Growing MFT)

These servers are not in immediate danger but are showing a trajectory worth investigating. Recommended actions:

8. Identify what is generating the file churn using fsutil file createnew and reviewing application logs

9. Check whether the workload has changed recently — new processes, new integrations, increased throughput

10. If MFT growth is legitimate and expected, document the baseline so future growth can be compared against it

11. Consider scheduling a maintenance window to run chkdsk /f to consolidate MFT free space if fragmentation is significant

12. Plan OS upgrade to Windows Server 2022 where possible — the dynamic USN journal eliminates this risk class entirely

The Longer-Term Fix: OS Upgrade

The most durable solution is to eliminate the static USN journal entirely by upgrading affected servers from Windows Server 2016 to Windows Server 2022. On 2022, the USN journal is dynamic — it expands automatically as needed and never causes the write-lock failure mode described in this post.

This is not a quick fix, but it is the right architectural decision for any server running high file throughput workloads. The December 2025 incident and the two additional HIGH-risk servers we found during the monitoring audit are strong justification for accelerating the upgrade timeline on the affected 2016 servers.

Lessons Learned

Technical Lessons

• Assumption broken: Disk free space is not the same as filesystem health.

NTFS can run out of internal metadata space while gigabytes of raw storage remain available. Standard monitoring that only tracks free space percentage will miss this failure mode entirely.

• Tool gap: fsutil is criminally underused.

The information needed to detect and prevent this outage has always been available in Windows. fsutil usn queryjournal exposes the journal state in real time. The problem was that no one was routinely collecting and trending this data.

• OS version matters: Windows 2016 has known USN journal limitations that are fixed in 2022.

This is documented by Microsoft but not prominently enough that most teams are aware of it. Any Windows 2016 server running high file-throughput workloads should be treated as a USN journal risk until upgraded or explicitly monitored.

• Monitoring design: Static thresholds are blind to context.

A rule that fires when MFT exceeds 5GB is meaningless without knowing whether that is the server's normal state or a sudden change. Trend monitoring — tracking the rate of change over time — is far more valuable for this class of problem.

Operational Lessons

• Post-incident monitoring should be automated, not periodic.

After any significant outage, the instinct is to do a one-time audit and close the ticket. What we built instead was a permanent, automated daily check that grows smarter as it accumulates more historical data. The audit never ends; it just runs quietly in the background.

• Infrastructure visibility tools should speak the language of the audience.

The PowerShell script is for engineers. The dashboard is for anyone who needs to understand the risk and make decisions. AI-assisted analysis bridges that gap — it translates raw infrastructure metrics into plain language risk assessments that a non-technical stakeholder can act on.

• Correlation across time is more valuable than point-in-time snapshots.

The 19-day history revealed patterns — acceleration, step-changes, co-occurring events on the same date — that a single day's reading would never show. Building time-series collection from the start was the right decision.

What We Built: Summary

Components

Component	Technology	Purpose
FSHealth Scanner	PowerShell + VMware PowerCLI	Daily collection of NTFS metadata from all Windows VMs via vCenter
History Archive	Daily CSV files	Accumulating time-series of MFT and USN data per server-drive
Predictive Engine	PowerShell (built into scanner)	Compares yesterday vs today, flags USN and MFT risk conditions
HTML Dashboard	Standalone HTML/JS/CSS	Interactive fleet overview with trend charts, server selection, AI prompt generation
Copy for Copilot	JavaScript prompt builder	Generates server-specific AI analysis prompts for use in Microsoft Copilot or Claude
Daily Email Report	PowerShell Send-MailMessage	HTML email with risk tables sent to team, with full dashboard as attachment

Daily Workflow

13. Veeam scheduler triggers the FSHealth scan script each morning

14. Scanner connects to vCenter, collects fsutil data from all 14 servers via PowerCLI

15. Data saved to History\YYYYMMDD.csv and FSHealth-Predictive.csv

16. Report generator reads all history CSVs, computes trends, builds HTML dashboard

17. Dashboard saved to Reports\FSHealth-Report.html (updated in place — same URL daily)

18. Email sent to team with risk summary tables and HTML report as attachment

19. Team opens email, reviews high-risk servers, opens dashboard for detail

20. For any server requiring deeper analysis: select it in the dashboard, click Copy Prompt, paste into Microsoft Copilot

21. Copilot returns a detailed technical analysis with remediation recommendations

22. Engineer executes remediation, validates fix, adds notes to ticket

Closing Thoughts

The December 2025 outage was caused by something that most infrastructure engineers have never encountered — and that is precisely what makes it dangerous. NTFS metadata exhaustion is a failure mode that looks like a disk space problem, behaves like a disk space problem, but cannot be solved by adding disk space. The underlying cause is invisible to every standard monitoring tool.

The solution we built is not particularly complex. The PowerShell collection script is a few hundred lines. The report generator is straightforward. What makes the system effective is the combination of daily automated collection, trend analysis over time, and AI-assisted interpretation that surfaces risk in language that both engineers and management can act on.

If you are running Windows Server 2016 with high file-throughput workloads — file transfer servers, FTP services, large-scale file processing pipelines — I would strongly encourage you to run the following command today:

fsutil usn queryjournal D:

Compare the Next Usn value against Max Size. If they are close, you are closer to a production outage than you realise. Do not wait for the write lock to find out.

When Free Disk Space Lies: How I Solved a Silent NTFS Filesystem Exhaustion Outage and Built an AI-Powered Early Warning System

The Incident

What Happened

The Investigation

Prequel to Root Cause

Root Cause: MFT and USN Journal Exhaustion

Why Windows 2016 Specifically?

The Fix

Understanding the Problem Deeply

What is the MFT and Why Does It Matter?

Why Standard Monitoring Missed It

The Risk Profile Going Forward

Building the Monitoring Solution

Architecture Overview

Data Collection: The Scanner Script

Risk Detection: What the Script Looks For

What 19 Days of Data Revealed

The AI-Powered Dashboard

Why Add AI to Infrastructure Monitoring?

Dashboard Architecture

Key Features

The Copy for Copilot Feature

What AI Analysis Adds Over Static Scripts

Email Integration

Remediation Playbook

For HIGH Risk Servers (Win2016 + USN Max >= 4GB)

Step 1: Assess current journal usage

Step 2: Resize the journal

Step 3: Verify

For MEDIUM Risk Servers (Growing MFT)

The Longer-Term Fix: OS Upgrade

Lessons Learned

Technical Lessons

Operational Lessons

What We Built: Summary

Components

Daily Workflow

Closing Thoughts

Popular posts from this blog

On-Prem Storage To Azure Storage - Part-2

On-Prem Storage To Azure Storage - Part-1

Secure Boot vCenter Deployment