When Free Disk Space Lies: How I Solved a Silent NTFS Filesystem Exhaustion Outage and Built an AI-Powered Early Warning System
The Incident
What Happened
In December 2025,
one of our production FTP servers went down hard. The server — running Windows
Server 2016 — was responsible for ingesting high volumes of files around the
clock. At some point during a peak processing window, all write operations on
the D: drive failed simultaneously.
The symptom was
deceptively simple: the operating system reported 0 bytes available on the D:
drive. Applications failed, file transfers aborted, and alerts fired across the
board.
The first
instinct of anyone on the team was obvious: the disk is full. Except it wasn't.
Outage was caused by something most infrastructure engineers have never
encountered — and that is exactly what makes it dangerous. NTFS metadata
exhaustion is invisible to standard monitoring, looks exactly like a disk space
problem, and cannot be solved by adding disk space.
The Investigation
When we RDP'd
into the server and opened Windows Explorer, the D: drive showed over 500GB of
free space on a 3TB volume. Task Manager and diskpart agreed. The disk was not
full by any conventional measure.
We started
working down the checklist:
•
Disk free space — fine
(500GB+ available)
•
File system errors — none
reported by chkdsk surface scan
•
Volume Shadow Copy Service
— not consuming excessive space
•
Disk health — SMART status
clean
•
Permissions — service
accounts had full control
Nothing obvious. I started digging into the NTFS filesystem metadata layer using
fsutil — a command-line tool that exposes internals that Windows Explorer and
standard monitoring tools never surface.
This is where we found the real problem. File count was around 70 million files in single FTP folder.
Prequel to Root Cause
Root Cause: MFT and USN Journal Exhaustion
NTFS maintains a
data structure called the USN Journal — the Update Sequence Number journal. It
is a change log that records every file and directory operation on a volume:
creates, deletes, renames, attribute changes, and writes. The journal exists
primarily to support backup software, file replication, antivirus engines, and
search indexing — all of which use it to efficiently detect what changed since
they last ran.
On Windows Server
2016, the USN journal has a critical characteristic: it is static. The journal
is allocated a fixed maximum size at volume format time. When the allocated
journal space fills up, NTFS does not gracefully expand it. Instead — and this
is the critical detail — NTFS begins failing write operations on the volume
entirely.
Critical Finding The server was not out of disk space. The NTFS USN
journal had consumed all of its pre-allocated metadata space. NTFS responded by
locking writes at the filesystem level, even though hundreds of gigabytes of
raw storage remained unused.
Running fsutil
against the affected drive revealed this immediately:
fsutil usn queryjournal D:
Usn Journal ID :
0x01d9a3f5b2c40000
First Usn :
0x0000000000000000
Next Usn :
0x0000001000000000 <- allocated space
exhausted
Lowest Valid Usn : 0x0000000000000000
Max Size :
0x0000001000000000 <- 4096 MB maximum
Allocation Delta : 0x0000000040000000 <- 1024 MB delta
The Next USN value had reached the Max Size ceiling. The journal was full. NTFS had nowhere
to write the change records for new operations, so it refused to allow new
writes at all.
A second server
running the same workload was found in the same state within the same week.
I will try to explain it as simply as possible using - (Bank Account Analogy)
Think of the file system like a bank system:
- MFT (Master File Table) → The main account database
(stores every file and folder record) - USN Journal → The audit log
(tracks every change for backup, antivirus, auditing, replication, etc.) - NTFS Transaction Log ($LogFile) → The transaction ledger
(ensures data consistency)
Every file operation (create, delete, rename, modify) is
treated as a financial transaction:
You cannot update an account unless both the audit log (USN) and transaction ledger ($LogFile) can also be written.
What Went Wrong
Over time, the system accumulated tens of millions of
small files, which caused:
- The MFT to grow extremely large
(~23.4 GB)
- Heavy fragmentation of filesystem
metadata
- The USN audit log reached its size
and growth limits (by default, it’s set to 512Mb)
When Windows attempted normal operations (such as
deleting/writing files), NTFS tried to:
- Update the account database (MFT)
- Write an audit log entry (USN Journal)
- Record the transaction (NTFS log)
At this point:
The audit log (USN journal) could no longer expand
due to metadata fragmentation and the growth boundary (512MB limit)
Just like a bank:
If the audit log is full, all transactions are blocked,
even if money is still available.
So, Windows blocked all file changes to protect
filesystem consistency.
This caused:
- Disk reported “0 bytes free”
- File deletion failures
- Folder errors such as “directory not
empty”
- Volume appeared locked
Even though nearly 900 GB of physical disk space was still available.
Root Cause
(Technical Summary)
The failure was caused by:
- Extremely high file counts (~24 million)
- Very large MFT size (~23.4 GB)
- USN Journal size too small for
enterprise-scale workloads
- NTFS metadata fragmentation is
preventing journal expansion
This triggered NTFS metadata exhaustion, a known
behavior in large-scale file systems.
This was an architectural scaling limit being reached.
Permanent Fix
Implemented (tested and confirmed write-lock removed)
Built and resized the NTFS change journal:
- Increased USN Journal size from 512
MB → 4 GB
- Increased allocation growth from 128
MB → 1 GB
This:
- Eliminates fragmentation risk
- Allows safe metadata expansion
- Prevents recurrence of this failure mode
Additionally:
- MFT growth reservation is configured
optimally
This now makes sense. After expanding the USN Journal, I was able to unlock the D: drive, which immediately allowed the MFT to expand and removed the write lock.
Why Windows 2016 Specifically?
This is not a bug
that affects all versions of Windows equally. The USN journal behaviour changed
significantly between OS versions:
|
OS Version |
USN Journal
Type |
Behaviour at
Max Size |
|
Windows Server
2016 |
Static (fixed
allocation) |
Fails ALL
write operations when full |
|
Windows Server
2019 |
Semi-static |
Improved but
still requires monitoring |
|
Windows Server
2022 |
Dynamic
(auto-expanding) |
Self-managing,
much lower risk profile |
Our affected
servers were both running Windows Server 2016. The high volume of file
operations — thousands of files per hour being created, processed, and deleted
— generated an enormous number of USN journal entries. The journal filled up
faster than anyone anticipated.
The Fix
The immediate fix
was straightforward once the cause was identified. We resized and recreated the
USN journal using fsutil with a larger maximum size and a larger allocation
delta:
# Delete the existing journal
fsutil usn deletejournal /D D:
# Recreate with 8GB max size and 512MB allocation delta
fsutil usn createjournal m=8589934592 a=536870912 D:
Write operations
resumed immediately after the journal was recreated. No data loss occurred —
the journal exhaustion only blocked writes; it did not corrupt existing data.
The same fix was
applied to the second affected server. Both returned to normal operation within
minutes.
Key Lesson The fix took minutes. Finding the cause took hours. The
real work was building something to ensure it never happens silently again.
Understanding the Problem Deeply
What is the MFT and Why Does It Matter?
Beyond the USN
journal, there is a second NTFS metadata structure worth understanding: the
Master File Table, or MFT. The MFT is a database that contains one record for
every file and directory on the volume. Each file — regardless of its size —
has an MFT entry. A volume containing ten million small files will have a very
large MFT even if those files are tiny.
The MFT grows as
files are created. In environments with high file turnover — where files are
constantly created and deleted — the MFT can grow substantially over time
because NTFS does not compact or shrink the MFT when files are deleted. Space
freed by deleted file records may be reused, but the overall MFT size rarely
decreases.
A very large MFT
is not immediately dangerous, but it is an early indicator of a high-churn
environment that is also at elevated risk from USN journal exhaustion. We
therefore monitor both metrics together.
Why Standard Monitoring Missed It
Our existing
monitoring infrastructure — like most enterprise environments — was tracking
the right things for the wrong abstraction layer. We had alerts on:
•
Disk free space (GB and
percentage)
•
Volume read/write errors
•
CPU and memory utilisation
•
Service availability and
response times
None of these
metrics reach into the NTFS metadata layer. Disk free space is a measure of raw
storage — it knows nothing about journal allocation. A server with 500GB free
and an exhausted USN journal looks perfectly healthy to every standard
monitoring tool. This is a genuinely invisible failure mode.
The fsutil
command is the only built-in Windows tool that surfaces USN journal allocation
details. It is not integrated into Performance Monitor, Windows Admin Center,
Event Viewer alerts, or most third-party monitoring platforms.
The Risk Profile Going Forward
After the
incident, we audited the rest of our Windows server fleet. We had 14 servers
running various workloads, exposing 36 NTFS volumes in total. Several were on
Windows Server 2016 with high file operation counts. The audit revealed:
•
2 servers with USN journal
configurations identical to the failed server (4096MB max, 1024MB delta on
Win2016)
•
3 servers showing
measurable MFT growth trends over the previous weeks
•
1 server with an MFT
already at 23GB — the original incident server — reflecting years of
accumulated file churn
We needed a
systematic approach. Manual audits are a point-in-time snapshot. What we needed
was continuous monitoring with trend analysis.
Building the Monitoring Solution
Architecture Overview
I built the monitoring solution in two layers:
1.
Data Collection — A
PowerShell script that runs daily via the existing Veeam backup infrastructure,
connects to all Windows VMs via VMware PowerCLI, collects NTFS metadata using
fsutil, and saves daily CSV snapshots.
2.
Analysis and Alerting — A
second PowerShell script that reads all historical CSVs, computes trends,
assesses risk, generates a self-contained HTML dashboard, and emails it to the
team.
Data Collection: The Scanner Script
The scanner runs
from a management server that already has VMware PowerCLI installed. It
connects to vCenter, enumerates all running Windows VMs, and uses
Invoke-VMScript to run fsutil commands remotely inside each guest — no agent
installation, no firewall changes, no new dependencies.
For each NTFS
volume on each server, it captures:
•
MFT Valid Data Length — the
current size of the Master File Table in bytes
•
USN Journal Maximum Size —
the ceiling beyond which writes will fail
•
USN Allocation Delta — the
chunk size used when allocating journal space
•
USN Allocated Size — how
much journal space is currently in use
•
Dirty bit state — whether
the volume has unflushed changes
The data is saved
to a daily CSV file in a History folder, creating a time series that
accumulates day by day:
D:\Veeam\Scan-FileSystem\Logs\History\20260210.csv
D:\Veeam\Scan-FileSystem\Logs\History\20260211.csv
D:\Veeam\Scan-FileSystem\Logs\History\...
It also saves
FSHealth-Summary.csv (today's state) and FSHealth-Predictive.csv (risk
predictions from comparing yesterday to today).
Risk Detection: What the Script Looks For
The predictive
engine compares today's readings against yesterday's and flags conditions that
indicate approaching failure. The key detection rules are:
HIGH RISK (Immediate Action) Windows
Server 2016 + USN Max >= 4096MB. This is the exact configuration that caused
the December 2025 outage. Any server matching this profile gets a HIGH flag
regardless of other metrics.
MEDIUM RISK (Monitor Closely) MFT
growing consistently on Windows Server 2016 (>20MB over tracking period), OR
any server showing >5MB MFT growth regardless of OS version.
The script is
deliberately conservative. A server that is fine today but is trending toward
the danger zone gets flagged early, giving the team time to investigate before
a crisis.
What 19 Days of Data Revealed
After collecting
19 days of continuous monitoring data across all 14 servers and 36 drives, the
picture became clear:
|
Risk |
Server /
Drive |
OS |
MFT Size |
Finding |
|
HIGH |
AUSSDC1FTP001
D: |
WS2016 |
23,928 MB |
Exact Dec 2025
incident profile. USN Max = 4096MB, Delta = 1024MB. MFT at 23GB from years of
churn. |
|
HIGH |
AUSSDC1STGFTP20
D: |
WS2016 |
4,650 MB |
Same USN
profile. USN resize detected Feb 24 — post-incident remediation applied but
still at risk. |
|
MED |
AUSSDC1MFT001
E: |
WS2016 |
4,634 MB |
MFT growing
+21.75MB over 19 days (~1.8MB/day). Consistent upward trend. |
|
MED |
AUSMFTCOLD01
E: |
WS2019 |
590 MB |
Fastest MFT
growth in fleet: +24.75MB in 19 days (~8.25MB/day). Investigate workload. |
|
MED |
AUSMFTCOLD01
C: |
WS2019 |
559 MB |
Step-change
growth detected on Feb 24 — correlates with USN resize event. |
|
MED |
AUSSDC1FTP001
C: |
WS2016 |
759 MB |
Slow but
steady MFT growth +8.5MB on the original incident server's system drive. |
|
OK |
All others
(30) |
Various |
— |
Stable.
Win2022 servers with dynamic USN journal show no risk indicators. |
The AI-Powered Dashboard
Why Add AI to Infrastructure Monitoring?
The PowerShell
script is good at detecting known patterns — we told it exactly what to look
for. But infrastructure risk is rarely that clean. Static threshold rules have
well-known limitations:
•
A threshold that is
appropriate for one server may be meaningless for another with a different
workload profile
•
A server growing 10MB/day
may be completely normal if it has always grown at that rate, or alarming if it
suddenly accelerated from 1MB/day
•
Correlating multiple
signals simultaneously (MFT size + USN configuration + OS version + growth rate
+ workload type) is difficult to express as simple if/then rules
•
Raw fsutil output contains
additional fields — repair stream size, $LogFile allocation, VSS overhead —
that the script does not yet parse but which carry diagnostic value
An AI model,
given the full 19-day history and context about the December 2025 incident, can
reason across all of these dimensions simultaneously. It can identify patterns
that deviate from each server's own baseline, distinguish genuine risk
escalation from normal variation, and explain its reasoning in plain English
that management can act on.
More practically:
our management team and some client stakeholders are not infrastructure
engineers. They need the system to tell them in clear language what is wrong,
how serious it is, and what to do about it — not hand them a CSV and expect
them to interpret it.
Dashboard Architecture
The dashboard is
generated daily by a PowerShell script (Generate-FSHealthReport.ps1) that runs
automatically after the scan. It reads all accumulated History CSVs, computes
risk across all server-drive pairs, and outputs a single self-contained HTML
file with all data embedded.
The file requires
no web server, no database, no dependencies beyond a browser. It can be opened
directly from a network share, emailed as an attachment, or served from any
file server. Everything — the data, the charts, the AI integration — is in one
file.
Key Features
•
Fleet overview bar showing
HIGH / MEDIUM / OK / Total counts at a glance
•
Filterable server list
sorted by risk level, with inline MFT size and OS indicators
•
19-day MFT trend chart that
updates instantly when a different server is selected
•
USN parameter cards
flagging servers that match the December 2025 incident profile
•
Copy for Copilot button
that generates a complete, server-specific AI analysis prompt
•
Direct API mode for
organisations with Anthropic API access
The Copy for Copilot Feature
This is the
feature designed specifically for our environment, where Microsoft Copilot is
available to staff through standard Microsoft 365 licensing but direct API
access is limited.
When a user
selects any server-drive pair and clicks Copy Prompt, the dashboard generates a
complete analysis prompt containing:
3.
The full incident
background — what happened in December 2025, how the USN journal works, what
the failure mode looks like
4.
The selected server's
current state — MFT size, growth total, average daily rate, USN max, USN delta,
OS version, risk flags
5.
The complete 19-day
historical record for that specific drive — every data point
6.
Fleet context — which other
servers are HIGH or MEDIUM risk, total drives monitored
7.
Structured analysis
requests — what the AI should assess, compare, and recommend
The user copies
that prompt, opens Microsoft Copilot in their browser, pastes it, and receives
a detailed technical analysis with prioritised remediation steps — without ever
needing API credentials or a developer to help them.
Critically, the
prompt is generated fresh for whichever server is currently selected. Every
server in the fleet gets its own contextual analysis. It is not a generic
template — it contains that specific server's actual data.
What AI Analysis Adds Over Static Scripts
We asked Co-Pilot to analyse the two HIGH-risk servers using the 19-day data. The observations that went beyond what the PowerShell thresholds could detect:
•
AUSSDC1FTP001 D: has an MFT
of 23GB suggesting tens of millions of files have passed through this server
over its lifetime. Even if the USN journal is resized, the MFT size indicates
this server is operating at a fundamentally different scale than the others and
may need architectural review, not just a parameter change.
•
AUSMFTCOLD01 E: is growing
at 8.25MB/day average but the rate is not constant — it accelerated in the
second week of the monitoring period. A static threshold would flag it the same
way regardless of acceleration. The AI identified the rate change as a more
urgent signal than the absolute value.
•
The USN resize events
detected on Feb 24 (AUSSDC1STGFTP20 and AUSMFTCOLD01) both coincide —
suggesting a remediation run was performed that day, but the script treating
them as independent events had no way to correlate them. The AI noted this as a
pattern suggesting a deliberate change rather than an anomaly.
Email Integration
The report generator also sends a daily HTML email to the infrastructure team and relevant stakeholders. The email contains:
•
A summary bar with HIGH /
MEDIUM / OK / Total drive counts
•
A full table of every HIGH
risk drive with server name, drive letter, OS, MFT size, growth, and the
specific risk flag
•
A full table of every
MEDIUM risk drive with the same fields
•
An all-clear message when
no risks are detected
•
The full interactive HTML
report as an attachment
The subject line adapts to urgency — ACTION REQUIRED when HIGH risks are present, MONITOR when only MEDIUM risks exist, and All Systems OK when everything is clean. This means recipients can triage from their inbox without opening anything.
SMTP
configuration is inherited directly from the existing scan script, so there is
no new configuration to maintain. The same relay server, same from address,
same recipient list.
Remediation Playbook
For HIGH Risk Servers (Win2016 + USN Max
>= 4GB)
These servers
match the exact profile that caused the December 2025 outage. Treat them as
urgent regardless of whether they are currently experiencing issues. The
journal filling up is a when, not an if, on high-churn servers.
Step 1: Assess current journal usage
fsutil usn queryjournal D:
Compare Next Usn
against Max Size. If Next Usn is within 20% of Max Size, treat as critical.
Step 2: Resize the journal
# Delete the existing journal (brief impact on VSS/backup tools)
fsutil usn deletejournal /D D:
# Recreate with larger values
# m = max size in bytes (8GB = 8589934592)
# a = allocation delta in bytes (512MB = 536870912)
fsutil usn createjournal m=8589934592 a=536870912 D:
Step 3: Verify
fsutil usn queryjournal D:
Confirm Max Size
now reflects the new value. Monitor Next Usn over the following week to ensure
the growth rate is sustainable within the new ceiling.
Planning Note Deleting the USN journal is a brief operation but causes
backup software, antivirus, and search indexing to lose their change-tracking
checkpoint. They will typically perform a full rescan on next run. Coordinate
with backup teams before executing on production servers.
For MEDIUM Risk Servers (Growing MFT)
These servers are not in immediate danger but are showing a trajectory worth investigating. Recommended actions:
8.
Identify what is generating
the file churn using fsutil file createnew and reviewing application logs
9.
Check whether the workload
has changed recently — new processes, new integrations, increased throughput
10. If MFT growth is legitimate and expected, document the
baseline so future growth can be compared against it
11. Consider scheduling a maintenance window to run chkdsk /f
to consolidate MFT free space if fragmentation is significant
12. Plan OS upgrade to Windows Server 2022 where possible —
the dynamic USN journal eliminates this risk class entirely
The Longer-Term Fix: OS Upgrade
The most durable
solution is to eliminate the static USN journal entirely by upgrading affected
servers from Windows Server 2016 to Windows Server 2022. On 2022, the USN
journal is dynamic — it expands automatically as needed and never causes the
write-lock failure mode described in this post.
This is not a
quick fix, but it is the right architectural decision for any server running
high file throughput workloads. The December 2025 incident and the two
additional HIGH-risk servers we found during the monitoring audit are strong
justification for accelerating the upgrade timeline on the affected 2016
servers.
Lessons Learned
Technical Lessons
•
Assumption broken: Disk free space is not the same as filesystem health.
NTFS can run out of internal metadata space while gigabytes of raw storage remain available. Standard monitoring that only tracks free space percentage will miss this failure mode entirely.
•
Tool gap: fsutil is criminally underused.
The information needed to detect and prevent this outage has always been available in Windows. fsutil usn queryjournal exposes the journal state in real time. The problem was that no one was routinely collecting and trending this data.
•
OS version matters: Windows 2016 has known USN journal limitations that are
fixed in 2022.
This is documented by Microsoft but not prominently enough that most teams are aware of it. Any Windows 2016 server running high file-throughput workloads should be treated as a USN journal risk until upgraded or explicitly monitored.
•
Monitoring design: Static thresholds are blind to context.
A rule that fires
when MFT exceeds 5GB is meaningless without knowing whether that is the
server's normal state or a sudden change. Trend monitoring — tracking the rate
of change over time — is far more valuable for this class of problem.
Operational Lessons
•
Post-incident monitoring
should be automated, not periodic.
After any significant outage, the instinct is to do a one-time audit and close the ticket. What we built instead was a permanent, automated daily check that grows smarter as it accumulates more historical data. The audit never ends; it just runs quietly in the background.
•
Infrastructure visibility
tools should speak the language of the audience.
The PowerShell script is for engineers. The dashboard is for anyone who needs to understand the risk and make decisions. AI-assisted analysis bridges that gap — it translates raw infrastructure metrics into plain language risk assessments that a non-technical stakeholder can act on.
•
Correlation across time is
more valuable than point-in-time snapshots.
The 19-day
history revealed patterns — acceleration, step-changes, co-occurring events on
the same date — that a single day's reading would never show. Building
time-series collection from the start was the right decision.
What We Built: Summary
Components
|
Component |
Technology |
Purpose |
|
FSHealth
Scanner |
PowerShell +
VMware PowerCLI |
Daily
collection of NTFS metadata from all Windows VMs via vCenter |
|
History
Archive |
Daily CSV
files |
Accumulating
time-series of MFT and USN data per server-drive |
|
Predictive
Engine |
PowerShell
(built into scanner) |
Compares
yesterday vs today, flags USN and MFT risk conditions |
|
HTML Dashboard |
Standalone
HTML/JS/CSS |
Interactive
fleet overview with trend charts, server selection, AI prompt generation |
|
Copy for
Copilot |
JavaScript
prompt builder |
Generates
server-specific AI analysis prompts for use in Microsoft Copilot or Claude |
|
Daily Email
Report |
PowerShell
Send-MailMessage |
HTML email
with risk tables sent to team, with full dashboard as attachment |
Daily Workflow
13. Veeam scheduler triggers the FSHealth scan script each
morning
14. Scanner connects to vCenter, collects fsutil data from
all 14 servers via PowerCLI
15. Data saved to History\YYYYMMDD.csv and
FSHealth-Predictive.csv
16. Report generator reads all history CSVs, computes trends,
builds HTML dashboard
17. Dashboard saved to Reports\FSHealth-Report.html (updated
in place — same URL daily)
18. Email sent to team with risk summary tables and HTML
report as attachment
19. Team opens email, reviews high-risk servers, opens
dashboard for detail
20. For any server requiring deeper analysis: select it in
the dashboard, click Copy Prompt, paste into Microsoft Copilot
21. Copilot returns a detailed technical analysis with
remediation recommendations
22. Engineer executes remediation, validates fix, adds notes
to ticket
Closing Thoughts
The December 2025
outage was caused by something that most infrastructure engineers have never
encountered — and that is precisely what makes it dangerous. NTFS metadata
exhaustion is a failure mode that looks like a disk space problem, behaves like
a disk space problem, but cannot be solved by adding disk space. The underlying
cause is invisible to every standard monitoring tool.
The solution we
built is not particularly complex. The PowerShell collection script is a few
hundred lines. The report generator is straightforward. What makes the system
effective is the combination of daily automated collection, trend analysis over
time, and AI-assisted interpretation that surfaces risk in language that both
engineers and management can act on.
If you are
running Windows Server 2016 with high file-throughput workloads — file transfer
servers, FTP services, large-scale file processing pipelines — I would strongly
encourage you to run the following command today:
fsutil usn queryjournal D:
Compare the Next
Usn value against Max Size. If they are close, you are closer to a production
outage than you realise. Do not wait for the write lock to find out.