A few weeks ago I found a .doc file on an old backup drive. Created in 1994. Password protected. An old document I had written myself — and I had absolutely no idea what the password was anymore.

Here’s how I got it back.

Disclaimer: These techniques are for recovering your own files only. Applying them to files you do not own or have no permission to access is illegal. The author accepts no responsibility for misuse.


The file wouldn’t even open

Modern Word threw an error. Not “wrong password” — just a generic “Word experienced an error trying to open the file.” LibreOffice did nothing. The file was 145 KB, so it wasn’t empty. Something was in there.

First thing: check what the file actually is. A .doc extension means nothing.

with open('DB.DOC', 'rb') as f:
    print(f.read(4).hex())

Output: d0cf11e0 — the OLE compound document signature. Real Word file. Then list the internal streams:

import olefile
ole = olefile.OleFileIO('DB.DOC')
for entry in ole.listdir():
    print(entry)

Only three streams: \x01CompObj, \x05SummaryInformation, WordDocument. No 0Table or 1Table — those only exist in Word 97+. This was Word 6.0, from 1993. The metadata confirmed it:

m = ole.get_metadata()
print(m.creating_application)  # b'Microsoft Word 6.0'

Step 1: check entropy before doing anything else

Before committing to an attack strategy, compute the Shannon entropy of the stream body. This tells you immediately what you’re dealing with:

  • ~4–5 bits/byte → plain text
  • ~6–7.5 bits/byte → XOR obfuscation (Word 6.0 style)
  • ~7.9–8.0 bits/byte → real encryption (AES, RC4)
import math
from collections import Counter

def byte_entropy(data):
    counts = Counter(data)
    total = len(data)
    return -sum((c/total) * math.log2(c/total) for c in counts.values())

If you see 7.9+, you need a completely different approach. Word 6.0 XOR lands around 6.5–7.2, which is exactly what this file showed.


Step 2: find the XOR key length

Word 6.0 doesn’t use real encryption. It derives a fixed-length key from your password and XORs the entire document body with it, repeating. The key length is a property of Word’s key derivation, not of the password — a 3-character password and a 15-character password both produce the same key length.

To find that length, use a Kasiski coincidence test: for each candidate length L, count how many positions satisfy ciphertext[i] == ciphertext[i+L]. The true key length produces a spike.

Key length 16: 14919 matches  <-- clear winner
Key length  4:  9013 matches
Key length  7:  8621 matches

16 had nearly twice the matches of anything else. That’s the derived key length — not the password length. The actual password could be anywhere from 1 to 15 characters.


Step 3: recover the key bytes

For each of the 16 key positions, collect all ciphertext bytes at that column. In Dutch (or English) prose, the most common plaintext character is a space (0x20). So the most common byte in each column, XORed with 0x20, gives the key byte at that position:

key = []
for i in range(16):
    column = bytes(ciphertext[j] for j in range(i, len(ciphertext), 16))
    most_common = Counter(column).most_common(1)[0][0]
    key.append(most_common ^ 0x20)

Then decrypt:

decrypted = bytes(ciphertext[i] ^ key[i % 16] for i in range(len(ciphertext)))

The first readable chunk came through immediately — actual Dutch prose, fully legible.

Thirty years. Done in an afternoon.


Other attack angles worth knowing

Crib-dragging is faster than pure frequency analysis when you know something about the content. Slide a known word — “de”, “het”, a month name, a person’s name — across the ciphertext. At each offset, XOR the crib against the ciphertext to produce candidate key bytes. Accumulate votes across many cribs and the key emerges with fewer assumptions. It also acts as a check: if both methods agree on a key byte, you can be confident in it.

Targeting text-bearing sections also helps. Word documents mix formatted text with style tables, font tables, and metadata. Frequency analysis works best when applied to the text body only. If results are noisy, scan the stream at different offsets to find where the readable content actually starts — 0x400 is the standard but not always correct.


Can you recover the actual password?

Sort of, but not really. What frequency analysis gives you is the derived encryption key — the 16 bytes Word uses to XOR the content. That’s not the password itself. Word 6.0 ran the password through a one-way key derivation step first.

You could brute-force it: generate candidate passwords, run each through Word 6.0’s key derivation algorithm, check if the result matches your known key. That’s faster than trying to open the file each time, but it’s still brute force — and for longer passwords completely impractical on a CPU. A tool like hashcat on a GPU can test billions of candidates per second, but Word 6.0 isn’t a supported hash mode, so formatting the hash correctly for hashcat requires extra work.

For most purposes — recovering your own content — you don’t need the password at all. You need the key, and that’s what the analysis gives you.


Does this work for other Word versions?

It depends on the version.

Word 6.0 / Word 95 (1993–1995): Fully vulnerable. XOR with a derived key, no real cryptography. The method above works.

Word 97 / 2000 (1997–1999): Slightly stronger — RC4 with a 40-bit key. Still breakable. office2john.py extracts the hash, hashcat with -m 9700 cracks it. Minutes to hours depending on password complexity. One nuance: some corporate installs enabled stronger CSP settings, which changes the picture considerably.

Word 2002 / 2003 (XP era): RC4 with SHA-1 key derivation. Harder, but still crackable. Hashcat mode -m 9800.

Word 2007 and later (.docx): AES-128 or AES-256. Properly encrypted. Brute force is the only option, and with a strong password it’s effectively impossible. Use zip2john on the .docx file (which is a zip), then hashcat.

Older is weaker. Anything before Word 2007 has real structural weaknesses. The jump at Office 2007 was massive — it’s essentially a different world.


One thing that confused me

Zeroing out the password hash bytes at offset 0x2C in the WordDocument stream removes the password prompt but doesn’t decrypt the content. Word 6.0 uses the password to derive the XOR key — no password, no key, no decryption. The file opens but looks like garbage.

Zeroing the hash is enough for write-protected documents that aren’t actually encrypted. For encrypted content you need the key regardless. Old Office formats separated authentication from encryption badly — and that separation is exactly what makes them exploitable.


The script

Available on my GitHub. Edit four lines at the top and run it — no command line needed. It supports seven modes:

ModeWhat it does
xorFrequency analysis — recovers the XOR key directly (default)
cribKnown-plaintext attack using guessed words
brutePassword guessing with prefixes and suffix combinations
really_bruteExhaustive search over the full character space, length by length
rawBlind string extraction, no key needed
entropyMeasure stream entropy to choose the right attack
text_offsetScan offsets to find where readable content starts
DOC_FILE = r"C:/path/to/your/file.doc"
OUT_FILE = r"C:/path/to/output.txt"
MODE = "xor"
XOR_KEY_LENGTH_OVERRIDE = None  # or set to 16 to skip auto-detection

The content was all there, intact. Thirty years is a long time to keep a secret from yourself — turns out you just need a frequency table to get it back.