I had a folder with thousands of PDFs. Invoices next to yoga manuals next to scanned maps next to bank statements. Filenames like xkbf2291.pdf and 00183774.pdf. No structure whatsoever.
Manually sorting them wasn’t happening. So I built a Python script that reads each file, sends the content to Claude, and moves everything into topic folders automatically. It handles PDFs, Word docs, Excel files, plain text, images — basically whatever you throw at it.
The cost? About $0.50 for 600 files. Around $8 per 10,000 if you use the Batch API with Haiku (Claude’s cheapest model). That’s not a typo.
The script is on GitHub: https://github.com/rcsmit/python_scripts_rcsmit/blob/master/sort_files_by_topic.py
Here’s how it works.
What the script does
- Scans a folder recursively for supported files
- Extracts text from each one (first 2 pages for PDFs, first 50 rows for spreadsheets, OCR for images)
- Sends batches of 10 files to Claude for classification
- Moves each file into a subfolder named after its topic
- If the filename is meaningless (pure digits, UUID, random letters), Claude suggests a proper name — included in the same API call, no extra cost
You run it with one command:
python sort_files_by_topic.py
That’s it. No arguments needed.
Supported file types
- PDF via
pdfplumber - DOCX via
python-docx - DOC (legacy Word) via
antiword - XLSX / XLS via
openpyxl - TXT / CSV / RTF / HTML / HTM
- Images (JPG, PNG, TIFF, BMP, GIF, WEBP) via Tesseract OCR
Setup
1. Install Python dependencies
pip install pdfplumber anthropic tqdm python-docx openpyxl pillow pytesseract striprtf
2. Install system tools (optional but recommended)
For faster PDF extraction (5–10x faster than pdfplumber on large files):
# Mac
brew install poppler
# Ubuntu / Debian
sudo apt install poppler-utils
For legacy .doc files:
# Mac
brew install antiword
# Ubuntu / Debian
sudo apt install antiword
For image OCR:
# Mac
brew install tesseract
# Ubuntu / Debian
sudo apt install tesseract-ocr
3. Get an Anthropic API key
Go to console.anthropic.com, create an account, add a payment method, and generate a key under API Keys. Copy it immediately — you won’t see it again.
New accounts get ~$5 in free credits. For 10,000 files you need about $8, so add €10–15 to be safe.
Set the key as an environment variable:
# Mac / Linux
export ANTHROPIC_API_KEY="sk-ant-..."
# Windows Command Prompt
set ANTHROPIC_API_KEY=sk-ant-...
# PowerShell
$env:ANTHROPIC_API_KEY="sk-ant-..."
To make it permanent, add the export line to your ~/.bashrc or ~/.zshrc.
Configure the script
Open sort_files_by_topic.py and set two lines near the top:
INPUT_DIR: Path = Path(r"C:\path\to\your\files") # where your files live
OUTPUT_DIR: Path = Path(r"C:\path\to\sorted") # created automatically
The output directory is created if it doesn’t exist. All sorted files land there in topic subfolders.
You can also customize the topic list. The default covers 20 topics:
DEFAULT_TOPICS: list[str] = [
"Finance & Accounting",
"Legal & Contracts",
"Medical & Health",
"Technical & Engineering",
"Science & Research",
"Business & Management",
"Education & Training",
"Government & Policy",
"Human Resources",
"Marketing & Sales",
"Real Estate",
"Sheet music",
"Travel & Tourism",
"Art & Culture",
"Spiritual & Yoga",
"Service Quality",
"Bank accounts",
"Manuals",
"Tourism",
"Maps",
"Other",
]
Change these to whatever fits your collection.
Run it
python sort_files_by_topic.py
The script uses the Anthropic Batch API by default. This means it submits all the classification requests in one go, waits for Anthropic to process them (usually 15–60 minutes for large batches), then downloads the results and moves the files. The Batch API costs 50% less than real-time API calls — that’s where the $0.50 / 600 files number comes from.
If you want results immediately and don’t mind paying double:
python sort_files_by_topic.py --standard
Other useful flags:
python sort_files_by_topic.py --dry-run # preview without moving anything
python sort_files_by_topic.py --copy # copy instead of move (keep originals)
python sort_files_by_topic.py --verbose # see what's happening in detail
Resume support
The script writes a sort_progress.json file to the output directory. If it crashes or you stop it, re-running picks up exactly where it left off. Files already processed are skipped. Don’t delete this file unless you want to start over.
Smart renaming
Files with useless names get renamed automatically. The script detects:
- Pure digit filenames (
00183774.pdf) - UUIDs (
3f2504e0-4f89-11d3-9a0c-0305e82c3301.pdf) - Random character strings (
xkbf2291.pdf) - Names with fewer than 40% letters
For those, Claude suggests a descriptive name (max 30 characters) based on the file content. This happens inside the same API call as the classification — no extra tokens, no extra cost.
Cost breakdown
Based on real usage and current Haiku 4.5 pricing:
| Files | Batch API | Standard API |
|---|---|---|
| 600 | ~$0.50 | ~$1.00 |
| 1,000 | ~$0.80 | ~$1.60 |
| 5,000 | ~$4.00 | ~$8.00 |
| 10,000 | ~$8.00 | ~$16.00 |
The main variables are average file size and how much text gets extracted. Scanned image-only PDFs (no text layer) use almost no tokens. Dense text documents use more.
What doesn’t work well
OCR on low-quality scans is hit or miss. Tesseract handles clean scans fine but struggles with handwriting, skewed pages, or photos of documents.
Old .doc files only work if you have antiword installed. Without it, those files are skipped.
Files where no text can be extracted at all (blank pages, pure image PDFs without OCR) get classified as “Other” based on the filename alone.
The script is on GitHub: https://github.com/rcsmit/python_scripts_rcsmit/blob/master/sort_files_by_topic.py
