ai-automation
Building a CLI Tool to Automate Spreadsheet Data Cleaning with DeepSeek
How I Built a CLI Tool to Automate Spreadsheet Data Cleaning with DeepSeek
The short answer is: I used DeepSeek to create a Python CLI tool that cleans messy Excel files in under 5 seconds per file. The AI wrote 90% of the code, but I had to fix critical gaps in file handling and error recovery. Here’s the battle-tested process.
The Problem: My Spreadsheet Nightmare
Every Monday, I get 12-15 Excel files from field teams filled with:
- Merged cells that break pandas
- Inconsistent date formats (MM/DD/YYYY vs DD-MM-YY)
- Empty header rows
- “N/A”, “NULL”, and ”---” all representing missing data
Manual cleaning took 3 hours weekly. I needed a tool where I could just run:
cleanxls --input messy_file.xlsx --output clean_file.csv
The AI Attempt: First Prompt to DeepSeek
I started with this exact prompt:
"""
Prompt: Write a Python CLI tool using Click that:
1. Takes an Excel file as input
2. Handles merged cells by unmerging and filling values downward
3. Standardizes dates to YYYY-MM-DD
4. Replaces all null markers (N/A, NULL, ---) with empty strings
5. Outputs a clean CSV
Include error handling for corrupt files.
"""
DeepSeek output a 127-line script using:
pandasfor data manipulationclickfor CLI interfaceopenpyxlto handle merged cells
First red flag: It used openpyxl’s merged_cells property, which only works on .xlsx files, not .xls.
Where It Broke: 3 Critical Failures
-
File Format Blindness: Crashed on older
.xlsfiles with:AttributeError: 'Workbook' object has no attribute 'merged_cells' -
Date Disaster: Tried to force-convert everything resembling a date, turning “Order #2024-1001” into a datetime object.
-
Silent Corruption: When encountering password-protected files, it created empty CSVs without warning.
What I Had to Fix
1. The File Format War
I modified the prompt to specify:
"""
Add support for both .xls (use xlrd) and .xlsx (openpyxl).
First detect file extension, then use appropriate engine.
"""
2. Date Detection Logic
Added a regex check before date conversion:
if re.match(r'^\d{1,4}[-\/]\d{1,2}[-\/]\d{1,4}$', str(cell_value)):
# Attempt date conversion
3. Error Handling Overhaul
Added explicit checks for:
- Password-protected files
- Binary corruption (using magic numbers)
- Permission issues
The Working Result
Final command structure:
cleanxls --input messy.* --output_dir ./clean \
--date-format "%Y-%m-%d" \
--null-markers "NULL,N/A,---"
Sample output:
Processing 15 files...
✔ Success: 12 files (avg 4.2s/file)
✖ Failed: 3 files (password protected)
Clean files saved to ./clean/
Key features added:
- Batch processing with wildcards (
messy.*) - Configurable null markers
- Progress bar using
tqdm
What I Learned
-
AI Misses Edge Cases
DeepSeek didn’t consider that.xlsand.xlsxrequire different libraries. Always specify file formats. -
Prompt Iteration Beats Debugging
Instead of fixing the code, I refined the prompt:""" Modify to handle both .xls and .xlsx by: 1. Checking file extension 2. Using xlrd for .xls 3. Using openpyxl for .xlsx """This got me 80% there vs. manual coding.
-
Validation is Non-Negotiable
The AI-generated script would silently overwrite files. I added:if os.path.exists(output_path): raise click.Confirm("Overwrite?")
The Exact Prompt
Here’s the final prompt that worked:
"""
Write a Python CLI tool using Click that:
1. Accepts multiple input files via wildcard (e.g., 'data/*.xls*')
2. Detects file extension to use correct engine (xlrd for .xls, openpyxl for .xlsx)
3. Unmerges cells by filling values downward
4. Converts only valid dates to YYYY-MM-DD (skip strings with letters)
5. Replaces configurable null markers (default: NULL,N/A,---)
6. Validates output directory exists
7. Shows progress bar with tqdm
8. Skips password-protected files with warning
Output clean CSVs with '_clean' suffix.
"""
Related Guides
- Ga4 Data Delayed Or Missing Check Measurement Id Consent
- Ga4 Events Automatic Recommended Custom Tracking Guide
FAQ
1. Can this handle 100MB+ Excel files?
Yes, but add --chunksize 10000 to process in chunks. The initial AI code loaded everything into memory.
2. How do I add custom cleaning rules?
Extend the clean_column() function. The AI structured this well for modifications.
3. Why Click instead of argparse?
DeepSeek defaulted to Click for better CLI UX. I kept it for the auto-help generation.
4. Can this run on schedule?
Yes! I combined it with a cron job to auto-process files in our SFTP drop folder.
5. What about Google Sheets?
That’s my next experiment-using gspread with API auth.
What spreadsheet headache would you automate with this approach? Drop a comment with your worst data cleaning war story.
Praveen
Technology enthusiast helping people work smarter with practical guides and AI workflows.
Explore more: Browse all ai automation guides or check related articles below.