Part of our ai automation guide series

ai-automation

Building a CLI Tool to Automate Spreadsheet Data Cleaning with DeepSeek

Praveen 4 min read
yellow and black handle hammer and screw driver
Photo by Julie Molliver on Unsplash

How I Built a CLI Tool to Automate Spreadsheet Data Cleaning with DeepSeek

The short answer is: I used DeepSeek to create a Python CLI tool that cleans messy Excel files in under 5 seconds per file. The AI wrote 90% of the code, but I had to fix critical gaps in file handling and error recovery. Here’s the battle-tested process.

The Problem: My Spreadsheet Nightmare

Every Monday, I get 12-15 Excel files from field teams filled with:

  • Merged cells that break pandas
  • Inconsistent date formats (MM/DD/YYYY vs DD-MM-YY)
  • Empty header rows
  • “N/A”, “NULL”, and ”---” all representing missing data

Manual cleaning took 3 hours weekly. I needed a tool where I could just run:

cleanxls --input messy_file.xlsx --output clean_file.csv

The AI Attempt: First Prompt to DeepSeek

I started with this exact prompt:

"""
Prompt: Write a Python CLI tool using Click that:
1. Takes an Excel file as input
2. Handles merged cells by unmerging and filling values downward
3. Standardizes dates to YYYY-MM-DD
4. Replaces all null markers (N/A, NULL, ---) with empty strings
5. Outputs a clean CSV
Include error handling for corrupt files.
"""

DeepSeek output a 127-line script using:

  • pandas for data manipulation
  • click for CLI interface
  • openpyxl to handle merged cells

First red flag: It used openpyxl’s merged_cells property, which only works on .xlsx files, not .xls.

Where It Broke: 3 Critical Failures

  1. File Format Blindness: Crashed on older .xls files with:

    AttributeError: 'Workbook' object has no attribute 'merged_cells'
  2. Date Disaster: Tried to force-convert everything resembling a date, turning “Order #2024-1001” into a datetime object.

  3. Silent Corruption: When encountering password-protected files, it created empty CSVs without warning.

What I Had to Fix

1. The File Format War
I modified the prompt to specify:

"""
Add support for both .xls (use xlrd) and .xlsx (openpyxl). 
First detect file extension, then use appropriate engine.
"""

2. Date Detection Logic
Added a regex check before date conversion:

if re.match(r'^\d{1,4}[-\/]\d{1,2}[-\/]\d{1,4}$', str(cell_value)):  
    # Attempt date conversion

3. Error Handling Overhaul
Added explicit checks for:

  • Password-protected files
  • Binary corruption (using magic numbers)
  • Permission issues

The Working Result

Final command structure:

cleanxls --input messy.* --output_dir ./clean \
         --date-format "%Y-%m-%d" \
         --null-markers "NULL,N/A,---"

Sample output:

Processing 15 files...
✔ Success: 12 files (avg 4.2s/file)
✖ Failed: 3 files (password protected)
Clean files saved to ./clean/

Key features added:

  • Batch processing with wildcards (messy.*)
  • Configurable null markers
  • Progress bar using tqdm

What I Learned

  1. AI Misses Edge Cases
    DeepSeek didn’t consider that .xls and .xlsx require different libraries. Always specify file formats.

  2. Prompt Iteration Beats Debugging
    Instead of fixing the code, I refined the prompt:

    """
    Modify to handle both .xls and .xlsx by:
    1. Checking file extension
    2. Using xlrd for .xls
    3. Using openpyxl for .xlsx
    """

    This got me 80% there vs. manual coding.

  3. Validation is Non-Negotiable
    The AI-generated script would silently overwrite files. I added:

    if os.path.exists(output_path):
        raise click.Confirm("Overwrite?")

The Exact Prompt

Here’s the final prompt that worked:

"""
Write a Python CLI tool using Click that:
1. Accepts multiple input files via wildcard (e.g., 'data/*.xls*')
2. Detects file extension to use correct engine (xlrd for .xls, openpyxl for .xlsx)
3. Unmerges cells by filling values downward
4. Converts only valid dates to YYYY-MM-DD (skip strings with letters)
5. Replaces configurable null markers (default: NULL,N/A,---)
6. Validates output directory exists
7. Shows progress bar with tqdm
8. Skips password-protected files with warning
Output clean CSVs with '_clean' suffix.
"""

FAQ

1. Can this handle 100MB+ Excel files?
Yes, but add --chunksize 10000 to process in chunks. The initial AI code loaded everything into memory.

2. How do I add custom cleaning rules?
Extend the clean_column() function. The AI structured this well for modifications.

3. Why Click instead of argparse?
DeepSeek defaulted to Click for better CLI UX. I kept it for the auto-help generation.

4. Can this run on schedule?
Yes! I combined it with a cron job to auto-process files in our SFTP drop folder.

5. What about Google Sheets?
That’s my next experiment-using gspread with API auth.

What spreadsheet headache would you automate with this approach? Drop a comment with your worst data cleaning war story.

P

Praveen

Technology enthusiast helping people work smarter with practical guides and AI workflows.

Explore more: Browse all ai automation guides or check related articles below.