πŸ“¦ simonw / csv-diff

Python CLI tool and library for diffing CSV and JSON files

β˜… 328 stars β‘‚ 52 forks πŸ‘ 328 watching βš–οΈ Apache License 2.0
clickcsvcsv-diffdatasette-iodatasette-tooldiffgit-scrapingtsv-diff
πŸ“₯ Clone https://github.com/simonw/csv-diff.git
HTTPS git clone https://github.com/simonw/csv-diff.git
SSH git clone git@github.com:simonw/csv-diff.git
CLI gh repo clone simonw/csv-diff
Simon Willison Simon Willison Bump Docker to Python 3.12, refs #11 26903b7 1 years ago πŸ“ History
πŸ“‚ main View all commits β†’
πŸ“ .github
πŸ“ csv_diff
πŸ“ tests
πŸ“„ .gitignore
πŸ“„ Dockerfile
πŸ“„ LICENSE
πŸ“„ README.md
πŸ“„ setup.py
πŸ“„ README.md

csv-diff

PyPI Changelog Tests License

Tool for viewing the difference between two CSV, TSV or JSON files. See Generating a commit log for San Francisco’s official list of trees (and the sf-tree-history repo commit log) for background information on this project.

Installation

pip install csv-diff

Usage

Consider two CSV files:

one.csv

id,name,age 1,Cleo,4 2,Pancakes,2

two.csv

id,name,age 1,Cleo,5 3,Bailey,1

csv-diff can show a human-readable summary of differences between the files:

$ csv-diff one.csv two.csv --key=id 1 row changed, 1 row added, 1 row removed

1 row changed

Row 1 age: "4" => "5"

1 row added

id: 3 name: Bailey age: 1

1 row removed

id: 2 name: Pancakes age: 2

The --key=id option means that the id column should be treated as the unique key, to identify which records have changed.

The tool will automatically detect if your files are comma- or tab-separated. You can over-ride this automatic detection and force the tool to use a specific format using --format=tsv or --format=csv.

You can also feed it JSON files, provided they are a JSON array of objects where each object has the same keys. Use --format=json if your input files are JSON.

Use --show-unchanged to include full details of the unchanged values for rows with at least one change in the diff output:

% csv-diff one.csv two.csv --key=id --show-unchanged 1 row changed

id: 1 age: "4" => "5"

Unchanged: name: "Cleo"

JSON output

You can use the --json option to get a machine-readable difference:

$ csv-diff one.csv two.csv --key=id --json { "added": [ { "id": "3", "name": "Bailey", "age": "1" } ], "removed": [ { "id": "2", "name": "Pancakes", "age": "2" } ], "changed": [ { "key": "1", "changes": { "age": [ "4", "5" ] } } ], "columns_added": [], "columns_removed": [] }

Adding templated extras

You can specify additional keys to be displayed in the human-readable format using the --extra option:

--extra name "Python format string with {id} for variables"

For example, to output a link to https://news.ycombinator.com/latest?id={id} for each item with an ID, you could use this:

csv-diff one.csv two.csv --key=id \
  --extra latest "https://news.ycombinator.com/latest?id={id}"
These extras display something like this:
1 row changed

  id: 41459472
    points: "24" => "25"
    numComments: "5" => "6"
  extras:
    latest: https://news.ycombinator.com/latest?id=41459472

As a Python library

You can also import the Python library into your own code like so:

from csvdiff import loadcsv, compare diff = compare( load_csv(open("one.csv"), key="id"), load_csv(open("two.csv"), key="id") )

diff will now contain the same data structure as the output in the --json example above.

If the columns in the CSV have changed, those added or removed columns will be ignored when calculating changes made to specific rows.

As a Docker container

Build the image

$ docker build -t csvdiff .

Run the container

$ docker run --rm -v $(pwd):/files csvdiff

Suppose current directory contains two csv files : one.csv two.csv

$ docker run --rm -v $(pwd):/files csvdiff one.csv two.csv

Alternatives

  • csvdiff is a "fast diff tool for comparing CSV files" - you may get better results from this than from csv-diff against larger files.