📦 BurntSushi / imdb-rename

📄 README.md · 249 lines
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249imdb-rename
===========
A command line tool to rename media files based on titles from IMDb.
imdb-rename downloads the official IMDb data set and creates a local index to
use for fast fuzzy searching.

[![Linux build status](https://api.travis-ci.org/BurntSushi/imdb-rename.svg)](https://travis-ci.org/BurntSushi/imdb-rename)
[![Windows build status](https://ci.appveyor.com/api/projects/status/github/BurntSushi/imdb-rename?svg=true)](https://ci.appveyor.com/project/BurntSushi/imdb-rename)
[![](http://meritbadge.herokuapp.com/imdb-rename)](https://crates.io/crates/imdb-rename)

Dual-licensed under MIT or the [UNLICENSE](http://unlicense.org).


### Installation

**[Archives of precompiled binaries for imdb-rename are available for Windows,
macOS and Linux.](https://github.com/BurntSushi/imdb-rename/releases)**

Otherwise, users are expected to compile imdb-rename from source:

```
$ git clone https://github.com/BurntSushi/imdb-rename
$ cd imdb-rename
$ cargo build --release
$ ./target/release/imdb-rename --help
```

Alternatively, if you have
[Cargo installed](https://rustup.rs),
then you can install imdb-rename directly from
[crates.io](https://crates.io):

```
$ cargo install imdb-rename
```

imdb-rename's minimum supported Rust version is **1.28.0**.

#### Archlinux

An aur package is available: [imdb-rename](https://aur.archlinux.org/packages/imdb-rename/).

### Quick example

Ever since Season 1 of The Simpsons came out on DVD, I've been collecting them
and ripping them on to my hard drive. My process is somewhat manual, but I
wind up with a directory that looks like this:

```
S18E01.mkv  S18E05.mkv  S18E09.mkv  S18E13.mkv  S18E17.mkv  S18E21.mkv
S18E02.mkv  S18E06.mkv  S18E10.mkv  S18E14.mkv  S18E18.mkv  S18E22.mkv
S18E03.mkv  S18E07.mkv  S18E11.mkv  S18E15.mkv  S18E19.mkv
S18E04.mkv  S18E08.mkv  S18E12.mkv  S18E16.mkv  S18E20.mkv
```

It would be much nicer if these files had their proper episode titles.
imdb-rename can rename these files automatically using episode titles from
IMDb:

```
$ imdb-rename -q 'the simpsons {show}' *.mkv
```

This command ran a query with the `-q` flag to identify the TV show, provided
the files to rename, and... presto!

```
S18E01 - The Mook, the Chef, the Wife and Her Homer.mkv
S18E02 - Jazzy & The Pussycats.mkv
S18E03 - Please Homer, Don't Hammer 'Em.mkv
S18E04 - Treehouse of Horror XVII.mkv
S18E05 - G.I. (Annoyed Grunt).mkv
S18E06 - Moe'N'a Lisa.mkv
S18E07 - Ice Cream of Margie: With the Light Blue Hair.mkv
S18E08 - The Haw-Hawed Couple.mkv
S18E09 - Kill Gil, Vol. 1 & 2.mkv
S18E10 - The Wife Aquatic.mkv
S18E11 - Revenge Is a Dish Best Served Three Times.mkv
S18E12 - Little Big Girl.mkv
S18E13 - Springfield Up.mkv
S18E14 - Yokel Chords.mkv
S18E15 - Rome-old and Juli-eh.mkv
S18E16 - Homerazzi.mkv
S18E17 - Marge Gamer.mkv
S18E18 - The Boys of Bummer.mkv
S18E19 - Crook and Ladder.mkv
S18E20 - Stop or My Dog Will Shoot.mkv
S18E21 - 24 Minutes.mkv
S18E22 - You Kent Always Say What You Want.mkv
```


### Fancier example

imdb-rename isn't limited to just renaming TV episodes based on season/episode
numbers. It can also perform a fuzzy match based on the contents of the
file name. For example, given this file:

```
Thor.Ragnarok.2017.1080p.WEB-DL.DD5.1.H264-FGT.mkv
```

We can "clean it up" and rename it to a nice title like so:

```
$ imdb-rename Thor.Ragnarok.2017.1080p.WEB-DL.DD5.1.H264-FGT.mkv
```

which gives us:

```
Thor: Ragnarok (2017).mkv
```


### Freeform searching

We can also use imdb-rename to search IMDb, which is the default behavior
when a `-q/--query` is provided without any file names:

```
$ imdb-rename -q 'homey loves flanders'
#     score  id         kind       title                   year  tv
1     1.000  tt0773646  tvEpisode  Homer Loves Flanders    1994  S05E16 The Simpsons
2     0.646  tt2101691  tvEpisode  Tiny Loves Flowers      N/A   S02E08 Dinosaur Train
3     0.568  tt3203408  tvEpisode  Courtney Loves Love     2014  S01E05 Courtney Loves Dallas
4     0.561  tt1722576  short      In Flanders Fields      2010
5     0.561  tt2253780  tvSeries   In Vlaamse Velden       2014
6     0.555  tt4528474  video      My Lovely Homeland      2011
7     0.551  tt0220646  tvMovie    Moll Flanders           1975
[... results truncated ...]
```

Notice that our query had a typo in it. imdb-rename does its best to find the
most relevant results. It is also fast. Even though the above query searches
through all 6 million names in IMDb, it runs in under 100ms. This is thanks to
using an inverted index memory mapped from disk.


### How does it work?

imdb-rename works by downloading
[approved datasets from IMDb](https://www.imdb.com/interfaces/),
and creating an inverted index based on ngrams extracted
from the names in IMDb's data. The inverted index provides a
quick way to search and rank results using techniques from
[information retrieval](https://nlp.stanford.edu/IR-book/)
such as
[Okapi-BM25](https://en.wikipedia.org/wiki/Okapi_BM25).


### Motivation

My motivation for building this tool is somewhat idiosyncratic, but three-fold:

1. I find it very convenient to have a tool to rename media files
   automatically. imdb-rename is my third iteration on this tool. The first was
   an unpublished hodge podge of Python scripts and a MySQL database. The
   second was a
   [Go program with a PostgreSQL database](https://github.com/BurntSushi/goim).
   The Go program served me well, but IMDb retired their old data format, which
   required me to build a new tool to adapt.
2. I've been working on a low-level information retrieval library off-and-on
   for a couple years, and initially built this tool on top of that library as
   a form of dogfooding. It didn't work out as well as I'd hoped, so I scrapped
   the generic library and built out a specific solution tailored to IMDb. I'm
   no longer dogfooding directly, but I've established a useful baseline.
3. I want more people to learn about information retrieval, and I believe this
   tool can serve to teach others. In particular, imdb-rename is a complete
   end-to-end information retrieval system that is fast, solves a real problem,
   is only a few thousand lines of code and comes with a built-in
   evaluation that is easy to run.

This tool is perhaps a bit over engineered, but I had fun with it. Believe it
or not, parts of imdb-rename are intentionally simple at the cost of both query
speed and size on disk!


### Evaluation

It is possible to run an evaluation to compare the various parameters available
for searching. The evaluation system is available as a separate tool called
imdb-eval, which is included in this repository. To use it, we must first build
it:

```
$ git clone https://github.com/BurntSushi/imdb-rename
$ cd imdb-rename
$ cargo build --release --all
$ ./target/release/imdb-eval --help
```

Running an evaluation is simple. We can run an evaluation on all combinations
of scorer and similarity function, along with ngram sizes of 3 and 4 like so:
(This will use truth data that is built into the `imdb-eval` binary.)

```
$ ./target/release/imdb-eval --ngram-size 3 --ngram-size 4 | tee eval.csv
```

This will output the results of running a search on every item in the truth
data. The results include the rank of the expected answer. The results can be
summarized into a single score called the
[Mean Reciprocal Rank](https://en.wikipedia.org/wiki/Mean_reciprocal_rank)
(which is itself a specific instance of MAP, or mean average precision)
with the `--summarize` flag like so:

```
$ ./target/release/imdb-eval --summarize eval.csv
```

If you have [xsv](https://github.com/BurntSushi/xsv) installed, then the
results can be easily sorted and formatted:

```
$ ./target/release/imdb-eval --summarize eval.csv | xsv sort -R -s mrr | xsv table
```

If you want to tweak the truth data, then you might consider starting with the
bundled truth data (assuming you're at the root of the imdb-rename repository):

```
$ $EDITOR data/eval/truth.toml
$ ./target/release/imdb-eval --ngram-size 3 --ngram-size 4 --truth data/eval/truth.toml
```


### What does this tool not do?

imdb-rename is tool for renaming media files, and to the extent that searching
IMDb facilitates renaming files, it is also a search tool. There is no
intent to develop this further to explore all IMDb data, such as cast/crew
information.

Folks interested in building a different type of IMDb tool may be interested
in the [`imdb-index`](https://docs.rs/imdb-index) crate, which provides
programmatic access to the index created by imdb-rename.


### IMDb licensing

The data used by imdb-rename is retrieved from
[IMDb datasets](https://www.imdb.com/interfaces/).
In particular, imdb-rename will never scrape imdb.com, and only uses the data
provided by IMDb in the `tsv` files.

Additionally, imdb-rename must only be used for non-commercial and personal
uses.