;doc:import: rewrite; date skipping -> overlap detection

2024-11-29 15:45:19 -08:00 · 2024-11-29 15:45:19 -08:00 · 3d0aec7e8b
commit 3d0aec7e8b
parent a38c09be98
1 changed files with 79 additions and 62 deletions
--- a/hledger/Hledger/Cli/Commands/Import.md
+++ b/hledger/Hledger/Cli/Commands/Import.md
@ -11,99 +11,116 @@ Flags:
 This command detects new transactions in each FILE argument since it was last run, 
 and appends them to the main journal.
-Or with `--dry-run`, it just print the transactions that would be added.
+Or with `--dry-run`, it just prints a preview of the new transactions that would be added.
 Or with `--catchup`, it just marks all of the FILEs' current transactions as already imported.
 This is one of the few hledger commands that writes to the journal file (see also `add`).
-It only appends; existing data will not be changed.
+It only appends to the journal; existing entries will not be changed.
-The input files are specified as arguments, so to import one or more
+The data files are specified as arguments, so to import one or more
-CSV files to your main journal, you will run `hledger import bank.csv`
+CSV files to your main journal, you will run 
-or perhaps `hledger import *.csv`.
+`hledger import bank1.csv ...` or perhaps `hledger import *.csv`.
 Note you can import from any input file format, eg journal files;
 but CSV/SSV/TSV files are the most common import source.
-Note you can import from any file format, though CSV files are the
+The import destination is the main journal file,
-most common import source, and these docs focus on that case.
+which can be specified in the usual way with `$LEDGER_FILE` or `-f/--file`.
-The target file (main journal) should be in journal format.
+It should be in journal format.
-### Date skipping
+### Overlap detection
-`import` tries to import only the transactions which are new since the last import, ignoring any that it has seen in previous runs.
+You could convert and append new bank transactions without `import`, by doing `hledger -f bank.csv print >>$LEDGER_FILE`.
-So if your bank's CSV includes the last three months of data, you can download and `import` it every month (or week, or day) 
+But the `import` command has a useful feature: it tries to avoid re-importing transactions it has already seen on previous runs. 
-and only the new transactions will be imported each time.
+This means you don't have to worry about overlapping data in successive downloads of your bank CSV.
 Just download and import it as often as you like, and only the new transactions will be imported each time.
-It works as follows: for each imported `FILE`,
+We don't call this "deduplication", because it's generally not possible to reliably detect duplicates in bank CSV.
 Instead, `import` remembers the latest date processed from each CSV file (saving it in a hidden file).
 This is a simple mechanism that works well for most real-world CSV, where:
- It tries to read the latest date previously seen, from `.latest.FILE` in the same directory
+1. the data file name is stable (does not change) across imports
- Then it processes `FILE`, ignoring transactions on or before that date
+2. the item dates are stable across imports
 3. the order of same-date items is stable across imports
 4. the newest items have the newest dates
-And after a successful import, unless `--dry-run` was used, it updates the `.latest.FILE`(s) for next time.
+If the downloaded file name does change, you could use the rules file
-This is a simple system that works for most real-world CSV files;
+(with a `source` glob rule) as the import source instead.
-it assumes the following are true, or true enough:
+Also if there is occasional instability in item dates/order, it is usually harmless.
 (You can reduce the chance of disruption by downloading and importing more often.)
-1. the name of the input file is stable across successive downloads
+If overlap detection does go wrong, it's not too hard to recover from:
 2. new items always have the newest dates
 3. item dates are stable across downloads
 4. the order of same-date items is stable across downloads.
-Tips:
+- You'll notice it when you try to reconcile your hledger balances with your bank.
 - `hledger print FILE.csv` will show all recently downloaded transactions.
  Compare these with your journal and copy/paste if needed.
 - You can manually update or remove the `.latest.FILE`, or use `--catchup`.
 - You can use `--dry-run` to preview what will be imported.
 - Download and import more often, eg twice a week, at least while you are learning.
  It's easier to review and troubleshoot when there are fewer transactions.
- To help ensure a stable file name, remember you can use a CSV rules file as an input file.
+Here's how it works in detail:
- If you have a bank whose CSV dates or ordering occasionally change,
+For each `FILE` being imported with `hledger import FILE ...`,
  you can reduce the chance of this happening in new transactions by importing more often.
  (If it happens in old transactions, that's harmless.)
-Note this is just one kind of "deduplication": not reprocessing the same dates across successive runs.
+1. hledger reads a `.latest.FILE` file in the same directory, if any.
-`import` doesn't detect other kinds of duplication, such as 
+  This file contains the latest record date previously imported from FILE, in YYYY-MM-DD format.
-the same transaction appearing multiple times within a single run,
+  If multiple records with that date were imported, the date is repeated on N lines.
 or a new transaction that looks identical to a transaction already in the journal.
 (Because these can happen legitimately in real-world data.)
-Here's a situation where you need to run `import` with care:
+2. hledger reads records from FILE.
-say you download but forget to import `bank.1.csv`, and a week later you download `bank.2.csv` with some overlapping data.
+  If a latest date was found in step 1, it skips the records before and on that date
-You should not process both of these as a single import (`hledger import bank.1.csv bank.2.csv`),
+  (or the first N records on that date).
 because the overlapping transactions would not be deduplicated.
 Instead, import one file at a time, using the same filename each time:
-```cli
+3. After a successful import of all FILE arguments, without error and without `--dry-run`,
-$ mv bank.1.csv bank.csv; hledger import bank.csv
+   hledger saves the new latest dates in each FILE's `.latest.FILE` for next time.
 $ mv bank.2.csv bank.csv; hledger import bank.csv
 ```
-Normally you don't need to think about `.latest.*` files, 
+<!--
-but you can create or modify them to catch up to a certain date,
+Related: 
-or delete them to mark all transactions as new.
+[CSV > Working with CSV > Deduplicating, importing](#deduplicating-importing)
-Their format is a single ISO-format `YYYY-MM-DD` date, optionally repeated on multiple lines,
+-->
 meaning "I have seen the transactions before this date, and this many of them on this date".
 [`hledger print --new`](#print) also uses and updates these `.latest.*` files, but it is less often used.
 Related: [CSV > Working with CSV > Deduplicating, importing](#deduplicating-importing).
-### Import testing
+### Import preview
 With `--dry-run`, the transactions that will be imported are printed
-to the terminal, without updating your journal or state files.
+to standard output as a preview, without updating  your journal or .latest files.
-The output is valid journal format, like the print command, so you can re-parse it.
+
-Eg, to see any importable transactions which CSV rules have not categorised:
+The output is valid journal format, like the print command, so hledger can re-parse it.
 So you could check for new transactions not yet categorised by your CSV rules, like so:
 ```cli
-$ hledger import --dry bank.csv | hledger -f- -I print unknown
+$ hledger import --dry-run bank.csv | hledger -f- -I print unknown
 ```
-or (live updating):
+And you could watch this while you update your rules file, eg like so:
 ```cli
-$ ls bank.csv* | entr bash -c 'echo ====; hledger import --dry bank.csv | hledger -f- -I print unknown'
+$ watchexec -- 'hledger import --dry-run data.csv | hledger -f- -I print unknown'
 ```
-Note: when importing from multiple files at once, it's currently possible for
+There is another command which does the same kind of overlap detection: [`hledger print --new`](#print).
-some .latest files to be updated successfully, while the actual import fails
+But generally `import` or `import --dry-run` are used instead.
-because of a problem in one of the files, leaving them out of sync (and causing
+
-some transactions to be missed).
+### Import special cases
-To prevent this, do a --dry-run first and fix any problems before the real import.
+
 As mentioned, general "deduplication" is not what `import` does.
 For example, here are two cases which will not be deduplicated
 (and normally should not be, since these can happen legitimately in financial data):
 - Two or more of the new CSV records are identical.
 - Or a new CSV record generates a journal entry identical to one already in the journal.
 Separately, here's a situation where you need to run `import` with care:
 say you download `bank.csv`, but forget to import it or delete it.
 And next month you download it again. This time your web browser may save it as `bank (2).csv`.
 So now each of these may have data not included in the other.
 You should import from each one in turn, in the correct order, taking care to use the same filename each time:
 ```cli
 $ hledger import bank.csv
 $ mv 'bank (2).csv' bank.csv
 $ hledger import bank.csv
 ```
 ### Importing balance assignments