hledger/Import.md at 3d0aec7e8bcfe2fd62d3e43fccc8198f6d039396

Simon Michael 3d0aec7e8b ;doc:import: rewrite; date skipping -> overlap detection

2024-11-29 15:45:19 -08:00

6.4 KiB

Raw Blame History

import

Import new transactions from one or more data files to the main journal.

Flags:
     --catchup              just mark all transactions as already imported
     --dry-run              just show the transactions to be imported

This command detects new transactions in each FILE argument since it was last run, and appends them to the main journal.

Or with --dry-run, it just prints a preview of the new transactions that would be added.

Or with --catchup, it just marks all of the FILEs’ current transactions as already imported.

This is one of the few hledger commands that writes to the journal file (see also add). It only appends to the journal; existing entries will not be changed.

The data files are specified as arguments, so to import one or more CSV files to your main journal, you will run hledger import bank1.csv ... or perhaps hledger import *.csv. Note you can import from any input file format, eg journal files; but CSV/SSV/TSV files are the most common import source.

The import destination is the main journal file, which can be specified in the usual way with $LEDGER_FILE or -f/--file. It should be in journal format.

Overlap detection

You could convert and append new bank transactions without import, by doing hledger -f bank.csv print >>$LEDGER_FILE. But the import command has a useful feature: it tries to avoid re-importing transactions it has already seen on previous runs. This means you don’t have to worry about overlapping data in successive downloads of your bank CSV. Just download and import it as often as you like, and only the new transactions will be imported each time.

We don’t call this “deduplication”, because it’s generally not possible to reliably detect duplicates in bank CSV. Instead, import remembers the latest date processed from each CSV file (saving it in a hidden file). This is a simple mechanism that works well for most real-world CSV, where:

the data file name is stable (does not change) across imports
the item dates are stable across imports
the order of same-date items is stable across imports
the newest items have the newest dates

If the downloaded file name does change, you could use the rules file (with a source glob rule) as the import source instead. Also if there is occasional instability in item dates/order, it is usually harmless. (You can reduce the chance of disruption by downloading and importing more often.)

If overlap detection does go wrong, it’s not too hard to recover from:

You’ll notice it when you try to reconcile your hledger balances with your bank.
hledger print FILE.csv will show all recently downloaded transactions. Compare these with your journal and copy/paste if needed.
You can manually update or remove the .latest.FILE, or use --catchup.
You can use --dry-run to preview what will be imported.
Download and import more often, eg twice a week, at least while you are learning. It’s easier to review and troubleshoot when there are fewer transactions.

Here’s how it works in detail:

For each FILE being imported with hledger import FILE ...,

hledger reads a .latest.FILE file in the same directory, if any. This file contains the latest record date previously imported from FILE, in YYYY-MM-DD format. If multiple records with that date were imported, the date is repeated on N lines.
hledger reads records from FILE. If a latest date was found in step 1, it skips the records before and on that date (or the first N records on that date).
After a successful import of all FILE arguments, without error and without --dry-run, hledger saves the new latest dates in each FILE’s .latest.FILE for next time.

Import preview

With --dry-run, the transactions that will be imported are printed to standard output as a preview, without updating your journal or .latest files.

The output is valid journal format, like the print command, so hledger can re-parse it. So you could check for new transactions not yet categorised by your CSV rules, like so:

$ hledger import --dry-run bank.csv | hledger -f- -I print unknown

And you could watch this while you update your rules file, eg like so:

$ watchexec -- 'hledger import --dry-run data.csv | hledger -f- -I print unknown'

There is another command which does the same kind of overlap detection: hledger print --new. But generally import or import --dry-run are used instead.

Import special cases

As mentioned, general “deduplication” is not what import does. For example, here are two cases which will not be deduplicated (and normally should not be, since these can happen legitimately in financial data):

Two or more of the new CSV records are identical.
Or a new CSV record generates a journal entry identical to one already in the journal.

Separately, here’s a situation where you need to run import with care: say you download bank.csv, but forget to import it or delete it. And next month you download it again. This time your web browser may save it as bank (2).csv. So now each of these may have data not included in the other. You should import from each one in turn, in the correct order, taking care to use the same filename each time:

$ hledger import bank.csv
$ mv 'bank (2).csv' bank.csv
$ hledger import bank.csv

Importing balance assignments

Entries added by import will have their posting amounts made explicit (like hledger print -x). This means that any balance assignments in imported files must be evaluated; but, imported files don’t get to see the main file’s account balances. As a result, importing entries with balance assignments (eg from an institution that provides only balances and not posting amounts) will probably generate incorrect posting amounts. To avoid this problem, use print instead of import:

$ hledger print IMPORTFILE [--new] >> $LEDGER_FILE

(If you think import should leave amounts implicit like print does, please test it and send a pull request.)

Import and commodity styles

Amounts in entries added by import will be formatted according to the journal’s canonical commodity styles, as declared by commodity directives or inferred from the journal’s amounts.

Related: CSV > Amount decimal places.

6.4 KiB Raw Blame History Unescape Escape