feat:csv: support data generating scripts

and rewrite the rules reader.
2025-08-26 00:00:27 +01:00 · 2025-08-26 00:00:27 +01:00 · 97899f9a9b
commit 97899f9a9b
parent c7878e88da
2 changed files with 224 additions and 133 deletions
--- a/hledger-lib/Hledger/Read/RulesReader.hs
+++ b/hledger-lib/Hledger/Read/RulesReader.hs
@ -79,7 +79,7 @@ import qualified Data.Text.IO as T
 import Data.Time ( Day, TimeZone, UTCTime, LocalTime, ZonedTime(ZonedTime),
  defaultTimeLocale, getCurrentTimeZone, localDay, parseTimeM, utcToLocalTime, localTimeToUTC, zonedTimeToUTC, utctDay)
 import Safe (atMay, headMay, lastMay, readMay)
-import System.Directory (createDirectoryIfMissing, doesFileExist, getHomeDirectory, getModificationTime, renameFile, removeFile)
+import System.Directory (createDirectoryIfMissing, doesFileExist, getHomeDirectory, getModificationTime, removeFile)
 -- import System.Directory (createDirectoryIfMissing, doesFileExist, getHomeDirectory, getModificationTime, listDirectory, renameFile, doesDirectoryExist)
 import System.Exit      (ExitCode(..))
 import System.FilePath (stripExtension, takeBaseName, takeDirectory, takeExtension, takeFileName, (<.>), (</>))
@ -118,7 +118,7 @@ getDownloadDir = do
  return $ home </> "Downloads"  -- XXX
 -- | Read, parse and post-process a "Journal" from the given rules file, or give an error.
-- This particular reader also provides some extra features like data-cleaning and archiving.
+-- This particular reader also provides some extra features like data cleaning/generating commands and data archiving.
 --
 -- The provided input file handle, and the --rules option, are ignored by this reader.
 -- Instead, a data file (or data-generating command) is usually specified by the @source@ rule.
@ -133,19 +133,18 @@ getDownloadDir = do
 --
 -- The source rule can specify a data-cleaning command, after a @|@ separator: @source foo*.csv | sed -e 's/USD/$/g'@.
 -- This command is executed by the user's default shell, receives the data file's content on stdin,
-- and should output CSV data suitable for hledger's conversion rules.
+-- and should output CSV data suitable for the conversion rules.
 -- A # character can be used to comment out the data-cleaning command: @source foo*.csv  # | ...@.
 --
-- When using the source rule, if the archive rule is also present, some behaviours change:
+-- Or the source rule can specify just a data-generating command, with no file pattern: @source | foo-csv.sh@.
 -- In this case the command receives no input; it should output CSV data suitable for the conversion rules.
 --
-- - The import command:
+-- If the archive rule is present:
--   will move the data file to an archive directory after a successful read
+-- After successfully reading the data file or data command and converting to a journal, while doing a non-dry-run import:
--   (renamed like the rules file, date-stamped, to an auto-created data/ directory next to the rules file).
+-- the data will be archived in an auto-created data/ directory next to the rules file,
--   And it will read the oldest data file, not the newest, if the glob pattern matches multiple files.
+-- with a name based on the rules file and the data file's modification date and extension
--   If there is a data-cleaning command, only the original uncleaned data is archived, currently.
+-- (or for a data-generating command, the current date and the ".csv" extension).
--
+-- And import will prefer the oldest file matched by a glob pattern (not the newest).
 -- - Other commands:
 --   will read the newest archived data file, if any, as a fallback if the glob pattern matches no data files.
 --
 -- Balance assertions are not checked by this reader.
 --
@ -153,93 +152,150 @@ parse :: InputOpts -> FilePath -> Handle -> ExceptT String IO Journal
 parse iopts rulesfile h = do
  lift $ hClose h -- We don't need it (XXX why ?)
-  -- XXX higher-than usual debug level for file reading to bypass excessive noise from elsewhere, normally 6 or 7
+  -- The rules reader does a lot; we must be organised.
-  rules <- readRulesFile $ dbg4 "reading rules file" rulesfile
+
  -- 1. gather contextual info
  --  gives: import flag, dryrun flag, rulesdir
  let
-    -- XXX How can we know when the command is import, and if it's a dry run ? In a hacky way, currently.
+    args     = progArgs
-    args = progArgs
+    import_  = dbg2 "import" $ any (`elem` args) ["import", "imp"]
-    -- XXX Difficult to identify the command name reliably here,
+    dryrun   = dbg2 "dryrun" $ any (`elem` args) ["--dry-run", "--dry"]
-    -- Cli.hs's moveFlagsAfterCommand would help but is not importable here.
+    rulesdir = takeDirectory rulesfile
    -- Just look for import or imp appearing anywhere in the arguments.
    cmdisimport = dbg7 "cmdisimport" $ any (`elem` args) ["import", "imp"]
    dryrun     = dbg7 "dryrun"     $ any (`elem` args) ["--dry-run", "--dry"]
    importing  = dbg7 "importing"  $ cmdisimport && not dryrun
    archive    = dbg7 "archive"    $ isJust (getDirective "archive" rules)
    archiving  = dbg7 "archiving"  $ importing && archive
    rulesdir   = dbg7 "rulesdir"   $ takeDirectory rulesfile
    archivedir = dbg7 "archivedir" $ rulesdir </> "data"
-  mdatafileandcmd <- liftIO $ do
+  -- 2. parse the source and archive rules
-    dldir <- getDownloadDir  -- look here for the data file if it's specified without a directory
+  --  needs: rules file
-    let
+  --  gives: file pattern, data cleaning/generating command, archive flag
      msourcearg = getDirective "source" rules
-      -- Surrounding whitespace is removed from the whole source argument and from each part of it.
+  -- XXX higher-than usual logging priority for file reading (normally 6 or 7), to bypass excessive noise from elsewhere
-      -- A # before | makes the rest of line a comment.
+  rules <- readRulesFile $ dbg1 "reading rules file" rulesfile
-      -- A # after | is left for the shell to interpret; it could be part of the command or the start of a comment.
+  let
-      stripspaces = T.strip
+    msourcearg = getDirective "source" rules
-      stripcommentandspaces = stripspaces . T.takeWhile (/= '#')
+      -- Nothing -> error' $ rulesfile ++ " source rule must specify a file pattern or a command"
-      msourceandcmd = T.breakOn "|" . stripspaces <$> msourcearg
+    -- Surrounding whitespace is removed from the whole source argument and from each part of it.
-      msource = T.unpack . stripcommentandspaces . fst <$> msourceandcmd
+    -- A # before | makes the rest of line a comment.
-      mcmd = msourceandcmd >>= \sc ->
+    -- A # after | is left for the shell to interpret; it could be part of the command or the start of a comment.
    stripspaces = T.strip
    stripcommentandspaces = stripspaces . T.takeWhile (/= '#')
    mpatandcmd = T.breakOn "|" . stripspaces <$> msourcearg
    mpat = dbg2 "file pattern" $  -- a non-empty file pattern, or nothing
      case T.unpack . stripcommentandspaces . fst <$> mpatandcmd of
        Just s | not $ null s -> Just s
        _ -> Nothing
    mcmd = dbg2 "data command" $  -- a non-empty command, or nothing
      mpatandcmd >>= \sc ->
        let c = T.unpack . stripspaces . T.drop 1 . snd $ sc
        in if null c then Nothing else Just c
-    datafiles <- case msource of
+    archive = isJust (getDirective "archive" rules)
      Nothing  -> return [maybe err (dbg4 "inferred source") $ dataFileFor rulesfile]  -- shouldn't fail, f has .rules extension
        where err = error' $ "could not infer a data file for " <> rulesfile
      Just glb -> do
        let (dir,desc) = if isFileName glb then (dldir," in download directory") else (rulesdir,"")
        expandGlob dir (dbg4 "source rule" glb) >>= sortByModTime <&> dbg4 ("matched files"<>desc<>", oldest first")
        -- XXX disable for now, too much complication: easy review of recent imported data:
        --   `archive` also affects non-`import` commands reading the rules file:
        --   when the `source` rule's glob pattern matches no files (no new downloads are available),
        --   they will use the archive as a fallback (reading the newest archived file, if any).
        -- if the source rule matched no files and we are reading not importing, use the most recent archived file.
        -- case globmatches of
        --   [] | archive && not cmdisimport -> do
        --     archivesFor archivedir rulesfile <&> take 1 <&> dbg4 "latest file in archive directory"
        --   _ -> return globmatches  -- XXX don't let it be cleaned again
-    return $ case datafiles of
+  -- 3. find the file to be read, if any
-      []                            -> (Nothing, Nothing)
+  --  needs: file pattern, data command, import flag, archive flag, downloads dir
-      [f] | cmdisimport             -> dbg4 "importing"             (Just f    , mcmd)
+  --  gives: data file, data file description
      [f]                           -> dbg4 "reading"               (Just f    , mcmd)
      fs | cmdisimport && archiving -> dbg4 "importing oldest file" (headMay fs, mcmd)
      fs | cmdisimport              -> dbg4 "importing newest file" (lastMay fs, mcmd)
      fs                            -> dbg4 "reading newest file"   (lastMay fs, mcmd)
-  case mdatafileandcmd of
+  (mdatafile, datafiledesc) <- dbg2 "data file found ?" <$> case (mpat, mcmd) of
-    (Nothing, _) -> return nulljournal  -- data file specified by source rule was not found
+    (Nothing, Nothing) -> error' $ "to make " ++ rulesfile ++ " readable,\n please add a 'source' rule with a non-empty file pattern or command"
-    (Just datafile, mcmd) -> do
+    (Nothing, Just _) -> return (Nothing, "")
-      exists <- liftIO $ doesFileExist datafile
+    (Just pat, _) -> do
-      if not (datafile=="-" || exists)
+      dldir <- liftIO getDownloadDir  -- look here for the data file if it's specified without a directory
-      then return nulljournal      -- data file inferred from rules file name was not found
+      let
-      else do
+        (startdir, dirdesc)
-        datafileh      <- liftIO $ openFileOrStdin datafile
+          | isFileName pat = (dldir,    " in download directory")
-        rawdata        <- liftIO $ readHandlePortably datafileh
+          | otherwise      = (rulesdir, "")
-        cleandata      <- liftIO $ maybe (return rawdata) (\c -> runFilterCommand rulesfile c rawdata) mcmd
+      fs <- liftIO $
-        cleandatafileh <- liftIO $ inputToHandle cleandata
+        expandGlob startdir pat
-        do
+        >>= sortByModTime
-          readJournalFromCsv (Just $ Left rules) datafile cleandatafileh Nothing
+        <&> dbg2 ("matched files"<>dirdesc<>", oldest first")
-          -- apply any command line account aliases. Can fail with a bad replacement pattern.
+      return $
-          >>= liftEither . journalApplyAliases (aliasesFromOpts iopts)
+        if import_ && archive
-              -- journalFinalise assumes the journal's items are
+        then (headMay fs, " oldest file")
-              -- reversed, as produced by JournalReader's parser.
+        else (lastMay fs, " newest file")
              -- But here they are already properly ordered. So we'd
              -- better preemptively reverse them once more. XXX inefficient
              . journalReverse
          >>= journalFinalise iopts{balancingopts_=(balancingopts_ iopts){ignore_assertions_=True}} rulesfile ""
          >>= \j -> do
            when archiving $ liftIO $ saveToArchive archivedir rulesfile datafile (mcmd <&> const cleandata)
            return j
-- | Run the given shell command, passing the given text as input, and return the output.
+  -- 4. log which file we are reading/importing/cleaning/generating
-- Or if the command fails, raise an informative error.
+  --  needs: data file, data file description, import flag
-runFilterCommand :: FilePath -> String -> Text -> IO Text
+
-runFilterCommand rulesfile cmd input = do
+  case (mdatafile, datafiledesc) of
-  let process = (shell cmd) { std_in = CreatePipe, std_out = CreatePipe, std_err = CreatePipe }
+    (Just f, desc) -> dbg1IO ("trying to " ++ (if import_ then "import" else "read") ++ desc) f
    (Nothing, _)   -> return ()
  -- 5. read raw, cleaned or generated data
  --  needs: file pattern, data file, data command
  --  gives: clean data (possibly empty)
  mexistingdatafile <- maybe (return Nothing) (\f -> liftIO $ do
    exists <- doesFileExist f
    return $ if exists then Just f else Nothing
    ) $ mdatafile
  cleandata <- dbg1With (\t -> "read "++(show $ length $ T.lines t)++" lines") <$> case (mpat, mexistingdatafile, mcmd) of
    -- file pattern, but no file found
    (Just _, Nothing, _) -> -- trace "file pattern, but no file found" $
      return ""
    -- file found, and maybe a data cleaning command
    (_, Just f,  mc) ->  -- trace "file found" $ 
      liftIO $ do
        raw <- openFileOrStdin f >>= readHandlePortably
        maybe (return raw) (\c -> runCommandAsFilter rulesfile (dbg0Msg ("running: "++c) c) raw) mc
    -- no file pattern, but a data generating command
    (Nothing, _, Just cmd) -> -- trace "data generating command" $
      liftIO $ runCommand rulesfile $ dbg0Msg ("running: " ++ cmd) cmd
    -- neither a file pattern nor a data generating command
    (Nothing, _, Nothing) -> -- trace "no file pattern or data generating command" $
      error' $ rulesfile ++ " source rule must specify a file pattern or a command"
  -- 6. convert the clean data to a (possibly empty) journal
  --  needs: clean data, rules, rules file, data file if any
  --  gives: journal
  j <- do
    cleandatah <- liftIO $ inputToHandle cleandata
    readJournalFromCsv (Just $ Left rules) (fromMaybe "(cmd)" mdatafile) cleandatah Nothing
    -- apply any command line account aliases. Can fail with a bad replacement pattern.
    >>= liftEither . journalApplyAliases (aliasesFromOpts iopts)
        -- journalFinalise assumes the journal's items are
        -- reversed, as produced by JournalReader's parser.
        -- But here they are already properly ordered. So we'd
        -- better preemptively reverse them once more. XXX inefficient
        . journalReverse
    >>= journalFinalise iopts{balancingopts_=(balancingopts_ iopts){ignore_assertions_=True}} rulesfile ""
  -- 7. if non-empty, successfully read and converted, and we're doing a non-dry-run archiving import: archive the data
  --  needs: import/archive/dryrun flags, rules directory, rules file, data file if any, clean data
  when (not (T.null cleandata) && import_ && archive && not dryrun) $
    liftIO $ saveToArchive (rulesdir </> "data") rulesfile mdatafile cleandata
  return j
 -- | For the given rules file, run the given shell command, in the rules file's directory.
 -- If the command fails, raise an error and show its error output;
 -- otherwise return its output, and show any error output as a warning.
 runCommand :: FilePath -> String -> IO Text
 runCommand rulesfile cmd = do
  let process = (shell cmd) { cwd = Just $ takeDirectory rulesfile, std_out = CreatePipe, std_err = CreatePipe }
  withCreateProcess process $ \_ mhout mherr phandle -> do
    case (mhout, mherr) of
      (Just hout, Just herr) -> do
        out <- T.hGetContents hout
        err <- hGetContents' herr
        exitCode <- waitForProcess phandle
        case exitCode of
          ExitSuccess -> do
            unless (null err) $ warnIO err
            return out
          ExitFailure code ->
            error' $ "in " ++ rulesfile ++ ": command \"" ++ cmd ++ "\" failed with exit code " ++ show code
              ++ (if null err then "" else ":\n" ++ err)
      _ -> error' $ "in " ++ rulesfile ++ ": failed to create pipes for command execution"
 -- | For the given rules file, run the given shell command, in the rules file's directory, passing the given text as input.
 -- Return the output, or if the command fails, raise an informative error.
 runCommandAsFilter :: FilePath -> String -> Text -> IO Text
 runCommandAsFilter rulesfile cmd input = do
  let process = (shell cmd) { cwd = Just $ takeDirectory rulesfile, std_in = CreatePipe, std_out = CreatePipe, std_err = CreatePipe }
  withCreateProcess process $ \mhin mhout mherr phandle -> do
    case (mhin, mhout, mherr) of
      (Just hin, Just hout, Just herr) -> do
@ -254,46 +310,50 @@ runFilterCommand rulesfile cmd input = do
              ++ (if null err then "" else ":\n" ++ err)
      _ -> error' $ "in " ++ rulesfile ++ ": failed to create pipes for command execution"
-- | Save some successfully imported data to the given archive directory,
+type DirPath = FilePath
 -- autocreating that if needed, and showing informational output on stderr.
 -- The remaining arguments are: the rules file path (for naming), the original data file,
 -- and if there was a data-cleaning command, the cleaned data from that file.
 -- The archive file name will be RULESFILEBASENAME.DATAFILEMODDATE.DATAFILEEXT.
 -- When there is cleaned data, currently only that is saved (not the original data).
 saveToArchive :: FilePath -> FilePath -> FilePath -> Maybe Text -> IO ()
 saveToArchive archivedir rulesfile datafile mcleandata = do
  createDirectoryIfMissing True archivedir
  hPutStrLn stderr $ "archiving " <> datafile
  (_origname, cleanname) <- archiveFileName rulesfile datafile
  let
    cleanarchive = archivedir </> cleanname
    -- origarchive  = archivedir </> origname
  case mcleandata of
    Just cleandata -> do
      -- disabled for simplicity:
      -- the original data is also saved, as RULESFILEBASENAME.orig.DATAFILEMODDATE.DATAFILEEXT.
      -- hPutStrLn stderr $ " as " <> origarchive
      -- renameFile datafile origarchive
      -- hPutStrLn stderr $ " and " <> cleanarchive
      hPutStrLn stderr $ " as " <> cleanarchive
      T.writeFile cleanarchive cleandata
      removeFile datafile
    Nothing -> do
      hPutStrLn stderr $ " as " <> cleanarchive
      renameFile datafile cleanarchive
-- | Figure out the file names to use when archiving, for the given rules file, the given data file.
+-- | Save some successfully imported data
 -- (more precisely: data that was successfully read and maybe cleaned, or that was generated, during an import)
 -- to the given archive directory, autocreating that if needed, and show informational output on stderr.
 -- The arguments are:
 -- the archive directory,
 -- the rules file (for naming),
 -- the data file name, if any,
 -- the data that was read, cleaned, or generated.
 -- The archive file name will be RULESFILEBASENAME.DATAFILEMODDATEORCURRENTDATE.DATAFILEEXTORCSV.
 -- Note for a data generating command, where there's no data file, we use the current date
 -- and a .csv file extension (meaning "character-separated values" in this case).
 saveToArchive :: DirPath -> FilePath -> Maybe FilePath -> Text -> IO ()
 saveToArchive archivedir rulesfile mdatafile cleandata = do
  createDirectoryIfMissing True archivedir
  (_, cleanname) <- archiveFileName rulesfile mdatafile
  let cleanarchive = archivedir </> cleanname
  hPutStrLn stderr $ "archiving " <> cleanarchive
  T.writeFile cleanarchive cleandata
  maybe (return ()) removeFile mdatafile
 -- | Figure out the file names to use when archiving, for the given rules file and the given data file if any.
 -- The second name is for the final (possibly cleaned) data; the first name has ".orig" added,
 -- and is used if both original and cleaned data are being archived. They will be like this:
 -- ("RULESFILEBASENAME.orig.DATAFILEMODDATE.DATAFILEEXT", "RULESFILEBASENAME.DATAFILEMODDATE.DATAFILEEXT")
-archiveFileName :: FilePath -> FilePath -> IO (String, String)
+archiveFileName :: FilePath -> Maybe FilePath -> IO (String, String)
-archiveFileName rulesfile datafile = do
+archiveFileName rulesfile mdatafile = do
-  moddate <- (show . utctDay) <$> getModificationTime datafile
+  let base = takeBaseName rulesfile
-  let (base, ext) = (takeBaseName rulesfile, takeExtension datafile)
+  case mdatafile of
-  return (
+    Just datafile -> do
-     base <.> "orig" <.> moddate <.> ext
+      moddate <- (show . utctDay) <$> getModificationTime datafile
-    ,base            <.> moddate <.> ext
+      let ext = takeExtension datafile
-    )
+      return (
         base <.> "orig" <.> moddate <.> ext
        ,base            <.> moddate <.> ext
        )
    Nothing -> do
      let ext = "csv"
      curdate <- show <$> getCurrentDay
      return (
         base <.> "orig" <.> curdate <.> ext
        ,base            <.> curdate <.> ext
        )
 -- -- | In the given archive directory, if it exists, find the paths of data files saved for the given rules file.
 -- -- They will be reverse sorted by name, ie newest first, assuming normal archive file names.
--- a/hledger/hledger.m4.md
+++ b/hledger/hledger.m4.md
@ -3294,23 +3294,54 @@ All this enables a convenient workflow where can you just download CSV files, th
 See also ["Working with CSV > Reading files specified by rule"](#reading-files-specified-by-rule).
-### Data cleaning
+<!--
 The source rule supports ~ for home directory: `source ~/Downloads/foo.csv`.
 If the argument is a bare filename, its directory is assumed to be ~/Downloads: `source foo.csv`.
 Otherwise if it is a relative path, it is assumed to be relative to the rules file's directory: `source new/foo.csv`.
 The source rule can specify a glob pattern: `source foo*.csv`.
 If the glob pattern matches multiple files, the newest (last modified) file is used (with one exception, described below).
 The source rule can specify a data-cleaning command, after a `|` separator: `source foo*.csv | sed -e 's/USD/$/g'`.
 This command is executed by the user's default shell, receives the data file's content on stdin,
 and should output CSV data suitable for the conversion rules.
 A # character can be used to comment out the data-cleaning command: `source foo*.csv  # | ...`.
 Or the source rule can specify a data-generating command, with no file pattern: `source | foo-csv.sh`.
 In this case the command receives no input; it should output CSV data suitable for the conversion rules.
 -->
 ### Data cleaning / generating commands
 After `source`'s file pattern, you can write `|` (pipe) and a data cleaning command.
 If hledger's CSV rules aren't enough, you can pre-process the downloaded data here with a shell command or script, to make it more suitable for conversion.
-The command will be executed by your default shell, will receive the data file's content as standard input,
+The command will be executed by your default shell, in the directory of the rules file, will receive the data file's content as standard input,
-and should output zero or more lines of character-separated-values, ready for conversion by the CSV rules.
+and should output zero or more lines of character-separated-values, suitable for conversion by the CSV rules.
 Or, after `source` you can write `|` and a data generating command (with no file pattern before the `|`).
 This command receives no input, and should output zero or more lines of character-separated values, suitable for conversion by the CSV rules.
 Whenever hledger runs one of these commands, it will print the command on stderr.
 If the command produces error output, but exits successfully, hledger will show the error output as a warning.
 If the command fails, hledger will fail and show the error output in the error message.
 *Added in 1.50; experimental.*
 ## `archive`
-Adding `archive` to a rules file causes the `import` command
+With `archive` added to a rules file, the `import` command
-to archive (move and rename) each imported data file, in a nearby `data/` directory.
+will archive each successfully processed data file or data command output in a nearby `data/` directory.
-Also, `import` will prefer the oldest of the `source` rule's glob-matched files rather than the newest.
+The archive file name will be based on the rules file and the data file's modification date and extension
 (or for a data-generating command, the current date and the ".csv" extension).
 The original data file, if any, will be removed.
 Also, in this mode `import` will prefer the oldest file matched by the `source` rule's glob pattern, not the newest.
 (So if there are multiple downloads, they will be imported and archived oldest first.)
-Archiving imported data is optional, but it can be useful for
+Archiving is optional, but it can be useful for
 troubleshooting your CSV rules,
 regenerating entries with improved rules,
 checking for variations in your bank's CSV,