1datalad addurls(1)          General Commands Manual         datalad addurls(1)
2
3
4

NAME

6       datalad addurls - create and update a dataset from a list of URLs.
7

SYNOPSIS

9       datalad  addurls  [-h]  [-d  DATASET] [-t TYPE] [-x REGEXP] [-m FORMAT]
10              [--key    FORMAT]    [--message    MESSAGE]    [-n]     [--fast]
11              [--ifexists {overwrite|skip}] [--missing-value VALUE] [--nosave]
12              [--version-urls] [-c  PROC]  [-J  NJOBS]  [--drop-after]  [--on-
13              collision       {error|error-if-different|take-first|take-last}]
14              [--version] URL-FILE URL-FORMAT FILENAME-FORMAT
15
16
17

DESCRIPTION

19   Format specification
20       Several arguments take format strings.  These  are  similar  to  normal
21       Python format strings where the names from `URL-FILE` (column names for
22       a comma- or tab-separated file or properties for JSON) are available as
23       placeholders.  If  `URL-FILE`  is a CSV or TSV file, a positional index
24       can also be used (i.e., "{0}" for the first column). Note that a place‐
25       holder cannot contain a ':' or '!'.
26
27       In  addition,  the `FILENAME-FORMAT` arguments has a few special place‐
28       holders.
29
30       - _repindex
31
32         The constructed file names must be unique across all fields rows.  To
33         avoid collisions, the special placeholder "_repindex" can be added to
34         the formatter.  Its value will start at 0 and increment every time a
35         file name repeats.
36
37       - _url_hostname, _urlN, _url_basename*
38
39         Various parts of the formatted URL are available.  Take
40         "http://datalad.org/asciicast/seamless_nested_repos.sh" as  an  exam‐
41       ple.
42
43         "datalad.org" is stored as "_url_hostname".  Components of the URL's
44         path can be referenced as "_urlN".  "_url0" and "_url1" would map to
45         "asciicast" and "seamless_nested_repos.sh", respectively.  The final
46         part of the path is also available as "_url_basename".
47
48         This name is broken down further.  "_url_basename_root" and
49         "_url_basename_ext" provide access to the root name and extension.
50         These  values  are similar to the result of os.path.splitext, but, in
51       the
52         case of multiple periods, the extension is identified using the same
53         length heuristic that git-annex uses.  As a result, the extension of
54         "file.tar.gz" would be ".tar.gz", not ".gz".  In addition, the fields
55         "_url_basename_root_py" and "_url_basename_ext_py" provide access to
56         the result of os.path.splitext.
57
58       - _url_filename*
59
60         These are similar to _url_basename* fields,  but  they  are  obtained
61       with
62         a server request.  This is useful if the file name is set in the
63         Content-Disposition header.
64
65   Examples
66       Consider a file "avatars.csv" that contains::
67
68         who,ext,link
69         neurodebian,png,https://avatars3.githubusercontent.com/u/260793
70         datalad,png,https://avatars1.githubusercontent.com/u/8927200
71
72       To  download each link into a file name composed of the 'who' and 'ext'
73       fields, we could run::
74
75       $ datalad addurls -d avatar_ds avatars.csv '{link}' '{who}.{ext}'
76
77       The `-d avatar_ds` is used to create a new dataset in "$PWD/avatar_ds".
78
79       If we were already in a dataset and wanted to create a  new  subdataset
80       in  an  "avatars" subdirectory, we could use "//" in the `FILENAME-FOR‐
81       MAT` argument::
82
83       $ datalad addurls avatars.csv '{link}' 'avatars//{who}.{ext}'
84
85       If the information is represented as JSON lines instead of comma  sepa‐
86       rated  values  or a JSON array, you can use a utility like jq to trans‐
87       form the JSON lines into an array that addurls accepts::
88
89       $ ... | jq --slurp . | datalad addurls - '{link}' '{who}.{ext}'
90
91       NOTE
92
93        For users familiar with 'git annex addurl': A large part of this
94        plugin's functionality can be viewed as transforming data from
95        `URL-FILE` into a "url filename" format that fed to 'git annex addurl
96        --batch --with-files'.
97

OPTIONS

99       URL-FILE
100              A file that contains URLs or information that  can  be  used  to
101              construct  URLs.  Depending  on  the value of --input-type, this
102              should be a comma- or tab-separated file (with a header  as  the
103              first  row) or a JSON file (structured as a list of objects with
104              string values). If '-', read from  standard  input,  taking  the
105              content  as  JSON  when  --input-type is at its default value of
106              'ext'.
107
108       URL-FORMAT
109              A format string that specifies the URL for each entry.  See  the
110              'Format Specification' section above.
111
112       FILENAME-FORMAT
113              Like  `URL-FORMAT`, but this format string specifies the file to
114              which the URL's content will be downloaded. The name should be a
115              relative  path  and  will  be taken as relative to the top-level
116              dataset, regardless of whether it is specified via --dataset  or
117              inferred.  The  file name may contain directories. The separator
118              "//" can be used to indicate that the left-side directory should
119              be  created  as a new subdataset. See the 'Format Specification'
120              section above.
121
122
123       -h, --help, --help-np
124              show this help message. --help-np forcefully disables the use of
125              a pager for displaying the help message
126
127       -d DATASET, --dataset DATASET
128              Add  the  URLs  to this dataset (or possibly subdatasets of this
129              dataset). An empty or non-existent directory is passed to create
130              a  new dataset. New subdatasets can be specified with `FILENAME-
131              FORMAT`. Constraints: Value must be a Dataset or a valid identi‐
132              fier of a Dataset (e.g. a path) or value must be NONE
133
134       -t TYPE, --input-type TYPE
135              Whether `URL-FILE` should be considered a CSV file, TSV file, or
136              JSON file. The default value, "ext",  means  to  consider  `URL-
137              FILE` as a JSON file if it ends with ".json" or a TSV file if it
138              ends with ".tsv". Otherwise,  treat  it  as  a  CSV  file.  Con‐
139              straints:  value  must  be  one of ('ext', 'csv', 'tsv', 'json')
140              [Default: 'ext']
141
142       -x REGEXP, --exclude-autometa REGEXP
143              By default, metadata field=value pairs are constructed with each
144              column in `URL-FILE`, excluding any single column that is speci‐
145              fied via `URL-FORMAT`. This argument can be used to exclude col‐
146              umns  that match a regular expression. If set to '*' or an empty
147              string, automatic metadata extraction  is  disabled  completely.
148              This  argument  does  not  affect  metadata  set explicitly with
149              --meta.
150
151       -m FORMAT, --meta FORMAT
152              A format string that specifies metadata. It should be structured
153              as  "<field>=<value>".  As an example, "location={3}" would mean
154              that the value for the "location" metadata field should  be  set
155              the  value of the fourth column. This option can be given multi‐
156              ple times.
157
158       --key FORMAT
159              A format string that specifies an annex key for  the  file  con‐
160              tent.  In this case, the file is not downloaded; instead the key
161              is used to create the file without content. The value should  be
162              structured as "[et:]<input backend>[-s<bytes>]--<hash>". The op‐
163              tional "et:" prefix, which requires git-annex 8.20201116 or lat‐
164              er,  signals  to  toggle  extension  state  of the input backend
165              (i.e., MD5 vs MD5E). As an  example,  "et:MD5-s{size}--{md5sum}"
166              would  use the 'md5sum' and 'size' columns to construct the key,
167              migrating the key from MD5 to MD5E, with an extension  based  on
168              the  file  name. Note: If the *input* backend itself is an annex
169              extension backend (i.e., a backend with  a  trailing  "E"),  the
170              key's  extension  will  not be updated to match the extension of
171              the corresponding file name. Thus, unless  the  input  keys  and
172              file  names  are  generated from git-annex, it is recommended to
173              avoid using extension backends as input. If an extension is  de‐
174              sired,  use the plain variant as input and prepend "et:" so that
175              git-annex will migrate from the plain backend to  the  extension
176              variant.
177
178       --message MESSAGE
179              Use this message when committing the URL additions. Constraints:
180              value must be NONE or value must be a string
181
182       -n, --dry-run
183              Report which URLs would be downloaded to which  files  and  then
184              exit.
185
186       --fast If  True,  add the URLs, but don't download their content. WARN‐
187              ING: ONLY USE THIS OPTION IF YOU UNDERSTAND THE CONSEQUENCES. If
188              the  content  of  the  URLs is not downloaded, then datalad will
189              refuse to retrieve the contents with `datalad get <file>` by de‐
190              fault  because the content of the URLs is not verified. Add `an‐
191              nex.security.allow-unverified-downloads = ACKTHPPT` to your  git
192              config  to  bypass the safety check. Underneath, this passes the
193              `--fast` flag to `git annex addurl`.
194
195       --ifexists {overwrite|skip}
196              What to do if a constructed file name already  exists.  The  de‐
197              fault  behavior is to proceed with the `git annex addurl`, which
198              will fail if the file size has changed. If set  to  'overwrite',
199              remove the old file before adding the new one. If set to 'skip',
200              do not add the new file.  Constraints:  value  must  be  one  of
201              ('overwrite', 'skip')
202
203       --missing-value VALUE
204              When  an  empty  string  is encountered, use this value instead.
205              Constraints: value must be NONE or value must be a string
206
207       --nosave
208              by default all modifications to a dataset are immediately saved.
209              Giving this option will disable this behavior.
210
211       --version-urls
212              Try  to  add a version ID to the URL. This currently only has an
213              effect on HTTP URLs for AWS S3 buckets. s3:// URL versioning  is
214              not yet supported, but any URL that already contains a "version‐
215              Id=" parameter will be used as is.
216
217       -c PROC, --cfg-proc PROC
218              Pass this --cfg_proc value when calling CREATE to make datasets.
219
220       -J NJOBS, --jobs NJOBS
221              how many parallel jobs (where possible) to  use.  "auto"  corre‐
222              sponds to the number defined by 'datalad.runtime.max-annex-jobs'
223              configuration item. Constraints: value must  be  convertible  to
224              type  'int'  or value must be NONE or value must be one of ('au‐
225              to',)
226
227       --drop-after
228              drop files after adding to annex.
229
230       --on-collision {error|error-if-different|take-first|take-last}
231              What to do when more than one row produces the same  file  name.
232              By default an error is triggered. "error-if-different" suppress‐
233              es that error if rows for a given file name collision  have  the
234              same  URL  and metadata. "take-first" or "take-last" indicate to
235              instead take the first row or last row from each set of  collid‐
236              ing rows. Constraints: value must be one of ('error', 'error-if-
237              different', 'take-first', 'take-last') [Default: 'error']
238
239       --version
240              show the module and its version which provides the command
241

AUTHORS

243        datalad is developed by The DataLad Team and Contributors <team@datal‐
244       ad.org>.
245
246
247
248datalad addurls 0.19.3            2023-08-11                datalad addurls(1)
Impressum