1datalad addurls(1) General Commands Manual datalad addurls(1)
2
3
4
6 datalad addurls - create and update a dataset from a list of URLs.
7
9 datalad addurls [-h] [-d DATASET] [-t TYPE] [-x REGEXP] [-m FORMAT]
10 [--key FORMAT] [--message MESSAGE] [-n] [--fast]
11 [--ifexists {overwrite|skip}] [--missing-value VALUE] [--nosave]
12 [--version-urls] [-c PROC] [-J NJOBS] [--drop-after] [--on-
13 collision {error|error-if-different|take-first|take-last}]
14 [--version] URL-FILE URL-FORMAT FILENAME-FORMAT
15
16
17
19 Format specification
20 Several arguments take format strings. These are similar to normal
21 Python format strings where the names from `URL-FILE` (column names for
22 a comma- or tab-separated file or properties for JSON) are available as
23 placeholders. If `URL-FILE` is a CSV or TSV file, a positional index
24 can also be used (i.e., "{0}" for the first column). Note that a place‐
25 holder cannot contain a ':' or '!'.
26
27 In addition, the `FILENAME-FORMAT` arguments has a few special place‐
28 holders.
29
30 - _repindex
31
32 The constructed file names must be unique across all fields rows. To
33 avoid collisions, the special placeholder "_repindex" can be added to
34 the formatter. Its value will start at 0 and increment every time a
35 file name repeats.
36
37 - _url_hostname, _urlN, _url_basename*
38
39 Various parts of the formatted URL are available. Take
40 "http://datalad.org/asciicast/seamless_nested_repos.sh" as an exam‐
41 ple.
42
43 "datalad.org" is stored as "_url_hostname". Components of the URL's
44 path can be referenced as "_urlN". "_url0" and "_url1" would map to
45 "asciicast" and "seamless_nested_repos.sh", respectively. The final
46 part of the path is also available as "_url_basename".
47
48 This name is broken down further. "_url_basename_root" and
49 "_url_basename_ext" provide access to the root name and extension.
50 These values are similar to the result of os.path.splitext, but, in
51 the
52 case of multiple periods, the extension is identified using the same
53 length heuristic that git-annex uses. As a result, the extension of
54 "file.tar.gz" would be ".tar.gz", not ".gz". In addition, the fields
55 "_url_basename_root_py" and "_url_basename_ext_py" provide access to
56 the result of os.path.splitext.
57
58 - _url_filename*
59
60 These are similar to _url_basename* fields, but they are obtained
61 with
62 a server request. This is useful if the file name is set in the
63 Content-Disposition header.
64
65 Examples
66 Consider a file "avatars.csv" that contains::
67
68 who,ext,link
69 neurodebian,png,https://avatars3.githubusercontent.com/u/260793
70 datalad,png,https://avatars1.githubusercontent.com/u/8927200
71
72 To download each link into a file name composed of the 'who' and 'ext'
73 fields, we could run::
74
75 $ datalad addurls -d avatar_ds avatars.csv '{link}' '{who}.{ext}'
76
77 The `-d avatar_ds` is used to create a new dataset in "$PWD/avatar_ds".
78
79 If we were already in a dataset and wanted to create a new subdataset
80 in an "avatars" subdirectory, we could use "//" in the `FILENAME-FOR‐
81 MAT` argument::
82
83 $ datalad addurls avatars.csv '{link}' 'avatars//{who}.{ext}'
84
85 If the information is represented as JSON lines instead of comma sepa‐
86 rated values or a JSON array, you can use a utility like jq to trans‐
87 form the JSON lines into an array that addurls accepts::
88
89 $ ... | jq --slurp . | datalad addurls - '{link}' '{who}.{ext}'
90
91 NOTE
92
93 For users familiar with 'git annex addurl': A large part of this
94 plugin's functionality can be viewed as transforming data from
95 `URL-FILE` into a "url filename" format that fed to 'git annex addurl
96 --batch --with-files'.
97
99 URL-FILE
100 A file that contains URLs or information that can be used to
101 construct URLs. Depending on the value of --input-type, this
102 should be a comma- or tab-separated file (with a header as the
103 first row) or a JSON file (structured as a list of objects with
104 string values). If '-', read from standard input, taking the
105 content as JSON when --input-type is at its default value of
106 'ext'.
107
108 URL-FORMAT
109 A format string that specifies the URL for each entry. See the
110 'Format Specification' section above.
111
112 FILENAME-FORMAT
113 Like `URL-FORMAT`, but this format string specifies the file to
114 which the URL's content will be downloaded. The name should be a
115 relative path and will be taken as relative to the top-level
116 dataset, regardless of whether it is specified via --dataset or
117 inferred. The file name may contain directories. The separator
118 "//" can be used to indicate that the left-side directory should
119 be created as a new subdataset. See the 'Format Specification'
120 section above.
121
122
123 -h, --help, --help-np
124 show this help message. --help-np forcefully disables the use of
125 a pager for displaying the help message
126
127 -d DATASET, --dataset DATASET
128 Add the URLs to this dataset (or possibly subdatasets of this
129 dataset). An empty or non-existent directory is passed to create
130 a new dataset. New subdatasets can be specified with `FILENAME-
131 FORMAT`. Constraints: Value must be a Dataset or a valid identi‐
132 fier of a Dataset (e.g. a path) or value must be NONE
133
134 -t TYPE, --input-type TYPE
135 Whether `URL-FILE` should be considered a CSV file, TSV file, or
136 JSON file. The default value, "ext", means to consider `URL-
137 FILE` as a JSON file if it ends with ".json" or a TSV file if it
138 ends with ".tsv". Otherwise, treat it as a CSV file. Con‐
139 straints: value must be one of ('ext', 'csv', 'tsv', 'json')
140 [Default: 'ext']
141
142 -x REGEXP, --exclude-autometa REGEXP
143 By default, metadata field=value pairs are constructed with each
144 column in `URL-FILE`, excluding any single column that is speci‐
145 fied via `URL-FORMAT`. This argument can be used to exclude col‐
146 umns that match a regular expression. If set to '*' or an empty
147 string, automatic metadata extraction is disabled completely.
148 This argument does not affect metadata set explicitly with
149 --meta.
150
151 -m FORMAT, --meta FORMAT
152 A format string that specifies metadata. It should be structured
153 as "<field>=<value>". As an example, "location={3}" would mean
154 that the value for the "location" metadata field should be set
155 the value of the fourth column. This option can be given multi‐
156 ple times.
157
158 --key FORMAT
159 A format string that specifies an annex key for the file con‐
160 tent. In this case, the file is not downloaded; instead the key
161 is used to create the file without content. The value should be
162 structured as "[et:]<input backend>[-s<bytes>]--<hash>". The op‐
163 tional "et:" prefix, which requires git-annex 8.20201116 or lat‐
164 er, signals to toggle extension state of the input backend
165 (i.e., MD5 vs MD5E). As an example, "et:MD5-s{size}--{md5sum}"
166 would use the 'md5sum' and 'size' columns to construct the key,
167 migrating the key from MD5 to MD5E, with an extension based on
168 the file name. Note: If the *input* backend itself is an annex
169 extension backend (i.e., a backend with a trailing "E"), the
170 key's extension will not be updated to match the extension of
171 the corresponding file name. Thus, unless the input keys and
172 file names are generated from git-annex, it is recommended to
173 avoid using extension backends as input. If an extension is de‐
174 sired, use the plain variant as input and prepend "et:" so that
175 git-annex will migrate from the plain backend to the extension
176 variant.
177
178 --message MESSAGE
179 Use this message when committing the URL additions. Constraints:
180 value must be NONE or value must be a string
181
182 -n, --dry-run
183 Report which URLs would be downloaded to which files and then
184 exit.
185
186 --fast If True, add the URLs, but don't download their content. WARN‐
187 ING: ONLY USE THIS OPTION IF YOU UNDERSTAND THE CONSEQUENCES. If
188 the content of the URLs is not downloaded, then datalad will
189 refuse to retrieve the contents with `datalad get <file>` by de‐
190 fault because the content of the URLs is not verified. Add `an‐
191 nex.security.allow-unverified-downloads = ACKTHPPT` to your git
192 config to bypass the safety check. Underneath, this passes the
193 `--fast` flag to `git annex addurl`.
194
195 --ifexists {overwrite|skip}
196 What to do if a constructed file name already exists. The de‐
197 fault behavior is to proceed with the `git annex addurl`, which
198 will fail if the file size has changed. If set to 'overwrite',
199 remove the old file before adding the new one. If set to 'skip',
200 do not add the new file. Constraints: value must be one of
201 ('overwrite', 'skip')
202
203 --missing-value VALUE
204 When an empty string is encountered, use this value instead.
205 Constraints: value must be NONE or value must be a string
206
207 --nosave
208 by default all modifications to a dataset are immediately saved.
209 Giving this option will disable this behavior.
210
211 --version-urls
212 Try to add a version ID to the URL. This currently only has an
213 effect on HTTP URLs for AWS S3 buckets. s3:// URL versioning is
214 not yet supported, but any URL that already contains a "version‐
215 Id=" parameter will be used as is.
216
217 -c PROC, --cfg-proc PROC
218 Pass this --cfg_proc value when calling CREATE to make datasets.
219
220 -J NJOBS, --jobs NJOBS
221 how many parallel jobs (where possible) to use. "auto" corre‐
222 sponds to the number defined by 'datalad.runtime.max-annex-jobs'
223 configuration item. Constraints: value must be convertible to
224 type 'int' or value must be NONE or value must be one of ('au‐
225 to',)
226
227 --drop-after
228 drop files after adding to annex.
229
230 --on-collision {error|error-if-different|take-first|take-last}
231 What to do when more than one row produces the same file name.
232 By default an error is triggered. "error-if-different" suppress‐
233 es that error if rows for a given file name collision have the
234 same URL and metadata. "take-first" or "take-last" indicate to
235 instead take the first row or last row from each set of collid‐
236 ing rows. Constraints: value must be one of ('error', 'error-if-
237 different', 'take-first', 'take-last') [Default: 'error']
238
239 --version
240 show the module and its version which provides the command
241
243 datalad is developed by The DataLad Team and Contributors <team@datal‐
244 ad.org>.
245
246
247
248datalad addurls 0.19.3 2023-08-11 datalad addurls(1)