doc(xsv): reorganized and updated with examples

2026-02-14 17:49:53 +00:00 · 2023-10-16 23:07:55 +00:00
parent b138dd5ef2
commit 8d4c099bc7
1 changed files with 564 additions and 13 deletions
--- a/xsv/README.md
+++ b/xsv/README.md
@@ -5,7 +5,7 @@ tagline: |
  xsv: A fast CSV command line toolkit written in Rust.
 ---

-To update or switch versions, run `webi xsv@stable` (or `@v2`, `@beta`, etc).
+To update or switch versions, run `webi xsv@stable` (or `@v0.13`, `@beta`, etc).

 ### Files

@@ -15,7 +15,6 @@ install:
 ```text
 ~/.config/envman/PATH.env
 ~/.local/bin/xsv
-~/.local/opt/xsv
 ```

 ## Cheat Sheet
@@ -25,34 +24,586 @@ install:
 > simplicity and speed, it is an essential tool for anyone working with CSV
 > data.

+Get the canonical sample data:
+
+```sh
+curl -o ./worldcitiespop.csv -L https://burntsushi.net/stuff/worldcitiespop.csv
+```
+
+(or check "The Data Science Toolkit": <https://github.com/petewarden/dstkdata>)
+
+Show the CSV's headers:
+
+```sh
+xsv headers ./worldcitiespop.csv
+```
+
+Query the CSV data any which way:
+
+```sh
+xsv search '^(John|Jane)$' --select 'First Name' ./address-book.csv |
+    xsv select 'ID,First Name,Last Name' |
+    xsv sort --select 'Last Name,First Name' |
+    xsv slice -s 0 -n 5
+    xsv table
+```
+
+(selects the first 5 rows, with a "First Name" of "John" or "Jane", sorted by
+"Last Name", as a table)
+
 ### Basic Usage

-To count the number of rows in a CSV file:
+- Subcommand Help & Global Options
+- View Headers
+- Take a Sample
+- Select Columns
+- View Vertically (`\G`)
+- View as Table
+- Count Rows
+
+**Subcommand Help & Global Options**

 ```sh
-xsv count data.csv
+xsv <subcommand> --help
 ```

-### Joining Two CSV Files
+Subcommands are: `cat`, `count`, `fixlengths`, `flatten`, `fmt`, `frequency`,
+`headers`, `help`, `index`, `input`, `join`, `sample`, `search`, `select`,
+`slice`, `sort`, `split`, `stats`, `table`.

-To perform an inner join on two CSV files based on a common column:
+```text
+Common options:
+    -h, --help             Display this message
+    -o, --output <file>    Write output to <file> instead of stdout.
+    -n, --no-headers       When set, the first row will not be interpreted
+                           as headers. (i.e., They are not searched, analyzed,
+                           sliced, etc.)
+    -d, --delimiter <arg>  The field delimiter for reading CSV data.
+                           Must be a single character. (default: ,)
+```
+
+**View Headers**
+
+Shows column name and index (1-based).

 ```sh
-xsv join column1 file1.csv column2 file2.csv
+xsv headers ./worldcitiespop.csv
 ```

-### Sample Data
+```text
+1   Country
+2   City
+3   AccentCity
+4   Region
+5   Population
+6   Latitude
+7   Longitude
+```

-To randomly sample rows from a CSV file:
+**Sample Data**

 ```sh
-xsv sample 100 data.csv
+xsv sample 2 ./worldcitiespop.csv
 ```

-### Analyzing Data
+```text
+Country,City,AccentCity,Region,Population,Latitude,Longitude
+us,provo,Provo,UT,105764,40.2338889,-111.6577778
+lv,riga,Riga,25,742570,56.95,24.1
+```

-To display basic statistics for each column in a CSV file:
+**Select Columns**

 ```sh
-xsv stats data.csv
+xsv sample 2 ./worldcitiespop.csv |
+    xsv select Country,City
+```
+
+```text
+Country,City
+us,provo
+lv,riga
+```
+
+```sh
+xsv sample 2 ./worldcitiespop.csv |
+    xsv select --no-headers 1,2
+```
+
+```text
+us,provo
+lv,riga
+```
+
+**Select Rows**
+
+Rows are **0-indexed** and do not include the header, unless `--no-headers` is
+given.
+
+```sh
+# xsv slice -s <start> -l <howmany>
+xsv slice -s 0 -l 2 ./worldcitiespop.csv |
+    xsv table
+```
+
+```text
+Country  City        AccentCity  Region  Population  Latitude    Longitude
+ad       aixas       Aixàs       06                  42.4833333  1.4666667
+ad       aixirivali  Aixirivali  06                  42.4666667  1.5
+```
+
+**View Vertically**
+
+Like `\G` in SQL.
+
+```sh
+xsv sample 2 ./worldcitiespop.csv |
+    xsv select Country,City |
+    xsv flatten
+```
+
+```text
+Country  us
+City     provo
+#
+Country  lv
+City     riga
+```
+
+**View as a Table**
+
+Makes all columns the same width, truncated to `-c N`.
+
+```sh
+xsv search 'Provo|Riga' ./worldcitiespop.csv |
+    xsv table -c 10
+```
+
+```text
+Country  City           AccentCity     Region  Population  Latitude    Longitude
+us       provo          Provo          UT      105764      40.2338889  -111.65777...
+lv       riga           Riga           25      742570      56.95       24.1
+```
+
+### How to Query & Analyze Data
+
+**Count Number of Rows**:
+
+```sh
+xsv count ./worldcitiespop.csv
+```
+
+```text
+3173958
+```
+
+**Search by Column Value**
+
+```sh
+# xsv search <RegExp> [--select 'Column 1,Column 2']
+xsv search '^(provo|riga)$' --select 'City' ./worldcitiespop.csv |
+    xsv search '^(us|lv)$' --select Country
+```
+
+**Join CSVs**
+
+Like a SQL INNER JOIN (other options available).
+
+```sh
+# The equivalent of
+# SELECT *
+# FROM "countries-by-name"
+# INNER JOIN "countries-by-code"
+# ON "Country Code" = "Code"
+
+xsv join \
+    "Country Code" ./contries-by-name.csv \
+    "Code" countries-by-code.csv
+```
+
+**Basic Statistics**
+
+```sh
+xsv stats ./worldcitiespop.csv |
+    xsv table -c 11
+```
+
+```text
+field       type     sum             min             max             min_length  max_length  mean            stddev
+Country     Unicode                  ad              zw              2           2
+City        Unicode                   bab el ahm...  Þykkvibaer      1           91
+AccentCity  Unicode                   Bâb el Ahm...  ïn Bou Chel...  1           91
+Region      Unicode                  00              Z9              0           2
+Population  Integer  2289584999      7               31480498        0           8           47719.57063...  302885.5592...
+Latitude    Float    86294096.37...  -54.933333      82.483333       1           12          27.18816580...  21.95261384...
+Longitude   Float    117718483.5...  -179.983333...  180             1           14          37.08885989...  63.22301045...
+```
+
+### Extended Help
+
+We don't generally print the full help, but since this is so vast and it's
+useful to be able to search it all on a single page...
+
+```text
+cat         Concatenate by row or column
+count       Count records
+fixlengths  Makes all records have same length
+flatten     Show one field per line
+fmt         Format CSV output (change field delimiter)
+frequency   Show frequency tables
+headers     Show header names
+index       Create CSV index for faster access
+input       Read CSV data with special quoting rules
+join        Join CSV files
+sample      Randomly sample CSV data
+search      Search CSV data with regexes
+select      Select columns from CSV
+slice       Slice records from CSV
+sort        Sort CSV data
+split       Split CSV data into many files
+stats       Compute basic statistics
+table       Align CSV data into columns
+```
+
+**How to view ALL help, at once**
+
+```sh
+xsv --list |
+    tail -n +2 |
+    grep -v '^$' |
+    cut -c 5- |
+    cut -d' ' -f1 |
+    xargs -I '{}' xsv '{}' --help
+```
+
+**All Usage and Options**
+
+(descriptions and common help omitted)
+
+#### `xsv cat --help`
+
+```text
+Usage:
+    xsv cat rows    [options] [<input>...]
+    xsv cat columns [options] [<input>...]
+    xsv cat --help
+
+cat options:
+    -p, --pad              When concatenating columns, this flag will cause
+                           all records to appear. It will pad each row if
+                           other CSV data isn't long enough.
+```
+
+#### `xsv count --help`
+
+```text
+Usage:
+    xsv count [options] [<input>]
+```
+
+#### `xsv fixlengths --help`
+
+```text
+Usage:
+    xsv fixlengths [options] [<input>]
+
+fixlengths options:
+    -l, --length <arg>     Forcefully set the length of each record. If a
+                           record is not the size given, then it is truncated
+                           or expanded as appropriate.
+```
+
+#### `xsv flatten --help`
+
+```text
+Usage:
+    xsv flatten [options] [<input>]
+
+flatten options:
+    -c, --condense <arg>  Limits the length of each field to the value
+                           specified. If the field is UTF-8 encoded, then
+                           <arg> refers to the number of code points.
+                           Otherwise, it refers to the number of bytes.
+    -s, --separator <arg>  A string of characters to write after each record.
+                           When non-empty, a new line is automatically
+                           appended to the separator.
+                           [default: #]
+```
+
+#### `xsv fmt --help`
+
+```text
+Usage:
+    xsv fmt [options] [<input>]
+
+fmt options:
+    -t, --out-delimiter <arg>  The field delimiter for writing CSV data.
+                               [default: ,]
+    --crlf                     Use '\r\n' line endings in the output.
+    --ascii                    Use ASCII field and record separators.
+    --quote <arg>              The quote character to use. [default: "]
+    --quote-always             Put quotes around every value.
+    --escape <arg>             The escape character to use. When not specified,
+                               quotes are escaped by doubling them.
+```
+
+#### `xsv frequency --help`
+
+```text
+Usage:
+    xsv frequency [options] [<input>]
+
+frequency options:
+    -s, --select <arg>     Select a subset of columns to compute frequencies
+                           for. See 'xsv select --help' for the format
+                           details. This is provided here because piping 'xsv
+                           select' into 'xsv frequency' will disable the use
+                           of indexing.
+    -l, --limit <arg>      Limit the frequency table to the N most common
+                           items. Set to '0' to disable a limit.
+                           [default: 10]
+    -a, --asc              Sort the frequency tables in ascending order by
+                           count. The default is descending order.
+    --no-nulls             Don't include NULLs in the frequency table.
+    -j, --jobs <arg>       The number of jobs to run in parallel.
+                           This works better when the given CSV data has
+                           an index already created. Note that a file handle
+                           is opened for each job.
+                           When set to '0', the number of jobs is set to the
+                           number of CPUs detected.
+                           [default: 0]
+```
+
+#### `xsv headers --help`
+
+```text
+Usage:
+    xsv headers [options] [<input>...]
+
+headers options:
+    -j, --just-names       Only show the header names (hide column index).
+                           This is automatically enabled if more than one
+                           input is given.
+    --intersect            Shows the intersection of all headers in all of
+                           the inputs given.
+```
+
+#### `xsv index --help`
+
+```text
+Usage:
+    xsv index [options] <input>
+    xsv index --help
+
+index options:
+    -o, --output <file>    Write index to <file> instead of <input>.idx.
+                           Generally, this is not currently useful because
+                           the only way to use an index is if it is specially
+                           named <input>.idx.
+```
+
+#### `xsv input --help`
+
+```text
+Usage:
+    xsv input [options] [<input>]
+
+input options:
+    --quote <arg>          The quote character to use. [default: "]
+    --escape <arg>         The escape character to use. When not specified,
+                           quotes are escaped by doubling them.
+    --no-quoting           Disable quoting completely.
+```
+
+#### `xsv join --help`
+
+```text
+Usage:
+    xsv join [options] <columns1> <input1> <columns2> <input2>
+    xsv join --help
+
+join options:
+    --no-case              When set, joins are done case insensitively.
+    --left                 Do a 'left outer' join. This returns all rows in
+                           first CSV data set, including rows with no
+                           corresponding row in the second data set. When no
+                           corresponding row exists, it is padded out with
+                           empty fields.
+    --right                Do a 'right outer' join. This returns all rows in
+                           second CSV data set, including rows with no
+                           corresponding row in the first data set. When no
+                           corresponding row exists, it is padded out with
+                           empty fields. (This is the reverse of 'outer left'.)
+    --full                 Do a 'full outer' join. This returns all rows in
+                           both data sets with matching records joined. If
+                           there is no match, the missing side will be padded
+                           out with empty fields. (This is the combination of
+                           'outer left' and 'outer right'.)
+    --cross                USE WITH CAUTION.
+                           This returns the cartesian product of the CSV
+                           data sets given. The number of rows return is
+                           equal to N * M, where N and M correspond to the
+                           number of rows in the given data sets, respectively.
+    --nulls                When set, joins will work on empty fields.
+                           Otherwise, empty fields are completely ignored.
+                           (In fact, any row that has an empty field in the
+                           key specified is ignored.)
+```
+
+#### `xsv sample --help`
+
+```text
+Usage:
+    xsv sample [options] <sample-size> [<input>]
+    xsv sample --help
+
+Common options:
+    -h, --help             Display this message
+    -o, --output <file>    Write output to <file> instead of stdout.
+    -n, --no-headers       When set, the first row will be consider as part of
+                           the population to sample from. (When not set, the
+                           first row is the header row and will always appear
+                           in the output.)
+    -d, --delimiter <arg>  The field delimiter for reading CSV data.
+                           Must be a single character. (default: ,)
+```
+
+#### `xsv search --help`
+
+```text
+Usage:
+    xsv search [options] <regex> [<input>]
+    xsv search --help
+
+search options:
+    -i, --ignore-case      Case insensitive search. This is equivalent to
+                           prefixing the regex with '(?i)'.
+    -s, --select <arg>     Select the columns to search. See 'xsv select -h'
+                           for the full syntax.
+    -v, --invert-match     Select only rows that did not match
+```
+
+#### `xsv select --help`
+
+```text
+  Select the first and fourth columns:
+  $ xsv select 1,4
+
+  Select the first 4 columns (by index and by name):
+  $ xsv select 1-4
+  $ xsv select Header1-Header4
+
+  Ignore the first 2 columns (by range and by omission):
+  $ xsv select 3-
+  $ xsv select '!1-2'
+
+  Select the third column named 'Foo':
+  $ xsv select 'Foo[2]'
+
+  Re-order and duplicate columns arbitrarily:
+  $ xsv select 3-1,Header3-Header1,Header1,Foo[2],Header1
+
+  Quote column names that conflict with selector syntax:
+  $ xsv select '"Date - Opening","Date - Actual Closing"'
+
+Usage:
+    xsv select [options] [--] <selection> [<input>]
+    xsv select --help
+```
+
+#### `xsv slice --help`
+
+```text
+Usage:
+    xsv slice [options] [<input>]
+
+slice options:
+    -s, --start <arg>      The index of the record to slice from.
+    -e, --end <arg>        The index of the record to slice to.
+    -l, --len <arg>        The length of the slice (can be used instead
+                           of --end).
+    -i, --index <arg>      Slice a single record (shortcut for -s N -l 1).
+```
+
+#### `xsv sort --help`
+
+```
+Usage:
+    xsv sort [options] [<input>]
+
+sort options:
+    -s, --select <arg>     Select a subset of columns to sort.
+                           See 'xsv select --help' for the format details.
+    -N, --numeric          Compare according to string numerical value
+    -R, --reverse          Reverse order
+```
+
+#### `xsv split --help`
+
+```text
+Usage:
+    xsv split [options] <outdir> [<input>]
+    xsv split --help
+
+split options:
+    -s, --size <arg>       The number of records to write into each chunk.
+                           [default: 500]
+    -j, --jobs <arg>       The number of spliting jobs to run in parallel.
+                           This only works when the given CSV data has
+                           an index already created. Note that a file handle
+                           is opened for each job.
+                           When set to '0', the number of jobs is set to the
+                           number of CPUs detected.
+                           [default: 0]
+    --filename <filename>  A filename template to use when constructing
+                           the names of the output files.  The string '{}'
+                           will be replaced by a value based on the value
+                           of the field, but sanitized for shell safety.
+                           [default: {}.csv]
+```
+
+#### `xsv stats --help`
+
+```
+Usage:
+    xsv stats [options] [<input>]
+
+stats options:
+    -s, --select <arg>     Select a subset of columns to compute stats for.
+                           See 'xsv select --help' for the format details.
+                           This is provided here because piping 'xsv select'
+                           into 'xsv stats' will disable the use of indexing.
+    --everything           Show all statistics available.
+    --mode                 Show the mode.
+                           This requires storing all CSV data in memory.
+    --cardinality          Show the cardinality.
+                           This requires storing all CSV data in memory.
+    --median               Show the median.
+                           This requires storing all CSV data in memory.
+    --nulls                Include NULLs in the population size for computing
+                           mean and standard deviation.
+    -j, --jobs <arg>       The number of jobs to run in parallel.
+                           This works better when the given CSV data has
+                           an index already created. Note that a file handle
+                           is opened for each job.
+                           When set to '0', the number of jobs is set to the
+                           number of CPUs detected.
+                           [default: 0]
+```
+
+#### `xsv table --help`
+
+```text
+Usage:
+    xsv table [options] [<input>]
+
+table options:
+    -w, --width <arg>      The minimum width of each column.
+                           [default: 2]
+    -p, --pad <arg>        The minimum number of spaces between each column.
+                           [default: 2]
+    -c, --condense <arg>  Limits the length of each field to the value
+                           specified. If the field is UTF-8 encoded, then
+                           <arg> refers to the number of code points.
+                           Otherwise, it refers to the number of bytes.
 ```