pigpen.core

The core PigPen operations. These are the primary functions that you use to
build a PigPen query.

cogroup

macro

added in 0.1.0

(cogroup selects f)(cogroup selects f opts)
Joins many relations together by a common key. Each relation specifies a
key-selector function on which to join. A combiner function is applied to each
join key and all values from each relation that match that join key. This is
similar to join, without flattening the data. Optionally takes a map of options.

  Example:

    (pig/cogroup [(foo :on :a)
                  (bar :on :b, :type :required, :fold (fold/count))]
                 (fn [key foos bar-count] ...)
                 {:parallel 20})

In this example, foo and bar are other pig queries and :a and :b are the
key-selector functions for foo and bar, respectively. These can be any
functions - not just keywords. There can be more than two select clauses.
By default, a matching key value from eatch source relation is optional,
meaning that keys don't have to exist in all source relations to be part of the
output. To specify a relation as required, add 'required' to the select clause.
The third argument is a function used to consolidate matching key values. For
each uniqe key value, this function is called with the value of the key and all
values with that key from foo and bar. As such, foos and bars are both
collections. The last argument is an optional map of options. A fold function
can be specified to aggregate groupings in parallel. See pigpen.fold for more
info on fold functions.

  Options:

    :parallel - The degree of parallelism to use (pig only)
    :join-nils - Whether nil keys from each relation should be treated as equal

  See also: pigpen.core/join, pigpen.core/group-by

concat

added in 0.1.0

(concat relations+)
Concatenates all relations provided. Does not guarantee any ordering of the
relations. Identical to pigpen.core/union-multiset.

  Example:

    (pig/concat
      (pig/return [1 2 2 3 3 3 4 5])
      (pig/return [1 2 2 3 3])
      (pig/return [1 1 2 2 3 3]))

    => [1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 3 4 5]

  See also: pigpen.core/union, pigpen.core/distinct, pigpen.core/union-multiset

constantly

added in 0.1.0

(constantly data)
Returns a function that takes any number of arguments and returns a constant
set of data as if it had been loaded by pigpen. This is useful for testing,
but not supported in generated scripts. The parameter 'data' must be a sequence.
The values of 'data' can be any clojure type.

  Example:

    (pig/constantly [1 2 3])
    (pig/constantly [{:a 123} {:b 456}])

  See also: pigpen.core/return

difference

added in 0.1.0

(difference opts? relations+)
Performs a set difference on all relations provided and returns the distinct
results. Optionally takes a map of options as the first parameter.

  Example:

    (pig/difference
      (pig/return [1 2 2 3 3 3 4 5])
      (pig/return [1 2])
      (pig/return [3]))

    => [4 5]

  Options:

    :parallel - The degree of parallelism to use (pig only)

  See also: pigpen.core/difference-multiset, pigpen.core/intersection

difference-multiset

added in 0.1.0

(difference-multiset opts? relations+)
Performs a multiset difference on all relations provided and returns all
results. Optionally takes a map of options as the first parameter.

  Example:

    (pig/difference-multiset
      (pig/return [1 2 2 3 3 3 3 4 5])
      (pig/return [1 2 3])
      (pig/return [1 2 3]))

    => [3 3 4 5]

  Options:

    :parallel - The degree of parallelism to use (pig only)

  See also: pigpen.core/difference, pigpen.core/intersection

distinct

macro

added in 0.1.0

(distinct relation)(distinct opts relation)
Returns a relation with the distinct values of relation. Optionally takes a
map of options.

  Example:

    (pig/distinct foo)
    (pig/distinct {:parallel 20} foo)

  Options:

    :parallel - The degree of parallelism to use (pig only)
    :partition-by - A partition function to use. Should take the form:
      (fn [n key] (mod (hash key) n)) Where n is the number of partitions and
      key is the key to partition.

  See also: pigpen.core/union, pigpen.core/union-multiset, pigpen.core/filter

dump

added in 0.3.0

(dump query)(dump opts query)
Executes a script locally and returns the resulting values as a clojure
sequence. This command is very useful for unit tests.

  Example:

    (->>
      (pig/load-clj "input.clj")
      (pig/map inc)
      (pig/filter even?)
      (pig/dump)
      (clojure.core/map #(* % %))
      (clojure.core/filter even?))

    (deftest test-script
      (is (= (->>
               (pig/load-clj "input.clj")
               (pig/map inc)
               (pig/filter even?)
               (pig/dump))
             [2 4 6])))

  Note: pig/store commands return the output data
        pig/store-many commands merge their results

  Note: The original rx pigpen.core/dump command is now pigpen.rx/dump. This
        implementation uses lazy seqs instead.

filter

macro

added in 0.1.0

(filter pred relation)
Returns a relation that only contains the items for which (pred item)
returns true.

  Example:

    (pig/filter even? foo)
    (pig/filter (fn [x] (even? (* x x))) foo)

  See also: pigpen.core/remove, pigpen.core/take, pigpen.core/sample,
            pigpen.core/distinct, pigpen.core/filter-by

filter-by

macro

added in 0.2.3

(filter-by key-selector keys relation)(filter-by key-selector keys opts relation)
Filters a relation by the keys in another relation. The key-selector function
is applied to each element of relation. If the resulting key is present in keys,
the value is kept. Otherwise it is dropped. nils are dropped or preserved based
on whether there is a nil value present in keys. This operation is referred to
as a semi-join in relational databases.

  Example:

    (let [keys (pig/return [1 3 5])
          data (pig/return [{:k 1, :v "a"}
                            {:k 2, :v "b"}
                            {:k 3, :v "c"}
                            {:k 4, :v "d"}
                            {:k 5, :v "e"}])]
      (pig/filter-by :k keys data))

    => (pig/dump *1)
    [{:k 1, :v "a"}
     {:k 3, :v "c"}
     {:k 5, :v "e"}]

  Options:

    :parallel - The degree of parallelism to use (pig only)

  Note: keys must be distinct before this is used or you will get duplicate values.
  Note: Unlike filter, this joins relation with keys and can be potentially expensive.

  See also: pigpen.core/filter, pigpen.core/remove-by, pigpen.core/intersection

fold

macro

added in 0.2.0

(fold reducef relation)(fold combinef reducef relation)
Computes a parallel reduce of the relation. This is done in multiple stages
using reducef and combinef. First, combinef is called with no args to produce a
seed value. Then, reducef reduces portions of the data using that seed value.
Finally, combinef is used to reduce each of the intermediate values. If combinef
is not specified, reducef is used for both. Fold functions defined using
pigpen.fold/fold-fn can also be used.

  Example:

    (pig/fold + foo)
    (pig/fold + (fn [acc _] (inc acc)) foo)
    (pig/fold (fold/fold-fn + (fn [acc _] (inc acc))) foo)

  See pigpen.fold for more info on fold functions.

  Note: Folding an empty sequence will always return an empty sequence:

		=> (->>
		     (pig/return [])
		     (pig/fold (fold/count))
		     (pig/dump))
		[]

group-by

macro

added in 0.1.0

(group-by key-selector relation)(group-by key-selector opts relation)
Groups relation by the result of calling (key-selector item) for each item.
This produces a sequence of map entry values, similar to using seq with a
map. Each value will be a lazy sequence of the values that match key.
Optionally takes a map of options, including :parallel and :fold.

  Example:

    (pig/group-by :a foo)
    (pig/group-by count {:parallel 20} foo)

  Options:

    :parallel - The degree of parallelism to use (pig only)

  See also: pigpen.core/cogroup

  See pigpen.fold for more info on :fold options.

intersection

added in 0.1.0

(intersection opts? relations+)
Performs an intersection on all relations provided and returns the distinct
results. Optionally takes a map of options as the first parameter.

  Example:

    (pig/intersection
      (pig/return [1 2 2 3 3 3 4 5])
      (pig/return [1 2 2 3 3])
      (pig/return [1 1 2 2 3 3]))

    => [1 2 3]

  Options:

    :parallel - The degree of parallelism to use (pig only)

  See also: pigpen.core/intersection-multiset, pigpen.core/difference

intersection-multiset

added in 0.1.0

(intersection-multiset opts? relations+)
Performs a multiset intersection on all relations provided and returns all
results. Optionally takes a map of options as the first parameter.

  Example:

    (pig/intersection-multiset
      (pig/return [1 2 2 3 3 3 4 5])
      (pig/return [1 2 2 3 3])
      (pig/return [1 1 2 2 3 3]))

    => [1 2 2 3 3]

  Options:

    :parallel - The degree of parallelism to use (pig only)

  See also: pigpen.core/intersection, pigpen.core/difference

into

macro

added in 0.1.0

(into to relation)
Returns a new relation with all values from relation conjoined onto to.

Note: This operation uses a single reducer and won't work for large datasets.

See also: pigpen.core/reduce

Note: Reducing an empty sequence will always return an empty sequence:

=> (->>
     (pig/return [])
     (pig/into {})
     (pig/dump))
[]

join

macro

added in 0.1.0

(join selects f)(join selects f opts)
Joins many relations together by a common key. Each relation specifies a
key-selector function on which to join. A function is applied to each join
key and each pair of values from each relation that match that join key.
Optionally takes a map of options.

  Example:

    (pig/join [(foo :on :a)
               (bar :on :b :type :optional)]
              (fn [f b] ...)
              {:parallel 20})

In this example, foo and bar are other pig queries and :a and :b are the
key-selector functions for foo and bar, respectively. These can be any
functions - not just keywords. There can be more than two select clauses.
By default, a matching key value from eatch source relation is required,
meaning that they must exist in all source relations to be part of the output.
To specify a relation as optional, add 'optional' to the select clause. The
third argument is a function used to consolidate matching key values. For each
uniqe key value, this function is called with each set of values from the cross
product of each source relation. By default, this does a standard inner join.
Use 'optional' to do outer joins. The last argument is an optional map of
options.

  Options:

    :parallel - The degree of parallelism to use (pig only)
    :join-nils - Whether nil keys from each relation should be treated as equal

  See also: pigpen.core/cogroup, pigpen.core/union

keys-fn

macro

added in 0.3.0

(keys-fn & body)
Creates a named anonymous function. Useful as a terse substitute for keys
destructuring.

Similar to an anonymous function, which uses positional names for arg:

  #(assoc %1 :foo %2 :bar %3)

`keys-fn` uses named variables that are keys in the input map:

  (keys-fn
    (assoc %
      :foo-copy %foo
      :bar-x-2  (* %bar 2)))

To compare, this is the same function using keys destructuring:

  (fn [{:keys [foo bar], :as value}]
    (assoc value
      :foo-copy foo
      :bar-x-2  (* bar 2)))

When using a large number of destructured variables, this can make a noticeable
difference in code size and readability. This macro simply re-writes the first
form into the second.

load-clj

added in 0.1.0

(load-clj location)
Loads clojure data from a file. Each line should contain one value and will
be parsed using clojure.edn/read-string into a value.

  Example:

    (pig/load-clj "input.clj")

  See also: pigpen.core/load-string, pigpen.core/load-tsv, pigpen.core/load-json

  See: https://github.com/edn-format/edn

load-csv

added in 0.2.12

(load-csv location)(load-csv location separator quotor)
Loads data from a csv file. Each line is returned as a vector of strings,
split according to RFC4180(*). The default separator is \, and quote is \".

  Note: Newlines within cells are not supported due to line-based splitting of files.

  Example:

    (pig/load-csv "input.csv")
    (pig/load-tsv "input.csv" \, \")

  See also: pigpen.core/load-string, pigpen.core/load-tsv, pigpen.core/load-clj, pigpen.core/load-json

load-json

macro

added in 0.2.3

(load-json location)(load-json location opts)
Loads json data from a file. Each line should contain one value and will be
parsed using clojure.data.json/read-str into a value. Options can be passed to
read-str as a map. The default options used are {:key-fn keyword}.

  Example:

    (pig/load-json "input.json")

  See also: pigpen.core/load-string, pigpen.core/load-tsv, pigpen.core/load-clj

load-lazy

added in 0.1.0

(load-lazy location)(load-lazy location delimiter)
Loads data from a tsv file. Each line is returned as a lazy seq, split by
the specified delimiter. The default delimiter is \t.

  See also: pigpen.core/load-tsv

load-string

added in 0.2.3

(load-string location)
Loads data from a file. Each line is returned as a string.

Example:

  (pig/load-string "input.txt")

See also: pigpen.core/load-tsv, pigpen.core/load-clj, pigpen.core/load-json

load-tsv

added in 0.1.0

(load-tsv location)(load-tsv location delimiter)
Loads data from a tsv file. Each line is returned as a vector of strings,
split by the specified regex delimiter. The default delimiter is #"\t".

  Example:

    (pig/load-tsv "input.tsv")
    (pig/load-tsv "input.csv" #",")

  See also: pigpen.core/load-string, pigpen.core/load-clj, pigpen.core/load-json

map

macro

added in 0.1.0

(map f relation)
Returns a relation of f applied to every item in the source relation.
Function f should be a function of one argument.

  Example:

    (pig/map inc foo)
    (pig/map (fn [x] (* x x)) foo)

  Note: Unlike clojure.core/map, pigpen.core/map takes only one relation. This
is due to the fact that there is no defined order in pigpen. See pig/join,
pig/cogroup, and pig/union for combining sets of data.

  See also: pigpen.core/mapcat, pigpen.core/map-indexed, pigpen.core/join,
            pigpen.core/cogroup, pigpen.core/union

map-indexed

macro

added in 0.1.0

(map-indexed f relation)(map-indexed f opts relation)
Returns a relation of applying f to the the index and value of every item in
the source relation. Function f should be a function of two arguments: the index
and the value. If you require sequential ids, use option {:dense true}.

  Example:

    (pig/map-indexed (fn [i x] (* i x)) foo)
    (pig/map-indexed vector {:dense true} foo)

  Options:

    :dense - force sequential ids (pig only)

  Note: If you require sorted data, use sort or sort-by immediately before
        this command.

  Note: Pig will assign the same index to any equal values, regardless of how
        many times they appear.

  Note: The cascading implementation of map-indexed uses a single reducer

  See also: pigpen.core/sort, pigpen.core/sort-by, pigpen.core/map, pigpen.core/mapcat

mapcat

macro

added in 0.1.0

(mapcat f relation)
Returns the result of applying concat, or flattening, the result of applying
f to each item in relation. Thus f should return a collection.

  Example:

    (pig/mapcat (fn [x] [(dec x) x (inc x)]) foo)

  See also: pigpen.core/map, pigpen.core/map-indexed

reduce

macro

added in 0.1.0

(reduce f relation)(reduce f val relation)
Reduce all items in relation into a single value. Follows semantics of
clojure.core/reduce. If a sequence is returned, it is kept as a single value
for further processing.

  Example:

    (pig/reduce + foo)
    (pig/reduce conj [] foo)

  Note: This operation uses a single reducer and won't work for large datasets.
        Use pig/fold to do a parallel reduce.

  See also: pigpen.core/fold, pigpen.core/into

  Note: Reducing an empty sequence will always return an empty sequence:

		=> (->>
		     (pig/return [])
		     (pig/reduce +)
		     (pig/dump))
		[]

remove

macro

added in 0.1.0

(remove pred relation)
Returns a relation without items for which (pred item) returns true.

Example:

  (pig/remove even? foo)
  (pig/remove (fn [x] (even? (* x x))) foo)

See also: pigpen.core/filter, pigpen.core/take, pigpen.core/sample,
          pigpen.core/distinct, pigpen.core/remove-by

remove-by

macro

added in 0.2.3

(remove-by key-selector keys relation)(remove-by key-selector keys opts relation)
Filters a relation by the keys in another relation. The key-selector function
is applied to each element of relation. If the resulting key is _not_ present in
keys, the value is kept. Otherwise it is dropped. nils are dropped or preserved
based on whether there is a nil value present in keys. This operation is
referred to as an anti-join in relational databases.

  Example:

    (let [keys (pig/return [1 3 5])
          data (pig/return [{:k 1, :v "a"}
                            {:k 2, :v "b"}
                            {:k 3, :v "c"}
                            {:k 4, :v "d"}
                            {:k 5, :v "e"}])]
      (pig/remove-by :k keys data))

    => (pig/dump *1)
    [{:k 2, :v "b"}
     {:k 4, :v "d"}]

  Options:

    :parallel - The degree of parallelism to use (pig only)

  Note: Unlike remove, this joins relation with keys and can be potentially expensive.

  See also: pigpen.core/remove, pigpen.core/filter-by, pigpen.core/difference

return

added in 0.1.0

(return data)
Returns a constant set of data as a pigpen relation. This is useful for
testing, but not supported in generated scripts. The parameter 'data' must be a
sequence. The values of 'data' can be any clojure type.

  Example:

    (pig/constantly [1 2 3])
    (pig/constantly [{:a 123} {:b 456}])

  See also: pigpen.core/constantly

sample

added in 0.1.0

(sample p relation)
Samples the input records by p percentage. This is non-deterministic;
different values may selected on subsequent runs. p should be a value
between 0.0 and 1.0

  Example:

    (pig/sample 0.01 foo)

  Note: This is potentially an expensive operation when run locally.

  See also: pigpen.core/filter, pigpen.core/take

sort

macro

added in 0.1.0

(sort relation)(sort comp relation)(sort comp opts relation)
Sorts the data with an optional comparator. Takes an optional map of options.

Example:

  (pig/sort foo)
  (pig/sort :desc foo)
  (pig/sort :desc {:parallel 20} foo)

Notes:
  The default comparator is :asc (ascending sort order).
  Only :asc and :desc are supported comparators.
  The values must be primitive values (string, int, etc).
  Maps, vectors, etc are not supported.

Options:

  :parallel - The degree of parallelism to use (pig only)

Note: The cascading implementation of sort uses a single reducer

See also: pigpen.core/sort-by

sort-by

macro

added in 0.1.0

(sort-by key-fn relation)(sort-by key-fn comp relation)(sort-by key-fn comp opts relation)
Sorts the data by the specified key-fn with an optional comparator. Takes an
optional map of options.

  Example:

    (pig/sort-by :a foo)
    (pig/sort-by #(count %) :desc foo)
    (pig/sort-by (fn [x] (* x x)) :desc {:parallel 20} foo)

  Notes:
    The default comparator is :asc (ascending sort order).
    Only :asc and :desc are supported comparators.
    The key-fn values must be primitive values (string, int, etc).
    Maps, vectors, etc are not supported.

  Options:

    :parallel - The degree of parallelism to use (pig only)

  Note: The cascading implementation of sort-by uses a single reducer

  See also: pigpen.core/sort

store-clj

added in 0.1.0

(store-clj location relation)
Stores the relation into location using edn (clojure format). Each value is
written as a single line.

  Example:

    (pig/store-clj "output.clj" foo)

  See also: pigpen.core/store-string, pigpen.core/store-tsv, pigpen.core/store-json

  See: https://github.com/edn-format/edn

store-json

macro

added in 0.2.3

(store-json location relation)(store-json location opts relation)
Stores the relation into location using clojure.data.json. Each value is
written as a single line. Options can be passed to write-str as a map.

  Example:

    (pig/store-json "output.json" foo)

  See also: pigpen.core/store-string, pigpen.core/store-tsv, pigpen.core/store-clj

store-many

added in 0.1.0

(store-many outputs+)
Combines multiple store commands into a single script. This is not required
if you have a single output.

  Example:

    (pig/store-many
      (pig/store-tsv "foo.tsv" foo)
      (pig/store-clj "bar.clj" bar))

  Note: When run locally, this will merge the results of any source relations.

store-string

added in 0.2.3

(store-string location relation)
Stores the relation into location as a string. Each value is written as a
single line.

  Example:

    (pig/store-string "output.txt" foo)

  See also: pigpen.core/store-tsv, pigpen.core/store-clj, pigpen.core/store-json

store-tsv

added in 0.1.0

(store-tsv location relation)(store-tsv location delimiter relation)
Stores the relation into location as a tab-delimited file. Thus, each input
value must be sequential. Complex values are stored as edn (clojure format).
Single string values are not quoted. You may optionally pass a different delimiter.

  Example:

    (pig/store-tsv "output.tsv" foo)
    (pig/store-tsv "output.csv" "," foo)

  See also: pigpen.core/store-string, pigpen.core/store-clj, pigpen.core/store-json

take

added in 0.1.0

(take n relation)
Limits the number of records to n items.

Example:

  (pig/take 200 foo)

Note: This is potentially an expensive operation when run on the server.

See also: pigpen.core/filter, pigpen.core/sample

union

added in 0.1.0

(union opts? relations+)
Performs a union on all relations provided and returns the distinct results.
Optionally takes a map of options as the first parameter.

  Example:

    (pig/union
      (pig/return [1 2 2 3 3 3 4 5])
      (pig/return [1 2 2 3 3])
      (pig/return [1 1 2 2 3 3]))

    => [1 2 3 4 5]

  Options:

    :parallel - the degree of parallelism to use (pig only)

  See also: pigpen.core/union-multiset, pigpen.core/distinct

union-multiset

added in 0.1.0

(union-multiset relations+)
Performs a union on all relations provided and returns all results.
Identical to pigpen.core/concat.

  Example:

    (pig/union-multiset
      (pig/return [1 2 2 3 3 3 4 5])
      (pig/return [1 2 2 3 3])
      (pig/return [1 1 2 2 3 3]))

    => [1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 3 4 5]

  See also: pigpen.core/union, pigpen.core/distinct, pigpen.core/concat