pigpen.core.op

*** ALPHA - Subject to change ***

  The raw pigpen operators. These are the basic building blocks that platforms
implement. All higher level operators are defined in terms of these operators.
These should be used to build custom PigPen operators. In these examples, fields
refers to the fields that the underlying platform is aware of. Usually this is a
single user field that represents arbitrary Clojure data.

  Note: You most likely don't want this namespace. Unless you are doing advanced
things, stick to pigpen.core

bind$

added in 0.3.0

(bind$ func opts relation)(bind$ requires func opts relation)
Inputs: ([func opts relation] [requires func opts relation])
  Returns: m/Bind$

  The way to apply user code to a relation. `func` should be a function that
takes a collection of arguments, and returns a collection of result tuples.
Optionally takes a collection of namespaces to require before executing user
code.

  Example:

    (bind$
      (fn [args]
        ;; do stuff to args, return a value like this:
        [[foo-value bar-value]   ;; result 1
         [foo-value bar-value]   ;; result 2
         ...
         [foo-value bar-value]]) ;; result N
      {:args  '[x y]
       :alias '[foo bar]}
      relation)

  In this example, our function takes `args` which is a tuple of argument values
from the previous relation. Here, this selects the fields x and y. The function
then returns 0-to-many result tuples. Each of those tuples maps to the fields
specified by the alias option. If not specified, args defaults to the fields of
the input relation and alias defaults to a single field `value`. All field names
should be symbols.

  There are many provided bind helper functions, such as map->bind, that take a
normal map function of one arg to one result, and convert it to a bind function.

    (bind$
      (map->bind (fn [x] (* x x)))
      {}
      data)

  See also: pigpen.core.fn/map->bind, pigpen.core.fn/mapcat->bind,
            pigpen.core.fn/filter->bind, pigpen.core.fn/process->bind,
            pigpen.core.fn/key-selector->bind,
            pigpen.core.fn/keyword-field-selector->bind,
            pigpen.core.fn/indexed-field-selector->bind

code$

added in 0.3.0

(code$ udf init func args)
Inputs: [udf init func args]
  Returns: m/CodeExpr

  Encapsulation for user code. Used with projection-func$ and project$. You
probably want bind$ instead of this.

The parameter `udf` should be one of:

  :seq    - returns zero or more values
  :fold   - apply a fold aggregation

The parameter `init` is code to be executed once before the user code, `func` is
the user code to execute, and `args` specifies which fields should be passed to
`func`. `args` can also contain strings, which are passed through as constants
to the user code. The result of `func` should be in the same format as bind$.

  Example:

    (code$ :seq '(require my-ns.core) '(fn [args] ...) ['c 'd])

  See also: pigpen.core.op/project$, pigpen.core.op/projection-func$

concat$

added in 0.3.0

(concat$ opts ancestors)
Inputs: [opts ancestors]
  Returns: (s/either m/Concat$ m/Op)

  Concatenates the set of ancestor relations together. The fields produced by
the concat operation are the fields of the first relation.

  Example:

    (concat$ {} [relation1 relation2])

  See also: pigpen.core.op/distinct$

distinct$

added in 0.3.0

(distinct$ opts relation)
Inputs: [opts relation]
Returns: m/Distinct$

Returns the distinct values in relation.

Example:

  (distinct$ {} relation)

See also: pigpen.core.op/concat$

filter->bind

added in 0.3.0

(filter->bind f)
For use with pigpen.core.op/bind$

Takes a filter-style function (one that takes a single input and returns a
boolean output) and returns a bind function that performs the same logic.

  Example:

    (filter->bind (fn [x] (even? x)))

  See also: pigpen.core.op/bind$

group$

added in 0.3.0

(group$ field-dispatch join-types opts ancestors)
Inputs: [field-dispatch join-types opts ancestors]
  Returns: m/Group$

  Performs a cogroup on the ancestors provided. The parameter `field-dispatch`
should be one of the following, and produces the following output fields:

  :group - [group r0/key r0/value ... rN/key rN/value]
  :join  - [r0/key r0/value ... rN/key rN/value]
  :set   - [r0/value ... rN/value]

The parameter `join-types` is a vector of keywords (:required or :optional)
specifying if each relation is required or optional. The length of join-types
must match the number of relations passed.

  Example:

    (group$
      :group
      [:required :optional]
      {}
      [relation1 relation2])

In this example, the operation performs a cogroup on relation1 and relation2.
The `:group` field-dispatch means that both of those relations will provide a
field with fields `key` and `value`, and the operation will add a `group` field.
The first relation is marked as required and the second is optional.

  See also: pigpen.core.op/join$

indexed-field-selector->bind

added in 0.3.0

(indexed-field-selector->bind n f)
For use with pigpen.core.op/bind$

Selects the first n fields and projects them as fields. The input relation
should have a single field, which is sequential. Applies f to the remaining args.

  Example:

    (indexed-field-selector->bind 2 pr-str)

  See also: pigpen.core.op/bind$

join$

added in 0.3.0

(join$ field-dispatch join-types opts ancestors)
Inputs: [field-dispatch join-types opts ancestors]
  Returns: m/Join$

  Performs a join on the ancestors provided. The parameter `field-dispatch`
should be one of the following, and produces the following output fields:

  :group - [group r0/key r0/value ... rN/key rN/value]
  :join  - [r0/key r0/value ... rN/key rN/value]
  :set   - [r0/value ... rN/value]

The parameter `join-types` is a vector of keywords (:required or :optional)
specifying if each relation is required or optional. The length of join-types
must match the number of relations passed.

  Example:

    (join$
      :join
      [:required :optional]
      {}
      [relation1 relation2])

In this example, the operation performs a join on relation1 and relation2.
The `:join` field-dispatch means that both of those relations will provide a
field with fields `key` and `value`. The first relation is marked as required
and the second is optional.

  See also: pigpen.core.op/group$

key-selector->bind

added in 0.3.0

(key-selector->bind f)
For use with pigpen.core.op/bind$

Creates a key-selector function based on `f`. The resulting bind function
returns a tuple of [(f x) x]. This is generally used to separate a key for
subsequent use in a sort, group, or join.

  Example:

    (key-selector->bind (fn [x] (:foo x)))

  See also: pigpen.core.op/bind$

keyword-field-selector->bind

added in 0.3.0

(keyword-field-selector->bind fields)
For use with pigpen.core.op/bind$

Selects a set of fields from a map and projects them as native fields. The
bind function takes a single arg, which is a map with keyword keys. The
parameter `fields` is a sequence of keywords to select. The input relation
should have a single field that is a map value.

  Example:

    (keyword-field-selector->bind [:foo :bar :baz])

  See also: pigpen.core.op/bind$

load$

added in 0.3.0

(load$ location storage fields opts)
Inputs: [location storage fields opts]
  Returns: m/Load$

  Load the data specified by `location`, a string. The parameter `storage` is a
keyword such as :string, :parquet, or :avro that specifies the type of storage
to use. Each platform is responsible for dispatching on storage as appropriate.
The parameters `fields` and `opts` specify what fields this will produce and any
options to the command.

  Example:

    (load$ "input.tsv" :string '[value] {})

  See also: pigpen.core.op/store$

map->bind

added in 0.3.0

(map->bind f)
For use with pigpen.core.op/bind$

Takes a map-style function (one that takes a single input and returns a
single output) and returns a bind function that performs the same logic.

  Example:

    (map->bind (fn [x] (* x x)))

  See also: pigpen.core.op/bind$

mapcat->bind

added in 0.3.0

(mapcat->bind f)
For use with pigpen.core.op/bind$

Takes a mapcat-style function (one that takes a single input and returns zero
to many outputs) and returns a bind function that performs the same logic.

  Example:

    (mapcat->bind (fn [x] (seq x)))

  See also: pigpen.core.op/bind$

noop$

added in 0.3.0

(noop$ opts relation)
Inputs: [opts relation]
Returns: m/NoOp$

A no-op command. This is used to introduce a unique id for a command.

Example:

  (noop$ {} relation)

project$

added in 0.3.0

(project$ projections opts relation)
Inputs: [projections opts relation]
  Returns: m/Project$

  Used to manipulate the fields of a relation, either by aliasing them or
applying functions to them. Usually you want bind$ instead of project$, as
PigPen will compile many of the former into one of the latter.

  Example:

    (project$
      [(projection-field$ 'a 'b)
       (projection-func$ 'e
         (code$ :seq
                '(require my-ns.core)
                '(fn [args] ...)
                ['c 'd]))]
      {}
      relation)

In the example above, we apply two operations to the input relation. First, we
alias the input field 'a as 'b. Second, we apply the user code specified by
code$ to the fields 'c and 'd to produce the field 'e. This implies that the
input relation has three fields, 'a, 'c, and 'd, and that the output fields of
this relation are 'b and 'e. If multiple projections are provided that flatten
results, the cross product of those is returned.

  See also: pigpen.core.op/projection-field$, pigpen.core.op/projection-func$,
            pigpen.core.op/code$

projection-field$

added in 0.3.0

(projection-field$ field)(projection-field$ field alias)(projection-field$ field alias flatten)
Project a single field into another, optionally providing an alias for the
new field. If an alias is not specified, the input field name is used. If the
field represents a collection, specify `flatten` as true to flatten the values
of the field into individual records.

  Examples:

    (projection-field$ 'a)         ;; copy the field a as a
    (projection-field$ 'a 'b)      ;; copy the field a as b
    (projection-field$ 'a 'b true) ;; copy the field a as b and flatten a

  See also: pigpen.core.op/project$, pigpen.core.op/projection-func$

projection-func$

added in 0.3.0

(projection-func$ alias code)(projection-func$ alias flatten code)
Inputs: ([alias code :- m/CodeExpr] [alias flatten code :- m/CodeExpr])

  Apply code to a set of fields, optionally flattening the result. See code$
for details regarding how to express user code.

  Examples:

    (projection-func$ 'a (code$ ...))      ;; scalar result
    (projection-func$ 'a (code$ ...) true) ;; flatten the result collection

  See also: pigpen.core.op/code$, pigpen.core.op/project$,
            pigpen.core.op/projection-field$

rank$

added in 0.3.0

(rank$ opts relation)
Inputs: [opts relation]
  Returns: m/Rank$

  Rank the input relation. Adds a new field ('index), a long, to the fields of
the input relation.

  Example:

    (rank$ {} relation)

  See also: pigpen.core.op/sort$

reduce$

added in 0.3.0

(reduce$ opts relation)
Inputs: [opts relation]
  Returns: m/Reduce$

  Reduce the entire relation into a single recrod that is the collection of all
records.

  Example:

    (reduce$ {} relation)

  See also: pigpen.core.op/group$

return$

added in 0.3.0

(return$ fields data)
Inputs: [fields data]
  Returns: m/Return$

  Return the data as a PigPen relation. The parameter `fields` specifies what
fields the data will contain.

  Example:

    (return$ ['value] [{'value 42} {'value 37}])

  See also: pigpen.core.op/load$

sample$

added in 0.3.0

(sample$ p opts relation)
Inputs: [p opts relation]
Returns: m/Sample$

Samples the input relation at percentage p, where (<= 0.0 p 1.0).

Example:

  (sample$ 0.5 {} relation)

See also: pigpen.core.op/take$

sort$

added in 0.3.0

(sort$ key comp opts relation)
Inputs: [key comp opts relation]
  Returns: m/Sort$

  Sort the data in relation. The parameter `key` specifies the field that
should be used to sort the data. The sort field should be a native type; not
serialized. `comp` is either :asc or :desc.

  Example:

    (sort$ 'key :asc {} relation)

store$

added in 0.3.0

(store$ location storage opts relation)
Inputs: [location storage opts relation]
  Returns: m/Store$

  Store the data specified by `location`, a string. The parameter `storage` is
a keyword such as :string, :parquet, or :avro that specifies the type of storage
to use. Each platform is responsible for dispatching on storage as appropriate.
The parameter `opts` specify any options to the command. This command can only
be passed to store-many$ commands or to platform generation commands.

  Example:

    (store$ "output.tsv" :string {} relation)

  See also: pigpen.core.op/load$

store-many$

(store-many$ outputs)
Inputs: [outputs]
  Returns: m/StoreMany$

  Combines multiple store$ commands into a single command. This command can
only be passed to other store-many$ commands or to platform generation commands.

  Example:

    (store-many$ [(store$ "output1.tsv" :string {} relation1)
                  (store$ "output2.tsv" :string {} relation2)])

  See also: pigpen.core.op/store$

take$

added in 0.3.0

(take$ n opts relation)
Inputs: [n opts relation]
Returns: m/Take$

Returns the first n records from relation.

Example:

  (take$ 100 {} relation)

See also: pigpen.core.op/sample$