📦 apache / datafusion

📄 howtos.md · 191 lines
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191<!---
  Licensed to the Apache Software Foundation (ASF) under one
  or more contributor license agreements.  See the NOTICE file
  distributed with this work for additional information
  regarding copyright ownership.  The ASF licenses this file
  to you under the Apache License, Version 2.0 (the
  "License"); you may not use this file except in compliance
  with the License.  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing,
  software distributed under the License is distributed on an
  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  KIND, either express or implied.  See the License for the
  specific language governing permissions and limitations
  under the License.
-->

# HOWTOs

## How to update the version of Rust used in CI tests

Make a PR to update the [rust-toolchain] file in the root of the repository.

[rust-toolchain]: https://github.com/apache/datafusion/blob/main/rust-toolchain.toml

## Adding new functions

**Implementation**

| Function type | Location to implement     | Trait to implement                             | Macros to use                                    | Example              |
| ------------- | ------------------------- | ---------------------------------------------- | ------------------------------------------------ | -------------------- |
| Scalar        | [functions][df-functions] | [`ScalarUDFImpl`]                              | `make_udf_function!()` and `export_functions!()` | [`advanced_udf.rs`]  |
| Nested        | [functions-nested]        | [`ScalarUDFImpl`]                              | `make_udf_expr_and_func!()`                      |                      |
| Aggregate     | [functions-aggregate]     | [`AggregateUDFImpl`] and an [`Accumulator`]    | `make_udaf_expr_and_func!()`                     | [`advanced_udaf.rs`] |
| Window        | [functions-window]        | [`WindowUDFImpl`] and a [`PartitionEvaluator`] | `define_udwf_and_expr!()`                        | [`advanced_udwf.rs`] |
| Table         | [functions-table]         | [`TableFunctionImpl`] and a [`TableProvider`]  | `create_udtf_function!()`                        | [`simple_udtf.rs`]   |

- The macros are to simplify some boilerplate such as ensuring a DataFrame API compatible function is also created
- Ensure new functions are properly exported through the subproject
  `mod.rs` or `lib.rs`.
- Functions should preferably provide documentation via the `#[user_doc(...)]` attribute so their documentation
  can be included in the SQL reference documentation (see below section)
- Scalar functions are further grouped into modules for families of functions (e.g. string, math, datetime).
  Functions should be added to the relevant module; if a new module needs to be created then a new [Rust feature]
  should also be added to allow DataFusion users to conditionally compile the modules as needed
- Aggregate functions can optionally implement a [`GroupsAccumulator`] for better performance

Spark compatible functions are [located in separate crate][df-spark] but otherwise follow the same steps, though all
function types (e.g. scalar, nested, aggregate) are grouped together in the single location.

[df-functions]: https://github.com/apache/datafusion/tree/main/datafusion/functions
[functions-nested]: https://github.com/apache/datafusion/tree/main/datafusion/functions-nested
[functions-aggregate]: https://github.com/apache/datafusion/tree/main/datafusion/functions-aggregate
[functions-window]: https://github.com/apache/datafusion/tree/main/datafusion/functions-window
[functions-table]: https://github.com/apache/datafusion/tree/main/datafusion/functions-table
[df-spark]: https://github.com/apache/datafusion/tree/main/datafusion/spark
[`scalarudfimpl`]: https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.ScalarUDFImpl.html
[`aggregateudfimpl`]: https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.AggregateUDFImpl.html
[`accumulator`]: https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.Accumulator.html
[`groupsaccumulator`]: https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.GroupsAccumulator.html
[`windowudfimpl`]: https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.WindowUDFImpl.html
[`partitionevaluator`]: https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.PartitionEvaluator.html
[`tablefunctionimpl`]: https://docs.rs/datafusion/latest/datafusion/catalog/trait.TableFunctionImpl.html
[`tableprovider`]: https://docs.rs/datafusion/latest/datafusion/catalog/trait.TableProvider.html
[`advanced_udf.rs`]: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/udf/advanced_udf.rs
[`advanced_udaf.rs`]: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/udf/advanced_udaf.rs
[`advanced_udwf.rs`]: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/udf/advanced_udwf.rs
[`simple_udtf.rs`]: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/udf/simple_udtf.rs
[rust feature]: https://doc.rust-lang.org/cargo/reference/features.html

**Testing**

Prefer adding `sqllogictest` integration tests where the function is called via SQL against
well known data and returns an expected result. See the existing [test files][slt-test-files] if
there is an appropriate file to add test cases to, otherwise create a new file. See the
[`sqllogictest` documentation][slt-readme] for details on how to construct these tests.
Ensure edge case, `null` input cases are considered in these tests.

If a behaviour cannot be tested via `sqllogictest` (e.g. testing `simplify()`, needs to be
tested in isolation from the optimizer, difficult to construct exact input via `sqllogictest`)
then tests can be added as Rust unit tests in the implementation module, though these should be
kept minimal where possible

[slt-test-files]: https://github.com/apache/datafusion/tree/main/datafusion/sqllogictest/test_files
[slt-readme]: https://github.com/apache/datafusion/blob/main/datafusion/sqllogictest/README.md

**Documentation**

Run documentation update script `./dev/update_function_docs.sh` which will update the relevant
markdown document [here][fn-doc-home] (see the documents for [scalar][fn-doc-scalar],
[aggregate][fn-doc-aggregate] and [window][fn-doc-window] functions)

- You _should not_ manually update the markdown document after running the script as those manual
  changes would be overwritten on next execution
- Reference [GitHub issue] which introduced this behaviour

[fn-doc-home]: https://github.com/apache/datafusion/blob/main/docs/source/user-guide/sql
[fn-doc-scalar]: https://github.com/apache/datafusion/blob/main/docs/source/user-guide/sql/scalar_functions.md
[fn-doc-aggregate]: https://github.com/apache/datafusion/blob/main/docs/source/user-guide/sql/aggregate_functions.md
[fn-doc-window]: https://github.com/apache/datafusion/blob/main/docs/source/user-guide/sql/window_functions.md
[github issue]: https://github.com/apache/datafusion/issues/12740

## How to display plans graphically

The query plans represented by `LogicalPlan` nodes can be graphically
rendered using [Graphviz](https://www.graphviz.org/).

To do so, save the output of the `display_graphviz` function to a file.:

```rust
// Create plan somehow...
let mut output = File::create("/tmp/plan.dot")?;
write!(output, "{}", plan.display_graphviz());
```

Then, use the `dot` command line tool to render it into a file that
can be displayed. For example, the following command creates a
`/tmp/plan.pdf` file:

```bash
dot -Tpdf < /tmp/plan.dot > /tmp/plan.pdf
```

## How to format `.md` documents

We use [`prettier`] to format `.md` files.

You can either use `npm i -g prettier` to install it globally or use `npx` to run it as a standalone binary.
Using `npx` requires a working node environment. Upgrading to the latest prettier is recommended (by adding
`--upgrade` to the `npm` command).

```bash
$ prettier --version
2.3.0
```

After you've confirmed your prettier version, you can format all the `.md` files:

```bash
prettier -w {datafusion,datafusion-cli,datafusion-examples,dev,docs}/**/*.md
```

[`prettier`]: https://prettier.io/

## How to format `.toml` files

We use [`taplo`] to format `.toml` files.

To install via cargo:

```sh
cargo install taplo-cli --locked
```

> Refer to the [taplo installation documentation][taplo-install] for other ways to install it.

```bash
$ taplo --version
taplo 0.9.0
```

After you've confirmed your `taplo` version, you can format all the `.toml` files:

```bash
taplo fmt
```

[`taplo`]: https://taplo.tamasfe.dev/
[taplo-install]: https://taplo.tamasfe.dev/cli/installation/binary.html

## How to update protobuf/gen dependencies

For the `proto` and `proto-common` crates, the prost/tonic code is generated by running their respective `./regen.sh` scripts,
which in turn invokes the Rust binary located in `./gen`.

This is necessary after modifying the protobuf definitions or altering the dependencies of `./gen`, and requires a
valid installation of [protoc] (see [installation instructions] for details).

```bash
# From repository root
# proto-common
./datafusion/proto-common/regen.sh
# proto
./datafusion/proto/regen.sh
```

[protoc]: https://github.com/protocolbuffers/protobuf#protocol-compiler-installation
[installation instructions]: https://datafusion.apache.org/contributor-guide/getting_started.html#protoc-installation