📦 apache / datafusion

📄 index.md · 202 lines
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202<!---
  Licensed to the Apache Software Foundation (ASF) under one
  or more contributor license agreements.  See the NOTICE file
  distributed with this work for additional information
  regarding copyright ownership.  The ASF licenses this file
  to you under the Apache License, Version 2.0 (the
  "License"); you may not use this file except in compliance
  with the License.  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing,
  software distributed under the License is distributed on an
  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  KIND, either express or implied.  See the License for the
  specific language governing permissions and limitations
  under the License.
-->

# Introduction

We welcome and encourage contributions of all kinds, from all levels, such as:

1. Tickets with issue reports or feature requests
2. Discussions
3. Documentation improvements
4. Code, both PR and (especially) PR Review.

In addition to submitting new PRs, we have a healthy tradition of community
members reviewing each other's PRs. Doing so is a great way to help the
community as well as get more familiar with Rust and the relevant codebases.

## Development Environment

Setup your development environment [here](development_environment.md), and learn
how to test the code [here](testing.md).

## Finding and Creating Issues to Work On

You can find a curated [good-first-issue] list to help you get started.
You can read about how we plan larger projects in the [Roadmap and Improvement Proposals](roadmap.md) section.

[good-first-issue]: https://github.com/apache/datafusion/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22

### Open Contribution and Assigning tickets

DataFusion is an open contribution project, and thus there is no particular
project imposed deadline for completing issues or restrictions on who can
work on an issue, nor limits to how many people can work on an issue at the same time.

Contributors drive the project forward based on their own priorities and
interests and thus you are free to work on any issue that interests you.

If someone is already working on an issue that you want or need but hasn't
been able to finish it yet, you should feel free to work on it as well. In
general it is both polite and will help avoid unnecessary duplication of work if
you leave a note on an issue when you start working on it.

If you want to work on an issue which is not already assigned to someone else
and there are no comment indicating that someone is already working on that
issue then you can assign the issue to yourself by submitting a single word
comment `take`. This will assign the issue to yourself. However, if you are
unable to make progress you should unassign the issue by commenting a single
word `untake`.

# Developer's guide

## Pull Request Overview

We welcome pull requests (PRs) from anyone in the community.

DataFusion is a rapidly evolving project and we try to review and merge PRs quickly.

Review bandwidth is currently our most limited resource, and we highly encourage reviews by the broader community. If you are waiting for your PR to be reviewed, consider helping review other PRs that are waiting. Such review both helps the reviewer to learn the codebase and become more expert, as well as helps identify issues in the PR (such as lack of test coverage), that can be addressed and make future reviews faster and more efficient.

The lifecycle of a PR is:

1. Create a PR targeting the `main` branch.
2. For new contributors a committer must first trigger the CI tasks. Please mention the members from committers list in the PR to help trigger the CI
3. Your PR will be reviewed. Please respond to all feedback on the PR: you don't have to change the code, but you should acknowledge the feedback. PRs waiting for the feedback for more than a few days will be marked as draft.
4. Once the PR is approved, one of the [committers] will merge your PR, typically within 24 hours. We leave approved "major" changes (see below) open for 24 hours prior to merging, and sometimes leave "minor" PRs open for the same time to permit additional feedback.

Note that the above time frames are estimates. Due to limited committer
bandwidth, it may take longer to merge your PR. Please wait
patiently. If it has been several days you can friendly ping the
committer who approved your PR to help remind them to merge it.

[committers]: https://people.apache.org/phonebook.html?unix=datafusion

## Creating Pull Requests

When possible, we recommend splitting your contributions into multiple smaller focused PRs rather than large PRs (500+ lines) because:

1. The PR is more likely to be reviewed quickly -- our reviewers struggle to find the contiguous time needed to review large PRs.
2. The PR discussions tend to be more focused and less likely to get lost among several different threads.
3. It is often easier to accept and act on feedback when it comes early on in a small change, before a particular approach has been polished too much.

If you are concerned that a larger design will be lost in a string of small PRs, creating a large draft PR that shows how they all work together can help.

Note all commits in a PR are squashed when merged to the `main` branch so there is one commit per PR after merge.

## Conventional Commits & Labeling PRs

We generate change logs for each release using an automated process that will categorize PRs based on the title
and/or the GitHub labels attached to the PR.

We follow the [Conventional Commits] specification to categorize PRs based on the title. This most often simply means
looking for titles starting with prefixes such as `fix:`, `feat:`, `docs:`, or `chore:`. We do not enforce this
convention but encourage its use if you want your PR to feature in the correct section of the changelog.

The change log generator will also look at GitHub labels such as `bug`, `enhancement`, or `api change`, and labels
do take priority over the conventional commit approach, allowing maintainers to re-categorize PRs after they have been merged.

[conventional commits]: https://www.conventionalcommits.org/en/v1.0.0/

## Reviewing Pull Requests

Some helpful links:

- [PRs Waiting for Review] on GitHub
- [Approved PRs Waiting for Merge] on GitHub

[prs waiting for review]: https://github.com/apache/datafusion/pulls?q=is%3Apr+is%3Aopen+-review%3Aapproved+-is%3Adraft+
[approved prs waiting for merge]: https://github.com/apache/datafusion/pulls?q=is%3Apr+is%3Aopen+review%3Aapproved+-is%3Adraft

When reviewing PRs, our primary goal is to improve DataFusion and its community together. PR feedback should be constructive with the aim to help improve the code as well as the understanding of the contributor.

Please ensure any issues you raise contains a rationale and suggested alternative -- it is frustrating to be told "don't do it this way" without any clear reason or alternate provided.

Some things to specifically check:

1. Is the feature or fix covered sufficiently with tests (see the [Testing](testing.md) section)?
2. Is the code clear, and fits the style of the existing codebase?

## Performance Improvements

Performance improvements are always welcome: performance is a key DataFusion
feature.

In general, the performance improvement from a change should be "enough" to
justify any added code complexity. How much is "enough" is a judgement made by
the committers, but generally means that the improvement should be noticeable in
a real-world scenario and is greater than the noise of the benchmarking system.

To help committers evaluate the potential improvement, performance PRs should
in general be accompanied by benchmark results that demonstrate the improvement.

The best way to demonstrate a performance improvement is with the existing
benchmarks:

- [System level SQL Benchmarks](https://github.com/apache/datafusion/tree/main/benchmarks)
- Microbenchmarks such as those in [functions/benches](https://github.com/apache/datafusion/tree/main/datafusion/functions/benches)

If there is no suitable existing benchmark, you can create a new one. It helps
to isolate the effects of your change by creating a separate PR with the
benchmark, and then a PR with the code change that improves the benchmark.

[system level sql benchmarks]: https://github.com/apache/datafusion/tree/main/benchmarks
[functions/benches]: https://github.com/apache/datafusion/tree/main/datafusion/functions/benches

## "Major" and "Minor" PRs

Since we are a worldwide community, we have contributors in many timezones who review and comment. To ensure anyone who wishes has an opportunity to review a PR, our committers try to ensure that at least 24 hours passes between when a "major" PR is approved and when it is merged.

A "major" PR means there is a substantial change in design or a change in the API. Committers apply their best judgment to determine what constitutes a substantial change. A "minor" PR might be merged without a 24 hour delay, again subject to the judgment of the committer. Examples of potential "minor" PRs are:

1. Documentation improvements/additions
2. Small bug fixes
3. Non-controversial build-related changes (clippy, version upgrades etc.)
4. Smaller non-controversial feature additions

The good thing about open code and open development is that any issues in one change can almost always be fixed with a follow on PR.

## Stale PRs

Pull requests will be marked with a `stale` label after 60 days of inactivity and then closed 7 days after that.
Commenting on the PR will remove the `stale` label.

## AI-Assisted contributions

DataFusion has the following policy for AI-assisted PRs:

- The PR author should **understand the core ideas** behind the implementation **end-to-end**, and be able to justify the design and code during review.
- **Calls out unknowns and assumptions**. It's okay to not fully understand some bits of AI generated code. You should comment on these cases and point them out to reviewers so that they can use their knowledge of the codebase to clear up any concerns. For example, you might comment "calling this function here seems to work but I'm not familiar with how it works internally, I wonder if there's a race condition if it is called concurrently".

### Why fully AI-generated PRs without understanding are not helpful

Today, AI tools cannot reliably make complex changes to DataFusion on their own, which is why we rely on pull requests and code review.

The purposes of code review are:

1. Finish the intended task.
2. Share knowledge between authors and reviewers, as a long-term investment in the project. For this reason, even if someone familiar with the codebase can finish a task quickly, we're still happy to help a new contributor work on it even if it takes longer.

An AI dump for an issue doesn’t meet these purposes. Maintainers could finish the task faster by using AI directly, and the submitters gain little knowledge if they act only as a pass through AI proxy without understanding.

Please understand the reviewing capacity is **very limited** for the project, so large PRs which appear to not have the requisite understanding might not get reviewed, and eventually closed or redirected.

### Better ways to contribute than an “AI dump”

It's recommended to write a high-quality issue with a clear problem statement and a minimal, reproducible example. This can make it easier for others to contribute.