Skip to content

Commit 762a35d

Browse files
razvansbernauer
andauthored
ADR 29: Standardize database connections (#321)
* First draft of ADR025. * Fix typo * Update modules/contributor/pages/adr/ADR025-database-connection.adoc Co-authored-by: Sebastian Bernauer <sebastian.bernauer@stackable.de> * Update modules/contributor/pages/adr/ADR025-database-connection.adoc Co-authored-by: Sebastian Bernauer <sebastian.bernauer@stackable.de> * Update ADR025-database-connection.adoc * Change api version and a bit more structure. * Update ADR025-database-connection.adoc * Update ADR025-database-connection.adoc wip * Update ADR025-database-connection.adoc wip * Update ADR025-database-connection.adoc * Update ADR025-database-connection.adoc * wip * Update adr number for db connections to 26 * Fix adr renaming. * wip * Explain product and db specific manifests. * Update ADR026-database-connection.adoc * Meeting in Ka on-site * fix --------- Co-authored-by: Sebastian Bernauer <sebastian.bernauer@stackable.de>
1 parent 5c93fee commit 762a35d

File tree

3 files changed

+386
-1
lines changed

3 files changed

+386
-1
lines changed
Lines changed: 384 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,384 @@
1+
= ADR029: Standardize database connections
2+
Razvan Mihai <razvan.mihai@stackable.tech>
3+
v0.1, 2022-12-08
4+
:status: accepted
5+
6+
* Status: {status}
7+
* Deciders:
8+
** Felix Hennig
9+
** Lukas Voetmand
10+
** Malte Sander
11+
** Razvan Mihai
12+
** Sascha Lautenschläger
13+
** Sebastian Bernauer
14+
* Date: 2022-12-08
15+
16+
Technical Story: https://github.com/stackabletech/issues/issues/238
17+
18+
== Context and Problem Statement
19+
20+
Many products supported by the Stackable Data Platform require databases to store metadata. Currently there is no uniform, consistent way to define database connections. In addition, some Stackable operators define database credentials to be provided inline and in plain text in the cluster definitions.
21+
22+
A quick analysis of the status-quo regarding database connection definitions shows how different operators handle them:
23+
24+
* Apache Hive: the cluster custom resource defined a field called "database" with access credentials in clear text.
25+
* Apache Airflow and Apache Superset: uses a field called "credentialSecret" that contains multiple different database connection definitions. Even worse, it contains credentials not related to a database, such as a secret to encrypt the cookies. In case of Airflow, this secret only supports the Celery executor.
26+
* Apache Druid: uses a field called "metadataStorageDatabase" where access credentials are expected to be inline and in plain text.
27+
28+
== Decision Drivers
29+
30+
Here we attempt to standardize the way database connections are defined across the Stackable platform in such a way that:
31+
32+
* Different database systems are supported.
33+
* Access credentials are defined in Kubernetes `Secret`` objects.
34+
* Product configuration only allows (product) supported databases ...
35+
* But there is a generic way to configure additional database systems.
36+
* Misconfigured connections should be rejected as early as possible in the product lifecycle.
37+
* Generated CRD documentation is easy to follow by users.
38+
39+
Initially we thought that database connections should be implemented as stand-alone Kubernetes resources and should be referenced in cluster definitions. This idea was thrown away mostly because sharing database connections across products is not good practice and we shouldn't encourage it.
40+
41+
== Considered Options
42+
43+
1. (rejected) `DatabaseConnection` A generic resource definition.
44+
2. (rejected) Database driver specific resource definition.
45+
3. (accepted) Product supported and a generic DB specifications.
46+
47+
=== 1. (rejected) `DatabaseConnection` A generic resource definition
48+
49+
The first idea was to introduce a new Kubernetes resource called `DatabaseConnection` with the following fields:
50+
51+
[cols="1,1"]
52+
|===
53+
|Field name | Description
54+
|credentials
55+
|A string with name of a `Secret` containing at least a user name and a password field. Additional fields are allowed.
56+
|driver
57+
|A string with the database driver named. This is a generic field that identifies the type of the database used.
58+
|protocol
59+
|The protocol prefix of the final connection string. Most Java based products will use `jdbc:`.
60+
|host
61+
|A string with the host name to connect to.
62+
|instance
63+
|A string with the database instance to connect to. Optional.
64+
|port
65+
|A positive integer with the TCP port used for the connection. Optional.
66+
|properties
67+
|A dictionary of additional properties for driver tuning like number of client threads, various buffer sizes and so on. Some drivers, like `derby` use this to define the database name and whether the DB should by automatically created or not. Optional
68+
|===
69+
70+
The `Secret` object referenced by `credentials` must contain two fields named `USER_NAME` and `PASSWORD` but can contain additional fields like first name, last name, email, user role and so on.
71+
72+
=== Examples
73+
74+
These examples showcase the spec change required from the current status:
75+
76+
The current Druid metadata database connection
77+
78+
[source,yaml]
79+
---
80+
metadataStorageDatabase:
81+
dbType: postgresql
82+
connString: jdbc:postgresql://druid-postgresql/druid
83+
host: druid-postgresql
84+
port: 5432
85+
user: druid
86+
password: druid
87+
88+
becomes
89+
90+
[source,yaml]
91+
---
92+
metadataStorageDatabase: druid-metadata-connection
93+
94+
where `druid-metadata-connection` is a standalone `DatabaseConnection` resource defined as follows
95+
96+
[source,yaml]
97+
---
98+
apiVersion: db.stackable.tech/v1alpha1
99+
kind: DatabaseConnection
100+
metadata:
101+
name: druid-metadata-connection
102+
spec:
103+
driver: postgresql
104+
host: druid-postgresql
105+
port: 5432
106+
protocol: jdbc:postgresql
107+
instance: druid
108+
credentials: druid-metadata-credentials
109+
110+
and the credentials field contains the name of a Kubernetes `Secret` defined as:
111+
112+
[source,yaml]
113+
---
114+
apiVersion: v1
115+
kind: Secret
116+
metadata:
117+
name: druid-metadata-credentials
118+
type: Opaque
119+
data:
120+
USER_NAME: druid
121+
PASSWORD: druid
122+
123+
NOTE: This idea was discarded because it didn't satisfy all acceptance criteria. In particular it wouldn't be possible to catch misconfigurations at cluster creation time.
124+
125+
=== (rejected) 2. Database driver specific resource definition.
126+
127+
In an attempt to address the issues of the first option above, a more detailed specification was necessary. Here, database generic configurations are possible that can be better validated, as in the example below.
128+
129+
[source,yaml]
130+
---
131+
apiVersion: databaseconnection.stackable.tech/v1alpha1
132+
kind: DatabaseConnection
133+
metadata:
134+
name: druid-metadata-connection
135+
namespace: default
136+
spec:
137+
database:
138+
postgresql:
139+
host: druid-postgresql # mandatory
140+
port: 5432 # defaults to some port number - depending on wether tls is enabled
141+
schema: druid # defaults to druid
142+
credentials: druid-postgresql-credentials # mandatory. key username and password
143+
parameters: {} # optional
144+
redis:
145+
host: airflow-redis-master # mandatory
146+
port: 6379 # defaults to some port number - depending on wether tls is enabled
147+
schema: druid # defaults to druid
148+
credentials: airflow-redis-credentials # optional. key password
149+
parameters: {} # optional
150+
derby:
151+
location: /tmp/derby/ # optional, defaults to /tmp/derby-{metadata.name}/derby.db
152+
parameters: # optional
153+
create: "true"
154+
genericConnectionString:
155+
driver: postgresql
156+
format: postgresql://$SUPERSET_DB_USER:$SUPERSET_DB_PASS@postgres.default.svc.local:$SUPERSET_DB_PORT/superset&param1=value1&param2=value2
157+
secret: ... # optional
158+
SUPERSET_DB_USER: ...
159+
SUPERSET_DB_PASS: ...
160+
SUPERSET_DB_PORT: ...
161+
generic:
162+
driver: postgresql
163+
host: superset-postgresql.default.svc.cluster.local # optional
164+
port: 5432 # optional
165+
protocol: pgsql123 # optional
166+
instance: superset # optional
167+
credentials: name-of-secret-with-credentials #optional
168+
parameters: {...} # optional
169+
connectionStringFormat: "{protocol}://{credentials.user_name}:{credentials.credentials}@{host}:{port}/{instance}&[parameters,;]"
170+
tls: # optional
171+
verification:
172+
ca_cert:
173+
...
174+
In addition, a second generic DB type (`genericConnectionString`) is introduced. This specification allows templating connection URLs with variables defined in secrets and it's not restricted only to user credentials.
175+
176+
NOTE: This proposal was rejected because for the same reason as the first proposal. In addition, it fails to make possible DB configurations product specific.
177+
178+
=== (accepted) Product supported and a generic DB specifications.
179+
180+
It seems that an unique, platform wide mechanism to describe database connections that also fulfills all acceptance criteria is not feasable. Database drivers and product configurations are too diverse and cannot be forced into a type safe specification.
181+
182+
Thus the single, global connection manifest needs to split into two different categories, each covering a subset of the acceptance criteria:
183+
184+
1. A database specific mechanism. This allows to catch misconfigurations early, it promotes good documentation and uniformity inside the platform.
185+
2. An operator specific mechanism. This is a wildcard that can be used to configure database connections that are not officially supported by the products but that can still be partially validated early.
186+
187+
The first mechanism requires the operator framwork to provide predefined structures and supporting functions for widely available database systems such as: PostgreSQL, MySQL, MariaDB, Oracle, SQLite, Derby, Redis and so on. This doesn't mean that all products can be configured with all DB implementations. The product definitions will only allow the subset that is officially supported by the products.
188+
189+
The second mechanism is operator/product specific and it contains mostly a pass-through list of relevant **product properties**. There is at least one exception, and that is the handling of user credentials which still need to be provisioned in a secure way (as long as the product supports it).
190+
191+
==== Database specific manifests
192+
193+
Support for the following database systems is planned. Additional systems may be added in the future.
194+
195+
1. PostgreSQL
196+
197+
[source,yaml]
198+
postgresql:
199+
host: postgresql # mandatory
200+
port: 5432 # optional, default is 5432
201+
instance: my-database # mandatory
202+
credentials: my-application-credentials # mandatory. key username and password
203+
parameters: {} # optional
204+
tls: secure-connection-class-name # optional
205+
auth: authentication-class-name # optional. authentication class to use.
206+
207+
PostgreSQL supports multiple authentication mechanisms as described https://www.postgresql.org/docs/9.1/auth-pg-hba-conf.html[here].
208+
209+
2.) MySQL
210+
211+
[source,yaml]
212+
mysql:
213+
host: mysql # mandatory
214+
port: 3306 # optional, default is 3306
215+
instance: my-database # mandatory
216+
credentials: my-application-credentials # mandatory. key username and password
217+
parameters: {} # optional
218+
tls: secure-connection-class-name # optional
219+
auth: authentication-class-name # optional. authentication class to use.
220+
221+
MySQL supports multiple authentication mechanisms as described https://dev.mysql.com/doc/refman/8.0/en/socket-pluggable-authentication.html[here].
222+
223+
3.) Derby
224+
225+
Derby is used often as an embedded database for testing and prototyping ideas and implementations. It's not recommended for production use-cases.
226+
227+
[source,yaml]
228+
derby:
229+
location: /tmp/my-database/ # optional, defaults to /tmp/derby-<some-suffix>/derby.db
230+
231+
232+
==== Product specific manifests
233+
234+
1.) Apache Druid
235+
236+
Apache Druid clusters can be configured any of the DB specific manifests from above. In addition, a DB generic configuration can pe specified:
237+
238+
The following example shows how to configure the metadata storage for a Druid cluster using either one of the supported back-ends or a generic system. In a production setting only the PostgreSQL or MySQL manifests should be used.
239+
240+
[source,yaml]
241+
generic:
242+
driver: postgresql # mandatory
243+
uri: jdbc:postgresql://<host>/druid?foo;bar # mandatory
244+
credentialsSecret: my-secret # mandatory. key username + password
245+
246+
The above is translated into the following Java properties:
247+
248+
[source]
249+
druid.metadata.storage.type=postgresql
250+
druid.metadata.storage.connector.connectURI=jdbc:postgresql://<host>/druid?foo;bar
251+
druid.metadata.storage.connector.user=druid
252+
druid.metadata.storage.connector.password=druid
253+
254+
2.) Apache Superset
255+
256+
NOTE: Superset supports a very wide range of database systems as described https://superset.apache.org/docs/databases/installing-database-drivers[here]. Not all of them are suitable for metadata storage.
257+
258+
Connections to Apache Hive, Apache Druid and Trino clusters deployed as part of the SDP platform can be automated by using discovery configuration maps. In this case, the only attribute to configure is the name of the discovery config map of the appropriate system.
259+
260+
In addition, a generic way to configure a database connection looks as follows:
261+
262+
[source,yaml]
263+
generic:
264+
secret: superset-metadata-secret # mandatory. A secret naming with one entry called "key". Used to encrypt metadata and session cookies.
265+
template: postgresql://{{SUPERSET_DB_USER}}:{{SUPERSET_DB_PASS}}@postgres.default.svc.local/superset&param1=value1&param2=value2 # mandatory
266+
templateSecret: my-secret # optional
267+
SUPERSET_DB_USER: ...
268+
SUPERSET_DB_PASS: ...
269+
270+
The template attribute allows to specify the full connection string as required by Superset (and the underlying SQLAlchemy framework). Variables in the template are specified within `{{` and `}}` markers and their contents is replaced with the corresponding field in the `templateSecret` object.
271+
272+
3.) Apache Hive
273+
274+
For production environments, we recommend PostgreSQL back-end and for development, Derby.
275+
276+
A generic connection can be configured as follows:
277+
278+
[source,yaml]
279+
generic:
280+
driver: org.postgresql.Driver # mandatory
281+
uri: jdbc:postgresql://postgresql.us-west-2.rds.amazonaws.com:5432/mypgdb # mandatory
282+
credentialsSecret: my-secret # mandatory (?). key username + password
283+
284+
4.) Apache Airflow
285+
286+
A generic Airflow database connection can be configured in a similar fashion with Superset:
287+
288+
[source,yaml]
289+
generic:
290+
template: postgresql://{{AIRFLOW_DB_USER}}:{{AIRFLOW_DB_PASS}}@postgres.default.svc.local/superset&param1=value1&param2=value2 # mandatory
291+
templateSecret: my-secret # optional
292+
AIRFLOW_DB_USER: ...
293+
AIRFLOW_DB_PASS: ...
294+
295+
The resulting CRDs look like:
296+
297+
[source,yaml]
298+
----
299+
kind: DruidCluster
300+
spec:
301+
clusterConfig:
302+
metadataDatabase:
303+
postgresql:
304+
host: postgresql # mandatory
305+
port: 5432 # defaults to some port number - depending on wether tls is enabled
306+
database: druid # mandatory
307+
credentials: postgresql-credentials # mandatory. key username and password
308+
parameters: {} # optional BTreeMap<String, String>
309+
mysql:
310+
host: mysql # mandatory
311+
port: XXXX # defaults to some port number - depending on wether tls is enabled
312+
database: druid # mandatory
313+
credentials: mysql-credentials # mandatory. key username and password
314+
parameters: {} # optional BTreeMap<String, String>
315+
derby:
316+
location: /tmp/derby/ # optional, defaults to /tmp/derby-<some-suffix>/derby.db
317+
generic:
318+
driver: postgresql # mandatory
319+
uri: jdbc:postgresql://<host>/druid?foo;bar # mandatory
320+
credentialsSecret: my-secret # mandatory. key username + password
321+
# druid.metadata.storage.type=postgresql
322+
# druid.metadata.storage.connector.connectURI=jdbc:postgresql://<host>/druid
323+
# druid.metadata.storage.connector.user=druid
324+
# druid.metadata.storage.connector.password=druid
325+
---
326+
kind: SupersetCluster
327+
spec:
328+
clusterConfig:
329+
metadataDatabase:
330+
postgresql:
331+
host: postgresql # mandatory
332+
port: 5432 # defaults to some port number - depending on wether tls is enabled
333+
database: superset # mandatory
334+
credentials: postgresql-credentials # mandatory. key username and password
335+
parameters: {} # optional BTreeMap<String, String>
336+
mysql:
337+
host: mysql # mandatory
338+
port: XXXX # defaults to some port number - depending on wether tls is enabled
339+
database: superset # mandatory
340+
credentials: mysql-credentials # mandatory. key username and password
341+
parameters: {} # optional BTreeMap<String, String>
342+
sqlite:
343+
location: /tmp/sqlite/ # optional, defaults to /tmp/sqlite-<some-suffix>/derby.db
344+
generic:
345+
uriSecret: my-secret # mandatory. key uri
346+
# postgresql://{username}:{password}@{host}:{port}/{database}?sslmode=require
347+
kind: HiveCluster
348+
spec:
349+
clusterConfig:
350+
metadataDatabase:
351+
postgresql:
352+
host: postgresql # mandatory
353+
port: 5432 # defaults to some port number - depending on wether tls is enabled
354+
database: druid # mandatory
355+
credentials: postgresql-credentials # mandatory. key username and password
356+
parameters: {} # optional BTreeMap<String, String>
357+
derby:
358+
location: /tmp/derby/ # optional, defaults to /tmp/derby-<some-suffix>/derby.db
359+
# Missing: MS-SQL server, Oracle
360+
generic:
361+
driver: org.postgresql.Driver # mandatory
362+
uri: jdbc:postgresql://postgresql.us-west-2.rds.amazonaws.com:5432/mypgdb # mandatory
363+
credentialsSecret: my-secret # mandatory (?). key username + password
364+
# <property>
365+
# <name>javax.jdo.option.ConnectionURL</name>
366+
# <value>jdbc:postgresql://postgresql.us-west-2.rds.amazonaws.com:5432/mypgdb</value>
367+
# <description>PostgreSQL JDBC driver connection URL</description>
368+
# </property>
369+
# <property>
370+
# <name>javax.jdo.option.ConnectionDriverName</name>
371+
# <value>org.postgresql.Driver</value>
372+
# <description>PostgreSQL metastore driver class name</description>
373+
# </property>
374+
# <property>
375+
# <name>javax.jdo.option.ConnectionUserName</name>
376+
# <value>database_username</value>
377+
# <description>the username for the DB instance</description>
378+
# </property>
379+
# <property>
380+
# <name>javax.jdo.option.ConnectionPassword</name>
381+
# <value>database_password</value>
382+
# <description>the password for the DB instance</description>
383+
# </property>
384+
----

modules/contributor/partials/current_adrs.adoc

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,11 +17,12 @@
1717
**** xref:adr/ADR018-product_image_versioning.adoc[]
1818
**** xref:adr/ADR019-trino_catalog_definitions.adoc[]
1919
**** xref:adr/ADR020-trino_catalog_usage.adoc[]
20-
**** xref:adr/ADR021-stackablectl_stacks_inital_version.adoc[]
20+
**** xref:adr/ADR021-stackablectl_stacks_initial_version.adoc[]
2121
**** xref:adr/ADR022-spark-history-server.adoc[]
2222
**** xref:adr/ADR023-product-image-selection.adoc[]
2323
**** xref:adr/ADR024-out-of-cluster_access.adoc[]
2424
**** xref:adr/ADR025-logging_architecture.adoc[]
2525
**** xref:adr/ADR026-affinities.adoc[]
2626
**** xref:adr/ADR027-status.adoc[]
2727
**** xref:adr/ADR028-automatic-stackable-version.adoc[]
28+
**** xref:adr/ADR029-database-connection.adoc[]

0 commit comments

Comments
 (0)