Integrate ELK stack with Fluentd

https://github.com/gliderlabs/logspout/issues/440

assigned to @achtsnits

changed title from Integrate ELK stack with logspout to Integrate ELK stack with Fluentd

changed the description

we're using Fluentd already in EOxHub thus I guess this would be better than logspout

changed milestone to %Baseline evolution

assigned to @mallingerb and unassigned @achtsnits

Open questions for now:

Should we really use fluentd or perhaps something else?
Instead of using the fluentd logging driver, it would also be possible to keep using the default 'json-file' logging driver and collect the logs from there. This might even be possible with fluentd. As long as our requirements aren't strict, we might be fine either way.
The current setup in the branch however does use the fluentd logging driver.
=> use fluentd logging driver
Behavior if logging is down
The current config in the branch doesn't allow containers to start if fluentd is down. If fluentd goes down at some later point, the containers seem to continue to work. I'm not sure if logs could go lost during that time.
=> OK
Where do we put the fluentd image?
Fluentd requires you to extend the base image with your own plugins. For now we could just put it to registry.gitlab.eox.at/esa/prism/vs/fluentd, and we could just build it automatically via gitlab.
In the future, we could have a company-wide image for that (or we move everything to EOxHub, where we have the bitnami image).
=> OK, build image automatically in gitlab-ci
Replication of Elasticsearch
For HA and backup and upgrades with little to no downtime, it's recommended to have multiple replicas of ES. This might however need more maintenance (more moving parts).
If the requirements aren't really strict here, I'd just go for a single node setup.
=> use single node setup, no guarantee on long term log storage
Were are the actual logs stored?
Is it enough to have named volumes in docker? If i understand correctly, they are only available per node, so we have to pin ES to a single node (or use multiple services and pin each to one node)
How much disc space is there?
If we need to persist logs long time, we should also save them somewhere else, but we probably don't need that now.
=> use docker volume on master node, i.e. pin ES to that node
Is it ok to expose ports on the docker nodes?
The setup in the branch works by exposing the port 24224 for fluentd on every node (manager and worker). The docker daemon on each node then connects to fluentd via the host system. Is this port on the docker nodes reachable from outside? (Note that fluentd doesn't need to be deployed everywhere, docker will open the port everywhere and redirect to some instance. It might still be a good idea to have it everywhere.)
=> exposing ports should be fine because nodes are not reachable. check briefly if docker can send logs directly to fluentd in docker
Should we start with no log parsing at all?
It should be simple to add log parsing, but the containers need to have a fixed schema. For now, I would disable the parsing for the output of the cache since there still are messages on stdout which don't follow the apache access format and that leads to errors.
=> We want log parsing for cache and core. all stdout logs must have same schema. @mallingerb handles this for cache, @fabian.schindler for core
Do we enable fluentd per container or globally in docker?
We could start by enabling fluentd logging on certain containers, and once we feel comfortable with it in production enable it directly in the docker configs of the nodes of the swarm.
However if we want to enable it on dev machines, we have to put the parameter in the compose file in any case.
=> either way is fine
Should kibana be reachable also via subdomain and basic auth?
elasticsearch probably doesn't need to be reachable, except for management tasks, where an ssh tunnel could suffice (and e.g. run tools like this in your browser)
For development, kibana can be reached under localhost:5601
=> kibana should be reachable kibana.pdas.prism.eox.at via trafik and protected by apiauth
Can we have downtime for deployment?
=> brief downtime during configuration should be fine
Manual setup for index patterns for local development
=> add info to README
Gitops
=> would be good in the future

Core image:

init script > stderr
wait script > stderr
gunicorn > stdout
django > stderr

mentioned in merge request !17 (merged)

closed with merge request !17 (merged)

mentioned in commit 978799ec

assigned to @fabian.schindler and unassigned @mallingerb

As far as i can tell, only the logging setup of core remains to be done, therefore i'll assign this to @fabian.schindler

reopened

added Doing label

mentioned in commit eb863d9a

closed

removed Doing label

Integrate ELK stack with Fluentd

Child items ...

Activity