Loki, the latest open source project from the Grafana Labs team, is a horizontally scalable, high-availability, multi-tenant log aggregation system. It is designed to be very cost effective and easy to use because it does not index log content, but rather configures a set of tags for each log stream. Project inspired by Prometheus , the official description is: Like Prometheus, but for logs
, similar to the Prometheus logging system.
Overview
Unlike other logging systems, Loki only indexes your log metadata tags (like Prometheus’ tags), and does not index the raw log data in full text. The log data itself is then compressed and stored as chunks (blocks) on an object store (like S3 or GCS) or even a local filesystem. A small index and highly compressed chunks can greatly simplify operations and reduce the cost of using Loki.
A more detailed version of this document can be found in the chapter introduction of Loki Architecture.
Multi-tenant
Loki supports a multi-tenant model where the data is completely separated between tenants. Multi-tenancy is implemented with a tenant ID (a string generated with numeric letters). When multi-tenancy mode is disabled, a false
tenant ID is generated internally for all requests.
Operation Mode
Loki can be run locally on a small scale or horizontally, and Loki comes with a single process mode that allows you to run all the microservices you need in a single process. Single process mode is ideal for testing Loki or for small runs. For horizontal scaling, Loki’s microservices can be broken into separate processes, allowing them to scale independently.
Components
Distributor
The dispatcher service is responsible for processing logs written by client. Essentially it is the first stop
in the log data writing path. Once the distributor receives the log data, it splits them into batches and sends them to multiple collectors in parallel.
The distributor communicates through gPRC and collector. They are stateless, so we can scale them up and down according to the actual needs.
Hashing
The allocator uses a combination of a consistent hash and a configurable replication factor to determine which collector services should receive log data.
The hash is generated based on log tags and tenant IDs.
A hash ring stored in Consul is used to implement the consistency hash; all collectors register their own set of Token into the hash ring. The allocator then finds the Token that best matches the hash value of the log and sends the data to the holder of that Token.
Consistency
Since all allocators share the same hash ring, a write request can be sent to any allocator.
To ensure consistency in query results, Loki uses Dynamo way quorum consistency on reads and writes. This means that the allocator will wait for at least half of the collectors to respond before sending samples to the user and then responding to the user.
Ingester
The collector service is responsible for writing log data to the backend of the long-term storage (DynamoDB, S3, Cassandra, etc.).
The collector verifies that the collected logs are not out of order. When the collector receives a log line that is not in the expected order, the log line is rejected and an error is returned to the user. For more information about this, see the Timestamp Ordering section.
The collector verifies that the received log lines are received in increasing order of timestamp (i.e., each log has a later timestamp than the previous log). When the collector receives logs not in this order, the log lines are rejected and an error is returned.
Each unique tag set of data is constructed into chunks
in memory, and then they are stored in the back-end storage.
If a chunks process crashes or suddenly hangs, all data that has not been flushed to storage will be lost. loki is typically configured with multiple copies (typically 3) to reduce this risk.
Timestamp sorting
Generally all log rows pushed to Loki must have a newer timestamp than the previously received row. However there may be cases where multiple log lines have the same nanosecond-level timestamp, which can be handled as follows.
- If the incoming line exactly matches the previously received line (both timestamp and log text match), the incoming line is considered an exact duplicate and will be ignored.
- If the timestamp of the incoming line is the same as the timestamp of the previous line, but the log text is not the same, then the line log will be received. This means that it is possible to have two different log lines for the same timestamp.
Handoff (handoff)
By default, when a collector shuts down and a view leaves the hash ring, it will wait to see if a new collector view comes in before it flushes and tries to initiate a handoff. The handoff will transfer all Tokens and in-memory chunks owned by the leaving collector to the new collector.
This process is done to avoid flushing all chunks at shutdown, which is a slower and more time-consuming process.
File system support
The collector supports writing to the file system via BoltDB, but this only works in single process mode, because the querier needs access to the same backend storage. Also BoltDB only allows one process to lock the DB for a given amount of time.
Querier
The Queryer service is responsible for processing the LogQL query statement to evaluate the log data stored in the long-term storage.
It first tries to query all collectors for in-memory data before returning to the back-end storage to load the data.
Front-end queries
This service is an optional component that comes in front of a set of queriers to take care of scheduling requests fairly among them, parallelizing them as much as possible and caching the requests.
Chunk
The block store is Loki’s long-term data store designed to support interactive queries and continuous writes without background maintenance tasks. It consists of the following components.
- Block Index which can be supported by DynamoDB, Bigtable or Cassandra.
- KV storage for the block data itself, which can be DynamoDB, Bigtable, Cassandra, or on top of that, an object store such as S3.
Unlike the other core components of Loki, block storage is not a separate service, task, or process, but a library embedded in the collectors and queriers that need access to Loki data.
The block store relies on a unified “NoSQL” storage (DynamoDB, Bigtable, and Cassandra) interface that can be used to support block store indexes. The interface assumes that an index is a collection of the following keys.
- Hash KEY - This is required for all reads and writes.
- RANGE KEY - This is required for writes and can be omitted for reads, and can be queried by prefix or range.
The interfaces in these databases supported above work a little differently.
- DynamoDB supports range and hash KEYs. so index entries are modeled directly as DynamoDB data, hash KEYs are distributed KEYs, and ranges are range KEYs.
- For Bigtable and Cassandra, index entries are modeled as individual column values. A hash KEY becomes a row KEY and a range KEY becomes a column KEY.
Some patterns are used to map the set of matchers and tags used for reads and writes to block storage to the appropriate operations for indexes. New patterns will be added as Loki evolves, mainly to better balance and improve query performance.
Compare to other logging systems
EFK (Elasticsearch, Fluentd, Kibana) is used to fetch, visualize and query logs from various sources.
The data in Elasticsearch is stored on disk as unstructured JSON objects. The keys of each object and the contents of each key are indexed. The JSON objects can then be used to define queries (called Query DSL) or to query data through the Lucene query language.
In contrast, Loki in single binary mode can store data on disk, but in horizontally scalable mode, data storage needs to be in a cloud storage system such as S3, GCS, or Cassandra. logs are stored as plain text and tagged with a set of tagged names and values, where only the tags are indexed. This tradeoff makes it cheaper to operate than full indexing. logs in Loki are queried using LogQL. Due to this design trade-off, LogQL queries that filter based on content (i.e., text within log lines) require loading all blocks within the search window that match the tags defined in the query.
Fluentd is typically used to collect logs and forward them to Elasticsearch. fluentd is known as a data collector, which can collect logs from many sources, process them, and then forward them to one or more targets.
Promtail, by contrast, is tailored for Loki. Its primary mode of operation is to discover log files stored on disk and forward them to Loki associated with a set of tags. Promtail can do service discovery for Kubernetes Pods running on the same node, act as a Docker log driver, read logs from a specified folder, and fetch systemd logs continuously.
Loki represents logs by a set of tags in a similar way to how Prometheus represents metrics. When deployed in an environment with Prometheus, the logs from Protail typically have the same tags as your application metrics because the same service discovery mechanism is used. Having the same level of logs and metrics allows users to seamlessly switch between metrics and logs to help with root cause analysis.
Kibana is used to visualize and search Elasticsearch data and is very powerful when doing analysis on this data. kibana provides many visualization tools to do data analysis, such as maps, machine learning for anomaly detection, and relationship graphs. Alerts can also be configured to notify users when something unexpected happens.
In contrast, Grafana is specifically tailored to time series data from data sources such as Prometheus and Loki. Dashboards can be set up to visualize metrics (with logging support coming soon), and data can also be queried on an ad hoc basis using exploration views. Like Kibana, Grafana also supports alerts based on your metrics.
Installation
The official recommendation is to use Tanka for installation, which is a re-implemented version of Ksonnect for production deployments within Grafana, but Tanka is currently not used much and few people are familiar with it, so we won’t introduce this way here. The following 3 methods are mainly introduced.
Installing Loki with Helm
Prerequisites
First you need to make sure that you have deployed a Kubernetes cluster and installed and configured the Helm client, then add Loki’s chart repository:.
|
|
The chart repository can be updated using the following command.
|
|
Deploy Loki
Deploy with default configuration
|
|
Specify namespace
|
|
Specify configuration
|
|
Deploy Loki tool stack (Loki, Promtail, Grafana, Prometheus)
|
|
Deploy Loki tool stack (Loki, Promtail, Grafana, Prometheus)
Deploying Grafana
To install Grafana to a Kubernetes cluster using Helm, you can use the command shown below.
|
|
To obtain the Grafana administrator password, you can use the command shown below.
|
|
To access the Grafana UI page, you can use the following command.
|
|
Then open http://localhost:3000
in your browser and log in with admin and the password output above. Then follow the prompts to add the Loki data source with the Loki address http://loki:3100
.
Accessing Loki using HTTPS Ingress
If Loki and Promtail are deployed on different clusters, you can add an Ingress object in front of Loki, which can be accessed via HTTPS by adding a certificate and enabling Basic Auth authentication on the Ingress for security purposes.
In Promtail, set the following values to use HTTPS and Basic Auth authentication for communication.
An example of Ingress’ Helm template is shown below.
|
|
Installing Loki with Docker
We can use Docker or Docker Compose to install Loki for evaluation, testing or development of Lok, but for production environments we recommend using Tanka or Helm method.
Prerequisites
- Docker
- Docker Compose (optional, installation is only required if you use the Docker Compose method)
Installing with Docker
Installation with Docker
Linux
After execution, the loki-config.yaml and promtail-config.yaml configuration files will be downloaded to the directory we are using, and the Docker container will use these configuration files to run Loki and Promtail.
|
|
Installing with Docker Compose
Local installation of Loki
Binary files
Each release includes binaries for Loki, which can be found on GitHub’s Release page.
openSUSE Linux Installer
The community provides Loki packages for openSUSE Linux, which can be installed using the following.
- Add the repository
https://download.opensuse.org/repositories/security:/logging/
to the system. For example, if you are using Leap 15.1, execute the commandsudo zypper ar
https://download.opensuse.org/repositories/security:/logging/openSUSE_Leap_15.1/security:logging.repo ;sudo zypper ref
- Use the command
zypper in loki
to install the Loki package - Start the Loki and Promtail services.
systemd start promtail && systemd enable promtail
systemd start loki && systemd enable loki
- Modify the configuration files as required:
/etc/loki/promtail.yaml
and/etc/loki/loki.yaml
.
Manual build
Prerequisite
- Go version 1.13+
- Make
- Docker (for updating protobuf files and yacc files)
** Build**
Clone Loki code to $GOPATH/src/github.com/grafana/loki
path.
|
|
Then switch to the code directory and execute the make loki
command.
Getting Started with Loki
Loki configuration in Grafana
Grafana has built-in support for Loki in versions 6.0 and above. It is recommended that you use 6.3 or later to have access to the new LogQL features.
- Log in to your Grafana instance. If this is the first time you are running Grafana, the username and password default to
admin
. - In Grafana, go to
Configuration > Data Sources
using the icon on the left sidebar. - Click the
+ Add data source
button. - Select Loki from the list.
- The Http URL field is the address of your Loki server, for example, it might be http://localhost:3100 when running locally or when running Docker with port mapping. When running with docker-compose or Kubernetes, the address is likely to be https://loki:3100.
- To view the logs, click ``Explore`’’ on the sidebar, select Loki Data Source in the top left drop-down menu, and then use the Logs tab button to filter the log stream.
Querying Loki with LogCLI
If you prefer a command line interface, LogCLI
allows users to use LogQL queries against the Loki server.
Installation
Binary (recommended)
The release package downloaded from the Release page contains the logcli binaries.
Source code installation
You can also use golang to compile the source code directly, using the go get
command as shown below to get logcli
and the binaries will appear in the $GOPATH/bin
directory under.
|
|
Usage examples
Assuming you are currently using Grafana Cloud, you need to set the following environment variables.
If you are using Grafana locally, you can point the LogCLI directly to a local instance without a username and password at
|
|
Note: If you add a proxy server in front of Loki and configure authentication, you still need to configure the corresponding
LOKI_USERNAME
andLOKI_PASSWORD
data.
Once configured, you can use some of the logcli
commands as shown below.
|
|
Batch Search
Starting with Loki 1.6.0, logcli sends log queries to Loki in batches.
If you set the query’s -limit
parameter (default is 30) to a larger number, say 10000, then logcli will automatically send this request to Loki in batches, with the default batch size being 1000.
Loki has a server-side limit on the maximum number of rows returned in a query (default is 5000). Batch sending allows you to issue requests larger than the server-side limit, as long as the -batch
size is smaller than the server limit.
Note that query metadata is printed on stderr for each batch, and this can be stopped by setting the --quiet
parameter.
For the configured values will take effect from low to high based on environment variables and command line flags.
Command Details
Detailed information on the use of the logcli
command line tool is shown below.
|
|
Label
Label tags are key-value pairs that can define anything, and we like to call them metadata that describe the log stream. If you are familiar with Prometheus, then you must have some knowledge of Label tags, which are also defined in Loki’s scrape configuration and have the same functionality as Prometheus, these tags are very easy to associate application metrics with log data.
The tags in Loki perform a very important task: they define a stream. More specifically, the combination of each tag key and value defines the stream. If just one tag value changes, this will create a new stream.
If you are familiar with Prometheus, the terminology there is called sequences, and there is an additional dimension in Prometheus: indicator names. this is simplified in Loki, since there are no indicator names, only labels, so it was finally decided to use streams instead of sequences.
Label example
The following example will illustrate the basic use and concepts of Label tags in Loki.
First, take a look at the following example.
This configuration will fetch the log file data and add a job=syslog
tag, which we can query like this.
|
|
This will create a stream in Loki. Now let’s add some more task configurations.
|
|
This will create a stream in Loki. We can query these streams in several ways.
The last way we use is a regex
tag matcher to get logs with job tag values of apache or syslog. Next we see how to use the additional tags.
|
|
To get the logs of these two tasks you can use the following instead of the regex approach.
|
|
By using one tag it is possible to query many log streams, and by combining several different tags it is possible to create very flexible log queries.
Label tags are indices of Loki log data, and they are used to find the compressed log content, which is stored separately as blocks. Each unique combination of labels and values defines a stream , a stream of logs that are batched, compressed, and stored as blocks.
Cardinality
The previous example uses a statically defined Label tag with a single value; however, there are ways to define the label dynamically. For example, we have log data like the following.
|
|
We can use the following to parse this log data.
|
|
This regex matches each component of the log line and extracts the value of each component into a capture group. Inside the pipeline code, this data is placed into a temporary data structure that allows it to be used for other processing while the log line is being processed (at which point the temporary data is discarded).
From this regex, we then use two of these capture groups to dynamically set two tags based on the content of the log line itself.
|
|
Suppose we have the following lines of log data:
|
|
Then, after the logs are collected in Loki, they are created as a stream as shown below.
|
|
These 4 log lines will become 4 separate streams and start populating 4 separate blocks. Any additional log lines that match these tag/value
combinations will be added to the existing streams. If another unique label combination comes in (e.g. status_code=“500”) another new stream will be created.
For example, if we set a Label tag for the IP, not only will each request from the user become a unique stream, but each request with a different action or status_code from the same user will get its own stream.
If there are 4 common actions (GET, PUT, POST, DELETE) and 4 common status codes (probably more than 4!) ), this would be 16 streams and 16 separate blocks. Then now multiply that by each user, and if we use IP’s tags, you’ll soon have thousands or tens of thousands of streams.
This Cardinality is so high, it’s enough to hang Loki.
When we talk about Cardinality, we mean the combination of tags and values and the number of streams they create. High Cardinality means using tags with a large range of possible values, such as IP, or a combination that requires other tags, even if they have a small and limited set, such as status_code and action.
High Cardinality causes Loki to build a huge index (💰💰💰💰) and store thousands of tiny blocks into object storage (slow), Loki currently performs very poorly in this configuration and is very uneconomical to run and use.
Loki Performance Optimization
Now that we know that it’s bad to use a lot of tags or tags with a lot of values, how should I query my logs? If none of the data is indexed, won’t the query be really slow?
We see that people using Loki are used to other index-heavy solutions and they just feel that they need to define a lot of tags to be able to query the logs efficiently, after all, many other logging solutions are for indexing, which is the usual way of thinking before.
When using Loki, you may need to forget what you know and see how you can solve this problem with parallelization. the superpower of Loki is to break queries into small chunks and schedule them in parallel, so you can query large amounts of log data in a small amount of time.
Large indexes are very complex and expensive, and typically, the full-text index of your log data is equal to or larger than the size of the log data itself. To query your log data, this index needs to be loaded, probably in memory for performance, which makes it very difficult to scale, and when you collect a large number of logs, your index becomes very large.
Now let’s talk about Loki, indexes are usually an order of magnitude smaller than the amount of logs you’re collecting. So if you do a good job of keeping your streams to a minimum, then the index grows very slowly compared to the logs collected.
Loki will effectively keep your static costs as low as possible (index size and memory requirements and static log storage) and allow query performance to be controlled by horizontal scaling at runtime.
To see how it works, let’s go back to the example above where we query for a specific IP address to access log data. Instead of using a label to store the IP, we instead use a filter expression to query it.
|
|
Behind the scenes Loki breaks that query into smaller shards (shards) and opens each chunk (chunk) for the stream whose label matches and starts looking up that IP address.
The size of these shards and the amount of parallelization is configurable and based on the resources you provide. If you want, you can configure shard spacing to 5m, deploy 20 lookups, and process gigabytes of logs in a few seconds. Or you can go even crazier and configure 200 queriers and process terabytes of logs!
This tradeoff between smaller indexes and parallel queries and larger/faster full-text indexes is what makes Loki so cost effective relative to other systems. The cost and complexity of operating a large index is high and usually fixed, whether you are querying it or not, and you are paying for it 24 hours a day.
The advantage of this design is that you can decide how much querying power you want to have, and you can change it on demand. Query performance becomes a function of how much you want to spend. At the same time the data is heavily compressed and stored in low cost object stores like S3 and GCS. this minimizes fixed operational costs while still providing incredibly fast querying capabilities.