current position:Home>Open source monitoring system Prometheus

Open source monitoring system Prometheus

2022-06-24 12:46:29PHP Development Engineer

Prometheus Is the Kubernetes(k8s) after ,CNCF The second open source project I graduated from , It comes from Google Of Borgmon. This paper starts from “ monitor ” It's about , Explain profound theories in simple language Prometheus Architecture principle of 、 Target discovery 、 Index model 、 Aggregation query and other design core points .

One 、 Preface

I've been exposed to all kinds of surveillance , Open source CAT、Zipkin、Pinpoint wait , And deep secondary development ; I've also been exposed to the charging cloud system APM, Have a good understanding of the highlights and limitations of all kinds of monitoring .

last year 10 In June, we quickly landed a set of easy-to-use 、 flexible 、 Business monitoring platform with highlights , Which USES Prometheus. From the stage of technology selection ,Prometheus And its ecology is very impressive , Let's talk about monitoring design and Prometheus.

Usually a monitoring system mainly includes collection ( Information sources :log、metrics)、 Report ( agreement :http、tcp)、 polymerization 、 Storage 、 Visualization, alarms, etc . The collection and reporting is the core function of the client , There are usually regular peripheral detections ( In the early Nagios、Zabbix)、AOP The way to manually weave code ( Buried point )、 Bytecode automatic weaving and other methods ( No burying point ).

Two 、 What is monitoring

A set of products , To quantify management techniques 、 A service system or solution for a business .

This product mainly solves two problems ( Product value ):

  • technology : All kinds of functions of the system 、 State and other technical performance data 、 visualization , To ensure the stability of the technical system 、 Safety, etc .
  • Business : Data all kinds of business performance 、 visualization , For analysis 、 Intervene in time , To ensure the efficient operation of the business .

3、 ... and 、 The basic principles of monitoring

  • Monitoring in advance : Monitoring must be considered in the architecture design phase , Instead of waiting for the deployment to go online
  • Monitoring what : Global perspective , From the top ( Business ) Down . For general business , It is recommended to monitor the nearest place to the user first , The user's good experience is the driving force for business development , This is also the most sensitive 、 Important places .
  • User friendly : Monitoring services are easy to use , Easy access to , Automate as much as possible
  • Technical personnel 、 Sources of information for business people 、 Be able to assist in fault location and resolution
  • visualization : Show all kinds of data clearly ( All kinds of charts show ), And alarm and other information records
  • The alarm : What questions need to be notified ?( Such as : Requiring human intervention , meaningful ) Tell who ?( Such as : Person in charge of front line system ) How to inform ?( Such as : SMS 、 Telephone 、 Other means of communication ; Clear information 、 accuracy 、 operational ) How often do you notify ?( Such as :5 minute ) When to stop notification and when to upgrade to someone else ?( Such as : Returned to normal ; Two hours of problems have not recovered , Notify the superior leader of the upgrade )

Four 、Prometheus Design Analysis

Prometheu Focus on the kinds of data that are happening right now , Instead of tracking data from weeks ago , Because they think “ Most monitoring queries and alarms are data in a day ”, It also proves that :85% The timing query of is 26 Within hours .

In a nutshell ,Prometheus It's a quasi real time monitoring system , And time series data capability .

1.  The overall architecture

Prometheus Architecture diagram ( Quote from Prometheus Official website )

The architecture of the simplification point is as follows :

Prometheus Mainly through pull The way to get the monitored program (target\exports) The time series data leaked out in the . Of course, it also provides pushgateway service , In general, a small amount of data can also push Mode sending .

2.  Target discovery

Prometheus adopt pull How to get the index data of service , So how does it discover these services ?

There are many ways to handle the discovery of target resources :

2.1 Manual profile list

By hand , Add static configuration , Specify services to monitor , as follows target block :



scrape_configs:.....   # Monitoring activities   - job_name: 'xxxxxxactivity-wap'    metrics_path: /prometheus/metrics    static_configs:    - targets: ['10.xx.xx.xx:8080',                ......  ......]          # Monitor coupons   - job_name: 'xxxxxxshop-coupon'    metrics_path: /prometheus/metrics    static_configs:    - targets: ['10.xx.xx.xx:8080',                ......  ......]                 # marketing   - job_name: 'xxxxxx-sales-api'    metrics_path: /prometheus/metrics    static_configs:    - targets: ['10.xx.xx.xx:8080',                ......  ......               ]......

Obvious , This way, though, is very simple , But maintaining a long list of service hosts in a busy job is not an extensible and elegant way , dynamic 、 Large scale will make this way impossible to continue .

Specify the load Directory , Changes to these directory files will be detected by disk monitoring , then Prometheus These changes will be applied immediately . As an alternative , The contents of the file will also be refreshed at the specified interval (refresh_interval) Be regularly Prometheus Reread , Take effect when changes are found .

Examples are as follows :



......# monitor   Order center OMS-APIscrape_configs:  - job_name: 'oms-api'    metrics_path: /prometheus/metrics    file_sd_configs:    - files:     - 'conf/oms-targets.json'     # Default  5 minute      refresh_interval:5m......

conf/oms-targets.json file ( Changes to this file will be monitored , Usually this file is generated by another program , Such as CMDB Source ):



[  {    "labels": {      "job": "oms-api"    },    "targets": [      'ip1:8080','ip2:8080',......    ]  }]

2.3  be based on API Autodiscover for

Currently, the native service discovery plug-ins available are AmazonEC2、Azure、Consul、Kubernetes wait .

Below to Consul For example , When the instance starts successfully, you can use the script ( Or others ) The current node information , Sign up to Consul On ( Similar to starting backward zk or redis Write the current node information ).Prometheus Will sense in real time Consul Changes in data , And automatically do the hot loading .



# monitor   Order center OMS-API- job_name: 'oms-api'    consul_sd_configs:    #consul  Address , Listen to all service address information by default       - server: 'xxxxxx'        services: []

notes :Consul Is based on GO Open source tools for language development , Mainly for distributed , The service system provides service registration 、 Service discovery and configuration management .Consul Service delivery registration / Find out 、 health examination 、Key/Value Storage 、 Multi data center and distributed consistency guarantee

2.4  be based on DNS Autodiscover for

In the case where none of the previous methods are suitable ,DNS Service discovery allows you to specify DNS List of items , Then query the records in these entries , To find the target list . Use less , Don't go into .

After the monitored target is found successfully , You can bring it with you web Visual view on the page , Pictured ( Local simulation environment ):

3.  Index collection and aggregation

Prometheus adopt pull Pull the timing data indicators in the external process (Exporter), Pull process details allow users to configure information about : Such as frequency 、 Early aggregation rules 、 Target process leakage mode (http url)、 How to connect 、 Connection authentication and so on .


The so-called index is the quantitative measurement of multiple attributes of software or hardware . It's different from the log collection ELK monitor ,Prometheus Through four types of indicators :

(1) Measuring type (Gauge): A number that can be increased or decreased ( It's essentially a snapshot of measurement ). Common memory usage .

(2) Counting type (counter): Increase or decrease , Unless reset to 0. For example, some systems HTTP Request quantity .

(3) Histogram (histogram): By sampling the monitored index points , Show the type of data distribution frequency .

The figure above highlights the importance of distribution for understanding indicators such as latency . If we assume the SLO( Service level objectives ) by 150ms, that 137ms The average delay of seems to be acceptable ; But actually , Every time 10 There is... In one request 1 It's in 193ms Completed above , Every time 100 There is... In one request 10 Not up to standard !( Pictured :90 Line 、99 The line is not up to standard )

(4) Abstract (summary): And Histogram Very similar , The main difference is summary Aggregate on the client side , and Histogram On the server end . therefore summary It is only suitable for monomer index without centralized polymerization ( Such as GC Related indicators ).

Three rules of thumb :

  1. If you need data aggregation of multiple collection nodes 、 Summary , Please select histogram ;
  2. If you need to observe the data distribution of multiple acquisition nodes , Please select histogram ;
  3. If you don't need to think about clustering ( Such as GC Related information ), Can choose summary, It can provide more accurate quantiles .

4.  polymerization 、 Inquire about

Built in data query DSL Language :PromQL, It can quickly support aggregation and multiple forms of queries , And through the web Interface , It can be used in the browser quickly . In our practice , Use Grafana It's more practical to do visualization 、 beautiful .

About PromQL Use more grammar , You can view the official website documents , Don't go into .

About index aggregation

For aggregation of indicators ,Prometheus Several functions are provided . Take the following aggregation indicators for example :

  • The average
  • The middle number
  • Percentiles ( Here's the picture 99 Line : Per cent 99 Less than 12s This value )
  • Standard deviation ( Measure data set differences ,0 The representation is the same as the average , The larger the data difference is )
  • Rate of change

5.  Data model

Prometheus Like other mainstream sequential databases , On the definition of data model , It will also include metric name、 One or more labels( Same as InfluxDB Inside tags meaning ) as well as metric value.

If used JSON Represents the original time series data in a time series database :

One json An example of temporal data represented


## use JSON Represents a sequential data {  "timestamp": 1346846400,            //  Time stamp   "metric": "total_website_visits",  //  Index name   "tags":{                          //  Label group     "instance": "aaa",    "job": "job001"  },  "value": 18                     //  Index value }

metric name Add a group labels Define... As a unique identifier time series( That's the timeline ). once label change , A new time series will be created , The original configuration based on this time series will be invalid . In the query , Support according to labels Condition search time series, Support simple conditions and complex conditions .

The image above is a simple view of the distribution of all data points , The horizontal axis is time , The vertical axis is the timeline , Every point in the area is a data point .Prometheus Every time data is received , What you receive is a vertical line in the area in the figure . This expression is very vivid , Because at the same time , Each timeline produces only one data point , But at the same time, there will be multiple time lines to generate data , Connect these data points , It's just a vertical line . This characteristic is very important , Optimization strategies that affect data writing and compression .

Retention time

Prometheus Focus on short-term monitoring 、 Alarm and design , So by default it only saves 15 Days of time series data . If it's going to be longer , It is recommended to consider data storage to other platforms separately . At present, our solution is remote storage ,Prometheus The pulled data will fall to InfluxDB On , This ensures better storage resiliency , Real time landing storage of data .

6.Prometheus Open source ecology

Prometheus The ecosystem includes an alarm engine 、 Alarm management AlertManager, Support push Pattern data reported PushGateWay, It provides a more elegant and beautiful visual interface Grafana, Supports remote storage RemoteStoreAdapter;log Convert to metric Of Mtail wait .

besides , There is also a series of Exporter( It can be understood as monitoring agent), these Exporter It can be installed and used directly . Auto monitor application 、 machine 、 Mainstream databases 、MQ wait .

Prometheus There are also a series of client libraries in the ecosystem , Support for a variety of mainstream programming languages Java、C、Python wait .

so to speak Prometheus Our ecology is relatively perfect , And the community is active enough , Future period .

Complete example :

come from “ Open source world ” , link :, If you want to reprint , Please indicate the source , Otherwise, the legal liability will be investigated .

copyright notice
author[PHP Development Engineer ],Please bring the original link to reprint, thank you.

Random recommended