Relationship with Open Source Monitoring Systems
There are many open source monitoring systems in the market today, such as Zabbix, Prometheus, Nagios, Open-Falcon, etc. What is the relationship between Tingyun Infrastructure and them?
The purpose of Tingyun Infrastructure is to collect monitoring metrics of operating systems and components, correlate them with APM application metrics, and combine AI technology to quickly help users locate problems. At the same time, Tingyun Infrastructure hopes to minimize user deployment workload, so the design aims to natively integrate with users' existing monitoring systems such as Zabbix, Prometheus, etc. However, during architectural design, it was found that the architectural characteristics between different systems vary greatly. Integrating open source monitoring systems with Tingyun application and microservice metrics correlation would require enormous configuration workload for users. Therefore, Tingyun Infrastructure ultimately adopted an extension of the Prometheus open source system, making it more integrated, agent management simpler and easier to use, and more scalable.
So Tingyun Infrastructure and open source monitoring systems are not mutually exclusive. Tingyun Infrastructure itself is an extension based on the entire Prometheus monitoring system. At the beginning of selection, we conducted extensive research. Currently, the more commonly used open source monitoring systems in the industry are Zabbix and Prometheus. Zabbix's advantages are simple platform installation and maintenance, long development time, and overall mature host monitoring. The disadvantages are poor component and container support, and complex script configuration. Prometheus's advantages are rich community component agents and good container support. The disadvantages are weak frontend, UI interface needs to integrate with Grafana, complex management, too many configuration files, and for large-scale applications, it also needs to integrate with Kafka, InfluxDB, ElasticSearch, etc., further deepening operational complexity.
In summary, we decided to make certain modifications to the community Prometheus monitoring, reduce the number of agent deployments, strengthen agent management functions, provide page management functions, integrate with Kafka and Druid storage, so that data can be stored for a long time and horizontally scaled. This achieves "Just Work" for users.
In the future of continuous technological development, if we can enable users to natively integrate with existing monitoring systems and minimize customer configuration workload, Tingyun Infrastructure will also provide support at the first opportunity.