Logiq is now patented! This solidifies our position as a leader in the field of modern data engineering.   Know More

Log Management on LOGIQ vs ELK Stack

Log Management on LOGIQ vs ELK Stack

Machine log collection, management, and analytics are essential for a company to thrive in the information age cloud business.  I recently learned how a top-performing cloud company architects its log management infrastructure.  The corporate log service underlying evolves continuously based on the need, and it generates critical business values.  This write-up first describes the log system implementation on AWS.  It then discusses the architecture decision to address resource limitations, control the operation budget, and reduce operation disruption.

Centralize log management is always a preferred solution for handling enterprise logging.  Application, IT system, and cloud micro-service logs are collected and managed in one central location, for example, AWS EKS.

DevOps and automation teams play a central role in developing and maintaining the log management infrastructure.  Log collections and analytics data preparation phases are all in place. See [1].  As nice as the system seen from high ground, there are engineering overheads to address the infrastructure’s computing, storage, and budget resource limitation.

The ingestion pipeline first filters the log data by extracting the useful log portion to reduce logging noises – what logs or log portions to retain or remove.  The trimming directives usually come from the log data end-user, such as a data scientist or system analyst.  They communicate with the DevOps/automation team to create customized log extraction filters for deployment.  The goal is to control the logging volume size and content.  The DevOps/Automation team builds the ad-hoc best-effort log filters, and the process is laborious and error-prone.  

This log reduction filter control is in place for the log data infrastructure limitation.  In this example, it uses Amazon elastic search service (EKS).  Both the compute and storage resources for indexing needs to be checked for a healthy logging system.  To maintain a performing stable operating state, the total number of ingesting logs is controlled at 30GB/day, see [8], and total backlogs are retained for two weeks.  Data are backed to an economical Amazon S3 storage after that.

Ingested log data are critical for company operations.  Different business functional units for the company deals with different log data and different usage.  For example, the performance and capacity team extracts metrics from logs to model system usage trends and forecast future demand.  The customer business unit would extract metrics to analyze customer insight and creates business values, for example, customer churn.  The DevOps would maintain the system’s operating state over SLA (Service Level Agreement) requirement.  The log data metrics extraction and the subsequent analytics are highly customized, flexible, and fluid.  Each functional unit is specifically created to solve specific business problems. It is highly desirable to utilize AI/ML techniques and methodology, see [7].  For example, holistically, log ingestion data pipe can be enhanced with tag or label to facilitate later AI/ML analysis. All the mutable fields in the log are automatically extracted for analysis.  This subject will be revisited in another blog post.

As mentioned earlier, the deployed Amazon ES operating resource needs to be monitored and controlled for a performing log system.  Amazon ES infrastructure can scale-out, but the effort is not walk-in-the-park, and it often requires some degree of trial-and-error.  The new ingesting log data capped at 30GB/day with a 2-week log data retention time for maintaining the operating budget and system performance.  The system always holds about 500GB of new log records for processing.  The overflown log is backup into Amazon S3 at $25/TB-month for an indefinite period.  The AWS hosting of such a setup is around $60k/year, not including engineering and operator costs. 

Here is a similar log infrastructure setup using LOGIQ building blocks.  The system now becomes simpler because addressing resource limitation is alleviated with the use of S3 storage.  See figure below,

LOGIQ log management infrastructure removes the engineering build-in infrastructure overhead.  The new construct is efficient and straightforward.  The table below lists the infrastructure resource engineering overheads, and the list tags are from an earlier drawing 

TagDescriptionOverhead Action LOGIQ
[4]Need for reducing log ingestion to save log process resources.  Consult and communicate with log end-user.The process is error-prone due to log and apps changes and requirements.Simple log data ingestion pipeline  
[5]DevOps implement log filter to reduce ingest log countImplementations and validate the stored log with end-userNo need to trim log data.  Store un-redacted logs into S3 storage.
[8]Maintain constant overhead and scale-out ELK if neededElastic search service does not scale seamlessly.LOGIQ scales easily using K8S pods
[9]Daily backup the oldest logs to S3 to maintain stable log working set sizeAdd backup and no easy process for re-using the backup log dataDirectly operate on S3 storage– Searching, event capturing, AI/ML analysis, etc.

In summary, this article describes an operating log infrastructure setup on AWS.  It also presents a similar LOGIQ log infrastructure setup.  LOGIQ log management infrastructure removes induced engineering overhead because of its competitive advantage in the native use of S3 storage. 

There are plenty of references on the web about scaling elastic search service and the common consensus from these references is such a task is not for the faint heart.

Pepe Juan

Pepe Juan

Tsai-Chi (Pepe) Huang is a highly accomplished and innovative System Research Engineer and the Founding Engineer at LOGIQ.AI. With a passion for driving technological advancement through scientific research and scalable design, Pepe has a proven track record of success in challenging engineering, system implementation, performance optimization, and system analysis.

The LOGIQ blog

Let’s keep this a friendly and inclusive space: A few ground rules: be respectful, stay on topic, and no spam, please.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

More insights. More affordable. Less hassle.

Follow Us on LinkedIn
and Twitter!

Before you go, make sure you don’t miss out on our latest updates and insights. Follow us on LinkedIn to stay up-to-date on industry trends, company news, and valuable insights.

Click the “Follow” button below to join our community and stay ahead of the curve. Thank you for visiting our site, and we hope to connect with you soon!