World knows how to be cruel and becomes more cruel when you have no answers in adverse situations. For product companies, this is no myth. You should consider this article as a guide for startup companies running internet business who face a lot of post launch risk due to inefficient server infrastructure plan.
NOTE: This article does not belong to big data administrators and developers who are working with distributed systems, Hadoop, Hbase or other such technologies.
Businesses run with high inspirations and on huge costs. Even the smaller setups need to maintain a lot of trust and breakthrough service so that customers can be pleased. When a business is starting on pre-defined rules everything goes smoothly as planned. However one’s the business multiplies or let’s say the traffic on website or application increases, uncertainties creates doom days for stakeholders.
If you are reading this article I assume you to be a business owner or a product manager or a product developer who knows how an internet business works and the meaning of an internet product.
It is very important for system administrators and product managers to keep an understanding of reliable system and that undoubtedly includes the focus on system performance, operations and critical notifications.
Unfortunately young companies start with a lot of ambition but with loose ends technically. Building a product is a dream come true and it takes a lot of hard work to make it kickass. My focus is precisely on planning and taking required measures throughout the life cycle of your product and certainly at the critical market testing.
Most companies somehow convert their ideas into matured products with different level’s of war footings. However the test, or I would say the big lean starts with post launch and maintaining a smooth early tenure of the product.
Quick 8 questions as a guide for startup companies to weigh their server infrastructure plan and to ensure that post launch challenges have been addressed:
- How important is it for the product to be live 24x7x365?
- How critical is the information for business?
- Is there a backup plan if there is a fail-over with server/data-center?
- Is there any disaster recovery plan for the product?
- Do you monitor your servers to take appropriate measures for avoiding downtime?
- Do you have plans to automate fail-over?
- Do you have notification system for different business cases to take appropriate measures before a critical fail-over happen?
- Is your server architecture audited by an expert?
I know most of the questions look daunting at first encounter. But be assured that there is no alien mathematics for answering them and you will have the same opinion at the end of this article.
Building robust small enterprise production environment server infrastructure plan addressing the issue of fail-overs
It can be too variable if you compare your staging and production server in terms of performance and response time. Here my motto is to describe the post launch “Production Server” and I am with the guess that you guys can handle your “Staging” environment.
I will begin with recommending the best answers in general cases for all the questions that I have framed above. And to begin with, follow the diagram below.
However in most cases, the answers for the first two questions are mostly the same, i.e. the product needs to be alive all the time and the information on the servers are hugely critical to be safe. Yes, it is, because the business is what when you are available with the information required. So considering the common answers to the first two questions we will be addressing the answers to question 3 to 8 on fly.
The above diagram shows how a small business or a young product can marginally cover the risk and can reduce the downtime if smartly planned. If you closely observe this chart, you will observe that we are speaking about an application which is backed with a MySQL database. So a quick glimpse on this illustration:
- Application Servers: The core application is deployed to two different servers, which are identical in their versions and patches. Just consider it like your same application deployed twice in a same manner at two different servers.
- Load balancer: There is a load balancer that decides on the basis of round robin method to send each request in circular order to servers. For example, assume for the above chart that all odd requests are going to “APP Server 1” and all even requests are going to “APP Server 2” hypothetically.
- Handling fail-over by minimizing risk involved with application server: Now let’s say if one server fails the load balancer knows that it need to send all the further request to the another server, and hence an application layer downtime can be handled in this way.
- Separate data storage server: The recommended way to manage an APP server is to make sure that the APP servers are nothing but static piece of business logic having no dynamic content and database being managed onto it. And all dynamic data (Uploads) should go to a specific new server, like, CDN based separate data storage.
- Core database configuration file: Moving ahead, let’s assume that there is a configuration file staying at each APP server which knows how to connect with the master database.
- Handling fail-over by minimizing risk involved with database server: Now let’s say that the master database fails, then there has to be a way to switch to the backup database. For this we should create a manual job to check weather the server is live or not. But why can’t it be straight like how we did with APP servers. So the reason is simple, each set of data can be stored to one single database at a time because if we start storing to two separate databases, the application response time will double up. So we store at one master database and will sync all the data with backup database at a defined interval. For a less critical application, 5 minutes could be a decent interval to consider. Therefore, if the master database fails then the MySQL ping job will update the configuration file at both the server to use Backup database (Which would not be having a maximum of last 5 minutes data, in the case we assumed).
- Disaster Recovery: We also need to make sure that if both the app server or the database server or the data storage server went permanently down, could be because of natural disaster or anything else, then we should be having a backup to restore it. The backup should be taken on specific intervals from all the servers and should be stored on a separate server for recovery plan.
- JOB Server: The important aspect of the whole infrastructure plan resides within JOB Server. It’s an isolated server which runs multiple scripts and responsible for the following tasks:
- To pull data from both the APP servers, database servers and the data storage server and creating its backup copy at the “Backup server 1”.
- To create another backup naming “Backup server 2” on a different server from “Backup server 1” to quickly recover if it fails.
- To run a sync script to pull all data from master database and to push it to backup database.
- To run a database ping to check if the database is available, and if not then to switch the database to backup by updating configuration file at both the APP servers.
- Email Notifications: Each critical and warning event should be trigger an email to the stakeholders to notify the status of the servers, so that if any manual intervention is required, it should be taken in care immediately.
- Audited and tested: The server architecture needs to be audited by the professionals and should be tested on production system as “beta release” before making it fully live. This is to make sure that the product is backed with a robust plan and infrastructure that it cannot face critical downtime frequently.
NOTE: You will observe that each server is located at different locations in production environment architecture diagram. This is a disaster counter measure. Where it’s more unlikely that all four different locations will face downtime at same time. SO keeping servers at different locations is always a good practice.