What reasons cause the broken network data center?

What reasons cause the broken network data center?
2013-03-27 20:45:56
One way of Thai embarrassed embarrassed at the end of 2012 under the bed to the Chinese film box office record, but also refresh the record. But in IT, data center security failure events erupt frequently, also the impact of enterprise user's psychological line of defense in a meeting. Only look, the security problems of the data center, don't be embarrassed again. "Tai embarrassed".
Cloud computing service is touted to be the IT saints in this era, all services can be "cloud". However, when many companies to be the first to eat crab, but found that often the most vulnerable is their. In recent years, emerge in an endless stream of cloud services broken network events, so that the industry could appall.

People gradually return to the ideal, to more clearly see the true face of cloud computing. It can be said, no matter how lofty dream or to find a secure foothold, cloud services eventually from one data center to be transmitted to a data center, and in the process still could not escape the need, computer, network, power, storage, etc. between collaborative work. As a result, the entire process error and loopholes in the offing, coupled with natural disasters. So, enabled cloud services, you must have a certain amount of mental preparation, but also with a second-hand solutions to deal with.

Editor here, look at the reasons behind a series of broken network events that occurred in recent years. Between from 2009 to 2012. Perhaps make you see: even if the computer error seems inevitable, reinsurance measures seem only security event control in a small probability range.

The off network types: system failure

Typical event 1: Amazon AWS Christmas Eve off network

The cause of the malfunction: Elastic Load Balancing service failure

December 24, 2012, Christmas Eve just past, Amazon did not let their customers have had peace. Amazon the AWS located in the eastern United States data center fails, the Elastic Load Balancing service (Elastic Load Balancing Service) interrupt, sites such as Netflix and Heroku affected. Which, Heroku in the AWS eastern United States regional service before failure has also been influenced. However, some coincidence Netflix rival Amazon's own Amazon Prime Instant Video is not affected because of this failure.

December 24, Amazon AWS interrupt service event is not the first time, of course, will not be the last time.

October 22, 2012, Amazon AWS network services in Northern Virginia interruption. The reason is similar to the last. The effects of the accident including Reddit, Pinterest and other well-known websites. Interrupt affect the elasticity of magic beans, followed by flexible magic beans console, relational database services, flexible cache, Elastic Compute Cloud EC2, cloud search. This accident made a lot of people believe that Amazon should to upgrade North Pooh the Ghia data center infrastructure.

April 22, 2011, a large area of the Amazon cloud data center server downtime, this event is considered to Amazon the most serious in the history of cloud computing security event. Amazon downtime cloud computing center in Northern Virginia, including answer service Quora, the news service Reddit, Hootsuite and location tracking service FourSquare, some sites have been affected. Amazon's official report claims that the incident is due to the presence of its EC2 system design vulnerabilities and design flaws, to improve the EC2 (the Amazon ElasticComputeCloud service), competitive and continue to fix known vulnerabilities and defects.

In January 2010, almost 60,008 thousand Salesforce.com users experienced at least one hour of downtime. Salesforce.com "systematic errors" due to their own data center, all services, including backup, including a brief paralysis. It also exposed the Salesforce.com do not want to open the lock strategy: its PaaS platform, Force.com Salesforce.com outside the. Once Salesforce.com problems, Force.com the same problems. Interrupt service occurred a long time, the problem will become very tricky.

The broken network inducement: natural disasters

Typical event: Amazon Northern Ireland Berlin data center downtime

The fault reasons: lightning struck a transformer of the data center in Berlin

August 6, 2011, caused by lightning in Northern Ireland, Dublin Amazon and Microsoft cloud computing network in Europe, large-scale data center power outage downtime. The lightning struck a transformer near Dublin data center, leading to its explosion. The explosion triggered a fire, so that the work of all public service agencies to temporarily halt the entire data center downtime.

This data center is the Amazon, only for data storage and in Europe, that is, EC2 cloud computing platform customers during the accident no other data center for temporary use. Downtime event makes the length of the Amazon EC2 cloud service platform many websites to interrupt a long time up to two days.

Typical event: Calgary data center fire accident

Failure reasons: data center fire

Calgary data center fire accident July 11, 2012: Canadian communications service providers ShawCommunicationsInc in Calgary the A Er Bota's data center, a fire occurred, resulting in hundreds of local hospital surgical delay. Manage the data center to provide emergency services, the fire affected the the main backup systems to support critical public services. The event as a series of government agencies sounded the alarm, you must ensure the timely recovery and have failover system, combined with the introduction of disaster management plans.

Typical event: Hurricane Sandy attacks data center

The cause of the malfunction: storms and floods led to the data center to stop running

October 29, 2012, super hurricane Sandy: data center in New York and New Jersey are subject to the impact of the hurricane, including the adverse impact for the Lower Manhattan area flooding and the shutdown of some facilities, the surrounding area data center generators run disorders. Impact in hurricane Sandy and beyond the general single disruptions brought disaster of unprecedented scale in the data center industry in the affected areas. In fact, the diesel has become the lifeblood of the data center recovery, to take over the entire load as a backup power system, prompting the special measures to keep the generator fuel. With the immediate focus gradually shifted to the post-disaster reconstruction, we need long-term engineering and disaster recovery data center location, explore this topic may last for months, or even years.

The off net incentive III: human factors

Typical event 1: Hosting.com service disruptions

The cause of the malfunction: the service provider implementation of the circuit breaker operating sequence is incorrect due to UPS Close

July 28, 2012 the Hosting.com outage event: human error is often considered one of the dominant factors in the data center downtime. July the Hosting.com interrupt events caused the 1100 Customer Service interruption is an example. Shutdown accident is being carried out due to the company's data center in Newark, Delaware, UPS systems preventive maintenance, service provider implementation of the circuit breaker operating sequence is not correct to cause the UPS shut down is caused by the data center suites facilities the loss of one of the key factors. "CEO of ArtZeile Hosting.com. "There is no failure of any important power system or standby power systems, is entirely caused by a human error."

Typical event: Microsoft outbreak of the BPOS service interruptions event

The cause of the malfunction: Microsoft data centers in the United States, Europe and Asia, a determined set error

In September 2010, Microsoft hosting services at least three times in the western United States within a few weeks time interrupt event to apologize to the user. This is Microsoft's first major cloud computing events broke.

The accident, the user access BPOS (Business Productivity Online Suite) service, if you use Microsoft's North American facilities access service customers may encounter a problem, this failure lasted two hours. Although Microsoft engineers later claimed to have solved the problem, but did not solve the fundamental problem, which is also produced September 3 and September 7 service interrupted again.

Microsoft's Clint Patterson said this data breaches are caused by errors due to Microsoft data centers in the United States, Europe and Asia, a determined set. BPOS software in the offline address book in the "very special circumstances" available to unauthorized users. The address book contains the contact information.

Microsoft said that this error fixed two hours after the discovery. Microsoft said it has tracking facilities, to enable it to get in touch with people who download these data errors in order to clear these data.

Off network incentives: system failure

Typical event 1: GoDaddy the website DNS server interrupt

The fault reasons: data table within a series of routers in the system caused by network outages

The GoDaddy website DNS server interrupts: September 10, 2012 the domain name giant GoDaddy DNS server is one of the most important suppliers, which has 5,000,000 websites and manages over 50 million domain names. This is why the disruptions will be the most devastating event in 2012 on September 10.

Some speculation even to the interrupt events for up to six hours is the result of denial of service attacks, but GoDaddy later said, this router table corrupted data. Service interruption is not caused by external influences. "GoDaddy interim CEO 史葛瓦格纳 said. "This is not a hacker attack is not a denial of service attack (DDoS) We have determined that the service interruption is due to damage to the network event caused by the series of routers internal data table."

Typical event: Shengda Yun stored off network

The fault reasons: data center physical server disk is damaged

August 6, 2012 8:10 pm Shengda Yun due to cloud host failure caused by the loss of user data events public statement published on its official microblogging. The statement said: August 6, Shengda Yun data center in Wuxi, because a single physical server disk is damaged, resulting in the loss of individual users' data. Shengda Yun has to make every effort to assist the user to recover data.

Lead to individual users' data lost because of a physical server disk is damaged, Shengda Yun technicians are given their own interpretation: the virtual machine's disk, there are two modes of production, a direct host the physical disk. This case, if the host's physical disk fails, the cloud host will inevitably result in the loss of data, which is generated by the incident reasons; another is to use remote storage is grand hard disk products, this way the actual is to save the user's data to a remote cluster, and at the same time do multiple backups, and even host a failure will not affect a the Cloud host of data. Difficult to avoid because the physical damage to the machine, in order to avoid accidental loss you are experiencing, we recommend that you also do data backup to cloud host.

Typical events 3: Google App Engine interrupt service

The cause of the malfunction: network delay

Google App Engine: GAE WEB application development and hosting platform, data center management by google interrupt time is October 26th, and lasted 4 hours, because suddenly become slow to respond, and an error. Affected, 50% of the GAE request fails.

Google said that there is no loss of data, application behavior also have backup can be restored. Apologize, google announced Nov. user can google said they are strengthening their network services to cope with the problems of network delay, "We have enhanced the flow routing capabilities, and to adjust the configuration, these will effectively prevent such problems from happening again.

The off net incentive Five: System Bug

Typical Event 1: Azure global interrupt service

The cause of the accident: Software Bug calculated incorrectly result in a leap year time

February 28, 2012, due to the leap year bug "causing Microsoft Azure service a large area on a global scale to interrupt, the interrupt time over 24 hours. Although Microsoft said the software bug is incorrect leap year calculation of time lead, but the incident provoked a strong reaction in many users, many people asked Microsoft to make more reasonable explanation for this purpose.

The typical event 2: Gmail e-mail the outbreak of a global failure

The cause of the accident: data center routine maintenance, the side effects of the new code

February 24, 2009, Google's Gmail e-mail to the outbreak of the global failure service interrupted time up to 4 hours. Google explained the cause of the accident: when routine maintenance of the data center in Europe, some of the new program code (trying to geographical proximity data focus on all of the body) some side effects, leading to another data center in Europe to overload, so The knock-on effect on the expansion and interface to other data centers, and ultimately lead to the disconnection of the global and other data center does not work.

Typical event: "5.19 off network events

The cause of the accident: the client-side software Bug, internet terminals frequent DNS requests, triggering DNS congestion

May 19, 2009, 21:50, Jiangsu, Anhui, Guangxi, Hainan, Gansu, Zhejiang and other six provinces user declaration to access the site slow or inaccessible. Unit investigating after the Ministry of Communications, said the national six provinces network disruptions, because a company launched the client software defects, resulting in the installation of the software of the Internet, in the case of abnormal work of the company's domain name authorization server terminal frequently initiate DNS request triggered DNS congestion, resulting in a large number of users access the site slow or page can not open.

Which the, DN SPod is the leading DNS service provider one N SPod company service DNS service for a number of well-known websites. The attack resulted in paralysis the 6 units dns the DN SPod belongs DNS server, a direct result of the DNS system paralyzed STORM, including a number of network service providers, and thereby lead to network congestion, resulting in a large number of users can not be normal Internet. The Ministry of Industry and pointed out that this incident exposed the Domain Name Service to become the weak link in network security, instructed all units to strengthen the security of the Domain Name Service.

Summary

Enable cloud services company, a large extent, is considering such a service can be more editing, cost-effective. However, such considerations if it is based on reducing the cost of security as, it is estimated that the boss will not agree with a lot of companies. Cloud services off the endless stream of events caused by concerns about the safety of the cloud.

Now, the solution can proceed from several angles, sure cloud services for enterprise-class customers, regular backups of the data of the cloud, with the second set of solutions, in order to prepare for contingencies. For cloud service providers, since a variety of off-network event is inevitable, it must consider a countermeasure to minimize the loss of their users, improve response efficiency off network events.

Government departments have responsibilities of oversight and remind legal legal cloud services were introduced and constantly improve, and to remind the user to one hundred percent reliable cloud computing services do not currently exist.