Protecting your SOA application (and your job) from remote failures

SOA – for Service Oriented Architecture – is the buzzword du jour. Organizations in all industries want to realize its promises, which include sharing information more openly and coherently across the enterprise and increasing organizational agility by making it easier to assemble new applications from existing components. Although there are fierce debates about the details (SOAP vs REST, contract first vs schema first vs code first, etc.) there is little disagreement about the general principles. Software should be constructed in the form of loosely coupled, coarse-grained, stateless services which can be invoked remotely over a network, and which form the basis of enterprise code reuse. This is a huge topic about which volumes have been written, but for now I’m going to focus on one issue: The importance of designing your SOA services to be resilient against network failures.

Let’s start with an example. Say you have a legacy system containing important data. You want new applications to be able to create, update, and delete records in this legacy system. You decide to create synchronous web services called createRecord, getRecord, and updateRecord which perform the required operations on the legacy database. Then you start invoking these web services from a new desktop client application. So far so good. Your application is running on a high performance corporate LAN, the latency for each web service call ranges from 10 to 2000 milliseconds, and your users are happy. Every once in a while a web service call fails due to a network problem, but by and large the overall solution is adequate.

The problems start when a second development group wants to create a web service that also needs to create these legacy records. The obvious thing for them to do is to call your web service. Duplicate code is avoided and everyone is happy, except that the latency for the new group’s service is double that of your original web service (there’s the delay to call their web service, plus the additional delay for their web service to call yours).

Now let’s fast forward into the future. Imagine development teams throughout your enterprise have been using each others’ services freely and the information systems have become a huge web of interconnected network services. Transaction processing latencies will be highly unpredictable, as an innocent web service request might trigger a flurry of dependent requests. Furthermore the slightest problem in the network or failure of one component may cause a large-scale outage.

The solution is to define a rule throughout your SOA: Service implementations should not wait on responses from remote services. If a service must send a request to another service, a decoupling design pattern should be used.

For example, say your createCustomer web service needs to send a request to a remote system called CustomerIMS to carry out its function. One decoupling design pattern would be to have it save the request to a database table called CUSTOMERIMS_OUTBOUND and return immediately. Then other threads or processes can arrange for the request to be sent from the database to the CustomerIMS system, and for responses to be stored back in the CUSTOMERIMS_OUTBOUND table. The client can poll the results periodically, or an asynchronous notification can be arranged. If the CustomerIMS system temporarily goes down, this arrangement fails gracefully. Instead of timing out, requests are buffered in the database. Retries can be arranged if desired, and the database table provides transparency as monitoring applications can query the database to track the progress of requests.

As more and more services are rolled out and become interdependent, it is increasingly important to decouple SOA services to protect against network failures and remote service outages. Otherwise we risk building a generation of unreliable applications, which will be followed by an eventual backlash against the whole idea of SOA.

As a side point – as SOA gains in ascendancy, I predict a resurgence of interest in the use of traditional relational databases as a stable, central point of staging and communication, as well as new interest in distributed caching solutions such as Tangosol. Good time to buy Oracle stock.

Protecting your SOA application (and your job) from remote failures

One comment

Leave a Reply