Handling HTTP cache is one of the most important aspects when you need to scale a web application. If well used, it can be your best friend; but when badly used, it may be you worst enemy.

I’m not going to explain the basic aspects of caching here, there are already a lot of great material about it. I’m going to bring a specific problem here.

On a previous post, I wrote about Globosat Play architecture. As explained there, it evolved to a microservices architecture, and as such, we ended up making a lot of HTTP requests to our internal services. So we needed to manage those requests very well.

Suppose you access a page like Combate channel home page. To fill up every information in that page, we need to query data from:

  • a videos API, to bring up a list of available channels and the latest videos from Combate channel
  • a highlights API, to check the latest highlights selected by an editor
  • an events API, to check a list of previous and next UFC events

That means a user request could be represented by something like this:

Requests without cache
Requests without cache

Now imagine one of these services is unavailable. Or very slow. Or giving un unexpected answer. If we didn’t consider these scenarios, we would end up with a brittle application, susceptible to a lot of issues.

Michael Nygard, in Release It!, says we must develop cynical systems:

Enterprise software must be cynical. Cynical software expects bad things to happen and is never surprised when they do. Cynical software doesn’t even trust itself, so it puts up internal barriers to protect itself from failures. It refuses to get too intimate with other systems, because it could get hurt.

That means we shouldn’t trust anybody. Don’t assume a service is up, available, fast and correct. Even if you know and trust the maintainers of this service, consider that it may have problems (and it will, eventually!). One of the defense mechanisms against this is caching.

At Globosat Play, we decided to implement two levels of cache. We call them performance and stale.

Performance cache

The performance cache is meant to avoid a flood of unnecessary requests to a single resource in a short period of time. Going back to Combate home page example, one of the services our back-end requests is a list of next UFC events. This doesn’t change often; only when a new event is created, or when an event finishes, once a couple of weeks. That means, it’s very wasteful to hit that service for every user accessing Combate home page. Suppose the events API response changes once a week; if that page gets 100,000 hits in that period, that means I would make 100,000 requests for that API, when I could just make one and keep the results in cache, which is much faster.

The solution for this is keeping a performance cache for a specific period of time. Suppose I set my cache for 5 minutes. The decision flow for this would be:

  • cache available? Respond with cache
  • cache unavailable? Make the request, write the response in cache, set its TTL (Time-to-leave) for 5 minutes, respond

That means I would hit that API only once every 5 minutes, independenly of how many users are accessing my home page right now. I’m not only avoiding wasteful requests, but also protecting my internal services and giving faster responses - it’s much faster to access the cache than making an HTTP request. The diagram below depicts this scenario:

Requests with cache
Requests with cache

The problem in this scenario is, even if I’m sure that my events API only changes once a week, I can’t set my cache TTL for 1 week. Imagine if I do that and the cache expires a few minutes before a new event is registered. That means I won’t see the new event until the next week! You need to carefully evaluate the performance cache times for each service you depend on.

Even if you have a service that can’t be cached for that long, you could have a great benefit from caching the request for at least a few seconds. Imagine an application with 10,000 requests/s. If you set the back-end service request cache TTL for 1 second, you are making a single request for your service, instead of 10,000 requests!

Stale cache

The second cache level is stale. It’s a safety against problems like network instability or service unavailable. Let’s use the latest videos API as an example. Suppose my application back-end tries to access this service and it gets a 500 HTTP status code. If I have a stale cached version of it, I can use it to give a valid response to its client. The stale data may be outdated by a few minutes or hours, but it’s still better than giving no response at all - of course, that depends on the case. For some kinds of services, an outdated response may not be feasible, like giving the wrong balance when your client accesses his bank account. But for most of the cases, stale cache is a great alternative.

Usually we set the performance cache time for a few minutes and the stale cache for a few hours. Our standard setup is 5 minutes and 6 hours, respectivelly.

Implementing cache levels in Ruby

To implement performance and stale cache levels in Ruby applications, we created and open sourced a gem called Content Gateway. With it, it’s much easier to manage cache levels.

After installing it, you need to configure the default request timeout, the performance and stale cache expiration times and the cache backend, besides other optional configurations:

config = OpenStruct.new(
  timeout: 2.seconds,
  cache_expires_in: 5.minutes,
  cache_stale_expires_in: 6.hours,
  cache: ActiveSupport::Cache.lookup_store(:memory_store)
)

gateway = ContentGateway::Gateway.new("My API", config)

With this basic configuration, you can start to make HTTP requests. You can also override the default configurations for each request:

# Params are added via query string
gateway.get("https://www.goodreads.com/search.xml", key: YOUR_KEY, q: "Ender's Game") # => "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<GoodreadsResponse>\n  <Request>..."

# Specific configuration params are supported, like "timeout" and "skip_cache"
gateway.get_json("https://api.cdnjs.com/libraries/jquery", timeout: 1.second, skip_cache: true) # => {"name"=>"jquery", "filename"=>"jquery.min.js", "version"=>"3.1.1", ...

It supports POST, PUT and DELETE as well. For all verbs, there are two methods for making the request: one is simply the name of the verb and the other has _json suffix. The former treats the response body as string, and the latter, as a Hash.

gateway.post_json("https://api.dropboxapi.com/2/files/copy", headers: { Authorization: "Bearer ACCESS_TOKEN" }, payload: { from_path: "path1", to_path: "path2" })
gateway.put_json("https://a.wunderlist.com/api/v1/list_positions/id", payload: { values: [4567, 4568, 9876, 234], revision: 123 })
gateway.delete("https://a.wunderlist.com/api/v1/tasks/id")

You can also make a few other customizations. Check out the project page on github for more information and examples.


Nerdcast is an amazing podcast (in portuguese) about nerdie stuff in general. One of its latest episodes talked about the digital artist profession. The guests were animators that worked on feature films like Moana and Doctor Strange. They talked a lot about what they do, and I saw a lot of analogies between their work and software development.

One of the podcast guests said that we was working on a specific scene for Doctor Strange for about 2 months, and suddenly his boss called him and told the director decided to cut that scene off from the film. Those 2 months of work turned into garbage. The lesson he learned from that is, you shouldn’t get attached to the project you’re working on. It’s not your project, it’s your company’s project and you happen to be working on it, which is very different. The analogy here is very clear, because that happens a lot in software projects: sometimes your client decides the feature you’ve been working on for a few weeks or months is not that important, or even worse, that the whole project shouldn’t be maintained anymore. It’s very hard not to be impacted by these news, but if you don’t work on that, you will frequently get frustrated. The project is meant to deliver value to your client, not you. Always remember that you are not paid to write code.

If you want a project to be really yours, and have total freedom to decide what to do and how to do it, you need a personal project. But it’s important to notice that, if someday you decide to make that a business, turning that into a startup or selling some kind of product or service, you will start to have clients. When that happens, you need to be ready to let go of your ideas; if nobody wants your product, be ready to pivot (or even discontinue the whole product). A/B tests are also a great way to learn how to let go of your ideas and beliefs: if a hypothesis is proven worse than the default behavior, just delete it.

Another important topic that was mentioned in the podcast was: the guests, as artists, have a hard time deciding when to stop improving their work. They start working on a scene and iterate a couple of times to make it better. As perfectionists, they want to keep polishing their work. But sometimes the scene is already looking so good that any improvement won’t be noticed by the audience, so it just won’t deliver value to the client anymore. The problem is not knowing when to stop. We can make an analogy here with refactoring: sometimes we develop a new feature, and even after it’s implemented and well tested, we decide to refactor. The objective could be making the code clearer for anyone that may touch it later - to implement a new feature of fix a bug - or maybe extracting a part of it to remove duplication from a similar feature you already had. In both cases, the refactoring won’t deliver value in the short term, but will on the mid-term or long term: you will have lower maintenance costs. But we may have the same problem as the animators: it’s very hard to know when to stop polishing the code. At some point, the refactoring won’t deliver value anymore, and we just refactor to please ourselves. As said before, you need to remember the project is not yours, and you are not paid to write code!


This post is at least one year late. Since I gave a few talks about Globosat Play architecture (slides in portuguese), I intended to write a more detailed post, but always procrastinated about this.

Globo.com is the internet arm of the largest media conglomerate from Brazil, and one of the largest in the world. One of the areas of the company is responsible for our video platform, which includes encoding, distribution and streaming for any website of the group that needs videos.

About 5 years ago, one of our videos team developed a video product called globo.tv (recently discontinued and replaced with a newer product, Globo Play). Its content was focused on Rede Globo (our main broadcast TV network) shows, like news, sports and the famous brazilian telenovelas. Most of globo.tv content were small scenes from these TV shows, open for every user, but it also offered full episodes for paying subscribers.

globo.tv
globo.tv home page

The original architecture was a single monolithic Rails app, with a Unicorn application server, a MongoDB database and a Redis instance for cache, all behind an nginx HTTP server. This architecture served very well for some time.

globo.tv architecture
globo.tv architecture

Then came 2012 with new demands for globo.tv. We needed to start offering live streaming of a couple of events, like UFC, Big Brother Brasil, soccer championships and the Winter Olympic Games. Also, we needed to start offering a collection of videos from Combate channel, focused on MMA sports.

At this point, we realized the old monolithic architecture wouldn’t serve anymore. So we needed to break it in smaller parts. The first step was identifying smaller subdomains inside videos. The first ones we identified and split were live streaming, for these new live events demand, and VoD (video on demand), for Combate channel videos.

Clearly these new subdomains deserved their own projects, but the problem is, they needed to share a few business rules and data from the original globo.tv project, and also between them. That meant we needed to extract a few services from globo.tv to its own project: globotv-api was born!

globo.tv architecture, 2.0
globo.tv architecture, 2.0

At this point, we had:

  • globotv-api, which offered a few basic video services, like the most recent videos from a specific program and the most watched programs
  • globotv, the remain of the original project, which started to consume globotv-api services. It was responsible for serving the original pages, like home, program page, video page and search
  • globotv-events, a new project responsible for live streaming of different kinds of events. It also consumed globotv-api and offered its own API with specific services, like a list of live events happening right now
  • globotv-vod, another new project, which served the collection of videos for Combate channel. It also consumed globotv-api and offered its own API with specific services, like a list of competitions and videos from a specific fighter

The break-up of the monolith was a very important move. If we didn’t split it at that time, we would end up with a larger and larger project, which would soon become a monster - harder to understand, harder to maintain, harder to evolve. It allowed us to share a small part of our domain, which was common among these new requirements and the original project.

To split the front-end among different projects, we used nginx as a router. We already used it in front of our application server, but with multiple projects, we created an upstream for each one, and configured a location for each URL pattern, like this:

upstream globotv {
  server globotv.internal.globo.com;
}

upstream globotv-events {
  server globotv-events.internal.globo.com;
}

upstream globotv-vod {
  server globotv-vod.internal.globo.com;
}

server {
  listen 80;
  server_name globotv.globo.com;

  location ~ (.+)/ao-vivo/ {
    proxy_set_header Host $http_host;
    proxy_pass http://globotv-events;
    break;
  }

  location ~ ^/combate/ {
    proxy_set_header Host $http_host;
    proxy_pass http://globotv-vod;
    break;
  }

  location ~ /  {
    proxy_set_header Host $http_host;
    proxy_pass http://globotv;
    break;
  }
}

The internal domains aren’t publicly exposed, the only way to access the projects is through nginx. With a configuration similar to this, nginx routes incoming requests for globotv.globo.com domain to different projects, according to the URL pattern: URLs that contain “/ao-vivo/” pattern (“live” in portuguese) are routed to globotv-events project; URLs starting with “/combate/” are forwarded to globotv-vod project. The last location matches every other URL to the original globotv project. globotv-api project doesn’t appear in this configuration, as it isn’t publicly accessible (it’s only accessed from the other projects). This configuration allowed us to serve different pages from different projects transparently for the user.

Despite many benefits, this architectural change brought new challenges. The first one was keeping a consistent visual identity among pages served from different projects. For example, the video thumb component used in the home page should be just like the video thumb from the search page; the product header should be the same in every page. The split in multiple projects was an architectural decision; the user doesn’t need to know this, because globo.tv was still a single product.

The solution for this problem was creating a components library, called globotv-ui. With this solution, we were able to share visual components, comprised of HTML, JS and CSS. They were standardized and documented, which made it very easy to create new components and share them among these projects - as all of them were Ruby on Rails projects, we delivered globotv-ui library as a Rubygem.

examples of components from globotv-ui library
examples of components from globotv-ui library

Fast forward a few years, and a new video product emerged: Globosat Play. It was very similiar to globo.tv on a few aspects - the idea was offering VoD and live streaming from Globosat channels (TV channels available only for paying subscribers), but with a few differences: we also needed to offer movies, and now the focus was on subscribers, and not on free users anymore.

Globosat Play home page
Globosat Play home page

The main challenge at that point was how to share components and services between these two products, but without one limiting the other’s evolution and requirements. We needed to re-evaluate our architecture and business domain to solve this issue. We realized that, as both products had many similarities and some differences, we needed to create a common, shared layer of services, but also keep some services specific. The new architecture has become something like this:

Globosat Play architecture
Globosat Play architecture

This diagram is simplified from the previous ones - the new projects are also Rails apps with their own MongoDB and Redis instances. The green boxes are projects that serve web pages, and the blue ones are APIs. We split our projects in 3 parts: the top one in the image is specific to globo.tv product; the bottom one is specific to Globosat Play; and the middle one represent shared services. You can notice most of the projects created before (globotv-api, globotv-events and globotv-vod) started being shared, as Globosat Play also had those same requirements.

Besides that, we created a few other projects. Some of them, like movies, attended a specific subdomain that didn’t exist before; others, like globotv-search, were extracted from the original globotv project - these features already existed, but now we needed to share them between globo.tv and Globosat Play. Also, globotv-api kept being our main source for basic video services.

This evolution also required a few new configurations on our nginx server:

upstream movies {
  server movies.internal.globo.com;
}

upstream globotv-search {
  server globotv-search.internal.globo.com;
}

upstream globosat-play {
  server globosatplay.internal.globo.com;
}

upstream globotv-events {
  server globotv-events.internal.globo.com;
}

upstream globotv-vod {
  server globotv-vod.internal.globo.com;
}

server {
  listen 80;
  server_name globosatplay.globo.com;

  location ~ (.+)/ao-vivo/ {
    proxy_set_header Host $http_host;
    proxy_pass http://globotv-events;
    break;
  }

  location ~ ^/telecine/ {
    proxy_set_header Host $http_host;
    proxy_pass http://movies;
    break;
  }

  location ~ ^/busca/ {
    proxy_set_header Host $http_host;
    proxy_pass http://globotv-search;
    break;
  }

  location ~ /  {
    proxy_set_header Host $http_host;
    proxy_pass http://globosat-play;
    break;
  }
}

In the end, we realized the microservices architecture brought a lot of advantages:

  • smaller, easier to manage projects: each subdomain is separated in its own project, which helps keeping the code smaller and easier to understand. Also, this allows different teams to work in different projects
  • faster builds: with many small projects, the time needed to build and run test suites for each one gets smaller, which gives faster feedback cycles. That stimulates developers to run the test suites more frequently
  • smaller and less risky deploys: each project is responsible for a small part of the product; that means bugs only affect that small subdomain of it. Suppose you introduce a bug in the search project; that would affect only the search page. Your users would still be able to access the home page and watch videos. That gives confidence for the team to deploy to production more frequently, which reduces even more the risk of bugs
  • flexible infrastructure: every service is a REST API over HTTP, using JSON format. That means you could write each one in a different language with a different database technology, if you wanted and needed. You could select the best tools for the job
  • easier incremental changes: suppose you want to migrate your application server from Unicorn to Puma. With a monolithic application, you would need to flip the switch all at once. With many small services, you could choose one to try Puma - maybe the less critical one. If the migration is successful, you could continue with this process one project at a time

But we also had a few disadvantages:

  • more complex architecture: when a new developer starts in your team, it’s much harder to explain to him how the architecture works and what each project is responsible for
  • harder local environment setup: another problem for new developers is setting up the local environment. With many projects, each one with its own requirements, that’s much harder
  • harder to update dependencies, like newer gems: when you need to update a dependency, like a gem that fixes a critical security flaw, you need to do that once for each project. With a monolith, you would just need to do that a single time
  • harder to test: when each service depends on many others, the setup for the test environment is much harder. You would need to set a complex environment for integration tests, or create fake APIs to replace real ones, or maybe mock API responses, using something like VCR
  • heterogeneous environment: the flexibility earned with microservices might result in projects in different languages, with different databases and other dependencies. That makes it harder to maintain, because not every developer may understand the whole set of technologies
  • more failure points, harder to debug: when a service fails, the problem may be a bug or database failure, for example, but may also be the result of a failure in another service it depends on. The debugging process gets harder and slower

In the end, the migration from monolith to microservices was very successful to us. But the main point here is realizing that microservices aren’t the magical solution to every problem - a messy monolith split would generate many messy microservices, as illustrated in the image below. You should analyze the characteristics of your project before deciding if microservices are the best option.


When I was younger, I used to get a lot of phone calls during my birthdays, from my relatives and friends. But now the situation is much different.

A few days ago I got older, and the “happy birthday” wishes came from many different mediums. So I decided to compile some curious (and completely useless) statistics about this:

Happy Birthday

Sometimes technology makes life more complicated, instead of simplifying it.


In the last weekend, I attended to DevDay 2015, in Belo Horizonte, where I presented a talk about “evolution of a distributed architecture” (in portuguese).

The last talk of the event presented Stack Overflow architecture. This talk was very controversial, and I felt I needed to write something to show my point of view on this subject.

A lot is being said in the community about the value of software engineering best practices. For the last decades, people like Uncle Bob, Kent Beck, Martin Fowler and many others have been doing a great effort promoting these practices, through books, posts and talks. These professionals made and still make a great job orienting new developers, so we can have well tested software projects and well designed architectures, focusing on aspects like maintainability, scalability, security and quality.

Despite this, we know that, in many companies, management makes pressure on developers to deliver as fast as possible, forgetting about software quality and maintainability - even knowing they will be responsible for the project maintenance, at least in short term. This habit goes against every best practice described above, and is very harmful to new developers. These ones that start their career in companies with this idea in mind, even if they’ve read about best practices, end up believing that this is utopian, and that in practice it’s impossible to design a sustainable architecture, refactor legacy code or automate tests, because of pressures for fast delivery of new features. That’s why it’s essential to spread the word about best practices, to show these new developers that it’s not only possible to focus on quality, but also essential for the project evolution and maintenance.

In the talk that closed DevDay 2015, Stack Overflow architecture was presented. The displayed numbers are impressive: it’s one of the 50 top sites in the world, with millions of page views per day, and all that supported by only 9 physical servers, each one working at around 5% load, plus 2 database servers. With this structure, the average load time is only 18 ms. How is this possible?

The secret, according to this talk, is their obsession with performance. Every new feature must have the best possible performance. When a library or tool they use isn’t considered fast enough, they rewrite it from scratch. As layered architectures are slow, every database query is manually written, and directly in the controller.

Stack Overflow code has low testability, because you can’t create mocks to replace the database connection, for instance. That’s why the project has very few automated tests (and it was mentioned that some developers don’t even run the tests). As they have a massive number of active and engaged users, any bug that’s introduced after a deploy is quickly found and reported at Meta Stack Overflow. To update the operating system version, they just remove a server from the pool, apply the update and put the server back live. They assume that, if a new bug arises, users will soon find and report it.

The speaker let it very clear that modeling, architecture and tests are good stuff, but they’re not for everyone. I personally disagree.

Stack Overflow is a very particular case. As their audience is made of developers, and they’re very engaged and passionate about the product, bugs in production are considered acceptable, because the main focus is performance. But what’s the point of performance without quality? Would you buy the fastest car in the world even knowing it doesn’t have seat belts and air bags, and that it doesn’t support replacing a flat tire or a defective part? The analogy is very exaggerated - a bug in the site doesn’t involve life risk, but what I mean is, if you focus only and exclusively on performance, you give up other aspects like quality and security. It’s the same line of thought from those managers I mentioned before, who make a lot of pressure for fast delivery, regardless of quality.

Fast food
Does every fast food need to be like this?

In my opinion, even if your project focuses on performance, quality can’t be abandoned. The other extreme - over-engineering - is also bad; if you have a small application, with only a couple of users, it doesn’t make sense to create a complex architecture, thinking about the possibility of maybe one day it may expand. This would be creating a solution for a problem that doesn’t exist.

I have a real example for this: I register all my expenses in Google Spreadsheet. I wanted to share these data with my wife, but as the spreadsheet is very large, I created a small app that extracts these data and displays a very simple dashboard, only with information that she is interested in. This app has only two users - me and her -, and there isn’t a chance that this grows up. In this case, it doesn’t make sense for me to think about a scalable architecture. But when we talk about a product with tens of millions of users, the situation is very different. Currently, Stack Overflow doesn’t have a competitor to match, but if one rises with a better user experience or new features, they will have a hard time to follow.

My main concern while watching this talk was the impact that these ideas can have on the audience. The majority of them were very young, probably students or professionals starting their career. A talk like this, spreading the word about optional software quality, can be very harmful to them. And it seems like Facebook has a similar issue with code quality.

To wrap up, I want to make it clear that I’m not doing a personal attack against Stack Overflow or the speaker. I admire her courage to take the stage and present such controversial ideas, even though I disagree on them. I’m a Stack Overflow user and will continue being after all.