Incidents like WannaCry ransomware expose the importance of doing backups, which is usually forgotten by many people.

When someone talks about backing up our personal files, we usually think about services like Dropbox and Google Drive. But they have a reasonable cost if you have more than a couple of GB of data. There are a lot of other solutions available. Most of them are cheaper, but not always reliable - imagine if you backup all your personal data to a small and unknown service, and a few months later, the company breaks. Or a security flaw exposes all your personal data! Of course this could also happen with Dropbox and Google Drive, but it’s much less likely, being two large and serious companies.

One alternative to them is Amazon Glacier. It’s a not so popular Amazon service for data archiving. You should notice it works differently from the usual backup solutions. When you sign up to Dropbox, for instance, you can install an app to your computer or mobile phone, or use the web interface to instantly access your files and upload new ones. Glacier is much more low level. It doesn’t have a web interface, app or even command line tool! There’s only an API, which you use to check your files, download or upload.

And there’s more: the download and upload rates are very slow. And to download a file, you first have to ask for a file retrieval job; the download will be available in a couple of hours!!!

This seems like a terrible service, so why use it? Because it’s very, very cheap! You only pay US$ 0.004 per GB per month for storage, besides additional costs for requests. And even being slow and hard to use, it’s a service offered by Amazon, which gives you confidence it won’t suddenly disappear.

Having said that, Glacier isn’t a service to keep data you may need immediately. But it’s ideal for something you probably won’t need to access anytime soon. Think about your family pictures: when you want to access them, you probably doesn’t need them right away; you’re fine waiting a couple of hours for that.

Glacier is also a great option for “backups of backups”. If you want to be neurotic about backups (and you should!), you can archive a copy of your backups there.

Usage

The easiest way to use Glacier is with a third party client. I like amazon-glacier-cmd-interface. After setting up the basic configuration, you can create a vault and upload you files:

$ glacier-cmd mkvault my-disaster-backup
$ glacier-cmd upload my-disaster-backup my-file1 my-file2 ...

To list archives in a vault:

$ glacier-cmd inventory <vaultname>

The inventory retrieval job takes a couple of hours to be processed. You can check its status with:

$ glacier-cmd listjobs <vaultname>
+------------------------------------------------------+---------------+--------------+--------------------+--------------------------+------------+
|                      VaultARN                        |    Job ID     | Archive ID   |       Action       |        Initiated         |   Status   |
+------------------------------------------------------+---------------+--------------+--------------------+--------------------------+------------+
| arn:aws:glacier:us-west-2:483413266890:vaults/backup | QYqdvM4k8q... |    None      | InventoryRetrieval | 2017-07-24T15:47:48.310Z | InProgress |
+------------------------------------------------------+---------------+--------------+--------------------+--------------------------+------------+

When the job status change to Succeeded, run the inventory command again to check the archive list.

To download an archive, first you need to check its id in the inventory:

$ glacier-cmd inventory <vaultname>
Inventory of vault: arn:aws:glacier:us-west-2:483413266890:vaults/backup
Inventory Date: 2017-07-05T11:22:15Z

Content:
+---------------+---------------------+----------------------+------------------+------------+
|  Archive ID   | Archive Description |       Uploaded       | SHA256 tree hash |    Size    |
+---------------+---------------------+----------------------+------------------+------------+
| uFg2FE_guu... | file1.tar.gz        | 2017-03-31T14:29:17Z | b41922e1a2...    | 1342622251 |
| 43Wjk63Dcu... | file2.tar.gz        | 2017-03-31T17:18:28Z | 2346170d22...    | 2347810677 |
+---------------+---------------------+----------------------+------------------+------------+
This vault contains 2 items, total size 2.5 GB.

Then, create an archive retrieval job using the archive id:

$ glacier-cmd getarchive <vaultname> <archive id>
+-----------+---------------+
|   Header  |    Value      |
+-----------+---------------+
|   JobId   | Xa17IAadQG... |
| RequestId | cPcomv_vTf... |
+-----------+---------------+

When the download is available (you can check its status with glacier-cmd listjobs <vaultname>), download it with:

$ glacier-cmd download <vaultname> <archive id> --outfile <filename>

Globosat Play is a video product for pay TV subscribers, where you can catch up programs you missed on TV. It’s like an umbrella for a couple of different channels. One of the most popular of them is SporTV, one of the largest sports channels in Brazil.

Last year we had the Olympic Games, a major sports event here in Rio. As SporTV channel was going to have a large coverage for the event, we decided to rethink the user experience for SporTV Play (SporTV channel offers inside Globosat Play). In this case, we were going to focus in improving the live TV experience, which is responsible for most of the audience.

Besides rethinking the user experience, we decided to also rethink our front-end architecture, and address a couple of the issues we had.

I’ve already written about Globosat Play architecture before. Its front-end is basically a couple of Rails apps using regular erb templates. These apps share components through a component library called globotv-ui. It’s about 4 years old, way before newer component technologies arised and got popular.

As descripted in the previous post, this component library solution allowed us to avoid rewriting the same components again and again for each app. We were able to share our components, but the main problem is that this was an in-house solution. We defined our components structure to attend our needs, so it was really hard to share them outside our product.

Also, we had a few problems with our front-end architecture. When we updated a component that was used on pages served by different apps, we needed to “synchronize” the deploys. If we deployed one app and took too long before deploying another, in the mean time, both would have different versions of that component. That could generate an inconsistency for our product, and it was one of the issues we wanted to address with the new solution.

In late 2015, we started a couple of technical discussions and some proofs of concept, and decided to adopt React in our front-end. A couple of arguments led the way to our decision:

Standardize components structure

Even after creating a dozen components in globotv-ui, we still could find some differences among them. That’s because the structure is loose and not well documented. Usually we start a new component looking at another one, and replicate that structure. But a lot of different developers worked on that library, each one with his own preferences. So we really didn’t have a pattern for components. They were well organized and tested, but in the end, they were just a couple of JS and CSS files, sometimes with a template for generating the HTML (in Handlebars for client-side rendering, or ERB templates with Rails helpers for server-side).

The clear and well known React component structure helps keeping a pattern among components. The component lifecycle lets us manage how our components should behave. It allows us to not only open source a couple of generic components, but also search for ready components instead of recreating everything we needed.

As an example of this, we wanted to keep our header sticky after the user scrolls down the page. Instead of implementing this behavior, we used react-headroom. Problem solved!

Declarative programming model

Another benefit that React brings is its declarative programming model, instead of the traditional imperative model. Here is a simple example: a text area with a “Tweet” button, which should be disabled while the text area field is empty. Here is the imperative implementation using jQuery:

// Initially disable the button
$("button").prop("disabled", true)

// When the value of the text area changes...
$("textarea").on("input", function() {
  // If there's at least one character...
  if ($(this).val().length > 0) {
    // Enable the button.
    $("button").prop("disabled", false)
  } else {
    // Else, disable the button.
    $("button").prop("disabled", true)
  }
})

Now the React version:

class TweetBox extends Component {
  state = {
    text: ""
  }

  handleChange(event) {
    this.setState({ text: event.target.value })
  }

  render() {
    return (
      <div>
        <textarea onChange={this.handleChange}></textarea>
        <button disabled={this.state.text.length === 0}>Tweet</button>
      </div>
    )
  }
}

Globo Play release

The third strong argument in favor of React was the release of Globo Play, in late 2015. It’s another video product developed here at Globo.com. It’s very similar to Globosat Play, and it already used React. So when we started developing the new interface for SporTV Play at the Olympic Games, the team that developed Globo Play already had a great experience with it to help our adoption.

The new architecture

As descripted in a previous post, our architecture was already microservices-based:

Globosat Play original architecture
Globosat Play original architecture

In the front-end, we had a couple of different apps to serve different pages in our product, with an nginx server in front of them. The nginx server proxies the requests to each app, according to the request path. Our idea was adding a new rule to it, to forward all requests to SporTV Play home and live signals pages to a new app. This new project was going to use React, and ideally share components with Globo Play.

The first step was thinking about the API for this new front-end app. We already had a couple of APIs serving our current apps, so we didn’t need to create a new API. We could just use what we already had, but we decided to follow the Back-end for Front-end pattern.

The APIs we already had served well many of our apps, and some of them are legacy. They have a lot of services, and most of them are fine-grained. To serve the new front-end app, we would probably need to make many requests to group all the data we needed.

Also, we thought it was a good idea to separate the new app from the old ones. Doing this, we would have less chance of coupling old and new apps. Suppose both of them consumed the same services; that would create a coupling between them. If the new app required a change in this service contract, we wouldn’t be able to do that without the risk of breaking the old app.

Besides that, we know the mobile consumption is raising, and sometimes we suffer with terrible 3/4g connections. With a new API specifically designed to attend the needs of the new app, we could deliver the smallest possible payload (just the data we were going to use). Also, we could create more specific services, reducing the number of requests to a minimum.

We followed two principles from the Back-end for Front-end (BFF) pattern: the front-end consumes services from its BFF and nothing more; and the front-end is the only BFF client. They are tightly coupled together, but this is not a problem, because they should both be maintained by the same team. It’s like we are splitting our app in two. The BFF is responsible for orchestrating requests from internal, fine-grained services, apply some business rules and deliver data ready to be consumed by the front-end, which just consumes these coase-grained services and takes care of the presentation layer.

BFF architecture
BFF architecture

One downside of the BFF is code replication. Some of the services we created in the BFF were new and very specific to our new SporTV Play front-end app (like the current schedule and the list of live signals). Others were already available in our internal APIs. But to follow the rule were the front-end can’t access any service outside its BFF, we needed to add a new route to the BFF, and basically make a proxy pass to our internal APIs. As any decision in software engineering, it’s a tradeoff. There is no perfect solution for everything, and the benefits overcome this issue.

For more details and discussions about the BFF pattern, check out Sam Newman’s and Phil Calçado’s articles.

This is the first post of a series. In the next ones, I intend to write a little more about React components, state management, CSS architecture and components sharing.


Handling HTTP cache is one of the most important aspects when you need to scale a web application. If well used, it can be your best friend; but when badly used, it may be you worst enemy.

I’m not going to explain the basic aspects of caching here, there are already a lot of great material about it. I’m going to bring a specific problem here.

On a previous post, I wrote about Globosat Play architecture. As explained there, it evolved to a microservices architecture, and as such, we ended up making a lot of HTTP requests to our internal services. So we needed to manage those requests very well.

Suppose you access a page like Combate channel home page. To fill up every information in that page, we need to query data from:

  • a videos API, to bring up a list of available channels and the latest videos from Combate channel
  • a highlights API, to check the latest highlights selected by an editor
  • an events API, to check a list of previous and next UFC events

That means a user request could be represented by something like this:

Requests without cache
Requests without cache

Now imagine one of these services is unavailable. Or very slow. Or giving un unexpected answer. If we didn’t consider these scenarios, we would end up with a brittle application, susceptible to a lot of issues.

Michael Nygard, in Release It!, says we must develop cynical systems:

Enterprise software must be cynical. Cynical software expects bad things to happen and is never surprised when they do. Cynical software doesn’t even trust itself, so it puts up internal barriers to protect itself from failures. It refuses to get too intimate with other systems, because it could get hurt.

That means we shouldn’t trust anybody. Don’t assume a service is up, available, fast and correct. Even if you know and trust the maintainers of this service, consider that it may have problems (and it will, eventually!). One of the defense mechanisms against this is caching.

At Globosat Play, we decided to implement two levels of cache. We call them performance and stale.

Performance cache

The performance cache is meant to avoid a flood of unnecessary requests to a single resource in a short period of time. Going back to Combate home page example, one of the services our back-end requests is a list of next UFC events. This doesn’t change often; only when a new event is created, or when an event finishes, once a couple of weeks. That means, it’s very wasteful to hit that service for every user accessing Combate home page. Suppose the events API response changes once a week; if that page gets 100,000 hits in that period, that means I would make 100,000 requests for that API, when I could just make one and keep the results in cache, which is much faster.

The solution for this is keeping a performance cache for a specific period of time. Suppose I set my cache for 5 minutes. The decision flow for this would be:

  • cache available? Respond with cache
  • cache unavailable? Make the request, write the response in cache, set its TTL (Time-to-leave) for 5 minutes, respond

That means I would hit that API only once every 5 minutes, independenly of how many users are accessing my home page right now. I’m not only avoiding wasteful requests, but also protecting my internal services and giving faster responses - it’s much faster to access the cache than making an HTTP request. The diagram below depicts this scenario:

Requests with cache
Requests with cache

The problem in this scenario is, even if I’m sure that my events API only changes once a week, I can’t set my cache TTL for 1 week. Imagine if I do that and the cache expires a few minutes before a new event is registered. That means I won’t see the new event until the next week! You need to carefully evaluate the performance cache times for each service you depend on.

Even if you have a service that can’t be cached for that long, you could have a great benefit from caching the request for at least a few seconds. Imagine an application with 10,000 requests/s. If you set the back-end service request cache TTL for 1 second, you are making a single request for your service, instead of 10,000 requests!

Stale cache

The second cache level is stale. It’s a safety against problems like network instability or service unavailable. Let’s use the latest videos API as an example. Suppose my application back-end tries to access this service and it gets a 500 HTTP status code. If I have a stale cached version of it, I can use it to give a valid response to its client. The stale data may be outdated by a few minutes or hours, but it’s still better than giving no response at all - of course, that depends on the case. For some kinds of services, an outdated response may not be feasible, like giving the wrong balance when your client accesses his bank account. But for most of the cases, stale cache is a great alternative.

Usually we set the performance cache time for a few minutes and the stale cache for a few hours. Our standard setup is 5 minutes and 6 hours, respectivelly.

Implementing cache levels in Ruby

To implement performance and stale cache levels in Ruby applications, we created and open sourced a gem called Content Gateway. With it, it’s much easier to manage cache levels.

After installing it, you need to configure the default request timeout, the performance and stale cache expiration times and the cache backend, besides other optional configurations:

config = OpenStruct.new(
  timeout: 2.seconds,
  cache_expires_in: 5.minutes,
  cache_stale_expires_in: 6.hours,
  cache: ActiveSupport::Cache.lookup_store(:memory_store)
)

gateway = ContentGateway::Gateway.new("My API", config)

With this basic configuration, you can start to make HTTP requests. You can also override the default configurations for each request:

# Params are added via query string
gateway.get("https://www.goodreads.com/search.xml", key: YOUR_KEY, q: "Ender's Game") # => "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<GoodreadsResponse>\n  <Request>..."

# Specific configuration params are supported, like "timeout" and "skip_cache"
gateway.get_json("https://api.cdnjs.com/libraries/jquery", timeout: 1.second, skip_cache: true) # => {"name"=>"jquery", "filename"=>"jquery.min.js", "version"=>"3.1.1", ...

It supports POST, PUT and DELETE as well. For all verbs, there are two methods for making the request: one is simply the name of the verb and the other has _json suffix. The former treats the response body as string, and the latter, as a Hash.

gateway.post_json("https://api.dropboxapi.com/2/files/copy", headers: { Authorization: "Bearer ACCESS_TOKEN" }, payload: { from_path: "path1", to_path: "path2" })
gateway.put_json("https://a.wunderlist.com/api/v1/list_positions/id", payload: { values: [4567, 4568, 9876, 234], revision: 123 })
gateway.delete("https://a.wunderlist.com/api/v1/tasks/id")

You can also make a few other customizations. Check out the project page on github for more information and examples.


Nerdcast is an amazing podcast (in portuguese) about nerdie stuff in general. One of its latest episodes talked about the digital artist profession. The guests were animators that worked on feature films like Moana and Doctor Strange. They talked a lot about what they do, and I saw a lot of analogies between their work and software development.

One of the podcast guests said that we was working on a specific scene for Doctor Strange for about 2 months, and suddenly his boss called him and told the director decided to cut that scene off from the film. Those 2 months of work turned into garbage. The lesson he learned from that is, you shouldn’t get attached to the project you’re working on. It’s not your project, it’s your company’s project and you happen to be working on it, which is very different. The analogy here is very clear, because that happens a lot in software projects: sometimes your client decides the feature you’ve been working on for a few weeks or months is not that important, or even worse, that the whole project shouldn’t be maintained anymore. It’s very hard not to be impacted by these news, but if you don’t work on that, you will frequently get frustrated. The project is meant to deliver value to your client, not you. Always remember that you are not paid to write code.

If you want a project to be really yours, and have total freedom to decide what to do and how to do it, you need a personal project. But it’s important to notice that, if someday you decide to make that a business, turning that into a startup or selling some kind of product or service, you will start to have clients. When that happens, you need to be ready to let go of your ideas; if nobody wants your product, be ready to pivot (or even discontinue the whole product). A/B tests are also a great way to learn how to let go of your ideas and beliefs: if a hypothesis is proven worse than the default behavior, just delete it.

Another important topic that was mentioned in the podcast was: the guests, as artists, have a hard time deciding when to stop improving their work. They start working on a scene and iterate a couple of times to make it better. As perfectionists, they want to keep polishing their work. But sometimes the scene is already looking so good that any improvement won’t be noticed by the audience, so it just won’t deliver value to the client anymore. The problem is not knowing when to stop. We can make an analogy here with refactoring: sometimes we develop a new feature, and even after it’s implemented and well tested, we decide to refactor. The objective could be making the code clearer for anyone that may touch it later - to implement a new feature of fix a bug - or maybe extracting a part of it to remove duplication from a similar feature you already had. In both cases, the refactoring won’t deliver value in the short term, but will on the mid-term or long term: you will have lower maintenance costs. But we may have the same problem as the animators: it’s very hard to know when to stop polishing the code. At some point, the refactoring won’t deliver value anymore, and we just refactor to please ourselves. As said before, you need to remember the project is not yours, and you are not paid to write code!


This post is at least one year late. Since I gave a few talks about Globosat Play architecture (slides in portuguese), I intended to write a more detailed post, but always procrastinated about this.

Globo.com is the internet arm of the largest media conglomerate from Brazil, and one of the largest in the world. One of the areas of the company is responsible for our video platform, which includes encoding, distribution and streaming for any website of the group that needs videos.

About 5 years ago, one of our videos team developed a video product called globo.tv (recently discontinued and replaced with a newer product, Globo Play). Its content was focused on Rede Globo (our main broadcast TV network) shows, like news, sports and the famous brazilian telenovelas. Most of globo.tv content were small scenes from these TV shows, open for every user, but it also offered full episodes for paying subscribers.

globo.tv
globo.tv home page

The original architecture was a single monolithic Rails app, with a Unicorn application server, a MongoDB database and a Redis instance for cache, all behind an nginx HTTP server. This architecture served very well for some time.

globo.tv architecture
globo.tv architecture

Then came 2012 with new demands for globo.tv. We needed to start offering live streaming of a couple of events, like UFC, Big Brother Brasil, soccer championships and the Winter Olympic Games. Also, we needed to start offering a collection of videos from Combate channel, focused on MMA sports.

At this point, we realized the old monolithic architecture wouldn’t serve anymore. So we needed to break it in smaller parts. The first step was identifying smaller subdomains inside videos. The first ones we identified and split were live streaming, for these new live events demand, and VoD (video on demand), for Combate channel videos.

Clearly these new subdomains deserved their own projects, but the problem is, they needed to share a few business rules and data from the original globo.tv project, and also between them. That meant we needed to extract a few services from globo.tv to its own project: globotv-api was born!

globo.tv architecture, 2.0
globo.tv architecture, 2.0

At this point, we had:

  • globotv-api, which offered a few basic video services, like the most recent videos from a specific program and the most watched programs
  • globotv, the remain of the original project, which started to consume globotv-api services. It was responsible for serving the original pages, like home, program page, video page and search
  • globotv-events, a new project responsible for live streaming of different kinds of events. It also consumed globotv-api and offered its own API with specific services, like a list of live events happening right now
  • globotv-vod, another new project, which served the collection of videos for Combate channel. It also consumed globotv-api and offered its own API with specific services, like a list of competitions and videos from a specific fighter

The break-up of the monolith was a very important move. If we didn’t split it at that time, we would end up with a larger and larger project, which would soon become a monster - harder to understand, harder to maintain, harder to evolve. It allowed us to share a small part of our domain, which was common among these new requirements and the original project.

To split the front-end among different projects, we used nginx as a router. We already used it in front of our application server, but with multiple projects, we created an upstream for each one, and configured a location for each URL pattern, like this:

upstream globotv {
  server globotv.internal.globo.com;
}

upstream globotv-events {
  server globotv-events.internal.globo.com;
}

upstream globotv-vod {
  server globotv-vod.internal.globo.com;
}

server {
  listen 80;
  server_name globotv.globo.com;

  location ~ (.+)/ao-vivo/ {
    proxy_set_header Host $http_host;
    proxy_pass http://globotv-events;
    break;
  }

  location ~ ^/combate/ {
    proxy_set_header Host $http_host;
    proxy_pass http://globotv-vod;
    break;
  }

  location ~ /  {
    proxy_set_header Host $http_host;
    proxy_pass http://globotv;
    break;
  }
}

The internal domains aren’t publicly exposed, the only way to access the projects is through nginx. With a configuration similar to this, nginx routes incoming requests for globotv.globo.com domain to different projects, according to the URL pattern: URLs that contain “/ao-vivo/” pattern (“live” in portuguese) are routed to globotv-events project; URLs starting with “/combate/” are forwarded to globotv-vod project. The last location matches every other URL to the original globotv project. globotv-api project doesn’t appear in this configuration, as it isn’t publicly accessible (it’s only accessed from the other projects). This configuration allowed us to serve different pages from different projects transparently for the user.

Despite many benefits, this architectural change brought new challenges. The first one was keeping a consistent visual identity among pages served from different projects. For example, the video thumb component used in the home page should be just like the video thumb from the search page; the product header should be the same in every page. The split in multiple projects was an architectural decision; the user doesn’t need to know this, because globo.tv was still a single product.

The solution for this problem was creating a components library, called globotv-ui. With this solution, we were able to share visual components, comprised of HTML, JS and CSS. They were standardized and documented, which made it very easy to create new components and share them among these projects - as all of them were Ruby on Rails projects, we delivered globotv-ui library as a Rubygem.

examples of components from globotv-ui library
examples of components from globotv-ui library

Fast forward a few years, and a new video product emerged: Globosat Play. It was very similiar to globo.tv on a few aspects - the idea was offering VoD and live streaming from Globosat channels (TV channels available only for paying subscribers), but with a few differences: we also needed to offer movies, and now the focus was on subscribers, and not on free users anymore.

Globosat Play home page
Globosat Play home page

The main challenge at that point was how to share components and services between these two products, but without one limiting the other’s evolution and requirements. We needed to re-evaluate our architecture and business domain to solve this issue. We realized that, as both products had many similarities and some differences, we needed to create a common, shared layer of services, but also keep some services specific. The new architecture has become something like this:

Globosat Play architecture
Globosat Play architecture

This diagram is simplified from the previous ones - the new projects are also Rails apps with their own MongoDB and Redis instances. The green boxes are projects that serve web pages, and the blue ones are APIs. We split our projects in 3 parts: the top one in the image is specific to globo.tv product; the bottom one is specific to Globosat Play; and the middle one represent shared services. You can notice most of the projects created before (globotv-api, globotv-events and globotv-vod) started being shared, as Globosat Play also had those same requirements.

Besides that, we created a few other projects. Some of them, like movies, attended a specific subdomain that didn’t exist before; others, like globotv-search, were extracted from the original globotv project - these features already existed, but now we needed to share them between globo.tv and Globosat Play. Also, globotv-api kept being our main source for basic video services.

This evolution also required a few new configurations on our nginx server:

upstream movies {
  server movies.internal.globo.com;
}

upstream globotv-search {
  server globotv-search.internal.globo.com;
}

upstream globosat-play {
  server globosatplay.internal.globo.com;
}

upstream globotv-events {
  server globotv-events.internal.globo.com;
}

upstream globotv-vod {
  server globotv-vod.internal.globo.com;
}

server {
  listen 80;
  server_name globosatplay.globo.com;

  location ~ (.+)/ao-vivo/ {
    proxy_set_header Host $http_host;
    proxy_pass http://globotv-events;
    break;
  }

  location ~ ^/telecine/ {
    proxy_set_header Host $http_host;
    proxy_pass http://movies;
    break;
  }

  location ~ ^/busca/ {
    proxy_set_header Host $http_host;
    proxy_pass http://globotv-search;
    break;
  }

  location ~ /  {
    proxy_set_header Host $http_host;
    proxy_pass http://globosat-play;
    break;
  }
}

In the end, we realized the microservices architecture brought a lot of advantages:

  • smaller, easier to manage projects: each subdomain is separated in its own project, which helps keeping the code smaller and easier to understand. Also, this allows different teams to work in different projects
  • faster builds: with many small projects, the time needed to build and run test suites for each one gets smaller, which gives faster feedback cycles. That stimulates developers to run the test suites more frequently
  • smaller and less risky deploys: each project is responsible for a small part of the product; that means bugs only affect that small subdomain of it. Suppose you introduce a bug in the search project; that would affect only the search page. Your users would still be able to access the home page and watch videos. That gives confidence for the team to deploy to production more frequently, which reduces even more the risk of bugs
  • flexible infrastructure: every service is a REST API over HTTP, using JSON format. That means you could write each one in a different language with a different database technology, if you wanted and needed. You could select the best tools for the job
  • easier incremental changes: suppose you want to migrate your application server from Unicorn to Puma. With a monolithic application, you would need to flip the switch all at once. With many small services, you could choose one to try Puma - maybe the less critical one. If the migration is successful, you could continue with this process one project at a time

But we also had a few disadvantages:

  • more complex architecture: when a new developer starts in your team, it’s much harder to explain to him how the architecture works and what each project is responsible for
  • harder local environment setup: another problem for new developers is setting up the local environment. With many projects, each one with its own requirements, that’s much harder
  • harder to update dependencies, like newer gems: when you need to update a dependency, like a gem that fixes a critical security flaw, you need to do that once for each project. With a monolith, you would just need to do that a single time
  • harder to test: when each service depends on many others, the setup for the test environment is much harder. You would need to set a complex environment for integration tests, or create fake APIs to replace real ones, or maybe mock API responses, using something like VCR
  • heterogeneous environment: the flexibility earned with microservices might result in projects in different languages, with different databases and other dependencies. That makes it harder to maintain, because not every developer may understand the whole set of technologies
  • more failure points, harder to debug: when a service fails, the problem may be a bug or database failure, for example, but may also be the result of a failure in another service it depends on. The debugging process gets harder and slower

In the end, the migration from monolith to microservices was very successful to us. But the main point here is realizing that microservices aren’t the magical solution to every problem - a messy monolith split would generate many messy microservices, as illustrated in the image below. You should analyze the characteristics of your project before deciding if microservices are the best option.