Nerdcast is an amazing podcast (in portuguese) about nerdie stuff in general. One of its latest episodes talked about the digital artist profession. The guests were animators that worked on feature films like Moana and Doctor Strange. They talked a lot about what they do, and I saw a lot of analogies between their work and software development.
One of the podcast guests said that we was working on a specific scene for Doctor Strange for about 2 months, and suddenly his boss called him and told the director decided to cut that scene off from the film. Those 2 months of work turned into garbage. The lesson he learned from that is, you shouldn’t get attached to the project you’re working on. It’s not your project, it’s your company’s project and you happen to be working on it, which is very different. The analogy here is very clear, because that happens a lot in software projects: sometimes your client decides the feature you’ve been working on for a few weeks or months is not that important, or even worse, that the whole project shouldn’t be maintained anymore. It’s very hard not to be impacted by these news, but if you don’t work on that, you will frequently get frustrated. The project is meant to deliver value to your client, not you. Always remember that you are not paid to write code.
If you want a project to be really yours, and have total freedom to decide what to do and how to do it, you need a personal project. But it’s important to notice that, if someday you decide to make that a business, turning that into a startup or selling some kind of product or service, you will start to have clients. When that happens, you need to be ready to let go of your ideas; if nobody wants your product, be ready to pivot (or even discontinue the whole product). A/B tests are also a great way to learn how to let go of your ideas and beliefs: if a hypothesis is proven worse than the default behavior, just delete it.
Another important topic that was mentioned in the podcast was: the guests, as artists, have a hard time deciding when to stop improving their work. They start working on a scene and iterate a couple of times to make it better. As perfectionists, they want to keep polishing their work. But sometimes the scene is already looking so good that any improvement won’t be noticed by the audience, so it just won’t deliver value to the client anymore. The problem is not knowing when to stop. We can make an analogy here with refactoring: sometimes we develop a new feature, and even after it’s implemented and well tested, we decide to refactor. The objective could be making the code clearer for anyone that may touch it later - to implement a new feature of fix a bug - or maybe extracting a part of it to remove duplication from a similar feature you already had. In both cases, the refactoring won’t deliver value in the short term, but will on the mid-term or long term: you will have lower maintenance costs. But we may have the same problem as the animators: it’s very hard to know when to stop polishing the code. At some point, the refactoring won’t deliver value anymore, and we just refactor to please ourselves. As said before, you need to remember the project is not yours, and you are not paid to write code!
This post is at least one year late. Since I gave a few talks about Globosat Play architecture (slides in portuguese), I intended to write a more detailed post, but always procrastinated about this.
Globo.com is the internet arm of the largest media conglomerate from Brazil, and one of the largest in the world. One of the areas of the company is responsible for our video platform, which includes encoding, distribution and streaming for any website of the group that needs videos.
About 5 years ago, one of our videos team developed a video product called globo.tv (recently discontinued and replaced with a newer product, Globo Play). Its content was focused on Rede Globo (our main broadcast TV network) shows, like news, sports and the famous brazilian telenovelas. Most of globo.tv content were small scenes from these TV shows, open for every user, but it also offered full episodes for paying subscribers.
The original architecture was a single monolithic Rails app, with a Unicorn application server, a MongoDB database and a Redis instance for cache, all behind an nginx HTTP server. This architecture served very well for some time.
Then came 2012 with new demands for globo.tv. We needed to start offering live streaming of a couple of events, like UFC, Big Brother Brasil, soccer championships and the Winter Olympic Games. Also, we needed to start offering a collection of videos from Combate channel, focused on MMA sports.
At this point, we realized the old monolithic architecture wouldn’t serve anymore. So we needed to break it in smaller parts. The first step was identifying smaller subdomains inside videos. The first ones we identified and split were live streaming, for these new live events demand, and VoD (video on demand), for Combate channel videos.
Clearly these new subdomains deserved their own projects, but the problem is, they needed to share a few business rules and data from the original globo.tv project, and also between them. That meant we needed to extract a few services from globo.tv to its own project: globotv-api was born!
At this point, we had:
globotv-api, which offered a few basic video services, like the most recent videos from a specific program and the most watched programs
globotv, the remain of the original project, which started to consume globotv-api services. It was responsible for serving the original pages, like home, program page, video page and search
globotv-events, a new project responsible for live streaming of different kinds of events. It also consumed globotv-api and offered its own API with specific services, like a list of live events happening right now
globotv-vod, another new project, which served the collection of videos for Combate channel. It also consumed globotv-api and offered its own API with specific services, like a list of competitions and videos from a specific fighter
The break-up of the monolith was a very important move. If we didn’t split it at that time, we would end up with a larger and larger project, which would soon become a monster - harder to understand, harder to maintain, harder to evolve. It allowed us to share a small part of our domain, which was common among these new requirements and the original project.
To split the front-end among different projects, we used nginx as a router. We already used it in front of our application server, but with multiple projects, we created an upstream for each one, and configured a location for each URL pattern, like this:
The internal domains aren’t publicly exposed, the only way to access the projects is through nginx. With a configuration similar to this, nginx routes incoming requests for globotv.globo.com domain to different projects, according to the URL pattern: URLs that contain “/ao-vivo/” pattern (“live” in portuguese) are routed to globotv-events project; URLs starting with “/combate/” are forwarded to globotv-vod project. The last location matches every other URL to the original globotv project. globotv-api project doesn’t appear in this configuration, as it isn’t publicly accessible (it’s only accessed from the other projects). This configuration allowed us to serve different pages from different projects transparently for the user.
Despite many benefits, this architectural change brought new challenges. The first one was keeping a consistent visual identity among pages served from different projects. For example, the video thumb component used in the home page should be just like the video thumb from the search page; the product header should be the same in every page. The split in multiple projects was an architectural decision; the user doesn’t need to know this, because globo.tv was still a single product.
The solution for this problem was creating a components library, called globotv-ui. With this solution, we were able to share visual components, comprised of HTML, JS and CSS. They were standardized and documented, which made it very easy to create new components and share them among these projects - as all of them were Ruby on Rails projects, we delivered globotv-ui library as a Rubygem.
Fast forward a few years, and a new video product emerged: Globosat Play. It was very similiar to globo.tv on a few aspects - the idea was offering VoD and live streaming from Globosat channels (TV channels available only for paying subscribers), but with a few differences: we also needed to offer movies, and now the focus was on subscribers, and not on free users anymore.
The main challenge at that point was how to share components and services between these two products, but without one limiting the other’s evolution and requirements. We needed to re-evaluate our architecture and business domain to solve this issue. We realized that, as both products had many similarities and some differences, we needed to create a common, shared layer of services, but also keep some services specific. The new architecture has become something like this:
This diagram is simplified from the previous ones - the new projects are also Rails apps with their own MongoDB and Redis instances. The green boxes are projects that serve web pages, and the blue ones are APIs. We split our projects in 3 parts: the top one in the image is specific to globo.tv product; the bottom one is specific to Globosat Play; and the middle one represent shared services. You can notice most of the projects created before (globotv-api, globotv-events and globotv-vod) started being shared, as Globosat Play also had those same requirements.
Besides that, we created a few other projects. Some of them, like movies, attended a specific subdomain that didn’t exist before; others, like globotv-search, were extracted from the original globotv project - these features already existed, but now we needed to share them between globo.tv and Globosat Play. Also, globotv-api kept being our main source for basic video services.
This evolution also required a few new configurations on our nginx server:
In the end, we realized the microservices architecture brought a lot of advantages:
smaller, easier to manage projects: each subdomain is separated in its own project, which helps keeping the code smaller and easier to understand. Also, this allows different teams to work in different projects
faster builds: with many small projects, the time needed to build and run test suites for each one gets smaller, which gives faster feedback cycles. That stimulates developers to run the test suites more frequently
smaller and less risky deploys: each project is responsible for a small part of the product; that means bugs only affect that small subdomain of it. Suppose you introduce a bug in the search project; that would affect only the search page. Your users would still be able to access the home page and watch videos. That gives confidence for the team to deploy to production more frequently, which reduces even more the risk of bugs
flexible infrastructure: every service is a REST API over HTTP, using JSON format. That means you could write each one in a different language with a different database technology, if you wanted and needed. You could select the best tools for the job
easier incremental changes: suppose you want to migrate your application server from Unicorn to Puma. With a monolithic application, you would need to flip the switch all at once. With many small services, you could choose one to try Puma - maybe the less critical one. If the migration is successful, you could continue with this process one project at a time
But we also had a few disadvantages:
more complex architecture: when a new developer starts in your team, it’s much harder to explain to him how the architecture works and what each project is responsible for
harder local environment setup: another problem for new developers is setting up the local environment. With many projects, each one with its own requirements, that’s much harder
harder to update dependencies, like newer gems: when you need to update a dependency, like a gem that fixes a critical security flaw, you need to do that once for each project. With a monolith, you would just need to do that a single time
harder to test: when each service depends on many others, the setup for the test environment is much harder. You would need to set a complex environment for integration tests, or create fake APIs to replace real ones, or maybe mock API responses, using something like VCR
heterogeneous environment: the flexibility earned with microservices might result in projects in different languages, with different databases and other dependencies. That makes it harder to maintain, because not every developer may understand the whole set of technologies
more failure points, harder to debug: when a service fails, the problem may be a bug or database failure, for example, but may also be the result of a failure in another service it depends on. The debugging process gets harder and slower
In the end, the migration from monolith to microservices was very successful to us. But the main point here is realizing that microservices aren’t the magical solution to every problem - a messy monolith split would generate many messy microservices, as illustrated in the image below. You should analyze the characteristics of your project before deciding if microservices are the best option.
The last talk of the event presented Stack Overflow architecture. This talk was very controversial, and I felt I needed to write something to show my point of view on this subject.
A lot is being said in the community about the value of software engineering best practices. For the last decades, people like Uncle Bob, Kent Beck, Martin Fowler and many others have been doing a great effort promoting these practices, through books, posts and talks. These professionals made and still make a great job orienting new developers, so we can have well tested software projects and well designed architectures, focusing on aspects like maintainability, scalability, security and quality.
Despite this, we know that, in many companies, management makes pressure on developers to deliver as fast as possible, forgetting about software quality and maintainability - even knowing they will be responsible for the project maintenance, at least in short term. This habit goes against every best practice described above, and is very harmful to new developers. These ones that start their career in companies with this idea in mind, even if they’ve read about best practices, end up believing that this is utopian, and that in practice it’s impossible to design a sustainable architecture, refactor legacy code or automate tests, because of pressures for fast delivery of new features. That’s why it’s essential to spread the word about best practices, to show these new developers that it’s not only possible to focus on quality, but also essential for the project evolution and maintenance.
In the talk that closed DevDay 2015, Stack Overflow architecture was presented. The displayed numbers are impressive: it’s one of the 50 top sites in the world, with millions of page views per day, and all that supported by only 9 physical servers, each one working at around 5% load, plus 2 database servers. With this structure, the average load time is only 18 ms. How is this possible?
The secret, according to this talk, is their obsession with performance. Every new feature must have the best possible performance. When a library or tool they use isn’t considered fast enough, they rewrite it from scratch. As layered architectures are slow, every database query is manually written, and directly in the controller.
Stack Overflow code has low testability, because you can’t create mocks to replace the database connection, for instance. That’s why the project has very few automated tests (and it was mentioned that some developers don’t even run the tests). As they have a massive number of active and engaged users, any bug that’s introduced after a deploy is quickly found and reported at Meta Stack Overflow. To update the operating system version, they just remove a server from the pool, apply the update and put the server back live. They assume that, if a new bug arises, users will soon find and report it.
The speaker let it very clear that modeling, architecture and tests are good stuff, but they’re not for everyone. I personally disagree.
Stack Overflow is a very particular case. As their audience is made of developers, and they’re very engaged and passionate about the product, bugs in production are considered acceptable, because the main focus is performance. But what’s the point of performance without quality? Would you buy the fastest car in the world even knowing it doesn’t have seat belts and air bags, and that it doesn’t support replacing a flat tire or a defective part? The analogy is very exaggerated - a bug in the site doesn’t involve life risk, but what I mean is, if you focus only and exclusively on performance, you give up other aspects like quality and security. It’s the same line of thought from those managers I mentioned before, who make a lot of pressure for fast delivery, regardless of quality.
In my opinion, even if your project focuses on performance, quality can’t be abandoned. The other extreme - over-engineering - is also bad; if you have a small application, with only a couple of users, it doesn’t make sense to create a complex architecture, thinking about the possibility of maybe one day it may expand. This would be creating a solution for a problem that doesn’t exist.
I have a real example for this: I register all my expenses in Google Spreadsheet. I wanted to share these data with my wife, but as the spreadsheet is very large, I created a small app that extracts these data and displays a very simple dashboard, only with information that she is interested in. This app has only two users - me and her -, and there isn’t a chance that this grows up. In this case, it doesn’t make sense for me to think about a scalable architecture. But when we talk about a product with tens of millions of users, the situation is very different. Currently, Stack Overflow doesn’t have a competitor to match, but if one rises with a better user experience or new features, they will have a hard time to follow.
My main concern while watching this talk was the impact that these ideas can have on the audience. The majority of them were very young, probably students or professionals starting their career. A talk like this, spreading the word about optional software quality, can be very harmful to them. And it seems like Facebook has a similar issue with code quality.
To wrap up, I want to make it clear that I’m not doing a personal attack against Stack Overflow or the speaker. I admire her courage to take the stage and present such controversial ideas, even though I disagree on them. I’m a Stack Overflow user and will continue being after all.
After a few years maintaining this blog, I decided to start writing in english. I feel like I can reach a wider audience with this. Sometimes when I make a comment in a blog post or github issue, or answer a question in Stack Overflow, I want to link to something I wrote in my blog, but as I wrote in portuguese, I can’t do that.
I don’t think this is going to be a problem for most developers - at least I hope so; if you are a software developer and don’t understand english, you should!
Eventually I may write something in portuguese again, if I have a reason for that, so I tagged every old post with portuguese. And every post in english will be tagged with english. I also added direct links in the sidebar, for easy access.