The Fall

2022-07-02 3571 words 17 minutes

Contents

Here is how we failed another access control. And also how we destroyed a successful startup.

What to do and what not to do

The project was great. Another ambitious one. The bank already had an access control system. More than 4000 branches around the entire country. The main goal was simpler. We needed to build a web application that would interact with the doors of the branches. Easy, we had a private network, from one of the biggest banks in the country, it would never be slow like MonKey was. MonKey’s problem was a network problem, it would never be bad untested code. But we would do things differently. To prevent the frontend from being slow, we used Ember.js, not AngularJS anymore. Now we would get a performative frontend. Besides network issues, MonKey had framework issues. AngularJS was not good enough for our application. It was too complex for AngularJS, which was the reason behind the framework change.

Of course, we chose a better solution than C++ for the backend. No one codes in C++ anymore. We had Python and NodeJS as options. Java was too heavy and slow. PHP? No. We chose NodeJS. If it happened today, I’m pretty sure we would have chosen Deno… Or, Rust…

Well, we were prepared to work on the front end and backend. We had a PostgreSQL just for us. We had development, testing, QA, and production environments. We have Azure DevOps, with automated pipelines for each environment. We had environment variables. We had secrets to not expose our credentials and security stuff. Oh, we had SSH access to development, testing, and QA! We had a nice web page that showed us our build and deployment status, so sweet!

But we had one problem. We had more than 4000 branches using two different versions of a device. One thing the bank said to us was to use the same hardware. It would be too expensive to replace them all with a new one. And they trusted the one already installed because it’s already working. No problem. Let’s look at the two versions of the device without a manual or anything helpful to understand how it operates. No source code available, absolute nothing. It needed reverse engineering. I needed to open both versions, analyze the circuit, and discover which ports to send and read bytes. Well, you got the idea. Take something and rebuild it from scratch without knowing anything about it.

The deal breaker

The bank can’t use the software if it does not work like it was supposed to. If it’s an access control system, it was supposed to control access and interact with a hardware. In realtime. If we had no hardware, we had no access control. Simple as that.

I was really familiar with hardware stuff. It was my focus since we started the mini-door thing, the OmniKey. And I got pretty good at it. One of the best classes I took in college was Architecture and Computer Organization. We learned how CPUs work, and how to build a unicycle, multicycle, and pipeline processor architecture using SystemVerilog. During the course, we built 3 MIPS processors, one for each architecture. The final project was to create a game in [MIPS Assembly](https://en.wikipedia.org/wiki/MIPS_architecture#:~:text=MIPS%20(Microprocessor%20without%20Interlocked%20Pipelined,based%20in%20the%20United%20States.) language and run it into a FPGA with our MIPS pipeline processor in it. My group and I implemented a Snake game in Assembly. We ran it in our pipeline processor. This was so amazing. We struggled a lot, but our group got maximum grades. Our pipeline CPU worked, and the game had no visible bugs.

This class made me learn a lot about CPUs, microcontrollers, and microprocessors. I learned assembly. I was the hardware guy. My final project at college was a Recurrent Neural Network implemented on FPGA using Stochastic Computing. I’ll leave this for another time.

The first thing to do was to decide how I would build this new thing. When I opened it, I could see that the hardware was actually simple. It was a Raspberry Pi 1 attached to a bigger board, which routed the circuit to the buzzer, LEDs, fingerprint sensor, serial bus, keyboard, and LCD screen. The two different versions of the hardware consisted just of different fingerprint readers. So if I built software one, the other is supposed to work. I would just need to identify the fingerprint reader. Dope.

I created another operational system image based on the old one that was in use. I updated the packages, cleaned the operating system, and just left what I would use. Since the time was short and I needed to be fast, I chose NodeJS as the primary language. NodeJS was a really great choice for that. After understanding how to interact with all the peripherals on the board, I could create C++ code to have low-level access to resources. And then I could compile the C++ code to use it in the code in NodeJS. Pretty sick.

Hello world

While other people were involved in building frontend and backend, and crossing fingers to get a device to connect, I was creating simple applications as a proof of concept to show stuff working on the hardware. The LEDs blinking, the buzzer buzzing, and even the fingerprints being read. On both devices. I created a structure to identify in runtime the fingerprint sensor and instantiate the correct handler for it. If there was any problem, it threw an error, and you’d see an intermitent red LED blinking. I was ready to start coding for it. The hardest part was done. We got our operational system fully prepared to get some NodeJS code running.

I implemented a local database for the device, secure endpoints, and secure connections. I treated concurrent operations. I treated errors. I designed a state machine for it, which was implemented using a Singleton, and this machine managed all the states. It had a debug mode, where I could force the machine to go to the state I wanted, or I could type some numbers on the device and execute a state. Then I started building blocks, identifying which state I would use according to the project’s requirements.

Working with this device was one of the most exciting experiences I’ve ever had in my life. From a totally unknown device I could build something new, something even better and faster. It was totally scalable (state machines can be scalable), debuggable, and, surprise everyone, testable. I must confess that I did not implement unit tests for it. I built testing states. For each state I had, I also had a testing state. It was triggered by the command line. And it had a report at the end. I could see if states were buggy if I changed or created a state. Or if a transition between states were correct or not. It was not perfect, but it was good enough.

Connecting pieces together

Frontend and backend were already communicating with each other. The frontend was sending HTTP requests, the backend reached the database and returned data. We needed now actual data coming from the devices. And we needed to send data to the device. We used Socket.io for that. For some reason, we were not supposed to use fancy tools for IoT. We could just use sockets and HTTP requests. Brokers, message queues, and stuff like that were not a choice. I don’t know, until now, why the IT people blocked us on that. They just decided to use software that they authorized. No Kafka. No MQTT. No RabbitMQ. Why? No idea. IT people from the bank decided that. We explained the necessity for those tools, but they denied it. So we chose Socket.io. And we imagined something like chat rooms for devices, and the backend server would manage those rooms and connections.

The system was working. We integrated the device with the backend using sockets. Data was coming and going. The frontend looked great. We had two devices (one for each version) and used them for our development. I had my personal device to do my job. The team used the other two devices so they could work on their stuff.

Things were good, but the relationship between the team was not. Money was not coming, and people were stressed, upset, and hurt. Our associates were fighting each other all the time and provoking each other. The partnership was almost gone, and so was respect. We were not friends anymore. Things were weird between us all. Our investor was not paying us, and we were getting angry. He promised us that he would pay us when we reached some checkpoints. There was no contract, just words. We were naive. Or just really tired. Or both. He promised the money, but in the meantime, we could be hired by his company to work, that way we would receive some money to keep working. Well, it’s clear what was happening, right? Also, there were some stranger things happening…

Private conversations

We were 5 associates, and I considered them all great friends. My best friends.

There was the design guy, very funny and creative. He delivered beautiful work.
There was the administrator guy, who led us and managed stuff, great guy
The high guy, very hyperactive and creative, fast programmer, great guy as well
The mid guy, I really admired him and respected him a lot. I took him as my mentor at the time. I wanted to be like him
And… me

Well, the high and mid guys were fighting a lot since Mandrill VMS failure. They were friends and stuff, but they were constantly fighting. The other two guys and I tried to make things work, but no luck. We could see that we were not a team any more like we were in the past. It was like each of us was working for ourselves. The mid guy was constantly talking to the investor without us like he was trying to do something, trying to get out or to get a solo opportunity. I don’t know. He always told the investor that my work and the high guy were terrible, and it was better to dump it in the trash. But his code was perfect. Then, the investor came to us complaining that we were not doing a good job. It happened many times.

We started facing out lies between us, omitting things. We started facing the worst phase of our business relationship and friendship. Things started getting unsustainable. People started going to work when other people were not working so they could avoid seeing each other. And still no money. Actually, yes, we had money. We were employees of our own investor. Which was going to pay us later, but no contract was signed.

We had some project managers on the project. They were employees from the investor’s company. We had 6 or 7 different people managing the project in less than 2 years. I think less than 1 year. Anyway, things were really bad. The client (the bank) was not satisfied with the project, it was constantly bugging, the web app did not load, or the data was incorrect. The pressure on the project managers was always immense, and the team was not a team of employees. I mean, it was while the investor did not pay us. But the project managers didn’t have any authority over us, so they were not respected and ended up quitting.

The new project manager

There was one project manager. He was the last one before we deployed our system to production. He was the PM guy. He knew how to talk to people and make things work. He managed to be friends with everyone on the team except the high guy. The high guy hated him. Because the PM was always trying to get things working correctly, and the high guy was not comfortable following the process, he was an incredibly skilled developer. He did not need to follow the best practices of software development. He just needed coffee, a headphone, and code. Otherwise, the mid guy loved the PM. Turns out that this PM came from a huge project inside the bank and a project that was pretty important. And provided a lot of money to the bank. The mid guy was interested in other projects. He did not care anymore about us and our partnership. He wanted to grow and go alone. So he became BFF to the PM. And the PM loved him. He saw the mid guy as a god of software development.

I liked the PM; he was nice and trying to become my friend. He saw that I got some skills, and he was looking for skilled people for his projects. I had no problems with him, while I had peace to do my work and the client was happy, I was happy. Our leader, the manager guy, didn’t like the PM, actually they didn’t like each other, and the PM was always provoking the manager guy. Tough times. The client was satisfied, the project looked to be going forward, and things were good.

The firing

Well, it was too good to be true. On another crucial meeting with the client to demo our system, we found many ridiculous bugs. It was unbelievable that we didn’t get it properly working for the client. We had bugs the client reported weeks ago that were not fixed. They were badly fixed, or worst… We listed almost 2 pages of bugs we had in the system. Most of them (if not every one of them) were frontend bugs. The PM asked the high guy how much time he would need to clear those bugs out. Turns out that the high guy was so much skilled that he did not need to give estimates. He’s a pro coder, he didn’t follow those protocols, he was a hacker. There were fights between the PM and the high guy. Lots of fights. Until the day the PM decided to fire him and kick him off the project. He would no longer be one of us. And the mid guy was making his way to this enormous project that the PM had.

The investor didn’t talk to us either. We barely communicate. He proposed to end our partnership. The mid guy also asked to leave because he was going to pursue different things for his career. The high guy left sometime later. It made no sense for him to remain in our partnership. The design guy went to work on a great startup, it was an excellent opportunity for him. So it was just me and the manager guy. We still needed to move forward with the access control system for the bank. Team-less. Almost team-less, actually. It was me, the manager guy, and the fullstack guy. We had a woman that QA-ed the system. She was an excellent QA. But from the original company, from those almost 30 people, just us 3 were left. To deliver a vast system. To a bank. Over 4000 branches.

Pilot

We were ready to deploy our first branch. Everything was already scheduled. I would go to the bank branch, update the device’s operating system to ours, and it was supposed to connect to our production backend. The whole process was pretty quick, but from my perspective, time froze. I was so nervous. We already had a bunch of failures in the past (actually, we did not have any success). And this is it, the proof that our job was successful or not. Well, after starting up the device, the production frontend showed green to the device status. It was online. I ran the test suite for the peripherals, did the registration process for creating credentials for the security guards, and everything worked fine.

We had some goals. First we needed 1 device running. Then 5, 10, 15, 20, 30, 40, 50… And so on. Until we reach those +4000, and finish the whole migration to our system. We fixed bugs when the client reported. We trained people to install the new operating system and start the device, it was pretty simple, and they did wonderfully. We trained people to use the software, identify problems, and deal with them. People loved it. So beautiful, full of graphs and numbers, green and red device statuses. Databases were syncing when devices lost connection. Everything was great.

The client was extremely happy. People got promoted and started making more money in their new positions. They had new shoes, new watches, and new fancy sunglasses. I still got my old dirty jeans, black shirts, and old shoes. I didn’t get promoted. My manager partner didn’t get promoted either—neither the fullstack guy. Well, we were still working for our investor, who cheated on us and lied to us. We never received any % of the project (we know how much the project costed for the bank, so we could easily estimate how much we were supposed to receive). We had a good partnership with the client; he really liked us and cared for us. He constantly bought pizza for us. He was one of the best people I’ve ever met. He treated us so well. And he saw what happened to us. He knew everything from the beginning. He knew we got cheated and never received any money. Then, he articulated something incredible.

Winds of change

It didn’t make sense to work for our investor. He was still making money with our work and paying us ridiculous amounts of money. The company was just me and manager guy, and the fullstack guy, who is one of the best developers I’ve ever seen. He was my first choice for a coding partner, and he still is until today.

Well, the client guy managed to change the company responsible for the access control project. The previous company, which I will call the bad company, had some contracts with this bank. Many of them were cut off. And other companies were assigned to them. Since the bad company treated us like trash, the bank did the same to them. It costed a lot of money, probably much more money than they owned us. We were hired by a new company, that was now responsible for this control access system. We got an excellent salary. We got a great team and a true architect to lead us.

It turns out that, besides the incompetence of the bad company, there was a technical problem. The device migrations stopped, and the system could not hold more than 60 devices plugged in. It made the bank spend a lot of time and money. We had a huge problem, the same problem we had in the past. The frontend and backend had terrible performance issues. The current architecture was so poorly designed that it could not handle a bunch of users logged in and 60 devices. The new architect guy, the fullstack guy, and I started studying the horrendous codebase. The architect guy was not as experienced with JavaScript and NodeJS as we did, but he knew how to handle performance issues and bottlenecks. He showed us how to debug and test the performance of the system. He taught us techniques that we didn’t have and were essential for us to refactor almost all the code.

Until these days, I never saw worse code. There’s a reason why we have asynchronous code in JavaScript, async and await. There’s a reason why some decisions need to be made when we have single-threaded applications. The high and mid guys just ignored those good practices. The code was entirely synchronous… Using asynchronous functions. For almost every statement, we had an await. Our event loop was blocked virtually all the time. We had a monolith application. It was nicely organized but poorly coded. We started breaking the system into smaller applications, responsible for one thing only. It was still a monolithic application, but it had environment variables that controlled which code the instance would run. We could instantiate how many instances we wanted, and control which features the instances would take care of. And would log in different files so that we can find problematic features. And we could see metrics for each instance, identify memory leaks, CPU-intensive operations, and stuff like that. We reduced the complexity of the system, created interfaces, and applied design patterns. We abstracted stuff. It wasn’t a sophisticated approach, but it brought the desired results.

The production environment was a joke. The mid guy configured our runner to use the default NodeJS settings. We had a machine with 26GB of memory and 16 cores. The runner was instantiating one NodeJS process in a single core, with 512MB of memory allocated. Besides bad code, we had a bad infrastructure configuration. The more we dug, the worse we found ourselves. We needed to fix this production environment and use all the resources we have for our application. Changes were made. We had one NodeJS process for each core now. We distributed the amount of memory necessary for each Node process. We deployed our refactored frontend and backend. We escalated the system to handle every single branch connected and hundreds of users connected simultaneously. Data was coming and going, and all branches were opening and closing correctly. And was possible to watch in real-time using our web app, full of numbers and graphs.