May 24, 2020

MAD-API: A real world Digital Transformation experiment

‌Legacy corporates are abound, and they see the Internet unicorns stealing their lunch. They know that if you cant beat 'em, then join 'em. So they also want some of the fancy stuff that the Google's and Facebook's are doing, hoping that will prevent them from falling into irrelevance.

The above is pretty much the context when I started a new role as API Architect in the Digital team of a large, multi-national, monolithic, legacy telco about 3 years ago. I was to help build an API economy, to be used by internal and external developers and partners.

As the name implies, the rest of the organisation was not 'Digital', so they needed a separate department that could - a common anti-pattern found in corporates. I recall the first one-on-one coffee I had with our CTIO, where he said our Digital team should do things differently, and even fail and learn as we along, as we try to transform the business.

I was extremely fortunate - it's not that often that you get to be part of a Digital Transformation, and better yet to be at the forefront, leading and defining it. To make it even better, for this specific API platform, we were starting a greenfields implementation - we didn't have to re-write any legacy systems.

What was scary was that neither I, nor anyone else on the team or company, had ever done this before. Thats why I called this an "experiment" - there was no defined plan on how to transform a legacy organisation, so in true agile fashion, we were going to continually experiment: to try a few things, and see how it worked, and pivot based on the results. But on the other hand, I had read many tales of both success and woe, that came from unicorns and other legacy organisations and I was determined to learn from those lessons. Based on my previous DevOps experience, I felt ready for the challenge. My mindset matched that of Stephen from Ahead in the Cloud:

I was in the right place at the right time — this time with just enough ambition and naiveté to think that I might be able to help the company address some of the issues I highlight above, and transform the role IT played in the company"

I am moving on to a new role, so as I come to the end of this amazing journey, I take a look back at lessons learnt, the things we have accomplished, and the challenges that still lie ahead. In a way, this is my personal DevOps-style post-mortem of the whole 30 months journey, with the intent to review what worked and what did'nt, so that we can improve in future. Each of the main points below are accompanied by a link to a blog post of mine that discusses each point, as it happened in the timeline. Therefore this post is almost an index page of the different posts I wrote along the journey, and hopefully links them all nicely into a single cohesive story.

The Foundation

My starting point was the tale of Xerox in the book review of Fumbling the Future which gives the details of my email and conversation with the then CTIO, which were key in forming the tenets we set out for the program (which I later found out resembled a lot of the key objectives in Ahead in the Cloud):

Become a software company again, by in-sourcing key development talent
not be (completely) outsourced and overly vendor reliant (we are still a legacy corporate after all)
Use smaller SI partners and vendors, who are much faster and flexible, and focussed on specific and niche skills, rather than the big SI partners who are bulky, and claim to be really good at all things, but manage to do most things OK, and some things not-so-OK.
use a decentralised development model of small Product/Feature teams in the OpCos, and break away from the factory mentality, of having a large single centralised dev team that develops for all OpCos.
Build a platform based on best of breed layers, not best of stack. Typically, corporates are risk averse, so when procuring new platforms, its deemed safer to go with the full Oracle stack, (or SAP, or Microsoft). Much like the big SI partners point above, this means that Oracle might have a solution thats really good for a DB, but not so good as middleware or a web layer. But by going for the whole Oracle stack, you stuck with a sucky web layer on top of a decent DB. So we chose to rather go for the best individual layer, and not the best overall stack. This is different to the [Platform vs Product approach from Cloud Strategy](http://gcp.hacksaw.co.za/blog/book-review-cloud-strategy-gregor-hohpe/), but I think for valid reasons.
Leverage much more open source instead of propriety or custom vendor-written systems
Become Cloud-first - which means shifting the burden of proof from "why should we build this on cloud?" to "why shouldn't this be on cloud?"

Our intended architecture for building an API eco-system, where APIs are the actual product, included a developer portal for developers to discover and self-serve, and a microservices architecture to build and deploy the code to serve the APIs. Each API would consist of multiple containers, hence we wanted Kubernetes as the container management platform. Now all of this was very much in opposition of the legacy architecture, and typical way to do SOA and serve APIs with a big fat ESB in the middle.

We also realised that for us to implement DevOps, we needed agile infrastructure to match the speed of development. Using the public cloud would have been ideal, but not for all cases. While we used public cloud to host the initial platform, which served a few of the initial markets, yet we knew that many OpCos in Africa have strict regulatory laws, preventing the use of public cloud. I therefore wrote a detailed proposal for building a hybrid cloud architecture - that would allow us to provision IaaS and PaaS services on both public and an internal cloud, in a standardised fashion. This proposal eventually failed, because as the Digital department, in true legacy siloed fashion, we were not responsible for infrastructure, as there was a dedicated Infrastructure team, who claimed that their existing virtualisation was private cloud. Our senior management also lacked the foresight to realise that the lack of this hybrid cloud would lead us to spend far too much time and effort on undifferentiated plumbing, building VMs and networking for each OpCo. And even though we have hosted the initial Group-wide platform on public Cloud, we lacked a consistent method to connect back to on-premise source systems, which resulted in us building VPNs per system, per OpCo.

Another one of my key realisations was that we needed to become a software company again, by in-sourcing key development talent. I spent months on this proposal, and tried lobbying any Exec I could find (I waited till late in the evening to walk an Exec to his car in order to pitch the idea and secure time in his calendar for a more detailed presentation). After many rounds on presentations, we eventually got approval for head count to hire staff developers.

Because we were a large corporate, weighed down with governance, in order to begin, we needed to first get approval for, and then procure the new systems and partners we needed on the journey. Very different to the cloud approach, as discussed in Cloud Strategy. We therefore launched 3 RFPs for:

1. an API Management Platform: after almost a year long process, we chose to go with Google Apigee

2. a PaaS - Cloud Native Managed Kubernetes Platform: we eventually settled on using Pivotal Cloud Foundry

3. Multiple SI partners, to do the development and operations of the code and platforms above: We made conscious decisions to not got with the usual suspects (Accenture, Deloitte, IBM), and chose 3 dev partners. In the end we didn't go with the truly small, niche, or boutique dev houses that I envisaged, but it was good enough to start with.

The real start

The above procurement processes were just the formalities, but still took longer than anticipated. So after 9 months (argh! I cant believe I can say that with a straight face) we got tired of waiting for procurement, and decided to choose interim partners and platforms to start the journey. We chose a particular use-case: an API that was needed to solve some regulatory requirements for a customer facing app, that was originally planned to take 6 weeks. Based on that, we quickly met with a few small dev houses, and chose one, together with cloud solutions for the platforms we needed (Openshift Online, Apigee, Atlassian, Postman). With only 6 weeks left of the year left to deliver this system and APIs to production, we hit the ground running, hard! We started with 3 developers, and shared other roles amongst other members like Product Owner and architect. We were responsible for writing the code, setting up the platform to run the code, and the pipelines to take code from the developers laptop to production, with the required quality gates. I think this the most important thing we did - because if we waited for procurement, it would have taken another year, and more importantly by starting out small allowed us to to learn and experiment - a tremendous learning opportunity. By starting out small, with a defined focus on a project that was urgent but not crucial, we could practically test out the planned architecture - the perfect way to start a digital journey on the cloud.

Within 6 weeks, we met all targets. We stood-up the platform, deployed the code, and the mobile app had consumed the API by the 2nd sprint. For the new few sprints, we continued to add more features to the API. The procurement process was still on-going, so we decided to extend this team for a few more months. (This was potentially the cause of the major issues regarding the lack of scaling - as we mislead ourselves at this point thinking we could simply expand beyond the initial use-case without investing in a proper structure). One of our biggest wins was building a Loans API for a large social networking app to consume - it went from idea to production in 2 sprints. And like that, for the first 6 months, we had some noticeable wins. We invited stakeholders from across the business to the demos, which were very well received. We starting receiving requests to build new APIs from across the business.

These few initial wins became our hero project:

Our newfound ability to deliver technology to the business quickly became our hero project, and it helped us encourage my team and executive stakeholders to come on the journey with us.

At this point, we had the key ingredients for a true digital transformation, which allowed us to build the platform to host our API products, and other products:

Agile way of working, using Scrumban, with the associated ceremonies of 2 week Sprints as a way to keep focussed, and a retrospective and demo to show what we had built and take feedback to improve in the next sprint.
DevOps culture, tools and processes.
flexible architecture, based on microservices, that allowed for independant deployment and scaling.
Cloud-based platform that was elastic and could scale on demand, where we pay for only what we use, and we did'nt have to look after servers and OSs.
Business support to solve the problem, and introduce a new way of working.

One of my biggest sources of satisfaction was the DevOps culture we had built. We had developers building pipelines, containers, and deploying their own code. I recall when we had our first outage on a weekend - the whole team dived in to help, and when we returned on Monday, in the post-mortem we identified the lack of QA as the cause of the failure, and pivoted to address that.

But alas, as the journey progressed after this few initial wins, we started to hit the limits of senior leaderships legacy mindset, and the threshold of digital change they could endure, expressed by a significant lack of vision and air cover, leading to many issues.

Growing Pains

At this stage, about 6 months after we began, the cracks started to show, as top leadership, caught in the heady win of our early wins, over promised what we could deliver to all BUs and all regions, but without investing in scaling the team. So without scaling the team, building a proper structure, and investing in more people and tooling, they wanted us to expand to other BUs and regions, and going beyond our mandate of doing PoCs and showing quick-wins. They also starting demanding that we build a certain number (randomly chosen) of APIs per sprint - to them it didn't matter what - we just had to build an ever-expanding catalogue of APIs that they could show off to other Execs as proof of progress (in typical legacy fashion - they valued quantity over quality). It was at this time that one of the senior manager walked into our developer space and started ranting about how we need to be delivering more faster, and caring less about quality and "architecture":

We need to put less effort on quality so we can build more features for our next release

The senior management in our team didn't have the ability the standup to the Execs and provide us with the air-cover we required to continue our work, and insulate us from the political in-fighting. All of this resulted in a major decrease in productivity, especially in serving the most valuable customers that were close to us and could benefit from the platform. The team lost a lot of motivation due to priorities changing mid-sprint, and the demand to just push out APIs, irrespective of who actually wants to use it. Our sprint burndown clearly showed the slowdown and loss of productivity, but managements response was the cliched "just work harder". Sigh!

Beginning to mature, but still lacking top-down support

As the journey progressed, and as budgets became available in the new financial year, we eventually expanded and grew. We onboarded new Dev teams in a few OpCos, to cater for development in the OpCos. We had a number of teething issues with the OpCos, that are still recurring even much later.

‌As the backlog grew, and the screams of non-delivery by our customers grew louder, management started throwing developers at the problem. In order to scale, we split the one large team into separate Platform and Product/Feature teams. However, management still didn't quite understand this - frequently defaulting to the silo'd centralised factory mentality. This is possibly because we used the incorrect definition of a pipeline, and therefore didn't focus on optimising the first part of the pipeline. Gary Gruyver in Starting and Scaling DevOps defines the pipeline as starting with the business idea and requirements.

I have worked with one organization that moved to a more just- in-time approach for requirements and that has transformed their planning processes from taking 20% or more of their capacity to less than 5%. They eliminated waste and freed up 15% of the capacity of their organization to focus on creating value for the business. This was done by limiting long-term commitments of over a year to less than 50% of capacity and committing additional capacity in shorter timeframe horizons. The details of how this worked are in Chapter 5 of Leading the Transformation by Gary Gruver and Tommy Mouser. This was a big shift that freed up more capacity, and it also improved the speed of value through the system because new ideas could move quickly into development if they were of the highest priority instead of waiting in queue behind a lot of lower-priority ideas that were previously planned. This move is a big cultural change for most organizations. It requires software/IT and business executives to think differently about how they manage software. They really need to change their focus from optimizing the system for accuracy in plans to optimizing it for throughput of value for the customer. They need to be clear about the business decisions they need to support and work with the organization to limit the investment in requirements just to the level of detail required to support those decisions.

Even though we have some (very few to really make a difference) in-house staff developers, they were still very junior, and therefore we needed to rely on SI partners for most of the development skills. This led to another anti-pattern of Outsourcing DevOps: While contractors and experts are vital to bring in new skills, the ownership cannot be outsourced. If the team does not see their own execs taking ownership, there is less motivation to overcome cultural inertia.

There were a few important wins against the old thinking along the way, like when the Change Management team wanted us to attend CAB to request for approval before deploying to production, which we quickly managed to avoid.

KPIs in legacy orgs are generally done so bad they they encourage legacy behaviour and thinking, so I fought quite hard to get our KPIs based on the solid research of high performers from Accelerate.

The feedback from the business was also encouraging - a few told us how this new way has increased their ability to deliver. One of the most memorable events was when I was requested to travel to our largest OpCo, to assist with the platform and partner selections. Late one evening, after many presentations from partners, I went up to the marketing department. I argued that they consider choosing multiple partners, based on the niche skills they have per market segment, that will allow them to really delivery differentiated services, rather than a single partner who will do an average job in the different market segments. I recall the Senior Manager saying "very different and new way of thiking - I like it, lets talk to procurement." Very satisfying indeed.

Where we went wrong, and what can we do to improve

In hind-sight, our failings were many, but the key ones were:

right up front, we failed to create any KPIs or measures of success, and establish technical, cultural and leadership metrics, so along the journey we could tell if we were improving or worsening
Lack of Transformational Leadership - strong leadership and vision in the team to provide air cover from the rest of the legacy organisation. Almost any technologists’ dream come true is defined as

Significant top-down support with overwhelmingly strong vision and air cover—along with financial backing...

we failed to anticipate and cater for the crucial 6th Step mentioned in Ahead in the Cloud:

SCALE AND RE-ORGANIZE - Once you have a few initial projects delivered successfully using the newer approaches and practices, the rest of the organization should become eager to leverage the services, tools, and expertise of the CoE for their specific needs and problems. You have to carefully plan for this critical last step of scaling the CoE function across the rest of the organization. In our case, we were a little late to find out that the CoE had become a bottleneck for rest of the organization to adopt Cloud and DevOps practices. Eventually, we built federated teams and built DevOps capabilities within each application team to scale out the CoE’s function.

Lack of Hybrid Cloud Infrastructure caused us to spend too much effort on VMs and Firewalls in the OpCos
Lack of an ability to integrate Cloud with our internal MPLS so we don’t need multiple VPNs per system
skills shortage in the rest of Africa meant we had developers that could code in Java, but not familiar with DevOps, containers, k8s, etc.

My personal takeaways

It was here that I learnt the importance of time to value and Bias for Action, and not asking for permission to get things done. That was the only way we managed to bypass the procurement circus and actually get going.

The tech part is easy. The biggest thing holding back legacy orgs is cultural inertia.

Static Serverless Ghost

Flask on AWS Serverless: A learning journey - Part 2

Flask on AWS Serverless: A learning journey - Part 1