Last year, Button launched Button Evolution, ushering a new era of personalization-driven performance in the affiliate world. Since then, our customers have leveraged deeper insights on how to spend smarter, driving higher performance and happier customers.
But how did we accomplish this? We built the entire data platform on new infrastructure and analytics from the ground up. Here is the story of the data team’s journey to build Evolution at Button.
Going into 2019 we were like any regular growing data team: Our stack consisted of a small Redshift cluster; Airflow built by a product engineer (who originally moonlighted as a Data Engineer during a hackathon); Firehose; and Tableau. To push new analytics out, our usual workflow involved creating airflow jobs to manage each ETL step.
As the team grew to add Data Analysts (DA), Data Engineers (DE) and Data Scientists (DS), the number of pipelines skyrocketed, but the development and debugging process became painfully slow. This entire process was clunky and took precious time from DA and DS to do meaningful work. For example, the average time to push a new analysis or dashboard was creeping towards 1-2 weeks.
Our Previous Tech Stack
We ran Redshift at near full capacity and were constantly hit by performance issues. The amount of data we could store was fairly limited and had to be constantly pruned. Scaling up was expensive and did not give us an equal improvement in performance. As the data team and the product needs grew, some queries took over an hour to run or crashed due to lack of memory. They caused the database to hang and crash until a Data Engineer killed the queries from the console.
Furthermore, our Data Scientists were tied to their laptops for work. They were limited to the RAM available and their queries could return terabytes of data. Any disruption to their laptops (like getting disconnected or closing their laptops) means losing the queries that might have run for hours.
We also needed a tool that allowed data scientists and analysts to quickly build pipelines in SQL. Airflow, while it supported major ETL needs for DE, was a major source of frustration for non-DE, requiring high familiarity and time that took away from other roadmap goals. Other companies have encountered similar frustrations as there are a myriad of solutions to abstract away Airflow’s development process ranging from YAML configurations to drag-and-drop UIs; these solutions have been reinvented over and over again.
Overall, every step of the original process cost precious time and generated enough fire drills that we needed a better solution to level up the entire team.
The Evolution (of data at Button)
As we looked at the data-heavy roadmap for the year ahead, it was clear it was time to take action. We had to make 4 major changes: Warehousing, Transformation, Analytics, Computing.
In the end, we chose BigQuery, dbt, Looker, and AI Notebooks respectively.
First, we needed to unlock DE resources to work on high priority projects rather than be bogged down by mundane tasks. A major time sink was database maintenance, requiring precious hours each week worrying about performance, utilization, and migrations.
We chose BigQuery because it solved most of our initial issues. A query that took 45 minutes to run in Redshift, now took only 1-2 minutes to run in BigQuery. Results are stored in temporary tables so there is no need to worry about losing them. It became much easier to play around and switch partition and cluster keys, which in Redshift meant making sure that the server had enough space to make a full copy of the table to be altered which translated to scaling up by increasing the number of nodes. BigQuery is also a managed service so we no longer had to play an active role as DBAs.
As an added bonus, GCP has data transfer tools that made the migration from Redshift to BigQuery absolutely painless.
The above switch meant that we could now focus on ingesting new data sources, improve existing tooling, and even onboard dbt (data build tool), a new tool by Fishtown Analytics that “enables analytics engineers to transform data in their warehouses by simply writing select statements".
Introducing dbt meant that those DA and DS workflows that took weeks to complete, now could be done reasonably within days (or less!) and be fully tested. Airflow still plays a role in orchestrating the dbt executions and handling more complex transformations in python. All the jobs in Airflow which generated aggregations were ported to dbt. During the process, this gave us the opportunity to fully document all the metrics and source fields.
A common problem that data teams face is that data dictionaries are standalone documents or spreadsheets that often become neglected or outdated, Another consequence is that the origins of aggregate fields become challenging to trace. dbt is able to solve this problem by defining models for each table, their lineage to others, and their fields can be documented directly in code. Now we had codified data dictionaries which, as we add or remove fields, these definitions get carried over.
In addition, dbt can generate HTML documentation that can be hosted. Thanks to this, now we have democratized data. Anyone at Button could explore and understand our metrics, their definitions as well as their origins.
We also took the opportunity to bring in Looker, which gave us a simpler and more intuitive experience to build dashboards, benefitting both data analysts (maintainers) and non-data stakeholders.
Looker enables robust analytics workflows and leverages data modeling best practices which were already enforced with dbt. Among these, we have one central view of all the underlying code behind each dashboard (in contrast to using Tableau where dashboards are simple query wrappers). We use Github to review and version control any changes to the LookML. And we leverage the ability to join different tables directly in the LookML.
In addition, Looker has enabled more self-service among non-data internal users through powerful data export functionality, an alerting system that can be set from the dashboards, transparency in the data available through Explores, and by exposing term definitions.
Now that we had all the foundation set up, there still remained the problem of data scientists struggling to do their job on their laptops. Previously, we had explored Sagemaker Notebook instances. However, there were issues with permissions and authentication into Redshift. Since to make the most out of it, data scientists needed to share their scripts and models with each other. With a shared Notebook instance, they could accomplish that, but AWS has no built-in authentication mechanism. The only way was to have the credentials as plaintext.
Google solved that issue with AI Notebooks, that contained all the previous features but also had built-in authentication. The instances could be scaled up or down as needed. With this, our data scientists found they could run everything from the cloud without worrying about running low in memory or disconnecting from the internet. They could pick any tool or language they deemed useful.
Our Current State and What We Have Gained
Overall, our re-platform has been a big success. Some of the results we've seen:
- 100x more data and 50x faster queries in BigQuery vs Redshift
- 120 models created through dbt vs 19 handcrafted models via Airflow
- 35 hours of development time saved per task on average
- More autonomy for DA, DS, and non-data stakeholders
- Product teams can self-serve by creating their own queries and dashboards
This project took us 7 months to complete and it was thanks to the valiant efforts from the entire data team that we are now in a much stronger position as a data-driven company.