The original version of Placement Analysis utilized MySQL to execute and aggregate data. The UI and middle tier would generate a query that would combine the necessary data and then group and sort as needed. When tables were small and our data model relatively simplistic, this performed as needed but as the number of ad placements increased and our data model fragmented across many tables, performance would degrade linearly, if not worse. The reason is the Cartesian joins across the 7-10 large tables was simply too expensive. In addition, it only allowed for grouping and sorting at one tier and in the new Placement Analysis the requirement was to be able to do this on multiple levels or tiers (for instance group by audience, then creative, and then time).
- High performance
- Decouple data model from query engine
- Allow for multi-tier pivoting
- High availability
- Linear scalability
Storing the raw data in MySQL is still useful as the vast majority of the data is relational in nature and MySQL provides ACID transactions. To achieve the performance required and support multi-tier pivoting, Scarecrow loads all of the required MySQL data into memory but breaks it into two types:
- Placement metadata such as creative, and audience
- Performance time based data
Placement metadata is always kept in memory and is immutable. Loading this data often requires joins across many tables and can be slow, however, once in memory it can be accessed very quickly. Recent performance data is always kept in memory while older data is retained based on user access. Changes to data are then streamed into Scarecrow to ensure the query results represent real-time metrics.
The query engine acts upon the data in memory using data agnostic actions (i.e. group, sort, where, having). It is important to note that the data model and data access layers are completely decoupled from the query engine. The paths given to the query engine refer to fields within the object using reflection and support expressions against those fields. This allows for using Scarecrow to analyze other future data objects beyond just placements.
The Scarecrow query engine uses a procedural language to process the data called Chirp. Here is an example:
In this case, placement metadata for site 1234 is utilized and first filters out any placements whose audience min age is 15. The results are then grouped by audience gender and creative title and sorted by image and bid type. Attach then combines metadata with today’s time based performance data and groups it by hour. Finally, the having step filters any aggregate data for each hour to only include those whose click through rate is greater than 50%. Data is represented as a tree with aggregate summaries at each non-leaf node.
The result is that loading Placement Analysis for a customer in the old solution with 1 million placements would take between 20-30 minutes, while Scarecrow will execute a more complex query in under 2 seconds.
Scale & Availability
The remaining two items (linear scalability and high availability) are addressed using a peer-to-peer cluster. Data is divided throughout the cluster by the site that owns it. These “shards” of data are then distributed to a cluster of Scarecrow instances so that the data usage is as evenly distributed as possible as our most scarce resource is memory. The assumption is that a site’s data will easily fit in the memory space of a given node (in practice even our largest site easily fits on a node). As the number of sites increases (and thus the total data space), we simply add nodes to the cluster and rebalance. The peer-to-peer nature of the cluster allows for high availability by self-monitoring. If a node goes down, the other nodes will load its shards and evenly distribute. Client requests are sent to a load balancer which distributes requests randomly through the cluster. The node receiving the request then forwards to the node owning the shard of data responsible for building the response. This allows for the addition or removal of nodes from the cluster without impacting the client. Because we only return one page of data at a time, the cost of tunneling through a node is insignificant.
In addition, code deployment is done via rolling restart where a node is stopped, its data distributed throughout the cluster, the node restarted, and the cluster rebalanced. This eliminates any downtime and the need to fetch data from the database.
The old methodology of storing data in a centralized location such as MySQL is no longer feasible as data scale explodes. Standard NoSQL solutions attempt to address this problem while providing high availability however their ability to answer ad hoc queries is severely limited. Scarecrow utilizes the best of both worlds by understanding and storing the relational nature of data with the ability execute ad hoc queries. This is possible by distributing the data across a peer-to-peer cluster and storing the data in optimized format completely in memory.