Editor’s note: Today we are hearing from Ben Hizak, co-founder of Cherre. He shares Cherre’s approach to consolidating multiple data sets into one data warehouse to power BI and analytics for their customers. He explains how the company uses Google Cloud and Cloud SQL to bring real estate datasets together and help customers make better data-driven decisions.
Cherre’s quest to transform real estate investing and underwriting into a science requires us to connect thousands of datasets and make them accessible for investment, management and underwriting decisions. By delivering actionable insights based on fresh data, our clients can answer questions that they could never answer before. Each of our clients is interested in a different combination of public and paid datasets, which is then to be combined with that client’s private data. All of this has to be done in a secure, scalable and repeatable manner. We process this data in BigQuery and then store it in Cloud SQL for PostgreSQL, a single source of truth that our clients can use to train machine learning models, consume via an API which runs on Google Kubernetes Engine (GKE), or visualize using BI tools like Looker.
Before Cherre, this kind of consolidated data work was simply not available in the real estate industry. Before Cherre, datasets were never in one place, and each dataset contained different properties with different attributes. For example, data on transactions from one vendor and data on securitized debt from another lived in completely different worlds prior to Cherre. And private data vendors never before let raw data leave their systems. Building that trust so those vendors now allow their data onto our platform has been a unique honor and responsibility.
Our clients are large organizations such as large real estate funds, banks, and insurance companies, which manage billions of dollars each. For them, the difference between a right decision and a wrong decision can make the difference between a successful business and a failed business. Their need for speed is insatiable. We enable our clients to make better decisions faster by putting clean, connected data at their fingertips. Cherre customers can evaluate opportunities and trends faster and more accurately, and efficiently manage their portfolios, while saving millions of dollars in manual data collection and analytics costs.
Consolidating data to turn investing into a science
Consolidating thousands of datasets and millions of data points requires both technical capabilities and human sensitivities. We serve as a center point for two sets of stakeholders — for large institutions that require the highest levels of security and for providers of proprietary data that stake their reputations on our ability to deliver their data on comparable or better infrastructure than they provide to their own clients. Our reputation relies on the strength of our technical infrastructure.
When choosing a database to back our API, Postgres was the natural solution. It is feature-rich, fast, and has a strong community. Postgres also has strong geospatial abilities, which is an absolute requirement for clients in our industry.
Postgres and GraphQL combine well to serve different use cases in a ‘wide data’ scenario
To understand our stack, it helps to understand our data. Most data engineers are working in a “thin and fast” scenario; a small number of fields have to be processed very rapidly. Our case is the opposite. Real estate data is “wide and slow(er)”; our input consists of many thousands of tables. We reduce these to something more manageable, just over 100 tables and thousands of columns for certain clients, which they then can consume via our API and visualize in Looker. Some of our clients even run their calculations directly on our infrastructure
REST doesn’t cut it
Every client asks different questions from their data. While some might care about specifics in a certain building — number of tenants, average occupancy or length of occupancy — others will be more interested in general information, such as fire safety codes. . Our clients expect to be able to filter on any column and traverse datasets without having to understand how they join together. In the classic world of REST APIs, clients would have to pull all the objects and know how to assemble them on the client’s side. However, that would be antithetical to our purpose — to make data ingestions as simple as possible. Data wrangling is our job. We had to look for a different approach, which required a very different architecture.
GraphQL—a more modern type of API — comes in handy. Not only does it prevent over-fetching and under-fetching of data, but it also joins tables and traverses relationships (one-to-many, many-to-one, many-to-many) in a way that is seamless for the client. Real estate data can be difficult to wrangle in that it comes from thousands of authorities, in various formats and qualities. The data is built on various objects, often lacking a consistent key or identical data point, which makes for complex and very subtle joins. Cherre wanted to take on that burden, and not pass it to our clients. This is where GraphQL shines. Employing GraphQL lets Cherre focus on the minutiae so that our clients can focus on the big picture.
GraphQL allows each of our clients to query only the objects and fields that they care about, and have them be joined, filtered and aggregated on the server side, yielding a response that is easy to consume. Take the following example in which we conduct a one-to-many join: buildings with their apartments (units), and filtering on both entities:
In this example, we can see a many-to-one query, because each apartment belongs to only one building:
And in this example we can see a one-to-many query that pulls apartments based on the building they belong to (in this case buildings that were built after 2010 and have a doorman).
Disposable tables – Because you can’t always get it right the first time
Since Cherre is working on thousands of datasets and tens of thousands of input fields, the chances that we will misunderstand and miscalculate a field are so high as to be a certainty. To preempt this, we wanted to be able to rebuild our output from scratch. We were looking for something like “disposable infrastructure” but for datasets.
We decided that Cherre would not operate on existing data and instead, would always recompute what we need from scratch. This approach is often called immutable tables. If there’s a mistake, you can fix the code and rerun — voila, mistake corrected.
To achieve this, Cherre builds immutable tables that live for a short time. We then funnel some or all of the client traffic to the new table. We keep the old table in case we need to do a zero-downtime rollback, but ultimately delete the old table to conserve space. We delete data with full confidence that we can rebuild it from scratch, by running the appropriate version of the code.
Immutable tables are more costly and take longer to compute, but they allow Cherre to iterate more quickly with the confidence that nothing is ever lost. It also allows us to compare two versions of our output.
Why Cloud SQL
As you can see, our database needs are not trivial, which is why we chose Cloud SQL for PostgreSQL. We need to control our costs, and we do it using Cloud SQL committed-use discounts (CUDs and) and a cost monitoring tool from DoIT international. We also need high availability, read replicas, and backup management. Cloud SQL does all that for us so we don’t have to spend development time on it. It also saves us time we’d have otherwise had to spend patching and upgrading the software.
We use Google Cloud projects to implement robust information security, so having a database that sits well within the Google projects system enables us to be consistent across the organization with the way we manage permissions and security. And we leverage Config Connector, which allows us to provision Google Cloud resources through Kubernetes. Since we want to use infrastructure-as-code for everything and we are using Google Kubernetes Engine (GKE), it was natural for us to use Config Connector for our operations.
We’re extremely proud of what we’ve built, but there’s always room for improvement. As technology continues to evolve and as we continue to develop, we’re seeking out new opportunities to become more efficient. We’re also looking to leverage deeper database optimization, which is much easier to do now, thanks to Query Insights. More information is becoming available every day. The buyers, sellers and other real estate professionals that can make smarter decisions based on that data will gain an edge in the marketplace. We’re committed to making the connections they need to get ahead.
Cloud BlogRead More