Summary
Most of the time when you think about a data pipeline or ETL job what comes to mind is a purely mechanistic progression of functions that move data from point A to point B. Sometimes, however, one of those transformations is actually a full-fledged machine learning project in its own right. In this episode Tal Galfsky explains how he and the team at Cherre tackled the problem of messy data for Addresses by building a natural language processing and entity resolution system that is served as an API to the rest of their pipelines. He discusses the myriad ways that addresses are incomplete, poorly formed, and just plain wrong, why it was a big enough pain point to invest in building an industrial strength solution for it, and how it actually works under the hood. After listening to this you’ll look at your data pipelines in a new light and start to wonder how you can bring more advanced strategies into the cleaning and transformation process.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask.
RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.
Your host is Tobias Macey and today I’m interviewing Tal Galfsky about how Cherre is bringing order to the messy problem of physical addresses and entity resolution in their data pipelines.
Interview
Introduction
How did you get involved in the area of data management?
Started as physicist and evolved into Data Science
Can you start by giving a brief recap of what Cherre is and the types of data that you deal with?
Cherre is a company that connects data
We’re not a data vendor, in that we don’t sell data, primarily
We help companies connect and make sense of their data
The real estate market is historically closed, gut let, behind on tech
What are the biggest challenges that you deal with in your role when working with real estate data?
Lack of a standard domain model in real estate.
Ontology. What is a property? Each data source, thinks about properties in a very different way. Therefore, yielding similar, but completely different data.
QUALITY (Even if the dataset are talking about the same thing, there are different levels of accuracy, freshness).
HIREARCHY. When is one source better than another
What are the teams and systems that rely on address information?
Any company that needs to clean or organize (make sense) their data, need to identify, people, companies, and properties.
Our clients use Address resolution in multiple ways. Via the UI or via an API. Our service is both external and internal so what I build has to be good enough for the demanding needs of our data science team, robust enough for our engineers, and simple enough that non-expert clients can use it.
Can you give an example for the problems involved in entity resolution
Known entity example.
Empire state buidling.
To resolve addresses in a way that makes sense for the client you need to capture the real world entities. Lots, buildings, units.
Identify the type of the object (lot, building, unit)
Tag the object with all the relevant addresses
Relations to other objects (lot, building, unit)
What are some examples of the kinds of edge cases or messiness that you encounter in addresses?
First class is string problems.
Second class component problems.
third class is geocoding.
I understand that you have developed a service for normalizing addresses and performing entity resolution to provide canonical references for downstream analyses. Can you give an overview of what is involved?
What is the need for the service. The main requirement here is connecting an address to lot, building, unit with latitude and longitude coordinates
How were you satisfying this requirement previously?
Before we built our model and dedicated service we had a basic prototype for pipeline only to handle NYC addresses.
What were the motivations for designing and implementing this as a service?
Need to expand nationwide and to deal with client queries in real time.
What are some of the other data sources that you rely on to be able to perform this normalization and resolution?
Lot data, building data, unit data, Footprints and address points datasets.
What challenges do you face in managing these other sources of information?
Accuracy, hirearchy, standardization, unified solution, persistant ids and primary keys
Digging into the specifics of your solution, can you talk through the full lifecycle of a request to resolve an address and the various manipulations that are performed on it?
String cleaning, Parse and tokenize, standardize, Match
What are some of the other pieces of information in your system that you would like to see addressed in a similar fashion?
Our named entity solution with connection to knowledge graph and owner unmasking.
What are some of the most interesting, unexpected, or challenging lessons that you learned while building this address resolution system?
Scaling nyc geocode example. The NYC model was exploding a subset of the options for messing up an address. Flexibility. Dependencies. Client exposure.
Now that you have this system running in production, if you were to start over today what would you do differently?
a lot but at this point the module boundaries and client interface are defined in such way that we are able to make changes or completely replace any given part of it without breaking anything client facing
What are some of the other projects that you are excited to work on going forward?
Named entity resolution and Knowledge Graph
Contact Info
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
BigQuery is huge asset and in particular UDFs but they don’t support API calls or python script
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
Photonics
Knowledge Graph
Entity Resolution
BigQuery
NLP == Natural Language Processing
dbt
Podcast Episode
Airflow
Podcast.__init__ Episode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Read MoreData Engineering Podcast