Every organization has its own unique data culture and capabilities. Yet each is expected to use technology trends and solutions in the same way as everyone else. Your organization may be built on years of legacy applications, you may have developed a considerable amount of expertise and knowledge, yet you may be asked to adopt a new approach based on a technology trend. On the other hand, you may be on the other side of the spectrum, a digitally native organization built with engineering principles from scratch without legacy systems but expected to follow the same principles as process driven, established organizations. The question is, should we treat these organizations in the same way when it comes to data processing? In this series of blogs and papers this is what we are exploring: how to set up an organization from the first principles from data analyst, data engineering and data science point of view. In reality, there is no such organization that is solely driven by one of these but it is likely to be a combination of multiple types. What type of organization you become is then driven by how much you are influenced by each of these principles.
When you are considering what data processing technology encompasses, take a step back and make a strategic decision based on your key goals. This can be whether you optimize for performance, cost, reduction in operational overhead, increase in operational excellence, integration of new analytical and machine learning approaches. Or perhaps you’re looking to leverage existing employees’ skills while meeting all your data governance and regulatory requirements. We will be exploring these different themes and will focus on how they guide your decision-making process. You may be coming from technologies which are solving some of the past problems and some of the terminologies may be more familiar, however they don’t scale your capabilities. There is also the opportunity cost of prioritizing legacy and new issues that arise from a transformation effort, and as a result your new initiative can set you further behind on your core business while you play catch up to an ever changing technology landscape.
Data value chain
The key for any ingestion and transformation tool is to extract data from a source and start acting on it. The ultimate goal is to reduce the complexity and increase the timeliness of the data. Without data, it is impossible to create a data driven organization and act on the insights. As a result, data needs to be transformed, enriched, joined with other data sources, and aggregated to make better decisions. In other words, insights on good timely data mean good decisions.
While deciding on the data ingestion pipeline, one of the best approaches is to look into the volume of data, the velocity of the data, and type of data that is arriving. Other considerations include the number of different data sources you are managing, whether you need to scale to thousands of sources using generic pipelines, whether you want to create one generic pipeline but then apply data quality rules and governance. ETL tools are ideal for this use case as generic pipelines can be written and then parameterized.
On the other hand, consider the data source. Can the data be directly ingested without transforming and formatting the data? If the data does not need to be transformed and can be ingested directly into the data warehouse as a managed solution. This not only reduces the operational costs but also allows for more timely data delivery. If the data is coming in through an unstructured format such as XML or in a format such as EBCDIC and needs to be transformed and formatted, then a tool with ETL Capabilities can be used depending on the speed of the data arrival.
It is also important to understand the speed and time of arrival of the data. Think about your SLAs and time durations/windows that are relevant for your data ingestion plans. This would not only drive the ingestion profiles but would also dictate which framework to use. As discussed above, velocity requirements would drive the decision-making process.
Type of Organization
Different organizations can be successful by employing different strategies based on the talent that they have. Just like in sports, each team plays with a different strategy with the ultimate goal of winning.
Organizations often need to decide on what’s the best strategy to take in respect to data ingestion and processing – whether you need to hire an expensive group of data engineers, or exploit your data wizards and analysts to enrich and transform data that can be acted on, or whether it would be more realistic to train the current workforce to do more functional/high value work rather than to focus on building generally understood and available foundational pieces.
On the other hand, the transformation part of ETL pipelines as we know it, dictates where the load will be. All of these are made a reality in the cloud native world where data can be enriched, aggregated, and joined. Loading data into a powerful and modern data warehouse means that you can already join and enrich the data using ELT. Consequently, ETL isn’t really needed in its strict terms anymore if the data can be loaded directly into the data warehouse.
All of the above was not possible in the traditional, siloed, and static data warehouses and data ecosystems whereby systems would not talk to each other or there were capacity constraints in respect to both storing and processing the data in the expensive Data Warehouse. This is no longer the case in the BigQuery world as storage is now cheap and transformations are now much more capable without constraints of virtual appliances.
If your organization is already heavily invested into an ETL tool, one option is to use them to load BigQuery and transform the data initially within the ETL tool. Once the as-is and to-be are verified to be matching, then with the improved knowledge and expertise one can start moving workloads into BigQuery SQL, and effectively do ELT.
Furthermore, if your organization is coming from a more traditional data warehouse that extensively relies on stored procedures and scripting, then the question that one may ask is, do I continue leveraging these skills and expertise and use these capabilities that are also provided in BigQuery? ELT with BigQuery is more natural, similar to what’s already in Teradata BTEQ, Oracle PL/SQL but migrating from ETL to ELT requires changes. This change then enables exploiting streaming use cases, such as real-time use cases in retail. This is because there is no preceding step before data is loaded and made available.
Organizations can be broadly classified under 3 types as Data Analyst Driven, Data Engineering driven, and Blended organization. We will be covering a Data Science driven organization within the Blended category.
Data Analyst Driven
Analysts understand the business and are used to using SQL/spreadsheets. Allowing them to do advanced analytics through interfaces that they are accustomed to enables scaling. As a result, easy to use ETL tooling to bring data quickly into the target system becomes a key driver. Ingesting data directly from a source or staging area then also becomes critical as it allows analysts to exploit their key skills using ELT and increases timeliness of the data. This is commonplace with traditional EDWs and realized by extended capabilities of using Stored Procedures and Scripting. Data is enriched, transformed, and cleansed using SQL and ETL tools act as the orchestration tools.
The capabilities brought by cloud computing on separation of data and computation changes the face of the EDW as well. Rather than creating complex ingestion pipelines, the role of the ingestion becomes, bringing data close to the cloud, staging on a storage bucket or on a messaging system before being ingested into the cloud EDW. This then releases data analysts to focus on looking into data insights using tools and interfaces that they are accustomed to.
Data Engineering / Data Science Driven
Building complex data engineering pipelines is expensive but enables increased capabilities. This allows creating repeatable processes and scaling the number of sources. Once complemented with cloud it enables agile data processing methodologies. On the other hand, data science organizations allow carrying out experiments and producing applications that work for specific use cases but are not often productionised or generalized.
Real-time analytics enables immediate responses and there are specific use cases where low latency anomaly detection applications are required to run. In other words, business requirements would be such that it has to be acted upon as the data arrives on the fly. Processing this type of data or application requires transformation done outside of the target.
All the above usually requires custom applications or state-of-the-art tooling which is achieved by organizations that excel with their engineering capabilities. In reality, there are very few organizations that can be truly engineering organizations. Many fall into what we call here as the blended organization.
Blended org
The above classification can be used on tool selection for each project. For example, rather than choosing a single tool, choose the right tool for the right workload, because this would reduce operational cost, license cost and use the best of the tools available. Let the deciding factor be driven by business requirements: each business unit or team would know the applications they need to connect with to get valuable business insights. This coupled with the data maturity of the organization would be the key to making sure the right data processing tool would be the right fit.
In reality, you are likely to be somewhere on a spectrum. Digital native organizations are likely to be closer to being engineering driven, due to their culture and business that they are in. However, brick and mortar organizations would be closer to being analyst driven due to the significant number of legacy systems and processes they possess. These organizations are either considering or working toward digital transformation with an aspiration of having a data engineering / software engineering culture like Google.
The blended organization with strong skills around data engineering, would have built the platform and built frameworks, to increase reusable patterns would increase productivity and then reduce costs. Data engineers focus on running Spark on Kubernetes whereas infrastructure engineers focus on container work. This in turn provides unparalleled capabilities as application developers focus on the data pipelines and even the underlying technologies or platforms changes code stays the same. As a result, security issues, latency requirements, cost demands and portability are addressed at multiple layers.
Conclusion – What type of organization are you?
Often an organization’s infrastructure is not flexible enough to react to a fast changing technological landscape. Whether you are part of an organization which is engineering driven or analyst driven, organizations frequently look at technical requirements that inform which architecture to implement. But a key, and frequently overlooked, component needed to truly become a data-driven organization is the impact of the architecture on your data users. When you take into account the responsibilities, skill sets, and trust of your data users, you can create the right data platform to meet the needs of your IT department as well as your business.
To become a truly data-driven organization, the first step is to design and implement an analytics data platform that meets your technical and business needs. The reality is that each organization is different and has a different culture, different skills, and capabilities. Key is to leverage its strengths to stay competitive while adopting new technologies when it is needed and as it fits to your organization.
To learn more about the elements of how to build an analytics data platform depending on the organization you are, read our paper here.
Cloud BlogRead More