Our analogy is, like when you have, one machine learning algorithm, the machine learning algorithm is training, training and training. So in the same way. We think together we are going to learn, learn, and learn with the different areas. It is really interesting because the data is the common factor. If you are in engineering, if you are in data science, if you’re a machine learning engineer, if you’re a data analyst. At the end of the day, all together, we are working with data.
This week we welcome Joseph Arriola. He is the Data Engineering and AI lead at Yalo Chat and he is also the host of Big Data Guatemala Meet-up group. Joseph is passionate about technology related to the power of the data . His experience as an engineer started in development in different languages and he later found interest in how technology met business with data, so he migrated to the world of analytics.
Joseph chats with us about chatbots, Tensorflow models, and the handoff between his engineering team and the data science team. Follow us for more great data engineering conversations.
Intro
Sean Anderson 0:43
Hello, and welcome to another episode of the sources and destinations podcast. Where we talk about everything data and data engineering. This week, we’re triangulating the globe with our guest from Guatemala, as well as myself here from Austin, Texas. And, as always my co host Dash Desai coming into us from California. If you’re not familiar with the podcast and want to catch up on what we’re doing. You can access the podcast through any of the popular podcast platforms. We’re on SoundCloud, Spotify, Apple podcasts, as well as many other platforms. So we invite you to check out some of our previous episodes and get familiar with the sources and destinations podcast. But for this week, I’m gonna hand it over to Dash to tell us a little bit more about our guest.
Dash Desai 1:33
Thanks, Sean. So this week, we’re super excited to welcome Joseph Arriola. He’s a data engineering and AI lead at Yalo chat and he’s also the host of the Big Data Guatemala Meet-up Group. Joseph is passionate about technology related to the power of the data. He’s experienced this as an engineer starting with developing in different coding languages. He later found interest in how technology met business. So then he migrated to the world of analytics. Hey, Joseph, thanks for joining us today and welcome to the sources and destinations podcast. So to kick it off, can you please share one of your favorite data engineering projects that you’re currently working on?
Joseph Arriola 2:20
Hi, Dash. Hi, Sean. Thank you for these great opportunity to have this conversation related to the the awesome world around the data. And yeah, I would like to share my favorite project right now is Yalo. Yalo is the third company that we are working with since two years ago and this is really interesting for me, because we are we are working in order to build the different data pipelines in order to integrate the information, structured and unstructured data. I am also doing some kind of consulting with some customers around the Central America. The different kinds of projects are around the data, like migration of the technology on premise to cloud. And as you know, in this kind of migration, the data pipeline should be like the best way to integrate the both worlds on premise and cloud.
Building Chatbots
Sean Anderson 3:19
That’s great, Joseph and really interested in your work that you’ve created over there at Yalo Chat. For those of our listeners that are not intimately familiar with chatbots, how would you explain chatbot technology and the type of data that you collect? And how you implement data for a chatbot solution?
Joseph Arriola 3:40
Thanks, Sean. Well, in Yalo, it is a fact that we are creating the chatbot. But at the end of the day, we are building a conversational relationship manager. Our goal is to create these upgrades, good relationships between the users, and with their companies. So we created these different chatbots using artificial intelligence. In order to understand what it is the users say, what is the intent, and then we are creating a different integration between the systems in order to create the different connections and really engage the users with the companies.
Dash Desai 4:26
That’s awesome. So can you talk about some of the challenges that you’re running into on these projects?
Joseph Arriola 4:33
I think when you are in a world of the data, one of the biggest challenges is how can you do the deeper integration between the different data sources, so I think this is one of the most important challenges. The second one is how you can do or how you can make efficient time to deploy or to develop the different data pipelines? If you have these data pipelines, the second thing is how you can do the operations. How can you handle in the operation, the code maintenance? I think this is the really important thing here. Because at the end of the day, you will have a lot of that appearing. You will have to operate these different data pipelines. So I think it’s really important to look for a good strategy to maintain and also a good platform to maintain all these different data pipelines.
Sources and Destinations
Sean Anderson 5:33
So Joseph, I think that brings up a good point. I understand that the various amount of sources presents its own challenges. Then you also have operationalizing it all. So what are some of the tools that you look at in terms of tackling those various forms of sources, as well as operationalizing and maintaining the pipelines once you deploy them?
Joseph Arriola 5:55
Well, Sean, it’s really interesting, because I will say that the first tool that we use is StreamSets Data Collector. We are using the the open source version and we are really happy with that. Because even if you are using the Data Collector, Data Collector has different impressive features. For example, you can see the some metrics around the data pipelines. So this really helps us to use. We are using another tools like Python and also we are using a cloud cloud provider GCP, Google Cloud Platform. At the end of the day, our main tool is StreamSets in order to integrate the different data sources that we have. I think another important thing for us is using a cloud provider, or cloud strategy. Because as you know, at the end of the day, in this new age of the data, you have to have a good way or a good strategy to escalate. And to do it in the first deployment of the different tools or virtual machines, for example.
Sean Anderson 7:16
So you just said something that brings up a really interesting question. When you’re developing a pipeline where you’re thinking about the logic of how the data should flow through that pipeline. Are you always aware of what the destination is going to be? Or even the source? Or are there times where you’re building the pipeline and you don’t necessarily know what the source data is going to look like?
Joseph Arriola 7:37
Well, to be honest, when we started with the StreamSets, we started to do more efficiencies. We started to to do more design of the data architecture. I mean, right now, we don’t worry about the destination or the source, because when you are using a distributed Data Collector, you have a big catalog of the different data sources and the different destinations. So in our case, it has a lot of connectors with the principals and services that the GCP has. So I think the best way to use StreamSets, in this case, is we will already have more time to do the design instead of hand coding. So I think this is really important for us, because we have more efficient time to do the design.
Data Engineering in the Cloud
Dash Desai 8:36
So you mentioned GCP. Are there any other cloud platforms that you guys using as well?
Joseph Arriola 8:40
Right now? No, I will say but in the past, we have AWS and also GCP. But we did this migration maybe two years ago, I think. This is really funny because for the different topics related to data we use a StreamSets in order to do this migration because we we have some data or some large archive logs in S3. We we would like to do it in the best way and we see a good opportunity to use StreamSets in order to migrate some like high/low data that we have in the different places in AWS. So, I like this integration between between cloud providers. For example, you can have information in s3, but you will have to move this information from S3 to BigQuery. You can connect the different services in the different clouds using a StreamSets. This is really interesting because the StreamSets thing is like agnostic of the cloud provider. So you can implement and you can do this integration in the different clouds.
Sean Anderson 10:05
So, Joseph, you know, it sounds like cloud is a pretty integral part of your strategy. Have you guys always been cloud? Or did you start off on-premises and then migrate to the cloud? And is Yalo Chat predominantly on the cloud in terms of both its data and its applications?
Joseph Arriola 10:25
Well, we started on cloud. So for us, the cloud, this is our strategy. We started as a startup, we are growing, growing. But, I mean, we have the headquarters in Mexico. Our strategy always has been cloud because we see it in the past the strategy is to scale and we like being dynamic in a way to test other technologies faster. So we saw this opportunity in the cloud, and yeah, we started in cloud.
Dash Desai 11:07
Sounds good. Can you please share some of the reasons why you pick some of these technologies?
Joseph Arriola 11:14
Yeah, let me start with we use StreamSets for data integration and because it allows us to be fast, and have efficiency. We are using airflow. Well, the name of the airflow in GCP is called Composer. We are using a Cloud Composer in order to do the orchestration of the different data pipelines. We are using BigQuery. Because we query the modern data warehouse. And this does require maintenance and operation. So if you can see, the end of the day, or strategy is thinking about how we can save time in maintenance and operation. And of course, less code. Because we believe in the talent of the people in order to say make these changes and invest the time in the design and create architectures.
Sean Anderson 12:20
So that’s great. I wanted to double-click on something you said. So you mentioned hand coding. We’ve kind of found out through these interviews, hand-coding can often be just kind of part and parcel of what needs to be done. How reliant is your team on hand coding and to what degree are you guys trying to supplement that with other tools and expand it to people that don’t have that hand-coding expertise?
Joseph Arriola 12:46
Well, it’s interesting, Sean, because, as I said, Our strategy is to invest the time to design things. I mean, to design the architecture and to design the different components in the cloud. And for us, it’s really important to have in order to do handholding, for example, in our case it is SQL, because we are using BigQuery. So BigQuery, you will use a SQL and another important tool for us is is Python. But we use ******* with some cases. But again, for us, it’s more important to know SQL, because we can do some different procedures in BigQuery. We are looking for tools in order to improve the time and for us it is really important use of StreamSets because we we save a lot of time to create the different data pipelines. If for example, you want to push some message on Kafka, maybe you are going to use like, I don’t know, for example, 10 or 20 lines of code. But you are using a StreamSets, maybe you only have to do like a drag and drop of the source and the destination. And that’s it, you can start to push some message on Kafka. So for us, it’s more important using these kind of tools in order to save the time and do more efficient time for the different engineers in order to create data architectures.
Dynamics with Data Science
Sean Anderson 14:35
You see people it’s all about the sources and destinations, Joseph’s here confirming it. So Joseph, I wanted to kind of switch gears a little bit and you’ve already talked about it. So it seems like you guys are focused on data pipelines, data engineering, and you’re also maintaining the architectures. In some cases in the cloud building those architectures. What does the team look like? What are the team dynamics? Do you guys pair up with the data science and analytics teams? You said that you using NLP? So how do you interact with the groups that are doing the more data science work?
Joseph Arriola 15:08
Well, it’s a good question because we are looking for integrate the different areas that we already have. So, the work that the data engineering team does is to prep, well mean, to integrate the data and then prepare the data. Then we are looking for a strategy or a relationship with the data science team. So that is what we think is, again, or philosophies I think if we, we all learn from everybody. For everyone. And each day, we are learning more and more and more it’s like our analogy is like when you have one algorithm machine learning. The machine learning algorithm is training, training training. So in the same way, we think together, we are going to learn and learn and learn with the different areas is really interesting, because as you can see the data is the is the common factor. If you are in engineering, if you are data science, if you’re a machine learning engineer, if you’re a data analyst. At the end of the day, all together, we are working with the data. Of course in a different focus with a different perspective. But yeah, I think they work at the end of the day engineering is like the core in the in the middle of the different services. So yep.
Dash Desai 16:33
Oh, one of my favorite topics is data science and machine learning. So Joseph, did you know that you can load TensorFlow models in Data Collector?
Joseph Arriola 16:45
Well, Dash. To be honest, I saw it. And also I did some some tests with Tensorflow models. We are looking forward to using it in some use cases. We already have some things in the lab. I mean, we don’t have it right now. Either in production using StreamSets with the TensorFlow models. But in theory and some things that I tried, I really love it. Because it could be a really easy way to implement. I mean, this is a good strategy. You can do with a machine learning engineer a model in TensorFlow and then you can use it in StreamSets. I think that is really interesting that there is this collaboration between the machine learning engineers and the data engineers. And using in real time, because every time that you can see the data in a streaming way, you can use this, or you can send it to the machine learning model. I think it’s really interesting integration. It’s a really good feature that StreamSets has this.
Machine Learning
Sean Anderson 18:00
Yeah, I think that’s a really interesting topic. I love how you talked about the way that the machine learning teams are working with the data engineering teams. It reminds me of a quote from Josh Wills. He said “If you want to get a data engineer to do an ETL job, just tell them it’s machine learning model.” How much do those worlds collide? How many of your data engineers are actually getting their hands dirty with the machine learning technology? I mean, I imagine they have to at least understand the machine learning requirements in order to work with the team.
Joseph Arriola 18:30
Well, we started the strategy, by creating the data warehouse, this modern data warehouse. Right now we are reserved to the strategy for machine learning. So right now, we are looking for integrating the different vision because the new Vice President that we have knows a lot of data practices. He is really passionate about the data. He will look for a new strategy. We started to create a strategy with machine learning. Right now the data engineer knows about the end-to-end process of machine learning. But we are looking for growth, this new area for machine learning and our engineers.
That One Weird Thing
Dash Desai 19:27
That’s awesome. Thanks for sharing all these great tidbits. Now to one of my favorite segments of the podcast. I’m calling it that one weird thing. So as you work on data engineering projects, no matter how big or small, you always run into that one weird thing, right? A great example is dates and time zones. Another great example is SQL Boolean expressions that involve NULL. Do you have anything like that you want to share with our audiences today?
Joseph Arriola 19:59
Well, It’s an interesting segment of the podcast Dash. I think when you are in the world of data, of course, you are going to see some weird things working with the data. So for me, I think it was when I started to get data from API services. Most of the time, the data comes in JSON format. So imagine when you want to parse this kind of data. When it comes in JSON format, sometimes is really difficult. Or sometimes it takes lines of code to do it. Sometimes I think it’s related to the fact that it’s JSON. It’s really difficult to do it, or to parse this kind of data. So with StreamSets, with Data Collector, I saw an easy way just to use some transform logic. And you can do this on flat files or you can do JSON files.
Dash Desai 21:30
Yep. It’s one of those things. Thanks for sharing that, Joseph.
Find out More
Sean Anderson 21:35
So Joseph, I think we’re rounding up on our time here. I think before we leave, we understand that you’re also active in the community and in the ecosystem. You do Big Data Guatemala, as a meet-up group, can you tell people where they can find out a little bit more about Big Data Guatemala or Yalo Chat?
Joseph Arriola 21:55
Yeah. Thank you, Sean. Well, if you want to talk more around the data, you can find me in in LinkedIn as Joseph Arriola or you can find the community of Big Data Guatemala online, where am the meet-up organizer. If you want to find out more about Yalo Chat you can see our webpage www.yalochat.com. You can see different articles or different things that we share with the community.
Sean Anderson 22:24
Sounds good, Joseph. I want to thank you. You’ve been a longtime proponent of us here at StreamSets, and we’ve always really enjoyed your conversation and your activities in the open source community. As well as just collaborating with other ecosystems. So we’re glad you’re around. We’re glad that you have developed such a data centric team at Yalo Chat and we hope that people take the time to learn about what you guys are doing and what you are doing in Guatemala in general. So Joseph, thank you so much for being our guest today. As always, I want to thank my co-host Dash for getting into that one weird thing and we will be back with you guys in just two weeks for another episode of the sources and destinations podcast.
The post Sources And Destinations Podcast Episode #4 Joseph Arriola appeared first on StreamSets.
Read MoreStreamSets