This week we welcome Jordan Volz, Field Engineering Lead at Continual.ai. Jordan has been working in the data and machine learning space for over a decade at companies like Cloudera and Dataiku. A mathematician by training, Jordan found himself at the forefront of the big data and machine learning revolution and speaks about the waves of adoption in data science and machine learning. In this podcast, Jordan chats about the merits of python and SQL, the waves of adoption he has seen in the machine learning space, and the direction in which he sees the practice of data science moving. Jordan lives on the front-lines working directly with practitioners to solve the problems of tomorrow.
Introducing Jordan Volz
Sean Anderson 0:34
Hello, hello, and welcome to another episode of the Sources and Destinations Podcast. A summer episode of the Sources and Destinations Podcast. And this week, we’re excited to introduce you to a guest. He’s been a longtime friend of mine and just somebody I’ve really looked up to in the machine learning data science community. So really excited to be picking his brain today. But first, I just want to give everybody a heads up. Just last week, we officially opened up registration for the DataOps Summit here at the end of September. It’s especially relevant today because Jordan, who we’re going to be talking to, is actually going to be speaking at the DataOps Summit. So far, we’ve got about 30 breakout speakers spanning from topics around data engineering, machine learning, data architectures, and dataops. So I’m really excited to continually share the lineup with you guys. And we’re going to be doing a very special extended episode of the Sources and Destinations Podcast, leading up to the DataOps Summit. We’re going to be interviewing some of our past guests and some of the speakers of the show, so I’m really excited for you guys to check out that episode as well.
All right, so let’s get on to our guest today. Today, we’re happy to host Jordan Volz. He’s the Director of Field Engineering at Continual.ai. And he’s been working in the data and machine learning space for over a decade. Jordan is a good friend of mine. In fact, I’ve actually spoken with him at various different Strata and Hadoop worlds in the past and always admired his thoughts on the ecosystem. And everything that I was encountering as some of these data platforms really started to explode. So Jordan, thanks for joining us today. And as always, I’m joined by my faithful co-host, Dash Desai. So in order to get to know Jordan a little bit better, I thought it might be prudent to hand it over to Dash and let you kick it off.
Dash Desai 2:33
Thanks, Sean. Thanks, Jordan, for joining us today. So the first time we spoke, you talked a lot about the passion you have for data engineering. Can you share why or how you got into data engineering and things like that with the audience?
Jordan Volz 2:48
Hi, guys, thanks a lot for having me. It’s really exciting to be here and to talk to you guys. In terms of data engineering, my mind is just very clearly the number one thing that you have to have in order on the data side of your organization to really accomplish any of your use cases. Early in my career, I was a software engineer working for a company that provided electronic medical record software for its clients. One of the things that I worked with was maintaining this offline database that hospice nurses could use in order to go visit patients and provide care to them, and then document what had happened and have that flow into the online system. This seemingly very rudimentary workflow was actually a pretty complex technology problem, and a lot of it required a lot of data engineering because the online database was a hierarchical database, which I think is super interesting. You could talk about that for an entire podcast in and of itself. And then the offline database was the standard relational model that we use today. So there was a lot of coercing of the data that had to go on between one and the other, either going from the online system, then the offline system, or once the nurse had documented stuff went back into the online system. So, that was really my first taste of data engineering. And, surprisingly, it was kind of a software developer, and I think for people who work in the data industry, we often forget that developers do a lot of work with data. And they’re sometimes at the forefront fighting a lot of these data problems as well. Like, sure, there are data scientists, data analysts, andother users who need data and use data to get a lot of value out of data. Before any of these roles existed, there were just people building software trying to use data. So it’s always been interesting that I think I kind of saw that approach as my career grew as well.
Sean Anderson 5:06
And Jordan, speaking of your career, I’d love to hear a little bit more about the pathway to where you are now at Continual.ai. And then I think later in the show, I definitely want to dig into what you guys are doing. Because I think from the data science aspect, it’s really compelling. But, you know, talk us through how you started with hospice data and ended up where you are today.
From Mathematics to Software Engineering
Jordan Volz 5:29
Yeah, sure. I’d love to take you through that. You know, so as I said, I started out of grad school as a software developer. Academically, I’m actually just trained in pure mathematics. And I’ve always thought that’s a really good base for people in technology, because mathematicians essentially just learn to solve problems, like that’s what you do in math, you learn to solve problems. So working in technology, every day is a new problem to solve. And, hopefully you have some idea of how to solve it. But a lot of times, you have to figure out the best way to solve this problem and how to tackle this new challenge. So you know, anybody out there who is in the mathematics realm, I always think technology is a very good place for people who are trained in that. So out of school, I started with software development. And, I really liked the data aspect of the job the most. I guess I actually almost took a data job with that company, but they really needed a software developer. So I decided to do the software developer role for about two years. From there, I then moved into other roles. I worked for a company doing enterprise search technology. I worked for Cloudera, who was the leader of the Hadoop and big data movement. I worked at Dataiku for a few years. Now, I’m now Continual.ai doing machine learning work. So, throughout all of it I’ve had a variety of roles. I’ve worked from software developer to data scientist to solution architects to analysts. I kind of feel like I’ve worn most of the hats you can wear in the startup world. Usually, I’m working for software vendors as well. So I typically prefer to work at smaller companies, where you actually do have the opportunity to kind of explore and do a lot of different things in your career path. So it’s nice in the sense that your job never gets stale. You’re not doing the same thing every day for every company that you work for. And, you know, how did I end up at Continual.ai? I think that was the original question. So your CEO, I worked with him at Cloudera. He had been the founder of a company that we acquired, and I essentially had done a lot of work with them as I was the ML field lead for a while in Cloudera. And, we were coordinating a lot of like, how do we go to market with our product? And how do we make customers successful with it? So, you know, he left Cloudera a little after me and decided he was starting up his own company and got in touch. And it sounded like a really good idea. And so I decided that I wanted to come join.
Sean Anderson 8:24
That’s awesome, Jordan. That’s actually where I know Jordan from. We first started collaborating at Cloudera. And it was an interesting time because it was that inflection point where the Hadoop ecosystem started to fully embrace the capabilities of things like the ML capabilities of Apache Spark, and later on, at Cloudera data science workbench, which I worked closely with Jordan on developing. So, I was happy to see more of that kind of ML focus continue both at Cloudera and kind of outside of those walls. So I just had one quick question for you, Jordan. And you brought this up just now when you were talking. You said you’ve always come from a mathematical, kind of more scientific base background. But yet you’ve had these roles and kind of more of the engineering arena. And we’ve even talked to previous guests. We were talking to one guest, Lorna, who said that engineers and developers often think very differently than data scientists and kind of more mathematically trained coders. What are your thoughts on that?
Jordan Volz 9:36
That is a really good question. And I think that kind of gets to the heart of what a lot of organizations struggle with, to be honest with you. I think there’s kind of two breeds of, quote-unquote, data scientist or machine learning engineer. I think you have people who, like myself, kind of came up through the engineering background. Maybe a little different because I had the academic training on which data science is founded with all the math and statistics. After that, I worked in engineering for a while. My career path eventually took me into data science. I’m not in college today. But it seems to me some colleges now offer data science programs. Now you can just learn data science itself as your discipline and then go out and get a data science job. My observations about this other group of data science are, they definitely lack a lot of the engineering skills, that I guess maybe some of the older, more established data scientists have those skills. And I think that can create problems because they are missing the context of how do you build a production system? How do you make good software? How do you make good products? There’s a lot more to data science and just being out of Python and a Jupyter. Notebook, which, for junior data scientists, seems to be one of the main ways in which they operate. It’s very interesting. This morning, I was actually talking to a consulting company and with their head of data science, he had basically said he doesn’t recommend that his clients hire data scientists because data scientists are useless because they don’t know how to do any production work. I feel like that may be kind of an extreme view. But I think it is getting that kind of this dichotomy where you do have people coming out into this data science field that is maybe lacking some of the more traditional tech skills that you would expect graduates to have.
The Machine Learning Engineer
Dash Desai 11:55
Hey, just a quick follow-up question on that. Do you think that’s what gave birth to this new role? Which is a machine learning engineer?
Jordan Volz 12:09
I think that is partially part of it. Yeah, I do. I haven’t spent much time looking at data science job postings versus machine learning job postings. But I do think that when I see machine learning engineer job postings, they tend to hint at this idea that you actually can do the engineering work as well. Whereas, some data science job postings are focused more on Python notebook type use. Depending on who you are and your organization’s needs, your data scientists’ may be perfectly fine. But I feel that a lot of organizations haven’t understood that this dichotomy exists. You can’t hire a data scientist and expect them to do a machine learning, engineer’s work because they just don’t have the background. It’d be like hiring a data analyst to do software engineering, right? They just don’t. They’re not in that discipline. Not the fault of the data scientists. I don’t think they’re doing what they want to do. I think sometimes they’re hired for jobs that they’re not prepared for.
Dash Desai 13:19
So obviously, you’ve worked a lot with machine learning platforms. Can you give us your take on the landscape as it stands today?
Machine Learning: Waves of Adoption
Jordan Volz 13:28
I’d be happy to, I certainly have had a ton of experience across a lot of the different tools. So I can definitely give you my view. And this is something I’ve actually been working on for a blog post lately, but I’ve been very interested in how the machine learning platform market has been evolving over the past decade or so. I can give you a brief synopsis of that. At the beginning of the day, You could look at the machine learning workflow. And you could ask, if I had to do this from scratch, what do I need? You definitely need data for that. You would need code. You’d have to write some code to produce your model, you would get a model out of it. You have an artifact like a model, and maybe you consider the predictions the model creates as part of this as well. And you need some kind of environment to run it in. Usually, Python seems to be the predominant language and environment. You need those four things, data code model and environment. If you look at the early platforms, they’re very much focused on the notebook and being a collaborative tool around that. Some of them also kind of have an infrastructure component to it, where they’re like trying to get at, we’re focused on both the code and the environment. But the idea is, data scientists use notebooks. Let’s give them a tool that allows them to easily collaborate and share ideas. So these platforms, to their credit, are very good for development workflows, like, they’re very generalist, like you can come in code and do whatever you want. What they’re not so great at is really like getting to production, right? In general, the notebook is a terrible mechanism for doing production machine learning work. A lot of these platforms have really failed to make that step into being like a really good operational AI platform. So that’s more or less the first wave. And Sean and I worked at Cloudera, where we both supported the Cloudera data science workbench, and that’s very much in this first wave. It was very much a collaborative data science tool where everybody would come in, work on notebooks. In the second wave, these tools are more model-based, right? The first wave was a code-based environment based on a model-based tool. Now you’re kind of moving up, and you’re saying, All right, well, now we can do things like automated machine learning or auto ML. Or we can do things like machine learning operations or ML Ops, where the focus is really on the model itself. How do we democratize workflows a little bit to make it easier for data scientists to run experiments to track experiments? A lot of these solutions are very narrow and very much focused on one specific part of the workflow. Some of them have grown since then, and they’ve tried to expand out and be more of an end-to-end experience. My general categorization of the second wave tools is that they’re usually very good at what they were built to do. But as they start to grow, it’s kind of hard to grow out of that very narrow scope. You have to become a full end-to-end tool that you’d want to really use for production purposes. The second wave tools also, historically, and this is like both for my own use, and talking to hundreds of hundreds of companies who use these tools struggle to get real production value out of them. So, what is the third wave? The third wave in my mind is really putting data at the center of the machine learning workflow. And this is actually like a very new space. And there aren’t a lot of people doing it today, in terms of like, commercialized vendors, but there have been a lot of success stories from fintech companies using this approach to operationalize AI within the organization. The idea is, we now have first wave tools. We have second wave tools. We have ML systems that can do auto ML, and ML Ops can just automate all of that process, and can’t we really just have the user focus on the data? If you look at machine learning, like super abstract, please go back to my mathematical roots and abstract this out as much as possible. You can say that machine learning is a black box. And what goes into it are just inputs, that is data. And what comes out of it is an output that is predictions. So if you have this machine learning box that understands your problem well enough to produce good and robust models, what your data workers really should be working on is making sure you have high-quality inputs into this system. And I think this is really the heart of like, what is the data first tool, what’s exciting about it is I can give this platform to a data analyst or data engineer, or a business user. They can work with the system and like an entirely, what I would say is a declarative or configuration-based approach, where they’re able to specify the entirety of their ML workflow, exactly through configuration. Then the system reads that and does exactly those sorts of actions. I wouldn’t say that this third wave solves 100% of the problems in the ML space. But I think I always adhere to this idea of this 95 five split, where 95% of your ML problems are fairly easy and straightforward and well known. Then the other 5 are super hard, if possible, at all. I also think that the data-first approach is good for 95% of problems, where you actually could have people go in and automate all this stuff being built through in this declarative fashion with the correct platform. For that other 5%, you’re probably going to need tools like those source wave notebook tools or the second wave, ML Ops type tools because they require a lot of TLC from machine learning engineers or data scientists.
Dash Desai 19:54
Got it? Thanks. So you talked about ML Ops. Do you think the focus on And has been given enough thought?
Jordan Volz 20:02
Yeah. In terms of ML Ops, it’s certainly a very hot market right now. I feel like 9 out of 10 new companies that come out are like ML Ops to some degree. And there’s also so much talk online about how you see headlines of which 80% of data science projects fail, or one out of 20 models make it to production. Part of the problem is that software vendors often give companies the tools they want, not the tools they need. In data science, this has resulted in essentially, products being made for the data scientist. As we talked about it earlier, it not really someone who’s well versed in production processes. The data scientists see the product, and then they’ll say, “Oh, I like it, let’s buy it”, and you get that positive feedback. But that is actually not a tool that is very helpful for them, and getting to production and getting actual value out to the organization. There’s kind of like this weird effect going on in the software vendor market around this. Whereas, you shouldn’t be focused more on just, you know, the question should be like, how do we help organizations do production machine learning? Not like, how do we get data scientists to buy our tool? And when you take that approach, I think, you start to see things like this third wave, data first ML platform is kind of the answer in my mind to how do we actually operational AI for the enterprise. Machine learning, certainly, or ML Ops certainly plays a part of that. But I think that there’s a lot more to it. Buying an ML Ops tool will not necessarily solve all your problems in terms of operationalizing your AI stack.
Continual.ai
Sean Anderson 21:59
That’s interesting, Jordan. So maybe that’s a good lead into what you’re doing a Continual.ai. Is Continual.ai really set up to address some of these modern challenges or to help usher in some of these new phases of data science and machine learning?
Jordan Volz 22:15
Yeah, exactly. This is essentially the system we’re trying to build out at Continual.ai. We’re currently marketing ourselves as operational AI for your cloud data warehouse. The operational aspect of that is that we’re very much focused on automation, making your data workers successful, and getting models into production. The data warehouse is hopefully self explanatory. But, we’re primarily focused on assisting customers who have most or all their data in a cloud data warehouse. So, there are still companies who are doing a lot of on-premn work. Our focus as an early stage startup is on cloud data warehouses. Our approach is, declarative data first approach where you could take a data engineer, you could take a data analyst, and, essentially, you interact with a system via the SQL language, so they can come in, and they can say, here’s a SQL query that represents my data and give it to the system, you guys can build models for me with the resulting data. And not only that, but then there’s this continual aspect of it, which is machine learning isn’t a one and done workflow, we have to be able to rebuild models, weekly, daily, hourly, whatever the use case calls for, and then put those predictions back into your data warehouse. So you have this close loop of taking the data out of the data warehouse, building models, and then generating predictions that go back into it. So for data analysts and data engineers, it’s exactly the workflow they usually look for. And then you can plug in your reporting tool into your data warehouse, as you hopefully already have, and you can consume those predictions downstream.
That One Weird Thing
Dash Desai 24:01
Thanks, Jordan. So that brings us to our last segment of the podcast, we call it that one weird thing. So as you work on different projects, data engineering, machine learning and what have you, no matter how big or small, there’s always that one weird thing that you run into. A very simple example is working with different data formats, time zones across different systems. Do you mind sharing something that you’ve come across? Maybe, A new thing that’s kind of weird that you want to share tips and tricks around?
Jordan Volz 24:33
One thing I can speak to that I think is interesting is, I would classify it as kind of just like SQL versus Python ism. I feel like online, there’s a lot of talk about how SQL is better than Python. I usually side with Occam’s razor, like the simplest solution is the best. I think SQL is simpler for more users. So if you can do something in SQL, you probably should. And you can use Python when the use case calls. Obviously, don’t write 1000s of lines of SQL for something you could do in a couple of lines of Python. People who may be maintaining whatever you do in the future are probably more likely to be able to read SQL. But, Python sometimes is very nice doing things that SQL can’t do. And one thing that I feel like I’m constantly upset about in SQL is that most data warehouses don’t have a really elegant way for you to do a select star, and also, in the process do a simple replacement of one of the columns. And, I think for Python users, this is super straightforward and something you do all the time, which is you load something into a pandas data frame and then you just do a simple operation on a column. For example, maybe you have a timestamp as a string format and you want to convert it to a timestamp format in pandas. You can do a conversion just directly on the column. And that’s like one line of code in Pandas that’s super easy. For a lot of SQL vendors, this isn’t actually very easy. And you have to do way more work than you’d like. I’d like this to be a one line thing where you just say select star and replace this column with something else. And there actually is one vendor who does this. And it’s BigQuery. BigQuery does have a select star replaced syntax, which I only learned about recently, actually. But a lot of the other vendors, there’s not a really elegant way and you end up having do a select star, but you have to add in all the column names and then modify the column you want to modify it, that’s fine. But if you have five columns, and when you have like 500 columns, it’s now like a really big annoyance. So, I think the select star replacement is something that I always find to be super weird. And I guess it’s kind of like a data science specific thing that hasn’t come up a lot in the engineering workflows, which is why more vendors don’t have out of the box support for it. But I would definitely love it if more people got on board with the select star replace. I think that’d be super useful.
Find Out More
Sean Anderson 27:26
Thanks for that, Jordan. So just real quick, can you give everybody some information on how they can find out more about you and Continual.ai right now?
Jordan Volz 27:35
Yeah, sure. Our website is Continual.ai. Come to our website and join us. We’re currently in closed beta. So if you’re listening to this in the summer, you can fill out a form and we’ll contact you and try to work out access to the platform. We will be opening up self-service in the future. So if you’re listening to this in like a year, and it’s 2022, we’re very likely self-service. So you can just go to our website, sign up and start using the platform. But yeah, check out our website, see what we’re all about. And as Sean says, I’ll be speaking at the upcoming conference (Dataops Summit 2021). So I’ll be talking about operational AI and going through why a lot of previous approaches to operational AI have not succeeded and trying to spell out a general framework for how you’d want to build an operational AI system today.
Sean Anderson 28:28
Jordan, so that’s a perfect segue into my closing thoughts here. Thanks, everybody for joining us. I want to thank our guest today, Jordan Volz, and also my co-host Dash. And as Jordan mentioned, if you’d like to join all of us at the DataOps Summit, it’s completely free to register, to be part of this. We’ve got a really exciting lineup from keynotes to breakout speakers, some really nice technical deep dives into various different data engineering patterns. To sign up for that go to dataopssummit-sf.com. You can listen to all of the episodes and www.streamsets.com/community. We look forward to talking more about data engineering & machine learning with you guys. So tune in. Until next time.
The post Sources and Destinations Podcast Episode #7 Jordan Volz appeared first on StreamSets.
Read MoreStreamSets