Episode 7

Contracts and Code: The Realities of AI Development

Contracts and Code: The Realities of AI Development

About This Episode

In this episode, Valentino Stoll and Joe Leo unpack the widening gap between headline-grabbing AI salaries and the day-to-day realities of building sustainable AI products. From sports-style contracts stuffed with equity to the true cost of running large models, they explore why incremental gains often matter more than hype. The conversation dives into the messy art of benchmarking LLMs, the fresh evaluation tools emerging in the Ruby ecosystem, and new OpenAI features that change how prompts, tools, and reasoning tokens are handled. Along the way, they weigh the business math of switching models, debate standardisation versus playful experimentation in Ruby, and highlight frameworks like RubyLLM, Phoenix, and Leva that are reshaping how developers ship AI features.

Takeaways

The importance of marketing oneself in the tech industry.

Disparity in AI salaries reflects market demand and hype.

AI contracts often include equity, complicating true value assessment.

Full Transcript

Valentino Stoll (00:01)
Hey everybody, welcome to another episode of the Ruby AI podcast. I'm your host Valentino Stoll and joined by co-host Joe Leo.

Joe Leo (00:09)
Hi, I'm Joe Leo. I have one question for my co-host. Whose contract is larger? Juan Soto's contract with the New York Mets or your contract to be an AI engineer at Gusto?

Valentino Stoll (00:25)
You know, I wish I could be categorized with all these AI engineers. ⁓ Unfortunately, ⁓ my frame of reference is not deep enough in the ML ⁓ aspect of things to qualify, I think, ⁓ for $100 million. Yeah, I'm pretty disappointed. ⁓

Joe Leo (00:41)
That is a shame. Yeah, that seems like it seems like a matter of positioning, you know, where you kind of

they taught they teach you this when you're when you're going out to the market with a product. It's like, OK, well, if people view you as this, then maybe the product's only worth like 10 bucks. But if they view you as this, you get a 200 million dollar signing bonus. That's all marketing.

Valentino Stoll (01:00)
It's all marketing, Like, they should just be taught in

all high schools, right? A part of the curriculum. How to market yourself. ⁓

Joe Leo (01:06)
Yeah. Yeah, exactly. I've yeah, had a marketing. That's God, that's horrible. ⁓

I am thinking like I came across this thing. It's called the Mettis list and it is it's like, ⁓ you know, it's the world's top AI researchers. I'll put it here in the show notes. ⁓ It and so it's just, you know, it's this ranked list and, know, you could see like.

you know, Ilja there at the top, but then there's also all these names. So I, you know, I put my email in, I don't know what they're going to send me, but I now can see the top 100 ⁓ AI engineers ranked and, you know, it's like, you know, like trading cards, can try and catch them all.

Valentino Stoll (01:50)
You

So what are you gathering up from all this? Just like how many hundreds, billions of dollars are in this list? ⁓

Joe Leo (02:02)
Yeah,

you know, I think it's interesting to note the disparity, you know, people. So and I'm not begrudging anybody on this list. I want to be clear about that. You know, we have talked on the show before, I think about how lucky we are who are guys of people of a certain age who, you know, came to software development from a place of like, hey, this is interesting and I could

Now I could probably make a decent career, you know, and then all of a sudden, a few years in software development became the hottest thing and and all of the salaries, you know, increased and all of the demand for us increased through, you know, no, no real work on my own part. You know, the industry changed. ⁓ Now nobody ever offered me 100 million dollars, but, know, I was able to live, ⁓ live a nice life. And and so I think that's great. Now I'm seeing this kind of turning.

And you know, it's got to be there's obviously hype in this, right? ⁓ Nobody is worth 100 million dollars. It's a matter of, you know, scarcity and and demand on the parts of these companies that have hundreds of billions of dollars at stake.

Valentino Stoll (03:16)
Yeah, you know, when I first saw these deals come out, like Facebook or whatever, ⁓ you gotta think like, all these deals have to be tied up in equity, just like they would be if you were joining a startup, right? Like, are these deals really like that incredible from like that perspective, right? Or is it really just like the same deal with just more stock? you know.

Joe Leo (03:29)
I'd imagine. Yeah, yeah.

It's a, that's a great question. And that also

is a good alignment with sports because sports deals are always, it's like, it's all about the number, right? And then when we dig in, we're like, well, this is really just a $20 million deal with a bunch of voidable years at the end of it. And, and yeah, I think the, probably the same thing is true here. It's like, yeah, you get a, you'll get a couple hundred million dollars if meta wins the race, which nobody actually knows what that looks like. Nobody I've talked to. Maybe I should talk to these hundred people.

Valentino Stoll (03:49)
Right.

Joe Leo (04:10)
They're the smartest people ever. But nobody knows what it looks like to win this AI race or why. Nobody's been able to satisfactorily answer me why it matters that one company does win. Because so far with the advent of LLMs, we've seen either incremental or tremendous improvements. But after each improvement, there's always a short period of time when the next model by the next

company catches up. It doesn't seem like there is a huge gap or a huge way that somebody can build a moat and become the LLM that the human race uses.

Valentino Stoll (04:51)
Yeah, know, that was, you know, I feel like recently kind of proven with like the DeepSeq releases, right? Like everyone's just like, well, like DeepSeq now is like just as good as OpenAI. Like everybody's going to use it. And it's just like not the case, right? And not that DeepSeq isn't useful. DeepSeq has great features and tasks that are really well performed by it.

Joe Leo (04:59)
Yeah, yeah.

Yeah, well that's true. That's true, that's not the case.

Valentino Stoll (05:20)
But yeah, I feel like you know, it's all about the vendor lock, right? ⁓ And like once you get people using it like it necessitates them continue using it ⁓

Joe Leo (05:27)
Mm-hmm. Yeah.

Yeah, I think that's, I think that's a good point. And we were talking about this right before the show started that what open AI built is a consumer tech product and the, the usage or sorry, the, the efficacy of the LLM behind it really is secondary to the experience that people have using it as a chat bot. ⁓ and so

So that deep seek example is a perfect example. You and I might go out and use the best model for whatever job that ⁓ is at hand and we might evaluate the different models and we might even ⁓ find a way to objectively determine which one is best. ⁓ But most people are just reaching for open AI because that was the early leader and everybody went out, myself included, and got themselves an open AI subscription so they could use ChatGPT.

Valentino Stoll (06:32)
Yeah, so I'm curious, ⁓ you know, we've talked before about, okay, well, how do we test out, you know, all of our processes on different kinds of models, ⁓ or like watch what might happen, like, how are you, like, that with Phoenix today?

Joe Leo (06:44)
Yeah.

Yeah, it's a really good question. the first thing we did, and this is now going back at least eight or nine months. And, know, the whole history of Phoenix is just one year. It was called the September experiment when we started. we're coming up on one year. ⁓ We built a test harness that would run test generation against a number of different repositories. And we had.

a couple of open source repos. had a couple of death method repos that we threw in there. I think we had a customer repo that consented to have us do it. And then we could run it against all of these different models. That still exists, but ⁓ we've found it lacking in the kind of efficacy that we needed because there is so much nuance to the test generation process. And so it's great to say,

Okay, this works. This one is the best model for generating specs and, and for a code base. But, you we just got off the call with a prospect and they've got one to 1.2 million lines of code. We're not going to just go and generate tests for the entire application. would, you know, it would cost an absolute fortune. It would take, I don't know how long. Um, and it's not, that's not necessary, right? We want to start with.

Valentino Stoll (08:02)
you

Joe Leo (08:13)
high FLOG scores, high complexity, high dependencies. And so then it becomes like, okay, how well does it do in these different circumstances? And then you throw in ⁓ what dependencies are being used in that test, what needs to be mocked, what factories are available versus what is not available. And I find it, maybe my engineers don't find it ⁓ like this, but I find it to be an overwhelming number of variables to accurately say, okay, this one is the best.

And so then what happens is there's a lot of circumstantial evidence. Well, this one worked really well for me. it's, you know, like this is my guy, ChachiPT5. That's my guy, right? But, you know, but I talked to somebody else and their guy is, know, Gemini4. So it's hard to know.

Valentino Stoll (08:59)
Yeah, you you make a great point. ⁓ It's funny if you dive into the open source, large language model world, the first thing you learn is that every model has its own unique way of crafting prompts. There are tricks that you can add to each different one ⁓ that help signal the model in different ways. ⁓

Joe Leo (09:25)
All

Valentino Stoll (09:28)
The same is true even of these bigger ones, right? Your GPT-5 prompt probably might not perform as well as a 03 Mini or something, ⁓ just because it was structured in a way that reasoning models would benefit from it. And maybe the GPT-5 model isn't as good at reasoning about the reasoning section of the prompt.

Joe Leo (09:39)
Mm-hmm.

Right.

Mm-hmm.

Mm-hmm.

Valentino Stoll (09:54)
And then

you're just scoffing it up as, this model isn't as good. But really it's like, well, the prompt needs to be different in order to justify that. And then at that point, how are we managing then? ⁓ It goes back to there was a talk at the AI engineer conference last where somebody was talking about ⁓ prompts as a compiled ⁓ language.

Joe Leo (09:58)
Right.

Yeah.

Valentino Stoll (10:23)
And basically, people are thinking of them as just kind of text that gets hydrated, but really it's like ⁓ compiled source code in a way with instructions for an LLM to complete. once you make an update to that, it changes and recompiles, and it needs to work differently. And you then can't use that compiled source code and plug it into Windows 95.

Joe Leo (10:23)
Mm.

Mm-hmm.

Valentino Stoll (10:52)
You know, like it's the same idea and like so I'm definitely getting behind that and I don't know. I feel like a lot of the ⁓ you know Prompting strategies libraries are like maybe misaligned in this way. We're like, yeah I was I was hopeful with like the d spy stuff that we could see maybe like some self-improving things But it's still almost too complicated

Joe Leo (10:52)
Right, right, yeah.

Valentino Stoll (11:21)
from an instruction standpoint to just like blanket apply prompts across the board, right? ⁓

Joe Leo (11:27)
Yeah,

I think you raise a really interesting point and I'm curious to know, you know, in your role now at Gusto, you ⁓ have any benchmarks? Do you have any hard and fast ways of saying, okay, new model came out, I'm going to determine whether or not it's better or worse for the use case that we have at hand.

Valentino Stoll (11:49)
Yeah, mean, so we have many, many different kinds of evaluation test sets, right? So it's always good to have your core data sets that you're turning off for efficacy of whatever you're trying to measure against. And so we do build some golden data sets around what we're trying to test ⁓ output for. And yeah, we could swap models in and out. But again, it circles back to, well, are we going to spend the time?

Joe Leo (12:04)
Mm-hmm.

Valentino Stoll (12:17)
to try and like reprompt all these prompts that we have specifically for the new model. Like we don't really have that automated yet. Like, you know, now that I'm thinking about it, I probably want to now ⁓ see how possible that is, like get a prompt to reprompt it. But at the same time, ⁓ yeah, it's all about measuring just like output. And so like how close can you get all your evaluations ⁓ in line and.

Joe Leo (12:19)
Yeah.

Mm-hmm.

Yeah, yeah.

Mm-hmm.

Valentino Stoll (12:43)
Like there's an easy way to just drop in a new model and see, okay, is this going to really have a measurable impact? And like typically with these big ones, it's like negligible, right? Like you might see like, I don't know, a 0.1 % or like sometimes 5 % for some tasks. But yeah, it's like, you know, as they come up with new releases too, like that may get improved too, right?

Joe Leo (13:07)
Yeah.

Valentino Stoll (13:08)
So then you have to timestamp the fingerprint of the model and makes you know test that it like because because this whole juggling act ⁓ which you know, ⁓ if you find that you're the task that you want to solve like is solved by a model It's almost like well, why change the model create all this extra work? ⁓

Joe Leo (13:12)
Right, that's right.

Right. Yeah. Yeah, that's

a fair point. I do think that this is something that's missing. Like there's going to be a company that comes along and kind of takes this off our hands, this benchmarking and evaluation. ⁓ Because what I'm hearing from you, ⁓ you know, I hear from others as well, and it always feels like, okay, well, we have this hand-rolled thing. And that typically means that there's an opportunity there in the market that's not being addressed today.

Valentino Stoll (13:56)
Yeah, know, at Doximity too, you know, you prefer like setting up your pipelining and evaluation mechanisms around a vendor, right? Like it's just so easy to just buy something and then plug in your data sets and then have it churn through the calculations. But really that's all it's doing. Like all of these services, you know, they're all around like LLM as a judge. like, so you pay for like the baseline evaluators to get you up and running to calculate something.

Joe Leo (14:06)
Mm-hmm.

Mm-hmm,

yeah.

Valentino Stoll (14:26)
And

then you realize what those somethings are and you're like, well, we want to add these specific calculations to target something. And then you realize, well, I got to build out this whole thing. you'd be surprised how hard it is to create these things that it's really just an LLM call in the end. And you can generate synthetic data even to help ⁓ insulate.

Joe Leo (14:32)
Yeah.

Yeah.

Valentino Stoll (14:53)
and cushion the data sets that you do have so you have more data to test with that's similar and generated in similar ways ⁓ that helps. ⁓ But yeah, at the same time, you always circle back to, ⁓ well, we have too much custom things that we want to measure and test. And so you do just end up rolling your own out of necessity. ⁓

Joe Leo (15:13)
Yeah,

no, don't think it's an easy problem to solve ⁓ by any means.

Valentino Stoll (15:15)
But at the same time,

had like a, so we had Klaus on ⁓ an episode or two ago and he presented the Leva gem, like easy way to like just test stuff in rails, really awesome. ⁓ And so I see kind of like more of that like augmented service like coming out just across the board. Like I know, yeah, we've mentioned Langsmith before.

Joe Leo (15:38)
Yeah.

Valentino Stoll (15:45)
And that's like, kind of like the, you know, I don't know if they're free anymore. They used to be free. Yeah. But that was the de facto like, okay, yeah, you could just watch your traces and then like measure the impact. ⁓ And you know, that works great for a while. And they've, they've got some great tools. I'm not going to knock links with, ⁓ cause they do. have some, if you're looking to take, set up your eval pipeline, it's all there. But yeah, like how do you.

Joe Leo (15:45)
Sure.

No, they're not. They're not.

Mm-hmm.

Mm-hmm.

Valentino Stoll (16:13)
I do see this as kind of like, you know, I was hopeful when OpenAI had their, like, you know, they have their own evaluations. forget what they call it. Like, ⁓

This is going to bother me if I don't look it up.

Yeah, I guess it's just evals. So they have their own built-in version of this where you can sample all of the requests that come in through their API and then drop them into datasets and then run tests against them. ⁓ And I think that they even have it set up so that you can take it a step further and fine-tune based on what it finds out from some of that evaluation. ⁓

Joe Leo (16:36)
Okay, simple enough.

Right.

Mm-hmm.

Valentino Stoll (17:03)
which is kind of neat, but it's like, I don't know. It doesn't, I haven't had the desire to use it. And I don't know if that's just like, just me or maybe there's something like you're mentioning where it's like.

You know, we want to stay model agnostic. And so like, we don't want necessarily tooling that's like tying us back into the thing that's locking us into the models.

Joe Leo (17:22)
Right.

Yeah,

I can easily see that as well. That was kind of the first thing that I thought of with eval. Because I don't want to self reinforcing, you know, like, OK, I've made the decision to use open AI. Let me use an open AI tool that convinces me that I got to keep using it. ⁓ I also think that. Langsmith has been good enough for us. ⁓ I think that, you know, Langsmith has a real advantage in being a first mover.

Valentino Stoll (17:41)
You

Right.

Joe Leo (18:00)
But there are other tools available. We've stayed on Langsmith. think there's some inertia there. We haven't seen anything that is blowing us away in terms of being better. ⁓ But it's also, well, know, it's not Ruby. ⁓ You know, this is like, I mean, it's a SaaS product, but you know, it's a Python first solution. And that was fine when we first started out, but as more and more of our application is...

Valentino Stoll (18:17)
Yeah, yeah.

Joe Leo (18:29)
is built ⁓ in Ruby and in our recent turn to integrating Ruby LLM into our application, we came up short in trying to do tracing and observability. ⁓ And so one of our engineers went out and had to kind of roll their own ⁓ tracing and observability and integrate it between Ruby LLM and Langsmith. ⁓ PR may be coming soon.

⁓ to, to Ruby LLM, but it was, it surprised me that, Hey, there's not, there's nothing out of the box here. Right. Like, because I think it is, it is something that is just that new. I think, and I know you've got some opinions on this. My take is that unlike evaluating in an objective manner, what model is best for a particular task that seems like a very big and onerous problem to solve.

This doesn't seem like a big problem to solve. This seems like something that should be solved again and again and again because there's a straightforward way to do it and you know when you've done it correctly.

Valentino Stoll (19:30)
Yeah, I think it goes back to, I feel like, you know, the open tracing days, right? Like, I guess we're still in them, but like, there was like, you know, this concept that like spread wide and is still kind of prevalent in ways of having like a distributed tracing mechanism that you just like hook into your, you know, application and your language. And it just works and it tags everything.

Joe Leo (19:39)
Yeah. Yeah.

Yeah.

Valentino Stoll (19:58)
and can do spans if you want to get more focused on details. And we're like...

Joe Leo (20:04)
That is

open telemetry to an extent, right? Yeah.

Valentino Stoll (20:06)
Yeah, open telemetry. guess this,

yeah, I forget the transitionary period of each of those. ⁓ open telemetry. ⁓ And so like, to me, this seems like the same problem, except we have like...

Joe Leo (20:12)
Yeah, no, yeah. Yeah. Yeah.

Valentino Stoll (20:23)
The whole streaming thing is busting everything, right? Because there's so many failure modes now that we have to account for. ⁓ And then how does the tracing tie into those failure modes? ⁓

Joe Leo (20:27)
Yes.

Yeah, and you don't want the active tracing itself to start contributing to the failure modes, right? Because with more streaming and more ways to connect, there's more tracing.

Valentino Stoll (20:47)
Right. And then that.

Right. And then the data pipelining too, right? It's like, it's not like you're just sifting through logs anymore. You're like calculating things based on those logs. Like maybe post-processing is still thing. I feel like maybe the logging tooling had a miss here, you know, in order to like just adapt to LLM usage.

Joe Leo (21:13)
Hmm, that's interesting.

Yeah.

Valentino Stoll (21:19)
⁓

you hey, just use your existing open telemetry and then, you can also like sample your data and pipe this into some evals. ⁓ I'm honestly surprised that that doesn't exist.

Joe Leo (21:28)
Yeah.

That's a fair point

because rarely do we need this up to the minute. I don't need to know the cost of my, at least for most of the time, I don't need to know how much it costs for me to run this, use this amount of inference on a model in real time. But I may want to know at the end of the day or at end of the week, the end of a month, what are the hotspots?

And another big part of this, I didn't mention this before, is that there's real cost benefit analysis. Whenever we're using the leading models, the big heavy hitters that are paying billions of dollars to make these incremental improvements, well, they're passing some of that cost on to us. And whenever we can do something with one of the other models, whether it's open source or it's just a generation behind, we're going to save a whole lot of money. And in our case, we're talking about continuous running. This thing is running. ⁓

all day long on your repository, so it adds up quickly.

Valentino Stoll (22:30)
Right.

Yeah, you know, we were talking with Klaus in this previous episode and he gave the most incredible talk at San Francisco Ruby AI meetup where it was like, he was showcasing the LevaGem and he used it specifically for this purpose you're talking about to compare some lesser model to see how it performed and also how much money he's saving.

Joe Leo (22:56)
Mm-hmm.

Valentino Stoll (23:02)
so that as a business owner, he could decide, okay, is the difference in quality worth the money I would spend on this model, right? And he was like, and I think the example he showed was like a GBD 4.0 mini or something versus like GBD, I think it was 4.0 at the time. And it was just like, the percentage in quality loss from his evaluations were like, I don't know, 1 % or 2 % or something like that. And so he was just like, but.

Joe Leo (23:09)
Yeah. Yeah.

Mm-hmm.

interesting.

Valentino Stoll (23:32)
the cost savings was outrageous. It was like, exactly. And so he was like, well, obviously I can make this decision to just use the cheaper model, right? And so it's interesting to see that exposure of like, cost actually really does matter. And it's easy if you're in a bigger organization to maybe just overlook that idea. ⁓

Joe Leo (23:36)
Yeah, like 100 % or 90%. Yeah, yeah.

Yeah.

Mm-hmm.

Valentino Stoll (24:02)
But at the same time, it is needless spending.

Joe Leo (24:06)
It's needless spending.

And, you know, when you're building products with AI, which I think we're going to get into, it used to be that their SaaS model was almost like zero marginal cost. You had another user, it cost you just a couple of bytes, a couple extra minutes of ⁓ EC2 or compute, and you're off and running, right? It used to be like, well, we can cost us, I don't know.

you know, half a million dollars a year in fixed costs to build this thing. And so we need to make, we need to make that up. But then every time, every user we add that gets us over that is just, ⁓ you know, it's just profit. That's not the case for, I can tell you from firsthand experience, that is not the case for ⁓ AI products that leverage existing LLMs, especially commercially available LLMs, right? The variable costs now are very high. And

The famous ones in our industry, the cursors and clients of the world, they lose money at just as rapid a pace as open AI does. They just do it in smaller quantities, right? There's an order of magnitude difference there, but they're all losing money, right? So what do you do? Not everybody wants to lose money. I can also tell you from personal experience. you know, so what do you, what does a small and medium sized business owner do?

when they've got a great idea, either internally or externally, they want to leverage the LLM, it's continuous thing. But now you need to be able to make back that cost. And a lot of times you're doing that through some combination of price engineering, which is difficult because the loss leaders are charging just a few pennies, and managing the costs on the other end, which I think Klaus had a really good idea there.

and it's something that we're starting to incorporate as well.

Valentino Stoll (26:04)
Yeah, it's funny. It's almost like it reminds me of, ⁓ you know, whether you hire a, you know, somebody right out of college or, you know, somebody that has 10 years of experience doing something like you like to think of that as well. This person that has more experience is going to accelerate, you know, development of my product by this multiplier and ⁓

Joe Leo (26:32)
Mm-hmm.

Valentino Stoll (26:34)
the cost may be more upfront, the value from that, like you get speed. And so it's like a similar idea for like these models where you're like, you pay for, you know, the more expensive model in hopes that it's just faster at getting to what you want out of it. ⁓ Even though there's all these like free workers behind the scenes that could, you could be running just like it'll take them a while, you know, like, you know, if you spin up quad code with like, ⁓ you know,

Joe Leo (26:39)
Mm-hmm.

Right. Yeah, yeah, right.

Valentino Stoll (27:04)
QWEN 3 or something ⁓ on your MacBook, it may take a few days to run something, but you're gonna get it and it's free. so I guess the time, it's funny because all of this AI is accelerating everything, right? And so speed is almost the most important part if you're trying to innovate. But if you're truly just solving somebody's problem as a business owner,

Joe Leo (27:08)
Right. Yeah, yeah, but you're gonna get it. Yeah.

Mm-hmm.

Mm-hmm.

Valentino Stoll (27:34)
Do you need that acceleration? Are you gonna be dwarfed by some competitor because they're more innovative solving the same problem? I honestly don't know.

Joe Leo (27:43)
Yeah,

probably not. mean, that's not how businesses worked to date, right? Like the first the first entrance into the market is not guaranteed to dominate that market forever. It happens sometimes, but that's usually a business that has really worked hard to shore up, you know, ⁓ what it does best and its customer service and all the rest of it. And I still think that matters. ⁓ I've yet to. I've yet to see a case where that doesn't matter.

And so I think that's a good point. And I think that, you know, with respect to the models themselves, the changes, well, I think what we're discussing here today is that the changes have always been incremental. And now with ChatGPT 5, we have this like mass realization or this mass kind of like understanding that.

hey, this was just a little incremental change. Like every time we got a new version, everybody was kind of losing their minds for either good or not good reasons. They were kind of elated by it. Now we have Chachi PT5 and it's like, OK, this is an incremental improvement, is seen widely regarded as a failure ⁓ by many people, I think, outside of the tech community. Whereas in the tech community, we're like, incremental is good. We'll take a little bit of an improvement here. Let's see what it can do. ⁓

But the thing is that we really haven't scratched the surface of what these models can do. so getting something that's like, I don't know, an order of magnitude better, that could be great, but we're not even using what we have today.

Valentino Stoll (29:19)
Right. Yeah. I mean, that's a great point. And it makes me think, like, I remember reading the release notes ⁓ on OpenAI ⁓ and there's just like so many details of the API updates as they relate to the model specifically that I think were just so overlooked. And it makes me think that, okay, well, everybody basically, it's like the Apple, like you're holding it wrong, right? Like,

Joe Leo (29:36)
Mm-hmm.

Yeah.

Valentino Stoll (29:48)
Basically, everybody

has to redo some portions of what they've built in order to get the optimizations that are in play now. And I have a feeling this is going to be a consistent iterative cycle like this, where it's like, the model may seem very underwhelming, but if you use all these other little features we've added, then you're going to be really wowed.

Joe Leo (29:57)
Yeah.

Yeah. Do you, I mean,

do you have any examples of that? Like what, what, of what you read?

Valentino Stoll (30:19)
yeah, so they have like, ⁓ let me bring it up because I'm going to forget otherwise. ⁓

Joe Leo (30:25)
Okay.

Valentino Stoll (30:29)
Here we go.

Yeah, here we go. they... ⁓

So one example is you can now ⁓ constrain tool outputs to specific grammars. instead of just saying, the structure of this tool is going to return this specific structure from a parameter standpoint or attributes, and instead you can use an actual grammar like Ruby. Ruby doesn't have a grammar, unfortunately. But some other languages have.

Joe Leo (30:52)
Okay.

Mm-hmm.

Valentino Stoll (31:12)
programming languages have grammars, but you can make your own grammars. And they're basically just like ⁓ language templates that let you generically describe how something should be respond in a formatted way. And so basically they made a way to make structured outputs very flexible and more toward language. So if you wanted something to generate a report in a specific way,

Joe Leo (31:15)
Mm-hmm.

Mm-hmm.

Valentino Stoll (31:40)
You could just generate a grammar that is a report formatted in that way, and then all the outputs for that tool would return in that way. ⁓ That's one example. ⁓ Another one is they introduced this idea of preambles. ⁓ You can ask the model to think out loud before it makes a tool call even. You can capture if a

Joe Leo (31:49)
Yeah, that is very cool.

Yeah.

Mm-hmm.

Valentino Stoll (32:09)
know, GBD5 is reasoning about whether it should call a tool. You can have it basically supply you that information ahead of time so that when you get your tool responses, when it's like, you should make all of these tool calls. And it's like, this is the reason why now, right? And maybe you could be like, well, that reason we don't really care about, right? Or that could be useful, passing it along to some tool that makes another LLEV call. ⁓

Joe Leo (32:19)
Yeah.

Okay, yeah.

Yeah. Yeah.

Mm-hmm.

You would need a human in loop at that point, right? Or another agent that can act that way. Yeah.

Valentino Stoll (32:39)
And so that was... Right, exactly. You could, yeah. mean, if

you had an agent that, like, you you wanted to approve all the tool calls beforehand by somebody, even the front-end user, ⁓ you know, you could present that reasoning back to them and be like, this is why I think we need to do that, ⁓ instead of you having to be like, well, we think this is why we should call this tool. ⁓ You know, or just nothing, right? We're making it, you know? Right. ⁓

Joe Leo (32:48)
Mm-hmm. Mm-hmm.

Mm-hmm.

Yeah. Yeah. Or we already did it and we'll send

you a bill.

Valentino Stoll (33:14)
And then another one is like ⁓

See.

yeah, so you can now get like reasoning tokens back ⁓ and feed them back into subsequent calls. ⁓ So if you get like, I know they don't expose like the thinking steps you see in chat GPT, but now you can basically get those for free if you do multi-turn responses. So if you get like a response and it has reasoning in it, you can pass that back in your next call and it will be so close in proximity.

Joe Leo (33:38)
Mm-hmm.

Mm-hmm.

Valentino Stoll (33:54)
in the vector space that it can reuse it optimally and it doesn't have to like re-reason about things again. And so yeah, I mean, these are just like, you know, small examples I feel like of what is to come in this kind of space. Because you think like, well, this is OpenAI, right? Like Anthropic has like similar API like announcement releases, ⁓ you know, with these other adjustments that you can use their API for, right?

Joe Leo (33:59)
nice. Yeah, yeah.

Mm-hmm.

Mm-hmm.

Valentino Stoll (34:23)
⁓ And so it's like back to the compilation thing like well, okay. Well, you can't just try an anthropic model up against GPT-5 and say hey, can we just use an anthropic right like

Joe Leo (34:32)
Mm-hmm.

So this

is some way of ⁓ translating the way that I prompt OpenAI over to the way that I prompt Claude to try to achieve the same result.

Valentino Stoll (34:49)
Right, exactly. And

even on the API levels, maybe use the API features differently for reasoning in Anthropic than you would OpenAI because the models are formatted and structured a different way and built differently, to be honest. It's funny people just being like, okay, I entered this thing in the chat GPT and look at this, what I entered in the Cloud and they're completely different. And it's like, well... ⁓

Joe Leo (35:06)
Yeah.

All right. Yeah,

yeah, they're totally different spaces. Yeah, yeah.

Valentino Stoll (35:18)
It's not really apples to apples, you know?

And I may, from a product standpoint, it is. But I feel like there's just, like you mentioned, like way too many parameters. And so I guess I wanted to circle back to the Ruby world for a minute, because you did mention Ruby LLM, like integrating that. So like, how do you weigh that, right, as a product builder? Because like there's all this like...

Joe Leo (35:29)
Yeah, yeah.

Mm-hmm. Mm-hmm, yeah.

Valentino Stoll (35:47)
You know, I feel like hype lately about...

you know, is AI actually delivering anything? Right? And so like, as a business owner, like building AI products, right? Like, ⁓ how do you decide like, okay, well, we should switch to using this framework even, right? Like, what is the value that that's gonna even provide? Right? Like.

Joe Leo (35:58)
Well, yeah.

Yeah,

yeah, I know what you mean. I mean, and it's there's kind of two sides of it. I'm a I'm a business owner with engineers that use AI to build an AI platform. Right. And so there's AI all the way down already. And ⁓ and soon they'll just they'll just replace me with AI and then they'll take all the money. ⁓ But the the decision for Ruby LLM was ⁓ I suppose it was twofold one.

Valentino Stoll (36:26)
Right?

Joe Leo (36:43)
was we're doing a lot of different work around ⁓ the test generation process of Phoenix. We are trying to improve the efficacy and efficiency of Phoenix. And so there's graders that we're building ⁓ and trying to analyze, okay, how good are these tests? How do we put them back in the loop to make them even better? And we've always had evaluations for the tests that we generate. ⁓ in this case, want to take, we want to build our own.

just greater our own assessment of how good these tests come out and try to make it as as death method ask as possible. Right. So we have our own opinions on it. ⁓ I don't think they're that controversial, but they're, you know, they're they're strong opinions ⁓ loosely held about how tests should be formatted today. So there's that piece. And then there's also this piece of when we started this 12 whole months ago, there were not a lot of.

Ruby LLM integration options. And so we built a lot of it in Python. And so when, ⁓ you know, for example, active agent, we had Justin on the show, you know, that didn't exist when we started Phoenix. ⁓ And so we have, we've always had a Ruby and a Rails application because there's a web app behind this, but the lower level, the test generation was done in Python. So that is still the case.

but more and more of the application as we're getting deeper into the end-to-end integration of both test generation and then generating new tests for new code that's creating Everflame reports where we're evaluating the tests and the code as it comes in, those kind of end-to-end features, more and more we're able to say, okay, well, we wanna do this with Ruby, ⁓ or in some cases, it is advantageous for us to do this in Ruby. So we went out and we did an evaluation. We looked at... ⁓

We looked at Sublayer, looked at ActiveAgent, we looked at Roast. We started with Roast for that evaluation piece that I mentioned, which is not part of the web app, but we started Roast for the evaluation piece. It didn't hold up well, and I think that is probably because it is, it's kind of...

What was I going to say? I think it is actually, it's not being sunsetted, but I think it is not being as actively developed as it was. ⁓ which is funny. You know, we had, we had OB on here like, you know, six weeks ago and it's like, well, you know, things change, things change quickly. And I think OB himself, I could be wrong about this, but I think he is also more in favor of, you know, putting more effort into Ruby LLM and seeing that become sort of the, you know, the leader, ⁓ for

for AI integration. ⁓ And so, and then when we look at Ruby LLM compared to Sublayer and Activision, I love both of the maintainers on these projects, ⁓ there's not as much support, there's not as much use. And so we decided we'd plunge in with Ruby LLM. And so, so far so good, it's only been a couple of weeks. But, and as I've mentioned, you know, it doesn't have out of the box support for tracing and observability, so we had to roll our own. But, ⁓

But we have felt like it has really added a boost. It's definitely helped us shore up that end-to-end integration. So we're happy with it.

Valentino Stoll (40:11)
Yeah, that's awesome to hear. ⁓ I haven't had a chance personally to play with Ruby LLM. And I think the structured outputs was maybe preventing me from messing with it, which is now in there. So I'll probably circle back as well. But it's kind of funny. The Ruby AI gem, or Ruby Open AI gem, rather, ⁓ that was our de facto gold standard.

Joe Leo (40:22)
Yeah. Yeah. Check it out. Yeah.

Mm-hmm.

Yeah.

Valentino Stoll (40:40)
And

now it's, well, how do we get all these other LLMs in there? Lang chain RB seemed promising, ⁓ and that worked great. And then as all of these APIs evolve, somebody's like, Ruby LLM, I'm going to make all the things that are missing from all these other libraries and release this. And then at what point?

Joe Leo (40:47)
Yeah.

Mm-hmm.

Yeah.

Valentino Stoll (41:09)
that's like that just keep continuing and like somebody else comes out is just I'm making the LLM gem and it's just like this is what you use for

Joe Leo (41:13)
Yeah.

Yeah, yeah. I mean,

I wrote about this or I talked about this and

probably everything I said back then was obsolete now. But one thing that I stressed to the developers was being able to, you have to make things modular and you had to reduce the dependencies on your integration points because you're gonna switch them out. And I think that goes for these gems as well. It goes for Roast, goes for Ruby, LLM, whatever you're using right now, it might be,

Valentino Stoll (41:30)
Right.

Joe Leo (41:57)
the thing that you want to use a year from now, but the chances are not good. And so for that reason, you cannot tightly couple. And we've always been warned about this, right? Like tight coupling. But now it's like, hey, the thing that you're using today, you may not want to switch, but you want to test something new out. And you're to want to do that over and over and over again, because the field is evolving so quickly.

Valentino Stoll (42:02)
Right.

Yeah, it does come down to just, I feel like standardization. It's just so boring, ⁓ especially for AI, which is like, all this new, new, new, you know? And like, when it comes down to it and you're maintaining this stuff, ⁓ you're just like, man, I wish like there was a standard. Right? I mean, thank God for Anthropic with MCP and like that actually taking hold, which is like a little weird because like...

Joe Leo (42:29)
Yeah.

You

Yeah, no, I totally agree with you. Yeah.

Mm-hmm.

Valentino Stoll (42:52)
there was the open API standard, which is essentially the same thing. But at the same time, okay, everybody just get behind this one thing and then we're gonna push it hard. I feel like Ruby is great at polar opposing both forces here, right? Being innovative and just doing everything all the same, all different ways. And also getting behind something eventually.

Joe Leo (42:56)
Yeah. Yeah.

Mm-hmm.

Well, because that's what Rails is, right? I mean, we could have just kept creating web apps and deciding where to put our models every single time. And Rails, and actually you just put them here. They go here. there was a talk at RailsConf where this was brought up and it was because you want to abstract away the decisions that don't matter. It doesn't matter where your models go. It just matters that they go somewhere and it's in an expected place. It's more important that there's a convention.

Valentino Stoll (43:22)
Right.

Right, yeah, exactly.

And you know we still.

Joe Leo (43:52)
than that you can configure it. And I think there's a lot of that in AI and in this world. And we just don't, we may not know what, is which yet. Like what the most important thing is to abstract away and what's the most important thing to just let developers be creative and explore.

Valentino Stoll (44:10)
Yeah, totally. And it seems at this point, like, that speed is almost ⁓ out-trumping, ⁓ you know, long-term support in a way. ⁓ I worry a little bit about it, but like, feel like, you know, Ruby's very, like, you could just monkey patch anything, you know? ⁓

Joe Leo (44:36)
Yeah, yeah.

⁓

Valentino Stoll (44:39)
You know, you

just go in and be like, okay, like this class is like, ⁓ we're gonna eval in real time, like to do something completely different than what it was like built for. And I feel like, you know, well, it's not maybe the best thing to do. It's like something you can do, you know, like something you can do. ⁓

Joe Leo (44:48)
You're right.

It's something you can do. Yeah, it's true. And I wonder how

much I wonder how different that is. mean, you know, being in the world of software engineering services for as long as I have, you know, people have done, you know, they've committed atrocities of programming so that they could get, you know, they can go from zero to one and get their next round of funding. And the answer is always, well, we'll fix it with the next round of funding. Right. And, you know, and some people actually do.

and commit to it. And some people don't because it's just more important to chase the next thing and chase the next, ⁓ you know, vertical curve in user adoption. ⁓ This definitely, it accelerates that kind of problem, but I don't think it's a new problem.

Valentino Stoll (45:38)
Right. Yeah, it's definitely not a new problem. And I think...

I think this reminds me, there was this hype cycle over Rails, the hype dropped off and people are still using it. And now it's like, it's not a hype, but it's popular. And so there's a hype versus popularity. And so ⁓ where are we leading ⁓ in this AI space with Ruby?

Joe Leo (46:01)
Mm-hmm.

Mm-hmm.

Valentino Stoll (46:17)
⁓ of, know, like Ruby LLM is a great example of popularity, right? Like there was a hype around it and now it's like really just becoming popular. ⁓ And so like, great, like use that tool. Like, so like what, you know, what else is coming out, right? Like, what else is like, are we facing from a hype cycle perspective? ⁓ You know, we don't have to go into the coding world, agent world yet.

Joe Leo (46:22)
Mm-mm.

Yes.

It's coming. Yeah, it's coming. Don't worry. Tune in next week. Yeah.

Valentino Stoll (46:44)
Right, but that's feel like it's its own episode. ⁓ Yeah.

But like where does, where does Ruby fit in that, you know, AI building business space? Like, you know, from a, like it's no longer hype and it's really just popular and like, ⁓ how do we just like, why is it a good use case, right? That we keep coming back to, right? Like it's more than just like Ruby is, it's like, okay, people are

Joe Leo (47:03)
Yeah.

Valentino Stoll (47:14)
know, innovating on things still and it's like becoming a useful tool ⁓ to use it for that.

Joe Leo (47:15)
Mm-hmm.

I think it's a good, I mean, it's a great question. I don't know if I have a, I don't know if I have the perfect answer. I, I would say that we are, we're engineers. If you're listening to this podcast, you're, you're probably an engineer or you're, you know, my mom. So hi mom. Um, but, uh, you know, and I think it's actually, it's incumbent on us to, um, to rise above the hype and that's, that's our job because the hype has always been there. And.

hype for Ruby and even Ruby on Rails needed to dissipate. And it's okay to get caught up, the fact is that at the end of the day, we're building useful tools that we want to stand the test of time. And that's always been, at least for me as a software engineer that, you know, takes himself seriously as a software engineer. I judge my own abilities and success on whether or not the applications I build are still around in a few years.

Right. They're useful. They serve a purpose. Most of the time they either earn money or they are funded, right. If it's a nonprofit, but those tools that I build are still being used. And if they're not, think, well, you know, I have something to learn from that. And I think the same thing is true for us here. It is easy to get caught up in AI hype, especially when there's a new tool coming out every five minutes that we can go and check out. And I think it's okay to, and this is where the Ruby comes in. It's a great place to explore.

with, you know, we had Chad Fowler on the show and he was talking about some other languages that might be better suited for ⁓ automation or, you know, code gen with AI. And that may be true, but Ruby still is the best for ⁓ the user experience of the developer to play with things, to figure things out, to try things out. And I think it's by doing that, it's by doing that creative work.

we learn what is actually just hype and it's not actually going to be lasting for the test of time. And what is, as you put it, and suitably popular and serving a good function.

Valentino Stoll (49:31)
Right? Yeah, hopefully more things. Because, you know, it is the Ruby nature to make things that are not useful at all and just fun. ⁓ that is true. Yeah. Which is funny. I just found out about this thing called Y Day, ⁓ where it's like, you know, ⁓ a memorial ⁓ event ⁓ that happens every August 19th.

Joe Leo (49:35)
Yeah.

Yeah, I think that's good. That's how you learn though, right? Yeah.

Valentino Stoll (50:01)
⁓ to like inspire the Ruby community to see how far you can push the weird corner of Ruby.

Joe Leo (50:08)
That's funny,

but you know, the thing is that why they exist because everybody got something out of why the lucky stiff, right? You know, he was being playful and he did something that was great. Okay, I'm sorry. So continue. I'm to get off my soapbox.

Valentino Stoll (50:14)
Right. Right.

Yeah, so I mean like, you know, but as an example, I was like, well, like, you know, let me have it Claude, you know, take this description of Y day and see like what it comes up with idea wise. And it came up with these two incredible ideas. ⁓ One was around ⁓ quantum Ruby. So creating a version of Ruby that sometimes executed and sometimes didn't based on your observation of it. So like,

Joe Leo (50:33)
Mm-hmm.

I

love that. Yeah.

Valentino Stoll (50:50)
Basically there would be a quantum state and like

if you were observing the code executing, then it would execute. Otherwise it would be in this weird holding pattern. And then there was the Schrodinger's state of like, you know, maybe it does, maybe it doesn't like very much in line, right? ⁓ And another one was just like, ⁓ what was it?

Joe Leo (50:59)
yeah yeah nice yeah

Right, right. Yeah, yeah.

Valentino Stoll (51:18)
It was a emoji rubber duck session. so like, basically like opening up a Ruby console where like you can use emojis to define your code. And depending on the kinds of emojis you use, you could like make it run faster or slower. like you could like put it in angry mode, you know, and like ⁓ have more errors raised.

Joe Leo (51:29)
Yeah.

⁓ nice, I like that. Yeah.

Yeah, yeah, I like that. Yeah.

Valentino Stoll (51:46)
So I don't

know, I feel like ⁓ it's what helps the Ruby community be so popular, but also maybe ⁓ how it prevents it from being taken seriously in a lot of ways.

Joe Leo (51:53)
Mm-hmm.

I'm okay with that. mean,

let's put this another way. If you're doing that, and I've got another example as well, but if you're doing that, then maybe you're resisting the urge ⁓ to do something like ⁓ create the next billion dollar company with just one person, right? Like that's everybody taking themselves, I think, way too seriously. So we could learn a lesson from why, you know, the whys of the world ⁓ and from your example, V, where

It's like, hey, you know what? I built this thing and it's, isn't it funny? And isn't it silly? And we did this with AI. And because I think that tinkering is, is actually really important. And I think it's a lot of fun.

Valentino Stoll (52:33)
Right. ⁓

Yeah,

totally. mean, yeah, the serious notes, ⁓ I definitely align with the anti-pattern there. And maybe that's where Ruby really shines. just like that it grinds against the draconian seriousness of everything. Right.

Joe Leo (52:55)
Yeah, yeah, right. Look how great this thing is. We all have to become rich. We all have to,

or it's got to take over the world. Like we don't have to be in this world of absolutes all the time. My thing that I've been checking out, I think I've mentioned this before, but if you're not already reading Scott Werner's works on my machine, Substack, he has been churning out, I've been following it for a while, but in the last few weeks, I've been churning out

Valentino Stoll (53:07)
Right. Right.

so great.

Joe Leo (53:24)
this outstanding content. I think I think it is one of the most creative thinkers of of ⁓ the usage of AI, how it how it works in your code, how you can leverage it. ⁓ And so what he released recently were these two new gems, VSM and AirBee. And ⁓ and he's his post about it and why he did it and how he feels like it can be a template for ⁓

for artificial intelligence systems that need to iterate and loop into all the things that we talked about on this show. ⁓ Check that out. We'll put it in the show notes and give them both a try.

Valentino Stoll (54:08)
Yeah, completely, we're gonna have to get him back on to talk about these. like, ⁓ yeah, there, you know, this is definitely the future. And I don't know how he comes up with this content cause he must have a gold. Yeah, I know. I feel like he just like talks about this in his, you know, in his living room or kitchen. ⁓ That's how I imagine it happening. you know. But yeah, totally like the self building.

Joe Leo (54:11)
Yeah, yeah, I know.

I know. Two in the morning on a lot of coffee, he says. Yeah, I know.

Yeah, yeah, I talked to his young son about it. know, yeah, yeah.

Valentino Stoll (54:36)
⁓ like runtime style nature that he's like gearing toward. I think it's just what so many people have been talking about, but haven't figured out how to do it. And this is very, very promising. ⁓

Joe Leo (54:42)
Mm-hmm.

Yeah.

Yeah,

I would put money on him to figure it out or at least figure out some large piece of it. You know, I don't know if this is it. He doesn't know if this is it, but this is worth checking out. Yeah.

Valentino Stoll (55:00)
Right, yeah, totally.

All right, well, we've talked about a lot. Is there anything else you wanted to dive into? I feel like we've hit all the corners of, you know, I guess, how do you measure AI in reasonable ways ⁓ in Ruby space, ⁓ tracing, all of this fun stuff that really isn't solved.

Joe Leo (55:09)
Yeah.

Yeah, no, like we can, I think we can.

Mm-hmm. Yeah.

Yeah,

Valentino Stoll (55:35)
All

Joe Leo (55:35)
yeah.

Valentino Stoll (55:37)
right. don't know. Hopefully somebody listening is like going to come out and be like, yeah, you just use this thing now, you know. ⁓

Joe Leo (55:46)
yeah, yes. If you

are that person and you're thinking, can't believe these guys just talked about this for 45 minutes when there is an obvious answer, don't be shy. You can shout that at me on social media or my email. Just let me know.

Valentino Stoll (56:00)
Otherwise we'll just keep talking about how it doesn't exist, you know?

Joe Leo (56:01)
Yeah, yeah. Well, otherwise we'll drive you slowly insane with this

mindless chatter. ⁓ man. All right. Well, this was great. Very happy to be on here and talking about this stuff. We'll do it again soon.

Valentino Stoll (56:21)
Yep. All right. Thanks for listening.

Joe Leo (56:24)
Take care everybody, bye.

Want to modernize your Rails system?

Def Method helps teams modernize high-stakes Rails applications without disrupting their business.