We will observe a battle for the true openness in AI

No technology has as many dual-use challenges as artificial intelligence. The same AI models that invent vivacious illustrations and visual effects for movies are the exact models that can generate democracy-killing algorithmic propaganda. Code may well be code, but more and more AI leaders are considering how to balance the desire for openness with the need for responsible innovation.

One of those leading companies is Hugging Face (a Lux portfolio company), and part of the weight of AI’s safe future lies there with Carlos Muñoz Ferrandis, a Spanish lawyer and PhD researcher at the Max Planck Institute for Innovation and Competition (Munich). Ferrandis is co-lead of the Legal & Ethical Working Group at BigScience and the AI counsel for Hugging Face. He’s been working on Open & Responsible AI licenses (“OpenRAIL”) that fuse the freedom of traditional open-source licenses with the responsible usage that AI leaders wish to see emerge from the community.

In today’s episode, Ferrandis joins host Danny Crichton to talk about why code and models require different types of licenses, balancing openness with responsibility, how to keep the community adaptive even as AI models are added to more applications, how these new AI licenses are enforced, and what happens when AI models get ever cheaper to train.

Episode Produced by ⁠⁠⁠⁠⁠⁠Christopher Gates⁠⁠⁠⁠⁠⁠

Music by ⁠⁠⁠⁠⁠⁠George Ko⁠⁠⁠⁠⁠⁠

‍

<- Previous

continue

reading

Next ->

Transcript

This is a human-generated transcript, however, it has not been verified for accuracy.

Danny Crichton:
Okay, let's see how it goes. I have some ideas.

Chris Gates:
Okay, let's do it.

Danny Crichton:
Max Chafkin at Bloomberg wrote an incredible piece last week on the 100 billion dollar struggle for creating self-driving cars. One of my favorite anecdotes is that there's this woman in San Francisco who was living in her home, she's just sitting there, and one day notices, there's this sort of white car blazing with the word "Waymo" on it, and has a spinning, which was a LIDAR sensor, but it's sort of the spinning sensor going on top of the car, and she's looking out the window, sees the car back up in her driveway, turn around and drive away. She didn't think anything of it until the cars started to trickle in more and more frequently. First there was just one, then a couple, and then at some point, I guess there were dozens and dozens of cars, because she was at the end of this road and the only way to turn around was basically to use her driveway, and every Waymo AI modeling algorithm figured out that her driveway was the single best driveway to turn around in San Francisco.

I just love this story, because in one way, we talk a lot about AI and technology, it's super funny. It's not particularly funny if you're one of the investors of a hundred billion dollars behind these companies, but I thought it was such a visceral sort of story and anecdote of like, this is what's also going wrong so much in AI today. And that gets at something about dual use. We talked about dual use on the podcast. I've written a newsletter called AI Dual Use Medicine and Bioweapons. I first got connected to dual use 15 years ago when I was doing bioweapons and biodefense research. One of the challenges you have in bio is, along with everything in medicine, is you have the ability to help and protect people, to make people better, but those same tools, those same biological tools, whether it's CRISPR, or a PCR machine, can also be used for evil, to actually accentuate the damage that any individual bioweapon can use.

When you look around the technology world, nothing has as many dual use characteristics as AI. Artificial intelligence can be used for tremendous good, but it can also be used for terrible, terrible things all the way up to the famous Terminator film. And so I wanted to get more into the head of how are technologists today confronting this dual use technology? And the answer is they're actually trying to do a lot. They're trying to move from a world of purely open technologies that can be used by both good and bad actors to ones in which they actually have more control over what people are calling responsible innovation. So I thought, "Let's do an episode on that."

I reached out to one of our portfolio companies here at Lux, which is Hugging Face, which started by just hosting AI transformer models and has now become one of the most popular online hubs and platforms for anyone doing AI research. I asked their AI council, and there's not a lot of these people around the world, but there are now official AI council lawyers whose full-time job is to figure out how do you handle licensing and all the logistics of this new industry. And so today, I want to bring on one special person to talk about dual use and how we can make the world safe for AI. Take a listen.

Danny Crichton:
Hello and welcome to Securities, a podcast and newsletter devoted to science, technology, finance, and the human condition. I'm your host, Danny Creighton, and today we have Carlos Munoz Fernandez. Carlos is a Spanish lawyer and PhD researcher at the Max Plank Institute for Innovation and Competition, and is the co-lead of the Legal and Ethical Working Group at Big Science and the AI Council for Hugging Face. Carlos, welcome to the show.

Carlos Muñoz Ferrandis:
Hi, Danny. Well, thanks.

Danny Crichton:
So the topic of today's discussion is something that I'm super curious about, which is open source licensing. We think of open source software, which has sort of transformed the way that software has worked in Silicon Valley for startups, for corporations, for governments, but we rarely actually get into the nuts and bolts of exactly how this whole framework works. And ultimately it's a legal framework, a model, a set of licenses that allows software developers to use code, to reuse it, to adjust and change it as they need to. That was developed over, I guess the last 30, 40 years, and to fairly normal regular parts of business. But as we entered this new AI world, there's been a shift in that we need to update these licenses in order to allow people to use AI modeling in their work.

I just want to start with the basics of open source licensing, which is, let's talk a little bit about MIT, Apache, and a lot of these tools, why they work for open source code and why they aren't sufficient enough in AI modeling.

Carlos Muñoz Ferrandis:
Yeah, no, that's a super interesting discussion and question. So first of all, I think back in the even '80s, '90s, open source licenses were conceived as instruments to promote distribution and incremental innovation of software and of code, and more precisely, source code. Now you have this debate whether on the one hand, code, and on the other, a machine learning model are more or less the same thing or the same artifact, which in fact they are not. So on the one hand you have code, and on the other you have models. These are two different artifacts which should be subject to different licensing terms.

Danny Crichton:
So when you look at open source licenses, I mean one of the interesting things is that they were open, right? Everyone can use them. You can use them for good, for evil, everything in between. When you get into AI modeling, there's this heightened concern that AI models could be used in negative application. If we're talking about image generators, the same image generators that are making the fun images and memes that we're seeing on Twitter can be used for propaganda, disinformation purposes, hacks on other governments, whatever the case may be. I'm curious why that shift in philosophy, because in many ways that sort of dual use, that sort of positive and negative use case of the technology applies equally to code just as much as it does to AI modeling.

Carlos Muñoz Ferrandis:
So the decision by the big science community to include an open and responsible AI license to distribute the BLOOM set of models was an organic one. When we were starting to discuss under which licensing conditions are we going to distribute the output of projects, so the BLOOM set of models, at first I realized that we wanted an open source license, so let's choose an Apache 2.0 license, broadly open source, very permissive. Anyone will be able to use the model also for commercial purposes, and further redistribute or distribute these derivatives of the model. But then, instantly we realized that open sourcing would entail potential misuses, so we were placed in a position where a balance has to be struck between open innovation and responsible innovation. And OpenRAILs, so Open and Responsible AI Licenses, are meant to be one of these, let's say answers to the intersection.

Danny Crichton:
And with OpenRAILS, so Open and Responsible AI Licenses, you have two axes that are built on this. One is the open and then the responsible. So on the open side, it's offering royalty-free access. And then on the responsible part, which I think is the more interesting one, is sort of encasing and increasing a set of values into the code itself, saying that the community wants to see its work used in certain contexts, not in others. How do you delimit the level of responsible use cases? Like how gray area does it come? And as normal with open source licensing, you're locking yourself in long-term, that you develop this licenses, this is how the values are getting set. How does that change over time and evolve as we learn more about use cases and potentially maybe some of the downsides for some of these AI models?

Carlos Muñoz Ferrandis:
The first open and responsible AI license has, if I'm not mistaken right now, 13 different use base restrictions, to take into account, first of all, the technical limitations of the model, then also our ethical values. Do we want to promote the model or the use of the model for very specific scenarios, which we don't feel very comfortable. Third parties using the model for this such as, for instance, medical uses or medical results interpretation. Why? Because first of all, we are dealing with pre-trained model, which has not been specifically designed or fine tuned to be used for medical advice or medical results interpretation. Therefore, we didn't feel comfortable third parties using an open basis the model for this specific purpose. We didn't want the public to conceive this use base restrictions hindering or foreclosing incremental innovation.

So we believe these are the general rule. However, the licensor, at its own discretion, might admit some exceptions for some of the use base restrictions due to the fact that the potential licensee will be able to justify and make that case to the licensor that the use of the model for a specific use base restriction is legit or they have come to a specific technical result that allows the model to be used in this specific field without any concerns either from the side of the licensee or the side of the license.

Danny Crichton:
I'm listening to this carefully. There's a balance here between trying to allow innovation and opening it up and allowing people to just experiment, play with these new technologies. We're sort of an understanding that if we just sort of willy-nilly use AI models in applications, you mentioned medicine, you could also mention criminal justice. There've been some cases there of AI in sentencing guidelines, that there can be bias, that there can be challenges, that we don't necessarily even know about today. And so there's a little bit of maybe not hesitancy, but sort of much more careful and surgical usage of these models to say, "Let's make sure we know we're doing and getting involved with first, before we just open up these models to the world and use them in a bunch of different applications."

So I want to move on to enforcement, because I think another piece of this, we'll just mention particularly in the big science discussion of its open rear license, is the enforcement piece, which is ultimately these models are being posted to the internet, people can download them, people can use them. There's no active enforcement mechanism, and there's sort of this implied potential enforcement mechanism. Maybe you could talk a little bit more about that.

Carlos Muñoz Ferrandis:
Enforcement is one of these core things. How are you as a licenser simply going to enforce your license? That's it. This is a question as old as contracts, I believe. It depends on a case-by-case scenario. If the user of the model is a model developer, whose main aim is to develop a derivative version of BLOOM, for instance, of a Stable Diffusion, this could be one type of user to, for instance, fine tune it, embed the model in an app, right, and create a specific machine learning app for whatever purposes.

The second type of user could be basically, let's say, an end consumer interested in generating outputs from the model, in generating images from Dali Mini, in generating images from Dali, from Stable Diffusion, or just text from, for instance, BLOOM. So what we decided here, you will be able to assess it under paragraph six of the license, is that we do not claim any rights on the output generated by the model, and therefore the user is solely accountable for the use of this output. So we believe that the user should be the one responsible when it comes to the use of the output generated by the model. The output should be used without contravening the license. This means that if the licensor spots a potential output generated by the model, which would be deemed to be harmful under our set of use base restrictions, we could enforce the license and under paragraph seven, remotely restrict access to the model to this specific user.

Danny Crichton:
We're talking about enforcement all these licenses, and most of that comes from this kind of scarcity of models. The ability to train a model is still extraordinarily expensive, in some cases, tens of millions of dollars. It's getting cheaper every month and every year, but at some point, theoretically, it'll be easier and easier to just build models from scratch. And so these models that in some cases cost a hundred million dollars to train, now it's down to tens of millions of dollars, at some point it'll be millions, maybe hundreds of thousands. Does that worry you at all, that there's going to just be a profusion of AI models coming up in the next couple of years and some of those models may have really strict use case restrictions and others will be wide open?

Carlos Muñoz Ferrandis:
Yeah. It's the first time I hear this question, so thanks a lot. I think it's a challenging one as on the super interesting one. So first of all, I think, well, as anyone caring about AI governance and how machine learning models are being used, I will be concerned about in two years time for sure. I would be more concerned about if now nowadays we wouldn't be taking all these AI governance steps towards a better and more solid collaborative and collective AI governance system. In two years time, my vision, we will be far more solid, as the AI community as a whole, we will be far more solid when it comes to let's say proficiency or more efficient AI governance tools. The last one is the regulatory policy perspective.

Danny Crichton:
So you're thinking of it as both a bottoms-up governance model, which is these licenses, OpenRAIL et cetera, plus this kind of top down nation state or in the case of the EU supernational governance where those two intersect. And yes, you might be able to build a new model, not license it under OpenRAIL, let's say the opposite, the closed and irresponsible AI license CRAIL, but you'll be able to still enforce it because it's coming down from the EU and the AI Act. And so there's a sort of intersection there that protects this entire industry. Let me ask you, has there been controversies over the design of these sorts of licenses? Because I know, having covered open source for many years, obviously very philosophical differences between different types of engineers, people who want to completely open, who want more responsible innovation, who want a lot of restrictions. How do you see the community today? Is there a lot of consensus on where it needs to go? Is there dissensus? Are there different parties that are trying to argue different points of view? How do you see the debate today?

Carlos Muñoz Ferrandis:
That's also another very sharp question, which I love because my PhD basically deals with the intersection between open source and the technical standards from an IP and antitrust perspective. So I deal basically with what I call the battle to capture the concept of openness in the policy spheres. So nowadays, you have different friction in versions of openness, and not just nowadays. I mean always, right? For instance, I give you an example. You go to the telecoms sector, and you have two different conceptions of openness. On the one hand you have the open source proponents, and on the other you have the intellectual property rights traditional holders. So the traditional IP holders claiming their rights and including them in their technical contributions to critical standards such as 5G or 6G. Why? Because once this standard will be set, all the intellectual property rights reading on some of these technical specifications will be deemed to be standard essential patents, means a huge return on investment for these traditional IPR holders.

Now, they claim that when they license these specific branch or set of patents, they claim that this is their open approach to opening use of the standard and also the use of their patents by mean of these very reasonable and [inaudible 00:15:36] terms. On the other hand, you have a different conception of openness. This is the one held by the Open Source Initiative according to the open source definition and the set of principles they have.

I believe in the short run, mid run, we are going to observe a battle, a battle for this concept of openness, the true openness in AI and who holds the truth over this policy captured concept. I think this is the wrong approach to take. I think we should be very, very respectful over the values behind the open-related initiatives already present in the market.

Under the right initiative and also having drafted the first open and responsible AI licenses, we do not claim we hold the truth on what's open and not. We have our own approach to openness due to our initiative values and aims, and we totally respect that the open source initiative have their own concept of openness as meta or any other stakeholders is going to have. We should co-exist. The main battle is not for open AI. It's for the intersection between open and responsible AI. That's the main battle right now, not just about openness. Openness in the AI realm nowadays is not enough. That's it. We need openness plus responsible use and development of the technology.

Danny Crichton:
I think this is super interesting, particularly in the context of the capabilities that AI offers. I mean you mentioned telecom and OpenRAN, or Radio Access Network technology, has become very popular. It's a huge topic in Washington DC these days and geopolitics that we've talked about maybe a little bit on the podcast here and there, but what's interesting is there is the standards which we see with OpenRAN, which we see in the semiconductor space with RISC-V, where there have been standards built together often around what you mentioned with FRAN, Free, Reasonable, and Non-discriminatory patents, where a group of companies come together and say, we will put our patents up into this cohort, this consortium. We saw this with 4G technology, 5G technology, OpenRAN, et cetera, where the intellectual property comes together and you can license it and you can and take advantage of it.

But that's just the first step of the openness, right? I want to analogize it to a little bit of what we saw with COVID-19 and the vaccines, where Pfizer and others put the source code online. You can download the vaccine, so to speak, from GitHub, I believe, but that doesn't mean you can actually make the vaccine. You can't actually produce the actual biological tissue and sample that goes into a vial that then gets placed under your skin that actually creates the vaccine, because that's actually where the intellectual property is, which became a, I believe, a huge fight last year when it was sort of open sourced, but then no one actually knows how to produce this stuff in the first place, and so that's really the challenge. Same with RISC-V. People can see the ISA, the Instruction Set Architecture, but they don't actually have the ability to actually produce a chip in a fab, and that's a totally different block.

What I find interesting about AI is one, more of us actually have the capability to go do this. If I have a computer, I could probably train a model probably very slowly, but I could train a model. I could actually go build it. And second, unlike OpenRAN, where ultimately you're getting a telecom network or on RISC-V, you're building a semiconductor and there's a reasonable sense that there's a sort of basic universality to this. It's used in everything from cars to planes to missiles and other applications, and it's sort of hard to restrict. AI to me feels like you're building a model, and it's going to be applied to a couple of obvious cases. So it's easier to imagine controlling and putting values on that in a way that I feel like OpenRAN, it's really hard to do. In other words, how do you integrate, say, free speech into OpenRAN, so it's not used in an authoritarian regime to block protests or something like this? It just seems much harder to encase those values into the technology itself, unlike an AI where I feel like you have much more power to do.

Carlos Muñoz Ferrandis:
Openness, first of all, shouldn't be seen as a binary concept. This is open, this is close. You have to take a gradual approach to openness. You are going to have different degrees of openness. You are going to have something more open than another thing. That's the first thing. The other thing, having this multidimensional approach to openness is that traditionally, the concept of openness has been focused on intellectual property rights exploitation. If you open the result of it, you get access right in principle to this technology. But as you were mentioning, you have also to have either the technical infrastructure, and also the know-how to develop and implement this technology. So this is where the other dimension of openness comes into place, which is openness within the development process of the technology. So you have this open approach to the development of such a critical infrastructure, such as a large language model.

If you give the public and the research community, or the AI community at large, the ability to or the opportunity to access, how do you develop and how do you Train?which are the specific governance considerations to take when developing a large language model? Then when you open source it or when you OpenRAIL it or when just you release it on an open basis, all this public, will have more chances of succeeding in developing or knowing how to develop and analyze a large language model. So I think this is very interesting always to have this kind of multidimensional approach to openness and not just narrowly or rigidly focusing on openness as the result of the project.

Danny Crichton:
I want to talk about the transition, I think, in the AI world between closed to open, because I think, if you think about the last 10, 15, 20 years, in AI, most models were closed. The big algorithms for all the social networks, the Google search engine is one of the largest AI models, one of the most important AI models, certainly one of the most lucrative, probably if not the most lucrative AI model in the world, generating hundreds of billions of revenue for search volume every year.

But how much do you believe that these open models that you're pioneering at Big Science or at Stable Diffusion will force some of the other closed algorithms and AI models to go more open? Because obviously there's been a lot of pressure on Google, on TikTok, and others to open their models up for inspection, to be able to inspect and say, are they affecting people's emotions? Are they placing certain corporate interests over the benefits of individual users? They have been always extremely closed, but there has been this kind of transition or at least demand to open it up. And I'm curious whether you think any of this sort of OpenRAIL licensing could help to migrate some of those closed models into a more open format.

Carlos Muñoz Ferrandis:
Openness as such is not just a value, but openness, it's a core competitive factor in the AI industry, and more precisely in platform competition. If you are adopt a very strategic vision or business perspective of openness, what you are trained to generate is massive adoption on your platform, and for your platform to basically become potentially in the near term, the de facto standard in the market. So this is the very strategic perspective of openness, which should be always be very mild.

Danny Crichton:
Well, fantastic. Well, Carlos, thank you so much for joining us.

Carlos Muñoz Ferrandis:
Thanks a lot, Danny. Take care.