Can AI even be open source? It's complicated

05 August, 2024

Without open source, there is no artificial intelligence (AI). Period. End of statement.

It's not just that AI's early roots spring from the 1960s' open language Lisp; the headline AI generative models, such as ChatGPT, Llama 2, and DALL-E, are built on solid, open-source foundations. However, those models and programs themselves are not open source.

Also: AI scientist: 'We need to think outside the large language model box'

Oh, I know that when Meta CEO Mark Zuckerberg unveiled Llama 3.1 in a Threads post, he said, "Open-source AI is the path forward," and that Meta is "taking the next steps towards open-source AI becoming the industry standard."

At a SIGGRAPH keynote discussion with Nvidea CEO Jensen Huang, Zuckerberg admitted:

We're not pursuing [open source] out of altruism, though I believe it will benefit the ecosystem. We're doing it because we think it will enhance our offerings by creating a strong ecosystem. … this might sound selfish, but after building this company for a while, one of my goals for the next 10 or 15 years is to ensure we can build the fundamental technology for our social experiences.

Zuckerberg is sincere about open source. As we've seen repeatedly, open source is the way to unite technologies. For example, we use a unified Linux now instead of multiple, incompatible versions of Unix because Linus Torvalds open-sourced Linux under GPLv2.

Also: A new White House report embraces open-source AI

But I've also read Meta's Llama 2 license and the Llama Acceptable Use Policy. It's not open source. It's not even close.

Zuck's not alone, though, in playing fast and loose with open source. From the name, you'd think OpenAI is open source. It was indeed open back when GPT-1 and GPT-2 were state-of-the-art. That was a long time -- and billions in revenue -- ago. Starting with GPL-3, OpenAI closed its doors.

As Mark Dingemanse, a language scientist at Radboud University in Nijmegen, Netherlands said in a Nature article, "Some big firms are reaping the benefits of claiming to have open-source models while trying "to get away with disclosing as little as possible."

Indeed, Dingemanse and his colleague Andreas Liesenfeld found only one AI chatbot that could truly be described as open: The Hugging Face-hosted Large-Language Model (LLM) BigScience/BloomZ.

Other LLMs that qualify are Falcon, FastChat-T5, and OpenLLaMA. But most LLMs contain proprietary, copyrighted, or simply unknown information their owners won't tell you about. As the Electronic Frontier Foundation (EFF) observed, "Garbage In, Gospel Out."

Now, much of the innovative software driving AI is open source. TensorFlow is a versatile learning framework that supports multiple programming languages and is used for machine learning. PyTorch is popular for its dynamic computational graphs and ease of use in deep learning applications that quickly come to mind.

Also: How open source attracts some of the world's top innovators

The LLMs and programs built on them are another story. All the most popular AI chatbots and programs are proprietary.

So, why are companies claiming their projects are open source? By "open-washing" their efforts, businesses hope to gild their programs with open source's positive connotations of transparency, collaboration, and innovation. They also hope to con developers into helping advance their own projects. It's all about marketing.

Clearly, we need to devise an open-source definition that fits AI programs to stop these faux-source efforts in their tracks. Unfortunately, that's easier said than done.

While people constantly fuss over the finer details of what's open-source code and what isn't, the Open Source Initiative (OSI) has nailed down the definition, the Open Source Definition (OSD), for almost twenty years. The convergence of open source and AI is much more complicated.

In fact, Joseph Jacks, founder of the Venture Capitalist (VC) business FOSS Capital, argued there is "no such thing as open-source AI" since "open source was invented explicitly for software source code."

It's true. In addition, open-source's legal foundation is copyright law. As Jacks observed, "Neural Net Weights (NNWs) [which are essential in AI] are not software source code -- they are unreadable by humans, nor are they debuggable."

As Stefano Maffulli, OSI executive director, has told me, software and data are mixed in AI, and existing open-source licenses are breaking down. Specifically, trouble emerges when all that data and code are merged in AI/ML artifacts -- such as datasets, models, and weights. "Therefore, we need to make a new definition for open-source AI," said Mafulli.

Also: Switzerland's federal government requires releasing its software as open source

However, getting there hasn't been easy. The main point of contention is the extent of openness required, particularly regarding training data. While some argue that releasing pre-trained models without the training data is sufficient, others argue that true open-source AI should also include access to the training data.

As julia ferraioli (Stet: she spells her name in all lower case), Amazon Web Services (AWS) Open Source AI/ML Strategist, observed in a blog post, with the current OSI open-source AI definition 0.08 draft, "the only aspects of the data that a system desiring to be labeled as 'open source AI' would need to publish are: training methodologies and techniques; training data scope and characteristics; training data provenance (including how data was obtained and selected), training data labeling procedures, and training data cleaning methodology."

None of that, ferraioli continued, "gives the prospective adopter of the AI system insight into the data that was used to train the system." Without this data, can an AI be open? Ferraioli argues it can't.

She's not the only one who holds that position. She quotes her colleague, AWS Principal Open Source Technical Strategist Tom Callaway, who wrote, "Without requiring the data be open, it is not possible for anyone without the data to fully study or modify the LLM, or distribute all of its source code. You can only use it, tune/tweak it a bit, but you can't dive deep into it to understand why it does what it does."

Also: More than money, open-source pros want these 2 things from their next jobs

He has a good point. At its heart, open source is all about understanding the code. In AI's case, that means the data as well. As Maffulli said at the recent United Nations OSPOs for Good Conference, "While there's broad agreement on the overarching principles, it's becoming obvious that the devil is in the details." You can say that again.

At the same conference, Sasha Luccioni, Hugging Face's AI and climate lead, argued, "You can't really expect all companies to be 100% open source as the open source license defines it. You can't expect companies just to give up everything that they're making money off of and do so in a way they're comfortable with."

Still, Luccioni believes "a responsible AI license can exist" -- one that is open source friendly -- where you can define your terms of open source. By tweaking the language a little bit, you can move forward in a way that companies, governments, and academia are all comfortable with instead of saying this project or license is not open source.

Also: Why don't more people use desktop Linux? I have a theory you might not like

Open-source advocates disagreed. I suspect the arguments will continue for years to come.

The OSI, with the help of 70 others, consisting of researchers, lawyers, policymakers, activists, and representatives from big tech companies like Meta, Google, and Amazon and groups such as the Linux Foundation and the Alfred P. Sloan Foundation, is wrestling to come up with a workable definition. The goal is to present a stable version of the Open Source AI Definition at the next All Things Open conference in Raleigh, North Carolina, from October 27th to the 29th.

I'll be there. So strap in, folks. The combination of open-source principles and AI development is driving significant advancements. It's also enabling faster innovation, promoting collaboration, and democratizing access to powerful AI tools. But, its evolution promises to be a long, difficult process.