By Sebastian Lema ‘25 in Spring 2025
Generative artificial intelligence has taken the world by storm, advancing rapidly and forcing industries to scramble to rewrite policies, rethink operations, and grapple with ethical concerns. But despite this new emergence, artificial intelligence and large language models aren’t new. They have actually been here for a long time, subtly sneaking their way into our everyday lives.
The first notable generative AI model was ELIZA, which was developed in 1966 by MIT’s Joseph Weizenbaum to simulate a therapist using simple pattern-matching techniques. Since then, AI has taken many forms, evolving from simple rulebased chatbots to complex systems that can analyze images, recommend content, and even generate art. In the 1990s, IBM’s Deep Blue defeated world chess champion Garry Kasparov, showcasing AI’s strategic capabilities. In the 2010s, Apple’s Siri and Amazon’s Alexa brought voice assistants into everyday life, while Google’s AlphaGo stunned the world in 2016 by beating a top human player in the game of Go.
So why the sudden boom? Why has AI suddenly blown up in its rapid advancement and presence in our dayto-day lives? The tipping point came just a few years ago with advances in deep learning, a method where computers learn patterns by analyzing huge amounts of data, similar to how our brains learn through experience. A major breakthrough was the invention of transformer models by Google in 2017.
Unlike older systems that read language one word at a time, transformers can look at entire sentences—or even paragraphs—all at once. This helps the AI better understand meaning and context, making it much faster and more accurate when learning from massive amounts of text. Major players like Google (via DeepMind and Bard), Meta (with its LLaMA models), Amazon, and Microsoft have since poured billions into AI, weaving it into search engines, e-commerce, and cloud platforms. But it’s OpenAI that has led the charge; with the release of ChatGPT, it not only showcased the raw potential of generative language models but also sparked a global conversation about AI’s role in everything from education to creativity to the future of work.
One of the most prominent conversations surrounding generative artificial intelligence has focused on its many ethical concerns. From data security, monetization, creative ownership, concerns about bias, misinformation, job loss, and AI’s potential environmental impact, this recent rise of AI has come with many unanswered questions that schools, companies, governments, and lawmakers have yet to answer. One of the biggest concerns about generative artificial intelligence revolves around how AI trains. To keep improving, large language models (LLMs) need to take in large amounts of data to train on and increase their accuracy when answering questions. But where does this data come from?
Companies like OpenAI pay contractors and data providers such as Common Crawl, academic publishers, and third-party aggregators to obtain large datasets containing websites, books, code repositories, and other digitized text from across the internet. These datasets include everything from Wikipedia articles and public domain books to online forums like Reddit and GitHub code, as well as potentially copyrighted material scraped from the web. This has raised a large amount of concern regarding fair use. Fair use is a rule in U.S. copyright law that lets people use parts of someone else’s work without asking for permission so long as it’s used in a way that adds new meaning, purpose, or context. Some artists, authors, and photographers feel that generative AI training on their work would violate fair use by using their copyrighted content without consent or compensation, in ways that are not transformative and could potentially compete with or replace their original work. This competition is coming in the form of copycat content that mimics their unique style.
In fact, OpenAI is facing a high-profile lawsuit from The New York Times, which accuses the company of using its articles without permission to train ChatGPT. The lawsuit argues that this practice not only bypasses copyright protections but also enables AI to generate content that closely mimics Times articles, potentially undermining its business model. When asked about fair use, OpenAI CFO Sarah Friar used the example of writing a book, explaining that just as an author draws on everything they’ve read in the past without copying it word for word, AI models learn from a wide range of publicly available texts to generate new, original responses. It gives the original material a new purpose by using it to train a language model rather than to inform or entertain readers, places it in a new context as part of a machine earning process instead of direct consumption, and creates new meaning by providing users with generated responses that reinterpret or summarize the content in ways that are tailored to their specific queries.
Training off of publicly available texts is only one portion of ChatGPT’s training. One of the powerful things about ChatGPT is how over a long contact window, it can become very personalized to each individual user. But to do this, ChatGPT must use your data, all of your individual entries, to train. This brings up another major concern that users have: the fear that their personal information could be misused or accessed without their consent. In doing this, your data and information will become a part of ChatGPT’s ever-growing dataset, which is used to continually improve and refine its responses.
Sarah Friar emphasized OpenAI's strong commitment to protecting user information. She stated, “We take [this] incredibly seriously...how do we batten down the hatches within our own environment and make sure that data is safe?” Friar highlighted that OpenAI faces constant attacks, noting, “We know we’re under attack every single day by nation states, largely China, trying to steal our IP.” This shows that OpenAI’s data is extremely valuable—and that protecting this data is not just about privacy, but also about guarding intellectual property that powers its AI. Friar concluded, “We’re viewed to be, if not the best in the world[, in the] top 1%.”
Additionally, OpenAI leans heavily on user customization. If users are still wary about their data and its use, they can opt out of data usage for training purposes. As Friar pointed out, “You can choose...whether or not you want us to train on your data.” This gives users the ability to control their privacy and adjust settings to match their preferences as the product continues to evolve.
The rise and growth of generative AI presents both exciting opportunities and significant challenges. As AI models like ChatGPT become more integrated into our daily lives, questions about fair use, data security, and other ethical concerns will only grow. While OpenAI has taken steps to address these concerns, the debate over how AI uses data and the potential for copyright violations remains unresolved. The conclusion of The New York Times v. OpenAI will be a major deciding factor in how generative AIs are able to train. As this technology continues to evolve, lawmakers and governments must move swiftly to establish regulations that safeguard both innovation and the rights of individuals. The balance struck will shape the future of AI and its impact on society.