I. Almeida's Blog

Navigating the Murky Waters of AI and Copyright

Powerful Generative AI systems can now generate stunning works of art, human-sounding text, and original music with the click of a button. This emerging technology holds immense promise, yet also surfaces intricate legal questions around copyright protections. How exactly should business leaders navigate the complex intersection between AI creation and existing copyright laws? A new research paper by legal scholar Dr Andres Guadamuz provides an enlightening analysis of this murky terrain.

Guadamuz explains that modern AI relies heavily on a process called machine learning. Here, algorithms are fed vast troves of data—such as text corpuses, images, or audio samples - which they analyze to discern patterns and complete tasks. As the AI ingests more data, its performance improves. This data serves as the lifeblood for systems like ChatGPT, DALL-E 2, and Midjourney to produce their creative outputs.

Of course, much of this training data consists of copyrighted works. And herein lies the crux of the issue. Does an AI system infringe copyright through its utilization of such data? Are laws adequately calibrated to protect rights holders while also giving space for AI innovation to blossom? Guadamuz's research suggests we are in a legal gray zone lacking definitive precedents.

Read more here.

Like • 0 comments • flag

Published on September 16, 2023 03:27 • Tags: ai-ethics, generative-ai, llm, llms, responsible-ai

Documenting Machine Learning Datasets to Increase Accountability and Inclusivity

Machine learning models rely heavily on their training datasets, inheriting inherent biases and limitations. Yet unlike other fields like electronics where datasheets meticulously detail components' operating parameters, no standards exist for documenting machine learning datasets. This research proposes "datasheets for datasets" to fill that gap, increasing transparency and mitigating risks.

Datasets impart significant yet often opaque influences on models. Biased data produces biased AI, like résumé screening tools that disadvantaged women. Undocumented datasets also limit reproducibility. The World Economic Forum thus recommends documenting datasets to avoid discrimination.

The authors devised an initial set of questions for dataset creators to reflect on a dataset's motivation, composition, collection, processing, uses, distribution and maintenance. For example, what subpopulations does the data represent? Could it directly or indirectly identify individuals? Might it result in unfair treatment for certain groups?

Read more here.

Like • 0 comments • flag

Published on September 16, 2023 03:25 • Tags: datasheets-for-datasets, generative-ai, lllm, llms

What is Tokenization? Let's Explore, Using Novel AI's New Tokenizer as a Use Case

Tokenization is a foundational step in natural language processing (NLP) and machine learning.

Large Language Models are big statistical calculators that work with numbers, not words. Tokenisation converts the words into numbers, with each number representing a position in a dictionary of all the possible words.

Read more here.

Like • 0 comments • flag

Published on September 16, 2023 03:21 • Tags: tokenization