🖼️ The Invisible Frame

Foundation Models and the True Edges of Imagination

Jun 23, 2025

When we make art with machines, we’re creating inside a frame. Most of the time, we don’t see it.

The words come easily, the images feel infinite, and the models seem magical. But every output is bounded by something silent, structural, and mostly invisible: the massive dataset that taught the machine how to see, the intensive training run that taught it what to value, the examples it inherited… and the voices it never saw.

This is the reality of working with foundation models. They don’t just generate. They replay the world they were taught, not the one you live in.

And for some, those reflections can feel deeply disappointing, like the model never learned to see them at all.

❓ What’s a Foundation Model, and Why Should Creatives Care?

A foundation model is the original, massive AI, trained on vast amounts of internet data that powers everything from how we search, shop, and translate language to how we interact with health apps, voice assistants, and customer service bots. They underpin legal copilots, personal finance planners, and workplace tools. Even when you’re not using AI directly, you're often feeling the ripple effects of one of these models because they're becoming the infrastructure layer of the internet itself.

In creative fields, these models are the invisible engine beneath tools like ChatGPT, Midjourney, Suno, Runway, DALL·E, and many more. A foundation model is not a tool you prompt directly. It’s the operating system your tools run on. Think of it as the pre-trained brain your favorite app borrowed, or the hidden motherboard powering your workflow.

It doesn’t just autocomplete your sentence or paint your picture, it decides what kinds of sentences and pictures “make sense” in the first place. Most creatives never meet it. But they feel it in the defaults, the sameness, and in the answers that almost sound right but still miss your voice.

These models are primarily trained on statistical patterns derived from vast and often unfiltered internet data. They average out language, aesthetics, and story structures found online.

They learn from scraped data, not what truly holds personal meaning. Their understanding is built from volume, not the nuance you bring. They learn from frequency, not from what ignites your creative voice. They are not artists, they are averages.

Foundation models reflect what’s most visible, not what’s most vital. Which means: a Ghanaian grandmother might appear vaguely European. A boundary-pushing sci-fi short defaults to Marvel trailer rhythm. A poem about grief sounds like a Hallmark card. Not because the model is broken, but because the frame it inherited was incomplete.

🖼 The Metaphor: The Invisible Frame

Imagine walking into a white-walled gallery. A single painting hangs in front of you.

You study the colors, admire the brushwork, and start to find meaning. But what if the frame around it, the literal border, was more than just a decorative edge? What if it functioned as an invisible editor, silently shaping what could appear inside it? Imagine it as a subtle filter, pre-determining which color palettes, perspectives, and objects could ever reach your eye.

This frame doesn’t just mark where this picture ends. It’s the silent gatekeeper that has already pre-selected which artistic styles are considered “good”, which symbols were “relevant”, and which emotions “fit” long before any paint touched the canvas.

That’s what it can feel like creating with AI today. The machine doesn’t just paint within the frame, it decides what belongs inside it. And if your story wasn’t in the data, it gets cropped out of the frame. The frame isn’t neutral, it’s inherited.

Your creative sovereignty begins by seeing this reality clearly.

🔍 What’s Actually In the Frame?

Most major foundational models like GPT-4, Claude, Gemini, and Stable Diffusion are trained on:

Common Crawl: This is like the world’s biggest digital vacuum cleaner, sweeping up billions of web pages. But it doesn’t scrape evenly. It gathers what’s most visible, which is usually highly linked, frequently updated, and easy to index content. Imagine walking through a library with a flashlight that only shines on the most popular, English-language books on the front shelves while entire wings remain in the dark (roughly 45% to 60% is English-language, depending on the specific crawl and filtering applied by model developers).
We all remember where Hermione Granger got the scoop on Nicholas Flamel and the Sorcerer's Stone, buried in The Restricted Section of Hogwarts Library with no index card in sight. The internet equivalent of magic: not gone, just outside the flashlight beam :)
LAION-5B: A massive database with 5.85 billion image–text pairs scraped from the internet with minimal human curation. Think of it like a photo album compiled by a search engine, with billions of snapshots and captions, some clear, some mislabeled, many with the potential to be biased by what’s most uploaded, tagged, and repeated online. It reflects what’s popular or legible to algorithms, not what’s sacred, nuanced, or rare. Approximately 2.3 billion are English-language pairs, while 2.2 billion are from over 100 other languages, and 1 billion are language-agnostic.

Together with Wikipedia, Reddit, GitHub, online news, and other digital marketplaces, these reflect the statistical center of the internet. Not its edges. The model didn’t choose to center certain voices. It simply reflected the voices already amplified online. The result? A machine that mimics coherence but can miss meaning. Something that performs pattern recognition but maybe not the recognition of your lived experience.

🫥 Why “Default” Aesthetics Dominate

When a voice, image, or story form isn’t frequent enough, it can get statistically drowned. Even when you prompt with care and include a myth from your culture, a dialect from your region, or an image shaped by memory, the model may return chrome and symmetry. Like a tone shaped by Western cinema or cadence tuned to English-speaking internet. The model isn’t malicious, it was just never taught to imagine otherwise. What’s statistically probably often dictates aesthetics, and what’s missing becomes unseen.

The most dangerous bias isn’t what the model says, but in the vast realm of what it was never taught to express.

This is why we build a Creative Firewall™, the boundary that protects your deepest signal from being overwritten by the machine’s averages. And it’s why we reach for Trojan Prompts™, hyper-specific language only you can speak that carries tone, rhythm, memory, and myth in a form the machine wasn’t trained to expect. It goes beyond just naming the culture, it’s about speaking in its voice.

Because the machine won’t ask: What haunts you? It won’t prompt itself with a question only you could carry. But you can.

And that’s how you pierce the frame, not just with aesthetic choices, but by demanding your story to be seen.

🎨 Why This Matters to All Artists

Whether you’re a poet or a fashion designer, a filmmaker or a choreographer, if you’re working with a foundation model, you’re working inside a frame. And if you don’t know what the frame includes, you won’t realize what it excludes. That can lead to:

Cultural erasure through default aesthetics
Story forms treated as “errors”
A feeling that you’re “misusing” the tools when really, the canvas was biased from the start
Creative self-censorship, reshaping your work to fit what the machine understands instead of what you really need to say

Your originality may feel like a glitch, only because the system wasn’t trained to recognize your signal. The cost isn’t just artistic, it’s a missed opportunity to expand the very imagination of AI.

Inside the Frame vs. Beyond It

What the Model Learns vs. What Artists Must Teach (Aesthetics of the Average vs. Aesthetics of the Sacred)

🏃🏽‍♀️ Real-Life Example: A Finish Line Poem

When I asked ChatGPT:

“Write a poem about the moment I crossed the finish line of my first half marathon.”

It gave me this:

The sun was warm, the sky was clear, I chased my dream, mile after mile. The crowd erupted, I held back tears, the finish line was worth each trial.

It’s painfully generic and not me. It’s reflecting the internet’s average perspective. Motivational tweets, poetic templates, race blogs. This is what inside the frame looks like: surface level emotion without personal truth.

So I tried again, but this time, I inserted a Trojan Prompt:

“I once ran competitively, until a knee injury stole that identity. For years, I avoided running, haunted by a past self I couldn’t live up to. But this race became about return, each mile a step from old pain toward the life I’ve built now. The soft trails behind my childhood home were once therapy, red Mizuno shoes pounding dirt as Penn Masala played on repeat. Now those echoes follow me through Los Angeles streets at sunrise and sunset. Still therapy, but no longer running away. Instead, running proudly home to my daughters’ hugs.

^^Write a poem that captures that.”

And this time, the model returns:

The red Mizunos from high school still fit like a memory.
Back then, I ran to stay ahead of things I couldn’t name.
Now I run to feel them fully, and let them go.

The soft trail behind my childhood home
taught me how to be alone without being afraid. Penn Masala in my ears, holding silence I couldn’t name.

After the injury, I disappeared from myself that flew.

Los Angeles streets under a burnt-orange sky.
My breath, loud and steady.
My pace not fast, but honest.

And at the finish line, not glory,
but four small arms reaching up.

I ran toward that.

Not a perfect poem, lots I would change and make my own. But, this version is a better reflection of my Creative Firewall because it includes

Personal memory (injury, identity, backstory)
Sensory grounding (Mizuno, soft dirt, Penn Masala)
Cultural touchstones (Penn Masala soundtrack)
A mythic arc (escape → return)

This is the power of layering Trojan Prompts inside a Creative Firewall, not just instructing the model, but teaching it how to listen. The model initially reflects what it saw. The artist reveals what it missed.

The model performs probability while the artist retrieves memory.

🔧 Creative Practice: Redraw the Frame

Before your next AI-assisted project, try this: Ask the model:

“What kinds of stories, voices, or aesthetics do you think are underrepresented in your training?”

Then, prompt edge cases. Give it stories it wasn’t built to tell. Watch how it overgeneralizes, or blanks. Then teach it. This isn’t just about getting better results, it’s about reshaping the model’s creative edge with your voice at the center.

I once prompted:

“A surreal jungle temple in the style of Wes Anderson meets 90s Bollywood.”

It delivered quirky symmetry, pastels, and playful tracking shots, but missed the opulence, the melodrama, the devotional rhythm of the song. The Bollywood part felt like window dressing. The model had seen the colors but missed the cadence. It had no memory of Shah Rukh Khan, nostalgia, or rain-soaked confession. That’s the limit of default prompting.

Many artists try harder prompts when the AI “doesn’t get it”. But there’s a better way: show it, don’t just tell it. You don’t need to be a machine learning engineer. You just need to know what got left out. Try:

🎨 Training it in your style.
Some tools let you upload your own images or artworks so the AI learns your specific visual language, not just what’s trending online.
➤ Use: DreamBooth, Luma Genie, or OpenAI fine-tuning

📚 Giving it your own references.
Instead of relying on what the internet says, you can feed the AI your past work, like journals, poems, voice notes, or scripts so it grounds its response in your experience and voice.
➤ Use: Custom GPTs with file upload, RAG pipelines, or Perplexity Pages

🪜 Building the story step-by-step.
Break complex ideas into a sequence of smaller prompts. That way, the AI can follow your emotional arc or narrative rhythm more naturally.
➤ Use: Prompt chaining with ChatGPT memory, or visual sequencers like Runway or Pika

🧰 Bonus Tools for Creators Shaping the Frame:

Want to go deeper than prompting? These tools help you shape, teach, and debug the frame itself:

Fine-tune a model on your own creative voice
Use Hugging Face or AutoTrain to remix open-source model weights with your own data. No PhD required!
Teach the model your visual tone
Try OpenArt’s LoRA Studio to train a visual “style adapter” that reflects your unique aesthetic.
Build prompts that remember and evolve
Tools like LangChain or PromptLayer help you build richer, more dynamic interactions.
Audit the dataset behind the default look
PromptHero’s Dataset Explorer lets you see which styles, voices, and artists dominate and which ones are missing.

🧠 Final Thought

Prompting alone won’t get you past the frame. You must actively see it, name its limitations, and sometimes, break through it entirely. The model will always return to what it knows. So you have to teach it what it missed. You won’t know the edges of the frame exist until you make something that doesn’t quite fit. But when do, and color outside the lines, you are actively architecting an even wider frame. That’s when the real art begins.

Let’s become active curators of the machine’s memory. Your story is the new training set. What will the machine learn to see because of you?

✉️ Want more metaphors, tactics, and tools for protecting your creative voice? Subscribe below and share this with one artist shaping the frame!

Ben

Jun 24

I really like the suggestion to think about what is missing in the model training when thinking about prompting, and giving the model more context to work with; your own style, more nuance, more backstory. I think that it also helps artists focus on the story and not the tool. When you do these things it feels like bringing more of the traditional artistic practice to the process and avoiding the mistake of expecting the tool/model/black box to magically express your unique creative intent.

Expand full comment

Voice in the Age of the Machine

Discussion about this post