Building Your First Real GPT Is Not a Prompting Exercise

I recently built my first non-trivial GPT.

The interesting lesson was not about clever prompting. It was almost the opposite.

The GPT only started to become useful when I stopped treating it as something I could configure with a good instruction and a pile of documents, and started treating it as a small software project: source material, structure, behaviour, tests, iteration, version control.

That sounds obvious after the event. It was not obvious at the start.

The First Attempt Failed in a Familiar Way

My first attempt was probably the default path most people take.

I uploaded a set of existing documents, wrote a reasonably sensible instruction, and expected the GPT to work out the rest.

It did not.

The answers were not terrible. That was part of the problem. They were plausible, broadly relevant, and occasionally useful. But they were also vague, inconsistent, and too willing to drift away from the standards I was trying to enforce.

The real warning sign was repeatability. Ask the same question more than once and the answer would shift in ways that mattered. Nothing was obviously broken, but it was not dependable.

Looking back, the reason is clear. The documents were written for humans, not retrieval. Important guidance was buried inside larger documents. Some assumptions were implicit. The prompt was trying to compensate for weak source material. Behaviour and knowledge were mixed together.

At that point I stopped tuning and started again.

That was the right decision.

The Knowledge Base Matters More Than the Prompt

The second attempt started with the documents, not the GPT instructions.

I took the source material and converted it into Markdown. Then I stripped out noise, removed irrelevant sections, and reorganised the content around topics rather than original documents.

That distinction matters.

A document written for a human reader usually has a narrative structure. It explains, introduces, repeats, and provides context. That is useful when someone is reading from start to finish. It is much less useful when a GPT is trying to retrieve the right fragment of knowledge in response to a specific question.

For retrieval, the unit of structure needs to be smaller and sharper.

Instead of preserving large documents, I split the material into focused modules: one for each area of knowledge that needed to be used accessed and used together - essentially split into topics. Each file had a job. Each one answered a class of question. This was especially important if similar informaiton was split over many source files.

This was the first real improvement. Not a better prompt. Better source structure.

Restructuring Exposes the Gaps

Once the material was split into modules, the gaps became much easier to see.

Some processes were missing steps. Some terms were used inconsistently. Some guidance depended on knowledge that had never been written down because, in the original context, everyone involved already knew it.

That is the awkward thing about turning human knowledge into machine-usable knowledge. You find out how much of the contract was never in the text.

So I added missing sections, standardised terminology, and made assumptions explicit. I did not try to make the documents longer. I tried to make them less ambiguous.

This made a bigger difference than I expected.

The GPT became more stable not because it had been told to be stable, but because the material it retrieved was clearer.

Codex Became the Build Tool

This was not a manual editing exercise.

I used Codex to do the heavy lifting: converting source material into Markdown, splitting files, reorganising sections, identifying missing topics, and checking consistency across the knowledge base.

That changed the economics of the work.

Manually doing this across many files would have been slow and error-prone. With Codex, the work became iterative. I could ask it to restructure a set of files, inspect the result, correct the direction, and run another pass.

This is where the process started to feel less like writing a prompt and more like shaping a codebase.

The files mattered. The structure mattered. The naming mattered. Repetition and ambiguity mattered. The same instincts you use when cleaning up software applied here too.

The Prompt Is the Control Layer

Only after the knowledge base was in reasonable shape did I write the main instruction.

This is the opposite of how I started.

The main prompt should not duplicate the knowledge base. If it tries to do that, it becomes bloated, fragile, and hard to reason about. Its job is to define behaviour:

what the GPT is for
how it should answer
when it should use the knowledge files
what it should do when the answer is uncertain
what standards it should not compromise

I kept the instruction deliberately constrained. The more knowledge I pushed into the prompt, the worse the design became. The prompt is not the application. It is the control layer.

That separation between behaviour and knowledge is probably the most important design principle I took from the exercise.

Testing Has to Start Earlier Than Feels Natural

The other mistake I made was leaving test questions too late.

I eventually built a small test set covering normal questions, edge cases, ambiguous requests, and questions where the GPT should refuse to invent an answer.

That should have existed earlier.

Without a test set, you are just having a conversation with the GPT and deciding whether it feels better. That is not enough. The model can improve in one area while regressing in another. It can produce one excellent answer and still fail the same class of question five minutes later.

Testing gave me a way to see whether the system was becoming more reliable, not just more impressive.

I also found it useful to test in Codex before moving into the GPT builder. Codex was faster for file-level iteration, easier for inspecting the knowledge base, and better suited to making structural changes. But that did not remove the need to test again in the actual GPT runtime.

The two environments behaved differently enough to matter.

Codex and GPT Runtime Are Not the Same Thing

This was one of the more practical lessons.

Something that worked well in Codex did not always behave identically once deployed as a GPT. Retrieval could differ. Tone could shift. A file that seemed obvious in the build environment might not be used in the way I expected at runtime.

So the workflow became a loop:

Build and restructure in Codex
Test in Codex
Deploy to the GPT
Test again
Fix the source material, not just the prompt

That last point matters.

When a GPT gives a weak answer, the temptation is to add another instruction. Sometimes that is right. More often, the problem is lower down: the relevant knowledge file is too vague, too large, badly named, or missing the thing the model needs.

Prompt patches feel quick, but they accumulate into a mess.

Fixing the source is slower in the moment and better over time.

Git Is Not Optional Once This Becomes Serious

As soon as the GPT became a multi-file system, Git became necessary.

I wanted version history for the knowledge files, the main instruction, and the test questions. I wanted to be able to experiment, roll back, compare approaches, and understand why the GPT had changed.

Without Git, the process would have become guesswork very quickly.

This is another reason the exercise felt more like software than prompting. Once you have source files, build steps, tests, and runtime behaviour, you are no longer just configuring a chatbot. You are maintaining a system.

What I Would Do Differently

Next time, I would start with a module map.

In this build, the modules emerged from the documents I already had. That worked, but it was backward. A better approach is to begin with the questions the GPT needs to answer, the domains it needs to understand, and the boundaries between them.

Then the documents can be shaped to fit the model, rather than the model inheriting the accidental structure of the documents.

I would also split files more aggressively. Some of my early modules were still too broad. Smaller files retrieved more cleanly and produced less noise in the answers.

I would separate reference material from process material more deliberately. “How this works” and “how to do this” are different kinds of knowledge. Mixing them makes retrieval less predictable.

And I would define the test cases at the same time as the module map, not after the first working version. Tests are not a final validation step. They are part of the design.

The Real Lesson

The tooling makes GPT creation look simple.

In a trivial case, it probably is.

But if you want a GPT that behaves consistently, reflects a real body of knowledge, and gives answers you are prepared to rely on, the work is more structured than the interface suggests.

You are not just writing a prompt.

You are designing a knowledge base, a behaviour layer, and a feedback loop.

That is much closer to software engineering than most of the current language around GPTs implies.