Effective use of AI
Artificial intelligence is a red hot topic right now, in tech, economics, and politics. There is always something going through a hype-cycle, especially in tech, but this time around AI is at levels beyond the norm.
My business has received multiple federal surveys about our current and planned use of AI. There is federal legislation focused on AI. International trade restrictions. Celebrities making claims like "we will replace all programmers within 5 years" and "this will be as big as the industrial revolution". There was even a letter from my church addressing appropriate uses of AI.
So what exactly is "AI" and how can we use it effectively?
1. What is AI?
First let's clarify some terms. In general when people speak about "AI" they appear to mean "AGI" (artificial general intelligence). Most assume that what we are seeing with ChatGPT, Grok, Gemini etc. is AGI now, or is on a path to AGI. But while what ChatGPT, Grok, Gemini, etc. can do now is impressive, and how they do it is fascinating, they are not AGI, and even claiming that we are on a path to AGI is extraordinary. Artificial intelligence looms large in our science fiction: humanoid robots, superintelligence, and as a potential competitor to mankind. We are not there, or even about to be there. AGI would require a major breakthrough, probably multiple major breakthroughs. We understand how quantum computers and nuclear fusion work, but we can't yet build generally useful versions, we don't yet have a solid theoretic understanding of intelligence, so AGI feels even further away. What we have seen in recent years is rapid improvement in specific artificial intelligence techniques, specifically with large language models (LLMs), transformers, and unsupervised learning. We'll cover these techniques in more depth throughout this article, but from here on out I'll refer to them generally as LLMs rather than as AI to avoid the verbal and mental assumption of AGI.
It is helpful to realize that we've been here before. In the 1960s it was a common academic and industry belief that writing computer programs was simply too hard for humans, and that we had to invent an artificial intelligence that could do it for us. It was also believed that we were on the cusp of doing exactly that. It was a mainstream topic, Hollywood made movies about our AI future, we got a lot of science fiction dealing with AI, and it became a matter of government policy.
ChatGPT and other current LLMs look surprisingly like the AI from the 1960s, with the equivalent to ChatGPT being ELIZA an AI "psychologist". (Those where more trusting times). ELIZA was much like what ChatGPT is now: it was cutting edge research and required expensive/powerful compute to run. However with the passage of time and growth of computational power, ELIZA is now more of a toy. For example the text editor that I am writing this in includes a collecting of whimsical vintage computer games, one of which is a copy of ELIZA. ELIZA can even fit in a web browser, so you can try it out at http://psych.fullerton.edu/mbirnbaum/psych101/Eliza.htm
The predominant technique at the time was that of Expert Systems, it was hoped that by incremental improvement they would live up to their name. However they have since been replaced by cheaper unsupervised learning.
By the 1970s most people funding AI research started to give up, leading to the AI Winter. Where with little funding and little progress AI related research and businesses were wiped out.
It also wasn't a case that society gave up too soon and if we had simply stuck with it we would have expected to see ChatGPT level functionality sooner. Some companies did stick with it: Cyc for example invested some 40 years, hundreds of millions of dollars, and thousands of man years of effort into the expert systems approach. However the result has ultimately been a dead end. The progress over the last few years is only because we discovered entirely different techniques. The modern techniques also require levels of computational power, and quantities of data that haven't been available until the last decade or so.
So what do we have?
The most important recent advancements have been in unsupervised learning, transformers and large language models (LLMs). ChatGPT, Grok, Gemini etc. are all LLMs, and they have been "trained" using "unsupervised learning".
To really understand this we have to step back a bit. Lets start with a simpler "AI" that everyone is likely familiar with: a "spell checker". How does your computer "know" that you have misspelled a word and how does it offer smart suggestions?
The first part is straight forward, we need a dictionary of all the words in the language. While that is trivial now, it was once considered a large amount of data, which was challenging to obtain a license to a copy of or to produce one's own, and potentially hard to fit into a computer's memory.
Once we have that the computer can compare every word you write to see if it exists in the dictionary. If not, it shows a red squiggle under it so you know you messed up.
Then how do we handle the second part, suggesting what word might have been meant? Similarly to how you likely learned to spell, there are spelling rules like "i before e except after c" that we can "teach" the computer. This is how expert systems where built — humans manually cataloging facts and assembling logical rules for how all information should be handled.
However handcrafting rules is laborious, and expensive. It can work for a well defined domain, such as English spelling, but if you tried ELIZA you can see that it quickly breaks down for harder problems, such as "intelligent English conversation".
Originally spell checking was a hard problem for computers to handle. Not just because of the creation, encoding, and expression of the rules, but because of the very limited amounts of memory and computational power available.
Now that we have additional memory, compute, and data available we can implement a different approach for suggesting spelling corrections. To do so we'll need two pieces, a measurement of how "far apart" words are from each other, and a measurement of likely a word would be appear in a text. For the first Levenshtein edit distance will do nicely. For the second, we can use any large corpus of English text, for example Wictionary, Project Gutenberg and Wikipedia all make it easy to download their entire contents. Then we can run that through a program to count and word frequencies.
Now that we've done this preparation we can implement our spell checker. We'll first have the computer compare our misspelled word against the dictionary and find us a list of candidate words that require the fewest edits to convert the misspelled word into them, ranking them by edit distance. Thus the computer can easily see that correcting "thes" to "this" or "these" is much more likely than correcting it to "thesaurus", the words are "closer" to what it was given. But what if we are given a misspelling of "thi" is it more likely that the user meant "this" or "the"? Here we use our word frequency calculations, and see that the word "the" appears far more often than "this" and so suggest both, but rank the correction to "the" higher.
(Real world spell checkers do a lot more than what we've covered here. For example humans tend to spell phonetically, so we should have an edit distance for word sounds instead of letters, etc. However our simplified mental model is on track. For a more complete look see AI research Peter Norvig's article on how to write a spelling corrector)
We can see a few interesting patterns here, ones that will be repeated as we discuss more complicated systems. We can replace manual human rule building with algorithms / calculations that can "learn" if we have access to plentiful compute power, and data. That we can do a lot with probabilities, and that we can come up with a mathematical measurement of the "distance" between words (and as we'll see later, between ideas).
So to build a modern LLM, like ChatGPT, Grok, Gemini etc. we will need data, compute, and algorithms. Lets dive in and see where each of these come from, and what they look like.
To start we will need a large quantity of data. Computer scientists deal with big numbers, so what does "large quantity of data" mean? Numbers here are rough, but we mean big, like "the entire internet" isn't enough, kind of big. I've heard that frontier models like ChatGPT are trained on datasets that are about four times the size of the public internet.
This means that companies that are building LLMs have a voracious appetite for data, and this can lead to some friction: should they be allowed to hoover up the entire internet? What about the copyright to the information they are ingesting? Should they be allowed to use private data? Are they allowed to profit from your copyrighted and or private data? etc.
Then you will need sufficient computational resources. Again the "sufficient" here means big, gobsmacking big.
Everyone is familiar with relatively inexpensive "consumer" levels of compute. Your phone, laptop, or a desktop computer. You may also be aware that the likes of Google, Facebook, and Amazon have much larger numbers of computers "somewhere". That "somewhere" is in data centers. Large, generally nondescript buildings (think Costco, or larger), located in places with major fiber internet connections, abundant electrical power, and when possible in ares with low seismic activity and the absence of severe weather. They typically have their own onsite backup generators, extensive cooling systems, and keep a low profile. The hyperscalers (the biggest internet companies: Amazon, Google, Facebook, Microsoft, Apple, Netflix) operate a large number of these data centers as do data center companies you have probably never heard of. Companies such as: Equinix, Digital Realty, NTT Global Data Centers, CyrusOne, and GDS Holdings that rent out data center space. Some people can be surprised to see Amazon on this list, and while Amazon does use a substantial amount of compute to power Amazon.com and related web properties, they also run "Amazon Web Services" (AWS) the largest cloud services provider.
To get an idea of the scale at which the hyperscalers operate. Not only are they good at building the physical structures and setting up semi-truck loads of servers, it is becoming increasingly common that they will start by building or buying a power plant to power them.
The preferred compute for LLM workloads is the GPU (graphical processing unit). As their name suggests, GPUs where originally developed to accelerate graphical applications, especially video games. They are still used for that, but have been in strong demand for other workloads for the last decade or so. First because cryptocurrency mining worked better on GPUs than CPUs (central processing units, the main computing unit in you phone or laptop), and now because AI workloads are much more efficient on GPUs. (The cryptominers have mostly moved on to other techniques, such as ASICs).
NVIDIA is the largest manufacturer of GPUs, and has ridden the rocket of cryptocurrency and then AI demand to become one of the largest companies in the world. Current estimates are that they have five offers to buy each GPU chip they manufacture, which has kept prices elevated, and the supply of GPUs constrained.
To get a more concrete idea of what a GPU is, the NVIDA A100 GPU is a common "workhorse" model, it is available on Amazon for $7-$10K each. If you have the cash, you could also get a later, more powerful option like the H100 for around $27,000.
If you are a hyperscaler you are going to be deploying tens of thousands to hundreds of thousands of these in a single data center (there is a tidy price tag on that).
You can also rent time on GPUs from most cloud providers, though you are generally looking at a few dollars per minute (and somewhat oddly with the per-minute pricing) lengthy contract commitments.
We'll dive into the how the algorithms work in a minute, but for now let's assume that they are available and review the process at a high level.
Once you have your algorithm, a massive trove of data and somewhere between a few and a hundred thousand GPUs the next step is training. In our spellcheck example our training just involves counting word frequencies in our text corpus, which is something a modern laptop can do in a few minutes. Training a modern LLM is more time consuming. Training is handled as a batch job — we start, run it in parallel across multiple GPUs, and leave it running uninterrupted for some time. Depending on the model, the amount of data, and the amount of GPUs available this "some time" will be between hours and months. For practical reasons hyperscalers generally increase the number of GPUs used to keep the training period down to weeks or less (you can't make rapid progress if it takes months to see if your changes made an improvement).
The output of training is a set of "weights". We'll cover these in greater depth later, but for now think of it as big blob of data. "Big" here is much smaller than the training data, on the range of a DVD to a hard drive in size. It is possible to tune the probabilistic model ahead of time to select how big the resulting weights will be. Generally we describe weights as the number of parameters. A full size model may be around 70 billion parameters (though it might scale up to a few hundred billion) and a small model might be around 3.5 billion. A larger model (more parameters, large weights) will give better responses to queries, a smaller model will require less compute to run and will respond faster to queries, but give less detailed responses. We use smaller models in constrained environments — cell phones, automobiles etc.
In the above I've referred to giving the LLM information as queries, that is not wrong, but in proper parlance what we provide the LLM is called a "prompt" and giving the LLM a prompt is referred to as "prompting". Someone who writes good prompts is a "prompt engineer". I'm skeptical about "engineer" being an appropriate title.
It is significant to note that training is a distinct step. When you are typing something into ChatGPT it isn't learning from your prompts. OpenAI (the makers of ChatGPT) could store all your prompts, and add them to the trove of data that will be used to train the next model, but the model you are using isn't learning from them.
There are some additional terms related to prompts that we should define: context and token. Context rounds to "the prompt". It is possible to use prompts that include substantial amounts of background information, from paragraphs to lengthy documents/books. Most systems also have two prompts, a "system prompt" and a "user prompt" unless you have programmatic access to a system, you will generally only interact with the user prompt. The system prompt is used to give the LLM instructions which the user cannot override. Tokens are the atomic (indivisible / most primitive) pieces of information that an LLM uses. Each model tends to tokenize differently, and a token could be as small as a single character, or as large as a phrase. A reasonable approximation is that a token is 4 characters in length. You can use tools like OpenAI's tokenizer to see how many characters a particular prompt contains. Token for multi-media LLMs (for images, video etc.) are measured differently.
We've named this time period, when you aren't training the LLM but are prompting it and receiving answers as "inference time". While training is run as a batch job, generally across many GPUs, inference can generally only make use of a single GPU for a single prompt. You can generally imagine it as your own (identical) copy of the weights, on your own GPU, responding to your prompts. (Again for practical reasons the system is actually time shared, those GPUs are expensive and should run at 100%).
So as a recap:
- We have a probabilistic model
- We have a trove of data
- We have compute, in the form of GPUs.
- We push the data through the probabilistic model, which generates weights. We can plan the resulting weight size when we start training. We use larger models for greater accuracy, smaller ones for less inference time compute costs and faster responses.
- We generally move those weights to another system/set of machines, again with GPUs, for responding to user prompts. This is called inference time, and generally is not parallelizable for a single prompt. We can handle more prompts by splitting them across more GPUs, but the model weights are limited to what can fit into a single GPU / we can't split a hard prompt across multiple GPUs well.
There is a caveat to that last point. Cutting edge systems are experimenting with running a single prompt through multiple models at the same time, then feeding all their responses into a single model to summarize, ideally creating a better response than any individual model could.
This training process is expensive. There is a significant cost in GPUs, gathering and storing the training datasets, power costs, and in engineering work. The researches creating the models, data engineers gathering and loading the data during training, engineering costs to create the systems that let users prompt the model and see the answers. We can presume that the companies doing this work will want to eventually make a profit from it — either through charging directly for access to the LLMs or tangentially such as through ads. While there are many free offerings now, it looks like the target price is probably around $20/user/month for personal use, and maybe $200/user/month for business uses.
GPUs are a bottleneck for LLMs. For training the number of GPUs we have available, and their capabilities determines the time we will need for training, and not being able to obtain a sufficient number or quality will therefor slow training down. At inference time we want the weights to fit in the memory of a single GPU, and so are limited on how big of models we can create. As GPUs get more capable, we expect to see a corollary improvement in LLMs performance. Similar to how computers have improved in the last decades.
The US has also attempted to slow rival nations development of AI by limiting the number and quality of GPUs that they have access to. These export controls have been deployed particularly against China, since the first Trump administration, and additional controls are in the works.
Export controls are not a "perfect" technique for the nation deploying them. For example there are conflicting reports about the degree to which parties outside the US have purchased GPUs for resell to China, thereby circumventing the controls. It does appear that the controls have reduced the availability of higher performance GPUs in China. But in turn this has incentivized the Chinese to develop their own GPU manufacturing capabilities and to emphasize more efficient designs in their LLM research.
Significantly Chinese researched announced (and shared) their own, cutting edge LLM model "DeepSeek" over the Christmas holiday in 2024. The model is an impressive leap forward, putting the Chinese officially back on the cutting edge of AI development and importantly was substantially more efficient than western models at the time. Western researchers are responding, but what the Chinese have done is impressive, and it appears they substantially did it because of the export controls.
Deepseek is a good model, it has been examined extensively for a "Chinese" bias, however the version shared appears to have been trained just like western models, on the entire internet and has the same biases as western models. The Chinese have also published everything needed to train DeepSeek on the data of your choice, which would remove any bias. The DeepSeek models simply require less GPU to train and less GPU at inference time, they also developed it quickly, and with a smaller team than most US based companies have.
Modern data centers and LLMs in particular consume substantial amounts of electricity. This has meant that hyperscalers have been buying power plants, planning data center locations based on the availability of power, recommissioning closed power plants etc. However if current growth charts hold, we are in for some interesting times. Electricity usage in the US has mostly remained steady since about 2007. The US has continued to add power generation to the grid, but mostly this added capacity has been used to shut down older, dirtier power sources such as coal. It has been a long time (25 years or so) since we've seen significant year over year increases in power demand. Over that time the regulatory environment for additional power generation has become arduous, dramatically slowing the ability to commission new plants and new transmission lines. Into this environment we are adding a tech industry demand curve: up and to the right. Tech companies target rapid, even exponential growth, and are seeing exponential demand for LLM services with matching rapid increases in the amount of electricity they need. The speed at which we can build data centers and demand for electricity far exceeds our ability to build new power plants and power distribution systems. It is unclear at this point how the projected demand will be met — tech systems / the demand side simply scale far faster than the supply side. Tech companies are searching for options and investing in nuclear power. There is also speculation that this demand is part of what is behind President Trump's declaration of a national energy emergency If this topic interests you check out this podcast about the AI energy bottleneck.
While the challenges faced by the hyperscalers to deliver GPU compute on a national level are substantial, it is possible to train and run your own LLM, even at the personal or small business level.
This is possible because of the strong culture of openness in computer science. This spans from free and open publishing of fundamental research, to freely sharing trained models. Large companies will train models using the process and data we described earlier, and then publish variants of them on Hugging Face and Github. They can then be freely downloaded, and either used as is, or further trained on your own data.
GPUs are available and are within the price range of other hardware/physical assets that a business might purchase. The models can also be trained, or run, on CPUs, they will simply train and run slower. I regularly run models on my laptop with reasonably good performance.
If this interests you llamafile is a project that turns models into a single executable — which makes them trivial to download and run.
Here are some llamafile wrapped models to try things out:
- Gemma, the open version of Gemini
- OpenAI's Whisper, for speech to text.
- LLaVA for image to text
- Mistral, a high performance model for it's size
While it may be of interest to fewer readers, I think it is important that the fundamental research is also open. For example the paper "Attention is All You Need" that kicked off the current LLM era is freely published by Google's research team.
Lets dig into that fundamental research a little and round out our understanding of what an LLM is "under the hood".
To quote OpenAIThe models learn to understand the statistical relationships between these tokens, and excel at producing the next token in a sequence of tokens.
Recall that LLMs don't generally use words, but rather tokens, that distinction is unimportant for now, so we could rephrase the above as:
LLMs learn the statistical relationships between words, and excel at producing the next most likely word.
That is fundamentally what they do, nothing more, nothing less. We humans have what might be a "bug" in our psyche — have a strong tendency to anthropomorphize. We readily ascribe human emotions, thought patterns or attributes to non-humans. So in case it is unclear, LLMs:
- Have no consciousness
- Have no volition
- Only learn during training. They cannot learn or get better on their own
- They generate probability based answers, rather than "think" as a human would.
They are so good at what they do that it can be hard to remember the above points.
You will recall from earlier that the AI systems from the 1960s were rule based, they relied on humans to assemble an ever growing corpus of highly structured rules. Assembling these rules is also called "training", though it is a manual form. One of the fundamental breakthroughs was the invention of unsupervised training. We setup the untrained model, feeds it large amounts of data, and it works out the pre-existing pattern and "learns". This is amazing. Let's see how it works.
The first thing that we need is a way to convert words into something that the computer can work with. Computers are fundamentally that, computers. They are good at math or "computing" and fundamentally bad at language until we have a way to convert language to math. That is what an LLM represents — a language computer.
How could we express words and ideas numerically, so that logic becomes math? We are going to exploit a discovery about language: Distributional semantics which can be summarized as "linguistic items with similar distributions have similar meanings.". This means that, similar to our spell checker earlier, if we examine a large corpus of language, the computer can derive something on its own. In this case instead of just word frequencies, we want to represent word meanings but that is exactly what is encoded in language by the word frequencies, and the frequency of word combinations. This property holds for every human language. We can represent this meaning as mathematical vectors, specifically we will used the Word2Vec algorithm to convert our text into mathematical vectors. This is the first step in any LLM's process, and is called "embedding", and one of the algorithms to go from words (or tokens) to vectors is Word2Vec (there are others).
The resulting vectors are not the "simple" two and three dimensional vectors of high school math, but rather higher dimensional vectors, ChatGPT 2 for example uses vectors of 768 dimensions and later LLMs trend towards more (1024, 2048, etc.)
To give an idea of the geometric "space" that this represents, a vector of 768 dimensions is more than sufficient to give a unique representation to every star in the universe. Which gives us an immense canvas to represent ideas.
Once we have these vectors, we can now perform math on ideas. One of the most famous examples from the Word2Vec paper is: king - man + woman → queen that is, that the vector for king, minus the vector for man, plus the vector for woman, becomes the vector for queen.
I find this to be such a mathematical tickle. It is delightful.
You can give this a try / find other relationships with the tool at https://calc.datova.ai/
If you try it out, note that word case matters. You will also see that the real world is not as tidy as our example. In reality king - man + woman actually results in a vector that is closest to king, it is by excluding the first vector that we get a clear winner for queen, and this holds true for many/most comparisons.
It is also straight forward to get answers that are mathematically correct, but semantic nonsense, by giving nonsense: elephant + car → four-wheeler and sadly we can also break things by giving valid, sensible input: king - nation → reign. But then again, is a king without a nation a king? or is king-ness a distinct trait? I think word2vec would say that a king without a kingdom isn't a king, and if we try that phrasing: king - kingdom → gentleman-at-arms which sounds right on to me.
The takeaway is that while the underlying math is extraordinary, real world use cases require a lot of additional complexity and human effort, and don't cover everything. Also that LLMs are composed of layers, each building on a foothold of functionality / mathematics to create an extra-ordinary output.
So far we have tokenized our input, convert those tokens into a vector representation of their semantic meaning, and now we will feed them into a "transformer".
The transformer design is from the seminal paper the paper "Attention is All You Need" prior to that the state of the art was convoluted and recurrent neural networks. The breakthrough with transformers was a better (a more computationally efficient) way to handle the increased complexity of handling multiple tokens i.e. handling sentences, paragraphs, and entire documents. Meaning is of course not entirely stored or transferred in individual words, but instead in sentences, paragraphs and larger collecting of text. The breakthrough was in having a way to "pay attention" to multiple parts of the text at the same time — convoluted and recurrent neural networks where inherently serial, while transformers can perform attention in parallel. Transformers operation is similar to what we have already covered, they operate on the vector representation of semantic meaning, taking in, and outputting these high-dimensional vectors. They are layered, the first transformer taking as input the vectors from the embedding step (such as Word2Vec) and outputting a set of vectors, which are then taken as input by another transformer. These are "stacked" to create a system of one to several dozen. Within each transformer is a series of mathematical transformations on the vectors. Using matrix math we try to arrive at a set of "weights" — additional vectors — that when multiplied by the provided vectors in the training material, will output the remaining values.
For example let's say we are training an LLM on the sentence:
The quick brown fox jumps over the lazy dog.
What we will do in training is take the first part of the sentence, i.e. "the quick brown fox" and then see what, with our existing weights, the LLM outputs, and check that against "jumps over the lazy dog." if it doesn't output something that matches, we average what we got, over what we wanted, and store the results, and keep training.
The end results is an idealized set of vectors that, when given a set of input tokens, will output the expected output tokens.
Which brings us full circle back to the quote from OpenAI that we started with:
The models learn to understand the statistical relationships between these tokens, and excel at producing the next token in a sequence of tokens.
Which hopefully makes more sense now. If not, or if you would like to go deeper there is an excellent visual explanation of a transformer here. Of course real-world use is more complicated than the theoretical ideal (with issues of overfitting, and a need to tweak the temperature of the response and/or sampling methods etc.)
I still find it amazing that LLMs work at all, and even more so that they work so well. It is also fascinating to contemplate the immense amount of data, and compute that they take as input, and the underlying order to language and meaning that allows them to function.
2. Experience Reports
Now that we have a better understand of what "AI" or LLMs are, how do we use them effectively? What are they good at, and what are they not good at?
What follows is a collection of real world examples of LLMs use, and some conclusions I think we can draw from the results.
2.1. The Project Manager
I was a subcontractor on a government IT modernization project. The team consisted of twelve or so specialists — software engineers, data engineers, domain experts, training & change management, quality assurance, etc.
As is common in IT projects the work was divided up into two weeks "sprints". Ideally in each sprint a vertical "slice" of the overall functionality is delivered. The goal is that at the end of a sprint some amount of functionality is complete. It may be a small, but complete piece, including conception, writing code, creating the user interface, importing of data, integration with the existing system(s), quality assurance, documentation, and training.
Prior to the start of a sprint the project's leader, the "project manager" is expected to determine what work will be delivered. He or she documents what needs to be done, splits the work into tasks and assigns the tasks to the respective team members.
This specific team was newly assembled for this project, with few team members knowing each other. The work started on schedule, all team members working on their respective tasks etc. And then chaos. The tasks where well written — the spelling was excellent, they where grammatically correct and on initial reading they appeared pertinent. But on closer examination they lacked substance. The tasks had things like "there shall be a system for doing X" but no description of that system. No details on who operates it, when, what data it uses, what its outputs are, etc. and the tasks didn't combine to create anything.
It is an unfortunate artifact of modern software development that the normal way of running a sprint, while intended to be flexible and effective, is more commonly rigid and unable to adapt to problems encountered during the cycle. (That is a topic for another article, hopefully coming soon) So it is with embarrassment that I have to report that twelve professionals churned hard for two weeks and accomplished zip.
Now at the end of each sprint there is a formal process, a "retrospective" meeting of all team members that is specifically designed to review the sprint and identify ways the sprint process can be improved. During this meeting it was pretty clear to everyone involved that the task descriptions where the problem.
It was at this point that the project manager admitted that he hadn't written the task descriptions. He was under deadline pressure with the project's start date and so had instead fed a basic description of what was needed into ChatGPT and copy and pasted the responses into the tasks and assigned them out.
I think the issue was aggravated by it being the first sprint, with all team members being unfamiliar with each other, the project, and what to expect in general. This was also several years ago, so no one else on the team had much experience with "AI" and no one recognized the task descriptions as likely AI generated.
What went right?The existing framework for the work, including team retrospectives was able to identify the problem (AI generated task descriptions) and the team members, consisting of real "AI" — actually intelligent — people could then come up with a solution.
The LLM doesn't have anything to its credit in this instance.
What went wrong?The cost was significant, the average billable rate of these experts is around $150/hr, for 12 people, for two weeks. At this point many of the team members where not yet working on the project full time, so the final cost of the wasted two weeks was probably "only" about $108,000.
The problem was not so much a failing of ChatGPT — even a "perfect" ChatGPT wouldn't be able to fix this because the nature of the problem was misunderstood by the project manager. His job was not to "generate tasks with well written descriptions and assign them to each team member prior to the sprint" that was only the tangible product that his work should have produced. His job was to plan the work, and that involves meeting with stakeholders, determining the best place to start the project, priorities, building consensus, consulting with experts as to the feasibility and best implementation strategy, etc.
The failing was compounded by the project manager passively claiming the work as his own. Passively because that was the team's default assumption, and no disclaimer was made.
2.2. The Urgent Care Physician
In my hometown there is a well regarded urgent care facility, run by an excellent Dr. Johnson (name changed to protect his privacy). Dr. Johnson is an excellent physician, his patients love him, lots of people pick his urgent care facility over having a regular physician. The guy is awesome. He has enthusiastically adopted an "AI" powered tool to assist him with charting and maintaining patient records. He carries a tablet with him as he visits with patients. He explains to patients that the tablet contains an "AI" powered app that assists him with maintaining their records. That the app will be "listening in" during their visit, taking notes, and that he will be addressing the app during their visit.
Dr, Johnson says that prior to adopting this system he would come to work everyday at 5am and spend three hours charting, now that he has this system he can simply come in at 8am. It has been a huge change for his work/life balance, and made him much more interested in delaying his retirement.
What went right?The "AI" use is saving the practice/Dr. Johnson a substantial amount. As a rough estimate, if Dr. Johnson's time is normally billed at $150/hr, and it saves him 3 hours a day, 5 days a week, that is about $9,000/mo. This isn't directly billable work, but it is something that directly contributes to Dr. Johnson's well being, and desire to remain in a needed and demanding profession.
The "AI" use is in plain sight. Everyone is aware of its use. Additionally its use is very specific — it is not attempting to be general purpose, but rather an (LLM) that is tuned to perform a specific function.
What went wrong?No my knowledge, little. The system has been in places for some months now and Dr. Johnson has been delighted with it. His patients can have some concerns with privacy — having a device intentionally listening in and recording their visits, but Dr. Johnson's bedside manner is excellent, and he is up front about what the device does. This has generally alleviated any concerns.
2.3. The Startup Entrepreneur
Sam (name changed to protect privacy) is a startup entrepreneur. He has money from his successful sports career, has gone through a startup incubator, and has access to potential angel investors and venture capital. He has an idea for a company, and has invested time over the previous two years to build out a prototype / MVP (minimal viable product) using low-code/no-code based tools. He now wants to add features beyond what the low-code tools can handle. He has heard a lot about AI. The hype is so strong at the moment that the best way to get VC money is to add "AI" to your feature list. Without AI no one wants to listen to you, with that, everyone wants to hear your pitch. He grabs some free online "AI" programming tools and prompts Gemini to write example code for him. By a process of taking what Gemini outputs, running it, Googling when it fails / re-prompting Gemini he has, as someone who has never written code and doesn't have prior programming experience, been able to write code.
What went right?It was accessible, Sam was able to use free and readily available tools to tackle a task that he didn't have training for or any experience doing. In about a week he had code that mostly worked. Mostly worked was sufficient to refine his ideas about the feature and whether he wanted to continue putting effort into it.
What went wrong?It wasn't sustainable. "AI" can write code, but its ability to debug code is more limited. Sam was able to make progress for the first few days but then got to where his code wasn't working, and he couldn't tell why, and the "AI" couldn't either. This was the point that he came to me for help with his code. I could help him get it working, and move on to the next step, but from this point on he and the "AI" where again quickly stumped.
His code was also not ready for "production use" even if it ran it would have had to be re-written by an experienced human before it could perform reliably enough to be part of a product.
2.4. The Senior Software Engineer
A friend of mine, Dave (name changed to protect his privacy) is the lead engineer for a small company. They are building an IoT (internet of things) device. Consequently they have interesting engineering challenges around network connectivity to hundreds of thousands of devices. They also need to keep the devices up to date, ingest data from them etc. These devices also generate a large volume of data (for IoT devices) and need to be real time.
He has inherited the system from another team, so there is a codebase and product in place, but he needs to substantially revise it to handle the growing business.
He knows that he needs to re-architect how the system ingests the data from the devices and wants to game out a few possible architectures and compare implementation costs across two cloud providers (AWS and Cloudflare). So he types up a detailed description of the requirements, what he wants to accomplish, etc. and feeds it into Grok. Grok returns a substantial, multi-page response, detailing suggestions, trade-offs, estimated performance levels, and makes a recommendation that he go with AWS and a specific architecture. Grok claims that the recommended solution will be cheaper to operate, easier to debug, and is a more "tried and true" solution. Dave and I often bounce ideas of off each other, so he sends me Grok's analysis and asks what I think.
I reviewed the extensive prompt and Groks response, and pointed out several areas of concern to Dave:
- It looked to me like Dave had made an error in his inputs, adding an extra zero to the number of IoT devices that the system would need to handle.
- Several of the performance numbers that Grok stated and used to back up its arguments didn't look right. I'm not sufficiently familiar with the specific systems Dave is considering to be able to say immediately that the answers are wrong, but from what I know of such systems in general they didn't sound right.
- From personal experience I don't like to trust an "LLM" without it "showing its work". Actually I tend to want other engineers to show me the math too, not just LLMs!
Dave took my feedback and went back to tweaking his prompts to get Grok to show its work and hopefully double check its answer. When pressed to show its work, Grok revised its recommendation, suggesting a Cloudflare solution instead of AWS, but for the same primary reason: cost. Grok had originally said AWS was cheaper, but when pressed said that CloudFlare would cost about $4,959.28/month and that AWS would cost $17,828.99/month. Grok persisted in claiming the AWS solution was superior on other grounds.
What went right?I would say very little. Grok was unable to back up its position, switching it when pushed for numbers, and persisting where the measurement of quality was less concrete (what cost/benefit can we assign to "tried and true"?).
Dave has had a streak of AI enthusiasm since before it was popular and sees things in a more rosy light — he says that the LLM helped him to consider different alternatives and arrive at a better final solution.
What went wrong?There is an issue with "confidence" and AI. When speaking with a fellow human we have a variety of ways of expressing, and judging, the confidence of an answer. As someone responding to a question, you use cues. Things like firmness of voice, posture, tone, and word choice to indicate a range of confidence. Everything from "I think it might be but I'm unsure" to "I absolutely know this and will stake my life on it". Current generation general-purpose LLMs such as Grok don't do this. This is an active area of research, and the LLM does have some internal "confidence" indicators that might be usable, but the UI (user interface, how you interact with it) lacks any indication of confidence.
Grok simply, "confidently" asserts a well written response as factual. We humans have a "bug" in that anything written is generally given additional confidence over, say something spoken, and well written material over poorly written.
Humans also have an entire external system of confidence indicators — we trust a doctor's responses for medical areas, lawyers for legal, and plumbers for the kitchen sink. And via repeated exposure, we can identify which of our friends and family will speak with confidence on a subject they know nothing about, and which are deep experts in regards their model train collecting hobby.
We don't have a mechanism to develop quite the same confidence in LLMs.
There is also an issue with supplying poor or inaccurate information in the prompt. Because I was familiar with the system Dave was working with I could identify that the values he entered where off by a factor of ten. Someone else, with a background in IoT might or might not have noticed that the numbers seemed off and double checked with Dave, the LLM doesn't have the ability to identify suspicious inputs or check our confidence about the inputs.
There is also the issue of the LLMs reasoning flipping from: "use AWS because it is cheaper", "to AWS will cost 4x as much". Current LLM reasoning frequently flips its opinions when pressed.
2.5. Effect of AI on Education
I think the case of AI in education is important case to review, because of its apparently consistent and large impacts. I don't have the same level of first or second hand experience, and must rely on other people's experience.
I have encountered a few people with positive views on AI in education. These have been expressed as hopes that AI may be able to personalize education to every student, or be used to reduce the work needed to grade student work, or to prepare lesson materials.
However these views appear to be hoped for outcomes, rather than available in current practice. What does appear to be happening, especially in higher education sounds more like this:
The birth of AI has turbocharged student cheating. No, that doesn’t say it. It is a massive dose of anabolic steroids straight to the heart. No… that’s still underselling it. AI is more like fluorine. Fluorine is the most reactive element, hoovering up electrons from almost everything it gets near. If you add two oxygen atoms to two fluorine atoms, you get dioxygen difluoride. That stuff detonates things at -180C. It sets ice on fire and blows up if you add water vapor. It’s too aggressive to be used as rocket fuel, even though rockets are propelled by explosions. Chemists are terrified of it. AI is a tanker truck full of dioxygen difluoride firehosed straight onto the academic quad.
Taken from Why AI is Destroying Academic Integrity
This does not appear to be restricted to a few institutions, but rather is the general experience across all of academia. See also The Average College Student Today and
Last year, as a college sophomore, ChatGPT and its counterparts gained traction, and I, just like 80% of students on my campus, began using them. Over the course of the next two years, I noticed that my brain was legitimately turning to mush, and so I've gradually been weaning myself off since... I hate to say it so brusquely, we are no less than drug addicts. And you know what they say, kicking a drug addiction is one of the most challenging things to accomplish. So yeah, this is how things will be. And I hope you see that students and faculty are on the same side of the issue.
From https://hilariusbookbinder.substack.com/p/comments-on-the-average-college-student
What is going on appears to be very rapid adoption of LLM tools by students, but since growth comes from work, if the students don't perform the work themselves they don't growth.
3. Summary of Pitfalls
- Misunderstanding the nature of the work, confusing a work product, such as a document, test, report etc. with the work that actually needs done and has historically been measured indirectly by the work product.
- Miss attribution or claiming authorship of work that was produces by an LLM
- Lack of tools to communication confidence, both from the user of an LLM to the LLM, and more so, from the LLM to the user.
- Confusion by LLM users around interpreting the ability of an LLM to start a task, with the ability to perform that task in full. Such as conflating the ability to generate prototype code with the ability to debug or alter production code.
- Attempting to perform more generalized tasks with LLMs. They appear to perform much better when they are specialized. Having a specific LLM, tuned to a specific task.
- Inaccurate information and hallucinations. How much can we trust a LLMs output if it is only "mostly" right? What if we need to base a decision on multiple "mostly right" answers in combination? For example if you need three pieces of information to make a decision, and each piece is 80% right, their combined error is about 50% — you might as well flip a coin or use a magic eight ball.
- Without performing the work ourselves, we do not grow. I think this may prove to be one of the key tradeoffs in using an LLM. Similar to how business theory for the last few decades has emphasized outsourcing the non-essential, but never outsourcing one's core business. User's of LLMs should be careful to not outsource their "core business". One's view on what is your "core business" can differ. My friend Dave considers planning a system, architecture and communicating with humans his "core business", and writing code a great thing to outsource. I consider writing code to be my core business, with the communication skills a close second and neither to be touched by AI.
- It appears that we may replace many heuristics with LLMs. I'm unsure what this will mean, but the LLM based solution will definitely be more expensive in terms of compute, and is much hard to understand how it operates.
4. Current effects of AI
It isn't necessary for "AI" to be capable of changing the world, widespread belief in its capability is sufficient to create a change, at least in the short term.
Some of the immediate effect of AI include:
- Massive growth for the companies supplying the underlying hardware / GPUs.
- Rapid build out of the physical world infrastructure to support AI.
- A move of capital into AI and AI related ventures. "Slap AI on it" is currently reasonable business strategy.
- Lots of experimentation, and lots of companies being created. The companies that are successful are seeing faster adoption, and faster revenue growth than the previous generation of successful companies. Lots of unsuccessful companies too, just the successes tend to grow faster.
- AI is being used as part of the rationale for workforce reductions. Hard data that AI is actually improving workforce productivity is scarce, but the belief that it is, or the story that it is, is used as a rational for workforce reductions.
- Emergence of AI powered fraud. The forms are varied, and growing. Deep fakes which impersonate someone's voice or appearance. Automation of, and increasing quality of phishing attacks and "credential" fraud, where someone uses AI's assistance to pretends to expertise they don't actually posses, to obtain a job, etc.
- Rapid consumer adoption. At this point the majority of adults in the US have used AI. Compare that with for example Cryptocurrencies, which have been around for more than fifteen years, and have had their own, intense, hype cycle, but still have much lower penetration.
- A strong negative effect on education, in particular, higher education.
It looks like it will have substantial effects in other fields, but hard data is still extremely limited, and determining if the effect is caused because of AI capabilities, or only the belief in those capabilities is unclear.
5. Looking to the future
Over the short term (the next five years) I think we will see a return to earth around AI expectations along with some disillusionment. People are currently expecting AGI like capabilities and I don't see us getting those for decades to come.
I think we may also see changes/breakthroughs around the hardware used for AI. GPUs are where all the action is right now, but memory bandwidth is more of the issue and we can get a lot more of it by redesigning the chip from the bottom up. See for example what Cerebras Systems has done. Their website doesn't cover their work in much technical detail, for a better overview see this podcast episode.
Over the longer term (ten plus years) I expect AI to make major changes to our world. In general though I don't think those changes will be the ones we expect. We expect iRobot, but I think what we will get looks more like more accurate weather forecasts, self driving cars and reduced credit card fraud.
In short I expect that we will see a growth in specific, narrowly applied AI and a reduction in expectations around general use cases.
A handful of examples: Stripe is using AI for fraud detection on billions of credit card transactions. You may not be aware but credit card transactions cost the merchant accepting the payment, a typical rate is around 2.9%. With reduction in credit card fraud, that percentage could come down. A percent of what rounds to the US economy is a non-trivial amount of value.
Another example is in audio compression, which has implications for satellite based cell service and 5G deployments. For example Fabrice Bellard has developed an AI audio compression algorithm TSAC. This algorithm could be used to dramatically reduce the bandwidth used by phone and other digital voice communications. That reduction in turn can increase the range that a phone can be from a cell tower or satellite, and the number of calls that can be serviced by a single tower or satellite.
Another example is Whisper is a remarkably good speech to text engine — with uses from medical charting, to signal intelligence, to automatic transcription and search of business meetings, and transcribing court records.
We are also going to need to find ways to deal with the hazards of AI. These are not of the "Skynet will take over the world and rule mankind" variety, but of the more mundane "how will Johnny learn to be an engineer if ChatGPT does his homework" variety. How will we motivate students to learn, or employees to think when they can outsource to AI?
There will be other society wide challenges — if a state has the power to track every individual and make individually tailored "guidance" or "influences" how do we avoid totalitarian societies? Note I am less concerned about what existing totalitarian nations may do, and what we may need to do to defend our freedoms in the currently non-totalitarian states.
6. In Closing...
It is my experience that technology enables a brighter world, or a darker one, depending on the individual choices we make. Digital technologies have made more information available to more people than ever before in history. Do you we use that to make a better, more educated world, or binge on vapid Youtube videos? Similarly the digital age has made encrypted, private communications available to everyone for free, and has enabled universal surveillance. And note that this is an individual choice — nation states will surveil, but if you want privacy it is there for the effort. We may find that AI will destroy education, or that we will learn how to teach students first why they should learn. Your pick.
The world will hinge not upon the technology available, but on the individual choices made by individuals.