

I have been thinking about the true cost of running LLMs (of course, Ed Zitron and others have written about this a lot).
We take it for granted that large parts of the internet are available for free. Sure, a lot of it is plastered with ads, and paywalls are becoming increasingly common, but thanks to economies of scale (and a level of intrinsic motivation/altruism/idealism/vanity), it still used to be viable to provide information online without charging users for every bit of it. Same appears to be true for the tools to discover said information (search engines).
Compare this to the estimated true cost of running AI chatbots, which (according to the numbers I’m familiar with) may be tens or even hundreds of dollars a month for each user. For this price, users would get unreliable slop, and this slop could only be produced from the (mostly free) information that is already available online while disincentivizing creators from producing more of it (because search engine driven traffic is dying down).
I think the math is really abysmal here, and it may take some time to realize how bad it really is. We are used to big numbers from tech companies, but we rarely break them down to individual users.
Somehow reminds me of the astronomical cost of each bitcoin transaction (especially compared to the tiny cost of processing a single payment through established payment systems).
It’s quite noteworthy how often these shots start out somewhat okay at the first prompt, but then deteriorate markedly over the following seconds.
As a layperson, I would try to explain this as follows: At the beginning, the AI is - to some extent - free to “pick” how the characters and their surroundings would look like (while staying within the constraints of the prompt, of course, even if this doesn’t always work out either).
Therefore, the AI can basically “fill in the blanks” from its training data and create something that may look somewhat impressive at first glance.
However, for continuing the shot, the AI is now stuck with these characters and surroundings while having to follow a plot that may not be represented in its training data, especially not for the characters and surroundings it had picked. This is why we frequently see inconsistencies, deviations from the prompt or just plain nonsense.
If I am right about this assumption, it might be very difficult to improve these video generators, I guess (because an unrealistic amount of additional training data would be required).
Edit: According to other people, it may also be related to memory/hardware etc. In that case, my guesses above may not apply. Or maybe it is a mixture of both.