In my last post, I proclaimed that we could automatically transcribe text from social media screenshots, which would be a quick-and-easy win to improve image descriptions and accessibility everywhere. Today, I admit that surprise, surprise: it turns out it was not so easy. An outline:

Promising early results with Tesseract.js
Later challenges
The bitter lesson and how it impacts progress today

—

Promising early results with Tesseract.js

As I mentioned at the end of my previous post,

A huge proportion of images posted on Bluesky are not photos or any kind of visual content; they’re literally just screenshots of text from other social media sites, like X/Twitter, Tumblr, or Reddit.
Digital Seams: Image descriptions on Bluesky

I noticed this as I reviewed the dataset and kept seeing images like these:

A screenshot of a social media post: lauren @NotABigJerk: i'm not falling for that hot take. that's clearly someone with a fetish for getting yelled at. i refuse to participate in that kind of perversion

Tesseract.js is an optical character recognition (OCR) library that runs in pure Javascript, including on the client-side. This was appealing to me for a few reasons:

I just wanted to capture text rather than describe visuals, so OCR should be good enough.
Tesseract provides bounding boxes, so it’s possible to detect only images that are mostly text (and ignore other images where there might be important visuals).
Client-side integration makes it easy to run on end users’ devices and present the text to the user to review before posting.

This is the Tesseract.js demo from their website:

An animation shows Tesseract.js recognizing text and drawing bounding boxes around the text of "To the Autumnal Moon", a poem by Coleridge. It takes about a second to recognize the text.

And when I ran similar code over the tweet screenshot above, I got the following output:

2) lauren oes F: @NotABiglerk i'm not falling for that hot take. that's clearly someone with a fetish for getting yelled at. i refuse to participate in that kind of perversion

It’s not perfect, but it’s almost there! The UI elements of the profile picture and three-dots context menu are misrecognized and the username is slightly wrong. But the entire body of the tweet is correct. I was excited about this early result and ran it over the first 10,000 image posts recorded in my earlier dataset (image_posts.csv), and found…

Later challenges

…actually, most of the results were not nearly as good. To find social media screenshots, I filtered to a high confidence level (0.80 or higher) and limited to images where Tesseract’s bounding boxes covered at least 20% of the total image area.

Even with these restrictions, it turns out there are lots of variations on social media screenshots. Keeping it just to tweets, I saw various different issues that tanked the quality of the results:

Variations with UI elements like timestamps, reaction counts, and the “Follow” button, that added more noise to the recognized text.
People often screenshot not just a single tweet in isolation, but a series of tweets as a thread, or quote tweets.
Tesseract.js doesn’t handle emojis directly.

Here’s an assortment of tweets that I thought should be good candidates for automated text recognition, but didn’t work well with my approach:

An image of a tweet that includes an image of a tweet itself.

The outer tweet is from Michael Irving @MikeIrvo: The audio log you find in a trashed office in a sci-fi horror game:

Below follows an image of a tweet. The inner tweet is from Elon Musk @elonmusk: The first human received an implant from @Neuralink yesterday and is recovering well. Initial results show promising neuron spike detection.

Three replies in a tweet thread, all from ginisangutak@bsky @ginisangutak. The middle tweet is selected.

First tweet: im thinking about this because i have peers who are mothers, women who tell me i eventually have to be a mother and then everything i know from my mother and her mother. tired exhausted mothers from the times i would sit in parks and listen to their stories.

Second tweet: mothers are so unlikeable really. they have this gift of turning the temperature to unbearably cold or warmth, enough to light up a room. they’ll look at you and make you feel terrible and then they’ll run their hands over your face and you feel loved.

Third tweet: its years of trauma, adapting and doing the best you could to raise the person to your best ability. sometimes it swallows you whole, taking everyone wiht you. and motherhood is so, delicate and complex. and when u start parenting your parents it all makes sense. full circle.

A detailed view of a single tweet, including a date, time, view count, reply count, repost count, like count, bookmark count, the follow button, and the share icon.

Jorgemosa @Jorgeous_1: I Looked Non Binary Af Last Night. I Was Scared! I Couldnt Change It

A Bluesky thread of two posts.

The first is from Mia Sato @miasato.bsky.social: ESPN please stop saying “the big dumper”

The reply is from shankspeare.bsky.social @shanks…: It’s just gross. Why is everyone talking about his ass? We’ve all see bigger.

Each skeet includes a relative timestamp (1h) as well as the UI elements for replies, reposts, likes, share, and the three-dot context menu.

A tweet from Elizabeth Mika (with the lock icon), @yo…: Narcissistic psychopaths are inherently destructive: their lack of conscience and their bottomless desire for power, adulation, and revenge make them natural born destroyers. They’re unable to create anything of value. In fact, they pervert values, even as they preach them.

These are not crazy edge cases. They are common examples of real images that people post in the wild. And for each example, I could imagine good old-fashioned image processing that would produce better results.

But then I took a step back. I had already trimmed my dataset from 10,000 images to a little over 100 by filtering for high-confidence, mostly-text images. I’m sure there are other formats that would require special effort. Was it really worth the time to build out social media-specific image processing to automate transcription for 1% of images posted?

I knew it was possible to do better. But the nagging thought kept coming up: isn't this the bitter lesson?

The bitter lesson and its impact on progress today

The bitter lesson is an observation by Rich Sutton that historically, specialized human knowledge has been beaten by better methods that scale well with more compute.

…researchers always tried to make systems that worked the way the researchers thought their own minds worked---they tried to put that knowledge in their systems---but it proved ultimately counterproductive, and a colossal waste of researcher's time, when, through Moore's law, massive computation became available and a means was found to put it to good use.
Rich Sutton: The Bitter Lesson

Indeed, if you slam these images into multi-modal LLMs like ChatGPT or Claude today, which are based on massive computation and training data, they produce just about perfect image transcriptions. They are slower and more compute-intensive than Tesseract.js, which makes them impractical to run on the client-side today. But I can believe that in a few years, there will be more specialized open-source models that will run just fine, and do a better job.

To me, that’s demotivating. Why should I invest now in handcrafted logic to nail automated transcription of these screenshots, when the magic of scaling will catch up in a few years? I could go do something else with my limited time on this planet.

I worry that people feel similarly about just about any problem, and that’s holding us back from real improvements that are available today. As Drew Breunig wrote about the bitter lesson and computer chess (cribbing from Aidan McLaughlin):

Leela is a deep learning model that, “started with no intrinsic chess-specific knowledge other than the basic rules of the game.” It learned by playing chess, at an absurd scale, until it was the best in the world. A true example of the bitter lesson.
Then Stockfish adopted a small, purpose-built search model inside its conventional chess program. Today, Stockfish remains unbeaten – and can run on your iPhone. By not embracing compute as the primary lever, the Stockfish team not only delivered quality, but delivered something everyone can use, often.
Drew Breunig: Does the Bitter Lesson Have Limits?

It feels like a mixed bag that there isn’t truly low-hanging fruit to easily improve the quality and frequency of image descriptions - good that I can’t make a drive-by improvement in a few hours (which would indicate a real lack of attention to accessibility), but also bad that I didn’t make a real improvement. I have been sufficiently humbled.

Can we do better? Of course it’s possible. But to me, the emotional resistance is just as much of a barrier as the technical challenge.

Working on something similar? I’d love to chat - you can reach me at bobbie@digitalseams.com .

Image transcription humbled me, plus thoughts on the bitter lesson

Promising early results with Tesseract.js

Later challenges

The bitter lesson and its impact on progress today

A recent phishing attack on GitHub

Image descriptions on Bluesky

Digital Seams