Human-AI collaboration and understanding AGI

In this week’s newsletter, we are going to extend the hypothesis from last week on personal digital assistants to the physical world. In addition, we will also explore a product idea for how text-to-image generation models can be imagined as a product.

The Evolution of Agents - from specific tasks to General Intelligence

In today’s rapidly evolving tech landscape, the concept of agents has become a hot topic, especially in the wake of the latest Consumer Electronics Show (CES) exhibits. Giants like Samsung, LG introduced their version of home robots/agents and an innovative combination of hardware and software such as ‘Rabbit’ might redefine how we interact with technology. But let’s take a step back and consider something: don’t agents already exist in our homes today? Consider your washing machine – you could imagine it to be an agent that takes in your dirty clothes and delivers them clean, dry, and ready-to-wear. Or think about your robot vacuum cleaner, diligently mapping your home and cleaning according to your preferences. Okay, calling them agents might be a bit of a stretch but these devices are capable of performing specific tasks autonomously to make our lives easier.

Can we combine recent advances in the software stack with the hardware stack?

We are now entering an era where the capabilities of these robots are no longer confined to singular tasks. With the advent of Large Language Models (LLMs), these agents are evolving from task-specific tools to being capable of handling a broad range of responsibilities. This shift could herald a move towards a higher level of abstraction. Imagine a smart home assistant that doesn’t clean or wash but manages an array of other task-specific agents, streamlining our lives even further. In last week’s newsletter, we explored this idea with digital assistants but recent developments indicate a stronger emphasis on developing intelligent, adaptable programming with strong hardware integration. There have been announcements this week on progress made by hardware-based robots like Figure and its breadth of capabilities that learns to make coffee.

Another impressive demo appeared to show an open-source hardware robot - ALOHA that was able to cook a meal. In many of the demos of the robot, it is being operated remotely but what is fascinating is its breadth of capabilities. We’re seeing a transition from machines designed for specific tasks to robots that are versatile, general-purpose systems. And it’s open-source!

The Path to AGI: More Than Just Advanced Agents

Imagining the evolution of these agents into more generalized, capable assistants naturally, led me to ponder a profound question: Are we inching closer to Artificial General Intelligence (AGI), and what is even meant by that term? A paper published recently by researchers at Deepmind tries to provide a taxonomy and way to think about AGI and what it might constitute.

Instead of seeing it as a singular point, the authors propose a taxonomy of classifying AGI based on capabilities and performance. Capabilities of these systems evolve from a narrow focus and as characteristics of AGI emerge they become broader. The performance of these systems is compared with that of humans starting from emerging (somewhat comparable to humans) to superhuman (outperforming every human). You can imagine DeepBlue and AlphaGo that perform specific tasks with a performance at the level of the 99th percentile of humans but are not great at other tasks. On the other hand, a tool like ChatGPT performs a wide variety of tasks but certainly not at the level of the 50th percentile of human performance. Compared to previous tests of AGI like the Turing Test, I found this taxonomy to be more nuanced with enough room to evolve as new capabilities in systems emerge. In the context of our discussion so far though, this paper specifically excludes physical tasks from the definition of AGI, i.e. it’s not necessary for a system to operate in the physical world for it to be classified as AGI. The paper also stops short of providing concrete examples or clear tests of generality. I think this also indicates how early we are in our understanding of AGI and how human interactions with such systems will evolve.

Revolutionizing Image Generation with Interactive AI

I had the chance to be introduced to a specific way in which human-AI collaboration happens with a transformative approach to image generation. During one of my interactions this week, a graphics producer laid out their vision of how the creative process should work. Rather than working with multiple standalone text-to-image generation models, imagine that you are interacting with an AI graphics artist through a chat or voice interface to discuss the conceptualization and follow a collaborative and iterative process. You initiate a conversation with the AI, much like briefing a graphic artist. You describe your vision, and the AI provides a draft image. But here’s where it gets interesting: you can then provide feedback, request changes, and refine the image, not by starting from scratch, but by iteratively adjusting the existing draft. It’s about guiding the AI to mold and modify the image based on specific comments and feedback.

I would imagine that implementing this might be more straightforward than anticipated. It would likely involve an orchestration pipeline that integrates multiple types of AI models – starting with an LLM chat model to interpret and transform textual instructions into an image prompt. This prompt then guides an image generation model and any subsequent image generation uses the image-to-image workflow that incorporates your feedback. I could also imagine that specific elements within an image could be identified with the help of the Segment-Anything model so that the correct masks are applied for any edits. This approach could potentially be a game-changer in graphic design, offering a digital artist to everyone.

Reading through some further research in this area, I came across the Versatile Diffusion research paper where the authors tried to incorporate the same ideas by integrating them into a single model and training in an end-to-end fashion. This is fascinating because it allows the (multimodal) model to blend text and image modalities, share common weights, and remove the need for multiple individual models as I had imagined it in my workflow. This idea seems like a viable product that could be developed using any of these techniques and something I look forward to implementing in my next project. More importantly, it shows an example of how human-AI collaboration looks like and I believe that the same paradigm would extend to many interesting applications in the future, be it robots learning by looking at humans or by interacting with us via chat.