Blog

DALL-E 2 from Scratch

Nicolaus J Correll — Mon, 16 Sep 2024 23:13:48 +0000

DALL-E 2 from Scratch Nicolaus J Correll Mon, 09/16/2024 - 17:13

Text-Conditioned Image Generation on FashionMNIST using CLIP Latents

by Matthew Nguyen

Denoising diffusion probabilistic models (DDPM) are a popular type of generative AI model that were introduced by Ho et al. in 2020 and improved upon by Nichol et al. in 2021. The basic idea behind these models is that noise is added to images in the forward diffusion process in order to train the model to predict the noise that should be removed at a certain timestep in the reverse diffusion process. When sampling images, you would start with an image containing pure noise and iteratively remove model’s predicted noise at each timestep until you get the final image.

In order to have a DDPM generate multiple types of images while still letting the user choose which type of image they want, the model needs to be conditioned on some input. Ramesh et al. introduced one such conditioning method called unCLIP, which is used in OpenAI’s DALL-E 2 model. In the method described by Ramesh et al., the input caption is first passed to a prior network which will use a trained CLIP model to get the CLIP text embeddings. These text embeddings are then used by a decoder-only transformer in order to generate possible CLIP image embeddings. The CLIP image embeddings generated by the prior network will be used by a decoder network, which consists of a UNet model, in order to condition the images that are created. In this article, we are going to be building a simple diffusion model using this process.

Keep reading on Correll Lab on Medium for free...

Text-Conditioned Image Generation on FashionMNIST using CLIP Latents by Matthew Nguyen

Off

Traditional

White

Testing the Field Capabilities of the Unitree Go-1

Anonymous — Fri, 05 Jul 2024 06:00:00 +0000

Testing the Field Capabilities of the Unitree Go-1 Anonymous (not verified) Fri, 07/05/2024 - 00:00

Promotional videos are great, but what is the real deal taking a robotic dog to the field?

Our goal is to find out what a commodity robotic dog can add to their set of tools, what it can actually accomplish in the field, and what fundamental research in robotics is needed to enable them.

Here are our key findings from a first deployment:

The Unitree Go-1 is able to navigate surprisingly rugged terrain.
The robot does fail. Its legs can entangle with the stems of forbs and shrubs and the robot can easily slip even on flat (!) terrain.
If the robot fails, it often cannot recover by itself, but needs to be manually disentangled and rebooted.
The robot itself is absolutely not rugged and susceptive to dust and morning dew, requiring additional engineering for field applications.

Continue reading on Towards Data Science...

Traditional

White

Thinking, Fast and Slow, with LLMs and PDDL

Anonymous — Mon, 10 Jun 2024 20:57:50 +0000

Thinking, Fast and Slow, with LLMs and PDDL Anonymous (not verified) Mon, 06/10/2024 - 14:57

ChatGPT is never shy at pretending to perform deep thought, but — like our brain — might need additional tools to reason accurately

“ChatGPT can make mistakes. Check important info.” is now written right underneath the prompt, and we all got used to the fact that ChatGPT stoically makes up anything from dates to entire references. But what about basic reasoning? Looking at a simple tower rearranging task from the early days of Artificial Intelligence (AI) research, we will show how large language models (LLM) reach their limitations and introduce the Planning Domain Definition Language (PDDL) and symbolic solvers to make up for it. Given that LLMs are fundamentally probabilistic, it is likely that such tools will be built-in to future versions of AI agents, combining common sense knowledge and razor-sharp reasoning. To get the most out of this article, set up your own PDDL environment using VS Code’s PDDL extension and planutils planner interface and work along with the examples.

Continue reading on Towards Data Science...

Traditional

White

Building CLIP From Scratch

Anonymous — Thu, 16 May 2024 09:21:38 +0000

Building CLIP From Scratch Anonymous (not verified) Thu, 05/16/2024 - 03:21

by Matt Nguyen

Open World Object Recognition on the Clothing MNIST Dataset

Computer vision systems were historically limited to a fixed set of classes, CLIP has been a revolution allowing open world object recognition by “predicting which image and text pairings go together". CLIP is able to predict this by learning the cosine similarity between image and text feature for batches of training data. This is shown in the contrastive pre-training portion of Figure 1 where the dot product between the image features {I_1 … I_N} and the text features {T_1 … T_N} is taken.

In this tutorial, we are going to build CLIP from scratch and test it on the fashion MNIST dataset. Some of the sections in this article are taken from my vision transformers article. Notebook with the code from this tutorial can be found here.

Continue reading on Correll lab...

Traditional

White

Is Open World Vision in Robotic Manipulation Useful?

Anonymous — Tue, 14 May 2024 17:44:52 +0000

Is Open World Vision in Robotic Manipulation Useful? Anonymous (not verified) Tue, 05/14/2024 - 11:44

by Uri Soltz

Google’s Open World Localization Visual Transformer (OWL-ViT) in combination with Meta’s “Segment Anything” has emerged as the goto pipeline for zero-shot object recognition — none of the objects have been used in training the classifier — in robotic manipulation. Yet, OWL-ViT has been trained on static images from the internet and has limited fidelity in a manipulation context. OWL-ViT returns a non-negligible confusion matrix and we show that processing the same view from different distances significantly increases performance. Still, OWL-ViT works better for some objects than for others and is thus inconsistent. Our experimental setup is described in Exploring MAGPIE: A Force Control Gripper w/ 3D Perception, by Streck Salmon.

Read the full article on Correll lab...

Traditional

White

MAGPIE: An Open-Source Force Control Gripper With 3D Perception

Anonymous — Tue, 14 May 2024 17:40:36 +0000

MAGPIE: An Open-Source Force Control Gripper With 3D Perception Anonymous (not verified) Tue, 05/14/2024 - 11:40

by Streck Salmon

There are a myriad of robotic arms, but very few choices when it comes to robotic grippers, particularly those with built-in force control and perception. This article explores the outer and inner workings of the MAGPIE gripper, an intelligent robotic object manipulator developed at the Correll Lab at the �ֲ��ý, Boulder. The gripper’s hardware design was created by Stephen Otto during his Master’s thesis, and the software for planning, perception (utilizing the RealSense with Open3D), and interfacing with the UR5 was developed by Dylan Kriegman as part of his senior thesis. Alongside this, James Watson also made significant contributions to perception and planning software. The original paper, published by Correll, Otto, Kriegman, and Watson, can be found here.

Read the full article on Correll Lab...

Traditional

White

Building a Vision Transformer Model From Scratch

Anonymous — Thu, 04 Apr 2024 06:00:00 +0000

Building a Vision Transformer Model From Scratch Anonymous (not verified) Thu, 04/04/2024 - 00:00

Building a Vision Transformer Model From Scratch

by Matt Nguyen

The self-attention-based transformer model was first introduced by Vaswani et al. in their paper Attention Is All You Need in 2017 and has been widely used in natural language processing. A transformer model is what is used by OpenAI to create ChatGPT. Transformers not only work on text, but also on images, and essentially any sequential data. In 2021, Dosovitsky et al. introduced the idea of using the transformers for computer vision tasks such as image classification in their paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In their paper, they were able to achieve excellent results with their vision transformer model compared to convolutional networks and required a lot less resources to train.

In this tutorial, we are going to build a vision transformer model from scratch and test is on the MNIST dataset, a collection of handwritten digits that have become a standard benchmark in machine learning. Notebook with the code from tutorial can be found here.

Read the full article on Correll lab....

Traditional

White

The Future of Robotic Assembly

Anonymous — Thu, 28 Mar 2024 06:00:00 +0000

The Future of Robotic Assembly Anonymous (not verified) Thu, 03/28/2024 - 00:00

Since the introduction of mass production in 1913 assembly lines are still mostly human — humanoids might change this

Henry Ford is known as the father of mass production, streamlining the production of his “Model T” enabling cars to be widespread affordable. One of the key innovations at the time was to use a conveyor belt in the assembly lane that paced the production process. Yet, actual labor was mostly manual and still is today, for example looking at engine assembly at BMW in 2024.

Mass production at Henry Ford’s Model T factory in 1913 (left, public domain) and engine assembly at BMW in 2024 (right, picture from here).

Pacing an assembly line by what is known by the german word “Takt” or cycle time, is indeed a key idea to make an assembly process predictable. The throughput of a factory is directly related to its Takt, which in turn is driven by the slowest contributor, and directly relates to the sojourn time of an order in an assembly line. In a human-driven environment, people might eventually adapt to the cycle time of the processes around them, which is beautifully captured in a the figure below that shows the acquisition of speed skill in a cigar manufacturing factory.

Continue reading on Towards Data Science..,

Traditional

White

Grasping With Common Sense using VLMs and LLMs

Anonymous — Sun, 10 Mar 2024 07:00:00 +0000

Grasping With Common Sense using VLMs and LLMs Anonymous (not verified) Sun, 03/10/2024 - 00:00

How to leverage large language models for robotic grasping and code generation

Grasping and manipulation remain a hard, unsolved problem in robotics. Grasping is not just about identifying points where to put your fingers on an object to create sufficient constraints. Grasping is also about applying just enough force to pick up the object with breaking it, while making sure it can be put to its intended use. At the same time, grasping provides critical sensor input to detect what an object is and what its properties are. With mobility essentially solved, grasping and manipulation remains the final frontier in unlocking truely autonomous labor replacements.

Continue reading on Toward Data Science....

Traditional

White

Are the Humanoids Here to Stay?

Anonymous — Fri, 01 Mar 2024 07:00:00 +0000

Are the Humanoids Here to Stay? Anonymous (not verified) Fri, 03/01/2024 - 00:00

Humanoids might finally solve the “brownfield” problem that plagues robotic adaptation, and recent breakthroughs in multi-modal transformers and diffusion models might actually make it happen.

Not a week goes by without a flurry of humanoid companies releasing a new update. Optimus can walk? Digit has just moved an empty tote? So has Figure! It also seems that real companies are finally getting interested. Starting with Tesla, humanoids are now “working” at Amazon and BMW, from which it is only a short way to our households and gardens. But are they really working? The demos we get to see are neither as exciting as Boston Dynamics’ Atlas doing parkour, nor humanoids seem to be very productive. So is the market rightfully excited and are humanoids up to something? I’m excited about humanoids for two reasons:

1) Humanoids might finally solve the “Brownfield” problem, the main reason so many robots solutions burn in pilot purgatory.

2) Machine learning has made a huge leap in 2023, with computers exhibiting reasoning skills that — for the first time — allow them to operate in open-world settings and perform contact richt manipulation.

Continue to read on Towards Data Science...

Traditional

White