Very human-like errors.
by bryanlarsen 5 minutes ago
State-of-the-art Vision Language Models achieve 100% accuracy counting on images of popular subjects (e.g. knowing that the Adidas logo has 3 stripes and a dog has 4 legs) but are only ~17% accurate in counting in counterfactual images (e.g. counting stripes in a 4-striped Adidas-like logo or counting legs in a 5-legged dog).
by taesiri 4 hours ago
fun findings related to memorization of AI models. It simply means LLMs/VLLMs do not know how to predict generally but memorizing instead. A new perspective on adversarial attack methods.
by shenkha 52 minutes ago
for overly represented concepts, like popular brands, it seems that the model “ignores” the details once it detects that the overall shapes or patterns are similar. Opening up the vision encoders to find out how these images cluster in the embedding space should provide better insights.
by taesiri 23 minutes ago
This may indicate that while VLMs might possess the necessary capability, their strong biases can cause them to overlook important cues, and their overconfidence in their own knowledge can lead to incorrect answers. by vokhanhan25 46 minutes ago