this post was submitted on 07 Apr 2024
6 points (100.0% liked)
Machine Learning
499 readers
1 users here now
A community for posting things related to machine learning
Icon base by Lorc under CC BY 3.0 with modifications to add a gradient
founded 2 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
Very frustrating to see this, as large models have shown that scalar activation functions make only a tiny impact when your model is wide enough.
https://arxiv.org/abs/2002.05202v1 shows GLU-based activation functions (2 inputs->1 output) almost universally beat their equivalent scalar functions. IMO there needs to be more work around these kinds of multi-input constructions, as there are much bigger potential gains.
E.g. even for cases where the network only needs static routing (tabular data), transformers sometimes perform magically better than MLPs. This suggests there's something special about self-attention as an "activation function". If that magic can be extracted and made sub-quadratic, it could be a paradigm shift in NN design.
The authors of the blog post seem aware of the limitations of their focus:
Thank you for highlighting this research! At first glance it's interesting that sigmoid functions re-emerge as more useful using the approaches evaluated in that article.