overview for nsa

1

pl.aiwright - GPT-4 dialogue for Disco Elysium: The Final Cut (pl.aiwright.dev)

submitted 9 months ago by [email protected] to c/[email protected]

0 comments fedilink

pl.aiwright is an AI-powered dialogue generation tool for interactive narrative games. This is a first step for an open research platform to explore AI-generated dialogue in games.

Notably, it's different from most other AI miiddleware like InWorld since it's open source.

2

What's In My Big Data? (arxiv.org)

submitted 10 months ago by [email protected] to c/[email protected]

0 comments fedilink

Large text corpora are the backbone of language models. However, we have a limited understanding of the content of these corpora, including general statistics, quality, social factors, and inclusion of evaluation data (contamination). In this work, we propose What's In My Big Data? (WIMBD), a platform and a set of sixteen analyses that allow us to reveal and compare the contents of large text corpora. WIMBD builds on two basic capabilities -- count and search -- at scale, which allows us to analyze more than 35 terabytes on a standard compute node. We apply WIMBD to ten different corpora used to train popular language models, including C4, The Pile, and RedPajama. Our analysis uncovers several surprising and previously undocumented findings about these corpora, including the high prevalence of duplicate, synthetic, and low-quality content, personally identifiable information, toxic language, and benchmark contamination. For instance, we find that about 50% of the documents in RedPajama and LAION-2B-en are duplicates. In addition, several datasets used for benchmarking models trained on such corpora are contaminated with respect to important benchmarks, including the Winograd Schema Challenge and parts of GLUE and SuperGLUE. We open-source WIMBD's code and artifacts to provide a standard set of evaluations for new text-based corpora and to encourage more analyses and transparency around them: github.com/allenai/wimbd.

3

The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI (arxiv.org)

submitted 10 months ago by [email protected] to c/[email protected]

0 comments fedilink

The race to train language models on vast, diverse, and inconsistently documented datasets has raised pressing concerns about the legal and ethical risks for practitioners. To remedy these practices threatening data transparency and understanding, we convene a multi-disciplinary effort between legal and machine learning experts to systematically audit and trace 1800+ text datasets. We develop tools and standards to trace the lineage of these datasets, from their source, creators, series of license conditions, properties, and subsequent use. Our landscape analysis highlights the sharp divides in composition and focus of commercially open vs closed datasets, with closed datasets monopolizing important categories: lower resource languages, more creative tasks, richer topic variety, newer and more synthetic training data. This points to a deepening divide in the types of data that are made available under different license conditions, and heightened implications for jurisdictional legal interpretations of copyright and fair use. We also observe frequent miscategorization of licenses on widely used dataset hosting sites, with license omission of 72%+ and error rates of 50%+. This points to a crisis in misattribution and informed use of the most popular datasets driving many recent breakthroughs. As a contribution to ongoing improvements in dataset transparency and responsible use, we release our entire audit, with an interactive UI, the Data Provenance Explorer, which allows practitioners to trace and filter on data provenance for the most popular open source finetuning data collections: www.dataprovenance.org.

1

GPT-4 Doesn't Know It's Wrong: An Analysis of Iterative Prompting for Reasoning Problems (arxiv.org)

submitted 11 months ago by [email protected] to c/[email protected]

0 comments fedilink

There has been considerable divergence of opinion on the reasoning abilities of Large Language Models (LLMs). While the initial optimism that reasoning might emerge automatically with scale has been tempered thanks to a slew of counterexamples, a wide spread belief in their iterative self-critique capabilities persists. In this paper, we set out to systematically investigate the effectiveness of iterative prompting of LLMs in the context of Graph Coloring, a canonical NP-complete reasoning problem that is related to propositional satisfiability as well as practical problems like scheduling and allocation. We present a principled empirical study of the performance of GPT4 in solving graph coloring instances or verifying the correctness of candidate colorings. In iterative modes, we experiment with the model critiquing its own answers and an external correct reasoner verifying proposed solutions. In both cases, we analyze whether the content of the criticisms actually affects bottom line performance. The study seems to indicate that (i) LLMs are bad at solving graph coloring instances (ii) they are no better at verifying a solution--and thus are not effective in iterative modes with LLMs critiquing LLM-generated solutions (iii) the correctness and content of the criticisms--whether by LLMs or external solvers--seems largely irrelevant to the performance of iterative prompting. We show that the observed increase in effectiveness is largely due to the correct solution being fortuitously present in the top-k completions of the prompt (and being recognized as such by an external verifier). Our results thus call into question claims about the self-critiquing capabilities of state of the art LLMs.

2

A Long Way to Go: Investigating Length Correlations in RLHF (arxiv.org)

submitted 11 months ago by [email protected] to c/[email protected]

0 comments fedilink

Great successes have been reported using Reinforcement Learning from Human Feedback (RLHF) to align large language models. Open-source preference datasets and reward models have enabled wider experimentation beyond generic chat settings, particularly to make systems more "helpful" for tasks like web question answering, summarization, and multi-turn dialogue. When optimizing for helpfulness, RLHF has been consistently observed to drive models to produce longer outputs. This paper demonstrates that optimizing for response length is a significant factor behind RLHF's reported improvements in these settings. First, we study the relationship between reward and length for reward models trained on three open-source preference datasets for helpfulness. Here, length correlates strongly with reward, and improvements in reward score are driven in large part by shifting the distribution over output lengths. We then explore interventions during both RL and reward model learning to see if we can achieve the same downstream improvements as RLHF without increasing length. While our interventions mitigate length increases, they aren't uniformly effective across settings. Furthermore, we find that even running RLHF with a reward based solely on length can reproduce most of the downstream improvements over the initial policy model, showing that reward models in these settings have a long way to go.

4

Think before you speak: Training Language Models With Pause Tokens (arxiv.org)

submitted 11 months ago by [email protected] to c/[email protected]

1 comments fedilink

Language models generate responses by producing a series of tokens in immediate succession: the $(K+1)^{th}$ token is an outcome of manipulating $K$ hidden vectors per layer, one vector per preceding token. What if instead we were to let the model manipulate say, $K+10$ hidden vectors, before it outputs the $(K+1)^{th}$ token? We operationalize this idea by performing training and inference on language models with a (learnable) $\textit{pause}$ token, a sequence of which is appended to the input prefix. We then delay extracting the model's outputs until the last pause token is seen, thereby allowing the model to process extra computation before committing to an answer. We empirically evaluate $\textit{pause-training}$ on decoder-only models of 1B and 130M parameters with causal pretraining on C4, and on downstream tasks covering reasoning, question-answering, general understanding and fact recall. Our main finding is that inference-time delays show gains when the model is both pre-trained and finetuned with delays. For the 1B model, we witness gains on 8 of 9 tasks, most prominently, a gain of $18%$ EM score on the QA task of SQuAD, $8%$ on CommonSenseQA and $1%$ accuracy on the reasoning task of GSM8k. Our work raises a range of conceptual and practical future research questions on making delayed next-token prediction a widely applicable new paradigm.

interesting new paper exploring the addition of "pause" tokens for lms. it seems to imply that the computational capacity of lms is bounded by the number of tokens in its context, so simply adding "meaningless" tokens gives the model increased computational capacity leading to increased performance on downstream tasks

2

Language Modeling Is Compression (arxiv.org)

submitted 1 year ago by [email protected] to c/[email protected]

0 comments fedilink

It has long been established that predictive models can be transformed into lossless compressors and vice versa. Incidentally, in recent years, the machine learning community has focused on training increasingly large and powerful self-supervised (language) models. Since these large language models exhibit impressive predictive capabilities, they are well-positioned to be strong compressors. In this work, we advocate for viewing the prediction problem through the lens of compression and evaluate the compression capabilities of large (foundation) models. We show that large language models are powerful general-purpose predictors and that the compression viewpoint provides novel insights into scaling laws, tokenization, and in-context learning. For example, Chinchilla 70B, while trained primarily on text, compresses ImageNet patches to 43.4% and LibriSpeech samples to 16.4% of their raw size, beating domain-specific compressors like PNG (58.5%) or FLAC (30.3%), respectively. Finally, we show that the prediction-compression equivalence allows us to use any compressor (like gzip) to build a conditional generative model.

I wonder what a paper like this, especially given the title, does for the legal case regarding copyright and generative AI. Haven't had a chance to read the paper yet, so don't know if the findings are relevant to copyright.

6

Retentive Network: A Successor to Transformer for Large Language Models (arxiv.org)

submitted 1 year ago by [email protected] to c/[email protected]

4 comments fedilink

This is an exciting new paper that replaces attention in the Transformer architecture with a set of decomposable matrix operations that retain the modeling capacity of Transformer models, while allowing parallel training and efficient RNN-like inference without the use of attention (it doesn't use a softmax).

It achieves lower perplexity than Transformers models with more than 2B parameters and requires much lower GPU memory and FLOPs compared Transformers for inference.

Abstract:

In this work, we propose Retentive Network (RetNet) as a foundation architecture for large language models, simultaneously achieving training parallelism, low-cost inference, and good performance. We theoretically derive the connection between recurrence and attention. Then we propose the retention mechanism for sequence modeling, which supports three computation paradigms, i.e., parallel, recurrent, and chunkwise recurrent. Specifically, the parallel representation allows for training parallelism. The recurrent representation enables low-cost $O(1)$ inference, which improves decoding throughput, latency, and GPU memory without sacrificing performance. The chunkwise recurrent representation facilitates efficient long-sequence modeling with linear complexity, where each chunk is encoded parallelly while recurrently summarizing the chunks. Experimental results on language modeling show that RetNet achieves favorable scaling results, parallel training, low-cost deployment, and efficient inference. The intriguing properties make RetNet a strong successor to Transformer for large language models. Code will be available at https://aka.ms/retnet.

Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time in c/[email protected]

[–] [email protected] 1 points 1 year ago

Averaging model weights seems to help across textual domains as well, see Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models and Scaling Expert Language Models with Unsupervised Domain Discovery. I wonder if the two types of averaging (across hyperparameters and across domains) can be combined to produce even better models.

Hardwiring ViT Patch Selectivity into CNNs using Patch Mixing in c/[email protected]

[–] [email protected] 0 points 1 year ago (1 children)

Please don't post links to reddit.

Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models in c/[email protected]

[–] [email protected] 2 points 1 year ago

It seems like for creative text generation tasks, metrics have been shown to be deficient; this even holds for the new model-based metrics. That leaves human evaluation (both intrinsic and extrinsic) as the gold standard for those types of tasks. I wonder if the results from this paper (and other future papers that look automatic CV metrics) will lead reviewers to demand more human evaluation in CV tasks like they do for certain NLP tasks.

2

Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models (arxiv.org)

submitted 1 year ago by [email protected] to c/[email protected]

1 comments fedilink

Abstract:

We systematically study a wide variety of image-based generative models spanning semantically-diverse datasets to understand and improve the feature extractors and metrics used to evaluate them. Using best practices in psychophysics, we measure human perception of image realism for generated samples by conducting the largest experiment evaluating generative models to date, and find that no existing metric strongly correlates with human evaluations. Comparing to 16 modern metrics for evaluating the overall performance, fidelity, diversity, and memorization of generative models, we find that the state-of-the-art perceptual realism of diffusion models as judged by humans is not reflected in commonly reported metrics such as FID. This discrepancy is not explained by diversity in generated samples, though one cause is over-reliance on Inception-V3. We address these flaws through a study of alternative self-supervised feature extractors, find that the semantic information encoded by individual networks strongly depends on their training procedure, and show that DINOv2-ViT-L/14 allows for much richer evaluation of generative models. Next, we investigate data memorization, and find that generative models do memorize training examples on simple, smaller datasets like CIFAR10, but not necessarily on more complex datasets like ImageNet. However, our experiments show that current metrics do not properly detect memorization; none in the literature is able to separate memorization from other phenomena such as underfitting or mode shrinkage. To facilitate further development of generative models and their evaluation we release all generated image datasets, human evaluation data, and a modular library to compute 16 common metrics for 8 different encoders at https://github.com/layer6ai-labs/dgm-eval.

Extending Context Window of Large Language Models via Positional Interpolation in c/[email protected]

[–] [email protected] 1 points 1 year ago

hmmm... not sure which model you're referring to. do you have a paper link?

Extending Context Window of Large Language Models via Positional Interpolation in c/[email protected]

[–] [email protected] 1 points 1 year ago

do you have a link?

[@machinelearning](https://kbin.social/m/machinelearning) am I in the right place? Lol in c/[email protected]

[–] [email protected] 1 points 1 year ago

@Koffindodjer indeed you are!

6

Extending Context Window of Large Language Models via Positional Interpolation (arxiv.org)

submitted 1 year ago by [email protected] to c/[email protected]

5 comments fedilink

Interesting technique to increase the context window of language models by finetuning on a small number of samples after pretraining.

(I did a double-take after seeing the heading on the first page of the pdf, but it's not actually an old paper.)

We present Position Interpolation (PI) that extends the context window sizes of RoPE-based pretrained LLMs such as LLaMA models to up to 32768 with minimal fine-tuning (within 1000 steps), while demonstrating strong empirical results on various tasks that require long context, including passkey retrieval, language modeling, and long document summarization from LLaMA 7B to 65B. Meanwhile, the extended model by Position Interpolation preserve quality relatively well on tasks within its original context window. To achieve this goal, Position Interpolation linearly down-scales the input position indices to match the original context window size, rather than extrapolating beyond the trained context length which may lead to catastrophically high attention scores that completely ruin the self-attention mechanism. Our theoretical study shows that the upper bound of interpolation is at least $\sim 600 \times$ smaller than that of extrapolation, further demonstrating its stability. Models extended via Position Interpolation retain its original architecture and can reuse most pre-existing optimization and infrastructure.

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks in c/[email protected]

[–] [email protected] 1 points 1 year ago

Also reminds me of this ICLR paper: Linearly Mapping from Image to Text Space.

13

Inverse Scaling: When Bigger Isn't Better (arxiv.org)

submitted 1 year ago by [email protected] to c/[email protected]

0 comments fedilink

Abstract:

Work on scaling laws has found that large language models (LMs) show predictable improvements to overall loss with increased scale (model size, training data, and compute). Here, we present evidence for the claim that LMs may show inverse scaling, or worse task performance with increased scale, e.g., due to flaws in the training objective and data. We present empirical evidence of inverse scaling on 11 datasets collected by running a public contest, the Inverse Scaling Prize, with a substantial prize pool. Through analysis of the datasets, along with other examples found in the literature, we identify four potential causes of inverse scaling: (i) preference to repeat memorized sequences over following in-context instructions, (ii) imitation of undesirable patterns in the training data, (iii) tasks containing an easy distractor task which LMs could focus on, rather than the harder real task, and (iv) correct but misleading few-shot demonstrations of the task. We release the winning datasets at https://inversescaling.com/data to allow for further investigation of inverse scaling. Our tasks have helped drive the discovery of U-shaped and inverted-U scaling trends, where an initial trend reverses, suggesting that scaling trends are less reliable at predicting the behavior of larger-scale models than previously understood. Overall, our results suggest that there are tasks for which increased model scale alone may not lead to progress, and that more careful thought needs to go into the data and objectives for training language models.

7

DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models (decodingtrust.github.io)

submitted 1 year ago by [email protected] to c/[email protected]

0 comments fedilink

DecodingTrust is the Adversarial GLUE Benchmark. DecodingTrust aims at providing a thorough assessment of trustworthiness in GPT models.

This research endeavor is designed to help researchers and practitioners better understand the capabilities, limitations, and potential risks involved in deploying these state-of-the-art Large Language Models (LLMs).
This project is organized around the following eight primary perspectives of trustworthiness, including:

Toxicity
Stereotype and bias
Adversarial robustness
Out-of-Distribution Robustness
Privacy
Robustness to Adversarial Demonstrations
Machine Ethics
Fairness

Paper: https://arxiv.org/abs/2306.11698
Repo: https://github.com/AI-secure/DecodingTrust

The Curse of Recursion: Training on Generated Data Makes Models Forget in c/[email protected]

[–] [email protected] 0 points 1 year ago (1 children)

If the effect is strong enough, then it could have a very negative effect on LLM training in the near future, considering more and more of the internet contains ChatGPT & GPT-4 content in it and automatic detectors are currently quite poor.