Statistical investigations into the geometry and homology of random programs

Datalogisk Institut

Statistical investigations into the geometry and homology of random programs

Publikation: Working paper › Preprint › Forskning

Standard

Statistical investigations into the geometry and homology of random programs. / Sporring, Jon; Larsen, Ken Friis.

arxiv.org, 2024.

Publikation: Working paper › Preprint › Forskning

Harvard

Sporring, J & Larsen, KF 2024 'Statistical investigations into the geometry and homology of random programs' arxiv.org. <https://arxiv.org/abs/2407.04854>

APA

Sporring, J., & Larsen, K. F. (2024). Statistical investigations into the geometry and homology of random programs. arxiv.org. https://arxiv.org/abs/2407.04854

Vancouver

Sporring J, Larsen KF. Statistical investigations into the geometry and homology of random programs. arxiv.org. 2024 jul. 5.

Author

Sporring, Jon ; Larsen, Ken Friis. / Statistical investigations into the geometry and homology of random programs. arxiv.org, 2024.

Bibtex

@techreport{0484cc0149c14c1ebcb46158eb56046e,

title = "Statistical investigations into the geometry and homology of random programs",

abstract = "AI-supported programming has taken giant leaps with tools such as Meta's Llama and openAI's chatGPT. These are examples of stochastic sources of programs and have already greatly influenced how we produce code and teach programming. If we consider input to such models as a stochastic source, a natural question is, what is the relation between the input and the output distributions, between the chatGPT prompt and the resulting program?In this paper, we will show how the relation between random Python programs generated from chatGPT can be described geometrically and topologically using Tree-edit distances between the program's syntax trees and without explicit modeling of the underlying space. A popular approach to studying high-dimensional samples in a metric space is to use low-dimensional embedding using, e.g., multidimensional scaling. Such methods imply errors depending on the data and dimension of the embedding space. In this article, we propose to restrict such projection methods to purely visualization purposes and instead use geometric summary statistics, methods from spatial point statistics, and topological data analysis to characterize the configurations of random programs that do not rely on embedding approximations. To demonstrate their usefulness, we compare two publicly available models: ChatGPT-4 and TinyLlama, on a simple problem related to image processing.Application areas include understanding how questions should be asked to obtain useful programs; measuring how consistently a given large language model answers; and comparing the different large language models as a programming assistant. Finally, we speculate that our approach may in the future give new insights into the structure of programming languages. ",

author = "Jon Sporring and Larsen, {Ken Friis}",

year = "2024",

month = jul,

day = "5",

language = "English",

publisher = "arxiv.org",

type = "WorkingPaper",

institution = "arxiv.org",

}

RIS

TY - UNPB

T1 - Statistical investigations into the geometry and homology of random programs

AU - Sporring, Jon

AU - Larsen, Ken Friis

PY - 2024/7/5

Y1 - 2024/7/5

N2 - AI-supported programming has taken giant leaps with tools such as Meta's Llama and openAI's chatGPT. These are examples of stochastic sources of programs and have already greatly influenced how we produce code and teach programming. If we consider input to such models as a stochastic source, a natural question is, what is the relation between the input and the output distributions, between the chatGPT prompt and the resulting program?In this paper, we will show how the relation between random Python programs generated from chatGPT can be described geometrically and topologically using Tree-edit distances between the program's syntax trees and without explicit modeling of the underlying space. A popular approach to studying high-dimensional samples in a metric space is to use low-dimensional embedding using, e.g., multidimensional scaling. Such methods imply errors depending on the data and dimension of the embedding space. In this article, we propose to restrict such projection methods to purely visualization purposes and instead use geometric summary statistics, methods from spatial point statistics, and topological data analysis to characterize the configurations of random programs that do not rely on embedding approximations. To demonstrate their usefulness, we compare two publicly available models: ChatGPT-4 and TinyLlama, on a simple problem related to image processing.Application areas include understanding how questions should be asked to obtain useful programs; measuring how consistently a given large language model answers; and comparing the different large language models as a programming assistant. Finally, we speculate that our approach may in the future give new insights into the structure of programming languages.

AB - AI-supported programming has taken giant leaps with tools such as Meta's Llama and openAI's chatGPT. These are examples of stochastic sources of programs and have already greatly influenced how we produce code and teach programming. If we consider input to such models as a stochastic source, a natural question is, what is the relation between the input and the output distributions, between the chatGPT prompt and the resulting program?In this paper, we will show how the relation between random Python programs generated from chatGPT can be described geometrically and topologically using Tree-edit distances between the program's syntax trees and without explicit modeling of the underlying space. A popular approach to studying high-dimensional samples in a metric space is to use low-dimensional embedding using, e.g., multidimensional scaling. Such methods imply errors depending on the data and dimension of the embedding space. In this article, we propose to restrict such projection methods to purely visualization purposes and instead use geometric summary statistics, methods from spatial point statistics, and topological data analysis to characterize the configurations of random programs that do not rely on embedding approximations. To demonstrate their usefulness, we compare two publicly available models: ChatGPT-4 and TinyLlama, on a simple problem related to image processing.Application areas include understanding how questions should be asked to obtain useful programs; measuring how consistently a given large language model answers; and comparing the different large language models as a programming assistant. Finally, we speculate that our approach may in the future give new insights into the structure of programming languages.

M3 - Preprint

BT - Statistical investigations into the geometry and homology of random programs

PB - arxiv.org

ER -

ID: 397984686