this post was submitted on 27 Oct 2023
17 points (81.5% liked)

Programming

17668 readers
179 users here now

Welcome to the main community in programming.dev! Feel free to post anything relating to programming here!

Cross posting is strongly encouraged in the instance. If you feel your post or another person's post makes sense in another community cross post into it.

Hope you enjoy the instance!

Rules

Rules

  • Follow the programming.dev instance rules
  • Keep content related to programming in some way
  • If you're posting long videos try to add in some form of tldr for those who don't want to watch videos

Wormhole

Follow the wormhole through a path of communities [email protected]



founded 2 years ago
MODERATORS
 

I am trying to create a program that downloads a JSON file and I'm trying to convert it to sqlite.

Why?:

  • I believe the training data on Deepdanbooru is weak compaired to Rule34 or Gelbooru. I have tried compiling the C# for linux, but I decided it was too hard.

I am a mainly a Web developer who uses basic JS/HTML/CSS. These don't help as I need it to be in C. (Not ++/#). I have tried using AI to assist, but no matter what language I use (Python, Ruby, Java, C), It will fail.

Here is the closest I gotten ( 〔〕 --> ><:

#include 〔stdio.h〕
#include 〔stdlib.h〕
#include 〔string.h〕
#include 〔curl/curl.h〕
#include "cJSON.h"
#include "sqlite3.h"

#define URL "https://danbooru.donmai.us/posts.json?page=1&amp;limit=1000&amp;tags=duck&amp;json=1"

#define DB_NAME "data.db"

static int callback(void *NotUsed, int argc, char **argv, char **azColName) {
    int i;
    for (i = 0; i &lt; argc; i++) {
        printf("%s = %s\n", azColName[i], argv[i] ? argv[i] : "NULL");
    }
    printf("\n");
    return 0;
}

int main() {
    CURL *curl;
    CURLcode res;
    FILE *fp;
    char *url = URL;
    char outfilename[FILENAME_MAX] = "data.json";
    curl = curl_easy_init();
    if (curl) {
        fp = fopen(outfilename, "wb");
        curl_easy_setopt(curl, CURLOPT_URL, url);
        curl_easy_setopt(curl, CURLOPT_FOLLOWLOCATION, 1L);
        curl_easy_setopt(curl, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)");
        curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, NULL);
        curl_easy_setopt(curl, CURLOPT_WRITEDATA, fp);
        res = curl_easy_perform(curl);
        curl_easy_cleanup(curl);
        fclose(fp);
    }
    sqlite3 *db;
    char *err_msg = 0;
    int rc = sqlite3_open(DB_NAME, &amp;db);
    if (rc != SQLITE_OK) {
        fprintf(stderr, "Cannot open database: %s\n", sqlite3_errmsg(db));
        sqlite3_close(db);
        return 1;
    }
    char *sql = "CREATE TABLE data (id INT PRIMARY KEY NOT NULL, md5 TEXT NOT NULL, tag_string TEXT NOT NULL, tag_count_general INT NOT NULL);";
    if (rc != SQLITE_OK) {
        fprintf(stderr, "SQL error: %s\n", err_msg);
        sqlite3_free(err_msg);
        sqlite3_close(db);
        return 1;
    }
    cJSON *json;
    cJSON *item;
    cJSON *id;
    cJSON *md5;
    cJSON *tag_string;
    cJSON *tag_count_general;
    char buffer[1024];
    fp = fopen("data.json", "r");
    fread(buffer, 1024, 1, fp);
    fclose(fp);
    json = cJSON_Parse(buffer);
    if (!json) {
        printf("Error before: [%s]\n", cJSON_GetErrorPtr());
        return 1;
    }
    cJSON_ArrayForEach(item, json) {
        id = cJSON_GetObjectItem(item, "id");
        md5 = cJSON_GetObjectItem(item, "md5");
        tag_string = cJSON_GetObjectItem(item, "tag_string");
        tag_count_general = cJSON_GetObjectItem(item, "tag_count_general");
        char insert_query[1024];
        sprintf(insert_query,
                "INSERT INTO data (id, md5, tag_string, tag_count_general) VALUES (%d,'%s','%s',%d);",
                id->valueint,
                md5->valuestring,
                tag_string->valuestring,
                tag_count_general->valueint
               );
        rc = sqlite3_exec(db, insert_query, callback, 0, &amp;err_msg);
        if (rc != SQLITE_OK) {
            fprintf(stderr, "SQL error: %s\n", err_msg);
            sqlite3_free(err_msg);
            sqlite3_close(db);
            return 1;
        }
    }
}

Compile: gcc cJSON.c file.c -lcurl -lsqlite3

Error: Error before: [tag_count_co]

you are viewing a single comment's thread
view the rest of the comments
[–] [email protected] 9 points 1 year ago* (last edited 1 year ago) (2 children)
    fp = fopen("data.json", "r");
    fread(buffer, 1024, 1, fp);
    fclose(fp);

You're only reading the first 1024 bytes, that document is much larger than 1024 bytes. So the JSON library gets an incomplete JSON that ends unexpectedly. You need a much larger buffer than 1024 (ideally dynamically allocated to the right size or expandable). fread can also fail or read less than the specified amount of bytes. The correct way to use fread is to keep calling it until it returns <1 and feof returns true, and adjust the pointer in buffer to advance by whatever fread returns since you can't assume it will always return the full requested amount of data. If you have 256b of a file in memory and request to read 1024, it can return you the 256 immediately while it goes fetch the 768 others from disk. It usually doesn't, but it can:

RETURN VALUE
       On success, the number of bytes read is returned (zero indicates end of
       file),  and the file position is advanced by this number.  It is not an
       error if this number is smaller than the  number  of  bytes  requested;
       this  may happen for example because fewer bytes are actually available
       right now (maybe because we were close to end-of-file,  or  because  we
       are reading from a pipe, or from a terminal), or because read() was in‐
       terrupted by a signal.  See also NOTES.

Read man 3 fread and man 2 read for more details, or look it up online.

Any particular reason it has to be C? That would be much much easier in Python or JS since you don't have to worry about that kind of memory management. It's a good learning experience though, C is useful to know even if you never use it, since everything ends up in libc.

[–] alr 3 points 1 year ago (1 children)

For the benefit of anyone reading this later, the function to check end-of-file should be feof, not foef.

[–] [email protected] 1 points 1 year ago

Updated, good catch!

[–] [email protected] 1 points 1 year ago (2 children)

Really? That was the issue.

I wanted to learn C after having an Adruino (and making neat programs), but seeing what fread does, that is a nightmare.

Well, Python is my friend now.

Thanks.

[–] [email protected] 5 points 1 year ago

seeing what fread does, that is a nightmare.

It's really not all that bad considering that you're pretty close to talking directly to the kernel. The C language itself is pretty simple: it doesn't come with any built-in functions or anything. It's code, it gets compiled to a binary. But we need a way to talk to the operating system, and that's where the C standard library or libc comes in. It contains a standard set of operating system utilities and functions that you can expect most operating systems to implement. The C standard library is pretty old, from an era when a megabyte of RAM was a lot of RAM and where every CPU cycle counted. So it doesn't do a whole lot: it's supposed to be the building block for everything, it needs to be fast and flexible and as small as possible.

What you want is nicer C libraries that makes those things easier to work with. Or to write a function whenever you encounter something repetitive, in this case it's probably 20-25 lines to properly implement the necessary malloc, fopen and fread then you're done with it forever. Copy paste it in your next project, or make yourself a library of C gadgets you accumulate over time.

If you're looking for an experience close to C but a little more batteries included, you might want to consider C++ which does still get new modern features to it. Most C code is valid C++ code, it's not like learning an entirely new language. Reading a whole file for example is a lot more straightforward (source):

std::ifstream t("file.txt"); // input file stream of "file.txt"
std::stringstream buffer; // a buffer for an arbitrarily long string
buffer &lt;&lt; t.rdbuf(); // send everything from the read buffer of the file into the string buffer

You can, however, appreciate how close you're working with the hardware and kernel, just like the Arduino: the interface with the kernel (on Linux) to read a file is basically:

  • Program: "please read up to N bytes at address X for this file descriptor"
  • Kernel: puts N bytes at X "here's how many bytes I was able to get for you"

That's the whole, unabridged thing. From your perspective, you made a syscall and the data magically appeared in your memory. The kernel has no idea what your intentions are, you just ask it to read stuff and it gives it to you. If you misplace it or tell it to read too much or not enough, it doesn't know, it obliges. If you want to read the whole file, it's your job to keep asking the kernel for more, just how the kernel will ask the hard drive for more. That's exactly what read(fd, buffer, bufsize) does, nothing more, nothing less. It's the smallest unit of work you can ask, and it's efficient. The data possibly went straight from your NVMe to the memory of your program. Copied exactly once.

And this is why fread is the way it is, as are most libc things. Interestingly, even that has become too inefficient, and now we have things like io_uring, that's even more complicated to use but really really fast.

It's a whole rabbit hole and that's why we use libraries and avoid using those too much.

[–] [email protected] 3 points 1 year ago* (last edited 1 year ago) (2 children)

Well, Python is my friend now.

The Python read() function, which is basically the same as fread() also accepts a length parameter and 1024 is a sensible number to give it.

You can read the entire file in python but you probably shouldn't do that. In the real world, your software will be given files larger than it can handle and it's important to have logic in place to process the file incrementally instead of all at once.

Honestly, it's what makes this particular problem so challenging... the proper way to parse a JSON file is as a "stream" and not by holding the entire string in memory at once unless you can guarantee it's very small.

It's definitely possible to write a C program that reads the entire file. You just won't find many examples of that because it's a bad idea. What if someone gives you a 30GB file?

You said you used an "AI to assist" but it failed? What were you using? I pasted it into ChatGPT 4 and it found eight problems - including this one. Some of the others are even worse... for example the CREATE TABLE query is literally never executed. Oops.

[–] [email protected] 2 points 1 year ago

Streaming JSON parsers are a thing, e.g. pdjson for C. It's, of course, a different approach and it's generally slightly trickier to work with those, but that's what you would use of you have unbound document size and you can process it in chunks.

[–] [email protected] 0 points 1 year ago

While true it should stop reading a 30G file, I am the stupid when it come to this and I just need a program running.

The AI I used was Bing AI. It was "Helpful" since I don't know C at all, but my primitive JS knowledge couldn't save it.

I know small amount of Python and this is the only time I used AI for assistance. I have to say, the context is there, but needs fixing from time to time.