this post was submitted on 29 Jan 2025
29 points (100.0% liked)

Linux

49530 readers
816 users here now

From Wikipedia, the free encyclopedia

Linux is a family of open source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1991 by Linus Torvalds. Linux is typically packaged in a Linux distribution (or distro for short).

Distributions include the Linux kernel and supporting system software and libraries, many of which are provided by the GNU Project. Many Linux distributions use the word "Linux" in their name, but the Free Software Foundation uses the name GNU/Linux to emphasize the importance of GNU software, causing some controversy.

Rules

Related Communities

Community icon by Alpár-Etele Méder, licensed under CC BY 3.0

founded 5 years ago
MODERATORS
 

Edit

My question was very badly written but the new title reflect the actual question. Thanks to 3 very friendly and dedicated users (@harsh3466 @tuna @learnbyexample) I was able to find a solution for my files, so thank you guys !!!

For those who will randomly come across this post here are 3 possible ways to achieve the desired results.

Solution 1 (https://lemmy.ml/post/25346014/16383487)

#! /bin/bash
files="/home/USER/projects/test.md"

mdlinks="$(grep -Po ']\((?!https).*\)' "$files")"
mdlinks2="$(grep -Po '#.*' <<<$mdlinks)"

while IFS= read -r line; do
	#Converts 1.2 to 1-2 (For a third level heading needs to add a supplementary [0-9]) 
	dashlink="$(echo "$line" | sed -r 's|(.+[0-9]+)\.([0-9]+.+\))|\1-\2|')"
	sed -i "s/$line/${dashlink}/" "$files"

	#Puts everything to lowercase after a hashtag
	lowercaselink="$(echo "$dashlink" | sed -r 's|#.+\)|\L&|')"
	sed -i "s/$dashlink/${lowercaselink}/" "$files"

	#Removes spaces (%20) from markdown links after a hashtag
	spacelink="$(echo "$lowercaselink" | sed 's|%20|-|g')"
	sed -i "s/$lowercaselink/${spacelink}/" "$files"

done <<<"$mdlinks2"

Solution 2 (https://lemmy.ml/post/25346014/16453351)

sed -E ':l;s/(\[[^]]*\]\()([^)#]*#[^)]*\))/\1\n\2/;Te;H;g;s/\n//;s/\n.*//;x;s/.*\n//;/^https?:/!{:h;s/^([^#]*#[^)]*)(%20|\.)([^)]*\))/\1-\3/;th;s/(#[^)]*\))/\L\1/;};tl;:e;H;z;x;s/\n//;'

Solution 3 (https://lemmy.ml/post/25346014/16453161)

perl -pe 's/\[[^]]+\]\((?!https?)[^#]*#\K[^)]+(?=\))/lc $&=~s:%20|\d\K\.(?=\d):-:gr/ge'

Relevant links

https://mike.bailey.net.au/notes/software/apps/obsidian/issues/markdown-heading-anchors/#background


Hi everyone !

I'm in need for some assistance for string manipulation with sed and regex. I tried a whole day to trial & error and look around the web to find a solution however it's way over my capabilities and maybe here are some sed/regex gurus who are willing to give me a helping hand !

With everything I gathered around the web, It seems it's rather a complicated regex and sed substitution, here we go !

What Am I trying to achieve?

I have a lot of markdown guides I want to host on a self-hosted forgejo based git markdown. However the classic markdown links are not the same as one github/forgejo...

Convert the following string:

[Some text](#Header%20Linking%20MARKDOWN.md)

Into

[Some text](#header-linking-markdown.md)

As you can see those are the following requirement:

  • Pattern: [Some text](#link%20to%20header.md)
  • Only edit what's between parentheses
  • Replace space (%20) with -
  • Everything as lowercase
  • Links are sometimes in nested parentheses
    • e.g. (look here [Some text](#link%20to%20header.md))
  • Do not change a line that begins with https (external links)

While everything is probably a bit complex as a whole the trickiest part is probably the nested parentheses :/

What I tried

The furthest I got was the following:

sed -Ei 's|\(([^\)]+)\)|\L&|g' test3.md #make everything between parentheses lowercase

sed -i '/https/ ! s/%20/-/g' test3.md #change every %20 occurrence to -

These sed/regx substitution are what I put together while roaming the web, but it has a lot a flaws and doesn't work with nested parentheses. Also this would change every %20 occurrence in the file.

The closest solution I found on stackoverflow looks similar but wasn't able to fit to my needs. Actually my lack of regex/sed understanding makes it impossible to adapt to my requirements.


I would appreciate any help even if a change of tool is needed, however I'm more into a learning processes, so a script or CLI alternative is very appreciated :) actually any help is appreciated :D !

Thanks in advance.

you are viewing a single comment's thread
view the rest of the comments
[–] [email protected] 2 points 2 days ago* (last edited 2 days ago) (1 children)

I did it!! It also handles the case where an external link and internal link are on the same line :D

sed -E ':l;s/(\[[^]]*\]\()([^)#]*#[^)]*\))/\1\n\2/;Te;H;g;s/\n//;s/\n.*//;x;s/.*\n//;/^https?:/!{:h;s/^([^#]*#[^)]*)(%20|\.)([^)]*\))/\1-\3/;th;s/(#[^)]*\))/\L\1/;};tl;:e;H;z;x;s/\n//;'

Here is my annotated file

# Begin loop
:l;

# Bisect first link in pattern space into pattern space and append to hold space
# Example: `text [label](file#fragment)'
#   Pattern space: `file#fragment)'
#   Hold space: `text [label]('
# Steps:
#   1. Strategically insert \n
#       1a. If this fails, branch out
#   2. Append to hold space (this creates two \n's. It feels weird for the
#      first iteration, but that's ok)
#   3. Copy hold space to pattern space, remove first \n, then trim off
#      everything past the second \n
#   4. Swap pattern/hold, and trim off everything up to and incl the last \n
s/(\[[^]]*\]\()([^)#]*#[^)]*\))/\1\n\2/;
Te;
H;
g; s/\n//; s/\n.*//;
x; s/.*\n//;

# Modify only if it is an internal link
/^https?:/! {
    # Add hyphens
    :h;
    s/^([^#]*#[^)]*)(%20|\.)([^)]*\))/\1-\3/;
    th;
    # Make lowercase
    s/(#[^)]*\))/\L\1/;
};

# "conditional" branch so it checks the next conditional again
tl;

# Exit: join pattern space to hold space, then move to pattern space.
# Since the loop uses H instead of h, have to make sure hold space is empty
:e;
H;
z;
x; s/\n//;
[–] [email protected] 2 points 2 days ago (1 children)

Wow ! Thank you ! It did a rapid test on a test-file.md

[Just a test](#just-a-test)
[Just a link](https://mylink/%20with%20space.com)
[External link](readme.md#just-a-test)
[Link with numbers](readme.md#1-3-this-is-another-test)
[Link with numbers](Another%20file%20to%20readme.md#1-3-this-is-another-test)

Great job ! Thank you very much !!! I'm really impressed what someone with proper knowledge can do ! However, I really do not want to mess around with your regex... This will only call for disaster xD ! I will keep preciously your regex and annotated file in my knowledge base, I'm sure some time in the future I will come back to it and try to break it down as learning process.

Thank you very much !!! 👍

[–] [email protected] 1 points 2 days ago

No problem. I think this is a great "final boss" question for learning sed, because it turns out it is deceptively hard!! You have to understand not only a lot about regex, but about sed to get it right. I learned a lot about sed just by tackling this problem!

I really do not want to mess around with your regex

It is very delicate for sure, but one part you can for sure change is at the # Add hyphens part. In the regex you can see (%20|\.). These are a list of "characters" which get converted to hyphens. For example, you could modify it to (%20|\.|\+) and it will convert +s to -s as well!

Still it is not perfect:

  • If the link spans multiple lines, the regex won't match
  • If the link contains escaped characters like \\\\\[LINK](#LINK) or [LINK\]\\\\](#LINK)
  • If the link is inside a code block ``` it will get changed (which may or may not be intended)

But for a sed-only solution this is about as good as it will get I'm afraid.

Overall I'm very happy with it. Someday I would like to make a video that goes into depth about sed, since it is tricky to learn just from the docs.