this post was submitted on 29 Jan 2025
29 points (100.0% liked)

Linux

49530 readers
785 users here now

From Wikipedia, the free encyclopedia

Linux is a family of open source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1991 by Linus Torvalds. Linux is typically packaged in a Linux distribution (or distro for short).

Distributions include the Linux kernel and supporting system software and libraries, many of which are provided by the GNU Project. Many Linux distributions use the word "Linux" in their name, but the Free Software Foundation uses the name GNU/Linux to emphasize the importance of GNU software, causing some controversy.

Rules

Related Communities

Community icon by Alpár-Etele Méder, licensed under CC BY 3.0

founded 5 years ago
MODERATORS
 

Edit

My question was very badly written but the new title reflect the actual question. Thanks to 3 very friendly and dedicated users (@harsh3466 @tuna @learnbyexample) I was able to find a solution for my files, so thank you guys !!!

For those who will randomly come across this post here are 3 possible ways to achieve the desired results.

Solution 1 (https://lemmy.ml/post/25346014/16383487)

#! /bin/bash
files="/home/USER/projects/test.md"

mdlinks="$(grep -Po ']\((?!https).*\)' "$files")"
mdlinks2="$(grep -Po '#.*' <<<$mdlinks)"

while IFS= read -r line; do
	#Converts 1.2 to 1-2 (For a third level heading needs to add a supplementary [0-9]) 
	dashlink="$(echo "$line" | sed -r 's|(.+[0-9]+)\.([0-9]+.+\))|\1-\2|')"
	sed -i "s/$line/${dashlink}/" "$files"

	#Puts everything to lowercase after a hashtag
	lowercaselink="$(echo "$dashlink" | sed -r 's|#.+\)|\L&|')"
	sed -i "s/$dashlink/${lowercaselink}/" "$files"

	#Removes spaces (%20) from markdown links after a hashtag
	spacelink="$(echo "$lowercaselink" | sed 's|%20|-|g')"
	sed -i "s/$lowercaselink/${spacelink}/" "$files"

done <<<"$mdlinks2"

Solution 2 (https://lemmy.ml/post/25346014/16453351)

sed -E ':l;s/(\[[^]]*\]\()([^)#]*#[^)]*\))/\1\n\2/;Te;H;g;s/\n//;s/\n.*//;x;s/.*\n//;/^https?:/!{:h;s/^([^#]*#[^)]*)(%20|\.)([^)]*\))/\1-\3/;th;s/(#[^)]*\))/\L\1/;};tl;:e;H;z;x;s/\n//;'

Solution 3 (https://lemmy.ml/post/25346014/16453161)

perl -pe 's/\[[^]]+\]\((?!https?)[^#]*#\K[^)]+(?=\))/lc $&=~s:%20|\d\K\.(?=\d):-:gr/ge'

Relevant links

https://mike.bailey.net.au/notes/software/apps/obsidian/issues/markdown-heading-anchors/#background


Hi everyone !

I'm in need for some assistance for string manipulation with sed and regex. I tried a whole day to trial & error and look around the web to find a solution however it's way over my capabilities and maybe here are some sed/regex gurus who are willing to give me a helping hand !

With everything I gathered around the web, It seems it's rather a complicated regex and sed substitution, here we go !

What Am I trying to achieve?

I have a lot of markdown guides I want to host on a self-hosted forgejo based git markdown. However the classic markdown links are not the same as one github/forgejo...

Convert the following string:

[Some text](#Header%20Linking%20MARKDOWN.md)

Into

[Some text](#header-linking-markdown.md)

As you can see those are the following requirement:

  • Pattern: [Some text](#link%20to%20header.md)
  • Only edit what's between parentheses
  • Replace space (%20) with -
  • Everything as lowercase
  • Links are sometimes in nested parentheses
    • e.g. (look here [Some text](#link%20to%20header.md))
  • Do not change a line that begins with https (external links)

While everything is probably a bit complex as a whole the trickiest part is probably the nested parentheses :/

What I tried

The furthest I got was the following:

sed -Ei 's|\(([^\)]+)\)|\L&|g' test3.md #make everything between parentheses lowercase

sed -i '/https/ ! s/%20/-/g' test3.md #change every %20 occurrence to -

These sed/regx substitution are what I put together while roaming the web, but it has a lot a flaws and doesn't work with nested parentheses. Also this would change every %20 occurrence in the file.

The closest solution I found on stackoverflow looks similar but wasn't able to fit to my needs. Actually my lack of regex/sed understanding makes it impossible to adapt to my requirements.


I would appreciate any help even if a change of tool is needed, however I'm more into a learning processes, so a script or CLI alternative is very appreciated :) actually any help is appreciated :D !

Thanks in advance.

you are viewing a single comment's thread
view the rest of the comments
[–] learnbyexample 2 points 5 days ago* (last edited 5 days ago) (2 children)

Here's a solution with perl (assuming you don't want to change http/https after the start of ( instead of start of a line):

perl -pe 's/\[[^]]+\]\(\K(?!https?)[^)]+(?=\))/lc $&=~s|%20|-|gr/ge' ip.txt
  • e flag allows you to use Perl code in the substitution portion.
  • \[[^]]+\]\(\K match square brackets and use \K to mark the start of matching portion (text before that won't be part of $&)
  • (?!https?) don't match if http or https is found
  • [^)]+(?=\)) match non ) characters and assert that ) is present after those characters
  • $&=~s|%20|-|gr change %20 to - for the matching portion found, the r flag is used to return the modified string instead of change $& itself
  • lc is a function to change text to lowercase
[–] [email protected] 2 points 2 days ago (1 children)

Sorry for the late response... I was busy with another user :S My English is so bad I'm not able to response to every one at the same time... Whatever...

I tried your pearl regex substitution and effectively it does what I ask from my post, so thank you very much for your help ! However, I missed a few use cases were your regex breaks... But that's on me, your command works as expected !!!

[Link with numbers](Another%20Markdown%20file.md#1.3%20this%20is%20another%20test.md)

The part before the hashtag need to keeps it's original form (even with %20) because it links to a markdown file directly and not a header (Hope it's comprehensible?). It took me a lot of time with another user and we came to a wrapped up script that does everything:

#! /bin/bash

files="/home/USER/projects/test.md"

mdlinks="$(grep -Po ']\((?!https).*\)' "$files")"
mdlinks2="$(grep -Po '#.*' <<<$mdlinks)"

while IFS= read -r line; do
	#Converts 1.2 to 1-2 (For a third level heading needs to add a supplementary [0-9]) 
	dashlink="$(echo "$line" | sed -r 's|(.+[0-9]+)\.([0-9]+.+\))|\1-\2|')"
	sed -i "s/$line/${dashlink}/" "$files"

	#Puts everything to lowercase after a hashtag
	lowercaselink="$(echo "$dashlink" | sed -r 's|#.+\)|\L&|')"
	sed -i "s/$dashlink/${lowercaselink}/" "$files"

	#Removes spaces (%20) from markdown links after a hashtag
	spacelink="$(echo "$lowercaselink" | sed 's|%20|-|g')"
	sed -i "s/$lowercaselink/${spacelink}/" "$files"

done <<<"$mdlinks2"

If you are motivated you can still improve your regex If you want :) I'm kinda curious If it's possible with a one-liner ! Thank again for your help and sorry for the late response !!

[–] learnbyexample 2 points 2 days ago (1 children)

This might work, but I think it is best to not tinker further if you already have a working script (especially one that you understand and can modify further if needed).

perl -pe 's/\[[^]]+\]\((?!https?)[^#]*#\K[^)]+(?=\))/lc $&=~s:%20|\d\K\.(?=\d):-:gr/ge'
[–] [email protected] 2 points 2 days ago (1 children)

Thank you ! It does actually ticks every use case (for my files) looks pretty rad !

This might work, but I think it is best to not tinker further if you already have a working script (especially one that you understand and can modify further if needed).

I totally agree but I will keep your regex as reference, in the near future I will give it a try to decompose you regex as learning process but it looks rather very complex !

Another user came up with the following solution:

sed -E ':l;s/(\[[^]]*\]\()([^)#]*#[^)]*\))/\1\n\2/;Te;H;g;s/\n//;s/\n.*//;x;s/.*\n//;/^https?:/!{:h;s/^([^#]*#[^)]*)(%20|\.)([^)]*\))/\1-\3/;th;s/(#[^)]*\))/\L\1/;};tl;:e;H;z;x;s/\n//;'

Just as a little experiment, If you want to spend some time and give me a answer, what do you think? It's a another way to achieve the same kind of results but they are significantly different. I know there a thousand ways to achieve the same results but I'm kinda curious how it looks from an experts eyes :).

Thanks again for your help and the time you took to write up a complex regex for my use case ! 👍

[–] learnbyexample 1 points 2 hours ago

Well, I'm not going to even try understanding the various features used in that sed command. I do know how to use basic loops with labels, but I never bothered with all the buffer manipulation stuff. I'd rather use awk/perl/python for those cases.

[–] [email protected] 2 points 5 days ago (1 children)

I didn't test this, but it will change the whole URL while changes are only needed in its fragment component (after the first #).

[–] learnbyexample 1 points 5 days ago (1 children)

Hmm, OP mentioned "Only edit what’s between parentheses" - don't see anywhere that whole URL shouldn't be changed...

[–] [email protected] 1 points 5 days ago

Paths are constant, only anchors are generated by forgejo.