I'm going to give a lesson on regular expressions now.
 o regular expression
 kk
 i actually used preg_split in that script
* Farkie (~Farkie@farkie.co.uk) has joined #learnphp
 After this lesson, you will hopefully be on your way to personifying this comic: http://xkcd.com/208/
* Starfoxxes (~Starfoxxe@HSI-KBW-078-043-148-193.hsi4.kabel-badenwuerttemberg.de) has joined #learnphp
* Bor0 (~sitnik@unaffiliated/boro/x-000000001) has joined #learnphp
 == PHP Regex ==
* randomdrake (~ddrake@208-84-194-2.rev.ackley.net) has joined #learnphp
 HA!
 The primary regular expression library that uses the same basic syntax as Perl regular expressions, which has become a sort of de facto standard since Larry Wall created it in the 1980s.
 These functions all start with the preg_ prefix. Their manual pages can be found at http://us.php.net/manual/en/ref.pcre.php
 Before we get into the functions, however, we need to learn the syntax.
* salathe (salathe@idle.php.net) has joined #learnphp
 (old php also has ereg_)
 Yes, PHP has its own dialect of regex, but those have been unofficially deprecated for so long; officially since 5.3.
 A good utility for this lesson, and whenever you need to build complex regexes, is http://www.regextester.com/.
 Go there and make sure the PREG dialect is selected.
 == The Basics ==
 Perl regex are always surrounded by the "/" delimeter.
 To search for the word "Foo", you'd use the expression "/Foo/"
 You can also substitute the delimeters with other symbols.
 Say you're dealing with a filename: /etc/php/php.ini
 Your regex string may be "/\/php.ini/" or just "|php.ini|"
 err
 "|/php.ini|"
 == Special Characters ==
 Now the real power of regular expressions has to do with the various special characters and operators that make finding types of text a breeze.
* randomdrake (~ddrake@208-84-194-2.rev.ackley.net) has left #learnphp
 There are a lot.
 One of the best resources for finding the special character you need that I've found is at http://www.cheatography.com/davechild/cheat-sheets/regular-expressions/
 You shouldn't just bookmark this page; you should print it. It makes an invaluable desk reference. I laminated this print out years ago and it really helps when I'm stumped.
 Going in order of the special chars I use most often:
 .      The "." signifies "any character".
 Ex:   /F.o/ would match "Foo", "Fao", "F-o", "F o", "F1o", etc.
 .      Matches zero or more of the same pattern.
 Ex:  /F.*o/ would match "F-----asdfasdfasdfo", "Fo", etc.,  but not "Fa"
 .      Matches at least one or more of the same pattern.
 Ex:  /F.+o/ would match "F-----asdfasdfasdfo", "Fao", etc.,  but not "Fa" or "Fo"
 err that's +
 +      Matches at least one or more of the same pattern.
 ?      Matches none or one.
 ohh!
 I think I understand.
 Ex:  /F.?o/ would match "Fao", "Fo", etc.,  but not "Fa" or "Faaaaao"
 ()     Parenthesis have two functions; one of them allows you to group more than one character into a pattern.
 you lost me on delimeter
 salathe, do you know what "delimeter" means?
 You use regular expressions to search for data?
 does it involve sandwiches and a measuring tape?
 companyhen, to search and to search/replace
 salathe, it's basically jsut a symbol that you put at the beginning and end of a search string.
 tsmith: okay, but why?
 salathe, because that's the way PERL regexes were designed by Larry Wall in the 1980s. I don't know of any technical reason why.
 salathe, PHP's own implementation (ereg_) aren't nearly as powerful as the preg_ and use non-standard syntax, but they don't require delimeters.
* Starfoxxes (~Starfoxxe@HSI-KBW-078-043-148-193.hsi4.kabel-badenwuerttemberg.de) has left #learnphp
 actually it uses POSIX instead of PCRE
 ok
 thanks Farkie 
 ok continuing...
 ()     Parenthesis have two functions; one of them allows you to group more than one character into a pattern.
 tsmith: to be perl-compatible is a good enough reason for me :)
 Ex:  /F(oo)+/  would match "Foo" "Foooo", even "Fooooooool", but not "Fo" or "Fol".
 There are a few more really important special characters:
 pcre regexp are the best :D
 but if you did /Foo/ would that still match foooooooo
 Arth, no
 /Foo/i would
 Arth, yes. 
 sebbu, are you sure?
 What does the i do?
 F != f
* tapeworm has quit (Read error: Connection reset by peer)
 case insensitive
 so, parentheses?
  echo preg_match("/Foo/", "Foooooo");
 1
 it does in fact match Foooo, Arth is correct.
 [21:35:01]  but if you did /Foo/ would that still match foooooooo
 oh I misread, my bad.
 salathe, is right... if you put an "i" after the last delimeter, it will do a case-insensitive search.
 I see that in the cheat sheet now
 go me! (but I think you meant sebbu)
 Yes.
 ok let's go on
* sebbu don't use lots of libraries/frameworks, usually i only do (small things?) from scratch
 o so /Foo/ = Foo
 err Foooooo
 Yes
 both
 abcFoodef too
 ok
 Yes
 ^      The caret indicates the beginning of a line.
 Indeed.
 tsmith: sorry to interrupt, what about the second function for parentheses?
 salathe, I will get to that in a bit when I cover preg replacements.
 Ex: /^Foo/ would match "Foo" but not "   Foo".
 $      The caret indicates the end of a line.
 Before the \n
 Ex: /Foo$/ would match "aasdfFoo" but not "Fooasdf"
 Interesting..
 the ^ is caret
 "asdFoosad" would not work with /Foo$/ then
 but /sad$/ would
* PlotCitizen (~PlotCitiz@unaffiliated/plotcitizen) has joined #learnphp
 companyhen, correct.
 and you have to declare $ at the end of your expression and ^ at the beginning?
 i.e. /^Foo/ but not /Foo^/
 OK Now
 we're going to talk about some complex stuff now.
 Perl Regexes are inherently "greedy".
 That means, by default if they can match more, they will match more, even if you don't really want that.
 Ah, I was just messing around with Regular Expressions and I remember the greedy property, it's quite useful.
 Consider the following:
 String: I  =really= like this book. It =rocks=.
 We want to find the first instance of =word=, so we try:
 preg_match("/=.+=/", "I  =really= like this book. It =rocks=.");
 because of the greediness of regex, we will not end up with "=really=". We will end up with "=really= like this book. It =rocks=".
 In order to overcome this, you want to append the /U (capital U) modifier like:
 preg_match("/=.+=/U", "I  =really= like this book. It =rocks=.");
 That will match "=really=".
 Clear to everyone?
 U = Ungreedy pattern
 Yes, precisely.
 Ungreedy=lazy
 you can also add a ? after the + or the * to have the same effect
 and .+ = everything in between = =
 Yes, I was about to point that out.
 There's a more fool proof way to do this that works in far more regex engines than Perl's...
 Here's the website I was messing with http://gskinner.com/RegExr/
 that's to search until the pattern you want...
 You should check it out.
 Whoa.. cool site PlotCitizen
 yeah, that looks awesome
 There's a more fool proof way to do this that works in far more regex engines than Perl's...
 that's to search until the pattern you want...
 Oops i forgot negative patterns.
 PlotCitizen, doesn't work without flash
 I know :(
 Negative Matches:
 Sometimes you want to find when a character pattern doesnt' exist.  
 In this case, you use the pattern: [^c] or [^(pattern)]
 For instance, I wanted to find every line that didn't begin with "    " (e.g. wasn't the start of a paragraph):
 I used /\n[^(    )]/
 \n == "newline"
 In this case, it returned every newline except for those where the next line started with "    "
 You can use negative matches to lazily grab content, too:
 for instance:
 it wouldn't work for the first line
 sebbu, yeah but you could just insert a newline yourself and then it would. That was simplist in my case.
 that doesn't look right
 salathe, i could create a test case real quick
 what do brackets do?
 start a line with a parenthesis
 ah
 companyhen, the pattern [^a] would find stuff that wasn't "a"
 companyhen: inside [] brackets are just brackets
 can you search for parenthises
 the pattern [^(abc)] would find stuff that wasn't "abc"
 so "negative" as you stated
 Yes.
 tsmith: it would find a single character that isn't in the list "()abc"
 Arth: Yes, you can search for parentheses and other utility characters by putting a backslash before it
 salathe, no... i do not believe so. that would be [^abc]
 For example, if you want to search for a number sign, you do \#
 ok cool
 tsmith: no, that would find a single character that isn't in the list a, b or c
 so it would be /\(/
 salathe, so how would you not match "abc"?
 using assertions
 depends what exactly you're trying to do, a negative assertion would be  usual
 is that (?!abc)?
 not any other choice
 [^(abc)] don't find acb
 ok. my bad.
 Negative lookaheads and lookbehinds are beyond the scope of this lesson.
 i could explain a little of them
 so are character classes it seems :P
 but that's hard
 (sorry, low punch)
 salathe, i haven't gotten there yet.
 tsmith: you haven't?
 []     I don't really know how to describe brackets.
 salathe, evil question : how would i match punctuation in a fully utf-8 compatible way ? :p
 okay i understand there are negative matches and regular matches...
 could you break this downand explain it
 whats the next step? :P
 sebbu, you'd have to pass the "u" unicode modifier.
 "/\r\n|\r|\n/"
 That would match a \r\n windows line ending then a | then another windows line ending
 so a blank line -> | -> newline
 sebbu: \pP in UTF8 mode
 err
* Spectrael (~josh@pool-72-66-213-191.ronkva.east.verizon.net) has joined #learnphp
 salathe, i've never seen the \p before.
 what does it do?
 http://php.net/regexp.reference.unicode
 thunderstorm gonna knock out my internet i can feel it :\
 then what am i suppose to do?!
 []   is the Range operator.
 Spectrael: sit around a candle telling ghost stories?
 If you put individual characters inside of [] it will match any of them.
 no candles :(
 buy a generator asap
 E.g. /[bcd] would match any string with at least one instance of the letter "a" and/or "b" and/or "c"
 tsmith: the usual term is "character class" for []
 ok
 ahh
 [a-z] would match the lower case a through z
 [A-Z] would match the upper case
 [a-zA-Z0-9] would match all alphanumeric characters.
 This brings us to Character Classes, which is what salathe was hinting at.
 Several of these ranges (they are ranges, not character classes) can be simplified by just calling their character class equivalents.
 \s     Matches every white space character (spaces, tabs and new lines).
 \S     Matches every non-white space character.
 \d     Matches every digit (0-9)
 \D     Matches every non-digit
 those aren't character classes at all
 (they are ranges, not character classes)
 dammit
 http://www.cheatography.com/davechild/cheat-sheets/regular-expressions/
 i'm not the only one who got it wrong lol
 what's the definition of a character class?
 I think we're both right.
 that's the [] syntax
 they are, in fact, ranges.
 I think that just about wraps up simplistic searches.
 a range is denoted by the dash, so A-Z is a range from A through Z inclusive
 a range would be {3,5} for digitsa tleast
 ooo
 ok
 let's move on to regex replacement.
 so [a-z0-9@] is a character class, with two ranges and a single character.
 where do you get @ from salathe
 just above 2 on my keyboard
 (it was a random character choice)
 lo
 it matches any character a-z 0-9 and @
 tsmith, character classes are [:alnum:]
 sebbu: in POSIX-land
 makes sense
 so if you wanted to search for valid domain names;
 (and in PCRE which supports those POSIX classes too)
 Relevant xkcd: http://xkcd.com/208/
 so if you wanted to search for valid domain names, it'd be something like this:
 (in PCRE you'd want to use [[:alnum:]] or [a-zA-Z0-9] to match a single alphanumeric character)
 Search for domains: /([a-z0-9_].)*[a-z0-9_]{2}.[a-z]{2}/
* salathe stops interrupting 
 thaerr
 Search for domains: /([a-z0-9_]*\.)*[a-z0-9_]{2,}\.[a-z]{2,}/
 the {2,} means at least 2
 ( "_" isn't valid for domains)
 crap
* salathe really stops interrupting
 sorry I'll let you get on!
 Search for domains: /([a-z0-9\-]*\.)*[a-z0-9\-]{2,}\.[a-z]{2,}/
 the \- you have to use inside [] to differentiate it from the other "-"
 ( the \ before - isn't necessary there, but is sensible to use)
* salathe sits on his hands
 tsmith, shouldn't a + instead of a * in the first parenthesis ?
 ([a-z0-9\-]*\.)  matches the subdomain, * means "maybe it's there, maybe it isn't"
* sebbu is not sur www....google.fr is  valid
 sure*
 As you guys can see, regex takes lots of refinement.
 It's hard to think in them and it's easy to make mistakes.
 lol its one of the most confusing things
 That's why unit testing is generally recommended.
 lol
 ===================================
 Regex Replacements
 ===================================
 and it's not well-suited to do recursive parsing (including html)
 Yes, you should use simple_html_dom for your HTML parsing needs.
* sebbu prefer sax
 OK let's stop interrupting or i'm never going to get to the good stuff.
 While regex searching is powerful, it pales in comparison to the power of regex replacements.
 tsmith: the regex syntax is usually the easy step after deciding what you really want the regex to do that's the time-consuming part usually.
 salathe, maybe for you. 
 I have used preg_match() maybe 20 times in my career, but I've used preg_replace() innumeral times.
 PREG replacement patterns take the form of /search/replace/.
 But in PHP land, it's more like preg_replace("/search/", "replacement", $subject);
 e.g.:
  $text = "The Pig jumped over The Moon";
 $results = preg_replace("/Pig/", "Cow", $text);
 print_r($results);
 Output:
 The Cow jumped over The Moon
 makes sense :P
 The most powerful thing about regex is that parentheses can be used to hold variables (i'm not sure what they're called, actually)
 for instance:
 What if we just wanted to know what animal jumped over the moon?
 "capturing subpattern reference"
 (the "(blah)" being a capturing subpattern, as opposed to a non-capturing sub pattern)
 ppYes.
  $text = "The Pig jumped over The Moon";
 $search = "/(The .+) jumped over The Moon/";
 $replacement = "$1";
 $results = preg_replace($search, $replacement, $text);
 print_r($results);
 The contents in each () is referenced by its number $1 = "The Pig" in thsi context.
 The output is literally: "The Pig"
 What if we wanted to know the action, too?
 or rather, wanted to return it
  $text = "The Pig jumped over The Moon";
 $search = "/(The .+) (.+) over The Moon/";
 $replacement = "$1: $2";
 $results = preg_replace($search, $replacement, $text);
 print_r($results);
 Output: "The Pig: jumped"
 Via this method, we can do complex substitutions.
 For instance:
 To strip all HTML tags:
 strip_tags() ?
 Yeah
 i'm trying to convert vim regex to php and it's tough lol
 use /<[^>]+>/ ?
 OK it's not exactly PHP syntax, but:
 Given this text: twilight_of_american_culture
 and runt through this regex replacement:
 (well, that wouldn't work on > in attributes name or values)
 preg_replace("/^[a-zA-Z0-9].+/", "    
  • $0 - orig_html - epub
  • ", $title); the output is: ouch
  • twilight_of_american_culture - epub
  • That's an example where regex replacement really helps. === Regex Multi Pattern Matches === that's not the output at all salathe, i converted it from vim syntax. it may have bugs. it's largely right, however. apart from the glaring PHP syntax errors and the regex not even nearly matching what you want ok shit. the original is : %s/^[a-zA-Z0-9].\+/\t\t\t\t
  • \0<\/a> - orig_html<\/a> - epub<\/a><\/li>/gc and it works fine in vim. preg_replace('/^([a-zA-Z0-9].+)$/', '
  • \1 - orig_html - epub
  • ', $title); it works but I don't think it's what you were really trying to do, but do continue :) $text = "twilight"; echo preg_replace("/.+/", "$0 - $0", $text); twilight - twilight I dont see the "glaring" bug. It's right behind you. you fixed the PHP syntax problems in that second example you had clashing double quotes oh yeah that's understandable. Just needed to replace " with ' lol. and you fixed the regex problem (the superfluous ^[a-zA-Z0-9] ) I think that's pretty good for on-the-fly conversion. oh in the original, there was HTML too yeah, it's fine :) i only wanted to match stuff that didn't have html in it anyway === Regex Multi Pattern Matches === Regexes are also very useful for parsing a string into distinct parts for later manipulation and/or analysis. You do this by adding an existing array as the third parameter of preg_match(). $text = "The Pig jumped over The Moon"; $search = "/(The .+) (.+) over (.+)/"; $replacement = "$1: $2"; $matches = array(); preg_match($search, $text, $matches); print_r($matches); Output: Array ( [0] => The Pig jumped over The Moon [1] => The Pig [2] => jumped [3] => The Moon ) $matches[0] will always be the full searched string. [1] will be the results in the first (), [2] in the second, etc. Now, I can give you any syntax that started with "The [blah] [did something] over [something else]" and it would pick it up and stick them into the correct array elements. $text = "The Blob slid over The drain"; $search = "/(The .+) (.+) over (.+)/"; $replacement = "$1: $2"; $matches = array(); preg_match($search, $text, $matches); print_r($matches); Array ( [0] => The Blob slid over The drain [1] => The Blob [2] => slid [3] => The drain ) That's useful for all sorts of stuff. the gymnast champion did a backflip over the carpet it matches, but is badly split Yeah because it doesn't match the pattern. exactly (yes, i willfully choose multiple word for each match) Regex doesn't help us fully parse the English language, but it helps us get closer. A lot closer than strpos() and substr(). willingly* i once did a html parser with strpos&substr You mean you tried ;p no, i did HTML can't be parsed by string tokenizers like strpos or regex. well, i didn't needed the full html syntax, nor complex analysis Yeah, true enough. it was primary a table with fusionned cells I think that concludes this lesson. ANy questions? pretty good stuff tsmith! I'd like to thank everyone for attending. not exactly, but i have several things to say to complete it ok sure! sebbu, go ahead. first, alternatives oh yes. good stuff and good timing for the end of the lesson! *leaves work* you can do pattern to match lots of things but something you have several choice or limited ones so a pattern like /(zero|one|two|three|four|five|six|seven|eight|nine)/ will match any english-spelled digit That's one more overloaded use of the ()s, btw. i'll focus on some () if you wanted to limit the possible subject in the last exemples * companyhen has quit (Quit: companyhen) /(The .+) (.+) over (.+)/ you could try to use that alternative but it'll mess around with the matches because there's already () like /(The (human|cat|dog|pig|cow)) (.+) over (.+)/ there's a syntax to use non-matching () /(The (?:human|cat|dog|pig|cow)) (.+) over (.+)/

    [[Category:PHP_From_Beginner_to_Pro]]