Finding dates and times in text
My hobby is trawling the web pages of a few hundred music groups to produce a schedule of their forthcoming performances anywhere in the world. Many of the announcements are in a narrative posting - so a function is needed to identify and extract the possible future date/time from a mass of text.
It seemed fairly easy on first thoughts - yyyy/mm/dd, dd/mm/yy, mm/dd/yy, dd/mm/yyyy. Then came variants using dots or dashes instead of slashes - and some with spaces too. Some had leading zeroes in their months and days - not to mention regular typos.
Many leave out the year - or even the month - although they are sometimes in a page/section header. Year changes can be signalled by the order of entries - as the month suddenly jumps backwards. However one page had entries without any year reference, in the random order that the performances were booked.
The locale of the group is not a tie breaker for the dd/mm mm/dd variants. A European group page might announce a USA tour with US format dates.
Then there are the month name variants in many languages - and their different ways of expressing that format. 20 January, January 20, 20 de gener. Plus the month abbreviations and different languages' ordinal suffixes like 1st and 1er. "May" and "March" in several languages is the same word as a verb.
To cater for all the encountered variants and typos there are now functions numbered 1 to 7 - with awkward supplements of 2A, 5A, and 7A. Each caters for many variations on a theme - picking its first match that satisfies the validation criteria. The human eye is then presented with a selection of best guesses - together with part of their surrounding text.
Times? A sample of 9am/pm 9a.m 9 uhr 9h 9.00 9:00 09:00 21:00 with varying spaces - preceding or succeeding the date.
No attempt has yet been made to handle relative expressions like "next Saturday", "this Saturday", etc. Too many groups just post an image of the performance poster.
The code is as rambling as this attempt to describe its requirements.